Your laptop as an AI server
Ollama is an open-source tool that lets you run large language models (LLMs) directly on your own computer — no internet required, no API keys, no subscription. Think of it as a download manager + inference engine for AI models, wrapped in a simple command-line interface.
When you run a model with Ollama, your CPU (and GPU if available) do all the heavy lifting. Your prompts never leave your machine. This is called on-device inference.
Ollama also exposes a local REST API on port 11434, meaning any tool that knows how to talk to an HTTP endpoint can use it — editors, scripts, harness tools like Codex CLI, and more.
Ollama is not an AI model itself. It's the runtime — the engine that loads, manages, and serves open-weight models like Gemma, Llama, Mistral, Phi, and others.
How we got here
The ability to run LLMs locally didn't happen overnight. It's the result of years of research breakthroughs, open-source activism, and clever engineering.
ollama run, ollama pull) and a REST API. It dramatically lowered the barrier to entry.Breaking it down
Complex technology should be explainable at every level. Here's Ollama explained two ways:
It's like downloading a smart toy to your room
You know how Siri and Alexa live in the internet and need Wi-Fi to answer you? Ollama is like downloading a really smart robot brain onto your computer, so it lives in your bedroom.
Once it's there, you can talk to it and ask it questions — even if your Wi-Fi is off! It doesn't tell anyone what you said, because it never goes to the internet. It's your private robot helper.
Ollama is the tool that helps put those robot brains on your computer. The brains are called "models" — they're like different toys you can download, each one good at different things.
Running AI like a local game server
You know how in Minecraft you can play on a server with friends, but you can also start your own local server on your computer? Ollama is like setting up your own private AI server.
Big AI tools like ChatGPT run on huge computers in data centers — you're basically borrowing their power. With Ollama, you download the AI "brain" (called a model) to your own PC, and your computer does all the thinking.
The cool parts: it works offline, no one can see your chats, and you can try different models like Gemma or Llama — kind of like switching between different game characters, each with different abilities.
🎓 Developer Definition
Ollama is a locally-run model inference server built on llama.cpp, that provides a Docker-like CLI for pulling, running, and managing open-weight language models. It exposes an OpenAI-compatible REST API at localhost:11434, enabling drop-in replacement for cloud APIs in development workflows.
Step-by-step installation
Generic recommended specs for running models locally
| Hardware Tier | RAM | Models You Can Run | Speed | Status |
|---|---|---|---|---|
| Budget Laptop | 8 GB | gemma2:2b, phi3:mini, tinyllama | Slow (2–5 tok/s) | Works with patience |
| Mid-range Laptop | 16 GB | gemma3:4b, llama3.2:3b, mistral:7b | OK (5–15 tok/s) | Good daily driver |
| Gaming PC / M-series Mac | 32 GB | llama3.1:8b, qwen2.5:14b, gemma3:12b | Fast (15–50 tok/s) | Excellent |
| Workstation / Mac Studio | 64 GB+ | llama3.1:70b, qwen2.5:72b, deepseek | Very fast | Production grade |
Install Ollama
Download and install Ollama for your platform.
# macOS / Linux — one-liner curl -fsSL https://ollama.com/install.sh | sh # Windows — Download installer from: https://ollama.com/download/windows # Verify installation ollama --version
After install, Ollama runs as a background service and listens on localhost:11434.
Pull your first model
Choose a model from the Ollama library. For beginners, Gemma3 or Llama3.2 are excellent starting points.
# Pull a model (downloads to ~/.ollama/models) ollama pull gemma3 # Or pull a specific size variant ollama pull gemma3:4b ollama pull llama3.2:3b ollama pull mistral:7b # List all downloaded models ollama list
Run the model
# Interactive chat mode ollama run gemma3 # Single prompt mode ollama run gemma3 "Explain recursion in simple terms" # Check what's running ollama ps # Stop a loaded model ollama stop gemma3
Use the API directly
Ollama exposes an OpenAI-compatible REST API. Any tool that supports OpenAI can point to Ollama instead.
# Basic API call with curl curl http://localhost:11434/api/generate \ -d '{ "model": "gemma3", "prompt": "What is Ollama?", "stream": false }' # OpenAI-compatible endpoint curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma3", "messages": [{"role": "user", "content": "Hello!"}] }'
Advanced: Create a custom Modelfile
A Modelfile is like a Dockerfile for AI models — define a system prompt, temperature, and parameters.
# Create a file called "Modelfile" FROM gemma3 # Set a system prompt SYSTEM """ You are a senior software engineer who gives concise, accurate code reviews. Use markdown. """ # Tune parameters PARAMETER temperature 0.3 PARAMETER num_ctx 8192 # Build and run your custom model ollama create myreviewer -f Modelfile ollama run myreviewer
Environment variables & configuration
# Change model storage location (default: ~/.ollama) export OLLAMA_MODELS=/path/to/models # Allow external access (LAN / other machines) export OLLAMA_HOST=0.0.0.0:11434 # Set GPU layers (tune for your VRAM) export OLLAMA_NUM_GPU=35 # Set number of parallel requests export OLLAMA_NUM_PARALLEL=2 # Keep model loaded in memory (seconds) export OLLAMA_KEEP_ALIVE="10m"
Set OLLAMA_KEEP_ALIVE="-1" to keep the model permanently loaded. This eliminates the cold-start delay between prompts at the cost of persistent RAM usage.
What models can you run?
Ollama's library hosts 100+ models. Here are the most popular for different use cases:
Use ollama search <keyword> to find models, or browse ollama.com/library. Model names with no tag default to :latest. Use specific tags like gemma3:4b-it-q4_K_M to control quantization level.
The honest trade-offs
Advantages
- Complete privacy — prompts never leave your machine
- No recurring costs after hardware investment
- Works offline — planes, remote locations, no Wi-Fi
- No rate limits or context window throttling
- Fully customizable via Modelfiles and system prompts
- OpenAI-compatible API — drop-in for existing tools
- Run multiple models simultaneously
- No censorship or content filtering from providers
- Educational — understand how LLMs actually work
Shortcomings
- Slower than cloud — CPU inference is notably slower than datacenter GPUs
- Hardware ceiling — larger, smarter models need more RAM/VRAM
- High RAM usage — a 7B model alone uses 5–8GB RAM
- Model quality gap — local 7B models lag behind GPT-4 or Claude Opus
- CPU heat and battery drain on laptops
- Initial model downloads are large (2–30 GB per model)
- No internet-connected tools (web search) without extra setup
- Limited multimodal capability at smaller sizes
Ollama as an AI backend
Ollama's real superpower is acting as the engine for other tools. "Harness tools" are CLIs and editors that sit on top of a model API and give it agentic capabilities — like browsing files, writing code, and running commands.
Codex CLI
OpenAI's official CLI coding agent. Designed to work with GPT-4o, but supports any OpenAI-compatible endpoint — including Ollama. It scans your codebase, understands context, and can write, edit, and run code.
Claude Code
Anthropic's agentic coding tool. Primarily uses Claude API, but can be configured to point to local models via OpenAI-compatible endpoints. Excellent for large codebases with its extended thinking capability.
Continue.dev
VS Code / JetBrains extension for AI coding. First-class Ollama support. Provides autocomplete, chat, and edit modes, all powered by your local models.
Open WebUI
A ChatGPT-like browser interface that connects directly to Ollama. Run it locally and get a full-featured chat UI with history, RAG, and model switching — all offline.
Using Codex CLI with Ollama + Gemma
Running an open-weight coding agent entirely on local hardware — zero cloud calls, full source-code privacy. Here's exactly what happens, step by step.
$ ollama pull gemma3:4b
npm install -g @openai/codex
ollama launch codex --model gemma3:4b
cd ~/Desktop/my-project
codex
Running Codex CLI with Gemma3:4b on a 16GB RAM laptop uses approximately 5-8GB of RAM for the model, and pushes CPU to 70–100% during inference. Responses arrive in 10–30 seconds per turn. It works — but requires patience and benefits enormously from GPU acceleration or higher-end hardware. For production use, 32GB RAM or an NVIDIA GPU with 8GB+ VRAM is recommended.
Quick Reference — Codex CLI Commands
| Command | Description |
|---|---|
| ollama launch codex --model <name> | Start Codex CLI with a specific Ollama model as the backend |
| scan the project | Ask Codex to analyze your codebase and build understanding |
| find and fix a bug in @filename | Ask Codex to diagnose and patch bugs in a specific file |
| write tests for @filename | Generate unit tests for a given module |
| /model | Switch to a different Ollama model mid-session |
| Esc | Interrupt a running inference |
Quick Reference — Ollama Commands
| Command | Description |
|---|---|
| ollama run <model> | Start an interactive chat session with a model |
| ollama pull <model> | Download a model from the Ollama library |
| ollama list | Show all downloaded models |
| ollama ps | Show currently loaded models and resource usage |
| ollama rm <model> | Delete a model from local storage |
| ollama create <name> -f Modelfile | Create a custom model from a Modelfile |
| ollama show <model> | Display model metadata, parameters, and Modelfile |
| ollama serve | Start the Ollama server manually (usually auto-started) |
Go deeper
The local AI ecosystem moves fast. Here are the best places to keep learning:
Complete reference for commands, API, Modelfile format, and GPU setup guides.
Browse and search 100+ models with sizes, benchmarks, and pull commands.
Source code, issues, community integrations, and contribution guides.
A beautiful ChatGPT-style interface that runs locally on top of Ollama.
Integrate Ollama with VS Code or JetBrains for AI autocomplete and chat.
The most active community for local LLM enthusiasts. Tips, benchmarks, model comparisons.
The home of open model weights, datasets, and leaderboards for model comparison.
The C++ engine that powers Ollama under the hood. For advanced users who want direct control.
OpenAI's terminal coding agent. Works with Ollama as an open-weight backend.
| Term | Meaning |
|---|---|
| LLM | Large Language Model — a neural network trained on text to predict and generate language (GPT, Gemma, Llama, etc.) |
| Inference | Running a trained model to generate outputs. "Local inference" means your CPU/GPU does this, not a remote server. |
| Quantization | Compressing model weights from 32-bit floats to 4-bit integers, reducing RAM requirements ~4–8x with minimal quality loss. |
| GGUF | The file format Ollama uses to store quantized models. Designed for efficient CPU inference. |
| Context Window | How many tokens (words) the model can "see" at once. Larger = more memory needed. Configured via num_ctx. |
| Modelfile | A configuration file for customizing a model's behavior — like a Dockerfile for AI. Defines system prompt, parameters, etc. |
| Open-weight | Models whose weights (parameters) are publicly released. "Open source AI" — you can download and run them yourself. |
| Harness tool | A CLI or app that wraps a model API and gives it agentic capabilities (file access, code execution, tool calling). |
| tok/s | Tokens per second — a measure of inference speed. 10 tok/s ≈ ~7 words/second. Cloud APIs do 50–100+ tok/s. |