Ollama CLI Cheatsheet: ls, serve, run, ps + commands (2026 update)
Updated Ollama command list - ls, ps, run, serve, etc
This Ollama CLI cheatsheet focuses on the commands you use every day (ollama ls, ollama serve, ollama run, ollama ps, model management, and common workflows), with examples you can copy/paste.
It also includes a short “performance knobs” section to help you discover (and then deep-dive) OLLAMA_NUM_PARALLEL and related settings.

This Ollama cheatsheet is focusing on CLI commands, model management, and customization, But we have here also some curl calls too.
For a full picture of where Ollama fits among local, self-hosted and cloud options—including vLLM, Docker Model Runner, LocalAI and cloud providers—see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared. If you’re comparing different local LLM hosting solutions, check out our comprehensive comparison of Ollama, vLLM, LocalAI, Jan, LM Studio and more. For those seeking alternatives to command-line interfaces, Docker Model Runner offers a different approach to LLM deployment.
Ollama installation (download and CLI install)
- Option 1: Download from Website
- Visit ollama.com and download the installer for your operating system (Mac, Linux, or Windows).
- Option 2: Install via Command Line
- For Mac and Linux users, use the command:
curl -fsSL https://ollama.com/install.sh | sh
- Follow the on-screen instructions and enter your password if prompted.
Ollama system requirements (RAM, storage, CPU)
- Operating System: Mac, Linux, or Windows
- Memory (RAM): 8GB minimum, 16GB or more recommended
- Storage: At least ~10GB free space (model files could be really big, see here more Move Ollama Models to Different Drive )
- Processor: A relatively modern CPU (from the last 5 years). If you’re curious about how Ollama utilizes different CPU architectures, see our analysis of how Ollama uses Intel CPU Performance and Efficient Cores.
For serious AI workloads, you might want to compare hardware options. We’ve benchmarked NVIDIA DGX Spark vs Mac Studio vs RTX-4080 performance with Ollama, and if you’re considering investing in high-end hardware, our DGX Spark pricing and capabilities comparison provides detailed cost analysis.
Basic Ollama CLI Commands
| Command | Description |
|---|---|
ollama serve |
Starts the Ollama server (default port 11434). |
ollama run <model> |
Runs the specified model in an interactive REPL. |
ollama pull <model> |
Downloads the specified model to your system. |
ollama push <model> |
Uploads a model to the Ollama registry. |
ollama list |
Lists all downloaded models. Same as ollama ls. |
ollama ps |
Shows currently running (loaded) models. |
ollama stop <model> |
Stops (unloads) a running model. |
ollama rm <model> |
Removes a model from your system. |
ollama cp <source> <dest> |
Copies a model under a new name locally. |
ollama show <model> |
Displays details about a model (architecture, parameters, template, etc.). |
ollama create <model> |
Creates a new model from a Modelfile. |
ollama launch [integration] |
Zero-config launch of AI coding assistants (Claude Code, Codex, Droid, OpenCode). |
ollama signin |
Authenticates with the Ollama registry (enables private models and cloud models). |
ollama signout |
Signs out from the Ollama registry. |
ollama help |
Provides help about any command. |
Jump links: Ollama serve command · Ollama launch command · Ollama run command · Ollama run flags · Ollama ps command · Ollama show command · Ollama signin · Ollama CLI basics · Performance knobs (OLLAMA_NUM_PARALLEL) · Parallel requests deep dive
Ollama CLI (what it is)
Ollama CLI is the command-line interface to manage models and run/serve them locally. Most workflows boil down to:
- Start the server:
ollama serve - Run a model:
ollama run <model> - See what’s loaded/running:
ollama ps - Manage models:
ollama pull,ollama list,ollama rm
Ollama model management: pull and list models commands
List Models:
ollama list
the same as:
ollama ls
This command lists all the models that have been downloaded to your system, with their file sizes on your hdd/sdd, like
$ ollama ls
NAME ID SIZE MODIFIED
deepseek-r1:8b 6995872bfe4c 5.2 GB 2 weeks ago
gemma3:12b-it-qat 5d4fa005e7bb 8.9 GB 2 weeks ago
LoTUs5494/mistral-small-3.1:24b-instruct-2503-iq4_NL 4e994e0f85a0 13 GB 3 weeks ago
dengcao/Qwen3-Embedding-8B:Q4_K_M d3ca2355027f 4.7 GB 4 weeks ago
dengcao/Qwen3-Embedding-4B:Q5_K_M 7e8c9ad6885b 2.9 GB 4 weeks ago
qwen3:8b 500a1f067a9f 5.2 GB 5 weeks ago
qwen3:14b bdbd181c33f2 9.3 GB 5 weeks ago
qwen3:30b-a3b 0b28110b7a33 18 GB 5 weeks ago
devstral:24b c4b2fa0c33d7 14 GB 5 weeks ago
Download a Model: ollama pull
ollama pull mistral-nemo:12b-instruct-2407-q6_K
This command downloads the specified model (e.g., Gemma 2B, or mistral-nemo:12b-instruct-2407-q6_K) to your system. The model files could be quite large, so keep an eye on the space used by models on the hard drive, or ssd. You might even want to move all Ollama models from you home directory to another bigger and better drive
Upload a Model: ollama push
ollama push my-custom-model
Uploads a local model to the Ollama registry so others can pull it.
You need to be signed in first (ollama signin) and the model name must be prefixed with your Ollama username, e.g. myuser/my-model.
Use --insecure if you are pushing to a private registry over HTTP:
ollama push myuser/my-model --insecure
Copy a Model: ollama cp
ollama cp llama3.2 my-llama3-variant
Creates a local copy of a model under a new name without re-downloading anything. This is handy before editing a Modelfile — copy first, customise the copy, and keep the original intact:
ollama cp qwen3:14b qwen3-14b-custom
ollama create qwen3-14b-custom -f ./Modelfile
Ollama show command
ollama show prints information about a downloaded model.
ollama show qwen3:14b
By default it prints the model card (architecture, context length, embedding length, quantization, etc.). There are three useful flags:
| Flag | What it shows |
|---|---|
--modelfile |
The full Modelfile used to create the model (FROM, SYSTEM, TEMPLATE, PARAMETER lines) |
--parameters |
Only the parameter block (e.g. num_ctx, temperature, stop tokens) |
--verbose |
Extended metadata including tensor shapes and layer counts |
# See exactly what system prompt and template a model was built with
ollama show deepseek-r1:8b --modelfile
# Check the context window size and other inference parameters
ollama show qwen3:14b --parameters
# Full tensor-level detail (useful when debugging quantization)
ollama show llama3.2 --verbose
The --modelfile output is especially useful before customising a model: you can copy the base Modelfile and edit from there rather than writing one from scratch.
Ollama serve command
ollama serve starts the local Ollama server (default HTTP port 11434).
ollama serve
“ollama serve” command (systemd-friendly example):
# set env vars, then start the server
# make ollama available on the host's IP address
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_NUM_PARALLEL=2
ollama serve
Ollama run command
Run a Model:
ollama run gpt-oss:20b
This command starts the specified model and opens an interactive REPL for interaction. Want to understand how Ollama manages multiple concurrent requests? Learn more about how Ollama handles parallel requests in our detailed analysis.
ollama run runs a model in an interactive session,
so you in the case of gpt-oss:120b would see something like
$ ollama run gpt-oss:120b
>>> Send a message (/? for help)
you can type your questions or commands and the model will reply.
>>> who are you?
Thinking...
The user asks "who are you?" Simple question. Should respond as ChatGPT, an AI language model, trained by OpenAI,
etc. Provide brief intro. Probably ask if they need help.
...done thinking.
I’m ChatGPT, an AI language model created by OpenAI. I’ve been trained on a wide range of text so I can help
answer questions, brainstorm ideas, explain concepts, draft writing, troubleshoot problems, and much more. Think
of me as a versatile virtual assistant—here to provide information, support, and conversation whenever you need
it. How can I help you today?
>>> Send a message (/? for help)
To exit the interactive ollama session, press Ctrl+D, or you can type /bye, the same result:
>>> /bye
$
Ollama run command examples
To run a model and ask a single question in a non-interactive mode:
printf "Give me 10 bash one-liners for log analysis.\n" | ollama run llama3.2
If you want to see detailed verbose LLM reply in ollama session - run the model with --verbose or -v parameter:
$ ollama run gpt-oss:20b --verbose
>>> who are you?
Thinking...
We need to respond to a simple question: "who are you?" The user is asking "who are you?" We can answer that we
are ChatGPT, a large language model trained by OpenAI. We can also mention capabilities. The user likely expects
a brief introduction. We'll keep it friendly.
...done thinking.
I’m ChatGPT, a large language model created by OpenAI. I’m here to help answer questions, offer explanations,
brainstorm ideas, and chat about a wide range of topics—everything from science and history to creative writing
and everyday advice. Just let me know what you’d like to talk about!
total duration: 1.118585707s
load duration: 106.690543ms
prompt eval count: 71 token(s)
prompt eval duration: 30.507392ms
prompt eval rate: 2327.30 tokens/s
eval count: 132 token(s)
eval duration: 945.801569ms
eval rate: 139.56 tokens/s
>>> /bye
$
Yes, that’s right, it is 139 tokens per second. The gpt-oss:20b is very fast. If you, like me have GPU with 16GB VRAM - see the LLMs speed somparison details in Best LLMs for Ollama on 16GB VRAM GPU.
Tip: If you want the model available over HTTP for multiple apps, start the server with ollama serve and use the API client instead of long interactive sessions.
Ollama run flags (full reference)
| Flag | Description |
|---|---|
--verbose / -v |
Print timing stats (tokens/s, load time, etc.) after each response |
-p, --parameters |
Pass model parameters inline without a Modelfile (see below) |
--format string |
Force a specific output format, e.g. json |
--nowordwrap |
Disable automatic word wrapping — useful when piping output to scripts |
--insecure |
Allow connecting to a registry over HTTP (for private/self-hosted registries) |
Override model parameters without a Modelfile (-p / –parameters)
The -p flag lets you change inference parameters at runtime without creating a Modelfile.
You can stack multiple -p flags:
# Increase the context window and lower temperature
ollama run qwen3:14b -p num_ctx=32768 -p temperature=0.5
# Run a coding task with deterministic output
ollama run devstral:24b -p temperature=0 -p num_ctx=65536
Common parameters you can set this way:
| Parameter | Effect |
|---|---|
num_ctx |
Context window size in tokens (default is model-dependent, often 2048–4096) |
temperature |
Randomness: 0 = deterministic, 1 = creative |
top_p |
Nucleus sampling threshold |
top_k |
Limits vocabulary to top-K tokens |
num_predict |
Maximum tokens to generate (-1 = unlimited) |
repeat_penalty |
Penalty for repeating tokens |
Multiline input in the REPL
Wrap text in triple quotes (""") to enter a multi-line prompt without submitting early:
>>> """Summarise this in one sentence:
... The quick brown fox jumps over the lazy dog.
... It happened on a Tuesday.
... """
Multimodal models (images)
For vision-capable models (e.g. gemma3, llava), pass an image path directly in the prompt:
ollama run gemma3 "What's in this image? /home/user/screenshot.png"
Generating embeddings via CLI
Embedding models output a JSON array instead of text. Pipe text directly for quick one-off embeddings:
echo "Hello world" | ollama run nomic-embed-text
For production embedding workloads use the /api/embeddings REST endpoint or the Python client instead.
Force JSON output (–format)
ollama run llama3.2 --format json "List 5 capital cities as JSON"
The model is instructed to return valid JSON. Useful when piping output to jq or a script that expects structured data.
Ollama stop command
This command stops the specified running model.
ollama stop llama3.1:8b-instruct-q8_0
Ollama evicts models automagically after some time.
You can specify this time, byt default is 4 minutes.
If you don’t want to wait the remaining time , you might want to use this ollama stop command.
You can also kick the model out of the VRAM by calling /generate API endpoint with parameter keep_alive=0, see below for the description and example.
Ollama ps command
ollama ps shows currently running models and sessions (useful to debug “why is my VRAM full?”).
ollama ps
The example of the ollama ps output is below:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b 17052f91a42e 14 GB 100% GPU 4096 4 minutes from now
You see here on my PC the gpt-oss:20b fits into the my GPU’s 16GB VRAM very well, and ocupied only 14GB.
If I execute ollama run gpt-oss:120b and then call the ollama ps, the outcome will not be that bright:
78% of layers are on CPU, and this is just with the context window 4096 tokens. It will be more should I need to increase the context.
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b a951a23b46a1 66 GB 78%/22% CPU/GPU 4096 4 minutes from now
Ollama launch command (AI coding integrations)
ollama launch is a command introduced in Ollama v0.15 (January 2026) that gives you zero-config, one-line setup for popular AI coding assistants running against your local Ollama server.
Why use ollama launch?
Before ollama launch, wiring up a coding agent like Claude Code or Codex to a local Ollama backend meant manually setting environment variables, pointing the tool to the right API endpoint, and picking a compatible model. ollama launch handles all of that for you interactively.
If you already run Ollama locally and want an agentic coding assistant without paying for API calls or sending code to the cloud, ollama launch is the fastest path there.
Supported integrations
| Integration | What it is |
|---|---|
claude |
Anthropic’s Claude Code — agentic coding assistant |
codex |
OpenAI’s Codex CLI coding assistant |
droid |
Factory’s AI coding agent |
opencode |
Open-source coding assistant |
Basic usage
# Interactive picker — choose an integration from a menu
ollama launch
# Launch a specific integration directly
ollama launch claude
# Launch with a specific model
ollama launch claude --model qwen3-coder
# Configure the integration without launching it (useful to inspect settings)
ollama launch droid --config
Recommended models
Coding agents need a long context window to hold whole-file context and multi-turn conversation history. Ollama recommends models with at least 64 000 tokens of context:
| Model | Notes |
|---|---|
qwen3-coder |
Strong coding performance, long context, runs locally |
glm-4.7-flash |
Fast local option |
devstral:24b |
Mistral’s coding-focused model |
If your GPU cannot fit the model, Ollama also offers cloud-hosted variants (e.g. qwen3-coder:480b-cloud) which integrate the same way but route inference to Ollama’s cloud tier — requiring ollama signin.
Example: running Claude Code locally with Ollama
# 1. Make sure the model is available
ollama pull qwen3-coder
# 2. Launch Claude Code against it
ollama launch claude --model qwen3-coder
Ollama sets the necessary environment variables and starts Claude Code pointing at http://localhost:11434 automatically.
You can then use Claude Code exactly as you normally would — the only difference is that inference happens on your own hardware.
Performance knobs (OLLAMA_NUM_PARALLEL)
If you see queueing or timeouts under load, the first knob to learn is OLLAMA_NUM_PARALLEL.
OLLAMA_NUM_PARALLEL= how many requests Ollama executes in parallel.- A higher value can increase throughput, but may increase VRAM pressure and latency spikes.
Quick example:
OLLAMA_NUM_PARALLEL=2 ollama serve
For a full explanation (including tuning strategies and failure modes), see:
Releasing Ollama model from VRAM (keep_alive)
When a model is loaded into VRAM (GPU memory), it stays there even after you finish using it. To explicitly release a model from VRAM and free up GPU memory, you can send a request to the Ollama API with keep_alive: 0.
- Release Model from VRAM using curl:
curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'
Replace MODELNAME with your actual model name, for example:
curl http://localhost:11434/api/generate -d '{"model": "qwen3:14b", "keep_alive": 0}'
- Release Model from VRAM using Python:
import requests
response = requests.post(
'http://localhost:11434/api/generate',
json={'model': 'qwen3:14b', 'keep_alive': 0}
)
This is particularly useful when:
- You need to free up GPU memory for other applications
- You’re running multiple models and want to manage VRAM usage
- You’ve finished using a large model and want to release resources immediately
Note: The keep_alive parameter controls how long (in seconds) a model stays loaded in memory after the last request. Setting it to 0 immediately unloads the model from VRAM.
Customizing Ollama models (system prompt, Modelfile)
-
Set System Prompt: Inside the Ollama REPL, you can set a system prompt to customize the model’s behavior:
>>> /set system For all questions asked answer in plain English avoiding technical jargon as much as possible >>> /save ipe >>> /byeThen, run the customized model:
ollama run ipeThis sets a system prompt and saves the model for future use.
-
Create Custom Model File: Create a text file (e.g.,
custom_model.txt) with the following structure:FROM llama3.1 SYSTEM [Your custom instructions here]Then, run:
ollama create mymodel -f custom_model.txt ollama run mymodelThis creates a customized model based on the instructions in the file".
Ollama signin and signout (registry authentication)
ollama signin
ollama signout
ollama signin authenticates your local Ollama installation with the Ollama registry at ollama.com. Once signed in, the client stores credentials locally and reuses them automatically for subsequent commands.
What signin unlocks:
- Pulling and pushing private models from your account or organisation.
- Using cloud-hosted models (e.g.
qwen3-coder:480b-cloud) that are too large to run locally. - Publishing models to the registry with
ollama push.
Alternative: API key authentication
If you are running Ollama in a CI pipeline or a headless server where interactive ollama signin is not practical, create an API key in your Ollama account settings and expose it as an environment variable:
export OLLAMA_API_KEY=ollama_...
ollama pull myorg/private-model
The OLLAMA_API_KEY variable is picked up automatically by every Ollama command and API request — no need to run ollama signin on each machine.
Using Ollama run command with files (summarize, redirect)
-
Summarize Text from a File:
ollama run llama3.2 "Summarize the content of this file in 50 words." < input.txtThis command summarizes the content of
input.txtusing the specified model. -
Log Model Responses to a File:
ollama run llama3.2 "Tell me about renewable energy." > output.txtThis command saves the model’s response to
output.txt.
Ollama CLI use cases (text generation, analysis)
-
Text Generation:
- Summarizing a large text file:
ollama run llama3.2 "Summarize the following text:" < long-document.txt - Generating content:
ollama run llama3.2 "Write a short article on the benefits of using AI in healthcare." > article.txt - Answering specific questions:
ollama run llama3.2 "What are the latest trends in AI, and how will they affect healthcare?"
.
- Summarizing a large text file:
-
Data Processing and Analysis:
- Classifying text into positive, negative, or neutral sentiment:
ollama run llama3.2 "Analyze the sentiment of this customer review: 'The product is fantastic, but delivery was slow.'" - Categorizing text into predefined categories: Use similar commands to classify or categorize text based on predefined criteria.
- Classifying text into positive, negative, or neutral sentiment:
Using Ollama with Python (client and API)
- Install Ollama Python Library:
pip install ollama - Generate Text Using Python:
This code snippet generates text using the specified model and prompt.
import ollama response = ollama.generate(model='gemma:2b', prompt='what is a qubit?') print(response['response'])
For advanced Python integration, explore using Ollama’s Web Search API in Python, which covers web search capabilities, tool calling, and MCP server integration. If you’re building AI-powered applications, our AI Coding Assistants comparison can help you choose the right tools for development.
Looking for a web-based interface? Open WebUI provides a self-hosted interface with RAG capabilities and multi-user support. For high-performance production deployments, consider vLLM as an alternative. To compare Ollama with other local and cloud LLM infrastructure choices, see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared.
Useful links
Configuration and Management
Alternatives and Comparisons
- Local LLM Hosting: Complete 2026 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More
- vLLM Quickstart: High-Performance LLM Serving
- Docker Model Runner vs Ollama: Which to Choose?
- First Signs of Ollama Enshittification
Performance and Hardware
- How Ollama Handles Parallel Requests
- How Ollama is using Intel CPU Performance and Efficient Cores
- NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison
- DGX Spark vs. Mac Studio: A Practical, Price-Checked Look at NVIDIA’s Personal AI Supercomputer
Integration and Development
- Using Ollama Web Search API in Python
- AI Coding Assistants Comparison
- Open WebUI: Self-Hosted LLM Interface
- Open-Source Chat UIs for LLMs on Local Ollama Instances
- Constraining LLMs with Structured Output: Ollama, Qwen3 & Python or Go
- Integrating Ollama with Python: REST API and Python Client Examples
- Go SDKs for Ollama - comparison with examples