Ollama CLI Cheatsheet: ls, serve, run, ps + commands (2026 update)

Updated Ollama command list - ls, ps, run, serve, etc

Page content

This Ollama CLI cheatsheet focuses on the commands you use every day (ollama ls, ollama serve, ollama run, ollama ps, model management, and common workflows), with examples you can copy/paste.

It also includes a short “performance knobs” section to help you discover (and then deep-dive) OLLAMA_NUM_PARALLEL and related settings.

ollama cheatsheet

This Ollama cheatsheet is focusing on CLI commands, model management, and customization, But we have here also some curl calls too.

For a full picture of where Ollama fits among local, self-hosted and cloud options—including vLLM, Docker Model Runner, LocalAI and cloud providers—see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared. If you’re comparing different local LLM hosting solutions, check out our comprehensive comparison of Ollama, vLLM, LocalAI, Jan, LM Studio and more. For those seeking alternatives to command-line interfaces, Docker Model Runner offers a different approach to LLM deployment.

Ollama installation (download and CLI install)

  • Option 1: Download from Website
    • Visit ollama.com and download the installer for your operating system (Mac, Linux, or Windows).
  • Option 2: Install via Command Line
    • For Mac and Linux users, use the command:
curl -fsSL https://ollama.com/install.sh | sh
  • Follow the on-screen instructions and enter your password if prompted.

Ollama system requirements (RAM, storage, CPU)

For serious AI workloads, you might want to compare hardware options. We’ve benchmarked NVIDIA DGX Spark vs Mac Studio vs RTX-4080 performance with Ollama, and if you’re considering investing in high-end hardware, our DGX Spark pricing and capabilities comparison provides detailed cost analysis.

Basic Ollama CLI Commands

Command Description
ollama serve Starts the Ollama server (default port 11434).
ollama run <model> Runs the specified model in an interactive REPL.
ollama pull <model> Downloads the specified model to your system.
ollama push <model> Uploads a model to the Ollama registry.
ollama list Lists all downloaded models. Same as ollama ls.
ollama ps Shows currently running (loaded) models.
ollama stop <model> Stops (unloads) a running model.
ollama rm <model> Removes a model from your system.
ollama cp <source> <dest> Copies a model under a new name locally.
ollama show <model> Displays details about a model (architecture, parameters, template, etc.).
ollama create <model> Creates a new model from a Modelfile.
ollama launch [integration] Zero-config launch of AI coding assistants (Claude Code, Codex, Droid, OpenCode).
ollama signin Authenticates with the Ollama registry (enables private models and cloud models).
ollama signout Signs out from the Ollama registry.
ollama help Provides help about any command.

Jump links: Ollama serve command · Ollama launch command · Ollama run command · Ollama run flags · Ollama ps command · Ollama show command · Ollama signin · Ollama CLI basics · Performance knobs (OLLAMA_NUM_PARALLEL) · Parallel requests deep dive

Ollama CLI (what it is)

Ollama CLI is the command-line interface to manage models and run/serve them locally. Most workflows boil down to:

  • Start the server: ollama serve
  • Run a model: ollama run <model>
  • See what’s loaded/running: ollama ps
  • Manage models: ollama pull, ollama list, ollama rm

Ollama model management: pull and list models commands

List Models:

ollama list

the same as:

ollama ls

This command lists all the models that have been downloaded to your system, with their file sizes on your hdd/sdd, like

$ ollama ls
NAME                                                    ID              SIZE      MODIFIED     
deepseek-r1:8b                                          6995872bfe4c    5.2 GB    2 weeks ago     
gemma3:12b-it-qat                                       5d4fa005e7bb    8.9 GB    2 weeks ago     
LoTUs5494/mistral-small-3.1:24b-instruct-2503-iq4_NL    4e994e0f85a0    13 GB     3 weeks ago     
dengcao/Qwen3-Embedding-8B:Q4_K_M                       d3ca2355027f    4.7 GB    4 weeks ago     
dengcao/Qwen3-Embedding-4B:Q5_K_M                       7e8c9ad6885b    2.9 GB    4 weeks ago     
qwen3:8b                                                500a1f067a9f    5.2 GB    5 weeks ago     
qwen3:14b                                               bdbd181c33f2    9.3 GB    5 weeks ago     
qwen3:30b-a3b                                           0b28110b7a33    18 GB     5 weeks ago     
devstral:24b                                            c4b2fa0c33d7    14 GB     5 weeks ago  

Download a Model: ollama pull

ollama pull mistral-nemo:12b-instruct-2407-q6_K

This command downloads the specified model (e.g., Gemma 2B, or mistral-nemo:12b-instruct-2407-q6_K) to your system. The model files could be quite large, so keep an eye on the space used by models on the hard drive, or ssd. You might even want to move all Ollama models from you home directory to another bigger and better drive

Upload a Model: ollama push

ollama push my-custom-model

Uploads a local model to the Ollama registry so others can pull it. You need to be signed in first (ollama signin) and the model name must be prefixed with your Ollama username, e.g. myuser/my-model. Use --insecure if you are pushing to a private registry over HTTP:

ollama push myuser/my-model --insecure

Copy a Model: ollama cp

ollama cp llama3.2 my-llama3-variant

Creates a local copy of a model under a new name without re-downloading anything. This is handy before editing a Modelfile — copy first, customise the copy, and keep the original intact:

ollama cp qwen3:14b qwen3-14b-custom
ollama create qwen3-14b-custom -f ./Modelfile

Ollama show command

ollama show prints information about a downloaded model.

ollama show qwen3:14b

By default it prints the model card (architecture, context length, embedding length, quantization, etc.). There are three useful flags:

Flag What it shows
--modelfile The full Modelfile used to create the model (FROM, SYSTEM, TEMPLATE, PARAMETER lines)
--parameters Only the parameter block (e.g. num_ctx, temperature, stop tokens)
--verbose Extended metadata including tensor shapes and layer counts
# See exactly what system prompt and template a model was built with
ollama show deepseek-r1:8b --modelfile

# Check the context window size and other inference parameters
ollama show qwen3:14b --parameters

# Full tensor-level detail (useful when debugging quantization)
ollama show llama3.2 --verbose

The --modelfile output is especially useful before customising a model: you can copy the base Modelfile and edit from there rather than writing one from scratch.

Ollama serve command

ollama serve starts the local Ollama server (default HTTP port 11434).

ollama serve

“ollama serve” command (systemd-friendly example):

# set env vars, then start the server
# make ollama available on the host's IP address
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_NUM_PARALLEL=2
ollama serve

Ollama run command

Run a Model:

ollama run gpt-oss:20b

This command starts the specified model and opens an interactive REPL for interaction. Want to understand how Ollama manages multiple concurrent requests? Learn more about how Ollama handles parallel requests in our detailed analysis.

ollama run runs a model in an interactive session, so you in the case of gpt-oss:120b would see something like

$ ollama run gpt-oss:120b
>>> Send a message (/? for help)

you can type your questions or commands and the model will reply.

>>> who are you?
Thinking...
The user asks "who are you?" Simple question. Should respond as ChatGPT, an AI language model, trained by OpenAI, 
etc. Provide brief intro. Probably ask if they need help.
...done thinking.

I’m ChatGPT, an AI language model created by OpenAI. I’ve been trained on a wide range of text so I can help 
answer questions, brainstorm ideas, explain concepts, draft writing, troubleshoot problems, and much more. Think 
of me as a versatile virtual assistant—here to provide information, support, and conversation whenever you need 
it. How can I help you today?

>>> Send a message (/? for help)

To exit the interactive ollama session, press Ctrl+D, or you can type /bye, the same result:

>>> /bye
$ 

Ollama run command examples

To run a model and ask a single question in a non-interactive mode:

printf "Give me 10 bash one-liners for log analysis.\n" | ollama run llama3.2

If you want to see detailed verbose LLM reply in ollama session - run the model with --verbose or -v parameter:

$ ollama run gpt-oss:20b --verbose
>>> who are you?
Thinking...
We need to respond to a simple question: "who are you?" The user is asking "who are you?" We can answer that we 
are ChatGPT, a large language model trained by OpenAI. We can also mention capabilities. The user likely expects 
a brief introduction. We'll keep it friendly.
...done thinking.

I’m ChatGPT, a large language model created by OpenAI. I’m here to help answer questions, offer explanations, 
brainstorm ideas, and chat about a wide range of topics—everything from science and history to creative writing 
and everyday advice. Just let me know what you’d like to talk about!

total duration:       1.118585707s
load duration:        106.690543ms
prompt eval count:    71 token(s)
prompt eval duration: 30.507392ms
prompt eval rate:     2327.30 tokens/s
eval count:           132 token(s)
eval duration:        945.801569ms
eval rate:            139.56 tokens/s
>>> /bye
$ 

Yes, that’s right, it is 139 tokens per second. The gpt-oss:20b is very fast. If you, like me have GPU with 16GB VRAM - see the LLMs speed somparison details in Best LLMs for Ollama on 16GB VRAM GPU.

Tip: If you want the model available over HTTP for multiple apps, start the server with ollama serve and use the API client instead of long interactive sessions.

Ollama run flags (full reference)

Flag Description
--verbose / -v Print timing stats (tokens/s, load time, etc.) after each response
-p, --parameters Pass model parameters inline without a Modelfile (see below)
--format string Force a specific output format, e.g. json
--nowordwrap Disable automatic word wrapping — useful when piping output to scripts
--insecure Allow connecting to a registry over HTTP (for private/self-hosted registries)

Override model parameters without a Modelfile (-p / –parameters)

The -p flag lets you change inference parameters at runtime without creating a Modelfile. You can stack multiple -p flags:

# Increase the context window and lower temperature
ollama run qwen3:14b -p num_ctx=32768 -p temperature=0.5

# Run a coding task with deterministic output
ollama run devstral:24b -p temperature=0 -p num_ctx=65536

Common parameters you can set this way:

Parameter Effect
num_ctx Context window size in tokens (default is model-dependent, often 2048–4096)
temperature Randomness: 0 = deterministic, 1 = creative
top_p Nucleus sampling threshold
top_k Limits vocabulary to top-K tokens
num_predict Maximum tokens to generate (-1 = unlimited)
repeat_penalty Penalty for repeating tokens

Multiline input in the REPL

Wrap text in triple quotes (""") to enter a multi-line prompt without submitting early:

>>> """Summarise this in one sentence:
... The quick brown fox jumps over the lazy dog.
... It happened on a Tuesday.
... """

Multimodal models (images)

For vision-capable models (e.g. gemma3, llava), pass an image path directly in the prompt:

ollama run gemma3 "What's in this image? /home/user/screenshot.png"

Generating embeddings via CLI

Embedding models output a JSON array instead of text. Pipe text directly for quick one-off embeddings:

echo "Hello world" | ollama run nomic-embed-text

For production embedding workloads use the /api/embeddings REST endpoint or the Python client instead.

Force JSON output (–format)

ollama run llama3.2 --format json "List 5 capital cities as JSON"

The model is instructed to return valid JSON. Useful when piping output to jq or a script that expects structured data.

Ollama stop command

This command stops the specified running model.

ollama stop llama3.1:8b-instruct-q8_0

Ollama evicts models automagically after some time. You can specify this time, byt default is 4 minutes. If you don’t want to wait the remaining time , you might want to use this ollama stop command. You can also kick the model out of the VRAM by calling /generate API endpoint with parameter keep_alive=0, see below for the description and example.

Ollama ps command

ollama ps shows currently running models and sessions (useful to debug “why is my VRAM full?”).

ollama ps

The example of the ollama ps output is below:

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b    17052f91a42e    14 GB    100% GPU     4096       4 minutes from now

You see here on my PC the gpt-oss:20b fits into the my GPU’s 16GB VRAM very well, and ocupied only 14GB.

If I execute ollama run gpt-oss:120b and then call the ollama ps, the outcome will not be that bright: 78% of layers are on CPU, and this is just with the context window 4096 tokens. It will be more should I need to increase the context.

NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:120b    a951a23b46a1    66 GB    78%/22% CPU/GPU    4096       4 minutes from now

Ollama launch command (AI coding integrations)

ollama launch is a command introduced in Ollama v0.15 (January 2026) that gives you zero-config, one-line setup for popular AI coding assistants running against your local Ollama server.

Why use ollama launch?

Before ollama launch, wiring up a coding agent like Claude Code or Codex to a local Ollama backend meant manually setting environment variables, pointing the tool to the right API endpoint, and picking a compatible model. ollama launch handles all of that for you interactively.

If you already run Ollama locally and want an agentic coding assistant without paying for API calls or sending code to the cloud, ollama launch is the fastest path there.

Supported integrations

Integration What it is
claude Anthropic’s Claude Code — agentic coding assistant
codex OpenAI’s Codex CLI coding assistant
droid Factory’s AI coding agent
opencode Open-source coding assistant

Basic usage

# Interactive picker — choose an integration from a menu
ollama launch

# Launch a specific integration directly
ollama launch claude

# Launch with a specific model
ollama launch claude --model qwen3-coder

# Configure the integration without launching it (useful to inspect settings)
ollama launch droid --config

Coding agents need a long context window to hold whole-file context and multi-turn conversation history. Ollama recommends models with at least 64 000 tokens of context:

Model Notes
qwen3-coder Strong coding performance, long context, runs locally
glm-4.7-flash Fast local option
devstral:24b Mistral’s coding-focused model

If your GPU cannot fit the model, Ollama also offers cloud-hosted variants (e.g. qwen3-coder:480b-cloud) which integrate the same way but route inference to Ollama’s cloud tier — requiring ollama signin.

Example: running Claude Code locally with Ollama

# 1. Make sure the model is available
ollama pull qwen3-coder

# 2. Launch Claude Code against it
ollama launch claude --model qwen3-coder

Ollama sets the necessary environment variables and starts Claude Code pointing at http://localhost:11434 automatically. You can then use Claude Code exactly as you normally would — the only difference is that inference happens on your own hardware.

Performance knobs (OLLAMA_NUM_PARALLEL)

If you see queueing or timeouts under load, the first knob to learn is OLLAMA_NUM_PARALLEL.

  • OLLAMA_NUM_PARALLEL = how many requests Ollama executes in parallel.
  • A higher value can increase throughput, but may increase VRAM pressure and latency spikes.

Quick example:

OLLAMA_NUM_PARALLEL=2 ollama serve

For a full explanation (including tuning strategies and failure modes), see:

Releasing Ollama model from VRAM (keep_alive)

When a model is loaded into VRAM (GPU memory), it stays there even after you finish using it. To explicitly release a model from VRAM and free up GPU memory, you can send a request to the Ollama API with keep_alive: 0.

  • Release Model from VRAM using curl:
curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

Replace MODELNAME with your actual model name, for example:

curl http://localhost:11434/api/generate -d '{"model": "qwen3:14b", "keep_alive": 0}'
  • Release Model from VRAM using Python:
import requests

response = requests.post(
    'http://localhost:11434/api/generate',
    json={'model': 'qwen3:14b', 'keep_alive': 0}
)

This is particularly useful when:

  • You need to free up GPU memory for other applications
  • You’re running multiple models and want to manage VRAM usage
  • You’ve finished using a large model and want to release resources immediately

Note: The keep_alive parameter controls how long (in seconds) a model stays loaded in memory after the last request. Setting it to 0 immediately unloads the model from VRAM.

Customizing Ollama models (system prompt, Modelfile)

  • Set System Prompt: Inside the Ollama REPL, you can set a system prompt to customize the model’s behavior:

    >>> /set system For all questions asked answer in plain English avoiding technical jargon as much as possible
    >>> /save ipe
    >>> /bye
    

    Then, run the customized model:

    ollama run ipe
    

    This sets a system prompt and saves the model for future use.

  • Create Custom Model File: Create a text file (e.g., custom_model.txt) with the following structure:

    FROM llama3.1
    SYSTEM [Your custom instructions here]
    

    Then, run:

    ollama create mymodel -f custom_model.txt
    ollama run mymodel
    

    This creates a customized model based on the instructions in the file".

Ollama signin and signout (registry authentication)

ollama signin
ollama signout

ollama signin authenticates your local Ollama installation with the Ollama registry at ollama.com. Once signed in, the client stores credentials locally and reuses them automatically for subsequent commands.

What signin unlocks:

  • Pulling and pushing private models from your account or organisation.
  • Using cloud-hosted models (e.g. qwen3-coder:480b-cloud) that are too large to run locally.
  • Publishing models to the registry with ollama push.

Alternative: API key authentication

If you are running Ollama in a CI pipeline or a headless server where interactive ollama signin is not practical, create an API key in your Ollama account settings and expose it as an environment variable:

export OLLAMA_API_KEY=ollama_...
ollama pull myorg/private-model

The OLLAMA_API_KEY variable is picked up automatically by every Ollama command and API request — no need to run ollama signin on each machine.

Using Ollama run command with files (summarize, redirect)

  • Summarize Text from a File:

    ollama run llama3.2 "Summarize the content of this file in 50 words." < input.txt
    

    This command summarizes the content of input.txt using the specified model.

  • Log Model Responses to a File:

    ollama run llama3.2 "Tell me about renewable energy." > output.txt
    

    This command saves the model’s response to output.txt.

Ollama CLI use cases (text generation, analysis)

  • Text Generation:

    • Summarizing a large text file:
      ollama run llama3.2 "Summarize the following text:" < long-document.txt
      
    • Generating content:
      ollama run llama3.2 "Write a short article on the benefits of using AI in healthcare." > article.txt
      
    • Answering specific questions:
      ollama run llama3.2 "What are the latest trends in AI, and how will they affect healthcare?"
      

    .

  • Data Processing and Analysis:

    • Classifying text into positive, negative, or neutral sentiment:
      ollama run llama3.2 "Analyze the sentiment of this customer review: 'The product is fantastic, but delivery was slow.'"
      
    • Categorizing text into predefined categories: Use similar commands to classify or categorize text based on predefined criteria.

Using Ollama with Python (client and API)

  • Install Ollama Python Library:
    pip install ollama
    
  • Generate Text Using Python:
    import ollama
    
    response = ollama.generate(model='gemma:2b', prompt='what is a qubit?')
    print(response['response'])
    
    This code snippet generates text using the specified model and prompt.

For advanced Python integration, explore using Ollama’s Web Search API in Python, which covers web search capabilities, tool calling, and MCP server integration. If you’re building AI-powered applications, our AI Coding Assistants comparison can help you choose the right tools for development.

Looking for a web-based interface? Open WebUI provides a self-hosted interface with RAG capabilities and multi-user support. For high-performance production deployments, consider vLLM as an alternative. To compare Ollama with other local and cloud LLM infrastructure choices, see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared.

Configuration and Management

Alternatives and Comparisons

Performance and Hardware

Integration and Development