Ollama CLI Cheatsheet: ls, serve, run, ps + commands (2026 update)

Updated Ollama command list - ls, ps, run, serve, etc

Page content

This Ollama CLI cheatsheet focuses on the commands you use every day (ollama ls, ollama serve, ollama run, ollama ps, model management, and common workflows), with examples you can copy/paste.

It also includes a short “performance knobs” section to help you discover (and then deep-dive) OLLAMA_NUM_PARALLEL and related settings.

ollama cheatsheet

This Ollama cheatsheet is focusing on CLI commands, model management, and customization, But we have here also some curl calls too.

For a full picture of where Ollama fits among local, self-hosted and cloud options—including vLLM, Docker Model Runner, LocalAI and cloud providers—see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared. If you’re comparing different local LLM hosting solutions, check out our comprehensive comparison of Ollama, vLLM, LocalAI, Jan, LM Studio and more. For those seeking alternatives to command-line interfaces, Docker Model Runner offers a different approach to LLM deployment.

Ollama installation (download and CLI install)

  • Option 1: Download from Website
    • Visit ollama.com and download the installer for your operating system (Mac, Linux, or Windows).
  • Option 2: Install via Command Line
    • For Mac and Linux users, use the command:
curl https://ollama.ai/install.sh | sh
  • Follow the on-screen instructions and enter your password if prompted.

Ollama system requirements (RAM, storage, CPU)

For serious AI workloads, you might want to compare hardware options. We’ve benchmarked NVIDIA DGX Spark vs Mac Studio vs RTX-4080 performance with Ollama, and if you’re considering investing in high-end hardware, our DGX Spark pricing and capabilities comparison provides detailed cost analysis.

Basic Ollama CLI Commands

Command Description
ollama serve Starts Ollama on your local system.
ollama create <new_model> Creates a new model from an existing one for customization or training.
ollama show <model> Displays details about a specific model, such as its configuration and release date.
ollama run <model> Runs the specified model, making it ready for interaction.
ollama pull <model> Downloads the specified model to your system.
ollama list Lists all the downloaded models. The same as ollama ls
ollama ps Shows the currently running models.
ollama stop <model> Stops the specified running model.
ollama rm <model> Removes the specified model from your system.
ollama help Provides help about any command.

Jump links: Ollama serve command · Ollama run command · Ollama ps command · Ollama CLI basics · Performance knobs (OLLAMA_NUM_PARALLEL) · Parallel requests deep dive

Ollama CLI (what it is)

Ollama CLI is the command-line interface to manage models and run/serve them locally. Most workflows boil down to:

  • Start the server: ollama serve
  • Run a model: ollama run <model>
  • See what’s loaded/running: ollama ps
  • Manage models: ollama pull, ollama list, ollama rm

Ollama model management: pull and list models commands

List Models:

ollama list

the same as:

ollama ls

This command lists all the models that have been downloaded to your system, with their file sizes on your hdd/sdd, like

$ ollama ls
NAME                                                    ID              SIZE      MODIFIED     
deepseek-r1:8b                                          6995872bfe4c    5.2 GB    2 weeks ago     
gemma3:12b-it-qat                                       5d4fa005e7bb    8.9 GB    2 weeks ago     
LoTUs5494/mistral-small-3.1:24b-instruct-2503-iq4_NL    4e994e0f85a0    13 GB     3 weeks ago     
dengcao/Qwen3-Embedding-8B:Q4_K_M                       d3ca2355027f    4.7 GB    4 weeks ago     
dengcao/Qwen3-Embedding-4B:Q5_K_M                       7e8c9ad6885b    2.9 GB    4 weeks ago     
qwen3:8b                                                500a1f067a9f    5.2 GB    5 weeks ago     
qwen3:14b                                               bdbd181c33f2    9.3 GB    5 weeks ago     
qwen3:30b-a3b                                           0b28110b7a33    18 GB     5 weeks ago     
devstral:24b                                            c4b2fa0c33d7    14 GB     5 weeks ago  

Download a Model: ollama pull

ollama pull mistral-nemo:12b-instruct-2407-q6_K

This command downloads the specified model (e.g., Gemma 2B, or mistral-nemo:12b-instruct-2407-q6_K) to your system. The model files could be quite large, so keep an eye on the space used by models on the hard drive, or ssd. You might even want to move all Ollama models from you home directory to another bigger and better drive

Ollama serve command

ollama serve starts the local Ollama server (default HTTP port 11434).

ollama serve

“ollama serve” command (systemd-friendly example):

# set env vars, then start the server
# make ollama available on the host's IP address
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_NUM_PARALLEL=2
ollama serve

Ollama run command

Run a Model:

ollama run gpt-oss:20b

This command starts the specified model and opens an interactive REPL for interaction. Want to understand how Ollama manages multiple concurrent requests? Learn more about how Ollama handles parallel requests in our detailed analysis.

ollama run runs a model in an interactive session, so you in the case of gpt-oss:120b would see something like

$ ollama run gpt-oss:120b
>>> Send a message (/? for help)

you can type your questions or commands and the model will reply.

>>> who are you?
Thinking...
The user asks "who are you?" Simple question. Should respond as ChatGPT, an AI language model, trained by OpenAI, 
etc. Provide brief intro. Probably ask if they need help.
...done thinking.

I’m ChatGPT, an AI language model created by OpenAI. I’ve been trained on a wide range of text so I can help 
answer questions, brainstorm ideas, explain concepts, draft writing, troubleshoot problems, and much more. Think 
of me as a versatile virtual assistant—here to provide information, support, and conversation whenever you need 
it. How can I help you today?

>>> Send a message (/? for help)

To exit the interactive ollama session, press Ctrl+D, or you can type /bye, the same result:

>>> /bye
$ 

Ollama run command examples

To run a model and ask a single question in a non-interactive mode:

printf "Give me 10 bash one-liners for log analysis.\n" | ollama run llama3.2

If you want to see detailed verbose LLM reply in ollama session - run the model with --verbose or -v parameter:

$ ollama run gpt-oss:20b --verbose
>>> who are you?
Thinking...
We need to respond to a simple question: "who are you?" The user is asking "who are you?" We can answer that we 
are ChatGPT, a large language model trained by OpenAI. We can also mention capabilities. The user likely expects 
a brief introduction. We'll keep it friendly.
...done thinking.

I’m ChatGPT, a large language model created by OpenAI. I’m here to help answer questions, offer explanations, 
brainstorm ideas, and chat about a wide range of topics—everything from science and history to creative writing 
and everyday advice. Just let me know what you’d like to talk about!

total duration:       1.118585707s
load duration:        106.690543ms
prompt eval count:    71 token(s)
prompt eval duration: 30.507392ms
prompt eval rate:     2327.30 tokens/s
eval count:           132 token(s)
eval duration:        945.801569ms
eval rate:            139.56 tokens/s
>>> /bye
$ 

Yes, that’s right, it is 139 tokens per second. The gpt-oss:20b is very fast. If you, like me have GPU with 16GB VRAM - see the LLMs speed somparison details in Best LLMs for Ollama on 16GB VRAM GPU.

Tip: If you want the model available over HTTP for multiple apps, start the server with ollama serve and use the API client instead of long interactive sessions.

Ollama stop command

This command stops the specified running model.

ollama stop llama3.1:8b-instruct-q8_0

Ollama evicts models automagically after some time. You can specify this time, byt default is 4 minutes. If you don’t want to wait the remaining time , you might want to use this ollama stop command. You can also kick the model out of the VRAM by calling /generate API endpoint with parameter keep_alive=0, see below for the description and example.

Ollama ps command

ollama ps shows currently running models and sessions (useful to debug “why is my VRAM full?”).

ollama ps

The example of the ollama ps output is below:

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b    17052f91a42e    14 GB    100% GPU     4096       4 minutes from now

You see here on my PC the gpt-oss:20b fits into the my GPU’s 16GB VRAM very well, and ocupied only 14GB.

If I execute ollama run gpt-oss:120b and then call the ollama ps, the outcome will not be that bright: 78% of layers are on CPU, and this is just with the context window 4096 tokens. It will be more should I need to increase the context.

NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:120b    a951a23b46a1    66 GB    78%/22% CPU/GPU    4096       4 minutes from now

Performance knobs (OLLAMA_NUM_PARALLEL)

If you see queueing or timeouts under load, the first knob to learn is OLLAMA_NUM_PARALLEL.

  • OLLAMA_NUM_PARALLEL = how many requests Ollama executes in parallel.
  • A higher value can increase throughput, but may increase VRAM pressure and latency spikes.

Quick example:

OLLAMA_NUM_PARALLEL=2 ollama serve

For a full explanation (including tuning strategies and failure modes), see:

Releasing Ollama model from VRAM (keep_alive)

When a model is loaded into VRAM (GPU memory), it stays there even after you finish using it. To explicitly release a model from VRAM and free up GPU memory, you can send a request to the Ollama API with keep_alive: 0.

  • Release Model from VRAM using curl:
curl http://localhost:11434/api/generate -d '{"model": "MODELNAME", "keep_alive": 0}'

Replace MODELNAME with your actual model name, for example:

curl http://localhost:11434/api/generate -d '{"model": "qwen3:14b", "keep_alive": 0}'
  • Release Model from VRAM using Python:
import requests

response = requests.post(
    'http://localhost:11434/api/generate',
    json={'model': 'qwen3:14b', 'keep_alive': 0}
)

This is particularly useful when:

  • You need to free up GPU memory for other applications
  • You’re running multiple models and want to manage VRAM usage
  • You’ve finished using a large model and want to release resources immediately

Note: The keep_alive parameter controls how long (in seconds) a model stays loaded in memory after the last request. Setting it to 0 immediately unloads the model from VRAM.

Customizing Ollama models (system prompt, Modelfile)

  • Set System Prompt: Inside the Ollama REPL, you can set a system prompt to customize the model’s behavior:

    >>> /set system For all questions asked answer in plain English avoiding technical jargon as much as possible
    >>> /save ipe
    >>> /bye
    

    Then, run the customized model:

    ollama run ipe
    

    This sets a system prompt and saves the model for future use.

  • Create Custom Model File: Create a text file (e.g., custom_model.txt) with the following structure:

    FROM llama3.1
    SYSTEM [Your custom instructions here]
    

    Then, run:

    ollama create mymodel -f custom_model.txt
    ollama run mymodel
    

    This creates a customized model based on the instructions in the file".

Using Ollama run command with files (summarize, redirect)

  • Summarize Text from a File:

    ollama run llama3.2 "Summarize the content of this file in 50 words." < input.txt
    

    This command summarizes the content of input.txt using the specified model.

  • Log Model Responses to a File:

    ollama run llama3.2 "Tell me about renewable energy." > output.txt
    

    This command saves the model’s response to output.txt.

Ollama CLI use cases (text generation, analysis)

  • Text Generation:

    • Summarizing a large text file:
      ollama run llama3.2 "Summarize the following text:" < long-document.txt
      
    • Generating content:
      ollama run llama3.2 "Write a short article on the benefits of using AI in healthcare." > article.txt
      
    • Answering specific questions:
      ollama run llama3.2 "What are the latest trends in AI, and how will they affect healthcare?"
      

    .

  • Data Processing and Analysis:

    • Classifying text into positive, negative, or neutral sentiment:
      ollama run llama3.2 "Analyze the sentiment of this customer review: 'The product is fantastic, but delivery was slow.'"
      
    • Categorizing text into predefined categories: Use similar commands to classify or categorize text based on predefined criteria.

Using Ollama with Python (client and API)

  • Install Ollama Python Library:
    pip install ollama
    
  • Generate Text Using Python:
    import ollama
    
    response = ollama.generate(model='gemma:2b', prompt='what is a qubit?')
    print(response['response'])
    
    This code snippet generates text using the specified model and prompt.

For advanced Python integration, explore using Ollama’s Web Search API in Python, which covers web search capabilities, tool calling, and MCP server integration. If you’re building AI-powered applications, our AI Coding Assistants comparison can help you choose the right tools for development.

Looking for a web-based interface? Open WebUI provides a self-hosted interface with RAG capabilities and multi-user support. For high-performance production deployments, consider vLLM as an alternative. To compare Ollama with other local and cloud LLM infrastructure choices, see LLM Hosting: Local, Self-Hosted & Cloud Infrastructure Compared.

Configuration and Management

Alternatives and Comparisons

Performance and Hardware

Integration and Development