How to call ollama from python?

To call ollama from python you can use either standard REST API interface or Ollama python client.

Integrating Ollama with Python: REST API and Python Client Examples

+ Specific Examples Using Thinking LLMs

Page content

In this post, we’ll explore two ways to connect your Python application to Ollama: 1. Via HTTP REST API; 2. Via the official Ollama Python library.

We’ll cover both chat and generate calls, and then discuss how to use “thinking models” effectively.

ollama and python

Ollama has quickly become one of the most convenient ways to run large language models (LLMs) locally. With its simple interface and support for popular open models like Llama 3, Mistral, Qwen2.5, and even “thinking” variants like qwen3, it’s easy to embed AI capabilities directly into your Python projects — without relying on external cloud APIs.

🧩 Prerequisites

Before diving in, make sure you have:

Ollama installed and running locally (ollama serve)
Python 3.9+
Required dependencies:

pip install requests ollama

Confirm Ollama is running by executing:

ollama list

You should see available models such as llama3, mistral, or qwen3.

⚙️ Option 1: Using Ollama’s REST API

The REST API is ideal when you want maximum control or when integrating with frameworks that already handle HTTP requests.

Example 1: Chat API

import requests
import json

url = "http://localhost:11434/api/chat"

payload = {
    "model": "llama3.1",
    "messages": [
        {"role": "system", "content": "You are a Python assistant."},
        {"role": "user", "content": "Write a function that reverses a string."}
    ]
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data.get("message", {}).get("content", ""), end="")

👉 The Ollama REST API streams responses line-by-line (similar to OpenAI’s streaming API). You can accumulate content or display it in real time for chatbots or CLI tools.

Example 2: Generate API

If you don’t need chat context or roles, use the simpler /api/generate endpoint:

import requests

url = "http://localhost:11434/api/generate"
payload = {
    "model": "llama3.1",
    "prompt": "Explain recursion in one sentence."
}

response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
    if line:
        print(line.decode("utf-8"))

This endpoint is great for one-shot text generation tasks — summaries, code snippets, etc.

🐍 Option 2: Using the Ollama Python Library

The Ollama Python client provides a cleaner interface for developers who prefer to stay fully in Python.

Example 1: Chat API

import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a code assistant."},
        {"role": "user", "content": "Generate a Python script that lists all files in a directory."}
    ]
)

print(response['message']['content'])

This returns the final message as a dictionary. If you want streaming, you can iterate over the chat stream:

stream = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Write a haiku about recursion."}
    ],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Example 2: Generate API

import ollama

output = ollama.generate(
    model="llama3.1",
    prompt="Summarize the concept of decorators in Python."
)

print(output['response'])

Or stream the result:

stream = ollama.generate(
    model="llama3.1",
    prompt="List three pros of using Python for AI projects.",
    stream=True
)

for chunk in stream:
    print(chunk['response'], end='', flush=True)

🧠 Working with “Thinking” Models

Ollama supports “thinking models” such as qwen3, designed to show their intermediate reasoning steps. These models produce structured output, often in a format like:

<think>
  Reasoning steps here...
</think>
Final answer here.

This makes them useful for:

Debugging model reasoning
Research into interpretability
Building tools that separate thought from output

Example: Using a Thinking Model

import ollama

response = ollama.chat(
    model="qwen3",
    messages=[
        {"role": "user", "content": "What is the capital of Australia?"}
    ]
)

content = response['message']['content']

# Optionally extract "thinking" part
import re
thinking = re.findall(r"<think>(.*?)</think>", content, re.DOTALL)
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL)

print("🧠 Thought process:\n", thinking[0].strip() if thinking else "N/A")
print("\n✅ Final answer:\n", answer.strip())

When to Use Thinking Models

Use Case	Recommended Model	Why
Interpretability / Debugging	`qwen3`	View reasoning traces
Performance-sensitive apps	`qwen3 non-thinking mode`	Faster, less verbose
Educational / Explanatory	`qwen3`	Shows step-by-step logic

✅ Summary

Task	REST API	Python Client
Simple text generation	`/api/generate`	`ollama.generate()`
Conversational chat	`/api/chat`	`ollama.chat()`
Streaming support	Yes	Yes
Works with thinking models	Yes	Yes

Ollama’s local-first design makes it ideal for secure, offline, or privacy-sensitive AI applications. Whether you’re building an interactive chatbot or a background data enrichment service, you can integrate LLMs seamlessly into your Python workflow — with full control over models, latency, and data.

🧩 Prerequisites

⚙️ Option 1: Using Ollama’s REST API

Example 1: Chat API

Example 2: Generate API

🐍 Option 2: Using the Ollama Python Library

Example 1: Chat API

Example 2: Generate API

🧠 Working with “Thinking” Models

Example: Using a Thinking Model

When to Use Thinking Models

✅ Summary

Useful links