Integrating Ollama with Python: REST API and Python Client Examples

+ Specific Examples Using Thinking LLMs

Page content

In this post, we’ll explore two ways to connect your Python application to Ollama: 1. Via HTTP REST API; 2. Via the official Ollama Python library.

We’ll cover both chat and generate calls, and then discuss how to use “thinking models” effectively.

ollama and python

Ollama has quickly become one of the most convenient ways to run large language models (LLMs) locally. With its simple interface and support for popular open models like Llama 3, Mistral, Qwen2.5, and even “thinking” variants like qwen3, it’s easy to embed AI capabilities directly into your Python projects — without relying on external cloud APIs.


🧩 Prerequisites

Before diving in, make sure you have:

  • Ollama installed and running locally (ollama serve)
  • Python 3.9+
  • Required dependencies:
pip install requests ollama

Confirm Ollama is running by executing:

ollama list

You should see available models such as llama3, mistral, or qwen3.


⚙️ Option 1: Using Ollama’s REST API

The REST API is ideal when you want maximum control or when integrating with frameworks that already handle HTTP requests.

Example 1: Chat API

import requests
import json

url = "http://localhost:11434/api/chat"

payload = {
    "model": "llama3.1",
    "messages": [
        {"role": "system", "content": "You are a Python assistant."},
        {"role": "user", "content": "Write a function that reverses a string."}
    ]
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data.get("message", {}).get("content", ""), end="")

👉 The Ollama REST API streams responses line-by-line (similar to OpenAI’s streaming API). You can accumulate content or display it in real time for chatbots or CLI tools.


Example 2: Generate API

If you don’t need chat context or roles, use the simpler /api/generate endpoint:

import requests

url = "http://localhost:11434/api/generate"
payload = {
    "model": "llama3.1",
    "prompt": "Explain recursion in one sentence."
}

response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
    if line:
        print(line.decode("utf-8"))

This endpoint is great for one-shot text generation tasks — summaries, code snippets, etc.


🐍 Option 2: Using the Ollama Python Library

The Ollama Python client provides a cleaner interface for developers who prefer to stay fully in Python.

Example 1: Chat API

import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a code assistant."},
        {"role": "user", "content": "Generate a Python script that lists all files in a directory."}
    ]
)

print(response['message']['content'])

This returns the final message as a dictionary. If you want streaming, you can iterate over the chat stream:

stream = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Write a haiku about recursion."}
    ],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Example 2: Generate API

import ollama

output = ollama.generate(
    model="llama3.1",
    prompt="Summarize the concept of decorators in Python."
)

print(output['response'])

Or stream the result:

stream = ollama.generate(
    model="llama3.1",
    prompt="List three pros of using Python for AI projects.",
    stream=True
)

for chunk in stream:
    print(chunk['response'], end='', flush=True)

🧠 Working with “Thinking” Models

Ollama supports “thinking models” such as qwen3, designed to show their intermediate reasoning steps. These models produce structured output, often in a format like:

<think>
  Reasoning steps here...
</think>
Final answer here.

This makes them useful for:

  • Debugging model reasoning
  • Research into interpretability
  • Building tools that separate thought from output

Example: Using a Thinking Model

import ollama

response = ollama.chat(
    model="qwen3",
    messages=[
        {"role": "user", "content": "What is the capital of Australia?"}
    ]
)

content = response['message']['content']

# Optionally extract "thinking" part
import re
thinking = re.findall(r"<think>(.*?)</think>", content, re.DOTALL)
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL)

print("🧠 Thought process:\n", thinking[0].strip() if thinking else "N/A")
print("\n✅ Final answer:\n", answer.strip())

When to Use Thinking Models

Use Case Recommended Model Why
Interpretability / Debugging qwen3 View reasoning traces
Performance-sensitive apps qwen3 non-thinking mode Faster, less verbose
Educational / Explanatory qwen3 Shows step-by-step logic

✅ Summary

Task REST API Python Client
Simple text generation /api/generate ollama.generate()
Conversational chat /api/chat ollama.chat()
Streaming support Yes Yes
Works with thinking models Yes Yes

Ollama’s local-first design makes it ideal for secure, offline, or privacy-sensitive AI applications. Whether you’re building an interactive chatbot or a background data enrichment service, you can integrate LLMs seamlessly into your Python workflow — with full control over models, latency, and data.