Performance

A performance engineering hub for running LLMs efficiently: runtime behavior, bottlenecks, benchmarks, and the real constraints that shape throughput and latency.

Hugo caching strategies are essential for maximizing the performance of your static site generator. While Hugo generates static files that are inherently fast, implementing proper caching at multiple layers can dramatically improve build times, reduce server load, and enhance user experience.

NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison

I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).

Ollama’s GPT-OSS models have recurring issues handling structured output, especially when used with frameworks like LangChain, OpenAI SDK, vllm, and others.

Memory allocation and model scheduling in Ollama new version - v0.12.1

Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.

LLM Performance and PCIe Lanes: Key Considerations

How PCIe Lanes Affect LLM Performance? Depending on the task. For training and multi-gpu inferrence - perdormance drop is significant.

Test: How Ollama is using Intel CPU Performance and Efficient Cores

I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.

This guide explains how Ollama handles parallel requests (concurrency, queuing, and resource limits), and how to tune it using the OLLAMA_NUM_PARALLEL environment variable (and related knobs).

Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.

Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.

Performance

LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization

Hugo Caching Strategies for Performance

NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison

Ollama GPT-OSS Structured Output Issues

Memory allocation and model scheduling in Ollama new version - v0.12.1

LLM Performance and PCIe Lanes: Key Considerations

Test: How Ollama is using Intel CPU Performance and Efficient Cores

How Ollama Handles Parallel Requests

Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - LLM Test

Large Language Models Speed Test