Performance
Hugo Caching Strategies for Performance
Optimize developing and running Hugo sites
Hugo caching strategies are essential for maximizing the performance of your static site generator. While Hugo generates static files that are inherently fast, implementing proper caching at multiple layers can dramatically improve build times, reduce server load, and enhance user experience.
NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison
GPT-OSS 120b benchmarks on three AI platforms
I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).
Ollama GPT-OSS Structured Output Issues
Not very nice.
Ollama’s GPT-OSS models have recurring issues handling structured output, especially when used with frameworks like LangChain, OpenAI SDK, vllm, and others.
Memory allocation and model scheduling in Ollama new version - v0.12.1
My own test of ollama model scheduling
Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.
LLM Performance and PCIe Lanes: Key Considerations
Thinking of installing second gpu for LLMs?
How PCIe Lanes Affect LLM Performance? Depending on the task. For training and multi-gpu inferrence - perdormance drop is significant.
Test: How Ollama is using Intel CPU Performance and Efficient Cores
Ollama on Intel CPU Efficient vs Performance cores
I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.
How Ollama Handles Parallel Requests
Understand Ollama concurrency, queueing, and how to tune OLLAMA_NUM_PARALLEL for stable parallel requests.
This guide explains how Ollama handles parallel requests (concurrency, queuing, and resource limits), and how to tune it using the OLLAMA_NUM_PARALLEL environment variable (and related knobs).
Mistral Small, Gemma 2, Qwen 2.5, Mistral Nemo, LLama3 and Phi - LLM Test
Next round of LLM tests
Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.
Large Language Models Speed Test
Let's test the LLMs' speed on GPU vs CPU
Comparing prediction speed of several versions of LLMs: llama3 (Meta/Facebook), phi3 (Microsoft), gemma (Google), mistral(open source) on CPU and GPU.