LLM Performance

Speculative Decoding: 20-50% Faster LLM Inference

A 70B model generates one token per forward pass, and each pass reloads weights from VRAM, computes attention across the context, and synchronizes memory. Between tokens, the GPU sits idle while it waits for sequential dependencies to resolve.

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.

LLM Structured Output Validation in Python That Holds Up

Most LLM “structured output” tutorials are unserious. They teach you to ask for JSON politely and then hope the model behaves. That is not validation. That is optimism with braces.

Agentic LLM Inference Parameters Reference for Qwen 3.6 and Gemma 4

This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).

16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.

LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization

A performance engineering hub for running LLMs efficiently: runtime behavior, bottlenecks, benchmarks, and the real constraints that shape throughput and latency.

Comparing LLMs performance on Ollama on 16GB VRAM GPU

Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.

BAML vs Instructor: Structured LLM Outputs

When working with Large Language Models in production, getting structured, type-safe outputs is critical. Two popular frameworks - BAML and Instructor - take different approaches to solving this problem.

Reduce LLM Costs: Token Optimization Strategies

Token optimization is the critical skill separating cost-effective LLM applications from budget-draining experiments.

NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison

I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).

LLM ASICs and specialized inference chips (why they matter)

The future of AI is not only about smarter models. It is also about silicon that matches how those models are actually served. Specialized hardware for LLM inference is following a path reminiscent of Bitcoin mining’s move from GPUs to purpose-built ASICs, only with harder constraints because models and precision recipes keep evolving.

Here is a comparison between Qwen3:30b and GPT-OSS:20b focusing on instruction following and performance parameters, specs and speed.

Ollama’s GPT-OSS models have recurring issues handling structured output, especially when used with frameworks like LangChain, OpenAI SDK, vllm, and others.

Structured output comparison across popular LLM providers - OpenAI, Gemini, Anthropic, Mistral and AWS Bedrock

Here’s a side-by-side support comparison of structured output (getting reliable JSON back) across popular LLM providers, plus minimal Python examples

Constraining LLMs with Structured Output: Ollama, Qwen3 & Python or Go

Large Language Models (LLMs) are powerful, but in production we rarely want free-form paragraphs. Instead, we want predictable data: attributes, facts, or structured objects you can feed into an app. That’s LLM Structured Output.

Memory allocation and model scheduling in Ollama new version - v0.12.1

Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.

LLM Performance

Speculative Decoding: 20-50% Faster LLM Inference

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

LLM Structured Output Validation in Python That Holds Up

Agentic LLM Inference Parameters Reference for Qwen 3.6 and Gemma 4

16 GB VRAM LLM benchmarks with llama.cpp (speed and context)

LLM Performance in 2026: Benchmarks, Bottlenecks & Optimization

Comparing LLMs performance on Ollama on 16GB VRAM GPU

BAML vs Instructor: Structured LLM Outputs

Reduce LLM Costs: Token Optimization Strategies

NVIDIA DGX Spark vs Mac Studio vs RTX-4080: Ollama Performance Comparison

LLM ASICs and specialized inference chips (why they matter)

Comparison: Qwen3:30b vs GPT-OSS:20b

Ollama GPT-OSS Structured Output Issues

Structured output comparison across popular LLM providers - OpenAI, Gemini, Anthropic, Mistral and AWS Bedrock

Constraining LLMs with Structured Output: Ollama, Qwen3 & Python or Go

Memory allocation and model scheduling in Ollama new version - v0.12.1