Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU
MTP vs standard decoding on RTX 4080 — real benchmarks
I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.
MTP vs standard decoding on RTX 4080 — real benchmarks
I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM.
Stop parsing vibes. Validate contracts.
Most LLM “structured output” tutorials are unserious. They teach you to ask for JSON politely and then hope the model behaves. That is not validation. That is optimism with braces.
Agentic LLM tuning reference
This page is a practical reference for agentic LLM inference tuning (temperature, top_p, top_k, penalties, and how they interact in multi-step and tool-heavy workflows).
llama.cpp token speed on 16 GB VRAM (tables).
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.
LLM speed test on RTX 4080 with 16GB VRAM
Running large language models locally gives you privacy, offline capability, and zero API costs. This benchmark reveals exactly what one can expect from 14 popular LLMs on Ollama on an RTX 4080.
Type-safe LLM outputs with BAML and Instructor
When working with Large Language Models in production, getting structured, type-safe outputs is critical. Two popular frameworks - BAML and Instructor - take different approaches to solving this problem.
Cut LLM costs by 80% with smart token optimization
Token optimization is the critical skill separating cost-effective LLM applications from budget-draining experiments.
GPT-OSS 120b benchmarks on three AI platforms
I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).
ASICs and custom silicon push LLM inference speed and efficiency
The future of AI is not only about smarter models. It is also about silicon that matches how those models are actually served. Specialized hardware for LLM inference is following a path reminiscent of Bitcoin mining’s move from GPUs to purpose-built ASICs, only with harder constraints because models and precision recipes keep evolving.
Comparing Speed, parameters and performance of these two models
Here is a comparison between Qwen3:30b and GPT-OSS:20b focusing on instruction following and performance parameters, specs and speed.
Not very nice.
Ollama’s GPT-OSS models have recurring issues handling structured output, especially when used with frameworks like LangChain, OpenAI SDK, vllm, and others.
Slightly different APIs require special approach.
Here’s a side-by-side support comparison of structured output (getting reliable JSON back) across popular LLM providers, plus minimal Python examples
A couple of ways to get structured output from Ollama
Large Language Models (LLMs) are powerful, but in production we rarely want free-form paragraphs. Instead, we want predictable data: attributes, facts, or structured objects you can feed into an app. That’s LLM Structured Output.
My own test of ollama model scheduling
Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.
Thinking of installing second gpu for LLMs?
How PCIe Lanes Affect LLM Performance? Depending on the task. For training and multi-gpu inferrence - perdormance drop is significant.