LLM - Rost Glukhov | Personal site and technical blog

LLM Wiki Maintenance: Drift, Contradictions and Review

An LLM Wiki fails when old facts remain plausible, contradictions become polished, and generated summaries drift from their sources.

GPUs for AI in 2026: NVIDIA, AMD, Intel Compared

The AI hardware landscape has shifted significantly in 2026, with NVIDIA, AMD, and Intel all competing for developers who need GPUs capable of running local large language models and AI inference workloads.

A2A and MCP Agent Security: Identity, Delegation, and Audit Trails

Prompt injection gets most of the security attention in LLM systems, and it deserves attention, but it is not the whole problem once agents start calling tools and delegating work to other agents.

Most AI agent demos still behave like chat completions with extra steps: you send a prompt, wait a few seconds, and get an answer back in one response.

Speculative Decoding: 20-50% Faster LLM Inference

A 70B model generates one token per forward pass, and each pass reloads weights from VRAM, computes attention across the context, and synchronizes memory. Between tokens, the GPU sits idle while it waits for sequential dependencies to resolve.

What Is Spec-Driven Development? The Spec as Source of Truth

Spec-Driven Development is one of those ideas that software engineers have reached for before and then set aside when the effort stopped paying.

Spec-Driven Development vs Vibe Coding: Waterfall?

Spec-Driven Development entered 2026 as the serious developer’s answer to vibe coding drift.

Google A2A Protocol in 2026: Adoption, Hype, and Reality

Google’s Agent2Agent protocol, usually shortened to A2A, had a strange first year.

Polling Agents in AI Assistants: 11 Implementation Patterns

Polling agents are one of the least glamorous parts of AI assistant architecture, but they are also one of the most useful.

A2A vs MCP: Do AI Agents Really Need Both Protocols?

AI agent architecture is starting to split into two layers.

What Is the A2A Protocol? Agent Cards and Tasks Explained

The A2A Protocol, short for Agent2Agent Protocol, is an open standard for communication between independent AI agent systems.

LLM Architecture: System Design for Production AI

Design decisions for production LLM systems — routing, cost, guardrails, and multi-model orchestration. The layer between running models and building reliable AI applications.

Cost Optimization for LLM Systems: Where the Money Actually Goes

LLM costs scale linearly with usage. A system processing 10,000 requests a day at $0.01 per request costs $100 daily — $365 a year. At enterprise scale, that’s over $10,000.

LLM Guardrails in Practice: What Actually Works

LLMs are unpredictable. They hallucinate, leak data, generate harmful content, or refuse legitimate requests. Guardrails constrain model behavior without sacrificing capability.

Model Routing: Stop Using One Model for Everything

Running a 70B parameter model to summarize a 200-word email is wasteful. Running a 3B model to review production code is reckless. Most systems live somewhere in between — and that’s where model routing comes in.

Multi-Model System Design: When One Model Isn't Enough

Single-model systems are simple. Multi-model systems are powerful. The challenge isn’t choosing models — it’s designing the architecture that orchestrates them.