LLM - Page 4 - Rost Glukhov | Personal site and technical blog

Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.

Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.

llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

Soon you are juggling vLLM, llama.cpp, and more—each stack on its own port. Everything downstream still wants one /v1 base URL; otherwise you keep shuffling ports, profiles, and one-off scripts. llama-swap is the /v1 proxy before those stacks.

AI Systems: Self-Hosted Assistants, RAG, and Local Infrastructure

Most local AI setups start with a model and a runtime.

Oh My Opencode Review: Honest Results, Billing Risks, and When It's Worth It

Oh My Opencode promises a “virtual AI dev team” — Sisyphus orchestrating specialists, tasks running in parallel, and the magic ultrawork keyword activating all of it.

Best LLMs for OpenCode - From Gemma 4 to Qwen 3.6, Tested Locally

I have tested how OpenCode works with several locally hosted on Ollama and llama.cpp LLMs, and for comparison added some Free models from OpenCode Zen.

Oh My Opencode Specialised Agents Deep Dive and Model Guide

The biggest capability jump in OpenCode comes from specialised agents: deliberate separation of orchestration, planning, execution, and research.

OpenHands Coding Assistant QuickStart: Install, CLI Flags, Examples

OpenHands is an open-source, model-agnostic platform for AI-driven software development agents. It lets an agent behave more like a coding partner than a simple autocomplete tool.

LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

LocalAI is a self-hosted, local-first inference server designed to behave like a drop-in OpenAI API for running AI workloads on your own hardware (laptop, workstation, or on-prem server).

Oh My Opencode QuickStart for OpenCode: Install, Configure, Run

Oh My Opencode turns OpenCode into a multi-agent coding harness: an orchestrator delegates work to specialist agents that run in parallel.

llama.cpp Quickstart with CLI and Server

I keep coming back to llama.cpp for local inference—it gives you control that Ollama and others abstract away, and it just works. Easy to run GGUF models interactively with llama-cli or expose an OpenAI-compatible HTTP API with llama-server.

AI Developer Tools: The Complete Guide to AI-Powered Development

Artificial Intelligence is reshaping how software is written, reviewed, deployed, and maintained. From AI coding assistants to GitOps automation and DevOps workflows, developers now rely on AI-powered tools across the entire software lifecycle.

OpenCode Quickstart: Install, Configure, and Use the Terminal AI Coding Agent

OpenCode is an open source AI coding agent you can run in the terminal (TUI + CLI) with optional desktop and IDE surfaces. This is the OpenCode Quickstart: install, verify, connect a model/provider, and run real workflows (CLI + API).

Monitor LLM Inference in Production (2026): Prometheus & Grafana for vLLM, TGI, llama.cpp

LLM inference looks like “just another API” — until latency spikes, queues back up, and your GPUs sit at 95% memory with no obvious explanation.