LLM Hosting

TGI - Text Generation Inference - Install, Config, Troubleshoot

Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned how production breaks -

Remote Ollama access via Tailscale or WireGuard, no public ports

Ollama is at its happiest when it is treated like a local daemon: the CLI and your apps talk to a loopback HTTP API, and the rest of the network never finds out it exists.

Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.

llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

Soon you are juggling vLLM, llama.cpp, and more—each stack on its own port. Everything downstream still wants one /v1 base URL; otherwise you keep shuffling ports, profiles, and one-off scripts. llama-swap is the /v1 proxy before those stacks.

LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

LocalAI is a self-hosted, local-first inference server designed to behave like a drop-in OpenAI API for running AI workloads on your own hardware (laptop, workstation, or on-prem server).

llama.cpp Quickstart with CLI and Server

I keep coming back to llama.cpp for local inference—it gives you control that Ollama and others abstract away, and it just works. Easy to run GGUF models interactively with llama-cli or expose an OpenAI-compatible HTTP API with llama-server.

Self-hosting LLMs keeps data, models, and inference under your control-a practical path to AI sovereignty for teams, enterprises, nations.

Open WebUI is a powerful, extensible, and feature-rich self-hosted web interface for interacting with large language models.

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs) developed by UC Berkeley’s Sky Computing Lab.

Choosing the Right LLM for Cognee: Local Ollama Setup

Choosing the Best LLM for Cognee demands balancing graph-building quality, hallucination rates, and hardware constraints. Cognee excels with larger, low-hallucination models (32B+) via Ollama but mid-size options work for lighter setups.

Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?

Running LLMs locally is now practical for developers, startups, and even enterprise teams.
But choosing the right tool — Ollama, vLLM, LM Studio, LocalAI or others — depends on your goals:

Docker Model Runner: Context Size Config Guide

Configuring context sizes in Docker Model Runner is more complex than it should be.

Adding NVIDIA GPU Support to Docker Model Runner

Docker Model Runner is Docker’s official tool for running AI models locally, but enabling NVidia GPU acceleration in Docker Model Runner requires specific configuration.

Docker Model Runner Cheatsheet: Commands & Examples

Docker Model Runner (DMR) is Docker’s official solution for running AI models locally, introduced in April 2025. This cheatsheet provides a quick reference for all essential commands, configurations, and best practices.

LLM Hosting

TGI - Text Generation Inference - Install, Config, Troubleshoot

Remote Ollama access via Tailscale or WireGuard, no public ports

Ollama in Docker Compose with GPU and Persistent Model Storage

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

llama.swap Model Switcher Quickstart for OpenAI-Compatible Local LLMs

LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

llama.cpp Quickstart with CLI and Server

LLM Self-Hosting and AI Sovereignty

Open WebUI: Self-Hosted LLM Interface

vLLM Quickstart: High-Performance LLM Serving - in 2026

Choosing the Right LLM for Cognee: Local Ollama Setup

Ollama vs vLLM vs LM Studio: Best Way to Run LLMs Locally in 2026?

Docker Model Runner: Context Size Config Guide

Adding NVIDIA GPU Support to Docker Model Runner

Docker Model Runner Cheatsheet: Commands & Examples