Chunking is the most under-estimated hyperparameter in Retrieval ‑ Augmented Generation (RAG):
it silently determines what your LLM “sees”,
how expensive ingestion becomes,
and how much of the LLM’s context window you burn per answer.
Running large language models locally gives you privacy, offline capability, and zero API costs.
This benchmark reveals exactly what one can expect from 9 popular
LLMs on Ollama on an RTX 4080.
After automatically installing a new kernel, Ubuntu 24.04 has lost the ethernet network. This frustrating issue occurred for me a second time, so I’m documenting the solution here to help others facing the same problem.
Deploy enterprise AI on budget hardware with open models
The democratization of AI is here.
With open-source LLMs like Llama 3, Mixtral, and Qwen now rivaling proprietary models, teams can build powerful AI infrastructure using consumer hardware - slashing costs while maintaining complete control over data privacy and deployment.
I dug up some interesting performance tests of GPT-OSS 120b running on Ollama across three different platforms: NVIDIA DGX Spark, Mac Studio, and RTX 4080. The GPT-OSS 120b model from the Ollama library weighs in at 65GB, which means it doesn’t fit into the 16GB VRAM of an RTX 4080 (or the newer RTX 5080).
Docker Model Runner (DMR) is Docker’s official solution for running AI models locally, introduced in April 2025. This cheatsheet provides a quick reference for all essential commands, configurations, and best practices.