Hermes AI Assistant Skills for Real Production Setups
Profile-first Hermes setups for serious workloads
Hermes AI assistant, officially documented as Hermes Agent, is not positioned as a simple chat wrapper.
Profile-first Hermes setups for serious workloads
Hermes AI assistant, officially documented as Hermes Agent, is not positioned as a simple chat wrapper.
The skills worth keeping, and the ones to skip
OpenClaw has two extension stories, and they are easy to mix up.
Plugins extend the runtime. Skills extend the agent’s behavior.
Plugins first. Skills naming in brief.
This article is about OpenClaw plugins — native gateway packages that add channels, model providers, tools, speech, memory, media, web search, and other runtime surfaces.
How real OpenClaw systems are actually structured
OpenClaw looks simple in demos. In production, it becomes a system.
Claude subscriptions no longer power agents
The quiet loophole that powered a wave of agent experimentation is now closed.
Self-hosted AI search with local LLMs
Vane is one of the more pragmatic entries in the “AI search with citations” space: a self-hosted answering engine that mixes live web retrieval with local or cloud LLMs, while keeping the whole stack under your control.
Agentic coding, now with local model backends.
Claude Code is not autocomplete with better marketing. It is an agentic coding tool: it reads your codebase, edits files, runs commands, and integrates with your development tools.
Hermes Agent install and quickstart for devs
Hermes Agent is a self-hosted, model-agnostic AI assistant that runs on a local machine or low-cost VPS, works through terminal and messaging interfaces, and improves over time by turning repeated tasks into reusable skills.
Install TGI, ship fast, debug faster
Text Generation Inference (TGI) has a very specific energy. It is not the newest kid in the inference street, but it is the one that already learned how production breaks -
llama.cpp token speed on 16 GB VRAM (tables).
Here I am comparing speed of several LLMs running on GPU with 16GB of VRAM, and choosing the best one for self-hosting.
RTX 5090 in AU is scarce and overpriced
Australia has RTX 5090 stock. Barely. And if you find one, you will pay a premium that feels detached from reality.
Remote Ollama access without public ports
Ollama is at its happiest when it is treated like a local daemon: the CLI and your apps talk to a loopback HTTP API, and the rest of the network never finds out it exists.
Compose-first Ollama server with GPU and persistence.
Ollama works great on bare metal. It gets even more interesting when you treat it like a service: a stable endpoint, pinned versions, persistent storage, and a GPU that is either available or it is not.
HTTPS Ollama without breaking streaming responses.
Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.
RAG embeddings - Python, Ollama, OpenAI APIs.
If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).
Serve open models fast with SGLang.
SGLang is a high-performance serving framework for large language models and multimodal models, built to deliver low-latency and high-throughput inference across everything from a single GPU to distributed clusters.