AI Systems: Self-Hosted Assistants, RAG, and Local Infrastructure

Page content

Most local AI setups start with a model and a runtime.

You download a quantized model, launch it through Ollama or another runtime, and begin prompting. For experimentation, this is more than enough. But once you move beyond curiosity — once you care about memory, retrieval quality, routing decisions, or cost awareness — the simplicity starts to show its limits.

This cluster explores a different approach: treating the AI assistant not as a single model invocation, but as a coordinated system.

That distinction may seem subtle at first, but it changes how you think about local AI entirely.

AI systems orchestration with local LLMs, RAG, and memory layers


What Is an AI System?

An AI system is more than a model. It is an orchestration layer connecting inference, retrieval, memory, and execution into something that behaves like a coherent assistant.

Running a model locally is infrastructure work. Designing an assistant around that model is systems work.

If you have explored our broader guides on:

you already know that inference is only one layer of the stack.

The AI Systems cluster sits on top of those layers. It does not replace them — it combines them.


OpenClaw: A Self-Hosted AI Assistant System

OpenClaw is an open-source, self-hosted AI assistant designed to operate across messaging platforms while running on local infrastructure.

At a practical level, it:

  • Uses local LLM runtimes such as Ollama or vLLM
  • Integrates retrieval over indexed documents
  • Maintains memory beyond a single session
  • Executes tools and automation tasks
  • Can be instrumented and observed
  • Operates within hardware constraints

It is not just a wrapper around a model. It is an orchestration layer connecting inference, retrieval, memory, and execution into something that behaves like a coherent assistant.

Getting started and architecture:

Context and analysis:

  • OpenClaw rise and fall timeline — the economics behind the viral spike, the April 2026 subscription cutoff, and what the collapse reveals about AI hype cycles

Extending and configuring OpenClaw:

Plugins extend the OpenClaw runtime — adding memory backends, model providers, communication channels, web tools, and observability. Skills extend agent behavior — defining how and when the agent uses those capabilities. Production configuration means combining both, shaped around who is actually using the system.


Hermes: A Persistent Agent with Skills and Tool Sandboxing

Hermes Agent is a self-hosted, model-agnostic assistant focused on persistent operation: it can run as a long-lived process, execute tools through configurable backends, and improve workflows over time through memory and reusable skills.

At a practical level, Hermes is useful when you want:

  • A terminal-first assistant that can also bridge into messaging apps
  • Provider flexibility through OpenAI-compatible endpoints and model switching
  • Tool execution boundaries via local and sandboxed backends
  • Day-two operations with diagnostics, logs, and config hygiene

Hermes profiles are fully isolated environments — each with its own config, secrets, memories, sessions, skills, and state — making profiles the real unit of production ownership, not the individual skill.


Persistent knowledge and memory

Some problems are not solved by a bigger context window alone — they need persistent knowledge (graphs, ingestion pipelines) and agent memory plugins (Honcho, Mem0, Hindsight, and similar backends) wired into assistants such as Hermes or OpenClaw.

  • AI Systems Memory hub — scope of the memory subcluster plus links to Cognee guides and stack context
  • Agent memory providers compared — full comparison of Honcho, OpenViking, Mem0, Hindsight, Holographic, RetainDB, ByteRover, and Supermemory for Hermes-style integrations

What Makes AI Systems Different

Several characteristics make AI systems worth examining more closely.

Model Routing as a Design Choice

Most local setups default to one model. AI systems support selecting models intentionally.

That introduces questions:

  • Should small requests use smaller models?
  • When does reasoning justify a larger context window?
  • What is the cost difference per 1,000 tokens?

These questions connect directly to performance trade-offs discussed in the LLM performance guide and infrastructure decisions outlined in the LLM hosting guide.

AI systems surface those decisions instead of hiding them.

Retrieval Is Treated as an Evolving Component

AI systems integrate document retrieval, but not as a simplistic “embed and search” step.

They acknowledge:

  • Chunk size affects recall and cost
  • Hybrid search (BM25 + vector) may outperform pure dense retrieval
  • Reranking improves relevance at the cost of latency
  • Indexing strategy impacts memory consumption

These themes align with the deeper architectural considerations discussed in the RAG tutorial.

The difference is that AI systems embed retrieval into a living assistant rather than presenting it as an isolated demo.

Memory as Infrastructure

Stateless LLMs forget everything between sessions.

AI systems introduce persistent memory layers. That immediately raises design questions:

  • What should be stored long-term?
  • When should context be summarized?
  • How do you prevent token explosion?
  • How do you index memory efficiently?

Those questions intersect directly with data-layer considerations from the data infrastructure guide. For Hermes Agent specifically — bounded two-file memory, prefix caching, external plugins — start with Hermes Agent Memory System and the cross-framework comparison Agent memory providers compared. The AI Systems Memory hub lists related Cognee and knowledge-layer guides.

Memory stops being a feature and becomes a storage problem.

Observability Is Not Optional

Most local AI experiments stop at “it responds.”

AI systems make it possible to observe:

  • Token usage
  • Latency
  • Hardware utilization
  • Throughput patterns

This connects naturally with the monitoring principles described in the observability guide.

If AI runs on hardware, it should be measurable like any other workload.


What It Feels Like to Use

From the outside, an AI system may still look like a chat interface.

Under the surface, more happens.

If you ask it to summarize a technical report stored locally:

  1. It retrieves relevant document segments.
  2. It selects an appropriate model.
  3. It generates a response.
  4. It records token usage and latency.
  5. It updates persistent memory if necessary.

The visible interaction remains simple. The system behavior is layered.

That layered behavior is what differentiates a system from a demo.


Where AI Systems Fit in the Stack

The AI Systems cluster sits at the intersection of several infrastructure layers:

  • LLM Hosting: The runtime layer where models execute (Ollama, vLLM, llama.cpp)
  • RAG: The retrieval layer that provides context and grounding
  • Performance: The measurement layer that tracks latency and throughput
  • Observability: The monitoring layer that provides metrics and cost tracking
  • Data Infrastructure: The storage layer that handles memory and indexing

Understanding that distinction is useful. Running it yourself makes the difference clearer.

For a minimal local installation with OpenClaw, see the OpenClaw quickstart guide, which walks through a Docker-based setup using either a local Ollama model or a cloud-based Claude configuration.

If your setup depends on Claude, this policy change for agent tools clarifies why API billing is now required for third-party OpenClaw workflows.


AI assistant guides:

Infrastructure layers:

Subscribe

Get new posts on AI systems, Infrastructure, and AI engineering.