AI Assistant Architecture: LLM, Memory, Tools, Routing, Observability

How serious assistants are actually built.

Page content

A production AI assistant is not “an LLM with a prompt”. It is a system that accepts intent, keeps state, decides when to retrieve or act, and exposes enough runtime detail to debug failures.

That systems-level view is what the AI Systems cluster explores when assistants move beyond a single model invocation.

OpenAI describes agents as applications that plan, call tools, collaborate, and keep enough state for multi-step work, while Anthropic frames the same problem as a managed harness that can run files, commands, web access, and code securely.

The cleanest architecture splits responsibilities into five layers: LLM, Memory, Tooling, Routing, and Observability. That split matches the capabilities exposed by major provider APIs, by MCP, by self-hosted runtimes such as vLLM and llama.cpp, and by real assistant systems such as OpenClaw and Hermes.

illustration in light tones of a layered AI assistant architecture with data flow lines, memory nodes, and servers, no text.

Memory should be treated as more than “longer context”. Retrieval systems turn external knowledge into explicit non-parametric memory — the same design space covered in depth by Retrieval-Augmented Generation (RAG) — and both Anthropic’s context guidance and the “Lost in the Middle” paper warn that merely cramming more tokens into context does not guarantee reliable recall.

Tool use is a contract boundary, not magic. OpenAI function calling, Anthropic tool use, and MCP all rely on the same pattern: the model emits a structured request, some runtime executes it, and the result flows back into the conversation. If that boundary is sloppy, the assistant becomes sloppy.

My bias is simple: start boring. One orchestrator, one durable memory path, one trace per request, and one explicit policy for tool execution. Multi-agent graphs are useful, but only after you can explain your single-agent failure cases without guessing.

What an AI assistant system is

A practical definition is this: an AI assistant system is a runtime that turns user intent into a response or action by combining a model interface, context assembly, tool execution, state management, and telemetry. That is why the useful docs are not just model cards. The useful docs are API references, tool contracts, retrieval guides, routing docs, and tracing docs. OpenAI’s Responses API exposes stateful interactions, built-in tools, and function calling. Anthropic’s Claude API exposes direct Messages access as well as Managed Agents. OpenClaw and Hermes go one step further and show what happens when you put those capabilities behind persistent gateways, channels, sessions, and memory.

In other words, an assistant system has a broader contract than a chat completion. A good internal contract looks something like this:

AssistantRequest  = user intent + identity + session + attachments + policy
AssistantResponse = answer + actions + citations + state changes + trace id

That contract matters because every production disagreement eventually reduces to one of these questions: what context was visible, which tool executed, which model answered, which memory was read or written, and where the trace says the system spent time. OpenTelemetry defines traces as the path of a request through an application, which is exactly the abstraction serious assistants need. LangSmith and OpenLIT then specialise that idea for LLMs, tools, vector stores, and agent workflows.

Core components and interfaces

The component split below is the one I find most durable. It is also the split that lines up best with the official APIs and the open-source runtimes people actually operate.

Layer	Main responsibility	Typical interface	Example technologies
LLM layer	Reason, generate, decide, emit structured calls	Responses API, Messages API, OpenAI-compatible or Anthropic-compatible endpoints	OpenAI, Anthropic, vLLM, llama.cpp, Ollama
Memory layer	Hold session state, durable notes, and searchable knowledge	embeddings, vector search, memory read/write tools, retrieval APIs	OpenAI embeddings and vector stores, Pinecone, Weaviate, pgvector, Milvus, Hermes memory, OpenClaw memory
Tooling layer	Read data and perform actions outside the model	JSON-schema tools, MCP tools, file and web search, native runtime tools	OpenAI function calling, Anthropic tool use, MCP, LangChain tools, LlamaIndex query tools
Routing layer	Choose model, backend, policy, and tenant path	model aliases, failover groups, health checks, budgets, channel bindings	LiteLLM, OpenClaw multi-agent routing, Hermes provider runtime resolution
Observability	Explain what happened and why	traces, spans, logs, metrics, eval runs	OpenTelemetry, LangSmith, OpenLIT

The table above is derived from the official provider interfaces, MCP, vector database docs, and the runtime docs for vLLM, llama.cpp, OpenClaw, and Hermes.

The LLM layer should do three things well: consume a current working context, emit either a final answer or a structured action request, and return enough metadata to support retries and tracing. OpenAI’s Responses API is explicitly designed for stateful interactions plus built-in tools and function calling. Anthropic’s Messages API exposes the same core loop through tool_use blocks and tool_result returns, while Managed Agents gives you a hosted harness if you do not want to build the loop yourself. Self-hosted runtimes such as vLLM and llama.cpp matter because they preserve familiar provider-style interfaces while letting you place inference inside your own environment.

The Memory layer should be split mentally into three buckets: working memory, durable symbolic memory, and searchable semantic memory. OpenAI embeddings return vectors that can be indexed and searched; OpenAI Retrieval and File Search then layer semantic and keyword search on top of vector stores. Pinecone, Weaviate, pgvector, and Milvus represent four common storage shapes: fully managed, open-source vector-native, Postgres-native, and distributed vector database. Hermes and OpenClaw add a useful reminder that not all memory belongs in a vector store: file-backed notes, reviewed promotions, and session-scoped snapshots are often the more honest design. Memory Systems in AI Assistants maps the cross-framework model; Hermes Agent Memory System unpacks bounded core memory and frozen session snapshots in one product.

The Tooling layer is where an assistant stops being a summariser and starts being software. OpenAI function calling treats tools as schema-defined functionality the model may decide to invoke. Anthropic says the same thing more explicitly: tool use is a contract between your application and the model, and the model never executes anything on its own. MCP generalises that contract into a client-server protocol where hosts connect to one or more servers that expose tools, prompts, and resources — the same boundary described step by step in MCP Server in Go. LangChain and LlamaIndex sit comfortably here as orchestration libraries: LangChain focuses on a prebuilt agent architecture and integrations, while LlamaIndex focuses on context-augmented data access, query engines, and workflows.

The Routing layer exists because “which model?” is never the only question. You also need “which provider path, which tenant, which budget, which latency class, and which fallback?”. LiteLLM is useful because its official docs are refreshingly concrete: weighted pick, least-busy, latency-based, cost-based routing, and bounded failovers are all first-class patterns. OpenClaw extends routing upward into channel and agent isolation, while Hermes extends it downward into model slots for main and auxiliary work such as summarisation, context compression, and MCP tool routing. That is the right mental model: the router chooses more than a model, it chooses an execution lane.

The Observability layer is what prevents architecture from turning into folklore. OpenTelemetry gives you the trace abstraction. LangSmith gives you end-to-end visibility over LLM application steps and supports cloud, hybrid, and self-hosted deployment shapes. OpenLIT gives you OpenTelemetry-native AI observability with zero-code and manual instrumentation options, including support for LLMs, agent frameworks, vector databases, and GPUs. For production metrics, traces, and SLO patterns across inference and agent workflows, see Observability for LLM Systems. If your assistant has no trace per request, no span per model call, and no event history for tool execution, you do not really have an architecture yet. You have vibes.

Capture, enrich, respond

The sequence that keeps showing up in real systems is capture -> enrich -> respond -> record. Different frameworks wrap it differently, but the flow is stable enough to treat as the backbone.

sequenceDiagram participant U as User or Channel participant G as Gateway or UI participant R as Router participant M as Memory and Retrieval participant L as LLM participant T as Tools or MCP participant O as Observability U->>G: message, file, or command G->>O: start root trace G->>R: request + identity + session + policy R->>M: load session state and retrieve context M-->>R: notes, chunks, metadata R->>L: prompt + context + tool schemas L-->>R: answer or tool call alt tool call R->>T: execute tool or MCP action T-->>R: tool result R->>L: tool result + updated context L-->>R: final answer end R->>M: persist session changes and memory candidates R->>O: spans, metrics, eval events G-->>U: response

The capture step is usually more important than it looks. OpenClaw and Hermes both put a persistent gateway in front of the assistant because ingress is not just text entry. It includes channel metadata, identities, authorisation, session boundaries, direct messages, groups, cron ticks, and delivery semantics. If you skip that layer and rely on a raw chat widget abstraction, you will eventually bolt it back on as ad hoc middleware anyway.

The enrich step is where mature systems diverge from toy demos. OpenAI Retrieval and File Search make retrieval explicit through vector stores and search calls. LlamaIndex formalises the same pattern through data connectors, indexes, query engines, and workflows. Hermes goes further by splitting the model estate into main and auxiliary slots, offloading work such as compression, summarisation, and routing to smaller or more specialised models. That is a design pattern worth stealing: do not spend your most expensive model tokens on chores.

The respond step is not “generate text”. It is “close the current loop”. If the model can answer directly, it does. If it needs a tool, it emits a structured request. Anthropic’s tool-use contract and OpenAI’s function-calling guide both make this explicit. The reason this matters architecturally is that outputs now include both language and control flow. Your response object is partly prose and partly runtime plan.

The record step is where consistency semantics show up. Pinecone separates write and read paths and processes writes after durable acknowledgement. Hermes memory is injected as a frozen snapshot per session so it can preserve prefix-cache performance, which means new writes do not automatically appear in the current session prompt. OpenClaw’s Dreaming system only promotes reviewed, grounded candidates into MEMORY.md, and it is opt-in rather than always-on. The practical lesson is that memory is rarely truly read-after-write across every layer. You need to design for staged visibility.

OpenClaw and Hermes as reference systems

OpenClaw and Hermes are useful reference cases because they are not just wrappers around one provider API. Both present an assistant as a long-running system with gateways, sessions, tools, memory, and multiple model backends.

Architectural concern	OpenClaw mapping	Hermes mapping
Ingress and surfaces	Self-hosted gateway connecting chat apps and channel surfaces	Single background messaging gateway connecting many external platforms
Orchestration	Gateway-centric control plane for channels and AI interactions	`AIAgent` loop handling prompt assembly, provider selection, tool dispatch, retries, and failover
Routing	Multi-agent routing binds inbound traffic to isolated agents with separate workspaces and sessions	Main and auxiliary model slots split core reasoning from compression, summarisation, approvals, and MCP routing
Memory	File-backed memory plus optional active memory and background Dreaming promotion	`MEMORY.md` and `USER.md` injected as a frozen session snapshot, plus external memory providers
Tooling and extension	Built-in tools, session tools, provider plugins, custom and self-hosted endpoints	40+ tools, built-in MCP client, toolsets, skills, and memory-provider plugins

This mapping is grounded in the official OpenClaw and Hermes docs and repos. OpenClaw documents a gateway architecture, multi-agent routing, custom and self-hosted provider support including vLLM and Ollama, optional active memory, and Dreaming-based promotion. Hermes documents a messaging gateway, a central AIAgent loop, main and auxiliary model slots, built-in memory, and native MCP integration.

My slightly opinionated read is that both systems make the same architectural argument in different accents. OpenClaw is strongly gateway-first. Hermes is strongly agent-loop-first. But both reject the shallow idea that an assistant is just “prompt plus model”. They model channels, identities, memory semantics, tool surfaces, and backend heterogeneity as first-class concerns. That is exactly what a production architecture should do.

A practical hybrid stack inspired by both systems looks like this:

edge:
  gateway: hermes or openclaw

routing:
  proxy: litellm
  policy: latency and budget aware
  tenancy: session and channel scoped

llm:
  primary: openai responses or anthropic messages
  local_fallback: vllm
  local_dev: ollama or llama.cpp

memory:
  session: sqlite or postgres
  semantic: pgvector or weaviate
  embeddings: openai embeddings or ollama embeddings

tools:
  contract: json schema tools plus mcp
  examples: filesystem, browser, web search, internal APIs

observability:
  traces: opentelemetry
  ai_dashboards: openlit or langsmith
  evals: openai evals plus app-specific regression sets

That stack is a reasoned deployment pattern rather than a vendor-prescribed blueprint. It works because the official interfaces line up: OpenAI and Anthropic expose tool-oriented APIs, vLLM and llama.cpp emulate provider-style endpoints, Ollama handles local models and embeddings, MCP standardises external tools, LiteLLM handles routing and failover, and OpenTelemetry-compatible platforms can trace the whole path.

Patterns, tables, and tradeoffs

There are a few repeatable assistant patterns worth naming. A managed assistant keeps most of the runtime inside provider APIs. A retrieval-first assistant treats memory and search as the main differentiator. A tool-first assistant behaves more like an operator than a chatbot. A gateway assistant prioritises always-on access through messaging surfaces. A specialist mesh decomposes work into multiple agents or routes. Official docs across OpenAI, Anthropic, LlamaIndex, LiteLLM, OpenClaw, and Hermes all support versions of these patterns, even if they name them differently.

Pattern	What it optimises for	Best use case	Hidden cost
Managed assistant	Speed of delivery	Internal copilots and support bots	Provider lock-in and less control over runtime details
Retrieval-first assistant	Grounded answers over owned data	Docs, support, knowledge work	Retrieval quality becomes the real product
Tool-first assistant	Action over conversation	Ops workflows, data pulls, automations	Side effects, retries, and approvals become core concerns
Gateway assistant	Ubiquitous access	Personal and team assistants across chat surfaces	Identity, session, and security complexity
Specialist mesh	Division of labour	Complex workflows with real ownership boundaries	Harder debugging, orchestration, and eval design

The specialist mesh pattern grows into a distinct engineering discipline as agent count rises. For the six canonical coordination patterns — orchestrator-worker, sequential pipeline, fan-out, hierarchical, swarm, and mesh — with specific failure modes and a production decision framework, see Multi-Agent Orchestration Patterns.

This pattern table is a synthesis from the provider docs, framework docs, and reference systems rather than a claim from any one vendor.

Option shape	Typical components	Strength	Weakness
Managed	OpenAI Responses or Anthropic Managed Agents, hosted file search or vector stores	Fastest path, fewer moving parts, hosted tools	Lowest control over data path and runtime semantics
Hybrid	Provider API plus self-hosted router and vector store	Good balance of speed and control	More contracts to maintain
Self-hosted	vLLM or llama.cpp or Ollama, MCP, self-hosted vector DB, OTel	Strong privacy and deployment control	Highest ops burden, hardware and tuning overhead

Table notes: OpenAI hosted File Search is a managed tool, Anthropic offers a managed harness, Pinecone is a managed vector service, while vLLM, llama.cpp, Ollama, pgvector, Weaviate, Milvus, LangSmith self-hosted, and OpenLIT all support self-managed or hybrid operation to varying degrees.

Vector store	Shape	Why teams choose it	Watchout
Pinecone	Managed vector service	Strong operational simplicity and scalable managed architecture	External dependency and managed-service economics
Weaviate	Open-source vector database	Vector plus inverted indexes and flexible index choices	More cluster tuning than a hosted-only path
pgvector	Postgres extension	Keep vectors with relational data and existing SQL stack	Not the best fit for every high-scale ANN workload
Milvus	Distributed vector database	Purpose-built scale and ecosystem around managed Zilliz Cloud	Another specialist datastore to operate

Table notes: Pinecone documents a managed control plane and regional data planes. Weaviate documents vector and inverted indexes with multiple vector index types. pgvector adds exact and approximate nearest-neighbour search to Postgres. Milvus positions itself as an open-source high-performance, scalable vector database, with Zilliz Cloud as the managed option.

LLM option	Interface style	Best at	Watchout
OpenAI Responses	Stateful responses plus built-in tools	Fast start, hosted tools, structured loops	You inherit platform-specific abstractions
Anthropic Messages	Direct model access with explicit tool-use contract	Clear tool boundaries and good control in custom loops	More runtime is your responsibility unless you use Managed Agents
vLLM	OpenAI-compatible and Anthropic-compatible self-hosted serving	High-throughput self-hosted inference	Real infrastructure and model-serving work
Ollama	Simple local model and embedding runtime	Local development and small self-hosted stacks	Not the same class of serving system as a tuned distributed runtime
llama.cpp	Lightweight local server with provider-compatible routes	Edge, CPU-first, constrained environments	You do more manual tuning and capability matching

Table notes: OpenAI documents Responses as its advanced interface for stateful responses and built-in tools. Anthropic documents the Messages API and the tool-use contract separately from Managed Agents. vLLM exposes an OpenAI-compatible server plus Anthropic Messages API support. Ollama documents local embedding and model workflows. llama.cpp documents OpenAI-compatible chat, responses, and embeddings routes, plus Anthropic-compatible chat completions.

Constraint or tradeoff	Bias toward managed	Bias toward self-hosted	Practical mitigation
Latency	Often better first iteration and fewer local tuning tasks	Can win when model and data are colocated and kept warm	Use routing tiers, hot caches, and smaller auxiliary models
Cost	Easy to start, variable at token scale	Better amortisation at steady utilisation	Measure real traffic before optimising by instinct
Privacy and residency	Simpler for non-sensitive data	Stronger control for sensitive and regulated flows	Use hybrid boundaries and keep only what must move
Consistency	Hosted tools still have staged visibility semantics	Self-hosted memory pipelines also stage and promote data	Define read-after-write rules explicitly by layer
Scaling	Less control-plane pain	Better tailoring for steady, specialised workloads	Use batching, queueing, and isolated tenants
Debuggability	Easy to miss opaque provider internals	Easy to drown in self-made complexity	Trace every request and evaluate every route

This tradeoff matrix is an architectural inference from the official docs, not a vendor benchmark. The consistency row matters more than many blog posts admit: Pinecone separates write and read paths, Hermes freezes memory into session-start prompts, and OpenClaw promotes durable memory through staged review. That means “memory updated” and “memory visible to the current answer” are often different truths.

Failure modes and mitigations

Most assistants do not fail because the base model is “bad”. They fail because the surrounding system lies to the model, starves it of the right context, lets tools drift, or makes debugging impossible.

Where it breaks	Typical symptom	Usual cause	Mitigation
Prompt assembly	Confident but off-target answer	Too much irrelevant context, poor ordering	Budget context, rerank, keep key facts near the top
Retrieval	Correct tone, wrong facts	Bad chunking, stale index, weak filters	Evaluate retrieval separately, add metadata filters and hybrid search
Tool boundary	Wrong action or duplicate action	Loose schemas, retries without idempotency	Tight schemas, idempotency keys, approval gates
Routing	Wildly inconsistent behaviour by request	Cost or latency routing without quality controls	Add sticky sessions and per-route evals
Memory	Stale or poisoned recall	Over-eager writes, weak review, cross-session leakage	Separate working and durable memory, review promotions
Observability	No idea what happened	Missing traces or no span granularity	Emit root and subspans for retrieval, model, and tool calls
Hallucination control	Plausible but unsupported claims	Weak grounding or no validation pass	Reference-doc validation, self-consistency checks, eval gates

The evidence base for this table is broad but consistent. Anthropic’s tool docs make clear that tool use is a contract boundary. OpenAI Guardrails includes hallucination detection against a reference knowledge base via File Search. SelfCheckGPT shows that self-consistency across samples can help detect unsupported claims. The “Lost in the Middle” results and Anthropic’s context guidance both reinforce the same operational lesson: more tokens do not remove the need for context curation.

Preferred mitigation stack could be boring and repetitive: trace every request, version prompts, evaluate retrieval independently, keep tools idempotent, and run regression evals before you change routes or memory policy. OpenAI’s Evals docs and repo are blunt about why: without evals, it is hard and time-consuming to understand how model or prompt changes affect your use case. That applies just as much to routers and retrieval as it does to prompts.