AI Assistant Architecture: LLM, Memory, Tools, Routing, Observability
How serious assistants are actually built.
A production AI assistant is not “an LLM with a prompt”. It is a system that accepts intent, keeps state, decides when to retrieve or act, and exposes enough runtime detail to debug failures.
That systems-level view is what the AI Systems cluster explores when assistants move beyond a single model invocation.
OpenAI describes agents as applications that plan, call tools, collaborate, and keep enough state for multi-step work, while Anthropic frames the same problem as a managed harness that can run files, commands, web access, and code securely.
The cleanest architecture splits responsibilities into five layers: LLM, Memory, Tooling, Routing, and Observability. That split matches the capabilities exposed by major provider APIs, by MCP, by self-hosted runtimes such as vLLM and llama.cpp, and by real assistant systems such as OpenClaw and Hermes.

Memory should be treated as more than “longer context”. Retrieval systems turn external knowledge into explicit non-parametric memory — the same design space covered in depth by Retrieval-Augmented Generation (RAG) — and both Anthropic’s context guidance and the “Lost in the Middle” paper warn that merely cramming more tokens into context does not guarantee reliable recall.
Tool use is a contract boundary, not magic. OpenAI function calling, Anthropic tool use, and MCP all rely on the same pattern: the model emits a structured request, some runtime executes it, and the result flows back into the conversation. If that boundary is sloppy, the assistant becomes sloppy.
My bias is simple: start boring. One orchestrator, one durable memory path, one trace per request, and one explicit policy for tool execution. Multi-agent graphs are useful, but only after you can explain your single-agent failure cases without guessing.
What an AI assistant system is
A practical definition is this: an AI assistant system is a runtime that turns user intent into a response or action by combining a model interface, context assembly, tool execution, state management, and telemetry. That is why the useful docs are not just model cards. The useful docs are API references, tool contracts, retrieval guides, routing docs, and tracing docs. OpenAI’s Responses API exposes stateful interactions, built-in tools, and function calling. Anthropic’s Claude API exposes direct Messages access as well as Managed Agents. OpenClaw and Hermes go one step further and show what happens when you put those capabilities behind persistent gateways, channels, sessions, and memory.
In other words, an assistant system has a broader contract than a chat completion. A good internal contract looks something like this:
AssistantRequest = user intent + identity + session + attachments + policy
AssistantResponse = answer + actions + citations + state changes + trace id
That contract matters because every production disagreement eventually reduces to one of these questions: what context was visible, which tool executed, which model answered, which memory was read or written, and where the trace says the system spent time. OpenTelemetry defines traces as the path of a request through an application, which is exactly the abstraction serious assistants need. LangSmith and OpenLIT then specialise that idea for LLMs, tools, vector stores, and agent workflows.
Core components and interfaces
The component split below is the one I find most durable. It is also the split that lines up best with the official APIs and the open-source runtimes people actually operate.
| Layer | Main responsibility | Typical interface | Example technologies |
|---|---|---|---|
| LLM layer | Reason, generate, decide, emit structured calls | Responses API, Messages API, OpenAI-compatible or Anthropic-compatible endpoints | OpenAI, Anthropic, vLLM, llama.cpp, Ollama |
| Memory layer | Hold session state, durable notes, and searchable knowledge | embeddings, vector search, memory read/write tools, retrieval APIs | OpenAI embeddings and vector stores, Pinecone, Weaviate, pgvector, Milvus, Hermes memory, OpenClaw memory |
| Tooling layer | Read data and perform actions outside the model | JSON-schema tools, MCP tools, file and web search, native runtime tools | OpenAI function calling, Anthropic tool use, MCP, LangChain tools, LlamaIndex query tools |
| Routing layer | Choose model, backend, policy, and tenant path | model aliases, failover groups, health checks, budgets, channel bindings | LiteLLM, OpenClaw multi-agent routing, Hermes provider runtime resolution |
| Observability | Explain what happened and why | traces, spans, logs, metrics, eval runs | OpenTelemetry, LangSmith, OpenLIT |
The table above is derived from the official provider interfaces, MCP, vector database docs, and the runtime docs for vLLM, llama.cpp, OpenClaw, and Hermes.
The LLM layer should do three things well: consume a current working context, emit either a final answer or a structured action request, and return enough metadata to support retries and tracing. OpenAI’s Responses API is explicitly designed for stateful interactions plus built-in tools and function calling. Anthropic’s Messages API exposes the same core loop through tool_use blocks and tool_result returns, while Managed Agents gives you a hosted harness if you do not want to build the loop yourself. Self-hosted runtimes such as vLLM and llama.cpp matter because they preserve familiar provider-style interfaces while letting you place inference inside your own environment.
The Memory layer should be split mentally into three buckets: working memory, durable symbolic memory, and searchable semantic memory. OpenAI embeddings return vectors that can be indexed and searched; OpenAI Retrieval and File Search then layer semantic and keyword search on top of vector stores. Pinecone, Weaviate, pgvector, and Milvus represent four common storage shapes: fully managed, open-source vector-native, Postgres-native, and distributed vector database. Hermes and OpenClaw add a useful reminder that not all memory belongs in a vector store: file-backed notes, reviewed promotions, and session-scoped snapshots are often the more honest design — patterns unpacked in Hermes Agent Memory System for bounded core memory and frozen session snapshots.
The Tooling layer is where an assistant stops being a summariser and starts being software. OpenAI function calling treats tools as schema-defined functionality the model may decide to invoke. Anthropic says the same thing more explicitly: tool use is a contract between your application and the model, and the model never executes anything on its own. MCP generalises that contract into a client-server protocol where hosts connect to one or more servers that expose tools, prompts, and resources — the same boundary described step by step in MCP Server in Go. LangChain and LlamaIndex sit comfortably here as orchestration libraries: LangChain focuses on a prebuilt agent architecture and integrations, while LlamaIndex focuses on context-augmented data access, query engines, and workflows.
The Routing layer exists because “which model?” is never the only question. You also need “which provider path, which tenant, which budget, which latency class, and which fallback?”. LiteLLM is useful because its official docs are refreshingly concrete: weighted pick, least-busy, latency-based, cost-based routing, and bounded failovers are all first-class patterns. OpenClaw extends routing upward into channel and agent isolation, while Hermes extends it downward into model slots for main and auxiliary work such as summarisation, context compression, and MCP tool routing. That is the right mental model: the router chooses more than a model, it chooses an execution lane.
The Observability layer is what prevents architecture from turning into folklore. OpenTelemetry gives you the trace abstraction. LangSmith gives you end-to-end visibility over LLM application steps and supports cloud, hybrid, and self-hosted deployment shapes. OpenLIT gives you OpenTelemetry-native AI observability with zero-code and manual instrumentation options, including support for LLMs, agent frameworks, vector databases, and GPUs. For production metrics, traces, and SLO patterns across inference and agent workflows, see Observability for LLM Systems. If your assistant has no trace per request, no span per model call, and no event history for tool execution, you do not really have an architecture yet. You have vibes.
Capture, enrich, respond
The sequence that keeps showing up in real systems is capture -> enrich -> respond -> record. Different frameworks wrap it differently, but the flow is stable enough to treat as the backbone.
sequenceDiagram
participant U as User or Channel
participant G as Gateway or UI
participant R as Router
participant M as Memory and Retrieval
participant L as LLM
participant T as Tools or MCP
participant O as Observability
U->>G: message, file, or command
G->>O: start root trace
G->>R: request + identity + session + policy
R->>M: load session state and retrieve context
M-->>R: notes, chunks, metadata
R->>L: prompt + context + tool schemas
L-->>R: answer or tool call
alt tool call
R->>T: execute tool or MCP action
T-->>R: tool result
R->>L: tool result + updated context
L-->>R: final answer
end
R->>M: persist session changes and memory candidates
R->>O: spans, metrics, eval events
G-->>U: response
The capture step is usually more important than it looks. OpenClaw and Hermes both put a persistent gateway in front of the assistant because ingress is not just text entry. It includes channel metadata, identities, authorisation, session boundaries, direct messages, groups, cron ticks, and delivery semantics. If you skip that layer and rely on a raw chat widget abstraction, you will eventually bolt it back on as ad hoc middleware anyway.
The enrich step is where mature systems diverge from toy demos. OpenAI Retrieval and File Search make retrieval explicit through vector stores and search calls. LlamaIndex formalises the same pattern through data connectors, indexes, query engines, and workflows. Hermes goes further by splitting the model estate into main and auxiliary slots, offloading work such as compression, summarisation, and routing to smaller or more specialised models. That is a design pattern worth stealing: do not spend your most expensive model tokens on chores.
The respond step is not “generate text”. It is “close the current loop”. If the model can answer directly, it does. If it needs a tool, it emits a structured request. Anthropic’s tool-use contract and OpenAI’s function-calling guide both make this explicit. The reason this matters architecturally is that outputs now include both language and control flow. Your response object is partly prose and partly runtime plan.
The record step is where consistency semantics show up. Pinecone separates write and read paths and processes writes after durable acknowledgement. Hermes memory is injected as a frozen snapshot per session so it can preserve prefix-cache performance, which means new writes do not automatically appear in the current session prompt. OpenClaw’s Dreaming system only promotes reviewed, grounded candidates into MEMORY.md, and it is opt-in rather than always-on. The practical lesson is that memory is rarely truly read-after-write across every layer. You need to design for staged visibility.
OpenClaw and Hermes as reference systems
OpenClaw and Hermes are useful reference cases because they are not just wrappers around one provider API. Both present an assistant as a long-running system with gateways, sessions, tools, memory, and multiple model backends.
| Architectural concern | OpenClaw mapping | Hermes mapping |
|---|---|---|
| Ingress and surfaces | Self-hosted gateway connecting chat apps and channel surfaces | Single background messaging gateway connecting many external platforms |
| Orchestration | Gateway-centric control plane for channels and AI interactions | AIAgent loop handling prompt assembly, provider selection, tool dispatch, retries, and failover |
| Routing | Multi-agent routing binds inbound traffic to isolated agents with separate workspaces and sessions | Main and auxiliary model slots split core reasoning from compression, summarisation, approvals, and MCP routing |
| Memory | File-backed memory plus optional active memory and background Dreaming promotion | MEMORY.md and USER.md injected as a frozen session snapshot, plus external memory providers |
| Tooling and extension | Built-in tools, session tools, provider plugins, custom and self-hosted endpoints | 40+ tools, built-in MCP client, toolsets, skills, and memory-provider plugins |
This mapping is grounded in the official OpenClaw and Hermes docs and repos. OpenClaw documents a gateway architecture, multi-agent routing, custom and self-hosted provider support including vLLM and Ollama, optional active memory, and Dreaming-based promotion. Hermes documents a messaging gateway, a central AIAgent loop, main and auxiliary model slots, built-in memory, and native MCP integration.
My slightly opinionated read is that both systems make the same architectural argument in different accents. OpenClaw is strongly gateway-first. Hermes is strongly agent-loop-first. But both reject the shallow idea that an assistant is just “prompt plus model”. They model channels, identities, memory semantics, tool surfaces, and backend heterogeneity as first-class concerns. That is exactly what a production architecture should do.
A practical hybrid stack inspired by both systems looks like this:
edge:
gateway: hermes or openclaw
routing:
proxy: litellm
policy: latency and budget aware
tenancy: session and channel scoped
llm:
primary: openai responses or anthropic messages
local_fallback: vllm
local_dev: ollama or llama.cpp
memory:
session: sqlite or postgres
semantic: pgvector or weaviate
embeddings: openai embeddings or ollama embeddings
tools:
contract: json schema tools plus mcp
examples: filesystem, browser, web search, internal APIs
observability:
traces: opentelemetry
ai_dashboards: openlit or langsmith
evals: openai evals plus app-specific regression sets
That stack is a reasoned deployment pattern rather than a vendor-prescribed blueprint. It works because the official interfaces line up: OpenAI and Anthropic expose tool-oriented APIs, vLLM and llama.cpp emulate provider-style endpoints, Ollama handles local models and embeddings, MCP standardises external tools, LiteLLM handles routing and failover, and OpenTelemetry-compatible platforms can trace the whole path.
Patterns, tables, and tradeoffs
There are a few repeatable assistant patterns worth naming. A managed assistant keeps most of the runtime inside provider APIs. A retrieval-first assistant treats memory and search as the main differentiator. A tool-first assistant behaves more like an operator than a chatbot. A gateway assistant prioritises always-on access through messaging surfaces. A specialist mesh decomposes work into multiple agents or routes. Official docs across OpenAI, Anthropic, LlamaIndex, LiteLLM, OpenClaw, and Hermes all support versions of these patterns, even if they name them differently.
| Pattern | What it optimises for | Best use case | Hidden cost |
|---|---|---|---|
| Managed assistant | Speed of delivery | Internal copilots and support bots | Provider lock-in and less control over runtime details |
| Retrieval-first assistant | Grounded answers over owned data | Docs, support, knowledge work | Retrieval quality becomes the real product |
| Tool-first assistant | Action over conversation | Ops workflows, data pulls, automations | Side effects, retries, and approvals become core concerns |
| Gateway assistant | Ubiquitous access | Personal and team assistants across chat surfaces | Identity, session, and security complexity |
| Specialist mesh | Division of labour | Complex workflows with real ownership boundaries | Harder debugging, orchestration, and eval design |
This pattern table is a synthesis from the provider docs, framework docs, and reference systems rather than a claim from any one vendor.
| Option shape | Typical components | Strength | Weakness |
|---|---|---|---|
| Managed | OpenAI Responses or Anthropic Managed Agents, hosted file search or vector stores | Fastest path, fewer moving parts, hosted tools | Lowest control over data path and runtime semantics |
| Hybrid | Provider API plus self-hosted router and vector store | Good balance of speed and control | More contracts to maintain |
| Self-hosted | vLLM or llama.cpp or Ollama, MCP, self-hosted vector DB, OTel | Strong privacy and deployment control | Highest ops burden, hardware and tuning overhead |
Table notes: OpenAI hosted File Search is a managed tool, Anthropic offers a managed harness, Pinecone is a managed vector service, while vLLM, llama.cpp, Ollama, pgvector, Weaviate, Milvus, LangSmith self-hosted, and OpenLIT all support self-managed or hybrid operation to varying degrees.
| Vector store | Shape | Why teams choose it | Watchout |
|---|---|---|---|
| Pinecone | Managed vector service | Strong operational simplicity and scalable managed architecture | External dependency and managed-service economics |
| Weaviate | Open-source vector database | Vector plus inverted indexes and flexible index choices | More cluster tuning than a hosted-only path |
| pgvector | Postgres extension | Keep vectors with relational data and existing SQL stack | Not the best fit for every high-scale ANN workload |
| Milvus | Distributed vector database | Purpose-built scale and ecosystem around managed Zilliz Cloud | Another specialist datastore to operate |
Table notes: Pinecone documents a managed control plane and regional data planes. Weaviate documents vector and inverted indexes with multiple vector index types. pgvector adds exact and approximate nearest-neighbour search to Postgres. Milvus positions itself as an open-source high-performance, scalable vector database, with Zilliz Cloud as the managed option.
| LLM option | Interface style | Best at | Watchout |
|---|---|---|---|
| OpenAI Responses | Stateful responses plus built-in tools | Fast start, hosted tools, structured loops | You inherit platform-specific abstractions |
| Anthropic Messages | Direct model access with explicit tool-use contract | Clear tool boundaries and good control in custom loops | More runtime is your responsibility unless you use Managed Agents |
| vLLM | OpenAI-compatible and Anthropic-compatible self-hosted serving | High-throughput self-hosted inference | Real infrastructure and model-serving work |
| Ollama | Simple local model and embedding runtime | Local development and small self-hosted stacks | Not the same class of serving system as a tuned distributed runtime |
| llama.cpp | Lightweight local server with provider-compatible routes | Edge, CPU-first, constrained environments | You do more manual tuning and capability matching |
Table notes: OpenAI documents Responses as its advanced interface for stateful responses and built-in tools. Anthropic documents the Messages API and the tool-use contract separately from Managed Agents. vLLM exposes an OpenAI-compatible server plus Anthropic Messages API support. Ollama documents local embedding and model workflows. llama.cpp documents OpenAI-compatible chat, responses, and embeddings routes, plus Anthropic-compatible chat completions.
| Constraint or tradeoff | Bias toward managed | Bias toward self-hosted | Practical mitigation |
|---|---|---|---|
| Latency | Often better first iteration and fewer local tuning tasks | Can win when model and data are colocated and kept warm | Use routing tiers, hot caches, and smaller auxiliary models |
| Cost | Easy to start, variable at token scale | Better amortisation at steady utilisation | Measure real traffic before optimising by instinct |
| Privacy and residency | Simpler for non-sensitive data | Stronger control for sensitive and regulated flows | Use hybrid boundaries and keep only what must move |
| Consistency | Hosted tools still have staged visibility semantics | Self-hosted memory pipelines also stage and promote data | Define read-after-write rules explicitly by layer |
| Scaling | Less control-plane pain | Better tailoring for steady, specialised workloads | Use batching, queueing, and isolated tenants |
| Debuggability | Easy to miss opaque provider internals | Easy to drown in self-made complexity | Trace every request and evaluate every route |
This tradeoff matrix is an architectural inference from the official docs, not a vendor benchmark. The consistency row matters more than many blog posts admit: Pinecone separates write and read paths, Hermes freezes memory into session-start prompts, and OpenClaw promotes durable memory through staged review. That means “memory updated” and “memory visible to the current answer” are often different truths.
Failure modes and mitigations
Most assistants do not fail because the base model is “bad”. They fail because the surrounding system lies to the model, starves it of the right context, lets tools drift, or makes debugging impossible.
| Where it breaks | Typical symptom | Usual cause | Mitigation |
|---|---|---|---|
| Prompt assembly | Confident but off-target answer | Too much irrelevant context, poor ordering | Budget context, rerank, keep key facts near the top |
| Retrieval | Correct tone, wrong facts | Bad chunking, stale index, weak filters | Evaluate retrieval separately, add metadata filters and hybrid search |
| Tool boundary | Wrong action or duplicate action | Loose schemas, retries without idempotency | Tight schemas, idempotency keys, approval gates |
| Routing | Wildly inconsistent behaviour by request | Cost or latency routing without quality controls | Add sticky sessions and per-route evals |
| Memory | Stale or poisoned recall | Over-eager writes, weak review, cross-session leakage | Separate working and durable memory, review promotions |
| Observability | No idea what happened | Missing traces or no span granularity | Emit root and subspans for retrieval, model, and tool calls |
| Hallucination control | Plausible but unsupported claims | Weak grounding or no validation pass | Reference-doc validation, self-consistency checks, eval gates |
The evidence base for this table is broad but consistent. Anthropic’s tool docs make clear that tool use is a contract boundary. OpenAI Guardrails includes hallucination detection against a reference knowledge base via File Search. SelfCheckGPT shows that self-consistency across samples can help detect unsupported claims. The “Lost in the Middle” results and Anthropic’s context guidance both reinforce the same operational lesson: more tokens do not remove the need for context curation.
Preferred mitigation stack could be boring and repetitive: trace every request, version prompts, evaluate retrieval independently, keep tools idempotent, and run regression evals before you change routes or memory policy. OpenAI’s Evals docs and repo are blunt about why: without evals, it is hard and time-consuming to understand how model or prompt changes affect your use case. That applies just as much to routers and retrieval as it does to prompts.
More reading
If you want to go deeper, there are the most useful primary sources to keep open while designing or reviewing an assistant architecture.
-
OpenAI: Responses Overview, Function Calling, Using Tools, Retrieval, File Search, Evals, and MCP for remote tool servers.
-
Anthropic: API Overview, Tool Use, the tool-use contract, Managed Agents, Context Windows, and the MCP connector.
-
MCP itself: the Architecture Overview and Specification are worth reading directly, because they explain hosts, clients, servers, tools, prompts, resources, transports, and capability negotiation cleanly.
-
Frameworks and routing: LangChain Overview, LlamaIndex context-augmentation docs, LiteLLM routing docs, LangSmith observability docs.
-
Self-hosted runtimes and assistant systems: vLLM, llama.cpp server, Ollama embeddings, OpenClaw docs and repo, Hermes docs and repo.
-
Storage and observability: Pinecone, Weaviate, pgvector, Milvus, OpenTelemetry, OpenLIT.
-
Research papers: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Lost in the Middle, and SelfCheckGPT.