Building a Multi-Agent Chatbot with LangGraph and MCP
Modern chatbots like ChatGPT and Claude can already do a lot — they search the web, write code, analyze images, and reason across documents. But have you ever wondered what it actually takes to build one of these systems yourself? How does the agent decide which tool to call? How does it loop between reasoning and acting? How do you wire up vector search, code generation, and image understanding into a single conversation?
I built HiveChat (github.com/sahilmalik27/hivechat) to learn exactly that — a multi-agent chatbot using LangGraph's ReAct pattern with 7 MCP tools, running entirely on local LLMs via Ollama.
This post is a deep dive into what I learned: the architecture, the patterns, and the things that surprised me along the way.
Why Build This?
The best way to understand how agent systems work is to build one from scratch. I wanted to get hands-on with the core building blocks: reasoning loops, tool orchestration, streaming, and state persistence. HiveChat became the learning project — a system where I could explore questions like:
- How does LangGraph's ReAct pattern actually work under the hood?
- What does it take to integrate external tools safely via MCP?
- How do you stream tokens to a browser while the agent is still reasoning?
- How do you persist multi-turn conversations across service restarts?
- What changes when you run local models instead of cloud APIs?
Architecture Overview
Here's the high-level flow:
┌──────────────────────────────────────────────────────┐
│ Frontend (Next.js) │
│ WebSocket streaming chat interface │
└─────────────────────┬──────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Backend (FastAPI) │
│ ChatAgent with LangGraph engine │
└─────────────────────┬──────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Ollama │ │Milvus │ │Postgres│
│ (LLMs) │ │(Vector)│ │(State) │
└────────┘ └────────┘ └────────┘
│ │
└────────────┼────────────┘
▼
┌────────────────────────┐
│ 7 MCP Tools via stdio │
│ (RAG, Code, Image, ...) │
└────────────────────────┘
The heart of this is the ChatAgent, a LangGraph-powered state machine that implements the ReAct pattern: Reasoning and Acting in a loop.
The ReAct Loop: Generate → Decide → Act → Repeat
LangGraph is a framework for building cyclic computation graphs. Unlike chains (which are linear), graphs can loop back, making them perfect for agent workflows.
Here's the core loop:
def _build_graph(self):
"""Build the ReAct agent graph."""
workflow = StateGraph(AgentState)
# The two main nodes
workflow.add_node("generate", self._generate_node)
workflow.add_node("action", self._action_node)
# Edges: generate → should_continue → action → generate
workflow.add_edge("action", "generate")
workflow.add_conditional_edges(
"generate",
self._should_continue,
{
"continue": "action",
"end": END,
}
)
workflow.set_entry_point("generate")
return workflow.compile(checkpointer=MemorySaver())
Let me break this down:
-
generatenode: The LLM reasons about the user's message and decides what to do next. Does it need to call a tool? Or can it answer directly? -
should_continueedge: A decision point. If the LLM said "I need to use a tool," route toaction. If it said "Here's my final answer," route toEND. -
actionnode: Execute the tool the LLM requested. Store the result. -
Loop back: With the tool result in context, go back to
generateso the LLM can interpret what it got and decide the next step.
This loop repeats until the LLM decides it has enough information to answer.
State Management with TypedDict
The magic is in the AgentState—a TypedDict that holds everything:
class AgentState(TypedDict):
messages: list[BaseMessage]
iterations: int
chat_id: str
image_data: dict # Parsed images from user messages
documents: list # Uploaded document context
metadata: dict # Tool results, user info, etc.
Every node (generate, action) reads from and writes to this state. When you add a message, it persists through the entire conversation:
def _generate_node(self, state: AgentState) -> AgentState:
"""Generate the next response or tool call."""
response = self.llm.invoke(state["messages"])
state["messages"].append(response)
state["iterations"] += 1
return state
PostgreSQL stores this state across sessions via MemorySaver, so conversations survive service restarts.
The 7 MCP Tools: Extending the Agent
Model Context Protocol (MCP) is a newer standard for safely integrating external tools with AI models. Instead of embedding tool code directly in your service, you run tools as separate processes and connect via stdio.
HiveChat ships with 7 tools:
- RAG (Retrieval-Augmented Generation): Vector search over uploaded documents using Milvus
- Code Generation: Using Ollama with a fine-tuned code model
- Image Understanding: LLaVA vision model via Ollama
- Web Search: Real-time web information retrieval
- PDF Extraction: Parse and extract text from PDFs
- Weather: Current conditions and forecasts
- General Search: Broad information lookup
The key insight: tools are microservices, not monolithic. They have their own Docker containers, health checks, and exponential backoff retry logic.
Here's how we connect to them:
class MCPClient:
def __init__(self, tool_config: dict):
self.client = MultiServerMCPClient(
servers=tool_config,
transport_type="stdio"
)
async def initialize(self):
"""Initialize with exponential backoff."""
max_retries = 5
for attempt in range(max_retries):
try:
await self.client.initialize()
return
except Exception as e:
wait = 2 ** attempt
logger.warning(f"MCP init failed, retry in {wait}s: {e}")
await asyncio.sleep(wait)
raise RuntimeError("MCP initialization failed")
The exponential backoff is critical: some tools (like the RAG service) take time to start up, and we don't want to fail immediately.
Streaming Responses with Tool Call Parsing
Users expect ChatGPT-like streaming. We can't buffer everything and send it at the end. So we stream tokens as they come, but we also parse tool calls in real-time and show the user what the agent is doing.
Here's the WebSocket handler:
@websocket
async def chat(websocket: WebSocket, chat_id: str):
await websocket.accept()
while True:
user_message = await websocket.receive_text()
# Stream the agent's reasoning
async for event in agent.stream(
input={"messages": [HumanMessage(content=user_message)]},
config={"configurable": {"chat_id": chat_id}},
):
# Parse LLM tokens vs. tool calls
if "generate" in event:
chunk = event["generate"].content
await websocket.send_json({
"type": "token",
"data": chunk
})
elif "action" in event:
tool_name = event["action"].tool
await websocket.send_json({
"type": "tool_call",
"tool": tool_name,
"input": event["action"].tool_input
})
The frontend renders this in real-time: text streams in, tool calls appear as badges showing the agent is "thinking," then results appear as context.
Why Local LLMs? Why Ollama?
A lot of agent systems rely on OpenAI's API. But there are real advantages to running local:
- Cost: No per-token charges. Just GPU hardware.
- Latency: Models run on your own hardware. No network round-trip to a cloud service.
- Privacy: All data stays on-premise. No logs sent to third parties.
- Control: You choose the model, can fine-tune, and get full traceability.
Ollama is the runtime that makes this practical. It handles model downloads, quantization, batching, and serves via a simple REST API. LangGraph talks to Ollama via LangChain's Ollama integration.
Example: A single question, three tool calls
Here's what actually happens when you ask HiveChat a question that requires multiple tools. Say you type:
"I uploaded a paper on vector databases. Can you summarize the key indexing strategies and write me a Python snippet that benchmarks HNSW vs. IVF?"
The agent doesn't answer directly. It enters the ReAct loop:
Iteration 1 — RAG retrieval
Agent thinking: The user referenced an uploaded paper. I should search the
vector store for relevant content before answering.
→ Tool call: rag_search(query="vector database indexing strategies HNSW IVF")
Tool result: [3 chunks retrieved]
- "HNSW builds a navigable small-world graph with logarithmic search..."
- "IVF partitions the vector space into Voronoi cells using k-means..."
- "Benchmark results show HNSW achieves 95% recall at 2ms latency..."
Iteration 2 — Code generation
Agent thinking: I have the context. Now I need to generate a benchmark script.
Let me use the code generation tool for a clean implementation.
→ Tool call: code_generator(
prompt="Python benchmark comparing HNSW vs IVF using FAISS",
language="python"
)
Tool result:
import faiss
import numpy as np
import time
...
Iteration 3 — Final response
Agent thinking: I have the paper context and the code. I can now compose
a complete answer. No more tools needed.
→ Decision: END (respond to user)
The user sees all of this streamed in real-time: first a "Searching documents..." indicator, then "Generating code...", then the final answer flows in token by token — a summary of indexing strategies grounded in their uploaded paper, followed by a working benchmark script.
The whole exchange took three iterations through the ReAct loop: generate → tool → generate → tool → generate → end. Each iteration is a full round-trip through the LangGraph state machine.
Lessons Learned
1. ReAct patterns need careful step limits.
An agent can loop indefinitely. Always set a max iteration count:
if state["iterations"] > 10:
return {"messages": [AIMessage(content="I've exhausted my reasoning steps.")]}
2. Streaming is hard; test it thoroughly.
Streaming response parsing is easy to get wrong. Tool call detection, token chunking, and error handling all need careful testing. We built a test suite that simulates various LLM response formats.
3. Tool failures shouldn't crash the agent.
When the RAG service is down or the web search times out, the agent should handle it gracefully:
async def call_tool(self, tool_name: str, params: dict):
try:
return await self.mcp_client.invoke(tool_name, params)
except TimeoutError:
return f"Tool {tool_name} timed out. Continuing with available context."
4. PostgreSQL state persistence is underrated.
Checkpointing agent state to a database is non-trivial but worth it. Users appreciate resuming a multi-turn conversation where they left off, and it makes debugging easier.
5. Observability is essential.
With multiple services, LLM calls, and tool invocations, things break in subtle ways. OpenTelemetry and tracing (Jaeger) let you see the entire flow: which tool took 5 seconds? Did the LLM hallucinate a tool name? These questions are impossible to answer without traces.
What's Next?
I'm exploring:
- Agentic function calling: Letting the agent decide how many tools to call in parallel
- Tool composition: Chains of tools (e.g., search → summarize → translate)
- Fine-tuned models: Training a smaller LLM specifically for tool selection to reduce hallucinations
- Multi-user isolation: Currently all conversations share the same agent instance; I want per-user resource limits
Try It Out
The source code is on GitHub: github.com/sahilmalik27/hivechat (private repo). If you'd like access or want to discuss the project, connect with me on LinkedIn.
It runs locally with Docker Compose. It's a great starting point if you're curious about:
- Building with LangGraph
- Integrating MCP tools
- Streaming LLM responses
- Multi-agent orchestration
The architecture is meant to be modular—swap out the RAG tool for your own docs, replace Ollama with a different LLM provider, or add new tools by creating a new MCP server.
Happy building.
Thanks for reading. Follow me for more.
← More posts