2026-02-24·10 min read

Building a Multi-Agent Chatbot with LangGraph and MCP

ai-agentsllmslanggraphmcpmulti-agentarchitecture

Modern chatbots like ChatGPT and Claude can already do a lot — they search the web, write code, analyze images, and reason across documents. But have you ever wondered what it actually takes to build one of these systems yourself? How does the agent decide which tool to call? How does it loop between reasoning and acting? How do you wire up vector search, code generation, and image understanding into a single conversation?

I built HiveChat (github.com/sahilmalik27/hivechat) to learn exactly that — a multi-agent chatbot using LangGraph's ReAct pattern with 7 MCP tools, running entirely on local LLMs via Ollama.

This post is a deep dive into what I learned: the architecture, the patterns, and the things that surprised me along the way.

Why Build This?

The best way to understand how agent systems work is to build one from scratch. I wanted to get hands-on with the core building blocks: reasoning loops, tool orchestration, streaming, and state persistence. HiveChat became the learning project — a system where I could explore questions like:

How does LangGraph's ReAct pattern actually work under the hood?
What does it take to integrate external tools safely via MCP?
How do you stream tokens to a browser while the agent is still reasoning?
How do you persist multi-turn conversations across service restarts?
What changes when you run local models instead of cloud APIs?

Architecture Overview

Here's the high-level flow:

┌──────────────────────────────────────────────────────┐
│                   Frontend (Next.js)                  │
│         WebSocket streaming chat interface           │
└─────────────────────┬──────────────────────────────┘
                      │
                      ▼
┌──────────────────────────────────────────────────────┐
│              Backend (FastAPI)                        │
│         ChatAgent with LangGraph engine              │
└─────────────────────┬──────────────────────────────┘
                      │
         ┌────────────┼────────────┐
         ▼            ▼            ▼
    ┌────────┐   ┌────────┐   ┌────────┐
    │ Ollama │   │Milvus  │   │Postgres│
    │ (LLMs) │   │(Vector)│   │(State) │
    └────────┘   └────────┘   └────────┘
         │            │
         └────────────┼────────────┘
                      ▼
         ┌────────────────────────┐
         │  7 MCP Tools via stdio  │
         │  (RAG, Code, Image, ...) │
         └────────────────────────┘

The heart of this is the ChatAgent, a LangGraph-powered state machine that implements the ReAct pattern: Reasoning and Acting in a loop.

The ReAct Loop: Generate → Decide → Act → Repeat

LangGraph is a framework for building cyclic computation graphs. Unlike chains (which are linear), graphs can loop back, making them perfect for agent workflows.

Here's the core loop:

def _build_graph(self):
    """Build the ReAct agent graph."""
    workflow = StateGraph(AgentState)

    # The two main nodes
    workflow.add_node("generate", self._generate_node)
    workflow.add_node("action", self._action_node)

    # Edges: generate → should_continue → action → generate
    workflow.add_edge("action", "generate")
    workflow.add_conditional_edges(
        "generate",
        self._should_continue,
        {
            "continue": "action",
            "end": END,
        }
    )

    workflow.set_entry_point("generate")
    return workflow.compile(checkpointer=MemorySaver())

Let me break this down:

generate node: The LLM reasons about the user's message and decides what to do next. Does it need to call a tool? Or can it answer directly?
should_continue edge: A decision point. If the LLM said "I need to use a tool," route to action. If it said "Here's my final answer," route to END.
action node: Execute the tool the LLM requested. Store the result.
Loop back: With the tool result in context, go back to generate so the LLM can interpret what it got and decide the next step.

This loop repeats until the LLM decides it has enough information to answer.

State Management with TypedDict

The magic is in the AgentState—a TypedDict that holds everything:

class AgentState(TypedDict):
    messages: list[BaseMessage]
    iterations: int
    chat_id: str
    image_data: dict  # Parsed images from user messages
    documents: list   # Uploaded document context
    metadata: dict    # Tool results, user info, etc.

Every node (generate, action) reads from and writes to this state. When you add a message, it persists through the entire conversation:

def _generate_node(self, state: AgentState) -> AgentState:
    """Generate the next response or tool call."""
    response = self.llm.invoke(state["messages"])
    state["messages"].append(response)
    state["iterations"] += 1
    return state

PostgreSQL stores this state across sessions via MemorySaver, so conversations survive service restarts.

The 7 MCP Tools: Extending the Agent

Model Context Protocol (MCP) is a newer standard for safely integrating external tools with AI models. Instead of embedding tool code directly in your service, you run tools as separate processes and connect via stdio.

HiveChat ships with 7 tools:

RAG (Retrieval-Augmented Generation): Vector search over uploaded documents using Milvus
Code Generation: Using Ollama with a fine-tuned code model
Image Understanding: LLaVA vision model via Ollama
Web Search: Real-time web information retrieval
PDF Extraction: Parse and extract text from PDFs
Weather: Current conditions and forecasts
General Search: Broad information lookup

The key insight: tools are microservices, not monolithic. They have their own Docker containers, health checks, and exponential backoff retry logic.

Here's how we connect to them:

class MCPClient:
    def __init__(self, tool_config: dict):
        self.client = MultiServerMCPClient(
            servers=tool_config,
            transport_type="stdio"
        )

    async def initialize(self):
        """Initialize with exponential backoff."""
        max_retries = 5
        for attempt in range(max_retries):
            try:
                await self.client.initialize()
                return
            except Exception as e:
                wait = 2 ** attempt
                logger.warning(f"MCP init failed, retry in {wait}s: {e}")
                await asyncio.sleep(wait)
        raise RuntimeError("MCP initialization failed")

The exponential backoff is critical: some tools (like the RAG service) take time to start up, and we don't want to fail immediately.

Streaming Responses with Tool Call Parsing

Users expect ChatGPT-like streaming. We can't buffer everything and send it at the end. So we stream tokens as they come, but we also parse tool calls in real-time and show the user what the agent is doing.

Here's the WebSocket handler:

@websocket
async def chat(websocket: WebSocket, chat_id: str):
    await websocket.accept()

    while True:
        user_message = await websocket.receive_text()

        # Stream the agent's reasoning
        async for event in agent.stream(
            input={"messages": [HumanMessage(content=user_message)]},
            config={"configurable": {"chat_id": chat_id}},
        ):
            # Parse LLM tokens vs. tool calls
            if "generate" in event:
                chunk = event["generate"].content
                await websocket.send_json({
                    "type": "token",
                    "data": chunk
                })
            elif "action" in event:
                tool_name = event["action"].tool
                await websocket.send_json({
                    "type": "tool_call",
                    "tool": tool_name,
                    "input": event["action"].tool_input
                })

The frontend renders this in real-time: text streams in, tool calls appear as badges showing the agent is "thinking," then results appear as context.

Why Local LLMs? Why Ollama?

A lot of agent systems rely on OpenAI's API. But there are real advantages to running local:

Cost: No per-token charges. Just GPU hardware.
Latency: Models run on your own hardware. No network round-trip to a cloud service.
Privacy: All data stays on-premise. No logs sent to third parties.
Control: You choose the model, can fine-tune, and get full traceability.

Ollama is the runtime that makes this practical. It handles model downloads, quantization, batching, and serves via a simple REST API. LangGraph talks to Ollama via LangChain's Ollama integration.

Example: A single question, three tool calls

Here's what actually happens when you ask HiveChat a question that requires multiple tools. Say you type:

"I uploaded a paper on vector databases. Can you summarize the key indexing strategies and write me a Python snippet that benchmarks HNSW vs. IVF?"

The agent doesn't answer directly. It enters the ReAct loop:

Iteration 1 — RAG retrieval

Agent thinking: The user referenced an uploaded paper. I should search the
               vector store for relevant content before answering.

→ Tool call: rag_search(query="vector database indexing strategies HNSW IVF")

Tool result: [3 chunks retrieved]
  - "HNSW builds a navigable small-world graph with logarithmic search..."
  - "IVF partitions the vector space into Voronoi cells using k-means..."
  - "Benchmark results show HNSW achieves 95% recall at 2ms latency..."

Iteration 2 — Code generation

Agent thinking: I have the context. Now I need to generate a benchmark script.
               Let me use the code generation tool for a clean implementation.

→ Tool call: code_generator(
    prompt="Python benchmark comparing HNSW vs IVF using FAISS",
    language="python"
  )

Tool result:
  import faiss
  import numpy as np
  import time
  ...

Iteration 3 — Final response

Agent thinking: I have the paper context and the code. I can now compose
               a complete answer. No more tools needed.

→ Decision: END (respond to user)

The user sees all of this streamed in real-time: first a "Searching documents..." indicator, then "Generating code...", then the final answer flows in token by token — a summary of indexing strategies grounded in their uploaded paper, followed by a working benchmark script.

The whole exchange took three iterations through the ReAct loop: generate → tool → generate → tool → generate → end. Each iteration is a full round-trip through the LangGraph state machine.

Lessons Learned

1. ReAct patterns need careful step limits.

An agent can loop indefinitely. Always set a max iteration count:

if state["iterations"] > 10:
    return {"messages": [AIMessage(content="I've exhausted my reasoning steps.")]}

2. Streaming is hard; test it thoroughly.

Streaming response parsing is easy to get wrong. Tool call detection, token chunking, and error handling all need careful testing. We built a test suite that simulates various LLM response formats.

3. Tool failures shouldn't crash the agent.

When the RAG service is down or the web search times out, the agent should handle it gracefully:

async def call_tool(self, tool_name: str, params: dict):
    try:
        return await self.mcp_client.invoke(tool_name, params)
    except TimeoutError:
        return f"Tool {tool_name} timed out. Continuing with available context."

4. PostgreSQL state persistence is underrated.

Checkpointing agent state to a database is non-trivial but worth it. Users appreciate resuming a multi-turn conversation where they left off, and it makes debugging easier.

5. Observability is essential.

With multiple services, LLM calls, and tool invocations, things break in subtle ways. OpenTelemetry and tracing (Jaeger) let you see the entire flow: which tool took 5 seconds? Did the LLM hallucinate a tool name? These questions are impossible to answer without traces.

What's Next?

I'm exploring:

Agentic function calling: Letting the agent decide how many tools to call in parallel
Tool composition: Chains of tools (e.g., search → summarize → translate)
Fine-tuned models: Training a smaller LLM specifically for tool selection to reduce hallucinations
Multi-user isolation: Currently all conversations share the same agent instance; I want per-user resource limits

Try It Out

The source code is on GitHub: github.com/sahilmalik27/hivechat (private repo). If you'd like access or want to discuss the project, connect with me on LinkedIn.

It runs locally with Docker Compose. It's a great starting point if you're curious about:

Building with LangGraph
Integrating MCP tools
Streaming LLM responses
Multi-agent orchestration

The architecture is meant to be modular—swap out the RAG tool for your own docs, replace Ollama with a different LLM provider, or add new tools by creating a new MCP server.

Happy building.

Thanks for reading. Follow me for more.

← More posts