From PDF to Podcast: Building PodForge
What if your research papers, documentation, and dense PDFs could talk to you?
That's the question behind PodForge — a system that takes any PDF and transforms it into an engaging podcast. Not a robot droning through text. Not a simple concatenation of sentences. A real podcast with characters, dialogue, pacing, and personality.
Feed it a 20-page research paper, and PodForge generates a script as if two knowledgeable hosts are discussing the findings. Feed it product documentation, and it becomes a conversational walkthrough. Feed it a dense technical spec, and it pulls out the key insights and presents them as a coherent narrative.
This post is about the design decisions, the architecture, and what it took to build a system that makes PDFs sound good.
The problem with reading PDFs
PDFs are everywhere, but they're terrible for consumption when you're driving, exercising, or doing dishes. Yet PDFs carry the most valuable information — research papers, technical specs, competitor analyses, meeting notes.
The obvious solution is text-to-speech: run a TTS engine on the PDF and you get audio. But that's just... reading. The robot voice drones through paragraphs, loses semantic meaning, and feels like punishment rather than learning.
PodForge asks a different question: What if we treated the PDF as source material for a podcast, not a document to be narrated?
The LLM does the heavy lifting. It reads the entire PDF, extracts the key ideas, structures them as a narrative, and generates a dialogue script as if two intelligent hosts are discussing the material. Then text-to-speech turns that script into audio with character voices.
The result: a 20-minute podcast that captures the essence of a 2-hour read, sounds natural, and actually holds your attention.
Architecture: A system built for real-time
Early versions of PodForge would take your PDF, process it end-to-end, and come back 15 minutes later with a complete podcast. That's fine for batch jobs, but it's miserable user experience.
The breakthrough was this: the user's browser should see progress in real-time. Not "request submitted" → 15 minutes of silence → "done." But "extraction started" → "summarizing" → "outline generated" → "scripting dialogue" → "processing segment 3 of 8" → "rendering audio" → "done."
That requirement shaped everything. Here's the architecture:
┌──────────────────────────────────────────────────────────────┐
│ Browser / Client │
│ (WebSocket listener) │
└──────────────────┬───────────────────────────────────────────┘
│
│ WebSocket: Real-time status updates
│
┌────────────┴─────────────────┐
│ │
▼ │
┌────────────────┐ ┌──────────▼──────────┐
│ API Service │ │ ConnectionManager │
│ (FastAPI:8002) │ │ (pub/sub bridge) │
│ │ └──────────┬──────────┘
│ - Accept PDF │ ▲
│ - Job creation │ │
│ - Status check │ ┌─────────┴─────────┐
└────────┬───────┘ │ │
│ ▼ ▼
│ ┌─────────────┐ ┌────────────────┐
│ │ Redis │ │ Pub/Sub Channel:
│ │ (job store) │ │ "status_updates:all"
│ └─────────────┘ └────────────────┘
│ ▲
│ │
┌────┴──────────────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PDF Service │ │ Agent Service│ │ TTS Service │
│ (Docling) │ │ (LLM pipeline) │(edge-tts+) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└───────────────────┼───────────────────┘
Status updates to Redis
MinIO
(PDF + audio)
The magic happens in the ConnectionManager — it's the bridge between Redis pub/sub and WebSocket. It runs in a background thread, listening to the Redis channel status_updates:all. Every time a service updates job status, it publishes to that channel. The ConnectionManager reads those messages, queues them, and broadcasts them to all connected browser clients via WebSocket.
The client sees every milestone in real-time:
Uploading PDF...
Extracting content...
Summarizing document...
Generating outline...
Creating dialogue script...
Processing audio: segment 1 of 5...
Processing audio: segment 2 of 5...
...
Podcast ready! Download here.
The four services
The system breaks into four microservices, each with a single responsibility:
1. API Service (FastAPI)
The public interface. Accepts PDF uploads, creates jobs, manages the job queue, and serves status. Uses Pydantic for validation, Redis for job storage, and MinIO for PDF persistence.
When a user uploads a PDF:
- Store it in MinIO
- Create a job record in Redis with status
pending - Enqueue the job for the Agent Service
- Return a job ID to the client
The client polls the status endpoint every second, which reads from Redis and streams status updates via WebSocket.
2. PDF Service (Docling-powered)
Extracts text, tables, and metadata from the PDF. Uses the Docling library, which is purpose-built for extracting structured content from PDFs — it understands tables, figures, headers, and preserves semantic meaning.
Docling is superior to naive PDF text extraction because PDFs are a presentation format, not a content format. A table in a PDF might be 20 lines of scrambled characters in raw text, but Docling reconstructs the actual table structure.
3. Agent Service (LLM pipeline)
This is the core. It receives the extracted PDF text and orchestrates the transformation into a podcast script.
The pipeline has five stages:
Stage 1: Summarization Read the full document, extract the key insights and main arguments. LLM produces a 1-2 paragraph summary.
Stage 2: Outline Generation Convert the summary into a structured outline: 5-7 main topics, each with 2-3 key points. The outline becomes the skeleton of the script.
Stage 3: Segment Processing Split the outline into segments (one per topic), and for each segment, generate detailed talking points that the podcast hosts will discuss.
Stage 4: Dialogue Generation For each segment, generate a conversation between two hosts: Host A asks questions, Host B provides insights, they riff on implications. Uses Jinja2 prompt templates that are specialized for dialogue generation.
Stage 5: Monologue (optional) If the user prefers a single-speaker format, generate a monologue script instead of dialogue.
Each stage publishes its status to Redis. The ConnectionManager picks it up and broadcasts to the browser.
4. TTS Service
Converts the podcast script into audio. PodForge uses a pluggable architecture here — the Strategy pattern.
Default strategy: edge-tts (free, cloud-based) Microsoft's edge-tts service is free and has reasonable voice quality. No API key required. It's the default because it works instantly and costs nothing.
GPU strategies: XTTS, Bark, Coqui (advanced) If you have GPU, you can run voice cloning models locally. XTTS can clone a voice from a short audio sample. Bark has creative voices. Coqui is multilingual.
The TTS service accepts the podcast script, renders each segment in parallel, and streams the completed audio files to MinIO.
The patterns that make it work
1. Redis pub/sub → WebSocket bridge (ConnectionManager)
This is the heartbeat of the system:
# ConnectionManager runs in a background thread
class ConnectionManager:
def __init__(self, redis_client):
self.pubsub = redis_client.pubsub()
self.pubsub.subscribe("status_updates:all")
self.message_queue = asyncio.Queue()
threading.Thread(target=self._listen_redis, daemon=True).start()
def _listen_redis(self):
# Blocks in background thread, reads from Redis
for message in self.pubsub.listen():
if message['type'] == 'message':
# Queue the message for broadcast
asyncio.run(self.message_queue.put(message['data']))
async def broadcast_to_websockets(self, websocket_connections):
# Main async loop: pull from queue, broadcast to all clients
while True:
message = await self.message_queue.get()
for ws in websocket_connections:
await ws.send_json(message)
The thread safety here is critical. Redis pub/sub is blocking, so it runs in a dedicated OS thread. The message queue is asyncio.Queue, which is thread-safe and integrates with the async event loop. The broadcast loop runs in the FastAPI async context.
This pattern decouples the message source (Redis) from the message delivery mechanism (WebSocket). Add a new delivery channel (Server-Sent Events, gRPC, whatever) — just consume from the same queue.
2. Job status flow
When the Agent Service processes a segment, it publishes status like this:
# In Agent Service
job_id = "podcast_2024_0001"
status = {
"job_id": job_id,
"stage": "dialogue_generation",
"segment": 3,
"total_segments": 8,
"progress_percent": 37,
"message": "Generating dialogue for segment 3: Technical Implementation"
}
# Store in Redis hash (for query)
redis.hset(f"job:{job_id}", mapping=status)
# Publish for real-time updates
redis.publish("status_updates:all", json.dumps(status))
The Redis hash (job:{id}) is the source of truth for the job state. A client can query http://api.example.com/job/{id} at any time and get the latest status. But if they're watching, the pub/sub channel delivers updates in real-time.
This is elegant because it handles both polling and streaming gracefully. Web browsers can fall back to polling if WebSocket fails. Real-time clients get instant updates.
3. LLM pipeline with Jinja2 templates
The Agent Service doesn't hard-code prompts. Instead, it uses a template system:
# templates/podcast_prompts.yaml
summarization:
template: |
You are a senior researcher. Read this document and extract the key ideas:
{{ content }}
Provide a 1-2 paragraph summary that captures the main arguments.
dialogue_generation:
template: |
You are generating a podcast dialogue between two hosts: Alice (asks questions) and Bob (provides insights).
Topic: {{ segment_title }}
Talking points: {{ talking_points }}
Generate a natural-sounding dialogue (600-800 words) where:
- Alice opens with a question about the topic
- Bob explains the concept with examples
- Alice challenges an assumption
- Bob clarifies and adds nuance
- End with a takeaway
Format as:
Alice: [dialogue]
Bob: [dialogue]
class PodcastPrompts:
def __init__(self):
self.templates = self._load_templates()
def summarize(self, content):
template = self.templates['summarization']['template']
prompt = jinja2.Template(template).render(content=content)
return llm(prompt)
def generate_dialogue(self, segment_title, talking_points):
template = self.templates['dialogue_generation']['template']
prompt = jinja2.Template(template).render(
segment_title=segment_title,
talking_points=talking_points
)
return llm(prompt)
This approach is powerful because:
- Prompts live in version control and are auditable
- You can tune prompts without redeploying code
- A/B test different prompt versions easily
- Team members can iterate on prompts without touching Python
The templates separate concerns: prompt engineering from orchestration.
4. Pluggable TTS with Strategy pattern
The TTS service has multiple implementations:
from abc import ABC, abstractmethod
class TTSBackend(ABC):
@abstractmethod
async def synthesize(self, text: str, voice: str, output_path: str):
pass
class EdgeTTSBackend(TTSBackend):
async def synthesize(self, text: str, voice: str, output_path: str):
# Free, no setup required
communicate = edge_tts.Communicate(text, voice=f"en-US-{voice}Voice")
await communicate.save(output_path)
class XTTSBackend(TTSBackend):
def __init__(self, gpu_device="cuda:0"):
self.model = load_xtts_model(device=gpu_device)
async def synthesize(self, text: str, voice: str, output_path: str):
# Clone voice from speaker_reference_audio
speaker_embedding = self.model.get_speaker_embedding(
speaker_reference_audio
)
audio = self.model.tts(
text=text,
speaker_wav=speaker_embedding,
language="en"
)
audio.save(output_path)
class TTSService:
def __init__(self, backend_type: str = "edge-tts"):
if backend_type == "edge-tts":
self.backend = EdgeTTSBackend()
elif backend_type == "xtts":
self.backend = XTTSBackend()
# Add Bark, Coqui, etc.
async def render_podcast(self, script: str):
# Same interface regardless of backend
for segment in script.segments:
await self.backend.synthesize(
text=segment.dialogue,
voice=segment.voice,
output_path=f"audio/{segment.id}.mp3"
)
This pattern makes it trivial to add new TTS backends without touching the orchestration code. The API and Agent Service don't care whether you're using edge-tts or XTTS — they call the same interface.
Docker, MinIO, and distributed tracing
The system runs as 7+ containers:
- API Service (FastAPI, port 8002)
- PDF Service (Docling extractor)
- Agent Service (LLM orchestrator)
- TTS Service (text-to-speech)
- Redis (job store + pub/sub)
- MinIO (object storage for PDFs and audio)
- Jaeger (OpenTelemetry tracing)
Each service publishes traces to Jaeger. You can watch a single PDF flow through all four services, see which LLM call took 8 seconds, and which segment took longest to render.
The shared Python package (shared/) defines common types: Job, Status, PodcastSegment. This ensures all services agree on the data model.
Docker Compose orchestrates everything. A single docker-compose up brings up the whole system with 7+ containers, volumes, networks, and environment variables.
Example: A research paper becomes a podcast
Let me walk through what actually happens when you feed PodForge a real PDF. Say you upload a 15-page research paper on transformer architectures.
Step 1 — Upload and extraction
You drag the PDF into the web UI. The API Service stores it in MinIO, creates a job, and the PDF Service extracts the content via Docling. Your browser shows:
✓ PDF uploaded
⏳ Extracting content...
✓ Extracted 15 pages, 4 tables, 2 figures
Step 2 — Summarization
The Agent Service receives the extracted markdown and sends it to the LLM:
Agent: Summarizing document...
LLM output: "This paper introduces a novel attention mechanism that reduces
the quadratic complexity of standard transformers to linear time. The key
insight is replacing full self-attention with a locality-sensitive hashing
scheme that groups similar tokens into buckets..."
Step 3 — Outline generation
The LLM structures the summary into a podcast outline:
{
"title": "Rethinking Attention in Transformers",
"segments": [
{"topic": "Why transformers are slow", "points": ["quadratic scaling", "memory bottleneck"]},
{"topic": "The hashing trick", "points": ["LSH attention", "bucket-based grouping"]},
{"topic": "Benchmarks and results", "points": ["speed gains", "quality tradeoffs"]},
{"topic": "What this means for the field", "points": ["scaling implications", "open questions"]}
]
}
Step 4 — Dialogue generation
For each segment, the LLM generates a natural conversation:
Alice: So the big problem with transformers is that attention scales quadratically, right?
Like, every token has to look at every other token.
Bob: Exactly. If you have a sequence of 10,000 tokens, that's 100 million attention
computations. It's why long-context models are so expensive to run.
Alice: And this paper says you don't actually need all those comparisons?
Bob: Right — their insight is that most attention weights are near zero anyway.
Only a small fraction of token pairs actually matter. So they use a hashing
trick to find the tokens that are likely to attend to each other...
Step 5 — Audio synthesis
The TTS Service renders each segment with two distinct voices (one per host), concatenates them with intro/outro music, and uploads the final MP3 to MinIO. Your browser shows:
⏳ Rendering audio: segment 1 of 4...
⏳ Rendering audio: segment 2 of 4...
⏳ Rendering audio: segment 3 of 4...
⏳ Rendering audio: segment 4 of 4...
✓ Podcast ready! Duration: 12 minutes 34 seconds
▶ [Play] [Download MP3]
The whole process takes about 8–15 minutes depending on document length and TTS backend. But because every step streams to the browser, it never feels like waiting.
What I learned
Pub/sub is underrated for long-running jobs. The standard pattern is "POST → wait → GET status." Pub/sub flips it: the server pushes updates to the client in real-time. For user experience, it's night and day. Users see progress. They don't feel abandoned.
Separate concerns more aggressively than you think necessary. The PDF service is dumb — it just extracts. The Agent Service is dumb — it just generates scripts. The TTS service is dumb — it just renders audio. But together, they compose into something sophisticated. When you need to add voice cloning or support a new document format, you touch exactly one service.
LLM pipelines need templates, not strings. Hard-coded prompts scattered across your code are a maintenance nightmare. Jinja2 templates keep prompts versioned, auditable, and separable from logic.
Real-time UI requires threading. Redis pub/sub blocks. FastAPI is async. Bridging them cleanly requires a background thread feeding into an async queue. It sounds messy, but the pattern is straightforward once you understand it.
Architecture shows up in user experience. The reason PodForge feels snappy, even though it takes 15 minutes to process a PDF, is because the architecture doesn't hide the work. Users see every step. They can step away and check back. They understand what's happening.
What's next
PodForge is on GitHub at github.com/sahilmalik27/podforge (private repo). The current version supports PDF → podcast with dialogue and monologue modes, multiple TTS backends, and real-time progress streaming. If you'd like access or want to chat about the project, connect with me on LinkedIn.
Future directions:
- Multi-language support (script generation in Spanish, Mandarin, Hindi, etc.)
- Custom voice synthesis (give it an audio sample and it learns to sound like that person)
- Interactive transcript editor (let users tweak the script before final render)
- API for programmatic access (for integration with document management systems)
If you've been sitting on a pile of PDFs wishing you could consume them while driving, give it a try. If you build something interesting with it — or want to contribute an improvement — I'd love to hear from you.
The best systems are the ones that change how people interact with information.
PodForge started as a thought experiment: "What if my research papers could talk?" It became a lesson in distributed systems, real-time communication, and designing systems where the architecture serves the user experience.
Thanks for reading. Follow me for more.
← More posts