Voice Agent Lab

System diagrams

A concrete pipeline: seed thread URL → scrape + normalize → persona research → script generation → voice render → mix → publish → website playback. The public UI reads from /api/episodes.

End-to-end workflow

flowchart LR A["Seed Reddit thread URL"] --> B["Scrape post + full comment tree"] B --> R1["seed-thread.raw.json"] B --> R2["seed-thread.tree.json"] B --> R3["seed-thread.normalized.json"] R3 --> C["Research agent optional
web search + artifact capture"] C --> C2["persona-pack.json"] R3 --> D["Writer room planner_agents
sequential reply-linked turns"] C2 --> D D --> S1["script.json
lineId speaker respondsToLineId timing"] D --> S2["script.txt"] S1 --> T["TTS ElevenLabs
per-line render"] T --> ST["stems/*.mp3"] ST --> M["Mix with overlap rules"] M --> OUT["episode-mix.mp3"] subgraph DEV["Local mode EPISODES_STORE fs"] OUT --> L1["output/episodes/{episodeId}/"] L1 --> LE["/local-episodes/{episodeId}/{file}"] LE --> API1["/api/episodes from FS"] end subgraph PROD["Production mode EPISODES_STORE postgres"] OUT --> P1["Upload to S3
audio + artifacts"] P1 --> P2["Index in Neon
podcast_episodes table"] P2 --> API2["/api/episodes from Postgres"] end API1 --> UI["Website UI /podcast/"] API2 --> UI

Podcast generator architecture

Conversation quality comes from explicit turn planning, reply targeting, and controlled interjections before rendering.

flowchart TB IN1["Seed artifacts
normalized thread + persona pack"] --> CTX["Context builder"] CTX --> PLAN["Planner
beats + speaking goals"] PLAN --> SCH["Turn scheduler
who speaks next and why"] subgraph AGENTS["Persona agent pool"] HOST["Host agent"] POST["Post reader agent"] COMM["Comment reader agent"] PANA["Panelist A agent"] PANB["Panelist B agent"] end SCH --> HOST SCH --> POST SCH --> COMM SCH --> PANA SCH --> PANB HOST --> CAND["Candidate proposals
text + intent + reply target"] POST --> CAND COMM --> CAND PANA --> CAND PANB --> CAND CAND --> ARB["Turn arbiter
coherence timing anti-talkover"] ARB --> INT["Interjection lane
short overlap comments only"] INT --> DIR["Performance director
pace emphasis energy"] ARB --> DIR DIR --> SCR["Script compiler
lineId speaker respondsToLineId startMs"] SCR --> VOX["Voice renderer ElevenLabs"] VOX --> MIX["Mixer
ducking overlaps spacing"] MIX --> OUT2["Final episode mix + artifacts"]

Why sequential turns?

The writer generates one speaker line at a time with a reply target. That forces “actual responding” behavior instead of everybody free-associating.

respondsToLineId turn arbitration

Why save artifacts?

Debuggability. You can open the raw scrape, verify what was selected as “sources,” and trace every rendered line back to a script and seed.

sources.json seed-thread.*.json

Serving strategy

Local dev serves files directly from disk at /local-episodes. Production should publish files to object storage (S3/CloudFront) and store URLs in Postgres, so the web app can run statelessly (Vercel/AWS).

S3 Neon Vercel