Power & Context

Introduction Section

Power & Context is a fully self-hosted, microservices-based platform that transforms any article URL into a polished, NPR-style podcast episode. It pairs local large language models with zero-shot voice cloning text-to-speech to generate two-host conversational audio, all without depending on third-party APIs. Built as a collection of independent FastAPI services orchestrated with Docker Compose, the stack covers the entire pipeline — from article extraction and contextual research, to script writing, to high-quality narration.

Status: MVP Complete — The core pipeline is functional end-to-end: articles are extracted, scripts are generated, and audio is synthesized across multiple TTS backends. The project continues to evolve with deployment hardening, additional voice options, and quality improvements.

Problem & Solution

The Problem

Turning written content into engaging audio presents several challenges:

Vendor lock-in — Most podcast and TTS solutions depend on paid cloud APIs with usage limits and recurring costs
Missing context — Naively reading an article aloud strips away the analysis, nuance, and surrounding context that make audio engaging
Robotic narration — Generic text-to-speech sounds flat and lacks the natural, conversational feel of a real podcast
Monolithic tooling — Single-application solutions are hard to scale, swap components in, or run partially offline
Privacy concerns — Sending content and source material to external services isn't always acceptable
Manual effort — Producing a podcast episode from an article traditionally requires writing, recording, and editing by hand

The Solution

Power & Context addresses these challenges with a modular, local-first architecture:

Self-hosted by default — Runs entirely on your own hardware or VPS with local LLMs (Ollama) and local TTS, no required API keys
Context-aware extraction — Crawls the source article and linked pages to assemble richer context before scripting
NPR-style script generation — Produces a two-host conversational script with analysis, framing, and natural dialogue
Pluggable TTS backends — Choose Piper, Coqui XTTS-v2, F5-TTS, or OpenAI at runtime, with zero-shot voice cloning support
Microservices architecture — Independent, individually scalable services connected over HTTP and a job queue
Async job processing — A Redis-backed RQ worker handles long-running generation without blocking the API
Production-ready deployment — Docker Compose orchestration with GitHub Actions CI/CD to a VPS

Technical Implementation

The platform is composed of four independent services plus shared infrastructure, all orchestrated through Docker Compose:

Context Service (FastAPI, Playwright)
- Extracts article content from a URL
- Optionally crawls linked pages to gather additional context
- Supports Mercury and Playwright-based extraction strategies
- Exposes a clean /api/context/from-url endpoint
Script Service (FastAPI, Ollama / OpenAI)
- Generates NPR-style two-host podcast scripts from extracted context
- Uses a local Ollama LLM by default (e.g. mistral-nemo:12b) with an OpenAI fallback
- Returns a structured episode package ready for narration
TTS Service (FastAPI, multi-backend)
- Supports Piper, Coqui XTTS-v2, F5-TTS, and OpenAI TTS
- Runtime backend selection via environment variable or per-request override
- Zero-shot voice cloning for distinct HOST1 and HOST2 voices
- Batch generation and MP3/WAV output
Podcast Service (FastAPI API + RQ Worker)
- Orchestrates the full pipeline: context → script → audio → storage
- Async job queue backed by Redis for long-running generation
- Optional Dropbox integration for episode storage and sharing
- Simple web interface for submitting URLs and tracking job status
Shared Infrastructure
- Redis for the job queue and status tracking
- Ollama as the local LLM runtime
- FFmpeg for audio processing and stitching

Key Features

Context-Aware Article Extraction

Rather than reading a single page verbatim, the context service can crawl the source URL and its linked pages, assembling a fuller picture of the topic. This richer context feeds directly into script generation, producing episodes that explain and analyze rather than simply recite.

NPR-Style Script Generation

The script service generates conversational, two-host dialogue in the style of public radio — complete with framing, analysis, and natural back-and-forth. It runs on a local Ollama model by default, keeping content private, and gracefully falls back to OpenAI when configured.

Multi-Backend Voice Cloning TTS

The TTS service is the heart of the audio experience, supporting four interchangeable backends:

Piper (~200MB) — Fast and lightweight, ideal for CPU
Coqui XTTS-v2 (~2GB) — Strong quality with voice cloning
F5-TTS (~16GB+) — Highest quality, RAM-intensive
OpenAI (~50MB) — Cloud-based, no local models

Each backend supports distinct HOST1 and HOST2 voices via zero-shot voice cloning from short reference audio clips, giving each episode a consistent two-host sound.

Asynchronous Job Pipeline

Podcast generation is a long-running process, so the podcast service submits work to a Redis-backed RQ queue. The API returns immediately with a job ID, while a dedicated worker handles extraction, scripting, synthesis, and storage in the background. Clients poll for status and retrieve a download link when the episode is ready.

Runtime Configurability

Nearly every aspect of the stack is configurable through environment variables — LLM model and endpoint, TTS backend and voice references, emotion and language settings, storage credentials, and service URLs — making it easy to tune for available hardware or swap components without code changes.

API Architecture

The services expose clean, focused HTTP APIs:

Context Service

POST /api/context/from-url — Extract and combine context from a URL and linked pages
GET /health — Service health and extractor availability

Script Service

POST /api/script — Generate an NPR-style episode script from context
GET /health — Service health and model availability

TTS Service

POST /api/tts — Generate narration for a chunk of text and speaker
POST /api/tts/batch — Batch-generate multiple audio chunks
GET /health — Backend status

Podcast Service

POST /api/generate — Submit an article URL for podcast generation
GET /api/job/{job_id} — Check job status and retrieve the download URL
GET /health — Pipeline and dependency health

Deployment & Operations

Containerized Orchestration

The entire stack is defined in Docker Compose, with separate development and production configurations. Each service builds from its own Dockerfile, declares health checks, and sets sensible resource reservations — including memory limits for the LLM and worker containers.

CI/CD to a VPS

A GitHub Actions workflow handles deployment to a VPS over SSH, aligned with the project's other production services. This enables push-to-deploy updates for the self-hosted stack.

Local LLM Flexibility

The stack can run Ollama as a local container or point at a remote Ollama instance over a private network (e.g. Tailscale), letting heavier models run on dedicated hardware while the rest of the pipeline stays lightweight.

Educational Applications

This project is a practical reference for engineers exploring:

Microservices design — Decomposing a pipeline into independent, HTTP-connected services
Async job processing — Using Redis and RQ for long-running background work
Local AI integration — Running LLMs and TTS models entirely on-premises
Docker Compose orchestration — Coordinating multiple services with health checks and resource limits
Pluggable architectures — Designing systems where components (like TTS backends) can be swapped at runtime
CI/CD for self-hosted apps — Automating deployment to a VPS with GitHub Actions

Target Users

The platform is designed to serve:

Developers — Building self-hosted AI and audio pipelines
Podcasters & Content Creators — Generating audio from written content automatically
AI Engineers — Experimenting with local LLMs and voice cloning TTS
Self-Hosting Enthusiasts — Running privacy-preserving, API-free tooling
DevOps Engineers — Studying microservices orchestration and deployment patterns

Future Enhancements

Planned improvements include:

Expanded voice library — More reference voices and emotion presets
Web UI improvements — Richer dashboards for managing episodes and jobs
Additional sources — Support for RSS feeds, newsletters, and document uploads
Quality tuning — Refined prompting and audio post-processing for more natural episodes
Observability — Centralized logging, metrics, and monitoring across services
Multi-language support — Broader language coverage in scripting and narration

Conclusion

Power & Context demonstrates how a thoughtful microservices architecture can deliver a complete, private, end-to-end AI pipeline — from raw article URL to finished, two-host podcast episode — without relying on paid cloud services. By combining local LLMs, context-aware extraction, and pluggable voice-cloning TTS, it turns written content into engaging audio while keeping data and infrastructure firmly under the owner's control.

This project is actively being developed, with ongoing work on deployment hardening, voice quality, and new content sources. The complete source code is available on GitHub for reference and experimentation.