Data pipeline • scraping → summarization

News Analyzer

I built this pipeline to turn raw local news into a concise daily briefing. It scrapes, extracts, indexes, and summarizes—then ships the result through my Kubernetes stack.

Browse projects

Loading visualization...

The particles represent batches moving between stages.

Under the Hood

Automated Scraping

Uses Playwright for auth + navigation on dynamic e-editions. Downloads source documents and keeps sessions stable across runs.

Index + Retrieval

Extracted text gets chunked + embedded, then written to Qdrant / Weaviate for retrieval and dedupe across editions.

AI Summarization

Uses LLMs through LiteLLM to generate a compact, scannable briefing and extract key entities for follow-up queries.

What I optimize for

Idempotent runs (safe to retry)
Dedupe + provenance (know what changed)
Stable prompts (same input → same shape output)

Common failure modes

The “hard” part isn’t the model—it’s everything around it: logins, flaky sources, and making sure partial data doesn’t poison the index.

session expiryPDF layout driftduplicate editionsrate limits