How this archive was built
The Big Picture
The India AI Impact Summit hosted 500+ sessions across five days. That’s an extraordinary amount of knowledge — but unless you happened to be in the right room at the right time, most of it would simply… evaporate. I wanted to fix that.
The goal was simple: take every available recording, extract the knowledge from it, and make it all searchable, browsable, and interconnected. The execution involved a pipeline of scraping, transcription, AI-powered summarization, and static site publishing — stitched together over the course of one week.
Here’s how it all came together.
flowchart LR A["🌐 Official Website (500+ sessions)"] --> B["🧹 Scrape & Clean Metadata"] B --> C["📥 Download Videos"] C --> D["🎵 Extract Audio"] D --> E["🗣️ Transcribe (Whisper)"] E --> F["🤖 Summarize (LLM)"] F --> G["🔗 Find Connections (Embeddings + FAISS)"] G --> H["📝 Generate Markdown"] H --> I["🚀 Publish (Quartz v4)"]
Step 1: Scraping the Session Data
The first step was getting a structured list of every session — titles, speakers, descriptions, tracks, and YouTube links — from the official summit website.
I used Python + BeautifulSoup to scrape the session catalog, supplemented by some manual work where the website’s structure was inconsistent or incomplete.
Output: A structured dataset of 500+ sessions with metadata.
Step 2: Downloading the Videos
Of the 500+ sessions listed, roughly 400+ had YouTube recordings available. I used yt-dlp to batch-download all of them.
This is where the first round of data cleaning pain began. Some videos were mislabeled on the summit website — a session titled one thing would link to a video of something else entirely. Others were multi-session recordings — for example, all keynote addresses for a given day stitched into a single hours-long video. These had to be manually identified and split so each session could be summarized individually.
Output: ~400 video files, cleaned and mapped to their correct sessions.
Step 3: Audio Extraction
Before transcription, I extracted the audio track from each video using ffmpeg, transcoding to 192 kbps to keep file sizes manageable while preserving enough quality for accurate speech recognition.
Output: ~400 audio files ready for transcription.
Step 4: Transcription
This was the heavy lifting. I ran OpenAI’s whisper-large-v3 locally via HuggingFace Transformers on my Nvidia RTX 2080Ti.
The numbers:
| Metric | Value |
|---|---|
| Total audio transcribed | ~400 hours |
| GPU | Nvidia RTX 2080Ti |
| Total transcription time | ~35 hours |
| Throughput | ~11.5× real-time |
That’s roughly 400 hours of conference content transcribed in less than 2 days on a single consumer GPU. Not bad, Whisper. Not bad at all. 🎉
Output: Raw transcripts for 400+ sessions.
Step 5: Summarization
Raw transcripts are long, messy, and full of filler words. To turn them into something readable, I ran every transcript through gpt-oss-120b on OpenRouter.ai — a model I’ve found to be the best mix of quality and cost for tasks like this.
Each transcript was sent along with the session’s metadata (title, description, speaker list) and a detailed system prompt that instructed the model to:
- Detect multi-session transcripts and segment them correctly
- Identify speakers by cross-referencing the speaker list with in-transcript cues
- Produce a comprehensive, structured summary with headers, subheaders, and speaker attribution
- Surface key elements explicitly: announcements 📢, key insights 💡, data & findings 📊, recommendations 🔑, and open questions ❓
- Generate a “Key Takeaways” section with 5–10 crisp bullet points per session
- Handle edge cases gracefully — garbled audio, abrupt starts/ends, audience crosstalk, divergence from the official agenda
The prompt ran to over 2,000 words and was carefully iterated to handle the messiness of real-world conference transcripts.
flowchart LR T["📜 Raw Transcript"] --> LLM["🤖 gpt-oss-120b (OpenRouter.ai)"] M["📋 Session Metadata (title, speakers, description)"] --> LLM P["📐 System Prompt (2000+ words)"] --> LLM LLM --> S["📝 Structured Summary with Key Takeaways"]
The total cost of summarizing 400+ sessions (representing ~400 hours of content)?
Less than $1. 🤯
Output: Detailed, structured summaries for every session.
Step 6: Finding Connections
A summit like this has sessions that echo, complement, and build on each other — but those connections aren’t obvious from titles alone. I wanted the archive to surface them.
I generated vector embeddings of each session’s summary and metadata (titles, speakers, tracks) using sentence-transformers/all-MiniLM-L6-v2, then used Facebook’s FAISS to find semantically similar sessions. The similarity threshold was tuned by hand — running it, eyeballing the results, adjusting, and repeating until the connections felt genuinely useful rather than noisy.
Tags for each session were also derived from this embedding process.
flowchart TD S["📝 Summaries + Metadata"] --> E["🧮 Embeddings (all-MiniLM-L6-v2)"] E --> F["🔎 FAISS Similarity Search"] F --> R["🔗 Related Sessions"] E --> Tags["🏷️ Auto-Generated Tags"]
Output: A web of connections between sessions, plus tags for browsable categorization.
Step 7: Data Cleaning (The Unglamorous Hero)
Data cleaning wasn’t a step — it was every step. At each stage of the pipeline, there was something to fix:
| Stage | What Needed Cleaning |
|---|---|
| Scraping | Inconsistent metadata, missing fields |
| Video mapping | Mislabeled videos, multi-session recordings that needed manual splitting |
| Transcription | Multilingual passages, garbled names, transcription artifacts |
| Summarization | Enforcing consistent formatting across 400+ outputs |
| Connections | Tuning similarity thresholds to avoid false positives |
Most of this was manual, with scripting where possible. It’s the least exciting part of the project and the part that took the most patience.
Step 8: Publishing
All the summaries, metadata, tags, and connections were collated into Markdown files and published using Quartz v4 — a fantastic static site generator designed for interconnected knowledge bases. I added a few custom components to render session metadata and tweaked the theme, but Quartz did the heavy lifting of making the archive browsable, searchable, and navigable.
The Full Pipeline at a Glance
flowchart TD subgraph EXTRACT ["1️⃣ Extract"] A["🌐 Scrape Official Website (Python + BeautifulSoup)"] --> B["📋 Session Metadata (500+ sessions)"] B --> C["📥 Download Videos via yt-dlp (400+ with recordings)"] end subgraph TRANSFORM ["2️⃣ Transform"] C --> D["🎵 Extract Audio (ffmpeg → 192kbps)"] D --> E["🗣️ Transcribe (whisper-large-v3 on RTX 2080Ti) ~400 hrs → 35 hrs"] E --> F["🤖 Summarize (gpt-oss-120b via OpenRouter) Cost: < $1"] F --> G["🧮 Embed & Connect (MiniLM + FAISS)"] end subgraph CLEAN ["🧹 Clean"] D -.->|"manual + scripted"| D E -.->|"manual + scripted"| E F -.->|"manual + scripted"| F G -.->|"threshold tuning"| G end subgraph PUBLISH ["3️⃣ Publish"] G --> H["📝 Collate into Markdown"] H --> I["🚀 Publish with Quartz v4"] end style EXTRACT fill:#e8f5e9,stroke:#388e3c style TRANSFORM fill:#e3f2fd,stroke:#1565c0 style CLEAN fill:#fff3e0,stroke:#ef6c00 style PUBLISH fill:#f3e5f5,stroke:#7b1fa2
By the Numbers
| Metric | Value |
|---|---|
| Sessions on official agenda | 500+ |
| Sessions with video recordings | 400+ |
| Total audio transcribed | ~400 hours |
| Transcription time (RTX 2080Ti) | ~35 hours |
| Summarization model | gpt-oss-120b (OpenRouter) |
| Summarization cost | < $1 |
| Embedding model | all-MiniLM-L6-v2 |
| Total project time | ~1 week |
| Coffee consumed | ☕ undisclosed |
Built with curiosity, duct tape, and an mass of open-source tooling. If you spot an error or have feedback, let me know!