How this archive was built

The Big Picture

The India AI Impact Summit hosted 500+ sessions across five days. That’s an extraordinary amount of knowledge — but unless you happened to be in the right room at the right time, most of it would simply… evaporate. I wanted to fix that.

The goal was simple: take every available recording, extract the knowledge from it, and make it all searchable, browsable, and interconnected. The execution involved a pipeline of scraping, transcription, AI-powered summarization, and static site publishing — stitched together over the course of one week.

Here’s how it all came together.

flowchart LR
    A["🌐 Official Website (500+ sessions)"] --> B["🧹 Scrape & Clean Metadata"]
    B --> C["📥 Download Videos"]
    C --> D["🎵 Extract Audio"]
    D --> E["🗣️ Transcribe (Whisper)"]
    E --> F["🤖 Summarize (LLM)"]
    F --> G["🔗 Find Connections (Embeddings + FAISS)"]
    G --> H["📝 Generate Markdown"]
    H --> I["🚀 Publish (Quartz v4)"]

Step 1: Scraping the Session Data

The first step was getting a structured list of every session — titles, speakers, descriptions, tracks, and YouTube links — from the official summit website.

I used Python + BeautifulSoup to scrape the session catalog, supplemented by some manual work where the website’s structure was inconsistent or incomplete.

Output: A structured dataset of 500+ sessions with metadata.


Step 2: Downloading the Videos

Of the 500+ sessions listed, roughly 400+ had YouTube recordings available. I used yt-dlp to batch-download all of them.

This is where the first round of data cleaning pain began. Some videos were mislabeled on the summit website — a session titled one thing would link to a video of something else entirely. Others were multi-session recordings — for example, all keynote addresses for a given day stitched into a single hours-long video. These had to be manually identified and split so each session could be summarized individually.

Output: ~400 video files, cleaned and mapped to their correct sessions.


Step 3: Audio Extraction

Before transcription, I extracted the audio track from each video using ffmpeg, transcoding to 192 kbps to keep file sizes manageable while preserving enough quality for accurate speech recognition.

Output: ~400 audio files ready for transcription.


Step 4: Transcription

This was the heavy lifting. I ran OpenAI’s whisper-large-v3 locally via HuggingFace Transformers on my Nvidia RTX 2080Ti.

The numbers:

MetricValue
Total audio transcribed~400 hours
GPUNvidia RTX 2080Ti
Total transcription time~35 hours
Throughput~11.5× real-time

That’s roughly 400 hours of conference content transcribed in less than 2 days on a single consumer GPU. Not bad, Whisper. Not bad at all. 🎉

Output: Raw transcripts for 400+ sessions.


Step 5: Summarization

Raw transcripts are long, messy, and full of filler words. To turn them into something readable, I ran every transcript through gpt-oss-120b on OpenRouter.ai — a model I’ve found to be the best mix of quality and cost for tasks like this.

Each transcript was sent along with the session’s metadata (title, description, speaker list) and a detailed system prompt that instructed the model to:

  • Detect multi-session transcripts and segment them correctly
  • Identify speakers by cross-referencing the speaker list with in-transcript cues
  • Produce a comprehensive, structured summary with headers, subheaders, and speaker attribution
  • Surface key elements explicitly: announcements 📢, key insights 💡, data & findings 📊, recommendations 🔑, and open questions ❓
  • Generate a “Key Takeaways” section with 5–10 crisp bullet points per session
  • Handle edge cases gracefully — garbled audio, abrupt starts/ends, audience crosstalk, divergence from the official agenda

The prompt ran to over 2,000 words and was carefully iterated to handle the messiness of real-world conference transcripts.

flowchart LR
    T["📜 Raw Transcript"] --> LLM["🤖 gpt-oss-120b (OpenRouter.ai)"]
    M["📋 Session Metadata (title, speakers, description)"] --> LLM
    P["📐 System Prompt (2000+ words)"] --> LLM
    LLM --> S["📝 Structured Summary with Key Takeaways"]

The total cost of summarizing 400+ sessions (representing ~400 hours of content)?

Less than $1. 🤯

Output: Detailed, structured summaries for every session.


Step 6: Finding Connections

A summit like this has sessions that echo, complement, and build on each other — but those connections aren’t obvious from titles alone. I wanted the archive to surface them.

I generated vector embeddings of each session’s summary and metadata (titles, speakers, tracks) using sentence-transformers/all-MiniLM-L6-v2, then used Facebook’s FAISS to find semantically similar sessions. The similarity threshold was tuned by hand — running it, eyeballing the results, adjusting, and repeating until the connections felt genuinely useful rather than noisy.

Tags for each session were also derived from this embedding process.

flowchart TD
    S["📝 Summaries + Metadata"] --> E["🧮 Embeddings (all-MiniLM-L6-v2)"]
    E --> F["🔎 FAISS Similarity Search"]
    F --> R["🔗 Related Sessions"]
    E --> Tags["🏷️ Auto-Generated Tags"]

Output: A web of connections between sessions, plus tags for browsable categorization.


Step 7: Data Cleaning (The Unglamorous Hero)

Data cleaning wasn’t a step — it was every step. At each stage of the pipeline, there was something to fix:

StageWhat Needed Cleaning
ScrapingInconsistent metadata, missing fields
Video mappingMislabeled videos, multi-session recordings that needed manual splitting
TranscriptionMultilingual passages, garbled names, transcription artifacts
SummarizationEnforcing consistent formatting across 400+ outputs
ConnectionsTuning similarity thresholds to avoid false positives

Most of this was manual, with scripting where possible. It’s the least exciting part of the project and the part that took the most patience.


Step 8: Publishing

All the summaries, metadata, tags, and connections were collated into Markdown files and published using Quartz v4 — a fantastic static site generator designed for interconnected knowledge bases. I added a few custom components to render session metadata and tweaked the theme, but Quartz did the heavy lifting of making the archive browsable, searchable, and navigable.


The Full Pipeline at a Glance

flowchart TD
    subgraph EXTRACT ["1️⃣ Extract"]
        A["🌐 Scrape Official Website (Python + BeautifulSoup)"] --> B["📋 Session Metadata (500+ sessions)"]
        B --> C["📥 Download Videos via yt-dlp (400+ with recordings)"]
    end

    subgraph TRANSFORM ["2️⃣ Transform"]
        C --> D["🎵 Extract Audio (ffmpeg → 192kbps)"]
        D --> E["🗣️ Transcribe (whisper-large-v3 on RTX 2080Ti) ~400 hrs → 35 hrs"]
        E --> F["🤖 Summarize (gpt-oss-120b via OpenRouter) Cost: < $1"]
        F --> G["🧮 Embed & Connect (MiniLM + FAISS)"]
    end

    subgraph CLEAN ["🧹 Clean"]
        D -.->|"manual + scripted"| D
        E -.->|"manual + scripted"| E
        F -.->|"manual + scripted"| F
        G -.->|"threshold tuning"| G
    end

    subgraph PUBLISH ["3️⃣ Publish"]
        G --> H["📝 Collate into Markdown"]
        H --> I["🚀 Publish with Quartz v4"]
    end

    style EXTRACT fill:#e8f5e9,stroke:#388e3c
    style TRANSFORM fill:#e3f2fd,stroke:#1565c0
    style CLEAN fill:#fff3e0,stroke:#ef6c00
    style PUBLISH fill:#f3e5f5,stroke:#7b1fa2

By the Numbers

MetricValue
Sessions on official agenda500+
Sessions with video recordings400+
Total audio transcribed~400 hours
Transcription time (RTX 2080Ti)~35 hours
Summarization modelgpt-oss-120b (OpenRouter)
Summarization cost< $1
Embedding modelall-MiniLM-L6-v2
Total project time~1 week
Coffee consumedundisclosed

Built with curiosity, duct tape, and an mass of open-source tooling. If you spot an error or have feedback, let me know!