AI-Ready Data: Shared Infrastructure for Innovation
Abstract
The panel explored why AI progress in India (and globally) stalls when high‑quality, “AI‑ready’’ data is hidden in fragmented, poorly‑documented silos. Speakers described the technical, governance and trust challenges of turning administrative and alternate data into machine‑readable, interoperable assets, and presented concrete initiatives – a shared AI‑readiness framework, Google’s open‑source Data‑Commons platform, contextual glossaries/knowledge‑graphs, federated metadata catalogs, and the MOSPI‑run “MCP” service. The discussion moved through use‑case illustrations (MSME location planning, health‑worker decision support, education‑content generation) and concluded with audience questions on data‑economy business models, the need for benchmarked trust, and how to keep data rails stable over time.
Detailed Summary
Rohit Bardawaj opened by framing AI’s dependence on data:
- LLMs excel because they have scraped the public web, but enterprise data remains locked in PDFs, scanned forms and “fragmented silos.”
- Example: an entrepreneur in Nagpur cannot discover a biotechnology‑plant subsidy because the relevant government notification lives only in a PDF that LLMs do not index.
- The “information divide” means that valuable public‑sector data is not AI‑ready – it is digitised but not trusted, safe or interoperable.
Key Insight: Making data AI‑ready is a long, multi‑step journey (cleaning, linking, contextualising, exposing via APIs) and must accommodate user choice of LLMs.
2. Defining AI‑Readiness – A Framework (MoSPI)
Rohit asked the panel to articulate a common definition.
Shalini Kapoor (EkStep) highlighted two gaps:
- Lack of a shared definition – many assume AI‑readiness exists but cannot agree on its components.
- Awareness gap – users (e.g., a colleague asking ChatGPT about a regional dialect) are unaware of the metadata, semantic layers and context required for reliable AI output.
She proposed a two‑tier framework:
| Tier | Description |
|---|---|
| Core AI‑Readiness | Machine‑readable catalog, structured metadata, a context file (defining units, frequencies, glossaries). |
| Aspirational AI‑Readiness | Higher‑level governance, provenance, federated stewardship, and mechanisms for continuous quality improvement. |
Rohit agreed and outlined the first concrete step: create a concise slide deck that contrasts “what AI sees” (all files scanned) with “what a human sees” (focused on a single version). The deck will be used to seed a common framework that all institutions can adopt.
3. Data Commons – Open‑Source, Federated Knowledge Graphs (Google)
Prem Ramaswami described Google’s Data Commons initiative:
- Goal: Transform public‑sector data into machine‑readable form (structured tables with rich metadata) and place it in a global knowledge graph.
- Technical approach:
- Publish data in open‑source stacks (JSON‑LD, schema.org, CSV‑to‑graph pipelines).
- Keep the data federated – each agency hosts its own slice, but all slices share a common schema, avoiding a single point of control.
- Real‑world impact: United Nations Statistical Division, WHO, ILO already use Data Commons as a backend, reducing analyst time spent on column‑renaming by >80 %.
Key Insight: When data is structured + searchable via an AI‑augmented interface, LLMs can retrieve relevant facts quickly, improving answer quality dramatically.
4. Contextualisation & Domain Glossaries (EkStep & A4I)
Shalini illustrated a domain‑specific glossary approach:
- Problem: LLMs translate well for major languages but fail on domain‑specific jargon (e.g., sixth‑grade physics terms).
- Solution: Attach a glossary (or mini‑knowledge‑graph) to the LLM so that, before responding, the model consults the specialised term list.
- Outcome: Users receive consistent translations without seeing the glossary; the model’s internal prompt is enriched automatically.
Ashish expanded, noting three recurring data‑quality challenges:
- Interoperability – disparate datasets must speak a common language.
- Contextualisation – raw numbers need accompanying definitions (e.g., “frequency = quarterly”).
- Verifiability – many public datasets are declared (self‑reported) rather than verified (clinically validated).
He argued that AI‑ready data must solve all three to be trustworthy.
5. Benchmarking Trust & Stability (A4I)
Ashish (again) reported ongoing work on a benchmark for answer stability:
- Motivation: The same prompt yields different answers across LLMs or even within the same LLM when queried by different users (e.g., farmers).
- Pilot: Using “Amul AI” and “Bharat Vistar” platforms to test repeatability of responses.
- Goal: Define a ground‑truth benchmark that captures variance across models and users, guiding the design of guardrails and human‑in‑the‑loop checks.
6. Governance vs. Technology – Audience Poll
The moderator asked the audience whether making alternate/secondary data AI‑ready is a governance problem or a technical problem.
- Result: Majority (≈ 60 %) saw it as a governance issue – needing policies, stewardship, and data‑ownership rules.
- Rohit’s stance: Emphasised the need for a federated data‑stewardship model, possibly led by the National Statistics Office (NSO), but open to other trusted entities.
7. Practical Roadmap – Cataloguing, Metadata, Context Files (MoSPI)
Rohit gave a concise four‑step checklist for any agency wishing to make data AI‑ready:
| Step | Action |
|---|---|
| 1. Cataloguing | Publish a machine‑readable inventory (not a PDF) of every dataset, including indicators and definitions. |
| 2. Metadata | Attach JSON/XML metadata for each field (type, units, update frequency). |
| 3. Context File | Provide a separate file that explains domain‑specific terms (e.g., “frequency = quarterly”). |
| 4. Business Glossary / Knowledge Graph | Convert the context into a searchable graph that downstream LLMs can query. |
He also described MCP (Metadata Catalog Platform) servers now live for ten MOSPI datasets, enabling queries such as “price of moong‑dal over the past year” via both Claude and ChatGPT.
Announcement: MOSPI’s Data‑Boarding‑Pass concept – a certificate that a dataset has passed the AI‑readiness checklist, making it instantly discoverable for B2B or research use.
8. Use‑Case Illustrations
| Use‑case | Who benefits? | How AI‑ready data helps |
|---|---|---|
| MSME location planning | Small shop owners | Overlay their sales CSV with 50 k public datasets (population, foot‑traffic, price index) to evaluate site risk. |
| Health‑worker decision support | Front‑line health workers | Fuse district‑level health surveys with climate & nutrition data to prioritise interventions. |
| Education content generation | Teachers of visually‑impaired children | Combine glossaries with LLMs to produce audio‑rich STEM lessons in regional languages. |
| Policy friction detection | Statisticians | Identify discrepancies between top‑down survey data and alternate crowd‑sourced data (e.g., road‑construction status) to flag data quality issues. |
Prem added that any organization can spin up a Data Commons instance in ~20 minutes using the open‑source platform, ingest a CSV, and instantly gain the “network effect” of all existing public datasets.
9. Audience Q&A Highlights
-
Business Model for Data Platforms –
- Rohit clarified that the NSO is publicly funded; data is free for research but commercial usage incurs fees under a transparent policy.
- Ashish introduced the GIVE framework (Guarantee, Incentive, Value, Exchangeability) to motivate data contributions and sustain a data‑economy.
-
Infrastructure Longevity (“Digging up rails”) –
- Rohit likened the AI‑ready data layer to India’s Digital Public Infrastructure (UPI, Aadhaar). It may evolve, but starting now prevents costly re‑work later.
-
Ensuring Default Adoption of MCP –
- Prem emphasised that the MCP connector should be baked into analysts’ existing tools so users never have to “add it manually.”
-
Creative Community Uses –
- A participant shared a Twitter (X) project that turned a Tamil folk song about grains into a grain‑price index using public data, illustrating grassroots innovation.
Key Takeaways
- AI‑ready data is more than digitisation – it needs structured metadata, contextual glossaries, and trusted provenance to be useful for LLMs.
- A shared, federated framework (core + aspirational layers) is essential; MoSPI will drive the first version and invite other agencies to adopt it.
- Google’s Data Commons demonstrates how open‑source, federated knowledge graphs can dramatically lower analyst effort and enable multilingual, cross‑domain queries.
- Contextualisation via glossaries/knowledge graphs is a practical way to overcome LLM weaknesses in domain‑specific vocabulary and regional dialects.
- Trust & stability benchmarks are being built (e.g., Amul AI, Bharat Vistar) to measure answer variability and guide guard‑rail policies.
- Governance is the bottleneck – agencies must agree on cataloguing standards, metadata formats, and a federated stewardship model; technical solutions alone are insufficient.
- MCP servers and the Data‑Boarding‑Pass provide a ready‑to‑use, certified pipeline for developers and policy‑makers, turning raw public datasets into instantly queryable AI assets.
- Real‑world impact includes MSME site‑selection, health‑worker decision support, inclusive education, and rapid detection of data‑quality frictions.
- Sustainable data‑economy requires clear incentives (GIVE framework) and a balanced public‑private cost model: free for research, fee‑based for commercial exploitation.
- Start now – building AI‑ready data rails today prevents costly retro‑fits later and aligns India’s data infrastructure with other Digital Public Infrastructure initiatives.
See Also:
- ai-for-everyone-empowering-people-businesses-and-society
- democratizing-ai-resources-in-india
- ai-innovators-exchange-accelerating-innovation-through-startup-and-industry-synergy
- flipping-the-script-how-the-global-majority-can-recode-the-ai-economy
- scaling-ai-solutions-through-southsouth-collaboration
- ai-for-economic-growth-and-social-good-ai-for-all-driving-economic-advancement-and-societal-well-being