Data Sharing Infrastructures for AI: Building for Trust, Purpose, and Public Values
Abstract
The panel examined the emerging tensions between unlocking data for AI‑driven innovation and protecting that data for national sovereignty, community rights, and public value. Participants explored how different incentive models—value‑exchange mechanisms, marketplace designs, legal frameworks (data‑protection law, copyright, data‑sovereignty), and community‑driven licensing—can reconcile these frictions. Real‑world examples from African language work (Masakhane), health‑data digitisation in low‑resource settings, and the Gates Foundation’s data‑ecosystem pilots illustrated both the promise and the pitfalls of current approaches. The discussion culminated in a call for coordinated, “South‑to‑South” collective action, pragmatic governance architectures, and a recognition that friction itself can be a catalyst for better policy design.
Detailed Summary
Astha Kapoor opened the session by framing data‑sharing as a dual‑purpose problem:
- Innovation‑driven unlocking – making data available so AI models can be trained, delivering public services, research breakthroughs, and economic growth.
- Sovereignty‑driven safeguarding – ensuring that data exchange respects national sovereignty, cultural norms, and community rights.
She noted that the tension is “not just technical”; it is rooted in the motivations of states and societies: sovereignty, service delivery, transparency, and accountability. She invited the panel to unpack the frictions that arise when these motivations clash.
2. Value‑Exchange & Incentive Models (Vijay Suresh Kumar)
- Incentive‑centric view – Data sharing works when a clear value exchange aligns the interests of all stakeholders (government, private sector, communities).
- Stakeholder alignment – If the data steward can see a tangible benefit—financial, capacity‑building, or social impact—participation becomes viable.
- Community‑centric concern – In African language contexts, community ownership and gender dynamics affect who can claim stewardship.
- Illustration – Without community‑level benefits, external actors may “extract” data while the originating community sees no return, reinforcing inequality.
Key Insight (Vijay): Frictions become productive when they surface hidden power imbalances; they force designers to embed equitable value flows into data‑exchange architectures.
3. Sovereignty, Extraction, and the “Utopian” Vision (Rahul Matthan)
- Legal framing – Rahul referenced his co‑authored “New Deal for Data” paper, which critiques the three dominant governance regimes: data protection, copyright, and data sovereignty.
- Data‑protection law – Emphasizes data minimisation and purpose‑limited retention—principles that conflict with AI’s appetite for large, persistent datasets.
- Copyright – Traditional copyright assumes copying is prohibited; Rahul argued that machine learning “copying” (weight adjustments) is not the same as verbatim duplication, thus the regime is ill‑suited for AI.
- Data sovereignty – Often framed as a state‑centric tool, but Rahul argued AI raises unprecedented cross‑border concerns that existing sovereignty tools do not address.
Recommendation (Rahul): Redesign legal architectures to decouple ownership from mere possession, enabling data contributors to retain control while permitting responsible AI training.
4. Community‑Driven Data Workflows (Chenai Chair – Masakhane)
- Masakhane’s origin – A grassroots collective of African researchers filling a vacuum left by state inaction on AI for African languages.
- Bootstrapping via global allies – Received Google Cloud credits and mentorship, demonstrating that external resources can empower community‑led initiatives.
- Collective authorship – Papers often list 20+ contributors, reflecting an agency‑building approach where many scholars co‑own the output.
- Licensing experiments – Two frameworks, ESETU and NUDO, propose an “African‑first” open‑source period (e.g., two‑year exclusive access for African actors) before global commercialization.
Key Insight (Chenai): Community‑driven models can negotiate a hybrid of open‑source ethos and protective “local‑first” licensing, preserving cultural integrity while still inviting global innovation.
5. Marketplaces, Pricing, and “Value Attribution” (Saranya Gopinath)
- Marketplace vision – A decentralised data marketplace where datasets are discoverable, auditable, and priced variably based on user class (big‑tech vs. academia).
- Challenges of valuation – Data’s value “inflates” when aggregated; attributing precise monetary reward to a single contributor is practically impossible.
- Proposed mechanisms –
- Tiered licensing – Different cost structures for commercial vs. non‑commercial users.
- Revenue‑share contracts – Percent‑based payouts tied to downstream AI product revenues.
- Clean‑room / federated learning – Data never leaves the host environment; only model updates move, reducing risk of misuse.
Recommendation (Saranya): Pursue a multi‑tiered marketplace that offers both monetary and non‑monetary incentives (e.g., co‑authorship, capacity‑building) to accommodate diverse contributor motivations.
6. “Data‑Extraction Cycle” & Sustainability (Vijay Suresh Kumar)
- Digitisation bottleneck – Many low‑income economies lack even basic digitised health, agricultural, or linguistic datasets.
- Capital intensity – Building repositories, cleaning data, and creating standards requires sustained investment; one‑off grants often fail to create a lasting ecosystem.
- Ecosystem approach – The Gates Foundation’s work includes:
- Data repositories (discoverable catalogs).
- Governance frameworks (audit trails, access controls).
- Clean‑rooms & federated learning labs to keep data local.
- Toolkits for harmonisation (ontologies, standards).
- Benchmarks tailored to local contexts (e.g., Sahara, MMLU‑lite).
Key Insight (Vijay): A sustainable data‑exchange ecosystem must combine technical infrastructure (clean rooms, federated access) with financial mechanisms (ongoing funding, value‑return loops) to avoid “dead‑end” projects.
7. Licensing, Open‑Source, and the Commons Debate (Chenai & Rahul)
- Licensing tension – Open‑source licences (e.g., GPL, CC‑BY) may be too permissive for communities wishing to protect cultural heritage; proprietary licences can exclude local innovators.
- ESETU / NUDO model – “Open first for Africans for two years, then commercialisable.” This hybrid approach seeks to balance openness with later monetisation.
- Commons critique (Bertrand Montubert, Global Partnership on AI) – The commons model – shared, non‑exclusive use – is challenged by the extractive nature of foundation models: a single data feed can be trained once and then generate outsized, centrally‑controlled value.
Recommendation (Chenai & Rahul): Develop purpose‑specific commons (e.g., health‑data commons, language‑resource commons) that embed restrictive usage clauses (non‑commercial only, “model‑to‑data” training) to prevent unilateral extraction.
8. The “Model‑to‑Data” Paradigm & Negotiating Leverage
- Idea – Instead of moving data to big‑tech clouds, move the model to the data (federated learning, secure enclaves).
- Leverage for low‑resource regions – By insisting that trained model weights stay locally, contributors retain a strategic asset that can be reused or further refined without surrendering raw data.
- Negotiation tactic – Countries (or regional blocs) can demand localised model snapshots as part of data‑access contracts, ensuring that the intellectual capital remains under local stewardship.
Key Insight (Saranya): “Model‑to‑data” flips the power dynamic, turning data providers into co‑owners of the resulting AI artefact, not just passive suppliers.
9. Collective Action & South‑to‑South Collaboration
- Scale of bargaining power – A single nation or community often lacks the leverage to negotiate favorable terms with global AI players.
- Regional coalitions – Proposals for African‑wide data alliances or India‑Africa joint data‑trusts to pool datasets, share governance, and present a united front to large tech firms.
- Philanthropic facilitation – Foundations (e.g., Gates) can act as neutral conveners, providing seed funding and “bridge” infrastructure while respecting community governance.
Recommendation (All Panelists): Institutionalise South‑to‑South data coalitions with shared legal templates and technical standards, thereby creating a critical mass that can negotiate on more equal footing.
10. Audience Q&A – Clarifying Open Issues
| Question | Main Points Raised |
|---|---|
| Marketplace vs. Commons (Bertrand Montubert) | Marketplace can accommodate variable licensing; commons risk extractive “one‑time grazing” without ongoing benefit. |
| Data as bargaining chip (Mok, policy researcher) | Strong pre‑emptive legal restrictions could force a higher price for data, but overly high barriers may deter collaboration. |
| Capital concentration in foundation models (Bharat, Takshila) | Only a few large models dominate; the “last bargaining chip” will be non‑digital knowledge (traditional, health, language). |
| Licensing & enforcement (Rahul) | Legal enforcement is costly; embedding technical safeguards (clean rooms, federated learning) is more reliable. |
| Labor & data‑labeling (Audience) | Data‑labelers are forming trade‑union‑like groups, highlighting the need to recognise human labour in data pipelines. |
| Future of AI‑driven labour (Mok) | AI may replace many jobs; data‑ownership rights could become a primary source of income for marginalized groups. |
11. Concluding Remarks
- Astha Kapoor summed up the discussion: friction is necessary and productive. It forces stakeholders to confront hidden inequities and design better governance.
- Call to Action – “Start the marketplace, embed legal‑technical safeguards, and create collective South‑to‑South bargaining bodies now before the window closes.”
Key Takeaways
- Friction as a catalyst: Tensions between data openness and sovereignty surface power imbalances; acknowledging them leads to stronger, more inclusive designs.
- Three insufficient legal regimes: Existing data‑protection, copyright, and sovereignty frameworks are misaligned with AI’s data needs; a new, hybrid legal architecture is required.
- Value‑exchange is central: Sustainable data sharing hinges on clear, equitable value flows—financial, capacity‑building, or community empowerment.
- Community‑driven models work: Masakhane demonstrates that grassroots, community‑owned data initiatives can thrive with targeted external resources and “local‑first” licensing.
- Marketplace with tiered access: A decentralized data marketplace should support varying licences (commercial vs. academic) and offer clean‑room/federated learning to protect raw data.
- Model‑to‑data paradigm: Moving models to data rather than data to models restores leverage for data contributors and aligns with sovereign interests.
- Collective South‑to‑South action: Regional data coalitions amplify bargaining power against global AI giants and enable shared standards, funding, and governance.
- Licensing hybrids (ESETU/NUDO): Time‑limited exclusive access for local actors before broader commercial release balances openness with community benefit.
- Data‑labour rights: The emerging labour movement around data labeling underscores the need to recognise and remunerate human work embedded in AI pipelines.
- Urgency: The current transition window—before foundation models become wholly self‑sufficient—is narrow; immediate coordinated action is essential.
See Also:
- digital-public-goods-for-global-ai-equity
- exploring-a-regulatory-framework-for-open-data
- ai-for-the-global-south-from-governance-to-inclusion
- pathways-for-equitable-ai-compute-access
- responsible-ai-for-health-governance-implementation-and-investment-considerations
- harnessing-ai-for-health-equity-building-inclusive-human-capital-and-strengthening-researchindustry-collaboration
- toward-collective-action-a-roundtable-on-safe-and-trusted-ai
- democratizing-ai-resources-equitable-access-to-compute-and-data-for-entrepreneurship
- flipping-the-script-how-the-global-majority-can-recode-the-ai-economy
- south-south-cooperation-in-ai-policymaking-developing-a-collaboration-roadmap