Advancing Multilingual AI: Global South Governance of Non-English Model Development and Deployment

Detailed Summary

Speaker: Aliya Bhatia (CDT)

  • Training‑data imbalance – Over 60 % of the data used to train most “multilingual” large language models (LLMs) is English‑derived (Common Crawl, other open‑source corpora). Only ~40 % is non‑English, and this minority is not representative of the linguistic diversity of the Global South.
  • Quality of non‑English data – When non‑English data exists, it is often:
    • Machine‑translated from English (introducing systematic errors).
    • Collected from low‑quality web‑scrapes that do not reflect natural spoken or written usage.
  • Architectural bias – Many model architectures are designed around English‑centric tokenisation, syntax, and training objectives, limiting transferability to typologically divergent languages.
  • Cross‑lingual transfer misconceptions – Assuming that a model can learn low‑resource language behaviour simply by copying sentence‑structure knowledge from high‑resource languages overlooks fundamental typological differences (e.g., SVO vs. SOV order).
  • Testing‑regime shortcomings – Evaluation benchmarks are usually:
    • Machine‑translated from English, lacking cultural nuance.
    • Aggregated into a single “multilingual” score that masks language‑specific failures.

These three buckets—training data, model architecture, and evaluation—frame the panel’s subsequent discussion.


2. Panelist Contributions

2.1. African‑Language Landscape

Speaker: Tajuddin Gwadabe (Masakhane African Languages Hub)

  • Data scarcity & translation bottlenecks – Most African‑language NLP work still relies on pivot translation (African → English/French → African), which discards linguistic nuance and eliminates direct African‑to‑African translation pathways.
  • Cultural relevance gap – Existing LLMs may generate syntactically correct sentences in Swahili or Yoruba, but they fail on culturally grounded queries (e.g., local breakfast customs).
  • Multimodal dataset initiative – The hub is commissioning a grant‑program to create multilingual, multimodal corpora for ~40 African languages, comprising:
    • Text passages.
    • Corresponding audio recordings (native speakers).
    • Contextual images (e.g., culturally specific paintings).
      This aims to enable vision‑language and voice‑to‑voice models that respect cultural context.
  • Potential for conflict mitigation – By providing cross‑language resources (e.g., Yoruba ↔ Igbo), multilingual AI can reduce inter‑ethnic misunderstandings in multilingual societies such as Nigeria (≈ 200 M population, three dominant languages).

2.2. Indian‑Language Benchmarking & Cultural Context

Speaker: Arushi Gupta / Urvashi Aneja (Digital Futures Lab)

  • Disaggregating “localisation” – Language is only one facet of culture; a thorough localisation effort must also embed:
    • Local histories, power asymmetries, cuisine, everyday practices.
    • Example: A model trained on soap‑opera scripts reflects a narrow, middle‑class, urban Indian cultural lens.
  • Intersectional bias – Gender, caste, race, and class intersect in ways that are invisible to English‑centric data.
  • Illustrative case study – Work on a sexual‑reproductive‑health chatbot for marginalized Indian women revealed:
    • The Hindi word for “abortion” and “miscarriage” is identical, causing fatal mis‑classifications.
    • Focus‑group discussions with end‑users generated the nuanced lexical distinctions needed, which were then fed back into the training data.
  • WEIRD‑NLP critique – The dominance of “Western‑Educated‑Industrialized‑Rich‑Developed” (WEIRD) data sources reproduces Western biases, marginalising Global‑South linguistic realities.

2.3. Indigenous Language Content Moderation (Quechua)

Speaker: Dhanaraj Thakur (GWU)

  • Systemic exclusion in policy pipelines – Moderation policies are authored in English/Spanish, then machine‑translated into Quechua for review.
  • Human‑in‑the‑loop bottleneck – Quechua‑speaking moderators receive pre‑translated content, limiting their ability to judge nuance.
  • Empirical findings – User surveys showed that harmful Quechua content often remains online because classifiers fail to detect it, whereas comparable Spanish content is reliably flagged.
  • Root cause analysis – Large tech firms pursue “one‑size‑fits‑all” scaling, deploying generic multilingual models without accounting for low‑resource language idiosyncrasies or community‑driven feedback loops.

2.4. AI‑Governance, Sovereign Models & Policy Landscape

Speaker: Chinasa T Okolo (Technecultura / UN consultant)

  • National AI strategies – Around 20 African countries have released AI strategies that stress AI sovereignty but rarely spell out multilingual mandates.
  • Sovereign‑model examples:
    • Nigeria – “Atlas”: LLM covering Yoruba, Igbo, Hausa, and Nigerian Pidgin.
    • Latin America – “LATAM‑GPT” (Chile).
    • Southeast Asia – “Sea Lion” (regional multilingual model).
  • Motivation – Private‑sector LLMs often ignore Global‑South languages or deliver low‑quality outputs; sovereign models aim to fill that gap by investing in local data and social infrastructure.
  • Governance implications – Emphasis on reducing dependence on global tech giants, fostering local ecosystem growth, and embedding community‑led data stewardship mechanisms.

2.5. Emerging Initiatives & Cross‑Cutting Solutions

InitiativeLead/InstitutionCore Goal
Nurete Obudo (Data‑governance framework)University of Pretoria & Strathmore University (Prof. Melissa Amino, Chichoke Okori)Ensure fair compensation for community‑contributed data used to train LLMs.
Masakhane Benchmark for African LanguagesMasakhane African Languages HubMeasure model performance on real‑world tasks (e.g., crop‑disease diagnosis) in rural settings.
Community‑Centred Red‑Team / Benchmarking NetworkDigital Futures Lab + Global‑South civil‑society coalitionCo‑design evaluation datasets, capture inter‑annotator disagreement, and avoid extractive research practices.
Safety‑Evaluation Working GroupMulti‑institutional (academia, NGOs, industry)Develop shared methodologies for assessing gender, caste, and other intersectional harms in LLM outputs.
Public‑Communication CampaignsPanel consensusCounter the “magical” narrative of frontier AI firms and promote nuanced risk/benefit literacy among the general public.

3. Cross‑Cutting Themes

  1. Data Ownership & Compensation – Emerging frameworks (e.g., Nurete Obudo) stress that communities should receive tangible benefits, not merely cash transfers, for their linguistic contributions.

  2. Community‑Centred Evaluation – Benchmarking must move beyond static datasets to participatory red‑team exercises that surface cultural nuance and disagreement.

  3. Trust & Accountability – Trust hinges on transparent communication of model limits, robust public education, and inclusive governance structures that give communities a “right of refusal” over model deployment.

  4. Policy Alignment – Sovereign AI initiatives can catalyse multilingual capability, but they need to be paired with data‑infrastructure investments and institutional mechanisms that ensure community voices shape policy.

  5. Inter‑regional Knowledge Transfer – The panel highlighted similar challenges across Africa, South Asia, and Latin America, suggesting that shared tooling (multimodal corpora, benchmark protocols) can accelerate progress globally.


4. Audience Q&A – Language as a Living, Classified System

Question (Lavanch, Cambridge University): Is it desirable to “codify” language at all, given that language is fluid and classification risks flattening cultural experience?

Panel Reflections (summarized):

  • Recognition of colonial bias – All panelists agreed that AI, as a classificatory technology, can inadvertently reproduce colonial hierarchies.
  • Negotiated participation – The solution is not to abandon codification but to embed negotiated consent and right‑of‑refusal mechanisms at every AI lifecycle stage.
  • Iterative co‑design – Emphasised the need for ongoing community dialogue, continuous updating of corpora, and transparent governance to keep models aligned with living linguistic practice.

The discussion concluded that responsible multilingual AI requires both technical codification (to make systems usable) and robust sociopolitical safeguards that prevent cultural erasure.

Key Takeaways

  • Training Data Skew – Over 60 % of multilingual LLM training data is English; non‑English data is both scarce and low‑quality, creating systemic performance gaps.
  • Architectural & Evaluation Bias – Model designs and benchmark pipelines are English‑centric, leading to inaccurate cross‑lingual transfer and misleading “multilingual” scores.
  • Cultural Relevance Matters – Successful AI must handle contextual queries (e.g., local customs, idioms) which current models largely miss.
  • Community‑Generated Multimodal Corpora – Projects like Masakhane’s 40‑language multimodal dataset aim to supply the missing text‑voice‑image triad needed for culturally aware AI.
  • Intersectional Benchmarks – Gender, caste, and other identity dimensions require community‑led evaluation; a single “fairness” metric is insufficient.
  • Indigenous Content Moderation Failures – Quechua case shows how policy written in high‑resource languages, coupled with machine translation, leaves harmful content unchecked for low‑resource speakers.
  • Sovereign Model Momentum – National initiatives (Nigeria’s Atlas, LATAM‑GPT, Sea Lion) illustrate a growing trend toward locally governed multilingual LLMs.
  • Data‑Governance & Compensation – Frameworks such as Nurete Obudo propose that data contributors receive fair benefits, addressing exploitation concerns.
  • Trust Through Transparency – Counter‑narratives to the “magical AI” hype, public‑focused risk communication, and open‑source evaluation tools are essential for accountability.
  • Right of Refusal & Negotiated Development – Embedding community consent mechanisms throughout the AI lifecycle helps reconcile the classificatory nature of technology with living linguistic diversity.