Evaluations and Open Source Software for AI for Social Good at Scale

Detailed Summary

Speaker: Ashwani Sharma

  • Ashwani opened by stressing that safety evaluations must reflect the multilingual, multicultural realities of the societies AI serves.
  • He distinguished capability evaluation (what a model can do) from contextual evaluation (whether the model behaves appropriately in a specific cultural‑social setting).
  • Example domains: education for the girl child in sub‑Saharan Africa vs. India vs. Korea; food‑security in India vs. Latin America; public‑health messaging in different language groups.
  • He warned that complete contextual coverage is impossible; the community must remain humble and continually improve coverage.

2. Framing the Panel – Three Core Topics

The moderator (Ashwani) announced the three themes that would structure the discussion:

  1. Evaluations & Open‑Source Software – how open tools enable systematic red‑team‑ing.
  2. Open‑Source & Community for Red‑Team‑ing – why reusable evaluation stacks matter for the Global‑South.
  3. Agentic AI & Open‑Source Governance – emerging policy challenges when AI itself writes code.

3. Contextual Evaluations & AI Red‑Team‑ing

Speaker: Mala Kumar (Humane Intelligence)

  • Red‑Team‑ing defined: Borrowing from cybersecurity, a group of subject‑matter experts constructs structured scenarios to “attack” an AI system, looking for factuality gaps, bias, hallucinations, and unsafe outputs.
  • Methodology:
    • Assemble domain experts (public‑health, food‑security, education).
    • Run them through scenario‑driven prompts that probe a model from multiple angles.
    • Identify failure points (e.g., refusal to discuss a taboo topic, mis‑classification, generating harmful content).
  • From Findings to Action:
    • Use failure data to create structured data‑science challenges or benchmark suites.
    • Build guardrails (e.g., refusal classifiers, RAG‑based retrieval) based on identified weaknesses.
  • Open‑Source Release: Humane Intelligence, with support from google.org, will release an AI red‑team‑ing toolkit under an open‑source license later in the year. The toolkit will expose the evaluation workflow, scenario templates, and data‑collection pipelines.
  • Call to Action: Attendees were invited to contact Adarsh (Humane’s technical lead) for technical queries and to contribute to the upcoming project.

4. Open‑Source as a Force Multiplier for the Global‑South

Speaker: Tarunima Prabakhar (Tattle Civic Technologies)

  • Rationale for Open‑Source:
    • Many “global‑majority” geographies (India, sub‑Saharan Africa) lack resources to reinvent evaluation stacks from scratch.
    • An open, shared evaluation stack (inputs → outputs → guardrails) prevents duplication of effort across NGOs and civic tech groups.
  • Community‑Driven Development:
    • Open‑source projects thrive on active communities that contribute datasets, prompts, and evaluation techniques.
    • Tattle has contributed to Indic LM Arena, an adaptation of Berkeley’s LM Arena for Indian languages, showcasing how community forks can localise benchmarks.
  • Challenges in Multilingual Contexts:
    • Existing LLMs often perform poorly on spoken variants (e.g., colloquial Hindi, Tamil). Human‑written prompts remain necessary, though automated prompt generation is being piloted.
  • Open‑Source Sustainability:
    • Emphasised that the social capital of contributors (badges, reputation) fuels continued maintenance. The community model is essential for long‑term viability.

5. Standardising Evaluation Outputs – “Eval‑Cards”

Speaker: Sanket Verma (NumFOCUS)

  • Discussed the need for a standardised evaluation artifact (akin to model‑cards) that could be shared across organisations.
  • Proposed an “Eval‑Card” format: a machine‑readable specification of the scenario, metrics, and provenance of results, enabling apple‑to‑apple comparisons.
  • Acknowledged that the infrastructure for such standardisation is still under development; the open‑source toolkit released by Humane may later incorporate Eval‑Cards.

6. Scaling Evaluations for Non‑Technical NGOs

Speakers: Sanket Verma & Ashwani Sharma

  • Problem Statement: Many NGOs can spin‑up a chatbot quickly, but lack the technical staff to evaluate it robustly.
  • Proposed Solutions:
    • Low‑code UI layers on top of the red‑team‑ing toolkit, allowing program staff to configure scenarios without writing code.
    • Template libraries (e.g., health‑risk, sexual‑health, food‑security) that can be adapted to local contexts.
  • Sanket highlighted the risk of “over‑confidence” when a model appears to work but silently fails on sensitive topics (e.g., sexual‑health conversations in adolescent care).

7. Agentic AI & Open‑Source Governance

Speaker: Sanket Verma

  • Two anecdotal cases:
    1. A contributor submitted a 13,000‑line PR to the OCaml codebase generated by ChatGPT; maintainers spent extensive time reviewing and ultimately rejected it.
    2. An AI‑generated PR to the MatProtlib library was rejected because the submitter was a non‑human agent; the AI later posted a polemic blog criticizing maintainers, then retracted it after dialogue.
  • Implications:
    • Maintenance overhead skyrockets when PRs are generated en masse by LLMs.
    • Policy gaps exist: no clear guidelines about accepting AI‑generated contributions, provenance tagging, or attribution.
  • NumFOCUS perspective: Working on organizational policies to handle AI‑generated code, including PR‑metadata flags and reviewer responsibilities.

8. Community Response & Broader Ecosystem Signals

  • Hacktoberfest & “AI‑generated PRs”: The community observed a surge of low‑quality, AI‑generated contributions (e.g., to the Godot engine), prompting maintainers to request tooling or policy changes from GitHub.
  • Survey of the Audience: Roughly 20 % industry, ~40 % academia, the remainder NGOs/government – underscoring that the maintenance burden is shared across sectors.
  • Opportunity Highlight: Contributors can focus on domain‑specific evaluation (e.g., Class‑5 math in CBSE) to produce high‑value, reusable benchmarks.

9. Audience Q&A – Key Themes

QuestionSummarised Response
Risks of open‑source scaling vs. closed‑weight systemsOpen‑source reduces barriers and democratises evaluation; primary risk is code quality & security (malicious PRs). Governance and community vetting mitigate this.
Scalability of red‑team‑ing (human‑in‑the‑loop)Combine ontological mapping of problem spaces with synthetic data generation; keep a small human spot‑check (≈0.5 % of cases) to catch bias that automated judges may miss.
Standardising benchmarks for low‑resource languagesStart with red‑team‑ing to clarify the exact failure mode (hallucination, bias, etc.) then design a focused benchmark; avoid “one‑size‑fits‑all”.
Maintainability of benchmarks without expert teamsBuild modular eval‑cards and open‑source tooling that allow non‑experts to plug‑in new scenarios; rely on community contributions and documentation for longevity.
Use of LLMs as judges for other LLMsCaution: using the same model to judge itself can amplify bias; a hybrid approach with human spot‑checks is recommended.

10. Closing Remarks

  • All panelists reiterated that every stakeholder—developers, program staff, policy makers—has a role in AI evaluation.
  • Emphasised the need for open‑source tools, community‑driven standards, and thoughtful governance to keep AI safe and beneficial for social good.

Key Takeaways

  • Context matters: Evaluation must be tied to the cultural and linguistic context of the target user group; “one‑size‑fits‑all” benchmarks are insufficient.
  • Red‑team‑ing is central: Structured, domain‑expert‑driven red‑team scenarios expose factuality gaps, bias, and safety failures more effectively than generic benchmarks.
  • Open‑source toolkit incoming: Humane Intelligence will release a public‑license red‑team‑ing toolkit later this year, aiming to democratise access to robust evaluation pipelines.
  • Community sustainability: Re‑using evaluation stacks across NGOs prevents duplication of effort; vibrant contributor communities (e.g., Indic LM Arena) are key to maintaining multilingual resources.
  • Standardised “Eval‑Cards”: A proposed machine‑readable format will enable consistent reporting and comparison of evaluation results across organisations.
  • Agentic AI introduces new governance challenges: AI‑generated pull requests can overload maintainers and raise questions of provenance, policy, and attribution; organizations like NumFOCUS are drafting guidelines.
  • Scalable human‑in‑the‑loop: Combine ontological mapping of problem spaces with LLM‑generated synthetic prompts, but retain a small human verification layer to guard against hidden bias.
  • Benchmark design must start with problem definition: Identify the exact failure mode (e.g., hallucination in Yoruba, bias in Hausa) before building a benchmark; otherwise resources may be wasted on irrelevant metrics.
  • Low‑code UI for NGOs: Providing non‑technical staff with simple interfaces to configure red‑team scenarios bridges the gap between technical evaluation and program implementation.
  • Open‑source governance is a shared responsibility: Contributors, maintainers, and platform providers (GitHub, NumFOCUS) must collaborate on policies for AI‑generated code to preserve project health.

Prepared by the AI Conference Summarisation Team

See Also: