Best practices from the International Network for Advanced AI Measurement, Evaluation and Science.
Abstract
The panel explored how the International Network for Advanced AI Measurement, Evaluation and Science (the “Network”) can help its member organisations share best practices, overcome evaluation challenges, and coordinate government‑industry collaboration worldwide. Each panelist gave a brief overview of their national AI evaluation programme, highlighted common technical and policy obstacles (rapid model evolution, benchmark degradation, agentic AI, resource constraints), and described concrete steps the Network is taking—joint testing exercises, open‑source tooling, shared reporting standards, and trusted information exchange. The audience asked about emerging agentic evaluations, budgeting disparities, and geopolitical pressures on AI safety, prompting additional insights on future priorities such as societal‑impact assessments and pre‑deployment audits.
Detailed Summary
Moderator (Chris Meserole) opened the session by introducing the four panelists and noting the summit’s Indian host. He invited each speaker to give a short status update on their organisation’s work.
2. Opening Remarks from the Panel
2.1. United Kingdom – Adam Beaumont
- Institutional background – UK AISI was launched at the Bletchley Park AI Summit (2024). It is part of a ten‑member “Network for Advanced AI Measurement, Evaluation and Science.”
- Core deliverables – The UK team maintains an open‑source inspect platform (evaluation framework) used by frontier AI firms and government bodies.
- Measurement emphasis – Stressed that reliable measurement is a pre‑condition for understanding AI capabilities and for safe deployment.
- National investment – New funding for the National Physical Laboratory’s AI‑measurement centre aims to accelerate secure, transparent AI adoption.
- Publications – Cited the “Frontier AI Trends Report” (December 2025) summarising two years of evaluations across ~30 models.
2.2. Singapore – Wan‑C Lee
- Organisation – Leads AI governance and safety at the Infocom Media Development Authority (IMDA) and heads the Singapore AI Safety Institute (SAISI).
- Timeline – SAISI was founded around the 2024 Bletchley/Seoul summits; formalised shortly after.
- Evaluation focus – Conducts both practical model evaluations and research on the science of evaluations. Emphasised the need for technical capacity within government to keep pace with rapidly emerging evaluation methods.
- Academic partnership – Main research hub is the Digital Trust Centre at Nanyang Technological University.
- Collaborative work – Highlighted joint testing with other Network members and the importance of multilingual evaluation (e.g., language‑specific testing for Indian, Korean, Japanese contexts).
2.3. United States – Austin Mayron (Austin Mehran)
- Institutional shift – In June 2025 the US CAISI rebranded from the US AI Safety Institute to the Center for AI Standards and Innovation (CAISI), signalling a move from pure safety to broader innovation and adoption.
- Host institution – CAISI is housed within NIST, leveraging its historic standards‑development expertise.
- Science‑first approach – CAISI develops measurement science for AI, championing transparent, reproducible, and “gold‑standard” evaluations (referencing the 2023 US Executive Order on Gold‑Standard Science).
- Network role – Emphasised the need for a single, consistent voice across countries to avoid fragmented evaluation outcomes.
- Recent output – Draft NIST AI‑802 “Best Practices for Automated Benchmark Evaluations,” now publicly available for comment.
2.4. Industry – Sara Hooker (Adaption Labs)
- Career trajectory – Formerly at DeepMind, led research at Cohere, now co‑founder of Adaption Labs.
- Two‑pronged perspective – (1) Benchmarks are increasingly static, over‑fit, and gamified; (2) Government can play a unique curatorial role.
- Benchmark criticism – Open benchmarks rapidly become stale as models train on them; private test sets and secret hold‑outs are needed to retain relevance.
- Systemic harm gap – Current benchmarks miss real‑world harms (e.g., fraudulent voice‑calls, AI‑generated fake resumes, “AI companionship” misuse).
- Policy relevance – Benchmarks must drive concrete policy decisions; otherwise they become data for its own sake.
3. The Core Challenges in AI Evaluation
| Challenge | Representative Insights |
|---|---|
| Speed of model progress | Wan‑C Lee: “Capabilities develop very quickly; we struggle to keep evaluation pipelines up‑to‑date.” |
| Language & cultural coverage | Wan‑C Lee: Emphasised multilingual testing, citing Indian, Korean, Japanese language needs. |
| Agentic AI (tools, memory, autonomous behavior) | Austin Mayron & Adam Beaumont: Described the need for new scaffolding to capture tool‑calling, intermediate states, and multilingual agent interaction. |
| Benchmark degradation & “leaderboard illusion” | Sara Hooker: Open benchmarks are gamified; private benchmarks must stay secret or evolve continuously. |
| Public vs. private reporting | Adam Beaumont & Austin Mayron: Discussed a tension between transparency (gold‑standard science) and information‑hazard control. |
| Resource disparities across nations | Audience question answered by Adam Beaumont: UK shares open‑source tools (inspect, cyber‑arena) and publishes reports to level the playing field. |
| Geopolitical pressure | Austin Mayron: US policy (Executive Order on AI) influences standards work; network helps depoliticise technical dialogue. |
| Reproducibility & reporting standards | Austin Mayron: NIST AI‑802 outlines emerging practices (e.g., publishing evaluation transcripts). |
| Societal‑impact assessment | Adam Beaumont (later): Highlighted the next frontier—evaluating broader societal consequences, not just technical metrics. |
4. The Role of the International Network
4.1. Knowledge‑Sharing & Joint Testing
- Joint exercises – Singapore, UK, US, and others have performed collaborative testing on language coverage, cybersecurity, and early agentic scenarios.
- Trust building – Wan‑C Lee stressed that the Network “creates a trusted enclave” where government experts can exchange technical details safely.
4.2. Open‑Source Toolkits
- Inspect framework – UK‑AISI’s open‑source platform for model assessment (released 2025).
- Cyber‑arena / Control arena – Additional UK tools for adversarial testing of AI systems.
4.3. Standardisation Efforts
- Common reporting templates – Discussed during the San Diego workshop; still an open question whether a single format should be adopted.
- Public‑private dichotomy – Consensus that some results must remain private to mitigate information hazards, while methodology should be shared openly.
4.4. Community‑wide Communication
- Slack & email channels – Austin Mayron highlighted informal expert‑to‑expert communication as a rapid‑response mechanism.
- Public blog posts & drafts – All four organisations publish “blog‑style” technical summaries (e.g., UK Frontier AI Trends Report, US NIST draft).
5. Audience Q&A – Key Points
| Question | Main Responses |
|---|---|
| Evaluating emergent, “aging‑free” (agentic) systems | Adam Beaumont: Need new scaffolding to capture intermediate tool calls. Austin Mayron: Joint testing showed language‑specific challenges in tool‑calling. |
| Resource gaps for smaller ACs | Adam Beaumont: Open‑source tools, shared reports, and frontier‑AI trends data help less‑resourced bodies. |
| Geopolitics & US executive order | Austin Mayron: Executive order reinforces gold‑standard science; it pushes US agencies to adopt transparent, reproducible standards, which the Network can help harmonise globally. |
| Preventing capability “superposition” in models | No direct answer; panel noted that pre‑deployment audits (mentioned later) and tighter research‑stage evaluation may mitigate unchecked capability emergence. |
| Rollback / intervene‑ability mechanisms | Sara Hooker: Model checkpointing and multi‑generation pools already provide a form of rollback; emphasised need for systematic “intervenability” benchmarks. |
| Future evaluation priorities | Consensus on three fronts: (1) Agentic AI testing, (2) Societal‑impact assessments, (3) Pre‑deployment audits with industry partners. |
6. Emerging Consensus & Action Items
- Develop shared, modular evaluation scaffolding for agentic AI (tool‑calling, memory, multilingual prompts).
- Expand open‑source tooling (inspect, cyber‑arena) and provide documentation for rapid adoption by newer ACs.
- Standardise reporting (metadata, reproducibility checklists) while defining clear criteria for what stays private.
- Institutionalise regular joint testing exercises (e.g., yearly “Network Evaluation Hackathon”).
- Broaden the evaluation remit to include systemic societal impact metrics (e.g., labor market displacement, misinformation propagation).
- Maintain a trusted communication channel (Slack, mailing list) for real‑time sharing of emerging threats, papers, and methodological tweaks.
Key Takeaways
- Measurement is foundational – All panelists agreed that without robust, reproducible evaluation frameworks, safe AI deployment is impossible.
- The Network provides a vital trust hub – It enables governments to exchange technical details that would otherwise be siloed, fostering faster collective learning.
- Benchmarks are becoming obsolete – Static, public leaderboards are quickly over‑fit; private or evolving test sets are required to stay relevant.
- Agentic AI introduces new dimensions – Evaluations now must capture intermediate tool‑calls, multi‑step reasoning, and multilingual interactions.
- Open‑source tools (e.g., inspect) democratise capability – Sharing software and reports helps lower‑resource nations participate meaningfully.
- Transparency vs. information hazard – A balance is needed between gold‑standard reproducibility and protecting sensitive evaluation data.
- Geopolitical dynamics shape standards – US executive orders and similar policies drive national adoption of rigorous evaluation norms, which the Network can help harmonise.
- Future work must move beyond technical metrics – Societal‑impact assessments and pre‑deployment audits are identified as the next frontier for the Network.
- Resource inequities can be mitigated – Shared tooling, joint reports, and community mentorship are concrete ways to level the playing field.
- Continued collaboration is essential – Regular workshops, joint testing, and a persistent communication channel are needed to keep pace with rapid AI advances.
See Also:
- panel-on-the-2026-international-ai-safety-report
- preparing-to-monitor-the-impacts-of-agents-closing-the-global-assurance-divide-for-safe-and-trusted-ai
- effective-ai-assessments-verification-and-assurance-establishing-the-foundations-for-responsible-confidence-in-ai
- the-role-of-science-in-international-ai-governance
- towards-a-safer-south-launch-of-the-global-south-network-on-ai-safety-and-evaluation
- flipping-the-script-how-the-global-majority-can-recode-the-ai-economy
- evaluations-and-open-source-software-for-ai-for-social-good-at-scale
- implementing-ai-standards-for-global-prosperity-in-an-era-of-agentic-ai
- scaling-trusted-ai-global-practices-local-impact
- international-ai-safety-coordination-what-policymakers-need-to-know