AI in Governance: Revolutionising Government Efficiency

Abstract

The session opened with Dean Karlan’s presentation of a randomized field experiment in Togo that used mobile‑phone metadata and machine‑learning models to target COVID‑era cash transfers. The study found that AI dramatically improved beneficiary targeting but failed to detect the program’s impact on poverty‑related outcomes when compared with a traditional household survey. Karlan then highlighted broader methodological challenges of evaluating fast‑moving AI tools in development settings. A moderated panel then examined the readiness of governments—globally and in India—to adopt AI, the practical hurdles faced by innovators working with the public sector, and how evidence‑based evaluation can guide procurement and scaling of AI solutions for greater state capacity.

Detailed Summary

1.1 Background & Motivation

  • Problem: Identifying the poorest households for cash‑transfer programmes is difficult in low‑ and middle‑income countries because administrative registers are sparse and census data are outdated (some from the 1940s–1980s).
  • Opportunity: Mobile‑phone metadata (call‑detail records, location traces, SMS volumes) provide a rich, real‑time source of behavioural signals that can be fed into machine‑learning pipelines.

1.2 The NoVessi Cash‑Transfer Program

  • Launched during the COVID‑19 lockdown to keep informal workers from having to go to markets.
  • Urban rollout: Used recent voter‑registration lists as a proxy for informal‑worker status (no AI involved).
  • Rural rollout: Lacked comparable registers, so the team built an AI model using six months of phone‑metadata for 5.83 million subscribers generating 1.3 billion calls.

1.3 Methodology

  1. Training data (“truth” dataset): A pre‑COVID household survey containing consumption, food‑security, and asset measures.
  2. Machine‑learning pipeline: Features extracted from phone data (call frequency, mobility patterns, inbound/outbound calls, SMS volume, etc.) fed into a supervised model to predict poverty status.
  3. Targeting: The model’s poverty scores were used to allocate cash transfers to the poorest villages.

1.4 Findings – Targeting vs. Impact Measurement

DimensionSurvey‑based “truth”Phone‑data machine‑learning
Targeting accuracyHigh – correctly identified poor clustersHigh – matched consumption‑based poverty maps
Measured impact on outcomesSignificant gains in:
• Food security
• Mental health
• Perceived socioeconomic status
• Aggregate welfare index
Zero detectable effect on all outcomes (confidence intervals included zero)

Key Insight: AI excelled at identifying poor households but failed to capture the short‑run treatment effects observed in the survey.

1.5 Why the Discrepancy?

  1. Different outcome focus: Survey captured short‑term vulnerability (food security, stress) which is hard to infer from phone behaviour; phone data better captures long‑run asset wealth.
  2. Model drift: Training data pre‑dated COVID; pandemic‑induced changes in phone usage (e.g., reduced mobility, altered call patterns) broke the relationship between features and poverty.
  3. Data limitation: No fresh consumption data during COVID to re‑train the model, leading to a mismatch between the feature‑outcome mapping.

1.6 Broader Lessons on AI Evaluation

  • Speed mismatch: AI tools evolve rapidly, whereas randomized impact evaluations are slow.
  • Need for short‑run proxies: Identify outcomes that can be measured quickly and are predictive of long‑run welfare (e.g., service uptake, daily transaction volume).
  • Iterative design vs. static evaluation: Many AI‑driven programs iterate continuously; evaluations must either capture the full iteration or lock the design before testing.
  • Counterfactual rigor: Without a credible control group, claims of success are vacuous. Randomized control trials (RCTs) remain the gold standard.
  • Procurement implications: Governments should embed evidence generation in AI procurement, avoiding “splashy” demos that lack impact data.

2. Panel Discussion – From Global Vision to Ground‑Level Implementation

2.1 Opening Question (Kapil Viswanathan) – Are governments ready for AI’s impact?

Robin Scott (Global View)

  • Optimism vs. preparedness gap: >90 % of public servants are optimistic about AI, yet only 30 % of senior leaders have examined AI’s effect on their own jobs.
  • Ethics awareness: Only 26 % of self‑identified AI implementers in government understand their agency’s ethical AI framework.
  • Pilotitis: 70 % of leaders report running AI pilots, but ≈50 % of those have scaling plans and ≈33 % have data‑readiness strategies.

Key Point: The enthusiasm is real; the systematic capacity (data, ethics, scaling) is lagging.

2.2 Indian Perspective (Mohammed Y. Safirulla)

  • Positive signals: Central and state governments are investing in up‑skilling, open data portals, and compute infrastructure.
  • Challenges: Scalability hinges on data availability, clear labelling standards, and ensuring AI benefits reach the masses.
  • Outlook: Optimistic but acknowledges a long road ahead to embed AI across ministries.

2.3 “Trenches” – Legal‑Tech and Court Automation (Utkarsh Saxena)

  • Two AI categories:

    1. LLM‑driven decision‑support (high regulatory risk, bias, hallucinations).
    2. Process automation (transcription, digitisation, workflow management) – low‑risk, high‑productivity.
  • Current focus: Adalet AI provides speech‑to‑text, image‑to‑text, and case‑flow tools that do not replace judicial decisions but relieve clerical bottlenecks.

  • Infrastructure bottleneck: Sovereign AI solutions must run on government‑owned GPUs; reliance on commercial APIs is limited for sensitive judicial data.

  • Painkiller vs. multivitamin analogy (see § 3.3).

2.4 Data‑Quality & “Pilotitis” (Robin Scott – again)

  • High‑quality domains: Finance & taxation (rich, precise transaction data) enable effective AI for expenditure tracking and fraud detection.
  • Low‑quality domains: Education, health, and social services where data are heterogeneous and noisy.
  • Success conditions: Robust, auditable experiments; careful design to avoid reinforcing existing biases.

Illustrative example: During COVID, unsupervised clustering on health records revealed a diabetes‑plus‑hypertension comorbidity that predicts higher mortality, prompting targeted policy interventions.

2.5 State‑Level Dynamics (Safirulla)

  • Kerala & other states: Active AI pilots, especially in health and agriculture, but many remain pilot‑only due to lack of validation protocols (sensitivity, specificity, third‑party audit).
  • Scaling bottleneck: Absence of systematic evidence (e.g., impact evaluations) hampers expansion beyond pockets.

2.6 Entrepreneurial Viewpoint – Market Size & Strategy (Utkarsh Saxena)

  • Painkiller first: Target high‑friction, clearly‑felt tasks (e.g., courtroom stenography). Success builds trust and opens avenues for broader “multivitamin” solutions (case‑flow management, analytics).

  • Market scope:

    • India: ~50 million pending cases; the AI solution could accelerate case resolution dramatically.
    • Global South: Similar legacy colonial judicial structures in Africa and elsewhere suggest a potentially worldwide market, though no concrete dollar figure is offered.
  • Key take‑away: Multivitamin opportunities are unbounded; the immediate market is driven by specific, painful painpoints.

2.7 Evaluation Practices – Designing Pilots for Scale

  • Telemetry & monitoring: Only 45 % of leaders have plans to evaluate AI pilots; 20 % of public servants feel competent to assess AI skill needs. Lack of built‑in telemetry means post‑deployment learning is limited.
  • Design thinking: Experiments must be built with measurement in mind (e.g., define intermediate outputs such as number of judgments per day, bail orders issued).
  • Third‑party audits: Boost trust and facilitate scaling—especially important for AI systems influencing rights.

2.8 Closing Reflections (Dean Karlan & Moderator)

  • Evidence‑first procurement: Governments should demand rigorous impact data before large‑scale AI contracts.
  • Balancing optimism with humility: AI offers powerful levers for crisis response (rapid cash targeting) but is not a panacea; careful, context‑specific evaluation remains indispensable.

Key Takeaways

  • AI excels at rapid targeting of beneficiaries when rich behavioural data (e.g., mobile‑phone logs) are available, as demonstrated in Togo’s cash‑transfer program.
  • Impact measurement is harder: The same AI pipeline failed to detect short‑run welfare improvements captured by household surveys, likely due to outcome mismatch, model drift, and lack of contemporaneous training data.
  • Evaluation lag vs. AI speed: Traditional RCTs are slow; developers must identify reliable short‑run proxies and embed telemetry to monitor AI performance in real time.
  • Government readiness is uneven: While >90 % of civil servants are enthusiastic, only a minority have examined AI’s effect on their own jobs, understand ethical frameworks, or have data‑readiness plans.
  • Pilotitis dominates: Most AI initiatives remain pilots; only ~50 % have scaling roadmaps and ~33 % have concrete data‑preparation strategies.
  • High‑impact domains (finance, tax, pandemic health analytics) benefit from clean, high‑frequency data; low‑quality domains (education, social services) need robust experimental design to avoid bias amplification.
  • Process‑automation (“painkillers”)—such as courtroom transcription—gains rapid government adoption; broader systemic solutions (“multivitamins”) require longer trust‑building cycles.
  • Infrastructure constraints (need for sovereign compute, limited API access) are a major barrier for sensitive sectors like the judiciary.
  • Evidence‑driven procurement and third‑party audits are critical to scaling AI tools within public institutions.
  • Market potential: Immediate, high‑pain use‑cases (court stenography, case‑flow) constitute a sizable domestic market; the longer‑term “multivitamin” space is conceptually limitless across the global south.

Prepared from the verbatim transcript of the AI in Governance session held in Delhi.

See Also: