Publicly Accessible Data and AI Training: Safeguards for Responsible Reuse

Abstract

The session examined how data‑governance frameworks shape the ethical reuse of publicly accessible data for training artificial‑intelligence models. Drawing on an emerging, Microsoft‑supported research project, the speakers outlined early findings on the legal and ethical boundaries imposed by copyright, data‑protection, and nascent AI‑governance regulations. The discussion highlighted methodological approaches, comparative insights from Brazil, Japan, and Australia, and the practical challenges innovators face when balancing openness with rights‑respecting safeguards. Audience interaction via live polls helped surface practitioners’ current data‑usage practices and perceptions.

Detailed Summary

  • Introduction to Open Data Charter (ODC):

    • Founded a decade ago around six charter principles adopted by governments and civil‑society partners for data collection, publication, and reuse.
    • ODC’s evolving mission now includes exploring the intersection of open data with emerging technologies such as digital public infrastructure and AI.
  • Project Overview:

    • A new research initiative, funded by Microsoft, investigates the legal and ethical limits of using publicly accessible data for AI training, focusing on copyright and data‑protection concerns.
    • The project marks ODC’s first foray beyond open‑government data into the broader realm of publicly accessible information.
  • Interactive Element:

    • Participants were invited to scan a QR code and answer an initial poll on their sector (private, civil‑society, academia, government).
    • A second poll would capture the types of data participants use for analytics or AI training and the contexts of that use.

2. Main Presentation (Fola Adeleke)

2.1. Research Motivation

  • Growth of AI Training on Web‑Scale Data:

    • AI models increasingly rely on large, openly available datasets.
    • However, regulatory guidance is fragmented: it is unclear whether publicly available data can be freely mined for AI.
  • Key Legal Gaps:

    • Copyright law: often lacks explicit text‑and‑data‑mining (TDM) exceptions for publicly accessible content.
    • Data‑protection law: applies to personal data even when publicly posted (e.g., social‑media profiles), limiting reuse without consent.
  • Governance Challenges:

    • Uncertainty hampers innovators, public‑sector users, and community projects.
    • There is a rising discourse on government‑provided data as a public‑good source for AI, especially in the Global South where state datasets are diverse and potentially under‑used.

2.2. Research Question & Scope

  • Primary Question:

    • To what extent can publicly accessible data be reused in AI training while respecting copyright, data‑protection, and emerging AI‑governance regulations?
  • Three Core Legal Frameworks Examined:

    1. Copyright – particularly TDM exceptions.
    2. Data‑Protection – purpose limitation, good‑faith, and public‑interest tests.
    3. “Imaginary” AI‑Governance Regulations – emerging statutes that may clarify AI‑specific data uses.
  • Stakeholder Focus:

    • Private‑sector firms, non‑state actors, and state agencies.
    • Exploration of interpretations of exceptions and perceived restrictions on analytical use of public data.

2.3. Comparative Country Study

  • Countries Selected: Brazil (Latin America), Japan (Asia), Australia (Oceania).
  • Rationale: Geographic diversity and distinct legal traditions provide contrasting regulatory ecosystems.
CountryKey Legal FeaturesImplications for Public Data Reuse
BrazilStrong open‑by‑default policy for government data; robust data‑protection lawEncourages reuse but data‑protection still limits personal data mining.
JapanBroad information‑analysis permission under copyright, provided the use is not for personal enjoyment; comprehensive data‑protection lawAllows more liberal TDM, yet personal data protections remain.
AustraliaEmerging AI‑governance framework; existing copyright lacks explicit TDM exceptionOpportunity for legislative clarification; current ambiguity creates risk.

2.4. Methodology

  • Mixed‑Methods Approach:

    • Desk Research: Review of statutes, case law, policy documents, and industry practices (ongoing).
    • Surveys & Interviews: Planned for later phases to capture stakeholder perceptions across the three jurisdictions.
  • Current Status:

    • Initial desktop analysis completed; empirical data collection (surveys, interviews) forthcoming.

2.5. Early Findings & Observations

  • Regulatory Silence on TDM:

    • Most copyright regimes examined lack explicit TDM provisions, leaving a gray area for AI training.
  • Data‑Protection Constraints:

    • Even anonymized public data may embed biases; misuse can violate purpose‑limitation and good‑faith requirements.
  • Bias & Ethical Risks:

    • Selection bias in publicly curated datasets can perpetuate discriminatory outcomes when fed into AI models.
  • Potential for Legislative Evolution:

    • Emerging AI‑governance statutes could embed clearer TDM allowances, offering a pathway to responsible reuse.

3. Panel Interaction & Audience Polls

  • Poll 1 (Sector Identification): Collected demographics of attendees (private, civil‑society, academia, government).

  • Poll 2 (Data Usage): Sought information on the types of publicly accessible data participants employ for analytics/AI and the contexts of use (research, commercial, academic, etc.).

  • Audience Engagement:

    • While specific poll results were not disclosed in the transcript, the moderator reiterated that the questions would be revisited after the presentations, emphasizing the interactive nature of the session.

4. Closing Remarks (Renato & Fola)

  • Final Thanks: Appreciation for participants’ endurance after a long conference day and acknowledgment of the cold venue.

  • Next Steps:

    • The research team will continue data collection, analyze survey/interview results, and share findings as they mature.
    • Contact information (email) was provided for ongoing dialogue.
  • Unanswered Questions:

    • Time constraints prevented the panel from addressing all audience questions; the discussion was framed as an opening step toward deeper exploration.

Key Takeaways

  • Regulatory Uncertainty: Current copyright and data‑protection statutes often lack clear exceptions for text‑and‑data‑mining of publicly accessible data, creating legal ambiguity for AI developers.
  • Three‑Framework Lens: The research focuses on copyright, data‑protection, and emerging AI‑governance regulations to assess responsible data reuse.
  • Comparative Insight: Brazil’s open‑by‑default policy, Japan’s permissive copyright stance, and Australia’s nascent AI‑governance framework illustrate diverse national approaches and highlight opportunities for legislative harmonisation.
  • Bias Risks Remain: Even anonymized public data can embed biases that affect AI outcomes, underscoring the need for ethical safeguards beyond legal compliance.
  • Mixed‑Methods Design: The project combines desk research with forthcoming surveys and interviews to capture both legal analysis and stakeholder perceptions.
  • Interactive Engagement: Live polling was used to map participant sectors and data‑usage practices, informing the research’s relevance to real‑world practitioners.
  • Future Directions: Ongoing data collection and analysis will refine recommendations for safeguards and mechanisms that enable responsible reuse of publicly accessible data in AI training.
  • Collaboration Across Borders: The partnership between the Open Data Charter and the Global Center on AI Governance exemplifies cross‑regional cooperation to address a global challenge.

See Also: