Governments as data providers for AI

Part of the Deep Dive: Data Governance Webinar Series

Government intervention in the AI ecosystem is typically disregarded as heavy-handed regulation that will induce deadweight losses and cap innovation. Importantly, though, there’s potential for mutually beneficial agreements, with government data as the foundation.

Governments steward all kind of data about their countries and people living within them, collected via national archives, censuses, service provision, and more. This data is often published openly for the sake of transparency or reference by researchers, but the advent of genAI tooling should provoke governments to think harder about how their data can and should be used. Simply put, LLM chatbots can only benefit from access to government data; for example, it reduces their chances of hallucinations when it comes to high-stakes situations for citizens of a country when they ask about welfare schemes, childcare, health services, and more.

We first discuss what these benefits might tangibly look like, before then moving on to an exploration of the current status of government datasets and their readiness for such a future role. Specifically, we reference Open Data Institute research on the UK government as a data provider for AI published earlier this year which, with an ablation experiment and an information leakage experiment, demonstrate the idiosyncratic, disorganised integration of UK government data and AI. We present how this should be changed and what steps should be followed to bring overall AI-readiness to government data.

Across our discussions, an undercurrent theme will be the ongoing siloing between governments and the AI ecosystem right now. Too little dialogue is happening, and too little mutual benefit is being realised. Government data represents an olive branch — a way forward, for the benefit of all.

Video transcript

Presenter: Neil Majithia, Researcher at Open Data Institute (ODI)

Introduction

The Open Data Institute, co-founded by Professor Sir Nigel Shadbolt and Sir Tim Berners-Lee, conducts research on data foundations for AI development. This presentation covers two key pieces of work from ODI’s data-centric AI campaign, focused primarily on the UK context and targeted at policymakers.

Part 1: Current State of Government Data and AI

Government’s Role in the AI Ecosystem

While governments are often viewed as bureaucratic regulators that stifle innovation, they can serve as providers of public goods—similar to how they provide roads and infrastructure. Government data represents a valuable public good that can support AI development.

Why Government Data Matters for AI

  • Provides objective, factual information from authoritative sources
  • Supports “citizen queries”—questions citizens have about government services and policies
  • Example: A single parent asking about child benefits and how they interact with other assistance programs
  • Offers a practical, real-world use case for AI chatbots

Research Question

To what extent is the UK government currently a data provider for AI?

Experiment 1: Ablation Study

Method: Tested how important government data is to large language models (LLMs) by comparing baseline performance against models that had “forgotten” government website content.

Key Results:

  • Government websites are important data providers for LLMs
  • Hallucinations increased after ablation, demonstrating the value of government data
  • Importance varies by subject matter—niche queries rely more heavily on government information

Experiment 2: Information Leakage Study

Method: Tested whether structured data from data.gov.uk was present in LLM training corpora using refined prompts.

Key Results:

  • Recall was largely unsuccessful
  • Data.gov.uk, the UK’s premier structured data source, is not effectively serving as a data provider for foundational LLMs
  • Even data published in secondary sources (e.g., Guardian articles) couldn’t be reliably recalled

Recommendations from Part 1

  1. Make government data more openly available and AI-ready
  2. Revise data reuse policies for LLM access
  3. Develop an AI-ready national data library
  4. Equip existing data infrastructure with AI capabilities
  5. Invest in high-quality benchmarks and evaluation protocols

Part 2: Framework for AI-Ready Data

Why AI Readiness Matters

  • Poor data standards lead to problematic AI outputs (Google’s “data cascades” research showed 92% prevalence of avoidable cascades)
  • In healthcare, biased electronic health records can severely impact decision-making
  • AI-ready data can bridge the digital skills gap between public and private sectors
  • Enables effective use of AI agents and workflows

Existing Frameworks and Their Limitations

  • FAIR principles: Too broad, not sufficiently nuanced
  • Domain-specific frameworks: Too narrow (e.g., biomedical data)
  • Success story: UniProt and Protein Data Bank enabled AlphaFold development because data was already AI-ready

ODI’s Framework Development

Process: Conducted desk research and interviews with:

  • Computer science experts
  • Data analysts and day-to-day data users
  • Large data portal experts
  • Legal consultants
  • ODI institutional experts

The Framework: Three Core Categories

1. Dataset

  • Follow international standards and norms
  • Ensure semantic and logical consistency
  • Address class and source imbalances
  • Implement proper identification and anonymization (GDPR, EU AI Act)
  • Use appropriate file formats (Apache Parquet > CSV > Excel spreadsheets)

2. Metadata (Arguably Most Important)

  • Provide context necessary for autonomous AI agents
  • Include legal and socio-technical information
  • Ensure machine readability
  • Enable AI to understand modalities, semantics, biases, and synthetic data components

3. Surrounding Infrastructure

  • User-centric data portal accessibility
  • API access
  • Version control infrastructure
  • Discoverability for training corpus scraping

Conclusion

Key Takeaways

  • Government data can be a genuine public good in the AI ecosystem
  • Everyone must align: government departments, citizens, public sector institutions, developers, researchers, legislators, and service providers
  • AI readiness requires strong central direction—a “composer” to coordinate efforts
  • Following tangible criteria (like ODI’s framework) is preferable to vague approaches
  • Leadership is needed to implement these changes

Contact: Neil Majetti, Open Data Institute For more information, visit the ODI website or contact the presenter directly.