Part of the Deep Dive: Data Governance Webinar Series
In my talk, I will focus on how public AI goals can be achieved through policies that focus on developing shared data infrastructure. I will use the European policy debate on the AI Continent Action Plan as an example.
Much of the current policy debate around public AI focuses on securing computing capacity and building infrastructural stacks for AI training and inference. In the paper, we propose three distinct pathways that take compute, data, or models as their starting point. In my talk, I will present the data pathway in detail, arguing that establishing a viable data commons is a critical policy measure for securing public AI infrastructures. The data pathway emphasizes the creation of high-quality datasets as digital public goods, governed through commons-based frameworks. This includes developing datasets as publicly accessible resources while protecting against value extraction, and establishing public data commons with appropriate governance mechanisms.
The public AI model outlined in the white paper also assumes that such data-related developments will be closely tied to the creation of a state-of-the-art, open-source, large AI model. Only through this integration can public value be secured and the extraction of value by dominant commercial actors be avoided.
In my talk, I will apply the concept of data-driven public AI development to ongoing policy debates in Europe, offering an example of how such an initiative could be structured. In 2025, the European Commission announced the AI Continent Action Plan, which includes measures to create elements of a European public AI stack. Central to this plan is the proposal for Data Labs—institutions tied to AI Factories that are responsible for federating data from various sources and ensuring its quality. I argue that the Data Labs framework must build on data commons approaches to ensure that increased data availability truly serves the public interest.
Video transcript
By Alek Tarkowski, Director of Strategy at Open Future
Introduction
I’m Alek Tarkowski from Open Future, a European think tank focused on the digital commons. Today I’ll discuss the data pathway to building public AI, building on work from a policy paper on data governance and open-source AI that I co-authored with OSI last year.
Understanding Data Commons
Data commons are crucial for public AI development. They represent:
- Community-based governance of data resources
- Alternatives to proprietary ownership models
- Protection against exploitation of shared data for private gain
- Infrastructure for generating public value beyond just community benefit
Data commons can simultaneously achieve three goals often seen as contradictory:
- Protecting fundamental rights
- Serving the public interest
- Generating economic value
Six Principles of Commons-Based Data Governance for AI
- Share as much as possible with necessary restrictions – Use open data where appropriate, but implement gated access when needed to protect the commons
- Respect choices of data subjects and content creators
- Ensure data transparency
- Establish trusted institutions for governance
- Enable community engagement
- Ensure data quality
What is Public AI?
Public AI refers to building normative alternatives to corporate AI systems by addressing power concentration through public engagement and public interest orientation.
The Paris Charter on AI defines it as building open public goods and infrastructure that provide alternatives to market concentration while ensuring democratic participation, accountability, and environmental sustainability.
Key Clarifications
Public AI vs. Open Source: You cannot have public AI without open-source AI. Open-source models are necessary but not sufficient—they must be combined with public functions and governance.
Public AI vs. Sovereign AI: While both involve independent infrastructure, sovereign AI focuses on national competitiveness and compute ownership. Public AI emphasizes public value and democratic governance without nationalist goals.
Public Digital Infrastructure Framework
Public AI is a form of public digital infrastructure characterized by three elements:
- Public Attributes – Publicly accessible, open, or interoperable (where open-source standards matter)
- Public Functions – Contributes to public interest goals (education, health, civic participation)
- Public Control – Democratically governed or overseen by the public (not necessarily state-owned)
The AI Stack and Vertical Orchestration
Understanding AI as a layered stack is essential:
- Compute Layer – Chips and data centers
- Data Layer – Training datasets
- Model Layer – AI systems
- Applications Layer – End-user solutions
While commercial actors pursue vertical integration to concentrate power, public policy should pursue vertical orchestration—coordinating actors across layers toward shared public interest goals.
Seven Principles of Public AI
- Directionality and Purpose – Clear goals beyond just “deploying AI”
- Open Release – Open-source models and components
- Commons-Based Governance – Democratic, community-based data governance
- Conditional Computing – Public compute resources allocated with conditions tied to public goals
- Digital Rights Protection – Fundamental safeguard
- Sustainable Development – Environmentally responsible
- Reciprocity – Mechanisms ensuring value returns to the commons, not just extracted by corporations
Three Pathways to Public AI
1. The Compute Pathway
Goal: Reduce dependence on commercial computing power through publicly owned infrastructure.
Examples: US NAIRR initiative, Europe’s 13 AI Factories (averaging 25,000 GPUs each)
Challenges:
- Unavoidable dependencies on chip producers (especially Nvidia) and proprietary software
- Cannot achieve fully public infrastructure—typically requires public-private partnerships
- Lacks clear purpose without additional direction
- Alone, insufficient to address power concentration
2. The Data Pathway
Goal: Develop and share high-quality datasets that are publicly accessible and governed through democratic, commons-based frameworks.
Approaches:
- Traditional open data
- New gated access models with permissions protecting the commons
- Regulatory measures (fair use, text and data mining exceptions)
Challenges:
- Downstream risk: Public data extracted for private gain, reinforcing inequalities
- Data winter: Despite calls for more data, public availability isn’t improving
- Volume vs. quality: Need to shift focus from quantity to ensuring transparency and quality
- Limited success: European common data spaces haven’t fully achieved their goals after five years
Recent Progress:
- Luther’s Common Pile
- Harvard’s Institutional Data Initiative book corpus
Critical insight: Regulatory frameworks (like fair use or TDM exceptions) can enable AI training without requiring new dataset creation, but both approaches should be complementary.
3. The Model Pathway
Goal: Develop open-source AI models with strong public functions focused on generating public value.
Key concept: The capstone model—a state-of-the-art, highly capable open-source model that serves as foundation for a broader ecosystem where specialized models can be fine-tuned or distilled.
Critical point: The data and model pathways are tightly coupled. Public/open-source model developers should be the primary users of shared datasets. Otherwise:
- Data mainly benefits large commercial players already capable of using it
- Or access must be heavily gated, limiting public value
Important caveat: Data sharing has value beyond AI development. We shouldn’t only share data because we’re building AI systems.
Case Study: European AI Continent Action Plan
The 2024 European AI action plan shows elements of a public AI approach but has gaps:
What’s Present:
- 13 AI Factories (plus 5 more gigafactories planned)
- Cloud and AI Development Act
- Updated Data Union Strategy
- Apply AI strategy for adoption in key sectors
- RAISE (Resource for AI Science in Europe) for research
- Data labs for accessing common data spaces
What’s Missing:
- Stronger commitment to a state-of-the-art, open-source, AI Act-compliant capstone model
- An institution capable of orchestrating AI development in the public interest
- Clearer vision of data as an AI enabler – Data labs are the right anchor points but need more detail
Future Possibilities:
Enrico Letta’s 2023 report proposed a European knowledge commons with a centralized data sharing platform—similar to Open Future’s earlier “public data commons” proposal. Signs suggest this may develop, but implementation remains uncertain.
Beyond AI: Sustainability of the Information Ecosystem
A complementary perspective from my colleague Paul Keller’s paper “Beyond AI and Copyright: Finding a Sustainable Information Ecosystem”:
While enabling data sharing and developing public AI are important, we also need remuneration mechanisms to finance the commons. Without these, the information ecosystem remains unsustainable.
Conclusion
The central argument: Data alone is insufficient as a pathway to public AI. While essential, it must be combined with:
- Public compute infrastructure (with clear purpose)
- Public/open-source model development (especially capstone models)
- Clear public interest goals
- Democratic governance structures
Only through this integrated approach—vertical orchestration across the entire AI stack—can we build truly public AI that serves collective interests rather than concentrating power in corporate hands.
