What should open source AI aspire to be?

Part of the Deep Dive: Data Governance Webinar Series

The discourse around open source AI typically considers openness and transparency a key driver for “”democratizing”” AI technology. However, as critics have pointed out, even “”maximally open AI systems”” (Widder et al. 2023) alone do not challenge power dynamics in AI research and development (not least due to the high costs of training base models). In fact, much of the open source AI ecosystem depends on infrastructures or tools developed by leading AI companies like Meta (PyTorch) or Microsoft (GitHub). Transparency is an important element to address concentrations of power and move toward a fairer, more accountable ecosystem, but it is important to be specific about its role.

In this talk, we want to spark a conversation about what open source AI should aspire to be beyond being transparent. First, we argue that open source AI systems can serve as proxies that help identify points of human decision-making and accountability applicable to generative AI systems more broadly, whether those systems are transparent or not. This is possible because open source and proprietary AI systems share similar technological designs and data curation techniques. Open source AI should welcome critical inquiry to leverage this similarity, providing insights into otherwise opaque systems. Second, we argue that open source AI should strive to pioneer fairer and more democratic governance models for AI. As an example, we present a collaborative paper based on a convening organized by Mozilla and EleutherAI to discuss the development of open datasets containing only public domain or permissively licensed data. It gathered over 30 builders to document what’s working, identify shared obstacles, and exchange strategies. The paper offers a practical guide for sourcing, curating, and releasing large-scale open datasets with levels of reproducibility, fairness and accountability not found in intransparent AI systems by leading AI companies.

This talk is based on the paper Towards Best Practices for Open Datasets for LLM Training but expands it.

Video transcript

Part 1: Open Source AI as a Proxy for Proprietary Systems

Stefan Baack

Most discussions about open source AI focus on performance gaps with proprietary models. However, the more critical issue is open source AI’s dependency on proprietary systems and the power dynamics this creates.

The Problem of Ecosystem Capture

The open source AI ecosystem relies heavily on infrastructure controlled by major AI companies, for example PyTorch or Github. This gives these companies power to shape open source AI for their benefit.

However, this dependency has an unexpected consequence: it keeps open and proprietary AI systems remarkably similar in technological design and data curation techniques.

Three Factors Maintaining Similarity

  1. Shared Roots: In the pre-ChatGPT era, companies like OpenAI were more transparent to attract top researchers. Many influential open source projects recreated their work—like EleutherAI’s “The Pile,” which replicated OpenAI’s GPT-3 training dataset.
  2. Strategic Capture: Companies actively make open source efforts compatible with their technologies. As Zuckerberg explained, setting industry standards with tools like PyTorch allows Meta to easily integrate open source innovations into their stack.
  3. Competitive Dynamics: Competitors to US-based AI leaders use openness strategically to level the playing field. DeepSeek is a prime example for how competitive and geopolitical dynamics currently help keeping technological advances relevant to the whole industry transparent enough that they can be adopted by open source AI projects.

The Opportunity

These similarities mean studying open source AI projects can reveal decision-making patterns and accountability mechanisms that apply to all generative AI systems. Open source AI should embrace this role—turning a perceived weakness into strength by serving as a proxy for critical research into proprietary systems.

The community should push back against “openwashing” by companies like Meta and advocate for transparency levels that strengthen its role as a useful proxy for research and civil society.

Part 2: Open Source AI as a Governance Laboratory

Kasia Odrozek

Being a mirror for closed systems is too small an ambition. Open source must be the blueprint for alternatives and serve as our governance laboratory.

Governance Through Data

Every AI dataset is a governance system. Decisions about what data to include, filter, and who decides reflect values. The question isn’t whether governance exists—it’s whether it’s visible and accountable.

Closed datasets hide these choices behind corporate walls. Open datasets make them visible and contestable. Community governance has stronger incentives to uphold ethical standards when contributors see themselves as part of a public mission—like Wikipedia.

The Shifting Landscape

The old way is dying. Sharing large pre-training datasets, once common, is now rare due to legal risks and competitive secrecy. Platforms are blocking crawlers, content is moving behind paywalls, and the AI-web relationship is being settled in courtrooms by those with the deepest pockets.

An Opportunity, Not a Crisis

What if we ask better questions: How do we build AI not by taking, but by collaborating?

Imagine training data that’s co-created, where communities have genuine agency over its use and share its benefits. The collaborative DNA for this exists in open source tradition.

The Data Set Convening Framework

Mozilla and EleutherAI brought together 30 open dataset builders to draft a shared framework grounded in:

  • Reproducibility and documentation
  • Autonomy and agency for data creators
  • Transparency for accountability
  • Flexible governance giving contributors control

Redefining Openness

This may sound contradictory: How can we call it open while promoting access restrictions?

For some communities, unrestricted openness means exploitation due to power imbalances. In such instances, openness can mean:

  • Transparent documentation of data origins
  • Clear, public rules for use
  • Agency for communities who created the data

The Mozilla Data Collective Example

Recently launched, this platform treats data providers as partners with autonomy. They choose who uses their data, for what purpose, and set terms. Licensing becomes a toolkit for empowerment.

When people have real stakes in the system, they’re inspired to participate. This builds AI that’s more representative, accurate, accountable, and human.

Why Governance Matters

In a recent interview, Sam Altman was asked who decides ChatGPT’s moral framework. While he said that they intend to reflect the knowledge of all humanity, when pressed, he admitted, “Ultimately, the person who can overrule these decisions is me.”

What counts as truth and whose knowledge gets scaled through AI systems matters deeply. With closed systems like ChatGPT, choices are made behind corporate walls. With open systems like Wikipedia or open datasets, rules and data are visible, contestable, and shaped fairly.

The Vision

Open source communities should aspire not just to publish code, weights, or data—but to become prototyping labs for democratic governance of knowledge. Labs that openly invite critical research and demonstrate what responsible AI looks like in action.