Part of the Deep Dive: Data Governance Webinar Series
Video transcript
Speaker: Stefano Maffulli, Executive Director, Open Source Initiative (OSI)
Open Source AI Definition
About a year ago, OSI published the first definition of open source AI, based on the four fundamental freedoms from the free software definition:
- Use
- Study
- Modify
- Share
The challenge was interpreting “source code” for AI systems, which aren’t programmed traditionally but learn through operations involving software, knowledge, hardware, and data.
After a year-long global co-design process with over 100 people from more than 20 organizations across 27 nations (one-third from the Global South), OSI determined that open source AI requires:
- Unrestricted access to trained parameters/weights
- Complete code used to train the system and prepare training data
- Full dataset (unless legally impossible)
The Data Governance Problem
There’s a broken social contract between AI developers (public, open source, or proprietary) and the general public, who simultaneously create data and are subjects of AI systems.
Major commercial AI labs release models with open licenses, but discussions about training methods, code, and datasets lag behind the availability of open weights.
Two Main Challenge Categories:
Legal: Balancing exclusive rights with common interests
Technical: Ensuring content can be downloaded, remixed, and shared globally while remaining traceable to its source and rights holders
Why Copyright Isn’t Sufficient
Open datasets of images don’t share actual images due to copyright issues—they share URLs and scripts to retrieve them, meaning each dataset creation may be slightly different.
Software vs. other media: Copyright works reasonably for software because developers have 40+ years of embedded practices (copyright notices, version control). This discipline doesn’t exist for visual artists or musicians—copyright information doesn’t travel with JPEGs.
Complexity at scale: Tracking provenance becomes extremely difficult (e.g., a Creative Commons image becomes a billboard background that appears in a movie).
Jurisdictional variations: Concepts like fair use and text/data mining exceptions don’t exist uniformly worldwide.
Style vs. content: Copyright doesn’t protect style. A musician might prevent their work from training AI, but parody artists using similar styles might allow it—how do we track these distinctions?
Emerging Solutions
Creative Commons Signals Initiative seeks to balance three goals:
- Preserve the commons
- Encourage AI development that respects creators’ wishes
- Avoid blocking innovation
IETF AI Preferences Working Group is standardizing vocabulary for content creators to express their intentions.
Future Concerns
- Interoperability: How do we move data between AI systems without lock-in?
- Balance: Maintaining the delicate equilibrium between public interest, innovation, and individual rights
OSI continues focusing on data governance through conferences like this because these technical and legal issues remain difficult to resolve but essential to address.
