Part of the Deep Dive: Data Governance Webinar Series
This talk will build on ongoing work by the Centre for Internet and Society of the CNRS and the Open Knowledge Foundation on the topic of sustainable data commons.
On the one hand, machine-reuse of openly licensed works and public domain works is crucial for AI model development, and has been used to create fully open training datasets like Common Corpus and Common Pile. On the other hand, open source AI also relies on web crawl data as well as mixed datasets, where complete openness of such training datasets may not be feasible or legally possible. Further, with growing use of generative AI (including open source GenAI models), rightsholders and data stewards are articulating new desires for attribution, reciprocity and sustainability when it comes to data reuse.
How must all this be factored into open source AI, without inadvertently enabling openwashing? In my talk, I draw from a recently published report, to present new licensing initiatives for training data that hold useful insights on data governance of open source AI.
Video transcript
Presenter: Ramya Chandrasekhar, Legal Researcher at the Center for Internet and Society (CNRS, Paris)
The Challenge
Foundation AI models are data-intensive, requiring large quantities of diverse data (text, image, audio, video) from the open web. Common Crawl, a free repository of web data, is a primary source for training datasets used by organizations like EleutherAI, the Allen Institute for AI, and OpenAI.
This creates two tensions:
- Data Availability: By 2028, the entire indexed web may be exhausted for AI training, raising questions about sustainable access to high-quality data.
- Creator Rights: Data from the open web is created by individuals and communities who have expectations about how their work should be used, expressed through copyright, moral frameworks, and social contracts.
Limitations of Existing Licenses
Traditional permissive licenses (CC-BY, Apache, BSD) and copyleft licenses (CC-ShareAlike, GPL, AGPL) face challenges when applied to AI training data:
- Attribution: Unclear how attribution requirements transfer through the AI training lifecycle
- ShareAlike: No legal certainty on how copyleft obligations apply from training data to model outputs
- Sustainability: Public domain content (like Wikipedia) faces infrastructure strain from bot access
- Enforcement: Resource-intensive for individual creators to track violations
Emerging License Categories
1. Responsible AI Licenses
RAIL and OpenRAIL Licenses
- Preserve open-source transparency while addressing AI-specific harms
- Include behavioral use restrictions (e.g., prohibiting military use, discriminatory applications, hate speech)
- Require downstream transmission of restrictions
Montreal License
- Creates granular taxonomy for data use in AI/ML contexts
- Distinguishes between:
- Standalone data rights: access, tagging, distribution, re-representation
- AI-specific rights: benchmarking, research, internal use, output commercialization, model commercialization
New Open Data Commons License
- Addresses “mixed datasets” containing both personal and non-personal data
- Compulsory elements:
- Privacy pledge (GDPR compliance)
- Right to erasure commitments
- Optional elements:
- Confidentiality and scope restrictions
- Financial contributions for dataset maintenance
2. Alternative Approaches
AI2 Impact License (Allen Institute for AI)
- Risk-based framework: datasets classified as high, medium, or low risk
- Requires derivative impact reports documenting new uses
- Public registry of license violators
- Note: Initially applied to Dolma dataset but later switched to more permissive ODBL license due to community pushback
Noitu Bodo License (University of Pretoria)
- Created for African language datasets
- Tiered copyleft addressing equity gaps:
- Developing country licensees: Standard sharealike, royalty-free access, commercial use allowed outside Africa
- Global North licensees: Stronger sharealike obligations, must provide benefits (financial compensation, partnerships, know-how sharing)
Key Insights
These new licenses:
- Draw inspiration from existing open-source architecture and inverse copyright logic
- Recognize limitations of traditional licenses in the AI context
- Provide additional granular options for licensors
- Address unique challenges: privacy risks, extraction problems, power asymmetries
Three Dimensions of Data Openness
- Technical Openness: Vibrant digital public domain and permissive licensing
- Community Data Sovereignty: Respecting desires of communities who create and steward datasets
- Limits on Extractivism: Restricting well-resourced actors from dominating AI development
Conclusion
License proliferation, while creating compatibility and enforceability challenges, can be productive. It helps articulate reciprocity, attribution, and sustainability concerns from different communities. A one-size-fits-all approach may not work for all training datasets—instead, different licensing options can work together depending on the dataset type, community stakeholders, and intended use.
Friction can be productive and necessary to ensure the open web remains available for AI training while ensuring responsible data reuse.
