Part of the Deep Dive: Data Governance Webinar Series
With the advent of deep learning, speech recognition models like Open AI’s Whisper are now trained on hundreds of thousands of hours of speech data, likely gathered without the consent of the speakers who contributed it. As responsible AI practices grow in prevalence and we continue to advance machine learning-enabled speech technologies, we must define and commit to a set of shared best practices for responsible speech data collection and stewardship.
Here, we outline those we see as most salient at the Mozilla Data Collective – and warmly welcome dialogue on these with like-minded practitioners and researchers.
Consent – permission to collect, steward and distribute speech data.
Control – what, who and where is speech data used?
Cultural inclusion – what does it mean to handle speech data with CARE?
Contributing corpora – providing the ability to unlock speech datasets created through research or community efforts and steward them effectively
Collective – introducing the Mozilla Data Collective for community creation, curation and control of speech data.
By focusing on facilitating the contribution of corpora collected and created by community actors – such as researchers, hobbyists and language enthusiasts – we can ensure that consent and cultural alignment are more closely intertwined with dataset stewardship. However, surfacing community-created datasets for broader re-use isn’t enough. We need to ensure that as data stewards, licensing models, dataset usage and governance structures align to the original intent and wishes of dataset owners and speech data contributors.
To support this position, the Mozilla Data Collective has built a platform that gives data owners the ability to create, curate and control data – and we’re now ready to iterate the platform based on community and contributor feedback. This is a play for community, consent and care – rather than simply more data.
Video transcript
The Problem: Tokenomics and the Data Crisis
What Are Tokens?
Tokens are the building blocks of AI training data—small segments of text (like words or word parts) that machine learning models use to understand and generate language. Modern AI models need these relationships between tokens to predict what comes next, functioning essentially as sophisticated autocomplete.
The Token Crisis
- Massive scale needed: Modern AI models require trillions of tokens to train (GPT-5 estimated at 114 trillion tokens)
- Limited supply: The entire public web contains only 60-160 trillion tokens
- Peak token passed: We’ve reached the point where most authentic human data has been scraped
- Quality degradation: Synthetic AI-generated data (or “AI slop”) is flooding the internet, causing potential model collapse—like making photocopies of photocopies
Current Problems with Data Collection
Exploitative Scraping
- Aggressive scrapers overwhelm website infrastructure
- Value extracted from creators without permission or compensation
- Poorly paid workers label and moderate scraped content
- Power concentrated among a few major players
Bias and Representation
- Data sets over-represent specific communities (like Reddit)
- Predominantly English and a handful of other languages
- Ignores 7,000+ languages still spoken globally
- Violates data sovereignty and Indigenous data rights (CARE principles)
- Culture, history, and stories scraped without consent
Governance Gaps
- No clear legal framework for AI data sourcing
- Black-box processes serve corporate needs over data creators
- No realistic way for individuals to remove their data from models
- Emerging lawsuits finding evidence of large-scale piracy
The Solution: Mozilla Data Collective
What We Built
A platform that puts communities back in control by allowing data creators to:
- Host and share data on their own terms
- Set their own licenses and distribution settings
- Choose whether to charge for their data
- Access infrastructure without building it themselves
For Data Users
- Discover high-quality, human-generated data sets
- Access transparent documentation and licensing
- Connect directly with communities
- Ensure supply chain compliance with emerging standards (like ISO 42001)
Different from Common Voice
Common Voice (Mozilla’s open speech data set with 300 languages, 1M+ contributors, 36K hours under CC0 license) continues unchanged. The Data Collective serves communities wanting different terms than full openness.
Key Principles
The platform enables:
- Communities choosing what happens to their data
- Transparent, ethical data access
- Fair compensation for data creators
- Preservation of linguistic and cultural diversity
- Compliance with data governance standards
The Call to Action
Authentic human-generated data is becoming scarce and valuable. Current extraction methods are problematic. We need systems where individuals control their own data.
Share your vision: What would make you feel safer and more hopeful about controlling your data? What should we build next?
Connect: Mozilla Data Collective holds regular office hours
Note: All mistakes in the original talk are the speaker’s; credit for research goes to colleague Kathy Reid
