Building Public Data for LLMs

Part of the Deep Dive: Data Governance Webinar Series

As the capabilities and influence of large language models (LLMs) continue to expand, so too do the risks associated with opaque, legally ambiguous, and unrepresentative training data. While much attention has been focused on open-source models, relatively little infrastructure exists to support high-quality open datasets for pretraining. Drawing on EleutherAI’s work on the Common Pile, an 8 TB, openly licensed dataset designed for LLM training, I will outline a principled approach to constructing, curating, and evaluating open datasets that are legal, transparent, and representative.

The Common Pile serves as a case study for how community-driven, auditable pipelines can be used to aggregate and filter data from public domain and openly licensed sources. I will discuss the methods used to ensure language coverage, deduplication, quality control, and bias mitigation, and share insights into the trade-offs between dataset diversity, scale, and legal safety. I will also briefly present evaluation results from Comma, an open LLM trained exclusively on this dataset, and explore what they reveal about the viability of open data for training performant models.

This talk is intended for researchers, technologists, and open-source practitioners interested in strengthening the foundations of the open AI ecosystem. Attendees will leave with a deeper understanding of the technical, legal, and infrastructural challenges of open dataset construction—and why solving them is essential for equitable, reproducible AI development.

Video transcript

Building Openly Licensed Training Data for Large Language Models

Stella Biderman, Executive Director, Eleuther AI


Introduction

I’m excited to discuss building openly licensed training data for large language models. At Eleuther AI, we’ve spent five years working to make large-scale AI research more transparent, accessible, and ethical, with a strong focus on data work.

Our notable projects include:

  • The Pile: The first openly released LM pre-training corpus, alongside the GPT-Neo model
  • Pythia model suite: Established the fully open pipeline standard, now adopted by organizations like Allen AI and adopted as the gold standard for openness in the field

Today, I’ll discuss our work toward the next milestone: openly licensed pre-training data.

Understanding Licensing in LLM Context

When discussing licensing for LLMs, there are three key areas:

  1. Data licensing: The data the model is trained on
  2. Training permission: Whether you have permission to train on that data
  3. Model deployment: Distribution and use of the trained model

This talk focuses on data licensing—from collection through processing until the data enters the model.

Five Categories of Data Licenses

From most to least permissive:

  1. Public Domain: No restrictions (includes CC0, Unlicense)
  2. Permissive Licenses: Free to use, modify, and share for any purpose; no relicensing restrictions (e.g., MIT, Apache 2.0, CC-BY)
  3. Open Licenses: Free to use, modify, and share, but with relicensing restrictions (e.g., CC-BY-SA, GPL, ODC-BY)
  4. Restricted Use: Free for some users, some purposes (e.g., non-commercial licenses, “ethical AI” licenses)
  5. Proprietary: No permission to use without specific authorization

In this talk, “openly licensed” refers to categories 1-3.

Why Openly Licensed Data Matters

We aim to create “acceptable data” that serves three goals:

  1. Legal protection: Enable organizations to train models without fear of lawsuits (dozens have been filed already)
  2. Transparency: Return to earlier norms where companies shared details about training data composition, even if they didn’t release the actual data
  3. Research enablement: Support critical AI research on learning dynamics, social biases, and memorization—work that requires access to training data

Key Requirements

  • Ability to collect, process, and use data freely
  • Right to redistribute training data for research purposes
  • Note: No need for a unified license across the entire corpus—granularity is more important

Understanding Collection vs. Component Licenses

A crucial distinction often misunderstood:

  • Collection/database license: Covers the intellectual property in assembling the dataset
  • Component licenses: Cover individual documents within the dataset

Both must be openly licensed. The collection license doesn’t override individual component licenses.

Examples

Dolma (AI2): Has an ODC-BY collection license, but three of four main components are proprietary—insufficient for our purposes.

Institutional Books (Harvard): Contains 100% public domain books, but has a restricted-use collection license—also insufficient.

The Landscape of Training Data

Examining prominent datasets:

  • Public domain: Project Gutenberg, USPTO, Free Law Project, arXiv abstracts
  • Permissive: KELM, UK legislation, Stack Exchange
  • Open: Common Pile, Common Corpus, Wikipedia
  • Restricted: Institutional Books, arXiv (as a whole)
  • Proprietary: C4, The Pile, SlimPajama, Dolma, RefinedWeb, FineWeb-Edu

Reality check: If you ask someone in AI to name three pre-training datasets, all three will likely be proprietary.

Challenges in Building Openly Licensed Data

Challenge 1: Automated Tools Are Inadequate

We’ve tested both classical and AI-based NLP tools for license detection—they’re unreliable for several reasons:

Multimodal reasoning required: License information often appears as visual emblems (Creative Commons icons) rather than text. Frontier models struggle with this.

Limited website access: Many AI agents can’t access sidebars and footers where licenses are typically displayed.

License confusion: Website owners frequently:

  • Write custom licenses (often incorrectly)
  • Modify existing licenses
  • Provide contradictory statements (e.g., “CC-BY-SA, all rights reserved”)
  • Use vague terms (e.g., “MIT-ish license”)

Noisy signals: Conflicts between robots.txt files and actual licenses create ambiguity about true permissions.

Our approach: In Common Crawl subsets, we use multi-layered human auditing combined with AI tools, then conduct a third human audit where they disagree. The human is usually correct when they disagree with AI.

Challenge 2: Data Laundering

People frequently download data, modify it, and re-upload it with incorrect licensing—sometimes malicious, often benign. Either way, it creates legal problems.

Evidence from research: A 2022 Data Provenance Initiative audit of Hugging Face, GitHub, and Papers with Code found that the vast majority of documents had missing or incorrect licensing information—even at a high level (commercial vs. non-commercial).

Examples of problematic datasets:

  • YouTube Commons: High error rate in alleged CC-BY data
  • StarCoder: Code dataset with licensing information, but high error rates
  • Hacker News on Kaggle: Posted by “Y Combinator” account, but Y Combinator couldn’t confirm authorization. Shows MIT and CC0 licenses, but actual Hacker News site prohibits commercial use.

Challenge 3: Missing Document-Level Metadata

Most ML projects don’t record licensing information per document or include essential verification metadata.

The Pile (2020): Detailed corpus-level information, but no per-document licensing PAIS dataset: Has document-level licensing, but includes vague categories like “various open data” Typical approach: Subcorpus-level licensing (e.g., “GitHub data includes MIT, BSD, Apache 2.0”)—better than nothing, but insufficient for our standards.

Our Approach: Common Pile v0.1

Core Principles

1. Comprehensive metadata documentation

Minimum requirements for each document:

  • Authorship/IP ownership
  • Actual license
  • Links to original document

Ideally also include: creation date, access date, unique IDs, etc.

2. High standards for licensing metadata

License information must come from:

  • Original authors, OR
  • Gold-standard provenance organizations (we only trust HathiTrust and Library of Congress), OR
  • Government datasets with legally-defined licenses (e.g., US federal data is public domain by law)

3. Manual human auditing

Humans actively look for signs of license laundering. Key breakthrough: reframing from “Can we use this?” to “Can you find a reason to disqualify this?”—decreased usable data but massively increased reliability.

The Reliability Question

Dilemma: If you audit 1,000 papers and find 7 with wrong licensing (0.7% error rate), do you use the other 993? What about 70 errors (7% rate)?

Current state: Each organization must make case-by-case decisions with their lawyers.

Future need: Community standards for acceptable accuracy rates and verification methods.

Case Studies

Stack v2

Great code dataset from Software Heritage, filtered using ScanCode for license detection. However, the released version included data with missing licenses and didn’t meet our metadata standards.

Our solution: Reprocessed the data using ScanCode information to obtain individual file-level licenses.

YouTube

YouTube offers two upload licenses: standard YouTube license (not open) and CC-BY.

Problem: Massive amounts of obviously non-CC-BY content marked as CC-BY:

  • TV show clips (Breaking Bad shorts)
  • Full anime episodes (Sailor Moon)
  • Parody videos with extended copyrighted content (Avengers: Age of Ultron scenes)

Our solution:

  • Manually curated 1,000+ professional, government, and academic channels that upload original content
  • These organizations have financial incentives to get licensing correct
  • Verified official status (including email domain verification)
  • Filtered for CC-BY videos from these channels only
  • Spot-checked results
  • Removed entire channels if any problematic videos found
  • Transcribed with Whisper-large models and spot-checked again

Does It Actually Work?

Experiment 1: 1.7B Parameter Models (28B Tokens)

Compared Common Pile against other corpora:

Results: Models trained on Common Pile:

  • Substantially outperformed other openly licensed data
  • Met or exceeded performance of models trained on The Pile and OSCAR (both proprietary)
  • Only meaningful gap: FineWeb-Edu (currently the world’s best pre-training corpus)

Experiment 2: 7B Parameter Models (1-2 Trillion Tokens)

Real, serious models comparable to historical benchmarks (LLaMA 1/2, Mosaic MPT, DeepSeek-LM).

Results: Models trained on Common Pile matched or exceeded performance of similarly-trained models.

Note: Qwen 3 (shown for reference) was trained on 15 trillion tokens—7.5-15× more data than our models—representing a performance upper bound.

Ongoing Challenges

1. Books and Social Bias

Public domain books are mostly very old (pre-1928). Few openly licensed modern books exist.

Concerns:

  • Representation of Muslims, women, people of color, and Global South in 1910s literature is often problematic
  • May teach undesirable social biases reflecting outdated worldviews
  • Technological anachronisms (telegraphs, no phones/internet)

Current state: Virtually no research exists on this question. We’re working on it.

2. Locked-Up Public Domain Data

Huge amounts of public domain content trapped in difficult formats:

  • Audio recordings: Whisper-large models show promise for transcription, but expensive to scale
  • PDFs: OCR remains surprisingly difficult, especially for old text and non-standard formatting

3. International Copyright Law

Common Pile focuses on US law because international copyright is complex. We need partners with local expertise in other countries.

Conclusion

The Common Pile v0.1 represents:

  • 8 terabytes of public domain and openly licensed text
  • 2+ years of work by 20+ researchers
  • Collaboration between University of Toronto, Hugging Face, Eleuther AI, BAIR, and many others

We deliberately named it “v0.1” as a statement of intention—this is just the beginning. We aim for v0.2, v0.3, and eventually v1, v2, v3 to do bigger and better.

Key Collaborators and Funders

Special thanks to:

  • University of Toronto
  • Allen Institute for AI
  • Mozilla and Mozilla AI
  • Data Provenance Initiative
  • University of Maryland
  • Hugging Face
  • Lawrence Livermore National Lab

These organizations provided funding, expertise, and partnership in building a future with powerful open models trained on openly licensed data.


The bottom line: We can train competitive models on openly licensed data. The gap is closing, and the work continues.