Part of the Deep Dive: Data Governance Webinar Series
Data plays a critical role in the AI ecosystem, so much so that some model-makers have begun contacting libraries in search of new data and scraping their websites with such frequency that their ability to deliver services to their traditional patrons is impacted. A new paradigm for publishing library collections is needed to respond to the rapid development of LLMs and the resulting scarcity of high-quality, open training data.
Our publication of Institutional Books 1.0—a collection of 983,004 public domain volumes from Harvard Library—is grounded in the belief that collections from libraries are well positioned to improve the training data ecosystem. The commitment to diversifying sources, strengthening provenance chains, and increasing accountability to original source material are library practices we can bring to the training data ecosystem.
In preparing this collection, we took a rigorous approach to information stewardship. The output of this work is a roughly 242B token public domain dataset spanning more than 300 languages across topics of literature, law, philosophy, and science. We refined this dataset through collection-level deduplication, high level topic classification, OCR artifact and text analysis, and OCR text post-processing.
In collaborating with institutional and AI communities to co-develop data practices, we aim to create a healthier and more efficient foundation for model development across the commercial, academic, and public spheres that establishes mutual benefit between the AI and library communities.
Video summary
Greg Leppert, Matteo Cargnelutti, Catherine Brobston, Institutional Data Initiative (IDI), Harvard Law School Library
IDI is a team of researchers, data scientists, and policy experts working with libraries, museums, government agencies, and cultural institutions to ensure sustainable access to data and knowledge in the AI era.
Our Core Beliefs:
- Knowledge institutions play a critical role in society that must be preserved
- AI community effort should flow back into knowledge institutions for mutual benefit
- Change happens through setting examples, not just standards—through impactful data releases
The Dataset
Matteo Cargnelutti, Principal Engineer, IDI
Institutional Books contains nearly one million public domain books scanned at Harvard Library through the Google Books project. Released for early access on Hugging Face on June 12, it includes OCR text, bibliographic records, and comprehensive analysis for each volume.
Scale and Scope
- 1.04 million volumes with text
- 394 million pages
- 248 billion GPT-4 tokens
- 60% of volumes contain 100,000+ tokens—ideal for long-context AI training
Key Processing Steps
1. Data Retrieval and Quantification
- Retrieved raw scans, OCR data, metadata, and bibliographic records from Google
- Measured text availability using page counts and token counts across five tokenizers
2. Collection Mapping
- Temporal coverage: Majority published in 19th-20th centuries, with a spike between 1880-1910
- Language detection: Combined librarian-assigned codes with text-level detection to identify ~390 languages; nearly half in English, with strong European language representation
- Quality metrics: Assessed OCR quality and text characteristics to identify artifacts and edge cases
3. Enhanced Usability
- Topic classification: Trained a model on 80,000 records to assign Library of Congress classification categories (20 main classes)
- Deduplication: Identified 41,000 likely duplicates using locality-sensitive hashing without removing them
- OCR post-processing: Trained a model to detect line types and reassemble text (paragraphs, headings) to improve ML/NLP usability—856,000 texts processed
- Rights determination: Used HathiTrust records to identify public domain volumes
Open Resources
- Processing pipeline available on GitHub
- Topic classification model on Hugging Face
- Google Books extraction tool coming soon
Community Impact
Catherine Brobston, Program Director, IDI
Primary Use Cases
1. Foundational Model Development
- Long-context training improves handling of extended conversations/documents
- Vetted, multilingual content (50% non-English)
- Supports transparency in model development
2. Historical Research
- Provides reliable, unaltered historical data for academic research
- Enables fine-tuning models with historical perspectives for teaching tools
- Notable project: Training models on pre-cutoff historical data to test predictive capabilities
3. Non-English Language Models
- Substantial Latin and Ancient Greek content opens new training opportunities
- Supporting development in Scottish Gaelic, Ukrainian, and other languages
Future Development
Next Steps:
- Version 2 of processing pipeline to improve OCR handling of structured outputs (columns, non-Latin alphabets)
- Potential “LLM-ready” version with common cleaning applied (removed headers/footers, page numbers) to lower barriers to entry
- Collaboration with Boston Public Library on historic newspapers (winter release)
Conclusion
Institutional Books represents the first step toward iterative, collaborative data releases that balance accountability to source material with broad utility across use cases. The goal is bidirectional support between knowledge institutions and the AI community.
Get involved: Download the dataset, use it in your research, and share your feedback at institutional.org.
