In a world where artificial intelligence (AI) is evolving at a breakneck pace, the availability of quality training data is crucial. EleutherAI, an AI research organization, has just taken a major step by publishing what it claims to be one of the largest collections of licensed and public domain texts for training AI models: the Common Pile v0.1. This dataset, with an impressive size of 8 terabytes, marks a turning point in the quest for transparency and legal compliance in the AI field. Here’s everything you need to know about this groundbreaking initiative.
A two-year collaborative effort
The Common Pile v0.1 is the result of two years of hard work, carried out in collaboration with major players in the AI ecosystem, such as startups Poolside and Hugging Face, as well as academic institutions like the University of Toronto, the Vector Institute, the Allen Institute for Artificial Intelligence, and many others. This ambitious project was designed with the help of legal experts to ensure that all included data is either in the public domain or under open licenses, thus avoiding copyright issues that have often plagued AI model training practices.
The dataset brings together diverse content from 30 different sources, including:
- Scientific articles and abstracts from Arxiv,
- Medical texts from PubMed Central,
- Millions of research articles,
- Legal documents such as US patents, court decisions from the Caselaw Access Project, and transcripts of British Parliament debates,
- Books digitized by the Library of Congress and Project Gutenberg,
- Free educational resources,
- More than half of the dataset consisting of code samples from StackExchange forums, GitHub discussions, and YouTube video transcripts from over 2,000 channels.
A response to controversies over training data
AI companies, including giants like OpenAI, face lawsuits over their training practices, which often rely on scraping web content, including materials protected by copyright, such as books and research journals. These practices have sparked debates about legality, ethics, and transparency in the field. Stella Biderman, executive director of EleutherAI, emphasized in a blog post on Hugging Face that these lawsuits have “drastically reduced” transparency at AI companies, harming research by making it difficult to understand how models work and their shortcomings.
EleutherAI offers an alternative with Common Pile v0.1, which stands out for its careful curation. Unlike its predecessor, The Pile, an 800 GB dataset that contained copyright-protected materials and drew criticism, this new dataset relies exclusively on open and public sources. EleutherAI also used Whisper, OpenAI’s open-source audio transcription model, to include transcribed audio content, enhancing data diversity.
For more technical details, see EleutherAI’s blog post: EleutherAI’s Blog Post.
Competitive models trained on the Common Pile
In parallel, EleutherAI unveiled two new AI models, Comma v0.1-1T and Comma v0.1-2T, both with 7 billion parameters, trained respectively on 1 and 2 trillion tokens from Common Pile v0.1. These models compete with proprietary alternatives, such as Meta’s first Llama model, on benchmarks in coding, image understanding, and mathematics. According to EleutherAI, these results prove that carefully curated data under open licenses can produce high-performing models, contradicting the idea that unlicensed text is necessary to achieve strong performance.
Stella Biderman stated in her post: “In general, we think the common idea that unlicensed text is the key to performance is unjustified. As the amount of open and public domain data increases, we can expect improvements in the quality of models trained on this content.” Link to Stella Biderman’s post on Hugging Face.
A step toward greater transparency and ethics
Common Pile v0.1 is part of EleutherAI’s effort to correct past mistakes. A few years ago, the release of The Pile had sparked controversy due to the inclusion of copyright-protected content. Today, the organization is committed to publishing open datasets more frequently, in collaboration with its research and infrastructure partners. A notable event was the “Dataset Convening” organized by Mozilla and EleutherAI in June 2024, where experts from startups, nonprofit AI labs, and civil society organizations discussed best practices for creating open datasets. The results of this event are detailed in an article titled “Towards Best Practices for Open Datasets for LLM Training”, available on arXiv: Link to the paper on arXiv.
Why does this matter?
Common Pile v0.1 demonstrates that it is possible to build high-performing AI models without resorting to copyright-protected data, paving the way for a more transparent, ethical, and legally sound AI ecosystem. By making this dataset accessible to all, EleutherAI strengthens its commitment to open science and open source, core values of the organization. Resources, including the code used to build this dataset, are available on GitHub, and standalone tools for audio transcription and document conversion have been published in partnership with Mozilla.
With Common Pile v0.1, EleutherAI is redefining the standards of AI research by proving that quality and performance do not require compromising ethics or legality. This 8 TB dataset, combined with the Comma v0.1 models, offers a valuable resource for researchers, developers, and organizations looking to build responsible AI models.
