• Skip to main content

Annielytics.com

I make data sexy

  • About
  • Tools
  • Blog
  • Portfolio
  • Contact
  • Log In

Jun 12 2025

Common Pile v0.1: A License-Safe Dataset Built for Responsible AI Development

I was rifling through my typical website feeds and emails as I updated my AI Timeline and was stopped dead in my tracks by an article on TechCrunch announcing a release by EleutherAI, an AI research organization, of what it claims is one of the largest collections of licensed and open-domain text for training AI models.

I regularly include dataset releases in my AI Timeline—and even assigned them a dedicated tag. But this one stood out because its enormity (8 terabytes) and and allocation of resources exhausted to curate it piqued my curiosity. So I started digging to verify their claim.

And I must say, in a field where most training datasets remain opaque and legally questionable, the approach taken with the Common Pile v0.1 dataset is commendable. After two years of license verification, deduplication, and filtering, this corpus provides what has been largely missing: a large-scale dataset with clear legal provenance and full documentation. Rather than relying on web scraping of uncertain copyright status, each source has been explicitly verified and attributed. For organizations concerned about the legal and ethical implications of their AI training data, this represents a viable alternative to the industry’s standard practices.

The Team

The Common Pile v0.1 dataset was developed by researchers from EleutherAI in collaboration with Vector Institute, Hugging Face, the Allen Institute for AI, Lila Sciences, poolside, Lawrence Livermore National Laboratory, Teraflop AI (its website is down atm), and several academic institutions, including the University of Toronto, Cornell University, MIT, CMU, and University of Maryland College Park.

Unlike many large-scale corpora, Common Pile v0.1 is fully documented and legally transparent, making it suitable for both commercial and public-interest applications. The dataset and accompanying models (Comma v0.1) are available on the Hugging Face platform, and a detailed overview is provided in the project’s Arxiv paper.

When ‘Open’ Isn’t Really Open

A major challenge in creating ethically sourced AI training datasets is the widespread problem of inaccurate licensing information across the internet. One particularly concerning issue is ‘license laundering’, i.e. the redistribution of copyrighted material under an incorrect license, often without having the rights to do so. This creates uncertainty about whether content that appears to be openly licensed actually is.

To ensure the Common Pile v0.1 contained only genuinely open content, the research team established rigorous sourcing standards and only included data from sources where they could verify that the licensing information came directly from the actual copyright holders. This cautious approach meant excluding some potentially valuable datasets, including OpenAlex, YouTube Commons, and the Hacker News dataset on Kaggle because the licensing provenance couldn’t be definitively confirmed.

This careful curation process reflects the growing emphasis on building AI systems with transparent, legally sound training data, even when it means working with smaller datasets than might otherwise be possible.

Their Multi-Stage Data Cleaning Process

Raw data from diverse internet sources requires extensive preprocessing before it can effectively train language models. You can learn more about machine learning models you can use for preprocessing data in this filtered view of my Machine Learning Model Picker. (I include a screenshot of one of the model cards and decision tooltips to demonstrate the types of additional info I’ve made available in these nodes.)

Click for larger image

The Common Pile research team implemented a comprehensive cleaning pipeline using the Dolma data processing toolkit, applying both universal quality filters and source-specific optimizations to ensure the final dataset contains only high-quality, appropriate content while preserving the linguistic diversity essential for robust model training.

General Text Processing

  • Language filtering: They applied a FastText classifier to retain only English content.
  • Quality filtering: They used the DataComp-LM text quality classifier with low threshold to remove noisy web text.
  • OCR error removal: They filtered documents with pervasive OCR errors using likelihood-based filtering against Trillion Word Corpus unigram model.
  • Toxicity filtering: They applied FastText toxicity classifiers trained on the Jigsaw dataset to reduce inappropriate content.
  • PII redaction: They used regex to remove and replace personally identifiable information (PII) like email addresses, phone numbers, and IP addresses with placeholder tokens.
  • Source-specific cleaning: They applied regex filtering to remove repetitive text like page numbers, preambles, and license statements.
  • Global deduplication: They performed document-level fuzzy deduplication using bloom filters, marking documents as duplicates if they shared greater than 90% of 20-grams (i.e., sequences of 20 consecutive words).

Code Data Processing

  • Initial filtering: They applied Red Pajama V1 heuristics based on line length, alphanumeric character proportion, and alphabetical-to-token ratios.
  • Language selection: They only retained code for 15 programming languages: Python, C, C++, SQL, Java, PHP, Rust, Javascript, Typescript, Go, Ruby, Markdown, C#, Swift, and shell. (Sorry, R users.)
  • Quality classification: They used language-specific classifiers to keep only educational and well-documented code.
  • HTML processing: They extracted plaintext from HTML documents using Trafilatura.
  • Standard filtering: They applied the same language, length, toxicity, and PII filtering used for general text.

Common Pile’s Modular Design

The Common Pile v0.1 is structured around nine curated content categories. The research team’s breakdown of the dataset sizes of the sources in each of these categories is illustrated in the chart below.

breakdown of the dataset sizes of the sources in each of the 9 categories
Click for a larger image

Their Arxiv paper provides a thorough explanation of these nine categories on page 4, but here’s the TL;DR version:

  • Scientific and scholarly texts: The subset includes filtered peS2o research papers, PubMed Central medical articles, and ArXiv papers in quantitative sciences. ArXiv abstracts are included for all papers since they’re distributed under CC0 license regardless of full-text licensing.
  • Online discussion forums: This subset features StackExchange question-answer pairs under CC BY-SA license, GitHub issues and comments from repositories with approved open licenses, and Ubuntu IRC chat logs from 2004 onward in the public domain. These multi-turn question-answer pairs can be useful for training language models to follow conversational structure as well as improving performance on dialog-centric tasks. This filtered view of my AI Strategy app provides links to leaderboards that will aid you in comparing chat-oriented benchmarks across prospective models.
  • Government and legal texts: This subset incorporates US federal documents from GovInfo.gov and Regulations.gov, USPTO patent documents dating back to 1782, and UK parliamentary proceedings from Hansard. It also includes public domain court decisions from the Caselaw Access Project and Court Listener.
  • Curated task datasets: This subset contains datasets specifically designed for fine-tuning tasks like question answering and summarization. Sources are verified through the Data Provenance Initiative to ensure proper licensing and avoid license laundering. This filtered view of my AI Strategy app provides links to leaderboards that will aid you in comparing text-oriented benchmarks across prospective models.
  • Books: This subset focuses on public domain works including pre-1929 US books, titles from the Biodiversity Heritage Library, Internet Archive digitizations, Library of Congress collections, and select Project Gutenberg books.
  • Open Educational Resources (OERs): This subset draws from educational materials under Creative Commons licenses including peer-reviewed books from the Directory of Open Access, the PressBooks catalog, instructional materials from OERCommons , and open-access textbooks from LibreTexts.
  • Wikis: This subset includes English-language Wikimedia foundation wikis converted from wikitext to plain text, plus additional wikis from wikiteam’s database dumps that use open licenses.
  • Source code: This subset features the openly licensed subset of Stack V2 compiled by the Software Heritage Foundation and BigCode, plus the Python Enhancement Proposals released into the public domain. This filtered view of my AI Strategy app provides links to leaderboards that will aid you in comparing code-generation benchmarks across prospective models.
  • YouTube and web content: This subset contains transcriptions from over 2,000 manually curated YouTube channels with CC BY licensed speech content, plus web pages with verified Creative Commons licensing from Common Crawl snapshots and select news sites.

Comparing Data Quality: How the Common Pile Stacks Up

To see how well the Common Pile performs, the research team ran a head-to-head comparison. They trained identical 1.7B parameter models on the different datasets and tested them on eight standard benchmarks. (I break down these benchmarks by task in my AI Strategy tool. You can see an example of the leaderboards you can use to compare quality and cost benchmarks for tasks around code generation from this filtered view).

They include a chart that compares the performance of the model using each of the datasets, with the Common Pile being indicated as ‘Comma’.

Before diving into the results, it’s worth understanding what their ‘Openly Licensed’ vs ‘Unlicensed’ categories. Openly licensed datasets use content where creators have explicitly granted permission for AI training. Think Creative Commons licensed articles, public domain books, or open-source code. Unlicensed datasets, on the other hand, scrape content from across the web without explicit permission, which is how most foundation language models have been trained but raises significant copyright concerns.

Note: You can get an idea of the types of legal liability some of these model creators are facing—particularly by publishers that are tired of having their content filched by model creators—by following the ‘legal action’ tag of my interactive AI Timeline.

Click for larger image

Back to their chart….

What’s immediately striking is how quickly you can spot the differences. Within just a few billion training tokens, it becomes clear which datasets produce better models. You don’t need to wait for the full, expensive training run to complete.

To put this in perspective, training a language model involves feeding it billions or trillions of pieces of text, called tokens, which are measured along the x-axes of these charts. This training can take weeks or months and cost hundreds of thousands of dollars. The fact that quality differences emerge so early means researchers can evaluate their data choices after just a few days of training rather than waiting for the entire process to finish. This early signal is valuable for making decisions about which datasets to use for larger, more expensive training runs.

Among the ethically sourced datasets, the Common Pile dataset (indicated by the orange line) consistently outperforms its competitors. But it also holds its own against unlicensed datasets. The Common Pile performs nearly as well as OSCAR, a popular dataset that includes copyrighted material—and actually outperformed the original Pile across most benchmarks, according to the team.

FineWeb still edges ahead, but the gap is surprisingly small. This challenges a common assumption in the AI community, that you need to use potentially problematic, unlicensed data to build competitive models.

For organizations worried about the legal and ethical implications of their training data, these results are encouraging. You no longer have to choose between doing the right thing and building effective models.

Also A Fine-Tuning Treasure Trove

In addition to using the dataset in its entirety to build a language model, its subsets could be used to fine-tune models, to make an already established language model more responsive to a particular task/need.

Many organizations remain hesitant to pursue model fine-tuning due to historical cost barriers that made custom AI development prohibitively expensive for all but the largest tech companies. Previously, fine-tuning required massive computational resources, specialized infrastructure, and teams of machine learning engineers, often costing hundreds of thousands of dollars per project. However, the past year has seen dramatic cost reductions driven by more efficient training techniques like LoRA adapters, the proliferation of affordable cloud GPU services, and open-source tooling that democratizes the process. Combined with high-quality, legally safe datasets like the Common Pile, fine-tuning has evolved from an enterprise luxury to an accessible competitive advantage.

Below are a few examples of how the data subsets could be used to fine-tune a small or large language model.

Scientific and Scholarly Texts

  • A pharmaceutical company could fine-tune a model on PubMed content to create an AI assistant that helps researchers identify drug interactions and contraindications during clinical trial design
  • An academic publisher could use ArXiv papers to develop a peer review assistant that flags methodological issues and suggests relevant citations for submitted manuscripts
  • A medical device startup could leverage scientific literature to build a diagnostic support tool that interprets lab results and suggests follow-up tests based on current research

Online Discussion Forums

  • A software company could fine-tune on StackExchange data to create an internal coding assistant that answers developer questions and suggests debugging approaches in their company’s tech stack
  • A customer service platform could use GitHub issue discussions to train models that automatically categorize and route technical support tickets based on problem patterns
  • An educational technology company could leverage forum conversations to build a tutoring bot that guides students through problem-solving using conversational teaching methods

Government and Legal Texts

  • A legal tech startup could fine-tune on court decisions to create a case law research assistant that identifies relevant precedents and predicts litigation outcomes for attorneys
  • A compliance consulting firm could use regulatory documents to build an AI system that automatically audits corporate policies against current federal regulations
  • A patent law firm could leverage USPTO documents to develop a prior art search tool that identifies potential patent conflicts before filing applications

Curated Task Datasets

  • A market research company could use question-answering datasets to train models that automatically extract insights from survey responses and interview transcripts
  • A content marketing agency could fine-tune on summarization datasets to create tools that generate executive summaries from lengthy industry reports
  • An e-learning platform could leverage classification datasets to build automated grading systems that evaluate student essay responses across different subjects

Books

  • A publishing house could fine-tune on classic literature to create writing assistants that help authors maintain consistent narrative voice and style across their manuscripts
  • A museum could use historical texts to develop interactive exhibits where visitors can have conversations with AI representations of historical figures
  • A genealogy service could leverage biographical works to create tools that help users write compelling family history narratives from genealogical data

Open Educational Resources

  • An online university could fine-tune on textbook content to create personalized tutoring systems that adapt explanations to individual student learning styles
  • A corporate training company could use educational materials to build onboarding assistants that answer new employee questions about company procedures and industry knowledge
  • A language learning app could leverage multilingual educational resources to create conversation practice bots that simulate real classroom discussions

Wikis

  • A travel company could fine-tune on Wikipedia content to create trip planning assistants that provide detailed cultural and historical context for destinations
  • A journalism organization could use wiki data to build fact-checking tools that automatically verify claims in news articles against encyclopedic sources
  • A game development studio could leverage specialized wiki content to create intelligent NPCs that provide accurate lore and world-building information to players

Source Code

  • A DevOps company could fine-tune on code repositories to create automated code review tools that identify security vulnerabilities and suggest best practices
  • A startup accelerator could use programming datasets to build mentorship bots that help founders understand technical feasibility and implementation approaches
  • An enterprise software vendor could leverage code examples to create documentation generators that automatically write API guides and integration tutorials

YouTube and Web Content

  • A podcast production company could fine-tune on transcribed videos to create show note generators that automatically extract key topics and create episode summaries
  • A corporate communications team could use speech-based content to train presentation coaches that provide feedback on speaking style and content organization
  • A language therapy practice could leverage diverse speech patterns to create pronunciation training tools for clients with speech impediments or second-language learners

Final Thought

With its rigorous licensing verification and competitive performance, the Common Pile v0.1 offers organizations a clear path forward in an increasingly litigious landscape. The dataset demonstrates that doing the right thing and building effective AI systems are no longer mutually exclusive goals.

Image credit: Conny Schneider

Written by Annie Cushing · Categorized: AI

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Copyright © 2025