Common Corpus

Common Corpus https://huggingface.co/datasets/PleIAs/common_corpus

Tue Nov 19 14:43:09 2024

📧email

Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more.

Announcing Common Corpus: A 2+ trillion token dataset that's fully open and accessible @ moz://a.

Links per page

Filters