The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).
15 trillion tokens of the finest data the web has to offer.
The FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the
datatrove library, our large scale data processing library.
FineWeb was originally meant to be a fully open replication of
RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of
FineWeb well above that of the original
RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.
Related contents:
Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more.
Your Machine Learning and Data Science Community.
Inside Kaggle you’ll find all the code & data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time.
Keshif is a web-based tool that lets you browse and understand datasets easily.