FineWeb

FineWeb https://huggingface.co/datasets/HuggingFaceFW/fineweb

Sun Jan 26 15:26:31 2025

15 trillion tokens of the finest data the web has to offer.

The FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the datatrove library, our large scale data processing library.

FineWeb was originally meant to be a fully open replication of RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of FineWeb well above that of the original RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.

Links per page

Filters