Search: [dataset] - Biapy Web Directory

Virtual Cell Atlas https://arcinstitute.org/tools/virtualcellatlas

Wed Feb 26 13:29:08 2025

email

The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).

Virtual Cell Atlas @ GitHub.

FineWeb https://huggingface.co/datasets/HuggingFaceFW/fineweb

Sun Jan 26 15:26:31 2025

email

15 trillion tokens of the finest data the web has to offer.

The FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the datatrove library, our large scale data processing library.

FineWeb was originally meant to be a fully open replication of RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of FineWeb well above that of the original RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.

Related contents:

Address Database https://netsyms.com/gis/addresses

Fri Dec 13 14:52:01 2024

email

Self-hosted street address database

A SQLite3 database file with over 150 million U.S. and Canada address records. Indexed for fast queries, even on fairly slow hardware.

Common Corpus https://huggingface.co/datasets/PleIAs/common_corpus

Tue Nov 19 14:43:09 2024

email

Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more.

Announcing Common Corpus: A 2+ trillion token dataset that's fully open and accessible @ moz://a.

LAION-5B https://laion.ai/blog/laion-5b/

Mon Jun 26 08:49:38 2023

email

A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION.

We present a dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world

RedPajama-Data https://github.com/togethercomputer/RedPajama-Data

Thu Jun 1 14:29:31 2023

email

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset.

Kaggle https://www.kaggle.com/

Wed Dec 1 10:52:27 2021

email

Your Machine Learning and Data Science Community.

Inside Kaggle you’ll find all the code & data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time.

Keshif https://keshif.me/

Wed Mar 9 12:47:04 2016

email

Keshif is a web-based tool that lets you browse and understand datasets easily.

Keshif @ GitHub

Links per page

Filters