<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>dataset</title>
    <link rel="self" type="application/atom+xml" href="https://links.biapy.com/guest/tags/1169/feed"/>
    <updated>2026-04-25T11:57:43+00:00</updated>
    <id>https://links.biapy.com/guest/tags/1169/feed</id>
            <entry>
            <id>https://links.biapy.com/links/12019</id>
            <title type="text"><![CDATA[CredData (Credential Dataset)]]></title>
            <link rel="alternate" href="https://github.com/Samsung/CredData" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/12019"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[CredData (Credential Dataset) is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicious line.

CredData can be used to develop new tools or improve existing tools. Furthermore, using the benchmark result of the CredData, users can choose a proper tool among open source credential scanning tools according to their use case. We sincerely hope that CredData will help minimize credential leaks.

Related contents:

- [\#66 @ Erreur 403 :fr:](https://newsletter.erreur403.fr/p/erreur-403-66).]]>
            </summary>
            <updated>2026-03-05T12:10:14+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/11775</id>
            <title type="text"><![CDATA[Parquet]]></title>
            <link rel="alternate" href="https://parquet.apache.org/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/11775"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming languages and analytics tools.

- [Apache Parquet Format @ GitHub](https://github.com/apache/parquet-format).

Related contents:

- [Building Your Modern Data Analytics Stack with Python, Parquet, and DuckDB @ KD nuggets](https://www.kdnuggets.com/building-your-modern-data-analytics-stack-with-python-parquet-and-duckdb).]]>
            </summary>
            <updated>2026-02-11T10:19:02+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/10939</id>
            <title type="text"><![CDATA[ecosyste.ms]]></title>
            <link rel="alternate" href="https://ecosyste.ms/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/10939"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[Open source intelligence for your project.
Tools and datasets to support, sustain, and secure critical digital infrastructure.

he world&amp;#039;s most comprehensive and accurate dataset about open source production and use, for free

ecosyte.ms has indexed over six billion events across nearly two thousand sources to create the world&amp;#039;s most comprehensive and accurate dataset about open source production and use.

Related contents:

- [Episode \#665: The world of open source metadata @ Changelog Interviews](https://changelog.com/podcast/665).]]>
            </summary>
            <updated>2025-11-12T12:50:21+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/1434</id>
            <title type="text"><![CDATA[Virtual Cell Atlas]]></title>
            <link rel="alternate" href="https://arcinstitute.org/tools/virtualcellatlas" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/1434"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).

- [Virtual Cell Atlas @ GitHub](https://github.com/ArcInstitute/arc-virtual-cell-atlas/).]]>
            </summary>
            <updated>2025-08-28T19:55:19+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/1836</id>
            <title type="text"><![CDATA[FineWeb]]></title>
            <link rel="alternate" href="https://huggingface.co/datasets/HuggingFaceFW/fineweb" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/1836"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[15 trillion tokens of the finest data the 🌐 web has to offer.

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library.

🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of 🍷 FineWeb well above that of the original 🦅 RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.

Related contents:

- [🍷 FineWeb: decanting the web for the finest text data at scale @ HugginFace](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
- [S5E7 - Sommes-nous à l&amp;#039;aube d&amp;#039;un effondrement des IA ? @ Underscore_&amp;#039;s acast :fr:](https://shows.acast.com/micode-underscore/episodes/s5e7-sommes-nous-a-laube-dun-effondrement-des-ia).]]>
            </summary>
            <updated>2025-08-28T21:01:57+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/2454</id>
            <title type="text"><![CDATA[Address Database]]></title>
            <link rel="alternate" href="https://netsyms.com/gis/addresses" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/2454"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[Self-hosted street address database

A SQLite3 database file with over 150 million U.S. and Canada address records. Indexed for fast queries, even on fairly slow hardware.]]>
            </summary>
            <updated>2025-08-28T22:45:05+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/2812</id>
            <title type="text"><![CDATA[Common Corpus]]></title>
            <link rel="alternate" href="https://huggingface.co/datasets/PleIAs/common_corpus" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/2812"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. 

- [Announcing Common Corpus: A 2+ trillion token dataset that&amp;#039;s fully open and accessible @ moz://a](https://future.mozilla.org/builders/news_insights/announcing-common-corpus-a-2-trillion-token-dataset-thats-fully-open-and-accessible/).]]>
            </summary>
            <updated>2025-08-28T23:45:30+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/5353</id>
            <title type="text"><![CDATA[LAION-5B]]></title>
            <link rel="alternate" href="https://laion.ai/blog/laion-5b/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/5353"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[A NEW ERA OF OPEN LARGE-SCALE MULTI-MODAL DATASETS | LAION.

We present a dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world]]>
            </summary>
            <updated>2025-08-29T06:48:47+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/5423</id>
            <title type="text"><![CDATA[RedPajama-Data]]></title>
            <link rel="alternate" href="https://github.com/togethercomputer/RedPajama-Data" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/5423"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[The RedPajama-Data repository contains code for preparing large datasets for training large language models.
RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset.]]>
            </summary>
            <updated>2025-08-29T07:00:52+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/7247</id>
            <title type="text"><![CDATA[Kaggle]]></title>
            <link rel="alternate" href="https://www.kaggle.com/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/7247"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[Your Machine Learning and Data Science Community.

Inside Kaggle you’ll find all the code &amp;amp; data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time.]]>
            </summary>
            <updated>2025-08-29T12:06:32+00:00</updated>
        </entry>
            <entry>
            <id>https://links.biapy.com/links/7693</id>
            <title type="text"><![CDATA[Keshif]]></title>
            <link rel="alternate" href="https://keshif.me/" />
            <link rel="via" type="application/atom+xml" href="https://links.biapy.com/links/7693"/>
            <author>
                <name><![CDATA[Biapy]]></name>
            </author>
            <summary type="text">
                <![CDATA[Keshif is a web-based tool that lets you browse and understand datasets easily.

- [Keshif @ GitHub](https://github.com/adilyalcin/keshif)]]>
            </summary>
            <updated>2025-08-29T13:19:33+00:00</updated>
        </entry>
    </feed>
