data-science
High performance array computing.
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more. JAX provides a familiar NumPy-style API for ease of adoption by researchers and engineers.
Related contents:
Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
Lance is a modern columnar data format optimized for machine learning and AI applications. It efficiently handles diverse multimodal data types while providing high-performance querying and versioning capabilities.
Related contents:
S3GD is a highly optimized, PyTorch-compatible Triton implementation of the Smoothed SignSGD optimizer, meant for reinforcement learning post-training.
Related contents:
Python Data Analysis Library.
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Related contents:
The Data-Oriented Language for Sane Software Development.
Odin is a general-purpose programming language with distinct typing, built for high performance, modern systems, and built-in data-oriented data types. The Odin Programming Language, the C alternative for the joy of programming.
Related contents:
Scalable, Interactive Visualization. Compute & interactively visualize large embeddings.
Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and search embeddings and metadata.
Context Engineering for AI Systems.
TensorLake transforms unstructured documents into AI-ready data through Document Ingestion APIs and enables building scalable data processing pipelines with a serverless workflow runtime. The platform handles the complexity of document parsing, data extraction, and workflow orchestration on fully managed infrastructure including GPU acceleration.
Positron, a next-generation data science IDE.
-
A free, next-generation data science IDE built by Posit PBC.
-
An extensible, polyglot tool for writing code and exploring data.
-
A familiar environment for reproducible authoring and publishing.
Related contents:
Open-source data multitool. Data exploration at your fingertips.
VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.
Generate realistic datasets for demos, learning, and dashboards. Instantly preview data, export as CSV or SQL, and explore with Metabase.
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Related contents:
Look At Your Data 👀.
Data quality is the most important factor in machine learning success. Hyperparam brings exploration and analysis of massive text datasets to the browser.
Extract, Transform, Index Data. Easy and Fresh. CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.
With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.
A self-contained, lightweight and OOB research platform for modern ML.
Boson is a lightweight, fully containerized, and feature-rich machine learning research platform. It centralizes essential tools to help teams keep projects lean, organized, and reproducible—while reducing overhead and boosting productivity. Think Databricks/Sagemaker but local and free.
Boson enables engineers and researchers to iterate faster without getting bogged down by infrastructure or tooling complexity.
An implementation of the LOESS / LOWESS algorithm in Rust.
As data volumes continue to grow in fields like machine learning and scientific computing, optimizing fundamental operations like matrix multiplication becomes increasingly critical. Blosc2's chunk-based approach offers a new path to efficiency in these scenarios.
Blosc is a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans, although it can handle string data too). It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc main goal is not just to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.
Related contents:
🚀 Async-Powered Pandas.
Lightweight Pandas monkey-patch that adds async support to map, apply, applymap, aggregate, and transform, enabling seamless handling of async functions with controlled max_parallel execution.
The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).
Easy web apps for data science without the compromises. No web development skills required.
Related contents:
Reproducible Data Science Environments with Nix.
{rix} is an R package that leverages Nix, a package manager focused on reproducible builds. With Nix, you can create project-specific environments with a custom version of R, its packages, and all system dependencies (e.g., GDAL). Nix ensures full reproducibility, which is crucial for research and development projects.
Related contents:
The R Project for Statistical Computing.
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
Related contents:
🪄 Create rich visualizations with AI
Data Formulator is an application from Microsoft Research that uses large language models to transform data, expediting the practice of data visualization.
Data Formulator is an AI-powered tool for analysts to iteratively create rich visualizations. Unlike most chat-based AI tools where users need to describe everything in natural language, Data Formulator combines user interface interactions (UI) and natural language (NL) inputs for easier interaction. This blended approach makes it easier for users to describe their chart designs while delegating data transformation to AI.
A Free 9-Week Course on Data Engineering Fundamentals.
Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.
For better or for worse, LLMs are here to stay. We all read content that they produce online, most of us interact with LLM chatbots, and many of us use them to produce content of our own.
In a series of five- to ten-minute lessons, we will explain what these machines are, how they work, and how to thrive in a world where they are everywhere.
You will learn when these systems can save you a lot of time and effort. You will learn when they are likely to steer you wrong. And you will discover how to see through the hype to tell the difference. ?
AI by Hand ✍️ Exercises in Excel
Research and data to make progress against the world’s largest problems.
To make progress against the pressing problems the world faces, we need to be informed by the best research and data.
Our World in Data makes this knowledge accessible and understandable, to empower those working to build a better world.
A faster way to build and share data apps. Streamlit turns data scripts into shareable web apps in minutes. All in pure Python. No front‑end experience required.
Streamlit lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. Once you’ve created an app, you can use our Community Cloud platform to deploy, manage, and share your app.
Multi-modal modular data ingestion and retrieval.
DataBridge is an open source library for natural language search and management of multi-modal data. Get started by installing databridge now!
DataBridge is a powerful document processing and retrieval system designed for building intelligent document-based applications. It provides a robust foundation for semantic search, document processing, and AI-powered document interactions.
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.
Related contents:
Your data tell a story. Explore. Visualize. Model. Make a difference. Better insight starts with Stata.
Stata is statistical software for data science.
Open-source document processing platform built for knowledge workers.
Rowfill helps extract, analyze, and process data from complex documents, images, PDFs and more with advanced AI capabilities.
Insights, Unlocked in Real Time.
Apache Pinot™: The real-time analytics open source platform for lightning-fast insights, effortless scaling, and cost-effective data-driven decisions.
Related contents:
We wrote this glossary to solve a problem we ran into working with GPUs here at Modal : the documentation is fragmented, making it difficult to connect concepts at different levels of the stack, like Streaming Multiprocessor Architecture , Compute Capability , and nvcc compiler flags .
Everything to Markdown.
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.
High performance array computing.
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Related contents:
Fetch an entire site and save it as a text file (to be used with AI models).
A Supercharged IDE for Data Science.
Zasper is an IDE designed from the ground up to support massive concurrency. It provides a minimal memory footprint, exceptional speed, and the ability to handle numerous concurrent connections.
It's perfectly suited for running REPL-style data applications, e.g. Jupyter notebooks.
SQL-like Querying for Various Data Sources.
Musoq lets you use SQL-like queries on files, directories, images and other data sources without a database. It's designed to ease life for developers.
Musoq is a tool that lets developers and IT professionals query different data sources using SQL-like syntax, without needing to import data into a database first. It’s designed for scenarios where you need to analyze files, directories, archives, or other data sources quickly and efficiently.
Data Runs Better on SDF. Transform Data Better with SDF. SDF is the fastest way to build a scalable, reliable, and optimized data warehouse.
SDF is a developer platform for data that scales SQL understanding across an organization, empowering all data teams to unlock the full potential of their data.
SDF is a multi-dialect SQL compiler, transformation framework, and analytical database engine. It natively compiles SQL dialects, like Snowflake, and connects to their corresponding data warehouses to materialize models.
Convert Anything into Structured Actionable Data.
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks.
OmniParse is a platform that ingests and parses any unstructured data into structured, actionable data optimized for GenAI (LLM) applications. Whether you are working with documents, tables, images, videos, audio files, or web pages, OmniParse prepares your data to be clean, structured, and ready for AI applications such as RAG, fine-tuning, and more
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference.
OpenVINO is an open-source toolkit for optimizing and deploying deep learning models from cloud to edge. It accelerates deep learning inference across various use cases, such as generative AI, video, audio, and language with models from popular frameworks like PyTorch, TensorFlow, ONNX, and more. Convert and optimize models, and deploy across a mix of Intel® hardware and environments, on-premises and on-device, in the browser or in the cloud.
Solve puzzles. Learn CUDA.
GPU architectures are critical to machine learning, and seem to be becoming even more important every day. However, you can be an expert in machine learning without ever touching GPU code. It is hard to gain intuition working through abstractions.
This notebook is an attempt to teach beginner GPU programming in a completely interactive fashion. Instead of providing text with concepts, it throws you right into coding and building GPU kernels. The exercises use NUMBA which directly maps Python code to CUDA kernels. It looks like Python but is basically identical to writing low-level CUDA code. In a few hours, I think you can go from basics to understanding the real algorithms that power 99% of deep learning today. If you do want to read the manual, it is here:
Solve Puzzles. Learn Metal 🤘.
Port of srush/GPU-Puzzles to Metal using MLX Custom Kernals.
GPUs are crucial in machine learning because they can process data on a massively parallel scale. While it's possible to become an expert in machine learning without writing any GPU code, building intuition is challenging when you're only working through layers of abstraction. Additionally, as models grow in complexity, the need for developers to write efficient, high-performance kernels becomes increasingly important to leverage the power of modern hardware.
The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics.
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
Related contents:
The open table format for analytic datasets.
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
Related contents:
- PyIceberg: Current State and Roadmap @ Ju Data Engineering Newsletter.
- The Equality Delete Problem in Apache Iceberg @ Data Engineer Things's Medium.
- How I Saved Millions by Restructuring Iceberg Metadata @ Gautham Gondi's Medium.
- High Throughput Ingestion with Iceberg @ Adobe Tech Blog's Medium.
- Scaling Iceberg Writes with Confidence: A Conflict-Free Distributed Architecture for Fast, Concurrent, Consistent Append-Only Writes @ e6data.
High-Performance Klong array language in Python.
KlongPy is a Python adaptation of the Klong array language, known for its high-performance vectorized operations that leverage the power of NumPy. Embracing a "batteries included" philosophy, KlongPy combines built-in modules with Python's expansive ecosystem, facilitating rapid application development with Klong's succinct syntax.
DataFrames for the new era.
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.
Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.
This repo has all the resources you need to become an amazing data engineer!
Data and AI reliability. Delivered.
Data breaks. Monte Carlo ensures your team is the first to know and solve with end-to-end data observability.
The Databricks Data Intelligence Platform. Databricks brings AI to your data to help you bring AI to the world.
Related contents:
Conversational Data Analysis.
PandasAI is a Python platform that makes it easy to ask questions to your data in natural language. It helps non-technical users to interact with their data in a more natural way, and it helps technical users to save time, and effort when working with data.
PandasAI is a Python library that integrates generative artificial intelligence capabilities into pandas, making dataframes conversational. Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
Cross-Language Serialization for Relational Algebra. A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.
Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity. An orchestration platform for the development, production, and observation of data assets.
The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.
Open and unified metadata platform for data discovery, observability, and governance.
A single place for all your data and all your data practitioners to build and manage high quality data assets at scale. Built by Collate and the founders of Apache Hadoop, Apache Atlas, and Uber Databook.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.
Department of Education (DOE) for New South Wales (AUS) data stack in a box. With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.
Docling parses documents and exports them to the desired format with ease and speed. 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON.
Related contents:
AI Data Management at Scale - Curate, Enrich, and Version Datasets.
DataChain is a modern Pythonic data-frame library designed for artificial intelligence. It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
Run SQL queries on CSV files directly in your browser. No data leaves your browser. Fast, private, and easy to use.
A lightweight next-gen data explorer - Postgres, MySQL, SQLite, MongoDB, Redis, MariaDB & Elastic Search with Chat interface.
The powerful data exploration & web app framework for Python.
Panel is an open-source Python library designed to streamline the development of robust tools, dashboards, and complex applications entirely within Python. With a comprehensive philosophy, Panel integrates seamlessly with the PyData ecosystem, offering powerful, interactive data tables, visualizations, and much more, to unlock, visualize, share, and collaborate on your data for efficient workflows.