Solve Puzzles. Learn Metal 🤘.
Port of srush/GPU-Puzzles to Metal using MLX Custom Kernals.
GPUs are crucial in machine learning because they can process data on a massively parallel scale. While it's possible to become an expert in machine learning without writing any GPU code, building intuition is challenging when you're only working through layers of abstraction. Additionally, as models grow in complexity, the need for developers to write efficient, high-performance kernels becomes increasingly important to leverage the power of modern hardware.
The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics.
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
The open table format for analytic datasets.
Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.
High-Performance Klong array language in Python.
KlongPy is a Python adaptation of the Klong array language, known for its high-performance vectorized operations that leverage the power of NumPy. Embracing a "batteries included" philosophy, KlongPy combines built-in modules with Python's expansive ecosystem, facilitating rapid application development with Klong's succinct syntax.
DataFrames for the new era.
Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.
Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.
Data and AI reliability. Delivered.
Data breaks. Monte Carlo ensures your team is the first to know and solve with end-to-end data observability.
Conversational Data Analysis.
PandasAI is a Python platform that makes it easy to ask questions to your data in natural language. It helps non-technical users to interact with their data in a more natural way, and it helps technical users to save time, and effort when working with data.
PandasAI is a Python library that integrates generative artificial intelligence capabilities into pandas, making dataframes conversational.
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
Cross-Language Serialization for Relational Algebra.
A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.
Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity.
An orchestration platform for the development, production, and observation of data assets.
The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.
Open and unified metadata platform for data discovery, observability, and governance.
A single place for all your data and all your data practitioners to build and manage high quality data assets at scale. Built by Collate and the founders of Apache Hadoop, Apache Atlas, and Uber Databook.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.
Department of Education (DOE) for New South Wales (AUS) data stack in a box.
With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.
Docling parses documents and exports them to the desired format with ease and speed.
🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON.
AI Data Management at Scale - Curate, Enrich, and Version Datasets.
DataChain is a modern Pythonic data-frame library designed for artificial intelligence. It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
Run SQL queries on CSV files directly in your browser. No data leaves your browser.
Fast, private, and easy to use.
A lightweight next-gen data explorer - Postgres, MySQL, SQLite, MongoDB, Redis, MariaDB & Elastic Search with Chat interface.
The powerful data exploration & web app framework for Python.
Panel is an open-source Python library designed to streamline the development of robust tools, dashboards, and complex applications entirely within Python. With a comprehensive philosophy, Panel integrates seamlessly with the PyData ecosystem, offering powerful, interactive data tables, visualizations, and much more, to unlock, visualize, share, and collaborate on your data for efficient workflows.
Build Python Data & AI web applications.
Turns Data and AI algorithms into production-ready web applications in no time.
Taipy is designed for data scientists and machine learning engineers to build data & AI web applications.
From simple pilots to production-ready web applications in no time. No more compromise on performance, customization, and scalability.
The Data Processor for Agents.
Marly allows your agents to extract tables & text from your PDFs, Powerpoints, etc in a structured format making it easy for them to take subsequent actions (database call, API call, creating a chart etc).