This repo has all the resources you need to become an amazing data engineer!
Data and AI reliability. Delivered.
Data breaks. Monte Carlo ensures your team is the first to know and solve with end-to-end data observability.
Conversational Data Analysis.
PandasAI is a Python platform that makes it easy to ask questions to your data in natural language. It helps non-technical users to interact with their data in a more natural way, and it helps technical users to save time, and effort when working with data.
PandasAI is a Python library that integrates generative artificial intelligence capabilities into pandas, making dataframes conversational.
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
Cross-Language Serialization for Relational Algebra.
A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.
Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity.
An orchestration platform for the development, production, and observation of data assets.
The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.
Open and unified metadata platform for data discovery, observability, and governance.
A single place for all your data and all your data practitioners to build and manage high quality data assets at scale. Built by Collate and the founders of Apache Hadoop, Apache Atlas, and Uber Databook.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.
Department of Education (DOE) for New South Wales (AUS) data stack in a box.
With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.
Docling parses documents and exports them to the desired format with ease and speed.
🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON.
AI Data Management at Scale - Curate, Enrich, and Version Datasets.
DataChain is a modern Pythonic data-frame library designed for artificial intelligence. It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.
Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.
Run SQL queries on CSV files directly in your browser. No data leaves your browser.
Fast, private, and easy to use.
A lightweight next-gen data explorer - Postgres, MySQL, SQLite, MongoDB, Redis, MariaDB & Elastic Search with Chat interface.
The powerful data exploration & web app framework for Python.
Panel is an open-source Python library designed to streamline the development of robust tools, dashboards, and complex applications entirely within Python. With a comprehensive philosophy, Panel integrates seamlessly with the PyData ecosystem, offering powerful, interactive data tables, visualizations, and much more, to unlock, visualize, share, and collaborate on your data for efficient workflows.
Build Python Data & AI web applications.
Turns Data and AI algorithms into production-ready web applications in no time.
Taipy is designed for data scientists and machine learning engineers to build data & AI web applications.
From simple pilots to production-ready web applications in no time. No more compromise on performance, customization, and scalability.
The Data Processor for Agents.
Marly allows your agents to extract tables & text from your PDFs, Powerpoints, etc in a structured format making it easy for them to take subsequent actions (database call, API call, creating a chart etc).
Use SQL for everything. Query anything with old-school cool SQL.
Anyquery is a CLI tool to run SQL queries on any data source, no matter if it's a file, an API, logs, or a local app.
See the integrations for the full extent of what you can do.
Drasi makes it easy and efficient to detect and react to changes in databases.
Drasi is a data processing platform that simplifies detecting changes in data and taking immediate action. It is a comprehensive solution that provides built-in capabilities to track system logs and change feeds for specific events, evaluate them for relevance, and automatically initiate appropriate reactions.
"The LLVM of columnar file formats". A toolkit for working with compressed Arrow on-disk, in-memory, and over-the-wire.
Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.
Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation.
The Snowflake AI Data Cloud - Mobilize Data, Apps, and AI.
Snowflake delivers ease of use, instant elasticity, and lower TCO.