data-pipeline
Semantic Data Processing. Build data processing and data analysis pipelines that leverage the power of LLMs 🧠
Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs). Semlib provides, as building blocks, familiar functional programming primitives like map, reduce, sort, and filter, but with a twist: Semlib's implementation of these operations are programmed with natural language descriptions rather than code. Under the hood, Semlib handles complexities such as prompting, parsing, concurrency control, caching, and cost tracking.
UNIFIED DATA PROCESSING FRAMEWORK. Flow is a PHP-based, strongly typed data processing framework with a low memory footprint.
The most advanced data processing framework allowing to build scalable data processing pipelines and move data between various data sources and destinations.
Related contents:
A declarative, 🐻❄️-native data frame validation library.
Dataframely is a Python package to validate the schema and content of polars data frames. Its purpose is to make data pipelines more robust by ensuring that data meet expectations and more readable by adding schema information to data frame type hints.
Magical Data Engineering Workflows.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Mage is a hybrid framework for transforming and integrating data. It combines the best of both worlds: the flexibility of notebooks with the rigor of modular code.
Related contents:
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Related contents:
Related contents:
- The New Look and Feel of Apache Kafka 4.0 @ The New Stack.
- Kafka: The End of the Beginning @ Materialized View.
- Optimizing Kafka Tracing with OpenTelemetry: Boost Visibility & Performance @ New Relic.
- Introducing Apache Kafka® 4.1.0: What’s New and How to Upgrade @ Confluent.
- Testing Kafka-based Asynchronous Workflows Using OpenTelemetry @ Signadot.
superglue is an open-source server that sits as a layer between complex APIs and your application. With superglue, you always get the data that you want in the format that you expect. Fetch data from JSON and XML APIs, as well as CSV and Excel files in seconds.
Kubernetes-native platform to run massively parallel data/streaming jobs.
A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with.
Data processing with ML, LLM and Vision LLM.
Sparrow is an innovative open-source solution for efficient data extraction and processing from various documents and images. It seamlessly handles forms, bank statements, invoices, receipts, and other unstructured data sources. Sparrow stands out with its modular architecture, offering independent services and pipelines all optimized for robust performance.
Related contents:
A Free 9-Week Course on Data Engineering Fundamentals.
Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.
Efficient data transformation and modeling framework that is backwards compatible with dbt.
SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.
Related contents:
Generate video, images, audio with AI. The most powerful open source node-based application for creating images, videos, and audio with GenAI.
This ui will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface. For some workflow examples and see what ComfyUI can do you can check out:
Open-Source Data Movement for LLMs. AI Platform. Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity. An orchestration platform for the development, production, and observation of data assets.
The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.
It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.
Department of Education (DOE) for New South Wales (AUS) data stack in a box. With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.
Open-source developer platform and workflow engine. Turn scripts into auto-generated UIs, APIs and cron jobs. Compose them as workflows or data pipelines. Build complex, data-intensive apps with ease.
Write and deploy software 10x faster, and run it with the highest reliability and observability on the fastest self-hostable job orchestrator.
Open-source developer platform to power your entire infra and turn scripts into webhooks, workflows and UIs. Fastest workflow engine (13x vs Airflow). Open-source alternative to Retool and Temporal.
Your Data Pipeline, Simplified. GlareDB: An analytics DBMS for distributed data.
Data exists everywhere: your laptop, Postgres, Snowflake and as files in S3. It exists in various formats such as Parquet, CSV and JSON. Regardless, there will always be multiple steps spanning several destinations to get the insights you need.
GlareDB is designed to query your data wherever it lives using SQL that you already know.
Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
Related contents: