Search: [data-pipeline] - Biapy Web Directory

Fri Mar 28 13:50:50 2025

email

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Apache Kafka @ GitHub.

Related contents:

The New Look and Feel of Apache Kafka 4.0 @ The New Stack.

superglue https://superglue.cloud/

Fri Feb 28 13:58:58 2025

email

superglue is an open-source server that sits as a layer between complex APIs and your application. With superglue, you always get the data that you want in the format that you expect. Fetch data from JSON and XML APIs, as well as CSV and Excel files in seconds.

superglue @ GitHub.

Numaflow https://numaflow.numaproj.io/

Wed Feb 19 14:16:57 2025

email

Kubernetes-native platform to run massively parallel data/streaming jobs.

A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with.

Numaflow @ GitHub.

Sparrow https://sparrow.katanaml.io/

Mon Feb 17 09:11:09 2025

email

Data processing with ML, LLM and Vision LLM.

Sparrow is an innovative open-source solution for efficient data extraction and processing from various documents and images. It seamlessly handles forms, bank statements, invoices, receipts, and other unstructured data sources. Sparrow stands out with its modular architecture, offering independent services and pipelines all optimized for robust performance.

Sparrow @ GitHub.

Related contents:

Why SQLMesh Might be The Best dbt Alternative @ The Data Toolbox.

ComfyUI https://www.comfy.org/

Mon Jan 27 05:34:22 2025

email

Generate video, images, audio with AI.
The most powerful open source node-based application for creating images, videos, and audio with GenAI.

This ui will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface. For some workflow examples and see what ComfyUI can do you can check out:

ComfyUI @ GitHub.

Airbyte https://airbyte.com/

Wed Jan 8 15:44:06 2025

email

Open-Source Data Movement for LLMs. AI Platform.
Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes.

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Airbyte @ GitHub.

Dagster https://dagster.io/

Fri Nov 8 08:23:41 2024

email

Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity.
An orchestration platform for the development, production, and observation of data assets.

The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.

Dagster @ GitHub.

data stack in a box https://github.com/wisemuffin/nsw-doe-data-stack-in-a-box

Fri Nov 8 08:07:34 2024

email

Department of Education (DOE) for New South Wales (AUS) data stack in a box.
With the push of one button you can have your own data stack up and running in 5 mins! .

Windmill https://www.windmill.dev/

Mon Nov 4 09:47:58 2024

email

Open-source developer platform and workflow engine. Turn scripts into auto-generated UIs, APIs and cron jobs. Compose them as workflows or data pipelines. Build complex, data-intensive apps with ease.

Write and deploy software 10x faster, and run it with the highest reliability and observability on the fastest self-hostable job orchestrator.

Open-source developer platform to power your entire infra and turn scripts into webhooks, workflows and UIs. Fastest workflow engine (13x vs Airflow). Open-source alternative to Retool and Temporal.

Windmill.

GlareDB https://glaredb.com/

Thu Sep 21 14:23:47 2023

email

Your Data Pipeline, Simplified. GlareDB: An analytics DBMS for distributed data.

Data exists everywhere: your laptop, Postgres, Snowflake and as files in S3. It exists in various formats such as Parquet, CSV and JSON. Regardless, there will always be multiple steps spanning several destinations to get the insights you need.

GlareDB is designed to query your data wherever it lives using SQL that you already know.

Links per page

Filters