Biapy's Bookmarks

Semlib

https://semlib.anish.io/

Semantic Data Processing. Build data processing and data analysis pipelines that leverage the power of LLMs 🧠

Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs). Semlib provides, as building blocks, familiar functional programming primitives like map, reduce, sort, and filter, but with a twist: Semlib's implementation of these operations are programmed with natural language descriptions rather than code. Under the hood, Semlib handles complexities such as prompting, parsing, concurrency control, caching, and cost tracking.

Semlib @ GitHub.

data-analytics data-pipeline foss library llm mit-licensed open-source python semantic

Added 1 month ago

Flow PHP

https://flow-php.com/

UNIFIED DATA PROCESSING FRAMEWORK. Flow is a PHP-based, strongly typed data processing framework with a low memory footprint.

The most advanced data processing framework allowing to build scalable data processing pipelines and move data between various data sources and destinations.

Flow PHP @ GitHub.

Related contents:

ETL Pipelines with Flow PHP with Norbert Orzechowicz @ QuisburgLive.

data-pipeline development foss framework mit-licensed open-source php

Added 3 months ago

Dataframely

https://dataframely.readthedocs.io/en/latest/

A declarative, 🐻‍❄️-native data frame validation library.

Dataframely is a Python package to validate the schema and content of polars data frames. Its purpose is to make data pipelines more robust by ensuring that data meet expectations and more readable by adding schema information to data frame type hints.

Dataframely @ GitHub.

bsd3-licensed data-pipeline data-validation foss open-source polars python

Added 5 months ago

Mage AI

https://www.mage.ai/

Magical Data Engineering Workflows.

🧙 Build, run, and manage data pipelines for integrating and transforming data.

Mage is a hybrid framework for transforming and integrating data. It combines the best of both worlds: the flexibility of notebooks with the rigor of modular code.

Mage AI @ GitHub.

Related contents:

Alternatives to Talend – How To Migrate Away From Talend For Your Data Pipelines @ Seattle Data Guy.

apache2-licensed data-pipeline elt etl foss framework open-source workflow

Added 6 months ago

Apache Kafka

https://kafka.apache.org/

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Related contents:

apache2-licensed big-data data-integration data-pipeline data-stream event-stream foss kafka open-source self-hosted

Added 7 months ago

superglue

https://superglue.cloud/

superglue is an open-source server that sits as a layer between complex APIs and your application. With superglue, you always get the data that you want in the format that you expect. Fetch data from JSON and XML APIs, as well as CSV and Excel files in seconds.

superglue @ GitHub.

api-gateway data-pipeline foss legacy machine-learning open-source self-hosted web-app

Added 8 months ago

Numaflow

https://numaflow.numaproj.io/

Kubernetes-native platform to run massively parallel data/streaming jobs.

A Kubernetes-native, serverless platform for running scalable and reliable event-driven applications. Numaflow decouples event sources and sinks from the processing logic, allowing each component to independently auto-scale based on demand. With out-of-the-box sources and sinks, and built-in observability, developers can focus on their processing logic without worrying about event consumption, writing boilerplate code, or operational complexities. Each step of the pipeline can be written in any programming language, offering unparalleled flexibility in using the best programming language for each step and ease of using the languages you are most familiar with.

Numaflow @ GitHub.

data-pipeline data-stream foss kubernetes open-source

Added 8 months ago

Sparrow

https://sparrow.katanaml.io/

Data processing with ML, LLM and Vision LLM.

Sparrow is an innovative open-source solution for efficient data extraction and processing from various documents and images. It seamlessly handles forms, bank statements, invoices, receipts, and other unstructured data sources. Sparrow stands out with its modular architecture, offering independent services and pipelines all optimized for robust performance.

Sparrow @ GitHub.

Related contents:

Sparrow - Pour extraire des données avec l'IA @ Korben :fr:.

ai data-pipeline foss llm ocr open-source parser self-hosted web-app

Added 8 months ago

Data Engineering Zoomcamp

https://github.com/DataTalksClub/data-engineering-zoomcamp

A Free 9-Week Course on Data Engineering Fundamentals.

Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.

data-engineering data-pipeline data-science e-learning open-source

Added 8 months ago

SQLMesh

https://sqlmesh.readthedocs.io/en/stable/

Efficient data transformation and modeling framework that is backwards compatible with dbt.

SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.

SQLMesh @ GitHub.

Related contents:

Why SQLMesh Might be The Best dbt Alternative @ The Data Toolbox.

data-analytics data-pipeline data-transformation dbt foss open-source self-hosted sql web-app

Added 9 months ago

ComfyUI

https://www.comfy.org/

Generate video, images, audio with AI. The most powerful open source node-based application for creating images, videos, and audio with GenAI.

This ui will let you design and execute advanced stable diffusion pipelines using a graph/nodes/flowchart based interface. For some workflow examples and see what ComfyUI can do you can check out:

ComfyUI @ GitHub.

data-pipeline foss genai image open-source self-hosted stable-diffusion web-app

Added 9 months ago

Airbyte

https://airbyte.com/

Open-Source Data Movement for LLMs. AI Platform. Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes.

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Airbyte @ GitHub.

data-integration data-pipeline etl llm self-hosted source-available

Added 9 months ago

Dagster

https://dagster.io/

Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity. An orchestration platform for the development, production, and observation of data assets.

The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.

Dagster @ GitHub.

data-pipeline data-science data-stream foss observability open-source orchestration self-hosted

Added 11 months ago

data stack in a box

https://github.com/wisemuffin/nsw-doe-data-stack-in-a-box

Department of Education (DOE) for New South Wales (AUS) data stack in a box. With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.

data-analytics data-pipeline data-science data-stack data-stream open-source self-hosted

Added 11 months ago

Windmill

https://www.windmill.dev/

Open-source developer platform and workflow engine. Turn scripts into auto-generated UIs, APIs and cron jobs. Compose them as workflows or data pipelines. Build complex, data-intensive apps with ease.

Write and deploy software 10x faster, and run it with the highest reliability and observability on the fastest self-hostable job orchestrator.

Open-source developer platform to power your entire infra and turn scripts into webhooks, workflows and UIs. Fastest workflow engine (13x vs Airflow). Open-source alternative to Retool and Temporal.

Windmill.

automation data-pipeline developer-experience devops open-source scheduler self-hosted web-app workflow

Added 11 months ago

GlareDB

https://glaredb.com/

Your Data Pipeline, Simplified. GlareDB: An analytics DBMS for distributed data.

Data exists everywhere: your laptop, Postgres, Snowflake and as files in S3. It exists in various formats such as Parquet, CSV and JSON. Regardless, there will always be multiple steps spanning several destinations to get the insights you need.

GlareDB is designed to query your data wherever it lives using SQL that you already know.

apache api csv data-analytics database data-pipeline data-science dbms distributed json open-source parquet postgresql rest s3 snowflake sql

Added 2 years ago

Apache Airflow

https://airflow.apache.org/

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.

Apache Airflow @ GitHub.

Related contents:

airflow apache2-licensed data-pipeline elt etl foss open-source python scheduler workflow

Added 3 years ago