Biapy's Bookmarks

Apache Calcite

https://calcite.apache.org/

Dynamic data management framework. The foundation for your next high-performance database.

It contains many of the pieces that comprise a typical database management system but omits the storage primitives. It provides an industry standard SQL parser and validator, a customisable optimizer with pluggable rules and cost functions, logical and physical algebraic operators, various transformation algorithms from SQL to algebra (and the opposite), and many adapters for executing SQL queries over Cassandra, Druid, Elasticsearch, MongoDB, Kafka, and others, with minimal configuration.

Apache Calcite @ GitHub.

Related contents:

Reimagining log analytics for the modern enterprise @ OpenSearch.

apache2-licensed big-data database foss framework opensearch open-source sql

Added 3 days ago

Datafari Enterprise Search

https://www.datafari.com/

Open Source, Distributed, Big Data Enterprise Search Engine.

Datafari is an open source enterprise search solution enriched with AI. It is the perfect product for anyone who needs to search and analyze its corporate data and documents, both within the content and the metadata. Plus, with its genAI modules, it allows to easily leverage mistral, openai, or local LLMs for your company data.

Datafari @ GitHub.

apache2-licensed big-data business foss llm open-source search-engine

Added 4 months ago

BigQuery

http://BigQuery

AI data platform.

From data warehouse to autonomous data and AI platform

BigQuery is the autonomous data to AI platform, automating the entire data life cycle, from ingestion to AI-driven insights, so you can go from data to AI to action faster.

Gemini in BigQuery features are now included in BigQuery pricing models.

Related contents:

BigQuery’s Ridiculous Pricing Model Cost Us $10,000 in Just 22 Seconds!!! @ Data Engineer Things.

big-data bigquery cloud commercial data-lake gcp lakehouse

Added 5 months ago

Apache Kafka

https://kafka.apache.org/

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Related contents:

apache2-licensed big-data data-integration data-pipeline data-stream event-stream foss kafka open-source self-hosted

Added 7 months ago

Apache Gravitino

https://gravitino.apache.org/

A unified metadata lake across all your sources, formats, cloud providers, and regions in a federated architecture. World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

Apache Gravitino is a high-performance, geo-distributed, and federated metadata lake. It manages metadata directly in different sources, types, and regions, providing users with unified metadata access for data and AI assets.

Apache Gravitino @ GitHub.

big-data data-lake foss metadata open-source self-hosted

Added 7 months ago

Apache Pinot™

https://pinot.apache.org/

Insights, Unlocked in Real Time.

Apache Pinot™: The real-time analytics open source platform for lightning-fast insights, effortless scaling, and cost-effective data-driven decisions.

Apache Pinot @ GitHub.

Related contents:

Serving Millions of Apache Pinot™ Queries with Neutrino @ Uber Blog.

big-data data-analytics data-science foss open-source self-hosted

Added 9 months ago

Apache Iceberg™

https://iceberg.apache.org/

The open table format for analytic datasets.

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Apache Iceberg @ GitHub.

Related contents:

apache2-licensed big-data database data-science foss open-source sql

Added 10 months ago

Bufstream

https://buf.build/product/bufstream

The best way of working with Protocol Buffers. Elastic, self-hosted Kafka with Advanced Semantic Intelligence Guarantee streaming data quality and slash cloud costs 10x with Bufstream, a drop-in replacement for Apache Kafka®.

Bufstream is a Kafka-compatible streaming system which stores records directly in an object storage service like S3.

big-data data-stream kafka open-source s3

Added 11 months ago

Apache CouchDB

https://couchdb.apache.org/

Seamless multi-master sync, that scales from Big Data to Mobile, with an Intuitive HTTP/JSON API and designed for Reliability.

CouchDB is a database that completely embraces the web. Store your data with JSON documents. Access your documents with your web browser, via HTTP. Query, combine, and transform your documents with JavaScript. CouchDB works well with modern web and mobile apps. You can distribute your data, efficiently using CouchDB’s incremental replication. CouchDB supports master-master setups with automatic conflict detection.

Related contents:

big-data database foss http json open-source self-hosted

Added 11 months ago

Trench

https://www.trench.dev/

Open source analytics infrastructure. Fast and scalable. No bloat. GDPR compliant.

A single production-ready Docker image built on ClickHouse, Kafka, and Node.js for tracking events, users, page views, and interactions.

Trench @ GitHub.

analytics big-data clickhouse data-stream foss gdpr kafka open-source web-app

Added 1 year ago

Apache Hadoop

https://hadoop.apache.org/

The Apache® Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

big-data database nosql open-source

Added 1 year ago

AutoMQ

https://www.automq.com/

Source Available Reinvented Kafka. 10x Cost Efficiency.

AutoMQ is a cloud-first alternative to Kafka by decoupling durability to S3 and EBS. 10x cost-effective. Autoscale in seconds. Single-digit ms latency.

AutoMQ is a stateless Kafka on S3. 10x Cost-Effective. No Cross-AZ Traffic Cost. Autoscale in seconds. Single-digit ms latency. Multi-AZ Availability.

AutoMQ @ GitHub.

Related contents:

What If We Could Rebuild Kafka From Scratch? @ Gunnar Morling.

apache2-licensed big-data foss kafka open-source s3

Added 1 year ago

Apache Hudi

https://hudi.apache.org/

An Open Source Data Lake Platform.

Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

big-data data-lake foss open-source

Added 1 year ago

Data For Good

https://dataforgood.fr/

Les technologies numériques sont incroyablement puissantes et redéfinissent le fonctionnement de notre société. Pour les acteurs qui œuvrent pour l'intérêt général, la technologie peut parfois être un levier démutiplicateur d'impacts positifs, cependant et malheureusement ces acteurs n'ont souvent pas les ressources technologiques ou humaines pour accélérer leur action citoyenne. Data for Good existe pour rétablir l'équilibre.

association big-data data-analytics data-science france ong

Added 1 year ago

Amazon Athena

https://aws.amazon.com/athena/

Interactive SQL. Analyze petabyte-scale data where it lives with ease and flexibility.

Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Analyze data or build applications from an Amazon Simple Storage Service (S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena is built on open-source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

286 - Data & Dev - Christophe Blefari @ <ifttd> :fr:.

amazon aws big-data commercial data data-analytics data-science serverless web-service

Added 1 year ago

Apache DataFusion

https://datafusion.apache.org/

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

DataFusion is great for building projects such as domain specific query engines, new database platforms and data pipelines, query languages and more. It lets you start quickly from a fully working engine, and then customize those features specific to your use.

DataFusion @ GitHub.

big-data database data-centric dataframe open-source self-hosted sql

Added 1 year ago

JuiceFS

https://juicefs.com/en/

Open Source Distributed POSIX File System for Cloud. JuiceFS is a distributed POSIX file system built on top of Redis and S3.

JuiceFS is a high-performance POSIX file system released under Apache License 2.0, particularly designed for the cloud-native environment. The data, stored via JuiceFS, will be persisted in Object Storage (e.g. Amazon S3), and the corresponding metadata can be persisted in various compatible database engines such as Redis, MySQL, and TiKV based on the scenarios and requirements.

With JuiceFS, massive cloud storage can be directly connected to big data, machine learning, artificial intelligence, and various application platforms in production environments. Without modifying code, the massive cloud storage can be used as efficiently as local storage.

JuiceFS @ GitHub.

big-data cloud filesystem open-source posix redis s3

Added 1 year ago

Trunk Data Platform (TDP)

https://www.trunkdataplatform.io/

open source big data platform.

Trunk Data Platform is an Open Source, free, Hadoop distribution.

big-data cloud cluster distribution hadoop open-source self-hosted

Added 1 year ago

Apache Druid

https://druid.apache.org/

Druid is a high performance, real-time analytics database that delivers sub-second queries on streaming and batch data at scale and under load.

analytics apache big-data open-source real-time streaming

Added 2 years ago

XetHub: fast, frictionless collaboration at scale

https://xethub.com/

XetHub brings speedy access and Git-based collaboration to large scale repositories of data, code, or any combination of files. Our instant mount feature makes it possible to access GBs and TBs of data in seconds at the speed of localhost, while our de-duplication algorithm stores data and differences efficiently to save money and speed up development cycles. XetHub is ideal for teams who already use Git to track their code changes, and want to leverage the power of infinite history, pull requests, and difference-based tracking for larger assets such as datasets or media files. Managing complete projects with familiar Git semantics makes change tracking and continuous integration a breeze, especially for workflows that use code to generate or augment assets.

big-data cloud collaboration data git web-service

Added 2 years ago

Neo4j

https://neo4j.com/

Graph Database Management System. Neo4j Graph Data Platform. Blazing-Fast Graph, Petabyte Scale. With proven trillion+ entity performance, developers, data scientists, and enterprises rely on Neo4j as the top choice for high-performance, scalable analytics, intelligent app development, and advanced AI/ML pipelines.

ai big-data data database dbms graph java machine-learning

Added 3 years ago

ClickHouse

https://github.com/ClickHouse/ClickHouse

ClickHouse® is a free analytics DBMS for big data. ClickHouse® is an open-source column-oriented database management system that allows generating analytical data reports in real-time.

analytics big-data database dbms real-time statistics

Added 3 years ago

Konbert

https://konbert.com/

Open big JSON, CSV Files: Online Viewer, Explorer and Converter. View and convert big data files. View large or small files right in your browser and export them in any format.

big-data converter csv json web-service

Added 3 years ago

Planet

https://www.planet.com/

Daily Earth Data to See Change and Make Better Decisions. Planet provides daily satellite data that helps businesses, governments, researchers, and journalists understand the physical world and take action.

ai big-data satellite

Added 3 years ago

Climate TRACE

https://www.climatetrace.org/

Climate TRACE was built to collect and share greenhouse gas emissions from anthropogenic (human) activities to facilitate climate action .

ai big-data climate greenhouse-effect satellite

Added 3 years ago

Robtex

https://www.robtex.com/

Robtex is used for various kinds of research of IP numbers, Domain names, etc. Robtex uses various sources to gather public information about IP numbers, domain names, host names, Autonomous systems, routes etc. It then indexes the data in a big database and provide free access to the data. We aim to make the fastest and most comprehensive free DNS lookup tool on the Internet. Our database now contains billions of documents of internet data collected over more than a decade.

api big-data data-mining dns web-service

Added 3 years ago

Luna

https://www.luna-lang.org/

A WYSIWYG language for data processing.

big-data development language statistics

Added 7 years ago

EveryPolitician

http://everypolitician.org/

Political data for 233 countries. The world’s richest open dataset on politicians

big-data web-service

Added 8 years ago

PNDA

http://pndaproject.io/

The scalable, open source big data analytics platform for networks and services.

big-data open-source web-service

Added 9 years ago

ROOT a Data analysis Framework

https://root.cern.ch/

A modular scientific software framework. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++ but integrated with other languages such as Python and R.

big-data foss framework

Added 9 years ago

OpenGrid

https://github.com/Chicago/opengrid

A user-friendly, map-based tool to combine and explore real-time or historical data.

big-data foss geolocation

Added 9 years ago

WhereHows

https://github.com/linkedin/WhereHows

WhereHows is a data discovery and lineage tool built at LinkedIn. It integrates with all the major data processing systems and collects both catalog and operational metadata from them.

big-data foss linkedin

Added 9 years ago

GridDB

https://github.com/griddb/griddb_nosql

high performance, high scalability and high reliability database for big data. GridDB has a KVS (Key-Value Store)-type data model that is suitable for sensor data stored in a timeseries. It is a database that can be easily scaled-out according to the number of sensors.

big-data database kvs

Added 9 years ago

HBase - Apache HBase Home

http://hbase.apache.org/

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

big-data database distributed hadoop scalability server

Added 11 years ago

D3.js - Data-Driven Documents

http://d3js.org/

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

big-data développement data graphiques javascript statistics

Added 11 years ago

Waarp

https://github.com/Waarp/Waarp-All

Waarp provides a secure and efficient open source MFT solution.

Waarp Platform is a set of applications and tools specialized in managing and monitoring a high number of transfers in a secure and reliable way.

It relies on its own open protocol named R66, which has been designed to optimize file transfers, ensure the integrity of the data provide ways to integrate transfers in larger business transactions.

big-data file-transfer foss mft open-source self-hosted

Added 13 years ago

Apache™ Hadoop™!

http://hadoop.apache.org/

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

big-data

Added 13 years ago