Biapy's Bookmarks

JAX

https://docs.jax.dev/en/latest/

High performance array computing.

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more. JAX provides a familiar NumPy-style API for ease of adoption by researchers and engineers.

JAX @ GitHub.

Related contents:

apache2-licensed data-science foss numpy open-source python vector

Added 5 days ago

Lance

https://lancedb.github.io/lance/

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

Lance is a modern columnar data format optimized for machine learning and AI applications. It efficiently handles diverse multimodal data types while providing high-performance querying and versioning capabilities.

Lance @ GitHub.

Related contents:

Lance takes aim at Parquet in file format joust @ The Register.

apache2-licensed columnar data-science duckdb format foss lance llm machine-learning open-source pandas parquet polars pytorch

Added 2 weeks ago

S3GD

https://github.com/WhyPhyLabs/s3gd

S3GD is a highly optimized, PyTorch-compatible Triton implementation of the Smoothed SignSGD optimizer, meant for reinforcement learning post-training.

Related contents:

S3GD Optimizer Algorithm @ WhyPhyLabs.

data-science foss machine-learning mit-licensed open-source pytorch

Added 1 month ago

pandas

https://pandas.pydata.org/

Python Data Analysis Library.

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

pandas @ GitHub.

Related contents:

Leveraging Pandas and SQL Together for Efficient Data Analysis @ KD nuggets.

bsd3-licensed data-analytics data-science foss library open-source pandas python

Added 1 month ago

Odin Programming Language

https://odin-lang.org/

The Data-Oriented Language for Sane Software Development.

Odin is a general-purpose programming language with distinct typing, built for high performance, modern systems, and built-in data-oriented data types. The Odin Programming Language, the C alternative for the joy of programming.

Odin @ GitHub.

Related contents:

Package Managers are Evil @ gingerBill.

bsd3-licensed data-science development foss language open-source

Added 1 month ago

Embedding Atlas

https://apple.github.io/embedding-atlas/

Scalable, Interactive Visualization. Compute & interactively visualize large embeddings.

Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and search embeddings and metadata.

Embedding Atlas @ GitHub.

data-science data-visualization embeddings foss mit-licensed open-source

Added 2 months ago

Tensorlake

https://www.tensorlake.ai/

Context Engineering for AI Systems.

TensorLake transforms unstructured documents into AI-ready data through Document Ingestion APIs and enables building scalable data processing pipelines with a serverless workflow runtime. The platform handles the complexity of document parsing, data extraction, and workflow orchestration on fully managed infrastructure including GPU acceleration.

Tensorlake SDK @ GitHub.

ai commercial data-science web-service

Added 2 months ago

Positron

https://positron.posit.co/

Positron, a next-generation data science IDE.

A free, next-generation data science IDE built by Posit PBC.
An extensible, polyglot tool for writing code and exploring data.
A familiar environment for reproducible authoring and publishing.
Positron @ GitHub.

Related contents:

Episode #103: #define: props to astronomer @ Changelog & Friends.

data-science editor elastic-licensed ide linux macos open-source software windows

Added 3 months ago

VisiData

https://www.visidata.org/

Open-source data multitool. Data exploration at your fingertips.

VisiData is an interactive multitool for tabular data. It combines the clarity of a spreadsheet, the efficiency of the terminal, and the power of Python, into a lightweight utility which can handle millions of rows with ease.

VisiData @ GitHub.

command-line csv data-science data-visualization foss gpl3-licensed open-source tui

Added 3 months ago

AI Dataset Generator

https://github.com/metabase/dataset-generator

Generate realistic datasets for demos, learning, and dashboards. Instantly preview data, export as CSV or SQL, and explore with Metabase.

ai data-science faker fixtures foss llm mit-licensed open-source self-hosted web-app

Added 4 months ago

NumPy

https://numpy.org/

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Numpy @ GitHub.

Related contents:

I don't like NumPy @ Dynomight.

data-science foss mathematics open-source python

Added 5 months ago

Hyperparam

https://hyperparam.app/

Look At Your Data 👀.

Data quality is the most important factor in machine learning success. Hyperparam brings exploration and analysis of massive text datasets to the browser.

Hyperparam @ GitHub.

data-analytics data-explorer data-science foss machine-learning mit-licensed open-source parquet web-app

Added 5 months ago

CocoIndex

https://cocoindex.io/

Extract, Transform, Index Data. Easy and Fresh. CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

CocoIndex @ GitHub.

apache2-licensed data-science data-transformation foss open-source self-hosted

Added 6 months ago

Boson

https://github.com/bosonstack/boson

A self-contained, lightweight and OOB research platform for modern ML.

Boson is a lightweight, fully containerized, and feature-rich machine learning research platform. It centralizes essential tools to help teams keep projects lean, organized, and reproducible—while reducing overhead and boosting productivity. Think Databricks/Sagemaker but local and free.

Boson enables engineers and researchers to iterate faster without getting bogged down by infrastructure or tooling complexity.

ai bsl-licensed data-science llm llmops machine-learning open-source self-hosted web-app

Added 6 months ago

loess-rs

https://github.com/joaofig/loess-rs

An implementation of the LOESS / LOWESS algorithm in Rust.

LOESS — Smoothing data using local regression @ Data Science Collective's Medium.

data-science foss image-manipulation mit-licensed open-source rust

Added 6 months ago

Blosc

https://www.blosc.org/

As data volumes continue to grow in fields like machine learning and scientific computing, optimizing fundamental operations like matrix multiplication becomes increasingly critical. Blosc2's chunk-based approach offers a new path to efficiency in these scenarios.

Blosc is a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans, although it can handle string data too). It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc main goal is not just to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.

C-Blosc2 @ GitHub.

Related contents:

Compress Better, Compute Bigger @ ironArray.

bsd-licensed compression data-science foss lossless open-source python

Added 7 months ago

aiopandas

https://github.com/telekinesis-inc/aiopandas

🚀 Async-Powered Pandas.

Lightweight Pandas monkey-patch that adds async support to map, apply, applymap, aggregate, and transform, enabling seamless handling of async functions with controlled max_parallel execution.

asynchronous data-science foss mit-licensed open-source pandas python

Added 7 months ago

Virtual Cell Atlas

https://arcinstitute.org/tools/virtualcellatlas

The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).

Virtual Cell Atlas @ GitHub.

data-science dataset open-data

Added 8 months ago

Shiny

https://shiny.posit.co/

Easy web apps for data science without the compromises. No web development skills required.

Related contents:

data-science development foss framework open-source python r web

Added 8 months ago

rix

https://docs.ropensci.org/rix/

Reproducible Data Science Environments with Nix.

{rix} is an R package that leverages Nix, a package manager focused on reproducible builds. With Nix, you can create project-specific environments with a custom version of R, its packages, and all system dependencies (e.g., GDAL). Nix ensures full reproducibility, which is crucial for research and development projects.

rix @ GitHub.

Related contents:

Episode 608: R With Eric Nantz @ Coder Radio.

data-science development foss linux nix open-source r

Added 8 months ago

R

https://www.r-project.org/

The R Project for Statistical Computing.

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

Related contents:

data-analytics data-science data-visualization foss language open-source r

Added 8 months ago

Data Formulator

https://github.com/microsoft/data-formulator

🪄 Create rich visualizations with AI

Data Formulator is an application from Microsoft Research that uses large language models to transform data, expediting the practice of data visualization.

Data Formulator is an AI-powered tool for analysts to iteratively create rich visualizations. Unlike most chat-based AI tools where users need to describe everything in natural language, Data Formulator combines user interface interactions (UI) and natural language (NL) inputs for easier interaction. This blended approach makes it easier for users to describe their chart designs while delegating data transformation to AI.

ai data-science data-transformation data-visualization foss llm microsoft mit-licensed open-source self-hosted web-app

Added 8 months ago

Data Engineering Zoomcamp

https://github.com/DataTalksClub/data-engineering-zoomcamp

A Free 9-Week Course on Data Engineering Fundamentals.

Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.

data-engineering data-pipeline data-science e-learning open-source

Added 8 months ago

Modern-Day Oracles or Bullshit Machines ?

https://thebullshitmachines.com/

For better or for worse, LLMs are here to stay. We all read content that they produce online, most of us interact with LLM chatbots, and many of us use them to produce content of our own.

In a series of five- to ten-minute lessons, we will explain what these machines are, how they work, and how to thrive in a world where they are everywhere.

You will learn when these systems can save you a lot of time and effort. You will learn when they are likely to steer you wrong. And you will discover how to see through the hype to tell the difference. ?

ai data-science e-learning llm machine-learning web-service

Added 8 months ago

AI by Hand ✍️ Exercises in Excel

https://github.com/ImagineAILab/ai-by-hand-excel

AI by Hand ✍️ Exercises in Excel

ai data-science e-learning excel foss machine-learning open-source

Added 8 months ago

Our World in Data

https://ourworldindata.org/

Research and data to make progress against the world’s largest problems.

To make progress against the pressing problems the world faces, we need to be informed by the best research and data.

Our World in Data makes this knowledge accessible and understandable, to empower those working to build a better world.

community data-analytics data-science data-visualization open-data web-service

Added 9 months ago

Streamlit

https://github.com/streamlit/streamlit

A faster way to build and share data apps. Streamlit turns data scripts into shareable web apps in minutes. All in pure Python. No front‑end experience required.

Streamlit lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. Once you’ve created an app, you can use our Community Cloud platform to deploy, manage, and share your app.

data-science development foss open-source python web

Added 9 months ago

DataBridge

https://databridge.gitbook.io/databridge-docs

Multi-modal modular data ingestion and retrieval.

DataBridge is an open source library for natural language search and management of multi-modal data. Get started by installing databridge now!

DataBridge is a powerful document processing and retrieval system designed for building intelligent document-based applications. It provides a robust foundation for semantic search, document processing, and AI-powered document interactions.

data-science foss library llm nlp open-source rag

Added 9 months ago

DeepSeek-R1

https://github.com/deepseek-ai/DeepSeek-R1

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

Related contents:

DeepSeek Crushes OpenAI o1 with an MIT-Licensed Model—Developers Are Losing It @ AIM.

data-science foss llm open-source

Added 9 months ago

Stata

https://www.stata.com/

Your data tell a story. Explore. Visualize. Model. Make a difference. Better insight starts with Stata.

Stata is statistical software for data science.

business-intelligence commercial data-analytics data-science data-visualization web-service

Added 9 months ago

Rowfill

https://www.rowfill.com/

Open-source document processing platform built for knowledge workers.

Rowfill helps extract, analyze, and process data from complex documents, images, PDFs and more with advanced AI capabilities.

Rowfill @ GitHub.

ai data-science foss llm open-source self-hosted web-app

Added 9 months ago

Apache Pinot™

https://pinot.apache.org/

Insights, Unlocked in Real Time.

Apache Pinot™: The real-time analytics open source platform for lightning-fast insights, effortless scaling, and cost-effective data-driven decisions.

Apache Pinot @ GitHub.

Related contents:

Serving Millions of Apache Pinot™ Queries with Neutrino @ Uber Blog.

big-data data-analytics data-science foss open-source self-hosted

Added 9 months ago

GPU Glossary

https://modal.com/gpu-glossary/readme

We wrote this glossary to solve a problem we ran into working with GPUs here at Modal : the documentation is fragmented, making it difficult to connect concepts at different levels of the stack, like Streaming Multiprocessor Architecture , Compute Capability , and nvcc compiler flags .

cuda data-science documentation e-learning gpu machine-learning nvidia

Added 9 months ago

🚀 E2M

https://github.com/wisupai/e2m

Everything to Markdown.

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

converter data-science foss llm markdown open-source

Added 9 months ago

JAX

https://jax.readthedocs.io/en/latest/

High performance array computing.

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX @ GitHub.

Related contents:

The PyTorch developer's guide to JAX fundamentals @ Google Cloud Blog.

data-science foss library machine-learning numpy open-source python pytorch

Added 9 months ago

sitefetch

https://github.com/egoist/sitefetch

Fetch an entire site and save it as a text file (to be used with AI models).

ai archive command-line data-science llm machine-learning open-source web

Added 9 months ago

Zasper

https://zasper.io/

A Supercharged IDE for Data Science.

Zasper is an IDE designed from the ground up to support massive concurrency. It provides a minimal memory footprint, exceptional speed, and the ability to handle numerous concurrent connections.

It's perfectly suited for running REPL-style data applications, e.g. Jupyter notebooks.

Zasper @ GitHub.

data-science foss ide jupyter open-source repl

Added 9 months ago

Musoq

https://puchaczov.github.io/Musoq/

SQL-like Querying for Various Data Sources.

Musoq lets you use SQL-like queries on files, directories, images and other data sources without a database. It's designed to ease life for developers.

Musoq is a tool that lets developers and IT professionals query different data sources using SQL-like syntax, without needing to import data into a database first. It’s designed for scenarios where you need to analyze files, directories, archives, or other data sources quickly and efficiently.

Musoq @ GitHub.

data-science foss open-source query sql

Added 9 months ago

SDF Labs

https://www.sdf.com/

Data Runs Better on SDF. Transform Data Better with SDF. SDF is the fastest way to build a scalable, reliable, and optimized data warehouse.

SDF is a developer platform for data that scales SQL understanding across an organization, empowering all data teams to unlock the full potential of their data.

SDF is a multi-dialect SQL compiler, transformation framework, and analytical database engine. It natively compiles SQL dialects, like Snowflake, and connects to their corresponding data warehouses to materialize models.

Source: Testing is Not Enough: Transforming Data Quality with Write, Audit, Publish using SDF Build @ SDF Blog.

command-line data-analytics database data-science data-transformation framework sql

Added 9 months ago

OmniParse

https://omniparse.cognitivelab.in/

Convert Anything into Structured Actionable Data.

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks.

OmniParse is a platform that ingests and parses any unstructured data into structured, actionable data optimized for GenAI (LLM) applications. Whether you are working with documents, tables, images, videos, audio files, or web pages, OmniParse prepares your data to be clean, structured, and ready for AI applications such as RAG, fine-tuning, and more

OmniParse @ GitHub.

data-science foss genai llm open-source parser rag

Added 9 months ago

OpenVINO

https://docs.openvino.ai/2024/index.html

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference.

OpenVINO is an open-source toolkit for optimizing and deploying deep learning models from cloud to edge. It accelerates deep learning inference across various use cases, such as generative AI, video, audio, and language with models from popular frameworks like PyTorch, TensorFlow, ONNX, and more. Convert and optimize models, and deploy across a mix of Intel® hardware and environments, on-premises and on-device, in the browser or in the cloud.

OpenVINO @ GitHub.

ai data-science deep-learning foss llm machine-learning onnx open-source pytorch tensorflow toolkit

Added 10 months ago

GPU Puzzles

https://github.com/srush/GPU-Puzzles

Solve puzzles. Learn CUDA.

GPU architectures are critical to machine learning, and seem to be becoming even more important every day. However, you can be an expert in machine learning without ever touching GPU code. It is hard to gain intuition working through abstractions.

This notebook is an attempt to teach beginner GPU programming in a completely interactive fashion. Instead of providing text with concepts, it throws you right into coding and building GPU kernels. The exercises use NUMBA which directly maps Python code to CUDA kernels. It looks like Python but is basically identical to writing low-level CUDA code. In a few hours, I think you can go from basics to understanding the real algorithms that power 99% of deep learning today. If you do want to read the manual, it is here:

cuda data-science development e-learning foss open-source python

Added 10 months ago

Metal Puzzles

https://github.com/abeleinin/Metal-Puzzles

Solve Puzzles. Learn Metal 🤘.

Port of srush/GPU-Puzzles to Metal using MLX Custom Kernals.

GPUs are crucial in machine learning because they can process data on a massively parallel scale. While it's possible to become an expert in machine learning without writing any GPU code, building intuition is challenging when you're only working through layers of abstraction. Additionally, as models grow in complexity, the need for developers to write efficient, high-performance kernels becomes increasingly important to leverage the power of modern hardware.

data-science development e-learning foss gpu macos metal open-source

Added 10 months ago

Apache Arrow

https://arrow.apache.org/

The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics.

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Related contents:

columnar data-analytics data-science format foss open-source

Added 10 months ago

Apache Iceberg™

https://iceberg.apache.org/

The open table format for analytic datasets.

Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Apache Iceberg @ GitHub.

Related contents:

apache2-licensed big-data database data-science foss open-source sql

Added 10 months ago

KlongPy

https://github.com/briangu/klongpy

High-Performance Klong array language in Python.

KlongPy is a Python adaptation of the Klong array language, known for its high-performance vectorized operations that leverage the power of NumPy. Embracing a "batteries included" philosophy, KlongPy combines built-in modules with Python's expansive ecosystem, facilitating rapid application development with Klong's succinct syntax.

data-science foss library numpy open-source python

Added 10 months ago

Polars

https://pola.rs/

DataFrames for the new era.

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

Polars is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model.

Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.

Polars @ GitHub.

dataframe data-science foss olap open-source polars python rust

Added 11 months ago

The Data Engineering Handbook

https://github.com/DataExpert-io/data-engineer-handbook

This repo has all the resources you need to become an amazing data engineer!

data-science e-learning knowledge-base

Added 11 months ago

Monte Carlo

https://www.montecarlodata.com/

Data and AI reliability. Delivered.

Data breaks. Monte Carlo ensures your team is the first to know and solve with end-to-end data observability.

Continuous Compliance Monitoring @ Mike Carpenter's Medium.

commercial data-science observability web-service

Added 11 months ago

Databricks

https://www.databricks.com/

The Databricks Data Intelligence Platform. Databricks brings AI to your data to help you bring AI to the world.

Related contents:

SQL Gets Easier: Announcing New Pipe Syntax @ Databricks blog.

commercial data-analytics data-science web-service

Added 11 months ago

PandasAI

https://pandas-ai.com/

Conversational Data Analysis.

PandasAI is a Python platform that makes it easy to ask questions to your data in natural language. It helps non-technical users to interact with their data in a more natural way, and it helps technical users to save time, and effort when working with data.

PandasAI is a Python library that integrates generative artificial intelligence capabilities into pandas, making dataframes conversational. Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.

PandasAI @ GitHub.

data-science library llm machine-learning pandas python source-available

Added 11 months ago

Substrait

https://substrait.io/

Cross-Language Serialization for Relational Algebra. A cross platform way to express data transformation, relational algebra, standardized record expression and plans.

Substrait is a format for describing compute operations on structured data. It is designed for interoperability across different languages and systems.

data-science data-transformation foss language open-source

Added 11 months ago

Dagster

https://dagster.io/

Cloud-native orchestration of data pipelines. Ship data pipelines with extraordinary velocity. An orchestration platform for the development, production, and observation of data assets.

The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

Dagster is a cloud-native data pipeline orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability.

It is designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.

Dagster @ GitHub.

data-pipeline data-science data-stream foss observability open-source orchestration self-hosted

Added 11 months ago

OpenMetadata

https://open-metadata.org/

Open and unified metadata platform for data discovery, observability, and governance.

A single place for all your data and all your data practitioners to build and manage high quality data assets at scale. Built by Collate and the founders of Apache Hadoop, Apache Atlas, and Uber Databook.

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.

OpenMetadata @ GitHub.

data-analytics data-science foss metadata open-source self-hosted web-app

Added 11 months ago

data stack in a box

https://github.com/wisemuffin/nsw-doe-data-stack-in-a-box

Department of Education (DOE) for New South Wales (AUS) data stack in a box. With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.

data-analytics data-pipeline data-science data-stack data-stream open-source self-hosted

Added 11 months ago

Docling

https://ds4sd.github.io/docling/

Docling parses documents and exports them to the desired format with ease and speed. 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON.

Docling @ GitHub.

Related contents:

Docling - Pour convertir vos documents sans prise de tête @ Korben :fr:.

asciidoc data-mining data-science docx foss html llm markdown open-source parser pdf pptx python rag

Added 11 months ago

DataChain

https://datachain.ai/

AI Data Management at Scale - Curate, Enrich, and Version Datasets.

DataChain is a modern Pythonic data-frame library designed for artificial intelligence. It is made to organize your unstructured data into datasets and wrangle it at scale on your local machine. Datachain does not abstract or hide the AI models and API calls, but helps to integrate them into the postmodern data stack.

Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them.

DataChain @ GitHub.

data-science foss machine-learning open-source python

Added 11 months ago

CSV SQL Tool

https://csvsqltool.com/

Run SQL queries on CSV files directly in your browser. No data leaves your browser. Fast, private, and easy to use.

csv data-science sql web-service

Added 11 months ago

Clidey WhoDB

https://whodb.clidey.com/

A lightweight next-gen data explorer - Postgres, MySQL, SQLite, MongoDB, Redis, MariaDB & Elastic Search with Chat interface.

WhoDB @ GitHub.

data-explorer data-science elasticsearch foss mariadb mongodb mysql open-source postgresql redis self-hosted sqlite web-app

Added 11 months ago

Panel

https://panel.holoviz.org/

The powerful data exploration & web app framework for Python.

Panel is an open-source Python library designed to streamline the development of robust tools, dashboards, and complex applications entirely within Python. With a comprehensive philosophy, Panel integrates seamlessly with the PyData ecosystem, offering powerful, interactive data tables, visualizations, and much more, to unlock, visualize, share, and collaborate on your data for efficient workflows.

Panel @ GitHub.

dashboard data-science low-code open-source pydata web-app

Added 1 year ago