Search: [data-science] - Biapy Web Directory

Mon Mar 31 13:58:17 2025

email

As data volumes continue to grow in fields like machine learning and scientific computing, optimizing fundamental operations like matrix multiplication becomes increasingly critical. Blosc2's chunk-based approach offers a new path to efficiency in these scenarios.

Blosc is a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans, although it can handle string data too). It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc main goal is not just to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.

C-Blosc2 @ GitHub.

Related contents:

Compress Better, Compute Bigger @ ironArray.

aiopandas https://github.com/telekinesis-inc/aiopandas

Mon Mar 17 13:19:54 2025

email

Async-Powered Pandas.

Lightweight Pandas monkey-patch that adds async support to map, apply, applymap, aggregate, and transform, enabling seamless handling of async functions with controlled max_parallel execution.

Virtual Cell Atlas https://arcinstitute.org/tools/virtualcellatlas

Wed Feb 26 13:29:08 2025

email

The Arc Virtual Cell Atlas is a collection of high quality, curated, open datasets assembled for the purpose of accelerating the creation of virtual cell models. The Atlas includes both observational and perturbational data from over 300 million cells (and growing).

Virtual Cell Atlas @ GitHub.

Shiny https://shiny.posit.co/

Wed Feb 26 07:42:21 2025

email

Easy web apps for data science without the compromises.
No web development skills required.

Related contents:

rix https://docs.ropensci.org/rix/

Wed Feb 26 07:36:06 2025

email

Reproducible Data Science Environments with Nix.

{rix} is an R package that leverages Nix, a package manager focused on reproducible builds. With Nix, you can create project-specific environments with a custom version of R, its packages, and all system dependencies (e.g., GDAL). Nix ensures full reproducibility, which is crucial for research and development projects.

rix @ GitHub.

Related contents:

Episode 608: R With Eric Nantz @ Coder Radio.

R https://www.r-project.org/

Wed Feb 26 07:34:15 2025

email

The R Project for Statistical Computing.

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

Related contents:

Data Engineering Zoomcamp https://github.com/DataTalksClub/data-engineering-zoomcamp

Tue Feb 11 14:00:27 2025

email

A Free 9-Week Course on Data Engineering Fundamentals.

Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.

Modern-Day Oracles or Bullshit Machines ? https://thebullshitmachines.com/

Mon Feb 10 14:18:05 2025

email

For better or for worse, LLMs are here to stay. We all read content that they produce online, most of us interact with LLM chatbots, and many of us use them to produce content of our own.

In a series of five- to ten-minute lessons, we will explain what these machines are, how they work, and how to thrive in a world where they are everywhere.

You will learn when these systems can save you a lot of time and effort. You will learn when they are likely to steer you wrong. And you will discover how to see through the hype to tell the difference. ?

AI by Hand

Exercises in Excel https://github.com/ImagineAILab/ai-by-hand-excel

Mon Feb 10 13:48:20 2025

email

AI by Hand Exercises in Excel

Our World in Data https://ourworldindata.org/

Wed Jan 29 20:27:09 2025

email

Research and data to make progress against the world’s largest problems.

To make progress against the pressing problems the world faces, we need to be informed by the best research and data.

Our World in Data makes this knowledge accessible and understandable, to empower those working to build a better world.

Streamlit https://github.com/streamlit/streamlit

Tue Jan 28 21:21:24 2025

email

A faster way to build and share data apps.
Streamlit turns data scripts into shareable web apps in minutes.
All in pure Python. No front‑end experience required.

Streamlit lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. Once you’ve created an app, you can use our Community Cloud platform to deploy, manage, and share your app.

DataBridge https://databridge.gitbook.io/databridge-docs

Sat Jan 25 11:26:42 2025

email

Multi-modal modular data ingestion and retrieval.

DataBridge is an open source library for natural language search and management of multi-modal data. Get started by installing databridge now!

DataBridge is a powerful document processing and retrieval system designed for building intelligent document-based applications. It provides a robust foundation for semantic search, document processing, and AI-powered document interactions.

DeepSeek-R1 https://github.com/deepseek-ai/DeepSeek-R1

Mon Jan 20 15:35:27 2025

email

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

Related contents:

DeepSeek Crushes OpenAI o1 with an MIT-Licensed Model—Developers Are Losing It @ AIM.

Stata https://www.stata.com/

Sun Jan 19 15:07:31 2025

email

Your data tell a story. Explore. Visualize. Model. Make a difference.
Better insight starts with Stata.

Stata is statistical software for data science.

Rowfill https://www.rowfill.com/

Fri Jan 17 18:33:20 2025

email

Open-source document processing platform built for knowledge workers.

Rowfill helps extract, analyze, and process data from complex documents, images, PDFs and more with advanced AI capabilities.

Rowfill @ GitHub.

Apache Pinot

https://pinot.apache.org/

Fri Jan 17 07:30:26 2025

email

Insights, Unlocked in Real Time.

Apache Pinot: The real-time analytics open source platform for lightning-fast insights, effortless scaling, and cost-effective data-driven decisions.

Apache Pinot @ GitHub.

Related contents:

Serving Millions of Apache Pinot Queries with Neutrino @ Uber Blog.

GPU Glossary https://modal.com/gpu-glossary/readme

Wed Jan 15 13:29:29 2025

email

We wrote this glossary to solve a problem we ran into working with GPUs here at Modal : the documentation is fragmented, making it difficult to connect concepts at different levels of the stack, like Streaming Multiprocessor Architecture , Compute Capability , and nvcc compiler flags .

E2M https://github.com/wisupai/e2m

Wed Jan 15 10:57:56 2025

email

Everything to Markdown.

E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.

JAX https://jax.readthedocs.io/en/latest/

Tue Jan 14 13:58:18 2025

email

High performance array computing.

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX @ GitHub.

Related contents:

The PyTorch developer's guide to JAX fundamentals @ Google Cloud Blog.

sitefetch https://github.com/egoist/sitefetch

Tue Jan 14 09:11:55 2025

email

Fetch an entire site and save it as a text file (to be used with AI models).

Links per page

Filters