data-analytics
Count your code, quickly.
Tokei is a program that displays statistics about your code. Tokei will show the number of files, total lines within those files and code, comments, and blanks grouped by language.
CSVs sliced, diced & analyzed.
qsv (pronounced "Quicksilver") is a command line program for indexing, slicing, analyzing, filtering, enriching, validating & joining CSV files.
Amplify the Impact of Your People, Expertise & Data.
Altair and RapidMiner share the same vision to make data analytics simple enough for all users, but scalable, governed, and safe enough for all enterprises. RapidMiner is the enterprise-ready data science platform that amplifies the collective impact of your people, expertise and data for breakthrough competitive advantage.
Protect your business, scale your security. Open Source Vulnerability Management Platform.
Security has two difficult tasks: designing smart ways of getting new information, and keeping track of findings to improve remediation efforts. With Faraday, you may focus on discovering vulnerabilities while we help you with the rest. Just use it in your terminal and get your work organized on the run. Faraday was made to let you take advantage of the available tools in the community in a truly multiuser way.
Faraday aggregates and normalizes the data you load, allowing exploring it into different visualizations that are useful to managers and analysts alike.
The universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics.
Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.
Related contents:
PyPI Package Statistics & Analytics
Track downloads, analyze trends, and gain insights into the Python ecosystem
Data Runs Better on SDF. Transform Data Better with SDF. SDF is the fastest way to build a scalable, reliable, and optimized data warehouse.
SDF is a developer platform for data that scales SQL understanding across an organization, empowering all data teams to unlock the full potential of their data.
SDF is a multi-dialect SQL compiler, transformation framework, and analytical database engine. It natively compiles SQL dialects, like Snowflake, and connects to their corresponding data warehouses to materialize models.
Know Your User™
Open source user analytics for sovereign cybersecurity.
Tirreno is open-source user analytics software.
Tirreno is a universal analytic tool for monitoring online platforms, web applications, SaaS, communities, IoT, mobile applications, intranets, and e-commerce websites. It is effective against external threats associated with partners or customers, as well as internal risks posed by employees or suppliers.
DuckDB is an in-process SQL OLAP database management system.
DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs, maps), and several extensions designed to make SQL easier to use.
Related contents:
- DuckDB - Le moteur SQL qui transforme vos données @ Korben :fr:.
- Why DuckDB is my first choice for data processing @ >robinlinacre.
- DuckDB is Probably the Most Important Geospatial Software of the Last Decade @ dbreunig.com.
- Why Semantic Layers Matter — and How to Build One with DuckDB @ MotherDuck.
- Querying Billions of GitHub Events Using Modal and DuckDB (Part 1: Ingesting Data) @ noreasontopanic.
- DuckDB beats Polars for 1TB of data @ Confessions of a Data Guy.
- Building Your Modern Data Analytics Stack with Python, Parquet, and DuckDB @ KD nuggets.
- Building an Obsidian RAG with DuckDB and MotherDuck @ MotherDuck.
KNIME Analytics Platform is free and open source, which ensures users remain on the bleeding edge of data science, 300+ connectors to data sources, and integrations to all popular machine learning libraries.
Use SQL for everything. Query anything with old-school cool SQL.
Anyquery is a CLI tool to run SQL queries on any data source, no matter if it's a file, an API, logs, or a local app. See the integrations for the full extent of what you can do.
Open Source, SQL-driven Data Dashboards powered by DuckDB.
Build analytics dashboards simply by writing SQL.
Related contents:
The best dashboards are built with code. Create fast, beautiful data apps, dashboards, and reports from the command line. Write Markdown, JavaScript, SQL, Python, R… and any language you like. Free and open-source.
A static site generator for data apps, dashboards, reports, and more. Observable Framework combines JavaScript on the front-end for interactive graphics with any language on the back-end for data analysis.
SedonaDB is an open-source single-node analytical database engine with geospatial as a first-class citizen. It aims to deliver the fastest spatial analytics query speed and the most comprehensive function coverage available.
Related contents:
🦘 Explore multimedia datasets at scale.
Kangas is a tool for exploring, analyzing, and visualizing large-scale multimedia data. It provides a straightforward Python API for logging large tables of data, along with an intuitive visual interface for performing complex queries against your dataset.
Moose lets you develop analytical backends in pure TypeScript or Python code. The developer framework for your data & analytics stack.
Moose is an open source developer framework for building analytical backends. Moose is designed to help you quickly prototype, productionize, and scale data products, data pipelines, and data APIs - on OLAP and streaming infrastructure - using native TypeScript or Python.
KNIME offers a complete platform for end-to-end data science, from creating analytic models, to deploying them and sharing insights within the organization, through to data apps and services.
Simple way to access various statistics in git repository. Git quick statistics is a simple and efficient way to access various statistics in git repository.
Any git repository may contain tons of information about commits, contributors, and files. Extracting this information is not always trivial, mostly because there are a gadzillion options to a gadzillion git commands - I don't think there is a single person alive who knows them all. Probably not even Linus Torvalds himself :).
Index your Gmail account to a SQLite DB and play with the data.
This is a script to download emails from Gmail and store them in a SQLite database for further analysis. I find it extremely useful to have all my emails in a database to run queries on them. For example, I can find out how many emails I received per sender, which emails take the most space, and which emails from which sender I never read.
PostgreSQL log analyzer.
pgBadger is a PostgreSQL log analyzer built for speed with fully detailed reports and professional rendering.
Kubernetes usage analytics for CPU, Memory, and GPU — track costs and optimize cluster resources.
kube-opex-analytics is a Kubernetes usage accounting and analytics tool that helps organizations track CPU, Memory, and GPU resources consumed by their clusters over time (hourly, daily, monthly).
All-in-One Desktop App to Analyze Data Locally.
TextQuery is an all-in-one desktop app to import, query, modify, and visualize your raw data with SQL.
Zircolite is a standalone tool written in Python 3. It allows to use SIGMA rules on : MS Windows EVTX (EVTX, XML and JSONL format), Auditd logs, Sysmon for Linux and EVTXtract logs.
DataEase is an open source data visualization analysis tool that helps users quickly analyze data and gain insights into business trends, thereby improving and optimizing their business. DataEase supports a wide range of data source connections, can quickly create charts by dragging and dropping, and can be easily shared with others.
Contribute to krishnaik06/The-Grand-Complete-Data-Science-Materials development by creating an account on GitHub.
Download and parse data from Garmin Connect or a Garmin watch, FitBit CSV, and MS Health CSV files into and analyze data in Sqlite serverless databases with Jupyter notebooks.
Python scripts for parsing health data into and manipulating data in a SQLite database. SQLite is a light weight database that doesn't require a server.
Related contents:
Sonarr & Radarr Media Library Insights.
Sortarr is a lightweight web dashboard for Sonarr and Radarr that helps you understand how your media library uses storage. It is not a Plex tool, but it is useful in Plex setups for spotting oversized series or movies and comparing quality vs. size trade-offs.
Efficient data transformation and modeling framework that is backwards compatible with dbt.
SQLMesh is a next-generation data transformation framework designed to ship data quickly, efficiently, and without error. Data teams can efficiently run and deploy data transformations written in SQL or Python with visibility and control at any size.
Related contents:
Graphic Walker is a different open-source alternative to Tableau. It allows data scientists to analyze data and visualize patterns with simple drag-and-drop / natural language query operations.
System for collecting, deriving and querying facts about source code.
Glean is a system for working with facts about source code. You can use it for:
-
Collecting and storing detailed information about code structure. Glean is designed around an efficient storage model that enables storing information about code at scale.
-
Querying information about code, to power tools and experiences from online IDE features to offline code analysis.
Source: Indexing code at scale with Glean @ Engineering at Meta.
An automated document analyzer for Paperless-ngx using OpenAI API and Ollama (Mistral, llama, phi 3, gemma 2) to automatically analyze and tag your documents.
It features: Automode, Manual Mode, Ollama and OpenAI, a Chat function to query your documents with AI, a modern and intuitive Webinterface.
Real-time usage monitor for Claude Code — session limits, weekly limits, and plan tier with colour-coded progress bars
Unified Engine for large-scale data analytics.
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Related contents:
Data analysis & OSINT tool for everyone.
warning: created by ex-employee of the FSB
Related contents:
Rapidly Search and Hunt through Windows Forensic Artefacts.
Chainsaw provides a powerful ‘first-response’ capability to quickly identify threats within Windows forensic artefacts such as Event Logs and MFTs. Chainsaw offers a generic and fast method of searching through event logs for keywords, and by identifying threats using built-in support for Sigma detection rules, and via custom Chainsaw detection rules.
the Analytics Agent built for context engineering. Build your agent context like a file system.
Deploy a chat UI for anyone to run analytics on your data.
Related contents:
Slice and dice log files on the command line.
Angle-grinder allows you to parse, aggregate, sum, average, min/max, percentile, and sort your data. You can see it, live-updating, in your terminal. Angle grinder is designed for when, for whatever reason, you don't have your data in graphite/honeycomb/kibana/sumologic/splunk/etc. but still want to be able to do sophisticated analytics.
Related contents:
Analytics and data science notebook for teams. Jupyter notebook for the AI era.
-
Link Snowflake, BigQuery, CSVs, and 60+ data sources
-
Write in Python, SQL, R — or just prompt Deepnote Agent
-
Build powerful data apps and dashboards with AI
Git-like version control CLI backed by PostgreSQL with pg-xpatch delta compression.
Related contents:
library and tools for information extraction.
This project provides free (even for commercial use) state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors.
Department of Education (DOE) for New South Wales (AUS) data stack in a box. With the push of one button you can have your own data stack up and running in 5 mins! 🏎️.
Visualise your CSV files in seconds without sending your data anywhere.
Open and unified metadata platform for data discovery, observability, and governance.
A single place for all your data and all your data practitioners to build and manage high quality data assets at scale. Built by Collate and the founders of Apache Hadoop, Apache Atlas, and Uber Databook.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.
Semantic Data Processing. Build data processing and data analysis pipelines that leverage the power of LLMs 🧠
Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs). Semlib provides, as building blocks, familiar functional programming primitives like map, reduce, sort, and filter, but with a twist: Semlib's implementation of these operations are programmed with natural language descriptions rather than code. Under the hood, Semlib handles complexities such as prompting, parsing, concurrency control, caching, and cost tracking.
Insights, Unlocked in Real Time.
Apache Pinot™: The real-time analytics open source platform for lightning-fast insights, effortless scaling, and cost-effective data-driven decisions.
Related contents:
Open Source Business Intelligence
The simplest, fastest way to get business intelligence and analytics to everyone in your company 😋
A project providing a Graphic Walker Pane for use with HoloViz Panel.
A simple way to explore your data through a Tableau-like interface directly in your Panel data applications.
panel-graphic-walker brings the power of Graphic Walker to your data science workflow, seamlessly integrating interactive data exploration into notebooks and Panel applications. Effortlessly create dynamic visualizations, analyze datasets, and build dashboards—all within a Pythonic, intuitive interface.
dbt™ is a SQL-first transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone on the data team can safely contribute to production-grade data pipelines.
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats.
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
Open-source sQL AI Agent. Text2SQL made Easy!
Wren AI is an open-source SQL AI Agent that empowers data, product, and business teams to access insights through AI chat, built-in well designed intuitive UI and UX, integrating seamlessly with tools like Excel and Google Sheets.
text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).
Financial data platform for analysts, quants and AI agents. The AI Workspace for Finance.
Bridge your data with AI. Build AI-powered analytics applications, faster, securely and on your terms.
Interactive SQL. Analyze petabyte-scale data where it lives with ease and flexibility.
Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Analyze data or build applications from an Amazon Simple Storage Service (S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena is built on open-source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.
Kylin is a high concurrency, high performance and intelligent OLAP engine that provides low-cost and ultimate data analytics experience.
AI Call Analytics. Clean, annotate, and summarize call transcripts with GPT-4.5.
Open Source AI Calling Transcriptions, Summaries, and Analytics built on OpenAI Whisper.
Understand how your team codes with AI. Coding Agent Analytics for Claude Code.
Rudel gives engineering leaders visibility into Claude Code usage across their team. Track productivity, quantify ROI, and surface quality signals, automatically.
PHP basic ressource profiler (CPU/memory), safe and useful for production sites.
phptop prints per query and average metrics comparable to 'time' (wallclock, user and system CPU time) along with memory and other ressource usages.
It can be easily globally activated on a LAMP server and requires little resources and a single line configuration change in your php.ini. It has been used by Bearstech on many production servers for years without any problems.
Zero-ETL data analytics with Postgres.
Simple and cost-effective cloud analytics platform automatically synced with your data sources.
BemiDB is a Postgres read replica optimized for analytics. It consists of a single binary that seamlessly connects to a Postgres database, replicates the data in a compressed columnar format, and allows you to run complex queries using its Postgres-compatible analytical query engine.