parser
A Swift package for parsing iWork Keynote, Pages, and Numbers documents.
A Swift package for parsing and extracting content from Apple iWork documents (Pages, Numbers, and Keynote). WorkKit provides a straightforward API to open iWork documents and traverse their content.
Related contents:
a streaming JSON parser.
jsonriver is a simple JS library that will parse JSON incrementally as it streams in, e.g. from a network request or a language model. It gives you a sequence of increasingly complete values.
Kong is a command-line parser for Go.
Kong aims to support arbitrarily complex command-line structures with as little developer effort as possible.
To achieve that, command-lines are expressed as Go types, with the structure and tags directing how the command line is mapped onto the struct.
Fast, all‑in‑one JavaScript parser and generator for RSS, Atom, RDF, and JSON Feed, with support for popular namespaces and OPML files. Fast, all-in-one parser and generator for RSS, Atom, RDF, and JSON Feed, with support for Podcast, iTunes, Dublin Core, and OPML files.
Feedsmith offers universal and format‑specific parsers that maintain the original feed structure in a clean, object-oriented format while intelligently normalizing legacy elements. Access all feed data without compromising simplicity.
Related contents:
Python SQL Parser and Transpiler.
SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 31 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.
A set of Swift libraries for parsing, inspecting, generating, and transforming Swift source code.
The swift-syntax package is a set of libraries that work on a source-accurate tree representation of Swift source code, called the SwiftSyntax tree. The SwiftSyntax tree forms the backbone of Swift’s macro system – the macro expansion nodes are represented as SwiftSyntax nodes and a macro generates a SwiftSyntax tree to be inserted into the source file.
Related contents:
10x faster dynamic Protobuf parsing in Go that’s even 3x faster than generated code.
hyperpb is a highly optimized dynamic message library for Protobuf or read-only workloads. It is designed to be a drop-in replacement for dynamicpb, protobuf-go's canonical solution for working with completely dynamic messages.
Related contents:
OSV is a high-performance CSV parser for Ruby, implemented in Rust. It wraps BurntSushi's excellent csv-rs crate.
It provides a simple interface for reading CSV files with support for both hash-based and array-based row formats.
The array-based mode is faster than the hash-based mode, so if you don't need the hash keys, use the array-based mode.
HTML-aware ERB parsing. Powerful and seamless HTML-aware ERB parsing and tooling.
Next-generation HTML+ERB parsing for smarter developer tooling and more. Herb is an HTML-aware Embedded Ruby parsing tool built on Prism, Ruby's official parser.
C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
Related contents:
Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks
JSON is everywhere on the Internet. Servers spend a lot of time parsing it. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to break speed records.
JSON parser creating Rust objects in-memory.
Related contents:
Data processing with ML, LLM and Vision LLM.
Sparrow is an innovative open-source solution for efficient data extraction and processing from various documents and images. It seamlessly handles forms, bank statements, invoices, receipts, and other unstructured data sources. Sparrow stands out with its modular architecture, offering independent services and pipelines all optimized for robust performance.
Related contents:
Python tool for converting files and office documents to Markdown. MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
Related contents:
Convert Anything into Structured Actionable Data.
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks.
OmniParse is a platform that ingests and parses any unstructured data into structured, actionable data optimized for GenAI (LLM) applications. Whether you are working with documents, tables, images, videos, audio files, or web pages, OmniParse prepares your data to be clean, structured, and ready for AI applications such as RAG, fine-tuning, and more
Extract structured data from PDFs. Stop wasting time extracting PDFs. Transform your PDF documents into structured data with Documind. Simple, powerful and open-source.
Documind is an advanced document processing tool that leverages AI to extract structured data from PDFs. It is built to handle PDF conversions, extract relevant information, and format results as specified by customizable schemas.
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. MinerU was born during the pre-training process of InternLM. We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models. Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on issue and attach the relevant PDF.
A comprehensive test suite for RFC 8259 compliant JSON parsers
JSON for Classic C++.
json.cpp is a baroque JSON parsing / serialization library for C++.
Docling parses documents and exports them to the desired format with ease and speed. 🗂️ Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON.
Related contents:
The Data Processor for Agents.
Marly allows your agents to extract tables & text from your PDFs, Powerpoints, etc in a structured format making it easy for them to take subsequent actions (database call, API call, creating a chart etc).
OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
Detect and extract tables to markdown and csv.
Tabled is a small library for detecting and extracting tables. It uses surya to find all the tables in a PDF, identifies the rows/columns, and formats cells into markdown, csv, or html.
A library and language for building parsers, interpreters, compilers, etc.
Ohm is a parsing toolkit consisting of a library and a domain-specific language. You can use it to parse custom file formats or quickly build parsers, interpreters, and compilers for programming languages.
Langium is an open source language engineering tool with first-class support for the Language Server Protocol, written in TypeScript and running in Node.js.
Parser Building Toolkit for JavaScript.
Chevrotain is a blazing fast and feature rich Parser Building Toolkit for JavaScript with built-in support for LL(K). Grammars and 3rd party plugin for LL(*) grammars. It can be used to build parsers/compilers/interpreters for various use cases ranging from simple configuration files, to full-fledged programing languages.
:smirk_cat: A snarky 1kb Markdown parser written in JavaScript. Snarkdown is a dead simple 1kb Markdown parser.
It's designed to be as minimal as possible, for constrained use-cases where a full Markdown parser would be inappropriate.