a swiss-army tool for scraping and extracting data from online assets, made for hackers.
Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.
Turn websites into LLM-ready data.
Power your AI apps with clean data crawled from any website. It's also open-source.
Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
A Pool of Hosted Browsers, For Use With Puppeteer or Playwright.
Run your scraping, testing, screenshotting or any other automation with our pool of browsers. Ready connect to with Puppeteer, Playwright or via our APIs.
CLI tool for saving complete web pages as a single HTML file.
A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.
A tool to scrape LinkedIn without API restrictions for data reconnaissance.
This tool assists in performing reconnaissance using the LinkedIn.com website/API for red team or social engineering engagements. It performs a company specific search to extract a detailed list of employees who work for the target company. Enter the name of the target company and the tool will help determine the LinkedIn company ID, which will be used to perform the search.
RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.
A Python package & command-line tool to gather text on the Web.
Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.
Extract data from plots, images, and maps.
A web based tool to extract numerical data from plot images. Supports XY, Polar, Ternary diagrams and Maps.
It is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy.
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Inspired by Hartley Brody, this cheat sheet is about web scraping using rvest,httr and Rselenium. It covers many topics in this blog.
While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered.
A hassle-free web scraper to process information from websites, easily and without getting blocked.
Convert web pages into PDF, ePub, and Kindle (mobi) files
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program is its ability to find your targets e-mail adresses.
Query the Web of data on Web-scale by
moving intelligence from servers to clients.
Scrape websites visually. No code required!