Search: [scraping] - Biapy Web Directory

Thu Oct 3 14:38:15 2024

email

a swiss-army tool for scraping and extracting data from online assets, made for hackers.

Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.

Firecrawl https://www.firecrawl.dev/

Mon Sep 2 08:29:36 2024

email

Turn websites into LLM-ready data.

Power your AI apps with clean data crawled from any website. It's also open-source.
Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Firecrawl @ GitHub.

Scraperr https://github.com/jaypyles/Scraperr

Fri Aug 2 15:07:24 2024

email

Self-hosted webscraper.

Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.

Browserless https://www.browserless.io/

Wed Jun 19 05:32:26 2024

email

A Pool of Hosted Browsers, For Use With Puppeteer or Playwright.

Run your scraping, testing, screenshotting or any other automation with our pool of browsers. Ready connect to with Puppeteer, Playwright or via our APIs.

Browserless @ GitHub.

monolith https://crates.io/crates/monolith

Tue Mar 26 08:06:31 2024

email

CLI tool for saving complete web pages as a single HTML file.

A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.

Dark Visitors https://darkvisitors.com/

Wed Jan 31 15:31:38 2024

email

A List of Known AI Agents on the Internet.

Insight into the hidden ecosystem of autonomous chatbots and data scrapers crawling across the web. Protect your website from unwanted AI agent access.

ScrapedIn https://github.com/dchrastil/ScrapedIn

Tue Dec 12 08:10:47 2023

email

A tool to scrape LinkedIn without API restrictions for data reconnaissance.

This tool assists in performing reconnaissance using the LinkedIn.com website/API for red team or social engineering engagements. It performs a company specific search to extract a detailed list of employees who work for the target company. Enter the name of the target company and the tool will help determine the LinkedIn company ID, which will be used to perform the search.

L'Intersection de l'ingénierie sociale et des outils de Scraping : découverte de ScrapedIn @ Serge Houtain's LinkedIn .

RSS-Bridge https://rss-bridge.org/

Mon Nov 27 12:29:41 2023

email

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.

RSS-Bridge @ GitHub.

Trafilatura https://trafilatura.readthedocs.io/en/latest/

Fri Jun 9 14:12:17 2023

email

A Python package & command-line tool to gather text on the Web.

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Trafilatura @ GitHub

htmlq https://github.com/mgdm/htmlq

Tue Jan 31 16:05:09 2023

email

Like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

WebPlotDigitizer https://automeris.io/WebPlotDigitizer/

Thu Jan 5 12:16:15 2023

email

Extract data from plots, images, and maps.
A web based tool to extract numerical data from plot images. Supports XY, Polar, Ternary diagrams and Maps.
It is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy.

WebPlotDigitizer @ GitHub

Web Scraping Reference: Cheat Sheet for Web Scraping using R https://github.com/yusuzech/r-web-scraping-cheat-sheet

Tue Jan 3 13:55:24 2023

email

Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Inspired by Hartley Brody, this cheat sheet is about web scraping using rvest,httr and Rselenium. It covers many topics in this blog.

While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered.

MrScraper https://mrscraper.com/

Wed Dec 7 08:18:15 2022

email

A hassle-free web scraper to process information from websites, easily and without getting blocked.

Browserflow https://browserflow.app/

Mon Oct 17 12:22:07 2022

email

Web Scraping & Web Automation.
Scrape websites. Automate tasks. No coding required.
Browserflow is a no-code/low-code Chrome extension that allows you to automate your work on any website.
Save time by automating repetitive tasks in minutes. Run in your browser or in the cloud.

Buster https://github.com/dessant/buster

Wed Aug 10 17:36:37 2022

email

Captcha solver extension for humans.
Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.

Txtpaper https://txtpaper.com/

Thu Feb 3 17:48:37 2022

email

Convert web pages into PDF, ePub, and Kindle (mobi) files

DaProfiler https://github.com/TheRealDalunacrobate/DaProfiler

Thu Dec 2 15:42:43 2021

email

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program is its ability to find your targets e-mail adresses.

Linked Data Fragments http://linkeddatafragments.org/

Tue Jan 3 08:46:24 2017

email

Query the Web of data on Web-scale by
moving intelligence from servers to clients.

RoboBrowser https://github.com/jmcarp/robobrowser

Tue May 17 07:36:47 2016

email

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

Portia | Scrapinghub http://scrapinghub.com/portia

Sun Feb 21 17:57:08 2016

email

Scrape websites visually. No code required!

Links per page

Filters