Search: [scraping] - Biapy Web Directory

GitDorker https://github.com/obheda12/GitDorker

Wed Nov 20 08:18:12 2024

📧email

A Python program to scrape secrets from GitHub through usage of a large repository of dorks.

GitDorker is a tool that utilizes the GitHub Search API and an extensive list of GitHub dorks that I've compiled from various sources to provide an overview of sensitive information stored on github given a search query.

The Primary purpose of GitDorker is to provide the user with a clean and tailored attack surface to begin harvesting sensitive information on GitHub. GitDorker can be used with additional tools such as GitRob or Trufflehog on interesting repos or users discovered from GitDorker to produce best results.

Scrapling https://github.com/D4Vinci/Scrapling

Wed Nov 13 07:49:01 2024

📧email

Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python.

Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

Crawlee https://crawlee.dev/

Fri Nov 8 08:54:40 2024

📧email

Build reliable crawlers. Fast.

A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee @ GitHub.

Crawlee for Python https://crawlee.dev/python/

Fri Nov 8 08:53:35 2024

📧email

Build your Python web crawlers using Crawlee.
It helps you build reliable Python web crawlers. Fast.

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee for Python @ GitHub.

Maxun https://maxun-website.vercel.app/

Mon Nov 4 09:59:13 2024

📧email

Open-Source No-Code Web Data Extraction Platform.

Build custom robots to automate data scraping.

Maxun @ GitHub.

Scrappey https://scrappey.com/

Mon Oct 21 10:44:11 2024

📧email

Web Scraping API.
Tired of getting blocked while Scraping the web?

Our simple-to-use API makes it easy. Rotating proxies, Anti-Bot technology and headless browsers to CAPTCHAs. It's never been this easy.

Cloudflare Turnstile Page & Captcha Bypass for Scraping https://github.com/sarperavci/CloudflareBypassForScraping

Mon Oct 21 10:43:33 2024

📧email

A cloudflare verification bypass script for webscraping.

We love scraping, don't we? But sometimes, we face Cloudflare protection. This script is designed to bypass the Cloudflare protection on websites, allowing you to interact with them programmatically.

Pour ceux qui voudrai remplacer FlareSolverr @ r/yggTorrents.

Pipet https://github.com/bjesus/pipet

Thu Oct 3 14:38:15 2024

📧email

a swiss-army tool for scraping and extracting data from online assets, made for hackers.

Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.

Firecrawl https://www.firecrawl.dev/

Mon Sep 2 08:29:36 2024

📧email

Turn websites into LLM-ready data.

Power your AI apps with clean data crawled from any website. It's also open-source.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Firecrawl @ GitHub.

Scraperr https://github.com/jaypyles/Scraperr

Fri Aug 2 15:07:24 2024

📧email

Self-hosted webscraper.

Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.

Browserless https://www.browserless.io/

Wed Jun 19 05:32:26 2024

📧email

A Pool of Hosted Browsers, For Use With Puppeteer or Playwright.

Run your scraping, testing, screenshotting or any other automation with our pool of browsers. Ready connect to with Puppeteer, Playwright or via our APIs.

Browserless @ GitHub.

monolith https://crates.io/crates/monolith

Tue Mar 26 08:06:31 2024

📧email

⬛️ CLI tool for saving complete web pages as a single HTML file.

A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.

Dark Visitors https://darkvisitors.com/

Wed Jan 31 15:31:38 2024

📧email

A List of Known AI Agents on the Internet.

Insight into the hidden ecosystem of autonomous chatbots and data scrapers crawling across the web. Protect your website from unwanted AI agent access.

ScrapedIn https://github.com/dchrastil/ScrapedIn

Tue Dec 12 08:10:47 2023

📧email

A tool to scrape LinkedIn without API restrictions for data reconnaissance.

This tool assists in performing reconnaissance using the LinkedIn.com website/API for red team or social engineering engagements. It performs a company specific search to extract a detailed list of employees who work for the target company. Enter the name of the target company and the tool will help determine the LinkedIn company ID, which will be used to perform the search.

🔗🤖 L'Intersection de l'ingénierie sociale et des outils de Scraping : découverte de ScrapedIn 🔍🌐 @ Serge Houtain's LinkedIn :fr:.

RSS-Bridge https://rss-bridge.org/

Mon Nov 27 12:29:41 2023

📧email

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.

RSS-Bridge @ GitHub.

Trafilatura https://trafilatura.readthedocs.io/en/latest/

Fri Jun 9 14:12:17 2023

📧email

A Python package & command-line tool to gather text on the Web.

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Trafilatura @ GitHub

htmlq https://github.com/mgdm/htmlq

Tue Jan 31 16:05:09 2023

📧email

Like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

WebPlotDigitizer https://automeris.io/WebPlotDigitizer/

Thu Jan 5 12:16:15 2023

📧email

Extract data from plots, images, and maps.
A web based tool to extract numerical data from plot images. Supports XY, Polar, Ternary diagrams and Maps.
It is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy.

WebPlotDigitizer @ GitHub

Web Scraping Reference: Cheat Sheet for Web Scraping using R https://github.com/yusuzech/r-web-scraping-cheat-sheet

Tue Jan 3 13:55:24 2023

📧email

Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Inspired by Hartley Brody, this cheat sheet is about web scraping using rvest,httr and Rselenium. It covers many topics in this blog.

While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered.

MrScraper https://mrscraper.com/

Wed Dec 7 08:18:15 2022

📧email

A hassle-free web scraper to process information from websites, easily and without getting blocked.

Links per page

Filters