Search: [scraping] - Biapy Web Directory

PriceBuddy https://github.com/jez500/pricebuddy

Fri Mar 28 15:06:52 2025

email

PriceBuddy is an open source, self-hostable, web application that allows users to compare prices of products from different online retailers. Users can search for a product and view the prices of that product from different online retailers.

Anubis https://anubis.techaro.lol/

Fri Mar 21 13:47:37 2025

email

Anubis: self hostable scraper defense software.

Weighs the soul of incoming HTTP requests using proof-of-work to stop AI crawlers.

Anubis @ GitHub.

Related contents:

Fetcher MCP https://github.com/jae-jae/fetcher-mcp

Fri Mar 21 13:22:10 2025

email

MCP server for fetch web page content using Playwright headless browser.

Lightpanda Browser https://lightpanda.io/

Tue Jan 28 07:40:46 2025

email

The headless browser. The future of browser automation.

Combining the flexibility of a Headless Chrome API with an innovative tailor-made browser for unmatched performance and efficiency in browser automation.

Lightpanda Browser @ GitHub.

Related contents:

Lightpanda - Le navigateur rapide pour l'automatisation web @ Korben .

Nepenthes https://zadzmo.org/code/nepenthes/

Mon Jan 27 10:22:49 2025

email

This is a tarpit intended to catch web crawlers. Specifically, it's targetting crawlers that scrape data for LLM's - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

Related contents:

Common Crawl https://commoncrawl.org/

Sun Jan 26 15:27:53 2025

email

Open Repository of Web Crawl Data.

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Related contents:

S5E7 - Sommes-nous à l'aube d'un effondrement des IA ? @ Underscore_'s acast .

DrissionPage官网

https://drissionpage.cn/

Sun Jan 12 20:59:25 2025

email

Python based web automation tool. Powerful and elegant.

DrissionPage is a Python-based web automation tool.
It can control the browser, send and receive packets, and combine the two.
You can balance the convenience of browser automation with the efficiency of requests.
It is powerful, built-in countless user-friendly design and convenient features.
Its syntax is simple and elegant, the code is small, and it is friendly to beginners.

DrissionPage @ GitHub.

Crawl4AI https://crawl4ai.com/mkdocs/

Fri Jan 10 06:54:32 2025

email

Open-Source LLM-Friendly Web Crawler & Scraper.

Crawl4AI delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

Crawl4AI @ GitHub.

AIL-Framework https://github.com/supdevinci/ail-framework-docker

Wed Jan 8 07:50:24 2025

email

AIL-Framework is a powerful open-source project designed for online data analysis and web crawling, tailored for cybersecurity researchers and analysts.

Related contents:

1 Tools en 5 commandes @ Laurent Biagotti's LinkedIn .

DataFuel https://www.datafuel.dev/

Fri Dec 13 14:00:20 2024

email

Turn websites into LLM - ready data.

DataFuel API scrapes entire websites and knowledge bases in a single query. Get clean, markdown-structured web data instantly for your RAG systems and AI models. No complex scraping code needed.

Darkdump https://github.com/josh0xA/darkdump

Thu Dec 12 08:24:27 2024

email

Open Source Intelligence Interface for Deep Web Scraping.

Darkdump is a OSINT interface for carrying out deep web investgations written in python in which it allows users to enter a search query in which darkdump provides the ability to scrape .onion sites relating to that query to try to extract emails, metadata, keywords, images, social media etc. Darkdump retrieves sites via Ahmia.fi and scrapes those .onion addresses when connected via the tor network.

Related contents:

darkdump: Open Source Intelligence Interface for Deep Web Scraping @ Dark Web Informer.

Discount Bandit https://discount-bandit.cybrarist.com/

Mon Nov 25 09:17:22 2024

email

Self Hosted product tracker for Amazon, Walmart And many more.

Track products pricing across multi ecommerce stores such as amazon,ebay,walmart, target and many more.

simple-cloudflare-solver https://github.com/nlevee/simple-cloudflare-solver

Fri Nov 22 14:56:29 2024

email

simple-cloudflare-solver is an API to bypass Cloudflare's protection system. It can be used as a gateway by applications like Jackett and Prowlarr to access protected resources.

GitDorker https://github.com/obheda12/GitDorker

Wed Nov 20 08:18:12 2024

email

A Python program to scrape secrets from GitHub through usage of a large repository of dorks.

GitDorker is a tool that utilizes the GitHub Search API and an extensive list of GitHub dorks that I've compiled from various sources to provide an overview of sensitive information stored on github given a search query.

The Primary purpose of GitDorker is to provide the user with a clean and tailored attack surface to begin harvesting sensitive information on GitHub. GitDorker can be used with additional tools such as GitRob or Trufflehog on interesting repos or users discovered from GitDorker to produce best results.

Scrapling https://github.com/D4Vinci/Scrapling

Wed Nov 13 07:49:01 2024

email

Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python.

Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

Crawlee https://crawlee.dev/

Fri Nov 8 08:54:40 2024

email

Build reliable crawlers. Fast.

A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee @ GitHub.

Crawlee for Python https://crawlee.dev/python/

Fri Nov 8 08:53:35 2024

email

Build your Python web crawlers using Crawlee.
It helps you build reliable Python web crawlers. Fast.

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee for Python @ GitHub.

Maxun https://maxun-website.vercel.app/

Mon Nov 4 09:59:13 2024

email

Open-Source No-Code Web Data Extraction Platform.

Build custom robots to automate data scraping.

Maxun @ GitHub.

Scrappey https://scrappey.com/

Mon Oct 21 10:44:11 2024

email

Web Scraping API.
Tired of getting blocked while Scraping the web?

Our simple-to-use API makes it easy. Rotating proxies, Anti-Bot technology and headless browsers to CAPTCHAs. It's never been this easy.

Cloudflare Turnstile Page & Captcha Bypass for Scraping https://github.com/sarperavci/CloudflareBypassForScraping

Mon Oct 21 10:43:33 2024

email

A cloudflare verification bypass script for webscraping.

We love scraping, don't we? But sometimes, we face Cloudflare protection. This script is designed to bypass the Cloudflare protection on websites, allowing you to interact with them programmatically.

Pour ceux qui voudrai remplacer FlareSolverr @ r/yggTorrents.

Links per page

Filters