scraping
Get your valid antibot cookies yourself!
Built with seleniumbase and FastAPI, this project aims to mimic FlareSolverr's API and functionality of providing you with http cookies and headers for websites protected with anti-bot protections.
Related contents:
Botasaurus is a Swiss Army knife 🔪 for web scraping and browser automation 🤖 that helps you create bots fast. ⚡️
Botasaurus is an all-in-one web scraping framework that enables you to build awesome scrapers in less time, with less code, and with more fun.
Related contents:
This document proposes a standardized vocabulary for expressing preferences related to how digital assets are used by automated processing systems. This vocabulary allows for the creation of structured declarations about restrictions or permissions for use of digital assets by such systems.
Related contents:
state of the art browsing agent (WebArena 72.7%).
Meka Agent is an open-source, autonomous computer-using agent that delivers state-of-the-art browsing capabilities. The agent works and acts in the same way humans do, by purely using vision as its eyes and acting within a full computer context.
It is designed as a simple, extensible, and customizable framework, allowing flexibility in the choice of models, tools, and infrastructure providers.
AI-Powered Web Scraping & Data Enrichment. AI-powered web search with instant results and follow-up questions.
🔥 Blazing-fast AI search engine with real-time citations, streaming responses, and live data powered by Firecrawl
Related contents:
Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.
Related contents:
No-Code Web Data Extraction Platform. Turn Websites To APIs & Spreadsheets In Minutes.
Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot. Web data extraction doesn't get easier than this!
A powerful self-hosted web scraping solution. Scrape websites without writing a single line of code.
Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.
Transparent AI, Rooted in Research, Open to All. The Open Source Deep Researcher Tool. AI-Powered Online Data Information Synthesis Assistant.
CleverBee is a powerful Python-based research assistant agent using Large Language Models (LLMs) like Claude and Gemini, Playwright for web browsing, and Chainlit for an interactive UI. It performs research assistance by browsing the web, extracting content (HTML), cleaning it, and synthesizing findings based on user research topics.
Telegram Scraper & JSON Exporter & telegram chanels scraper.
A fast and reliable Telegram channel scraper that fetches posts and exports them to JSON.
PriceBuddy is an open source, self-hostable, web application that allows users to compare prices of products from different online retailers. Users can search for a product and view the prices of that product from different online retailers.
Anubis: self hostable scraper defense software.
Weighs the soul of incoming HTTP requests using proof-of-work to stop AI crawlers.
Related contents:
- Block AI scrapers with Anubis @ Xe.
- Episode 146: When AI Attacks @ Self-Hosted.
- The surreal joy of having an overprovisioned homelab @ Xe.
- Open source devs are fighting AI crawlers with cleverness and vengeance @ TechCrunch.
- [Anubis] Utiliser la preuve de travail pour bloquer les robots @ Pofilo.fr :fr:.
- The Day Anubis Saved Our Websites From a DDoS Attack @ fabulous.systems.
- Protéger tous ses sites avec Anubis @ Dryusdan.space 🚀.
- A thought on JavaScript "proof of work" anti-scraper systems @ Wandering Thoughts.
- Anubis - Protégez votre site web contre les scrapers IA en moins de 15 minutes @ Korben :fr:.
MCP server for fetch web page content using Playwright headless browser.
The headless browser. The future of browser automation.
Combining the flexibility of a Headless Chrome API with an innovative tailor-made browser for unmatched performance and efficiency in browser automation.
Related contents:
This is a tarpit intended to catch web crawlers. Specifically, it's targetting crawlers that scrape data for LLM's - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.
It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
Related contents:
Open Repository of Web Crawl Data.
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Related contents:
Python based web automation tool. Powerful and elegant.
DrissionPage is a Python-based web automation tool. It can control the browser, send and receive packets, and combine the two. You can balance the convenience of browser automation with the efficiency of requests. It is powerful, built-in countless user-friendly design and convenient features. Its syntax is simple and elegant, the code is small, and it is friendly to beginners.
Open-Source LLM-Friendly Web Crawler & Scraper.
Crawl4AI delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
AIL-Framework is a powerful open-source project designed for online data analysis and web crawling, tailored for cybersecurity researchers and analysts.
Related contents:
Turn websites into LLM - ready data.
DataFuel API scrapes entire websites and knowledge bases in a single query. Get clean, markdown-structured web data instantly for your RAG systems and AI models. No complex scraping code needed.
Open Source Intelligence Interface for Deep Web Scraping.
Darkdump is a OSINT interface for carrying out deep web investgations written in python in which it allows users to enter a search query in which darkdump provides the ability to scrape .onion sites relating to that query to try to extract emails, metadata, keywords, images, social media etc. Darkdump retrieves sites via Ahmia.fi and scrapes those .onion addresses when connected via the tor network.
Related contents:
Self Hosted product tracker for Amazon, Walmart And many more.
Track products pricing across multi ecommerce stores such as amazon,ebay,walmart, target and many more.
simple-cloudflare-solver is an API to bypass Cloudflare's protection system. It can be used as a gateway by applications like Jackett and Prowlarr to access protected resources.
A Python program to scrape secrets from GitHub through usage of a large repository of dorks.
GitDorker is a tool that utilizes the GitHub Search API and an extensive list of GitHub dorks that I've compiled from various sources to provide an overview of sensitive information stored on github given a search query.
The Primary purpose of GitDorker is to provide the user with a clean and tailored attack surface to begin harvesting sensitive information on GitHub. GitDorker can be used with additional tools such as GitRob or Trufflehog on interesting repos or users discovered from GitDorker to produce best results.
Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python.
Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.
Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.
Build reliable crawlers. Fast.
A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Build your Python web crawlers using Crawlee. It helps you build reliable Python web crawlers. Fast.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Open-Source No-Code Web Data Extraction Platform.
Build custom robots to automate data scraping.
Web Scraping API. Tired of getting blocked while Scraping the web?
Our simple-to-use API makes it easy. Rotating proxies, Anti-Bot technology and headless browsers to CAPTCHAs. It's never been this easy.
A cloudflare verification bypass script for webscraping.
We love scraping, don't we? But sometimes, we face Cloudflare protection. This script is designed to bypass the Cloudflare protection on websites, allowing you to interact with them programmatically.
a swiss-army tool for scraping and extracting data from online assets, made for hackers.
Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.
Turn websites into LLM-ready data.
Power your AI apps with clean data crawled from any website. It's also open-source. 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Self-hosted webscraper.
Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.
A Pool of Hosted Browsers, For Use With Puppeteer or Playwright.
Run your scraping, testing, screenshotting or any other automation with our pool of browsers. Ready connect to with Puppeteer, Playwright or via our APIs.
⬛️ CLI tool for saving complete web pages as a single HTML file.
A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.
A List of Known AI Agents on the Internet.
Insight into the hidden ecosystem of autonomous chatbots and data scrapers crawling across the web. Protect your website from unwanted AI agent access.
A tool to scrape LinkedIn without API restrictions for data reconnaissance.
This tool assists in performing reconnaissance using the LinkedIn.com website/API for red team or social engineering engagements. It performs a company specific search to extract a detailed list of employees who work for the target company. Enter the name of the target company and the tool will help determine the LinkedIn company ID, which will be used to perform the search.
RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.
A Python package & command-line tool to gather text on the Web.
Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.
Related contents:
Like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.
Extract data from plots, images, and maps. A web based tool to extract numerical data from plot images. Supports XY, Polar, Ternary diagrams and Maps. It is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy.
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium. Inspired by Hartley Brody, this cheat sheet is about web scraping using rvest,httr and Rselenium. It covers many topics in this blog.
While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered.
A hassle-free web scraper to process information from websites, easily and without getting blocked.
Web Scraping & Web Automation. Scrape websites. Automate tasks. No coding required. Browserflow is a no-code/low-code Chrome extension that allows you to automate your work on any website. Save time by automating repetitive tasks in minutes. Run in your browser or in the cloud.
Captcha solver extension for humans. Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.
DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program is its ability to find your targets e-mail adresses.
Query the Web of data on Web-scale by moving intelligence from servers to clients.
RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.
Used to scrape a web page to detect what 3rd-party services are being used. Check out sherlock-segment for a collection of plugin examples.
The objective of this program is to gather emails, subdomains, hosts, employee names, open ports and banners from different public sources like search engines, PGP key servers and SHODAN computer database.
Portia is a tool for visually scraping web sites without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site.
An open source web scraping framework for Python.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.