Biapy's Bookmarks

ai.robots.txt

https://github.com/ai-robots-txt/ai.robots.txt?ref=geeek.org

A list of AI agents and robots to block.

This list contains AI-related crawlers of all types, regardless of purpose. We encourage you to contribute to and implement this list on your own site. See information about the listed crawlers and the FAQ.

Related contents:

#119: Les news sur le développement web et l'IA pour septembre 2025 RC2 @ Double Slash :fr:.

ai ai-agent apache blocklist caddy foss mit-licensed open-source robots.txt scraping

Added 3 weeks ago

The /llms.txt file

https://llmstxt.org/

A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.

We propose adding a /llms.txt markdown file to websites to provide LLM-friendly content. This file offers brief background information, guidance, and links to detailed markdown files.

llms.txt markdown is human and LLM readable, but is also in a precise format allowing fixed processing methods (i.e. classical programming techniques such as parsers and regex).

Related contents:

ai llm llms.txt scraping standard web

Added 1 month ago

mdream

https://github.com/harlan-zw/mdream

☁️ Convert any site to clean markdown & llms.txt. Boost your site's AI discoverability or generate LLM context for a project you're working with.

Mdream core is a highly optimized primitive for producing Markdown from HTML that is optimized for LLMs.

converter foss llm markdown mit-licensed open-source scraping

Added 1 month ago

dirsearch

https://github.com/maurosoria/dirsearch

Web path scanner. An advanced web path brute-forcer.

Related contents:

DirSearch - Un scanner de chemins web @ Korben :fr:.

foss gpl2-licensed open-source osint scraping web

Added 1 month ago

Byparr

https://github.com/ThePhaseless/Byparr

Get your valid antibot cookies yourself!

Built with seleniumbase and FastAPI, this project aims to mimic FlareSolverr's API and functionality of providing you with http cookies and headers for websites protected with anti-bot protections.

Related contents:

Byparr bypasses Flaresolverr @ ElfHosted.

cloudflare foss gpl3-licensed open-source prowlarr scraping seedbox

Added 1 month ago

🤖 Botasaurus Framework 🤖

https://www.omkar.cloud/botasaurus/

Botasaurus is a Swiss Army knife 🔪 for web scraping and browser automation 🤖 that helps you create bots fast. ⚡️

Botasaurus is an all-in-one web scraping framework that enables you to build awesome scrapers in less time, with less code, and with more fun.

🤖 Botasaurus 🤖 @ GitHub.

Related contents:

Botasaurus - Le scraper qui rend Cloudflare aussi facile à contourner qu'un CAPTCHA de 2005 @ Korben :fr:.

foss framework mit-licensed open-source python scraping

Added 2 months ago

A Vocabulary For Expressing AI Usage Preferences

https://ietf-wg-aipref.github.io/drafts/draft-ietf-aipref-vocab.html?cf_target_id=_blank

This document proposes a standardized vocabulary for expressing preferences related to how digital assets are used by automated processing systems. This vocabulary allows for the creation of structured declarations about restrictions or permissions for use of digital assets by such systems.

Related contents:

Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives @ Cloudflare Blog.

ai crawler robot scraping standard

Added 2 months ago

Meka Agent

https://withmeka.com/

state of the art browsing agent (WebArena 72.7%).

Meka Agent is an open-source, autonomous computer-using agent that delivers state-of-the-art browsing capabilities. The agent works and acts in the same way humans do, by purely using vision as its eyes and acting within a full computer context.

It is designed as a simple, extensible, and customizable framework, allowing flexibility in the choice of models, tools, and infrastructure providers.

Meka Agent @ GitHub.

ai-agent foss mit-licensed open-source scraping web

Added 2 months ago

Fireplexity

https://tools.firecrawl.dev/fireplexity

AI-Powered Web Scraping & Data Enrichment. AI-powered web search with instant results and follow-up questions.

🔥 Blazing-fast AI search engine with real-time citations, streaming responses, and live data powered by Firecrawl

Fireplexity @ GitHub.

Related contents:

#116: Les news sur le développement web et l'IA pour juillet 2025 RC1@ Double Slash :fr:.

ai firecrawl foss llm mit-licensed open-source scraping search-engine self-hosted

Added 3 months ago

The Web Robots Pages

https://www.robotstxt.org/

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

Related contents:

I was wrong about robots.txt @ Evgenii Pendragon.

crawler documentation robot scraping seo standard web

Added 3 months ago

Maxun Cloud

https://www.maxun.dev/

No-Code Web Data Extraction Platform. Turn Websites To APIs & Spreadsheets In Minutes.

Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot. Web data extraction doesn't get easier than this!

Maxun @ GitHub.

agpl3-licensed foss no-code open-source scraping self-hosted web-app

Added 4 months ago

Scraperr

https://scraperr-docs.pages.dev/

A powerful self-hosted web scraping solution. Scrape websites without writing a single line of code.

Scraperr @ GitHub.

foss mit-licensed no-code open-source scraping self-hosted web-app

Added 4 months ago

Defuddle

https://kepano.github.io/defuddle/

Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Defuddle @ GitHub.

foss javascript mit-licensed open-source readability scraping

Added 4 months ago

CleverBee

https://cleverb.ee/

Transparent AI, Rooted in Research, Open to All. The Open Source Deep Researcher Tool. AI-Powered Online Data Information Synthesis Assistant.

CleverBee is a powerful Python-based research assistant agent using Large Language Models (LLMs) like Claude and Gemini, Playwright for web browsing, and Chainlit for an interactive UI. It performs research assistance by browsing the web, extracting content (HTML), cleaning it, and synthesizing findings based on user research topics.

CleverBee @ GitHub.

agpl3-licensed ai ai-agent llm open-source playwright python scraping

Added 5 months ago

TeleGraphite

https://github.com/hamodywe/telegram-scraper-TeleGraphite

Telegram Scraper & JSON Exporter & telegram chanels scraper.

A fast and reliable Telegram channel scraper that fetches posts and exports them to JSON.

foss json mit-licensed open-source scraping telegram

Added 6 months ago

PriceBuddy

https://github.com/jez500/pricebuddy

PriceBuddy is an open source, self-hostable, web application that allows users to compare prices of products from different online retailers. Users can search for a product and view the prices of that product from different online retailers.

e-commerce foss gpl3-licensed open-source scraping self-hosted web-app

Added 6 months ago

Anubis

https://anubis.techaro.lol/

Anubis: self hostable scraper defense software.

Weighs the soul of incoming HTTP requests using proof-of-work to stop AI crawlers.

Anubis @ GitHub.

Related contents:

firewall foss mit-licensed open-source scraping self-hosted web-application-firewall

Added 7 months ago

Fetcher MCP

https://github.com/jae-jae/fetcher-mcp

MCP server for fetch web page content using Playwright headless browser.

ai-agent foss llm mcp mit-licensed open-source playwright scraping web web-browser

Added 7 months ago

Lightpanda Browser

https://lightpanda.io/

The headless browser. The future of browser automation.

Combining the flexibility of a Headless Chrome API with an innovative tailor-made browser for unmatched performance and efficiency in browser automation.

Lightpanda Browser @ GitHub.

Related contents:

Lightpanda - Le navigateur rapide pour l'automatisation web @ Korben :fr:.

automation browser-automation chrome foss headless open-source scraping web-browser

Added 8 months ago

Nepenthes

https://zadzmo.org/code/nepenthes/

This is a tarpit intended to catch web crawlers. Specifically, it's targetting crawlers that scrape data for LLM's - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

Related contents:

crawler foss honeypot mit-licensed open-source scraping self-hosted web-application-firewall

Added 9 months ago

Common Crawl

https://commoncrawl.org/

Open Repository of Web Crawl Data.

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Related contents:

S5E7 - Sommes-nous à l'aube d'un effondrement des IA ? @ Underscore_'s acast :fr:.

crawler llm machine-learning non-profit rag scraping web-service

Added 9 months ago

DrissionPage官网 :cn:

https://drissionpage.cn/

Python based web automation tool. Powerful and elegant.

DrissionPage is a Python-based web automation tool. It can control the browser, send and receive packets, and combine the two. You can balance the convenience of browser automation with the efficiency of requests. It is powerful, built-in countless user-friendly design and convenient features. Its syntax is simple and elegant, the code is small, and it is friendly to beginners.

DrissionPage @ GitHub.

automation china open-source python scraping web web-browser

Added 9 months ago

Crawl4AI

https://crawl4ai.com/mkdocs/

Open-Source LLM-Friendly Web Crawler & Scraper.

Crawl4AI delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

Crawl4AI @ GitHub.

ai crawler foss llm open-source rag scraping

Added 9 months ago

AIL-Framework

https://github.com/supdevinci/ail-framework-docker

AIL-Framework is a powerful open-source project designed for online data analysis and web crawling, tailored for cybersecurity researchers and analysts.

Related contents:

1 Tools en 5 commandes @ Laurent Biagotti's LinkedIn :fr:.

data-analytics foss open-source osint scraping security

Added 9 months ago

DataFuel

https://www.datafuel.dev/

Turn websites into LLM - ready data.

DataFuel API scrapes entire websites and knowledge bases in a single query. Get clean, markdown-structured web data instantly for your RAG systems and AI models. No complex scraping code needed.

commercial llm rag scraping web-service

Added 10 months ago

Darkdump

https://github.com/josh0xA/darkdump

Open Source Intelligence Interface for Deep Web Scraping.

Darkdump is a OSINT interface for carrying out deep web investgations written in python in which it allows users to enter a search query in which darkdump provides the ability to scrape .onion sites relating to that query to try to extract emails, metadata, keywords, images, social media etc. Darkdump retrieves sites via Ahmia.fi and scrapes those .onion addresses when connected via the tor network.

Related contents:

command-line deep-web foss mit-licensed open-source osint python scraping

Added 10 months ago

Discount Bandit

https://discount-bandit.cybrarist.com/

Self Hosted product tracker for Amazon, Walmart And many more.

Track products pricing across multi ecommerce stores such as amazon,ebay,walmart, target and many more.

e-commerce foss open-source product-tracker scraping self-hosted web-app

Added 11 months ago

simple-cloudflare-solver

https://github.com/nlevee/simple-cloudflare-solver

simple-cloudflare-solver is an API to bypass Cloudflare's protection system. It can be used as a gateway by applications like Jackett and Prowlarr to access protected resources.

cloudflare foss open-source scraping seedbox self-hosted web-app

Added 11 months ago

GitDorker

https://github.com/obheda12/GitDorker

A Python program to scrape secrets from GitHub through usage of a large repository of dorks.

GitDorker is a tool that utilizes the GitHub Search API and an extensive list of GitHub dorks that I've compiled from various sources to provide an overview of sensitive information stored on github given a search query.

The Primary purpose of GitDorker is to provide the user with a clean and tailored attack surface to begin harvesting sensitive information on GitHub. GitDorker can be used with additional tools such as GitRob or Trufflehog on interesting repos or users discovered from GitDorker to produce best results.

audit command-line github open-source pentest scraping secret security

Added 11 months ago

Scrapling

https://github.com/D4Vinci/Scrapling

Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python.

Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

foss library open-source python scraping web

Added 11 months ago

Crawlee

https://crawlee.dev/

Build reliable crawlers. Fast.

A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee @ GitHub.

crawler foss javascript library open-source scraping typescript web

Added 11 months ago

Crawlee for Python

https://crawlee.dev/python/

Build your Python web crawlers using Crawlee. It helps you build reliable Python web crawlers. Fast.

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee for Python @ GitHub.

crawler development foss library open-source python scraping web

Added 11 months ago

Maxun

https://maxun-website.vercel.app/

Open-Source No-Code Web Data Extraction Platform.

Build custom robots to automate data scraping.

Maxun @ GitHub.

automation no-code open-source scraping self-hosted web-app

Added 11 months ago

Scrappey

https://scrappey.com/

Web Scraping API. Tired of getting blocked while Scraping the web?

Our simple-to-use API makes it easy. Rotating proxies, Anti-Bot technology and headless browsers to CAPTCHAs. It's never been this easy.

api commercial scraping web-service

Added 1 year ago

Cloudflare Turnstile Page & Captcha Bypass for Scraping

https://github.com/sarperavci/CloudflareBypassForScraping

A cloudflare verification bypass script for webscraping.

We love scraping, don't we? But sometimes, we face Cloudflare protection. This script is designed to bypass the Cloudflare protection on websites, allowing you to interact with them programmatically.

Pour ceux qui voudrai remplacer FlareSolverr @ r/yggTorrents.

api cloudflare open-source scraping

Added 1 year ago

Pipet

https://github.com/bjesus/pipet

a swiss-army tool for scraping and extracting data from online assets, made for hackers.

Pipet is a command line based web scraper. It supports 3 modes of operation - HTML parsing, JSON parsing, and client-side JavaScript evaluation. It relies heavily on existing tools like curl, and it uses unix pipes for extending its built-in capabilities.

command-line foss html json open-source scraping

Added 1 year ago

Firecrawl

https://www.firecrawl.dev/

Turn websites into LLM-ready data.

Power your AI apps with clean data crawled from any website. It's also open-source. 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Firecrawl @ GitHub.

agpl3-licensed crawler data-science foss llm machine-learning open-source scraping

Added 1 year ago

Scraperr

https://github.com/jaypyles/Scraperr

Self-hosted webscraper.

Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.

open-source scraping self-hosted web-app

Added 1 year ago

Browserless

https://www.browserless.io/

A Pool of Hosted Browsers, For Use With Puppeteer or Playwright.

Run your scraping, testing, screenshotting or any other automation with our pool of browsers. Ready connect to with Puppeteer, Playwright or via our APIs.

Browserless @ GitHub.

automation browser open-source playwright puppeteer scraping ui-testing web-service

Added 1 year ago

monolith

https://crates.io/crates/monolith

⬛️ CLI tool for saving complete web pages as a single HTML file.

A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.

archive command-line html open-source rust scraping web

Added 1 year ago

Dark Visitors

https://darkvisitors.com/

A List of Known AI Agents on the Internet.

Insight into the hidden ecosystem of autonomous chatbots and data scrapers crawling across the web. Protect your website from unwanted AI agent access.

ai blocklist machine-learning scraping web-service

Added 1 year ago

ScrapedIn

https://github.com/dchrastil/ScrapedIn

A tool to scrape LinkedIn without API restrictions for data reconnaissance.

This tool assists in performing reconnaissance using the LinkedIn.com website/API for red team or social engineering engagements. It performs a company specific search to extract a detailed list of employees who work for the target company. Enter the name of the target company and the tool will help determine the LinkedIn company ID, which will be used to perform the search.

🔗🤖 L'Intersection de l'ingénierie sociale et des outils de Scraping : découverte de ScrapedIn 🔍🌐 @ Serge Houtain's LinkedIn :fr:.

linkedin open-source osint python scraping social-network

Added 1 year ago

RSS-Bridge

https://rss-bridge.org/

RSS-Bridge is a PHP project capable of generating RSS and Atom feeds for websites that don't have one. It can be used on webservers or as a stand-alone application in CLI mode.

RSS-Bridge @ GitHub.

open-source rss scraping self-hosted syndication web-app

Added 1 year ago

Trafilatura

https://trafilatura.readthedocs.io/en/latest/

A Python package & command-line tool to gather text on the Web.

Trafilatura is a Python package and command-line tool designed to gather text on the Web. It includes discovery, extraction and text processing components. Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments. It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats.

Trafilatura @ GitHub.

Related contents:

Alimenter les RAG/LLM avec Trafilatura @ DevSecOps :fr:.

apache2-licensed foss markdown open-source python rag scraping

Added 2 years ago

htmlq

https://github.com/mgdm/htmlq

Like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

command-line css html open-source scraping shell terminal

Added 2 years ago

WebPlotDigitizer

https://automeris.io/WebPlotDigitizer/

Extract data from plots, images, and maps. A web based tool to extract numerical data from plot images. Supports XY, Polar, Ternary diagrams and Maps. It is often necessary to reverse engineer images of data visualizations to extract the underlying numerical data. WebPlotDigitizer is a semi-automated tool that makes this process extremely easy.

WebPlotDigitizer @ GitHub

data-science data-visualization linux macos open-source plots scraping windows

Added 2 years ago

Web Scraping Reference: Cheat Sheet for Web Scraping using R

https://github.com/yusuzech/r-web-scraping-cheat-sheet

Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium. Inspired by Hartley Brody, this cheat sheet is about web scraping using rvest,httr and Rselenium. It covers many topics in this blog.

While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered.

cheatsheet r scraping

Added 2 years ago

MrScraper

https://mrscraper.com/

A hassle-free web scraper to process information from websites, easily and without getting blocked.

scraping web-service

Added 2 years ago

Browserflow

https://browserflow.app/

Web Scraping & Web Automation. Scrape websites. Automate tasks. No coding required. Browserflow is a no-code/low-code Chrome extension that allows you to automate your work on any website. Save time by automating repetitive tasks in minutes. Run in your browser or in the cloud.

automation browser-addon chrome scraping

Added 3 years ago

Buster

https://github.com/dessant/buster

Captcha solver extension for humans. Buster is a browser extension which helps you to solve difficult captchas by completing reCAPTCHA audio challenges using speech recognition. Challenges are solved by clicking on the extension button at the bottom of the reCAPTCHA widget.

captcha captcha-solver data recaptcha scraping

Added 3 years ago

Txtpaper

https://txtpaper.com/

Convert web pages into PDF, ePub, and Kindle (mobi) files

epub mobi pdf scraping web-service

Added 3 years ago

DaProfiler

https://github.com/TheRealDalunacrobate/DaProfiler

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program is its ability to find your targets e-mail adresses.

scraping security

Added 3 years ago

Linked Data Fragments

http://linkeddatafragments.org/

Query the Web of data on Web-scale by moving intelligence from servers to clients.

scraping web-app

Added 8 years ago

RoboBrowser

https://github.com/jmcarp/robobrowser

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

automation python scraping unmaintained

Added 9 years ago

Portia | Scrapinghub

http://scrapinghub.com/portia

Scrape websites visually. No code required!

foss scraping web-app web-service

Added 9 years ago

Sherlock

https://github.com/segmentio/sherlock

Used to scrape a web page to detect what 3rd-party services are being used. Check out sherlock-segment for a collection of plugin examples.

développement scraping

Added 10 years ago

Osmosis

https://github.com/rc0x03/node-osmosis

HTML/XML parser and web scraper for NodeJS.

javascript nodejs scraping unmaintained

Added 10 years ago

theharvester - Information Gathering

https://code.google.com/p/theharvester

The objective of this program is to gather emails, subdomains, hosts, employee names, open ports and banners from different public sources like search engines, PGP key servers and SHODAN computer database.

foss scapping scraping

Added 11 years ago

Portia

https://github.com/scrapinghub/portia

Portia is a tool for visually scraping web sites without any programming knowledge. Just annotate web pages with a point and click editor to indicate what data you want to extract, and portia will learn how to scrape similar pages from the site.

foss open-source python scraping self-hosted web-app

Added 11 years ago

Scrapy

http://scrapy.org/

An open source web scraping framework for Python.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy @ GitHub

crawler development foss framework library open-source python scraping web

Added 12 years ago