Biapy's Bookmarks

SiteOne Crawler

https://crawler.siteone.io/

free website analyzer, offline exporter, sitemap generator and Swiss Army Knife, you will love.

SiteOne Crawler is a cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization—ideal for developers, DevOps, QA engineers, and consultants. Supports Windows, macOS, and Linux (x64 and arm64).

SiteOne Crawler @ GitHub.

command-line crawler foss mit-licensed open-source seo

Added 1 month ago

SEOnaut

https://seonaut.org/

Open Source SEO audit tool.

SEOnaut is an SEO tool for website audits under the MIT license, giving you full transparency and control. Customize the tool to fit your unique needs or contribute to its ongoing development. Flexible, adaptable software you can trust.

SEOnaut @ GitHub.

audit crawler foss mit-licensed open-source self-hosted seo web-app

Added 1 month ago

A Vocabulary For Expressing AI Usage Preferences

https://ietf-wg-aipref.github.io/drafts/draft-ietf-aipref-vocab.html?cf_target_id=_blank

This document proposes a standardized vocabulary for expressing preferences related to how digital assets are used by automated processing systems. This vocabulary allows for the creation of structured declarations about restrictions or permissions for use of digital assets by such systems.

Related contents:

Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives @ Cloudflare Blog.

ai crawler robot scraping standard

Added 2 months ago

The Web Robots Pages

https://www.robotstxt.org/

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

Related contents:

I was wrong about robots.txt @ Evgenii Pendragon.

crawler documentation robot scraping seo standard web

Added 3 months ago

Photon

https://github.com/s0md3v/Photon

Incredibly fast crawler designed for OSINT.

command-line crawler foss open-source osint

Added 8 months ago

Nepenthes

https://zadzmo.org/code/nepenthes/

This is a tarpit intended to catch web crawlers. Specifically, it's targetting crawlers that scrape data for LLM's - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

Related contents:

crawler foss honeypot mit-licensed open-source scraping self-hosted web-application-firewall

Added 9 months ago

Common Crawl

https://commoncrawl.org/

Open Repository of Web Crawl Data.

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Related contents:

S5E7 - Sommes-nous à l'aube d'un effondrement des IA ? @ Underscore_'s acast :fr:.

crawler llm machine-learning non-profit rag scraping web-service

Added 9 months ago

Crawl4AI

https://crawl4ai.com/mkdocs/

Open-Source LLM-Friendly Web Crawler & Scraper.

Crawl4AI delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

Crawl4AI @ GitHub.

ai crawler foss llm open-source rag scraping

Added 9 months ago

Crawlee

https://crawlee.dev/

Build reliable crawlers. Fast.

A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee @ GitHub.

crawler foss javascript library open-source scraping typescript web

Added 11 months ago

Crawlee for Python

https://crawlee.dev/python/

Build your Python web crawlers using Crawlee. It helps you build reliable Python web crawlers. Fast.

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee for Python @ GitHub.

crawler development foss library open-source python scraping web

Added 11 months ago

Firecrawl

https://www.firecrawl.dev/

Turn websites into LLM-ready data.

Power your AI apps with clean data crawled from any website. It's also open-source. 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Firecrawl @ GitHub.

agpl3-licensed crawler data-science foss llm machine-learning open-source scraping

Added 1 year ago

Fess

https://fess.codelibs.org/

Fess is very powerful and easily deployable Enterprise Search Server.

Fess is a very powerful and easily deployable Enterprise Search Server. You can quickly install and run Fess on any platform where you can run the Java Runtime Environment. Fess is provided under the Apache License 2.0.

Fess is based on OpenSearch/Elasticsearch, but knowledge/experience about OpenSearch/Elasticsearch is not required. Fess provides an easy to use Administration GUI to configure the system via your browser. Fess also contains a Crawler, which can crawl documents on a web server, file system, or Data Store (such as a CSV or database). Many file formats are supported including (but not limited to): Microsoft Office, PDF, and zip.

crawler elasticsearch enterprise opensearch open-source search-engine self-hosted

Added 1 year ago

Katana

https://github.com/projectdiscovery/katana

A next-generation crawling and spidering framework

crawler data-science framework golang open-source software spider

Added 2 years ago

Greenflare SEO Web Crawler

https://greenflare.io/

The Open Source SEO Crawler

crawler seo software

Added 3 years ago

Screaming Frog SEO Spider Website Crawler

https://www.screamingfrog.co.uk/seo-spider/

The industry leading website crawler for Windows, macOS and Ubuntu, trusted by thousands of SEOs and agencies worldwide for technical SEO site audits.

crawler development linux macos seo spider ubuntu web-design windows

Added 3 years ago

Scrapy

http://scrapy.org/

An open source web scraping framework for Python.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy @ GitHub

crawler development foss framework library open-source python scraping web

Added 12 years ago