Search: [crawler] - Biapy Web Directory

Photon https://github.com/s0md3v/Photon

Tue Feb 25 15:09:02 2025

email

Incredibly fast crawler designed for OSINT.

Nepenthes https://zadzmo.org/code/nepenthes/

Mon Jan 27 10:22:49 2025

email

This is a tarpit intended to catch web crawlers. Specifically, it's targetting crawlers that scrape data for LLM's - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, optional Markov-babble can be added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.

Related contents:

S5E7 - Sommes-nous à l'aube d'un effondrement des IA ? @ Underscore_'s acast .

Crawl4AI https://crawl4ai.com/mkdocs/

Fri Jan 10 06:54:32 2025

email

Open-Source LLM-Friendly Web Crawler & Scraper.

Crawl4AI delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

Crawl4AI @ GitHub.

Crawlee https://crawlee.dev/

Fri Nov 8 08:54:40 2024

email

Build reliable crawlers. Fast.

A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee @ GitHub.

Crawlee for Python https://crawlee.dev/python/

Fri Nov 8 08:53:35 2024

email

Build your Python web crawlers using Crawlee.
It helps you build reliable Python web crawlers. Fast.

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee for Python @ GitHub.

Firecrawl https://www.firecrawl.dev/

Mon Sep 2 08:29:36 2024

email

Turn websites into LLM-ready data.

Power your AI apps with clean data crawled from any website. It's also open-source.
Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.

Firecrawl @ GitHub.

Fess https://fess.codelibs.org/

Thu Jul 18 15:27:20 2024

email

Fess is very powerful and easily deployable Enterprise Search Server.

Fess is a very powerful and easily deployable Enterprise Search Server. You can quickly install and run Fess on any platform where you can run the Java Runtime Environment. Fess is provided under the Apache License 2.0.

Fess is based on OpenSearch/Elasticsearch, but knowledge/experience about OpenSearch/Elasticsearch is not required. Fess provides an easy to use Administration GUI to configure the system via your browser. Fess also contains a Crawler, which can crawl documents on a web server, file system, or Data Store (such as a CSV or database). Many file formats are supported including (but not limited to): Microsoft Office, PDF, and zip.

Katana https://github.com/projectdiscovery/katana

Wed Nov 16 08:33:23 2022

email

A next-generation crawling and spidering framework

Greenflare SEO Web Crawler https://greenflare.io/

Mon May 30 16:47:26 2022

email

The Open Source SEO Crawler

Screaming Frog SEO Spider Website Crawler https://www.screamingfrog.co.uk/seo-spider/

Thu Mar 31 15:57:20 2022

email

The industry leading website crawler for Windows, macOS and Ubuntu, trusted by thousands of SEOs and agencies worldwide for technical SEO site audits.

Scrapy http://scrapy.org/

Thu Mar 14 16:37:56 2013

email

An open source web scraping framework for Python.

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy @ GitHub

Links per page

Filters