Build reliable crawlers. Fast.
A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Build your Python web crawlers using Crawlee.
It helps you build reliable Python web crawlers. Fast.
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Turn websites into LLM-ready data.
Power your AI apps with clean data crawled from any website. It's also open-source.
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
Fess is very powerful and easily deployable Enterprise Search Server.
Fess is a very powerful and easily deployable Enterprise Search Server. You can quickly install and run Fess on any platform where you can run the Java Runtime Environment. Fess is provided under the Apache License 2.0.
Fess is based on OpenSearch/Elasticsearch, but knowledge/experience about OpenSearch/Elasticsearch is not required. Fess provides an easy to use Administration GUI to configure the system via your browser. Fess also contains a Crawler, which can crawl documents on a web server, file system, or Data Store (such as a CSV or database). Many file formats are supported including (but not limited to): Microsoft Office, PDF, and zip.
An open source web scraping framework for Python.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.