8000 GitHub - any4ai/AnyCrawl: AnyCrawl πŸš€: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

AnyCrawl πŸš€: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

License

Notifications You must be signed in to change notification settings

any4ai/AnyCrawl

Repository files navigation

AnyCrawl

AnyCrawl

Fast Scalable Web Crawling Site Crawling SERP Multi Threading Multi Process Batch Tasks

License: MIT PRs Welcome LLM Ready Documentation

Node.js TypeScript Redis

πŸ“– Overview

AnyCrawl is a high-performance web crawling and scraping application that excels in multiple domains:

  • SERP Crawling: Support for multiple search engines with batch processing capabilities
  • Web Crawling: Efficient single-page content extraction
  • Site Crawling: Comprehensive full-site crawling with intelligent traversal
  • High Performance: Multi-threading and multi-process architecture
  • Batch Processing: Efficient handling of batch crawling tasks

Built with modern architectures and optimized for LLMs (Large Language Models), AnyCrawl provides:

πŸš€ Quick Start

πŸ“– For detailed documentation, visit Docs

Docker Deployment

docker compose up --build

Environment Variables

Variable Description Default Example
NODE_ENV Runtime environment production production, development
ANYCRAWL_API_PORT API service port 8080 8080
ANYCRAWL_HEADLESS Use headless mode for browser engines true true, false
ANYCRAWL_PROXY_URL Proxy server URL (supports HTTP and SOCKS) (none) http://proxy:8080
ANYCRAWL_IGNORE_SSL_ERROR Ignore SSL certificate errors true true, false
ANYCRAWL_KEEP_ALIVE Keep connections alive between requests true true, false
ANYCRAWL_AVAILABLE_ENGINES Available scraping engines (comma-separated) cheerio,playwright,puppeteer playwright,puppeteer
ANYCRAWL_API_DB_TYPE Database type sqlite sqlite, postgresql
ANYCRAWL_API_DB_CONNECTION Database connection string/path /usr/src/app/db/database.db /path/to/db.sqlite, postgresql://user:pass@localhost/db
ANYCRAWL_REDIS_URL Redis connection URL redis://redis:6379 redis://localhost:6379
ANYCRAWL_API_AUTH_ENABLED Enable API authentication false true, false
ANYCRAWL_API_CREDITS_ENABLED Enable credit system false true, false

πŸ“š Usage Examples

πŸ’‘ You can use Playground to test APIs and generate code examples for your preferred programming language.

Note: If you are self-hosting AnyCrawl, make sure to replace https://api.anycrawl.dev with your own server URL.

Web Scraping

Basic Usage

curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameters

Parameter Type Description Default
url string (required) The URL to be scraped. Must be a valid URL starting with http:// or https:// -
engine string Scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering with modern engine), puppeteer (JavaScript rendering with Chrome) cheerio
proxy string Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port (none)

Search Engine Results (SERP)

Basic Usage

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameters

Parameter Type Description Default
query string (required) Search query to be executed -
engine string Search engine to use. Options: google google
pages integer Number of search result pages to retrieve 1
lang string Language code for search results (e.g., 'en', 'zh', 'all') en-US

Supported Search Engines

  • Google

❓ FAQ

Common Questions

  1. Q: Can I use proxies? A: Yes, AnyCrawl supports both HTTP and SOCKS proxies. Configure them through the ANYCRAWL_PROXY_URL environment variable.

  2. Q: How to handle JavaScript-rendered content? A: AnyCrawl supports Puppeteer and Playwright for JavaScript rendering needs.

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🎯 Mission

Our mission is to build foundational products for the AI ecosystem, providing essential tools that empower both individuals and enterprises to develop AI applications. We are committed to accelerating the advancement of AI technology by delivering robust, scalable infrastructure that serves as the cornerstone for innovation in artificial intelligence.


Built with ❀️ by the Any4AI team

About

AnyCrawl πŸš€: A Node.js/TypeScript crawler that turns websites into LLM-ready data and extracts structured SERP results from Google/Bing/Baidu/etc. Native multi-threading for bulk processing.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 
0