AnyCrawl

📖 Overview

AnyCrawl is a high-performance web crawling and scraping application that excels in multiple domains:

SERP Crawling: Support for multiple search engines with batch processing capabilities
Web Crawling: Efficient single-page content extraction
Site Crawling: Comprehensive full-site crawling with intelligent traversal
High Performance: Multi-threading and multi-process architecture
Batch Processing: Efficient handling of batch crawling tasks

Built with modern architectures and optimized for LLMs (Large Language Models), AnyCrawl provides:

🚀 Quick Start

📖 For detailed documentation, visit Docs

Docker Deployment

docker compose up --build

Environment Variables

Variable	Description	Default	Example
`NODE_ENV`	Runtime environment	`production`	`production`, `development`
`ANYCRAWL_API_PORT`	API service port	`8080`	`8080`
`ANYCRAWL_HEADLESS`	Use headless mode for browser engines	`true`	`true`, `false`
`ANYCRAWL_PROXY_URL`	Proxy server URL (supports HTTP and SOCKS)	(none)	`http://proxy:8080`
`ANYCRAWL_IGNORE_SSL_ERROR`	Ignore SSL certificate errors	`true`	`true`, `false`
`ANYCRAWL_KEEP_ALIVE`	Keep connections alive between requests	`true`	`true`, `false`
`ANYCRAWL_AVAILABLE_ENGINES`	Available scraping engines (comma-separated)	`cheerio,playwright,puppeteer`	`playwright,puppeteer`
`ANYCRAWL_API_DB_TYPE`	Database type	`sqlite`	`sqlite`, `postgresql`
`ANYCRAWL_API_DB_CONNECTION`	Database connection string/path	`/usr/src/app/db/database.db`	`/path/to/db.sqlite`, `postgresql://user:pass@localhost/db`
`ANYCRAWL_REDIS_URL`	Redis connection URL	`redis://redis:6379`	`redis://localhost:6379`
`ANYCRAWL_API_AUTH_ENABLED`	Enable API authentication	`false`	`true`, `false`
`ANYCRAWL_API_CREDITS_ENABLED`	Enable credit system	`false`	`true`, `false`

📚 Usage Examples

💡 You can use Playground to test APIs and generate code examples for your preferred programming language.

Note: If you are self-hosting AnyCrawl, make sure to replace https://api.anycrawl.dev with your own server URL.

Web Scraping

Basic Usage

curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameters

Parameter	Type	Description	Default
url	string (required)	The URL to be scraped. Must be a valid URL starting with http:// or https://	-
engine	string	Scraping engine to use. Options: `cheerio` (static HTML parsing, fastest), `playwright` (JavaScript rendering with modern engine), `puppeteer` (JavaScript rendering with Chrome)	cheerio
proxy	string	Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: `http://[username]:[password]@proxy:port`	(none)

Search Engine Results (SERP)

Basic Usage

curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameters

Parameter	Type	Description	Default
`query`	string (required)	Search query to be executed	-
`engine`	string	Search engine to use. Options: `google`	google
`pages`	integer	Number of search result pages to retrieve	1
`lang`	string	Language code for search results (e.g., 'en', 'zh', 'all')	en-US

Supported Search Engines

Google

❓ FAQ

Common Questions

Q: Can I use proxies? A: Yes, AnyCrawl supports both HTTP and SOCKS proxies. Configure them through the ANYCRAWL_PROXY_URL environment variable.
Q: How to handle JavaScript-rendered content? A: AnyCrawl supports Puppeteer and Playwright for JavaScript rendering needs.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🎯 Mission

Our mission is to build foundational products for the AI ecosystem, providing essential tools that empower both individuals and enterprises to develop AI applications. We are committed to accelerating the advancement of AI technology by delivering robust, scalable infrastructure that serves as the cornerstone for innovation in artificial intelligence.

_{Built with ❤️ by the Any4AI team}

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github/workflows		.github/workflows
apps		apps
packages		packages
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.npmrc		.npmrc
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
proxy-config.example.json		proxy-config.example.json
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnyCrawl

📖 Overview

🚀 Quick Start

Docker Deployment

Environment Variables

📚 Usage Examples

Web Scraping

Basic Usage

Parameters

Search Engine Results (SERP)

Basic Usage

Parameters

Supported Search Engines

❓ FAQ

Common Questions

🤝 Contributing

📄 License

🎯 Mission

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors 3

Languages

License

any4ai/AnyCrawl

Folders and files

Latest commit

History

Repository files navigation

AnyCrawl

📖 Overview

🚀 Quick Start

Docker Deployment

Environment Variables

📚 Usage Examples

Web Scraping

Basic Usage

Parameters

Search Engine Results (SERP)

Basic Usage

Parameters

Supported Search Engines

❓ FAQ

Common Questions

🤝 Contributing

📄 License

🎯 Mission

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors 3

Languages

Packages