Elastic Open Crawler is a lightweight, open code web crawler designed for discovering, extracting, and indexing web content directly into Elasticsearch. This CLI-driven tool streamlines web content ingestion into Elasticsearch, enabling easy searchability through on-demand or scheduled crawls defined by configuration files.
This repository contains code for the Elastic Open Web Crawler. Docker images are available for the crawler at the Elastic Docker registry.
Important
The Open Crawler is currently in beta. Beta features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release.
This documentation outlines the following ways to run the Elastic Open Web Crawler:
- Simple Docker quickstart: Run a basic crawl with zero setup. No Elasticsearch instance required.
- Ingest into Elasticsearch: Configure the Open Crawler to connect to Elasticsearch and index crawl results.
- Developer guide: Build and run Open Crawler from source, for developers who want to modify or extend the code.
Elasticsearch | Open Crawler | Operating System |
---|---|---|
8.x |
v0.2.x |
Linux, OSX |
9.x |
v0.2.1 and above |
Linux, OSX |
Let's scrape our first website using the Open Crawler running on Docker!
The following commands will create a simple config file in your local directory, which will then be used by the Dockerized crawler to run a crawl. The results will be printed to your console, so no Elasticsearch setup is required for this step.
Run the following commands from your terminal:
cat > crawl-config.yml << EOF
output_sink: console
domains:
- url: https://example.com
EOF
docker run \
-v ./crawl-config.yml:/crawl-config.yml \
-it docker.elastic.co/integrations/crawler:latest jruby bin/crawler crawl /crawl-config.yml
If everything is set up correctly, you should see the crawler start up and begin crawling example.com
.
It will print the following output to the screen and then return control to the terminal:
[primary] Initialized an in-memory URL queue for up to 10000 URLs
[primary] Starting the primary crawl with up to 10 parallel thread(s)...
...
<HTML Content from example.com>
...
[primary] Finished a crawl. Result: success;
To run different crawls, start by changing the - url: ...
in the crawl-config.yml
file.
After each change, just run the docker run...
command again to see the results.
Once you're ready to run a more complex crawl, check out Connecting to Elasticsearch to ingest data into your Elasticsearch instance.
- Crawl lifecycle: Learn how the crawler discovers, queues, and indexes content across two stages: the primary crawl and the purge crawl.
- Document schema: Review the standard fields used in Elasticsearch documents, and how to extend the current schema and mappings with custom extraction rules.
- Feature comparison: See how Open Crawler compares to Elastic Crawler, including feature support and deployment differences.
- Crawl rules: Control which URLs the Open Crawler is allowed to visit.
- Extraction rules: Define how and where the crawler extracts content from HTML or URLs.
- Binary content extraction: Extract text from downloadable files like PDFs and DOCX using MIME-type matching and ingest pipelines.
- Crawler directives: Use robots.txt, meta tags, or embedded data attributes to guide discovery and content extraction.
- Ingest pipelines: Learn how Open Crawler uses Elasticsearch ingest pipelines.
- Scheduling: Use cron-based scheduling to automate crawl jobs at fixed intervals.
- Logging: Enable system and event logging to help monitor and troubleshoot crawler activity.
- Configuration files: Understand the Open Crawler and Elasticsearch YAML configuration files and how both can be leveraged to create a complete configuration.
The Open Crawler includes a CLI for running and managing crawl jobs, validating configs, and more. See the CLI reference for available commands and usage examples.
You can build and run the Open Crawler locally using the provided setup instructions. Detailed setup steps, including environment requirements, are in the Developer Guide.
Want to contribute? We welcome bug reports, code contributions, and documentation improvements. Read the Contributing Guide for contribution types, PR guidelines, and coding standards.
For support and contact options, see the Getting Support page.