Elastic Open Web Crawler

Elastic Open Crawler is a lightweight, open code web crawler designed for discovering, extracting, and indexing web content directly into Elasticsearch. This CLI-driven tool streamlines web content ingestion into Elasticsearch, enabling easy searchability through on-demand or scheduled crawls defined by configuration files.

This repository contains code for the Elastic Open Web Crawler. Docker images are available for the crawler at the Elastic Docker registry.

Important

The Open Crawler is currently in beta. Beta features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release.

Getting started

This documentation outlines the following ways to run the Elastic Open Web Crawler:

Simple Docker quickstart: Run a basic crawl with zero setup. No Elasticsearch instance required.
Ingest into Elasticsearch: Configure the Open Crawler to connect to Elasticsearch and index crawl results.
Developer guide: Build and run Open Crawler from source, for developers who want to modify or extend the code.

Version compatibility

Elasticsearch	Open Crawler	Operating System
`8.x`	`v0.2.x`	Linux, OSX
`9.x`	`v0.2.1` and above	Linux, OSX

Simple Docker quickstart

Let's scrape our first website using the Open Crawler running on Docker!

The following commands will create a simple config file in your local directory, which will then be used by the Dockerized crawler to run a crawl. The results will be printed to your console, so no Elasticsearch setup is required for this step.

Run the following commands from your terminal:

cat > crawl-config.yml << EOF
output_sink: console
domains:
  - url: https://example.com
EOF

docker run \
  -v ./crawl-config.yml:/crawl-config.yml \
  -it docker.elastic.co/integrations/crawler:latest jruby bin/crawler crawl /crawl-config.yml

If everything is set up correctly, you should see the crawler start up and begin crawling example.com. It will print the following output to the screen and then return control to the terminal:

[primary] Initialized an in-memory URL queue for up to 10000 URLs
[primary] Starting the primary crawl with up to 10 parallel thread(s)...
...
<HTML Content from example.com>
...
[primary] Finished a crawl. Result: success;

To run different crawls, start by changing the - url: ... in the crawl-config.yml file. After each change, just run the docker run... command again to see the results.

Ingest into Elasticsearch

Once you're ready to run a more complex crawl, check out Connecting to Elasticsearch to ingest data into your Elasticsearch instance.

Documentation

Core concepts

Crawl lifecycle: Learn how the crawler discovers, queues, and indexes content across two stages: the primary crawl and the purge crawl.
Document schema: Review the standard fields used in Elasticsearch documents, and how to extend the current schema and mappings with custom extraction rules.
Feature comparison: See how Open Crawler compares to Elastic Crawler, including feature support and deployment differences.

Crawler features

Crawl rules: Control which URLs the Open Crawler is allowed to visit.
Extraction rules: Define how and where the crawler extracts content from HTML or URLs.
Binary content extraction: Extract text from downloadable files like PDFs and DOCX using MIME-type matching and ingest pipelines.
Crawler directives: Use robots.txt, meta tags, or embedded data attributes to guide discovery and content extraction.
Ingest pipelines: Learn how Open Crawler uses Elasticsearch ingest pipelines.
Scheduling: Use cron-based scheduling to automate crawl jobs at fixed intervals.
Logging: Enable system and event logging to help monitor and troubleshoot crawler activity.

Configuration

Configuration files: Understand the Open Crawler and Elasticsearch YAML configuration files and how both can be leveraged to create a complete configuration.

Developer guide

Crawler CLI

The Open Crawler includes a CLI for running and managing crawl jobs, validating configs, and more. See the CLI reference for available commands and usage examples.

Build from source

You can build and run the Open Crawler locally using the provided setup instructions. Detailed setup steps, including environment requirements, are in the Developer Guide.

Contribute

Want to contribute? We welcome bug reports, code contributions, and documentation improvements. Read the Contributing Guide for contribution types, PR guidelines, and coding standards.

Contact

For support and contact options, see the Getting Support page.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.buildkite		.buildkite
.devcontainer		.devcontainer
.github		.github
bin		bin
config		config
docs		docs
lib		lib
script		script
spec		spec
vendor		vendor
.backportrc.json		.backportrc.json
.bundler-version		.bundler-version
.gitignore		.gitignore
.java-version		.java-version
.jrubyrc		.jrubyrc
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Brewfile		Brewfile
Dockerfile		Dockerfile
Dockerfile.wolfi		Dockerfile.wolfi
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Jarfile		Jarfile
Jars.lock		Jars.lock
LICENSE		LICENSE
Makefile		Makefile
NOTICE.txt		NOTICE.txt
README.md		README.md
catalog-info.yaml		catalog-info.yaml
docker-compose.yaml		docker-compose.yaml
product_version		product_version
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Elastic Open Web Crawler

Getting started

Version compatibility

Simple Docker quickstart

Ingest into Elasticsearch

Documentation

Core concepts

Crawler features

Configuration

Developer guide

Crawler CLI

Build from source

Contribute

Contact

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors 21

Languages

License

elastic/crawler

Folders and files

Latest commit

History

Repository files navigation

Elastic Open Web Crawler

Getting started

Version compatibility

Simple Docker quickstart

Ingest into Elasticsearch

Documentation

Core concepts

Crawler features

Configuration

Developer guide

Crawler CLI

Build from source

Contribute

Contact

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 21

Languages

Packages