8000 GitHub - elastic/crawler
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

elastic/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elastic Open Web Crawler

Elastic Open Crawler is a lightweight, open code web crawler designed for discovering, extracting, and indexing web content directly into Elasticsearch. This CLI-driven tool streamlines web content ingestion into Elasticsearch, enabling easy searchability through on-demand or scheduled crawls defined by configuration files.

This repository contains code for the Elastic Open Web Crawler. Docker images are available for the crawler at the Elastic Docker registry.

Important

The Open Crawler is currently in beta. Beta features are subject to change and are not covered by the support SLA of generally available (GA) features. Elastic plans to promote this feature to GA in a future release.

Getting started

This documentation outlines the following ways to run the Elastic Open Web Crawler:

Version compatibility

Elasticsearch Open Crawler Operating System
8.x v0.2.x Linux, OSX
9.x v0.2.1 and above Linux, OSX

Simple Docker quickstart

Let's scrape our first website using the Open Crawler running on Docker!

The following commands will create a simple config file in your local directory, which will then be used by the Dockerized crawler to run a crawl. The results will be printed to your console, so no Elasticsearch setup is required for this step.

Run the following commands from your terminal:

cat > crawl-config.yml << EOF
output_sink: console
domains:
  - url: https://example.com
EOF

docker run \
  -v ./crawl-config.yml:/crawl-config.yml \
  -it docker.elastic.co/integrations/crawler:latest jruby bin/crawler crawl /crawl-config.yml

If everything is set up correctly, you should see the crawler start up and begin crawling example.com. It will print the following output to the screen and then return control to the terminal:

[primary] Initialized an in-memory URL queue for up to 10000 URLs
[primary] Starting the primary crawl with up to 10 parallel thread(s)...
...
<HTML Content from example.com>
...
[primary] Finished a crawl. Result: success;

To run different crawls, start by changing the - url: ... in the crawl-config.yml file. After each change, just run the docker run... command again to see the results.

Ingest into Elasticsearch

Once you're ready to run a more complex crawl, check out Connecting to Elasticsearch to ingest data into your Elasticsearch instance.

Documentation

Core concepts

  • Crawl lifecycle: Learn how the crawler discovers, queues, and indexes content across two stages: the primary crawl and the purge crawl.
  • Document schema: Review the standard fields used in Elasticsearch documents, and how to extend the current schema and mappings with custom extraction rules.
  • Feature comparison: See how Open Crawler compares to Elastic Crawler, including feature support and deployment differences.

Crawler features

  • Crawl rules: Control which URLs the Open Crawler is allowed to visit.
  • Extraction rules: Define how and where the crawler extracts content from HTML or URLs.
  • Binary content extraction: Extract text from downloadable files like PDFs and DOCX using MIME-type matching and ingest pipelines.
  • Crawler directives: Use robots.txt, meta tags, or embedded data attributes to guide discovery and content extraction.
  • Ingest pipelines: Learn how Open Crawler uses Elasticsearch ingest pipelines.
  • Scheduling: Use cron-based scheduling to automate crawl jobs at fixed intervals.
  • Logging: Enable system and event logging to help monitor and troubleshoot crawler activity.

Configuration

  • Configuration files: Understand the Open Crawler and Elasticsearch YAML configuration files and how both can be leveraged to create a complete configuration.

Developer guide

Crawler CLI

The Open Crawler includes a CLI for running and managing crawl jobs, validating configs, and more. See the CLI reference for available commands and usage examples.

Build from source

You can build and run the Open Crawler locally using the provided setup instructions. Detailed setup steps, including environment requirements, are in the Developer Guide.

Contribute

Want to contribute? We welcome bug reports, code contributions, and documentation improvements. Read the Contributing Guide for contribution types, PR guidelines, and coding standards.

Contact

For support and contact options, see the Getting Support page.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

0