percolator_scaling

The following provides scripts and test files for performance testing Elasticsearch Percolator using a public Best Buy dataset and complements the blog post:

[reference]

It is intended to assist with replication of tests and to demonstrate Percolator scaling properties only. Queries have not been optimised and relevancy/recall of documents against queries not considered.

Any use case here is hypothetical. It is assumed each Percolate query represents a users registered interest in a product category, with a set of search terms provided. Documents are assumed to be new products being listed. Documents are percolated against the queries thus indicating which users would theoretically be alerted about the product. No alerting is performed by the tests, which simply percolate the documents sequentially using JMeter.

The following assumes Elasticsearch 1.7.1

Data Set and Scripts

Dataset can be downloaded from

https://www.kaggle.com/c/acm-sf-chapter-hackathon-big/data

One file is utilised for Percolator testing:

A training csv file (train.csv) containing approximately 1.8 million entries. Each line represents a user clicking on an item. The information captured includes the item selected, its category and the search term used to locate the item.

The above file can be converted into Percolator queries using the python (v2.7.x) script createQueries.py

python createQueries.py

This produces a file queries.csv. Optional parameter -n allows the number of queries output to be controlled e.g.

python createQueries.py -n 10000

###Percolator Queries

Example Percolator Query:

{
  "_index": "best_buy",
  "_type": ".percolator",
  "_id": "AU9lnTnrPssfQ9IP0sO2",
  "_score": 1,
  "_source": {
     "query": {
        "filtered": {
           "filter": {
              "bool": {
                 "must": [
                    {
                       "term": {
                          "category_id": "abcat0101001"
                       }
                    }
                 ]
              }
           },
           "query": {
              "multi_match": {
                 "query": "Televisiones Panasonic  50 pulgadas",
                 "fields": [
                    "description",
                    "name",
                    "search_terms"
                 ]
              }
           }
        }
     },
     "category_id": "abcat0101001",
     "user": "000000df17"
  }

The above query would be for user "000000df17", matching on any products in category "abcat0101001" with terms "Televisiones Panasonic 50 pulgadas".

###Indexing Percolator queries

indexQueries.py provided to assist indexing of Percolator queries. This:

Creates an index "best_buy". Caution: If the index already exists as it is deleted and re-added.
Applies the appropriate schema provided through settings.json. This allows you to the number of primary shards/replicas here through settings.json.
Indexes N queries from the file queries.csv in batches of X (default 10000). Assumes Elasticsearch is running on localhost:9200. Easily modified in script.
Performs a flush on completion of indexing and forces optimisation to a single shard.

Execute as follows:

python indexQueries.py -x <batch_size> -n <number_of_docs>

###Percolator Documents

Documents for percolation have been generated from the products.tar provided by Best Buy. The following files provide documents for sample percolation, with each line representing a product:

docs_500.txt - 500 sample docs, with no filters.
docs_500_filtered.txt - 500 sample docs, with no filters.
docs_1000.txt - 1000 sample docs, with no filters.
docs_1000_filtered.txt - 500 sample docs, with no filters.

Example doc (without filter):

{
  "doc": {
    "description": "Compatible with select 1998-2008 Ford vehicles; connects an aftermarket radio to a vehicle's harness",
    "category_id": [
      "cat00000",
      "abcat0300000",
      "pcmcat165900050023",
      "pcmcat165900050031",
      "pcmcat165900050034"
    ],
    "search_terms": [
      "harness",
      "wiring",
      "ford",
      "radio"
    ],
    "name": "Metra - Wiring Harness for Select 1998-2008 Ford Vehicles - Multicolored",
    "id": "347137"
  },
  "size": 10
}

Example doc (with filter):

{
  "filter": {
    "bool": {
      "must": [
        {
          "terms": {
            "category_id": [
              "cat00000",
              "abcat0300000",
              "pcmcat165900050023",
              "pcmcat165900050031",
              "pcmcat165900050034"
            ]
          }
        }
      ]
    }
  },
  "doc": {
    "description": "Compatible with select 1998-2008 Ford vehicles; connects an aftermarket radio to a vehicle's harness",
    "category_id": [
      "cat00000",
      "abcat0300000",
      "pcmcat165900050023",
      "pcmcat165900050031",
      "pcmcat165900050034"
    ],
    "search_terms": [
      "harness",
      "wiring",
      "ford",
      "radio"
    ],
    "name": "Metra - Wiring Harness for Select 1998-2008 Ford Vehicles - Multicolored",
    "id": "347137"
  },
  "size": 10
}

Documents were selected from the first N products provided by Best Buy. No attempt has been made to clean or optimise keywords/terms.

For each percolated doc:

10 matches are requested
search_terms have been added. These represent the top 10 terms used to find the product by users and have been obtained from a terms agg on the indexed queries. These have been added to ensure every document matches a percolation query.

###Running Tests with JMeter

For the purposes of testing performance, Documents can be percolated using the simple provided Jmeter test 'PercolatorTest.jmx'. This simple test executes queries sequentially in a single thread with no delay. No attempt is made to verify responses are correct, although users can view results through the "View Results in Tree" component. A Summary Report allows users to view statistics e.g. avg response time.

####Settings

Docs to be used for percolate tests. Assumes structure of one percolate doc per line in the format highlighted above. The file read can be changed through the CSV datasource "PercolateQueriesReader" i.e.

Percolate endpoint. Assumed to be /best_buy/.percolator/_percolate. If you have you used the indexing script this shouldn't need changing. If required change via the HTTP request:

Host and port. Assumed to be localhost:9200. Changed via the HTTP request

Changing Number of Percolate Samples. Set to 500 by default. Consider changing if using the larger 1000 sample files. Changed via the Percolate Thread Group:

###Blog Tests

All of above was used for the purposes of testing the scaling properties of Percolator. Details can be found:

[reference]

The following was repeated for the docs_500.txt and docs_500_filtered.txt. Initially, N was set to 100,000.

Indexing N documents using the indexQueries.py script provided.
500 queries replayed against the index and statistics recorded e.g. avg query response time.
Increase N by 100,000 and repeat.

The above was repeated 10 times for a maxumum test size of 1 million queries.

Further details on the test environment:

10 matches per document percolation requested. This still requires complete evaluation of all documents on each shard to provide a total hits count. However, results are not skewed by increased response sizes.
MacBook Pro, 3.0GHz Dual Core i7, 16GB RAM, 500GB Solid State Drive
8GB heap space for Elasticsearch i.e. -Xmx12g -Xms12g
mlock enabled
500 documents replayed sequentially using Jmeter. Single thread with no delay. JMeter v 1.0. File provided above.
Between each test the index was deleted via the indexQueries.py script, thus clearing in memory percolator cache correctly.
Force an optimise to a single segment after indexing each batch.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
Setting_Host_Port.png		Setting_Host_Port.png
Setting_Loop_Count.png		Setting_Loop_Count.png
Setting_Percolate_Path.png		Setting_Percolate_Path.png
Setting_Source_File.png		Setting_Source_File.png
createQueries.py		createQueries.py
docs_1000.txt		docs_1000.txt
docs_1000_filtered.txt		docs_1000_filtered.txt
docs_500.txt		docs_500.txt
docs_500_filtered.txt		docs_500_filtered.txt
indexQueries.py		indexQueries.py
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

percolator_scaling

Data Set and Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

gingerwizard/percolator_scaling

Folders and files

Latest commit

History

Repository files navigation

percolator_scaling

Data Set and Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages