[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

Investigate failed reindex operations
Closed, ResolvedPublic

Description

Out of the ~1600 (cluster, wiki) pairs run so far we have 48 that failed the count check after reindexing.

Example failure from mediawikiwiki.codfw.reindex.log:

general index...
    ...
    Setting index identifier...mediawikiwiki_general_1583750799
    ...
    Task: efGikGagRvaI7V5vLMKdBQ:33189921 Search Retries: 0 Bulk Retries: 0 Indexed: 708700 / 739104
    Task: efGikGagRvaI7V5vLMKdBQ:33189921 Search Retries: 0 Bulk Retries: 0 Indexed: 718900 / 739104
    Task: efGikGagRvaI7V5vLMKdBQ:33189921 Search Retries: 0 Bulk Retries: 0 Indexed: 727900 / 739104
    Task: efGikGagRvaI7V5vLMKdBQ:33189921 Search Retries: 0 Bulk Retries: 0 Indexed: 734000 / 739104
    Task: efGikGagRvaI7V5vLMKdBQ:33189921 Search Retries: 0 Bulk Retries: 0 Indexed: 739104 / 739104
    Verifying counts...Not close enough!  old=739107 new=495600 difference=0.32946109291348
Failed to load index - counts not close enough.  old=739107 new=495600 difference=0.32946109291348.  Check for warnings above.

Looking at the index afterwards though, the counts seem reasonable:

ebernhardson@mwmaint1002:/home/mstyles/cirrus_log$ curl -s https://search.svc.codfw.wmnet:9243/_cat/indices | grep mediawikiwiki_general
green open mediawikiwiki_general_1564414369       fkIH5anxRAWcrTX3y9IEbQ  1 2   739612    79734    6.4gb      2gb
green open mediawikiwiki_general_1583750799       pq77hdudTlygUBHAPawZQg  1 2   739104        0    4.3gb    1.4gb
ebernhardson@mwmaint1002:/home/mstyles/cirrus_log$ curl https://search.svc.codfw.wmnet:9243/mediawikiwiki_general_1564414369/_count?pretty
{
  "count" : 739612,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}
ebernhardson@mwmaint1002:/home/mstyles/cirrus_log$ curl https://search.svc.codfw.wmnet:9243/mediawikiwiki_general_1583750799/_count?pretty
{
  "count" : 739104,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Event Timeline

EBernhardson added a subscriber: Mstyles.

As a bit of a guess I've put a patch up in SWAT today that adds a 30 second wait between performing the refresh operation and performing the count. This is basically a guess that the refresh isn't as complete as it claims to be, since clearly at some point in the future the index has all the expected docs.

Change 579077 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] reindex: Wait around for counts to match before giving up

https://gerrit.wikimedia.org/r/579077

Change 579077 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] reindex: Wait around for counts to match before giving up

https://gerrit.wikimedia.org/r/579077

Change 579090 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.35.0-wmf.22] reindex: Wait around for counts to match before giving up

https://gerrit.wikimedia.org/r/579090

Change 579091 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@wmf/1.35.0-wmf.23] reindex: Wait around for counts to match before giving up

https://gerrit.wikimedia.org/r/579091

Change 579090 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.35.0-wmf.22] reindex: Wait around for counts to match before giving up

https://gerrit.wikimedia.org/r/579090

Change 579091 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.35.0-wmf.23] reindex: Wait around for counts to match before giving up

https://gerrit.wikimedia.org/r/579091

New patch waits in a loop, checking every 30 seconds for up to 10 minutes for the counts to match up. Re-started kowiki reindex and will see how this goes.

Seems to have worked, reindexer now reports:

Verifying counts...Not close enough!  old=487533 new=442805 difference=0.091743533258261          
Waiting to re-check counts...Not close enough!  old=487533 new=443767 difference=0.089770333495374           
Waiting to re-check counts...done