As a Wikidata product manager, I want our analytics to be accurate.
As a Wikidata power user, I want queries using pp_sortkey to return the correct data.
Problem:
Since April 2022, the page props wb-claims, wbl-forms and wbl-senses often have the pp_sortkey set to NULL in the page_props database table. This means that queries using the sort key, rather than the pp_value (to optimize the query by using a covering index), are returning the wrong result.
Example:
mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wbl-forms', 'wbl-senses') AND pp_sortkey IS NULL; +----------+ | COUNT(*) | +----------+ | 1150039 | +----------+ 1 row in set (0.687 sec) mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname IN ('wbl-forms', 'wbl-senses') AND pp_sortkey IS NOT NULL; +----------+ | COUNT(*) | +----------+ | 1234425 | +----------+ 1 row in set (0.729 sec) mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'wb-claims' AND pp_sortkey IS NULL; +----------+ | COUNT(*) | +----------+ | 946316 | +----------+ 1 row in set (0.812 sec) mysql:research@dbstore1005.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM page_props WHERE pp_propname = 'wb-claims' AND pp_sortkey IS NOT NULL; +-----------+ | COUNT(*) | +-----------+ | 106516430 | +-----------+ 1 row in set (1 min 40.897 sec)
This appears to have been caused by Page properties should always be strings (T305158) – PagePropsTable::getPropertySortKeyValue() returns null for strings, even if the strings are numeric. (The corresponding Wikibase change was thankfully never merged.)
Screenshots/mockups:
This is likely responsible for the reported drop of senses and forms in Grafana:
BDD
WHEN a lexeme is created or edited
AND its page props have been written
THEN the pp_sortkey for pp_propname='wbl-senses' is non-null
AND the pp_sortkey for pp_propname='wbl-forms' is non-null
Acceptance criteria:
- The pp_sortkey is populated for new edits
- We repopulate the pp_sortkey for all affected lexemes
Open questions: