[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

Add hasrecommendation: search keyword
Closed, ResolvedPublic5 Estimated Story Points

Description

As part of the Add Links work (T268803: Add a link engineering: Search pipeline), we need a search keyword for restricting search results to pages which have link recommendations prepared for them.

Ideally this should be reusable for other recommendation types; hasrecommendation:link seems like the straightforward way to do it, in line with the existing hastemplate: keyword.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There wasn't much discussion on whether this should be strictly filtering or filtering+ranking like articletopic: does (based on some kind of recommendation confidence score). The latter seems reasonable (and in that case maybe the naming should be different) but I don't think we have any actual use case for it - the new keyword would be used for suggesting tasks to users, and that needs to be done in a non-deterministic way to avoid collisions.

CBogen set the point value for this task to 5.Dec 7 2020, 6:18 PM

There wasn't much discussion on whether this should be strictly filtering or filtering+ranking

Filtering only seems the correct way forward here, we don't really have any information available to help decide which things are going to be better. Additionally my understanding is that the use case will involve the random sort, short circuiting any scoring done by the keyword.

We could craft a confidence score (we have scores for the individual links). But as you say we don't really have a use case for it.

We could craft a confidence score (we have scores for the individual links). But as you say we don't really have a use case for it.

For image recommendations the Structured Data team does have a use case for confidence scores, in case that future work should factor into anything here.

(For more information see the "What parameters do we need to be able to tune for image matches?" row here.)

I'm pretty sure that doesn't apply for this flag/this part of the pipeline at all but I wanted to mention it just in case.

I'm pretty sure that doesn't apply for this flag/this part of the pipeline at all but I wanted to mention it just in case.

It applies if you want to be able to filter by score.

You could say "only recommendation with a score >= X are valid", where X might be per-wiki configurable, and then that could be handled at the service level and the search index doesn't need to know about it.
Or the score could be returned as part of the recommendation data and the client could decide what to do with it; that does not require search integration either.

However, if clients need to be able to ask for different score ranges (e.g. there should be a way of calling the API that returns recommendations with 0.5+ score and another one that only returns ones with 0.9+ score), or the API needs to rank by the score (return the highest-scoring results first), then the score needs to be in the search index.

AIUI the plan is to make the recommendation field a text field, containing special words like type_image and then doing a fulltext search within that field for those words. That's relatively easy to extend to ranking (just add those words multiple times, with the number of occurrences corresponding to the score, and the standard word frequency weighting mechanics in ElasticSearch will rank up results with a higher score). I'm not sure if it's easy to extend to filtering to specific score ranges, though. And ranking is not ideal for the task queue use case because we don't want everyone to get the same search results (although that depends on how individualized the searches done by the bot operators are otherwise).

However, if clients need to be able to ask for different score ranges (e.g. there should be a way of calling the API that returns recommendations with 0.5+ score and another one that only returns ones with 0.9+ score), or the API needs to rank by the score (return the highest-scoring results first), then the score needs to be in the search index.

AIUI the plan is to make the recommendation field a text field, containing special words like type_image and then doing a fulltext search within that field for those words. That's relatively easy to extend to ranking (just add those words multiple times, with the number of occurrences corresponding to the score, and the standard word frequency weighting mechanics in ElasticSearch will rank up results with a higher score). I'm not sure if it's easy to extend to filtering to specific score ranges, though. And ranking is not ideal for the task queue use case because we don't want everyone to get the same search results (although that depends on how individualized the searches done by the bot operators are otherwise).

Structured data is definitely hoping to be able to allow bots to determine their own confidence score threshold for image recs, so the ability to ask for different score ranges is desired if it's doable.

Tagging @Ramsey-WMF as an FYI.

The analysis chain and field here is the same as used for the ores models. As such it can accept an additional integer value in 1-1000 as it's confidence score. Today since there is no information additional information provided this is set to a constant value of 1 for all pages. Were additional information to be included in the events that could be added. These confidence scores can be range filtered with a custom search query dcausse previously added to our elasticsearch plugin when creating these fields.

For this ticket it seems the keyword use case is still as a plain filter.

The analysis chain and field here is the same as used for the ores models. As such it can accept an additional integer value in 1-1000 as it's confidence score. Today since there is no information additional information provided this is set to a constant value of 1 for all pages. Were additional information to be included in the events that could be added. These confidence scores can be range filtered with a custom search query dcausse previously added to our elasticsearch plugin when creating these fields.

For this ticket it seems the keyword use case is still as a plain filter.

Sounds good to me!

For this ticket it seems the keyword use case is still as a plain filter.

Yes, for link recommendations we only need a boolean filter. I just wanted to make sure we don't run into problems in the long term.

Change 655083 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[mediawiki/extensions/CirrusSearch@master] Add hasrecommendation keyword to search filters

https://gerrit.wikimedia.org/r/655083

Change 655083 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add hasrecommendation keyword to search filters

https://gerrit.wikimedia.org/r/655083

This does not seem to be working. Here are a few pages for which GrowthExperiments generated recommendations:
https://test.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirrusdoc&pageids=114619%7C114368%7C114510%7C113318

weighted_tags looks as expected, e.g.

"weighted_tags": [
                                "classification.ores.articletopic/Culture.Biography.Biography*|0.6265578266014",
                                "classification.ores.articletopic/Geography.Regions.Americas.North America|0.78969229536699",
                                "classification.ores.articletopic/History and Society.Politics and government|0.85337919734571",
                                "recommendation.link/exists|1"
                            ]

but a search for hasrecommendation:link finds nothing:
https://test.wikipedia.org/w/index.php?search=hasrecommendation%3Alink&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

cirrusDumpQuery says the generated query is

"query": {
    "bool": {
        "must": [
            {
                "match_all": {}
            }
        ],
        "filter": [
            {
                "bool": {
                    "must": [
                        {
                            "match": {
                                "weighted_tags": {
                                    "query": "recommendation.link\/exists"
                                }
                            }
                        },
                        {
                            "terms": {
                                "namespace": [
                                    0
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    }
},

which, to my untrained-in-Elasticsearch eyes, looks correct. Also, the search keyword works fine on my local setup.

It also works fine on beta, although we have different problems there, which might or might not be search infrastructure related: T277208: Add Link: refreshLinkRecommendations.php does not write to the search index on beta

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

Actually thats not it, @dcausse double checked and foo.bar/baz|0.1234 will be interpereted as if foo.bar/baz|0.1234|1 was provided, essentially it cant parse the value after | as an int so assumes it's part of the source.

The actual problem here is we've only run the reindex procedure against the beta cluster, prod will be coming up soon-ish. The current analytics process is currently loading data into both ores_articletopics and weighted_tags using the same formatted data, until the switchover is complete (elastic allows writing to fields that don't exist, they will simply be stored in the source document and used during the reindex that enables the field).

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

That's my fault, I used a script to import production ORES scores for a bunch of testwiki articles, but forgot to scale up the scores. I can fix the index data if it's worth the effort.

Change 671214 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[mediawiki/extensions/CirrusSearch@master] hasrecommendation: Write and query BC field as well

https://gerrit.wikimedia.org/r/671214

classification.ores.articletopic/History and Society.Politics and government|0.85337919734571

I wonder if this breaks anything, the value after | should be an integer between 1 and 1000 (untested, but suspicious).

That's my fault, I used a script to import production ORES scores for a bunch of testwiki articles, but forgot to scale up the scores. I can fix the index data if it's worth the effort.

If the effort is minimal it would be a bit cleaner, but on review it looks like these shouldn't cause any problems beyond the predictions being unfindable.

For the problem at hand, we can either reindex testwiki today and wait for the full cluster reindex, or put a little BC code in cirrus to update/query the BC field. Which makes the most sense i suppose depends on how quickly it needs to work outside testwiki. The BC patch wasn't too difficult so I put one together, but we can reindex testwiki at any time.

If the effort is minimal it would be a bit cleaner, but on review it looks like these shouldn't cause any problems beyond the predictions being unfindable.

It's not that minimal, we'd have to iterate through all pages and check their tags, as I didn't keep the logs and I don't think there's a way to find the affected pages via search. But it's not that huge either, and one should fix what one breaks so if you think it's problematic let me know.

For the problem at hand, we can either reindex testwiki today and wait for the full cluster reindex, or put a little BC code in cirrus to update/query the BC field. Which makes the most sense i suppose depends on how quickly it needs to work outside testwiki. The BC patch wasn't too difficult so I put one together, but we can reindex testwiki at any time.

Thanks for the quick fix! We won't need this outside testwiki for several weeks.

Mentioned in SAL (#wikimedia-operations) [2021-03-12T19:47:04Z] <ebernhardson> start in-place reindex testwiki in eqiad, codfw, cloudelastic cirrus clusters for T269493

Change 672447 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Fix topic scores in importOresTopics.php

https://gerrit.wikimedia.org/r/672447

Change 672536 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] [WIP] Validate DataSender::sendUpdateWeightedTags() better

https://gerrit.wikimedia.org/r/672536

Change 672447 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Fix topic scores in importOresTopics.php

https://gerrit.wikimedia.org/r/672447

Change 671214 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] hasrecommendation: Write and query BC field as well

https://gerrit.wikimedia.org/r/671214

Change 672536 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Validate DataSender::sendUpdateWeightedTags() better

https://gerrit.wikimedia.org/r/672536

Change 673732 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] Fix string weight handling in CirrusSearch::updateWeightedTags

https://gerrit.wikimedia.org/r/673732

Change 673732 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Fix string weight handling in CirrusSearch::updateWeightedTags

https://gerrit.wikimedia.org/r/673732