Remove usage of MEDIAWIKI_JOB_RUNNER from CirrusSearch extension
Closed, ResolvedPublic
Actions

Description

The MEDIAWIKI_JOB_RUNNER global constant is not reliable and after an audit (T244828) we have decided to deprecate and remove it.

The use of MEDIAWIKI_JOB_RUNNER in CirrusSearch should be removed.

Details

	Subject	Repo	Branch	Lines +/-
	Remove usage of MEDIAWIKI_JOB_RUNNER constant	mediawiki/extensions/CirrusSearch	master	+6 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T157088 [EPIC] Develop a JobQueue backend based on EventBus
Open	daniel	T175146 JobQueue: Unify JobRunner entry points
Resolved	• Clarakosi	T244826 Create rest endpoint for executing jobs instead of /rpc/RunSingleJob
Resolved	• Clarakosi	T244828 Investigate alternatives to MEDIAWIKI_JOB_RUNNER global
Resolved	EBernhardson	T247129 Remove usage of MEDIAWIKI_JOB_RUNNER from CirrusSearch extension

Event Timeline

• Clarakosi created this task.Mar 6 2020, 9:07 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMar 6 2020, 9:07 PM

Reedy updated the task description. (Show Details)Mar 6 2020, 11:50 PM

Based on the provided ticket, no replacement will be offered. I think it's still important to be able to differentiate web requests from job runner requests. As a relatively common example, a count of web requests that invoked cirrussearch isn't possible from these logs if job runner logs report the same execution context as web requests. For reference these are not standard operational logs that only go to logstash, these are logs that feed our analytics and ML pipelines.

In T244828#5891647, @Krinkle wrote:

[extensions/CirrusSearch]: This is a string for logging purpose only. Trivially inverted. Right now it is coded as cli-or-job: cli, API: api, default: web. It can instead be coded as index.php: web, API: api, default: misc where misc captures CLI, Job and other non-user-facing contexts.

In T247129#5956863, @EBernhardson wrote:

[…] it's still important to be able to differentiate web requests from job runner requests. As a relatively common example, a count of web requests that invoked cirrussearch isn't possible from these logs if job runner logs report the same execution context as web requests.

I think the keyword here is "web request". Define web requests?

Do you consider some or all of rest.php, thumb.php, api.php, or load.php to be web requests? Under the current scheme, these were "web" except for api.php.

In the above proposal, "web" would be limited to "index.php", "api" remain the same, and the others under "misc". Would that work for you? If not, is there another way this can be grouped that would work for you?

In T244828#5956881, @EBernhardson wrote:

These are logs that go to hadoop and feed our analytics and ML pipelines. In this context it's important to be able to filter out logs generated by the job runner infrastructure, as they are primarily concerned with user behaviour.

Are these Hadoop queries automated or is it by hand? In my strawmen proposal, the two user-related strings "web" and "api" for the same log field would continue to exist.

If you're currently filtering out "cli", one would filter out "misc" instead - or simply look only for "web"/"api".

Detecting job runner explicitly will no longer be possible as jobs would no longer have their own entry point. They can run under any of cli, rpc/…php, or (soon) via w/rest.php/…/job/execute. For CirrusSearch, should rest.php get its own string label, or fall under one of api or misc or web. If under api or web beware that this would include jobs in the future. If placed under misc or under its own label, beware that endpoints mainly existing via api today, might in the future (also) have similar endpoints under rest.php.

Krinkle mentioned this in T244828: Investigate alternatives to MEDIAWIKI_JOB_RUNNER global.Mar 11 2020, 3:55 AM

In T247129#5959078, @Krinkle wrote:

In T244828#5891647, @Krinkle wrote:

[extensions/CirrusSearch]: This is a string for logging purpose only. Trivially inverted. Right now it is coded as cli-or-job: cli, API: api, default: web. It can instead be coded as index.php: web, API: api, default: misc where misc captures CLI, Job and other non-user-facing contexts.

In T247129#5956863, @EBernhardson wrote:

[…] it's still important to be able to differentiate web requests from job runner requests. As a relatively common example, a count of web requests that invoked cirrussearch isn't possible from these logs if job runner logs report the same execution context as web requests.

I think the keyword here is "web request". Define web requests?

Do you consider some or all of rest.php, thumb.php, api.php, or load.php to be web requests? Under the current scheme, these were "web" except for api.php.

Definitions are of course hard, but in the end what the analytics this feeds cares about is executions that directly return results to a user. I imagine only index.php would be fine, it looks like in the years since this code was written and new defines have been added to index.php that we can key off of.

In the above proposal, "web" would be limited to "index.php", "api" remain the same, and the others under "misc". Would that work for you? If not, is there another way this can be grouped that would work for you?

In T244828#5956881, @EBernhardson wrote:

These are logs that go to hadoop and feed our analytics and ML pipelines. In this context it's important to be able to filter out logs generated by the job runner infrastructure, as they are primarily concerned with user behaviour.

Are these Hadoop queries automated or is it by hand? In my strawmen proposal, the two user-related strings "web" and "api" for the same log field would continue to exist.

If you're currently filtering out "cli", one would filter out "misc" instead - or simply look only for "web"/"api".

Detecting job runner explicitly will no longer be possible as jobs would no longer have their own entry point. They can run under any of cli, rpc/…php, or (soon) via w/rest.php/…/job/execute. For CirrusSearch, should rest.php get its own string label, or fall under one of api or misc or web. If under api or web beware that this would include jobs in the future. If placed under misc or under its own label, beware that endpoints mainly existing via api today, might in the future (also) have similar endpoints under rest.php.

As you've surmised we don't actually care that they are job runner, we simply need job runner to be in some easily filtered class of things. I will work out switching this over to detect index.php directly and double check that none of the outputs change in surprising ways.

Change 579305 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Remove usage of MEDIAWIKI_JOB_RUNNER constant

https://gerrit.wikimedia.org/r/579305

gerritbot added a project: Patch-For-Review.Mar 12 2020, 3:49 PM

EBernhardson claimed this task.Mar 12 2020, 7:05 PM

EBernhardson triaged this task as Medium priority.

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Mar 12 2020, 7:21 PM

Change 579305 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Remove usage of MEDIAWIKI_JOB_RUNNER constant

https://gerrit.wikimedia.org/r/579305

Maintenance_bot removed a project: Patch-For-Review.Mar 13 2020, 12:10 AM

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.24; 2020-03-17).Mar 13 2020, 1:00 AM

EBernhardson moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Mar 13 2020, 4:52 PM