[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

Remove usage of MEDIAWIKI_JOB_RUNNER from CirrusSearch extension
Closed, ResolvedPublic

Description

The MEDIAWIKI_JOB_RUNNER global constant is not reliable and after an audit (T244828) we have decided to deprecate and remove it.

The use of MEDIAWIKI_JOB_RUNNER in CirrusSearch should be removed.

Event Timeline

Based on the provided ticket, no replacement will be offered. I think it's still important to be able to differentiate web requests from job runner requests. As a relatively common example, a count of web requests that invoked cirrussearch isn't possible from these logs if job runner logs report the same execution context as web requests. For reference these are not standard operational logs that only go to logstash, these are logs that feed our analytics and ML pipelines.

  • [extensions/CirrusSearch]: This is a string for logging purpose only. Trivially inverted. Right now it is coded as cli-or-job: cli, API: api, default: web. It can instead be coded as index.php: web, API: api, default: misc where misc captures CLI, Job and other non-user-facing contexts.

[…] it's still important to be able to differentiate web requests from job runner requests. As a relatively common example, a count of web requests that invoked cirrussearch isn't possible from these logs if job runner logs report the same execution context as web requests.

I think the keyword here is "web request". Define web requests?

Do you consider some or all of rest.php, thumb.php, api.php, or load.php to be web requests? Under the current scheme, these were "web" except for api.php.

In the above proposal, "web" would be limited to "index.php", "api" remain the same, and the others under "misc". Would that work for you? If not, is there another way this can be grouped that would work for you?

These are logs that go to hadoop and feed our analytics and ML pipelines. In this context it's important to be able to filter out logs generated by the job runner infrastructure, as they are primarily concerned with user behaviour.

Are these Hadoop queries automated or is it by hand? In my strawmen proposal, the two user-related strings "web" and "api" for the same log field would continue to exist.

If you're currently filtering out "cli", one would filter out "misc" instead - or simply look only for "web"/"api".

Detecting job runner explicitly will no longer be possible as jobs would no longer have their own entry point. They can run under any of cli, rpc/…php, or (soon) via w/rest.php/…/job/execute. For CirrusSearch, should rest.php get its own string label, or fall under one of api or misc or web. If under api or web beware that this would include jobs in the future. If placed under misc or under its own label, beware that endpoints mainly existing via api today, might in the future (also) have similar endpoints under rest.php.

  • [extensions/CirrusSearch]: This is a string for logging purpose only. Trivially inverted. Right now it is coded as cli-or-job: cli, API: api, default: web. It can instead be coded as index.php: web, API: api, default: misc where misc captures CLI, Job and other non-user-facing contexts.

[…] it's still important to be able to differentiate web requests from job runner requests. As a relatively common example, a count of web requests that invoked cirrussearch isn't possible from these logs if job runner logs report the same execution context as web requests.

I think the keyword here is "web request". Define web requests?

Do you consider some or all of rest.php, thumb.php, api.php, or load.php to be web requests? Under the current scheme, these were "web" except for api.php.

Definitions are of course hard, but in the end what the analytics this feeds cares about is executions that directly return results to a user. I imagine only index.php would be fine, it looks like in the years since this code was written and new defines have been added to index.php that we can key off of.

In the above proposal, "web" would be limited to "index.php", "api" remain the same, and the others under "misc". Would that work for you? If not, is there another way this can be grouped that would work for you?

These are logs that go to hadoop and feed our analytics and ML pipelines. In this context it's important to be able to filter out logs generated by the job runner infrastructure, as they are primarily concerned with user behaviour.

Are these Hadoop queries automated or is it by hand? In my strawmen proposal, the two user-related strings "web" and "api" for the same log field would continue to exist.

If you're currently filtering out "cli", one would filter out "misc" instead - or simply look only for "web"/"api".

Detecting job runner explicitly will no longer be possible as jobs would no longer have their own entry point. They can run under any of cli, rpc/…php, or (soon) via w/rest.php/…/job/execute. For CirrusSearch, should rest.php get its own string label, or fall under one of api or misc or web. If under api or web beware that this would include jobs in the future. If placed under misc or under its own label, beware that endpoints mainly existing via api today, might in the future (also) have similar endpoints under rest.php.

As you've surmised we don't actually care that they are job runner, we simply need job runner to be in some easily filtered class of things. I will work out switching this over to detect index.php directly and double check that none of the outputs change in surprising ways.

Change 579305 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Remove usage of MEDIAWIKI_JOB_RUNNER constant

https://gerrit.wikimedia.org/r/579305

EBernhardson triaged this task as Medium priority.
EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

Change 579305 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Remove usage of MEDIAWIKI_JOB_RUNNER constant

https://gerrit.wikimedia.org/r/579305