[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

Client Developer uses client-side cache
Closed, ResolvedPublic5 Estimated Story Points

Description

"As a Client Developer, I want to use a client-side cache, so that I can speed up data access and reduce my network access times."

We've put off supporting client-side cache in the MW REST API so far. This technique is great for reducing API calls and speeding up apparent performance for clients. Browser-based clients and some HTTP library users will get this built-in automatically.

To support client-side cache, we need to:

  • Emit Last-Modified header (harder but more widely used)
  • Emit ETag header (easier but less widely used)
  • Emit Expires header (rarely applicable, but very useful)
  • Support If-Modfied-Since for requests
  • Support If-None-Match for requests

The Handler class has methods to get last modified date and etag, but the REST Router class doesn't use them.

It would be good to get this functionality into the API now, while we're still low-traffic and it's not an emergency.

One good potential endpoint for enhancing with this is GET /revision/{id}/bare, since revisions are mostly immutable.

Paths:

- [ ] /user/{name}/hello

  • /v1/page/{title}/history
  • /v1/page/{title}/history/counts/{type}
  • /v1/revision/{from}/compare/{to}
  • /v1/revision/{id}/bare
  • /coredev/v0/page/{title}/
  • /coredev/v0/page/{title}/links/language
  • /coredev/v0/page/{title}/links/media
  • /coredev/v0/file/{title}

- [ ] /coredev/v0/search/page

Event Timeline

Also important is making sure we flush the Varnish cache when an object is edited, deleted, or undeleted.

This is something that I think should be split out of this task and implemented in the next stage. Purging caches is hard, and in case of purging Varnish, we might be adding quite a lot of purge traffic if we use naive approach of just purging every possible URI the resource could be accessed via. After we split it into a separate ticket, we would need to involve the traffic team to consult us.

A generic question that needs answering - how much staleness would we allow. In order for this to work we would need to emit a cache-control: max-age header, so we would allow stale data on the client side. Should this be configured per endpoint? Do we want one number across all endpoints? Configurable via Mediawiki-config or hard-coded? My proposal is to begin with hardcoded N seconds for all handlers. Not sure about how much staleness to allow? Opinions?

In order to implement it, we need to utilize the framework support @tstarling has implemented. We need to override 2 methods in each handler - getEtag and getLastModified. This patch could be used for inspiration.

  • /user/{name}/hello

This should probably go away, so ignoring it.

/v1/page/{title}/history

Should be pretty straightforward, need to verify how revision deletions affect it.

/v1/page/{title}/history/counts/{type}

Pretty straightforward for simple case of latest revision timestamp, need to think how revision deletions and restore affect it.

/v1/revision/{from}/compare/{to}

See https://gerrit.wikimedia.org/r/c/mediawiki/core/+/542212

/v1/revision/{id}/bare

Quite straightforward to implement, but I don't know if the benefit is really there to support conditional requests - to validate the condition we need to fetch a revision, and by the time we have a revision we pretty much have the whole AP response. So, not supporting conditionals will make it overall faster, but increase the amount of data transferred slightly.

/coredev/v0/search/page

I don't see a any way to support it from the first glance. But I can be mistaken. We need help from an expert on MW search API

/coredev/v0/page/{title}/links/language

Yeah :) We can probably add some staleness for the client-side caching. As for revalidation headers - I don't see a way to do it in an efficient way. The langlinks table doesn't have a last-tocuhed field, so I'm not sure we can do anything. Not a problem, it's a pretty efficient query already.

WDoranWMF set the point value for this task to 5.Dec 3 2019, 5:51 PM

There was some misunderstanding about what 'cache-control' role in here.

The cache validation is described well in this article. Basically, this is what will happen once we provide the headers described in the ticket:

  1. When a browser fetches a response, it will cache it for max-age period of time in seconds. Without max-age, no client-side caching will happen.
  2. For the period of time of max-age the browser will not issue a new request to the server for the same URI and will reuse the cached entity. We have no control over the client-side cache, can not purge anything from it and can not force the revalidation before max-age expires. This is the maximum staleness time that I refer to.
  3. When the max-age expired, if we provided all the headers described in the ticket, the browser will perform a revalidation request - it will attach 'if-match' or 'if-modified-since' header. In response we can return 403 Unmodified with no payload and a new max-age - the cached entity in the browser will be reused for the new max-age - go to step 1.

So, nothing prevents us from implementing the machinery described in the ticket, but without setting cache-control: max-age X it will not kick in and not provide any benefit. The question is how much staleness seconds we can accept. It can not be too high, since we have absolutely no control over the browser cache until manage expires. If it's very low, we still get the benefit of not transferring the data again and again because of cache revalidation, but the network roundtrips for revalidation still happen. The larger we go with max-age, the less network roundtrips we have, but the more stale data we allow.

I think we should evaluate each use-case for appropriate maxage, but we probably shouldn't go beyond 5-10 minutes for anything. I propose that for initial version of implementation we stick to 1 minute for everything. Opinions?

@Pchelolo Thank you for writing this up. I would agree with initially going with 1 minute as the default for now. But I'm curious how can we track and tune it long term? Is there data captured on what "normal" is?

@Pchelolo Thank you for writing this up. I would agree with initially going with 1 minute as the default for now. But I'm curious how can we track and tune it long term? Is there data captured on what "normal" is?

Heh, given that's browser cache, it's not transparent to us at all. I guess here we should use our best judgment.

Change 562576 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/core@master] PageHistoryCount Conditional Request

https://gerrit.wikimedia.org/r/562576

Change 566568 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/tools/api-testing@master] PageHistoryCount tests for deleting/un-deleting revisions

https://gerrit.wikimedia.org/r/566568

Change 562576 merged by jenkins-bot:
[mediawiki/core@master] PageHistoryCount Conditional Request

https://gerrit.wikimedia.org/r/562576

Change 566568 merged by jenkins-bot:
[mediawiki/tools/api-testing@master] PageHistoryCount tests for deleting/un-deleting revisions

https://gerrit.wikimedia.org/r/566568

WDoranWMF raised the priority of this task from Low to High.Feb 26 2020, 7:42 PM

We really should split this up by endpoint. The caching needs and characteristics for different endpoints are really not the same. In particular, endpoints that return information about the current revision behave quite different from ones that return data for a fixed revision.

I broke out the user story for reverse proxies to T247779 so we can track that work separately. Let's keep this ticket just about client-side.

The Handler class has methods to get last modified date and etag, but the REST Router class doesn't use them.

This is incorrect, they are used via ConditionalHeadersUtil in the applyConditionalResponseHeaders() method of the Router class.

Change 580126 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] PageHistory: Add etag and last-modified headers

https://gerrit.wikimedia.org/r/580126

Change 580126 merged by jenkins-bot:
[mediawiki/core@master] REST endpoints: Add etag and last-modified headers

https://gerrit.wikimedia.org/r/580126

Change 583669 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Display different errors for title vs file not found in MediaFileHandler

https://gerrit.wikimedia.org/r/583669

Change 583669 merged by jenkins-bot:
[mediawiki/core@master] Display different errors for title vs file not found in MediaFileHandler

https://gerrit.wikimedia.org/r/583669

Documentation for conditional requests from a client developer perspective is live here. I reviewed this content with Nikki on wiki, but additional feedback is always welcome!

There are some elements of this feature that I think would be a better fit for an implementer's guide (T255610).

Change 820794 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/core@master] Mark bogus cache key in RevDelRevisionList with a FIXME

https://gerrit.wikimedia.org/r/820794

Change 820794 merged by jenkins-bot:

[mediawiki/core@master] Remove now unused (bogus) cache key in RevDelRevisionList

https://gerrit.wikimedia.org/r/820794