Move MainStash out of Redis to a simpler multi-dc aware solution
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Dec 17 2018, 2:55 PM

Description

As evidenced during the investigation of T211721, we don't just write session data to the sessions redis cluster, but we also write all data from anything in MediaWiki that uses MediaWikiServices::getInstance()->getMainObjectStash() to the local redis cluster.

This breaks in a multi-dc setup for a number of reasons, first and foremost that we replicate redis data from the master DC to the others, but not the other way around as redis doesn't support multi-master replication.

Status quo

The MediaWiki "Main stash" is backed in WMF production by a Redis cluster labelled "redis_sessions", and is co-located on the Memcached hardware. It is perceived as having the following feaures:

Fairly high persistence. (It is not a cache, and the data is not recoverable in case of loss. But it is expected to lose data in main stash if under pressure under hopefully rare circumstances.) Examples of user impact:
- session - User gets logged out and loses any session-backend data (e.g. book composition).
- echo - Notifications might be wrongly marked as read or unread.
- watchlist - Reviewed edits might show up agaihn as unreviewed.
- resourceloader - Version hash churn would cause CDN and browser cache misses for a while.
Replication. (Data is eventually consistent and readable from both DCs)
Fast (low latency).

Note that:

"DC-local writable" is not on this list (mainstash only requires master writes), but given WMF is not yet mulit-dc we have more or less assumed that sessions are always locally writable and we need to keep performing sessions writes locally in a multi-DC world.
"Replication" is on the list and implemented in one direction for Redis at WMF. This would suffice if we only needed master writes, but for sessions we need local writes as well.

Future of sessions

Move Session and Echo data out of the MainStash into their own store that supports local writes and bi-di replication. This is tracked under T206016 and T222851.

Future of mainstash

To fix the behaviour of the software in a multi-dc scenario, I see the following possibilities, depending on what type of storage guarantees we want to have:

Option 1: If we don't need data to be consistent cross-dc: After we migrate the sessions data to their own datastore, we turn off replication and leave the current Redis clusters operating separately in each DC.

Option 2: If we need cross-dc consistency, but we don't need the data to have a guaranteed persistence: We can migrate the stash to mcrouter.

Option 3: If we need all of the above, plus persistence: We might need to migrate that to the same (or a similar) service to the one we will use for sessions.

I would very much prefer to be able to avoid the last option, if at all possible.

Details

Subject	Repo	Branch	Lines +/-
objectcache: make "multiPrimaryMode" work with LB-based SqlBagOStuff	mediawiki/core	master	+37 -1
Add "db-mainstash" entry to $wgObjectCaches	operations/mediawiki-config	master	+11 -0
Switch wgMainStash back to Redis	operations/mediawiki-config	master	+1 -1
Configure FilterProfiler cache separately	mediawiki/extensions/AbuseFilter	wmf/1.39.0-wmf.16	+11 -1
Configure FilterProfiler cache separately	mediawiki/extensions/AbuseFilter	master	+11 -1
Switch AbuseFilter profiler back to redis	operations/mediawiki-config	master	+4 -0
Switch wgMainStash to db-mainstash	operations/mediawiki-config	master	+1 -1
Configure FilterProfiler cache separately	mediawiki/extensions/AbuseFilter	wmf/1.39.0-wmf.15	+11 -1
Add MariaDB grants for x2	operations/puppet	production	+15 -3
objectcache: Simplify SqlBagOStuff class configuration	mediawiki/core	master	+235 -182
objectcache: remove "multiPrimaryMode" DB type assertion	mediawiki/core	master	+10 -18
objectcache: Simplify docs of SqlBagOStuff 'purgePeriod' option	mediawiki/core	master	+2 -2
objectcache: add "globalKeyLbDomain" option to use with "globalKeyLB"	mediawiki/core	master	+21 -5

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		aaron	T252951 ResourceLoader DepStore lock acquired twice?
Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved		Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved		Krinkle	T113916 Switch ResourceLoader file dependency tracking to MultiDC-friendly backend
Resolved		jijiki	T267581 Phase out "redis_sessions" cluster and away from memcached cluster
Resolved		aaron	T212129 Move MainStash out of Redis to a simpler multi-dc aware solution
Resolved		Eevans	T222851 Improve Echo seentime code for multi-DC access
Resolved		aaron	T229062 Look into a simple way to have global keys with db-replicated
Resolved		Krinkle	T254634 Determine and implement multi-dc strategy for ChronologyProtector
			Unknown Object (Task)
Resolved		Papaul	T267041 (Need By: 2020-11-29) rack/setup/install db214[234]
Resolved		Marostegui	T269324 Productionize x2 databases
Resolved		aaron	T274174 Add modtoken field and flags to objectcache table
Resolved		Krinkle	T288998 Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression)
Resolved		tstarling	T306118 Notify DBA prior to sending db traffic to x2
Resolved		tstarling	T315271 db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash
Resolved	PRODUCTION ERROR	• jcrespo	T315274 Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().
Resolved		tstarling	T315427 Create a puppet role for x2 hosts
Resolved		Marostegui	T315853 reclone x2 codfw hosts
Resolved		tstarling	T315995 Document how to disable x2 per DC

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Marostegui mentioned this in T269324: Productionize x2 databases.Jan 29 2021, 3:34 AM

Krinkle reopened subtask T269324: Productionize x2 databases as Open.Feb 1 2021, 8:33 PM

Krinkle removed a subtask: Unknown Object (Task).Feb 13 2021, 11:50 PM

Krinkle added a subtask: T274174: Add modtoken field and flags to objectcache table.Feb 16 2021, 7:47 PM

Marostegui closed subtask T269324: Productionize x2 databases as Resolved.Mar 5 2021, 5:53 AM

Krinkle mentioned this in T252908: AbuseFilter warning: "doCas failed due to race condition" (from Redis).Mar 8 2021, 5:49 PM

Krinkle mentioned this in T270101: Grants not working with DB hosts with to ipv6.Mar 10 2021, 9:29 PM

Krinkle closed subtask T254634: Determine and implement multi-dc strategy for ChronologyProtector as Resolved.Mar 19 2021, 8:29 PM

Krinkle added a parent task: T113916: Switch ResourceLoader file dependency tracking to MultiDC-friendly backend.Apr 19 2021, 6:56 PM

jijiki mentioned this in T280586: Move "redis_sessions" to "redis_misc" cluster.Apr 19 2021, 7:06 PM

jijiki mentioned this in T280582: Reduce number of shards in redis_sessions cluster.Apr 19 2021, 7:15 PM

Krinkle mentioned this in T281934: Audit uses of 'db-replicated' in WMF production code and consider using MainStash DB.May 4 2021, 11:20 PM

Krinkle mentioned this in T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken).May 15 2021, 11:10 PM

Krinkle mentioned this in T280599: Reduce DiscussionTools' usage of the parser cache.Jul 28 2021, 1:50 PM

Krinkle mentioned this in T288998: Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression).Aug 17 2021, 4:29 PM

aaron closed subtask T274174: Add modtoken field and flags to objectcache table as Resolved.Sep 20 2021, 6:23 PM

Krinkle added a subtask: T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken).Sep 20 2021, 7:14 PM

jijiki mentioned this in T293216: Upgrade mc* and mc-gp* hosts to Debian Bullseye.Oct 13 2021, 11:52 AM

Krinkle mentioned this in T274174: Add modtoken field and flags to objectcache table.Dec 22 2021, 6:48 PM

Krinkle removed a subtask: T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken).Dec 22 2021, 7:37 PM

Next steps:

Decide on which database name(s) we need on the x2 cluster.
Create them.
Try connecting from MW CLI with --wiki=aawiki, and let it auto-create the objectcache table.

For database names, I propose one of the following options:

Named after local wiki dbname, with each x2.##wiki database woud have its own objectcache table.
- Inspiration: Like externalstore text tables (a core per-wiki table, transparently hosted on a different cluster).
- Inspiration: Like Echo and GrowthExperiments (on extension1, similarly uses per-wiki tables). See wikitech:MariaDB#x1.
- Downside: This would make garbage collection significantly more complicated (requiring an additional level of indirection and iteration). And complicates the wiki creation process (T158730), and general complexity from having per-wiki separation. Fine if there's a good reason to, but not a good default strategy imho. Afaik all keys can be in the same table, same as we do today for parsercache, main cache (memc), and main stash (redis). Also, we'd still need a name for the x2 db where the global version of "objectcache" would reside.
"mainstash".
- Inspiration: Like parsercache, where the database is named after the logical cluster (e.g. pc1, pc2). For main stash there isn't an obvious name for the logical cluster right now. There's no numbering (yet), and historically its not had a name as there isn't a "db" name from client perspective of Redis or Memcached.
- Inspiration: Like centralauth (on s7), and flowdb and cognate (on x1), which are also cross-wiki feature with a db named after itself. See wikitech:MariaDB#x1.
"wikishared".
- Inspiration: Like extension1 and other cross-wiki concepts with MW in WMF prod.
- Downside: Not unique. While this name is already used in multiple places, thusfar for DB-related things we have (afaik) only used it for non-core tables of MW extensions, and so far they're all on the extension1 cluster. This means that for debugging purposes, there is (afaik?) only one obvious definition of what "the wikishared database" means, e.g. for sql wikishared. And there are also notable exceptions. E.g. centralauth has its own database on s7 (its tableed are not under a db called "wikishared").
- Upside: Not unique. If we choose to make "wikishared" the general name for cross-wiki tables that are fully part of the MW production schema but hosted externally, then re-using it would slightly simplify moving data from one database to another if we needed to. On the other hand, given this it not actually a concrete concept in the MW ecoystem, this dbg name needs to be configured per-feature either way. And it seems just as easy, if not easier, to move a whole single-table database then to move a table within a larger db. And having a unique name seems somewhat useful actually in terms of documenting, debugging connections etc.

I'm slighly leaning toward mainstash. With the table thus logically known as mainstash.objectcache on x2.

I would prefer either mainstash or wikishared, but I don't have any strong opinions about any of them.

I like "mainstash". If there is ever vertical sharding by extension, then "<group>stash" could be used as a DB name on separate clusters.

Change 752806 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] objectcache: add \"globalKeyLbDomain\" option to use with \"globalKeyLB\"

https://gerrit.wikimedia.org/r/752806

Change 752807 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[operations/mediawiki-config@master] Add \"db-mainstash\" entry to $wgObjectCaches

https://gerrit.wikimedia.org/r/752807

Change 752806 merged by jenkins-bot:

[mediawiki/core@master] objectcache: add \"globalKeyLbDomain\" option to use with \"globalKeyLB\"

https://gerrit.wikimedia.org/r/752806

ReleaseTaggerBot added a project: MW-1.38-notes (1.38.0-wmf.20; 2022-01-31).Jan 31 2022, 8:00 PM

Change 773657 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] objectcache: make "multiPrimaryMode" work with LB-based SqlBagOStuff instances

https://gerrit.wikimedia.org/r/773657

Change 773657 merged by jenkins-bot:

[mediawiki/core@master] objectcache: make "multiPrimaryMode" work with LB-based SqlBagOStuff

https://gerrit.wikimedia.org/r/773657

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.7; 2022-04-11).Apr 5 2022, 3:00 AM

Krinkle changed the status of subtask T306118: Notify DBA prior to sending db traffic to x2 from Open to Stalled.Apr 13 2022, 5:27 PM

Change 779964 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] objectcache: Simplify docs of SqlBagOStuff 'purgePeriod' option

https://gerrit.wikimedia.org/r/779964

Change 779964 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Simplify docs of SqlBagOStuff 'purgePeriod' option

https://gerrit.wikimedia.org/r/779964

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.8; 2022-04-18); removed MW-1.39-notes (1.39.0-wmf.7; 2022-04-11).Apr 14 2022, 5:00 AM

Next: Decide on how and whether to fragment the data in mainstashdb, e.g. like parser cache, like external store, or something else. @aaron to propose some ideas for DBAs to provide feedback/guidenace on.

Change 780903 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] objectcache: remove "multiPrimaryMode" DB type assertion

https://gerrit.wikimedia.org/r/780903

Change 780903 merged by jenkins-bot:

[mediawiki/core@master] objectcache: remove "multiPrimaryMode" DB type assertion

https://gerrit.wikimedia.org/r/780903

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.9; 2022-04-25); removed MW-1.39-notes (1.39.0-wmf.8; 2022-04-18).Apr 23 2022, 9:00 PM

In T212129#7865521, @Krinkle wrote:

Next: Decide on how and whether to fragment the data in mainstashdb, e.g. like parser cache, like external store, or something else. @aaron to propose some ideas for DBAs to provide feedback/guidenace on.

What I want to avoid is the kind of overhead that the parser cache has for purging expired blobs. The linked list used for overflow pages probably gets fragmented over time due to wildly varying blob sizes.

Ladsgroup subscribed.May 6 2022, 8:31 AM

tstarling subscribed.May 17 2022, 11:50 PM

Change 798030 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Simplify SqlBagOStuff configuration

https://gerrit.wikimedia.org/r/798030

Change 798030 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Simplify SqlBagOStuff class configuration

https://gerrit.wikimedia.org/r/798030

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.14; 2022-05-30); removed MW-1.39-notes (1.39.0-wmf.9; 2022-04-25).May 27 2022, 4:00 PM

daniel mentioned this in T308511: [SPIKE] Determine necessity of edit session continuity during data center switchovers .May 30 2022, 12:21 PM

Here is the schema (just one table):

CREATE TABLE objectstash (
  keyname VARBINARY(255) DEFAULT '' NOT NULL,
  value MEDIUMBLOB DEFAULT NULL,
  exptime BINARY(14) NOT NULL,
  modtoken VARCHAR(22) DEFAULT '0000000000000000000000' NOT NULL,
  flags INT UNSIGNED DEFAULT NULL,
  INDEX exptime (exptime),
  PRIMARY KEY(keyname)
) ENGINE=innoDB COMMENT='MERGE_THRESHOLD=30';

Krinkle changed the status of subtask T306118: Notify DBA prior to sending db traffic to x2 from Stalled to Open.May 31 2022, 11:36 PM

Krinkle mentioned this in T306118: Notify DBA prior to sending db traffic to x2.

Change 799433 had a related patch set uploaded (by Krinkle; author: Tim Starling):

[operations/mediawiki-config@master] Switch wgMainStash to db-mainstash

https://gerrit.wikimedia.org/r/799433

Change 752807 merged by jenkins-bot:

[operations/mediawiki-config@master] Add "db-mainstash" entry to $wgObjectCaches

https://gerrit.wikimedia.org/r/752807

Change 802669 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Add MariaDB grants for x2

https://gerrit.wikimedia.org/r/802669

Change 802669 merged by Tim Starling:

[operations/puppet@production] Add MariaDB grants for x2

https://gerrit.wikimedia.org/r/802669

I created the database and table, applied the grants, and tested it from eval.php, testing the wikiadmin@eqiad and wikiuser2022@codfw grants. In eqiad, it seems to work. In codfw, reading works, but writing fails because the LB is configured with read-only mode. That's fine, so I'm ready to deploy the switchover commit. But I'll leave it until Monday to give others a chance to review.

I'm fully alone next week as the rest of the team is gone, can we enable this the following week instead of Monday?
Thanks.

OK, how about June 14, 05:00 UTC?

In T212129#7978237, @tstarling wrote:

OK, how about June 14, 05:00 UTC?

That would work. I will let you know for sure on Tuesday

Change 804024 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch wgMainStash back to Redis

https://gerrit.wikimedia.org/r/804024

Mark asked me to prepare a rollback plan which can be used to switch back to Redis if something goes wrong.

If it's immediately broken after deployment, then we can switch back without flushing Redis, but if we want to switch back some hours later, it seems best to flush Redis to avoid presenting stale data to MediaWiki.

Our use of Redis does not distinguish between mainstash keys and other callers. Using a separate DB or Redis instance would create its own risks rather than being a safe fallback. Running a FLUSHALL command would wipe CentralAuth sessions, resulting in user inconvenience since users without the "remember me" option would have to log in again. So I suggest only doing the flush if/when a rollback proves necessary.

The manual warns that FLUSHALL is "slow", presumably O(N), and will block all activity on the server while it completes (we are running 2.8 which has no ASYNC option). But number of keys per server is only 0.8M to 1.3M, so I figure it couldn't take more than a few seconds.

To flush all relevant eqiad redis servers, on deploy1002 run mwscript shell.php --wiki=enwiki and paste the following into it:

$servers = [ '10.64.0.125', '10.64.0.65', '10.64.16.21', '10.64.16.190', '10.64.32.153', '10.64.32.158', '10.64.48.91', '10.64.48.93' ];
foreach ( $servers as $host ) {
$redis = new Redis();
$redis->connect($host);
$redis->auth($wmgRedisPassword);
$redis->flushAll();
print $host . ": " . $redis->info()['db0'] . "\n";
}

This should show the number of remaining keys on each server after the flush, which should be a small number.

Then https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/804024 can be deployed.

I got the server list from deploy1002:/etc/nutcracker/nutcracker.yml . I tested all the lines of the script except the flushAll(), and I confirmed that the Redis config has no security measure which would prevent clients from running FLUSHALL.

Thank you Tim!
I will bring this up on our Team meeting on Monday.

tstarling closed subtask T306118: Notify DBA prior to sending db traffic to x2 as Resolved.Jun 14 2022, 4:59 AM

Change 799433 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch wgMainStash to db-mainstash

https://gerrit.wikimedia.org/r/799433

Mentioned in SAL (#wikimedia-operations) [2022-06-14T05:11:30Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: T212129 Switch wgMainStash to db-mainstash (duration: 03m 38s)

Change 805266 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/AbuseFilter@master] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805266

Change 805268 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch AbuseFilter profiler back to redis

https://gerrit.wikimedia.org/r/805268

Change 805160 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.15] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805160

Change 805160 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.15] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805160

Change 805266 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805266

Mentioned in SAL (#wikimedia-operations) [2022-06-14T06:20:26Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.15/extensions/AbuseFilter/extension.json: T212129 (duration: 03m 32s)

Mentioned in SAL (#wikimedia-operations) [2022-06-14T06:24:19Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.15/extensions/AbuseFilter/includes/ServiceWiring.php: T212129 (duration: 03m 33s)

Change 805268 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch AbuseFilter profiler back to redis

https://gerrit.wikimedia.org/r/805268

Mentioned in SAL (#wikimedia-operations) [2022-06-14T06:28:26Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: T212129 (duration: 03m 31s)

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.17; 2022-06-20); removed MW-1.39-notes (1.39.0-wmf.14; 2022-05-30).Jun 14 2022, 7:00 AM

PleaseStand mentioned this in T308069: 1.39.0-wmf.16 deployment blockers.Jun 14 2022, 12:27 PM

Change 805361 had a related patch set uploaded (by Jforrester; author: Tim Starling):

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.16] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805361

Jdforrester-WMF added a parent task: T308069: 1.39.0-wmf.16 deployment blockers.Jun 14 2022, 2:16 PM

Change 805361 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.16] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805361

brennen removed a parent task: T308069: 1.39.0-wmf.16 deployment blockers.Jun 14 2022, 4:37 PM

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.16; 2022-06-13); removed MW-1.39-notes (1.39.0-wmf.17; 2022-06-20).Jun 14 2022, 5:00 PM

Jdforrester-WMF mentioned this in T310532: Investigate McRouter GET request spike from wmf.15.Jun 14 2022, 9:42 PM

Metrics on db1151 look fine. Disk space usage on db1151 is growing at a rate of 5.9 GB per day, and there is 8.6TB available, implying exhaustion in 4 years, if it keeps growing at the same rate, which it is not expected to do. I think this is done.

Change 804024 abandoned by Tim Starling:

[operations/mediawiki-config@master] Switch wgMainStash back to Redis

Reason:

not needed

https://gerrit.wikimedia.org/r/804024