[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Page MenuHomePhabricator

Improve performance of Compare query for Special:Investigate
Closed, ResolvedPublicFeb 28 2020

Description

The current CheckUser tool places limits on the maximum potential cost of accessing data, in case a result set is large. We need to set similar limits for the Special:Investigate Compare tab.

The current version of the query essentially does:

SELECT cuc_user, cuc_user_text, cuc_ip, cuc_ip_hex, cuc_agent, MIN(cuc_timestamp), MAX(cuc_timestamp), COUNT(*) FROM cu_changes WHERE cuc_type IN (0,1) AND (cuc_user_text IN <users> OR cuc_ip IN <IPs> OR cuc_ip_hex BETWEEN <min IP> AND <max IP>) GROUP BY cuc_user_text, cuc_ip, cuc_agent ORDER BY cuc_user_text, cuc_ip, cuc_agent LIMIT <limit>

This performs a filesort operation on an arbitrarily large number of rows before imposing any limit. Although an index on, say, (cuc_user, cuc_ip, cuc_agent, cuc_timestamp) (and changing cuc_user_text -> cuc_user) could avoid the filesort, there would still be no cap on how many rows are grouped before the limit is applied.

After discussing the problem with @Catrope, here are some proposed changes to place limits on the worst potential performance.

UI changes

  • Place a low limit on the number of users and IPs that can be investigated at once by the check user
  • Allow the check user to narrow the time range for their investigation (the current CheckUser tool does this)
  • Show warnings and advice if data are truncated

Query changes
One thing we could do would be to use efficient subqueries to access a limited number of rows, and perform the grouping and pagination on this size-limited subset. We would have to decide how to limit the subset, e.g. would we rather have some data for each target, or prioritise the most recent data, etc. For example, to get some data for each target, we could do something like:

SELECT a.cuc_user_text, a.cuc_ip, a.cuc_agent, MIN(a.cuc_timestamp), MAX(a.cuc_timestamp), COUNT(*) FROM (

-- for each user
( SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_user = <user> AND cuc_timestamp >= <timestamp> ORDER BY cuc_timestamp DESC LIMIT <limit> ) UNION

-- for each IP
( SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex = <IP> AND cuc_timestamp >= <timestamp> ORDER BY cuc_timestamp DESC LIMIT <limit> ) UNION

-- for each IP range (NB highest first, to make use of cuc_ip_hex_time index)
( SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex >= <min IP> AND cuc_ip_hex <= <max IP> AND cuc_timestamp >= <timestamp> ORDER BY cuc_ip_hex DESC, cuc_timestamp DESC LIMIT <limit> )

) AS a GROUP BY cuc_user_text, cuc_ip, cuc_agent ORDER BY cuc_user_text, cuc_ip, cuc_agent LIMIT <pager limit>;
  • Separate subqueries for each target entered ensure we get some data per target. Each has a limit (e.g. if the limit of rows we should access overall is 5,000 and 5 targets have been entered, each could have a limit of 1000); order by descending timestamp
  • The outer query performs a filesort, but on a limited number of rows. (Grouping could be done in PHP but would require customization of the pager and involve more data transfer)
  • Repeating each subquery with a count can tell us whether each one hit the limit (this can't be told from the final aggregated results due to the union)

The biggest problem with making these (or similar) changes to the query is that a set of results may be incomplete if any of the subqueries hit their limit (and the missing results will not be accessible on the next page). Some ways to mitigate this are:

  • See if we can do some analysis on the largest cu_changes tables to find a limit that is unlikely to be hit often but which will perform tolerably
  • Reduce the likelihood of hitting the limit by limiting the number of targets that can be input at once
  • Display a warning that the results may be incomplete, whenever any subquery hits the limit
  • Display general advice to check users that they can reduce the likelihood of getting incomplete results by reducing the number of targets, reducing the width of IP ranges, and/or reducing the time window.

Event Timeline

I ran the query above on Wikidata with my two user accounts, my home IP address, and an 8 IP range around my IP address

7 rows in set (4 min 16.03 sec)

clearly what we have isn't going to work.

I ran ANALYZE and got this back:

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1SIMPLEcu_changesALLcuc_ip_hex_timeNULLNULLNULL6351367562944720.00100.000.02Using where; Using temporary; Using filesort

It seems that we need to add the following indexes

/* Special:Investigate/Compare WHERE */
CREATE INDEX /*i*/cuc_user_text_type ON /*_*/cu_changes (cuc_user_text, cuc_type);
CREATE INDEX /*i*/cuc_ip_type ON /*_*/cu_changes (cuc_ip, cuc_type);
CREATE INDEX /*i*/cuc_ip_hex_type ON /*_*/cu_changes (cuc_ip_hex, cuc_type);

/* Special:Investigate/Compare GROUP BY & ORDER BY */
CREATE INDEX /*i*/cuc_user_text_ip_agent ON /*_*/cu_changes (cuc_user_text, cuc_ip, cuc_agent);

Doing this should avoid the full table scan, the temporary table, and the filesort.

If that doesn't resolve the performance issues, I think we could pre-aggregate the data (when an edit is made) as I suggested in T244579#5891317 thus removing the need for the GROUP BY in the first place. :)

We should probably keep in mind the schema changes that are happening in T233004. Not necessarily a blocker though.

@dbarratt It's worth bearing in mind the schema change process (https://wikitech.wikimedia.org/wiki/Schema_changes), which needs to be followed when adding indexes. Given the additional costs to storage and table insertions (cu_changes is large and frequently updated), adding four quite specific indexes for this query would be expensive. (Incidentally, I'd question whether the query can use all four indexes, and how efficient that would be if so.)

As mentioned in the task description, indexing doesn't solve the problem of setting a limit on the number of rows being accessed. That could be done by limiting before grouping (and showing a warning) as outlined in the task description. It could also be done by adding a new table as @dbarratt suggests, but that would incur storage/update costs, might impact the product timescale and would represent a commitment to the current product design (further tweaks might necessitate schema changes) - @Niharika, @Mooeypoo what do you think?

As a solution with no database changes, I find showing a limited dataset with a warning acceptable (especially if we can find a limit that is rarely hit). It's common for our products to have a fallback plan if the result set is too large. The current CheckUser tool:

  • Shows no data, but a list of IP addresses and a warning if an IP range yields too many edits, or too many users
  • Shows truncated data (no pagination) with a warning if a user name yields too many edits

@Niharika I seem to remember the plan outlined in the task description being acceptable to you from a product perspective when we last discussed it - is that still the case?

As mentioned in the task description, indexing doesn't solve the problem of setting a limit on the number of rows being accessed.

I guess I'm confused on why that is a problem? Setting pre-aggregation limit seems like a solution?

To clarify, I'm not opposed to setting a post aggregation limit, like limiting the IP ranges the user can enter. But it seems odd to limit the aggregation, which would result in incorrect data being displayed (and not a clear way to know it is incorrect?). Or am I misunderstanding this?

In regards to schema changes, I think it would be wise to heed this advice:

Schema changes on a live database are hard, and they should not be done lightly. That doesn't mean that schema changes should be avoided, not at all, a good database design is essential for security and performance.

If the problem is that the query is not performant enough (which is clear from the test), perhaps a schema change is precisely what we need? But yes, as you pointed out, we are committing to the product operating the way it's been designed, so I think it would be wise to ensure that this is what we want before proceeding.

Incidentally, I'd question whether the query can use all four indexes, and how efficient that would be if so.

It might not, I haven't gone through and thoroughly tested it. It's a little complicated because the user might provide a few users and no IPs or ranges, so it's more to cover the different use-cases then ensure that all of them are used. I think this certainly require a lot more testing. :)

As mentioned in the task description, indexing doesn't solve the problem of setting a limit on the number of rows being accessed. That could be done by limiting before grouping (and showing a warning) as outlined in the task description. It could also be done by adding a new table as @dbarratt suggests, but that would incur storage/update costs, might impact the product timescale and would represent a commitment to the current product design (further tweaks might necessitate schema changes) - @Niharika, @Mooeypoo what do you think?
As a solution with no database changes, I find showing a limited dataset with a warning acceptable (especially if we can find a limit that is rarely hit). It's common for our products to have a fallback plan if the result set is too large. The current CheckUser tool:

  • Shows no data, but a list of IP addresses and a warning if an IP range yields too many edits, or too many users
  • Shows truncated data (no pagination) with a warning if a user name yields too many edits

@Niharika I seem to remember the plan outlined in the task description being acceptable to you from a product perspective when we last discussed it - is that still the case?

@Tchanders Are you talking about the following -

UI changes
• Place a low limit on the number of users and IPs that can be investigated at once by the check user
• Allow the check user to narrow the time range for their investigation (the current CheckUser tool does this)

I am okay with both of these things. It's also fine if we want to put a limit on the number of rows we pull up in the Compare tab. If we can show something like Only first $number records shown. Please filter down the results to see more. - that is okay to begin with.
I'm also okay with us trying something out and seeing if it works for the users. We wouldn't really know how extreme the use cases are until we put it in front of the users. We'll learn a whole lot more once we have something users can play around with.

Another proposal is to create another table, something like this:

CREATE TABLE /*_*/cu_actor (
  -- Primary key
  cua_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT
  
  -- actor
  cua_actor bigint unsigned NOT NULL,

  -- IP address, visible
  cua_ip VARCHAR(255) NULL default '',

  -- IP address as hexidecimal
  cua_ip_hex VARCHAR(255) default NULL,

  -- User agent
  cua_agent VARCHAR(255) BINARY default NULL,

  -- Most recent event timestamp
  cua_timestamp CHAR(14) NOT NULL default '',

) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/cu_actor_ip_agent ON /*_*/cu_actor (cua_actor, cua_ip, cua_agent);
CREATE INDEX /*i*/cu_actor_timestamp ON /*_*/cu_actor (cua_timestamp);

Then we'd modify cu_changes like this:

ALTER TABLE /*$wgDBprefix*/cu_changes
  ADD COLUMN (`cuc_cua_id` bigint unsigned default NULL)
  AFTER cuc_title;

After we fill in the table (or allow it to be filled in over 30 days) we can remove the columns from the changes table:

ALTER TABLE /*$wgDBprefix*/cu_changes DROP 'cuc_user';
ALTER TABLE /*$wgDBprefix*/cu_changes DROP 'cuc_user_text';
ALTER TABLE /*$wgDBprefix*/cu_changes DROP 'cuc_ip';
ALTER TABLE /*$wgDBprefix*/cu_changes DROP 'cuc_ip_hex';
ALTER TABLE /*$wgDBprefix*/cu_changes DROP 'cuc_agent';

Perhaps an even better way to do this is to create a new cu_changes table instead of altering the existing one, something like:

CREATE TABLE /*_*/cu_revision (
  -- CheckUser Actor ID
  cua_id BIGINT NOT NULL,

  -- Revision ID
  rev_id INT NOT NULL

  PRIMARY KEY (cua_id, rev_id)
) /*$wgDBTableOptions*/;

And perhaps some additional fields as necessary.

Doing this prevents having to do a schema change on the production database.

We might want to figure out what we want to do on T175587 before we commit to the schema of the two tables.

Tchanders updated the task description. (Show Details)

The engineers and @Catrope (thanks!) met to discuss this, and essentially boiled it down to three proposals:

ProposalMore infoProCon
Add index on (cuc_user, cuc_ip, cuc_agent)Keeps query in task description, but should avoid the large sort. We'd need to experiment to confirm the performance is good enough.Only solution that gets all the data. (Data may be paginated, but it will all be there.)Time and DBA approval needed for experiment; experiment may fail
Add another tableSee previous two comments (@dbarratt could you also add the query that we would use?)Gets all the user/IP/UA combinations. (Editcount and time range may be incomplete unless the user changes their parameters.)Time for DBA approval; need to be sure of product to avoid schema change after table is added
Truncate data before aggregationSee task descriptionCan implement now (no DBA involvement); can test performance on production.May not find all user/IP/UA combinations unless the user changes their parameters

@Tchanders Thank you for summarizing.
My instinct would be to go with the third proposal, which seems easiest to implement. Can you elaborate a little more on the con there?

May not find all user/IP/UA combinations unless the user changes their parameters

Do you mean we won't be finding all the results because we truncate the result-set? Would we be able to tell the user that the results were truncated at $number? And, what do you think the truncation limit would be?

I don't know if you have any data on how often the 5000 result limit in the current tool is hit, my personal experience is that it's fairly common particularly for mobile IP ranges. If that limit was brought even lower per IP range, e.g. if only the last 1000 edits from an IP range were considered, I think that would be a significant degradation.

Do you mean we won't be finding all the results because we truncate the result-set? Would we be able to tell the user that the results were truncated at $number? And, what do you think the truncation limit would be?

@Niharika Yes, with the proposed query, we'd limit the result from each entered target by timestamp. If the target were a single user (limit 5000), and the user's last 5000 edits were from one IP, but there were older edits from another IP, their results would be truncated and the older edits from the other IPs wouldn't be returned. There would be a warning, explaining that the results were truncated to a certain number. (This is also the same for the current CheckUser tool.)

@ST47 Thanks, it's really helpful to hear about common experiences with hitting the limits in the real world.

Our main constraint is keeping the limit on the maximum amount of data returned roughly the same as for the current CheckUser tool. A new feature we're adding is the ability to enter several targets at once (e.g. an IP range plus some user accounts). In this situation, assuming we want to have at least some data for every target, the limit for each target would need to decrease in order for us to keep the 5000 limit on the total amount of data. If these limits were hit, a warning would be displayed. If the search were altered to include only the IP range, the same amount of data would be returned as for the current CheckUser.

There may be something we could do with setting higher limits for IP ranges than single IPs or users.

A couple of small clarifications from the meeting. The third option is the one that's laid out in the description of this ticket, including the product implications.

The first two options are dependent on a little bit of experimentation; we've discussed trying to mock a few tens of millions of rows so that we can test the potential queries to see how they'd work. The two options seem encouraging to us, but we can't quite be sure that they will resolve the issue yet, so we should add that to the list of potential complications. That said, they do *potentially* give us more data to display.

For the limitation, I am not sure what would be entirely reasonable, but I think we were talking on the scale of ~5000 rows per limit. The current CheckUser limits on 5000 or 10,000 depending on the query.

I believe (and @Tchanders / @Catrope please correct me if I'm wrong) that we will be able to tell the user that there may be further data, and that they can "drill down" more by changing the form or the timespan.

I believe (and @Tchanders / @Catrope please correct me if I'm wrong) that we will be able to tell the user that there may be further data, and that they can "drill down" more by changing the form or the timespan.

That's correct - we can even tell them which target's data were truncated (they might not all be truncated)

Only solution that gets all the data. (Data may be paginated, but it will all be there.)

I believe the second solution would get all of the data as well?

@dbarratt could you also add the query that we would use?

Based on the example in T245499#5907670 and T245499#5907824 I think it would look something like this:

SELECT cua_id, cua_actor, cua_ip, cua_agent, cua_timestamp FROM cu_actor WHERE cua_actor IN <users> OR cua_ip IN <IPs> OR cuc_ip_hex BETWEEN <min IP> AND <max IP>) ORDER BY cua_timestamp, cua_id LIMIT <limit>

I realize this changes the ORDER BY but we'll have the same problem when T233004 is completed. We could preserve the same order with a JOIN on the actor table, but I'm not sure if having the results ordered by users, then the IPs, then the UAs is beneficial for the user anyways?

After that is retrieved, the result set for the page would do follow-up queries for the min/max timestamp and the count of revisions like so:

SELECT MIN(revision.rev_timestamp), MAX(revision.rev_timestamp), COUNT(*) AS count FROM cu_revision INNER JOIN cu_revision.rev_id = revision.rev_id WHERE cu_revision.cua_id = <cua_id from the previous query>

That query should use the existing rev_timestamp index on the revision table as a covering index.

Editcount and time range may be incomplete unless the user changes their parameters

I'm confused on why would that be the case?

Time for DBA approval

Based on @Niharika's comment at T245662#5898887, it seems that adding a table (or two new tables?) would not require DBA approval and would not constitute a schema change. Am I misunderstanding?

need to be sure of product to avoid schema change after table is added

If we're creating a new table anyways, I assume it would be fine if we destroyed it and re-created it if there were changes? Doesn't remove the risk, but it might help mitigate it.

The schema in T245499#5907670 could be simplified a bit more by removing cua_ip (and use cua_ip_hex instead). I can't see a reason why that would be needed?

I did a bit of testing on the proposed query changes (the first row in the table above), and it does perform significantly better.

Even if I remove the inner LIMIT completely, the query executes in less than (or precisely?) a second. I think if we could remove that it resolves my concerns and it appears that the performance is sufficient. The problem with the inner limit, is that the query doesn't reveal when you're getting inaccurate data (unless there is some way we could add that?).

Original Query:

TypeCount
Bot 110067
User 152
User 158
User 1400
User 110
User 11
User 114

7 rows in set (5 min 25.26 sec)

Proposed Query (with inner limit):

TypeCount
Bot 14485
User 152
User 158
User 1399
User 110
User 11
User 114

7 rows in set (0.92 sec)

Proposed Query (without inner limit):

TypeCount
Bot 110067
User 152
User 158
User 1399
User 110
User 11
User 114

7 rows in set (1.00 sec)

I'm not sure why that 4th row is off by one in both of the proposed query examples.

NOTE: I removed the cuc_timestamp condition from the test query in all of the examples.

I forgot to add the ANALYZE results of the proposed query (without inner limit):

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1PRIMARY<derived2>ALLNULLNULLNULLNULL3953710601.00100.00100.00Using temporary; Using filesort
2DERIVEDcu_changesrangecuc_user_ip_timecuc_user_ip_time4NULL545544.00100.0098.35Using index condition; Using where
3UNIONcu_changesrefcuc_ip_hex_timecuc_ip_hex_time258const1949610596.00100.0099.92Using index condition; Using where
4UNIONcu_changesrangecuc_ip_hex_timecuc_ip_hex_time258NULL1949610596.00100.0099.92Using index condition; Using where
NULLUNION RESULT<union2,3,4>ALLNULLNULLNULLNULLNULL10601.00NULLNULL

If I'm reading this correctly (doubtful, ha!) it looks like we are now avoiding the full table scan

I'd question whether the query can use all four indexes, and how efficient that would be if so.

You're totally right, it doesn't.

Original:

7 rows in set (4 min 37.10 sec)

I changed the current query to use cu_user instead of cu_user_text in the WHERE:

7 rows in set (52.73 sec)

Here is the ANALYZE:

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1SIMPLEcu_changesALLcuc_ip_hex_time,cuc_user_ip_timeNULLNULLNULL5665575462201657.00100.000.01Using where; Using temporary; Using filesort

Please ignore the solution I proposed in T245499#5894792 as that doesn't seem to be of any help to anyone. :)

Based on this data, it appears that we would get the best results from:

  1. Adding the two new database tables (since that has the added advantage of not using a temporary table or a filesort, though it is possible that the subquery is still more efficient in some instances).
  2. Changing the query to what is proposed (and removing the inner LIMIT to ensure consistent results)

However, since its less work to update the query than it is to add and populate database tables, I would recommend doing these in the reverse order. If the performance is sufficient then we don't need to create the database tables. If it isn't, then we didn't lose any extra effort.

I did a few tests locally and I stumbled upon Optimizing Subqueries with Materialization in the documentation. If the query is rewritten slightly to something like this:

SELECT cuc_user_text, cuc_ip, cuc_agent, MIN(cuc_timestamp), MAX(cuc_timestamp), COUNT(*) FROM cu_changes WHERE cuc_id IN (

-- for each user
( SELECT cuc_id FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_user = <user id>) UNION

-- for each IP
( SELECT cuc_id FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex = <hex ip>) UNION

-- for each IP range (NB highest first, to make use of cuc_ip_hex_time index)
( SELECT cuc_id FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex >= <hex start> AND cuc_ip_hex <= <hex end>)

) GROUP BY cuc_user_text, cuc_ip, cuc_agent ORDER BY cuc_user_text, cuc_ip, cuc_agent LIMIT <pager limit>;

Then we can benefit from an index on the ORDER BY / GROUP BY like this:

CREATE INDEX /*i*/cuc_user_text_ip_agent ON /*_*/cu_changes (cuc_user_text, cuc_ip, cuc_agent);

from my testing locally, this uses the existing indexes to do the sub-queries, then uses the new index to do the GROUP BY and ORDER BY

Ooh, this subquery materialization stuff is really helpful, thanks! I knew that using the indexes to get rows by user/IP should be possible, but I didn't know how to tell MySQL to do it. Adding an index to help the GROUP BY will likely improve performance too.

However, I think we'll still need inner LIMITs, to prevent runaway queries that examine very large numbers of rows. With the improvements you suggested, we may be able to get away with higher inner limits than we otherwise would, but I think there will still need to be some sort of limit on the number of rows examined, even if it's a relatively high number (e.g. 10k or more).

I was really intrigued by the speedup you found in T245499#5909620. It can be difficult to tell sometimes whether these effects are real or are caused by warmup/caching, so I tried to find out by running a test query in production:

mysql:research@dbstore1003.eqiad.wmnet [enwiki]>  analyze SELECT cuc_user_text, cuc_ip, cuc_agent, MIN(cuc_timestamp), MAX(cuc_timestamp), COUNT(*) FROM cu_changes WHERE cuc_id IN (  SELECT cuc_id FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_user=22559448 UNION SELECT cuc_id FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex='22E11E38' UNION SELECT cuc_id FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex >= 'AC1001E0' AND cuc_ip_hex <= 'AC1001EF' ) GROUP BY cuc_user_text, cuc_ip, cuc_agent ORDER BY cuc_user_text, cuc_ip, cuc_agent LIMIT 500;

+------+--------------------+--------------+--------+----------------------------------------+---------+---------+------+----------+-------------+----------+------------+----------------------------------------------+
| id   | select_type        | table        | type   | possible_keys                          | key     | key_len | ref  | rows     | r_rows      | filtered | r_filtered | Extra                                        |
+------+--------------------+--------------+--------+----------------------------------------+---------+---------+------+----------+-------------+----------+------------+----------------------------------------------+
|    1 | PRIMARY            | cu_changes   | ALL    | NULL                                   | NULL    | NULL    | NULL | 16500000 | 16600000.00 |   100.00 |       1.22 | Using where; Using temporary; Using filesort |
|    2 | DEPENDENT SUBQUERY | cu_changes   | eq_ref | PRIMARY,cuc_user_time,cuc_user_ip_time | PRIMARY | 4       | func |        1 |        1.00 |   100.00 |       0.00 | Using where                                  |
|    3 | DEPENDENT UNION    | cu_changes   | eq_ref | PRIMARY,cuc_ip_hex_time                | PRIMARY | 4       | func |        1 |        1.00 |   100.00 |       0.13 | Using where                                  |
|    4 | DEPENDENT UNION    | cu_changes   | eq_ref | PRIMARY,cuc_ip_hex_time                | PRIMARY | 4       | func |        1 |        1.00 |   100.00 |       1.10 | Using where                                  |
| NULL | UNION RESULT       | <union2,3,4> | ALL    | NULL                                   | NULL    | NULL    | NULL |     NULL |        0.01 |     NULL |       NULL |                                              |
+------+--------------------+--------------+--------+----------------------------------------+---------+---------+------+----------+-------------+----------+------------+----------------------------------------------+
5 rows in set (1 min 55.10 sec)

(numbers rounded)
In this case the user subquery returned 0 results (oops, that'll teach me to use my own account on enwiki where I haven't edited for a while), the IP subquery returned about 100k results, and the IP range subquery returned about 200k results. There may well be a speedup here compared to the current query, but 2 minutes is still too slow. Of course this is without the new index, I wasn't able to test if it would have been faster with the new index.

Unfortunately, trying to add an inner limit to this query got me the following error message:

ERROR 1235 (42000): This version of MariaDB doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So instead I tried the query from the task description, first with limits, then without:

mysql:research@dbstore1003.eqiad.wmnet [enwiki]> analyze SELECT a.cuc_user_text, a.cuc_ip, a.cuc_agent, MIN(a.cuc_timestamp), MAX(a.cuc_timestamp), COUNT(*) FROM ( ( SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_user=[redacted] order by cuc_timestamp desc limit 10000) union (SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex='[redacted]' ORDER BY cuc_timestamp desc limit 10000) union (SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex>= '[redacted]' AND cuc_ip_hex <= '[redacted]' order by cuc_timestamp desc limit 10000)) AS a GROUP BY cuc_user_text, cuc_ip, cuc_agent ORDER BY cuc_user_text, cuc_ip, cuc_agent LIMIT 500;
+------+--------------+--------------+-------+--------------------------------+-----------------+---------+-------+----------+-----------+----------+------------+---------------------------------+
| id   | select_type  | table        | type  | possible_keys                  | key             | key_len | ref   | rows     | r_rows    | filtered | r_filtered | Extra                           |
+------+--------------+--------------+-------+--------------------------------+-----------------+---------+-------+----------+-----------+----------+------------+---------------------------------+
|    1 | PRIMARY      | <derived2>   | ALL   | NULL                           | NULL            | NULL    | NULL  |    20001 |  19800.00 |   100.00 |     100.00 | Using temporary; Using filesort |
|    2 | DERIVED      | cu_changes   | ref   | cuc_user_time,cuc_user_ip_time | cuc_user_time   | 4       | const |        1 |      0.00 |   100.00 |     100.00 | Using where                     |
|    3 | UNION        | cu_changes   | ref   | cuc_ip_hex_time                | cuc_ip_hex_time | 258     | const |   186000 |  45000.00 |   100.00 |      22.23 | Using where                     |
|    4 | UNION        | cu_changes   | index | cuc_ip_hex_time                | cuc_timestamp   | 16      | NULL  | 16500000 | 900000.00 |     2.34 |       1.10 | Using where                     |
| NULL | UNION RESULT | <union2,3,4> | ALL   | NULL                           | NULL            | NULL    | NULL  |     NULL |  19800.00 |     NULL |       NULL |                                 |
+------+--------------+--------------+-------+--------------------------------+-----------------+---------+-------+----------+-----------+----------+------------+---------------------------------+
5 rows in set (1.65 sec)

mysql:research@dbstore1003.eqiad.wmnet [enwiki]> analyze SELECT a.cuc_user_text, a.cuc_ip, a.cuc_agent, MIN(a.cuc_timestamp), MAX(a.cuc_timestamp), COUNT(*) FROM ( ( SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_user=[redacted] order by cuc_timestamp desc) union (SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex='[redacted]' ORDER BY cuc_timestamp desc ) union (SELECT cuc_user_text, cuc_ip, cuc_agent, cuc_timestamp FROM cu_changes WHERE cuc_type IN (0,1) AND cuc_ip_hex>= '[redacted]' AND cuc_ip_hex <= '[redacted]' order by cuc_timestamp desc )) AS a GROUP BY cuc_user_text, cuc_ip, cuc_agent ORDER BY cuc_user_text, cuc_ip, cuc_agent LIMIT 500;
+------+--------------+--------------+-------+--------------------------------+-----------------+---------+-------+--------+-----------+----------+------------+------------------------------------+
| id   | select_type  | table        | type  | possible_keys                  | key             | key_len | ref   | rows   | r_rows    | filtered | r_filtered | Extra                              |
+------+--------------+--------------+-------+--------------------------------+-----------------+---------+-------+--------+-----------+----------+------------+------------------------------------+
|    1 | PRIMARY      | <derived2>   | ALL   | NULL                           | NULL            | NULL    | NULL  | 570000 | 200000.00 |   100.00 |     100.00 | Using temporary; Using filesort    |
|    2 | DERIVED      | cu_changes   | ref   | cuc_user_time,cuc_user_ip_time | cuc_user_time   | 4       | const |      1 |      0.00 |   100.00 |     100.00 | Using where                        |
|    3 | UNION        | cu_changes   | ref   | cuc_ip_hex_time                | cuc_ip_hex_time | 258     | const | 186000 | 106000.00 |   100.00 |      19.79 | Using index condition; Using where |
|    4 | UNION        | cu_changes   | range | cuc_ip_hex_time                | cuc_ip_hex_time | 258     | NULL  | 387000 | 189000.00 |   100.00 |      96.28 | Using index condition; Using where |
| NULL | UNION RESULT | <union2,3,4> | ALL   | NULL                           | NULL            | NULL    | NULL  |   NULL | 200000.00 |     NULL |       NULL |                                    |
+------+--------------+--------------+-------+--------------------------------+-----------------+---------+-------+--------+-----------+----------+------------+------------------------------------+
5 rows in set (3.81 sec)

I'm honestly pretty shocked that this query completed in only 4 seconds. It didn't seem to be related to a warm cache, because running the first query again still took over a minute, and running it with different IPs also took 4 seconds. So... maybe the query in the task description is actually quite fast?

I played around some more, and decided to throw the worst at it that I could come up with. With a user with ~300k rows, an IP address with ~300k rows, and an IP range with ~800k rows (so 1.4M rows scanned in total), the query from the task description took 29 seconds. Adding LIMIT 100000 (that's 100k) to each of those subqueries took it down to 8 seconds. Adding a timestamp condition (>= 1 Feb, so 23 days' worth of data) made it 5 seconds, and then setting the limit to 33,333 (so that the limits add up to 100k) got it down to 2 seconds. That seems pretty acceptable. The worst-performing part seems to be IP range scans that return lots of results, because then MySQL is stuck between a rock (using the cuc_timestamp index for sorting, then a manual where for the IP range) and a hard place (using the cuc_ip_hex_time index for the IP range, then a filesort for ordering by timestamp).

Based on all this I recommend the following setting the limit on each subquery to about 100k/N, where N is the number of subqueries, so that the total number of rows scanned is limited to 100k. To detect whether the limit was reached, you could then run each subquery individually, but 1) adding LIMIT 1 OFFSET K (where K=100k/N, i.e. the limit you used in the real query), so that the query returns 1 row if there's more data and 0 rows if not; and 2) dropping the ORDER BY, which can significantly speed up the query in some cases. A subquery with LIMIT 1 OFFSET 100000 without an ORDER BY takes about 0.3 seconds in the situation where there really are more than 100k rows, which isn't great, but in that situation your main query probably takes 2-3 seconds anyway.

@Catrope wow thank you for all of that! :)

dropping the ORDER BY, which can significantly speed up the query in some cases.

Do you think the LIMIT on the main subquery is worth the cost of the ORDER BY on the subquery?

I tested it and it resulted in one less filesort, but no real difference in execution time. What happens with the query you were using above?

Compare (with limit and order by):

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1PRIMARY<derived2>ALLNULLNULLNULLNULL105505019.00100.00100.00Using temporary; Using filesort
2DERIVEDcu_changesrangecuc_user_ip_timecuc_user_ip_time4NULL550540.00100.00100.00Using index condition; Using where; Using filesort
3UNIONcu_changesrefcuc_ip_hex_timecuc_ip_hex_time258const166765009.00100.0099.82Using where
4UNIONcu_changesrangecuc_ip_hex_timecuc_ip_hex_time258NULL166765009.00100.0099.82Using where
NULLUNION RESULT<union2,3,4>ALLNULLNULLNULLNULLNULL5019.00NULLNULL

To (no limit and no order by):

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1PRIMARY<derived2>ALLNULLNULLNULLNULL339028772.00100.00100.00Using temporary; Using filesort
2DERIVEDcu_changesrangecuc_user_ip_timecuc_user_ip_time4NULL550549.00100.0098.36Using index condition; Using where
3UNIONcu_changesrefcuc_ip_hex_timecuc_ip_hex_time258const166768767.00100.0099.90Using index condition; Using where
4UNIONcu_changesrangecuc_ip_hex_timecuc_ip_hex_time258NULL166768767.00100.0099.90Using index condition; Using where
NULLUNION RESULT<union2,3,4>ALLNULLNULLNULLNULLNULL8772.00NULLNULL

I suppose it might make a bigger difference if that first subquery resulted in a more expensive filesort.

@Catrope wow thank you for all of that! :)

dropping the ORDER BY, which can significantly speed up the query in some cases.

Do you think the LIMIT on the main subquery is worth the cost of the ORDER BY on the subquery?

Yes. In my pathological example (which I realize is pathological, but still), adding a LIMIT reduced the query run time from 29 seconds to ~2 seconds. You are right though that the ORDER BY does make the query slower sometimes, it's just that in extreme cases the un-limited-ness makes it slower still.

I tested it and it resulted in one less filesort, but no real difference in execution time. What happens with the query you were using above?

Compare (with limit and order by):

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1PRIMARY<derived2>ALLNULLNULLNULLNULL105505019.00100.00100.00Using temporary; Using filesort
2DERIVEDcu_changesrangecuc_user_ip_timecuc_user_ip_time4NULL550540.00100.00100.00Using index condition; Using where; Using filesort
3UNIONcu_changesrefcuc_ip_hex_timecuc_ip_hex_time258const166765009.00100.0099.82Using where
4UNIONcu_changesrangecuc_ip_hex_timecuc_ip_hex_time258NULL166765009.00100.0099.82Using where
NULLUNION RESULT<union2,3,4>ALLNULLNULLNULLNULLNULL5019.00NULLNULL

To (no limit and no order by):

idselect_typetabletypepossible_keyskeykey_lenrefrowsr_rowsfilteredr_filteredExtra
1PRIMARY<derived2>ALLNULLNULLNULLNULL339028772.00100.00100.00Using temporary; Using filesort
2DERIVEDcu_changesrangecuc_user_ip_timecuc_user_ip_time4NULL550549.00100.0098.36Using index condition; Using where
3UNIONcu_changesrefcuc_ip_hex_timecuc_ip_hex_time258const166768767.00100.0099.90Using index condition; Using where
4UNIONcu_changesrangecuc_ip_hex_timecuc_ip_hex_time258NULL166768767.00100.0099.90Using index condition; Using where
NULLUNION RESULT<union2,3,4>ALLNULLNULLNULLNULLNULL8772.00NULLNULL

I suppose it might make a bigger difference if that first subquery resulted in a more expensive filesort.

In my case it was #4 (the IP range one) that filesorted, and it was doing a filesort on ~800k rows. In your case, #2 is doing a filesort on 550 rows. Obviously sorting 1000x more things is slower, and in fact it's (slightly) more than 1000x slower because sorting is O(n log n).

@Catrope Thanks so much! that's really helplful!

Since adding UI for specifying the time range needs a discussion, here's a separate task: T246261. We can use a default value for the timestamp condition until then.

Change 575013 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/extensions/CheckUser@master] Improve performance of query for Special:Investigate Compare tab

https://gerrit.wikimedia.org/r/575013

@Niharika Here's a screenshot of the warning on the current patch. How does that look to you?

image.png (228×832 px, 20 KB)

@Tchanders I'd suggest telling them the reason the data is incomplete. Something like "Due to technical limitations we've reached the number of records that can be presented. The data returned for Admin and 1.1.1.1 is incomplete. Please try using fewer targets, smaller time window or narrower IP ranges."
It's more verbose but I think they're more likely to understand this.

Just addressing a misunderstanding in how the improved query works (discussed on https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/575013/15) - i.e. whether there is a separate subquery for each target, then a separate query for each target to check whether a limit has been hit (as in task description); or a separate subquery per type of target (user, IP, range), and then a separate query to check whether the limit had been hit for each type of target.

Here's an example to illustrate what I think are the benefits to having a separate subquery per target (originally pointed out by @Catrope):

Targets entered: UserA, UserB, UserC, IP1, IP2
Total limit on no. results: 100,000
No. rows in cu_changes:

  • UserA - 100,000
  • UserB - 1,000
  • UserC - 0
  • IP1 - 200,000
  • IP2 - 1,000

Limit-per-target method

No. subqueries: 5
Limit per subquery: 100,000 / 5 = 20,000
No. queries: 1 + 5

Results obtained:
UserA - 20,000 results
UserB - 1,000 results
UserC - 0 results
IP1 - 20,000 results
IP2 - 1,000 results

We get all results for any users with a small number of rows. We also know whether each user has complete or incomplete results and can advise the check user accordingly.

Limit-per-type method

No. subqueries: 2
Limit per subquery: 100,000 / 2 = 50,000
No. queries: 1 + 2

Results obtained:
UserA - 50,000 results
UserB - 0 results
UserC - 0 results
IP1 - 50,000 results
IP2 - 0 reuslts

We don't get any results for UserB, UserC or IP1. Additionally, we don't know whether we're missing results (UserB and IP2 just came after the limit; UserC actually has no rows).

Change 575013 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Improve performance of query for Special:Investigate Compare tab

https://gerrit.wikimedia.org/r/575013

@Tchanders For the subquery, for example,

SELECT  cuc_user,cuc_user_text,cuc_ip,cuc_ip_hex,cuc_agent,cuc_timestamp  FROM `cu_changes`    WHERE cuc_user = 11 AND cuc_type IN (0,1)   ORDER BY cuc_timestamp LIMIT 100000

should that be

...ORDER BY cuc_timestamp DESC LIMIT 100000

Otherwise, I don't think you are getting the latest edits.

Change 579001 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/extensions/CheckUser@master] CompareService: Order investigation results by descending timestamp

https://gerrit.wikimedia.org/r/579001

@dom_walden Good point - the patch above fixes this.

@Tchanders I am finding that sometimes the results I get for number of edits by a user is different depending on whether I search for 1 or 2 targets.

I think this is because the UNION is removing duplicate data (i.e. rows with the same user, ip, agent, timestamp), which I have a lot of as an artifact of how I generated my test data.

I don't know how likely this is on production, but I guess it is possible that the same user(+IP+UA) can make two edits within a second of each other.

To be safe, would including cuc_id in the subqueries mean we only remove duplicates that are actually the same row in the DB? Is this going work the same across the different supported databases (MariaDB, SQLite, Postgres, etc.)?

Change 579001 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] CompareService: Order investigation results by descending timestamp

https://gerrit.wikimedia.org/r/579001

ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".Mar 13 2020, 6:27 PM

Change 580939 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[mediawiki/extensions/CheckUser@master] CompareService: Select row ID in subquery

https://gerrit.wikimedia.org/r/580939

Great catch @dom_walden - should be fixed by the latest patch,

Change 580939 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] CompareService: Select row ID in subquery

https://gerrit.wikimedia.org/r/580939

For each target in the investigation, a subquery is created. The 100000 limit is split evenly across the targets (50000 for 2, 33333 for 3, 25000 for 4, etc.)

Example for investigation of user id 5, IP 1.2.3.4 and IP range 2.3.4.5/16 the SQL looks like:

SELECT a.cuc_user,a.cuc_user_text,a.cuc_ip,a.cuc_ip_hex,a.cuc_agent,MIN(a.cuc_timestamp) AS first_edit,MAX(a.cuc_timestamp) AS last_edit,count(*) AS total_edits
FROM ((SELECT cuc_id,cuc_user,cuc_user_text,cuc_ip,cuc_ip_hex,cuc_agent,cuc_timestamp FROM cu_changes WHERE cuc_user = 5 AND cuc_type IN (0,1) ORDER BY cuc_timestamp DESC LIMIT 33333 )
UNION (SELECT cuc_id,cuc_user,cuc_user_text,cuc_ip,cuc_ip_hex,cuc_agent,cuc_timestamp FROM cu_changes WHERE cuc_ip_hex = '01020304' AND cuc_type IN (0,1) ORDER BY cuc_timestamp DESC LIMIT 33333 )
UNION (SELECT cuc_id,cuc_user,cuc_user_text,cuc_ip,cuc_ip_hex,cuc_agent,cuc_timestamp FROM cu_changes WHERE ((cuc_ip_hex >= '02030000') AND (cuc_ip_hex <='0203FFFF')) AND cuc_type IN (0,1) ORDER BY cuc_timestamp DESC LIMIT 33333 )) a
GROUP BY cuc_user_text,cuc_ip,cuc_agent ORDER BY cuc_user_text,cuc_ip,cuc_agent LIMIT 11;

As far as possible I compared the results to the results from the current Check User. The results are not always quite the same because...:

  1. current CU has different limits:
    • when searching IPs for a user, it limits results to most recent 5000 edits (but after grouping by ip)
    • when searching users for IP/range, it limits to most recent 10000 edits (without grouping)
  2. new CU doesn't always include the same edits, e.g. we only look at cuc_type=0,1

For similar reasons, doing a performance comparison with current CU is a bit meaningless, as they are have different limits, are doing different things (e.g. new CU combines searching for usernames and IPs into one query).

So, I did do some comparisons with SQL queries I constructed myself, although these queries mostly looked like the queries Special:Investigate were generating, so they might have the same bugs.

I also constructed a limited set of data where I knew what that expected results should be. I saw no discrepancies between actual and expected results. But this was a limited data set, not necessarily representative of production data.

Bugs in T245499#5960008 and T245499#5963431 have been fixed.