8000 Research how much of a difference analyze / sqlite_stat1 makes · Issue #369 · simonw/sqlite-utils · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Research how much of a difference analyze / sqlite_stat1 makes #369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonw opened this issue Jan 9, 2022 · 11 comments
Closed

Research how much of a difference analyze / sqlite_stat1 makes #369

simonw opened this issue Jan 9, 2022 · 11 comments
Labels
research wontfix This will not be worked on

Comments

@simonw
Copy link
Owner
simonw commented Jan 9, 2022

Is there a downside to having a sqlite_stat1 table if it has wildly incorrect statistics in it?

Originally posted by @simonw in #365 (comment)

More generally: how much of a difference does the sqlite_stat1 table created by ANALYZE make to queries?

I'm particularly interested in group by / count * queries since Datasette uses those for faceting.

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022

@simonw simonw added the research label Jan 9, 2022
@simonw
Copy link
8000
Owner Author
simonw commented Jan 9, 2022

I'll start by running some experiments against the 11MB database file from https://global-power-plants.datasettes.com/global-power-plants.db

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022
analyze % sqlite-utils indexes global-power-plants.db -t           
table                           index_name                                           seqno    cid  name            desc  coll      key
------------------------------  -------------------------------------------------  -------  -----  ------------  ------  ------  -----
global-power-plants             "global-power-plants_owner"                              0     12  owner              0  BINARY      1
global-power-plants             "global-power-plants_country_long"                       0      1  country_long       0  BINARY      1
global-power-plants_fts_idx     sqlite_autoindex_global-power-plants_fts_idx_1           0      0  segid              0  BINARY      1
global-power-plants_fts_idx     sqlite_autoindex_global-power-plants_fts_idx_1           1      1  term               0  BINARY      1
global-power-plants_fts_config  sqlite_autoindex_global-power-plants_fts_config_1        0      0  k                  0  BINARY      1

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022
analyze % sqlite-utils global-power-plants-analyzed.db 'analyze'
[{"rows_affected": -1}]
analyze % sqlite-utils tables global-power-plants-analyzed.db 
[{"table": "global-power-plants"},
 {"table": "global-power-plants_fts"},
 {"table": "global-power-plants_fts_data"},
 {"table": "global-power-plants_fts_idx"},
 {"table": "global-power-plants_fts_docsize"},
 {"table": "global-power-plants_fts_config"},
 {"table": "sqlite_stat1"}]
analyze % sqlite-utils rows global-power-plants-analyzed.db sqlite_stat1 -t
tbl                              idx                                 stat
-------------------------------  ----------------------------------  ---------
global-power-plants_fts_config   global-power-plants_fts_config      1 1
global-power-plants_fts_docsize                                      33643
global-power-plants_fts_idx      global-power-plants_fts_idx         199 40 1
global-power-plants_fts_data                                         136
global-power-plants              "global-power-plants_owner"         33643 4
global-power-plants              "global-power-plants_country_long"  33643 202

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022

Basically no difference using this very basic benchmark:

analyze % python3 -m timeit '__import__("sqlite3").connect("global-power-plants.db").execute("select country_long, count(*) from [global-power-plants] group by country_long").fetchall()'
100 loops, best of 5: 2.39 msec per loop
analyze % python3 -m timeit '__import__("sqlite3").connect("global-power-plants-analyzed.db").execute("select country_long, count(*) from [global-power-plants] group by country_long").fetchall()'
100 loops, best of 5: 2.38 msec per loop

I should try this against a much larger database.

https://covid-19.datasettes.com/covid.db is 879MB.

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022

Didn't manage to spot a meaningful difference with that database either:

analyze % python3 -m timeit '__import__("sqlite3").connect("covid.db").execute("select fips, count(*) from [ny_times_us_counties] group by fips").fetchall()'         
2 loops, best of 5: 101 msec per loop
analyze % python3 -m timeit '__import__("sqlite3").connect("covid-analyzed.db").execute("select fips, count(*) from [ny_times_us_counties] group by fips").fetchall()'
2 loops, best of 5: 103 msec per loop

Maybe select fips, count(*) from [ny_times_us_counties] group by fips isn't a good query for testing this?

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022

There are some clues as to what effect ANALYZE has in https://www.sqlite.org/optoverview.html

Some quotes:

SQLite might use a skip-scan on an index if it knows that the first one or more columns contain many duplication values. If there are too few duplicates in the left-most columns of the index, then it would be faster to simply step ahead to the next value, and thus do a full table scan, than to do a binary search on an index to locate the next left-column value.

The only way that SQLite can know that there are many duplicates in the left-most columns of an index is if the ANALYZE command has been run on the database. Without the results of ANALYZE, SQLite has to guess at the "shape" of the data in the table, and the default guess is that there are an average of 10 duplicates for every value in the left-most column of the index. Skip-scan only becomes profitable (it only gets to be faster than a full table scan) when the number of duplicates is about 18 or more. Hence, a skip-scan is never used on a database that has not been analyzed.

And

Join reordering is automatic and usually works well enough that programmers do not have to think about it, especially if ANALYZE has been used to gather statistics about the available indexes, though occasionally some hints from the programmer are needed.

And

The various sqlite_statN tables contain information on how selective the various indexes are. For example, the sqlite_stat1 table might indicate that an equality constraint on column x reduces the search space to 10 rows on average, whereas an equality constraint on column y reduces the search space to 3 rows on average. In that case, SQLite would prefer to use index ex2i2 since that index is more selective.

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022
EXPLAIN QUERY PLAN select country_long, count(*) from [global-power-plants] group by country_long

https://global-power-plants.datasettes.com/global-power-plants?sql=EXPLAIN+QUERY+PLAN+select+country_long%2C+count%28*%29+from+%5Bglobal-power-plants%5D+group+by+country_long

SCAN TABLE global-power-plants USING COVERING INDEX "global-power-plants_country_long"

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022

This is probably too fancy. I think maybe the way to do this is with select * from [global-power-plants] where "country_long" = 'United Kingdom' - then mess around with stats to see if I can get it to use the index or not based on them.

Here's the explain for that: https://global-power-plants.datasettes.com/global-power-plants?sql=EXPLAIN+QUERY+PLAN+select+*+from+[global-power-plants]+where+%22country_long%22+%3D+%27United+Kingdom%27

@simonw
Copy link
Owner Author
simonw commented Jan 9, 2022

I think the query that will help solve this is:

explain query plan select * from ny_times_us_counties where state = 1 and county = 2

In this case, the query planner needs to decide if it should use the index for the state column or the index for the county column. That's where the statistics come into play. In particular:

tbl idx stat
ny_times_us_counties idx_ny_times_us_counties_date 2092871 2915
ny_times_us_counties idx_ny_times_us_counties_fips 2092871 651
ny_times_us_counties idx_ny_times_us_counties_county 2092871 1085
ny_times_us_counties idx_ny_times_us_counties_state 2092871 37373

Those numbers are explained by this comment in the SQLite C code: https://github.com/sqlite/sqlite/blob/5622c7f97106314719740098cf0854e7eaa81802/src/analyze.c#L41-L55

** There is normally one row per index, with the index identified by the
** name in the idx column.  The tbl column is the name of the table to
** which the index belongs.  In each such row, the stat column will be
** a string consisting of a list of integers.  The first integer in this
** list is the number of rows in the index.  (This is the same as the
** number of rows in the table, except for partial indices.)  The second
** integer is the average number of rows in the index that have the same
** value in the first column of the index.

So that table is telling us that using a value in the county column will filter down to an average of 1,085 rows, whereas filtering on the state column will filter down to an average of 37,373 - so clearly the county index is the better index to use here!

Just one catch: against both my covid.db and my covid-analyzed.db databases the county index is picked for both of them - so SQLite is somehow guessing that county is a better index even though it doesn't have statistics for that.

@simonw
Copy link
Owner Author
simonw commented Feb 3, 2022

Closing this - it was something I was curious about, but evidently not curious enough to actually do the work!

@simonw simonw closed this as completed Feb 3, 2022
@simonw simonw added the wontfix This will not be worked on label Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant
0