8000 create-index should run analyze after creating index · Issue #365 · simonw/sqlite-utils · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

create-index should run analyze after creating index #365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fgregg opened this issue Jan 7, 2022 · 16 comments
Closed

create-index should run analyze after creating index #365

fgregg opened this issue Jan 7, 2022 · 16 comments
Labels
Milestone

Comments

@fgregg
Copy link
Contributor
fgregg commented Jan 7, 2022

sqlite's query planner depends upon analyze to make good use of indices. It would be nice if analyze was run as part of the create-index command.

If data is inserted later, things can get out date, but it would still probably be a net win.

@simonw
Copy link
Owner
simonw commented Jan 7, 2022

Relevant documentation: https://www.sqlite.org/lang_analyze.html

@simonw
Copy link
Owner
simonw commented Jan 7, 2022

I've not used the ANALYZE feature in SQLite at all before. Should probably add Python library methods for it.

Annoyingly I use the word "analyze" to mean something else in the CLI - for these features:

@fgregg
Copy link
Contributor Author
fgregg commented Jan 7, 2022

i added an index to one table with sqlite-utils, and then a query that used to take about 1 second started taking hundreds of seconds.

running analyze got me back to sub second speed.

@simonw
Copy link
Owner
simonw commented Jan 7, 2022

After implementing #366 I can make it so sqlite-utils create-index automatically runs db.analyze(index_name) afterwards, maybe with a --no-analyze option in case anyone wants to opt out of that for specific performance reasons.

@simonw
Copy link
Owner
simonw commented Jan 7, 2022

Or I could leave off --no-analyze and tell people that if they want to add an index without running analyze they can execute the CREATE INDEX themselves.

@simonw
Copy link
Owner
simonw commented Jan 8, 2022

The one thing that worries me a little bit about doing this by default is that it adds a surprising new table to the database - it may be confusing to users if they run create-index and their database suddenly has a new sqlite_stat1 table, see #366 (comment)

Options here are:

  • Do it anyway. People can tolerate a surprise table appearing when they create an index.
  • Only run ANALYZE if the user says sqlite-utils create-index ... --analyze
  • Use the --analyze option, but also automatically run ANALYZE if they create an index and the database they are working with already has a sqlite_stat1 table

I'm currently leading towards that third option - @fgregg any thoughts?

@simonw simonw closed this as completed Jan 8, 2022
@simonw simonw reopened this Jan 8, 2022
@fgregg
Copy link
Contributor Author
fgregg commented Jan 8, 2022

for options 2 and 3, i would worry about discoverablity.

in other db’s it is not necessary to explicitly call analyze for most indices. ie for postgres

The system regularly collects statistics on all of a table's columns. Newly-created non-expression indexes can immediately use these statistics to determine an index's usefulness.

i suppose i would propose raising a warning if the stats table is created that explains what is going on and informs users about a —no-analyze argument.

@simonw
Copy link
Owner
simonw commented Jan 8, 2022

Is there a downside to having a sqlite_stat1 table if it has wildly incorrect statistics in it?

Imagine the following sequence of events:

  • User imports a few records, creating the table, using sqlite-utils insert
  • User runs sqlite-utils create-index ... which also creates and populates the sqlite_stat1 table
  • User runs insert again to populate several million new records

The user now has a database file with several million records and a statistics table that is wildly out of date, having been populated when they only had a few.

Will this result in surprisingly bad query performance compared to it that statistics table did not exist at all?

If so, I lean much harder towards ANALYZE as a strictly opt-in optimization, maybe with the --analyze option added to sqlite-utils insert top to help users opt in to updating their statistics after running big inserts.

@simonw
Copy link
Owner
simonw commented Jan 8, 2022

The reason I'm hesitating on this is that I've not actually used ANALYZE at all in nearly five years of messing around with SQLite! So I'm nervous that there are surprise downsides I haven't thought of.

My hunch is that ANALYZE is only worth worrying about on much larger databases, in which case I'm OK supporting it as a thoroughly documented power-user feature rather than a default.

@fgregg
Copy link
Contributor Author
fgregg commented Jan 8, 2022

the table with the query ran so bad was about 50k.

i think the scenario should not be worse than no stats.

i also did not know that sqlite was so different from postgres and needed an explicit analyze call.

@fgregg
Copy link
Contributor Author
fgregg commented Jan 8, 2022

the out-of-date scenario you describe could be addressed by automatically adding an analyze to the insert or convert commands if they implicate an index

@fgregg
Copy link
Contributor Author
fgregg commented Jan 8, 2022

or using “ pragma optimize”

@simonw
Copy link
Owner
simonw commented Jan 9, 2022

Found one report on Stack Overflow from 9 years ago of someone seeing broken performance after running ANALYZE, hard to say that's a trend and not a single weird edge-case though! https://stackoverflow.com/q/12947214/6083

@fgregg
Copy link
Contributor Author
fgregg commented Jan 9, 2022

i don’t want to be such a partisan for analyze, but the query planner deciding not to use an index based on information collected by analyze is not necessarily a bug, but could be the correct choice.

the original poster in that stack overflow doesn’t say there’s a performance regression

@simonw simonw added this to the 3.21 milestone Jan 10, 2022
@simonw simonw closed this as completed in 1b84c17 Jan 11, 2022
@simonw
Copy link
Owner
simonw commented Jan 11, 2022

I decided to go with making this opt-in, mainly for consistency with the other places where I added this feature - see:

You can now run the following:

sqlite-utils create-index mydb.db mytable mycolumn --analyze

And ANALYZE will be run on the index once it has been created.

@fgregg
Copy link
Contributor Author
fgregg commented Jan 11, 2022

thanks so much! always a pleasure to see how you work through these things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0