8000 Avoid re-running autogen based on constant cache by aisamanra · Pull Request #5591 · sorbet/sorbet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Avoid re-running autogen based on constant cache #5591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 18, 2022

Conversation

aisamanra
Copy link
Contributor
@aisamanra aisamanra commented Apr 6, 2022

Motivation

This uses a constant hash along with a list of changed files to decide whether to run autogen or not. There are two new behaviors in autogen mode that are added here:

  • if --autogen-constant-cache-file is provided when other autogen flags are provided, then it'll compute the constant hash of all input files and pack them into the provided path in a msgpack format, and then runs autogen as usual. There is no way to compute the constant cache file outside of autogen: this simply should never come up in any other code path.
  • if --autogen-constant-cache-file is provided and one or more files are provided with --autogen-changed-files, then before doing any autogen work we'll first compute the constant hash for only those files listed, and if all of them compute to the same constant hash that they map to in the file itself, we simply exit without doing any other work. We should be pretty cautious about this: we will go on to do a full autogen run if any of the following are true:
    • There is no cache file present, or the cache file is unreadable in any way
    • Any of the files listed are not present in the cache file
    • There are more than a fixed number of changes (currently 100)
    • Any of the files are non-readable or have other issues

This more or less adds a "fast path" for autogen that allows us to duck out early if we can show that the intermediate edits would not have affected the output.

This technically also rewrites the constant hash from how it exists. It previously was a recursive function defined on parser::Node values, and now it uses a ShallowTreeMap to work on the post-desugared AST. The problem with the previous one was that working it into the pipeline in order to compute the constant hash of the codebase was very tricky and invasive, since it had to be worked into the middle of the indexOne function during the roughly two-line span of code where we have a parser::Node before it's turned into a desugared tree. That plumbing was invasive and error-prone, so in order to move the computation to the early part of autogen and minimize disruption to the existing path, I've allowed it to operate over our usual AST so it could get computed early in autogen.

Test plan

See included automated tests. Added a CLI test which demonstrates the behavior we want.

@aisamanra
Copy link
Contributor Author

We have a policy of testing changes to Sorbet against Stripe's codebase before
merging them. I've kicked off a test run for the current PR. When the build
finishes, I'll share with you whether or how it failed. Thanks!

Stripe employees can see the build results here:

https://go/builds/bui_LSlUW9sytVYpxY
https://go/builds/bui_LSlUVZvXz3E0BY

@aisamanra aisamanra marked this pull request as ready for review April 8, 2022 01:49
@aisamanra aisamanra requested a review from a team as a code owner April 8, 2022 01:49
@aisamanra aisamanra requested review from jez and removed request for a team April 8, 2022 01:49
@aisamanra aisamanra changed the title [WIP] Avoid re-running autogen based on constant cache Avoid re-running autogen based on constant cache Apr 8, 2022
@aisamanra
Copy link
Contributor Author

I think this almost definitely needs some tweaking before landing, but I want to get it up for review so I can get feedback. The biggest sticking point right now is that @nroman-stripe thinks we should switch it over to using a KVStore instead of the custom-written msgpack format we have here. I'm not opposed to that, but it's a bit subtle—in particular, we can only have one KVStore open at a time, and we currently have an existing one open while we're doing the exit-early check. If we feel strongly that we should use the KVStore instead of this custom format, I can think of ways of tackling this, but I figured it was worth getting eyes on it before starting down that road.

@nroman-stripe
Copy link
Contributor

Just to be clear, I'm not a blocker on that KVStore. It just seemed suited to the task, but I wasn't fully aware of the complications involved.

@jez
Copy link
Collaborator
jez commented Apr 9, 2022

Sorry, I didn't get a chance to look at this today. On the cache thing: could we store the autogen caches on the core::File? We already do this for the LSP file hashes.

@jez
Copy link
Collaborator
jez commented Apr 9, 2022

You can read more about that in these PRs:

#2760
#2762

@jez
Copy link
Collaborator
jez commented Apr 11, 2022

this looks ~fine to me? it is pretty well comtained to not LSP and not the core pipeline and doesn't introduce new external dependencies. Going to hand off review to nroman for the actual implementation review.

@jez jez requested review from ngroman and nroman-stripe and removed request for ngroman April 11, 2022 19:09
Copy link
Contributor
@nroman-stripe nroman-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit. Some minor nits and suggestions.

// Returns `true` if the constant hash information provided tells us
// we can exit `autogen` early, and `false` otherwise.
bool autogenCanExitEarly(shared_ptr<spd::logger> &logger, options::Options &opts) {
if (opts.autogenConstantCacheFile.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a timer for this

// the output of autogen. This means we can always be conservative: it's okay if this returns `false` in places
// where it could return `true`, because that'll be correct but slower.
static bool canSkipAutogen(core::GlobalState &gs, std::string_view cachePath,
std::vector<std::string> &changedFiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about? (Could be nice on the API for some of these other funcs as well.

Suggested change
std::vector<std::string> &changedFiles);
const std::vector<std::string> &changedFiles);

_constantHashMap.emplace(path, hash);
};

std::string pack();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

std::string pack();
std::string pack() const;
AutogenCache(UnorderedMap<std::string, unsigned int> constantHashMap) : _constantHashMap(constantHashMap){};
AutogenCache() = default;

void add(std::string path, unsigned int hash) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but I it's more idiomatic in this codebase to use uint32_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this mostly because it was what the hash code was already doing and it felt consistent, c.f. https://github.com/sorbet/sorbet/blob/master/core/hashing/hashing.h#L10-L14

std::vector<std::string> &changedFiles);

static AutogenCache unpackForFiles(std::string_view path, UnorderedSet<std::string> &changedFiles);
static AutogenCache merge(std::vector<AutogenCache> &caches) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this being used anywhere

@@ -1259,4 +1261,45 @@ bool cacheTreesAndFiles(const core::GlobalState &gs, WorkerPool &workers, vector
return written;
}

vector<ast::ParsedFile> autogenCacheFiles(const core::GlobalState &gs, string_view cachePath,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better name for this would be something like autogenWriteCacheFile or something similarly indicative.

vector<ast::ParsedFile> autogenCacheFiles(const core::GlobalState &gs, string_view cachePath,
vector<ast::ParsedFile> what, WorkerPool &workers) {
#ifndef SORBET_REALMAIN_MIN
Timer timeit(gs.tracer(), "autogenConstantCache");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less confusing if this matches the func name

AutogenCache(UnorderedMap<std::string, unsigned int> constantHashMap) : _constantHashMap(constantHashMap){};
AutogenCache() = default;

void add(std::string path, unsigned int hash) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this API using a FileRef instead of string for path? Then pack() could take a GS and stream the strings into the buffer at the last second. I realize that slightly complicates things when reading for the can skip check, but I think in that case since we never re-use or pack that cache it could stay as a UnorderedMap<string, unint32_t>

AutogenCache() = default;

void add(std::string path, unsigned int hash) {
_constantHashMap.emplace(path, hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the name of densive programming should this ping a thread_id and ENFORCE that add's come from it?

@jez jez removed their request for review April 13, 2022 17:33
aisamanra and others added 3 commits April 13, 2022 14:41
Copy link
Contributor
@nroman-stripe nroman-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@aisamanra aisamanra merged commit a10576f into master Apr 18, 2022
@aisamanra aisamanra deleted the gdritter/autogen-constant-cache branch April 18, 2022 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0