-
Notifications
You must be signed in to change notification settings - Fork 565
Avoid re-running autogen based on constant cache #5591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We have a policy of testing changes to Sorbet against Stripe's codebase before Stripe employees can see the build results here: → https://go/builds/bui_LSlUW9sytVYpxY |
I think this almost definitely needs some tweaking before landing, but I want to get it up for review so I can get feedback. The biggest sticking point right now is that @nroman-stripe thinks we should switch it over to using a |
Just to be clear, I'm not a blocker on that KVStore. It just seemed suited to the task, but I wasn't fully aware of the complications involved. |
Sorry, I didn't get a chance to look at this today. On the cache thing: could we store the autogen caches on the core::File? We already do this for the LSP file hashes. |
this looks ~fine to me? it is pretty well comtained to not LSP and not the core pipeline and doesn't introduce new external dependencies. Going to hand off review to nroman for the actual implementation review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks legit. Some minor nits and suggestions.
// Returns `true` if the constant hash information provided tells us | ||
// we can exit `autogen` early, and `false` otherwise. | ||
bool autogenCanExitEarly(shared_ptr<spd::logger> &logger, options::Options &opts) { | ||
if (opts.autogenConstantCacheFile.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a timer for this
main/autogen/cache.h
Outdated
// the output of autogen. This means we can always be conservative: it's okay if this returns `false` in places | ||
// where it could return `true`, because that'll be correct but slower. | ||
static bool canSkipAutogen(core::GlobalState &gs, std::string_view cachePath, | ||
std::vector<std::string> &changedFiles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about? (Could be nice on the API for some of these other funcs as well.
std::vector<std::string> &changedFiles); | |
const std::vector<std::string> &changedFiles); |
main/autogen/cache.h
Outdated
_constantHashMap.emplace(path, hash); | ||
}; | ||
|
||
std::string pack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string pack(); | |||
std::string pack() const; |
AutogenCache(UnorderedMap<std::string, unsigned int> constantHashMap) : _constantHashMap(constantHashMap){}; | ||
AutogenCache() = default; | ||
|
||
void add(std::string path, unsigned int hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but I it's more idiomatic in this codebase to use uint32_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this mostly because it was what the hash code was already doing and it felt consistent, c.f. https://github.com/sorbet/sorbet/blob/master/core/hashing/hashing.h#L10-L14
main/autogen/cache.h
Outdated
std::vector<std::string> &changedFiles); | ||
|
||
static AutogenCache unpackForFiles(std::string_view path, UnorderedSet<std::string> &changedFiles); | ||
static AutogenCache merge(std::vector<AutogenCache> &caches) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this being used anywhere
main/pipeline/pipeline.cc
Outdated
@@ -1259,4 +1261,45 @@ bool cacheTreesAndFiles(const core::GlobalState &gs, WorkerPool &workers, vector | |||
return written; | |||
} | |||
|
|||
vector<ast::ParsedFile> autogenCacheFiles(const core::GlobalState &gs, string_view cachePath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better name for this would be something like autogenWriteCacheFile
or something similarly indicative.
main/pipeline/pipeline.cc
Outdated
vector<ast::ParsedFile> autogenCacheFiles(const core::GlobalState &gs, string_view cachePath, | ||
vector<ast::ParsedFile> what, WorkerPool &workers) { | ||
#ifndef SORBET_REALMAIN_MIN | ||
Timer timeit(gs.tracer(), "autogenConstantCache"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less confusing if this matches the func name
AutogenCache(UnorderedMap<std::string, unsigned int> constantHashMap) : _constantHashMap(constantHashMap){}; | ||
AutogenCache() = default; | ||
|
||
void add(std::string path, unsigned int hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this API using a FileRef
instead of string for path? Then pack()
could take a GS and stream the strings into the buffer at the last second. I realize that slightly complicates things when reading for the can skip check, but I think in that case since we never re-use or pack that cache it could stay as a UnorderedMap<string, unint32_t>
AutogenCache() = default; | ||
|
||
void add(std::string path, unsigned int hash) { | ||
_constantHashMap.emplace(path, hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the name of densive programming should this ping a thread_id and ENFORCE that add's come from it?
Co-authored-by: Nathaniel Roman <nroman@stripe.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Motivation
This uses a constant hash along with a list of changed files to decide whether to run autogen or not. There are two new behaviors in autogen mode that are added here:
--autogen-constant-cache-file
is provided when other autogen flags are provided, then it'll compute the constant hash of all input files and pack them into the provided path in a msgpack format, and then runs autogen as usual. There is no way to compute the constant cache file outside of autogen: this simply should never come up in any other code path.--autogen-constant-cache-file
is provided and one or more files are provided with--autogen-changed-files
, then before doing any autogen work we'll first compute the constant hash for only those files listed, and if all of them compute to the same constant hash that they map to in the file itself, we simply exit without doing any other work. We should be pretty cautious about this: we will go on to do a full autogen run if any of the following are true:This more or less adds a "fast path" for autogen that allows us to duck out early if we can show that the intermediate edits would not have affected the output.
This technically also rewrites the constant hash from how it exists. It previously was a recursive function defined on
parser::Node
values, and now it uses aShallowTreeMap
to work on the post-desugared AST. The problem with the previous one was that working it into the pipeline in order to compute the constant hash of the codebase was very tricky and invasive, since it had to be worked into the middle of theindexOne
function during the roughly two-line span of code where we have aparser::Node
before it's turned into a desugared tree. That plumbing was invasive and error-prone, so in order to move the computation to the early part of autogen and minimize disruption to the existing path, I've allowed it to operate over our usual AST so it could get computed early in autogen.Test plan
See included automated tests. Added a CLI test which demonstrates the behavior we want.