🔨 Update FastAPI People Experts script, refactor and optimize data fetching to handle rate limits #13267
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🔨 Update FastAPI People Experts script, refactor and optimize data fetching to handle rate limits
FastAPI People Experts have not been updated in a few months because the script was hitting GitHub API rate limits.
The first step was to separate the logic into different scripts, I started with that before, so now contributors, reviewers, translators, and sponsors are computed in different scripts and GitHub Actions.
Then I was still hitting rate limits, I was thinking I would need to store the data in an external DB and update it gradually and continuously to avoid the rate limits, that would have been a big side project just to get that working. 😅
Then with this, I played a bit with different ways to get the data, how to limit the results, and also a bit to analyze the distribution of results to optimize the queries.
I discovered that (it seems) GitHub rate limits the GraphQL API based on an estimate of how much it would take to compute the queries in a worst-case scenario, not what they actually take to compute. So, using a query that would get 100 comments per discussion, I was consuming the rate limit per hour quickly. Nevertheless, there are only a handful of discussions that have more than 50 comments. When I updated the query to fetch only the 50 comments per discussion, that solved the rate limit. That's how I concluded that GitHub was assigning the rate limit "points" based on what would be a worst-case cost of computing each query before executing the query, so I was charged rate-limit-points before they were used.
And all this is just to say that the main thing that solved the problem was to set the query to fetch the first 50 comments instead of the first 100, even though most discussions have less than 50.
I also had a temporary snippet to do some quick stats of the fetched data and calculate how many had how many comments:
This was run in an interactive window, so it's not even a full script.
The results as of the moment of making this PR:
and: