8000 Additional tweaks to improve performance of household inference by dehall · Pull Request #57 · mitre/data-owner-tools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Additional tweaks to improve performance of household inference #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 18, 2023

Conversation

dehall
Copy link
Collaborator
@dehall dehall commented Apr 18, 2023

This PR introduces yet more tweaks to improve performance of household inference in households.py.

Summary:

  • Only read the pii columns required
  • Keep the "exploded address" columns in a separate DataFrame so we can delete the whole thing once those columns are no longer needed
  • Dump the household matched pairs to a file so that we can restart from there if the process runs out of memory
  • Add a --pairsfile arg to specify that pairs file to restart from there
  • Write the household_pii and mapping files at the same time, don't just store the household_pii file in an array to write later
  • Add some additional debug statements
  • Keep using MultiIndexes wherever possible rather than converting to lists of tuples because the performance seems to be better all around
  • Add the [extras] to the textdistance dependency because it includes additional libraries that speed up, ex, jarowinkler. In my testing this was about a 25% speedup.
  • Delete objects and aggressively GC when they are no longer needed

@dehall dehall merged commit f6d11fc into master Apr 18, 2023
@dehall dehall deleted the household_perf_again branch April 18, 2023 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0