8000 Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming by bede · Pull Request #62 · nf-core/detaxizer · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bede
Copy link
@bede bede commented May 15, 2025

This PR contains necessary modifications to read renaming in order to use detaxizer with large datasets (50-100GB compressed) on a machine with 128GB of RAM. Pipeline execution time is also dramatically reduced. Please do not feel obliged to merge, but I wanted to share my changes in case they help someone.

  • Use buffered writes to prevent unbounded growth of the read renaming dictionary
  • Use dnaio by @marcelm instead of Biopython for FASTQ parsing, which is >10x faster
  • Change process label to process_low since memory usage is now bounded

Test profile runs successfully with conda and docker.

Should you consider merging, I will update the changelog etc. accordingly. These modifications appear to be working well for me, but it's possible renaming has regressed e.g. wrt blocked gzip support.

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@d4straub
Copy link
Contributor

Thanks a lot for sharing your improvement! The speed of this step is a bottleneck, indeed.
Any improvement, especially easing this bottleneck, is definitely welcome.
Is blocked gzip not supported by dnaio? Or why do you think such a problem might occur?
If indeed dnaio doesnt support blocked gzip but the current solution does (sorry, have no idea), then it might be better to have both solutions available?

@marcelm
Copy link
marcelm commented May 15, 2025

Since I’ve been tagged: dnaio supports reading concatenated/multiblock gzip files.

@d4straub
Copy link
Contributor

Since I’ve been tagged: dnaio supports reading concatenated/multiblock gzip files.

Great, thanks, so adding this improvement doesn't need to be alongside the old implementation.
I am still a bit concerned about the part but it's possible renaming has regressed, I am not a python person unfortunately and cannot really judge.

Unfortunately all that failing tests are from the outdated nf-core template, see #60

@d4straub d4straub requested a review from jannikseidelQBiC May 16, 2025 14:07
@bede
Copy link
Author
bede commented May 17, 2025

Thanks @d4straub, I mentioned the possibility of regression out of caution, not because I'm aware of issues. Used with custom config specifying memory limits, this PR was the first version that completed successfully for all of my test datasets in a single execution without retries or crashes. If there is appetite to merge, I could also look into applying prerequisite nf-core template updates, but likely not for another week or two.

@d4straub
81CC Copy link
Contributor

It would be very nice to have that improvement in the pipeline. So yes, if you find the time it would be great to update template & merge the PR (to dev though, not master, at least as first step).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0