Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming #62

bede · 2025-05-15T09:33:59Z

This PR contains necessary modifications to read renaming in order to use detaxizer with large datasets (50-100GB compressed) on a machine with 128GB of RAM. Pipeline execution time is also dramatically reduced. Please do not feel obliged to merge, but I wanted to share my changes in case they help someone.

Use buffered writes to prevent unbounded growth of the read renaming dictionary
Use dnaio by @marcelm instead of Biopython for FASTQ parsing, which is >10x faster
Change process label to process_low since memory usage is now bounded

Test profile runs successfully with conda and docker.

Should you consider merging, I will update the changelog etc. accordingly. These modifications appear to be working well for me, but it's possible renaming has regressed e.g. wrt blocked gzip support.

nf-core-bot · 2025-05-15T09:34:34Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

d4straub · 2025-05-15T10:40:56Z

Thanks a lot for sharing your improvement! The speed of this step is a bottleneck, indeed.
Any improvement, especially easing this bottleneck, is definitely welcome.
Is blocked gzip not supported by dnaio? Or why do you think such a problem might occur?
If indeed dnaio doesnt support blocked gzip but the current solution does (sorry, have no idea), then it might be better to have both solutions available?

marcelm · 2025-05-15T10:49:45Z

Since I’ve been tagged: dnaio supports reading concatenated/multiblock gzip files.

d4straub · 2025-05-16T14:06:16Z

Since I’ve been tagged: dnaio supports reading concatenated/multiblock gzip files.

Great, thanks, so adding this improvement doesn't need to be alongside the old implementation.
I am still a bit concerned about the part but it's possible renaming has regressed, I am not a python person unfortunately and cannot really judge.

Unfortunately all that failing tests are from the outdated nf-core template, see #60

bede · 2025-05-17T08:10:19Z

Thanks @d4straub, I mentioned the possibility of regression out of caution, not because I'm aware of issues. Used with custom config specifying memory limits, this PR was the first version that completed successfully for all of my test datasets in a single execution without retries or crashes. If there is appetite to merge, I could also look into applying prerequisite nf-core template updates, but likely not for another week or two.

d4straub · 2025-05-19T06:52:44Z

It would be very nice to have that improvement in the pipeline. So yes, if you find the time it would be great to update template & merge the PR (to dev though, not master, at least as first step).

Replace biopython with faster dnaio, buffer writes to reduce memory

bb5697f

bede mentioned this pull request May 15, 2025

Read renaming process loads all read headers into memory simultaneously #63

Open

marcelm mentioned this pull request May 15, 2025

Test reading concatenated/multiblock gzips works pycompression/xopen#171

Merged

d4straub requested a review from jannikseidelQBiC May 16, 2025 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming #62

Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming #62

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming #62

Are you sure you want to change the base?

Replace biopython with faster dnaio, buffer writes to reduce memory spikes during renaming #62

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!