8000 Test reading concatenated/multiblock gzips works by marcelm · Pull Request #171 · pycompression/xopen · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Test reading concatenated/multiblock gzips works #171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 16, 2025
Merged

Conversation

marcelm
Copy link
Collaborator
@marcelm marcelm commented May 15, 2025

To support what I claim here: nf-core/detaxizer#62 (comment)

@rhpvorderman
Copy link
Collaborator

Nice to know that all that work on dnaio does catch attention.

Should we also support writing blocked gzip in the future? It is mandatory for writing BAM format, so it is something to consider. (Or maybe it should be in dnaio specifically, not xopen, because it is such a bioinformatics thing.)

@rhpvorderman rhpvorderman merged commit d1931cb into main May 16, 2025
18 checks passed
@rhpvorderman rhpvorderman deleted the multiblock-gzip branch May 16, 2025 07:03
@marcelm
Copy link
Collaborator Author
marcelm commented May 16, 2025

Should we also support writing blocked gzip in the future?

It’s a bit special, so I think it would be ok to not support that directly if we cannot find a simple interface for it. I seems to already work to just close the file and re-open it in append mode (I guess this won’t be very efficient for many small blocks, but at least it’s possible):

from xopen import xopen
f = xopen("out.gz", mode="wb")
f.write(b"hello\n")
f.close()
f = xopen("out.gz", mode="ab")
f.write(b"world\n")
f.close()

Hm, but if I look at the generated file, it actually contains four gzip headers in total. Is that something that python-isal does?

@rhpvorderman
Copy link
Collaborator

Could be, in that case it is a bug worth investigating.

Using igzip.open rather than xopen

$ hexdump out.gz 
0000000 8b1f 0808 00f7 6827 ff00 756f 0074 48cb
0000010 c9cd e7c9 0002 3020 363a 0006 0000 8b1f
0000020 0808 00f7 6827 ff00 756f 0074 cf2b ca2f
0000030 e149 0002 61a8 dd38 0006 0000          

I count two instances of 8b 1f as expected.

Using igzip_threaded.open

0000000 8b1f 0008 0000 0000 00ff 48ca c9cd e7c9
0000010 0002 0000 ffff 0003 3020 363a 0006 0000
0000020 8b1f 0008 0000 0000 00ff 0003 0000 0000
0000030 0000 0000 8b1f 0008 0000 0000 00ff cf2a
0000040 ca2f e149 0002 0000 ffff 0003 61a8 dd38
0000050 0006 0000 8b1f 0008 0000 0000 00ff 0003
0000060 0000 0000 0000 0000                    
0000068

Interestingly indeed multiple headers here. The output of zcat is still correct. Let me check how it works in the code.

The flush call on multithreaded gzip writes all the data but also ends the gzip stream. It seems as flush is called twice. The code should probably be changed. Flush should not end the gzip stream, that is not congruent with how the single threaded implemenation works.

@rhpvorderman
Copy link
Collaborator

I created a bug report. I have no time to fix this now, but it will be fixed in the future.

@marcelm
Copy link
Collaborator Author
marcelm commented May 16, 2025

Thanks! I’ll subscribe to that issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0