Test reading concatenated/multiblock gzips works #171

marcelm · 2025-05-15T11:03:22Z

To support what I claim here: nf-core/detaxizer#62 (comment)

rhpvorderman · 2025-05-16T07:03:10Z

Nice to know that all that work on dnaio does catch attention.

Should we also support writing blocked gzip in the future? It is mandatory for writing BAM format, so it is something to consider. (Or maybe it should be in dnaio specifically, not xopen, because it is such a bioinformatics thing.)

marcelm · 2025-05-16T08:59:51Z

Should we also support writing blocked gzip in the future?

It’s a bit special, so I think it would be ok to not support that directly if we cannot find a simple interface for it. I seems to already work to just close the file and re-open it in append mode (I guess this won’t be very efficient for many small blocks, but at least it’s possible):

from xopen import xopen
f = xopen("out.gz", mode="wb")
f.write(b"hello\n")
f.close()
f = xopen("out.gz", mode="ab")
f.write(b"world\n")
f.close()

Hm, but if I look at the generated file, it actually contains four gzip headers in total. Is that something that python-isal does?

rhpvorderman · 2025-05-16T09:21:01Z

Could be, in that case it is a bug worth investigating.

Using igzip.open rather than xopen

$ hexdump out.gz 
0000000 8b1f 0808 00f7 6827 ff00 756f 0074 48cb
0000010 c9cd e7c9 0002 3020 363a 0006 0000 8b1f
0000020 0808 00f7 6827 ff00 756f 0074 cf2b ca2f
0000030 e149 0002 61a8 dd38 0006 0000

I count two instances of 8b 1f as expected.

Using igzip_threaded.open

0000000 8b1f 0008 0000 0000 00ff 48ca c9cd e7c9
0000010 0002 0000 ffff 0003 3020 363a 0006 0000
0000020 8b1f 0008 0000 0000 00ff 0003 0000 0000
0000030 0000 0000 8b1f 0008 0000 0000 00ff cf2a
0000040 ca2f e149 0002 0000 ffff 0003 61a8 dd38
0000050 0006 0000 8b1f 0008 0000 0000 00ff 0003
0000060 0000 0000 0000 0000                    
0000068

Interestingly indeed multiple headers here. The output of zcat is still correct. Let me check how it works in the code.

The flush call on multithreaded gzip writes all the data but also ends the gzip stream. It seems as flush is called twice. The code should probably be changed. Flush should not end the gzip stream, that is not congruent with how the single threaded implemenation works.

rhpvorderman · 2025-05-16T09:23:20Z

I created a bug report. I have no time to fix this now, but it will be fixed in the future.

marcelm · 2025-05-16T09:24:23Z

Thanks! I’ll subscribe to that issue

Test reading concatenated/multiblock gzips works

58fe3fd

rhpvorderman merged commit d1931cb into main May 16, 2025
18 checks passed

rhpvorderman deleted the multiblock-gzip branch May 16, 2025 07:03

rhpvorderman mentioned this pull request May 16, 2025

Flushing the multithreaded implementation is different from the single-threaded implementation. pycompression/python-isal#230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test reading concatenated/multiblock gzips works #171

Test reading concatenated/multiblock gzips works #171

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Test reading concatenated/multiblock gzips works #171

Test reading concatenated/multiblock gzips works #171

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!