This repository features a pure Rust
implementation of gunzip
(decompression only), created from scratch with approximately 1000 lines of code for educational purposes.
Explore branches 1 through 25 for a step-by-step understanding of the code. The main branch reflects the latest version.
The following roughly summarizes each stage
- branch 1:
main()
function and skeletal structure - branch 2:
bitread
module for reading bits from a byte stream - branch 3:
gzip
header & footer for parsing the metadata and checksum - branch 4:
inflate
for block type 0 (uncompressed) data - branch 5:
codebook
andhuffman_decoder
modules for decoding Huffman codes - branch 6:
lz77
andsliding_window
modules for decompressing LZ77-encoded data - branch 7:
inflate
for block type 1 and 2 (compressed) data using fixed or dynamic Huffman codes - branch 8:
checksum_write
module for verifying the decompressed data - branch 9: performance optimization
- branch 10: multithread support
- branch 11: streaming support
- branch 12: memory optimization
- branch 13: bug fix to reject non-compliant
.gz
file - branch 14: minor optimization for block type 0
- branch 15: performance optimization
- branch 16: performance optimization by replacing dynamic dispatch with static dispatch
- branch 17: performance optimization
- branch 18: performance optimization and remove dead code
- branch 19: performance optimization
- branch 20: reduce buffer size and hence cache misses
- branch 21: improve implementation for decode and HuffmanDecoder
- branch 22: remove unnecessary heap allocation
- branch 23: recycle heap allocation
- branch 24: optimize write to stdout
- branch 25: fix bug when codelength + extrabits > 24
- written from scratch in pure Rust without unsafe code
- supports streaming, i.e., the decompressor implements
Read
trait - supports multithreading (two threads)
- competitive in performance with other popular implementations (see Benchmark below)
use gunzip::{SingleThreadDecompressor, MultiThreadDecompressor};
fn main() -> std::io::Result<()> {
let reader = std::fs::File::open("data.bin.gz")?;
let mut writer = std::fs::File::create("data.bin")?;
let mut decompressor = SingleThreadDecompressor::new(reader);
// or MultiThreadDecompressor::new(reader);
std::io::copy(&mut decompressor, &mut writer)?;
Ok(())
}
See gunzip.rs as an example.
# Usage: target/release/gunzip [-t]
# Decompresses .gz file read from stdin and outputs to stdout
# -t: employ two threads
# single thread
$ target/release/gunzip < compressed.gz > decompressed
# two threads
$ target/release/gunzip -t < compressed.gz > decompressed
The following result is obtained by running this benchmark script. To isolate IO bottleneck, the input/output files are placed in a ramdisk.
# create a ramdisk
mkdir /ramdisk
mount -o size=4G -t tmpfs none /ramdisk
cd /ramdisk
# run benchmark script
curl https://gist.githubusercontent.com/TechHara/500a1f7b8cc08af86856f707c04b0023/raw/b668d419373bb138f71cfe467f0f9e573aaaf89f/gunzip_benchmark.sh | bash
You are welcome to contribute by submitting a PR for bug fixes or enhancements. See here for detailed documentations.
You can also contribute by porting this code to a different language so that more people can learn from it.