Improves compression ratio for small windowLog #1624
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To ensure that all detected matches remain within window size,
the current strategy is to preemptively invalidate all indexes
that could end up being too far while compressing next block.
This is good for speed, as checking if a match is within distance is now a trivial comparison.
This roughly invalidates one block worth of data, which is about 128 KB.
It's then progressively restored as compression progress through the block. But temporarily, the window size has been truncated by up to 128 KB.
This is bad for compression ratio, but the impact is low, as long as the window size is large, relative to the block size.
As the window size is reduced, this is no longer the case.
Especially when
wlog <= 17
, the window size is now equal to the block size. Hence, this strategy invalidates all history before each block (assuming each block is full), making each block effectively independent. This has a big impact on compression ratio.In this patch, the logic is changed so that each strategy selects how to best deal with maximum offset distance, aka, the window size.
Logic for
fast
anddouble_fast
is unchanged, as these single-probe strategies want to spend very little time per attempt.Logic for all strategies >=
greedy
is modified, to accepts all candidates, up to maximum window size. This makes them more effective in combination with small window sizes.Example on
enwik7
at level19
:dev
As expected, improvement increases as window size decreases. It's negligible at
wlog = 23
, then roughly doubles with each reduction ofwlog
.On reaching
wlog = 17
, there is a visible bump. After that, the gain stabilize around +8-9%, which is quite significant.