-
Notifications
You must be signed in to change notification settings - Fork 293
perf: optimize load_indices #3762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8aa2be9
to
da533a9
Compare
// Pushdown only works with the legacy format for now. | ||
if data_storage_version == LanceFileVersion::Legacy { | ||
let start_bytes = get_bytes(); | ||
dataset | ||
.scan() | ||
.filter("not_indexed = 50") | ||
.unwrap() | ||
.try_into_batch() | ||
.await | ||
.unwrap(); | ||
let pushdown_scan_bytes = get_bytes() - start_bytes; | ||
|
||
assert!(pushdown_scan_bytes < filtered_scan_bytes); | ||
assert!(pushdown_scan_bytes < filtered_scan_bytes); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that once I started caching the index metadata, there wasn't a different in bytes read for the stable format. It seems like this was only passing because the first call was calling load_indices()
, and the second call had the indices cached. Does this seem right @westonpace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that seems correct. Currently, pushdown only gives significant benefits with the legacy format. The plan for the stable format is to add support for zone map / bloom filters as a secondary (best-effort) index. Now that we have ngram we've paved out the path for these best-effort indices pretty well.
6bc04fe
to
8a69aa9
Compare
Codecov ReportAttention: Patch coverage is
Additiona 8000 l details and impacted files@@ Coverage Diff @@
## main #3762 +/- ##
========================================
Coverage 78.58% 78.59%
========================================
Files 273 273
Lines 101936 102053 +117
Branches 101936 102053 +117
========================================
+ Hits 80102 80204 +102
- Misses 18667 18679 +12
- Partials 3167 3170 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
if let Some(index_offset) = manifest.index_section { | ||
if manifest_size - index_offset <= last_block.len() { | ||
let offset_in_block = last_block.len() - (manifest_size - index_offset); | ||
let message_len = | ||
LittleEndian::read_u32(&last_block[offset_in_block..offset_in_block + 4]) | ||
as usize; | ||
let message_data = | ||
&last_block[offset_in_block + 4..offset_in_block + 4 + message_len]; | ||
let section = lance_table::format::pb::IndexSection::decode(message_data)?; | ||
let indices = section | ||
.indices | ||
.into_iter() | ||
.map(Index::try_from) | ||
.collect::<Result<Vec<_>>>()?; | ||
session.index_cache.insert_metadata( | ||
base_path.as_ref(), | ||
manifest_location.version, | ||
Arc::new(indices), | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I've been wondering lately if we should just bite the bullet and make the high level index metadata a part of the manifest proper instead of a secondary message that requires independent loading, just for simplicity of understanding and not performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be interesting to see if it's worth splitting that out in the first place.
HEAD
request to get the size of the manifest before attempting to read the index section.