8000 perf: optimize load_indices by wjones127 · Pull Request #3762 · lancedb/lance · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf: optimize load_indices #3762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 7, 2025

Conversation

wjones127
Copy link
Contributor
@wjones127 wjones127 commented Apr 30, 2025
  • Eliminates a HEAD request to get the size of the manifest before attempting to read the index section.
  • When we open a manifest file, if we happened to read the index section, decode it eagerly so we don't have to make an additional IO request later.
  • When we write a manifest file, put the index metadata in the cache.

@wjones127 wjones127 force-pushed the perf/optimize-load-indices branch from 8aa2be9 to da533a9 Compare April 30, 2025 22:00
Comment on lines +5012 to +5206
// Pushdown only works with the legacy format for now.
if data_storage_version == LanceFileVersion::Legacy {
let start_bytes = get_bytes();
dataset
.scan()
.filter("not_indexed = 50")
.unwrap()
.try_into_batch()
.await
.unwrap();
let pushdown_scan_bytes = get_bytes() - start_bytes;

assert!(pushdown_scan_bytes < filtered_scan_bytes);
assert!(pushdown_scan_bytes < filtered_scan_bytes);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that once I started caching the index metadata, there wasn't a different in bytes read for the stable format. It seems like this was only passing because the first call was calling load_indices(), and the second call had the indices cached. Does this seem right @westonpace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems correct. Currently, pushdown only gives significant benefits with the legacy format. The plan for the stable format is to add support for zone map / bloom filters as a secondary (best-effort) index. Now that we have ngram we've paved out the path for these best-effort indices pretty well.

@wjones127 wjones127 force-pushed the perf/optimize-load-indices branch from 6bc04fe to 8a69aa9 Compare May 6, 2025 21:12
@wjones127 wjones127 marked this pull request as ready for review May 6, 2025 21:12
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 90.47619% with 16 lines in your changes missing coverage. Please review.

Project coverage is 78.59%. Comparing base (985a488) to head (8a69aa9).

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 83.01% 6 Missing and 3 partials ⚠️
rust/lance/src/io/commit.rs 72.22% 5 Missing ⚠️
rust/lance/src/dataset/cleanup.rs 0.00% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/transaction.rs 0.00% 0 Missing and 1 partial ⚠️
Additiona 8000 l details and impacted files
@@           Coverage Diff            @@
##             main    #3762    +/-   ##
========================================
  Coverage   78.58%   78.59%            
========================================
  Files         273      273            
  Lines      101936   102053   +117     
  Branches   101936   102053   +117     
========================================
+ Hits        80102    80204   +102     
- Misses      18667    18679    +12     
- Partials     3167     3170     +3     
Flag Coverage Δ
unittests 78.59% <90.47%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +444 to +464
if let Some(index_offset) = manifest.index_section {
if manifest_size - index_offset <= last_block.len() {
let offset_in_block = last_block.len() - (manifest_size - index_offset);
let message_len =
LittleEndian::read_u32(&last_block[offset_in_block..offset_in_block + 4])
as usize;
let message_data =
&last_block[offset_in_block + 4..offset_in_block + 4 + message_len];
let section = lance_table::format::pb::IndexSection::decode(message_data)?;
let indices = section
.indices
.into_iter()
.map(Index::try_from)
.collect::<Result<Vec<_>>>()?;
session.index_cache.insert_metadata(
base_path.as_ref(),
manifest_location.version,
Arc::new(indices),
);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I've been wondering lately if we should just bite the bullet and make the high level index metadata a part of the manifest proper instead of a secondary message that requires independent loading, just for simplicity of understanding and not performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would be interesting to see if it's worth splitting that out in the first place.

@wjones127 wjones127 merged commit 26286f0 into lancedb:main May 7, 2025
24 of 27 checks passed
@wjones127 wjones127 deleted the perf/optimize-load-indices branch May 7, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0