perf: optimize load_indices #3762

wjones127 · 2025-04-30T19:27:43Z

Eliminates a HEAD request to get the size of the manifest before attempting to read the index section.
When we open a manifest file, if we happened to read the index section, decode it eagerly so we don't have to make an additional IO request later.
When we write a manifest file, put the index metadata in the cache.

wjones127 · 2025-05-05T20:51:54Z

rust/lance/src/dataset/scanner.rs

+        // Pushdown only works with the legacy format for now.
+        if data_storage_version == LanceFileVersion::Legacy {
+            let start_bytes = get_bytes();
+            dataset
+                .scan()
+                .filter("not_indexed = 50")
+                .unwrap()
+                .try_into_batch()
+                .await
+                .unwrap();
+            let pushdown_scan_bytes = get_bytes() - start_bytes;

-        assert!(pushdown_scan_bytes < filtered_scan_bytes);
+            assert!(pushdown_scan_bytes < filtered_scan_bytes);
+        }


I found that once I started caching the index metadata, there wasn't a different in bytes read for the stable format. It seems like this was only passing because the first call was calling load_indices(), and the second call had the indices cached. Does this seem right @westonpace?

Yes, that seems correct. Currently, pushdown only gives significant benefits with the legacy format. The plan for the stable format is to add support for zone map / bloom filters as a secondary (best-effort) index. Now that we have ngram we've paved out the path for these best-effort indices pretty well.

codecov-commenter · 2025-05-06T21:46:31Z

Codecov Report

Attention: Patch coverage is 90.47619% with 16 lines in your changes missing coverage. Please review.

Project coverage is 78.59%. Comparing base (985a488) to head (8a69aa9).

Files with missing lines	Patch %	Lines
rust/lance/src/dataset.rs	83.01%	6 Missing and 3 partials ⚠️
rust/lance/src/io/commit.rs	72.22%	5 Missing ⚠️
rust/lance/src/dataset/cleanup.rs	0.00%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/transaction.rs	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #3762    +/-   ##
========================================
  Coverage   78.58%   78.59%            
========================================
  Files         273      273            
  Lines      101936   102053   +117     
  Branches   101936   102053   +117     
========================================
+ Hits        80102    80204   +102     
- Misses      18667    18679    +12     
- Partials     3167     3170     +3

Flag	Coverage Δ
unittests	`78.59% <90.47%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace · 2025-05-07T13:36:54Z

rust/lance/src/dataset.rs

+        if let Some(index_offset) = manifest.index_section {
+            if manifest_size - index_offset <= last_block.len() {
+                let offset_in_block = last_block.len() - (manifest_size - index_offset);
+                let message_len =
+                    LittleEndian::read_u32(&last_block[offset_in_block..offset_in_block + 4])
+                        as usize;
+                let message_data =
+                    &last_block[offset_in_block + 4..offset_in_block + 4 + message_len];
+                let section = lance_table::format::pb::IndexSection::decode(message_data)?;
+                let indices = section
+                    .indices
+                    .into_iter()
+                    .map(Index::try_from)
+                    .collect::<Result<Vec<_>>>()?;
+                session.index_cache.insert_metadata(
+                    base_path.as_ref(),
+                    manifest_location.version,
+                    Arc::new(indices),
+                );
+            }
+        }


Nice. I've been wondering lately if we should just bite the bullet and make the high level index metadata a part of the manifest proper instead of a secondary message that requires independent loading, just for simplicity of understanding and not performance.

Yeah it would be interesting to see if it's worth splitting that out in the first place.

github-actions bot added the performance label Apr 30, 2025

wjones127 force-pushed the perf/optimize-load-indices branch from 8aa2be9 to da533a9 Compare April 30, 2025 22:00

wjones127 commented May 5, 2025

View reviewed changes

wjones127 added 6 commits May 6, 2025 14:00

perf: optimize load_indices

65ed011

finish up

ae3af6f

fix

d5c1c56

fix test

b21cb5f

fix test

aa4f5ea

fix docs

8a69aa9

wjones127 force-pushed the perf/optimize-load-indices branch from 6bc04fe to 8a69aa9 Compare May 6, 2025 21:12

wjones127 marked this pull request as ready for review May 6, 2025 21:12

westonpace approved these changes May 7, 2025

View reviewed changes

wjones127 merged commit 26286f0 into lancedb:main May 7, 2025
24 of 27 checks passed

wjones127 deleted the perf/optimize-load-indices branch May 7, 2025 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimize load_indices #3762

perf: optimize load_indices #3762

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

perf: optimize load_indices #3762

perf: optimize load_indices #3762

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!