8000 Enable Faiss-based vector format to index larger number of vectors in a single segment by kaivalnp · Pull Request #14847 · apache/lucene · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Enable Faiss-based vector format to index larger number of vectors in a single segment #14847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kaivalnp
Copy link
Contributor
@kaivalnp kaivalnp commented Jun 25, 2025

Description

I was trying to index a large number of vectors in a single segment, and ran into an error because of the way we copy vectors to native memory, before calling Faiss to create an index:

Caused by: java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 3276800000
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314)
        at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158)

This limitation was hit because we use a ByteBuffer (backed by native memory) to copy vectors from heap -- which has a 2 GB size limit

As a fix, I've changed it to use MemorySegment specific functions to copy vectors (also moving away from these byte buffers in other places, and using more appropriate IO methods)

With these changes, we no longer see the above error and are able to build and search an index. Also ran benchmarks for a case where this limit was not hit to check for performance impact:

Baseline (on main):

    type  recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
   faiss   0.997        1.855   1.819        0.981  100000   100      50       32        200         no     31.07       3218.44           32.76             1         3152.11      1562.500     1562.500       HNSW

Candidate (on this PR):

    type  recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
   faiss   0.998        1.817   1.794        0.987  100000   100      50       32        200         no     29.57       3381.46           33.20             1         3152.11      1562.500     1562.500       HNSW

..and indexing / search performance is largely unchanged

Edit: Related to #14178

… a single segment

- Moves away from a ByteBuffer (with a 2 GB limit) to direct copying of vectors to native memory
- Also simplify some other off-heap memory IO instances
Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor
@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM - one question: do we have unit tests covering this?

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author
kaivalnp commented Jun 26, 2025

@msokolov I wasn't sure about attempting to index a large amount of vector data, given that it'll take up a few GB of RAM. I've added a test for now, please let me know if I should keep it (or how to test it better). Perhaps having the test is fine, because we run Faiss tests (and only those) in a separate GH action?

The test fails deterministically when added to main:

   >     java.lang.IllegalStateException: Segment is too large to wrap as ByteBuffer. Size: 2149576700
   >         at __randomizedtesting.SeedInfo.seed([1B557576B3F191C9:6F03E13EF7CEF63A]:0)
   >         at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.checkArraySize(AbstractMemorySegmentImpl.java:374)
   >         at java.base/jdk.internal.foreign.AbstractMemorySegmentImpl.asByteBuffer(AbstractMemorySegmentImpl.java:199)
   >         at org.apache.lucene.sandbox.codecs.faiss.LibFaissC.createIndex(LibFaissC.java:224)

@msokolov
Copy link
Contributor
8000 msokolov commented Jun 26, 2025

Sorry I was too vague - I didn't mean we should be testing the > 2GB case! I just wanted to make sure we had unit test coverage for these classes at all because I'm not familiar with this part of thge codebase

- Also modify the test to make backporting easier
@kaivalnp
Copy link
Contributor Author

we had unit test coverage for these classes at all

Yes, we have a test class that runs all tests in the BaseKnnVectorsFormatTestCase

We had to modify / disable a few because the format only supports float vectors and a few similarity functions..

We run these tests on each PR / commit via GH actions, see sample run from this PR, which ran:

> Task :lucene:sandbox:test
:lucene:sandbox:test (SUCCESS): 53 test(s), 8 skipped

I didn't mean we should be testing the > 2GB case

I kind of like that we have this test, can we just mark it as "monster" so that we don't run it locally / from GH actions?
Also refactored a bit to make backporting easier..

I was able to run it using:

./gradlew -p lucene/sandbox -Dtests.faiss.run=true test --tests "org.apache.lucene.sandbox.codecs.faiss.*" -Dtests.monster=true -Dtests.heapsize=16g

..where it took a (relatively) long time to run:

:lucene:sandbox:test (SUCCESS): 53 test(s), 8 skipped
The slowest tests during this run:
  14.64s TestFaissKnnVectorsFormat.testLargeVectorData (:lucene:sandbox)
The slowest suites during this run:
  16.63s TestFaissKnnVectorsFormat (:lucene:sandbox)

Also, running it on main gives the same error as above

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0