-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: traversing adjacency graph for multiple query chunks #933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jina/drivers/search.py
Outdated
# delete non-existed matches in reverse | ||
for j in reversed(miss_idx): | ||
del docs[j] | ||
# # delete non-existed matches in reverse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can u explain here how we identified the need to avoid this deletion, it was quite a case, not obvious at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a query which was split into sentences (e.g. documents with granularity 1). For some reason they completely disappeared from the data stream. After going backwards, we found out, that something around the KVSearchDriver
caused. Until ultimately we saw that driver deleting the chunks, since they have no entry in the BinaryPbIndexer
(they are part of the query). And the driver things, whatever has no value in my database, is not worth returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic has to be kept. this handles the miss docs in sharding. With shards, each shard receive the same set of ids as query, each lookup on its local data, some hit and some miss. This logic is for deleting the misses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so @maximilianwerk we need to adapt somehow the logic of traversing to adapt then to this without removing this, maybe we need to recover the recur_on
parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanxiao it disappeared because the way we provide now granularity_range
and adjacency_range
is limiting. It came from an unwanted loop on chunks
before matches
. So we will need to tackle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maximilianwerk It would be great to add a unit test so that we ensure everyone is on the same page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes adding tests was anyhow on my agenda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is mostly caused by the limitation on how we traverse. In one case we are traversing on chunks before traversing in matches because we set granularity_range: [1, 1] and adjacency_range: [0 , 1]
to go for the KVSearchDrivrer. So it is a "bug more or less" it first does an iteration on chunks and does not find anything because is a query chunk, therefore it deletes the chunk from the query. So later when we traverse on matches it has no chunk to go down to
65e38b1
to
5ce8c17
Compare
make sure to fix the commit lint error so that CI keeps tracking the health status of this PR |
66aad3b
to
3150db8
Compare
Codecov Report
@@ Coverage Diff @@
## master #933 +/- ##
==========================================
+ Coverage 77.07% 77.22% +0.15%
==========================================
Files 68 68
Lines 5095 5138 +43
==========================================
+ Hits 3927 3968 +41
- Misses 1168 1170 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add some integration test to show that the issue found for multires search is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just being paranoic, can we have a couple of theses tests run with Rankers? I feel they may have some differences with the others that may deserve testing. And these tests can work as good reference and "documentation" (note the quotes)
* fix: add recur_on param back * fix: fix dry_run use case * fix: made matches and chunk traversion similiar again Co-authored-by: maximilianwerk <maximilian.werk@jina.ai>
2452815
to
1eba5c8
Compare
@@ -11,7 +11,7 @@ on: | |||
with: | |||
executor: BaseKVIndexer | |||
granularity_range: [0, 0] | |||
adjacency_range: [0, 1] | |||
adjacency_range: [1, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change which I am not utterly sure of. But it somehow makes sense: once you retrieved the matches, you only collect the metadata for the matches and not of the query itself. BTW: Getting the metadata for the query itself broke the hello-world
@@ -6,11 +6,9 @@ with: | |||
pods: | |||
encode: | |||
uses: $RESOURCE_DIR/helloworld.encoder.yml | |||
parallel: $PARALLEL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having parallel
does not make sense, since the encoder saves the used matrix and having two in parallel means, they overwrite each others matrix. It works during querying, but half of the indexed documents will never retrieved, since they are encoded with a thrown away matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! good catch!
index: | ||
uses: $RESOURCE_DIR/helloworld.indexer.yml | ||
shards: $SHARDS | ||
separated_workspace: true | ||
polling: all | ||
uses_after: $RESOURCE_DIR/helloworld.reduce.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that was an artifact from old!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce is needed if used in parallel right? but it may use _reduce
anyhow? why was it needed to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just does not provide any value and was not working before. So I thought removing complexity from a hello-world
is a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in for reducing complexity, and since u found the problem with parallelism in encoder, it makes it more obvious, reducer only comes to play when parallelism is up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is required, do not remove
@@ -20,6 +20,6 @@ requests: | |||
start: 0 | |||
end: 50 | |||
granularity_range: [0, 0] | |||
adjacency_range: [0, 1] | |||
adjacency_range: [0, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New ranges are so much better than the old ones. This slice says: get me the top 50 in the first dimension e.g. for the first 50 queries submitted. I like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please append the breaking changes introduced by this PR to here #885
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some behaviors are unexpected. At least they are controversial against our definition of right-exclusive ranges
:param granularity_range: right-exclusive range of the recursion depth, (0, 1) for root-level only
:param adjacency_range: right-exclusive range of the recursion adjacency, (0, 1) for single matches
@pytest.fixture(autouse=True) | ||
def run_around_tests(): | ||
yield | ||
rm_files(['vec1.gz', 'vec2.gz', 'chunk1.gz', 'chunk2.gz', | ||
'vecidx1.bin', 'vecidx2.bin', 'kvidx1.bin', 'kvidx2.bin']) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good pattern for later usages! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should get rid of the rm_files
method, but rather use pytest temporary folders. The rm_files
wiped my whole home folder, since I used it a bit wrongly... Don't even know, what I did wrong, since it was wiped... :D
There are two things to change. @maximilianwerk
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor naming issue.
Good catch with the right-inclusivity. 👍 |
I think what is needed now is a good documentation and visual explanation of how to set the ranges correctly. I'll look into that after the Hackaton. |
Done. |
This is needed for making the lyrics demo work again. Be aware, that this will break tests and is meant as a proposal. The deleting of non-found chunks might be crucial for some application and switching it on/off might be a parameter.