8000 fix: traversing adjacency graph for multiple query chunks by maximilianwerk · Pull Request #933 · jina-ai/serve · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: traversing adjacency graph for multiple query chunks #933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 17, 2020

Conversation

maximilianwerk
Copy link
Member

This is needed for making the lyrics demo work again. Be aware, that this will break tests and is meant as a proposal. The deleting of non-found chunks might be crucial for some application and switching it on/off might be a parameter.

# delete non-existed matches in reverse
for j in reversed(miss_idx):
del docs[j]
# # delete non-existed matches in reverse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u explain here how we identified the need to avoid this deletion, it was quite a case, not obvious at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a query which was split into sentences (e.g. documents with granularity 1). For some reason they completely disappeared from the data stream. After going backwards, we found out, that something around the KVSearchDriver caused. Until ultimately we saw that driver deleting the chunks, since they have no entry in the BinaryPbIndexer (they are part of the query). And the driver things, whatever has no value in my database, is not worth returning.

Copy link
Member
@hanxiao hanxiao Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic has to be kept. this handles the miss docs in sharding. With shards, each shard receive the same set of ids as query, each lookup on its local data, some hit and some miss. This logic is for deleting the misses

Copy link
Member
@hanxiao hanxiao Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason of your disappearance is probably due to the default bind drivers on executors. please take a look at resources/

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so @maximilianwerk we need to adapt somehow the logic of traversing to adapt then to this without removing this, maybe we need to recover the recur_on parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanxiao it disappeared because the way we provide now granularity_range and adjacency_range is limiting. It came from an unwanted loop on chunks before matches. So we will need to tackle this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maximilianwerk It would be great to add a unit test so that we ensure everyone is on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes adding tests was anyhow on my agenda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is mostly caused by the limitation on how we traverse. In one case we are traversing on chunks before traversing in matches because we set granularity_range: [1, 1] and adjacency_range: [0 , 1] to go for the KVSearchDrivrer. So it is a "bug more or less" it first does an iteration on chunks and does not find anything because is a query chunk, therefore it deletes the chunk from the query. So later when we traverse on matches it has no chunk to go down to

@maximilianwerk maximilianwerk marked this pull request as draft September 10, 2020 15:08
@maximilianwerk maximilianwerk force-pushed the fix-traversing-adjacency-graph branch from 65e38b1 to 5ce8c17 Compare September 14, 2020 10:11
@hanxiao
Copy link
Member
hanxiao commented Sep 14, 2020

make sure to fix the commit lint error so that CI keeps tracking the health status of this PR

@maximilianwerk maximilianwerk force-pushed the fix-traversing-adjacency-graph branch 3 times, most recently from 66aad3b to 3150db8 Compare September 15, 2020 11:29
@jina-bot jina-bot added the area/testing This issue/PR affects testing label Sep 15, 2020
@codecov
Copy link
codecov bot commented Sep 15, 2020

Codecov Report

Merging #933 into master will increase coverage by 0.15%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #933      +/-   ##
==========================================
+ Coverage   77.07%   77.22%   +0.15%     
==========================================
  Files          68       68              
  Lines        5095     5138      +43     
==========================================
+ Hits         3927     3968      +41     
- Misses       1168     1170       +2     
Impacted Files Coverage Δ
jina/clients/python/io.py 89.79% <100.00%> (ø)
jina/helloworld/components.py 100.00% <100.00%> (ø)
jina/logging/profile.py 51.25% <0.00%> (-0.65%) ⬇️
jina/peapods/pod.py 79.73% <0.00%> (+<0.01%) ⬆️
jina/peapods/pea.py 89.96% <0.00%> (+0.07%) ⬆️
jina/executors/compound.py 86.95% <0.00%> (+0.19%) ⬆️
jina/enums.py 95.19% <0.00%> (+0.19%) ⬆️
jina/peapods/gateway.py 77.72% <0.00%> (+0.21%) ⬆️
jina/logging/base.py 67.20% <0.00%> (+0.26%) ⬆️
jina/executors/rankers/__init__.py 92.85% <0.00%> (+0.35%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42a9ecf...59c8d85. Read the comment docs.

Copy link
Contributor
@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some integration test to show that the issue found for multires search is enabled

@jina-bot jina-bot added size/M and removed size/S labels Sep 15, 2020
Copy link
Contributor
@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just being paranoic, can we have a couple of theses tests run with Rankers? I feel they may have some differences with the others that may deserve testing. And these tests can work as good reference and "documentation" (note the quotes)

@maximilianwerk maximilianwerk force-pushed the fix-traversing-adjacency-graph branch from 2452815 to 1eba5c8 Compare September 16, 2020 10:04
@jina-bot jina-bot added the area/helloworld This issue/PR affects the helloworld label Sep 16, 2020
@@ -11,7 +11,7 @@ on:
with:
executor: BaseKVIndexer
granularity_range: [0, 0]
adjacency_range: [0, 1]
adjacency_range: [1, 1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change which I am not utterly sure of. But it somehow makes sense: once you retrieved the matches, you only collect the metadata for the matches and not of the query itself. BTW: Getting the metadata for the query itself broke the hello-world

@@ -6,11 +6,9 @@ with:
pods:
encode:
uses: $RESOURCE_DIR/helloworld.encoder.yml
parallel: $PARALLEL
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having parallel does not make sense, since the encoder saves the used matrix and having two in parallel means, they overwrite each others matrix. It works during querying, but half of the indexed documents will never retrieved, since they are encoded with a thrown away matrix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! good catch!

index:
uses: $RESOURCE_DIR/helloworld.indexer.yml
shards: $SHARDS
separated_workspace: true
polling: all
uses_after: $RESOURCE_DIR/helloworld.reduce.yml
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that was an artifact from old!?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce is needed if used in parallel right? but it may use _reduce anyhow? why was it needed to be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just does not provide any value and was not working before. So I thought removing complexity from a hello-world is a good thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in for reducing complexity, and since u found the problem with parallelism in encoder, it makes it more obvious, reducer only comes to play when parallelism is up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is required, do not remove

@maximilianwerk maximilianwerk marked this pull request as ready for review September 16, 2020 10:46
@@ -20,6 +20,6 @@ requests:
start: 0
end: 50
granularity_range: [0, 0]
adjacency_range: [0, 1]
adjacency_range: [0, 0]
Copy link
Member Author
10000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New ranges are so much better than the old ones. This slice says: get me the top 50 in the first dimension e.g. for the first 50 queries submitted. I like it.

Copy link
Member
@hanxiao hanxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please append the breaking changes introduced by this PR to here #885

@hanxiao hanxiao added this to the v0.6 🚨 Breaking Changes milestone Sep 16, 2020
Copy link
Member
@nan-wang nan-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some behaviors are unexpected. At least they are controversial against our definition of right-exclusive ranges

        :param granularity_range: right-exclusive range of the recursion depth, (0, 1) for root-level only
        :param adjacency_range: right-exclusive range of the recursion adjacency, (0, 1) for single matches

Comment on lines +49 to +55
@pytest.fixture(autouse=True)
def run_around_tests():
yield
rm_files(['vec1.gz', 'vec2.gz', 'chunk1.gz', 'chunk2.gz',
'vecidx1.bin', 'vecidx2.bin', 'kvidx1.bin', 'kvidx2.bin'])


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good pattern for later usages! 👍

Copy link
Member Author
@maximilianwerk maximilianwerk Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get rid of the rm_files method, but rather use pytest temporary folders. The rm_files wiped my whole home folder, since I used it a bit wrongly... Don't even know, what I did wrong, since it was wiped... :D

@nan-wang
Copy link
Member

There are two things to change. @maximilianwerk

  • adapt the docstring in BaseRecursiveDriver. The ranges are right-inclusive.
  • require adjacency_range to start with 1 or at least remark this in the docstring. Otherwise, this leads to confusions and is error-prone. When setting adjacency: [0, 0], recur_on=['matches', ], apply_all actually will work on the chunks. This means it has the same effect as granularity: [0, 0], recur_on=['chunks', ]. Specifically, the following two tests lead to the same results.
def test_matches_with_0_0():
    docs = build_docs()
    driver = SliceQL(
        start=0,
        end=1,
        adjacency_range=(0, 0),
        granularity_range=(0, 0),
        recur_on=['matches', ]
    )
    driver._traverse_apply(docs)
    assert len(docs) == 1

def test_chunks_with_0_0():
    docs = build_docs()
    driver = SliceQL(
        start=0,
        end=1,
        adjacency_range=(0, 0),
        granularity_range=(0, 0),
        recur_on=['chunks', ]
    )
    driver._traverse_apply(docs)
    assert len(docs) == 1

Copy link
Member
@nan-wang nan-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming issue.

@maximilianwerk
Copy link
Member Author

Good catch with the right-inclusivity. 👍

@maximilianwerk
Copy link
Member Author

I think what is needed now is a good documentation and visual explanation of how to set the ranges correctly. I'll look into that after the Hackaton.

@maximilianwerk
Copy link
Member Author

please append the breaking changes introduced by this PR to here #885

Done.

@nan-wang nan-wang merged commit 1904505 into master Sep 17, 2020
@nan-wang nan-wang deleted the fix-traversing-adjacency-graph branch September 17, 2020 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase area/helloworld This issue/PR affects the helloworld area/testing This issue/PR affects testing component/client component/driver component/resource size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0