fix: traversing adjacency graph for multiple query chunks #933

maximilianwerk · 2020-09-10T13:10:45Z

This is needed for making the lyrics demo work again. Be aware, that this will break tests and is meant as a proposal. The deleting of non-found chunks might be crucial for some application and switching it on/off might be a parameter.

jina/drivers/__init__.py

JoanFM · 2020-09-10T13:16:31Z

jina/drivers/search.py

-        # delete non-existed matches in reverse
-        for j in reversed(miss_idx):
-            del docs[j]
+        # # delete non-existed matches in reverse


Can u explain here how we identified the need to avoid this deletion, it was quite a case, not obvious at all

We had a query which was split into sentences (e.g. documents with granularity 1). For some reason they completely disappeared from the data stream. After going backwards, we found out, that something around the KVSearchDriver caused. Until ultimately we saw that driver deleting the chunks, since they have no entry in the BinaryPbIndexer (they are part of the query). And the driver things, whatever has no value in my database, is not worth returning.

this logic has to be kept. this handles the miss docs in sharding. With shards, each shard receive the same set of ids as query, each lookup on its local data, some hit and some miss. This logic is for deleting the misses

the reason of your disappearance is probably due to the default bind drivers on executors. please take a look at resources/

so @maximilianwerk we need to adapt somehow the logic of traversing to adapt then to this without removing this, maybe we need to recover the recur_on parameter?

@hanxiao it disappeared because the way we provide now granularity_range and adjacency_range is limiting. It came from an unwanted loop on chunks before matches. So we will need to tackle this.

@maximilianwerk It would be great to add a unit test so that we ensure everyone is on the same page.

Yes adding tests was anyhow on my agenda.

it is mostly caused by the limitation on how we traverse. In one case we are traversing on chunks before traversing in matches because we set granularity_range: [1, 1] and adjacency_range: [0 , 1] to go for the KVSearchDrivrer. So it is a "bug more or less" it first does an iteration on chunks and does not find anything because is a query chunk, therefore it deletes the chunk from the query. So later when we traverse on matches it has no chunk to go down to

hanxiao · 2020-09-14T10:58:21Z

make sure to fix the commit lint error so that CI keeps tracking the health status of this PR

codecov · 2020-09-15T18:55:49Z

Codecov Report

Merging #933 into master will increase coverage by 0.15%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #933      +/-   ##
==========================================
+ Coverage   77.07%   77.22%   +0.15%     
==========================================
  Files          68       68              
  Lines        5095     5138      +43     
==========================================
+ Hits         3927     3968      +41     
- Misses       1168     1170       +2

Impacted Files	Coverage Δ
jina/clients/python/io.py	`89.79% <100.00%> (ø)`
jina/helloworld/components.py	`100.00% <100.00%> (ø)`
jina/logging/profile.py	`51.25% <0.00%> (-0.65%)`	⬇️
jina/peapods/pod.py	`79.73% <0.00%> (+<0.01%)`	⬆️
jina/peapods/pea.py	`89.96% <0.00%> (+0.07%)`	⬆️
jina/executors/compound.py	`86.95% <0.00%> (+0.19%)`	⬆️
jina/enums.py	`95.19% <0.00%> (+0.19%)`	⬆️
jina/peapods/gateway.py	`77.72% <0.00%> (+0.21%)`	⬆️
jina/logging/base.py	`67.20% <0.00%> (+0.26%)`	⬆️
jina/executors/rankers/__init__.py	`92.85% <0.00%> (+0.35%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42a9ecf...59c8d85. Read the comment docs.

JoanFM

I would add some integration test to show that the issue found for multires search is enabled

jina/drivers/__init__.py

JoanFM

Just being paranoic, can we have a couple of theses tests run with Rankers? I feel they may have some differences with the others that may deserve testing. And these tests can work as good reference and "documentation" (note the quotes)

* fix: add recur_on param back * fix: fix dry_run use case * fix: made matches and chunk traversion similiar again Co-authored-by: maximilianwerk <maximilian.werk@jina.ai>

maximilianwerk · 2020-09-16T10:08:00Z

jina/resources/executors.requests.CompoundIndexer.yml

@@ -11,7 +11,7 @@ on:
      with:
        executor: BaseKVIndexer
        granularity_range: [0, 0]
-        adjacency_range: [0, 1]
+        adjacency_range: [1, 1]


This is a change which I am not utterly sure of. But it somehow makes sense: once you retrieved the matches, you only collect the metadata for the matches and not of the query itself. BTW: Getting the metadata for the query itself broke the hello-world

maximilianwerk · 2020-09-16T10:09:17Z

jina/resources/helloworld.flow.query.yml

@@ -6,11 +6,9 @@ with:
 pods:
  encode:
    uses: $RESOURCE_DIR/helloworld.encoder.yml
-    parallel: $PARALLEL


Having parallel does not make sense, since the encoder saves the used matrix and having two in parallel means, they overwrite each others matrix. It works during querying, but half of the indexed documents will never retrieved, since they are encoded with a thrown away matrix.

oh! good catch!

maximilianwerk · 2020-09-16T10:09:39Z

jina/resources/helloworld.flow.query.yml

  index:
    uses: $RESOURCE_DIR/helloworld.indexer.yml
    shards: $SHARDS
    separated_workspace: true
    polling: all
-    uses_after: $RESOURCE_DIR/helloworld.reduce.yml


I guess that was an artifact from old!?

reduce is needed if used in parallel right? but it may use _reduce anyhow? why was it needed to be removed?

It just does not provide any value and was not working before. So I thought removing complexity from a hello-world is a good thing.

All in for reducing complexity, and since u found the problem with parallelism in encoder, it makes it more obvious, reducer only comes to play when parallelism is up

this is required, do not remove

jina/clients/python/io.py

maximilianwerk · 2020-09-16T12:35:59Z

jina/resources/helloworld.reduce.yml

@@ -20,6 +20,6 @@ requests:
          start: 0
          end: 50
          granularity_range: [0, 0]
-          adjacency_range: [0, 1]
+          adjacency_range: [0, 0]


New ranges are so much better than the old ones. This slice says: get me the top 50 in the first dimension e.g. for the first 50 queries submitted. I like it.

hanxiao

please append the breaking changes introduced by this PR to here #885

nan-wang

Some behaviors are unexpected. At least they are controversial against our definition of right-exclusive ranges

        :param granularity_range: right-exclusive range of the recursion depth, (0, 1) for root-level only
        :param adjacency_range: right-exclusive range of the recursion adjacency, (0, 1) for single matches

jina/clients/python/io.py

nan-wang · 2020-09-16T15:20:07Z

tests/unit/flow/test_flow_multimode.py

+@pytest.fixture(autouse=True)
+def run_around_tests():
+    yield
+    rm_files(['vec1.gz', 'vec2.gz', 'chunk1.gz', 'chunk2.gz',
+              'vecidx1.bin', 'vecidx2.bin', 'kvidx1.bin', 'kvidx2.bin'])
+
+


A good pattern for later usages! 👍

We should get rid of the rm_files method, but rather use pytest temporary folders. The rm_files wiped my whole home folder, since I used it a bit wrongly... Don't even know, what I did wrong, since it was wiped... :D

tests/unit/drivers/test_recursive_traversal.py

jina/resources/helloworld.reduce.yml

nan-wang · 2020-09-17T02:01:35Z

There are two things to change. @maximilianwerk

adapt the docstring in BaseRecursiveDriver. The ranges are right-inclusive.
require adjacency_range to start with 1 or at least remark this in the docstring. Otherwise, this leads to confusions and is error-prone. When setting adjacency: [0, 0], recur_on=['matches', ], apply_all actually will work on the chunks. This means it has the same effect as granularity: [0, 0], recur_on=['chunks', ]. Specifically, the following two tests lead to the same results.

def test_matches_with_0_0():
    docs = build_docs()
    driver = SliceQL(
        start=0,
        end=1,
        adjacency_range=(0, 0),
        granularity_range=(0, 0),
        recur_on=['matches', ]
    )
    driver._traverse_apply(docs)
    assert len(docs) == 1

def test_chunks_with_0_0():
    docs = build_docs()
    driver = SliceQL(
        start=0,
        end=1,
        adjacency_range=(0, 0),
        granularity_range=(0, 0),
        recur_on=['chunks', ]
    )
    driver._traverse_apply(docs)
    assert len(docs) == 1

nan-wang

Minor naming issue.

jina/drivers/__init__.py

maximilianwerk · 2020-09-17T06:38:14Z

Good catch with the right-inclusivity. 👍

maximilianwerk · 2 7438 020-09-17T06:39:05Z

I think what is needed now is a good documentation and visual explanation of how to set the ranges correctly. I'll look into that after the Hackaton.

maximilianwerk · 2020-09-17T06:46:50Z

please append the breaking changes introduced by this PR to here #885

Done.

resolved

jina-bot added size/S area/core This issue/PR affects the core codebase component/driver labels Sep 10, 2020

maximilianwerk mentioned this pull request Sep 10, 2020

fix: lyric example running again with adaptions in core jina-ai/examples#192

Merged

JoanFM reviewed Sep 10, 2020

View reviewed changes

maximilianwerk marked this pull request as draft September 10, 2020 15:08

maximilianwerk force-pushed the fix-traversing-adjacency-graph branch from 65e38b1 to 5ce8c17 Compare September 14, 2020 10:11

jina-bot added the component/resource label Sep 14, 2020

maximilianwerk force-pushed the fix-traversing-adjacency-graph branch 3 times, most recently from 66aad3b to 3150db8 Compare September 15, 2020 11:29

jina-bot added the area/testing This issue/PR affects testing label Sep 15, 2020

JoanFM reviewed Sep 15, 2020

View reviewed changes

jina/drivers/__init__.py Show resolved Hide resolved

jina-bot added size/M and removed size/S labels Sep 15, 2020

JoanFM reviewed Sep 16, 2020

View reviewed changes

maximilianwerk and others added 10 commits September 16, 2020 08:36

fix: traversing adjacency graph for multiple query chunks

a8a399f

fix: add recur_on param back (#944)

2d8090a

* fix: add recur_on param back * fix: fix dry_run use case * fix: made matches and chunk traversion similiar again Co-authored-by: maximilianwerk <maximilian.werk@jina.ai>

feat: base ranker should not remove meta information anymore

594aedd

fix: removed double calling for root node

b33c3d9

fix: adapt unit test

b49b52d

fix: unit tests running again

bd6caca

refactor: better readability of function

0099f25

test: added tests for showcasing traversal

0e983c8

test: full testing of traversal

54a2180

fix: hello world working again

1eba5c8

maximilianwerk force-pushed the fix-traversing-adjacency-graph branch from 2452815 to 1eba5c8 Compare September 16, 2020 10:04

jina-bot added the area/helloworld This issue/PR affects the helloworld label Sep 16, 2020

maximilianwerk commented Sep 16, 2020

View reviewed changes

maximilianwerk marked this pull request as ready for review September 16, 2020 10:46

fix: hello world is even prettier

e641305

jina-bot added the component/client label Sep 16, 2020

maximilianwerk commented Sep 16, 2020

View reviewed changes

jina/clients/python/io.py Show resolved Hide resolved

maximilianwerk commented Sep 16, 2020

View reviewed changes

This was referenced Sep 16, 2020

feat: adapt cxmodal search to new recursion strategy jina-ai/examples#203

Merged

feat: adapt pokemon search to new recursion strategy jina-ai/examples#204

Merged

feat: adapt fashion query language search to new recursion strategy jina-ai/examples#205

Merged

hanxiao approved these changes Sep 16, 2020

View reviewed changes

hanxiao previously requested changes Sep 16, 2020

View reviewed changes

hanxiao added this to the v0.6 🚨 Breaking Changes milestone Sep 16, 2020

nan-wang requested changes Sep 16, 2020

View reviewed changes

nan-wang requested changes Sep 17, 2020

View reviewed changes

jina/drivers/__init__.py Show resolved Hide resolved

fix: docstring

59c8d85

maximilianwerk mentioned this pull request Sep 17, 2020

🚨 BREAKING CHANGES 0.5.0 -> 0.6.0 #885

Closed

nan-wang approved these changes Sep 17, 2020

View reviewed changes

nan-wang merged commit 1904505 into master Sep 17, 2020

nan-wang deleted the fix-traversing-adjacency-graph branch September 17, 2020 06:56

hanxiao mentioned this pull request Sep 29, 2020

feat: tree traversal for reverseQL, filterQL, selectQL #995

Merged

nan-wang mentioned this pull request Sep 30, 2020

🎉 release v0.6.0 [W.I.P.] #997

Closed

fix: traversing adjacency graph for multiple query chunks #933

fix: traversing adjacency graph for multiple query chunks #933

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!