refactor: added fast traversal with structure #1950

maximilianwerk · 2021-02-16T10:39:43Z

This PR enables changes on chunks and matches level even with the FastRecursiveMixin.
Discussion point (opinions welcomed!):

renaming of docs to leaves in _apply_all. Reasoning: it fits to leaves in the tree traversal and thus it is a better semantic naming, then the general docs. Anyhow, I am not totally convinced about the renaming.

some tests still needs fixing. will happen throughout the day.

tests/unit/drivers/test_encoder_driver.py

github-actions · 2021-02-16T10:50:57Z

Latency summary

Current PR yields:

😶 index QPS at 1212, delta to last 3 avg.: +0%
😶 query QPS at 20, delta to last 3 avg.: -3%

Breakdown

Version	Index QPS	Query QPS
current	1212	20
`1.0.7`	1224	20
`1.0.6`	1222	20

Backed by latency-tracking. Further commits will update this comment.

jina/drivers/__init__.py

codecov · 2021-02-16T14:58:33Z

Codecov Report

Merging #1950 (0757ab6) into master (989d068) will increase coverage by 0.11%.
The diff coverage is 99.26%.

@@            Coverage Diff             @@
##           master    #1950      +/-   ##
==========================================
+ Coverage   89.70%   89.81%   +0.11%     
==========================================
  Files         208      211       +3     
  Lines       11062    11054       -8     
==========================================
+ Hits         9923     9928       +5     
+ Misses       1139     1126      -13

Flag	Coverage Δ
daemon	`51.20% <48.16%> (+0.39%)`	⬆️
jina	`90.28% <99.25%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
jina/helper.py	`83.44% <71.42%> (-0.98%)`	⬇️
daemon/models/custom.py	`87.23% <100.00%> (+0.27%)`	⬆️
jina/drivers/__init__.py	`93.86% <100.00%> (+0.01%)`	⬆️
jina/drivers/convertdriver.py	`97.22% <100.00%> (ø)`
jina/drivers/craft.py	`100.00% <100.00%> (ø)`
jina/drivers/encode.py	`93.75% <100.00%> (-0.46%)`	⬇️
jina/drivers/evaluate.py	`100.00% <100.00%> (ø)`
jina/drivers/index.py	`96.15% <100.00%> (ø)`
jina/drivers/multimodal.py	`91.89% <100.00%> (ø)`
jina/drivers/predict.py	`88.70% <100.00%> (ø)`
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 989d068...0757ab6. Read the comment docs.

hanxiao

what's the reason behind the change from DocumentSet to Iterable of DocumentSet? This PR basically reverts the effort of refactor(driver): use traverse from document type #1938 and entangled the traversal & applying again.
The real problem is not about FastRecursiveMixin, those Drivers that can leverage FastRecursiveMixin are good enough. The problem is about the remaining drivers: if we can push one more step and unify their interface with docs.traverse() However, in this PR I don't see related changes on the legacy RecursiveMixin.
I'm against calling it leaves, reason is that recursion and applying are detangled in the FastRecursiveMixin and that's the idea in the design of _traverse_apply makes all BaseRecursiveDriver low-efficient #1932.
In general, FastRecursiveMixin in refactor(driver): use traverse from document type #1938 and drivers used it are good enough already.

If there are some new traversal strategy, create a new Mixin is a better way to go than changing FastRecursiveMixin, maybe create a new mixin called MutableRecursiveMixin, whereas the current FastRecursiveMixin is considered as ImmutableRecursiveMixin.

JoanFM · 2021-02-17T06:31:28Z

The point is to have only one source of traversal logic, and the more generic one that can be used for every driver is the one returning Iterable[DocumentSet]. Flattening before or afterwards have very little implications in my opinion. The cost of having two traversal logics can make even more painful further optimizations or mantainance in my opinion.

jina/drivers/__init__.py

JoanFM

It is looking really good!

jina/drivers/__init__.py

jina/drivers/evaluate.py

jina/types/sets/chunk.py

JoanFM · 2021-02-19T23:05:39Z

jina/types/sets/traversable.py

+    from ..document.traversable import Traversable
+
+
+class TraversableSequence:


Like a lot this idea!

JoanFM · 2021-02-19T23:06:42Z

tests/unit/drivers/test_reduce_all_driver.py

@@ -67,3 +68,27 @@ def validate(req):
        flow.index(input_fn=input_fn, on_done=response_mock)

    response_mock.assert_called()
+
+
+def test_reduce_all_root_chunks(mocker, docs):


This should go with best possible effort into an integration test. Try not to add tests with Flow in unittests

It can be done in another PR though

While I see your point, this was here before. Not part of this PR for now, to not obfuscate more, what actually happens in this PR.

florian-hoenicke · 2021-02-24T09:19:13Z

jina/drivers/__init__.py

-
-    def __call__(self, *args, **kwargs):
-        """Call the Driver.
+        """Apply function works on a list of list of docs, modify the docs in-place.


Is it a list of docs instead of a list of list of docs?

It is an Iterable[DocumentSet] which is an Iterable over Chunks or Matches. Witch are a list of Documents.

then, we should change the docstring I guess:
iterable over docsets

jina/types/sets/traversable.py

jina/drivers/reduce.py

jina/drivers/search.py

jina/drivers/reduce.py

maximilianwerk · 2021-03-03T22:26:01Z

@JoanFM I needed to change some imports in order to make the mock of random_port work effectively everywhere. That is a subtle change and I will dig up more reading material for the team abort import structuring in the future.

nan-wang

LGTM👍

nan-wang · 2021-03-04T07:46:30Z

daemon/models/custom.py

@@ -38,9 +39,9 @@ def _get_pydantic_fields(parser: Callable[..., 'argparse.ArgumentParser']):
                    default_factory=random_identity,
                    example=a['default'],
                    description=a['help'])
-            elif a['default_factory'] == random_port.__name__:
+            elif a['default_factory'] == helper.random_port.__name__:


Minor comments. To my understanding, from jina.helper import random_port is more efficient.

But it is not mockable, if the import happens before the mock was defined.

And efficiency in these imports is not important from real importance. I first want to see in numbers, how much efficiency increases, before we keep these dirty imports.

nan-wang · 2021-03-04T08:00:33Z

jina/drivers/__init__.py

-
-        :param *args: *args for ``_traverse_apply``
-        :param **kwargs: **kwargs for ``_traverse_apply``
+        :param doc_sequences: the Documents that should be handled


As m CEB7 odify is used in the above paragraphs, I'd suggest staying consistent

Suggested change

:param doc_sequences: the Documents that should be handled

:param doc_sequences: the Documents that should be modified

nan-wang · 2021-03-04T08:09:44Z

jina/drivers/querylang/filter.py

+        for docs in doc_sequences:
+            if self.lookups:
+                _lookups = Q(**self.lookups)
+                miss_idx = []
+                for idx, doc in enumerate(docs):
+                    if not _lookups.evaluate(doc):
+                        miss_idx.append(idx)
+
+                # delete non-exit matches in reverse
+                for j in reversed(miss_idx):
+                    del docs[j]


Suggested change

for docs in doc_sequences:

if self.lookups:

_lookups = Q(**self.lookups)

miss_idx = []

for idx, doc in enumerate(docs):

if not _lookups.evaluate(doc):

miss_idx.append(idx)

# delete non-exit matches in reverse

for j in reversed(miss_idx):

del docs[j]

if not self.lookups:

return

for docs in doc_sequences:

_lookups = Q(**self.lookups)

miss_idx = []

for idx, doc in enumerate(docs):

if not _lookups.evaluate(doc):

miss_idx.append(idx)

# delete non-exit matches in reverse

for j in reversed(miss_idx):

del docs[j]

nan-wang · 2021-03-04T08:27:32Z

jina/drivers/querylang/slice.py

+            if self.start <= 0 and (self.end is None or self.end >= len(docs)):
+                pass
+            else:
+                del docs[int(self.end):]
+                del docs[:int(self.start)]


Suggested change

if self.start <= 0 and (self.end is None or self.end >= len(docs)):

pass

else:

del docs[int(self.end):]

del docs[:int(self.start)]

if self.start > 0 or (self.end is not None and self.end < len(docs)):

del docs[int(self.end):]

del docs[:int(self.start)]

maximilianwerk requested a review from a team as a code owner February 16, 2021 10:39

maximilianwerk requested review from david 8000 bp and imsergiy February 16, 2021 10:39

maximilianwerk marked this pull request as draft February 16, 2021 10:39

jina-bot added size/M area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/driver component/type labels Feb 16, 2021

JoanFM suggested changes Feb 16, 2021

View reviewed changes

tests/unit/drivers/test_encoder_driver.py Show resolved Hide resolved

florian-hoenicke requested changes Feb 16, 2021

View reviewed changes

jina/drivers/__init__.py Outdated Show resolved Hide resolved

maximilianwerk force-pushed the refactor-fast-traversal branch from c8dc307 to 3b4d05e Compare February 16, 2021 14:47

hanxiao previously requested changes Feb 17, 2021

View reviewed changes

JoanFM previously requested changes Feb 17, 2021

View reviewed changes

jina/drivers/__init__.py Outdated Show resolved Hide resolved

jina/drivers/__init__.py Outdated Show resolved Hide resolved

jina-bot added size/XL and removed size/M labels Feb 19, 2021

JoanFM reviewed Feb 19, 2021

View reviewed changes

maximilianwerk force-pushed the refactor-fast-traversal branch from aa2c7ed to f39fc5e Compare February 19, 2021 23:08

maximilianwerk marked this pull request as ready for review February 19, 2021 23:16

maximilianwerk force-pushed the refactor-fast-traversal branch from 9b86a42 to 0ce344b Compare February 22, 2021 11:43

florian-hoenicke reviewed Feb 24, 2021

View reviewed changes

CatStark reviewed Feb 24, 2021

View reviewed changes

jina/types/sets/traversable.py Show resolved Hide resolved

florian-hoenicke reviewed Feb 24, 2021

View reviewed changes

jina/drivers/reduce.py Show resolved Hide resolved

florian-hoenicke previously requested changes Feb 24, 2021

View reviewed changes

jina/drivers/search.py Show resolved Hide resolved

florian-hoenicke reviewed Feb 24, 2021

View reviewed changes

jina/drivers/reduce.py Show resolved Hide resolved

maximilianwerk force-pushed the 9E88 refactor-fast-traversal branch 2 times, most recently from 8f5f6e3 to c402d16 Compare March 1, 2021 09:58

jina-bot added area/daemon area/network This issue/PR affects network functionality component/peapod labels Mar 3, 2021

maximilianwerk and others added 21 commits March 4, 2021 07:55

refactor: added fast traversal with structure

1e3b6c9

fix: document set requires sequences

e05b042

test: fix vector index and vector search

5024249

refactor: revert to old interface

8508449

refactor: kv search uses set recursive mixin

4675327

test: completed traversal

64379ef

feat: almost all driver adapted

eac1518

refactor: completely removed recursivemixin

b6a40cc

refactor: moved docs property to base driver

4873821

fix: repaired tests

46da98a

fix: docstrings

4a21cda

refactor: added traversal per path

e46b80b

docs: fix docstring linter errors

eb65a31

fix: random ports are now not double sampled

d8e9845

fix: port type

4ed6e67

fix: test and sync ci and cd again

28b781c

test: moved port retry to test fixture

1598980

test: fixed random port for daemon

4a4ca1a

fix: final cleanup

03f2ca1

test: make fixture more elegant

b4a5171

refactor: fixed mock related imports

0757ab6

maximilianwerk force-pushed the refactor-fast-traversal branch from b8d0b4e to 0757ab6 Compare March 4, 2021 06:57

JoanFM approved these changes Mar 4, 2021

View reviewed changes

nan-wang approved these changes Mar 4, 2021

View reviewed changes

nan-wang merged commit a151006 into master Mar 4, 2021

nan-wang deleted the refactor-fast-traversal branch March 4, 2021 10:58

		from ..document.traversable import Traversable


		class TraversableSequence:

	:param doc_sequences: the Documents that should be handled
	:param doc_sequences: the Documents that should be modified

refactor: added fast traversal with structure #1950

refactor: added fast traversal with structure #1950

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Latency summary

Breakdown

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!