8000 refactor: added fast traversal with structure by maximilianwerk · Pull Request #1950 · jina-ai/serve · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

refactor: added fast traversal with structure #1950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 4, 2021
Merged

Conversation

maximilianwerk
Copy link
Member

This PR enables changes on chunks and matches level even with the FastRecursiveMixin.
Discussion point (opinions welcomed!):

  • renaming of docs to leaves in _apply_all. Reasoning: it fits to leaves in the tree traversal and thus it is a better semantic naming, then the general docs. Anyhow, I am not totally convinced about the renaming.

some tests still needs fixing. will happen throughout the day.

@maximilianwerk maximilianwerk requested a review from a team as a code owner February 16, 2021 10:39
@maximilianwerk maximilianwerk marked this pull request as draft February 16, 2021 10:39
@jina-bot jina-bot added size/M area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/driver component/type labels Feb 16, 2021
@github-actions
Copy link
github-actions bot commented Feb 16, 2021

Latency summary

Current PR yields:

  • 😶 index QPS at 1212, delta to last 3 avg.: +0%
  • 😶 query QPS at 20, delta to last 3 avg.: -3%

Breakdown

Version Index QPS Query QPS
current 1212 20
1.0.7 1224 20
1.0.6 1222 20

Backed by latency-tracking. Further commits will update this comment.

@maximilianwerk maximilianwerk force-pushed the refactor-fast-traversal branch from c8dc307 to 3b4d05e Compare February 16, 2021 14:47
@codecov
Copy link
codecov bot commented Feb 16, 2021

Codecov Report

Merging #1950 (0757ab6) into master (989d068) will increase coverage by 0.11%.
The diff coverage is 99.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1950      +/-   ##
==========================================
+ Coverage   89.70%   89.81%   +0.11%     
==========================================
  Files         208      211       +3     
  Lines       11062    11054       -8     
==========================================
+ Hits         9923     9928       +5     
+ Misses       1139     1126      -13     
Flag Coverage Δ
daemon 51.20% <48.16%> (+0.39%) ⬆️
jina 90.28% <99.25%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/helper.py 83.44% <71.42%> (-0.98%) ⬇️
daemon/models/custom.py 87.23% <100.00%> (+0.27%) ⬆️
jina/drivers/__init__.py 93.86% <100.00%> (+0.01%) ⬆️
jina/drivers/convertdriver.py 97.22% <100.00%> (ø)
jina/drivers/craft.py 100.00% <100.00%> (ø)
jina/drivers/encode.py 93.75% <100.00%> (-0.46%) ⬇️
jina/drivers/evaluate.py 100.00% <100.00%> (ø)
jina/drivers/index.py 96.15% <100.00%> (ø)
jina/drivers/multimodal.py 91.89% <100.00%> (ø)
jina/drivers/predict.py 88.70% <100.00%> (ø)
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 989d068...0757ab6. Read the comment docs.

Copy link
Member
@hanxiao hanxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are some new traversal strategy, create a new Mixin is a better way to go than changing FastRecursiveMixin, maybe create a new mixin called MutableRecursiveMixin, whereas the current FastRecursiveMixin is considered as ImmutableRecursiveMixin.

@JoanFM
Copy link
Contributor
JoanFM commented Feb 17, 2021

The point is to have only one source of traversal logic, and the more generic one that can be used for every driver is the one returning Iterable[DocumentSet]. Flattening before or afterwards have very little implications in my opinion. The cost of having two traversal logics can make even more painful further optimizations or mantainance in my opinion.

JoanFM
JoanFM previously requested changes Feb 17, 2021
@jina-bot jina-bot added size/XL and removed size/M labels Feb 19, 2021
Copy link
Contributor
@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is looking really good!

from ..document.traversable import Traversable


class TraversableSequence:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like a lot this idea!

@@ -67,3 +68,27 @@ def validate(req):
flow.index(input_fn=input_fn, on_done=response_mock)

response_mock.assert_called()


def test_reduce_all_root_chunks(mocker, docs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go with best possible effort into an integration test. Try not to add tests with Flow in unittests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done in another PR though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I see your point, this was here before. Not part of this PR for now, to not obfuscate more, what actually happens in this PR.

@maximilianwerk maximilianwerk force-pushed the refactor-fast-traversal branch from aa2c7ed to f39fc5e Compare February 19, 2021 23:08
@maximilianwerk maximilianwerk marked this pull request as ready for review February 19, 2021 23:16
@maximilianwerk maximilianwerk force-pushed the refactor-fast-traversal branch from 9b86a42 to 0ce344b Compare February 22, 2021 11:43

def __call__(self, *args, **kwargs):
"""Call the Driver.
"""Apply function works on a list of list of docs, modify the docs in-place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a list of docs instead of a list of list of docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an Iterable[DocumentSet] which is an Iterable over Chunks or Matches. Witch are a list of Documents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, we should change the docstring I guess:
iterable over docsets

@maximilianwerk maximilianwerk force-pushed the 9E88 refactor-fast-traversal branch 2 times, most recently from 8f5f6e3 to c402d16 Compare March 1, 2021 09:58
@jina-bot jina-bot added area/daemon area/network This issue/PR affects network functionality component/peapod labels Mar 3, 2021
@maximilianwerk
Copy link
Member Author

@JoanFM I needed to change some imports in order to make the mock of random_port work effectively everywhere. That is a subtle change and I will dig up more reading material for the team abort import structuring in the future.

@maximilianwerk maximilianwerk force-pushed the refactor-fast-traversal branch from b8d0b4e to 0757ab6 Compare March 4, 2021 06:57
Copy link
Member
@nan-wang nan-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍

@@ -38,9 +39,9 @@ def _get_pydantic_fields(parser: Callable[..., 'argparse.ArgumentParser']):
default_factory=random_identity,
example=a['default'],
description=a['help'])
elif a['default_factory'] == random_port.__name__:
elif a['default_factory'] == helper.random_port.__name__:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. To my understanding, from jina.helper import random_port is more efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is not mockable, if the import happens before the mock was defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And efficiency in these imports is not important from real importance. I first want to see in numbers, how much efficiency increases, before we keep these dirty imports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree


:param *args: *args for ``_traverse_apply``
:param **kwargs: **kwargs for ``_traverse_apply``
:param doc_sequences: the Documents that should be handled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As m CEB7 odify is used in the above paragraphs, I'd suggest staying consistent

Suggested change
:param doc_sequences: the Documents that should be handled
:param doc_sequences: the Documents that should be modified

Comment on lines +41 to +51
for docs in doc_sequences:
if self.lookups:
_lookups = Q(**self.lookups)
miss_idx = []
for idx, doc in enumerate(docs):
if not _lookups.evaluate(doc):
miss_idx.append(idx)

# delete non-exit matches in reverse
for j in reversed(miss_idx):
del docs[j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for docs in doc_sequences:
if self.lookups:
_lookups = Q(**self.lookups)
miss_idx = []
for idx, doc in enumerate(docs):
if not _lookups.evaluate(doc):
miss_idx.append(idx)
# delete non-exit matches in reverse
for j in reversed(miss_idx):
del docs[j]
if not self.lookups:
return
for docs in doc_sequences:
_lookups = Q(**self.lookups)
miss_idx = []
for idx, doc in enumerate(docs):
if not _lookups.evaluate(doc):
miss_idx.append(idx)
# delete non-exit matches in reverse
for j in reversed(miss_idx):
del docs[j]

Comment on lines +54 to +58
if self.start <= 0 and (self.end is None or self.end >= len(docs)):
pass
else:
del docs[int(self.end):]
del docs[:int(self.start)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.start <= 0 and (self.end is None or self.end >= len(docs)):
pass
else:
del docs[int(self.end):]
del docs[:int(self.start)]
if self.start > 0 or (self.end is not None and self.end < len(docs)):
del docs[int(self.end):]
del docs[:int(self.start)]

@nan-wang nan-wang merged commit a151006 into master Mar 4, 2021
@nan-wang nan-wang deleted the refactor-fast-traversal branch March 4, 2021 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cicd This issue/PR affects the cicd pipeline area/core This issue/PR affects the core codebase area/helper This issue/PR affects the helper functionality area/housekeeping This issue/PR is housekeeping area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/driver component/peapod component/type size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0