-
Notifications
You must be signed in to change notification settings - Fork 2.2k
refactor: added fast traversal with structure #1950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
c8dc307
to
3b4d05e
Compare
Codecov Report
@@ Coverage Diff @@
## master #1950 +/- ##
==========================================
+ Coverage 89.70% 89.81% +0.11%
==========================================
Files 208 211 +3
Lines 11062 11054 -8
==========================================
+ Hits 9923 9928 +5
+ Misses 1139 1126 -13
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what's the reason behind the change from DocumentSet to Iterable of DocumentSet? This PR basically reverts the effort of refactor(driver): use traverse from document type #1938 and entangled the traversal & applying again.
- The real problem is not about
FastRecursiveMixin
, those Drivers that can leverageFastRecursiveMixin
are good enough. The problem is about the remaining drivers: if we can push one more step and unify their interface withdocs.traverse()
However, in this PR I don't see related changes on the legacyRecursiveMixin
. - I'm against calling it
leaves
, reason is that recursion and applying are detangled in theFastRecursiveMixin
and that's the idea in the design of_traverse_apply
makes all BaseRecursiveDriver low-efficient #1932. - In general,
FastRecursiveMixin
in refactor(driver): use traverse from document type #1938 and drivers used it are good enough already.
If there are some new traversal strategy, create a new Mixin
is a better way to go than changing FastRecursiveMixin
, maybe create a new mixin called MutableRecursiveMixin
, whereas the current FastRecursiveMixin
is considered as ImmutableRecursiveMixin
.
The point is to have only one source of traversal logic, and the more generic one that can be used for every driver is the one returning Iterable[DocumentSet]. Flattening before or afterwards have very little implications in my opinion. The cost of having two traversal logics can make even more painful further optimizations or mantainance in my opinion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is looking really good!
from ..document.traversable import Traversable | ||
|
||
|
||
class TraversableSequence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like a lot this idea!
@@ -67,3 +68,27 @@ def validate(req): | |||
flow.index(input_fn=input_fn, on_done=response_mock) | |||
|
|||
response_mock.assert_called() | |||
|
|||
|
|||
def test_reduce_all_root_chunks(mocker, docs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go with best possible effort into an integration
test. Try not to add tests with Flow
in unittests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done in another PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I see your point, this was here before. Not part of this PR for now, to not obfuscate more, what actually happens in this PR.
aa2c7ed
to
f39fc5e
Compare
9b86a42
to
0ce344b
Compare
jina/drivers/__init__.py
Outdated
|
||
def __call__(self, *args, **kwargs): | ||
"""Call the Driver. | ||
"""Apply function works on a list of list of docs, modify the docs in-place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a list of docs instead of a list of list of docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an Iterable[DocumentSet]
which is an Iterable
over Chunks or Matches. Witch are a list of Documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then, we should change the docstring I guess:
iterable over docsets
8f5f6e3
to
c402d16
Compare
@JoanFM I needed to change some imports in order to make the |
b8d0b4e
to
0757ab6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍
@@ -38,9 +39,9 @@ def _get_pydantic_fields(parser: Callable[..., 'argparse.ArgumentParser']): | |||
default_factory=random_identity, | |||
example=a['default'], | |||
description=a['help']) | |||
elif a['default_factory'] == random_port.__name__: | |||
elif a['default_factory'] == helper.random_port.__name__: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. To my understanding, from jina.helper import random_port
is more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is not mockable, if the import happens before the mock was defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And efficiency in these imports is not important from real importance. I first want to see in numbers, how much efficiency increases, before we keep these dirty imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
|
||
:param *args: *args for ``_traverse_apply`` | ||
:param **kwargs: **kwargs for ``_traverse_apply`` | ||
:param doc_sequences: the Documents that should be handled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As m
CEB7
odify
is used in the above paragraphs, I'd suggest staying consistent
:param doc_sequences: the Documents that should be handled | |
:param doc_sequences: the Documents that should be modified |
for docs in doc_sequences: | ||
if self.lookups: | ||
_lookups = Q(**self.lookups) | ||
miss_idx = [] | ||
for idx, doc in enumerate(docs): | ||
if not _lookups.evaluate(doc): | ||
miss_idx.append(idx) | ||
|
||
# delete non-exit matches in reverse | ||
for j in reversed(miss_idx): | ||
del docs[j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for docs in doc_sequences: | |
if self.lookups: | |
_lookups = Q(**self.lookups) | |
miss_idx = [] | |
for idx, doc in enumerate(docs): | |
if not _lookups.evaluate(doc): | |
miss_idx.append(idx) | |
# delete non-exit matches in reverse | |
for j in reversed(miss_idx): | |
del docs[j] | |
if not self.lookups: | |
return | |
for docs in doc_sequences: | |
_lookups = Q(**self.lookups) | |
miss_idx = [] | |
for idx, doc in enumerate(docs): | |
if not _lookups.evaluate(doc): | |
miss_idx.append(idx) | |
# delete non-exit matches in reverse | |
for j in reversed(miss_idx): | |
del docs[j] |
if self.start <= 0 and (self.end is None or self.end >= len(docs)): | ||
pass | ||
else: | ||
del docs[int(self.end):] | ||
del docs[:int(self.start)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.start <= 0 and (self.end is None or self.end >= len(docs)): | |
pass | |
else: | |
del docs[int(self.end):] | |
del docs[:int(self.start)] | |
if self.start > 0 or (self.end is not None and self.end < len(docs)): | |
del docs[int(self.end):] | |
del docs[:int(self.start)] |
This PR enables changes on chunks and matches level even with the
FastRecursiveMixin
.Discussion point (opinions welcomed!):
docs
toleaves
in_apply_all
. Reasoning: it fits to leaves in the tree traversal and thus it is a better semantic naming, then the generaldocs
. Anyhow, I am not totally convinced about the renaming.some tests still needs fixing. will happen throughout the day.