-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: put siblings back #2234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: put siblings back #2234
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2234 +/- ##
==========================================
- Coverage 90.59% 87.40% -3.19%
==========================================
Files 209 209
Lines 11352 11372 +20
==========================================
- Hits 10284 9940 -344
- Misses 1068 1432 +364
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe need to add tests?
Co-authored-by: Nan Wang <nan.wang@jina.ai>
76e84da
to
1cb3d52
Compare
@Yongxuanzhang thanks, I added the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add an integration test? Or adapt an existing one with segmentation to assert nr of siblings?
jina/drivers/segment.py
Outdated
@@ -54,8 +54,9 @@ def _apply_all(self, docs: 'DocumentSet', *args, **kwargs): | |||
|
|||
@staticmethod | |||
def _add_chunks(doc, chunks): | |||
num_siblings = len(chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_siblings = len(chunks) | |
num_siblings = len(chunks) + len(doc.chunks) |
Co-authored-by: Nan Wang <nan.wang@jina.ai>
0d39275
to
3cba88b
Compare
Rebased to fix broken test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we recompute the siblings in the ChunkSet
and add it in every add
operation?
@@ -54,8 +54,9 @@ def _apply_all(self, docs: 'DocumentSet', *args, **kwargs): | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add an assertion in the segment driver test to see if the resulting chunks have the right amount of siblings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I added a test.
@@ -48,17 +56,41 @@ def append(self, document: 'Document', **kwargs) -> 'Document': | |||
chunk.update_content_hash() | |||
return chunk | |||
|
|||
def extend(self, iterable: Iterable['Document']) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the sibling update functionality to the extend method. It is a bit inconsistent because append does not update the siblings. But if we would update this attribute for append as well it could slow down the system when having many chunks. We could argue that the system is slowed down anyways if we have a lot of chunks and have the doc.siblings = len(self)
also in the append.
Anyways, I think there is the goal to remove siblings in the long term. So
A3E2
thinking about all the cases might not be necessary
@cristianmtr I've added an integration test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍
Removing the length attribute broke the BiMatchRanker.
This pr restores the length attribute and renames it to siblings.