8000 feat: add DocumentArrayMemmap by hanxiao · Pull Request #2579 · jina-ai/serve · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: add DocumentArrayMemmap #2579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 8, 2021
Merged

feat: add DocumentArrayMemmap #2579

merged 7 commits into from
Jun 8, 2021

Conversation

hanxiao
Copy link
Member
@hanxiao hanxiao commented Jun 8, 2021

Content

This PR adds DocumentArrayMemmap, a memmap types for enabling access to large on-disk DocumentArray.

Benchmark results

Write speed on 50,000 random docs (10K docs, each with 5 chunks): 1.14x speedup

jina git:(feat-da-memmap) ✗ python toy9.py
da ...	da takes 0 seconds (0.71s)
dam ...	dam takes 0 seconds (0.62s)

Read speed on 50,000 random docs (10K docs, each with 5 chunks): 1.81x speedup

jina git:(feat-da-memmap) ✗ python toy9.py
da ...	da takes 0 seconds (0.20s)
dam ...	dam takes 0 seconds (0.11s)

Memory usage when loading: DAM uses 20MB constantly, whereas DA uses 342MB linearly wrt number of docs.

jina git:(feat-da-memmap) ✗ python -m memory_profiler toy9.py
Filename: toy9.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    21    246.5 MiB    246.5 MiB           1   @profile
    22                                         def load_da():
    23    588.7 MiB    342.2 MiB           1       da = DocumentArray.load_binary('a.bin')
    24    588.8 MiB      0.1 MiB       10001       for _ in da:
    25    588.8 MiB      0.0 MiB       10000           pass
jina git:(feat-da-memmap) ✗ python -m memory_profiler toy9.py
Filename: toy9.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28    246.0 MiB    246.0 MiB           1   @profile
    29                                         def load_dam():
    30    266.7 MiB     20.7 MiB           1       dam = DocumentArrayMemmap('./tmp/')
    31    267.6 MiB      0.9 MiB       10001       for _ in dam:
    32    267.6 MiB      0.0 MiB       10000           pass

Appendix: Benchmark script

from memory_profiler import profile

from jina import DocumentArray
from jina.logging.profile import TimeContext
from jina.types.arrays.memmap import DocumentArrayMemmap
from tests import random_docs


def write():
    docs = list(random_docs(10000))
    with TimeContext('da'):
        da = DocumentArray(docs)
        da.save_binary('a.bin')

    with TimeContext('dam'):
        dam = DocumentArrayMemmap('./tmp/')
        dam.clear()
        dam.extend(docs)


def read():
    with TimeContext('da'):
        da = DocumentArray.load_binary('a.bin')

    with TimeContext('dam'):
        dam = DocumentArrayMemmap('./tmp/')


@profile
def load_da():
    da = DocumentArray.load_binary('a.bin')
    for _ in da:
        pass


@profile
def load_dam():
    dam = DocumentArrayMemmap('./tmp/')
    for _ in dam:
        pass


if __name__ == '__main__':
    read()

@hanxiao hanxiao requested a review from a team as a code owner June 8, 2021 01:57
@hanxiao hanxiao requested review from nan-wang and jakobkruse1 June 8, 2021 01:57
@hanxiao hanxiao changed the title Feat da memmap feat: add DocumentArrayMemmap Jun 8, 2021
@jina-bot jina-bot added size/L area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/type labels Jun 8, 2021
@codecov
Copy link
codecov bot commented Jun 8, 2021

Codecov Report

Merging #2579 (48b7381) into master (502a7db) will increase coverage by 0.39%.
The diff coverage is 93.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2579      +/-   ##
==========================================
+ Coverage   86.03%   86.42%   +0.39%     
==========================================
  Files         152      153       +1     
  Lines        9507     9636     +129     
==========================================
+ Hits         8179     8328     +149     
+ Misses       1328     1308      -20     
Flag Coverage Δ
daemon 46.39% <28.10%> (-0.25%) ⬇️
jina 86.43% <93.46%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/types/arrays/document.py 83.78% <79.31%> (-0.11%) ⬇️
jina/types/arrays/memmap.py 96.77% <96.77%> (ø)
jina/flow/base.py 90.59% <0.00%> (+0.91%) ⬆️
jina/peapods/runtimes/zmq/zed.py 93.80% <0.00%> (+0.95%) ⬆️
jina/peapods/peas/__init__.py 96.77% <0.00%> (+2.41%) ⬆️
jina/types/mixin.py 93.54% <0.00%> (+3.22%) ⬆️
jina/peapods/pods/compound.py 93.23% <0.00%> (+11.27%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26f156d...48b7381. Read the comment docs.

@jina-bot jina-bot added the area/housekeeping This issue/PR is housekeeping label Jun 8, 2021
@hanxiao hanxiao merged commit 37350bf into master Jun 8, 2021
@hanxiao hanxiao deleted the feat-da-memmap branch June 8, 2021 06:39
numb3r3 pushed a commit that referenced this pull request Jun 9, 2021
* feat: add documentarray memmap
alanthssss pushed a commit that referenced this pull request Jun 9, 2021
* feat: add documentarray memmap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase area/housekeeping This issue/PR is housekeeping area/testing This issue/PR affects testing component/type size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0