8000 bug: ensure tika extracts embedded doc with a stream · Issue #1165 · ICIJ/datashare · GitHub 8000
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
bug: ensure tika extracts embedded doc with a stream #1165
@pirhoo

Description

@pirhoo

When an embedded file content is requested by a user, instead of extracting content in memory, we will generate a file on disk. It will create a cache that will be used by datashare to stream the content of embedded files.

We will add :

  • an option artifact-dir where the cache will be stored
  • we will generate a binary contents file called raw and document metadata with the following structure:
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/raw <--- the file binary content
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/raw.json <--- the file metadata
    # and maybe later it could be extended with
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/text/1.png <-- first page text
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/text/2.png <-- second page text
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/thumbnail/1.png <-- page 1 thumbnail
    /artifact-dir/<id_first_two_chars>/<id_following_two_chars>/<doc_id>/thumbnail/2.png <-- page 2 thumbnail

For example:

/configured/path/to/artifacts/12/34/1234f5fc76b8e243c8b0ae42cbee55afd3b0c0ffe67d31a5a8f2a9b13f2998e8/raw
/configured/path/to/artifacts/12/34/1234f5fc76b8e243c8b0ae42cbee55afd3b0c0ffe67d31a5a8f2a9b13f2998e8/raw.json
# ...

If the option artifactDir is not provided, datashare will use memory as before.

part of #1397

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0