8000 lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) · Issue #1108 · llmware-ai/llmware · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
lib.run_ocr_on_images(params) - FAILURE due to lib.add_files(params, get_images=True) extracting and writing .emf images to lib image folder path (and associated db 'content_type' record) #1108
Open
@wissamharoun

Description

@wissamharoun

after a successful library.add_files(params, get_images=True) ingestion run...
library in db is populated. in some cases .emf image files are extracted from documents and saved in the library image file path and their respective record in the table will designate 'content_type' key with value 'image'

subsequently, when invoking
lib.run_ocr_on_images() to process those extracted images --> Parser. ocr_images_in_library()
is invoked
Parser. ocr_images_in_library() relies on the 'content_type' key with value 'image' to build the workload to be passed to
output = ImageParser(params).process_ocr(image_path, img_name, preserve_spacing=False)

which results in an available .emf file being passed to tesseract - which does not support emf files - which crashes execution

environment
macos 15.x
llmware v 0.3.8
db in use: sqlite

Screenshot 2024-11-26 at 16 29 01

Screenshot 2024-11-26 at 15 23 37

Processing image 82: image23_1.emf

DEBUG: OCR Error occurred: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')
DEBUG: Error type: <class 'pytesseract.pytesseract.TesseractError'>
DEBUG: Full traceback:
Traceback (most recent call last):
  File "/Users/user_xyz/project_directory/debugging_lib_run_ocr.py", line 28, in <module>
    lib.run_ocr_on_images(min_size=10, chunk_size=400, realtime_progress=True, add_to_library=False)
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/library.py", line 1246, in run_ocr_on_images
    output = Parser(library=self).ocr_images_in_library(add_to_library=add_to_library,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4620, in ocr_images_in_library
    output = ImageParser(text_chunk_size=chunk_size).process_ocr(image_path, img_name, preserve_spacing=False)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4709, in process_ocr
    text_out = pytesseract.image_to_string(os.path.join(dir_fp,fn))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 486, in image_to_string
    return {
           ^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 489, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
                           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 352, in run_and_get_output
    run_tesseract(**kwargs)
  File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 284, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type 31D1

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0