Description
after a successful library.add_files(params, get_images=True) ingestion run...
library in db is populated. in some cases .emf image files are extracted from documents and saved in the library image file path and their respective record in the table will designate 'content_type' key with value 'image'
subsequently, when invoking
lib.run_ocr_on_images() to process those extracted images --> Parser. ocr_images_in_library()
is invoked
Parser. ocr_images_in_library() relies on the 'content_type' key with value 'image' to build the workload to be passed to
output = ImageParser(params).process_ocr(image_path, img_name, preserve_spacing=False)
which results in an available .emf file being passed to tesseract - which does not support emf files - which crashes execution
environment
macos 15.x
llmware v 0.3.8
db in use: sqlite
Processing image 82: image23_1.emf
DEBUG: OCR Error occurred: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')
DEBUG: Error type: <class 'pytesseract.pytesseract.TesseractError'>
DEBUG: Full traceback:
Traceback (most recent call last):
File "/Users/user_xyz/project_directory/debugging_lib_run_ocr.py", line 28, in <module>
lib.run_ocr_on_images(min_size=10, chunk_size=400, realtime_progress=True, add_to_library=False)
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/library.py", line 1246, in run_ocr_on_images
output = Parser(library=self).ocr_images_in_library(add_to_library=add_to_library,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4620, in ocr_images_in_library
output = ImageParser(text_chunk_size=chunk_size).process_ocr(image_path, img_name, preserve_spacing=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/llmware/parsers.py", line 4709, in process_ocr
text_out = pytesseract.image_to_string(os.path.join(dir_fp,fn))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 486, in image_to_string
return {
^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 489, in <lambda>
Output.STRING: lambda: run_and_get_output(*args),
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 352, in run_and_get_output
run_tesseract(**kwargs)
File "/Users/user_xyz/project_directory/venv/lib/python3.12/site-packages/pytesseract/pytesseract.py", line 284, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'Error in fopenReadStream: failed to open locally with tail \x01 for filename \x01 Leptonica Error in pixRead: image file not found: \x01 Image file \x01 cannot be read! Error during processing.')