Description
Describe the bug
When processing a PDF using marker_single
, a TypeError
occurs during the line merging process.
Traceback
Traceback (most recent call last):
File "/Users/xxxxx/.local/bin/marker_single", line 8, in <module>
sys.exit(convert_single_cli())
^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/marker/scripts/convert_single.py", line 35, in convert_single_cli
rendered = converter(fpath)
^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/marker/converters/pdf.py", line 154, in __call__
document = self.build_document(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/marker/converters/pdf.py", line 149, in build_document
processor(document)
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/marker/processors/line_merge.py", line 130, in __call__
self.merge_lines(lines, block)
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/marker/processors/line_merge.py", line 104, in merge_lines
line.merge(other_line)
File "/Users/xxxxx/.local/pipx/venvs/marker-pdf/lib/python3.12/site-packages/marker/schema/text/line.py", line 99, in merge
self.structure = self.structure + other.structure
~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
TypeError: can only concatenate list (not "NoneType") to list
Cause
The error occurs in the merge
method of the Line
class (marker/schema/text/line.py
). The line self.structure = self.structure + other.structure
attempts to concatenate the structure
attributes directly. If either self.structure
or other.structure
is None
, this results in the observed TypeError
.
Proposed Fix
Modify the merge
method to handle potential None
values by treating them as empty lists before concatenation:
def merge(self, other: "Line"):
self.polygon = self.polygon.merge([other.polygon])
# Handle potential None values for structure
self_structure = self.structure if self.structure is not None else []
other_structure = other.structure if other.structure is not None else []
self.structure = self_structure + other_structure
if self.formats is None:
self.formats = other.formats
elif other.formats is not None:
self.formats = list(set(self.formats + other.formats))
I am not sure whether the fix is acceptable for the original intended purpose of merge.
Environment (if relevant)
- marker-pdf version: (Please add the version you are using)
- Python version: 3.12
- OS: macOS Sonoma
Additional context
This error was encountered while processing a microsoft word converted pdf, the documents are quite dense with text.