Description
Hi team,
Thank you for your work on Marker and Surya! I’ve been using the pipeline to convert PDFs to Markdown, and I’ve noticed a consistent issue with heading formatting in the output .md files:
Problem:
All headings—main headings, subheadings, and further subheadings—are formatted at the same level (e.g., all as ## Heading). There is no hierarchical structure (i.e., no use of ###, ####, etc.), even when the source PDF clearly contains sections and subsections.
Example Output:
text
Chapter One
The Incidence and Costs of Lameness
INCIDENCE OF LAMENESS
All of these are at the same or inconsistent levels, regardless of their actual hierarchy in the document.
Expected Behavior:
Headings should reflect their true hierarchy, such as:
text
Chapter One
The Incidence and Costs of Lameness
Incidence of Lameness
or similar, depending on the document structure.
Why it matters:
Proper heading hierarchy is critical for downstream LLM, RAG, and search applications.
Tools like LangChain’s MarkdownHeaderTextSplitter and other semantic chunkers rely on correct heading levels for accurate document navigation and retrieval.
Flat headings make it impossible to reconstruct the document’s logical structure programmatically.
Questions:
Is there a way to configure Marker or Surya to better detect heading levels?
Are there plans to improve heading hierarchy detection in future releases?
Is there a workaround or recommended processor/config for this use case?
Environment:
Marker version: 1.7.4
Surya version: 0.14.5
Thanks for your help and for all your work on these tools!