8000 Headings in Markdown Output Are mostly all Flat (No consistent Hierarchy) · Issue #730 · datalab-to/marker · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Headings in Markdown Output Are mostly all Flat (No consistent Hierarchy) #730
Open
@Sidclove

Description

@Sidclove

Hi team,

Thank you for your work on Marker and Surya! I’ve been using the pipeline to convert PDFs to Markdown, and I’ve noticed a consistent issue with heading formatting in the output .md files:

Problem:
All headings—main headings, subheadings, and further subheadings—are formatted at the same level (e.g., all as ## Heading). There is no hierarchical structure (i.e., no use of ###, ####, etc.), even when the source PDF clearly contains sections and subsections.

Example Output:

text

Chapter One

The Incidence and Costs of Lameness

INCIDENCE OF LAMENESS

All of these are at the same or inconsistent levels, regardless of their actual hierarchy in the document.

Expected Behavior:
Headings should reflect their true hierarchy, such as:

text

Chapter One

The Incidence and Costs of Lameness

Incidence of Lameness

or similar, depending on the document structure.

Why it matters:

Proper heading hierarchy is critical for downstream LLM, RAG, and search applications.

Tools like LangChain’s MarkdownHeaderTextSplitter and other semantic chunkers rely on correct heading levels for accurate document navigation and retrieval.

Flat headings make it impossible to reconstruct the document’s logical structure programmatically.

Questions:
Is there a way to configure Marker or Surya to better detect heading levels?

Are there plans to improve heading hierarchy detection in future releases?

Is there a workaround or recommended processor/config for this use case?

Environment:
Marker version: 1.7.4

Surya version: 0.14.5

Thanks for your help and for all your work on these tools!

Image

snippet1.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0