Description
Explain the problem.
The HTML reader should treat <style>
elements in <body>
as an empty string, not a newline.
I encountered <style>
elements in <body>
in the wild (they are extremely common on Wikipedia), and I noticed that Pandoc renders them very differently from a web browser:
$ echo '<p>A<style></style>B</p>' | ./pandoc --from html --to gfm
A
B
$ echo '<p>A<style></style>B</p>' | ./pandoc --from html --to native
[ Plain [ Str "A" ] , Plain [ Str "B" ] ]
I didn't check every platform, but <p>A<style></style>B</p>
renders as simply AB
in Firefox, Chromium, and Safari on macOS.
I looked at the <style>
specification to figure out what the expected behavior is here, but I found the HTML parsing specification to be extremely difficult to understand. The W3C Markup Validation Service does confirm that including <style>
elements in <body>
is invalid, but handling invalid HTML seems to be within Pandoc's scope according to #9090 (comment): "Well, we already handle many cases of invalid HTML. If there are other particular ones that come up, feel free to report."
I will report this problem upstream to Wikipedia at some point, but it appears to be a fundamental part of how MediaWiki templates work. I first noticed this bug on Circadian rhythm - Wikipedia, which includes 16 <style>
elements in <body>
. I therefore expect this invalid HTML to be difficult to change there, at least within the near future. According to Help:Markup validation - Wikipedia, they appear to want to avoid invalid markup, though.
Pandoc version?
macOS on Apple Silicon (albeit an x86_64 executable running under Rosetta2)
pandoc 3.6.3-nightly-2025-02-24
Features: +server +lua
Scripting engine: Lua 5.4