8000 Floki removes blank text nodes without option to avoid this · Issue #75 · philss/floki · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Floki removes blank text nodes without option to avoid this #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Eiji7 opened this issue Dec 15, 2016 · 14 comments
Open

Floki removes blank text nodes without option to avoid this #75

Eiji7 opened this issue Dec 15, 2016 · 14 comments
Labels

Comments

@Eiji7
Copy link
Eiji7 commented Dec 15, 2016

Actual results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
[{"span", [], ["5"]}, {"span", [], ["="]}, {"span", [], ["5"]}]

Expected results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
# or
Floki.parse("<span>5</span> <span>=</span> <span>5</span>", remove_text_nodes_with_whitespaces: false)

[{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Note: this is really important in some cases. For example, please try parsing html generated from github markup (code samples).

@philss
Copy link
Owner
philss commented Dec 18, 2016

Hi @Eiji7.

This issue is a known issue inside Mochiweb HTML parser, which is the parser behind Floki.

The suggested workaround would replace white-spaces between tags with a &#32; or &#x20; before calling :mochiweb.parse/1.

I tried some expressions to replace and I came up with this:

"<span>5</span> <span>=</span> <span>5</span>"
|> String.replace(~r/>[ \n\r]+</, ">&#32;<")
|> Floki.parse

# => [{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Not sure if this is safe to do inside Floki.parse/2, specially because this would affect code inside script tags.

Maybe the option you are suggesting could come as true by default only to preserve the current behavior that is broken, but is already known. What do you think? Do you have suggestions?
Thanks!

@Eiji7
Copy link
Author
Eiji7 commented Dec 19, 2016

Edit:
@philss: Looks better, but it should be already implemented, so developers not need to care about replace.

@aphillipo
Copy link

@Eiji7 Surely it's going to add a lot of overhead if you don't need this though? Maybe it should be Floki.preserve_inter_tag_spaces and documented to say that it will copy the entire string to do this.

@Eiji7
Copy link
Author
Eiji7 commented Dec 29, 2016

@aphillipo: yup, this way or #37.
btw. If you can explain me how parsers should work, like:

best way is to create regex, you can find sample/demo patterns at ...

then I could think about create an XML/HTML parser and validator in Elixir. I can start work on it now, but I don't worked on similar project, so I can't provide others that it will be really fast. As listed in 3rd point at #37 issue we need think about a good algorithm to avoid decrease of efficiency

@aphillipo
Copy link

Building HTML parsers is very difficult - I want to use Servo's for this eventually...

https://github.com/servo/html5ever

@Eiji7
Copy link
Author
Eiji7 commented Dec 29, 2016

@philss and @aphillipo: Did you know :xmerl and :xmerl_scan?
For example see erlang docs.
Maybe it's what we are looking for that we can easily extend?

Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:4:4] [async-threads:10]

Interactive Elixir (1.3.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :xmerl_scan.string('<p><span>5</span> <span>=</span> <span>5</span></p>')  
{{:xmlElement, :p, :p, [], {:xmlNamespace, [], []}, [], 1, [],
  [{:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 1, [],
    [{:xmlText, [span: 1, p: 1], 1, [], '5', :text}], [], '/home/eiji',
    :undeclared}, {:xmlText, [p: 1], 2, [], ' ', :text},
   {:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 3, [],
    [{:xmlText, [span: 3, p: 1], 1, [], '=', :text}], [], :undefined,
    :undeclared}, {:xmlText, [p: 1], 4, [], ' ', :text},
   {:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 5, [],
    [{:xmlText, [span: 5, p: 1], 1, [], '5', :text}], [], :undefined,
    :undeclared}], [], '/home/eiji', :undeclared}, []}

What do you think about it?

@philss
Copy link
Owner
philss commented Dec 29, 2016

@Eiji7 :xmerl won't work for us. This is because xmerl was made to parse XML documents, and HTML is very less strict than XML. Normally HTML parsers are more forgiven than XML parsers, due the reason that users may write invalid HTML that should be presented in the browsers.

Here are some references:

One of the possibilities to fix this is to fix mochiweb's HTML parser.

As @aphillipo said, HTML parsers are very difficult to build. I created that issue (#37) without understand that. Now I'm more inclined to the idea of reuse an existing HTML parser. Servo's html5ever is my favorite choice for now.

I did not started working on this, but help trying out this "bridge" between Elixir and Rust would be very welcome! Here is a project that is very promising: https://github.com/hansihe/Rustler.

@Eiji7
Copy link
Author
Eiji7 commented Dec 30, 2016

@philss: Ok, I see your point. I see that your way ("bridge") is faster to implement. I don't know how exactly will work. Will it be as fast (with that "bridge") as native Elixir parser? It may be interesting for me in further future.
Anyway I'm interested in :xmerl and I think that I will look at its source files and I will probably create a Elixir version of it that will work for more that one document format (XML, HTML, XHTML and maybe also HAML).

@philss philss added the Bug label Jan 12, 2017
@philss
Copy link
Owner
philss commented Jan 12, 2017

@Eiji7 sorry for the delay.

I'm not sure how it will work. It does not seems to have a big performance gap when using, for example, an Erlang NIF written in C or Rust. The problems are more related to the security of code. If the C/Rust code crash, the entire Erlang VM can crash. This is one of the reasons I want to give Rust a try with html5ever (because Rust promises a more safe environment for your programs).

Cool! The idea to write a HTML parser entirely in Elixir is not dead yet! It is only very hard to do the "right way".

I will let this open as a "bug" and try to figure out how to "quick fix" this.

Thanks!

@dariodf
Copy link
dariodf commented Mar 4, 2018

I tried to use html5ever and i got too many errors when parsing.

So here's a (quite expensive) solution for this problem:

  defp replace_whitespaces(text, last_text \\ "")
  defp replace_whitespaces(text, last_text) when last_text == text, do: text
  defp replace_whitespaces(text, _) do
    text
    |> String.replace(~r/>((&#32;)*) ( *)</, ">\\g{1}&#32;\\g{3}<")
    |> replace_whitespaces(text)
  end

Then, you use this function before your find and text:

    body
    |> replace_whitespaces
    |> Floki.find(selector)
    |> Floki.text

@lud-wj
Copy link
lud-wj commented Feb 5, 2024

Hello,

I am having the same problem. Now it seems that the mochiweb parser is distributed in floki, so the problem could be fixed directly instead of relying on third party (which does not exist anymore AFAICT).

I have a workaround with html5ever that works for small payloads, I'm gonna test it with huge, dirty HTML chunks. But it could be nice to have an option to keep all the text nodes regardless of their contents, don't you think?

@nickkaltner
Copy link
nickkaltner commented Apr 25, 2025

I ran into this too. For my use case (lots of PREs) I needed to convert spaces and newlines separately eg.

    new_html =
      Regex.replace(~r/>( +)</, html, fn _, spaces ->
        ">" <> String.duplicate("&#32;", String.length(spaces)) <> "<"
      end)

    # Preserve newlines between tags using LF entity instead of CR
    new_html = String.replace(new_html, ~r/>[\n\r]+</, ">&#10;<")

@q60
Copy link
q60 commented < 8000 a href="#issuecomment-2891638932" id="issuecomment-2891638932-permalink" class="Link--secondary js-timestamp">May 19, 2025

+1, i've run into this too. i would be glad to look at the code and try to fix it, but i don't have much time :/

@azizk
Copy link
azizk commented May 22, 2025

A simple(r) workaround without a regex would be:

"<span>5</span> <span>=</span> <span>5</span>"
|> String.replace(["> ", ">\n", ">\r"], ">&#32;")
|> Floki.parse_document!()

[{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants
0