8000 escaped characters · Issue #356 · phpgt/Dom · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

escaped characters #356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kartofelek007 opened this issue May 29, 2022 · 18 comments · Fixed by #394
Closed

escaped characters #356

kartofelek007 opened this issue May 29, 2022 · 18 comments · Fixed by #394

Comments

@kartofelek007
Copy link

I have html with code:

<script>
p.append(" są ");
</script>

I read my html, add div and print output ($content is html get by ob_get_clean() and it good because i tested it in my other function):

$document = new HTMLDocument($content);
$h1 = $document->querySelector("#pageTitle");
$div = $document->createElement("div");
$div->innerHTML = "lorem";
$h1->after($div);
$html = $document->documentElement;

I got:

<script>
p.append(" s&#261; ");
</script>
@g105b
Copy link
Member
g105b commented May 29, 2022

Hi @kartofelek007 ,

Thanks for the issue report. I'll take a look at this and create a test case to investigate. I know the content of a script tag is treated differently in the underlying DOMDocument, so maybe there's something this library can do to patch this behaviour.

Will keep you up to date when I have a fix.

Greg.

g105b pushed a commit that referenced this issue Jul 25, 2022
@g105b
Copy link
Member
g105b commented Jul 25, 2022

Hi @kartofelek007,

Please can you check over the changes I've just pushed to branch 356-escaped-characters? I believe this has fixed the issue, and it's also improved the way escaping works for the textContent property.

Unit tests available here: https://github.com/PhpGt/Dom/blob/356-escaped-characters/test/phpunit/HTMLDocumentTest.php#L650-L667

@kartofelek007
Copy link
Author

I am simple man. I dont know how to use this with composer.

@g105b
Copy link
Member
g105b commented Jul 28, 2022

😄 I'm happy to help. If you are requiring phpgt/dom directly through composer, you can refer to a particular branch by using this syntax:

{
	"require": {
		"phpgt/dom": "dev-356-escaped-characters"
	}
}

Alternatively, you could clone this repository, use git checkout 356-escaped-characters and use it directly, like in the unit tests.

If you need any help integrating, I'm more than happy advising you - I'm really invested in helping get my open source work accessible to all.

@kartofelek007
Copy link
Author

I install it on my page.

2022-07-29_101642

In composer.lock I have something like:

{
            "name": "phpgt/dom",
            "version": "dev-356-escaped-characters",
            "source": {
                "type": "git",
                "url": "https://github.com/PhpGt/Dom.git",
                "reference": "c2f11cc299feefc7ee2faae61178de5b04a17dec"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/PhpGt/Dom/zipball/c2f11cc299feefc7ee2faae61178de5b04a17dec",
                "reference": "c2f11cc299feefc7ee2faae61178de5b04a17dec",
                "shasum": ""
            },

But it still not working.
Part of my course page:

<pre class="line-numbers"><code class="language-js">
const p = document.querySelector("#testNodeText");
const btn = document.querySelector(".btn-textNodeTest");

const word1 = document.createTextNode("Psy");
p.append(word1);

p.append(" są ");

const word2 = document.createTextNode("fajne");
p.append(word2);

btn.addEventListener("click", () =&gt; {
    console.dir(word1); //wypiszemy sobie by zobaczyć co możemy użyć
    word1.textContent = "Koty też";
    word2.textContent = "super!";
});
</code></pre>

<script>
    {
        const p = document.querySelector("#testNodeText");
        const btn = document.querySelector(".btn-textNodeTest");

        const word1 = document.createTextNode("Psy");
        p.append(word1);

        p.append(" s&#261; ");

        const word2 = document.createTextNode("fajne");
        p.append(word2);

        btn.addEventListener("click", () => {
            console.dir(word1); //wypiszemy sobie by zobaczy&#263; co mo&#380;emy u&#380;y&#263;

            word1.textContent = "Koty te&#380;";
            word2.textContent = "super!";
        });
    }
</script>

In normal html elements (pre, code etc) the letters are correct. But in script tag all non eglish letters are escaped (look for example at comment after console.dir

@g105b
Copy link
Member
g105b commented Jul 29, 2022

Thank you for supplying a test case. I'll add this to the unit tests to replicate in a controlled environment. If you keep your eye on the 356-escaped-characters branch you'll see the new tests appear soon.

g105b pushed a commit that referenced this issue Jul 29, 2022
@g105b
Copy link
Member
g105b commented Jul 29, 2022

Hi @kartofelek007,

I've added a test to check that the non-English characters within script tags are not escaped. The test is passing, so I think I need more information from you about your particular use-case, so I can write more tests and isolate this bug properly.

Please could you provide me the source HTML that contains the <script> tag, and the PHP code that you are using to load it into the DOM, and the PHP code that renders the DOM back to the browser?

With your help I'm sure we can figure out why this strange encoding issue is occurring.

P.S. cool DOOM terminal

@g105b
Copy link
Member
g105b commented Jul 29, 2022

Take a look at scratch.php - I've pushed this to the branch which can be tested within a real web browser.

The scratch.php loads your example JavaScript into an HTMLDocument, then echos the document to the page.

I serve the scratch.php locally using this command: php -S 0.0.0.0:8000 scratch.php and this allows me to view the page in my web browser at http://localhost:8000/

Please see this screen recording. I hope this helps get us to a solution:

Marcin

@kartofelek007
Copy link
Author

Hmmm. Your example work fine.
I paste my function to your code and get escape characters. Maybe my function is shity?
You can see what I do in real on my page: https://kursjs.pl/kurs/dom/dom- 8000 tworzenie-i-usuwanie#createTextNode (now i use my own regexp solution)

<?php
use Gt\Dom\HTMLDocument;

require "vendor/autoload.php";

$content = <<<HTML
<div class="page-container" id="pageContainer">

<h1 class="page-title" id="pageTitle">Tworzenie i usuwanie elementów</h1>

<pre class="line-numbers"><code class="language-js">
const p = document.querySelector("#testNodeText");
const btn = document.querySelector(".btn-textNodeTest");

const word1 = document.createTextNode("Psy"); //pierwsze słowo
p.append(word1);

p.append(" są "); //nie potrzebujemy tutaj referencji

const word2 = document.createTextNode("fajne"); //ostatnie słowo
p.append(word2);

btn.addEventListener("click", () => {
    console.dir(word1); //wypiszemy sobie by zobaczyć co możemy użyć
    word1.textContent = "Koty też";
    word2.textContent = "super!";
});
</code></pre>

<script>
    {
        const p = document.querySelector("#testNodeText");
        const btn = document.querySelector(".btn-textNodeTest");

        const word1 = document.createTextNode("Psy");
        p.append(word1);

        p.append(" są "); //nie potrzebujemy tutaj referencji

        const word2 = document.createTextNode("fajne");
        p.append(word2);

        btn.addEventListener("click", e => {
            console.dir(word1); //wypiszemy sobie by zobaczyć co możemy użyć

            word1.textContent = "Koty też";
            word2.textContent = "super!";
        });
    }
</script>
</div>
HTML;


function generateContentTable($content) {
    $document = new HTMLDocument($content);

    $h2 = $document->querySelectorAll("h2.subtitle");

    $html = '<div class="table-of-content" role="navigation">
                <h2 class="visuallyhidden">Spis treści</h2>
                    <ul class="table-of-content-list">';

    foreach($h2 as $el) {
        $html .= '<li><a href="#' . $el->getAttribute('id') . '">' . $el->innerText . '</a></li>';
    }
    $html .= '</ul></div>';

    $div = $document->createElement("div");
    $div->innerHTML = $html;

    $h1 = $document->querySelector("#pageTitle");
    $h1->after($div);
    $html = $document->documentElement;
    return $html->innerHTML;
}

$c = generateContentTable($content);
echo $c;

Edit. I test my funcion in two scanarios in your scratch.php:

function generateContentTable($content) {
    //$document = new HTMLDocument($content);
    //$html = $document->documentElement;
    return $content; //return OK
}

function generateContentTable($content) {
    $document = new HTMLDocument($content);
    $html = $document->documentElement;
    return $html->innerHTML; //return ESCAPED LETTERS
}

@g105b
Copy link
Member
g105b commented Jul 29, 2022

Hi! Thanks for providing more detail. I can see where your issue lies now - within the innerHTML function's implementation.

I'll see what I can do to improve this functionality, but in the meantime, I think you can resolve your issue by using the HTMLDocument's toString function.

Rather than doing this:

$html = $document->documentElement;
return $html->innerHTML;

do this instead:

return (string)$document;

You will not see any of this escaping after casting the document to a string. I will see if I can provide a fix to the innerHTML property so it behaves as expected.

There is some advice I'd like to share with you at this point, which is to try and avoid concatenating strings of HTML, and instead using the DOM functionality like createElement and appendChild. That way, there's no need to worry about valid syntax or typos. I've got myself out of a lot of trouble in the past by avoiding concatenating strings of HTML.

@kartofelek007
Copy link
Author
kartofelek007 commented Jul 29, 2022

This escape letters outside script tag:

$document = new HTMLDocument($content);
return (string)$document;

but not in script tag

g105b pushed a commit that referenced this issue Sep 5, 2022
@g105b
Copy link
Member
g105b commented Sep 5, 2022

Hello @kartofelek007

Sorry for the delay, but I've spent some time tonight trying to get UTF-8 characters working across all tests.

Please can you check this is working for you? I believe I've fixed things for you.

Greg.

P.S. It's on the 356-escaped-characters branch!

@g105b g105b linked a pull request Sep 14, 2022 that will close this issue
@kartofelek007
Copy link
Author
kartofelek007 commented Sep 15, 2022

After update this return content where every inline js script (include inline google ad script) look like this:

<script>@@@@@@@@@@@@@@@@---script-6323981498a9f---@@@@@@@@@@@@@@@@</script>

And most important - generate take long time. For example for my single page this may be 20-30s. I create simple test with 200lines html - generate may take 3-4s? If I delete all script from this html, generate is muuuuuch faster.

@g105b
Copy link
Member
g105b commented Sep 16, 2022

I've just added some more tests to the branch and can confirm that I've somehow added some weird bugs. Sorry to keep this open for so long, I'll revisit it and see if I can fix whatever weird bug I've introduced.

@kartofelek007
Copy link
Author

I'm very poor with php. In dont know how to run test. I do this in my own way: I download your repo, checkout do character branch, install composer, recreate test page (see abowe 29 jul) and run it. Then i run your library on my page.

I'm not suprised, because this php functionality is somehow broken and not refined

@g105b
Copy link
Member
g105b commented Sep 16, 2022

@kartofelek007 I made a big mistake with my previous fix. I believe I have now fixed everything in the latest commit - sorry for confusing things.

Here's how you can run the tests locally, with the latest version of the code:

git checkout 356-escaped-characters
git pull
composer install
vendor/bin/phpunit test/phpunit

I believe you should be able to see the correct characters within the <script> tags, and everything should be nice and fast without any delay.

Thank you for helping me figure this one out, I'm very interested in hearing how you get on.

@kartofelek007
Copy link
Author

First test on test page and my real page pass correct. Looks like it's working fine now. Good job!

@g105b
Copy link
Member
g105b commented Sep 16, 2022

Great! I'll keep the pull request open for a bit longer while I integrate the branch into a few live apps I'm running.

g105b pushed a commit that referenced this issue Sep 25, 2022
* bug: certain elements do not encode properly on render
closes #356

* test: test bug #356 provided by @kartofelek007

* test: extra assertions - check escaped string are not in the render

* test: Remove unnecessary echo

* scratch: Marcin's test isolation

* fix: more work towards #356

* tweak: remove scratch file

* fix: better solution to #356 using processing entity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0