8000 Puzzle: Given the same html string, playground and server side use output different metadata · Issue #21 · kepano/defuddle · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Puzzle: Given the same html string, playground and server side use output different metadata #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nbbaier opened this issue Apr 1, 2025 · 2 comments

Comments

@nbbaier
Copy link
nbbaier commented Apr 1, 2025

Given the following HTML string:

<!DOCTYPE html>
<html lang="en">
   <head>
      <meta charset="UTF-8" />
      <meta name="viewport" content="width=device-width, initial-scale=1.0" />
      <meta name="description" content="Nico Baier's personal website" />
      <title>Nico Baier</title>
      <script type="application/ld+json">
         {
            "@context": "http://schema.org",
            "@type": "Person",
            "email": "mailto:nico.baier@gmail.com",
            "jobTitle": "Natural Language Analyst",
            "name": "Nico Baier",
            "additionalName": "Nicholas Baier",
            "alumniOf": [
               "University of Michigan",
               "University of California, Berkeley"
            ],
            "birthPlace": "Ann Arbor, Michigan",
            "url": "https://nicobaier.com",
            "sameAs": [
               "https://linkedin.com/in/nbbaier",
               "https://github.com/nbbaier/",
               "https://bsky.app/profile/nicobaier.com"
            ]
         }
      </script>
   </head>
   <body>
      Dummy Content.
   </body>
</html>

The Defuddle Playground returns the following metadata

{
  "title": "Nico Baier",
  "description": "Nico Baier's personal website",
  "domain": "nicobaier.com",
  "favicon": "https://nicobaier.com/favicon.ico",
  "image": "",
  "published": "",
  "author": "",
  "site": "",
  "schemaOrgData": [
    {
      "@context": "http://schema.org",
      "@type": "Person",
      "email": "mailto:nico.baier@gmail.com",
      "jobTitle": "Natural Language Analyst",
      "name": "Nico Baier",
      "additionalName": "Nicholas Baier",
      "alumniOf": [
        "University of Michigan",
        "University of California, Berkeley"
      ],
      "birthPlace": "Ann Arbor, Michigan",
      "url": "https://nicobaier.com",
      "sameAs": [
        "https://linkedin.com/in/nbbaier",
        "https://github.com/nbbaier/",
        "https://bsky.app/profile/nicobaier.com"
      ]
    }
  ],
  "wordCount": 2,
  "parseTime": 6
}

While using defuddle outside of the playground yields the following metadata:

{
	"title": "Nico Baier",
	"description": "Nico Baier's personal website",
	"domain": "",
	"favicon": "",
	"image": "",
	"published": "",
	"author": "",
	"site": "",
	"`": [
		{
			"@context": "http://schema.org",
			"@type": "Person",
			"email": "mailto:nico.baier@gmail.com",
			"jobTitle": "Natural Language Analyst",
			"name": "Nico Baier",
			"additionalName": "Nicholas Baier",
			"alumniOf": [
				"University of Michigan",
				"University of California, Berkeley"
			],
			"birthPlace": "Ann Arbor, Michigan",
			"url": "https://nicobaier.com",
			"sameAs": [
				"https://linkedin.com/in/nbbaier",
				"https://github.com/nbbaier/",
				"https://bsky.app/profile/nicobaier.com"
			]
		}
	],
	"wordCount": 2,
	"parseTime": 38
}

This is the code that produces the later output:

import Defuddle from "defuddle";
import { JSDOM } from "jsdom";
import "global-jsdom/register";
const html = `...`;
const { document } = new JSDOM(html).window;
const article = new Defuddle(document).parse();
console.log(article);

Notice that domain,favicon,image,published,author, and site are all null in later output. After some playing around, I figured out that this behavior seems to stem from the presence of the application/ld+json data in the original html. If you remove that, the output becomes identical:

{
	"title": "Nico Baier",
	"description": "Nico Baier's personal website",
	"domain": "",
	"favicon": "",
	"image": "",
	"published": "",
	"author": "",
	"site": "",
	"schemaOrgData": [],
	"wordCount": 2,
	"parseTime": 26
}

So the question is - why does the presence of the schemaOrgData in the original HTML populate domain andfavicon on the playground but not in the server side use?

@kepano
Copy link
Owner
kepano commented Apr 2, 2025

Can you try again with 0.5.2?

@nbbaier
Copy link
Author
nbbaier commented Apr 3, 2025

Same behavior on 0.5.2 and 0.5.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0