10000 fix schema extractor by mromanelli9 · Pull Request #154 · goose3/goose3 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix schema extractor #154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 18, 2022
Merged

fix schema extractor #154

merged 2 commits into from
Oct 18, 2022

Conversation

mromanelli9
Copy link
Contributor

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

SchemaExtractor checks a schema matching the URL http://schema.org, while Schema.org URL is now https://schema.org (with https).
This causes the SchemaExtractor to fail when https://schema.org is found.

Additional comments

This fix just replaces the old URL with the https version. Another way could be to consider both versions with something like this:

#goose3/extractors/schema.py#L38
if context["@context"][-10:] == "schema.org" and context["@type"] in KNOWN_SCHEMA_TYPES:

Copy link
Contributor
@lababidi lababidi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barrust looks good. Anything I might have missed?

@barrust
Copy link
Collaborator
barrust commented Oct 17, 2022

I think the checking both versions either what you are proposing or an in expression would allow for older HTML files to continue to work. I wonder if this is why the tests are failing.

@barrust barrust self-requested a review October 17, 2022 16:16
@codecov-commenter
Copy link

Codecov Report

Merging #154 (1455383) into master (e39ca61) will not change coverage.
The diff coverage is 50.00%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #154   +/-   ##
=======================================
  Coverage   91.03%   91.03%           
=======================================
  Files          30       30           
  Lines        2409     2409           
=======================================
  Hits         2193     2193           
  Misses        216      216           
Impacted Files Coverage Δ
goose3/extractors/schema.py 75.00% <50.00%> (ø)

@barrust barrust merged commit a3448b3 into goose3:master Oct 18, 2022
@mromanelli9 mromanelli9 deleted the schema-fix branch October 19, 2022 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0