8000 Added warnings if nltk_data `.zip` files exist without any corresponding `.xml` files. by tomaarsen · Pull Request #2908 · nltk/nltk · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Added warnings if nltk_data .zip files exist without any corresponding .xml files. #2908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 15, 2021

Conversation

tomaarsen
Copy link
Member
@tomaarsen tomaarsen commented Dec 10, 2021

Resolves nltk/nltk_data#178

Hello!

Pull request overview

  • Added warnings if build_index is ran in nltk.downloader, when the root folder contains .zip files without any corresponding .xml files.

The issue

As inspired mainly by nltk/nltk_data#178, as well as by nltk/nltk_data#176 and nltk/nltk_data#177, @ekaf has noticed that there are several resources in nltk_data that have .zip files with data, but no .xml to specify what kind of data it is. When there is no .xml, these .zip files are quietly being ignored.

We experienced this issue with omw-1.4.zip (nltk/nltk_data#176) and wordnet2021.zip (nltk/nltk_data#177) recently, and it seems there are more resources with these issues.

This PR adds a warning in these situations.
Running make pkg_index on nltk_data n 8000 ow also outputs the following:

[sic]\nltk\downloader.py:2294: UserWarning: listing.csv.zip exists, but listing.csv.xml cannot be found! This could mean that listing.csv can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):
[sic]\nltk\downloader.py:2294: UserWarning: omw-1.4.zip exists, but omw-1.4.xml cannot be found! This could mean that omw-1.4 can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):
[sic]\nltk\downloader.py:2294: UserWarning: ptb3.zip exists, but ptb3.xml cannot be found! This could mean that ptb3 can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):
[sic]\nltk\downloader.py:2294: UserWarning: wordnet2021.zip exists, but wordnet2021.xml cannot be found! This could mean that wordnet2021 can not be downloaded.
  for pkg_xml, zf, subdir in _find_packages(os.path.join(root, "packages")):

This should prevent us from accidentally forgetting to add an .xml to a new nltk_data resource.

Notes

Ignore my error in the commit message, it's meant to say .xml instead of .csv. I was distracted by the listing.csv and listing.csv.zip files. Are these files even still being used? GitHub is complaining that the listing.csv isn't even correct, as the number of columns don't match between all rows.

Furthermore, this PR shows that ptb3 (i.e. a stub of Penn Treebank 3) can not be downloaded. It seems that ptb is now being used, and that ptb3 is just an old copy of ptb:
image
Perhaps this means we can just delete ptb3 from nltk_data.

Thank you @ekaf for the suggestion.

  • Tom Aarsen

@tomaarsen tomaarsen force-pushed the feature/build_index_warnings branch from aa009ea to e4b3028 Compare December 10, 2021 15:02
@tomaarsen
Copy link
Member Author

It seems there were 8000 issues with #2909, so I'll revert to not using that in this PR.

@tomaarsen
Copy link
Member Author

Failing tests seem to be unrelated, and about SENNA. Unsure why they're failing as of now.

@tomaarsen
Copy link
Member Author

I refreshed the tests, they re-downloaded the third party tools, and now the tests pass correctly 🎉

@stevenbird stevenbird merged commit 72d9885 into nltk:develop Dec 15, 2021
@stevenbird
Copy link
Member

Thanks @tomaarsen, @purificant – great idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Packages missing a corresponding .xml file
3 participants
0