8000 Wrong encoding detected: Windows-1254 with confidence 0.62 even though some chars are invalid · Issue #292 · chardet/chardet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Wrong encoding detected: Windows-1254 with confidence 0.62 even though some chars are invalid #292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
corneliusroemer opened this issue Aug 19, 2024 · 0 comments

Comments

@corneliusroemer
Copy link
corneliusroemer commented Aug 19, 2024

I ran chardet on all files in the python/pep repository (via codespell -e) and was surpised to have codespell error.

It turns out chardet recognizes a very normal "UTF-8" file as "Windows-1254" with confidence 0.62.

However there are non-Windows-1254 characters present.

It's surprising that chardet determines an encoding that cannot possibly be true (Windows-1254) due to invalid characters in that encoding, when another encoding is correct (UTF-8).

This is the file with the non-Windows-1254 lines highlighted:
https://github.com/python/peps/blob/0c23a1fe311097848012f7a0561db0fec953e330/pep_sphinx_extensions/tests/pep_lint/test_pep_number.py#L55-L61

Error:

$ chardetect pep_sphinx_extensions/tests/pep_lint/test_pep_number.py
pep_sphinx_extensions/tests/pep_lint/test_pep_number.py: Windows-1254 with confidence 0.6207130055013389

$ codespell -e                                                       
Traceback (most recent call last):
  File "/Users/corneliusromer/.local/bin/codespell", line 8, in <module>
    sys.exit(_script_main())
             ^^^^^^^^^^^^^^
  File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 1121, in _script_main
    return main(*sys.argv[1:])
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 1305, in main
    bad_count += parse_file(
                 ^^^^^^^^^^^
  File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 963, in parse_file
    lines, encoding = file_opener.open(filename)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 232, in open
    return self.open_with_chardet(filename)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 257, in open_with_chardet
    lines = f.readlines()
            ^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0