Wrong encoding detected: Windows-1254 with confidence 0.62 even though some chars are invalid · Issue #292 · chardet/chardet · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran chardet on all files in the python/pep repository (via codespell -e) and was surpised to have codespell error.
It turns out chardet recognizes a very normal "UTF-8" file as "Windows-1254" with confidence 0.62.
However there are non-Windows-1254 characters present.
It's surprising that chardet determines an encoding that cannot possibly be true (Windows-1254) due to invalid characters in that encoding, when another encoding is correct (UTF-8).
$ chardetect pep_sphinx_extensions/tests/pep_lint/test_pep_number.py
pep_sphinx_extensions/tests/pep_lint/test_pep_number.py: Windows-1254 with confidence 0.6207130055013389
$ codespell -e
Traceback (most recent call last):
File "/Users/corneliusromer/.local/bin/codespell", line 8, in <module>
sys.exit(_script_main())
^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 1121, in _script_main
return main(*sys.argv[1:])
^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 1305, in main
bad_count += parse_file(
^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 963, in parse_file
lines, encoding = file_opener.open(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 232, in open
return self.open_with_chardet(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 257, in open_with_chardet
lines = f.readlines()
^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/encodings/cp1254.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined>
The text was updated successfully, but these errors were encountered:
I ran chardet on all files in the python/pep repository (via
codespell -e
) and was surpised to have codespell error.It turns out chardet recognizes a very normal "UTF-8" file as "Windows-1254" with confidence 0.62.
However there are non-Windows-1254 characters present.
It's surprising that chardet determines an encoding that cannot possibly be true (Windows-1254) due to invalid characters in that encoding, when another encoding is correct (UTF-8).
This is the file with the non-Windows-1254 lines highlighted:
https://github.com/python/peps/blob/0c23a1fe311097848012f7a0561db0fec953e330/pep_sphinx_extensions/tests/pep_lint/test_pep_number.py#L55-L61
Error:
The text was updated successfully, but these errors were encountered: