Open
Description
I ran chardet on all files in the python/pep repository (via codespell -e
) and was surpised to have codespell error.
It turns out chardet recognizes a very normal "UTF-8" file as "Windows-1254" with confidence 0.62.
However there are non-Windows-1254 characters present.
It's surprising that chardet determines an encoding that cannot possibly be true (Windows-1254) due to invalid characters in that encoding, when another encoding is correct (UTF-8).
This is the file with the non-Windows-1254 lines highlighted:
https://github.com/python/peps/blob/0c23a1fe311097848012f7a0561db0fec953e330/pep_sphinx_extensions/tests/pep_lint/test_pep_number.py#L55-L61
Error:
$ chardetect pep_sphinx_extensions/tests/pep_lint/test_pep_number.py
pep_sphinx_extensions/tests/pep_lint/test_pep_number.py: Windows-1254 with confidence 0.6207130055013389
$ codespell -e
Traceback (most recent call last):
File "/Users/corneliusromer/.local/bin/codespell", line 8, in <module>
sys.exit(_script_main())
^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 1121, in _script_main
return main(*sys.argv[1:])
^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 1305, in main
bad_count += parse_file(
^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 963, in parse_file
lines, encoding = file_opener.open(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 232, in open
return self.open_with_chardet(filename)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/.local/pipx/venvs/codespell/lib/python3.12/site-packages/codespell_lib/_codespell.py", line 257, in open_with_chardet
lines = f.readlines()
^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/encodings/cp1254.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined>
Metadata
Metadata
Assignees
Labels
No labels