gcovr crashes with files which are not utf-8 encoded #148

strahlc · 2016-09-07T09:23:08Z

Unfortunately we have some submodules which are not utf-8 encoded.
If we run gcovr on our project, we got a backtrace:

$ gcovr -r .
Traceback (most recent call last):
  File "/usr/lib/python-exec/python3.4/gcovr", line 2312, in <module>
    process_datafile(file_, covdata, options)
  File "/usr/lib/python-exec/python3.4/gcovr", line 891, in process_datafile
    process_gcov_data(fname, covdata, abs_filename, options)
  File "/usr/lib/python-exec/python3.4/gcovr", line 489, in process_gcov_data
    line = INPUT.readline()
  File "/usr/lib64/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1750: invalid start byte

In projects where all files are utf-8 encoded everything works fine.

We are using gcovr-3.3.

The text was updated successfully, but these errors were encountered:

balegoff · 2016-09-08T15:35:16Z

We have exactly the same issue since we updated gcvor from 3.2 to 3.3

balegoff · 2016-09-08T15:51:39Z

gcovr -v -r . gives me this:

...
Parsing coverage data for file /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/vector
  Filtering coverage data for file /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/vector
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/bin/gcovr", line 4, in <module>
    __import__('pkg_resources').run_script('gcovr==3.2', 'gcovr')
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 735, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1659, in run_script
    exec(script_code, namespace, namespace)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gcovr-3.2-py3.5.egg/EGG-INFO/scripts/gcovr", line 1961, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gcovr-3.2-py3.5.egg/EGG-INFO/scripts/gcovr", line 749, in process_datafile
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gcovr-3.2-py3.5.egg/EGG-INFO/scripts/gcovr", line 416, in process_gcov_data
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 1049: invalid continuation byte

It's kind of random, it doesn't always fail on the same file, mostly on our files though.

balegoff · 2016-09-14T10:08:22Z

Its seems that I'm facing the issue with v3.2 when I install from source code.
Installing from homebrew works fine though.
Any chance to have v3.3 on homebrew ?

Source files may not be properly encoded. Make the handling of such files more tolerant. Fixes gcovr#148.

Source files may not be properly encoded. While the compiler and gcov do not care it will blow up Python 3 that expects proper encoding. Make the handling of such files more tolerant by using the 'surrogateescape' error policy. On the other hand Python 2 does not care about the encoding. Wrap the open() function there to add the missing 'errors' parameter. Fixes gcovr#148.

shw9 · 2017-12-21T11:53:11Z

Facing the same issue. Not getting any clue on what is causing this issue.

Traceback (most recent call last):
File "/opt/gcovr/noarch/3.3-2/bin/gcovr", line 2312, in
process_datafile(file_, covdata, options)
File "/opt/gcovr/noarch/3.3-2/bin/gcovr", line 891, in process_datafile
process_gcov_data(fname, covdata, abs_filename, options)
File "/opt/gcovr/noarch/3.3-2/bin/gcovr", line 489, in process_gcov_data
line = INPUT.readline()
File "/opt/python/x86_64/3.5.1-1/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 427: invalid start byte

latk · 2018-02-11T20:43:22Z

As a workaround, using gcovr under Python 2.7 might sidestep these issues when using a single-byte encoding.

The gcovr source currently ignores file encoding. The PR #157 suggests a way to address these issues (inserting replacement characters when the input doesn't decode via UTF-8), but I think that solution is mostly wrong because it doesn't actually support non-UTF-8 encodings – it just paints over any errors. The --html-encoding option has a similar intention but works in reverse, by declaring the encoding of HTML files that include the source code directly.

I think the correct solution is to adapt #157 and introduce a --source-encoding switch to properly decode the input (defaulting to UTF-8). This still won't work for mixed-encoding code bases, but I don't know how such a use case can be addressed.

I'm deferring this issue because other tasks have to be done first, but I understand that gcovr is broken regarding encodings and needs to be fixed.

goriy · 2018-02-12T00:05:01Z

It's a good idea to introduce something like --source-encoding parameter. Maybe during huge refactoring planned after 3.4 release.

It's amazing, but I've got some problems even with utf-8 encoded sources on Windows (python 3.6)!

It seems like there is more than one default encoding:

returned by sys.getdefaultencoding
returned by locale.getpreferredencoding

They can differ. It seems that default encoding for open is the one returned by locale.getpreferredencoding. I know no way to change it at the moment. It doesn't obey environment variables (LANG, LANGUAGE, PYTHONIOENCODING, LC_xxx), some calls to locale.setlocale doesn't affect it either.

So, if source code is in that encoding (no matter utf-8 or not) - you are lucky and it's just enough to adjust html reports encoding produced by gcovr by means of --html-encoding.

If your source code is not in that encoding gcovr (and any other python3 script) crashes with:

UnicodeDecodeError: 'charmap' codec can't decode byte...

The only simple way to get it work is to implicitly set encoding='' parameter to open() calls.

As far as I know, Python 2.7 reads files "as is", gcovr doesn't interfere either, so there should be no such problem.

goriy · 2018-02-12T00:14:57Z

I've got some sources in utf-8 encoding. gcovr crashes on Windows using Python 3.6.
I've tried this hack and it worked:

diff --git a/scripts/gcovr b/scripts/gcovr                          
index abc8108..3ecc8a1 10
8000
0755                                       
--- a/scripts/gcovr                                                 
+++ b/scripts/gcovr                                                 
@@ -456,7 +456,7 @@ def is_non_code(code):                          
 # Process a single gcov datafile                                   
 #                                                                  
 def process_gcov_data(data_fname, covdata, source_fname, options): 
-    INPUT = open(data_fname, "r")                                  
+    INPUT = open(data_fname, "r", encoding='utf-8')                
     #                                                              
     # Get the filename                                             
     #                                                              
@@ -1716,7 +1716,7 @@ def print_html_report(covdata, details):      
         data['ROWS'] = []                                          
         currdir = os.getcwd()                                      
         os.chdir(root_dir)                                         
-        INPUT = open(data['FILENAME'], 'r')                        
+        INPUT = open(data['FILENAME'], 'r', encoding='utf-8')      
         ctr = 1                                                    
         for line in INPUT:                                         
             data['ROWS'].append(                                   
@@ -1728,7 +1728,7 @@ def print_html_report(covdata, details):      
         data['ROWS'] = '\n'.join(data['ROWS'])                     
                                                                    
         htmlString = source_page.substitute(**data)                
-        OUTPUT = open(cdata._sourcefile, 'w')                      
+        OUTPUT = open(cdata._sourcefile, 'w', encoding='utf-8')    
         OUTPUT.write(htmlString + '\n')                            
         OUTPUT.close()

In this case --html-encoding parameter value should go to encoding='' while saving files (and maybe not only for html reports)

latk · 2018-05-29T11:09:54Z

@goriy As you've previously looked into encoding issues – could you perhaps test PR #256 on your projects? I think it should fix these problems, although the PR doesn't yet have any testcases.

latk · 2018-06-03T20:16:03Z

Source file encoding support has been implemented in #256. If it doesn't address your use case, please add a comment with more information.

goriy · 2018-06-06T20:28:36Z

Sorry for late answer.
I've just tested gcovr with changes introduced in #256 on same projects as before and in the same environment. It works fine now! Thanks a lot!

OS: Windows, Python version: 3.6, encodings:

input with utf-8 and output with utf-8
input with utf-8 and output with cp1251
input with cp1251 and output with cp1251
input with cp1251 and output with utf-8

P. S. Special thanks to @lisongmin!

jkloetzke added a commit to jkloetzke/gcovr that referenced this issue Nov 22, 2016

Fix unicode exceptions

22d36a7

Source files may not be properly encoded. Make the handling of such files more tolerant. Fixes gcovr#148.

jkloetzke mentioned this issue Nov 22, 2016

Fix unicode exceptions #156

Closed

latk added the Type: Bug label Feb 11, 2018

lisongmin added a commit to lisongmin/gcovr that referenced this issue May 20, 2018

Support non utf-8 encoding, fix gcovr#148

d45368a

lisongmin mentioned this issue May 21, 2018

Support --source-encoding option. #256

Merged

latk closed this as completed Jun 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gcovr crashes with files which are not utf-8 encoded #148

gcovr crashes with files which are not utf-8 encoded #148

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcovr crashes with files which are not utf-8 encoded #148

gcovr crashes with files which are not utf-8 encoded #148

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!