Add abstract interface for reader/writer. #474

Spacetown · 2021-02-11T20:50:46Z

Ass discussed in #443 a object oriented interface is added.

The writer xml.py is renamed to coveralls.py and a option was added. I think we should deprecate the option --xml because sonarcube.py also produce a XML file.
add_copyright.py was changed to generate black conform files.
Move generator utils to a seperate file.
Add reader for JSON files.
Move the check functions and OutputOrDefault to GcovrConfigOption to access them eays in the readers and writers.

Spacetown · 2021-02-11T21:01:55Z

@latk I hope this is a little step to get your dream come true.

codecov · 2021-02-11T21:02:06Z

Codecov Report

Merging #474 (8382e1b) into master (23b7114) will increase coverage by 0.33%.
The diff coverage is 96.09%.

@@            Coverage Diff             @@
##           master     #474      +/-   ##
==========================================
+ Coverage   95.60%   95.94%   +0.33%     
==========================================
  Files          20       25       +5     
  Lines        2480     2612     +132     
  Branches      426      438      +12     
==========================================
+ Hits         2371     2506     +135     
  Misses         49       49              
+ Partials       60       57       -3

Flag	Coverage Δ
ubuntu-18.04	`95.36% <95.80%> (+0.36%)`	⬆️
windows-2019	`95.51% <96.05%> (+0.35%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
gcovr/tests/test_config.py	`100.00% <ø> (ø)`
gcovr/utils.py	`91.93% <ø> (+0.06%)`	⬆️
gcovr/reader/base.py	`80.00% <80.00%> (ø)`
gcovr/__main__.py	`94.44% <89.36%> (+2.66%)`	⬆️
gcovr/writer/utils.py	`92.92% <92.92%> (ø)`
gcovr/writer/coveralls.py	`93.75% <93.82%> (+0.54%)`	⬆️
gcovr/writer/__init__.py	`95.91% <95.91%> (ø)`
gcovr/writer/json.py	`97.26% <95.91%> (-2.74%)`	⬇️
gcovr/reader/json.py	`96.72% <96.72%> (ø)`
gcovr/writer/html.py	`96.56% <97.36%> (+0.29%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23b7114...8382e1b. Read the comment docs.

latk · 2021-02-14T22:26:10Z

I've looked through the code, and it has some interesting ideas – the main() is much better now! It was a bit hard to review though, because it all consists of a single commit that moved a lot of code around and reformatted it.

I want to discuss two high-level aspects before doing a more thorough review:

consequences from dividing/grouping code one way or the other
data and control flow

The matrix problem, and potential consequences.

When dividing up functionality, there are different approaches with different consequences. I want us to think this through before committing to any alternative.

If we have N components with M concerns each, we somehow have to divide up these different aspects. I refer to that as a “matrix problem” since it's usually equally reasonable to separate aspects by row or by column:

	json	html	txt	…
arguments
validation
writing

A general heuristic for solving this problem is: what changes together, stays together. This would suggest that the presented division by output format is correct, since arguments are likely to change together with the current writer (if at all).

Such a division might also enable pluggable writers, some way down the road. However, we are far from offering a sufficiently stable plugin interface, so I'm not particularly concerned about this.

Dividing by concern would enable something entirely different: loading writers only on demand. In the presented architecture, all writers must be loaded up front in order to discover available arguments. A throwaway experiment I tried out some time ago suggested opportunity for noticeably faster startup when lazy-loading writers (this wasn't previously possible due to utilities from the HTML writer being used from other components). I don't want to give up this potential performance win unless we are sure that it's worth it.

Data flow and control flow

In the proposed design, argument validation logic is moved to the various writer modules. There are two aspects that make me a bit uncomfortable.

First, all writers operate on the shared options object which has long been a problem due to unclear interface boundaries. But now, the other options/accesses are less visible because they are in different files. I would tend to leave the validation as it is until a clear interface boundary can be introduced, e.g. if each writer has a constructor that extracts the required arguments into member fields.

Secondly, control flow. The validation code and some of the writer code contain application-level control flow. There should not be a sys.exit() outside the main – that is a CLI concern, not a writer concern. If we want to move this validation code, we should rather think about suitable exceptions.

Conclusion

All in all, I'm really thankful for this PR because it enables us to have a conversation about future directions. But it's a lot of change with little discussion so I'm hesitant to say “yes” outright. On the other hand, just moving code around is a fairly low risk change.

What do you think? Which future directions would you prioritize with the “matrix problem”? Which changes can we focus on in this PR, what should be deferred to another iteration?

Spacetown · 2021-02-15T19:14:33Z

I've looked through the code, and it has some interesting ideas – the main() is much better now! It was a bit hard to review though, because it all consists of a single commit that moved a lot of code around and reformatted it.

Mea culpa! Next time I'll try to create several commits.

A general heuristic for solving this problem is: what changes together, stays together. This would suggest that the presented division by output format is correct, since arguments are likely to change together with the current writer (if at all).

That's also my prefered way. I don't want to add the options in file A, add the option handling in file B and the code to write the output in file C. Another pro for this way is that if you only change the HTML writer you can be sure that the other formats aren't affected.

Such a division might also enable pluggable writers, some way down the road. However, we are far from offering a sufficiently stable plugin interface, so I'm not particularly concerned about this.

I'm with you, this can be done in the far future.

Dividing by concern would enable something entirely different: loading writers only on demand. In the presented architecture, all writers must be loaded up front in order to discover available arguments. A throwaway experiment I tried out some time ago suggested opportunity for noticeably faster startup when lazy-loading writers (this wasn't previously possible due to utilities from the HTML writer being used from other components). I don't want to give up this potential performance win unless we are sure that it's worth it.

A improvement can be to do lazy loading of the parts needed for writing like it's done in the HTML writer.

In the proposed design, argument validation logic is moved to the various writer modules. There are two aspects that make me a bit uncomfortable.

First, all writers operate on the shared options object which has long been a problem due to unclear interface boundaries. But now, the other options/accesses are less visible because they are in different files. I would tend to leave the validation as it is until a clear interface boundary can be introduced, e.g. if each writer has a constructor that extracts the required arguments into member fields.

Do you mean the check_options of the reader/writeer?

Secondly, control flow. The validation code and some of the writer code contain application-level control flow. There should not be a sys.exit() outside the main – that is a CLI concern, not a writer concern. If we want to move this validation code, we should rather think about suitable exceptions.

I was also not so happy with this but I didn't want to change the code as much. I can use an ArgumentException here.

All in all, I'm really thankful for this PR because it enables us to have a conversation about future directions. But it's a lot of change with little discussion so I'm hesitant to say “yes” outright. On the other hand, just moving code around is a fairly low risk change.

What do you think? Which future directions would you prioritize with the “matrix problem”? Which changes can we focus on in this PR, what should be deferred to another iteration?

I prefer to keep all code related to a format in one file. For a new writer you only have to implement the functions of the base class and add it to the global list.
I think follwing points should be part of this PR:

Replace sys.exit by exception handling in all readers/writers.
Add a "namespace" to all option variables, e.g. option --html-absolute-paths should set html_relative_anchors instead of relative_anchors.

Separate PR:

Change the gcov stuff to a reader.
Implement lazy loading to improve startup performance.
Maybe add an option group for each writer writer options except the option activating the writer, e.g. --html, --html-detailsand --html-details-syntax-highlighting will stay in group output_options all other options are moved to group html_options.

Spacetown · 2021-06-14T19:59:15Z

@latk What do you think about the proposal? Should I add lazy loading for the readers with this PR?

Summary: This commit introduces a completely rewritten parser for gcov's human-readable report format. The implementation pays closer attention to the upstream file format, e.g. with respect to the exact number formatting. Compared to the previous parser, the data flow has been greatly clarified, making this parser more obviously correct. The new parser will make it easier to react to future changes in the format, and to make use of improvements from newer GCC versions – such as the `Working directory` metadata. Background ---------- The old parser had some problems that could be addressed by this rewrite. * The old parser was very complex due to mixing low level parsing and high-level decisions about coverage exclusion. Fixed in the new parser: now, staging is split into multiple phases that clearly separate concerns. The low-level parsing phase produces a data model that is safe to use for the other phases. * The old parser used an object-oriented design were parsing state was stored in instance fields. This created a huge "state space" that was difficult to track, potentially leading to bugs. For example, it was not immediately clear in the investigation of <gcovr#511> whether the problem was just that the parser forgot to update its line number correctly. Fixed in the new parser: using a more functional/procedural design, the data flows are very clear. State is tracked explicitly. By separating parsing phases, the state space is much smaller. * To expand on the previous point: The old parser essentially consisted of multiple interleaved state machines. There was one state machine for processing coverage data, and an interleaved state machine for tracking active coverage exclusion markers. This interleaving made it difficult to understand whether the state transitions were correct. Fixed in the new parser: coverage exclusion patterns are collected in a separate phase, before the actual coverage data is processed. * The old parser made use of very fragile parsing strategies, such as using `str.split()` excessively. This gave rise to fragile assumptions about the exact format. For example, the IndexError in <gcovr#226> was an example of wrong assumptions. The new parser uses regular expressions to parse tag lines, and only uses `str.split()` for the more structured source code lines. This is more self-documenting. Treatment of numerical values was aligned with the routines in the gcov source code. Should the format deviate in the future, the regexes will fail to match, making it possible to detect and fix the errors. (Until then, `--gcov-ignore-parse-errors` can be used). Design of the new parser ------------------------ The new parser is more complex in the sense that there is a lot more code. But there is a clearer separation of concerns, and the parser was closely informed by the gcov documentation and source code. As a result, I am confident that it handles far more edge cases correctly, in particular relating to the handling of numbers/percentages. There are three items for external use: **`parse_metadata(lines)`** creates a dict of values from the metadata lines. The old parser handled the very first line of the report separately to extract the filename. The new parser uses the same more robust parsing code for this metadata. **`ParserFlags`** is a flag-enum that describes various boolean features. A single object with flags seems simpler to handle than multiple variables like `exclude_throw_branches`. **`parse_coverage(lines, ...)`** is the main function for parsing the coverage. It performs multiple phases: * Each input line is parsed/tokenized into an internal data model. The data model is strongly typed. The various classes like `_SourceLine` are implemented as NamedTuples, which is both very convenient and very memory-efficient. Relevant items: `_parse_line()`, `_parse_tag_line()`, data model * Exclusion markers are extracted from source code lines and arranged into a data structure for later lookup. Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()` * Parsed lines are processed to populate a `FileCoverage` model. At this stage, exclusions are applied. The state space is very small, with only four variables that have to be tracked. Relevant items: `_ParserState`, `_gather_coverage_from_line()`, `_line_noncode_and_count()`, `_function_can_be_excluded()`, `_branch_can_be_excluded()`, `_add_coverage_for_function()` * Warnings are reported, and any potential errors re-raised. This is equivalent to the previous parser. Relevant items: `_report_lines_with_errors()` Impact on tests --------------- The new parser is almost completely bug-compatible with the old parser. This is e.g. visible in the potentially unintuitive handling of the `noncode` status. The new separation between low-level parsing and high-level decisions makes it more clear what is actually going on. There was a significant change in the Nautilus parser test. The input file contains the following pattern: ------------------ #####: 52:foo() ? bar(): Previously, this line 52 was not reported as uncovered. I consider that to be an error, and have updated the expected output correspondingly. This could indicate that the new parser is in fact more robust than the old parser when it comes to template specialization sections. Other than that, the tests were only updated to account for the different parser APIs. Internally, the new parser uses a lot of doctests. Future directions ----------------- The new parser extracts *all* available data, only to throw it away. It might now become feasible to make use of more of this data. In particular: * handling template specialization sections properly * collecting block-level coverage data * using the `working directory` metadata field Conflicts with other development efforts ---------------------------------------- * <gcovr#503> report of excluded coverage Makes a small patch to the parser. The same effect can be achieved by adding a few lines in `_gather_coverage_from_line()`. * <gcovr#484> tests with clang-10 Touches neighboring lines. Will be reported as a merge conflict by Git, but there's no semantic conflict. * <gcovr#474> abstract interface for reader/writer Small change in the parser code regarding `sys.exit(1)` (new parser: `raise SystemExit(1)`). It's worth noting that this is effectively unreachable code. Lines will only be reported if there was an exception, and if there was an exception it will be re-thrown. * <gcovr#361> --no-markers to ignore exclusion markers Touches the exclusion handling code. This is of course totally changed by the new parser. But the new parser would make it even easier to implement that functionality. * <gcovr#350> decision coverage Adds significant new parsing code, but most of it is outside of the gcov-parser. These changes could be ported with moderate effort to the new parser.

Summary: This commit introduces a completely rewritten parser for gcov's human-readable report format. The implementation pays closer attention to the upstream file format, e.g. with respect to the exact number formatting. Compared to the previous parser, the data flow has been greatly clarified, making this parser more obviously correct. The new parser will make it easier to react to future changes in the format, and to make use of improvements from newer GCC versions – such as the `Working directory` metadata. Background ---------- The old parser had some problems that could be addressed by this rewrite. * The old parser was very complex due to mixing low level parsing and high-level decisions about coverage exclusion. Fixed in the new parser: now, staging is split into multiple phases that clearly separate concerns. The low-level parsing phase produces a data model that is safe to use for the other phases. * The old parser used an object-oriented design were parsing state was stored in instance fields. This created a huge "state space" that was difficult to track, potentially leading to bugs. For example, it was not immediately clear in the investigation of <gcovr#511> whether the problem was just that the parser forgot to update its line number correctly. Fixed in the new parser: using a more functional/procedural design, the data flows are very clear. State is tracked explicitly. By separating parsing phases, the state space is much smaller. * To expand on the previous point: The old parser essentially consisted of multiple interleaved state machines. There was one state machine for processing coverage data, and an interleaved state machine for tracking active coverage exclusion markers. This interleaving made it difficult to understand whether the state transitions were correct. Fixed in the new parser: coverage exclusion patterns are collected in a separate phase, before the actual coverage data is processed. * The old parser made use of very fragile parsing strategies, such as using `str.split()` excessively. This gave rise to fragile assumptions about the exact format. For example, the IndexError in <gcovr#226> was an example of wrong assumptions. The new parser uses regular expressions to parse tag lines, and only uses `str.split()` for the more structured source code lines. This is more self-documenting. Treatment of numerical values was aligned with the routines in the gcov source code. Should the format deviate in the future, the regexes will fail to match, making it possible to detect and fix the errors. (Until then, `--gcov-ignore-parse-errors` can be used). Design of the new parser ------------------------ The new parser is more complex in the sense that there is a lot more code. But there is a clearer separation of concerns, and the parser was closely informed by the gcov documentation and source code. As a result, I am confident that it handles far more edge cases correctly, in particular relating to the handling of numbers/percentages. There are three items for external use: **`parse_metadata(lines)`** creates a dict of values from the metadata lines. The old parser handled the very first line of the report separately to extract the filename. The new parser uses the same more robust parsing code for this metadata. **`ParserFlags`** is a flag-enum that describes various boolean features. A single object with flags seems simpler to handle than multiple variables like `exclude_throw_branches`. **`parse_coverage(lines, ...)`** is the main function for parsing the coverage. It performs multiple phases: * Each input line is parsed/tokenized into an internal data model. The data model is strongly typed. The various classes like `_SourceLine` are implemented as NamedTuples, which is both very convenient and very memory-efficient. Relevant items: `_parse_line()`, `_parse_tag_line()`, data model * Exclusion markers are extracted from source code lines and arranged into a data structure for later lookup. Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()` * Parsed lines are processed to populate a `FileCoverage` model. At this stage, exclusions are applied. The state space is very small, with only four variables that have to be tracked. Relevant items: `_ParserState`, `_gather_coverage_from_line()`, `_line_noncode_and_count()`, `_function_can_be_excluded()`, `_branch_can_be_excluded()`, `_add_coverage_for_function()` * Warnings are reported, and any potential errors re-raised. This is equivalent to the previous parser. Relevant items: `_report_lines_with_errors()` Impact on tests --------------- The new parser is almost completely bug-compatible with the old parser. This is e.g. visible in the potentially unintuitive handling of the `noncode` status. The new separation between low-level parsing and high-level decisions makes it more clear what is actually going on. There was a significant change in the **Nautilus parser test**. The input file contains the following pattern: ------------------ #####: 52:foo() ? bar(): Previously, this line 52 was not reported as uncovered. I consider that to be an error, and have updated the expected output correspondingly. This could indicate that the new parser is in fact more robust than the old parser when it comes to template specialization sections. In the **excl-branch test**, gcovr will encounter gcov input as the following when using GCC-8 or later: #####: 9: virtual ~Bar() #####: 10: {} // ... ------------------ Bar::~Bar(): function Bar::~Bar() called 0 returned 0% blocks executed 0% #####: 9: virtual ~Bar() #####: 10: {} // ... call 0 never executed call 1 never executed ------------------ Bar::~Bar(): function Bar::~Bar() called 0 returned 0% blocks executed 0% #####: 9: virtual ~Bar() #####: 10: {} // ... ------------------ The old parser associated the `function` annotations with line 11. This was clearly incorrect. The test reference was updated to associate the destructor with line 9. Other than that, the tests were only updated to account for the different parser APIs. Internally, the new parser uses a lot of doctests. Future directions ----------------- The new parser extracts *all* available data, only to throw it away. It might now become feasible to make use of more of this data. In particular: * handling template specialization sections properly * collecting block-level coverage data * using the `working directory` metadata field Conflicts with other development efforts ---------------------------------------- * <gcovr#503> report of excluded coverage Makes a small patch to the parser. The same effect can be achieved by adding a few lines in `_gather_coverage_from_line()`. * <gcovr#484> tests with clang-10 Touches neighboring lines. Will be reported as a merge conflict by Git, but there's no semantic conflict. * <gcovr#474> abstract interface for reader/writer Small change in the parser code regarding `sys.exit(1)` (new parser: `raise SystemExit(1)`). It's worth noting that this is effectively unreachable code. Lines will only be reported if there was an exception, and if there was an exception it will be re-thrown. * <gcovr#361> --no-markers to ignore exclusion markers Touches the exclusion handling code. This is of course totally changed by the new parser. But the new parser would make it even easier to implement that functionality. * <gcovr#350> decision coverage Adds significant new parsing code, but most of it is outside of the gcov-parser. These changes could be ported with moderate effort to the new parser.

Summary: This commit introduces a completely rewritten parser for gcov's human-readable report format. The implementation pays closer attention to the upstream file format, e.g. with respect to the exact number formatting. Compared to the previous parser, the data flow has been greatly clarified, making this parser more obviously correct. The new parser will make it easier to react to future changes in the format, and to make use of improvements from newer GCC versions – such as the `Working directory` metadata. Background ---------- The old parser had some problems that could be addressed by this rewrite. 8000 * The old parser was very complex due to mixing low level parsing and high-level decisions about coverage exclusion. Fixed in the new parser: now, staging is split into multiple phases that clearly separate concerns. The low-level parsing phase produces a data model that is safe to use for the other phases. * The old parser used an object-oriented design were parsing state was stored in instance fields. This created a huge "state space" that was difficult to track, potentially leading to bugs. For example, it was not immediately clear in the investigation of <gcovr#511> whether the problem was just that the parser forgot to update its line number correctly. Fixed in the new parser: using a more functional/procedural design, the data flows are very clear. State is tracked explicitly. By separating parsing phases, the state space is much smaller. * To expand on the previous point: The old parser essentially consisted of multiple interleaved state machines. There was one state machine for processing coverage data, and an interleaved state machine for tracking active coverage exclusion markers. This interleaving made it difficult to understand whether the state transitions were correct. Fixed in the new parser: coverage exclusion patterns are collected in a separate phase, before the actual coverage data is processed. * The old parser made use of very fragile parsing strategies, such as using `str.split()` excessively. This gave rise to fragile assumptions about the exact format. For example, the IndexError in <gcovr#226> was an example of wrong assumptions. The new parser uses regular expressions to parse tag lines, and only uses `str.split()` for the more structured source code lines. This is more self-documenting. Treatment of numerical values was aligned with the routines in the gcov source code. Should the format deviate in the future, the regexes will fail to match, making it possible to detect and fix the errors. (Until then, `--gcov-ignore-parse-errors` can be used). Design of the new parser ------------------------ The new parser is more complex in the sense that there is a lot more code. But there is a clearer separation of concerns, and the parser was closely informed by the gcov documentation and source code. As a result, I am confident that it handles far more edge cases correctly, in particular relating to the handling of numbers/percentages. There are three items for external use: **`parse_metadata(lines)`** creates a dict of values from the metadata lines. The old parser handled the very first line of the report separately to extract the filename. The new parser uses the same more robust parsing code for this metadata. **`ParserFlags`** is a flag-enum that describes various boolean features. A single object with flags seems simpler to handle than multiple variables like `exclude_throw_branches`. **`parse_coverage(lines, ...)`** is the main function for parsing the coverage. It performs multiple phases: * Each input line is parsed/tokenized into an internal data model. The data model is strongly typed. The various classes like `_SourceLine` are implemented as NamedTuples, which is both very convenient and very memory-efficient. Relevant items: `_parse_line()`, `_parse_tag_line()`, data model * Exclusion markers are extracted from source code lines and arranged into a data structure for later lookup. Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()` * Parsed lines are processed to populate a `FileCoverage` model. At this stage, exclusions are applied. The state space is very small, with only four variables that have to be tracked. Relevant items: `_ParserState`, `_gather_coverage_from_line()`, `_line_noncode_and_count()`, `_function_can_be_excluded()`, `_branch_can_be_excluded()`, `_add_coverage_for_function()` * Warnings are reported, and any potential errors re-raised. This is equivalent to the previous parser. Relevant items: `_report_lines_with_errors()` Impact on tests --------------- The new parser is almost completely bug-compatible with the old parser. This is e.g. visible in the potentially unintuitive handling of the `noncode` status. The new separation between low-level parsing and high-level decisions makes it more clear what is actually going on. There was a significant change in the **Nautilus parser test**. The input file contains the following pattern: ------------------ #####: 52:foo() ? bar(): Previously, this line 52 was not reported as uncovered. I consider that to be an error, and have updated the expected output correspondingly. This could indicate that the new parser is in fact more robust than the old parser when it comes to template specialization sections. In the **excl-branch test**, gcovr will encounter gcov input as the following when using GCC-8 or later: #####: 9: virtual ~Bar() #####: 10: {} // ... ------------------ Bar::~Bar(): function Bar::~Bar() called 0 returned 0% blocks executed 0% #####: 9: virtual ~Bar() #####: 10: {} // ... call 0 never executed call 1 never executed ------------------ Bar::~Bar(): function Bar::~Bar() called 0 returned 0% blocks executed 0% #####: 9: virtual ~Bar() #####: 10: {} // ... ------------------ The old parser associated the `function` annotations with line 11. This was clearly incorrect. The test reference was updated to associate the destructor with line 9. Other than that, the tests were only updated to account for the different parser APIs. Internally, the new parser uses a lot of doctests. Future directions ----------------- The new parser extracts *all* available data, only to throw it away. It might now become feasible to make use of more of this data. In particular: * handling template specialization sections properly * collecting block-level coverage data * using the `working directory` metadata field Conflicts with other development efforts ---------------------------------------- * <gcovr#503> report of excluded coverage Makes a small patch to the parser. The same effect can be achieved by adding a few lines in `_gather_coverage_from_line()`. * <gcovr#484> tests with clang-10 Touches neighboring lines. Will be reported as a merge conflict by Git, but there's no semantic conflict. * <gcovr#474> abstract interface for reader/writer Small change in the parser code regarding `sys.exit(1)` (new parser: `raise SystemExit(1)`). It's worth noting that this is effectively unreachable code. Lines will only be reported if there was an exception, and if there was an exception it will be re-thrown. * <gcovr#361> --no-markers to ignore exclusion markers Touches the exclusion handling code. This is of course totally changed by the new parser. But the new parser would make it even easier to implement that functionality. * <gcovr#350> decision coverage Adds significant new parsing code, but most of it is outside of the gcov-parser. These changes could be ported with moderate effort to the new parser.

Summary: This commit introduces a completely rewritten parser for gcov's human-readable report format. The implementation pays closer attention to the upstream file format, e.g. with respect to the exact number formatting. Compared to the previous parser, the data flow has been greatly clarified, making this parser more obviously correct. The new parser will make it easier to react to future changes in the format, and to make use of improvements from newer GCC versions – such as the `Working directory` metadata. Background ---------- The old parser had some problems that could be addressed by this rewrite. * The old parser was very complex due to mixing low level parsing and high-level decisions about coverage exclusion. Fixed in the new parser: now, staging is split into multiple phases that clearly separate concerns. The low-level parsing phase produces a data model that is safe to use for the other phases. * The old parser used an object-oriented design were parsing state was stored in instance fields. This created a huge "state space" that was difficult to track, potentially leading to bugs. For example, it was not immediately clear in the investigation of <gcovr#511> whether the problem was just that the parser forgot to update its line number correctly. Fixed in the new parser: using a more functional/procedural design, the data flows are very clear. State is tracked explicitly. By separating parsing phases, the state space is much smaller. * To expand on the previous point: The old parser essentially consisted of multiple interleaved state machines. There was one state machine for processing coverage data, and an interleaved state machine for tracking active coverage exclusion markers. This interleaving made it difficult to understand whether the state transitions were correct. Fixed in the new parser: coverage exclusion patterns are collected in a separate phase, before the actual coverage data is processed. * The old parser made use of very fragile parsing strategies, such as using `str.split()` excessively. This gave rise to fragile assumptions about the exact format. For example, the IndexError in <gcovr#226> was an example of wrong assumptions. The new parser uses regular expressions to parse tag lines, and only uses `str.split()` for the more structured source code lines. This is more self-documenting. Treatment of numerical values was aligned with the routines in the gcov source code. Should the format deviate in the future, the regexes will fail to match, making it possible to detect and fix the errors. (Until then, `--gcov-ignore-parse-errors` can be used). Design of the new parser ------------------------ The new parser is more complex in the sense that there is a lot more code. But there is a clearer separation of concerns, and the parser was closely informed by the gcov documentation and source code. As a result, I am confident that it handles far more edge cases correctly, in particular relating to the handling of numbers/percentages. There are three items for external use: **`parse_metadata(lines)`** creates a dict of values from the metadata lines. The old parser handled the very first line of the report separately to extract the filename. The new parser uses the same more robust parsing code for this metadata. **`ParserFlags`** is a flag-enum that describes various boolean features. A single object with flags seems simpler to handle than multiple variables like `exclude_throw_branches`. **`parse_coverage(lines, ...)`** is the main function for parsing the coverage. It performs multiple phases: * Each input line is parsed/tokenized into an internal data model. The data model is strongly typed. The various classes like `_SourceLine` are implemented as NamedTuples, which is both very convenient and very memory-efficient. Relevant items: `_parse_line()`, `_parse_tag_line()`, data model * Exclusion markers are extracted from source code lines and arranged into a data structure for later lookup. Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()` * Parsed lines are processed to populate a `FileCoverage` model. At this stage, exclusions are applied. The state space is very small, with only four variables that have to be tracked. Relevant items: `_ParserState`, `_gather_coverage_from_line()`, `_line_noncode_and_count()`, `_function_can_be_excluded()`, `_branch_can_be_excluded()`, `_add_coverage_for_function()` * Warnings are reported, and any potential errors re-raised. This is equivalent to the previous parser. Relevant items: `_report_lines_with_errors()` Impact on tests --------------- The new parser is almost completely bug-compatible with the old parser. This is e.g. visible in the potentially unintuitive handling of the `noncode` status. The new separation between low-level parsing and high-level decisions makes it more clear what is actually going on. There was a significant change in the **Nautilus parser test**. The input file contains the following pattern: ------------------ #####: 52:foo() ? bar(): Previously, this line 52 was not reported as uncovered. I consider that to be an error, and have updated the expected output correspondingly. This could indicate that the new parser is in fact more robust than the old parser when it comes to template specialization sections. In the **excl-branch test**, gcovr will encounter gcov input as the following when using GCC-8 or later: #####: 9: virtual ~Bar() #####: 10: {} // ... ------------------ Bar::~Bar(): function Bar::~Bar() called 0 returned 0% blocks executed 0% #####: 9: virtual ~Bar() #####: 10: {} // ... call 0 never executed call 1 never executed ------------------ Bar::~Bar(): function Bar::~Bar() called 0 returned 0% blocks executed 0% #####: 9: virtual ~Bar() #####: 10: {} // ... ------------------ The old parser associated the `function` annotations with line 11. This was clearly incorrect. The test reference was updated to associate the destructor with line 9. Other than that, the tests were only updated to account for the different parser APIs. Internally, the new parser uses a lot of doctests. Future directions ----------------- The new parser extracts *all* available data, only to throw it away. It might now become feasible to make use of more of this data. In particular: * handling template specialization sections properly * collecting block-level coverage data * using the `working directory` metadata field Conflicts with other development efforts ---------------------------------------- * <gcovr#503> report of excluded coverage Makes a small patch to the parser. The same effect can be achieved by adding a few lines in `_gather_coverage_from_line()`. * <gcovr#484> tests with clang-10 Touches neighboring lines. Will be reported as a mer 8000 ge conflict by Git, but there's no semantic conflict. * <gcovr#474> abstract interface for reader/writer Small change in the parser code regarding `sys.exit(1)` (new parser: `raise SystemExit(1)`). It's worth noting that this is effectively unreachable code. Lines will only be reported if there was an exception, and if there was an exception it will be re-thrown. * <gcovr#361> --no-markers to ignore exclusion markers Touches the exclusion handling code. This is of course totally changed by the new parser. But the new parser would make it even easier to implement that functionality. * <gcovr#350> decision coverage Adds significant new parsing code, but most of it is outside of the gcov-parser. These changes could be ported with moderate effort to the new parser.

Spacetown · 2022-08-05T17:34:22Z

This PR has has to many conflicts and should be started from scratch.

Spacetown mentioned this pull request Feb 11, 2021

Move the generators to a sub directory #443

Merged

Spacetown added this to the 4.3 milestone Feb 11, 2021

Spacetown force-pushed the oo_interface_reader_writer branch from 6b2b89d to a2358ae Compare February 11, 2021 20:56

Spacetown requested review from latk and chrta February 11, 2021 20:59

Spacetown force-pushed the oo_interface_reader_writer branch from d474091 to 6c0ec03 Compare February 15, 2021 20:44

Spacetown mentioned this pull request Mar 24, 2021

Prepare gcovr 5.0 release #424

Closed

6 tasks

Spacetown modified the milestones: 4.3, UpcomingRelease May 20, 2021

Spacetown force-pushed the oo_interface_reader_writer branch 2 times, most recently from 8ae5c51 to 7745a60 Compare June 14, 2021 19:24

Spacetown added Format: CSV Format: HTML Format: JSON Format: Text Format: Sonarcube Internal change Internal change not visible to user needs review Type: Enhancement labels Jun 14, 2021

Spacetown added 3 commits July 26, 2021 21:36

Add abstract interface for reader/writer.

13c3295

Replace sys.exit(...) in RuntimeError and handle in __main__.py.

c289264

Add "namespace" to options of reader/writer.

8382e1b

Spacetown force-pushed the oo_interface_reader_writer branch from 7745a60 to 8382e1b Compare July 26, 2021 19:38

Spacetown mentioned this pull request Aug 4, 2021

Report on *covered* not *uncovered* #507

Closed

latk mentioned this pull request Sep 10, 2021

new parser for the .gcov format #512

Merged

This was referenced Nov 3, 2021

Coveralls GitHub Action not supported. #522

Closed

Support the LCOV-Info output format #523

Closed

Spacetown mentioned this pull request Dec 30, 2021

Add an option to disable long file names #528

Merged

Spacetown modified the milestones: 5.1, Upcoming release Feb 20, 2022

Spacetown modified the milestones: 5.2, WishList Apr 1, 2022

Spacetown closed this Aug 5, 2022

Spacetown mentioned this pull request Mar 23, 2023

Add interface to writer modules #755

Merged

Spacetown deleted the oo_interface_reader_writer branch April 24, 2023 20:33

Spacetown removed this from the WishList milestone Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add abstract interface for reader/writer. #474

Add abstract interface for reader/writer. #474

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add abstract interface for reader/writer. #474

Add abstract interface for reader/writer. #474

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

The matrix problem, and potential consequences.

Data flow and control flow

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!