8000 Add abstract interface for reader/writer. by Spacetown · Pull Request #474 · gcovr/gcovr · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add abstract interface for reader/writer. #474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

Spacetown
Copy link
Member
@Spacetown Spacetown commented Feb 11, 2021

Ass discussed in #443 a object oriented interface is added.

  • The writer xml.py is renamed to coveralls.py and a option was added. I think we should deprecate the option --xml because sonarcube.py also produce a XML file.
  • add_copyright.py was changed to generate black conform files.
  • Move generator utils to a seperate file.
  • Add reader for JSON files.
  • Move the check functions and OutputOrDefault to GcovrConfigOption to access them eays in the readers and writers.

@Spacetown Spacetown added this to the 4.3 milestone Feb 11, 2021
@Spacetown Spacetown force-pushed the oo_interface_reader_writer branch from 6b2b89d to a2358ae Compare February 11, 2021 20:56
@Spacetown Spacetown requested review from latk and chrta February 11, 2021 20:59
@Spacetown
Copy link
Member Author

@latk I hope this is a little step to get your dream come true.

@codecov
Copy link
codecov bot commented Feb 11, 2021

Codecov Report

Merging #474 (8382e1b) into master (23b7114) will increase coverage by 0.33%.
The diff coverage is 96.09%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #474      +/-   ##
==========================================
+ Coverage   95.60%   95.94%   +0.33%     
==========================================
  Files          20       25       +5     
  Lines        2480     2612     +132     
  Branches      426      438      +12     
==========================================
+ Hits         2371     2506     +135     
  Misses         49       49              
+ Partials       60       57       -3     
Flag Coverage Δ
ubuntu-18.04 95.36% <95.80%> (+0.36%) ⬆️
windows-2019 95.51% <96.05%> (+0.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
gcovr/tests/test_config.py 100.00% <ø> (ø)
gcovr/utils.py 91.93% <ø> (+0.06%) ⬆️
gcovr/reader/base.py 80.00% <80.00%> (ø)
gcovr/__main__.py 94.44% <89.36%> (+2.66%) ⬆️
gcovr/writer/utils.py 92.92% <92.92%> (ø)
gcovr/writer/coveralls.py 93.75% <93.82%> (+0.54%) ⬆️
gcovr/writer/__init__.py 95.91% <95.91%> (ø)
gcovr/writer/json.py 97.26% <95.91%> (-2.74%) ⬇️
gcovr/reader/json.py 96.72% <96.72%> (ø)
gcovr/writer/html.py 96.56% <97.36%> (+0.29%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23b7114...8382e1b. Read the comment docs.

@latk
Copy link
Member
latk commented Feb 14, 2021

I've looked through the code, and it has some interesting ideas – the main() is much better now! It was a bit hard to review though, because it all consists of a single commit that moved a lot of code around and reformatted it.

I want to discuss two high-level aspects before doing a more thorough review:

  • consequences from dividing/grouping code one way or the other
  • data and control flow

The matrix problem, and potential consequences.

When dividing up functionality, there are different approaches with different consequences. I want us to think this through before committing to any alternative.

If we have N components with M concerns each, we somehow have to divide up these different aspects. I refer to that as a “matrix problem” since it's usually equally reasonable to separate aspects by row or by column:

json html txt
arguments
validation
writing

A general heuristic for solving this problem is: what changes together, stays together. This would suggest that the presented division by output format is correct, since arguments are likely to change together with the current writer (if at all).

Such a division might also enable pluggable writers, some way down the road. However, we are far from offering a sufficiently stable plugin interface, so I'm not particularly concerned about this.

Dividing by concern would enable something entirely different: loading writers only on demand. In the presented architecture, all writers must be loaded up front in order to discover available arguments. A throwaway experiment I tried out some time ago suggested opportunity for noticeably faster startup when lazy-loading writers (this wasn't previously possible due to utilities from the HTML writer being used from other components). I don't want to give up this potential performance win unless we are sure that it's worth it.

Data flow and control flow

In the proposed design, argument validation logic is moved to the various writer modules. There are two aspects that make me a bit uncomfortable.

First, all writers operate on the shared options object which has long been a problem due to unclear interface boundaries. But now, the other options/accesses are less visible because they are in different files. I would tend to leave the validation as it is until a clear interface boundary can be introduced, e.g. if each writer has a constructor that extracts the required arguments into member fields.

Secondly, control flow. The validation code and some of the writer code contain application-level control flow. There should not be a sys.exit() outside the main – that is a CLI concern, not a writer concern. If we want to move this validation code, we should rather think about suitable exceptions.

Conclusion

All in all, I'm really thankful for this PR because it enables us to have a conversation about future directions. But it's a lot of change with little discussion so I'm hesitant to say “yes” outright. On the other hand, just moving code around is a fairly low risk change.

What do you think? Which future directions would you prioritize with the “matrix problem”? Which changes can we focus on in this PR, what should be deferred to another iteration?

@Spacetown
Copy link
Member Author

I've looked through the code, and it has some interesting ideas – the main() is much better now! It was a bit hard to review though, because it all consists of a single commit that moved a lot of code around and reformatted it.

Mea culpa! Next time I'll try to create several commits.

A general heuristic for solving this problem is: what changes together, stays together. This would suggest that the presented division by output format is correct, since arguments are likely to change together with the current writer (if at all).

That's also my prefered way. I don't want to add the options in file A, add the option handling in file B and the code to write the output in file C. Another pro for this way is that if you only change the HTML writer you can be sure that the other formats aren't affected.

Such a division might also enable pluggable writers, some way down the road. However, we are far from offering a sufficiently stable plugin interface, so I'm not particularly concerned about this.

I'm with you, this can be done in the far future.

Dividing by concern would enable something entirely different: loading writers only on demand. In the presented architecture, all writers must be loaded up front in order to discover available arguments. A throwaway experiment I tried out some time ago suggested opportunity for noticeably faster startup when lazy-loading writers (this wasn't previously possible due to utilities from the HTML writer being used from other components). I don't want to give up this potential performance win unless we are sure that it's worth it.

A improvement can be to do lazy loading of the parts needed for writing like it's done in the HTML writer.

In the proposed design, argument validation logic is moved to the various writer modules. There are two aspects that make me a bit uncomfortable.

First, all writers operate on the shared options object which has long been a problem due to unclear interface boundaries. But now, the other options/accesses are less visible because they are in different files. I would tend to leave the validation as it is until a clear interface boundary can be introduced, e.g. if each writer has a constructor that extracts the required arguments into member fields.

Do you mean the check_options of the reader/writeer?

Secondly, control flow. The validation code and some of the writer code contain application-level control flow. There should not be a sys.exit() outside the main – that is a CLI concern, not a writer concern. If we want to move this validation code, we should rather think about suitable exceptions.

I was also not so happy with this but I didn't want to change the code as much. I can use an ArgumentException here.

All in all, I'm really thankful for this PR because it enables us to have a conversation about future directions. But it's a lot of change with little discussion so I'm hesitant to say “yes” outright. On the other hand, just moving code around is a fairly low risk change.

What do you think? Which future directions would you prioritize with the “matrix problem”? Which changes can we focus on in this PR, what should be deferred to another iteration?

I prefer to keep all code related to a format in one file. For a new writer you only have to implement the functions of the base class and add it to the global list.
I think follwing points should be part of this PR:

  • Replace sys.exit by exception handling in all readers/writers.
  • Add a "namespace" to all option variables, e.g. option --html-absolute-paths should set html_relative_anchors instead of relative_anchors.

Separate PR:

  • Change the gcov stuff to a reader.
  • Implement lazy loading to improve startup performance.
  • Maybe add an option group for each writer writer options except the option activating the writer, e.g. --html, --html-detailsand --html-details-syntax-highlighting will stay in group output_options all other options are moved to group html_options.

@Spacetown Spacetown force-pushed the oo_interface_reader_writer branch from d474091 to 6c0ec03 Compare February 15, 2021 20:44
@Spacetown Spacetown mentioned this pull request Mar 24, 2021
6 tasks
@Spacetown Spacetown modified the milestones: 4.3, UpcomingRelease May 20, 2021
@Spacetown Spacetown force-pushed the oo_interface_reader_writer branch 2 times, most recently from 8ae5c51 to 7745a60 Compare June 14, 2021 19:24
@Spacetown
Copy link
Member Author

@latk What do you think about the proposal? Should I add lazy loading for the readers with this PR?

@Spacetown Spacetown force-pushed the oo_interface_reader_writer branch from 7745a60 to 8382e1b Compare July 26, 2021 19:38
latk added a commit to latk/gcovr that referenced this pull request Sep 10, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the Nautilus parser test. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
latk added a commit to latk/gcovr that referenced this pull request Sep 10, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the **Nautilus parser test**. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

In the **excl-branch test**, gcovr will encounter gcov input as the following
when using GCC-8 or later:

        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    call    0 never executed
    call    1 never executed
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------

The old parser associated the `function` annotations with line 11. This was
clearly incorrect. The test reference was updated to associate the destructor
with line 9.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
latk added a commit to latk/gcovr that referenced this pull request Sep 10, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

8000

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the **Nautilus parser test**. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

In the **excl-branch test**, gcovr will encounter gcov input as the following
when using GCC-8 or later:

        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    call    0 never executed
    call    1 never executed
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------

The old parser associated the `function` annotations with line 11. This was
clearly incorrect. The test reference was updated to associate the destructor
with line 9.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a merge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
latk added a commit to latk/gcovr that referenced this pull request Sep 15, 2021
Summary: This commit introduces a completely rewritten parser for gcov's
human-readable report format. The implementation pays closer attention to the
upstream file format, e.g. with respect to the exact number formatting. Compared
to the previous parser, the data flow has been greatly clarified, making this
parser more obviously correct. The new parser will make it easier to react to
future changes in the format, and to make use of improvements from newer GCC
versions – such as the `Working directory` metadata.

Background
----------

The old parser had some problems that could be addressed by this rewrite.

* The old parser was very complex due to mixing low level parsing and high-level
  decisions about coverage exclusion.

  Fixed in the new parser: now, staging is split into multiple phases that
  clearly separate concerns. The low-level parsing phase produces a data model
  that is safe to use for the other phases.

* The old parser used an object-oriented design were parsing state was stored in
  instance fields. This created a huge "state space" that was difficult to
  track, potentially leading to bugs. For example, it was not immediately clear
  in the investigation of <gcovr#511> whether
  the problem was just that the parser forgot to update its line number
  correctly.

  Fixed in the new parser: using a more functional/procedural design, the data
  flows are very clear. State is tracked explicitly. By separating parsing
  phases, the state space is much smaller.

* To expand on the previous point: The old parser essentially consisted of
  multiple interleaved state machines. There was one state machine for
  processing coverage data, and an interleaved state machine for tracking active
  coverage exclusion markers. This interleaving made it difficult to understand
  whether the state transitions were correct.

  Fixed in the new parser: coverage exclusion patterns are collected in a
  separate phase, before the actual coverage data is processed.

* The old parser made use of very fragile parsing strategies, such as using
  `str.split()` excessively. This gave rise to fragile assumptions about the
  exact format. For example, the IndexError in
  <gcovr#226> was an example of wrong
  assumptions.

  The new parser uses regular expressions to parse tag lines, and only uses
  `str.split()` for the more structured source code lines. This is more
  self-documenting. Treatment of numerical values was aligned with the routines
  in the gcov source code. Should the format deviate in the future, the regexes
  will fail to match, making it possible to detect and fix the errors. (Until
  then, `--gcov-ignore-parse-errors` can be used).

Design of the new parser
------------------------

The new parser is more complex in the sense that there is a lot more code. But
there is a clearer separation of concerns, and the parser was closely informed
by the gcov documentation and source code. As a result, I am confident that it
handles far more edge cases correctly, in particular relating to the handling of
numbers/percentages.

There are three items for external use:

**`parse_metadata(lines)`** creates a dict of values from the metadata lines.
The old parser handled the very first line of the report separately to extract
the filename. The new parser uses the same more robust parsing code for this
metadata.

**`ParserFlags`** is a flag-enum that describes various boolean features. A single
object with flags seems simpler to handle than multiple variables like
`exclude_throw_branches`.

**`parse_coverage(lines, ...)`** is the main function for parsing the coverage.
It performs multiple phases:

* Each input line is parsed/tokenized into an internal data model. The data
  model is strongly typed. The various classes like `_SourceLine` are
  implemented as NamedTuples, which is both very convenient and very
  memory-efficient.

  Relevant items: `_parse_line()`, `_parse_tag_line()`, data model

* Exclusion markers are extracted from source code lines and arranged into a
  data structure for later lookup.

  Relevant items: `_find_excluded_ranges()`, `_make_is_in_any_range()`

* Parsed lines are processed to populate a `FileCoverage` model. At this stage,
  exclusions are applied. The state space is very small, with only four
  variables that have to be tracked.

  Relevant items: `_ParserState`, `_gather_coverage_from_line()`,
  `_line_noncode_and_count()`, `_function_can_be_excluded()`,
  `_branch_can_be_excluded()`, `_add_coverage_for_function()`

* Warnings are reported, and any potential errors re-raised. This is equivalent
  to the previous parser.

  Relevant items: `_report_lines_with_errors()`

Impact on tests
---------------

The new parser is almost completely bug-compatible with the old parser. This is
e.g. visible in the potentially unintuitive handling of the `noncode` status.
The new separation between low-level parsing and high-level decisions makes it
more clear what is actually going on.

There was a significant change in the **Nautilus parser test**. The input file
contains the following pattern:

    ------------------
        #####:   52:foo() ? bar():

Previously, this line 52 was not reported as uncovered. I consider that to be an
error, and have updated the expected output correspondingly. This could indicate
that the new parser is in fact more robust than the old parser when it comes to
template specialization sections.

In the **excl-branch test**, gcovr will encounter gcov input as the following
when using GCC-8 or later:

        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    call    0 never executed
    call    1 never executed
    ------------------
    Bar::~Bar():
    function Bar::~Bar() called 0 returned 0% blocks executed 0%
        #####:    9:    virtual ~Bar()
        #####:   10:    {} // ...
    ------------------

The old parser associated the `function` annotations with line 11. This was
clearly incorrect. The test reference was updated to associate the destructor
with line 9.

Other than that, the tests were only updated to account for the different parser
APIs.

Internally, the new parser uses a lot of doctests.

Future directions
-----------------

The new parser extracts *all* available data, only to throw it away. It might
now become feasible to make use of more of this data. In particular:

* handling template specialization sections properly
* collecting block-level coverage data
* using the `working directory` metadata field

Conflicts with other development efforts
----------------------------------------

* <gcovr#503> report of excluded coverage

  Makes a small patch to the parser. The same effect can be achieved by adding a
  few lines in `_gather_coverage_from_line()`.

* <gcovr#484> tests with clang-10

  Touches neighboring lines. Will be reported as a mer
8000
ge conflict by Git, but
  there's no semantic conflict.

* <gcovr#474> abstract interface for reader/writer

  Small change in the parser code regarding `sys.exit(1)` (new parser:
  `raise SystemExit(1)`). It's worth noting that this is effectively unreachable
  code. Lines will only be reported if there was an exception, and if there was
  an exception it will be re-thrown.

* <gcovr#361> --no-markers to ignore exclusion markers

  Touches the exclusion handling code. This is of course totally changed by the
  new parser. But the new parser would make it even easier to implement that
  functionality.

* <gcovr#350> decision coverage

  Adds significant new parsing code, but most of it is outside of the
  gcov-parser. These changes could be ported with moderate effort to the new
  parser.
@Spacetown Spacetown modified the milestones: 5.1, Upcoming release Feb 20, 2022
@Spacetown Spacetown modified the milestones: 5.2, WishList Apr 1, 2022
@Spacetown
Copy link
Member Author

This PR has has to many conflicts and should be started from scratch.

@Spacetown Spacetown closed this Aug 5, 2022
@Spacetown Spacetown deleted the oo_interface_reader_writer branch April 24, 2023 20:33
@Spacetown Spacetown removed this from the WishList milestone Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0