feat: improve dpkg cataloger license recognition for "license agreements" #3888
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The DPKG cataloger can run into issues when parsing license contents from a reader.
syft/syft/pkg/cataloger/debian/parse_copyright.go
Lines 15 to 41 in 175a671
The above code will follow this specification to try and read
License: <VALUE>
from the copyright file.What this PR Solves
Some popular licenses do not include the
License:
tag and instead prompt a heading with:This PR takes the above header case into consideration and builds an exception list so that these multi line license agreements can be surfaced as license strings and returned from the copyright reader.
I followed the directions on the issue to see if the use case for detecting the NVIDIA license agreement is fulfilled by this PR. In the below snippet you can see two new licenses being detected along with their associated packages.
Alternative License Discovery Methods This PR Solves
This PR also attempts to categorize licenses that do NOT have a
License:
field by passing their contents through the new scanner from ##3876This will create a license with the shasum256 of the contents. If the user enables the variable
SYFT_LICENSE_CONTENT=unknown
, the SBOM will return the full contents for these unknown licenses.Here is an example of one of these alternatively discovered licenses with
SYFT_LICENSE_CONTENT=unknown
enabled that was not discovered previously:Packages with no licenses for the given test image and why
Names and versions of packages still missing licenses in the final SBOM after failing to find values in
parseLicensesFromCopyright
and running the licenses classifier against their contents.For some of the cuda packages not having a
copyright
file under/usr/share/doc/<PKG>/
means we cannot do either of the above methods for license identification. In this case there are no contents to pass to the classifier and no reader to try and extractLicense:
fields from.Full list of packages with no licenses after this is merged:
Type of change
Checklist: