8000 Return full license string instead of SHA256 hash when license string exceeds 64 characters. · Issue #3780 · anchore/syft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Return full license string instead of SHA256 hash when license string exceeds 64 characters. #3780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Funsho-Agboola opened this issue Apr 4, 2025 · 7 comments · Fixed by #3844
Assignees
Labels
enhancement New feature or request

Comments

@Funsho-Agboola
Copy link
Funsho-Agboola commented Apr 4, 2025

What would you like to be added:
I would like Syft to add a feature that returns the full license string, even when it exceeds 64 characters, instead of hashing it and returning a LicenseRef-<hash>

This could be done by:

  • I think the original license string should be returned also for better traceability or maybe a flag added to output the full license string regardless

Why is this needed:
Currently, Syft hashes license strings longer than 64 characters using SHA256, replacing license strings with:

`LicenseRef-"sha256-Hash"`

This actually limits traceability during license scans and compliance checks. The redislabs/k8s-controller:7.8.2-6 image is a good example.

The following license string was found in the RPM DB on this image using:

rpm -qa --qf '%{NAME}: %{LICENSE}\n'

glibc-minimal-langpack: LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL
glibc-common: LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL
glibc: LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL

Syft hashes the string and returns:

LicenseRef-cedbc2fa4301332b3d3569627696d986a63b3f3a293a2759a611c7c3deebd428

Which I verified on python:

import hashlib
print(hashlib.sha256(b"LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL").hexdigest())
cedbc2fa4301332b3d3569627696d986a63b3f3a293a2759a611c7c3deebd428

Additional context:

The behaviour is defined here: https://github.com/anchore/syft/blob/main/syft/format/internal/spdxutil/helpers/license.go

Particularly:

		if len(l.Value) <= 64 {
			// if the license text is less than the size of the hash,
			// just use it directly so the id is more readable
			candidate.ID = spdxlicense.LicenseRefPrefix + SanitizeElementID(l.Value)
		} else {
			hash := sha256.Sum256([]byte(l.Value))
			candidate.ID = fmt.Sprintf("%s%x", spdxlicense.LicenseRefPrefix, hash)
		}

Environment:

Syft version:
syft 1.20.0

@Funsho-Agboola Funsho-Agboola added the enhancement New feature or request label Apr 4, 2025
@VictorHuu
Copy link
Contributor

github/go-spdx#8
The problem arises not due to the Syft, but the go-spdx
In the go-spdx, it adopts a policy of fail-fast, that is, when encountered a license unknown or deviates from the license names in the official list,such as LGPLv2(yours) vs LGPL-2.0(so-called correct one), it will fail immediately.
In the real world,however,license name that is a little different from the defined ones in the specification is common and acceptable.

8000
@Funsho-Agboola
Copy link
Author
Funsho-Agboola commented Apr 10, 2025

Hi @VictorHuu, thanks for your response, you are very correct that go-spdx fails hard if the string is not a recognized SPDX ID.

On this image, there are many packages similar to your example like:

$ rpm -qa --qf '%{NAME}: %{LICENSE}\n' | grep -i 'LGPLv2+ and GPLv3+'
libassuan: LGPLv2+ and GPLv3+
gpgme: LGPLv2+ and GPLv3+

That syft then returns as: "licenseDeclared":"LicenseRef-LGPLv2--and-GPLv3-"

I am actually particular about Syft’s decision to hash the unrecognized license strings greater than 64 characters. Is it purely a design choice on Syft to keeping it short as a best practice for clarity? As go-spdx does not mandate this limit.

Also, could Syft offer like an option to output full license strings >64 chars, or like a flag?

Like:
LicenseRef-"full license string more than 64 characters"

instead of hasing it like:
LicenseRef-"sha256-Hash"

I need the full license strings in my workflow even if they’re non-SPDX compliant.

@VictorHuu
Copy link
Contributor
VictorHuu commented Apr 15, 2025

Hi @Funsho-Agboola,thanks for your insights.
For 1st question, I think it's indeed a deliberate choice for clarity.

For traceability, maybe you can refer to #2724 (comment) and #3450 are about the representations of the full text of a license.

And when it comes to the LicenseRef, I think

  1. go-spdx can adopt a little more lenient policy that marks the non-compliant string UNKNOWN and
  2. syft should make the limit of LicenseRef length be configurable which defaults to 64.

Sorry, I'm almost a newbie, so my suggestions might need further consultation from the nuclear dev team :(.

@Funsho-Agboola
Copy link
Author
Funsho-Agboola commented Apr 17, 2025

Thanks, @VictorHuu, Making the limit of the LicenseRef length configurable would really help. I'll reach out to the team working on #3450 to propose integrating this idea into their ongoing improvement on the PR.

@spiffcs
Copy link
Contributor
spiffcs commented May 1, 2025

Thanks @Funsho-Agboola! Working on this now that the full-text pr has merged.

@spiffcs
Copy link
Contributor
spiffcs commented May 1, 2025

I see why this isn't getting sent directly to the ID portion of the license translator for SPDX.

"LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL",

The above value isn't coming back as a valid spdx license expression.

I used the following program as a quick validator:

package main

import (
	"bufio"
	"flag"
	"fmt"
	"os"

	"github.com/github/go-spdx/v2/spdxexp"
)

func main() {
	// Define --file flag
	filePath := flag.String("file", "", "Path to file containing SPDX expressions (one per line)")
	flag.StringVar(filePath, "f", "", "Shorthand for --file")
	flag.Parse()

	var expressions []string

	// If --file is provided, read expressions from file
	if *filePath != "" {
		file, err := os.Open(*filePath)
		if err != nil {
			fmt.Fprintf(os.Stderr, "Error opening file: %v\n", err)
			os.Exit(1)
		}
		defer file.Close()

		scanner := bufio.NewScanner(file)
		for scanner.Scan() {
			line := scanner.Text()
			if line != "" {
				expressions = append(expressions, line)
			}
		}
		if err := scanner.Err(); err != nil {
			fmt.Fprintf(os.Stderr, "Error reading file: %v\n", err)
			os.Exit(1)
		}
	}

	// Append any additional command-line arguments
	expressions = append(expressions, flag.Args()...)

	// Validate SPDX expressions
	if len(expressions) == 0 {
		fmt.Println("No SPDX expressions provided.")
		os.Exit(0)
	}

	_, invalid := spdxexp.ValidateLicenses(expressions)

	fmt.Printf("\nInvalid expressions:\n")
	for _, inv := range invalid {
		fmt.Printf("  ✖ %s\n", inv)
	}
}

Invalid expressions:
  ✖ LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL

I think with the PR referenced on this issue we might be at a place where this gets easier. Before value could have been full text OR some invalid shorter expression.

We chose <64 constraint so that we didn't include larger length values as part of the ID calculation. Let me see what I can do with the latest version of main so make sure these values are no loner subject to this id calculation.

@spiffcs spiffcs moved this from In Progress to In Review in OSS May 1, 2025
@github-project-automation github-project-automation bot moved this from In Review to Done in OSS May 2, 2025
@Funsho-Agboola
Copy link
Author
Funsho-Agboola commented May 6, 2025

Hi @spiffcs Confirmed it here thanks, long license texts now come through intact. Thanks for jumping on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants
0