8000 Update specification to v1.2 by rdicosmo · Pull Request #56 · swhid/specification · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Update specification to v1.2 #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions Chapters/0.Foreword.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Foreword

This document contains the Publicly Available Specification of the
SWHID identifier, which was used as a starting point for the creation
of the [https://www.iso.org/standard/89985.html](ISO/IEC standard 18670).

ISO (the International Organization for Standardization)
is a worldwide federation of national standards bodies (ISO member bodies).
The work of preparing International Standards
Expand Down Expand Up @@ -49,13 +53,10 @@ to the World Trade Organization (WTO) principles
in the Technical Barriers to Trade (TBT),
see [https://www.iso.org/iso/foreword.html](https://www.iso.org/iso/foreword.html).

This document was prepared by <!-- the Joint Development Foundation -->XXX.
<!--
This document was adopted,
under the PAS procedure,
by Joint Technical Committee ISO/IEC JTC 1, Information technology,
in parallel with its approval by the national bodies of ISO and IEC.
-->

Any feedback or questions on this document
should be directed to the user's national standards body.
Expand Down
16 changes: 8 additions & 8 deletions Chapters/0.Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,19 @@ This has strengthened the need to precisely track, ensure availability, and
guarantee integrity of the components that go into a given system for a variety
of stakeholders. Academia needs to ensure that research results are
reproducible, industry needs to improve the traceability of the software supply
chain, developer communities need tools to cope with the increasing complexity.
chain, and developer communities need tools to cope with the increasing complexity.

A key building block for addressing this issue is a system of *intrinsic*
A key building block for addressing this issue is a system of intrinsic
identifiers that allows users to precisely pinpoint the exact version of any software
artifact, at all levels of granularity, *without relying* on any central registry
artifact, at all levels of granularity, without relying on any central registry
or naming authority.

With this specification, the SWHID working group makes such a system of
intrinsic identifiers, originally developed for the Software Heritage
universal source code archive, available to all stakeholders.
universal source code archive [1], available to all stakeholders.

For the sake of clarity, we will use examples drawn directly from the Software
Heritage archive, but notice that systems for the persistent archival of software
artifacts, as well as resolution of SWHIDs are out of the scope of this
specification, and the SWHID specification does not require in any way the
For the sake of clarity, examples have been drawn directly from the Software
Heritage archive; however, it is important to note that systems for the persistent archival of software
artifacts, as well as resolution of SWHIDs, are outside the scope of this
specification, which does not require the
use of Software Heritage.
10 changes: 5 additions & 5 deletions Chapters/1.Scope.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# 1 Scope

This SoftWare Hash IDentifier (SWHID) specification
defines a standard data format for referencing digital artifacts that
This specification
defines a standard data format for referencing software artifacts that
match the data model of modern distributed version control systems.

This includes the typical tree-like structure of a filesystem hierarchy,
but also special nodes to track revisions and releases, as well as the
This format includes the typical tree-like structure of a filesystem hierarchy,
but also, special nodes to track revisions and releases, as well as the
full status of a version control system, with all its development
branches.

Expand All @@ -18,4 +18,4 @@ The computation of the SWHID identifiers is based on Merkle Acyclic Directed
Graphs, a natural generalization of Merkle trees.

The resolution of SWHIDs, that is, the process of obtaining a copy of a digital
artifact corresponding to a given SWHID, is out of the scope of this specification.
artifact corresponding to a given SWHID, is outside the scope of this specification.
8 changes: 4 additions & 4 deletions Chapters/2.Normative_references.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ the latest edition of the referenced document
RFC-3174,
*US Secure Hash Algorithm 1 (SHA1)*,
The Internet Society Network Working Group,
[https://tools.ietf.org/html/rfc3174](https://tools.ietf.org/html/rfc3174)
[*https://tools.ietf.org/html/rfc3174*](https://tools.ietf.org/html/rfc3174)

RFC-3986,
*Uniform Resource Identifier (URI): Generic Syntax*,
The Internet Society Network Working Group,
[https://tools.ietf.org/html/rfc3986](https://tools.ietf.org/html/rfc3986)
[*https://tools.ietf.org/html/rfc3986*](https://tools.ietf.org/html/rfc3986)

RFC-3987,
*Internationalized Resource Identifiers (IRIs)*,
The Internet Society Network Working Group,
[https://tools.ietf.org/html/rfc3987](https://tools.ietf.org/html/rfc3987)
[*https://tools.ietf.org/html/rfc3987*](https://tools.ietf.org/html/rfc3987)

RFC-5234,
*Augmented BNF for Syntax Specifications: ABNF*,
The Internet Society Network Working Group,
[https://tools.ietf.org/html/rfc5234](https://tools.ietf.org/html/rfc5234)
[*https://tools.ietf.org/html/rfc5234*](https://tools.ietf.org/html/rfc5234)
82 changes: 57 additions & 25 deletions Chapters/3.Terms_and_definitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,54 +11,86 @@ for use in standardization at the following addresses:
* IEC Electropedia:
available at [http://www.electropedia.org/](http://www.electropedia.org/)

## 3.1 branch
**3.1**

In the context of version control systems, a branch is a parallel line of development that stems from the main line (commonly known as the "main" or "master" branch). It allows developers to isolate their work for a particular feature or bug fix without affecting the main line of development. Once the work is complete and tested, it can be merged back into the main branch.
**branch**

## 3.2 git
parallel line of development in a *version control system* (3.7), that stems from the main line

Git is a distributed version control system created by Linus Torvalds in 2005. It allows teams of programmers to work on the same code base without overwriting each other's changes. Git is known for its speed, data integrity, and support for distributed, non-linear workflows. Each Git directory on every computer is a full-fledged repository with complete history and version tracking abilities, independent of network access or a central server.
**3.2**

## 3.3 hierarchical file system
**Git**

A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically. It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure.
distributed *version control system* (3.7) created by Linus Torvalds in 2005

## 3.4 intrinsic identifier
**3.3**

An identifier that can be computed directly from the object that it identifies, without needing access to a registry. Typical examples are cryptographically strong hashes.
**hierarchical file system**

## 3.5 repository
method of organizing and managing files in a computer where data is stored hierarchically

In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, and documentation. It often includes metadata about the stored items, such as version number, author, and date of the last modification. Repositories can be local or remote and are managed by version control systems like Git.
**3.4**

## 3.6 SHA1
**intrinsic identifier**

*SHA-1* (short for "Secure Hash Algorithm 1", also stylized as "*SHA1*") is a hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value.
The returned value is called *SHA1 checksum*, or simply *SHA1* when there is no risk of ambiguity between the function and the returned value.
identifier that can be computed directly from the object that it identifies, without needing access to a registry

**3.5**

**repository**

storage location for *software development artifacts* (3.8) including but not limited to source code, build scripts, and documentation

**3.6**

**SHA1**

SHA1

Secure Hash Algorithm 1

hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value

Note 1 to entry: The returned value is called *SHA1 checksum*, or simply *SHA1* when there is no risk of ambiguity between the function and the returned value.
A detailed description of how to compute SHA1 is available in RFC-3174.

In the wake of the [Shattered attack](https://shattered.io/) of 2017 (see paper: [Stevens2017Shattered](B.Bibliography.md)), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums.
It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see paper: [Stevens2013Counter](B.Bibliography.md)).
In the wake of the [Shattered attack](https://shattered.io/) of 2017 (see [3]), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums.
It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see [2]).

As collision-prone files are problematic from the point of view of unequivocal identification and integrity verification, the SWHID standard takes measures to avoid that such files are referenced using only SHA1 checksums.
For the purpose of this specification document, the SHA1 function is therefore considered to be a *partial* function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in [Stevens2013Counter](B.Bibliography.md) and the reference implementation of it available at <https://github.com/cr-marcstevens/sha1collisiondetection> (version `stable-v1.0.3`, corresponding to Git commit ID `38096fc021ac5b8f8207c7e926f11feb6b5eb17c`).
For the purpose of this specification, the SHA1 function is therefore considered to be a *partial* function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in [2] and the reference implementation of it available at https://github.com/cr-marcstevens/sha1collisiondetection (Git commit ID `38096fc021ac5b8f8207c7e926f11feb6b5eb17c`, or version `stable-v1.0.3`).

When such a collision is detected during SHA1 computation, no SHA1 can be obtained for the object in question and hence, depending on the context, a valid SWHID might not exist for it.

Note that in most cases SHA1 in this specification are computed on objects after adding specific headers to them, making "trivial" collision-prone files still perfectly valid and hence referenceable using SWHIDs.
In most cases, SHA1s in this specification are computed on objects after adding specific headers to them, making "trivial" collision-prone files still perfectly valid and hence referenceable using SWHIDs.

**3.7**

**version control system**

revision control system

source control system

software tool that helps manage different versions of *software development artifacts* (3.8)

**3.8**

**software artifact**

object

## 3.7 version control system
representation of a distinct entity identifiable by a SWHID

A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and the branching and merging of code. Examples include Git, Mercurial, and Subversion.
**3.9**

## 3.8 software artifact
**metadata**

A software artifact, also referred to as a digital artifact and a software object, represents a distinct entity identifiable by a SWHID. This entity can be as granular as a single line of code within a source file or as expansive as an entire codebase comprising multiple source files. In addition to source files, a software object can also be a binary file resulting from code compilation or multiple binary files linked together to produce an executable file.
supplementary information associated with a *software artifact* (3.8)

## 3.9 metadata
**3.10**

Within the context of this specification, metadata refers to supplementary information associated with a software object. It serves to provide a deeper understanding of the object by detailing attributes such as the programming language used, its functionality, or its dependencies. Metadata can also enumerate the individuals involved in the software's development, elucidate its licensing terms, offer a record of version history, and more. Essentially, metadata encapsulates the broader context, provenance, and attributes of the software object, ensuring a comprehensive understanding of its nature and purpose.
**UNIX epoch**

## 3.10 UNIX epoch
time reference point that denotes the precise moment at 00:00:00 Coordinated Universal Time (UTC) on 1 January 1970

The UNIX epoch is a time reference point that denotes the precise moment at 00:00:00 Coordinated Universal Time (UTC) on 1 January 1970. In UNIX-based systems, time is often represented as the total number of seconds that have transpired since this specific moment. This convention is widely used in computing for time-stamping and date-time representations.
32 changes: 19 additions & 13 deletions Chapters/4.Syntax.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
# 4 Syntax

A SWHID consists of two separate parts, a mandatory *core identifier* that can
identify any software artifact (or "object"), and an optional list of
A SWHID consists of two separate parts: a mandatory *core_identifier* that can
identify any software artifact, and an optional list of
*qualifiers* that allows specification of the context where the object is meant to be
seen and points to a subpart of the object itself.
seen and that points to a subpart of the object itself.

Syntactically, SWHIDs are generated by the `<identifier>` entry point in
the following grammar:
the following grammar (which uses notation defined by RFC-5234):

``` {.bnf}
<identifier> ::= <core_identifier> [ <qualifiers> ] ;

<core_identifier> ::= "swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<core_identifier> ::=
"swh" ":" <scheme_version> ":" <object_type> ":" <object_id> ;
<scheme_version> ::= "1" ;
<object_type> ::=
"snp" (* snapshot *)
Expand All @@ -20,9 +21,13 @@ the following grammar:
| "dir" (* directory *)
| "cnt" (* content *)
;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *)
<dec_digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
<hex_digit> ::= <dec_digit> | "a" | "b" | "c" | "d" | "e" | "f" ;
<object_id> ::= 40 * <hex_digit> ; (* intrinsic id, hex-encoded *)
<dec_digit> ::=
"0" | "1" | "2" | "3" | "4"
| "5" | "6" | "7" | "8" | "9" ;
<hex_digit> ::=
<dec_digit>
| "a" | "b" | "c" | "d" | "e" | "f" ;

<qualifiers> ::= ";" <qualifier> [ <qualifiers> ] ;
<qualifier> ::=
Expand All @@ -36,8 +41,8 @@ the following grammar:
| <path_ctxt>
;
<origin_ctxt> ::= "origin" "=" <url_escaped> ;
<visit_ctxt> ::= "visit" "=" <identifier_core> ;
<anchor_ctxt> ::= "anchor" "=" <identifier_core> ;
<visit_ctxt> ::= "visit" "=" <core_identifier> ;
<anchor_ctxt> ::= "anchor" "=" <core_identifier> ;
<path_ctxt> ::= "path" "=" <path_absolute_escaped> ;
<fragment_qualifier> ::= "lines" "=" <range> | "bytes" "=" <range> ;
<range> ::= <number> ["-" <number>] ;
Expand All @@ -48,10 +53,11 @@ the following grammar:

The last two symbols are defined as:

- `<path_absolute_escaped>` is an `ipath-absolute` from RFC-3987; and
- `<url_escaped>` is an `IRI` as defined in RFC-3987.
- `<url_escaped>` is an `IRI` as defined in RFC-3987; and
- `<path_absolute_escaped>` is an `ipath-absolute` from RFC-3987.


In both of these, all occurrences of `;` (and `%`, as required by the RFC)
have been percent-encoded (as `%3B` and `%25` respectively). Other
characters *may* be percent-encoded, for example, to improve readability and/or
characters may be percent-encoded, for example, to improve readability and/or
embeddability of SWHID in other contexts.
Loading
0