research-article

Open access

An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU

Authors:

Qiang WangAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 3

Article No.: 44, Pages 1 - 25

https://doi.org/10.1145/3659209

Published: 14 September 2024 Publication History

PDF eReader

Abstract

The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) now. Existing HBSs are designed based on serial Merkle tree traversal, which is not conducive to taking full advantage of the computing power of parallel architectures such as CPUs and GPUs. We propose a parallel Merkle tree traversal (PMTT), which is tested by implementing LMS on the GPU. This is the first work accelerating LMS on the GPU, which performs well even with over 10,000 cores. Considering different scenarios of algorithmic parallelism and data parallelism, we implement corresponding variants for PMTT. The design of PMTT for algorithmic parallelism mainly considers the execution efficiency of a single task, while that for data parallelism starts with the full utilisation of GPU performance. In addition, we are the first to design a CPU-GPU collaborative processing solution for traversal algorithms to reduce the communication overhead between CPU and GPU. For algorithmic parallelism, our implementation is still 4.48× faster than the ideal time of the state-of-the-art traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8,192, the parallel efficiency is 78.39%. In comparison, our LMS implementation outperforms most existing LMS and XMSS implementations.

1 Introduction

Cryptographic algorithms are evolving towards parallel design driven by the demand for enhanced performance. Accelerators like GPUs are extensively utilised to provide high-performance cryptographic operations. Within this domain, post-quantum cryptography (PQC), designed to defend against the attacks of quantum computers, has garnered increased attention, especially due to two pressing reasons [33]. (1) The swift advancements in quantum computing underscore the urgency of adopting PQC. Conventional cryptographic systems face vulnerability with the emergence of large-scale quantum computers, which can leverage algorithms like Shor’s [36] or Grover’s [17] to undermine cryptography. (2) Despite PQC algorithms remaining resilient against quantum attacks for the foreseeable future, they are inherently more intricate and time-intensive compared to prevailing cryptographic systems such as RSA and ECDSA.

Digital signatures are a critical aspect of cryptographic algorithms, extensively utilised in identity verification processes like website certificates. Similar to other cryptographic algorithms, digital signature algorithms impose stringent performance requirements. Among these, hash-based signatures (HBS) [34], as a form of post-quantum digital signature, offer robust security but often face significant performance challenges.

HBSs are divided into two types: stateful and stateless. The primary distinction lies in whether they require state maintenance. Stateless algorithms are more resource-intensive due to their reliance on state-independent modules. The stateful HBS algorithms lack these modules, resulting in lower overhead. However, this advantage comes with limitations on their application. These constraints do not pose security concerns under specific circumstances [9, 30]. Thus, HBS has been standardised by the National Institute of Standards and Technology (NIST).

Different from stateless HBS, stateful HBS uses Merkle tree traversal to speed up the signature process precisely due to the presence of a state association between signatures. Presently, Merkle tree traversal can be divided into two types: fractal Merkle tree traversal [22] and Merkle tree traversal in log space and time, such as Szydlo’s algorithm [41] and BDS algorithm [7].

Motivations. Existing Merkle tree traversal algorithms are mainly designed for serial execution and memory-constrained devices, This limitation hinders parallel devices such as CPUs and GPUs from exploiting their parallel capabilities. For example, the BDS algorithm uses the order relationship of signatures for optimisation, which impedes data parallelism. Fractal Merkle tree traversal only considers storing a single tree at each level, leading to insufficient parallelism for certain parameter configurations. Consequently, there is an urgent need for the development of a new Merkle tree traversal algorithm that overcomes these limitations.

Leighton-Micali Signature (LMS) [28] and eXtended Merkle Signature Scheme (XMSS) [6] are the only two PQC algorithms standardised by NIST, both being stateful HBS. We exemplify the parallel LMS implementation on the GPU to describe the parallel Merkle tree traversal (PMTT) algorithm we propose. PMTT is a general algorithm that is not limited to a specific deployment or algorithm. For example, we can also optimise XMSS on the CPU platform using PMTT. Since LMS incurs lower overhead than XMSS and the GPU is one of the parallel platforms with the highest demand for parallel algorithms, such an environment allows us to thoroughly validate PMTT.

There is lots of work accelerating LMS. Kampanakis et al. [23] explored serial optimisation, employing the pre-hashing technique to minimise the number of redundant hashing operations. Some work has been done on CPUs. Cisco [29] offered a multi-threaded CPU solution using pthread. However, such implementations are limited to the resources of a single CPU node, leading to low parallelism for cryptographic operations. To increase the parallelism, Kang et al. [24] deployed LMS on multiple CPU nodes, employing MPI for multi-node data sharing and AVX2/AVX512 vectorisation for enhanced parallelism. In addition, accelerators such as FPGAs are used for LMS. Song et al. [39] suggested an LMS keypair generation architecture to achieve low latency. Later, they [38] proposed an architecture for LMS keypair generation for low latency, followed by a complete hardware implementation of digital signature and a reconfigurable architecture supporting various parameters. Thoma et al. [43] presented the first agile FPGA implementation supporting both LMS and XMSS, both using the BDS traversal algorithm.

Despite some optimisation efforts, current LMS implementations fall short in meeting the demands of high-throughput applications like social networking platforms (e.g., TikTok) or large e-commerce websites, such as Alibaba’s web server processing 325,000 orders per second during “Singles Day” in 2017. [25]. The demand for throughput is exponentially increasing in the era of the Internet of Things. As of now, the maximum throughput of signature generation for LMS_SHA256_M32_H20(10,10) is owned to [24]: 1,563 tasks/s. Such throughput is still unable to cope with the needs of large websites. Therefore, there is an urgent need for a new parallel implementation scheme for LMS.

Related Work. XMSS is another standardised HBS algorithm by NIST PQC. Optimising the LMS algorithm can involve various strategies. One approach is to vectorise the underlying hash computations using technologies like AVX2 [10, 11] and NEON [37]. Additionally, leveraging hardware-specific instructions from manufacturers such as Intel [13] or Cadence [35] can further improve efficiency. Furthermore, utilising hardware accelerators like FPGA [8, 16, 42, 45] or ASIC [31] can significantly boost computational speed. SPHINCS\(^+\)[3] is a stateless HBS to be standardised by NIST PQC. There is some optimisation work: vectorising the underlying hash computations using AVX2 [21] or NEON [2], reducing the number of instruction cycles through hardware-specific instructions such as SHA-NI [20] and accelerators such as FPGA [1, 4]. SPHINCS is the predecessor of SPHINCS\(^+\), and optimisation work for it includes GPU optimisations [40].

Long-term research has focused on using GPUs to accelerate cryptography. Symmetric encryption algorithms can take advantage of the high parallelism of GPUs to encrypt data because each block’s computation is independent, e.g., advanced encryption standard (AES) [19] and CHACHA20 [46]. Asymmetric encryption algorithms leverage GPUs to deploy data-parallel or hybrid-parallel schemes, enabling efficient processing of encryption tasks one-to-one or many-to-one. This design allows for a flexible tradeoff between latency and throughput, which is exploited in numerous studies to speedup classical algorithms such as RSA [12] and ECDSA [14], as well as PQC algorithms like KYBER [18], SABER [26], NewHope [15], and FrodoKEM [27].

Application Scenarios. High-performance digital signatures are mainly used in two application scenarios: web servers and cloud services [44]. Their actual needs are both to achieve the highest possible throughput while satisfying a certain latency. Considering actual network traffic, the number of digital signature requests should be taken into account when designing the corresponding policy. Therefore, we design algorithmic parallelism for accelerating a single task and data parallelism for accelerating multiple tasks.

Challenges. GPUs feature numerous cores with relatively modest individual performance. Thus, the main challenges of parallelising LMS on the GPU include three parts. (1) How to make a reasonable parallel design to efficiently utilise these cores. (2) How to effectively utilise the GPU programming technology and make full use of hardware resources. (3) How to design a parallel Merkle tree traversal algorithm adapted to the GPU architecture.

Contributions. To solve these challenges, the contributions in this article are as follows:

We propose PMTT, a novel approach tailored for parallel execution, distinct from existing Merkle tree traversal primarily designed for serial execution. Parallel execution environments typically feature ample memory and heightened latency sensitivity. Hence, PMTT prioritizes trading space for time to optimise parallel efficiency while ensuring the additional space requirements remain negligible compared to the overall system resources. The reason that PMTT achieves superior performance compared to other traversal methods is that it can exploit more parallelism.

We tailor parallel designs to specific scenarios based on PMTT. For algorithmic parallelism, the priority is minimising latency, irrespective of throughput. For data parallelism, the focus is on maximising throughput while maintaining reasonable latency.

We analyse the time and space requirements of PMTT and discuss tradeoffs between time and memory for memory-constrained devices to enable their use of PMTT.

We provide the first GPU implementation for LMS that can run over 10,000 cores. Beyond standard GPU optimisation methods, we present a CPU-GPU co-processing strategy, utilising a single CPU core to reduce communication overhead. For a variety of specific situations, we design detailed communication plans.

For algorithmic parallelism, our implementation is 4.48× faster than the ideal time of the best traversal algorithm. For data parallelism, when the number of cores increases from 1 to 8192, our implementation maintains a parallel efficiency of 78.39%. Furthermore, our implementation surpasses existing LMS and XMSS implementations in performance.

The remainder of this article is organised as follows. Section 2 introduces preliminaries. Section 3 details the design for PMTT and LMS. Section 4 describes the communication optimisation on the GPU. Section 5 evaluates the PMTT and our LMS implementation, and Section 6 concludes this article.

2 Preliminaries

2.1 Notations

Table 1 shows the terms and their meanings of the LMS algorithm. For convenience, we use ap to represent algorithmic parallelism and dp to represent data parallelism.

Table 1.

Terms	Meanings	Terms	Meanings
pk	The public key of LMS	\(lmots\_pk\)	The public key of LM-OTS
sk	The private key of LMS	\(lmots\_sk\)	The private key of LM-OTS
sig	The signature of LMS	\(lmots\_sig\)	The signature of LM-OTS
h	Height of the LMS tree	L	Number of LMS tree levels
\(h^{\prime }\)	Height of the subtree	\(L^{\prime }\)	Number of subtree levels
H	A hash function	n	Size of H’s output in bytes
w	Length of a Winternitz chain	p	Number of n-byte string in OTS
q	An integer as an LMS leaf identifier	I	A string as a keypair identifier
seed	A secret random value	m	Message to be signed

Table 1. Terms and Their Meanings of the LMS Algorithm

2.2 Digital Signature Mechanism

The digital signature mechanism is a tuple of algorithms (KeyGen, Sign, Verify), which are:

(pk, sk) \(\leftarrow\) KeyGen(seed): A key generation algorithm that generates a pk and a sk based on a seed, which is generated randomly.

sig \(\leftarrow\) Sign(sk, m): A signature generation algorithm that signs a m using a sk to output a sig.

b \(\leftarrow\) Verify(pk, sig, m): A signature verification algorithm that verifies a sig using a pk and a m to output a result b, which can be either true or false.

2.3 Review of LMS

This subsection introduces LMS according to the original LMS patent [28], the Request For Comments (RFC) standard 8554 [29] and the NIST standard [9]. Without emphasising variants, LMS refers to the two variants, the single-tree variant LMS and the multi-tree variant hierarchical signature system (HSS). Table 2 shows the LMS parameter sets that correspond to the standard XMSS parameter sets as LMS does not have standard parameter sets. For example, the parameter set LMS_SHA256_M32_H20(10,10) means the hash function used is SHA256, n is 32, total height is 20, L is 2 and h is 10 for both levels. Here, the numbers from left to right represent the tree height from top to bottom. For all parameter sets, h can be 5, 10, 15, 20, or 25. It should be emphasised that if the total height of the tree is \(total\_h\), the maximum number of signatures is \(2^{total\_h}\).

Table 2.

Parameter Sets	sk	pk	sig	Maximum Number of sig
LMS_SHA256_M32_H10(10)	64	60	2,512	2\(^{10}\)
LMS_SHA256_M32_H15(15)	64	60	2,672	2\(^{15}\)
LMS_SHA256_M32_H20(20)	64	60	2,832	2\(^{20}\)
LMS_SHA256_M32_H20(10,10)	64	60	5,076	2\(^{20}\)
LMS_SHA256_M32_H20(5,5,5,5)	64	60	9,564	2\(^{20}\)
LMS_SHA256_M32_H40(20,20)	64	60	5,716	2\(^{40}\)
LMS_SHA256_M32_H40(10,10,10,10)	64	60	10,204	2\(^{40}\)
LMS_SHA256_M32_H40(5,5,5,5,5,5,5,5)	64	60	19,180	2\(^{40}\)
LMS_SHA256_M32_H60(20,20,20)	64	60	8,600	2\(^{60}\)
LMS_SHA256_M32_H60(10,10,10,10,10,10)	64	60	15,332	2\(^{60}\)

Table 2. Sizes (bytes) of sk, pk, sig and Maximum Number of sig with w = 4

2.3.1 LM-OTS.

LM-OTS, the one-time signature (OTS) scheme of LMS, serves as the fundamental operator within LMS. It involves several hash operations and comprises three main functions:

\(lmots\_pk\) \(\leftarrow\) lmots_pkgen (seed, I, q). The LM-OTS key generation function is used when generating the leaf node of the LMS tree. The leaf node is the hash of \(lmots\_pk\).

\(lmots\_sig\) \(\leftarrow\) lmots_sign (m, seed, \(I_1\), \(I_2\), q). When performing an LMS signing, the message to be signed by the LMS and every root node used all need to be signed by using this function. The function is executed once at every level when performing an LMS signing.

\(lmots\_pk\) \(\leftarrow\) lmots_pk_from_sig (\(lmots\_sig\), m, I, q, sig). The function uses \(lmots\_sig\) to generate \(lmots\_pk\), which is then compared with the one in storage. As with lmots_sign, the function is executed once at every level when performing an LMS signature verification.

2.3.2 The LMS Tree.

LMS is designed around the LMS tree, which is a variant of the Merkle tree and contains 2\(^h\) leaf nodes. In Figure 1, an LMS tree with a height (h) of 5 is depicted. The authentication paths (\(Auth_i\), where i ranges from 0 to 4) shown are for leaf 6. An authentication path comprises all necessary nodes from the leaf to the root node. Each leaf in the LMS tree can only be utilised for a single LMS signing. Reusing a leaf would compromise security [5]. Therefore, an LMS tree supports the signing of 2\(^h\) messages. The LMS tree’s branch nodes are obtained by hashing the concatenated value of the left and right child nodes. During the tree construction process, nodes at each level are built from the bottom up until the root node is obtained.

Fig. 1.

2.3.3 LMS.

When \(L=1\), the multi-tree variant HSS is the single-tree variant LMS. That is, LMS is a subset of the HSS. Thus, only HSS is described here. Similar to LM-OTS, HSS includes three functions: keypair generation, signature generation and signature verification, which are as

(pk, sk) \(\leftarrow\) LMS_KeyGen (seed). The main part of this function is to construct the top-level tree to get its root node. However, this function also constructs LMS trees at other levels if the initialisation of some traversal algorithms needs these trees.

sig \(\leftarrow\) LMS_Sign (sk, m). At each level, the main part of this function is to complete a Merkle tree traversal and an LM-OTS signing.

b \(\leftarrow\) LMS_Verify (pk, sig, m). At each level, the main part of the function is to complete an LM-OTS signature verification and compute the root node.

2.4 Merkle Tree Traversal

Efficient traversal of Merkle trees is crucial for optimising keypair and signature generation performance. This traversal involves computing the authentication path while minimising time and space overheads. The optimisation of the Merkle tree traversal revolves around a problem: how to ensure that the required node values can be computed and prepared when needed in a time- and space-saving manner. Two of the most effective algorithms in use today are highlighted below.

2.4.1 BDS Traversal.

The BDS traversal [7], an advancement over Szydlo’s approach [41], optimises performance by scheduling computations only for leaf nodes rather than all nodes. In comparison, each signature computes \((h-1)/2\) leaf nodes and \((h-3)/2\) branch nodes on average, achieving a space bound of \(3.5h - 4\) nodes. This improvement makes BDS superior to Szydlo’s method, establishing it as the leading memory-efficient serial algorithm. The rationale behind this strategy is that less space will be used if the computation of two nodes of the same height in different tree hash stacks is done serially rather than in parallel.

This assumption imposes a timing constraint on consecutive signatures, impacting the design of parallel algorithms. Specifically, in the context of single-keypair data parallelism, achieving parallelism becomes challenging. This is because using the signature from one round directly for subsequent rounds is not feasible. Although traversal algorithms like BDS are more attractive in algorithmic parallelism and multi-keypair data parallelism due to their small memory footprints, PMTT is not based on BDS given that single-keypair data parallelism is a more extensive high-performance application scenario.

2.4.2 Fractal Merkle Tree Traversal.

Compared with the previous algorithm, the fractal Merkle tree traversal divides a tree into multiple subtrees for processing [22]. This approach allows for a straightforward balance between storage and computation since the sizes of the subtrees can vary. The process consists of three phases: initialisation, output, and verification. The verification phase is identical to the traditional verification phase for Merkle trees, and it is not detailed here.

Initialisation Phase. The root node is to be computed using the Merkle algorithm and outputted in this phase. When running the Merkle algorithm, it is necessary to store all nodes of the leftmost tree, which are called existing trees. The second tree on the left is called the desired tree, and both are initialised. Algorithm 1 shows the precise steps of the initialisation phase.

Output and Update Phase. Algorithm 2 outlines the procedure, where each round consists of three steps. (1) Output. At round leaf, the existing subtrees contain nodes required for the authentication path of this round. Thus, the authentication path of the leafth leaf is selected and outputted. (2) Update. At round leaf, the leaf = 1 (mod 2\(^{ih^{\prime }}\)) nodes in the old existing tree \(Exist_i\) are discarded. The completed desired tree \(Desire_i\) becomes the new existing tree \(Exist_i\); and a new, empty desired subtree is created. (3) Grow. In this step, each desired subtree that is not yet completed is grown. Only two nodes need to be completed per turn to ensure that the desired tree is completed before the Existing tree is used up, as a total of \(2^{ih^{\prime }+1}-2\) nodes need computation.

The maximum space is \(h\cdot 2^{h^{\prime }+1}/h^{\prime }+h^2/2h^{\prime }\). In the worst case, \(2h/{\log }h\) nodes are computed per round. For a Merkle tree of a given height, the subtree height directly affects memory consumption. When \(h^{\prime }\) equals h, only 2 nodes (one leaf and one branch node) need to be processed per round, resulting in a memory requirement of \(2^{h+1}+h/2\). This represents optimal performance, with memory usage directly linked to the Merkle tree’s height.

There’s a limitation with fractal Merkle tree traversal: it can only handle one Merkle tree at a time, which can lead to poor parallel efficiency, especially when dealing with shorter trees. For instance, if a Merkle tree has a height of only 5 levels, even if the optimal solution of fractal Merkle tree traversal is used, that is \(h^{\prime }=h\), the maximum parallelism is very low, resulting in suboptimal utilisation of GPU resources.

2.5 CUDA-Enabled GPU

NVIDIA provides a parallel computing framework called Compute Unified Device Architecture (CUDA) for its GPUs. The CUDA program execution involves three primary steps: transferring input data from CPUs to GPUs, executing the kernel, and transferring data from GPUs to CPUs. Before kernel execution, determining the optimal number of blocks and threads within each block is crucial for maximising efficiency. Additionally, precise specification of data communication between CPUs and GPUs is essential. During kernel execution, data sharing between threads is the core of effective GPU parallel programming.

3 Parallel Design for LMS

We first briefly describe the design of PMTT, followed by a detailed explanation of the schemes tailored to specific application scenarios and the parallel design of LMS. In this article, a task refers to a single instance of keypair generation, signature generation or signature verification. Moreover, a thread is used to describe a basic execution unit. We call multiple threads processing a single task algorithmic parallelism and multiple threads processing multiple tasks data parallelism. Expressed in high-performance computing terms, designs related to strong scaling are involved in algorithm parallelism, and designs related to weak scaling are involved in data parallelism.

3.1 Parallel Merkle Tree Traversal (PMTT)

We propose PMTT based on fractal Merkle tree traversal. Similarly, PMTT consists of three phases: initialisation, update and output, and verification. The verification phase remains unchanged from the standard Merkle tree verification and thus needs no introduction. PMTT enhances the fractal Merkle tree traversal in several key aspects to better suit parallel execution.

Simultaneous Construction of Multiple Trees. The fractal Merkle tree traversal only constructs some nodes when performing signature generation, leading to low parallelism. Therefore, we choose to construct the next existing tree when the existing one is used up. This approach offers two benefits: (1) It reduces memory usage as there’s no need to store desired trees. (2) It boosts parallelism by enabling the construction of multiple nodes simultaneously. However, a drawback is that signature generation might be time-consuming if tree construction is slow. Yet, this isn’t a significant concern when running PMTT on GPUs, where processing times are typically short.

Storing One-Time Signatures. Although one-time signatures are not involved in Merkle tree traversal, their storage also affects performance. In the LMS implementations based on fractal Merkle tree traversal recommended by RFC [29], only Merkle tree nodes and root-related information are saved. But in our solution, we store the one-time signature corresponding to the root node to avoid redundant computations.

Storing Multiple Trees at the Same Level. When the tree height is short, it results in limited maximum parallelism during tree construction. Hence, it’s crucial to build and retain multiple trees at the same level. The greater the number of trees, the better the parallel efficiency.

Agile Initialisation. The initialisation phase is usually performed during keypair generation. If keypair and signature generation are performed at different locations, there is notable overhead associated with moving the auxiliary array. Our design does not have this limitation. If the auxiliary array is not detected during signature generation, PMTT is initialised automatically.

3.2 PMTT for Algorithmic Parallelism

3.2.1 Specific Design.

To distinguish, we call levels of the LMS tree level and levels of the subtree sublevel. Each level of an LMS tree comprises \(L^{\prime }\) sublevels of subtrees. We just store a subtree at each sublevel for algorithmic parallelism. The design goal of algorithmic parallelism is to achieve the lowest possible latency. Since the solution for constructing multiple trees aims at improving parallel efficiency and throughput, it is not necessary for algorithmic parallelism.

Initialisation Phase. Algorithm 3 provides an accurate description of the initialisation phase of PMTT for algorithmic parallelism. For each level, this phase consists of L sublevels of parallel Merkle tree construction. At each sublevel, the leftmost subtrees are constructed sequentially from top to bottom. Within each layer of the Merkle tree, \(L^{\prime }-1\) subtrees are constructed. These subtrees are located in the leftmost of the LMS tree. The height of the subtree decreases iteratively, for example, with heights of 15, 10, and 5 when \(h = 15\) and \(h^{\prime } = 5\). The upper five layers of the subtree are stored in the array \(node_{ki}\). There is a redundant computation as some nodes are generated repeatedly, but this redundancy can be optimised in implementation. Following the tree construction, LM-OTS signatures \(ots_k\) are generated. The root node is utilised to derive \(ots_k\), excluding the one-time signature generated by the root of the top-level Merkle tree, which holds no significance. In each level, PMTT marks both the LMS tree and the subtree as valid.

Update and Output Phase. Algorithm 4 accurately describes the update and output phases of PMTT for algorithmic parallelism. We do not use the desired tree of the fractal Merkle tree because of the large communication cost of the desired tree between CPU and GPU. First, the subtrees need to be constructed. At each level, the subtree needs to be judged sublevel by sublevel. If the subtree at a given sublevel is not stored, it needs to be constructed and marked as valid. If the leaf is the last node of the subtree, the subtree is marked as invalid.

Next, the OTS and authentication path are stored using a two-stage process. In the non-bottom level (lines 11–17), the OTS needs to be judged on whether it can be used for a given leaf. If it cannot be used, lmots_sign is employed. Otherwise, the OTS can be read directly from memory. Then the OTS and authentication path are appended to the LMS signature. In the bottom level (line 18), the OTS of the message and the authentication path are output. At the end of each level, the LMS tree is marked as invalid if the tree has been exhausted, with a total of L levels. Figure 2 shows the visual representation of the structure we use.

Fig. 2.

3.2.2 Memory Footprint.

The auxiliary array is used to store data used in the PMTT algorithm. It consists of two main components:

Merkle Tree Nodes. The number of nodes in a Merkle tree is determined solely by its height. For a Merkle tree with a tree height of \(h^{\prime }\), the total number of nodes is \(2^{h^{\prime }+1} - 1\), each occupying a memory size of n bytes. To accommodate memory alignment, we allocate space for all nodes using \(2^{h^{\prime }+1} \cdot n\) bytes.

LM-OTS Signatures. Each LM-OTS signature occupies \(4 + n + p\cdot n\) bytes. The LM-OTS signature is not stored at every level because the value corresponding to the message and the top-level tree’s root has no meaning to be stored and cannot be reused.

\begin{equation} Space_{ap} = (L-1)\cdot (4 + n + p\cdot n) + L \cdot \sum _{h^{\prime }=0}^{L^{\prime }-1} 2^{h^{\prime }+1} \cdot n . \end{equation}

(1)

We use the parameter sets in Table 2 for representation, where their LMS trees have the same height. The memory footprint is shown as Equation (1), encompassing memory usage of Merkle tree nodes and LM-OTS signatures exclusively. Let n be its maximum value, 32, and p be the standard value corresponding to XMSS, 67. We use a maximum total height of 60 to calculate memory usage. We refrain from recommending high subtrees for execution on GPU due to communication costs. Two maximum heights of subtrees (\(h^{\prime }\)) are used to illustrate the memory footprint of PMTT. When \(h=5,10,15,20\), \(h^{\prime }=10\), the memory usage approximates 47.42 KB, 394.64 KB, 270.39 KB and 388.26 KB, respectively. When \(h=5,10,15,20\), \(h^{\prime }=5\), the memory used is about 47.42 KB, 34.64 KB, 22.39 KB and 16.26 KB, respectively. These memory consumptions remain negligible compared to typical memory capacities, often measured in GB.

3.3 Parallel Construction of the Merkle Tree

All trees of PMTT are constructed using the parallel construction of the Merkle tree, which is detailed in [24, 47]. The first level is the parallel generation of different nodes at the same layer, with each thread computing one node. The second level is the parallel generation of the LM-OTS public key inside the leaf node, with each thread handling an OTS chain. The maximum parallelism of the two levels is the number of leaf nodes and the number of OTS chains p, respectively.

Leaf nodes. There are many solutions for generating leaf nodes.

Two Levels of Parallelism. This method offers the highest parallelism of \(leaf\_num\cdot p\). It minimises latency but demands lots of memory to retain the results of all OTS chains, approximately \(leaf\_num\cdot p\cdot n\).

First-Level Parallelism Only. Here, the memory requirement is about \(leaf\_num\cdot n\). This approach is more efficient in terms of parallelism compared to the previous one, as it excludes intra-node parallelism.

Second-Level Parallelism Only. The maximum memory footprint of this solution is \(p\cdot n\). This solution does not store nodes, so it is impossible to generate branch nodes in parallel.

Branch Nodes. When employing first-level parallelism, branch nodes can be stored in the same memory space as leaf nodes, eliminating the need for additional memory allocation.

Insufficient Parallelism. If there’s insufficient parallelism to generate all leaf nodes simultaneously, we can adjust the parallel approach, i.e., let one thread generate a subtree instead of one thread generating a leaf. By using the root nodes of these subtrees, we can efficiently build the branch nodes in parallel.

3.4 PMTT for Data Parallelism

In data parallelism, multiple threads handle multiple tasks in parallel. Unlike algorithmic parallelism, which focuses on improving the performance of a single task, data parallelism aims for high throughput. Throughput refers to the number of tasks executed per unit of time, directly linked to GPU utilisation. Data parallelism can be divided into two types according to the number of keypairs used: single-keypair and multi-keypair.

3.4.1 Multi-keypair Data Parallelism.

The parallel scheme of multi-keypair data parallelism is to use multiple threads to perform multiple tasks one-to-one. This design is beneficial to improving throughput. Although PMTT can use less memory than fractal Merkle tree traversal, it is not recommended to use multi-keypair data parallelism of PMTT because the PMTT parallel design may cause some signing to take too much time. Instead, BDS traversal and fractal Merkle tree traversal are recommended because they can make the signing time relatively stable. If PMTT must be used, it is recommended to use a lower subtree height to reduce memory usage.

3.4.2 Single-keypair Data Parallelism.

Single-key pair data parallelism is not merely assigning one thread to process one task, as the LMS trees constructed for multiple tasks remain the same. PMTT considers reusing these LMS trees to accelerate the process of signature generation. For tasks with large workloads, the overhead of reconstructing subtrees often outweighs that of transferring subtrees. Given this, PMTT no longer leverages subtrees in an LMS tree. Compared with algorithmic parallelism, PMTT stores multiple trees to improve parallelism. Without storing multiple trees, at most \(2^{h}\) signatures can be completed simultaneously, for example, only 32 signatures can be generated concurrently when \(h=5\).

Initialisation Phase. We introduce a variable, para, to describe the parallelism of the current system, which can be the number of CUDA cores in the GPU or the number of physical cores in the CPU. In addition, programmers can specify para to control CPU/GPU resource allocation. Algorithm 5 shows the initialisation phase of PMTT for single-keypair data parallelism. The algorithm processes the tree level by level, from bottom to top, to deal with the changed para. This bottom-to-top iteration is necessary to compute the cumulative \(sum_h\). Because one of the leaves of the bottom tree can be used for one signing, \(M_k\) LMS trees should be constructed. These trees are constructed in parallel, with \(M_k\) decreasing as the level increases. PMTT uses \(info_{k}\) to mark how many trees are available at level k. Finally, LMS trees and corresponding OTS are generated and stored, level by level.

Update and Output Phase. Algorithm 6 shows the specific steps of the update and output phases of PMTT for single-keypair data parallelism. Each round processes \(task\_num\) tasks (signing). The approach follows a level-by-level progression from bottom to top. First, PMTT uses \(info_k\) to identify the trees requiring updates. In our scheme, PMTT utilizes an index-based design to iteratively utilize these trees without data movement. If some trees are used up, it is necessary to construct some new LMS trees and update \(node_{kq}\) and \(ots_{kq}\) (lines 5–10). Then, PMTT outputs the authentication paths and LM-OTS signatures for each task in parallel (lines 11–14). Finally, leaf, \(task\_num\), and k are used to get the number of the remaining LMS trees, which is stored in \(info_k\) (line 16). We do data parallelism to improve parallelism of some behaviours, including LM-OTS signature generation of messages and output of OTSs and authentication paths (lines 18–20). Figure 3 shows the diagram describing how data is updated for PMTT single-keypair data parallelism. When the (\(M_k-1\))th tree at the lower level is exhausted, the subsequent leaf at the upper level is utilised. If this leaf corresponds to the last one, construction of the next tree becomes necessary.

Fig. 3.

3.4.3 Memory Used.

In multi-keypair data parallelism, each task uses the same amount of memory as in algorithmic parallelism. Here, we focus on analysing single-keypair data parallelism, whose auxiliary array only includes Merkle tree nodes and LM-OTS signatures. The maximum size of the auxiliary array is determined in Equation (2). For the calculation of parameter \(M_k\), see Algorithm 6. A total height of 60 is used to calculate the maximum memory usage. Here, the number of CUDA cores for NVIDIA RTX 3090 is used as para, 10496. Let n be 32 B. When \(h=5, 10, 15, 20\), about 1.41 MB, 1.03 MB, 8.01 MB, and 192.00 MB are used.

\begin{equation} Space_{sdp} = \sum _{L-1}^{k=0} M_k \cdot 2^{h_k+1} \cdot n + \sum _{L-1}^{k=1} M_k \cdot (4 + n + p\cdot n) . \end{equation}

(2)

3.5 LMS Parallel Design

3.5.1 LMS Keypair Generation.

Two keypair generation schemes are provided: one without requiring an auxiliary array and the other necessitating its construction. In the former, only the top-level tree needs construction, while in the latter, data for each level is generated for the auxiliary array. Opt for the first solution if the initial signature latency isn’t crucial, as it separates keypair generation from signature generation. This allows flexibility, as the server generating the keypair may not handle signature generation. However, if the initial signature latency is significant, opt for the second solution.

Scheme without Auxiliary Array. In this scheme, keypair generation involves concise input and output: parameter set and seed produce the LMS keypair. Initially, a thread calls a random number generator to generate seeds for keypair generation. To ensure each thread receives the same seed, synchronisation operations are performed. The seed is stored in a global variable in GPU global memory for access by all threads. Later, all participating threads parse the parameters for constructing the Merkle tree in parallel. Upon completion, thread 0 stores the necessary information for the LMS private key, marking the completion of its construction. Subsequently, all threads construct the top-level Merkle tree to obtain the root node, completing the LMS public key construction.

Scheme with Auxiliary Array. This solution requires referring to the initialisation phase of PMTT to construct an auxiliary array. Additionally, it requires incorporating I/O-related modules for transferring the auxiliary array to other physical machines. After generating the keypair, the relevant auxiliary array is outputted to the file system. Because of file I/O operations and the construction of multi-level trees, the computational overhead is usually much higher than the scheme without constructing the auxiliary array.

3.5.2 LMS Signature Generation.

It is necessary to assess the status of the auxiliary array during signature generation. If an auxiliary array is generated when the keypair is generated, the previous auxiliary array needs to be retrieved from the file system or memory. If not, the auxiliary array should be constructed during signature generation. To implement this agile initialisation, we output the results at the end of each level of signature generation, which differs from the design of the fractal Merkle tree traversal. If an auxiliary array exists, the authentication path is directly outputted. If no auxiliary array exists, the necessary components are constructed and the authentication path is outputted once the construction is completed.

For algorithmic parallelism, it is necessary to mark whether the current tree has been exhausted or constructed. We mark the current tree as constructed after completing the LM-OTS signing because the LM-OTS signing follows the LMS tree generation. Additionally, marking is necessary after constructing the top-level tree since the root of the top-level tree isn’t used for the LM-OTS signing. The mark that the LMS tree has been exhausted occurs at the end of the signature generation, which can be judged from the leaf index. For single-keypair data parallelism, it’s also essential to determine the number of trees available in this round and construct the necessary trees accordingly.

3.5.3 LMS Signature Verification.

For algorithmic parallelism, we optimise by parallelising LM-OTS signature verification. However, due to the maximum parallelism limit of p, the acceleration effect is not obvious. Following LM-OTS signature verification, the p OTS chain results need to be combined and hashed. In the serial version, we can perform part of the hashing operation after each OTS chain, saving memory. However, in the parallel version, we must use \(p \cdot n\) space to store the OTS chain results concurrently. Additionally, synchronisation is required to ensure result accuracy after computing the OTS chains. The data parallel scheme is straightforward: each thread handles a serial LMS signature verification.

3.6 Time Analysis of PMTT

When PMTT runs on a GPU, the main overhead arises from LM-OTS signature generation, Merkle tree construction, and memory copying. Let the overhead of LM-OTS parallel signature generations be denoted as \(O_{ls}\), the overhead of parallel Merkle trees constructions as \(O_{m}\), the overhead of coping \(lmots\_sig\) as \(m_{ls}\) and the overhead of coping authentication paths be \(m_{a}\).

For analysing the time complexity of PMTT in algorithmic parallelism, let’s consider an example with the parameter set LMS_SHA256_M32_H60(10,10,10,10,10,10), where \(h^{\prime } = 10\), total \((2^{60})\) signatures can be generated. Total \((\sum _{=00}^{5}2^{10\cdot i})\) LMS trees need to be constructed and \((\sum _{i=1}^{6}2^{10\cdot i})\) LM-OTS signatures need to be generated. In addition, 6 LM-OTS signatures and 6 levels of authentication paths need to be memory copied when generating a signature. Thus, the average overhead of an LMS signature generation for algorithmic parallelism is \(6 (m_{a} + m_{ls}) + (\sum _{i=0}^{5}2^{10\cdot i}\cdot O_{m} + \sum _{i=1}^{6}2^{10\cdot i}\cdot O_{ls})/2^{60}\).

For single-keypair data parallelism, the same parameter sets are used. The main difference is the overhead of memory coping and LM-OTS signature generation. Let the overhead of LM-OTS signature generations in a data-parallel way be donated as \(O_{pls}\), the overhead of parallel memory coping \(lmots\_sig\) as \(m_{pls}\), the overhead of memory coping authentication paths as \(m_{pa}\) and the number of tasks as \(num_t\). Because this overhead is affected by the number of tasks, we assume that the same number of signatures are performed each time. The average overhead of an LMS signature generation for single-keypair data parallelism is \(6 (m_{pa} + m_{pls}) + O_{pls} + (\sum _{i=0}^{5}2^{10\cdot i}\cdot O_{m} + \sum _{i=1}^{5}2^{10\cdot i}\cdot O_{ls})/(2^{60}/num_t)\).

Through these two examples, we can know that time analysis of PMTT cannot be universal when introducing parallelism. The reason is that parallelism is a parameter with a very wide effect. Therefore, practical analysis must be tailored to specific scenarios rather than relying on universal principles.

3.7 PMTT for other HBSs

Here, we describe how PMTT can be applied to other HBSs, such as XMSS. All stateful HBS algorithms rely on Merkle tree traversal to speedup signature generation. The differences between LMS and other stateful HBS algorithms are how to deal with bitmasks, generate leaf nodes, and format signatures. Therefore, adapting PMTT to other HBS variants only requires adjustments tailored to the specific scheme. For example, PMTT needs to call parallel interfaces of the specific HBS and store information according to the specific signature format. It can be asserted that PMTT is applicable to all stateful HBSs. However, since stateless HBSs like SPHINCS\(^+\) do not employ Merkle tree traversal, PMTT cannot be utilised in such cases.

3.8 Tradeoffs for Memory-constrained Devices

For PMTT, The memory footprint is related to the height of the subtree, the number of tasks and the height of the LMS tree. We analyze three types of parallelism separately here.

Algorithmic Parallelism. We recommend using a subtree height of 5. At this height, the maximum memory footprint is 49 KB. If 49 KB still cannot meet actual needs, reducing the subtree height, such as to 4, resulting in a footprint of 39 KB, could be considered. If less memory usage is necessary, it is recommended to use the BDS traversal, which only takes up very little memory (less than 7 KB). For less memory footprint, the classic Merkle traversal can be used, requiring less than 3 KB of memory.

Multi-keypair Data Parallelism. Likewise, a subtree height of 5 is recommended. The maximum footprint is about 509 MB when 10496 tasks are computed. If memory constraints exist, reducing the number of tasks is advisable, as the memory footprint scales proportionally with the task count. When reducing the task count to 1, the strategies outlined in algorithmic parallelism can be referenced.

Single-keypair Data Parallelism. Here, subtrees are not utilised. The maximum memory footprint, with 10,496 tasks, is around 276 MB. Because auxiliary arrays make up the largest portion of the tree, we recommend using a smaller LMS tree height and reducing the number of tasks. When the number of tasks is reduced to 1, the strategies outlined in algorithmic parallelism can be employed.

4 Implementations On GPU

We refer to [47] for GPU-related performance optimisation. Here we only describe how to perform the PMTT-related communication optimisation. Communication, the data transmission between CPU and GPU, is critical in CUDA programming. To minimise latency, it’s essential to precisely specify the amount of data for communication. However, transmitting all auxiliary arrays can lead to redundant data transfer, as some data remains unchanged post-kernel execution. For keypair generation, if an auxiliary array requires computation, it must be fully output. If the auxiliary array is not computed, it is not output. For signature generation, the transmission of auxiliary arrays can be optimised. We describe them separately in two cases: algorithmic parallelism and data parallelism.

4.1 Algorithmic Parallelism

We employ a CPU-GPU collaborative approach to minimise communication overhead, with the CPU handling the reading operation of the auxiliary array and the GPU managing tree construction. This design ensures efficient utilisation of resources without overburdening the CPU, thereby maintaining server performance. Specifically, there are five kinds of communication schemes for signature generation.

(1) Signature that No Tree Construction. In this case, we let the GPU compute \(lmots\_sig\) corresponding to the message, and the remaining information is directly read by the CPU. Since we only need to generate an OTS on the GPU, we only need to exchange OTS-related data. This communication strategy offers the most optimisation as it significantly reduces data transfer.

(2) Signature that No Auxiliary Array. If there are no auxiliary arrays on the file system and memory, the GPU is used to initialise the PMTT algorithm and output the auxiliary arrays.

(3) Signature that LMS Tree is Used Up. If the existing tree is used up, we need to update the data of the tree on the CPU. The first scheme (1. signature that no tree construction) is performed based on these new existing trees.

(4) Signature that LMS Trees Need to be Updated. We need to determine whether any subtrees have been exhausted. If there are exhausted subtrees, the GPU is used to complete the construction of the next subtree and output it to the corresponding position of the auxiliary array.

4.2 Data Parallelism

Different from the design of algorithmic parallelism, we let all operations be done on the GPU to ensure efficient output of the authentication path and \(lmots\_sig\). In this case, our communication scheme is simple. All data needs to be transferred between the CPU to the GPU.

5 Evaluations

5.1 Evaluation Criteria

Our implementations are evaluated based on average throughput and average latency to provide a more intuitive understanding. Latency, measured in milliseconds (ms), encompasses both GPU kernel execution time and data communication time between the CPU and GPU. Throughput is quantified as the number of tasks executed per unit of time, such as the number of signatures generated per second. In addition to throughput and latency, parallel efficiency and speedup are key evaluation criteria, as commonly observed in parallel studies. Speedup denotes the increase in throughput compared to a baseline execution. The baseline is represented by serial single-core execution on the GPU. Parallel efficiency, on the other hand, is the ratio of speedup to the number of physical cores. Experimental results about strong scaling and weak scaling correspond to algorithm parallelism and data parallelism, respectively.

For all tests of keypair generations and signature verifications, 100 rounds are tested to improve accuracy, except for some parameter sets that take more than 10 minutes. The performance of signature generation varies across iterations, hence we conduct numerous rounds of tests. Specifically, for algorithmic and data parallelism, we run 16,384 and 128 rounds respectively. To avoid the effect of initialisation, we do not count the running time of the first-round signature. Additionally, we incorporate a 10-round warm-up phase to mitigate initialisation effects. For memory footprint, the auxiliary data, keypairs and signatures are all included.

5.2 Setups

The testbed configuration is outlined in Table 3. We only offer the base clock and base memory bandwidth because the GPU boost is disabled. The reason is that enabling GPU boost may reduce GPU lifetime. The mainboard, PCI-E, and CPU memory all have performance restrictions on data transfers between the CPU and GPU; thus, we only provide transfer bandwidth that is directly tied to performance.

Table 3.

Item	Value	Item	Value
GPU	Geforce RTX 3090	CUDA cores	10,496
Architecture	Ampere	Base clock	1,395 MHz
Base memory bandwidth	784.4 GB/s	SMs	80
Copy engines	2	Compute capability	8.6
Compiler	CUDA-11.8
CPU	Intel Xeon Gold 5218R	Base clock for GPU	2.1 GHz
CPU to GPU bandwidth	12.4 GB/s	GPU to CPU bandwidth	13.2 GB/s

Table 3. The Details of the Testbed

The LMS algorithm relies on the SHA256 hash function for all hashing operations. We validate the correctness of our SHA256 implementations using test vectors provided by the Cryptographic Algorithm Validation Program (CAVP) from NIST [32]. Our LMS implementations are derived from the reference implementation outlined in RFC 8554 [29]. To test correctness, we manually set seeds and compare the generated keypairs, signatures, and verification results between our implementations and the reference implementation.

5.3 Tests of Algorithmic Parallelism

In this subsection, PMTT for algorithmic parallelism is evaluated. We focus on testing the performance of signature generation. Keypair generation and signature verification are not the core of PMTT. Their optimisation methods are detailed in [24, 47]. We will time three basic operations, hash function, leaf node generation and LM-OTS signature generation. For experimental accuracy, we do not record the time of communication here.

When discussing the overhead of HBS and fractal Merkle tree traversal, it’s essential to use an ideal scenario to describe their performance. Their overhead cannot be measured directly because their signature generation executes the hash function a different number of times in each round. To illustrate their ideal overhead, we need to know the overhead of some basic operations. The respective test methods are:

Hash. This setup employs a single thread to execute a hash function. The input to this hash function comprises a 64-byte message, covering the input values for all hash functions in the LMS. According to [48], the execution overhead of the hash function remains consistent for inputs ranging from 0 to 64 bytes.

AP_leaf_gen. It represents that p threads are used to generate a leaf in parallel. The time to generate leaves consists of three main steps. (1) It includes preparation operations, primarily consisting of two hash functions. (2) It involves LM-OTS public key generation, where p threads are utilised to produce p OTS chain results simultaneously. (3) It requires splicing the OTS chain results and hashing them to obtain the leaf nodes. Since this final step cannot be parallelised and involves hashing a lengthy input, it becomes the most time-consuming aspect of the process.

AP_lmots_sign. It represents that p threads are used to perform an LM-OTS signing in parallel. Because the OTS chain lengths are different, enough seeds are used to make sure we get the average. The main overhead comes from generating p OTS chains with p threads in parallel.

BDS_ideal/Fractal_ideal. It represents the ideal time of BDS traversal/Fractal Merkle tree traversal that we calculate using the conclusions in [7, 22]. The specific expressions can be found in sections 2.4.1 and 2.4.2.

PMTT_test. We just test it through our implementation.

Table 4 shows the average run time per round of some basic LMS operations and traversal algorithms on the GPU. PMTT and fractal Merkle tree traversal are both tested with a subtree height of 10 for a fair comparison. In that case, the fractal Merkle tree traversal can run fastest. However, this subtree height uses a lot of CUDA cores leading to performance degradation, so it is not the best performance of PMTT. This is the maximum memory footprint of PMTT and fractal Merkle tree traversal for given parameter sets. Employing a smaller subtree height would lead to reduced memory consumption.

Table 4.

Item	Formula about Time	Time (ms)	Memory
Hash	a	0.01554	-
AP_leaf_gen	b	0.14850	-
AP_lmots_sign	c	0.05461	-
BDS_ideal	\(6((h-3)a/2+(h-1)b/2+c)\)	4.66350	6 KB
Fractal_ideal	\(5(a+b+c)\)	1.09325	724 KB
PMTT_test	\(6 (m_{a} + m_{ls}) + (\sum _{i=0}^{5}2^{10\cdot i}\cdot O_{m} + \sum _{i=1}^{6}2^{10\cdot i}\cdot c)/2^{60}\)	0.24426	409 KB

Table 4. Average Run Time per Round for LMS_SHA256_M32_H60(10,10,10,10,10,10)

Hash. We use the SHA256 hash function for testing. We do not parallelise the hash function because there is currently no parallel strategy for short inputs [48].

AP_leaf_gen/AP_lmots_sign. These two basic operations are used by BDS and fractal Merkle tree traversal. We assume that programmers parallelise these two basic operations to improve performance when running the two traversal algorithms.

BDS_ideal. The memory footprint of the BDS algorithm is very small, but the running time on large memory devices is worse than that of PMTT and fractal Merkle tree traversal. Our implementation is 9.55× faster than the ideal time for BDS traversal.

Fractal_ideal. The main difference in the cost of PMTT compared to fractal Merkle tree traversal is the scheme of constructing the tree. Fractal Merkle tree traversal computes two nodes per round, while PMTT directly constructs the tree. As a result, PMTT boasts higher parallelism and efficiency. Our implementation is 2.59× faster than the ideal time for fractal Merkle tree traversal. Because PMTT does not store the desired tree, the memory footprint is about half that of the fractal Merkle tree traversal.

PMTT_test. Here, only the computation time on the GPU is measured because here only the performance of computation is compared. Compared to the fractal Merkle tree traversal, PMTT uses less memory because it does not store desired trees. The time is measured by our implementation, but we also list the formula of the ideal time, where the meanings of parameters are shown in Section 3.6. Here, our implementation is still 4.48× (4.66350/1.09325) faster than the ideal time of the state-of-the-art traversal algorithm (fractal Merkle tree traversal).

The latency and memory footprint with various subtree heights for algorithmic parallelism are shown in Table 5. Here, “20/4” indicates a total height of 20 with 4 levels, where LMS trees have equal heights, such as H20(5,5,5,5). Performance is influenced by factors such as computation load, CUDA cores’ computation power, and communication size. Using a lower subtree height results in fewer threads with higher computation power. Consequently, the implementation with a height of \(h^{\prime }=5\) demonstrates superior time performance and relatively small memory usage.

Table 5.

\(h^{\prime }\)	All	10	6	5	4	All	10	6	5	4
Para.	KG	Sign				Verify	Memory (KB)
10	1.09	0.12	0.13	0.12	0.18	0.19	66	7	6	4
15	9.15	0.14	0.15	0.13	0.15	0.21	68	11	8	6
20	240.33	0.13	0.19	0.13	0.14	0.23	130	15	10	7
20/2	1.1	0.17	0.18	0.15	0.27	0.37	135	17	15	11
20/4	0.55	0.25	0.20	0.19	0.54	0.65	23	23	23	20
40/2	244.84	0.19	0.23	0.17	0.18	0.43	263	32	23	17
40/4	1.09	0.23	0.23	0.18	0.31	0.71	272	36	32	25
40/8	0.55	0.35	0.26	0.24	0.59	1.27	49	49	49	42
60/3	246.4	0.23	0.26	0.19	0.21	0.63	396	49	36	27
60/6	1.11	0.29	0.27	0.22	0.34	1.05	409	55	49	39

Table 5. The Latency and Memory Footprint with Different Heights of Subtrees (\(h^{\prime }\)) for Algorithmic Parallelism

5.4 Tests of Data Parallelism

This test examines two forms of data parallelism: single-keypair and multi-keypair. Their main difference is the number of keypairs used. Because HBS have status, thus their parallel methods are quite different. PMTT outperforms BDS and fractal Merkle tree in single-keypair data parallelism but not in multi-keypair scenarios. To show the impact of communication on performance, we display the memory bandwidth utilisation, denoted as “util” = (transmitted data amount)/(whole latency × maximum bandwidth). Given the asymmetry between uplink and downlink bandwidths, we use an average value of 12.8 GB/s to represent the maximum bandwidth. If computation and communication overlap, the utilisation can theoretically approach 100%.

Multi-Keypair Data Parallelism. The latency, memory footprint and memory bandwidth utilisation with 1 or 10496 tasks for multi-keypair data parallelism are shown in Table 6, where \(h^{\prime } = 5\). In this Parallelism, each thread handles a single task, resulting in prolonged latency. Consequently, employing Parallel Multi-Tasking Technology (PMTT) is not advisable due to its significant latency. Although the average latency might seem acceptable, some signing can take a long time. PMTT does not perform well in serial executions because its designs are based on a high parallel scale. Furthermore, with 10496 tasks, the memory bandwidth utilisation reaches approximately 50%. Because our implementation does not use multiple streams to overlap computation and communication, the utilisation rate is lower.

Table 6.

Para.	KG-1	KG-10496	Sign-1		Sign-10496		Memory -10496	Util -10496
Para.	KG-1	KG-10496	all	comp	all	comp	Memory -10496	Util -10496
10	2,204.63	2,468.36	2.98	2.94	13.50	4.50	67 MB	0.39
15	71,372.43	81,557.22	3.03	2.98	17.09	4.60	89 MB	0.41
20	2,300,098.75	2,579,666.21	3.04	2.99	20.58	4.58	111 MB	0.42
20/2	2,242.17	2,540.37	3.14	3.08	26.67	5.31	155 MB	0.45
20/4	70.10	78.93	3.35	3.29	38.81	6.73	244 MB	0.49
40/2	2,300,008.21	2,575,025.95	3.16	3.10	40.80	5.43	244 MB	0.47
40/4	2,240.60	2,533.48	3.35	3.29	53.18	6.84	332 MB	0.49
40/8	70.11	78.49	3.75	3.67	81.32	9.66	509 MB	0.49
60/3	2,298,823.85	2,570,920.63	3.27	3.20	61.03	6.29	376 MB	0.48
60/6	2,242.01	2,509.17	3.56	3.49	83.52	8.56	509 MB	0.48

Table 6. The Latency and Memory Utilisation with 1 or 10,496 Tasks for Multi-Keypair Data Parallelism

Single-Keypair Data Parallelism. The LMS_SHA256_M32_H20(20) parameter set is used to test the weak scalability. Figure 4 shows the latency and throughput of LMS signature generation with different numbers of tasks for single-keypair data parallelism. As the number of tasks scales from 1 to 8,192 cores, throughput increases from 86.13 tasks/s to 553,139.77 tasks/s. resulting in a speedup of 6,421.95 times with a parallel efficiency of 78.39%. Latency remains relatively stable when there are fewer tasks, but beyond 2,048 tasks, it gradually increases. This latency increment can be attributed to two main factors: firstly, as task count rises, so does communication size. Secondly, with increased GPU core count, parallel efficiency experiences a slight decline.

Fig. 4.

Table 7 shows the latency, throughput, memory footprint and memory bandwidth utilisation (Util) of PMTT for single-keypair data parallelism with 10,496 tasks per round. Because the H10(10) parameter set can only sign 1,024 signatures, it is not tested. In this context, the memory usage encompasses not only auxiliary data but also keypairs and signatures. When the total height remains constant, various parameter sets offer distinct advantages. Parameter Sets with \(h=10\) excel in signature generation performance and minimise memory usage. Parameter sets featuring \(h=20\) exhibit superior signature verification performance. Meanwhile, sets with \(h=5\) demonstrate the highest efficiency in keypair generation. The selection of parameter sets should align with the specific requirements of the service environment. Furthermore, we can observe that the lower each level of the tree, the lower the memory bandwidth utilisation. This also indicates that the lower the tree at each level, the lower the benefits of computational and communication overlap.

Table 7.

Para.	KG-L	Sign-L	Verify-L	KG-T	Sign-T	Verify-T	Memory	Util
15	10.26	13.92	4.2	998,051	735,632	2,438,095	28.09 MB	0.16
20	239.74	15.34	4.36	42,713	667,536	2,348,624	91.66 MB	0.47
20/2	1.11	20.86	7.4	9,225,225	490,892	1,383,784	50.26 MB	0.19
20/4	0.58	246.57	13.3	17,655,172	41,530	769,925	94.07 MB	0.03
40/2	244.66	33.04	7.81	41,854	309,927	1,311,140	183.82 MB	0.43
40/4	1.14	29.7	13.91	8,982,456	344,781	736,161	100.47 MB	0.26
40/8	0.56	296.44	25.63	18,285,714	34,543	399,532	187.99 MB	0.05
60/3	247.89	49.86	11.56	41,309	205,375	885,813	275.99 MB	0.43
60/6	1.12	38.56	20.33	9,142,857	265,560	503,689	150.67 MB	0.31

Table 7. The Latency, Throughput and Memory Utilisation for Single-Keypair Data Parallelism

5.5 Comparison with Related Works

Table 8 shows a comparison with related work and our work, where AP, MDP, and SDP represent the algorithmic, multi-keypair and single-keypair parallelism. Our single-keypair data parallel implementation processes 10,240 tasks per round. With a total parameter height of 10, signatures can only be generated 1,024 times, rendering it impractical to assess the performance of the single-keypair data parallelism. For all tests, we do not initialise PMTT in the keypair generation. The performance metrics of RFC 8,554 [29] are derived from [24], which of other works mentioned are derived from those articles directly. To show the energy consumption, we assume that the device is exclusively used and operates at full power. Energy consumption is calculated according to this formula: energy consumption = maximum power times latency. We use Joules consumed per signature generation (J/Sign) for presentation.

Table 8.

Para.	Work	KG-L	Sign-L	KG-T	Sign-T	J/Sign	Device
H10(10)	[29]	12.89	1.99	77.58	502.50	0.816	Xeon Gold 6258R
	[38]	3.35	0.004	298.50	270,270	-	XCVU7P FPGA
	[43]	140.00	0.22	7.14	4,545.00	-	Artix-7 FPGA
	AP [24]	1.14	1.61	877.20	621.10	5.281	Xeon Gold 6258R
	MDP [24]	150.00	1.52	1,703.00	168,615	0.019	Xeon Gold 6258R
	AP-PMTT	1.09	0.12	917.43	8,333.33	0.042	GeForce RTX 3090
	MDP-PMTT	2,468.36	13.50	4,252.22	777,481	0.0005	GeForce RTX 3090
H20(20)	[29]	11,534	62.32	0.09	16.05	25.55	Xeon Gold 6258R
	[38]	3,427.00	0.004	0.29	270,270	-	XCVU7P FPGA
	AP [24]	605.80	9.49	1.65	105.40	31.12	Xeon Gold 6258R
	MDP [24]	154,589	606.10	1.66	422.40	7.76	Xeon Gold 6258R
	AP-PMTT	240.33	0.13	4.16	7,692.31	0.04	GeForce RTX 3090
	SDP-PMTT	239.74	15.34	4.17	667,535	0.0005	GeForce RTX 3090
	MDP-PMTT	2,579,666	20.58	4.07	510,009	0.0007	GeForce RTX 3090
H20(10,10)	[29]	14.76	15.26	67.75	65.53	6.26	Xeon Gold 6258R
	AP [24]	1.12	7.12	892.90	140.40	23.35	Xeon Gold 6258R
	MDP [24]	164.30	163.75	1,558.00	1,563.00	2.09	Xeon Gold 6258R
	AP-PMTT	1.10	0.15	909.09	6,666.67	0.053	GeForce RTX 3090
	SDP-PMTT	1.11	20.86	900.90	490,891	0.0007	GeForce RTX 3090
	MDP-PMTT	2,540.37	26.67	4,131.68	393,550	0.0009	GeForce RTX 3090

Table 8. Comparison with Related Work about Throughput (T), Latency (L) and Energy Consumption

[29]. It is a multithreaded LMS implementation using pthread provided by RFC 8,554. However, due to its reliance on shared memory, the performance is constrained by the number of CPU cores, resulting in limited parallelism. As a consequence, the overall performance of the system is adversely affected.

[38]. This FPGA implementation solely adopts the single-tree variant of the LMS algorithm, which severely limits its practical usability. During LMS signing, they employ fractal Merkle tree traversal to store all node data, causing the primary overhead to arise from a single execution of \(lmots\_sign\). The signature latency remains low due to FPGAs excelling in low parallelism scenarios. However, a drawback lies in the extended time required for keypair generation. With \(h=20\), the memory footprint significantly increases as it necessitates storing the entire tree.

[43]. The study presented the first BDS implementation on LMS for FPGA. Disadvantages include extended keypair generation time and limited to a single-level tree implementation [38].

[24]. They use fractal Merkle tree traversal to provide a complete multi-node LMS parallel implementation scheme. Despite employing 8 nodes, their performance marginally lagged behind ours due to CPU limitations. In addition, they only provide a multi-keypair data parallel implementation scheme. Here, the performance of data parallelism is measured using 256 tasks. For calculating energy consumption, two additional details are necessary: (1) Each node comprises two processors, and (2) each processor consumes 205 W of power.

PMTT. Our LMS implementation stands out for its superior performance in various aspects compared to previous work. It excels particularly in maximum signing throughput and latency of keypair generation. Additionally, we offer multi-tree variants and present three implementations: algorithmic parallelism, single-keypair data parallelism, and multi-keypair data parallelism. Notably, our PMTT implementation demonstrates greater energy efficiency compared to the CPU implementation.

6 Conclusion

We introduce PMTT, a parallel Merkle tree traversal method optimised for parallel devices, leveraging their full capabilities. Through PMTT, we developed an LMS implementation on GPU, harnessing over 10,000 cores. Compared to existing traversal methods, PMTT exhibits higher parallelism, particularly advantageous for devices with ample memory. Additionally, PMTT can be adapted for memory-constrained devices with some tradeoffs. Our novel approach integrates CPU-GPU collaboration for signature generation, minimising communication overhead while utilising only a single CPU core. Experimental results demonstrate PMTT’s superior performance over alternative traversal algorithms, with our LMS implementation surpassing existing solutions. Because Merkle trees are more complex to design on multiple GPUs, our future work includes applying PMTT to multiple GPUs.

References

[1]

Dorian Amiet, Lukas Leuenberger, Andreas Curiger, and Paul Zbinden. 2020. FPGA-based sphincs\({}^{\mbox{+}}\) implementations: Mind the glitch. In Proceedings of the 23rd Euromicro Conference on Digital System Design, DSD 2020, Kranj, Slovenia, August 26-28, 2020. IEEE, 229–237.

Abstract

1 Introduction

2 Preliminaries

2.1 Notations

2.2 Digital Signature Mechanism

2.3 Review of LMS

2.3.1 LM-OTS.

2.3.2 The LMS Tree.

2.3.3 LMS.

2.4 Merkle Tree Traversal

2.4.1 BDS Traversal.

2.4.2 Fractal Merkle Tree Traversal.

2.5 CUDA-Enabled GPU

3 Parallel Design for LMS

3.1 Parallel Merkle Tree Traversal (PMTT)

3.2 PMTT for Algorithmic Parallelism

3.2.1 Specific Design.

3.2.2 Memory Footprint.

3.3 Parallel Construction of the Merkle Tree

3.4 PMTT for Data Parallelism

3.4.1 Multi-keypair Data Parallelism.

3.4.2 Single-keypair Data Parallelism.

3.4.3 Memory Used.

3.5 LMS Parallel Design

3.5.1 LMS Keypair Generation.

3.5.2 LMS Signature Generation.

3.5.3 LMS Signature Verification.

3.6 Time Analysis of PMTT

3.7 PMTT for other HBSs

3.8 Tradeoffs for Memory-constrained Devices

4 Implementations On GPU

4.1 Algorithmic Parallelism

4.2 Data Parallelism

5 Evaluations

5.1 Evaluation Criteria

5.2 Setups

5.3 Tests of Algorithmic Parallelism

5.4 Tests of Data Parallelism

5.5 Comparison with Related Works

6 Conclusion

References

Cited By

Index Terms

Recommendations

RFC 8391: XMSS: eXtended Merkle Signature Scheme

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Hybrid Parallel Programming on GPU Clusters

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations