research-article

Open access

empi: GPU-Accelerated Matching Pursuit with Continuous Dictionaries

Author:

Piotr T. RóżańskiAuthors Info & Claims

ACM Transactions on Mathematical Software, Volume 50, Issue 3

Article No.: 17, Pages 1 - 17

https://doi.org/10.1145/3674832

Published: 21 October 2024 Publication History

PDF eReader

Abstract

This article introduces an effective approach to performing matching pursuit calculations with continuous (quasi-infinite) dictionaries. Simulating continuous parameter space is accomplished by combining optimal dictionary construction as introduced previously with local parameter optimization. At the same time, the calculations can be performed not only with Gabor atoms, but with any type of oscillating atoms, or even multiple types of atoms at once. The proposed method is confirmed by introducing a first stable version of a high-performance implementation empi, supporting calculations with continuous as well as discrete dictionaries, further accelerated by the use of multi-threading and GPU support.

1 Introduction

One of the basic problems in signal analysis is finding a sparse approximation of the input signal by a subset of functions (time-frequency atoms) \(g_{i}\in D\) from the pre-defined set (a dictionary). Formally, we want to find a sequence of atoms \((g_{i})\) and the sequence of coefficients \(c_{i}\in\mathbb{R}\) to minimize the difference

\begin{align}\chi=\left\lVert x(t)-\sum_{i=1}^{k}c_{i} g_{i}(t)\right\rVert. \end{align}

(1)

Such a problem may have several variants: we can either fix the number of atoms \(k\) and look for the best \(k\)-sparse approximation, or we can set the threshold \(\chi_{\text{max}}\) for the difference and find the minimal \(k\) for which a representation with \(\chi\leq\chi_{\text{max}}\) exists. Nonetheless, in general, it is an NP-hard problem [3]. However, an approximate solution can be obtained by a greedy algorithm called matching pursuit (MP).

1.1 MP

MP is an iterative algorithm for signal decomposition first proposed by Mallat and Zhang in 1993 [14]. Starting with a signal \(x(t)\) and a pre-defined dictionary of functions (atoms), in each iteration a single element from the dictionary corresponding to the best match with the signal is found and subtracted from the signal. The residual signal from the given iteration becomes an input signal for the next iteration and the procedure repeats until the stopping criteria are met, e.g.,

\begin{align}\begin{split} x^{(0)} & =x \\g_{i} & =\operatorname{argmax}_{g\in D}(g\cdot x^{(i- 1)}) \\c_{i} & =g_{i}\cdot x^{(i-1)} \\x^{(i)} & =x^{(i-1)}-c_{i} g_{i},\end{split}\end{align}

(2)

where \(\cdot\) denotes the scalar product

\begin{align}g\cdot x=\int_{-\infty}^{\infty}g(t)x(t) dt\end{align}

(3)

and in case of discrete signals (e.g., time series), the integral is replaced with a summation.

The criteria for ending the algorithm can be based either on the pre-selected number of iterations or on the total residual energy being reduced below a given threshold (e.g., 1% of the initial signal energy). Also, since energies \(c_{i}^{2}\) of subsequent atoms will form a descending sequence due to inherent properties of MP [14], the stopping criteria can also be based on the last atom's energy instead.

The most popular approach to constructing a dictionary for MP is to use oscillating atoms with well-behaving and continuous envelope functions [14]. One of the possible choices are Gabor atoms associated with a Gaussian envelope function, which in addition to having good numerical properties, are also considered to be a good model for transient effects in time series. Moreover, atoms defined this way have a compact representation in the time-frequency space, which opens a way to estimating time-frequency maps of energy density based on MP [7]. A dictionary can be further supplemented with non-oscillating atoms (e.g., delta-type atoms, where each function has only one non-zero-value) or (co)sine functions, spanning the entire signal.

1.2 Continuous vs. Discrete Dictionaries

Time-frequency dictionaries introduced in the original formulation of MP [14] consist of atoms obtained by scaling, translating, and modulating a given envelope function (e.g., Gaussian envelope). Assuming a dictionary structured in this manner allows one to define a parameter space \(\Gamma_{0}=\{(s,t_{0},f_{0})\}\) where \(s\) is the atom's scale, \(t_{0}\) is its temporal position, and \(f_{0}\) is the modulation frequency. This way, constructing a dictionary \(D\) is equivalent to choosing a set of parameters \(\Gamma\subset\Gamma_{0}\) and proceeding with the construction as \(D=\{g_{\gamma}:\gamma\in\Gamma\}\) where \(g_{\gamma}\) is an atom obtained by transforming an envelope function with \(\gamma=(s,t_{0},f_{0})\). (Technical details on performing this transformation will be provided in Section 3.4.)

If \(\Gamma=\Gamma_{0}\), the dictionary \(D_{\infty}=\{g_{\gamma}:\gamma\in\Gamma_{0}\}\) is itself uncountable and highly redundant. Such dictionaries defined by a continuous parameter space will be called continuous dictionaries. On the other hand, any dictionary constructed through some arbitrary finite sampling of the parameter space will be called a discrete dictionary throughout this article. For practical reasons, the prevalent approach in actual signal processing applications has been, up until now, to restrict the calculations to discrete dictionaries.

The structure of a discrete dictionary has been partially specified in the initial article [14] as a dyadic dictionary, characterized by atoms’ time scales forming a geometric series. However, the quantities defining exact spacing in time and frequency were left as free parameters. A more precise study on the dictionary structure was presented in [13] and resulted in introducing an optimal dictionary, sampling the scale-time-frequency parameter space in a uniform way, in respect to a well-defined product-related metric. This step reduced the problem of constructing a dictionary to a single degree of freedom, a free parameter \(\epsilon\) corresponding to the density of the dictionary.

Even with the optimal dictionary construction, the accuracy of the estimated parameters increases with the size (density) of the dictionary \(D\), while at the same time, larger dictionaries (smaller \(\epsilon\)) increase the computational load significantly. In order to remove this arbitrary parameter affecting both the quality of the resulting decomposition as well as the computation time, a practical method for performing MP calculations with continuous dictionaries in an effective and mathematically provable way is highly desirable.

Another problem that would be automatically solved as a side effect of introducing continuous dictionaries relates to the statistical bias introduced by MP decomposition in a dictionary with finite and regular sampling of parameters. The solution proposed in 2001 [6] in terms of stochastic dictionaries made most of the numerical optimizations impossible, rendering the algorithm extremely hungry for resources. The algorithm proposed in this article removes the statistical bias, the necessity of an arbitrary choice of the dictionary's density, and at the same time offers a significant improvement in the speed of computation.

1.3 Rationale

The initial motivation for this study comes from the field of electroencephalography (EEG). MP with Gabor dictionaries offers a sparse representation of signals in terms of Gabor functions, parameterized explicitly by phases, time and frequency centers, and widths. Based upon these parameters, one can find correspondence between some of the functions fitted to EEG and structures known from its traditional, visual analysis. This allowed for at least partial unification of the traditional and algorithmic analysis of EEG [4].

In the previous implementations of MP used for EEG analysis up until recently [7, 8, 13, 21], accuracy of the estimated parameters increased with the size (density) of the dictionary \(D\), with larger dictionaries increasing significantly the computational load. This encouraged several attempts to the optimization of the dictionary structure [c.f. 6, 13, 14]. The implementation presented in this article can be viewed as a decomposition in a continuous, hence infinitely large, dictionary; its recent application allowed, e.g., for an efficient decomposition of \(\sim\)1,000 hours of 19-channel EEG into \(\sim 10^{7}\) atoms in [5].

1.4 Outline of the Article

This article presents a high-performance implementation of MP based on combining the optimal dictionary construction from [13] with an accurate estimation of the maximum error within a single iteration through the associated optimality factor from [14]. The combination of these two elements opens a way to simulating calculations with a continuous dictionary, while preserving the core principle of MP: a guarantee of choosing the globally best match in each iteration. Apart from removing the dependence of results on \(\epsilon\), this step eliminates altogether the statistical bias caused by an arbitrary dictionary structure.

The article is organized as follows. Section 2 describes the mathematical principles of the proposed method of simulating a continuous dictionary. Section 3 describes the technical details behind the proposed implementation empi. Section 4 presents an experimental performance evaluation of empi, including a comparison with the Matching Pursuit ToolKit (MPTK) [12] as a reference implementation. These three sections are followed by a short summary in Section 5, including a discussion on possible further development of this implementation.

2 Method

2.1 Simulating a Continuous Dictionary

The idea of using a discrete dictionary \(D\) to calculate a decomposition with a dictionary with continuous parameter space \(D_{\infty}\supset D\) seems to emerge as a natural consequence from Mallat's seminal article on MP [14]. For a given signal \(x(t)\), the problem can be seen as to find the best match \(g_{\infty}=\operatorname{argmax}_{g\in D_{\infty}}(g\cdot x)\) starting with a finite number of available scalar products \(g\cdot x\) for each \(g\in D\).

An intuitive solution would be to find the best match from the dictionary \(D\) and then use a local parameter optimization method (gradient or otherwise) to find the best match from the quasi-continuous dictionary. However, this strategy would not guarantee that the atom found this way would actually be the globally best match, as shown schematically in Figure 1.

Fig. 1.

To find the actual best match from the continuous dictionary, it is necessary to perform local parameter optimization starting not only from the single best match in \(D\) but also from the number of sub-optimal candidates. In order to identify the promising candidates, we have to re-introduce the optimality factor \(\alpha \lt 1\) from [14] as the dictionary-specific value that guarantees that

\begin{align}\forall_{g_{\infty}\in D_{\infty}}\exists_{g\in D} g\cdot x\geq\alpha(g_{ \infty}\cdot x)\end{align}

(4)

for all possible input signals \(x\). This value of \(\alpha\) is directly related to the maximum error in the single iteration that would be made by matching the signal \(x\) with the atom \(g\in D\) instead of \(g_{\infty}\).

Knowing this value for a given dictionary allows us to perform local parameter optimization only for promising best-match candidates. More specifically, we start by performing local optimization from the best match \(g^{(0)}\in D\), obtaining \(g^{(0)}_{\infty}\in D_{\infty}\) and storing this optimized atom's energy as \(c_{\text{max}}^{2}\). Then we perform the local optimization for other candidates \(g\in D\), but only if

\begin{align}c^{2}=(g\cdot x)^{2}\geq\alpha^{2}c_{\text{max}}^{2}\end{align}

(5)

and every time the optimization \(g\to g_{\infty}\) results in a better match \(c_{\infty}^{2} \gt c_{\text{max}}^{2}\), the latter is updated.

2.1.1 Assessing the Optimality Factor.

As we are mainly interested in the neighborhoods of promising best-match candidates (those with a good match to the signal), let's restrict to \(x\sim g_{\infty}\) as the worst-case scenario. In this case we re-define \(\alpha\) as

\begin{align}\alpha=\min_{g\in D}\min_{g_{\infty}\in\varepsilon(g)}(g_{\infty}\cdot g) \end{align}

(6)

where \(\varepsilon(g)\), the neighborhood of \(g\), is defined as the subset of \(D_{\infty}\) for which \(g\) is the best approximation:

\begin{align}\varepsilon(g)=\{g_{\infty}\in D_{\infty}\colon\operatorname{argmax}_{g^{\prime}\in D}(g_{\infty}\cdot g^{\prime})=g\}.\end{align}

(7)

In other words, if we defined the product-based metric (as introduced in [13]) on \(D_{\infty}\), each discrete dictionary \(D\) would define the Voronoi diagram partitioning the parameter space and the \(\varepsilon(g)\) would be the Voronoi cell associated with the seed \(g\). Due to the mismatch of grid dimensions \(\Delta_{t}\) and \(\Delta_{f}\) between neighboring scales differing by \(\Delta_{\lambda}\), the cells \(\varepsilon(g)\) may have various shapes. However, a detailed evaluation of possible configurations demonstrates that the lowest bound for \(\alpha\) is obtained with a symmetric configuration as shown in Figure 2.

Fig. 2.

However, even for such a configuration, the lower bound for \(\alpha\) cannot be easily derived analytically due to envelope-related normalization effects [19, 20]. Instead, the numerical analysis can be performed in order to find some practical approximation. The results are shown in Figure 3. It is shown that the lower bound for \(\alpha^{2}\) for middle frequencies (e.g., half of the Nyquist frequency) is only dependent on \(\epsilon^{2}\), with a good linear fit. On the other hand, the relation between the middle-frequency \(\alpha^{2}_{f_{0}=0.25}\) and the actual lower bound \(\alpha^{2}\) is dependent only on the product \(s\times f_{0}\), and neither depend on \(\epsilon\) nor scale \(s\). By combining these two observations, a simplified, yet practical model can be proposed as

\begin{align}\alpha^{2}\geq(1-a\epsilon^{2})(1-e^{P(sf_{0})}),\end{align}

(8)

where \(P\) is a polynomial with a small degree (e.g., 1). It should be emphasized that this proposed model is only a rough, practical approximation for calculating a lower bound of \(\alpha^{2}\) and it is not a result of any rigorous analytical derivation.

Fig. 3.

For Gabor atoms, one possible analytical lower bound, by no means unique, can be obtained with \(a=1.5\) and \(P(x)=-1.59x-2.11\) (Figure 3(b)). A similar analysis can be performed for any other envelope function, for example, triangular envelope [20], for which \(a=1.52\) and \(P(x)=-0.9x-1.97\) (Figure 3(d)).

3 Introducing empi

3.1 Toward the High-Performance Implementation

The first version of a high-performance implementation enhanced matching pursuit implementation (empi) was released in 2016 as a more accurate and faster replacement for the previous implementation MP5 [13] developed and used in the University of Warsaw. It has been designed with the same input/output specification, so its use with a graphical signal analysis toolkit Svarog [8] required only minor modifications. Few years later, on its way to the first stable version 1.0, the software has been rewritten from scratch with significantly smaller memory footprint, more efficient parallelization, and GPU support.

Based on the earlier extension of the optimal dictionary construction to non-Gaussian envelopes [20], this implementation also provides a practical framework for easy introduction of non-Gabor atoms, while preserving the same rigorous treatment. Combining atoms with different envelopes to form a hybrid dictionary is allowed as well.

3.2 Related Work

To the best of the author's knowledge, no open source implementation of MP with continuous dictionaries has been proposed, let alone made available so far. Even for discrete dictionaries, despite the ongoing popularity of MP and numerous papers dedicated to the mathematical properties of the algorithm, the range of available versatile open source implementations is still fairly limited, with most of them lacking proper support for multi-thread and multi-core CPUs. Instead, many published papers have used in-house implementations, usually written in Matlab or Python and not optimized thoroughly.

Apart from MPTK [12] being the most established and widely tested implementation, one should mention the implementation of [9] and the most recent implementation by [18]. Both of these implementations are restricted to Gabor atoms and offer no support for parallel (multi-threaded) calculations. While the implementation presented in [18] introduces a significant performance gain over MPTK, it relies upon an approximate update strategy using a truncated Gram matrix, which may introduce arbitrary errors into calculations. The errors may be avoided by performing arbitrary buffer resets, but this functionality is available fully only through the Matlab/Octave interface, which limits its applicability for comparative benchmarking. Also, it has not been studied how these errors propagate in case of multi-variate MP. Therefore, the performance comparison in section 4 will be limited to MPTK as a current de-facto standard implementation of MP.

3.3 General Description of empi

empi is written in a modern (2017) variant of C\(++\). The only external run-time dependency is a fast Fourier transform (FFT) implementation FFTW [10]. Code from SQLite¹ and CLI11² projects is included in the source code itself, along with the appropriate disclaimers and licenses.

The software is open source (GPL) and can be downloaded from https://github.com/develancer/empi where both source code and Linux binaries are available. Moreover, it can be compiled in various other operating systems, including OS X and Windows, provided CMake is available.

As with most computational tools, empi uses the command-line interface. Two arguments are always required: path to the input signal file and path for the resulting decomposition file to be created. Input signal is always assumed to be raw binary floating-point data. In case of multi-channel signals, data corresponding to separate channels are assumed to be multiplexed. Two output formats for the decomposition file are supported: SQLite (default) and JSON.

In addition to these two parameters, one can use a number of flags and options, controlling various aspects of program behavior. Detailed description of all options is included with the source and binary releases.

There are two modes of CPU parallelization, which can also be used together. First, a number of independent workers can be started, and each worker will process a separate subset of segments and/or channels. Second, each worker can be started with a number of concurrent CPU threads but all threads associated with a given worker will share work on the same signal segment. Either way, in addition to all workers’ threads, an additional single thread will be active and responsible for writing the decomposition results from all workers, in proper order, to the output file. The simplified workflow of the program is shown in Figure 4.

Fig. 4.

Therefore, if the input signal consists of multiple independent channels or segments to be analyzed separately, both parallelization modes can be used (Section 4.2 will provide a performance comparison between this two modes). If the input signal consist of a single segment and a single channel, or is to be processed in a multi-variate mode (where all channels have to be processed simultaneously), thread-based parallelization should be used, as even in case of multiple workers, only one worker would be assigned the entire signal anyway.

If the GPU devices are used in addition to CPU, each worker is started with one additional thread (corresponding to a separate CUDA stream) for each GPU device. The performance gain from enabling GPU devices, especially those with good double-precision floating point capabilities, is significant and will be evaluated in Section 4.

3.4 Dictionary Construction

The most computationally intensive part of the decomposition process is finding the best match from the discrete dictionary of oscillating atoms (Equation (2)). The dictionary is a set of functions constructed by scaling, translating, and modulating a pre-defined envelope function \(\Phi(t)\), as follows:

\begin{align}g_{(s,f_{0},t_{0})}(t)=K \Phi_{s}(t-t_{0})\cos\left(2\pi f_{0}(t-t_{0})+ \varphi\right)\kern-1pt,\end{align}

(9)

where \(\Phi_{s}(t)\sim\Phi(\frac{t}{s})\) is a scaled envelope function, \(K\) is the normalization factor, resulting from \(L^{2}\)-normalization, and all envelope functions (both original and scaled) are normalized as well:

\begin{align}\sum_{t=-\infty}^{+\infty}g_{(s,f_{0},t_{0})}(t)^{2}=\sum_{t=-\infty}^{+\infty }\Phi(t)^{2}=\sum_{t=-\infty}^{+\infty}\Phi_{s}(t)^{2}=1.\end{align}

(10)

The software uses the optimal dictionary construction as introduced in [13], defined by a single parameter \(\epsilon\) corresponding to the inverse of the dictionary size. (Please note this is completely unrelated to a threshold parameter \(\epsilon\) introduced in [18].) According to this construction, the values of scale parameter \(s\) form a geometric series, which allows to make a substitution \(\lambda=\log s\) and introduce a constant log-scale step \(\Delta_{\lambda}\) instead. For each scale \(s\), time (\(t_{0}\)) and frequency (\(f_{0}\)) values form a rectangular grid with steps of \(\Delta_{t}\) and \(\Delta_{f}\), respectively. Actual values of \(\Delta_{\lambda}\), \(\Delta_{t}\), and \(\Delta_{f}\) can be computed by solving equations

\begin{align}A(\Delta_{\lambda})=B(s\Delta_{f})=C(s^{-1}\Delta_{t})=1-\epsilon^{2}\end{align}

(11)

where \(A\), \(B\), and \(C\) are defined as specific integrals of the envelope function \(\Phi\). The general formulae for the these integrals as well as actual results for various envelope functions are presented in [20]. For example, using the Gaussian envelope \(\Phi(t)=2^{1/4}e^{-\pi t^{2}}\) leads to the following step values for Gabor atoms:

\begin{align}\begin{split}\Delta_{\lambda} & =\operatorname{arcosh} \frac{1}{(1-\epsilon^{2})^{2}} \\\Delta_{f} & =\frac{1}{s}\sqrt{-\frac{2}{\pi}\log(1- \epsilon^{2})} \\\Delta_{t} & =s\sqrt{-\frac{2}{\pi}\log(1-\epsilon^{2}) }.\end{split}\end{align}

(12)

In the current version of empi, the available envelope functions are Gaussian envelope (Gabor atoms) and triangular envelope. However, adding the support of further envelope functions is straightforward and requires only implementing a well-documented class interface named Family.

3.4.1 Boundary Conditions.

In order to create a realization of the dictionary based on the construction given above, one needs to set boundary conditions for all three dimensions. The design choice for empi is the following:

—

For scale (\(s\) or \(\lambda\)), the minimum and maximum values should be provided by the user. If not provided, some sensible default values are calculated as described in the documentation.

—

For frequency (\(f_{0}\)), the natural range is from 0 to the Nyquist frequency. The user can provide an alternative maximum value to restrict the dictionary atoms to lower frequencies only.

—

For position (\(t_{0}\)), the range spans the entire signal. An additional flag can be specified to ensure that no atom exceeds the signal range. If the flag is not used (default), empi assumes that signal values outside the valid range are zero.

It should be emphasized that even though for extremely narrow scales (\(s\rightarrow 0\)) the notion of atom's frequency ceases to exist, each realization of envelope function is re-normalized after discretization for a given scale \(s\) according to the construction described in [20] and therefore does not introduce additional normalization issues.

3.5 Scalar Product for the Best Match

In order to find the best match for the input signal, one needs to compute the scalar product between the input signal and each atom in the dictionary. Due to the specific structure of the dictionary, this step can be substantially optimized with the use of FFT. The scalar product between the atom \(g(t)\) and the discrete-time signal \(x(t)\) can be written as

\begin{align}g_{(s,f_{0},t_{0})}\cdot x=K\sum_{t=-\infty}^{+\infty}\Phi_{s}(t-t_{0})\cos \left(2\pi f_{0}(t-t_{0})+\varphi\right)x(t).\end{align}

(13)

On the other hand, the discrete Fourier transform of the input signal, windowed by the scaled envelope function can be written as

\begin{align}X_{(s,f_{0},t_{0})}=\sum_{t=-\infty}^{+\infty}\Phi_{s}(t-t_{0})e^{-2\pi if_{0} (t-t_{0})} x(t).\end{align}

(14)

Employing FFT in calculations allows to calculate \(X\) for all frequencies \(f_{0}\) for the given pair \((s,t_{0})\) through a single transform. However, the relation between Equations (13) and (14) is not as straightforward as it seems. The normalization factor \(K\) depends on the atom's phase \(\varphi\). On the other hand, finding the optimal phase (corresponding to the largest value of the scalar product) requires the exact value of \(K\). Therefore, it is not accurate to simply substitute \(\arg X\) as the optimal phase.

Instead, one can introduce an intermediate “corrected” value \(\tilde{X}\) defined as

\begin{align}\tilde{X}_{(s,f_{0},t_{0})}=\frac{2}{1-|F|^{2}}\left(X_{(s,f_{0},t_{0})}-X^{*} _{(s,f_{0},t_{0})}F_{(s,f_{0})}\right)\end{align}

(15)

where \(F\) is the Fourier-like sum over the scaled envelope function

\begin{align}F_{(s,f_{0})}=\sum_{t=-\infty}^{+\infty}\Phi_{s}(t)^{2}e^{-4\pi if_{0}t}.\end{align}

(16)

For sufficiently large values of \(f_{0}\), \(F\) approaches 0, and therefore, \(\tilde{X}\) differs from \(X\) only by a factor of 2. On the other hand, for very low frequencies, \(|F|\) is very close to 1 and for such cases, the implementation uses a slightly different (albeit equivalent) numerical approach.

Based on the corrected value of \(\tilde{X}\), one can finally calculate without any loss of precision

—

optimal phase \(\varphi=\arg\tilde{X}\),

—

amplitude \(=|\tilde{X}| \Phi_{s}(0)\),

—

scalar product \(c=g_{(s,f_{0},t_{0})}\cdot x\) (see Equation (2)) and associated atom energy

\begin{align}c^{2}=\frac{1}{2}\left(|\tilde{X}|^{2}+\Re(\tilde{X}^{2}F^{*})\right)\kern-1pt.\end{align}

(17)

For each scale \(s\) and time offset \(t_{0}\), only the highest value of \(c^{2}\) (among all values of \(f_{0}\)) is stored in memory, which results in a significantly reduced memory footprint. All stored values of \(c^{2}\) form a two-level heap-like static structure which can be easily updated and in which the highest value is always available.

Also, similarly to the approach used in [12], not all values are re-calculated between iterations, but only those corresponding to parts of the input signal affected by the subtraction of the previous best match. However, the affected scalar products are always updated from scratch instead of using any kind of update formula, as initially proposed in [14] and later used in [13]. There are three main reasons for this design choice:

—

using any kind of analytical formula for the scalar product of Gabor atoms would introduce a significant degree of error, especially for the very narrow scales, where the discrete Gabor atoms significantly deviate from their analog counterparts,

—

derivation of the exact analytical formulae for various envelope functions may be very difficult or even impossible, especially if the decomposition is using multiple atom types at once;

—

on the other hand, numerical computation of scalar products between all atoms would be a significant bottleneck, both time- and memory-wise.

The pseudo-code of the algorithm is shown in Algorithm 1. Calculating the scalar products in lines 3–5 and 16–17 is optimized with FFT according to Equations (14)–(17). Determining the optimal phase for each atom has been omitted for simplicity.

Algorithm 1 A Simplified Outline of the MP Implementation

1: for each atom scale \(s_{0}\) do

2: for each time shift \(t_{0}\)

3: \(c^{2}_{\text{best}}[s_{0},t_{0}]\leftarrow\max_{f} (g_{(s_{0},t_{0},f)}\cdot x) ^{2}\)

4: ⊳ find the best match between the signal \(x\) and envelope \((s_{0},t_{0})\)

5: \(f_{\text{best}}[s_{0},t_{0}]\leftarrow {\text{argmax}}_{f} (g_{(s_{0},t_{0},f)}\cdot x) ^{2}\) ⊳ and the associated frequency \(f\)

6: end for

7: end for

8: while stop conditions are not met do

9: \(s_{0},t_{0}\leftarrow {\text{argmax}} c^{2}_{\text{best}}\)

10: \(f_{0}\leftarrow f_{\text{best}}[s_{0},t_{0}]\) ⊳ \((s,t_{0},f_{0})\) defines the best match for this iteration

11: \(c\leftarrow g_{(s_{0},t_{0},f)}\cdot x\) ⊳the correct phase-dependent amplitude is computed

12: \(x\leftarrow x-c g_{(s_{0},t_{0},f_{0})}\) ⊳the matched atom is subtracted from signal

13: for each atom scale \(s\) do

14: for each time shift \(t\) do

15: if envelopes \((s_{0},t_{0})\) and \((s,t)\) overlap then

16: \(c^{2}_{\text{best}}[s,t]\leftarrow\max_{f} (g_{(s,t,f)}\cdot x)^{2}\)

17: \(f_{\text{best}}[s,t]\leftarrow {\text{argmax}}_{f} (g_{(s,t,f)}\cdot x)^{2}\)

18: end if

19: end for

20: end for

21: end while

3.5.1 Low-Level Use of FFT Libraries.

For CPU-based calculations, FFTW [10] is used to calculate all Fourier transforms. Since each iteration of MP requires calculating multiple FFTs, there is no need to parallelize individual transforms. Instead, if multiple threads are associated with the same worker, each of them is assigned a different pool of windowed Fourier transforms to be calculated.

However, for calculations using GPU, a different approach is used. In order to achieve the best performance on massively parallel computing systems, each CUDA kernel call is responsible for calculating multiple FFTs, thanks to the underlying multi-transform routines available in NVIDIA® CUDA® Fast Fourier Transform (CuFFT) [16]. Since CuFFT does not have any built-in support for calculating windowed Fourier transforms (especially if they overlap), this is accomplished with CUDA callbacks and virtual addressing. Also, if at least one GPU is used throughout the computation, all host memory which needs to be read or written in CUDA calls is allocated as page-locked (pinned) memory, accessible to all GPUs.

In each iteration, all transforms which need to be computed are sorted according to their length and placed in the doubly ended queue. This way, CPU threads consume short transforms first and leave the long transforms to GPU threads, for which the efficiency of CuFFT is far more superior.

3.5.2 Local Parameter Optimization.

Local optimization of the parameters is performed with a Nelder-Mead method [15]. More specifically, empi uses the C++ port³ of the original O’Neill code [17] with subsequent remarks by Chambers and Ertel [2], Benyon [1], and Hill [11]. The only change to this code added by the author is using simplex size instead of terminating variance as the optimalization's stopping criteria (which can be further adjusted with command-line options).

The function to be maximized is the atom's energy \(c^{2}\) (Equation (5)), and the optimization is performed in the space of \((\lambda,f_{0},t_{0})\) with the parameters scaled using local values of \(\Delta_{\lambda}\), \(\Delta_{f}\), and \(\Delta_{t}\) so the simplex size is well-defined and isotropic. Results of local optimization are stored in a result cache as a map \(g\mapsto g_{\infty}\), and each time the part of the signal overlapping any atom \(g\) changes, the cache entry \(g_{\infty}\) associated with that atom is erased.

4 Experimental Evaluation

All benchmark calculations have been performed on a 23-channel 1-hour long EEG recording with 128 Hz sampling rate, extracted from the middle of an overnight sleep recording. In all tests, the dictionary consisted of Gabor atoms only, and the scale parameter (\(s\)) spanned the range between 0.1 and 10 seconds.

4.1 Comparison with MPTK

In order to test the numerical correctness of empi as well as quantitatively measure its performance, MPTK [12] has been chosen as a reference, as being an established and widely used implementation of MP. The first step was to establish a correspondence between parameter sets of both implementations in order to calculate decompositions using the same dictionary structure. This was accomplished by adapting empi to support

—

a flag --dictionary-output, exporting the dictionary structure as a MPTK-compatible XML file, and

—

an additional switch --full-atoms-in-signal restricting the atoms in the dictionary to fully overlap with the signal's time range.

Decomposition results of artificial signal segments, simulated as single Gabor functions with randomly selected parameters, are shown in Figure 5. Lines representing functions fitted in the first iterations by MPTK and empi overlap completely, confirming bit-perfect agreement between these two implementations obtained by a suitable choice of corresponding sets of parameters. The black dotted line represents the actual signal, which may not fully coincide with the reconstruction since the simulations used a discrete dictionary with \(\epsilon^{2}=0.01\) (same for both implementations) while the Gabor atoms’ parameters were chosen randomly from continuous distributions.

Fig. 5.

To compare the performance of these two implementations on a real signal, a series of decompositions has been performed on a single channel extracted from an hour-long EEG recording. Similarly as above, the same dictionary with \(\epsilon^{2}=0.01\) and over 650 million Gabor atoms has been used for both implementations. Additionally, to make a fair comparison, empi was restricted to one CPU thread, as MPTK does not support multi-threading. The results provided in Figure 6 demonstrate that even in a single-threaded case, empi provides a similar (but less varied) performance compared to MPTK, while the resulting decomposition and the corresponding residual signal energy is exactly the same for both implementations. The gain from empi's multi-threading (both CPU and GPU) will be evaluated in Section 4.2.

Fig. 6.

The last comparison, using the same single-channel signal, was performed using a finer discrete dictionary with \(\epsilon^{2}=0.001\). For each implementation, 100 iterations were performed and the resulting decomposition was once again found to be identical. Despite using a significantly larger dictionary consisting of more than 32 billion atoms, both implementations demonstrated a very modest memory footprint of around 200 MB, which confirms the applicability of empi to analyze longer signals without any need of splitting the data into separate segments. The total wall-clock computation time was 1,313.65 seconds for MPTK and 1,002.61 seconds for a single-threaded empi run.

4.2 Multi-Threading Benchmarks

To evaluate performance in multi-threaded and hybrid CPU+GPU usage, benchmark calculations were performed on a high-performance computing system consisting of a 16-core AMD EPYC 7302 processor and two NVIDIA A100 GPUs. The EEG signal was either analyzed as a single hour-long segment or split into 60 segments (one minute each). For each segment, the decomposition was performed until the residual was found to be below 1% (\(-20\) dB) of the signal's initial energy.

4.2.1 Single-Channel Decomposition.

The first benchmark used a discrete dictionary with \(\epsilon^{2}=0.01\). Only a single channel from the recording (corresponding to Cz electrode) was used in this test. Results using various numbers of CPU workers and threads, both with and without GPU, are presented in Figure 7. As long as the signal is split into multiple segments (a), both parallelization schemes work similarly well, with a slight edge toward the scheme based on independent workers. However, if only a single segment is available (b), multiple workers cannot be used efficiently and the thread-based approach has to be used. In both cases, enabling one or two GPUs in addition to CPU processing power offers a significant (up to 10\(\times\)) reduction in computation time.

Fig. 7.

4.2.2 Multi-Channel Decomposition.

Multi-channel decomposition was computed using all 23 channels, with the same dictionary as above (\(\epsilon^{2}=0.01\)). Three variants were tested: SMP (where each channel is processed independently), MMP1 corresponding multi-channel decomposition with constant phase and MMP3 corresponding to multi-channel decomposition with variable phase. Detailed description of these three decomposition modes can be found in the literature, e.g. [13]. The signal was divided into 60 segments (1 minute each) and the computation was performed here only in one configuration, employing 16 independent workers. For each decomposition mode, 0, 1, or 2 GPUs were enabled in addition to the CPU processing power. The results shown in Table 1 demonstrate a significant speed-up gained from utilizing GPUs.

Table 1.

	SMP	MMP1	MMP3
CPU only (16 workers)	1,240.11	7,916.84	2,781.94
CPU \(+\) 1 GPU	560.06	2,262.44	1,141.13
CPU \(+\) 2 GPU	363.66	1,477.13	794.60

Table 1. Elapsed Real Computation Time (in Seconds) of Multivariate MP Decomposition of an Hour-Long 23-Channel EEG Recording

MMP, Multi-variate Matching Pursuit; SMP, Single-channel Matching Pursuit.

4.3 Continuous Dictionary

Calculations for simulated continuous dictionary were performed for a single channel only, similarly to Section 4.2.1. Two available parameter optimization modes were tested: local (where parameter optimization is performed only locally around the best match) and global (according to the systematic approach described in Section 2.1) as well as no-optimization (discrete dictionary) mode, each of them for a wide range of \(\epsilon^{2}\) values. The aim was to measure not only the total computation time, but also the quality of decomposition in each case, measured as the amount of residual energy after 10,000 iterations. The results shown in Figure 8 correspond to the same EEG signal as analyzed in the previous section.⁴

Fig. 8.

If no optimization is used, the performance shows a monotonic dependency between the size of the dictionary and the computation time (triangles in the top half of Figure 8). This is expected, as a larger dictionary always requires more computing power to calculate FFTs and related scalar products for each atom. Enlarging the dictionary also provides a decrease in the residual energy (the bottom half of Figure 8), as the atoms in subsequent iterations provide a better fit to the signal. Together, the results constitute a tradeoff between computation time and the quality of the decomposition.

For the local optimization mode, the results (squares in Figure 8) are somewhat different. The computation time is significantly longer than in the previous case due to additional time needed to locally optimize the parameters of each and every atom. Due to this additional step, the quality of the decomposition is significantly better. However, the variation in the overall quality is also significant and corresponds to the fact that the local optimization mode is not enough to fully simulate a continuous dictionary, as there is no guarantee of selecting the globally best match in each iteration. Moreover, since smaller dictionaries decrease the probability of accurately sampling the global optimum, the variation on the vertical scale grows as we decrease the dictionary size (right-most yellow-green squares on the bottom half of Figure 8).

However, for the global optimization mode (red circles in Figure 8), the computation time increases both for very fine (e.g. \(\epsilon^{2}=0.005\)) as well as for very coarse (e.g. \(\epsilon^{2}=0.079\)) dictionaries. For very small values of \(\epsilon^{2}\), the increase in computation time can be attributed to a large number of atoms, similarly as in two previous cases. However, for very large \(\epsilon^{2}\), the computation time increases as well, and this can be attributed to a large number of candidate atoms for which the local decomposition must be performed. As the \(\alpha^{2}\) (see Section 2.1) decreases, the number of atoms fulfilling Equation (5) grows as well, contributing to an increase in the overall computation time.

On the other hand, the quality of the decomposition in the global optimization mode does not depend on the selected value of \(\epsilon^{2}\). Therefore, for the global optimization mode there is always an optimal value for \(\epsilon^{2}\) providing the same results in the shortest time. For the analyzed system this value was found to be around \(\epsilon^{2}=0.04\), but it is expected to vary across different machines.

It should be emphasized that although the difference in residual energy between various modes of operation may not seem significant, and one can argue that increasing the number of iterations may make the residual energy arbitrarily small, the provided quantities are also indicators of the accuracy of the determined parameters. In many applications, including but not limited to biomedical engineering, e.g., analysis of sleep spindles (or in general, EEG sleep profiles) [21], obtaining the precise values of parameters (\(s\), \(t_{0}\), \(f_{0}\)) for each atom in the decomposition is crucial, as any mismatch in those parameters may lead to misclassification. Moreover, properly simulating the continuous dictionary removes the ambiguity associated with the \(\epsilon^{2}\) parameter and eliminates the impact of an arbitrary dictionary structure on the resulting decomposition.

5 Conclusions and Future Work

This article introduced a mathematical framework for MP algorithm with continuous dictionaries and presented an open source implementation empi with multi-threading and multi-GPU capabilities, performing exact calculations including, but not limited to, Gabor atoms. The proposed global optimization strategy based on evaluating mathematical properties of the optimal dictionary structure was shown to provide consistent results simulating a continuous dictionary. Multiple benchmarks were provided, demonstrating a significant performance gain resulting from multi-threading and/or GPU usage.

The work is (as always) far from done and there is still potential for optimization, in both GPU- and CPU-related parts of code. Especially, the implementation of finding the correct common phase for atoms in the MMP1 variant could probably be optimized, as it seems from the results in Table 1 that it is a bottleneck for multivariate decomposition. The local optimization routine, currently using a customized version of the Nelder-Mead algorithm, could benefit from introducing a different approach, but this will require a systematic evaluation of all possible solutions. Last but not least, one can notice in Figure 8 that even in the global optimization mode, there is still some deviation between results for various \(\epsilon^{2}\). This effect may be attributed to a finite accuracy of local optimization, but some further research is still needed to exclude additional factors.

Acknowledgments

The author would like to thank his PhD supervisors, Prof. Piotr Durka and Prof. Krzysztof Stencel from the University of Warsaw, for enormous patience and many valuable suggestions, as well as Dr. Marian Dovgialo for testing the software and providing constructive feedback. The benchmark calculations in Sections 4.2 and 4.3 have been performed on a high-performance computing cluster at the Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University in Toruń.

Footnotes

https://www.sqlite.org/

https://github.com/CLIUtils/CLI11

https://github.com/develancer/nelder-mead

⁴

However, it should be noted that tests were also performed on a random (white noise) signal and demonstrated a qualitatively similar behavior as described below, but with expectedly higher residual signal energy for the same number of iterations.

References

[1]

P. R. Benyon. 1976. Remark AS R15: Function minimization using a simplex procedure. Journal of the Royal Statistical Society. Series C (Applied Statistics) 25, 1 (1976), 97–97. Retrieved from http://www.jstor.org/stable/2346539

Abstract

1 Introduction

1.1 MP

1.2 Continuous vs. Discrete Dictionaries

1.3 Rationale

1.4 Outline of the Article

2 Method

2.1 Simulating a Continuous Dictionary

2.1.1 Assessing the Optimality Factor.

3 Introducing empi

3.1 Toward the High-Performance Implementation

3.2 Related Work

3.3 General Description of empi

3.4 Dictionary Construction

3.4.1 Boundary Conditions.

3.5 Scalar Product for the Best Match

3.5.1 Low-Level Use of FFT Libraries.

3.5.2 Local Parameter Optimization.

4 Experimental Evaluation

4.1 Comparison with MPTK

4.2 Multi-Threading Benchmarks

4.2.1 Single-Channel Decomposition.

4.2.2 Multi-Channel Decomposition.

4.3 Continuous Dictionary

5 Conclusions and Future Work

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations