1 Introduction

Most autonomous systems perceive the surrounding environment via light detection and ranging (LiDAR) [1]. This active sensing technology computes the distance to objects by measuring, at each pixel, the time-of-flight (ToF) between emitted and reflected photons. Both the emission and detection of photons consume power, which is of considerable concern in sensor and processing unit design. The current requirements on depth and spatial LiDAR resolution, however, make its power consumption prohibitive and significantly limit its application to resource-constrained platforms, including mobile devices with limited battery supply and physical space, drone scene mapping with SLAM [2], augmented reality [3] and mobile robotics [4].

A promising paradigm to reduce power consumption in resource-constrained devices is approximate computing (AC) [5], which has been widely adopted in signal processing [6], robotics [7] and machine learning [8]. One particular AC technique reduces the arithmetic precision of operations by representing numbers with fewer bits, thereby decreasing the computational cost of memory and logical units. As a result, reduced precision (RP), as it is often known, leads to significant savings in energy consumption [9]. In LiDAR applications, a side benefit of saving energy in data processing is a reduction of thermal noise in the photon detection components, which are often near the data processing unit [10].

The iterative \(\ell _1\) solver Alternating Direction Method of Multipliers (ADMM) was adopted in [11] for parallel compressive sensing (CS) of Lidar based depth image reconstruction, which allowed high 3D imaging frame rates with reduced laser power, memory use, logic cost, and power consumption. Alternatively, the accelerated proximal gradient descent (PGD) algorithm was adopted in [12] for a similar problem. Several other works have investigated the use of AC for convex optimization. In the context of depth reconstruction, Aßmann et al. [13] and Gürel et al. [14] made resource savings by reduction of precision, whereas Wills et al. [15] and Wu et al. [16] have applied approximate PGD methods to Model Predictive Control (MPC). Very recently, Wu et al. [17] deployed reduced precision on convex optimisation using ADMM and PGD for compressive depth reconstruction, not only to reduce power and resource, but also to obtain a faster implementation. However, in previous work, the approximate precision was fixed and pre-determined through the whole execution cycle.

In contrast, mixed precision computation has the potential to enable even further resource savings with overall low arithmetic error for either dense or sparse iterative computations [18]. Automated precision tuning can improve energy efficiency when employing iterative refinement [19]. However, current mixed and auto-tuned precision algorithms are either built for an instruction-driven architecture, or support only floating-point above single precision. This limits both the energy savings [20], and the application of approximation to smaller scale, resource-constrained platforms.

In this work, a new, mixed precision framework is proposed, supporting both fixed and floating point arithmetic. This is applied to 3D LiDAR CS depth reconstruction using a parallel \(\ell _1\) solver of the least absolute shrinkage and selection operator (lasso) problem [13]. Specifically, by applying mixed precision scaling to typical \(\ell _1\) solvers, ADMM and PGD, for compressive depth reconstruction, considerable savings in power, logic and memory resources are achieved.

Our contributions are summarized as follows:

  • An ADMM solver using iterative mixed precision implementation.

  • A comparative \(\ell _1\) solver, PGD, with iterative mixed precision.

  • A mixed precision framework is created for reconfigurable accelerator generation on an FPGA.

  • A comparative study of both \(\ell _1\) solvers between the existing consistent and the new mixed precision framework is evaluated in terms of depth reconstruction quality, implementation cost, and power consumption.

CS depth reconstruction based on convex optimization and the use of precision scaling are introduced in Sect. 2. In Sect. 3, the proposed framework is illustrated. The results are demonstrated in Sect. 4 with a conclusion in Sect. 5.

2 Background

2.1 Parallel Depth Reconstruction

Using time-correlated single photon counting (TCSPC), LiDAR systems sample depth information with single photon precision at every pixel that can detect photon events [21,22,23,24]. The round trip time of a photon, ToF, and thus the distance travelled by each photon, can be accurately measured. To improve the signal-to-noise ratio, it is common to increase the sampling size by accumulating many photon events in a histogram. Recent advancements in solid-state photon count arrays [10, 25, 26] enable high resolution LiDAR imaging. However, as the number of photon count pixels N, i.e., the resolution of the depth image, increases so does the number of raw histogram measurements.

For high resolution LiDAR, it can therefore be challenging to store and process these raw histogram measurements in real-time.

To address this large data volume, [11] has applied compressive sensing [27, 28] to depth imaging and devised a strategy that processes patches of the depth image independently and in parallel, achieving real-time reconstruction. The framework proposed in [11], named checkerboard compressive depth sensing (CBCS), is illustrated in Fig. 1. It uses a random pattern of illumination (structured light) that is reflected from the scene and acquired by photon pixels in a block-based manner. Each block is reconstructed by solving two lasso problems [29]: one for reconstructing a quantity called the depth-sum, \(x_Q \in \mathbb {R}^{n_B}\), where \(n_B\) is the number of pixels in block B, and one for reconstructing a quantity called the photon count intensity, \(x_I \in \mathbb {R}^{n_B}\). Specifically, given \(y_Q \in \mathbb {R}^{m_B}\) (resp. \(y_I \in \mathbb {R}^{m_B}\)) compressive measurements of the depth-sum (resp. photon count intensity), \(x_Q\) and \(x_I\) are recovered by solving

$$\begin{aligned}&\underset{x_Q}{\text {minimize}}\,\,\, \frac{1}{2} \big \Vert y_Q - \overline{A} x_Q\big \Vert _2^2 + \lambda \big \Vert F x_Q\big \Vert _1 \end{aligned}$$
(1)
$$\begin{aligned}&\underset{x_I}{\text {minimize}}\,\,\, \frac{1}{2} \big \Vert y_I - \overline{A} x_I\big \Vert _2^2 + \lambda \big \Vert F x_I\big \Vert _1\,, \end{aligned}$$
(2)

where \(\overline{A} \in \{0,1\}^{m_B \times n_B}\) is a known binary matrix that encodes the active pixels in each block for each measurement in \(y_Q\) or \(y_I\), and \(F \in \mathbb {R}^{n_B \times n_B}\) is a sparsifying matrix, e.g. a DCT matrix, that we assume is invertible. \(\lambda > 0\) is a regularisation parameter, and \(\Vert \cdot \Vert _2\) and \(\Vert \cdot \Vert _1\) are, respectively, the \(\ell _2\)- and \(\ell _1\)-norms. After recovering these quantities, the final depth image at block B is computed by dividing \(x_Q\) by \(x_I\) pointwise: \(x_D = x_Q ./ x_I \in \mathbb {R}^{n_B}\). The work in [11] also proposed a post-processing strategy to remove blocking effects. The problems in (1) are ubiquitous in signal processing and can be solved efficiently via ADMM [30] and PGD [31].

Figure 1
figure 1

Parallel Depth Reconstruction [13].

2.2 Mixed Precision and Tuning

Mixed precision combines the accelerated and less resource intensive processing of lower precision with the greater accuracy of higher precision arithmetic. It is normally deployed for iterative refinement using floating point data formats [32, 33], for example using three levels of precision [34]. Customized precision tuning instruments have been employed based on specific input sets [19]. Such precision scaling has gained increasing interest for computation-intensive applications, such as deep learning, saving computation and data storage [35] using both floating point

$$\begin{aligned} -1^\text {S}\times \text {M}\times 2^{\text {E-127}}, \end{aligned}$$
(3)

where \(\text {S}\) is the sign, \(\text {M}\) the mantissa, and \(\text {E}\) the exponent; and fixed point

$$\begin{aligned} -1^\text {S}\times ({\text {I}} + {\text {F}}), \end{aligned}$$
(4)

where \(\text {S}\) is the sign, \(\text {I}\) the integer value as sum of \(2^b\), and \(\text {F}\) the fraction value as the sum of \(2^{-b}\) from each bit position b depending on the 2’s complement format [36].

Different arithmetic types yield different dynamic ranges. Floating point with nonlinear binary representation uses less bits, while fixed point enables simpler binary operations but requires more bits. Hence, the various precisions in both floating and fixed point affect not only the algorithm’s performance, but also the embedded implementation.

By adjusting the combination of mixed precision arithmetic at compile-time, static precision tuning is performed once during the system design phase [34]. Such a tuning process can be considered as part of wider hardware/software co-design approaches, profiling the application data and code to compare with empirical measurement before running the system. On the other hand, dynamic precision tuning is invoked multiple times during the execution of the algorithm at run-time, which can incur some overhead [19].

We develop a framework for CS depth reconstruction, using ADMM and PGD to solve lasso. Its key aspect is the mixed precision design of both the arithmetic data type and the binary bit width, appropriate to the performance requirements of the sensing environment. This provides an alternative to our earlier work [17], which used constant precision, and results in a significantly more energy efficient implementation.

3 Mixed Precision Framework

Optimization-based algorithms for lasso are iterative, with the same sequence of steps occurring at each iteration. Traditionally, all computations are performed with the same precision. Here, instead, we consider the case in which different iterations use different precision. We focus firstly on ADMM, secondly on PGD. Then, we present the design flow of our mixed precision accelerator.

3.1 Mixed Precision ADMM

We consider the following reformulation of the problem in (1 and 2):

$$\begin{aligned}{}\begin{array}[t]{ll} \underset{x, z}{\text {minimize}} &{} \frac{1}{2}\big \Vert y - Ax\big \Vert _2^2 +\lambda _{\text {ADMM}}\Vert z\Vert _1\\ \text {subject to} &{} x - z = 0\,, \end{array} \end{aligned}$$
(5)

where we define \(A:= \overline{A}F^{-1} \in \mathbb {R}^{m \times n}\). Each iteration of ADMM [30] applied to (5) consists of

$$\begin{aligned} x^{k+1}&= \Big (A^\top A + \rho I_n\Big )^{-1}\Big [A^\top y + \rho (z^k - u^k)\Big ] \end{aligned}$$
(6)
$$\begin{aligned} z^{k+1}&= S_{\lambda _{\text {ADMM}/\rho }}\big (u^k + x^{k+1}\big ) \end{aligned}$$
(7)
$$\begin{aligned} u^{k+1}&= u^k + x^{k+1} - z^{k+1}\,, \end{aligned}$$
(8)

where \(\rho > 0\) is the augmented Lagrangian parameter, \(I_n \in \mathbb {R}^{n\times n}\) the identity matrix, \(u \in \mathbb {R}^n\) a dual variable, and \(S_{\lambda }(\cdot )\) the soft-thresholding operator applied component-wise: for \(v \in \mathbb {R}\), \(S_{\lambda }(v) = v - \lambda\) when \(v \ge \lambda\), \(S_{\lambda }(v) = v + \lambda\) when \(v < -\lambda\), and \(S_{\lambda }(v) = 0\) otherwise. If the parameter \(\rho\) is constant throughout the iterations, the matrix \(A^\top A + \rho I_n\) and its inverse can be precomputed. This can be done efficiently via the matrix inverse lemma [37],

$$\begin{aligned} \Big (A^\top A + \rho I_n\Big )^{-1}\!\! = \frac{1}{\rho } I_n - \frac{1}{\rho ^2}A^\top \Big (I_m + \frac{1}{\rho }AA^\top \Big )^{-1}\!\!A\, \end{aligned}$$
(9)

and by computing the inverse of the \(m\times m\) matrix in (9) via its Cholesky decomposition, \(I_m + (1/\rho )A A^\top = L L^\top\), where \(L\in \mathbb {R}^{m\times m}\) is lower triangular. The quantity \(g:= A^\top y\) can also be pre-computed. In our context, such pre-computations have the additional benefit of avoiding unnecessary accuracy loss when using mixed precision during the iterations. The resulting algorithm (with fixed precision) is shown in Algorithm 1, from which we can see that each iteration requires \(O(n + m^2)\) arithmetic operations. Step 4 of Algorithm 1 differs from (7) in that it uses over-relaxation, parameterized by \(0< \alpha < 2\), which can improve convergence [30].

figure f
figure g

Algorithm 2 implements Algorithm 1 using mixed precision. It takes as input the constant vector \(g = A^\top y\) and matrix \(H = A^\top {L^\top }^{-1} L^{-1} A\), and the maximum number of iterations \(k_{\max }\). At each iteration k, the algorithm recasts the constants g and H and the variables \(z^k\) and \(u^k\) to the precision required at the current iteration. In software such a function is implemented conceptually as a data type cast, while in hardware it is implemented through data path trimming, where the extra data path are unconnected between mismatched bit-width.

figure h

These variables are then used by the mixed precision function \(*\text {mixiter}\), described in Algorithm 3, which performs the same steps as each iteration of Algorithm 1, but with precision specified by \(l_k\). This variable represents the index to a set of static functions, each of which is written for a predefined precision, e.g., floating point with 32 bits or fixed point with 30 bits. Each of these functions uses the precomputed constants \(\alpha\), \(\hat{\alpha } := 1-\alpha\), \(\rho\), \(\hat{\rho } := 1/\rho\), and \(\kappa := \lambda _{\text {ADMM}}/\rho\). The initial index \(l_0\) is found based on the maximum \(\epsilon\) representing the precision loss given a certain threshold. This process is described in Sect. 3.2. During the iterations, \(l_k\) is modified to meet the demanding finer accuracy of the algorithm, increasing its precision and changing the elementary bit width gradually. We fixed the number of iterations to \(k_{\max }=5\), determined empirically, as no further improvement on reconstruction of depth has been observed beyond this point [17]. In the particular case of parallel computation using small blocks, \(4\times 4\) pixels, ADMM converges in very few iterations.

3.2 Mixed Precision PGD

We now consider the application of proximal gradient descent (PGD) to (1). Each of those problems can be written as

$$\begin{aligned} \underset{x}{\text {minimize}} \,\,\, \frac{1}{2}\big \Vert y - A x\big \Vert _2^2 + \lambda _{\text {PGD}}\Vert x\Vert _1\,, \end{aligned}$$
(10)

where \(A:=\overline{A}F^{-1} \in \mathbb {R}^{m\times n}\). Defining the convex differentiable function \(g(x):=(1/2)\Vert y - A x\Vert ^2_2\), the convex function \(h(x)=\lambda _{\text {PGD}}\Vert x\Vert _1\), and their sum \(f(x) = g(x) + h(x)\), PGD applied to (10) yields the iterative shrinkage thresholding (ISTA) algorithm [31, 38]:

$$\begin{aligned} \begin{aligned} x^{k+1}&= \text {prox}_{\alpha _k h}(x^k-\alpha _k \nabla g(x^k))\\&= S_{\alpha _k \lambda _{\text {PGD}}}\Big (x^k - \alpha _k A^\top A x^k + \alpha _k A^\top y\Big )\,, \end{aligned} \end{aligned}$$
(11)

where \(\alpha _k \ge 0\) is the step size at iteration k, and \(\text {prox}_{\alpha _k h}(\cdot )\) the proximal operator of \(\alpha _k h\) which, in this case, is the soft-thresholding operator \(S_{\alpha _k \lambda _{\text {PGD}}}\), defined in Sect. 3.1. PGD converges whenever \(0 < \alpha _k \le 1/L\), where L is the Lipschitz constant of the gradient of g, i.e., \(\nabla g(x) = A^\top A\). That is, L can be found as the maximum eigenvalue of \(A^T A\). We use a constant stepsize: \(\alpha _k = \alpha = 1/L\).

figure i

The general algorithm is described in Algorithm 4. It takes as input the matrix A, the vector \(g := A^\top y\), the regularizer constant \(\lambda _{\text {PGD}}\), and the maximum number of iterations \(k_{\max }\). After computing the stepsize \(\alpha\), it precomputes the vector \(g_{\alpha } := \alpha g\), the matrix \(W := \alpha A^\top A\), and the threshold \(\kappa := \alpha \lambda _{\text {PGD}}\). Steps 2 and 3 then implement exactly (11).

figure j
figure k

Algorithm 5 implements Algorithm 4 using mixed precision. Both algorithms have the same inputs and initialization, except that Algorithm 5 now also initializes the function pointer with index \(l_k\), which specifies a static function \(*\text {mixiter}^{l_k}(\cdot )\) that performs the main computations of PGD with a given precision. The order of these functions, which are accessed via \(l_k\), determines the sequence of precisions, maintained through combined search of various bit widths based on the precision loss of the precomputed parameters and a post-search fine tuning. As in mixed precision ADMM, all the relevant variables in Algorithm 5 are cast to the required precision and then used in the \(*\text {mixiter}\) function indexed by \(l_k\). \(\text {cast}_c\) only demonstrates the arithmetic precision transformation at the software level, and is not an overhead in hardware design at the appropriate precision. In this work, we only consider mixed precision within the same arithmetic type. As in mixed precision ADMM, we set the maximum number of iterations of PGD to \(k_{\max } = 5\).

3.3 Mixed Precision Accelerator

We developed the semi-automated framework illustrated in Fig. 2 based on the Matlab and Xilinx Vivado toolsets. It quickly prototypes an approximate accelerator with mixed-precision arithmetic on reconfigurable platforms. To support mixed precision design, we created an approximate generic linear algebra library calling third-party arithmetic types, which are based firstly on a customized floating-point library FloatX [32] and secondly on the Xilinx fixed-point library [39]. The library was established with C++ templates in the header only, which allows later specific arithmetic type allocation. Its major features are:

  • General (non-symmetric) real value operations.

  • Arbitrary bit width floating- and fixed-point arithmetic.

  • Basic vector/matrix algebra arithmetic (addition, subtraction, multiplication, division, inversion).

  • Triangular factorization (LU, and Cholesky decomposition).

  • Orthogonal factorization (QR decomposition).

Figure 2
figure 2

Mixed Precision Design Flow.

Based on the approximate linear algebra library, a user-defined kernel, which indicates the specific portion of signal processing application to be the target for approximation, is considered as input in Fig. 2. By compiling the input source, a list of various iterative functions are created using different arithmetic precisions, as expressed in both Algorithms 3 and 6. The source compiler is developed based on the Matlab MEX compiler API where the Design Space Exploration (DSE) with precision adaptation is evaluated based on the input pre-computed data.

The quantization effect of the data from single-precision floating point (32 bits, called full precision in the later context of this paper) to reduced precision is considered as the criterion for precision scaling. Specifically, the normalized absolute difference is used for floating point (FP) arithmetic while the absolute difference is adopted for fixed point (FXP) arithmetic due to the different functional relationships between the binary representations and decimal values. The most representative values, the maximum and minimum of those pre-computed parameters, are picked for the numerical analysis.

To measure the effects of different FP quantizations, we use the normalized absolute difference

$$\begin{aligned} \epsilon = \frac{\left\| \upsilon -\text {cast}_c(\upsilon )\right\| _1}{\left\| \upsilon \right\| _1}\,, \end{aligned}$$
(12)

where \(\upsilon \in \mathbb {R}^n\) is a vector and \(\Vert \cdot \Vert _1\) the \(\ell _1\)-norm. To measure the effects of different FXP quantizations, we simply use the absolute difference

$$\begin{aligned} \epsilon = \left\| \upsilon -\text {cast}_c(\upsilon )\right\| _1\,. \end{aligned}$$
(13)

We guarantee that in either (12) or (13), we always have \(\epsilon \le \omega = 5\times 10^{-3}\). Different arithmetic types will yield different such \(\epsilon\). Using exhaustive search, we select the bit width of different parts in the binary arithmetic format for either FP or FXP as the smallest that satisfies \(\epsilon \le \omega\). The representation with that bit width is then indexed by \(l_0\) in Algorithms 2 and 5.

The kernel with chosen precision is adopted for both the integer and fraction part representations to achieve a relatively close reconstruction performance to full precision, where heuristic fine tuning is performed at the DSE stage. The accelerator is generated with a Hardware Description Language (HDL) by calling Xilinx High-Level Synthesis (HLS) tools within the Matlab Framework. With combined algorithm evaluation and hardware implementation in a loop, it can quickly prototype mixed precision design on reconfigurable platforms.

4 Evaluation

Both \(\ell _1\) solvers, ADMM and PGD, using either floating (FP) or fixed-point (FXP) arithmetic, are illustrated and evaluated for compressive depth reconstruction on two synthetic and two real underwater sensing datasets. Reconstruction of the depth image using dSparse [13] (a non-compressive approach with oversampling) with full, single floating point (FP 32 bits) precision is adopted as a benchmark for comparison.

The FP 32 ADMM or PGD solutions are considered as the baseline, while the identified highest accuracy in the pre-scheduled list is considered as the most suitable reduced precision solution. Both are adopted for comparisons of recovered image fidelity as well as implemented hardware clock speed, resource usage and power consumed.

4.1 Depth Reconstruction Accuracy

The Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [41] are both considered as quality metrics, showing the fidelity of reconstruction and the perceived quality based on the pixel-wise absolute difference for normalised similarity scaling. Four different sensing environments are adopted to evaluate the quality of depth reconstruction with mixed precision: short range [40] and long range [42] synthetic benchmarks, and real LiDAR data for two very short range, underwater scenes [43]. In the short range scenario, the detailed mixed precision phenomenon is evaluated firstly against the reconstructed depth image quality for both PSNR and SSIM. Then mixed precision is applied to the other three scenarios, where the corresponding depth reconstruction accuracy is also evaluated.

  • Scene 1: short range

    By using the design flow for different scenes at different stand-off distances and dynamic ranges, mixed FP from 18 to 22 bits, and FXP from 20 to 24 bits are employed for the short range synthetic scene with distances ranging from 1.8 to 3.4 meters. Figures 3 and 4 illustrate the depth reconstruction accuracy against various consistent and mixed precision for both \(\ell _1\) solvers, ADMM and PGD. The ground truth is set as dSparse using single precision floating point with 32 bits, where its PSNR and SSIM are infinity and 1 since the comparison is to itself.

    The recovered depth images using the ADMM solver with FP arithmetic are shown in Fig. 3b–e. The PSNR and SSIM are 28 dB and 0.75 respectively for the baseline FP 32 bits (Fig. 3b), while, comparatively, the worst (FP 18) and best (FP22) quality of the pre-scheduled consistent FP precision are illustrated in Fig. 3c, with 0.15 loss in SSIM, and Fig. 3d, with similar PSNR and SSIM. Increasing the bit-width is shown by \(\rightarrow\), the mixed FP precision (FP 18\(\rightarrow\)22 bits) in Fig. 3e shows only 0.022 dB loss in PSNR and 0.009 loss in SSIM. It has even slight better PSNR than the consistent FP 22 solution. Corresponding depth images recovered by using the FXP ADMM solver are shown in Fig. 3f and g. Similarly, the worst case (FXP 20) in Fig. 3f is degraded in both PSNR and SSIM while the best case (FXP 24) in Fig. 3h has even better PSNR and SSIM than the baseline in Fig. 3b. Using mixed precision FXP 20\(\rightarrow\)24 bits, there is a PSNR loss of 0.19 dB and a 0.004 loss in SSIM.

    Similarly, Fig. 4b–e illustrate the recovered depth image using the PGD solver with FP, while Fig. 4f–h are those using FXP. The baseline in Fig. 4b achieves the same SSIM as the FP baseline with slightly better PSNR. The worst case (FP 17) of pre-scheduled, consistent FP in Fig. 4c degrades in both PSNR and SSIM, while the best case in Fig. 4d is almost the same as the baseline. The worst case (FXP 18) of pre-scheduled, consistent FXP has even better PSNR than the baseline but is much worse in SSIM. Both mixed FP and FXP precision show minor loss in both PSNR and SSIM.

    To further illustrate the use of mixed precision, PSNR and SSIM are plotted against bit width for both ADMM and PGD using FP and FXP representations in Figs. 58. The consistent precision, shown as a red line, refers to the use of the same precision over all iterative computations, and is compared to the use of mixed precision shown as blue line.

    As shown in Figs. 5 and 6, for the ADMM solver, the PSNR and SSIM values using mixed precision match the highest, consistent precision in either FP or FXP. An exception is the PSNR of mixed FXP precision in Fig. 5b, which is gradually degraded as the bit width decreases.

    Similar trends exist for the PGD solver as shown in Figs. 7 and 8, where all the PSNR and SSIM values for mixed FXP precision match those of the highest consistent precision in either FP or FXP, degrading with decreasing bit width.

    Summarising, from Figs. 58 we observe that the mixed precision does achieve similar fidelity of the reconstructed depth image when compared to the highest pre-scheduled precision, and to the baseline single floating point precision. However, there are differences between the FP and FXP representations during the transition from low to high precision in the pre-scheduled mixed precision list. The FP representation has a smooth transition for both PSNR and SSIM. Due to the computational uncertainty of truncating errors, the transition of PSNR is slightly bumpy for the FXP representation, but the transition of SSIM is a smooth fucntion of the FP representation.

  • Scene 2: long range

    Using both ADMM and PGDs, mixed FP from 16 to 20 bits, and FXP from 30 to 34 bits are adopted for the long range, synthetic scene with distances ranging from 0 to 100 meters. Figure 9 illustrates the comparisons of depth reconstruction. The baselines of the reconstructed depth image using ADMM and PGD solvers are illustrated in Fig. 9b and e. The PSNR using the ADMM solver is slightly better than the PGD solver by 0.1 dB, while the SSIM using the ADMM solver is slightly worse by 0.07.

    With mixed precision using the ADMM solver, there is a 0.2 dB loss of PSNR for FP from 18 to 22 bits and 0.021 dB loss of PSNR for FXP from 28 to 32 bits. Using the PGD solver, the mixed FP precision from 18 to 22 bits shows a slightly better PSNR than the baseline, and than mixed FXP precision from 30 to 34 bits. The SSIM with both FP and FXP mixed precision are both slightly worse than the baseline by about 0.006.

    Summarising, evaluation of the use of mixed precision in the long range, as in the short range synthetic scenario, confirms the comparable fidelity of reconstructed depth image to both the baseline and the consistent precision solutions. Further evaluation on real data is now considered.

  • Scene 3: underwater A

    This scene has distances ranging from 27 to 39 centimeters. Mixed FP from 18 to 22 bits using both the ADMM and PGD solvers in Fig. 10c and f, FXP from 28 to 32 bits using the ADMM solver in Fig. 10d and FXP from 30 to 34 bits using the PGD solvers in Fig. 10g are adopted.

    There is a slight PSNR degradation of about 0.2 dB using ADMM with mixed FP precision, and an almost identical PSNR with mixed FXP precision. SSIM is degraded by around 0.01 for both mixed FP and FXP precision using ADMM. For mixed precision using the PGD solver, the PSNR with both mixed FP and FXP precision is about 0.1 dB better than the baseline, while the SSIM is slightly degraded by 0.005.

  • Scene 4: underwater B

    This scene has distances ranging from 31.4 to 31.8 centimeters. Mixed FP from 23 to 27 bits using ADMM in Fig. 11c, FP from 19 to 23 bits using PGD in Fig. 11f, FXP from 32 to 36 bits using the ADMM solver in Fig. 11d and FXP from 30 to 34 bits using the PGD solver are illustrated in Fig. 11g. For FP and FXP mixed precision using both the ADMM and PGD solvers, the PSNR amd SSIM are always maintained at a similar level to the baseline FP 32 bits.

    Hence, from our experiments on diverse synthetic and real data, illustrated in Figs. 311, we can make the following observations.

    1. i

      The use of reduced precision in both FP and FXP representations leads to depth reconstruction that has considerable fidelity to the full precision solutions.

    2. ii

      The use of mixed precision, introduced in this paper, also achieves minor losses in PSNR and SSIM when compared to even the highest, fixed precision representations, and is significantly better than the lower precision solutions. This shows a path to even greater resource savings.

    3. iii

      The use of higher precision in the later iterations of \(\ell _1\) solvers makes a more significant contribution to the quality of optimization outcome, while the use of lower precision in the earlier iterations is either tolerable or can be well compensated by the later iteration with higher precision.

    4. iv

      For different mixed precision \(\ell _1\) solver solutions, PGD incorporates lower precision in general than ADMM to achieve similar fidelity in general, since PGD has simpler computational operations with less opportunity of precision loss.

    5. v

      Given that we can achieve similar fidelity with mixed precision, we anticipate that significant cost and power savings are achievable with a mixed strategy, when compared computation using fixed precision, whether full or even reduced through all iterations.

Having shown that the use of mixed precision is possible without significant degradation in quality of reconstruction, we now proceed to show how the use of mixed precision can result in further savings in hardware utilization and power consumption in the next section.

Figure 3
figure 3

Short range dataset [40]: recovered depth images using a dSparse, consistent FP 32 bit); and ADMM solver with b consistent FP 32 bit; c consistent FP 18 bit; d consistent FP 22 bit; e mixed precision from FP 18 to 22 bit for each iteration; f consistent FXP 20 bit; g consistent FXP 24 bit; h mixed precision from FXP 20 to 24 bit for each iteration.

Figure 4
figure 4

Short range dataset [40]: recovered depth images using a dSparse with single precision, FP 32 bit; and PGD solver with b consistent FP 32 bit; c consistent FP 17 bit; d consistent FP 21 bit; e mixed precision from FP 17 to 21 bit for each iteration; f consistent FXP 18 bit; g consistent FXP 22 bit; h using mixed precision from FXP 18 to 22 bit for each iteration.

Figure 5
figure 5

ADMM:PSNR v.s. Bitwidth (Short Range Dataset).

Figure 6
figure 6

ADMM: SSIM v.s. Bitwidth (Short Range Dataset).

Figure 7
figure 7

PGD: PSNR v.s. Bitwidth (Short Range Dataset).

Figure 8
figure 8

PGD: SSIM v.s. Bitwidth (Short Range Dataset).

Figure 9
figure 9

Long range dataset [42]: recovered depth images using a dSparse with consistent FP 32 bit; b ADMM, consistent FP 32 bit; c ADMM, mixed FP 16 to 20 bits; d ADMM, mixed FXP 30 to 34 bit; e PGD, consistent FP 32 bit; f PGD, mixed FP 16 to 20 bits; g PGD, mixed FXP 30 to 34 bit.

Figure 10
figure 10

Underwater dataset A [44]: recovered depth images using a dSparse with consistent FP 32 bit; b ADMM, consistent FP 32 bit; c ADMM, mixed FP 18 to 22 bits; d ADMM, mixed FXP 28 to 32 bit; e PGD, consistent FP 32 bit; f PGD, mixed FP 18 to 22 bits; g PGD, mixed FXP 30 to 34 bit.

Figure 11
figure 11

Underwater dataset B [44]: recovered depth images using a dSparse with consistent FP 32 bit; b ADMM, consistent FP 32 bit; c ADMM, mixed FP 23 to 27 bits; d ADMM, mixed FXP 32 to 36 bit; e PGD, consistent FP 32 bit; f PGD, mixed FP 19 to 23 bits; g PGD, mixed FXP 32 to 36 bit.

4.2 \(\ell _1\) Solver Accelerators

We have designed and implemented fully pipelined architectures using single and mixed precision with FP and FXP arithmetic on a Xilinx Ultrascale+ ZCU106 architecture using Vivado 2019.1 to fast prototype our design without exhaustive hardware optimization.

Following evaluation of algorithmic performance, the corresponding pipelined architecture has been implemented for either consistent or mixed precision ADMM and PGD to solve the twin lasso optimization problem in Eq. 1, where each iteration corresponds to a pipeline stage with either the same or an allocated, different precision. This strategy can be extended to more iterations, considering mixed precision between batched sub-iterations.

An example of the dataflow architecture of implemented accelerator using PGD \(\ell _1\) solver with mixed floating point precision is illustrated in Fig. 12. The MVMul performs the matrix-vector multiplication, where other \(VS*\) blocks are vector-scalar addition, subtraction, multiplication and comparison. VSub implements the vector subtraction.

Figure 12
figure 12

An Example of PGD Architecture.

The datapath of exponent and mantissa are separated between iterative mixed precision, where the lower bit-width is connected to corresponding part of later higher bit-width pipelines. The outcomes from the twin \(\ell _1\) solver, \(x_Q\) and \(x_I\), adopt a vector element-wise division to obtain the final depth pixels \(x_D\).

Resource utilization is derived from the Xilinx tools using the place and route implementation, while the power is estimated using the Xilinx Power Estimator (XPE). Table 1 presents the resource and power comparisons of both consistent and mixed precision using the ADMM and PGD solvers for all scenarios, which also lists the performance of system reaction latency and processing speed in pixel/s given fixed number of pixel in a depth image. The mixed precision values correspond to the bit width variation chosen according to the depth scaling for each of the scenes while the consistent precision cases are those of the highest bit width in the mixed precision combination.

Table 1 Resource and quality metrics for consistent and mixed precision. For custom floating point, the bit width = 1+E+M and for custom fixed point, the bit width = 1+I+F.

The resource utilization of the Look-Up-Table (LUT) and Digital Signal Processor (DSP) units are reduced by up to 67\(\%\) and 80\(\%\) respectively for the ADMM solver using mixed 20 to 24 bit-width fixed-point precision, and by up to 74\(\%\) and 78\(\%\) for the PGD solver using mixed 18 to 22 bit-width fixed-point precision. The related power consumption is significantly reduced by up to 60\(\%\) and 55\(\%\) for ADMM and PGD solver respectively. Though some mixed fixed-point implementation consume more DSPs than mixed floating-point, they are all less than using single precision floating point.

Hence, by adopting mixed precision with either FP or FXP, there are significant hardware resource and power savings when compared to single float precision. Given the similar algorithmic performance between the ADMM and PGD solvers, the outcomes for PGD are better than ADMM in all aspects, both resource utilization and power consumption. Due to the simplicity of the computational operations, the implemented PGD solver gains not only in hardware cost but also in throughput in pixels per second, and in quantified efficiency measured as energy consumption per pixel.

For the hardware implementations, the quality of depth reconstruction is quantified against the consumed resource and power in Figs. 13 and 14 for both the ADMM and PGD solvers. Only SSIM is adopted here as it is more sensitive to individual range errors, though PSNR should have similar trend.

Overall, with an average of \(1\%\) degradation in the SSIM metrics for depth reconstruction shown in Figs. 311, the cost of the ADMM solver using mixed precision is reduced by up to \(67\%\) for Look-Up-Tables (LUT), up to \(80\%\) for DSP units, when compared to single precision, and reduced by up to 74\(\%\) for LUT and 78\(\%\) for DSP for the PGD solver. The estimated dynamic power consumption using mixed precision is also significantly reduced by over 60\(\%\) and 55\(\%\) for the ADMM and PGD solvers respectively compared to single precision, and by approximately \(10\%\) compared to the corresponding consistent precision.

Figure 13
figure 13

SSIM against Resource and Power for the ADMM solver: LUT is considered as the most important resource representing logic area while the power in mW is considered. The higher the SSIM value with less LUT and Power, the better.

Figure 14
figure 14

SSIM against Resource and Power for PGD solver.

For the same sensing scenes with similar SSIM, FXP mixed precision has similar resource utilization to its comparative consistent precision, while the cost with FP mixed precision is slightly lower than its comparative consistent precision with reductions of up to \(10\%\) LUT. Accordingly, similar phenomena occur for the power consumption between FXP or FP mixed precision and their comparative consistent precision solutions. The largest advantage of FXP mixed precision over FP occurs in the first short range scenario, which has over 50\(\%\) reduction in LUT than its corresponding mixed precision accelerator with FP for both the ADMM and PGD solvers.

The bit-width of mixed precision varies between 18 to 28 bits. According to Table 1 and Figs. 311, compared to mixed floating-point precision, mixed fixed-point is generally more efficient in terms of power per computation, as measured by μJ/Pixel. In contrast, mixed floating point uses less representation bits, which requires less bandwidth for data communication links. Both benefits are important depending on the design requirements.

Finally, we observe that the implemented \(\ell _1\) solvers are for processing a single block within the parallel compressive depth reconstruction framework. Hence, the performance values given in Table 1 are considered as an upper bound of overall real-time depth reconstruction, assuming all the blocks are processed in parallel.

5 Conclusions

In this work, a mixed precision framework using both floating and fixed point arithmetic has been introduced with a pre-computed, static schedule. It has been adopted for compressive depth reconstruction design using approximate convex optimization and linear algebra libraries. By adopting a mixed precision schedule matched to the known stand-off and dynamic range of the LiDAR-sensed scene, incremental precision scaling has been applied to the \(\ell _1\) solvers, ADMM and PGD, for depth image reconstruction lasso using convex optimization. The results show that iterative mixed precision in both floating point and fixed point enables similar performance of depth reconstruction compared to single float precision with significant cost and power reductions using a pipelined architecture; greater than \(70\%\) in the best cases.

By analyzing the difference between the ADMM and PGD solvers using either floating point and fixed arithmetic, more efficient depth image reconstruction is enabled with the PGD solver. The benefits of the different arithmetic types have been demonstrated and shown to vary across the different scenarios. In future work, the adaptive run-time strategies to set the mixed precision based on evaluation during the iterative schedule of convex optimization should be explored, according to diverse dynamic sensing ranges and different arithmetic types.