1 Introduction
Formal verification of floating-point operations is becoming ever more challenging as hardware designs reach levels of complexity only previously seen in software. Its importance in industry is well known, exemplified by the Pentium floating-point division bug [
51]. We present a new approach to the verification of fixed- and floating-point transcendental algorithms. This technique should be viable for verifying high-precision algorithms, as our findings suggest that runtimes will not rise exponentially with the precision. Our experiments cover implementations of logarithms, square root, and the sine function; however, the methodology can also be applied to many different functions implemented using algorithms of the form described in Section
2. Verification of logarithm implementations is a relevant problem as it finds applications in fields such as digital signal processing and 3D graphics [
30,
40]. In addition, it can be very simple to implement, with one of the simplest examples of a floating-point logarithm using the exponent as the integer part of the result combined with a
lookup table (LUT) for the
most significant bits (msb) of the significand to generate the fractional part [
30].
Traditional techniques rely on exhaustive testing of all inputs to verify such algorithms, but this can be resource intensive, perhaps prohibitively so. The
Multiple-Precision Floating-Point Reliable (MPFR) library is a C library for multiple-precision floating-point computations with correct rounding [
25] and is widely used as a reference for many verification tasks. For example, some of the industrial implementations presented here were verified by comparing the outputs to the MPFR library. We shall see that the methodology used in this article performs more efficiently in particular cases.
The article will focus on implementations of transcendental functions in hardware that rely on piecewise polynomial approximations. Many elementary functions are traditionally calculated in software [
15,
27], but for numerically intensive algorithms such implementations may simply be too slow, leading to the development of dedicated hardware [
49,
56,
57]. Although primarily focusing on hardware, some software implementations may be amenable to the verification approach presented here, as we shall see in Section
6. De Dinechin, Lauter, and Melquiond verified a CRlibm binary64 library function using the Gappa proof assistant [
20,
41]. Their approach is potentially the most similar method (in execution) to ours and therefore we will also use our method to verify the same function. All the examples considered here use a binary representation, but the simplest decimal floating-point implementations, which convert to binary, using the binary algorithm, then converting back to decimal would also be amenable [
34]. Other decimal floating-point implementations appear to rely more on the digit-recurrence algorithm [
11,
12], which is more challenging to reduce to a series of inequalities since decisions are typically made at each iteration based on information from the previous iterations. Reducing the problem to a series of inequalities is important as this is a form that the chosen theorem prover can solve.
To produce the required proofs we use MetiTarski [
3], an automatic theorem prover for real-valued analytic functions, such as cosine and logarithm. It’s a combination of a resolution theorem prover and a decision procedure for the theory of
real closed fields (RCFs). An ordered field is real closed if every positive number has a square root and every odd-degree polynomial has at least one root, which is equivalent to saying that the field has the same first-order properties as the reals. The resolution prover at the base of MetiTarski is Joe Hurd’s Metis [
39], which is modified in several respects [
1]. One key modification is to the ordering of the resolution prover [
42], which encourages the replacement of supported functions by bounds. The in-built axioms are primarily upper and lower bounds on a set of supported functions. The choice of these bounds was carefully considered. Many are based on the bounds proposed by Daumas et al. [
16]; these are typically derived from Taylor series. Other bounds are obtained from continued fraction expansions, for example:
The resolution prover applies the axioms, replacing any supported functions, and generates polynomial inequalities. With the problem reduced to the theory of RCFs, the decision procedure is then sufficient to finalize the proof. Conjectures are passed to MetiTarski as a set of inequalities that are transformed by replacing any special function by an appropriate bound. Typically proofs are found in a few seconds [
2], but if MetiTarski is unable to prove a conjecture, it does not mean that the conjecture is false. For verification it is important that MetiTarski produces machine-readable proofs that include algebraic simplification, decision procedure calls, and resolution rules [
24]. MetiTarski is among a limited number of automated tools supporting transcendental functions. Examples include the first-order logic automated reasoning tool
dReal [
28] and an approach to Satisfiability Modulo, the theory of transcendental functions, which used properties from the MetiTarski suite as benchmarks [
14]. Our initial research encouraged the MetiTarski developers to add an axiom to bound the floor function. The main benefit of the update, for this article, is to simplify the syntax and construction of our conjectures.
The implementations we study will use a piecewise polynomial approach. To approximate a function
\(f(x)\), for
\(x\in \Sigma\), where
\(\Sigma\) is some input domain, we take a set of polynomials
\(p_i(x)\) for
\(i=0,1\ldots ,K\) and
\(x \in \Sigma _i\), where
\(\Sigma = \bigcup _i \Sigma _i\) and
\(\Sigma _i \cap \Sigma _j = \emptyset , \forall i\ne j\). The piecewise polynomial implementation is generally accompanied by some claimed error bound,
\(\epsilon\), which is what we aim to prove. More precisely, we prove that
For each
\(i\), we will generate at least one problem to be automatically proven by MetiTarski. The main novelty of this contribution is in the application of MetiTarski to this verification problem. MetiTarski’s understanding of elementary functions means that it is the only necessary theorem-proving tool required in this methodology, unlike other approaches that have combined tools [
20]. The automatic generation of problem files from a template problem makes the verification effort simpler and could be used with other automatic theorem provers.
In Section
2 we will discuss the common approaches to implementations of elementary functions. In Section
3 we will describe the verification methodology of this article, with actual applications and results of using this method presented in Sections
4 and
5. Lastly, we will compare our approach to other theorem-proving verification methods in Sections
6 and
7.
2 Transcendental Function Implementations
As mentioned above, transcendental functions are commonly implemented in software; however, the numerical algorithms used in software are often unsuitable for hardware. One example, using Chebyshev series expansion, would result in a high area circuit due to its use of a variety of expensive operations such as floating-point multiplication [
26]. There are lots of different hardware algorithms to implement these functions, but a large proportion fall into one of the following categories: digit-recurrence [
5,
48],
CORDIC (COordinate Rotation DIgital Computer) [
4,
59,
60], or table-lookup [
55,
57]. A comparison of the different algorithms is beyond the scope of this article but has been tackled by other authors [
49,
57]. We will focus on table-lookup algorithms, as they are broadly used and are the most amenable to our methodology.
Tang’s 1991 paper provided a general framework for implementing a function
\(f\) on an interval
\(I\), which most table-driven algorithms use to some degree [
57]. According to Tang, typical table-lookup algorithms have a set of
breakpoints \(\lbrace c_1,\ldots ,c_N\rbrace\), where
\(c_k \in I\) for
\(k=1, 2, \ldots , N\), along with a table of approximations
\(T\), such that
\(f(c_k) \approx T_k\) for
\(T_k \in T\). Given
\(x \in I\), the algorithm uses the following steps to calculate
\(f(x)\):
(1)
Reduction: Solve \(k = min_k \vert x-c_k\vert\), and then apply a reduction transformation \(r = R(x,c_k)\).
(2)
Approximation: Approximate \(f(r)\) using a function \(p(r)\); often a polynomial is used here.
(3)
Reconstruction: Using a reconstruction function
\(S\), which is determined by
\(f\) and
\(R\), find a final approximation.
In the rest of the paper, Tang describes algorithms for
\(2^x\),
\(\log (x),\) and
\(\sin {(x)}\), for relatively narrow intervals. In these examples, tables ranging in size from 32 to 64 entries are used. To support wider intervals, further transformations of the arguments to these narrow domains are necessary. For example, Tang proposes an algorithm to compute
\(\ln (x)\) for
\(x\in [1,2]\), which uses breakpoints
\(c_k = 1 + k/64\) for
\(k=0,1,\ldots ,64\). The breakpoint is chosen that satisfies
\(\vert x-c_k\vert \lt 1/128\) and a reduced argument
\(r=2(x-c_k)/(x+c_k)\) is used to compute a polynomial approximation
\(p(r)\). The final approximation is given by
\(\ln (x) \approx T_k + p(r)\), where
\(T_k \approx \ln (c_k)\) are the tabulated values [
57]. Polynomials are a common choice for approximating with a reduced argument, and to calculate the coefficients there are a number of approaches. Some use the Remez algorithm to generate the coefficients of the minmax polynomial [
57,
58], while others opt to use carefully rounded coefficients from the function’s Chebyshev expansion [
53]. For the IA-64 architecture, Intel provided a library of table-based algorithms for computing several transcendental functions [
35]. The tables used in this library range in size from 24 to 256 double-extended entries, for the exponential and logarithm, respectively.
Table-based algorithms have been further developed and modified to use
lookup tables (LUTs) to construct piecewise polynomial approximations to some elementary functions [
50,
56]. The reduction step still uses the breakpoint method, but the table no longer returns just an approximation,
\(T_k\); it now returns multiple polynomial coefficients for each lookup. Strollo et al. present, alongside their algorithm, many different divisions of the input domain. For example, to compute
\(\ln (1+x)\) for
\(x\in [0,1]\) using a piecewise-quadratic approximation, accurate to 24 fractional bits, they required 128 table entries of coefficients [
56]. In the industrial implementation described below, a table containing 32 breakpoints is used.
To further understand the relationship between LUT size and architecture choices, we can experiment with the FloPoCo tool [
22]. FloPoCo can automatically generate piecewise polynomial approximations to any supported function that are accurate to 1
unit in last place (ULP) for a given bitwidth. Note that in such tools, changing the function being approximated only changes the polynomials and doesn’t fundamentally change the architecture. Therefore, the architecture is to some degree independent of the function it approximates. Figure
1 shows how the LUT size required changes with the choice of polynomial degree. The second graph highlights how if we use quadratic polynomials and increase the precision of the approximation, exponentially more LUT entries are required. FloPoCo will be looked at in greater detail in Section
5.
In this article, the input to the function will often be represented in floating-point format [
29], which stores a number as
\((-1)^{s} \times 2^{e-b} \times 1.{\it significand}\). In IEEE-754 standard single precision,
\(s\) is a single bit representing the sign, the exponent
\(e\) is an 8-bit integer, the bias
\(b\) is a constant equal to 127, and the significand is 23 bits long. The implementations of the logarithm verified in this article rely on the following identity:
The reconstruction step then simply involves adding the integer exponent to an approximation to \(\log (1.{\it significand})\). This approximation passes the top \(k\) bits of the significand to a lookup table, which returns coefficients for a degree \(m\) polynomial, evaluated using the remaining low bits of the significand. Clearly, if the polynomial is a constant polynomial, then that is equivalent to the \(T_k\) described above. As this approach is essentially an enhancement to the method described by Tang, the verification of these piecewise polynomial LUT methods could also be adapted to the simpler LUT methods.
3 Verification Methodology
Given an implementation of a transcendental function following the outline above, we obtain an abstraction that is verifiable using MetiTarski. For a single-precision implementation, if the top
\(k\) bits of the significand are passed to a lookup table, for a fixed 8-bit exponent and sign bit, we reduce the verification problem to
\(2^k\) calls to MetiTarski. Therefore, the full verification over all inputs is reduced to just
\(2^{k+8+1}\) MetiTarski conjectures to be proven. In some cases, verification over all such inputs is not necessary: a bespoke hand proof may be able to confirm the correctness of the results for exponent scalings. Of course, most verification tasks use massively parallel processing to reduce the runtimes. Similar methods may be used to reduce the runtimes in our approach, as the conjectures passed to MetiTarski are independent of each other. In nearly all commercial implementations,
\(k\) is relatively small as lookup tables can have high ROM demands [
56]. Assuming that our interpolation coefficients are stored in a file and we can express the problem as a set of inequalities, the procedure follows the same basic outline (see Figure
2):
(1)
Write a template problem for the inequalities to be proven, with placeholders for all the relevant interpolation coefficients and most significant bit values.
(2)
Use a wrapper script to read the coefficients and replace them in the template problem to generate the full set of MetiTarski problems.
(3)
Run the Perl script that accompanies MetiTarski to test all of the problems.
(4)
Refine error modeling on problems that are not proven and return to step 1.
(5)
Exhaustively test regions where MetiTarski was unsuccessful.
To demonstrate the methodology, we analyze a toy implementation for computing the natural logarithm based on the outline above. The implementation takes as an input an 8-bit integer
\(x_0x_1...x_7\) and outputs an approximation to
\(\ln (1.x_0...x_7)\). The top 4 bits,
\(i = x_0..x_3\), are passed to a lookup table that returns the 10-bit interpolation coefficients
\(a_i, b_i, c_i\), for
\(i=0,\ldots ,15\). The coefficients are generated using a simple quadratic interpolation scheme. Writing
\(x=0.x_0...x_7\), the approximation generated is
This example is designed to be simple to show the underlying principles of the verification methodology. Later, we shall adapt it to be more relevant to industrial implementations. In this case, the implementation is accurate to
\(2^{-10}\), which can easily be verified using exhaustive testing as the input space only contains
\(2^8\) values.
The first step is to generate the template problem. In this problem, the upper bits are represented by a constant,
\(\_\_y=0.x_0...x_3\), and the lower bits,
\(X\), are modeled as a MetiTarski variable, to which we assign a specified range. The coefficients, which will be replaced in step 2 of the procedure, are just placeholders,
\(\_\_a, \_\_b\) and
\(\_\_c\). The
\(\Rightarrow\) should be read as implies. If this were a real hardware implementation, the design could be improved by absorbing
\(\_\_y\), a constant, into the pre-calculated coefficients, resulting in a polynomial just in
\(X\).
This formula is the template problem that contains a single variable, \(X\), and four placeholders, \(\_\_y, \_\_a, \_\_b\), and \(\_\_c\). A wrapper script now generates the 16 MetiTarski problems, replacing the placeholders \(\_\_a, \_\_b\) and \(\_\_c\) with the actual coefficients from the LUT and \(\_\_y\) with the relevant constant input to the LUT. A Perl script, which is supplied with MetiTarski, automates the calls to our prover, providing a true or false result for each problem. For our toy implementation, MetiTarski is able to provide proofs for all of these problems and therefore steps 4 and 5 of the procedure are rendered redundant. Next we enhance the toy implementation and start to see where the refinement step is useful. The total runtime was 5.3 seconds, and no more than 0.4 seconds is spent on any one proof. On such a small input space, exhaustive search is quicker by several orders of magnitude, taking less than a tenth of a second. However, as we shall see, our technique does not suffer from exponentially increasing runtimes as we increase the precision of the implementation.
With this basic understanding of the methodology, we shall make the toy implementation more realistic. Commercial hardware engineers have constraints on area and performance targets to meet, so they apply techniques to reduce the resource requirements. One of these is to truncate bits throughout the algorithm, reducing the number of adders required. This generally improves the performance but typically with some cost to the accuracy of the approximation. In our implementation, we choose appropriate terms to truncate in order to more closely replicate commercial algorithms. The new implementation returns an approximation of the form
In this case, the approximation is accurate to
\(2^{-7}\).
This can easily be checked using exhaustive testing, but notice that the implementation uses bit truncation on the first- and second-order terms. Since MetiTarski has no understanding of the integers, such non-analytic functions are difficult to model. In HOL Light, Harrison developed an entire theory of floating-point arithmetic to more closely model the hardware he intended to verify [
32]. For our purposes, it was sufficient to explore only simple approximations and bounds to such arithmetic. Inspired by this research, MetiTarski now includes support for a floor function. MetiTarski understands functions via axioms that give upper and lower bounds, and in the case of the floor function we simply have
\(x-1\lt\; \text{floor}\; (x)\le x\). It should be noted that this bound is poor when the inputs under investigation are close to 1; for example, MetiTarski will fail to prove
\(\text{floor}\; (0.5)\ge 0\). A simple extension could be the introduction of a bit truncation function that takes an input
\(x\) as well as the number of bits to truncate
\(y\). This would yield bounds
However, for the examples investigated in this article, the basic floor function is sufficient to verify the function to the same precision that can be verified via exhaustive testing. Therefore, using this function, it is possible to produce a MetiTarski conjecture. We now allow
\(X\) to be the integer value of the bottom 4 bits (
\(x_4x_5x_6x_7\)) and
\(\_\_y\) the integer value of the top 4 bits (
\(x_0x_1x_2x_3\)), which, as before, determines the coefficients
\(\_\_a, \_\_b, \_\_c\). We now use the integer values of
\(X\) and
\(\_\_y\), because doing so allows us to model bit truncation using the floor function. Our new template problem is then
where
Using the floor function provides a simple, but usually effective, model. However, this approach has an issue, which it shares with interval arithmetic, in that all correlation of errors is lost by this approach. To demonstrate this problem consider the following equation and MetiTarski floor function model of it, where
\(z\) is a 4-bit integer:
The floor function model indicates that this equation is bounded below by
\(\frac{1}{2}z - 1 - \frac{1}{8}z = \frac{3}{8}z - 1\). In actual fact the equation is bounded below by
\(\frac{3}{8}z - \frac{3}{8}\). This discrepancy is a result of disregarding any correlation between the two terms in the model and modeling it as a floor function rather than truncation. This issue also occurs in interval arithmetic, but fortunately we can deploy additional variables in our model that account for some of this correlation. By using two additional error variables our MetiTarski problem can model this behavior more accurately.
Essentially, the error variable
\(\epsilon _0\) bounds bit 0 and
\(\epsilon _1\) bounds a chunk of 2 bits, bits 1 and 2, of the variable
\(z\), allowing us to attain the bound we found analytically above, tighter than the bound we found using the floor function approach. Clearly, if we were to introduce an error variable for every bit or collection of bits truncated, then the number of variables in our MetiTarski problem would grow to be huge for any real-world design. This is problematic as MetiTarski runtimes can be doubly exponential in the number of variables, an inherent limitation of the decision procedure on which it relies. In addition, this approach to error modeling requires significantly more user input and skill than the simple floor function method. Intellectual effort is needed to calculate the correlation between error terms introduced in the hardware algorithm. We can also model errors more carefully using a truncation model (Equation (
4)) rather than the floor function approach to improve the tightness of the error bounds.
This example is intended to highlight the limitations of the floor function method and to demonstrate that with additional human and computational effort it is possible to refine our error models. Returning to the procedure outline given above, the floor function method should be used to generate the initial template problem in step 1. If in step 3 MetiTarski fails to prove all the problems it is given, we can introduce error variables in our template problem to refine our error model. This is typically an iterative approach, as there is a tradeoff between the human effort required and the tightness of the error bounds. So there may be several rounds of error model refinement with MetiTarski runs until either the problems are all proven or the error model refinements are exhausted. If after exhausting all error model refinements some of the MetiTarksi problems remain unproven, then the last step is to use exhaustive testing on the remaining regions. All this effort is typically not wasted, as proving a subset of the MetiTarski problems will reduce, possibly significantly, the size of the input space left for verification using exhaustive testing. In Section
4, we discuss a complex industrial implementation where this final exhaustive step was necessary to complete the verification. This is the only case where we required exhaustive testing.
To illustrate this approach, we model our updated implementation, Equation (
3), using error variables rather than the floor function. This yields a new MetiTarski problem:
where
Surprisingly, for this particular problem, using additional error variables rather than the floor function actually had minimal impact on the overall runtime for the 16 problems. However, the floor function is a recent addition to MetiTarski and we may be able to improve its performance by fine-tuning its heuristic parameters. On industrial implementations, to limit the number of variables it was sometimes necessary to combine error variables and manually calculate an upper bound on these. This was often challenging: some of the error variables can be highly correlated. The floor function method makes the process of generating an initial template problem significantly simpler. For the industrial implementations verified in this article, error variables were necessary, since tight error bounds were required.
To see why this technique is powerful, we extend the implementation above to larger input spaces. The approximation is essentially the same, using the same coefficients and lookup table, but our input now may be 10 bits rather than 8 bits, for example. Figure
3 compares the runtimes of our methodology and exhaustive testing as the input space grows. Notably, the MetiTarski method has roughly constant runtimes, as we expect: MetiTarski is an analytic tool, so increasing the space of discrete inputs only minimally alters the MetiTarski problems. Conversely, exhaustive testing runtimes suffer from exponential growth in the number of bits of the input. Of course, if the size of the lookup table increases, this will affect the MetiTarski runtimes as the number of problems will grow exponentially. The tables used in these algorithms are typically not prohibitively large; those referenced in Section
2 contained fewer than 256 entries, since large tables translate into additional silicon in hardware designs. We have already highlighted the inverse relationship between LUT size and polynomial degree in Figure
1, which suggests that higher-precision implementations can use higher-degree polynomials to limit LUT size growth. The toy example problem could be applied to any elementary function as the architecture framework, relying on piecewise quadratic interpolation, is function independent.
4 Applications AND Discussion
We present some results from applying this methodology to the verification of several larger commercial implementations. The runtime results can be found in Table
1 and demonstrate that the technique has real-world relevance.
Figure
4 gives a description of the floating-point implementation verified using this methodology. This implementation was a release candidate for Cadence Design Systems. We see that the 23-bit significand is split into four used sections, while the last bit is discarded. The top 6 bits,
\(y\), are passed to the LUT, which outputs three polynomial coefficients. Using these coefficients, the algorithm computes a polynomial approximation to the logarithm. The split significand is used to reduce the number of operations along with a truncation of bits that are insignificant for the accuracy target. The accuracy target for the implementation was 4 ULPs [
47], which is why, for the following conjectures, the bounds on the distance from the true logarithm are
\(2^{-21}(=\!\!4\times 2^{-23})\), since we consider inputs in the region [1,2), as justified by Equations (
1) and (
2). The exponent,
\(e\), in our input region is equal to the IEEE-754 bias
\(b\), yielding a zero exponent in the decoded value. This was the first implementation verified using the MetiTarski method, and as a result, when this was investigated, the floor function was not supported. However, rather than calling the floor function explicitly, the bounds on the calculations were calculated by hand and combined to yield a simple bound on the error in either direction. This is a labor-intensive method and was the inspiration behind the introduction of the floor function. This manual approach resulted in two separate template problems.
Template 1Template 2Uppercase letters are variables with defined ranges, which correspond to the different sections of the significand described in Figure
4. The lowercase letters are constants in each problem, which are replaced by the python wrapper script, where in particular
\(w\in \lbrace 0,1\rbrace\) is a single bit of the significand as shown in the diagram. We note here that the splitting of the significand has forced the use of two variables,
\(X\) and
\(Z\), to model the problem. On the right-hand side of the inequality in Template 1, the
\(2^{-33}a\) is an error term introduced to model the truncation of the squared term in the algorithm. More precisely, we are explicitly using the naive floor function bound:
In Template 1 we are trying to prove an upper bound on the approximation. We want to make the square term as small as possible so use its lower bound, since \(a\) is negative. This should not be surprising since the coefficient of the square term in the Taylor expansion of \(log_2(1+x)\) is negative. We move the error term to the right-hand side of the inequality for readability.
In these conjectures,
\(l\) is a high-precision approximation to
\(\ln (2)\), which is necessary because MetiTarski only supports the natural logarithm, meaning we need to switch the base. Generated using MPFR [
25], it was truncated at the 32nd decimal place. Therefore,
\(0\lt \ln {(2)}-l\lt 10^{-32}\). This error can be accounted for by the over-approximation of the truncation error term described by Equation (
5). This lower bound can actually be made tighter as we only truncate off at most 9 bits, so it becomes
This means that we have over-approximated the error by
\(2^{-42}a\). The error introduced by using
\(l\) is
For the cases we are interested in, \(\vert \ln {v}\vert \le 2\) and \(|a|\ge 10^{-1}\) in all the polynomials. This means that the error introduced by the \(l\) approximation is less than \(2^{-42}a\) and is accounted for in Template 1.
Of course, Template 1 only proves half of the verification problem. We must also prove the other side, Template 2, which we can AND with the first, allowing us to keep the same coefficients and variable bounds. The only difference is that we check the lower bound and also calculate a new error bound. In this case, the
\(2^{-23}\) term on the right-hand side of the inequality is to account for the final truncation and rounding of the output in order to fit it into floating-point format. It relies on the same floor function bound used in Equation (
5) but applied to the whole polynomial rather than a single term. It has the opposite sign to the previous error term and is moved to the other side of the inequality once again. There is no need to account for the error in
\(l\) in this template as
\(\ln {(2)}-l\gt 0\). If we can satisfy both these conjectures, then we have obtained a proof for the given input range, in our case [1,2).
An error variable,
\(\epsilon \in [2^{-33}a, 2^{-23}]\), could have been used to reduce this to just one template problem; however, testing showed that it was more efficient to split the conjecture in two to avoid the additional variable. The proofs of these problems and minor variants for inputs in the region [0.5,1) were obtained successfully with runtimes presented in Table
1. With a LUT containing 32 entries, the total number of MetiTarski problems for this verification task was 128 (=32
\(\times\) 2
\(\times\) 2), since each entry gave two problems and we needed slightly modified templates for the [0.5,1) region.
Previously, the implementation had been checked using a basic benchmarking tool. Through our investigations, we discovered that the expected accuracy of the implementation was breached for a small input region. Since exponent scaling for \(\log _2\) simply involves adding the exponent to the approximation made on the \(1.{\it significand}\) component, a bespoke and simple proof was sufficient to complete the verification. However, the proof identified a region in which the claimed accuracy may be breached and exhaustive testing on this region identified a large number of counterexamples. For the purpose of validating our methodology, we exhaustively tested all inputs: as predicted, everywhere but in this narrow region the implementation met its accuracy claim.
The second implementation in Table
1 is an experimental test case generated primarily to analyze how the MetiTarski method performed on higher-precision algorithms. It takes as input a fixed-point number with 8 bits before the decimal place and 24 bits after. Therefore, the total number of possible inputs is
\(2^{32}\), but it is a higher-precision algorithm and is more accurate as it makes use of a larger lookup table (256 entries). The algorithm used in this implementation is very similar to the one presented in Figure
4, taking a fixed-point input; therefore, we shall not discuss the details of the implementation. The template problem files were also extremely similar, if not simpler, but the main focus of this analysis was to test how the technique performed on higher-precision algorithms, as the claimed accuracy of this implementation was
\(2^{-28}\). We found that the technique was equally successful in verifying this implementation.
We see that the speedup achieved by MetiTarski is significant on the floating-point logarithm. Notably, it was even greater for the higher-precision (8.24 fixed-point) experimental implementation. Such an observation suggests that this methodology could be viable for verification of higher-precision hardware, where traditional exhaustive techniques are infeasible. Evidence supporting this claim will be given in Section
5.
In addition to the results above, we applied our technique to a more complex algorithm, which had already been formally verified. Here we were only able to prove a subset of the problems. If we were attempting to verify a new implementation, this outcome would still be helpful as we reduce the space of inputs on which exhaustive testing is necessary. Across our experiments this was the only case where we found it necessary to resort to exhaustive testing. Through this experience we discovered a significant shortcoming of the approach. Generating the initial template problem is a non-trivial task. The verification engineer needs a deep understanding of the design implementation in order to model the errors correctly. Floor functions may be used initially to establish whether the algorithm is correct to a higher error rate, but to obtain a full proof, several refinements of the model may be required. Such refinements should take into account correlations in the error terms for the different sub-intervals. This aspect of the technique becomes time consuming when trying to verify more involved algorithms. For example, a sequence of
if and
else if statements, which can be found in some hardware algorithms, and much more commonly in software, is difficult to model using this technique. By comparison, exhaustive testing is far simpler to implement: it requires only knowledge of the output and the correct answer, which we may reasonably assume we have given the existence of the MPFR library [
25]. This fact currently limits the circumstances under which our method is useful. For example, iterative algorithms for division and square root that use a variable number of iterations depending on the input are likely intractable.
5 Automatically Generated Designs: FloPoCo
An additional source of implementations we can verify is the FloPoCo tool [
22]. FloPoCo can generate arithmetic cores for a range of applications, with its primary focus being on FPGAs. Its operators are fully parameterizable in precision and can be given error specifications to meet. The tool outputs synthesizable VHDL, which is generally human readable. It is still actively developed and used by the academic community, winning the
Community Award of FPL 2017. FloPoCo has a wide range of methods and operators that it can generate, but we will focus on just one of these.
The FixFunctionByPiecewisePoly option generates a piecewise polynomial VHDL implementation for a given function \(f(x)\), where \(x\in [0,1]\) and has a user-specified number of bits. The user specifies the degree \(n\) of the polynomial that will be used in the approximation. By default, it generates an approximation to \(f(x)\), which is claimed to be correct to within 1 ULP.
The authors [
19] have described their method for generating approximations. They generate polynomial coefficients for each interval, where each interval spans the same input width, with the most significant bits of the input
\(x\) determining which polynomial is used. This is implemented as a LUT with one entry per polynomial/interval and width determined by the combined width of the coefficients. One of the benefits of using FloPoCo is that the user does not need to specify the number of intervals to split the domain into. The tool finds a number of intervals for which it can find a set of polynomials of degree
\(n\) that meet the error specification, allowing some of the error budget for rounding errors in implementing the polynomial evaluation. The authors do not claim that this is the smallest number of polynomials that would achieve the error budget.
Generating the polynomial coefficients is done using a modified Remez algorithm [
9], which essentially computes minmax polynomials over the space of polynomials with coefficients with finite precision. This removes the need to round real coefficients, avoiding additional error. The Sollya tool [
13] handles the polynomial generation and provides an error bound.
Polynomial evaluation is then implemented using the Horner evaluation scheme:
Internal truncation is used in this implementation because
\(x\in [0,1]\) implies that there is no need to compute
\(a_n x^n\) to full precision: most of its bits will not be significant. The architecture of the implementation is described by Figure
5, which motivates the error modeling in our MetiTarski problems.
We will verify the authors’ 23-bit and 52-bit examples [
19], which use degree 2 and degree 4 polynomials, respectively, to approximate
\(\sqrt {1+x}\). This is equivalent to computing
\(\sqrt {1.sig}\) for a binary64 input, where the exponent scaling may be handled by a different algorithm. Table
2 shows some details of the implementations and the verification effort required.
To illustrate the method applied to this example, consider just the 52-bit case. The LUT contained 256 entries corresponding to 256 different polynomials. We extract the coefficients table from the VHDL and use a wrapper script to insert these values as well as the most significant bits of the input into the template problem, as described in the general methodology. The implementation uses four truncated multiply add blocks. The truncation of the remaining input bits
\(z\) for each of these, as shown in Figure
5, forces us to split
\(z\) into four MetiTarski variables:
To center the interval in the VHDL implementation the
\(z_1\) variable is treated as signed. In addition to the truncations on the input bits to the FMAs, they are also truncated on the output, which we model as errors
\(E_i\) for each FMA. With an error specification that says we should be within 1 ULP of the exact square root, this yields an initial MetiTarski problem of the form
Maximizing the total error, and splitting based on \(z_1\) positive or negative, splits this initial template into four problems. Unfortunately, MetiTarski was unable to solve this four-variable problem, which is close to the limit of how many variables MetiTarski can handle.
We solved this problem by only using three variables, essentially introducing \(z_3^{\prime } = z_3+z_4\). Then wherever we want to use just \(z_3\), we can minimize/maximize the value of \(z_4\) and use the bounds \(z_3^{\prime } - (2^{-46} - 2^{-52}) \le z_3 \le z_3^{\prime }\). Splitting the problems based on \(z_1\) positive and negative and minimizing/maximizing the error in the polynomial evaluation yields four problems per interval/polynomial. Since our output has 52 bits after the decimal place, 1 ULP corresponds to a total error bound of \(2^{-52}\). The 23-bit example uses a similar modeling approach but only needs two problems per interval/polynomial.
The results are summarized in Table
2, where we see that each 52-bit problem takes 10
\(\times\) longer to solve than each 23-bit problem on average. This is due to the extra variable and the tighter error bound as well as the increased complexity due to the higher polynomial degree. A 3-hour verification runtime is certainly reasonable for a binary64 implementation, which clearly cannot be exhaustively tested.
6 A Comparison with Gappa
Having given several examples of how our approach can be used, we shall now look at a comparison with an existing tool. For verification of small floating-point functions, the Gappa tool [
7,
21] has been tested and is still being developed. Gappa uses interval arithmetic in order to prove conjectures, while MetiTarski relies on cylindrical algebraic decomposition to generate proofs. Both methods are valid; interval arithmetic is weaker but considerably more efficient. The approach using the Gappa tool is the most comparable method to that described in this article; therefore, we reproduced the verification of the polynomial approximation to the sine function implemented as part of the CRlibm project [
20,
41]. The function verified here is a binary64 software implementation using
round to nearest, which not only demonstrates the versatility of MetiTarski but also shows that the axioms bounding the sine function are sufficiently accurate to verify high-precision implementations.
The implementation verified here does not use lookup tables and so is a simpler case than we have been considering in the previous sections. In essence, we can think of this as just using a single table entry to yield the polynomial coefficients for all inputs, rather than sub-dividing the input domain. As a result, the verification method deployed here uses a simpler approach than that described by Figure
2, eliminating the need for template problems, wrapper scripts, or hundreds of MetiTarski calls.
The following C code quoted from [
20] is compiled respecting the C99 and IEEE-754 standards, with all variables being binary64.
In this code, the input is represented as a sum of two binary64 arguments, y = yh + yl, where yh represents the most significant part and yl the least significant, and the result, s, is similarly represented as a sum sh + sl. The Fast2Sum algorithm provides an exact representation of the sum of two floating-point numbers as a pair of floating-point numbers; namely, no rounding errors are incurred. This approximation to sine is only valid for a limited input region, namely values of magnitude less than
\(6.3\times 10^{-3}\). The code is an approximation to the Taylor expansion of sine, where the terms involving the least significant bits are discarded since they are sufficiently small.
The Gappa approach makes use of the “IEEEdouble” shortcut for the IEEE-compliant binary64
round to nearest mode, in order to encode the rounding errors in this implementation. Since the IEEE-754 standard imposes exact arithmetic in binary64
round to nearest mode, the maximum relative error incurred in each arithmetic operation is
\(2^{-53}\), since binary64 uses a 52-bit significand. This keeps the error modeling for our MetiTarski problems far simpler than in some hardware algorithms, which can operate at the bit level. Essentially, for each of the operations in the C code, a maximum relative error of
\(2^{-53}\) can be incurred. This property is commonly described as the
\((1+\epsilon)\) lemma. Harrison describes the bounds on
\(\epsilon\) in [
33].
The full Gappa script is given in [
20], and for the purpose of comparison we use the same outline of the problem to prove, just using more explicit error modeling. They also introduce the bound on the input variable
\(|Y|\ge 2^{-200}\), which is necessary to make sure that the relative errors due to underflow remain bounded. In the Gappa script, four variables are used, since the sine is represented by a variable
\(S\in [0,1]\). MetiTarski supports axiomatic bounds on functions such as sine, so this variable is dropped in order to reduce the problem to three variables. As Gappa has no notion of sine, first the approximation error between the exact polynomial approximation and sine had to be calculated. This approximation error takes no account of any floating-point rouding errors. For this purpose the Sollya tool [
13] was used to prove that
PolySinY represents the exact polynomial specified in the C code above with no floating-point rounding errors. In MetiTarski, no additional tools were required, as the in-built axiomatic bounds on the sine function were sufficient. The variable
\(Y\) is the sum of our two binary64 variables
\(H\) (msb) and
\(L\) (lsb), with a bound on the relative error in the argument reduction stage, also present in the Gappa paper:
The goal is to prove that the maximum relative error in our approximation of
\(\sin {(Y)}\) is less than
\(2^{-67}\). Since we explicitly calculate the error modeling and have used placeholder replacement as a strategy throughout the previous sections, the template problems include error placeholders
\(e_i\) for
\(i=0,\ldots ,7\), where each
\(e_i \in [1-2^{-53},1+2^{-53}]\), representing the relative error in each arithmetic operation. With these errors inserted, the polynomial approximation to sine after expanding out all the brackets reduces to
Ideally we would introduce a new MetiTarski variable for each
\(e_i\); however, MetiTarski is unlikely to terminate when using more than four or five variables, for the reasons described in [
1], so we need to make choices to bound these errors. By considering only
\(H\gt 0\) and choosing appropriate values for
\(e_i=1\pm 2^{-53}\), we obtain
We consider only
\(H\gt 0\), which implies that
\(\sin {(Y)}\gt 0\). We combine this with the constraints described above.
Using our bounding functions (Equation (
8)), we can reduce the problem to proving the four inequalities
Due to the error bounding approach and the anti-symmetry of sine and the polynomial in
\(H\), proving these four inequalities is sufficient to verify the original problem (Equation (
9)).
MetiTarski required some minor assistance with a basic rewrite for one of the inequalities, just replacing the variable
\(L\) with its relevant bound. The full proof scripts are given in Appendix
9, where the final inequality takes a slightly different form due to the required rewrite. The four problems were provable using MetiTarski, with a combined runtime of 460 seconds on an Intel I7-3517U CPU.
We will briefly digress, as these experiments also highlighted some interesting MetiTarski behavior. MetiTarski invokes an RCF solver during its operation, and it can be configured to use any one of Z3 [
23], QEPCAD B [
10], or Mathematica (invoked in batch mode) [
36]. Given that Mathematica and Z3 are widely used in academia and industry, we choose to trust their results. Only Mathematica terminates in a reasonable amount of time in this problem domain. In fact, using Z3 or QEPCAD B, MetiTarski hangs when trying to prove the significantly easier problem:
This introduces no errors in our function \(g\). This discovery suggests that there could be further development in this field.
To conclude this section, we should compare the Gappa method to ours. Gappa’s understanding of floating-point rounding via the shortcut described above makes it simpler to generate problem scripts when analyzing pure floating-point algorithms. By pure we mean that the algorithm only uses standard floating-point arithmetic with no bit-level manipulations. These are common in software, such as the CRlibm sine function, but in hardware, designers rarely stick to pure floating-point arithmetic, which would likely pose a different challenge for generating Gappa scripts. The Gappa approach had to involve other tools in its solution, while MetiTarski could be used in a standalone manner. With Gappa taking less than a second to generate a proof, it clearly demonstrates stronger performance. However, Gappa required six additional “hints,” with the authors describing the hint-writing process as the most time-consuming part. By comparison, MetiTarski required just one re-write of a problem in order to prove it, and no re-writes (or “hints”) were required in our other examples, so MetiTarski is competing on an uneven playing field. In both approaches it is evident that writing the respective scripts is the time-consuming component rather than generating the proofs.
Given the availability of tools with in-built floating-point rounding, such as Gappa, most software implementations will be more amenable to these tools rather than MetiTarski. However, this comparison has shown that the process of obtaining a proof involving an implementation of sine, using MetiTarski, is similar to the approach described for the logarithm earlier.
7 Related Work
An important problem for verification will be the interplay of floating-point and bit-level operations, common in real code. As observed by de Dinechin et al., expert developers will arrange code in order to take advantage of situations where floating-point arithmetic is exact, such as multiplication by a power of two, to name just one [
21]. Another trick, computing
\(2^n\) in double precision, can be done by shifting
\(n+1023\) left by 52 bits [
45]. Such techniques make verification more challenging, particularly in the context of high-precision transcendental functions [
38]. Code obfuscation can easily occur in the optimization stages, when the underlying mathematical approximations become obscured as expressions are manipulated and re-ordered in order to maximize floating-point accuracy.
We have already described the Gappa approach, but many other tools have been applied to floating-point algorithm verification. John Harrison conducted some of the earliest and most rigorous work via interactive theorem proving, using the HOL Light proof assistant [
31]. Above, we highlighted how Harrison had formalized the theory of IEEE floating-point arithmetic [
32]. He defined the set of representable floating-point numbers as a parameterized triple
\((E,p,N)\), which is the set of real numbers representable in the form
\((-1)^s 2^{e-N} k\) with
\(e\lt E\),
\(k\lt 2^p\), and
\(s\lt 2\). Formalizations of the four standard rounding modes (nearest, down, up, zero), the widely used
\((1+\epsilon)\) lemma, and other lemmas about exactness are constructed in HOL Light. As a result of this underlying theory, most floating-point problems are side-stepped and he is able to reason about real numbers and straightforward algebraic calculations [
32]. Such an approach is labor intensive and much of the labor is specific to the algorithm at hand. Note, however, that Harrison’s proofs are more detailed, including results about rounding modes.
The Coq system has also been the subject of several attempts to develop libraries [
8,
18,
44] and was combined with Gappa [
7,
17] in order to verify a wider range of floating-point algorithms. As well as these, Boldo combined Coq with Caduceus for the verification of floating-point C programs [
6].
CoqInterval [
43] targets a very similar goal to the one presented in this article. It is primarily concerned with verifying floating-point mathematical libraries, and as a result many of the problems look similar to ours, although there is less focus on piece-wise polynomial approximations. As a result, CoqInterval is likely capable of proving many of the problems presented here. The key distinction is that CoqInterval is based on interval arithmetic, which is generally weaker than cylindrical algebraic decomposition (MetiTarski’s underlying technology). The CoqInterval authors actually compared CoqInterval to MetiTarski, noting that MetiTarski was generally faster and could tackle more complex problems [
43]. However, CoqInterval uses automated Taylor expansion, while MetiTarski relies on its built-in axioms and so is more restricted. Also, CoqInterval proofs are formally verified by Coq, which may account for some of the performance difference.
FPTaylor is another recent tool that bounds floating-point round-off errors again using Taylor expansion [
54]. FPTaylor’s benchmarks show that it is capable of tackling problems with up to six variables, with support for elementary functions. In addition to these studies, several theorem provers have been applied to floating-point verification problems, but in the cases referenced here additional libraries were developed to handle floating-point numbers. The ACL2 prover was applied to floating-point division, multiplication, and square root algorithms [
46,
52]. Lastly, the PVS prover was used to verify the floating-point unit of a complete microprocessor [
37]. In general, the main difference between this article and those referenced above is that we target higher-level algorithms for verification and don’t attempt to encode floating-point operations in MetiTarski explicitly.