research-article

Open access

As-Is Approximate Computing

Authors:

Mitali Soni,

Asmita Pal,

Joshua San MiguelAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 20, Issue 1

Article No.: 3, Pages 1 - 26

https://doi.org/10.1145/3559761

Published: 17 November 2022 Publication History

All formats PDF

Abstract

Although approximate computing promises better performance for applications allowing marginal errors, dearth of hardware support and lack of run-time accuracy guarantees makes it difficult to adopt. We present As-Is, an Anytime Speculative Interruptible System that takes an approximate program and executes it with time-proportional approximations. That is, an approximate version of the program output is generated early and is gradually refined over time, thus providing the run-time guarantee of eventually reaching 100% accuracy. The novelty of our As-Is architecture is in its ability to conceptually marry approximate computing and speculative computing. We show how existing innovations in speculative architectures can be repurposed for anytime, best-effort approximation, facilitating the design efforts and overheads needed for approximate hardware support. As-Is provides a platform for real-time constraints and interactive users to interrupt programs early and accept their current approximate results as is. 100% accuracy is always guaranteed if more time can be spared. Our evaluations demonstrate favorable performance-accuracy tradeoffs for a range of approximate applications.

1 Introduction

Approximate computing is an emerging paradigm shown to be effective for real-time applications and interactive services [15, 26, 39]. Such applications often employ algorithms in machine learning and computer vision and process noisy sensor data and multimedia, all of which can tolerate inaccuracies in output [15, 16, 24, 25, 47]. For such applications, approximate computing can trade acceptable quality loss for improved performance.

Many datacenter workloads are interactive and approximate in nature [26]. However, different users and new inputs make each running instance of the same program unique. Under such scenarios, meeting the quality-of-service ((QoS), i.e., latency) expectations while ensuring quality-of-results ((QoR), i.e., accuracy) becomes challenging. Our proposed approach of approximate computing presents an opportunity to improve the efficiency of datacenter workloads by enabling an execution model where QoR scales directly proportional to QoS as shown in Figure 1.

Fig. 1.

Challenges of Approximate Computing. Despite the potential of this paradigm, it has not been widely adopted in current systems due to the following conventional wisdom.

–

No run-time guarantees on output accuracy. State-of-the-art approximation techniques lack the ability to dynamically enforce guarantees on accuracy while executing. The convention is to (1) employ offline or online accuracy models [4, 14] or (2) present experimental results that demonstrate empirically low error [4, 25, 47]. Unfortunately, these approaches are still unable to guarantee acceptability of all application outputs on-the-fly. Furthermore, it is unclear if this problem can ever be solved since the notion of acceptability is subjective and impossible to safely quantify.

–

Too big a departure from commodity systems. To achieve efficiency gains, approximation techniques often impose significant changes to the underlying hardware, requiring fundamentally new designs for compute [13, 14, 48] and storage [20, 37]. The challenge is that different approximations have varying impacts on different applications, in terms of both performance and accuracy. Intuitively, there is no one-size-fits-all solution. It is hard to justify investing significant chip real estate and verification efforts in an approximation mechanism that only helps some applications and potentially compromises the correctness of others.

As-Is Computing. We propose As-Is, an Anytime Speculative Interruptible System that maps approximate computing to speculative computing, overcoming the key challenges above. Conceptually, we marry the theoretical properties of anytime automata [39] with the practical benefits of speculative parallel architectures [21]. The former allows us to take any approximate application and execute it in a way that produces time-proportional approximations. That is, an approximate version of the program output is generated early and is gradually refined over time, thus providing the run-time guarantee of eventually reaching 100% accuracy. The latter allows us to rework existing innovations in speculative architectures for the purpose of approximation. This lets us implement approximations that work generally on a wide range of applications without having to commit to risky hardware changes.

Contributions. In this article, we offer the following:

–

We propose As-Is, a novel architecture that takes an approximate program and executes it with time-proportional approximations. This allows a latency constraint or interactive user to interrupt the program early and accept the current approximate result as is. 100% accuracy is always guaranteed if more time can be spared.

–

We present the unique challenges of building such a system that marries approximation and speculation. We show how together they can turn a typical speculative multiprocessor into a best-effort, approximate multiprocessor, with low additional hardware complexity.

–

We evaluate As-Is and demonstrate favorable-accuracy tradeoffs on a range of approximate programs from machine learning, computer vision, graph analytics, and image processing.

2 Background

We discuss the background concepts that are necessary to realize the marriage between approximate computing and speculative computing: anytime automata and ordered irregular parallelism.

2.1 Approximate Computing

Owing to the performance and energy gains from approximation, various approximation frameworks have been explored in the literature. The state-of-the-art general-purpose approximation frameworks are often harder to adopt since they lack execution time proportional gains in accuracy and the ability to interrupt the execution once the accuracy expectations have been met [13, 14, 48]. Moreover, these frameworks often rely on accuracy recovery mechanisms in the event that the output is not adequately accurate [22]. Even though few specialized approaches enable time-proportional approximation and support interruptibility, they are limited to specific algorithms which are being accelerated with approximation and cannot be extended to benefit various different workloads [10, 16, 27, 35, 52]. Table 1 compares the proposed As-Is system against relevant prior work. Kickstarter leverages iterative computation for graph algorithms and calculates intermediate values for subset of vertices. However, unlike As-Is they do not explore other benchmarks or even guarantee interruptibility of the application [52]. Brainiac also supports time proportional approximation using neural accelerators; however, their implementation does not claim that approximate outputs can be retrieved at any time. As-Is on the other hand employs a best-effort approach which allows retrieval of both precise and intermediate outputs [16]. In contrast, Samoyed is able to support an intermittent system by adopting smaller operations which can complete, when a larger operation requires more energy than a device can offer. However, it does not employ any approximation to achieve this, unlike As-Is [35]. Truffle exhibits high energy savings by supporting a low voltage knob for approximate calculations, but this does not guarantee that it will be able to give outputs with varying degrees of accuracy based on how long the system is allowed to run. As-Is is able to take care of the latter scenario along with energy savings [13]. Aloe is able to effectively demonstrate program reliability, but if the program is partially executed, as in for a shorter time, the checkers will consider any intermediate outputs as unacceptable [22].

Table 1.

Related Work	General Purpose	Time-Proportional Approximation	Interruptible
KickStarter [27, 52]	X	\(\checkmark\)	X
Brainiac [10, 16]	X	\(\checkmark\)	X
Samoyed [35]	\(\checkmark\)	X	\(\checkmark\)
Truffle [13], Enerj [14, 48]	\(\checkmark\)	X	X
Aloe [22]	\(\checkmark\)	\(\checkmark\)	X
As-Is	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)

Table 1. Comparison with Contemporary Approaches

2.2 The Anytime Automaton

The Anytime automaton [39] leverages anytime algorithms [11, 19, 29] that have been widely used in decision planning and artificial intelligence, to tackle key challenges in approximate computing. In the anytime automaton, increasingly accurate approximate versions of the final output are computed and used to eventually produce the final precise output. A key feature that anytime automaton provides is program interruptibility. At any point, if the program is interrupted, the automaton simply produces a best-effort or as-is output based on previously executed computations. This feature is useful since the application can now be stopped as soon as the output quality meets user expectations, saving both runtime as well as energy. In cases where the output quality falls below expectations, the user simply needs to let the application run longer.

The anytime automaton computation model provides a mechanism to extract significant parallelism from seemingly sequential applications (details of applications in Section 4.2). It accomplishes this by breaking down an application into a pipeline of fine-grained anytime computation tasks that can execute in parallel.

2.3 Ordered Irregular Parallelism

An ordered irregular parallel application consists of tasks with the following properties [21]. First, tasks follow a total or partial order. Second, tasks are dynamically created and creation order is different from execution order. Third, tasks may have data dependencies that are not known a priori. Ordered irregular parallel algorithms are common in data mining [50], graph analytics [18, 43], and event-driven systems [41].

An anytime automaton can also be considered as an ordered irregular parallel application. This is because in anytime automaton, each computation stage \(A\) is broken down into N tasks (i.e., \(A_1\), \(A_2\), ..., \(A_N\)), where N is input-dependent and may not be known a priori. The parent-child relationships between tasks are known, but the data dependencies between tasks are unknown in advance. Each task is also assigned a timestamp and may create and enqueue child tasks with any timestamp equal to or greater than its own, leading to a total order among tasks.

2.4 Anytime Execution Model

Figure 2(a) shows an example program broken down into two computation stages \(S\) and \(f\). We will use the same example throughout the article. The stage \(S\) samples all the pixels of an input image and builds a histogram of pixel intensities. The stage \(f\) takes a histogram as input and computes the arithmetic average of all the pixels. The stage \(S\) is made anytime with bit-reverse sampling and anytime stage \(f\) leverages reduced precision approximation. The stage \(S\) is broken down into three tasks and stage \(f\) is broken down into two tasks (Figure 2(b)). An execution schedule of tasks is shown in Figure 2(c).

Fig. 2.

3 As-Is Computing

In this section, we present the As-Is system, that maps anytime approximate computing to speculative multiprocessing. As-Is is a system-level realization of the anytime automaton theoretical model [39] and provides a practical implementation of task-level speculation for ordered irregular parallel applications.

As-Is benefits from the theoretical properties guaranteed by anytime automaton [39]. It is also inspired by prior works in conventional speculative computing [21]. However, prior works are not well suited for anytime computing because of the following challenges, which As-Is aims to address:

(1)

No programming model specification. As-Is provides a task-based programming model for expressing an application as an anytime automaton.

(2)

No means to exploit the parallelism exposed. We recognize that we can maximize and extract the anytime parallelism by framing it as an ordered irregular speculative parallel application.

(3)

No mapping to general-purpose hardware. As-Is builds a multiprocessor architecture and run-time system to efficiently execute anytime applications and handle task creation, scheduling, execution, and retirement.

(4)

No support for best-effort execution. The As-Is system maps the theoretical iterative and diffusive stages to physical mandatory and best-effort tasks that are recognized and supported by the underlying architecture.

(5)

No data dependence handling and conflict resolution. This work also provides a formal understanding of all possible true and false data dependencies among various tasks as well as outlines the conflict detection and resolution mechanisms needed to ensure that the eventual precise output is guaranteed.

(6)

No support for explicit forms of approximation. As-Is augments the conventional speculative computing substrate with support for approximation techniques such as random sampling, bit-reverse sampling, and reduced precision arithmetic.

We first present the As-Is programming model that translates an application into an anytime automaton. The next section outlines the As-Is runtime system and provides details on task management for creation, scheduling, execution and retirement of \(\mu\)Tasks. We then describe architectural support for approximation, conflict detection, and resolution. The section concludes with a walk-through example.

3.1 As-Is Programming Model

Listing 1.

As-Is uses a task-based programming model to declare programs in the form of anytime automata. Programs are composed of computation stages arranged in a feed-forward dataflow, as shown in Figure 2(a). Explicit buffers are used to pass information between parent and child stages (Figure 2(b)). The programmer specifies \(M\) computation stages: \(S_0 ... S_{M-1}\), in a high-level language such as C, along with per-stage information regarding the input buffer location, output buffer location, and any applicable approximation techniques (e.g., random input sampling and reduced precision). We discuss supported approximations later in Section 3.2.3. An example annotated program for the anytime model in Figure 2 is shown in Listing 1.

Computation Stages. There are three types of computation stages that can be declared.

–

Precise: A precise stage applies no approximations and is executed exactly as defined.

–

Iterative: An iterative anytime stage performs its computation in multiple iterations, each iteration with higher accuracy than the previous. Iterations are independent of one another; each iteration overwrites the previous iteration’s (now stale) result in the output buffer upon completion.

–

Diffusive: A diffusive anytime stage performs its computation in multiple steps, each step building on the accumulated result to bring accuracy closer to 100%. Steps may be dependent on one another and successful execution of each step is necessary for 100% accuracy.

As per the nomenclature in numerical methods, diffusive stages in As-Is are mapped to iterative stages where repeated state updates improve the solution with some convergence rate. However, we use the above terminology to remain consistent with the prior work this article is based on [39].

Computation Tasks. Each anytime computation stage \(S_i\) is decomposed into \(N\) tasks: \(T_{Si,1} ...\) \(T_{Si,N}\). These tasks are either iterations of an iterative stage or steps of a diffusive stage. The output of an iterative stage is 100% accurate upon completion of task \(T_{Si,N}\), while the output of a diffusive stage is 100% accurate upon completion of all \(N\) tasks. In the Figure 2(b) example, anytime stages \(S\) and \(f\) are executed as three and two tasks, respectively.

The number of tasks per stage can either be programmer-specified or dynamically selected at run-time. In our implementation, we profiled our applications and manually specified the number of tasks. Automating this process is left for future work.

Pipeline. The feed-forward composition of tasks can be executed as a pipeline as shown in Figure 3. There are two types of pipeline interactions listed below. Our As-Is run-time system can infer the appropriate pipeline depending on the types of parent and child stages.

Fig. 3.

–

Asynchronous: An asynchronous pipeline can be formed between any type of parent and child stages given that the stages can execute concurrently and independently. Each parent task spawns a new instance of the child stage, which implies multiple (redundant) instances of each child task. Parent and child tasks can execute concurrently, independent of one another.

–

Synchronous: A synchronous pipeline can be formed only if (1) the parent stage is diffusive, and (2) the child’s computation is distributive over the parent’s computation. Parent and child tasks can execute concurrently but must synchronize to produce the 100% accuracy and to reduce the redundant work performed in the asynchronous pipeline.

Example. Figure 3 illustrates these concepts in practice, for the example program in Listing 1. Figure 3(a) shows the baseline sequential execution of the program. The first stage sampling is composed of three tasks, while the second stage f is composed of two tasks. With iterative stages in an asynchronous pipeline (Figure 3(b)), each later task potentially recomputes results that were previously computed and overwrites the current output with a more accurate version. f(11xx) represents the output of the first task of stage f after processing the output of its parent task i.e., first task of stage S, whereas f(1122) represents the second and final output generated by the stage f over the same output from the first task of stage S.

With diffusive stages in an asynchronous pipeline, child stage f processes different versions of outputs produced by parent stage S as opposed to merely updates of them. However, with diffusive stages in a synchronous pipeline (Figure 3(d)), each later task builds directly on top of previously computed results, making the cumulative output more accurate.

3.2 As-Is Runtime System and Task Management

3.2.1 Definitions.

We first define key terms, which will be described in detail in the following sections.

–

\(\mu\)Task: A unique instance of a task. For example, in Figure 3, the second stage f is composed of two tasks for each of its parent’s three tasks, yielding 6 \(\mu\)Tasks.

–

Mandatory \(\mu\)Task: A \(\mu\)Task whose result is necessary to obtain 100% accuracy in the final program output.

–

Best-Effort \(\mu\)Task: A \(\mu\)Task whose result is not needed to obtain 100% accuracy in the final program output.

–

Complete: A \(\mu\)Task completes when it finishes producing its output and schedules its child \(\mu\)Tasks for execution.

–

Retire: A \(\mu\)Task retires when it has completed and is the oldest non-retired \(\mu\)Task in the system.

–

Abort: A \(\mu\)Task aborts (and re-executes if mandatory) when its data accesses conflict with another \(\mu\)Task that takes priority.

As described in previous sections, computation stages in the program are approximated in an anytime manner. This leads to multiple \(\mu\)Tasks, each contributing towards higher accuracy in the program output. The subsequent subsections detail the lifetime of each \(\mu\)Task.

3.2.2 μTask Creation.

Each \(\mu\)Task is uniquely defined by an \(M\)-digit \(\mu\)ID, where \(M\) is the number of computation stages. This is shown in Figure 2(c). A \(\mu\)Task’s \(\mu\)ID is set such that it satisfies all of the following:

–

\(\mu\)ID is greater than its parent’s \(\mu\)ID.

–

\(\mu\)ID is greater than its predecessor’s \(\mu\)ID (e.g., \(S_2\)’s predecessor is \(S_1\)).

–

\(\mu\)ID is less than the \(\mu\)IDs of all its ancestors’ successors (e.g., the successors to \(\mu\)Task 12’s ancestors are 21 and 22).

\(\mu\)IDs are assigned at the time of \(\mu\)Task creation. They specify the total order in which \(\mu\)Tasks appear to execute such that the program output increases in accuracy over time and is guaranteed to reach 100%. Similar to timestamps in ordered irregular speculation [21], \(\mu\)IDs are used to resolve all data conflicts among best-effort and mandatory tasks. The details of data dependencies and conflict resolution are described later in Sections 3.2.6 and 3.3.4, respectively.

To schedule \(\mu\)Tasks on available cores, the As-Is run-time system employs a priority queue that schedules tasks based on their \(\mu\)ID.

3.2.3 μTask Execution.

\(\mu\)Tasks with lower \(\mu\)IDs are scheduled for execution whenever there is an available core. All \(\mu\)Tasks may run concurrently and any data dependencies among them are resolved dynamically.

3.2.4 μTask Completion and Retirement.

Due to the concurrent execution of \(\mu\)Tasks, a \(\mu\)Task with a higher \(\mu\)ID may complete before one with a lower \(\mu\)ID. Such a \(\mu\)Task resides in the task queue (Figure 5, discussed is 3.3) until it is the lowest \(\mu\)ID in the system, upon which it retires.

3.2.5 Best-effort and Mandatory μTasks.

The As-Is architecture leverages speculative execution to perform its approximations and extract pipeline parallelism inherent to the anytime automaton execution model. In conventional architectures, speculative tasks are squashed, rolled back, and re-executed when a mis-speculation is detected. However, in As-Is, not all \(\mu\)Tasks are mandatory. That is, some \(\mu\)Tasks are executed only for the purpose of producing an approximate intermediate result before the final output is generated. Such tasks may be best-effort, implying that they do not contribute to the 100%-accurate output and thus do not need to be re-executed if aborted.

Formal Definitions. Formally, given a task \(T_{Si,j}\), which is the jth task in stage \(S_i\) (consisting of \(N\) tasks in total), this task is defined to be best-effort if \(S_i\) is iterative and \(j\lt N\). Consequently, a \(\mu\)Task instance of \(T_{Si,j}\) is best-effort if either (1) task \(T_{Si,j}\) is best-effort, or (2) its parent \(\mu\)Task is best-effort. All other \(\mu\)Tasks are mandatory. Figure 4 shows which \(\mu\)Tasks are best-effort depending on different stage-pipeline organizations.

Fig. 4.

Practical Implications. As-Is executes all tasks speculating that there are no data dependencies between tasks. If a dependence violation for a task is detected, the task is said to have been misspeculated. Best-effort \(\mu\)Tasks mitigate penalties of mis-speculation for intermediate approximate outputs. Upon a data conflict, a best-effort \(\mu\)Task is aborted without any re-execution. Other \(\mu\)Tasks that are descendants of the aborting \(\mu\)Task are also terminated; descendants are found by walking the task queue and identifying any \(\mu\)Tasks that have the same most significant digits in their \(\mu\)IDs as the aborting \(\mu\)Task. Aborted best-effort \(\mu\)Tasks may reduce the number of intermediate outputs produced, but computations for the final precise output remain unaltered. Successful execution of best-effort \(\mu\)Tasks is desirable but not critical to the program execution. In contrast to best-effort \(\mu\)Tasks, a mandatory \(\mu\)Task is considered critical for the application since each mandatory task contributes to the final precise output. This is demonstrated by the example in Figure 3. Thus mandatory \(\mu\)Tasks are always re-executed upon a conflict. Since mandatory \(\mu\)Tasks are critical to the execution, they are strictly ordered in completion; whereas best-effort \(\mu\)Tasks can complete in any order.

3.2.6 μTask Data Dependencies.

We describe in detail how data dependencies arise for the different possible stage-pipeline configurations, shown in Figure 4. Iterative Stages, Asynchronous Pipeline. In this case, each \(\mu\)Task reads whatever data is currently available in its input buffer. As a result, there are no dependencies between concurrently executing \(\mu\)Tasks of the same stage. In the example shown in Figure 4(a), only \(\mu\)Tasks \(S_3\) \(\rightarrow\) \(f_2\) are needed for the precise (i.e., 100%-accurate) program output. All other \(\mu\)Tasks do not fall on this critical-accuracy path and are thus best-effort. To guarantee that the final precise output is eventually reached in this case, mandatory \(\mu\)Tasks need to win all of their data conflicts. Since the output buffer is shared by all the \(\mu\)Tasks in a computation stage, the highest \(\mu\)ID at every computation stage wins all write-after-write (\(WAW\)) conflicts. The writer \(\mu\)Task among producer-consumer stages wins a write-after-read (\(WAR\)) conflict as the reader task is expected to be operating on older stale data at the time of update. Since all stages are iterative, \(\mu\)Tasks within the same stage do not depend on each other’s results; thus read-after-write (\(RAW\)) dependencies do not exist.

Diffusive Stages, Asynchronous Pipeline. With diffusive stages, \(\mu\)Tasks use the previously computed output and refine them for more accurate outputs. Thus with the diffusive model, the final \(\mu\)Task of every computation stage needs to complete for the final precise output. In this case, the critical-accuracy path consists of \(\mu\)Tasks \(S_1\) \(\rightarrow\) \(S_2\) \(\rightarrow\) \(S_3\) \(\rightarrow\) \(f_1\) \(\rightarrow\) \(f_2\) as shown in Figure 4(b), each of which is treated as a mandatory task. False dependencies (\(WAW\) and \(WAR\)) are handled as previously discussed for an asynchronous pipeline. True (\(RAW\)) dependencies may now exist. If \(S_2\) depends on \(S_1\)’s output, and a conflict arises during their concurrent execution, then \(S_2\) must be aborted and re-executed.

Diffusive Stages, Synchronous Pipeline. In a synchronous pipeline, all \(\mu\)Tasks compute a portion of the final precise output and are essential for correct execution. Thus all \(\mu\)Tasks are mandatory as depicted in Figure 4(c). On every conflict, \(\mu\)Tasks with a lower \(\mu\)ID take priority, and higher \(\mu\)IDs are re-executed. Even though the synchronous pipeline resembles a conventional task-speculative architecture, As-Is is able to produce time-proportional approximate outputs and can be stopped as soon as the intermediate program output is deemed acceptable.

3.3 As-Is Architecture

This section describes the As-Is architectural components and how they operate in detail.

3.3.1 Overview.

Figure 5 shows the organization of the As-Is architecture. As-Is is built as a cache-coherent shared-memory multiprocessor. Any data dependencies between tasks are detected and resolved dynamically, inspired by task-level speculation [21]. Though we implement a homogeneous multiprocessor, As-Is can support heterogeneous cores, which can allow for flexibility in where and how approximations are applied. All-in-all, this organization enables time-proportional approximations. For latency-critical or user-interactive applications, the program may be interrupted early, accepting the current approximate result as is. However, if accuracy becomes a concern depending on application and system variations, 100% accuracy is still guaranteed eventually; i.e., As-Is can always operate as a traditional speculative multiprocessor.

Fig. 5.

3.3.2 Task Queue.

All \(\mu\)Tasks dispatched for execution are allocated an entry in the task queue (Figure 5) to track their progress. The task queue can be small; our implementation uses a task queue size of 32 entries. A larger task queue can be easily supported for more processor cores. All \(\mu\)Tasks may run concurrently; any data dependencies among them are resolved by the conflict resolution units (CRs in Figure 5), discussed later. Note that during execution, the task queue entries are not modified; they are only needed during \(\mu\)Task creation and completion. When a \(\mu\)Task completes, it updates its task queue entry and pushes its output data to memory. This is because we opt for a lazy versioning scheme, detailed in later sections. Even after completion, a \(\mu\)Task may still be aborted due to a data conflict. Upon retirement, the task queue entry is removed, and any log due to versioning is cleared.

3.3.3 Approximation.

We implement hardware support for three fundamental approximation primitives that can be applied to an anytime stage:

–

Input sampling with a pseudo-random permutation: Instead of fetching and operating on input data elements sequentially, a linear-feedback shift register (LFSR) is used to process them in a pseudo-random order. This is useful for many computations that perform reductions or collect statistics of their input data. The input data elements are viewed as part of a probability distribution and are processed in an unbiased manner via the LFSR. Each processor core is equipped with an LFSR counter coupled to the register file, acting in lieu of the loop iterator register when input sampling is enabled. \(\mu\)Tasks of an input-sampled stage use the same LFSR seed values, and each \(\mu\)Task processes exponentially more elements than its predecessor.

–

Output sampling with a bit-reverse (tree) permutation: For sampling data elements of the output, we implement logic to permute bits of the loop iterator register in a bit-reverse manner. This effectively produces the output elements such that the output gradually increases in resolution. An example is shown in Figure 3 for a 4 \(\times\) 4 dataset. Naturally, each \(\mu\)Task in an output-sampled stage processes exponentially more elements than its predecessor.

–

Bit sampling (i.e., reduced precision): We support low-precision units for long-latency integer arithmetic operations (e.g., multiplication), drawing from prior work on subword-level approximation [15]. The most-significant bits are processed first, yielding an early estimate of the final value.

Though other approximation techniques can be supported, we focus our efforts on these three primitives since they are the most general and cover a wide range of applications. Furthermore, they require only minor changes to the microarchitecture. The Approx unit in Figure 5 consists of three subunits—(1) Reduced precision functional unit (RPFU) to support reduced precision approximation, (2) LFSR for pseudo random input sampling, and (3) Counter number generator (CNG) to enable bit-reverse output sampling.

3.3.4 Conflict Detection and Resolution.

Data Restructuring. When applying input or output sampling, data elements are processed in a non-sequential order that breaks spatial locality. To address this, As-Is supports restructuring the datasets such that the elements are prearranged in memory in the same order as the corresponding permutation (i.e., LFSR or tree). This is straightforward since these permutations are deterministic and statically known. Thus, As-Is is able to retain the inherent spatial locality of the original program.

This section details how true and false data dependencies are handled. While executing, all \(\mu\)Tasks maintain a read set and a write set (implemented via bloom filters in the conflict resolution units, Figure 5) that track all memory addresses that are accessed via load and store instructions, respectively. Note that conflict detection operates at a cache block granularity to simplify the hardware implementation and piggyback on cache coherence messages. As-Is employs a lazy versioning and conflict resolution scheme, where all store values are buffered in a redo log and pushed to memory only upon \(\mu\)Task completion. We opt for a lazy scheme since it simplifies how false dependencies are handled, unique to our asynchronous and synchronous pipeline organizations. Upon completion, since it is still possible for a best-effort \(\mu\)Task to be aborted prior to retirement, an undo log is atomically created while pushing the store values to memory. The read and write sets are also copied from the \(\mu\)Task’s local conflict resolution unit to its task queue entry so that other \(\mu\)Tasks can check it for conflicts.

True Dependencies. We outline how two tasks Y and X, with \(\mu\)IDs of 1,000 and 2,000, respectively, execute on our architecture in the presence of a \(RAW\) dependence between them. There are three possible scenarios that can occur at run-time:

(1)

If X loads the dependent memory location A after Y completes, there would be no hazard as X would have read the data value written by Y. This is ensured via traditional cache coherence.

(2)

If X loads A and completes before Y completes (which is only possible if Y is best-effort) as shown in Figure 6(a) when Y completes and retires, it walks the task queue and compares its read and write sets with all completed \(\mu\)Tasks with higher \(\mu\)IDs. If such a conflicting \(\mu\)Task X is found, it is aborted and re-executed.

As-Is conflict resolution in the presence of true dependencies.

(3)

If X loads A and completes after Y as shown in Figure 6(b), it would receive a coherence invalidation from Y while executing. We augment coherence messages to include the \(\mu\)IDs of the requesting \(\mu\)Tasks. This is then used by the local conflict resolution unit to catch the conflict and abort (and re-execute) X.

False Dependencies. We outline how the two tasks from our previous discussion, Y and X, execute in the presence of false dependencies between them. There are four possible scenarios:

(1)

If X completes after Y completes, there would be no hazard as task X would overwrite the dependent memory location A after Y has accessed it, as shown in Figure 7(a).

As-Is conflict resolution in the presence of false dependencies.

(2)

If Y is mandatory (i.e., executing speculatively) then no other \(\mu\)Task with higher \(\mu\)ID is allowed to complete before Y since all \(\mu\)Tasks are strictly ordered with mandatory \(\mu\)Tasks. Thus this scenario shown in Figure 7(b) is impossible.

(3)

If X completes before best-effort Y as shown in Figure 7(c) when X pushes its store values, an invalidation request is sent to the core executing Y. Since Y is best-effort, it would simply be aborted by the conflict resolution unit without re-execution.

(4)

If X completes before best-effort Y has even accessed A, when Y attempts to complete, it observes the conflict with X (a higher \(\mu\)ID) at the task queue and aborts itself. This is shown in Figure 7(d).

The CR unit in Figure 5 consists of a finite state machine that responds to incoming coherence messages based on a block’s current state. The CR unit is presented as a look-up table with a sample entry for the example in Figure 6(b)).

3.4 Walk-through Example

In this section, we present an example of Histogram equalization (histeq) execution on the As-Is system as depicted in Figure 8. (histeq) is commonly used to enhance the contrast of an image by scaling its pixel intensities. In this example, there are three computation stages—A, B, and C. Stage A is a diffusive stage and uses input sampling with pseudo-random permutations. Stage B is a precise stage without any approximations. Stage C is a diffusive stage that employs output sampling with bit-reverse permutation. Stage A has two tasks \(A_1\) and \(A_2\), stage B has one task and stage C has two tasks \(C_1\) and \(C_2\). Even though stage B has a single task, it has two \(\mu\)tasks since there are two parent tasks \(A_1\) and \(A_2\). Similarly, stage C has four \(\mu\)tasks. As the \(\mu\)tasks execute, they produce intermediate output \(O_1\), \(O_2,\) and \(O_3\) with increasing accuracy which translates to improved contrast for (histeq). Precise output O4 is produced in the end.

Fig. 8.

4 Methodology

This section describes our experimental configuration and applications.

4.1 Experimental Setup

We simulate the proposed As-Is architecture on a custom cycle-level, event-driven microarchitectural simulator based on Pin [34]. The simulator models simple x86-based in-order cores running at 2 GHz. We simulate a two-level cache hierarchy where each core has a 32 kB private, write-through L1 cache, and all cores share a unified 2 MB L2 cache. In addition to traditional microarchitectural components, the simulator models As-Is overheads such as conflict detection, mis-speculation penalties of aborted tasks, and re-execution of aborted tasks in detail. For power analysis, we use McPAT’s [32] in-order core and memory hierarchy models at 32 nm logic.

4.2 Applications

We evaluate As-Is on a range of approximate applications from PERFECT [5], AxBench [53], and SNAP [28]. PERFECT contains kernels from the embedded computing domain whereas AxBench is an approximate computing benchmark suite. SNAP is used for analyzing the scope of approximation in large graphs. We focus on six widely used kernels from diverse application domains which can benefit from approximate computing. We divide each benchmark into computation stages and apply appropriate approximation techniques per computation stage. Each anytime computation stage is further split into \(\mu\)Tasks at the time of execution. Table 2 lists all the benchmarks along with the configuration used in the evaluation. We use normalized root mean square error (NRMSE) for measuring the error of intermediate approximate outputs. We compute accuracy (i.e., QoR) as (1- NRMSE) of the approximate output against the precise output.

Table 2.

Application	Stage	Is Anytime ?	# \(\mu\)Tasks	Computation Mode	Approximation Technique
	Computation Stage Parameters
Histogram Equalization (histeq)	Stage 1	Yes	2, 4, 8	Diffusive	Random Input Sampling
	Stage 2	No	1	Precise	-
	Stage 3	Yes	2	Diffusive	Tree Output Sampling
2D Convolution (2dconv)	Stage 1	Yes	8, 16, 32	Diffusive	Reduced Precision,
					Tree Output Sampling
Discrete Wavelet Transform (dwt)	Stage 1	Yes	8, 16, 32	Diffusive	Tree Output Sampling
3* K-Means Clustering(kmeans)	Stage 1	Yes	2, 4, 8	Diffusive	Random Input Sampling
	Stage 2	No	1	Precise	-
	Stage 3	Yes	2	Diffusive	Tree Output Sampling
Debayering (debayer)	Stage 1	Yes	8,16,32	Diffusive	Tree Output Sampling
Betweenness Centrality (centrality)	Stage 1	Yes	8,16,32	Diffusive	Random Input Sampling
PageRank (pagerank)	Stage 1	Yes	8,16,32	Diffusive	Random Input Sampling
	Stage 2	No	1	Precise	-

Table 2. As-Is Configuration Parameters for Different Applications

The following approximate techniques in our evaluation:

–

Input data sampling with random permutation.

–

Output data sampling with bit-reverse or tree permutation.

–

Reduced fixed-point precision.

Sampling techniques are amenable to anytime computing as data elements can be sampled with progressively decreasing granularity. For unordered data, to avoid bias in the memory ordering, data is sampled in a random manner. In our experiments, we use a simple LFSR implementation for random number generation. When the position of data is significant such as in images and videos, sampling data in bit-reverse or tree permutation can improve the accuracy of intermediate outputs. Reduced fixed point precision techniques can also be made anytime by computing data bits in the order of significant (most significant bit to least significant bit).

2dconv. 2d convolution is used in computer vision, signal processing, and machine learning. Each output pixel is computed by the average of dot product sum of input pixels and the filter. We retain all the computations of 2dconv in a single stage anytime automaton. 2dconv benefits from both tree sampling as well as reduced fixed point precision.

histeq. Histogram equalization improves image contrast by calculating the cumulative distribution of pixel intensities and is commonly used for thermal, satellite, and x-ray images. The anytime automaton of this benchmark consists of three three computation stages—Stage 1 processes all the input pixels, Stage 2 calculates the scaling factor of each pixel for better contrast, and Stage 3 builds the final image.

dwt. Discrete wavelet transform is employed in data compression. We construct an automaton with a single stage that performs a discretely-sampled wavelet transform on an image. We evaluate the accuracy of this kernel by executing inverse transform precisely and comparing it to the precise output.

kmeans. K-means clustering partitions pixels of an image to different clusters based on its distance from the nearest mean and used in data mining applications. In this three stage automaton, the first stage computes the cluster centroids and assigns pixels to clusters based on their Euclidean distances, the second stage computes the average centroid values and the last stage samples the pixels and re-assigns them to a clusters.

debayer. Debayering algorithm takes an undersampled image from an image sensor and overlays a color filter array, to create a complete color image. A single-stage automaton suffices the computational complexity involved in the interpolation needed to re-create the complete image. To accommodate the scope of multiple \(\mu\)Tasks, we apply a tree-based sampling.

centrality. Betweenness centrality is a metric based on shortest paths between nodes, and used extensively to analyze social networks. It takes an unweighted graph and uses the Brandes algorithm for calculating betweenness centrality which is the fraction of shortest paths that pass through n [28].

pagerank. Pagerank is used by search engines to evaluate the importance of a web page for deciding the order of search results. The pages act as nodes and the hyperlinks as the edges of a graph.

In our experiments, each benchmark is run with variable number of \(\mu\)Tasks per anytime computation stage. The runtime numbers for these experiments are normalized with respect to the baseline system that executes each computation stage as a single \(\mu\)Task.

5 Evaluation

We evaluate the As-Is system for program output accuracy, runtime benefits, and power consumption. We present our results, analysis, and design considerations in this section.

5.1 Runtime-Accuracy Tradeoffs

In this section, we present the performance-accuracy tradeoffs enabled by our As-Is system. Figure 9, shows runtime-accuracy profiles for all the benchmarks on three configurations of \(\mu\)tasks each: 8, 16, and 32 on a 4 processor system. The y-axis represents accuracy in terms of (1- NRMSE) and the x-axis is run time normalized to the sequential execution of the same program. For a fair comparison, we also indicate the execution time of the parallel implementation of the same program using pthreads with 32 threads and 4 processors.

Fig. 9.

For all the applications, the As-Is system produces several outputs of high accuracy much earlier than the parallel baseline. In debayer, dwt, and 2dconv outputs with over 90% accuracy are produced as early as 1% of the sequential execution time as seen in the runtime-accuracy profiles in Figure 9(a), (b), and (c). These applications are embarrassingly parallel and thus, the precise output is reached faster with increasing number of \(\mu\)tasks with many tasks executing concurrently and probability of hazards with random sampling being low.

It is worth noting that 2dconv takes longer to reach the precise output. This is because 2dconv employs reduced precision approximation which helps in saving only the latency of multiplies and in our experiments, multiplies take only a small fraction of the run time. Even though reduced precision is one of the heavily studied approximation techniques, we find its utility is not as effective for the applications we studied for the following two reasons: (1) Reduced precision only seems to work for applications where the latency of arithmetic instructions dominates the execution time. (2) Reducing the precision creates RAW hazards among the tasks which prolong the time to reach the precise output.

K-means and histeq consist of three computation stages and perform many redundant computations to generate intermediate approximate outputs. For very large unweighted graphs, the centrality or pagerank computed after processing a small subset of nodes is highly representative and thus, both metrics start from a fairly high accuracy. However, all tasks end up updating each node’s centrality leading to more conflicts thus, resulting in almost sequential run time. In contrast, pagerank is embarrassingly parallel and takes 60% time to reach the precise output. Conflicts in pagerank primarily occur between \(\mu\)tasks from two stages, unlike centrality.

5.2 Sensitivity Study: μTasks and Processors

We now look at the trends observed in the runtime-accuracy profiles further by showing how long it takes to achieve 100% accuracy as we vary the number of \(\mu\)Tasks and processors in Figure 10. More processors present more opportunities to exploit parallelism. As a result, increasing the number of processors reduces the execution time for almost all the applications. Similarly, increasing number of \(\mu\)Tasks allows many tasks to execute concurrently. Thus, increasing the \(\mu\)Tasks from 8 to 32 in embarrassingly parallel applications like debayer, dwt, and 2dconv reduce the run time as seen in Figure 10(a), (b), (c), and (f).

Fig. 10.

Though the first stage of pagerank is embarrassingly parallel, pagerank has a second stage that is non-anytime. Thus, higher number of \(\mu\)Tasks in the first stage lead to more redundant executions of the non-anytime second stage. This is the general trend for all benchmarks that have a non-anytime stage, i.e., pagerank, histeq, and kmeans. With many tasks trying to access similar data, the conflict rate of tasks increases. Thus, for these benchmarks, the runtime to achieve 100% accuracy increases with the number of \(\mu\)Tasks due to increased conflicts and redundant execution. As all the \(\mu\)Tasks in centrality update the same data, the runtime is less affected by increase in number of available processors. Centrality is more prone to hazards with increase in \(\mu\)Tasks as all the tasks modify the same centrality table.

5.3 As-Is and Software Only Implementation

We compare As-Is with a software only implementation, shown in Figure 11 for 8 \(\mu\)Tasks and 4-processor configuration. As-Is reaches 95% accuracy 1.22\(\times\) –11.89\(\times\) faster than the software implementation, due to significant software overheads. As expected, the applications that take longer to achieve 95% accuracy see significant benefits from As-Is over the benchmarks that achieve high accuracy within the first few \(\mu\)Tasks.

Fig. 11.

5.4 As-Is and Conventional Speculative Architectures

Figure 12 presents a comparison of As-Is hardware and software components with the closest relevant speculative architecture [21] for 32 \(\mu\)Tasks and 4 processor configurations. We analyze the speedup from As-Is system by comparing the runtime-accuracy profiles for three different configurations: (1) As-Is hardware and software (2) As-Is software with baseline speculative hardware and (3) baseline application with baseline speculative hardware.

Fig. 12.

The As-Is software component breaks down the application in \(\mu\)Tasks that generate increasingly accurate outputs whereas the baseline software application would generate only one final output. In As-Is, the potential conflicts among tasks are statically predetermined due to the nature of our synchronous and asynchronous pipelines. As-Is leverages the knowledge of potential conflicts to utilize the system better by letting tasks abort without re-executing them while trying to produce best-effort approximate outputs. However, a baseline speculative system would instead treat all the tasks as mandatory and force them to finish by re-executing them on every conflict. Moreover, in As-Is, producer-consumer relationships between stages are known a priori. This allows As-Is to deal with the false dependencies introduced because of the shared buffer between the stages by opting for lazy versioning and conflict detection, in contrast to the eager versioning and eager conflict detection of the baseline speculative system.

For all the benchmarks except centrality, As-Is achieves highly accurate outputs earlier than the other configurations. Centrality has lots of hazards and faster conflict resolution from eager conflict detection and versioning speeds it up on the baseline architecture.

5.5 Benefits from Approximation vs. Speculation

Figure 13 presents the distribution of benefits from approximation alone and approximation along with speculation. For most benchmarks, unlocking the available parallelism with speculation gives significant benefits than just employing approximation techniques. However, there are two exceptions to these trends: centrality and histeq. Because of the partitioning of the histogram buckets there are lots of read and write conflicts in building the histogram and the earlier tasks get aborted very frequently upon parallelization. Similarly, centrality is also hurt by speculation because of lots of collisions between tasks.

Fig. 13.

5.6 Sensitivity to Cache Block Size

Figure 14 presents the impact of cache block size on As-Is system performance. We study the performance of debayer application with cache block sizes of 4 B, 16 B, and 64 B. Large cache blocks reduce the number of cold misses. However, they also introduce false sharing of data. Our results indicate that the locality benefits of a larger cache block outweigh the resulting false sharing and increased conflicts among tasks.

Fig. 14.

5.7 Mean and Variance Across Different Inputs

The results reported so far correspond to a single input image or graph. However, to account for variability across different inputs, we run all the benchmarks across four different inputs and mean and standard deviation between runtimes to reach 90% accuracy scaled to the sequential baseline on a 32 \(\mu\)Tasks and 4 processor system is reported in Figure 15.

Fig. 15.

5.8 Area Overheads

As-Is adds three additional hardware overheads: (1) 32 entry task queue (0.34 mm\(^2\)) (2) 16 kB dual port SRAM bloom filters (0.05 mm\(^2\)) and (3) 7 registers for sampling approximation (0.0007 mm\(^2\)). We use CACTI [33] to compute these areas at 32 nm ITRS-HP technology node. Overall these structures consume 4.4% additional area over the baseline speculative system [21].

5.9 Power Overheads

The power overheads of the As-Is system are presented in Table 3. The y-axis is the average power to reach 100% accuracy on As-Is normalized to the average power of the pthreads implementation. In As-Is, the benchmarks have 32 \(\mu\)Tasks in total and the pthreads implementation employs 32 threads. Both configurations are run on a 4 processor system.

Table 3.

Benchmark	Runtime Power	Bloom Filter	Task Queue	Approximation	Logic Overhead	Total Power
2dconv	1.641534808	0.018709115	4.24045E-08	0	0.0069083	1.667152266
debayer	1.887845233	0.001436962	4.45023E-08	0	0	1.88928224
dwt	1.235862257	0.00118626	4.97517E-08	0	0	1.237048567
histeq	1.048868066	0.001406349	2.6365E-08	3.33101E-08	0	1.050274474
kmeans	2.284551799	0.008963389	3.67173E-08	4.63893E-08	0	2.293515271
centrality	1.050264271	0.062085359	5.3236E-08	6.72593E-08	0	1.112349751
pagerank	1.119802284	4.89692E-05	4.77938E-08	6.03836E-08	0	1.119851361

Table 3. Average Power Consumption for 32 \(\mu\)tasks on 4 Processors with Breakdown for Additional Overheads

Kmeans, debayer and 2dconv incur high power overheads. In fact, kmeans takes 2.2\(\times\) power to reach 100% accuracy due to increased runtime and conflicts with a larger number of \(\mu\)tasks as observed in Figures 9 and 10. However, power consumption of histeq, dwt, centrality, and pagerank is at par with the baseline.

5.10 Energy Savings

As-Is does enable significant energy savings for earlier approximate outputs. The fraction of energy spent on As-Is to achieve 80%, 85%, and 90% accuracy normalized to the energy consumed by parallel baseline is reported in Figure 16, given 32 \(\mu\)Tasks and a 4-processor configuration. Even though the kmeans power is 2.3\(\times\) that of parallel baseline as shown in Table 3 and the runtime to reach 100% accuracy is also 2\(\times\) that of parallel baseline as shown in Figure 9, As-Is is able to produce 80% accurate output with only 0.84\(\times\) energy of parallel baseline.

Fig. 16.

5.11 Comparison with State-of-the-art

In Figure 17 we compare As-Is with other state-of-the-art approximate computing approaches like EnerJ [48]. For 2dconv benchmark with 4 bit reduced precision, EnerJ is able to achieve 97.04% accuracy and to reach the same level of accuracy, As-Is consumes 0.65\(\times\) energy when compared to Enerj.

Fig. 17.

5.12 Iterative vs. Diffusive Stages

Since diffusive tasks build on top of the previous output whereas iterative tasks overwrite the output buffer, all the diffusive computation stages can be expressed as iterative stages. Thus, from the efficiency perspective, we opted to implement stages as diffusive instead of iterative whenever possible to minimize the redundant computations. However, we perform a case study to compare diffusive and iterative implementations of the single-stage pagerank automaton in Figure 18 for 32 \(\mu\)Tasks and 4-processor configuration. With iterative stages even though it takes much longer for the precise output to be generated, outputs with high accuracy are still generated within 5% of the baseline runtime.

Fig. 18.

5.13 Dynamic Accuracy Evaluation

As-Is being an interruptible system, it would be hard to provide a dynamic accuracy mechanism that automatically infers when it is appropriate to terminate the program, given a desired user accuracy threshold. We train several regression-based models in order to gauge the runtime for a programmer-specified accuracy.

5.13.1 Implementation.

For this case study, we look into 2dconv and dwt operating on 32 \(\mu\)tasks and 4 processors. The training data is gathered by collecting runtime-accuracy numbers across 10 different images and the test dataset comprises of one image. Each image generates 16 data points on the runtime-accuracy curve for 2dconv and 6–8 for dwt. We train the following models in order to gauge the runtime required to achieve a desired user-specified accuracy: (1) Linear Regression, (2) Support Vector Regression (SVR), (3) SGD Regressor, (4) Bayesian Ridge, and (5) TheilSen Regressor.

5.13.2 Evaluation.

To evaluate our models, we use a custom score metric for interruptible execution. The score reflects a difference in the expected value over the predictor value by a certain model.

When the predicted runtime is greater than the expected runtime (as defined by the user-specified accuracy), the difference between the two is the model score. In this case, there is no accuracy deficit, so the score reflects how close the execution is to the ideal latency. On the other hand, if the predicted runtime is shorter, there is a chance of accuracy loss. The score reflects this loss since the model terminates execution earlier than ideal. The scoring is summarized by Equation (1). The score is computed by accumulating all root mean squared values of individual runtimes. A lower score indicates better prediction by the model.

\begin{equation} \begin{aligned}Expected_{Runtime} = \text{Actual Runtime(User-Specified Accuracy)} & \\ Score = {\left\lbrace \begin{array}{ll} \Delta {Runtime}, & \text{if}~Predicted_{Runtime} \gt Expected_{Runtime} \\ {\Delta }Accuracy, & \text{else} \end{array}\right.} \end{aligned} \end{equation}

(1)

The scores of all test points are accumulated (via root mean square) and shown in Figure 19 Out of all the regression models, TheilSen Regressor seems to have the best fit for \(2dconv\); however, SGDRegressor performs best for \(dwt\). Across these two applications, SGD regressor has the best fit, given the scores. Though given the low discrepancy in scores, any of our evaluated predictor models would be effective.

Fig. 19.

5.14 Limitations Discussion

Currently, As-Is imposes programmer effort in porting over any application, breaking it into anytime computing stages and annotating it with the relevant information. With follow-on work on compilers, much of this effort can be avoided or reduced. However, this effort is currently outside the scope of this article. Our evaluation is based on simulations and even though this is common for research works in the architecture community, we understand that the benefits of the As-Is platform may look different on a real system. We believe that the current evaluation is highly reflective of the performance and energy savings on a realized As-Is system. We have not done a comprehensive search of the application space, particularly bigger and end-to-end applications, but we expect them to only strengthen our evaluation.

6 Related Work

Approximate Computing. Approximate computing allows trading off accuracy and reliability for better performance and energy efficiency and has been widely explored in the prior work [3, 4, 13, 14, 16, 25, 38, 40, 46, 47, 48]. Reduced precision [23], loop perforation [49], load value approximation [38], memoization [1, 44, 45] are some hardware-based approximation techniques explored. These approaches provide several different mechanisms for programmers to specify approximation opportunities at a fine granularity like instructions, variables, functions and patterns. Several techniques also employ neural networks to estimate approximable code regions [2, 12, 14, 16, 17, 25, 31, 42, 51, 54]. In contrast, As-Is attempts to uncover benefits of approximation at a coarse task level granularity and exploits accuracy guarantees and interruptibility of anytime execution.

Best Effort Computing. Best-effort computing has been studied in the context of parallel computing [6], data mining [8, 36], and hardware transactional memory [7, 9, 30]. However, As-Is focuses on marrying best-effort computing with approximation. Moreover, in approaches like [6], best-effort and mandatory task execution is required to be isolated in both hardware and software, whereas the As-Is system can support both of these computations on the same substrate and relies on the underlying architecture to resolve data dependencies. Earlier techniques also rely on programmers to annotate best-effort computations; however, the As-Is runtime system can automatically classify \(\mu\)Tasks based on the computation stage-pipeline configuration.

Intermittent Computing. As-Is like task-level programming model is also leveraged by research in intermittent computing for recoverability [15, 35]. However, their focus has been on developing software-based techniques to ensure forward progress within tight energy constraints without much emphasis on the performance of the underlying system. However, As-Is is built with the goal of extracting performance at an acceptable accuracy loss aided by speculation to uncover the parallelism present in anytime computing. Even though software approaches may enable anytime-like computing models, optimizations at the hardware level are integral for system performance.

7 Conclusion

We propose As-Is, an Anytime Speculative Interruptible System that supports architectural realization of anytime approximate computing with speculative computing. The As-Is task-based programming model expresses an application as an anytime automaton. The theoretical anytime stages are formalized into physical mandatory and best-effort \(\mu\)Tasks which are created, scheduled, executed, resolved and retired by the underlying architecture and runtime system. Our results indicate favourable runtime-vs.-accuracy tradeoffs over a wide range of benchmarks.

References

[1]

C. Alvarez, J. Corbal, and M. Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IIEEE Transactions on Computers 54, 7 (2005), 922–927. DOI:DOI:

Abstract

1 Introduction

2 Background

2.1 Approximate Computing

2.2 The Anytime Automaton

2.3 Ordered Irregular Parallelism

2.4 Anytime Execution Model

3 As-Is Computing

3.1 As-Is Programming Model

3.2 As-Is Runtime System and Task Management

3.2.1 Definitions.

3.2.2 μTask Creation.

3.2.3 μTask Execution.

3.2.4 μTask Completion and Retirement.

3.2.5 Best-effort and Mandatory μTasks.

3.2.6 μTask Data Dependencies.

3.3 As-Is Architecture

3.3.1 Overview.

3.3.2 Task Queue.

3.3.3 Approximation.

3.3.4 Conflict Detection and Resolution.

3.4 Walk-through Example

4 Methodology

4.1 Experimental Setup

4.2 Applications

5 Evaluation

5.1 Runtime-Accuracy Tradeoffs

5.2 Sensitivity Study: μTasks and Processors

5.3 As-Is and Software Only Implementation

5.4 As-Is and Conventional Speculative Architectures

5.5 Benefits from Approximation vs. Speculation

5.6 Sensitivity to Cache Block Size

5.7 Mean and Variance Across Different Inputs

5.8 Area Overheads

5.9 Power Overheads

5.10 Energy Savings

5.11 Comparison with State-of-the-art

5.12 Iterative vs. Diffusive Stages

5.13 Dynamic Accuracy Evaluation

5.13.1 Implementation.

5.13.2 Evaluation.

5.14 Limitations Discussion

6 Related Work

7 Conclusion

References

Cited By

Index Terms

Recommendations

Enhancing Speculative Execution With Selective Approximate Computing

Approximate trivial instructions

Evaluation of approximate operators case study: sobel filter application executed on an approximate RISC-V platform

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tag

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations