1 Introduction
With each processor generation, architects aim to improve core performance while maintaining energy efficiency. To achieve high levels of performance, a processor must be able to build aggressive schedules that exploit
instruction-level parallelism (ILP) and memory-level parallelism. One of the main challenges in this process is reordering instructions in a scalable, energy efficient manner. Traditional out-of-order processors schedule ready instructions using complex schedulers that dynamically build dataflow dependencies and implicitly learn instruction delays. They achieve this by monitoring, waking up, and issuing instructions once their operands are produced. However, as previous studies have shown [
29,
30], this technique uses power hungry hardware structures that inefficiently scale processor performance.
To build efficient, scalable hardware that provides both high performance and energy efficiency, previous research has proposed a number of techniques covering both in-order and out-of-order processors. Some examples include parking non-ready instructions to better utilize available resources [
20,
21,
34], bypassing stalled instructions or filtering instructions based on their criticality to reduce stalling delays [
1,
2,
6,
19], replaying stalled instructions to avoid blocking the instruction queue [
12], and instruction prescheduling using dataflow dependencies [
26,
31]. Some solutions make the realization that the schedules of general-purpose applications are highly regular and repeat during execution; these propose issue time prediction processors that try to explicitly predict when instructions will be ready to issue, using dataflow dependencies and pre-defined instruction delays [
4,
5,
15,
17,
24,
33,
37,
42,
43]. But, unfortunately, without explicit knowledge of real-time instruction delays, the issue time predictions will never be accurate enough to achieve close-to out-of-order core performance in an efficient way. However, some works [
29,
30] propose hybrid processors where an out-of-order core produces repeated instruction schedules, taking into account true memory access latency, and offloads them to simple in-order cores. However, these solutions require the implementation of two cores, which increases design cost.
In this work, we aim to overcome the limitation of issue time prediction processors by dynamically building the knowledge of real-time instruction delays with low-cost hardware. In addition, we introduce an instruction reordering technique that uses this knowledge to prioritize instructions based on dataflow and timing information in a highly efficient way. We achieve high performance (and in some cases, outperform cores with expensive on-demand issue structures used by traditional out-of-order cores), with a light-weight structure that understands program dependencies and timing information to prioritize key instructions when necessary. We do this with a delay-based scheduling mechanism that uses latency information as seen by the core itself, instead of pre-defined values that have been used in all previous works up to now.
In this article, we propose a processor microarchitecture that dynamically prioritizes the issuing of instructions, just in time for execution, by recording real-time delays of repeated loads (i.e., in loops) and learning dataflow dependencies of instructions to accurately predict issue times of the same instructions in future appearances. It improves energy efficiency by replacing reservation-station based instruction queues with priority queues that reorder instructions using the predicted issue time as their ordering policy and reduces complexity by issuing only from the head of queues.
In this work, we make the following contributions:
•
An efficient issue time prediction processor with prioritization hardware that enables instruction reordering to achieve 86.2% of the performance of the upper-bound (a traditional out-of-order baseline), while consuming 30% less power (Section
4);
•
An issue time prediction algorithm that uses real-time load delays to enable accurate prediction of issue times for repeated instructions. A prediction that facilitates the prioritization of key instructions to fill the gaps between stalled instructions, improving the performance of issue time prediction processors by 5.5–25% on average (Section
3);
•
A comprehensive evaluation of the proposed microarchitecture with quantitative comparison to state-of-the-art issue time prediction processors (Sections
5 and
6).
2 Motivation and Overview
One of the key reasons why out-of-order processors are able to achieve high performance is because of the aggressive scheduling of instructions. Past research suggests that out-of-order schedules are repeated [
25] across loop iterations and can be learned [
29,
30] or predicted by assuming a pre-defined delay, as specified in the specifications for different types of instructions [
4,
5,
26,
31,
43]. But although many instructions execute with static delays, like traditional addition and multiplication, load instructions that miss in the L1 can have variable latency, depending on the level of the memory hierarchy they access.
Our study shows that even for accesses to the same level of the memory hierarchy, different load instructions can have different delays due to bandwidth contention in the memory hierarchy and the variability in DRAM access times. More specifically, we see variations across all PCs of as many as four cycles for L2 accesses and more than 2
\(\times\) the number of cycles for DRAM accesses compared to the specifications defined by the implementation. Therefore, assuming a single minimum pre-defined delay for memory accesses is not sufficient to accurately predict instructions issue times. However, Figure
1 shows that memory access times of loads in different appearances are repeated over consecutive iterations, on average 92.8% of the time.
The key insight of this article is that to accurately predict repeated instruction issue times and build a high performance schedule, learning the latest delay of memory accesses is required. In this work, we build schedules and reorder instructions in an energy efficient way using three core components that can replace the traditional out-of-order scheduler: (1) an instruction delay learning mechanism that tracks delays of load instructions over repeated appearances, (2) an issue time predictor that dynamically predicts when an instruction will be ready to issue, and (3) priority queue reordering that use these predictions to prioritize key instructions, even after dispatch has occurred.
3 Issue Time Prediction
To achieve aggressive, high-performing schedules, we need (1) to schedule instructions in a dataflow manner that satisfies producer–consumer relationships and (2) to reorder instructions such that idle cycles between dependent instructions are filled with independent work. Dataflow dependencies provide a scheduling policy where the execution of instructions follows the flow of the data from producers to consumers. The processor dynamically derives dataflow dependencies from the input and output operands of instructions. Because instructions require a certain amount of time to produce their output data, gaps of idle cycles are formed between dependent instructions. To achieve a high-performing schedule, these gaps must be filled with independent instructions that are ready to execute. In this work, we identify these gaps by learning real-time delays of load instructions that miss in the L1 cache. All other operations have static delays; therefore, learning is not required. Assuming instructions in repeated code (e.g., loops) appear more than once, we combine instruction delays with their dependencies to predict their issue time in future appearances.
3.1 Prediction Algorithm
The predicted issue time of a consuming instruction (
T\(_{Predicted}\)(c)) is estimated as the maximum value of the addition of the predicted issue time (
T\(_{Predicted}\)(p)) and the delay
\((T_{Delay}(p))\) of each of its producers (Equation (
2)). The delay
\((T_{Delay}(p))\) of each producer
\(p\) is calculated as the difference of its completion time (
T\(_{Complete}\)(p)) and its issue time (
T\(_{Issue}\)(p)) (Equation (
1)). An instruction can be issued only after all its producers have completed, and therefore the algorithm chooses the maximum value. By using the predicted issue time of the producers, the algorithm inherently propagates dataflow dependency chain delays to all instructions, and thus the resulting predicted issue time can directly be used to order instructions.
An instruction’s delay is calculated as
An instruction’s predicted issue time is estimated as
The proposed technique requires the core to observe and store delay information \((T_{Delay}(p))\) of repeated instructions (remember that storing delays is only required for load instructions that miss in the L1 cache as all other instructions have static delays). In the absence of this information (first appearance of an instruction or non-repeated instructions), the algorithm assumes the lowest delay to avoid unnecessary stalls in the execution. Because load instructions access different levels of the memory hierarchy in different iterations in an unpredictable way, constant monitoring and retraining of the predictor is required to keep the delay information up to date. Therefore, stored load delays are updated every iteration.
3.2 Example
To demonstrate how the issue time prediction algorithm works, we annotate a code region, see Table
1, that illustrates two loop iterations of a code snippet of the
hmmer application from the SPEC CPU2006 benchmark suite. Although this example demonstrates reordering within one basic block, in normal execution, there are no restrictions in reordering instructions between different blocks. Also, for simplicity the example uses instructions; however, the actual implementation uses uops.
For the purpose of this example, we assume a simple in-order core with one issue per cycle, loads/stores take four cycles to execute, and all other instructions execute in one cycle. \(T_{Issue}\) corresponds to the issue cycle, \(T_{Delay}\) is the instruction’s delay in the current iteration, and \(T_{Predicted}\) is the predicted issue time for its next appearance in relation to producers. \(\Delta _{this}\) and \(\Delta _{ooo}\) are the number of cycles an instruction was issued earlier, compared to a traditional in-order execution, for the proposed solution and a fully out-of-order core, respectively. Note that now is a relative time and is different for each instruction. It merely means that an instruction is ready for execution immediately after it is dispatched.
As shown in the dataflow diagram in Table
1, load instruction ➊ produces a result for instruction ➋. Instruction ➌ is a store that depends on ➋, while ➍ and ➎ are loads producing the operands for instruction ➏. Instructions ➌ and ➏ are producers of instructions ➐ and ➑, while store instruction ➒ is a consumer of instruction ➑. Based on these dependencies, the loop consists of two major dependency chains: ➊
\(\rightarrow\) ➋
\(\rightarrow\) ➌ and ➍, ➎
\(\rightarrow\) ➏. These chains are independent of one another, and therefore instructions can be reordered between these chains as needed. Instruction ➋ must wait for four cycles before it can issue because of its dependence on load instruction ➊. An in-order core will stall for four cycles between the two instructions. But an out-of-order core will fill these idle cycles by issuing instructions ➍ and ➎ earlier. To emulate this, we keep track of timing information for the relevant instructions. During the first iteration, we collect the issue cycle (
\(T_{Issue}\)) and the delay (
\(T_{Delay}\)) of every instruction and associate them with the dataflow dependencies to predict the issue time (
\(T_{Predicted}\)) of the same instructions in future appearances. In the second iteration, instructions ➍, ➎ (marked in green) bypass independent instructions that have a higher predicted issue time.
Observing the \(\Delta\)s in the second iteration allows us to see the benefit of this technique. After one iteration, the prediction algorithm builds a schedule that is the same as the schedule of the out-of-order core as shown by the matching deltas (\(\Delta _{this}\) and \(\Delta _{ooo}\)). The \(\Delta _{this}\) of instructions ➍ and ➎ is \(-\)5, because they are issued five cycles earlier compared to execution on an in-order core. Instructions that are part of the same dependency chain will also benefit and will also be able to issue earlier (instructions ➏, ➐, ➑, and ➒ in the example (marked in blue)). The \(\Delta _{ooo}\) is the same for both iterations, because the out-of-order core can reorder instructions in every iteration.
The issue time predictor requires just one iteration to learn real-time load instruction delays before applying them in the prioritization algorithm that will reorder instructions accordingly. However, in our implementation, instructions are also reordered in the first iteration by assuming L1 hit access time for all load instructions to avoid unnecessary stalls.
4 Proposed Microarchitecture
By tracking real-time load delays and instruction dependencies, we can more accurately predict instruction issue times and build aggressive schedules that mimic those of an out-of-order core, as shown in the example in Section
3.2. Using priority-based ordering hardware and the issue times predicted as the priority index it can efficiently reorder instructions. Figure
2 shows the schematic representation of the proposed architecture. Green colored components are the structures added to implement the Instruction Delay Learning process, while blue colored structures implement the Issue Time Prediction and Priority Queue Reordering in the execution unit.
In step ① of the Instruction Delay Learning, process instruction dependencies are stored in the Dependency Table (DT), that contains an entry for each physical register, and maps it to the instruction pointer that last wrote to this register. In step ②, the issue and completion time of load instructions that miss in the L1 cache are stored per PC in a direct-mapped memory structure called DelayCache. For every dispatched instruction the Issue Time Prediction algorithm identifies its producers from the DT in step 🅐 and their delays from the DelayCache in step 🅑, to calculate the Predicted Issue Time in step 🅒 that will be used by the Priority Queues (PQs) as a priority index to reorder instructions in the execution engine.
4.1 Instruction Delay Learning
While instructions are being fetched, decoded, and renamed, dependencies are stored and built by the DT. As instructions start executing, delays of instructions that caused upcoming instructions to stall (L1 cache miss) are stored in the DelayCache, initiating the training of the prediction mechanism. In this work, the delay represents the execution time of an instruction with respect to its issue time. An alternative approach not used in our final design stores the delay of an instruction with respect to its dispatch time, but our study shows large slowdowns in such design due to the unpredictability of structural hazards (see Figure
8(a)).
Although most instructions require a static delay before delivering their result, it is loads that cause the majority of the stalls, and their delay can be variable and unknown at dispatch time. In this implementation, we only store delays of loads that miss in the L1 and for all other instructions we use their predefined delay (derived from their type or L1 access time for loads that do not miss). This allows us to minimize the storage overhead and power requirements when implementing the DelayCache.
Due to application characteristics that relate to branch behavior and memory access patterns, load delays are unpredictable in different iterations (see Section
6.5/Figure
9(a)). Therefore, the DelayCache is continuously updated with the latest delay for every stored instruction, and the issue time predictor is trained in every iteration of a repeated code. Our experiments show that, for the majority of the applications tested, training every iteration produces the highest performance. Activity-based power analysis shows that training is not expensive, as only a subset of load instructions have a variable delay that require an update to the DelayCache. Note that in a context-switch scenario flushing the DT and the DelayCache is not required as both units use a least recently used replacement policy that will replace older application’s instruction delays with the new application’s delays. In the case when not all entries are used by the new application, the data of the older application will remain there when it the core switches back to that application.
Although our implementation trains the issue time predictor using load delay information, the issue prediction mechanism can be applied as-is to other instructions with variable delay, such as floating point division and transcendental functions. In this work, we do not cover their potential performance benefits, as they do not occur often in the applications we evaluate.
4.2 Issue Time Prediction and Dispatch
The delays of instructions stored in the DelayCache combined with the dependencies from the DT provide the necessary inputs for predicting the issue time of instructions as described in Section
3. For every renamed instruction, the DT is queried using the instruction’s input operands to find possible producers. In a DT hit, the DelayCache is queried with the producer’s addresses, and correspondingly, in a DelayCache hit, the delay will be retrieved to calculate the current instruction’s issue time. In case of a miss in the DelayCache, the value of an L1 hit (four cycles in our microarchitecture) is used to avoid unnecessary delays in the absence of misses.
The Execution Engine, which has the primary task of reordering instructions, is built using multiple priority instruction queues, with each functional unit having its own dedicated queue. Although instructions from multiple queues can execute out-of-order, instructions in a single queue can be issued only from the head of the queue and only to the corresponding functional unit. Because each queue corresponds to a specific functional unit, instructions are dispatched to the queues according to their type. If an instruction matches to more than one queue, then dataflow dependencies are used to steer incoming instructions to the first queue that has a producing instruction at its tail; otherwise, it will go to the queue with the least number of instructions. Our studies indicate that round-robin and global dependence steering schemes reduce performance compared to our scheduling methodology (see Figure
10 for more details).
Issue times are predicted for first time appearing or non-repeated instructions even in the absence of delay information by assuming the lowest delay (L1 access hit) to avoid unnecessary execution stalls.
4.3 Priority Queue Reordering and Issue
In the proposed architecture, we remove the traditional reservation-station-based scheduler and instead reorder instructions using light-weight and efficient priority instruction queues in the execution engine. PQ are built using Systolic Priority Queues [
22], where instructions are reordered based on a priority index (their predicted issue time in this case). Insertion and removal in a priority queue happen at the head as described in Reference [
22], and thus highest-priority inserted instructions are directly available from the head of a PQ on the next cycle; therefore back-to-back instruction execution is achieved. A free list of entries in the queues is also used so that new entries can be inserted at a free position. Because each functional unit has its own instruction queue and only instructions at the head of each queue can be issued, complex selection logic is not required to decide which instructions to issue every cycle. When an instruction at the head of a queue has unresolved data dependencies the queue blocks. However, instructions in other queues are not affected as only instructions in a blocked queue will stall.
4.4 Register Renaming
Register renaming works in the same way as in traditional out-of-order processors. Renaming replaces destination architectural registers with physical registers to eliminate the name dependencies (output dependencies and anti-dependencies) between instructions, and it automatically recognizes true dependencies. True data dependencies between instructions allow for a more flexible execution of instructions. Maintaining the status for each register, indicating whether or not it has been computed yet, allows the execution of instructions to be performed out-of-order when there are no true data dependencies.
4.5 Memory Dependencies
Memory operations are also reordered to maximize performance. Contrary to register dependencies that can be resolved at decode time, store-to-load memory dependencies with overlapping memory addresses can lead to incorrect execution if loads or stores are executed before older stores that refer to the same address. Memory dependencies are accurately predicted by identifying the stores upon which a load depends (store set) and communicate that information to the issue time predictor [
9]. Similarly to a traditional out-of-order processor, using the ROB and the LSU prevents memory violations. The LSU tracks executing memory operations and makes sure that they are committed in program order. Instructions are verified before commit to ensure that no memory violations will be visible to the architecture state.
4.6 Commit
The commit stage checks for exceptions before it releases structures such as store buffer entries and rename registers. Instructions enter in-order into the ROB during dispatch, record their completion out-of-order, and leave the ROB in-order. Interrupts and branch misspeculation events are handled as in other conventional processors. However, retraining of the issue time predictor is not required in this case, and if the core matches a repeated instruction from the DelayCache, they it will be reordered immediately.
4.7 Multi-Core Support
In a multi-core implementation, new connections are added in the memory hierarchy for loads accessing remote memory locations. Issue time prediction in the proposed design is based on a per core memory access latency at any part of the memory hierarchy; therefore, by replicating the new hardware structures (DT and DelayCache) to every core the prediction algorithm will adapt accordingly and learn remote access delays. A two-core setup simulation, with a variety of application pairings, resulted in similar gains as to the ones reported for the single-core scenario in Section
6.1. Specifically, the average performance difference of the multi-core implementation to an out-of-order multi-core baseline is within 0.6% of the the single-core processors performance difference. As the coherence misses or
Simultaneous Multi-Threading (SMT) scheduling could be less predictable, it would require new studies and, potentially, structure changes to handle these cases. However, this is out of the scope of this work. This core, as implemented, does not change any significant components in the back-end of the processor and, therefore, is compatible with the original coherence and consistency models as described in the core.
5 Experimental Setup
The performance evaluation of this work was performed on a modified version of the Sniper Multi-Core Simulator [
7], version 6.2, that uses the Instruction Window-Centric core model [
8]. We use a detailed DRAM model that takes into account DRAM page locality and other low-level details that account for all detailed DRAM delays. Power and energy analysis was conducted with McPAT [
23] version 1.3, modified to support our microarchitecture. Applications were compiled with the GCC compiler (-O2 optimization flag) and executed with the reference inputs of the SPEC CPU2006 benchmarks, using a single, representative (SimPoint-based [
36]), 750 million instruction trace. Average results are computed by combining output results of common workloads (but different input) into a weighted value before averaging the results across applications. The details of added structures to the core, with area and average power consumption, are listed in Table
2 and the details of the simulated microarchitectures are listed in Table
3. Performance is measured in Instructions per Cycle and energy efficiency in
Million Instructions Per Second per Watt (MIPS/W) and
Energy Delay Product (EDP). Unless explicitly stated, all summary results are weighted average values of all applications, while black bars represent results of the proposed design configuration described in Table
3.
7 Related Work
There has been extensive work in the past on instruction reordering to reduce runtime delays and improve processor performance. Table
4 presents state-of-the-art hardware solutions in instruction reordering, the delays they try to mitigate, the stage in the processor the reordering takes place, and the type of scheduler used to reorder instructions. High performance comes from mitigating both
Static and
Dynamic delays, while reordering instructions in the backend of the processor provides for higher flexibility. Unfortunately, the majority of past solutions uses an
RS-based scheduler for reordering instructions that limits energy efficiency improvement. In this work, we argue that smarter solutions are needed to significantly improve energy efficiency, using a simpler and more scalable scheduler (
PQ) to reorder instructions in the backend, while achieving high performance by addressing
Static and
Dynamic delays. In this section, we discuss different categories of solutions that address runtime delays in instruction scheduling.
RS-based Schedulers. Many solutions use dataflow dependencies to preschedule or prioritize instructions to improve the performance or the efficiency of an out-of-order processor. Dataflow Prescheduling [
26] fetches and reorders instructions in a prescheduling buffer using dataflow dependencies. This provides for a larger effective window size while keeping the issue buffer small. However, it does not take into account variable delay instructions and assumes static delays for all instructions (all loads are presumed to hit in L1). Segmented Instruction Queues [
33] divide large instruction queues into smaller segments that can be clocked at higher frequencies. They use dynamic dependence-based scheduling to promote instructions from segment to segment until they reach a small issue buffer. Data Cache Hit-Miss Prediction [
43] tries to predict L1 hits and reschedule load dependent instructions based on that information. But predicting only L1 hits does not take into account off-chip memory delays that have the most impact on the the performance of a processor. In a more complex implementation, Look-ahead Prediction [
24] tries to predict load delays using a value predictor. Dynamic solutions like in References [
21,
34] predict and prioritize critical or independent instructions. In Reference [
21], instructions that depend on long-latency operations are moved from the issue queue to a much larger waiting instruction buffer until their long-latency producer completes.
Long Term Parking (LTP) [
34] analyzes instructions and parks non-critical instructions from the main instruction stream to prioritize critical ones (address-generating instructions and loads). Similarly, N-Use [
5], uses an associative table to park non-ready instructions, Distance Issue Logic [
4] assumes unknown load delays and parks all their consumers in an RS IQ until their operands are produced, while Deterministic Issue Logic [
5] assumes a static delay for all loads and only parks stalled consumers to the RS IQ. FIFOrder [
2] and Delay and Bypass [
1] use the knowledge that an OoO-core instruction scheduler offers (availability of the instructions operands) to dispatch ready instructions to FIFO queues to reduce the size and power consumption of complex instruction queues. They differ by the type of instructions to be send to the FIFO queues based on their criticality and readiness. All these solutions require additional hardware to implement and still employ a traditional out-of-order scheduler to handle the reordering and compensate for timing mispredictions of their techniques.
FIFO-based Schedulers. Due to their low power consumption, in-order processors are highly energy efficient. However, they achieve significantly lower performance compared to an out-of-order processor. Complexity-Effective [
31] reorders instructions based on their dependencies. Instructions that belong to the same dataflow dependency chain are directed to dedicated in-order queues, while selection logic is used to issue instructions from the head of the queues. The Load Slice Core [
6] extends an in-order, stall-on-use core with a second in-order pipeline that allows memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Unfortunately, these solutions do not take into account dynamic delays. This creates large gaps between load-dependent instructions in a real execution that stall until the producing load returns from memory, thus limiting their performance improvement.
Cyclone [
12] uses a store set dependence predictor to monitor memory dependencies, while mispredicted instructions are replayed from the tail of the queue. But this implementation potentially scrambles the ordering of other instructions in the instruction window, creating a performance bottleneck. Wakeup-free scheduling [
17] improves this structural constraints by using a collapsing scheme that does not allow instructions to move while their latency counters are decreasing. But their evaluation is done using a perfect L1 Hit predictor for load latency delays, which does not take into account off-chip memory delays that have the most impact on the the performance of a processor. iCFP [
15] uses a Continual Flow Pipeline that switches to an advance execution mode when it encounters a L1 or L2 cache miss. Miss-dependent instructions are diverted into a slice buffer, un-blocking the pipeline for miss-independent instructions to execute. Although it achieves low power consumption, its performance is limited to 68% of the performance of an out-of-order processor [
15]. Freeway [
20] is an orthogonal solution that implements a technique similar to LTP [
34] on top of an in-order core and manages to improve its performance by 80%, while we achieve 180% increase in performance over our in-order baseline core (on the same applications). CASINO [
19] uses two in-order queues to filter instructions that block the issuing queue. However, their solution takes no real-time information into consideration when doing the filtering that can potentially lead to even more excessive delays when one of the queues is filled, depending on the applications executed.
Heterogeneous Processors. Mirage Cores [
30] and its predecessor Dynamos [
29] employ a full out-of-order core to produce fast out-of-order schedules that are stored in a local cache structure and executed by a number of in-order cores on the same processor. WiDGET [
42] enables dynamic customization of different combinations of small and/or powerful cores as a way to increase performance and reduce power consumption depending on the executing workload. The design complexity and cost of these solutions, however, makes them inefficient, as they still require the implementation of an out-of-order core to learn aggressive instruction schedules.
Software Implementations. Compile-time application analysis is also used to categorize and prioritize instructions by predicting the critical path of the execution [
13,
41]. Solutions with good balance between performance and energy efficiency use modified hardware equipped with the appropriate compile-time support to statically reorder instructions in advance [
3,
10,
18,
38,
39,
40,
44]. But, unlike our work, these solutions require modification to the application itself and do not provide backward compatibility for deployed applications.
SMT. In a multi-threaded architecture, independent instructions from different threads can be used to overcome dependency stalls from a single thread [
16,
37]. This boosts performance of multi-threaded applications as it increases processor throughput in throughput-sensitive parallel applications. However, these techniques do not address single-thread performance.
Prefetching. Prefetching attempts to minimize cache misses by executing additional instructions [
11,
14,
28]. Runahead [
28] allows the execution to continue past stalling to pre-execute instructions and generate new cache misses that fetch data earlier for future instructions. Continuous runahead [
14] extends previous solutions by dynamically filtering the instruction stream to identify the chains of operations that cause a pipeline to stall. Unfortunately, prefetching techniques alone are not enough as they only try to hide memory latency. All solutions referenced here still use a complex out-of-order scheduler to handle instructions reordering.