GB2625512A

GB2625512A - Triggered-producer and triggered-consumer instructions

Info

Publication number: GB2625512A
Application number: GB2218617.5A
Authority: GB
Inventors: Wang Wei; Eyole Mbou; Gabrielli Giacomo
Original assignee: ARM Ltd; Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2024-06-26
Also published as: WO2024126969A1; GB202218617D0

Abstract

Execution circuitry executes processing operations in response to triggered instructions according to a triggered instruction architecture. Candidate instruction storage circuitry stores triggered instructions, each specifying condition information 40 indicating at least one condition. Issue circuitry issues, in response to determining/predicting that the condition indicated by a given triggered instruction is met, that instruction for execution. The execution circuitry is responsive to state update information specified by the instruction to cause machine state 82 (e.g. predicate registers) to be updated. When the instruction comprises a triggered producer instruction, the execution circuitry is responsive to completion of execution of that instruction to cause dependency state 84 to be updated, indicating that a corresponding triggered consumer instruction can be issued. The issue circuitry evaluates, when the instruction comprises a triggered consumer instruction, whether the condition is determined/predicted to be met in dependence on both the machine state and the dependency state. Delay circuitry 90 may also be provided.

Description

TRIGGERED-PRODUCER AND TRIGGERED-CONSUMER INSTRUCTIONS

The present technique relates to the field of data processing.

A Triggered instruction architecture (TIA) is an instruction set architecture (ISA) in which there is no program counter (PC), and instead instructions specify conditions under which they are issued ("triggered"). The conditions specified by each instruction are sometimes referred to as "triggers", "trigger conditions" or "predicates", and instructions which specify conditions may be referred to as triggered instructions, condition-dependent instructions or condition-specifying instructions, for example.

The conditions specified by the triggered instructions are monitored and, if the system state (e.g. updates to the system state due to execution of previous instructions, and updates due to hardware events) matches the state defined in the conditions, then the corresponding instruction is issued ("triggered").

A key advantage of using a TIA is that the instruction fetch, decode and issue logic is much simpler than in a normal processor (e.g. such as a central processing unit, CPU) and hence more transistors (and therefore more circuit area and power budget) can be dedicated to the datapath, increasing the compute density.

Moreover, by controlling instruction execution in dependence on system state rather than a program counter, the number of control-flow instructions (e.g. branch instructions) to be executed can be reduced, and the PE can react quickly to incoming data (the incoming data can "trigger" the appropriate instructions) and events. For this reason, a TIA is also sometimes referred to as an event-driven architecture.

Viewed from a first example of the present technique, there is provided an apparatus comprising: at least one triggered-instruction processing element, a given triggered-instruction processing element comprising execution circuitry to execute processing operations in response to triggered instructions according to a triggered instruction architecture; candidate instruction storage circuitry to store a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition; and issue circuitry to issue, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution by the execution circuitry, wherein: the execution circuitry is responsive to state update information specified by the given triggered instruction to cause machine state information to be updated in dependence on the state update information; when the given triggered instruction comprises a triggered-producer instruction, the execution circuitry is responsive to completion of execution of a processing operation performed in response to the triggered-producer instruction to cause dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution; and the issue circuitry is configured to evaluate, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information.

Viewed from another example of the present technique, there is provided a method comprising: executing processing operations in response to triggered instructions according to a triggered instruction architecture; storing a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition; and issuing, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution; causing, in response to state update information specified by the given triggered instruction, machine state information to be updated in dependence on the state update information; in response to completion of execution of a processing operation performed in response to the given triggered instruction, when the given triggered instruction comprises a triggered-producer instruction, causing dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution; and evaluating, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information.

Viewed from another example of the present technique there is provided a computer program comprising instructions which, when executed on a computer, control the computer to provide: processing program logic to execute processing operations in response to triggered instructions according to a triggered instruction architecture; candidate instruction storage program logic to maintain a candidate instruction storage data structure to store a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition; and issue program logic to issue, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution by the processing program logic, wherein: the processing program logic is responsive to state update information specified by the given triggered instruction to cause machine state information to be updated in dependence on the state update information; when the given triggered instruction comprises a triggered-producer instruction, the processing program logic is responsive to completion of execution of a processing operation performed in response to the triggered-producer instruction to cause dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution; and the issue program logic is configured to evaluate, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information.

Viewed from another example of the present technique, there is provided a computer-readable storage medium storing the computer program described above. The computer-readable storage medium can be a transitory medium or a non-transitory medium.

Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings, in which: Figure 1 illustrates an apparatus comprising a number of triggered-instruction processing elements (PEs) coupled by an on-chip communication interconnect; Figure 2 illustrates an example of a triggered instruction; Figure 3 illustrates an example of a tagged data item received on an input channel; Figure 4 illustrates an example of triggered-instruction processing element (PE); Figures 5 to 7 illustrate examples of triggered-producer and triggered-consumer instructions; Figure 8 illustrates another example of triggered-instruction processing element (PE); Figure 9 is a flow diagram illustrating a method of executing instructions in a TIA that supports triggered-producer and triggered-consumer instructions; and Figure 10 illustrates a virtual machine implementation.

Before discussing example implementations with reference to the accompanying figures, the following description of example implementations and associated advantages is provided.

In accordance with one example configuration there is provided an apparatus comprising at least one triggered-instruction processing element (PE), a given triggered-instruction processing element comprising execution circuitry to execute processing operations in response to triggered instructions according to a triggered instruction architecture (TIA).

For example, the apparatus may comprise a single triggered-instruction PE (e.g. the given triggered-instruction PE may be the only triggered-instruction PE), but in other examples the apparatus may comprise multiple triggered-instruction PEs (e.g. the given triggered-instruction PE may be one of multiple triggered-instruction PEs -for example, in an array of triggered-instruction PEs providing an event-driven spatial fabric). In some examples, the apparatus could comprise a mix of triggered-instruction PEs and PEs based on a different type of instruction set architecture (ISA) such as a program counter (PC)-based architecture. The PEs may also be referred to as processors or processing circuitry, for example.

The execution circuitry executes the triggered instructions by performing (e.g. executing) processing operations. For example, the processing operations to be executed may be determined by information in the instruction encoding, such as an opcode, and could include -for example -arithmetic/logic operations, load/store operations and floating point (FP) operations.

In the given triggered-instruction PE, at least some of the instructions that are executed are triggered instructions which are executed in accordance with a TIA. However, it should be appreciated that it is also possible for the given triggered-instruction PE to be capable of processing some instructions which are not triggered instructions.

The apparatus also comprises candidate instruction storage circuitry to store a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition. The apparatus also comprises issue circuitry to issue, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution by the execution circuitry.

The plurality of triggered-instructions stored in the candidate instruction storage circuitry may also be referred to as a pool of triggered instructions, and may comprise a number of triggered-instructions whose condition information is to be monitored in order to determine when a condition indicated by the condition information is determined or predicted to have been met (and hence when the associated triggered instruction can be issued). For example, a condition indicated by the condition information of a particular triggered instruction may be determined or predicted to be met based on machine state, which could in turn be dependent on (for example) hardware events detected or generated by the triggered-instruction PE and/or previously executed instructions. Note that there may be some circumstances in which an instruction cannot be issued immediately after its condition is determined or predicted to be met; for example, if the conditions indicated by the condition information of multiple triggered instructions are determined or predicted to have been met at the same time, the multiple triggered instructions may be issued one after the other (e.g. this could be based on the order in which they are stored in memory). Alternatively, a particular triggered instruction whose condition has been predicted or determined to have been met may not be issued until any data it requires becomes available. Also, note that the issue circuitry of the present technique is configured to issue the given triggered-instruction either when its condition is determined to be met, or when it is predicted to be met (e.g. it is predicted that the condition has been or soon will be met) -hence, the issue circuitry of the present technique may support speculation of the conditions of triggered instructions.

In accordance with the present technique, the execution circuitry is responsive to state update information specified by the given triggered instruction to cause machine state information to be updated in dependence on the state update information.

Hence, the given triggered instruction specifies state update information as well as condition information -execution of the given triggered instruction can thus cause the machine state information to change, which can in turn lead to another triggered instruction being issued. However, while the given triggered instruction specifies both condition information and state update information, some triggered instructions may only specify condition information.

In accordance with the present technique, when the given triggered instruction comprises a triggered-producer instruction, the execution circuitry is responsive to completion of execution of a processing operation performed in response to the triggered-producer instruction to cause dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution. The issue circuitry is configured to evaluate, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information. Note that in this application, completion of a processing operation performed in response to an instruction is sometimes referred to as completion of execution of the instruction or completion of the instruction. This completion is normally indicated by the availability of results from the processing operation, for example.

Hence, in accordance with the present technique, the issue of triggered-consumer instructions is dependent not only on the machine state information (e.g. which is updated based on the state update information of executed triggered instructions, and may also be updated based on other events, such as hardware events), but also on the dependency state information (e.g. which is updated in response to completion of execution of a processing operation performed in response to a triggered-producer instruction). This differs from the way one might expect instructions to be issued in a typical TIA, where the conditions indicated by triggered instructions may depend only on the machine state information.

In examples of the present technique, the dependency state information provides a mechanism for expressing fine-grained dependencies between small groups of two or more triggered instructions. In particular, since the dependency state is updated in response to completion of a triggered-producer instruction (e.g. completion of a processing operation performed in response to the triggered-producer instruction), the present technique provides a mechanism for expressing dependencies where a particular triggered instruction (a triggered-consumer instruction) cannot be issued until execution of another triggered instruction (a triggered-producer instruction) has completed. In particular, such a dependency can be expressed in the present technique even in cases where the machine state is updated before completion of execution of an instruction. For example, the present technique may be especially useful in scenarios where execution of the triggered-producer instruction involves one or more long-latency operations (e.g. operations performed over multiple clock cycles), in which case completion of execution of the triggered-producer instruction may occur one or more cycles after the machine state is updated (e.g. because the long-latency operation(s) may take a greater number of cycles to complete than the number of cycles it takes to update the machine state).

In PC-based architectures, one might use barrier instructions to enforce dependencies between instructions. For example, some barrier instructions may prevent any instructions issued after the barrier from being executed before the barrier instruction (and any instructions issued before the barrier instruction) completes. However, barriers (also referred to as fences) are typically fairly restrictive (e.g. in the case of a full barrier, blocking execution of all instructions after the barrier, which can have a significant performance impact). Moreover, barriers typically rely on assuming sequential consistency of instructions (e.g. where program order implies execution order (at least for a single thread, from the perspective of a programmer) -for example, in most PC-based architectures, there is a predefined program order, and even if out-of-order execution is supported, the result of executing the instructions is constrained to be equivalent to the result of executing the instructions in the program order.

However, instruction issue in a TIA is not, naturally, sequential -there is generally no predefined program order for triggered instructions in a TIA, and even in cases where some instructions are issued in sequence, the sequence itself is frequently triggered by the machine state meeting a specified condition. Hence, barrier instructions are not suitable for a TIA.

In a TIA, triggered-instruction dependencies are typically expressed in the instruction itself, for example in the condition information. Hence, one might assume that fine-grained instruction dependencies could be expressed using the condition information of triggered-consumer instructions. One might, therefore, consider it unnecessary to define dependency state information in addition to the machine state information. However, the inventors realised (as explained above) that this may not be effective in situations where execution of a triggered-consumer instruction is dependent on completion of execution of a triggered-producer instruction, particularly when execution of the triggered-producer instruction involves at least one long-latency operation.

Hence, the inventors of the present technique have proposed making the condition(s) indicated by the condition information of a triggered-consumer instruction dependent on dependency state information, in addition to machine state information, with the dependency state information being set in response to completion of execution of a triggered-producer instruction. This allows fine-grained instruction dependencies to be expressed in a TIA.

In some examples, the execution circuitry is permitted to cause the machine state information to be updated before completion of execution of the processing operation performed in response to the triggered-producer instruction.

In such examples, the machine state information may not reliably indicate when execution of the triggered-producer instruction has completed, and hence some producer-consumer instruction dependencies may not be expressed accurately using conditions that are based on the machine state information alone. Hence, the present technique can be particularly useful when the machine state information is permitted to be updated before execution of the triggered-producer instruction has completed, since the dependency state information -which is set on completion -provides an additional mechanism for expressing these dependencies.

In some examples, the apparatus comprises a set of predicate registers, the set of predicate registers including one or more predicate registers to store the machine state information.

There are many ways in which the machine state information could be stored, but in these examples the machine state is represented by one or more predicate registers. Note that these one or more predicate registers could be all of the registers in the set of predicate registers, or a proper subset (some but not all) of the registers in the set. In particular examples, each register is a one-bit register, but it is also possible for each register to hold more than one bit. The number of predicate registers and the number of bits that each predicate register can hold are implementation dependent -note that the number of conditions which can be represented by the machine state is dependent on the number of bits available in the one or more predicate registers for holding the machine state.

In some examples, the set of predicate registers comprises at least one predicate register to store the dependency state information, and the issue circuitry is configured to determine that the at least one condition indicated by the condition information of the given triggered instruction is met when values stored in at least a subset of the predicate registers match expected values indicated by the condition information.

Hence, in these examples, at least one of the predicate registers in the set (e.g. other than the one or more predicate registers used for representing the machine state information) are used to represent the dependency state information. This approach can be advantageous, because it makes use of circuitry that may already exist in a PE based on a TIA. Moreover, this approach allows the conditions specified by triggered-instructions to remain dependent on the same set of predicate registers, regardless of whether the triggered-instructions are triggered-consumer instructions. It should be noted that the number of bits (and hence the number and size of predicate registers) used to represent the dependency state information is not particularly limited. Providing more bits for representing the dependency state information increases the number of instruction dependencies that can be represented at any given time, but may also reduce the number of bits available for representing the machine state information. Hence, for a given number of bits in the set of predicate registers, there is a trade-off between providing more bits for representing the machine state and providing more bits for representing the dependency information. However, the inventors of the present technique realised that there are often some unused predicate registers in a typical TIA PE. In some alternative examples, the apparatus comprises dependency tag storage circuitry to store the dependency state information, and the issue circuitry is configured to determine that the at least one condition indicated by the condition information of the given triggered instruction is met when: * values stored in at least a subset of the predicate registers match expected values indicated by the condition information; and * the dependency state information stored in the dependency tag storage circuitry matches expected dependency state information indicated by the condition information.

While the example above makes use of the predicate registers for representing the dependency state information, this example provides additional storage circuitry to store the dependency state information as one or more dependency tags, separate from the predicate registers. This can be advantageous, since it leaves the entire set of predicate registers available for representing the machine state.

In some examples, when the given instruction comprises the triggered-producer instruction, the execution circuitry is responsive to the completion of execution of the processing operation performed in response to the triggered-producer instruction to issue a dependency state update signal to cause the dependency state information to be updated to indicate that the at least one corresponding triggered-consumer instruction can be issued for execution.

This is an example of the mechanism by which the dependency state information may be updated.

In some particular examples, the apparatus comprises delay circuitry responsive to the dependency state update signal to cause a delay to be introduced between the dependency state update signal being issued and the dependency state information being updated.

Introduction of a delay in this way (which may, for example, be a delay of a predetermined number of clock cycles between issue of the dependency state update signal and updating of the dependency state information) can be advantageous since it can allow other instructions (other than the triggered-consumer instruction whose issue is dependent on the dependency state information) to proceed before the triggered-consumer instruction is issued. For example, these other instructions could be instructions whose issue is triggered by the update applied to the machine state information. This can avoid a situation where the intervening non-consumer instructions have a low likelihood of being issued before issue of the triggered-consumer instruction due, for example, to the triggered-producer instruction completing more quickly and/or more frequently than expected.

In some examples, the delay circuitry is responsive to the dependency state update signal to cause the delay to be introduced unless a time between the triggered-producer instruction being issued and the triggered-producer instruction being completed is determined to be greater than a predetermined threshold duration.

As noted above, introducing a delay between issue of the dependency state update signal and updating of the dependency state information can be particularly useful when the triggered-producer instruction completes quickly (e.g. without significant latency). However, the inventors realized that if the triggered-producer instruction does not complete quickly (e.g. when the triggered-producer instruction requires execution of one or more long-latency operations), there is a possibility for the delay between completion of the triggered-producer instruction and issue of the triggered-consumer instruction to be longer than is needed to issue all of the intervening non-consumer instructions. This could lead to down time, where the execution circuitry is available but is not executing any instructions -this can negatively impact the performance of the apparatus. Hence, in this example, the delay is not introduced unless the time (e.g. this could be measured as a number of clock cycles) between issue and completion of the triggered-producer instruction is less than or equal to the predetermined threshold duration (which also could, for example, be a threshold number of clock cycles). The delay circuitry may, in this example, also be referred to as dependency state update delay circuitry.

In some examples, the predetermined threshold duration is based on a number of cycles required to issue one or more selected non-consumer instructions after issuing the triggered-producer instruction.

In this way, the predetermined threshold duration can be set such that there is expected to be enough time between issue of the triggered-producer instruction and issue of the triggered-consumer instruction for all of the selected non-consumer instructions to be issued. Note that, in particular examples, the customary execution latencies of the non-consumer instructions could be into account as well.

In some examples, the predetermined threshold duration is configurable by software. Hence, the threshold duration can be variable, and can be set by a programmer.

In some examples, the execution circuitry is responsive to the state update information to issue a machine state update signal to cause the machine state information to be updated in dependence on the state update information This is an example of the mechanism by which the machine state information can be updated.

In some particular examples, the apparatus comprises delay circuitry responsive to the machine state update signal to cause a delay to be introduced between the dependency state update signal being issued and the dependency state information being updated.

Hence, while delay circuitry can be provided to introduce a delay between issue of the dependency state update signal and updating of the dependency state information (as explained above), delay circuitry can also (or instead) be provided to introduce a delay between issuing of the machine state update signal and updating the machine state information. The delay circuitry may, in this example, also be referred to as machine state update delay circuitry.

Note that, in examples where both the dependency state update delay circuitry and the machine state update delay circuitry are provided, these could be separate circuits (e.g. separate hardware), or the same delay circuitry could perform both functions. In any case, introduction of these delays can be particularly useful for instruction scheduling, e.g. where multiple triggered-instructions are triggered by the same condition of the machine state.

In some examples, the length of the delay is configurable by software.

Note that this can apply to the delay introduced by the dependency state update delay circuitry and/or the delay introduced by the machine state update delay circuitry.

In some examples, the at least one corresponding triggered-consumer instruction comprises a triggered instruction whose execution is dependent on completion of execution of the processing operation performed in response to the triggered-producer instruction. In this application, the term "triggered-producer instruction" generally refers to a triggered instruction which is expected to complete before one or more corresponding "triggered-consumer" instructions can be issued.

The techniques discussed above can be implemented in a hardware apparatus which has circuit hardware implementing the triggered-instruction processing element(s), execution circuitry, candidate instruction storage circuitry and issue circuitry described above, which support the triggered-producer and triggered-consumer instructions as part of the native instruction set architecture supported by the decode circuitry and processing circuitry. However, in another example the same techniques may be implemented in a computer program (e.g. an architecture simulator or model) which may be provided for controlling a host data processing apparatus to provide an instruction execution environment for execution of instructions from target code. The computer program may include instruction decoding program logic for decoding instructions of the target code so as to control a host data processing apparatus to perform data processing. Also, the program may include register maintenance program logic which maintains a data structure (within the memory or architectural registers of the host apparatus) which represents (emulates) the architectural registers (including any predicate registers) of the instruction set architecture being simulated by the program. The emulated registers may include any of the plurality of predicate registers described in some examples above. The program may also include issue program logic to emulate the issue circuitry described above, and candidate instruction storage program logic to maintain a candidate instruction storage data structure to emulate the candidate instruction storage circuitry described above. The instruction decoding program logic includes support for the triggered instructions described above, including the triggered-producer and triggered-consumer instructions, which have the same functionality as described above for the hardware example. Hence, such a simulator computer program may present, to target code executing on the simulator computer program, a similar instruction execution environment to that which would be provided by an actual hardware apparatus capable of directly executing the target instruction set, even though there may not be any actual hardware providing these features in the host computer which is executing the simulator program. This can be useful for executing code written for one instruction set architecture on a host platform which does not actually support that architecture. Also, the simulator can be useful during development of software for a new version of an instruction set architecture while software development is being performed in parallel with development of hardware devices supporting the new architecture. This can allow software to be developed and tested on the simulator so that software development can start before the hardware devices supporting the new architecture are available. The computer program may be provided on a computer-readable storage medium, which could be transitory or non-transitory.

Particular embodiments will now be described with reference to the figures.

Figure 1 shows an example of a data processing apparatus 10 which one or more TIA PEs may be implemented. In particular, Figure 1 schematically illustrates a data processing apparatus 10 arranged as a spatial architecture according to various examples of the present techniques. Spatial architectures can accelerate some applications by unrolling or unfolding the computations, which form the most time-consuming portion of program execution, in space rather than in time. Computations are unrolled in space by using a plurality of hardware units capable of concurrent operation. In addition to taking advantage of the concurrency opportunities offered by disaggregated applications which have been spread out on a chip, spatial architectures, such as data processing apparatus 10, also take advantage of distributed on-chip memories. In this way, each processing element is associated with one or more memory blocks in close proximity to it. As a result, spatial architectures can circumvent the von-Neumann bottleneck which hinders performance of many traditional architectures.

The data processing apparatus 10 comprises an array of processing elements (compute/memory access clusters) connected via an on-chip communication interconnect, such as a network on chip. The network is connected to a cache hierarchy or main memory via interface nodes, which are otherwise referred to as interface tiles (ITs) and are connected to the network via multiplexers (X). Each processing element comprises one or more compute tiles (CTs) and a memory tile (MTs). While Figure 1 shows a 1:1 mapping between CTs and MTs, other examples could share a MT between more than one CT. The CTs perform the bulk of the data processing operations and arithmetic computations performed by a given processing element (PE). The MTs act as memory access control circuitry and have the role of performing data accesses to locally connected memory (local storage circuitry) and data transfers to/from the more remote regions of memory and inter-processing element memory transfers between the processing element and other processing elements.

In some example configurations each of the PEs comprises local storage circuitry connected to each memory control circuit (MT) and each memory control circuity (MT) has direct connections to one processing circuit (CT). Each PE is connected to a network-on-chip which is used to transfer data between memory control circuits (MTs) and between each memory control circuit (MT) and the interface node (IT).

In alternative configurations local storage circuitry is provided between plural processing elements and is accessible by multiple memory control circuits (MTs). Alternatively, a single MT can be shared between plural CTs.

The processing circuitry formed by the respective compute/memory access clusters (CTs/MTs) shown in Figure 1 may, for example, be used as a hardware accelerator used to accelerate certain processing tasks, such as machine learning processing (e.g. neural network processing), encryption, etc. The ITs may be used to communicate with other portions of a system on chip (not shown in Figure 1), such as memory storage and other types of processing unit (e.g. central processing unit (CPU) or graphics processing unit (GPU). Configuration of control data used to control the operation of the CTs/MTs may be performed by software executing on a CPU or other processing unit of the system.

The CTs (or the cluster of CTs and MTs as a whole) can be seen as triggered-instruction processing elements, which execute instructions according to a triggered instruction architecture, rather than a program-counter-based architecture.

In a conventional program-counter-based architecture, a program counter is used to track sequential stepping of program flow through a program according to a predefined order defined by the programmer or compiler (other than at branch points marked by branch instructions). The correct sequence through the program is sequential other than that the branch points. At a branch point there are only two options for the next step in the program flow (taken or not-taken). Although a processor implementation may use techniques such as out of order processing and speculation to execute instructions in a different order from the program order, the results generated must be consistent with the results that would have been generated if the instructions were executed in program order.

In contrast, for a triggered instruction architecture (TIA), a number of triggered instructions (also referred to as condition-dependent instructions) are defined by the programmer or compiler which have no predefined order in which they are supposed to be executed. Instead, each triggered instruction specifies the trigger conditions to be satisfied by the machine state of the processing element for that instruction to validly be issued for execution. In a given cycle of determining the next instruction to issue, a triggered-instruction processing element can monitor multiple triggered instructions in the same cycle to check whether their trigger conditions are satisfied (rather than examining, at most, the conditions for taking or not-taking a single branch instruction in order to determine the next instruction to be executed after the branch, as in a program-counter based architecture).

It is possible for a triggered-instruction processing element to use speculation to predict which instructions will satisfy their respective trigger conditions, so that instructions can be issued before the trigger conditions are actually satisfied. This helps allow a processing pipeline to be more fully utilised (compared to the case in the absence of speculation, when the processing element waits for a given instruction to update the machine state before evaluating whether the machine state satisfies trigger conditions for another instruction). Such speculation can help to improve performance. However, even if speculation is used so that instructions are issued for execution before their trigger conditions are actually satisfied, the end result should be consistent with the result that would have been achieved if the update to machine state by one instruction was made before evaluating the trigger conditions for selecting the next instruction to be issued for execution. Hence, if the speculation was incorrect and an instruction was issued for execution but it is determined later that the trigger conditions for that instruction were not satisfied, then a recovery operation may be performed to flush results which could be incorrect and resume execution from a correct point prior to the misspeculation.

Event-driven (triggered) spatial architectures such as this can reduce the control flow overhead in program execution and effectively map out applications into "space" rather than time alone. In a typical event-driven spatial architecture, the PEs are configured for each specific application, which entails loading the instructions into the instruction memory of each PE and loading configuration settings into control registers. Two key goals of this established event-driven approach are to (1) reduce the complexity and physical area of the hardware used to issue instructions and (2) reduce the number of instructions required to manage program control-flow through issuing instructions based on data availability. In these many-core systems, PE area is a primary design constraint.

However, while a spatial architecture is an example of a situation in which a TIA can be particularly useful, it should be noted that TIAs are not limited to spatial architectures such as that shown in Figure 1.

Figure 2 illustrates an example of a triggered instruction. The lower part of Figure 2 illustrates an example of fields of an instruction encoding, while the upper part of Figure 2 shows information specified (in high level code representation) for an example of a triggered instruction by a programmer/compiler. The triggered instruction specifies: * trigger condition information 40 (e.g. a condition field indicative of at least one condition) indicating one or more trigger conditions which are to be satisfied by machine state of the processing element for the instruction to be validly issued for execution.

* an opcode 42 identifying the type of processing operation to be performed in response to the instruction (e.g. add in the high-level code example of Figure 2).

* one or more operands 44 for the processing operation; * a destination location 46 to which the result of the processing operation is to be output; and * trigger action information 48 (state update information) indicating one or more updates to machine state of the processing element to be made in response to the execution of the triggered instruction.

It will be appreciated that while the fields of the instructions are shown in a particular order in Figure 2, other implementations could order the fields differently. Also, information shown as a single field in the encoding of Figure 2 could be split between two or more discontiguous sets of bits within the instruction encoding.

In this example, the trigger condition information includes predicate information and input channel availability information. The predicate information and input channel availability information could be encoded separately in the trigger condition information, or represented by a common encoding.

The predicate information specifies one or more events which are to occur for the instruction to be validly issued. Although other encodings of the predicate information are also possible (e.g. with each value of the predicate information representing a certain combination of events that are to occur, not necessarily with each event represented by a separate bit in the encoding), a relatively simple encoding can be for each bit of the predicate indication to correspond to a different event and indicate whether that event is required to have occurred for the instruction to be validly issued for execution. Hence, if multiple bits are set, the trigger conditions require each of those events to occur for the instruction to be issued. An "event" represented by the predicate information could, for example, be any of: * occurrence of a hardware-signalled event (e.g. a reset, an interrupt, a memory fault, or an error signal being asserted).

* a buffer full/empty event caused by one of the buffer structures described below becoming full or empty.

* a software-defined event which has no particular hardware-defined meaning. Software can use such predicate bits to impose ordering restrictions on instructions. For example, if a first instruction should not be executed until a second instruction has executed, the second instruction can specify (in its trigger action information 48) that a selected predicate bit should be set in response to the second instruction, and the first instruction can specify On its trigger condition information 40) that the selected predicate bit should be set in order for the first instruction to validly be issued for execution.

The meaning of particular predicate bits may also depend on control state stored in a configuration register or other configuration storage, which affects the interpretation of the predicate bits. For example, Figure 2 shows an 8-bit predicate field which allows for 256 different combinations of events to be encoded (e.g. a combination of 8 different events in any independent combination of ON/OFF settings for each event if a bit per event is allocated, or 256 more arbitrary combinations of events if the encoding does not allocate a separate bit per event). The configuration register may store control which sets of events are represented by each encoding, selecting events from a larger set of events supported in hardware.

The trigger action information 48 can be defined using output predicates in a corresponding way to the input predicates defined for the trigger condition information 40. As noted above, the predicate bits (also referred to as condition information) can be set in response to execution of an instruction. It should be noted that the timing of setting the predicate bits is not particularly limited, and it can take place at any time during execution of the instruction -for example, the predicate bits may be set in parallel with executing any processing operations associated with the instruction. This can mean that the predicate bits are set before execution of the instruction has completed -for example, this could happen if the processing operation being executed in response to the instruction is a relatively long-latency operation (meaning that it takes a significant number (e.g. more than 1) of clock cycles to complete).

A given triggered-instruction processing element (CT) may receive input data from a number of input channels, where each input channel may be a physical signal path receiving input data from a particular source. The source of the input data could be, for example, the memory tile MT associated with that CT or a MT shared between a cluster of CTs including the given CT, or could be the on-chip network linking with other sets of CTs, or could be a dedicated signal path (separate from the main network on chip between CTs) between a particular pair of CTs or cluster of CTs. As shown in Figure 3, a given input channel n receives tagged data items 50 comprising a tag value 52 and data value 54. The tag value 52 is an identifier used to identify the purpose of the data and can be used by the triggered-instruction processing element (CT) to control the triggering of triggered instructions.

Hence, as shown in Figure 2, the trigger condition information 40 could also include an input data availability condition which indicates that valid issue of the instruction also depends on availability of input data on a particular input data channel. For example, the high level instruction shown at the top of Figure 2 indicates in its trigger conditions an identifier "%i0.0" signifying that valid issue requires availability of input data having a particular tag value "0" on a particular input channel (%i0). Of course, the indication "°/0i0.0" is just an example representation of this information at a high level and it will be appreciated that, in the instruction encoding itself, the trigger condition information 40 may encode in other ways the fact that triggering of the instruction depends on input data availability of data having a specified tag value on a specified input channel. It is not essential to always specify a particular tag value required to be seen in order for the trigger conditions to be satisfied. The triggered instruction architecture may also support the ability for the instruction to be triggered based on availability of input data (with any tag value) on a specified input channel.

The operands 44 for the triggered instruction can be specified in different ways. While Figure 2 shows an instruction having two operands, other instructions may have fewer operands or a greater number of operands. An operand can be identified as being stored in a register addressable using the local register address space of the triggered-instruction processing element (CT). See for example the operand identified using the identifier "%r3" in Figure 2, indicating that the operand is to be taken from register number 3. Also, an operand can be identified as being the data value taken from a particular input channel, such as input channel "%i0" as shown in Figure 2. Again, while Figure 2 shows the generic case where any data from input channel %i0 may be processed by the instruction, it may also be possible to specify that data having a particular tag value should be used as the operand (e.g. the operand could be specified as %i0.0x5, indicating that the operand is the data value having tag 0x5 on input channel %i0).

Similarly, the destination location 46 for the instruction could be either a register in the CT's local register address space or (as in the example of Figure 2) an indication of an output data channel onto which the result of the instruction should be output. The output data channel may be a signal path passing data to the same CT or another CT, or to the CT's or other CT's MT, or to the network on chip. The destination location 46 can identify a tag value to be specified in the tagged data item 50 to be output on the output channel. For example, the instruction in Figure 2 is specifying that a data value tagged with tag value OxF should be output on output channel %o1.

Figure 4 illustrates an example of circuitry included in a given triggered-instruction processing element On particular, the CT of the processing element) for processing triggered instructions. Triggered-instruction storage circuitry 11 includes a number of storage locations 60 for storing respective triggered-instructions. The trigger condition information 40 of those instructions is made available to instruction issuing circuitry 12 which analyses whether the trigger conditions 40 for the pool of triggered instructions are determined to be satisfied by the machine state and dependency state 22 (and, if applicable for a given instruction, also determines whether the trigger conditions are satisfied based on input channel data availability status of input channel data which has been received from input channels and is being held in input channel data holding storage 18). The machine state and dependency state 22 used to evaluate trigger conditions may include hardware event signals indicating whether various hardware events have occurred, as well as predicate indications set based on trigger actions from previous triggered instructions as discussed earlier. Interpretation of the predicates may depend on configuration information stored in a trigger condition/action configuration register 20.

Some examples may support speculative issue of triggered instructions, in which case the instruction checking circuitry 12 includes condition prediction circuitry 30 for predicting whether the trigger conditions for a given triggered instruction will be satisfied. The prediction can be based on prediction state maintained based on outcomes of previous attempts at executing the instructions (e.g. the prediction state may correlate an earlier event or identification of an earlier instruction with an identification of a later set of one or more instructions expected to be executed some time after the earlier event or instruction). If the prediction is incorrect and an instruction is incorrectly issued despite its trigger conditions not turning out to be satisfied, then the effects of the instruction can be reversed (e.g. by flushing the pipeline and resuming processing from a previous correct point of execution).

If multiple ready-to-issue triggered instructions are available, which each are determined or predicted to have their trigger conditions satisfied in the same cycle of selecting a next instruction to issue, the instruction issuing circuitry 12 selects between the ready-toissue triggered instructions based on a predefined priority order. For example, the priority order may be in a predetermined sequence of the storage locations 60 for the triggered-instruction storage circuitry 11 (with the instructions being allocated to those storage locations 60 in an order corresponding to the order in which the instructions appear in the memory address space from which those instructions are fetched -hence the programmer or compiler may influence the priority order by defining the order in which the instructions appear in memory). Alternatively, explicit priority indications may be assigned to each instruction to indicate their relative priority.

When a triggered instruction is selected for issue, it is sent to the execution circuitry 14 of the processing element (CT), which comprises a number of execution units 15 for executing instructions of different types of classes. For example, execution units 15 could include an adder to perform addition/subtraction operations, a multiplier to perform multiplication operations, etc. Operands for a given operation performed by the execution circuitry 14 can be derived either from input channel data from the input channel data holding storage 18, or from register data read from local register storage 16 of the processing element (or, as mentioned below from further register storage in an input processing block which is accessible based on a register address in the register address space used to access the local register storage 16). The result of a given operation performed by the execution circuitry can be output either as output channel data 17 to be output over a given output channel (to the same CT or other CTs, those CTs' associated MTs, or the network on chip) or could be written to a destination register of the local register storage 16 (or to the register storage in the input processing block). In addition to outputting the computational result of the executed instruction, the execution circuitry 14 also updates the machine state based on any trigger action specified by the trigger action information 48 of the executed instruction (e.g. one or more predicate bits may be set or cleared as specified by the trigger action information 48), and may also updates the dependency state, as will be discussed in more detail below.

Hence, since a triggered instruction specifies the conditions required for its own valid processing and can also perform a computation operation in addition to setting the predicates for controlling subsequent program flow, there is no need for dedicated branch instructions which only control program flow but do not carry out a corresponding computation operation. This helps to increase the compute density (amount of computational workloads achieved per instruction) and hence can improve performance.

Triggered spatial processing elements (PEs) typically have several input (and output) channels where packets of data are fed into it (and fed out of it). The input packets comprise tagged data values 50 having a tag 52 and data 54 as shown in Figure 3. The tag changes the system conditions, represented as predicate bits, and can therefore result in a specific instruction being triggered, based on the value of the tag. An advantage of the triggered instruction paradigm is how it reacts to incoming data streams efficiently, based on data availability.

As explained above, the machine state is permitted to be set based on state update information (trigger action information) specified by an instruction before execution of the instruction has completed. As a result, this can mean that the predicate bits in a triggered instruction cannot always accurately represent dependencies between instructions, particularly in the case of producer-consumer instruction dependencies (where one or more triggered-consumer instructions are not permitted to be issued until execution of a corresponding triggered-producer instruction has completed). Figure 5 illustrates this issue.

Figure 5 shows an example of a pair of instructions -a bulk data move instruction (bstw.w) and a remote store instruction (rstw.w) -where a programmer may wish to make execution of the second instruction (rstw.w) dependent on the first instruction (bstw.w) completing. Figure 5 shows how one might expect to represent this dependency in a typical TIA. In particular, the first (bstw.w) instruction specifies values 48 to which the predicates (part of the machine state 22) are to be set when the instruction is executed, and the second (rstw.w) instruction specifies values (40) to which the predicates are expected to be set in order for the instruction to be issued. In this case, the first instruction sets the predicates to the same values (0000_1011 -i.e. predicate bits p0, p1 and p3 being set, where p0 is the right-most bit in this representation) as are required by the condition information 40 of the second instruction, meaning that issue of the second instruction is dependent on execution of the first instruction.

However, the inventors realised that since the predicates are permitted to be set before execution of the first instruction completes, the predicates do not provide a mechanism for expressing a dependency where issue of the second instruction depends on execution of the first instruction completing. For example, the processing operations associated with the bstw.w instruction may be associated with a relatively long-latency, which could mean that execution of these operations has not completed by the time the predicates are updated; this may mean that the rstw.w instruction is issued while execution of the bstw.w instruction is ongoing. In some examples, this could be problematic -for example, if execution of the bstw.w instruction stalls or fails after execution of the rstw.w instruction has begun. Accordingly, the bstw.w and rstw.w are examples of instructions for which implementation of the present technique could be useful. Note, however, that these are just examples of instructions for which the present technique could be implemented -in practice, the present technique could be applied to any producer/consumer groups of instructions in a triggered architecture, particularly where the producer instructions have a long-latency.

Figures 6 and 7 illustrate examples of how the bstw.w and rstw.w instructions can be implemented triggered-producer and triggered-consumer instructions, in accordance with the present technique.

In Figure 6, the bstw.w instruction is modified so that it requires a specified one of the predicates (p7) to be set on completion of execution of the instruction. This update is in addition to setting the results of the other predicate registers (p0 to p6) in response to the state update information 48. The condition information specified by the rstw.w instruction is then modified to require predicate bit p7 to be set in addition to predicate bits p0, p1 and p3 (i.e. the condition information for the rstw.w instruction requires the predicate registers to be set to 1000 1011).

In the example of Figure 6, predicate bit p7 represents dependency state information, and the dependency state information (which is set in response to a triggered-producer instruction -bstw.w in this case -completing execution) indicates that a corresponding triggered-consumer instruction (rstw.w in this case) can be issued. This dependency state information thus provides a mechanism to express producer-consumer instruction dependencies.

Note that, additional non-consumer instructions may be triggered between the setting of predicate registers p0 to p6 (in response to the state update information 48) and the setting of predicate register p7 On response to the completion of the bstw.w instruction). These instructions (which may, for example, specify condition information that requires the predicate registers to be set to 0000_1011) may then executed concurrently with the bstw.w instruction. In the example of Figure 6, a single predicate bit is used to represent the dependency state information. However, it should be appreciated that the number of different instruction dependencies that can be represented at any given time can be increased by increasing the number of predicate bits that are made available for representing the dependency state information.

One might consider it counter-intuitive to use some of the predicate bits to represent dependency state information, since this can reduce the number of predicate bits that are available for representing the machine state. Extra predicate registers can be provided, but this may increase the circuit area required for the PE, and may also increase the number of bits that need to be provided within the encoding of each instruction to represent the predicate information -this is not ideal, since there can be significant pressure for encoding space within an instruction. However, the inventors realised that in a typical TIA, there are often spare predicate bits; the example shown in Figure 6 thus takes advantage of these extra predicate bits (both in the predicate registers and in the encoding of the triggered-producer and triggered-consumer instructions) to represent the dependency state information.

Figure 7 illustrates an alternative representation of triggered-producer and triggered-consumer instructions. In this example, it is assumed that one or more tag bits (e.g. completion flags) are provided alongside On addition to) the predicate registers, and the dependency state information is represented using these tag bits. Hence, in Figure 7, the bstw.w instruction is modified so that the tag bit p.1 is set when execution of the instruction completes, and the condition information of the rstw.w instruction is modified so that it requires the tag bit to be set in addition to predicate registers p0, p1 and p3 being set. In this example, more predicate bits are available for representing the machine state than in the example in Figure 6, but additional circuitry may be needed to represent the tag bits.

Turning now to Figure 8, this is another illustration of the triggered-instruction processing element CT. As shown in this Figure, a number of candidate triggered instructions are stored in a set of storage locations 60, each being stored with corresponding condition information 40. Trigger resolution circuitry 80 monitors the conditions 40, a set of predicate registers 82 (which, in this example, comprises eight 1-bit predicate registers), and one or more dependency tags 84. Note that in some examples, the dependency tags are not provided, and dependency state information is instead represented in a subset of the predicate registers 82 (as discussed above). The trigger resolution circuitry 80 also monitors data tags and channel status, as discussed above. When it is determined or predicted that the condition(s) associated with one of the candidate instructions is met, a priority encoder 86 issues a signal which acts as a control signal for a multiplexer 88, to cause the triggered instruction to be issued for execution. Accordingly, the trigger resolution circuitry 80, the priority encoder 86 and the multiplexer 88 may collectively be considered to be issue circuitry. As shown in the figure, once a triggered instruction is issued, it is executed. If the triggered instruction is part of a block of triggered instructions (other than the last instruction in the block) which are all associated with the same condition information, the issue circuitry will advance to the next instruction in the block once the preceding instruction is issued. For example, note that the lengths of the instructions in I-Memory (60) are different, illustrating the possibility that a sequence of triggered instructions can share the same I-Triggers (40). This represents a hybrid dataflow design, where not every single instruction needs to be triggered and a block of instructions can be triggered just once.

On the other hand, if the triggered instruction is not part of a block of instructions, or is the final instruction in the block of instructions, execution of the triggered instruction also involves updating the predicate registers 82 based on state update information specified by the instruction. In addition, if the triggered instruction is a triggered-producer instruction, once execution of the instruction has completed the tag bit 84 will also be updated.

As shown in the figure, delay circuitry 90 may also be provided. The delay circuitry in intercepts an update signal issued by execution circuitry to cause the predicate registers or the dependency tag to be updated, and is capable of creating a delay between the signal being issued and the corresponding update being performed. This delay may be selective (e.g. it may not always be applied), and the length of the delay may be variable. Two instances of the delay circuitry 90 are shown in Figure 8, to illustrate that there are multiple positions in which the delay circuitry 90 could be provided. For example, delay circuitry 90a could be provided on the path to updating the tag bit, to delay updates to the tag bit without impacting updates to the predicates. In another example, delay circuitry 90b could be provided on both the path to update the tag bit and the path to update the predicate registers, so that a delay can also be introduced when updating the predicate registers. In yet another example (not shown in the figure), the delay may be introduced in respect of updating the predicate registers but not the tag bit.

Figure 9 is a flow diagram illustrating processing of triggered instructions. At step 100, the instruction issuing circuitry 12 of the processing element determines whether the machine and dependency state 22 (and input channel data availability, if relevant for any particular instruction) satisfy, or are predicted to satisfy, the trigger conditions for any of the pool of triggered instructions stored in the triggered-instruction storage circuitry 11. If not, then the instruction issuing circuitry 12 waits for a time when an instruction is determined or predicted to satisfy its trigger conditions. If multiple triggered instructions are ready to issue (step 102), then at step 104 the issuing circuitry issues one of the ready to issue instructions which is selected based on a predetermined priority order (e.g. the storage order of the instructions in memory). Otherwise, if there is only one instruction ready to issue, that instruction is issued at step 106. At step 108 the execution circuitry 14 executes the issued instruction on one or more operands to generate a result value. The operands can be read from local registers 16 or from input channel data stored in the input channel data holding area 18, or can be dequeued data which is dequeued from one of the input data buffers managed by the input channel processing circuitry 70. The result value can be written to a local register 16, output as output channel data, or enqueued onto one of the buffers managed by the input channel processing circuitry 70. At step 110, the execution circuitry 16 also triggers an update to the machine state 22 based on the trigger action information 48 specified by the executed instruction. At step 112, the dependency state is updated -however, this step is not performed until execution of the instruction has completed.

Figure 10 illustrates a simulator implementation that may be used. Whilst the earlier described embodiments implement the present invention in terms of apparatus and methods for operating specific processing hardware supporting the techniques concerned, it is also possible to provide an instruction execution environment in accordance with the embodiments described herein which is implemented through the use of a computer program. Such computer programs are often referred to as simulators, insofar as they provide a software based implementation of a hardware architecture. Varieties of simulator computer programs include emulators, virtual machines, models, and binary translators, including dynamic binary translators. Typically, a simulator implementation may run on a host processor 210, optionally running a host operating system 208, supporting the simulator program 202. In some arrangements, there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and/or multiple distinct instruction execution environments provided on the same host processor. Historically, powerful processors have been required to provide simulator implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstances, such as when there is a desire to run code native to another processor for compatibility or re-use reasons. For example, the simulator implementation may provide an instruction execution environment with additional functionality which is not supported by the host processor hardware, or provide an instruction execution environment typically associated with a different hardware architecture. An overview of simulation is given in "Some Efficient Architecture Simulation Techniques", Robert Bedichek, Winter 1990 USENIX Conference, Pages 53 -63.

To the extent that embodiments have previously been described with reference to particular hardware constructs or features, in a simulated embodiment, equivalent functionality may be provided by suitable software constructs or features. For example, particular circuitry may be implemented in a simulated embodiment as computer program logic. Similarly, memory hardware, such as a register or cache, may be implemented in a simulated embodiment as a software data structure. For example, as shown in Figure 10, the simulator code may comprise processing program logic 204 to emulate the execution circuitry described above, and issue program logic 206 to emulate the issue circuitry described above. In arrangements where one or more of the hardware elements referenced in the previously described embodiments are present on the host hardware (for example, host processor 210), some simulated embodiments may make use of the host hardware, where suitable.

The simulator program 202 may be stored on a computer-readable storage medium (which may be a non-transitory medium), and provides a program interface (instruction execution environment) to the target code 200 (which may include applications, operating systems and a hypervisor) which is the same as the interface of the hardware architecture being modelled by the simulator program 202. Thus, the program instructions of the target code 200, which may include triggered instructions such as the triggered-producer and triggered-consumer instructions described above, may be executed from within the instruction execution environment using the simulator program 202, so that a host computer 210 which does not actually have the hardware features of the apparatus CT discussed above can emulate these features.

In the present application, the words "configured to..." are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a "configuration" means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. "Configured to" does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.

Further, the words "comprising at least one of..." in the present application are used to mean that any one of the following options or any combination of the following options is included. For example, "at least one of: A; B and C" is intended to mean A or B or C or any combination of A, B and C (e.g. A and B or A and C or B and C).

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.

Claims

CLAIMS1. An apparatus comprising: at least one triggered-instruction processing element, a given triggered-instruction processing element comprising execution circuitry to execute processing operations in response to triggered instructions according to a triggered instruction architecture; candidate instruction storage circuitry to store a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition; and issue circuitry to issue, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution by the execution circuitry, wherein: the execution circuitry is responsive to state update information specified by the given triggered instruction to cause machine state information to be updated in dependence on the state update information; when the given triggered instruction comprises a triggered-producer instruction, the execution circuitry is responsive to completion of execution of a processing operation performed in response to the triggered-producer instruction to cause dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution; and the issue circuitry is configured to evaluate, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information.
2. The apparatus of claim 1, wherein the execution circuitry is permitted to cause the machine state information to be updated before completion of execution of the processing operation performed in response to the triggered-producer instruction.
3. The apparatus of claim 1 or claim 2, comprising a set of predicate registers, the set of predicate registers including one or more predicate registers to store the machine state information.
4. The apparatus of claim 3, wherein: the set of predicate registers comprises at least one predicate register to store the dependency state information; and the issue circuitry is configured to determine that the at least one condition indicated by the condition information of the given triggered instruction is met when values stored in at least a subset of the predicate registers match expected values indicated by the condition information.
5. The apparatus of claim 3, comprising: dependency tag storage circuitry to store the dependency state information, wherein the issue circuitry is configured to determine that the at least one condition indicated by the condition information of the given triggered instruction is met when values stored in at least a subset of the predicate registers match expected values indicated by the condition information; and the dependency state information stored in the dependency tag storage circuitry matches expected dependency state information indicated by the condition information.
6. The apparatus of any preceding claim, wherein: when the given instruction comprises the triggered-producer instruction, the execution circuitry is responsive to the completion of execution of the processing operation performed in response to the triggered-producer instruction to issue a dependency state update signal to cause the dependency state information to be updated to indicate that the at least one corresponding triggered-consumer instruction can be issued for execution; and the apparatus comprises delay circuitry responsive to the dependency state update signal to cause a delay to be introduced between the dependency state update signal being issued and the dependency state information being updated.
7. The apparatus of claim 6, wherein the delay circuitry is responsive to the dependency state update signal to cause the delay to be introduced unless a time between the triggered-producer instruction being issued and the triggered-producer instruction being completed is determined to be greater than a predetermined threshold duration.
8. The apparatus of claim 7, wherein the predetermined threshold duration is based on a number of cycles required to issue one or more selected non-consumer instructions after issuing the triggered-producer instruction.
9. The apparatus of claim 7 or claim 8, wherein the predetermined threshold duration is configurable by software.
10. The apparatus of any preceding claim, wherein: the execution circuitry is responsive to the state update information to issue a machine state update signal to cause the machine state information to be updated in dependence on the state update information; and the apparatus comprises delay circuitry responsive to the machine state update signal to cause a delay to be introduced between the dependency state update signal being issued and the dependency state information being updated.
11. The apparatus of any of claims 6 to 10, wherein the length of the delay is configurable by software.
12. The apparatus of any preceding claim, wherein the at least one corresponding triggered-consumer instruction comprises a triggered instruction whose execution is dependent on completion of execution of the processing operation performed in response to the triggered-producer instruction.
13. A method comprising: executing processing operations in response to triggered instructions according to a triggered instruction architecture; storing a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition; and issuing, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution; causing, in response to state update information specified by the given triggered instruction, machine state information to be updated in dependence on the state update information; in response to completion of execution of a processing operation performed in response to the given triggered instruction, when the given triggered instruction comprises a triggered-producer instruction, causing dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution; and evaluating, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information.
14. A computer program comprising instructions which, when executed on a computer, control the computer to provide: processing program logic to execute processing operations in response to triggered instructions according to a triggered instruction architecture; candidate instruction storage program logic to maintain a candidate instruction storage data structure to store a plurality of triggered instructions, each triggered instruction specifying condition information indicating at least one condition; and issue program logic to issue, in response to a determination or a prediction of the at least one condition indicated by the condition information specified by a given triggered instruction being met, the given triggered instruction for execution by the processing program logic, wherein: the processing program logic is responsive to state update information specified by the given triggered instruction to cause machine state information to be updated in dependence on the state update information; when the given triggered instruction comprises a triggered-producer instruction, the processing program logic is responsive to completion of execution of a processing operation performed in response to the triggered-producer instruction to cause dependency state information to be updated to indicate that at least one corresponding triggered-consumer instruction can be issued for execution; and the issue program logic is configured to evaluate, when the given instruction comprises a triggered-consumer instruction, whether the at least one condition is determined or predicted to be met in dependence on both the machine state information and the dependency state information.
15. A computer-readable storage medium storing the computer program of claim 14.