EP3942407A1

EP3942407A1 - Processor device and a method for parallel processing instructions in a processor device

Info

Publication number: EP3942407A1
Application number: EP19712991.9A
Authority: EP
Inventors: Paolo Bernardi; Marcelo SAMSONIUK
Original assignee: Unify Patente GmbH and Co KG
Current assignee: Unify Patente GmbH and Co KG
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2022-01-26
Also published as: WO2020187421A1

Abstract

A processor device having a pipelined processor architecture is provided, the processor device comprising: a set of n single registers, a set of instructions consisting of a combination of several picoopcodes, a set of opcodes consisting of a concatenation of several picoinstructions, a decoding device, which is configured to decode each picoopcode into a picoinstruction, a set of possible picoinstructions for each register, which have a variable number of m bits, wherein each register of the set of n register is configured to execute its picoinstruction per clock cycle, which may use arbitrary but fixed clock cycles to fetch, decode and execute the picoinstructions or access other register sets, memories or external devices. Further, a method for parallel processing instructions in a processor device is provided.

Description

Processor device and a method for

parallel processing instructions in a processor device

Description The present invention relates to a processor device and a method for parallel processing instructions in a processor device.

In the past and up to now, data processing is a constant and important issue. In particular, new approaches for optimizing processor speed are constantly developed. Thus, various methods and technologies are known in prior art for improving processors speed, either directly by means of clock rate improvements, or indirectly, by means of processor architecture improvements. As to architecture improvements, several different approaches are known in prior art. There are various documented methods for implementing computer architectures (CISC, RISC, VLIW, etc.), whereby every one of the known architectures has its strengths and weaknesses. While CISC (Complex Instruction Sets Computer) and RISC (Reduced Instruction Set Computer) are currently designed to provide one instruction per clock (scalar processor) or more instructions per clock (super-scalar processors), VLIW (Very Large Instruction Word) processors are able to execute several instructions in one clock, however, at costs of severe processor complexity. Although the initial processors started with a small set of registers (the accumulator and some special registers, such as memory pointers), all architectures quickly evolved towards register oriented architectures. Such processors usually group a large number of registers into a small block of addressable memory, named“register file”, which stores the internal registers used in all basic operations, as illustrated exemplarily for a register file with n registers in Fig. 1. Although the register file oriented architectures are dominant due to better performance as compared to memory oriented architectures, stack oriented architectures and accumulator oriented architectures, the main problem with addressable registers remains: Only one single register is writable at a time. Typically, the instruction set defines one destination register (D-REG) and one source register (S-REG), as illustrated in Fig. 2. The S-REG and D-REG fields are typically configured such that an S-REG or D-REG field of 2^m bits addresses a memory with n entries, corresponding to n registers, as exemplified in Fig. 3.

Although there are solutions for the above mentioned limitation, most solutions are based on the register file replication: shadowing registers, renaming registers, windowing registers, etc. All known solutions are fully transparent to the software, but they use large amounts of logic and are susceptible to failure so that they do not work all the times. Also, the processor needs to use pipeline interlocks in order to avoid problems, such as register collisions, which otherwise impact its performance.

Moreover, in prior art, a more traditional solution for the above mentioned problem relates to vectoring the register file, which is very similar to register shadowing, but which is not transparent to the software. Namely, the software knows about the register replication and uses this fact in order to compute multiple results in parallel, as illustrated in Fig. 4. Although the performance easily may be scaled up, the vector architecture has a severe limitation: The operation must be the same for all computed registers. This results from the instruction set, which is not more complex than the scalar architecture, but also typically defines one destination vector register (D-VREG) and one source vector register (S-VREG), which points to the register file, as illustrated in Fig. 5.

Advanced techniques, such as register masking and register chaining can bypass some of the above mentioned limitations, but they do affect the overall performance and increase the complexity. The vector architecture is commonly found in super-computers and Graphics Processing Units (GPU), as well media accelerators, such as Digital Signal Processors (DSP), where multiple data can be more easily processed with a single vector instruction.

However, in some cases, data processing is rather complex and, thus, vector processing is not possible in some applications. A typical workaround for the limitation is the use of the VLIW architecture, in a way such that different register sets will be able to work with different instructions, as illustrated in Fig. 6.

Nevertheless, although far more flexible than the other architectures, the VLIW solution causes a deep impact on the instruction word, which is much greater as compared to other architectures, as illustrated in Fig. 7.

Although the differences do not appear to be significant, the main difference is that all instructions are explicitly executed in parallel and, in order to always keep the execution unit working, all the instructions must be transferred at the same time, which explains the VL (Very Large) in the VLIW.

Therefore, a designer of a new processor has to select the appropriate architecture in consideration of how much logic space is available and which performance is desired, resulting in a direct equation according to which the performance is proportional to the logic space and, as consequence, to the costs. Further, in prior art solutions concerning the improvement of processor speed and performance, there exists a so-called instruction interlock problem.

A classical 5-pipeline RISC processor has a structure as illustrated in Fig. 8, which means that, for the classic RISC processor, each instruction is executed in five clocks cycles. Each clock can execute different phases, and theoretically, still 1 instruction per clock can be accomplished, as illustrated in Fig. 9.

An interlock problem will occur if an instruction n needs to use a register that is being modified by a predecessor instruction. For example, according to a hypothetical case where a value is attributed to a register R1 and R1 values are used on the subsequent instructions, a scenario as illustrated in Fig. 10 will occur, in which case, the value R1 =10 will be finalized only at the end of the 5^th cycle, which means that instruction 2 and instruction 3 cannot be executed as long as instruction 1 is not completed. In this case, the simplest approach is to stop the processor until the processor receives the instruction result, as illustrated in Fig. 1 1 . Consequently, however, processing time is wasted, once the processor stopped for two clocks cycles waiting for the last instruction result.

Summarizing the above, there are several techniques known in prior art which provide different approaches to optimize processors and anticipate the instruction execution such that modern processors are able to execute one or more instructions per clock cycle. Flowever, known modern techniques are rather expensive due to additional logic, as explained above, and also, they use more power, while the final processor frequency is lowered.

Therefore, the present invention is based on the object to provide a processor device and a corresponding method for parallel processing of instructions in a processor device according to which processor speed can be improved without the drawbacks involved in prior art solutions. In particular, a processor device and a method for parallel processing of instructions in a processor device should be provided which allow for parallel execution of instructions in one clock cycle without increasing complexity and costs.

This object is solved by a processor device having the features according to claim 1 and a method for parallel processing of instructions in a processor device having the features according to claim 10. Preferred embodiments of the invention are defined in the respective dependent claims.

According to the invention, a processor device having a pipelined processor architecture is provided, the processor device comprising: a set of n single registers, a set of instructions consisting of a concatenation of n different picoopcodes, a set of opcodes consisting of a combination of p possible picoinstructions for each register, a set of decoding devices, which is configured to decode each picoopcodes into a picoinstruction, wherein the set of possible picoinstructions p(n) ^p for each register n has a variable number of m bits, wherein each register of the set of n registers is configured to execute its picoopcode per clock cycle preferably in a way the ROC architecture can be adapted to use an arbitrary number of clock cycles to fetch, decode and execute the picoinstructions and/or access other register sets, as register banks, register files, etc., memories as RAM, ROM, etc., or external devices as IO or memory mapped peripherals, as well coprocessors and close attached devices.

Preferably, the decoding device is a multiplexer, a dedicated adder, a multiplier, a shift-register, or a combination thereof.

According to a preferred embodiment, each register from the set of n registers is a single register, a small array of registers, in particular, a register bank or register file, or a large array of registers, in particular, a memory.

According to another preferred embodiment, each register from the set of n registers has up to 2^m picoopcodes. According to still another preferred embodiment, each register from the set of n registers is configured to execute up to 2m different picoinstructions. Preferably, each picoinstruction is assigned to an own register of the set of n registers, in which the result of the picoinstruction is stored.

Also, it is advantageous, if the processor device is a register-oriented computer.

Further, according to another preferred embodiment the pipelined processor architecture comprises an instruction fetch IF stage, an instruction decode ID stage, an instruction execute EX stage, and a memory access MA stage. Moreover, according to still another preferred embodiment, the pipelined processor architecture comprises a write back WB stage.

Further, according to the invention, a method for parallel processing instructions in a processor device is provided, wherein the method comprises the steps of:

defining a set of registers;

defining a number of picoinstruction sets, each picoinstruction having fixed and well-defined source registers and one fixed destination register file for the whole picoinstruction set, wherein each register of the set of registers comprises its own picoinstruction set;

concatenating an arbitrary number of picoopcodes in a single opcode;

decoding, in parallel, of each picoopcode into picoinstructions.

Preferably, the method further comprises a step of accessing, in parallel, if necessary, picoinstructions to other picoopcodes or register sets, including register banks, internal register banks, internal memories, and external buses to external memories. According to another preferred embodiment, the method further comprises a step of executing, in parallel, each decoded picoinstruction.

Further, the method may comprise a step of storing the results of each picoinstruciton into the destination register file at the end of the execution of the picoopcode.

By the inventive processor device and the corresponding method, the problems described above in view of prior art processing devices are overcome. Namely, by the present invention, since each picoinstruction has its own register assigned to store a result, truly parallel execution of n picoinstructions or n picoopcodes at the same clock cycle is enabled for an instruction set. At the same time, the processor device, or in particular, register oriented computer, does not specify a destination register due to the fact that each picoinstruction has a register assigned to it for storing the result, and thus, the write back WB stage is always deterministic. Thus, no instruction-related interlock problems will occur. Altogether, the inventive configuration allows for more parallel instructions being processed at the same time, as well as improved instruction flows, however, without any interlock problems occurring. Processor speed and performance can thus be improved significantly without increasing complexity and costs.

The invention and embodiments thereof will be described below in further detail in connection with the drawing.

Fig. 1 shows an example for a register file according to prior art;

Fig. 2 shows an instruction set according to prior art;

Fig. 3 illustrates the addressing procedure for the instruction set shown in

Fig. 2; Fig. 4 an example for vectoring a register file according to prior art;

Fig. 5 shows another instruction set according to prior art;

Fig. 6 shows an implementation of a VLIW architecture according to prior art;

Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions according to prior art;

Fig. 8 shows an example for the structure of a 5-stage pipeline RISC processor according to prior art;

Fig. 9 shows a scheme for processing instructions by means of the processor shown in Fig. 8;

Fig. 10 illustrates the occurrence of an interlock problem in the RISC processor shown in Fig. 8;

Fig. 11 illustrates how to resolve the interlock problem illustrated in Fig. 10;

Fig. 12 shows three examples for a register which may be implemented in an embodiment of the processing device of the present invention;

Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention;

Fig. 14 shows an example for a multiplexed decoding device for a picoopcode; Fig. 15 shows an example of an adder decoder device for a picoopcode, which uses a segment of the own picoopcode as adder input; Fig. 16 illustrates the storing procedure to the register for a given picoinstruction set;

Fig. 17 shows the varying size of the picoopcodes;

Fig. 18 shows two examples for an opcode set of the processing device according to an embodiment of the invention;

Fig. 19 shows an example for a ROC processor with the execution of two picoinstructions according to the processing device according to an embodiment of the invention; and

Fig. 20 shows a scheme for a one instruction per cycle, 4-stage pipeline execution ROC architecture according to an embodiment of the invention.

Fig. 1 shows an example for a register file or register bank 1 according to prior art, wherein a processor with a traditional register architecture is concerned. Flere, a number of individual registers 2, 2’, 2”, etc. are grouped into a small block of addressable memory, which is referred to as the“register file”, and in which the internal registers used in all basic operations are stored.

Fig. 2 shows an instruction 3 according to prior art, which typically defines one opcode 4, one destination register D-REG 6 and one source register S- REG 5. The opcode defines the type of operation and the S-REG and D-REG define the operands.

Fig. 3 illustrates the addressing procedure for the instruction shown in Fig. 2. Flere, the S-REG 5 and D-REG 6 fields are typically configured such that an S-REG field or a D-REG field of 2^m bits addresses a memory with n entries via two address buses, one for the source operand 9 (e.g., a register pointer for read operation) and another for destination operand 10 (e.g., a register pointer for write operation), corresponding to a register file 1 of n registers. According to the opcode 4, the data will be written (indicated by reference numeral 7 for write data) or read (indicated by reference numeral 8 for read data). Fig. 4 shows an example for vectoring a register file 1 according to prior art (as shown in Fig. 1 ). Flere, the software knows about the register replication and uses the fact so as to compute multiple results in parallel, as illustrated in the figure. Fig. 5 shows another instruction 3 according to prior art, which in particular is an instruction with a vectorial source register 5 and destination register field 6, as well a vector opcode 4. The source 9 and destination 10 buses are connected in the same buses as shown in Fig. 4. The organization here is almost the same as in the case of Fig. 3, but with multiple results computed in parallel.

Fig. 6 shows an implementation of a VLIW (very large instruction word) architecture 12 according to prior art, according to which different register sets are working with different instructions. Although similar to vectoring, the difference here is that separate write (register pointers 9, 9’, 9” for write operation (destination register) for instructiono, instruction·!, instruction, respectively) and read (register pointers 10, 10’, 10” for read operation (source register) for instructiono, instruction·!, instruction, respectively) buses are used.

Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions 3, 3’, 3” according to prior art, each one composed by different opcodes 4, source registers 5 and destination registers 6, for illustrating the deep impact in the instruction word which is much stronger compared to other architectures. Also, as long each instruction 3, 3’, 3” has m-bits, the final VLIW instruction 13 will have m x 3 bits, as long the example defines 3 parallel unities (of course, the VLIW architecture can be parallelized ad infinitum ). Fig. 8 shows an example for the structure of a processor device 14 according to prior art embodied as a 5-stage pipeline RISC with a first stage“Instruction Fetch” IF 20, a second stage “Instruction Decode” ID 21 , a third stage “Instruction Execute” EX 22, a fourth stage“Memory Access” MA 23, and a fifth stage“Write Back” WB 24. The RISC processor 14 comprises a program counter unit 15, an instruction memory 16, a register bank 1 comprising a plurality of individual registers 2, an Arithmetic Logic Unit ALU 17, a data memory 18, and a multiplex unit 19.

Fig. 9 shows a scheme for processing instructions by means of the RISC processor 14 (shown in Fig. 8), wherein each instruction 1 , instruction 2, instruction 3, instruction 4, instruction 5 ... instruction n is executed in five clock cycles, and each clock is able to execute different phases and theoretically, still may achieve the processing of one instruction per clock.

However, as can be seen in Fig. 10, in the same RISC processor configuration (as processor 14 illustrated in Fig. 8), an interlock problem may occur, if an instruction n needs to use a register which is currently being modified by a predecessor instruction. Here, in a hypothetical case in which a value, as for example, a value“10”, is attributed to register R1 and the value is then used for the following instruction, the value R1 = 10 will only be finalized at the end of the 5^th cycle, which means that the second and third instructions cannot be executed while the first instruction is not finalized.

Fig. 1 1 illustrates how to resolve the interlock problem illustrated in Fig. 10. In a simple approach, the processor is stopped (here, for two clocks cycles) until it has the instruction result.

Fig. 12 shows three examples for a register file 1 which may be implemented in an embodiment of the processing device, in particular, in a register oriented controller according to an embodiment of the present invention. The first uppermost example is a large array of registers embodied as a memory 23. The second or middle example is a small array of registers embodied as a register file or register bank 1. The third example shows a plurality of single individual registers 2, 2’, 2”, etc.

Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention. Again, there is a plurality of single registers (2, 2’, 2”, etc. Each one of the plurality of single registers 2, 2’, 2” is assigned by one instruction in one clock cycle, in a way that a plurality of instructions 25, 25’, 25”, etc. assigns a plurality of registers 2, 2’, 2”, etc. in one clock cycle. Thus, register 2 will execute the instruction 25 in one clock cycle, register 2’ will execute the instruction 25’ in the same single clock cycle, the register 2” will execute the instruction 25” in the same single clock cycle, and so on. The instructions 25, 25’, 25”, etc. are so-called picoinstructions, in a way there is no need to define the destination register, as long the destination register is implicit by the associated register.

Fig. 14 shows an example of a decoding circuit which uses a decoder 27 to choose between several picoinstructions 25, 25’, 25”, etc. in order to decode a picoopcode 26. As expected, the picoopcode 26 with a width of m bits can decode up to 2^m different picoinstructions 25, 25’, 25”, etc.

Fig. 15 shows an example of a decoding circuit which uses an adder 28 to implement its own picoopcode 26 as adder input from the same register as defined by the path 30, resulting in a picoinstruction 25 which is expressed as“Pn = Rn + On”.

Further, it is noted that each picoinstruction 25, 25’, 25” may either have or may not have other registers as input. Some examples of valid picoinstructions are listed below: Pn(1 ) = R1

Pn(2) = R1 + R2

Pn(3) = R2 - 1 Pn(4) = R2 ^* R3

Pn(5) = R1 » 2

Pn(6) = R1 OR R2

Pn(7) = 0

Fig. 16 illustrates the storing procedure for a given plurality of picoinstructions 25, 25’, 25”, etc., namely, it can be seen that for a given picoinstruction Pn(m) with opcode 26, the result will always be stored in the corresponding register Rn 2 via the multiplex unit 19.

Fig. 17 shows the possible varying size of the picoopcodes 26, 26’, 26” inside an instruction 3. Flere, a first picoopcode 26 has a length of few bits, a second picopocode 26’ has a length of a large number bits, a third picoopcode 26” has an intermediary length, and so on. Also, it is noted that different registers 2, 2’, 2”, etc. have different picoopcodes 26, 26’, 26”, etc. lengths which results in different sets of picoinstructions.

Fig. 18 shows two examples of instructions 3 and 3’ of the processing device according to an embodiment of the invention. Flere, it can be seen that the instruction set of the processor device consists of a concatenation of several picoopcodes 26, 26’, 26”, etc. in parallel, whereby different picoopcodes may be used or may not be used for each instruction according to a prefix code 31 or 31’. As to the lower example with the prefix code 31’, starting with the condition COND 32, there may be a conditional execution, for example, comparing values of two or more different registers or checking, for example, if a register is positive or negative.

Fig. 19 shows an example of ROC processor 34 for the execution of two parallel picoopcodes 26, 26’, each one with a different picoinstruction set 25, 25’, 25”, etc. and 33, 33’, 33”, etc. according to the processing device according to an embodiment of the invention. There are no restrictions regarding how long or short the pipelines for each picoinstruction set can be. One picoinstruction can be executed in one cycle clock or in a long pipeline with several clock cycles. Further, different picoinstructions sets may have different execution cycles.

The above described configuration solves all problems involved with prior art technologies. Namely, a processor with an instruction set that allows truly parallel execution of n-picoinstructions or n-picoopocode in the same clock cycle is provided. Further, the ROC does not specify a destination register, thus, there is no WB stage, as illustrated in Fig. 20, which provides the advantage that no instruction-related interlock problems will occur in a one- cycle execution ROC architecture.

Reference numerals

1 register file or bank

2, 2’, 2” individual registers

3 instruction

4 opcode

5 source register

6 destination register

7 write data

8 read data

9 source operand

10 destination operand

12 VLIW

13 VLIW instruction

14, 34 processor device

15 program counter unit

16 instruction memory

17 ALU

18 data memory

19 multiplex unit

20 first stage IF

21 second stage ID

22 third stage EX

23 fourth stage MA

24 fifth stage WB

25, 25’, 25”, 33, 33’, 33” picoinstruction

26 picoopcode

27 decoder

28 adder

31 , 31’ prefix code

32 condition COND

Claims

1. Processor device (34) having a pipelined processor architecture, the processor device (34) comprising:

- a set of n single registers (2, 2’),

- a set of instructions (3) consisting of a concatenation of n different picoopcodes (26, 26’),

- a set of opcodes (4) consisting of a combination of p possible picoinstructions for each register (25, 25’, 25” for register 1 ; 33, 33’,

33” for register 2; ... picoinstruction ^p for register n),

- a set of decoding devices (27, 27’), which is configured to decode each picoopcodes (26, 26’) into a picoinstruction (25, 25’, 25”, 33, 33’, 33”), wherein the set of possible picoinstructions (25, 25’, 25”, 33, 33’, 33”) p(n) ^p for each register n has a variable number of m bits, wherein each register of the set of n registers (2, 2’) is configured to execute its picoopcode (26, 26’) per clock cycle preferably in a way the ROC architecture can be adapted to use an arbitrary number of clock cycles to fetch, decode and execute the picoinstructions and/or access other register sets, memories, or external devices.

2. Processor device (34) according to claim 1 , wherein the decoding device (27) is a multiplexer (19), a dedicated adder (28), a multiplier, a shift- register, or a combination thereof.

3. Processor device (34) according to claim 1 or claim 2, wherein each register (2, 2’) from the set of n registers is a single register, a small array of registers, in particular, a register bank (1 ), or a large array of registers, in particular, a memory (18).

4. Processor device (34) according to any one of the preceding claims, wherein each register (2, 2’) from the set of n registers (2, 2’) has up to 2^m picoopcodes (26, 26’).

5. Processor device (34) according to any one of the preceding claims, wherein each register (2, 2’) from the set of n registers (2, 2’) is configured to execute up to 2^m different picoinstructions (25, 25’, 25”, 33, 33’, 33”).

6. Processor device (34) according to any one of the preceding claims, wherein each picoinstruction (25, 25’, 25”, 33, 33’, 33”) is assigned to an own register (2, 2’) of the set of n registers (2, 2’), in which the result of the picoinstruction (25, 25’, 25”, 33, 33’, 33”) is stored.

7. Processor device (34) according to any one of the preceding claims, wherein the processor device (34) is a register-oriented computer.

8. Processor device (34) according to any one of the preceding claims, wherein the pipelined processor architecture comprises an instruction fetch IF stage (20), an instruction decode ID stage (21 ), an instruction execute EX stage (22), and a memory access MA stage (23).

9. Processor device (34) according to any one of the preceding claims, wherein the pipelined processor architecture comprises a write back WB stage (24).

10. Method for parallel processing instructions in a processor device (34), having the features according to any one of the preceding claims 1 to 9, wherein the method comprises the steps of:

- defining a set of registers (2, 2’);

- defining a number of picoinstruction sets (25, 25’, 25”, 33, 33’, 33”), each picoinstruction (25, 25’, 25”, 33, 33’, 33”) having fixed and well- defined source register files (5) and one fixed destination register file (6) for the whole picoinstruction set (25, 25’, 25”, 33, 33’, 33”), wherein each register (2, 2’) of the set of registers (2, 2’) comprises its own picoinstruction set (25, 25’, 25”, 33, 33’, 33”);

- concatenating an arbitrary number of picoopcodes (26, 26’) in a single opcode;

- decoding, in parallel, of each picoopcode (26, 26’) into picoinstructions (25, 25’, 25”, 33, 33’, 33”).

11. Method according to claim 10, wherein the method further comprises

- accessing, in parallel, if necessary, picoinstructions (25, 25’, 25”, 33, 33’, 33”) to other picoopcodes (26, 26’) or register sets, including register banks, internal register banks, internal memories, and external buses to external memories.

12. Method according to claim 10 or claim 11 , wherein the method further comprises a step of

- executing, in parallel, each decoded picoinstruction (25, 25’, 25”, 33,

33’, 33”).

13. Method according to any one of claims 10 to 12, wherein the method further comprises a step of storing the results of each picoinstruction (25, 25’, 25”, 33, 33’, 33”) into the destination register file (1 ) at the end of the execution of the picoopcode (26, 26’).