[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

EP3942407A1 - Processor device and a method for parallel processing instructions in a processor device - Google Patents

Processor device and a method for parallel processing instructions in a processor device

Info

Publication number
EP3942407A1
EP3942407A1 EP19712991.9A EP19712991A EP3942407A1 EP 3942407 A1 EP3942407 A1 EP 3942407A1 EP 19712991 A EP19712991 A EP 19712991A EP 3942407 A1 EP3942407 A1 EP 3942407A1
Authority
EP
European Patent Office
Prior art keywords
register
registers
processor device
picoinstruction
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP19712991.9A
Other languages
German (de)
French (fr)
Inventor
Paolo Bernardi
Marcelo SAMSONIUK
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unify Patente GmbH and Co KG
Original Assignee
Unify Patente GmbH and Co KG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unify Patente GmbH and Co KG filed Critical Unify Patente GmbH and Co KG
Publication of EP3942407A1 publication Critical patent/EP3942407A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3853Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions

Definitions

  • the present invention relates to a processor device and a method for parallel processing instructions in a processor device.
  • CISC Complex Instruction Sets Computer
  • RISC Reduced Instruction Set Computer
  • VLIW Very Large Instruction Word
  • processors are able to execute several instructions in one clock, however, at costs of severe processor complexity.
  • the initial processors started with a small set of registers (the accumulator and some special registers, such as memory pointers), all architectures quickly evolved towards register oriented architectures.
  • Such processors usually group a large number of registers into a small block of addressable memory, named“register file”, which stores the internal registers used in all basic operations, as illustrated exemplarily for a register file with n registers in Fig. 1.
  • the instruction set defines one destination register (D-REG) and one source register (S-REG), as illustrated in Fig. 2.
  • D-REG destination register
  • S-REG source register
  • the S-REG and D-REG fields are typically configured such that an S-REG or D-REG field of 2 m bits addresses a memory with n entries, corresponding to n registers, as exemplified in Fig. 3.
  • a more traditional solution for the above mentioned problem relates to vectoring the register file, which is very similar to register shadowing, but which is not transparent to the software.
  • the software knows about the register replication and uses this fact in order to compute multiple results in parallel, as illustrated in Fig. 4.
  • the vector architecture has a severe limitation: The operation must be the same for all computed registers. This results from the instruction set, which is not more complex than the scalar architecture, but also typically defines one destination vector register (D-VREG) and one source vector register (S-VREG), which points to the register file, as illustrated in Fig. 5.
  • D-VREG destination vector register
  • S-VREG source vector register
  • VLIW solution causes a deep impact on the instruction word, which is much greater as compared to other architectures, as illustrated in Fig. 7.
  • a classical 5-pipeline RISC processor has a structure as illustrated in Fig. 8, which means that, for the classic RISC processor, each instruction is executed in five clocks cycles. Each clock can execute different phases, and theoretically, still 1 instruction per clock can be accomplished, as illustrated in Fig. 9.
  • the present invention is based on the object to provide a processor device and a corresponding method for parallel processing of instructions in a processor device according to which processor speed can be improved without the drawbacks involved in prior art solutions.
  • a processor device and a method for parallel processing of instructions in a processor device should be provided which allow for parallel execution of instructions in one clock cycle without increasing complexity and costs.
  • a processor device having a pipelined processor architecture comprising: a set of n single registers, a set of instructions consisting of a concatenation of n different picoopcodes, a set of opcodes consisting of a combination of p possible picoinstructions for each register, a set of decoding devices, which is configured to decode each picoopcodes into a picoinstruction, wherein the set of possible picoinstructions p(n) p for each register n has a variable number of m bits, wherein each register of the set of n registers is configured to execute its picoopcode per clock cycle preferably in a way the ROC architecture can be adapted to use an arbitrary number of clock cycles to fetch, decode and execute the picoinstructions and/or access other register sets, as register banks, register files, etc., memories as RAM, ROM, etc., or external devices as IO or memory mapped peripherals, as well coprocessor
  • the decoding device is a multiplexer, a dedicated adder, a multiplier, a shift-register, or a combination thereof.
  • each register from the set of n registers is a single register, a small array of registers, in particular, a register bank or register file, or a large array of registers, in particular, a memory.
  • each register from the set of n registers has up to 2 m picoopcodes.
  • each register from the set of n registers is configured to execute up to 2m different picoinstructions.
  • each picoinstruction is assigned to an own register of the set of n registers, in which the result of the picoinstruction is stored.
  • the processor device is a register-oriented computer.
  • the pipelined processor architecture comprises an instruction fetch IF stage, an instruction decode ID stage, an instruction execute EX stage, and a memory access MA stage. Moreover, according to still another preferred embodiment, the pipelined processor architecture comprises a write back WB stage.
  • a method for parallel processing instructions in a processor device comprises the steps of:
  • each picoinstruction having fixed and well-defined source registers and one fixed destination register file for the whole picoinstruction set, wherein each register of the set of registers comprises its own picoinstruction set;
  • the method further comprises a step of accessing, in parallel, if necessary, picoinstructions to other picoopcodes or register sets, including register banks, internal register banks, internal memories, and external buses to external memories.
  • the method further comprises a step of executing, in parallel, each decoded picoinstruction.
  • the method may comprise a step of storing the results of each picoinstruciton into the destination register file at the end of the execution of the picoopcode.
  • the inventive processor device By the inventive processor device and the corresponding method, the problems described above in view of prior art processing devices are overcome. Namely, by the present invention, since each picoinstruction has its own register assigned to store a result, truly parallel execution of n picoinstructions or n picoopcodes at the same clock cycle is enabled for an instruction set. At the same time, the processor device, or in particular, register oriented computer, does not specify a destination register due to the fact that each picoinstruction has a register assigned to it for storing the result, and thus, the write back WB stage is always deterministic. Thus, no instruction-related interlock problems will occur. Altogether, the inventive configuration allows for more parallel instructions being processed at the same time, as well as improved instruction flows, however, without any interlock problems occurring. Processor speed and performance can thus be improved significantly without increasing complexity and costs.
  • Fig. 1 shows an example for a register file according to prior art
  • Fig. 2 shows an instruction set according to prior art
  • Fig. 3 illustrates the addressing procedure for the instruction set shown in
  • FIG. 2; Fig. 4 an example for vectoring a register file according to prior art
  • Fig. 5 shows another instruction set according to prior art
  • Fig. 6 shows an implementation of a VLIW architecture according to prior art
  • Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions according to prior art
  • Fig. 8 shows an example for the structure of a 5-stage pipeline RISC processor according to prior art
  • Fig. 9 shows a scheme for processing instructions by means of the processor shown in Fig. 8.
  • Fig. 10 illustrates the occurrence of an interlock problem in the RISC processor shown in Fig. 8;
  • Fig. 11 illustrates how to resolve the interlock problem illustrated in Fig. 10;
  • Fig. 12 shows three examples for a register which may be implemented in an embodiment of the processing device of the present invention.
  • Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention
  • Fig. 14 shows an example for a multiplexed decoding device for a picoopcode
  • Fig. 15 shows an example of an adder decoder device for a picoopcode, which uses a segment of the own picoopcode as adder input
  • Fig. 16 illustrates the storing procedure to the register for a given picoinstruction set
  • Fig. 17 shows the varying size of the picoopcodes
  • Fig. 18 shows two examples for an opcode set of the processing device according to an embodiment of the invention.
  • Fig. 19 shows an example for a ROC processor with the execution of two picoinstructions according to the processing device according to an embodiment of the invention.
  • Fig. 20 shows a scheme for a one instruction per cycle, 4-stage pipeline execution ROC architecture according to an embodiment of the invention.
  • Fig. 1 shows an example for a register file or register bank 1 according to prior art, wherein a processor with a traditional register architecture is concerned.
  • Flere a number of individual registers 2, 2’, 2”, etc. are grouped into a small block of addressable memory, which is referred to as the“register file”, and in which the internal registers used in all basic operations are stored.
  • Fig. 2 shows an instruction 3 according to prior art, which typically defines one opcode 4, one destination register D-REG 6 and one source register S- REG 5.
  • the opcode defines the type of operation and the S-REG and D-REG define the operands.
  • Fig. 3 illustrates the addressing procedure for the instruction shown in Fig. 2.
  • the S-REG 5 and D-REG 6 fields are typically configured such that an S-REG field or a D-REG field of 2 m bits addresses a memory with n entries via two address buses, one for the source operand 9 (e.g., a register pointer for read operation) and another for destination operand 10 (e.g., a register pointer for write operation), corresponding to a register file 1 of n registers.
  • the data will be written (indicated by reference numeral 7 for write data) or read (indicated by reference numeral 8 for read data).
  • FIG. 4 shows an example for vectoring a register file 1 according to prior art (as shown in Fig. 1 ).
  • the software knows about the register replication and uses the fact so as to compute multiple results in parallel, as illustrated in the figure.
  • Fig. 5 shows another instruction 3 according to prior art, which in particular is an instruction with a vectorial source register 5 and destination register field 6, as well a vector opcode 4.
  • the source 9 and destination 10 buses are connected in the same buses as shown in Fig. 4.
  • the organization here is almost the same as in the case of Fig. 3, but with multiple results computed in parallel.
  • Fig. 6 shows an implementation of a VLIW (very large instruction word) architecture 12 according to prior art, according to which different register sets are working with different instructions.
  • VLIW very large instruction word
  • Fig. 6 shows an implementation of a VLIW (very large instruction word) architecture 12 according to prior art, according to which different register sets are working with different instructions.
  • the difference here is that separate write (register pointers 9, 9’, 9” for write operation (destination register) for instructiono, instruction ⁇ !, instruction, respectively) and read (register pointers 10, 10’, 10” for read operation (source register) for instructiono, instruction ⁇ !, instruction, respectively) buses are used.
  • Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions 3, 3’, 3” according to prior art, each one composed by different opcodes 4, source registers 5 and destination registers 6, for illustrating the deep impact in the instruction word which is much stronger compared to other architectures. Also, as long each instruction 3, 3’, 3” has m-bits, the final VLIW instruction 13 will have m x 3 bits, as long the example defines 3 parallel unities (of course, the VLIW architecture can be parallelized ad infinitum ).
  • the RISC processor 14 comprises a program counter unit 15, an instruction memory 16, a register bank 1 comprising a plurality of individual registers 2, an Arithmetic Logic Unit ALU 17, a data memory 18, and a multiplex unit 19.
  • Fig. 9 shows a scheme for processing instructions by means of the RISC processor 14 (shown in Fig. 8), wherein each instruction 1 , instruction 2, instruction 3, instruction 4, instruction 5 ... instruction n is executed in five clock cycles, and each clock is able to execute different phases and theoretically, still may achieve the processing of one instruction per clock.
  • an interlock problem may occur, if an instruction n needs to use a register which is currently being modified by a predecessor instruction.
  • a value as for example, a value“10”
  • Fig. 1 1 illustrates how to resolve the interlock problem illustrated in Fig. 10.
  • the processor is stopped (here, for two clocks cycles) until it has the instruction result.
  • Fig. 12 shows three examples for a register file 1 which may be implemented in an embodiment of the processing device, in particular, in a register oriented controller according to an embodiment of the present invention.
  • the first uppermost example is a large array of registers embodied as a memory 23.
  • the second or middle example is a small array of registers embodied as a register file or register bank 1.
  • the third example shows a plurality of single individual registers 2, 2’, 2”, etc.
  • Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention.
  • Each one of the plurality of single registers 2, 2’, 2” is assigned by one instruction in one clock cycle, in a way that a plurality of instructions 25, 25’, 25”, etc. assigns a plurality of registers 2, 2’, 2”, etc. in one clock cycle.
  • register 2 will execute the instruction 25 in one clock cycle
  • register 2’ will execute the instruction 25’ in the same single clock cycle
  • the register 2” will execute the instruction 25” in the same single clock cycle, and so on.
  • the instructions 25, 25’, 25”, etc. are so-called picoinstructions, in a way there is no need to define the destination register, as long the destination register is implicit by the associated register.
  • Fig. 14 shows an example of a decoding circuit which uses a decoder 27 to choose between several picoinstructions 25, 25’, 25”, etc. in order to decode a picoopcode 26.
  • the picoopcode 26 with a width of m bits can decode up to 2 m different picoinstructions 25, 25’, 25”, etc.
  • each picoinstruction 25, 25’, 25” may either have or may not have other registers as input.
  • Fig. 16 illustrates the storing procedure for a given plurality of picoinstructions 25, 25’, 25”, etc., namely, it can be seen that for a given picoinstruction Pn(m) with opcode 26, the result will always be stored in the corresponding register Rn 2 via the multiplex unit 19.
  • Fig. 17 shows the possible varying size of the picoopcodes 26, 26’, 26” inside an instruction 3.
  • a first picoopcode 26 has a length of few bits
  • a second picopocode 26’ has a length of a large number bits
  • a third picoopcode 26” has an intermediary length, and so on.
  • different registers 2, 2’, 2”, etc. have different picoopcodes 26, 26’, 26”, etc. lengths which results in different sets of picoinstructions.
  • Fig. 18 shows two examples of instructions 3 and 3’ of the processing device according to an embodiment of the invention.
  • the instruction set of the processor device consists of a concatenation of several picoopcodes 26, 26’, 26”, etc. in parallel, whereby different picoopcodes may be used or may not be used for each instruction according to a prefix code 31 or 31’.
  • the prefix code 31’ starting with the condition COND 32, there may be a conditional execution, for example, comparing values of two or more different registers or checking, for example, if a register is positive or negative.
  • Fig. 19 shows an example of ROC processor 34 for the execution of two parallel picoopcodes 26, 26’, each one with a different picoinstruction set 25, 25’, 25”, etc. and 33, 33’, 33”, etc. according to the processing device according to an embodiment of the invention.
  • a picoinstruction set can be executed in one cycle clock or in a long pipeline with several clock cycles.
  • different picoinstructions sets may have different execution cycles.
  • a processor with an instruction set that allows truly parallel execution of n-picoinstructions or n-picoopocode in the same clock cycle is provided.
  • the ROC does not specify a destination register, thus, there is no WB stage, as illustrated in Fig. 20, which provides the advantage that no instruction-related interlock problems will occur in a one- cycle execution ROC architecture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)
  • Advance Control (AREA)

Abstract

A processor device having a pipelined processor architecture is provided, the processor device comprising: a set of n single registers, a set of instructions consisting of a combination of several picoopcodes, a set of opcodes consisting of a concatenation of several picoinstructions, a decoding device, which is configured to decode each picoopcode into a picoinstruction, a set of possible picoinstructions for each register, which have a variable number of m bits, wherein each register of the set of n register is configured to execute its picoinstruction per clock cycle, which may use arbitrary but fixed clock cycles to fetch, decode and execute the picoinstructions or access other register sets, memories or external devices. Further, a method for parallel processing instructions in a processor device is provided.

Description

Processor device and a method for
parallel processing instructions in a processor device
Description The present invention relates to a processor device and a method for parallel processing instructions in a processor device.
In the past and up to now, data processing is a constant and important issue. In particular, new approaches for optimizing processor speed are constantly developed. Thus, various methods and technologies are known in prior art for improving processors speed, either directly by means of clock rate improvements, or indirectly, by means of processor architecture improvements. As to architecture improvements, several different approaches are known in prior art. There are various documented methods for implementing computer architectures (CISC, RISC, VLIW, etc.), whereby every one of the known architectures has its strengths and weaknesses. While CISC (Complex Instruction Sets Computer) and RISC (Reduced Instruction Set Computer) are currently designed to provide one instruction per clock (scalar processor) or more instructions per clock (super-scalar processors), VLIW (Very Large Instruction Word) processors are able to execute several instructions in one clock, however, at costs of severe processor complexity. Although the initial processors started with a small set of registers (the accumulator and some special registers, such as memory pointers), all architectures quickly evolved towards register oriented architectures. Such processors usually group a large number of registers into a small block of addressable memory, named“register file”, which stores the internal registers used in all basic operations, as illustrated exemplarily for a register file with n registers in Fig. 1. Although the register file oriented architectures are dominant due to better performance as compared to memory oriented architectures, stack oriented architectures and accumulator oriented architectures, the main problem with addressable registers remains: Only one single register is writable at a time. Typically, the instruction set defines one destination register (D-REG) and one source register (S-REG), as illustrated in Fig. 2. The S-REG and D-REG fields are typically configured such that an S-REG or D-REG field of 2m bits addresses a memory with n entries, corresponding to n registers, as exemplified in Fig. 3.
Although there are solutions for the above mentioned limitation, most solutions are based on the register file replication: shadowing registers, renaming registers, windowing registers, etc. All known solutions are fully transparent to the software, but they use large amounts of logic and are susceptible to failure so that they do not work all the times. Also, the processor needs to use pipeline interlocks in order to avoid problems, such as register collisions, which otherwise impact its performance.
Moreover, in prior art, a more traditional solution for the above mentioned problem relates to vectoring the register file, which is very similar to register shadowing, but which is not transparent to the software. Namely, the software knows about the register replication and uses this fact in order to compute multiple results in parallel, as illustrated in Fig. 4. Although the performance easily may be scaled up, the vector architecture has a severe limitation: The operation must be the same for all computed registers. This results from the instruction set, which is not more complex than the scalar architecture, but also typically defines one destination vector register (D-VREG) and one source vector register (S-VREG), which points to the register file, as illustrated in Fig. 5.
Advanced techniques, such as register masking and register chaining can bypass some of the above mentioned limitations, but they do affect the overall performance and increase the complexity. The vector architecture is commonly found in super-computers and Graphics Processing Units (GPU), as well media accelerators, such as Digital Signal Processors (DSP), where multiple data can be more easily processed with a single vector instruction.
However, in some cases, data processing is rather complex and, thus, vector processing is not possible in some applications. A typical workaround for the limitation is the use of the VLIW architecture, in a way such that different register sets will be able to work with different instructions, as illustrated in Fig. 6.
Nevertheless, although far more flexible than the other architectures, the VLIW solution causes a deep impact on the instruction word, which is much greater as compared to other architectures, as illustrated in Fig. 7.
Although the differences do not appear to be significant, the main difference is that all instructions are explicitly executed in parallel and, in order to always keep the execution unit working, all the instructions must be transferred at the same time, which explains the VL (Very Large) in the VLIW.
Therefore, a designer of a new processor has to select the appropriate architecture in consideration of how much logic space is available and which performance is desired, resulting in a direct equation according to which the performance is proportional to the logic space and, as consequence, to the costs. Further, in prior art solutions concerning the improvement of processor speed and performance, there exists a so-called instruction interlock problem.
A classical 5-pipeline RISC processor has a structure as illustrated in Fig. 8, which means that, for the classic RISC processor, each instruction is executed in five clocks cycles. Each clock can execute different phases, and theoretically, still 1 instruction per clock can be accomplished, as illustrated in Fig. 9.
An interlock problem will occur if an instruction n needs to use a register that is being modified by a predecessor instruction. For example, according to a hypothetical case where a value is attributed to a register R1 and R1 values are used on the subsequent instructions, a scenario as illustrated in Fig. 10 will occur, in which case, the value R1 =10 will be finalized only at the end of the 5th cycle, which means that instruction 2 and instruction 3 cannot be executed as long as instruction 1 is not completed. In this case, the simplest approach is to stop the processor until the processor receives the instruction result, as illustrated in Fig. 1 1 . Consequently, however, processing time is wasted, once the processor stopped for two clocks cycles waiting for the last instruction result.
Summarizing the above, there are several techniques known in prior art which provide different approaches to optimize processors and anticipate the instruction execution such that modern processors are able to execute one or more instructions per clock cycle. Flowever, known modern techniques are rather expensive due to additional logic, as explained above, and also, they use more power, while the final processor frequency is lowered.
Therefore, the present invention is based on the object to provide a processor device and a corresponding method for parallel processing of instructions in a processor device according to which processor speed can be improved without the drawbacks involved in prior art solutions. In particular, a processor device and a method for parallel processing of instructions in a processor device should be provided which allow for parallel execution of instructions in one clock cycle without increasing complexity and costs.
This object is solved by a processor device having the features according to claim 1 and a method for parallel processing of instructions in a processor device having the features according to claim 10. Preferred embodiments of the invention are defined in the respective dependent claims.
According to the invention, a processor device having a pipelined processor architecture is provided, the processor device comprising: a set of n single registers, a set of instructions consisting of a concatenation of n different picoopcodes, a set of opcodes consisting of a combination of p possible picoinstructions for each register, a set of decoding devices, which is configured to decode each picoopcodes into a picoinstruction, wherein the set of possible picoinstructions p(n) p for each register n has a variable number of m bits, wherein each register of the set of n registers is configured to execute its picoopcode per clock cycle preferably in a way the ROC architecture can be adapted to use an arbitrary number of clock cycles to fetch, decode and execute the picoinstructions and/or access other register sets, as register banks, register files, etc., memories as RAM, ROM, etc., or external devices as IO or memory mapped peripherals, as well coprocessors and close attached devices.
Preferably, the decoding device is a multiplexer, a dedicated adder, a multiplier, a shift-register, or a combination thereof.
According to a preferred embodiment, each register from the set of n registers is a single register, a small array of registers, in particular, a register bank or register file, or a large array of registers, in particular, a memory.
According to another preferred embodiment, each register from the set of n registers has up to 2m picoopcodes. According to still another preferred embodiment, each register from the set of n registers is configured to execute up to 2m different picoinstructions. Preferably, each picoinstruction is assigned to an own register of the set of n registers, in which the result of the picoinstruction is stored.
Also, it is advantageous, if the processor device is a register-oriented computer.
Further, according to another preferred embodiment the pipelined processor architecture comprises an instruction fetch IF stage, an instruction decode ID stage, an instruction execute EX stage, and a memory access MA stage. Moreover, according to still another preferred embodiment, the pipelined processor architecture comprises a write back WB stage.
Further, according to the invention, a method for parallel processing instructions in a processor device is provided, wherein the method comprises the steps of:
defining a set of registers;
defining a number of picoinstruction sets, each picoinstruction having fixed and well-defined source registers and one fixed destination register file for the whole picoinstruction set, wherein each register of the set of registers comprises its own picoinstruction set;
concatenating an arbitrary number of picoopcodes in a single opcode;
decoding, in parallel, of each picoopcode into picoinstructions.
Preferably, the method further comprises a step of accessing, in parallel, if necessary, picoinstructions to other picoopcodes or register sets, including register banks, internal register banks, internal memories, and external buses to external memories. According to another preferred embodiment, the method further comprises a step of executing, in parallel, each decoded picoinstruction.
Further, the method may comprise a step of storing the results of each picoinstruciton into the destination register file at the end of the execution of the picoopcode.
By the inventive processor device and the corresponding method, the problems described above in view of prior art processing devices are overcome. Namely, by the present invention, since each picoinstruction has its own register assigned to store a result, truly parallel execution of n picoinstructions or n picoopcodes at the same clock cycle is enabled for an instruction set. At the same time, the processor device, or in particular, register oriented computer, does not specify a destination register due to the fact that each picoinstruction has a register assigned to it for storing the result, and thus, the write back WB stage is always deterministic. Thus, no instruction-related interlock problems will occur. Altogether, the inventive configuration allows for more parallel instructions being processed at the same time, as well as improved instruction flows, however, without any interlock problems occurring. Processor speed and performance can thus be improved significantly without increasing complexity and costs.
The invention and embodiments thereof will be described below in further detail in connection with the drawing.
Fig. 1 shows an example for a register file according to prior art;
Fig. 2 shows an instruction set according to prior art;
Fig. 3 illustrates the addressing procedure for the instruction set shown in
Fig. 2; Fig. 4 an example for vectoring a register file according to prior art;
Fig. 5 shows another instruction set according to prior art;
Fig. 6 shows an implementation of a VLIW architecture according to prior art;
Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions according to prior art;
Fig. 8 shows an example for the structure of a 5-stage pipeline RISC processor according to prior art;
Fig. 9 shows a scheme for processing instructions by means of the processor shown in Fig. 8;
Fig. 10 illustrates the occurrence of an interlock problem in the RISC processor shown in Fig. 8;
Fig. 11 illustrates how to resolve the interlock problem illustrated in Fig. 10;
Fig. 12 shows three examples for a register which may be implemented in an embodiment of the processing device of the present invention;
Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention;
Fig. 14 shows an example for a multiplexed decoding device for a picoopcode; Fig. 15 shows an example of an adder decoder device for a picoopcode, which uses a segment of the own picoopcode as adder input; Fig. 16 illustrates the storing procedure to the register for a given picoinstruction set;
Fig. 17 shows the varying size of the picoopcodes;
Fig. 18 shows two examples for an opcode set of the processing device according to an embodiment of the invention;
Fig. 19 shows an example for a ROC processor with the execution of two picoinstructions according to the processing device according to an embodiment of the invention; and
Fig. 20 shows a scheme for a one instruction per cycle, 4-stage pipeline execution ROC architecture according to an embodiment of the invention.
Fig. 1 shows an example for a register file or register bank 1 according to prior art, wherein a processor with a traditional register architecture is concerned. Flere, a number of individual registers 2, 2’, 2”, etc. are grouped into a small block of addressable memory, which is referred to as the“register file”, and in which the internal registers used in all basic operations are stored.
Fig. 2 shows an instruction 3 according to prior art, which typically defines one opcode 4, one destination register D-REG 6 and one source register S- REG 5. The opcode defines the type of operation and the S-REG and D-REG define the operands.
Fig. 3 illustrates the addressing procedure for the instruction shown in Fig. 2. Flere, the S-REG 5 and D-REG 6 fields are typically configured such that an S-REG field or a D-REG field of 2m bits addresses a memory with n entries via two address buses, one for the source operand 9 (e.g., a register pointer for read operation) and another for destination operand 10 (e.g., a register pointer for write operation), corresponding to a register file 1 of n registers. According to the opcode 4, the data will be written (indicated by reference numeral 7 for write data) or read (indicated by reference numeral 8 for read data). Fig. 4 shows an example for vectoring a register file 1 according to prior art (as shown in Fig. 1 ). Flere, the software knows about the register replication and uses the fact so as to compute multiple results in parallel, as illustrated in the figure. Fig. 5 shows another instruction 3 according to prior art, which in particular is an instruction with a vectorial source register 5 and destination register field 6, as well a vector opcode 4. The source 9 and destination 10 buses are connected in the same buses as shown in Fig. 4. The organization here is almost the same as in the case of Fig. 3, but with multiple results computed in parallel.
Fig. 6 shows an implementation of a VLIW (very large instruction word) architecture 12 according to prior art, according to which different register sets are working with different instructions. Although similar to vectoring, the difference here is that separate write (register pointers 9, 9’, 9” for write operation (destination register) for instructiono, instruction·!, instruction, respectively) and read (register pointers 10, 10’, 10” for read operation (source register) for instructiono, instruction·!, instruction, respectively) buses are used.
Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions 3, 3’, 3” according to prior art, each one composed by different opcodes 4, source registers 5 and destination registers 6, for illustrating the deep impact in the instruction word which is much stronger compared to other architectures. Also, as long each instruction 3, 3’, 3” has m-bits, the final VLIW instruction 13 will have m x 3 bits, as long the example defines 3 parallel unities (of course, the VLIW architecture can be parallelized ad infinitum ). Fig. 8 shows an example for the structure of a processor device 14 according to prior art embodied as a 5-stage pipeline RISC with a first stage“Instruction Fetch” IF 20, a second stage “Instruction Decode” ID 21 , a third stage “Instruction Execute” EX 22, a fourth stage“Memory Access” MA 23, and a fifth stage“Write Back” WB 24. The RISC processor 14 comprises a program counter unit 15, an instruction memory 16, a register bank 1 comprising a plurality of individual registers 2, an Arithmetic Logic Unit ALU 17, a data memory 18, and a multiplex unit 19.
Fig. 9 shows a scheme for processing instructions by means of the RISC processor 14 (shown in Fig. 8), wherein each instruction 1 , instruction 2, instruction 3, instruction 4, instruction 5 ... instruction n is executed in five clock cycles, and each clock is able to execute different phases and theoretically, still may achieve the processing of one instruction per clock.
However, as can be seen in Fig. 10, in the same RISC processor configuration (as processor 14 illustrated in Fig. 8), an interlock problem may occur, if an instruction n needs to use a register which is currently being modified by a predecessor instruction. Here, in a hypothetical case in which a value, as for example, a value“10”, is attributed to register R1 and the value is then used for the following instruction, the value R1 = 10 will only be finalized at the end of the 5th cycle, which means that the second and third instructions cannot be executed while the first instruction is not finalized.
Fig. 1 1 illustrates how to resolve the interlock problem illustrated in Fig. 10. In a simple approach, the processor is stopped (here, for two clocks cycles) until it has the instruction result.
Fig. 12 shows three examples for a register file 1 which may be implemented in an embodiment of the processing device, in particular, in a register oriented controller according to an embodiment of the present invention. The first uppermost example is a large array of registers embodied as a memory 23. The second or middle example is a small array of registers embodied as a register file or register bank 1. The third example shows a plurality of single individual registers 2, 2’, 2”, etc.
Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention. Again, there is a plurality of single registers (2, 2’, 2”, etc. Each one of the plurality of single registers 2, 2’, 2” is assigned by one instruction in one clock cycle, in a way that a plurality of instructions 25, 25’, 25”, etc. assigns a plurality of registers 2, 2’, 2”, etc. in one clock cycle. Thus, register 2 will execute the instruction 25 in one clock cycle, register 2’ will execute the instruction 25’ in the same single clock cycle, the register 2” will execute the instruction 25” in the same single clock cycle, and so on. The instructions 25, 25’, 25”, etc. are so-called picoinstructions, in a way there is no need to define the destination register, as long the destination register is implicit by the associated register.
Fig. 14 shows an example of a decoding circuit which uses a decoder 27 to choose between several picoinstructions 25, 25’, 25”, etc. in order to decode a picoopcode 26. As expected, the picoopcode 26 with a width of m bits can decode up to 2m different picoinstructions 25, 25’, 25”, etc.
Fig. 15 shows an example of a decoding circuit which uses an adder 28 to implement its own picoopcode 26 as adder input from the same register as defined by the path 30, resulting in a picoinstruction 25 which is expressed as“Pn = Rn + On”.
Further, it is noted that each picoinstruction 25, 25’, 25” may either have or may not have other registers as input. Some examples of valid picoinstructions are listed below: Pn(1 ) = R1
Pn(2) = R1 + R2
Pn(3) = R2 - 1 Pn(4) = R2 * R3
Pn(5) = R1 » 2
Pn(6) = R1 OR R2
Pn(7) = 0
Fig. 16 illustrates the storing procedure for a given plurality of picoinstructions 25, 25’, 25”, etc., namely, it can be seen that for a given picoinstruction Pn(m) with opcode 26, the result will always be stored in the corresponding register Rn 2 via the multiplex unit 19.
Fig. 17 shows the possible varying size of the picoopcodes 26, 26’, 26” inside an instruction 3. Flere, a first picoopcode 26 has a length of few bits, a second picopocode 26’ has a length of a large number bits, a third picoopcode 26” has an intermediary length, and so on. Also, it is noted that different registers 2, 2’, 2”, etc. have different picoopcodes 26, 26’, 26”, etc. lengths which results in different sets of picoinstructions.
Fig. 18 shows two examples of instructions 3 and 3’ of the processing device according to an embodiment of the invention. Flere, it can be seen that the instruction set of the processor device consists of a concatenation of several picoopcodes 26, 26’, 26”, etc. in parallel, whereby different picoopcodes may be used or may not be used for each instruction according to a prefix code 31 or 31’. As to the lower example with the prefix code 31’, starting with the condition COND 32, there may be a conditional execution, for example, comparing values of two or more different registers or checking, for example, if a register is positive or negative.
Fig. 19 shows an example of ROC processor 34 for the execution of two parallel picoopcodes 26, 26’, each one with a different picoinstruction set 25, 25’, 25”, etc. and 33, 33’, 33”, etc. according to the processing device according to an embodiment of the invention. There are no restrictions regarding how long or short the pipelines for each picoinstruction set can be. One picoinstruction can be executed in one cycle clock or in a long pipeline with several clock cycles. Further, different picoinstructions sets may have different execution cycles.
The above described configuration solves all problems involved with prior art technologies. Namely, a processor with an instruction set that allows truly parallel execution of n-picoinstructions or n-picoopocode in the same clock cycle is provided. Further, the ROC does not specify a destination register, thus, there is no WB stage, as illustrated in Fig. 20, which provides the advantage that no instruction-related interlock problems will occur in a one- cycle execution ROC architecture.
Reference numerals
1 register file or bank
2, 2’, 2” individual registers
3 instruction
4 opcode
5 source register
6 destination register
7 write data
8 read data
9 source operand
10 destination operand
12 VLIW
13 VLIW instruction
14, 34 processor device
15 program counter unit
16 instruction memory
17 ALU
18 data memory
19 multiplex unit
20 first stage IF
21 second stage ID
22 third stage EX
23 fourth stage MA
24 fifth stage WB
25, 25’, 25”, 33, 33’, 33” picoinstruction
26 picoopcode
27 decoder
28 adder
31 , 31’ prefix code
32 condition COND

Claims

Claims
1. Processor device (34) having a pipelined processor architecture, the processor device (34) comprising:
- a set of n single registers (2, 2’),
- a set of instructions (3) consisting of a concatenation of n different picoopcodes (26, 26’),
- a set of opcodes (4) consisting of a combination of p possible picoinstructions for each register (25, 25’, 25” for register 1 ; 33, 33’,
33” for register 2; ... picoinstruction p for register n),
- a set of decoding devices (27, 27’), which is configured to decode each picoopcodes (26, 26’) into a picoinstruction (25, 25’, 25”, 33, 33’, 33”), wherein the set of possible picoinstructions (25, 25’, 25”, 33, 33’, 33”) p(n) p for each register n has a variable number of m bits, wherein each register of the set of n registers (2, 2’) is configured to execute its picoopcode (26, 26’) per clock cycle preferably in a way the ROC architecture can be adapted to use an arbitrary number of clock cycles to fetch, decode and execute the picoinstructions and/or access other register sets, memories, or external devices.
2. Processor device (34) according to claim 1 , wherein the decoding device (27) is a multiplexer (19), a dedicated adder (28), a multiplier, a shift- register, or a combination thereof.
3. Processor device (34) according to claim 1 or claim 2, wherein each register (2, 2’) from the set of n registers is a single register, a small array of registers, in particular, a register bank (1 ), or a large array of registers, in particular, a memory (18).
4. Processor device (34) according to any one of the preceding claims, wherein each register (2, 2’) from the set of n registers (2, 2’) has up to 2m picoopcodes (26, 26’).
5. Processor device (34) according to any one of the preceding claims, wherein each register (2, 2’) from the set of n registers (2, 2’) is configured to execute up to 2m different picoinstructions (25, 25’, 25”, 33, 33’, 33”).
6. Processor device (34) according to any one of the preceding claims, wherein each picoinstruction (25, 25’, 25”, 33, 33’, 33”) is assigned to an own register (2, 2’) of the set of n registers (2, 2’), in which the result of the picoinstruction (25, 25’, 25”, 33, 33’, 33”) is stored.
7. Processor device (34) according to any one of the preceding claims, wherein the processor device (34) is a register-oriented computer.
8. Processor device (34) according to any one of the preceding claims, wherein the pipelined processor architecture comprises an instruction fetch IF stage (20), an instruction decode ID stage (21 ), an instruction execute EX stage (22), and a memory access MA stage (23).
9. Processor device (34) according to any one of the preceding claims, wherein the pipelined processor architecture comprises a write back WB stage (24).
10. Method for parallel processing instructions in a processor device (34), having the features according to any one of the preceding claims 1 to 9, wherein the method comprises the steps of:
- defining a set of registers (2, 2’);
- defining a number of picoinstruction sets (25, 25’, 25”, 33, 33’, 33”), each picoinstruction (25, 25’, 25”, 33, 33’, 33”) having fixed and well- defined source register files (5) and one fixed destination register file (6) for the whole picoinstruction set (25, 25’, 25”, 33, 33’, 33”), wherein each register (2, 2’) of the set of registers (2, 2’) comprises its own picoinstruction set (25, 25’, 25”, 33, 33’, 33”);
- concatenating an arbitrary number of picoopcodes (26, 26’) in a single opcode;
- decoding, in parallel, of each picoopcode (26, 26’) into picoinstructions (25, 25’, 25”, 33, 33’, 33”).
11. Method according to claim 10, wherein the method further comprises
- accessing, in parallel, if necessary, picoinstructions (25, 25’, 25”, 33, 33’, 33”) to other picoopcodes (26, 26’) or register sets, including register banks, internal register banks, internal memories, and external buses to external memories.
12. Method according to claim 10 or claim 11 , wherein the method further comprises a step of
- executing, in parallel, each decoded picoinstruction (25, 25’, 25”, 33,
33’, 33”).
13. Method according to any one of claims 10 to 12, wherein the method further comprises a step of storing the results of each picoinstruction (25, 25’, 25”, 33, 33’, 33”) into the destination register file (1 ) at the end of the execution of the picoopcode (26, 26’).
EP19712991.9A 2019-03-21 2019-03-21 Processor device and a method for parallel processing instructions in a processor device Pending EP3942407A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/057129 WO2020187421A1 (en) 2019-03-21 2019-03-21 Processor device and a method for parallel processing instructions in a processor device

Publications (1)

Publication Number Publication Date
EP3942407A1 true EP3942407A1 (en) 2022-01-26

Family

ID=65904432

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19712991.9A Pending EP3942407A1 (en) 2019-03-21 2019-03-21 Processor device and a method for parallel processing instructions in a processor device

Country Status (2)

Country Link
EP (1) EP3942407A1 (en)
WO (1) WO2020187421A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6101592A (en) * 1998-12-18 2000-08-08 Billions Of Operations Per Second, Inc. Methods and apparatus for scalable instruction set architecture with dynamic compact instructions
US7526633B2 (en) * 2005-03-23 2009-04-28 Qualcomm Incorporated Method and system for encoding variable length packets with variable instruction sizes
CN104424128B (en) * 2013-08-19 2019-12-13 上海芯豪微电子有限公司 Variable length instruction word processor system and method

Also Published As

Publication number Publication date
WO2020187421A1 (en) 2020-09-24

Similar Documents

Publication Publication Date Title
US11188330B2 (en) Vector multiply-add instruction
CN110580175A (en) Variable format, variable sparse matrix multiply instruction
US9164763B2 (en) Single instruction group information processing apparatus for dynamically performing transient processing associated with a repeat instruction
US7346881B2 (en) Method and apparatus for adding advanced instructions in an extensible processor architecture
US10628155B2 (en) Complex multiply instruction
JPH09311786A (en) Data processor
US9965275B2 (en) Element size increasing instruction
WO2015114305A1 (en) A data processing apparatus and method for executing a vector scan instruction
JPH03218523A (en) Data processor
US10303399B2 (en) Data processing apparatus and method for controlling vector memory accesses
TWI770079B (en) Vector generating instruction
TW202223633A (en) Apparatuses, methods, and systems for instructions for 16-bit floating-point matrix dot product instructions
EP3942407A1 (en) Processor device and a method for parallel processing instructions in a processor device
JP5786719B2 (en) Vector processor
JP7006097B2 (en) Code generator, code generator and code generator
US20070061551A1 (en) Computer Processor Architecture Comprising Operand Stack and Addressable Registers

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20210906

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20240612

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20241002