EP3942407A1 - Processor device and a method for parallel processing instructions in a processor device - Google Patents
Processor device and a method for parallel processing instructions in a processor deviceInfo
- Publication number
- EP3942407A1 EP3942407A1 EP19712991.9A EP19712991A EP3942407A1 EP 3942407 A1 EP3942407 A1 EP 3942407A1 EP 19712991 A EP19712991 A EP 19712991A EP 3942407 A1 EP3942407 A1 EP 3942407A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- register
- registers
- processor device
- picoinstruction
- instruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000015654 memory Effects 0.000 claims abstract description 24
- 238000013459 approach Methods 0.000 description 5
- 230000010076 replication Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 101100301524 Drosophila melanogaster Reg-5 gene Proteins 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3818—Decoding for concurrent execution
- G06F9/3822—Parallel decoding, e.g. parallel decode units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
- G06F9/30149—Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3853—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions
Definitions
- the present invention relates to a processor device and a method for parallel processing instructions in a processor device.
- CISC Complex Instruction Sets Computer
- RISC Reduced Instruction Set Computer
- VLIW Very Large Instruction Word
- processors are able to execute several instructions in one clock, however, at costs of severe processor complexity.
- the initial processors started with a small set of registers (the accumulator and some special registers, such as memory pointers), all architectures quickly evolved towards register oriented architectures.
- Such processors usually group a large number of registers into a small block of addressable memory, named“register file”, which stores the internal registers used in all basic operations, as illustrated exemplarily for a register file with n registers in Fig. 1.
- the instruction set defines one destination register (D-REG) and one source register (S-REG), as illustrated in Fig. 2.
- D-REG destination register
- S-REG source register
- the S-REG and D-REG fields are typically configured such that an S-REG or D-REG field of 2 m bits addresses a memory with n entries, corresponding to n registers, as exemplified in Fig. 3.
- a more traditional solution for the above mentioned problem relates to vectoring the register file, which is very similar to register shadowing, but which is not transparent to the software.
- the software knows about the register replication and uses this fact in order to compute multiple results in parallel, as illustrated in Fig. 4.
- the vector architecture has a severe limitation: The operation must be the same for all computed registers. This results from the instruction set, which is not more complex than the scalar architecture, but also typically defines one destination vector register (D-VREG) and one source vector register (S-VREG), which points to the register file, as illustrated in Fig. 5.
- D-VREG destination vector register
- S-VREG source vector register
- VLIW solution causes a deep impact on the instruction word, which is much greater as compared to other architectures, as illustrated in Fig. 7.
- a classical 5-pipeline RISC processor has a structure as illustrated in Fig. 8, which means that, for the classic RISC processor, each instruction is executed in five clocks cycles. Each clock can execute different phases, and theoretically, still 1 instruction per clock can be accomplished, as illustrated in Fig. 9.
- the present invention is based on the object to provide a processor device and a corresponding method for parallel processing of instructions in a processor device according to which processor speed can be improved without the drawbacks involved in prior art solutions.
- a processor device and a method for parallel processing of instructions in a processor device should be provided which allow for parallel execution of instructions in one clock cycle without increasing complexity and costs.
- a processor device having a pipelined processor architecture comprising: a set of n single registers, a set of instructions consisting of a concatenation of n different picoopcodes, a set of opcodes consisting of a combination of p possible picoinstructions for each register, a set of decoding devices, which is configured to decode each picoopcodes into a picoinstruction, wherein the set of possible picoinstructions p(n) p for each register n has a variable number of m bits, wherein each register of the set of n registers is configured to execute its picoopcode per clock cycle preferably in a way the ROC architecture can be adapted to use an arbitrary number of clock cycles to fetch, decode and execute the picoinstructions and/or access other register sets, as register banks, register files, etc., memories as RAM, ROM, etc., or external devices as IO or memory mapped peripherals, as well coprocessor
- the decoding device is a multiplexer, a dedicated adder, a multiplier, a shift-register, or a combination thereof.
- each register from the set of n registers is a single register, a small array of registers, in particular, a register bank or register file, or a large array of registers, in particular, a memory.
- each register from the set of n registers has up to 2 m picoopcodes.
- each register from the set of n registers is configured to execute up to 2m different picoinstructions.
- each picoinstruction is assigned to an own register of the set of n registers, in which the result of the picoinstruction is stored.
- the processor device is a register-oriented computer.
- the pipelined processor architecture comprises an instruction fetch IF stage, an instruction decode ID stage, an instruction execute EX stage, and a memory access MA stage. Moreover, according to still another preferred embodiment, the pipelined processor architecture comprises a write back WB stage.
- a method for parallel processing instructions in a processor device comprises the steps of:
- each picoinstruction having fixed and well-defined source registers and one fixed destination register file for the whole picoinstruction set, wherein each register of the set of registers comprises its own picoinstruction set;
- the method further comprises a step of accessing, in parallel, if necessary, picoinstructions to other picoopcodes or register sets, including register banks, internal register banks, internal memories, and external buses to external memories.
- the method further comprises a step of executing, in parallel, each decoded picoinstruction.
- the method may comprise a step of storing the results of each picoinstruciton into the destination register file at the end of the execution of the picoopcode.
- the inventive processor device By the inventive processor device and the corresponding method, the problems described above in view of prior art processing devices are overcome. Namely, by the present invention, since each picoinstruction has its own register assigned to store a result, truly parallel execution of n picoinstructions or n picoopcodes at the same clock cycle is enabled for an instruction set. At the same time, the processor device, or in particular, register oriented computer, does not specify a destination register due to the fact that each picoinstruction has a register assigned to it for storing the result, and thus, the write back WB stage is always deterministic. Thus, no instruction-related interlock problems will occur. Altogether, the inventive configuration allows for more parallel instructions being processed at the same time, as well as improved instruction flows, however, without any interlock problems occurring. Processor speed and performance can thus be improved significantly without increasing complexity and costs.
- Fig. 1 shows an example for a register file according to prior art
- Fig. 2 shows an instruction set according to prior art
- Fig. 3 illustrates the addressing procedure for the instruction set shown in
- FIG. 2; Fig. 4 an example for vectoring a register file according to prior art
- Fig. 5 shows another instruction set according to prior art
- Fig. 6 shows an implementation of a VLIW architecture according to prior art
- Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions according to prior art
- Fig. 8 shows an example for the structure of a 5-stage pipeline RISC processor according to prior art
- Fig. 9 shows a scheme for processing instructions by means of the processor shown in Fig. 8.
- Fig. 10 illustrates the occurrence of an interlock problem in the RISC processor shown in Fig. 8;
- Fig. 11 illustrates how to resolve the interlock problem illustrated in Fig. 10;
- Fig. 12 shows three examples for a register which may be implemented in an embodiment of the processing device of the present invention.
- Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention
- Fig. 14 shows an example for a multiplexed decoding device for a picoopcode
- Fig. 15 shows an example of an adder decoder device for a picoopcode, which uses a segment of the own picoopcode as adder input
- Fig. 16 illustrates the storing procedure to the register for a given picoinstruction set
- Fig. 17 shows the varying size of the picoopcodes
- Fig. 18 shows two examples for an opcode set of the processing device according to an embodiment of the invention.
- Fig. 19 shows an example for a ROC processor with the execution of two picoinstructions according to the processing device according to an embodiment of the invention.
- Fig. 20 shows a scheme for a one instruction per cycle, 4-stage pipeline execution ROC architecture according to an embodiment of the invention.
- Fig. 1 shows an example for a register file or register bank 1 according to prior art, wherein a processor with a traditional register architecture is concerned.
- Flere a number of individual registers 2, 2’, 2”, etc. are grouped into a small block of addressable memory, which is referred to as the“register file”, and in which the internal registers used in all basic operations are stored.
- Fig. 2 shows an instruction 3 according to prior art, which typically defines one opcode 4, one destination register D-REG 6 and one source register S- REG 5.
- the opcode defines the type of operation and the S-REG and D-REG define the operands.
- Fig. 3 illustrates the addressing procedure for the instruction shown in Fig. 2.
- the S-REG 5 and D-REG 6 fields are typically configured such that an S-REG field or a D-REG field of 2 m bits addresses a memory with n entries via two address buses, one for the source operand 9 (e.g., a register pointer for read operation) and another for destination operand 10 (e.g., a register pointer for write operation), corresponding to a register file 1 of n registers.
- the data will be written (indicated by reference numeral 7 for write data) or read (indicated by reference numeral 8 for read data).
- FIG. 4 shows an example for vectoring a register file 1 according to prior art (as shown in Fig. 1 ).
- the software knows about the register replication and uses the fact so as to compute multiple results in parallel, as illustrated in the figure.
- Fig. 5 shows another instruction 3 according to prior art, which in particular is an instruction with a vectorial source register 5 and destination register field 6, as well a vector opcode 4.
- the source 9 and destination 10 buses are connected in the same buses as shown in Fig. 4.
- the organization here is almost the same as in the case of Fig. 3, but with multiple results computed in parallel.
- Fig. 6 shows an implementation of a VLIW (very large instruction word) architecture 12 according to prior art, according to which different register sets are working with different instructions.
- VLIW very large instruction word
- Fig. 6 shows an implementation of a VLIW (very large instruction word) architecture 12 according to prior art, according to which different register sets are working with different instructions.
- the difference here is that separate write (register pointers 9, 9’, 9” for write operation (destination register) for instructiono, instruction ⁇ !, instruction, respectively) and read (register pointers 10, 10’, 10” for read operation (source register) for instructiono, instruction ⁇ !, instruction, respectively) buses are used.
- Fig. 7 shows an example for a VLIW instruction with multiple parallel instructions 3, 3’, 3” according to prior art, each one composed by different opcodes 4, source registers 5 and destination registers 6, for illustrating the deep impact in the instruction word which is much stronger compared to other architectures. Also, as long each instruction 3, 3’, 3” has m-bits, the final VLIW instruction 13 will have m x 3 bits, as long the example defines 3 parallel unities (of course, the VLIW architecture can be parallelized ad infinitum ).
- the RISC processor 14 comprises a program counter unit 15, an instruction memory 16, a register bank 1 comprising a plurality of individual registers 2, an Arithmetic Logic Unit ALU 17, a data memory 18, and a multiplex unit 19.
- Fig. 9 shows a scheme for processing instructions by means of the RISC processor 14 (shown in Fig. 8), wherein each instruction 1 , instruction 2, instruction 3, instruction 4, instruction 5 ... instruction n is executed in five clock cycles, and each clock is able to execute different phases and theoretically, still may achieve the processing of one instruction per clock.
- an interlock problem may occur, if an instruction n needs to use a register which is currently being modified by a predecessor instruction.
- a value as for example, a value“10”
- Fig. 1 1 illustrates how to resolve the interlock problem illustrated in Fig. 10.
- the processor is stopped (here, for two clocks cycles) until it has the instruction result.
- Fig. 12 shows three examples for a register file 1 which may be implemented in an embodiment of the processing device, in particular, in a register oriented controller according to an embodiment of the present invention.
- the first uppermost example is a large array of registers embodied as a memory 23.
- the second or middle example is a small array of registers embodied as a register file or register bank 1.
- the third example shows a plurality of single individual registers 2, 2’, 2”, etc.
- Fig. 13 illustrates the assignment of instructions to registers according to an embodiment of the processing device of the present invention.
- Each one of the plurality of single registers 2, 2’, 2” is assigned by one instruction in one clock cycle, in a way that a plurality of instructions 25, 25’, 25”, etc. assigns a plurality of registers 2, 2’, 2”, etc. in one clock cycle.
- register 2 will execute the instruction 25 in one clock cycle
- register 2’ will execute the instruction 25’ in the same single clock cycle
- the register 2” will execute the instruction 25” in the same single clock cycle, and so on.
- the instructions 25, 25’, 25”, etc. are so-called picoinstructions, in a way there is no need to define the destination register, as long the destination register is implicit by the associated register.
- Fig. 14 shows an example of a decoding circuit which uses a decoder 27 to choose between several picoinstructions 25, 25’, 25”, etc. in order to decode a picoopcode 26.
- the picoopcode 26 with a width of m bits can decode up to 2 m different picoinstructions 25, 25’, 25”, etc.
- each picoinstruction 25, 25’, 25” may either have or may not have other registers as input.
- Fig. 16 illustrates the storing procedure for a given plurality of picoinstructions 25, 25’, 25”, etc., namely, it can be seen that for a given picoinstruction Pn(m) with opcode 26, the result will always be stored in the corresponding register Rn 2 via the multiplex unit 19.
- Fig. 17 shows the possible varying size of the picoopcodes 26, 26’, 26” inside an instruction 3.
- a first picoopcode 26 has a length of few bits
- a second picopocode 26’ has a length of a large number bits
- a third picoopcode 26” has an intermediary length, and so on.
- different registers 2, 2’, 2”, etc. have different picoopcodes 26, 26’, 26”, etc. lengths which results in different sets of picoinstructions.
- Fig. 18 shows two examples of instructions 3 and 3’ of the processing device according to an embodiment of the invention.
- the instruction set of the processor device consists of a concatenation of several picoopcodes 26, 26’, 26”, etc. in parallel, whereby different picoopcodes may be used or may not be used for each instruction according to a prefix code 31 or 31’.
- the prefix code 31’ starting with the condition COND 32, there may be a conditional execution, for example, comparing values of two or more different registers or checking, for example, if a register is positive or negative.
- Fig. 19 shows an example of ROC processor 34 for the execution of two parallel picoopcodes 26, 26’, each one with a different picoinstruction set 25, 25’, 25”, etc. and 33, 33’, 33”, etc. according to the processing device according to an embodiment of the invention.
- a picoinstruction set can be executed in one cycle clock or in a long pipeline with several clock cycles.
- different picoinstructions sets may have different execution cycles.
- a processor with an instruction set that allows truly parallel execution of n-picoinstructions or n-picoopocode in the same clock cycle is provided.
- the ROC does not specify a destination register, thus, there is no WB stage, as illustrated in Fig. 20, which provides the advantage that no instruction-related interlock problems will occur in a one- cycle execution ROC architecture.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Executing Machine-Instructions (AREA)
- Advance Control (AREA)
Abstract
Description
Claims
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2019/057129 WO2020187421A1 (en) | 2019-03-21 | 2019-03-21 | Processor device and a method for parallel processing instructions in a processor device |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3942407A1 true EP3942407A1 (en) | 2022-01-26 |
Family
ID=65904432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19712991.9A Pending EP3942407A1 (en) | 2019-03-21 | 2019-03-21 | Processor device and a method for parallel processing instructions in a processor device |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP3942407A1 (en) |
WO (1) | WO2020187421A1 (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6101592A (en) * | 1998-12-18 | 2000-08-08 | Billions Of Operations Per Second, Inc. | Methods and apparatus for scalable instruction set architecture with dynamic compact instructions |
US7526633B2 (en) * | 2005-03-23 | 2009-04-28 | Qualcomm Incorporated | Method and system for encoding variable length packets with variable instruction sizes |
CN104424128B (en) * | 2013-08-19 | 2019-12-13 | 上海芯豪微电子有限公司 | Variable length instruction word processor system and method |
-
2019
- 2019-03-21 WO PCT/EP2019/057129 patent/WO2020187421A1/en active Application Filing
- 2019-03-21 EP EP19712991.9A patent/EP3942407A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
WO2020187421A1 (en) | 2020-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11188330B2 (en) | Vector multiply-add instruction | |
CN110580175A (en) | Variable format, variable sparse matrix multiply instruction | |
US9164763B2 (en) | Single instruction group information processing apparatus for dynamically performing transient processing associated with a repeat instruction | |
US7346881B2 (en) | Method and apparatus for adding advanced instructions in an extensible processor architecture | |
US10628155B2 (en) | Complex multiply instruction | |
JPH09311786A (en) | Data processor | |
US9965275B2 (en) | Element size increasing instruction | |
WO2015114305A1 (en) | A data processing apparatus and method for executing a vector scan instruction | |
JPH03218523A (en) | Data processor | |
US10303399B2 (en) | Data processing apparatus and method for controlling vector memory accesses | |
TWI770079B (en) | Vector generating instruction | |
TW202223633A (en) | Apparatuses, methods, and systems for instructions for 16-bit floating-point matrix dot product instructions | |
EP3942407A1 (en) | Processor device and a method for parallel processing instructions in a processor device | |
JP5786719B2 (en) | Vector processor | |
JP7006097B2 (en) | Code generator, code generator and code generator | |
US20070061551A1 (en) | Computer Processor Architecture Comprising Operand Stack and Addressable Registers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210906 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20240612 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20241002 |