CN108287730B

CN108287730B - Processor pipeline device

Info

Publication number: CN108287730B
Application number: CN201810338781.3A
Authority: CN
Inventors: 胡振波
Original assignee: Wuhan Silicon Integrated Co Ltd
Current assignee: Wuhan Silicon Integrated Co Ltd
Priority date: 2018-03-14
Filing date: 2018-04-16
Publication date: 2023-12-29
Anticipated expiration: 2038-04-16
Also published as: CN108287730A

Abstract

The invention discloses a processor pipeline structure, which comprises an instruction storage unit, an instruction fetching unit, an execution unit, a memory access unit and a write-back unit, wherein the write-back unit comprises a first write-back module and a second write-back module; the first write-back module is used for arbitrating the write-back sequence of the execution result of each long-period instruction output by the execution unit or the access unit, so that the write-back sequence is consistent with the dispatch sequence of the corresponding long-period instruction; the second write-back module is used for arbitrating write-back sequence of the single-period instruction output by the execution unit and the multi-period instruction execution result output by the first write-back module, and the long-period instruction has higher priority; the invention solves the problem that the existing processor pipeline structure cannot simultaneously achieve low power consumption, low cost, small area and high performance by improving the internal structure of each stage of pipeline unit.

Description

Processor pipeline device

Technical Field

The invention belongs to the technical field of processor hardware design, and particularly relates to an ultra-low power consumption high-performance processor pipeline structure.

Background

In recent years, with the continuous improvement of integrated circuit manufacturing processes, the integration level and performance of processors are continuously improved, correspondingly, the power consumption is continuously increased, with the wide use of mobile equipment and the rapid development of the internet of things, the requirements for low-power-consumption, small-area, low-cost and high-performance processors are continuously increased, and the processors have high performance while reducing the power consumption and the low cost, so that the integrated circuit is a new research hotspot for designers.

The invention patent with the grant bulletin number of CN 104699463B discloses a novel pipeline structure, which adopts a mode of constructing a register stack to reduce a large amount of dynamic power consumption generated by register overturn caused by a data path; the pipeline structure is mainly suitable for reducing a large amount of dynamic power consumption caused by register overturning due to large-scale high-speed transmission, cannot play a role in the occasions with small transmission scale and low error rate, and can also increase the complexity of design, increase the area of a processor and cause the rise of cost;

the patent of the invention, issued to CN 101464721B, discloses a design method for controlling performance and power consumption by monitoring the performance of a pipelined processor, and reconfiguring the pipeline to switch from a high-performance mode to a low-performance mode when a decrease in processor throughput is detected; the system and design are quite complex, not suitable for low cost small area processor designs, and this is acknowledged in his embodiment, mentioning that the development effort is quite complex and time consuming;

the invention patent with the grant publication number of CN 103218029B discloses a pipeline structure for controlling power supply voltage, which is characterized in that the power supply voltage is further reduced by changing the structure of a register in the existing pipeline structure and adding an internal error correction circuit of the register and an external error correction circuit of the pipeline, and the voltage is regulated in real time by utilizing the level of error number, so that the power consumption of a kernel is further reduced; the above design does not consider reducing the processor area and the cost, and although it reduces the power consumption to some extent, it also increases the complexity and cost of the system.

In summary, existing processors, while reducing power consumption to some extent, increase system complexity, increase processor area and increase cost.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides an ultra-low power consumption high performance processor pipeline structure, and aims to solve the problem that the existing processor pipeline structure cannot simultaneously realize low power consumption, low cost, small area and high performance by improving the internal structure of pipeline units at all levels.

To achieve the above object, according to one aspect of the present invention, there is provided a processor pipeline structure including an instruction storage unit, an instruction fetch unit, an execution unit, a memory access unit, and a write-back unit; the first end of the instruction fetch unit is connected with the instruction storage unit, and the second end of the instruction fetch unit is connected with the first end of the execution unit; the second end of the execution unit is connected with the first end of the write-back unit, and the third end of the execution unit is connected with the first end of the access unit; the second end of the access unit is connected with the second end of the write-back unit;

the instruction fetching unit fetches an instruction from the instruction storage unit in one clock cycle; the execution unit is used for decoding and executing the instruction output by the instruction fetching unit, and the result of instruction execution is written back into the register group through the write-back unit;

the write-back unit comprises a first write-back module and a second write-back module; the first end of the first write-back module is connected with the second end of the execution unit, the second end of the first write-back module is connected with the second end of the memory access unit, and the third end of the first write-back module is connected with the first end of the second write-back module; the second end of the second write-back module is connected with the fourth end of the execution unit;

the first write-back module is used for arbitrating the write-back sequence of the execution result of each long-period instruction output by the execution unit or the access unit, so that the write-back sequence is consistent with the dispatch sequence of the corresponding long-period instruction; the second write-back module is used for arbitrating write-back sequence of the single-period instruction output by the execution unit and the multi-period instruction execution result output by the first write-back module, and the long-period instruction has higher priority.

Preferably, the instruction fetch unit of the processor pipeline structure includes a first program counter, a second program counter, a PC generation module, a partial decoding module, a branch prediction module, and an instruction register;

the first end of the instruction register is connected with the first end of the instruction storage unit, and the second end of the instruction register is connected with the first end of the execution unit; the first end of the partial decoding module is connected with the second end of the instruction storage unit, the second end of the partial decoding module is connected with the first end of the branch prediction module, and the third end of the partial decoding module is connected with the first end of the PC generation module; the second end of the branch prediction module is connected with the second end of the PC generation module; the third end of the PC generation module is connected with the first end of the first program counter, the fourth end of the PC generation module is connected with the second end of the first program counter, the fifth end of the PC generation module is connected with the third end of the instruction storage unit, and the sixth end of the PC generation module is connected with the third end of the execution unit; the third end of the first program counter is connected with the first end of the second program counter; the second end of the second program counter is connected with the third end of the execution unit;

the partial decoding module is used for decoding the current instruction fetched from the instruction storage unit to judge whether the type of the current instruction is a common instruction or a branch jump instruction, and if the type of the current instruction is the common instruction, the partial decoding module directly sends the current instruction to the PC generation module; the PC generation module generates an address of a next instruction to be fetched according to the current instruction and the current instruction address sent by the first program counter;

if the current instruction is a branch jump instruction, the partial decoding module sends the current instruction to the branch prediction module; the branch prediction module acquires the jump target address of the current instruction through static prediction, and the PC generation module generates the address of the next instruction to be fetched according to the jump target address of the current instruction acquired by the branch prediction module;

the partial decoding module, the branch prediction module and the PC generation module are of a combined logic structure, and the decoding of the current instruction, the branch prediction and the generation of the address of the next instruction to be fetched are all completed in the same clock period.

Preferably, the execution unit of the processor pipeline structure comprises a decoding module, a dispatch module, an instruction tracking module, a single-cycle instruction operation module, a long-cycle instruction operation module and a delivery module;

the first end of the decoding module is connected with the second end of the instruction register, the second end of the decoding module is connected with the second end of the second program counter, and the third end of the decoding module is connected with the first end of the dispatch module; the second end of the dispatch module is connected with the first end of the instruction tracking module, the third end of the dispatch module is connected with the first end of the single-period instruction operation module, the fourth end of the dispatch module is connected with the first end of the long-period instruction operation module, and the fifth end of the dispatch module is connected with the first end of the access unit; the second end of the long-period instruction operation module is connected with the first end of the first write-back module; the second end of the instruction tracking module is connected with the second end of the first write-back module; the second end of the single-period instruction operation module is connected with the second end of the second write-back module, the third end of the single-period instruction operation module is connected with the second end of the memory access unit, and the fourth end of the single-period instruction operation module is connected with the first end of the delivery module; the second end of the delivery module is connected with the third end of the second write-back module, and the third end is connected with the sixth end of the PC generation module;

the instruction tracking module is used for storing long-period instruction information which is sent out and not written back, when the sending module sends the instruction, the information of the currently sent instruction is compared with each long-period instruction information stored in the instruction tracking module, so as to judge whether the current instruction and the long-period instruction which is sent out and not written back generate data correlation, if not, the information is normally sent out; if yes, the dispatch is paused until the data correlation is released after the execution of the relevant long-period instruction is completed, and the dispatch is continued.

Preferably, the delivering module of the processor pipeline structure comprises an abnormality judging sub-module and a branch prediction analyzing sub-module;

the first end of the branch prediction analysis sub-module and the first end of the abnormality judgment sub-module are connected with the fourth end of the single-period instruction operation module, and the second end of the branch prediction analysis sub-module and the second end of the abnormality judgment sub-module are connected with the sixth end of the PC generation module; the third end of the abnormality judgment sub-module is connected with the third end of the second write-back module;

the branch prediction analysis sub-module is used for judging whether the address of the next instruction to be fetched generated by the PC generation module is correct according to the operation result of the single-period instruction operation module, and if so, the next instruction to be fetched is not processed; if not, clearing the error address, generating a new next instruction address to be fetched, and feeding back the new instruction address to the PC generation module;

the abnormality judging submodule is used for judging whether the current instruction is wrong in the execution process according to the operation result of the single-period instruction operation module, and if not, the current instruction is not processed; if yes, the current instruction address is cleared, a new address is generated, and the new address is fed back to the PC generation module.

Preferably, the instruction tracking module of the processor pipeline structure comprises a plurality of table entries for storing long-period instruction information, wherein one table entry correspondingly stores information of one long-period instruction, and the information comprises a source operand register index and a result register index.

Preferably, in the processor pipeline structure, the instruction tracking module is implemented by using a FIFO, and when the first write-back module performs write-back operation on a plurality of long-period instructions, the write-back sequence of the different long-period instructions is arbitrated according to the pointing sequence of the read pointer of the FIFO; after a certain long-period instruction is written back, the instruction tracking module deletes the information corresponding to the long-period instruction.

Preferably, in the processor pipeline structure, the single-cycle instruction operation module is further configured to generate a memory access address;

the access unit is used as a control module of memory access, and corresponding instructions are obtained from the instruction storage component or corresponding data are obtained from the data storage component through address judgment according to the memory access address.

Preferably, the instruction memory unit of the processor pipeline structure is implemented by using an instruction tightly coupled memory.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) According to the processor pipeline structure, two-stage write-back of instructions is realized through the first write-back module and the second write-back module, and the write-back of instructions with different long periods is finished through the synergistic effect of the first write-back module and the instruction tracking module, so that the write-back sequence is strictly consistent with the dispatch sequence, the simplicity of hardware configuration is realized, and the area of a processor is reduced; the second write-back module is used for arbitrating the write-back sequence of all single-cycle instructions and long-cycle instructions, wherein the long-cycle instructions have priority; in an idle period without writing back long-period instructions, single-period instructions can be written back randomly; the delivery and the write-back of the long-period instruction are separated through the two-stage write-back strategy, so that even if the multi-period long-period instruction is executed, the pipeline is not blocked, the subsequent single-period instruction can be smoothly written back and delivered, and the performance of a processor is improved;

(2) According to the processor pipeline structure, the problem of data correlation is solved through the synergistic effect of the instruction tracking module and the dispatch module; the instruction tracking module is used for storing long-period instruction information which is sent out and not written back, and when the sending module sends the instruction, the information of the currently sent instruction is compared with each long-period instruction information in the instruction tracking module so as to judge whether the current instruction and the long-period instruction which is sent out and not written back generate RAW and WAW correlation or not, if no data correlation exists, the sending module sends the instruction normally; if the data correlation exists, suspending dispatch until the execution of the related long-period instruction is finished and the data correlation is released, and continuing dispatch; the invention solves the problem of data correlation by adopting a method of blocking a pipeline, and does not directly and rapidly bypass the result of a long-period instruction to a subsequent instruction to be dispatched, thereby reducing the power consumption and the area of a processor;

(3) The invention provides a processor pipeline structure, which uses ITCM accessed in a single period as an instruction memory, and an instruction fetching unit can fetch an instruction from the ITCM in a period; the ITCM is used for replacing the traditional Cache, so that the real-time requirement of the ultralow-power-consumption small-area processor can be met, and the cost and the area of the processor are reduced; the partial decoding module, the branch prediction module and the PC generation module are of a combined logic structure, the instruction fetching unit finishes a series of operations such as instruction reading, partial decoding, branch prediction, PC for generating the next instruction to be fetched and the like in one period, continuous instruction fetching is realized, and the performance of the processor is greatly improved.

Drawings

FIG. 1 is a block diagram of a processor pipeline architecture according to an embodiment of the present invention;

FIG. 2 is a block diagram of a finger unit of a processor pipeline structure according to an embodiment of the present invention;

FIG. 3 is a block diagram of an execution unit of a processor pipeline architecture provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a write-back unit of a processor pipeline structure, according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The processor pipeline structure provided by the embodiment of the invention is mainly suitable for low-cost small-area processors which are designed for embedded ultra-low power consumption scenes, such as a buzzer E200 processor core which is self-ground by the company based on RISC-V architecture; the pipeline structure comprises a multi-stage pipeline unit, and specifically comprises an instruction storage unit, an instruction fetching unit, an execution unit, a memory access unit and a write-back unit; the first end of the instruction fetch unit is connected with the instruction storage unit, and the second end of the instruction fetch unit is connected with the first end of the execution unit; the second end of the execution unit is connected with the first end of the write-back unit, and the third end of the execution unit is connected with the first end of the access unit; the second end of the access unit is connected with the second end of the write-back unit;

the invention mainly divides the level of the pipeline according to the clock period, wherein the instruction storage unit and the instruction fetching unit belong to a first-level pipeline, the instruction storage unit is used for storing instructions, and the instruction fetching unit is used for fetching the instructions from the instruction storage unit continuously and uninterruptedly; the execution unit and the write-back unit belong to a second-stage pipeline, the execution unit is used for decoding and executing the instruction output by the instruction fetch unit, and the write-back unit is used for writing back the result of instruction execution into a general Register File (Regfile); because the instruction is decoded, executed, and written back all at the same clock cycle, the execution unit and the write back unit are divided into the second stage in the pipeline structure.

For some single-cycle instructions, only the two stages of pipeline are needed, and for some long-cycle instructions, the memory access function of the memory access unit is needed, the memory access unit belongs to the third-stage pipeline, but the result output by the memory access unit is written back to the general register group through the write-back unit in the second-stage pipeline, so that the pipeline structure of the ultralow-power-consumption high-performance processor provided by the embodiment is a variable-length pipeline structure, compared with the traditional linear pipeline structure, the pipeline stage number is reduced, and the area of the processor can be reduced and the cost is reduced.

The instruction fetching unit is used for fetching instructions from the instruction storage unit, and in order to improve the performance of the processor, the instruction fetching process needs to be fast and continuous, and because the ultra-low power consumption high performance processor pipeline structure provided by the embodiment is mainly suitable for a low-small area processor designed for an embedded ultra-low power consumption scene, the program code quantity of the embedded processor core of the level is not large, the instruction fetching unit adopts an instruction tightly coupled memory (Instruction Tightly Coupled Memory; ITCM) as the instruction storage unit to store instructions, and can fetch one instruction from the ITCM in one clock cycle, thereby realizing fast instruction fetching; compared with the traditional I-Cache, the method can reduce the cost and the area of the processor on the premise of meeting the real-time requirement of the ultralow-power-consumption small-area processor.

The instruction fetching unit comprises a first program counter, a second program counter, a PC generating module, a partial decoding module, a branch prediction module and an instruction register;

the first end of the instruction register is connected with the first end of the ITCM, and the second end of the instruction register is connected with the first end of the execution unit; the first end of the partial decoding module is connected with the second end of the ITCM, the second end of the partial decoding module is connected with the first end of the branch prediction module, and the third end of the partial decoding module is connected with the first end of the PC generation module; the second end of the branch prediction module is connected with the second end of the PC generation module; the third end of the PC generation module is connected with the first end of the first program counter, the fourth end of the PC generation module is connected with the second end of the first program counter, the fifth end of the PC generation module is connected with the third end of the ITCM, and the sixth end of the PC generation module is connected with the second end of the execution unit; the third end of the first program counter is connected with the first end of the second program counter; the second end of the second program counter is connected with the third end of the execution unit;

the instruction register is used for storing an instruction (called a current instruction) fetched from the ITCM in a certain clock cycle and sending the instruction to the execution unit in the next clock cycle; the second program counter is used for receiving the current instruction address sent by the first program counter and sending the current instruction address to the execution unit in the next clock cycle; the execution unit receives the current instruction and the current instruction address synchronously.

The partial decoding module is used for decoding a current instruction taken out from the ITCM to judge whether the type of the current instruction is a common instruction or a branch jump instruction, and if the type of the current instruction is the common instruction, the partial decoding module directly sends the current instruction to the PC generation module; the PC generation module generates an address (namely a PC value) of a next instruction to be fetched according to the current instruction and the current instruction address sent by the first program counter; if the current instruction is a branch jump instruction, the partial decoding module sends the current instruction to the branch prediction module; the branch prediction module acquires the jump target address of the current instruction through static prediction, and the PC generation module generates the address of the next instruction to be fetched according to the jump target address of the current instruction acquired by the branch prediction module and respectively sends the address to the first program counter and the ITCM;

the partial decoding module, the branch prediction module and the PC generation module are of a combined logic structure, and the decoding of the current instruction, the branch prediction and the generation of the address of the next instruction to be fetched are all completed in the same clock cycle, so that the instruction fetching unit provided by the embodiment can complete the acquisition of the current instruction and the generation of the address of the next instruction to be fetched in one clock cycle, continuous instruction fetching is realized, and the performance of a processor is improved.

The branch prediction module adopts a simple and flexible static prediction method: the conditional branch instruction for backward jump is predicted to be true jump, and the conditional branch instruction for forward jump is predicted to be non-jump, and the specific points are as follows:

1. for conditional branch instructions such as BEQ, BNE, etc., the static prediction method described above is used (i.e., backward jumps are predicted to require a jump, otherwise no jump is predicted to be required); for the jump target address, adding the PC and the offset represented by the immediate to obtain the jump target address;

2. for an unconditional direct jump instruction, such as jal instruction, it is not necessary to predict its jump direction because it must jump; for the jump target address, its jump target address is obtained by adding its PC and the offset represented by the immediate.

3. For an unconditional indirect jump instruction, such as a jar instruction, the jump direction of the unconditional indirect jump instruction is not required to be predicted because the unconditional indirect jump instruction can jump; for its jump target address, the operand whose base address required to calculate the jump target address comes from its rs1 index needs to be read from the general register set, and possibly also forms a RAW data dependency with the instruction being executed in the execution unit; the invention adopts different schemes according to different rsl indexes: if the index number corresponds to a constant register, the constant is directly used without reading from the relevant register; if the index number corresponds to a common link register, the register is directly pulled out, and in order to prevent the data dependency of the read-after-write, it needs to be determined that the write-back unit in the second stage pipeline does not perform the write-back operation on the register, which is specifically as follows:

31. if the index number of rs1 is x0, then the base address calculation is directly performed using the constant 0 (x 0 represents constant 0 according to the RISC-V architecture definition), without reading from Regfile.

32. If the index number of rs1 is x1, since x1 is often used for a link register to return to a jump instruction as a function, x1 is directly pulled out from a Regfile in an execution unit, and a Read Port occupying the Regfile is not needed; in order to prevent an instruction being executed in an execution unit from needing to be written back to the link register, thereby causing a RAW data dependency, the branch prediction module needs to determine whether an instruction is currently written back to the link register;

3. if the index number of rs1 is other registers (abbreviated as xn) except x0 and x1, the xn needs to be Read from Regfile by using the Read Port of Regfile, and whether the current Read Port is idle or not needs to be determined that no resource conflict exists; meanwhile, in order to prevent the write back xn of an instruction being executed in the execution unit from causing a RAW data dependency, the branch prediction module needs to determine whether there is an instruction currently to write back Regfile.

The execution unit comprises a decoding module, a dispatch module, an instruction tracking module, a single-period instruction operation module, a long-period instruction operation module and a delivery module;

the first end of the decoding module is connected with the second end of the instruction register, the second end of the decoding module is connected with the second end of the second program counter, and the third end of the decoding module is connected with the first end of the dispatch module; the second end of the dispatch module is connected with the first end of the instruction tracking module, the third end of the dispatch module is connected with the first end of the single-period instruction operation module, the fourth end of the dispatch module is connected with the first end of the long-period instruction operation module, and the fifth end of the dispatch module is connected with the first end of the access unit; the second end of the instruction tracking module is connected with the first end of the write-back unit; the second end of the single-period instruction operation module is connected with the second end of the write-back unit, the third end of the single-period instruction operation module is connected with the second end of the memory access unit, and the fourth end of the single-period instruction operation module is connected with the first end of the delivery module; the second end of the long-period instruction operation module is connected with the third end of the write-back unit; the second end of the delivery module is connected with the fourth end of the write-back unit, and the third end of the delivery module is connected with the sixth end of the PC generation module;

the decoding module is used for decoding the acquired current instruction and the current instruction address to acquire an operand register index; and is used for obtaining the corresponding operation data from Read-Regfile according to the operand register index;

the dispatch module is used for dispatching the operation data acquired by the decoding module to different operation units for execution according to the instruction type; the single-period instruction operation module is mainly used for operation and execution of single-period instructions, and the long-period instruction operation module is mainly used for operation and execution of long-period instructions; the execution results of the single-cycle instruction and the long-cycle instruction are written back to Write-Regfile through a Write-back unit;

the delivery module is used for delivering the calculation result of the single-period instruction operation module to the PC generation module; the delivery module comprises an abnormality judgment sub-module and a branch prediction analysis sub-module;

the first end of the branch prediction analysis sub-module and the first end of the abnormality judgment sub-module are connected with the fourth end of the single-period instruction operation module, and the second end of the branch prediction analysis sub-module and the second end of the abnormality judgment sub-module are connected with the sixth end of the PC generation module; the third end of the abnormality judgment sub-module is connected with the fourth end of the write-back unit;

the branch prediction analysis sub-module is used for judging whether the next instruction address to be fetched generated by the PC generation module is correct according to the calculation result of the single-period instruction operation module, if so, the next instruction address to be fetched is not processed; if not, clearing the error address, generating a new next instruction address to be fetched, and feeding back the new instruction address to the PC generation module; the abnormality judging submodule is used for judging whether the current instruction is wrong in the execution process according to the calculation result of the single-period instruction operation module, and if not, the current instruction is not processed; if yes, the current instruction address is cleared, a new address is generated and fed back to the PC generation module; and the PC generating module sends the acquired new address to the ITCM, and the instruction fetching unit fetches the instruction again from the ITCM and sends the instruction to the executing unit for decoding and executing.

Because part of the long-period instructions may have errors in the execution process, the Write-back unit needs to interface with the exception judging submodule to trigger exception, and if the exception occurs, the execution result of the long-period instructions is not written back to the Write-Regfile.

The dispatch module is based on a micro-architecture that is dispatched in sequence, and when each instruction is dispatched, it needs to check whether there is a data correlation between the instruction and the instruction that was previously dispatched to execute but not yet written back; the data dependencies are divided into Write-After-Read (WAR), read-After-Write (RAW), and Write-After-Write (WAW);

1. WAR correlation: since the pipeline structure provided by the invention is suitable for a micro-architecture processor based on sequential dispatch and sequential Write back, source operands are Read from a general register set when instructions are dispatched, the 'Write-back-regf ile operation of the subsequently executed instructions' cannot happen before the 'Read-regf ile operand of the previously executed instructions', and therefore data conflict caused by WAR correlation cannot happen.

2. RAW correlation: the instruction being dispatched is at the second stage of the pipeline, and assuming that the previously dispatched instruction (referred to as a leading instruction for short) is a single cycle instruction (also written back at the second stage of the pipeline), the leading single cycle instruction has completed execution and written back the result to Write-Regfile, so that the instruction being dispatched is unlikely to generate a data conflict with the RAW dependency of the leading single cycle instruction; however, assuming that the leading instruction is a long-cycle instruction, since the long-cycle instruction requires multiple cycles to write back the result, the instruction being dispatched may generate a RAW dependency with the leading long-cycle instruction.

3. WAW correlation: the instruction being dispatched is in the second stage of the pipeline, assuming that the leading instruction is a single cycle instruction, the leading single cycle instruction has completed execution and written the result back to Write-regf ile, so that the instruction being dispatched is unlikely to generate a data conflict with the WAW dependency of the leading single cycle instruction; however, assuming that the leading instruction is a long-cycle instruction, since the long-cycle instruction requires multiple cycles to write back the result, the instruction being dispatched may create a WAW dependency with the leading long-cycle instruction.

In summary, in the pipeline structure provided by the present invention, the "dispatch instruction" can only generate the RAW and WAW dependencies with the "long-cycle instruction that has not yet been executed".

In order to be able to detect the RAW and WAW dependencies between a currently dispatched instruction and a pre-fetch long-cycle instruction, the present invention provides an instruction trace module in the execution unit for storing long-cycle instruction information that has been dispatched and has not yet been written back, including, but not limited to, the source operand register index and the result register index of the long-cycle instruction;

the instruction tracking module is preferably realized by adopting a first-in first-out (FIFO) mechanism; each time a long-cycle instruction is dispatched by the dispatch module, an Entry (Entry) is allocated to the long-cycle instruction in the instruction tracking module, and the Entry is used for storing a source operand register index and a result register index of the long-cycle instruction; after the Write-back unit writes back the execution result of the long-period instruction into the Write-regress, the instruction tracking module removes the table entry corresponding to the long-period instruction, so that long-period instruction information which is sent out and not written back is stored in the instruction tracking module; when the dispatch module dispatches the instruction, comparing the source operand register index and the result register index of the currently dispatched instruction with each item information in the instruction tracking module, so as to judge whether the current instruction generates RAW and WAW correlation with the long-period instruction which is dispatched and not written back, if no data correlation exists, the instruction is dispatched normally; if there is a data dependency, the dispatch is suspended until the associated long-cycle instruction is completely executed and the data dependency is removed. The depth of the FIFO defaults to two table entries, so that the information of two long-period instructions can be stored simultaneously; the number of entries preferably does not exceed four, otherwise the running speed of the processor will be reduced.

The pipeline structure provided by the invention has the advantages that the conflict caused by data correlation is avoided, the method of blocking the pipeline is adopted, the result of the long-period instruction is not directly and rapidly bypassed to the subsequent instruction to be dispatched, and the power consumption and the area of a processor are reduced.

For partial long-cycle instructions, such as Load and Store instructions and "A" expansion instructions, the memory access function of the memory access unit is required; after the instruction is dispatched to the single-period instruction operation module through the dispatch module, the single-period instruction operation module generates a memory access address through operation and sends the memory access address to the access unit, and the access unit is used as a control module of memory access to acquire a corresponding instruction from the instruction storage component or acquire corresponding data from the data storage component through address judgment.

The write-back unit comprises a first write-back module and a second write-back module, wherein the first end of the first write-back module is connected with the second end of the long-period instruction operation module, the second end of the first write-back module is connected with the second end of the instruction tracking module, the third end of the first write-back module is connected with the first end of the second write-back module, and the fourth end of the first write-back module is connected with the third end of the access unit; the second end of the second write-back module is connected with the second end of the single-period instruction operation module, and the third end of the second write-back module is connected with the third end of the abnormality judgment sub-module;

the first write-back module is mainly used for arbitrating write-back of execution results of each long-period instruction, as shown in fig. 4, the operation results of the long-period instruction processed by the long-period instruction operation module or the access unit firstly enter the first write-back module; in addition, the operation result of the long-period instruction received by the first write-back module may also come from a multiplier-divider, an FPU, an EAI coprocessor and the like; these long-cycle instructions are written back without strictly following their dispatch sequence in theory, and only need to follow the dispatch sequence in the event of a conflict in the registers, and the rest of the time can be written back out of order. However, in order to achieve the simplicity of hardware, the present embodiment selects to strictly write back the operation result according to the dispatch sequence of the long-period instruction; because the execution cycles of different long-period instructions are different from each other and even some long-period instructions are dynamic, the precedence relationship of the long-period instructions cannot be easily judged, so that the precedence relationship of the long-period instructions needs to be recorded in advance.

The instruction tracking module provided in this embodiment is configured to record information of a long-period instruction, and if the dispatch module dispatches a long-period instruction, a table entry is allocated to the long-period instruction in the instruction tracking module to record the information of the long-period instruction; the FIFO Pointer (Pointer) of this entry acts as the Instruction Tag (ITAG) for the long-period Instruction; the long-period instruction carries the corresponding ITAG from the dispatch to the time when the operation result is written back;

the instruction tracking module and the first write-back module cooperate to complete write-back operation of all long-period instructions, and the operation result of the long-period instructions received by the first write-back module comprises ITAG corresponding to the long-period instructions; because the instruction tracking module is a first-in first-out FIFO, a read pointer (ReadPointer) of the FIFO points to an item which enters the instruction tracking module first, the first write-back module sends an operation result of a long-period instruction corresponding to the item to the second write-back module, and meanwhile, the instruction tracking module deletes the item corresponding to the long-period instruction; the first write-back module determines the write-back sequence of the long-period instruction operation result according to the pointing sequence of the instruction tracking module, and ensures that the write-back sequence and the dispatch sequence are strictly consistent.

The second write-back module is mainly used for receiving the operation result of the single-period instruction sent by the single-period instruction operation module and the operation result of the long-period instruction arbitrated by the first write-back module, and arbitrating the write-back sequence of all instructions in a priority mode. If the single-cycle instruction is written back in an idle cycle without long-cycle instruction, the single-cycle instruction can be written back randomly; that is, single-cycle instructions at later locations in the program stream may be written back earlier (if there is no data dependency) than long-cycle instructions at earlier locations, so the pipeline structure provided by embodiments of the present invention also has the ability to write back out of order.

Compared with the existing processor pipeline structure, the processor pipeline structure provided by the invention has the advantages that the internal structure of each stage of pipeline units is improved, the partial decoding module, the branch prediction module and the PC generation module of the combined logic structure are arranged in the instruction fetching unit, the acquisition of the current instruction and the generation of the address of the next instruction to be fetched can be completed within one clock period, the continuous instruction fetching is realized, and the performance of the processor is improved; an instruction tracking module is arranged in an execution unit, the problem of data correlation is solved by adopting a method of blocking a pipeline, and a result of a long-period instruction is not directly and rapidly bypassed to a subsequent instruction to be dispatched, so that the power consumption and the area of a processor are reduced; the write-back unit is divided into a first write-back module and a second write-back module, and the delivery and write-back of the long-period instruction are separated through a two-stage write-back strategy, so that even if the multi-period long-period instruction is executed, the pipeline is not blocked, the subsequent single-period instruction can still be successfully written back and delivered, and the performance of a processor is improved; the problem that the existing processor pipeline structure cannot simultaneously achieve low power consumption, low cost, small area and high performance is solved.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The processor pipeline device comprises an instruction storage unit, an instruction fetching unit, an execution unit, a memory access unit and a write-back unit, and is characterized in that a first end of the instruction fetching unit is connected with the instruction storage unit, and a second end of the instruction fetching unit is connected with a first end of the execution unit; the second end of the execution unit is connected with the first end of the write-back unit, and the third end of the execution unit is connected with the first end of the access unit; the second end of the access unit is connected with the second end of the write-back unit;

the write-back unit comprises a first write-back module and a second write-back module; the first end of the first write-back module is connected with the second end of the execution unit, the second end of the first write-back module is connected with the second end of the memory unit, and the third end of the first write-back module is connected with the first end of the second write-back module; the second end of the second write-back module is connected with the fourth end of the execution unit;

2. The processor pipeline apparatus of claim 1 wherein the instruction fetch unit comprises a first program counter, a second program counter, a PC generation module, a partial decode module, a branch prediction module, and an instruction register;

3. The processor pipeline apparatus of claim 1 or 2, wherein the execution unit comprises a decode module, a dispatch module, an instruction trace module, a single-cycle instruction operation module, a long-cycle instruction operation module, and a commit module;

the first end of the decoding module is connected with the second end of the instruction register, the second end of the decoding module is connected with the second end of the second program counter, and the third end of the decoding module is connected with the first end of the dispatch module; the second end of the dispatch module is connected with the first end of the instruction tracking module, the third end of the dispatch module is connected with the first end of the single-period instruction operation module, the fourth end of the dispatch module is connected with the first end of the long-period instruction operation module, and the fifth end of the dispatch module is connected with the first end of the access unit; the second end of the long-period instruction operation module is connected with the first end of the first write-back module; the second end of the instruction tracking module is connected with the second end of the first write-back module; the second end of the single-period instruction operation module is connected with the second end of the second write-back module, the third end of the single-period instruction operation module is connected with the second end of the memory access unit, and the fourth end of the single-period instruction operation module is connected with the first end of the delivery module; the second end of the delivery module is connected with the third end of the second write-back module, and the third end of the delivery module is connected with the sixth end of the PC generation module; the instruction tracking module is used for storing long-period instruction information which is sent out and not written back, and when the dispatching module dispatches the instruction, the information of the current dispatched instruction is compared with each long-period instruction information stored in the instruction tracking module so as to judge whether the current instruction has data correlation with the long-period instruction which is sent out and not written back, if not, the current instruction is dispatched normally; if yes, the dispatch is paused until the data correlation is released after the execution of the relevant long-period instruction is completed, and the dispatch is continued.

4. The processor pipeline apparatus of claim 3, wherein the delivery module includes an exception determination sub-module and a branch prediction resolution sub-module;

5. A processor pipeline apparatus as claimed in claim 3, wherein said instruction tracking module comprises a plurality of entries for storing long-cycle instruction information, one of said entries corresponding to information storing one long-cycle instruction, said information comprising a source operand register index and a result register index.

6. The processor pipeline apparatus of claim 5 wherein the instruction tracking module is implemented using a FIFO, and wherein when the first write-back module performs write-back operation on the plurality of long-period instructions, the write-back order of the different long-period instructions is arbitrated according to the pointing order of the read pointers of the FIFO; after a certain long-period instruction is written back, the instruction tracking module deletes the information corresponding to the long-period instruction.

7. The processor pipeline apparatus of claim 3 wherein the single cycle instruction operation module is further to generate a memory access address;

the access unit is used as a control module of memory access, and corresponding instructions are obtained from the instruction storage component or corresponding data are obtained from the data storage component according to the memory access address through address judgment.

8. The processor pipeline apparatus of claim 1 wherein the instruction storage unit is implemented using instruction tightly coupled memory.