CN106775591B - A kind of hardware loop processing method and system of processor - Google Patents
A kind of hardware loop processing method and system of processor Download PDFInfo
- Publication number
- CN106775591B CN106775591B CN201611021587.XA CN201611021587A CN106775591B CN 106775591 B CN106775591 B CN 106775591B CN 201611021587 A CN201611021587 A CN 201611021587A CN 106775591 B CN106775591 B CN 106775591B
- Authority
- CN
- China
- Prior art keywords
- instruction
- loop
- decoding unit
- circular buffer
- loop body
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30065—Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30069—Instruction skipping instructions, e.g. SKIP
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
A kind of the hardware loop processing system and method for processor, by increasing circular buffer in original Fetch unit, preceding N item in corresponding loop body is instructed, subsequent decoding unit is directly output to by circular buffer, it eliminates in circulating treatment procedure, when jumping to first instruction of loop body from the instruction of loop body the last item every time, as program storage read the delay of data and caused by additional latent period, to realize that the zero propagation of hardware loop jumps.The method of the invention design is simple, it is only necessary to increase the circular buffer an of hardware and corresponding selecting module in original system, can realize that the zero propagation of hardware loop jumps.In addition, this method can also reduce access of the Fetch unit to program storage, to reduce the power consumption of processor.
Description
Technical field
The present invention relates to hardware loop processing techniques, more particularly to one kind to take location level framework to carry out for processor or DSP
Improved hardware loop processing technique.
Background technique
In processor or DSP, circulation is a kind of very common Program Type.In the prior art, it usually utilizes commonly
Jump instruction is simultaneously used cooperatively general register to realize circular treatment.But this processing mode loop body finally, need
To judge whether circulation terminates using additional instruction, also need to jump to loop body using jump instruction if be not over
First instruction, many additional cycles can be brought to loop body in this way.Especially in the case where loop nesting, operation efficiency
It can therefore be significantly damaged.
In order to improve the execution efficiency of circulation, currently, beginning to use the mode pair of hardware loop in some processors or DSP
Circulation is handled.It judges automatically whether circulation terminates by hardware, and jumps to first instruction of loop body.It is this
Although mode does not use additional decision instruction and jump instruction, but be delayed since program storage exists, so often
It is secondary from loop body the last item instruction jump to first instruction of loop body when, it is still necessary to additional latent period, ability
Read first instruction of loop body.In this case, if instructing fewer, these additional waiting weeks in loop body
Phase can still generate the efficiency of circulation and seriously affect.
For problem of the prior art, the hardware loop processing method and system of processor or DSP disclosed by the invention can
To eliminate additional latent period when jumping to loop body first instruction from the instruction of loop body the last item every time, realization is followed
The zero propagation of ring jumps, and can increase substantially the efficiency of circulation.And the reading times of program storage can be thus reduced,
Reduce processor or DSP power consumption.
Summary of the invention
In order to solve the shortcomings of the prior art, the purpose of the present invention is to provide at a kind of hardware loop of processor
Manage method and system.
Firstly, to achieve the above object, the present invention proposes a kind of processor hardware circulating treating system.System includes program
Memory, Fetch unit, circular buffer, selecting module, decoding unit and execution unit;
Described program memory connects the input terminal of Fetch unit, the instruction output end connection selection mould of the Fetch unit
One input terminal of block, the output end of another input terminal connection circular buffer of selecting module;The output end of selecting module connects
Connect the input terminal of decoding unit;The input terminal of the output end connection execution unit of the decoding unit;
The Fetch unit is also connect with the first control signal end of decoding unit, the circular buffer also with decoding unit
Second control signal end connection.
Further, in above system, the first control signal end of the decoding unit is used for unread in decoding unit
To when cyclic node instruction, the Fetch unit is controlled by selecting module to decoding unit) output decoded down
One instruction;Meanwhile the decoding unit is also controlled by first control signal end after reading first newly recycled instruction
It makes the Fetch unit and passes through synchronous first article to the N articles instruction for exporting new loop body to circular buffer of selecting module;It is described
Circular buffer is used to receive the preceding N item instruction of new loop body, and instructs indentation storehouse to store the N item of receiving;Wherein, institute
The last item instruction during the cyclic node stated refers to that circulation carries out, in loop body;The circular buffer receives
New loop body in instruct item number N=min { n, m }, n be loop body in instruction strip number, m be instruction from program storage reach
Decoding unit clock periodicity experienced subtracts 1;
The second control signal end of the decoding unit is used for when decoding unit reads cyclic node instruction, and control follows
Ring caching successively exports first article to the N articles instruction of previous cycle body by selecting module to decoding unit.
In this way, the instruction that can be decoded according to decoding unit, is being followed by selecting module according to this instruction
The location of in ring body, judgement is that instruction is read directly from circular buffer, it is desired nonetheless to which the slave program of level-one level-one stores
Instruction is read in device.Preceding N item instruction inside control loop body directly goes out to transport to decoding unit by the circular buffer of fetching grade,
To avoiding in circulation, every time from program storage level-one level-one to decoding unit transmit instruction and caused by additional waiting
Period.The present invention decreases the access to program storage, to drop while the zero propagation for realizing hardware loop jumps
Low-power consumption.
Further, in above system, the circular buffer is the storehouse being made of more than two instruction buffers, heap
The number of plies of stack is the number of instruction buffer, the number of plies for the loop nesting that the number of instruction buffer is supported by the processor, heap
The reading manner of stack is first-in last-out.
Here, by storehouse, the corresponding N item instruction of outermost one layer circulation is pressed in the bottom of storehouse, by nesting
Most interior one layer is that N item instruction in circulation is placed in the top layer of storehouse.When reading instruction, according to the structure of circulation, first complete to embedding
After most interior one layer of circular treatment of set, then the instruction of its outer one layer of circulation is handled.First-in last-out by storehouse
Reading manner, so that each layer of circulation is handled all in accordance with the nested position locating for it.It is unlikely to order of operation occur
On mistake.
Further, in above system, the instruction buffer is the memory that a size is N, each described
Sequence in memory according to address from 0 to N-1 successively stores first article to the N articles instruction in corresponding loop body.
In above system, the memory in the circular buffer is the RAM(Random- that size is N command length
Access Memory, random access memory) or register.
Secondly, to achieve the above object, it is also proposed that a kind of processor pipeline fetching grade processing method, it is provided by the present invention
Fetching grade method be used to replace the processing method of fetching grade in existing assembly line.In this method, fetching grade inputs upper level
Each instruction according to the following steps handle:
The first step judges whether the instruction inputted from upper level is first of new loop body instruction, if so, by fetching
Grade successively stores and synchronous first to the N articles instruction into next stage output previous cycle body;Otherwise, third step is jumped to;
The judgement of this step is carried out by the decoding unit for being located at fetching grade next stage, and two controls for passing through decoding unit
Signal end is realized to going location unit and circular buffer to send control signal respectively.
Wherein, N=min { n, m }, n are the item number instructed in previous cycle body, and m is to deposit in processor pipeline from reading
Subtract 1 to the clock periodicity undergone required for decoding grade in reservoir;
Second step if whether instruction is the last item instruction of loop body, and recycles and is also not finished, then will refer to storage in grade
Previous cycle body in first to the N articles instruction successively to next stage output after, waiting receive upper level input new command
And jump to the first step;Otherwise, third step is jumped to;
Third step after exporting from the instruction that fetching grade obtains it from upper level to next stage, terminates at epicycle fetching grade
Manage process.
One wheel process flow of fetching grade is directed toward the process that its next stage exports an instruction, at one of alignment processing device
The clock period.Fetching grade often receives an instruction and is handled all in accordance with above-mentioned step.
Further, in the above method, the fetching grade includes the circulation that a depth is supported by locating processor
The storehouse of the nested number of plies is completed to store in accordance with the following steps first to the in previous cycle body in the first step by the storehouse
The operation of N item instruction:
Step 101, whether decision instruction is first instruction of new loop body, if so, current (in storehouse) is stored
All loop bodies instruction press to next layer of storage unit (pressing to next layer of storehouse) after, go to step 102;It is no
Then, step 102 is jumped directly to;
Step 102, suitable from 0 to N-1 according to address in most upper one layer of storage unit (i.e. in most upper one layer of storehouse)
Sequence successively stores first article to the N articles instruction in previous cycle body.
Meanwhile in the above method, the storehouse is completed to export fetching grade memory in above-mentioned second step in accordance with the following steps
The operation of first to the N articles instruction in the previous cycle body of storage:
Step 201, whether decision instruction is the N articles of previous cycle body instruction, if so, and loop body be over, then
The instruction of all loop bodies stored in current stack is proposed into upper layer storage unit and (above mentions one to the top of storehouse
Layer) after, go to step 202;Otherwise, step 202 is jumped directly to;
Step 202, the sequence from most upper one layer of storage unit (i.e. most upper one layer of storehouse), according to address from 0 to N-1
First article to the N articles instruction being sequentially output in previous cycle body.
Beneficial effect
The present invention increases circular buffer in the Fetch unit of original DSP, is stored by circular buffer and export correspondence
Preceding N item instruction in loop body.In this way, it eliminates in circulating treatment procedure, refers to every time from loop body the last item
When order jumps to first instruction of loop body, as program storage read the delay of data and caused by additional waiting week
Phase, to realize that the zero propagation of hardware loop jumps.Simultaneously as the design of circular buffer, is effectively reduced in processor
Access to program storage, to reduce power consumption.
In method of the present invention and processor system, after program first enters circulation, by the preceding N in the loop body
Item instruction is stored in the layer and recycles in corresponding circular buffer.The number of instructions N for being stored in circular buffer is equal to from sending program storage
Device read command subtracts 1 to the clock periodicity that instruction reaches decoding unit, but not is more than the number of instructions of loop body.Every time from following
The instruction of ring body the last item requires additional latent period N when jumping to loop body first instruction, thus is being stored in advance
After the instruction in this N number of period, when loop body is when having executed the last item instruction, so that it may the N+1 articles instruction is jumped directly to,
Without additional latent period.
Further, the present invention is directed to the situation of loop nesting, by storehouse, by the corresponding N item of outermost one layer circulation
Instruction is pressed in the bottom of storehouse, and most interior one layer by nesting is that the N item in circulation instructs the top layer for being placed in storehouse.Refer in reading
When enabling, according to the structure of circulation, first complete to most it is interior one layer circulation processing, then to its it is outer one layer circulation instruction at
Reason.Since storehouse has reading manner first-in last-out, the instruction of most interior one layer of circulation always can be processed at first, thus, this
The design of sample can guarantee that each layer of circulation carries out all in accordance with the nested position locating for it, guarantee that order of operation is correct
Meanwhile also eliminating latent period additional in calculating process.
It can not only realize that the zero propagation of hardware loop jumps using the system that this method designs, but also program storage can be reduced
Memory reading times can especially be greatly reduced in the case where loop nesting, power consumption is effectively reduced for the reading times of device
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
It obtains it is clear that understand through the implementation of the invention.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, and with it is of the invention
Embodiment together, is used to explain the present invention, and is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is pipelined architecture figure according to the present invention;
Fig. 2 is circular buffer block architecture diagram according to the present invention;
The schematic diagram that Fig. 3 executes for the circulation under no circular buffer state;
Fig. 4 is the schematic diagram that the circulation being added under circular buffer state executes.
Specific embodiment
Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings, it should be understood that preferred reality described herein
Apply example only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention.It can be according to this in arbitrary processor or DSP
The method that design provides completes the design of circular buffer.
Fig. 1 is pipelined architecture figure according to the present invention.Processor or DSP shown in figure include successively concatenated in order
Program storage 100, Fetch unit 101, decoding unit 103 and execution unit 108 respectively correspond the processor or the level Four of DSP
Assembly line F0 ~ E0.F0 is the first level production line that instruction is read, and Fetch unit 101 issues program storage in this level production line
The read request of device 100.F1 is the second level production line that instruction is read, and in this level production line, program bus 104 returns to F0 flowing water
Line institute's reading instruction to Fetch unit 101, and with circular buffer 102 export instruction selected after, give decoding unit
103.D0 is Instruction decoding assembly line, for decoding to instruction.E0 is instruction execution pipeline, completes decoding for executing
Instruction.
The difference of processor hardware circulating treating system and existing system provided in this embodiment is, in this system, takes
Refer to further include selecting module 107 and circulation between the instruction output end of unit 101 and the command input of the decoding unit 103
Caching 102;Two input terminals of the selecting module 107 are separately connected the output end of Fetch unit 101 and circular buffer 102.
That is, 4 level production line processors in the present embodiment include program storage 100, Fetch unit 101, circular buffer
102, selecting module 107, decoding unit 103 and execution unit 108, totally 6 modules.The connection of program storage 100 therein takes
Refer to the input terminal of unit 101, an input terminal of the instruction output end connection selecting module 107 of the Fetch unit 101, selection
The output end of another input terminal connection circular buffer 102 of module 107;The output end of selecting module 107 connects decoding unit
103 input terminal;The input terminal of the output end connection execution unit 108 of the decoding unit 103;
Under this system structure, the Fetch unit 101 is also connect with the first control signal end 105 of decoding unit 103,
The circular buffer 102 is also connect with the second control signal end 106 of decoding unit 103.
The first control signal end 105 of the decoding unit 103 is used to refer to cyclic node decoding unit 103 is unread
When enabling, the Fetch unit 101 is controlled by selecting module 107 and exports next decoded to decoding unit 103
Instruction;Meanwhile the decoding unit 103 is also controlled by first control signal 105 after reading first newly recycled instruction
It makes the Fetch unit 101 and passes through synchronous first article to the N articles for exporting new loop body to circular buffer 102 of selecting module 107
Instruction;The circular buffer 102 is used to receive the preceding N item instruction of new loop body, and instructs indentation storehouse to carry out the N item of receiving
Storage;Wherein, the last item instruction during the cyclic node refers to that circulation carries out, in loop body;Described
The item number N=min { n, m } instructed in the received new loop body of circular buffer 102, n are instruction strip number in loop body, and m is instruction
The clock periodicity experienced of decoding unit 103, which is reached, from program storage 100 subtracts 1;
The second control signal end 106 of the decoding unit 103 is used to read cyclic node instruction in decoding unit 103
When, control loop caching 102 successively exports first article of previous cycle body to N to decoding unit 103 by selecting module 107
Item instruction.
Due to reaching D0 assembly line needs 2 from the instruction that reads request to for issuing program storage 100 in present example
Period, therefore only need for first instruction of every layer of loop body to be stored in corresponding circular buffer 102, i.e. N is equal to 1.Fig. 1
In it can be seen that, instruction D0 assembly line complete decoding after, if it is determined that loop body built-in command and for for the first time into
Enter circulation, the instruction for giving decoding unit 103 can be stored in circular buffer 102 at this time.
The circular buffer 102 is the storehouse being made of more than two instruction buffers, and the number of plies of storehouse is instruction buffer
Number, the number of plies for the loop nesting that the number of instruction buffer is supported by the processor, the reading manner of storehouse is advanced
After go out.
In the present embodiment, the structure of circular buffer 102 is as shown in Figure 2.Assuming that current processor supports 3 layers of hardware loop
Nesting, and according to the description of front, it needs first instruction of every layer of loop body being stored in corresponding circular buffer 102, because
This circular buffer 102 is actually the storehouse that a depth is 3, first finger of the hardware loop of every layer stack storage respective layer
It enables.It is every to enter a new hardware loop, all first instruction of loop body can be pressed into instruction buffer 0, and simultaneously will instruction
Effective instruction in caching 0 and instruction buffer 1 is respectively pressed into instruction buffer 1 and instruction buffer 2(that is, in Fig. 2, is referring to
The instruction for storing nested innermost loop in caching 0 is enabled, the finger of the one layer of circulation in nested centre is stored in instruction buffer 1
It enables, the instruction of nested outermost loop is stored in instruction buffer 2).Whenever the last item for going to hardware loop body refers to
It enables, the instruction stored in instruction buffer 0 can pop up, and in next cycle feeding decoding unit, while by 2 He of instruction buffer
Effective instruction indentation instruction buffer 1 and instruction buffer 0 in instruction buffer 1.
Each above-mentioned instruction buffer is the memory that a size is N, according to ground in each described memory
Sequence of the location from 0 to N-1 successively stores first article to the N articles instruction in corresponding loop body.In the present embodiment, that is, often
A instruction buffer just stores an instruction, and 3 instruction buffers constitute the storehouse that a depth is 3.
Here it be N(the present embodiment is middle N=1 that every layer of instruction buffer, which is a size) memory, which can be with
It is realized with RAM or register, for after program first enters circulation, storing the instruction of the preceding N item in the loop body.Deposit follows
The number of instructions N of ring caching, which is equal to from the clock periodicity for issuing program storage read command to instruction arrival decoding unit, subtracts 1,
It but not is more than the number of instructions of loop body.According to the description that previous cycles cache, the read/write address of every layer of instruction buffer is all
According to the sequential transformations of 0 ~ N-1.
The operating process of the circular buffer 102 of embodiment shown in FIG. 1 is described in detail below.
After Fetch unit 101 issues the read request of program storage 100, by 2 periods, the instruction of reading reaches D0 stream
Waterline.Decoding unit 103 complete decoding after, if it is determined that loop body built-in command and for for the first time enter circulation,
While execution unit 108 are sent into the instruction, need the instruction being pressed into circular buffer 102.When the judgement of decoding unit 103 is translated
Code instruction is that the last item of loop body instructs, and recycles and be not over, and Fetch unit 101 will be issued to program storage
Read request, and the instruction read is the 2nd article of instruction of loop body.Since the instruction newly read needs 2 periods to reach D0 flowing water
Line, therefore first that takes out loop body from circular buffer 102 is instructed and decoded, second in first clock cycle
A period completes the decoding of the Article 2 instruction for the loop body newly fetched.
It should be noted that in addition to first time recycles, each circulation later all will if loop body only has an instruction
Loop body instruction is read from circular buffer 102, without carrying out read operation repeatedly to program storage 100.Because in program
After first entering circulation, need for the preceding N item instruction in the loop body to be stored in the corresponding circular buffer of this layer circulation.Deposit follows
The number of instructions N of ring caching, which is equal to from the clock periodicity for issuing program storage read command to instruction arrival decoding unit, subtracts 1.
If the instruction number of loop body is less than or equal to N, only need to recycle from the reading of program storage 100 when recycling for the first time
Body instructs and is stored in circular buffer 102, can read whole loop body instructions from circular buffer 102 later, until circulation terminates.
If the number of instructions of loop body is greater than N, after being again introduced into circulation, the N item being stored in front of reading first from circular buffer refers to
It enables, until the N+1 articles instruction of the loop body to read back from program storage returns to decoding unit.It can realize that hardware follows in this way
The zero propagation of ring jumps.
That is, the processor flowing water for needing to undergo m+1 clock cycle to decoding grade from reading memory
For line, fetching grade output order in accordance with the following steps:
The first step judges whether defeated from upper level be such as first of new loop body instruction to the instruction of fetching grade, if so,
It is then successively stored from fetching grade and synchronous first to the N articles into next stage (i.e. decoding grade) output previous cycle body instructs;
Otherwise, third step is jumped to;
Wherein, N=min { n, m }, n are the item number instructed in previous cycle body, and m is to deposit in processor pipeline from reading
Subtract 1 to the clock periodicity undergone required for decoding grade in reservoir;
Second step if whether instruction is the last item instruction of loop body, and is recycled and is also not finished, then successively by fetching grade
First to the N articles instruction in previous cycle body stored into next stage (i.e. decoding grade) output fetching grade, after output
Waiting receives upper level input new command and jumps to the first step again;Otherwise, third step is jumped to;
Third step exports its instruction obtained from upper level from fetching grade to next stage (i.e. decoding grade), terminates epicycle and take
Refer to grade process flow.
Wherein, the heap of the number of plies of the fetching grade loop nesting that include a depth supported by locating processor
Stack is carried out first to the N articles instruction in the storage previous cycle body in the above-mentioned first step by the storehouse in accordance with the following steps
Operation:
Step 101, whether decision instruction is first of new loop body instruction, if so, will store in current stack
The instruction of all loop bodies pushes one layer of (instruction pressure being originally stored in first layer storage unit to the depths of storehouse
To second layer storage unit, the instruction being originally stored in second layer storage unit is pressed into third layer storage unit, successively class
Push away), go to step 102 later;Otherwise, step 102 is jumped directly to;
Step 102, suitable from 0 to N-1 according to address in most upper one layer of storehouse (in i.e. most upper one layer of storage unit)
Sequence successively stores first article to the N articles instruction in previous cycle body.
Meanwhile the storehouse is completed to export the previous cycle stored in fetching grade in above-mentioned second step in accordance with the following steps
The operation of first to the N articles instruction in body:
Step 201, whether decision instruction is the N articles of previous cycle body instruction, if so, and loop body be over, then
The instruction of all loop bodies stored in current stack is above mentioned one layer (that is, third will be originally stored in the top of storehouse
Instruction in layer storage unit is mentioned to the second layer, and the instruction being originally stored in second layer storage unit is mentioned to first layer, according to
It is secondary to analogize), go to step 102 later;Otherwise, step 202 is jumped directly to;
Step 202, (i.e. most upper one layer of storehouse), the sequence according to address from 0 to N-1 from most upper one layer of storage unit
First article to the N articles instruction being sequentially output in previous cycle body.
It compares the circulation under no circular buffer state shown in Fig. 3 and executes state procedure it is not difficult to find that since program stores
There is delay in device, so needing to be inserted into when jumping to first instruction of loop body from the instruction of loop body the last item every time
First instruction of the bubble of a cycle, the loop body of reading gets to D0 assembly line.Therefore compare in loop body instruction
In the case where few or loop nesting, the efficiency that the bubble of this insertion executes program has bigger influence.
After increasing circular buffer in Fig. 4, the bubble being inserted into original Fig. 3 is followed by what is read from circular buffer 102
First instruction of ring body is replaced, and the zero propagation for realizing hardware loop jumps.Program can be also reduced using this method simultaneously
Memory reading times can especially be greatly reduced in the case where loop nesting, be effectively reduced for the reading times of memory
Power consumption.
Through this embodiment it is found that circular buffer is the storehouse that a depth is M, the size of M is what the processor was supported
The nested number of plies of hardware loop, one layer of storage unit described in every layer stack corresponding method process, i.e. a finger in Fig. 2
Enable caching.It is every to enter a new circulation, the preceding N item newly recycled can all be instructed and be stored in instruction buffer 0 according to instruction sequences, and
Simultaneously by the effective instruction indentation instruction buffer 1 and instruction buffer 2 in instruction buffer 0 and instruction buffer 1.Whenever going to hardware
The last item of loop body instructs, and can pop up in sequence the instruction stored in instruction buffer 0, and be sent into decoding unit, etc.
After all instructions pop-up in instruction buffer 0, the effective instruction in instruction buffer 2 and instruction buffer 1 is pressed into instruction buffer 1
With instruction buffer 0.
In this way, the corresponding N item instruction of outermost one layer circulation to be pressed in the bottom of storehouse, by nesting by storehouse
Most interior one layer is that N item instruction in circulation is placed in the top layer of storehouse.When reading instruction, according to the structure of circulation, first completion pair
The processing of most interior one layer of circulation, then the instruction of its outer one layer of circulation is handled.Since storehouse has reading first-in last-out
The instruction of mode, most interior one layer of circulation always can be processed at first, thus, such design can guarantee each layer of circulation all
It carries out according to the nested position locating for it, while guaranteeing that order of operation is correct, also eliminates additional in calculating process
Latent period.
The hardware loop processing method and system of processor or DSP disclosed herein, by original Fetch unit
Increase circular buffer, can eliminate additional when jumping to loop body first instruction from the instruction of loop body the last item every time
Latent period.This method design is simple, increases the resource of seldom hardware in original system, can realize the zero of hardware loop
Delay jumps.In addition, this method can also reduce access of the Fetch unit to program storage, to reduce power consumption.
Those of ordinary skill in the art will appreciate that: the foregoing is only a preferred embodiment of the present invention, and does not have to
In the limitation present invention, although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art
For, still can to foregoing embodiments record technical solution modify, or to part of technical characteristic into
Row equivalent replacement.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should all include
Within protection scope of the present invention.
Claims (7)
1. a kind of processor hardware circulating treating system, which is characterized in that the system includes program storage (100), fetching list
First (101), circular buffer (102), selecting module (107), decoding unit (103) and execution unit (108);
Described program memory (100) connects the input terminal of Fetch unit (101), the instruction output of the Fetch unit (101)
Another input terminal of one input terminal of end connection selecting module (107), selecting module (107) connects circular buffer (102)
Output end;The input terminal of output end connection decoding unit (103) of selecting module (107);The decoding unit (103) it is defeated
Outlet connects the input terminal of execution unit (108);
The Fetch unit (101) also connect with the first control signal end (105) of decoding unit (103), the circular buffer
(102) it is also connect with the second control signal end (106) of decoding unit (103);First control letter of the decoding unit (103)
Number end (105) be used for decoding unit (103) it is unread to cyclic node instruction when, control the Fetch unit (101) and pass through
The next instruction that selecting module (107) is decoded to decoding unit (103) output;Meanwhile the decoding unit
(103) after reading first newly recycled instruction, the Fetch unit also is controlled by first control signal end (105)
(101) pass through selecting module (107) synchronous first article to the N articles instruction that new loop body is exported to circular buffer (102);
The circular buffer (102) be used for receive new loop body preceding N item instruction, and by the N item of receiving instruct indentation storehouse into
Row storage;Wherein, the last item instruction during the cyclic node refers to that circulation carries out, in loop body;It is described
The received new loop body of circular buffer (102) in the item number N=min { n, m } that instructs, n is instruction strip number in loop body, and m is
Instruction reaches decoding unit (103) clock periodicity experienced from program storage (100) and subtracts 1;
The second control signal end (106) of the decoding unit (103) refers to for reading cyclic node in decoding unit (103)
When enabling, control loop caches (102) and successively exports the of previous cycle body to decoding unit (103) by selecting module (107)
One article to the N articles instruction.
2. hardware loop processing system as described in claim 1, which is characterized in that the circular buffer (102) is by two
The storehouse that above instructions caching is constituted, the number of plies of storehouse are the number of instruction buffer, and the number of instruction buffer is the processor
The number of plies for the loop nesting supported, the reading manner of storehouse are first-in last-out.
3. hardware loop processing system as claimed in claim 2, which is characterized in that the instruction buffer is that a size is
The memory of N, the sequence in each memory of circular buffer (102) according to address from 0 to N-1 successively store corresponding circulation
First article to the N articles instruction in body.
4. hardware loop processing system as claimed in claim 1 or 2, which is characterized in that in the circular buffer (102)
Memory is the random access memory or register that size is N command length.
5. the processor pipeline fetching grade processing of processor hardware circulating treating system as described in any one of Claims 1 to 4
Method, which is characterized in that processing that this approach includes the following steps:
The first step judges whether the instruction inputted from upper level is first of new loop body instruction, if so, by fetching grade according to
Secondary storage and synchronous first to the N articles instruction into next stage output previous cycle body;Otherwise, third step is jumped to;
Wherein, N=min { n, m }, n are the item number instructed in previous cycle body, and m is in processor pipeline from reading memory
In to decoding grade required for undergo clock periodicity subtract 1;
Second step, if instruction is the last item instruction of loop body, and circulation is also not finished, then fetching grade is stored current
After first to the N articles instruction in loop body is successively exported to next stage, waiting receives upper level input new command and jumps
To the first step;Otherwise, third step is jumped to;
Third step after the instruction that fetching grade obtains it from upper level is exported to next stage, terminates epicycle fetching grade process flow.
6. processor pipeline fetching grade processing method as claimed in claim 5, which is characterized in that in the first step, press
According to first to the N articles instruction in following method storage previous cycle body:
Step 101, whether decision instruction is first of new loop body instruction, if so, by currently stored all circulations
After the instruction of body presses to next layer of storage unit, 102 are gone to step;Otherwise, step 102 is jumped directly to;
Step 102, it in most upper one layer of storage unit, is successively stored in previous cycle body according to sequence of the address from 0 to N-1
First article to the N articles instruction.
7. such as processor pipeline fetching grade processing method described in claim 5 or 6, which is characterized in that the second step
In, first to the N articles instruction in the previous cycle body stored in fetching grade is exported as follows:
Step 201, if instruction is the N articles instruction of previous cycle body, and loop body is over, then is owned currently stored
Loop body instruction propose upper layer storage unit after, go to step 202;Otherwise, step 202 is jumped directly to;
Step 202, it from most upper one layer of storage unit, is sequentially output in previous cycle body according to sequence of the address from 0 to N-1
First article to the N articles instruction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611021587.XA CN106775591B (en) | 2016-11-21 | 2016-11-21 | A kind of hardware loop processing method and system of processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611021587.XA CN106775591B (en) | 2016-11-21 | 2016-11-21 | A kind of hardware loop processing method and system of processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106775591A CN106775591A (en) | 2017-05-31 |
CN106775591B true CN106775591B (en) | 2019-06-18 |
Family
ID=58969971
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611021587.XA Active CN106775591B (en) | 2016-11-21 | 2016-11-21 | A kind of hardware loop processing method and system of processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106775591B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109032665B (en) * | 2017-06-09 | 2021-01-26 | 龙芯中科技术股份有限公司 | Method and device for processing instruction output in microprocessor |
CN107368287B (en) * | 2017-06-12 | 2020-11-13 | 北京中科睿芯科技有限公司 | Acceleration system, acceleration device and acceleration method for cyclic dependence of data stream structure |
CN107423148A (en) * | 2017-07-26 | 2017-12-01 | 广州路派电子科技有限公司 | A kind of double buffering protocol data analysis system being applied under multi-task scheduling environment |
CN107729054B (en) * | 2017-10-18 | 2020-07-24 | 珠海市杰理科技股份有限公司 | Method and device for realizing execution of processor on loop body |
CN111522584B (en) * | 2020-04-10 | 2023-10-31 | 深圳优矽科技有限公司 | Hardware circulation acceleration processor and hardware circulation acceleration method executed by same |
CN114116010B (en) * | 2022-01-27 | 2022-05-03 | 广东省新一代通信与网络创新研究院 | Architecture optimization method and device for processor cycle body |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1012693B1 (en) * | 1996-04-29 | 2006-04-12 | Atmel Corporation | Program memory and signal processing system storing instructions encoded for reducing power consumption during reads |
CN102436367A (en) * | 2011-09-26 | 2012-05-02 | 杭州中天微系统有限公司 | 16/32 bits mixed framework order prefetched buffer device |
CN102637149A (en) * | 2012-03-23 | 2012-08-15 | 山东极芯电子科技有限公司 | Processor and operation method thereof |
-
2016
- 2016-11-21 CN CN201611021587.XA patent/CN106775591B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1012693B1 (en) * | 1996-04-29 | 2006-04-12 | Atmel Corporation | Program memory and signal processing system storing instructions encoded for reducing power consumption during reads |
CN102436367A (en) * | 2011-09-26 | 2012-05-02 | 杭州中天微系统有限公司 | 16/32 bits mixed framework order prefetched buffer device |
CN102637149A (en) * | 2012-03-23 | 2012-08-15 | 山东极芯电子科技有限公司 | Processor and operation method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN106775591A (en) | 2017-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106775591B (en) | A kind of hardware loop processing method and system of processor | |
CN101807144B (en) | Prospective multi-threaded parallel execution optimization method | |
CN101221541B (en) | Programmable communication controller for SOC and its programming model | |
US9170816B2 (en) | Enhancing processing efficiency in large instruction width processors | |
CN105426160A (en) | Instruction classified multi-emitting method based on SPRAC V8 instruction set | |
CN106575220A (en) | Multiple clustered very long instruction word processing core | |
US20040255103A1 (en) | Method and system for terminating unnecessary processing of a conditional instruction in a processor | |
CN103890718A (en) | Digital signal processor and baseband communication device | |
US8949575B2 (en) | Reversing processing order in half-pumped SIMD execution units to achieve K cycle issue-to-issue latency | |
US20110264892A1 (en) | Data processing device | |
CN110928832A (en) | Asynchronous pipeline processor circuit, device and data processing method | |
CN105242904B (en) | For processor instruction buffering and the device and its operating method of circular buffering | |
CN112527393A (en) | Instruction scheduling optimization device and method for master-slave fusion architecture processor | |
CN104503733B (en) | The merging method and device of a kind of state machine | |
KR101545701B1 (en) | A processor and a method for decompressing instruction bundles | |
CN117389731B (en) | Data processing method and device, chip, device and storage medium | |
CN104516829A (en) | Microprocessor and method for using an instruction loop cache thereof | |
CN107924310A (en) | Produced using the memory instructions for avoiding producing in table (PAT) prediction computer processor | |
JP2014215624A (en) | Arithmetic processing device | |
JP4771079B2 (en) | VLIW processor | |
US10454670B2 (en) | Memory optimization for nested hash operations | |
CN116048627A (en) | Instruction buffering method, apparatus, processor, electronic device and readable storage medium | |
CN116257174A (en) | Heterogeneous space optimizer based on tensor asynchronous hard disk read-write | |
US20130151817A1 (en) | Method, apparatus, and computer program product for parallel functional units in multicore processors | |
JP5630798B1 (en) | Processor and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |