[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106775591B - A kind of hardware loop processing method and system of processor - Google Patents

A kind of hardware loop processing method and system of processor Download PDF

Info

Publication number
CN106775591B
CN106775591B CN201611021587.XA CN201611021587A CN106775591B CN 106775591 B CN106775591 B CN 106775591B CN 201611021587 A CN201611021587 A CN 201611021587A CN 106775591 B CN106775591 B CN 106775591B
Authority
CN
China
Prior art keywords
instruction
loop
decoding unit
circular buffer
loop body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611021587.XA
Other languages
Chinese (zh)
Other versions
CN106775591A (en
Inventor
李炜
陶建平
韩景通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Grand Cloud Co Ltd
Original Assignee
Jiangsu Grand Cloud Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Grand Cloud Co Ltd filed Critical Jiangsu Grand Cloud Co Ltd
Priority to CN201611021587.XA priority Critical patent/CN106775591B/en
Publication of CN106775591A publication Critical patent/CN106775591A/en
Application granted granted Critical
Publication of CN106775591B publication Critical patent/CN106775591B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30065Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30069Instruction skipping instructions, e.g. SKIP

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

A kind of the hardware loop processing system and method for processor, by increasing circular buffer in original Fetch unit, preceding N item in corresponding loop body is instructed, subsequent decoding unit is directly output to by circular buffer, it eliminates in circulating treatment procedure, when jumping to first instruction of loop body from the instruction of loop body the last item every time, as program storage read the delay of data and caused by additional latent period, to realize that the zero propagation of hardware loop jumps.The method of the invention design is simple, it is only necessary to increase the circular buffer an of hardware and corresponding selecting module in original system, can realize that the zero propagation of hardware loop jumps.In addition, this method can also reduce access of the Fetch unit to program storage, to reduce the power consumption of processor.

Description

A kind of hardware loop processing method and system of processor
Technical field
The present invention relates to hardware loop processing techniques, more particularly to one kind to take location level framework to carry out for processor or DSP Improved hardware loop processing technique.
Background technique
In processor or DSP, circulation is a kind of very common Program Type.In the prior art, it usually utilizes commonly Jump instruction is simultaneously used cooperatively general register to realize circular treatment.But this processing mode loop body finally, need To judge whether circulation terminates using additional instruction, also need to jump to loop body using jump instruction if be not over First instruction, many additional cycles can be brought to loop body in this way.Especially in the case where loop nesting, operation efficiency It can therefore be significantly damaged.
In order to improve the execution efficiency of circulation, currently, beginning to use the mode pair of hardware loop in some processors or DSP Circulation is handled.It judges automatically whether circulation terminates by hardware, and jumps to first instruction of loop body.It is this Although mode does not use additional decision instruction and jump instruction, but be delayed since program storage exists, so often It is secondary from loop body the last item instruction jump to first instruction of loop body when, it is still necessary to additional latent period, ability Read first instruction of loop body.In this case, if instructing fewer, these additional waiting weeks in loop body Phase can still generate the efficiency of circulation and seriously affect.
For problem of the prior art, the hardware loop processing method and system of processor or DSP disclosed by the invention can To eliminate additional latent period when jumping to loop body first instruction from the instruction of loop body the last item every time, realization is followed The zero propagation of ring jumps, and can increase substantially the efficiency of circulation.And the reading times of program storage can be thus reduced, Reduce processor or DSP power consumption.
Summary of the invention
In order to solve the shortcomings of the prior art, the purpose of the present invention is to provide at a kind of hardware loop of processor Manage method and system.
Firstly, to achieve the above object, the present invention proposes a kind of processor hardware circulating treating system.System includes program Memory, Fetch unit, circular buffer, selecting module, decoding unit and execution unit;
Described program memory connects the input terminal of Fetch unit, the instruction output end connection selection mould of the Fetch unit One input terminal of block, the output end of another input terminal connection circular buffer of selecting module;The output end of selecting module connects Connect the input terminal of decoding unit;The input terminal of the output end connection execution unit of the decoding unit;
The Fetch unit is also connect with the first control signal end of decoding unit, the circular buffer also with decoding unit Second control signal end connection.
Further, in above system, the first control signal end of the decoding unit is used for unread in decoding unit To when cyclic node instruction, the Fetch unit is controlled by selecting module to decoding unit) output decoded down One instruction;Meanwhile the decoding unit is also controlled by first control signal end after reading first newly recycled instruction It makes the Fetch unit and passes through synchronous first article to the N articles instruction for exporting new loop body to circular buffer of selecting module;It is described Circular buffer is used to receive the preceding N item instruction of new loop body, and instructs indentation storehouse to store the N item of receiving;Wherein, institute The last item instruction during the cyclic node stated refers to that circulation carries out, in loop body;The circular buffer receives New loop body in instruct item number N=min { n, m }, n be loop body in instruction strip number, m be instruction from program storage reach Decoding unit clock periodicity experienced subtracts 1;
The second control signal end of the decoding unit is used for when decoding unit reads cyclic node instruction, and control follows Ring caching successively exports first article to the N articles instruction of previous cycle body by selecting module to decoding unit.
In this way, the instruction that can be decoded according to decoding unit, is being followed by selecting module according to this instruction The location of in ring body, judgement is that instruction is read directly from circular buffer, it is desired nonetheless to which the slave program of level-one level-one stores Instruction is read in device.Preceding N item instruction inside control loop body directly goes out to transport to decoding unit by the circular buffer of fetching grade, To avoiding in circulation, every time from program storage level-one level-one to decoding unit transmit instruction and caused by additional waiting Period.The present invention decreases the access to program storage, to drop while the zero propagation for realizing hardware loop jumps Low-power consumption.
Further, in above system, the circular buffer is the storehouse being made of more than two instruction buffers, heap The number of plies of stack is the number of instruction buffer, the number of plies for the loop nesting that the number of instruction buffer is supported by the processor, heap The reading manner of stack is first-in last-out.
Here, by storehouse, the corresponding N item instruction of outermost one layer circulation is pressed in the bottom of storehouse, by nesting Most interior one layer is that N item instruction in circulation is placed in the top layer of storehouse.When reading instruction, according to the structure of circulation, first complete to embedding After most interior one layer of circular treatment of set, then the instruction of its outer one layer of circulation is handled.First-in last-out by storehouse Reading manner, so that each layer of circulation is handled all in accordance with the nested position locating for it.It is unlikely to order of operation occur On mistake.
Further, in above system, the instruction buffer is the memory that a size is N, each described Sequence in memory according to address from 0 to N-1 successively stores first article to the N articles instruction in corresponding loop body.
In above system, the memory in the circular buffer is the RAM(Random- that size is N command length Access Memory, random access memory) or register.
Secondly, to achieve the above object, it is also proposed that a kind of processor pipeline fetching grade processing method, it is provided by the present invention Fetching grade method be used to replace the processing method of fetching grade in existing assembly line.In this method, fetching grade inputs upper level Each instruction according to the following steps handle:
The first step judges whether the instruction inputted from upper level is first of new loop body instruction, if so, by fetching Grade successively stores and synchronous first to the N articles instruction into next stage output previous cycle body;Otherwise, third step is jumped to;
The judgement of this step is carried out by the decoding unit for being located at fetching grade next stage, and two controls for passing through decoding unit Signal end is realized to going location unit and circular buffer to send control signal respectively.
Wherein, N=min { n, m }, n are the item number instructed in previous cycle body, and m is to deposit in processor pipeline from reading Subtract 1 to the clock periodicity undergone required for decoding grade in reservoir;
Second step if whether instruction is the last item instruction of loop body, and recycles and is also not finished, then will refer to storage in grade Previous cycle body in first to the N articles instruction successively to next stage output after, waiting receive upper level input new command And jump to the first step;Otherwise, third step is jumped to;
Third step after exporting from the instruction that fetching grade obtains it from upper level to next stage, terminates at epicycle fetching grade Manage process.
One wheel process flow of fetching grade is directed toward the process that its next stage exports an instruction, at one of alignment processing device The clock period.Fetching grade often receives an instruction and is handled all in accordance with above-mentioned step.
Further, in the above method, the fetching grade includes the circulation that a depth is supported by locating processor The storehouse of the nested number of plies is completed to store in accordance with the following steps first to the in previous cycle body in the first step by the storehouse The operation of N item instruction:
Step 101, whether decision instruction is first instruction of new loop body, if so, current (in storehouse) is stored All loop bodies instruction press to next layer of storage unit (pressing to next layer of storehouse) after, go to step 102;It is no Then, step 102 is jumped directly to;
Step 102, suitable from 0 to N-1 according to address in most upper one layer of storage unit (i.e. in most upper one layer of storehouse) Sequence successively stores first article to the N articles instruction in previous cycle body.
Meanwhile in the above method, the storehouse is completed to export fetching grade memory in above-mentioned second step in accordance with the following steps The operation of first to the N articles instruction in the previous cycle body of storage:
Step 201, whether decision instruction is the N articles of previous cycle body instruction, if so, and loop body be over, then The instruction of all loop bodies stored in current stack is proposed into upper layer storage unit and (above mentions one to the top of storehouse Layer) after, go to step 202;Otherwise, step 202 is jumped directly to;
Step 202, the sequence from most upper one layer of storage unit (i.e. most upper one layer of storehouse), according to address from 0 to N-1 First article to the N articles instruction being sequentially output in previous cycle body.
Beneficial effect
The present invention increases circular buffer in the Fetch unit of original DSP, is stored by circular buffer and export correspondence Preceding N item instruction in loop body.In this way, it eliminates in circulating treatment procedure, refers to every time from loop body the last item When order jumps to first instruction of loop body, as program storage read the delay of data and caused by additional waiting week Phase, to realize that the zero propagation of hardware loop jumps.Simultaneously as the design of circular buffer, is effectively reduced in processor Access to program storage, to reduce power consumption.
In method of the present invention and processor system, after program first enters circulation, by the preceding N in the loop body Item instruction is stored in the layer and recycles in corresponding circular buffer.The number of instructions N for being stored in circular buffer is equal to from sending program storage Device read command subtracts 1 to the clock periodicity that instruction reaches decoding unit, but not is more than the number of instructions of loop body.Every time from following The instruction of ring body the last item requires additional latent period N when jumping to loop body first instruction, thus is being stored in advance After the instruction in this N number of period, when loop body is when having executed the last item instruction, so that it may the N+1 articles instruction is jumped directly to, Without additional latent period.
Further, the present invention is directed to the situation of loop nesting, by storehouse, by the corresponding N item of outermost one layer circulation Instruction is pressed in the bottom of storehouse, and most interior one layer by nesting is that the N item in circulation instructs the top layer for being placed in storehouse.Refer in reading When enabling, according to the structure of circulation, first complete to most it is interior one layer circulation processing, then to its it is outer one layer circulation instruction at Reason.Since storehouse has reading manner first-in last-out, the instruction of most interior one layer of circulation always can be processed at first, thus, this The design of sample can guarantee that each layer of circulation carries out all in accordance with the nested position locating for it, guarantee that order of operation is correct Meanwhile also eliminating latent period additional in calculating process.
It can not only realize that the zero propagation of hardware loop jumps using the system that this method designs, but also program storage can be reduced Memory reading times can especially be greatly reduced in the case where loop nesting, power consumption is effectively reduced for the reading times of device
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, and with it is of the invention Embodiment together, is used to explain the present invention, and is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is pipelined architecture figure according to the present invention;
Fig. 2 is circular buffer block architecture diagram according to the present invention;
The schematic diagram that Fig. 3 executes for the circulation under no circular buffer state;
Fig. 4 is the schematic diagram that the circulation being added under circular buffer state executes.
Specific embodiment
Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings, it should be understood that preferred reality described herein Apply example only for the purpose of illustrating and explaining the present invention and is not intended to limit the present invention.It can be according to this in arbitrary processor or DSP The method that design provides completes the design of circular buffer.
Fig. 1 is pipelined architecture figure according to the present invention.Processor or DSP shown in figure include successively concatenated in order Program storage 100, Fetch unit 101, decoding unit 103 and execution unit 108 respectively correspond the processor or the level Four of DSP Assembly line F0 ~ E0.F0 is the first level production line that instruction is read, and Fetch unit 101 issues program storage in this level production line The read request of device 100.F1 is the second level production line that instruction is read, and in this level production line, program bus 104 returns to F0 flowing water Line institute's reading instruction to Fetch unit 101, and with circular buffer 102 export instruction selected after, give decoding unit 103.D0 is Instruction decoding assembly line, for decoding to instruction.E0 is instruction execution pipeline, completes decoding for executing Instruction.
The difference of processor hardware circulating treating system and existing system provided in this embodiment is, in this system, takes Refer to further include selecting module 107 and circulation between the instruction output end of unit 101 and the command input of the decoding unit 103 Caching 102;Two input terminals of the selecting module 107 are separately connected the output end of Fetch unit 101 and circular buffer 102. That is, 4 level production line processors in the present embodiment include program storage 100, Fetch unit 101, circular buffer 102, selecting module 107, decoding unit 103 and execution unit 108, totally 6 modules.The connection of program storage 100 therein takes Refer to the input terminal of unit 101, an input terminal of the instruction output end connection selecting module 107 of the Fetch unit 101, selection The output end of another input terminal connection circular buffer 102 of module 107;The output end of selecting module 107 connects decoding unit 103 input terminal;The input terminal of the output end connection execution unit 108 of the decoding unit 103;
Under this system structure, the Fetch unit 101 is also connect with the first control signal end 105 of decoding unit 103, The circular buffer 102 is also connect with the second control signal end 106 of decoding unit 103.
The first control signal end 105 of the decoding unit 103 is used to refer to cyclic node decoding unit 103 is unread When enabling, the Fetch unit 101 is controlled by selecting module 107 and exports next decoded to decoding unit 103 Instruction;Meanwhile the decoding unit 103 is also controlled by first control signal 105 after reading first newly recycled instruction It makes the Fetch unit 101 and passes through synchronous first article to the N articles for exporting new loop body to circular buffer 102 of selecting module 107 Instruction;The circular buffer 102 is used to receive the preceding N item instruction of new loop body, and instructs indentation storehouse to carry out the N item of receiving Storage;Wherein, the last item instruction during the cyclic node refers to that circulation carries out, in loop body;Described The item number N=min { n, m } instructed in the received new loop body of circular buffer 102, n are instruction strip number in loop body, and m is instruction The clock periodicity experienced of decoding unit 103, which is reached, from program storage 100 subtracts 1;
The second control signal end 106 of the decoding unit 103 is used to read cyclic node instruction in decoding unit 103 When, control loop caching 102 successively exports first article of previous cycle body to N to decoding unit 103 by selecting module 107 Item instruction.
Due to reaching D0 assembly line needs 2 from the instruction that reads request to for issuing program storage 100 in present example Period, therefore only need for first instruction of every layer of loop body to be stored in corresponding circular buffer 102, i.e. N is equal to 1.Fig. 1 In it can be seen that, instruction D0 assembly line complete decoding after, if it is determined that loop body built-in command and for for the first time into Enter circulation, the instruction for giving decoding unit 103 can be stored in circular buffer 102 at this time.
The circular buffer 102 is the storehouse being made of more than two instruction buffers, and the number of plies of storehouse is instruction buffer Number, the number of plies for the loop nesting that the number of instruction buffer is supported by the processor, the reading manner of storehouse is advanced After go out.
In the present embodiment, the structure of circular buffer 102 is as shown in Figure 2.Assuming that current processor supports 3 layers of hardware loop Nesting, and according to the description of front, it needs first instruction of every layer of loop body being stored in corresponding circular buffer 102, because This circular buffer 102 is actually the storehouse that a depth is 3, first finger of the hardware loop of every layer stack storage respective layer It enables.It is every to enter a new hardware loop, all first instruction of loop body can be pressed into instruction buffer 0, and simultaneously will instruction Effective instruction in caching 0 and instruction buffer 1 is respectively pressed into instruction buffer 1 and instruction buffer 2(that is, in Fig. 2, is referring to The instruction for storing nested innermost loop in caching 0 is enabled, the finger of the one layer of circulation in nested centre is stored in instruction buffer 1 It enables, the instruction of nested outermost loop is stored in instruction buffer 2).Whenever the last item for going to hardware loop body refers to It enables, the instruction stored in instruction buffer 0 can pop up, and in next cycle feeding decoding unit, while by 2 He of instruction buffer Effective instruction indentation instruction buffer 1 and instruction buffer 0 in instruction buffer 1.
Each above-mentioned instruction buffer is the memory that a size is N, according to ground in each described memory Sequence of the location from 0 to N-1 successively stores first article to the N articles instruction in corresponding loop body.In the present embodiment, that is, often A instruction buffer just stores an instruction, and 3 instruction buffers constitute the storehouse that a depth is 3.
Here it be N(the present embodiment is middle N=1 that every layer of instruction buffer, which is a size) memory, which can be with It is realized with RAM or register, for after program first enters circulation, storing the instruction of the preceding N item in the loop body.Deposit follows The number of instructions N of ring caching, which is equal to from the clock periodicity for issuing program storage read command to instruction arrival decoding unit, subtracts 1, It but not is more than the number of instructions of loop body.According to the description that previous cycles cache, the read/write address of every layer of instruction buffer is all According to the sequential transformations of 0 ~ N-1.
The operating process of the circular buffer 102 of embodiment shown in FIG. 1 is described in detail below.
After Fetch unit 101 issues the read request of program storage 100, by 2 periods, the instruction of reading reaches D0 stream Waterline.Decoding unit 103 complete decoding after, if it is determined that loop body built-in command and for for the first time enter circulation, While execution unit 108 are sent into the instruction, need the instruction being pressed into circular buffer 102.When the judgement of decoding unit 103 is translated Code instruction is that the last item of loop body instructs, and recycles and be not over, and Fetch unit 101 will be issued to program storage Read request, and the instruction read is the 2nd article of instruction of loop body.Since the instruction newly read needs 2 periods to reach D0 flowing water Line, therefore first that takes out loop body from circular buffer 102 is instructed and decoded, second in first clock cycle A period completes the decoding of the Article 2 instruction for the loop body newly fetched.
It should be noted that in addition to first time recycles, each circulation later all will if loop body only has an instruction Loop body instruction is read from circular buffer 102, without carrying out read operation repeatedly to program storage 100.Because in program After first entering circulation, need for the preceding N item instruction in the loop body to be stored in the corresponding circular buffer of this layer circulation.Deposit follows The number of instructions N of ring caching, which is equal to from the clock periodicity for issuing program storage read command to instruction arrival decoding unit, subtracts 1. If the instruction number of loop body is less than or equal to N, only need to recycle from the reading of program storage 100 when recycling for the first time Body instructs and is stored in circular buffer 102, can read whole loop body instructions from circular buffer 102 later, until circulation terminates. If the number of instructions of loop body is greater than N, after being again introduced into circulation, the N item being stored in front of reading first from circular buffer refers to It enables, until the N+1 articles instruction of the loop body to read back from program storage returns to decoding unit.It can realize that hardware follows in this way The zero propagation of ring jumps.
That is, the processor flowing water for needing to undergo m+1 clock cycle to decoding grade from reading memory For line, fetching grade output order in accordance with the following steps:
The first step judges whether defeated from upper level be such as first of new loop body instruction to the instruction of fetching grade, if so, It is then successively stored from fetching grade and synchronous first to the N articles into next stage (i.e. decoding grade) output previous cycle body instructs; Otherwise, third step is jumped to;
Wherein, N=min { n, m }, n are the item number instructed in previous cycle body, and m is to deposit in processor pipeline from reading Subtract 1 to the clock periodicity undergone required for decoding grade in reservoir;
Second step if whether instruction is the last item instruction of loop body, and is recycled and is also not finished, then successively by fetching grade First to the N articles instruction in previous cycle body stored into next stage (i.e. decoding grade) output fetching grade, after output Waiting receives upper level input new command and jumps to the first step again;Otherwise, third step is jumped to;
Third step exports its instruction obtained from upper level from fetching grade to next stage (i.e. decoding grade), terminates epicycle and take Refer to grade process flow.
Wherein, the heap of the number of plies of the fetching grade loop nesting that include a depth supported by locating processor Stack is carried out first to the N articles instruction in the storage previous cycle body in the above-mentioned first step by the storehouse in accordance with the following steps Operation:
Step 101, whether decision instruction is first of new loop body instruction, if so, will store in current stack The instruction of all loop bodies pushes one layer of (instruction pressure being originally stored in first layer storage unit to the depths of storehouse To second layer storage unit, the instruction being originally stored in second layer storage unit is pressed into third layer storage unit, successively class Push away), go to step 102 later;Otherwise, step 102 is jumped directly to;
Step 102, suitable from 0 to N-1 according to address in most upper one layer of storehouse (in i.e. most upper one layer of storage unit) Sequence successively stores first article to the N articles instruction in previous cycle body.
Meanwhile the storehouse is completed to export the previous cycle stored in fetching grade in above-mentioned second step in accordance with the following steps The operation of first to the N articles instruction in body:
Step 201, whether decision instruction is the N articles of previous cycle body instruction, if so, and loop body be over, then The instruction of all loop bodies stored in current stack is above mentioned one layer (that is, third will be originally stored in the top of storehouse Instruction in layer storage unit is mentioned to the second layer, and the instruction being originally stored in second layer storage unit is mentioned to first layer, according to It is secondary to analogize), go to step 102 later;Otherwise, step 202 is jumped directly to;
Step 202, (i.e. most upper one layer of storehouse), the sequence according to address from 0 to N-1 from most upper one layer of storage unit First article to the N articles instruction being sequentially output in previous cycle body.
It compares the circulation under no circular buffer state shown in Fig. 3 and executes state procedure it is not difficult to find that since program stores There is delay in device, so needing to be inserted into when jumping to first instruction of loop body from the instruction of loop body the last item every time First instruction of the bubble of a cycle, the loop body of reading gets to D0 assembly line.Therefore compare in loop body instruction In the case where few or loop nesting, the efficiency that the bubble of this insertion executes program has bigger influence.
After increasing circular buffer in Fig. 4, the bubble being inserted into original Fig. 3 is followed by what is read from circular buffer 102 First instruction of ring body is replaced, and the zero propagation for realizing hardware loop jumps.Program can be also reduced using this method simultaneously Memory reading times can especially be greatly reduced in the case where loop nesting, be effectively reduced for the reading times of memory Power consumption.
Through this embodiment it is found that circular buffer is the storehouse that a depth is M, the size of M is what the processor was supported The nested number of plies of hardware loop, one layer of storage unit described in every layer stack corresponding method process, i.e. a finger in Fig. 2 Enable caching.It is every to enter a new circulation, the preceding N item newly recycled can all be instructed and be stored in instruction buffer 0 according to instruction sequences, and Simultaneously by the effective instruction indentation instruction buffer 1 and instruction buffer 2 in instruction buffer 0 and instruction buffer 1.Whenever going to hardware The last item of loop body instructs, and can pop up in sequence the instruction stored in instruction buffer 0, and be sent into decoding unit, etc. After all instructions pop-up in instruction buffer 0, the effective instruction in instruction buffer 2 and instruction buffer 1 is pressed into instruction buffer 1 With instruction buffer 0.
In this way, the corresponding N item instruction of outermost one layer circulation to be pressed in the bottom of storehouse, by nesting by storehouse Most interior one layer is that N item instruction in circulation is placed in the top layer of storehouse.When reading instruction, according to the structure of circulation, first completion pair The processing of most interior one layer of circulation, then the instruction of its outer one layer of circulation is handled.Since storehouse has reading first-in last-out The instruction of mode, most interior one layer of circulation always can be processed at first, thus, such design can guarantee each layer of circulation all It carries out according to the nested position locating for it, while guaranteeing that order of operation is correct, also eliminates additional in calculating process Latent period.
The hardware loop processing method and system of processor or DSP disclosed herein, by original Fetch unit Increase circular buffer, can eliminate additional when jumping to loop body first instruction from the instruction of loop body the last item every time Latent period.This method design is simple, increases the resource of seldom hardware in original system, can realize the zero of hardware loop Delay jumps.In addition, this method can also reduce access of the Fetch unit to program storage, to reduce power consumption.
Those of ordinary skill in the art will appreciate that: the foregoing is only a preferred embodiment of the present invention, and does not have to In the limitation present invention, although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art For, still can to foregoing embodiments record technical solution modify, or to part of technical characteristic into Row equivalent replacement.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should all include Within protection scope of the present invention.

Claims (7)

1. a kind of processor hardware circulating treating system, which is characterized in that the system includes program storage (100), fetching list First (101), circular buffer (102), selecting module (107), decoding unit (103) and execution unit (108);
Described program memory (100) connects the input terminal of Fetch unit (101), the instruction output of the Fetch unit (101) Another input terminal of one input terminal of end connection selecting module (107), selecting module (107) connects circular buffer (102) Output end;The input terminal of output end connection decoding unit (103) of selecting module (107);The decoding unit (103) it is defeated Outlet connects the input terminal of execution unit (108);
The Fetch unit (101) also connect with the first control signal end (105) of decoding unit (103), the circular buffer (102) it is also connect with the second control signal end (106) of decoding unit (103);First control letter of the decoding unit (103) Number end (105) be used for decoding unit (103) it is unread to cyclic node instruction when, control the Fetch unit (101) and pass through The next instruction that selecting module (107) is decoded to decoding unit (103) output;Meanwhile the decoding unit (103) after reading first newly recycled instruction, the Fetch unit also is controlled by first control signal end (105) (101) pass through selecting module (107) synchronous first article to the N articles instruction that new loop body is exported to circular buffer (102);
The circular buffer (102) be used for receive new loop body preceding N item instruction, and by the N item of receiving instruct indentation storehouse into Row storage;Wherein, the last item instruction during the cyclic node refers to that circulation carries out, in loop body;It is described The received new loop body of circular buffer (102) in the item number N=min { n, m } that instructs, n is instruction strip number in loop body, and m is Instruction reaches decoding unit (103) clock periodicity experienced from program storage (100) and subtracts 1;
The second control signal end (106) of the decoding unit (103) refers to for reading cyclic node in decoding unit (103) When enabling, control loop caches (102) and successively exports the of previous cycle body to decoding unit (103) by selecting module (107) One article to the N articles instruction.
2. hardware loop processing system as described in claim 1, which is characterized in that the circular buffer (102) is by two The storehouse that above instructions caching is constituted, the number of plies of storehouse are the number of instruction buffer, and the number of instruction buffer is the processor The number of plies for the loop nesting supported, the reading manner of storehouse are first-in last-out.
3. hardware loop processing system as claimed in claim 2, which is characterized in that the instruction buffer is that a size is The memory of N, the sequence in each memory of circular buffer (102) according to address from 0 to N-1 successively store corresponding circulation First article to the N articles instruction in body.
4. hardware loop processing system as claimed in claim 1 or 2, which is characterized in that in the circular buffer (102) Memory is the random access memory or register that size is N command length.
5. the processor pipeline fetching grade processing of processor hardware circulating treating system as described in any one of Claims 1 to 4 Method, which is characterized in that processing that this approach includes the following steps:
The first step judges whether the instruction inputted from upper level is first of new loop body instruction, if so, by fetching grade according to Secondary storage and synchronous first to the N articles instruction into next stage output previous cycle body;Otherwise, third step is jumped to;
Wherein, N=min { n, m }, n are the item number instructed in previous cycle body, and m is in processor pipeline from reading memory In to decoding grade required for undergo clock periodicity subtract 1;
Second step, if instruction is the last item instruction of loop body, and circulation is also not finished, then fetching grade is stored current After first to the N articles instruction in loop body is successively exported to next stage, waiting receives upper level input new command and jumps To the first step;Otherwise, third step is jumped to;
Third step after the instruction that fetching grade obtains it from upper level is exported to next stage, terminates epicycle fetching grade process flow.
6. processor pipeline fetching grade processing method as claimed in claim 5, which is characterized in that in the first step, press According to first to the N articles instruction in following method storage previous cycle body:
Step 101, whether decision instruction is first of new loop body instruction, if so, by currently stored all circulations After the instruction of body presses to next layer of storage unit, 102 are gone to step;Otherwise, step 102 is jumped directly to;
Step 102, it in most upper one layer of storage unit, is successively stored in previous cycle body according to sequence of the address from 0 to N-1 First article to the N articles instruction.
7. such as processor pipeline fetching grade processing method described in claim 5 or 6, which is characterized in that the second step In, first to the N articles instruction in the previous cycle body stored in fetching grade is exported as follows:
Step 201, if instruction is the N articles instruction of previous cycle body, and loop body is over, then is owned currently stored Loop body instruction propose upper layer storage unit after, go to step 202;Otherwise, step 202 is jumped directly to;
Step 202, it from most upper one layer of storage unit, is sequentially output in previous cycle body according to sequence of the address from 0 to N-1 First article to the N articles instruction.
CN201611021587.XA 2016-11-21 2016-11-21 A kind of hardware loop processing method and system of processor Active CN106775591B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611021587.XA CN106775591B (en) 2016-11-21 2016-11-21 A kind of hardware loop processing method and system of processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611021587.XA CN106775591B (en) 2016-11-21 2016-11-21 A kind of hardware loop processing method and system of processor

Publications (2)

Publication Number Publication Date
CN106775591A CN106775591A (en) 2017-05-31
CN106775591B true CN106775591B (en) 2019-06-18

Family

ID=58969971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611021587.XA Active CN106775591B (en) 2016-11-21 2016-11-21 A kind of hardware loop processing method and system of processor

Country Status (1)

Country Link
CN (1) CN106775591B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032665B (en) * 2017-06-09 2021-01-26 龙芯中科技术股份有限公司 Method and device for processing instruction output in microprocessor
CN107368287B (en) * 2017-06-12 2020-11-13 北京中科睿芯科技有限公司 Acceleration system, acceleration device and acceleration method for cyclic dependence of data stream structure
CN107423148A (en) * 2017-07-26 2017-12-01 广州路派电子科技有限公司 A kind of double buffering protocol data analysis system being applied under multi-task scheduling environment
CN107729054B (en) * 2017-10-18 2020-07-24 珠海市杰理科技股份有限公司 Method and device for realizing execution of processor on loop body
CN111522584B (en) * 2020-04-10 2023-10-31 深圳优矽科技有限公司 Hardware circulation acceleration processor and hardware circulation acceleration method executed by same
CN114116010B (en) * 2022-01-27 2022-05-03 广东省新一代通信与网络创新研究院 Architecture optimization method and device for processor cycle body

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1012693B1 (en) * 1996-04-29 2006-04-12 Atmel Corporation Program memory and signal processing system storing instructions encoded for reducing power consumption during reads
CN102436367A (en) * 2011-09-26 2012-05-02 杭州中天微系统有限公司 16/32 bits mixed framework order prefetched buffer device
CN102637149A (en) * 2012-03-23 2012-08-15 山东极芯电子科技有限公司 Processor and operation method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1012693B1 (en) * 1996-04-29 2006-04-12 Atmel Corporation Program memory and signal processing system storing instructions encoded for reducing power consumption during reads
CN102436367A (en) * 2011-09-26 2012-05-02 杭州中天微系统有限公司 16/32 bits mixed framework order prefetched buffer device
CN102637149A (en) * 2012-03-23 2012-08-15 山东极芯电子科技有限公司 Processor and operation method thereof

Also Published As

Publication number Publication date
CN106775591A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106775591B (en) A kind of hardware loop processing method and system of processor
CN101807144B (en) Prospective multi-threaded parallel execution optimization method
CN101221541B (en) Programmable communication controller for SOC and its programming model
US9170816B2 (en) Enhancing processing efficiency in large instruction width processors
CN105426160A (en) Instruction classified multi-emitting method based on SPRAC V8 instruction set
CN106575220A (en) Multiple clustered very long instruction word processing core
US20040255103A1 (en) Method and system for terminating unnecessary processing of a conditional instruction in a processor
CN103890718A (en) Digital signal processor and baseband communication device
US8949575B2 (en) Reversing processing order in half-pumped SIMD execution units to achieve K cycle issue-to-issue latency
US20110264892A1 (en) Data processing device
CN110928832A (en) Asynchronous pipeline processor circuit, device and data processing method
CN105242904B (en) For processor instruction buffering and the device and its operating method of circular buffering
CN112527393A (en) Instruction scheduling optimization device and method for master-slave fusion architecture processor
CN104503733B (en) The merging method and device of a kind of state machine
KR101545701B1 (en) A processor and a method for decompressing instruction bundles
CN117389731B (en) Data processing method and device, chip, device and storage medium
CN104516829A (en) Microprocessor and method for using an instruction loop cache thereof
CN107924310A (en) Produced using the memory instructions for avoiding producing in table (PAT) prediction computer processor
JP2014215624A (en) Arithmetic processing device
JP4771079B2 (en) VLIW processor
US10454670B2 (en) Memory optimization for nested hash operations
CN116048627A (en) Instruction buffering method, apparatus, processor, electronic device and readable storage medium
CN116257174A (en) Heterogeneous space optimizer based on tensor asynchronous hard disk read-write
US20130151817A1 (en) Method, apparatus, and computer program product for parallel functional units in multicore processors
JP5630798B1 (en) Processor and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant