EP1460532A2 - Computer processor data fetch unit and related method - Google Patents
Computer processor data fetch unit and related method Download PDFInfo
- Publication number
- EP1460532A2 EP1460532A2 EP04251079A EP04251079A EP1460532A2 EP 1460532 A2 EP1460532 A2 EP 1460532A2 EP 04251079 A EP04251079 A EP 04251079A EP 04251079 A EP04251079 A EP 04251079A EP 1460532 A2 EP1460532 A2 EP 1460532A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- instruction
- level cache
- operands
- clock speed
- processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims description 29
- 230000015654 memory Effects 0.000 claims abstract description 32
- 239000004065 semiconductor Substances 0.000 claims description 8
- 230000008569 process Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3861—Recovery, e.g. branch miss-prediction, exception handling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
- G06F9/3879—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor for non-native instruction execution, e.g. executing a command; for Java instruction set
Definitions
- the present invention is directed to the field of microelectronics, and more particularly, to microprocessors.
- the memory stores information, such as instructions and data, for use by the processor.
- the instructions direct the processor in data manipulation, and the data are acted on by the processor in accordance with the instructions.
- processors today are typically designed in a pipelined architecture.
- the processor processes an instruction in different stages. This enables the processor to process more than one instruction simultaneously, one at each of several stages.
- prior art processors receive an instruction stream, such as from a software program, and processed the instruction stream through four different stages: 1) retrieve, also termed fetch, an instruction from memory; 2) decode the instruction and retrieve the operands needed for the instruction from memory; 3) execute the instruction on the operands to obtain a result; and 4) take the result and store it in memory.
- these stages were implemented at the same clock speed.
- processors In order to execute program instructions quickly, a computer's processor must have instructions and operands from memory available at the processor at the time they are needed in the instruction stream. New processors are continually being designed that execute instructions at increasingly faster rates, however the time to access data in memory, also termed memory latency, is not decreasing at a similar rate. As a result, processors often have to wait for memory accesses to complete operations. This considerably reduces the overall performance of the processor and prevents systems using the processor from taking full advantage of the increased processor speeds.
- Caches are small, fast memories that are located physically closer to the processor than the main memory.
- a first level cache also called an L1 cache, is a small, fast memory, typically, co-located with the processor on the same semiconductor chip for fast access speed.
- Higher level caches such as L2, L3, etc., are also often used in the computer systems, but are typically located farther from the processor than the L1 cache.
- Caches partially solve the memory latency problem as they can more closely match processor speeds; however, caches are typically too small to hold very much data, and are therefore limited in their ability to solve the memory latency problem.
- FIG. 1 illustrates a block diagram of a computer system 100 including a processor 110.
- computer system 100 includes processor 110, an L1 cache 112, an L2 cache 116, an L3 cache 118, and a main memory 120.
- Computer system 100 is further connected to a display 102 for displaying information on computer system 100 and one or more input device(s) 104 for inputting information into computer system 100.
- L1 cache 112 is co-located on the same semiconductor chip 114 as processor 110, with L2 cache 116 and L3 cache 118 existing off semiconductor chip 114 of processor 110, and a main memory 120 located elsewhere in the computer system.
- processor 110 fetched an instruction from L1 cache 112, decoded the instruction and determined the needed operands, executed the instruction, and then stored the result in L1 cache. If the instruction and/or operands were not in L1 cache 112, termed a cache miss, processor 110 would wait while the instruction and/or operand was retrieved from L2 cache 116, L3 cache 118, or main memory 120. Due to the small size of L1 cache 112, cache misses could be frequent.
- processor 110 could only get information quickly and efficiently from L1 cache 112. If the information was in a higher level cache, e.g., L2 cache 116 or L3 cache 118, or main memory 120, the processor had to wait to receive the information, and the processor typically did nothing while it waited. Thus, prior art processors spent most of their time waiting for information to be retrieved from caches or memory so the processor could act on the information. This was inefficient and expensive in terms of lost processor productivity.
- the invention seeks to provide for a computer processor unit having advantages over known such units.
- a device fetching information for a computer processor having a first clock speed includes: a first level cache interface; an instruction decoder coupled with the first level cache interface; a program counter coupled with the instruction decoder and the first level cache interface; an arithmetic logic unit coupled with the instruction decoder; and a branch prediction logic unit coupled with the instruction decoder, wherein the device operates at a second clock speed, the second clock speed being faster than the first clock speed.
- a prefetch unit receives the same instruction stream as a processor.
- the prefetch unit is run at a faster clock speed than the processor allowing the prefetch unit to run ahead of the processor in the instruction stream and to prefetch information for the processor.
- the prefetch unit advantageously requests instructions and operands from a first level (L1) cache.
- the L1 cache sends the requested instructions and operands to the prefetch unit and automatically stores the requested instructions and operands until needed by the processor.
- the prefetch unit improves processor performance by reducing the number of cache misses and by reducing memory latency.
- a fetch unit includes: a first level cache interface, the first level cache interface for receiving instructions and operands from a first level cache, and for sending requests for instructions and operands to the first level cache; an instruction decoder coupled with the first level cache interface, the instruction decoder for decoding at least one instruction and for determining any operands needed by the at least one instruction; a program counter coupled with the first level cache interface and the instruction decoder, the program counter for storing a location of the at least one instruction; an arithmetic logic unit coupled with the instruction decoder, the arithmetic logic unit for calculating addresses and other mathematical operations; and a branch prediction logic unit coupled with the instruction decoder, the branch execution logic unit for selecting an instruction branch of a conditional branch instruction.
- a device for fetching information for a computer processor having a first clock speed that includes: a first level cache interface for requesting and receiving instructions and operands from a first level cache; an instruction decoder coupled with the first level cache interface, the instruction decoder for decoding a received instruction and for determining whether or not one or more operands are required by the instruction; a program counter coupled with the first level cache interface and the instruction decoder, the program counter for storing a location of the received instruction; an arithmetic logic unit coupled with the instruction decoder, the arithmetic logic unit for calculating addresses of instructions and operands and other mathematical operations; and a branch prediction logic unit coupled with the instruction decoder, the branch prediction logic unit for selecting an instruction branch of a conditional branch instruction, wherein the device operates at a second clock speed, the second clock speed being faster than the first clock speed.
- a device for fetching information for a computer processor having a first clock speed includes: means for requesting an instruction from a first level cache; means for receiving the instruction from the first level cache; means for decoding the instruction; means for determining whether or not one or more operands are required by the instruction; means for requesting the one or more operands from the first level cache if one or more operands are required by the instruction; means for receiving the one or more operands, if any, from the first level cache; and means for calculating a next instruction.
- a computer system includes: a processor, the processor operating at a first clock speed; a fetch unit coupled with the processor, the prefetch unit operating at a second clock speed, the second clock speed being faster than the first clock speed; a first level cache coupled with the processor and the prefetch unit; and a main memory communicatively coupled with the first level cache.
- the invention provides for a method for prefetching information for a computer processor having a first clock speed and which includes: requesting an instruction from a first level cache; receiving the instruction from the first level cache; decoding the instruction; determining whether or not one or more operands are required by the instruction; if one or more operands are required by the instruction, requesting the one or more operands from the first level cache; receiving the one or more operands from the first level cache; and calculating a next instruction.
- a method for fetching information for a computer processor having a first clock speed can also be provided and which includes: requesting an instruction from a first level cache, the first level cache automatically storing the instruction; receiving the instruction from the first level cache; decoding the instruction; determining whether or not one or more operands are required by the instruction; if one or more operands are required by the instruction, requesting the one or more operands from the first level cache, the first level cache automatically storing the one or more operands; receiving the one or more operands, if any, from the first level cache; and calculating a next instruction, wherein the method is performed at a second clock speed, the second clock speed being faster than the first clock speed.
- the fetch unit serves as a prefetch unit which improves processor performance by reducing the number of cache misses and by reducing memory latency.
- the present invention provides for methods and devices that prefetch information, such as instructions and data, in advance of the processor needing them.
- the present invention takes the first stage of the prior art pipelined architecture, e.g., retrieve an instruction, separates this stage from the other processor stages, and runs it at a faster clock speed than the other stages implemented by the processor.
- the present invention is implemented as a prefetch unit on the same semiconductor chip as the processor.
- the prefetch unit receives the same instruction stream as the processor, and, due to the faster clock speed, is able to run ahead of the processor in the instruction stream to prefetch information in advance of the processor needing the information.
- the prefetch unit requests instructions and operands from a first level (L1) cache.
- the L1 cache sends the requested instructions and operands to the prefetch unit and automatically stores the requested instructions and operands until needed by the processor.
- FIG. 2 illustrates a block diagram of a computer system 200 including a prefetch unit 222 according to one embodiment of the present invention.
- Computer system 200 further includes: a processor 210, an L1 cache 212, two higher level caches -- an L2 cache 216 and an L3 cache 218, and a main memory 220.
- Computer system 200 is illustrated as further including a display 202 and one or more input device(s) 204. It is understood by those of skill in the art that in other embodiments, computer system 200 can be differently configured and that the present illustration is for exemplary purposes only to aid in describing the present invention. In particular, the presence of display 202, input device(s) 204, L2 cache 216, and L3 cache 218 are not required.
- prefetch unit 222 is co-located on the same semiconductor chip 214 with processor 210 and L1 cache 212. In other embodiments, prefetch unit 222 is co-located on the same semiconductor chip 214 with processor 210. Although prefetch unit 222 is illustrated physically separate from the processor 210, logically it operates as part of processor 210. In one embodiment, prefetch unit 222 operates, or runs, at a faster clock speed than processor 210. A clock input (not shown) can be externally supplied to prefetch unit 222 or internally generated by prefetch unit 222.
- the instruction stream of computer system 200 flows from main memory 220 to L3 cache 218, to L2 cache 216, and to L1 cache 212. From L1 cache 212, the instruction stream is sent to both prefetch unit 222 and processor 210. Prefetch unit 222 requests an instruction or operand from L1 cache 212, e.g., the address of an instruction or operand. If L1 cache 212 does not have the requested instruction or operand, L1 cache 212 obtains the requested instruction or operand from L2 cache 216, L3 cache 218, or main memory 220 in advance of processor 210 needing the instruction or operand.
- FIG. 3 illustrates a block diagram of prefetch unit 222 of FIG. 2 according to one embodiment of the present invention.
- prefetch unit 222 includes: 1) a first level (L1) cache interface 334; 2) an instruction decoder 330; 3) a program counter 336; 4) an arithmetic logic unit (ALU) 332; and 5) a branch prediction logic unit 338.
- L1 cache interface 334 includes: 1) a first level (L1) cache interface 334; 2) an instruction decoder 330; 3) a program counter 336; 4) an arithmetic logic unit (ALU) 332; and 5) a branch prediction logic unit 338.
- ALU arithmetic logic unit
- L1 cache interface 334 is utilized for sending requests for instructions and operands to L1 cache 212 (FIG. 2) and for receiving instructions and operands from L1 cache 212 (FIG. 2), such as instructions in an instruction stream, or requested instructions and operands.
- Instruction decoder 330 is utilized for decoding the instruction and for determining any operands needed by the instruction.
- Program counter 336 is utilized to keep track of where prefetch unit 222 is in the instruction stream and stores the current location.
- ALU 332 is utilized for calculating addresses and other mathematical operations.
- Branch prediction logic unit 338 is utilized for selecting an instruction branch of a conditional branch instruction. Prefetch unit 222 is further described herein with reference to FIG. 4.
- FIG. 4 is a process flow diagram of a process 400 implemented by prefetch unit 222 for prefetching information for use by processor 210 of FIG. 2 according to one embodiment of the present invention.
- Process 400 is automatically implemented by prefetch unit 222.
- prefetch unit 222 requests a first instruction from L1 cache 212. This request is made through L1 cache interface 334. If the instruction is stored in L1 cache 212, L1 cache 212 sends the requested instruction to prefetch unit 222 via L1 cache interface 334.
- L1 cache 212 If L1 cache 212 does not have the instruction, L1 cache 212 obtains the instruction from a higher level cache, such as L2 cache 216 or L3 cache 218, or from main memory 220. When L1 cache 212 obtains the requested instruction, L1 cache 212 automatically stores the instruction and sends the requested instruction to prefetch unit 222 via L1 cache interface 334.
- a higher level cache such as L2 cache 216 or L3 cache 218, or from main memory 220.
- prefetch unit 222 receives the requested instruction from L1 cache 212 via L1 cache interface 334.
- Instruction decoder 330 of prefetch unit 222 receives the requested instruction from L1 cache interface 334, and program counter 336 stores the current location of the instruction.
- instruction decoder 330 decodes the instruction. Generally, instruction decoder 330 receives a bit pattern and determines what type of instruction has been received. Instruction decoding is well known to those of skill in the art and not further described herein.
- instruction decoder 330 of prefetch unit 222 determines if the instruction requires operands. Instruction decoder 330 determines from the bit pattern what operands are required for the instruction (if any). This operation can also involve ALU 332, if mathematical operations are required.
- prefetch unit 222 requests the operands from L1 cache 212 via L1 cache interface 334. If L1 cache 212 does not have the operands, L1 cache 212 retrieves the operands from a higher level cache, such as L2 cache 216 or L3 cache 218, or from main memory 220. The retrieved operands are automatically stored in L1 cache 212 and sent to prefetch unit 222. Prefetch unit 222 may or may not act on the operands dependent upon whether or not the operands are needed by prefetch unit 222, such as for address calculation or branch prediction.
- prefetch unit 222 calculates a next instruction to be fetched and returns to operation 402.
- Prefetch unit 222 holds the address of the current instruction in program counter 336, so calculation of the next address is made from the current instruction address held in program counter 336.
- the next instruction may be the next instruction in the instruction stream, or it may be an instruction in a different instruction branch. Execution of different instruction branches by prefetch unit 222 is further described herein with reference to a conditional branch instruction.
- conditional branch instruction is a program instruction that directs the computer system, e.g., computer system 200, to jump to another location in the program if a specified condition is met. This other location in the program is termed a conditional instruction branch or, simply, an instruction branch.
- prefetch unit 222 may not have the information necessary to determine which instruction branch to choose. For example, at operation 416, prefetch unit 222 may not have the information necessary to calculate whether the condition is met as the needed information is a variable number or a calculated number supplied from another operation or component of computer system 200. Thus, prefetch unit 222 needs some technique for choosing an instruction branch. In one embodiment, prefetch unit 222 utilizes a process termed branch prediction to select an instruction branch and execute the instructions in the selected instruction branch.
- Prior art processors typically implemented a process termed speculative execution when a conditional branch instruction was received in the instruction stream.
- the processor speculated which instruction branch might be the correct branch to execute next and started retrieving the instructions from that instruction branch.
- the next stage in the processor then began retrieving the operands for that instruction, and the following stage in the processor then began operating on the operands in accordance with the instruction.
- the processor was taking actions and changing data, and the processor might still not have had determinative information as to whether the instruction branch was the correct selection or not.
- prefetch unit 222 When prefetch unit 222 receives a conditional branch instruction, prefetch unit 222 may not have the information necessary to determine if the condition is met. Consequently, in one embodiment, branch prediction logic unit 338 of prefetch unit 222 utilizes branch prediction to select an instruction branch. Branch prediction is well known to those of skill in the art and not further described herein.
- prefetch unit 222 selects the correct instruction branch, there are no disadvantages to having made the selection. However, if the selection is incorrect, prefetch unit 222 simply throws out the wrong instructions. As prefetch unit 222 is receiving the instruction stream in advance of processor 210 and prefetching the instructions and operands at a faster rate than processor 210, selecting the wrong instruction branch by prefetch unit 222 merely increases the cache miss probability. Consequently, the present invention reduces costly undo processes of speculative execution by processors in the prior art and marks a significant improvement over the prior art.
- prefetch unit 222 can catch up to where processor 210 is in the instruction stream because, according to the invention, prefetch unit 222 is clocked faster to retrieve instructions and operands at a faster rate than processor 210. Thus, cache misses by prefetch unit 222 can cause prefetch unit 222 to stall, but most likely not processor 210.
- prefetch unit 222 it is very inexpensive in terms of processing time for prefetch unit 222 to engage in branch prediction rather than have processor 210 engage in speculative execution, because unlike a processor, prefetch unit 222 does not have to undo an incorrect branch prediction and stalls due to cache misses occur at prefetch unit 222 rather than processor 210.
- processor interrupts are unpredictable. Further, they have to be attended to right away and not delayed. While interrupts happen frequently, processors typically don't spend much of their total time attending to interrupts.
- a processor receives an interrupt during a program that is being currently executed, the processor jumps to a separate set of instructions associated with the interrupt. When the processor is done executing the instructions associated with the interrupt, the processor returns to the program it was executing prior to the interrupt.
- prefetch unit 222 can't prefetch interrupt code from the instruction stream in advance of processor 210, because the interrupt, and, therefore, what code to execute, is unpredictable.
- prefetch unit 222 stops operating in advance of processor 210 and enters a pause mode during the interrupt. In pause mode, prefetch unit 222 retrieves the next instruction for processor 210, but doesn't retrieve the subsequent instructions far in advance of processor 210 receiving them.
- prefetch unit 222 resumes operation, e.g., exits pause mode.
- prefetch unit 222 calculates the addresses of data it may need. Address calculations are typically simple, but in some instances, are quite complex, too complex for prefetch unit 222. For example, array address calculations often require index calculations that can be complex, such as in the case of hashing algorithms. Thus, in designing different embodiments of the present invention, some tradeoff in the design of prefetch unit 222 can be made between speed and the ability to calculate more complex data -- a simple prefetch unit that runs faster but can't handle very complex calculations, or a more complex prefetch unit that runs slower but can handle more complex calculations.
- prefetch unit 222 By limiting prefetch unit 222 functions to operations that are relatively simple, and not complex, prefetch unit 222 can be designed to run very fast, for example, in one embodiment, 2-10 times faster than the processor speed of processor 210. In embodiments in which functions of prefetch unit 222 are limited, prefetch unit 222 retrieves instructions and operands, but doesn't operate on them extensively. The operations prefetch unit 222 does perform in these embodiments, such as calculating addresses, are simple and fast allowing prefetch unit 222 to retrieve instructions and data far in advance of processor 210 needing the information.
- prefetch unit 222 In instances where prefetch unit 222 can't act, such as when address calculations are too complex, in one embodiment, prefetch unit 222 enters the pause mode. In these instances, processor 210 performs the complex calculations, and prefetch unit 222 fetches instructions and operands when they are needed. When the complex calculations are complete, prefetch unit 222 resumes operation, e.g., exits pause mode.
- the methods and devices of the present invention prefetch information, such as instructions and operands, in advance of a processor, so that the needed instructions and operands are present in L1 cache when the processor needs the instructions and operands.
- the present invention separates the prior art processor stage of retrieving an instruction (or operand), from the other processor stages, and runs it at a faster clock speed than that of the processor.
- the prefetch unit improves processor performance by reducing the number of cache misses and by reducing memory latency.
- branch prediction of conditional branch instructions is performed without the need to undo incorrect instruction branch selections as seen in with speculative execution by prior art processors further improving processor performance.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Advance Control (AREA)
Abstract
Description
- The present invention is directed to the field of microelectronics, and more particularly, to microprocessors.
- Essentially all computer architectures today utilize a memory and a processor. The memory stores information, such as instructions and data, for use by the processor. The instructions direct the processor in data manipulation, and the data are acted on by the processor in accordance with the instructions.
- Processors today are typically designed in a pipelined architecture. In a pipelined architecture, generally, the processor processes an instruction in different stages. This enables the processor to process more than one instruction simultaneously, one at each of several stages.
- Broadly viewed, prior art processors receive an instruction stream, such as from a software program, and processed the instruction stream through four different stages: 1) retrieve, also termed fetch, an instruction from memory; 2) decode the instruction and retrieve the operands needed for the instruction from memory; 3) execute the instruction on the operands to obtain a result; and 4) take the result and store it in memory. In prior art computer architectures, these stages were implemented at the same clock speed.
- In order to execute program instructions quickly, a computer's processor must have instructions and operands from memory available at the processor at the time they are needed in the instruction stream. New processors are continually being designed that execute instructions at increasingly faster rates, however the time to access data in memory, also termed memory latency, is not decreasing at a similar rate. As a result, processors often have to wait for memory accesses to complete operations. This considerably reduces the overall performance of the processor and prevents systems using the processor from taking full advantage of the increased processor speeds.
- To mitigate this speed differential, designers typically utilize one or more caches additional to the main memory of a computer system. Caches are small, fast memories that are located physically closer to the processor than the main memory. A first level cache, also called an L1 cache, is a small, fast memory, typically, co-located with the processor on the same semiconductor chip for fast access speed. Higher level caches, such as L2, L3, etc., are also often used in the computer systems, but are typically located farther from the processor than the L1 cache. Caches partially solve the memory latency problem as they can more closely match processor speeds; however, caches are typically too small to hold very much data, and are therefore limited in their ability to solve the memory latency problem.
- FIG. 1 illustrates a block diagram of a
computer system 100 including aprocessor 110. As illustrated,computer system 100 includesprocessor 110, anL1 cache 112, anL2 cache 116, anL3 cache 118, and amain memory 120.Computer system 100 is further connected to adisplay 102 for displaying information oncomputer system 100 and one or more input device(s) 104 for inputting information intocomputer system 100. As illustrated,L1 cache 112 is co-located on thesame semiconductor chip 114 asprocessor 110, withL2 cache 116 andL3 cache 118 existing offsemiconductor chip 114 ofprocessor 110, and amain memory 120 located elsewhere in the computer system. - Generally, in the prior art, instructions flowed from
main memory 120 toL3 cache 118, toL2 cache 116, toL1 cache 112, and then toprocessor 110.Processor 110 then advanced the instructions through the pipelined processor stages (not shown) earlier described. In the prior art,processor 110 fetched an instruction fromL1 cache 112, decoded the instruction and determined the needed operands, executed the instruction, and then stored the result in L1 cache. If the instruction and/or operands were not inL1 cache 112, termed a cache miss,processor 110 would wait while the instruction and/or operand was retrieved fromL2 cache 116,L3 cache 118, ormain memory 120. Due to the small size ofL1 cache 112, cache misses could be frequent. - A disadvantage of this prior art approach was that
processor 110 could only get information quickly and efficiently fromL1 cache 112. If the information was in a higher level cache, e.g.,L2 cache 116 orL3 cache 118, ormain memory 120, the processor had to wait to receive the information, and the processor typically did nothing while it waited. Thus, prior art processors spent most of their time waiting for information to be retrieved from caches or memory so the processor could act on the information. This was inefficient and expensive in terms of lost processor productivity. - The invention seeks to provide for a computer processor unit having advantages over known such units.
- According to one aspect of the present invention, there is provided a device fetching information for a computer processor having a first clock speed and that includes: a first level cache interface; an instruction decoder coupled with the first level cache interface; a program counter coupled with the instruction decoder and the first level cache interface; an arithmetic logic unit coupled with the instruction decoder; and a branch prediction logic unit coupled with the instruction decoder, wherein the device operates at a second clock speed, the second clock speed being faster than the first clock speed.
- Advantageously, information, such as instructions and operands, is prefetched in advance of a processor needing it. In one embodiment, a prefetch unit receives the same instruction stream as a processor. The prefetch unit is run at a faster clock speed than the processor allowing the prefetch unit to run ahead of the processor in the instruction stream and to prefetch information for the processor.
- The prefetch unit advantageously requests instructions and operands from a first level (L1) cache. The L1 cache sends the requested instructions and operands to the prefetch unit and automatically stores the requested instructions and operands until needed by the processor. By prefetching the information, the prefetch unit improves processor performance by reducing the number of cache misses and by reducing memory latency.
- According to another aspect, a fetch unit includes: a first level cache interface, the first level cache interface for receiving instructions and operands from a first level cache, and for sending requests for instructions and operands to the first level cache; an instruction decoder coupled with the first level cache interface, the instruction decoder for decoding at least one instruction and for determining any operands needed by the at least one instruction; a program counter coupled with the first level cache interface and the instruction decoder, the program counter for storing a location of the at least one instruction; an arithmetic logic unit coupled with the instruction decoder, the arithmetic logic unit for calculating addresses and other mathematical operations; and a branch prediction logic unit coupled with the instruction decoder, the branch execution logic unit for selecting an instruction branch of a conditional branch instruction.
- Yet further, a device can be provided for fetching information for a computer processor having a first clock speed that includes: a first level cache interface for requesting and receiving instructions and operands from a first level cache; an instruction decoder coupled with the first level cache interface, the instruction decoder for decoding a received instruction and for determining whether or not one or more operands are required by the instruction; a program counter coupled with the first level cache interface and the instruction decoder, the program counter for storing a location of the received instruction; an arithmetic logic unit coupled with the instruction decoder, the arithmetic logic unit for calculating addresses of instructions and operands and other mathematical operations; and a branch prediction logic unit coupled with the instruction decoder, the branch prediction logic unit for selecting an instruction branch of a conditional branch instruction, wherein the device operates at a second clock speed, the second clock speed being faster than the first clock speed.
- In another embodiment, a device for fetching information for a computer processor having a first clock speed includes: means for requesting an instruction from a first level cache; means for receiving the instruction from the first level cache; means for decoding the instruction; means for determining whether or not one or more operands are required by the instruction; means for requesting the one or more operands from the first level cache if one or more operands are required by the instruction; means for receiving the one or more operands, if any, from the first level cache; and means for calculating a next instruction.
- According to still a further embodiment, a computer system includes: a processor, the processor operating at a first clock speed; a fetch unit coupled with the processor, the prefetch unit operating at a second clock speed, the second clock speed being faster than the first clock speed; a first level cache coupled with the processor and the prefetch unit; and a main memory communicatively coupled with the first level cache.
- In another aspect the invention provides for a method for prefetching information for a computer processor having a first clock speed and which includes: requesting an instruction from a first level cache; receiving the instruction from the first level cache; decoding the instruction; determining whether or not one or more operands are required by the instruction; if one or more operands are required by the instruction, requesting the one or more operands from the first level cache; receiving the one or more operands from the first level cache; and calculating a next instruction.
- A method for fetching information for a computer processor having a first clock speed can also be provided and which includes: requesting an instruction from a first level cache, the first level cache automatically storing the instruction; receiving the instruction from the first level cache; decoding the instruction; determining whether or not one or more operands are required by the instruction; if one or more operands are required by the instruction, requesting the one or more operands from the first level cache, the first level cache automatically storing the one or more operands; receiving the one or more operands, if any, from the first level cache; and calculating a next instruction, wherein the method is performed at a second clock speed, the second clock speed being faster than the first clock speed.
- By prefetching information the fetch unit serves as a prefetch unit which improves processor performance by reducing the number of cache misses and by reducing memory latency.
- It is to be understood that both the foregoing general description and following detailed description are intended only to exemplify and explain the invention as claimed.
- The invention is described further hereinafter by way of example only with reference to the accompanying drawings in which:
- FIG. 1 illustrates a block diagram of a computer system including a processor;
- FIG. 2 illustrates a block diagram of a computer system including a prefetch unit according to an embodiment of the present invention;
- FIG. 3 illustrates a block diagram of the prefetch unit of FIG. 2 according to an embodiment of the present invention; and
- FIG. 4 illustrates a process flow diagram of a method for prefetching instructions and data for a processor according to an embodiment of the present invention.
- In the drawings the same reference numbers may be used throughout the following description to refer to the same or like parts.
- The present invention provides for methods and devices that prefetch information, such as instructions and data, in advance of the processor needing them. Broadly viewed, the present invention takes the first stage of the prior art pipelined architecture, e.g., retrieve an instruction, separates this stage from the other processor stages, and runs it at a faster clock speed than the other stages implemented by the processor.
- In one embodiment, the present invention is implemented as a prefetch unit on the same semiconductor chip as the processor. The prefetch unit receives the same instruction stream as the processor, and, due to the faster clock speed, is able to run ahead of the processor in the instruction stream to prefetch information in advance of the processor needing the information. In one embodiment, the prefetch unit requests instructions and operands from a first level (L1) cache. The L1 cache sends the requested instructions and operands to the prefetch unit and automatically stores the requested instructions and operands until needed by the processor.
- FIG. 2 illustrates a block diagram of a
computer system 200 including aprefetch unit 222 according to one embodiment of the present invention.Computer system 200 further includes: aprocessor 210, anL1 cache 212, two higher level caches -- anL2 cache 216 and anL3 cache 218, and amain memory 220.Computer system 200 is illustrated as further including adisplay 202 and one or more input device(s) 204. It is understood by those of skill in the art that in other embodiments,computer system 200 can be differently configured and that the present illustration is for exemplary purposes only to aid in describing the present invention. In particular, the presence ofdisplay 202, input device(s) 204,L2 cache 216, andL3 cache 218 are not required. - In FIG. 2, in one embodiment,
prefetch unit 222 is co-located on thesame semiconductor chip 214 withprocessor 210 andL1 cache 212. In other embodiments,prefetch unit 222 is co-located on thesame semiconductor chip 214 withprocessor 210. Althoughprefetch unit 222 is illustrated physically separate from theprocessor 210, logically it operates as part ofprocessor 210. In one embodiment,prefetch unit 222 operates, or runs, at a faster clock speed thanprocessor 210. A clock input (not shown) can be externally supplied toprefetch unit 222 or internally generated byprefetch unit 222. - In the present illustration, the instruction stream of
computer system 200 flows frommain memory 220 toL3 cache 218, toL2 cache 216, and toL1 cache 212. FromL1 cache 212, the instruction stream is sent to bothprefetch unit 222 andprocessor 210.Prefetch unit 222 requests an instruction or operand fromL1 cache 212, e.g., the address of an instruction or operand. IfL1 cache 212 does not have the requested instruction or operand,L1 cache 212 obtains the requested instruction or operand fromL2 cache 216,L3 cache 218, ormain memory 220 in advance ofprocessor 210 needing the instruction or operand. - FIG. 3 illustrates a block diagram of
prefetch unit 222 of FIG. 2 according to one embodiment of the present invention. As illustrated in FIG. 3, in one embodiment,prefetch unit 222 includes: 1) a first level (L1)cache interface 334; 2) aninstruction decoder 330; 3) aprogram counter 336; 4) an arithmetic logic unit (ALU) 332; and 5) a branchprediction logic unit 338. -
L1 cache interface 334 is utilized for sending requests for instructions and operands to L1 cache 212 (FIG. 2) and for receiving instructions and operands from L1 cache 212 (FIG. 2), such as instructions in an instruction stream, or requested instructions and operands.Instruction decoder 330 is utilized for decoding the instruction and for determining any operands needed by the instruction.Program counter 336 is utilized to keep track of whereprefetch unit 222 is in the instruction stream and stores the current location.ALU 332 is utilized for calculating addresses and other mathematical operations. Branchprediction logic unit 338 is utilized for selecting an instruction branch of a conditional branch instruction.Prefetch unit 222 is further described herein with reference to FIG. 4. - FIG. 4 is a process flow diagram of a
process 400 implemented byprefetch unit 222 for prefetching information for use byprocessor 210 of FIG. 2 according to one embodiment of the present invention.Process 400 is automatically implemented byprefetch unit 222. Referring now to FIGS. 2, 3 and 4, together, according toprocess 400, in one embodiment, atoperation 402, when a new program starts,prefetch unit 222 requests a first instruction fromL1 cache 212. This request is made throughL1 cache interface 334. If the instruction is stored inL1 cache 212,L1 cache 212 sends the requested instruction toprefetch unit 222 viaL1 cache interface 334. - If
L1 cache 212 does not have the instruction,L1 cache 212 obtains the instruction from a higher level cache, such asL2 cache 216 orL3 cache 218, or frommain memory 220. WhenL1 cache 212 obtains the requested instruction,L1 cache 212 automatically stores the instruction and sends the requested instruction toprefetch unit 222 viaL1 cache interface 334. - At
operation 404,prefetch unit 222 receives the requested instruction fromL1 cache 212 viaL1 cache interface 334.Instruction decoder 330 ofprefetch unit 222 receives the requested instruction fromL1 cache interface 334, andprogram counter 336 stores the current location of the instruction. - At
operation 406, upon receipt of the instruction,instruction decoder 330 decodes the instruction. Generally,instruction decoder 330 receives a bit pattern and determines what type of instruction has been received. Instruction decoding is well known to those of skill in the art and not further described herein. Atoperation 408,instruction decoder 330 ofprefetch unit 222 determines if the instruction requires operands.Instruction decoder 330 determines from the bit pattern what operands are required for the instruction (if any). This operation can also involveALU 332, if mathematical operations are required. - If the instruction requires operands, at
operation 410,prefetch unit 222 requests the operands fromL1 cache 212 viaL1 cache interface 334. IfL1 cache 212 does not have the operands,L1 cache 212 retrieves the operands from a higher level cache, such asL2 cache 216 orL3 cache 218, or frommain memory 220. The retrieved operands are automatically stored inL1 cache 212 and sent toprefetch unit 222.Prefetch unit 222 may or may not act on the operands dependent upon whether or not the operands are needed byprefetch unit 222, such as for address calculation or branch prediction. - At operation 416,
prefetch unit 222 calculates a next instruction to be fetched and returns tooperation 402.Prefetch unit 222 holds the address of the current instruction inprogram counter 336, so calculation of the next address is made from the current instruction address held inprogram counter 336. The next instruction may be the next instruction in the instruction stream, or it may be an instruction in a different instruction branch. Execution of different instruction branches byprefetch unit 222 is further described herein with reference to a conditional branch instruction. - Frequently processors receive a set of instructions that contains one or more conditional branch instructions. A conditional branch instruction is a program instruction that directs the computer system, e.g.,
computer system 200, to jump to another location in the program if a specified condition is met. This other location in the program is termed a conditional instruction branch or, simply, an instruction branch. - As
prefetch unit 222 runs ahead ofprocessor 210 in the instruction stream (due to the faster clock speed),prefetch unit 222 may not have the information necessary to determine which instruction branch to choose. For example, at operation 416,prefetch unit 222 may not have the information necessary to calculate whether the condition is met as the needed information is a variable number or a calculated number supplied from another operation or component ofcomputer system 200. Thus,prefetch unit 222 needs some technique for choosing an instruction branch. In one embodiment,prefetch unit 222 utilizes a process termed branch prediction to select an instruction branch and execute the instructions in the selected instruction branch. - Prior art processors typically implemented a process termed speculative execution when a conditional branch instruction was received in the instruction stream. The processor speculated which instruction branch might be the correct branch to execute next and started retrieving the instructions from that instruction branch. The next stage in the processor then began retrieving the operands for that instruction, and the following stage in the processor then began operating on the operands in accordance with the instruction. Soon the processor was taking actions and changing data, and the processor might still not have had determinative information as to whether the instruction branch was the correct selection or not.
- If the selection was correct, there were, typically, no disadvantages to the selection by the processor. However, if the selection was incorrect, the processor had to undo the actions taken under the incorrect speculation and then execute the correct instruction branch. Undo processes were complex processes and costly in terms of lost processing time.
- When
prefetch unit 222 receives a conditional branch instruction,prefetch unit 222 may not have the information necessary to determine if the condition is met. Consequently, in one embodiment, branchprediction logic unit 338 ofprefetch unit 222 utilizes branch prediction to select an instruction branch. Branch prediction is well known to those of skill in the art and not further described herein. - If
prefetch unit 222 selects the correct instruction branch, there are no disadvantages to having made the selection. However, if the selection is incorrect,prefetch unit 222 simply throws out the wrong instructions. Asprefetch unit 222 is receiving the instruction stream in advance ofprocessor 210 and prefetching the instructions and operands at a faster rate thanprocessor 210, selecting the wrong instruction branch byprefetch unit 222 merely increases the cache miss probability. Consequently, the present invention reduces costly undo processes of speculative execution by processors in the prior art and marks a significant improvement over the prior art. - In instances where a cache miss causes
prefetch unit 222 to stall, onceprefetch unit 222 resumes operation,prefetch unit 222 can catch up to whereprocessor 210 is in the instruction stream because, according to the invention,prefetch unit 222 is clocked faster to retrieve instructions and operands at a faster rate thanprocessor 210. Thus, cache misses byprefetch unit 222 can causeprefetch unit 222 to stall, but most likely notprocessor 210. Consequently, it is very inexpensive in terms of processing time forprefetch unit 222 to engage in branch prediction rather than haveprocessor 210 engage in speculative execution, because unlike a processor,prefetch unit 222 does not have to undo an incorrect branch prediction and stalls due to cache misses occur atprefetch unit 222 rather thanprocessor 210. - Unlike conditional branch instructions which can occur in an instruction stream that is being input to
prefetch unit 222, processor interrupts are unpredictable. Further, they have to be attended to right away and not delayed. While interrupts happen frequently, processors typically don't spend much of their total time attending to interrupts. When a processor receives an interrupt during a program that is being currently executed, the processor jumps to a separate set of instructions associated with the interrupt. When the processor is done executing the instructions associated with the interrupt, the processor returns to the program it was executing prior to the interrupt. - When an interrupt occurs in
computer system 200,prefetch unit 222 can't prefetch interrupt code from the instruction stream in advance ofprocessor 210, because the interrupt, and, therefore, what code to execute, is unpredictable. Thus, in one embodiment, during an interrupt,prefetch unit 222 stops operating in advance ofprocessor 210 and enters a pause mode during the interrupt. In pause mode,prefetch unit 222 retrieves the next instruction forprocessor 210, but doesn't retrieve the subsequent instructions far in advance ofprocessor 210 receiving them. When the interrupt code is complete,prefetch unit 222 resumes operation, e.g., exits pause mode. - In order for
prefetch unit 222 to run efficiently,prefetch unit 222 calculates the addresses of data it may need. Address calculations are typically simple, but in some instances, are quite complex, too complex forprefetch unit 222. For example, array address calculations often require index calculations that can be complex, such as in the case of hashing algorithms. Thus, in designing different embodiments of the present invention, some tradeoff in the design ofprefetch unit 222 can be made between speed and the ability to calculate more complex data -- a simple prefetch unit that runs faster but can't handle very complex calculations, or a more complex prefetch unit that runs slower but can handle more complex calculations. - By limiting
prefetch unit 222 functions to operations that are relatively simple, and not complex,prefetch unit 222 can be designed to run very fast, for example, in one embodiment, 2-10 times faster than the processor speed ofprocessor 210. In embodiments in which functions ofprefetch unit 222 are limited,prefetch unit 222 retrieves instructions and operands, but doesn't operate on them extensively. Theoperations prefetch unit 222 does perform in these embodiments, such as calculating addresses, are simple and fast allowingprefetch unit 222 to retrieve instructions and data far in advance ofprocessor 210 needing the information. - In instances where
prefetch unit 222 can't act, such as when address calculations are too complex, in one embodiment,prefetch unit 222 enters the pause mode. In these instances,processor 210 performs the complex calculations, andprefetch unit 222 fetches instructions and operands when they are needed. When the complex calculations are complete,prefetch unit 222 resumes operation, e.g., exits pause mode. - As shown above, according to the present invention, and unlike the prior art, the methods and devices of the present invention prefetch information, such as instructions and operands, in advance of a processor, so that the needed instructions and operands are present in L1 cache when the processor needs the instructions and operands. The present invention separates the prior art processor stage of retrieving an instruction (or operand), from the other processor stages, and runs it at a faster clock speed than that of the processor. By prefetching information, the prefetch unit improves processor performance by reducing the number of cache misses and by reducing memory latency.
- Additionally, branch prediction of conditional branch instructions is performed without the need to undo incorrect instruction branch selections as seen in with speculative execution by prior art processors further improving processor performance.
- The foregoing description of an implementation of the invention has been presented for purposes of illustration and description only, and therefore is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or can be acquired from practicing the invention.
- Consequently, the scope of protection is not limited to the specific embodiments which are shown for illustrative purposes.
Claims (20)
- A device for fetching information for a computer processor, the computer processor having a first clock speed, the device comprising:a first level cache interface;an instruction decoder coupled with the first level cache interface;a program counter coupled with the instruction decoder and the first level cache interface;an arithmetic logic unit coupled with the instruction decoder; anda branch prediction logic unit coupled with the instruction decoder,wherein the device is arranged to operate at a second clock speed, the second clock speed being faster than the first clock speed.
- The device of Claim 1, wherein the device and the computer processor are co-located on a same semiconductor chip.
- A fetch unit comprising:a first level cache interface arranged for receiving instructions and operands from a first level cache and for sending requests for instructions and operands to the first level cache;an instruction decoder coupled with the first level cache interface and arranged for decoding at least one instruction and for determining any operands needed by the at least one instruction;a program counter coupled with the first level cache interface and the instruction decoder, and arranged for storing a location of the at least one instruction;an arithmetic logic unit coupled with the instruction decoder and arranged for calculating addresses and other mathematical operations; anda branch prediction logic unit coupled with the instruction decoder and arranged for selecting an instruction branch of a conditional branch instruction.
- The unit of Claim 3, and arranged to prefetch the instruction and operands for a computer processor arranged to operate at a first clock speed, the unit being arranged to operate at a second clock speed, which is faster than the first clock speed.
- A device for fetching information for a computer processor arranged to operate in accordance with a first clock speed, the device comprising:a first level cache interface for requesting and receiving instructions and operands from a first level cache;an instruction decoder coupled with the first level cache interface, and arranged for decoding a received instruction and for determining whether or not one or more operands are required by the instruction;a program counter coupled with the first level cache interface and the instruction decoder, and arranged for storing a location of the received instruction;an arithmetic logic unit coupled with the instruction decoder, and arranged for calculating addresses of instructions and operands and other mathematical operations; anda branch prediction logic unit coupled with the instruction decoder, and arranged for selecting an instruction branch of a conditional branch instruction,wherein the device is arranged to operate at a second clock speed, the second clock speed being faster than the first clock speed.
- The device of Claim 5, and arranged such that if the received instruction is a conditional branch instruction, the branch prediction logic unit selects an instruction branch of the conditional branch instruction using branch prediction.
- A device for fetching information for a computer processor having a first clock speed, the device comprising:means for requesting an instruction from a first level cache;means for receiving the instruction from the first level cache;means for decoding the instruction;means for determining whether or not one or more operands are required by the instruction;means for requesting the one or more operands from the first level cache if one or more operands are required by the instruction;means for receiving the one or more operands, if any, from the first level cache; andmeans for calculating a next instruction.
- The device of Claim 7, wherein the device operates at a second clock speed, the second clock speed being faster than the first clock speed.
- The device of Claim 7 or 8, wherein means are provided for selecting an instruction branch of a conditional branch instruction using branch prediction if the instruction is a conditional branch instruction.
- A computer system comprising:a processor arranged to operate at a first clock speed;a fetch unit coupled with the processor and arranged to operate at a second clock speed, the second clock speed being faster than the first clock speed;a first level cache coupled with the processor and the fetch unit; anda main memory communicatively coupled with the first level cache.
- A computer system of Claim 10, wherein the fetch unit further comprises:a first level cache interface;an instruction decoder coupled with the first level cache interface;a program counter coupled with the instruction decoder and the first level cache interface;an arithmetic logic unit coupled with the instruction decoder; anda branch prediction logic unit coupled with the instruction decoder.
- A computer system of Claim 10 or 11, and further comprising:one or more higher level caches.
- A computer system of Claim 10, 11 or 12 and wherein the fetch unit is external to the processor.
- A computer system of Claim 10, 11 or 12, wherein the fetch unit is internal to the processor.
- A computer system of Claim 10, 11, 12, 13 or 14, wherein the fetch unit and the processor are co-located on the same semiconductor chip.
- A method for fetching information for a computer processor having a first clock speed, the method comprising:requesting an instruction from a first level cache;receiving the instruction from the first level cache;decoding the instruction;determining whether or not one or more operands are required by the instruction;if one or more operands are required by the instruction, requesting the one or more operands from the first level cache;receiving the one or more operands from the first level cache; andcalculating a next instruction.
- The method of Claim 16, wherein the method is performed at a second clock speed, the second clock speed being faster than the first clock speed.
- The method of Claim 16 or 17, wherein the instruction and the one or more operands, if any, are automatically stored in the first level cache.
- The method of Claim 16, 17 or 18, further comprising:if the instruction is a conditional branch instruction, selecting an instruction branch of the conditional branch instruction using branch prediction.
- A method for fetching information for a computer processor having a first clock speed, the method comprising:requesting an instruction from a first level cache, the first level cache automatically storing the instruction;receiving the instruction from the first level cache;decoding the instruction;determining whether or not one or more operands are required by the instruction;if one or more operands are required by the instruction, requesting the one or more operands from the first level cache, the first level cache automatically storing the one or more operands;receiving the one or more operands, if any, from the first level cache; andcalculating a next instruction,wherein the method is performed at a second clock speed, the second clock speed being faster than the first clock speed.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/393,890 US20040186960A1 (en) | 2003-03-20 | 2003-03-20 | Computer processor data prefetch unit |
US393890 | 2003-03-20 |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1460532A2 true EP1460532A2 (en) | 2004-09-22 |
EP1460532A3 EP1460532A3 (en) | 2006-06-21 |
Family
ID=32824915
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP04251079A Withdrawn EP1460532A3 (en) | 2003-03-20 | 2004-02-26 | Computer processor data fetch unit and related method |
Country Status (2)
Country | Link |
---|---|
US (1) | US20040186960A1 (en) |
EP (1) | EP1460532A3 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639886B2 (en) | 2009-02-03 | 2014-01-28 | International Business Machines Corporation | Store-to-load forwarding mechanism for processor runahead mode operation |
GB2548871A (en) * | 2016-03-31 | 2017-10-04 | Advanced Risc Mach Ltd | Instruction prefetching |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7194582B1 (en) * | 2003-05-30 | 2007-03-20 | Mips Technologies, Inc. | Microprocessor with improved data stream prefetching |
US7177985B1 (en) * | 2003-05-30 | 2007-02-13 | Mips Technologies, Inc. | Microprocessor with improved data stream prefetching |
US7260686B2 (en) * | 2004-08-17 | 2007-08-21 | Nvidia Corporation | System, apparatus and method for performing look-ahead lookup on predictive information in a cache memory |
US8874885B2 (en) * | 2008-02-12 | 2014-10-28 | International Business Machines Corporation | Mitigating lookahead branch prediction latency by purposely stalling a branch instruction until a delayed branch prediction is received or a timeout occurs |
US10963255B2 (en) | 2013-07-15 | 2021-03-30 | Texas Instruments Incorporated | Implied fence on stream open |
US9015422B2 (en) * | 2013-07-16 | 2015-04-21 | Apple Inc. | Access map-pattern match based prefetch unit for a processor |
US9710271B2 (en) | 2014-06-30 | 2017-07-18 | International Business Machines Corporation | Collecting transactional execution characteristics during transactional execution |
US9336047B2 (en) | 2014-06-30 | 2016-05-10 | International Business Machines Corporation | Prefetching of discontiguous storage locations in anticipation of transactional execution |
US9348643B2 (en) | 2014-06-30 | 2016-05-24 | International Business Machines Corporation | Prefetching of discontiguous storage locations as part of transactional execution |
US9600286B2 (en) | 2014-06-30 | 2017-03-21 | International Business Machines Corporation | Latent modification instruction for transactional execution |
US9448939B2 (en) | 2014-06-30 | 2016-09-20 | International Business Machines Corporation | Collecting memory operand access characteristics during transactional execution |
US11334355B2 (en) | 2017-05-04 | 2022-05-17 | Futurewei Technologies, Inc. | Main processor prefetching operands for coprocessor operations |
US12135876B2 (en) | 2018-02-05 | 2024-11-05 | Micron Technology, Inc. | Memory systems having controllers embedded in packages of integrated circuit memory |
US11416395B2 (en) | 2018-02-05 | 2022-08-16 | Micron Technology, Inc. | Memory virtualization for accessing heterogeneous memory components |
US10782908B2 (en) | 2018-02-05 | 2020-09-22 | Micron Technology, Inc. | Predictive data orchestration in multi-tier memory systems |
US11099789B2 (en) | 2018-02-05 | 2021-08-24 | Micron Technology, Inc. | Remote direct memory access in multi-tier memory systems |
US10880401B2 (en) * | 2018-02-12 | 2020-12-29 | Micron Technology, Inc. | Optimization of data access and communication in memory systems |
US10877892B2 (en) | 2018-07-11 | 2020-12-29 | Micron Technology, Inc. | Predictive paging to accelerate memory access |
US10852949B2 (en) | 2019-04-15 | 2020-12-01 | Micron Technology, Inc. | Predictive data pre-fetching in a data storage device |
US11755333B2 (en) | 2021-09-23 | 2023-09-12 | Apple Inc. | Coprocessor prefetcher |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0478132A1 (en) * | 1990-09-28 | 1992-04-01 | Tandem Computers Incorporated | Mutiple-clocked synchronous processor unit |
WO1998021659A1 (en) * | 1996-11-13 | 1998-05-22 | Intel Corporation | Data cache with data storage and tag logic with different clocks |
US20020095563A1 (en) * | 2000-09-08 | 2002-07-18 | Shailender Chaudhry | Method and apparatus for using an assist processor to prefetch instructions for a primary processor |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835951A (en) * | 1994-10-18 | 1998-11-10 | National Semiconductor | Branch processing unit with target cache read prioritization protocol for handling multiple hits |
JP3000961B2 (en) * | 1997-06-06 | 2000-01-17 | 日本電気株式会社 | Semiconductor integrated circuit |
US6317810B1 (en) * | 1997-06-25 | 2001-11-13 | Sun Microsystems, Inc. | Microprocessor having a prefetch cache |
US6212603B1 (en) * | 1998-04-09 | 2001-04-03 | Institute For The Development Of Emerging Architectures, L.L.C. | Processor with apparatus for tracking prefetch and demand fetch instructions serviced by cache memory |
US7243204B2 (en) * | 2003-11-25 | 2007-07-10 | International Business Machines Corporation | Reducing bus width by data compaction |
-
2003
- 2003-03-20 US US10/393,890 patent/US20040186960A1/en not_active Abandoned
-
2004
- 2004-02-26 EP EP04251079A patent/EP1460532A3/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0478132A1 (en) * | 1990-09-28 | 1992-04-01 | Tandem Computers Incorporated | Mutiple-clocked synchronous processor unit |
WO1998021659A1 (en) * | 1996-11-13 | 1998-05-22 | Intel Corporation | Data cache with data storage and tag logic with different clocks |
US20020095563A1 (en) * | 2000-09-08 | 2002-07-18 | Shailender Chaudhry | Method and apparatus for using an assist processor to prefetch instructions for a primary processor |
Non-Patent Citations (2)
Title |
---|
AUSTIN T M ET AL: "STREAMLINING DATA CACHE ACCESS WITH FAST ADDRESS CALCULATION" COMPUTER ARCHITECTURE NEWS, ACM, NEW YORK, NY, US, vol. 23, no. 2, 1 May 1995 (1995-05-01), pages 369-380, XP000525188 ISSN: 0163-5964 * |
CRAGO: "Reducing the Traffic of Loop-Based programs Using a Prefetch Processor" [Online] 25 March 1997 (1997-03-25), , UNIV. OF SOUTHERN CALIFORNIA , XP002377825 Retrieved from the Internet: URL:http://scholar.google.com/url?sa=U&q=h ttp://www.east.isi.edu/~crago/tr-96-09.ps> [retrieved on 2006-04-10] * paragraph [03.1] * * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8639886B2 (en) | 2009-02-03 | 2014-01-28 | International Business Machines Corporation | Store-to-load forwarding mechanism for processor runahead mode operation |
GB2548871A (en) * | 2016-03-31 | 2017-10-04 | Advanced Risc Mach Ltd | Instruction prefetching |
GB2548871B (en) * | 2016-03-31 | 2019-02-06 | Advanced Risc Mach Ltd | Instruction prefetching |
US10620953B2 (en) | 2016-03-31 | 2020-04-14 | Arm Limited | Instruction prefetch halting upon predecoding predetermined instruction types |
Also Published As
Publication number | Publication date |
---|---|
EP1460532A3 (en) | 2006-06-21 |
US20040186960A1 (en) | 2004-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1460532A2 (en) | Computer processor data fetch unit and related method | |
US5649138A (en) | Time dependent rerouting of instructions in plurality of reservation stations of a superscalar microprocessor | |
JP5410281B2 (en) | Method and apparatus for prefetching non-sequential instruction addresses | |
US6523110B1 (en) | Decoupled fetch-execute engine with static branch prediction support | |
US6185676B1 (en) | Method and apparatus for performing early branch prediction in a microprocessor | |
EP2330500B1 (en) | System and method for using a branch mis-prediction buffer | |
EP1889152B1 (en) | A method and apparatus for predicting branch instructions | |
EP1089170A2 (en) | Processor architecture and method of processing branch control instructions | |
US20070288736A1 (en) | Local and Global Branch Prediction Information Storage | |
JP2001142705A (en) | Processor and microprocessor | |
WO2000014628A1 (en) | A method and apparatus for branch prediction using a second level branch prediction table | |
US20070260853A1 (en) | Switching processor threads during long latencies | |
US8943301B2 (en) | Storing branch information in an address table of a processor | |
US6735687B1 (en) | Multithreaded microprocessor with asymmetrical central processing units | |
US20090204791A1 (en) | Compound Instruction Group Formation and Execution | |
JP5335440B2 (en) | Early conditional selection of operands | |
US6446143B1 (en) | Methods and apparatus for minimizing the impact of excessive instruction retrieval | |
US20050216713A1 (en) | Instruction text controlled selectively stated branches for prediction via a branch target buffer | |
JP2009524167A5 (en) | ||
US20040225866A1 (en) | Branch prediction in a data processing system | |
US20200150967A1 (en) | Misprediction of predicted taken branches in a data processing apparatus | |
US9395985B2 (en) | Efficient central processing unit (CPU) return address and instruction cache | |
US5829031A (en) | Microprocessor configured to detect a group of instructions and to perform a specific function upon detection | |
US7908463B2 (en) | Immediate and displacement extraction and decode mechanism | |
JP2001060152A (en) | Information processor and information processing method capable of suppressing branch prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK |
|
PUAL | Search report despatched |
Free format text: ORIGINAL CODE: 0009013 |
|
AK | Designated contracting states |
Kind code of ref document: A3 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LI LU MC NL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK |
|
AKX | Designation fees paid |
Designated state(s): DE FR GB |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20061222 |