WO2000033315A1 - Apparatus and method for optimizing die utilization and speed performance by register file splitting - Google Patents
Apparatus and method for optimizing die utilization and speed performance by register file splitting Download PDFInfo
- Publication number
- WO2000033315A1 WO2000033315A1 PCT/US1999/028467 US9928467W WO0033315A1 WO 2000033315 A1 WO2000033315 A1 WO 2000033315A1 US 9928467 W US9928467 W US 9928467W WO 0033315 A1 WO0033315 A1 WO 0033315A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- register file
- storage array
- ports
- storages
- write
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C8/00—Arrangements for selecting an address in a digital store
- G11C8/16—Multiple access memory array, e.g. addressing one storage element via at least two independent addressing line groups
Definitions
- the present invention relates to storage or memory in a processor. More specifically, the present invention relates to a multiple-port storage array.
- a processor includes storage or memory to store program data and instructions.
- the memory storage includes cells for storing information and lines for accessing the cells according to defined address locations.
- the information is arranged in words that contain a plurality of cells.
- the cells in a word are connected by word lines.
- the cells in a plurality of words that are located at corresponding positions in the words are connected by bit lines.
- a particular address in the memory is accessed by applying address signals to decoding circuitry called an address port.
- the address port sends an address select signal to a word line at the selected location in the memory array.
- the address select signal matches the address of a word memory, data is transferred from or to the individual memory cells at the specified address. Data of each cell is transferred on the associated bit line.
- a multi-port memory array For arrays having more than one address port, called multi-port arrays, more than one address may be decoded and more than one data transfer made during a single read/write cycle.
- a multi-port memory array has several common bit lines for each memory cell in the array.
- a register file is one type of memory array.
- a word line is associated with each address in a memory array or each register in a register file.
- a separate word line is used at each address to control each of the separate read bit lines and each of the separate write bit lines.
- Each of the separate word lines is connected to an address port. Since for every cell in an array the number of bit lines may be equal to the number of word lines or an integer multiplier of the number of word lines and the number of word lines for each address is equal to the number of ports in the array, the size of the multi-port memory array increases as a square of the number of ports to the array.
- an address is applied to a port and decoded, forming an address signal that is sent via the word line associated with the port to the decoded address location.
- the address signal on the word line causes the contents of the memory cells at the selected address to be written if the address is applied to a write port or read if the address is applied to a read port.
- Data is transferred to or from the memory cell via write bit lines and read bit lines, respectively.
- Each of the read bit lines and write bit lines is associated with a separate word line (port).
- the processor performs a plurality of read operations up to the total number of read ports and a plurality of write operations up to the total number of write ports.
- the read addresses and write addresses may be different or the same. Because more than one read operation may be made from a particular memory address during one read/write cycle, the maximum amount of current applied to the memory cell is determined by the number of read ports in the array.
- Each memory cell is associated with a word line, a bit line, and pass transistors, resulting in a size or pitch of the memory array that is relatively large.
- the pitch size of the individual cells corresponds to a large overall size of the memory array and usage of a large percentage of the area on an integrated circuit die.
- the large area of the circuit results in a reduced manufacturing yield and increased fabrication cost of the circuit.
- the relatively large size of the memory array lengthens the average access time of data in the memory array in several aspects. First, a larger overall size in a memory array results in longer word lines and bit lines, lengthening the time for a signal to pass along the line.
- the pass transistors, word line, and bit line associated with a cell increase the capacitive loading on the cell, reducing the capability of the finite charge stored in each cell to drive a selected differential bit line pair.
- a multi-ported register file is typically metal limited to the area consumed by the circuit proportional with the square of the number of ports. It has been discovered that a processor having a register file structure divided into a plurality of separate and independent register files forms a layout structure with an improved layout efficiency. The read ports of the total register file structure are allocated among the separate and individual register files. Each of the separate and individual register files has write ports that correspond to the total number of write ports in the total register file structure. Writes are fully broadcast so that all of the separate and individual register files are coherent.
- a 16-port register file structure with twelve read ports and four write ports is split into four separate and individual 7-port register files, each with three read ports and four write ports.
- the area of a single 16-port register file would have a size proportional to 16 times 16 or 256.
- Each of the separate and individual register files has a size proportional to 7 times 7 or 49 for a total of 4 times 49 or 196.
- the capacity of a single 16-port register and the four 7-port registers is identical with the split register file structure advantageously having a significantly reduced area.
- the reduced area advantageously corresponds to an improvement in access time of a register file and thus speed performance due to a reduction in the length of word lines and bit lines connecting the array cells that reduces the time for a signal to pass on the lines.
- the improvement in speed performance is highly advantageous due to strict time budgets that are imposed by the specification of high- performance processors and also to attain a large capacity register file that is operational at high speed.
- a 17-port register file structure includes twelve read ports and five write ports. Each of the separate and individual register files has 5 write ports. The area of a single 17-port register file would have a size proportional to 17 times 17 or 289. Each of the separate and individual register files has a size proportional to 8 times 8 or 64 for a total of 4 times 64 or 256.
- a storage array structure for a processor having R read ports and W write ports includes a plurality of storage array storages.
- the storage array storages have a reduced number of read ports allocated from the R read ports so that the total number of read ports for the plurality of storage array storages is R.
- the storage array storages each have W write ports.
- a register file structure for a processor having R read ports and W write ports includes a plurality of register file storages.
- the register file storages have a reduced number of read ports allocated from the R read ports so that the total number of read ports for the plurality of register file storages is R.
- the register file storages each have W write ports.
- a processor in accordance with another embodiment of the present invention, includes an instruction supplying circuit and a plurality of functional units.
- the processor includes a register file structure coupled to the instruction supplying circuit and coupled to the plurality of functional units.
- the register file structure has R read ports and W write ports and includes a plurality of register file storages.
- the register file storages have a reduced number of read ports allocated from the R read ports so that the total number of read ports for the plurality of register file storages is R.
- the register file storages each have W write ports.
- FIGURE 1 is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention.
- FIGURE 2 is a schematic block diagram showing the core of the processor.
- FIGURE 3 is a schematic block diagram that illustrates an embodiment of the split register file that is suitable for usage in the processor.
- FIGURE 4 is a schematic block diagram that shows a logical view of the register file and functional units in the processor.
- FIGURE 5 is a pictorial schematic diagram depicting an example of instruction execution among a plurality of media functional units.
- FIGURE 6 illustrates a schematic block diagram of an SRAM array used for the multi-port split register file.
- FIGURE 7A and 7B are, respectively, a schematic block diagram and a pictorial diagram that illustrate the register file and a memory array insert of the register file.
- FIGURE 8 is a schematic block diagram showing an arrangement of the register file into the four register file segments.
- FIGURE 9 is a schematic timing diagram that illustrates timing of the processor pipeline.
- FIGURE 1 a schematic block diagram illustrates a single integrated circuit chip implementation of a processor 100 that includes- a memory interface 102, a geometry decompressor 104, two media processing units 110 and 112, a shared data cache 106, and several interface controllers.
- the interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die.
- the components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time.
- the interface controllers include a an UltraPort Architecture Interconnect (UP A) controller 116 and a peripheral component interconnect (PCI) controller 120.
- UP A UltraPort Architecture Interconnect
- PCI peripheral component interconnect
- the illustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller.
- the shared data cache 106 is a dual-ported storage that is shared among the media processing units 110 and 112 with one port allocated to each media processing unit.
- the data cache 106 is four- way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown).
- the data cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between the media processing units 110 and 112.
- the UPA controller 116 is a custom interface that attains a suitable balance between high-performance computational and graphic subsystems.
- the UPA is a cache-coherent, processor-memory interconnect.
- the UPA attains several advantageous characteristics including a scaleable bandwidth through support of multiple bused interconnects for data and addresses, packets that are switched for improved bus utilization, higher bandwidth, and precise interrupt processing.
- the UPA performs low latency memory accesses with high throughput paths to memory.
- the UPA includes a buffered cross-bar memory interface for increased bandwidth and improved scaleability.
- the UPA supports high-performance graphics with two-cycle single-word writes on the 64-bit UPA interconnect.
- the UPA interconnect architecture utilizes point-to-point packet switched messages from a centralized system controller to maintain cache coherence. Packet switching improves bus bandwidth utilization by removing the latencies commonly associated with transaction-based designs.
- the PCI controller 120 is used as the primary system I/O interface for connecting standard, high- volume, low-cost peripheral devices, although other standard interfaces may also be used.
- the PCI bus effectively transfers data among high bandwidth peripherals and low bandwidth peripherals, such as CD-ROM players, DVD players, and digital cameras.
- Two media processing units 110 and 112 are included in a single integrated circuit chip to support an execution environment exploiting thread level parallelism in which two independent threads can execute simultaneously.
- the threads may arise from any sources such as the same application, different applications, the operating system, or the runtime environment.
- Parallelism is exploited at the thread level since parallelism is rare beyond four, or even two, instructions per cycle in general purpose code.
- the illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions.
- a typical "general-purpose" processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time.
- the illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism.
- Thread level parallelism is particularly useful for JavaTM applications which are bound to have multiple threads of execution.
- Java methods including “suspend”, “resume”, “sleep”, and the like include effective support for threaded program code.
- JavaTM class libraries are thread-safe to promote parallelism.
- Sun, Sun, Sun Microsystems and the Sun logo are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks, including UltraSPARC I and UltraSPARC II, are used under license and are trademarks of SPARC International, Inc. in the United States and other countries.
- the thread model of the processor 100 supports a dynamic compiler which runs as a separate thread using one media processing unit 110 while the second media processing unit 112 is used by the current application.
- the compiler applies optimizations based on "on-the-fly" profile feedback information while dynamically modifying the executing code to improve execution on each subsequent run. For example, a "garbage collector" may be executed on a first media processing unit 110, copying objects or gathering pointer information, while the application is executing on the other media processing unit 112.
- processor 100 shown in FIGURE 1 includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution.
- processor 100 a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors.
- the media processing units 110 and 112 each include an instruction cache 210, an instruction aligner 212, an instruction buffer 214, a pipeline control unit 226, a split register file 216, a plurality of execution units, and a load/store unit 218.
- the media processing units 110 and 112 use a plurality of execution units for executing instructions.
- the execution units for a media processing unit 110 include three media functional units (MFU) 220 and one general functional unit (GFU) 222.
- the media functional units 220 are multiple single- instruction-multiple-datapath (MSIMD) media functional units. Each of the media functional units 220 is capable of processing parallel 16-bit components.
- the media functional units 220 operate in combination as tightly-coupled digital signal processors (DSPs). Each media functional unit 220 has an separate and individual sub-instruction stream, but all three media functional units 220 execute synchronously so that the subinstructions progress lock-step through pipeline stages.
- DSPs digital signal processors
- the general functional unit 222 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal square root operations, and many others.
- ALU arithmetic logic unit
- the general functional unit 222 supports less common parallel operations such as the parallel reciprocal square root instruction.
- the illustrative instruction cache 210 has a 16 Kbyte capacity and includes hardware support to maintain coherence, allowing dynamic optimizations through self-modifying code.
- Software is used to indicate that the instruction storage is being modified when modifications occur.
- the 16K capacity is suitable for performing graphic loops, other multimedia tasks or processes, and general-purpose JavaTM code.
- Coherency is maintained by hardware that supports write-through, non-allocating caching.
- Self-modifying code is supported through explicit use of "store-to-instruction-space" instructions store2i.
- Software uses the store ⁇ instruction to maintain coherency with the instruction cache 210 so that the instruction caches 210 do not have to be snooped on every single store operation issued by the media processing unit 110.
- the pipeline control unit 226 is connected between the instruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units.
- the pipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions.
- the pipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls.
- the pipeline control unit 226 also generates traps and maintains special registers.
- Each media processing unit 110 and 112 includes a split register file 216, a single logical register file including 128 thirty-two bit registers.
- the split register file 216 is split into a plurality of register file segments 224 to form a multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time.
- a separate register file segment 224 is allocated to each of the media functional units 220 and the general functional unit 222.
- each register file segment 224 has 128 32-bit registers.
- the first 96 registers (0-95) in the register file segment 224 are global registers. All functional units can write to the 96 global registers.
- the global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all register file segments 224.
- Registers 96-127 in the register file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or "visible" to other functional units.
- the media processing units 110 and 112 are highly structured computation blocks that execute software- scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time.
- the operational characteristics support multiple instruction issue through a pragmatic very large instruction word (VLIW) approach that avoids hardware interlocks to account for software that does not schedule operations properly.
- VLIW very large instruction word
- a VLIW instruction word always includes one instruction that executes in the general functional unit (GFU) 222 and from zero to three instructions that execute in the media functional units (MFU) 220.
- a MFU instruction field within the VLIW instruction word includes an operation code (opcode) field, three source register (or immediate) fields, and one destination register field.
- Instructions are executed in-order in the processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory.
- the execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor.
- the media processing units 110 and 112 are high-performance but simplified with respect to both compilation and execution.
- the media processing units 110 and 112 are most generally classified as a simple 2-scalar execution engine with full bypassing and hardware interlocks on load operations.
- the instructions include loads, stores, arithmetic and logic (ALU) instructions, and branch instructions so that scheduling for the processor 100 is essentially equivalent to scheduling for a simple 2-scalar execution engine for each of the two media processing units 110 and 112.
- the processor 100 supports full bypasses between the first two execution units within the media processing unit 110 and 112 and has a scoreboard in the general functional unit 222 for load operations so that the compiler does not need to handle nondeterministic latencies due to cache misses.
- the processor 100 scoreboards long latency operations that are executed in the general functional unit 222, for example a reciprocal square-root operation, to simplify scheduling across execution units.
- the scoreboard (not shown) operates by tracking a record of an instruction packet or group from the time the instruction enters a functional unit until the instruction is finished and the result becomes available.
- a VLIW instruction packet contains one GFU instruction and from zero to three MFU instructions. The source and destination registers of all instructions in an incoming VLIW instruction packet are checked against the scoreboard.
- any true dependencies or output dependencies stall the entire packet until the result is ready.
- Use of a scoreboarded result as an operand causes instruction issue to stall for a sufficient number of cycles to allow the result to become available. If the referencing instruction that provokes the stall executes on the general functional unit 222 or the first media functional unit 220, then the stall only endures until the result is available for intra-unit bypass. For the case of a load instruction that hits in the data cache 106, the stall may last only one cycle. If the referencing instruction is on the second or third media functional units 220, then the stall endures until the result reaches the writeback stage in the pipeline where the result is bypassed in transmission to the split register file 216.
- the scoreboard automatically manages load delays that occur during a load hit.
- all loads enter the scoreboard to simplify software scheduling and eliminate NOPs in the instruction stream.
- the scoreboard is used to manage most interlocks between the general functional unit 222 and the media functional units 220. All loads and non-pipelined long-latency operations of the general functional unit 222 are scoreboarded. The long-latency operations include division idiv,fdiv instructions, reciprocal square root frecsqrt, precsqrt instructions, and power ppower instructions. None of the results of the media functional units 220 is scoreboarded. Non-scoreboarded results are available to subsequent operations on the functional unit that produces the results following the latency of the instruction.
- the illustrative processor 100 has a rendering rate of over fifty million triangles per second without accounting for operating system overhead. Therefore, data feeding specifications of the processor 100 are far beyond the capabilities of cost-effective memory systems.
- Sufficient data bandwidth is achieved by rendering of compressed geometry using the geometry decompressor 104, an on-chip real-time geometry decompression engine.
- Data geometry is stored in main memory in a compressed format. At render time, the data geometry is fetched and decompressed in real-time on the integrated circuit of the processor 100.
- the geometry decompressor 104 advantageously saves memory space and memory transfer bandwidth.
- the compressed geometry uses an optimized generalized mesh structure that explicitly calls out most shared vertices between triangles, allowing the processor 100 to transform and light most vertices only once.
- the triangle throughput of the fransform-and-light stage is increased by a factor of four or more over the throughput for isolated triangles.
- multiple vertices are operated upon in parallel so that the utilization rate of resources is high, achieving effective spatial software pipelining.
- operations are overlapped in time by operating on several vertices simultaneously, rather than overlapping several loop iterations in time.
- high trip count loops are software-pipelined so that most media functional units 220 are fully utilized.
- a schematic block diagram illustrates an embodiment of the split register file 216 that is suitable for usage in the processor 100.
- the split register file 216 supplies all operands of processor instructions that execute in the media functional units 220 and the general functional units 222 and receives results of the instruction execution from the execution units.
- the split register file 216 operates as an interface to the geometry decompressor 104.
- the split register file 216 is the source and destination of store and load operations, respectively.
- the split register file 216 in each of the media processing units 110 and 112 has 128 registers. Graphics processing places a heavy burden on register usage. Therefore, a large number of registers is supplied by the split register file 216 so that performance is not limited by loads and stores or handling of intermediate results including graphics "fills" and "spills".
- the illusfrative split register file 216 includes twelve read ports and five write ports, supplying total data read and write capacity between the central registers of the split register file 216 and all media functional units 220 and the general functional unit 222.
- the five write ports include one 64-bit write port that is dedicated to load operations. The remaining four write ports are 32 bits wide and are used to write operations of the general functional unit 222 and the media functional units 220.
- Total read and write capacity promotes flexibility and facility in programming both of hand-coded routines and compiler-generated code.
- a sixteen port file is roughly proportional in size and speed to a value of 256.
- the illusfrative split register file 216 is divided into four register file segments 310, 312, 314, and 316, each having three read ports and four write ports so that each register file segment has a size and speed proportional to 49 for a total area for the four segments that is proportional to 196. The total area is therefore potentially smaller and faster than a single central register file. Write operations are fully broadcast so that all files are maintained coherent. Logically, the split register file 216 is no different from a single central register file However, from the perspective of layout efficiency, the split register file 216 is highly advantageous, allowing for reduced size and improved performance.
- the new media data that is operated upon by the processor 100 is typically heavily compressed. Data transfers are communicated in a compressed format from main memory and input/output devices to pins of the processor 100, subsequently decompressed on the integrated circuit holding the processor 100, and passed to the split register file 216.
- the register file 216 is a focal point for attaining the very large bandwidth of the processor 100.
- the processor 100 transfers data using a plurality of data transfer techniques.
- cacheable data is loaded into the split register file 216 through normal load operations at a low rate of up to eight bytes per cycle.
- streaming data is transferred to the split register file 216 through group load operations which transfer thirty-two bytes from memory directly into eight consecutive 32-bit registers.
- the processor 100 utilizes the streaming data operation to receive compressed video data for decompression.
- Compressed graphics data is received via a direct memory access (DMA) unit in the geometry decompressor 104.
- the compressed graphics data is decompressed by the geometry decompressor 104 and loaded at a high bandwidth rate into the split register file 216 via group load operations that are mapped to the geometry decompressor 104.
- DMA direct memory access
- Load operations are non-blocking and scoreboarded so that a long latency inherent to loads can be hidden by early scheduling.
- a schematic block diagram shows a logical view of the register file 216 and functional units in the processor 100.
- the physical implementation of the core processor 100 is simplified by replicating a single functional unit to form the three media functional units 220.
- the media functional units 220 include circuits that execute various arithmetic and logical operations including general-purpose code, graphics code, and video-image-speech (VIS) processing.
- VIS processing includes video processing, image processing, digital signal processing (DSP) loops, speech processing, and voice recognition algorithms, for example.
- a simplified pictorial schematic diagram depicts an example of instruction execution among a plurality of media functional units 220.
- Results generated by various internal function blocks within a first individual media functional unit are immediately accessible internally to the first media functional unit 510 but are only accessible globally by other media functional units 512 and 514 and by the general functional unit five cycles after the instruction enters the first media functional unit 510, regardless of the actual latency of the instruction. Therefore, instructions executing within a functional unit can be scheduled by software to execute immediately, taking into consideration the actual latency of the instruction. In contrast, software that schedules instructions executing in different functional units is expected to account for the five cycle latency.
- the shaded areas represent the stage at which the pipeline completes execution of an instruction and generates final result values.
- media processing unit instructions have three different latencies - four cycles for instructions such as fmuladd and fadd, two cycles for instructions such as pmuladd, and one cycle for instructions like padd and xor.
- FIGURE 6 a schematic block diagram depicts an embodiment of the multiport register file 216.
- a plurality of read address buses RA1 through RAN carry read addresses that are applied to decoder ports 616-1 through 616-N, respectively.
- Decoder circuits are well known to those of ordinary skill in the art, and any of several implementations could be used as the decoder ports 616-1 through 616-N.
- FIGURE 7 A and 7B a schematic block diagram and a pictorial diagram, respectively, illustrate the register file 216 and a memory array insert 710.
- the register file 216 is connected to a four functional units 720, 722, 724, and 726 that supply information for performing operations such as arithmetic, logical, graphics, data handling operations and the like.
- the illustrative register file 216 has twelve read ports 730 and four write ports 732.
- the twelve read ports 730 are illustratively allocated with three ports connected to each of the four functional units.
- the four write ports 732 are connected to receive data from all of the four functional units.
- the register file 216 includes a decoder, as is shown in FIGURE 6, for each of the sixteen read and write ports.
- the register file 216 includes a memory array 740 that is partially shown in the insert 710 illustrated in FIGURE 7B and includes a plurality of word lines 744 and bit lines 746.
- the word lines 744 and bit lines 746 are simply a set of wires that connect transistors (not shown) within the memory array 740.
- the word lines 744 select registers so that a particular word line selects a register of the register file 216.
- the bit lines 746 are a second set of wires that connect the transistors in the memory array 740. Typically, the word lines 744 and bit lines 746 are laid out at right angles.
- the word lines 744 and the bit lines 746 are constructed of metal laid out in different planes such as a metal 2 layer for the word lines 744 and a metal 3 layer for the bit lines 746.
- bit lines and word lines may be constructed of other materials, such as polysilicon, or can reside at different levels than are described in the illusfrative embodiment, that are known in the art of semiconductor manufacture.
- the word lines 744 are separated by a distance of about l ⁇ m and the bit lines 746 are separated by approximately l ⁇ m. Other circuit dimensions may be constructed for various processes.
- the illusfrative example shows one bit line per port, other embodiments may use multiple bit lines per port.
- each cell When a particular functional unit reads a particular register in the register file 216, the functional unit sends an address signal via the read ports 730 that activates the appropriate word lines to access the register.
- each cell In a register file having a conventional structure and twelve read ports, each cell, each storing a single bit of information, is connected to twelve word lines to select an address and twelve bit lines to carry data read from the address.
- the four write ports 732 address registers in the register file using four word lines 744 and four bit lines 746 connected to each cell.
- the four word lines 744 address a cell and the four bit lines 746 carry data to the cell.
- the illustrative register file 216 were laid out in a conventional manner with twelve read ports 730 and four write ports 732 for a total of sixteen ports and the ports were 1 ⁇ m apart, one memory cell would have an integrated circuit area of 256 ⁇ m (16x16). The area is proportional to the square of the number of ports.
- the register file 216 is alternatively implemented to perform single-ended reads and/or single-ended writes utilizing a single bit line per port per cell, or implemented to perform differential reads and/or differential writes using two bit lines per port per cell.
- the register file 216 is not laid out in the conventional manner and instead is split into a plurality of separate and individual register file segments 224.
- FIGURE 8 a schematic block diagram shows an arrangement of the register file 216 into the four register file segments 224.
- the register file 216 remains operational as a single logical register file in the sense that the four of the register file segments 224 contain the same number of registers and the same register values as a conventional register file of the same capacity that is not split.
- the separated register file segments 224 differ from a register file that is not split through elimination of lines that would otherwise connect ports to the memory cells. Accordingly, each register file segment 224 has connections to only three of the twelve read ports 730, lines connecting a register file segment to the other nine read ports are eliminated.
- each of the four register file segments 224 has connections to all four write ports 732.
- each of the four register file segments 224 has three read ports and four write ports for a total of seven ports.
- the individual cells are connected to seven word lines and seven bit lines so that a memory array with a spacing of 1 ⁇ m between lines has an area of approximately 49 ⁇ m 2 .
- the four register file segments 224 have an area proportion to seven squared. The total area of the four register file segments 224 is therefore proportional to 49 times 4, a total of 196.
- the split register file thus advantageously reduces the area of the memory array by a ratio of approximately 256/196 (1.3X or 30%).
- the reduction in area further advantageously corresponds to an improvement in speed performance due to a reduction in the length of the word lines 744 and the bit lines 746 connecting the array cells that reduces the time for a signal to pass on the lines.
- the improvement in speed performance is highly advantageous due to strict time budgets that are imposed by the specification of high- performance processors and also to attain a large capacity register file that is operational at high speed.
- the operation of reading the register file 216 typically takes place in a single clock cycle.
- a cycle time of two nanoseconds is imposed for accessing the register file 216.
- register files typically only have up to about 32 registers in comparison to the 128 registers in the illustrative register file 216 of the processor 100.
- a register file 216 that is substantially larger than the register file in conventional processors is highly advantageous in high-performance operations such as video and graphic processing.
- the reduced size of the register file 216 is highly useful for complying with time budgets in a large capacity register file.
- a simplified schematic timing diagram illustrates timing of the processor pipeline 900.
- the pipeline 900 includes nine stages including three initiating stages, a plurality of execution phases, and two terminating stages.
- the three initiating stages are optimized to include only those operations necessary for decoding instructions so that jump and call instructions, which are pervasive in the Java language, execute quickly. Optimization of the initiating stages advantageously facilitates branch prediction since branches, jumps, and calls execute quickly and do not introduce many bubbles.
- the first of the initiating stages is a fetch stage 910 during which the processor 100 fetches instructions from the 16Kbyte two-way set-associative instruction cache 210.
- the fetched instructions are aligned in the instruction aligner 212 and forwarded to the instruction buffer 214 in an align stage 912, a second stage of the initiating stages.
- the aligning operation properly positions the instructions for storage in a particular segment of the four register file segments 310, 312, 314, and 316 and for execution in an associated functional unit of the three media functional units 220 and one general functional unit 222.
- a decoding stage 914 of the initiating stages the fetched and aligned VLIW instruction packet is decoded and the scoreboard (not shown) is read and updated in parallel.
- the four register file segments 310, 312, 314, and 316 each holds either floatingpoint data or integer data.
- the register files are read in the decoding (D) stage.
- the two terminating stages include a trap-handling stage 960 and a write-back stage 962 during which result data is written-back to the split register file 216.
- the illustrative register file has one bit line per port, in other embodiments more bit lines may be allocated for a port.
- the described word lines and bit lines are formed of a metal. In other examples, other conductive materials such as doped polysilicon may be employed for interconnects.
- the described register file uses single-ended reads and writes so that a single bit line is employed per bit and per port. In other processors, differential reads and writes with dual-ended sense amplifiers may be used so that two bit lines are allocated per bit and per port, resulting in a bigger pitch. Dual-ended sense amplifiers improve memory fidelity but greatly increase the size of a memory array, imposing a heavy burden on speed performance.
- the spacing between bit lines and word lines is described to be approximately l ⁇ m. In some processors, the spacing may be greater than l ⁇ m. In other processors the spacing between lines is less than l ⁇ m.
Landscapes
- Engineering & Computer Science (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
- Static Random-Access Memory (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE69906809T DE69906809T2 (en) | 1998-12-03 | 1999-12-02 | ARRANGEMENT AND METHOD FOR OPTIMIZING CHIP INTEGRATION AND SPEED PERFORMANCE BY REGISTER MEMORY DISTRIBUTION |
EP99965078A EP1147519B1 (en) | 1998-12-03 | 1999-12-02 | Apparatus and method for optimizing die utilization and speed performance by register file splitting |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/204,481 US6343348B1 (en) | 1998-12-03 | 1998-12-03 | Apparatus and method for optimizing die utilization and speed performance by register file splitting |
US09/204,481 | 1998-12-03 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2000033315A1 true WO2000033315A1 (en) | 2000-06-08 |
WO2000033315B1 WO2000033315B1 (en) | 2000-08-24 |
Family
ID=22758075
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1999/028467 WO2000033315A1 (en) | 1998-12-03 | 1999-12-02 | Apparatus and method for optimizing die utilization and speed performance by register file splitting |
Country Status (4)
Country | Link |
---|---|
US (1) | US6343348B1 (en) |
EP (1) | EP1147519B1 (en) |
DE (1) | DE69906809T2 (en) |
WO (1) | WO2000033315A1 (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6718457B2 (en) * | 1998-12-03 | 2004-04-06 | Sun Microsystems, Inc. | Multiple-thread processor for threaded software applications |
US7139898B1 (en) | 2000-11-03 | 2006-11-21 | Mips Technologies, Inc. | Fetch and dispatch disassociation apparatus for multistreaming processors |
US7035998B1 (en) * | 2000-11-03 | 2006-04-25 | Mips Technologies, Inc. | Clustering stream and/or instruction queues for multi-streaming processors |
US7127588B2 (en) * | 2000-12-05 | 2006-10-24 | Mindspeed Technologies, Inc. | Apparatus and method for an improved performance VLIW processor |
GB2374746B (en) * | 2001-04-19 | 2005-04-13 | Discreet Logic Inc | Displaying image data |
US7310710B1 (en) | 2003-03-11 | 2007-12-18 | Marvell International Ltd. | Register file with integrated routing to execution units for multi-threaded processors |
CN1333356C (en) * | 2004-07-23 | 2007-08-22 | 中国人民解放军国防科学技术大学 | Write serialization and resource duplication combined multi-port register file design method |
US7523295B2 (en) * | 2005-03-21 | 2009-04-21 | Qualcomm Incorporated | Processor and method of grouping and executing dependent instructions in a packet |
US20060229638A1 (en) * | 2005-03-29 | 2006-10-12 | Abrams Robert M | Articulating retrieval device |
US7277353B2 (en) * | 2005-08-22 | 2007-10-02 | P.A. Semi, Inc. | Register file |
US7187606B1 (en) * | 2005-08-22 | 2007-03-06 | P.A. Semi, Inc. | Read port circuit for register file |
US7366032B1 (en) * | 2005-11-21 | 2008-04-29 | Advanced Micro Devices, Inc. | Multi-ported register cell with randomly accessible history |
US8347037B2 (en) * | 2008-10-22 | 2013-01-01 | International Business Machines Corporation | Victim cache replacement |
US8209489B2 (en) * | 2008-10-22 | 2012-06-26 | International Business Machines Corporation | Victim cache prefetching |
US8499124B2 (en) * | 2008-12-16 | 2013-07-30 | International Business Machines Corporation | Handling castout cache lines in a victim cache |
US8225045B2 (en) * | 2008-12-16 | 2012-07-17 | International Business Machines Corporation | Lateral cache-to-cache cast-in |
US8489819B2 (en) * | 2008-12-19 | 2013-07-16 | International Business Machines Corporation | Victim cache lateral castout targeting |
US7990780B2 (en) * | 2009-02-20 | 2011-08-02 | Apple Inc. | Multiple threshold voltage register file cell |
US8949540B2 (en) * | 2009-03-11 | 2015-02-03 | International Business Machines Corporation | Lateral castout (LCO) of victim cache line in data-invalid state |
US8095733B2 (en) * | 2009-04-07 | 2012-01-10 | International Business Machines Corporation | Virtual barrier synchronization cache castout election |
US8131935B2 (en) * | 2009-04-07 | 2012-03-06 | International Business Machines Corporation | Virtual barrier synchronization cache |
US8347036B2 (en) * | 2009-04-09 | 2013-01-01 | International Business Machines Corporation | Empirically based dynamic control of transmission of victim cache lateral castouts |
US8327073B2 (en) * | 2009-04-09 | 2012-12-04 | International Business Machines Corporation | Empirically based dynamic control of acceptance of victim cache lateral castouts |
US8312220B2 (en) * | 2009-04-09 | 2012-11-13 | International Business Machines Corporation | Mode-based castout destination selection |
US9189403B2 (en) * | 2009-12-30 | 2015-11-17 | International Business Machines Corporation | Selective cache-to-cache lateral castouts |
US10007518B2 (en) * | 2013-07-09 | 2018-06-26 | Texas Instruments Incorporated | Register file structures combining vector and scalar data with global and local accesses |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0520788A2 (en) * | 1991-06-25 | 1992-12-30 | Fujitsu Limited | Semiconductor memory devices |
EP0520425A2 (en) * | 1991-06-27 | 1992-12-30 | Nec Corporation | Semiconductor memory device |
US5642325A (en) * | 1995-09-27 | 1997-06-24 | Philips Electronics North America Corporation | Register file read/write cell |
US5822341A (en) * | 1995-04-06 | 1998-10-13 | Advanced Hardware Architectures, Inc. | Multiport RAM for use within a viterbi decoder |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5111431A (en) * | 1990-11-02 | 1992-05-05 | Analog Devices, Inc. | Register forwarding multi-port register file |
JP3676411B2 (en) | 1994-01-21 | 2005-07-27 | サン・マイクロシステムズ・インコーポレイテッド | Register file device and register file access method |
US5761475A (en) | 1994-12-15 | 1998-06-02 | Sun Microsystems, Inc. | Computer processor having a register file with reduced read and/or write port bandwidth |
US5713039A (en) * | 1995-12-05 | 1998-01-27 | Advanced Micro Devices, Inc. | Register file having multiple register storages for storing data from multiple data streams |
US5764943A (en) | 1995-12-28 | 1998-06-09 | Intel Corporation | Data path circuitry for processor having multiple instruction pipelines |
US5828623A (en) * | 1996-02-23 | 1998-10-27 | Integrated Device Technology, Inc. | Parallel write logic for multi-port memory arrays |
US5657291A (en) * | 1996-04-30 | 1997-08-12 | Sun Microsystems, Inc. | Multiport register file memory cell configuration for read operation |
US5778248A (en) | 1996-06-17 | 1998-07-07 | Sun Microsystems, Inc. | Fast microprocessor stage bypass logic enable |
US5742557A (en) * | 1996-06-20 | 1998-04-21 | Northern Telecom Limited | Multi-port random access memory |
KR100228339B1 (en) * | 1996-11-21 | 1999-11-01 | 김영환 | Multi-port access memory for sharing read port and write port |
US5946262A (en) * | 1997-03-07 | 1999-08-31 | Mitsubishi Semiconductor America, Inc. | RAM having multiple ports sharing common memory locations |
JPH117773A (en) * | 1997-06-18 | 1999-01-12 | Sony Corp | Semiconductor memory device |
KR100289386B1 (en) * | 1997-12-27 | 2001-06-01 | 김영환 | Multi-port sram |
US6144609A (en) * | 1999-07-26 | 2000-11-07 | International Business Machines Corporation | Multiport memory cell having a reduced number of write wordlines |
-
1998
- 1998-12-03 US US09/204,481 patent/US6343348B1/en not_active Expired - Lifetime
-
1999
- 1999-12-02 DE DE69906809T patent/DE69906809T2/en not_active Expired - Lifetime
- 1999-12-02 EP EP99965078A patent/EP1147519B1/en not_active Expired - Lifetime
- 1999-12-02 WO PCT/US1999/028467 patent/WO2000033315A1/en active IP Right Grant
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0520788A2 (en) * | 1991-06-25 | 1992-12-30 | Fujitsu Limited | Semiconductor memory devices |
EP0520425A2 (en) * | 1991-06-27 | 1992-12-30 | Nec Corporation | Semiconductor memory device |
US5822341A (en) * | 1995-04-06 | 1998-10-13 | Advanced Hardware Architectures, Inc. | Multiport RAM for use within a viterbi decoder |
US5642325A (en) * | 1995-09-27 | 1997-06-24 | Philips Electronics North America Corporation | Register file read/write cell |
Also Published As
Publication number | Publication date |
---|---|
EP1147519A1 (en) | 2001-10-24 |
US6343348B1 (en) | 2002-01-29 |
EP1147519B1 (en) | 2003-04-09 |
WO2000033315B1 (en) | 2000-08-24 |
DE69906809D1 (en) | 2003-05-15 |
DE69906809T2 (en) | 2003-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7437534B2 (en) | Local and global register partitioning technique | |
EP1147519B1 (en) | Apparatus and method for optimizing die utilization and speed performance by register file splitting | |
US6718457B2 (en) | Multiple-thread processor for threaded software applications | |
US7490228B2 (en) | Processor with register dirty bit tracking for efficient context switch | |
US6279100B1 (en) | Local stall control method and structure in a microprocessor | |
WO2000033183A9 (en) | Method and structure for local stall control in a microprocessor | |
US7028170B2 (en) | Processing architecture having a compare capability | |
US6757820B2 (en) | Decompression bit processing with a general purpose alignment tool | |
US7117342B2 (en) | Implicitly derived register specifiers in a processor | |
US20010042187A1 (en) | Variable issue-width vliw processor | |
US6615338B1 (en) | Clustered architecture in a VLIW processor | |
US6374351B2 (en) | Software branch prediction filtering for a microprocessor | |
Yu et al. | An energy-efficient mobile vertex processor with multithread expanded VLIW architecture and vertex caches | |
US6625634B1 (en) | Efficient implementation of multiprecision arithmetic |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP KR |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: B1 Designated state(s): JP KR |
|
AL | Designated countries for regional patents |
Kind code of ref document: B1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
B | Later publication of amended claims | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1999965078 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1999965078 Country of ref document: EP |
|
WWG | Wipo information: grant in national office |
Ref document number: 1999965078 Country of ref document: EP |