EP3327570A1 - Dual mode local data store - Google Patents
Dual mode local data store Download PDFInfo
- Publication number
- EP3327570A1 EP3327570A1 EP16203672.7A EP16203672A EP3327570A1 EP 3327570 A1 EP3327570 A1 EP 3327570A1 EP 16203672 A EP16203672 A EP 16203672A EP 3327570 A1 EP3327570 A1 EP 3327570A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- requestor
- access
- requestors
- partitions
- shared resource
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 230000009977 dual effect Effects 0.000 title 1
- 238000005192 partition Methods 0.000 claims abstract description 118
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 23
- 230000004044 response Effects 0.000 claims description 16
- 238000003860 storage Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 238000013500 data storage Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000008685 targeting Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000004513 sizing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0844—Multiple simultaneous or quasi-simultaneous cache accessing
- G06F12/0846—Cache with multiple tag or data arrays being simultaneously accessible
- G06F12/0848—Partitioned cache, e.g. separate instruction and operand caches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/526—Mutual exclusion algorithms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/28—Using a specific disk cache architecture
- G06F2212/282—Partitioned cache
Definitions
- the parallelization of tasks is used to increase the throughput of computer systems.
- compilers or the software programmer extract parallelized tasks from program code to execute in parallel on the system hardware.
- Out-of-order execution, deep pipelines, speculative execution and multi-threaded execution are used to exploit instruction level parallelism, and thus, increase throughput.
- a parallel architecture processor is included in the system to exploit data level parallelism and offload computationally intensive and repetitive tasks from conventional general-purpose processors. Examples of these tasks include video graphics rendering, cryptography, garbage collection and other vector instruction applications.
- SIMD single instruction multiple data
- a graphics processing unit (GPU) is one example of a SIMD processor.
- the GPU includes one or more SIMD compute units, each with multiple lanes of processing resources for executing instructions of a respective thread.
- the instructions are the same in the threads executing across the lanes but with data elements particular to a given lane.
- An operating system scheduler or a programmer via a software programming platform schedules the threads on the lanes of the SIMD compute units.
- the result data generated by a given lane within the SIMD compute unit is inaccessible to other lanes without costly latencies of storing and retrieving the result data to other forms of data storage.
- systems do not provide an architecture that allows the number of lanes to dynamically change, and thus, alter the amount of storage to share within the local data store. Therefore, the systems do not support conflict resolution and full accessibility (addressability) of the local data store.
- each of many requestors are assigned to a partition of a shared resource.
- each partition is a separate partition, which is non-overlapping with other partitions of the shared resource.
- a controller is used to support access to the shared resource. When the controller determines no requestor generates an access request for an unassigned partition, the controller permits simultaneous access to the assigned partitions for active requestors. However, when the controller determines at least one active requestor generates an access request for an unassigned partition, the controller allows a single active requestor to gain access to the entire shared resource while stalling access for the other active requestors.
- the controller performs arbitration by selecting an active requestor. In some embodiments, the selection is based on least recently used criteria. The controller stalls access of the shared resource for unselected requestors while permitting access for the selected requestor. In some embodiments, the controller sets a limit on a number of access requests performed for the selected requestor or sets a limit on an amount of time for performing access requests for the selected requestor such as a number of clock cycles. If the active requestors have more access requests, the controller stalls access of the shared resource for the selected requestor and marks it as the most recently selected active requestor. Afterward, the controller deselects the requestor and again performs arbitration by selecting another active requestor to have exclusive access to the entire shared resource.
- the shared resource is a local data store in a graphics processing unit and each of the multiple requestors is a single instruction multiple data (SIMD) compute unit.
- the controller detects access requests to unassigned partitions by detecting accesses to regions of the local data store external to the assigned memory address boundaries for the SIMD compute units.
- when a given SIMD compute unit has exclusive access to the entire local data store it has exclusive access for a single clock cycle before arbitration reoccurs and another SIMD compute unit gains exclusive access. However, another number of clock cycles is possible and contemplated.
- the controller monitors a number of access requests and when the number reaches a limit, arbitration reoccurs.
- each SIMD compute unit includes read and write ports to the local data store, which are used to provide access to the local data store for another SIMD compute unit when the other SIMD compute unit has exclusive access to the local data store.
- FIG. 1 a generalized block diagram of one embodiment of a computing system supporting access of a shared resource is shown.
- the computing system includes requestors 110A-110H accessing the shared resource 140 via the arbitration control unit 120.
- the shared resource 140 is a shared memory and the arbitration control unit 120 is a memory controller.
- the shared resource 140 is a unit with specific intensive computational functionality or a unit for providing switching access to a network.
- Other examples of a resource and any associated controller are possible and contemplated.
- the requestors 110A-110H include the computation resources 112A-112H.
- the computational resources 112A-112H include pipeline registers, data structures for storing intermediate results, circuitry for performing integer arithmetic, floating-point arithmetic, Boolean logic operations, branch condition comparisons and so forth.
- the shared resource 140 is partitioned into multiple partitions 142A-142H.
- each of the partitions 142A-142H includes a same amount of data storage, a same amount of intensive computational functionality and so forth.
- one or more of the partitions 142A-142H includes less or more data storage or intensive computational functionality than other ones of the partitions 142A-142H.
- each of the partitions 142A-142H is a separate partition which does not overlap with any other partition of the partitions 142A-142H. In other embodiments, overlapping is used. In various embodiments, each partition of the partitions 142A-142H is assigned to one of the computational resources 112A-112H. In other embodiments, two or more of the computational resources 112A-112H are assigned to a same one of the partitions 142A-142H.
- the assignments between the computational resources 112A-112H and the partitions 142A-142H in addition to the sizes of the partitions 142A-142H are set by programmable control and status registers (not shown).
- Firmware an executing software application or other software is used to update the control and status registers to initially assign and subsequently reassign the computational resources 112A-112H to the partitions 142A-142H and the sizes of the partitions 142A-142H.
- control logic implemented by hardware circuitry within the requestors 110A-110H and/or the arbitration control unit 120 performs the initial assignment and sizing and subsequent reassignments and resizing.
- one or more of the requestors 110A-110H generate access requests for the shared resource 140.
- the generated access requests identify one of the partitions 142A-142H.
- the generated access request targets the identified partition.
- the targeted partition is either an assigned partition or an unassigned partition.
- the access requests are serviced based on the assignments.
- Each access request is permitted by the arbitration control unit 120 to access its assigned partition.
- the selection logic implemented by the multiplexer ("mux") gates 130A-130H selects access information 134A-134H based on the grant signal(s) 132A-132H.
- the grant signal(s) 132A-132H are asserted by the arbitration control unit 120 in a manner to select the assigned one of the requestors 110A-110H based on the earlier set assignments. Therefore, the partitions 142A-142H are accessed by its assigned one of the requestors 110A-110H. In various embodiments, two or more of the partitions 142A-142H are accessed simultaneously when there are no conflicts based on the assignments.
- any access request generated by the requestors 110A-110H targets an unassigned one of the partitions 142A-142H, then the requestors 110A-110H gain exclusive access to the partitions 142A-142H.
- the exclusive access occurs based on arbitration provided by the arbitration control unit 120. For example, in various embodiments, each active requestor of the requestors 110A-110H gains exclusive access for a clock cycle based on a least recently selected basis. In other embodiments, a number of clock cycles or a number of access requests is used by the arbitration control unit 120 to determine when to allow another active requestor of the requestors 110A-110H to gain exclusive access to the partitions 142A-142H.
- the computing system includes a hybrid arbitration scheme wherein the arbitration control unit 120 includes a centralized arbiter and one or more of the requestors 110A-110H include distributed arbitration logic.
- the requestors 110A-110H includes an arbiter for selecting a given request to send to the arbitration control unit 120 from multiple requests generated by multiple sources within the computational resources 112A-112H.
- the arbitration control unit 120 selects one or more requests to send to the shared resource 140 from multiple requests received from the requestors 110A-110H.
- the grant signals 132A-132H are asserted based on the received requests and detecting whether any received request targets an assigned one of the partitions 142A-142H.
- the arbitration control unit 120 adjusts the number of clock cycles or the number of access requests for exclusive access to the shared resource 140 based on an encoded priority along with the least-recently-selected scheme.
- Responses 150 for the requests are shown as being sent back to the arbitration control unit 120.
- the responses 150 are sent directly to the requestors 110A-110H such as via a bus.
- polling logic within the interfaces of the requestors 110A-110H is used to retrieve associated response data 150 from the bus or the arbitration control unit 120.
- the responses 150 are sent to other computational units (not shown) within the computing system.
- the parallel architecture processor 200 is a graphics processing unit (GPU) with compute units 210A-210D accessing the local data store 260 via the arbitration control unit 250.
- a GPU includes a separate local data share for each of the compute units 210A-210D for sharing data among the lanes 220A-220M.
- the local data share 260 is shared among the compute units 210A-210D. Therefore, it is possible for one or more of lanes 220A-220M within the compute unit 210A to share result data with one or more lanes 220A-220M within the compute unit 210D based on an operating mode.
- the parallel architecture processor 200 includes special-purpose integrated circuitry optimized for highly parallel data applications such as single instruction multiple data (SIMD) operations.
- the parallel architecture processor 200 is a graphics processing unit (GPU) used for video graphics rendering.
- GPU graphics processing unit
- each of the lanes 220A-220M within the compute unit 210A comprises registers 222A and an arithmetic logic unit (ALU) 224A.
- Lanes within other compute units of the compute units 210A-210D also include similar components.
- the registers 222A are storage elements used as a register file for storing operands and results.
- the data flow within the ALU 224A is pipelined.
- the ALU 224A includes pipeline registers, data structures for storing intermediate results and circuitry for performing integer arithmetic, floating-point arithmetic, Boolean logic operations, branch condition comparisons and so forth. These components are not shown for ease of illustration.
- Each of the computation units within a given row across the lanes 220A-220M is the same computation unit. Each of these computation units operates on a same instruction, but different data associated with a different thread.
- Each of the lanes 220A-220M within the compute unit 210A accesses the cache 230 for instructions.
- the cache 230 stores operand data to load into the registers 222A.
- the cache 230 is referred to as a level one (L1) texture cache.
- Each of the compute units 210A-210D has further access to a shared L2 cache (not shown) which acts as a global data share for the compute units 210A-210D.
- each of the compute units 210A-210D includes a cache controller placed logically at the top above the cache 230 to store and retrieve data from the shared L2 cache.
- each of the lanes 220A-220M processes data for a separate thread.
- Each of the compute units 210A-210D processes threads for a given work unit.
- An operating system (OS) scheduler or a user-level scheduler schedules workloads running on a computer system with the parallel architecture processor 200 using a variety of schemes such as a round-robin scheme, a priority scheme, an availability scheme or a combination.
- a programmer schedules the workloads in combination with the runtime system. In such a case, the programmer utilizes a software platform to perform the scheduling.
- the OpenCL® (Open Computing Language) framework supports programming across heterogeneous computing environments and includes a low-level application programming interface (API) for heterogeneous computing.
- API application programming interface
- OpenCL The OpenCL framework (generally referred to herein as "OpenCL") includes a C-like language interface used to define execution queues, wherein each queue is associated with an OpenCL device.
- An OpenCL device may be a general-purpose central processing unit (CPU), a GPU, or other unit with at least one processor core within a heterogeneous multi-core architecture.
- a function call is referred to as an OpenCL compute kernel, or simply a "compute kernel”.
- a software programmer schedules the compute kernels in the execution queues.
- a compute kernel is matched with one or more records of data to produce one or more work units of computation. Each work unit has a unique identifier (ID).
- ID unique identifier
- Each of the compute units 210A-210D is assigned one of the many work units by the OS or by the software programmer.
- Each of the lanes 220A-220M within a given one of the compute units 210A-210D is assigned a thread within the assigned work unit.
- Each of the lanes 220A-220M accesses the local data share 260.
- each of the lanes 220A-220M has allocated space within the local data share 260.
- Each of the lanes 220A-220M within a given one of the compute units 210A-210D has access to the allocated space of the other lanes within the same given compute unit.
- lane 220A within the compute unit 210A has access to the allocated space within the local data store 260 assigned to the lane 220M within the compute unit 210A.
- the lanes 220A-220M within the compute unit 210A have access each other's allocated space due to processing a same work unit.
- the requests generated by each of the lanes 220A-220M seek to access a block of data.
- the block of data, or data block is a set of bytes stored in contiguous memory locations. The number of bytes in a data block is varied according to design choice, and may be of any size.
- the scheduler 240 is used to schedule the access requests generated by the lanes 220A-220M within the compute unit 210A. The generated access requests are sent from the scheduler 240 to the local data store 260 via the arbitration control unit 250.
- each of the partitions 262A-262D is a separate partition which does not overlap with any other partition of the partitions 262A-262D.
- each of the partitions 262A-262D includes a same amount of data storage.
- one or more of the partitions 262A-262D includes less or more data storage than other ones of the partitions 262A-262D.
- the assignments between the compute units 210A-210D and the partitions 262A-262D in addition to the sizes of the partitions 262A-262D are set by an operating system, a software programmer, a dedicated control unit or other.
- programmable control and status registers (not shown) store particular values to set the assignments.
- Firmware, an executing software application or other software is used to update the control and status registers to initially assign and subsequently reassign the compute units 210A-210D and the partitions 262A-262D in addition to defining the sizes of the partitions 262A-262D.
- control logic implemented by hardware circuitry within the compute units 210A-210D and/or the arbitration control unit 250 performs the initial assignment, subsequent reassignments and resizing.
- the arbitration control unit 250 is used to provide shared memory capability across the compute units 210A-210D. For example, in various embodiments, threads of a same work unit are scheduled across two or more of the compute units 210A-210D, rather than scheduled to a single one of the compute units 210A-210D. For efficient processing, communication between the lanes should expand beyond a single one of the compute units 210A-210D.
- the compute unit 210A is assigned to the partition 262A and the compute unit 210D is assigned to the partition 262D.
- threads of a same work unit are scheduled across the two compute units 210A and 210D. It is now possible for efficient execution that one or more of the lanes 220A-220M in the compute unit 210A needs to communicate with one or more lanes 220A-220M in the compute unit 210D.
- the arbitration control unit 250 identifies this situation and provides exclusive access to the local data store 260 for a selected one of the compute units 210A and 210D.
- the compute unit selected by the arbitration control unit 250 has exclusive access for a given duration of time.
- the given duration is a single clock cycle. Therefore, in the above example, the compute units 210A and 210D alternate having exclusive access of the local data store 260 each clock cycle.
- the given duration is programmable. In other embodiments, the duration is measured based on another number of clock cycles. In yet other embodiments, the given duration is measured based on a number of access requests, an encoded priority, an identifier (ID) of the requestor, an ID of a destination for the response data, a least-recently-selected scheme, and so forth. Further details of the logic used by the arbitration control unit 250 is next described.
- FIG. 3 one embodiment of a method 300 for processing access requests targeting a shared resource is shown.
- the steps in this embodiment are shown in sequential order. However, in other embodiments some steps occur in a different order than shown, some steps are performed concurrently, some steps are combined with other steps, and some steps are absent.
- multiple requestors are set up in a computing system to access a shared resource.
- the shared resource is divided into multiple partitions.
- Part of the setup process is assigning each of the multiple requestors to one of the multiple partitions (block 302).
- the assignments are based on logic implemented in hardware, software or a combination.
- An operating system, a software programmer, a dedicated control unit or other performs the assignments.
- the sizes of the partitions are also set during the setup process.
- the active requestors generate access requests for the shared resource (block 308).
- the generated access requests identify one of the multiple partitions.
- the identification includes an identifier (ID) of a partition.
- an indication such as a field or encoding, indirectly identifies the partition and control logic determines the identification based on the indication.
- an address indirectly identifies the partition by indicating a data storage location within a given address range associated with the partition.
- the access requests are serviced based on the assignments (block 312). Each access request is permitted to access its assigned partition. However, if any generated access request targets an unassigned partition ("yes" branch of the conditional block 310), then the access requests are serviced based on the arbitration allowing exclusive access to the entire shared resource (block 314). For example, each one of the active requestors gains exclusive access to the entire shared resource for a given duration.
- the given duration is measured based on a number of clock cycles. In other embodiments, the given duration is measured based on a number of access requests. In various embodiments, the given duration is programmable. In some embodiments, the given duration is further based on an encoded priority, an identifier (ID) of the requestor, an ID of a destination for the response data, a least-recently-selected scheme, and so forth.
- FIG. 4 another embodiment of a method 400 for processing access requests targeting a shared resource is shown.
- Multiple requestors have been assigned to partitions within a shared resource. As described earlier, the requestors generate access requests identifying one of the partitions. If no generated access requests target an unassigned partition ("no" branch of the conditional block 402), then the access requests are serviced based on accessing the assigned partitions (block 404). Each access request is permitted to access its assigned partition. In various embodiments, unshared partitions are accessed simultaneously. The processing of the instructions continue (block 406) and the requestors generate access requests.
- any generated access request targets an unassigned partition ("yes" branch of the conditional block 402), then one requestor is selected for non-conflicting access of the shared resource (block 408).
- the selected requestor is the requestor that generated the access request targeting the unassigned partition.
- the selected requestor is the requestor which is currently the least-recently-selected requestor.
- being the least-recently-selected requestor is based on time since the last access request was serviced for the requestor.
- being the least-recently-selected requestor is based on a number of access requests serviced for the requestor.
- selection is further based on an encoded priority, an ID of the requestor, identification of the operations being processed by computational units associated with the requestor and so forth.
- the unselected requestors are stalled (block 410).
- stalling includes preventing the unselected requestors from sending access requests for the shared resource.
- stalling includes not selecting access requests stored in a request queue from the unselected requestors.
- an ID of the unselected requestors is used to identify the access requests to ignore in the queue.
- Access requests generated by the selected requestor have exclusive access to the shared resource for a given duration of time.
- the given duration is measured based on a number of clock cycles. In other embodiments, the given duration is measured based on a number of access requests. In various embodiments, the given duration is programmable. In some embodiments, the given duration is further based on an encoded priority, an identifier (ID) of the requestor, an ID of a destination for the response data, a least-recently-selected scheme, and so forth.
- an indication is set to switch selection of requestors using arbitration.
- the currently selected requestor is deselected and stalled.
- Another active requestor is selected based on the arbitration criteria used earlier such as the criteria described for the selecting step in block 408.
- the selection based on arbitration logic continues until the current workload is completed or a reset is forced.
- the processing of the instructions continue (block 406) and the requestors generate access requests.
- the access requests are processed in one of two modes. If no generated access requests target an unassigned partition, then processing continues in a first mode where the assigned partitions are available for servicing the access requests. However, if any generated access request targets an unassigned partition, then processing switches to a second mode where the requestors are selected for exclusive access to the entire shared resource.
- FIG. 5 a generalized block diagram of one embodiment of a method 500 for selecting sources of access requests for use of a shared resource is shown.
- Multiple requestors have been assigned to partitions within a shared resource. As described earlier, the requestors generate access requests identifying one of the partitions. It is determined at least one active requestor requests access of an unassigned partition of the resource (block 502). One of the active requestors is selected as the next requestor to have exclusive access to entire resource (block 504).
- the selected requestor has exclusive access of each partition of the shared resource for a given duration.
- the given duration is based on a variety of factors. If the selected requestor did not access the shared resource such as for the given duration ("no" branch of the conditional block 506), then the selected requestor maintains selection and continues to access the shared resource with exclusive access (block 508). However, if the selected requestor did access the shared resource for the given duration ("yes" branch of the conditional block 506), then the selected requestor is deselected (block 510).
- An indication is set indicating the requestor is the most-recently-selected requestor (block 512). If the workload for the requestors is not yet completed ("no" branch of the conditional block 514), then control flow of method 500 returns to block 504 where another requestor is selected for exclusive access to the shared resource. If the workload for the requestors is completed ("yes" branch of the conditional block 514), then selection of the requestors is also completed (block 516). Should another workload be assigned to the requestors, in some embodiments, the mode of operation resets to providing access to only assigned partitions of the shared resource.
- a computer accessible storage medium includes any storage media accessible by a computer during use to provide instructions and/or data to the computer.
- a computer accessible storage medium includes storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray.
- Storage media further includes volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc.
- SDRAM synchronous dynamic RAM
- DDR double data rate SDRAM
- LPDDR2, etc. low-power DDR
- RDRAM Rambus DRAM
- SRAM static RAM
- ROM Flash memory
- non-volatile memory e.g. Flash memory
- USB
- program instructions include behavioral-level descriptions or register-transfer level (RTL) descriptions of the hardware functionality in a high level programming language such as C, or a design language (HDL) such as Verilog, VHDL, or database format such as GDS II stream format (GDSII).
- RTL register-transfer level
- HDL design language
- GDSII database format
- the description is read by a synthesis tool, which synthesizes the description to produce a netlist including a list of gates from a synthesis library.
- the netlist includes a set of gates, which also represent the functionality of the hardware including the system.
- the netlist is then placed and routed to produce a data set describing geometric shapes to be applied to masks.
- the masks are then used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system.
- the instructions on the computer accessible storage medium are the netlist (with or without the synthesis library) or the data set, as desired. Additionally, the instructions are utilized for purposes of emulation by a hardware based type emulator from such vendors as Cadence®, EVE®, and Mentor Graphics®.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
- Advance Control (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
A system and method for efficiently processing access requests for a shared resource are described. Each of many requestors are assigned to a partition of a shared resource. When a controller determines no requestor generates an access request for an unassigned partition, the controller permits simultaneous access to the assigned partitions for active requestors. When the controller determines at least one active requestor generates an access request for an unassigned partition, the controller allows a single active requestor to gain exclusive access to the entire shared resource while stalling access for the other active requestors. The controller alternatives exclusive access among the active requestors. In various embodiments, the shared resource is a local data store in a graphics processing unit and each of the multiple requestors is a single instruction multiple data (SIMD) compute unit.
Description
- The parallelization of tasks is used to increase the throughput of computer systems. To this end, compilers or the software programmer extract parallelized tasks from program code to execute in parallel on the system hardware. Out-of-order execution, deep pipelines, speculative execution and multi-threaded execution are used to exploit instruction level parallelism, and thus, increase throughput. To further increase parallel execution on the hardware, a parallel architecture processor is included in the system to exploit data level parallelism and offload computationally intensive and repetitive tasks from conventional general-purpose processors. Examples of these tasks include video graphics rendering, cryptography, garbage collection and other vector instruction applications.
- Various examples of the above systems exploiting data level parallelism include a single instruction multiple data (SIMD) processor as the parallel architecture processor. A graphics processing unit (GPU) is one example of a SIMD processor. The GPU includes one or more SIMD compute units, each with multiple lanes of processing resources for executing instructions of a respective thread. The instructions are the same in the threads executing across the lanes but with data elements particular to a given lane. An operating system scheduler or a programmer via a software programming platform schedules the threads on the lanes of the SIMD compute units.
- Without the use of a local data store, the result data generated by a given lane within the SIMD compute unit is inaccessible to other lanes without costly latencies of storing and retrieving the result data to other forms of data storage. Although the multiple lanes of the SIMD compute unit share the local data store, systems do not provide an architecture that allows the number of lanes to dynamically change, and thus, alter the amount of storage to share within the local data store. Therefore, the systems do not support conflict resolution and full accessibility (addressability) of the local data store.
- In view of the above, efficient methods and systems for efficiently processing access requests for a shared resource are desired.
-
-
FIG. 1 is a generalized diagram of one embodiment of a computing system supporting access of a shared resource. -
FIG. 2 is a generalized diagram of one embodiment of a parallel architecture processor. -
FIG. 3 is a generalized diagram of one embodiment of a method for processing access requests targeting a shared resource. -
FIG. 4 is a generalized diagram of another embodiment of a method for processing access requests targeting a shared resource. -
FIG. 5 is a generalized diagram of one embodiment of a method for selecting sources of access requests for use of a shared resource. - While the invention is susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims.
- In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention. However, one having ordinary skill in the art should recognize that the invention might be practiced without these specific details. In some instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the present invention. Further, it will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements are exaggerated relative to other elements.
- Systems and methods for efficiently processing access requests for a shared resource are contemplated. In various embodiments, each of many requestors are assigned to a partition of a shared resource. In some embodiments, each partition is a separate partition, which is non-overlapping with other partitions of the shared resource. A controller is used to support access to the shared resource. When the controller determines no requestor generates an access request for an unassigned partition, the controller permits simultaneous access to the assigned partitions for active requestors. However, when the controller determines at least one active requestor generates an access request for an unassigned partition, the controller allows a single active requestor to gain access to the entire shared resource while stalling access for the other active requestors.
- The controller performs arbitration by selecting an active requestor. In some embodiments, the selection is based on least recently used criteria. The controller stalls access of the shared resource for unselected requestors while permitting access for the selected requestor. In some embodiments, the controller sets a limit on a number of access requests performed for the selected requestor or sets a limit on an amount of time for performing access requests for the selected requestor such as a number of clock cycles. If the active requestors have more access requests, the controller stalls access of the shared resource for the selected requestor and marks it as the most recently selected active requestor. Afterward, the controller deselects the requestor and again performs arbitration by selecting another active requestor to have exclusive access to the entire shared resource.
- In various embodiments, the shared resource is a local data store in a graphics processing unit and each of the multiple requestors is a single instruction multiple data (SIMD) compute unit. In some embodiments, the controller detects access requests to unassigned partitions by detecting accesses to regions of the local data store external to the assigned memory address boundaries for the SIMD compute units. In various embodiments, when a given SIMD compute unit has exclusive access to the entire local data store, it has exclusive access for a single clock cycle before arbitration reoccurs and another SIMD compute unit gains exclusive access. However, another number of clock cycles is possible and contemplated. Alternatively, in other embodiments, the controller monitors a number of access requests and when the number reaches a limit, arbitration reoccurs. In various embodiments, each SIMD compute unit includes read and write ports to the local data store, which are used to provide access to the local data store for another SIMD compute unit when the other SIMD compute unit has exclusive access to the local data store.
- Turning to
FIG. 1 , a generalized block diagram of one embodiment of a computing system supporting access of a shared resource is shown. In the shown embodiment, the computing system includesrequestors 110A-110H accessing the sharedresource 140 via thearbitration control unit 120. In some embodiments, the sharedresource 140 is a shared memory and thearbitration control unit 120 is a memory controller. In other embodiments, the sharedresource 140 is a unit with specific intensive computational functionality or a unit for providing switching access to a network. Other examples of a resource and any associated controller are possible and contemplated. - The
requestors 110A-110H include the computation resources 112A-112H. In various embodiments, the computational resources 112A-112H include pipeline registers, data structures for storing intermediate results, circuitry for performing integer arithmetic, floating-point arithmetic, Boolean logic operations, branch condition comparisons and so forth. As shown, the sharedresource 140 is partitioned intomultiple partitions 142A-142H. In some embodiments, each of thepartitions 142A-142H includes a same amount of data storage, a same amount of intensive computational functionality and so forth. In other embodiments, one or more of thepartitions 142A-142H includes less or more data storage or intensive computational functionality than other ones of thepartitions 142A-142H. - In various embodiments, each of the
partitions 142A-142H is a separate partition which does not overlap with any other partition of thepartitions 142A-142H. In other embodiments, overlapping is used. In various embodiments, each partition of thepartitions 142A-142H is assigned to one of the computational resources 112A-112H. In other embodiments, two or more of the computational resources 112A-112H are assigned to a same one of thepartitions 142A-142H. - In some embodiments, the assignments between the computational resources 112A-112H and the
partitions 142A-142H in addition to the sizes of thepartitions 142A-142H are set by programmable control and status registers (not shown). Firmware, an executing software application or other software is used to update the control and status registers to initially assign and subsequently reassign the computational resources 112A-112H to thepartitions 142A-142H and the sizes of thepartitions 142A-142H. In other embodiments, control logic implemented by hardware circuitry within therequestors 110A-110H and/or thearbitration control unit 120 performs the initial assignment and sizing and subsequent reassignments and resizing. - As one or more of the computational resources 112A-112H process instructions of one or more applications, one or more of the
requestors 110A-110H generate access requests for the sharedresource 140. In various embodiments, the generated access requests identify one of thepartitions 142A-142H. By identifying one of thepartitions 142A-142H, the generated access request targets the identified partition. The targeted partition is either an assigned partition or an unassigned partition. - If no access request generated by the
requestors 110A-110H target an unassigned one of thepartitions 142A-142H, then the access requests are serviced based on the assignments. Each access request is permitted by thearbitration control unit 120 to access its assigned partition. The selection logic implemented by the multiplexer ("mux")gates 130A-130H selectsaccess information 134A-134H based on the grant signal(s) 132A-132H. The grant signal(s) 132A-132H are asserted by thearbitration control unit 120 in a manner to select the assigned one of therequestors 110A-110H based on the earlier set assignments. Therefore, thepartitions 142A-142H are accessed by its assigned one of therequestors 110A-110H. In various embodiments, two or more of thepartitions 142A-142H are accessed simultaneously when there are no conflicts based on the assignments. - If any access request generated by the
requestors 110A-110H targets an unassigned one of thepartitions 142A-142H, then therequestors 110A-110H gain exclusive access to thepartitions 142A-142H. The exclusive access occurs based on arbitration provided by thearbitration control unit 120. For example, in various embodiments, each active requestor of therequestors 110A-110H gains exclusive access for a clock cycle based on a least recently selected basis. In other embodiments, a number of clock cycles or a number of access requests is used by thearbitration control unit 120 to determine when to allow another active requestor of therequestors 110A-110H to gain exclusive access to thepartitions 142A-142H. - In some embodiments, the computing system includes a hybrid arbitration scheme wherein the
arbitration control unit 120 includes a centralized arbiter and one or more of therequestors 110A-110H include distributed arbitration logic. For example, one or more of therequestors 110A-110H includes an arbiter for selecting a given request to send to thearbitration control unit 120 from multiple requests generated by multiple sources within the computational resources 112A-112H. Thearbitration control unit 120 selects one or more requests to send to the sharedresource 140 from multiple requests received from therequestors 110A-110H. The grant signals 132A-132H are asserted based on the received requests and detecting whether any received request targets an assigned one of thepartitions 142A-142H. In addition, in some embodiments, thearbitration control unit 120 adjusts the number of clock cycles or the number of access requests for exclusive access to the sharedresource 140 based on an encoded priority along with the least-recently-selected scheme. -
Responses 150 for the requests are shown as being sent back to thearbitration control unit 120. In other embodiments, theresponses 150 are sent directly to therequestors 110A-110H such as via a bus. In some embodiments, polling logic within the interfaces of therequestors 110A-110H is used to retrieve associatedresponse data 150 from the bus or thearbitration control unit 120. In various other embodiments, theresponses 150 are sent to other computational units (not shown) within the computing system. - Referring now to
FIG. 2 , one embodiment of aparallel architecture processor 200 is shown. In various embodiments, theparallel architecture processor 200 is a graphics processing unit (GPU) withcompute units 210A-210D accessing thelocal data store 260 via thearbitration control unit 250. Generally, a GPU includes a separate local data share for each of thecompute units 210A-210D for sharing data among thelanes 220A-220M. Here, however, thelocal data share 260 is shared among thecompute units 210A-210D. Therefore, it is possible for one or more oflanes 220A-220M within thecompute unit 210A to share result data with one ormore lanes 220A-220M within thecompute unit 210D based on an operating mode. - As described earlier, the
parallel architecture processor 200 includes special-purpose integrated circuitry optimized for highly parallel data applications such as single instruction multiple data (SIMD) operations. In various embodiments, theparallel architecture processor 200 is a graphics processing unit (GPU) used for video graphics rendering. As shown, each of thelanes 220A-220M within thecompute unit 210A comprisesregisters 222A and an arithmetic logic unit (ALU) 224A. Lanes within other compute units of thecompute units 210A-210D also include similar components. - In various embodiments, the
registers 222A are storage elements used as a register file for storing operands and results. - In various embodiments, the data flow within the
ALU 224A is pipelined. TheALU 224A includes pipeline registers, data structures for storing intermediate results and circuitry for performing integer arithmetic, floating-point arithmetic, Boolean logic operations, branch condition comparisons and so forth. These components are not shown for ease of illustration. Each of the computation units within a given row across thelanes 220A-220M is the same computation unit. Each of these computation units operates on a same instruction, but different data associated with a different thread. - Each of the
lanes 220A-220M within thecompute unit 210A accesses thecache 230 for instructions. In addition, thecache 230 stores operand data to load into theregisters 222A. For embodiments performing video graphics rendering, thecache 230 is referred to as a level one (L1) texture cache. Each of thecompute units 210A-210D has further access to a shared L2 cache (not shown) which acts as a global data share for thecompute units 210A-210D. For example, in various embodiments, each of thecompute units 210A-210D includes a cache controller placed logically at the top above thecache 230 to store and retrieve data from the shared L2 cache. - As described earlier, each of the
lanes 220A-220M processes data for a separate thread. Each of thecompute units 210A-210D processes threads for a given work unit. An operating system (OS) scheduler or a user-level scheduler schedules workloads running on a computer system with theparallel architecture processor 200 using a variety of schemes such as a round-robin scheme, a priority scheme, an availability scheme or a combination. Alternatively, a programmer schedules the workloads in combination with the runtime system. In such a case, the programmer utilizes a software platform to perform the scheduling. For example, the OpenCL® (Open Computing Language) framework supports programming across heterogeneous computing environments and includes a low-level application programming interface (API) for heterogeneous computing. - The OpenCL framework (generally referred to herein as "OpenCL") includes a C-like language interface used to define execution queues, wherein each queue is associated with an OpenCL device. An OpenCL device may be a general-purpose central processing unit (CPU), a GPU, or other unit with at least one processor core within a heterogeneous multi-core architecture. In the OpenCL framework a function call is referred to as an OpenCL compute kernel, or simply a "compute kernel". A software programmer schedules the compute kernels in the execution queues. A compute kernel is matched with one or more records of data to produce one or more work units of computation. Each work unit has a unique identifier (ID). Each of the
compute units 210A-210D is assigned one of the many work units by the OS or by the software programmer. Each of thelanes 220A-220M within a given one of thecompute units 210A-210D is assigned a thread within the assigned work unit. - Each of the
lanes 220A-220M accesses thelocal data share 260. For example, in various embodiments, each of thelanes 220A-220M has allocated space within thelocal data share 260. Each of thelanes 220A-220M within a given one of thecompute units 210A-210D has access to the allocated space of the other lanes within the same given compute unit. For example,lane 220A within thecompute unit 210A has access to the allocated space within thelocal data store 260 assigned to thelane 220M within thecompute unit 210A. Thelanes 220A-220M within thecompute unit 210A have access each other's allocated space due to processing a same work unit. - The requests generated by each of the
lanes 220A-220M seek to access a block of data. In various embodiments, the block of data, or data block, is a set of bytes stored in contiguous memory locations. The number of bytes in a data block is varied according to design choice, and may be of any size. Thescheduler 240 is used to schedule the access requests generated by thelanes 220A-220M within thecompute unit 210A. The generated access requests are sent from thescheduler 240 to thelocal data store 260 via thearbitration control unit 250. - As shown, the
local data share 260 is divided intomultiple partitions 262A-262D. In various embodiments, each of thepartitions 262A-262D is a separate partition which does not overlap with any other partition of thepartitions 262A-262D. In some embodiments, each of thepartitions 262A-262D includes a same amount of data storage. In other embodiments, one or more of thepartitions 262A-262D includes less or more data storage than other ones of thepartitions 262A-262D. - In various embodiments, the assignments between the
compute units 210A-210D and thepartitions 262A-262D in addition to the sizes of thepartitions 262A-262D are set by an operating system, a software programmer, a dedicated control unit or other. For example, in some embodiments, programmable control and status registers (not shown) store particular values to set the assignments. Firmware, an executing software application or other software is used to update the control and status registers to initially assign and subsequently reassign thecompute units 210A-210D and thepartitions 262A-262D in addition to defining the sizes of thepartitions 262A-262D. In other embodiments, control logic implemented by hardware circuitry within thecompute units 210A-210D and/or thearbitration control unit 250 performs the initial assignment, subsequent reassignments and resizing. - In various embodiments, the
arbitration control unit 250 is used to provide shared memory capability across thecompute units 210A-210D. For example, in various embodiments, threads of a same work unit are scheduled across two or more of thecompute units 210A-210D, rather than scheduled to a single one of thecompute units 210A-210D. For efficient processing, communication between the lanes should expand beyond a single one of thecompute units 210A-210D. - In one example, the
compute unit 210A is assigned to thepartition 262A and thecompute unit 210D is assigned to thepartition 262D. However, later, threads of a same work unit are scheduled across the twocompute units lanes 220A-220M in thecompute unit 210A needs to communicate with one ormore lanes 220A-220M in thecompute unit 210D. Thearbitration control unit 250 identifies this situation and provides exclusive access to thelocal data store 260 for a selected one of thecompute units - The compute unit selected by the
arbitration control unit 250 has exclusive access for a given duration of time. In various embodiments, the given duration is a single clock cycle. Therefore, in the above example, thecompute units local data store 260 each clock cycle. In various embodiments, the given duration is programmable. In other embodiments, the duration is measured based on another number of clock cycles. In yet other embodiments, the given duration is measured based on a number of access requests, an encoded priority, an identifier (ID) of the requestor, an ID of a destination for the response data, a least-recently-selected scheme, and so forth. Further details of the logic used by thearbitration control unit 250 is next described. - Referring now to
FIG. 3 , one embodiment of amethod 300 for processing access requests targeting a shared resource is shown. For purposes of discussion, the steps in this embodiment (as well as inFigures 4-5 ) are shown in sequential order. However, in other embodiments some steps occur in a different order than shown, some steps are performed concurrently, some steps are combined with other steps, and some steps are absent. - In various embodiments, multiple requestors are set up in a computing system to access a shared resource. The shared resource is divided into multiple partitions. Part of the setup process is assigning each of the multiple requestors to one of the multiple partitions (block 302). The assignments are based on logic implemented in hardware, software or a combination. An operating system, a software programmer, a dedicated control unit or other performs the assignments. In addition, in some embodiments, the sizes of the partitions are also set during the setup process. When the last requestor is reached for assignment ("yes" branch of the conditional block 304), instructions of one or more software applications are processed by the computing system (block 306).
- During the processing of the one or more software applications, the active requestors generate access requests for the shared resource (block 308). In various embodiments, the generated access requests identify one of the multiple partitions. In some embodiments, the identification includes an identifier (ID) of a partition. In other embodiments, an indication, such as a field or encoding, indirectly identifies the partition and control logic determines the identification based on the indication. In yet other embodiments, an address indirectly identifies the partition by indicating a data storage location within a given address range associated with the partition. By identifying one of the multiple partitions, the generated access request targets the identified partition. The targeted partition is either an assigned partition or an unassigned partition.
- If no generated access requests target an unassigned partition ("no" branch of the conditional block 310), then the access requests are serviced based on the assignments (block 312). Each access request is permitted to access its assigned partition. However, if any generated access request targets an unassigned partition ("yes" branch of the conditional block 310), then the access requests are serviced based on the arbitration allowing exclusive access to the entire shared resource (block 314). For example, each one of the active requestors gains exclusive access to the entire shared resource for a given duration. In various embodiments, the given duration is measured based on a number of clock cycles. In other embodiments, the given duration is measured based on a number of access requests. In various embodiments, the given duration is programmable. In some embodiments, the given duration is further based on an encoded priority, an identifier (ID) of the requestor, an ID of a destination for the response data, a least-recently-selected scheme, and so forth.
- Turning now to
FIG. 4 , another embodiment of amethod 400 for processing access requests targeting a shared resource is shown. Multiple requestors have been assigned to partitions within a shared resource. As described earlier, the requestors generate access requests identifying one of the partitions. If no generated access requests target an unassigned partition ("no" branch of the conditional block 402), then the access requests are serviced based on accessing the assigned partitions (block 404). Each access request is permitted to access its assigned partition. In various embodiments, unshared partitions are accessed simultaneously. The processing of the instructions continue (block 406) and the requestors generate access requests. - If any generated access request targets an unassigned partition ("yes" branch of the conditional block 402), then one requestor is selected for non-conflicting access of the shared resource (block 408). In various embodiments, the selected requestor is the requestor that generated the access request targeting the unassigned partition. In other embodiments, the selected requestor is the requestor which is currently the least-recently-selected requestor. In some embodiments, being the least-recently-selected requestor is based on time since the last access request was serviced for the requestor. In other embodiments, being the least-recently-selected requestor is based on a number of access requests serviced for the requestor. In some embodiments, selection is further based on an encoded priority, an ID of the requestor, identification of the operations being processed by computational units associated with the requestor and so forth.
- The unselected requestors are stalled (block 410). In some embodiments, stalling includes preventing the unselected requestors from sending access requests for the shared resource. In other embodiments, stalling includes not selecting access requests stored in a request queue from the unselected requestors. In some embodiments, an ID of the unselected requestors is used to identify the access requests to ignore in the queue.
- Any partition in the shared resource is available for access by the access requests generated by the selected requestor (block 412). Access requests generated by the selected requestor have exclusive access to the shared resource for a given duration of time. As described earlier, in some embodiments, the given duration is measured based on a number of clock cycles. In other embodiments, the given duration is measured based on a number of access requests. In various embodiments, the given duration is programmable. In some embodiments, the given duration is further based on an encoded priority, an identifier (ID) of the requestor, an ID of a destination for the response data, a least-recently-selected scheme, and so forth.
- When the given duration is reached, an indication is set to switch selection of requestors using arbitration. The currently selected requestor is deselected and stalled. Another active requestor is selected based on the arbitration criteria used earlier such as the criteria described for the selecting step in
block 408. The selection based on arbitration logic continues until the current workload is completed or a reset is forced. The processing of the instructions continue (block 406) and the requestors generate access requests. As can be seen from the above, the access requests are processed in one of two modes. If no generated access requests target an unassigned partition, then processing continues in a first mode where the assigned partitions are available for servicing the access requests. However, if any generated access request targets an unassigned partition, then processing switches to a second mode where the requestors are selected for exclusive access to the entire shared resource. - Turning now to
FIG. 5 , a generalized block diagram of one embodiment of amethod 500 for selecting sources of access requests for use of a shared resource is shown. Multiple requestors have been assigned to partitions within a shared resource. As described earlier, the requestors generate access requests identifying one of the partitions. It is determined at least one active requestor requests access of an unassigned partition of the resource (block 502). One of the active requestors is selected as the next requestor to have exclusive access to entire resource (block 504). As described earlier, many factors are considered for selection such as a least-recently-selected scheme, an encoded priority, a number of pending access requests, a number of access requests already serviced, an indication of the computation being performed by an associated computational unit, an age of current outstanding requests and so forth. - In various embodiments, the selected requestor has exclusive access of each partition of the shared resource for a given duration. As described earlier, the given duration is based on a variety of factors. If the selected requestor did not access the shared resource such as for the given duration ("no" branch of the conditional block 506), then the selected requestor maintains selection and continues to access the shared resource with exclusive access (block 508). However, if the selected requestor did access the shared resource for the given duration ("yes" branch of the conditional block 506), then the selected requestor is deselected (block 510).
- An indication is set indicating the requestor is the most-recently-selected requestor (block 512). If the workload for the requestors is not yet completed ("no" branch of the conditional block 514), then control flow of
method 500 returns to block 504 where another requestor is selected for exclusive access to the shared resource. If the workload for the requestors is completed ("yes" branch of the conditional block 514), then selection of the requestors is also completed (block 516). Should another workload be assigned to the requestors, in some embodiments, the mode of operation resets to providing access to only assigned partitions of the shared resource. - It is noted that one or more of the above-described embodiments include software. In such embodiments, the program instructions that implement the methods and/or mechanisms are conveyed or stored on a computer readable medium. Numerous types of media which are configured to store program instructions are available and include hard disks, floppy disks, CD-ROM, DVD, flash memory, Programmable ROMs (PROM), random access memory (RAM), and various other forms of volatile or non-volatile storage. Generally speaking, a computer accessible storage medium includes any storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a computer accessible storage medium includes storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, or DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media further includes volatile or non-volatile memory media such as RAM (e.g. synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM), etc.), ROM, Flash memory, non-volatile memory (e.g. Flash memory) accessible via a peripheral interface such as the Universal Serial Bus (USB) interface, etc. Storage media includes microelectromechanical systems (MEMS), as well as storage media accessible via a communication medium such as a network and/or a wireless link.
- Additionally, in various embodiments, program instructions include behavioral-level descriptions or register-transfer level (RTL) descriptions of the hardware functionality in a high level programming language such as C, or a design language (HDL) such as Verilog, VHDL, or database format such as GDS II stream format (GDSII). In some cases the description is read by a synthesis tool, which synthesizes the description to produce a netlist including a list of gates from a synthesis library. The netlist includes a set of gates, which also represent the functionality of the hardware including the system. The netlist is then placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks are then used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system. Alternatively, the instructions on the computer accessible storage medium are the netlist (with or without the synthesis library) or the data set, as desired. Additionally, the instructions are utilized for purposes of emulation by a hardware based type emulator from such vendors as Cadence®, EVE®, and Mentor Graphics®.
- Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Claims (14)
- A computing system comprising:a shared resource comprising a plurality of partitions;a plurality of requestors, each assigned to a different partition of the plurality of partitions of the shared resource; anda controller coupled to the shared resource, wherein in response to receiving a request for access to a given partition from a first requestor of the plurality of requestors, the controller is configured to:provide the first requestor with access to only the given partition, in response to determining the given partition is assigned to the first requestor; andprovide the first requestor with access to all partitions of the plurality of partitions, in response to determining the given partition is not assigned to the first requestor.
- The computing system as claimed in claim 1, wherein the controller is further configured to:stall access of the shared resource for the first requestor; andmark the first requestor as the most recently selected active requestor of the plurality of requestors.
- The computing system as claimed in claim 2, wherein the controller is further configured to:select a second requestor different from the first requestor of the plurality of requestors;remove the stall for the selected second requestor; andprovide the second requestor with access to all partitions of the plurality of partitions.
- The computing system as claimed in claim 1, wherein the shared resource is a local data store in a graphics processing unit and each of the plurality of requestors is a single instruction multiple data (SIMD) compute unit.
- A method comprising:assigning each of a plurality of requestors to a different partition of a plurality of partitions of a shared resource;in response to receiving a request for access to a given partition from a first requestor of a plurality of requestors:providing the first requestor with access to only the given partition, in response to determining the given partition is assigned to the first requestor; andproviding the first requestor with access to all partitions of the plurality of partitions, in response to determining the given partition is not assigned to the first requestor.
- The method as claimed in claim 5, wherein the first requestor is a least recently selected active requestor of the plurality of requestors.
- The method as claimed in claim 5, further comprising deselecting the first requestor responsive to:determining completion of a given number of access requests for the first requestor; anddetermining the plurality of requestors have more access requests.
- The method as claimed in claim 7, wherein the given number of access requests is a number of access requests serviced within a single clock cycle.
- The method as claimed in claim 8, further comprising:stalling access of the shared resource for the first requestor; andmarking the first requestor as the most recently selected active requestor of the plurality of requestors.
- The method as claimed in claim 7, further comprising:selecting a second requestor different from the first requestor of the plurality of requestors;removing the stall for the selected second requestor; andpermitting access of any of the plurality of partitions for the second requestor.
- The method as claimed in claim 7, wherein the shared resource is a local data store in a graphics processing unit and each of the plurality of requestors is a single instruction multiple data (SIMD) compute unit.
- A controller comprising:a first interface coupled to a shared resource comprising a plurality of partitions;a second interface coupled to a plurality of requestors, each assigned to a different partition of the plurality of partitions of the shared resource; anda control unit; andwherein in response to receiving a request for access to a given partition from a first requestor of the plurality of requestors, the control unit is configured to:provide the first requestor with access to only the given partition, in response to determining the given partition is assigned to the first requestor; andprovide the first requestor with access to all partitions of the plurality of partitions, in response to determining the given partition is not assigned to the first requestor.
- The controller as claimed in claim 12, wherein the control unit is further configured to deselect the first requestor responsive to:determining completion of a given number of access requests for the first requestor; anddetermining the plurality of requestors have more access requests.
- The controller as claimed in claim 12, further comprising stalling access to the shared resource for each of the plurality of requestors other than the first requestor when providing the first requestor with access to all partitions.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/360,205 US10073783B2 (en) | 2016-11-23 | 2016-11-23 | Dual mode local data store |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3327570A1 true EP3327570A1 (en) | 2018-05-30 |
Family
ID=57544333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16203672.7A Ceased EP3327570A1 (en) | 2016-11-23 | 2016-12-13 | Dual mode local data store |
Country Status (6)
Country | Link |
---|---|
US (1) | US10073783B2 (en) |
EP (1) | EP3327570A1 (en) |
JP (1) | JP7246308B2 (en) |
KR (1) | KR102493859B1 (en) |
CN (1) | CN110023904B (en) |
WO (1) | WO2018098183A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10445850B2 (en) * | 2015-08-26 | 2019-10-15 | Intel Corporation | Technologies for offloading network packet processing to a GPU |
JP6979139B2 (en) * | 2019-03-08 | 2021-12-08 | モービルアイ ヴィジョン テクノロジーズ リミテッド | Priority-based management of access to shared resources |
CN111506350A (en) * | 2020-04-30 | 2020-08-07 | 中科院计算所西部高等技术研究院 | Streaming processor with OODA circular partitioning mechanism |
TWI817039B (en) * | 2020-09-08 | 2023-10-01 | 以色列商無比視視覺科技有限公司 | Method and system for managing access of multiple initiators to shared resources and related computer program product |
CN112631780A (en) * | 2020-12-28 | 2021-04-09 | 浙江大华技术股份有限公司 | Resource scheduling method and device, storage medium and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009145917A1 (en) * | 2008-05-30 | 2009-12-03 | Advanced Micro Devices, Inc. | Local and global data share |
US8667200B1 (en) * | 2009-09-22 | 2014-03-04 | Nvidia Corporation | Fast and highly scalable quota-based weighted arbitration |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5265232A (en) | 1991-04-03 | 1993-11-23 | International Business Machines Corporation | Coherence control by data invalidation in selected processor caches without broadcasting to processor caches not having the data |
US5584017A (en) | 1991-12-19 | 1996-12-10 | Intel Corporation | Cache control which inhibits snoop cycles if processor accessing memory is the only processor allowed to cache the memory location |
JP2906805B2 (en) * | 1992-02-20 | 1999-06-21 | 富士通株式会社 | Memory sharing type multiprocessor system |
US6044446A (en) | 1997-07-01 | 2000-03-28 | Sun Microsystems, Inc. | Mechanism to reduce interprocessor traffic in a shared memory multi-processor computer system |
US7096323B1 (en) | 2002-09-27 | 2006-08-22 | Advanced Micro Devices, Inc. | Computer system with processor cache that stores remote cache presence information |
US6868485B1 (en) | 2002-09-27 | 2005-03-15 | Advanced Micro Devices, Inc. | Computer system with integrated directory and processor cache |
US7047322B1 (en) | 2003-09-30 | 2006-05-16 | Unisys Corporation | System and method for performing conflict resolution and flow control in a multiprocessor system |
US7023445B1 (en) | 2004-04-12 | 2006-04-04 | Advanced Micro Devices, Inc. | CPU and graphics unit with shared cache |
US7360032B2 (en) | 2005-07-19 | 2008-04-15 | International Business Machines Corporation | Method, apparatus, and computer program product for a cache coherency protocol state that predicts locations of modified memory blocks |
US7472253B1 (en) | 2006-09-22 | 2008-12-30 | Sun Microsystems, Inc. | System and method for managing table lookaside buffer performance |
US7669011B2 (en) | 2006-12-21 | 2010-02-23 | Advanced Micro Devices, Inc. | Method and apparatus for detecting and tracking private pages in a shared memory multiprocessor |
JP5233437B2 (en) | 2008-06-23 | 2013-07-10 | トヨタ車体株式会社 | Automotive display device, protective panel shape design method and shape design device |
US20120159090A1 (en) * | 2010-12-16 | 2012-06-21 | Microsoft Corporation | Scalable multimedia computer system architecture with qos guarantees |
US10860384B2 (en) * | 2012-02-03 | 2020-12-08 | Microsoft Technology Licensing, Llc | Managing partitions in a scalable environment |
US9864638B2 (en) * | 2012-06-22 | 2018-01-09 | Intel Corporation | Techniques for accessing a graphical processing unit memory by an application |
US10546558B2 (en) * | 2014-04-25 | 2020-01-28 | Apple Inc. | Request aggregation with opportunism |
CN105224886B (en) * | 2014-06-26 | 2018-12-07 | 中国移动通信集团甘肃有限公司 | A kind of mobile terminal safety partition method, device and mobile terminal |
-
2016
- 2016-11-23 US US15/360,205 patent/US10073783B2/en active Active
- 2016-12-13 EP EP16203672.7A patent/EP3327570A1/en not_active Ceased
-
2017
- 2017-11-21 CN CN201780072221.1A patent/CN110023904B/en active Active
- 2017-11-21 WO PCT/US2017/062853 patent/WO2018098183A1/en active Application Filing
- 2017-11-21 KR KR1020197017661A patent/KR102493859B1/en active IP Right Grant
- 2017-11-21 JP JP2019527881A patent/JP7246308B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009145917A1 (en) * | 2008-05-30 | 2009-12-03 | Advanced Micro Devices, Inc. | Local and global data share |
US8667200B1 (en) * | 2009-09-22 | 2014-03-04 | Nvidia Corporation | Fast and highly scalable quota-based weighted arbitration |
Also Published As
Publication number | Publication date |
---|---|
KR102493859B1 (en) | 2023-01-31 |
US10073783B2 (en) | 2018-09-11 |
CN110023904A (en) | 2019-07-16 |
JP7246308B2 (en) | 2023-03-27 |
US20180143907A1 (en) | 2018-05-24 |
WO2018098183A1 (en) | 2018-05-31 |
KR20190082308A (en) | 2019-07-09 |
JP2020500379A (en) | 2020-01-09 |
CN110023904B (en) | 2021-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7313381B2 (en) | Embedded scheduling of hardware resources for hardware acceleration | |
US10592279B2 (en) | Multi-processor apparatus and method of detection and acceleration of lagging tasks | |
US10073783B2 (en) | Dual mode local data store | |
JP6571078B2 (en) | Parallel processing device for accessing memory, computer-implemented method, system, computer-readable medium | |
KR101835056B1 (en) | Dynamic mapping of logical cores | |
US10007527B2 (en) | Uniform load processing for parallel thread sub-sets | |
US9710306B2 (en) | Methods and apparatus for auto-throttling encapsulated compute tasks | |
US9069609B2 (en) | Scheduling and execution of compute tasks | |
JP6260303B2 (en) | Arithmetic processing device and control method of arithmetic processing device | |
CN103197953A (en) | Speculative execution and rollback | |
US10019283B2 (en) | Predicting a context portion to move between a context buffer and registers based on context portions previously used by at least one other thread | |
KR20130116166A (en) | Multithread application-aware memory scheduling scheme for multi-core processors | |
US11537397B2 (en) | Compiler-assisted inter-SIMD-group register sharing | |
US11934698B2 (en) | Process isolation for a processor-in-memory (“PIM”) device | |
US9715413B2 (en) | Execution state analysis for assigning tasks to streaming multiprocessors | |
JP6201591B2 (en) | Information processing apparatus and information processing apparatus control method | |
US9760969B2 (en) | Graphic processing system and method thereof | |
US20140189701A1 (en) | Methods, systems and apparatuses for processor selection in multi-processor systems | |
US20070101332A1 (en) | Method and apparatus for resource-based thread allocation in a multiprocessor computer system | |
US8910181B2 (en) | Divided central data processing | |
US9898333B2 (en) | Method and apparatus for selecting preemption technique | |
US11809874B2 (en) | Conditional instructions distribution and execution on pipelines having different latencies for mispredictions | |
CN117859114A (en) | Processing apparatus and method for sharing storage among cache memory, local data storage, and register file | |
US20220206851A1 (en) | Regenerative work-groups | |
TW202242638A (en) | Instruction dispatch for superscalar processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20161213 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20190114 |