CN118279125B - Method for realizing light general graphic processor - Google Patents
Method for realizing light general graphic processor Download PDFInfo
- Publication number
- CN118279125B CN118279125B CN202410712527.0A CN202410712527A CN118279125B CN 118279125 B CN118279125 B CN 118279125B CN 202410712527 A CN202410712527 A CN 202410712527A CN 118279125 B CN118279125 B CN 118279125B
- Authority
- CN
- China
- Prior art keywords
- instruction
- data
- subunit
- memory
- graphics processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 230000000903 blocking effect Effects 0.000 claims description 3
- 238000007667 floating Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000007726 management method Methods 0.000 claims description 3
- 238000013468 resource allocation Methods 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims 1
- 238000005265 energy consumption Methods 0.000 abstract description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006798 recombination Effects 0.000 description 2
- 238000005215 recombination Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Image Generation (AREA)
Abstract
The invention relates to the technical field of general graphics processors, in particular to a method for realizing a lightweight general graphics processor. The realization method of the lightweight general graphics processor is based on an open computing language OpenCL programming framework and a RISC-V fifth generation reduced instruction set, adopts a single instruction multithreading SIMT computing model, and simultaneously adopts a high bandwidth memory HBM cache as a global memory to realize the rapid deployment of the lightweight general graphics processor. The realization method of the lightweight general graphics processor adopts the pipeline mode, thereby reducing the utilization rate of the memory and the complexity of the software to the maximum extent, improving the utilization rate of hardware resources, reducing the energy consumption and realizing higher energy efficiency ratio; meanwhile, the RISC-V instruction set is easier to deploy and develop, and flexible tailorability and adaptability are achieved.
Description
Technical Field
The invention relates to the technical field of general graphics processors, in particular to a method for realizing a lightweight general graphics processor.
Background
With the recent rise of artificial intelligence, there is a great deal of data processing and model training involved. Deep learning is a common method in artificial intelligence, and requires matrix operations on a large amount of data, thus involving a large amount of parallelization and vectorization operations.
Compared with a traditional CPU, a general-purpose graphics processor (GPGPU) is a processor for processing high-performance computing tasks by utilizing the characteristics of a many-core structure, multithreading and high memory bandwidth of the processor, and has more computing units and higher bandwidth to execute parallelization and vectorization operations. However, the large number of computing cores also means high power consumption, and how to balance performance and power consumption has gradually become the main optimization direction of current general-purpose graphics processors.
Aiming at the problem of low energy efficiency ratio of the main stream general graphics processor, the invention provides a realization method of a lightweight general graphics processor.
Disclosure of Invention
The invention provides a simple and efficient realization method of a lightweight general graphics processor in order to make up the defects of the prior art.
The invention is realized by the following technical scheme:
A realization method of a lightweight general graphics processor adopts a single instruction multithread SIMT (Single Instruction Multiple Threads) calculation model based on an open computing language OpenCL (Open Computing Language) programming framework and a RISC-V fifth generation reduced instruction set, and simultaneously adopts a high-bandwidth memory HBM (High Bandwidth Memory) cache as a global memory high-bandwidth memory HBM, thereby realizing rapid deployment of the lightweight general graphics processor with high energy efficiency ratio.
Dividing a computing platform into a host side (host) and a device side (device) based on an open computing language OpenCL programming framework, and deploying a general-purpose graphics processor core at the device side;
The general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor Unit is only responsible for scheduling instructions and data, and offloads the execution of the data to a Stream computer-Engine Unit (SCU);
Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.
The host side is responsible for realizing data interaction, resource allocation and equipment management;
the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;
The Computing resource on the device side is composed of a plurality of Computing Units (CU for short), the Computing Units are further composed of a plurality of processing Units (Processing Elements PE for short), and the Computing on the device side is completed in the processing Units.
Each instance of the general purpose graphics processor core when executing at the device side is called a work-item (work-item) or a thread, and several instances are organized into a thread bundle (wave-front), and threads in the same thread bundle execute in parallel.
The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit.
The host side generates a large number of parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API (Application Programming Interface) function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, generates a starting signal through a Siring runtime application programming interface Xilinx Runtime API, and operates the general graphics processor core.
The Instruction parameters are issued to an Instruction Pre-analysis Unit (IDU) of the general graphics processor core through a direct memory access XDMA technology, the Instruction parameters are Pre-analyzed, the data structure recombination of the Instruction parameters is completed, then the Instruction parameters are forwarded to an Instruction cache L1-Icache, and compressed data to be operated is forwarded to a data cache L1-Dcache.
In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit (Warp Schedule Unit, WSU) detects an instruction cached in the instruction cache L1-Icache (Instruction Cache), then the instruction is read and a thread schedule is built, dividing the thread bundle into three states of active, blocking or waiting:
When the thread bundle is in an activated state, a Fetch subunit (FU) is notified to Fetch an instruction from an instruction cache L1-Icache according to an instruction address, and the instruction is forwarded to a decoding subunit, and the instruction and operation data are analyzed in the decoding subunit (Deocode Unit, DU) and cached in an instruction cache Ibuffer (Instruction Buffer, IB); if an Immediate Unit (IU) detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine Unit.
After the compressed data to be operated is forwarded to a stream computation engine Unit, a data packet transformation subunit (PACKAGE SHAPER Unit, PSU for short) decompresses the data to be operated to obtain original data and forwards the original data to a data packet computation subunit (Package Compute Unit, PCU for short); according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, the result is written back to the shared memory, and then the result is dispatched back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, and the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation.
After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.
An implementation device of a lightweight general-purpose graphics processor includes a memory and a processor; the memory is used for storing a computer program, and the processor is used for implementing the method steps described above when executing the computer program.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps as described above.
The beneficial effects of the invention are as follows: the realization method of the lightweight general graphics processor adopts the pipeline mode, thereby reducing the utilization rate of the memory and the complexity of the software to the maximum extent, improving the utilization rate of hardware resources, reducing the energy consumption and realizing higher energy efficiency ratio; meanwhile, the RISC-V instruction set is easier to deploy and develop, and flexible tailorability and adaptability are achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a deployment method of a lightweight general-purpose graphics processor according to the present invention.
FIG. 2 is a schematic diagram of a lightweight general purpose graphics processor module architecture according to the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the following description will make clear and complete description of the technical solution of the present invention in combination with the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
RISC-V, the fifth generation reduced instruction set, is an open-source reduced Instruction Set Architecture (ISA) that is of interest for its efficiency, controllability, and autonomy.
The realization method of the lightweight general graphics processor is based on an open computing language OpenCL (Open Computing Language) programming framework and a RISC-V fifth generation reduced instruction set, adopts a single instruction multithread SIMT (Single Instruction Multiple Threads) computing model, and simultaneously adopts a high bandwidth memory HBM (High Bandwidth Memory) cache as a global memory high bandwidth memory HBM, thereby realizing the rapid deployment of the lightweight general graphics processor with high energy efficiency ratio.
Dividing a computing platform into a host end (host) and a device end (device) based on an open computing language OpenCL programming framework, and arranging a general-purpose graphic processor CORE (CORE for short) at the device end;
The general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor Unit is only responsible for scheduling instructions and data, and offloads the execution of the data to a Stream computer-Engine Unit (SCU for short);
Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.
The host side is responsible for realizing data interaction, resource allocation and equipment management;
the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;
The Computing resource on the device side is composed of a plurality of Computing Units (CU for short), the Computing Units are further composed of a plurality of processing Units (Processing Elements PE for short), and the Computing on the device side is completed in the processing Units.
Each instance of the general purpose graphics processor core when executing at the device side is called a work-item (work-item) or a thread, and several instances are organized into a thread bundle (wave-front), and threads in the same thread bundle execute in parallel.
The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit.
The host side generates a large number of parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API (Application Programming Interface) function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, generates a starting signal through a Siring runtime application programming interface Xilinx Runtime API, and operates the general graphics processor core.
The Instruction parameters are issued to an Instruction Pre-analysis Unit (IDU) of the general graphics processor core through a direct memory access XDMA technology, the Instruction parameters are Pre-analyzed, the data structure recombination of the Instruction parameters is completed, then the Instruction parameters are forwarded to an Instruction cache L1-Icache, and compressed data to be operated is forwarded to a data cache L1-Dcache.
In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit (Warp Schedule Unit, WSU) detects an instruction cached in the instruction cache L1-Icache (Instruction Cache), then the instruction is read and a thread schedule is built, dividing the thread bundle into three states of active, blocking or waiting:
When the thread bundle is in an activated state, a Fetch subunit (FU) is notified to Fetch an instruction from an instruction cache L1-Icache according to an instruction address, and the instruction is forwarded to a decoding subunit, and the instruction and operation data are analyzed in the decoding subunit (Deocode Unit, DU) and cached in an instruction cache Ibuffer (Instruction Buffer, IB); if an Immediate Unit (IU) detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine Unit.
After the compressed data to be operated is forwarded to a stream computation engine Unit, a data packet transformation subunit (PACKAGE SHAPER Unit, PSU for short) decompresses the data to be operated to obtain original data and forwards the original data to a data packet computation subunit (Package Compute Unit, PCU for short); according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, the result is written back to the shared memory, and then the result is dispatched back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, and the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation.
After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.
The realization device of the lightweight general-purpose graphic processor comprises a memory and a processor; the memory is used for storing a computer program, and the processor is used for implementing the method steps described above when executing the computer program.
The readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps as described above.
The above examples are only one of the specific embodiments of the present invention, and the ordinary changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the scope of the present invention.
Claims (8)
1. A method for realizing a lightweight general-purpose graphics processor is characterized in that: based on an open computing language OpenCL programming framework and a RISC-V fifth generation reduced instruction set, adopting a single instruction multithreading SIMT computing model, and adopting a high bandwidth memory HBM cache as a global memory high bandwidth memory HBM;
Dividing a computing platform into a host end and a device end based on an open computing language OpenCL programming framework, and arranging a general-purpose graphic processor kernel at the device end;
the general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor unit is only responsible for scheduling instructions and data, and uninstalls the execution of the data to the stream computation engine unit;
The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit;
In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit detects an instruction cached in an instruction cache L1-Icache, reading the instruction and constructing a thread scheduling table, and dividing a thread bundle into three states of activation, blocking or waiting;
When the thread bundle is in an activated state, notifying the instruction fetching subunit to fetch an instruction from the instruction cache L1-Icache according to the instruction address, and forwarding the instruction to the decoding subunit; completing the analysis of the instruction and the operation data in the decoding subunit and caching the analysis of the instruction and the operation data in an instruction cache Ibuffer; if the dispatch subunit detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine unit;
Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.
2. The method for implementing a lightweight general purpose graphics processor as claimed in claim 1, wherein: the host side is responsible for realizing data interaction, resource allocation and equipment management;
the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;
The computing resource on the equipment end consists of a plurality of computing units, the computing units further consist of a plurality of processing units, and the computation on the equipment end is completed in the processing units.
3. The method for implementing a lightweight general purpose graphics processor according to claim 1 or 2, wherein: each instance of the general purpose graphics processor core when executing at the device end is called a work item or a thread, and a plurality of instances are organized into a thread bundle, and threads in the same thread bundle execute in parallel.
4. A method for implementing a lightweight general purpose graphics processor as claimed in claim 3, wherein: the host side generates parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, and generates a starting signal through an Sailingsi runtime application programming interface Xilinx Runtime API to operate the general graphics processor core.
5. The method for implementing the lightweight general purpose graphics processor as claimed in claim 4, wherein: the instruction parameters are issued to an instruction pre-analyzing unit of the general graphics processor core through a direct memory access XDMA technology, the instruction parameters are pre-analyzed, the data structure of the instruction parameters is recombined, the instruction parameters are forwarded to an instruction cache L1-Icache, and compressed data to be operated are forwarded to a data cache L1-Dcache.
6. The method for implementing the lightweight general purpose graphics processor as claimed in claim 5, wherein: after the compressed data to be operated is forwarded to the flow calculation engine unit, the data packet transformation subunit decompresses the data to be operated to obtain the original data and forwards the original data to the data packet calculation subunit; according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, writing the result back to the shared memory, and then dispatching the result back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, wherein the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation;
After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.
7. An implementation device of a lightweight general-purpose graphics processor, characterized in that: comprising a memory and a processor; the memory is configured to store a computer program, the processor being configured to implement the method according to any one of claims 1 to 6 when the computer program is executed.
8. A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410712527.0A CN118279125B (en) | 2024-06-04 | 2024-06-04 | Method for realizing light general graphic processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410712527.0A CN118279125B (en) | 2024-06-04 | 2024-06-04 | Method for realizing light general graphic processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118279125A CN118279125A (en) | 2024-07-02 |
CN118279125B true CN118279125B (en) | 2024-08-06 |
Family
ID=91647120
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410712527.0A Active CN118279125B (en) | 2024-06-04 | 2024-06-04 | Method for realizing light general graphic processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118279125B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112581585A (en) * | 2020-12-24 | 2021-03-30 | 西安翔腾微电子科技有限公司 | TLM device of GPU command processing module based on SysML view and operation method |
CN114239806A (en) * | 2021-12-16 | 2022-03-25 | 浙江大学 | RISC-V structured multi-core neural network processor chip |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8782645B2 (en) * | 2011-05-11 | 2014-07-15 | Advanced Micro Devices, Inc. | Automatic load balancing for heterogeneous cores |
US9582287B2 (en) * | 2012-09-27 | 2017-02-28 | Intel Corporation | Processor having multiple cores, shared core extension logic, and shared core extension utilization instructions |
RU2584470C2 (en) * | 2014-03-18 | 2016-05-20 | Федеральное государственное учреждение "Федеральный научный центр Научно-исследовательский институт системных исследований Российской академии наук" (ФГУ ФНЦ НИИСИ РАН) | Hybrid flow microprocessor |
CN104503950B (en) * | 2014-12-09 | 2017-10-24 | 中国航空工业集团公司第六三一研究所 | A kind of graphics processor towards OpenGL API |
CN105630441B (en) * | 2015-12-11 | 2018-12-25 | 中国航空工业集团公司西安航空计算技术研究所 | A kind of GPU system based on unified staining technique |
EP3625939A1 (en) * | 2017-07-10 | 2020-03-25 | Fungible, Inc. | Access node for data centers |
CN109144573A (en) * | 2018-08-16 | 2019-01-04 | 胡振波 | Two-level pipeline framework based on RISC-V instruction set |
CN110007961B (en) * | 2019-02-01 | 2023-07-18 | 中山大学 | RISC-V-based edge computing hardware architecture |
US11461097B2 (en) * | 2021-01-15 | 2022-10-04 | Cornell University | Content-addressable processing engine |
CN117151180A (en) * | 2023-09-19 | 2023-12-01 | 厦门壹普智慧科技有限公司 | Processor for simplifying data flow instruction set |
-
2024
- 2024-06-04 CN CN202410712527.0A patent/CN118279125B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112581585A (en) * | 2020-12-24 | 2021-03-30 | 西安翔腾微电子科技有限公司 | TLM device of GPU command processing module based on SysML view and operation method |
CN114239806A (en) * | 2021-12-16 | 2022-03-25 | 浙江大学 | RISC-V structured multi-core neural network processor chip |
Also Published As
Publication number | Publication date |
---|---|
CN118279125A (en) | 2024-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20120256922A1 (en) | Multithreaded Processor and Method for Realizing Functions of Central Processing Unit and Graphics Processing Unit | |
US11609792B2 (en) | Maximizing resource utilization of neural network computing system | |
Orr et al. | Fine-grain task aggregation and coordination on GPUs | |
WO2020103706A1 (en) | Data processing system and data processing method | |
JP6336399B2 (en) | Multi-threaded computing | |
WO2022134729A1 (en) | Risc-v-based artificial intelligence inference method and system | |
CN112580792B (en) | Neural network multi-core tensor processor | |
WO2022078400A1 (en) | Device and method for processing multi-dimensional data, and computer program product | |
Chen et al. | Characterizing scalar opportunities in GPGPU applications | |
CN111026444A (en) | GPU parallel array SIMT instruction processing model | |
CN118279125B (en) | Method for realizing light general graphic processor | |
WO2021008257A1 (en) | Coprocessor and data processing acceleration method therefor | |
CN108549935B (en) | Device and method for realizing neural network model | |
JP2024538829A (en) | Artificial intelligence core, artificial intelligence core system, and load/store method for artificial intelligence core system | |
US20190272460A1 (en) | Configurable neural network processor for machine learning workloads | |
Ho et al. | Improving gpu throughput through parallel execution using tensor cores and cuda cores | |
CN104636207B (en) | Coordinated dispatching method and system based on GPGPU architectures | |
Lin et al. | swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer | |
KR101420592B1 (en) | Computer system | |
US20110247018A1 (en) | API For Launching Work On a Processor | |
Maitre et al. | Fast evaluation of GP trees on GPGPU by optimizing hardware scheduling | |
CN111443898A (en) | Method for designing flow program control software based on priority queue and finite-state machine | |
Zhang et al. | CPU-assisted GPU thread pool model for dynamic task parallelism | |
Kwon et al. | Mobile GPU shader processor based on non-blocking coarse grained reconfigurable arrays architecture | |
Falahati et al. | ISP: Using idle SMs in hardware-based prefetching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |