[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118279125B - Method for realizing light general graphic processor - Google Patents

Method for realizing light general graphic processor Download PDF

Info

Publication number
CN118279125B
CN118279125B CN202410712527.0A CN202410712527A CN118279125B CN 118279125 B CN118279125 B CN 118279125B CN 202410712527 A CN202410712527 A CN 202410712527A CN 118279125 B CN118279125 B CN 118279125B
Authority
CN
China
Prior art keywords
instruction
data
subunit
memory
graphics processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410712527.0A
Other languages
Chinese (zh)
Other versions
CN118279125A (en
Inventor
李乐乐
郭鑫斐
王帅
赵鑫鑫
姜凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Science Research Institute Co Ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co Ltd filed Critical Shandong Inspur Science Research Institute Co Ltd
Priority to CN202410712527.0A priority Critical patent/CN118279125B/en
Publication of CN118279125A publication Critical patent/CN118279125A/en
Application granted granted Critical
Publication of CN118279125B publication Critical patent/CN118279125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Generation (AREA)

Abstract

The invention relates to the technical field of general graphics processors, in particular to a method for realizing a lightweight general graphics processor. The realization method of the lightweight general graphics processor is based on an open computing language OpenCL programming framework and a RISC-V fifth generation reduced instruction set, adopts a single instruction multithreading SIMT computing model, and simultaneously adopts a high bandwidth memory HBM cache as a global memory to realize the rapid deployment of the lightweight general graphics processor. The realization method of the lightweight general graphics processor adopts the pipeline mode, thereby reducing the utilization rate of the memory and the complexity of the software to the maximum extent, improving the utilization rate of hardware resources, reducing the energy consumption and realizing higher energy efficiency ratio; meanwhile, the RISC-V instruction set is easier to deploy and develop, and flexible tailorability and adaptability are achieved.

Description

Method for realizing light general graphic processor
Technical Field
The invention relates to the technical field of general graphics processors, in particular to a method for realizing a lightweight general graphics processor.
Background
With the recent rise of artificial intelligence, there is a great deal of data processing and model training involved. Deep learning is a common method in artificial intelligence, and requires matrix operations on a large amount of data, thus involving a large amount of parallelization and vectorization operations.
Compared with a traditional CPU, a general-purpose graphics processor (GPGPU) is a processor for processing high-performance computing tasks by utilizing the characteristics of a many-core structure, multithreading and high memory bandwidth of the processor, and has more computing units and higher bandwidth to execute parallelization and vectorization operations. However, the large number of computing cores also means high power consumption, and how to balance performance and power consumption has gradually become the main optimization direction of current general-purpose graphics processors.
Aiming at the problem of low energy efficiency ratio of the main stream general graphics processor, the invention provides a realization method of a lightweight general graphics processor.
Disclosure of Invention
The invention provides a simple and efficient realization method of a lightweight general graphics processor in order to make up the defects of the prior art.
The invention is realized by the following technical scheme:
A realization method of a lightweight general graphics processor adopts a single instruction multithread SIMT (Single Instruction Multiple Threads) calculation model based on an open computing language OpenCL (Open Computing Language) programming framework and a RISC-V fifth generation reduced instruction set, and simultaneously adopts a high-bandwidth memory HBM (High Bandwidth Memory) cache as a global memory high-bandwidth memory HBM, thereby realizing rapid deployment of the lightweight general graphics processor with high energy efficiency ratio.
Dividing a computing platform into a host side (host) and a device side (device) based on an open computing language OpenCL programming framework, and deploying a general-purpose graphics processor core at the device side;
The general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor Unit is only responsible for scheduling instructions and data, and offloads the execution of the data to a Stream computer-Engine Unit (SCU);
Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.
The host side is responsible for realizing data interaction, resource allocation and equipment management;
the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;
The Computing resource on the device side is composed of a plurality of Computing Units (CU for short), the Computing Units are further composed of a plurality of processing Units (Processing Elements PE for short), and the Computing on the device side is completed in the processing Units.
Each instance of the general purpose graphics processor core when executing at the device side is called a work-item (work-item) or a thread, and several instances are organized into a thread bundle (wave-front), and threads in the same thread bundle execute in parallel.
The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit.
The host side generates a large number of parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API (Application Programming Interface) function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, generates a starting signal through a Siring runtime application programming interface Xilinx Runtime API, and operates the general graphics processor core.
The Instruction parameters are issued to an Instruction Pre-analysis Unit (IDU) of the general graphics processor core through a direct memory access XDMA technology, the Instruction parameters are Pre-analyzed, the data structure recombination of the Instruction parameters is completed, then the Instruction parameters are forwarded to an Instruction cache L1-Icache, and compressed data to be operated is forwarded to a data cache L1-Dcache.
In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit (Warp Schedule Unit, WSU) detects an instruction cached in the instruction cache L1-Icache (Instruction Cache), then the instruction is read and a thread schedule is built, dividing the thread bundle into three states of active, blocking or waiting:
When the thread bundle is in an activated state, a Fetch subunit (FU) is notified to Fetch an instruction from an instruction cache L1-Icache according to an instruction address, and the instruction is forwarded to a decoding subunit, and the instruction and operation data are analyzed in the decoding subunit (Deocode Unit, DU) and cached in an instruction cache Ibuffer (Instruction Buffer, IB); if an Immediate Unit (IU) detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine Unit.
After the compressed data to be operated is forwarded to a stream computation engine Unit, a data packet transformation subunit (PACKAGE SHAPER Unit, PSU for short) decompresses the data to be operated to obtain original data and forwards the original data to a data packet computation subunit (Package Compute Unit, PCU for short); according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, the result is written back to the shared memory, and then the result is dispatched back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, and the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation.
After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.
An implementation device of a lightweight general-purpose graphics processor includes a memory and a processor; the memory is used for storing a computer program, and the processor is used for implementing the method steps described above when executing the computer program.
A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps as described above.
The beneficial effects of the invention are as follows: the realization method of the lightweight general graphics processor adopts the pipeline mode, thereby reducing the utilization rate of the memory and the complexity of the software to the maximum extent, improving the utilization rate of hardware resources, reducing the energy consumption and realizing higher energy efficiency ratio; meanwhile, the RISC-V instruction set is easier to deploy and develop, and flexible tailorability and adaptability are achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a deployment method of a lightweight general-purpose graphics processor according to the present invention.
FIG. 2 is a schematic diagram of a lightweight general purpose graphics processor module architecture according to the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the following description will make clear and complete description of the technical solution of the present invention in combination with the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
RISC-V, the fifth generation reduced instruction set, is an open-source reduced Instruction Set Architecture (ISA) that is of interest for its efficiency, controllability, and autonomy.
The realization method of the lightweight general graphics processor is based on an open computing language OpenCL (Open Computing Language) programming framework and a RISC-V fifth generation reduced instruction set, adopts a single instruction multithread SIMT (Single Instruction Multiple Threads) computing model, and simultaneously adopts a high bandwidth memory HBM (High Bandwidth Memory) cache as a global memory high bandwidth memory HBM, thereby realizing the rapid deployment of the lightweight general graphics processor with high energy efficiency ratio.
Dividing a computing platform into a host end (host) and a device end (device) based on an open computing language OpenCL programming framework, and arranging a general-purpose graphic processor CORE (CORE for short) at the device end;
The general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor Unit is only responsible for scheduling instructions and data, and offloads the execution of the data to a Stream computer-Engine Unit (SCU for short);
Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.
The host side is responsible for realizing data interaction, resource allocation and equipment management;
the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;
The Computing resource on the device side is composed of a plurality of Computing Units (CU for short), the Computing Units are further composed of a plurality of processing Units (Processing Elements PE for short), and the Computing on the device side is completed in the processing Units.
Each instance of the general purpose graphics processor core when executing at the device side is called a work-item (work-item) or a thread, and several instances are organized into a thread bundle (wave-front), and threads in the same thread bundle execute in parallel.
The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit.
The host side generates a large number of parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API (Application Programming Interface) function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, generates a starting signal through a Siring runtime application programming interface Xilinx Runtime API, and operates the general graphics processor core.
The Instruction parameters are issued to an Instruction Pre-analysis Unit (IDU) of the general graphics processor core through a direct memory access XDMA technology, the Instruction parameters are Pre-analyzed, the data structure recombination of the Instruction parameters is completed, then the Instruction parameters are forwarded to an Instruction cache L1-Icache, and compressed data to be operated is forwarded to a data cache L1-Dcache.
In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit (Warp Schedule Unit, WSU) detects an instruction cached in the instruction cache L1-Icache (Instruction Cache), then the instruction is read and a thread schedule is built, dividing the thread bundle into three states of active, blocking or waiting:
When the thread bundle is in an activated state, a Fetch subunit (FU) is notified to Fetch an instruction from an instruction cache L1-Icache according to an instruction address, and the instruction is forwarded to a decoding subunit, and the instruction and operation data are analyzed in the decoding subunit (Deocode Unit, DU) and cached in an instruction cache Ibuffer (Instruction Buffer, IB); if an Immediate Unit (IU) detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine Unit.
After the compressed data to be operated is forwarded to a stream computation engine Unit, a data packet transformation subunit (PACKAGE SHAPER Unit, PSU for short) decompresses the data to be operated to obtain original data and forwards the original data to a data packet computation subunit (Package Compute Unit, PCU for short); according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, the result is written back to the shared memory, and then the result is dispatched back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, and the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation.
After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.
The realization device of the lightweight general-purpose graphic processor comprises a memory and a processor; the memory is used for storing a computer program, and the processor is used for implementing the method steps described above when executing the computer program.
The readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps as described above.
The above examples are only one of the specific embodiments of the present invention, and the ordinary changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the scope of the present invention.

Claims (8)

1. A method for realizing a lightweight general-purpose graphics processor is characterized in that: based on an open computing language OpenCL programming framework and a RISC-V fifth generation reduced instruction set, adopting a single instruction multithreading SIMT computing model, and adopting a high bandwidth memory HBM cache as a global memory high bandwidth memory HBM;
Dividing a computing platform into a host end and a device end based on an open computing language OpenCL programming framework, and arranging a general-purpose graphic processor kernel at the device end;
the general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor unit is only responsible for scheduling instructions and data, and uninstalls the execution of the data to the stream computation engine unit;
The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit;
In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit detects an instruction cached in an instruction cache L1-Icache, reading the instruction and constructing a thread scheduling table, and dividing a thread bundle into three states of activation, blocking or waiting;
When the thread bundle is in an activated state, notifying the instruction fetching subunit to fetch an instruction from the instruction cache L1-Icache according to the instruction address, and forwarding the instruction to the decoding subunit; completing the analysis of the instruction and the operation data in the decoding subunit and caching the analysis of the instruction and the operation data in an instruction cache Ibuffer; if the dispatch subunit detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine unit;
Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.
2. The method for implementing a lightweight general purpose graphics processor as claimed in claim 1, wherein: the host side is responsible for realizing data interaction, resource allocation and equipment management;
the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;
The computing resource on the equipment end consists of a plurality of computing units, the computing units further consist of a plurality of processing units, and the computation on the equipment end is completed in the processing units.
3. The method for implementing a lightweight general purpose graphics processor according to claim 1 or 2, wherein: each instance of the general purpose graphics processor core when executing at the device end is called a work item or a thread, and a plurality of instances are organized into a thread bundle, and threads in the same thread bundle execute in parallel.
4. A method for implementing a lightweight general purpose graphics processor as claimed in claim 3, wherein: the host side generates parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, and generates a starting signal through an Sailingsi runtime application programming interface Xilinx Runtime API to operate the general graphics processor core.
5. The method for implementing the lightweight general purpose graphics processor as claimed in claim 4, wherein: the instruction parameters are issued to an instruction pre-analyzing unit of the general graphics processor core through a direct memory access XDMA technology, the instruction parameters are pre-analyzed, the data structure of the instruction parameters is recombined, the instruction parameters are forwarded to an instruction cache L1-Icache, and compressed data to be operated are forwarded to a data cache L1-Dcache.
6. The method for implementing the lightweight general purpose graphics processor as claimed in claim 5, wherein: after the compressed data to be operated is forwarded to the flow calculation engine unit, the data packet transformation subunit decompresses the data to be operated to obtain the original data and forwards the original data to the data packet calculation subunit; according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, writing the result back to the shared memory, and then dispatching the result back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, wherein the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation;
After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.
7. An implementation device of a lightweight general-purpose graphics processor, characterized in that: comprising a memory and a processor; the memory is configured to store a computer program, the processor being configured to implement the method according to any one of claims 1 to 6 when the computer program is executed.
8. A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 6.
CN202410712527.0A 2024-06-04 2024-06-04 Method for realizing light general graphic processor Active CN118279125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410712527.0A CN118279125B (en) 2024-06-04 2024-06-04 Method for realizing light general graphic processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410712527.0A CN118279125B (en) 2024-06-04 2024-06-04 Method for realizing light general graphic processor

Publications (2)

Publication Number Publication Date
CN118279125A CN118279125A (en) 2024-07-02
CN118279125B true CN118279125B (en) 2024-08-06

Family

ID=91647120

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410712527.0A Active CN118279125B (en) 2024-06-04 2024-06-04 Method for realizing light general graphic processor

Country Status (1)

Country Link
CN (1) CN118279125B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581585A (en) * 2020-12-24 2021-03-30 西安翔腾微电子科技有限公司 TLM device of GPU command processing module based on SysML view and operation method
CN114239806A (en) * 2021-12-16 2022-03-25 浙江大学 RISC-V structured multi-core neural network processor chip

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8782645B2 (en) * 2011-05-11 2014-07-15 Advanced Micro Devices, Inc. Automatic load balancing for heterogeneous cores
US9582287B2 (en) * 2012-09-27 2017-02-28 Intel Corporation Processor having multiple cores, shared core extension logic, and shared core extension utilization instructions
RU2584470C2 (en) * 2014-03-18 2016-05-20 Федеральное государственное учреждение "Федеральный научный центр Научно-исследовательский институт системных исследований Российской академии наук" (ФГУ ФНЦ НИИСИ РАН) Hybrid flow microprocessor
CN104503950B (en) * 2014-12-09 2017-10-24 中国航空工业集团公司第六三一研究所 A kind of graphics processor towards OpenGL API
CN105630441B (en) * 2015-12-11 2018-12-25 中国航空工业集团公司西安航空计算技术研究所 A kind of GPU system based on unified staining technique
EP3625939A1 (en) * 2017-07-10 2020-03-25 Fungible, Inc. Access node for data centers
CN109144573A (en) * 2018-08-16 2019-01-04 胡振波 Two-level pipeline framework based on RISC-V instruction set
CN110007961B (en) * 2019-02-01 2023-07-18 中山大学 RISC-V-based edge computing hardware architecture
US11461097B2 (en) * 2021-01-15 2022-10-04 Cornell University Content-addressable processing engine
CN117151180A (en) * 2023-09-19 2023-12-01 厦门壹普智慧科技有限公司 Processor for simplifying data flow instruction set

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581585A (en) * 2020-12-24 2021-03-30 西安翔腾微电子科技有限公司 TLM device of GPU command processing module based on SysML view and operation method
CN114239806A (en) * 2021-12-16 2022-03-25 浙江大学 RISC-V structured multi-core neural network processor chip

Also Published As

Publication number Publication date
CN118279125A (en) 2024-07-02

Similar Documents

Publication Publication Date Title
US20120256922A1 (en) Multithreaded Processor and Method for Realizing Functions of Central Processing Unit and Graphics Processing Unit
US11609792B2 (en) Maximizing resource utilization of neural network computing system
Orr et al. Fine-grain task aggregation and coordination on GPUs
WO2020103706A1 (en) Data processing system and data processing method
JP6336399B2 (en) Multi-threaded computing
WO2022134729A1 (en) Risc-v-based artificial intelligence inference method and system
CN112580792B (en) Neural network multi-core tensor processor
WO2022078400A1 (en) Device and method for processing multi-dimensional data, and computer program product
Chen et al. Characterizing scalar opportunities in GPGPU applications
CN111026444A (en) GPU parallel array SIMT instruction processing model
CN118279125B (en) Method for realizing light general graphic processor
WO2021008257A1 (en) Coprocessor and data processing acceleration method therefor
CN108549935B (en) Device and method for realizing neural network model
JP2024538829A (en) Artificial intelligence core, artificial intelligence core system, and load/store method for artificial intelligence core system
US20190272460A1 (en) Configurable neural network processor for machine learning workloads
Ho et al. Improving gpu throughput through parallel execution using tensor cores and cuda cores
CN104636207B (en) Coordinated dispatching method and system based on GPGPU architectures
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
KR101420592B1 (en) Computer system
US20110247018A1 (en) API For Launching Work On a Processor
Maitre et al. Fast evaluation of GP trees on GPGPU by optimizing hardware scheduling
CN111443898A (en) Method for designing flow program control software based on priority queue and finite-state machine
Zhang et al. CPU-assisted GPU thread pool model for dynamic task parallelism
Kwon et al. Mobile GPU shader processor based on non-blocking coarse grained reconfigurable arrays architecture
Falahati et al. ISP: Using idle SMs in hardware-based prefetching

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant