CN118279125B

CN118279125B - Method for realizing light general graphic processor

Info

Publication number: CN118279125B
Application number: CN202410712527.0A
Authority: CN
Inventors: 李乐乐; 郭鑫斐; 王帅; 赵鑫鑫; 姜凯
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2024-06-04
Filing date: 2024-06-04
Publication date: 2024-08-06
Anticipated expiration: 2044-06-04
Also published as: CN118279125A

Abstract

The invention relates to the technical field of general graphics processors, in particular to a method for realizing a lightweight general graphics processor. The realization method of the lightweight general graphics processor is based on an open computing language OpenCL programming framework and a RISC-V fifth generation reduced instruction set, adopts a single instruction multithreading SIMT computing model, and simultaneously adopts a high bandwidth memory HBM cache as a global memory to realize the rapid deployment of the lightweight general graphics processor. The realization method of the lightweight general graphics processor adopts the pipeline mode, thereby reducing the utilization rate of the memory and the complexity of the software to the maximum extent, improving the utilization rate of hardware resources, reducing the energy consumption and realizing higher energy efficiency ratio; meanwhile, the RISC-V instruction set is easier to deploy and develop, and flexible tailorability and adaptability are achieved.

Description

Method for realizing light general graphic processor

Technical Field

The invention relates to the technical field of general graphics processors, in particular to a method for realizing a lightweight general graphics processor.

Background

With the recent rise of artificial intelligence, there is a great deal of data processing and model training involved. Deep learning is a common method in artificial intelligence, and requires matrix operations on a large amount of data, thus involving a large amount of parallelization and vectorization operations.

Compared with a traditional CPU, a general-purpose graphics processor (GPGPU) is a processor for processing high-performance computing tasks by utilizing the characteristics of a many-core structure, multithreading and high memory bandwidth of the processor, and has more computing units and higher bandwidth to execute parallelization and vectorization operations. However, the large number of computing cores also means high power consumption, and how to balance performance and power consumption has gradually become the main optimization direction of current general-purpose graphics processors.

Aiming at the problem of low energy efficiency ratio of the main stream general graphics processor, the invention provides a realization method of a lightweight general graphics processor.

Disclosure of Invention

The invention provides a simple and efficient realization method of a lightweight general graphics processor in order to make up the defects of the prior art.

The invention is realized by the following technical scheme:

A realization method of a lightweight general graphics processor adopts a single instruction multithread SIMT (Single Instruction Multiple Threads) calculation model based on an open computing language OpenCL (Open Computing Language) programming framework and a RISC-V fifth generation reduced instruction set, and simultaneously adopts a high-bandwidth memory HBM (High Bandwidth Memory) cache as a global memory high-bandwidth memory HBM, thereby realizing rapid deployment of the lightweight general graphics processor with high energy efficiency ratio.

Dividing a computing platform into a host side (host) and a device side (device) based on an open computing language OpenCL programming framework, and deploying a general-purpose graphics processor core at the device side;

The general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor Unit is only responsible for scheduling instructions and data, and offloads the execution of the data to a Stream computer-Engine Unit (SCU);

Before entering the stream computing engine unit, the data is in a compressed state, after entering the stream computing engine unit, the data is decompressed through the data packet conversion subunit, then the data packet computing subunit is entered to realize operation, after the operation is completed, the data is compressed through the data packet conversion subunit, and the data is cached to the memory through the write-back subunit of the RISC-V fifth generation reduced instruction processor unit.

The host side is responsible for realizing data interaction, resource allocation and equipment management;

the equipment end is responsible for completing the deployment of the lightweight general graphic processor and executing the general graphic processor kernel;

The Computing resource on the device side is composed of a plurality of Computing Units (CU for short), the Computing Units are further composed of a plurality of processing Units (Processing Elements PE for short), and the Computing on the device side is completed in the processing Units.

Each instance of the general purpose graphics processor core when executing at the device side is called a work-item (work-item) or a thread, and several instances are organized into a thread bundle (wave-front), and threads in the same thread bundle execute in parallel.

The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit.

The host side generates a large number of parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API (Application Programming Interface) function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, generates a starting signal through a Siring runtime application programming interface Xilinx Runtime API, and operates the general graphics processor core.

The Instruction parameters are issued to an Instruction Pre-analysis Unit (IDU) of the general graphics processor core through a direct memory access XDMA technology, the Instruction parameters are Pre-analyzed, the data structure recombination of the Instruction parameters is completed, then the Instruction parameters are forwarded to an Instruction cache L1-Icache, and compressed data to be operated is forwarded to a data cache L1-Dcache.

In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit (Warp Schedule Unit, WSU) detects an instruction cached in the instruction cache L1-Icache (Instruction Cache), then the instruction is read and a thread schedule is built, dividing the thread bundle into three states of active, blocking or waiting:

When the thread bundle is in an activated state, a Fetch subunit (FU) is notified to Fetch an instruction from an instruction cache L1-Icache according to an instruction address, and the instruction is forwarded to a decoding subunit, and the instruction and operation data are analyzed in the decoding subunit (Deocode Unit, DU) and cached in an instruction cache Ibuffer (Instruction Buffer, IB); if an Immediate Unit (IU) detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine Unit.

After the compressed data to be operated is forwarded to a stream computation engine Unit, a data packet transformation subunit (PACKAGE SHAPER Unit, PSU for short) decompresses the data to be operated to obtain original data and forwards the original data to a data packet computation subunit (Package Compute Unit, PCU for short); according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, the result is written back to the shared memory, and then the result is dispatched back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, and the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation.

After the host polls the Done signal sent by the device, the operation of the general graphics processor kernel is ended, the operation result in the device-side global memory high bandwidth memory HBM is synchronized back to the host memory by the direct memory access XDMA technology, and the transmission link between the host memory and the device-side global memory high bandwidth memory HBM is released, so that the operation is completed.

An implementation device of a lightweight general-purpose graphics processor includes a memory and a processor; the memory is used for storing a computer program, and the processor is used for implementing the method steps described above when executing the computer program.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps as described above.

The beneficial effects of the invention are as follows: the realization method of the lightweight general graphics processor adopts the pipeline mode, thereby reducing the utilization rate of the memory and the complexity of the software to the maximum extent, improving the utilization rate of hardware resources, reducing the energy consumption and realizing higher energy efficiency ratio; meanwhile, the RISC-V instruction set is easier to deploy and develop, and flexible tailorability and adaptability are achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a deployment method of a lightweight general-purpose graphics processor according to the present invention.

FIG. 2 is a schematic diagram of a lightweight general purpose graphics processor module architecture according to the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the technical solution of the present invention, the following description will make clear and complete description of the technical solution of the present invention in combination with the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

RISC-V, the fifth generation reduced instruction set, is an open-source reduced Instruction Set Architecture (ISA) that is of interest for its efficiency, controllability, and autonomy.

The realization method of the lightweight general graphics processor is based on an open computing language OpenCL (Open Computing Language) programming framework and a RISC-V fifth generation reduced instruction set, adopts a single instruction multithread SIMT (Single Instruction Multiple Threads) computing model, and simultaneously adopts a high bandwidth memory HBM (High Bandwidth Memory) cache as a global memory high bandwidth memory HBM, thereby realizing the rapid deployment of the lightweight general graphics processor with high energy efficiency ratio.

Dividing a computing platform into a host end (host) and a device end (device) based on an open computing language OpenCL programming framework, and arranging a general-purpose graphic processor CORE (CORE for short) at the device end;

The general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor Unit is only responsible for scheduling instructions and data, and offloads the execution of the data to a Stream computer-Engine Unit (SCU for short);

The realization device of the lightweight general-purpose graphic processor comprises a memory and a processor; the memory is used for storing a computer program, and the processor is used for implementing the method steps described above when executing the computer program.

The readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method steps as described above.

The above examples are only one of the specific embodiments of the present invention, and the ordinary changes and substitutions made by those skilled in the art within the scope of the technical solution of the present invention should be included in the scope of the present invention.

Claims

1. A method for realizing a lightweight general-purpose graphics processor is characterized in that: based on an open computing language OpenCL programming framework and a RISC-V fifth generation reduced instruction set, adopting a single instruction multithreading SIMT computing model, and adopting a high bandwidth memory HBM cache as a global memory high bandwidth memory HBM;

Dividing a computing platform into a host end and a device end based on an open computing language OpenCL programming framework, and arranging a general-purpose graphic processor kernel at the device end;

the general graphic processor core integrates a lightweight RISC-V fifth generation reduced instruction processor unit and a stream computation engine unit; the RISC-V fifth generation reduced instruction processor unit is only responsible for scheduling instructions and data, and uninstalls the execution of the data to the stream computation engine unit;

The RISC-V fifth generation reduced instruction processor unit adopts a classical five-stage pipeline and comprises a thread scheduling subunit, a fetching subunit, a decoding subunit, a dispatch subunit and a write-back subunit;

In a RISC-V fifth generation reduced instruction processor unit, if a thread scheduling subunit detects an instruction cached in an instruction cache L1-Icache, reading the instruction and constructing a thread scheduling table, and dividing a thread bundle into three states of activation, blocking or waiting;

When the thread bundle is in an activated state, notifying the instruction fetching subunit to fetch an instruction from the instruction cache L1-Icache according to the instruction address, and forwarding the instruction to the decoding subunit; completing the analysis of the instruction and the operation data in the decoding subunit and caching the analysis of the instruction and the operation data in an instruction cache Ibuffer; if the dispatch subunit detects an instruction to be dispatched, the compressed data to be operated is read according to the instruction and forwarded to the stream calculation engine unit;

2. The method for implementing a lightweight general purpose graphics processor as claimed in claim 1, wherein: the host side is responsible for realizing data interaction, resource allocation and equipment management;

The computing resource on the equipment end consists of a plurality of computing units, the computing units further consist of a plurality of processing units, and the computation on the equipment end is completed in the processing units.

3. The method for implementing a lightweight general purpose graphics processor according to claim 1 or 2, wherein: each instance of the general purpose graphics processor core when executing at the device end is called a work item or a thread, and a plurality of instances are organized into a thread bundle, and threads in the same thread bundle execute in parallel.

4. A method for implementing a lightweight general purpose graphics processor as claimed in claim 3, wherein: the host side generates parallel computing tasks, transmits instruction parameters to a general graphics processor core deployed on the device side through a direct memory access XDMA technology by calling a general graphics processor core application programming interface API function integrated by an open computing language OpenCL programming framework, transmits compressed data to be operated from a host memory to a global memory high-bandwidth memory HBM, and generates a starting signal through an Sailingsi runtime application programming interface Xilinx Runtime API to operate the general graphics processor core.

5. The method for implementing the lightweight general purpose graphics processor as claimed in claim 4, wherein: the instruction parameters are issued to an instruction pre-analyzing unit of the general graphics processor core through a direct memory access XDMA technology, the instruction parameters are pre-analyzed, the data structure of the instruction parameters is recombined, the instruction parameters are forwarded to an instruction cache L1-Icache, and compressed data to be operated are forwarded to a data cache L1-Dcache.

6. The method for implementing the lightweight general purpose graphics processor as claimed in claim 5, wherein: after the compressed data to be operated is forwarded to the flow calculation engine unit, the data packet transformation subunit decompresses the data to be operated to obtain the original data and forwards the original data to the data packet calculation subunit; according to the operation requirement of the instruction, completing logic operation, floating point operation and/or special function operation in a data packet calculation subunit; after the operation is finished, writing the result back to the shared memory, and then dispatching the result back to the global memory high bandwidth memory HBM through the RISC-V fifth generation reduced instruction processor unit, wherein the equipment end generates a configuration completion Done signal to inform the host end of finishing the operation;

7. An implementation device of a lightweight general-purpose graphics processor, characterized in that: comprising a memory and a processor; the memory is configured to store a computer program, the processor being configured to implement the method according to any one of claims 1 to 6 when the computer program is executed.

8. A readable storage medium, characterized by: the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 6.