[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110135569B - Heterogeneous platform neuron positioning three-level flow parallel method, system and medium - Google Patents

Heterogeneous platform neuron positioning three-level flow parallel method, system and medium Download PDF

Info

Publication number
CN110135569B
CN110135569B CN201910289495.7A CN201910289495A CN110135569B CN 110135569 B CN110135569 B CN 110135569B CN 201910289495 A CN201910289495 A CN 201910289495A CN 110135569 B CN110135569 B CN 110135569B
Authority
CN
China
Prior art keywords
memory space
pointer variable
pointer
cpu
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910289495.7A
Other languages
Chinese (zh)
Other versions
CN110135569A (en
Inventor
邹丹
朱小谦
朱敏
王文珂
李金才
汪祥
陆丽娜
甘新标
孟祥飞
夏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910289495.7A priority Critical patent/CN110135569B/en
Publication of CN110135569A publication Critical patent/CN110135569A/en
Application granted granted Critical
Publication of CN110135569B publication Critical patent/CN110135569B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a heterogeneous platform neuron positioning three-level pipeline parallel method, a system and a medium, wherein slice image data are calculated into blocking parameters according to the image size and the calculation granularity; respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters; initializing variables and storage space; the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute computing tasks, each computing task comprises three steps of parallel data reading-in, positioning computation and data writing-back, and each computing task in the middle executes the data writing-back of the previous computing task and the data reading-in of the next computing task while executing the positioning computation. The method can improve the processing speed of the neuron positioning, and has the advantages of high neuron positioning speed, short total program execution time, flexible three-stage pipeline realization, parameter configuration support and easy transplantation and popularization.

Description

Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
Technical Field
The invention relates to an analysis method of a neural circuit fine structure, in particular to a neuron positioning three-level flow parallel method, a system and a medium for a heterogeneous platform, which are used for realizing neuron positioning parallel computation based on a CPU-GPU heterogeneous computing platform.
Background
The neural circuit information is the key for understanding brain function and brain disease mechanism, how to realize automatic tracking of neural circuit big data is one of the key scientific problems faced by neural field researches such as brain science and the like. The neuron positioning is the key of the neural circuit data analysis, and the accurate neuron cell body position is obtained by analyzing the neural circuit image data and is the basis of the subsequent quantitative analysis.
The typical large scale range neuron localization method is based on the biological fact that each cell has one and only one soma, integrates and establishes a biophysical model through a mathematical method (such as a 1 norm minimization idea), and carries out large scale neuron localization by solving the model. The method has good robustness on various cell types, shapes, sizes, distribution densities and the like in a large-scale range, and is a main method for positioning neurons in the large-scale range of the current high-precision neural circuit image data set, but the image scale capable of being processed by the method is limited by the memory capacity of a single computing node, and the processing speed is limited by the computing performance of the single computing node.
With the continuous progress of observation technology, the data scale of a high-precision neural circuit image data set is rapidly increased, and particularly, the great progress of optical molecular markers and microscopic imaging technology makes the high-resolution acquisition of whole brain data practical. Because of the large brain volume of primates, imaging a large range of 10 cubic centimeters of data with 1 micron resolution in each direction, as calculated by current MOST imaging techniques, will yield hundreds of TB data. Based on the existing neuron positioning method, 1 hour is needed for processing 1GB data, and 1000 hours, namely more than 40 days are needed for processing 1TB data. How to efficiently locate neurons from dense neuron group TB-level mass data is still a huge challenge in the aspect of image processing, and becomes a bottleneck problem which seriously restricts whether acquired data can be converted into knowledge or not.
Graphics Processing Units (GPUs) employ a completely new design architecture that is completely different from conventional general-purpose multi-core processors. The GPU is specially designed for large-scale data parallel computing modes, and typical applications of such computing modes include graphics and video processing, large-scale matrix computation, and numerical simulation. Unlike a general-purpose multi-core processor, a GPU largely adopts a simd (single Instruction Multiple data) structure to implement data and Instruction parallelism under the same processor path. With the continuous enhancement of the programmability of the GPU, particularly the appearance of programming environments such as CUDA and the like and a series of advanced debugging tools, the complexity of general-purpose computation programming of the GPU is greatly reduced, and a new era facing general-purpose computation of the GPU is comprehensively started. General purpose computational image processors (gpgpgpu) have evolved into high-parallelism, multi-threaded, many-core processors with powerful computational power and high memory bandwidth.
Compared with the isomorphic parallel architecture, the heterogeneous parallel architecture composed of a general processor CPU and a coprocessor GPU is a structure more suitable for large-scale computation intensive tasks. The heterogeneous parallel system structure can effectively adapt to the complexity of program characteristics in multiple application fields, has high practical application efficiency, conforms to the trend of rapid increase of the capacity of a super large scale integrated circuit chip, and can meet the development requirement that the application program characteristics are diversified day by day. The heterogeneous architecture comprises processors with different structures, a general-purpose processor CPU with a transaction type and a special-purpose processor GPU with a computing type, and different tasks are processed by different types of processors, so that the heterogeneous architecture is advantageous.
However, because the CPU-GPU heterogeneous computation model is different from the traditional CPU isomorphic computation model, the existing CPU-based program cannot be directly run on the GPU. And because the GPU cannot directly access the storage space of the CPU, in order to utilize the computation capability of the GPU, input data must be transmitted from the memory of the CPU end to the memory of the GPU end before the computation starts, and the computation result must be transmitted from the memory of the GPU end to the memory of the CPU end after the computation ends, and so on until all computation tasks are executed. Frequent data transmission between the CPU and the GPU occupies a large amount of program running time, and the running efficiency of the program is greatly influenced. How to improve the calculation efficiency of the CPU and the GPU and reduce the data transmission overhead between the CPU and the GPU is a difficult point for developing a neuron positioning algorithm oriented to a CPU-GPU heterogeneous system structure. At present, no technical scheme for carrying out neuron positioning by utilizing a CPU-GPU is disclosed.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a three-level pipeline parallel method, a three-level pipeline parallel system and a three-level pipeline parallel medium for positioning neurons of a heterogeneous platform.
In order to solve the technical problems, the invention adopts the technical scheme that:
a heterogeneous platform neuron positioning three-level flow parallel method comprises the implementation steps of:
1) calculating blocking parameters of the slice image data according to the image size and the calculation granularity;
2) respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters;
3) initializing variables and storage space;
4) the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute calculation tasks, each calculation task comprises three steps of data reading-in, positioning calculation and data writing-back, each calculation task in the middle carries out the data writing-back of the previous calculation task while carrying out the positioning calculation and carries out the data reading-in of the next calculation task, and the three steps of the data reading-in, the positioning calculation and the data writing-back are carried out in parallel.
Preferably, the detailed steps of step 1) include:
1.1) calculating the maximum data block size gSizeMax which can support calculation on the GPU, wherein the gSizeMax is a positive integer:
1.2) separately determiningThe block size and the number of blocks in the x, y, z direction and the total number of blocks: if xDIm<And g SizeMax, setting the value of the x-direction block size xScale as xDMm, otherwise setting the value of the xScale as g SizeMax, and setting the value of the x-direction block number xNum as
Figure BDA0002024447760000021
If yDim<And g SizeMax, setting the value of the y-direction block size yScale to be yDim, otherwise, setting the value of yScale to be g SizeMax. Setting the value of the y-direction block number yNum to
Figure BDA0002024447760000031
If zDim<Setting the value of zScale in the z-direction as zDim if the gSizeMax is adopted, and otherwise, setting the value of zScale as gSizeMax; setting the value of the z-direction block number zNum to
Figure BDA0002024447760000032
Wherein xdIm, yDim and zDim are respectively preset parameters; setting the value of the total number of the blocks bNum as xNum yNum zNum, and numbering the values of the total number of the blocks bNum from 1 in sequence.
Preferably, in step 2), when the storage space allocation is performed at the CPU end and the GPU end respectively based on the blocking parameter, three pointer variables gReadPtr, gProcPtr, and gWritePtr are declared at the GPU end for the storage space allocation at the GPU end, and the video memory space capacity allocated to each pointer is gSizeMax3Wherein gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data; two pointer variables cReadBuf and cWriteBuf are declared on a CPU aiming at the storage space distribution of the CPU end, and the capacity of the memory space distributed by each pointer is gSizeMax3Wherein cReadBuf is used for data buffering between gReadPtr and the disk, and cWriteBuf is used for data buffering between gWritePtr and the disk; and the size of the data block calculated on the CPU is set as gSizeMax, three pointer variables cReadPtr, cProcPtr and cWritePtr are declared on the CPU side, and the capacity of the memory space allocated by each pointer is gSizeMax3Where gSizeMax is the maximum data block size that can be supported for computation on the GPU.
Preferably, the detailed steps of initializing variables and memory space in step 3) include: the method comprises the steps of applying for a mutex loop variable idx for a CPU, initializing the idx to be 2, reading a data block No. 1 in a disk into a memory space pointed by a pointer variable cProcPtr, reading a data block No. 2 in the disk into a memory space pointed by a pointer variable cReadBuf, and then transmitting the data block No. 2 from the memory space pointed by the pointer variable cReadBuf to a video memory space pointed by a pointer variable gProcpPtr at a GPU end.
Preferably, the detailed steps in step 4) include:
4.1) starting a process No. 0-2 responsible for the organization and data transmission of the calculation tasks on the GPU and a process No. 3-5 responsible for the organization and data transmission of the calculation tasks on the CPU;
4.2) calling a GPU through a No. 0-2 process to execute a calculation task in a three-level pipeline mode, calling a CPU through a No. 3-5 process to execute the calculation task in the three-level pipeline mode, reading a data block required by the positioning of a next group of neurons while positioning each image block neuron, and writing the data block positioned by the previous group of neurons back to a disk, so that the read-write operation of the disk and the positioning operation of the neurons are performed in parallel;
4.3) synchronizing the processes No. 0, 1, 2, 3, 4 and 5, and finishing the calculation.
Preferably, the detailed steps of calling the GPU through the No. 0-2 process in the step 4.2) and executing the calculation task in a three-stage pipeline manner include:
4.2.1A) starting ncGPU threads on the GPU through a No. 0 process according to the available computing core number ncGPU on the GPU, and parallelly performing neuron positioning computation on a data block pointed by a pointer variable gProcPtr by all the GPU threads; adding 1 to the mutex cyclic variable idx through the process No. 1, comparing the mutex cyclic variable idx with the total number of blocks bNum, reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadBuf if the mutex cyclic variable idx is less than or equal to the total number of blocks bNum, and then transmitting the memory space pointed by the pointer variable cReadBuf at the CPU end to a display memory space pointed by a pointer variable gReadPtr at the GPU end; checking the video memory space pointed by the pointer variable gWritePtr through the process No. 2, if the video memory space pointed by the pointer variable gWritePtr is stored with the data block, transmitting the data block from the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuf, then storing the data block into a disk from the memory space pointed by the pointer variable cWriteBuf, and clearing the video memory space pointed by the pointer variable gWritePtr;
4.2.2A) synchronizing the processes No. 0, No. 1 and No. 2, and finishing the calculation of the current data block of the GPU after the synchronization is finished; performing GPU video memory pointer exchange by using the process No. 0, specifically operating as declaring a temporary pointer variable gtPtr, assigning the pointer variable gtPtr to a pointer variable gProcPtr, assigning the pointer variable gProcPtr to a pointer variable gReadPtr, assigning the pointer variable gReadPtr to a pointer variable gWritePtr, and assigning the pointer variable gWritePtr to a pointer variable gtPtr; the process 0 checks the video memory space pointed to by the pointer variable gProcPtr, if the content is empty, the step 4.2.3A is executed, otherwise, the step 4.2.1A) is executed;
4.2.3A) the process No. 0 transfers the data block in the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuff, and then writes the data block back to the disk from the memory space pointed by the pointer variable cWriteBuf; and recovering the video memory space pointed by the GPU-end pointer variables gReadPtr, gProcPtr and gWritePtr, and recovering the memory space pointed by the CPU-end pointer variables cReadBuf and cWriteBuf.
Preferably, the detailed steps of calling the CPU through the processes 3 to 5 in the step 4.2) and simultaneously executing the calculation task in a three-stage pipeline manner include:
4.2.1B) starting ncCPU threads on the CPU through the No. 3 process according to the available computing core number ncCPU on the CPU, and parallelly carrying out neuron positioning computation on the data block pointed by the pointer variable cProcPtr by all the CPU threads; adding 1 to the mutex cyclic variable idx through the process No. 4, comparing the mutex cyclic variable idx with the total number bNum of the blocks, and reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadPtr if the mutex cyclic variable idx is less than or equal to the bNum; checking the memory space pointed by the pointer variable cWritePtr through the process No. 5, if the memory space pointed by the pointer variable cWritePtr is stored with a data block, storing the data block into a disk, and clearing the memory space pointed by the pointer variable cWritePtr;
4.2.2B) synchronizing the processes No. 3, No. 4 and No. 5, and finishing the calculation of the current data block of the CPU after the synchronization is finished; performing CPU memory pointer exchange by the process No. 3, specifically operating as declaring a temporary pointer variable ctPtr, assigning the pointer variable ctPtr as a pointer variable cprocPtr, assigning the pointer variable cprocPtr as a pointer variable cReadPtr, assigning the pointer variable cReadPtr as a pointer variable cWritePtr, and assigning the pointer variable cWritePtr as a pointer variable ctPtr; the process No. 3 checks the memory space pointed to by the pointer variable cProcPtr, if the content is null, the step 4.2.3B is executed, otherwise, the step 4.2.1B is executed);
4.2.3B) writing the data block in the memory space pointed to by the pointer variable cWritePtr back to the disk by the process No. 3; the memory space pointed to by the CPU side pointer variables cReadPtr, cProcPtr, and cWritePtr is reclaimed.
The invention also provides a heterogeneous platform neuron localization three-level pipeline parallel system which comprises a computer device with a GPU, wherein the computer device is programmed to execute the steps of the heterogeneous platform neuron localization three-level pipeline parallel method.
The invention also provides a heterogeneous platform neuron localization three-level pipeline parallel system, which comprises a computer device with a GPU, wherein a storage medium of the computer device is stored with a computer program which is programmed to execute the heterogeneous platform neuron localization three-level pipeline parallel method.
The present invention also provides a computer readable storage medium having stored thereon a computer program programmed to perform the aforementioned heterogeneous platform neuron localization three-level pipeline parallel method of the present invention.
Compared with the prior art, the invention has the following advantages: according to the method, the blocking parameters of the slice image data are calculated according to the image size and the calculation granularity; respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters; initializing variables and storage space; and the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute a calculation task, read a data block required by the positioning of the next group of neurons while carrying out the positioning of each image block neuron, and write the data block positioned by the previous group of neurons back to the disk, so that the read-write operation and the neuron positioning operation of the disk are carried out in parallel. The method can improve the processing speed of the neuron positioning, and has the advantages of high neuron positioning speed, short total program execution time, flexible three-stage pipeline realization, parameter configuration support and easy transplantation and popularization.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram illustrating a parallel principle of a three-stage pipeline in the method according to the embodiment of the present invention.
Detailed Description
The neuron-oriented three-level pipeline parallel method, system and medium for the heterogeneous platform are further described in detail below by taking a server equipped with a dual-core twelve-core 2.4GHz CPU and an NVIDIA GTX 1080Ti GPU as an example of the heterogeneous platform. The hard disk capacity of the server is 24TB, the memory capacity is 256GB, and the GPU video memory space is 11 GB. Input data is composed of 10000 single-layer image sequences with the resolution of 40000 multiplied by 40000.
As shown in fig. 1, the method for locating three-level pipeline parallelism by using a heterogeneous platform neuron in the embodiment includes the following steps:
1) calculating blocking parameters of the slice image data according to the image size and the calculation granularity;
2) respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters;
3) initializing variables and storage space;
4) the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute calculation tasks, each calculation task comprises three steps of data reading-in, positioning calculation and data writing-back, each calculation task in the middle carries out the data writing-back of the previous calculation task while carrying out the positioning calculation and carries out the data reading-in of the next calculation task, and the three steps of the data reading-in, the positioning calculation and the data writing-back are carried out in parallel.
In this embodiment, the detailed steps of step 1) include:
1.1) calculating the maximum data block size gSizeMax which can support calculation on the GPU, wherein the gSizeMax is a positive integer:
1.2) determining the block size and the block number in the x, y and z directions and the total block number respectively: if xDIm<And g SizeMax, setting the value of the x-direction block size xScale as xDMm, otherwise setting the value of the xScale as g SizeMax, and setting the value of the x-direction block number xNum as
Figure BDA0002024447760000061
If yDim<And g SizeMax, setting the value of the y-direction block size yScale to be yDim, otherwise, setting the value of yScale to be g SizeMax. Setting the value of the y-direction block number yNum to
Figure BDA0002024447760000062
If zDim<Setting the value of zScale in the z-direction as zDim if the gSizeMax is adopted, and otherwise, setting the value of zScale as gSizeMax; setting the value of the z-direction block number zNum to
Figure BDA0002024447760000063
Wherein xdIm, yDim and zDim are respectively preset parameters; setting the value of the total number of the blocks bNum as xNum yNum zNum, and numbering the values of the total number of the blocks bNum from 1 in sequence. The main variables are defined as follows: cCM: CPU side memory capacity. gMem: and displaying the memory capacity at the GPU side. And g Num: number of GPUs. xdIm: the number of pixels in the x direction of each layer. And (2) yDim: the number of pixels in the y direction of each layer. zDim: and (4) number of layers.
In this embodiment, calculating the maximum block size gSizeMax that can be supported and calculated on the GPU is:
Figure BDA0002024447760000064
in the above equation, gMem is the memory size on the GPU.
In this embodiment, the number of x-direction blocks
Figure BDA0002024447760000065
Number of blocks in y direction
Figure BDA0002024447760000066
Number of blocks in z direction
Figure BDA0002024447760000067
The total number of blocks bNum is 260 × 260 × 65 is 4394000, the blocks are numbered in sequence from 1, and the size of each data block is 1543B=3.65GB。
In this embodiment, when the storage space allocation is performed at the CPU end and the GPU end based on the blocking parameter in step 2), three pointer variables gReadPtr, gProcPtr, and gWritePtr are declared at the GPU end for the storage space allocation at the GPU end, and the video memory space capacity allocated to each pointer is gSizeMax3Wherein gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data; two pointer variables cReadBuf and cWriteBuf are declared on a CPU aiming at the storage space distribution of the CPU end, and the capacity of the memory space distributed by each pointer is gSizeMax3Wherein cReadBuf is used for data buffering between gReadPtr and the disk, and cWriteBuf is used for data buffering between gWritePtr and the disk; and the size of the data block calculated on the CPU is set as gSizeMax, three pointer variables cReadPtr, cProcPtr and cWritePtr are declared on the CPU side, and the capacity of the memory space allocated by each pointer is gSizeMax3Where gSizeMax is the maximum data block size that can be supported for computation on the GPU.
Specifically, in this embodiment, three pointer variables, i.e., gReadPtr, gProcPtr, and gWritePtr, are declared at the GPU end, and the video memory capacity allocated to each pointer is 1543B — 3.65GB, where gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data. Two pointer variables cReadBuf and cWriteBuf are declared on the CPU, and the capacity of allocated memory space of each pointer is 1543B-3.65 GB, wherein cReadBuf is used for data buffering between gReadPtr and a disk, and cWriteBuf is used for data buffering between gWritePtr and the disk. The data block size calculated on the CPU is set to 3.65 GB. Correspondingly, three pointer variables cReadPtr, cProcPtr and cWritePtr are declared on the CPU side, and the capacity of memory space allocated to each pointer is 3.65 GB.
The detailed steps of initializing variables and storage space in step 3) in this embodiment include: the method comprises the steps of applying for a mutex loop variable idx for a CPU, initializing the idx to be 2, reading a data block No. 1 in a disk into a memory space pointed by a pointer variable cProcPtr, reading a data block No. 2 in the disk into a memory space pointed by a pointer variable cReadBuf, and then transmitting the data block No. 2 from the memory space pointed by the pointer variable cReadBuf to a video memory space pointed by a pointer variable gProcpPtr at a GPU end.
In this embodiment, the detailed steps in step 4) include:
4.1) starting a process No. 0-2 responsible for the organization and data transmission of the calculation tasks on the GPU and a process No. 3-5 responsible for the organization and data transmission of the calculation tasks on the CPU;
4.2) calling a GPU through a No. 0-2 process to execute a calculation task in a three-level pipeline mode, calling a CPU through a No. 3-5 process to execute the calculation task in the three-level pipeline mode, reading a data block required by the positioning of a next group of neurons while positioning each image block neuron, and writing the data block positioned by the previous group of neurons back to a disk, so that the read-write operation of the disk and the positioning operation of the neurons are performed in parallel; in the embodiment, in the step 4.2), the GPU is called through the No. 0-2 process to execute the calculation task in a three-stage pipeline mode, meanwhile, the CPU is called through the No. 3-5 process to execute the calculation task in the three-stage pipeline mode, namely, the CPU and the GPU perform neuron positioning on the CPU and the GPU simultaneously, so that the calculation efficiency is improved, and the calculation time is reduced;
4.3) synchronizing the processes No. 0, 1, 2, 3, 4 and 5, and finishing the calculation.
In this embodiment, the detailed steps of invoking the GPU through the No. 0-2 process in step 4.2) and executing the computation task in a three-stage pipeline manner include:
4.2.1A) starting ncGPU threads on the GPU through a No. 0 process according to the available computing core number ncGPU on the GPU, and parallelly performing neuron positioning computation on a data block pointed by a pointer variable gProcPtr by all the GPU threads; in this embodiment, the process # 0 starts 3584 threads on the GPU according to the number of available compute cores 3584 on the GPU; adding 1 to the mutex cyclic variable idx through the process No. 1, comparing the mutex cyclic variable idx with the total number of blocks bNum, reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadBuf if the mutex cyclic variable idx is less than or equal to the total number of blocks bNum, and then transmitting the memory space pointed by the pointer variable cReadBuf at the CPU end to a display memory space pointed by a pointer variable gReadPtr at the GPU end; checking the video memory space pointed by the pointer variable gWritePtr through the process No. 2, if the video memory space pointed by the pointer variable gWritePtr is stored with the data block, transmitting the data block from the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuf, then storing the data block into a disk from the memory space pointed by the pointer variable cWriteBuf, and clearing the video memory space pointed by the pointer variable gWritePtr; in the step 4.2.1A), the No. 0-2 process simultaneously performs data block reading, data block calculation and data block write-back of the GPU, so that time overlapping of data transmission and calculation is realized, and data transmission overhead of the GPU is reduced;
4.2.2A) synchronizing the processes No. 0, No. 1 and No. 2, and finishing the calculation of the current data block of the GPU after the synchronization is finished; performing GPU video memory pointer exchange by using the process No. 0, specifically operating as declaring a temporary pointer variable gtPtr, assigning the pointer variable gtPtr to a pointer variable gProcPtr, assigning the pointer variable gProcPtr to a pointer variable gReadPtr, assigning the pointer variable gReadPtr to a pointer variable gWritePtr, and assigning the pointer variable gWritePtr to a pointer variable gtPtr; the process 0 checks the video memory space pointed to by the pointer variable gProcPtr, if the content is empty, the step 4.2.3A is executed, otherwise, the step 4.2.1A) is executed; in step 4.2.2A) of this embodiment, data exchange is realized by exchanging pointers, so that copying of a large amount of memory space is avoided, and the space-time efficiency of memory space management is improved;
4.2.3A) the process No. 0 transfers the data block in the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuff, and then writes the data block back to the disk from the memory space pointed by the pointer variable cWriteBuf; and recovering the video memory space pointed by the GPU-end pointer variables gReadPtr, gProcPtr and gWritePtr, and recovering the memory space pointed by the CPU-end pointer variables cReadBuf and cWriteBuf.
In this embodiment, the detailed steps of calling the CPU through the processes No. 3 to No. 5 in the step 4.2) and simultaneously executing the calculation task in a three-stage pipeline manner include:
4.2.1B) starting ncCPU threads on the CPU through the No. 3 process according to the available computing core number ncCPU on the CPU, and parallelly carrying out neuron positioning computation on the data block pointed by the pointer variable cProcPtr by all the CPU threads; adding 1 to a mutex loop variable idx through a process No. 4, comparing the mutex loop variable idx with the total number bNum of blocks (4394000), and reading a data block of the mutex loop variable idx in a disk into a memory space pointed by a pointer variable cReadPtr if the mutex loop variable idx is less than or equal to the bNum (4394000); checking the memory space pointed by the pointer variable cWritePtr through the process No. 5, if the memory space pointed by the pointer variable cWritePtr is stored with a data block, storing the data block into a disk, and clearing the memory space pointed by the pointer variable cWritePtr; in the step 4.2.1B), the No. 3 to No. 5 processes simultaneously read the data block, calculate the data block and write back the data block at the CPU end, so that the time overlapping of data transmission and calculation is realized, and the data transmission overhead at the CPU end is reduced;
4.2.2B) synchronizing the processes No. 3, No. 4 and No. 5, and finishing the calculation of the current data block of the CPU after the synchronization is finished; performing CPU memory pointer exchange by the process No. 3, specifically operating as declaring a temporary pointer variable ctPtr, assigning the pointer variable ctPtr as a pointer variable cprocPtr, assigning the pointer variable cprocPtr as a pointer variable cReadPtr, assigning the pointer variable cReadPtr as a pointer variable cWritePtr, and assigning the pointer variable cWritePtr as a pointer variable ctPtr; the process No. 3 checks the memory space pointed to by the pointer variable cProcPtr, if the content is null, the step 4.2.3B is executed, otherwise, the step 4.2.1B is executed); in step 4.2.2B) of this embodiment, data exchange is realized by exchanging pointers, so that copying of a large amount of memory space is avoided, and the space-time efficiency of memory space management is improved;
4.2.3B) writing the data block in the memory space pointed to by the pointer variable cWritePtr back to the disk by the process No. 3; the memory space pointed to by the CPU side pointer variables cReadPtr, cProcPtr, and cWritePtr is reclaimed.
As shown in fig. 2, because the processing steps in the positioning calculation task executed by the CPU and the GPU each include three steps of data reading, positioning calculation, and data writing back, and there is a dependency relationship between the data in the three steps, the first Round of algorithm (Round 1) executes the steps of reading in the memory data to generate the three-dimensional image volume data, the second Round of algorithm (Round 2) executes the steps of reading in the memory data to generate the three-dimensional image volume data, and performing the positioning calculation of the neurons at the same time, starting from the third Round of algorithm (Round 3) until the last third Round (Round n-2) before the positioning calculation is completed, and the three steps of data reading, positioning calculation, and data writing back in each Round of algorithm execution are all performed at the same time. The data reading is to process the next group of slice image data, the positioning calculation is to process the volume data corresponding to the current group of slice image data which has been read, and the data writing back is to write back the processed neuron positioning result of the previous group of slice image data to the disk array. Through the technical approach, the time for reading data in and writing data back can be effectively hidden in the neuron positioning calculation step.
In summary, in this embodiment, the CPU organizes computation and data transmission based on a blocking method, the CPU and the GPU perform neuron positioning using multiple threads, and the data transmission steps among the CPU memory, the GPU memory, and the disk adopt a multi-stage pipeline manner, that is, while processing each image block data, the next image block data to be processed is read, and at the same time, the last processed image block data is written back to the disk, so that the data transmission operation and the data processing operation are performed in parallel. In the embodiment, the neuron positioning three-level flow parallel of the heterogeneous platform is realized by a multi-process and multi-thread mixed parallel technology, on the CPU-GPU heterogeneous parallel computing platform, a CPU multi-core processor and a GPU multi-core coprocessor are simultaneously utilized for carrying out neuron positioning computation, and the computation and data transmission time overlap is carried out by the multi-level flow line technology, so that the neuron positioning speed can be improved. After the running data is counted, compared with a neuron positioning algorithm running on a double-path twelve-core CPU, the neuron positioning three-level pipeline parallel method for the heterogeneous platform can improve the neuron positioning speed to be more than 3 times.
In addition, the present embodiment further provides a three-level pipeline parallel system for positioning a neuron of a heterogeneous platform, which includes a computer device with a GPU, where the computer device is programmed to execute the steps of the three-level pipeline parallel method for positioning a neuron of a heterogeneous platform according to the present embodiment. In addition, the present embodiment also provides a three-level pipeline parallel system for positioning a neuron on a heterogeneous platform, which includes a computer device with a GPU, where a storage medium of the computer device stores a computer program programmed to execute the three-level pipeline parallel method for positioning a neuron on a heterogeneous platform according to the present embodiment. In addition, the present embodiment also provides a computer-readable storage medium, on which a computer program programmed to execute the aforementioned heterogeneous platform neuron localization three-level pipeline parallel method according to the present embodiment is stored.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (9)

1. A heterogeneous platform neuron positioning three-level flow parallel method is characterized by comprising the following implementation steps:
1) calculating blocking parameters of the slice image data according to the image size and the calculation granularity;
2) respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters;
3) initializing variables and storage space;
4) the method comprises the steps that a CPU carries out task scheduling, a CPU and a GPU simultaneously adopt a three-stage pipeline mode to execute calculation tasks, each calculation task comprises three steps of data reading, positioning calculation and data writing back, each calculation task in the middle carries out data writing back of the previous calculation task while carrying out positioning calculation, and carries out data reading of the next calculation task, and the three steps of data reading, positioning calculation and data writing back are carried out in parallel;
the detailed steps of the step 1) comprise:
1.1) calculating the maximum data block size gSizeMax which can support calculation on the GPU, wherein the gSizeMax is a positive integer:
1.2) determining the block size and the block number in the x, y and z directions and the total block number respectively: if the xDMm is smaller than the gSizeMax, setting the value of the x-direction block size xScale as xDMm, otherwise, setting the value of the xScale as gSizeMax, and setting the value of the x-direction block number xNam as ⌈ xDMm/xScale ⌉; if the yDim is less than the gSizeMax, setting the value of the block size yScale in the y direction as the yDim, otherwise, setting the value of the yScale as the gSizeMax; setting the value of the number of blocks in the y direction, yNum, to be ⌈ yDim/yScale ⌉; if zDim < gSizeMax, setting the value of zScale in the z-direction block size as zDim, otherwise setting the value of zScale as gSizeMax; setting the value of the z-direction block number zNum to be ⌈ zDim/zScale ⌉; wherein xdIm, yDim and zDim are respectively preset parameters; setting the value of the total number of the blocks bNum as xNum yNum zNum, and numbering the values of the total number of the blocks bNum from 1 in sequence.
2. The heterogeneous platform neuron positioning three-level pipeline parallel method according to claim 1, wherein in step 2), when memory space allocation is performed at a CPU end and a GPU end respectively based on partitioning parameters, three pointer variables gReadPtr, gProcPtr and gWritePtr are declared at the GPU end for GPU end memory space allocation, and the capacity of a memory space allocated to each pointer is gSizeMax3Wherein gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data; two pointer variables cReadBuf and cWriteBuf are declared on a CPU aiming at the storage space distribution of the CPU end, and the capacity of the memory space distributed by each pointer is gSizeMax3Wherein cReadBuf is used for data buffering between gReadPtr and the disk, and cWriteBuf is used for data buffering between gWritePtr and the disk; and the block size of the data calculated on the CPU is set to gSizeMax atThe CPU end declares three pointer variables cReadPtr, cProcPtr and cWritePtr, and the memory space capacity allocated by each pointer is gSizeMax3Where gSizeMax is the maximum data block size that can be supported for computation on the GPU.
3. The three-level pipeline parallel method for positioning neurons of the heterogeneous platform according to claim 2, wherein the detailed steps of initializing variables and storage space in step 3) comprise: the method comprises the steps of applying for a mutex loop variable idx for a CPU, initializing the idx to be 2, reading a data block No. 1 in a disk into a memory space pointed by a pointer variable cProcPtr, reading a data block No. 2 in the disk into a memory space pointed by a pointer variable cReadBuf, and then transmitting the data block No. 2 from the memory space pointed by the pointer variable cReadBuf to a video memory space pointed by a pointer variable gProcpPtr at a GPU end.
4. The heterogeneous platform neuron localization three-level pipeline parallel method according to claim 3, wherein the detailed step in the step 4) comprises:
4.1) starting a process No. 0-2 responsible for the organization and data transmission of the calculation tasks on the GPU and a process No. 3-5 responsible for the organization and data transmission of the calculation tasks on the CPU;
4.2) calling a GPU through a No. 0-2 process to execute a calculation task in a three-level pipeline mode, calling a CPU through a No. 3-5 process to execute the calculation task in the three-level pipeline mode, reading a data block required by the positioning of a next group of neurons while positioning each image block neuron, and writing the data block positioned by the previous group of neurons back to a disk, so that the read-write operation of the disk and the positioning operation of the neurons are performed in parallel;
4.3) synchronizing the processes No. 0, 1, 2, 3, 4 and 5, and finishing the calculation.
5. The heterogeneous platform neuron positioning three-level pipeline parallel method according to claim 4, wherein the detailed step of calling the GPU through No. 0-2 processes in the step 4.2) to execute the calculation task in a three-level pipeline mode comprises the following steps:
4.2.1A) starting ncGPU threads on the GPU through a No. 0 process according to the available computing core number ncGPU on the GPU, and parallelly performing neuron positioning computation on a data block pointed by a pointer variable gProcPtr by all the GPU threads; adding 1 to the mutex cyclic variable idx through the process No. 1, comparing the mutex cyclic variable idx with the total number of blocks bNum, reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadBuf if the mutex cyclic variable idx is less than or equal to the total number of blocks bNum, and then transmitting the memory space pointed by the pointer variable cReadBuf at the CPU end to a display memory space pointed by a pointer variable gReadPtr at the GPU end; checking the video memory space pointed by the pointer variable gWritePtr through the process No. 2, if the video memory space pointed by the pointer variable gWritePtr is stored with the data block, transmitting the data block from the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuf, then storing the data block into a disk from the memory space pointed by the pointer variable cWriteBuf, and clearing the video memory space pointed by the pointer variable gWritePtr;
4.2.2A) synchronizing the processes No. 0, No. 1 and No. 2, and finishing the calculation of the current data block of the GPU after the synchronization is finished; performing GPU video memory pointer exchange by using the process No. 0, specifically operating as declaring a temporary pointer variable gtPtr, assigning the pointer variable gtPtr to a pointer variable gProcPtr, assigning the pointer variable gProcPtr to a pointer variable gReadPtr, assigning the pointer variable gReadPtr to a pointer variable gWritePtr, and assigning the pointer variable gWritePtr to a pointer variable gtPtr; the process 0 checks the video memory space pointed to by the pointer variable gProcPtr, if the content is empty, the step 4.2.3A is executed, otherwise, the step 4.2.1A) is executed;
4.2.3A) the process No. 0 transfers the data block in the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuff, and then writes the data block back to the disk from the memory space pointed by the pointer variable cWriteBuf; and recovering the video memory space pointed by the GPU-end pointer variables gReadPtr, gProcPtr and gWritePtr, and recovering the memory space pointed by the CPU-end pointer variables cReadBuf and cWriteBuf.
6. The three-level pipeline parallel method for positioning the neurons of the heterogeneous platform according to claim 5, wherein the detailed step of calling the CPU through the processes 3-5 in the step 4.2) and simultaneously executing the calculation task in a three-level pipeline mode comprises the following steps:
4.2.1B) starting ncCPU threads on the CPU through the No. 3 process according to the available computing core number ncCPU on the CPU, and parallelly carrying out neuron positioning computation on the data block pointed by the pointer variable cProcPtr by all the CPU threads; adding 1 to the mutex cyclic variable idx through the process No. 4, comparing the mutex cyclic variable idx with the total number bNum of the blocks, and reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadPtr if the mutex cyclic variable idx is less than or equal to the bNum; checking the memory space pointed by the pointer variable cWritePtr through the process No. 5, if the memory space pointed by the pointer variable cWritePtr is stored with a data block, storing the data block into a disk, and clearing the memory space pointed by the pointer variable cWritePtr;
4.2.2B) synchronizing the processes No. 3, No. 4 and No. 5, and finishing the calculation of the current data block of the CPU after the synchronization is finished; performing CPU memory pointer exchange by the process No. 3, specifically operating as declaring a temporary pointer variable ctPtr, assigning the pointer variable ctPtr as a pointer variable cprocPtr, assigning the pointer variable cprocPtr as a pointer variable cReadPtr, assigning the pointer variable cReadPtr as a pointer variable cWritePtr, and assigning the pointer variable cWritePtr as a pointer variable ctPtr; the process No. 3 checks the memory space pointed to by the pointer variable cProcPtr, if the content is null, the step 4.2.3B is executed, otherwise, the step 4.2.1B is executed);
4.2.3B) writing the data block in the memory space pointed to by the pointer variable cWritePtr back to the disk by the process No. 3; the memory space pointed to by the CPU side pointer variables cReadPtr, cProcPtr, and cWritePtr is reclaimed.
7. A heterogeneous platform neuron localization three-level pipeline parallel system comprising a computer device with a GPU, wherein the computer device is programmed to perform the steps of the heterogeneous platform neuron localization three-level pipeline parallel method according to any one of claims 1-6.
8. A heterogeneous platform neuron localization three-level pipeline parallel system comprises a computer device with a GPU, and is characterized in that a storage medium of the computer device is stored with a computer program which is programmed to execute the heterogeneous platform neuron localization three-level pipeline parallel method according to any one of claims 1-6.
9. A computer-readable storage medium having stored thereon a computer program programmed to perform the heterogeneous platform neuron localization three-level pipeline parallel method according to any one of claims 1-6.
CN201910289495.7A 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium Active CN110135569B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910289495.7A CN110135569B (en) 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910289495.7A CN110135569B (en) 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Publications (2)

Publication Number Publication Date
CN110135569A CN110135569A (en) 2019-08-16
CN110135569B true CN110135569B (en) 2021-09-21

Family

ID=67569648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910289495.7A Active CN110135569B (en) 2019-04-11 2019-04-11 Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Country Status (1)

Country Link
CN (1) CN110135569B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516795B (en) * 2019-08-28 2022-05-10 北京达佳互联信息技术有限公司 Method and device for allocating processors to model variables and electronic equipment
CN110543940B (en) * 2019-08-29 2022-09-23 中国人民解放军国防科技大学 Neural circuit body data processing method, system and medium based on hierarchical storage
CN110992241A (en) * 2019-11-21 2020-04-10 支付宝(杭州)信息技术有限公司 Heterogeneous embedded system and method for accelerating neural network target detection
CN112529763B (en) * 2020-12-16 2024-06-21 航天科工微电子系统研究院有限公司 Image processing system and tracking system based on soft and hard coupling
CN113806067B (en) * 2021-07-28 2024-03-29 卡斯柯信号有限公司 Safety data verification method, device, equipment and medium based on vehicle-to-vehicle communication
CN113918356B (en) * 2021-12-13 2022-02-18 广东睿江云计算股份有限公司 Method and device for quickly synchronizing data based on CUDA (compute unified device architecture), computer equipment and storage medium
CN117689025B (en) * 2023-12-07 2024-06-14 上海交通大学 Quick large model reasoning service method and system suitable for consumer display card

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617626A (en) * 2013-12-16 2014-03-05 武汉狮图空间信息技术有限公司 Central processing unit (CPU) and ground power unit (GPU)-based remote-sensing image multi-scale heterogeneous parallel segmentation method
CN104267940A (en) * 2014-09-17 2015-01-07 武汉狮图空间信息技术有限公司 Quick map tile generation method based on CPU+GPU
CN104375807A (en) * 2014-12-09 2015-02-25 中国人民解放军国防科学技术大学 Three-level flow sequence comparison method based on many-core co-processor
CN106815807A (en) * 2017-01-11 2017-06-09 重庆市地理信息中心 A kind of unmanned plane image Fast Mosaic method based on GPU CPU collaborations

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9058680B2 (en) * 2011-12-28 2015-06-16 Think Silicon Ltd Multi-threaded multi-format blending device for computer graphics operations
CN109451322B (en) * 2018-09-14 2021-02-02 北京航天控制仪器研究所 Acceleration implementation method of DCT (discrete cosine transform) algorithm and DWT (discrete wavelet transform) algorithm based on CUDA (compute unified device architecture) for image compression

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617626A (en) * 2013-12-16 2014-03-05 武汉狮图空间信息技术有限公司 Central processing unit (CPU) and ground power unit (GPU)-based remote-sensing image multi-scale heterogeneous parallel segmentation method
CN104267940A (en) * 2014-09-17 2015-01-07 武汉狮图空间信息技术有限公司 Quick map tile generation method based on CPU+GPU
CN104375807A (en) * 2014-12-09 2015-02-25 中国人民解放军国防科学技术大学 Three-level flow sequence comparison method based on many-core co-processor
CN106815807A (en) * 2017-01-11 2017-06-09 重庆市地理信息中心 A kind of unmanned plane image Fast Mosaic method based on GPU CPU collaborations

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing;Tao Li;《Springer》;20170906;全文 *
基于异构系统架构的朴素贝叶斯图像分类算法的研究;肖难;《中国优秀硕士学位论文全文数据库信息科技辑》;20180715;第四章 *
面向 CPU+GPU 异构平台的模板匹配目标识别并行算法;马永军等;《天津科技大学学报》;20140830;第48-52页 *

Also Published As

Publication number Publication date
CN110135569A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110135569B (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
US11080051B2 (en) Techniques for efficiently transferring data to a processor
Huang et al. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines
US10725837B1 (en) Persistent scratchpad memory for data exchange between programs
US11907717B2 (en) Techniques for efficiently transferring data to a processor
US20200264970A1 (en) Memory management system
Gmys et al. A GPU-based Branch-and-Bound algorithm using Integer–Vector–Matrix data structure
CN112749120A (en) Techniques for efficiently transferring data to a processor
Munekawa et al. Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU
CN109408867B (en) Explicit R-K time propulsion acceleration method based on MIC coprocessor
Park et al. mGEMM: low-latency convolution with minimal memory overhead optimized for mobile devices
CN106971369B (en) Data scheduling and distributing method based on GPU (graphics processing Unit) for terrain visual field analysis
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
US20230289242A1 (en) Hardware accelerated synchronization with asynchronous transaction support
Wu et al. A vectorized k-means algorithm for intel many integrated core architecture
Rapaport GPU molecular dynamics: Algorithms and performance
US20230144553A1 (en) Software-directed register file sharing
Ino et al. Performance study of LU decomposition on the programmable GPU
Dudnik et al. Cuda architecture analysis as the driving force Of parallel calculation organization
Nelson et al. Don't forget about synchronization! Guidelines for using locks on graphics processing units
Aji et al. Accelerating data-serial applications on data-parallel GPGPUs: a systems approach
US20230297643A1 (en) Non-rectangular matrix computations and data pattern processing using tensor cores
CN118519787B (en) Memory access and parallel efficiency optimization method based on synchronization-free SpTRSV algorithm
US20240311163A1 (en) Hardware-driven call stack attribution
US20230101085A1 (en) Techniques for accelerating smith-waterman sequence alignments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant