CN110135569B

CN110135569B - Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Info

Publication number: CN110135569B
Application number: CN201910289495.7A
Authority: CN
Inventors: 邹丹; 朱小谦; 朱敏; 王文珂; 李金才; 汪祥; 陆丽娜; 甘新标; 孟祥飞; 夏飞
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-04-11
Filing date: 2019-04-11
Publication date: 2021-09-21
Anticipated expiration: 2039-04-11
Also published as: CN110135569A

Abstract

The invention discloses a heterogeneous platform neuron positioning three-level pipeline parallel method, a system and a medium, wherein slice image data are calculated into blocking parameters according to the image size and the calculation granularity; respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters; initializing variables and storage space; the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute computing tasks, each computing task comprises three steps of parallel data reading-in, positioning computation and data writing-back, and each computing task in the middle executes the data writing-back of the previous computing task and the data reading-in of the next computing task while executing the positioning computation. The method can improve the processing speed of the neuron positioning, and has the advantages of high neuron positioning speed, short total program execution time, flexible three-stage pipeline realization, parameter configuration support and easy transplantation and popularization.

Description

Heterogeneous platform neuron positioning three-level flow parallel method, system and medium

Technical Field

The invention relates to an analysis method of a neural circuit fine structure, in particular to a neuron positioning three-level flow parallel method, a system and a medium for a heterogeneous platform, which are used for realizing neuron positioning parallel computation based on a CPU-GPU heterogeneous computing platform.

Background

The neural circuit information is the key for understanding brain function and brain disease mechanism, how to realize automatic tracking of neural circuit big data is one of the key scientific problems faced by neural field researches such as brain science and the like. The neuron positioning is the key of the neural circuit data analysis, and the accurate neuron cell body position is obtained by analyzing the neural circuit image data and is the basis of the subsequent quantitative analysis.

The typical large scale range neuron localization method is based on the biological fact that each cell has one and only one soma, integrates and establishes a biophysical model through a mathematical method (such as a 1 norm minimization idea), and carries out large scale neuron localization by solving the model. The method has good robustness on various cell types, shapes, sizes, distribution densities and the like in a large-scale range, and is a main method for positioning neurons in the large-scale range of the current high-precision neural circuit image data set, but the image scale capable of being processed by the method is limited by the memory capacity of a single computing node, and the processing speed is limited by the computing performance of the single computing node.

With the continuous progress of observation technology, the data scale of a high-precision neural circuit image data set is rapidly increased, and particularly, the great progress of optical molecular markers and microscopic imaging technology makes the high-resolution acquisition of whole brain data practical. Because of the large brain volume of primates, imaging a large range of 10 cubic centimeters of data with 1 micron resolution in each direction, as calculated by current MOST imaging techniques, will yield hundreds of TB data. Based on the existing neuron positioning method, 1 hour is needed for processing 1GB data, and 1000 hours, namely more than 40 days are needed for processing 1TB data. How to efficiently locate neurons from dense neuron group TB-level mass data is still a huge challenge in the aspect of image processing, and becomes a bottleneck problem which seriously restricts whether acquired data can be converted into knowledge or not.

Graphics Processing Units (GPUs) employ a completely new design architecture that is completely different from conventional general-purpose multi-core processors. The GPU is specially designed for large-scale data parallel computing modes, and typical applications of such computing modes include graphics and video processing, large-scale matrix computation, and numerical simulation. Unlike a general-purpose multi-core processor, a GPU largely adopts a simd (single Instruction Multiple data) structure to implement data and Instruction parallelism under the same processor path. With the continuous enhancement of the programmability of the GPU, particularly the appearance of programming environments such as CUDA and the like and a series of advanced debugging tools, the complexity of general-purpose computation programming of the GPU is greatly reduced, and a new era facing general-purpose computation of the GPU is comprehensively started. General purpose computational image processors (gpgpgpu) have evolved into high-parallelism, multi-threaded, many-core processors with powerful computational power and high memory bandwidth.

Compared with the isomorphic parallel architecture, the heterogeneous parallel architecture composed of a general processor CPU and a coprocessor GPU is a structure more suitable for large-scale computation intensive tasks. The heterogeneous parallel system structure can effectively adapt to the complexity of program characteristics in multiple application fields, has high practical application efficiency, conforms to the trend of rapid increase of the capacity of a super large scale integrated circuit chip, and can meet the development requirement that the application program characteristics are diversified day by day. The heterogeneous architecture comprises processors with different structures, a general-purpose processor CPU with a transaction type and a special-purpose processor GPU with a computing type, and different tasks are processed by different types of processors, so that the heterogeneous architecture is advantageous.

However, because the CPU-GPU heterogeneous computation model is different from the traditional CPU isomorphic computation model, the existing CPU-based program cannot be directly run on the GPU. And because the GPU cannot directly access the storage space of the CPU, in order to utilize the computation capability of the GPU, input data must be transmitted from the memory of the CPU end to the memory of the GPU end before the computation starts, and the computation result must be transmitted from the memory of the GPU end to the memory of the CPU end after the computation ends, and so on until all computation tasks are executed. Frequent data transmission between the CPU and the GPU occupies a large amount of program running time, and the running efficiency of the program is greatly influenced. How to improve the calculation efficiency of the CPU and the GPU and reduce the data transmission overhead between the CPU and the GPU is a difficult point for developing a neuron positioning algorithm oriented to a CPU-GPU heterogeneous system structure. At present, no technical scheme for carrying out neuron positioning by utilizing a CPU-GPU is disclosed.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a three-level pipeline parallel method, a three-level pipeline parallel system and a three-level pipeline parallel medium for positioning neurons of a heterogeneous platform.

In order to solve the technical problems, the invention adopts the technical scheme that:

a heterogeneous platform neuron positioning three-level flow parallel method comprises the implementation steps of:

1) calculating blocking parameters of the slice image data according to the image size and the calculation granularity;

2) respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters;

3) initializing variables and storage space;

4) the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute calculation tasks, each calculation task comprises three steps of data reading-in, positioning calculation and data writing-back, each calculation task in the middle carries out the data writing-back of the previous calculation task while carrying out the positioning calculation and carries out the data reading-in of the next calculation task, and the three steps of the data reading-in, the positioning calculation and the data writing-back are carried out in parallel.

Preferably, the detailed steps of step 1) include:

1.1) calculating the maximum data block size gSizeMax which can support calculation on the GPU, wherein the gSizeMax is a positive integer:

1.2) separately determiningThe block size and the number of blocks in the x, y, z direction and the total number of blocks: if xDIm<And g SizeMax, setting the value of the x-direction block size xScale as xDMm, otherwise setting the value of the xScale as g SizeMax, and setting the value of the x-direction block number xNum as

If yDim<And g SizeMax, setting the value of the y-direction block size yScale to be yDim, otherwise, setting the value of yScale to be g SizeMax. Setting the value of the y-direction block number yNum to

If zDim<Setting the value of zScale in the z-direction as zDim if the gSizeMax is adopted, and otherwise, setting the value of zScale as gSizeMax; setting the value of the z-direction block number zNum to

Wherein xdIm, yDim and zDim are respectively preset parameters; setting the value of the total number of the blocks bNum as xNum yNum zNum, and numbering the values of the total number of the blocks bNum from 1 in sequence.

Preferably, in step 2), when the storage space allocation is performed at the CPU end and the GPU end respectively based on the blocking parameter, three pointer variables gReadPtr, gProcPtr, and gWritePtr are declared at the GPU end for the storage space allocation at the GPU end, and the video memory space capacity allocated to each pointer is gSizeMax³Wherein gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data; two pointer variables cReadBuf and cWriteBuf are declared on a CPU aiming at the storage space distribution of the CPU end, and the capacity of the memory space distributed by each pointer is gSizeMax³Wherein cReadBuf is used for data buffering between gReadPtr and the disk, and cWriteBuf is used for data buffering between gWritePtr and the disk; and the size of the data block calculated on the CPU is set as gSizeMax, three pointer variables cReadPtr, cProcPtr and cWritePtr are declared on the CPU side, and the capacity of the memory space allocated by each pointer is gSizeMax³Where gSizeMax is the maximum data block size that can be supported for computation on the GPU.

Preferably, the detailed steps of initializing variables and memory space in step 3) include: the method comprises the steps of applying for a mutex loop variable idx for a CPU, initializing the idx to be 2, reading a data block No. 1 in a disk into a memory space pointed by a pointer variable cProcPtr, reading a data block No. 2 in the disk into a memory space pointed by a pointer variable cReadBuf, and then transmitting the data block No. 2 from the memory space pointed by the pointer variable cReadBuf to a video memory space pointed by a pointer variable gProcpPtr at a GPU end.

Preferably, the detailed steps in step 4) include:

4.1) starting a process No. 0-2 responsible for the organization and data transmission of the calculation tasks on the GPU and a process No. 3-5 responsible for the organization and data transmission of the calculation tasks on the CPU;

4.2) calling a GPU through a No. 0-2 process to execute a calculation task in a three-level pipeline mode, calling a CPU through a No. 3-5 process to execute the calculation task in the three-level pipeline mode, reading a data block required by the positioning of a next group of neurons while positioning each image block neuron, and writing the data block positioned by the previous group of neurons back to a disk, so that the read-write operation of the disk and the positioning operation of the neurons are performed in parallel;

4.3) synchronizing the processes No. 0, 1, 2, 3, 4 and 5, and finishing the calculation.

Preferably, the detailed steps of calling the GPU through the No. 0-2 process in the step 4.2) and executing the calculation task in a three-stage pipeline manner include:

4.2.1A) starting ncGPU threads on the GPU through a No. 0 process according to the available computing core number ncGPU on the GPU, and parallelly performing neuron positioning computation on a data block pointed by a pointer variable gProcPtr by all the GPU threads; adding 1 to the mutex cyclic variable idx through the process No. 1, comparing the mutex cyclic variable idx with the total number of blocks bNum, reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadBuf if the mutex cyclic variable idx is less than or equal to the total number of blocks bNum, and then transmitting the memory space pointed by the pointer variable cReadBuf at the CPU end to a display memory space pointed by a pointer variable gReadPtr at the GPU end; checking the video memory space pointed by the pointer variable gWritePtr through the process No. 2, if the video memory space pointed by the pointer variable gWritePtr is stored with the data block, transmitting the data block from the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuf, then storing the data block into a disk from the memory space pointed by the pointer variable cWriteBuf, and clearing the video memory space pointed by the pointer variable gWritePtr;

4.2.2A) synchronizing the processes No. 0, No. 1 and No. 2, and finishing the calculation of the current data block of the GPU after the synchronization is finished; performing GPU video memory pointer exchange by using the process No. 0, specifically operating as declaring a temporary pointer variable gtPtr, assigning the pointer variable gtPtr to a pointer variable gProcPtr, assigning the pointer variable gProcPtr to a pointer variable gReadPtr, assigning the pointer variable gReadPtr to a pointer variable gWritePtr, and assigning the pointer variable gWritePtr to a pointer variable gtPtr; the process 0 checks the video memory space pointed to by the pointer variable gProcPtr, if the content is empty, the step 4.2.3A is executed, otherwise, the step 4.2.1A) is executed;

4.2.3A) the process No. 0 transfers the data block in the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuff, and then writes the data block back to the disk from the memory space pointed by the pointer variable cWriteBuf; and recovering the video memory space pointed by the GPU-end pointer variables gReadPtr, gProcPtr and gWritePtr, and recovering the memory space pointed by the CPU-end pointer variables cReadBuf and cWriteBuf.

Preferably, the detailed steps of calling the CPU through the processes 3 to 5 in the step 4.2) and simultaneously executing the calculation task in a three-stage pipeline manner include:

4.2.1B) starting ncCPU threads on the CPU through the No. 3 process according to the available computing core number ncCPU on the CPU, and parallelly carrying out neuron positioning computation on the data block pointed by the pointer variable cProcPtr by all the CPU threads; adding 1 to the mutex cyclic variable idx through the process No. 4, comparing the mutex cyclic variable idx with the total number bNum of the blocks, and reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadPtr if the mutex cyclic variable idx is less than or equal to the bNum; checking the memory space pointed by the pointer variable cWritePtr through the process No. 5, if the memory space pointed by the pointer variable cWritePtr is stored with a data block, storing the data block into a disk, and clearing the memory space pointed by the pointer variable cWritePtr;

4.2.2B) synchronizing the processes No. 3, No. 4 and No. 5, and finishing the calculation of the current data block of the CPU after the synchronization is finished; performing CPU memory pointer exchange by the process No. 3, specifically operating as declaring a temporary pointer variable ctPtr, assigning the pointer variable ctPtr as a pointer variable cprocPtr, assigning the pointer variable cprocPtr as a pointer variable cReadPtr, assigning the pointer variable cReadPtr as a pointer variable cWritePtr, and assigning the pointer variable cWritePtr as a pointer variable ctPtr; the process No. 3 checks the memory space pointed to by the pointer variable cProcPtr, if the content is null, the step 4.2.3B is executed, otherwise, the step 4.2.1B is executed);

4.2.3B) writing the data block in the memory space pointed to by the pointer variable cWritePtr back to the disk by the process No. 3; the memory space pointed to by the CPU side pointer variables cReadPtr, cProcPtr, and cWritePtr is reclaimed.

The invention also provides a heterogeneous platform neuron localization three-level pipeline parallel system which comprises a computer device with a GPU, wherein the computer device is programmed to execute the steps of the heterogeneous platform neuron localization three-level pipeline parallel method.

The invention also provides a heterogeneous platform neuron localization three-level pipeline parallel system, which comprises a computer device with a GPU, wherein a storage medium of the computer device is stored with a computer program which is programmed to execute the heterogeneous platform neuron localization three-level pipeline parallel method.

The present invention also provides a computer readable storage medium having stored thereon a computer program programmed to perform the aforementioned heterogeneous platform neuron localization three-level pipeline parallel method of the present invention.

Compared with the prior art, the invention has the following advantages: according to the method, the blocking parameters of the slice image data are calculated according to the image size and the calculation granularity; respectively allocating storage space at a CPU end and a GPU end based on the partitioning parameters; initializing variables and storage space; and the CPU carries out task scheduling, the CPU and the GPU simultaneously adopt a three-stage pipeline mode to execute a calculation task, read a data block required by the positioning of the next group of neurons while carrying out the positioning of each image block neuron, and write the data block positioned by the previous group of neurons back to the disk, so that the read-write operation and the neuron positioning operation of the disk are carried out in parallel. The method can improve the processing speed of the neuron positioning, and has the advantages of high neuron positioning speed, short total program execution time, flexible three-stage pipeline realization, parameter configuration support and easy transplantation and popularization.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating a parallel principle of a three-stage pipeline in the method according to the embodiment of the present invention.

Detailed Description

The neuron-oriented three-level pipeline parallel method, system and medium for the heterogeneous platform are further described in detail below by taking a server equipped with a dual-core twelve-core 2.4GHz CPU and an NVIDIA GTX 1080Ti GPU as an example of the heterogeneous platform. The hard disk capacity of the server is 24TB, the memory capacity is 256GB, and the GPU video memory space is 11 GB. Input data is composed of 10000 single-layer image sequences with the resolution of 40000 multiplied by 40000.

As shown in fig. 1, the method for locating three-level pipeline parallelism by using a heterogeneous platform neuron in the embodiment includes the following steps:

3) initializing variables and storage space;

In this embodiment, the detailed steps of step 1) include:

1.2) determining the block size and the block number in the x, y and z directions and the total block number respectively: if xDIm<And g SizeMax, setting the value of the x-direction block size xScale as xDMm, otherwise setting the value of the xScale as g SizeMax, and setting the value of the x-direction block number xNum as

Wherein xdIm, yDim and zDim are respectively preset parameters; setting the value of the total number of the blocks bNum as xNum yNum zNum, and numbering the values of the total number of the blocks bNum from 1 in sequence. The main variables are defined as follows: cCM: CPU side memory capacity. gMem: and displaying the memory capacity at the GPU side. And g Num: number of GPUs. xdIm: the number of pixels in the x direction of each layer. And (2) yDim: the number of pixels in the y direction of each layer. zDim: and (4) number of layers.

In this embodiment, calculating the maximum block size gSizeMax that can be supported and calculated on the GPU is:

in the above equation, gMem is the memory size on the GPU.

In this embodiment, the number of x-direction blocks

Number of blocks in y direction

Number of blocks in z direction

The total number of blocks bNum is 260 × 260 × 65 is 4394000, the blocks are numbered in sequence from 1, and the size of each data block is 154³B＝3.65GB。

In this embodiment, when the storage space allocation is performed at the CPU end and the GPU end based on the blocking parameter in step 2), three pointer variables gReadPtr, gProcPtr, and gWritePtr are declared at the GPU end for the storage space allocation at the GPU end, and the video memory space capacity allocated to each pointer is gSizeMax³Wherein gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data; two pointer variables cReadBuf and cWriteBuf are declared on a CPU aiming at the storage space distribution of the CPU end, and the capacity of the memory space distributed by each pointer is gSizeMax³Wherein cReadBuf is used for data buffering between gReadPtr and the disk, and cWriteBuf is used for data buffering between gWritePtr and the disk; and the size of the data block calculated on the CPU is set as gSizeMax, three pointer variables cReadPtr, cProcPtr and cWritePtr are declared on the CPU side, and the capacity of the memory space allocated by each pointer is gSizeMax³Where gSizeMax is the maximum data block size that can be supported for computation on the GPU.

Specifically, in this embodiment, three pointer variables, i.e., gReadPtr, gProcPtr, and gWritePtr, are declared at the GPU end, and the video memory capacity allocated to each pointer is 1543B — 3.65GB, where gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data. Two pointer variables cReadBuf and cWriteBuf are declared on the CPU, and the capacity of allocated memory space of each pointer is 1543B-3.65 GB, wherein cReadBuf is used for data buffering between gReadPtr and a disk, and cWriteBuf is used for data buffering between gWritePtr and the disk. The data block size calculated on the CPU is set to 3.65 GB. Correspondingly, three pointer variables cReadPtr, cProcPtr and cWritePtr are declared on the CPU side, and the capacity of memory space allocated to each pointer is 3.65 GB.

The detailed steps of initializing variables and storage space in step 3) in this embodiment include: the method comprises the steps of applying for a mutex loop variable idx for a CPU, initializing the idx to be 2, reading a data block No. 1 in a disk into a memory space pointed by a pointer variable cProcPtr, reading a data block No. 2 in the disk into a memory space pointed by a pointer variable cReadBuf, and then transmitting the data block No. 2 from the memory space pointed by the pointer variable cReadBuf to a video memory space pointed by a pointer variable gProcpPtr at a GPU end.

In this embodiment, the detailed steps in step 4) include:

4.2) calling a GPU through a No. 0-2 process to execute a calculation task in a three-level pipeline mode, calling a CPU through a No. 3-5 process to execute the calculation task in the three-level pipeline mode, reading a data block required by the positioning of a next group of neurons while positioning each image block neuron, and writing the data block positioned by the previous group of neurons back to a disk, so that the read-write operation of the disk and the positioning operation of the neurons are performed in parallel; in the embodiment, in the step 4.2), the GPU is called through the No. 0-2 process to execute the calculation task in a three-stage pipeline mode, meanwhile, the CPU is called through the No. 3-5 process to execute the calculation task in the three-stage pipeline mode, namely, the CPU and the GPU perform neuron positioning on the CPU and the GPU simultaneously, so that the calculation efficiency is improved, and the calculation time is reduced;

In this embodiment, the detailed steps of invoking the GPU through the No. 0-2 process in step 4.2) and executing the computation task in a three-stage pipeline manner include:

4.2.1A) starting ncGPU threads on the GPU through a No. 0 process according to the available computing core number ncGPU on the GPU, and parallelly performing neuron positioning computation on a data block pointed by a pointer variable gProcPtr by all the GPU threads; in this embodiment, the process # 0 starts 3584 threads on the GPU according to the number of available compute cores 3584 on the GPU; adding 1 to the mutex cyclic variable idx through the process No. 1, comparing the mutex cyclic variable idx with the total number of blocks bNum, reading the data block of the mutex cyclic variable idx in the disk into a memory space pointed by a pointer variable cReadBuf if the mutex cyclic variable idx is less than or equal to the total number of blocks bNum, and then transmitting the memory space pointed by the pointer variable cReadBuf at the CPU end to a display memory space pointed by a pointer variable gReadPtr at the GPU end; checking the video memory space pointed by the pointer variable gWritePtr through the process No. 2, if the video memory space pointed by the pointer variable gWritePtr is stored with the data block, transmitting the data block from the video memory space pointed by the pointer variable gWritePtr to the memory space pointed by the pointer variable cWriteBuf, then storing the data block into a disk from the memory space pointed by the pointer variable cWriteBuf, and clearing the video memory space pointed by the pointer variable gWritePtr; in the step 4.2.1A), the No. 0-2 process simultaneously performs data block reading, data block calculation and data block write-back of the GPU, so that time overlapping of data transmission and calculation is realized, and data transmission overhead of the GPU is reduced;

4.2.2A) synchronizing the processes No. 0, No. 1 and No. 2, and finishing the calculation of the current data block of the GPU after the synchronization is finished; performing GPU video memory pointer exchange by using the process No. 0, specifically operating as declaring a temporary pointer variable gtPtr, assigning the pointer variable gtPtr to a pointer variable gProcPtr, assigning the pointer variable gProcPtr to a pointer variable gReadPtr, assigning the pointer variable gReadPtr to a pointer variable gWritePtr, and assigning the pointer variable gWritePtr to a pointer variable gtPtr; the process 0 checks the video memory space pointed to by the pointer variable gProcPtr, if the content is empty, the step 4.2.3A is executed, otherwise, the step 4.2.1A) is executed; in step 4.2.2A) of this embodiment, data exchange is realized by exchanging pointers, so that copying of a large amount of memory space is avoided, and the space-time efficiency of memory space management is improved;

In this embodiment, the detailed steps of calling the CPU through the processes No. 3 to No. 5 in the step 4.2) and simultaneously executing the calculation task in a three-stage pipeline manner include:

4.2.1B) starting ncCPU threads on the CPU through the No. 3 process according to the available computing core number ncCPU on the CPU, and parallelly carrying out neuron positioning computation on the data block pointed by the pointer variable cProcPtr by all the CPU threads; adding 1 to a mutex loop variable idx through a process No. 4, comparing the mutex loop variable idx with the total number bNum of blocks (4394000), and reading a data block of the mutex loop variable idx in a disk into a memory space pointed by a pointer variable cReadPtr if the mutex loop variable idx is less than or equal to the bNum (4394000); checking the memory space pointed by the pointer variable cWritePtr through the process No. 5, if the memory space pointed by the pointer variable cWritePtr is stored with a data block, storing the data block into a disk, and clearing the memory space pointed by the pointer variable cWritePtr; in the step 4.2.1B), the No. 3 to No. 5 processes simultaneously read the data block, calculate the data block and write back the data block at the CPU end, so that the time overlapping of data transmission and calculation is realized, and the data transmission overhead at the CPU end is reduced;

4.2.2B) synchronizing the processes No. 3, No. 4 and No. 5, and finishing the calculation of the current data block of the CPU after the synchronization is finished; performing CPU memory pointer exchange by the process No. 3, specifically operating as declaring a temporary pointer variable ctPtr, assigning the pointer variable ctPtr as a pointer variable cprocPtr, assigning the pointer variable cprocPtr as a pointer variable cReadPtr, assigning the pointer variable cReadPtr as a pointer variable cWritePtr, and assigning the pointer variable cWritePtr as a pointer variable ctPtr; the process No. 3 checks the memory space pointed to by the pointer variable cProcPtr, if the content is null, the step 4.2.3B is executed, otherwise, the step 4.2.1B is executed); in step 4.2.2B) of this embodiment, data exchange is realized by exchanging pointers, so that copying of a large amount of memory space is avoided, and the space-time efficiency of memory space management is improved;

As shown in fig. 2, because the processing steps in the positioning calculation task executed by the CPU and the GPU each include three steps of data reading, positioning calculation, and data writing back, and there is a dependency relationship between the data in the three steps, the first Round of algorithm (Round 1) executes the steps of reading in the memory data to generate the three-dimensional image volume data, the second Round of algorithm (Round 2) executes the steps of reading in the memory data to generate the three-dimensional image volume data, and performing the positioning calculation of the neurons at the same time, starting from the third Round of algorithm (Round 3) until the last third Round (Round n-2) before the positioning calculation is completed, and the three steps of data reading, positioning calculation, and data writing back in each Round of algorithm execution are all performed at the same time. The data reading is to process the next group of slice image data, the positioning calculation is to process the volume data corresponding to the current group of slice image data which has been read, and the data writing back is to write back the processed neuron positioning result of the previous group of slice image data to the disk array. Through the technical approach, the time for reading data in and writing data back can be effectively hidden in the neuron positioning calculation step.

In summary, in this embodiment, the CPU organizes computation and data transmission based on a blocking method, the CPU and the GPU perform neuron positioning using multiple threads, and the data transmission steps among the CPU memory, the GPU memory, and the disk adopt a multi-stage pipeline manner, that is, while processing each image block data, the next image block data to be processed is read, and at the same time, the last processed image block data is written back to the disk, so that the data transmission operation and the data processing operation are performed in parallel. In the embodiment, the neuron positioning three-level flow parallel of the heterogeneous platform is realized by a multi-process and multi-thread mixed parallel technology, on the CPU-GPU heterogeneous parallel computing platform, a CPU multi-core processor and a GPU multi-core coprocessor are simultaneously utilized for carrying out neuron positioning computation, and the computation and data transmission time overlap is carried out by the multi-level flow line technology, so that the neuron positioning speed can be improved. After the running data is counted, compared with a neuron positioning algorithm running on a double-path twelve-core CPU, the neuron positioning three-level pipeline parallel method for the heterogeneous platform can improve the neuron positioning speed to be more than 3 times.

In addition, the present embodiment further provides a three-level pipeline parallel system for positioning a neuron of a heterogeneous platform, which includes a computer device with a GPU, where the computer device is programmed to execute the steps of the three-level pipeline parallel method for positioning a neuron of a heterogeneous platform according to the present embodiment. In addition, the present embodiment also provides a three-level pipeline parallel system for positioning a neuron on a heterogeneous platform, which includes a computer device with a GPU, where a storage medium of the computer device stores a computer program programmed to execute the three-level pipeline parallel method for positioning a neuron on a heterogeneous platform according to the present embodiment. In addition, the present embodiment also provides a computer-readable storage medium, on which a computer program programmed to execute the aforementioned heterogeneous platform neuron localization three-level pipeline parallel method according to the present embodiment is stored.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A heterogeneous platform neuron positioning three-level flow parallel method is characterized by comprising the following implementation steps:

3) initializing variables and storage space;

4) the method comprises the steps that a CPU carries out task scheduling, a CPU and a GPU simultaneously adopt a three-stage pipeline mode to execute calculation tasks, each calculation task comprises three steps of data reading, positioning calculation and data writing back, each calculation task in the middle carries out data writing back of the previous calculation task while carrying out positioning calculation, and carries out data reading of the next calculation task, and the three steps of data reading, positioning calculation and data writing back are carried out in parallel;

the detailed steps of the step 1) comprise:

1.2) determining the block size and the block number in the x, y and z directions and the total block number respectively: if the xDMm is smaller than the gSizeMax, setting the value of the x-direction block size xScale as xDMm, otherwise, setting the value of the xScale as gSizeMax, and setting the value of the x-direction block number xNam as ⌈ xDMm/xScale ⌉; if the yDim is less than the gSizeMax, setting the value of the block size yScale in the y direction as the yDim, otherwise, setting the value of the yScale as the gSizeMax; setting the value of the number of blocks in the y direction, yNum, to be ⌈ yDim/yScale ⌉; if zDim < gSizeMax, setting the value of zScale in the z-direction block size as zDim, otherwise setting the value of zScale as gSizeMax; setting the value of the z-direction block number zNum to be ⌈ zDim/zScale ⌉; wherein xdIm, yDim and zDim are respectively preset parameters; setting the value of the total number of the blocks bNum as xNum yNum zNum, and numbering the values of the total number of the blocks bNum from 1 in sequence.

2. The heterogeneous platform neuron positioning three-level pipeline parallel method according to claim 1, wherein in step 2), when memory space allocation is performed at a CPU end and a GPU end respectively based on partitioning parameters, three pointer variables gReadPtr, gProcPtr and gWritePtr are declared at the GPU end for GPU end memory space allocation, and the capacity of a memory space allocated to each pointer is gSizeMax³Wherein gReadPtr points to the next image block data to be processed, gProcPtr points to the currently processed image block data, and gWritePtr points to the last processed image block data; two pointer variables cReadBuf and cWriteBuf are declared on a CPU aiming at the storage space distribution of the CPU end, and the capacity of the memory space distributed by each pointer is gSizeMax³Wherein cReadBuf is used for data buffering between gReadPtr and the disk, and cWriteBuf is used for data buffering between gWritePtr and the disk; and the block size of the data calculated on the CPU is set to gSizeMax atThe CPU end declares three pointer variables cReadPtr, cProcPtr and cWritePtr, and the memory space capacity allocated by each pointer is gSizeMax³Where gSizeMax is the maximum data block size that can be supported for computation on the GPU.

3. The three-level pipeline parallel method for positioning neurons of the heterogeneous platform according to claim 2, wherein the detailed steps of initializing variables and storage space in step 3) comprise: the method comprises the steps of applying for a mutex loop variable idx for a CPU, initializing the idx to be 2, reading a data block No. 1 in a disk into a memory space pointed by a pointer variable cProcPtr, reading a data block No. 2 in the disk into a memory space pointed by a pointer variable cReadBuf, and then transmitting the data block No. 2 from the memory space pointed by the pointer variable cReadBuf to a video memory space pointed by a pointer variable gProcpPtr at a GPU end.

4. The heterogeneous platform neuron localization three-level pipeline parallel method according to claim 3, wherein the detailed step in the step 4) comprises:

5. The heterogeneous platform neuron positioning three-level pipeline parallel method according to claim 4, wherein the detailed step of calling the GPU through No. 0-2 processes in the step 4.2) to execute the calculation task in a three-level pipeline mode comprises the following steps:

6. The three-level pipeline parallel method for positioning the neurons of the heterogeneous platform according to claim 5, wherein the detailed step of calling the CPU through the processes 3-5 in the step 4.2) and simultaneously executing the calculation task in a three-level pipeline mode comprises the following steps:

7. A heterogeneous platform neuron localization three-level pipeline parallel system comprising a computer device with a GPU, wherein the computer device is programmed to perform the steps of the heterogeneous platform neuron localization three-level pipeline parallel method according to any one of claims 1-6.

8. A heterogeneous platform neuron localization three-level pipeline parallel system comprises a computer device with a GPU, and is characterized in that a storage medium of the computer device is stored with a computer program which is programmed to execute the heterogeneous platform neuron localization three-level pipeline parallel method according to any one of claims 1-6.

9. A computer-readable storage medium having stored thereon a computer program programmed to perform the heterogeneous platform neuron localization three-level pipeline parallel method according to any one of claims 1-6.