CN114880109B

CN114880109B - Data processing method and device based on CPU-GPU heterogeneous architecture and storage medium

Info

Publication number: CN114880109B
Application number: CN202111539679.8A
Authority: CN
Inventors: 鲁真妍; 杨永魁; 喻之斌
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2023-04-14
Anticipated expiration: 2041-12-15
Also published as: CN114880109A; WO2023108801A1

Abstract

The application provides a data processing method and device based on a CPU-GPU heterogeneous architecture and a storage medium. The data processing method based on the CPU-GPU heterogeneous architecture comprises the following steps: acquiring a computing task of zero knowledge proof; inputting data proved by zero knowledge into a SYNTHESIZE stage for processing, and respectively inputting output data of the SYNTHESIZE stage into an FFT stage, a MULTIIEXP B stage and a MULTIIEXP C stage; inputting output data of the FFT stage into a MULTIEXP A stage, and outputting first certification information in the MULTIEXP A stage; processing the MULTIEXP B stage and the MULTIEXP C stage in parallel, and respectively outputting second certification information and third certification information; and combining the first certification information, the second certification information and the third certification information to generate a final certification result. By the aid of the data processing method, the problem of application obstruction caused by performance problems is solved by providing a zero-knowledge proof performance optimization mode, and landing of the zero-knowledge proof technology in an application scene is accelerated.

Description

Data processing method and device based on CPU-GPU heterogeneous architecture and storage medium

Technical Field

The present application relates to the field of zero-knowledge proof technology, and in particular, to a data processing method, device, and storage medium based on a CPU-GPU heterogeneous architecture.

Background

A zero knowledge proof means that the prover can convince the verifier that a statement is correct without exposing any useful information to the verifier. Therefore, the problems of data security, privacy leakage and the like can be solved.

Currently, zero knowledge proof is used, so that huge calculation amount is needed in the process of generating the proof, and the application of the proof is limited by time and economic cost. Under the isomorphic computation mode, the CPU cannot meet the requirement of intensive computation in the zero-knowledge proof, such as in the distributed storage system project, the zero-knowledge proof needs to be completed to package blocks to be submitted to the chain. The CPU is only used for completing zero knowledge proof, and the time for digging out one block is far beyond the specified block time and becomes an invalid block in the block chain. The GPU not only has strong floating-point number computing capability, but also is suitable for parallel computing of large-scale data. If CPU-GPU heterogeneous computing is used, the efficiency of zero knowledge proof can be greatly improved. However, on the heterogeneous architecture of the CPU-GPU, how to coordinate the relationship between different architecture devices to maximize the utilization rate of all the devices, so as to form a most efficient system, which is more complicated than the case of the homogeneous approach.

Disclosure of Invention

The application provides a data processing method and device based on a CPU-GPU heterogeneous architecture and a storage medium.

The application provides a data processing method based on a CPU-GPU heterogeneous architecture, which comprises the following steps:

acquiring a computing task with zero knowledge proof;

dividing the zero knowledge proof calculation task into three stages, namely a SYNTHESIZE stage, an FFT stage and a MULTIEXP stage, wherein the MULTIEXP stage is divided into a MULTIEXP A stage, a MULTIEXP B stage and a MULTIEXP C stage according to different input data;

inputting the data of the zero knowledge proof into a SYNTHESIZE stage for processing, and respectively inputting the output data of the SYNTHESIZE stage into an FFT stage, a MULTIIEXP B stage and a MULTIIEXP C stage;

inputting the output data of the FFT stage into a MULTIEXP A stage, wherein the MULTIEXP A stage outputs first certification information;

the multi XP B phase and the multi XP C phase are processed in parallel, and second certification information and third certification information are output respectively;

and combining the first certification information, the second certification information and the third certification information to generate a final certification result.

Dividing data input into the FFT stage into a plurality of parts of sub data;

processing the first sub-data through the FFT stage to obtain a first output sub-data;

and transmitting the first output subdata to the multi xpa stage for data processing while processing a second subdata through the FFT stage until processing and transmission of all subdata are completed.

The data processing method further comprises the following steps;

during the data preprocessing of the CPU for the FFT phase and the multixp phase, the re-used parameter reading is performed in advance in parallel.

The data processing method further comprises the following steps:

calculating the theoretical data maximum processing capacity of a single GPU task;

and determining the data amount processed by the single GPU based on the theoretical data maximum processing amount.

The theoretical data maximum processing capacity of the single GPU task comprises the following steps:

acquiring a total video memory of the GPU;

calculating a first video memory occupied by a thread in the GPU;

acquiring the residual video memory of the GPU based on the difference value of the total video memory and the first video memory;

and acquiring the maximum processing capacity of the theoretical data based on the ratio of the residual video memory of the GPU to the data quantity of the input data.

Wherein the calculating the first video memory occupied by the thread in the GPU includes:

acquiring the number of bus threads of the GPU;

calculating a second video memory of one thread based on the size of a storage unit and the number of the storage units in one thread in the GPU;

and calculating the first video memory based on the bus program number and the second video memory.

The data processing method further comprises the following steps:

increasing the number of bus threads of the GPU;

and increasing the data amount processed by the single GPU based on the increased bus thread number of the GPU so as to reduce the transmission times of the GPU.

The CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable computation.

The application also provides a terminal device comprising a memory and a processor, wherein the memory is coupled to the processor;

the memory is used for storing program data, and the processor is used for executing the program data to realize the data processing method.

The present application also provides a computer storage medium for storing program data which, when executed by a processor, is used to implement the data processing method described above.

The beneficial effect of this application is: the terminal equipment acquires a computing task of zero knowledge proof; dividing a calculation task with zero knowledge proof into three stages, namely a SYNTHESIZE stage, an FFT stage and a MULTIEXP stage, wherein the MULTIEXP stage is divided into a MULTIXPA stage, a MULTIXP B stage and a MULTIXP C stage according to different input data; inputting data proved by zero knowledge into a SYNTHESIZE stage for processing, and respectively inputting output data of the SYNTHESIZE stage into an FFT stage, a MULTIIEXP B stage and a MULTIIEXP C stage; inputting output data of the FFT stage into a MULTIEXPA stage, and outputting first proof information at the MULTIEXPA stage; processing the multi-IEXP B stage and the multi-IEXP C stage in parallel, and respectively outputting second certification information and third certification information; and combining the first certification information, the second certification information and the third certification information to generate a final certification result. By the aid of the data processing method, the problem of application obstruction caused by performance problems is solved by providing a zero-knowledge proof performance optimization mode, and landing of the zero-knowledge proof technology in an application scene is accelerated.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flowchart illustrating an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture according to the present disclosure;

FIG. 2 is a schematic diagram of a CPU-GPU heterogeneous architecture-based zero-knowledge proof computation data flow provided herein;

FIG. 3 is a detailed sub-step of step S14 of the data processing method of FIG. 1;

FIG. 4 is a schematic flowchart of another embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided in the present application;

FIG. 5 is a flow diagram illustrating serial execution of a prior art data processing method;

FIG. 6 is a schematic flow chart of parallel execution of the data processing method provided in the present application;

FIG. 7 is a total execution time under various optimization schemes provided herein;

FIG. 8 is the execution time of the MULTIIEXP phase under different optimization schemes provided by the present application;

fig. 9 is a schematic structural diagram of an embodiment of a terminal device provided in the present application;

FIG. 10 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the problem of low utilization rate of the CPU-GPU heterogeneous architecture, due to the algorithm complexity, huge data volume and calculated amount of zero knowledge proof and the system complexity brought by the CPU-GPU heterogeneous architecture, for the implementation based on the CPU-GPU heterogeneous architecture, the CPU is adjusted to be responsible for logic control and data preprocessing, the GPU is responsible for intensive and parallelizable calculation, and therefore a zero knowledge proof performance optimization method is provided, application obstruction caused by performance problems is solved, and landing of the zero knowledge proof technology in an application scene is accelerated.

Specifically, referring to fig. 1 and fig. 2, fig. 1 is a schematic flowchart of an embodiment of a data processing method based on a CPU-GPU heterogeneous architecture provided in the present application, and fig. 2 is a schematic diagram of zero-knowledge proof calculation data flow based on a CPU-GPU heterogeneous architecture provided in the present application.

As shown in fig. 1, the data processing method based on the CPU-GPU heterogeneous architecture according to the embodiment of the present application specifically includes the following steps:

step S11: and acquiring a computing task with zero knowledge proof.

Step S12: the zero knowledge proof calculation task is divided into three stages, namely a SYNTHESIZE (namely circuit generation) stage, an FFT (namely fast Fourier transform) stage and a MULTIXP (namely large number multiplication and addition) stage, wherein the MULTIXP stage is divided into a MULTIXPA stage (namely large number multiplication and addition A stage), a MULTIXPB stage (namely large number multiplication and addition B stage) and a MULTIXPC stage (namely large number multiplication and addition C stage) according to different input data.

In the embodiment of the present application, as shown in fig. 2, a parallel execution scheme for parallelization of computation is provided based on a CPU-GPU heterogeneous architecture.

Specifically, the calculation of the zero knowledge proof is divided into three stages, namely a SYNTHESIZE stage, an FFT stage and a MULTIEXP stage, wherein the calculation of the MULTIEXP stage can be divided into three parts, namely a MULTIXPA stage, a MULTIEXP B stage and a MULTIEXP C stage, according to the difference of input data. As shown in fig. 2, the data output from the synchronize stage is divided into three parts, one part is used as the input of the FFT stage, and the other two parts are used as the input of the mute xp B stage and mute xp C stage, respectively. The output of the FFT stage will be the input to the multiiexp a stage. The output of the final multixpa stage, multixpb stage, and multixpc stage generates PROOF of.

As can be seen from the calculation data flow shown in fig. 2, the operation of the optimized CPU-GPU heterogeneous architecture is mainly divided into two parts: the pipelining of the FFT stage and the MULTIXPA stage, and the parallelization of the MULTIXP B stage and the MULTIXP C stage. The following continues to describe these two parts separately in detail:

step S13: inputting the data of zero knowledge proof into SYNTHESIZE stage for processing, and inputting the output data of SYNTHESIZE stage into FFT stage, MULTIIEXP B stage and MULTIIEXP C stage respectively.

In the embodiment of the application, the terminal equipment inputs data proved by zero knowledge into a synchronization stage for processing, and inputs output data of the synchronization stage into an FFT stage, a MULTIEXP B stage and a MULTIEXP C stage in parallel for data processing.

The acceleration of disk IO and data preprocessing can be realized by setting the CPU in charge of logic control and data preprocessing in the zero knowledge proving process. Specifically, for the CPU, repeated parameters are continuously adopted to perform data preprocessing in the data preprocessing process, and the data preprocessing time of the CPU is longer in the FFT stage and the multi xp stage, so that in order to improve the utilization efficiency of the CPU, the CPU can perform parallel reading of the repeatedly used parameters in advance, and reduce continuous reading and calling of the repeatedly used parameters.

Step S14: the output data of the FFT stage is input into a MULTIEXP A stage, and the MULTIEXP A stage outputs first proof information.

In the embodiment of the present application, since the FFT stage and the multimedia phase have data dependency, the data processing procedures of the two stages cannot be directly parallel, but the pipeline of the FFT stage and the multimedia phase is to be implemented. Specifically, the terminal device needs to input the output result of the synchronize stage into the FFT stage, and then input the result into the multi stage after being processed by the FFT stage.

In order to further improve the utilization rate of the CPU-GPU heterogeneous architecture, the embodiment of the present application further provides a two-stage pipeline technical solution, and please specifically refer to fig. 3, where fig. 3 is a specific sub-step of step S14 of the data processing method shown in fig. 1.

As shown in fig. 3, the data processing method provided in the embodiment of the present application further includes:

step S141: the data input to the FFT stage is divided into several sub-data.

In the embodiment of the application, in combination with the characteristic that the FFT stage and the multi stage both have the function of dividing data into a plurality of parts and independently calculating each part, the CPU-GPU heterogeneous architecture can realize pipelining of the FFT stage and the multi stage in a two-stage pipelining manner.

Step S142: and processing the first sub-data through an FFT stage to obtain a first output sub-data.

Step S143: and while processing the second part of subdata through the FFT stage, transmitting the first part of output subdata to the MULTIEXPA stage for data processing until the processing and transmission of all subdata are completed.

Specifically, the FFT stage and the multi xpa stage divide the data to be processed into N parts, and when the x-th data is output by the thread calculating the FFT stage, the x-th data is immediately transmitted to the thread calculating the multi xpa stage, so that the x-th data is calculated by the thread calculating the multi xpa stage. At the same time, the FFT stage may begin processing the x +1 th data.

By the two-section type pipeline mode, the data processing efficiency of the FFT stage and the MULTIEXP A stage can be further improved, and even the data processing speed of the FFT stage and the data processing speed of the MULTIEXP B stage can be kept consistent with that of the MULTIEXP C stage.

Step S15: and processing the multiple B stage and the multiple C stage in parallel, and respectively outputting second certification information and third certification information.

In the embodiment of the application, because there is no data dependency relationship between the multi xp B phase and the multi xp C phase, and the video memory usage rate is low, the data processing processes of the two phases can be directly performed in parallel and simultaneously calculated.

Step S16: and combining the first certification information, the second certification information and the third certification information to generate a final certification result.

And finally, the terminal equipment generates a final PROOF result PROOF by combining the first PROOF information obtained by the calculation in the MULTIXPA stage, the second PROOF information obtained by the calculation in the MULTIXP B stage and the third PROOF information obtained by the calculation in the MULTIXP C stage.

In the embodiment of the application, the terminal equipment acquires a computing task of zero knowledge proof; dividing a calculation task with zero knowledge proof into three stages, namely a SYNTHESIZE stage, an FFT stage and a MULTIEXP stage, wherein the MULTIEXP stage is divided into a MULTIEXPA stage, a MULTIEXP B stage and a MULTIEXP C stage according to different input data; inputting data proved by zero knowledge into a SYNTHESIZE stage for processing, and respectively inputting output data of the SYNTHESIZE stage into an FFT stage, a MULTIIEXP B stage and a MULTIIEXP C stage; inputting output data of the FFT stage into a MULTIEXPA stage, and outputting first proof information at the MULTIEXPA stage; processing the multi-IEXP B stage and the multi-IEXP C stage in parallel, and respectively outputting second certification information and third certification information; and combining the first certification information, the second certification information and the third certification information to generate a final certification result. By the aid of the data processing method, the problem of application obstruction caused by performance problems is solved by providing a zero-knowledge proof performance optimization mode, and landing of the zero-knowledge proof technology in an application scene is accelerated.

Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a data processing method based on a CPU-GPU heterogeneous architecture according to another embodiment of the present disclosure. In the data processing method of the embodiment of the application, the CPU-GPU heterogeneous architecture optimizes the performance of the CPU-GPU heterogeneous architecture by reducing the transmission times of the CPU-GPU and increasing the number of GPU computing threads, and the use efficiency of the architecture is improved.

As shown in fig. 4, a data processing method provided in the embodiment of the present application includes:

step S21: and calculating the theoretical data maximum processing capacity of the single GPU task.

In the embodiment of the application, the terminal device may calculate the theoretical data maximum processing amount d of a single GPU task by using the following formula _max ：

Where mem is the video memory size of the GPU and i is the actual thread numberCores is the number of stream processors in the GPU, the ratio of the maximum number of parallel threads of the GPU. By equating the GPU maximum parallel threads to the number of stream processors, i × cores can be used to represent the number of bus threads of the GPU. Window and its manufacturing method _size One window size for the GPU is shown,

number of buckets owned by a thread, buck _size For a size of a bucket of the GPU, the calculation can be performed

The video memory size most occupied by the packet in one thread is passed through->

The size of the video memory occupied by the packet in all the threads in the GPU, namely the second video memory, can be calculated.

Further, k _size Is the size of a scalar data, p _size Is the size of one vector data. And (3) dividing the residual video memory except the video memory occupied by the bucket in all the threads in the GPU by the size of the input data to obtain the maximum processing capacity of the theoretical data of the single GPU task.

Step S22: and determining the data amount processed by the single GPU based on the theoretical maximum data processing amount.

In the embodiment of the application, the CPU-GPU heterogeneous architecture sets the data volume processed by the single GPU according to the calculated theoretical data maximum processing volume to process the data proved by the zero knowledge, so that the transmission times can be reduced to the maximum extent, and the processing efficiency is improved.

Further, the CPU-GPU heterogeneous architecture may also increase i, for example, from 2 to 4, thereby increasing actual thread data, increasing the data amount processed by a single GPU, and further reducing the number of transmission times.

In order to verify the data processing method based on the CPU-GPU heterogeneous architecture, the optimization scheme verification is carried out on the calculation task with input data of 32GB on the premise that correct proofs can be generated. Specifically, referring to fig. 5 to 8, fig. 5 is a schematic flow chart of serial execution of a data processing method in the prior art, fig. 6 is a schematic flow chart of parallel execution of the data processing method provided in the present application, fig. 7 is a total execution time under different optimization schemes provided in the present application, and fig. 8 is an execution time in a multi iexp phase under different optimization schemes provided in the present application.

Comparing the serial execution flow chart of fig. 5 with the parallel execution flow chart of fig. 6, it can be seen that by using the parallel execution scheme provided by the present application, the memory usage rate of the GPU is both greatly improved.

As shown in fig. 7, parallelization and data preprocessing acceleration based on bellpersison improve the speed by 9% and 37%, respectively. Because the thread scheduling overhead of the CPU and the GPU is increased through parallelization, the speed is not obviously improved, the data preprocessing time is high, and more repeated redundant operations are found during code analysis, so that the speed improving effect of the technical scheme for accelerating the preprocessing is good.

As shown in fig. 8, the speed is improved by 35% by reducing the number of transmission times and increasing the number of threads on the basis of the acceleration of the preprocessing.

In summary, the performance of the CPU-GPU heterogeneous architecture is continuously improved by performing performance optimization on the zero knowledge proof on the CPU-GPU heterogeneous architecture from the following three parts:

1) And (4) calculating parallelization.

2) Disk IO and acceleration of data preprocessing.

3) The transmission times of the CPU-GPU are reduced, and the number of GPU computing threads is increased.

It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.

To implement the data processing method based on the CPU-GPU heterogeneous architecture according to the foregoing embodiment, the present application further provides a terminal device, and specifically please refer to fig. 9, where fig. 9 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.

The terminal device 500 of the embodiment of the present application includes a memory 51 and a processor 52, wherein the memory 51 and the processor 52 are coupled.

The memory 51 is used for storing program data, and the processor 52 is used for executing the program data to implement the data processing method based on the CPU-GPU heterogeneous architecture described in the above embodiments.

In the present embodiment, the processor 52 may also be referred to as a CPU (Central Processing Unit). Processor 52 may be an integrated circuit chip having signal processing capabilities. The processor 52 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 52 may be any conventional processor or the like.

The present application further provides a computer storage medium, as shown in fig. 10, the computer storage medium 600 is used to store program data 61, and when the program data 61 is executed by a processor, the data processing method based on the CPU-GPU heterogeneous architecture as described in the foregoing embodiment is implemented.

The present application further provides a computer program product, where the computer program product includes a computer program operable to cause a computer to execute the data processing method based on the CPU-GPU heterogeneous architecture according to the embodiment of the present application. The computer program product may be a software installation package.

The data processing method based on the CPU-GPU heterogeneous architecture according to the above embodiments of the present application may exist in the form of a software functional unit when being implemented, and may be stored in a device, for example, a computer readable storage medium, when being sold or used as an independent product. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, and is not intended to limit the scope of the present application, and all equivalent structures or equivalent processes performed by the present application and the contents of the attached drawings, which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A data processing method based on a CPU-GPU heterogeneous architecture is characterized by comprising the following steps:

acquiring a computing task with zero knowledge proof;

inputting the data with zero knowledge proof into a SYNTHESIZE stage for processing, and respectively inputting the output data of the SYNTHESIZE stage into an FFT stage, a MULTIEXP B stage and a MULTIEXP C stage, wherein the MULTIEXP B stage and the MULTIEXP C stage have no data dependency relationship, and the FFT stage and the MULTIEXP A stage have data dependency relationship;

processing the MULTIIEXP B stage and the MULTIIEXP C stage in parallel, and respectively outputting second certification information and third certification information;

generating a final certification result by combining the first certification information, the second certification information and the third certification information;

wherein, the data processing method further comprises:

determining the data volume processed by the single GPU based on the theoretical data maximum processing volume;

wherein, the theoretical data maximum processing amount of a single GPU task is calculated by the following formula

：

，

Wherein,

is the size and/or the brightness of the video memory of the GPU>

Is the ratio of the actual thread number to the maximum parallel thread number of the GPU>

For the number of stream processors in GPU,/>>

For representing the number of bus threads of the GPU, < > or >>

For the size of one window of the GPU, <' > or>

For a GPU of one bucket size, <' > in>

Is a magnitude of a scalar data, is greater than or equal to>

Is the size of one vector data.

2. The data processing method of claim 1,

the data processing method further comprises:

dividing the data input into the FFT stage into a plurality of parts of sub-data;

and transmitting the first output subdata to the multiiexp a stage for data processing while processing a second portion of subdata through the FFT stage until processing and transmission of all subdata are completed.

3. The data processing method of claim 1,

the data processing method further comprises the following steps;

during the data preprocessing of the CPU to the FFT phase and the MULTIEXP phase, the repeated parameter reading is performed in advance and parallelly.

4. The data processing method of claim 1,

acquiring a total video memory of the GPU;

calculating a first video memory occupied by a thread in the GPU;

acquiring the residual video memory of the GPU based on the difference value between the total video memory and the first video memory;

and acquiring the maximum processing capacity of the theoretical data based on the ratio of the residual video memory of the GPU to the data capacity of the input data.

5. The data processing method of claim 4,

the calculating a first video memory occupied by a thread in the GPU comprises:

acquiring the number of bus threads of the GPU;

6. The data processing method of claim 5,

the data processing method further comprises the following steps:

increasing the number of bus threads of the GPU;

7. The data processing method of claim 1,

8. A terminal device, comprising a memory and a processor, wherein the memory is coupled to the processor;

wherein the memory is adapted to store program data and the processor is adapted to execute the program data to implement the data processing method of any of claims 1-7.

9. A computer storage medium for storing program data for implementing a data processing method according to any one of claims 1 to 7 when executed by a processor.