CN112083956B

CN112083956B - Heterogeneous platform-oriented automatic management system for complex pointer data structure

Info

Publication number: CN112083956B
Application number: CN202010971038.9A
Authority: CN
Inventors: 张伟哲; 何慧; 王法瑞; 方滨兴; 郝萌; 郭浩男
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2022-12-09
Anticipated expiration: 2040-09-15
Also published as: CN112083956A

Abstract

An automatic management system for a complex pointer data structure facing a heterogeneous platform relates to the technical field of heterogeneous programming. The invention aims to realize the automatic management of a complex pointer data structure in an OpenMP Offloading program on a heterogeneous computing platform and ensure the data consistency. The invention includes: the information collection module is used for carrying out static analysis on the source program and collecting program information; the automatic conversion module is mainly responsible for modifying the source code at a proper position according to different variable types and inserting a proper runtime API; and the runtime module is mainly responsible for realizing the memory management operation of the C + + standard again by using the cudaMallocManged () and the cudaFree () and providing an interface outwards. The invention can automatically manage the memory allocation, release and data transmission of the complex pointer data structure in the OpenMP offload program between the CPU and the GPU memory, and ensure the data consistency; therefore, convenience is provided for the development of the OpenMP Offloading program.

Description

A complex pointer data structure automatic management system for heterogeneous platforms

技术领域technical field

本发明涉及OpenMP Offloading程序中复杂指针数据结构自动管理系统，涉及异构编程技术领域。The invention relates to an automatic management system for complex pointer data structures in an OpenMP Offloading program, and relates to the technical field of heterogeneous programming.

背景技术Background technique

OpenMP是由OpenMP Architecture Review Board牵头提出的，并已被广泛接受，用于共享内存并行系统的多处理器程序设计的一套指导性编译处理方案(CompilerDirective)[1]。OpenMP支持的编程语言包括C、C++和Fortran；而支持OpenMP的编译器包括Sun Compiler，GNU Compiler和Intel Compiler等。OpenMP提供了对并行算法的高层的抽象描述，程序员通过在源代码中加入专用的pragma原语来指明自己的意图，由此编译器可以自动将程序进行并行化，并在必要之处加入同步互斥以及通信。OpenMP is led by the OpenMP Architecture Review Board and has been widely accepted as a set of guiding compilation processing scheme (CompilerDirective) [1] for multiprocessor programming of shared memory parallel system. The programming languages supported by OpenMP include C, C++, and Fortran; the compilers that support OpenMP include Sun Compiler, GNU Compiler, and Intel Compiler. OpenMP provides a high-level abstract description of parallel algorithms. Programmers specify their intentions by adding special pragma primitives to the source code, so that the compiler can automatically parallelize the program and add synchronization where necessary. Mutex and communication.

在高性能计算领域，各种加速器(如GPU、FPGA、DSP、MLU等)已经成为除CPU之外的重要算力来源。从4.0版本开始，OpenMP增加了Offloading(卸载)特性，OpenMPOffloading支持CPU+加速器的异构编程模型；经过4.5和5.0版本的发展，OpenMPOffloading逐步完善。OpenMPOffloading为OpenMP程序充分利用异构计算平台算力提供了可能，但将现有的OpenMP CPU程序修改为符合Offloading特性的程序仍然是一件困难、繁琐、易出错的工作，尤其是程序中包含复杂指针数据结构，例如：类嵌套指针、vector容器、多级嵌套指针等。In the field of high-performance computing, various accelerators (such as GPU, FPGA, DSP, MLU, etc.) have become an important source of computing power in addition to the CPU. Starting from version 4.0, OpenMP has added the Offloading (unloading) feature. OpenMPOffloading supports the heterogeneous programming model of CPU+accelerator; after the development of versions 4.5 and 5.0, OpenMPOffloading has been gradually improved. OpenMP Offloading makes it possible for OpenMP programs to make full use of the computing power of heterogeneous computing platforms, but it is still a difficult, tedious, and error-prone task to modify existing OpenMP CPU programs to meet the Offloading characteristics, especially when the program contains complex Pointer data structures, such as: class nested pointers, vector containers, multi-level nested pointers, etc.

虽然OpenMPOffloading语法简单，但仍需用户利用相关pragma原语显示管理CPU和加速器之间的数据传输操作，这给开发者造成了极大不便，尤其是遇到复杂指针数据结构时。例如C++中的vector容器，其内存分配是隐式的，开发者不易控制内存分配和数据传输，也就很难使用Offloading特性。对于嵌套指针或多级嵌套指针，处理不同级别指针指向的内存在CPU和加速器内存空间中的分配、释放、数据传输是繁琐且极易出错的工作，这也使开发者对Offloading特性望而却步。Although the syntax of OpenMPOffloading is simple, users still need to use relevant pragma primitives to display and manage data transfer operations between the CPU and the accelerator, which causes great inconvenience to developers, especially when encountering complex pointer data structures. For example, the memory allocation of the vector container in C++ is implicit. It is difficult for developers to control memory allocation and data transmission, and it is difficult to use the Offloading feature. For nested pointers or multi-level nested pointers, dealing with the allocation, release, and data transfer of memory pointed to by different levels of pointers in the CPU and accelerator memory space is tedious and error-prone work, which also discourages developers from the Offloading feature. .

NVIDIA GPU平台的CUDA编程模型从6.0版本开始支持统一内存(Unified Memory,UM)特性；该特性将CPU和GPU的地址空间统一，并自动管理CPU和GPU之间的数据传输。统一内存特性为OpenMP Offloading程序开发中复杂指针数据结构的自动化处理提供了可能的技术手段。The CUDA programming model of the NVIDIA GPU platform supports the Unified Memory (UM) feature from version 6.0; this feature unifies the address spaces of the CPU and GPU and automatically manages data transmission between the CPU and GPU. The unified memory feature provides a possible technical means for the automatic processing of complex pointer data structures in the development of OpenMP Offloading programs.

发明内容Contents of the invention

本发明要解决的技术问题为：The technical problem to be solved in the present invention is:

本发明的目的是提出一种面向异构平台的OpenMP Offloading程序中复杂指针数据结构自动管理系统，以解决现有技术在OpenMP CPU程序的基础上无法自动修改内存分配和释放语句并自动插入相关pragma原语，不能自动管理CPU和加速器之间的数据传输，无法保证包含复杂指针数据结构的程序在异构计算平台上的数据一致性，从而影响程序性能的问题。The purpose of the present invention is to propose a complex pointer data structure automatic management system in an OpenMP Offloading program oriented to heterogeneous platforms, so as to solve the problem that the prior art cannot automatically modify memory allocation and release statements and automatically insert relevant pragmas on the basis of OpenMP CPU programs. Primitives cannot automatically manage the data transmission between the CPU and the accelerator, and cannot guarantee the data consistency of programs containing complex pointer data structures on heterogeneous computing platforms, thus affecting program performance.

本发明为解决上述技术问题所采用的技术方案为：The technical scheme that the present invention adopts for solving the problems of the technologies described above is:

一种面向异构平台的复杂指针数据结构自动管理系统，所述系统用于实现OpenMPOffloading程序中复杂指针数据结构在异构计算平台上的自动管理并保证数据一致性；A heterogeneous platform-oriented complex pointer data structure automatic management system, the system is used to realize the automatic management of the complex pointer data structure in the OpenMPOffloading program on the heterogeneous computing platform and ensure data consistency;

该系统包括以下三个模块：The system includes the following three modules:

信息收集模块，该模块有两项功能：1)对源程序进行静态分析以收集程序信息；2)基于收集到的信息对源程序建立抽象表示即串并行控制流图；Information collection module, this module has two functions: 1) Statically analyze the source program to collect program information; 2) Establish an abstract representation of the source program based on the collected information, that is, a serial-parallel control flow graph;

该模块的工作过程分为以下两步：1)通过Clang编译器，将C或者C++的源代码生成对应的抽象语法树(Abstract Syntax Tree,AST)和控制流图(Control Flow Graph,CFG)，遍历AST和CFG，区分串并行域，并获取程序详细信息；2)根据这些信息生成串并行控制流图；The working process of this module is divided into the following two steps: 1) Through the Clang compiler, generate the corresponding abstract syntax tree (Abstract Syntax Tree, AST) and control flow graph (Control Flow Graph, CFG) from the source code of C or C++, Traverse AST and CFG, distinguish between serial and parallel domains, and obtain program details; 2) Generate serial and parallel control flow graphs based on these information;

自动转换模块，主要负责基于串并行控制流图，在源码中插入运行时API，从而完成代码转换；首先根据串并行控制流图中保存的复杂指针变量信息，确定复杂指针数据的类型；之后根据不同类型，在源码中适当位置，插入合适的运行时API，完成代码转换；这样，复杂指针变量所涉及的内存分配、释放以及数据传输操作都由运行时接管，从而可以通过运行时自动管理复杂指针变量，保证CPU和GPU之间的数据一致性；The automatic conversion module is mainly responsible for inserting the runtime API into the source code based on the serial-parallel control flow graph to complete the code conversion; firstly, according to the complex pointer variable information saved in the serial-parallel control flow graph, determine the type of the complex pointer data; then according to For different types, insert the appropriate runtime API at the appropriate position in the source code to complete the code conversion; in this way, the memory allocation, release, and data transfer operations involved in complex pointer variables are all taken over by the runtime, so that complex pointer variables can be automatically managed through runtime. Pointer variables to ensure data consistency between CPU and GPU;

运行时模块，主要负责基于统一内存实现复杂指针数据类型的以下操作：CPU和GPU上的内存分配和释放操作，以及CPU和GPU之间的自动数据传输操作；运行时模块由UMMaper类和allocator类组成，UMMaper类和allocator类以API接口的形式向外部提供内存分配和释放的接口。The runtime module is mainly responsible for the following operations of complex pointer data types based on unified memory: memory allocation and release operations on the CPU and GPU, and automatic data transfer operations between the CPU and GPU; the runtime module consists of the UMMaper class and the allocator class Composition, the UMMaper class and the allocator class provide an interface for memory allocation and release to the outside in the form of an API interface.

进一步地，所述信息收集模块的信息收集功能的实现过程为：Further, the realization process of the information collection function of the information collection module is:

首先使用Clang编译器对目标程序进行词法分析和语义分析，并生成抽象语法树(Abstract Syntax Tree,AST)和控制流图(Control Flow Graph,CFG)；First, use the Clang compiler to perform lexical analysis and semantic analysis on the target program, and generate an abstract syntax tree (Abstract Syntax Tree, AST) and a control flow graph (Control Flow Graph, CFG);

之后对AST和CFG进行下述三种静态分析，获得函数调用关系信息、变量相关信息、串/并行域相关信息，为之后的串并行控制流图建立和代码转换提供信息支持；After that, the following three static analyzes are performed on AST and CFG to obtain function call relationship information, variable related information, serial/parallel domain related information, and provide information support for the subsequent establishment of serial and parallel control flow graphs and code conversion;

分析一，函数调用关系分析：对于一个AST，执行以下两步工作：Analysis 1, function call relationship analysis: For an AST, perform the following two steps:

1)递归遍历AST上的每个节点；1) Recursively traverse each node on the AST;

2)如果当前节点是函数定义节点，则保存该函数名和该函数调用的子函数信息；2) If the current node is a function definition node, then save the function name and the sub-function information of the function call;

分析二，变量信息分析，对于一个AST，执行以下三步工作：Analysis 2, variable information analysis, for an AST, perform the following three steps:

2)如果当前节点涉及变量定义或引用，则保存变量定义信息、变量引用信息、变量作用域信息；2) If the current node involves variable definition or reference, save variable definition information, variable reference information, and variable scope information;

3)如果当前节点涉及内存分配或释放，则保存内存分配信息、内存释放信息；3) If the current node involves memory allocation or release, save memory allocation information and memory release information;

分析三，串并行域分析：对于一个CFG，执行以下三步工作：Analysis 3, serial parallel domain analysis: For a CFG, perform the following three steps:

1)递归遍历CFG上的每个节点，保存节点间关系信息；1) Recursively traverse each node on the CFG and save the relationship information between nodes;

2)如果当前节点在OpenMP#pragma并行指导语句的作用域内，则将当前节点标记为并行节点，并保存其类型信息和范围信息；2) If the current node is within the scope of the OpenMP#pragma parallel instruction statement, then the current node is marked as a parallel node, and its type information and range information are saved;

3)如果当前节点不在OpenMP#pragma并行指导语句的作用域内，则将当前节点标记为串行节点，并保存其类型信息和范围信息；3) If the current node is not within the scope of the OpenMP#pragma parallel instruction statement, then mark the current node as a serial node, and save its type information and range information;

通过上述三种静态分析，可以获得以下信息：函数调用关系信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息、串并行节点类型信息、串并行节点范围信息、节点间关系信息；这些信息将对串并行控制流图建立提供支持。Through the above three static analysis, the following information can be obtained: function call relationship information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information, Relationship information between nodes; these information will provide support for the establishment of serial and parallel control flow graphs.

进一步地，所述信息收集模块中的串并行控制流图定义如下：Further, the serial-parallel control flow graph in the information collection module is defined as follows:

定义串并行控制流图由串行节点、并行节点以及节点之间的有向边组成；其中，串行节点表示一段OpenMP#pragma并行指导语句作用域外的、内部无分支的、串行执行的代码片段；串行节点对应的代码片段在CPU上执行，串行节点也记作SEQ节点；Define the serial-parallel control flow graph to consist of serial nodes, parallel nodes, and directed edges between nodes; among them, a serial node represents a piece of OpenMP#pragma parallel instruction statement scope, internal branchless, and serially executed code Fragment; the code fragment corresponding to the serial node is executed on the CPU, and the serial node is also recorded as a SEQ node;

并行节点表示一段在OpenMP#pragma并行指导语句作用域内的，并行执行的代码片段；并行节点对应的代码片段被卸载到GPU上执行，并行节点也记作OMP节点；A parallel node represents a code segment that is executed in parallel within the scope of the OpenMP#pragma parallel instruction statement; the code segment corresponding to the parallel node is offloaded to the GPU for execution, and the parallel node is also recorded as an OMP node;

串行节点和并行节点中保存有函数调用信息和变量相关信息；The serial node and the parallel node store function call information and variable related information;

节点之间的有向边表示节点对应的代码片段执行的先后关系。The directed edges between the nodes represent the execution order of the code fragments corresponding to the nodes.

进一步地，所述信息收集模块中串并行控制流图建立过程：Further, the serial-parallel control flow graph establishment process in the information collection module:

串并行控制流图建立过程以函数为基本处理单位；对整个源程序而言，如果可以建立一个函数的串并行控制流图，结合所收集的函数调用关系信息，就可以递归建立整个源程序的串并行控制流图；The process of establishing a serial-parallel control flow graph takes functions as the basic processing unit; for the entire source program, if a serial-parallel control flow graph of a function can be established, combined with the collected function call relationship information, the entire source program can be recursively established. Serial-parallel control flow graph;

对于一个函数，基于所收集的信息，可以通过以下步骤建立串并行控制流图：For a function, based on the collected information, a serial-parallel control flow graph can be built through the following steps:

1)使用节点类型信息、节点范围信息建立一个个孤立的串行节点和并行节点；1) Use the node type information and node range information to create isolated serial nodes and parallel nodes;

2)使用节点间关系信息，在节点间建立有向边，将串行节点和并行节点连成图；2) Using the relationship information between nodes, establish directed edges between nodes, and connect serial nodes and parallel nodes into a graph;

3)根据节点范围信息，将函数调用信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息保存到对应的串行节点或并行节点。3) Save function call information, variable definition information, variable reference information, variable scope information, memory allocation information, and memory release information to corresponding serial nodes or parallel nodes according to the node range information.

进一步地，将所述复杂指针数据按指针所处的位置分为三种类型：Further, the complex pointer data is divided into three types according to the position of the pointer:

类嵌套指针、vector容器、多级嵌套指针；Class nested pointers, vector containers, multi-level nested pointers;

类嵌套指针是指类中所包含的指针，vector容器是指C++标准库中提供的vector容器；多级嵌套指针是指二级指针以及二级以上的指针。The class nested pointer refers to the pointer contained in the class, and the vector container refers to the vector container provided in the C++ standard library; the multi-level nested pointer refers to the second-level pointer and the pointer above the second level.

进一步地，所述类嵌套指针通过以下步骤进行处理：Further, the class nested pointer is processed through the following steps:

①递归遍历上述所建立的串并行控制流图，根据串/并行节点中保存的变量定义和引用信息，找出在串行节点中定义、并在并行节点中被引用的类嵌套指针及其所在的C++类；① Recursively traverse the serial-parallel control flow graph established above, and find out the class nested pointers defined in the serial node and referenced in the parallel node and their The C++ class where it is located;

②对于步骤①中找出的C++类，修改源码中该类的类型定义，使该类的类型在定义时继承运行时提供的UMMaper基类；② For the C++ class found in step ①, modify the type definition of this class in the source code, so that the type of this class inherits the UMMaper base class provided at runtime when it is defined;

③对于步骤①中找出的类嵌套指针，修改源码中该指针的内存分配和释放语句，使用cudaMallocManaged()分配内存，使用cudaFree()释放内存；③For the class nesting pointer found in step ①, modify the memory allocation and release statement of the pointer in the source code, use cudaMallocManaged() to allocate memory, and use cudaFree() to release memory;

④对于步骤①中找出的C++类实例，修改源码中该类实例的定义语句，使用被重载的new运算符创建类实例，并将步骤③中分配的内存空间地址传递给C++类实例中对应的嵌套指针。④ For the C++ class instance found in step ①, modify the definition statement of the class instance in the source code, use the overloaded new operator to create a class instance, and pass the memory space address allocated in step ③ to the C++ class instance The corresponding nested pointer.

进一步地，对所述vector容器通过以下步骤进行处理：Further, the vector container is processed through the following steps:

①递归遍历上述所建立的串并行控制流图，根据串/并行节点中保存的变量定义和引用信息，找出在串行节点中定义、并在并行节点中被引用的vector容器；① Recursively traverse the serial-parallel control flow graph established above, and find out the vector container defined in the serial node and referenced in the parallel node according to the variable definition and reference information saved in the serial/parallel node;

②对于步骤①中找出的vector容器实例，修改该vector容器实例的定义语句，插入对运行时提供的自定义allocator的显示调用。② For the vector container instance found in step ①, modify the definition statement of the vector container instance, and insert an explicit call to the custom allocator provided at runtime.

进一步地，对所述多级嵌套指针通过以下步骤进行处理：Further, the multi-level nested pointer is processed through the following steps:

①递归遍历上述建立的串并行控制流图，根据串/并行节点中保存的变量定义和引用信息，找出在串行节点中定义、并在并行节点中被引用的多级嵌套指针，及其各级子指针；① Recursively traverse the serial-parallel control flow graph established above, and find out the multi-level nested pointers defined in the serial node and referenced in the parallel node according to the variable definition and reference information stored in the serial/parallel node, and Its sub-pointers at all levels;

②对于步骤①中找出的所有多级嵌套指针和各级子指针，修改源码中这些指针的内存分配和释放语句，使用cudaMallocManaged()分配内存，使用cudaFree()释放内存。② For all the multi-level nested pointers and sub-pointers at all levels found in step ①, modify the memory allocation and release statements of these pointers in the source code, use cudaMallocManaged() to allocate memory, and use cudaFree() to release memory.

进一步地，所述运行时模块中UMMaper类的实现过程：为处理类嵌套指针，设计UMMaper类来管理C++类的内存分配和释放；UMMaper类中使用cudaMallocManaged()和cudaFree()对C++默认的new、delete运算符进行重载；UMMaper类的派生类在CPU和GPU上的内存空间的分配、释放、数据传输操作即可被统一内存自动管理。Further, the implementation process of the UMMaper class in the runtime module: for processing class nested pointers, design the UMMaper class to manage the memory allocation and release of the C++ class; use cudaMallocManaged() and cudaFree() in the UMMaper class to C++ default New and delete operators are overloaded; the allocation, release, and data transfer operations of the memory space of the derived classes of the UMMaper class on the CPU and GPU can be automatically managed by the unified memory.

进一步地，所述运行时模块中自定义allocator类的实现过程：为处理vector容器，设计自定义的allocator类来管理vector容器的内存分配和释放；vector容器默认使用C++标准库的allocator空间配置器来管理内存分配和释放，因此可以基于统一内存实现自定义的allocator类，从而实现对vector容器内存分配和释放的自动管理；自定义allocator类中，基于cudaMallocManaged()实现allocator类中的_allocator()函数，基于cudaFree()实现allocator类中的_deallocate()函数；在vector容器声明时显示调用自定义allocator类，即可实现统一内存对vector容器在CPU和GPU上的内存空间的分配、释放、数据传输操作的自动管理。Further, the implementation process of the custom allocator class in the runtime module: in order to process the vector container, design a custom allocator class to manage the memory allocation and release of the vector container; the vector container uses the allocator space configurator of the C++ standard library by default To manage memory allocation and release, so you can implement a custom allocator class based on unified memory, so as to realize the automatic management of vector container memory allocation and release; in the custom allocator class, implement _allocator( in the allocator class based on cudaMallocManaged() ) function, based on cudaFree() to implement the _deallocate() function in the allocator class; when the vector container is declared, it is displayed to call the custom allocator class, and the unified memory can allocate and release the memory space of the vector container on the CPU and GPU , Automatic management of data transfer operations.

本发明具有以下有益技术效果：The present invention has the following beneficial technical effects:

本发明所述系统可以在OpenMP CPU程序的基础上，自动修改内存分配和释放语句，并自动插入相关pragma原语，自动管理CPU和加速器之间的数据传输，以保证包含复杂指针数据结构的程序在异构计算平台上的数据一致性，并提高程序性能。The system of the present invention can automatically modify the memory allocation and release statements on the basis of the OpenMP CPU program, and automatically insert relevant pragma primitives, and automatically manage the data transmission between the CPU and the accelerator, so as to ensure that the program containing complex pointer data structures Data consistency on heterogeneous computing platforms and improve program performance.

本发明研究的异构编程方案主要针对OpenMP程序中复杂指针数据结构的自动管理，具体指自动修改内存分配和释放语句，自动插入相关pragma原语，自动管理CPU和加速器之间的数据传输；从而保证程序在异构计算平台上的数据一致性，并提高程序性能。The heterogeneous programming scheme researched by the present invention is mainly aimed at the automatic management of complex pointer data structures in the OpenMP program, specifically refers to automatically modifying memory allocation and release statements, automatically inserting relevant pragma primitives, and automatically managing data transmission between the CPU and the accelerator; thus Ensure data consistency of programs on heterogeneous computing platforms and improve program performance.

由于OpenMPOffloading支持加速器编程，使得CPU上的OpenMP代码可以卸载到GPU上执行，客观条件允许OpenMP代码的卸载。在此条件下，将OpenMP代码卸载到GPU上运行，一方面可以提升程序的运行效率，另一方面可以充分利用GPU的加速效果。但手动进行代码卸载既无法保证转换程序的正确性，同时又耗费人力物力，效率很低。因此本发明提出复杂指针数据结构自动管理方案，以解决在卸载过程中最困难的复杂指针数据结构在CPU和加速器之间的内存分配和数据传输等问题。Since OpenMPOffloading supports accelerator programming, the OpenMP code on the CPU can be offloaded to the GPU for execution, and objective conditions allow the offloading of the OpenMP code. Under this condition, offloading the OpenMP code to run on the GPU can improve the running efficiency of the program on the one hand, and make full use of the acceleration effect of the GPU on the other hand. However, manual code unloading cannot guarantee the correctness of the conversion program, and at the same time consumes manpower and material resources, and the efficiency is very low. Therefore, the present invention proposes an automatic management scheme for complex pointer data structures to solve the most difficult problems in the unloading process, such as memory allocation and data transmission between the CPU and the accelerator for complex pointer data structures.

通过在通用测试集(PolyBench，Rodinia等)上的对比实验，证明本发明可以自动管理OpenMP Offloading程序中复杂指针数据结构，保证程序的正确性，并提高程序性能。Through comparative experiments on general test sets (PolyBench, Rodinia, etc.), it is proved that the present invention can automatically manage complex pointer data structures in OpenMP Offloading programs, ensure program correctness, and improve program performance.

本发明所述的复杂指针数据结构是指复杂指针数据类型。The complex pointer data structure described in the present invention refers to the complex pointer data type.

附图说明Description of drawings

图1是本发明所述系统的整体框架图；图2是AST分析方法的结构框图；图3为函数串并行控制流图；图4是统一内存原理图；图5为处理类嵌套指针的自动卸载实现原理图(是否使用统一内存的程序对比)；图6为处理vector容器的自动卸载实现原理图(是否使用空间配置器的程序对比)；图7为处理多级嵌套指针的自动卸载实现原理图；图8为本发明实验设计整体框架图；Fig. 1 is the overall frame diagram of the system of the present invention; Fig. 2 is the structural block diagram of AST analysis method; Fig. 3 is a function string parallel control flow diagram; Fig. 4 is a schematic diagram of unified memory; Fig. 5 is a processing class nested pointer Schematic diagram of automatic unloading implementation (comparison of programs using unified memory or not); Figure 6 is a schematic diagram of automatic unloading of vector containers (comparison of programs using space configurator or not); Figure 7 is automatic unloading of processing multi-level nested pointers Realize schematic diagram; Fig. 8 is the overall frame diagram of experimental design of the present invention;

图9为2080Ti平台Large数据量下的运行时间柱状图，图中：图9(a)为Large-2080Ti运行时间对比图(数据集a)，图9(b)为Large-2080Ti运行时间对比图(数据集b)，图9(a)和图9(b)本质是一个图，因为数据集比较多，所以分成两个图；Figure 9 is a histogram of running time on the 2080Ti platform under Large data volume. In the figure: Figure 9(a) is a comparison chart of Large-2080Ti running time (data set a), and Figure 9(b) is a comparison chart of Large-2080Ti running time (Data set b), Figure 9(a) and Figure 9(b) are essentially a graph, because there are many data sets, so they are divided into two graphs;

图10为Rodinia测试集在2080Ti上的运行时间比较图；图11为Polybench-OAO不同数据量加速比对比图(2080Ti)；图12为Polybench加速比对比图(2080Ti)；图13为运行时开销图(2080Ti)；图14为复杂指针数据结构详细测试图(K40)；Figure 10 is a comparison chart of the running time of the Rodinia test set on 2080Ti; Figure 11 is a comparison chart of Polybench-OAO different data volume speedup (2080Ti); Figure 12 is a comparison chart of Polybench speedup (2080Ti); Figure 13 is the runtime overhead Figure (2080Ti); Figure 14 is a detailed test chart of complex pointer data structure (K40);

图15为含有vector的程序转换示例；图16为含有structnest的程序程序转换示例；图17为含有multilevel的程序转换示例。图15至图17中：vector为c++中的vector容器；structnest为类嵌套指针；multilevel为多级嵌套指针。Figure 15 is an example of program conversion with vector; Figure 16 is an example of program conversion with structnest; Figure 17 is an example of program conversion with multilevel. In Fig. 15 to Fig. 17: vector is a vector container in c++; structnest is a class nesting pointer; multilevel is a multilevel nesting pointer.

具体实施方式detailed description

结合附图1至17，对本发明所述的一种面向异构平台的复杂指针数据结构自动管理系统进行如下阐述：In conjunction with accompanying drawings 1 to 17, a heterogeneous platform-oriented complex pointer data structure automatic management system according to the present invention is described as follows:

本发明的主要任务是自动管理OpenMPOffloading程序中的复杂指针数据结构(类嵌套指针、vector容器、多级嵌套指针等)，即实现自动修改分配和释放语句，自动管理数据传输，保证数据一致性。本发明主要包括以下三个模块：The main task of the present invention is to automatically manage complex pointer data structures (class nested pointers, vector containers, multi-level nested pointers, etc.) in the OpenMPOffloading program, that is, to realize automatic modification of allocation and release statements, automatic management of data transmission, and ensure data consistency sex. The present invention mainly comprises following three modules:

信息收集模块，该模块有两项功能：1)对源程序进行静态分析以收集程序信息；2)基于收集到的信息对源程序建立抽象表示即串并行控制流图。该模块的工作过程分为以下两步：1)通过Clang编译器，将C或者C++的源代码生成对应的抽象语法树(Abstract SyntaxTree,AST)和控制流图(Control Flow Graph,CFG)，遍历AST和CFG，区分串并行域，并获取程序详细信息；2)根据这些信息生成串并行控制流图。Information collection module, this module has two functions: 1) Statically analyze the source program to collect program information; 2) Create an abstract representation of the source program based on the collected information, that is, a serial-parallel control flow graph. The working process of this module is divided into the following two steps: 1) Through the Clang compiler, generate the corresponding abstract syntax tree (Abstract SyntaxTree, AST) and control flow graph (Control Flow Graph, CFG) from the source code of C or C++, and traverse AST and CFG distinguish serial and parallel domains and obtain program details; 2) generate serial and parallel control flow graphs based on these information.

自动转换模块，主要负责基于串并行控制流图，在源码中插入运行时API，从而完成代码转换。首先根据串并行控制流图中保存的复杂指针变量信息，确定复杂指针变量的类型。之后根据不同类型，在源码中适当位置，插入合适的运行时API，完成代码转换。这样，复杂指针变量所涉及的内存分配、释放以及数据传输操作都由运行时接管，从而可以通过运行时自动管理复杂指针变量，保证CPU和GPU之间的数据一致性。The automatic conversion module is mainly responsible for inserting the runtime API into the source code based on the serial-parallel control flow graph to complete the code conversion. Firstly, according to the complex pointer variable information stored in the serial-parallel control flow graph, the type of the complex pointer variable is determined. Then according to different types, insert the appropriate runtime API at the appropriate position in the source code to complete the code conversion. In this way, the memory allocation, release, and data transfer operations involved in complex pointer variables are all taken over by the runtime, so that complex pointer variables can be automatically managed by runtime to ensure data consistency between the CPU and GPU.

运行时模块，主要负责基于统一内存实现复杂指针数据结构的以下操作：CPU和GPU上的内存分配和释放操作，CPU和GPU之间的自动数据传输操作；其中使用cudaMallocManaged()和cudaFree()重新实现new、delete、allocator这些C++默认接口，并以API接口的形式向外部提供内存分配和释放的接口。The runtime module is mainly responsible for the following operations of complex pointer data structures based on unified memory: memory allocation and release operations on the CPU and GPU, automatic data transfer operations between the CPU and GPU; among them, cudaMallocManaged() and cudaFree() are used to re- Implement the C++ default interfaces of new, delete, and allocator, and provide external memory allocation and release interfaces in the form of API interfaces.

结果会将源程序转换为带有运行时API的新程序，系统的整体框架如图1所示。As a result, the source program is converted into a new program with a runtime API, and the overall framework of the system is shown in Figure 1.

1信息收集模块1 Information Collection Module

信息收集模块实现两个功能：静态分析并收集程序信息；建立串并行控制流图。The information collection module implements two functions: static analysis and collection of program information; establishment of serial and parallel control flow graphs.

1.1信息收集1.1 Information collection

首先使用Clang编译器对目标程序进行词法分析和语义分析，并生成抽象语法树(Abstract Syntax Tree,AST)和控制流图(Control Flow Graph,CFG)。First, use the Clang compiler to perform lexical analysis and semantic analysis on the target program, and generate an abstract syntax tree (Abstract Syntax Tree, AST) and a control flow graph (Control Flow Graph, CFG).

之后对AST和CFG进行下述三种静态分析，获得函数调用关系信息、变量相关信息、串/并行域相关信息，为之后的串并行控制流图建立和代码转换提供信息支持。Afterwards, the following three static analyzes are performed on AST and CFG to obtain function call relationship information, variable related information, and serial/parallel domain related information, and provide information support for subsequent serial-parallel control flow graph establishment and code conversion.

2)如果当前节点是函数定义节点，则保存该函数名和该函数调用的子函数信息。2) If the current node is a function definition node, save the function name and sub-function information called by the function.

3)如果当前节点涉及内存分配或释放，则保存内存分配信息、内存释放信息。3) If the current node involves memory allocation or release, save memory allocation information and memory release information.

3)如果当前节点不在OpenMP#pragma并行指导语句的作用域内，则将当前节点标记为串行节点，并保存其类型信息和范围信息。3) If the current node is not within the scope of the OpenMP#pragma parallel instruction statement, then mark the current node as a serial node, and save its type information and range information.

通过上述三种静态分析，可以获得以下信息：函数调用关系信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息、串并行节点类型信息、串并行节点范围信息、节点间关系信息。这些信息将对串并行控制流图建立提供支持。Through the above three static analysis, the following information can be obtained: function call relationship information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information, Relationship information between nodes. These information will provide support for the establishment of serial and parallel control flow graphs.

1.2串并行控制流图建立1.2 Serial-parallel control flow graph establishment

定义串并行控制流图由串行节点、并行节点以及节点之间的有向边组成。其中，串行节点表示一段OpenMP#pragma并行指导语句作用域外的、内部无分支的、串行执行的代码片段；串行节点对应的代码片段在CPU上执行，串行节点也记作SEQ节点。Definition Serial-parallel control flow graph consists of serial nodes, parallel nodes and directed edges between nodes. Among them, the serial node represents a code fragment that is outside the scope of the OpenMP#pragma parallel instruction statement, has no internal branches, and is executed serially; the code fragment corresponding to the serial node is executed on the CPU, and the serial node is also recorded as a SEQ node.

并行节点表示一段在OpenMP#pragma并行指导语句作用域内的，并行执行的代码片段；并行节点对应的代码片段被卸载到GPU上执行，并行节点也记作OMP节点。串行节点和并行节点中保存有函数调用信息和变量相关信息。节点之间的有向边表示节点对应的代码片段执行的先后关系。A parallel node represents a code fragment that is executed in parallel within the scope of the OpenMP#pragma parallel instruction statement; the code fragment corresponding to the parallel node is offloaded to the GPU for execution, and the parallel node is also recorded as an OMP node. The serial node and the parallel node store function call information and variable related information. The directed edges between the nodes represent the execution order of the code fragments corresponding to the nodes.

串并行控制流图建立过程以函数为基本处理单位；对整个源程序而言，如果可以建立一个函数的串并行控制流图，结合函数调用关系信息，就可以递归建立整个源程序的串并行控制流图。The process of establishing a serial-parallel control flow graph takes functions as the basic processing unit; for the entire source program, if a serial-parallel control flow graph of a function can be established, combined with the function call relationship information, the serial-parallel control of the entire source program can be recursively established flow graph.

对于一个函数，基于收集的程序信息，可以通过以下步骤建立串并行控制流图：For a function, based on the collected program information, a serial-parallel control flow graph can be built through the following steps:

2自动转换模块2 automatic conversion modules

自动转换模块将复杂指针数据类型分为：类嵌套指针、vector容器、多级嵌套指针三种类型。The automatic conversion module divides complex pointer data types into three types: class nested pointer, vector container, and multi-level nested pointer.

复杂指针数据结构处理中的关键技术是统一内存(Unified Memory，UM)，故首先介绍统一内存的概念以及原理，UM维护了一个统一的内存池，在CPU与GPU中共享，仅仅只分配一次内存，这个数据指针对于host端(CPU)和device端(GPU)都是可用的。使用了单一指针进行托管内存，这样在不同设备之间进行的数据传输由UM的运行时系统来自动完成，并使GPU可以处理超出其内存容量的数据集。统一内存原理如图4所示。The key technology in the processing of complex pointer data structures is Unified Memory (UM), so first introduce the concept and principle of Unified Memory. UM maintains a unified memory pool, shared between CPU and GPU, and only allocates memory once. , this data pointer is available to both the host side (CPU) and the device side (GPU). Using a single pointer for managed memory, data transfer between different devices is done automatically by UM's runtime system and enables the GPU to process data sets that exceed its memory capacity. The principle of unified memory is shown in Figure 4.

利用统一内存技术，针对三种复杂指针数据类型，分别设计不同处理方法。Using unified memory technology, different processing methods are designed for the three complex pointer data types.

2.1类嵌套指针2.1 Class nested pointers

类嵌套指针是指类中所包含的指针，如图5所示，通过以下步骤进行处理：The class nested pointer refers to the pointer contained in the class, as shown in Figure 5, it is processed through the following steps:

①递归遍历串并行控制流图，根据串/并行节点中保存的变量定义和引用信息，找出在串行节点中定义、并在并行节点中被引用的类嵌套指针及其所在的类。① Traverse the serial-parallel control flow graph recursively, and find out the class nesting pointers defined in the serial node and referenced in the parallel node and the classes they belong to according to the variable definition and reference information saved in the serial/parallel node.

③对于步骤①中找出的类的类型，修改源码中该类的类型定义，使该类的类型在定义时继承运行时提供的UMMaper基类(见运行时设计)；③ For the type of class found in step ①, modify the type definition of this class in the source code, so that the type of this class inherits the UMMaper base class provided at runtime when it is defined (see runtime design);

④对于步骤①中找出的类实例，修改源码中该类实例的定义语句，使用被重载的new运算符创建类实例，并将步骤③中分配的内存空间地址传递给类中对应的嵌套指针。④ For the class instance found in step ①, modify the definition statement of the class instance in the source code, use the overloaded new operator to create the class instance, and pass the memory space address allocated in step ③ to the corresponding embedded in the class Set of pointers.

类嵌套指针的处理示例如图5所示。An example of processing class nested pointers is shown in Figure 5.

2.2vector容器2.2vector container

vector容器是指C++标准库中提供的vector容器。vector容器需要通过以下步骤进行处理：The vector container refers to the vector container provided in the C++ standard library. Vector containers need to be processed through the following steps:

①递归遍历串并行控制流图，根据串/并行节点中保存的变量定义和引用信息，找出在串行节点中定义、并在并行节点中被引用的vector容器。① Traverse the serial-parallel control flow graph recursively, and find out the vector container defined in the serial node and referenced in the parallel node according to the variable definition and reference information saved in the serial/parallel node.

②对于步骤①中找出的vector容器实例，修改该vector容器实例的定义语句，插入对运行时提供的自定义allocator(见运行时设计)的显示调用。② For the vector container instance found in step ①, modify the definition statement of the vector container instance and insert an explicit call to the custom allocator provided at runtime (see runtime design).

vector容器处理示例如图6所示。An example of vector container processing is shown in Figure 6.

2.3多级嵌套指针2.3 Multi-level nested pointers

多级嵌套指针是指二级指针以及二级以上的指针。多级嵌套指针需要通过以下步骤进行处理：A multi-level nested pointer refers to a second-level pointer and pointers above the second level. Multiple levels of nested pointers need to be handled through the following steps:

①递归遍历串并行控制流图，根据串/并行节点中保存的变量定义和引用信息，找出在串行节点中定义、并在并行节点中被引用的多级嵌套指针，及其各级子指针；① Recursively traverse the serial-parallel control flow graph, and find out the multi-level nested pointers defined in the serial node and referenced in the parallel node, and their levels child pointer;

多级嵌套指针的处理示例如图7所示。An example of processing multi-level nested pointers is shown in Figure 7.

3运行时3 runtime

为了使转换后的代码更加接近源代码的格式，尽量减少对源码的修改，本文提出了在原程序中插入少量运行时API的方案，将基于统一内存的内存分配都封装到运行时中。In order to make the converted code closer to the format of the source code and minimize the modification of the source code, this paper proposes a solution to insert a small amount of runtime API into the original program, and encapsulate the memory allocation based on unified memory into the runtime.

对于类及其嵌套指针，在运行时中实现负责内存分配和释放的UMMapper基类，如图5所示。在UMMapper中使用cudaMallocManaged()和cudaFree()对C++默认的new、delete运算符进行重载。这样只需在类声明时基于UMMapper派生，再使用new来新建类实例就可以实现通过统一内存自动管理派生类相关的内存分配、释放、数据传输操作。For classes and their nested pointers, the UMMapper base class responsible for memory allocation and release is implemented at runtime, as shown in Figure 5. Use cudaMallocManaged() and cudaFree() in UMMapper to overload the default new and delete operators of C++. In this way, you only need to derive based on UMMapper when the class is declared, and then use new to create a new class instance to realize automatic management of memory allocation, release, and data transfer operations related to derived classes through unified memory.

vector容器使用C++标准库的allocator空间配置器来管理内存分配和释放，因此可以基于统一内存实现自定义的allocator类，从而实现对vector容器内存分配和释放的自动管理。自定义allocator类中，基于cudaMallocManaged()实现allocator类中的_allocator()函数，基于cudaFree()实现allocator类中的_deallocate()函数。在vector容器声明时显示调用自定义allocator类，即可实现统一内存对vector容器在CPU和GPU上的内存空间的分配、释放、数据传输操作的自动管理，如图6所示。The vector container uses the allocator space configurator of the C++ standard library to manage memory allocation and release, so a custom allocator class can be implemented based on unified memory, thereby realizing automatic management of memory allocation and release of the vector container. In the custom allocator class, the _allocator() function in the allocator class is implemented based on cudaMallocManaged(), and the _deallocate() function in the allocator class is implemented based on cudaFree(). When the vector container is declared, it is displayed to call the custom allocator class, so that unified memory can automatically manage the allocation, release, and data transmission operations of the memory space of the vector container on the CPU and GPU, as shown in Figure 6.

对发明的技术效果进行如下验证：The technical effect of the invention is verified as follows:

为了验证和分析代码卸载方案的效果，我们在RTX 2080Ti平台上进行测验，使用PolyBench，Rodinia等基准测试数据集，将测试结果和另一较为著名的源到源编译器(DawnCC)以及手动转换(Manual)的结果还有CPU并行的代码进行比较。我们的方案命名为OAO(OpenMP Automated Offloadingwithcomplexdatastructuresupport)。In order to verify and analyze the effect of the code unloading scheme, we conducted a test on the RTX 2080Ti platform, using PolyBench, Rodinia and other benchmark test data sets, and compared the test results with another well-known source-to-source compiler (DawnCC) and manual conversion ( Manual) and CPU parallel code for comparison. Our solution is named OAO (OpenMP Automated Offloading with complex data structure support).

实验结果的测试和获取需要整合运行时，编写脚本运行，Python提取以及OriginLab分析这四个阶段。实验整体过程如图8所示。The test and acquisition of experimental results need to integrate the four stages of runtime, scripting and running, Python extraction and OriginLab analysis. The overall process of the experiment is shown in Figure 8.

1运行时间1 runtime

2080Ti平台Large数据量下的运行时间分别如图9所示。The running time of the 2080Ti platform under the Large data volume is shown in Figure 9.

Rodinia测试集在2080Ti上的运行时间比较，如图10所示。The running time comparison of the Rodinia test set on the 2080Ti is shown in Figure 10.

通过以上对数据集运行时间的收集和比对，可以得出以下结论：Through the above collection and comparison of the running time of the dataset, the following conclusions can be drawn:

(1)通过运行时间图可以看出，OAO可以处理全部23个测试程序，而DawnCC只能处理15个测试程序。OAO在K40平台的9个和2080Ti平台的15个程序上有性能提升；且OAO相对于手动转换在所有平台的所有测试程序上都有最好性能表现。(1) It can be seen from the running time graph that OAO can handle all 23 test programs, while DawnCC can only handle 15 test programs. OAO has improved performance on 9 programs on the K40 platform and 15 programs on the 2080Ti platform; and compared to manual conversion, OAO has the best performance on all test programs on all platforms.

(2)在Rodinia-2080Ti中，已知Rodinia数据体量由实验前测试获得，共8个数据集中，有5个数据集的OAO运行时间比OMP更短。(2) In Rodinia-2080Ti, the Rodinia data volume is known to be obtained from the pre-experimental test. There are 8 data sets in total, and the OAO running time of 5 data sets is shorter than that of OMP.

(3)在Polybench-Large中，OAO的运行时间优于DawnCC，Manual和native；运行时间总体上呈现OAO<DawnCC<Manual<native的趋势。其中，native运行时间最长，是因为选择的传输语句问题，没有进行并行优化；Manual包含了部分数据的冗余传输，运行时间比OAO要长；而OAO在数据集的处理上总体也要优于DawnCC。(3) In Polybench-Large, the running time of OAO is better than DawnCC, Manual and native; the running time generally shows a trend of OAO<DawnCC<Manual<native. Among them, the running time of native is the longest, because the selected transmission statement has not been optimized in parallel; Manual includes redundant transmission of some data, and the running time is longer than OAO; and OAO is generally better in the processing of data sets. at DawnCC.

2加速比2 speedup ratio

由图11可以看出，在加速比方面，在大于1的数据中，Large数据量加速比比Medium更大，这也正符合并行加速的特点，处理的数据量越大，并行加速效果越明显。It can be seen from Figure 11 that in terms of speedup ratio, among the data greater than 1, the speedup ratio of Large data volume is greater than that of Medium data, which is also in line with the characteristics of parallel acceleration. The larger the amount of data processed, the more obvious the effect of parallel speedup.

在加速比方面，计算方式为OMP/Type的比值，Type为其他类型，结果以1X为界限，上方为正向加速，下方为负向加速。由图12可以得出，不管在哪个平台上，所有测试程序的结果显示，OAO的加速比都优于Manual，在Polybench测试程序中，OAO略优于DawnCC，且由于DawnCC无法处理Rodinia测试程序，所以OAO在所有测试程序上都有最好的数据传输优化效果，且能处理DawnCC处理不了的复杂测试程序，适用范围更广。In terms of acceleration ratio, the calculation method is the ratio of OMP/Type, Type is other types, and the result is bounded by 1X, the upper part is positive acceleration, and the lower part is negative acceleration. It can be concluded from Figure 12 that regardless of the platform, the results of all test programs show that the speedup ratio of OAO is better than Manual. In the Polybench test program, OAO is slightly better than DawnCC, and because DawnCC cannot handle the Rodinia test program, Therefore, OAO has the best data transmission optimization effect in all test programs, and can handle complex test programs that DawnCC cannot handle, and has a wider scope of application.

3运行时开销3 runtime overhead

在Polybench测试集中，运行时的开销如图13所示。In the Polybench test set, the runtime overhead is shown in Figure 13.

所有测试程序的运行时间，与运行时开销相比，运行时开销都要小2个数量级以上，所以OAO在运行时的额外开销很小，几乎可以忽略，所以加入运行时对源程序在性能上并没有损失。Compared with the runtime overhead, the runtime overhead of all test programs is more than 2 orders of magnitude smaller, so the extra overhead of OAO at runtime is very small, almost negligible, so adding runtime has a great impact on the performance of the source program There is no loss.

4统一内存4 unified memory

统一内存的详细测试包括vector容器，类嵌套指针(structnest)和多级嵌套指针(multilevel)。复杂指针数据结构的运行时间和数据传输时间对比如图14所示。Detailed tests of unified memory include vector containers, class nested pointers (structnest) and multilevel nested pointers (multilevel). The running time and data transfer time comparison of the complex pointer data structure is shown in Figure 14.

从数据传输时间来看，H2D和D2H的时间远小于总运行时间，且全部复杂指针数据结构都可以正确处理，因此OAO编译器可以正确支持复杂指针数据结构的GPU卸载。From the perspective of data transmission time, the time of H2D and D2H is much less than the total running time, and all complex pointer data structures can be processed correctly, so the OAO compiler can correctly support GPU offloading of complex pointer data structures.

其中vector容器，structnest和multilevel的简单转换程序对比分别如图15、图16和图17所示。Among them, the vector container, the comparison of the simple conversion programs of structnest and multilevel are shown in Figure 15, Figure 16 and Figure 17 respectively.

其中由图15所示，声明阶段vector变量处修改为QY::allocator方式重载；图16中，声明阶段类内部嵌套指针x，y，z由cudaMallocManaged重载，而实例化的类a由传输语句标明，添加map(tofrom:a[0:1])；图17中，x为二级指针，y为x的一级指针，声明阶段x，y由cudaMallocManaged重载。As shown in Figure 15, the vector variable in the declaration phase is modified to be overloaded by QY::allocator; in Figure 16, the nested pointers x, y, and z inside the class in the declaration phase are overloaded by cudaMallocManaged, and the instantiated class a is defined by The transfer statement indicates that map(tofrom:a[0:1]) is added; in Figure 17, x is the second-level pointer, y is the first-level pointer of x, and x and y are overloaded by cudaMallocManaged in the declaration stage.

表1统一内存测试结果表Table 1 Unified memory test result table

统一内存测试结果如表1所示。在数据传输过程中，会在正式传输之前，优先进行一次传输，H2D的数据大小为4B，D2H的数据大小为1B(4+1模式)，上表中vector的双向传输，52比49大3，structnest的双向传输，36比33大3，都是由于这个原因。除此之外，vector只传输2次，分别为X和Y，每次的数据量大小为24B，structnest传输1次，传输A，数据量大小为32。The unified memory test results are shown in Table 1. In the process of data transmission, a transmission will be given priority before the official transmission. The data size of H2D is 4B, and the data size of D2H is 1B (4+1 mode). For the two-way transmission of vector in the above table, 52 is 3 larger than 49 , The two-way transmission of structnest, 36 is 3 larger than 33, all due to this reason. In addition, the vector is only transmitted twice, namely X and Y, each time the data size is 24B, and the structnest is transmitted once, A is transmitted, and the data size is 32.

而multilevel中，z是单向传输，使用的数据传输模型，x和y是多级嵌套指针，使用统一内存。除去优先传输的4+1模式，在H2D和D2H之间数据传输量存在差别的原因是，z并不回传，而x会回传。In multilevel, z is a one-way transmission, using the data transmission model, x and y are multi-level nested pointers, using unified memory. Except for the 4+1 mode of priority transmission, the reason why there is a difference in the amount of data transmission between H2D and D2H is that z does not return, but x does.

实验结果证明OAO可以利用统一内存来正确处理复杂指针数据结构，实现对复杂指针数据结构处理的支持。Experimental results prove that OAO can use unified memory to correctly process complex pointer data structures, and realize the support for processing complex pointer data structures.

Claims

1. An automatic management system of a complex pointer data structure facing a heterogeneous platform is characterized in that the system is used for realizing automatic management of the complex pointer data structure in an OpenMPOfflooding program on the heterogeneous computing platform and ensuring data consistency;

the system comprises the following three modules:

the information collection module has two functions: 1) Performing static analysis on the source program to collect program information; 2) Establishing an abstract representation, namely a serial-parallel control flow diagram, for a source program based on the collected information;

the working process of the module comprises the following two steps: 1) Generating a corresponding abstract syntax tree AST and a control flow graph CFG from a C or C + + source code through a Clang compiler, traversing the AST and the CFG, distinguishing a serial-parallel domain, and acquiring detailed program information; 2) Generating a serial-parallel control flow graph according to the information;

the automatic conversion module is mainly responsible for inserting an API (application program interface) in operation into a source code based on a serial-parallel control flow graph so as to complete code conversion; firstly, determining the type of complex pointer data according to complex pointer variable information stored in a serial-parallel control flow diagram; then according to different types, inserting proper runtime API in proper position in the source code to complete code conversion; in this way, the memory allocation, release and data transmission operations related to the complex pointer variable are all taken over by the runtime, so that the complex pointer variable can be automatically managed by the runtime, and the data consistency between the CPU and the GPU is ensured;

the runtime module is mainly responsible for realizing the following operations of the complex pointer data type based on the unified memory: memory allocation and release operations on the CPU and the GPU and automatic data transmission operations between the CPU and the GPU; the runtime module consists of a ummale class and an allocator class, and the ummale class and the allocator class provide interfaces for memory allocation and release to the outside in the form of API interfaces.

2. The system for automatically managing the complex pointer data structure oriented to the heterogeneous platform according to claim 1, wherein the information collection function of the information collection module is implemented by:

firstly, performing lexical analysis and semantic analysis on a target program by using a Clang compiler, and generating an abstract syntax tree AST and a control flow graph CFG;

then, the AST and the CFG are subjected to the following three static analyses to obtain function call relation information, variable related information and serial/parallel domain related information, and information support is provided for the establishment of a serial-parallel control flow graph and code conversion;

analysis one, function call relation analysis: for an AST, the following two steps are performed:

1) Recursively traversing each node on the AST;

2) If the current node is a function definition node, storing the function name and the sub-function information called by the function;

analysis two, variable information analysis, for an AST, the following three steps of work are performed:

1) Recursively traversing each node on the AST;

2) If the current node relates to variable definition or reference, saving variable definition information, variable reference information and variable scope information;

3) If the current node relates to memory allocation or release, storing memory allocation information and memory release information;

analysis three, series-parallel domain analysis: for one CFG, the following three steps are performed:

1) Recursively traversing each node on the CFG, and storing relationship information between the nodes;

2) If the current node is in the action domain of the OpenMP # pragma parallel instruction statement, marking the current node as a parallel node, and storing the type information and the range information of the current node;

3) If the current node is not in the action domain of the OpenMP # pragma parallel instruction statement, marking the current node as a serial node, and storing the type information and the range information of the serial node;

through the three static analyses described above, the following information can be obtained: function call relation information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information and inter-node relation information; this information will provide support for the serial-parallel control flow graph setup.

3. The automatic management system for the complex pointer data structure oriented to the heterogeneous platform according to claim 1 or 2, characterized in that the serial-parallel control flow graph in the information collection module is defined as follows:

defining a serial-parallel control flow graph to be composed of serial nodes, parallel nodes and directed edges among the nodes; the serial node represents a code segment which is outside the scope of action of an OpenMP # pragma parallel instruction statement, has no branch inside and is executed serially; executing the code segments corresponding to the serial nodes on the CPU, wherein the serial nodes are also marked as SEQ nodes;

the parallel node represents a code segment which is in the OpenMP # pragma parallel instruction statement action domain and is executed in parallel; unloading the code segments corresponding to the parallel nodes to a GPU for execution, wherein the parallel nodes are also marked as OMP nodes;

function calling information and variable related information are saved in the serial node and the parallel node;

and the directed edges among the nodes represent the sequential relation of the execution of the code segments corresponding to the nodes.

4. The system according to claim 3, wherein the process of establishing the serial-parallel control flow graph in the information collection module comprises:

the establishing process of the serial-parallel control flow graph takes a function as a basic processing unit; for the whole source program, if a serial-parallel control flow graph of a function can be established, combining the function call relation information collected in claim 2, the serial-parallel control flow graph of the whole source program can be recursively established;

for a function, based on the information gathered in claim 2, a serial-parallel control flow graph can be built by:

1) Establishing a serial node and a parallel node which are isolated one by using the node type information and the node range information;

2) Establishing directed edges among the nodes by using the relationship information among the nodes, and connecting the serial nodes and the parallel nodes into a graph;

3) And storing the function call information, the variable definition information, the variable reference information, the variable scope information, the memory allocation information and the memory release information to the corresponding serial node or parallel node according to the node range information.

5. The automatic management system for the complex pointer data structure oriented to the heterogeneous platform according to claim 4, wherein the complex pointer data is classified into three types according to the position of the pointer:

the device comprises a class nesting pointer, a vector container and a multi-level nesting pointer;

the class nesting pointer refers to a pointer contained in a class, and the vector container refers to a vector container provided in a C + + standard library; the multi-level nested pointers refer to pointers at two levels and pointers above two levels.

6. The heterogeneous platform-oriented automatic management system for the complex pointer data structure of the claim 5 is characterized in that the class nesting pointer is processed by the following steps:

(1) recursively traversing the serial-parallel control flow graph established in the claim 4, and finding out the class nesting pointers defined in the serial nodes and referred in the parallel nodes and the C + + classes thereof according to the variable definitions and reference information stored in the serial/parallel nodes;

(2) modifying the type definition of the type in the source code for the C + + type found in the step (1) so that the type of the type inherits the UMMaper base class provided by the runtime during definition;

(3) modifying the memory allocation and release statement of the pointer in the source code for the class nested pointer found in the step (1), allocating the memory by using cudaMallocManager () and releasing the memory by using cudaFree ();

(4) and (3) modifying the definition statement of the type instance in the source code for the C + + type instance found in the step (1), creating the type instance by using a reloaded new operator, and transmitting the memory space address distributed in the step (3) to a corresponding nested pointer in the C + + type instance.

7. The system according to claim 5, wherein the vector container is processed by the following steps:

(1) recursively traversing the serial-parallel control flow graph established in claim 4, and finding a vector container defined in the serial node and referred to in the parallel node according to the variable definition and reference information stored in the serial/parallel node;

(2) and (3) modifying the definition statement of the vector container instance found in the step (1), and inserting a display call for a custom allocator provided by the runtime.

8. The heterogeneous platform-oriented complex pointer data structure automatic management system according to claim 5, wherein the multi-level nested pointers are processed by the following steps:

(1) recursively traversing the serial-parallel control flow graph established in claim 4, and finding out multi-level nested pointers defined in the serial nodes and referred to in the parallel nodes and sub-pointers of the levels thereof according to variable definitions and reference information stored in the serial/parallel nodes;

(2) and (2) modifying memory allocation and release statements of all the multi-level nested pointers and all the levels of sub pointers found in the step (1), allocating memory by using cudaMallocManager () and releasing memory by using cudaFree ().

9. The system for automatically managing the complex pointer data structure oriented to the heterogeneous platform as claimed in claim 6, wherein the implementation process of the ummale class in the runtime module is as follows:

designing a UMMaper class to manage the memory allocation and release of the C + + class for processing the class nesting pointers; reloading a C + + default new and delete operator by using cudamallmanaged () and cudaFree () in the UMMaper class; the allocation, release and data transmission operations of the derived classes of the UMMaper class on the memory space of the CPU and the GPU can be automatically managed by the unified memory.

10. The system according to claim 7, wherein the implementation process of the custom allocator class in the runtime module is as follows:

designing a self-defined allocator class to manage memory allocation and release of the vector container for processing the vector container; the vector container defaults to use an allocator space configurator of a C + + standard library to manage memory allocation and release, so that a user-defined allocator class can be realized based on a uniform memory, and automatic management of memory allocation and release of the vector container is realized; in the custom allocator class, realizing a function of _ allocator () in the allocator class based on cudaMallocManager (), and realizing a function of _ deallocate () in the allocator class based on cudaFree (); and displaying and calling the custom allocator class when the vector container declares, so that the allocation and release of the memory space of the vector container on the CPU and the GPU and the automatic management of data transmission operation can be realized by the unified memory.