CN112083956B - Heterogeneous platform-oriented automatic management system for complex pointer data structure - Google Patents
Heterogeneous platform-oriented automatic management system for complex pointer data structure Download PDFInfo
- Publication number
- CN112083956B CN112083956B CN202010971038.9A CN202010971038A CN112083956B CN 112083956 B CN112083956 B CN 112083956B CN 202010971038 A CN202010971038 A CN 202010971038A CN 112083956 B CN112083956 B CN 112083956B
- Authority
- CN
- China
- Prior art keywords
- information
- serial
- parallel
- node
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 claims abstract description 33
- 230000005540 biological transmission Effects 0.000 claims abstract description 26
- 238000006243 chemical reaction Methods 0.000 claims abstract description 18
- 230000003068 static effect Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 49
- 238000000034 method Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 16
- 238000010586 diagram Methods 0.000 claims description 10
- 238000011161 development Methods 0.000 abstract description 3
- 238000012360 testing method Methods 0.000 description 19
- 238000012546 transfer Methods 0.000 description 12
- 239000012634 fragment Substances 0.000 description 8
- 241000687919 Rodinia Species 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000001133 acceleration Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013401 experimental design Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012812 general test Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/425—Lexical analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
- Stored Programmes (AREA)
Abstract
Description
技术领域technical field
本发明涉及OpenMP Offloading程序中复杂指针数据结构自动管理系统,涉及异构编程技术领域。The invention relates to an automatic management system for complex pointer data structures in an OpenMP Offloading program, and relates to the technical field of heterogeneous programming.
背景技术Background technique
OpenMP是由OpenMP Architecture Review Board牵头提出的,并已被广泛接受,用于共享内存并行系统的多处理器程序设计的一套指导性编译处理方案(CompilerDirective)[1]。OpenMP支持的编程语言包括C、C++和Fortran;而支持OpenMP的编译器包括Sun Compiler,GNU Compiler和Intel Compiler等。OpenMP提供了对并行算法的高层的抽象描述,程序员通过在源代码中加入专用的pragma原语来指明自己的意图,由此编译器可以自动将程序进行并行化,并在必要之处加入同步互斥以及通信。OpenMP is led by the OpenMP Architecture Review Board and has been widely accepted as a set of guiding compilation processing scheme (CompilerDirective) [1] for multiprocessor programming of shared memory parallel system. The programming languages supported by OpenMP include C, C++, and Fortran; the compilers that support OpenMP include Sun Compiler, GNU Compiler, and Intel Compiler. OpenMP provides a high-level abstract description of parallel algorithms. Programmers specify their intentions by adding special pragma primitives to the source code, so that the compiler can automatically parallelize the program and add synchronization where necessary. Mutex and communication.
在高性能计算领域,各种加速器(如GPU、FPGA、DSP、MLU等)已经成为除CPU之外的重要算力来源。从4.0版本开始,OpenMP增加了Offloading(卸载)特性,OpenMPOffloading支持CPU+加速器的异构编程模型;经过4.5和5.0版本的发展,OpenMPOffloading逐步完善。OpenMPOffloading为OpenMP程序充分利用异构计算平台算力提供了可能,但将现有的OpenMP CPU程序修改为符合Offloading特性的程序仍然是一件困难、繁琐、易出错的工作,尤其是程序中包含复杂指针数据结构,例如:类嵌套指针、vector容器、多级嵌套指针等。In the field of high-performance computing, various accelerators (such as GPU, FPGA, DSP, MLU, etc.) have become an important source of computing power in addition to the CPU. Starting from version 4.0, OpenMP has added the Offloading (unloading) feature. OpenMPOffloading supports the heterogeneous programming model of CPU+accelerator; after the development of versions 4.5 and 5.0, OpenMPOffloading has been gradually improved. OpenMP Offloading makes it possible for OpenMP programs to make full use of the computing power of heterogeneous computing platforms, but it is still a difficult, tedious, and error-prone task to modify existing OpenMP CPU programs to meet the Offloading characteristics, especially when the program contains complex Pointer data structures, such as: class nested pointers, vector containers, multi-level nested pointers, etc.
虽然OpenMPOffloading语法简单,但仍需用户利用相关pragma原语显示管理CPU和加速器之间的数据传输操作,这给开发者造成了极大不便,尤其是遇到复杂指针数据结构时。例如C++中的vector容器,其内存分配是隐式的,开发者不易控制内存分配和数据传输,也就很难使用Offloading特性。对于嵌套指针或多级嵌套指针,处理不同级别指针指向的内存在CPU和加速器内存空间中的分配、释放、数据传输是繁琐且极易出错的工作,这也使开发者对Offloading特性望而却步。Although the syntax of OpenMPOffloading is simple, users still need to use relevant pragma primitives to display and manage data transfer operations between the CPU and the accelerator, which causes great inconvenience to developers, especially when encountering complex pointer data structures. For example, the memory allocation of the vector container in C++ is implicit. It is difficult for developers to control memory allocation and data transmission, and it is difficult to use the Offloading feature. For nested pointers or multi-level nested pointers, dealing with the allocation, release, and data transfer of memory pointed to by different levels of pointers in the CPU and accelerator memory space is tedious and error-prone work, which also discourages developers from the Offloading feature. .
NVIDIA GPU平台的CUDA编程模型从6.0版本开始支持统一内存(Unified Memory,UM)特性;该特性将CPU和GPU的地址空间统一,并自动管理CPU和GPU之间的数据传输。统一内存特性为OpenMP Offloading程序开发中复杂指针数据结构的自动化处理提供了可能的技术手段。The CUDA programming model of the NVIDIA GPU platform supports the Unified Memory (UM) feature from version 6.0; this feature unifies the address spaces of the CPU and GPU and automatically manages data transmission between the CPU and GPU. The unified memory feature provides a possible technical means for the automatic processing of complex pointer data structures in the development of OpenMP Offloading programs.
发明内容Contents of the invention
本发明要解决的技术问题为:The technical problem to be solved in the present invention is:
本发明的目的是提出一种面向异构平台的OpenMP Offloading程序中复杂指针数据结构自动管理系统,以解决现有技术在OpenMP CPU程序的基础上无法自动修改内存分配和释放语句并自动插入相关pragma原语,不能自动管理CPU和加速器之间的数据传输,无法保证包含复杂指针数据结构的程序在异构计算平台上的数据一致性,从而影响程序性能的问题。The purpose of the present invention is to propose a complex pointer data structure automatic management system in an OpenMP Offloading program oriented to heterogeneous platforms, so as to solve the problem that the prior art cannot automatically modify memory allocation and release statements and automatically insert relevant pragmas on the basis of OpenMP CPU programs. Primitives cannot automatically manage the data transmission between the CPU and the accelerator, and cannot guarantee the data consistency of programs containing complex pointer data structures on heterogeneous computing platforms, thus affecting program performance.
本发明为解决上述技术问题所采用的技术方案为:The technical scheme that the present invention adopts for solving the problems of the technologies described above is:
一种面向异构平台的复杂指针数据结构自动管理系统,所述系统用于实现OpenMPOffloading程序中复杂指针数据结构在异构计算平台上的自动管理并保证数据一致性;A heterogeneous platform-oriented complex pointer data structure automatic management system, the system is used to realize the automatic management of the complex pointer data structure in the OpenMPOffloading program on the heterogeneous computing platform and ensure data consistency;
该系统包括以下三个模块:The system includes the following three modules:
信息收集模块,该模块有两项功能:1)对源程序进行静态分析以收集程序信息;2)基于收集到的信息对源程序建立抽象表示即串并行控制流图;Information collection module, this module has two functions: 1) Statically analyze the source program to collect program information; 2) Establish an abstract representation of the source program based on the collected information, that is, a serial-parallel control flow graph;
该模块的工作过程分为以下两步:1)通过Clang编译器,将C或者C++的源代码生成对应的抽象语法树(Abstract Syntax Tree,AST)和控制流图(Control Flow Graph,CFG),遍历AST和CFG,区分串并行域,并获取程序详细信息;2)根据这些信息生成串并行控制流图;The working process of this module is divided into the following two steps: 1) Through the Clang compiler, generate the corresponding abstract syntax tree (Abstract Syntax Tree, AST) and control flow graph (Control Flow Graph, CFG) from the source code of C or C++, Traverse AST and CFG, distinguish between serial and parallel domains, and obtain program details; 2) Generate serial and parallel control flow graphs based on these information;
自动转换模块,主要负责基于串并行控制流图,在源码中插入运行时API,从而完成代码转换;首先根据串并行控制流图中保存的复杂指针变量信息,确定复杂指针数据的类型;之后根据不同类型,在源码中适当位置,插入合适的运行时API,完成代码转换;这样,复杂指针变量所涉及的内存分配、释放以及数据传输操作都由运行时接管,从而可以通过运行时自动管理复杂指针变量,保证CPU和GPU之间的数据一致性;The automatic conversion module is mainly responsible for inserting the runtime API into the source code based on the serial-parallel control flow graph to complete the code conversion; firstly, according to the complex pointer variable information saved in the serial-parallel control flow graph, determine the type of the complex pointer data; then according to For different types, insert the appropriate runtime API at the appropriate position in the source code to complete the code conversion; in this way, the memory allocation, release, and data transfer operations involved in complex pointer variables are all taken over by the runtime, so that complex pointer variables can be automatically managed through runtime. Pointer variables to ensure data consistency between CPU and GPU;
运行时模块,主要负责基于统一内存实现复杂指针数据类型的以下操作:CPU和GPU上的内存分配和释放操作,以及CPU和GPU之间的自动数据传输操作;运行时模块由UMMaper类和allocator类组成,UMMaper类和allocator类以API接口的形式向外部提供内存分配和释放的接口。The runtime module is mainly responsible for the following operations of complex pointer data types based on unified memory: memory allocation and release operations on the CPU and GPU, and automatic data transfer operations between the CPU and GPU; the runtime module consists of the UMMaper class and the allocator class Composition, the UMMaper class and the allocator class provide an interface for memory allocation and release to the outside in the form of an API interface.
进一步地,所述信息收集模块的信息收集功能的实现过程为:Further, the realization process of the information collection function of the information collection module is:
首先使用Clang编译器对目标程序进行词法分析和语义分析,并生成抽象语法树(Abstract Syntax Tree,AST)和控制流图(Control Flow Graph,CFG);First, use the Clang compiler to perform lexical analysis and semantic analysis on the target program, and generate an abstract syntax tree (Abstract Syntax Tree, AST) and a control flow graph (Control Flow Graph, CFG);
之后对AST和CFG进行下述三种静态分析,获得函数调用关系信息、变量相关信息、串/并行域相关信息,为之后的串并行控制流图建立和代码转换提供信息支持;After that, the following three static analyzes are performed on AST and CFG to obtain function call relationship information, variable related information, serial/parallel domain related information, and provide information support for the subsequent establishment of serial and parallel control flow graphs and code conversion;
分析一,函数调用关系分析:对于一个AST,执行以下两步工作:
1)递归遍历AST上的每个节点;1) Recursively traverse each node on the AST;
2)如果当前节点是函数定义节点,则保存该函数名和该函数调用的子函数信息;2) If the current node is a function definition node, then save the function name and the sub-function information of the function call;
分析二,变量信息分析,对于一个AST,执行以下三步工作:
1)递归遍历AST上的每个节点;1) Recursively traverse each node on the AST;
2)如果当前节点涉及变量定义或引用,则保存变量定义信息、变量引用信息、变量作用域信息;2) If the current node involves variable definition or reference, save variable definition information, variable reference information, and variable scope information;
3)如果当前节点涉及内存分配或释放,则保存内存分配信息、内存释放信息;3) If the current node involves memory allocation or release, save memory allocation information and memory release information;
分析三,串并行域分析:对于一个CFG,执行以下三步工作:Analysis 3, serial parallel domain analysis: For a CFG, perform the following three steps:
1)递归遍历CFG上的每个节点,保存节点间关系信息;1) Recursively traverse each node on the CFG and save the relationship information between nodes;
2)如果当前节点在OpenMP#pragma并行指导语句的作用域内,则将当前节点标记为并行节点,并保存其类型信息和范围信息;2) If the current node is within the scope of the OpenMP#pragma parallel instruction statement, then the current node is marked as a parallel node, and its type information and range information are saved;
3)如果当前节点不在OpenMP#pragma并行指导语句的作用域内,则将当前节点标记为串行节点,并保存其类型信息和范围信息;3) If the current node is not within the scope of the OpenMP#pragma parallel instruction statement, then mark the current node as a serial node, and save its type information and range information;
通过上述三种静态分析,可以获得以下信息:函数调用关系信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息、串并行节点类型信息、串并行节点范围信息、节点间关系信息;这些信息将对串并行控制流图建立提供支持。Through the above three static analysis, the following information can be obtained: function call relationship information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information, Relationship information between nodes; these information will provide support for the establishment of serial and parallel control flow graphs.
进一步地,所述信息收集模块中的串并行控制流图定义如下:Further, the serial-parallel control flow graph in the information collection module is defined as follows:
定义串并行控制流图由串行节点、并行节点以及节点之间的有向边组成;其中,串行节点表示一段OpenMP#pragma并行指导语句作用域外的、内部无分支的、串行执行的代码片段;串行节点对应的代码片段在CPU上执行,串行节点也记作SEQ节点;Define the serial-parallel control flow graph to consist of serial nodes, parallel nodes, and directed edges between nodes; among them, a serial node represents a piece of OpenMP#pragma parallel instruction statement scope, internal branchless, and serially executed code Fragment; the code fragment corresponding to the serial node is executed on the CPU, and the serial node is also recorded as a SEQ node;
并行节点表示一段在OpenMP#pragma并行指导语句作用域内的,并行执行的代码片段;并行节点对应的代码片段被卸载到GPU上执行,并行节点也记作OMP节点;A parallel node represents a code segment that is executed in parallel within the scope of the OpenMP#pragma parallel instruction statement; the code segment corresponding to the parallel node is offloaded to the GPU for execution, and the parallel node is also recorded as an OMP node;
串行节点和并行节点中保存有函数调用信息和变量相关信息;The serial node and the parallel node store function call information and variable related information;
节点之间的有向边表示节点对应的代码片段执行的先后关系。The directed edges between the nodes represent the execution order of the code fragments corresponding to the nodes.
进一步地,所述信息收集模块中串并行控制流图建立过程:Further, the serial-parallel control flow graph establishment process in the information collection module:
串并行控制流图建立过程以函数为基本处理单位;对整个源程序而言,如果可以建立一个函数的串并行控制流图,结合所收集的函数调用关系信息,就可以递归建立整个源程序的串并行控制流图;The process of establishing a serial-parallel control flow graph takes functions as the basic processing unit; for the entire source program, if a serial-parallel control flow graph of a function can be established, combined with the collected function call relationship information, the entire source program can be recursively established. Serial-parallel control flow graph;
对于一个函数,基于所收集的信息,可以通过以下步骤建立串并行控制流图:For a function, based on the collected information, a serial-parallel control flow graph can be built through the following steps:
1)使用节点类型信息、节点范围信息建立一个个孤立的串行节点和并行节点;1) Use the node type information and node range information to create isolated serial nodes and parallel nodes;
2)使用节点间关系信息,在节点间建立有向边,将串行节点和并行节点连成图;2) Using the relationship information between nodes, establish directed edges between nodes, and connect serial nodes and parallel nodes into a graph;
3)根据节点范围信息,将函数调用信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息保存到对应的串行节点或并行节点。3) Save function call information, variable definition information, variable reference information, variable scope information, memory allocation information, and memory release information to corresponding serial nodes or parallel nodes according to the node range information.
进一步地,将所述复杂指针数据按指针所处的位置分为三种类型:Further, the complex pointer data is divided into three types according to the position of the pointer:
类嵌套指针、vector容器、多级嵌套指针;Class nested pointers, vector containers, multi-level nested pointers;
类嵌套指针是指类中所包含的指针,vector容器是指C++标准库中提供的vector容器;多级嵌套指针是指二级指针以及二级以上的指针。The class nested pointer refers to the pointer contained in the class, and the vector container refers to the vector container provided in the C++ standard library; the multi-level nested pointer refers to the second-level pointer and the pointer above the second level.
进一步地,所述类嵌套指针通过以下步骤进行处理:Further, the class nested pointer is processed through the following steps:
①递归遍历上述所建立的串并行控制流图,根据串/并行节点中保存的变量定义和引用信息,找出在串行节点中定义、并在并行节点中被引用的类嵌套指针及其所在的C++类;① Recursively traverse the serial-parallel control flow graph established above, and find out the class nested pointers defined in the serial node and referenced in the parallel node and their The C++ class where it is located;
②对于步骤①中找出的C++类,修改源码中该类的类型定义,使该类的类型在定义时继承运行时提供的UMMaper基类;② For the C++ class found in
③对于步骤①中找出的类嵌套指针,修改源码中该指针的内存分配和释放语句,使用cudaMallocManaged()分配内存,使用cudaFree()释放内存;③For the class nesting pointer found in
④对于步骤①中找出的C++类实例,修改源码中该类实例的定义语句,使用被重载的new运算符创建类实例,并将步骤③中分配的内存空间地址传递给C++类实例中对应的嵌套指针。④ For the C++ class instance found in
进一步地,对所述vector容器通过以下步骤进行处理:Further, the vector container is processed through the following steps:
①递归遍历上述所建立的串并行控制流图,根据串/并行节点中保存的变量定义和引用信息,找出在串行节点中定义、并在并行节点中被引用的vector容器;① Recursively traverse the serial-parallel control flow graph established above, and find out the vector container defined in the serial node and referenced in the parallel node according to the variable definition and reference information saved in the serial/parallel node;
②对于步骤①中找出的vector容器实例,修改该vector容器实例的定义语句,插入对运行时提供的自定义allocator的显示调用。② For the vector container instance found in
进一步地,对所述多级嵌套指针通过以下步骤进行处理:Further, the multi-level nested pointer is processed through the following steps:
①递归遍历上述建立的串并行控制流图,根据串/并行节点中保存的变量定义和引用信息,找出在串行节点中定义、并在并行节点中被引用的多级嵌套指针,及其各级子指针;① Recursively traverse the serial-parallel control flow graph established above, and find out the multi-level nested pointers defined in the serial node and referenced in the parallel node according to the variable definition and reference information stored in the serial/parallel node, and Its sub-pointers at all levels;
②对于步骤①中找出的所有多级嵌套指针和各级子指针,修改源码中这些指针的内存分配和释放语句,使用cudaMallocManaged()分配内存,使用cudaFree()释放内存。② For all the multi-level nested pointers and sub-pointers at all levels found in
进一步地,所述运行时模块中UMMaper类的实现过程:为处理类嵌套指针,设计UMMaper类来管理C++类的内存分配和释放;UMMaper类中使用cudaMallocManaged()和cudaFree()对C++默认的new、delete运算符进行重载;UMMaper类的派生类在CPU和GPU上的内存空间的分配、释放、数据传输操作即可被统一内存自动管理。Further, the implementation process of the UMMaper class in the runtime module: for processing class nested pointers, design the UMMaper class to manage the memory allocation and release of the C++ class; use cudaMallocManaged() and cudaFree() in the UMMaper class to C++ default New and delete operators are overloaded; the allocation, release, and data transfer operations of the memory space of the derived classes of the UMMaper class on the CPU and GPU can be automatically managed by the unified memory.
进一步地,所述运行时模块中自定义allocator类的实现过程:为处理vector容器,设计自定义的allocator类来管理vector容器的内存分配和释放;vector容器默认使用C++标准库的allocator空间配置器来管理内存分配和释放,因此可以基于统一内存实现自定义的allocator类,从而实现对vector容器内存分配和释放的自动管理;自定义allocator类中,基于cudaMallocManaged()实现allocator类中的_allocator()函数,基于cudaFree()实现allocator类中的_deallocate()函数;在vector容器声明时显示调用自定义allocator类,即可实现统一内存对vector容器在CPU和GPU上的内存空间的分配、释放、数据传输操作的自动管理。Further, the implementation process of the custom allocator class in the runtime module: in order to process the vector container, design a custom allocator class to manage the memory allocation and release of the vector container; the vector container uses the allocator space configurator of the C++ standard library by default To manage memory allocation and release, so you can implement a custom allocator class based on unified memory, so as to realize the automatic management of vector container memory allocation and release; in the custom allocator class, implement _allocator( in the allocator class based on cudaMallocManaged() ) function, based on cudaFree() to implement the _deallocate() function in the allocator class; when the vector container is declared, it is displayed to call the custom allocator class, and the unified memory can allocate and release the memory space of the vector container on the CPU and GPU , Automatic management of data transfer operations.
本发明具有以下有益技术效果:The present invention has the following beneficial technical effects:
本发明所述系统可以在OpenMP CPU程序的基础上,自动修改内存分配和释放语句,并自动插入相关pragma原语,自动管理CPU和加速器之间的数据传输,以保证包含复杂指针数据结构的程序在异构计算平台上的数据一致性,并提高程序性能。The system of the present invention can automatically modify the memory allocation and release statements on the basis of the OpenMP CPU program, and automatically insert relevant pragma primitives, and automatically manage the data transmission between the CPU and the accelerator, so as to ensure that the program containing complex pointer data structures Data consistency on heterogeneous computing platforms and improve program performance.
本发明研究的异构编程方案主要针对OpenMP程序中复杂指针数据结构的自动管理,具体指自动修改内存分配和释放语句,自动插入相关pragma原语,自动管理CPU和加速器之间的数据传输;从而保证程序在异构计算平台上的数据一致性,并提高程序性能。The heterogeneous programming scheme researched by the present invention is mainly aimed at the automatic management of complex pointer data structures in the OpenMP program, specifically refers to automatically modifying memory allocation and release statements, automatically inserting relevant pragma primitives, and automatically managing data transmission between the CPU and the accelerator; thus Ensure data consistency of programs on heterogeneous computing platforms and improve program performance.
由于OpenMPOffloading支持加速器编程,使得CPU上的OpenMP代码可以卸载到GPU上执行,客观条件允许OpenMP代码的卸载。在此条件下,将OpenMP代码卸载到GPU上运行,一方面可以提升程序的运行效率,另一方面可以充分利用GPU的加速效果。但手动进行代码卸载既无法保证转换程序的正确性,同时又耗费人力物力,效率很低。因此本发明提出复杂指针数据结构自动管理方案,以解决在卸载过程中最困难的复杂指针数据结构在CPU和加速器之间的内存分配和数据传输等问题。Since OpenMPOffloading supports accelerator programming, the OpenMP code on the CPU can be offloaded to the GPU for execution, and objective conditions allow the offloading of the OpenMP code. Under this condition, offloading the OpenMP code to run on the GPU can improve the running efficiency of the program on the one hand, and make full use of the acceleration effect of the GPU on the other hand. However, manual code unloading cannot guarantee the correctness of the conversion program, and at the same time consumes manpower and material resources, and the efficiency is very low. Therefore, the present invention proposes an automatic management scheme for complex pointer data structures to solve the most difficult problems in the unloading process, such as memory allocation and data transmission between the CPU and the accelerator for complex pointer data structures.
通过在通用测试集(PolyBench,Rodinia等)上的对比实验,证明本发明可以自动管理OpenMP Offloading程序中复杂指针数据结构,保证程序的正确性,并提高程序性能。Through comparative experiments on general test sets (PolyBench, Rodinia, etc.), it is proved that the present invention can automatically manage complex pointer data structures in OpenMP Offloading programs, ensure program correctness, and improve program performance.
本发明所述的复杂指针数据结构是指复杂指针数据类型。The complex pointer data structure described in the present invention refers to the complex pointer data type.
附图说明Description of drawings
图1是本发明所述系统的整体框架图;图2是AST分析方法的结构框图;图3为函数串并行控制流图;图4是统一内存原理图;图5为处理类嵌套指针的自动卸载实现原理图(是否使用统一内存的程序对比);图6为处理vector容器的自动卸载实现原理图(是否使用空间配置器的程序对比);图7为处理多级嵌套指针的自动卸载实现原理图;图8为本发明实验设计整体框架图;Fig. 1 is the overall frame diagram of the system of the present invention; Fig. 2 is the structural block diagram of AST analysis method; Fig. 3 is a function string parallel control flow diagram; Fig. 4 is a schematic diagram of unified memory; Fig. 5 is a processing class nested pointer Schematic diagram of automatic unloading implementation (comparison of programs using unified memory or not); Figure 6 is a schematic diagram of automatic unloading of vector containers (comparison of programs using space configurator or not); Figure 7 is automatic unloading of processing multi-level nested pointers Realize schematic diagram; Fig. 8 is the overall frame diagram of experimental design of the present invention;
图9为2080Ti平台Large数据量下的运行时间柱状图,图中:图9(a)为Large-2080Ti运行时间对比图(数据集a),图9(b)为Large-2080Ti运行时间对比图(数据集b),图9(a)和图9(b)本质是一个图,因为数据集比较多,所以分成两个图;Figure 9 is a histogram of running time on the 2080Ti platform under Large data volume. In the figure: Figure 9(a) is a comparison chart of Large-2080Ti running time (data set a), and Figure 9(b) is a comparison chart of Large-2080Ti running time (Data set b), Figure 9(a) and Figure 9(b) are essentially a graph, because there are many data sets, so they are divided into two graphs;
图10为Rodinia测试集在2080Ti上的运行时间比较图;图11为Polybench-OAO不同数据量加速比对比图(2080Ti);图12为Polybench加速比对比图(2080Ti);图13为运行时开销图(2080Ti);图14为复杂指针数据结构详细测试图(K40);Figure 10 is a comparison chart of the running time of the Rodinia test set on 2080Ti; Figure 11 is a comparison chart of Polybench-OAO different data volume speedup (2080Ti); Figure 12 is a comparison chart of Polybench speedup (2080Ti); Figure 13 is the runtime overhead Figure (2080Ti); Figure 14 is a detailed test chart of complex pointer data structure (K40);
图15为含有vector的程序转换示例;图16为含有structnest的程序程序转换示例;图17为含有multilevel的程序转换示例。图15至图17中:vector为c++中的vector容器;structnest为类嵌套指针;multilevel为多级嵌套指针。Figure 15 is an example of program conversion with vector; Figure 16 is an example of program conversion with structnest; Figure 17 is an example of program conversion with multilevel. In Fig. 15 to Fig. 17: vector is a vector container in c++; structnest is a class nesting pointer; multilevel is a multilevel nesting pointer.
具体实施方式detailed description
结合附图1至17,对本发明所述的一种面向异构平台的复杂指针数据结构自动管理系统进行如下阐述:In conjunction with accompanying
本发明的主要任务是自动管理OpenMPOffloading程序中的复杂指针数据结构(类嵌套指针、vector容器、多级嵌套指针等),即实现自动修改分配和释放语句,自动管理数据传输,保证数据一致性。本发明主要包括以下三个模块:The main task of the present invention is to automatically manage complex pointer data structures (class nested pointers, vector containers, multi-level nested pointers, etc.) in the OpenMPOffloading program, that is, to realize automatic modification of allocation and release statements, automatic management of data transmission, and ensure data consistency sex. The present invention mainly comprises following three modules:
信息收集模块,该模块有两项功能:1)对源程序进行静态分析以收集程序信息;2)基于收集到的信息对源程序建立抽象表示即串并行控制流图。该模块的工作过程分为以下两步:1)通过Clang编译器,将C或者C++的源代码生成对应的抽象语法树(Abstract SyntaxTree,AST)和控制流图(Control Flow Graph,CFG),遍历AST和CFG,区分串并行域,并获取程序详细信息;2)根据这些信息生成串并行控制流图。Information collection module, this module has two functions: 1) Statically analyze the source program to collect program information; 2) Create an abstract representation of the source program based on the collected information, that is, a serial-parallel control flow graph. The working process of this module is divided into the following two steps: 1) Through the Clang compiler, generate the corresponding abstract syntax tree (Abstract SyntaxTree, AST) and control flow graph (Control Flow Graph, CFG) from the source code of C or C++, and traverse AST and CFG distinguish serial and parallel domains and obtain program details; 2) generate serial and parallel control flow graphs based on these information.
自动转换模块,主要负责基于串并行控制流图,在源码中插入运行时API,从而完成代码转换。首先根据串并行控制流图中保存的复杂指针变量信息,确定复杂指针变量的类型。之后根据不同类型,在源码中适当位置,插入合适的运行时API,完成代码转换。这样,复杂指针变量所涉及的内存分配、释放以及数据传输操作都由运行时接管,从而可以通过运行时自动管理复杂指针变量,保证CPU和GPU之间的数据一致性。The automatic conversion module is mainly responsible for inserting the runtime API into the source code based on the serial-parallel control flow graph to complete the code conversion. Firstly, according to the complex pointer variable information stored in the serial-parallel control flow graph, the type of the complex pointer variable is determined. Then according to different types, insert the appropriate runtime API at the appropriate position in the source code to complete the code conversion. In this way, the memory allocation, release, and data transfer operations involved in complex pointer variables are all taken over by the runtime, so that complex pointer variables can be automatically managed by runtime to ensure data consistency between the CPU and GPU.
运行时模块,主要负责基于统一内存实现复杂指针数据结构的以下操作:CPU和GPU上的内存分配和释放操作,CPU和GPU之间的自动数据传输操作;其中使用cudaMallocManaged()和cudaFree()重新实现new、delete、allocator这些C++默认接口,并以API接口的形式向外部提供内存分配和释放的接口。The runtime module is mainly responsible for the following operations of complex pointer data structures based on unified memory: memory allocation and release operations on the CPU and GPU, automatic data transfer operations between the CPU and GPU; among them, cudaMallocManaged() and cudaFree() are used to re- Implement the C++ default interfaces of new, delete, and allocator, and provide external memory allocation and release interfaces in the form of API interfaces.
结果会将源程序转换为带有运行时API的新程序,系统的整体框架如图1所示。As a result, the source program is converted into a new program with a runtime API, and the overall framework of the system is shown in Figure 1.
1信息收集模块1 Information Collection Module
信息收集模块实现两个功能:静态分析并收集程序信息;建立串并行控制流图。The information collection module implements two functions: static analysis and collection of program information; establishment of serial and parallel control flow graphs.
1.1信息收集1.1 Information collection
首先使用Clang编译器对目标程序进行词法分析和语义分析,并生成抽象语法树(Abstract Syntax Tree,AST)和控制流图(Control Flow Graph,CFG)。First, use the Clang compiler to perform lexical analysis and semantic analysis on the target program, and generate an abstract syntax tree (Abstract Syntax Tree, AST) and a control flow graph (Control Flow Graph, CFG).
之后对AST和CFG进行下述三种静态分析,获得函数调用关系信息、变量相关信息、串/并行域相关信息,为之后的串并行控制流图建立和代码转换提供信息支持。Afterwards, the following three static analyzes are performed on AST and CFG to obtain function call relationship information, variable related information, and serial/parallel domain related information, and provide information support for subsequent serial-parallel control flow graph establishment and code conversion.
分析一,函数调用关系分析:对于一个AST,执行以下两步工作:
1)递归遍历AST上的每个节点;1) Recursively traverse each node on the AST;
2)如果当前节点是函数定义节点,则保存该函数名和该函数调用的子函数信息。2) If the current node is a function definition node, save the function name and sub-function information called by the function.
分析二,变量信息分析,对于一个AST,执行以下三步工作:
1)递归遍历AST上的每个节点;1) Recursively traverse each node on the AST;
2)如果当前节点涉及变量定义或引用,则保存变量定义信息、变量引用信息、变量作用域信息;2) If the current node involves variable definition or reference, save variable definition information, variable reference information, and variable scope information;
3)如果当前节点涉及内存分配或释放,则保存内存分配信息、内存释放信息。3) If the current node involves memory allocation or release, save memory allocation information and memory release information.
分析三,串并行域分析:对于一个CFG,执行以下三步工作:Analysis 3, serial parallel domain analysis: For a CFG, perform the following three steps:
1)递归遍历CFG上的每个节点,保存节点间关系信息;1) Recursively traverse each node on the CFG and save the relationship information between nodes;
2)如果当前节点在OpenMP#pragma并行指导语句的作用域内,则将当前节点标记为并行节点,并保存其类型信息和范围信息;2) If the current node is within the scope of the OpenMP#pragma parallel instruction statement, then the current node is marked as a parallel node, and its type information and range information are saved;
3)如果当前节点不在OpenMP#pragma并行指导语句的作用域内,则将当前节点标记为串行节点,并保存其类型信息和范围信息。3) If the current node is not within the scope of the OpenMP#pragma parallel instruction statement, then mark the current node as a serial node, and save its type information and range information.
通过上述三种静态分析,可以获得以下信息:函数调用关系信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息、串并行节点类型信息、串并行节点范围信息、节点间关系信息。这些信息将对串并行控制流图建立提供支持。Through the above three static analysis, the following information can be obtained: function call relationship information, variable definition information, variable reference information, variable scope information, memory allocation information, memory release information, serial-parallel node type information, serial-parallel node range information, Relationship information between nodes. These information will provide support for the establishment of serial and parallel control flow graphs.
1.2串并行控制流图建立1.2 Serial-parallel control flow graph establishment
定义串并行控制流图由串行节点、并行节点以及节点之间的有向边组成。其中,串行节点表示一段OpenMP#pragma并行指导语句作用域外的、内部无分支的、串行执行的代码片段;串行节点对应的代码片段在CPU上执行,串行节点也记作SEQ节点。Definition Serial-parallel control flow graph consists of serial nodes, parallel nodes and directed edges between nodes. Among them, the serial node represents a code fragment that is outside the scope of the OpenMP#pragma parallel instruction statement, has no internal branches, and is executed serially; the code fragment corresponding to the serial node is executed on the CPU, and the serial node is also recorded as a SEQ node.
并行节点表示一段在OpenMP#pragma并行指导语句作用域内的,并行执行的代码片段;并行节点对应的代码片段被卸载到GPU上执行,并行节点也记作OMP节点。串行节点和并行节点中保存有函数调用信息和变量相关信息。节点之间的有向边表示节点对应的代码片段执行的先后关系。A parallel node represents a code fragment that is executed in parallel within the scope of the OpenMP#pragma parallel instruction statement; the code fragment corresponding to the parallel node is offloaded to the GPU for execution, and the parallel node is also recorded as an OMP node. The serial node and the parallel node store function call information and variable related information. The directed edges between the nodes represent the execution order of the code fragments corresponding to the nodes.
串并行控制流图建立过程以函数为基本处理单位;对整个源程序而言,如果可以建立一个函数的串并行控制流图,结合函数调用关系信息,就可以递归建立整个源程序的串并行控制流图。The process of establishing a serial-parallel control flow graph takes functions as the basic processing unit; for the entire source program, if a serial-parallel control flow graph of a function can be established, combined with the function call relationship information, the serial-parallel control of the entire source program can be recursively established flow graph.
对于一个函数,基于收集的程序信息,可以通过以下步骤建立串并行控制流图:For a function, based on the collected program information, a serial-parallel control flow graph can be built through the following steps:
1)使用节点类型信息、节点范围信息建立一个个孤立的串行节点和并行节点;1) Use the node type information and node range information to create isolated serial nodes and parallel nodes;
2)使用节点间关系信息,在节点间建立有向边,将串行节点和并行节点连成图;2) Using the relationship information between nodes, establish directed edges between nodes, and connect serial nodes and parallel nodes into a graph;
3)根据节点范围信息,将函数调用信息、变量定义信息、变量引用信息、变量作用域信息、内存分配信息、内存释放信息保存到对应的串行节点或并行节点。3) Save function call information, variable definition information, variable reference information, variable scope information, memory allocation information, and memory release information to corresponding serial nodes or parallel nodes according to the node range information.
2自动转换模块2 automatic conversion modules
自动转换模块将复杂指针数据类型分为:类嵌套指针、vector容器、多级嵌套指针三种类型。The automatic conversion module divides complex pointer data types into three types: class nested pointer, vector container, and multi-level nested pointer.
复杂指针数据结构处理中的关键技术是统一内存(Unified Memory,UM),故首先介绍统一内存的概念以及原理,UM维护了一个统一的内存池,在CPU与GPU中共享,仅仅只分配一次内存,这个数据指针对于host端(CPU)和device端(GPU)都是可用的。使用了单一指针进行托管内存,这样在不同设备之间进行的数据传输由UM的运行时系统来自动完成,并使GPU可以处理超出其内存容量的数据集。统一内存原理如图4所示。The key technology in the processing of complex pointer data structures is Unified Memory (UM), so first introduce the concept and principle of Unified Memory. UM maintains a unified memory pool, shared between CPU and GPU, and only allocates memory once. , this data pointer is available to both the host side (CPU) and the device side (GPU). Using a single pointer for managed memory, data transfer between different devices is done automatically by UM's runtime system and enables the GPU to process data sets that exceed its memory capacity. The principle of unified memory is shown in Figure 4.
利用统一内存技术,针对三种复杂指针数据类型,分别设计不同处理方法。Using unified memory technology, different processing methods are designed for the three complex pointer data types.
2.1类嵌套指针2.1 Class nested pointers
类嵌套指针是指类中所包含的指针,如图5所示,通过以下步骤进行处理:The class nested pointer refers to the pointer contained in the class, as shown in Figure 5, it is processed through the following steps:
①递归遍历串并行控制流图,根据串/并行节点中保存的变量定义和引用信息,找出在串行节点中定义、并在并行节点中被引用的类嵌套指针及其所在的类。① Traverse the serial-parallel control flow graph recursively, and find out the class nesting pointers defined in the serial node and referenced in the parallel node and the classes they belong to according to the variable definition and reference information saved in the serial/parallel node.
③对于步骤①中找出的类的类型,修改源码中该类的类型定义,使该类的类型在定义时继承运行时提供的UMMaper基类(见运行时设计);③ For the type of class found in
③对于步骤①中找出的类嵌套指针,修改源码中该指针的内存分配和释放语句,使用cudaMallocManaged()分配内存,使用cudaFree()释放内存;③For the class nesting pointer found in
④对于步骤①中找出的类实例,修改源码中该类实例的定义语句,使用被重载的new运算符创建类实例,并将步骤③中分配的内存空间地址传递给类中对应的嵌套指针。④ For the class instance found in
类嵌套指针的处理示例如图5所示。An example of processing class nested pointers is shown in Figure 5.
2.2vector容器2.2vector container
vector容器是指C++标准库中提供的vector容器。vector容器需要通过以下步骤进行处理:The vector container refers to the vector container provided in the C++ standard library. Vector containers need to be processed through the following steps:
①递归遍历串并行控制流图,根据串/并行节点中保存的变量定义和引用信息,找出在串行节点中定义、并在并行节点中被引用的vector容器。① Traverse the serial-parallel control flow graph recursively, and find out the vector container defined in the serial node and referenced in the parallel node according to the variable definition and reference information saved in the serial/parallel node.
②对于步骤①中找出的vector容器实例,修改该vector容器实例的定义语句,插入对运行时提供的自定义allocator(见运行时设计)的显示调用。② For the vector container instance found in
vector容器处理示例如图6所示。An example of vector container processing is shown in Figure 6.
2.3多级嵌套指针2.3 Multi-level nested pointers
多级嵌套指针是指二级指针以及二级以上的指针。多级嵌套指针需要通过以下步骤进行处理:A multi-level nested pointer refers to a second-level pointer and pointers above the second level. Multiple levels of nested pointers need to be handled through the following steps:
①递归遍历串并行控制流图,根据串/并行节点中保存的变量定义和引用信息,找出在串行节点中定义、并在并行节点中被引用的多级嵌套指针,及其各级子指针;① Recursively traverse the serial-parallel control flow graph, and find out the multi-level nested pointers defined in the serial node and referenced in the parallel node, and their levels child pointer;
②对于步骤①中找出的所有多级嵌套指针和各级子指针,修改源码中这些指针的内存分配和释放语句,使用cudaMallocManaged()分配内存,使用cudaFree()释放内存。② For all the multi-level nested pointers and sub-pointers at all levels found in
多级嵌套指针的处理示例如图7所示。An example of processing multi-level nested pointers is shown in Figure 7.
3运行时3 runtime
为了使转换后的代码更加接近源代码的格式,尽量减少对源码的修改,本文提出了在原程序中插入少量运行时API的方案,将基于统一内存的内存分配都封装到运行时中。In order to make the converted code closer to the format of the source code and minimize the modification of the source code, this paper proposes a solution to insert a small amount of runtime API into the original program, and encapsulate the memory allocation based on unified memory into the runtime.
对于类及其嵌套指针,在运行时中实现负责内存分配和释放的UMMapper基类,如图5所示。在UMMapper中使用cudaMallocManaged()和cudaFree()对C++默认的new、delete运算符进行重载。这样只需在类声明时基于UMMapper派生,再使用new来新建类实例就可以实现通过统一内存自动管理派生类相关的内存分配、释放、数据传输操作。For classes and their nested pointers, the UMMapper base class responsible for memory allocation and release is implemented at runtime, as shown in Figure 5. Use cudaMallocManaged() and cudaFree() in UMMapper to overload the default new and delete operators of C++. In this way, you only need to derive based on UMMapper when the class is declared, and then use new to create a new class instance to realize automatic management of memory allocation, release, and data transfer operations related to derived classes through unified memory.
vector容器使用C++标准库的allocator空间配置器来管理内存分配和释放,因此可以基于统一内存实现自定义的allocator类,从而实现对vector容器内存分配和释放的自动管理。自定义allocator类中,基于cudaMallocManaged()实现allocator类中的_allocator()函数,基于cudaFree()实现allocator类中的_deallocate()函数。在vector容器声明时显示调用自定义allocator类,即可实现统一内存对vector容器在CPU和GPU上的内存空间的分配、释放、数据传输操作的自动管理,如图6所示。The vector container uses the allocator space configurator of the C++ standard library to manage memory allocation and release, so a custom allocator class can be implemented based on unified memory, thereby realizing automatic management of memory allocation and release of the vector container. In the custom allocator class, the _allocator() function in the allocator class is implemented based on cudaMallocManaged(), and the _deallocate() function in the allocator class is implemented based on cudaFree(). When the vector container is declared, it is displayed to call the custom allocator class, so that unified memory can automatically manage the allocation, release, and data transmission operations of the memory space of the vector container on the CPU and GPU, as shown in Figure 6.
对发明的技术效果进行如下验证:The technical effect of the invention is verified as follows:
为了验证和分析代码卸载方案的效果,我们在RTX 2080Ti平台上进行测验,使用PolyBench,Rodinia等基准测试数据集,将测试结果和另一较为著名的源到源编译器(DawnCC)以及手动转换(Manual)的结果还有CPU并行的代码进行比较。我们的方案命名为OAO(OpenMP Automated Offloadingwithcomplexdatastructuresupport)。In order to verify and analyze the effect of the code unloading scheme, we conducted a test on the RTX 2080Ti platform, using PolyBench, Rodinia and other benchmark test data sets, and compared the test results with another well-known source-to-source compiler (DawnCC) and manual conversion ( Manual) and CPU parallel code for comparison. Our solution is named OAO (OpenMP Automated Offloading with complex data structure support).
实验结果的测试和获取需要整合运行时,编写脚本运行,Python提取以及OriginLab分析这四个阶段。实验整体过程如图8所示。The test and acquisition of experimental results need to integrate the four stages of runtime, scripting and running, Python extraction and OriginLab analysis. The overall process of the experiment is shown in Figure 8.
1运行时间1 runtime
2080Ti平台Large数据量下的运行时间分别如图9所示。The running time of the 2080Ti platform under the Large data volume is shown in Figure 9.
Rodinia测试集在2080Ti上的运行时间比较,如图10所示。The running time comparison of the Rodinia test set on the 2080Ti is shown in Figure 10.
通过以上对数据集运行时间的收集和比对,可以得出以下结论:Through the above collection and comparison of the running time of the dataset, the following conclusions can be drawn:
(1)通过运行时间图可以看出,OAO可以处理全部23个测试程序,而DawnCC只能处理15个测试程序。OAO在K40平台的9个和2080Ti平台的15个程序上有性能提升;且OAO相对于手动转换在所有平台的所有测试程序上都有最好性能表现。(1) It can be seen from the running time graph that OAO can handle all 23 test programs, while DawnCC can only handle 15 test programs. OAO has improved performance on 9 programs on the K40 platform and 15 programs on the 2080Ti platform; and compared to manual conversion, OAO has the best performance on all test programs on all platforms.
(2)在Rodinia-2080Ti中,已知Rodinia数据体量由实验前测试获得,共8个数据集中,有5个数据集的OAO运行时间比OMP更短。(2) In Rodinia-2080Ti, the Rodinia data volume is known to be obtained from the pre-experimental test. There are 8 data sets in total, and the OAO running time of 5 data sets is shorter than that of OMP.
(3)在Polybench-Large中,OAO的运行时间优于DawnCC,Manual和native;运行时间总体上呈现OAO<DawnCC<Manual<native的趋势。其中,native运行时间最长,是因为选择的传输语句问题,没有进行并行优化;Manual包含了部分数据的冗余传输,运行时间比OAO要长;而OAO在数据集的处理上总体也要优于DawnCC。(3) In Polybench-Large, the running time of OAO is better than DawnCC, Manual and native; the running time generally shows a trend of OAO<DawnCC<Manual<native. Among them, the running time of native is the longest, because the selected transmission statement has not been optimized in parallel; Manual includes redundant transmission of some data, and the running time is longer than OAO; and OAO is generally better in the processing of data sets. at DawnCC.
2加速比2 speedup ratio
由图11可以看出,在加速比方面,在大于1的数据中,Large数据量加速比比Medium更大,这也正符合并行加速的特点,处理的数据量越大,并行加速效果越明显。It can be seen from Figure 11 that in terms of speedup ratio, among the data greater than 1, the speedup ratio of Large data volume is greater than that of Medium data, which is also in line with the characteristics of parallel acceleration. The larger the amount of data processed, the more obvious the effect of parallel speedup.
在加速比方面,计算方式为OMP/Type的比值,Type为其他类型,结果以1X为界限,上方为正向加速,下方为负向加速。由图12可以得出,不管在哪个平台上,所有测试程序的结果显示,OAO的加速比都优于Manual,在Polybench测试程序中,OAO略优于DawnCC,且由于DawnCC无法处理Rodinia测试程序,所以OAO在所有测试程序上都有最好的数据传输优化效果,且能处理DawnCC处理不了的复杂测试程序,适用范围更广。In terms of acceleration ratio, the calculation method is the ratio of OMP/Type, Type is other types, and the result is bounded by 1X, the upper part is positive acceleration, and the lower part is negative acceleration. It can be concluded from Figure 12 that regardless of the platform, the results of all test programs show that the speedup ratio of OAO is better than Manual. In the Polybench test program, OAO is slightly better than DawnCC, and because DawnCC cannot handle the Rodinia test program, Therefore, OAO has the best data transmission optimization effect in all test programs, and can handle complex test programs that DawnCC cannot handle, and has a wider scope of application.
3运行时开销3 runtime overhead
在Polybench测试集中,运行时的开销如图13所示。In the Polybench test set, the runtime overhead is shown in Figure 13.
所有测试程序的运行时间,与运行时开销相比,运行时开销都要小2个数量级以上,所以OAO在运行时的额外开销很小,几乎可以忽略,所以加入运行时对源程序在性能上并没有损失。Compared with the runtime overhead, the runtime overhead of all test programs is more than 2 orders of magnitude smaller, so the extra overhead of OAO at runtime is very small, almost negligible, so adding runtime has a great impact on the performance of the source program There is no loss.
4统一内存4 unified memory
统一内存的详细测试包括vector容器,类嵌套指针(structnest)和多级嵌套指针(multilevel)。复杂指针数据结构的运行时间和数据传输时间对比如图14所示。Detailed tests of unified memory include vector containers, class nested pointers (structnest) and multilevel nested pointers (multilevel). The running time and data transfer time comparison of the complex pointer data structure is shown in Figure 14.
从数据传输时间来看,H2D和D2H的时间远小于总运行时间,且全部复杂指针数据结构都可以正确处理,因此OAO编译器可以正确支持复杂指针数据结构的GPU卸载。From the perspective of data transmission time, the time of H2D and D2H is much less than the total running time, and all complex pointer data structures can be processed correctly, so the OAO compiler can correctly support GPU offloading of complex pointer data structures.
其中vector容器,structnest和multilevel的简单转换程序对比分别如图15、图16和图17所示。Among them, the vector container, the comparison of the simple conversion programs of structnest and multilevel are shown in Figure 15, Figure 16 and Figure 17 respectively.
其中由图15所示,声明阶段vector变量处修改为QY::allocator方式重载;图16中,声明阶段类内部嵌套指针x,y,z由cudaMallocManaged重载,而实例化的类a由传输语句标明,添加map(tofrom:a[0:1]);图17中,x为二级指针,y为x的一级指针,声明阶段x,y由cudaMallocManaged重载。As shown in Figure 15, the vector variable in the declaration phase is modified to be overloaded by QY::allocator; in Figure 16, the nested pointers x, y, and z inside the class in the declaration phase are overloaded by cudaMallocManaged, and the instantiated class a is defined by The transfer statement indicates that map(tofrom:a[0:1]) is added; in Figure 17, x is the second-level pointer, y is the first-level pointer of x, and x and y are overloaded by cudaMallocManaged in the declaration stage.
表1统一内存测试结果表Table 1 Unified memory test result table
统一内存测试结果如表1所示。在数据传输过程中,会在正式传输之前,优先进行一次传输,H2D的数据大小为4B,D2H的数据大小为1B(4+1模式),上表中vector的双向传输,52比49大3,structnest的双向传输,36比33大3,都是由于这个原因。除此之外,vector只传输2次,分别为X和Y,每次的数据量大小为24B,structnest传输1次,传输A,数据量大小为32。The unified memory test results are shown in Table 1. In the process of data transmission, a transmission will be given priority before the official transmission. The data size of H2D is 4B, and the data size of D2H is 1B (4+1 mode). For the two-way transmission of vector in the above table, 52 is 3 larger than 49 , The two-way transmission of structnest, 36 is 3 larger than 33, all due to this reason. In addition, the vector is only transmitted twice, namely X and Y, each time the data size is 24B, and the structnest is transmitted once, A is transmitted, and the data size is 32.
而multilevel中,z是单向传输,使用的数据传输模型,x和y是多级嵌套指针,使用统一内存。除去优先传输的4+1模式,在H2D和D2H之间数据传输量存在差别的原因是,z并不回传,而x会回传。In multilevel, z is a one-way transmission, using the data transmission model, x and y are multi-level nested pointers, using unified memory. Except for the 4+1 mode of priority transmission, the reason why there is a difference in the amount of data transmission between H2D and D2H is that z does not return, but x does.
实验结果证明OAO可以利用统一内存来正确处理复杂指针数据结构,实现对复杂指针数据结构处理的支持。Experimental results prove that OAO can use unified memory to correctly process complex pointer data structures, and realize the support for processing complex pointer data structures.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010971038.9A CN112083956B (en) | 2020-09-15 | 2020-09-15 | Heterogeneous platform-oriented automatic management system for complex pointer data structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010971038.9A CN112083956B (en) | 2020-09-15 | 2020-09-15 | Heterogeneous platform-oriented automatic management system for complex pointer data structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112083956A CN112083956A (en) | 2020-12-15 |
CN112083956B true CN112083956B (en) | 2022-12-09 |
Family
ID=73736379
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010971038.9A Active CN112083956B (en) | 2020-09-15 | 2020-09-15 | Heterogeneous platform-oriented automatic management system for complex pointer data structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112083956B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116956236A (en) * | 2022-04-13 | 2023-10-27 | 堡垒科技有限公司 | Decentralizing open source software trusted communication certification protocol |
CN118113342A (en) * | 2022-11-29 | 2024-05-31 | 中科天机气象科技有限公司 | Heterogeneous acceleration method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110383247A (en) * | 2017-04-28 | 2019-10-25 | 伊纽迈茨有限公司 | Method, computer-readable medium and heterogeneous computing system performed by computer |
CN111611158A (en) * | 2020-05-08 | 2020-09-01 | 中国原子能科学研究院 | An application performance analysis system and method |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001147819A (en) * | 1999-11-19 | 2001-05-29 | Fujitsu Ltd | Optimization device and recording medium |
CN101963918B (en) * | 2010-10-26 | 2013-05-01 | 上海交通大学 | Method for realizing virtual execution environment of central processing unit (CPU)/graphics processing unit (GPU) heterogeneous platform |
CN102707952A (en) * | 2012-05-16 | 2012-10-03 | 上海大学 | User description based programming design method on embedded heterogeneous multi-core processor |
CN104035781B (en) * | 2014-06-27 | 2017-06-23 | 北京航空航天大学 | A kind of method of quick exploitation isomerism parallel program |
CN106874113A (en) * | 2017-01-19 | 2017-06-20 | 国电南瑞科技股份有限公司 | A kind of many GPU heterogeneous schemas static security analysis computational methods of CPU+ |
CN106940654B (en) * | 2017-02-15 | 2020-08-14 | 南京航空航天大学 | Automatic detection and positioning method for memory error in source code |
CN113961446B (en) * | 2018-03-08 | 2025-02-14 | 华东师范大学 | A verification system used in a runtime formal verification method for source code |
CN109933327B (en) * | 2019-02-02 | 2021-01-08 | 中国科学院计算技术研究所 | OpenCL compiler design method and system based on code fusion compiling framework |
-
2020
- 2020-09-15 CN CN202010971038.9A patent/CN112083956B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110383247A (en) * | 2017-04-28 | 2019-10-25 | 伊纽迈茨有限公司 | Method, computer-readable medium and heterogeneous computing system performed by computer |
CN111611158A (en) * | 2020-05-08 | 2020-09-01 | 中国原子能科学研究院 | An application performance analysis system and method |
Also Published As
Publication number | Publication date |
---|---|
CN112083956A (en) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1519273B1 (en) | Region-based memory management for object-oriented programs | |
US8365142B2 (en) | Hypergraph implementation | |
Westrick et al. | Disentanglement in nested-parallel programs | |
CN112083956B (en) | Heterogeneous platform-oriented automatic management system for complex pointer data structure | |
CN111966397B (en) | A method for automatic migration and optimization of heterogeneous parallel programs | |
Vollmer et al. | Compiling tree transforms to operate on packed representations | |
US20020062478A1 (en) | Compiler for compiling source programs in an object-oriented programming language | |
CN102004662A (en) | Embedded scalable virtual machine | |
CN105447285B (en) | A method of improving OpenCL hardware execution efficiency | |
Wang et al. | libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems | |
Chong et al. | Static analysis of accessed regions in recursive data structures | |
JP7140935B1 (en) | Deterministic memory allocation for real-time applications | |
Zhao et al. | Code refactoring from OpenMP to MapReduce model for big data processing | |
Chambers et al. | Iterative type analysis and extended message splitting: Optimizing dynamically-typed object-oriented programs | |
CN114780105A (en) | An Efficient Interprocedural Escape Analysis Method for Golang Programs Based on Value Flow Analysis | |
Zhao et al. | Refactoring OpenMP code based on MapReduce model | |
Skjellum et al. | Object‐oriented analysis and design of the Message Passing Interface | |
Lang | Improved stack allocation using escape analysis in the KESO multi-JVM | |
CN111857732B (en) | A Marker-Based Parallelization Method for Serial Programs | |
He et al. | OpenMDSP: extending OpenMP to program multi-core DSPs | |
Nikhil | An overview of the parallel language Id | |
Liao et al. | Automatic Parallelization of XQuery Programs. | |
Wang et al. | A Scalable, Efficient, and Robust Dynamic Memory Management Library for HLS-based FPGAs | |
Nguyen | Compiler and Runtime Support for Efficient and Scalable Big Data Processing | |
Li et al. | Two-level task scheduling for irregular applications on GPU platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |