JP7521597B2

JP7521597B2 - OFFLOAD SERVER, OFFLOAD CONTROL METHOD, AND OFFLOAD PROGRAM

Info

Publication number: JP7521597B2
Application number: JP2022560579A
Authority: JP
Inventors: 庸次山登
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2024-07-24
Anticipated expiration: 2040-11-05
Also published as: JPWO2022097245A1; WO2022097245A1

Description

本発明は、機能処理をＧＰＵ（Graphics Processing Unit）等のアクセラレータに自動オフロードするオフロードサーバ、オフロード制御方法およびオフロードプログラムに関する。 The present invention relates to an offload server, an offload control method, and an offload program that automatically offloads functional processing to an accelerator such as a GPU (Graphics Processing Unit).

近年、ＣＰＵの半導体集積度が1.5年で２倍になるというムーアの法則が減速するのではないかと言われている。そのような状況から、少コアのＣＰＵだけでなく、ＦＰＧＡ（Field Programmable Gate Array）やＧＰＵ（Graphics Processing Unit）等のデバイスの活用が増えている。例えば、Microsoft（登録商標）社はＦＰＧＡを使ってBingの検索効率を高めるといった取り組みをしており、Amazon（登録商標）社は、ＦＰＧＡ, ＧＰＵ等をクラウドのインスタンスとして提供している。In recent years, it has been said that Moore's Law, which states that the semiconductor integration density of CPUs will double every 1.5 years, may be slowing down. In light of this situation, there has been an increase in the use of devices such as FPGAs (Field Programmable Gate Arrays) and GPUs (Graphics Processing Units) in addition to CPUs with fewer cores. For example, Microsoft (registered trademark) is working to improve the search efficiency of Bing using FPGAs, and Amazon (registered trademark) is providing FPGAs, GPUs, etc. as cloud instances.

少コアのＣＰＵ以外のデバイスをシステムで適切に活用するためには、デバイス特性を意識した設定やプログラム作成が必要であり、OpenMP（Open Multi-Processing）、OpenCL（Open Computing Language）、ＣＵＤＡ（Compute Unified Device Architecture）といった知識が必要になるため、大半のプログラマにとっては、スキルの壁が高い。 In order to properly utilize devices other than low-core CPUs in a system, it is necessary to configure and write programs that take into account the characteristics of the device, and knowledge of OpenMP (Open Multi-Processing), OpenCL (Open Computing Language), and CUDA (Compute Unified Device Architecture) is required, which creates a high skill barrier for most programmers.

少コアのＣＰＵ以外のＧＰＵやＦＰＧＡ、メニーコアＣＰＵ等のデバイスを活用するシステムは今後ますます増えていくと予想されるが、それらを最大限活用するには、技術的壁が高い。そこで、そのような壁を取り払い、少コアのＣＰＵ以外のデバイスを十分利用できるようにするため、プログラマが処理ロジックを記述したソフトウェアを、配置先の環境（ＦＰＧＡ、ＧＰＵ、メニーコアＣＰＵ等）にあわせて、適応的に変換、設定し、環境に適合した動作をさせるような、プラットフォームが求められている。 It is expected that the number of systems that utilize devices other than few-core CPUs, such as GPUs, FPGAs, and many-core CPUs, will continue to increase in the future, but there are high technical barriers to making the most of them. Therefore, in order to remove such barriers and make full use of devices other than few-core CPUs, there is a demand for a platform that can adaptively convert and configure software in which a programmer writes processing logic to suit the environment in which it is deployed (FPGA, GPU, many-core CPU, etc.), allowing it to operate in accordance with the environment.

非特許文献１には、一度記述したコードを、配置先の環境に存在するＧＰＵやＦＰＧＡ、メニーコアＣＰＵ等を利用できるように、変換、リソース設定等を自動で行い、アプリケーションを高性能に動作させることを目的とした、環境適応ソフトウェアが記載されている。Non-patent document 1 describes environment-adaptive software that automatically converts and configures resources so that code written once can utilize GPUs, FPGAs, many-core CPUs, etc. present in the deployment environment, with the aim of making applications run with high performance.

非特許文献２、３、４には、環境適応ソフトウェアの要素として、アプリケーションコードのループ文および機能ブロックを、ＦＰＧＡ、ＧＰＵに自動オフロードする方式が記載されている。Non-patent documents 2, 3, and 4 describe a method for automatically offloading loop statements and function blocks of application code to FPGAs and GPUs as elements of environment-adaptive software.

ＧＰＵの並列計算パワーを画像処理でないものにも使うＧＰＧＰＵ（General Purpose GPU）を行うための環境としてＣＵＤＡが普及している。ＣＵＤＡは、ＧＰＧＰＵ向けのNVIDIA（登録商標）社の環境である。また、ＦＰＧＡ、メニーコアＣＰＵ、ＧＰＵ等のヘテロなデバイスを同じように扱うための仕様としてOpenCLがあり、その開発環境も整いつつある。ＣＵＤＡ、OpenCLは、Ｃ言語の拡張を行いプログラムを行う形であり、プログラムの難度は高い（ＦＰＧＡ等のカーネルとＣＰＵのホストとの間のメモリデータのコピーや解放の記述を明示的に行う等）。 CUDA has become popular as an environment for running GPGPUs (General Purpose GPUs), which use the parallel computing power of GPUs for purposes other than image processing. CUDA is an environment for GPGPUs developed by NVIDIA (registered trademark). There is also OpenCL, a specification for treating heterogeneous devices such as FPGAs, many-core CPUs, and GPUs in the same way, and its development environment is also in the process of being established. CUDA and OpenCL are programmed by extending the C language, and are difficult to program (such as explicitly writing the copying and release of memory data between the kernel of an FPGA or the like and the host CPU).

また、ＣＵＤＡやOpenCLに比べて、より簡易にヘテロなデバイスを利用するための技術として、OpenACCやOpenMP等、コンパイラとしてＰＧＩコンパイラやgcc（登録商標）等がある。このコンパイラは、指示行ベースで、並列処理等を行う箇所を指定して、指示行に従って、ＧＰＵ、メニーコアＣＰＵ等に向けて実行ファイルを作成する。 In addition, compared to CUDA and OpenCL, there are technologies such as OpenACC and OpenMP that allow easier use of heterogeneous devices, and compilers such as the PGI compiler and gcc (registered trademark). These compilers specify the locations where parallel processing will be performed on a directive line basis, and create executable files for GPUs, many-core CPUs, etc. according to the directive lines.

上記、ＣＵＤＡ、OpenCL、OpenACC、OpenMP等の技術仕様を用いることで、ＦＰＧＡやＧＰＵ、メニーコアＣＰＵへオフロードすることは可能になっている。しかしながら、デバイス処理自体は行えるようになっても、高速化することには課題がある。例えば、マルチコアＣＰＵ向けに自動並列化機能を持つコンパイラとして、Intelコンパイラ（登録商標）等がある。これらは、自動並列化時に、コードの中のループ文中で並列処理可能な部分を抽出して、並列化している。しかし、メモリ処理等の影響で単に並列化可能ループ文を並列化しても性能がでないことも多い。ＦＰＧＡやＧＰＵ等で高速化する際には、OpenCLやＣＵＤＡの技術者がチューニングを繰り返したり、OpenACCコンパイラ等を用いて適切な並列処理範囲を探索し試行することがされている。 By using the above technical specifications such as CUDA, OpenCL, OpenACC, and OpenMP, it is possible to offload to FPGAs, GPUs, and many-core CPUs. However, even if device processing itself can be performed, there are challenges in accelerating it. For example, Intel Compiler (registered trademark) is a compiler with an automatic parallelization function for multi-core CPUs. During automatic parallelization, these extract parts of the loop statements in the code that can be parallelized and parallelize them. However, due to the influence of memory processing, etc., simply parallelizing parallelizable loop statements often does not achieve the desired performance. When accelerating FPGAs, GPUs, etc., OpenCL and CUDA engineers repeatedly tune the system, or use OpenACC compilers, etc. to search for and try out the appropriate range of parallel processing.

このため、技術スキルが乏しいプログラマが、ＦＰＧＡやＧＰＵ、メニーコアＣＰＵを活用してアプリケーションを高速化することは難しいし、自動並列化技術等を使う場合も並列処理箇所探索の試行錯誤等の稼働が必要だった。現状、ヘテロなデバイスに対するオフロードは手動での取組みが主流である。 For this reason, it is difficult for programmers with little technical skill to use FPGAs, GPUs, or many-core CPUs to speed up applications, and even when using automatic parallelization technology, it is necessary to go through a process of trial and error to find parallel processing locations. Currently, offloading to heterogeneous devices is mainly done manually.

Y. Yamato, H. Noguchi, M. Kataoka and T. Isoda, “Proposal of Environment Adaptive Software,” The 2nd International Conference on Control and Computer Vision (ICCCV 2019), pp.102-108, Jeju, June 2019.Y. Yamato, H. Noguchi, M. Kataoka and T. Isoda, “Proposal of Environment Adaptive Software,” The 2nd International Conference on Control and Computer Vision (ICCCV 2019), pp.102-108, Jeju, June 2019. Y. Yamato, “Study of parallel processing area extraction and data transfer number reduction for automatic GPU offloading of IoT applications,” Journal of Intelligent Information Systems, Springer, DOI: 10.1007/s10844-019-00575-8, Aug. 2019.Y. Yamato, “Study of parallel processing area extraction and data transfer number reduction for automatic GPU offloading of IoT applications,” Journal of Intelligent Information Systems, Springer, DOI: 10.1007/s10844-019-00575-8, Aug. 2019. Y. Yamato, “Proposal of Automatic FPGA Offloading for Applications Loop Statements,” The 7th Annual Conference on Engineering and Information Technology (ACEAIT 2020), pp.111-123, 2020.Y. Yamato, “Proposal of Automatic FPGA Offloading for Applications Loop Statements,” The 7th Annual Conference on Engineering and Information Technology (ACEAIT 2020), pp.111-123, 2020. Y. Yamato, “Proposal of Automatic Offloading for Function Blocks of Applications,” The 8th IIAE International Conference on Industrial Application Engineering 2020 (ICIAE 2020), pp.4-11, Mar. 2020.Y. Yamato, “Proposal of Automatic Offloading for Function Blocks of Applications,” The 8th IIAE International Conference on Industrial Application Engineering 2020 (ICIAE 2020), pp.4-11, Mar. 2020.

非特許文献１～４に記載の技術は、Ｃ言語プログラムのＧＰＵやＦＰＧＡへのオフロードであり、Python（登録商標）、Java（登録商標）等の多様な移行元言語は想定されていない。
Ｃ言語だけでなく、Python、Javaと移行元言語が多様となった場合でも、アプリケーションプログラムを自動オフロードすることが要請されている。 The techniques described in Non-Patent Documents 1 to 4 involve offloading C language programs to a GPU or FPGA, and do not take into account the various source languages such as Python (registered trademark) and Java (registered trademark).
There is a demand for automatic offloading of application programs, even when the source languages are diverse, including not only C but also Python and Java.

このような点に鑑みて本発明がなされたのであり、移行元言語に合わせて、処理を検討したり実装する必要をなくし、移行元言語が多様となった場合でも、アプリケーションプログラムを自動でオフロードすることを課題とする。 This invention was made in consideration of these points, and its objective is to eliminate the need to consider and implement processing to match the source language, and to automatically offload application programs even when the source languages are diverse.

前記した課題を解決するため、アプリケーションプログラムの特定処理をアクセラレータにオフロードするオフロードサーバであって、前記アプリケーションプログラムは、Pythonアプリケーションプログラムであり、前記Pythonアプリケーションプログラムのソースコードを、Pythonを解析する構文解析ツールを用いて分析するアプリケーションコード分析部と、前記Pythonアプリケーションプログラムのループ文の中で用いられる変数の参照関係を分析し、ループ外でデータ転送してよいデータについては、ループ外でのデータ転送を明示的に指定する明示的指定行を用いたデータ転送指定を行うデータ転送指定部と、前記Pythonアプリケーションプログラムのループ文を特定し、特定した各前記ループ文に対して、ＣＵＤＡでの指示を追加したPythonコードをpyCUDAでインタプリットする際、ＣＵＤＡ文法を用いたＧＰＵ処理を指定してインタプリットする並列処理指定部と、コンパイルエラーが出るループ文に対して、オフロード対象外とするとともに、コンパイルエラーが出ないループ文に対して、並列処理するかしないかの指定を行う並列処理パターンを作成する並列処理パターン作成部と、前記並列処理パターンの前記Pythonアプリケーションプログラムをコンパイルして、アクセラレータ検証用装置に配置し、前記アクセラレータにオフロードした際の性能測定用処理を実行する性能測定部と、性能測定結果をもとに、複数の前記並列処理パターンから高処理性能の並列処理パターンを複数選択し、高処理性能の前記並列処理パターンを交叉、突然変異処理により別の複数の並列処理パターンを作成して、新たに性能測定までを行い、指定回数の性能測定後に、性能測定結果をもとに、複数の前記並列処理パターンから最高処理性能の並列処理パターンを選択し、指定世代数のＧＡ処理終了後、最高性能の遺伝子配列に該当するPythonアプリコードを含む最高性能の並列処理パターンを解とし、最高処理性能の前記並列処理パターンをコンパイルして実行ファイルを作成する実行ファイル作成部と、を備えることを特徴とするオフロードサーバとした。 In order to solve the above-mentioned problems, an offload server offloads a specific process of an application program to an accelerator , the application program being a Python application program, the offload server comprising: an application code analysis unit that analyzes source code of the Python application program using a syntax analysis tool that analyzes Python; a data transfer specification unit that analyzes reference relationships of variables used in loop statements of the Python application program, and for data that may be transferred outside the loop, performs data transfer specification using an explicit specification line that explicitly specifies data transfer outside the loop; a parallel processing specification unit that identifies loop statements of the Python application program, and when interpreting Python code with CUDA instructions added for each identified loop statement with pyCUDA, specifies GPU processing using CUDA grammar for interpretation; and a loop statement that generates a compilation error is excluded from offloading, and The offload server is characterized by comprising: a parallel processing pattern creation unit that creates a parallel processing pattern that specifies whether or not to perform parallel processing for a loop statement in which no pile error occurs; a performance measurement unit that compiles the Python application program of the parallel processing pattern, places it on an accelerator verification device, and executes processing for measuring performance when offloaded to the accelerator; and an executable file creation unit that selects a plurality of parallel processing patterns with high processing performance from the plurality of parallel processing patterns based on a performance measurement result, crosses the parallel processing patterns with high processing performance, creates a plurality of other parallel processing patterns by mutation processing, and performs new performance measurement. After a specified number of performance measurements, the parallel processing pattern with the highest processing performance is selected from the plurality of parallel processing patterns based on the performance measurement result, and after GA processing for a specified number of generations is completed, the parallel processing pattern with the highest processing performance that includes a Python application code that corresponds to the gene sequence with the highest performance is set as a solution, and the parallel processing pattern with the highest processing performance is compiled to create an executable file .

本発明によれば、移行元言語に合わせて、処理を検討したり実装する必要をなくし、移行元言語が多様となった場合でも、アプリケーションプログラムを自動でオフロードすることができる。 According to the present invention, there is no need to consider or implement processing depending on the source language, and application programs can be automatically offloaded even when the source languages are diverse.

本発明の第１の実施形態に係るオフロードサーバの構成例を示す機能ブロック図である。FIG. 2 is a functional block diagram illustrating a configuration example of an offload server according to the first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバのＧＡを用いた自動オフロード処理を示す図である。FIG. 2 is a diagram illustrating an automatic offload process using GA of an offload server according to the first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバのSimple GAによる制御部（自動オフロード機能部）の処理の探索イメージとfor文の遺伝子配列マッピングを示す図である。1 is a diagram showing a search image of processing by a control unit (automatic offload function unit) by Simple GA of an offload server according to a first embodiment of the present invention, and gene sequence mapping of a for statement; FIG. 本発明の第１実施形態に係るオフロードサーバの自動オフロード機能部が処理するアプリケーションプログラムのソースコードのループ文（繰り返し文）を示す図である。1 is a diagram showing a loop statement (repetitive statement) of a source code of an application program processed by an automatic offload function unit of an offload server according to a first embodiment of the present invention; 本発明の第１の実施形態に係るオフロードサーバの《ループ文オフロード：共通》のフローチャートである。11 is a flowchart of <<Loop Statement Offload: Common>> of the offload server according to the first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバの《ループ文オフロード：共通》のフローチャートである。11 is a flowchart of <<Loop Statement Offload: Common>> of the offload server according to the first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバの《ループ文オフロード：Ｃ言語》のフローチャートである。4 is a flowchart of <<loop statement offload: C language>> of the offload server according to the first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバの《ループ文オフロード：Ｃ言語》のフローチャートである。4 is a flowchart of <<loop statement offload: C language>> of the offload server according to the first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバのPythonコードをpyCUDAでインタプリットする方法による《ループ文オフロード：Python》のフローチャートである。1 is a flowchart of “Loop Statement Offload: Python” by a method of interpreting Python code of an offload server with pyCUDA according to a first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバのPythonコードをpyCUDAでインタプリットする方法による《ループ文オフロード：Python》のフローチャートである。1 is a flowchart of “Loop Statement Offload: Python” by a method of interpreting Python code of an offload server with pyCUDA according to a first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバのpyACC利用時のfor文を示す図である。A figure showing a for statement when using pyACC on an offload server according to the first embodiment of the present invention. 図８Ａのfor文から作成されるコードパターンを示す図である。FIG. 8B is a diagram showing a code pattern created from the for statement in FIG. 8A. 本発明の第１の実施形態に係るオフロードサーバのpyACCを用いる方法による《ループ文オフロード：Python》のフローチャートである。1 is a flowchart of "loop statement offload: Python" using a method of using pyACC in an offload server according to a first embodiment of the present invention. 本発明の第１の実施形態に係るオフロードサーバのpyACCを用いる方法によるループ文オフロード：Python》のフローチャートである。1 is a flowchart of a loop statement offloading process using a method of using pyACC in an offload server according to a first embodiment of the present invention: Python. 本発明の第１の実施形態に係るオフロードサーバのIBM JDK 利用時のfor文を示す図である。FIG. 2 is a diagram illustrating a for statement when the offload server according to the first embodiment of the present invention uses IBM JDK. 図８Ａのfor文から作成されるコードパターンを示す図である。FIG. 8B is a diagram showing a code pattern created from the for statement in FIG. 8A. 本発明の第１の実施形態に係るオフロードサーバのpyACCを用いる方法による《ループ文オフロード：Java》のフローチャートである。1 is a flowchart of a loop statement offloading process for Java using a method of using pyACC in an offload server according to a first embodiment of the present invention; 本発明の第１の実施形態に係るオフロードサーバのpyACCを用いる方法による《ループ文オフロード：Java》のフローチャートである。1 is a flowchart of a loop statement offloading process for Java using a method of using pyACC in an offload server according to a first embodiment of the present invention; 本発明の第２の実施形態に係るオフロードサーバの構成例を示す機能ブロック図である。FIG. 11 is a functional block diagram illustrating a configuration example of an offload server according to a second embodiment of the present invention. 本発明の第２の実施形態に係るオフロードサーバの機能ブロックのオフロード処理を示す図である。FIG. 11 is a diagram illustrating an offload process of a functional block of an offload server according to a second embodiment of the present invention. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、《機能ブロックオフロード：共通》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。This is a flowchart when the control unit (automatic offload function unit) of the offload server related to the second embodiment of the present invention executes <Process A-1>, <Process B-1>, and <Process C-1> in the offload processing of <Function block offload: common>. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。This is a flowchart when a control unit (automatic offload function unit) of an offload server relating to the second embodiment of the present invention executes <Process A-2>, <Process B-2>, and <Process C-2> in the offload processing of a functional block. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、《機能ブロックオフロード：Ｃ言語》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。This is a flowchart when a control unit (automatic offload function unit) of an offload server related to the second embodiment of the present invention executes <Process A-1>, <Process B-1>, and <Process C-1> in offload processing of <Function block offload: C language>. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。This is a flowchart when a control unit (automatic offload function unit) of an offload server relating to the second embodiment of the present invention executes <Process A-2>, <Process B-2>, and <Process C-2> in the offload processing of a functional block. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、《機能ブロック：Python》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。This is a flowchart when the control unit (automatic offload function unit) of an offload server related to the second embodiment of the present invention executes <Process A-1>, <Process B-1>, and <Process C-1> in the offload processing of <Function block: Python>. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。This is a flowchart when a control unit (automatic offload function unit) of an offload server relating to the second embodiment of the present invention executes <Process A-2>, <Process B-2>, and <Process C-2> in the offload processing of a functional block. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、《機能ブロック：Java》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。This is a flowchart when the control unit (automatic offload function unit) of an offload server related to the second embodiment of the present invention executes <Process A-1>, <Process B-1>, and <Process C-1> in the offload processing of <Function block: Java>. 本発明の第２の実施形態に係るオフロードサーバの制御部（自動オフロード機能部）が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。This is a flowchart when a control unit (automatic offload function unit) of an offload server relating to the second embodiment of the present invention executes <Process A-2>, <Process B-2>, and <Process C-2> in the offload processing of a functional block. 本発明の実施形態に係るオフロードサーバの機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 2 is a hardware configuration diagram illustrating an example of a computer that realizes the functions of an offload server according to an embodiment of the present invention.

次に、本発明を実施するための形態における、オフロードサーバ等について説明する。
以下、明細書の説明において、移行先環境としては、ＧＰＵ、ＦＰＧＡ、メニーコアＣＰＵの３つを想定した例について説明する。本発明は、プログラマブルロジックデバイス全般に適用可能である。 Next, an offload server and the like in an embodiment of the present invention will be described.
In the following description of the specification, three examples are assumed as destination environments: a GPU, an FPGA, and a many-core CPU. The present invention is applicable to programmable logic devices in general.

（多様移行元言語対応の基本的な考え方）
・移行元言語
本実施形態で対象とする多様な移行元言語としては、Ｃ言語、Python、Javaの３つとする。これら３つの言語は、毎月TIOBE（登録商標）が発表するプログラム言語の人気ランキングの上位３つであり、プログラマ人口が多い。また、Ｃ言語はコンパイル型、Pythonはインタプリタ型、Javaはその中間的方式と、方式上の多様性も３つでカバーされている。そのため、これら３つで共通的に利用できる方式であれば、より多くの言語への対応も容易と考える。 (Basic principles for dealing with diverse source languages)
Source Language The various source languages targeted in this embodiment are C, Python, and Java. These three languages are the top three in the popularity ranking of programming languages announced monthly by TIOBE (registered trademark), and have a large number of programmers. In addition, the C language is a compiled type, Python is an interpreted type, and Java is an intermediate type, so the three languages cover a diversity of formats. Therefore, if a format can be commonly used by these three languages, it is believed that it will be easy to support more languages.

本実施形態では、移行先環境が単なるＣＰＵでない場合で、多様な移行元言語プログラムを、自動で高速にオフロードするために、検証環境の実機で性能測定し、進化計算手法等の手法と組み合わせて、徐々に高速なオフロードパターンを見つけるアプローチをとる。理由として、性能に関しては、コード構造だけでなく、処理するハードウェアのスペック、コンパイラやインタプリタ、データサイズ、ループ回数等の処理内容によって大きく変わるため、静的に予測する事が困難であり、動的な測定が必要だからである。実際に、市中には、ループ文を見つけコンパイル段階で並列化する自動並列化コンパイラがあるが、並列化可能ループ文の並列化だけでは性能を測定してみると低速になる場合も多いため、性能測定は必要である。In this embodiment, in order to automatically offload various source language programs at high speed when the destination environment is not simply a CPU, an approach is taken in which performance is measured on the actual machine in the verification environment, and a gradually faster offloading pattern is found by combining it with techniques such as evolutionary computing. The reason for this is that performance varies greatly depending not only on the code structure but also on the processing contents such as the specifications of the processing hardware, the compiler or interpreter, the data size, and the number of loops, making it difficult to predict statically and requiring dynamic measurement. In fact, there are automatic parallelizing compilers on the market that find loop statements and parallelize them at the compilation stage, but performance measurement is necessary because performance measurement often results in slow speeds when only parallelizing parallelizable loop statements.

・オフロードする対象
また、オフロードする対象については、アプリケーションプログラムのループ文および機能ブロックとするアプローチをとる。ループ文については、処理時間がかかるプログラムの処理の大半はループで費やされているという現状から、ループ文がオフロードのターゲットとして考えられる。一方、機能ブロックについては、特定処理を高速化する際に、処理内容や処理ハードウェアに適したアルゴリズムを用いることが多いため、個々のループ文の並列処理等に比べ、大きく高速化できる場合がある。行列積算やフーリエ変換等の頻繁に使われる機能ブロック単位で、ＧＰＵ等の処理デバイスに応じたアルゴリズムで実装された処理（ＣＵＤＡライブラリ等）に置換することで高速化する。 Targets for offloading The offloading target is the loop statements and function blocks of application programs. As for loop statements, most of the processing time of programs is currently spent in loops, so loop statements are considered as targets for offloading. On the other hand, when speeding up specific processing, algorithms suitable for the processing content and processing hardware are often used, so there are cases where the processing speed can be significantly increased compared to parallel processing of individual loop statements. The processing speed can be increased by replacing frequently used functional blocks such as matrix multiplication and Fourier transform with processing (CUDA library, etc.) implemented with algorithms suitable for processing devices such as GPUs.

・移行先環境
移行先環境としては、ＧＰＵ、ＦＰＧＡ、メニーコアＣＰＵの３つを想定し、これらが混在した環境でのＣ言語プログラムのオフロードも開示する。本発明の解決課題は、移行元言語が多様となった場合のアプリケーションの自動オフロードであるため、評価する移行先環境は限定されない。移行先環境は、一例としてＧＰＵとし、ＦＰＧＡやメニーコアＣＰＵについては、ＧＰＵで共通的方式を確認できれば、その拡張で実現できる。 Destination environment Three destination environments are assumed: GPU, FPGA, and many-core CPU, and offloading of C language programs in a mixed environment is also disclosed. The problem to be solved by the present invention is automatic offloading of applications when the source language becomes diverse, so the destination environment to be evaluated is not limited. As an example of the destination environment, GPU is used, and FPGA and many-core CPU can be realized by extending it if a common method can be confirmed in GPU.

・共通的なＧＰＵオフロード手法
共通的なＧＰＵオフロード手法は、「ループ文のＧＰＵ自動オフロード（以下、ループ文オフロードという）」と「機能ブロックの自動オフロード（以下、機能ブロックオフロードという）」とに分けられ、それぞれ手法が異なる。
以下の説明において、第１の実施形態で「ループ文オフロード」を記載し、第２の実施形態で「機能ブロックの自動オフロード」を記載する。そして、第１の実施形態（「ループ文のＧＰＵ自動オフロード」）と第２の実施形態（「機能ブロックオフロード」）のそれぞれにおいて、構成と、共通処理とＣ言語とPythonとJavaとを説明する。目次で示すと下記である。 Common GPU offloading methods Common GPU offloading methods are divided into "GPU automatic offloading of loop statements (hereinafter referred to as loop statement offloading)" and "automatic offloading of function blocks (hereinafter referred to as function block offloading)", and each method is different.
In the following description, "loop statement offloading" will be described in the first embodiment, and "automatic offloading of function blocks" will be described in the second embodiment. Then, the configuration, common processing, C language, Python, and Java will be described for each of the first embodiment ("GPU automatic offloading of loop statements") and the second embodiment ("function block offloading"). The table of contents is as follows.

（目次）
・第１の実施形態（「ループ文オフロード」）の構成（図１）
共通（移行元言語において共通）処理（図２－図４）
共通フローチャート（図５Ａ，図５Ｂ）
Ｃ言語の場合のフローチャート（図６Ａ，図６Ｂ）
Pythonの場合の説明図（図７Ａ，図７Ｂ，図８Ａ，図８Ｂ）
Pythonの場合のフローチャート（図９Ａ，図９Ｂ）
Javaの場合の説明図（図１０Ａ，図１０Ｂ）
Javaの場合のフローチャート（図１１Ａ，図１１Ｂ） (table of contents)
Configuration of the first embodiment ("loop statement offload") (FIG. 1)
Common (common in source language) processing (Figure 2-Figure 4)
Common Flowchart (FIGS. 5A and 5B)
Flowchart for C language (Fig. 6A and Fig. 6B)
Diagram of Python (Fig. 7A, Fig. 7B, Fig. 8A, Fig. 8B)
Flowchart for Python (Figure 9A and Figure 9B)
Diagram of Java (Fig. 10A, Fig. 10B)
Flowchart for Java (Fig. 11A and Fig. 11B)

・第２の実施形態（「機能ブロックオフロード」）の構成（図１２，図１３）
共通フローチャート（図１４，図１５）
Ｃ言語の場合のフローチャート（図１６，図１７）
Pythonの場合のフローチャート（図１８，図１９）
Javaの場合のフローチャート（図２０，図２１） Configuration of the second embodiment ("Function block offload") (FIGS. 12 and 13)
Common Flowchart (Fig. 14 and Fig. 15)
Flowchart for C language (Fig. 16, Fig. 17)
Flowchart for Python (Figure 18, Figure 19)
Flowchart for Java (Fig. 20, Fig. 21)

（第１の実施形態）
第１の実施形態は、ループ文オフロードについて記載する。
以下、第１の実施形態に係るオフロードサーバ１が、環境適応ソフトウェアシステムにおけるユーザ向けサービス利用のバックグラウンドで実行するオフロード処理を行う際の構成例について説明する。
サービスを提供する際は、初日は試し利用等の形でユーザにサービス提供し、そのバックグラウンドで画像分析等のオフロード処理を行い、翌日以降は画像分析をＦＰＧＡにオフロードしてリーズナブルな価格で見守りサービスを提供できるようにすることを想定する。 (First embodiment)
The first embodiment describes loop statement offloading.
An example of the configuration of the offload server 1 according to the first embodiment when performing offload processing in the background of using a user-oriented service in the environmentally adaptive software system will be described below.
When providing the service, it is expected that the service will be provided to users on a trial basis on the first day, with offloaded processing such as image analysis being performed in the background, and from the next day onwards, image analysis will be offloaded to the FPGA, making it possible to provide a monitoring service at a reasonable price.

図１は、本発明の第１の実施形態に係るオフロードサーバ１の構成例を示す機能ブロック図である。
オフロードサーバ１は、アプリケーションの特定処理をアクセラレータに自動的にオフロードする装置である。
図１に示すように、オフロードサーバ１は、制御部１１と、入出力部１２と、記憶部１３と、検証用マシン１４（Verification machine）(アクセラレータ検証用装置)と、を含んで構成される。 FIG. 1 is a functional block diagram showing an example of the configuration of an offload server 1 according to the first embodiment of the present invention.
The offload server 1 is a device that automatically offloads specific processing of an application to an accelerator.
As shown in FIG. 1, the offload server 1 includes a control unit 11, an input/output unit 12, a storage unit 13, and a verification machine 14 (accelerator verification device).

入出力部１２は、クラウドレイヤ、ネットワークレイヤおよびデバイスレイヤに属する各デバイス等との間で情報の送受信を行うための通信インタフェースと、タッチパネルやキーボード等の入力装置や、モニタ等の出力装置との間で情報の送受信を行うための入出力インタフェースとから構成される。The input/output unit 12 is composed of a communication interface for transmitting and receiving information between each device belonging to the cloud layer, network layer and device layer, and an input/output interface for transmitting and receiving information between input devices such as a touch panel and a keyboard, and output devices such as a monitor.

記憶部１３は、ハードディスクやフラッシュメモリ、ＲＡＭ（Random Access Memory）等により構成される。
この記憶部１３には、テストケースＤＢ（Test case database）１３１が記憶されるとともに、制御部１１の各機能を実行させるためのプログラム（オフロードプログラム）や、制御部１１の処理に必要な情報（例えば、中間言語ファイル(Intermediate file)１３２）が一時的に記憶される。 The storage unit 13 is composed of a hard disk, a flash memory, a RAM (Random Access Memory), or the like.
This memory unit 13 stores a test case database (DB) 131, and also temporarily stores programs (offload programs) for executing each function of the control unit 11, and information necessary for the processing of the control unit 11 (e.g., an intermediate language file (Intermediate file) 132).

テストケースＤＢ１３１は、検証対象ソフトに対応した試験項目のデータを格納する。試験項目のデータは、例えばMySQL等のデータベースシステムの場合、TPC-C等のトランザクション試験のデータである。 Test case DB131 stores data on test items corresponding to the software to be verified. For example, in the case of a database system such as MySQL, the test item data is data on transaction tests such as TPC-C.

制御部１１は、オフロードサーバ１全体の制御を司る自動オフロード機能部（Automatic Offloading function）である。制御部１１は、例えば、記憶部１３に格納されたアプリケーションプログラム（オフロードプログラム）を不図示のＣＰＵ（Central Processing Unit）が、ＲＡＭに展開し実行することにより実現される。The control unit 11 is an automatic offloading function that controls the entire offload server 1. The control unit 11 is realized, for example, by a CPU (Central Processing Unit) (not shown) expanding an application program (offload program) stored in the memory unit 13 into RAM and executing it.

アプリケーションプログラムは、Ｃ言語、Python、およびJavaより選択される少なくとも一つを含む。 The application program includes at least one selected from C, Python, and Java.

制御部１１は、アプリケーションコード指定部（Specify application code）１１１と、アプリケーションコード分析部（Analyze application code）１１２と、データ転送指定部１１３と、並列処理指定部１１４と、並列処理パターン作成部１１５と、性能測定部１１６と、実行ファイル作成部１１７と、本番環境配置部（Deploy final binary files to production environment）１１８と、性能測定テスト抽出実行部（Extract performance test cases and run automatically）１１９と、ユーザ提供部（Provide price and performance to a user to judge）１２０と、を備える。The control unit 11 includes an application code specification unit 111, an application code analysis unit 112, a data transfer specification unit 113, a parallel processing specification unit 114, a parallel processing pattern creation unit 115, a performance measurement unit 116, an executable file creation unit 117, a production environment deployment unit 118, a performance measurement test extraction and execution unit 119, and a user provision unit 120.

<アプリケーションコード指定部１１１>
アプリケーションコード指定部１１１は、入力されたアプリケーションコードの指定を行う。具体的には、アプリケーションコード指定部１１１は、受信したファイルに記載されたアプリケーションコードを、アプリケーションコード分析部１１２に渡す。 <Application code specification unit 111>
The application code designation unit 111 designates the input application code. Specifically, the application code designation unit 111 passes the application code described in the received file to the application code analysis unit 112.

<アプリケーションコード分析部１１２>
アプリケーションコード分析部１１２は、処理機能のソースコードを分析し、ループ文やＦＦＴライブラリ呼び出し等の構造を把握する。 <Application Code Analysis Unit 112>
The application code analysis unit 112 analyzes the source code of the processing function and understands the structure of loop statements, FFT library calls, and the like.

<データ転送指定部１１３>
データ転送指定部１１３は、アプリケーションプログラムのループ文の中で用いられる変数の参照関係を分析し、ループ外でデータ転送してよいデータについては、ループ外でのデータ転送を明示的に指定する明示的指定行を用いたデータ転送指定を行う。 <Data transfer designation unit 113>
The data transfer specification unit 113 analyzes the reference relationships of variables used in loop statements of an application program, and for data that may be transferred outside the loop, specifies the data transfer using an explicit specification line that explicitly specifies the data transfer outside the loop.

データ転送指定部１１３は、ＣＰＵからＧＰＵへのデータ転送を明示的に指定する明示的指定行と、ＧＰＵからＣＰＵへのデータ転送を明示的に指定する明示的指定行と、同じ変数に関してＣＰＵからＧＰＵへの転送とＧＰＵからＣＰＵへの転送とが重なる場合、データコピーの往復をまとめて明示的に指定する明示的指定行と、を用いたデータ転送指定を行う。The data transfer specification unit 113 specifies data transfer using an explicit specification line that explicitly specifies data transfer from the CPU to the GPU, an explicit specification line that explicitly specifies data transfer from the GPU to the CPU, and, when a transfer from the CPU to the GPU and a transfer from the GPU to the CPU overlap for the same variable, an explicit specification line that explicitly specifies both round trip data copies together.

データ転送指定部１１３は、ＣＰＵプログラム側で定義した変数とＧＰＵプログラム側で参照する変数が重なる場合、ＣＰＵからＧＰＵへのデータ転送の指示を行い、データ転送を指定する位置を、ＧＰＵ処理するループ文かそれより上位のループ文で、該当変数の設定、定義を含まない最上位のループとする。また、データ転送指定部１１３は、ＧＰＵプログラム側で設定した変数とＣＰＵプログラム側で参照する変数とが重なる場合、ＧＰＵからＣＰＵへのデータ転送の指示を行い、データ転送を指定する位置を、ＧＰＵ処理するループ文か、それより上位のループ文で、該当変数の参照、設定、定義を含まない最上位のループとする。When a variable defined on the CPU program side and a variable referenced on the GPU program side overlap, the data transfer designation unit 113 instructs data transfer from the CPU to the GPU, and the position at which the data transfer is designated is the highest loop in the loop statement for GPU processing or a higher level loop statement that does not include the setting or definition of the variable in question. When a variable set on the GPU program side and a variable referenced on the CPU program side overlap, the data transfer designation unit 113 instructs data transfer from the GPU to the CPU, and the position at which the data transfer is designated is the highest loop in the loop statement for GPU processing or a higher level loop statement that does not include the reference, setting, or definition of the variable in question.

<並列処理指定部１１４>
並列処理指定部１１４は、アプリケーションプログラムのループ文（繰り返し文）を特定し、各ループ文に対して、アクセラレータにおける並列処理指定文を指定してコンパイルする。
並列処理指定部１１４は、オフロード範囲抽出部（Extract offloadable area）１１４ａと、中間言語ファイル出力部（Output intermediate file）１１４ｂと、を備える。 <Parallel processing designation unit 114>
The parallel processing specification unit 114 identifies loop statements (repetitive statements) in the application program, and for each loop statement, specifies a parallel processing specification statement for the accelerator and compiles it.
The parallel processing specification unit 114 includes an offloadable area extraction unit (Extract offloadable area) 114a and an intermediate language file output unit (Output intermediate file) 114b.

オフロード範囲抽出部１１４ａは、ループ文やＦＦＴ等、ＧＰＵ・ＦＰＧＡにオフロード可能な処理を特定し、オフロード処理に応じた中間言語を抽出する。The offload range extraction unit 114a identifies processes that can be offloaded to a GPU or FPGA, such as loop statements and FFTs, and extracts an intermediate language corresponding to the offloaded process.

中間言語ファイル出力部１１４ｂは、抽出した中間言語ファイル１３２を出力する。中間言語抽出は、一度で終わりでなく、適切なオフロード領域探索のため、実行を試行して最適化するため反復される。The intermediate language file output unit 114b outputs the extracted intermediate language file 132. Intermediate language extraction is not a one-time process, but is repeated to try and optimize the execution in order to search for an appropriate offload area.

<並列処理パターン作成部１１５>
並列処理パターン作成部１１５は、コンパイルエラーが出るループ文（繰り返し文）に対して、オフロード対象外とするとともに、コンパイルエラーが出ない繰り返し文に対して、並列処理するかしないかの指定を行う並列処理パターンを作成する。 <Parallel processing pattern creation unit 115>
The parallel processing pattern creation unit 115 creates a parallel processing pattern that excludes loop statements (repetitive statements) that cause a compilation error from being offloaded, and specifies whether or not to process repetitive statements that do not cause a compilation error in parallel.

<性能測定部１１６>
性能測定部１１６は、並列処理パターンのアプリケーションプログラムをコンパイルして、検証用マシン１４に配置し、アクセラレータにオフロードした際の性能測定用処理を実行する。
性能測定部１１６は、バイナリファイル配置部（Deploy binary files）１１６ａを備える。バイナリファイル配置部１１６ａは、ＧＰＵ・ＦＰＧＡを備えた検証用マシン１４に、中間言語から導かれる実行ファイルをデプロイ(配置)する。 <Performance Measuring Unit 116>
The performance measurement unit 116 compiles an application program of a parallel processing pattern, places it on the verification machine 14, and executes a process for measuring performance when offloaded to the accelerator.
The performance measurement unit 116 includes a binary file deployment unit 116a. The binary file deployment unit 116a deploys an executable file derived from an intermediate language on the verification machine 14 including a GPU and FPGA.

性能測定部１１６は、配置したバイナリファイルを実行し、オフロードした際の性能を測定するとともに、性能測定結果を、オフロード範囲抽出部１１４ａに戻す。この場合、オフロード範囲抽出部１１４ａは、別の並列処理パターン抽出を行い、中間言語ファイル出力部１１４ｂは、抽出された中間言語をもとに、性能測定を試行する（後記図２の符号ａ参照）。The performance measurement unit 116 executes the placed binary file, measures the performance when offloaded, and returns the performance measurement results to the offload range extraction unit 114a. In this case, the offload range extraction unit 114a extracts another parallel processing pattern, and the intermediate language file output unit 114b attempts to measure performance based on the extracted intermediate language (see symbol a in Figure 2 below).

<実行ファイル作成部１１７>
実行ファイル作成部１１７は、所定回数繰り返された、性能測定結果をもとに、複数の前記並列処理パターンから高処理性能の並列処理パターンを複数選択し、高処理性能の並列処理パターンを交叉、突然変異処理により別の複数の並列処理パターンを作成して、新たに性能測定までを行い、指定回数の性能測定後に、性能測定結果をもとに、複数の前記並列処理パターンから最高処理性能の並列処理パターンを選択し、最高処理性能の前記並列処理パターンをコンパイルして実行ファイルを作成する。 <Executable File Creation Unit 117>
The executable file creation unit 117 selects multiple parallel processing patterns with high processing performance from the multiple parallel processing patterns based on the performance measurement results repeated a predetermined number of times, crosses the parallel processing patterns with high processing performance, creates multiple other parallel processing patterns by mutation processing, and performs new performance measurements. After a specified number of performance measurements, it selects the parallel processing pattern with the highest processing performance from the multiple parallel processing patterns based on the performance measurement results, compiles the parallel processing pattern with the highest processing performance, and creates an executable file.

<本番環境配置部１１８>
本番環境配置部１１８は、作成した実行ファイルを、ユーザ向けの本番環境に配置する（「最終バイナリファイルの本番環境への配置」）。本番環境配置部１１８は、最終的なオフロード領域を指定したパターンを決定し、ユーザ向けの本番環境にデプロイする。 <Production Environment Deployment Unit 118>
The production environment deployment unit 118 deploys the created executable file in the production environment for the user ("Deployment of final binary file in production environment"). The production environment deployment unit 118 determines a pattern that specifies the final offload area, and deploys it to the production environment for the user.

<性能測定テスト抽出実行部１１９>
性能測定テスト抽出実行部１１９は、実行ファイル配置後、テストケースＤＢ１３１から性能試験項目を抽出し、性能試験を実行する（「最終バイナリファイルの本番環境への配置」）。
性能測定テスト抽出実行部１１９は、実行ファイル配置後、ユーザに性能を示すため、性能試験項目をテストケースＤＢ１３１から抽出し、抽出した性能試験を自動実行する。 <Performance Measurement Test Extraction Execution Unit 119>
After placing the executable file, the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and executes the performance tests ("Placing the final binary file in the production environment").
After arranging the executable file, the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and automatically executes the extracted performance tests in order to show the performance to the user.

<ユーザ提供部１２０>
ユーザ提供部１２０は、性能試験結果を踏まえた、価格・性能等の情報をユーザに提示する（「価格・性能等の情報のユーザへの提供」）。テストケースＤＢ１３１には、性能試験項目が格納されている。ユーザ提供部１２０は、テストケースＤＢ１３１に格納された試験項目に対応した性能試験の実施結果に基づいて、価格、性能等のデータを、上記性能試験結果と共にユーザに提示する。ユーザは、提示された価格・性能等の情報をもとに、サービスの課金利用開始を判断する。ここで、本番環境への一括デプロイには、非特許文献（Y. Yamato, M. Muroi, K. Tanaka and M. Uchimura, “Development of Template Management Technology for Easy Deployment of Virtual Resources on OpenStack,” Journal of Cloud Computing, Springer, 2014, 3:7, DOI: 10.1186/s13677-014-0007-3, 12 pages, June 2014.）の技術を、また、性能自動試験には、非特許文献（Y. Yamato, “Automatic verification technology of software patches for user virtual environments on IaaS cloud,” Journal of Cloud Computing, Springer, 2015, 4:4, DOI: 10.1186/s13677-015-0028-6, 14 pages, Feb. 2015.）の技術を用いればよい。 <User providing unit 120>
The user providing unit 120 presents the user with information such as price and performance based on the performance test results ("Providing information such as price and performance to user"). Performance test items are stored in the test case DB 131. The user providing unit 120 presents data such as price and performance to the user along with the performance test results based on the results of the performance tests corresponding to the test items stored in the test case DB 131. The user decides whether to start charging for the service based on the presented information such as price and performance. Here, the technology described in the non-patent literature (Y. Yamato, M. Muroi, K. Tanaka and M. Uchimura, “Development of Template Management Technology for Easy Deployment of Virtual Resources on OpenStack,” Journal of Cloud Computing, Springer, 2014, 3:7, DOI: 10.1186/s13677-014-0007-3, 12 pages, June 2014.) can be used for the batch deployment to the production environment, and the technology described in the non-patent literature (Y. Yamato, “Automatic verification technology of software patches for user virtual environments on IaaS cloud,” Journal of Cloud Computing, Springer, 2015, 4:4, DOI: 10.1186/s13677-015-0028-6, 14 pages, February 2015.) can be used for the automatic performance testing.

［遺伝的アルゴリズムの適用］
オフロードサーバ１は、オフロードの最適化にＧＡを用いることができる。ＧＡを用いた場合のオフロードサーバ１の構成は下記の通りである。
すなわち、並列処理指定部１１４は、遺伝的アルゴリズムに基づき、コンパイルエラーが出ないループ文（繰り返し文）の数を遺伝子長とする。並列処理パターン作成部１１５は、アクセラレータ処理をする場合を１または０のいずれか一方、しない場合を他方の０または１として、アクセラレータ処理可否を遺伝子パターンにマッピングする。 [Application of genetic algorithm]
The offload server 1 can use GA for optimizing offloading. The configuration of the offload server 1 when using GA is as follows.
That is, the parallel processing specification unit 114 sets the number of loop statements (repetitive statements) that do not cause a compilation error as the gene length based on a genetic algorithm. The parallel processing pattern creation unit 115 maps the availability of accelerator processing to the gene pattern by setting either 1 or 0 when accelerator processing is performed and the other 0 or 1 when accelerator processing is not performed.

並列処理パターン作成部１１５は、遺伝子の各値を１か０にランダムに作成した指定個体数の遺伝子パターンを準備し、性能測定部１１６は、各個体に応じて、アクセラレータにおける並列処理指定文を指定したアプリケーションコードをコンパイルして、検証用マシン１４に配置する。性能測定部１１６は、検証用マシン１４において性能測定用処理を実行する。The parallel processing pattern creation unit 115 prepares a specified number of gene patterns in which each gene value is randomly created to be 1 or 0, and the performance measurement unit 116 compiles application code that specifies a parallel processing specification statement in the accelerator according to each individual, and places it on the verification machine 14. The performance measurement unit 116 executes a performance measurement process on the verification machine 14.

ここで、性能測定部１１６は、途中世代で、以前と同じ並列処理パターンの遺伝子が生じた場合は、当該並列処理パターンに該当するアプリケーションコードのコンパイル、および、性能測定はせずに、性能測定値としては同じ値を使う。
また、性能測定部１１６は、コンパイルエラーが生じるアプリケーションコード、および、性能測定が所定時間で終了しないアプリケーションコードについては、タイムアウトの扱いとして、性能測定値を所定の時間（長時間）に設定する。 Here, if a gene with the same parallel processing pattern as before occurs in an intermediate generation, the performance measurement unit 116 does not compile the application code corresponding to that parallel processing pattern or measure its performance, but uses the same value as the performance measurement value.
Furthermore, for application code that causes a compilation error and application code for which performance measurement does not end within a predetermined time, the performance measuring unit 116 treats the application code as timeout and sets the performance measurement value to a predetermined time (long time).

実行ファイル作成部１１７は、全個体に対して、性能測定を行い、処理時間の短い個体ほど適合度が高くなるように評価する。実行ファイル作成部１１７は、全個体から、適合度が所定値（例えば、全個数の上位ｎ％、または全個数の上位ｍ個ｎ，ｍは自然数）より高いものを性能の高い個体として選択し、選択された個体に対して、交叉、突然変異の処理を行い、次世代の個体を作成する。実行ファイル作成部１１７は、指定世代数の処理終了後、最高性能の並列処理パターンを解として選択する。The executable file creation unit 117 measures the performance of all the individuals, and evaluates them so that the shorter the processing time, the higher the fitness. From all the individuals, the executable file creation unit 117 selects those with a fitness higher than a predetermined value (e.g., the top n% of the total number, or the top m of the total number, where n and m are natural numbers) as high-performance individuals, and performs crossover and mutation processes on the selected individuals to create the next generation of individuals. After completing processing for the specified number of generations, the executable file creation unit 117 selects the parallel processing pattern with the highest performance as the solution.

以下、上述のように構成されたオフロードサーバ１の自動オフロード動作について説明する。
［自動オフロード動作］
図２は、オフロードサーバ１のＧＡを用いた自動オフロード処理を示す図である。
図２に示すように、オフロードサーバ１は、環境適応ソフトウェアの要素技術に適用される。オフロードサーバ１は、制御部（自動オフロード機能部）１１と、テストケースＤＢ１３１と、中間言語ファイル１３２と、検証用マシン１４と、を有している。
オフロードサーバ１は、ユーザが利用するアプリケーションコード（Application code）１２５を取得する。 The automatic offload operation of the offload server 1 configured as above will now be described.
[Automatic offload operation]
FIG. 2 is a diagram showing an automatic offload process using the GA of the offload server 1. As shown in FIG.
2, the offload server 1 is applied to elemental technologies of environment adaptive software. The offload server 1 includes a control unit (automatic offload function unit) 11, a test case DB 131, an intermediate language file 132, and a verification machine 14.
The offload server 1 acquires an application code 125 used by the user.

ユーザは、例えば、各種デバイス（Device１５１、ＣＰＵ-ＧＰＵを有する装置１５２、ＣＰＵ-ＦＰＧＡを有する装置１５３、ＣＰＵを有する装置１５４）の利用を契約した人である。
オフロードサーバ１は、機能処理をＣＰＵ-ＧＰＵを有する装置１５２、ＣＰＵ-ＦＰＧＡを有する装置１５３のアクセラレータに自動オフロードする。 The user is, for example, a person who has signed a contract to use various devices (Device 151, device having CPU-GPU 152, device having CPU-FPGA 153, device having CPU 154).
The offload server 1 automatically offloads function processing to accelerators such as a device 152 having a CPU-GPU and a device 153 having a CPU-FPGA.

以下、図２のステップ番号を参照して各部の動作を説明する。
<ステップＳ１１：Specify application code>
ステップＳ１１において、アプリケーションコード指定部１１１（図１参照）は、受信したファイルに記載されたアプリケーションコードを、アプリケーションコード分析部１１２に渡す。 The operation of each unit will be described below with reference to the step numbers in FIG.
<Step S11: Specify application code>
In step S 11 , the application code designation unit 111 (see FIG. 1 ) passes the application code described in the received file to the application code analysis unit 112 .

<ステップＳ１２：Analyze application code>
ステップＳ１２において、アプリケーションコード分析部１１２（図１参照）は、処理機能のソースコードを分析し、ループ文やＦＦＴライブラリ呼び出し等の構造を把握する。 <Step S12: Analyze application code>
In step S12, the application code analysis unit 112 (see FIG. 1) analyzes the source code of the processing function and grasps the structures of loop statements, FFT library calls, and the like.

<ステップＳ１３：Extract offloadable area>
ステップＳ１３において、並列処理指定部１１４（図１参照）は、アプリケーションのループ文（繰り返し文）を特定し、各繰り返し文に対して、アクセラレータにおける並列処理指定文を指定してコンパイルする。具体的には、オフロード範囲抽出部１１４ａ（図１参照）は、ループ文やＦＦＴ等、ＧＰＵ・ＦＰＧＡにオフロード可能な処理を特定し、オフロード処理に応じた中間言語を抽出する。 <Step S13: Extract offloadable area>
In step S13, the parallel processing specification unit 114 (see FIG. 1) identifies loop statements (repetitive statements) in the application, and compiles each repetitive statement by specifying a parallel processing specification statement in the accelerator. Specifically, the offload range extraction unit 114a (see FIG. 1) identifies processes that can be offloaded to the GPU/FPGA, such as loop statements and FFT, and extracts an intermediate language corresponding to the offloaded process.

<ステップＳ１４：Output intermediate file>
ステップＳ１４において、中間言語ファイル出力部１１４ｂ（図１参照）は、中間言語ファイル１３２を出力する。中間言語抽出は、一度で終わりでなく、適切なオフロード領域探索のため、実行を試行して最適化するため反復される。 <Step S14: Output intermediate file>
In step S14, the intermediate language file output unit 114b (see FIG. 1) outputs the intermediate language file 132. The intermediate language extraction is not a one-time process, but is repeated to try and optimize the execution for an appropriate offload area search.

<ステップＳ１５：Compile error>
ステップＳ１５において、並列処理パターン作成部１１５（図１参照）は、コンパイルエラーが出るループ文に対して、オフロード対象外とするとともに、コンパイルエラーが出ない繰り返し文に対して、並列処理するかしないかの指定を行う並列処理パターンを作成する。 <Step S15: Compile error>
In step S15, the parallel processing pattern creation unit 115 (see FIG. 1 ) creates a parallel processing pattern that excludes loop statements that produce a compilation error from being offloaded, and specifies whether or not to perform parallel processing on repetitive statements that do not produce a compilation error.

<ステップＳ２１：Deploy binary files>
ステップＳ２１において、バイナリファイル配置部１１６ａ（図１参照）は、ＧＰＵ・ＦＰＧＡを備えた検証用マシン１４に、中間言語から導かれる実行ファイルをデプロイする。 <Step S21: Deploy binary files>
In step S21, the binary file placement unit 116a (see FIG. 1) deploys an executable file derived from the intermediate language to the verification machine 14 equipped with a GPU/FPGA.

<ステップＳ２２：Measure performances>
ステップＳ２２において、性能測定部１１６（図１参照）は、配置したファイルを実行し、オフロードした際の性能を測定する。
オフロードする領域をより適切にするため、この性能測定結果は、オフロード範囲抽出部１１４ａに戻され、オフロード範囲抽出部１１４ａが、別パターンの抽出を行う。そして、中間言語ファイル出力部１１４ｂは、抽出された中間言語をもとに、性能測定を試行する（図２の符号ａ参照）。 <Step S22: Measure performance>
In step S22, the performance measurement unit 116 (see FIG. 1) executes the arranged file and measures the performance when offloaded.
In order to more appropriately determine the area to be offloaded, the performance measurement result is returned to the offload range extraction unit 114a, which extracts another pattern. The intermediate language file output unit 114b then attempts to measure performance based on the extracted intermediate language (see symbol a in FIG. 2).

図２の符号ａに示すように、制御部１１は、上記ステップＳ１２乃至ステップＳ２２を繰り返し実行する。制御部１１の自動オフロード機能をまとめると、下記である。すなわち、並列処理指定部１１４は、アプリケーションプログラムのループ文（繰り返し文）を特定し、各繰返し文に対して、ＧＰＵでの並列処理指定文を指定して、コンパイルする。そして、並列処理パターン作成部１１５は、コンパイルエラーが出るループ文を、オフロード対象外とし、コンパイルエラーが出ないループ文に対して、並列処理するかしないかの指定を行う並列処理パターンを作成する。そして、バイナリファイル配置部１１６ａは、該当並列処理パターンのアプリケーションプログラムをコンパイルして、検証用マシン１４に配置し、性能測定部１１６が、検証用マシン１４で性能測定用処理を実行する。実行ファイル作成部１１７は、所定回数繰り返された、性能測定結果をもとに、複数の並列処理パターンから最高処理性能のパターンを選択し、選択パターンをコンパイルして実行ファイルを作成する。As shown by the symbol a in FIG. 2, the control unit 11 repeatedly executes steps S12 to S22. The automatic offload function of the control unit 11 can be summarized as follows. That is, the parallel processing designation unit 114 identifies loop statements (repeated statements) of the application program, and compiles each repeated statement by designating a parallel processing designation statement in the GPU. The parallel processing pattern creation unit 115 creates a parallel processing pattern that excludes loop statements that generate compilation errors from offloading targets and designates whether or not to perform parallel processing for loop statements that do not generate compilation errors. The binary file placement unit 116a then compiles the application program of the corresponding parallel processing pattern and places it on the verification machine 14, and the performance measurement unit 116 executes the performance measurement process on the verification machine 14. The executable file creation unit 117 selects a pattern with the highest processing performance from a plurality of parallel processing patterns based on the performance measurement results repeated a predetermined number of times, and compiles the selected pattern to create an executable file.

<ステップＳ２３：Deploy final binary files to production environment>
ステップＳ２３において、本番環境配置部１１８は、最終的なオフロード領域を指定したパターンを決定し、ユーザ向けの本番環境にデプロイする。 <Step S23: Deploy final binary files to production environment>
In step S23, the production environment deployment unit 118 determines a pattern that specifies the final offload area, and deploys it to the production environment for the user.

<ステップＳ２４：Extract performance test cases and run automatically>
ステップＳ２４において、性能測定テスト抽出実行部１１９は、実行ファイル配置後、ユーザに性能を示すため、性能試験項目をテストケースＤＢ１３１から抽出し、抽出した性能試験を自動実行する。 <Step S24: Extract performance test cases and run automatically>
In step S24, after arranging the executable file, the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and automatically executes the extracted performance tests in order to show the performance to the user.

<ステップＳ２５：Provide price and performance to a user to judge>
ステップＳ２５において、ユーザ提供部１２０は、性能試験結果を踏まえた、価格・性能等の情報をユーザに提示する。ユーザは、提示された価格・性能等の情報をもとに、サービスの課金利用開始を判断する。 <Step S25: Provide price and performance to a user to judge>
In step S25, the user providing unit 120 provides the user with information on price, performance, etc., based on the performance test results. The user determines whether to start using the service for which charges are required, based on the provided information on price, performance, etc.

上記ステップＳ１１～ステップＳ２５は、例えばユーザのサービス利用のバックグラウンドで行われ、例えば、仮利用の初日の間に行う等を想定している。また、コスト低減のためにバックグラウンドで行う処理は、機能配置最適化とＧＰＵ・ＦＰＧＡオフロードのみを対象としてもよい。 The above steps S11 to S25 are assumed to be performed, for example, in the background while the user is using the service, for example, during the first day of trial use. In addition, the processing performed in the background to reduce costs may only target function placement optimization and GPU/FPGA offloading.

上記したように、オフロードサーバ１の制御部（自動オフロード機能部）１１は、環境適応ソフトウェアの要素技術に適用した場合、機能処理のオフロードのため、ユーザが利用するアプリケーションプログラムのソースコードから、オフロードする領域を抽出して中間言語を出力する（ステップＳ１１～ステップＳ１５）。制御部１１は、中間言語から導かれる実行ファイルを、検証用マシン１４に配置実行し、オフロード効果を検証する（ステップＳ２１～ステップＳ２２）。検証を繰り返し、適切なオフロード領域を定めたのち、制御部１１は、実際にユーザに提供する本番環境に、実行ファイルをデプロイし、サービスとして提供する（ステップＳ２３～ステップＳ２５）。As described above, when applied to the elemental technology of the environment-adaptive software, the control unit (automatic offload function unit) 11 of the offload server 1 extracts the area to be offloaded from the source code of the application program used by the user and outputs an intermediate language to offload functional processing (steps S11 to S15). The control unit 11 places and executes the executable file derived from the intermediate language on the verification machine 14 and verifies the offloading effect (steps S21 to S22). After repeating the verification and determining an appropriate offload area, the control unit 11 deploys the executable file in the production environment that will actually be provided to the user and provides it as a service (steps S23 to S25).

［ＧＡを用いたＧＰＵ自動オフロード］
ＧＰＵ自動オフロードは、ＧＰＵに対して、図２のステップＳ１２～ステップＳ２２を繰り返し、最終的にステップＳ２３でデプロイするオフロードコードを得るための処理である。 [Automatic GPU offloading using GA]
The GPU automatic offload is a process in which steps S12 to S22 in FIG. 2 are repeated for the GPU to obtain the offload code to be deployed in step S23.

ＧＰＵは、一般的にレイテンシーは保証しないが、並列処理によりスループットを高めることに向いたデバイスである。環境適応ソフトウェアの暗号化処理や、カメラ映像分析のための画像処理、大量センサデータ分析のための機械学習処理等が代表的であり、それらは、繰り返し処理が多い。そこで、アプリケーションの繰り返し文をＧＰＵに自動でオフロードすることでの高速化を狙う。 GPUs generally do not guarantee latency, but are devices suited to increasing throughput through parallel processing. Typical applications include encryption processing in environmentally adaptive software, image processing for camera image analysis, and machine learning processing for analyzing large amounts of sensor data, which involve a lot of repetitive processing. Therefore, the aim is to increase speed by automatically offloading repetitive statements in applications to the GPU.

しかし、従来技術で記載の通り、高速化には適切な並列処理が必要である。特に、ＧＰＵを使う場合は、ＣＰＵとＧＰＵ間のメモリ転送のため、データサイズやループ回数が多くないと性能が出ないことが多い。また、メモリデータ転送のタイミング等により、並列高速化できる個々のループ文（繰り返し文）の組み合わせが、最速とならない場合等がある。例えば、１０個のfor文（繰り返し文）で、１番、５番、１０番の３つがＣＰＵに比べて高速化できる場合に、１番、５番、１０番の３つの組み合わせが最速になるとは限らない等である。However, as described in the prior art, appropriate parallel processing is necessary to increase speed. In particular, when using a GPU, due to memory transfer between the CPU and GPU, performance is often not achieved unless the data size and number of loops are large. Also, depending on the timing of memory data transfer, the combination of individual loop statements (repeated statements) that can be accelerated in parallel may not be the fastest. For example, if there are 10 "for" statements (repeated statements), and numbers 1, 5, and 10 can be accelerated compared to the CPU, the combination of numbers 1, 5, and 10 will not necessarily be the fastest.

適切な並列領域指定のため、ＰＧＩコンパイラを用いて、for文の並列可否を試行錯誤して最適化する試みがある。しかし、試行錯誤には多くの稼働がかかり、サービスとして提供する際に、ユーザの利用開始が遅くなり、コストも上がってしまう問題がある。 In order to specify appropriate parallel regions, there have been attempts to optimize the parallelism of for statements through trial and error using the PGI compiler. However, trial and error requires a lot of work, and when offered as a service, this can delay users' start-up and increase costs.

そこで、本実施形態では、並列化を想定していない汎用プログラムから、自動で適切なオフロード領域を抽出する。このため、最初に並列可能for文のチェックを行い、次に並列可能for文群に対してＧＡを用いて検証環境で性能検証試行を反復し適切な領域を探索すること、を実現する。並列可能for文に絞った上で、遺伝子の部分の形で、高速化可能な並列処理パターンを保持し組み換えていくことで、取り得る膨大な並列処理パターンから、効率的に高速化可能なパターンを探索できる。 In this embodiment, therefore, suitable offload areas are automatically extracted from general-purpose programs that do not assume parallelization. For this purpose, parallelizable for statements are checked first, and then performance verification trials are repeated in a verification environment using GA for the group of parallelizable for statements to search for suitable areas. By narrowing down to parallelizable for statements and then retaining and recombining parallel processing patterns that can be accelerated in the form of genetic parts, it is possible to efficiently search for patterns that can be accelerated from the enormous number of possible parallel processing patterns.

［Simple GAによる制御部（自動オフロード機能部）１１の探索イメージ］
図３は、Simple GAによる制御部（自動オフロード機能部）１１の処理の探索イメージとfor文の遺伝子配列マッピングを示す図である。
ＧＡは、生物の進化過程を模倣した組合せ最適化手法の一つである。ＧＡのフローチャートは、初期化→評価→選択→交叉→突然変異→終了判定となっている。
本実施形態では、ＧＡの中で、処理を単純にしたSimple GAを用いる。Simple GAは、遺伝子は１、０のみとし、ルーレット選択、一点交叉、突然変異は１箇所の遺伝子の値を逆にする等、単純化されたＧＡである。 [Search image of the control unit (automatic offload function unit) 11 using Simple GA]
FIG. 3 is a diagram showing a search image of the process of the control unit (automatic offload function unit) 11 by the Simple GA and a gene sequence mapping of a for statement.
GA is a combinatorial optimization method that mimics the evolutionary process of living organisms. The GA flowchart is as follows: Initialization → Evaluation → Selection → Crossover → Mutation → Termination decision.
In this embodiment, a simple GA with simplified processing is used among GAs. The simple GA is a simplified GA in which genes are only 1 and 0, and roulette selection, one-point crossover, and mutation reverse the value of one gene.

<初期化>
初期化では、アプリケーションコードの全for文の並列可否をチェック後、並列可能for文を遺伝子配列にマッピングする。ＧＰＵ処理する場合は１、ＧＰＵ処理しない場合は０とする。遺伝子は、指定の個体数Ｍを準備し、１つのfor文にランダムに１、０の割り当てを行う。
具体的には、制御部（自動オフロード機能部）１１（図１参照）は、ユーザが利用するアプリケーションコード（Application code）１３０（図２参照）を取得し、図３に示すように、アプリケーションコード１３０のコードパターン（Code patterns）１４１からfor文の並列可否をチェックする。図３に示すように、コードパターン１４１から３つのfor文が見つかった場合（図３の符号ｂ参照）、各for文に対して１桁、ここでは３つのfor文に対し３桁の１または０を割り当てる。例えば、ＣＰＵで処理する場合０、ＧＰＵに出す場合１とする。ただし、この段階では１または０をランダムに割り当てる。
遺伝子長に該当するコードが３桁であり、３桁の遺伝子長のコードは２^３パターン、例えば１００、１０１、…となる。なお、図３では、コードパターン１４１中の丸印（○印）をコードのイメージとして示している。 <Initialization>
In the initialization, after checking whether all for statements in the application code can be parallelized, parallelizable for statements are mapped to gene arrays. If GPU processing is used, it is set to 1, and if not, it is set to 0. A specified number of individuals M is prepared for genes, and 1 or 0 is randomly assigned to each for statement.
Specifically, the control unit (automatic offload function unit) 11 (see FIG. 1) acquires the application code 130 (see FIG. 2) used by the user, and checks whether or not for statements can be parallelized from the code patterns 141 of the application code 130, as shown in FIG. 3. As shown in FIG. 3, if three for statements are found from the code pattern 141 (see symbol b in FIG. 3), one digit is assigned to each for statement, and three digits of 1 or 0 are assigned to the three for statements in this case. For example, 0 is assigned when processing is performed by the CPU, and 1 is assigned when output to the GPU. However, 1 or 0 is assigned randomly at this stage.
The code corresponding to the gene length is three digits, and the three-digit gene length code has ^2-3 patterns, for example, 100, 101, .... In Fig. 3, the circles (○ marks) in the code pattern 141 are shown as an image of the code.

<評価>
評価では、デプロイとパフォーマンスの測定（Deploy & performance measurement）を行う（図３の符号ｃ参照）。すなわち、性能測定部１１６（図１参照）は、遺伝子に該当するコードをコンパイルして検証用マシン１４にデプロイして実行する。性能測定部１１６は、ベンチマーク性能測定を行う。性能が良いパターン（並列処理パターン）の遺伝子の適合度を高くする。 <Evaluation>
In the evaluation, deployment and performance measurement are performed (see symbol c in FIG. 3). That is, the performance measurement unit 116 (see FIG. 1) compiles the code corresponding to the gene, deploys it on the verification machine 14, and executes it. The performance measurement unit 116 performs benchmark performance measurement. The degree of suitability of genes of patterns with good performance (parallel processing patterns) is increased.

<選択>
選択では、適合度に基づいて、高性能コードパターンを選択（Select high performance code patterns）する（図３の符号ｄ参照）。性能測定部１１６（図１参照）は、適合度に基づいて、高適合度の遺伝子を、指定の個体数選択する。本実施形態では、適合度に応じたルーレット選択および最高適合度遺伝子のエリート選択を行う。
図３では、選択されたコードパターン（Select code patterns）１４２の中の丸印（○印）が、３つに減ったことを探索イメージとして示している。 <Select>
In the selection, high performance code patterns are selected based on the fitness (see symbol d in FIG. 3). The performance measurement unit 116 (see FIG. 1) selects a specified number of genes with high fitness based on the fitness. In this embodiment, roulette selection according to the fitness and elite selection of the genes with the highest fitness are performed.
FIG. 3 shows, as a search image, that the number of circles (◯) in Select code patterns 142 has been reduced to three.

<交叉>
交叉では、一定の交叉率Ｐｃで、選択された個体間で一部の遺伝子をある一点で交換し、子の個体を作成する。
ルーレット選択された、あるパターン（並列処理パターン）と他のパターンとの遺伝子を交叉させる。一点交叉の位置は任意であり、例えば上記３桁のコードのうち２桁目で交叉させる。 <Crossover>
In crossover, some genes are exchanged at a certain point between selected individuals at a certain crossover rate Pc to create offspring individuals.
The genes of a certain pattern (parallel processing pattern) selected by roulette wheel are crossed with those of another pattern. The position of the one-point crossover is arbitrary, and for example, the crossover is performed at the second digit of the above three-digit code.

<突然変異>
局所解を避けるため、突然変異を導入する。なお、演算量を削減するために突然変異を行わない態様でもよい。突然変異では、一定の突然変異率Ｐｍで、個体の遺伝子の各値を０から１または１から０に変更する。 <Mutation>
In order to avoid local solutions, mutation is introduced. However, in order to reduce the amount of calculation, mutation may not be performed. In mutation, each value of an individual's genes is changed from 0 to 1 or from 1 to 0 at a certain mutation rate Pm.

<終了判定>
図３に示すように、交叉と突然変異後の次世代コードパターンの生成（Generate next generation code patterns after crossover & mutation）を行う（図３の符号ｅ参照）。
終了判定では、指定の世代数Ｔ回、繰り返しを行った後に処理を終了し、最高適合度の遺伝子を解とする。
例えば、性能測定して、速い３つ１０１、０１０、００１を選ぶ。この３つをＧＡにより、次の世代は、組み換えをして、例えば新しいパターン（並列処理パターン）１０１（一例）を作っていく。このとき、組み換えをしたパターンに、勝手に０を１にするなどの突然変異を入れる。上記を繰り返して、一番早いパターンを見付ける。指定世代（例えば、２０世代）などを決めて、最終世代で残ったパターンを、最後の解とする。 <End Judgment>
As shown in FIG. 3, next generation code patterns after crossover & mutation are generated (see symbol e in FIG. 3).
In the termination determination, the process is terminated after repeating the process for a specified number of generations T, and the gene with the highest fitness is taken as the solution.
For example, performance is measured and the three fastest ones, 101, 010, and 001, are selected. These three are then recombined in the next generation using GA to create a new pattern (parallel processing pattern) 101 (one example), for example. At this time, a mutation is automatically introduced into the recombined pattern, such as changing 0 to 1. The above is repeated to find the fastest pattern. A designated generation (for example, the 20th generation) is decided, and the pattern remaining in the final generation is designated as the final solution.

<デプロイ（配置）>
最高適合度の遺伝子に該当する、最高処理性能の並列処理パターンで、本番環境に改めてデプロイして、ユーザに提供する。 <Deployment>
The parallel processing pattern with the highest processing performance that corresponds to the gene with the highest fitness is then redeployed into a production environment and provided to the user.

ＧＰＵにオフロードできないfor文（ループ文；繰り返し文）が相当数存在する場合について説明する。例えば、for文が２００個あっても、ＧＰＵにオフロードできるものは３０個くらいである。ここでは、エラーになるものを除外し、この３０個について、ＧＡを行う。 We will explain the case where there are a considerable number of for statements (loop statements; repetitive statements) that cannot be offloaded to the GPU. For example, even if there are 200 for statements, only about 30 can be offloaded to the GPU. Here, we will exclude those that will result in an error, and run GA on these 30 statements.

例えば、C/C++コードに対するＧＰＵの処理を行う仕様であるOpenＡＣＣには、ディレクティブ #pragma acc kernelsで指定して、ＧＰＵ向けバイトコードを抽出し、実行によりＧＰＵオフロードを可能とするコンパイラがある。Python, Javaの場合は、ＣＵＤＡやJava Lambda式などでＧＰＵ処理を指定すればよい。For example, OpenACC, a specification for GPU processing of C/C++ code, has a compiler that allows you to specify the directive #pragma acc kernels to extract bytecode for the GPU and execute it to enable GPU offloading. In the case of Python and Java, you can specify GPU processing using CUDA or Java Lambda expressions.

また、C/C++を使った場合、C/C++のコードを分析し、for文を見付ける。for文を見付けると、OpenＡＣＣで並列処理の文法である #pragma acc kernelsを使ってfor文に対して書き込む。詳細には、何も入っていない #pragma acc kernels に、一つ一つfor文を入れてコンパイルして、エラーであれば、そのfor文はそもそも、ＧＰＵ処理できないので、除外する。このようにして、残るfor文を見付ける。そして、エラーが出ないものを、長さ（遺伝子長）とする。エラーのないfor文が５つであれば、遺伝子長は５であり、エラーのないfor文が１０であれば、遺伝子長は１０である。なお、並列処理できないものは、前の処理を次の処理に使うようなデータに依存がある場合である。
以上が準備段階である。次にＧＡ処理を行う。 Also, when C/C++ is used, the C/C++ code is analyzed to find for statements. When a for statement is found, it is written to the for statement using #pragma acc kernels, which is a parallel processing syntax in OpenACC. In detail, for statements are inserted one by one into an empty #pragma acc kernels and compiled. If an error occurs, the for statement is excluded since it cannot be processed by the GPU in the first place. In this way, the remaining for statements are found. The one that does not produce an error is then taken as the length (gene length). If there are five for statements without errors, the gene length is 5, and if there are 10 for statements without errors, the gene length is 10. Note that parallel processing is not possible when there is a data dependency such that the previous processing is used for the next processing.
The above is the preparation stage. Next, the GA process is carried out.

for文の数に対応する遺伝子長を有するコードパターンが得られている。始めはランダムに並列処理パターン１００、０１０、００１、…を割り当てる。そして、ＧＡ処理を行い、コンパイルする。その時に、オフロードできるfor文であるにもかかわらず、エラーがでることがある。for文が階層になっている（どちらか指定すればＧＰＵ処理できる）場合である。この場合は、エラーとなったfor文は、残してもよい。具体的には、処理時間が多くなった形にして、タイムアウトさせる方法がある。 A code pattern with a gene length corresponding to the number of for statements is obtained. First, parallel processing patterns 100, 010, 001, ... are randomly assigned. Then, GA processing is performed and compilation is performed. At this time, an error may occur even though the for statements can be offloaded. This is the case when the for statements are hierarchical (GPU processing is possible if either one is specified). In this case, the for statement that caused the error can be left. Specifically, there is a method of making it take longer to process, causing a timeout.

検証用マシン１４でデプロイして、ベンチマーク、例えば画像処理であればその画像処理でベンチマークする、その処理時間が短い程、適応度が高いと評価する。例えば、処理時間の逆数、処理時間１０秒かかるものは１、１００秒かかるものは０．１、１秒のものは１０とする。
適応度が高いものを選択して、例えば１０個のなかから、３～５個を選択して、それを組み替えて新しいコードパターンを作る。作成途中で、前と同じものができる場合がある。この場合、同じベンチマークを行う必要はないので、前と同じデータを使う。本実施形態では、コードパターンと、その処理時間は記憶部１３に保存しておく。 Deploy it on the verification machine 14 and benchmark it - for example, if it is image processing, benchmark it with that image processing, and evaluate the adaptability as higher as the processing time is shorter. For example, the reciprocal of the processing time, 1 is given for a processing time of 10 seconds, 0.1 for a processing time of 100 seconds, and 10 for a processing time of 1 second.
Those with high adaptability are selected, for example, 3 to 5 out of 10 are selected, and a new code pattern is created by rearranging them. During the creation process, it is possible that the same code pattern as before is created. In this case, since it is not necessary to perform the same benchmark, the same data as before is used. In this embodiment, the code pattern and its processing time are stored in the storage unit 13.

以上で、Simple GAによる制御部（自動オフロード機能部）１１の探索イメージについて説明した。次に、データ転送の一括処理手法について述べる。Above, we have explained the search image of the control unit (automatic offload function unit) 11 using Simple GA. Next, we will explain the batch processing method for data transfer.

［データ転送の一括処理手法］
上述したように、遺伝的アルゴリズムを用いることで、ＧＰＵ処理で効果のある並列処理部を自動チューニングしている。しかしながら、ＣＰＵ－ＧＰＵメモリ間のデータ転送によっては高性能化できないアプリケーションもあった。このため、スキルが無いユーザがＧＰＵを使ってアプリケーションを高性能化することは難しいし、自動並列化技術等を使う場合も並列処理可否の試行錯誤が必要であり、高速化できない場合があった。 [Batch processing method for data transfer]
As mentioned above, the genetic algorithm is used to automatically tune the parallel processing units that are effective in GPU processing. However, there were some applications for which the performance could not be improved by data transfer between the CPU and GPU memory. For this reason, it is difficult for unskilled users to improve the performance of applications using GPUs, and even when using automatic parallelization technology, trial and error is required to determine whether parallel processing is possible, and there were cases where speed could not be improved.

そこで、本実施形態では、より多くのアプリケーションを、自動でＧＰＵを用いて高性能化することを狙うとともに、ＧＰＵへのデータ転送回数を低減できる技術を提供する。 Therefore, in this embodiment, we aim to automatically use the GPU to improve the performance of more applications, while also providing technology that can reduce the number of data transfers to the GPU.

<基本的な考え方>
OpenＡＣＣ等の仕様では、ＧＰＵでの並列処理を指定する指示行に加えて、ＣＰＵからＧＰＵへのデータ転送やその逆を明示的に指定する指示行（以下、「明示的指示行」という）が定義されている。OpenＡＣＣ等の明示的指示行は、ＣＰＵからＧＰＵへのデータ転送のディレクティブ（directive：行頭に特殊な記号を記述した指示・指定コマンド）である「#pragma acc data copyin」、ＧＰＵからＣＰＵへのデータ転送のディレクティブである「#pragma acc data copyout」、ＣＰＵからＧＰＵへ再びＣＰＵへのデータ転送のディレクティブである「#pragma acc data copy」等である。 <Basic Concept>
In specifications such as OpenACC, in addition to directives that specify parallel processing in the GPU, directives that explicitly specify data transfer from the CPU to the GPU and vice versa (hereinafter referred to as "explicit directives") are defined. Explicit directives such as OpenACC include "#pragma acc data copyin", which is a directive for data transfer from the CPU to the GPU (directive: an instruction/designation command with a special symbol written at the beginning of a line), "#pragma acc data copyout", which is a directive for data transfer from the GPU to the CPU, and "#pragma acc data copy", which is a directive for data transfer from the CPU to the GPU and back to the CPU.

本実施形態は、非効率なデータ転送を低減するため、明示的指示行を用いたデータ転送指定を、ＧＡでの並列処理の抽出と合わせて行う。
本実施形態では、ＧＡで生成された各個体について、ループ文の中で利用される変数データの参照関係を分析し、ループ毎に毎回データ転送するのではなくループ外でデータ転送してよいデータについては、ループ外でのデータ転送を明示的に指定する。 In this embodiment, in order to reduce inefficient data transfer, data transfer is specified using an explicit directive line in combination with extraction of parallel processing in GA.
In this embodiment, for each individual generated by GA, the reference relationship of variable data used in the loop statement is analyzed, and for data that may be transferred outside the loop rather than every time a loop is performed, data transfer outside the loop is explicitly specified.

<具体例>
以下、具体的に処理を説明する。
データ転送の種類は、ＣＰＵからＧＰＵへのデータ転送、および、ＧＰＵからＣＰＵへのデータ転送がある。 <Example>
The process will be specifically described below.
The types of data transfer include data transfer from the CPU to the GPU and data transfer from the GPU to the CPU.

図４は、具体例の自動オフロード機能部が処理するアプリケーションプログラムのソースコードのループ文（繰り返し文）を示す図であり、ＣＰＵプログラム側で定義した変数とＧＰＵプログラム側で参照する変数が重なる場合の例である。
具体例の自動オフロード機能部は、図１の制御部（自動オフロード機能部）１１からデータ転送指定部１１３を取り去る、またはデータ転送指定部１１３を実行しない場合の例である。 FIG. 4 is a diagram showing a loop statement (repetitive statement) in the source code of an application program processed by a specific example of an automatic offload function unit, and is an example of a case where a variable defined on the CPU program side and a variable referenced on the GPU program side overlap.
The specific example of the automatic offload function unit is an example in which the data transfer designation unit 113 is removed from the control unit (automatic offload function unit) 11 in FIG. 1, or the data transfer designation unit 113 is not executed.

具体例のＣＰＵからＧＰＵへのデータ転送を例に採る。
図４は、具体例のＣＰＵからＧＰＵへデータ転送する場合のループ文において、ＣＰＵプログラム側で定義した変数とＧＰＵプログラム側で参照する変数が重なる場合の例である。なお、以下の記載および図４中のループ文の文頭の<1>～<4> は、説明の便宜上で付したものである（他図およびその説明においても同様）。
図４に示す具体例のループ文は、ＣＰＵプログラム側で記述され、
<1> ループ〔 for｜do｜while 〕 {
}
の中に、
<2> ループ〔 for｜do｜while 〕 {
}
があり、さらにその中に、
<3> ループ〔 for｜do｜while 〕 {
}
があり、さらにその中に、
<4> ループ〔 for〕{
}
がある。 Take data transfer from a CPU to a GPU as an example.
4 shows an example of a loop statement for transferring data from a CPU to a GPU, in which variables defined in the CPU program overlap with variables referenced in the GPU program. Note that the numbers <1> to <4> at the beginning of the loop statements in the following description and in FIG. 4 are added for the sake of convenience (the same applies to other figures and their explanations).
The loop statement of the specific example shown in FIG. 4 is written on the CPU program side,
<1> Loop [for | do | while] {
}
Among them,
<2> Loop [for | do | while] {
}
And within that,
<3> Loop [for | do | while] {
}
And within that,
<4> Loop〔for〕{
}
There is.

また、<1> ループ〔 for｜do｜while 〕 {
}で、変数ａが設定され、<4> ループ〔 for｜do｜while 〕 {
}で、変数ａが参照される。 Also, <1> Loop [for | do | while] {
}, the variable a is set, and the <4> loop [for | do | while] {
The variable a is referenced in }.

さらに、<3> ループ〔 for｜do｜while 〕 {
}で、ＰＧＩコンパイラによるfor文等の並列処理可能処理部を、OpenＡＣＣのディレクティブ #pragma acc kernels（並列処理指定文）で指定している（詳細後記）。 Furthermore, <3> Loop [for | do | while] {
}, parallel processing parts such as for statements by the PGI compiler are specified by the OpenACC directive #pragma acc kernels (parallel processing specification statement) (details will be described later).

図４に示す比較例のループ文では、図４の符号ｆに示すタイミングで毎回ＣＰＵからＧＰＵにデータ転送する。このため、ＧＰＵへのデータ転送回数を低減することが求められる。
なお、ＧＰＵからＣＰＵへのデータ転送も同様であり説明を省略する。 In the loop statement of the comparative example shown in Fig. 4, data is transferred from the CPU to the GPU every time at the timing indicated by the symbol f in Fig. 4. For this reason, it is desired to reduce the number of times data is transferred to the GPU.
Data transfer from the GPU to the CPU is similar, and a description thereof will be omitted.

以上述べたように、本実施形態では、できるだけ上位のループでデータ転送を一括して行うように、データ転送を明示的に指示することで、ループ毎に毎回データを転送する非効率な転送を避けることができる。As described above, in this embodiment, by explicitly instructing data transfer to be performed in bulk in as high-level a loop as possible, it is possible to avoid the inefficient transfer of data for each loop.

［ＧＰＵオフロード処理］
上述したデータ転送の一括処理手法により、オフロードに適切なループ文を抽出し、非効率なデータ転送を避けることができる。
ただし、上記データ転送の一括処理手法を用いても、ＧＰＵオフロードに向いていないプログラムも存在する。効果的なＧＰＵオフロードには、オフロードする処理のループ回数が多いことが必要である。 [GPU offload processing]
The data transfer batch processing technique described above makes it possible to extract loop statements suitable for offloading and to avoid inefficient data transfer.
However, even if the above-mentioned data transfer batch processing method is used, there are some programs that are not suitable for GPU offloading. Effective GPU offloading requires that the number of loops of the process to be offloaded is large.

そこで、本実施形態では、本格的なオフロード処理探索の前段階として、プロファイリングツールを用いて、ループ回数を調査する。プロファイリングツールを用いると、各行の実行回数を調査できるため、例えば、５０００万回以上のループを持つプログラムをオフロード処理探索の対象とする等、事前に振り分けることができる。以下、具体的に説明する（前記図３で述べた内容と一部重複する）。 Therefore, in this embodiment, a profiling tool is used to investigate the number of loops as a preliminary step to a full-scale offload processing search. Using a profiling tool, the number of executions of each line can be investigated, so that programs with loops of 50 million or more can be targeted for offload processing search in advance. A detailed explanation is given below (some of the content overlaps with that described in Figure 3 above).

本実施形態では、まず、制御部（自動オフロード機能部）１１（図１参照）が、アプリケーションプログラムを分析し、for，do，while等のループ文を把握する。次に、サンプル処理を実行し、プロファイリングツールを用いて、各ループ文のループ回数を調査し、一定の値以上のループがあるか否かで、探索を本格的に行うか否かの判定を行う。In this embodiment, first, the control unit (automatic offload function unit) 11 (see FIG. 1) analyzes the application program and identifies loop statements such as for, do, and while. Next, a sample process is executed, and a profiling tool is used to check the number of loops in each loop statement, and a decision is made as to whether or not to perform a full search based on whether or not there are loops that exceed a certain value.

探索を本格的に行うと決まった場合は、ＧＡの処理に入る（前記図３参照）。初期化ステップでは、アプリケーションコードの全ループ文の並列可否をチェックした後、並列可能ループ文をＧＰＵ処理する場合は１、しない場合は０として遺伝子配列にマッピングする。遺伝子は、指定の個体数が準備されるが、遺伝子の各値にはランダムに１，０の割り当てをする。 When it is decided to conduct a full-scale search, the GA process begins (see Figure 3 above). In the initialization step, after checking whether all loop statements in the application code can be processed in parallel, parallelizable loop statements are mapped to the gene array as 1 if they are to be processed by the GPU, and 0 if they are not. A specified number of genes are prepared, and each gene value is randomly assigned a value of 1 or 0.

ここで、遺伝子に該当するコードでは、ＧＰＵ処理すると指定されたループ文内の変数データ参照関係から、データ転送の明示的指示（OpenACCで指定するならば#pragma acc data copyin/copyout/copy）を追加する。Here, in the code corresponding to the gene, an explicit instruction for data transfer (#pragma acc data copyin/copyout/copy if specified with OpenACC) is added based on the variable data reference relationship within the loop statement specified for GPU processing.

評価ステップでは、遺伝子に該当するコードをコンパイルして検証用マシンにデプロイして実行し、ベンチマーク性能測定を行う。性能が良いパターンの遺伝子の適合度を高くする。遺伝子に該当するコードは、上述のように、並列処理指示行（例えば、図４の符号ｆ参照）が挿入されている。In the evaluation step, the code corresponding to the genes is compiled, deployed and executed on a verification machine, and benchmark performance is measured. The fitness of genes with good performance patterns is increased. As described above, parallel processing instruction lines (for example, see symbol f in Figure 4) are inserted into the code corresponding to the genes.

選択ステップでは、適合度に基づいて、高適合度の遺伝子を、指定の個体数選択する。本実施形態では、適合度に応じたルーレット選択および最高適合度遺伝子のエリート選択を行う。交叉ステップでは、一定の交叉率Ｐｃで、選択された個体間で一部の遺伝子をある一点で交換し、子の個体を作成する。突然変異ステップでは、一定の突然変異率Ｐｍで、個体の遺伝子の各値を０から１または１から０に変更する。 In the selection step, a specified number of highly fit genes are selected based on fitness. In this embodiment, roulette selection and elite selection of the most fit genes are performed according to fitness. In the crossover step, some genes are exchanged at a certain point between the selected individuals at a constant crossover rate Pc to create offspring individuals. In the mutation step, the value of each gene of an individual is changed from 0 to 1 or from 1 to 0 at a constant mutation rate Pm.

突然変異ステップまで終わり、次の世代の遺伝子が指定個体数作成されると、初期化ステップと同様に、データ転送の明示的指示を追加し、評価、選択、交叉、突然変異ステップを繰り返す。 Once the mutation step is completed and the specified number of genes for the next generation have been created, explicit instructions for data transfer are added, as in the initialization step, and the evaluation, selection, crossover, and mutation steps are repeated.

最後に、終了判定ステップでは、指定の世代数、繰り返しを行った後に処理を終了し、最高適合度の遺伝子を解とする。最高適合度の遺伝子に該当する、最高性能のコードパターンで、本番環境に改めてデプロイして、ユーザに提供する。 Finally, in the termination determination step, the process is terminated after the specified number of generations and repetitions, and the gene with the highest fitness is taken as the solution. The best-performing code pattern that corresponds to the gene with the highest fitness is then redeployed to the production environment and provided to the user.

以下、オフロードサーバ１の実装を説明する。本実装は、本実施形態の有効性を確認するためのものである。
［実装］
C/C++アプリケーションを汎用のＰＧＩコンパイラを用いて自動オフロードする実装を説明する。
本実装では、ＧＰＵ自動オフロードの有効性確認が目的であるため、対象アプリケーションはC/C++言語のアプリケーションとし、ＧＰＵ処理自体は、従来のＰＧＩコンパイラを説明に用いる。 The following describes the implementation of the offload server 1. This implementation is intended to confirm the effectiveness of this embodiment.
[implementation]
This paper describes an implementation of automatic offloading of C/C++ applications using a general-purpose PGI compiler.
In this implementation, since the purpose is to confirm the effectiveness of GPU automatic offloading, the target application is an application written in C/C++ language, and the GPU processing itself is explained using a conventional PGI compiler.

C/C++言語は、ＯＳＳ（Open Source Software）およびproprietaryソフトウェアの開発で、上位の人気を誇り、数多くのアプリケーションがC/C++言語で開発されている。一般ユーザが用いるアプリケーションプログラムのオフロードを確認するため、暗号処理や画像処理等のＯＳＳの汎用アプリケーションを利用する。 The C/C++ language is one of the most popular languages for developing OSS (Open Source Software) and proprietary software, and many applications have been developed in C/C++. To verify the offloading of application programs used by general users, we use general-purpose OSS applications such as encryption and image processing.

ＧＰＵ処理は、ＰＧＩコンパイラにより行う。ＰＧＩコンパイラは、OpenＡＣＣを解釈するC/C++/Fortran向けコンパイラである。本実施形態では、for文等の並列可能処理部を、OpenＡＣＣのディレクティブ #pragma acc kernels（並列処理指定文）で指定する。これにより、ＧＰＵ向けバイトコードを抽出し、その実行によりＧＰＵオフロードを可能としている。さらに、for文内のデータ同士に依存性があり並列処理できない処理やネストのfor文の異なる複数の階層を指定されている場合等の際に、エラーを出す。合わせて、#pragma acc data copyin/copyout/copy 等のディレクティブにより、明示的なデータ転送の指示が可能とする。GPU processing is performed by the PGI compiler. The PGI compiler is a compiler for C/C++/Fortran that interprets OpenACC. In this embodiment, parallel processing units such as for statements are specified by the OpenACC directive #pragma acc kernels (parallel processing specification statement). This allows bytecode for the GPU to be extracted and executed to enable GPU offloading. Furthermore, an error is issued when there is a dependency between data in a for statement that makes parallel processing impossible, or when multiple levels of nested for statements with different levels are specified. Additionally, explicit data transfer instructions are possible using directives such as #pragma acc data copyin/copyout/copy.

上記 #pragma acc kernels（並列処理指定文）での指定に合わせて、OpenＡＣＣのcopyin 節の #pragma acc data copyout(a[…])の、上述した位置への挿入により、明示的なデータ転送の指示を行う。 In accordance with the specification in the above #pragma acc kernels (parallel processing specification statement), an explicit data transfer is instructed by inserting #pragma acc data copyout(a[…]) in the OpenACC copyin clause at the position mentioned above.

<実装の動作概要>
実装の動作概要を説明する。
高速化するC/C++アプリケーションプログラムとそれを性能測定するベンチマークツールを準備する。 <Implementation Overview>
The outline of the implementation will be explained.
Prepare the C/C++ application program to be accelerated and a benchmark tool to measure its performance.

実装では、C/C++アプリケーションプログラムの利用依頼があると、まず、C/C++アプリケーションのコードを解析して、for文を発見するとともに、for文内で使われる変数データ等の、プログラム構造を把握する。構文解析には、LLVM/Clangの構文解析ライブラリ（libClangのpython binding）等を使用する。 In the implementation, when a request is made to use a C/C++ application program, the system first analyzes the code of the C/C++ application to find for statements and understand the program structure, such as the variable data used within the for statements. For syntax analysis, the LLVM/Clang syntax analysis library (libClang python binding) etc. is used.

実装では、最初に、そのアプリケーションがＧＰＵオフロード効果があるかの見込みを得るため、ベンチマークを実行し、上記構文解析で把握したfor文のループ回数を把握する。ループ回数把握には、GNUカバレッジのgcov等を用いる。プロファイリングツールとしては、「GNUプロファイラ(gprof)」、「GNUカバレッジ(gcov)」が知られている。双方とも各行の実行回数を調査できるため、どちらを用いてもよい。実行回数は、例えば、１０００万回以上のループ回数を持つアプリケーションプログラムのみ対象とするようにできるが、この値は変更可能である。 In the implementation, first, a benchmark is run to obtain an estimate of whether the application will benefit from GPU offloading, and the number of loops in the for statement identified by the syntax analysis above is determined. To determine the number of loops, tools such as gcov from GNU Coverage are used. Known profiling tools include "GNU Profiler (gprof)" and "GNU Coverage (gcov)". Either can be used, as both can investigate the number of times each line is executed. The number of executions can be set to only target application programs with a loop count of 10 million or more, for example, but this value can be changed.

ＣＰＵ向け汎用アプリケーションプログラムは、並列化を想定して実装されているわけではない。そのため、まず、ＧＰＵ処理自体が不可なfor文は排除する必要がある。そこで、各for文一つずつに対して、並列処理の#pragma acc kernels ディレクティブ挿入を試行し、コンパイル時にエラーが出るかの判定を行う。コンパイルエラーに関しては、幾つかの種類がある。for文の中で外部ルーチンが呼ばれている場合、ネストfor文で異なる階層が重複指定されている場合、break等でfor文を途中で抜ける処理がある場合、for文のデータにデータ依存性がある場合等がある。アプリケーションプログラムによって、コンパイル時エラーの種類は多彩であり、これ以外の場合もあるが、コンパイルエラーは処理対象外とし、#pragmaディレクティブは挿入しない。 General-purpose application programs for CPUs are not implemented with parallelization in mind. Therefore, it is necessary to first eliminate for statements that cannot be processed by the GPU. Therefore, the #pragma acc kernels directive for parallel processing is inserted into each for statement one by one to determine whether an error occurs during compilation. There are several types of compilation errors. These include when an external routine is called in the for statement, when different hierarchies are specified in a nested for statement, when there is a process that exits the for statement midway with break, when there is data dependency in the data in the for statement, etc. Compilation errors vary depending on the application program, and there may be other errors as well, but compilation errors are not processed and the #pragma directive is not inserted.

コンパイルエラーは自動対処が難しく、また対処しても効果が出ないことも多い。外部ルーチンコールの場合は、#pragma acc routineにより回避できる場合があるが、多くの外部コールはライブラリであり、それを含めてＧＰＵ処理してもそのコールがネックとなり性能が出ない。for文一つずつを試行するため、ネストのエラーに関しては、コンパイルエラーは生じない。また、break等で途中で抜ける場合は、並列処理にはループ回数を固定化する必要があり、プログラム改造が必要となる。データ依存が有る場合はそもそも並列処理自体ができない。 Compilation errors are difficult to deal with automatically, and even if they are dealt with, they are often ineffective. In the case of external routine calls, they can sometimes be avoided by using #pragma acc routine, but many external calls are libraries, and even if they are included in the GPU processing, the calls become a bottleneck and performance does not improve. Since each for statement is tried, no compilation errors will occur for nesting errors. Also, if you exit midway using break, etc., the number of loops must be fixed for parallel processing, and program modification will be necessary. If there is data dependency, parallel processing itself is not possible in the first place.

ここで、並列処理してもエラーが出ないループ文の数がａの場合、ａが遺伝子長となる。遺伝子の１は並列処理ディレクティブ有、０は無に対応させ、長さａの遺伝子に、アプリケーションコードをマッピングする。 Here, if the number of loop statements that can be processed in parallel without causing an error is a, then a becomes the gene length. A gene of 1 corresponds to the presence of a parallel processing directive, and a corresponds to the absence of a parallel processing directive, and application code is mapped to a gene of length a.

次に、初期値として、指定個体数の遺伝子配列を準備する。遺伝子の各値は、図３で説明したように、０と１をランダムに割当てて作成する。準備された遺伝子配列に応じて、遺伝子の値が１の場合は並列処理を指定するディレクティブ #pragma acc kernels をC/C++コードに挿入する。この段階で、ある遺伝子に該当するコードの中で、ＧＰＵで処理させる部分が決まる。上記Clangで解析した、for文内の変数データの参照関係をもとに、上述したルールに基づいて、ＣＰＵからＧＰＵへのデータ転送、その逆の場合のディレクティブ指定を行う。
具体的には、ＣＰＵからＧＰＵへのデータ転送が必要な変数は、 #pragma acc data copyinで指定し（図示省略）、ＧＰＵからＣＰＵへのデータ転送が必要な変数は、 #pragma acc data copyoutで指定する（図示省略）。同じ変数に関して、copyinとcopyoutが重なる場合は、#pragma acc data copyで纏め、記述をシンプルにする。 Next, a gene array for the specified number of individuals is prepared as an initial value. As explained in Figure 3, each gene value is created by randomly assigning 0 and 1. According to the prepared gene array, if the gene value is 1, the directive #pragma acc kernels is inserted into the C/C++ code to specify parallel processing. At this stage, the part of the code corresponding to a certain gene that is to be processed by the GPU is decided. Based on the reference relationship of the variable data in the for statement analyzed by Clang above, directives are specified for data transfer from the CPU to the GPU and vice versa, according to the rules mentioned above.
Specifically, variables that require data transfer from the CPU to the GPU are specified with #pragma acc data copyin (not shown), and variables that require data transfer from the GPU to the CPU are specified with #pragma acc data copyout (not shown). If copyin and copyout overlap for the same variable, they are combined with #pragma acc data copy to simplify the description.

並列処理およびデータ転送のディレクティブを挿入されたC/C++コードを、ＧＰＵを備えたマシン上のＰＧＩコンパイラでコンパイルを行う。コンパイルした実行ファイルをデプロイし、ベンチマークツールで性能を測定する。 The C/C++ code with parallel processing and data transfer directives inserted is compiled with a PGI compiler on a machine with a GPU. The compiled executable is deployed and its performance is measured with a benchmark tool.

全個体数に対して、ベンチマーク性能測定後、ベンチマーク処理時間に応じて、各遺伝子配列の適合度を設定する。設定された適合度に応じて、残す個体の選択を行う。選択された個体に対して、交叉処理、突然変異処理、そのままコピー処理のＧＡ処理を行い、次世代の個体群を作成する。 After measuring the benchmark performance for all individuals, the fitness of each gene sequence is set according to the benchmark processing time. Individuals to be kept are selected according to the set fitness. The selected individuals are subjected to GA processing, including crossover, mutation, and direct copying, to create the next generation population.

次世代の個体に対して、ディレクティブ挿入、コンパイル、性能測定、適合度設定、選択、交叉、突然変異処理を行う。ここで、ＧＡ処理の中で、以前と同じパターンの遺伝子が生じた場合は、その個体についてはコンパイル、性能測定をせず、以前と同じ測定値を用いる。 Directive insertion, compilation, performance measurement, fitness setting, selection, crossover, and mutation processes are performed on the next generation of individuals. If a gene with the same pattern as before is generated during the GA process, compilation and performance measurement are not performed on that individual, and the same measurement value as before is used.

指定世代数のＧＡ処理終了後、最高性能の遺伝子配列に該当する、ディレクティブ付きC/C++コードを解とする。 After the GA process is completed for the specified number of generations, the C/C++ code with directives that corresponds to the gene sequence with the best performance is determined as the solution.

この中で、個体数、世代数、交叉率、突然変異率、適合度設定、選択方法は、ＧＡのパラメータであり、別途指定する。提案技術は、上記処理を自動化することで、従来、専門技術者の時間とスキルが必要だった、ＧＰＵオフロードの自動化を可能にする。Among these, the number of individuals, number of generations, crossover rate, mutation rate, fitness setting, and selection method are GA parameters and are specified separately. By automating the above processes, the proposed technology makes it possible to automate GPU offloading, which previously required the time and skills of a specialized engineer.

［《ループ文オフロード：共通》フローチャート］
図５Ａ－Ｂは、《ループ文オフロード：共通》フローチャートであり、図５Ａと図５Ｂは、結合子で繋がれる。 [Flowchart for "Loop Statement Offloading: Common"]
5A and 5B are flowcharts for <<Loop Statement Offload: Common>>, and FIGS. 5A and 5B are connected with a connector.

<コード解析>
ステップＳ１０１で、アプリケーションコード分析部１１２（図１参照）は、アプリケーションプログラムのコード解析を行う。 <Code Analysis>
In step S101, the application code analysis unit 112 (see FIG. 1) performs a code analysis of the application program.

<ループ文特定>
ステップＳ１０２で、並列処理指定部１１４（図１参照）は、アプリケーションプログラムのループ文、参照関係を特定する。 <Loop statement identification>
In step S102, the parallel processing specification unit 114 (see FIG. 1) identifies loop statements and reference relationships in the application program.

<ループ文ループ回数>
ステップＳ１０３で、並列処理指定部１１４は、ベンチマークツールを動作させ、ループ文ループ回数を把握し、閾値振分けする。 <Loop statement loop count>
In step S103, the parallel processing specification unit 114 runs a benchmark tool to grasp the number of loops of the loop statement and allocate a threshold value.

<ループ文の並列処理可能性>
ステップＳ１０４で、並列処理指定部１１４は、各ループ文の並列処理可能性をチェックする。 <Parallel processing of loop statements>
In step S104, the parallel processing specification unit 114 checks whether each loop statement can be processed in parallel.

<ループ文の繰り返し>
制御部（自動オフロード機能部）１１は、ステップＳ１０５のループ始端とステップＳ１０８のループ終端間で、ステップＳ１０６－Ｓ１０７の処理についてループ文の数だけ繰り返す。
ステップＳ１０６で、並列処理指定部１１４は、各ループ文に対して、言語に応じた手法でＧＰＵ処理を指定してコンパイルまたはインタプリットする。
ステップＳ１０７で、並列処理指定部１１４は、エラー時は、該当for文からは、ＧＰＵ処理指定を削除する。
ステップＳ１０９で、並列処理指定部１１４は、コンパイルエラーが出ないfor文の数をカウントし、遺伝子長とする。 <Repeat loop statement>
The control unit (automatic offload function unit) 11 repeats the processing of steps S106-S107 between the loop start point of step S105 and the loop end point of step S108 for the number of loop statements.
In step S106, the parallel processing specification unit 114 specifies GPU processing for each loop statement using a language-specific method, and compiles or interprets the loop statement.
In step S107, if an error occurs, the parallel processing designation unit 114 deletes the GPU processing designation from the corresponding for statement.
In step S109, the parallel processing designation unit 114 counts the number of for statements for which no compilation error occurs, and sets the count as the gene length.

<指定個体数パターン準備>
次に、初期値として、並列処理指定部１１４は、指定個体数の遺伝子配列を準備する。ここでは、０と１をランダムに割当てて作成する。
ステップＳ１１０で、並列処理指定部１１４は、アプリケーションプログラムのコードを、遺伝子にマッピングする。０と１がランダムに割当てられた遺伝子配列を遺伝子にマッピングすることで、指定個体数パターンを準備する。
準備された遺伝子配列に応じて、遺伝子の値が１の場合は並列処理を指定するディレクティブをアプリケーションプログラムのコードに挿入する（例えば図３の#pragmaディレクティブ参照）。 <Preparation of designated population patterns>
Next, the parallel processing designation unit 114 prepares a designated number of gene sequences as initial values. Here, these are created by randomly assigning 0 and 1.
In step S110, the parallel processing designation unit 114 maps the code of the application program to genes. A designated population pattern is prepared by mapping a gene sequence in which 0s and 1s are randomly assigned to genes.
According to the prepared gene sequence, a directive that specifies parallel processing when the gene value is 1 is inserted into the code of the application program (see, for example, the #pragma directive in FIG. 3).

制御部（自動オフロード機能部）１１は、ステップＳ１１１のループ始端とステップＳ１２０のループ終端間で、ステップＳ１１２－Ｓ１１９の処理について指定世代数繰り返す。
また、上記指定世代数繰り返しにおいて、さらにステップＳ１１２のループ始端とステップＳ１１７のループ終端間で、ステップＳ１１３－Ｓ１１６の処理について指定個体数繰り返す。すなわち、指定世代数繰り返しの中で、指定個体数の繰り返しが入れ子状態で処理される。 The control unit (automatic offload function unit) 11 repeats the processing of steps S112-S119 for a designated number of generations between the loop start point of step S111 and the loop end point of step S120.
In addition, in the above-mentioned repetition of the specified number of generations, the processing of steps S113-S116 is further repeated a specified number of individuals between the loop start point of step S112 and the loop end point of step S117. That is, within the repetition of the specified number of generations, the repetition of the specified number of individuals is processed in a nested state.

<データ転送指定>
ステップＳ１１３で、データ転送指定部１１３は、変数参照関係から、言語に応じた手法でデータ転送を指定する。 <Data transfer specification>
In step S113, the data transfer specification unit 113 specifies a data transfer based on the variable reference relationship using a method appropriate for the language.

<コンパイル>
ステップＳ１１４で、並列処理パターン作成部１１５（図１参照）は、遺伝子パターンに応じてＧＰＵ処理基盤でコンパイルまたはインタプリットする。すなわち、並列処理パターン作成部１１５は、作成したアプリケーションプログラムのコードを、ＧＰＵを備えた検証用マシン１４上のＰＧＩコンパイラでコンパイルまたはインタプリットを行う。
ここで、ネストfor文を複数並列指定する場合等でコンパイルエラーとなることがある。この場合は、性能測定時の処理時間がタイムアウトした場合と同様に扱う。 <Compile>
In step S114, the parallel processing pattern creation unit 115 (see FIG. 1) compiles or interprets the code of the created application program on the GPU processing platform in accordance with the gene pattern. That is, the parallel processing pattern creation unit 115 compiles or interprets the code of the created application program on the PGI compiler on the verification machine 14 equipped with a GPU.
Here, a compilation error may occur when multiple nested for statements are specified in parallel, etc. In this case, it is treated the same as when the processing time during performance measurement times out.

ステップＳ１１５で、性能測定部１１６（図１参照）は、ＣＰＵ-ＧＰＵ搭載の検証用マシン１４に、実行ファイルをデプロイする。
ステップＳ１１６で、性能測定部１１６は、配置したバイナリファイルを実行し、オフロードした際のベンチマーク性能を測定する。 In step S115, the performance measurement unit 116 (see FIG. 1) deploys the executable file to the verification machine 14 equipped with a CPU and GPU.
In step S116, the performance measurement unit 116 executes the arranged binary file and measures the benchmark performance when offloaded.

ここで、途中世代で、以前と同じパターンの遺伝子については測定せず、同じ値を使う。つまり、ＧＡ処理の中で、以前と同じパターンの遺伝子が生じた場合は、その個体についてはコンパイルや性能測定をせず、以前と同じ測定値を用いる。
ステップＳ１１８で、実行ファイル作成部１１７（図１参照）は、処理時間が短い個体ほど適合度が高くなるように評価し、性能の高い個体を選択する。 Here, in the intermediate generation, genes with the same pattern as before are not measured, and the same values are used. In other words, if a gene with the same pattern as before is generated during the GA process, compilation or performance measurement is not performed for that individual, and the same measured value as before is used.
In step S118, the executable file creation unit 117 (see FIG. 1) evaluates the individuals with shorter processing times so that they have a higher fitness, and selects the individuals with the highest performance.

ステップＳ１１９で、実行ファイル作成部１１７は、選択された個体に対して、交叉、突然変異の処理を行い、次世代の個体を作成する。次世代の個体に対して、コンパイル、性能測定、適合度設定、選択、交叉、突然変異処理を行う。
すなわち、全個体に対して、ベンチマーク性能測定後、ベンチマーク処理時間に応じて、各遺伝子配列の適合度を設定する。設定された適合度に応じて、残す個体の選択を行う。選択された個体に対して、交叉処理、突然変異処理、そのままコピー処理のＧＡ処理を行い、次世代の個体群を作成する。 In step S119, the executable file creation unit 117 performs crossover and mutation processes on the selected individuals to create the next generation individuals. The next generation individuals are subjected to compilation, performance measurement, fitness setting, selection, crossover, and mutation processes.
That is, after benchmark performance is measured for all individuals, the fitness of each gene sequence is set according to the benchmark processing time. Individuals to be left are selected according to the set fitness. The selected individuals are subjected to GA processing including crossover, mutation, and direct copy to create the next generation population.

ステップＳ１２１で、実行ファイル作成部１１７は、指定世代数のＧＡ処理終了後、最高性能の遺伝子配列に該当するC/C++コード（最高性能の並列処理パターン）を解とする。In step S121, after the GA processing for the specified number of generations is completed, the executable file creation unit 117 determines the C/C++ code (the highest-performance parallel processing pattern) corresponding to the highest-performance gene sequence as the solution.

<ＧＡのパラメータ>
上記、個体数、世代数、交叉率、突然変異率、適合度設定、選択方法は、ＧＡのパラメータである。ＧＡのパラメータは、例えば、以下のように設定してもよい。
実行するSimple GAの、パラメータ、条件は例えば以下のようにできる。
遺伝子長：並列可能ループ文数
個体数Ｍ：遺伝子長以下
世代数Ｔ：遺伝子長以下
適合度：(処理時間)^-1/2
この設定により、ベンチマーク処理時間が短い程、高適合度になる。また、適合度を、(処理時間)^-1/2とすることで、処理時間が短い特定の個体の適合度が高くなり過ぎて、探索範囲が狭くなるのを防ぐことができる。また、性能測定が一定時間で終わらない場合は、タイムアウトさせ、処理時間１０００秒等の時間（長時間）であるとして、適合度を計算する。このタイムアウト時間は、性能測定特性に応じて変更させればよい。
選択：ルーレット選択
ただし、世代での最高適合度遺伝子は交叉も突然変異もせず次世代に保存するエリート保存も合わせて行う。
交叉率Ｐｃ：０．９
突然変異率Ｐｍ：０．０５ <GA parameters>
The above-mentioned number of individuals, number of generations, crossover rate, mutation rate, fitness setting, and selection method are parameters of the GA. The parameters of the GA may be set, for example, as follows.
The parameters and conditions for the Simple GA to be executed can be, for example, as follows:
Gene length: Number of loop statements that can be parallelized Number of individuals M: Less than or equal to gene length Number of generations T: Less than or equal to gene length Fitness: (processing time) ^-1/2
With this setting, the shorter the benchmark processing time, the higher the fitness. Also, by setting the fitness to (processing time) ^-1/2 , it is possible to prevent the fitness of a specific individual with a short processing time from becoming too high, narrowing the search range. Also, if the performance measurement does not end within a certain time, a timeout is set and the fitness is calculated assuming a processing time of 1000 seconds or the like (long time). This timeout time can be changed according to the performance measurement characteristics.
Selection: Roulette selection However, elite preservation is also performed in which the genes with the highest fitness in one generation are preserved in the next generation without crossover or mutation.
Crossover rate Pc: 0.9
Mutation rate Pm: 0.05

本実施形態では、gcov，gprof等を用いて、ループが多く実行時間がかかっているアプリケーションを事前に特定して、オフロード試行をする。これにより、効率的に高速化できるアプリケーションを見つけることができる。In this embodiment, applications that have many loops and take a long time to execute are identified in advance using gcov, gprof, etc., and offloading is attempted. This makes it possible to find applications that can be accelerated efficiently.

より短時間でオフロード部分を探索するためには、複数の検証用マシンで個体数分並列で性能測定することが考えられる。アプリケーションプログラムに応じて、タイムアウト時間を調整することも短時間化に繋がる。例えば、オフロード処理がＣＰＵでの実行時間の２倍かかる場合はタイムアウトとする等である。また、個体数、世代数が多い方が、高性能な解を発見できる可能性が高まる。しかし、各パラメータを最大にする場合、個体数×世代数だけコンパイル、および性能ベンチマークを行う必要がある。このため、本番サービス利用開始までの時間がかかる。本実施形態では、ＧＡとしては少ない個体数、世代数で行っているが、交叉率Ｐｃを０．９と高い値にして広範囲を探索することで、ある程度の性能の解を早く発見するようにしている。 In order to search for the offloaded part in a shorter time, it is possible to measure the performance in parallel on multiple verification machines for the number of individuals. Adjusting the timeout time according to the application program can also shorten the time. For example, if the offload processing takes twice as long as the execution time on the CPU, a timeout can be set. In addition, the more individuals and generations there are, the higher the chance of finding a high-performance solution. However, when maximizing each parameter, compilation and performance benchmarking must be performed for the number of individuals x the number of generations. This takes time before the actual service can be used. In this embodiment, the GA is performed with a small number of individuals and generations, but by setting the crossover rate Pc to a high value of 0.9 and searching a wide range, a solution with a certain level of performance can be found quickly.

以上、《ループ文オフロード：共通》について説明した、次に、《ループ文オフロード：Ｃ言語》について説明する。 We have explained "Loop Statement Offload: Common" above. Next, we will explain "Loop Statement Offload: C Language".

［《ループ文オフロード：Ｃ言語》］
ループ文オフロード：Ｃ言語について、基本的フローは、上記《ループ文オフロード：共通》と同様であり、言語非依存にできる。Ｃ言語に依存・非依存の処理について詳細に述べる。
ループ文オフロードの、コードの分析では、Ｃ言語を解析するClang等の構文解析ツールを用いて構文解析する。ループと変数の把握については、構文解析ツールの結果を管理する際は、言語に非依存に抽象的に管理できる。ループのＧＰＵ処理有無の遺伝子化についても、言語に非依存である。遺伝子情報のコード化では、遺伝子情報に合わせてＧＰＵで実行するためのコードを作成するため、Ｃ言語の拡張文法であるOpenACCでＧＰＵ処理を指定したり、変数転送を指定したりする。 [<<Loop Statement Offload: C Language>>]
Loop statement offloading: Regarding C language, the basic flow is the same as the above <<Loop statement offloading: common>>, and it can be made language independent. We will now explain in detail the processing that is dependent and independent of the C language.
When analyzing code for offloading loop statements, syntax analysis is performed using a syntax analysis tool such as Clang, which analyzes the C language. When managing the results of the syntax analysis tool to understand loops and variables, they can be managed abstractly and independent of the language. Genetization of whether or not to use GPU processing for loops is also language independent. When encoding genetic information, GPU processing and variable transfer are specified using OpenACC, an extended grammar for the C language, to create code to be executed on the GPU according to the genetic information.

コンパイルは、OpenACCコードをＰＧＩコンパイラ等でコンパイルする。性能測定は、言語に合わせて、Jenkins、Selenium等の自動測定ツールも用いて行う。次世代の遺伝子作成は、性能測定結果に合わせて適合度を設定し交叉等の処理を行うが、言語に非依存である。反復実行と最終解の決定も、言語に非依存である。
以上のように、ループ文オフロードでは、処理に関しては、ループと変数の管理とＧＡの遺伝子処理については言語に非依存に適用できる。 Compilation is performed by compiling OpenACC code with a PGI compiler or similar. Performance measurement is performed using automated measurement tools such as Jenkins and Selenium depending on the language. Creation of next-generation genes involves setting the fitness level according to the performance measurement results and performing processes such as crossover, but is language-independent. Iterative execution and determination of the final solution are also language-independent.
As described above, in loop statement offloading, the management of loops and variables and the genetic processing of GA can be applied independently of the language.

図６Ａ－Ｂは、《ループ文オフロード：Ｃ言語》のフローチャートであり、図６Ａと図６Ｂは、結合子で繋がれる。
C/C++向けOpenＡＣＣコンパイラを用いて以下の処理を行う。 6A and 6B are flowcharts of <<Loop Statement Offload: C Language>>, and FIGS. 6A and 6B are connected with a connector.
The following process is performed using the OpenACC compiler for C/C++.

<コード解析>
ステップＳ２０１で、アプリケーションコード分析部１１２（図１参照）は、C/C++アプリケーションプログラムのコード解析を行う。 <Code Analysis>
In step S201, the application code analysis unit 112 (see FIG. 1) performs code analysis of a C/C++ application program.

<ループ文特定>
ステップＳ２０２で、並列処理指定部１１４（図１参照）は、C/C++アプリケーションプログラムのループ文、参照関係を特定する。 <Loop statement identification>
In step S202, the parallel processing specification unit 114 (see FIG. 1) identifies loop statements and reference relationships in the C/C++ application program.

<ループ文ループ回数>
ステップＳ２０３で、並列処理指定部１１４は、ベンチマークツールを動作させ、ループ文ループ回数を把握し、閾値振分けする。 <Loop statement loop count>
In step S203, the parallel processing specification unit 114 runs a benchmark tool to grasp the number of loops of the loop statement and allocate a threshold value.

<ループ文の並列処理可能性>
ステップＳ２０４で、並列処理指定部１１４は、各ループ文の並列処理可能性をチェックする。 <Parallel processing of loop statements>
In step S204, the parallel processing specification unit 114 checks whether each loop statement can be processed in parallel.

<ループ文の繰り返し>
制御部（自動オフロード機能部）１１は、ステップＳ２０５のループ始端とステップＳ２０８のループ終端間で、ステップＳ２０６－Ｓ２０７の処理についてループ文の数だけ繰り返す。
ステップＳ２０６で、並列処理指定部１１４は、各ループ文に対して、OpenＡＣＣ文法を用いて、#pragma acc kernelsでＧＰＵ処理を指定してコンパイルする。
ステップＳ２０７で、並列処理指定部１１４は、エラー時は、該当for文からは、#pragma acc kernelsを削除する。
ステップＳ２０９で、並列処理指定部１１４は、コンパイルエラーが出ないfor文の数をカウントし、遺伝子長とする。 <Repeat loop statement>
The control unit (automatic offload function unit) 11 repeats the processing of steps S206-S207 between the loop start point of step S205 and the loop end point of step S208 for the number of loop statements.
In step S206, the parallel processing specification unit 114 compiles each loop statement by specifying GPU processing with #pragma acc kernels using the OpenACC syntax.
In step S207, if an error occurs, the parallel processing specification unit 114 deletes #pragma acc kernels from the corresponding for statement.
In step S209, the parallel processing specification unit 114 counts the number of for statements for which no compilation error occurs, and sets the count as the gene length.

<指定個体数パターン準備>
次に、初期値として、並列処理指定部１１４は、指定個体数の遺伝子配列を準備する。ここでは、０と１をランダムに割当てて作成する。
ステップＳ２１０で、並列処理指定部１１４は、C/C++アプリコードを、遺伝子にマッピングする。０と１がランダムに割当てられた指定個体数の遺伝子配列を遺伝子にマッピングするすることで、指定個体数パターンを準備する。
準備された遺伝子配列に応じて、遺伝子の値が１の場合は並列処理を指定するディレクティブをC/C++アプリコードに挿入する（例えば図３の#pragmaディレクティブ参照）。 <Preparation of designated population patterns>
Next, the parallel processing designation unit 114 prepares a designated number of gene sequences as initial values. Here, these are created by randomly assigning 0 and 1.
In step S210, the parallel processing designation unit 114 maps the C/C++ application code to genes. A designated number of gene sequences, to which 0 and 1 are randomly assigned, are mapped to genes to prepare a designated number of patterns.
According to the prepared gene sequence, a directive that specifies parallel processing when the gene value is 1 is inserted into the C/C++ application code (see, for example, the #pragma directive in Figure 3).

制御部（自動オフロード機能部）１１は、ステップＳ２１１のループ始端とステップＳ２２０のループ終端間で、ステップＳ２１２－Ｓ２１９の処理について指定世代数繰り返す。
また、上記指定世代数繰り返しにおいて、さらにステップＳ２１２のループ始端とステップＳ２１７のループ終端間で、ステップＳ２１３－Ｓ２１６の処理について指定個体数繰り返す。すなわち、指定世代数繰り返しの中で、指定個体数の繰り返しが入れ子状態で処理される。 The control unit (automatic offload function unit) 11 repeats the processing of steps S212-S219 for a designated number of generations between the loop start point of step S211 and the loop end point of step S220.
In addition, in the above-mentioned repetition of the specified number of generations, the processing of steps S213-S216 is further repeated a specified number of individuals between the loop start point of step S212 and the loop end point of step S217. In other words, within the repetition of the specified number of generations, the repetition of the specified number of individuals is processed in a nested state.

<データ転送指定>
ステップＳ２１３で、データ転送指定部１１３は、変数参照関係から、明示的指示行（#pragma acc data copy/in/out）を用いたデータ転送指定を行う。明示的指示行（#pragma acc data copy/in/out）を用いたデータ転送指定については、図４により説明した。 <Data transfer specification>
In step S213, the data transfer specification unit 113 specifies a data transfer using an explicit directive (#pragma acc data copy/in/out) based on the variable reference relationship. The data transfer specification using the explicit directive (#pragma acc data copy/in/out) has been described with reference to FIG.

<コンパイル>
ステップＳ２１４で、並列処理パターン作成部１１５（図１参照）は、遺伝子パターンに応じてディレクティブ指定したC/C++コードをＰＧＩコンパイラでコンパイルする。すなわち、並列処理パターン作成部１１５は、作成したC/C++アプリコードを、ＧＰＵを備えた検証用マシン１４上のＰＧＩコンパイラでコンパイルする。
ここで、ネストfor文を複数並列指定する場合等でコンパイルエラーとなることがある。この場合は、性能測定時の処理時間がタイムアウトした場合と同様に扱う。 <Compile>
In step S214, the parallel processing pattern creation unit 115 (see FIG. 1) compiles the C/C++ code with directives specified according to the gene pattern using a PGI compiler. That is, the parallel processing pattern creation unit 115 compiles the created C/C++ application code using a PGI compiler on the verification machine 14 equipped with a GPU.
Here, a compilation error may occur when multiple nested for statements are specified in parallel, etc. In this case, it is treated the same as when the processing time during performance measurement times out.

ステップＳ２１５で、性能測定部１１６（図１参照）は、ＣＰＵ-ＧＰＵ搭載の検証用マシン１４に、実行ファイルをデプロイする。
ステップＳ２１６で、性能測定部１１６は、配置したバイナリファイルを実行し、オフロードした際のベンチマーク性能を測定する。 In step S215, the performance measurement unit 116 (see FIG. 1) deploys the executable file to the verification machine 14 equipped with a CPU and GPU.
In step S216, the performance measurement unit 116 executes the arranged binary file and measures the benchmark performance when offloaded.

ここで、途中世代で、以前と同じパターンの遺伝子については測定せず、同じ値を使う。つまり、ＧＡ処理の中で、以前と同じパターンの遺伝子が生じた場合は、その個体についてはコンパイルや性能測定をせず、以前と同じ測定値を用いる。
ステップＳ２１８で、実行ファイル作成部１１７（図１参照）は、処理時間が短い個体ほど適合度が高くなるように評価し、性能の高い個体を選択する。 Here, in the intermediate generation, genes with the same pattern as before are not measured, and the same values are used. In other words, if a gene with the same pattern as before is generated during the GA process, compilation or performance measurement is not performed for that individual, and the same measured value as before is used.
In step S218, the executable file creation unit 117 (see FIG. 1) evaluates the individuals with shorter processing times so that they have a higher fitness, and selects the individuals with the highest performance.

ステップＳ２１９で、実行ファイル作成部１１７は、選択された個体に対して、交叉、突然変異の処理を行い、次世代の個体を作成する。次世代の個体に対して、コンパイル、性能測定、適合度設定、選択、交叉、突然変異処理を行う。
すなわち、全個体に対して、ベンチマーク性能測定後、ベンチマーク処理時間に応じて、各遺伝子配列の適合度を設定する。設定された適合度に応じて、残す個体の選択を行う。選択された個体に対して、交叉処理、突然変異処理、そのままコピー処理のＧＡ処理を行い、次世代の個体群を作成する。 In step S219, the executable file creation unit 117 performs crossover and mutation processes on the selected individuals to create the next generation individuals. The next generation individuals are subjected to compilation, performance measurement, fitness setting, selection, crossover, and mutation processes.
That is, after benchmark performance is measured for all individuals, the fitness of each gene sequence is set according to the benchmark processing time. Individuals to be left are selected according to the set fitness. The selected individuals are subjected to GA processing including crossover, mutation, and direct copy to create the next generation population.

ステップＳ２２１で、実行ファイル作成部１１７は、指定世代数のＧＡ処理終了後、最高性能の遺伝子配列に該当するC/C++コード（最高性能の並列処理パターン）を解とする。In step S221, after the GA processing for the specified number of generations is completed, the executable file creation unit 117 determines the C/C++ code (the highest-performance parallel processing pattern) corresponding to the highest-performance gene sequence as the solution.

以上、《ループ文オフロード：Ｃ言語》について説明した、次に、《ループ文オフロード：Python》について説明する。 We have explained "Loop statement offloading: C language" above. Next, we will explain "Loop statement offloading: Python".

［《ループ文オフロード：Python》］
ループ文オフロード：Pythonは、PythonコードをpyCUDAでインタプリットする方法（図７Ａ－Ｂ参照）と、OpenACCを解釈するインタプリタpyACCを用いる方法（図８Ａ－Ｂ、図９Ａ－Ｂ参照）とがある。以下、順に説明する。 [Loop statement offloading: Python]
Loop statement offload: Python has a method of interpreting Python code with pyCUDA (see Figures 7A-B) and a method of using the interpreter pyACC that interprets OpenACC (see Figures 8A-B and 9A-B). The following describes each method in order.

<PythonコードをpyCUDAでインタプリットする方法>
ループ文オフロードの、コードの分析では、Pythonを解析するast等の構文解析ツールを用いて構文解析する。ループと変数の把握については、構文解析ツールの結果を管理する際は、言語に非依存で、抽象的に管理できる。
ループのＧＰＵ処理有無の遺伝子化についても、言語に非依存である。遺伝子情報のコード化では、遺伝子情報に合わせてＧＰＵで実行するためのコードを作成するため、ＣＵＤＡ文法でＧＰＵ処理を指定したり、変数転送を指定したりする。 <How to interpret Python code with pyCUDA>
When offloading loop statements, code analysis is performed using a syntax analysis tool such as ast, which analyzes Python. When managing the results of the syntax analysis tool, loops and variables can be understood abstractly and independently of the language.
The coding of the loop with or without GPU processing is also language independent. In coding the genetic information, the GPU processing and variable transfer are specified in the CUDA grammar in order to create code for execution on the GPU according to the genetic information.

インタプリタは、ＣＵＤＡでの指示を追加したPythonコードをpyCUDAでインタプリットする。性能測定は、言語に合わせて、Jenkins等の自動測定ツールも用いて行う。次世代の遺伝子作成は、性能測定結果に合わせて適合度を設定し交叉等の処理を行うが、言語に非依存である。反復実行と最終解の決定も、言語に非依存である。ここで、pyCUDAでなくpyACCというOpenACCを解釈するインタプリタを用いてもよい。その場合は、Ｃ言語と同様にOpenACC文法でループ文のＧＰＵ処理を指定すればよい（後記）。 The interpreter uses pyCUDA to interpret Python code with CUDA instructions added. Performance measurements are performed using an automated measurement tool such as Jenkins depending on the language. Creation of the next generation of genes involves setting the fitness level according to the performance measurement results and performing processes such as crossover, but is language independent. Iterative execution and determination of the final solution are also language independent. Here, an interpreter that interprets OpenACC, called pyACC, can be used instead of pyCUDA. In that case, GPU processing of the loop statement can be specified using OpenACC grammar, just like with C language (see below).

Python では、ＧＰＵ処理指定したコードは実装によるが、CupyというオープンソースでNVIDIA GPUを利用するライブラリを用いた実装を説明する。
動作としては、PythonのコードでＧＰＵ処理を指定されるループ文はCupyライブラリを介して、NVIDIAのＣＵＤＡコマンドが実行され、NVIDIAのＧＰＵで処理がされる。 In Python, the code that specifies GPU processing depends on the implementation, but we will explain an implementation using Cupy, an open source library that utilizes NVIDIA GPUs.
In operation, loop statements that specify GPU processing in Python code are executed by NVIDIA's CUDA command via the Cupy library, and processing is carried out on the NVIDIA GPU.

本実施形態では、Ｃ言語同様、Pythonのループ文に対して、ＧＰＵ処理可否を遺伝的アルゴリズムにより選択し、適切なオフロードパターンを見つける。 In this embodiment, similar to the C language, for Python loop statements, a genetic algorithm is used to select whether GPU processing is possible and an appropriate offload pattern is found.

以下、Cupyを用いた例を記載する。
図７Ａ－Ｂは、PythonコードをpyCUDAでインタプリットする方法を説明する図であり、図７Ａは、変換元例を示し、図７Ｂは、変換後例（３階層のfor文の一番上をＧＰＵ処理指定する場合）を示す。
図７Ａに示すように、Pythonのfor文は行列演算として指定される。Cupyは、ＣＵＤＡコマンドを呼び、ＣＵＤＡがＧＰＵを実行する。 Below is an example using Cupy.
7A and 7B are diagrams explaining how to interpret Python code with pyCUDA. FIG. 7A shows an example of the original code, and FIG. 7B shows an example of the converted code (when the topmost part of a three-level for statement is specified for GPU processing).
As shown in Fig. 7A, a Python for statement is specified as a matrix operation. Cupy calls a CUDA command, and CUDA executes the GPU.

Cupyを用いる場合、Ｃ言語のOpenACCの＼pragmaのようにオフロードするfor文を記載する形ではない。
CupyからＣＵＤＡを介してＧＰＵ処理する際は、ＧＰＵでの並列演算は行列演算である。ＧＰＵ処理する箇所は、図７Ｂに示すように、
〔1～多重 for文の内側にある『配列[添字]』を右辺・左辺に持つ演算式〕は、
"[添字1][添字2]…" の部分を"[範囲開始1:範囲終了1,範囲開始2:範囲終了2, …]" 表現に書き換える。
《添字》が《範囲》に置き換わることによって、式全体が行列演算の式になる。 When using Cupy, it is not necessary to write a for statement to offload like the \pragma of OpenACC in C.
When processing on the GPU from Cupy via CUDA, the parallel calculation on the GPU is a matrix calculation. The part to be processed by the GPU is as shown in FIG. 7B.
[1～An arithmetic expression with an "array [subscript]" on the right and left sides inside a multiple for statement] is
Change the part "[index 1][index 2]…" to "[range start 1:range end 1, range start 2:range end 2, …]".
By replacing the "subscript" with a "range", the entire formula becomes a matrix operation formula.

以上、<PythonコードをpyCUDAでインタプリットする方法>について説明した。次に、<pyACCを用いる方法>について説明する。Above we have explained how to interpret Python code with pyCUDA. Next we will explain how to use pyACC.

<pyACCを用いる方法>
ループ文オフロード：Pythonは、上記pyCUDAでなくpyACCというOpenACCを解釈するインタプリタを用いてもよい。その場合は、Ｃ言語と同様にOpenACC文法でループ文のＧＰＵ処理を指定する。以下、pyACCを用いる方法について説明する。 <Method using pyACC>
Loop statement offload: Python may use an interpreter that interprets OpenACC called pyACC instead of the above pyCUDA. In that case, GPU processing of loop statements is specified using OpenACC syntax, just like C language. The method of using pyACC is explained below.

図８Ａ－Ｂは、pyACC利用時のコードパターンを示す図であり、図８Ａは、pyACC利用時のfor文を示し、図８Ｂは、図８Ａのfor文から作成されるコードパターンを示す。
図８Ｂに示すコードパターンは、図３のコードパターンに置き換えて用いられる。 8A and 8B are diagrams showing code patterns when pyACC is used, where FIG. 8A shows a for statement when pyACC is used, and FIG. 8B shows a code pattern created from the for statement in FIG. 8A.
The code pattern shown in FIG. 8B is used to replace the code pattern in FIG.

図９Ａ－Ｂは、《ループ文オフロード：Python》のフローチャートであり、図９Ａと図９Ｂは、結合子で繋がれる。また、図８Ｂに示すコードパターンを、図３のコードパターンに置き換えて用いる。
C/C++向けOpenＡＣＣコンパイラを用いて以下の処理を行う。 9A and 9B are flowcharts of "Loop Statement Offload: Python", and Fig. 9A and Fig. 9B are connected with a connector. Also, the code pattern shown in Fig. 8B is used by replacing the code pattern in Fig. 3.
The following process is performed using the OpenACC compiler for C/C++.

<コード解析>
ステップＳ３０１で、アプリケーションコード分析部１１２（図１参照）は、Pythonアプリケーションプログラムのコード解析を行う。 <Code Analysis>
In step S301, the application code analysis unit 112 (see FIG. 1) performs code analysis of the Python application program.

<ループ文特定>
ステップＳ３０２で、並列処理指定部１１４（図１参照）は、Pythonアプリケーションプログラムのループ文、参照関係を特定する。 <Loop statement identification>
In step S302, the parallel processing specification unit 114 (see FIG. 1) identifies loop statements and reference relationships in the Python application program.

<ループ文ループ回数>
ステップＳ３０３で、並列処理指定部１１４は、ベンチマークツールを動作させ、ループ文ループ回数を把握し、閾値振分けする。 <Loop statement loop count>
In step S303, the parallel processing specification unit 114 runs a benchmark tool to grasp the number of loops of the loop statement and allocate a threshold value.

<ループ文の並列処理可能性>
ステップＳ３０４で、並列処理指定部１１４は、各ループ文の並列処理可能性をチェックする。 <Parallel processing of loop statements>
In step S304, the parallel processing specification unit 114 checks whether each loop statement can be processed in parallel.

<ループ文の繰り返し>
制御部（自動オフロード機能部）１１は、ステップＳ３０５のループ始端とステップＳ１０８のループ終端間で、ステップＳ３０６－Ｓ３０７の処理についてループ文の数だけ繰り返す。
ステップＳ３０６で、並列処理指定部１１４は、各ループ文に対して、ＧＰＵ処理基盤に応じた手法でＧＰＵ処理を指定してインタプリットする。例えば、pyACC利用時はOpenACCの＼pragmaacckernels、Cupy利用時は対象ループの計算を行列計算に変換して指定、pyCUDA直接利用時はＣＵＤＡ文法等を用いる、が挙げられる。 <Repeat loop statement>
The control unit (automatic offload function unit) 11 repeats the processing of steps S306-S307 between the loop start point of step S305 and the loop end point of step S108 for the number of loop statements.
In step S306, the parallel processing specification unit 114 specifies and interprets GPU processing for each loop statement using a method according to the GPU processing platform. For example, when pyACC is used, \pragmaacckernels of OpenACC is used, when Cupy is used, the calculation of the target loop is converted into a matrix calculation and specified, and when pyCUDA is used directly, CUDA syntax is used.

ステップＳ３０７で、並列処理指定部１１４は、エラー時は、該当for文からは、ＧＰＵ処理を削除する。
ステップＳ３０９で、並列処理指定部１１４は、コンパイルエラーが出ないfor文の数をカウントし、遺伝子長とする。 In step S307, if an error occurs, the parallel processing designation unit 114 deletes the GPU processing from the corresponding for statement.
In step S309, the parallel processing specification unit 114 counts the number of for statements for which no compilation error occurs, and sets the count as the gene length.

<指定個体数パターン準備>
次に、初期値として、並列処理指定部１１４は、指定個体数の遺伝子配列を準備する。ここでは、０と１をランダムに割当てて作成する。
ステップＳ３１０で、並列処理指定部１１４は、Pythonアプリコードを、遺伝子にマッピングし、指定個体数パターン準備を行う。
準備された遺伝子配列に応じて、遺伝子の値が１の場合は並列処理を指定するディレクティブをPythonアプリコードに挿入する。 <Preparation of designated population patterns>
Next, the parallel processing designation unit 114 prepares a designated number of gene sequences as initial values. Here, these are created by randomly assigning 0 and 1.
In step S310, the parallel processing designation unit 114 maps the Python application code to genes and prepares a designated population pattern.
According to the prepared gene sequence, a directive specifying parallel processing when the gene value is 1 is inserted into the Python application code.

制御部（自動オフロード機能部）１１は、ステップＳ３１１のループ始端とステップＳ３２０のループ終端間で、ステップＳ３１２－Ｓ３１９の処理について指定世代数繰り返す。
また、上記指定世代数繰り返しにおいて、さらにステップＳ３１２のループ始端とステップＳ３１７のループ終端間で、ステップＳ３１３－Ｓ３１６の処理について指定個体数繰り返す。すなわち、指定世代数繰り返しの中で、指定個体数の繰り返しが入れ子状態で処理される。 The control unit (automatic offload function unit) 11 repeats the processing of steps S312-S319 for a designated number of generations between the loop start point of step S311 and the loop end point of step S320.
In addition, in the above-mentioned repetition of the specified number of generations, the processing of steps S313-S316 is further repeated a specified number of individuals between the loop start point of step S312 and the loop end point of step S317. In other words, within the repetition of the specified number of generations, the repetition of the specified number of individuals is processed in a nested state.

<データ転送指定>
ステップＳ３１３で、データ転送指定部１１３は、変数参照関係から、ＧＰＵ処理基盤に応じた手法でデータ転送を指定する。 <Data transfer specification>
In step S313, the data transfer designation unit 113 designates a data transfer using a method suited to the GPU processing platform based on the variable reference relationship.

<コンパイル>
ステップＳ３１４で、並列処理パターン作成部１１５（図１参照）は、遺伝子パターンに応じてディレクティブ指定したPythonアプリコードをＧＰＵ処理基盤でインタプリットする。
ここで、ネストfor文を複数並列指定する場合等でコンパイルエラーとなることがある。この場合は、性能測定時の処理時間がタイムアウトした場合と同様に扱う。 <Compile>
In step S314, the parallel processing pattern creation unit 115 (see FIG. 1) interprets the Python application code, which has been directive-specified according to the gene pattern, on the GPU processing platform.
Here, a compilation error may occur when multiple nested for statements are specified in parallel, etc. In this case, it is treated the same as when the processing time during performance measurement times out.

ステップＳ３１５で、性能測定部１１６（図１参照）は、ＣＰＵ－ＧＰＵ搭載の検証用マシン１４に、実行ファイルをデプロイする。
ステップＳ３１６で、性能測定部１１６は、配置したバイナリファイルを実行し、オフロードした際のベンチマーク性能を測定する。 In step S315, the performance measurement unit 116 (see FIG. 1) deploys the executable file to the verification machine 14 equipped with a CPU and GPU.
In step S316, the performance measurement unit 116 executes the arranged binary file and measures the benchmark performance when offloaded.

ここで、途中世代で、以前と同じパターンの遺伝子については測定せず、同じ値を使う。つまり、ＧＡ処理の中で、以前と同じパターンの遺伝子が生じた場合は、その個体についてはコンパイルや性能測定をせず、以前と同じ測定値を用いる。
ステップＳ３１８で、実行ファイル作成部１１７（図１参照）は、処理時間が短い個体ほど適合度が高くなるように評価し、性能の高い個体を選択する。 Here, in the intermediate generation, genes with the same pattern as before are not measured, and the same values are used. In other words, if a gene with the same pattern as before is generated during the GA process, compilation or performance measurement is not performed for that individual, and the same measured value as before is used.
In step S318, the executable file creation unit 117 (see FIG. 1) evaluates the individuals with shorter processing times so that they have a higher fitness, and selects the individuals with the highest performance.

ステップＳ３１９で、実行ファイル作成部１１７は、選択された個体に対して、交叉、突然変異の処理を行い、次世代の個体を作成する。次世代の個体に対して、コンパイル、性能測定、適合度設定、選択、交叉、突然変異処理を行う。
すなわち、全個体に対して、ベンチマーク性能測定後、ベンチマーク処理時間に応じて、各遺伝子配列の適合度を設定する。設定された適合度に応じて、残す個体の選択を行う。選択された個体に対して、交叉処理、突然変異処理、そのままコピー処理のＧＡ処理を行い、次世代の個体群を作成する。 In step S319, the executable file creation unit 117 performs crossover and mutation processes on the selected individuals to create the next generation individuals. The next generation individuals are subjected to compilation, performance measurement, fitness setting, selection, crossover, and mutation processes.
That is, after benchmark performance is measured for all individuals, the fitness of each gene sequence is set according to the benchmark processing time. Individuals to be left are selected according to the set fitness. The selected individuals are subjected to GA processing including crossover, mutation, and direct copy to create the next generation population.

ステップＳ３２１で、実行ファイル作成部１１７は、指定世代数のＧＡ処理終了後、最高性能の遺伝子配列に該当するPythonアプリコード（最高性能の並列処理パターン）を解とする。
以上、《ループ文オフロード：Python》について説明した、次に、《ループ文オフロード：Java》について説明する。 In step S321, the executable file creation unit 117 determines, after the GA process for the designated number of generations is completed, the Python application code corresponding to the gene sequence with the highest performance (the parallel processing pattern with the highest performance) as a solution.
Above we have explained "Loop Statement Offloading: Python". Next we will explain "Loop Statement Offloading: Java".

［《ループ文オフロード：Java》］
ループ文オフロード：Java では、ループ文オフロードの、コードの分析でJavaを解析するJavaParser等の構文解析ツールを用いて構文解析する。ループと変数の把握については、構文解析ツールの結果を管理する際は、言語に非依存に抽象的に管理できる。ループのＧＰＵ処理有無の遺伝子化についても、言語に非依存である。遺伝子情報のコード化では、遺伝子情報に合わせてＧＰＵで実行するためのコードを作成するため、Javaのラムダ記述でＧＰＵ処理を指定する、あるいは変数転送を指定する。 [Loop statement offloading: Java]
Loop statement offloading: In Java, loop statement offloading is performed using a syntax analysis tool such as JavaParser, which analyzes Java code. When managing the results of the syntax analysis tool, understanding of loops and variables can be managed abstractly and independent of the language. Genetization of loops to be processed by GPU or not is also language independent. When encoding genetic information, GPU processing or variable transfer is specified in Java lambda notation to create code to be executed on the GPU according to the genetic information.

実行環境は、Javaのラムダ記述での並列化をＧＰＵに対して行うことができるIBM JDK（登録商標）を用いる。IBM JDKはJavaのラムダ記述に従って並列処理をＧＰＵに対して実行する仮想マシンである。
性能測定は、言語に合わせて、Jenkins（登録商標）等の自動測定ツールも用いて行う。次世代の遺伝子作成は、性能測定結果に合わせて適合度を設定し交叉等の処理を行うが、言語に非依存である。反復実行と最終解の決定も、言語に非依存である。 The execution environment uses IBM JDK (registered trademark), which can perform parallelization on the GPU using Java lambda descriptions. IBM JDK is a virtual machine that executes parallel processing on the GPU according to Java lambda descriptions.
Performance measurement is performed using an automatic measurement tool such as Jenkins (registered trademark) according to the language. The next generation of genes is created by setting the fitness according to the performance measurement results and performing processes such as crossover, but is independent of the language. Iterative execution and the determination of the final solution are also independent of the language.

図１０Ａ－Ｂは、IBM JDK 利用時のコードパターンを示す図であり、図１０Ａは、IBM JDK 利用時のfor文を示し、図１０Ｂは、図１０Ａのfor文から作成されるコードパターンを示す。
図１０Ｂに示すコードパターンは、図３のコードパターンに置き換えて用いられる。 10A and 10B are diagrams showing code patterns when the IBM JDK is used, where FIG. 10A shows a for statement when the IBM JDK is used, and FIG. 10B shows a code pattern created from the for statement in FIG. 10A.
The code pattern shown in FIG. 10B is used to replace the code pattern in FIG.

図１１Ａ－Ｂは、《ループ文オフロード：Java》のフローチャートであり、図１１Ａと図１１Ｂは、結合子で繋がれる。また、図１０Ｂに示すコードパターンを、図３のコードパターンに置き換えて用いる。 Figures 11A-B are flowcharts for "Loop Statement Offload: Java", and Figures 11A and 11B are connected with a connector. The code pattern shown in Figure 10B is used by replacing the code pattern in Figure 3.

<コード解析>
ステップＳ４０１で、アプリケーションコード分析部１１２（図１参照）は、Javaアプリケーションプログラムのコード解析を行う。 <Code Analysis>
In step S401, the application code analysis unit 112 (see FIG. 1) performs code analysis of the Java application program.

<ループ文特定>
ステップＳ４０２で、並列処理指定部１１４（図１参照）は、Javaアプリケーションプログラムのループ文、参照関係を特定する。 <Loop statement identification>
In step S402, the parallel processing specification unit 114 (see FIG. 1) identifies loop statements and reference relationships in the Java application program.

<ループ文ループ回数>
ステップＳ４０３で、並列処理指定部１１４は、ベンチマークツールを動作させ、ループ文ループ回数を把握し、閾値振分けする。 <Loop statement loop count>
In step S403, the parallel processing specification unit 114 runs a benchmark tool to grasp the number of loops of the loop statement and allocate a threshold value.

<ループ文の並列処理可能性>
ステップＳ４０４で、並列処理指定部１１４は、各ループ文の並列処理可能性をチェックする。 <Parallel processing of loop statements>
In step S404, the parallel processing specification unit 114 checks whether each loop statement can be processed in parallel.

<ループ文の繰り返し>
制御部（自動オフロード機能部）１１は、ステップＳ４０５のループ始端とステップＳ４０８のループ終端間で、ステップＳ４０６－Ｓ４０７の処理についてループ文の数だけ繰り返す。
ステップＳ４０６で、並列処理指定部１１４は、各ループ文に対して、Javaのlambda式を用いて、java.util.Stream.IntStream.range(0,n).parallel()forEach(i -> {});でＧＰＵ処理を指定してコンパイルする。 <Repeat loop statement>
The control unit (automatic offload function unit) 11 repeats the processing of steps S406-S407 between the loop start point of step S405 and the loop end point of step S408 for the number of loop statements.
In step S406, the parallel processing specification unit 114 uses a Java lambda expression to specify GPU processing for each loop statement, with java.util.Stream.IntStream.range(0,n).parallel()forEach(i->{});, and compiles the loop statement.

ステップＳ４０７で、並列処理指定部１１４は、エラー時は、該当for文からは、java.util.Stream.IntStream.range(0,n).parallel()forEach(i -> {});を削除する。
ステップＳ４０９で、並列処理指定部１１４は、コンパイルエラーが出ないfor文の数をカウントし、遺伝子長とする。 In step S407, when an error occurs, the parallel processing specification unit 114 deletes java.util.Stream.IntStream.range(0,n).parallel()forEach(i->{}); from the corresponding for statement.
In step S409, the parallel processing specification unit 114 counts the number of for statements for which no compilation error occurs, and sets the count as the gene length.

<指定個体数パターン準備>
次に、初期値として、並列処理指定部１１４は、指定個体数の遺伝子配列を準備する。ここでは、０と１をランダムに割当てて作成する。
ステップＳ４１０で、並列処理指定部１１４は、Javaアプリコードを、遺伝子にマッピングする。０と１がランダムに割当てられた指定個体数の遺伝子配列を遺伝子にマッピングするすることで、指定個体数パターンを準備する。
準備された遺伝子配列に応じて、遺伝子の値が１の場合は並列処理を指定するディレクティブをJavaアプリコードに挿入する。 <Preparation of designated population patterns>
Next, the parallel processing designation unit 114 prepares a designated number of gene sequences as initial values. Here, these are created by randomly assigning 0 and 1.
In step S410, the parallel processing designation unit 114 maps the Java application code to genes. A designated number of gene sequences, to which 0 and 1 are randomly assigned, are mapped to genes to prepare a designated number of patterns.
According to the prepared gene sequence, a directive specifying parallel processing when the gene value is 1 is inserted into the Java application code.

制御部（自動オフロード機能部）１１は、ステップＳ４１１のループ始端とステップＳ４２０のループ終端間で、ステップＳ４１２－Ｓ４１９の処理について指定世代数繰り返す。
また、上記指定世代数繰り返しにおいて、さらにステップＳ４１２のループ始端とステップＳ４１７のループ終端間で、ステップＳ４１３－Ｓ４１６の処理について指定個体数繰り返す。すなわち、指定世代数繰り返しの中で、指定個体数の繰り返しが入れ子状態で処理される。 The control unit (automatic offload function unit) 11 repeats the processing of steps S412-S419 for a designated number of generations between the loop start point of step S411 and the loop end point of step S420.
In addition, in the above-mentioned repetition of the specified number of generations, the processing of steps S413-S416 is further repeated a specified number of individuals between the loop start point of step S412 and the loop end point of step S417. In other words, within the repetition of the specified number of generations, the repetition of the specified number of individuals is processed in a nested state.

<データ転送指定>
ステップＳ４１３で、データ転送指定部１１３は、変数参照関係から、Javaの記述でデータ転送を指定する。 <Data transfer specification>
In step S413, the data transfer specification unit 113 specifies data transfer in a Java description based on the variable reference relationships.

<コンパイル>
ステップＳ４１４で、並列処理パターン作成部１１５（図１参照）は、遺伝子パターンに応じてディレクティブ指定したJavaアプリコードをIBM JDKでビルドする。
ここで、ネストfor文を複数並列指定する場合等でコンパイルエラーとなることがある。この場合は、性能測定時の処理時間がタイムアウトした場合と同様に扱う。 <Compile>
In step S414, the parallel processing pattern creation unit 115 (see FIG. 1) builds the Java application code with directives specified according to the gene pattern using IBM JDK.
Here, a compilation error may occur when multiple nested for statements are specified in parallel, etc. In this case, it is treated the same as when the processing time during performance measurement times out.

ステップＳ４１５で、性能測定部１１６（図１参照）は、ＣＰＵ-ＧＰＵ搭載の検証用マシン１４に、実行ファイルをデプロイする。
ステップＳ４１６で、性能測定部１１６は、配置したバイナリファイルを実行し、オフロードした際のベンチマーク性能を測定する。 In step S415, the performance measurement unit 116 (see FIG. 1) deploys the executable file to the verification machine 14 equipped with a CPU and GPU.
In step S416, the performance measurement unit 116 executes the arranged binary file and measures the benchmark performance when offloaded.

ここで、途中世代で、以前と同じパターンの遺伝子については測定せず、同じ値を使う。つまり、ＧＡ処理の中で、以前と同じパターンの遺伝子が生じた場合は、その個体についてはコンパイルや性能測定をせず、以前と同じ測定値を用いる。
ステップＳ４１８で、実行ファイル作成部１１７（図１参照）は、処理時間が短い個体ほど適合度が高くなるように評価し、性能の高い個体を選択する。 Here, in the intermediate generation, genes with the same pattern as before are not measured, and the same values are used. In other words, if a gene with the same pattern as before is generated during the GA process, compilation or performance measurement is not performed for that individual, and the same measured value as before is used.
In step S418, the executable file creation unit 117 (see FIG. 1) evaluates the individuals with shorter processing times so that they have a higher fitness, and selects the individuals with the highest performance.

ステップＳ４１９で、実行ファイル作成部１１７は、選択された個体に対して、交叉、突然変異の処理を行い、次世代の個体を作成する。次世代の個体に対して、コンパイル、性能測定、適合度設定、選択、交叉、突然変異処理を行う。
すなわち、全個体に対して、ベンチマーク性能測定後、ベンチマーク処理時間に応じて、各遺伝子配列の適合度を設定する。設定された適合度に応じて、残す個体の選択を行う。選択された個体に対して、交叉処理、突然変異処理、そのままコピー処理のＧＡ処理を行い、次世代の個体群を作成する。 In step S419, the executable file creation unit 117 performs crossover and mutation processes on the selected individuals to create the next generation individuals. The next generation individuals are subjected to compilation, performance measurement, fitness setting, selection, crossover, and mutation processes.
That is, after benchmark performance is measured for all individuals, the fitness of each gene sequence is set according to the benchmark processing time. Individuals to be left are selected according to the set fitness. The selected individuals are subjected to GA processing including crossover, mutation, and direct copy to create the next generation population.

ステップＳ４２１で、実行ファイル作成部１１７は、指定世代数のＧＡ処理終了後、最高性能の遺伝子配列に該当するJavaアプリコード（最高性能の並列処理パターン）を解とする。
以上、第１の実施形態（「ループ文オフロード」）について説明した。 In step S421, the executable file creation unit 117 determines, after the GA process for the designated number of generations is completed, the Java application code corresponding to the gene sequence with the highest performance (parallel processing pattern with the highest performance) as a solution.
The first embodiment ("loop statement offload") has been described above.

（第２の実施形態）
第２の実施形態は、機能ブロックオフロードについて記載する。
図１２～図１３を参照して、機能ブロックオフロードの全体構成および動作を説明し、以下、機能ブロックオフロード：共通（図１４，図１５）、機能ブロックオフロード：Ｃ言語（図１６，図１７）、機能ブロックオフロード：Python（図１８，図１９）、機能ブロックオフロード：Java（図２０，図２１）を順に説明する。 Second Embodiment
The second embodiment describes function block offloading.
The overall configuration and operation of function block offload will be described with reference to Figures 12 and 13, and then the following will be described in order: function block offload: common (Figures 14 and 15), function block offload: C language (Figures 16 and 17), function block offload: Python (Figures 18 and 19), and function block offload: Java (Figures 20 and 21).

図１３は、本発明の第２の実施形態に係るオフロードサーバ２００の構成例を示す機能ブロック図である。図２と同一構成部分には、同一符号を付して重複箇所の説明を省略する。
オフロードサーバ２００は、アプリケーションプログラムの特定処理をアクセラレータに自動的にオフロードする装置である。 13 is a functional block diagram showing a configuration example of an offload server 200 according to the second embodiment of the present invention. The same components as those in FIG. 2 are denoted by the same reference numerals, and the description of the overlapping parts will be omitted.
The offload server 200 is a device that automatically offloads specific processing of an application program to an accelerator.

図１２に示すように、オフロードサーバ２００は、制御部２１０と、入出力部１２と、記憶部１３０と、検証用マシン１４ (アクセラレータ検証用装置)と、を含んで構成される。As shown in FIG. 12, the offload server 200 includes a control unit 210, an input/output unit 12, a memory unit 130, and a verification machine 14 (accelerator verification device).

入出力部１２は、各機器等との間で情報の送受信を行うための通信インタフェースと、タッチパネルやキーボード等の入力装置や、モニタ等の出力装置との間で情報の送受信を行うための入出力インタフェースとから構成される。The input/output unit 12 is composed of a communication interface for transmitting and receiving information between each device, etc., and an input/output interface for transmitting and receiving information between input devices such as a touch panel or keyboard, and output devices such as a monitor.

記憶部１３０は、ハードディスクやフラッシュメモリ、ＲＡＭ（Random Access Memory）等により構成され、制御部２１０の各機能を実行させるためのプログラム（オフロードプログラム）や、制御部２１０の処理に必要な情報（例えば、中間言語ファイル(Intermediate file)１３２）が一時的に記憶される。The memory unit 130 is composed of a hard disk, flash memory, RAM (Random Access Memory), etc., and temporarily stores programs (offload programs) for executing each function of the control unit 210, and information necessary for processing by the control unit 210 (e.g., intermediate language file 132).

記憶部１３は、コードパターンＤＢ（Code pattern database）２３０（後記）、テストケースＤＢ（Test case database）１３１を備える。 The memory unit 13 has a code pattern database 230 (described below) and a test case database 131.

テストケースＤＢ１３１には、性能試験項目が格納される。テストケースＤＢ１３１は、高速化するアプリケーションの性能を測定するような試験を行うための情報が格納される。例えば、画像分析処理の深層学習アプリケーションであれば、サンプルの画像とそれを実行する試験項目である。 Test case DB131 stores performance test items. Test case DB131 stores information for conducting tests to measure the performance of applications that are being accelerated. For example, in the case of a deep learning application for image analysis processing, the test items are sample images and the execution of the images.

検証用マシン１４は、環境適応ソフトウェアの検証用環境として、ＣＰＵ（Central Processing Unit）、ＧＰＵ、ＦＰＧＡを備える。The verification machine 14 is equipped with a CPU (Central Processing Unit), GPU, and FPGA as a verification environment for the environment-adaptive software.

<コードパターンＤＢ２３０>
・ＧＰＵライブラリ、ＩＰコアの記憶
コードパターンＤＢ２３０は、ＧＰＵやＦＰＧＡ等にオフロード可能なライブラリおよびＩＰコア（後記）を記憶する。すなわち、コードパターンＤＢ２３０は、後記<処理Ｂ－１>のために、特定のライブラリ、機能ブロックを高速化するＧＰＵ用ライブラリ（ＧＰＵライブラリ）やＦＰＧＡ用ＩＰコア（ＩＰコア）とそれに関連する情報を保持する。例えば、コードパターンＤＢ２３０は、ＦＦＴ等算術計算等のライブラリリスト（外部ライブラリリスト）を保持する。 <Code Pattern DB230>
Storage of GPU libraries and IP cores The code pattern DB 230 stores libraries and IP cores (described later) that can be offloaded to a GPU, FPGA, etc. That is, for <Process B-1> described later, the code pattern DB 230 stores specific libraries, libraries for GPUs (GPU libraries) that speed up function blocks, and IP cores for FPGAs (IP cores), as well as information related to them. For example, the code pattern DB 230 stores a library list (external library list) for arithmetic calculations such as FFT, etc.

・ＣＵＤＡライブラリの記憶
コードパターンＤＢ２３０は、ＧＰＵライブラリとして、例えばＣＵＤＡライブラリと当該ＣＵＤＡライブラリを利用するためのライブラリ利用手順とを記憶する。すなわち、後記<処理Ｃ－１>において、置換するライブラリやＩＰコアをＧＰＵやＦＰＧＡに実装し、ホスト側（ＣＰＵ）プログラムと繋ぐ場合、ライブラリ利用手順も含めて登録しておき、その手順に従って利用する。例えば、ＣＵＤＡライブラリでは、Ｃ言語コードからＣＵＤＡライブラリを利用する手順がライブラリとともに公開されているため、コードパターンＤＢ２３０にライブラリ利用手順も含めて登録しておく。 Storage of CUDA library The code pattern DB 230 stores, as a GPU library, for example, a CUDA library and a library usage procedure for using the CUDA library. That is, in <Process C-1> described later, when a library or IP core to be replaced is implemented in a GPU or FPGA and connected to a host side (CPU) program, the library usage procedure is also registered and used according to the procedure. For example, in the CUDA library, the procedure for using the CUDA library from C language code is published together with the library, so the library usage procedure is also registered in the code pattern DB 230.

・クラス、構造体の記憶
コードパターンＤＢ２３０は、ホストで計算する場合に記述が同様になる処理のクラスまたは構造体を記憶する。すなわち、後記<処理Ｂ－２>において、登録されていないライブラリ呼び出し以外の機能処理を検出するため、構文解析にてソースコードの定義記述からクラス、構造体等を検出する。コードパターンＤＢ２３０は、後記<処理Ｂ－２>のために、ホストで計算する場合に記述が同様になる処理のクラスまたは構造体を登録しておく。なお、クラスまたは構造体の機能処理に対して、高速化するライブラリやＩＰコアがあることは、類似性検出ツール（後記）で検出する。 Storage of classes and structures The code pattern DB 230 stores classes or structures of processes that will be written in a similar manner when calculated on the host. That is, in <Process B-2> described later, in order to detect functional processes other than unregistered library calls, classes, structures, etc. are detected from the definition description of the source code by syntax analysis. For <Process B-2> described later, the code pattern DB 230 registers classes or structures of processes that will be written in a similar manner when calculated on the host. The existence of libraries or IP cores that speed up the functional processes of the classes or structures is detected by a similarity detection tool (described later).

・OpenCLコードの記憶
コードパターンＤＢ２３０は、ＩＰコア関連の情報としてOpenCLコードを記憶する。コードパターンＤＢ２３０に、OpenCLコードを記憶しておくことで、OpenCLコードから、OpenCLインタフェースを用いたＣＰＵとＦＰＧＡの接続および、ＦＰＧＡへのＩＰコア実装が、XilinxやIntel等のＦＰＧＡベンダの高位合成ツール（後記）を介して行うことができる。 Storage of OpenCL code The code pattern DB 230 stores OpenCL code as information related to IP cores. By storing the OpenCL code in the code pattern DB 230, it is possible to connect a CPU and an FPGA using an OpenCL interface and to implement an IP core in an FPGA from the OpenCL code via a high-level synthesis tool (described later) of an FPGA vendor such as Xilinx or Intel.

<制御部２１０>
制御部２１０は、オフロードサーバ２００全体の制御を司る自動オフロード機能部（Automatic Offloading function）であり、記憶部１３０に格納されたプログラム（オフロードプログラム）を不図示のＣＰＵが、ＲＡＭに展開し実行することにより実現される。 <Control Unit 210>
The control unit 210 is an automatic offloading function that controls the entire offload server 200, and is realized by a CPU (not shown) that expands a program (offload program) stored in the memory unit 130 into RAM and executes it.

特に、制御部２１０は、ＣＰＵ向けの既存プログラムコードの中にＦＰＧＡやＧＰＵへオフロードすることで処理を高速化できる機能ブロックを検出し、検出した機能ブロックをＧＰＵ向けライブラリやＦＰＧＡ向けＩＰコア等に置き換えることで高速化をする機能ブロックのオフロード処理を行う。In particular, the control unit 210 detects functional blocks in existing program code for the CPU that can be offloaded to an FPGA or GPU to speed up processing, and performs offload processing of the functional blocks to speed up processing by replacing the detected functional blocks with libraries for the GPU or IP cores for the FPGA, etc.

制御部２１０は、アプリケーションコード指定部（Specify application code）１１１と、アプリケーションコード分析部（Analyze application code）１１２と、置換機能検出部２１３と、置換処理部２１４と、オフロードパターン作成部２１５と、性能測定部１１６と、実行ファイル作成部１１７と、本番環境配置部（Deploy final binary files to production environment）１１８と、性能測定テスト抽出実行部（Extract performance test cases and run automatically）１１９と、ユーザ提供部（Provide price and performance to a user to judge）１２０と、を備える。The control unit 210 includes an application code specification unit 111, an application code analysis unit 112, a replacement function detection unit 213, a replacement processing unit 214, an offload pattern creation unit 215, a performance measurement unit 116, an executable file creation unit 117, a production environment deployment unit 118, a performance measurement test extraction and execution unit 119, and a user provision unit 120.

<アプリケーションコード分析部１１２>
アプリケーションコード分析部１１２は、後記<処理Ａ－１>において、アプリケーションプログラムのソースコードを分析して、当該ソースコードに含まれる外部ライブラリの呼び出しを検出する。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 <Application Code Analysis Unit 112>
In <Process A-1> described later, the application code analysis unit 112 analyzes the source code of the application program and detects calls to external libraries contained in the source code. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code to analyze library calls and functional processing contained in the code, along with loop statement structures, etc.

上述したコード分析は、オフロードするデバイスを想定した分析が必要になるため、一般化は難しい。ただし、ループ文や変数の参照関係等のコードの構造を把握したり、機能ブロックとしてＦＦＴ処理を行う機能ブロックであることや、ＦＦＴ処理を行うライブラリを呼び出している等を把握することは可能である。機能ブロックの判断は、オフロードサーバが自動判断することは難しい。これもDeckard等の類似性検出ツールを用いて類似度判定等で把握することは可能である。ここで、Clangは、C/C++向けツールであるが、解析する言語に合わせたツールを選ぶ必要がある。 The code analysis described above is difficult to generalize, as it requires analysis that takes into account the device to be offloaded. However, it is possible to grasp the code structure, such as loop statements and variable reference relationships, and to determine that a functional block performs FFT processing and that it calls a library that performs FFT processing. It is difficult for the offload server to automatically determine the functional block. This can also be grasped by using a similarity detection tool such as Deckard to determine the similarity. Here, Clang is a tool for C/C++, but it is necessary to choose a tool that suits the language to be analyzed.

また、アプリケーションコード分析部１１２は、後記<処理Ａ－２>において、ソースコードからクラスまたは構造体のコードを検出する。 In addition, in <Process A-2> described below, the application code analysis unit 112 detects class or structure code from the source code.

<置換機能検出部２１３>
置換機能検出部２１３は、後記<処理Ｂ－１>において、検出された呼び出しをキーにして、コードパターンＤＢ２３０からＧＰＵライブラリおよびＩＰコアを取得する。具体的には、置換機能検出部２１３は、検出したライブラリ呼び出しに対して、ライブラリ名をキーとして、コードパターンＤＢ２３０と照合することで、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能処理を抽出する。 <Replacement function detection unit 213>
In <Process B-1> described later, the replacement function detection unit 213 uses the detected call as a key to acquire a GPU library and an IP core from the code pattern DB 230. Specifically, the replacement function detection unit 213 uses the library name as a key for the detected library call and compares it with the code pattern DB 230 to extract offloadable processing that can be offloaded to a GPU or FPGA.

ここで、コードパターンＤＢ２３０は、ＧＰＵライブラリとして、例えばＣＵＤＡライブラリと当該ＣＵＤＡライブラリを利用するためのライブラリ利用手順とを記憶している。そして、置換機能検出部２１３は、ライブラリ利用手順をもとに、コードパターンＤＢ２３０からＣＵＤＡライブラリを取得する。Here, the code pattern DB 230 stores, as a GPU library, for example, a CUDA library and a library usage procedure for using the CUDA library. Then, the replacement function detection unit 213 obtains the CUDA library from the code pattern DB 230 based on the library usage procedure.

置換機能検出部２１３は、後記<処理Ｂ－２>において、検出されたクラスまたは構造体（後記）の定義記述コードをキーにして、コードパターンＤＢ２３０からＧＰＵライブラリおよびＩＰコアを取得する。具体的には、置換機能検出部２１３は、コピーコードやコピー後変更した定義記述コードを検出する類似性検出ツールを用いて、置換元コードに含まれるクラスや構造体に対して、コードパターンＤＢ２３０から類似のクラスまたは構造体に紐づいて管理されているＧＰＵ、ＦＰＧＡにオフロードできるＧＰＵライブラリおよびＩＰコアを抽出する。In <Process B-2> described below, the replacement function detection unit 213 uses the definition description code of the detected class or structure (described below) as a key to obtain a GPU library and an IP core from the code pattern DB 230. Specifically, the replacement function detection unit 213 uses a similarity detection tool that detects copied code and definition description code that has been changed after copying to extract GPU libraries and IP cores that can be offloaded to a GPU or FPGA that are linked to a similar class or structure from the code pattern DB 230 for the class or structure included in the source code to be replaced.

<置換処理部２１４>
置換処理部２１４は、後記<処理Ｃ－１>において、アプリケーションプログラムのソースコードの置換元の処理記述を、置換機能検出部２１３が取得した置換先のライブラリおよびＩＰコアの処理記述に置換する。具体的には、置換処理部２１４は、抽出したオフロード可能処理を、ＧＰＵ向けのライブラリやＦＰＧＡ向けのＩＰコア等に置換する。
また、置換処理部２１４は、置換したライブラリおよびＩＰコアの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵやＦＰＧＡ等にオフロードする。具体的には、置換処理部２１４は、ＧＰＵ向けのライブラリやＦＰＧＡ向けのＩＰコア等に置換した機能ブロックを、ＣＰＵプログラムとのインタフェースを作成することでオフロードする。置換処理部２１４は、ＣＵＤＡ,OpenCL等の中間言語ファイル１３２を出力する。 <Replacement Processing Unit 214>
In <Processing C-1> described later, the replacement processing unit 214 replaces the processing description of the source of the application program source code with the processing description of the library and IP core of the replacement destination acquired by the replacement function detection unit 213. Specifically, the replacement processing unit 214 replaces the extracted offloadable processing with a library for GPU, an IP core for FPGA, or the like.
The replacement processing unit 214 also offloads the processing description of the replaced library and IP core to a GPU, FPGA, or the like as a functional block to be offloaded. Specifically, the replacement processing unit 214 offloads the functional block replaced by a library for a GPU or an IP core for an FPGA, or the like, by creating an interface with the CPU program. The replacement processing unit 214 outputs an intermediate language file 132 such as CUDA or OpenCL.

置換処理部２１４は、後記<処理Ｃ－２>において、アプリケーションプログラムのソースコードの置換元の処理記述を、取得したライブラリおよびＩＰコアの処理記述に置換するとともに、置換元と置換先で引数、戻り値の数または型が異なる場合に、その確認を通知する。 In the process C-2 described below, the replacement processing unit 214 replaces the source processing description of the application program source code with the processing description of the acquired library and IP core, and notifies the user of confirmation if the number or type of arguments or return values differs between the source and destination.

置換処理部２１４は、《機能ブロックオフロード：Ｃ言語》では、ＣＵＤＡのライブラリ呼び出しを、ＰＧＩコンパイラに指定する。
置換処理部２１４は、《機能ブロックオフロード：Python》では、ＣＵＤＡのライブラリ呼び出しを、pyCudaで指定する。
置換処理部２１４は、《機能ブロックオフロード：Java》では、ＣＵＤＡのライブラリ呼び出しを、Jcudaで指定する。 In the case of <<Function block offload: C language>>, the replacement processing unit 214 specifies a CUDA library call to the PGI compiler.
In <<Function block offload: Python>>, the replacement processing unit 214 specifies a CUDA library call using pyCuda.
In <<Function Block Offload: Java>>, the replacement processing unit 214 specifies a CUDA library call using Jcuda.

<オフロードパターン作成部２１５>
オフロードパターン作成部２１５は、１以上のオフロードするパターンを作成する。具体的には、ホストプログラムとのインタフェースを作成し、検証環境での性能測定を通じて、オフロードするしないを試行することで、より高速となるオフロードパターンを抽出する。 <Offload Pattern Creation Unit 215>
The offload pattern creation unit 215 creates one or more offloading patterns. Specifically, an interface with the host program is created, and the offloading and non-offloading patterns are tried through performance measurement in a verification environment, thereby extracting an offloading pattern that provides a higher speed.

ここで、コードパターンＤＢ２３０は、ＩＰコア関連の情報としてOpenCLコードを記憶している。オフロードパターン作成部２１５は、ＦＰＧＡ等のＰＬＤにオフロードする場合は、OpenCLコードをもとにOpenCLインタフェースを用いてホストとＰＬＤとを接続するとともに、OpenCLコードをもとにＰＬＤへのＩＰコアの実装を行う。Here, the code pattern DB 230 stores OpenCL code as information related to IP cores. When offloading to a PLD such as an FPGA, the offload pattern creation unit 215 connects the host and the PLD using an OpenCL interface based on the OpenCL code, and implements the IP core in the PLD based on the OpenCL code.

OpenCLのＡＰＩに沿う、カーネルプログラムとホストプログラムのインタフェース記述について述べる。なお、下記説明は、後記［処理Ｃ］（ホスト側とのインタフェースの整合）の<処理Ｃ－１>の具体例に対応する。This section describes the interface description between the kernel program and the host program in accordance with the OpenCL API. Note that the following explanation corresponds to a specific example of <Process C-1> in [Process C] (matching the interface with the host side) described below.

OpenCLのＣ言語の文法に沿って作成したカーネルは、OpenCLのＣ言語のランタイムＡＰＩを利用して、作成するホスト（例えば、ＣＰＵ）側のプログラムによりデバイス（例えば、ＦＰＧＡ）で実行される。カーネル関数hello()をホスト側から呼び出す部分は、OpenCLランタイムＡＰＩの一つであるclEnqueueTask()を呼び出すことである。
ホストコードで記述するOpenCLの初期化、実行、終了の基本フローは、下記ステップ１～１３である。このステップ１～１３のうち、ステップ１～１０がカーネル関数hello()をホスト側から呼び出すまでの手続（準備）であり、ステップ１１でカーネルの実行となる。 A kernel created according to the OpenCL C language syntax is executed on a device (e.g., FPGA) by a program created on the host (e.g., CPU) side using the OpenCL C language runtime API. The part where the kernel function hello() is called from the host side is to call clEnqueueTask(), which is one of the OpenCL runtime APIs.
The basic flow of OpenCL initialization, execution, and termination described in the host code consists of the following steps 1 to 13. Of these steps 1 to 13, steps 1 to 10 are the procedure (preparation) until the kernel function hello() is called from the host side, and step 11 is the execution of the kernel.

１．プラットフォーム特定
OpenCLランタイムＡＰＩで定義されているプラットフォーム特定機能を提供する関数clGetPlatformIDs()を用いて、OpenCLが動作するプラットフォームを特定する。 1. Platform Specific
The platform on which OpenCL runs is identified using a function clGetPlatformIDs( ) that provides platform identification functionality defined in the OpenCL runtime API.

２．デバイス特定
OpenCLランタイムＡＰＩで定義されているデバイス特定機能を提供する関数clGetDeviceIDs()を用いて、プラットフォームで使用するＧＰＵ等のデバイスを特定する。 2. Device Identification
A device such as a GPU to be used on the platform is identified using a function clGetDeviceIDs( ) that provides a device identification function defined in the OpenCL runtime API.

３．コンテキスト作成
OpenCLランタイムＡＰＩで定義されているコンテキスト作成機能を提供する関数clCreateContext()を用いて、OpenCLを動作させる実行環境となるOpenCLコンテキストを作成する。 3. Creating a context
An OpenCL context that serves as an execution environment for running OpenCL is created using a function clCreateContext( ) that provides a context creation function defined in the OpenCL runtime API.

４．コマンドキュー作成
OpenCLランタイムＡＰＩで定義されているコマンドキュー作成機能を提供する関数clCreateCommandQueue()を用いて、デバイスを制御する準備であるコマンドキューを作成する。OpenCLでは、コマンドキューを通して、ホストからデバイスに対する働きかけ（カーネル実行コマンドやホスト－デバイス間のメモリコピーコマンドの発行）を実行する。 4. Creating a command queue
A command queue is created in preparation for controlling the device using the function clCreateCommandQueue(), which provides a command queue creation function defined in the OpenCL runtime API. In OpenCL, the host issues commands to the device (such as issuing kernel execution commands and memory copy commands between the host and device) through the command queue.

５．メモリオブジェクト作成
OpenCLランタイムＡＰＩで定義されているデバイス上にメモリを確保する機能を提供する関数clCreateBuffer()を用いて、ホスト側からメモリオブジェクトを参照できるようにするメモリオブジェクトを作成する。 5. Creating a memory object
A memory object that allows the host side to reference the memory object is created using a function clCreateBuffer( ) that provides a function for allocating memory on a device defined in the OpenCL runtime API.

６．カーネルファイル読み込み
デバイスで実行するカーネルは、その実行自体をホスト側のプログラムで制御する。このため、ホストプログラムは、まずカーネルプログラムを読み込む必要がある。カーネルプログラムには、OpenCLコンパイラで作成したバイナリデータや、OpenCL Ｃ言語で記述されたソースコードがある。このカーネルファイルを読み込む（記述省略）。なお、カーネルファイル読み込みでは、OpenCLランタイムＡＰＩは使用しない。 6. Loading a kernel file The execution of the kernel executed on the device is controlled by the host program. For this reason, the host program must first load the kernel program. The kernel program includes binary data created by the OpenCL compiler and source code written in the OpenCL C language. This kernel file is loaded (description omitted). Note that the OpenCL runtime API is not used when loading the kernel file.

７．プログラムオブジェクト作成
OpenCLでは、カーネルプログラムをプログラムオブジェクトとして認識する。この手続きがプログラムオブジェクト作成である。
OpenCLランタイムＡＰＩで定義されているプログラムオブジェクト作成機能を提供する関数clCreateProgramWithSource()を用いて、ホスト側からメモリオブジェクトを参照できるようにするプログラムオブジェクトを作成する。カーネルプログラムのコンパイル済みバイナリ列から作成する場合は、clCreateProgramWithBinary()を使用する。 7. Creating a program object
In OpenCL, a kernel program is recognized as a program object. This procedure is called program object creation.
Create a program object that allows the host to reference the memory object by using the function clCreateProgramWithSource(), which provides the program object creation function defined in the OpenCL runtime API. When creating a program object from a compiled binary sequence of a kernel program, use clCreateProgramWithBinary().

８．ビルド
ソースコードとして登録したプログラムオブジェクトを OpenCL Ｃコンパイラ・リンカを使いビルドする。
OpenCLランタイムＡＰＩで定義されているOpenCL Ｃコンパイラ・リンカによるビルドを実行する関数clBuildProgram()を用いて、プログラムオブジェクトをビルドする。なお、clCreateProgramWithBinary()でコンパイル済みのバイナリ列からプログラムオブジェクトを生成した場合、このコンパイル手続は不要である。 8. Build Build the program object registered as source code using the OpenCL C compiler and linker.
A program object is built using the function clBuildProgram(), which executes a build by the OpenCL C compiler and linker defined in the OpenCL runtime API. Note that if a program object is generated from a compiled binary string using clCreateProgramWithBinary(), this compilation procedure is not necessary.

９．カーネルオブジェクト作成
OpenCLランタイムＡＰＩで定義されているカーネルオブジェクト作成機能を提供する関数clCreateKernel()を用いて、カーネルオブジェクトを作成する。１つのカーネルオブジェクトは、１つのカーネル関数に対応するので、カーネルオブジェクト作成時には、カーネル関数の名前(hello)を指定する。また、複数のカーネル関数を１つのプログラムオブジェクトとして記述した場合、１つのカーネルオブジェクトは、１つのカーネル関数に１対１で対応するので、clCreateKernel()を複数回呼び出す。 9. Creating a kernel object
A kernel object is created using the function clCreateKernel(), which provides a kernel object creation function defined in the OpenCL runtime API. Since one kernel object corresponds to one kernel function, the name of the kernel function (hello) is specified when creating the kernel object. Furthermore, if multiple kernel functions are written as one program object, one kernel object corresponds one-to-one to one kernel function, so clCreateKernel() is called multiple times.

１０．カーネル引数設定
OpenCLランタイムＡＰＩで定義されているカーネルへ引数を与える（カーネル関数が持つ引数へ値を渡す）機能を提供する関数clSetKernel()を用いて、カーネル引数を設定する。
以上、上記ステップ１～１０で準備が整い、ホスト側からデバイスでカーネルを実行するステップ１１に入る。 10. Kernel argument settings
The kernel arguments are set using the function clSetKernel(), which provides a function for giving arguments to the kernel defined in the OpenCL runtime API (passing values to arguments held by the kernel function).
As described above, the preparation is completed in steps 1 to 10, and the process advances to step 11 in which the kernel is executed on the device from the host side.

１１．カーネル実行
カーネル実行（コマンドキューへ投入）は、デバイスに対する働きかけとなるので、コマンドキューへのキューイング関数となる。
OpenCLランタイムＡＰＩで定義されているカーネル実行機能を提供する関数clEnqueueTask()を用いて、カーネルhelloをデバイスで実行するコマンドをキューイングする。カーネルhelloを実行するコマンドがキューイングされた後、デバイス上の実行可能な演算ユニットで実行されることになる。 11. Kernel Execution Kernel execution (submission to the command queue) is an action on the device, so it is a queuing function for the command queue.
A command to execute the kernel hello on the device is queued using a function clEnqueueTask() that provides a kernel execution function defined in the OpenCL runtime API. After the command to execute the kernel hello is queued, it will be executed on an executable computing unit on the device.

１２．メモリオブジェクトからの読み込み
OpenCLランタイムＡＰＩで定義されているデバイス側のメモリからホスト側のメモリへデータをコピーする機能を提供する関数clEnqueueReadBuffer()を用いて、デバイス側のメモリ領域からホスト側のメモリ領域にデータをコピーする。また、ホスト側からデバイス側のメモリへデータをコピーする機能を提供する関数clEnqueueWrightBuffer()を用いて、ホスト側のメモリ領域からデバイス側のメモリ領域にデータをコピーする。なお、これらの関数は、デバイスに対する働きかけとなるので、一度コマンドキューへコピーコマンドがキューイングされてからデータコピーが始まることになる。 12. Reading from a memory object
Data is copied from the device memory area to the host memory area using the function clEnqueueReadBuffer(), which provides a function for copying data from device memory to host memory defined in the OpenCL runtime API. Data is also copied from the host memory area to the device memory area using the function clEnqueueWrightBuffer(), which provides a function for copying data from the host memory to the device memory. Note that these functions act on the device, so the copy command is queued in the command queue once before data copying begins.

１３．オブジェクト解放
最後に、ここまでに作成してきた各種オブジェクトを解放する。
以上、OpenCL Ｃ言語に沿って作成されたカーネルの、デバイス実行について説明した。 13. Releasing objects Finally, release the various objects that you have created up to this point.
The above describes device execution of a kernel written in accordance with the OpenCL C language.

<性能測定部１１６>
性能測定部１１６は、作成された処理パターンのアプリケーションプログラムをコンパイルして、検証用マシン１４に配置し、ＧＰＵやＦＰＧＡ等にオフロードした際の性能測定用処理を実行する。
性能測定部１１６は、バイナリファイル配置部（Deploy binary files）１１６ａを備える。バイナリファイル配置部１１６ａは、ＧＰＵやＦＰＧＡを備えた検証用マシン１４に、中間言語から導かれるバイナリファイルをデプロイ(配置)する。 <Performance Measuring Unit 116>
The performance measurement unit 116 compiles an application program of the created processing pattern, places it on the verification machine 14, and executes processing for performance measurement when offloaded to a GPU, FPGA, or the like.
The performance measurement unit 116 includes a binary file deployment unit 116a. The binary file deployment unit 116a deploys (places) binary files derived from the intermediate language in the verification machine 14 that includes a GPU and/or FPGA.

性能測定部１１６は、配置したバイナリファイルを実行し、オフロードした際の性能を測定するとともに、性能測定結果を、バイナリファイル配置部１１６ａに戻す。この場合、性能測定部１１６は、抽出された別の処理パターンを用いて、抽出された中間言語をもとに、性能測定を試行する（後記図１３の符号ｇ参照）。The performance measurement unit 116 executes the placed binary file, measures the performance when offloaded, and returns the performance measurement results to the binary file placement unit 116a. In this case, the performance measurement unit 116 attempts to measure performance based on the extracted intermediate language using another extracted processing pattern (see symbol g in Figure 13 below).

性能測定の具体例について述べる。
オフロードパターン作成部２１５は、ＧＰＵやＦＰＧＡにオフロード可能な機能ブロックをオフロードする処理パターンを作成し、作成された処理パターンの中間言語を、実行ファイル作成部１１７がコンパイルする。性能測定部１１６は、コンパイルされたプログラムの性能を測定する（「１回目の性能測定」）。 A specific example of performance measurement will be described.
The offload pattern creation unit 215 creates a processing pattern for offloading a functional block that can be offloaded to a GPU or FPGA, and the intermediate language of the created processing pattern is compiled by the executable file creation unit 117. The performance measurement unit 116 measures the performance of the compiled program ("first performance measurement").

そして、オフロードパターン作成部２１５は、性能測定された中でＣＰＵに比べ高性能化された処理パターンをリスト化する。オフロードパターン作成部２１５は、リストの処理パターンを組み合わせてオフロードする新たな処理パターンを作成する。オフロードパターン作成部２１５は、組み合わせたオフロード処理パターンと中間言語を作成し、中間言語を、実行ファイル作成部１１７がコンパイルする。
性能測定部１１６は、コンパイルされたプログラムの性能を測定する（「２回目の性能測定」）。 The offload pattern creation unit 215 then creates a list of processing patterns that have been measured and have higher performance than the CPU. The offload pattern creation unit 215 creates a new processing pattern for offloading by combining the processing patterns in the list. The offload pattern creation unit 215 creates the combined offload processing pattern and an intermediate language, and the executable file creation unit 117 compiles the intermediate language.
The performance measurement unit 116 measures the performance of the compiled program ("second performance measurement").

<実行ファイル作成部１１７>
実行ファイル作成部１１７は、オフロードする処理パターンの中間言語をコンパイルして実行ファイルを作成する。一定数繰り返された、性能測定結果をもとに、１以上の処理パターンから最高処理性能の処理パターンを選択し、最高処理性能の処理パターンをコンパイルして最終実行ファイルを作成する。 <Executable File Creation Unit 117>
The executable file creation unit 117 compiles the intermediate language of the processing pattern to be offloaded to create an executable file. Based on the performance measurement results obtained by repeating a certain number of times, the executable file creation unit 117 selects the processing pattern with the highest processing performance from one or more processing patterns, compiles the processing pattern with the highest processing performance, and creates a final executable file.

<性能測定テスト抽出実行部１１９>
性能測定テスト抽出実行部１１９は、実行ファイル配置後、テストケースＤＢ１３１から性能試験項目を抽出し、性能試験を実行する。
性能測定テスト抽出実行部１１９は、実行ファイル配置後、ユーザに性能を示すため、性能試験項目をテストケースＤＢ１３１から抽出し、抽出した性能試験を自動実行する。 <Performance Measurement Test Extraction Execution Unit 119>
After arranging the executable file, the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and executes the performance tests.
After arranging the executable file, the performance measurement test extraction execution unit 119 extracts performance test items from the test case DB 131 and automatically executes the extracted performance tests in order to show the performance to the user.

<ユーザ提供部１２０>
ユーザ提供部１２０は、性能試験結果を踏まえた、価格・性能等の情報をユーザに提示する（「価格・性能等の情報のユーザへの提供」）。テストケースＤＢ１３１には、アプリケーションの性能を測定する試験を自動で行うためのデータが格納されている。ユーザ提供部１２０は、テストケースＤＢ１３１の試験データを実行した結果と、システムに用いられるリソース（仮想マシンや、ＦＰＧＡインスタンス、ＧＰＵインスタンス等）の各単価から決まるシステム全体の価格をユーザに提示する。ユーザは、提示された価格・性能等の情報をもとに、サービスの課金利用開始を判断する。 <User providing unit 120>
The user providing unit 120 presents the user with information such as price and performance based on the performance test results ("Providing information such as price and performance to the user"). The test case DB 131 stores data for automatically conducting tests to measure the performance of applications. The user providing unit 120 presents the user with the results of executing the test data in the test case DB 131 and the price of the entire system determined from the unit prices of the resources used in the system (virtual machines, FPGA instances, GPU instances, etc.). The user decides whether to start charging for the service based on the presented information such as price and performance.

以下、上述のように構成されたオフロードサーバ２００の機能ブロックオフロード処理について説明する。 Below, we will explain the functional block offload processing of the offload server 200 configured as described above.

上記、機能ブロックのオフロードの処理の概要と考慮点について説明する。
ＦＰＧＡに関しては、ハードウェア回路設計に多大な時間がかかることもあり、一度設計した機能を、ＩＰコア（Intellectual Property Core）という形で再利用可能にすることが多い。ＩＰコアとは、ＦＰＧＡ、ＩＣ、ＬＳＩなどの半導体を構成するための部分的な回路情報であり、特に機能単位でまとめられている。ＩＰコアは、暗号化／復号化処理、ＦＦＴ（Fast Fourier Transform）等の算術演算、画像処理、音声処理等が代表的な機能例である。ＩＰコアは、ライセンス料を支払うものが多いが、一部はフリーで提供されているものもある。 The outline of the above-mentioned function block offloading process and points to consider will be described below.
With regard to FPGAs, since hardware circuit design can take a lot of time, functions that have already been designed are often made reusable in the form of IP cores (Intellectual Property Cores). IP cores are partial circuit information for configuring semiconductors such as FPGAs, ICs, and LSIs, and are organized by function. Typical examples of IP core functions include encryption/decryption processing, arithmetic operations such as FFT (Fast Fourier Transform), image processing, and audio processing. Many IP cores require a license fee, but some are provided free of charge.

第２の実施形態では、ＦＰＧＡに関しては、ＩＰコアを自動オフロードに利用する。また、ＧＰＵに関しては、ＩＰコアという言い方ではないものの、ＦＦＴ、線形代数演算等が代表的な機能例であり、ＣＵＤＡを用いて実装されたcuFFTやcuBLAS等がＧＰＵ向けライブラリとしてフリーで提供されている。本第２の実施形態では、ＧＰＵに関してこれらのライブラリを活用する。In the second embodiment, for FPGAs, IP cores are used for automatic offloading. For GPUs, although they are not called IP cores, FFT and linear algebra operations are typical examples of functions, and cuFFT and cuBLAS implemented using CUDA are provided free of charge as libraries for GPUs. In this second embodiment, these libraries are utilized for GPUs.

本第２の実施形態では、ＣＰＵ向けに作られた既存プログラムコードの中で、ＦＦＴ処理等、ＧＰＵ、ＦＰＧＡにオフロードすることで高速化できるような機能ブロックが含まれる場合に、ＧＰＵ向けライブラリやＦＰＧＡ向けＩＰコア等に置き換えることでの高速化を図る。In this second embodiment, when existing program code created for a CPU contains functional blocks such as FFT processing that can be accelerated by offloading them to a GPU or FPGA, the acceleration is achieved by replacing them with GPU libraries or FPGA IP cores, etc.

［機能ブロックのオフロード処理概要］
第２の実施形態のオフロードサーバ２００は、環境適応ソフトウェアの要素技術としてユーザアプリケーションロジックのＧＰＵ、ＦＰＧＡ自動オフロードに適用した例である。
図１３は、オフロードサーバ２００の機能ブロックのオフロード処理を示す図である。
図１３に示すように、オフロードサーバ２００は、環境適応ソフトウェアの要素技術に適用される。オフロードサーバ２００は、制御部（自動オフロード機能部）１１と、コードパターンＤＢ２３０、テストケースＤＢ１３１と、中間言語ファイル１３２と、検証用マシン１４と、を有している。
オフロードサーバ２００は、ユーザが利用するアプリケーションコード（Application code）１３０を取得する。 [Outline of offload processing of functional blocks]
The offload server 200 of the second embodiment is an example in which the elemental technology of environment adaptive software is applied to automatic offloading of user application logic to GPU and FPGA.
FIG. 13 is a diagram showing the offload process of the functional blocks of the offload server 200. As shown in FIG.
13, the offload server 200 is applied to elemental technologies of environment adaptive software. The offload server 200 has a control unit (automatic offload function unit) 11, a code pattern DB 230, a test case DB 131, an intermediate language file 132, and a verification machine 14.
The offload server 200 acquires an application code 130 used by the user.

ユーザは、例えば、各種デバイス（Device）１５１、ＣＰＵ-ＧＰＵを有する装置１５２、ＣＰＵ-ＦＰＧＡを有する装置１５３、ＣＰＵを有する装置１５４を利用する。オフロードサーバ２００は、機能処理をＣＰＵ-ＧＰＵを有する装置１５２、ＣＰＵ-ＦＰＧＡを有する装置１５３のアクセラレータに自動オフロードする。 A user uses, for example, various devices 151, a device having a CPU-GPU 152, a device having a CPU-FPGA 153, and a device having a CPU 154. The offload server 200 automatically offloads functional processing to the accelerators of the device having a CPU-GPU 152 and the device having a CPU-FPGA 153.

以下、図１３のステップ番号を参照して各部の動作を説明する。
<ステップＳ３１：Specify application code>
ステップＳ３１において、アプリケーションコード指定部１１１（図１２参照）は、受信したファイルに記載されたアプリケーションコードを、アプリケーションコード分析部１１２に渡す。 The operation of each unit will be described below with reference to the step numbers in FIG.
<Step S31: Specify application code>
In step S 31 , the application code designation unit 111 (see FIG. 12 ) passes the application code described in the received file to the application code analysis unit 112 .

<ステップＳ３２：Analyze application code>（コード分析）
ステップＳ３２において、アプリケーションコード分析部１１２（図１２参照）は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 <Step S32: Analyze application code>
In step S32, the application code analysis unit 112 (see FIG. 12) uses a syntax analysis tool such as Clang to analyze the source code to analyze the library calls and functional processes contained in the code, as well as loop statement structures and the like.

<ステップＳ３３：Extract offloadable area>（オフロード可能処理抽出）
ステップＳ３３において、置換機能検出部２１３（図１２参照）は、把握したライブラリ呼び出しや機能処理について、コードパターンＤＢ２３０と照合することで、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能処理を抽出する。 <Step S33: Extract offloadable area> (Extract offloadable processing)
In step S33, the replacement function detection unit 213 (see FIG. 12) compares the identified library calls and functional processes with the code pattern DB 230 to extract offloadable processes that can be offloaded to a GPU or FPGA.

<ステップＳ３４：Output intermediate file>（オフロード用中間ファイル出力）
ステップＳ３４において、置換処理部２１４（図１２参照）は、抽出したオフロード可能処理を、ＧＰＵ向けのライブラリやＦＰＧＡ向けのＩＰコア等に置換する。置換処理部２１４は、ＧＰＵ向けのライブラリやＦＰＧＡ向けのＩＰコア等に置換した機能ブロックを、ＣＰＵプログラムとのインタフェースを作成することでオフロードする。置換処理部２１４は、ＣＵＤＡ,OpenCL等の中間言語ファイル１３２を出力する。中間言語抽出は、一度で終わりでなく、適切なオフロード領域探索のため、実行を試行して最適化するため反復される。
ここで、オフロード可能処理が直ちに高速化につながるか、またコスト効果が十分であるかは分からないので、オフロードパターン作成部２１５は、後述する検証環境での性能測定を通じて、オフロードするしないを試行することで、より高速となるオフロードパターンを抽出する。 <Step S34: Output intermediate file> (Output intermediate file for offloading)
In step S34, the replacement processing unit 214 (see FIG. 12) replaces the extracted offloadable processes with libraries for GPUs, IP cores for FPGAs, etc. The replacement processing unit 214 offloads the functional blocks replaced with libraries for GPUs, IP cores for FPGAs, etc. by creating an interface with the CPU program. The replacement processing unit 214 outputs an intermediate language file 132 such as CUDA or OpenCL. The intermediate language extraction is not completed once, but is repeated to search for an appropriate offload area and to perform trial and error execution for optimization.
Here, since it is not known whether offloadable processing will immediately lead to faster processing or whether the cost-effectiveness will be sufficient, the offload pattern creation unit 215 extracts an offload pattern that will result in faster processing by trying out whether or not to offload through performance measurements in a verification environment, which will be described later.

<ステップＳ２１：Deploy binary files>（デプロイ、性能測定試行）
ステップＳ２１において、バイナリファイル配置部１１６ａ（図１２参照）は、ＧＰＵ、ＦＰＧＡを備えた検証用マシン１４に、中間言語から導かれる実行ファイルをデプロイする。バイナリファイル配置部１１６ａは、配置したファイルを起動し、想定するテストケースを実行して、オフロードした際の性能を測定する。 <Step S21: Deploy binary files> (Deployment, performance measurement trial)
In step S21, the binary file placement unit 116a (see FIG. 12) deploys an executable file derived from the intermediate language to the verification machine 14 equipped with a GPU and FPGA. The binary file placement unit 116a starts the placed file, executes assumed test cases, and measures the performance when offloaded.

<ステップＳ２２：Measure performances>
ステップＳ２２において、性能測定部１１６（図１２参照）は、配置したファイルを実行し、オフロードした際の性能を測定する。 <Step S22: Measure performance>
In step S22, the performance measurement unit 116 (see FIG. 12) executes the arranged file and measures the performance when offloaded.

図１３の符号ｇに示すように、制御部２１０は、上記ステップＳ１２乃至ステップＳ２２を繰り返し実行する。制御部２１０の自動オフロード機能をまとめると、下記である。すなわち、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。置換機能検出部２１３は、検出したライブラリ呼び出しや機能処理について、コードパターンＤＢ２３０と照合することで、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能処理を抽出する。置換処理部２１４は、抽出したオフロード可能処理を、ＧＰＵ向けのライブラリやＦＰＧＡ向けのＩＰコア等に置換する。そして、オフロードパターン作成部２１５は、ＧＰＵ向けのライブラリやＦＰＧＡ向けのＩＰコア等に置換した機能ブロックを、ＣＰＵプログラムとのインタフェースを作成することでオフロードする。As shown by the symbol g in FIG. 13, the control unit 210 repeatedly executes the above steps S12 to S22. The automatic offload function of the control unit 210 can be summarized as follows. That is, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code, which analyzes the library calls and functional processes contained in the code, along with the loop statement structure, etc. The replacement function detection unit 213 extracts offloadable processes that can be offloaded to the GPU and FPGA by comparing the detected library calls and functional processes with the code pattern DB 230. The replacement processing unit 214 replaces the extracted offloadable processes with libraries for GPUs, IP cores for FPGAs, etc. Then, the offload pattern creation unit 215 offloads the functional blocks replaced with libraries for GPUs, IP cores for FPGAs, etc. by creating an interface with the CPU program.

上記ステップＳ１１～ステップＳ２５は、ユーザのサービス利用のバックグラウンドで行われ、例えば、仮利用の初日の間に行う等を想定している。The above steps S11 to S25 are carried out in the background while the user is using the service, and are assumed to be carried out, for example, during the first day of trial use.

上記したように、オフロードサーバ２００の制御部（自動オフロード機能部）２１０は、環境適応ソフトウェアの要素技術に適用した場合、機能処理のオフロードのため、ユーザが利用するアプリケーションプログラムのソースコードから、オフロードする領域を抽出して中間言語を出力する（ステップＳ３１～ステップＳ３４）。制御部２１０は、中間言語から導かれる実行ファイルを、検証用マシン１４に配置実行し、オフロード効果を検証する（ステップＳ２１～ステップＳ２２）。検証を繰り返し、適切なオフロード領域を定めたのち、制御部２１０は、実際にユーザに提供する本番環境に、実行ファイルをデプロイし、サービスとして提供する（ステップＳ２３～ステップＳ２５）。As described above, when applied to the elemental technology of the environment-adaptive software, the control unit (automatic offload function unit) 210 of the offload server 200 extracts the area to be offloaded from the source code of the application program used by the user and outputs an intermediate language to offload functional processing (steps S31 to S34). The control unit 210 places and executes the executable file derived from the intermediate language on the verification machine 14 and verifies the offload effect (steps S21 to S22). After repeating the verification and determining an appropriate offload area, the control unit 210 deploys the executable file in the production environment that will actually be provided to the user and provides it as a service (steps S23 to S25).

一般に、性能に関しては、最大性能になる設定を一回で自動発見するのは難しい。このため、オフロードパターンを、性能測定を検証環境で何度か繰り返すことにより試行し、高速化できるパターンを見つけることが本発明の特徴である。 In general, it is difficult to automatically discover settings that maximize performance in one go. For this reason, a feature of the present invention is that it tries out offload patterns by repeatedly measuring performance in a verification environment several times to find a pattern that can increase speed.

［機能ブロックのオフロード処理詳細］
機能ブロックのオフロードについては、機能ブロックの検出（以下、「処理Ａ」という）、その機能ブロックがオフロード用の既存ライブラリ／ＩＰコア等があるかを検出（以下、「処理Ｂ」という）、機能ブロックをライブラリ／ＩＰコア等と置換した際にホスト側とのインタフェースの整合（以下、「処理Ｃ」という）、の３つ要素を考慮する必要がある。上記３つ要素の考慮点に従い、機能ブロックのオフロード処理について詳細に述べる。 [Function block offload processing details]
For offloading of a functional block, three elements must be considered: detection of the functional block (hereinafter referred to as "Process A"), detection of whether the functional block has an existing library/IP core, etc. for offloading (hereinafter referred to as "Process B"), and consistency of the interface with the host side when replacing the functional block with the library/IP core, etc. (hereinafter referred to as "Process C"). The offloading process of the functional block will be described in detail according to the considerations of the above three elements.

［処理Ａ］（機能ブロックの検出）
「処理Ａ」（機能ブロックの検出）は、ライブラリの関数呼び出しを行い、ライブラリの関数呼び出しを機能ブロックとする<処理Ａ－１>と、登録されていないライブラリの関数呼び出しである場合、クラス、構造体等を検出して機能ブロックとする<処理Ａ－２>と、に分けられる。すなわち、<処理Ａ－１>は、既存のライブラリの関数呼び出しを検出して機能ブロックとするものであり、<処理Ａ－２>は、<処理Ａ－１>において機能ブロックを検出しない場合に、クラスまたは構造体から機能ブロックを抽出するものである。 [Process A] (detection of functional blocks)
"Processing A" (detection of function blocks) is divided into <Processing A-1>, which calls a function in a library and makes the function call in the library a function block, and <Processing A-2>, which detects a class, structure, etc. and makes the function block if the function call is in an unregistered library. That is, <Processing A-1> detects a function call in an existing library and makes it a function block, and <Processing A-2> extracts a function block from a class or structure if a function block is not detected in <Processing A-1>.

<処理Ａ－１>
アプリケーションコード分析部１１２は、構文解析を用いて、ソースコードから外部のライブラリの関数呼び出しを行っていることを検知する。詳細には、下記の通りである。コードパターンＤＢ２３０は、ＦＦＴ等算術計算等のライブラリリストを保持している。アプリケーションコード分析部１１２は、ソースコードを構文解析し、コードパターンＤＢ２３０が保持しているライブラリリストと照合して、外部のライブラリの関数呼び出しを行っていることを検知する。 <Process A-1>
The application code analysis unit 112 uses syntax analysis to detect whether a function of an external library is being called from source code. The details are as follows. The code pattern DB 230 holds a library list for arithmetic calculations such as FFT. The application code analysis unit 112 performs syntax analysis of the source code, and compares it with the library list held by the code pattern DB 230 to detect whether a function of an external library is being called.

<処理Ａ－２>
アプリケーションコード分析部１１２は、登録されていないライブラリ呼び出し以外の機能処理を機能ブロックとして検出するため、構文解析を用いて、ソースコードの定義記述からクラスまたは構造体の機能処理を検出する。アプリケーションコード分析部１１２は、例えば、Ｃ言語のstructを使って定義されるいくつかの変数をひとまとまりにした型である構造体（structure）や、インスタンス化したオブジェクトの型が値型である構造体に対して参照型であるクラス（class）を検出する。また、アプリケーションコード分析部１１２は、例えばJava（登録商標）において構造体に代替使用されるクラスを検出する。 <Process A-2>
In order to detect functional processes other than unregistered library calls as functional blocks, the application code analysis unit 112 detects functional processes of classes or structures from the definition description of the source code using syntax analysis. For example, the application code analysis unit 112 detects structures that are types that group together several variables defined using the struct of the C language, and classes in which the type of an instantiated object is a reference type for a structure that is a value type. The application code analysis unit 112 also detects classes that are used as substitutes for structures in Java (registered trademark), for example.

［処理Ｂ］（オフロード可能機能の検出）
［処理Ｂ］（オフロード可能機能の検出）は、<処理Ａ－１>を受け、コードパターンＤＢ２３０を参照して置換可能ＧＰＵライブラリ、ＩＰコアを取得する<処理Ｂ－１>と、<処理Ａ－２>を受け、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコア処理記述に置換する<処理Ｂ－２>と、に分けられる。すなわち、<処理Ｂ－１>は、ライブラリ名をキーに、コードパターンＤＢ２３０から置換可能ＧＰＵライブラリ、ＩＰコアを取得するものである。<処理Ｂ－２>は、クラス、構造体等のコードをキーに、置換可能ＧＰＵライブラリ・ＩＰコアを検出し、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコア処理記述に置換するものである。 [Process B] (Detection of offloadable functions)
[Processing B] (detection of offloadable functions) is divided into <Processing B-1>, which receives <Processing A-1> and refers to the code pattern DB 230 to obtain replaceable GPU libraries and IP cores, and <Processing B-2>, which receives <Processing A-2> and replaces the processing description of the application code to be replaced with the GPU library and IP core processing description to be replaced. That is, <Processing B-1> obtains replaceable GPU libraries and IP cores from the code pattern DB 230 using the library name as a key. <Processing B-2> detects replaceable GPU libraries and IP cores using codes such as classes and structures as a key, and replaces the processing description of the application code to be replaced with the GPU library and IP core processing description to be replaced.

処理Ｂの前提として、コードパターンＤＢ２３０には、特定のライブラリ、機能ブロックを高速化するＧＰＵ用ライブラリやＦＰＧＡ用ＩＰコアとそれに関連する情報が保持されている。また、コードパターンＤＢ２３０には、置換元のライブラリ、機能ブロックについては、機能名とともにコードや実行ファイルが登録されている。As a premise for process B, the code pattern DB 230 holds a specific library, a library for GPUs that speeds up a function block, an IP core for FPGAs, and related information. In addition, the code pattern DB 230 registers code and executable files along with function names for the original libraries and function blocks.

<処理Ｂ－１>
置換機能検出部２１３は、<処理Ａ－１>でアプリケーションコード分析部１１２が検出したライブラリ呼び出しに対して、ライブラリ名をキーに、コードパターンＤＢ２３０を検索し、コードパターンＤＢ２３０から、置換可能ＧＰＵライブラリ（高速化できるＧＰＵ用ライブラリ）やＦＰＧＡ用ＩＰコアを取得する。 <Process B-1>
The replacement function detection unit 213 searches the code pattern DB 230 using the library name as a key for the library call detected by the application code analysis unit 112 in <Process A-1>, and obtains a replaceable GPU library (a library for GPUs that can be accelerated) and an IP core for FPGA from the code pattern DB 230.

<処理Ｂ－１>の例を記載する。
置換機能検出部２１３は、例えば、置換元の処理が2D FFTの処理（非特許文献４等にコードがある）であった場合は、その外部ライブラリ名をキーに、2D FFTを処理するＦＰＧＡ処理として、OpenCLコードを検出する（ホストプログラム、カーネルプログラム）等）。なお、OpenCLコードは、コードパターンＤＢ２３０に記憶されている。 An example of <Process B-1> is described below.
For example, if the processing to be replaced is 2D FFT processing (code is available in Non-Patent Document 4, etc.), the replacement function detection unit 213 detects OpenCL code as FPGA processing for processing 2D FFT using the external library name as a key (host program, kernel program, etc.). The OpenCL code is stored in the code pattern DB 230.

置換機能検出部２１３は、例えば、置換元の処理が2D FFTの処理であった場合は、ＧＰＵライブラリとして検出されたcuFFTの中の関数呼び出しに置換する。なお、ＧＰＵライブラリは、コードパターンＤＢ２３０に記憶されている。 For example, if the original process is a 2D FFT process, the replacement function detection unit 213 replaces it with a function call in cuFFT detected as a GPU library. The GPU library is stored in the code pattern DB 230.

<処理Ｂ－２>
置換機能検出部２１３は、<処理Ａ－２>でアプリケーションコード分析部１１２が検出したクラス、構造体等のコードをキーに、コードパターンＤＢ２３０を検索し、コードパターンＤＢ２３０から、類似性検出ツールを用いて置換可能ＧＰＵライブラリ（高速化できるＧＰＵ用ライブラリ）やＦＰＧＡ用ＩＰコアを取得する。類似性検出ツールとは、Deckard等、コピーコードやコピー後変更したコードの検出を対象とするツールである。置換機能検出部２１３が、類似性検出ツールを用いることで、行列計算のコード等、ＣＰＵで計算する場合は記述が同様になる処理や、他者のコードをコピーして変更した処理等を一部検出できる。なお、類似性検出ツールは、新規に独立に作成したようなクラス等については検出が困難となるため対象外である。 <Process B-2>
The replacement function detection unit 213 searches the code pattern DB 230 using the code of the class, structure, etc. detected by the application code analysis unit 112 in <Process A-2> as a key, and obtains a replaceable GPU library (a library for GPU that can be accelerated) and an IP core for FPGA from the code pattern DB 230 using a similarity detection tool. The similarity detection tool is a tool that targets detection of copy code and code that has been modified after copying, such as Deckard. By using the similarity detection tool, the replacement function detection unit 213 can detect some processes, such as matrix calculation code, that are similarly written when calculated by a CPU, and processes that have been copied and modified by others. Note that the similarity detection tool does not target classes that have been newly created independently, because it is difficult to detect them.

<処理Ｂ－２>の例を記載する。
置換機能検出部２１３は、置換元ＣＰＵコードに検知されたクラスや構造体に対して、Deckard等の類似性検知ツールを用いて、コードパターンＤＢ２３０に登録された類似クラスや構造体を検索する。例えば、置換元の処理（非特許文献４等にコードがある）が2D FFTのクラスであった場合は、その類似クラスとしてコードパターンＤＢ２３０に登録されたクラスが2D FFTのクラスが検出される。コードパターンＤＢ２３０には、2D FFTをオフロード可能なＩＰコアやＧＰＵライブラリが登録されている。そのため、<処理Ｂ－１>と同様に、2D FFTに対して、OpenCLコード（ホストプログラム、カーネルプログラム等）やＧＰＵライブラリを検出する。 An example of <Process B-2> is described below.
The replacement function detection unit 213 searches for similar classes and structures registered in the code pattern DB 230 for classes and structures detected in the source CPU code using a similarity detection tool such as Deckard. For example, if the source process (whose code is in Non-Patent Document 4, etc.) is a 2D FFT class, the class registered in the code pattern DB 230 as its similar class is the 2D FFT class. The code pattern DB 230 registers IP cores and GPU libraries capable of offloading 2D FFT. Therefore, similar to <Process B-1>, OpenCL code (host program, kernel program, etc.) and GPU libraries are detected for 2D FFT.

［処理Ｃ］（ホスト側とのインタフェースの整合）
［処理Ｃ］（ホスト側とのインタフェースの整合）は、<処理Ｃ－１>と、<処理Ｃ－２>とを有する。<処理Ｃ－１>は、<処理Ｂ－１>を受け、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコア処理記述に置換するとともに、ＧＰＵライブラリ、ＩＰコア呼び出しのためのインタフェース処理を記述する。<処理Ｃ－２>は、<処理Ｂ－２>を受け、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコア処理記述に置換するとともに、ＧＰＵライブラリ、ＩＰコア呼び出しのためのインタフェース処理を記述する。ここで、上記ＧＰＵライブラリ、ＩＰコア呼び出しのためのインタフェース処理の記述が、「ホスト側とのインタフェースの整合」に対応する。 [Process C] (Interface matching with the host side)
[Processing C] (matching the interface with the host side) includes <Processing C-1> and <Processing C-2>. <Processing C-1> receives <Processing B-1>, and replaces the processing description of the application code to be replaced with the GPU library and IP core processing description of the replacement destination, and also describes the interface processing for calling the GPU library and IP core. <Processing C-2> receives <Processing B-2>, and replaces the processing description of the application code to be replaced with the GPU library and IP core processing description of the replacement destination, and also describes the interface processing for calling the GPU library and IP core. Here, the description of the interface processing for calling the GPU library and IP core corresponds to "matching the interface with the host side".

<処理Ｃ－１>
置換処理部２１４は、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコア処理記述に置換する。そして、置換処理部２１４は、ＧＰＵライブラリ、ＩＰコア呼び出しのためのインタフェース処理を記述し（OpenCL API等）、作成したパターンをコンパイルする。 <Process C-1>
The replacement processing unit 214 replaces the source processing description of the application code with the destination GPU library and IP core processing description. The replacement processing unit 214 then writes an interface process for calling the GPU library and IP core (OpenCL API, etc.), and compiles the created pattern.

<処理Ｃ－１>について、より詳細に説明する。
置換機能検出部２１３は、<処理Ａ－１>で検出したライブラリ呼び出しに対して、<処理Ｂ－１>で該当するライブラリやＩＰコアを検索している。このため、置換処理部２１４は、置換するライブラリやＩＰコアをＧＰＵやＦＰＧＡに実装し、ホスト側（ＣＰＵ）プログラムと繋ぐインタフェース処理を行う。 <Process C-1> will now be described in more detail.
The replacement function detection unit 213 searches for a corresponding library or IP core in <Process B-1> for the library call detected in <Process A-1>. For this reason, the replacement processing unit 214 implements the library or IP core to be replaced in the GPU or FPGA, and performs interface processing to connect it to the host side (CPU) program.

ここで、ＧＰＵ用ライブラリの場合は、ＣＵＤＡ等のライブラリを想定しており、Ｃ言語コードからＣＵＤＡライブラリを利用する手法がライブラリとともに公開されている。そこで、コードパターンＤＢ２３０に、ライブラリ利用手法も含めて登録しておき、置換処理部２１４は、コードパターンＤＢ２３０に登録されたライブラリ利用手法に従って、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリに置換するとともに、ＧＰＵライブラリで利用する関数の呼び出し等の所定記述を行う。 In the case of a GPU library, a library such as CUDA is assumed, and a method for using the CUDA library from C language code is published together with the library. Therefore, the library usage method is also registered in the code pattern DB 230, and the replacement processing unit 214 replaces the source processing description of the application code with the destination GPU library according to the library usage method registered in the code pattern DB 230, and performs predetermined descriptions such as calling functions to be used in the GPU library.

ＦＰＧＡ用ＩＰコアの場合は、ＨＤＬ（Hardware Description Language）等が想定される。この場合、ＩＰコア関連の情報としてOpenCLコードもコードパターンＤＢ２３０に保持されている。置換処理部２１４は、ＦＰＧＡとのインタフェース処理を、高位合成ツール（例えば、Xilinx Vivado, Intel HLS Compiler等）を介して行うことができる。置換処理部２１４は、例えば、OpenCLコードから、OpenCLインタフェースを用いたＣＰＵとＦＰＧＡの接続を、高位合成ツールを介して行う。同様に、置換処理部２１４は、ＦＰＧＡへのＩＰコア実装を、XilinxやIntel等のＦＰＧＡベンダの高位合成ツールを介して行う。In the case of an IP core for FPGA, HDL (Hardware Description Language) and the like are assumed. In this case, OpenCL code is also stored in the code pattern DB 230 as information related to the IP core. The replacement processing unit 214 can perform interface processing with the FPGA via a high-level synthesis tool (e.g., Xilinx Vivado, Intel HLS Compiler, etc.). The replacement processing unit 214 performs connection between the CPU and the FPGA using an OpenCL interface from the OpenCL code via the high-level synthesis tool. Similarly, the replacement processing unit 214 implements the IP core in the FPGA via a high-level synthesis tool of an FPGA vendor such as Xilinx or Intel.

<処理Ｃ－２>
置換処理部２１４は、アプリコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコア処理記述に置換する。そして、置換処理部２１４は、置換元と置換先で引数や戻り値の数や型が異なる場合に、ユーザに確認し、ＧＰＵライブラリ、ＩＰコア呼び出しのためのインタフェース処理を記述（OpenCL API等）するとともに、作成したパターンをコンパイルする。すなわち、<処理Ｃ－２>では、置換処理部２１４は、<処理Ａ－２>で検出したクラス、構造体等に対して、<処理Ｂ－２>で高速化できるライブラリやＩＰコアを検索している。このため、置換処理部２１４は、<処理Ｃ－２>では該当するライブラリやＩＰコアをＧＰＵやＦＰＧＡに実装する。 <Process C-2>
The replacement processing unit 214 replaces the processing description of the application code to be replaced with the GPU library and IP core processing description to be replaced. If the number and type of arguments and return values are different between the source and destination, the replacement processing unit 214 checks with the user, describes interface processing for calling the GPU library and IP core (OpenCL API, etc.), and compiles the created pattern. That is, in <Process C-2>, the replacement processing unit 214 searches for libraries and IP cores that can be accelerated in <Process B-2> for the classes, structures, etc. detected in <Process A-2>. Therefore, in <Process C-2>, the replacement processing unit 214 implements the corresponding libraries and IP cores in the GPU or FPGA.

<処理Ｃ－２>について、より詳細に説明する。
<処理Ｃ－１>では、特定のライブラリ呼び出しに対して高速化するライブラリやＩＰコアであるため、インタフェース部分の生成等は必要になるものの、ＧＰＵ、ＦＰＧＡとホスト側プログラムの想定する引数、戻り値の数や型は合っていた。しかし、<処理Ｂ－２>は、類似性等で判断しているため、引数や戻り値の数や型等の基本的な部分が合っている保証はない。ライブラリやＩＰコアは、既存ノウハウであり、引数、戻り値の数や型が合っていない場合であっても、変更が頻繁にできるものではない。このため、オフロードを依頼するユーザに対して、元のコードの引数や戻り値の数や型について、ライブラリやＩＰコアに合わせて変更するか否かを確認する。そして、確認了承後にオフロード性能試験を試行する。 <Process C-2> will now be described in more detail.
In <Process C-1>, because it is a library or IP core that speeds up a specific library call, it is necessary to generate an interface, but the number and type of arguments and return values expected by the GPU, FPGA, and the host program match. However, in <Process B-2>, since it is judged based on similarity, there is no guarantee that basic parts such as the number and type of arguments and return values match. Libraries and IP cores are existing know-how, and even if the number and type of arguments and return values do not match, they cannot be changed frequently. For this reason, the user who requests offloading is asked whether the number and type of arguments and return values of the original code will be changed to match the library or IP core. After confirmation and approval, an offload performance test is attempted.

型の違いについて、floatとdouble等キャストすればよいだけであれば、処理パターン作成時にキャストする処理を追加し、特にユーザ確認せずに性能測定試行に入ってもよい。また、引数や戻り値で、元のプログラムとライブラリやＩＰコアで数が異なる場合、例えば、ＣＰＵプログラムで引数１，２が必須で引数３がオプションであり、ライブラリやＩＰコアで引数１，２が必須の場合等は、オプション引数３は省略しても問題はない。このような場合は、ユーザに確認せず、処理パターン作成時にオプション引数は自動で無しとして扱うなどしてもよい。なお、引数や戻り値の数や型が完全に合っている場合は、<処理Ｃ－１>と同様の処理でよい。 Regarding type differences, if casting between float and double etc. is all that is required, a casting process can be added when creating the processing pattern, and performance measurement trials can begin without user confirmation. Also, if the number of arguments or return values differs between the original program and the library or IP core, for example, arguments 1 and 2 are required in the CPU program and argument 3 is optional, and arguments 1 and 2 are required in the library or IP core, there is no problem with omitting optional argument 3. In such cases, optional arguments may be automatically treated as absent when creating the processing pattern, without user confirmation. Note that if the number and types of arguments and return values match exactly, processing similar to <Process C-1> may be used.

［《機能ブロックオフロード：共通》フローチャート］
次に、図１４および図１５を参照してオフロードサーバ２００の《機能ブロックオフロード：共通》の動作概要を説明する。 [Function Block Offload: Common Flowchart]
Next, an outline of the operation of the <<Function block offload: common>> of the offload server 200 will be described with reference to FIG. 14 and FIG.

・<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>のフローチャート
図１４は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、《機能ブロックオフロード：共通》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。
ステップＳ５０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションプログラムのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 Flowchart of <Processing A-1>, <Processing B-1>, and <Processing C-1> FIG. 14 is a flowchart when the control unit (automatic offload function unit) 210 of the offload server 200 executes <Processing A-1>, <Processing B-1>, and <Processing C-1> in the offload processing of <<Function block offload: common>>.
In step S501, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code, including the loop statement structure and the like, as well as library calls and functional processing included in the code.

ステップＳ５０２で置換機能検出部２１３（図１２参照）は、アプリケーションプログラムの外部ライブラリ呼び出しを検出する。In step S502, the replacement function detection unit 213 (see Figure 12) detects an external library call of the application program.

ステップＳ５０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、ライブラリ名をキーに、置換可能ＧＰＵライブラリを取得する。具体的には、置換機能検出部２１３は、把握した外部ライブラリ呼び出しについて、コードパターンＤＢ２３０と照合することで、検出した置換可能ＧＰＵライブラリ・ＩＰコアを、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能な機能ブロックとして取得する。In step S503, the replacement function detection unit 213 obtains a replaceable GPU library from the code pattern DB 230 using the library name as a key. Specifically, the replacement function detection unit 213 compares the identified external library call with the code pattern DB 230, and obtains the detected replaceable GPU library/IP core as an offloadable functional block that can be offloaded to a GPU or FPGA.

ステップＳ５０４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリの処理記述に置換する。In step S504, the replacement processing unit 214 replaces the source processing description of the application source code with the destination processing description of the GPU library.

ステップＳ５０５で置換処理部２１４は、置換したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S505, the replacement processing unit 214 offloads the processing description of the replaced GPU library to the GPU as a functional block to be offloaded.

ステップＳ５０６で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。
ステップＳ５０７で実行ファイル作成部１１７は、作成したパターンをコンパイルまたはインタプリットする。 In step S506, the replacement processing unit 214 describes an interface process for calling the GPU library.
In step S507, the executable file creation unit 117 compiles or interprets the created pattern.

ステップＳ５０８で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。
ステップＳ５０９で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。 In step S508, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").
In step S509, the executable file creating unit 117 creates a combination pattern for the pattern that was able to increase the speed during the first measurement.

ステップＳ５１０で実行ファイル作成部１１７は、作成した組合せパターンをコンパイルまたはインタプリットする。
ステップＳ５１１で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。 In step S510, the executable file creation unit 117 compiles or interprets the created combination pattern.
In step S511, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ５１２で本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S512, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and ends the processing of this flow.

・<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>のフローチャート
図１５は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。なお、<処理Ａ－２>からの処理は、<処理Ａ－１>からの処理と並行して行えばよい。
ステップＳ６０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 15 is a flowchart of the case where the control unit (automatic offload function unit) 210 of the offload server 200 executes <processing A-2>, <processing B-2>, and <processing C-2> in the offload processing of the functional block. Note that the processing from <processing A-2> may be performed in parallel with the processing from <processing A-1>.
In step S601, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code to analyze the library calls and functional processing included in the code, together with the loop statement structure, etc.

ステップＳ６０２で置換機能検出部２１３（図１２参照）は、ソースコードからクラスまたは構造体の定義記述コードを検出する。In step S602, the replacement function detection unit 213 (see Figure 12) detects the definition description code of a class or structure from the source code.

ステップＳ６０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、類似性検出ツールを用いて、クラスまたは構造体の定義記述コードをキーにして、置換可能ＧＰＵライブラリを取得する。In step S603, the replacement function detection unit 213 uses a similarity detection tool to obtain a replaceable GPU library from the code pattern DB 230 using the definition description code of a class or structure as a key.

ステップＳ６０４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリ処理記述に置換する。In step S604, the replacement processing unit 214 replaces the source processing description of the application source code with the destination GPU library processing description.

ステップＳ６０５で置換処理部２１４は、置換元と置換先で引数、戻り値の数や型が異なる場合に、ユーザに確認する。 In step S605, the replacement processing unit 214 prompts the user if the number or type of arguments or return values differ between the source and destination.

ステップＳ６０６で置換機能検出部２１３は、置換または確認したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S606, the replacement function detection unit 213 offloads the processing description of the replaced or confirmed GPU library to the GPU as a functional block to be offloaded.

ステップＳ６０７で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S607, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ６０８で実行ファイル作成部１１７は、作成したパターンをコンパイルまたはインタプリットする。
ステップＳ６０９で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。 In step S608, the executable file creation unit 117 compiles or interprets the created pattern.
In step S609, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").

ステップＳ６１０で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。In step S610, the executable file creation unit 117 creates a combination pattern for the pattern that was able to be sped up during the first measurement.

ステップＳ６１１で実行ファイル作成部１１７は、作成した組合せパターンをコンパイルまたはインタプリットする。
ステップＳ６１２で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。 In step S611, the executable file creation unit 117 compiles or interprets the created combination pattern.
In step S612, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ６１３で本番環境配置部１１８は、本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S613, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and terminates the processing of this flow.

［《機能ブロックオフロード：Ｃ言語》］
機能ブロックオフロード：Ｃ言語の、コードの分析では、Ｃ言語を解析するClang（登録商標）等の構文解析ツールを用いて構文解析する。機能ブロックの把握については、構文解析ツールの結果を用いて、次処理のマッチング探索に用いるため、言語に非依存の機能ブロックとして管理する。 [Function block offload: C language]
Function block offload: In the analysis of C language code, a syntax analysis tool such as Clang (registered trademark) that analyzes C language is used to analyze the syntax. The results of the syntax analysis tool are used to grasp the function blocks, and they are managed as language-independent function blocks to be used for matching search in the next process.

オフロード可能機能ブロックの探索では、ライブラリ等の名前一致でのマッチングと、Deckard（登録商標）等のＣ言語機能ブロックの類似性検出ツールを用いた類似性検知による、探索が行われる。オフロード機能ブロックへの置換は、ＣＵＤＡライブラリ呼び出し等、その言語からのオフロード機能利用に合わせた処理に置換する必要がある。 When searching for offloadable function blocks, the search is performed by matching the names of libraries, etc., and by detecting similarities using a similarity detection tool for C language function blocks such as Deckard (registered trademark). Replacement with an offload function block requires replacement with processing that matches the use of offload functions from that language, such as calling a CUDA library.

コンパイルは、ＣＵＤＡライブラリ呼び出し等のＣ言語コードをＰＧＩコンパイラ等でコンパイルする。性能測定は、言語に合わせて、Jenkins（登録商標）等の自動測定ツールも用いて行う。オフロード可能機能ブロックが複数の際は反復実行され、最高性能のパターンが最終解として決定される。 Compilation involves compiling C language code, such as CUDA library calls, with a PGI compiler or similar. Performance measurement is performed using an automated measurement tool, such as Jenkins (registered trademark), depending on the language. When there are multiple offloadable function blocks, they are executed repeatedly, and the pattern with the best performance is determined as the final solution.

このように、機能ブロック文オフロードでは、処理に関しては、機能ブロックの管理と機能ブロックの名前一致でのマッチングについては言語に非依存に適用できる。 In this way, with function block statement offloading, in terms of processing, function block management and name matching of function blocks can be applied independently of the language.

［《機能ブロックオフロード：Ｃ言語》フローチャート］
次に、図１６および図１７を参照してオフロードサーバ２００の《機能ブロックオフロード：Ｃ言語》の動作概要を説明する。 [Function block offload: C language flowchart]
Next, an outline of the operation of the offload server 200 for <<Function block offload: C language>> will be described with reference to FIG. 16 and FIG.

・<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>のフローチャート
図１６は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、《機能ブロックオフロード：Ｃ言語》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。
ステップＳ７０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションプログラムのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 Flowchart of <Processing A-1>, <Processing B-1>, and <Processing C-1> FIG. 16 is a flowchart when the control unit (automatic offload function unit) 210 of the offload server 200 executes <Processing A-1>, <Processing B-1>, and <Processing C-1> in the offload processing of <<Function block offload: C language>>.
In step S701, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code, including the loop statement structure and the like, as well as library calls and functional processing included in the code.

ステップＳ７０２で置換機能検出部２１３（図１２参照）は、アプリケーションの外部ライブラリ呼び出しを検出する。In step S702, the replacement function detection unit 213 (see Figure 12) detects an external library call of the application.

ステップＳ７０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、ライブラリ名をキーに、置換可能ＧＰＵライブラリを取得する。具体的には、置換機能検出部２１３は、把握した外部ライブラリ呼び出しについて、コードパターンＤＢ２３０と照合することで、検出した置換可能ＧＰＵライブラリを、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能な機能ブロックとして取得する。In step S703, the replacement function detection unit 213 obtains a replaceable GPU library from the code pattern DB 230 using the library name as a key. Specifically, the replacement function detection unit 213 compares the identified external library call with the code pattern DB 230, and obtains the detected replaceable GPU library as an offloadable functional block that can be offloaded to a GPU or FPGA.

ステップＳ７０４で置換処理部２１４は、アプリケーションコードの置換元の処理記述を、置換先のＧＰＵライブラリの処理記述に置換する。In step S704, the replacement processing unit 214 replaces the source processing description of the application code with the destination processing description of the GPU library.

ステップＳ７０５で置換処理部２１４は、置換したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S705, the replacement processing unit 214 offloads the processing description of the replaced GPU library to the GPU as a functional block to be offloaded.

ステップＳ７０６で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S706, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ７０７で置換処理部２１４は、ＣＵＤＡのライブラリ呼び出しを、ＰＧＩコンパイラに指定する。 In step S707, the replacement processing unit 214 specifies the CUDA library call to the PGI compiler.

ステップＳ７０８で実行ファイル作成部１１７は、作成したパターンをＰＧＩコンパイラでコンパイルする。 In step S708, the executable file creation unit 117 compiles the created pattern using a PGI compiler.

ステップＳ７０９で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。
ステップＳ７１０で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。 In step S709, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").
In step S710, the executable file creation unit 117 creates a combination pattern for the pattern that was able to increase the speed during the first measurement.

ステップＳ７１１で実行ファイル作成部１１７は、作成した組合せパターンをＰＧＩコンパイラでコンパイルする。 In step S711, the executable file creation unit 117 compiles the created combination pattern using a PGI compiler.

ステップＳ７１２で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。In step S712, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ７１３で本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S713, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and ends the processing of this flow.

・<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>のフローチャート
図１７は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。なお、<処理Ａ－２>からの処理は、<処理Ａ－１>からの処理と並行して行えばよい。
ステップＳ８０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションプログラムのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 17 is a flowchart of the case where the control unit (automatic offload function unit) 210 of the offload server 200 executes <processing A-2>, <processing B-2>, and <processing C-2> in the offload processing of the functional block. Note that the processing from <processing A-2> may be performed in parallel with the processing from <processing A-1>.
In step S801, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code to analyze the library calls and functional processing included in the code, together with the loop statement structure and the like.

ステップＳ８０２で置換機能検出部２１３（図１２参照）は、ソースコードからクラスまたは構造体の定義記述コードを検出する。In step S802, the replacement function detection unit 213 (see Figure 12) detects the definition description code of a class or structure from the source code.

ステップＳ８０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、類似性検出ツールを用いて、クラスまたは構造体の定義記述コードをキーにして、置換可能ＧＰＵライブラリを取得する。In step S803, the replacement function detection unit 213 uses a similarity detection tool to obtain a replaceable GPU library from the code pattern DB 230 using the definition description code of a class or structure as a key.

ステップＳ８０４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリの処理記述に置換する。In step S804, the replacement processing unit 214 replaces the source processing description of the application source code with the destination processing description of the GPU library.

ステップＳ８０５で置換処理部２１４は、置換元と置換先で引数、戻り値の数や型が異なる場合に、ユーザに確認する。 In step S805, the replacement processing unit 214 prompts the user if the number or type of arguments or return values differ between the source and destination.

ステップＳ８０６で置換機能検出部２１３は、置換または確認したＧＰＵライブラリの置換元の処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。In step S806, the replacement function detection unit 213 offloads the processing description of the replaced or confirmed GPU library to the GPU as a functional block to be offloaded.

ステップＳ８０７で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S807, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ８０８で置換処理部２１４は、ＣＵＤＡのライブラリ呼び出しを、ＰＧＩコンパイラに指定する。 In step S808, the replacement processing unit 214 specifies the CUDA library call to the PGI compiler.

ステップＳ８０９で実行ファイル作成部１１７は、作成したパターンをＰＧＩコンパイラでコンパイルする。 In step S809, the executable file creation unit 117 compiles the created pattern using a PGI compiler.

ステップＳ８１０で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。In step S810, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").

ステップＳ８１１で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。In step S811, the executable file creation unit 117 creates a combination pattern for the pattern that was able to be sped up during the first measurement.

ステップＳ８１２で実行ファイル作成部１１７は、作成した組合せパターンをＰＧＩコンパイラでコンパイルする。 In step S812, the executable file creation unit 117 compiles the created combination pattern using a PGI compiler.

ステップＳ８１３で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。In step S813, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ８１４で本番環境配置部１１８は、本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S814, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and terminates the processing of this flow.

以上、《機能ブロックオフロード：Ｃ言語》について説明した、次に、《機能ブロックオフロード：Python》について説明する。 We have explained "Function Block Offload: C Language" above. Next, we will explain "Function Block Offload: Python".

［《機能ブロックオフロード：Python》］
機能ブロックオフロード：Pythonは、機能ブロックオフロードの、コードの分析では、Pythonを解析するast（登録商標）等の構文解析ツールを用いて構文解析する。機能ブロックの把握については、構文解析ツールの結果を用いて、次処理のマッチング探索に用いるため、言語に非依存の機能ブロックとして管理する。 [Function Block Offload: Python]
Function block offload: In the analysis of the code of function block offload, Python is parsed using a syntax analysis tool such as ast (registered trademark) that analyzes Python. The results of the syntax analysis tool are used to grasp the function blocks, and they are managed as language-independent function blocks to be used for matching search in the next process.

オフロード可能機能ブロックの探索では、ライブラリ等の名前一致でのマッチングと、CloneDigger（登録商標）等のPython機能ブロックの類似性検出ツールを用いた類似性検知による、探索が行われる。 When searching for offloadable function blocks, the search is performed by matching the names of libraries, etc., and by similarity detection using a Python function block similarity detection tool such as CloneDigger (registered trademark).

オフロード機能ブロックへの置換は、ＧＰＵ処理のpyCUDAでの呼び出し等、その言語からのオフロード機能利用に合わせた処理に置換する必要がある。インタプリタは、ＣＵＤＡに合わせたPythonコードをpyCUDAでインタプリットする。 When replacing with an offload function block, it is necessary to replace it with a process that matches the use of the offload function from the language, such as calling GPU processing with pyCUDA. The interpreter interprets Python code that is compatible with CUDA using pyCUDA.

性能測定は、言語に合わせて、Jenkins（登録商標）等の自動測定ツールも用いて行う。オフロード可能機能ブロックが複数の際は反復実行され、最高性能のパターンが最終解として決定される。 Performance measurements are also performed using automated measurement tools such as Jenkins (registered trademark) depending on the language. When there are multiple offloadable function blocks, they are executed repeatedly and the pattern with the best performance is determined as the final solution.

［《機能ブロックオフロード：Python》フローチャート］
次に、図１８および図１９を参照してオフロードサーバ２００の《機能ブロック：Python》の動作概要を説明する。 [Function Block Offloading: Python Flowchart]
Next, an outline of the operation of the <<functional block: Python>> of the offload server 200 will be described with reference to FIG. 18 and FIG.

・<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>のフローチャート
図１８は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、《機能ブロック：Python》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。
ステップＳ９０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションプログラムのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 Flowchart of <Processing A-1>, <Processing B-1>, and <Processing C-1> Figure 18 is a flowchart when the control unit (automatic offload function unit) 210 of the offload server 200 executes <Processing A-1>, <Processing B-1>, and <Processing C-1> in the offload processing of <Function block: Python>.
In step S901, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code, including the loop statement structure and the like, as well as library calls and functional processing included in the code.

ステップＳ９０２で置換機能検出部２１３（図１２参照）は、アプリケーションの外部ライブラリ呼び出しを検出する。In step S902, the replacement function detection unit 213 (see Figure 12) detects an external library call of the application.

ステップＳ９０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、ライブラリ名をキーに、置換可能ＧＰＵライブラリを取得する。具体的には、置換機能検出部２１３は、把握した外部ライブラリ呼び出しについて、コードパターンＤＢ２３０と照合することで、検出した置換可能ＧＰＵライブラリを、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能な機能ブロックとして取得する。In step S903, the replacement function detection unit 213 obtains a replaceable GPU library from the code pattern DB 230 using the library name as a key. Specifically, the replacement function detection unit 213 compares the identified external library call with the code pattern DB 230, and obtains the detected replaceable GPU library as an offloadable functional block that can be offloaded to a GPU or FPGA.

ステップＳ９０４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリの処理記述に置換する。In step S904, the replacement processing unit 214 replaces the source processing description of the application source code with the destination processing description of the GPU library.

ステップＳ９０５で置換処理部２１４は、置換したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S905, the replacement processing unit 214 offloads the processing description of the replaced GPU library to the GPU as a functional block to be offloaded.

ステップＳ９０６で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S906, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ９０７で置換処理部２１４は、ＣＵＤＡのライブラリ呼び出しを、pyCudaで指定する。 In step S907, the replacement processing unit 214 specifies the CUDA library call using pyCuda.

ステップＳ９０８で実行ファイル作成部１１７は、作成したパターンをpyCudaでインタプリットする。 In step S908, the executable file creation unit 117 interprets the created pattern using pyCuda.

ステップＳ９０９で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。
ステップＳ９１０で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。 In step S909, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").
In step S910, the executable file creation unit 117 creates a combination pattern for the pattern that was able to increase the speed during the first measurement.

ステップＳ９１１で実行ファイル作成部１１７は、作成した組合せパターンpyCudaでインタプリットする。
ステップＳ９１２で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。 In step S911, the executable file creation unit 117 interprets the created combination pattern with pyCuda.
In step S912, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ９１３で本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S913, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and ends the processing of this flow.

・<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>のフローチャート
図１９は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。なお、<処理Ａ－２>からの処理は、<処理Ａ－１>からの処理と並行して行えばよい。
ステップＳ１００１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションプログラムのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 19 is a flowchart of the case where the control unit (automatic offload function unit) 210 of the offload server 200 executes <processing A-2>, <processing B-2>, and <processing C-2> in the offload processing of the functional block. Note that the processing from <processing A-2> may be performed in parallel with the processing from <processing A-1>.
In step S1001, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code, including the loop statement structure and the like, as well as library calls and functional processing included in the code.

ステップＳ１００２で置換機能検出部２１３（図１２参照）は、ソースコードからクラスまたは構造体の定義記述コードを検出する。In step S1002, the replacement function detection unit 213 (see Figure 12) detects definition description code of a class or structure from the source code.

ステップＳ１００３で置換機能検出部２１３は、コードパターンＤＢ２３０から、類似性検出ツールを用いて、クラスまたは構造体の定義記述コードをキーにして、置換可能ＧＰＵライブラリを取得する。In step S1003, the replacement function detection unit 213 uses a similarity detection tool to obtain a replaceable GPU library from the code pattern DB 230 using the definition description code of a class or structure as a key.

ステップＳ１００４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリ処理記述に置換する。In step S1004, the replacement processing unit 214 replaces the source processing description of the application source code with the destination GPU library processing description.

ステップＳ１００５で置換処理部２１４は、置換元と置換先で引数、戻り値の数や型が異なる場合に、ユーザに確認する。 In step S1005, the replacement processing unit 214 prompts the user if the number or type of arguments or return values differ between the source and destination.

ステップＳ１００６で置換機能検出部２１３は、置換または確認したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S1006, the replacement function detection unit 213 offloads the processing description of the replaced or confirmed GPU library to the GPU as a functional block to be offloaded.

ステップＳ１００７で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S1007, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ１００８で置換処理部２１４は、ＣＵＤＡのライブラリ呼び出しを、pyCudaで指定する。 In step S1008, the replacement processing unit 214 specifies the CUDA library call using pyCuda.

ステップＳ１００９で実行ファイル作成部１１７は、作成したパターンをpyCudaでインタプリットする。 In step S1009, the executable file creation unit 117 interprets the created pattern with pyCuda.

ステップＳ１０１０で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。In step S1010, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").

ステップＳ１０１１で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。In step S1011, the executable file creation unit 117 creates a combination pattern for the pattern that was able to be sped up during the first measurement.

ステップＳ１０１２で実行ファイル作成部１１７は、作成した組合せパターンをpyCudaでインタプリットする。
ステップＳ１０１３で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。 In step S1012, the executable file creation unit 117 interprets the created combination pattern with pyCuda.
In step S1013, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ１０１４で本番環境配置部１１８は、本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S1014, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and terminates the processing of this flow.

以上、《機能ブロックオフロード：Python》について説明した、次に、《機能ブロックオフロード：Java》について説明する。 We have explained "Function Block Offload: Python" above. Next, we will explain "Function Block Offload: Java".

［《機能ブロックオフロード：Java》］
機能ブロックオフロード：Javaの、コードの分析では、Javaを解析するJavaParser（登録商標）等の構文解析ツールを用いて構文解析する。機能ブロックの把握については、構文解析ツールの結果を用いて、次処理のマッチング探索に用いるため、言語に非依存の機能ブロックとして管理する。 [Function Block Offload: Java]
Function block offload: In the analysis of Java code, a syntax analysis tool such as JavaParser (registered trademark) is used to analyze the syntax of Java. The results of the syntax analysis tool are used to grasp the function blocks, and they are managed as language-independent function blocks to be used in the matching search for the next process.

オフロード可能機能ブロックの探索では、ライブラリ等の名前一致でのマッチングと、Deckard（登録商標）等のJava機能ブロックの類似性検出ツールを用いた類似性検知による、探索が行われる。オフロード機能ブロックへの置換は、ＧＰＵ処理のＣＵＤＡライブラリの呼び出し等、その言語からのオフロード機能利用に合わせた処理に置換する必要がある。 When searching for offloadable function blocks, searches are performed by matching the names of libraries, etc., and by detecting similarities using similarity detection tools for Java function blocks such as Deckard (registered trademark). Replacement with an offload function block requires replacement with processing that matches the use of offload functions from that language, such as calling the CUDA library for GPU processing.

実行環境は、Javaのラムダ記述での処理をＧＰＵに対して行うことができるIBM JDK（登録商標）を用いる。性能測定は、言語に合わせて、Jenkins（登録商標）等の自動測定ツールも用いて行う。オフロード可能機能ブロックが複数の際は反復実行され、最高性能のパターンが最終解として決定される。 The execution environment uses IBM JDK (registered trademark), which allows processing using Java lambda statements on the GPU. Performance measurements are also performed using automated measurement tools such as Jenkins (registered trademark) depending on the language. When there are multiple offloadable function blocks, they are executed repeatedly, and the pattern with the best performance is determined as the final solution.

［《機能ブロックオフロード：Java》フローチャート］
次に、図２０および図２１を参照してオフロードサーバ２００の《機能ブロック：Java》の動作概要を説明する。 ["Function Block Offload: Java" Flowchart]
Next, an outline of the operation of the <<functional block: Java>> of the offload server 200 will be described with reference to FIG. 20 and FIG.

・<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>のフローチャート
図２０は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、《機能ブロック：Java》のオフロード処理において<処理Ａ－１>と<処理Ｂ－１>と<処理Ｃ－１>とを実行する場合のフローチャートである。
ステップＳ１１０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションプログラムのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 Flowchart of <Processing A-1>, <Processing B-1>, and <Processing C-1> FIG. 20 is a flowchart when the control unit (automatic offload function unit) 210 of the offload server 200 executes <Processing A-1>, <Processing B-1>, and <Processing C-1> in the offload processing of <Function block: Java>.
In step S1101, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application program to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code to analyze the library calls and functional processing included in the code, together with the loop statement structure and the like.

ステップＳ１１０２で置換機能検出部２１３（図１２参照）は、アプリケーションプログラムの外部ライブラリ呼び出しを検出する。 In step S1102, the replacement function detection unit 213 (see Figure 12) detects an external library call of the application program.

ステップＳ１１０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、ライブラリ名をキーに、置換可能ＧＰＵライブラリを取得する。具体的には、置換機能検出部２１３は、把握した外部ライブラリ呼び出しについて、コードパターンＤＢ２３０と照合することで、検出した置換可能ＧＰＵライブラリを、ＧＰＵ、ＦＰＧＡにオフロードできるオフロード可能な機能ブロックとして取得する。In step S1103, the replacement function detection unit 213 obtains a replaceable GPU library from the code pattern DB 230 using the library name as a key. Specifically, the replacement function detection unit 213 compares the identified external library call with the code pattern DB 230, and obtains the detected replaceable GPU library as an offloadable function block that can be offloaded to a GPU or FPGA.

ステップＳ１１０４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリ、ＩＰコアの処理記述に置換する。In step S1104, the replacement processing unit 214 replaces the source processing description of the application source code with the destination processing description of the GPU library and IP core.

ステップＳ１１０５で置換処理部２１４は、置換したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S1105, the replacement processing unit 214 offloads the processing description of the replaced GPU library to the GPU as a functional block to be offloaded.

ステップＳ１１０６で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S1106, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ１１０７で置換処理部２１４は、ＣＵＤＡのライブラリ呼び出しを、Jcudaで指定する。 In step S1107, the replacement processing unit 214 specifies the CUDA library call using Jcuda.

ステップＳ１１０８で実行ファイル作成部１１７は、作成したパターンをJcudaでビルドする。 In step S1108, the executable file creation unit 117 builds the created pattern using Jcuda.

ステップＳ１１０９で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。
ステップＳ１１１０で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。 In step S1109, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").
In step S1110, the executable file creation unit 117 creates a combination pattern for the pattern that was able to increase the speed during the first measurement.

ステップＳ１１１１で実行ファイル作成部１１７は、作成した組合せパターンをJcudaでビルドする。
ステップＳ１１１２で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。 In step S1111, the executable file creation unit 117 builds the created combination pattern using Jcuda.
In step S1112, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ１１１３で本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S1113, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and ends the processing of this flow.

・<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>のフローチャート
図２１は、オフロードサーバ２００の制御部（自動オフロード機能部）２１０が、機能ブロックのオフロード処理において<処理Ａ－２>と<処理Ｂ－２>と<処理Ｃ－２>とを実行する場合のフローチャートである。なお、<処理Ａ－２>からの処理は、<処理Ａ－１>からの処理と並行して行えばよい。
ステップＳ１２０１でアプリケーションコード分析部１１２（図１２参照）は、アプリケーションのオフロードしたいソースコードの分析を行う。具体的には、アプリケーションコード分析部１１２は、Clang等の構文解析ツールを用いて、ループ文構造等とともに、コードに含まれるライブラリ呼び出しや、機能処理を分析するソースコードの分析を行う。 21 is a flowchart of the case where the control unit (automatic offload function unit) 210 of the offload server 200 executes <processing A-2>, <processing B-2>, and <processing C-2> in the offload processing of the functional block. Note that the processing from <processing A-2> may be performed in parallel with the processing from <processing A-1>.
In step S1201, the application code analysis unit 112 (see FIG. 12) analyzes the source code of the application to be offloaded. Specifically, the application code analysis unit 112 uses a syntax analysis tool such as Clang to analyze the source code to analyze the library calls and functional processing included in the code, together with the loop statement structure, etc.

ステップＳ１２０２で置換機能検出部２１３（図１２参照）は、ソースコードからクラスまたは構造体の定義記述コードを検出する。In step S1202, the replacement function detection unit 213 (see Figure 12) detects definition description code of a class or structure from the source code.

ステップＳ１２０３で置換機能検出部２１３は、コードパターンＤＢ２３０から、類似性検出ツールを用いて、クラスまたは構造体の定義記述コードをキーにして、置換可能ＧＰＵライブラリを取得する。In step S1203, the replacement function detection unit 213 uses a similarity detection tool to obtain a replaceable GPU library from the code pattern DB 230 using the definition description code of a class or structure as a key.

ステップＳ１２０４で置換処理部２１４は、アプリケーションのソースコードの置換元の処理記述を、置換先のＧＰＵライブラリ処理記述に置換する。In step S1204, the replacement processing unit 214 replaces the source processing description of the application source code with the destination GPU library processing description.

ステップＳ１２０５で置換処理部２１４は、置換元と置換先で引数、戻り値の数や型が異なる場合に、ユーザに確認する。 In step S1205, the replacement processing unit 214 prompts the user if the number or type of arguments or return values differ between the source and destination.

ステップＳ１２０６で置換機能検出部２１３は、置換または確認したＧＰＵライブラリの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵにオフロードする。 In step S1206, the replacement function detection unit 213 offloads the processing description of the replaced or confirmed GPU library to the GPU as a functional block to be offloaded.

ステップＳ１２０７で置換処理部２１４は、ＧＰＵライブラリ呼び出しのためのインタフェース処理を記述する。 In step S1207, the replacement processing unit 214 describes the interface processing for calling the GPU library.

ステップＳ１１０８で置換処理部２１４は、ＣＵＤＡのライブラリ呼び出しを、Jcudaで指定する。 In step S1108, the replacement processing unit 214 specifies the CUDA library call using Jcuda.

ステップＳ１２０９で実行ファイル作成部１１７は、作成したパターンをJcudaでビルドする。 In step S1209, the executable file creation unit 117 builds the created pattern using Jcuda.

ステップＳ１２１０で性能測定部１１６は、作成したパターンを検証環境で性能測定する（「１回目の性能測定」）。In step S1210, the performance measurement unit 116 measures the performance of the created pattern in a verification environment ("first performance measurement").

ステップＳ１２１１で実行ファイル作成部１１７は、１回目測定時に高速化できたパターンについて組合せパターンを作成する。In step S1211, the executable file creation unit 117 creates a combination pattern for the pattern that was able to be sped up during the first measurement.

ステップＳ１２１２で実行ファイル作成部１１７は、作成した組合せパターンをJcudaでビルドする。 In step S1212, the executable file creation unit 117 builds the created combination pattern using Jcuda.

ステップＳ１２１３で性能測定部１１６は、作成した組合せパターンを検証環境で性能測定する（「２回目の性能測定」）。In step S1213, the performance measurement unit 116 measures the performance of the created combination pattern in a verification environment ("second performance measurement").

ステップＳ１２１４で本番環境配置部１１８は、本番環境配置部１１８は、１回目と２回目の測定の中で最高性能のパターンを選択して本フローの処理を終了する。 In step S1214, the production environment deployment unit 118 selects the pattern with the best performance between the first and second measurements and terminates the processing of this flow.

［実装例］
第１の実施形態（「ループ文オフロード」）および第２の実施形態（「機能ブロックオフロード」）の実装例を説明する。 [Implementation example]
An implementation example of the first embodiment ("loop statement offload") and the second embodiment ("function block offload") will be described.

<利用ツール>
対象アプリケーションプログラムはC/C++言語、Python、Javaのアプリケーションとする。
ＧＰＵ処理は、C/C++言語はＰＧＩコンパイラ19.10を用いる。ＰＧＩコンパイラは、OpenACCを解釈するC/C++向けコンパイラである。ＰＧＩコンパイラは、cuFFT等のＣＵＤＡライブラリの呼び出しも処理が可能である。 <Tools>
The target application programs are C/C++, Python, and Java applications.
For GPU processing, PGI Compiler 19.10 is used for C/C++ language. The PGI Compiler is a C/C++ compiler that interprets OpenACC. The PGI Compiler can also process calls to CUDA libraries such as cuFFT.

Pythonは、pyCUDA 2019.1.2を用いる。pyCUDAは、PythonからＧＰＵに処理実行するためのインタプリタである。あるいは、Pythonには、PyACCを用いる。PyACCは、PythonからOpenACCを解釈実行するためのインタプリタである。 For Python, pyCUDA 2019.1.2 is used. pyCUDA is an interpreter for executing processing on the GPU from Python. Alternatively, for Python, PyACC is used. PyACC is an interpreter for interpreting and executing OpenACC from Python.

Javaは、IBM JDK（登録商標）を用いる。IBM JDKはJavaのラムダ記述に従って並列処理をＧＰＵに対して実行する仮想マシンである。 For Java, IBM JDK (registered trademark) is used. IBM JDK is a virtual machine that executes parallel processing on the GPU according to Java lambda notation.

<構文解析>
C/C++言語の構文解析には、LLVM/Clang 6.0の構文解析ライブラリ(libClang（登録商標）のpython binding) を用いる。Pythonの構文解析には、astを用いる。Javaの構文解析には、Java Parser（登録商標）を用いる。 <Syntax analysis>
For C/C++ language syntax analysis, we use the LLVM/Clang 6.0 syntax analysis library (libClang (registered trademark) python binding). For Python syntax analysis, we use ast. For Java syntax analysis, we use Java Parser (registered trademark).

<類似性検出ツール>
類似性検出ツールには、C/C++言語、Javaには、Deckard v2.0（登録商標）を用いる。Deckardは、機能ブロックのオフロードの適用領域拡大のため、ライブラリ呼び出し以外にも、コードコピーし変更した機能等のオフロードを実現するため、照合対象となる部分コードと、ＤＢに登録されたコードの類似性を判定する。Pythonには、CloneDigger（登録商標）を用いる。 <Similarity detection tool>
The similarity detection tool is Deckard v2.0 (registered trademark) for C/C++ and Java. Deckard determines the similarity between the partial code to be compared and the code registered in the DB in order to expand the scope of application of function block offloading, not only for library calls but also for offloading functions that have been copied and modified. CloneDigger (registered trademark) is used for Python.

<コードパターンＤＢ>
照合に用いるコードパターンＤＢ２３０（図１２参照）は、MySQL8を用いる。呼び出しているライブラリ名をキーに、高速化できるライブラリ等を検索するためのレコードを保持する。ライブラリには、それに紐づく名前やコードや実行ファイルが保持される。実行ファイルはその利用手法等も登録されている。コードパターンＤＢ２３０には、ライブラリ等を類似性検出技術で検出するための、比較用コードとの対応関係も保持される。 <Code Pattern DB>
The code pattern DB 230 (see FIG. 12) used for matching uses MySQL 8. It holds records for searching for libraries etc. that can increase speed using the called library name as a key. The library holds the name, code and executable file associated with it. For the executable file, the method of use etc. are also registered. The code pattern DB 230 also holds the correspondence with comparison code for detecting libraries etc. using similarity detection technology.

<実装動作>
実装の動作概要について述べる。
実装は、アプリケーションの利用依頼があると、構文解析ライブラリを用いてコード解析を行う。次に、機能ブロックオフロード、ループ文オフロードの順に試行を行う。これは、ループ文と機能ブロックに関しては、アルゴリズム含めて処理内容に合わせてオフロードする機能ブロックオフロードの方が高速化できるからである。機能ブロックオフロードが可能であった場合は、後半のループ文オフロードはオフロード可能であった機能ブロック部分を抜いたコードに対して試行する。
性能測定の結果、最高性能のパターンを解とする。 <Implementation Operation>
An overview of the implementation will be given below.
When a request to use an application is received, the implementation uses a syntax analysis library to analyze the code. Next, function block offloading and loop statement offloading are attempted in that order. This is because, for loop statements and function blocks, function block offloading, which offloads according to the processing content including the algorithm, can achieve higher speeds. If function block offloading is possible, the latter half of the loop statement offloading is attempted on the code excluding the function block parts that were offloadable.
As a result of the performance measurement, the pattern with the best performance is determined as the solution.

以上、ライブラリ呼び出しの場合について記載した。
置換機能検出部２１３（図１２参照）が、類似性検出ツールを用いて類似性検出を行う場合について説明する。類似性検出を行う場合には、上記置換記述と並行して処理がされる。すなわち、置換機能検出部２１３が、類似性検出を行う場合、実装例では、<処理Ｂ－２>でDeckardを用いて、検出されたクラス、構造体等の部分コードとコードパターンＤＢ２３０に登録された比較用コードとの類似性検出を行う。そして、置換機能検出部２１３は、閾値超えの機能ブロックと該当するＧＰＵ用ライブラリやＦＰＧＡ用ＩＰコアを検出する。置換機能検出部２１３は、<処理Ｂ－１>の場合と同様に、実行ファイルやOpenCLを取得する。実装例では、次にＣ-１の場合と同様に実行用ファイルを作成するが、特に置換元のコードと置換するライブラリやＩＰコアの引数や戻り値、型等のインタフェースが異なる場合は、オフロードを依頼したユーザに対して、置換先ライブラリやＩＰコアに合わせて、インタフェースを変更してよいか確認し、確認後に実行用ファイルを作成する。 The above describes the case of library calls.
A case where the replacement function detection unit 213 (see FIG. 12) performs similarity detection using a similarity detection tool will be described. When similarity detection is performed, processing is performed in parallel with the above replacement description. That is, when the replacement function detection unit 213 performs similarity detection, in the implementation example, Deckard is used in <Process B-2> to detect similarity between partial code such as a detected class or structure and a comparison code registered in the code pattern DB 230. Then, the replacement function detection unit 213 detects the functional block exceeding the threshold and the corresponding GPU library or FPGA IP core. The replacement function detection unit 213 acquires an executable file or OpenCL as in <Process B-1>. In the implementation example, next, an executable file is created as in C-1, but especially when the interfaces of the arguments, return values, types, etc. of the code to be replaced and the library or IP core to be replaced are different, the user who requested the offload is asked whether it is OK to change the interface to match the library or IP core to be replaced, and the executable file is created after the confirmation.

この時点で、検証環境のＧＰＵやＦＰＧＡで性能測定できる実行用ファイルが作成される。機能ブロックオフロードについては、置換する機能ブロックが一つの場合は、その一つをオフロードするかしないかだけである。複数ある場合は、一つずつオフロードする／しないを検証パターンとして作成し、性能を測定し高速な解を検出する。これは、高速化可能とされていても実測してみないとその条件で高速になるかわからないためである。例えば、５つオフロード可能な機能ブロックがあり、１回目測定の結果、２番と4番のオフロードが高速化できた場合は、２番と４番両方をオフロードするパターンで２回目測定を行い、２番と４番単独でオフロードする場合より高速となっている場合は、解として選択する。 At this point, an executable file is created that can measure performance on the GPU or FPGA in the verification environment. Regarding function block offloading, if there is only one function block to be replaced, it is only a matter of whether or not to offload that one block. If there are multiple blocks, verification patterns are created for offloading/not offloading each block one by one, and performance is measured to find a fast solution. This is because even if it is said that speed can be increased, it is not known whether it will be faster under those conditions without actually measuring it. For example, if there are five function blocks that can be offloaded, and the first measurement shows that offloading blocks 2 and 4 can be increased in speed, a second measurement is made with a pattern of offloading both blocks 2 and 4, and if it is faster than offloading blocks 2 and 4 alone, it is selected as the solution.

［ハードウェア構成］
第１および第２の実施形態に係るオフロードサーバは、例えば図２２に示すような構成の物理装置であるコンピュータ９００によって実現される。
図２２は、オフロードサーバ１，２００の機能を実現するコンピュータの一例を示すハードウェア構成図である。コンピュータ９００は、ＣＰＵ（Central Processing Unit）９０１、ＲＯＭ（Read Only Memory）９０２、ＲＡＭ９０３、ＨＤＤ（Hard Disk Drive）９０４、入出力Ｉ／Ｆ（Interface）９０５、通信Ｉ／Ｆ９０６およびメディアＩ／Ｆ９０７を有する。 [Hardware configuration]
The offload server according to the first and second embodiments is realized by a computer 900, which is a physical device having a configuration as shown in FIG. 22, for example.
22 is a hardware configuration diagram showing an example of a computer that realizes the functions of the offload servers 1 and 200. The computer 900 includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, a RAM 903, a HDD (Hard Disk Drive) 904, an input/output I/F (Interface) 905, a communication I/F 906, and a media I/F 907.

ＣＰＵ９０１は、ＲＯＭ９０２またはＨＤＤ９０４に記憶されたプログラムに基づき作動し、図１、図１２に示すオフロードサーバ１，２００の各処理部による制御を行う。ＲＯＭ９０２は、コンピュータ９００の起動時にＣＰＵ９０１により実行されるブートプログラムや、コンピュータ９００のハードウェアに係るプログラム等を記憶する。The CPU 901 operates based on a program stored in the ROM 902 or the HDD 904, and controls each processing unit of the offload server 1, 200 shown in Figures 1 and 12. The ROM 902 stores a boot program executed by the CPU 901 when the computer 900 is started, programs related to the hardware of the computer 900, and the like.

ＣＰＵ９０１は、入出力Ｉ／Ｆ９０５を介して、マウスやキーボード等の入力装置９１０、および、ディスプレイ等の出力装置９１１を制御する。ＣＰＵ９０１は、入出力Ｉ／Ｆ９０５を介して、入力装置９１０からデータを取得するともに、生成したデータを出力装置９１１へ出力する。The CPU 901 controls an input device 910 such as a mouse or keyboard, and an output device 911 such as a display, via the input/output I/F 905. The CPU 901 acquires data from the input device 910 via the input/output I/F 905, and outputs generated data to the output device 911.

ＨＤＤ９０４は、ＣＰＵ９０１により実行されるプログラムおよび当該プログラムによって使用されるデータ等を記憶する。通信Ｉ／Ｆ９０６は、通信網（例えば、ＮＷ（Network）９２０）を介して他の装置からデータを受信してＣＰＵ９０１へ出力し、また、ＣＰＵ９０１が生成したデータを、通信網を介して他の装置へ送信する。The HDD 904 stores programs executed by the CPU 901 and data used by the programs. The communication I/F 906 receives data from other devices via a communication network (e.g., NW (Network) 920) and outputs the data to the CPU 901, and also transmits data generated by the CPU 901 to other devices via the communication network.

メディアＩ／Ｆ９０７は、記録媒体９１２に格納されたプログラムまたはデータを読み取り、ＲＡＭ９０３を介してＣＰＵ９０１へ出力する。ＣＰＵ９０１は、目的の処理に係るプログラムを、メディアＩ／Ｆ９０７を介して記録媒体９１２からＲＡＭ９０３上にロードし、ロードしたプログラムを実行する。記録媒体９１２は、ＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto Optical disk）等の光磁気記録媒体、磁気記録媒体、導体メモリテープ媒体又は半導体メモリ等である。The media I/F 907 reads a program or data stored in the recording medium 912 and outputs it to the CPU 901 via the RAM 903. The CPU 901 loads a program related to the target processing from the recording medium 912 onto the RAM 903 via the media I/F 907, and executes the loaded program. The recording medium 912 is an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable Disc), a magneto-optical recording medium such as an MO (Magneto Optical disc), a magnetic recording medium, a conductive memory tape medium, or a semiconductor memory, etc.

例えば、コンピュータ９００が第１および第２の実施形態に係るオフロードサーバ１，２００として機能する場合、コンピュータ９００のＣＰＵ９０１は、ＲＡＭ９０３上にロードされたプログラムを実行することによりオフロードサーバ１，２００の機能を実現する。また、ＨＤＤ９０４には、ＲＡＭ９０３内のデータが記憶される。ＣＰＵ９０１は、目的の処理に係るプログラムを記録媒体９１２から読み取って実行する。この他、ＣＰＵ９０１は、他の装置から通信網（ＮＷ９２０）を介して目的の処理に係るプログラムを読み込んでもよい。For example, when the computer 900 functions as the offload server 1, 200 according to the first and second embodiments, the CPU 901 of the computer 900 realizes the functions of the offload server 1, 200 by executing a program loaded onto the RAM 903. In addition, the data in the RAM 903 is stored in the HDD 904. The CPU 901 reads and executes a program related to the target processing from the recording medium 912. In addition, the CPU 901 may read a program related to the target processing from another device via a communication network (NW 920).

［効果］
以下、本発明に係るオフロードサーバ等の効果について説明する。
第１の実施形態に係るオフロードサーバ１（図１参照）は、アプリケーションプログラムの特定処理をアクセラレータにオフロードするオフロードサーバであって、アプリケーションプログラムは、Ｃ言語、Python、およびJavaより選択される少なくとも一つであり、オフロードサーバ１は、アプリケーションプログラムのソースコードを分析するアプリケーションコード分析部１１２と、アプリケーションプログラムのループ文の中で用いられる変数の参照関係を分析し、ループ外でデータ転送してよいデータについては、ループ外でのデータ転送を明示的に指定する明示的指定行を用いたデータ転送指定を行うデータ転送指定部１１３と、アプリケーションプログラムのループ文を特定し、特定した各ループ文に対して、アクセラレータにおける並列処理指定文を指定してコンパイルする並列処理指定部１１４と、コンパイルエラーが出る繰り返し文に対して、オフロード対象外とするとともに、コンパイルエラーが出ないループ文に対して、並列処理するかしないかの指定を行う並列処理パターンを作成する並列処理パターン作成部１１５と、を備える。また、並列処理パターンのアプリケーションプログラムをコンパイルして、検証用マシン１４に配置し、アクセラレータにオフロードした際の性能測定用処理を実行する性能測定部１１６と、性能測定結果をもとに、複数の前記並列処理パターンから高処理性能の並列処理パターンを複数選択し、高処理性能の並列処理パターンを交叉、突然変異処理により別の複数の並列処理パターンを作成して、新たに性能測定までを行い、指定回数の性能測定後に、性能測定結果をもとに、複数の前記並列処理パターンから最高処理性能の並列処理パターンを選択し、最高処理性能の前記並列処理パターンをコンパイルして実行ファイルを作成する実行ファイル作成部と、を備える。 [effect]
The effects of the offload server and the like according to the present invention will be described below.
An offload server 1 (see FIG. 1 ) according to the first embodiment is an offload server that offloads specific processing of an application program to an accelerator, and the application program is at least one selected from C language, Python, and Java. The offload server 1 includes an application code analysis unit 112 that analyzes the source code of the application program, a data transfer specification unit 113 that analyzes the reference relationships of variables used in loop statements of the application program, and, for data that may be transferred outside the loop, specifies data transfer using an explicit specification line that explicitly specifies data transfer outside the loop, a parallel processing specification unit 114 that identifies loop statements of the application program and compiles each identified loop statement by specifying a parallel processing specification statement in the accelerator, and a parallel processing pattern creation unit 115 that creates a parallel processing pattern that excludes repetitive statements that produce a compilation error from being offloaded and specifies whether or not to perform parallel processing on loop statements that do not produce a compilation error. The system also includes a performance measurement unit 116 that compiles an application program of a parallel processing pattern, places it on the verification machine 14, and executes processing for measuring performance when offloaded to the accelerator; and an executable file creation unit that selects a plurality of parallel processing patterns with high processing performance from the plurality of parallel processing patterns based on a performance measurement result, crosses the parallel processing patterns with high processing performance, creates a plurality of other parallel processing patterns by mutation processing, performs new performance measurements, and after a designated number of performance measurements, selects the parallel processing pattern with the highest processing performance from the plurality of parallel processing patterns based on the performance measurement result, compiles the parallel processing pattern with the highest processing performance, and creates an executable file.

このようにすることにより、オフロードサーバ１は、移行元言語がＣ言語、Python、Javaを含む多様な言語の場合でも共通的な方式でＧＰＵに自動オフロードすることができる。これにより、移行元言語に合わせて、処理を検討したり実装する必要がなくなり、コストダウンを図ることができる。 In this way, the offload server 1 can automatically offload to the GPU in a common manner even when the source language is a variety of languages including C, Python, and Java. This eliminates the need to consider and implement processing according to the source language, thereby reducing costs.

さらに、オフロードサーバ１は、移行元言語がＣ言語、Python、またはJavaのいずれの場合であっても、ＣＰＵ－ＧＰＵ間のデータ転送回数を低減しつつ、アプリケーションプログラムの特定処理をアクセラレータに自動でオフロードすることで、全体の処理能力を向上させることができる。これにより、ＣＵＤＡ等のスキルが無いユーザでもＧＰＵを使い高性能処理ができる。また、従来ＧＰＵでの高性能化が検討されていない汎用的なＣＰＵ向けアプリケーションを高性能化できる。また、高性能計算用サーバでない汎用的マシンのＧＰＵにオフロードすることができる。 Furthermore, the offload server 1 can improve overall processing capacity by automatically offloading specific processing of application programs to the accelerator while reducing the number of data transfers between the CPU and GPU, regardless of whether the source language is C, Python, or Java. This allows even users without skills in CUDA, etc., to perform high-performance processing using the GPU. It can also improve the performance of general-purpose CPU applications that have not previously been considered for high performance with a GPU. It can also offload to the GPU of a general-purpose machine that is not a high-performance computing server.

第２の実施形態に係るオフロードサーバ２００（図１２参照）は、アプリケーションプログラムの特定処理をＧＰＵまたはＰＬＤにオフロードするオフロードサーバであって、アプリケーションプログラムは、Ｃ言語、Python、およびJavaより選択される少なくとも一つであり、オフロードサーバ２００は、ＧＰＵまたはＰＬＤにオフロード可能なライブラリおよびＩＰコアを記憶するコードパターンＤＢ２３０と、アプリケーションプログラムのソースコードを分析して、当該ソースコードに含まれる外部ライブラリ呼び出しを検出するアプリケーションコード分析部１１２と、検出された外部ライブラリ呼び出しをキーにして、コードパターンＤＢ２３０からライブラリおよびＩＰコアを取得する置換機能検出部２１３と、アプリケーションプログラムのソースコードの置換元の処理記述を、置換機能検出部２１３が取得した置換先のライブラリおよびＩＰコアの置換先の処理記述として置換するとともに、置換したライブラリおよびＩＰコアの処理記述を、オフロード対象の機能ブロックとして、ＧＰＵまたはＰＬＤにオフロードする置換処理部２１４と、ホストプログラムとのインタフェースを作成するオフロードパターン作成部２１５と、作成されたＧＰＵまたはＰＬＤ処理パターンの前記アプリケーションをコンパイルして、実行ファイルを作成する実行ファイル作成部１１７と、作成された実行ファイルをアクセラレータ検証用装置に配置し、ＧＰＵまたはＰＬＤにオフロードした際の性能測定用処理を実行する性能測定部１１６と、を備え、実行ファイル作成部１１７は、性能測定用処理による性能測定結果をもとに、複数のＧＰＵまたはＰＬＤ処理パターンから最高処理性能のＧＰＵまたはＰＬＤ処理パターンを選択し、最高処理性能のＧＰＵまたはＰＬＤ処理パターンをコンパイルして、最終実行ファイルを作成する。The offload server 200 according to the second embodiment (see FIG. 12) is an offload server that offloads specific processing of an application program to a GPU or PLD, the application program being at least one selected from C language, Python, and Java, and the offload server 200 includes a code pattern DB 230 that stores libraries and IP cores that can be offloaded to the GPU or PLD, an application code analysis unit 112 that analyzes the source code of the application program and detects external library calls contained in the source code, a replacement function detection unit 213 that acquires libraries and IP cores from the code pattern DB 230 using the detected external library calls as a key, and a replacement function detection unit 213 that detects the processing description of the source of replacement in the source code of the application program and the library and IP cores to be replaced that are acquired by the replacement function detection unit 213. an offload pattern creation unit 215 that creates an interface with the host program; an executable file creation unit 117 that compiles the application of the created GPU or PLD processing pattern to create an executable file; and a performance measurement unit 116 that places the created executable file in an accelerator verification device and executes processing for measuring performance when offloaded to the GPU or PLD, where the executable file creation unit 117 selects a GPU or PLD processing pattern with the highest processing performance from a plurality of GPU or PLD processing patterns based on a performance measurement result from the processing for measuring performance, and compiles the GPU or PLD processing pattern with the highest processing performance to create a final executable file.

このようにすることにより、オフロードサーバ２００は、移行元言語がＣ言語、Python、Javaを含む多様な言語の場合でも共通的な方式でＧＰＵに自動オフロードすることができる。これにより、移行元言語に合わせて、処理を検討したり実装する必要をなり、コストダウンを図ることができる。 In this way, the offload server 200 can automatically offload to the GPU in a common manner even when the source language is a variety of languages including C, Python, and Java. This eliminates the need to consider and implement processing according to the source language, thereby reducing costs.

さらに、オフロードサーバ２００は、移行元言語がＣ言語、Python、またはJavaのいずれの場合であっても、アプリケーションコードの置換元の処理記述を、置換先のライブラリおよびＩＰコア処理記述に置換して、オフロード可能な機能ブロックとして、ＧＰＵやＰＬＤ（ＦＰＧＡ等）にオフロードする。すなわち、個々のループ文でなく、行列積算やフーリエ変換等のより大きな単位で、ＦＰＧＡやＧＰＵ等ハードウェア向けのアルゴリズム含めて実装された機能ブロックをオフロードする。これにより、ＧＰＵやＰＬＤ（ＦＰＧＡ等）への自動オフロードにおいて、機能ブロックの単位でオフロードすることで、オフロード処理の高速化を図ることができる。その結果、ＧＰＵ、ＦＰＧＡ、ＩｏＴデバイス等環境が多様になる中で、アプリケーションを環境に合わせて適応させることが可能になり、高性能にアプリケーションを動作させることができる。 Furthermore, the offload server 200 replaces the processing description of the application code to be replaced with the library and IP core processing description of the replacement destination, regardless of whether the source language is C, Python, or Java, and offloads the application code to a GPU or PLD (FPGA, etc.) as an offloadable functional block. That is, the offloading is performed not in individual loop statements, but in larger units such as matrix multiplication and Fourier transform, and includes functional blocks implemented including algorithms for hardware such as FPGAs and GPUs. This allows the offloading process to be accelerated by offloading in units of functional blocks in automatic offloading to GPUs and PLDs (FPGAs, etc.). As a result, as environments such as GPUs, FPGAs, and IoT devices become more diverse, it becomes possible to adapt applications to the environment, and applications can be operated with high performance.

第１および第２の実施形態に係るオフロードサーバ１，２００において、アプリケーションプログラムが、Ｃ言語の場合、ループ文のＧＰＵ処理をOpenACC文法で指定し、機能ブロックのＧＰＵ処理をＣＵＤＡライブラリ呼び出すようにしてＣ言語コンパイラを用いてＧＰＵオフロードすることを特徴とする。In the offload server 1,200 according to the first and second embodiments, when the application program is written in C language, the GPU processing of the loop statement is specified in OpenACC grammar, and the GPU processing of the functional block is offloaded by GPU offloading using a C language compiler by calling the CUDA library.

このようにすることにより、移行元言語がＣ言語の場合に共通的な方式で、ＧＰＵに自動オフロードすることができる。 By doing this, it is possible to automatically offload to the GPU using a common method when the source language is C.

第１および第２の実施形態に係るオフロードサーバ１，２００において、アプリケーションプログラムが、Pythonの場合、ループ文のＧＰＵ処理をＣＵＤＡ文法で指定し、機能ブロックのＧＰＵ処理をＣＵＤＡライブラリ呼び出すようにしてpyCUDAを用いてＧＰＵオフロードすることを特徴とする。In the offload server 1,200 according to the first and second embodiments, when the application program is Python, the GPU processing of the loop statement is specified in CUDA grammar, and the GPU processing of the functional block is offloaded using pyCUDA by calling the CUDA library.

このようにすることにより、移行元言語がPythonの場合に共通的な方式で、ＧＰＵに自動オフロードすることができる。 By doing this, it is possible to automatically offload to the GPU using a common method when the source language is Python.

第１および第２の実施形態に係るオフロードサーバ１，２００において、アプリケーションプログラムが、Pythonの場合、ループ文のＧＰＵ処理をOpenACC文法で指定し、機能ブロックのＧＰＵ処理をＣＵＤＡライブラリ呼び出すようにしてpyACCを用いてＧＰＵオフロードすることを特徴とする。In the offload server 1,200 according to the first and second embodiments, when the application program is Python, the GPU processing of the loop statement is specified in OpenACC grammar, and the GPU processing of the functional block is offloaded using pyACC by calling the CUDA library.

このようにすることにより、移行元言語がPythonの場合にＣ言語の場合と同様に、共通的な方式でＧＰＵに自動オフロードすることができる。 By doing this, when the source language is Python, it is possible to automatically offload to the GPU using a common method, just as in the case of C.

第１および第２の実施形態に係るオフロードサーバ１，２００において、アプリケーションプログラムが、Javaの場合、ループ文のＧＰＵ処理をJavaのラムダ文法で指定し、機能ブロックのＧＰＵ処理をＣＵＤＡライブラリ呼び出すようにしてJava仮想マシンを用いてＧＰＵオフロードすることを特徴とする。In the offload server 1,200 according to the first and second embodiments, when the application program is Java, the GPU processing of the loop statement is specified using Java lambda grammar, and the GPU processing of the functional block is called by the CUDA library, thereby performing GPU offloading using a Java virtual machine.

このようにすることにより、移行元言語がJavaの場合に共通的な方式でＧＰＵに自動オフロードすることができる。 By doing this, when the source language is Java, it is possible to automatically offload to the GPU in a common manner.

本発明は、コンピュータを、上記オフロードサーバとして機能させるためのオフロードプログラムとした。 The present invention is an offload program for causing a computer to function as the above-mentioned offload server.

このようにすることにより、一般的なコンピュータを用いて、上記オフロードサーバ２００の各機能を実現させることができる。 By doing this, it is possible to realize each function of the offload server 200 using a general-purpose computer.

［変形例］
第１の実施形態に係るオフロードサーバ１と第２の実施形態に係るオフロードサーバ２００とを組み合わせて、データ転送指定部１１３（図１参照）は、アプリケーションコード分析部１１２が行うコード解析結果をもとに、機能ブロックオフロード、ループ文オフロードの順に試行するようにデータ転送を行うととともに、機能ブロックオフロードが可能であった場合は、オフロード可能であった機能ブロック部分を抜いたコードに対して、ループ文オフロードを試行するデータ転送を行う構成としてもよい。 [Modification]
By combining the offload server 1 according to the first embodiment with the offload server 200 according to the second embodiment, the data transfer specification unit 113 (see Figure 1) may be configured to perform data transfer so as to attempt function block offloading and then loop statement offloading based on the results of the code analysis performed by the application code analysis unit 112, and if function block offloading is possible, to perform data transfer to attempt loop statement offloading for the code excluding the function block portion that was offloadable.

この構成により、オフロードサーバは、まず、アプリケーションプログラムの利用依頼があると、構文解析ライブラリを用いてコード解析を行い、次に、機能ブロックオフロード、ループ文オフロードの順に試行を行う。機能ブロックオフロードが可能であった場合は、オフロード可能であった機能ブロック部分を抜いたコードに対して、ループ文オフロードを試行し、性能測定の結果、最高性能のパターンを解とする。これにより、ループ文と機能ブロックに関しては、アルゴリズム含めて処理内容に合わせてオフロードする機能ブロックオフロードの方が高速化できる。機能ブロックオフロード、ループ文オフロードの順に試行を行うことで、処理の高速化を図ることができ、全体の処理能力を向上させることができる。 With this configuration, when an offload server receives a request to use an application program, it first performs code analysis using the syntax analysis library, and then attempts to offload function blocks and then loop statements. If function block offloading is possible, it attempts to offload loop statements on the code excluding the function block portions that were offloadable, and the pattern with the highest performance as a result of performance measurements is determined to be the solution. As a result, with regard to loop statements and function blocks, function block offloading, which offloads according to the processing content including algorithms, can achieve faster processing. By attempting function block offloading and then loop statement offloading, it is possible to speed up processing and improve overall processing capacity.

また、上記各実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手作業で行うこともでき、あるいは、手作業で行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上述文書中や図面中に示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。 Furthermore, among the processes described in each of the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by a known method. In addition, the information including the processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be changed arbitrarily unless otherwise specified.
In addition, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or part of them can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.

１，２００オフロードサーバ
１１，２１０制御部
１２入出力部
１３，１３０記憶部
１４検証用マシン (アクセラレータ検証用装置)
１５商用環境
１１１アプリケーションコード指定部
１１２アプリケーションコード分析部
１１３データ転送指定部
１１４並列処理指定部
１１４ａオフロード範囲抽出部
１１４ｂ中間言語ファイル出力部
１１５並列処理パターン作成部
１１６性能測定部
１１６ａバイナリファイル配置部
１１７実行ファイル作成部
１１８本番環境配置部
１１９性能測定テスト抽出実行部
１２０ユーザ提供部
１２５アプリケーションコード
１３１テストケースＤＢ
１３２中間言語ファイル
１５１各種デバイス
１５２ＣＰＵ-ＧＰＵを有する装置
１５３ＣＰＵ-ＦＰＧＡを有する装置
１５４ＣＰＵを有する装置
２１３置換機能検出部
２１４置換処理部
２１５オフロードパターン作成部
２３０コードパターンＤＢ 1,200 Offload server 11,210 Control unit 12 Input/output unit 13,130 Memory unit 14 Verification machine (accelerator verification device)
15 Commercial environment 111 Application code specification unit 112 Application code analysis unit 113 Data transfer specification unit 114 Parallel processing specification unit 114a Offload range extraction unit 114b Intermediate language file output unit 115 Parallel processing pattern creation unit 116 Performance measurement unit 116a Binary file placement unit 117 Executable file creation unit 118 Production environment placement unit 119 Performance measurement test extraction and execution unit 120 User provision unit 125 Application code 131 Test case DB
132 Intermediate language file 151 Various devices 152 Device having CPU-GPU 153 Device having CPU-FPGA 154 Device having CPU 213 Replacement function detection unit 214 Replacement processing unit 215 Offload pattern creation unit 230 Code pattern DB

Claims

An offload server that offloads a specific process of an application program to an accelerator,
the application program is a Python application program;
an application code analysis unit that analyzes a source code of the Python application program using a syntax analysis tool that analyzes Python;
a data transfer specification unit that analyzes reference relationships between variables used in a loop statement of the Python application program, and specifies data transfer using an explicit specification line that explicitly specifies data transfer outside the loop for data that may be transferred outside the loop;
a parallel processing specification unit that specifies a loop statement of the Python application program, and when interpreting Python code with CUDA instructions added for each of the specified loop statements using pyCUDA, specifies GPU processing using CUDA grammar and performs interpretation;
a parallel processing pattern creation unit that creates a parallel processing pattern that excludes a loop statement that generates a compilation error from being subject to offloading and specifies whether or not a loop statement that does not generate a compilation error should be processed in parallel;
a performance measurement unit that compiles the Python application program of the parallel processing pattern, places it in an accelerator verification device, and executes a process for performance measurement when offloaded to the accelerator;
an executable file creation unit which selects a plurality of parallel processing patterns with high processing performance from the plurality of parallel processing patterns based on a performance measurement result, crosses the parallel processing patterns with high processing performance, creates a plurality of other parallel processing patterns by mutation processing, and performs new performance measurement; after a designated number of performance measurements, selects a parallel processing pattern with the highest processing performance from the plurality of parallel processing patterns based on the performance measurement result; after GA processing for a designated number of generations is completed, sets the parallel processing pattern with the highest processing performance including a Python application code corresponding to the gene sequence with the highest performance as a solution; and compiles the parallel processing pattern with the highest processing performance to create an executable file;
An offload server comprising:

An offload server that offloads a specific process of an application program to an accelerator including a GPU (Graphics Processing Unit),
the application program is a Python application program;
an application code analysis unit that analyzes a source code of the Python application program using a syntax analysis tool that analyzes Python;
A data transfer specification unit that analyzes the reference relationship of variables used in a loop statement of the Python application program and specifies data transfer for data that may be transferred outside the loop;
A parallel processing specification section in which, for a loop statement in which GPU processing is specified in Python code, the Cupy calls a CUDA command via a library and CUDA performs GPU processing;
a parallel processing pattern creation unit that creates a parallel processing pattern that excludes a loop statement that generates a compilation error from being subject to offloading and specifies whether or not a loop statement that does not generate a compilation error should be processed in parallel;
a performance measurement unit that compiles the Python application program of the parallel processing pattern, places it in an accelerator verification device, and executes a process for performance measurement when offloaded to the accelerator;
an executable file creation unit which selects a plurality of parallel processing patterns with high processing performance from the plurality of parallel processing patterns based on a performance measurement result, crosses the parallel processing patterns with high processing performance, creates a plurality of other parallel processing patterns by mutation processing, and performs new performance measurement, and after a designated number of performance measurements, selects a parallel processing pattern with the highest processing performance from the plurality of parallel processing patterns based on the performance measurement result, and after GA processing for a designated number of generations is completed, sets the parallel processing pattern with the highest processing performance including a Python application code corresponding to the gene sequence with the highest performance as a solution, and compiles the parallel processing pattern with the highest processing performance to create an executable file,
The parallel processing designation unit, when performing GPU processing via CUDA, rewrites an arithmetic expression having arrays of a for statement on the right and left sides into an expression of a range start and a range end, and converts the entire expression into a matrix operation expression and specifies it.

An offload server that offloads a specific process of an application program to an accelerator,
the application program is a Python application program;
an application code analysis unit that analyzes a source code of the Python application program using a syntax analysis tool that analyzes Python;
a data transfer specification unit that analyzes reference relationships between variables used in a loop statement of the Python application program, and specifies data transfer using an explicit specification line that explicitly specifies data transfer outside the loop for data that may be transferred outside the loop;
A parallel processing specification unit that specifies \pragmaacckernels of OpenACC when identifying loop statements of the Python application program and interpreting Python code to which a CUDA instruction has been added for each of the identified loop statements using pyACC, which interprets OpenACC ;
a parallel processing pattern creation unit that creates a parallel processing pattern that excludes a loop statement that generates a compilation error from being subject to offloading and specifies whether or not a loop statement that does not generate a compilation error should be processed in parallel;
a performance measurement unit that compiles the Python application program of the parallel processing pattern, places it in an accelerator verification device, and executes a process for performance measurement when offloaded to the accelerator;
an executable file creation unit which selects a plurality of parallel processing patterns with high processing performance from the plurality of parallel processing patterns based on a performance measurement result, crosses the parallel processing patterns with high processing performance, creates a plurality of other parallel processing patterns by mutation processing, and performs new performance measurement; after a designated number of performance measurements, selects a parallel processing pattern with the highest processing performance from the plurality of parallel processing patterns based on the performance measurement result; after GA processing for a designated number of generations is completed, sets the parallel processing pattern with the highest processing performance including a Python application code corresponding to the gene sequence with the highest performance as a solution; and compiles the parallel processing pattern with the highest processing performance to create an executable file;
An offload server comprising:

An offload server that offloads a specific process of an application program to a graphics processing unit (GPU) or a programmable logic device (PLD),
the application program is a Python application program;
A storage unit that stores a GPU library and an IP core that can be offloaded to the GPU or the PLD;
an application code analysis unit that analyzes a source code of the Python application program using a syntax analysis tool that analyzes Python;
a replacement function detection unit that acquires the GPU library and the IP core from the storage unit using the detected external library call as a key;
When a GPU library call is specified by pyCUDA and a source process description of the source code of the Python application program is replaced with a destination process description of the GPU library and the IP core acquired by the replacement function detection unit, the Python code to which a CUDA instruction has been added is interpreted by pyCUDA and replaced with a process that matches the call of the GPU process by pyCUDA;
a replacement processing unit that offloads the replaced processing description of the GPU library and the IP core to the GPU or the PLD as a functional block to be offloaded;
an offload pattern creation unit that creates an interface with a host program, and extracts an offload pattern that will be faster by testing whether or not to offload through performance measurement in a verification environment;
an executable file creation unit that compiles the Python application program of the created GPU or PLD processing pattern to create an executable file;
a performance measurement unit that places the created executable file in an accelerator verification device and executes a process for measuring performance when the executable file is offloaded to the GPU or the PLD;
The offload server is characterized in that the executable file creation unit selects the GPU or PLD processing pattern with the highest processing performance from the multiple GPU or PLD processing patterns based on the performance measurement results from the performance measurement process, compiles the GPU or PLD processing pattern with the highest processing performance, and creates a final executable file.

5. The offload server according to claim 1 , wherein the performance measurement unit executes the process for measuring the performance when the offload is performed by using an automatic measurement tool including Jenkins.

5. The offload server according to claim 4 , wherein said performance measurement unit executes the execution repeatedly when there are a plurality of offloadable function blocks, and determines the pattern with the highest performance as the final solution.

An offload control method for an offload server that offloads a specific process of an application program to an accelerator, comprising:
the application program is a Python application program;
analyzing the source code of the Python application program with a parser that parses Python;
A step of analyzing reference relationships of variables used in a loop statement of the Python application program, and for data that may be transferred outside the loop, specifying data transfer using an explicit specification line that explicitly specifies data transfer outside the loop;
Identifying loop statements of the Python application program, and when interpreting Python code with CUDA instructions added for each of the identified loop statements using pyCUDA, interpreting the Python code by specifying GPU processing using CUDA grammar ;
creating a parallel processing pattern for specifying that a loop statement that generates a compilation error is not to be offloaded and that a loop statement that does not generate a compilation error is to be processed in parallel or not;
Compiling the Python application program of the parallel processing pattern, distributing the Python application program on an accelerator verification device, and executing a process for measuring performance when the Python application program is offloaded to the accelerator;
an offload control method comprising the steps of: selecting a plurality of parallel processing patterns with high processing performance from the plurality of parallel processing patterns based on performance measurement results; crossing the parallel processing patterns with high processing performance; creating another plurality of parallel processing patterns by mutation processing; and performing new performance measurements; after a designated number of performance measurements, selecting a parallel processing pattern with the highest processing performance from the plurality of parallel processing patterns based on the performance measurement results; after completion of GA processing for a designated number of generations, determining a parallel processing pattern with the highest processing performance that includes a Python application code that corresponds to a gene sequence with the highest performance as a solution; and compiling the parallel processing pattern with the highest processing performance to create an executable file.

An offload program for causing a computer to function as the offload server according to any one of claims 1 to 6 .