KR20230120850A

KR20230120850A - Deep-learning compiler for supporting heterogeneous computing platform and method thereof

Info

Publication number: KR20230120850A
Application number: KR1020220017588A
Authority: KR
Inventors: 유미선; 권용인; 김태호; 박제만; 이제민
Original assignee: 한국전자통신연구원
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2023-08-17

Abstract

Disclosed are a deep learning compile system supporting a heterogeneous computer platform and an operating method of a deep learning compiler supporting the heterogeneous computer platform. The deep learning compiler supporting the heterogeneous computing platform according to an embodiment of the present invention comprises: a memory wherein at least one program is recorded; and a processor that executes the program, wherein the program may perform a step of converting a deep neural net model into a graph connected according to data flow of operators, and a step of converting the deep neural net model into a code by considering execution performance measurement results of the operators in at least two or more heterogeneous accelerators included in a target platform and an execution order of the operators divided and assigned to the heterogeneous accelerators. Therefore, the present invention is capable of mapping the accelerator wherein the relevant graph will execute.

Description

Deep learning compiler supporting heterogeneous computing platform and its method

기재된 실시예는 딥뉴럴넷을 복수의 이종 가속기들을 포함하는 하드웨어 플랫폼 용 프로그램으로 자동 변환해주는 기술에 관한 것이다.The described embodiment relates to a technology for automatically converting a deep neural network into a program for a hardware platform including a plurality of heterogeneous accelerators.

딥뉴럴넷을 편리하게 생성하고 실행할 수 있도록 지원하는 Pytorch, TensorFlow 및 Caffe와 같은 다양한 딥러닝 플랫폼들이 여러 분야에서 널리 사용되고 있다. 그러나, 이러한 딥러닝 플랫폼들은 리소스 제약이 많은 임베디드 시스템상에 설치되어 실행되기가 어렵다. Various deep learning platforms such as Pytorch, TensorFlow, and Caffe that support the convenient creation and execution of deep neural networks are widely used in various fields. However, these deep learning platforms are difficult to install and run on resource-constrained embedded systems.

따라서, 최근 딥러닝 플랫폼 설치없이 딥뉴럴넷을 하드웨어 플랫폼용 머신코드 또는 컴파일 가능한 소스코드(이하 '코드'로 기재함)로 변환해주는 딥러닝 컴파일러들이 제안되고 있다. TVM, Glow 및 XLA 등으로 대표되는 딥러닝 컴파일러들은 딥러닝 플랫폼에서 생성된 딥뉴럴넷 모델을 타겟 HW의 아키텍처에서 빠르게 동작할 수 있는 최적화된 코드로 변환해준다.Accordingly, deep learning compilers that convert deep neural networks into machine codes for hardware platforms or compilable source codes (hereinafter referred to as 'code') without installing a deep learning platform have been proposed. Deep learning compilers represented by TVM, Glow, and XLA convert deep neural net models created in deep learning platforms into optimized codes that can operate quickly in the architecture of the target HW.

이러한 기존의 딥러닝 컴파일러들은 기본적으로 타겟 CPU(X86, ARM 등)용 코드를 생성하고, 추가적으로 뉴럴넷의 특정 연산자를 GPU나 NPU상에서 동작 가능한 코드로 변환할 수 있다. 예를 들어, XLA는 GPU나 TPU 코드를 생성할 수 있고, TVM은 VTA라 불리는 NPU(Neural Processing Unit) 코드를 생성할 수 있다. These existing deep learning compilers basically generate codes for target CPUs (X86, ARM, etc.) and can additionally convert specific operators of neural networks into codes that can be operated on GPUs or NPUs. For example, XLA can generate GPU or TPU code, and TVM can generate NPU (Neural Processing Unit) code called VTA.

이를 위해 딥러닝 컴파일러들은 입력된 딥뉴럴넷 모델을 계산 그래프로 변환하고, 변환된 계산 그래프를 기반으로 연산자 퓨징/제거(fusing/elimination) 및 연산자 간 순서 변경(operator sinking)과 같은 그래프 수준의 최적화를 수행한 후 타겟 HW 상에서 실행 가능한 코드를 생성한다. To this end, deep learning compilers convert the input deep neural network model into a computation graph, and based on the converted computation graph, perform graph-level optimization such as fusing/elimination and operator sinking. After execution, executable code is generated on the target HW.

그러나 기존의 딥러닝 컴파일러들은 타겟 HW의 종류를 한 가지만 지정 가능하다. 즉, 하나의 가속기에서만 뉴럴넷 연산자를 실행하는 코드를 생성할 수 있다. 만약 지정된 가속기가 특정 연산자를 실행하지 못하는 경우 CPU를 사용하도록 오프로딩 (offloading)하는 기능을 제공하는 딥러닝 컴파일러도 있으나, 기본적으로 타겟 HW는 한 가지만 지정 가능하다.However, existing deep learning compilers can specify only one type of target HW. That is, you can generate code that runs neural net operators on only one accelerator. There are also deep learning compilers that provide a function of offloading the CPU to use if the designated accelerator cannot execute a specific operator, but basically only one target HW can be specified.

그러나 최신 모바일 디바이스와 같은 임베디드 시스템의 경우 CPU, GPU 및 NPU 등의 다양한 이종 가속기들을 모두 내장하고 있다. 이들 이종 가속기들은 모두 다른 하드웨어적 특성을 가지고 있어 뉴럴넷 연산자의 종류에 따라 수행 속도 및 소모 전력 등이 다르다. 따라서, 시스템 내의 가용한 이종 가속기들 각각의 상이한 특성을 모두 활용하여 뉴럴넷의 실행 속도나 소모 전력을 최소화시킬 수 있는 기술이 요구된다.However, in the case of embedded systems such as the latest mobile devices, various heterogeneous accelerators such as CPU, GPU, and NPU are all embedded. All of these heterogeneous accelerators have different hardware characteristics, so their execution speed and power consumption differ depending on the type of neural network operator. Therefore, a technology capable of minimizing the execution speed or power consumption of the neural net by utilizing all the different characteristics of each of the available heterogeneous accelerators in the system is required.

기재된 실시예는 복수의 이종 가속기들을 내장한 이종 컴퓨팅 플랫폼에서 동작하는 고성능 딥뉴럴넷 실행코드를 자동 생성 가능한 딥러닝 컴파일 기술을 제공하는데 그 목적이 있다. An object of the described embodiment is to provide a deep learning compilation technology capable of automatically generating high-performance deep neural network execution codes operating on a heterogeneous computing platform incorporating a plurality of heterogeneous accelerators.

기재된 실시예는 복수의 이종 가속기들 각각의 상이한 특성을 모두 활용하여 실행 속도나 소모 전력을 최소화시킬 수 있도록 딥 뉴럴넷을 그래프 기반으로 분할하여 이종 가속기 상에서 병렬 실행 가능한 코드로 변환하는 기술을 제공하는 데 그 목적이 있다.The disclosed embodiment is to provide a technology for converting a deep neural network into a code that can be executed in parallel on a heterogeneous accelerator by dividing a deep neural network based on a graph so as to minimize execution speed or power consumption by utilizing all the different characteristics of each of a plurality of heterogeneous accelerators. It has a purpose.

실시예에 따른 이종 컴퓨팅 플랫폼 지원 딥러닝 컴파일러는, 적어도 하나의 프로그램이 기록된 메모리 및 프로그램을 실행하는 프로세서를 포함하며, 프로그램은, 딥 뉴럴넷 모델을 연산자들의 데이터 흐름에 따라 연결한 그래프로 변환하는 단계, 및 타겟 플랫폼에 포함된 적어도 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능 측정 결과와 이종 가속기들에 분할되어 할당된 연산자들의 실행 순서를 고려하여 딥 뉴럴넷 모델을 코드로 변환하는 단계를 수행할 수 있다. The heterogeneous computing platform support deep learning compiler according to the embodiment includes a memory in which at least one program is recorded and a processor that executes the program, and the program converts the deep neural net model into a graph connected according to the data flow of operators. and converting the deep neural net model into code by considering the execution performance measurement results of operators in at least two or more heterogeneous accelerators included in the target platform and the execution order of operators divided and allocated to the heterogeneous accelerators. can

이 때, 실행 성능 측정 결과는 그래프를 기반으로 타겟 플랫폼에 포함된 적어도 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능 측정을 통해 획득되거나 또는 외부로부터 입력될 수 있다.In this case, the execution performance measurement result may be obtained through measurement of execution performance of operators in at least two or more heterogeneous accelerators included in the target platform based on the graph, or may be input from the outside.

이 때, 프로그램은, 실행 성능 측정을 수행하는 경우에, 입력된 타겟 플랫폼 가속기 명세 정보를 기반으로 측정된 가속기 별 단위 연산자의 실행 성능 결과로 단위 성능 명세 파일을 생성하는 단계; 및 입력된 퓨징 연산자 명세 정보를 기반으로 측정된 가속기 별 연산자 그룹의 실행 성능 결과로 퓨징 연산자 성능 명세 파일을 생성하는 단계를 수행할 수 있다.At this time, when the program performs execution performance measurement, generating a unit performance specification file as an execution performance result of the unit operator for each accelerator measured based on the input target platform accelerator specification information; and generating a fusing operator performance specification file as a result of execution performance of operator groups for each accelerator measured based on the input fusing operator specification information.

이 때, 실행 순서는 실행 성능 측정 결과를 고려한 스케줄링을 통해 결정되거나 또는 외부로부터 입력될 수 있다.In this case, the execution order may be determined through scheduling in consideration of execution performance measurement results or may be input from the outside.

이 때, 프로그램은, 스케줄링을 수행하는 경우에, 실행 성능 측정 결과를 기반으로 연산자에 최대 실행 성능을 가지는 가속기를 매칭하는 단계; 연산자들을 순서대로 순회하면서 그래프를 분할하여 파티션을 생성하는 단계; 및 생성된 파티션들의 실행 순서를 결정하는 단계를 수행할 수 있다.At this time, when the program performs scheduling, matching the accelerator having the maximum execution performance to the operator based on the execution performance measurement result; generating a partition by partitioning the graph while traversing the operators in order; and determining an execution order of the created partitions.

이 때, 프로그램은, 스케줄링을 수행하기 이전에, 변환된 그래프를 하드웨어 독립적으로 최적화하는 단계를 더 수행하되, 최적화를 위해, 불필요한 연산자 제거, 연속적으로 발생되는 트랜스포즈(Transpose) 연산자 병합, 트랜스포즈(Transpose) 연산을 그래프의 하단으로 이동 및 연속 발생 Concat 연산자 병합 중 적어도 하나를 수행할 수 있다.At this time, before performing scheduling, the program further performs a step of optimizing the transformed graph independently of hardware, but for optimization, removing unnecessary operators, merging continuously generated transpose operators, and transposing At least one of moving the (Transpose) operation to the bottom of the graph and merging consecutively occurring Concat operators can be performed.

이 때, 프로그램은, 스케줄링을 수행하는 경우에, 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능 측정 결과를 기반으로 그래프를 적어도 하나의 연산자를 포함하는 복수의 브랜치들로 분리하는 단계를 더 수행하되, 매칭하는 단계는, 복수의 브랜치들 각각에 포함된 적어도 하나의 연산자 각각에 대해 최대 실행 성능을 가지는 가속기를 매칭하는 단계; 복수의 브랜치들 각각에 포함된 연산자 그룹에 대해 퓨징 연산자 성능 명세 파일을 기반으로 최대 실행 성능을 가지는 가속기를 검색하는 단계; 및 연산자 그룹에서의 실행 성능이 개별 연산자의 실행 성능보다 양호할 경우, 해당 개별 연산자에 매칭된 가속기를 검색된 가속기로 재 매칭하는 단계를 더 수행할 수 있다.At this time, in the case of performing scheduling, the program further performs the step of dividing the graph into a plurality of branches including at least one operator based on the measurement results of the execution performance of operators in two or more heterogeneous accelerators, , The matching step may include matching an accelerator having a maximum execution performance for each of at least one operator included in each of a plurality of branches; Searching for an accelerator having maximum execution performance for an operator group included in each of a plurality of branches, based on a fusing operator performance specification file; and re-matching an accelerator matched to the individual operator with the searched accelerator when the execution performance of the operator group is better than that of the individual operator.

이 때, 프로그램은, 스케줄링을 수행하는 경우에, 복수의 브랜치들의 순차적 또는 병렬적 실행 여부를 구분하는 단계를 더 수행하되, 매칭하는 단계는, 병렬적으로 실행되는 브랜치들에 동일한 가속기가 매칭될 경우, 병렬성을 이용하여 성능이 향상되면, 서로 다른 가속기를 재 할당하는 단계를 더 수행할 수 있다.At this time, in the case of performing scheduling, the program further performs a step of distinguishing whether a plurality of branches are executed sequentially or in parallel, but in the matching step, the same accelerator is matched to the branches executed in parallel. In this case, if performance is improved by using parallelism, a step of reallocating different accelerators may be further performed.

이 때, 프로그램은, 파티션을 생성하는 단계에서, 브랜치 내의 연산자들을 순서대로 순회하면서 현재 연산자가 이전 연산자와 동일한 가속기가 매칭되어 있는 경우, 동일한 파티션으로 결정하고, 현재 연산자가 이전 연산자와 동일한 가속기가 매칭되어 있지 않은 경우, 새로운 파티션을 생성하여 추가하는 단계를 더 수행할 수 있다.At this time, in the step of creating a partition, the program traverses the operators in the branch in order, and if the current operator matches the previous operator and the same accelerator, it determines the same partition, and the current operator has the same accelerator as the previous operator. If not matched, a step of creating and adding a new partition may be further performed.

이 때, 프로그램은, 실행 순서를 결정하는 단계에서, 너비 우선 탐색 (breath-first) 방식으로 순회하면서 순차적 실행 순서를 결정하고, 동일한 병렬 브랜치 집합에 포함되어 있고 서로 상이한 가속기가 할당되어 있는 파티션을 병렬 파티션으로 결정할 수 있다.At this time, in the step of determining the execution order, the program determines the sequential execution order while traversing the breadth-first search method, and determines the partitions included in the same parallel branch set and to which different accelerators are assigned. It can be determined by parallel partitioning.

이 때, 결정된 파티션의 순차적 실행 순서와 병렬 파티션 정보는, 특정 플랫폼에 비의존적인 형태로 파티션들의 실행 정보를 표현하는 파티션 스케줄 계획 및 특정 플랫폼에서 컴파일 가능한 코드 형태인 파티션 스케줄 코드 중 하나로 저장될 수 있다.At this time, the sequential execution order and parallel partition information of the determined partitions may be stored as one of a partition schedule plan expressing execution information of partitions in a form independent of a specific platform and a partition schedule code that is a code compilable on a specific platform. .

실시예에 따른 이종 컴퓨팅 플랫폼 지원 딥러닝 컴파일 방법은, 딥 뉴럴넷 모델을 연산자들의 데이터 흐름에 따라 연결한 그래프로 변환하는 단계, 및 타겟 플랫폼에 포함된 적어도 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능 측정 결과와 이종 가속기들에 분할되어 할당된 연산자들의 실행 순서를 고려하여 딥 뉴럴넷 모델을 코드로 변환하는 단계를 수행할 수 있다. The deep learning compilation method supporting heterogeneous computing platforms according to the embodiment includes the steps of converting a deep neural net model into a graph connected according to the data flow of operators, and the execution performance of operators in at least two or more heterogeneous accelerators included in a target platform. A step of converting the deep neural network model into code may be performed in consideration of the measurement result and the execution order of operators divided and assigned to heterogeneous accelerators.

이 때, 실행 성능 측정을 수행하는 경우에, 입력된 타겟 플랫폼 가속기 명세 정보를 기반으로 측정된 가속기 별 단위 연산자의 실행 성능 결과로 단위 성능 명세 파일을 생성하는 단계; 및 입력된 퓨징 연산자 명세 정보를 기반으로 측정된 가속기 별 연산자 그룹의 실행 성능 결과로 퓨징 연산자 성능 명세 파일을 생성하는 단계를 수행할 수 있다.At this time, when performing execution performance measurement, generating a unit performance specification file based on the execution performance result of the unit operator for each accelerator measured based on the input target platform accelerator specification information; and generating a fusing operator performance specification file as a result of execution performance of operator groups for each accelerator measured based on the input fusing operator specification information.

이 때, 스케줄링을 수행하는 경우에, 실행 성능 측정 결과를 기반으로 연산자에 최대 실행 성능을 가지는 가속기를 매칭하는 단계; 연산자들을 순서대로 순회하면서 그래프를 분할하여 파티션을 생성하는 단계; 및 생성된 파티션들의 실행 순서를 결정하는 단계를 수행할 수 있다.At this time, in the case of performing scheduling, matching an accelerator having maximum execution performance to an operator based on an execution performance measurement result; generating a partition by partitioning the graph while traversing the operators in order; and determining an execution order of the created partitions.

이 때, 스케줄링을 수행하기 이전에, 변환된 그래프를 하드웨어 독립적으로 최적화하는 단계를 더 수행하되, 최적화를 위해, 불필요한 연산자 제거, 연속적으로 발생되는 트랜스포즈(Transpose) 연산자 병합, 트랜스포즈(Transpose) 연산을 그래프의 하단으로 이동 및 연속 발생 Concat 연산자 병합 중 적어도 하나를 수행할 수 있다.At this time, before performing scheduling, a step of optimizing the transformed graph independently of hardware is further performed, but for optimization, unnecessary operators are removed, consecutively generated transpose operators are merged, and transpose At least one of moving the operation to the bottom of the graph and merging consecutively occurring Concat operators can be performed.

이 때, 스케줄링을 수행하는 경우에, 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능 측정 결과를 기반으로 그래프를 적어도 하나의 연산자를 포함하는 복수의 브랜치들로 분리하는 단계를 더 수행하되, 매칭하는 단계는, 복수의 브랜치들 각각에 포함된 적어도 하나의 연산자 각각에 대해 최대 실행 성능을 가지는 가속기를 매칭하는 단계; 복수의 브랜치들 각각에 포함된 연산자 그룹에 대해 퓨징 연산자 성능 명세 파일을 기반으로 최대 실행 성능을 가지는 가속기를 검색하는 단계; 및 연산자 그룹에서의 실행 성능이 개별 연산자의 실행 성능보다 양호할 경우, 해당 개별 연산자에 매칭된 가속기를 검색된 가속기로 재 매칭하는 단계를 더 수행할 수 있다.At this time, in the case of performing scheduling, further performing the step of separating the graph into a plurality of branches including at least one operator based on the measurement results of the execution performance of operators in two or more heterogeneous accelerators, but matching The step may include matching an accelerator having a maximum execution performance for each of at least one operator included in each of a plurality of branches; Searching for an accelerator having maximum execution performance for an operator group included in each of a plurality of branches, based on a fusing operator performance specification file; and re-matching an accelerator matched to the individual operator with the searched accelerator when the execution performance of the operator group is better than that of the individual operator.

이 때, 스케줄링을 수행하는 경우에, 복수의 브랜치들의 순차적 또는 병렬적 실행 여부를 구분하는 단계를 더 수행하되, 매칭하는 단계는, 병렬적으로 실행되는 브랜치들에 동일한 가속기가 매칭될 경우, 병렬성을 이용하여 성능이 향상되면, 서로 다른 가속기를 재 할당하는 단계를 더 수행할 수 있다.At this time, in the case of performing scheduling, a step of distinguishing whether sequential or parallel execution of a plurality of branches is further performed, but the matching step is, when the same accelerator is matched to branches executed in parallel, parallelism If performance is improved by using , a step of reallocating different accelerators may be further performed.

이 때, 파티션을 생성하는 단계에서, 브랜치 내의 연산자들을 순서대로 순회하면서 현재 연산자가 이전 연산자와 동일한 가속기가 매칭되어 있는 경우, 동일한 파티션으로 결정하고, 현재 연산자가 이전 연산자와 동일한 가속기가 매칭되어 있지 않은 경우, 새로운 파티션을 생성하여 추가하는 단계를 더 수행할 수 있다.At this time, in the step of creating partitions, while traversing the operators in the branch in order, if the current operator matches the same accelerator as the previous operator, it is determined as the same partition, and the current operator does not match the same accelerator as the previous operator. If not, a step of creating and adding a new partition may be further performed.

이 때, 실행 순서를 결정하는 단계에서, 너비 우선 탐색 (breath-first) 방식으로 순회하면서 순차적 실행 순서를 결정하고, 동일한 병렬 브랜치 집합에 포함되어 있고 서로 상이한 가속기가 할당되어 있는 파티션을 병렬 파티션으로 결정할 수 있다.At this time, in the step of determining the execution order, the sequential execution order is determined by traversing in a breadth-first search method, and partitions included in the same parallel branch set and to which different accelerators are assigned are designated as parallel partitions. can decide

실시예에 따른 이종 컴퓨팅 플랫폼 지원 딥러닝 컴파일을 위한 그래프 분할 및 스케줄 생성 방법은, 딥 뉴럴넷 모델을 연산자들의 데이터 흐름에 따라 연결한 그래프를 기반으로 타겟 플랫폼에 포함된 적어도 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능을 측정하는 단계, 실행 성능 측정 결과를 기반으로 연산자에 최대 실행 성능을 가지는 가속기를 매칭하는 단계, 연산자들을 순서대로 순회하면서 그래프를 분할하여 파티션을 생성하는 단계 및 생성된 파티션들의 실행 순서를 결정하는 단계를 수행할 수 있다. A graph division and schedule generation method for heterogeneous computing platform-supported deep learning compilation according to an embodiment is based on a graph connecting a deep neural net model according to data flow of operators in at least two heterogeneous accelerators included in a target platform. Measuring the execution performance of the operators, matching the accelerator with the maximum execution performance to the operator based on the execution performance measurement result, generating partitions by traversing the operators in order and dividing the graph, and executing the created partitions You can perform steps to determine the order.

이 때, 실행 성능을 측정하는 단계는, 입력된 타겟 플랫폼 가속기 명세 정보를 기반으로 측정된 가속기 별 단위 연산자의 실행 성능 결과로 단위 성능 명세 파일을 생성하는 단계 및 입력된 퓨징 연산자 명세 정보를 기반으로 측정된 가속기 별 연산자 그룹의 실행 성능 결과로 퓨징 연산자 성능 명세 파일을 생성하는 단계를 수행할 수 있다. At this time, the step of measuring the execution performance is the step of generating a unit performance specification file as a result of the execution performance of the unit operator for each accelerator measured based on the input target platform accelerator specification information and the input fusing operator specification information. A step of generating a fusing operator performance specification file may be performed as a result of the measured execution performance of the operator group for each accelerator.

실시예에 따른 이종 컴퓨팅 플랫폼 지원 딥러닝 컴파일을 위한 그래프 분할 및 스케줄 생성 방법은, 둘 이상의 이종 가속기들에서의 연산자들의 실행 성능 측정 결과를 기반으로 그래프를 적어도 하나의 연산자를 포함하는 복수의 브랜치들로 분리하는 단계를 더 포함하되, 매칭하는 단계는, 복수의 브랜치들 각각에 포함된 적어도 하나의 연산자 각각에 대해 최대 실행 성능을 가지는 가속기를 매칭하는 단계, 복수의 브랜치들 각각에 포함된 연산자 그룹에 대해 퓨징 연산자 성능 명세 파일을 기반으로 최대 실행 성능을 가지는 가속기를 검색하는 단계 및 연산자 그룹에서의 실행 성능이 개별 연산자의 실행 성능보다 양호할 경우, 해당 개별 연산자에 매칭된 가속기를 검색된 가속기로 재 매칭하는 단계를 수행할 수 있다. A method for dividing a graph and generating a schedule for deep learning compilation supporting heterogeneous computing platforms according to an embodiment converts a graph into a plurality of branches including at least one operator, based on results of measurement of execution performance of operators in two or more heterogeneous accelerators. Further comprising the step of separating into , wherein the matching step comprises: matching an accelerator having a maximum execution performance for each of the at least one operator included in each of a plurality of branches; and an operator group included in each of the plurality of branches. Searching for an accelerator having the maximum execution performance based on the fusing operator performance specification file, and if the execution performance in the operator group is better than the execution performance of an individual operator, the accelerator matched to the individual operator is replayed as the searched accelerator. Matching steps can be performed.

실시예에 따른 이종 컴퓨팅 플랫폼 지원 딥러닝 컴파일을 위한 그래프 분할 및 스케줄 생성 방법은, 복수의 브랜치들의 순차적 또는 병렬적 실행 여부를 구분하는 단계를 더 포함하되, 매칭하는 단계는, 병렬적으로 실행되는 브랜치들에 동일한 가속기가 매칭될 경우, 병렬성을 이용하여 성능이 향상되면, 서로 다른 가속기를 재 할당하는 단계를 더 수행할 수 있다.The graph division and schedule generation method for deep learning compilation supporting heterogeneous computing platforms according to an embodiment further includes determining whether a plurality of branches are executed sequentially or in parallel, and the matching step is executed in parallel. If the same accelerator is matched to the branches and performance is improved by using parallelism, a step of reallocating different accelerators may be further performed.

기재된 실시예에 따라, 딥뉴럴넷 모델을 복수의 이종 가속기들을 내장한 이종 컴퓨팅 플랫폼에서 동작하는 고성능 딥뉴럴넷 실행코드로 자동 변환할 수 있다.According to the described embodiment, a deep neural network model can be automatically converted into a high-performance deep neural network execution code that operates on a heterogeneous computing platform incorporating a plurality of heterogeneous accelerators.

기재된 실시예에 따라, 복수의 이종 가속기들 각각의 상이한 특성을 모두 활용하여 실행 속도나 소모 전력을 최소화시킬 수 있도록 딥 뉴럴넷을 그래프 기반으로 분할하고, 해당 그래프가 실행 될 가속기를 매핑할 수 있다.According to the described embodiment, a deep neural network may be segmented based on a graph to minimize execution speed or power consumption by utilizing all of the different characteristics of each of a plurality of heterogeneous accelerators, and an accelerator on which the corresponding graph will be executed may be mapped.

기재된 실시예에 따라, 타겟 플랫폼에 내장된 가속기 상에서 뉴럴넷 연산자들을 실제로 실행해보고 측정한 성능 결과를 기반으로 그래프 파티셔닝을 수행하므로 수행 시간이나 전력 소모 측면에서 최적의 파티션을 정확하게 찾아내는 것이 가능하다.According to the described embodiment, since graph partitioning is performed based on performance results measured by actually executing neural net operators on an accelerator built into a target platform, it is possible to accurately find an optimal partition in terms of execution time or power consumption.

기재된 실시예에 따라, 프로파일러를 이용하여 자동으로 생성된 성능 결과를 이용하므로 최소 수행시간 또는 최소 전력을 소요하는 최적 파티션을 한 번에 빠르게 찾아내는 것이 가능하다.According to the described embodiment, since a performance result automatically generated using a profiler is used, it is possible to quickly find an optimum partition consuming the minimum execution time or the minimum power at once.

도 1은 실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러를 개략적으로 설명하기 위한 구성도이다.
도 2는 종래의 딥러닝 컴파일러가 생성한 코드의 실행 시간 예를 도시한 도면이다.
도 3은 실시예에 따른 딥러닝 컴파일러가 생성한 코드의 실행 시간 예를 도시한 도면이다.
도 4는 실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러의 내부 블록 구성도이다.
도 5는 실시예에 따라 도 4에 도시된 스케줄 생성부의 상세 내부 블록 구성도이다.
도 6은 실시예에 따라 도 5에 도시된 스케줄 생성부의 동작을 설명하기 위한 순서도이다.
도 7은 다른 실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러의 내부 블록 구성도이다.
도 8은 실시예에 따른 그래프 기반 가속기별 실행 성능 측정 동작을 상세히 설명하기 위한 순서도이다.
도 9는 그래프를 브랜치 단위로 구분한 예시도이다.
도 10은 실시예에 따른 그래프의 연산자와 가속기를 매칭하는 단계를 상세히 설명하기 위한 순서도이다.
도 11은 다수의 브랜치를 포함하는 병렬 브랜치 집합의 예시도이다.
도 12는 실시예에 따른 그래프 분할 예시도이다.
도 13은 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.1 is a configuration diagram schematically illustrating a deep learning compiler supporting heterogeneous computer platforms according to an embodiment.
2 is a diagram showing an example of execution time of a code generated by a conventional deep learning compiler.
3 is a diagram illustrating an example of execution time of a code generated by a deep learning compiler according to an embodiment.
4 is an internal block diagram of a deep learning compiler supporting heterogeneous computer platforms according to an embodiment.
5 is a detailed internal block configuration diagram of the schedule generating unit shown in FIG. 4 according to an embodiment.
6 is a flowchart for explaining the operation of the schedule generating unit shown in FIG. 5 according to an embodiment.
7 is an internal block diagram of a deep learning compiler supporting heterogeneous computer platforms according to another embodiment.
8 is a flowchart for explaining in detail an execution performance measurement operation for each graph-based accelerator according to an embodiment.
9 is an exemplary diagram in which a graph is divided into branches.
10 is a flowchart for explaining in detail a step of matching an operator and an accelerator of a graph according to an embodiment.
11 is an exemplary diagram of a parallel branch set including multiple branches.
12 is an exemplary view of graph division according to an embodiment.
13 is a diagram showing a configuration of a computer system according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numbers designate like elements throughout the specification.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although "first" or "second" is used to describe various elements, these elements are not limited by the above terms. Such terms may only be used to distinguish one component from another. Therefore, the first component mentioned below may also be the second component within the technical spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.Terms used in this specification are for describing embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" or "comprising" implies that a stated component or step does not preclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used herein may be interpreted as meanings commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

이하에서는, 도 1 내지 도 13을 참조하여 실시예에 따른 이종 플랫폼 지원 딥러닝 컴파일일 시스템 및 방법이 상세히 설명된다.Hereinafter, with reference to FIGS. 1 to 13 , a heterogeneous platform supporting deep learning compilation system and method according to an embodiment will be described in detail.

도 1은 실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일 시스템을 개략적으로 설명하기 위한 구성도이고 도 2는 종래의 딥러닝 컴파일러가 생성한 코드의 실행 시간 예를 도시한 도면이고, 도 3은 실시예에 따른 딥러닝 컴파일러가 생성한 코드의 실행 시간 예를 도시한 도면이다. 1 is a configuration diagram for schematically explaining a deep learning compilation system supporting heterogeneous computer platforms according to an embodiment, and FIG. 2 is a diagram showing an example of execution time of a code generated by a conventional deep learning compiler, and FIG. It is a diagram showing an example of the execution time of the code generated by the deep learning compiler according to the example.

도 1을 참조하면, 실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러(이하 '컴파일러'로 기재함)(100)는 입력된 딥뉴럴넷 모델을 복수의 이종 가속기들(10-1,..., 10-N)을 포함하는 이종 컴퓨팅 플랫폼(이하 '타겟 플랫폼')(10)에 최적화된 실행 코드로 변환한다. Referring to FIG. 1 , a deep learning compiler (hereinafter referred to as 'compiler') 100 supporting heterogeneous computer platforms according to an embodiment converts an input deep neural network model to a plurality of heterogeneous accelerators 10-1, ..., 10-N) into an execution code optimized for a heterogeneous computing platform (hereinafter referred to as 'target platform') 10 including.

여기서, 딥 뉴럴넷 모델은, 예컨대, TensorFlow, Pytorch 및 Caffe 등과 같은 딥러닝 플랫폼에서 미리 학습되어 생성된 것으로, 복수의 연산자들로 구성된 것일 수 있다.Here, the deep neural net model may be created by learning in advance on a deep learning platform such as TensorFlow, Pytorch, and Caffe, and may be composed of a plurality of operators.

이때, 가속기는, CPU(Central Processing Unit), GPU(Graphics Processing Unit) 및 NPU(Neural Processing Unit)와 같이 딥뉴럴넷 연산을 수행하는 프로세싱 유닛일 수 있다. In this case, the accelerator may be a processing unit that performs a deep neural network operation, such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural processing unit (NPU).

실시예에 따른 컴파일러(100)는 입력 딥뉴럴넷 모델을 타겟 플랫폼(10) 상에서 효과적을 동작하는 코드로 변환한다. 컴파일러(100)는 타겟 플랫폼(10)에서 최적의 성능을 내는 코드를 생성하기 위해, 가장 먼저 타겟 플랫폼(10)에 포함된 복수의 이종 가속기들(10-1,..., 10-N)의 성능을 측정하기 위한 가속기 성능 프로파일링을 수행한다. The compiler 100 according to the embodiment converts the input deep neural net model into code that effectively runs on the target platform 10 . The compiler 100 first uses a plurality of heterogeneous accelerators 10-1, ..., 10-N included in the target platform 10 to generate code with optimal performance in the target platform 10. Perform accelerator performance profiling to measure the performance of

이때, 가속기 성능 프로파일링은, 각 가속기의 딥뉴럴넷 연산자 (또는 연산자 그룹) 실행에 소모되는 시간과 전력을 측정하는 작업을 의미한다. At this time, accelerator performance profiling means the task of measuring the time and power consumed in the execution of deep neural net operators (or groups of operators) in each accelerator.

그 다음, 컴파일러(100)는, 프로파일링 결과를 기반으로 딥뉴럴넷 모델을 어떤 방식으로 분할(Partitioning)하여 어느 가속기에서 실행할 지를 결정한다. 이때, 복수의 가속기들을 병렬로 활용하여 수행할 수 있는 딥뉴럴넷 파티션들도 찾아낸다. Next, the compiler 100 determines which accelerator to execute by partitioning the deep neural network model based on the profiling result. At this time, deep neural net partitions that can be performed using multiple accelerators in parallel are also found.

최종적으로, 컴파일러(100)는, 이러한 정보를 기반으로 타겟 플랫폼에서 최적의 성능을 낼 수 있는 병렬성 높은 코드를 생성한다. Finally, the compiler 100 generates code with high parallelism that can produce optimal performance in the target platform based on this information.

도 2 및 3은 타겟 플랫폼(10)이 CPU 1개와 NPU 1개를 포함하고 있다고 가정할 때, 기존 딥러닝 컴파일러들이 생성한 코드와 실시예에 따른 컴파일러(100)가 생성한 코드의 실행 시간 측면에서의 성능 비교 예제를 보여준다. 2 and 3 show the execution time aspects of codes generated by existing deep learning compilers and codes generated by the compiler 100 according to the embodiment, assuming that the target platform 10 includes one CPU and one NPU. A performance comparison example is shown.

도 2 및 3에 도시된 그래프는 딥러닝 컴파일러들이 입력 딥뉴럴넷 모델을 실행하기 위해 내부적으로 생성한 계산 그래프(이하 '그래프'로 기재함)이다. The graphs shown in FIGS. 2 and 3 are computational graphs (hereinafter referred to as 'graphs') internally generated by deep learning compilers to execute an input deep neural network model.

그래프에서 각 노드는 연산자이며, 에지는 연산자 간 데이터 흐름을 나타낸다. 또한, 연산자 옆에 쓰여있는 숫자는 연산자 수행 시간이다. 점선 원은 가속기에서 실행될 그래프의 파티션을 표시한다. 여기서, 파티션은 동일한 가속기가 할당된 연속된 연산자의 집합을 의미한다. Each node in the graph is an operator, and edges represent data flow between operators. Also, the number written next to the operator is the operator execution time. Dotted circles mark the partitions of the graph that will run on the accelerator. Here, a partition means a set of consecutive operators to which the same accelerator is assigned.

도 2의 우측에 도시된 도면에서 실선으로 묶인 연산자들은 VTA에서 연속으로 실행했을 때, 각각을 실행할 때보다 더 빠르게 실행되는 연산자들을 나타내며, 실선 원 옆에 쓰인 숫자는 실행 시간이다.In the diagram shown on the right side of FIG. 2, operators enclosed by solid lines represent operators that are executed faster than when each is executed in series when executed in VTA, and the number written next to the circle of the solid line is the execution time.

도 2는 파티셔닝 없이 그래프 전체를 CPU 및 NPU에서 실행한 경우의 실행시간을, 도 3은 가속기 성능 프로파일링을 수행한 결과를 바탕으로 그래프를 복수의 파티션으로 분할하여 실행한 경우의 실행 시간을 보여준다. Figure 2 shows the execution time when the entire graph is executed on CPU and NPU without partitioning, and Figure 3 shows the execution time when the graph is divided into a plurality of partitions and executed based on the results of accelerator performance profiling. .

기존 딥러닝 컴파일러들은 사용자가 가속기의 타입을 하나만 지정 가능하기 때문에 도 2에 도시된 바와 같이 그래프의 전체 연산자에 동일한 가속기를 할당한 후, 해당 가속기에서 실행 가능한 코드를 생성한다. 따라서, 그래프의 전체 수행 시간은 사용자가 선택한 가속기 상에서 연산자들을 연속적으로 수행하는데 걸리는 시간들의 합이다. Existing deep learning compilers allow users to designate only one type of accelerator, so as shown in FIG. 2, the same accelerator is assigned to all operators in the graph, and then code executable in the accelerator is generated. Thus, the total execution time of the graph is the sum of the times it takes to continuously execute the operators on the user-selected accelerator.

도 2 및 3을 비교하면, 연산 및 연산 그룹이 빠르게 동작 가능한 가속기 정보를 바탕으로 그래프를 분할하여 실행하면 단일 가속기에서 전체 그래프를 실행할 때보다 더 빠르게 실행하는 것이 가능하다. 특히, 데이터 의존성이 없는 파티션의 경우 서로 다른 가속기에 나누어 동시에 실행 가능하므로, 그래프를 파티셔닝하여 실행하면 병렬 실행을 통한 속도 향상 효과를 추가적으로 얻을 수도 있다.Comparing FIGS. 2 and 3 , if the graph is divided and executed based on accelerator information that allows calculation and calculation groups to operate quickly, it is possible to execute the graph faster than when the entire graph is executed in a single accelerator. In particular, since partitions that do not depend on data can be divided into different accelerators and executed at the same time, if the graph is partitioned and executed, an additional speed-up effect through parallel execution can be obtained.

도 4는 실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러의 내부 블록 구성도이고, 4 is an internal block diagram of a deep learning compiler supporting heterogeneous computer platforms according to an embodiment;

도 4를 참조하면, 실시예에 따른 컴파일러(100)는 입력된 딥뉴럴넷 모델을 컴파일러 설정을 위한 사용자 입력 정보를 기반으로 타겟 플랫폼(10)에 내장된 복수의 이종 가속기들을 최대한 활용하는 고성능 코드로 변환한다.Referring to FIG. 4 , the compiler 100 according to the embodiment transforms the input deep neural net model into high-performance code that utilizes the plurality of heterogeneous accelerators built into the target platform 10 to the maximum based on user input information for setting the compiler. convert

여기서, 컴파일러 설정을 위해 사용자 입력 정보는 타겟 플랫폼 가속기 명세 및 퓨징 연산자 명세를 포함할 수 있다. Here, user input information for compiler setting may include target platform accelerator specifications and fusing operator specifications.

이 중, 타겟 플랫폼 가속기 명세는, 타겟 플랫폼(10)이 내장하고 있는 이종 가속기들(10-1,...,10-N) 각각의 속성을 명시하는데, 일 예는 다음의 <표 1>과 같다.Among them, the target platform accelerator specification specifies the properties of each of the heterogeneous accelerators 10-1, ..., 10-N built into the target platform 10, an example of which is the following <Table 1>. Same as

<표 1>을 참조하면, 가속기 속성은, 가속기 이름(backendName), 식별자(ID) 및 지원되지 않는 연산자의 종류(nonSupportedNodes) 등을 포함할 수 있다. Referring to <Table 1>, accelerator properties may include an accelerator name (backendName), an identifier (ID), and types of unsupported operators (nonSupportedNodes).

컴파일러(100)는 이러한 타겟 플랫폼 가속기 명세에 기술된 내용을 참조하여 프로파일링, 파티셔닝 및 파티션 별 가속기 할당을 수행할 수 있다. The compiler 100 may perform profiling, partitioning, and accelerator allocation for each partition by referring to the content described in the target platform accelerator specification.

또한, 퓨징 연산자 그룹 명세는, 이종 가속기들 별로 각각의 연산을 개별적으로 수행할 때보다 연속적으로 수행할 때 더 높은 성능을 제공하는 특정 연산자 집합에 대한 정보를 나타낸다. 이는 어떠한 연산자 집합이 어떤 가속기에 효율적인가는 가속기의 HW 아키텍처와 컴파일러의 코드 생성 모듈의 구현에 따라 달라질 수 있기 때문이다. 일 예는 다음의 <표 2>와 같다. In addition, the fusing operator group specification indicates information about a specific operator set that provides higher performance when each operation is sequentially performed than when each operation is individually performed for each type of accelerator. This is because which set of operators is efficient for which accelerator may vary depending on the HW architecture of the accelerator and the implementation of the compiler's code generation module. An example is shown in <Table 2> below.

이종 컴퓨팅 플랫폼 지원 딥러닝 컴파일러(100)는 <표 2>와 같은 형태로 사용자가 명시한 연산자 그룹(퓨징 연산자 그룹)과 일치하는 연산자 그룹이 그래프 내에 존재하는 경우, 해당 연산자 그룹을 코드로 생성하여 모든 가속기 상에서 성능을 측정하는 프로파일링 작업을 수행한다. 그런데, 퓨징 연산자 그룹 명세가 주어지지 않으면 그래프 내의 모든 연속된 연산자들을 대상으로 성능을 측정한다. 연속적으로 수행 가능한 연산자의 개수는 컴파일러 내부적으로 설정한 값을 사용한다.Heterogeneous computing platform support The deep learning compiler 100, in the form of <Table 2>, if an operator group identical to the operator group (fusing operator group) specified by the user exists in the graph, the corresponding operator group is generated as a code and all Perform profiling to measure performance on the accelerator. However, if the fusing operator group specification is not given, the performance is measured for all consecutive operators in the graph. The number of operators that can be executed consecutively uses the value set internally by the compiler.

다시 도 4를 참조하면, 컴파일러(100)는, 딥뉴럴넷 모델 로더(110), 그래프 기반 하드웨어 독립적 최적화부(이하 '최적화부'로 기재함)(120) 및 하드웨어 의존적 최적화 및 가속기 코드 생성부(이하 '코드 생성부'로 기재함)(140)를 포함할 수 있다. 실시예에 따라 추가적으로, 그래프 분할 및 스케줄 생성부(이하 '스케줄 생성부'로 기재함(130)을 통합하여 타겟 플랫폼(20)에 내장된 복수의 이종 가속기들(10-1, ..., 10-N)을 최대한 활용하는 고성능 코드를 생성한다. Referring back to FIG. 4 , the compiler 100 includes a deep neural network model loader 110, a graph-based hardware independent optimization unit (hereinafter referred to as an 'optimization unit') 120 and a hardware dependent optimization and accelerator code generation unit ( (hereinafter referred to as 'code generation unit') 140 may be included. Depending on the embodiment, a plurality of heterogeneous accelerators (10-1, ..., 10-1, ..., 10-N) to generate high-performance code.

스케줄 생성부(130)은 최적화부(120)로부터 출력된 HW 독립적 최적화가 완료된 그래프의 파티셔닝을 수행한다. 그리고, 파티셔닝 수행이 종료되면, 파티션 리스트, 파티션 리스트에 포함된 파티션들의 실행 순서를 나타내는 스케줄 계획, 스케쥴 계획에 맞게 파티션들을 실행하는 코드인 스케줄 코드가 생성된다. The schedule generation unit 130 performs partitioning of the HW independent optimization-completed graph output from the optimization unit 120 . When the partitioning is completed, a partition list, a schedule plan indicating the execution order of the partitions included in the partition list, and a schedule code, which is code for executing partitions according to the schedule plan, are generated.

이 중, 파티션 리스트는 코드 생성부(140)로 전달되어 복수의 이종 가속기들(10-1, ..., 10-N)에서 동작할 수 있는 파티션 코드로 변환된다. 이와 같이 생성된 파티션 코드와 스케줄 코드는 함께 컴파일되어 딥뉴럴넷 추론을 수행하는 실행코드로 변환된다.Among them, the partition list is transmitted to the code generator 140 and converted into a partition code that can operate in the plurality of heterogeneous accelerators 10-1, ..., 10-N. The generated partition code and schedule code are compiled together and converted into executable code that performs deep neural network inference.

그러면, 도 4에 도시된 각 구성 요소들에 대해 상세히 살펴보기로 한다. Then, each component shown in FIG. 4 will be looked at in detail.

딥뉴럴넷 모델 로더(110)는, 입력되는 딥뉴럴넷 모델을 그래프 형태의 내부 표현으로 변환한다. 도 2 및 도 3에 도시된 바와 같이, 그래프는 딥뉴럴넷 모델의 연산자를 나타내는 노드(21)와, 노드(21)가 연산자들 간의 데이터의 흐름을 나타내는 에지(22)로 연결된 형태를 갖는다.The deep neural net model loader 110 converts the input deep neural net model into an internal representation in the form of a graph. As shown in FIGS. 2 and 3 , the graph has a form in which nodes 21 representing operators of the deep neural network model are connected by edges 22 representing the flow of data between operators.

최적화부(120)는, 딥뉴럴넷 모델 로더(110)가 생성한 그래프를 특정 가속기에 의존적이지 않은 최적화를 수행한다. The optimizer 120 optimizes the graph generated by the deep neural network model loader 110 not dependent on a specific accelerator.

이때, 하드웨어 독립적 최적화는, 예컨대, 불필요한 연산자 제거, 연속 발생 Transpose 연산자 병합, Transpose 연산을 그래프의 하단으로 이동 및 연속 발생 Concat 연산자 병합 등을 포함할 수 있다. In this case, hardware-independent optimization may include, for example, removing unnecessary operators, merging continuously occurring transpose operators, moving transpose operations to the bottom of the graph, and merging consecutively occurring concat operators.

스케줄 생성부(130)는, 타겟 플랫폼(10)에 포함된 적어도 둘 이상의 이종 가속기들(10-1, ..., 10-N)에서의 연산자들의 실행 성능 측정 결과에 따라 연산자들을 분할하여 이종 가속기들에 할당하고, 실행 순서를 스케줄링한다. The schedule generator 130 divides the operators according to the results of measuring the execution performance of the operators in the at least two or more heterogeneous accelerators 10-1, ..., 10-N included in the target platform 10, and divides the operators into heterogeneous accelerators. Allocate accelerators and schedule execution order.

즉, 프로파일러를 이용하여 타겟 플랫폼 가속기 명세에 명시된 가속기 상에서 뉴럴넷의 개별 연산자 및 연산자 그룹의 실행 성능(수행 시간 및 소모 전력)을 측정하고, 측정된 결과를 바탕으로 입력 그래프를 가장 신속히 또는 최소의 전력으로 실행할 수 있도록 연산자와 가속기를 매칭한다. In other words, the execution performance (execution time and power consumption) of individual operators and operator groups of the neural net is measured on the accelerator specified in the target platform accelerator specification using a profiler, and the input graph is drawn in the fastest or smallest way based on the measured results. Match operators and accelerators to run on full power.

이와 같이 매칭된 정보를 바탕으로 그래프 파티셔닝을 수행하고, 파티션들이 순차적으로 연결되어 있는 파티션 리스트를 생성한다. 이때, 파티션은 가속기가 할당된 서브그래프를 의미한다.Graph partitioning is performed based on this matched information, and a partition list in which partitions are sequentially connected is created. In this case, the partition refers to a subgraph to which an accelerator is assigned.

또한, 그래프 토폴로지를 분석하여 병렬로 실행할 수 있는 파티션들 정보를 추출하고 파티션 스케줄 계획 및 코드를 생성한다. 또한, 파티션 리스트에 포함된 각 파티션들을 코드 생성부(140)로 전달되어 가속기용으로 최적화된 파티션 코드가 생성되도록 한다. In addition, by analyzing the graph topology, information on partitions that can be executed in parallel is extracted and partition schedule plans and codes are created. In addition, each partition included in the partition list is transmitted to the code generator 140 to generate a partition code optimized for the accelerator.

이러한 스케줄 생성부(130)에 대한 상세한 설명은 도 5를 참조하여 후술하기로 한다. A detailed description of the schedule generating unit 130 will be described later with reference to FIG. 5 .

코드 생성부(140)는, 스케줄 생성부(130)에 의해 그래프 기반 스케줄 결과를 기반으로 딥 뉴럴넷 모델을 연산자들 각각에 할당된 가속기 및 실행 순서에 최적화된 코드로 변환한다. 즉, 입력으로 받은 파티션을 가속기 아키텍처를 고려하여 최적의 성능으로 동작하는 코드로 변환한다.The code generator 140 converts the deep neural network model into code optimized for an accelerator assigned to each operator and an execution order based on the result of the graph-based schedule by the schedule generator 130 . That is, the partition received as input is converted into code that operates with optimal performance considering the accelerator architecture.

도 5는 실시예에 따른 스케줄 생성부의 상세 내부 블록 구성도이고, 도 6은 실시예에 따른 스케줄 생성부의 동작을 설명하기 위한 순서도이다. 5 is a detailed internal block diagram of a schedule generator according to an embodiment, and FIG. 6 is a flowchart for explaining the operation of a schedule generator according to an embodiment.

도 5를 참조하면, 실시예에 따른 스케줄 생성부(130)는, 프로파일러(131), 브랜치 분석부(132), 가속기 할당부(133), 그래프 분할부(134) 및 파티션 스케줄러(135)를 포함할 수 있다. Referring to FIG. 5 , the schedule generator 130 according to the embodiment includes a profiler 131, a branch analyzer 132, an accelerator allocator 133, a graph divider 134, and a partition scheduler 135. can include

프로파일러(131)는, 타겟 플랫폼(10)에서 최적의 성능을 내는 실행 코드를 생성하기 위해, 타겟 플랫폼(10)에 포함된 복수의 이종 가속기들(10-1, ..., 10-N) 각각의 성능을 측정하기 위한 가속기 성능 프로파일링을 수행한다(S210). The profiler 131 uses a plurality of heterogeneous accelerators 10-1, ..., 10-N included in the target platform 10 to generate execution code that produces optimal performance in the target platform 10. ) performs accelerator performance profiling to measure each performance (S210).

이때, 가속기 성능 프로파일링은, 복수의 이종 가속기들(10-1, ..., 10-N) 각각이 딥뉴럴넷 모델을 구성하는 연산자 또는 연산자 그룹의 실행에 소모되는 시간 및 전력을 측정하여, 성능 명세 파일을 생성하는 것이다. At this time, the accelerator performance profiling measures the time and power consumed by each of the plurality of heterogeneous accelerators 10-1, ..., 10-N to execute an operator or operator group constituting the deep neural network model, To create a performance specification file.

이 때, 가속기 성능 프로파일링은, 도 7에 도시된 것처럼 컴파일러의 외부에 위치하는 프로파일러(150)를 통해 수행될 수도 있으며, 외부의 프로파일러(150)를 통해 생성된 성능 명세 파일을 입력받아 사용할 수 있다. At this time, the accelerator performance profiling may be performed through the profiler 150 located outside the compiler as shown in FIG. 7, and receives the performance specification file generated through the external profiler 150 as input. can be used

도 8은 실시예에 따른 그래프 기반 가속기별 실행 성능 측정 동작을 상세히 설명하기 위한 순서도이다.8 is a flowchart for explaining in detail an execution performance measurement operation for each graph-based accelerator according to an embodiment.

도 8을 참조하면, 그래프 기반 가속기별 실행 성능 측정하는 단계(S210)는, 크게 프로파일링을 통해 개별 가속기 별로 단위 연산자 성능 명세 파일을 생성하는 단계(S211~S213) 및 파티션에 상응하는 퓨징 연산자 그룹 성능 명세 파일을 생성하는 단계(S214~S216)를 포함할 수 있다. Referring to FIG. 8, the step of measuring the execution performance of each graph-based accelerator (S210) is largely a step of generating a unit operator performance specification file for each accelerator through profiling (S211 to S213) and a fusing operator group corresponding to the partition. It may include generating a performance specification file (S214 to S216).

즉, 프로파일러(131)는, 그래프를 개별 연산자 단위로 분할하여 하나의 연산자만을 포함하는 파티션들을 생성한다(S211).That is, the profiler 131 creates partitions including only one operator by dividing the graph into individual operator units (S211).

그런 후, 프로파일러(131)는, 하드웨어 의존적 최적화 모듈(120)에 의해 생성된 서브 그래프들로부터 타겟 플랫폼 가속기 명세에 명시된 가속기에서 실행 가능한 코드를 생성한다(S212).Then, the profiler 131 generates code executable in the accelerator specified in the target platform accelerator specification from the subgraphs generated by the hardware-dependent optimization module 120 (S212).

그리고, 프로파일러(131)는, 각각의 가속기 상에서 생성된 실행 코드를 반복 실행하면서 연산자들의 평균 실행 시간 및 소모 전력을 특정하여 기록하여 단위 연산자 성능 명세 파일을 생성한다. 일 예로, 단위 연산자 성능 명세 파일은 다음의 <표 3>과 같다. Then, the profiler 131 generates a unit operator performance specification file by recording the average execution time and power consumption of operators while repeatedly executing the execution code generated on each accelerator. As an example, the unit operator performance specification file is shown in <Table 3> below.

<표 3>을 참조하면, 단위 연산자 성능 명세 파일은, 연산자 식별자(ID), 수행 시간 및 전력 소모량을 포함한다. Referring to Table 3, the unit operator performance specification file includes an operator identifier (ID), execution time, and power consumption.

이때, 연산자 ID는, 연산자의 종류, 입력 데이터 타입, 입력 배열의 크기 및 사용되는 파라미터 값 중 적어도 하나 이상을 이용하여 생성된다. 이때, 동일한 ID를 가진 연산자는 동일한 연산자로 간주된다. In this case, the operator ID is generated using at least one of the operator type, input data type, input array size, and used parameter value. In this case, operators with the same ID are regarded as the same operator.

다음으로, 프로파일러(131)는, 그래프를 순회하면서 퓨징 연산 명세에 명시된 연속된 연산자들을 찾아 독립된 파티션을 생성한다(S214).Next, the profiler 131 traverses the graph, finds consecutive operators specified in the fusing operation specification, and creates independent partitions (S214).

그런 후, 프로파일러(131)는, 하드웨어 의존적 최적화 모듈(120)에 의해 생성된 파티션들로부터 타겟 플랫폼 가속기 명세에 명시된 가속기에서 실행 가능한 코드를 생성한다(S215).Then, the profiler 131 generates code executable in the accelerator specified in the target platform accelerator specification from the partitions generated by the hardware-dependent optimization module 120 (S215).

그리고, 프로파일러(131)는, 각각의 가속기 상에서 생성된 실행 코드를 실행하면서 실행 시간 및 소모 전력을 특정하여 기록하여 퓨징 연산자 성능 명세 파일을 생성한다. 일 예로, 퓨징 연산자 성능 명세 파일은 다음의 <표 4>와 같다. Then, the profiler 131 creates a fusing operator performance specification file by specifying and recording the execution time and power consumption while executing the execution code generated on each accelerator. As an example, a fusing operator performance specification file is shown in <Table 4> below.

<표 4>와 같이, 퓨징 연산자 그룹 성능 명세 파일은 퓨징 연산자 명세에 명시된 연산자 그룹의 ID, 수행 시간, 전력 소모량을 포함한다. 이때, 연산자 그룹의 ID는 연산자 그룹에 포함된 각 연산자의 ID들을 '-'로 연결한 값이다. As shown in <Table 4>, the fusing operator group performance specification file includes the operator group ID, execution time, and power consumption specified in the fusing operator specification. At this time, the ID of the operator group is a value obtained by connecting the IDs of each operator included in the operator group with '-'.

다시 도 5 및 도 6을 참조하면, 브랜치 분석부(132)는, S210에서 측정된 가속기별 실행 성능을 기반으로 입력 그래프를 복수의 브랜치들로 분리한다(S220). 이때, 그래프에서 순차적으로 실행해야 하는 브랜치와 병렬적으로 실행해도 되는 브랜치를 추출한다. Referring back to FIGS. 5 and 6 , the branch analyzer 132 divides the input graph into a plurality of branches based on the execution performance of each accelerator measured in S210 (S220). At this time, branches that should be executed sequentially and branches that may be executed in parallel are extracted from the graph.

도 9는 그래프를 브랜치 단위로 구분한 예시도이다. 9 is an exemplary diagram in which a graph is divided into branches.

도 9를 참조하면, 그래프를 G(V, E)라고 할 때, V는 G의 노드 집합이고, E는 에지 집합이다. 각 노드는 단일 연산자 정보를 포함하고 있다.Referring to FIG. 9, when the graph is G(V, E), V is a set of nodes of G and E is a set of edges. Each node contains information about a single operator.

이때, 브랜치는, G(V,E)에 대해 분기하는 경로(path)를 포함하고 있지 않고, 연속된 노드로만 구성된 서브 그래프로 정의될 수 있다. 예컨대, 도 8에 도시된 G로부터 4개의 브랜치들, 즉 브랜치 1, 브랜치 2-1, 브랜치 2-2 및 브랜치 3이 추출된다. In this case, a branch may be defined as a subgraph composed of only consecutive nodes and not including a path branching from G(V,E). For example, four branches are extracted from G shown in FIG. 8: branch 1, branch 2-1, branch 2-2 and branch 3.

이때, 병렬 브랜치는, 두 브랜치들이 입력 노드와 출력 노드를 공유하고 있는 것으로 정의될 수 있다. 예컨대, 도 8에 도시된 브랜치 2-1과 브랜치 2-2는 병렬 브랜치로 추출될 수 있다. In this case, a parallel branch may be defined as two branches sharing an input node and an output node. For example, branches 2-1 and 2-2 shown in FIG. 8 may be extracted as parallel branches.

다시 도 5 및 6을 참조하면, 가속기 결정부(130)는, 브랜치들 별로 성능을 최적화 할 수 있도록 노드 별 가속기를 할당한다(S230). Referring back to FIGS. 5 and 6 , the accelerator determining unit 130 allocates an accelerator for each node to optimize performance for each branch (S230).

이때, 가속기 결정은 최고의 성능을 낼 수 있는 연산자와 가속기를 매칭하는 것이다. At this time, the accelerator decision is to match the operator and accelerator that can produce the best performance.

이때, 성능의 종류는 최소 수행 시간 또는 최소 소모 전력이 될 수 있으나, 이는 일 실시 예일 뿐 본 발명은 이에 한정되지 않는다. 즉, 성능의 종류는 사용자로부터 입력 받거나 컴파일러 내부적으로 정의 가능하다. In this case, the type of performance may be a minimum execution time or a minimum power consumption, but this is only an embodiment and the present invention is not limited thereto. That is, the type of performance can be input from the user or defined internally by the compiler.

도 10은 실시예에 따른 그래프의 연산자와 가속기를 매칭하는 단계를 상세히 설명하기 위한 순서도이다.10 is a flowchart for explaining in detail a step of matching an operator and an accelerator of a graph according to an embodiment.

도 10을 참조하면, 가속기 할당부(133)는, 단위 연산자 성능 명세 파일과 퓨징 연산자 성능 명세 파일로부터 개별 연산자들 및 연산자 그룹들 각각의 수행 속도 및 전력 소모량을 로드한다(S231).Referring to FIG. 10 , the accelerator allocator 133 loads the execution speed and power consumption of individual operators and operator groups from the unit operator performance specification file and the fusing operator performance specification file (S231).

그런 후, 가속기 할당부(133)는, 각 브랜치에 포함된 개별 연산자를 순회하면서 각 연산자를 가장 빨리 실행하거나 최소의 전력을 소모하는 가속기를 찾아 매칭한다(S232).Then, the accelerator allocator 133 traverses the individual operators included in each branch to find and match the accelerator that executes each operator the fastest or consumes the least power (S232).

또한, 가속기 할당부(133)는, 각 브랜치에 포함된 연산자 중 퓨징 연산 명세에 포함된 연산자 그룹을 찾아 해당 그룹의 수행 시간 및 전력 소모량이 최소인 가속기를 찾아낸다(S233).In addition, the accelerator allocation unit 133 searches for an operator group included in the fusing operation specification among operators included in each branch and finds an accelerator having the minimum execution time and power consumption of the corresponding group (S233).

가속기 할당부(133)는, S233에서 찾은 가속기에서 연산자 그룹을 실행하는 시간이 S232에서 찾은 가속기들 상에서 개별 노드를 실행하는 성능보다 양호할 경우, 그룹에 포함된 모든 연산자들을 S233에서 찾은 가속기와 재 매칭한다(S234).The accelerator allocator 133, if the execution time of the operator group in the accelerator found in S233 is better than the performance of executing individual nodes on the accelerators found in S232, all the operators included in the group are transferred to the accelerator found in S233. Matching (S234).

그런 후, 가속기 할당부(133)는, 병렬 브랜치들에 동일한 가속기가 매칭되어 있는 경우, 병렬성을 이용하여 성능이 향상된다고 판단되면 서로 상이한 가속기를 재 할당한다(S235). Then, when the same accelerator is matched to the parallel branches, the accelerator allocation unit 133 reallocates different accelerators when it is determined that performance is improved by using parallelism (S235).

도 11은 다수의 브랜치를 포함하는 병렬 브랜치 집합의 예시도이다.11 is an exemplary diagram of a parallel branch set including multiple branches.

도 11을 참조하면, 가속기 할당부(131)는, 동일한 병렬 브랜치 그룹에 속한 브랜치들을(b2, b3) 가장 빠르게 실행할 수 있는 가속기를 할당할 수 있다. 만약, 가속기 할당 기준이 전력 소모량일 경우에는 가장 적은 전력을 소모하는 가속기를 할당할 수도 있다.Referring to FIG. 11 , the accelerator allocator 131 may allocate an accelerator that can execute branches b2 and b3 belonging to the same parallel branch group the fastest. If the criterion for allocating an accelerator is power consumption, an accelerator consuming the least amount of power may be allocated.

다시 도 5 및 도 6을 참조하면, 그래프 분할부(134)는, 그래프를 여러 연산자 그룹으로 분할한다(S240). 즉, 그래프 분할부(134)는, 브랜치 내의 연산자들을 순서대로 순회하면서 현재 연산자가 이전 연산자와 동일한 가속기가 매칭되어 있는 경우 동일한 파티션으로 묶고, 그러지 않으면 새로운 파티션을 만들어서 추가한다. 이때, 맨 처음 나오는 연산자는 무조건 새로운 파티션에 추가된다.Referring back to FIGS. 5 and 6 , the graph division unit 134 divides the graph into several operator groups (S240). That is, the graph partitioning unit 134 traverses the operators in the branch in order, and if the current operator matches the previous operator with the same accelerator, the graph division unit 134 groups them into the same partition, otherwise creates and adds a new partition. At this time, the first operator is unconditionally added to the new partition.

도 12는 실시예에 따른 그래프 분할 예시도이다. 12 is an exemplary view of graph division according to an embodiment.

도 12를 참조하면, 왼쪽 그래프는 브랜치 정보와 연산자 별 가속기 매칭 정보를 바탕으로 만들어진 파티션 (p1-p4)들을 보여준다. Referring to FIG. 12, the graph on the left shows partitions (p1-p4) created based on branch information and accelerator matching information for each operator.

다시 도 5 및 도 6을 참조하면, 파티션 스케줄러(135)는, S240에서의 브랜치 분석 정보를 기반으로 입력 딥 뉴럴넷의 실행 시간을 최소화하도록 파티션 실행 순서를 결정한다(S250).Referring back to FIGS. 5 and 6 , the partition scheduler 135 determines the partition execution order to minimize the execution time of the input deep neural net based on the branch analysis information in S240 (S250).

즉, 파티션 스케줄러(135)는, 파티션들을 너비 우선 탐색(breath-first) 방식으로 순회하면서 순차적 실행 순서를 결정한다. 예컨대, 도 12를 참조하면, 파티션 스케줄러가 결정한 순차적 실행 순서는 p1->p2->p3->p4가 될 수 있다. That is, the partition scheduler 135 determines a sequential execution order while traversing the partitions in a breath-first manner. For example, referring to FIG. 12 , the sequential execution order determined by the partition scheduler may be p1->p2->p3->p4.

그런 후, 파티션 스케줄러(135)는, 동일한 병렬 브랜치 집합에 포함되어 있고, 서로 다른 가속기가 할당되어 있는 파티션을 찾아 병렬 파티션으로 결정한다. 예컨대, 도 12를 참조하면, p2와 p3가 병렬 파티션이 될 수 있다.Then, the partition scheduler 135 finds a partition included in the same parallel branch set and to which different accelerators are assigned, and determines it as a parallel partition. For example, referring to FIG. 12, p2 and p3 may be parallel partitions.

한편, 결정된 파티션의 순차적 실행 순서와 병렬 파티션 정보의 저장 형태는 두 가지 실시예가 가능하다. Meanwhile, two embodiments are possible for the sequential execution order of the determined partition and the storage form of parallel partition information.

일 실시예에 따라, 특정 플랫폼에 의존적이지 않은 형태로 파티션들의 실행 정보를 표현하는 파티션 스케줄 계획이다. According to one embodiment, it is a partition schedule plan that expresses execution information of partitions in a form that is not dependent on a specific platform.

여기서, 파티션 스케줄 계획은, 전체 파티션의 개수, 각 파티션을 실행할 가속기 이름, 파티션에 포함 된 노드 이름 및 병렬 파티션들의 정보를 포함할 수 있다. 그리고, 파티션들은 병렬 파티션을 제외하고는 파티션 이름 필드(partitionNames)에 명시된 순서대로 실행된다. Here, the partition schedule plan may include the total number of partitions, the names of accelerators to execute each partition, the names of nodes included in the partitions, and information on parallel partitions. And, except for parallel partitions, partitions are executed in the order specified in the partition name field (partitionNames).

다른 실시예에 따라, 플랫폼에서 컴파일 가능한 코드 형태인 파티션 스케줄 코드이다. According to another embodiment, the partition schedule code is in the form of code compilable on the platform.

여기서, 파티션 스케줄 코드는, 파티션 스케줄 계획에 명시된 파티션을 실행하는 C 언어 코드이다. Here, the partition schedule code is a C language code that executes a partition specified in the partition schedule plan.

그리고, 타겟 플랫폼에서 파티션 코드와 파티션 스케줄 코드를 함께 컴파일하면 플랫폼에서 실행 가능한 실행 코드가 된다. In addition, if the partition code and the partition schedule code are compiled together in the target platform, they become executable codes that can be executed on the platform.

생성된 실행 코드는 컴파일러의 입력 딥뉴럴넷 모델의 입출력 형태와 동일한 입출력 형태를 가지고 있고 동일한 추론 연산을 수행하는 타겟 플랫폼 의존적 머신 코드일 수 있다. The generated execution code may be target platform-dependent machine code that has the same input/output type as that of the deep neural network model input by the compiler and performs the same inference operation.

또한, 도면에는 도시하지 아니하였으나, 본 발명의 일실시예에 따른 실행 순서는 별도의 스케줄링 작업을 수행하지 않고 외부로부터 입력된 정보를 적용할 수도 있다.In addition, although not shown in the drawing, the execution sequence according to an embodiment of the present invention may apply information input from the outside without performing a separate scheduling task.

도 13은 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.13 is a diagram showing a configuration of a computer system according to an embodiment.

실시예에 따른 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러(100)는 컴퓨터로 읽을 수 있는 기록매체와 같은 컴퓨터 시스템(1000)에서 구현될 수 있다.The heterogeneous computer platform supporting deep learning compiler 100 according to the embodiment may be implemented in a computer system 1000 such as a computer-readable recording medium.

컴퓨터 시스템(1000)은 버스(1020)를 통하여 서로 통신하는 하나 이상의 프로세서(1010), 메모리(1030), 사용자 인터페이스 입력 장치(1040), 사용자 인터페이스 출력 장치(1050) 및 스토리지(1060)를 포함할 수 있다. 또한, 컴퓨터 시스템(1000)은 네트워크(1080)에 연결되는 네트워크 인터페이스(1070)를 더 포함할 수 있다. 프로세서(1010)는 중앙 처리 장치 또는 메모리(1030)나 스토리지(1060)에 저장된 프로그램 또는 프로세싱 인스트럭션들을 실행하는 반도체 장치일 수 있다. 메모리(1030) 및 스토리지(1060)는 휘발성 매체, 비휘발성 매체, 분리형 매체, 비분리형 매체, 통신 매체, 또는 정보 전달 매체 중에서 적어도 하나 이상을 포함하는 저장 매체일 수 있다. 예를 들어, 메모리(1030)는 ROM(1031)이나 RAM(1032)을 포함할 수 있다.Computer system 1000 may include one or more processors 1010, memory 1030, user interface input devices 1040, user interface output devices 1050, and storage 1060 that communicate with each other via a bus 1020. can In addition, computer system 1000 may further include a network interface 1070 coupled to network 1080 . The processor 1010 may be a central processing unit or a semiconductor device that executes programs or processing instructions stored in the memory 1030 or the storage 1060 . The memory 1030 and the storage 1060 may be storage media including at least one of volatile media, nonvolatile media, removable media, non-removable media, communication media, and information delivery media. For example, memory 1030 may include ROM 1031 or RAM 1032 .

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art can implement the present invention in other specific forms without changing its technical spirit or essential features. You will understand that there is Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting.

100 : 이종 컴퓨터 플랫폼 지원 딥러닝 컴파일러
110 : 딥뉴럴넷 모델 로더
120 : 그래프 기반 하드웨어 독립적 최적화부
130 : 그래프 분할 및 스케줄 생성부 131 : 프로파일러
132 : 브랜치 분석부 133 : 가속기 할당
134 : 그래프 분할부 135 : 파티션 스케줄러
140 : 하드웨어 의존적 최적화 및 가속기 코드 생성부100: heterogeneous computer platform support deep learning compiler
110: Deep Neural Net Model Loader
120: graph-based hardware independent optimization unit
130: graph division and schedule generation unit 131: profiler
132: branch analysis unit 133: accelerator assignment
134: graph division unit 135: partition scheduler
140: hardware dependent optimization and accelerator code generation unit

Claims

a memory in which at least one program is recorded; and
A processor that executes a program;
program,
converting the deep neural network model into a graph connected according to the data flow of operators; and
A heterogeneous computing platform that performs a step of converting a deep neural network model into code by considering the execution performance measurement results of operators in at least two heterogeneous accelerators included in the target platform and the execution order of operators divided and allocated to the heterogeneous accelerators. Support deep learning compiler.

The method of claim 1, wherein the execution performance measurement result is
A deep learning compiler supporting heterogeneous computing platforms, obtained through measurement of execution performance of operators in at least two heterogeneous accelerators included in a target platform based on a graph or input from the outside.

The method of claim 2, wherein the program, when performing performance measurement,
generating a unit performance specification file as a result of execution performance of unit operators for each accelerator measured based on input target platform accelerator specification information; and
A deep learning compiler supporting heterogeneous computing platforms, including generating a fusing operator performance specification file as a result of execution performance of operator groups for each accelerator measured based on input fusing operator specification information.

4. The method of claim 3, wherein the order of execution is
Deep learning compiler supporting heterogeneous computing platforms, determined through scheduling considering execution performance measurement results or input from the outside.

The method of claim 4, wherein the program, when performing scheduling,
matching an accelerator with maximum execution performance to an operator based on an execution performance measurement result;
generating a partition by partitioning the graph while traversing the operators in order; and
A deep learning compiler supporting heterogeneous computing platforms, which performs a step of determining the execution order of the generated partitions.

The method of claim 4, wherein the program,
Before performing scheduling, further performing the step of optimizing the transformed graph independently of hardware,
For optimization, heterogeneous computing platform support deep, which performs at least one of: removing unnecessary operators, merging consecutively occurring Transpose operators, moving Transpose operations to the bottom of the graph, and merging consecutively occurring Concat operators. running compiler.

The method of claim 5, wherein the program,
When performing scheduling,
Further performing the step of separating the graph into a plurality of branches including at least one operator based on the measurement results of the execution performance of the operators in the two or more heterogeneous accelerators,
The matching step is
matching an accelerator having a maximum execution performance for each of at least one operator included in each of a plurality of branches;
Searching for an accelerator having maximum execution performance for an operator group included in each of a plurality of branches, based on a fusing operator performance specification file; and
If the execution performance of the operator group is better than the execution performance of individual operators, a deep learning compiler supporting heterogeneous computing platforms further performs a step of re-matching the accelerator matched to the individual operator with the found accelerator.

The method of claim 5, wherein the program,
When performing scheduling,
Further performing the step of distinguishing whether the plurality of branches are executed sequentially or in parallel,
The matching step is
A deep learning compiler supporting heterogeneous computing platforms, which further performs a step of reallocating different accelerators when performance is improved by using parallelism when the same accelerator is matched to branches executed in parallel.

The method of claim 5, wherein the program,
At the stage of creating a partition,
While traversing the operators in the branch in order, if the current operator matches the previous operator and the same accelerator, it is determined as the same partition,
A deep learning compiler supporting heterogeneous computing platforms that performs additional steps of creating and adding new partitions when the current operator does not match the same accelerator as the previous operator.

The method of claim 8, wherein the program,
At the stage of determining the execution order,
determine the sequential order of execution by traversing in a breadth-first manner;
A deep learning compiler supporting heterogeneous computing platforms that determines partitions that are included in the same set of parallel branches and have different accelerators assigned to them as parallel partitions.

The method of claim 10, wherein the sequential execution order and parallel partition information of the determined partition are:
A deep learning compiler supporting heterogeneous computing platforms that is stored as one of a partition schedule plan that expresses execution information of partitions in a form independent of a specific platform and a partition schedule code that is a code form that can be compiled on a specific platform.

converting the deep neural network model into a graph connected according to the data flow of operators; and
A heterogeneous computing platform that performs a step of converting a deep neural network model into code by considering the execution performance measurement results of operators in at least two heterogeneous accelerators included in the target platform and the execution order of operators divided and allocated to the heterogeneous accelerators. Support deep learning compilation method.

According to claim 12,
The execution performance measurement result is,
Obtained through measuring the execution performance of operators in at least two or more heterogeneous accelerators included in the target platform based on the graph or input from the outside,
When performing performance measurements,
generating a unit performance specification file as a result of execution performance of unit operators for each accelerator measured based on input target platform accelerator specification information; and
A deep learning compilation method supporting heterogeneous computing platforms, which performs a step of generating a fusing operator performance specification file as a result of the execution performance of an operator group for each accelerator measured based on input fusing operator specification information.

According to claim 13,
The order of execution is
It is determined through scheduling considering the execution performance measurement result or input from the outside,
When performing scheduling,
matching an accelerator with maximum execution performance to an operator based on an execution performance measurement result;
generating a partition by partitioning the graph while traversing the operators in order; and
A deep learning compilation method supporting heterogeneous computing platforms, which performs a step of determining an execution order of generated partitions.

According to claim 14,
Before performing scheduling, further performing the step of optimizing the transformed graph independently of hardware,
For optimization, heterogeneous computing platform support deep, which performs at least one of: removing unnecessary operators, merging consecutively occurring Transpose operators, moving Transpose operations to the bottom of the graph, and merging consecutively occurring Concat operators. Running compile method.

According to claim 14,
When performing scheduling,
Further performing the step of separating the graph into a plurality of branches including at least one operator based on the measurement results of the execution performance of the operators in the two or more heterogeneous accelerators,
The matching step is
matching an accelerator having a maximum execution performance for each of at least one operator included in each of a plurality of branches;
Searching for an accelerator having maximum execution performance for an operator group included in each of a plurality of branches, based on a fusing operator performance specification file; and
If the execution performance in the operator group is better than the execution performance of the individual operator, a deep learning compilation method supporting heterogeneous computing platforms further performing a step of re-matching the accelerator matched to the individual operator with the found accelerator.

measuring execution performance of operators in at least two or more heterogeneous accelerators included in a target platform based on a graph in which a deep neural network model is connected according to data flows of operators;
matching an accelerator with maximum execution performance to an operator based on an execution performance measurement result;
generating a partition by partitioning the graph while traversing the operators in order; and
A graph partitioning and schedule generation method for deep learning compilation supporting heterogeneous computing platforms, which performs a step of determining an execution order of generated partitions.

18. The method of claim 17, wherein measuring execution performance comprises:
generating a unit performance specification file as a result of execution performance of unit operators for each accelerator measured based on input target platform accelerator specification information; and
A graph division and schedule generation method for deep learning compilation supporting heterogeneous computing platforms, which performs steps of generating a fusing operator performance specification file as a result of the execution performance of operator groups for each accelerator measured based on input fusing operator specification information.

According to claim 18,
Separating the graph into a plurality of branches including at least one operator based on the measurement results of the execution performance of the operators in the two or more heterogeneous accelerators,
The matching step is
matching an accelerator having a maximum execution performance for each of at least one operator included in each of a plurality of branches;
Searching for an accelerator having maximum execution performance for an operator group included in each of a plurality of branches, based on a fusing operator performance specification file; and
If the execution performance of an operator group is better than that of an individual operator, a step of re-matching an accelerator matched to the individual operator with a searched accelerator is performed, and graph division and schedule generation for deep learning compilation supporting heterogeneous computing platforms are performed. method.

According to claim 19,
Further comprising the step of distinguishing whether the plurality of branches are executed sequentially or in parallel,
The matching step is
When the same accelerator is matched to branches executed in parallel, if performance is improved by using parallelism, a step of reallocating different accelerators is further performed, graph division and schedule for deep learning compilation supporting heterogeneous computing platforms How to create.