KR102718583B1

KR102718583B1 - Method and apparatus of data processing for neural network

Info

Publication number: KR102718583B1
Application number: KR1020190127258A
Authority: KR
Inventors: 손진우; 손창용; 유재형; 이서형; 정상일; 최창인; 한재준
Original assignee: 삼성전자주식회사
Priority date: 2019-08-13
Filing date: 2019-10-14
Publication date: 2024-10-18
Also published as: KR20210019917A

Abstract

뉴럴 네트워크를 위한 데이터 처리 방법 및 장치가 개시된다. 일 실시예에 따르면 데이터 처리 방법은 입력 평면 내 입력 엘리먼트들의 적어도 일부와 웨이트 평면 내 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 누적 데이터를 생성하고, 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 출력 채널에 대응하는 출력 평면을 생성하는 단계들을 포함한다.A data processing method and device for a neural network are disclosed. According to one embodiment, the data processing method includes the steps of accumulating multiplication results between at least some of input elements in an input plane and at least some of weight elements in a weight plane to generate accumulated data, and generating an output plane corresponding to an output channel among output planes of output feature maps each corresponding to an output channel based on the accumulated data.

Description

METHOD AND APPARATUS OF DATA PROCESSING FOR NEURAL NETWORK

아래 실시예들은 뉴럴 네트워크를 위한 데이터 처리 방법 및 장치에 관한 것이다.The following embodiments relate to a data processing method and device for a neural network.

인식 프로세스의 기술적 자동화는, 예를 들어, 특수한 계산 구조로서 프로세서로 구현된 뉴럴 네트워크 모델을 통해 구현되었으며, 이는 상당한 훈련 후에 입력 패턴과 출력 패턴 사이에서 계산상 직관적인 매핑을 제공할 수 있다. 이러한 맵핑을 생성하는 훈련된 능력은 신경망의 학습 능력이라 할 수 있다. 더구나, 특화된 훈련으로 인해, 이와 같이 특화되어 훈련된 신경망은, 예를 들어, 훈련하지 않은 입력 패턴에 대하여 비교적 정확한 출력을 발생시키는 일반화 능력을 가질 수 있다.Technical automation of the recognition process is implemented, for example, by neural network models implemented as processors as special computational structures, which, after considerable training, can provide computationally intuitive mappings between input and output patterns. The trained ability to generate such mappings can be referred to as the learning ability of the neural network. Furthermore, due to specialized training, such specialized trained neural networks can have generalization abilities, for example, generating relatively accurate outputs for untrained input patterns.

일 실시예에 따르면, 뉴럴 네트워크를 위한 데이터 처리 방법은 각각 입력 채널들에 대응하는 입력 특징 맵의 입력 평면들 중에 제1 입력 채널에 대응하는 제1 입력 평면을 획득하는 단계; 각각 상기 입력 채널들에 대응하는 웨이트 커널의 웨이트 평면들 중에 상기 제1 입력 채널에 대응하는 제1 웨이트 평면을 획득하는 단계; 상기 제1 입력 평면 내 제1 입력 엘리먼트들의 적어도 일부와 상기 제1 웨이트 평면 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터를 생성하는 단계; 및 상기 제1 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 제1 출력 채널에 대응하는 제1 출력 평면을 생성하는 단계를 포함한다.According to one embodiment, a data processing method for a neural network includes the steps of: obtaining a first input plane corresponding to a first input channel among input planes of input feature maps each corresponding to input channels; obtaining a first weight plane corresponding to the first input channel among weight planes of weight kernels each corresponding to the input channels; accumulating multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data; and generating a first output plane corresponding to the first output channel among output planes of output feature maps each corresponding to the output channels based on the first accumulated data.

상기 제1 출력 평면을 생성하는 단계는 상기 제1 누적 데이터를 포함하는 각 입력 채널에 관한 각 누적 데이터의 합에 기초하여 상기 제1 출력 평면을 생성하는 단계를 포함할 수 있다. 상기 데이터 처리 방법은 상기 입력 평면들 중에 제2 입력 채널에 대응하는 제2 입력 평면을 획득하는 단계; 상기 웨이트 평면들 중에 상기 제2 입력 채널에 대응하는 제2 웨이트 평면을 획득하는 단계; 및 상기 제2 입력 평면 내 제2 입력 엘리먼트들의 적어도 일부와 상기 제2 웨이트 평면 내 제2 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제2 누적 데이터를 생성하는 단계를 더 포함할 수 있다. 상기 제1 출력 평면을 생성하는 단계는 상기 제1 누적 데이터 및 상기 제2 누적 데이터의 합에 기초하여 상기 제1 출력 평면을 생성하는 단계를 포함할 수 있다.The step of generating the first output plane may include the step of generating the first output plane based on the sum of each accumulated data for each input channel including the first accumulated data. The data processing method may further include the step of obtaining a second input plane corresponding to a second input channel among the input planes; the step of obtaining a second weight plane corresponding to the second input channel among the weight planes; and the step of accumulating multiplication results between at least some of the second input elements in the second input plane and at least some of the second weight elements in the second weight plane to generate second accumulated data. The step of generating the first output plane may include the step of generating the first output plane based on the sum of the first accumulated data and the second accumulated data.

상기 제1 누적 데이터를 생성하는 단계는 상기 제1 입력 평면에서 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부에 대응하는 제1 입력 엘리먼트 벡터들을 추출하는 단계; 상기 제1 입력 엘리먼트 벡터들과 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부 간의 곱셈 결과에 대응하는 제1 가중된 입력 엘리먼트 벡터들을 생성하는 단계; 및 상기 제1 가중된 입력 엘리먼트 벡터들을 누적하여 상기 제1 누적 데이터를 생성하는 단계를 포함할 수 있다. 상기 제1 입력 엘리먼트 벡터들을 추출하는 단계는 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부의 인덱스들에 기초하여 상기 제1 입력 엘리먼트 벡터들에 대응하는 오프셋들을 결정하는 단계; 및 상기 오프셋들에 기초하여 상기 제1 입력 평면에서 상기 제1 입력 엘리먼트 벡터들을 추출하는 단계를 포함할 수 있다. 상기 제1 입력 엘리먼트 벡터들 및 상기 제1 가중된 입력 엘리먼트 벡터들의 사이즈는 SIMD(single instruction multiple data) 연산 단위에 대응할 수 있다.The step of generating the first accumulated data may include the steps of extracting first input element vectors corresponding to at least some of the first weight elements in the first input plane; generating first weighted input element vectors corresponding to a multiplication result between the first input element vectors and the at least some of the first weight elements; and accumulating the first weighted input element vectors to generate the first accumulated data. The step of extracting the first input element vectors may include the steps of determining offsets corresponding to the first input element vectors based on indices of the at least some of the first weight elements; and extracting the first input element vectors in the first input plane based on the offsets. The sizes of the first input element vectors and the first weighted input element vectors may correspond to a single instruction multiple data (SIMD) operation unit.

상기 제1 누적 데이터가 생성될 때, 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부 중에 0에 대응하는 제로 웨이트 엘리먼트들과 상기 제1 입력 엘리먼트들의 상기 적어도 일부 간의 곱셈 연산은 생략될 수 있다. 상기 데이터 처리 방법은 상기 제1 웨이트 엘리먼트들 중에 0에 대응하지 않는 논-제로 웨이트 엘리먼트들의 수를 결정하는 단계; 및 각각 미리 정해진 방식의 연산을 수행하도록 설정된 연산 타입들 중에 상기 결정된 논-제로 웨이트 엘리먼트들의 수에 대응하는 연산 타입을 선택하는 단계를 더 포함할 수 있다. 상기 제1 누적 데이터를 생성하는 단계는 상기 선택된 연산 타입에 기초하여 상기 제1 입력 엘리먼트들의 상기 적어도 일부와 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부에 대응하는 상기 논-제로 웨이트 엘리먼트들 간의 상기 곱셈 결과들을 누적하여 상기 제1 누적 데이터를 생성하는 단계를 포함할 수 있다. 상기 제1 누적 데이터를 생성하는 단계는 상기 논-제로 웨이트 엘리먼트들의 인덱스들에 기초하여 상기 제1 입력 평면에서 상기 논-제로 웨이트 엘리먼트들에 대응하는 제1 입력 엘리먼트 벡터들을 추출하는 단계; 상기 제1 입력 엘리먼트 벡터들과 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부에 대응하는 상기 논-제로 웨이트 엘리먼트들 간의 곱셈 결과에 대응하는 제1 가중된 입력 엘리먼트 벡터들을 생성하는 단계; 및 상기 제1 가중된 입력 엘리먼트 벡터들을 누적하여 상기 제1 누적 데이터를 생성하는 단계를 포함할 수 있다.When the first accumulated data is generated, a multiplication operation between the zero weight elements corresponding to 0 among at least some of the first weight elements and the at least some of the first input elements may be omitted. The data processing method may further include a step of determining the number of non-zero weight elements not corresponding to 0 among the first weight elements; and a step of selecting an operation type corresponding to the determined number of non-zero weight elements from among operation types each set to perform an operation in a predetermined manner. The step of generating the first accumulated data may include a step of accumulating multiplication results between the at least some of the first input elements and the non-zero weight elements corresponding to the at least some of the first weight elements based on the selected operation type to generate the first accumulated data. The step of generating the first accumulated data may include a step of extracting first input element vectors corresponding to the non-zero weight elements in the first input plane based on indices of the non-zero weight elements; The method may include generating first weighted input element vectors corresponding to a multiplication result between the first input element vectors and the non-zero weight elements corresponding to at least some of the first weight elements; and generating the first accumulated data by accumulating the first weighted input element vectors.

일 실시예에 따르면, 뉴럴 네트워크를 위한 데이터 처리 장치는 프로세서; 및 상기 프로세서에서 실행가능한 명령어들을 포함하는 메모리를 포함하고, 상기 명령어들이 상기 프로세서에서 실행되면, 상기 프로세서는 각각 입력 채널들에 대응하는 입력 특징 맵의 입력 평면들 중에 제1 입력 채널에 대응하는 제1 입력 평면을 획득하고, 각각 상기 입력 채널들에 대응하는 웨이트 커널의 웨이트 평면들 중에 상기 제1 입력 채널에 대응하는 제1 웨이트 평면을 획득하고, 상기 제1 입력 평면 내 제1 입력 엘리먼트들의 적어도 일부와 상기 제1 웨이트 평면 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터를 생성하고, 상기 제1 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 제1 출력 채널에 대응하는 제1 출력 평면을 생성한다.According to one embodiment, a data processing device for a neural network includes a processor; and a memory including instructions executable by the processor, wherein when the instructions are executed by the processor, the processor obtains a first input plane corresponding to a first input channel among input planes of input feature maps corresponding to input channels, respectively, obtains a first weight plane corresponding to the first input channel among weight planes of weight kernels corresponding to the input channels, respectively, accumulates multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data, and generates a first output plane corresponding to the first output channel among output planes of output feature maps corresponding to output channels, respectively, based on the first accumulated data.

일 실시예에 따르면, 뉴럴 네트워크를 위한 데이터 처리 방법은 각각 입력 채널들에 대응하는 입력 특징 맵의 입력 평면들 중에 제1 입력 채널에 대응하는 제1 입력 평면을 획득하는 단계; 각각 상기 입력 채널들에 대응하는 웨이트 커널의 웨이트 평면들 중에 상기 제1 입력 채널에 대응하는 제1 웨이트 평면을 획득하는 단계; 상기 제1 입력 평면 내 제1 입력 엘리먼트들의 적어도 일부와 상기 제1 웨이트 평면 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터를 생성하는 단계; 및 상기 제1 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 제1 출력 채널에 대응하는 제1 출력 평면을 생성하는 단계를 포함한다.
상기 제1 출력 평면을 생성하는 단계는 상기 제1 누적 데이터를 포함하는 각 입력 채널에 관한 각 누적 데이터의 합에 기초하여 상기 제1 출력 평면을 생성하는 단계를 포함할 수 있다. 상기 데이터 처리 방법은 상기 입력 평면들 중에 제2 입력 채널에 대응하는 제2 입력 평면을 획득하는 단계; 상기 웨이트 평면들 중에 상기 제2 입력 채널에 대응하는 제2 웨이트 평면을 획득하는 단계; 및 상기 제2 입력 평면 내 제2 입력 엘리먼트들의 적어도 일부와 상기 제2 웨이트 평면 내 제2 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제2 누적 데이터를 생성하는 단계를 더 포함할 수 있다. 상기 제1 출력 평면을 생성하는 단계는 상기 제1 누적 데이터 및 상기 제2 누적 데이터의 합에 기초하여 상기 제1 출력 평면을 생성하는 단계를 포함할 수 있다.
상기 제1 누적 데이터를 생성하는 단계는 상기 제1 입력 평면에서 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부에 대응하는 제1 입력 엘리먼트 벡터들을 추출하는 단계; 상기 제1 입력 엘리먼트 벡터들과 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부 간의 곱셈 결과에 대응하는 제1 가중된 입력 엘리먼트 벡터들을 생성하는 단계; 및 상기 제1 가중된 입력 엘리먼트 벡터들을 누적하여 상기 제1 누적 데이터를 생성하는 단계를 포함할 수 있다. 상기 제1 입력 엘리먼트 벡터들을 추출하는 단계는 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부의 인덱스들에 기초하여 상기 제1 입력 엘리먼트 벡터들에 대응하는 오프셋들을 결정하는 단계; 및 상기 오프셋들에 기초하여 상기 제1 입력 평면에서 상기 제1 입력 엘리먼트 벡터들을 추출하는 단계를 포함할 수 있다. 상기 제1 입력 엘리먼트 벡터들 및 상기 제1 가중된 입력 엘리먼트 벡터들의 사이즈는 SIMD(single instruction multiple data) 연산 단위에 대응할 수 있다.
상기 제1 누적 데이터가 생성될 때, 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부 중에 0에 대응하는 제로 웨이트 엘리먼트들과 상기 제1 입력 엘리먼트들의 상기 적어도 일부 간의 곱셈 연산은 생략될 수 있다. 상기 데이터 처리 방법은 상기 제1 웨이트 엘리먼트들 중에 0에 대응하지 않는 논-제로 웨이트 엘리먼트들의 수를 결정하는 단계; 및 각각 미리 정해진 방식의 연산을 수행하도록 설정된 연산 타입들 중에 상기 결정된 논-제로 웨이트 엘리먼트들의 수에 대응하는 연산 타입을 선택하는 단계를 더 포함할 수 있다. 상기 제1 누적 데이터를 생성하는 단계는 상기 선택된 연산 타입에 기초하여 상기 제1 입력 엘리먼트들의 상기 적어도 일부와 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부에 대응하는 상기 논-제로 웨이트 엘리먼트들 간의 상기 곱셈 결과들을 누적하여 상기 제1 누적 데이터를 생성하는 단계를 포함할 수 있다. 상기 제1 누적 데이터를 생성하는 단계는 상기 논-제로 웨이트 엘리먼트들의 인덱스들에 기초하여 상기 제1 입력 평면에서 상기 논-제로 웨이트 엘리먼트들에 대응하는 제1 입력 엘리먼트 벡터들을 추출하는 단계; 상기 제1 입력 엘리먼트 벡터들과 상기 제1 웨이트 엘리먼트들의 상기 적어도 일부에 대응하는 상기 논-제로 웨이트 엘리먼트들 간의 곱셈 결과에 대응하는 제1 가중된 입력 엘리먼트 벡터들을 생성하는 단계; 및 상기 제1 가중된 입력 엘리먼트 벡터들을 누적하여 상기 제1 누적 데이터를 생성하는 단계를 포함할 수 있다.
일 실시예에 따르면, 뉴럴 네트워크를 위한 데이터 처리 장치는 프로세서; 및 상기 프로세서에서 실행가능한 명령어들을 포함하는 메모리를 포함하고, 상기 명령어들이 상기 프로세서에서 실행되면, 상기 프로세서는 각각 입력 채널들에 대응하는 입력 특징 맵의 입력 평면들 중에 제1 입력 채널에 대응하는 제1 입력 평면을 획득하고, 각각 상기 입력 채널들에 대응하는 웨이트 커널의 웨이트 평면들 중에 상기 제1 입력 채널에 대응하는 제1 웨이트 평면을 획득하고, 상기 제1 입력 평면 내 제1 입력 엘리먼트들의 적어도 일부와 상기 제1 웨이트 평면 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터를 생성하고, 상기 제1 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 제1 출력 채널에 대응하는 제1 출력 평면을 생성한다.According to one embodiment, a data processing method for a neural network includes the steps of: obtaining a first input plane corresponding to a first input channel among input planes of input feature maps each corresponding to input channels; obtaining a first weight plane corresponding to the first input channel among weight planes of weight kernels each corresponding to the input channels; accumulating multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data; and generating a first output plane corresponding to the first output channel among output planes of output feature maps each corresponding to the output channels based on the first accumulated data.
The step of generating the first output plane may include the step of generating the first output plane based on the sum of each accumulated data for each input channel including the first accumulated data. The data processing method may further include the step of obtaining a second input plane corresponding to a second input channel among the input planes; the step of obtaining a second weight plane corresponding to the second input channel among the weight planes; and the step of accumulating multiplication results between at least some of the second input elements in the second input plane and at least some of the second weight elements in the second weight plane to generate second accumulated data. The step of generating the first output plane may include the step of generating the first output plane based on the sum of the first accumulated data and the second accumulated data.
The step of generating the first accumulated data may include the steps of extracting first input element vectors corresponding to at least some of the first weight elements in the first input plane; generating first weighted input element vectors corresponding to a multiplication result between the first input element vectors and the at least some of the first weight elements; and accumulating the first weighted input element vectors to generate the first accumulated data. The step of extracting the first input element vectors may include the steps of determining offsets corresponding to the first input element vectors based on indices of the at least some of the first weight elements; and extracting the first input element vectors in the first input plane based on the offsets. The sizes of the first input element vectors and the first weighted input element vectors may correspond to a single instruction multiple data (SIMD) operation unit.
When the first accumulated data is generated, a multiplication operation between the zero weight elements corresponding to 0 among at least some of the first weight elements and the at least some of the first input elements may be omitted. The data processing method may further include a step of determining the number of non-zero weight elements not corresponding to 0 among the first weight elements; and a step of selecting an operation type corresponding to the determined number of non-zero weight elements from among operation types each set to perform an operation in a predetermined manner. The step of generating the first accumulated data may include a step of accumulating multiplication results between the at least some of the first input elements and the non-zero weight elements corresponding to the at least some of the first weight elements based on the selected operation type to generate the first accumulated data. The step of generating the first accumulated data may include a step of extracting first input element vectors corresponding to the non-zero weight elements in the first input plane based on indices of the non-zero weight elements; The method may include generating first weighted input element vectors corresponding to a multiplication result between the first input element vectors and the non-zero weight elements corresponding to at least some of the first weight elements; and generating the first accumulated data by accumulating the first weighted input element vectors.
According to one embodiment, a data processing device for a neural network includes a processor; and a memory including instructions executable by the processor, wherein when the instructions are executed by the processor, the processor obtains a first input plane corresponding to a first input channel among input planes of input feature maps corresponding to input channels, respectively, obtains a first weight plane corresponding to the first input channel among weight planes of weight kernels corresponding to the input channels, respectively, accumulates multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data, and generates a first output plane corresponding to the first output channel among output planes of output feature maps corresponding to output channels, respectively, based on the first accumulated data.

아래 개시되어 있는 특정한 구조 또는 기능들은 단지 기술적 개념을 설명하기 위한 목적으로 예시된 것으로서, 아래 개시와는 달리 다른 다양한 형태로 실시될 수 있으며 본 명세서의 실시예들을 한정하지 않는다.The specific structures or functions disclosed below are merely exemplified for the purpose of explaining technical concepts and may be implemented in various forms other than those disclosed below and do not limit the embodiments of the present specification.

제1 또는 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 이해되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although the terms first or second may be used to describe various components, these terms should be understood only for the purpose of distinguishing one component from another. For example, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly indicates otherwise. In this specification, the terms "comprises" and the like are intended to specify the presence of a described feature, number, step, operation, component, part, or combination thereof, but should be understood to not preclude the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning they have in the context of the relevant art, and will not be interpreted in an idealized or overly formal sense unless explicitly defined herein.

이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments are described in detail with reference to the attached drawings. The same reference numerals presented in each drawing represent the same components.

도 1은 일 실시예에 따른 뉴럴 네트워크를 위한 데이터를 처리하는 처리 장치를 나타낸 도면이다. 도 1을 참조하면, 처리 장치(100)는 뉴럴 네트워크(110)를 포함할 수 있고, 뉴럴 네트워크(110)와 관련된 동작을 처리할 수 있다. 예를 들어, 뉴럴 네트워크(110)와 관련된 동작은 객체 인식 동작 및 사용자 인증 동작을 포함할 수 있다.FIG. 1 is a diagram illustrating a processing device for processing data for a neural network according to one embodiment. Referring to FIG. 1, a processing device (100) may include a neural network (110) and may process operations related to the neural network (110). For example, operations related to the neural network (110) may include object recognition operations and user authentication operations.

뉴럴 네트워크(110)는 딥 러닝에 기반하여 비선형적 관계에 있는 입력 데이터 및 출력 데이터를 서로 매핑함으로써 객체 인식 동작이나 사용자 인증 동작을 수행할 수 있다. 딥 러닝은 빅 데이터 세트로부터 영상 또는 음성 인식과 같은 문제를 해결하기 위한 기계 학습 기법이다. 딥 러닝은 준비된 트레이닝 데이터를 이용하여 뉴럴 네트워크(110)를 트레이닝하면서 에너지가 최소화되는 지점을 찾아가는 최적화 문제 풀이 과정으로 이해될 수 있다. 딥 러닝의 지도식(supervised) 또는 비지도식(unsupervised) 학습을 통해 뉴럴 네트워크(110)의 구조, 혹은 모델에 대응하는 가중치(weight)가 구해질 수 있고, 이러한 가중치를 통해 입력 데이터 및 출력 데이터가 서로 매핑될 수 있다.The neural network (110) can perform object recognition or user authentication operations by mapping input data and output data in a nonlinear relationship based on deep learning. Deep learning is a machine learning technique for solving problems such as image or voice recognition from a big data set. Deep learning can be understood as an optimization problem-solving process that finds a point where energy is minimized while training the neural network (110) using prepared training data. Through supervised or unsupervised learning of deep learning, the structure of the neural network (110) or the weight corresponding to the model can be obtained, and the input data and the output data can be mapped to each other through these weights.

뉴럴 네트워크(110)는 복수의 레이어들을 포함하는 딥 뉴럴 네트워크(deep neural network, DNN)에 해당할 수 있다. 복수의 레이어들은 입력 레이어(input layer), 적어도 하나의 히든 레이어(hidden layer), 및 출력 레이어(output layer)를 포함할 수 있다. 도 1에 도시된 제1 레이어, 제2 레이어, 및 제n 레이어는 이러한 복수의 레이어들 중 적어도 일부일 수 있다. 뉴럴 네트워크(110)는 완전 연결 네트워크(fully connected network), 컨볼루셔널 뉴럴 네트워크(convolutional neural network, CNN), 및 리커런트 뉴럴 네트워크(recurrent neural network, RNN) 중 적어도 하나를 포함할 수 있다. 예를 들어, 뉴럴 네트워크(110) 내 복수의 레이어들 중 적어도 일부는 CNN에 해당할 수 있고, 다른 일부는 완전 연결 네트워크에 해당할 수 있다.The neural network (110) may correspond to a deep neural network (DNN) including a plurality of layers. The plurality of layers may include an input layer, at least one hidden layer, and an output layer. The first layer, the second layer, and the nth layer illustrated in FIG. 1 may be at least some of these plurality of layers. The neural network (110) may include at least one of a fully connected network, a convolutional neural network (CNN), and a recurrent neural network (RNN). For example, at least some of the plurality of layers in the neural network (110) may correspond to a CNN, and other some may correspond to a fully connected network.

CNN에서 각 레이어에 입력되는 데이터는 입력 특징 맵(input feature map)으로 지칭될 수 있고, 각 레이어에서 출력되는 데이터는 출력 특징 맵(output feature map)으로 지칭될 수 있다. 입력 특징 맵 및 출력 특징 맵은 액티베이션 데이터(activation data)로 지칭될 수도 있다. 입력 레이어에서 입력 특징 맵은 입력 데이터에 대응할 수 있다.In CNN, the data input to each layer can be referred to as an input feature map, and the data output from each layer can be referred to as an output feature map. The input feature map and the output feature map can also be referred to as activation data. In the input layer, the input feature map can correspond to the input data.

뉴럴 네트워크(110)에 관한 동작을 처리하기 위해, 처리 장치(100)는 각 컨볼루셔널 레이어에 관해 입력 특징 맵 및 웨이트 커널(weight kernel) 간에 컨볼루션 연산을 수행할 수 있고, 컨볼루션 연산 결과에 기초하여 출력 특징 맵을 생성할 수 있다. 뉴럴 네트워크(110)의 폭과 깊이가 충분히 크면 임의의 함수를 구현할 수 있을 만큼의 용량(capacity)을 가질 수 있다. 뉴럴 네트워크(110)가 적절한 트레이닝 과정을 통해 충분히 많은 양의 트레이닝 데이터를 학습하면 최적의 성능을 달성할 수 있다.In order to process an operation regarding a neural network (110), the processing device (100) can perform a convolution operation between an input feature map and a weight kernel for each convolutional layer, and generate an output feature map based on the result of the convolution operation. If the width and depth of the neural network (110) are sufficiently large, it can have a capacity sufficient to implement an arbitrary function. If the neural network (110) learns a sufficiently large amount of training data through an appropriate training process, it can achieve optimal performance.

웨이트 커널은 '미리' 결정된 것으로 표현될 수 있는데, 여기서 '미리'는 뉴럴 네트워크(110)가 '시작'되기 전을 나타낼 수 있다. 뉴럴 네트워크(110)가 '시작'되었다는 것은 뉴럴 네트워크(110)가 추론(inference)을 위한 준비가 된 것을 의미할 수 있다. 예를 들어, 뉴럴 네트워크(110)가 '시작'된 것은 뉴럴 네트워크(110)가 메모리에 로드된 것, 혹은 뉴럴 네트워크(110)가 메모리에 로드된 이후 뉴럴 네트워크(110)에 추론(inference)을 위한 입력 데이터가 입력된 것을 포함할 수 있다.The weight kernel may be expressed as being determined 'in advance', where 'in advance' may indicate before the neural network (110) is 'started'. The neural network (110) being 'started' may mean that the neural network (110) is ready for inference. For example, the neural network (110) being 'started' may include the neural network (110) being loaded into memory, or input data for inference being input into the neural network (110) after the neural network (110) is loaded into memory.

아래에서 다시 상세히 설명되겠으나, 실시예들에 따른 컨볼루션 연산은 컨볼루션 연산의 중간 결과를 출력 특징 맵에 누적시키는 방식으로 수행되고, 웨이트 커널이나 입력 특징 맵을 컨볼루션에 적합한 형태로 변형하여 버퍼에 저장하는 버퍼링 동작이 요구되지 않는다. 다시 말해, 실시예들에 따른 컨볼루션 연산은 플래너(planar) 형태로 저장되어 있는 입력 특징 맵의 데이터를 그대로 사용할 수 있다. 따라서, 컨볼루션 연산의 효율이 크게 향상될 수 있다. 또한, 실시예들에 따른 컨볼루션 연산에서 스칼라에 해당하는 하나의 웨이트 엘리먼트와 매트릭스에 해당하는 하나의 입력 평면을 곱하는 것이 하나의 단위 연산에 해당할 수 있다. 따라서, 0의 값을 갖는 웨이트 엘리먼트들에 관해, 소프트웨어를 통해 제로-스킵을 효율적으로 처리할 수 있다.As will be described in more detail below, the convolution operation according to the embodiments is performed by accumulating intermediate results of the convolution operation in an output feature map, and does not require a buffering operation that transforms a weight kernel or an input feature map into a form suitable for convolution and stores it in a buffer. In other words, the convolution operation according to the embodiments can use the data of the input feature map stored in a planar form as it is. Accordingly, the efficiency of the convolution operation can be significantly improved. In addition, in the convolution operation according to the embodiments, multiplying one weight element corresponding to a scalar and one input plane corresponding to a matrix can correspond to one unit operation. Accordingly, zero-skip can be efficiently processed through software for weight elements having a value of 0.

도 2는 일 실시예에 따른 컨볼루션 연산 과정을 나타낸 도면이다. 도 2를 참조하면, 웨이트 커널(210) 및 입력 특징 맵(220) 간의 컨볼루션 연산에 따라 출력 특징 맵(230)이 생성된다. 웨이트 커널(210), 입력 특징 맵(220), 및 출력 특징 맵(230)의 데이터가 메모리 공간에 저장되는 형태는 각각 평면으로 표현될 수 있다. 예를 들어, 웨이트 커널(1) 내지 웨이트 커널(D) 각각은 C개의 웨이트 평면들을 포함할 수 있고, 입력 특징 맵(220)은 C개의 입력 평면들을 포함할 수 있고, 출력 특징 맵(230)은 D개의 출력 평면들을 포함할 수 있다. C개의 웨이트 평면들 및 C개의 입력 평면들은 각각 입력 채널들에 대응할 수 있고, D개의 출력 평면들은 각각 출력 채널들에 대응할 수 있다. 또한, C는 입력 채널의 수에 대응할 수 있고, D는 출력 채널의 수에 대응할 수 있다.FIG. 2 is a diagram illustrating a convolution operation process according to one embodiment. Referring to FIG. 2, an output feature map (230) is generated according to a convolution operation between a weight kernel (210) and an input feature map (220). The form in which data of the weight kernel (210), the input feature map (220), and the output feature map (230) are stored in a memory space may each be expressed as a plane. For example, each of the weight kernels (1) to (D) may include C weight planes, the input feature map (220) may include C input planes, and the output feature map (230) may include D output planes. The C weight planes and the C input planes may each correspond to input channels, and the D output planes may each correspond to output channels. In addition, C may correspond to the number of input channels, and D may correspond to the number of output channels.

각 평면은 미리 정해진 비트 폭의 엘리먼트들을 포함할 수 있다. 예를 들어, 각 웨이트 평면은 K*K의 사이즈를 가질 수 있고, 각 입력 평면 및 각 출력 평면은 W*H의 사이즈를 가질 수 있는데, 여기서 W, K, 및 H는 각각 엘리먼트들의 수를 나타낼 수 있다. 웨이트 평면의 엘리먼트는 웨이트 엘리먼트로 지칭될 수 있고, 입력 평면의 엘리먼트는 입력 엘리먼트로 지칭될 수 있고, 출력 평면의 엘리먼트는 출력 엘리먼트로 지칭될 수 있다. 실시예에 따른 컨볼루션 연산은 엘리먼트 단위로 수행될 수 있다.Each plane may include elements of a predetermined bit width. For example, each weight plane may have a size of K*K, and each input plane and each output plane may have a size of W*H, where W, K, and H may each represent the number of elements. An element of a weight plane may be referred to as a weight element, an element of an input plane may be referred to as an input element, and an element of an output plane may be referred to as an output element. A convolution operation according to an embodiment may be performed on an element-by-element basis.

설명의 편의를 위해 웨이트 평면의 폭 및 높이는 K로 동일한 것으로 가정되고, 입력 평면 출력 평면의 사이즈는 W*H로 동일한 것으로 가정된다. 다만, 실시예에 따라 웨이트 평면의 폭 및 높이는 서로 다르거나, 입력 평면 출력 평면의 사이즈는 서로 다를 수 있다.For convenience of explanation, the width and height of the weight plane are assumed to be equal to K, and the sizes of the input plane and the output plane are assumed to be equal to W*H. However, depending on the embodiment, the width and height of the weight plane may be different from each other, or the sizes of the input plane and the output plane may be different from each other.

도 3은 슬라이딩 윈도우 방식의 컨볼루션 연산을 나타낸 도면이다. 슬라이딩 윈도우 방식의 컨볼루션 연산에 따르면, 웨이트 커널(310)이 입력 특징 맵(320)에 슬라이딩되며 컨볼루션 연산이 수행되어 출력 특징 맵(330)이 생성된다.Figure 3 is a diagram illustrating a convolution operation using a sliding window method. According to the convolution operation using a sliding window method, a weight kernel (310) slides on an input feature map (320) and a convolution operation is performed to generate an output feature map (330).

슬라이딩 윈도우 방식은 컨볼루션 연산을 위해 기존에 일반적으로 이용되는 것이며, 실시예들에 따른 누적 방식과 구분될 수 있다. 예를 들어, 슬라이딩 윈도우 방식의 경우, 컬럼 벡터들을 생성하기 위해 입력 특징 맵(320)에 관한 버퍼링 동작이 수행될 수 있다. 실시예에 따른 누적 방식의 경우, 컨볼루션 연산의 중간 결과를 출력 특징 맵(330)에 누적시키는 방식으로 수행되므로, 슬라이딩 윈도우 방식과 같이 버퍼링 동작이 요구되지 않는다.The sliding window method is generally used for convolution operations, and can be distinguished from the accumulation method according to embodiments. For example, in the case of the sliding window method, a buffering operation may be performed on the input feature map (320) to generate column vectors. In the case of the accumulation method according to the embodiment, since the intermediate result of the convolution operation is accumulated in the output feature map (330), a buffering operation is not required like the sliding window method.

슬라이딩 윈도우 방식의 컨볼루션 연산에 따르면, 웨이트 커널(310)이 입력 특징 맵(320)에 슬라이딩되는 과정에서 웨이트 커널(310)은 입력 특징 맵(320)의 비연속적인 주소에 저장된 데이터와 연산이 수행되므로, 연산 처리 속도를 향상시키기 위해 입력 특징 맵(320)은 적절한 형태의 연속된 데이터로 변형될 수 있다. 예를 들어, 도 3에서 슬라이딩 스트라이드(sliding stride)는 1이고, 입력 특징 맵(320)의 가로 방향 및 세로 방향 각각에 두 줄의 제로 엘리먼트 벡터들을 통한 제로 패딩(zero padding)이 적용되는 것이 가정된다. 이 경우, 웨이트 커널(310)에 대응하는 K²*C의 로 벡터(row vector)들이 정의될 수 있고, 입력 특징 맵(320)은 K²*C의 컬럼 벡터(column vector)들로 변형될 수 있다.According to the convolution operation of the sliding window method, in the process of the weight kernel (310) sliding on the input feature map (320), the weight kernel (310) performs an operation with data stored in a non-consecutive address of the input feature map (320), so that the input feature map (320) can be transformed into continuous data of an appropriate form in order to improve the operation processing speed. For example, in FIG. 3, it is assumed that the sliding stride is 1 and zero padding using two rows of zero element vectors is applied to each of the horizontal and vertical directions of the input feature map (320). In this case, K ² * C row vectors corresponding to the weight kernel (310) can be defined, and the input feature map (320) can be transformed into K ² * C column vectors.

컬럼 벡터는 플래너 구조(planar structure) 또는 인터리브 구조(interleaved structure)의 입력 특징 맵(320)으로부터 컬럼 버퍼에 버퍼링될 수 있다. 플래너 구조의 경우, 입력 특징 맵(320)이 컬럼 벡터로 버퍼링되는 과정에서 하나의 출력 엘리먼트를 결정하기 위해 비 연속된 최다 메모리 액세스가 커널의 높이(K)와 입력 채널의 수(C) 간의 곱만큼 발생할 수 있다. 인터리브 구조의 경우, 입력 특징 맵(320)이 컬럼 벡터로 버퍼링되는 과정에서, 하나의 출력 엘리먼트를 결정하기 위해 비 연속된 메모리 최다 액세스가 커널의 높이(K)만큼 발생할 수 있다.The column vector can be buffered in the column buffer from the input feature map (320) of the planar structure or the interleaved structure. In the case of the planar structure, in the process of buffering the input feature map (320) into the column vector, the number of non-consecutive memory accesses may be as many as the product of the height (K) of the kernel and the number (C) of input channels to determine one output element. In the case of the interleaved structure, in the process of buffering the input feature map (320) into the column vector, the number of non-consecutive memory accesses may be as many as the height (K) of the kernel to determine one output element.

실시예에 따른 누적 방식의 컨볼루션의 경우 컨볼루션 연산의 중간 결과를 출력 특징 맵(330)에 누적시키는 방식으로 수행되므로, 입력 특징 맵(320)을 플래너 또는 인터리브와 같은 구조로 변형하기 위한 별도의 버퍼링 동작이 요구되지 않는다. 따라서, 누적 방식의 컨볼루션은 메모리 액세스를 최소화하여 컨볼루션 연산의 처리 속도를 극대화할 수 있다.In the case of the cumulative convolution according to the embodiment, since the intermediate result of the convolution operation is accumulated in the output feature map (330), a separate buffering operation for transforming the input feature map (320) into a structure such as a planar or interleaved structure is not required. Therefore, the cumulative convolution can minimize memory access and maximize the processing speed of the convolution operation.

도 4 및 도 5는 일 실시예에 따른 누적 방식의 컨볼루션 연산을 통해 하나의 출력 평면이 생성되는 과정을 나타낸 도면이다. 예를 들어, 출력 특징 맵은 D개의 출력 평면들을 포함할 수 있고, 도 4 및 도 5는 D개의 출력 평면들 중 하나가 생성되는 과정에 대응할 수 있다. 도 4 및 도 5에 도시된 과정이 D개의 출력 평면들에 관해 반복되어 출력 특징 맵이 생성될 수 있다.FIGS. 4 and 5 are diagrams illustrating a process of generating one output plane through a convolution operation in an accumulative manner according to one embodiment. For example, an output feature map may include D output planes, and FIGS. 4 and 5 may correspond to a process of generating one of the D output planes. The process illustrated in FIGS. 4 and 5 may be repeated for the D output planes to generate an output feature map.

도 4를 참조하면, 입력 특징 맵(410)과 웨이트 커널(420) 간의 컨볼루션 연산을 통해 출력 평면(430)이 생성된다. 예를 들어, 웨이트 커널(420)은 D개의 웨이트 커널들 중 d번째에 해당할 수 있고, 출력 평면(430)은 D개의 출력 평면들 중 d번째에 해당할 수 있다. 입력 특징 맵(410)은 도 5의 입력 평면들(510)을 포함할 수 있고, 웨이트 커널(420)은 도 5의 웨이트 평면들(520)을 포함할 수 있다. 출력 평면(430)은 도 5의 출력 평면(540)에 대응할 수 있다.Referring to FIG. 4, an output plane (430) is generated through a convolution operation between an input feature map (410) and a weight kernel (420). For example, the weight kernel (420) may correspond to the d-th among D weight kernels, and the output plane (430) may correspond to the d-th among D output planes. The input feature map (410) may include the input planes (510) of FIG. 5, and the weight kernel (420) may include the weight planes (520) of FIG. 5. The output plane (430) may correspond to the output plane (540) of FIG. 5.

도 5를 참조하면, 입력 평면들(511, 512, 513)은 입력 평면들(510)에 대응한다. 입력 평면들(511, 512, 513)의 수는 입력 채널의 수(C)에 대응한다. 아래에서 입력 채널의 수(C)는 3인 것을 가정한다. 다만, 이는 설명의 편의를 위한 것으로, 입력 채널의 수(C)는 3 이외의 다양한 값을 가질 수도 있다. 웨이트 평면들(521, 522, 523)은 웨이트 평면들(520)에 대응하고, 누적 평면들(531, 532, 533)은 누적 평면들(530)에 대응한다.Referring to FIG. 5, input planes (511, 512, 513) correspond to input planes (510). The number of input planes (511, 512, 513) corresponds to the number of input channels (C). Below, it is assumed that the number of input channels (C) is 3. However, this is for convenience of explanation, and the number of input channels (C) may have various values other than 3. Weight planes (521, 522, 523) correspond to weight planes (520), and accumulation planes (531, 532, 533) correspond to accumulation planes (530).

입력 평면(511)과 웨이트 평면(521) 간의 MAC 연산을 통해 누적 평면(531)이 생성되고, 입력 평면(512)과 웨이트 평면(522) 간의 MAC 연산을 통해 누적 평면(532)이 생성되고, 입력 평면(513)과 웨이트 평면(523) 간의 MAC 연산을 통해 누적 평면(533)이 생성된다. MAC 연산 과정은 추후 상세히 설명된다. 누적 평면들(531, 532, 533)이 생성되면, 누적 평면들(531, 532, 533)에 기초하여 출력 평면(540)이 생성될 수 있다. 예를 들어, 누적 평면들(531, 532, 533)의 합을 통해 출력 평면(540)이 생성될 수 있다.An accumulation plane (531) is generated through a MAC operation between an input plane (511) and a weight plane (521), an accumulation plane (532) is generated through a MAC operation between an input plane (512) and a weight plane (522), and an accumulation plane (533) is generated through a MAC operation between an input plane (513) and a weight plane (523). The MAC operation process will be described in detail later. When the accumulation planes (531, 532, 533) are generated, an output plane (540) can be generated based on the accumulation planes (531, 532, 533). For example, the output plane (540) can be generated through the sum of the accumulation planes (531, 532, 533).

도 6 및 도 7은 일 실시예에 따른 누적 방식의 컨볼루션 연산을 위한 입력 평면과 웨이트 평면 간의 MAC(multiply and accumulate) 연산을 나타낸 도면이다.FIGS. 6 and 7 are diagrams illustrating a MAC (multiply and accumulate) operation between an input plane and a weight plane for a convolution operation in an accumulative manner according to one embodiment.

도 6을 참조하면, 입력 평면(610)의 각 입력 엘리먼트와 웨이트 평면(620)의 각 웨이트 엘리먼트 간의 MAC 연산에 기초하여 누적 평면(630)이 생성된다. 웨이트 평면(620)은 w₁ 내지 w₉의 웨이트 엘리먼트들을 포함한다. 웨이트 평면(620)은 3*3의 사이즈를 갖는 것으로 설명되지만, 이는 설명의 편의를 위한 것이며 웨이트 평면(620)은 3*3 이외의 다른 다양한 사이즈를 가질 수 있다. 도 6에는 생략되어 있으나 입력 평면(610) 및 누적 평면(630)도 각각 복수의 엘리먼트들을 포함할 수 있으며, 엘리먼트 단위로 컨볼루션 연산이 수행될 수 있다.Referring to FIG. 6, an accumulation plane (630) is generated based on a MAC operation between each input element of the input plane (610) and each weight element of the weight plane (620). The weight plane (620) includes weight elements of w ₁ to w _9. The weight plane (620) is described as having a size of 3*3, but this is for convenience of description, and the weight plane (620) may have various sizes other than 3*3. Although omitted in FIG. 6, the input plane (610) and the accumulation plane (630) may also each include a plurality of elements, and a convolution operation may be performed on an element-by-element basis.

도 7을 참조하면, 입력 평면(711)은 도 6의 입력 평면(610)에 대응하고, 도 7의 웨이트 엘리먼트들(w₁ 내지 w₉)은 도 6의 웨이트 평면(620)의 웨이트 엘리먼트들(w₁ 내지 w₉)에 대응하고, 누적 평면(740)은 도 6의 누적 평면(630)에 대응한다. 슬라이딩 스트라이드에 기초하여 입력 평면(711)에 관한 제로 패딩이 수행되어 입력 평면(712)이 생성될 수 있다. 예를 들어, 입력 평면(711)의 사이즈가 W*H이고 슬라이딩 스트라이드가 1인 경우, 입력 평면(712)은 (W+2)*(H+2)의 사이즈를 가질 수 있다.Referring to FIG. 7, the input plane (711) corresponds to the input plane (610) of FIG. 6, the weight elements (w ₁ to w ₉ ) of FIG. 7 correspond to the weight elements (w ₁ to w ₉ ) of the weight plane (620) of FIG. 6, and the accumulation plane (740) corresponds to the accumulation plane (630) of FIG. 6. Zero padding may be performed on the input plane (711) based on the sliding stride to generate the input plane (712). For example, if the size of the input plane (711) is W*H and the sliding stride is 1, the input plane (712) may have a size of (W+2)*(H+2).

웨이트 엘리먼트들(w₁ 내지 w₉)을 포함하는 웨이트 평면과 입력 평면(712) 간에 슬라이딩 윈도우 방식의 컨볼루션 연산이 수행되는 것을 가정하면, 입력 평면(712) 상에 각 웨이트 엘리먼트들(w₁ 내지 w₉)이 반응하는 반응 영역들(721 내지 729)이 정의될 수 있다. 예를 들어, 슬라이딩 윈도우 방식의 컨볼루션 연산이 수행될 때, 반응 영역(721) 내 입력 엘리먼트들은 웨이트 엘리먼트(w₁)에 반응하고, 반응 영역(722) 내 입력 엘리먼트들은 웨이트 엘리먼트(w₂)에 반응하고, 반응 영역(729) 내 입력 엘리먼트들은 웨이트 엘리먼트(w₉)에 반응할 수 있다.Assuming that a sliding window-style convolution operation is performed between a weight plane including weight elements (w ₁ to w ₉ ) and an input plane (712), response areas (721 to 729) to which each of the weight elements (w ₁ to w ₉ ) reacts can be defined on the input plane (712). For example, when a sliding window-style convolution operation is performed, input elements in the response area (721) can react to the weight element (w ₁ ), input elements in the response area (722) can react to the weight element (w ₂ ), and input elements in the response area (729) can react to the weight element (w ₉ ).

반응 영역들(721 내지 729)의 사이즈는 입력 평면(711)의 사이즈와 동일하며, 반응 영역들(721 내지 729) 각각의 오프셋은 웨이트 엘리먼트들(w₁ 내지 w₉) 각각의 인덱스에 기초하여 결정될 수 있다. 예를 들어, 입력 평면(711)의 폭이 W+2인 경우, 반응 영역들(721 내지 729) 각각의 오프셋은 (W+2)*a+b로 정의될 수 있다. 오프셋은 입력 평면(예: 패딩이 적용된 입력 평면의 원점)을 기준으로 결정될 수 있다. 여기서, a는 (i-1)을 K로 나눈 몫을 나타내고, b는 (i-1)을 K로 나눈 나머지를 나타낸다. i는 웨이트 엘리먼트들(w₁ 내지 w₉)의 인덱스를 나타내고, K는 웨이트 커널의 폭을 나타낸다. 이에 따르면, 반응 영역(721)의 오프셋은 0, 반응 영역(722)의 오프셋은 1, 반응 영역(729)의 오프셋은 (W+2)*2+2다.The size of the reaction areas (721 to 729) is the same as the size of the input plane (711), and the offset of each of the reaction areas (721 to 729) can be determined based on the index of each of the weight elements (w ₁ to w ₉ ). For example, when the width of the input plane (711) is W+2, the offset of each of the reaction areas (721 to 729) can be defined as (W+2)*a+b. The offset can be determined based on the input plane (e.g., the origin of the input plane to which padding is applied). Here, a represents the quotient of (i-1) divided by K, and b represents the remainder of (i-1) divided by K. i represents the index of the weight elements (w ₁ to w ₉ ), and K represents the width of the weight kernel. According to this, the offset of the reaction area (721) is 0, the offset of the reaction area (722) is 1, and the offset of the reaction area (729) is (W+2)*2+2.

각 반응 영역(721 내지 729) 내 입력 엘리먼트들과 각 웨이트 엘리먼트들(w₁ 내지 w₉) 간의 곱에 따라 곱셈 결과들(731 내지 739)이 생성될 수 있고, 곱셈 결과들(731 내지 739)의 누적에 따라 누적 평면(740)이 생성될 수 있다. 예를 들어, 출력 평면은 C개의 누적 평면들의 합을 통해 생성될 수 있고, 도 7은 C개의 누적 평면들 중 하나에 해당하는 누적 평면(740)이 생성되는 과정에 대응할 수 있다. 곱셈 결과들(731 내지 739) 내 각 엘리먼트는 곱셈 결과 엘리먼트로 지칭될 수 있다. 도 7 에 도시된 과정이 C개의 누적 평면들에 관해 반복되어 출력 평면이 생성될 수 있다. 또한, 출력 특징 맵이 D개의 출력 평면들을 포함하는 경우, 누적 기반으로 생성된 D개의 출력 평면들을 통해 출력 특징 맵이 결정될 수 있다.Multiplication results (731 to 739) may be generated according to the product between the input elements in each reaction region (721 to 729) and each weight element (w ₁ to w ₉ ), and an accumulation plane (740) may be generated according to the accumulation of the multiplication results (731 to 739). For example, the output plane may be generated through the sum of C accumulation planes, and FIG. 7 may correspond to a process in which an accumulation plane (740) corresponding to one of the C accumulation planes is generated. Each element in the multiplication results (731 to 739) may be referred to as a multiplication result element. The process illustrated in FIG. 7 may be repeated for the C accumulation planes to generate the output plane. In addition, when the output feature map includes D output planes, the output feature map may be determined through the D output planes generated based on accumulation.

실시예들에 따르면 컨볼루션 연산의 중간 결과에 해당하는 곱셈 결과들(731 내지 739)을 누적하여 출력 특징 맵이 생성되며, 입력 특징 맵을 연속된 데이터로 변형하여 버퍼에 저장하는 것이 요구되지 않는다. 따라서, 입력 특징 맵을 연속된 데이터로 변형하여 버퍼에 저장하는데 소요되는 시간을 절감하여 컨볼루션 연산의 고속화가 가능하고, 변형된 데이터를 저장하기 위한 메모리 공간을 절약할 수 있다.According to the embodiments, the output feature map is generated by accumulating the multiplication results (731 to 739) corresponding to the intermediate results of the convolution operation, and it is not required to transform the input feature map into continuous data and store it in a buffer. Accordingly, the time required to transform the input feature map into continuous data and store it in a buffer is reduced, thereby enabling the speed of the convolution operation to be increased, and the memory space for storing the transformed data can be saved.

도 8 내지 도 10은 일 실시예에 따른 SIMD(single instruction multiple data) 처리를 이용한 누적 방식의 컨볼루션 연산을 나타낸 도면이다. SIMD는 하나의 명령어로 다수의 데이터를 처리하는 프로세서의 연산 처리 방식을 의미한다. 아래에서 상세히 설명되는 것처럼, 실시예에 따른 누적 방식의 컨볼루션 연산은 SIMD를 통해 수행될 수 있다.FIGS. 8 to 10 are diagrams illustrating a convolution operation using an accumulation method using SIMD (single instruction multiple data) processing according to one embodiment. SIMD refers to an operation processing method of a processor that processes multiple data with a single instruction. As described in detail below, a convolution operation using an accumulation method according to an embodiment can be performed through SIMD.

도 8을 참조하면, 웨이트 평면(810)이 입력 평면(820)의 슬라이딩 영역(821)에 슬라이딩되며 MAC 연산이 수행되어 누적 평면(830)의 누적 영역(831)이 결정될 수 있다. 유사하게, 웨이트 평면(810)이 슬라이딩 영역(822)에 슬라이딩되며 MAC 연산이 수행되어 누적 영역(832)이 결정되고, 웨이트 평면(810)이 슬라이딩 영역(823)에 슬라이딩되며 MAC 연산이 수행되어 누적 영역(833)이 결정될 수 있다. 슬라이딩 영역들(821 내지 823)의 높이는 웨이트 평면(810)의 높이에 대응하고, 누적 영역들(831 내지 833)의 높이는 하나의 엘리먼트에 대응한다. 이와 같은 방식으로 슬라이딩 영역들과 누적 영역들 간의 관계가 형성될 수 있다.Referring to FIG. 8, a weight plane (810) may slide on a sliding region (821) of an input plane (820), and a MAC operation may be performed to determine an accumulation region (831) of an accumulation plane (830). Similarly, a weight plane (810) may slide on a sliding region (822), and a MAC operation may be performed to determine an accumulation region (832), and a weight plane (810) may slide on a sliding region (823), and a MAC operation may be performed to determine an accumulation region (833). The heights of the sliding regions (821 to 823) correspond to the height of the weight plane (810), and the heights of the accumulation regions (831 to 833) correspond to one element. In this manner, a relationship between the sliding regions and the accumulation regions may be formed.

도 9를 참조하면, 입력 평면(900) 내 슬라이딩 영역(910)은 웨이트들(w₁ 내지 w₉)의 반응 영역들(911 내지 919)을 포함한다. 반응 영역들(911 내지 919) 각각의 오프셋은 웨이트 엘리먼트들(w₁ 내지 w₉) 각각의 인덱스에 기초하여 결정될 수 있다. 예를 들어, 도 7의 설명에 따라 반응 영역들(911 내지 919) 각각의 오프셋은 (W+2)*a+b로 정의될 수 있다. 오프셋은 슬라이딩 영역들(예: 각 슬라이딩 영역의 원점)을 기준으로 결정될 수 있다. 이 경우, 반응 영역들(911 내지 919) 각각의 오프셋은 0, 1, 2, (W+2), (W+2)+1, (W+2)+2, (W+2)*2, (W+2)*2+1, (W+2)*2+2가 될 수 있다.Referring to FIG. 9, the sliding region (910) in the input plane (900) includes reaction regions (911 to 919) of weights (w ₁ to w ₉ ). The offset of each of the reaction regions (911 to 919) may be determined based on the index of each of the weight elements (w ₁ to w ₉ ). For example, according to the description of FIG. 7, the offset of each of the reaction regions (911 to 919) may be defined as (W+2)*a+b. The offset may be determined based on the sliding regions (e.g., the origin of each sliding region). In this case, the offset of each of the reaction regions (911 to 919) may be 0, 1, 2, (W+2), (W+2)+1, (W+2)+2, (W+2)*2, (W+2)*2+1, (W+2)*2+2.

반응 영역들(911 내지 919)로부터 입력 엘리먼트 벡터들이 추출되어 레지스터들(r1 내지 r9)에 저장된다. 예를 들어, 반응 영역(911)의 제1 입력 엘리먼트 벡터는 레지스터(r1)에 저장되고, 반응 영역(912)의 제2 입력 엘리먼트 벡터는 레지스터(r2)에 저장될 수 있다. 이와 같이, 입력 엘리먼트 벡터들은 레지스터들(r1 내지 r9)에 순차적으로 저장될 수 있다.Input element vectors are extracted from the reaction areas (911 to 919) and stored in registers (r1 to r9). For example, a first input element vector of the reaction area (911) may be stored in register (r1), and a second input element vector of the reaction area (912) may be stored in register (r2). In this way, the input element vectors may be sequentially stored in the registers (r1 to r9).

입력 엘리먼트 벡터들은 각각 웨이트 엘리먼트들(w₁ 내지 w₉) 중 자신과 대응하는 엘리먼트와 엘리먼트 단위로 곱해지며, 이에 따라 가중된 입력 엘리먼트 벡터들이 생성될 수 있다. 예를 들어, 반응 영역(911)의 제1 입력 엘리먼트 벡터는 레지스터(r1)에 저장되어 웨이트 엘리먼트(w₁)와 곱해질 수 있고, 이에 따라 제1 가중된 입력 엘리먼트 벡터가 생성될 수 있다. 반응 영역(912)의 제2 입력 엘리먼트 벡터는 레지스터(r2)에 저장되어 웨이트 엘리먼트(w₂)와 곱해질 수 있고, 이에 따라 제2 가중된 입력 엘리먼트 벡터가 생성될 수 있다. 반응 영역들(911 내지 919), 입력 엘리먼트 벡터들, 및 가중된 입력 엘리먼트 벡터들의 사이즈는 SIMD 연산 단위에 대응할 수 있다.Each of the input element vectors is multiplied by its corresponding element among the weight elements (w ₁ to w ₉ ) on an element-by-element basis, and thus weighted input element vectors can be generated. For example, a first input element vector of a reaction area (911) can be stored in a register (r1) and multiplied by a weight element (w ₁ ), and thus a first weighted input element vector can be generated. A second input element vector of a reaction area (912) can be stored in a register (r2) and multiplied by a weight element (w ₂ ), and thus a second weighted input element vector can be generated. The sizes of the reaction areas (911 to 919), the input element vectors, and the weighted input element vectors can correspond to a SIMD operation unit.

이와 같은 과정을 통해 생성된 가중된 입력 엘리먼트 벡터들이 누적되어 슬라이딩 영역(910)에 대응하는 누적 벡터가 생성될 수 있다. 또한, 이와 같은 과정이 각 슬라이딩 영역에 관해 반복됨에 따라 각 슬라이딩 영역에 대응하는 누적 벡터들이 생성될 수 있고, 누적 벡터들이 모여서 누적 평면을 형성할 수 있다. 누적 평면 및 누적 벡터는 서로 다른 형태의 누적 데이터를 지칭할 수 있으며, 누적 데이터로 통칭될 수 있다.Through this process, weighted input element vectors generated can be accumulated to generate an accumulated vector corresponding to a sliding area (910). In addition, as this process is repeated for each sliding area, accumulated vectors corresponding to each sliding area can be generated, and the accumulated vectors can be gathered to form an accumulated plane. The accumulated plane and the accumulated vector can refer to different forms of accumulated data, and can be collectively referred to as accumulated data.

도 10을 참조하면, 출력 평면(1010) 내 출력 영역(1011)에서 기존에 저장된 누적 벡터(이하, 제1 누적 벡터로 지칭됨)가 로드되어 레지스터(r10)에 저장된다. 레지스터들(r1 내지 r9)을 통해 새로운 누적 벡터(이하, 제2 누적 벡터로 지칭됨)가 생성되면, 레지스터(r10)에서 제1 누적 벡터와 제2 누적 벡터가 누적되어 출력 영역(1011)에 저장된다.Referring to FIG. 10, an existing accumulated vector (hereinafter referred to as a first accumulated vector) is loaded into an output area (1011) within an output plane (1010) and stored in a register (r10). When a new accumulated vector (hereinafter referred to as a second accumulated vector) is generated through registers (r1 to r9), the first accumulated vector and the second accumulated vector are accumulated in the register (r10) and stored in the output area (1011).

도 10은 출력 영역(1011)에 누적 벡터가 저장되는 과정이 적어도 한 번 수행된 것을 가정한다. 예를 들어, 도 10은 각각 제1 입력 채널에 대응하는 제1 입력 평면과 제1 웨이트 평면 간의 MAC 연산을 통해 제1 누적 벡터가 생성되어 출력 영역(1011)에 저장되고, 이후 각각 제2 입력 채널에 대응하는 제2 입력 평면과 제2 웨이트 평면 간의 MAC 연산을 통해 제2 누적 벡터가 생성되고, 제1 누적 벡터 및 제2 누적 벡터가 누적되어 출력 영역(1011)에 저장되는 상황에 대응할 수 있다. 만약, 누적 영역(1011)에 초기 값이 저장된 경우, 다시 말해 누적 벡터가 처음 생성되는 경우, 누적 영역(1011)에서 누적 벡터가 로드되는 과정은 생략될 수 있고, 새롭게 생성된 누적 벡터가 별도의 누적 동작 없이 누적 영역(1011)에 저장될 수 있다.Fig. 10 assumes that the process of storing an accumulated vector in the output area (1011) is performed at least once. For example, Fig. 10 can correspond to a situation in which a first accumulated vector is generated through a MAC operation between a first input plane and a first weight plane, each corresponding to a first input channel, and is stored in the output area (1011), and then a second accumulated vector is generated through a MAC operation between a second input plane and a second weight plane, each corresponding to a second input channel, and the first accumulated vector and the second accumulated vector are accumulated and stored in the output area (1011). If an initial value is stored in the accumulation area (1011), that is, if the accumulation vector is generated for the first time, the process of loading the accumulated vector in the accumulation area (1011) can be omitted, and the newly generated accumulated vector can be stored in the accumulation area (1011) without a separate accumulation operation.

출력 영역(1011)에 누적 벡터들이 입력 채널의 수만큼 반복하여 저장되면(누적 횟수는 입력 채널의 수보다 하나 적음), 출력 영역(1011)에 대응하는 내 출력 엘리먼트 벡터가 결정될 수 있다. 또한, 출력 영역(1011)에 관한 과정이 출력 평면(1010) 내 나머지 출력 영역들에 관해서도 수행되면, 출력 평면(1010)이 결정될 수 있다. 따라서, SIMD를 통해 실시예에 따른 누적 방식의 컨볼루션 연산이 구현될 수 있다.When the accumulated vectors are repeatedly stored in the output area (1011) as many times as the number of input channels (the number of accumulations is one less than the number of input channels), the output element vector corresponding to the output area (1011) can be determined. In addition, when the process for the output area (1011) is also performed for the remaining output areas in the output plane (1010), the output plane (1010) can be determined. Therefore, the convolution operation of the accumulated method according to the embodiment can be implemented through SIMD.

도 11은 일 실시예에 따른 누적 방식의 컨볼루션 연산의 제로-스킵 과정을 나타낸 도면이다. 실시예들에 따른 컨볼루션 연산은 입력 평면 단위(보다 상세하게는, 입력 평면 내 반응 영역 단위)로 수행되므로, 소프트웨어를 통해 제로-스킵을 효율적으로 처리할 수 있다.Fig. 11 is a diagram illustrating a zero-skip process of a convolution operation in an accumulation manner according to one embodiment. Since the convolution operation according to the embodiments is performed in units of input planes (more specifically, in units of reaction areas within the input plane), zero-skip can be efficiently processed through software.

도 11을 참조하면, 각 반응 영역(1121 내지 1123) 내 입력 엘리먼트들과 각 웨이트 엘리먼트들(w₁ 내지 w₉) 간의 곱에 따라 곱셈 결과들(1141 내지 1143)이 생성될 수 있다. 도 11의 실시예에서 웨이트 엘리먼트들(w₃ 내지 w₅, w₈, w₉)은 0에 대응하는 것으로 가정한다. 이하, 0에 대응하는 웨이트 엘리먼트는 제로 웨이트 엘리먼트로 지칭될 수 있고, 0에 대응하지 않는 웨이트 엘리먼트는 논-제로 웨이트 엘리먼트로 지칭될 수 있다. 이 경우, 제로 웨이트 엘리먼트에 기반한 곱셈 결과(1143)와 같은 곱셈 결과들은 누적 평면이나 출력 평면의 데이터에 영향을 주지 않으므로, 이러한 곱셈 결과들에 관한 연산은 생략될 수 있다.Referring to FIG. 11, multiplication results (1141 to 1143) can be generated according to the product between input elements in each reaction area (1121 to 1123) and each weight element (w ₁ to w ₉ ). In the embodiment of FIG. 11, it is assumed that the weight elements (w ₃ to w ₅ , w ₈ , w ₉ ) correspond to 0. Hereinafter, a weight element corresponding to 0 may be referred to as a zero weight element, and a weight element not corresponding to 0 may be referred to as a non-zero weight element. In this case, since multiplication results such as the multiplication result (1143) based on the zero weight element do not affect data of an accumulation plane or an output plane, operations on these multiplication results can be omitted.

도 12는 일 실시예에 따른 미리 정해진 연산 타입을 이용하여 제로-스킵을 수행하는 과정을 나타낸 도면이다. 도 12를 참조하면, 단계(1210)에서 제로 인코딩이 수행된다. 제로 인코딩을 통해 웨이트 엘리먼트들에 포함된 논-제로 웨이트 엘리먼트들의 수가 결정될 수 있다. 예를 들어, 도 12에서 제로 인코딩 결과 논-제로 웨이트 엘리먼트들의 수는 4로 결정될 수 있다.FIG. 12 is a diagram illustrating a process of performing zero-skip using a predefined operation type according to one embodiment. Referring to FIG. 12, zero encoding is performed in step (1210). The number of non-zero weight elements included in weight elements can be determined through zero encoding. For example, in FIG. 12, the number of non-zero weight elements as a result of zero encoding can be determined as 4.

단계(1220)에서 연산 타입들 중에 논-제로 웨이트 엘리먼트들의 수에 대응하는 연산 타입이 선택되고, 논-제로 웨이트 엘리먼트에 대응하는 데이터가 레지스터에 로드된다. 도 12에서 4의 논-제로 웨이트 엘리먼트들의 수에 대응하는 연산 타입 4가 선택된다. 연산 타입들은 논-제로 웨이트 엘리먼트들의 수에 따라 각각 미리 정해진 방식의 연산을 수행하도록 설정될 수 있다. 예를 들어, 웨이트 엘리먼트들에 논-제로 웨이트 엘리먼트가 전혀 포함되지 않은 케이스부터 웨이트 엘리먼트들 전체가 논-제로 웨이트 엘리먼트에 해당하는 케이스까지 각각의 케이스에 관해 연산 타입들이 설정될 수 있다. 연산 타입의 수를 N으로, 웨이트 엘리먼트들의 수를 K*K로 정의하면, N=K*K+1일 수 있다. 도 12는 K=3이고, N=10인 케이스를 나타낸다.In step (1220), an operation type corresponding to the number of non-zero weight elements is selected from among the operation types, and data corresponding to the non-zero weight elements is loaded into a register. In FIG. 12, operation type 4 corresponding to the number of non-zero weight elements of 4 is selected. The operation types may be set to perform operations in a predetermined manner according to the number of non-zero weight elements. For example, operation types may be set for each case from a case where no non-zero weight element is included in the weight elements to a case where all of the weight elements correspond to non-zero weight elements. If the number of operation types is defined as N and the number of weight elements is defined as K*K, then N=K*K+1. FIG. 12 shows a case where K=3 and N=10.

레지스터에 로드되는 데이터는 입력 평면의 적어도 일부에 해당할 수 있다. 예를 들어, 논-제로 웨이트 엘리먼트에 대응하는 입력 엘리먼트 벡터가 레지스터에 로드될 수 있다. 논-제로 웨이트 엘리먼트의 인덱스에 기초하여 입력 엘리먼트 벡터에 대응하는 오프셋이 결정될 수 있고, 결정된 오프셋을 통해 입력 평면에서 입력 엘리먼트 벡터가 추출되어 레지스터에 저장될 수 있다. 도 12에서 논-제로 웨이트 엘리먼트에 해당하는 w₁, w₂, w₆, w₇에 기초하여 0, 1, (W+2)+2, (W+2)*2의 오프셋들이 결정되며, 결정된 오프셋들에 대응하는 입력 엘리먼트 벡터들이 reg1, reg2, reg3, reg4의 레지스터에 로드될 수 있다.Data loaded into a register may correspond to at least a part of an input plane. For example, an input element vector corresponding to a non-zero weight element may be loaded into a register. An offset corresponding to the input element vector may be determined based on an index of the non-zero weight element, and the input element vector may be extracted from the input plane through the determined offset and stored in the register. In Fig. 12, offsets of 0, 1, (W+2)+2, (W+2)*2 are determined based on w ₁ , w ₂ , w ₆ , and w ₇ corresponding to the non-zero weight elements, and input element vectors corresponding to the determined offsets may be loaded into registers of reg1, reg2, reg3, and reg4.

미리 정해진 방식의 연산은 논-제로 웨이트 엘리먼트들과 레지스터에 로드된 데이터 간의 MAC 연산을 수행하여 누적 데이터를 생성하는 것을 포함할 수 있다. 여기서, 데이터는 논-제로 웨이트 엘리먼트들의 수 및 오프셋에 기초하여 레지스터에 로드될 수 있다. 예를 들어, 논-제로 웨이트 엘리먼트들과 레지스터에 저장된 입력 엘리먼트 벡터들 간의 MAC 연산이 수행될 수 있다. 도 12에서 논-제로 웨이트 엘리먼트들(w₁, w₂, w₆, w₇) 및 레지스터들(reg1, reg2, reg3, reg4)에 저장된 입력 엘리먼트 벡터들 간의 곱셈 결과에 대응하는 가중된 입력 엘리먼트 벡터들이 생성되고, 가중된 입력 엘리먼트 벡터들의 누적에 따라 누적 데이터가 생성될 수 있다.The predetermined method of operation may include performing a MAC operation between non-zero weight elements and data loaded into a register to generate accumulated data. Here, the data may be loaded into the register based on the number and offset of the non-zero weight elements. For example, a MAC operation may be performed between the non-zero weight elements and input element vectors stored in the register. In FIG. 12, weighted input element vectors corresponding to the multiplication results between the non-zero weight elements (w ₁ , w ₂ , w ₆ , w ₇ ) and the input element vectors stored in the registers (reg1 , reg2 , reg3 , reg4 ) are generated, and accumulated data may be generated according to the accumulation of the weighted input element vectors.

단계(1230)에서 각 연산 타입에 대응하는 소스 코드가 실행될 수 있다. 예를 들어, 연산 타입 0 내지 연산 타입 9 각각에 대응하는 소스 코드가 메모리 코드 영역(memory code area)에 저장되어 있을 수 있고, 선택된 연산 타입에 대응하는 소스 코드가 메모리 코드 영역으로부터 로드되어 실행될 수 있다. 도 12에서는 연산 타입 4에 해당하는 소스 코드가 실행될 수 있다. 이러한 소스 코드는 메모리 공간을 적게 차지하므로, 소스 코드의 사용이 메모리 효율을 크게 저하시키지 않는다.In step (1230), source code corresponding to each operation type may be executed. For example, source code corresponding to each of operation types 0 to 9 may be stored in a memory code area, and source code corresponding to the selected operation type may be loaded from the memory code area and executed. In Fig. 12, source code corresponding to operation type 4 may be executed. Since such source code occupies a small memory space, use of the source code does not significantly reduce memory efficiency.

도 13은 일 실시예에 따른 누적 방식의 컨볼루션 연산 과정을 나타낸 플로우 차트이다. 도 13을 참조하면, 단계(1301)에서 웨이트 커널(w^d)이 획득된다. d는 출력 채널의 인덱스를 나타내며, 1 내지 D의 자연수일 수 있고, 초기에 1일 수 있다. 웨이트 커널들은 각각 출력 채널에 대응할 수 있다. 예를 들어, 웨이트 커널(w¹)은 제1 출력 채널에 대응할 수 있고, 웨이트 커널(w²)은 제2 출력 채널에 대응할 수 있다.FIG. 13 is a flow chart illustrating a convolution operation process of an accumulative method according to one embodiment. Referring to FIG. 13, in step (1301), a weight kernel (w ^d ) is obtained. d represents an index of an output channel, can be a natural number from 1 to D, and can be initially 1. The weight kernels can each correspond to an output channel. For example, the weight kernel (w ¹ ) can correspond to a first output channel, and the weight kernel (w ² ) can correspond to a second output channel.

단계(1302)에서 입력 평면(i_c)이 획득되고, 단계(1303)에서 웨이트 평면 ()가 획득된다. c는 입력 채널의 인덱스를 나타내며, 1 내지 C의 자연수일 수 있고, 초기에 1일 수 있다. 입력 평면들 및 웨이트 평면들은 각각 입력 채널들에 대응할 수 있다. 예를 들어, 입력 평면(i₁) 및 웨이트 평면()은 각각 제1 입력 채널에 대응할 수 있고, 입력 평면(i₂) 및 웨이트 평면()은 각각 제2 입력 채널에 대응할 수 있다.In step (1302), the input plane (i _c ) is obtained, and in step (1303), the weight plane ( ) is obtained. c represents the index of the input channel, can be a natural number from 1 to C, and can be initially 1. The input planes and the weight planes can each correspond to the input channels. For example, the input plane (i ₁ ) and the weight plane ( ) can correspond to the first input channel, respectively, and the input plane (i ₂ ) and the weight plane ( ) can each correspond to a second input channel.

단계(1306)에서 MAC 연산이 수행된다. 예를 들어, 입력 평면(i_c) 내 입력 엘리먼트들의 적어도 일부와 웨이트 평면 () 내 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 누적 데이터가 생성될 수 있다. 예를 들어, 입력 평면(i_c)에서 웨이트 엘리먼트들의 적어도 일부에 대응하는 입력 엘리먼트 벡터들이 추출되고, 입력 엘리먼트 벡터들과 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과에 대응하는 가중된 입력 엘리먼트 벡터들이 생성되고, 가중된 입력 엘리먼트 벡터들을 누적하여 누적 데이터가 생성될 수 있다. 이 때, 웨이트 엘리먼트들의 적어도 일부의 인덱스들에 기초하여 입력 엘리먼트 벡터들에 대응하는 오프셋들이 결정되고, 오프셋들에 기초하여 입력 평면에서 입력 엘리먼트 벡터들이 추출될 수 있다.In step (1306), a MAC operation is performed. For example, at least some of the input elements in the input plane (i _c ) and the weight plane ( ) may generate accumulated data by accumulating multiplication results between at least some of my weight elements. For example, input element vectors corresponding to at least some of the weight elements in the input plane (i _c ) may be extracted, weighted input element vectors corresponding to multiplication results between the input element vectors and at least some of the weight elements may be generated, and the weighted input element vectors may be accumulated to generate accumulated data. At this time, offsets corresponding to the input element vectors may be determined based on indices of at least some of the weight elements, and the input element vectors may be extracted from the input plane based on the offsets.

일 실시예에 따르면, 단계들(1304, 1305)을 통해 제로-스킵이 구현될 수 있다. 단계(1304)에서 제로 인코딩이 수행되고, 단계(1305)에서 연산 타입이 선택된다. 제로 인코딩을 통해 논-제로 웨이트 엘리먼트들의 수가 결정되면, 논-제로 웨이트 엘리먼트들의 수에 대응하는 연산 타입이 선택되고, 논-제로 웨이트 엘리먼트들에 대응하는 입력 엘리먼트들이 레지스터에 로드될 수 있다. 예를 들어, 논-제로 웨이트 엘리먼트들에 대응하는 입력 엘리먼트 벡터들이 레지스터에 로드될 수 있다.In one embodiment, zero-skip can be implemented through steps (1304, 1305). In step (1304), zero encoding is performed, and in step (1305), an operation type is selected. When the number of non-zero weight elements is determined through zero encoding, an operation type corresponding to the number of non-zero weight elements is selected, and input elements corresponding to the non-zero weight elements can be loaded into a register. For example, input element vectors corresponding to the non-zero weight elements can be loaded into a register.

선택된 연산 타입에 따라 미리 정해진 프로세스에 따른 연산들이 수행될 수 있다. 예를 들어, 연산들은 논-제로 웨이트 엘리먼트들과 레지스터 내 입력 엘리먼트들(예: 입력 엘리먼트 벡터들) 간의 곱셈을 수행하고, 곱셈 결과를 누적하여 누적 데이터(예: 누적 벡터)를 생성하는 것을 포함할 수 있다. 따라서, 누적 데이터가 생성될 때, 제로 웨이트 엘리먼트들과 입력 엘리먼트들 간의 곱셈 연산이 생략될 수 있다.Depending on the selected operation type, operations may be performed according to a predetermined process. For example, the operations may include performing a multiplication between non-zero weight elements and input elements (e.g., input element vectors) in a register, and accumulating the multiplication result to generate accumulated data (e.g., accumulated vector). Accordingly, when the accumulated data is generated, the multiplication operation between zero weight elements and input elements may be omitted.

단계(1307)에서 출력이 누적된다. 예를 들어, MAC 연산의 출력에 해당하는 누적 데이터가 누적될 수 있다. 일례로, c=1에 해당하는 첫 번째 반복이 진행되는 경우, 입력 평면(i₁)이 획득되고, 웨이트 평면()이 획득되고, 입력 평면(i₁) 내 제1 입력 엘리먼트들의 적어도 일부와 웨이트 평면() 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터가 생성될 수 있다. c=2에 해당하는 두 번째 반복이 진행되는 경우, 입력 평면(i₂)이 획득되고, 웨이트 평면()이 획득되고, 입력 평면(i₂) 내 제2 입력 엘리먼트들의 적어도 일부와 웨이트 평면() 내 제2 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제2 누적 데이터가 생성될 수 있다. 이 때, 제1 누적 데이터 및 상기 제2 누적 데이터가 누적될 수 있다. c=C에 해당하는 C번째 반복이 진행되는 경우, 각 입력 채널에 관한 각 누적 데이터의 합에 기초하여 출력 평면이 생성될 수 있다.In step (1307), the output is accumulated. For example, accumulated data corresponding to the output of the MAC operation can be accumulated. For example, when the first iteration corresponding to c = 1 is performed, the input plane (i ₁ ) is obtained, and the weight plane ( ) is obtained, and at least some of the first input elements in the input plane (i ₁ ) and the weight plane ( ) the first accumulated data can be generated by accumulating the results of multiplications between at least some of my first weight elements. When the second iteration corresponding to c=2 is performed, the input plane (i ₂ ) is obtained, and the weight plane ( ) is obtained, and at least some of the second input elements in the input plane (i ₂ ) and the weight plane ( ) the second accumulated data can be generated by accumulating the multiplication results between at least some of my second weight elements. At this time, the first accumulated data and the second accumulated data can be accumulated. When the Cth iteration corresponding to c=C is performed, an output plane can be generated based on the sum of each accumulated data for each input channel.

단계(1308)에서 c와 C가 비교된다. c와 C가 다른 경우, 예를 들어 c가 C보다 작은 경우, 단계(1309)에서 c가 1만큼 증가되고, 단계(1302)가 수행된다. c와 C가 같은 경우, 단계(1309)에서 d와 D가 비교된다. d와 D가 다른 경우, 예를 들어 d가 D보다 작은 경우, 단계(1311)에서 d가 1만큼 증가되고, 단계(1301)가 수행된다. 단계들(1308, 1309)를 통해 출력 채널이 고정된 상태에서 모든 입력 채널에 관해 컨볼루션이 수행될 수 있으며, 단계들(1310, 1311)을 통해 출력 채널을 바꿔가며 모든 출력 채널에 관해 컨볼루션이 수행될 수 있다.In step (1308), c and C are compared. If c and C are different, for example, if c is smaller than C, c is increased by 1 in step (1309), and step (1302) is performed. If c and C are equal, d and D are compared in step (1309). If d and D are different, for example, if d is smaller than D, d is increased by 1 in step (1311), and step (1301) is performed. Through steps (1308, 1309), convolution can be performed on all input channels while the output channel is fixed, and through steps (1310, 1311), convolution can be performed on all output channels while changing the output channel.

도 14는 일 실시예에 따른 뉴럴 네트워크를 위한 데이터 처리 방법을 나타낸 도면이다. 도 14를 참조하면, 처리 장치는 단계(1410)에서 각각 입력 채널들에 대응하는 입력 특징 맵의 입력 평면들 중에 제1 입력 채널에 대응하는 제1 입력 평면을 획득하고, 각각 입력 채널들에 대응하는 웨이트 커널의 웨이트 평면들 중에 제1 입력 채널에 대응하는 제1 웨이트 평면을 획득하고, 제1 입력 평면 내 제1 입력 엘리먼트들의 적어도 일부와 제1 웨이트 평면 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터를 생성하고, 제1 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 제1 출력 채널에 대응하는 제1 출력 평면을 생성한다. 그 밖에, 뉴럴 네트워크를 위한 데이터 처리 방법에는 도 1 내지 도 13을 통해 설명된 내용이 적용될 수 있다.FIG. 14 is a diagram illustrating a data processing method for a neural network according to one embodiment. Referring to FIG. 14, in step (1410), the processing device obtains a first input plane corresponding to a first input channel among input planes of input feature maps corresponding to input channels, respectively, obtains a first weight plane corresponding to the first input channel among weight planes of weight kernels corresponding to the input channels, respectively, accumulates multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data, and generates a first output plane corresponding to the first output channel among output planes of output feature maps corresponding to output channels, respectively, based on the first accumulated data. In addition, the contents described through FIGS. 1 to 13 may be applied to the data processing method for a neural network.

도 15는 일 실시예에 따른 뉴럴 네트워크를 위한 데이터를 처리하는 처리 장치를 나타낸 블록도이다. 도 15를 참조하면, 처리 장치(1500)는 입력 데이터를 수신하고, 입력 데이터와 관련된 뉴럴 네트워크의 동작을 처리할 수 있다. 예를 들어, 뉴럴 네트워크의 동작은 객체 인식 동작 및 사용자 인증 동작을 포함할 수 있다. 처리 장치(1500)는 뉴럴 네트워크의 처리와 관련하여 본 명세서에 기술되거나 또는 도시된 하나 이상의 동작을 수행할 수 있고, 뉴럴 네트워크의 처리 결과를 사용자에게 제공할 수 있다. 처리 장치(1500)는 뉴럴 네트워크의 동작을 처리하는 과정에서 누적 방식의 컨볼루션을 수행할 수 있다.FIG. 15 is a block diagram illustrating a processing device for processing data for a neural network according to one embodiment. Referring to FIG. 15, a processing device (1500) may receive input data and process an operation of a neural network related to the input data. For example, the operation of the neural network may include an object recognition operation and a user authentication operation. The processing device (1500) may perform one or more operations described or illustrated in this specification in relation to processing of the neural network, and may provide a processing result of the neural network to a user. The processing device (1500) may perform an accumulative convolution in the process of processing the operation of the neural network.

처리 장치(1500)는 하나 이상의 프로세서(1510) 및 메모리(1520)를 포함할 수 있다. 메모리(1520)는 프로세서(1510)에 연결되고, 프로세서(1510)에 의해 실행가능한 명령어들, 프로세서(1510)가 연산할 데이터 또는 프로세서(1510)에 의해 처리된 데이터를 저장할 수 있다. 메모리(1520)는 비일시적인 컴퓨터 판독가능 매체, 예컨대 고속 랜덤 액세스 메모리 및/또는 비휘발성 컴퓨터 판독가능 저장 매체(예컨대, 하나 이상의 디스크 저장 장치, 플래쉬 메모리 장치, 또는 기타 비휘발성 솔리드 스테이트 메모리 장치)를 포함할 수 있다.The processing device (1500) may include one or more processors (1510) and memory (1520). The memory (1520) may be coupled to the processor (1510) and may store instructions executable by the processor (1510), data to be computed by the processor (1510), or data processed by the processor (1510). The memory (1520) may include a non-transitory computer-readable medium, such as a high-speed random access memory and/or a non-volatile computer-readable storage medium (e.g., one or more disk storage devices, flash memory devices, or other non-volatile solid-state memory devices).

프로세서(1510)는 도 1 내지 도 14를 참조하여 설명된 하나 이상의 동작을 실행하기 위한 명령어들을 실행할 수 있다. 예를 들어, 메모리(1520)에 저장된 명령어가 프로세서(1510)에서 실행되면, 프로세서(1510)는 각각 입력 채널들에 대응하는 입력 특징 맵의 입력 평면들 중에 제1 입력 채널에 대응하는 제1 입력 평면을 획득하고, 각각 입력 채널들에 대응하는 웨이트 커널의 웨이트 평면들 중에 제1 입력 채널에 대응하는 제1 웨이트 평면을 획득하고, 제1 입력 평면 내 제1 입력 엘리먼트들의 적어도 일부와 제1 웨이트 평면 내 제1 웨이트 엘리먼트들의 적어도 일부 간의 곱셈 결과들을 누적하여 제1 누적 데이터를 생성하고, 제1 누적 데이터에 기초하여 각각 출력 채널들에 대응하는 출력 특징 맵의 출력 평면들 중에 제1 출력 채널에 대응하는 제1 출력 평면을 생성할 수 있다.The processor (1510) may execute instructions for performing one or more operations described with reference to FIGS. 1 to 14. For example, when an instruction stored in the memory (1520) is executed in the processor (1510), the processor (1510) may obtain a first input plane corresponding to a first input channel among input planes of input feature maps corresponding to input channels, respectively, obtain a first weight plane corresponding to the first input channel among weight planes of weight kernels corresponding to the input channels, respectively, accumulate multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data, and generate a first output plane corresponding to the first output channel among output planes of output feature maps corresponding to output channels, respectively, based on the first accumulated data.

도 16은 일 실시예에 따른 전자 장치를 나타낸 도면이다. 도 16을 참조하면, 전자 장치(1600)는 입력 데이터를 수신하고, 입력 데이터와 관련된 뉴럴 네트워크의 동작을 처리할 수 있다. 예를 들어, 뉴럴 네트워크의 동작은 객체 인식 동작 및 사용자 인증 동작을 포함할 수 있다. 전자 장치(1600)는 뉴럴 네트워크의 동작을 처리하는 과정에서 전술된 누적 방식의 컨볼루션을 수행할 수 있다. 전자 장치(1600)는 도 1 내지 도 15를 통해 설명된 처리 장치를 포함하거나, 도 1 내지 도 15를 통해 설명된 처리 장치의 기능을 수행할 수 있다.FIG. 16 is a diagram illustrating an electronic device according to an embodiment of the present invention. Referring to FIG. 16, an electronic device (1600) may receive input data and process an operation of a neural network related to the input data. For example, the operation of the neural network may include an object recognition operation and a user authentication operation. The electronic device (1600) may perform the above-described cumulative convolution in the process of processing the operation of the neural network. The electronic device (1600) may include a processing device described through FIGS. 1 to 15, or may perform a function of a processing device described through FIGS. 1 to 15.

전자 장치(1600)는 프로세서(1610), 메모리(1620), 카메라(1630), 저장 장치(1640), 입력 장치(1650), 출력 장치(1660) 및 네트워크 인터페이스(1670)를 포함할 수 있다. 프로세서(1610), 메모리(1620), 카메라(1630), 저장 장치(1640), 입력 장치(1650), 출력 장치(1660) 및 네트워크 인터페이스(1670)는 통신 버스(1680)를 통해 서로 통신할 수 있다.The electronic device (1600) may include a processor (1610), a memory (1620), a camera (1630), a storage device (1640), an input device (1650), an output device (1660), and a network interface (1670). The processor (1610), the memory (1620), the camera (1630), the storage device (1640), the input device (1650), the output device (1660), and the network interface (1670) may communicate with each other via a communication bus (1680).

프로세서(1610)는 전자 장치(1600) 내에서 실행하기 위한 기능 및 명령어들을 실행한다. 예를 들어, 프로세서(1610)는 메모리(1620) 또는 저장 장치(1640)에 저장된 명령어들을 처리할 수 있다. 프로세서(1610)는 도 1 내지 도 15를 통하여 설명된 하나 이상의 동작을 수행할 수 있다.The processor (1610) executes functions and instructions for execution within the electronic device (1600). For example, the processor (1610) may process instructions stored in the memory (1620) or the storage device (1640). The processor (1610) may perform one or more operations described with reference to FIGS. 1 to 15.

메모리(1620)는 뉴럴 네트워크의 동작을 처리하기 위한 정보를 저장한다. 메모리(1620)는 컴퓨터 판독가능한 저장 매체 또는 컴퓨터 판독가능한 저장 장치를 포함할 수 있다. 메모리(1620)는 프로세서(1610)에 의해 실행하기 위한 명령어들을 저장할 수 있고, 전자 장치(1600)에 의해 소프트웨어 또는 애플리케이션이 실행되는 동안 관련 정보를 저장할 수 있다.The memory (1620) stores information for processing the operation of the neural network. The memory (1620) may include a computer-readable storage medium or a computer-readable storage device. The memory (1620) may store instructions for execution by the processor (1610) and may store related information while software or an application is executed by the electronic device (1600).

카메라(1630)는 정지 영상, 비디오 영상, 또는 이들 모두를 촬영할 수 있다. 카메라(1630)는 사용자가 얼굴 인증을 시도하기 위해 입력하는 얼굴 영역을 촬영할 수 있다. 카메라(1630)는 객체들에 관한 깊이 정보를 포함하는 3D 영상을 제공할 수도 있다.The camera (1630) can capture still images, video images, or both. The camera (1630) can capture a facial area that a user inputs to attempt facial authentication. The camera (1630) can also provide 3D images that include depth information about objects.

저장 장치(1640)는 컴퓨터 판독가능한 저장 매체 또는 컴퓨터 판독가능한 저장 장치를 포함한다. 일 실시예에 따르면, 저장 장치(1640)는 메모리(1620)보다 더 많은 양의 정보를 저장하고, 정보를 장기간 저장할 수 있다. 예를 들어, 저장 장치(1640)는 자기 하드 디스크, 광 디스크, 플래쉬 메모리, 플로피 디스크 또는 이 기술 분야에서 알려진 다른 형태의 비휘발성 메모리를 포함할 수 있다.The storage device (1640) comprises a computer-readable storage medium or a computer-readable storage device. According to one embodiment, the storage device (1640) can store a larger amount of information than the memory (1620) and can store the information for a longer period of time. For example, the storage device (1640) can comprise a magnetic hard disk, an optical disk, flash memory, a floppy disk, or any other form of non-volatile memory known in the art.

입력 장치(1650)는 키보드 및 마우스를 통한 전통적인 입력 방식, 및 터치 입력, 음성 입력, 및 이미지 입력과 같은 새로운 입력 방식을 통해 사용자로부터 입력을 수신할 수 있다. 예를 들어, 입력 장치(1650)는 키보드, 마우스, 터치 스크린, 마이크로폰, 또는 사용자로부터 입력을 검출하고, 검출된 입력을 전자 장치(1600)에 전달할 수 있는 임의의 다른 장치를 포함할 수 있다.The input device (1650) can receive input from a user via traditional input methods such as a keyboard and mouse, and new input methods such as touch input, voice input, and image input. For example, the input device (1650) can include a keyboard, a mouse, a touch screen, a microphone, or any other device that can detect input from a user and transmit the detected input to the electronic device (1600).

출력 장치(1660)는 시각적, 청각적 또는 촉각적인 채널을 통해 사용자에게 전자 장치(1600)의 출력을 제공할 수 있다. 출력 장치(1660)는 예를 들어, 디스플레이, 터치 스크린, 스피커, 진동 발생 장치 또는 사용자에게 출력을 제공할 수 있는 임의의 다른 장치를 포함할 수 있다. 네트워크 인터페이스(1670)는 유선 또는 무선 네트워크를 통해 외부 장치와 통신할 수 있다.The output device (1660) can provide output of the electronic device (1600) to the user via visual, auditory, or tactile channels. The output device (1660) can include, for example, a display, a touch screen, a speaker, a vibration generating device, or any other device capable of providing output to the user. The network interface (1670) can communicate with an external device via a wired or wireless network.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(Field Programmable Gate Array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may be implemented using one or more general-purpose computers or special-purpose computers, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing instructions and responding. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing device is sometimes described as being used alone, but those skilled in the art will appreciate that the processing device may include multiple processing elements and/or multiple types of processing elements. For example, the processing device may include multiple processors, or a processor and a controller. Other processing configurations, such as parallel processors, are also possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing device to perform a desired operation or may independently or collectively command the processing device. The software and/or data may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal waves, for interpretation by the processing device or for providing instructions or data to the processing device. The software may also be distributed over network-connected computer systems, and stored or executed in a distributed manner. The software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program commands that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program commands, data files, data structures, etc., alone or in combination. The program commands recorded on the medium may be those specially designed and configured for the embodiment or may be those known to and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program commands such as ROMs, RAMs, flash memories, etc. Examples of the program commands include not only machine language codes generated by a compiler but also high-level language codes that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiment, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited drawings as described above, those skilled in the art can apply various technical modifications and variations based on the above. For example, even if the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or are replaced or substituted by other components or equivalents, appropriate results can be achieved.

Claims

A method for data processing for a neural network performed by a processor,
A step of obtaining a first input plane corresponding to a first input channel among input planes of input feature maps corresponding to each input channel;
A step of obtaining a first weight plane corresponding to the first input channel among the weight planes of the weight kernels corresponding to the input channels respectively;
A step of generating first accumulated data by accumulating multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane;
A step of storing the first accumulated data;
A step of obtaining a second input plane corresponding to a second input channel among the above input planes;
A step of obtaining a second weight plane corresponding to the second input channel among the above weight planes;
A step of generating second accumulated data by accumulating multiplication results between at least some of the second input elements in the second input plane and at least some of the second weight elements in the second weight plane;
a step of loading the first accumulated data to generate a sum of the first accumulated data and the second accumulated data; and
A step of generating a first output plane corresponding to the first output channel among output planes of output feature maps corresponding to the output channels based on the sum of the first accumulated data and the second accumulated data.
A method of processing data including:

In the first paragraph,
The step of generating the above first output plane is
A step of generating the first output plane based on the sum of each accumulated data for each input channel including the first accumulated data.
How to process data.

delete

In the first paragraph,
The step of generating the above first output plane is
A step of generating the first output plane based on the sum of the first accumulated data and the second accumulated data,
How to process data.

In the first paragraph,
The step of generating the above first accumulated data is
A step of extracting first input element vectors corresponding to at least some of the first weight elements in the first input plane;
generating first weighted input element vectors corresponding to a result of a multiplication between said first input element vectors and at least some of said first weight elements; and
A step of generating the first accumulated data by accumulating the first weighted input element vectors.
A method of processing data, comprising:

In paragraph 5,
The step of extracting the first input element vectors above is
determining offsets corresponding to the first input element vectors based on at least some of the indices of the first weight elements; and
A step of extracting the first input element vectors from the first input plane based on the offsets,
How to process data.

In paragraph 5,
The sizes of the first input element vectors and the first weighted input element vectors correspond to a SIMD (single instruction multiple data) operation unit.
How to process data.

In the first paragraph,
When the first accumulated data is generated, a multiplication operation between zero weight elements corresponding to 0 among at least some of the first weight elements and at least some of the first input elements is omitted.
How to process data.

In the first paragraph,
A step of determining the number of non-zero weight elements that do not correspond to 0 among the first weight elements; and
A step of selecting an operation type corresponding to the number of non-zero weight elements determined above among operation types set to perform operations in a predetermined manner, respectively.
A method of processing data, further comprising:

In Article 9,
The step of generating the above first accumulated data is
A step of accumulating the multiplication results between the non-zero weight elements corresponding to at least some of the first input elements and at least some of the first weight elements based on the selected operation type to generate the first accumulated data,
How to process data.

In Article 9,
The step of generating the above first accumulated data is
A step of extracting first input element vectors corresponding to the non-zero weight elements in the first input plane based on the indices of the non-zero weight elements;
generating first weighted input element vectors corresponding to a result of a multiplication between said first input element vectors and said non-zero weight elements corresponding to at least some of said first weight elements; and
A step of accumulating the first weighted input element vectors to generate the first accumulated data.
A method of processing data, comprising:

A computer program stored on a medium for executing the method of any one of claims 1, 2, and 4 to 11 in combination with hardware.

In a data processing device for a neural network,
processor; and
Memory containing instructions executable by said processor
Including,
When the above instructions are executed in the processor, the processor obtains a first input plane corresponding to a first input channel among input planes of an input feature map corresponding to each of the input channels, obtains a first weight plane corresponding to the first input channel among weight planes of weight kernels corresponding to each of the input channels, accumulates multiplication results between at least some of the first input elements in the first input plane and at least some of the first weight elements in the first weight plane to generate first accumulated data, and stores the first accumulated data.
Obtain a second input plane corresponding to a second input channel among the input planes, obtain a second weight plane corresponding to the second input channel among the weight planes, accumulate multiplication results between at least some of the second input elements in the second input plane and at least some of the second weight elements in the second weight plane to generate second accumulated data, and load the first accumulated data to generate a sum of the first accumulated data and the second accumulated data.
Generating a first output plane corresponding to the first output channel among the output planes of the output feature maps corresponding to the output channels based on the sum of the first accumulated data and the second accumulated data,
Data processing device.

In Article 13,
The above processor
Generating the first output plane based on the sum of each accumulated data for each input channel including the first accumulated data,
Data processing device.

delete

In Article 13,
The above processor
A step of generating the first output plane based on the sum of the first accumulated data and the second accumulated data,
Data processing device.

In Article 13,
The above processor
Extracting first input element vectors corresponding to at least some of the first weight elements in the first input plane, generating first weighted input element vectors corresponding to a multiplication result between the first input element vectors and at least some of the first weight elements, and accumulating the first weighted input element vectors to generate the first accumulated data.
Data processing device.

In Article 17,
The above processor
Determining offsets corresponding to the first input element vectors based on at least some of the indices of the first weight elements, and extracting the first input element vectors from the first input plane based on the offsets.
Data processing device.

In Article 17,
The sizes of the first input element vectors and the first weighted input element vectors correspond to a SIMD (single instruction multiple data) operation unit.
Data processing device.

In Article 13,
When the first accumulated data is generated, a multiplication operation between zero weight elements corresponding to 0 among at least some of the first weight elements and at least some of the first input elements is omitted.
Data processing device.

In Article 13,
The above processor
Among the first weight elements, the number of non-zero weight elements that do not correspond to 0 is determined, and among the operation types set to perform operations in a predetermined manner, an operation type corresponding to the determined number of non-zero weight elements is selected.
Data processing device.

In Article 21,
The above processor
generating the first accumulated data by accumulating the multiplication results between the non-zero weight elements corresponding to at least some of the first input elements and at least some of the first weight elements based on the selected operation type;
Data processing device.

In Article 21,
The above processor
Extracting first input element vectors corresponding to the non-zero weight elements in the first input plane based on the indices of the non-zero weight elements, generating first weighted input element vectors corresponding to a multiplication result between the first input element vectors and the non-zero weight elements corresponding to at least some of the first weight elements, and accumulating the first weighted input element vectors to generate the first accumulated data.
Data processing device.

delete