WO2024143611A1 - Efficient deep learning operation method and device - Google Patents
Efficient deep learning operation method and device Download PDFInfo
- Publication number
- WO2024143611A1 WO2024143611A1 PCT/KR2022/021578 KR2022021578W WO2024143611A1 WO 2024143611 A1 WO2024143611 A1 WO 2024143611A1 KR 2022021578 W KR2022021578 W KR 2022021578W WO 2024143611 A1 WO2024143611 A1 WO 2024143611A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- deep learning
- output
- pes
- weights
- adder tree
- Prior art date
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 10
- 238000004364 calculation method Methods 0.000 claims description 19
- 230000001133 acceleration Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 3
- 241001442055 Vipera berus Species 0.000 abstract description 40
- 238000010586 diagram Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 238000009825 accumulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
- G06F17/153—Multidimensional correlation or convolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Definitions
- the present invention relates to deep learning computation, and more specifically, to hardware design capable of high-speed/low-power deep learning computation through modification of the operator in performing computation acceleration processing for deep learning inference/learning.
- the goal of a deep learning accelerator is to receive input data (input feature map) and input convolution parameters (weight) and quickly perform deep learning calculations to generate output data (output feature map).
- Convolution calculation the core of deep learning calculation, is performed by multiple PEs (Processing Elements) through MAC (Multiplier & Adder) calculation.
- PEs Processing Elements
- MAC Multiplier & Adder
- PEs can directly output the multiplication results without calculating the partial sum of the multiplication results. PEs can perform multiplication operations in the channel progression direction.
- Figure 4 shows the structure of PE and Adder Tree according to an embodiment of the present invention for deep learning inference
- Figure 6 shows the structure of PE and Adder Tree according to an embodiment of the present invention for deep learning inference
- FIG. 2 is a block diagram showing the detailed structure of the operator 130 shown in FIG. 1.
- the embodiment of the present invention proposes a solution of removing the accumulator 15 of the PE 10 and adding it to the Adder Tree 20. Since the number of PEs in the PE tile of the operator 130 is much greater than the number of Adder Trees, this solution can reduce the increase in hardware complexity.
- FIG 4 is a diagram showing the structure of PE and Adder Tree (AT) for deep learning inference presented in an embodiment of the present invention. As shown, the accumulator (Acc) was removed from PE, and instead, the accumulator (Acc) was added to the Adder Tree (AT).
- the PE presented in the embodiment of the present invention does not have an accumulator (Acc), it does not calculate the partial sum of the multiplication results of the input and the weight, but outputs the multiplication result directly to the Adder Tree (AT). .
- Adder Tree sums the multiplication results output from PEs using adders and accumulates them in an accumulator (Acc).
- the PE presented in the embodiment of the present invention shown in FIG. 4 performs calculations in the channel progress direction. Since the PE operation is performed in the kernel progression direction, the Adder Tree (AT) accumulates the multiplication results by adding them up on a pixel basis.
- the Adder Tree accumulates the multiplication results by adding them up on a pixel basis.
- Figure 5 shows the operation of the existing PE (10) during the learning process.
- data is calculated without going through the Adder tree (20), so it is common to use the accumulator (15) within the PE (10).
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Neurology (AREA)
- Mobile Radio Communication Systems (AREA)
- User Interface Of Digital Computer (AREA)
- Feedback Control In General (AREA)
- Complex Calculations (AREA)
Abstract
An efficient deep learning operation method and device are provided. The deep learning operation device according to an embodiment of the present invention comprises PEs, each of which outputs the result of multiplying an input and a weight for a convolution operation, and an adder tree that sums and accumulates multiplication results output from the PEs. Therefore, a deep learning operation for a complex deep learning network can be performed with the lowered complexity of deep learning accelerator hardware, so that reduced hardware size and lower power consumption can be both achieved.
Description
본 발명은 딥러닝 연산에 관한 것으로, 더욱 상세하게는 딥러닝 추론/학습을 위한 연산 가속 처리를 진행함에 있어 연산기의 수정을 통하여 고속/저전력 딥러닝 연산이 가능한 하드웨어 설계에 관한 것이다.The present invention relates to deep learning computation, and more specifically, to hardware design capable of high-speed/low-power deep learning computation through modification of the operator in performing computation acceleration processing for deep learning inference/learning.
딥러닝 가속 장치는 입력 데이터(input feature map), 입력 컨볼루션 파라미터(weight)를 입력받아 딥러닝 연산을 빠르게 수행하여 출력 데이터(output feature map)를 생성하는 것을 목표로 하고 있다.The goal of a deep learning accelerator is to receive input data (input feature map) and input convolution parameters (weight) and quickly perform deep learning calculations to generate output data (output feature map).
딥러닝 연산의 핵심인 컨볼루션 연산은 다수의 PE(Processing Element)들이 MAC(Multiplier & Adder) 연산을 통해 수행하고 있다. 하지만 PE의 개수 증가는 딥러닝 가속 장치의 복잡도를 높이는 문제를 유발한다.Convolution calculation, the core of deep learning calculation, is performed by multiple PEs (Processing Elements) through MAC (Multiplier & Adder) calculation. However, increasing the number of PEs causes the problem of increasing the complexity of deep learning accelerators.
특히 딥러닝 네트워크의 사이즈가 커지고 구조가 복잡해지는 경우, PE의 개수는 더욱 증가할 수 밖에 없어 위 문제는 더욱 가중된다.In particular, when the size of the deep learning network increases and the structure becomes more complex, the number of PEs inevitably increases, further aggravating the above problem.
본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, 복잡한 딥러닝 네트워크를 위한 딥러닝 연산을 수행함에 있어 복잡도를 낮출 수 있는 하드웨어 구조의 딥러닝 가속 장치를 제공함에 있다.The present invention was created to solve the above problems, and the purpose of the present invention is to provide a deep learning acceleration device with a hardware structure that can reduce the complexity in performing deep learning calculations for complex deep learning networks. .
상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 딥러닝 연산 장치는, 컨볼루션 연산을 위해 입력과 가중치의 곱셈 결과를 출력하는 PE들; 및 PE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 Adder Tree;를 포함한다.A deep learning computing device according to an embodiment of the present invention for achieving the above object includes PEs that output a result of multiplying an input and a weight for a convolution operation; and an Adder Tree that accumulates the multiplication results output from the PEs.
PE들은, 곱셈 결과들에 대한 부분 합(partial sum)을 계산하지 않고, 곱셈 결과를 바로 출력할 수 있다. PE들은, 채널 진행 방향으로 곱셈 연산을 수행할 수 있다.PEs can directly output the multiplication results without calculating the partial sum of the multiplication results. PEs can perform multiplication operations in the channel progression direction.
Adder Tree는, 픽셀 단위로 곱셈 결과들을 합산하여 누적할 수 있다. Adder Tree는, Tree의 최종단에 곱셈 결과들의 합산 결과들을 누적하기 위한 누산기가 위치할 수 있다. Adder Tree는, 출력 채널 수 만큼의 덧셈기들을 포함할 수 있다.Adder Tree can be accumulated by summing the multiplication results on a pixel-by-pixel basis. In the Adder Tree, an accumulator may be located at the final end of the Tree to accumulate the sum results of the multiplication results. The Adder Tree can include as many adders as the number of output channels.
본 발명의 실시예에 따른 딥러닝 연산 장치는, Adder Tree의 출력을 정규화하는 단계; 정규화 결과에 활성화 함수를 적용하는 단계; 및 활성화 값에 Maxpool 연산을 적용하는 단계;를 더 포함할 수 있다.A deep learning computing device according to an embodiment of the present invention includes the steps of normalizing the output of the Adder Tree; applying an activation function to the normalization result; and applying Maxpool operation to the activation value.
한편 본 발명의 다른 실시예에 따른 딥러닝 연산 방법은, PE들이, 컨볼루션 연산을 위해 입력과 가중치의 곱셈 결과를 출력하는 단계; 및 Adder Tree가, PE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 단계;를 포함한다.Meanwhile, a deep learning calculation method according to another embodiment of the present invention includes the steps of PEs outputting a result of multiplying an input and a weight for a convolution operation; And the Adder Tree includes a step of summing and accumulating the multiplication results output from the PEs.
한편 본 발명의 또 다른 실시예에 따른 딥러닝 연산 장치는, 외부 메모리에 저장된 입력 데이터와 가중치를 읽어들이는 RDMA; RDMA가 읽어들인 입력 데이터와 가중치가 저장되는 입력 버퍼; 입력 버퍼에 저장된 입력 데이터와 가중치를 이용하여 컨볼루션 연산을 수행하는 연산기; 연산기의 출력 데이터가 저장되는 출력 버퍼; 출력 버퍼에 저장된 출력 데이터를 읽어들여 외부 메모리에 저장하는 WDMA;를 포함하고, 연산기는, 컨볼루션 연산을 위해 입력 데이터와 가중치의 곱셈 결과를 출력하는 PE들; 및 PE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 Adder Tree;를 포함한다.Meanwhile, a deep learning computing device according to another embodiment of the present invention includes RDMA for reading input data and weights stored in an external memory; Input buffer where input data and weights read by RDMA are stored; An operator that performs a convolution operation using input data and weights stored in an input buffer; An output buffer where the output data of the calculator is stored; WDMA reads the output data stored in the output buffer and stores it in an external memory, and the operator includes PEs that output the result of multiplying the input data and the weight for the convolution operation; and an Adder Tree that accumulates the multiplication results output from the PEs.
한편 본 발명의 또 다른 실시예에 따른 딥러닝 연산 방법은, 외부 메모리에 저장된 입력 데이터와 가중치를 읽어들이는 단계; 읽어들인 입력 데이터와 가중치를 저장하는 단계; 저장된 입력 데이터와 가중치를 이용하여 컨볼루션 연산을 수행하는 연산단계; 연산단계의 출력 데이터를 저장하는 단계; 저장된 출력 데이터를 읽어들여 외부 메모리에 저장하는 단계;를 포함하고, 연산단계는, PE(Processing Element)들이, 컨볼루션 연산을 위해 입력 데이터와 가중치의 곱셈 결과를 출력하는 단계; 및 Adder Tree가, PE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 단계;를 포함한다.Meanwhile, a deep learning calculation method according to another embodiment of the present invention includes the steps of reading input data and weights stored in an external memory; Storing the read input data and weights; An operation step of performing a convolution operation using stored input data and weights; Storing the output data of the calculation step; Reading the stored output data and storing it in an external memory; the operation step includes: processing elements (PEs) outputting the result of multiplying the input data and the weight for a convolution operation; And the Adder Tree includes a step of summing and accumulating the multiplication results output from the PEs.
이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, 복잡한 딥러닝 네트워크를 위한 딥러닝 연산을 수행하면서도 딥러닝 가속 장치 하드웨어의 복잡도는 낮출 수 있게 되어, 하드웨어 크기를 줄일 수 있음은 물론 전력 소모도 낮출 수 있게 된다.As described above, according to the embodiments of the present invention, it is possible to reduce the complexity of the deep learning accelerator hardware while performing deep learning calculations for a complex deep learning network, thereby reducing the hardware size and reducing power consumption. It can be lowered.
도 1은 본 발명이 적용 가능한 딥러닝 가속 장치를 도시한 도면,1 is a diagram showing a deep learning acceleration device to which the present invention can be applied;
도 2는, 도 1에 도시된 연산기의 상세 구조를 도시한 도면,Figure 2 is a diagram showing the detailed structure of the calculator shown in Figure 1;
도 3은 딥러닝 추론을 위한 기존 PE와 Adder Tree의 구조,Figure 3 shows the structure of existing PE and Adder Tree for deep learning inference;
도 4는 딥러닝 추론을 위한 본 발명의 실시예에 따른 PE와 Adder Tree의 구조,Figure 4 shows the structure of PE and Adder Tree according to an embodiment of the present invention for deep learning inference;
도 5는 딥러닝 추론을 위한 기존 PE와 Adder Tree의 구조,Figure 5 shows the structure of existing PE and Adder Tree for deep learning inference,
도 6은 딥러닝 추론을 위한 본 발명의 실시예에 따른 PE와 Adder Tree의 구조,Figure 6 shows the structure of PE and Adder Tree according to an embodiment of the present invention for deep learning inference;
도 7은 본 발명의 실시예에서 제시하는 최종 PE 타일의 구조를 도시한 도면이다.Figure 7 is a diagram showing the structure of the final PE tile presented in an embodiment of the present invention.
이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.
본 발명의 실시예에서는 효율적인 딥러닝 연산 방법/장치 및 이를 적용한 딥러닝 가속 장치를 제시한다. 딥러닝 추론/학습을 위한 연산 가속 처리를 진행함에 있어, 연산기의 수정을 통해 고속/저전력 연산이 가능하도록 하는 하드웨어 설계 기술이다.An embodiment of the present invention presents an efficient deep learning calculation method/device and a deep learning acceleration device applying the same. It is a hardware design technology that enables high-speed/low-power calculations through modification of the calculator when performing computational acceleration processing for deep learning inference/learning.
구체적으로 컨볼루션 연산을 수행하는 PE들에 구비되는 누산기(accumulator)를 Adder Tree로 옮김으로써, 딥러닝 가속 장치에서 누산기의 전체 개수를 줄여 준다.Specifically, by moving the accumulators provided in PEs that perform convolution operations to the Adder Tree, the total number of accumulators in the deep learning accelerator is reduced.
도 1은 본 발명이 적용 가능한 딥러닝 가속 장치를 도시한 도면이다. 도시된 딥러닝 가속 장치는, RDMA(Read Direct Memory Access)(110), 입력 버퍼(120), 연산기(130), 출력 버퍼(140), WDMA(Write Direct Memory Access)(150)를 포함하여 구성된다.Figure 1 is a diagram showing a deep learning acceleration device to which the present invention can be applied. The deep learning acceleration device shown includes RDMA (Read Direct Memory Access) 110, input buffer 120, operator 130, output buffer 140, and WDMA (Write Direct Memory Access) 150. do.
딥러닝 가속 장치는 외부 메모리(10)로부터 데이터를 입력받아 딥러닝 연산을 수행하고, 연산 결과를 외부 메모리(10)로 출력하여 저장한다.The deep learning acceleration device receives data from the external memory 10, performs deep learning calculations, and outputs and stores the calculation results to the external memory 10.
외부 메모리(10)로부터 입력받는 데이터는 IFmap(Input Feature map : 입력 영상의 특징 데이터)와 Weight(딥러닝 네트워크의 컨볼루션 파라미터)이고, 외부 메모리(10)로 출력하는 딥러닝 연산 결과는 OFmap(Outut Feature map)이다.The data input from the external memory 10 is IFmap (Input Feature map: feature data of the input image) and Weight (convolution parameters of the deep learning network), and the deep learning operation result output to the external memory 10 is OFmap ( Output Feature map).
따라서 RDMA(110)는 외부 메모리(10)에 저장된 IFmap과 Weight를 읽어들여 입력 버퍼(120)에 저장하고, WDMA(150)는 출력 버퍼(140)에 저장된 OFmap을 읽어들여 외부 메모리(10)에 저장한다.Therefore, the RDMA 110 reads the IFmap and Weight stored in the external memory 10 and stores them in the input buffer 120, and the WDMA 150 reads the OFmap stored in the output buffer 140 and stores them in the external memory 10. Save.
연산기(130)는 입력 버퍼(120)에 저장된 데이터로 딥러닝 연산을 수행한다. 도 2은, 도 1에 도시된 연산기(130)의 상세 구조를 도시한 블럭도이다.The calculator 130 performs deep learning calculations with data stored in the input buffer 120. FIG. 2 is a block diagram showing the detailed structure of the operator 130 shown in FIG. 1.
도시된 바와 같이, 연산기(130)는 딥러닝 연산을 위해 필요한 컨볼루션 연산 모듈(131), 컨볼루션 연산 결과를 합산하는 Adder Tree 모듈(132), 합산 결과를 정규화하는 배치 정규화(Batch Normalization) 모듈(133), 정규화 결과에 활성화 함수를 적용하는 Activation 모듈(134) 및 활성화 결과에 Maxpool 연산을 적용하는 Maxpool 모듈(135)을 포함하여 구성된다.As shown, the operator 130 includes a convolution operation module 131 necessary for deep learning operation, an Adder Tree module 132 for summing the convolution operation results, and a batch normalization module for normalizing the sum results. (133), an Activation module (134) that applies an activation function to the normalization result, and a Maxpool module (135) that applies the Maxpool operation to the activation result.
컨볼루션 연산 모듈(131)은 다수의 PE(Processing Element)들로 구성되며, 도 3에 도시된 바와 같이 기존 PE(10)에서는 MAC(Multiplier & Adder) 연산을 수행하는데, 해당 연산을 수행하는 PE의 개수가 증가함에 따라(대형 딥러닝 네트워크의 경우), 하드웨어의 복잡도가 선형적으로 증가할 수 밖에 없다.The convolution operation module 131 is composed of a number of PEs (Processing Elements), and as shown in FIG. 3, the existing PE 10 performs a MAC (Multiplier & Adder) operation, and the PE that performs the operation is As the number increases (in the case of large deep learning networks), the complexity of the hardware inevitably increases linearly.
본 발명의 실시예에서는 PE(10)의 누산기(15)를 제거하고 Adder Tree(20)에 추가하는 해결책을 제시한다. 연산기(130)의 PE 타일에서 PE의 개수가 Adder Tree의 개수 보다 훨씬 많기 때문에, 이와 같은 해결책은 하드웨어 복잡도의 증가폭을 줄여줄 수 있다.The embodiment of the present invention proposes a solution of removing the accumulator 15 of the PE 10 and adding it to the Adder Tree 20. Since the number of PEs in the PE tile of the operator 130 is much greater than the number of Adder Trees, this solution can reduce the increase in hardware complexity.
도 4는 본 발명의 실시예에서 제시하는 딥러닝 추론을 위한 PE와 Adder Tree(AT)의 구조를 도시한 도면이다. 도시된 바와 같이 PE에서 누산기(Acc)는 제거되었고, 대신 Adder Tree(AT)에 누산기(Acc)가 추가되었다.Figure 4 is a diagram showing the structure of PE and Adder Tree (AT) for deep learning inference presented in an embodiment of the present invention. As shown, the accumulator (Acc) was removed from PE, and instead, the accumulator (Acc) was added to the Adder Tree (AT).
구체적으로 Adder Tree(AT)에는 Tree의 최종단에 곱셈 결과들의 합산 결과들을 누적하기 위한 누산기(Acc)가 추가되었다.Specifically, in the Adder Tree (AT), an accumulator (Acc) was added to the final stage of the tree to accumulate the sum of the multiplication results.
이에 따라 본 발명의 실시예에서 제시하는 PE는 누산기(Acc)가 없으므로 입력과 가중치의 곱셈 결과들에 대한 부분 합(partial sum)을 계산하지 않고, 곱셈 결과를 바로 Adder Tree(AT)에 출력한다.Accordingly, since the PE presented in the embodiment of the present invention does not have an accumulator (Acc), it does not calculate the partial sum of the multiplication results of the input and the weight, but outputs the multiplication result directly to the Adder Tree (AT). .
Adder Tree(AT)는 PE들에서 출력되는 곱셈 결과들을 덧셈기(adder)들로 합산하여 누산기(Acc)에서 누적한다.Adder Tree (AT) sums the multiplication results output from PEs using adders and accumulates them in an accumulator (Acc).
한편 도 3에 제시된 기존의 PE는 커널 진행 방향으로 연산을 수행하였음에 반해, 도 4에 제시된 본 발명의 실시예에서 제시하는 PE는 채널 진행 방향으로 연산을 수행한다. PE의 연산이 커널 진행 방향으로 수행되므로, Adder Tree(AT)는 픽셀 단위로 곱셈 결과들을 합산하여 누적하게 된다.Meanwhile, while the existing PE shown in FIG. 3 performs calculations in the kernel progress direction, the PE presented in the embodiment of the present invention shown in FIG. 4 performs calculations in the channel progress direction. Since the PE operation is performed in the kernel progression direction, the Adder Tree (AT) accumulates the multiplication results by adding them up on a pixel basis.
누산기(Acc)의 누적 연산 보다 Adder Tree(AT)의 연산이 먼저 수행되기 때문이다. 멀티 채널을 처리하기 위해 기존의 Adder tree(20)는 입력 채널 수 만큼의 덧셈기들이 필요하였지만, 본 발명의 실시예에서 제시하는 Adder tree(AT)는 출력 채널 수 만큼의 덧셈기들이 필요하다.This is because the Adder Tree (AT) operation is performed before the accumulator (Acc) accumulation operation. In order to process multi-channels, the existing Adder tree (20) requires as many adders as the number of input channels, but the Adder tree (AT) proposed in the embodiment of the present invention requires as many adders as the number of output channels.
도 5에는 학습 과정에서 기존 PE(10)의 연산을 보여준다. 학습 과정 중 weight gradient 처리시에는 추론 과정과 다르게 Adder tree(20)를 거치지 않고 데이터를 연산하기 때문에 보통 PE(10) 안에 누산기(15)를 사용하는 것이 보편적이다.Figure 5 shows the operation of the existing PE (10) during the learning process. When processing weight gradient during the learning process, unlike the inference process, data is calculated without going through the Adder tree (20), so it is common to use the accumulator (15) within the PE (10).
그러나 보통 추론/학습 가속 연산에서 PE를 공통으로 사용하기 때문에, 본 발명의 실시예에서는 도 6과 같이 PE에서 누산기를 제외하고, Adder tree(AT)에서 출력 채널 수 만큼의 덧셈기를 이용하되, 여기에 1개의 덧셈기를 추가하는 것이 필요하다.However, since PE is commonly used in inference/learning acceleration operations, in the embodiment of the present invention, the accumulator is excluded from the PE as shown in FIG. 6, and an adder as many as the number of output channels is used in the Adder tree (AT), but here It is necessary to add one adder to .
이에 따라 본 발명의 실시예에서 제시하는 PE 타일(PE Tile) 구조는 도 7에 도시된 바와 같이 된다. 도시된 바와 같이 PE에서는 누산기가 제거되었고, Adder Tree(AT)에는 덧셈기 1개가 추가되고, Tree의 최종단에 곱셈 결과들의 합산 결과들을 누적하기 위한 누산기(Acc)가 추가되었다.Accordingly, the PE Tile structure presented in the embodiment of the present invention is as shown in FIG. 7. As shown, the accumulator was removed from PE, one adder was added to the Adder Tree (AT), and an accumulator (Acc) was added to the final stage of the tree to accumulate the sum results of the multiplication results.
도 7에 도시된 바와 같은 구조를 상정한다면, PE의 개수가 4096개이면 PE 내부에서 4096개의 누산기가 줄어들게 되고 Adder Tree에 총 544개의 덧셈기와 누산기가 추가된다. 이에 따라 하드웨어 복잡도가 줄어들어, 하드웨어 면적이 줄어들게 되며 저전력 동작도 가능해진다.Assuming the structure shown in Figure 7, if the number of PEs is 4096, 4096 accumulators are reduced within the PE and a total of 544 adders and accumulators are added to the Adder Tree. As a result, hardware complexity is reduced, hardware area is reduced, and low-power operation is possible.
지금까지 딥러닝 추론/학습을 위한 연산 가속 처리를 진행함에 있어 연산기의 수정을 통하여 고속/저전력 딥러닝 연산이 가능한 하드웨어 설계에 대해 바람직한 실시예들을 들어 상세히 설명하였다.So far, in the process of accelerating computation for deep learning inference/learning, we have described in detail preferred embodiments of hardware design capable of high-speed/low-power deep learning computation through modification of the calculator.
유연한 딥러닝 가속 장치로써 다양한 구조의 딥러닝 네트워크 및 레이어에 적용 가능하며, 높은 복잡도를 갖는 딥러닝 연산기의 수를 획기적으로 줄일 수 있는 방법이다.As a flexible deep learning acceleration device, it can be applied to deep learning networks and layers of various structures, and is a method that can dramatically reduce the number of deep learning operators with high complexity.
이를 통해 딥러닝 가속 장치의 저전력 동작이 가능해지는데, 저전력이 요구되는 환경인 엣지 디바이스, 모바일 디바이스, 서버 등에 모두 적용 가능한 방법이다.This enables low-power operation of deep learning acceleration devices, and is a method that can be applied to all environments that require low power, such as edge devices, mobile devices, and servers.
나아가 본 발명의 실시예에 의해 추론/학습에서에서 요구되는 하드웨어 크기의 지속적인 증가 속도를 낮출 수 있을 것으로 기대한다.Furthermore, it is expected that embodiments of the present invention will be able to reduce the rate of continuous increase in hardware size required for inference/learning.
또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the invention pertains without departing from the gist of the present invention as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical idea or perspective of the present invention.
Claims (10)
- 컨볼루션 연산을 위해 입력과 가중치의 곱셈 결과를 출력하는 PE(Processing Element)들; 및PE (Processing Elements) that output the result of multiplication of input and weight for convolution operation; andPE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 Adder Tree;를 포함하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning computing device comprising an Adder Tree that adds and accumulates the multiplication results output from PEs.
- 청구항 1에 있어서,In claim 1,PE들은,PEs,곱셈 결과들에 대한 부분 합(partial sum)을 계산하지 않고, 곱셈 결과를 바로 출력하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning computing device characterized by immediately outputting the multiplication results without calculating the partial sum of the multiplication results.
- 청구항 1에 있어서,In claim 1,PE들은,PEs,채널 진행 방향으로 곱셈 연산을 수행하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning computing device characterized in that it performs a multiplication operation in the direction of channel progression.
- 청구항 1에 있어서,In claim 1,Adder Tree는,Adder Tree,픽셀 단위로 곱셈 결과들을 합산하여 누적하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning computing device characterized by summing and accumulating multiplication results on a pixel-by-pixel basis.
- 청구항 4에 있어서,In claim 4,Adder Tree는,Adder Tree,Tree의 최종단에 곱셈 결과들의 합산 결과들을 누적하기 위한 누산기가 위치하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning computing device characterized by an accumulator located at the final stage of the tree to accumulate the sum results of the multiplication results.
- 청구항 5에 있어서,In claim 5,Adder Tree는,Adder Tree,출력 채널 수 만큼의 덧셈기들을 포함하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning computing device comprising as many adders as the number of output channels.
- 청구항 1에 있어서,In claim 1,Adder Tree의 출력을 정규화하는 단계;Normalizing the output of the Adder Tree;정규화 결과에 활성화 함수를 적용하는 단계; 및applying an activation function to the normalization result; and활성화 값에 Maxpool 연산을 적용하는 단계;를 더 포함하는 것을 특징으로 하는 딥러닝 연산 장치.A deep learning operation device further comprising: applying Maxpool operation to the activation value.
- PE(Processing Element)들이, 컨볼루션 연산을 위해 입력과 가중치의 곱셈 결과를 출력하는 단계; 및Processing Elements (PEs) outputting the result of multiplying inputs and weights for a convolution operation; andAdder Tree가, PE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 단계;를 포함하는 것을 특징으로 하는 딥러닝 연산 방법.A deep learning calculation method characterized in that the Adder Tree includes the step of summing and accumulating the multiplication results output from PEs.
- 외부 메모리에 저장된 입력 데이터와 가중치를 읽어들이는 RDMA(Read Direct Memory Access);RDMA (Read Direct Memory Access), which reads input data and weights stored in external memory;RDMA가 읽어들인 입력 데이터와 가중치가 저장되는 입력 버퍼;Input buffer where input data and weights read by RDMA are stored;입력 버퍼에 저장된 입력 데이터와 가중치를 이용하여 컨볼루션 연산을 수행하는 연산기;An operator that performs a convolution operation using input data and weights stored in an input buffer;연산기의 출력 데이터가 저장되는 출력 버퍼;An output buffer where the output data of the calculator is stored;출력 버퍼에 저장된 출력 데이터를 읽어들여 외부 메모리에 저장하는 WDMA(Write Direct Memory Access);를 포함하고,Includes WDMA (Write Direct Memory Access), which reads output data stored in the output buffer and stores it in external memory,연산기는,The calculator,컨볼루션 연산을 위해 입력 데이터와 가중치의 곱셈 결과를 출력하는 PE(Processing Element)들; 및PEs (Processing Elements) that output the result of multiplying input data and weights for convolution operation; andPE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 Adder Tree;를 포함하는 것을 특징으로 하는 딥러닝 가속 장치.A deep learning acceleration device comprising an Adder Tree that adds and accumulates the multiplication results output from PEs.
- 외부 메모리에 저장된 입력 데이터와 가중치를 읽어들이는 단계;Reading input data and weights stored in an external memory;읽어들인 입력 데이터와 가중치를 저장하는 단계;Storing the read input data and weights;저장된 입력 데이터와 가중치를 이용하여 컨볼루션 연산을 수행하는 연산단계;An operation step of performing a convolution operation using stored input data and weights;연산단계의 출력 데이터를 저장하는 단계;Storing the output data of the calculation step;저장된 출력 데이터를 읽어들여 외부 메모리에 저장하는 단계;를 포함하고,Comprising: reading the stored output data and storing it in external memory,연산단계는,The calculation step is,PE(Processing Element)들이, 컨볼루션 연산을 위해 입력 데이터와 가중치의 곱셈 결과를 출력하는 단계; 및Processing Elements (PEs) outputting the result of multiplying input data and weights for a convolution operation; andAdder Tree가, PE들에서 출력되는 곱셈 결과들을 합산하여 누적하는 단계;를 포함하는 것을 특징으로 하는 딥러닝 가속 방법.A deep learning acceleration method characterized in that the Adder Tree includes the step of summing and accumulating the multiplication results output from PEs.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020220188229A KR20240105809A (en) | 2022-12-29 | 2022-12-29 | Efficient deep learning computation method and apparatus |
KR10-2022-0188229 | 2022-12-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024143611A1 true WO2024143611A1 (en) | 2024-07-04 |
Family
ID=91717903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2022/021578 WO2024143611A1 (en) | 2022-12-29 | 2022-12-29 | Efficient deep learning operation method and device |
Country Status (2)
Country | Link |
---|---|
KR (1) | KR20240105809A (en) |
WO (1) | WO2024143611A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180034853A (en) * | 2016-09-28 | 2018-04-05 | 에스케이하이닉스 주식회사 | Apparatus and method test operating of convolutional neural network |
KR20190030564A (en) * | 2017-09-14 | 2019-03-22 | 한국전자통신연구원 | Neural network accelerator including bidirectional processing element array |
US10698657B2 (en) * | 2016-08-12 | 2020-06-30 | Xilinx, Inc. | Hardware accelerator for compressed RNN on FPGA |
KR20210074707A (en) * | 2019-12-12 | 2021-06-22 | 한국전자기술연구원 | Processing Device and Method with High Throughput for Neural Network Processor |
KR20220143333A (en) * | 2021-04-16 | 2022-10-25 | 포항공과대학교 산학협력단 | Mobilenet hardware accelator with distributed sram architecture and channel stationary data flow desigh method thereof |
-
2022
- 2022-12-29 KR KR1020220188229A patent/KR20240105809A/en unknown
- 2022-12-29 WO PCT/KR2022/021578 patent/WO2024143611A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10698657B2 (en) * | 2016-08-12 | 2020-06-30 | Xilinx, Inc. | Hardware accelerator for compressed RNN on FPGA |
KR20180034853A (en) * | 2016-09-28 | 2018-04-05 | 에스케이하이닉스 주식회사 | Apparatus and method test operating of convolutional neural network |
KR20190030564A (en) * | 2017-09-14 | 2019-03-22 | 한국전자통신연구원 | Neural network accelerator including bidirectional processing element array |
KR20210074707A (en) * | 2019-12-12 | 2021-06-22 | 한국전자기술연구원 | Processing Device and Method with High Throughput for Neural Network Processor |
KR20220143333A (en) * | 2021-04-16 | 2022-10-25 | 포항공과대학교 산학협력단 | Mobilenet hardware accelator with distributed sram architecture and channel stationary data flow desigh method thereof |
Also Published As
Publication number | Publication date |
---|---|
KR20240105809A (en) | 2024-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200320369A1 (en) | Image recognition method, apparatus, electronic device and storage medium | |
CN111859775B (en) | Software and hardware collaborative design for accelerating deep learning inference | |
CN110929865A (en) | Network quantification method, service processing method and related product | |
WO2023065983A1 (en) | Computing apparatus, neural network processing device, chip, and data processing method | |
CN111814957B (en) | Neural network operation method and related equipment | |
WO2024143611A1 (en) | Efficient deep learning operation method and device | |
US5768167A (en) | Two-dimensional discrete cosine transformation circuit | |
CN116911366A (en) | Computing system neural network optimization method and device | |
CN116450086B (en) | Chip comprising multiply-accumulator, terminal and control method | |
WO2022145713A1 (en) | Method and system for lightweighting artificial neural network model, and non-transitory computer-readable recording medium | |
WO2022260392A1 (en) | Method and system for generating image processing artificial neural network model operating in terminal | |
WO2021020848A2 (en) | Matrix operator and matrix operation method for artificial neural network | |
CN112748898B (en) | Complex vector computing device and computing method | |
WO2023120788A1 (en) | Data processing system and method capable of snn/cnn simultaneous drive | |
WO2023128024A1 (en) | Method and system for quantizing deep-learning network | |
WO2023085442A1 (en) | High-accuracy deep learning computing device | |
WO2024135862A1 (en) | Data processing and manipulation device supporting unstructured data processing | |
WO2024135860A1 (en) | Data pruning method for lightweight deep-learning hardware device | |
WO2024090600A1 (en) | Deep learning model training method and deep learning computation apparatus applied with same | |
WO2021107170A1 (en) | Low-power deep learning accelerator | |
WO2021049829A1 (en) | Method, system, and non-transitory computer-readable recording medium for performing artificial neural network operation | |
WO2024135870A1 (en) | Image recognition device performing input unit network quantization method for efficient object detection | |
WO2023214608A1 (en) | Quantum circuit simulation hardware | |
WO2022107927A1 (en) | Deep learning apparatus enabling rapid post-processing | |
WO2024091106A1 (en) | Method and system for selecting an artificial intelligence (ai) model in neural architecture search (nas) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22970250 Country of ref document: EP Kind code of ref document: A1 |