WO2024143611A1

WO2024143611A1 - Efficient deep learning operation method and device

Info

Publication number: WO2024143611A1
Application number: PCT/KR2022/021578
Authority: WO
Inventors: 이상설; 이은총; 김경호
Original assignee: 한국전자기술연구원
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2024-07-04
Also published as: KR20240105809A

Abstract

An efficient deep learning operation method and device are provided. The deep learning operation device according to an embodiment of the present invention comprises PEs, each of which outputs the result of multiplying an input and a weight for a convolution operation, and an adder tree that sums and accumulates multiplication results output from the PEs. Therefore, a deep learning operation for a complex deep learning network can be performed with the lowered complexity of deep learning accelerator hardware, so that reduced hardware size and lower power consumption can be both achieved.

Description

Efficient deep learning calculation method and device

The present invention relates to deep learning computation, and more specifically, to hardware design capable of high-speed/low-power deep learning computation through modification of the operator in performing computation acceleration processing for deep learning inference/learning.

The goal of a deep learning accelerator is to receive input data (input feature map) and input convolution parameters (weight) and quickly perform deep learning calculations to generate output data (output feature map).

Convolution calculation, the core of deep learning calculation, is performed by multiple PEs (Processing Elements) through MAC (Multiplier & Adder) calculation. However, increasing the number of PEs causes the problem of increasing the complexity of deep learning accelerators.

In particular, when the size of the deep learning network increases and the structure becomes more complex, the number of PEs inevitably increases, further aggravating the above problem.

The present invention was created to solve the above problems, and the purpose of the present invention is to provide a deep learning acceleration device with a hardware structure that can reduce the complexity in performing deep learning calculations for complex deep learning networks. .

A deep learning computing device according to an embodiment of the present invention for achieving the above object includes PEs that output a result of multiplying an input and a weight for a convolution operation; and an Adder Tree that accumulates the multiplication results output from the PEs.

PEs can directly output the multiplication results without calculating the partial sum of the multiplication results. PEs can perform multiplication operations in the channel progression direction.

Adder Tree can be accumulated by summing the multiplication results on a pixel-by-pixel basis. In the Adder Tree, an accumulator may be located at the final end of the Tree to accumulate the sum results of the multiplication results. The Adder Tree can include as many adders as the number of output channels.

A deep learning computing device according to an embodiment of the present invention includes the steps of normalizing the output of the Adder Tree; applying an activation function to the normalization result; and applying Maxpool operation to the activation value.

Meanwhile, a deep learning calculation method according to another embodiment of the present invention includes the steps of PEs outputting a result of multiplying an input and a weight for a convolution operation; And the Adder Tree includes a step of summing and accumulating the multiplication results output from the PEs.

Meanwhile, a deep learning computing device according to another embodiment of the present invention includes RDMA for reading input data and weights stored in an external memory; Input buffer where input data and weights read by RDMA are stored; An operator that performs a convolution operation using input data and weights stored in an input buffer; An output buffer where the output data of the calculator is stored; WDMA reads the output data stored in the output buffer and stores it in an external memory, and the operator includes PEs that output the result of multiplying the input data and the weight for the convolution operation; and an Adder Tree that accumulates the multiplication results output from the PEs.

Meanwhile, a deep learning calculation method according to another embodiment of the present invention includes the steps of reading input data and weights stored in an external memory; Storing the read input data and weights; An operation step of performing a convolution operation using stored input data and weights; Storing the output data of the calculation step; Reading the stored output data and storing it in an external memory; the operation step includes: processing elements (PEs) outputting the result of multiplying the input data and the weight for a convolution operation; And the Adder Tree includes a step of summing and accumulating the multiplication results output from the PEs.

As described above, according to the embodiments of the present invention, it is possible to reduce the complexity of the deep learning accelerator hardware while performing deep learning calculations for a complex deep learning network, thereby reducing the hardware size and reducing power consumption. It can be lowered.

1 is a diagram showing a deep learning acceleration device to which the present invention can be applied;

Figure 2 is a diagram showing the detailed structure of the calculator shown in Figure 1;

Figure 3 shows the structure of existing PE and Adder Tree for deep learning inference;

Figure 4 shows the structure of PE and Adder Tree according to an embodiment of the present invention for deep learning inference;

Figure 5 shows the structure of existing PE and Adder Tree for deep learning inference,

Figure 6 shows the structure of PE and Adder Tree according to an embodiment of the present invention for deep learning inference;

Figure 7 is a diagram showing the structure of the final PE tile presented in an embodiment of the present invention.

Hereinafter, the present invention will be described in more detail with reference to the drawings.

An embodiment of the present invention presents an efficient deep learning calculation method/device and a deep learning acceleration device applying the same. It is a hardware design technology that enables high-speed/low-power calculations through modification of the calculator when performing computational acceleration processing for deep learning inference/learning.

Specifically, by moving the accumulators provided in PEs that perform convolution operations to the Adder Tree, the total number of accumulators in the deep learning accelerator is reduced.

Figure 1 is a diagram showing a deep learning acceleration device to which the present invention can be applied. The deep learning acceleration device shown includes RDMA (Read Direct Memory Access) 110, input buffer 120, operator 130, output buffer 140, and WDMA (Write Direct Memory Access) 150. do.

The deep learning acceleration device receives data from the external memory 10, performs deep learning calculations, and outputs and stores the calculation results to the external memory 10.

The data input from the external memory 10 is IFmap (Input Feature map: feature data of the input image) and Weight (convolution parameters of the deep learning network), and the deep learning operation result output to the external memory 10 is OFmap ( Output Feature map).

Therefore, the RDMA 110 reads the IFmap and Weight stored in the external memory 10 and stores them in the input buffer 120, and the WDMA 150 reads the OFmap stored in the output buffer 140 and stores them in the external memory 10. Save.

The calculator 130 performs deep learning calculations with data stored in the input buffer 120. FIG. 2 is a block diagram showing the detailed structure of the operator 130 shown in FIG. 1.

As shown, the operator 130 includes a convolution operation module 131 necessary for deep learning operation, an Adder Tree module 132 for summing the convolution operation results, and a batch normalization module for normalizing the sum results. (133), an Activation module (134) that applies an activation function to the normalization result, and a Maxpool module (135) that applies the Maxpool operation to the activation result.

The convolution operation module 131 is composed of a number of PEs (Processing Elements), and as shown in FIG. 3, the existing PE 10 performs a MAC (Multiplier & Adder) operation, and the PE that performs the operation is As the number increases (in the case of large deep learning networks), the complexity of the hardware inevitably increases linearly.

The embodiment of the present invention proposes a solution of removing the accumulator 15 of the PE 10 and adding it to the Adder Tree 20. Since the number of PEs in the PE tile of the operator 130 is much greater than the number of Adder Trees, this solution can reduce the increase in hardware complexity.

Figure 4 is a diagram showing the structure of PE and Adder Tree (AT) for deep learning inference presented in an embodiment of the present invention. As shown, the accumulator (Acc) was removed from PE, and instead, the accumulator (Acc) was added to the Adder Tree (AT).

Specifically, in the Adder Tree (AT), an accumulator (Acc) was added to the final stage of the tree to accumulate the sum of the multiplication results.

Accordingly, since the PE presented in the embodiment of the present invention does not have an accumulator (Acc), it does not calculate the partial sum of the multiplication results of the input and the weight, but outputs the multiplication result directly to the Adder Tree (AT). .

Adder Tree (AT) sums the multiplication results output from PEs using adders and accumulates them in an accumulator (Acc).

Meanwhile, while the existing PE shown in FIG. 3 performs calculations in the kernel progress direction, the PE presented in the embodiment of the present invention shown in FIG. 4 performs calculations in the channel progress direction. Since the PE operation is performed in the kernel progression direction, the Adder Tree (AT) accumulates the multiplication results by adding them up on a pixel basis.

This is because the Adder Tree (AT) operation is performed before the accumulator (Acc) accumulation operation. In order to process multi-channels, the existing Adder tree (20) requires as many adders as the number of input channels, but the Adder tree (AT) proposed in the embodiment of the present invention requires as many adders as the number of output channels.

Figure 5 shows the operation of the existing PE (10) during the learning process. When processing weight gradient during the learning process, unlike the inference process, data is calculated without going through the Adder tree (20), so it is common to use the accumulator (15) within the PE (10).

However, since PE is commonly used in inference/learning acceleration operations, in the embodiment of the present invention, the accumulator is excluded from the PE as shown in FIG. 6, and an adder as many as the number of output channels is used in the Adder tree (AT), but here It is necessary to add one adder to .

Accordingly, the PE Tile structure presented in the embodiment of the present invention is as shown in FIG. 7. As shown, the accumulator was removed from PE, one adder was added to the Adder Tree (AT), and an accumulator (Acc) was added to the final stage of the tree to accumulate the sum results of the multiplication results.

Assuming the structure shown in Figure 7, if the number of PEs is 4096, 4096 accumulators are reduced within the PE and a total of 544 adders and accumulators are added to the Adder Tree. As a result, hardware complexity is reduced, hardware area is reduced, and low-power operation is possible.

So far, in the process of accelerating computation for deep learning inference/learning, we have described in detail preferred embodiments of hardware design capable of high-speed/low-power deep learning computation through modification of the calculator.

As a flexible deep learning acceleration device, it can be applied to deep learning networks and layers of various structures, and is a method that can dramatically reduce the number of deep learning operators with high complexity.

This enables low-power operation of deep learning acceleration devices, and is a method that can be applied to all environments that require low power, such as edge devices, mobile devices, and servers.

Furthermore, it is expected that embodiments of the present invention will be able to reduce the rate of continuous increase in hardware size required for inference/learning.

In addition, although preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the invention pertains without departing from the gist of the present invention as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical idea or perspective of the present invention.

Claims

PE (Processing Elements) that output the result of multiplication of input and weight for convolution operation; and

A deep learning computing device comprising an Adder Tree that adds and accumulates the multiplication results output from PEs.
In claim 1,

PEs,

A deep learning computing device characterized by immediately outputting the multiplication results without calculating the partial sum of the multiplication results.
In claim 1,

PEs,

A deep learning computing device characterized in that it performs a multiplication operation in the direction of channel progression.
In claim 1,

Adder Tree,

A deep learning computing device characterized by summing and accumulating multiplication results on a pixel-by-pixel basis.
In claim 4,

Adder Tree,

A deep learning computing device characterized by an accumulator located at the final stage of the tree to accumulate the sum results of the multiplication results.
In claim 5,

Adder Tree,

A deep learning computing device comprising as many adders as the number of output channels.
In claim 1,

Normalizing the output of the Adder Tree;

applying an activation function to the normalization result; and

A deep learning operation device further comprising: applying Maxpool operation to the activation value.
Processing Elements (PEs) outputting the result of multiplying inputs and weights for a convolution operation; and

A deep learning calculation method characterized in that the Adder Tree includes the step of summing and accumulating the multiplication results output from PEs.
RDMA (Read Direct Memory Access), which reads input data and weights stored in external memory;

Input buffer where input data and weights read by RDMA are stored;

An operator that performs a convolution operation using input data and weights stored in an input buffer;

An output buffer where the output data of the calculator is stored;

Includes WDMA (Write Direct Memory Access), which reads output data stored in the output buffer and stores it in external memory,

The calculator,

PEs (Processing Elements) that output the result of multiplying input data and weights for convolution operation; and

A deep learning acceleration device comprising an Adder Tree that adds and accumulates the multiplication results output from PEs.
Reading input data and weights stored in an external memory;

Storing the read input data and weights;

An operation step of performing a convolution operation using stored input data and weights;

Storing the output data of the calculation step;

Comprising: reading the stored output data and storing it in external memory,

The calculation step is,

Processing Elements (PEs) outputting the result of multiplying input data and weights for a convolution operation; and

A deep learning acceleration method characterized in that the Adder Tree includes the step of summing and accumulating the multiplication results output from PEs.