CN114692862B

CN114692862B - Method for adaptively adjusting and activating quantization bit width

Info

Publication number: CN114692862B
Application number: CN202011622451.0A
Authority: CN
Inventors: 张东
Original assignee: Hefei Ingenic Technology Co ltd
Current assignee: Hefei Ingenic Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2024-10-15
Anticipated expiration: 2040-12-31
Also published as: CN114692862A

Abstract

The invention provides a method for adaptively adjusting and activating quantization bit width, which aims to overcome the defects in the prior art and solve the problem that a quantized model cannot achieve the optimal acceleration ratio and precision. The method comprises the following steps: s1, data quantization: quantizing the data to be quantized to obtain low-bit data; s2, transmitting data to the next layer when training a low-bit model, and for activation, adopting a model, carrying out convolution after quantization, wherein the result is as follows: S3, in the reasoning process, under the condition that a weight channel is not changed, reducing the activated bit width can reduce the condition that the convolution accumulation result exceeds 16 bits, and if conv (W _sqf,F_uqf) is more than 1.0, reducing the activated quantized bit width until conv (W _sqf,F_uqf) is less than or equal to 1.0; and the same operation is performed on each output channel of the layer, so as to determine the corresponding bit width according to the distribution condition of each channel.

Description

Method for adaptively adjusting and activating quantization bit width

Technical Field

The invention relates to the technical field of convolutional neural network acceleration, in particular to a method for adaptively adjusting and activating quantization bit width.

Background

In recent years, with rapid development of technology, a large data age has come. Deep learning takes a Deep Neural Network (DNN) as a model, and has quite remarkable results in many key fields of artificial intelligence, such as image recognition, reinforcement learning, semantic analysis and the like. The Convolutional Neural Network (CNN) is used as a typical DNN structure, can effectively extract hidden layer characteristics of images, accurately classifies the images, and is widely applied to the fields of image recognition and detection in recent years.

In the existing quantization model method, layers of the model are quantized to the same bit width, or the quantization bit width of a certain layer of the model is manually adjusted.

However, the accuracy loss is different from different layers of quantization to different bit widths in the neural network model, and when the whole model is quantized to the same bit width, the whole quantization bit width may not be reduced, or the convergence effect of the model is not ideal, so that the optimal acceleration ratio cannot be achieved.

Technical terms commonly used in the prior art include:

Convolutional neural network (Convolutional Neural Networks, CNN): is a type of feedforward neural network that includes convolution calculations and has a depth structure.

Quantification: quantization refers to the process of approximating a continuous value (or a large number of possible discrete values) of a signal to a finite number (or fewer) discrete values.

Low bits: the data is quantized to 8bit,4bit or 2bit wide data.

SIMD full Single Instruction Multiple Data, single instruction multiple data stream. Is a technique for achieving spatial parallelism by employing a controller to control multiple processors while performing the same operations on each of a set of data (also known as "data vectors"), respectively. In the image processing process, since the data types of the image are commonly used in formats of RGB565, RGBA8888, YUV422, and the like, the data of the formats is characterized in that a component of a pixel is always represented by 8 bits or less of data. If a conventional processor is used for computation, the processor's registers are either 32-bit or 64-bit, but processing these data can only be used for their lower 8-bits, which is inefficient. If the 64-bit register is disassembled into 8-bit registers, 8 operations can be completed simultaneously, and the calculation efficiency is improved by 8 times. This is the core idea of SIMD instructions.

Disclosure of Invention

In order to solve the above problems, the present method aims to overcome the defects in the prior art and solve the problem that the quantized model cannot achieve the optimal speed-up ratio and precision.

Specifically, the invention provides a method for adaptively adjusting an activation quantization bit width, which comprises the following steps:

s1, data quantization: quantizing the data to be quantized to obtain low-bit data;

s2, transmitting data to the next layer when training a low-bit model, and for activation, adopting a Relu model, carrying out convolution after quantization, wherein the result is as follows:

This equation is for explaining the relationship between conv (W _sqf,F_uqf) and conv (W _q,F_q), where wb, fb are the quantized bit widths of the weight and Feature map, respectively, W _sqf is the data of the weight data quantized to low bits and normalized to [ -1 to 1], F _uqf is the data of the bottom bits and normalized to [ -1 to 1], and W _q,F_q is the data of the weight and Feature map quantized to low bits, respectively;

S3, in the reasoning process, under the condition that a weight channel is not changed, reducing the activated bit width can reduce the condition that the convolution accumulation result exceeds 16 bits, and if conv (W _sqf,F_uqf) is more than 1.0, reducing the activated quantized bit width until conv (W _sqf,F_uqf) is less than or equal to 1.0; and the same operation is performed on each output channel of the layer, so as to determine the corresponding bit width according to the distribution condition of each channel.

The step S1 includes:

1) Signed data quantization:

2) Unsigned data quantization:

Description of variables: w _f is full-precision data, W _q is analog quantized data, max _w is the maximum value in full-precision data W _f, and b is quantized bit width.

In the step S2, the data transferred to the next layer is:

If there is a sign

If unsigned.

The formula of Relu is as follows:

relu6(x)＝min(max(x，0)，6)∈[0，6]。

In the step S3, the convolution operation is accelerated by using a SIMD acceleration method.

The operation in the step S3 can be completed in model training, namely conv (W _sqf,F_uqf) is greater than 1.0 at the nth step when training the model, and fb _n+1＝fb_n -1 at the (n+1) th step, and fb _n+1＝fb_n at the (n+1) th step if conv (W _sqf,F_uqf) is less than or equal to 1.0 at the nth step.

The method performs full-precision model training.

Thus, the present application has the advantages that: the accumulated result of the convolution is reduced by adjusting the activated bit width, so that the accumulated sum is compressed to be within 16 bits, and the acceleration effect of using SIMD is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and together with the description serve to explain the application.

FIG. 1 is a schematic flow chart of the method of the present invention.

Detailed Description

In order that the technical content and advantages of the present invention may be more clearly understood, a further detailed description of the present invention will now be made with reference to the accompanying drawings.

As shown in fig. 1, the present invention relates to a method for improving model accuracy in the quantization of convolutional neural network, and in particular to a method for adaptively adjusting and activating quantization bit width.

A method of adaptively adjusting an active quantization bit width, the method comprising the steps of:

s3, at the time of reasoning, under the condition that a weight channel is not changed, reducing the activated bit width can reduce the condition that the convolution accumulation result exceeds 16 bits, if conv (W _sqf,F_uqf) is more than 1.0, the activated quantized bit width is reduced until conv (W _sqf,F_uqf) is less than or equal to 1.0, the operation can be completed in model training, namely, when conv (W _sqf,F_uqf) is more than 1.0 at the nth step in model training, the n+1th step fb _n+1＝fb_n -1 is carried out, and when conv (W _sqf,F_uqf) is less than or equal to 1.0 at the nth step, the n+1th step fb _n+1＝fb_n is carried out; and the same operation is performed on each output channel of the layer, so as to determine the corresponding bit width according to the distribution condition of each channel. In other words, in order to improve the acceleration effect of SIMD, it is desirable that the result of conv (W _q,F_q) at the time of reasoning is less than 16 bits, and when wb=8, fb=8, it can be derived from the above equation that the result of conv (W _q,F_q) is required to be less than 16 bits and the result of conv (W _sqf,F_uqf) must be less than 1.0, and the result of conv (W _sqf,F_uqf) can be adjusted by changing the size of fb without changing wb, thereby achieving adaptive adjustment of the bit width of activation quantization.

Specifically, the method comprises performing full-precision model training:

1. data quantization: quantizing the data to be quantized according to the formula shown to obtain low-bit data:

as shown in equation set 1 consisting of signed and unsigned quantization:

Quantification of the number of symbols:

W_f＝min(max(W_f,-max_w),max_w)

W_q＝clamp(-2^b-1,2^b-1-1,W_int)

Unsigned quantization:

W_f＝min(max(W_f,0),max_w)

W_q＝clamp(0,2^b-1,W_int)

Description of variables: w _f is full-precision data, W _q is analog quantized data, max _w is maximum value in full-precision data W _f, and b is quantized bit width.

2. Data passed to the next layer when training the low bit model, as shown in equation set 2:

for the activation of the model with Relu, the result of the post-quantization convolution is quantized as shown in equation 3:

3. Since the SIMID acceleration mode is adopted to accelerate the convolution operation in reasoning, if the result after convolution accumulation is saved to 16 bits according to the characteristics of SIMID, the result is twice faster than the result after accumulation is saved to 32 bits, and the situation that the convolution accumulation result exceeds 16 bits can be reduced by reducing the activated bit width under the condition that the weight channel is not changed can be known by the formula group 1 and the formula 3, the activated quantization bit width is reduced until conv (W _sqf,F_uqf) is less than or equal to 1.0 according to the formula 3 if conv (W _sqf,F_uqf) > 1.0. And the same operation can be performed on each output channel of the layer, so as to determine the corresponding bit width according to the distribution condition of each channel.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of adaptively adjusting an active quantization bit width, the method comprising the steps of:

S3, in the reasoning process, under the condition that a weight channel is not changed, reducing the activated bit width can reduce the condition that the convolution accumulation result exceeds 16 bits, and if conv (W _sqf,F_uqf) is more than 1.0, reducing the activated quantized bit width until conv (W _sqf,F_uqf) is less than or equal to 1.0; and the same operation is carried out on each output channel of the layer, so that the corresponding bit width is determined according to the distribution condition of each channel; the operation of convolution is accelerated by adopting a SIMD acceleration mode.

2. The method of claim 1 wherein the adaptive adjustment of the active quantization bit width, the method is characterized in that the step S1 comprises the following steps:

1) Signed data quantization:

2) Unsigned data quantization:

3. The method according to claim 1, wherein in the step S2, the data transferred to the next layer is:

If there is a sign

If unsigned.

4. The method of claim 1 wherein the adaptive adjustment of the active quantization bit width, the method is characterized in that the formula of Relu is as follows:

relu6(x)＝min(max(x，0)，6)∈[0，6]。

5. The method according to claim 1, wherein the operation in the step S3 is performed during model training, i.e. conv (W _sqf,F_uqf) >1.0 in the nth step and fb _n+1＝fb_n -1 in the (n+1) th step when conv (W _sqf,F_uqf) < 1.0 in the nth step, and fb _n+1＝fb_n in the (n+1) th step when conv (W _sqf,F_uqf) < 1.0 in the nth step.

6. The method of claim 1, wherein the method performs full-precision model training.