Recurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional dependence and attention. In order to maintain good accuracy, these components are frequently run using full-precision floating-point computation, making them slow, inefficient and difficult to deploy on edge devices. In addition, the complex nature of these operations makes them challenging to quantize using standard quantization methods without a significant performance drop. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions, to serve a wide range of state-of-the-art RNNs. The proposed method enables RNN-based language models to run on edge devices with \(2\times\) improvement in runtime, and \(4\times\) reduction in model size while maintaining similar accuracy as its full-precision counterpart.
Appendix A
A.1 Details on LSTM-Based Models
For BiLSTM cells, nothing stated in section Integer-only LSTM network is changed except that we enforce the forward LSTM hidden state \(\overrightarrow{\textbf{h}}_t\) and the backward LSTM hidden state \(\overleftarrow{\textbf{h}}_t\) to share the same quantization parameters so that they can be concatenated as a vector. If the model has embedding layers, they are quantized to 8-bit as we found they were not sensitive to quantization. If the model has residual connections (e.g., between LSTM cells), they are quantized to 8-bit integers. In encoder-decoder models, the attention layers would be quantized using the method described in Sect. “Integer-Only Attention”. The last fully-connected layer weights of the model are 8-bit quantized to allow for 8-bit matrix multiplication. We do not quantize the outputs and let them remain 32-bit integers as often this is where it is considered that the model has done its job and that some postprocessing is performed (e.g., beam search).
A.2 Experimental Details
In this section, we provide further details of our experimental setups. The number of parameters and training time of the model are reported in Table 7.
A.2.1 LayerNorm LSTM on PTB
Preprocessing of the dataset was performed [37]. The vocabulary size is 10K. We report the best perplexity per word on the validation set and test set for a language model of embedding size 200 with one LayerNormLSTM cell of state size 200. The lower the perplexity, the better the model performs. These experiments focus on the relative increase of perplexity between the full-precision models and their 8-bit quantized counterparts. We did not aim to reproduce state-of-the-art performance on PTB and went with a naïve set of hyper-parameters. The full-precision network is trained for 100 epochs with batch size 20 and BPTT [38] window size of 35. We used the SGD optimizer with weight decay of \(10^{-5}\) and learning rate 20, which is divided by 4 when the loss plateaus for more than two epochs without a relative decrease of \(10^{-4}\) in perplexity. We use gradient clipping of 0.25. We initialize the quantized models from the best full-precision checkpoint and train for another 100 epochs. We did not enable quantization to gather range statistics to compute the quantization parameters for the first five epochs.
A.2.2 Mogrifier LSTM on WikiText2
We describe the experimental setup for Mogrifier LSTM on WikiText2. Note that we follow the setup of [33] where they do not use dynamic evaluation [39] nor Monte Carlo dropout [40]. The vocabulary size is 33279. We use a two-layer Mogrifier LSTM with embedding dimension 272, state dimension 1366, and capped input gates. We use six modulation rounds per Mogrifier layer with low-rank dimension 48. We use 2 Mixture-of-Softmax layers [41]. The input and output embedding are tied. We use a batch size of 64 and a BPTT window size of 70. We train the full-precision Mogrifier LSTM for 340 epochs, after which we enable Stochastic Weight Averaging (SWA) [42] for 70 epochs. For the optimizer we used Adam [43] with a learning rate of \(\approx 3\times 10^{-3}\), \(\beta _1=0\), \(\beta _2=0.999\) and weight decay \(\approx 1.8\times 10^{-4}\). We clip gradients’ norm to 10. We use the same hyper-parameters for the quantized models from which we initialize with a pre-trained full-precision and continue to train for 200 epochs. During the first two epochs, we do not perform QAT, but we gather min and max statistics in the network to have a correct starting estimate of the quantization parameters. After that, we enable 8-bit QAT on every component of the Mogrifier LSTM: weights, matrix multiplications, element-wise operations, and activations. Then we replace activation functions in the model with quantization-aware PWLs and continue training for 100 epochs. We perform complete ablation on our method to study the effect of each component. Quantizing the weights and matrix multiplications cover about 0.1 of the perplexity increase. There is a clear performance drop after adding quantization of element-wise operations with an increase in the perplexity of about 0.3. This is due to element-wise operations in the cell and hidden state computations affecting the flow of information across timesteps and the residual connections across layers. Adding quantization of the activation does not impact the performance of the network.
A.2.3 ESPRESSO LSTM on LibriSpeech
The encoder comprises 4 CNN-BatchNorm-ReLU blocks followed by 4 BiLSTM layers with 1024 units. The decoder consists of 3 LSTM layers of units 1024, with Bahdanau’s attention on hidden states of the encoder and residual connections between each layer. The dataset preprocessing is precisely the same as in [34]. We train the model for 30 epochs on one V100 GPU, approximately six days to complete. We use a batch size of 24 while limiting the maximum number of tokens in a mini-batch to 26000. Adam is used with a starting learning rate of 0.001, which is divided by 2 when the validation set metric plateaus without a relative decrease of \(10^{-4}\) in performance. Cross-entropy with uniform label smoothing \(\alpha =0.1\) [44] is used as a loss function. The model predictions are weighted at evaluation time using a pre-trained full-precision 4-layer LSTM language model (shallow fusion). We consider this language model as an external component to the ESPRESSO LSTM; we do not quantize it due to the lack of resources. In our language modeling experiments, we have already shown that quantized language models retain their performance. We refer the reader to [34] and training scriptFootnote 5 for a complete description of the experimental setup.
We initialize the quantized model from the pre-trained full-precision ESPRESSO LSTM. Due to the lack of resources, we have trained the quantized model for only four epochs. The quantized model is trained on 6 V100 GPUs where each epoch takes two days, so a total of 48 GPU days. The batch size is set to 8 mini-batch per GPU with a maximum of 8600 tokens. We made these changes since, otherwise, the GPU would run out of VRAM due to the added fake quantization operations. For the first half of the first epoch, we gathered statistics for quantization parameters then we enabled QAT. The activation functions are swapped with quantization-aware PWL in the last epoch. The number of pieces for the quantization-aware PWLs is 96, except for the exponential function in the attention, which is 160, as we found out it was necessary to have more pieces because of its curvature. The number of pieces used is higher than that in the language modeling experiments. However, the difference is that the inputs to the activation functions are 16-bit rather than 8-bit, although the outputs are still quantized to 8-bit. It means we need more pieces to capture the input resolution better. Note that it would not be feasible to use a 16-bit Look-Up Table to compute the activation functions due to the size and cache misses, whereas using 96 pieces allows for a 170x reduction in memory consumption compared to LUT.
Appendix B
The following section provides some examples of integer-only arithmetic and more details on fixed-point scaling.
B.1 Multiplication
To illustrate how integer-only multiplication is achieved, we define an example utilizing (4). Defining \(u \in [u_{\min }, u_{\max }]=[-1, 1]\) and \(w \in [w_{\min }, w_{\max }]=[0,5]\), the multiplication between two numbers from those ranges will fall into \([z_{\min }, z_{\max }]=[-5, 5]\). From (3), for 8-bit quantization, we have \(S_u \approx 0.0078\), \(Z_u=128\), \(S_w \approx 0.0196, Z_w=0, S_z \approx 0.0392, Z_z=128\). Given \(u=-0.8\) and \(w=2.3\), we have \(q_u=25\) and \(q_w=117\). Therefore, following (4),
Using (2), the floating-point representation of \(q_z\) is \(r_z=-1.8424\) which is close to \(uv=-1.8399\). Note that we lost precision at two levels, the first time when quantizing u and v, then the second time when quantizing z, the multiplication output.
B.2 Addition
As mentioned in Sect. “Integer-only arithmetic”, addition with quantized numbers can take two forms. The first form is when the two numbers to be added share the same scaling factor and zero-point. For instance, given \(x_1=-0.3, x_2=0.7\) from \([-1,1]\), and \(S_x=0.0078, Z_x=128\), we have \(q_{x_1}=90\) and \(q_{x_2}=218\). The result value y will fall into the range \([-2,2]\), therefore \(S_y \approx 0.0157\) and \(Z_y=128\). Then, because they share the same quantization parameters, following (5),
We have \(r_y=0.4082\), while \(x_1 + x_2 = 0.3999\). The second form is when the two numbers do not share the same scaling factor and zero-point. Define \(a \in [a_{\min }, a_{\max }]=[-1, 1]\) and \(b \in [b_{\min }, b_{\max }]=[0,5]\), the addition between two numbers from those ranges will fall into \([c_{\min }, c_{\max }]=[-1, 6]\). We get \(S_a \approx 0.0078\), \(Z_a=128\), \(S_b \approx 0.0196, Z_b=0, S_c \approx 0.0274, Z_c=36\). For \(a=-0.9\), \(b=3.9\), we have \(q_a=13\) and \(q_b=199\). The quantized addition result \(q_c\), following (6), is,
and \(r_y=3.0140\) while \(a+b=3.0\).
B.3 Fixed Point Arithmetic
Even with the most careful rounding, fixed-point values represented with a scaling factor S may have an error of up to \(\pm 0.5\) in the stored integer, that is, \(\pm 0.5 S\) in the value. Therefore, smaller scaling factors generally produce more accurate results. On the other hand, a smaller scaling factor means a smaller range of values stored in a given program variable. The maximum fixed-point value that can be stored in a variable is the largest integer value that can be stored into it, multiplied by the scaling factor; and similarly for the minimum value. For example, Table 8 gives the implied scaling factor S, the minimum and maximum representable values. The accuracy \(\delta = S/2\) of values can be represented in 16-bit signed binary fixed-point format, depending on the number f of implied fraction bits.
To convert a number from a floating-point to a fixed-point, one may divide it by the scaling factor S, then round the result to the nearest integer. Care must be taken to ensure that the result fits in the destination variable or register. Depending on the scaling factor and storage size, and the range of input numbers, the conversion may not entail any rounding. To convert a fixed-point number to floating-point, in contrast, one may convert the integer to floating-point and then multiply it by the scaling factor S. This conversion may entail rounding if the integer’s absolute value is greater than 224 (for binary single-precision IEEE floating point) or of 253 (for double-precision). In addition, overflow or underflow may occur if \(\left|S\right|\) is very large or small. However, most computers with binary arithmetic have fast bit shift instructions that can multiply or divide an integer by any power of 2, particularly an arithmetic shift instruction. These instructions can be used to quickly change scaling factors that are powers of 2 while preserving the sign of the number.
Cite this article
Nia, V.P., Sari, E., Courville, V. et al. Training Integer-Only Deep Recurrent Neural Networks. SN COMPUT. SCI. 4, 501 (2023). https://doi.org/10.1007/s42979-023-01920-z
DOI: https://doi.org/10.1007/s42979-023-01920-z