On-chip memory based binarized convolutional deep neural network applying batch normalization free technique on an FPGA
H Yonekawa, H Nakahara - 2017 IEEE international parallel …, 2017 - ieeexplore.ieee.org
H Yonekawa, H Nakahara
2017 IEEE international parallel and distributed processing …, 2017•ieeexplore.ieee.orgA pre-trained convolutional deep neural network (CNN) is a feed-forward computation
perspective, which is widely used for the embedded systems, requires highly power-and-
area efficiency. This paper proposes a binarized CNN on an FPGA which treats only binary
2-values~(+ 1/-1) for the inputs and the weights. In this case, the multiplier is replaced into
an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using
binarized inputs and weights is more suitable. However, the binarized CNN requires the …
perspective, which is widely used for the embedded systems, requires highly power-and-
area efficiency. This paper proposes a binarized CNN on an FPGA which treats only binary
2-values~(+ 1/-1) for the inputs and the weights. In this case, the multiplier is replaced into
an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using
binarized inputs and weights is more suitable. However, the binarized CNN requires the …
A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires highly power-and-area efficiency. This paper proposes a binarized CNN on an FPGA which treats only binary 2-values~(+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose a batch normalization free binarized CNN which is mathematically equivalent to one using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the Xilinx Inc. Zynq UltraScale+ MPSoC zcu102 evaluation board. Our binarized CNN stores all the weights, inputs, and output to on-chip BRAMs those are faster and dissipate lower power than an off-chip memory, such as a DDR4SDRAM. Compared with the conventional FPGA realizations, although the classification accuracy is 6.5% decayed, the performance is 2.45 times faster, the power efficiency is slightly better, and the area efficiency is 2.68 times better. Compared with the ARM Cortex-A57, it is 136.8 times faster, it dissipates 3.1 times much power, and its performance per power efficiency is 44.7 times better. Also, compared with the Maxwell embedded GPU, it is 4.9 times faster, it dissipates 1.3 times much power, and its performance per power efficiency is 3.8 times better. Thus, our method is suitable for the embedded computer system.
ieeexplore.ieee.org