2023 Volume E106.D Issue 7 Pages 1230-1239
In industry, automatic speech recognition has come to be a competitive feature for embedded products with poor hardware resources. In this work, we propose a tiny end-to-end speech recognition model that is lightweight and easily deployable on edge platforms. First, instead of sophisticated network structures, such as recurrent neural networks, transformers, etc., the model we propose mainly uses convolutional neural networks as its backbone. This ensures that our model is supported by most software development kits for embedded devices. Second, we adopt the basic unit of MobileNet-v3, which performs well in computer vision tasks, and integrate the features of the hidden layer at different scales, thus compressing the number of parameters of the model to less than 1 M and achieving an accuracy greater than that of some traditional models. Third, in order to further reduce the CPU computation, we directly extract acoustic representations from 1-dimensional speech waveforms and use a self-supervised learning approach to encourage the convergence of the model. Finally, to solve some problems where hardware resources are relatively weak, we use a prefix beam search decoder to dynamically extend the search path with an optimized pruning strategy and an additional initialism language model to capture the probability of between-words in advance and thus avoid premature pruning of correct words. In our experiments, according to a number of evaluation categories, our end-to-end model outperformed several tiny speech recognition models used for embedded devices in related work.