CN117351939A

CN117351939A - Multilingual voice recognition system and method

Info

Publication number: CN117351939A
Application number: CN202311577487.5A
Authority: CN
Inventors: 赵茂祥; 李全忠; 何国涛; 蒲瑶
Original assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Current assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-01-05

Abstract

A multilingual voice recognition system and method belong to the technical field of voice recognition. Firstly, constructing independent acoustic models and language models of multiple languages, then linearly projecting modeling units of the acoustic models to a universal mixed modeling unit for outputting, combining the independently constructed language models of the multiple languages with the unified modeling unit to construct a network (such as a WFST network) required by decoding of a voice recognition system of the multiple languages, and then performing parallel decoding on the language network. The system and the method complete the multilingual voice recognition system by constructing a mode of combining the unified modeling acoustic unit and the language independent language model.

Description

Multilingual voice recognition system and method

Technical Field

The invention belongs to the technical field of intelligent voice interaction, and particularly relates to a multilingual voice recognition system and method.

Background

With the development of society and electronic information technology, artificial intelligence products become indispensable necessities in people's life, such as intelligent sound boxes, intelligent automobiles, intelligent televisions, intelligent air conditioners and the like. Many other devices and applications are rapidly evolving towards intelligence. Meanwhile, the functional requirements of users on artificial intelligent products are rich and various, and besides the basic functions are required to be met, the requirements of video entertainment, life services, interconnection and the like are met.

Along with the development of the rich and AI technology of intelligent application, consumers continuously upgrade the knowledge of intelligent products, and the product experience is changed from a single function to more intelligent scenes, so that many functions of the product need to be redefined, and meanwhile, the rapid development of the intelligent application is promoted by the great progress of technologies such as artificial intelligence, 5G, man-machine interaction equipment, an operating system and the like so as to meet the continuously rising knowledge and demands of users.

In modern human-machine dialog systems, speech recognition is a fundamental capability. With more and more devices providing voice interaction, there is also diversity in the language of the user's speech. The user may interact with the smart device in multiple languages, requiring the smart device to have the ability to automatically recognize utterances in multiple languages. The automatic recognition of multilingual provides a more natural and simple interaction mode for users.

If a plurality of models for respective independent languages and mixed languages are constructed according to the conventional art and then multilingual utterances are separately identified, the number of models is increased and optimization and maintenance are difficult.

Two main approaches to support multilingual speech are known in the prior art:

(1) Unifying the modeling units: the modeling units of different languages are mapped into the same modeling unit, and a language-independent voice recognition system is constructed. Reference is made in particular to "HUWenxuan, et al, multi-lingual Speech Recognition Researchbasedon End-to-endModel [ J ]. JOURNALOF SIGNALPROCESSING,2021,37 (10): 1816-1824.Doi:10.16798/J. Issn.1003-0530.2021.10.004").

(2) Constructing a multilingual network model: different languages can share part of parameters, and through sharing the parameters, the common characterization of multiple languages can be discovered. Specific reference may be made to "Alec Radford, et al Robust SpeechRecognitionvia Large-Scale Weak Supervision arXiv:2212.04356v1,2022.

The method (1) cannot effectively optimize the specific languages independently, and the general effect is worse than that of the method (2), but the method (1) can obtain good effects for many small languages without a large amount of training data. The method (2) is generally to train each language network model independently as a pre-trained language sub-network, then add a multi-language shared sub-network structure connecting each language, and train on this basis by using the mixed data of each language.

The multilingual recognition system in the prior art faces the following problems:

(1) It is desirable to provide multiple models for identifying combinations of different mixed languages. Such as chinese-english models, chinese models, yue models, puyue models, etc., which occupy a large amount of memory space.

(2) In the performance optimization stage, the specified components in each model need to be optimized. For example, when optimizing Chinese performance, the Chinese-English model, chinese model and Puyue model need to be optimized.

Disclosure of Invention

The invention provides a system and a method, firstly, an independent multilingual acoustic model and a language model are built, then modeling units of the acoustic models are linearly projected to a general mixed modeling unit for output, the independently built language models of the languages are combined with the unified modeling unit, a network (such as a WFST network) required by decoding of a voice recognition system of the languages is built, and then parallel decoding is carried out on the language network. The multi-language speech recognition system is completed by constructing a mode of combining the unified modeling acoustic unit and the language independent language model.

The invention provides a multilingual recognition system, which comprises a multilingual model selection module, an acoustic model library, a language model library, an endpoint detection device, a recognition engine and an output unit, wherein the multilingual model selection module is used for selecting an acoustic model library; wherein,

the multilingual model selection module is used for selecting one acoustic model of a corresponding language in the acoustic model library according to the language range input by a user, and selecting one or more language models of the corresponding language in the language model library;

the end point detection device is used for finding the starting and ending positions of the user speech after the user speech, and transmitting the detected user speech to the recognition engine;

the recognition engine comprises a feature extraction unit, an acoustic similarity calculation unit, a unified acoustic model unit, a language decoder group and a result arbitration unit;

the feature extraction unit is used for extracting features of the detected user speech;

the unified acoustic model unit is used for constructing and storing the unified acoustic model;

the acoustic similarity calculation unit is used for calculating acoustic similarity of the features extracted by the extraction unit by using the selected acoustic model; projecting the acoustic similarity calculation result onto the unified acoustic model according to the corresponding relation between the acoustic model and the unified acoustic model;

the language decoder group is used for decoding by adopting a plurality of decoders corresponding to a plurality of language models based on the unified acoustic model, and decoding respective recognition results;

and the result arbitration unit is used for comprehensively selecting a final result based on a plurality of recognition results aiming at a plurality of language models in the language decoder group.

Further, when the unified acoustic model is constructed, each acoustic model in the acoustic model library is projected onto the unified acoustic model in a linear projection mode.

The invention provides a multilingual recognition method, which is characterized in that,

selecting an acoustic model from the acoustic model library according to the language range input by the user;

selecting one or more language models in a language model library according to the language range input by the user;

extracting features of user speech detected by an end point detection device, and then calculating acoustic similarity of the extracted features by using the acoustic model;

projecting the acoustic similarity calculation result onto a region corresponding to the acoustic model of the unified acoustic model;

based on the unified acoustic model, decoding by using a language decoder corresponding to the language model to obtain recognition results of various languages;

and carrying out result arbitration on the identification results of various languages to obtain an optimal result.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of the invention.

The present invention also provides an electronic device including:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the methods of the present invention.

The system and the method can effectively solve the problem of resource occupation of the multilingual recognition system and the problem of multi-model resource management.

Taking four languages (Mandarin, english, guangdong, sichuan) and a single language model of 400M as an example, a user can say any language in the support range. According to various language combination methods, the storage space of the system in the prior art needs 400M×4+800M×6+1200M×4+1600M=12800M, and the maximum used memory is 1600M; according to the system and method provided by the invention, the required storage space is 400m×4=1600M, and the maximum memory is 400m×4=1600M. By comparison, the method of the invention can greatly reduce the occupied space of resources.

Drawings

For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a multilingual recognition system in the prior art;

FIG. 2 is a schematic diagram of a multilingual recognition system according to the present invention.

Detailed Description

In order to make the problems, technical solutions and technical effects to be solved by the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments and examples that a person skilled in the art would obtain without making any inventive effort are within the scope of the invention.

Before describing the exemplary embodiments of the present invention in more detail, it should be noted that, although the flow charts of the present invention describe various operations as a sequential process, many of the operational steps may be performed in parallel or concurrently and the order of the various operations may be rearranged as desired. In addition, the process may be terminated when the operation is completed, but may have additional steps not included in the drawing.

The multilingual recognition system in the prior art is shown in fig. 1, and the input audio to the output of the text result is required to be subjected to endpoint detection (VoiceActivity Detection, VAD) and a decoder; and selecting a corresponding model to carry out loading recognition according to the input language range of the user by the multilingual model selection module.

The invention provides a multilingual recognition system which is optimized on the basis of the multilingual recognition system in the prior art, as shown in fig. 2.

Specifically, the multilingual recognition system provided by the invention comprises: a multilingual model selection module, an acoustic model library, a language model library, an end point detection device, an identification engine, an output unit, and the like.

The multilingual model selection module is configured to select, according to a language range input by a user, one acoustic model M1 corresponding to a language list in the acoustic model library, and select, in the language model library, a plurality of language models L1, L2, … … corresponding to the language list.

For example, if the user selects the current language range as chinese and english, the acoustic model M1 of chinese and english is selected, and two language models of chinese L1 and english L2 are selected.

The end point detection device is used for finding the starting and ending positions of the audio signal after the user utters and transmitting the detected user utters to the recognition engine.

The recognition engine comprises a feature extraction unit, an acoustic similarity calculation unit, a unified acoustic model unit, a language decoder group and a result arbitration unit.

The extraction unit is used for extracting the characteristics of the detected user speech.

The unified acoustic model unit is used for constructing and storing the unified acoustic model. When the unified acoustic model is constructed, each acoustic model in the acoustic model library can be projected onto the unified acoustic model in a linear projection mode.

The acoustic similarity calculation unit is used for calculating acoustic similarity of the features extracted by the extraction unit by using the selected acoustic model M1; and projecting the acoustic similarity calculation result onto the unified acoustic model according to the corresponding relation between the acoustic model M1 and the unified acoustic output.

The unified acoustic output comprises the output of the acoustic model M1; for the other acoustic models M2, M3 … …, the output of the acoustic models M2, M3 … … is also included in the unified acoustic output, so that the unified acoustic model includes acoustic output from each language.

For example, if the acoustic model M1 is an english model, the output states thereof are N and numbered as [0, N ], and on the unified acoustic model, the area of the english part is [ K, k+n ], then the similarity of the acoustic model M1 needs to be used to assign values to the [ K, k+n ] area on the unified acoustic model unit.

Compared with the prior art, in the prior art, as shown in fig. 1, only the acoustic similarity of one acoustic model needs to be calculated, and then decoding is performed by using the result; in the present invention, as shown in fig. 2, after the acoustic similarity calculation, the calculation result needs to be mapped to the unified acoustic model unit.

The language decoder group is used for decoding by utilizing a decoder where the languages of various languages are located based on the unified acoustic model, and decoding the recognition results of the respective languages.

Compared with the prior art, as shown in fig. 1, since a language model is included in the prior art, only one decoder is required to be provided; in the present invention, as shown in fig. 2, since the language model includes a plurality of language models, a plurality of decoders corresponding to the language models are provided, so as to form a multi-language decoder group; and because the decoders share the same unified acoustic model, the decoders in the decoder group can effectively apply pruning and the like.

The result arbitration unit is configured to comprehensively select a final result according to acoustic probability, language priori information, and the like based on a plurality of recognition results for a plurality of language models L1, L2, … … in the language decoder group.

Typically, a most similar result is selected by the probability information output by the unified acoustic model. This result can also be adjusted by the result arbitration unit.

If the user speaks English 'young', chinese can possibly identify poplar; at this time, the result arbitration unit may effectively select a final result according to the context.

In contrast to the prior art, as shown in fig. 1, since there is only one model, the model output result is the final result. In the present invention, as shown in fig. 2, the result arbitration module is included because of the plurality of language models. This is more advantageous for applying a priori information in some languages, and for final result selection in combination with the respective results after recognition.

The output unit is used for outputting the final result obtained by the result arbitration unit.

Correspondingly to the multilingual recognition system, the invention provides a multilingual recognition method, which specifically comprises the following steps:

step S01, selecting an acoustic model M1 from the acoustic model library according to the language range input by the user.

According to the language range list input by the user, the best acoustic model M1 supporting the list is found in the acoustic models.

If the user selects the current range as Chinese and English, selecting an acoustic model of Chinese and English.

Step S02, selecting one or more language models L1, L2, … … from the language model library according to the language range input by the user.

And according to the language range list input by the user, a plurality of language models L1, L2 and … … matched with the language models are found in the language models.

If the user selects the current range as Chinese and English, selecting two language models of Chinese and English.

And S03, extracting features of the detected user speech by using end point detection, and then calculating the acoustic similarity of the extracted features by using the acoustic model M1 selected in the step S01.

After the user utters, the endpoint detection performs feature extraction on the detected user utters, and then performs acoustic similarity calculation on the features using the acoustic model M1 selected in step S01.

And step S04, projecting the acoustic similarity calculation result to a region of the unified acoustic model corresponding to the acoustic model.

As for the result in step S03, the result is projected onto the unified acoustic model unit according to the correspondence relationship between the acoustic model M1 and the unified acoustic output.

For example, if the acoustic model M1 is an english model, the output states thereof are N and numbered as [0, N ], and the area of the english part is [ K, k+n ] on the unified acoustic model unit, the similarity of the acoustic model M1 is required to be used to assign values to the [ K, k+n ] area on the unified acoustic model unit.

And S05, decoding by using a language decoder corresponding to the language model based on the unified acoustic model to obtain recognition results of various languages.

Using the result of step S04, the decoder where the language of each language is located decodes the result, so that the recognition result of each language can be decoded.

And step S06, carrying out result arbitration on the identification results of various languages to obtain an optimal result.

Based on the result of step S05, a plurality of recognition results for the plurality of language models L1, L2, … …, respectively, can be obtained; the final result can be comprehensively selected according to the modes of acoustic probability, language priori information and the like, and the result is output.

Furthermore, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor realizes the steps of the method according to the present invention.

The present invention also provides an electronic device including:

one or more processors; and

It is noted that portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a smart electronic device, such as a smart mobile phone or tablet computer, etc., may invoke or provide methods and/or solutions according to the present invention by way of operation of the smart electronic device. Program instructions for invoking the inventive methods may be stored in fixed or removable recording media and/or transmitted via a data stream in a broadcast or other signal bearing medium and/or stored within a working memory within an intelligent electronic device operating according to the program instructions. An embodiment according to the invention comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the invention as described above.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description of embodiments, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A multilingual recognition system comprises a multilingual model selection module, an acoustic model library, a language model library, an endpoint detection device, a recognition engine and an output unit; wherein,

2. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

when the unified acoustic model is constructed, each acoustic model in the acoustic model library is projected onto the unified acoustic model in a linear projection mode.

3. A multilingual recognition method is characterized in that,

4. The method of claim 3, wherein the step of,

5. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of claim 3 or 4.

6. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of claim 3 or 4.