CN114283788A

CN114283788A - Pronunciation evaluation method, training method, device and equipment of pronunciation evaluation system

Info

Publication number: CN114283788A
Application number: CN202011034404.4A
Authority: CN
Inventors: 石睿亨
Original assignee: Yimu Intelligent Technology Co ltd; Huawei Technologies Co Ltd
Current assignee: Yimu Intelligent Technology Co ltd; Huawei Technologies Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2022-04-05

Abstract

The application discloses a pronunciation evaluation method, a pronunciation evaluation system training method, a pronunciation evaluation device and pronunciation evaluation equipment, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a pronunciation audio to be evaluated; extracting phoneme information of the pronunciation audio to be evaluated; obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated; and determining the pronunciation score of the pronunciation audio to be evaluated based on the similarity. According to the technical scheme, the similarity between the student pronunciation and the teacher pronunciation is determined by comparing the universal phonemes suitable for multiple languages of the student pronunciation and the teacher pronunciation corresponding to the student pronunciation in any language, so that the score of the student pronunciation is obtained, the multi-language pronunciation evaluation is not required to be realized by multiple models, the pronunciation evaluation complexity is effectively reduced, and meanwhile, the storage resource is saved.

Description

Pronunciation evaluation method, pronunciation evaluation system training method, pronunciation evaluation device and pronunciation evaluation device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a pronunciation evaluation method, a pronunciation evaluation system training method, a pronunciation evaluation device and pronunciation evaluation equipment.

Background

In online learning scenarios such as foreign language languages, evaluating a user's pronunciation is the most common function for users performing online learning.

Currently, a commonly used method for evaluating a user pronunciation is as follows: and for language learning scenes of different languages, designing different pronunciation evaluation algorithms or models, and grading pronunciation audios of different languages.

The method for evaluating the pronunciation of the user has high complexity and large consumption of storage resources.

Disclosure of Invention

The embodiment of the application provides a pronunciation evaluation method, a pronunciation evaluation system training method, a pronunciation evaluation device and pronunciation evaluation equipment, which can realize multi-language pronunciation evaluation, effectively reduce the pronunciation evaluation complexity and save storage resources. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a pronunciation evaluation method, including:

acquiring a pronunciation audio to be evaluated;

extracting phoneme information of the pronunciation audio to be evaluated, wherein the phoneme information comprises universal phonemes of each pronunciation in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating special phonemes of a plurality of languages;

obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated;

and determining the pronunciation score of the pronunciation audio to be evaluated based on the similarity.

The similarity between the student pronunciation and the teacher pronunciation is determined by comparing the universal phonemes suitable for multiple languages of the student pronunciation and the teacher pronunciation in any language, so that the score of the student pronunciation is obtained, the multi-language pronunciation evaluation is realized without multiple models, the pronunciation evaluation complexity is effectively reduced, and meanwhile, the storage resource is saved.

In one possible design, the extracting the phoneme information of the pronunciation audio to be evaluated includes:

dividing the pronunciation audio to be evaluated into at least one audio frame;

acquiring the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated;

processing the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated through a multi-language phoneme detection model to obtain phoneme information of the pronunciation audio to be evaluated, wherein the phoneme information of the pronunciation audio to be evaluated comprises probability distribution of each audio frame in the pronunciation audio to be evaluated, and the probability distribution reflects the probability of pronunciation of the audio frame on each universal phoneme;

the multi-language phoneme detection model is a machine learning model used for determining universal phonemes corresponding to all audio frames in pronunciation audio.

Through the method, the probability of the audio frame on each phoneme is detected, and phoneme information is embodied in a quantized mode.

In one possible design, the processing the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated through the multi-language phoneme detection model to obtain the phoneme information of the pronunciation audio to be evaluated includes:

acquiring a target language of the pronunciation audio to be evaluated;

selecting target model parameters corresponding to the target language; the multi-language phoneme detection model is provided with a plurality of groups of model parameters, and different model parameters correspond to different languages;

and processing the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated through the multi-language detection model according to the target model parameters to obtain the phoneme information of the pronunciation audio to be evaluated.

By the mode, the phoneme recognition is realized according to the model parameters of different languages, so that the phoneme detection is more targeted, and the accuracy of the multi-language phoneme detection is improved.

In one possible design, the obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated includes:

performing dynamic time normalization processing on the distribution probability vectors of the audio frames in the pronunciation audio to be evaluated and the distribution probability vectors of the audio frames in the teacher pronunciation audio to obtain a time sequence matching result, wherein the time sequence matching result is a combination of the matched distribution probability vectors obtained by aligning the pronunciation audio to be evaluated and the teacher pronunciation audio according to the same general phoneme in time sequence;

and obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the similarity between the matched distribution probability vectors determined by the time sequence matching result.

Through the mode, the effect of accurately matching the distribution probability vectors is achieved.

In one possible design, the determining the pronunciation score of the pronunciation audio to be evaluated based on the similarity includes:

and determining the pronunciation score of the pronunciation audio to be evaluated according to the similarity through a scoring model, wherein the scoring model is a regression model used for quantitatively describing the statistical relationship between the similarity and the pronunciation score.

By the mode, the effect of accurately outputting the pronunciation scores is achieved.

According to an aspect of an embodiment of the present application, there is provided a method for training a pronunciation evaluation system, the method including:

acquiring a first training sample set and a second training sample set, wherein the first training sample set comprises at least one first training sample, sample data of the first training sample comprises teacher pronunciation audio in any language, tag data of the first training sample comprises universal phonemes of each pronunciation in the teacher pronunciation audio, the second training sample set comprises at least one second training sample, sample data of the second training sample comprises teacher pronunciation audio and student pronunciation audio for the same language, and tag data of the second training sample comprises pronunciation scores of the student pronunciation audio;

training the multi-language phoneme detection model based on the first training sample set, wherein the multi-language phoneme detection model is used for extracting phoneme information of pronunciation audio, the phoneme information comprises universal phonemes of all pronunciations in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating special phonemes of multiple languages;

extracting phoneme information of the second training sample through the multi-language phoneme detection model, wherein the phoneme information of the second training sample comprises phoneme information of the teacher pronunciation audio and phoneme information of the student pronunciation audio in the second training sample;

obtaining the similarity of the teacher pronunciation audio and the student pronunciation audio based on the phoneme information of the second training sample;

and fitting the scoring model according to the similarity and the pronunciation score of the student pronunciation audio, wherein the scoring model is used for scoring the audio to be evaluated.

The multi-language phoneme detection model is trained through the sample audio with the universal phoneme information as the label, so that the multi-language phoneme detection model can accurately detect the universal phoneme information corresponding to the teacher pronunciation and the student pronunciation of the same session, the similarity of the teacher pronunciation and the student pronunciation is calculated on the basis, a regression model is fitted to describe the statistical relationship between the scoring label carried by the student pronunciation and the similarity, and finally a pronunciation evaluation system is trained and completed, so that the application range of the model is expanded, the multi-language user pronunciation evaluation function is realized, the pronunciation evaluation complexity is effectively reduced, and the storage resource is saved.

In one possible design, the training the multi-lingual phoneme detection model based on the first training sample set includes:

training an initialized multi-lingual phoneme detection model based on the first training sample set, the initialized multi-lingual phoneme detection model being a pre-training model for detecting universal phonemes;

for a target language, acquiring a first training sample of the target language from the first training sample set;

adjusting the model parameters of the initialized multi-language phoneme detection model by adopting the first training sample of the target language to obtain model parameters corresponding to the target language;

and obtaining the multi-language phoneme detection model based on the model parameters respectively corresponding to the languages.

By the mode, the model training process is simplified, and the accuracy of the trained model can be ensured.

In a possible design, the adjusting the model parameters of the initialized multi-language phoneme detection model by using the first training sample of the target language to obtain the model parameters corresponding to the target language includes:

fixing parameters of an intermediate layer in the initialized multi-language phoneme detection model;

replacing a full-pass layer of an input end and an output end in the initialized multi-language phoneme detection model with a full-connection layer, wherein the full-pass layer does not perform any processing on data, and the full-connection layer is a related language layer for mining language features;

and adjusting the model parameters of the full connection layer by adopting the first training sample of the target language to obtain the model parameters corresponding to the target language.

By the mode, only the model parameters of the input end and the output end in the pre-training model are finely adjusted, so that the model detection accuracy is improved, and the training process is simplified.

In one possible design, the obtaining a first training sample set includes:

acquiring original tag data of teacher pronunciation audio of each language, wherein the original tag data comprises a special phoneme of each pronunciation in the teacher pronunciation audio, and the special phoneme is a phoneme which is special for one language;

acquiring a universal phoneme set, wherein the universal phoneme set is a set formed by the universal phonemes;

and replacing the special phonemes of each pronunciation corresponding to each audio frame in the teacher pronunciation audio based on the universal phoneme set to obtain the label data of the first training sample.

Through the mode, the training sample using the universal phoneme as the label data is obtained, and sample support is provided for model training.

In one possible design, the obtaining a universal phone set includes:

respectively training the phoneme recognizers of the languages based on the teacher pronunciation audio of the languages and the original label data of the teacher pronunciation audio of the languages;

sending the teacher pronunciation audio of each language to a phoneme recognizer of each language, and determining similar phonemes in each language, wherein the similar phonemes refer to special phonemes with similar pronunciations in each language;

and binding and merging the similar phonemes into the universal phoneme to obtain the universal phoneme set.

By the mode, the universal phoneme set suitable for multiple languages can be obtained, and a foundation is provided for simultaneously evaluating the pronunciation quality of the multiple languages.

In one possible design, the step of sending the teacher pronunciation audio of each language to the phoneme recognizer of each language to determine similar phonemes in each language comprises:

for a target language, sending a teacher pronunciation audio of the target language to a phoneme recognizer of the target language to obtain probability values of each audio frame in the teacher pronunciation audio of the target language under each special phoneme of the target language;

determining the special phonemes corresponding to the audio frames based on the probability values of the audio frames under the special phonemes of the target language;

inputting an audio frame corresponding to a first special phoneme in a first language into a phoneme recognizer of a second language to obtain a probability value of the audio frame corresponding to the first special phoneme under each special phoneme of the second language, wherein the first language and the second language are different languages;

taking a probability value of the audio frame corresponding to the first proprietary phoneme under each proprietary phoneme of the second language as an inverse of a distance from the first proprietary phoneme to each proprietary phoneme of the second language;

and determining similar phonemes in the languages through a clustering algorithm based on the distance.

Through the method, phonemes with the same or similar pronunciation can be accurately determined from the special phonemes of various languages.

In one possible design, the obtaining of the original tag data of the teacher's pronunciation audio of each language includes:

acquiring a phoneme set of each language;

and aligning the teacher pronunciation audio of each language based on the phoneme set of each language to obtain the original label data of the teacher pronunciation audio of each language.

Through the mode, the effect of automatically adding the label to the training sample is achieved.

According to an aspect of an embodiment of the present application, there is provided a pronunciation evaluation device, including:

the audio acquisition module is used for acquiring the pronunciation audio to be evaluated;

the feature extraction module is used for extracting phoneme information of the pronunciation audio to be evaluated, wherein the phoneme information comprises universal phonemes of each pronunciation in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating special phonemes of a plurality of languages;

the similarity determining module is used for obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated;

and the score determining module is used for determining the pronunciation score of the pronunciation audio to be evaluated based on the similarity.

According to an aspect of an embodiment of the present application, there is provided a training apparatus for a pronunciation evaluation system, the apparatus including:

the system comprises a sample acquisition module, a comparison module and a comparison module, wherein the sample acquisition module is used for acquiring a first training sample set and a second training sample set, the first training sample set comprises at least one first training sample, the sample data of the first training sample comprises teacher pronunciation audio in any language, the label data of the first training sample comprises universal phonemes of each pronunciation in the teacher pronunciation audio, the second training sample set comprises at least one second training sample, the sample data of the second training sample comprises teacher pronunciation audio and student pronunciation audio for the same language, and the label data of the second training sample comprises pronunciation scores of the student pronunciation audio;

a model training module, configured to train the multi-language phoneme detection model based on the first training sample set, where the multi-language phoneme detection model is configured to extract phoneme information of pronunciation audio, where the phoneme information includes universal phonemes for each pronunciation in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating dedicated phonemes of multiple languages;

a feature extraction module, configured to extract phoneme information of the second training sample through the multi-language phoneme detection model, where the phoneme information of the second training sample includes phoneme information of the teacher pronunciation audio and phoneme information of the student pronunciation audio in the second training sample;

the similarity determining module is used for obtaining the similarity of the teacher pronunciation audio and the student pronunciation audio based on the phoneme information of the second training sample;

and the model fitting module is used for fitting the scoring model according to the similarity and the pronunciation score of the student pronunciation audio, and the scoring model is used for scoring the audio to be evaluated.

According to an aspect of the embodiments of the present application, there is provided a computer device, the computer device includes a processor and a memory, the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the pronunciation assessment method.

According to an aspect of the embodiments of the present application, there is provided a computer device, the computer device includes a processor and a memory, the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the training method of the pronunciation evaluation system.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium, in which a computer program is stored, and the computer program is loaded and executed by a processor to implement the pronunciation evaluation method.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium, in which a computer program is stored, and the computer program is loaded and executed by a processor to implement the above-mentioned training method of the pronunciation evaluation system.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the pronunciation evaluation method.

According to an aspect of embodiments herein, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the pronunciation evaluation system.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

FIG. 2 is a flow chart of a pronunciation assessment method according to an embodiment of the present application;

FIG. 3 is a flow chart of a pronunciation assessment method according to another embodiment of the present application;

FIG. 4 is a diagram illustrating an exemplary timing matching result;

FIG. 5 is a schematic diagram illustrating a pronunciation assessment system;

FIG. 6 is a flowchart of a training method of the pronunciation assessment system according to an embodiment of the present application;

FIG. 7 illustrates a diagram of a multilingual phoneme detection model;

FIG. 8 is a flow chart of a method of obtaining a first set of training samples provided by an embodiment of the present application;

fig. 9 is a block diagram of a pronunciation evaluation device according to an embodiment of the present application;

fig. 10 is a block diagram of a pronunciation evaluation device according to another embodiment of the present application;

FIG. 11 is a block diagram of a training apparatus of the pronunciation assessment system according to an embodiment of the present application;

FIG. 12 is a block diagram of a training apparatus of a pronunciation assessment system according to another embodiment of the present application;

fig. 13 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, some terms in the present application are explained so as to be easily understood by those skilled in the art.

Speech Recognition technology (ASR), also known as automatic speech Recognition, aims at converting the vocabulary content in human speech into computer-readable input, such as keystrokes, binary codes or character sequences.

Mel-Frequency Cepstrum coefficients (MFCCs) in the field of sound processing, Mel-Frequency Cepstrum is a linear transformation of the log energy spectrum based on the nonlinear Mel scale (Mel scale) of sound frequencies. Mel-frequency cepstral coefficients (MFCCs) are widely used for speech recognition functions.

The KL Divergence (Kullback-Leibler Divergence), also known as Relative Entropy (Relative Entropy), is a measure of the asymmetry of the difference between two probability distributions (probability distributions). In information theory, the relative entropy is equivalent to the difference between the information entropies (Shannon entrypes) of two probability distributions. The phoneme information can be regarded as the probability distribution of each audio frame on each universal phoneme. Relative entropy measures the distance between two probability distributions, where the relative entropy is zero when the two probability distributions are the same, and increases when the difference between the two probability distributions increases. The relative entropy can be used to compare the similarity of the text.

Neural Networks (NN) are complex network systems formed by a large number of simple processing units (called neurons) widely interconnected, reflect many basic features of human brain functions, and are highly complex nonlinear dynamical learning systems.

The neural network may be composed of neural units (simply "neurons"), which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

A Convolutional Neural Network (CNN) is a deep neural Network with a Convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

LSTM (Long Short Term Memory) is a specific form of RNN (Recurrent Neural Network), and RNN is a generic Term for a series of Neural networks that can process sequence data. There are many variations of RNN, such as bi-directional RNN (bidirectional RNN). However, RNNs encounter great difficulty in dealing with long-term dependencies (nodes that are far apart in time series), because calculating the connections between nodes that are far apart involves multiple multiplications of jacobian matrices, which leads to problems of gradient vanishing (which often occurs) or gradient swelling (which is less likely to occur), and to solve this problem, the most widespread is the threshold RNN (gated RNN), while LSTM is the most well-known one of the thresholds RNN. The leaky unit allows the RNN to accumulate long-term contact between nodes with longer distances by designing a weight coefficient between connections; the threshold RNN generalizes the idea of allowing the coefficient to be changed at different times and allowing the network to forget what information has currently been accumulated. LSTM is such a threshold RNN. The LSTM makes the self-circulation weight change by increasing the input threshold, the forgetting threshold and the output threshold, so that the integral scale at different moments can be dynamically changed under the condition that the model parameters are fixed, thereby avoiding the problem of gradient disappearance or gradient expansion.

A loss function. In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, the process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

In the Back Propagation algorithm, the neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

Refer to fig. 1, which illustrates a schematic diagram of an environment for implementing an embodiment of the present application. The embodiment implementation environment can be implemented as a pronunciation evaluation system. The embodiment implementation environment may include: a terminal 10 and a server 20.

The terminal 10 may be an electronic device such as a mobile phone, a tablet computer, a multimedia player, a wearable device, a pc (personal computer), a language learning terminal, an intelligent teaching machine, and the like. The terminal 10 may be configured with or coupled to a microphone through which audio is collected. The terminal 10 may have installed therein a client running an application program, which may include a pronunciation evaluation function. In the embodiment of the present application, the type of the application is not limited, and may be a language learning application, an education tutor application, an instant messaging application, a language assessment application, an educational examination application, or the like.

The server 20 may be an independent physical server, a server cluster or a distributed system including a plurality of physical servers, or a cloud server providing a cloud computing service. Server 20 may be a backend server for the application described above to provide backend services for the application.

The terminal 10 and the server 20 may communicate with each other through a network, and the present application is not limited thereto.

In the pronunciation evaluation method provided in the embodiment of the present application, the execution subject of each step may be the server 20, or may be the terminal 10 (for example, a client of an application program running in the terminal 10), or may be executed by the terminal 10 and the server 20 in an interactive cooperation manner. For convenience of explanation, in the following method embodiments, only the execution subject of each step is described as a computer device, but the present invention is not limited thereto.

Referring to fig. 2, a flowchart of a pronunciation assessment method according to an embodiment of the present application is shown. The method comprises the following steps (210-240):

and step 210, obtaining the pronunciation audio to be evaluated.

The pronunciation audio to be evaluated refers to the audio of the pronunciation with the pronunciation quality to be evaluated. The Audio data Format includes, but is not limited to, motion Picture Experts Group Audio Layer III (MP 3) Format, Motion Picture Experts Group (MPEG) Format, Audio exchange File Format (AIFF) and Windows Media Audio (WMA) Format, and the embodiments of the present invention are not limited thereto.

Optionally, in the language learning scenario, the pronunciation audio to be evaluated is student pronunciation audio, for example, the student simulates the recorded voice of the teacher pronunciation according to a piece of language content.

Step 220, extracting the phoneme information of the pronunciation audio to be evaluated.

The phoneme information includes universal phonemes for respective pronunciations in the pronunciation audio, which are phonemes obtained by integrating exclusive phonemes for a plurality of languages.

A phoneme (phone) is a minimum unit of speech divided according to natural properties of audio, which may include a plurality of phonemes. From the acoustic property point of view, a phoneme is a minimum speech unit divided from the acoustic quality point of view; from the physiological point of view, a pronunciation action forms a phoneme. For example, "ma" includes two pronunciation actions of "m" and "a", i.e., two phonemes.

Each language has its corresponding phonemes, such as chinese phonemes, english phonemes, etc., and the present application refers to phonemes belonging to each language as proprietary phonemes. The general phoneme obtained by integrating the special phonemes of multiple languages is a new element which is integrated by the special phonemes of multiple languages with the same or similar pronunciation in different languages according to the pronunciation similarity principle, and is called as the general phoneme.

The phoneme information reflects information of phonemes corresponding to each pronunciation in the pronunciation audio, and the phonemes may be proprietary phonemes or universal phonemes.

The extracting of the phoneme information of the pronunciation audio to be evaluated refers to judging the phoneme corresponding to each pronunciation according to the pronunciation characteristics of each pronunciation in the pronunciation audio to be evaluated. The pronunciation characteristics may be the frequency, tone, loudness, etc. of pronunciation. The detailed process of extracting phoneme information is described in the following embodiments, which are only summarized in their entirety.

And step 230, obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated.

And extracting phoneme information of the teacher pronunciation audio. The process of extracting the phoneme information of the teacher pronunciation audio is the same as that of extracting the phoneme information of the pronunciation audio to be evaluated, and only the extracted objects are different.

And comparing the phoneme information of the evaluation pronunciation audio with the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated, and evaluating the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio according to the difference degree. Alternatively, the smaller the degree of difference, the higher the degree of similarity, and conversely, the lower the degree of similarity.

And step 240, determining the pronunciation score of the pronunciation audio to be evaluated based on the similarity.

And taking the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio as a judgment basis, and scoring the pronunciation audio to be evaluated. The pronunciation score is an index for quantitatively evaluating the pronunciation quality of the pronunciation audio to be evaluated. Optionally, the higher the similarity is, the higher the pronunciation score is, which indicates that the pronunciation quality of the pronunciation audio to be evaluated is higher and the pronunciation is more standard. Optionally, the pronunciation score may be a full score system or a full score system, which is not limited in this embodiment of the application.

In summary, according to the technical scheme provided by the embodiment of the application, the similarity between the student pronunciation and the teacher pronunciation is determined by comparing the universal phonemes suitable for multiple languages corresponding to the student pronunciation and the teacher pronunciation in any language, so that the score of the student pronunciation is obtained, the multi-language pronunciation evaluation is realized without multiple models, the pronunciation evaluation complexity is effectively reduced, and the storage resource is saved.

Please refer to fig. 3, which shows a flowchart of a pronunciation assessment method according to another embodiment of the present application. The method may comprise the following steps (310-370):

and step 310, acquiring a pronunciation audio to be evaluated.

Optionally, after obtaining the pronunciation audio to be evaluated, preprocessing the pronunciation audio to be evaluated, where the preprocessing refers to preprocessing data of the input pronunciation audio, and reducing interference to subsequent steps by cutting off a mute region at the beginning and the end of the Voice, and the preprocessing is generally referred to as Voice Activity Detection (VAD).

And step 320, dividing the pronunciation audio to be evaluated into at least one audio frame.

Since the audio data to be recognized has time-varying characteristics, but the characteristics are substantially stable in a short time range (e.g., 10 (ms) to 30ms), the audio data to be recognized may be segmented to analyze the characteristics, and the segments may be understood as audio frames. It is understood that the duration of an audio frame in the present application may be 20ms to 25ms, which is only an illustration here, and other values may also be taken in practical applications, which is not limited here. The concept of audio frames is less clear than video frames, and almost all video coding formats can simply consider a frame to be a coded picture. But the audio frames are associated with a coding format that is implemented by the respective coding standard itself. Since it does not require the concept of frames at all if it is in PCM (non-encoded audio data), it can be played according to the sampling rate and the sampling precision. For example, for audio with a sample rate of 44.1kHZ and a sample precision of 16 bits, you can calculate that bitrate is 44100 × 16kbps and audio data per second is 44100 × 16/8 bytes fixed. When AAC (advanced Audio coding) Audio with a sampling rate of 44.1kHz is decoded, the decoding time of one frame needs to be controlled within 23.22 milliseconds. Typically one frame at 1024 sample points. The sampling frequency is the number of times a sample of the acoustic amplitude is taken per second when the analog acoustic waveform is digitized. According to nyquist sampling theory, the sampling frequency should be around 40kHz in order to ensure that the sound is not distorted. The audio sampling frequency is 8kHz, 11.025kHz, 22.05kHz, 16kHz, 37.8kHz, 44.1kHz, 48kHz, etc. which are commonly used, and if a higher sampling frequency is adopted, the sound quality of DVD can be achieved. When decoding AAC audio with a sampling rate of 44.1kHz, the decoding time of one frame must be controlled within 23.22 milliseconds. Alternatively, one frame at 1024 sample points is typical.

Optionally, before dividing the audio to be evaluated into at least one audio frame, pre-emphasis processing is performed on the audio to be evaluated, where the pre-emphasis processing refers to boosting the high-frequency part of the audio data to be recognized, and a digital filter may be generally used to implement the pre-emphasis processing.

And step 330, acquiring the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated.

Optionally, after framing the pronunciation audio to be evaluated, windowing may be performed. The purpose of windowing can be thought of as emphasizing the speech waveform around the sample and attenuating the rest of the waveform. The short segments of the voiced audio to be evaluated are processed by transforming the segments, for example, by using three window functions, such as rectangular window, Hamming window (Hamming) and Hanning window (Hanning).

And performing frequency domain conversion processing on each audio frame in the pronunciation audio to be evaluated. The frequency domain conversion refers to converting the data of each audio frame in the pronunciation audio to be evaluated in the time domain to the frequency domain. For sound-accompaniment separation, since audio data is formed by overlapping different frequencies at the same time, it is difficult to reflect the difference between different frequencies in the time domain, therefore, audio data in the time domain needs to be converted to the frequency domain for analysis during audio analysis, and separation is more convenient. The frequency domain transformation methods include, but are not limited to, Fast Fourier Transform (FFT) and Discrete Fourier Transform (DFT).

And converting the data of each audio frame in the pronunciation audio to be evaluated from the time domain to the frequency domain to obtain the frequency domain data of each audio frame, and extracting the characteristics of each audio frame in the frequency domain to obtain the frequency domain characteristics of each audio frame. Optionally, the Frequency domain feature is Mel Frequency Band Energy (MFBE), the Mel Frequency Band Energy may be obtained through a Mel filter, and the Mel Frequency spectrum may be subjected to Cepstrum analysis to obtain a Mel Frequency Cepstrum Coefficient (MFCC), so as to obtain the Mel Frequency Band Energy of each audio frame. Optionally, the frequency domain features are used as a basis for identifying the phonemes corresponding to the respective audio frames.

And 340, processing the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated through the multi-language phoneme detection model to obtain phoneme information of the pronunciation audio to be evaluated.

Optionally, the phoneme information of the pronunciation audio to be evaluated includes phoneme information of each audio frame in the pronunciation audio to be evaluated, for example, the pronunciation audio to be evaluated includes M audio frames, and at this time, the phoneme information of the pronunciation audio to be evaluated has M probability distributions. Optionally, there is a one-to-one correspondence between the audio frames and the probability distributions, and one audio frame corresponds to one probability distribution. Optionally, the probability distribution reflects a probability of the pronunciation of the audio frame on each universal phoneme, for example, if the number of the universal phonemes is N, a dimension of the probability distribution is N, and a numerical value of each dimension in the probability distribution is a probability that the audio frame is a universal phoneme corresponding to the dimension.

The multi-lingual phoneme detection model is a machine learning model for determining universal phonemes corresponding to each audio frame in the pronunciation audio. Optionally, the multi-language phoneme detection model is a machine learning model constructed based on a long-short term memory neural network and a convolutional neural network, and includes an LSTM layer and a CNN layer, which are used for analyzing and identifying phonemes corresponding to respective pronunciations in the pronunciation audio.

In an exemplary embodiment, the above step 340 includes the following sub-steps:

step 341, obtaining the target language of the pronunciation audio to be evaluated.

Optionally, the target language is any language. Optionally, the target language of the pronunciation audio to be evaluated is determined according to the language type selected by the user. Optionally, the target language of the pronunciation audio to be evaluated is determined by a language identification technology. The method and the device for obtaining the target language of the pronunciation audio to be evaluated are not limited, and can be adjusted and selected according to actual conditions.

Step 342, selecting target model parameters corresponding to the target language.

Optionally, the multi-language phoneme detection model has multiple sets of model parameters, and different model parameters correspond to different languages, so that the pronunciation audio of different languages has stronger pertinence when the multi-language phoneme detection model detects the universal phoneme, and the detection result is more accurate.

And 343, processing the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated according to the target model parameters through the multi-language detection model to obtain the phoneme information of the pronunciation audio to be evaluated.

For example, at this time, the language of the to-be-evaluated audio is english, and the multi-language detection model calls the model parameters corresponding to english in the model to process the frequency domain features of each audio frame in the to-be-evaluated pronunciation audio, so as to detect the universal phonemes corresponding to each audio frame in the to-be-evaluated pronunciation audio, because the universal phonemes are new phonemes formed by integrating the special phonemes in each language according to the pronunciation similarity principle, wherein there is a correspondence between the universal phonemes and the special phonemes, and there may be slight differences between the pronunciation features of the universal phonemes and the special phonemes having the correspondence, and the model parameters corresponding to the language can eliminate the slight differences, so that the detection result is more accurate.

Optionally, the target model parameters are part of parameters in a multilingual detection model.

And 350, performing dynamic time normalization on the distribution probability vector of each audio frame in the pronunciation audio to be evaluated and the distribution probability vector of each audio frame in the teacher pronunciation audio to obtain a time sequence matching result.

Optionally, before performing step 350, the following steps are further performed:

and generating a distribution probability vector of each audio frame in the pronunciation audio to be evaluated based on the probability distribution of each audio frame in the pronunciation audio to be evaluated.

And generating a distribution probability vector sequence of the pronunciation audio to be evaluated based on the distribution probability vector of each audio frame in the pronunciation audio to be evaluated.

The distribution probability vector is a vector composed of probabilities of the pronunciations of the audio frames on the respective universal phonemes. For example, the probability of a certain audio frame on each universal phoneme is P₁、P₂、P₃、P₄、…、P_nThen the distribution probability vector for the audio frame is (P)₁，P₂，P₃，P₄，…，P_n)。

Optionally, the distribution probability vector of each audio frame in the pronunciation audio to be evaluated forms a distribution probability vector sequence of the pronunciation audio to be evaluated. Optionally, the distributed probability vectors of the audio frames are arranged in the sequence of distributed probability vectors in chronological order of the audio frames.

Optionally, by using a Dynamic Time Warping (DTW) algorithm, based on a cosine distance between a distribution probability vector of each audio frame in the pronunciation audio to be evaluated and a distribution probability vector of each audio frame in the teacher pronunciation audio, the distribution probability vectors of the distribution probability vectors are aligned in Time sequence, and the distribution probability vector of each audio frame in the pronunciation audio to be evaluated and two aligned distribution probability vectors in the distribution probability vector of each audio frame in the teacher pronunciation audio are matched distribution probability vectors.

Optionally, the time sequence matching result is a combination of matched distribution probability vectors obtained by aligning the pronunciation audio to be evaluated and the teacher pronunciation audio according to the same universal phoneme in time sequence, and the time sequence of the matched distribution probability vectors is maintained.

In one example, as shown in fig. 4, a diagram illustrating a timing matching result is illustrated. Wherein the distribution probability vector 411 of the first frame audio frame in the distribution probability vector sequence 41 of the teacher pronunciation audio and the distribution probability vector 421 of the first frame audio frame in the phoneme information vector sequence 42 of the student pronunciation audio are matched distribution probability vectors. The distribution probability vector 412 of the third frame audio frame in the distribution probability vector sequence 41 of the teacher pronunciation audio and the distribution probability vector 422 of the second frame audio frame in the distribution probability vector sequence 42 of the student pronunciation audio are matched distribution probability vectors.

And step 360, obtaining the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the similarity between the matched distribution probability vectors determined by the time sequence matching result.

Optionally, the similarity is a KL divergence.

Optionally, an average value of the similarity between the distribution probability vector of each audio frame in the pronunciation audio to be evaluated and the matching distribution probability vector in the distribution probability vector of each audio frame in the teacher pronunciation audio is taken as the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio.

And step 370, determining the pronunciation score of the pronunciation audio to be evaluated according to the similarity through the scoring model.

The scoring model is a regression model for quantitatively describing the statistical relationship between the similarity and the pronunciation score.

Optionally, the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio is input into a scoring model, and the scoring model outputs the pronunciation score of the pronunciation audio to be evaluated according to the statistical relationship between the similarity and the pronunciation score.

In one example, please refer to FIG. 5, which illustrates a schematic diagram of a pronunciation assessment system. The teacher's recording 50 and the student's recording 51 are input into the multi-lingual phoneme detection model 52, and the probabilities of each frame in the teacher's recording 50 mapping to a respective universal phoneme in the universal phoneme set and the probabilities of each frame in the student's recording 51 mapping to a respective universal phoneme in the universal phoneme set are output. The probability of each frame in the teacher sound recording 50 being mapped to each universal phoneme in the universal phoneme set is used to generate a distribution probability vector sequence 53 of the teacher sound recording 50, the probability of each frame in the student sound recording 51 being mapped to each universal phoneme in the universal phoneme set is used to generate a distribution probability vector sequence 54 of the student sound recording 51, the distribution probability vector sequence 53 of the teacher sound recording 50 and the distribution probability vector sequence 54 of the student sound recording 51 are aligned through a DTW algorithm, the similarity between the two is calculated, the similarity is input into a regression model 55 to be corrected, and finally, the pronunciation score 56 of the student sound recording 51 is output.

In summary, according to the technical scheme provided by the embodiment of the application, the model parameters corresponding to each language are set in the multi-language phoneme detection model, the probability distribution of the student pronunciation and the teacher pronunciation of any language under each universal phoneme is detected according to the model parameters corresponding to the languages, the similarity between the student pronunciation and the teacher pronunciation is determined based on the probability distribution of the student pronunciation and the probability distribution of the teacher pronunciation, and the score of the student pronunciation is output through the regression model, so that the number of models required by the multi-language pronunciation detection is reduced, the accuracy of pronunciation evaluation under the multi-language condition is ensured, and the storage resource is saved while the pronunciation evaluation complexity is effectively reduced.

Referring to fig. 6, a flowchart of a training method of the pronunciation assessment system according to an embodiment of the present application is shown. The pronunciation evaluating system comprises a multilingual phoneme detection model and a scoring model. The method may comprise the following steps (610-650):

step 610, a first training sample set and a second training sample set are obtained.

The first training sample set comprises at least one first training sample, sample data of the first training sample comprises teacher pronunciation audio in any language, and tag data of the first training sample comprises universal phonemes of all pronunciations in the teacher pronunciation audio.

Alternatively, the first sample training set may be referred to as an ARAT (Audio recording Along with transitions) dataset. Optionally, the first sample training set further includes student pronunciation audio in any language, and the label data of the first training sample further includes universal phonemes for respective pronunciations in the student pronunciation audio.

Optionally, the first training sample set includes a transcription text corresponding to the teacher pronunciation audio and the student pronunciation audio. The transcribed text is the corresponding text content in the audio generated by the teacher.

Optionally, the first training sample is stored in the form of a label file, such as an LBL file.

The second training sample set comprises at least one second training sample, the sample data of the second training sample comprises teacher pronunciation audio and student pronunciation audio for the same session, and the label data of the second training sample comprises pronunciation scores of the student pronunciation audio.

Alternatively, the second sample training set may be referred to as a tsr (teacher Student recording) dataset.

It is to be understood that the references to "teacher" and "student" in the embodiments of the present application are not limited to the concepts of teacher and student understood in daily life of people, where teacher generally refers to native speakers of various languages and pronunciation standards thereof, and where student generally refers to a user whose pronunciation quality is to be evaluated, and not necessarily the identity of the student.

Step 620, training the multi-lingual phoneme detection model based on the first training sample set.

The multi-language phoneme detection model is used for extracting phoneme information of pronunciation audio, wherein the phoneme information comprises universal phonemes of all pronunciations in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating special phonemes of multiple languages.

And (3) sending the first training sample in the first training sample into the multi-language phoneme detection model, and extracting phoneme information of pronunciation audio (teacher pronunciation audio or student pronunciation audio) through the multi-language phoneme detection model to obtain universal phonemes of all pronunciations in the predicted pronunciation audio.

The multi-lingual phoneme detection model is modified based on the difference between the label data in the first training sample, i.e., the universal phonemes for each pronunciation in the labeled teacher pronunciation audio, and the universal phonemes for each pronunciation in the predicted pronunciation audio. Optionally, the multi-lingual phoneme detection model is modified by a loss function so that the universal phonemes of each pronunciation in the predicted pronunciation audio are consistent with the universal phonemes of each pronunciation in the labeled teacher pronunciation audio, thereby ensuring the accuracy of the multi-lingual phoneme detection model.

In an exemplary embodiment, the step 620 includes:

step 621, train the initialized multi-lingual phoneme detection model based on the first training sample set.

The initialized multi-lingual phoneme detection model is a pre-trained model for detecting universal phonemes.

The pre-training model generally refers to a general phoneme detection training task designed by using a large-scale training sample (for example, a first training sample set constructed in the present application), a large-scale neural network algorithm structure is trained to be learned and realized, the finally obtained large-scale neural network algorithm structure is the pre-training model, and other tasks can be followed to perform feature extraction or task fine tuning on the basis of the model to realize the purpose of a specific task. The pre-training idea is to train a task to obtain a set of model parameters, initialize network model parameters by using the set of model parameters, and train other tasks by using the initialized network model to obtain models adapted to other tasks. By pre-training on a large-scale training sample, the initialized multi-language phoneme detection model can learn strong phoneme detection capability and can extract rich phoneme information from audio. By directly fine-tuning (fine-tune) downstream tasks on the pre-trained model, a downstream proprietary model can be obtained conveniently and quickly.

Fine tuning (fine tune) refers to performing small-scale training on a pre-training model by using a specific task target (downstream task) and task data (downstream data), so as to realize the fine adjustment of parameters of the pre-training model and finally obtain a model adaptive to the specific data and the task.

Step 622, for the target language, a first training sample of the target language is obtained from the first training sample set.

For example, for english, a first training sample in english is obtained from a first set of training samples.

Step 623, adjusting the model parameters of the initialized multi-language phoneme detection model by using the first training sample of the target language to obtain model parameters corresponding to the target language.

The above-mentioned adjustment of the model parameters of the initialized multi-language phoneme detection model by using the first training sample of the target language may be understood as the above-mentioned fine tuning process, which is a more specific fine tuning process in the embodiment of the present application. The language is used as a dividing basis, the first training samples of different languages are respectively used as different task data, small-scale training is carried out on the initialized multi-language phoneme detection model, model parameters corresponding to various languages are obtained, and the pertinence is stronger when universal phoneme detection is carried out on pronunciation audios of different languages.

In an exemplary embodiment, the above step 623 includes the following sub-steps.

Step 623a, parameters of the intermediate layer in the initialized multi-lingual phoneme detection model are fixed.

The initialized multi-language phoneme detection model comprises an all-pass layer at the input end, an all-pass layer at the output end and an intermediate layer consisting of LSTM and CNN. The intermediate layer refers to an intermediate layer composed of LSTM and CNN in the initialized multi-language phoneme detection model. The intermediate layer of the initialized multi-language phoneme detection model has the capability of extracting phoneme information, and the model parameters of the intermediate layer do not need to be adjusted again in order to reduce the training complexity.

Step 623b, replacing the all-pass layer of the input end and the output end in the initialized multi-language phoneme detection model with the all-connection layer.

The full pass layer does not process the data, it can be understood that the output of the full pass layer is the same as the output, and no processing is performed, and the full pass layer plays a role in occupying a position in the initialized multi-language phoneme detection model, and facilitates the addition of the full connection layer to the initialized multi-language phoneme detection model.

Fully Connected Layers (FC) are relevant language Layers for mining language features.

Fixed-dimension feature vectors (e.g., frequency-domain feature vectors) can be mapped to different classes of identified languages using a fully-connected layer. Each node of the fully connected layer is connected to all nodes of the previous layer for integrating the extracted features. The parameters of a fully connected layer are also typically the most due to its fully connected nature.

In a CNN structure, after a plurality of convolutional layers and pooling layers, 1 or more than 1 fully-connected layer is connected, similar to MLP, each neuron in the fully-connected layer is fully connected with all neurons in the previous layer. In order to improve the performance of the CNN network, a ReLU function is generally adopted as the excitation function of each neuron of the full connection layer. The output value of the last fully connected layer is passed to an output, which may be classified using softmax logistic regression, which may also be referred to as softmax layer.

Step 623c, adjusting the model parameters of the full connection layer by using the first training sample of the target language to obtain the model parameters corresponding to the target language.

The fine-tuning process may be embodied as adjusting model parameters of the full-link layer, generating a set of model parameters corresponding to a language by using a training sample of the language, and finally obtaining a plurality of sets of model parameters corresponding to each language in the full-link layer.

And step 624, obtaining a multi-language phoneme detection model based on the model parameters respectively corresponding to the languages.

The first training sample of each language is processed in step 623c, so that the model parameters in the full-link layer corresponding to each language can be obtained, and the multi-language phoneme detection model can be obtained after the model parameters corresponding to each language are trained and set.

In one example, as shown in FIG. 7, a diagram of a multilingual phoneme detection model is illustrated. An audio frame 70 of the pronunciation audio in the figure passes through the feature extraction layer 71 to generate a frequency domain feature vector (not shown in the figure) of the audio frame 70, and the frequency domain feature vector of the audio frame 70 is input into the multi-language phoneme detection model 72 to obtain a probability value 74 of the audio frame 70 on each universal phoneme 73.

Step 630, extracting the phoneme information of the second training sample through the multi-lingual phoneme detection model.

The phoneme information of the second training sample comprises phoneme information of the teacher pronunciation audio and phoneme information of the student pronunciation audio in the second training sample.

For the explanation of the phoneme information, please refer to the above embodiments, which are not described herein again.

And step 640, obtaining the similarity of the teacher pronunciation audio and the student pronunciation audio based on the phoneme information of the second training sample.

Please refer to the above embodiments for the description of the similarity between the teacher pronunciation audio and the student pronunciation audio, which is not described herein.

And 650, fitting a scoring model according to the similarity and the pronunciation score of the pronunciation audio of the student, wherein the scoring model is used for scoring the audio to be evaluated.

Optionally, the similarity between the student pronunciation audio and the teacher pronunciation audio in the second training sample set is used as an input of the regression model, the pronunciation score the teacher makes for the student pronunciation audio is used as an output of the regression model, the mathematical relationship between the similarity and the pronunciation score is counted, and one regression model is fitted to serve as the scoring model in the embodiment of the present application.

In summary, according to the technical scheme provided by the embodiment of the application, the multi-language phoneme detection model is trained through the sample audio with the universal phoneme information as the label, so that the multi-language phoneme detection model can accurately detect the universal phoneme information corresponding to the teacher pronunciation and the student pronunciation of the same session, the similarity between the teacher pronunciation and the student pronunciation is calculated on the basis, a regression model is fitted to describe the statistical relationship between the scoring label carried by the student pronunciation and the similarity, the pronunciation evaluation system is trained and completed finally, the application range of the model is expanded, the multi-language user pronunciation evaluation function is realized, the pronunciation evaluation complexity is effectively reduced, and the storage resource is saved.

In addition, aiming at different languages, a plurality of groups of corresponding model parameters are set in the multi-language phoneme detection model, so that the pronunciation evaluation accuracy under the multi-language condition is ensured.

Referring to fig. 8, a flowchart of a method for obtaining a first training sample set according to an embodiment of the present application is shown. The pronunciation evaluating system comprises a multilingual phoneme detection model and a scoring model. The method may comprise the following steps (810- & ltEn & gt 830):

and step 810, acquiring original label data of the teacher pronunciation audio of each language.

The raw label data includes the individual phonemes of each pronunciation in the teacher's pronunciation audio, the individual phonemes being phonemes specific to a language. The above-mentioned original label data is different from the label data of the above first training sample only in phonemes corresponding to the respective pronunciations.

In an exemplary embodiment, the above step 810 includes the following sub-steps:

step 811, obtain the phone set of each language.

The Phone Set (PS) of each language refers to a Set of exclusive phones of each language.

And step 812, aligning the teacher pronunciation audio of each language based on the phoneme set of each language to obtain the original tag data of the teacher pronunciation audio of each language.

Alternatively, the spoken audio and the transcribed text are phone-level aligned using an open-source montreal forced alignment tool.

The input of the montreal forced alignment tool comprises:

and preparing and storing files corresponding to the characters and the phonemes. In one possible implementation, words can be mapped to phonemes by constructing a dictionary (lexicon), which is a bridge used to connect acoustic models and language models, which can create a mapping from words to phonemes. The lexicon contains words that the speech recognition system can recognize and indicates pronunciations. English words use words and phonetic symbols, Chinese words use Chinese characters and pinyin for correspondence, so that an acoustic model and a language model are connected to form a state space for searching, and preparation is made for decoding.

Preparing lab files with the same name as the audio files, storing phonemes corresponding to the audio, and storing files corresponding to the characters and the phonemes.

The pronunciation audio file is prepared, and the sampling rate is required to be more than 16k by the official.

The Montreal forced alignment tool can output the phonemes corresponding to each frame of audio frame, and the start and stop time of each phoneme in each audio frame is used as the original label data of the teacher pronunciation audio of each language.

Step 820, obtain a universal phone set.

The universal phone set is a collection of universal phones.

In an exemplary embodiment, the above step 820 includes the following sub-steps:

in step 821, the phoneme recognizers of the languages are trained respectively based on the teacher pronunciation audio of each language and the original label data of the teacher pronunciation audio of each language.

The structure of the Phone Recognizers (PR) of the languages is the same as that of the multi-language Phone detection model initialized in the foregoing, and the training process is similar, except that the label data in the training samples are different, and therefore the function of extracting the Phone information is different. The label data in the training sample of the initialized multi-language phoneme detection model is a universal phoneme corresponding to each audio frame, the label data in the training sample of the phoneme recognizer of each language is a special phoneme corresponding to each audio frame, and the label data of the phoneme recognizer of english is an english phoneme corresponding to each audio frame. The phoneme information extracted by the initialized multi-language phoneme detection model is a universal phoneme corresponding to each audio frame, and the phoneme information extracted by the phoneme recognizer of each language is a special phoneme corresponding to each audio frame.

Optionally, the number of parameters of the phoneme recognizer of each language is different from the number of parameters of the initialized multi-language phoneme detection model, the initialized multi-language phoneme detection model needs to be added with parameters appropriately, and the number of parameters of the initialized multi-language phoneme detection model is controlled to be within 2 times of the number of parameters of the phoneme recognizer of a certain language.

Step 822, the teacher pronunciation audio of each language is sent to the phoneme recognizer of each language to determine the similar phonemes in each language.

Similar phonemes refer to proprietary phonemes that are similar or identical in pronunciation in each language.

In a possible implementation manner, the implementation procedure of the step 822 is as follows:

step 822a, for the target language, sending the teacher pronunciation audio of the target language to the phoneme recognizer of the target language to obtain the probability value of each audio frame in the teacher pronunciation audio of the target language under each specific phoneme of the target language.

For example, a section of recording of english content is sent to an english phoneme recognizer to obtain probability values of each audio frame in the recording under each english phoneme, that is, to obtain a probability distribution for describing the possibility that the phoneme corresponding to the audio frame is each phoneme.

Step 822b, determining the special phoneme corresponding to each audio frame based on the probability value of each audio frame under each special phoneme of the target language.

And determining the special phoneme corresponding to each audio frame according to the size relation of the probability value of each audio frame under each special phoneme of the target language. For example, the probability value under the "a" phoneme among the probability values under the respective proprietary phonemes corresponding to a certain audio frame is the largest, and then the phoneme corresponding to the audio frame is determined to be "a".

Step 822c, inputting the audio frame corresponding to the first specific phoneme in the first language into the phoneme recognizer of the second language to obtain the probability value of the audio frame corresponding to the first specific phoneme under each specific phoneme of the second language.

The first language and the second language are different languages. The first specific phoneme is a specific phoneme in the first language. And quantizing and reflecting the similar pronunciation of each special phoneme in the second language and the first special phoneme in the first language by calculating the probability value of the audio frame corresponding to the first special phoneme under each special phoneme in the second language.

Step 822d, taking the probability value of the audio frame corresponding to the first proprietary phoneme under each proprietary phoneme of the second language as the reciprocal of the distance from the first proprietary phoneme to each proprietary phoneme of the second language.

Here, the pronunciation similarity of each of the proprietary phonemes in the second language to the first proprietary phoneme in the first language is measured by the distance from the first proprietary phoneme to each of the proprietary phonemes in the second language. For example, if the probability value of phoneme 1 of language 1 under phoneme 2 of language 2 is high, the smaller the distance from phoneme 1 of language 1 to phoneme 2 of language 2, the pronunciation of phoneme 1 of language 1 is similar to that of phoneme 2 of language 2.

Step 822e, determining similar phonemes in each language through a clustering algorithm based on the distance.

Alternatively, the distance between the exclusive phonemes in each language satisfies a threshold condition, which is a preset numerical value for distinguishing the distance, and the threshold is not limited in the embodiment of the present application.

The clustering algorithm refers to clustering analysis, also called group analysis, which is a statistical analysis method for researching classification problems, and is also an important algorithm for data mining. Clustering (Cluster) analysis is composed of several patterns (patterns), which are typically vectors of a metric (measure) or a point in a multidimensional space. Cluster analysis is based on similarity, with more similarity between patterns in one cluster than between patterns not in the same cluster. Optionally, the application adopts a k-means clustering algorithm.

Step 823, binding and merging the similar phonemes into universal phonemes to obtain a universal phoneme set.

Optionally, a plurality of similar phonemes are bound and combined into a universal phoneme, and the name of the universal phoneme can be determined according to actual conditions.

Alternatively, the proprietary phonemes without similar phonemes are individually combined into a universal phoneme.

And step 830, replacing the special phonemes of each pronunciation corresponding to each audio frame in the teacher pronunciation audio based on the universal phoneme set to obtain the label data of the first training sample.

Optionally, the universal phoneme is used to replace the proprietary phonemes of the respective pronunciations corresponding to the respective audio frames in the teacher pronunciation audio according to the corresponding relationship between the similar phonemes and the universal phoneme.

In summary, in the technical scheme provided by the embodiment of the present application, phonemes with similar pronunciations in various languages are bound to be universal phonemes through a clustering algorithm, so as to generate a universal phoneme set, and then replace the special phonemes of various languages, thereby laying a foundation for realizing a user pronunciation evaluation function of multiple languages.

In an exemplary embodiment, the first training sample in the first training sample set is required to satisfy at least one of the following conditions.

Firstly, the recording is noiseless, or the SNR is more than 30 dB.

The SIGNAL-to-NOISE RATIO (SNR or S/N), i.e., the RATIO of the power of the output SIGNAL of an amplifier to the power of the NOISE output at the same time, is often expressed in decibels. A higher signal-to-noise ratio of a device indicates that it produces less noise. Generally, the larger the signal-to-noise ratio, the smaller the noise mixed in the signal, the higher the quality of sound playback, and vice versa.

Signal-to-noise ratio refers to the ratio of the signal level to the noise level. Expressed in dB. The peak is usually used for impulse noise and the root mean square is used for random noise. Abbreviated S/N or SNR. The normalized signal-to-noise ratio is the ratio of the signal energy per bit to the power spectral density of the noise, expressed in dB, or the ratio of the signal power to the noise power at a point in the communication system. Different expressions can be used for different systems as required. In an analog communication system, the ratio of the useful signal average power to the noise average power at the output of the demodulator of a communication terminal is often referred to; in a digital communication system, it is often referred to as the ratio of the average signal energy per symbol (bit) contained in the output of the digital demodulator and decoder of a terminal to the noise power per unit band. The signal-to-noise ratio is an important parameter for measuring the influence degree of noise on a signal. Signal-to-noise ratio can be increased by improving transmission means and enhancing device capabilities.

A simple definition of "noise" is: "signals self-generated by the device during processing", these signals are independent of the input signal. For MP3 players, the Signal-to-Noise ratio (snr) is an important parameter, and the ratio between the maximum undistorted sound Signal intensity generated by the sound source and the simultaneous Noise intensity is called the snr, Signal-to-Noise ratio (Signal/Noise), usually expressed in S/N, and is expressed in decibels (dB). For the player, the larger the value is, the better.

Second, the recording needs no reverberation, or RT60 reverberation time <0.3 seconds.

RT60 refers to the Reverberation Time (Reverberation Time) of a room, and is specifically defined as the Time taken for the sound field to decay by 60dB in seconds. The larger the RT60, the longer the sound produced by the room will disappear. The accepted definition of reverberation time is: the time required for the sound energy density to drop to 1/10 < SP > 6 </SP > is equivalent to 60dB of sound pressure level decay. The reverberation time of a certain frequency is the time required for the room sound to reach a stable state, the residual sound is repeatedly absorbed by the sound absorption material in the room after the sound source stops sounding, and the average sound energy density decays from the original value to one millionth (the sound energy density decays by 60dB), and is represented by T60 or RT. The reverberation time is too short, the sound is dry, dry and tasteless, and the method is not close to nature; the reverberation time is too long, so that the sound is mixed; when appropriate, the sound is mellow.

Third, the duration of the audio recording should be less than 10 minutes and there is preferably a speaker in an audio segment. Optionally, speakers are equally distributed across a range of genders and ages from 15 to 60 years old.

Fourth, the total duration of recordings in one language in the ARAT data set should exceed 200 hours, representing more than 200 speakers.

Fifthly, the text is transcribed without being accurate to the time point of each word.

Sixthly, the audio recording should be performed using a microphone representing a flat response in the range of 200-500 Hz.

Seven, the speech signal is considered wideband (bandwidth 8kHz) and is stored in PCM format at a sampling rate of 16kHz with a bit depth of 16 bits.

Pcm (pulse Code modulation) is one of the encoding methods for digital communication. The main process is to sample the analog signals of voice, image, etc. at regular intervals to make them discretize, at the same time, to round the sampled values by hierarchical unit to make them be rounded and quantized, and at the same time, to express the amplitude of sampled pulse by a group of binary codes.

Eighthly, the spoken words in the linguistic database should represent a wide variety of words so that the model learns a balanced representation of the acoustic model for the underlying phonemes.

In an exemplary embodiment, the second training samples in the second training sample set satisfy at least one of the following conditions.

Firstly, the recording is noiseless, or the SNR is more than 30 dB.

Third, the duration of the audio recording should be less than 10 seconds and there should be a speaker (teacher or student) in one audio segment. Optionally, speakers are equally distributed across a range of genders and ages from 15 to 60 years old.

Fourth, the audio recording should be recorded using a microphone that represents a flat response in the range of 200-500 Hz.

Fifth, the speech signal is considered wideband (bandwidth 8kHz) and is stored in PCM format at a sampling rate of 16kHz with a bit depth of 16 bits.

And sixthly, at least 30 sentences are contained in each language.

And seventhly, at least 10 native speakers (teachers) are required to record each sentence. I.e. a minimum of 300 teacher recordings need to be prepared.

Eight, the sound recordings of each teacher will be imitated by 10 students, resulting in at least 3000 student sound recordings.

And ninthly, scoring the 300 student recordings by each teacher according to the pronunciation standard.

In summary, according to the technical scheme provided by the embodiment of the present application, training samples are acquired according to multiple conditions, so as to obtain high-quality training samples, and ensure the detection accuracy of the trained multi-language phoneme detection model.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 9, a block diagram of a pronunciation evaluation device according to an embodiment of the present application is shown. The device has the function of realizing the pronunciation evaluation method. The apparatus 900 may include: an audio acquisition module 910, a feature extraction module 920, a similarity determination module 930, and a score determination module 940.

And the audio acquisition module 910 is configured to acquire a pronunciation audio to be evaluated.

The feature extraction module 920 is configured to extract phoneme information of the pronunciation audio to be evaluated, where the phoneme information includes universal phonemes of each pronunciation in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating special phonemes of multiple languages.

The similarity determining module 930 is configured to obtain a similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated.

And a score determining module 940, configured to determine a pronunciation score of the pronunciation audio to be evaluated based on the similarity.

In an exemplary embodiment, referring to fig. 10, the feature extraction module 920 includes: an audio frame dividing unit 921, a frequency domain converting unit 922, and a phoneme predicting unit 923.

And the audio frame dividing unit 921 is configured to divide the pronunciation audio to be evaluated into at least one audio frame.

And the frequency domain conversion unit 922 is configured to obtain frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated.

A phoneme prediction unit 923, configured to process the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated through a multi-language phoneme detection model to obtain phoneme information of the pronunciation audio to be evaluated, where the phoneme information of the pronunciation audio to be evaluated includes probability distribution of each audio frame in the pronunciation audio to be evaluated, and the probability distribution reflects probability of pronunciation of the audio frame on each universal phoneme;

In an exemplary embodiment, the phoneme prediction unit 923 is configured to:

acquiring a target language of the pronunciation audio to be evaluated;

In an exemplary embodiment, the similarity determination module 930 is configured to:

performing dynamic time normalization on the distribution probability vector of each audio frame in the pronunciation audio to be evaluated and the distribution probability vector of each audio frame in the teacher pronunciation audio to obtain a time sequence matching result, wherein the distribution probability vector is a vector formed by the probabilities of pronunciations of the audio frames on all the universal phonemes, and the time sequence matching result is a combination of matched distribution probability vectors obtained by aligning the pronunciation audio to be evaluated and the teacher pronunciation audio according to the same universal phonemes in time sequence;

In an exemplary embodiment, the score determination module 940 is configured to:

In addition, model parameters corresponding to various languages are set in the multi-language phoneme detection model, probability distribution of student pronunciations and teacher pronunciations of any language under various universal phonemes is detected according to the model parameters corresponding to the languages, the similarity of the student pronunciations and the teacher pronunciations is determined based on the probability distribution of the student pronunciations and the teacher pronunciations, scores of the student pronunciations are output through the regression model, the number of models required by multi-language pronunciation detection is reduced, and the pronunciation evaluation accuracy under the multi-language condition is also ensured.

Referring to fig. 11, a block diagram of a training device of a pronunciation assessment system according to an embodiment of the present application is shown. The pronunciation evaluating system comprises a multilingual phoneme detection model and a scoring model. The device has the function of realizing the training method of the pronunciation evaluation system. The apparatus 1100 may include: a sample acquisition module 1110, a model training module 1120, a feature extraction module 1130, a similarity determination module 1140, and a model fitting module 1150.

The sample acquiring module 1110 is configured to acquire a first training sample set and a second training sample set, where the first training sample set includes at least one first training sample, sample data of the first training sample includes teacher pronunciation audio in any language, tag data of the first training sample includes universal phonemes of each pronunciation in the teacher pronunciation audio, the second training sample set includes at least one second training sample, sample data of the second training sample includes teacher pronunciation audio and student pronunciation audio for the same language, and tag data of the second training sample includes pronunciation scores of the student pronunciation audio.

A model training module 1120, configured to train the multi-language phoneme detection model based on the first training sample set, where the multi-language phoneme detection model is configured to extract phoneme information of pronunciation audio, where the phoneme information includes universal phonemes for each pronunciation in the pronunciation audio, and the universal phonemes are phonemes obtained by integrating dedicated phonemes of multiple languages.

A feature extraction module 1130, configured to extract phoneme information of the second training sample through the multi-lingual phoneme detection model, where the phoneme information of the second training sample includes phoneme information of the teacher pronunciation audio and phoneme information of the student pronunciation audio in the second training sample.

A similarity determination module 1140, configured to obtain a similarity between the teacher pronunciation audio and the student pronunciation audio based on the phoneme information of the second training sample.

And the model fitting module 1150 is configured to fit the scoring model according to the similarity and the pronunciation score of the student pronunciation audio, where the scoring model is used to score the audio to be evaluated.

In an exemplary embodiment, referring to fig. 12, the model training module 1120 includes: a model pre-training unit 1121, a sample selecting unit 1122, a parameter adjusting unit 1123, and a model generating unit 1124.

A model pre-training unit 1121 configured to train an initialized multi-lingual phoneme detection model based on the first training sample set, where the initialized multi-lingual phoneme detection model is a pre-training model for detecting universal phonemes.

A sample selecting unit 1122, configured to, for a target language, obtain a first training sample of the target language from the first training sample set.

A parameter adjusting unit 1123, configured to adjust the model parameters of the initialized multi-language phoneme detection model by using the first training sample of the target language, so as to obtain model parameters corresponding to the target language.

The model generating unit 1124 is configured to obtain the multi-language phoneme detection model based on model parameters corresponding to a plurality of languages.

In an exemplary embodiment, the parameter adjusting unit 1123 is configured to:

replacing a full-pass layer of an input end and an output end in the initialized multi-language phoneme detection model with a full-connection layer, wherein the full-pass layer does not process data, and the full-connection layer is a related language layer for mining language features;

In an exemplary embodiment, referring to fig. 12, the sample acquiring module 1110 includes: original tag acquisition unit 1111, universal phoneme acquisition unit 1112, and phoneme replacement unit 1113.

The original tag obtaining unit 1111 is configured to obtain original tag data of a teacher pronunciation audio of each language, where the original tag data includes a specific phoneme of each pronunciation in the teacher pronunciation audio, and the specific phoneme is a phoneme specific to one language.

A universal phoneme obtaining unit 1112 is configured to obtain a universal phoneme set, which is a set of the universal phonemes.

A phoneme replacing unit 1113, configured to perform replacement processing on the special phonemes of each pronunciation corresponding to each audio frame in the teacher pronunciation audio based on the universal phoneme set, so as to obtain tag data of the first training sample.

In an exemplary embodiment, the universal phoneme acquisition unit 1112 is configured to:

In an exemplary embodiment, the universal phoneme obtaining unit 1112 is further configured to:

In an exemplary embodiment, the original tag obtaining unit 1111 is configured to:

acquiring a phoneme set of each language;

In addition, phonemes with similar pronunciations in various languages are bound into universal phonemes through a clustering algorithm to generate a universal phoneme set, and then the universal phonemes of various languages are replaced, so that a foundation is laid for realizing a multi-language user pronunciation evaluation function. And aiming at different languages, a plurality of groups of corresponding model parameters are set in the multi-language phoneme detection model, so that the pronunciation evaluation accuracy under the multi-language condition is ensured.

Referring to fig. 13, a block diagram of a computer device 1300 according to an embodiment of the present application is shown. The computer device 1300 may be an electronic device such as a mobile phone, a tablet computer, a multimedia playing device, a wearable device, a pc (personal computer), a language learning terminal, an intelligent teaching machine, and the like. The computer device is used for implementing the pronunciation evaluation method or the training method of the pronunciation evaluation system provided in the above embodiments. The computer device may be the terminal 10 or the server 20 in the application execution environment shown in fig. 1.

Generally, computer device 1300 includes: a processor 1301 and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1301 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also referred to as a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, processor 1301 may further include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. The memory 1302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 1302 is used to store at least one instruction, at least one program, set of codes, or set of instructions configured to be executed by one or more processors to implement the above-described pronunciation assessment method or the training method of the pronunciation assessment system.

In some embodiments, computer device 1300 may also optionally include: a peripheral interface 1303 and at least one peripheral. Processor 1301, memory 1302, and peripheral interface 1303 may be connected by a bus or signal line. Each peripheral device may be connected to the peripheral device interface 1303 via a bus, signal line, or circuit board.

Those skilled in the art will appreciate that the architecture shown in FIG. 13 is not intended to be limiting of the computer device 1300, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

In an exemplary embodiment, a computer-readable storage medium is also provided, in which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned pronunciation assessment method.

In an exemplary embodiment, a computer-readable storage medium is further provided, in which a computer program is stored, which when executed by a processor implements the training method of the pronunciation evaluation system described above.

Optionally, the computer-readable storage medium may include: ROM (Read Only Memory), RAM (Random Access Memory), SSD (Solid State drive), or optical disc. The Random Access Memory may include a ReRAM (resistive Random Access Memory) and a DRAM (Dynamic Random Access Memory).

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the pronunciation evaluation method.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the pronunciation evaluation system.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. In addition, the step numbers described herein only exemplarily show one possible execution sequence among the steps, and in some other embodiments, the steps may also be executed out of the numbering sequence, for example, two steps with different numbers are executed simultaneously, or two steps with different numbers are executed in a reverse order to the order shown in the figure, which is not limited by the embodiment of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A pronunciation assessment method, the method comprising:

acquiring a pronunciation audio to be evaluated;

2. The method according to claim 1, wherein the extracting of the phoneme information of the pronunciation audio to be evaluated comprises:

dividing the pronunciation audio to be evaluated into at least one audio frame;

3. The method according to claim 2, wherein the processing the frequency domain features of each audio frame in the pronunciation audio to be evaluated through a multi-language phoneme detection model to obtain the phoneme information of the pronunciation audio to be evaluated comprises:

acquiring a target language of the pronunciation audio to be evaluated;

4. The method according to any one of claims 1 to 3, wherein the obtaining of the similarity between the pronunciation audio to be evaluated and the teacher pronunciation audio based on the phoneme information of the pronunciation audio to be evaluated and the phoneme information of the teacher pronunciation audio corresponding to the pronunciation audio to be evaluated comprises:

5. The method according to any one of claims 1 to 3, wherein the determining the pronunciation score of the pronunciation audio to be evaluated based on the similarity comprises:

6. The training method of a pronunciation evaluating system is characterized in that the pronunciation evaluating system comprises a multilingual phoneme detection model and a scoring model; the method comprises the following steps:

7. The method of claim 6, wherein training the multilingual phoneme detection model based on the first set of training samples comprises:

8. The method according to claim 7, wherein said adjusting the model parameters of the initialized multi-lingual phoneme detection model by using the first training sample of the target language to obtain the model parameters corresponding to the target language comprises:

9. The method of any of claims 6 to 8, wherein the obtaining a first set of training samples comprises:

10. The method of claim 9, wherein said obtaining a universal phone set comprises:

11. The method of claim 10, wherein said feeding teacher pronunciation audio of said languages to said language-based phoneme recognizer for determining similar phonemes in said languages comprises:

12. The method of claim 9, wherein obtaining raw tag data of teacher pronunciation audio of each language comprises:

acquiring a phoneme set of each language;

13. A pronunciation evaluation device, the device comprising:

14. The apparatus of claim 13, wherein the feature extraction module comprises:

the audio frame dividing unit is used for dividing the pronunciation audio to be evaluated into at least one audio frame;

the frequency domain conversion unit is used for acquiring the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated;

the phoneme prediction unit is used for processing the frequency domain characteristics of each audio frame in the pronunciation audio to be evaluated through a multi-language phoneme detection model to obtain phoneme information of the pronunciation audio to be evaluated, wherein the phoneme information of the pronunciation audio to be evaluated comprises probability distribution of each audio frame in the pronunciation audio to be evaluated, and the probability distribution reflects the probability of pronunciation of the audio frame on each universal phoneme;

15. The apparatus of claim 14, wherein the phoneme prediction unit is configured to:

acquiring a target language of the pronunciation audio to be evaluated;

16. The apparatus of any one of claims 13 to 15, wherein the similarity determination module is configured to:

17. The apparatus of any one of claims 13 to 15, wherein the score determination module is configured to:

18. The training device of the pronunciation evaluating system is characterized in that the pronunciation evaluating system comprises a multilingual phoneme detection model and a scoring model; the device comprises:

19. The apparatus of claim 18, wherein the model training module comprises:

a model pre-training unit, configured to train an initialized multi-language phoneme detection model based on the first training sample set, where the initialized multi-language phoneme detection model is a pre-training model for detecting a universal phoneme;

a sample selecting unit, configured to, for a target language, obtain a first training sample of the target language from the first training sample set;

a parameter adjusting unit, configured to adjust a model parameter of the initialized multi-language phoneme detection model by using the first training sample of the target language to obtain a model parameter corresponding to the target language;

and the model generating unit is used for obtaining the multi-language phoneme detection model based on the model parameters respectively corresponding to the languages.

20. The apparatus of claim 19, wherein the parameter adjusting unit is configured to:

21. The apparatus of any one of claims 18 to 20, wherein the sample acquisition module comprises:

an original tag obtaining unit, configured to obtain original tag data of a teacher pronunciation audio of each language, where the original tag data includes a specific phoneme of each pronunciation in the teacher pronunciation audio, and the specific phoneme is a phoneme specific to one language;

a universal phoneme obtaining unit configured to obtain a universal phoneme set, which is a set of the universal phonemes;

and the phoneme replacing unit is used for replacing the special phonemes of each pronunciation corresponding to each audio frame in the teacher pronunciation audio based on the universal phoneme set to obtain the label data of the first training sample.

22. The apparatus of claim 21, wherein the universal phoneme retrieving unit is configured to:

23. The apparatus of claim 22, wherein the universal phoneme retrieving unit is further configured to:

24. The apparatus of claim 21, wherein the original tag obtaining unit is configured to:

acquiring a phoneme set of each language;

25. A computer device, characterized in that the computer device comprises a processor and a memory, in which a computer program is stored, which computer program is loaded and executed by the processor to implement the method according to any of claims 1 to 5, or to implement the method according to any of claims 6 to 12.

26. A computer-readable storage medium, in which a computer program is stored, which is loaded and executed by a processor to implement the method of any one of claims 1 to 5 or to implement the method of any one of claims 6 to 12.