CN118072714A - Method, device and computer program for English pronunciation assessment - Google Patents
Method, device and computer program for English pronunciation assessment Download PDFInfo
- Publication number
- CN118072714A CN118072714A CN202410224412.7A CN202410224412A CN118072714A CN 118072714 A CN118072714 A CN 118072714A CN 202410224412 A CN202410224412 A CN 202410224412A CN 118072714 A CN118072714 A CN 118072714A
- Authority
- CN
- China
- Prior art keywords
- audio
- pronunciation
- english
- frequency
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000004590 computer program Methods 0.000 title claims description 11
- 238000000605 extraction Methods 0.000 claims abstract description 33
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 238000012937 correction Methods 0.000 claims description 24
- 238000001228 spectrum Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000012544 monitoring process Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000003379 elimination reaction Methods 0.000 claims description 5
- 230000008030 elimination Effects 0.000 claims description 4
- 235000008694 Humulus lupulus Nutrition 0.000 claims description 3
- 238000004891 communication Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000006386 neutralization reaction Methods 0.000 claims description 3
- 230000010355 oscillation Effects 0.000 claims description 3
- 230000001020 rhythmical effect Effects 0.000 claims description 3
- 238000012776 robust process Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 2
- 230000009471 action Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention relates to the technical field of pronunciation assessment, and discloses a method for evaluating English pronunciation, which comprises the following steps: acquiring English pronunciation audio and pronunciation video to be scored; transmitting English pronunciation audio and pronunciation video to the recognition model, and carrying out data preprocessing and feature extraction to obtain English pronunciation audio features and English pronunciation video features; the data preprocessing comprises preprocessing the collected original data to eliminate interference information and noise, improve the feature extraction precision of voice signals and provide effective data for feature extraction and voice recognition; and meanwhile, grouping and marking the corresponding frames of video pictures corresponding to the single English vocabulary. The method has the advantages that the Meyer frequency cepstrum coefficient is adopted to conduct feature extraction on spoken English pronunciation, the obtained result is standardized and filled, error-killing recognition calculation is conducted according to attribute planning, and the number of mistakes in pronunciation is effectively reduced.
Description
Technical Field
The invention relates to the technical field of pronunciation assessment, in particular to a method, equipment and a computer program for English pronunciation assessment.
Background
As the number of english learners increases, so too does the number of english training institutions and english learning software. The same points exist in the pronunciation methods of English and Chinese spoken language are very few; the teacher is more important to the students to write, read, understand and read, and pay attention to the spoken language; the learner lacks the environment and time to practice the spoken english language.
Aiming at the problem that the pronunciation errors of students cannot be corrected and fed back in time by the traditional method for evaluating the pronunciation of the spoken English, the method for evaluating the pronunciation of the English is designed.
Disclosure of Invention
The invention provides a method, equipment and a computer program for English pronunciation assessment, which have the advantage of timely correction and solve the problems of the background technology.
The invention provides the following technical scheme: a method of english pronunciation assessment comprising the steps of:
step S1: acquiring English pronunciation audio and pronunciation video to be scored;
step S2: transmitting English pronunciation audio and pronunciation video to the recognition model, and carrying out data preprocessing and feature extraction to obtain English pronunciation audio features and English pronunciation video features;
The data preprocessing comprises preprocessing the collected original data to eliminate interference information and noise, improve the feature extraction precision of voice signals and provide effective data for feature extraction and voice recognition; meanwhile, grouping and marking corresponding frames of video pictures corresponding to the single English vocabulary;
The scoring model is used for realizing accurate recognition and classification of input voices, a clustering cross grouping algorithm is added on the basis of a single HMM recognition system, probability calculation is carried out on each input voice through the clustering cross grouping algorithm, and voice mapping is carried out on HMM parameters in a characteristic clustering group, so that an accurate voice recognition result is finally obtained;
Step S3: inputting the English pronunciation audio characteristics into the trained scoring model, and outputting the scoring result of English pronunciation;
The scoring model comprises a spoken language pronunciation training model, and the spoken language pronunciation training model is trained by an HMM model and a Baum-Welch method;
The process is as follows:
Starting model λ= (a, B, pi); by monitoring training O, a new parameter is derived, i.e., x= (a, B, pi); repeating the previous step, and optimizing the model parameters until P (O|lambda) converges;
Preferably, a represents the transfer function equation for HMM neutralization time independent cases; b represents the monitoring value condition of a given state in the HMM; pi represents the starting state space value in the HMM;
The method comprises the steps that MATLABGUI is adopted for carrying out voice and video acquisition on English words in an HMM model, the sampling frequency is set to be 8kHz, 20 students in English professions in certain universities are selected, 100 words are given to each person for voice output, 2000 acquired data are obtained, and the data set mainly comprises the sampling frequency, the state number and the voice character sequence of voice and video parameters;
step S4: the scoring model includes mispronounce identification and mispronounce correction;
Step S5: extracting mispronunciations in English pronunciation audios through results obtained by the mispronunciations recognition module, classifying English vocabularies in the English pronunciation audios, calculating the ratio of total vocabularies of the mispronunciations in the English pronunciation audios, and obtaining scores of the English pronunciation audios;
Step S6: meanwhile, the error vocabulary result is obtained through the error pronunciation recognition module, the frame video picture corresponding to the error vocabulary is obtained through searching the frame number position of the error vocabulary, and the error reading picture of the error vocabulary and the correct reading picture of the error vocabulary are carried out through a display, so that personnel can realize self-comparison, and self-adjustment learning is realized;
Step S7: and correcting pronunciation of the English pronunciation through error pronunciation correction, and playing the English pronunciation through equipment.
Preferably, step S2 specifically includes performing voice feature extraction by using Mel-frequency cepstrum coefficient (Mel-FrequencyCepstralCoefficients, MFCC) after data preprocessing is completed, where the MFCC feature extraction flow is shown in fig. 2;
the specific flow of MFCC feature extraction can be divided into 3 steps, which are specifically expressed as:
(1) After endpoint detection, the audio signal is converted into a digital signal, for each frame of speech signal x (n)
Performing N-point FFT conversion to obtain spectrum information X (k), wherein the expression is:
(2) A power spectrum |X (k) | 2 is obtained from X (k), a Mel frequency spectrum is obtained by using a Mel frequency filter bank H m( k, a logarithm is obtained from the Mel frequency spectrum, a logarithm energy spectrum signal S (m) is obtained,
Wherein H m( k) denotes a filter bank, M denotes a frequency packet;
If the highest frequency and the lowest frequency of the filter bank H m( k) are F h and F l, respectively, the sampling frequency of the speech signal is denoted as F s, and the Mel center frequency F (m) is expressed as:
Wherein B (f) represents a conversion function of a linear frequency f in hertz, and B -1 (f) represents an inverse function;
(3) Performing cepstrum processing on the energy spectrum signal S (m) by adopting a discrete cosine transform method (DiscreteCosineTrans-form, DCT) and solving an i-order MFCC cepstrum coefficient C (i), wherein the value of i is generally in the range of 12-16, and the expression of C (i) is as follows:
to better express the correlation between the frame signals, a cepstrum difference coefficient is calculated based on the cepstrum coefficient C (i), and the calculation formula is as follows:
And k is a differential parameter, the value is 2, the first-order differential coefficient of the MFCC is obtained by using the formula (1), the first-order differential coefficient of the MFCC can be represented by using delta MFCC, delta MFCC is substituted into the formula (1) to be calculated again, the second-order differential coefficient delta 2 MFCC is obtained, and finally the 24-dimensional MFCC characteristic parameter is obtained, so that the characteristic extraction is completed.
Preferably, step S3 specifically includes that the scoring model includes mispronounce recognition, the mispronounce recognition is performed on the feature extraction pronunciation by using an error-killing mode, and n is set to represent a fluctuation extremum of the feature extraction spoken language pronunciation frequency vibration; p represents the extremum of the trough of the frequency oscillation; d represents the correct period of the frequency audio; n represents the amplitude of the meson transmission frequency; AH represents the spoken utterance standard amplitude; t represents a frequency parameter, and the vibration audio E of the feature extraction spoken utterance is:
The audio is standardized and filled, namely:
Preferably, η E describes discrete values in the audio fill; n' describes the difference of the weight function filling the maximum and minimum; t describes the number of hops between two different audio nodes; d ij describes the closest distance between nodes i and j, and the attribute planning can be performed on the data after filling processing to obtain:
Preferably, U i represents an audio index; i represents the proper audio specific period parameter, the attribute of the parameter is labeled and error-killing identification operation is carried out, and a formula (5) is obtained:
Preferably, a T represents audio jitter, which is a parameter for measuring notes; s -1 represents the value of the audio attribute set, which is a parameter for carrying out error-killing identification on the audio; m represents the corresponding audio matching factor; r represents Gao Chengquan weight contained in the advanced audio; v i represents that the audio frequency is subjected to error elimination identification to obtain a limit value.
Preferably, step S3 specifically includes that the scoring model includes error pronunciation correction, and the result obtained in formula (5) is input into a feedback control device of the system for correction, where the feedback control device needs to calculate a feedback path in advance, that is:
preferably, ω i represents the linker of the feedback audio; omega j represents a scale communication numerical parameter; p denotes the audio category parameter coding,
The feedback path and related parameters are obtained through the formula (6), if M represents the audio failure value of pronunciation, the audio failure state is expressed in a sequence mode, and the method can better compare the audio to obtain the formula (7):
M=(dT-P-1)mod180(7)
Preferably, d represents the weight of the comparison audio path, the comparison results are digitally arranged in a sequencing manner, so that the correction accuracy is effectively improved, and the process comparison formula is as follows:
Preferably, MT represents the best parameters for audio measurement; n represents the accuracy of the audio correction in the specific range, if the pronunciation accuracy is C i, the real pronunciation is C q, and the obtained function relationship is:
preferably, t represents a feedback factor, l represents an extremum frequency, and in order to ensure the accuracy of a correction result in a feedback process, the feedback process is subjected to a robust process to obtain:
Preferably, the maxX4 values respectively represent the robust upper value, lower value, loading and transmission limit values in the feedback system; alpha represents the rhythmic performance of the pronunciation audio; r represents a built-in value of an attribute, the use condition of the system can be directly shown by using a robust algorithm, and the upper and lower orders of the audio are controlled, so that the system can work normally, and the process has a calculation formula as follows:
ωij(k+1)=ωij(k)+η(dj-yj)yj(1-yj)·f(1-f(uij))xij (11)
Preferably, η represents a learning coefficient; k represents the number of iterations; omega ij represents the audio collection weight; d j denotes the inter-audio node distance; y j denotes an audio output value; u ij denotes the audio collection speed; x ij represents the measured audio node.
Preferably, the apparatus comprises:
A memory for storing the processor-executable instructions;
And a processor for executing program instructions stored in the memory to perform the following:
Receiving an audio and video file comprising English voice, video and text transcripts corresponding to the English voice; and inputting the audio and video signals included in the audio and video files into a recognition model to obtain the voice information of each phoneme in each word of the English voice.
Preferably, the computer program is adapted to perform the method of english pronunciation assessment according to any one of the preceding claims 1-4.
The invention has the following beneficial effects:
1. According to the method, the device and the computer program for evaluating the English pronunciation, the characteristics of the spoken English pronunciation are extracted by adopting the Mel frequency cepstrum coefficient, the obtained result is standardized and filled, and error-killing recognition calculation is performed according to attribute planning, so that the number of mistakes in pronunciation are effectively reduced.
2. According to the English pronunciation assessment method, device and computer program, the error vocabulary result is obtained through the error pronunciation identification module, the frame video picture corresponding to the error vocabulary is obtained through searching the frame number position of the error vocabulary, and the error reading picture of the error vocabulary and the correct reading picture of the error vocabulary are carried out through a display, so that personnel can realize self-comparison, and self-adjustment learning is realized;
performing pronunciation correction on the English pronunciation audio through error pronunciation correction, and performing audio playing through equipment;
The method solves the problem that the traditional method for evaluating the pronunciation of the spoken English can not correct and feed back the pronunciation errors of students in time.
3. According to the English pronunciation assessment method, equipment and computer program, the HMM feature clustering grouping algorithm can improve the accuracy of English robot voice recognition, the voice recognition time is reduced, the recognition efficiency is further improved, simulation results show that the average recognition rate of the model is 99.65%, the average recognition time is 1.1037s, the recognition accuracy is better than that of a traditional model, the recognition rate of the model is 97.83% compared with that of an ANN model in deep learning, the recognition rate of the model is 99.65%, the recognition rate is higher than that of the ANN model, and the recognition effect is better.
Drawings
FIG. 1 is a schematic diagram of a feature clustering grouping algorithm of an HMM in the invention;
FIG. 2 is a flow chart of MFCC feature extraction in accordance with the present invention;
FIG. 3 is a graph of speech recognition results for two methods;
Fig. 4 is a graph of the overall recognition effect of the two methods.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-4, the method for evaluating english pronunciation includes the following steps:
step S1: acquiring English pronunciation audio and pronunciation video to be scored;
step S2: transmitting English pronunciation audio and pronunciation video to the recognition model, and carrying out data preprocessing and feature extraction to obtain English pronunciation audio features and English pronunciation video features;
the data preprocessing comprises preprocessing the collected original data to eliminate interference information and noise, improve the feature extraction precision of the voice signal and provide effective data for feature extraction and voice recognition; meanwhile, grouping and marking corresponding frames of video pictures corresponding to the single English vocabulary;
Step S3: inputting the English pronunciation audio characteristics into the trained scoring model, and outputting the scoring result of English pronunciation;
The scoring model comprises a spoken language pronunciation training model, and the spoken language pronunciation training model is trained by an HMM model and a Baum-Welch method;
The process is as follows:
Starting model λ= (a, B, pi); by monitoring training O, a new parameter is derived, i.e., x= (a, B, pi); repeating the previous step, and optimizing the model parameters until P (O|lambda) converges;
Wherein A represents a transfer function equation under the condition that the HMM neutralization time is irrelevant; b represents the monitoring value condition of a given state in the HMM; pi represents the starting state space value in the HMM;
The method comprises the steps that MATLABGUI is adopted for carrying out voice video acquisition on English words in an HMM model, the sampling frequency is set to be 8kHz, 20 students in English professions in certain universities are selected, 100 words are given to each person for voice output, 2000 acquired data are obtained, and the data set mainly comprises the sampling frequency, the state number and the voice character sequence of voice and video parameters;
The scoring model is used for realizing accurate recognition and classification of input voice, a clustering cross grouping algorithm is added on the basis of a single HMM recognition system, probability calculation is carried out on each input voice through the clustering cross grouping algorithm, and voice mapping is carried out on HMM parameters in a characteristic clustering group, so that an accurate voice recognition result is finally obtained.
Step S4: the scoring model comprises mispronounced identification and mispronounced correction;
Step S5: extracting mispronunciations in English pronunciation audios through results obtained by the mispronunciations recognition module, classifying English vocabularies in the English pronunciation audios, calculating the ratio of total vocabularies of the mispronunciations in the English pronunciation audios, and obtaining scores of the English pronunciation audios;
Step S6: meanwhile, the error vocabulary result is obtained through the error pronunciation recognition module, the frame video picture corresponding to the error vocabulary is obtained through searching the frame number position of the error vocabulary, and the error reading picture of the error vocabulary and the correct reading picture of the error vocabulary are carried out through a display, so that personnel can realize self-comparison, and self-adjustment learning is realized;
Step S7: and correcting pronunciation of the English pronunciation through error pronunciation correction, and playing the English pronunciation through equipment.
The step S2 specifically includes that after data preprocessing is completed, a Mel frequency cepstrum coefficient (Mel-FrequencyCepstralCoefficients, MFCC) is adopted to conduct voice feature extraction, and the flow of MFCC feature extraction is shown in FIG. 2;
the specific flow of MFCC feature extraction can be divided into 3 steps, which are specifically expressed as:
(1) After endpoint detection, the audio signal is converted into a digital signal, and N-point FFT conversion is performed on each frame of speech signal X (N), so as to obtain spectrum information X (k), where the expression is:
(2) A power spectrum |X (k) | 2 is obtained from X (k), a Mel frequency spectrum is obtained by using a Mel frequency filter bank H m (k), a logarithm is obtained from the Mel frequency spectrum, a logarithmic energy spectrum signal S (m) is obtained,
Wherein H m (k) denotes a filter bank, M denotes a frequency packet;
If the highest frequency and the lowest frequency of the filter bank H m (k) are F h and F l, respectively, the sampling frequency of the speech signal is denoted as F s, and the expression of the Mel center frequency F (m) is:
Wherein B (f) represents a conversion function of a linear frequency f in hertz, and B -1 (f) represents an inverse function;
(3) Performing cepstrum processing on the energy spectrum signal S (m) by adopting a discrete cosine transform method (DiscreteCosineTrans-form, DCT) and solving an i-order MFCC cepstrum coefficient C (i), wherein the value of i is generally in the range of 12-16, and the expression of C (i) is as follows:
to better express the correlation between the frame signals, a cepstrum difference coefficient is calculated based on the cepstrum coefficient C (i), and the calculation formula is as follows:
And k is a differential parameter, the value is 2, the first-order differential coefficient of the MFCC is obtained by using the formula (1), the first-order differential coefficient of the MFCC can be represented by using delta MFCC, delta MFCC is substituted into the formula (1) to be calculated again, the second-order differential coefficient delta 2 MFCC is obtained, and finally the 24-dimensional MFCC characteristic parameter is obtained, so that the characteristic extraction is completed.
Step S3 specifically comprises that the scoring model comprises false pronunciation identification, false pronunciation identification is carried out on feature extraction pronunciation by using a false-elimination mode, and n represents a fluctuation extremum of the frequency vibration of the feature extraction spoken language pronunciation; p represents the extremum of the trough of the frequency oscillation; d represents the correct period of the frequency audio; n represents the amplitude of the meson transmission frequency; AH represents the spoken utterance standard amplitude; t represents a frequency parameter, and the vibration audio E of the feature extraction spoken utterance is:
The audio is standardized and filled, namely:
Wherein η E describes discrete values in the audio fill; n' describes the difference of the weight function filling the maximum and minimum; t describes the number of hops between two different audio nodes; d ij describes the closest distance between nodes i and j, and the attribute planning can be performed on the data after filling processing to obtain:
Wherein U i represents an audio index; i represents the proper audio specific period parameter, the attribute of the parameter is labeled and error-killing identification operation is carried out, and a formula (5) is obtained:
Wherein A T represents audio jitter, which is a parameter for measuring notes; s -1 represents the value of the audio attribute set, which is a parameter for carrying out error-killing identification on the audio; m represents the corresponding audio matching factor; r represents Gao Chengquan weight contained in the advanced audio; v i represents that the audio frequency is subjected to error elimination identification to obtain a limit value.
The step S3 specifically includes that the scoring model includes error pronunciation correction, the result obtained in the formula (5) is input into a feedback control device of the system for correction, and the feedback control device needs to calculate a feedback path in advance, namely:
wherein ω i represents the linker of the feedback audio; omega j represents a scale communication numerical parameter; p denotes the audio category parameter coding,
The feedback path and related parameters are obtained through the formula (6), if M represents the audio failure value of pronunciation, the audio failure state is expressed in a sequence mode, and the method can better compare the audio to obtain the formula (7):
M=(dT-P-1)mod180(7)
Wherein d represents the weight of the comparison audio path, the comparison results are digitally arranged in a sequencing mode, the correction accuracy is effectively improved, and the process comparison formula is as follows:
Wherein MT represents the best parameters for audio metrics; n represents the accuracy of the audio correction in the specific range, if the pronunciation accuracy is C i, the real pronunciation is C q, and the obtained function relationship is:
wherein t represents a feedback factor, l represents an extremum frequency, and in order to ensure the accuracy of a correction result in a feedback process, the feedback process is subjected to a robust process to obtain:
Wherein, the maxX4 values respectively represent the robust upper value, lower value, loading capacity and transmission limit values in the feedback system; alpha represents the rhythmic performance of the pronunciation audio; r represents a built-in value of an attribute, the use condition of the system can be directly shown by using a robust algorithm, and the upper and lower orders of the audio are controlled, so that the system can work normally, and the process has a calculation formula as follows:
ωij(k+1)=ωij(k)+η(dj-yj)yj(1-yj)·f(1-f(uij))xij (11)
Wherein η represents a learning coefficient; k represents the number of iterations; omega ij represents the audio collection weight; d j denotes the inter-audio node distance; y j denotes an audio output value; u ij denotes the audio collection speed; x ij represents the measured audio node.
Wherein the apparatus comprises:
A memory for storing processor-executable instructions;
And a processor for executing program instructions stored in the memory to perform the following:
Receiving audio and video files comprising English voices, videos and text transcripts corresponding to the English voices; audio and video signals included in the audio and video files are input into the recognition model to obtain voice information of each phoneme in each word of english voice.
Wherein the computer program is for performing the method of english pronunciation assessment according to any of the preceding claims 1-4.
In order to verify whether the proposed English pronunciation assessment method is effective, the experiment is to compare the method with the traditional method, apply the two methods to the English robot recognition system, and respectively compare the recognition rate and the used time of the two models. Input of english voice "DoesRenminRoadturnrightfromthisintersection" in english robot voice recognition system? The comparison results are shown in fig. 2 and 3.
As can be seen from the comparison result of FIG. 2, compared with the conventional English pronunciation assessment method, the matching time of the model for the input English sentence is only 1.344s, the total recognition time is 1.4133s, and the matching time and the total recognition time of the conventional English pronunciation assessment method are 1.7834s and 1.8527s respectively, which are higher than the model.
Therefore, the model has shorter voice recognition time for English sentences, faster recognition speed and higher recognition efficiency.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (6)
1. The English pronunciation assessment method is characterized by comprising the following steps of:
step S1: acquiring English pronunciation audio and pronunciation video to be scored;
step S2: transmitting English pronunciation audio and pronunciation video to the recognition model, and carrying out data preprocessing and feature extraction to obtain English pronunciation audio features and English pronunciation video features;
The data preprocessing comprises preprocessing the collected original data to eliminate interference information and noise, improve the feature extraction precision of voice signals and provide effective data for feature extraction and voice recognition; meanwhile, grouping and marking corresponding frames of video pictures corresponding to the single English vocabulary;
The scoring model is used for realizing accurate recognition and classification of input voices, a clustering cross grouping algorithm is added on the basis of a single HMM recognition system, probability calculation is carried out on each input voice through the clustering cross grouping algorithm, and voice mapping is carried out on HMM parameters in a characteristic clustering group, so that an accurate voice recognition result is finally obtained;
Step S3: inputting the English pronunciation audio characteristics into the trained scoring model, and outputting the scoring result of English pronunciation;
The scoring model comprises a spoken language pronunciation training model, and the spoken language pronunciation training model is trained by an HMM model and a Baum-Welch method;
The process is as follows:
Starting model λ= (a, B, pi); by monitoring training O, a new parameter is derived, i.e., x= (a, B, pi); repeating the previous step, and optimizing the model parameters until P (O|lambda) converges;
Wherein A represents a transfer function equation under the condition that the HMM neutralization time is irrelevant; b represents the monitoring value condition of a given state in the HMM; pi represents the starting state space value in the HMM;
The method comprises the steps that MATLABGUI is adopted for carrying out voice and video acquisition on English words in an HMM model, the sampling frequency is set to be 8kHz, 20 students in English professions in certain universities are selected, 100 words are given to each person for voice output, 2000 acquired data are obtained, and the data set mainly comprises the sampling frequency, the state number and the voice character sequence of voice and video parameters;
step S4: the scoring model includes mispronounce identification and mispronounce correction;
Step S5: extracting mispronunciations in English pronunciation audios through results obtained by the mispronunciations recognition module, classifying English vocabularies in the English pronunciation audios, calculating the ratio of total vocabularies of the mispronunciations in the English pronunciation audios, and obtaining scores of the English pronunciation audios;
Step S6: meanwhile, the error vocabulary result is obtained through the error pronunciation recognition module, the frame video picture corresponding to the error vocabulary is obtained through searching the frame number position of the error vocabulary, and the error reading picture of the error vocabulary and the correct reading picture of the error vocabulary are carried out through a display, so that personnel can realize self-comparison, and self-adjustment learning is realized;
Step S7: and correcting pronunciation of the English pronunciation through error pronunciation correction, and playing the English pronunciation through equipment.
2. The method for english pronunciation assessment according to claim 1, wherein: step S2 specifically comprises the steps of extracting the characteristics, namely, after data preprocessing is completed, extracting the voice characteristics by adopting Mel frequency cepstrum coefficient (Mel-FrequencyCepstralCoefficients, MFCC), wherein the flow of extracting the MFCC characteristics is shown in FIG. 2;
the specific flow of MFCC feature extraction can be divided into 3 steps, which are specifically expressed as:
(1) After endpoint detection, the audio signal is converted into a digital signal, and N-point FFT conversion is performed on each frame of speech signal X (N), so as to obtain spectrum information X (k), where the expression is:
(2) A power spectrum |X (k) | 2 is obtained from X (k), a Mel frequency spectrum is obtained by using a Mel frequency filter bank H m( k, a logarithm is obtained from the Mel frequency spectrum, a logarithm energy spectrum signal S (m) is obtained,
Wherein H m (k) denotes a filter bank, M denotes a frequency packet;
If the highest frequency and the lowest frequency of the filter bank H m (k) are F h and F l, respectively, the sampling frequency of the speech signal is denoted as F s, and the expression of the Mel center frequency F (m) is:
Wherein B (f) represents a conversion function of a linear frequency f in hertz, and B-1 (f) represents an inverse function;
(3) Performing cepstrum processing on the energy spectrum signal S (m) by adopting a discrete cosine transform method (DiscreteCosineTrans-form, DCT) and solving an i-order MFCC cepstrum coefficient C (i), wherein the value of i is generally in the range of 12-16, and the expression of C (i) is as follows:
to better express the correlation between the frame signals, a cepstrum difference coefficient is calculated based on the cepstrum coefficient C (i), and the calculation formula is as follows:
And k is a differential parameter, the value is 2, the first-order differential coefficient of the MFCC is obtained by using the formula (1), the first-order differential coefficient of the MFCC can be represented by using delta MFCC, delta MFCC is substituted into the formula (1) to be calculated again, the second-order differential coefficient delta 2 MFCC is obtained, and finally the 24-dimensional MFCC characteristic parameter is obtained, so that the characteristic extraction is completed.
3. The method for english pronunciation assessment according to claim 1, wherein: step S3, specifically, the scoring model comprises false pronunciation identification, the false pronunciation identification is carried out on the feature extraction pronunciation by using a fault elimination mode, and n is set to represent the fluctuation extremum of the feature extraction spoken language pronunciation frequency vibration; p represents the extremum of the trough of the frequency oscillation; d represents the correct period of the frequency audio; n represents the amplitude of the meson transmission frequency; AH represents the spoken utterance standard amplitude; t represents a frequency parameter, and the vibration audio E of the feature extraction spoken utterance is:
The audio is standardized and filled, namely:
Wherein η E describes discrete values in the audio fill; n' describes the difference of the weight function filling the maximum and minimum; t describes the number of hops between two different audio nodes; d ij describes the closest distance between nodes i and j, and the attribute planning can be performed on the data after filling processing to obtain:
Wherein U i represents an audio index; i represents the proper audio specific period parameter, the attribute of the parameter is labeled and error-killing identification operation is carried out, and a formula (5) is obtained:
n=1,2,…,N-1 (5)
Wherein A T represents audio jitter, which is a parameter for measuring notes; s -1 represents the value of the audio attribute set, which is a parameter for carrying out error-killing identification on the audio; m represents the corresponding audio matching factor; r represents Gao Chengquan weight contained in the advanced audio; v i represents that the audio frequency is subjected to error elimination identification to obtain a limit value.
4. A method of english pronunciation assessment according to claim 3, wherein: the step S3 specifically includes that the scoring model includes error pronunciation correction, the result obtained in the formula (5) is input into a feedback control device of the system for correction, and the feedback control device needs to calculate a feedback path in advance, namely:
wherein ω i represents the linker of the feedback audio; omega j represents a scale communication numerical parameter; p denotes the audio category parameter coding,
The feedback path and related parameters are obtained through the formula (6), if M represents the audio failure value of pronunciation, the audio failure state is expressed in a sequence mode, and the method can better compare the audio to obtain the formula (7):
M=(dT-P-1)mod180(7)
Wherein d represents the weight of the comparison audio path, the comparison results are digitally arranged in a sequencing mode, the correction accuracy is effectively improved, and the process comparison formula is as follows:
Wherein MT represents the best parameters for audio metrics; n represents the accuracy of the audio correction in the specific range, if the pronunciation accuracy is C i, the real pronunciation is C q, and the obtained function relationship is:
wherein t represents a feedback factor, l represents an extremum frequency, and in order to ensure the accuracy of a correction result in a feedback process, the feedback process is subjected to a robust process to obtain:
Wherein, the maxX4 values respectively represent the robust upper value, lower value, loading capacity and transmission limit values in the feedback system; alpha represents the rhythmic performance of the pronunciation audio; r represents a built-in value of an attribute, the use condition of the system can be directly shown by using a robust algorithm, and the upper and lower orders of the audio are controlled, so that the system can work normally, and the process has a calculation formula as follows:
ωij(k+1)=ωij(k)+η(dj-yj)yj(1-yj)
·f(1-f(uij))xij (11)
Wherein η represents a learning coefficient; k represents the number of iterations; omega ij represents the audio collection weight; d j denotes the inter-audio node distance; y j denotes an audio output value; u ij denotes the audio collection speed; x ij represents the measured audio node.
5. The apparatus according to claim 1, wherein: the apparatus comprises:
A memory for storing the processor-executable instructions;
And a processor for executing program instructions stored in the memory to perform the following:
Receiving an audio and video file comprising English voice, video and text transcripts corresponding to the English voice; and inputting the audio and video signals included in the audio and video files into a recognition model to obtain the voice information of each phoneme in each word of the English voice.
6. A computer program as claimed in claim 1, characterized in that: the computer program is for performing the method of english pronunciation assessment according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224412.7A CN118072714A (en) | 2024-02-29 | 2024-02-29 | Method, device and computer program for English pronunciation assessment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410224412.7A CN118072714A (en) | 2024-02-29 | 2024-02-29 | Method, device and computer program for English pronunciation assessment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118072714A true CN118072714A (en) | 2024-05-24 |
Family
ID=91098616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410224412.7A Pending CN118072714A (en) | 2024-02-29 | 2024-02-29 | Method, device and computer program for English pronunciation assessment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118072714A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118629430A (en) * | 2024-08-15 | 2024-09-10 | 北京英格福科贸有限公司 | A trading terminal and system for multi-channel real-time voice quotation |
-
2024
- 2024-02-29 CN CN202410224412.7A patent/CN118072714A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118629430A (en) * | 2024-08-15 | 2024-09-10 | 北京英格福科贸有限公司 | A trading terminal and system for multi-channel real-time voice quotation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8949125B1 (en) | Annotating maps with user-contributed pronunciations | |
CN110457432B (en) | Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium | |
CN112885336B (en) | Training and recognition method and device of voice recognition system and electronic equipment | |
US11282511B2 (en) | System and method for automatic speech analysis | |
WO2022148176A1 (en) | Method, device, and computer program product for english pronunciation assessment | |
US9224383B2 (en) | Unsupervised language model adaptation for automated speech scoring | |
CN103559892A (en) | Method and system for evaluating spoken language | |
CN112466279B (en) | Automatic correction method and device for spoken English pronunciation | |
CN103594087A (en) | Method and system for improving oral evaluation performance | |
CN109658918A (en) | A kind of intelligence Oral English Practice repetition topic methods of marking and system | |
CN111915940A (en) | Method, system, terminal and storage medium for evaluating and teaching spoken language pronunciation | |
CN117292680A (en) | Voice recognition method for power transmission operation detection based on small sample synthesis | |
Liu et al. | AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning | |
CN118173118A (en) | Spoken question-answer scoring method, device, equipment, storage medium and program product | |
CN118072714A (en) | Method, device and computer program for English pronunciation assessment | |
Mary et al. | Searching speech databases: features, techniques and evaluation measures | |
CN110349567B (en) | Speech signal recognition method and device, storage medium and electronic device | |
CN112700795B (en) | Spoken language pronunciation quality evaluation method, device, equipment and storage medium | |
CN113053409B (en) | Audio evaluation method and device | |
CN114220419A (en) | A voice evaluation method, device, medium and equipment | |
CN111798867A (en) | English voice analysis and reinforcement learning system and method | |
CN115376547B (en) | Pronunciation evaluation method, pronunciation evaluation device, computer equipment and storage medium | |
CN113035237B (en) | Voice evaluation method and device and computer equipment | |
CN111199750B (en) | Pronunciation evaluation method and device, electronic equipment and storage medium | |
CN115440193A (en) | Pronunciation evaluation scoring method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |