CN103928023A

CN103928023A - Voice scoring method and system

Info

Publication number: CN103928023A
Application number: CN201410178813.XA
Authority: CN
Inventors: 李心广; 李苏梅; 何智明; 陈泽群; 李婷婷; 陈广豪; 马晓纯; 王晓杰; 陈嘉华; 徐集优
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2014-04-29
Filing date: 2014-04-29
Publication date: 2014-07-16
Anticipated expiration: 2034-04-29
Also published as: CN103928023B

Abstract

The invention discloses a voice scoring method. The voice scoring method includes the steps that firstly, examination paper voice of an examinee is recorded; secondly, the examination paper voice of the examinee is preprocessed, and an examination paper voice corpus is obtained; thirdly, feature parameters of the examination paper voice corpus are extracted; fourthly, feature matching is performed on the feature parameters of the examination paper voice corpus and a standard voice template with a voice identification method based on an HMM and ANN mixed model, the content of the examination paper voice is identified, and a preliminary score is given; fifthly, if the preliminary score is lower than a threshold value, the preliminary score is a final score, or else, scoring on sub-indexes such as the accuracy, the fluency, the voice speed, the rhythm, the stress and the intonation is performed; sixthly, the final score of the examination paper voice is obtained by synthesizing various score calculations. The invention further discloses a voice scoring system. According to the voice identification method based on the mixed model, identification is more accurate, objective scoring can be performed on the voice examination paper which is stored in a file pattern after examinee recording is performed through evaluation standard classification.

Description

A kind of speech assessment method and system

Technical field

The present invention relates to speech recognition and assessment technique, relate in particular to a kind of speech assessment method and system.

Background technology

Speech recognition technology is conventionally divided into two classes application point: a class is particular person speech recognition, and a class is unspecified person speech recognition.Particular person speech recognition technology is the recognition technology for a specific people, simply says to be exactly the sound of only identifying a people, is not suitable for colony widely; But not specific people discern technology is on the contrary, can meet the speech recognition requirement of different people, be applicable to extensive crowd's application.

At present in the IBM voice study group maintaining the leading position aspect large vocabulary speech recognition.The Bel research institute of AT & T has also started a series of experiments about unspecified person speech recognition, and its achievement has been established the method for the standard form of unspecified person speech recognition of how making.

This period, obtained major progress had:

(1) maturation of hidden markov model (Hidden Markov Models, HMM) technology and the constantly perfect main stream approach that becomes speech recognition;

(2) in carrying out continuous speech recognition, except identification acoustic information, utilize more various linguistries, help further voice are made identification and understood such as the knowledge of word-building, syntax, semanteme, dialogue background aspect etc.; In the Research of Speech Recognition field, also produced the language model based on statistical probability simultaneously;

(3) rise of the applied research of artificial neural network in speech recognition.In these researchs, most of Multilayer Perception network adopting based on back-propagation algorithm (BP algorithm); In addition, also has the feedforward network that network structure is simple, be easy to realize, do not have feedback signal; The stability of system and function of associate memory have the feedback network that has feedback between substantial connection, neuron.Artificial neural network has the ability of distinguishing complicated classification boundaries, and obviously it extremely contributes to mode division.

In addition, be also gradually improved towards the continuous speech dictation machine technology of personal use.This respect, the most representative is the ViaVoice of IBM and the Dragon Dictate system of Dragon company.These systems have speaker adaptation ability, and new user does not need whole vocabulary to train, and just can in use improve constantly discrimination.

The development of the speech recognition technology of China: have scientific research institution and the institution of higher learning such as acoustics institute of the Chinese Academy of Sciences, Institute of Automation, Tsing-Hua University, Northern Transportation University in Beijing.In addition, also have Harbin Institute of Technology, Chinese University of Science and Technology, Sichuan University etc. also to take action one after another.Now, domestic have many speech recognition systems to succeed in developing.The performance of these systems differs from one another: aspect the speech recognition of isolated word large vocabulary, the most representative is department of electronic engineering, tsinghua university and the speech recognition of the successful THED-919 particular person of device company of China Electronics cooperation research and development and understand real-time system; Aspect continuous speech recognition, computer center of Sichuan University has realized the continuous English of particular person of a Topic-constrained on microcomputer---Chinese speech translation demo system; Aspect unspecified person speech recognition, there is the voice control telephone directory system of Computer Science and Technology Department of Tsing-Hua University development and drop into actual use.

In addition, University of Science and Technology news fly the intelligent sound technology provider as Largest In China, have issued global first mobile Internet intelligent sound interaction platform " news rumours sound cloud " in 2010, and the declaration mobile Internet voice dictation epoch arrive.

Fly long-term research accumulation in intelligent sound technical field University of Science and Technology news, and in multinomial technical achievements leading in the world that has such as Chinese speech synthesis, speech recognition, speech evaluatings: phonetic synthesis is to realize man machine language to communicate by letter with speech recognition technology, and setting up one has necessary two gordian techniquies of the voice system of listening and saying ability; Automatic speech recognition technology (Auto Speech Recognize, ASR) problem to be solved is to allow computing machine can " understand " mankind's voice, by the Word message comprising in voice " extraction " out; Speech evaluating technology is a research forward position of intelligent sound process field, claim again computer-assisted language learning (Computer Assisted Language Learning) technology, be a kind of by machine automatically to pronunciation mark, error detection provide the technology of remedial teaching; Sound groove recognition technology in e, claim again speaker Recognition Technology (Speaker Recognition), be one and extract the correlated characteristic (as the spectrum signature of the fundamental frequency feature of reflection glottis folding frequency, reflection oral cavity size shape and sound channel length etc.) that represents speaker ' s identity by voice signal, and then identify the technology of the work aspects such as speaker ' s identity; Natural language is requisite element in people's life for thousands of years, work, study, and computing machine is one of greatest invention of 20th century, the natural language that how to utilize computing machine to grasp the mankind is processed, is even understood, make computing machine possess the mankind's listening, speaking, reading and writing ability, be the research work that domestic and international research institution pays special attention to and actively develops always.

Summary of the invention

Technical matters to be solved by this invention is, a kind of speech assessment method and system is provided, and the scoring of can fast going over examination papers exactly, marks to examinee with objective standards of grading.The advantage of existing voice quality objective evaluation model has been merged in the present invention, has obtained the better speech recognition modeling of performance and voice training model and the spoken marking scheme of voice more accurately; And can realize the voice paper of depositing with document form is carried out to objective scoring by multiple assessment indicator system.That the present invention has advantages of is more stable, efficiency is higher, for the practical of achievement in research lays the foundation, is conducive to realize extensive Oral English Practice and tests the target of automatically going over examination papers.

For solving the problems of the technologies described above, the invention provides a kind of speech assessment method, comprise step:

S1, record examinee's examination paper voice;

S2, described examinee's examination paper voice are carried out to pre-service, obtain examination paper voice language material;

S3, extract the characteristic parameter of described examination paper voice language material;

The characteristic parameter of described examination paper voice language material and received pronunciation template are carried out characteristic matching by S4, the audio recognition method of employing based on HMM and ANN mixture model, identifies the content of described examination paper voice, and give raw score;

If S5 raw score is lower than presetting threshold value, described raw score is the final scoring of these examination paper voice, and these examination paper voice of mark are problem volume; If raw score is higher than presetting threshold value, described examination paper voice are carried out to point index scoring of accuracy, fluency, word speed, rhythm, stress and intonation;

S6, to described point of index, scoring is weighted the final scoring that obtains described examination paper voice.

Further, before described step S1, also comprise step S0, described step S0 specifically comprises step:

S01, record expert's received pronunciation;

S02, described received pronunciation is carried out to pre-service, obtain received pronunciation language material;

S03, extract the characteristic parameter of described received pronunciation language material;

S04, the characteristic parameter of described received pronunciation language material is carried out to model training, obtain described received pronunciation template.

Further, in described step S4, the concrete steps of the audio recognition method based on HMM and ANN mixture model are:

S41, set up the HMM model of the characteristic parameter of described examination paper voice language material, obtain all state cumulative probabilities in HMM model;

S42, described all state cumulative probabilities are processed as the input feature vector of ANN sorter, thus output recognition result;

S43, described recognition result and described received pronunciation template are carried out to characteristic matching, thereby identify the content of described examination paper voice.

Further, pre-service in described step S2 specifically comprises pre-emphasis, point frame, windowing, noise reduction, end-point detection and cuts word, wherein, the concrete steps of described noise reduction are to adopt the blank voice segments of voice, as the base value of noise, subsequent voice is carried out to denoising.

Further, described in, cut word and specifically comprise step:

S21, extract the MFCC parameter of each phoneme in voice, and set up the HMM model of corresponding phoneme;

S22, voice are carried out to rough lumber divide, obtain effective voice segments;

S23, go out the word of described voice segments according to the HMM Model Identification of described phoneme, thereby be set of letters by speech recognition.

Further, the extracting parameter feature in described step S3 is specially extracts MFCC characteristic parameter, and concrete steps are that the language material obtaining after pre-service is carried out to Fast Fourier Transform (FFT), quarter window filtering, asks logarithm, discrete cosine transform to obtain MFCC characteristic parameter.

Further, the scoring of the accuracy in described step S5 concrete steps are:

The method of the employing value of pulling and pushing by regular speech sentences to be marked to the degree close with received pronunciation statement; Adopt short-time energy extract as feature described in the intensity curve of speech sentences to be marked and received pronunciation statement; The fitting degree of the intensity curve by speech sentences relatively to be marked and received pronunciation statement is marked.

Further, the scoring of the fluency in described step S5 concrete steps are:

Two parts before and after voice to be marked are cut into, thereby and word cut in first part and latter part obtain efficient voice section; The length of two-part front and back efficient voice section is made to division operation with the length of voice always to be marked respectively, and by the value obtaining and corresponding threshold, if be all greater than corresponding threshold value, it is fluent to be judged to be; Otherwise it is unfluent to be judged to be.

Word speed scoring concrete steps are: calculate in voice to be marked to pronounce and partly account for the ratio of voice duration whole to be marked, carry out word speed scoring according to described ratio.

Rhythm scoring concrete steps are: adopt improved dPVI parameter calculation formula to calculate the rhythm of voice to be marked.

Stress scoring concrete steps are: on the intensity curve basis after regular, by being set, stress threshold value and non-stress threshold value divide stress unit as double threshold and the stressed vowel duration of feature, and adopt DTW algorithm to carry out pattern match to speech sentences described to be marked and received pronunciation statement, realize commenting of stress.

Intonation scoring concrete steps are: extract the resonance peak of voice to be marked and received pronunciation, and according to the fitting degree of the variation tendency of the variation tendency of speech resonant peak described to be marked and received pronunciation resonance peak, intonation is marked.

The present invention also provides a kind of speech assessment system, comprising:

Voice recording module, for recording examinee's examination paper voice;

Pretreatment module, carries out pre-service for the examination paper voice to described examinee, obtains examination paper voice language material;

Parameter attribute extraction module, for extracting the characteristic parameter of described examination paper voice language material;

Sound identification module, for adopting audio recognition method based on HMM and ANN mixture model characteristic parameter and the received pronunciation template to described examination paper voice language material to carry out characteristic matching, identifies the content of examination paper voice, and gives raw score;

Speech assessment module, for carrying out accuracy scoring, fluency scoring, word speed scoring, rhythm scoring, stress scoring and intonation scoring for raw score higher than the examination paper voice of setting threshold.

Comprehensive grading module, obtains the final scoring of raw score higher than the examination paper voice of setting threshold for the score calculation of overall accuracy, fluency, word speed, rhythm, stress and intonation.

Implement the present invention, there is following beneficial effect:

1, the present invention has added practical noise reduction and has cut word method in pretreatment module, obtains the voice language material of better quality;

2, adopt the audio recognition method based on HMM and ANN mixture model, performance is better, and it is more accurate to identify;

3, by the many index analysis to word speed, rhythm, stress and intonation, than original more diversification of Score index of reading aloud topic, result has more objectivity;

4, by the double analysis to accuracy and fluency, on the original basis that can only realize reading aloud topic scoring, realize non-objective scorings of reading aloud topic such as translation topic, question-and-answer problem and repetition topics, set up a reasonable perfect speech assessment method and system, the scoring of fast going over examination papers exactly, marks to examinee with objective standards of grading;

5, the present invention have advantages of more stable, efficiency is higher, and practical, applied range, can apply to the process of correcting of SET, significantly effectively shortens and corrects the time, improves the high efficiency of system processing, has also improved the objectivity of correcting.

Brief description of the drawings

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the schematic flow sheet of the speech assessment method that provides of the embodiment of the present invention;

Fig. 2 is the schematic flow sheet of the concrete steps of step S0;

Fig. 3 is the schematic flow sheet of pretreated concrete steps in Fig. 1;

Fig. 4 is the schematic flow sheet of cutting the concrete steps of word in Fig. 3;

Fig. 5 is the schematic flow sheet of the concrete steps of MFCC characteristic parameter extraction;

Fig. 6 is the schematic flow sheet of the concrete steps of the audio recognition method based on HMM and ANN mixture model;

Fig. 7 is the structural representation of the speech assessment system that provides of the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

The embodiment of the present invention provides a kind of speech assessment method, as shown in Figure 1, comprises step:

S1, record examinee's examination paper voice;

S4, employing are based on hidden Markov model (Hidden Markov Models, and artificial neural network (Artificial Neural Networks HMM), ANN) characteristic parameter of described examination paper voice language material and received pronunciation template are carried out characteristic matching by the audio recognition method of mixture model, identify the content of described examination paper voice, and give raw score;

S6, the scoring of described point of index is weighted to the final scoring that obtains described examination paper voice.

Further, before described step S1, also comprise step S0, as shown in Figure 2, described step S0 specifically comprises step:

S01, record expert's received pronunciation;

Wherein received pronunciation is all recorded under specific environment by most professional persons, and voice content is corresponding with Oral English Exam content;

Wherein, the model training of received pronunciation refers to according to certain criterion, obtains the model parameter that characterizes this pattern essential characteristic, i.e. received pronunciation template from a large amount of known mode.The process of described model training specifically refers in order to make speech recognition system reach certain optimum condition, by to initial construction data constantly the parameter of iteration adjustment system template (comprise the variance of probability and the gauss hybrid models of state-transition matrix, average, weight etc.), the process that the performance of system is constantly approached to this optimum condition.Because professional person's received pronunciation and examinee's voice have difference to a certain extent, and scoring of the present invention is to liking nature person, so the present invention will make great efforts to expand corpus, expand to ordinary people by specific professional person, specific environment expands to conventional environment, and the speaker's who comprises different sexes, age, accent sound.

Next will be specifically introduced each step.

1, pre-service

As shown in Figure 3, pre-service in described step S2 specifically comprises noise reduction, pre-emphasis, point frame, windowing, end-point detection and cuts word, pretreated object is to eliminate because the impact that people's vocal organs itself and the equipment due to voice signal produce quality of speech signal, for speech feature extraction provides the parameter of high-quality, thereby improve the quality of speech processes.

Wherein, the concrete steps of described noise reduction are to adopt the blank voice segments of voice, as the base value of noise, subsequent voice is carried out to denoising, because find according to research, before examinee is recording recording, conventionally in a bit of time starting, there is no sounding, and this bit of recording is not blank, but there is the recording section of noise.Therefore, by extracting the audio frequency of this recording section as the base value of noise, just can carry out the processing of a place to go noise to recording afterwards, also get rid of the noise of unvoiced segments simultaneously.

Wherein, the described word of cutting refers to being in short cut into word or phrase one by one, so that must calculate, machine can by identifying one by one word or phrase, " understanding " examinee's statement content be carried out analysis and the last automatic scoring of corresponding bonus point or deduction of points factor and is prepared for after-stage computing machine.As shown in Figure 4, described in, cut word and specifically comprise step:

S21, extract Mel frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, the MFCC) parameter of each phoneme in voice, and set up the HMM model of corresponding phoneme;

The object that rough lumber divides has 2 points: the one, reduce operand, and reduce whereby the time of cutting word; The 2nd, increase the accuracy of cutting word.About rough segmentation, utilization be double threshold method, obviously blank place is intercepted, but the threshold value using is lower, object is in order to obtain effective voice segments;

This is cut, and word method has discrimination, accuracy rate is high, the advantage that error is little: 1) number of recognition template is fixed, and for HMM model, accuracy rate is very high; And not needing to go again to arrange the threshold value of output probability, this will improve discrimination to a great extent.2) after cutting word, obtain the pronunciation of word, the coupling of carrying out keyword can be assisted in pronunciation, thereby reduced the error that coupling word brings.

2, extracting parameter feature

Extraction characteristic parameter in described step S3 is specially and extracts MFCC characteristic parameter, and as shown in Figure 5, concrete steps are that the language material obtaining after pre-service is carried out to Fast Fourier Transform (FFT), quarter window filtering, asks logarithm, discrete cosine transform to obtain MFCC characteristic parameter.Wherein, adopting MFCC characteristic parameter is because it has considered the auditory properties of people's ear, and frequency spectrum is converted into the non-linear frequency spectrum based on Mel frequency, is then transformed on cepstrum domain.And without any hypotheses, simulate the auditory properties of people's ear by the method for mathematics, and use a string triangular form wave filter of arranging at low frequency region juxtaposition, catch the spectrum information of voice; In addition, the anti-noise ability of MFCC characteristic parameter and anti-distortion spectrum ability are strong, can better improve the recognition performance of system.

3, voice content identification

In described step S4, adopt the audio recognition method based on HMM and ANN mixture model, wherein HMM method has needs that the priori of voice signal statistical knowledge, categorised decision ability are weak, complex structure, needs a large amount of training samples and need to carry out a large amount of shortcomings of calculating; Although ANN has certain advantage in decision-making capability, it is still unsatisfactory to the descriptive power of dynamic time signal, and has training, the oversize shortcoming of recognition time based on the speech recognition algorithm of neural network.In order to overcome shortcoming separately, the present invention organically combines the HMM with stronger time modeling ability with two kinds of methods of ANN with stronger classification capacity, has further improved robustness and the accuracy rate of speech recognition.This method has not only overcome the overlapped problem between the insoluble pattern class of HMM itself, has improved the recognition capability of commute confusable word, has simultaneously also overcome ANN and only can process the limitation of fixed length input pattern, has saved complicated consolidation computing.Concrete, as shown in Figure 6, the concrete steps of the audio recognition method based on HMM and ANN mixture model in described step S4 are:

S42, described all state cumulative probabilities are processed as the input feature vector of ANN (being specially self organizing neural network) sorter, thus output recognition result;

4, voice evaluation

Due in daily life, there are some examinees not carry out spoken language test at official hour well, will there is a large amount of blank or None-identified in the examination paper voice that obtain, these examination paper recording are labeled as problem volume by we.Problem volume comprises the sound recording of blank recording and various None-identifieds, recording as excessive in the recording of non-English languages, noise etc., and the object of step S4 does not just identify the content that examinee reads, be exactly test problems volume in addition, and provide lower mark according to actual situation, just there is no need that for problems volume voice it is carried out to accuracy, fluency, word speed, rhythm, stress and intonation and mark.Only have and just carry out further voice evaluation when presetting threshold value when initial score.

(1) accuracy in described step S5 scoring concrete steps are: the method that adopts the value of pulling and pushing by regular speech sentences to be marked to the degree close with received pronunciation statement; Adopt short-time energy extract as feature described in the intensity curve of speech sentences to be marked and received pronunciation statement; The fitting degree of the intensity curve by speech sentences relatively to be marked and received pronunciation statement is marked.

The intensity of statement can reflect the variation of voice signal along with the time.The feature that in statement, stressed syllable is loud is the energy intensity being reflected in time domain, and to show as speech energy intensity large for stress syllable.But because different people different time is to, intensity of phonation also difference unequal with pronunciation duration in short, if the intensity curve of speech sentences to be marked and received pronunciation statement is directly carried out to template matches, result will affect the objectivity of evaluating.Therefore the present invention revises out a kind of intensity curve extracting method based on received pronunciation statement on the basis of original technology: when speech sentences time length ratio standard to be marked is short by speech sentences, adopt interpolation method to carry out supplementing of duration to it; In the time that speech sentences time length ratio received pronunciation statement to be marked is long, adopt the value of taking out method to carry out the adjustment of duration to it; Finally, utilize the point of maximum intensity of the intensity curve of received pronunciation statement, it is regular that the intensity curve for the treatment of scoring speech sentences carries out intensity.

(2) fluency scoring concrete steps are: voice to be marked are cut into front and back two parts, thereby and word cut in first part and latter part obtain efficient voice section; The length of two-part front and back efficient voice section is made to division operation with the length of voice always to be marked respectively, and by the value obtaining and corresponding threshold, if be all greater than corresponding threshold value, it is fluent to be judged to be; Otherwise it is unfluent to be judged to be;

For the fluency of Sentence-level, be intended to the clear and coherent degree by calculating sentence expression, and utilize received pronunciation to calculate the rhythm score of pronunciation, both fusions obtain the fluency diagnostic model of sentence.This sentence fluency methods of marking also can be applied to the scoring of chapter fluency.The method is considered the smoothness of enunciator in statement statement process, has the higher degree of correlation than classic method.Therefore can be applied in speech assessment system.

(3) word speed scoring concrete steps are: calculate in voice to be marked to pronounce and partly account for the ratio of voice duration whole to be marked, according to described ratio, word speed is marked.

(4) rhythm scoring concrete steps are: adopt the paired index of variability of improved otherness (the Distinct Pairwise Variability Index, dPVI) parameter calculation formula to calculate the rhythm of voice to be marked.DPVI, according to the feature of voice unit duration otherness, carries out respectively comparing calculation by received pronunciation statement and the syllable unit fragment duration of band scoring speech sentences, and the parameter of changing out is instructed to foundation for objective evaluation and feedback.

dPVI = 100 \times (Σ_{k = 1}^{m - 1} | {d 1}_{k} - {d 2}_{k} | + | {d 1}_{t} - {d 2}_{t} |) / {Len}_{Std}

Wherein d is the voice unit fragment duration (as: d that statement is divided _kbe k voice unit fragment duration), when m=min (received pronunciation statement unit number, speech sentences unit number to be marked), Len _stdfor received pronunciation statement duration.Suitable with received pronunciation statement duration owing to carrying out regularly speech sentences duration be marked having arrived before PVI computing, when calculating, can only use Len _stdas computing unit.

(5) stress scoring concrete steps are: on the intensity curve basis after regular, by being set, stress threshold value and non-stress threshold value divide stress unit as double threshold and the stressed vowel duration of feature, and adopt dynamic time warping (Dynamic Time Warping, DTW) algorithm carries out pattern match to speech sentences described to be marked and received pronunciation statement, realizes the scoring of stress.

Stress refers to sound stressed in word, phrase, sentence.The ultimate principle of DTW algorithm is dynamic time warping, and original unmatched time span between test template and reference template is mated.Calculate its similarity by traditional Euclidean distance, establishing reference template and test template is R and T, and the less similarity of distance B [T, R] is higher.The shortcoming of tradition DTW algorithm is in the time carrying out template matches, and the weight of all frames is consistent, must all templates of coupling, and calculated amount is larger, and particularly, when template number increases when very fast, operand growth is fast especially.So the present invention adopts the DTW algorithm speech sentences to be marked improved and the pattern match of received pronunciation statement, the perfect shortcoming of traditional DTW algorithm, the weight of each frame is given priority to, and greatly reduces calculated amount, make result more accurate.

(6) intonation scoring concrete steps are: extract the resonance peak of voice to be marked and received pronunciation, and according to the fitting degree of the variation tendency of the variation tendency of speech resonant peak described to be marked and received pronunciation resonance peak, intonation is marked.

Intonation is an important sign of representation language ability to express in people's English communication, is the reflection that speech people language uses state holophrase gesture, is the order of importance and emergency of voice and the intonation of modulation in tone in sense of hearing.

In the research of voice digital signal processing, the resonance peak of voice signal is a very important performance parameter.Here the resonance peak of mentioning refers to some regions that in the frequency spectrum of sound energy is concentrated relatively, the not determinative of tonequality still of resonance peak, and reflected the physical features of sound channel (resonant cavity).Sound is when through resonant cavity, be subject to the filter action of cavity, the energy of different frequency in frequency domain is redistributed, a part is because the resonant interaction of resonant cavity is strengthened, another part is decayed, and those frequencies that strengthened show as dense blackstreak on the sonagram of time frequency analysis.Because energy distribution is inhomogeneous, strong part is just as mountain peak, so be referred to as resonance peak.Resonance peak is the key character of reflection vocal tract resonance characteristic, and it has represented the direct sources of pronunciation information, and people utilized resonance peak information in speech perception, thus resonance peak is voice signal process in very important characteristic parameter.Resonance peak is the one group of resonant frequency producing when quasi-periodicity, pulse excitation entered sound channel.Formant parameter comprises formant frequency and frequency span, and it is the important parameter of the different simple or compound vowel of a Chinese syllable of difference.And resonance peak information is included among frequency envelope, the key that therefore formant parameter extracts is to estimate natural-sounding spectrum envelope, it is generally acknowledged that the maximal value in spectrum envelope is exactly resonance peak.

The present invention also provides a kind of speech assessment system, as shown in Figure 7, comprising:

Voice recording module 101, for recording examinee's examination paper voice;

Pretreatment module 102, carries out pre-service for the examination paper voice to described examinee, obtains examination paper voice language material;

Parameter attribute extraction module 103, for extracting the characteristic parameter of described examination paper voice language material;

Sound identification module 104, for adopting audio recognition method based on HMM and ANN mixture model characteristic parameter and the received pronunciation template to described examination paper voice language material to carry out characteristic matching, identifies the content of examination paper voice, and gives raw score;

Speech assessment module 105, for carrying out accuracy scoring, fluency scoring, word speed scoring, rhythm scoring, stress scoring and intonation scoring for raw score higher than the examination paper voice of setting threshold.

Comprehensive grading module 106, obtains the final scoring of raw score higher than the examination paper voice of setting threshold for the score calculation of overall accuracy, fluency, word speed, rhythm, stress and intonation.

Wherein, described speech assessment system and speech assessment method are mutually corresponding, and the step that therefore the concrete treatment step of each module can reference voice methods of marking, is not repeating again.

Implement the present invention, there is following beneficial effect:

(1) the present invention has added practical noise reduction and has cut word method in pretreatment module, obtains the voice language material of better quality;

(2) adopt the audio recognition method based on HMM and ANN mixture model, performance is better, and it is more accurate to identify;

(3) by the many index analysis to word speed, rhythm, stress and intonation, than original more diversification of Score index of reading aloud topic, result has more objectivity;

(4) by the double analysis to accuracy and fluency, on the original basis that can only realize reading aloud topic scoring, realize non-objective scorings of reading aloud topic such as translation topic, question-and-answer problem and repetition topics, set up a reasonable perfect speech assessment method and system, the scoring of fast going over examination papers exactly, marks to examinee with objective standards of grading;

(5) the present invention have advantages of more stable, efficiency is higher, and practical, applied range, can apply to the process of correcting of SET, significantly effectively shorten and correct the time, improve the high efficiency of system processing, also improved the objectivity of correcting.

Above disclosed is only a kind of preferred embodiment of the present invention, certainly can not limit with this interest field of the present invention, and the equivalent variations of therefore doing according to the claims in the present invention, still belongs to the scope that the present invention is contained.

Claims

1. a speech assessment method, is characterized in that, comprises step:

S1, record examinee's examination paper voice;

2. speech assessment method as claimed in claim 1, is characterized in that, before described step S1, also comprises step S0, and described step S0 specifically comprises step:

S01, record expert's received pronunciation;

3. speech assessment method as claimed in claim 1, is characterized in that, the concrete steps of the audio recognition method based on HMM and ANN mixture model in described step S4 are:

4. speech assessment method as claimed in claim 1, it is characterized in that, pre-service in described step S2 specifically comprises noise reduction, pre-emphasis, point frame, windowing, end-point detection and cuts word, wherein, the concrete steps of described noise reduction are to adopt the blank voice segments of voice, as the base value of noise, subsequent voice is carried out to denoising.

5. speech assessment method as claimed in claim 4, is characterized in that, described in cut word and specifically comprise step:

6. speech assessment method as claimed in claim 1, it is characterized in that, extracting parameter feature in described step S3 is specially extracts MFCC characteristic parameter, and concrete steps are that the language material obtaining after pre-service is carried out to Fast Fourier Transform (FFT), quarter window filtering, asks logarithm, discrete cosine transform to obtain MFCC characteristic parameter.

7. speech assessment method as claimed in claim 1, is characterized in that, the accuracy scoring concrete steps in described step S5 are:

8. speech assessment method as claimed in claim 1, is characterized in that, the fluency scoring concrete steps in described step S5 are:

Two parts before and after voice to be marked are cut into, thereby and word cut in first part and latter part obtain efficient voice section; The length of two-part front and back efficient voice section is made to division operation with the length of voice always to be marked respectively, and by the value obtaining and corresponding threshold, if be greater than corresponding threshold value, it is fluent to be judged to be; Otherwise it is unfluent to be judged to be.

9. speech assessment method as claimed in claim 1, is characterized in that, in described step S5

Word speed scoring concrete steps are: calculate in voice to be marked to pronounce and partly account for the ratio of voice duration whole to be marked, according to described ratio, word speed is marked;

Rhythm scoring concrete steps are: adopt improved dPVI parameter calculation formula to calculate the rhythm of voice to be marked;

Stress scoring concrete steps are: on the intensity curve basis after regular, by being set, stress threshold value and non-stress threshold value divide stress unit as double threshold and the stressed vowel duration of feature, and adopt DTW algorithm to carry out pattern match to speech sentences described to be marked and received pronunciation statement, realize the scoring of stress;

10. a speech assessment system, is characterized in that, comprising:

Voice recording module, for recording examinee's examination paper voice;

Characteristic parameter extraction module, for extracting the characteristic parameter of described examination paper voice language material;

Sound identification module, carry out characteristic matching for adopting audio recognition method based on HMM and ANN mixture model characteristic parameter and the received pronunciation template to described examination paper voice language material, identify the content of examination paper voice, and give raw score and mark whether as problem volume;

Speech assessment module, for carrying out accuracy scoring, fluency scoring, word speed scoring, rhythm scoring, stress scoring and intonation scoring for raw score higher than the non-problem examination paper voice that preset threshold value.