CN109272992A

CN109272992A - A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model

Info

Publication number: CN109272992A
Application number: CN201811422784.1A
Authority: CN
Inventors: 夏龙; 高强; 吴凡; 陈冬晓; 马楠; 崔成鹏; 郭常圳
Original assignee: Beijing Chalk Future Technology Co Ltd
Current assignee: Beijing Chalk Future Technology Co Ltd
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-01-25
Anticipated expiration: 2038-11-27
Also published as: CN109272992B

Abstract

The application provides a kind of spoken assessment method, device and a kind of device for generating spoken appraisal model, wherein spoken assessment method, the data processing end applied to spoken evaluation system includes: to obtain to be measured to comment voice data；Extract the spoke language attribute characteristic information to be measured for commenting voice data；Marking result is obtained according to the spoke language attribute characteristic information to be measured for commenting voice data based on pre-generated spoken appraisal model.

Description

A kind of spoken language assessment method, device and a kind of device for generating spoken appraisal model

Technical field

This application involves language learning technical field, in particular to a kind of spoken assessment method, device and a kind of generation mouth The device of language appraisal model, a kind of calculating equipment and a kind of computer readable storage medium.

Background technique

With the development of global economic integration, English increasingly shows its important work as international language With.The activity such as commercial exchange, cultural exchanges, transnational tourist is increasingly frequent, and more and more people need to learn a foreign language, in this way The raising of oral communicative competence has become the urgent need of foreign language learning.

Both for adult, voice training data are substantially adult existing overwhelming majority oral evaluation technology.By It is imperfect in the sound channel development of child, and have child's training data of mark seldom, these models are in child's oral evaluation It is difficult the performance having had.Other model needs a large amount of example audio, and the flexibility of model is very low.

Summary of the invention

In view of this, the embodiment of the present application provides a kind of spoken assessment method, device and a kind of spoken mould of testing and assessing of generations The device of type, a kind of calculating equipment and a kind of computer readable storage medium, to solve technological deficiency existing in the prior art.

The embodiment of the present application discloses a kind of spoken assessment method, applied to the data processing end of spoken evaluation system, packet It includes:

It obtains and to be measured comments voice data；

Extract the spoke language attribute characteristic information to be measured for commenting voice data；

It is obtained based on pre-generated spoken appraisal model according to the spoke language attribute characteristic information to be measured for commenting voice data To marking result.

Optionally, it is described obtain it is to be measured comment voice data after, described extract described to be measured comments the spoken of voice data to belong to Property characteristic information before include:

To it is described it is to be measured comment voice data carry out voice terminate to detect；

To voice terminate detection after it is to be measured comment voice data carry out sound channel length normalized；

Voice data progress speech recognition is commented to obtain comment sound number to be measured to by the to be measured of sound channel length normalized According to voice frame sequence and converting text corresponding with voice frame sequence；

The converting text and referenced text are subjected to pressure registration process, and generate pressure alignment information, the pressure Alignment information includes the corresponding relationship of the voice frame sequence of converting text Yu the voice frame sequence of the referenced text.

Optionally, to it is described it is to be measured comment voice data carry out voice terminate detection include:

To it is described it is to be measured comment voice data to identify, and the comment sound data to be measured after identification were converted to comprising the time Information detects speech text to the end of；

It carries out matching acquisition matched text time point with the speech text that detects to the end of according to the referenced text；

It is determined according to preset voice activity detection method and the matched text time point and described to be measured comments voice data End time point.

Optionally, the spoke language attribute characteristic information includes:

Pronunciation attributive character information；

Intonation attributive character information；

And fluency attributive character information.

Optionally, the spoke language attribute characteristic information further include: word marks characteristic information；

After alignment information is forced in the generation further include:

According to alignment information is forced, judge whether the conversion text matches with the referenced text, if it is not, then to reference The word addition word mark characteristic information not matched that in text.

Optionally, it is described obtain it is to be measured comment in voice data it is described it is to be measured comment voice data be the compressed voice of fragment Data；

The acquisition is to be measured to comment VoP to include:

Receive that fragment is compressed to be measured to comment voice data；

It is compressed to fragment it is to be measured comment voice data carry out decompression processing.

Optionally, during the pressure registration process, using the decoding figure for having Kleene closure, for according to ginseng It examines text and comments the voice data repeated in voice data to filter out for be measured.

Optionally, the acquisition is to be measured comments voice data, extracts the spoke language attribute feature letter to be measured for commenting voice data Breath includes:

Obtain initial speech data；

Using the speech separating method clustered based on depth, target speech data is extracted from the initial speech data and is made Voice data is commented to be to be measured；

Extract the spoke language attribute characteristic information to be measured for commenting voice data.

On the other hand, the application also provides a kind of method for generating spoken appraisal model, comprising:

Oral Training sample set is generated according to Oral Training sample group, each Oral Training sample group includes spoken instruction Practice sample, each Oral Training sample includes spoke language attribute characteristic information and the spoke language attribute characteristic information is corresponding beats Divide result；

Spoken appraisal model is trained according to the Oral Training sample set, the spoken language appraisal model makes each The spoke language attribute characteristic information of the Oral Training sample is associated with corresponding marking result.

Optionally, include: before being trained according to the Oral Training sample set to spoken appraisal model

Sound channel length normalized is carried out to the voice data of the Oral Training sample, by all Oral Trainings The sound channel length of Oral Training sample is unitized in sample group.

Optionally, the spoke language attribute characteristic information includes:

Pronunciation attributive character information；

Intonation attributive character information；

And fluency attributive character information.

On the other hand, the application also provides a kind of spoken assessment device, comprising:

Voice data obtains module, is configured as obtaining and to be measured comments voice data；

Attributive character extraction module is configured as extracting the spoke language attribute characteristic information to be measured for commenting voice data；

Marking result-generation module, is configured as based on pre-generated spoken appraisal model according to the comment sound to be measured The spoke language attribute characteristic information of data obtains marking result.

Optionally, described device further include:

Voice terminates detection module, be configured as to it is described it is to be measured comment voice data carry out voice terminate to detect；

Sound channel degree normalized module, to be measured after being configured as terminating voice detection comment voice data carry out sound Road length normalized；

Speech recognition module is configured as commenting voice data to carry out voice to by the to be measured of sound channel length normalized Identification obtains the converting text to be measured for commenting voice data；

Alignment module is forced, is configured as the converting text and referenced text carrying out pressure registration process, and generate Alignment information is forced, the pressure alignment information includes the voice frame sequence of converting text and the speech frame sequence of the referenced text The corresponding relationship of column.

Optionally, the voice terminates detection module and includes:

Recognition unit, be configured as to it is described it is to be measured comment voice data to identify, and by the comment sound to be measured after identification Data are converted to detects speech text comprising temporal information to the end of；

Matching unit is configured as carrying out matching acquisition according to the referenced text and the speech text that detects to the end of Matched text time point；

Determination unit is configured as determining institute according to preset voice activity detection method and the matched text time point State the end time point to be measured for commenting voice data.

Optionally, described device further include:

Mark module is configured as judge whether are the conversion text and the referenced text according to alignment information is forced Matching, if it is not, then to the word addition word mark characteristic information not matched that in referenced text.

Optionally, the voice data acquisition module includes:

First acquisition unit is configured as obtaining the compressed voice data of fragment；

Decompression units, be configured as it is compressed to fragment it is to be measured comment voice data carry out decompression processing.

Optionally, the pressure alignment module, using the decoding figure for having Kleene closure, for will be to according to referenced text The voice data repeated in assessment voice data filters out.

Optionally, the voice data acquisition module includes:

Second acquisition unit is configured as obtaining initial speech data；

Extraction unit is configured as extracting from initial speech data using the speech separating method clustered based on depth Target speech data comments voice data as to be measured.

On the other hand, the application also provides a kind of device for generating spoken appraisal model, comprising:

Training sample generation module is configured as generating Oral Training sample set, Mei Gesuo according to Oral Training sample group Stating Oral Training sample group includes Oral Training sample, and each Oral Training sample includes spoke language attribute characteristic information and institute State the corresponding marking result of spoke language attribute characteristic information；

Model generation module is configured as being trained spoken appraisal model according to the Oral Training sample set, obtain To the spoken appraisal model, the spoken language appraisal model makes the spoke language attribute characteristic information of each Oral Training sample It is associated with corresponding marking result.

Optionally, described device further include: sound channel length normalized module is configured as to the Oral Training sample This voice data carries out sound channel length normalized, by the sound of Oral Training sample in all Oral Training sample groups Road length is unitized.

On the other hand, the application also provides a kind of calculating equipment, including memory, processor and storage are on a memory simultaneously The computer instruction that can be run on a processor, the processor realize the spoken assessment method or institute when executing described instruction The step of stating the method for generating spoken appraisal model.

On the other hand, the application also provides a kind of computer readable storage medium, is stored with computer instruction, the instruction The step of spoken assessment method or the method for generating spoken appraisal model are realized when being executed by processor.

A kind of spoken assessment method, device and a kind of device, Yi Zhongji for generating spoken appraisal model provided by the present application Equipment and a kind of computer readable storage medium are calculated, by carrying out sound channel length normalized and model to Oral Training sample Training enables the spoken language assessment method and device to carry out spoken assessment to different crowd, and obtains accurately spoken assessment As a result, especially solving the problems, such as child's oral evaluation.Meanwhile terminating detection using voice in the spoken language assessment method and improving The fluency of entire spoken assessment, reduces user and uses threshold.In addition to this, it is solved in the method using Kleene closure The problem of repeating with reading；Speech separating method based on depth cluster effectively prevents voice data superposition to be detected or alternately existing The appearance of elephant.

Detailed description of the invention

Fig. 1 is the structural block diagram of the calculating equipment of one embodiment of the application；

Fig. 2 is the schematic flow chart of the spoken assessment method of one embodiment of the application；

Fig. 3 is the schematic flow chart of the spoken assessment method of one embodiment of the application

Fig. 4 is the schematic flow chart of the spoken appraisal model of generation of one embodiment of the application；

Fig. 5 is the TDNN schematic network structure of one embodiment of the application；

Fig. 6 is the partial structural diagram of the TDNN network structure of one embodiment of the application；

Fig. 7 is the structural schematic diagram of the spoken assessment device of one embodiment of the application；

Fig. 8 is the structural schematic diagram of the device of the spoken appraisal model of generation of one embodiment of the application.

Specific embodiment

Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.

In this application, a kind of spoken assessment method, device and a kind of device for generating spoken appraisal model, one are provided Kind calculates equipment and a kind of computer readable storage medium, is described in detail one by one in the following embodiments.

Fig. 1 shows the structural block diagram of the calculating equipment 100 according to one embodiment of this specification.The calculating equipment 100 Component includes but is not limited to memory 110 and processor 120.Processor 120 is connected with memory 110 by bus 130, number According to library 150 for saving data, the data that library 150 stores for receiving data of network 160.

Calculating equipment 100 further includes access device 140, access device 140 enable calculate equipment 100 via one or Multiple networks 160 communicate.The example of these networks includes public switched telephone network (PSTN), local area network (LAN), wide area network (WAN), the combination of the communication network of personal area network (PAN) or such as internet.Access device 140 may include wired or wireless One or more of any kind of network interface (for example, network interface card (NIC)), such as IEEE802.11 wireless local area Net (WLAN) wireless interface, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, universal serial bus (USB) connect Mouth, cellular network interface, blue tooth interface, near-field communication (NFC) interface, etc..

In an embodiment of this specification, unshowned other component can also in above-mentioned and Fig. 1 of calculating equipment 100 To be connected to each other, such as pass through bus.It should be appreciated that calculating device structure block diagram shown in FIG. 1 is merely for the sake of exemplary Purpose, rather than the limitation to this specification range.Those skilled in the art can according to need, and increase or replace other portions Part.

Calculating equipment 100 can be any kind of static or mobile computing device, including mobile computer or mobile meter Calculate equipment (for example, tablet computer, personal digital assistant, laptop computer, notebook computer, net book etc.), movement Phone (for example, smart phone), wearable calculating equipment (for example, smartwatch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment, or the static calculating equipment of such as desktop computer or PC.Calculating equipment 100 can also be mobile or state type Server.

Wherein, processor 120 can execute the step in method shown in Fig. 2.Fig. 2 shows according to one embodiment of the application Spoken assessment method schematic flow chart, the spoken language assessment method, applied to the data processing end of spoken evaluation system, Including step 202 to step 206:

Step 202: acquisition is to be measured to comment voice data.

Step 204: extracting the spoke language attribute characteristic information to be measured for commenting voice data.

Wherein, the spoke language attribute characteristic information includes: pronunciation attributive character information, intonation attributive character information and stream Sharp degree attributive character information；

Step 206: special according to the spoke language attribute to be measured for commenting voice data based on pre-generated spoken appraisal model Reference ceases to obtain marking result.

Obtain it is to be measured comment voice data after, the spoken language appraisal model can be according to the voice data to be measured commented Spoke language attribute characteristic information obtains marking as a result, step is simple, can greatly shorten the time that user waits result.And the mouth Language appraisal model is that precondition is good, without lay in a large amount of example audio can treat assessment voice data give a mark, Expand the use scope of this method.

The content of the spoke language attribute characteristic information reflects the marking direction of the spoken appraisal model, i.e., the described spoken language Pronunciation, intonation and the fluency that appraisal model treats assessment voice data are given a mark, and provide a score, with evaluation to The accuracy of assessment voice data.

Such as the score of the spoken appraisal model provided is obtained according to hundred-mark system, to the spoken appraisal model After inputting one section of voice, the spoken language appraisal model can carry out in terms of the pronunciation of this section of voice, intonation and fluency three Marking, and providing marking result is 98 points.

The schematic flow chart of the spoken assessment method according to one embodiment of the application is shown referring to Fig. 3, Fig. 3；It is described Spoken assessment method step includes step 302 to step 322:

Step 302: obtain by fragment it is compressed it is to be measured comment voice data, it is described that be measured to comment voice data be initial language Sound data.

The data processing end be server end in the case where, if the server end from client directly acquire without The to be measured of overcompression comments voice data, and user needs to expend biggish network flow, can also bring biggish delay, while also can Pressure is brought to server network broadband.To it is described it is to be measured comment voice data to carry out fragment compression processing, and carry out sliced transmission.

The size of the comment sound data fragmentation to be measured affects the performance of entire oral evaluation, and fragment is too short, and transmission is prolonged It is low late, but server end CPU/GPU utilization rate is low.Fragment is too long, although CPU/GPU utilization rate is high, network delay is big.

In actual use, each optimal cutting length to be measured for commenting voice data can be determined through a large number of experiments.

Audio compression algorithm is divided into lossless and damages two major classes, although lossless compression is lost without any information, pressure Shrinkage is too low, and although lossy compression has partial information loss, but compression ratio is higher, to speech recognition result almost without shadow It rings.There are many kinds of mature method, one embodiments of the application to use AAC (Advanced Audio Coding) lattice for compressed encoding Formula compressed encoding.

Step 304: it is compressed to fragment it is to be measured comment voice data carry out decompression processing (inverse transformation of AAC).

It is compressed to fragment it is to be measured comment voice data carry out decompression processing convenient for it is subsequent treat assessment voice data place Reason.

Step 306: using the speech separating method clustered based on depth, extracting target language from the initial speech data Sound data comment voice data as to be measured.

Under the scene that child has parent to accompany one's spouse who is studying abroad, it is possible that the phenomenon that speech superposition or alternating.It is clustered using depth Speech separating method accurately the voice data of child can be extracted from the voice data accompanied one's spouse who is studying abroad with parent, with Improve the precision of spoken assessment.

Step 308: to it is described it is to be measured comment voice data carry out voice terminate to detect.

In this step, specific the following steps are included:

It carries out matching acquisition matched text time point with the speech text that detects to the end of according to the referenced text, Middle matched text time point detects temporal information in speech text for the time point in the temporal information of referenced text and to the end of In time point matching after time point；

Wherein, the referenced text is the text for including user speech content and temporal information, the preset voice Activity detection approach can include but is not limited to VAD (Voice Activity Detection, also known as speech terminals detection, language Sound border detection).Terminate in detection process in voice, it is described wait test and assess by determining VAD and the matched text time point The end time point of voice data can be accurately judged to user whether processing is over reading, thus the no manual intervention the case where Lower realization detects the end time point of user speech automatically, and then improves the fluency of entire spoken assessment, reduces institute That states method uses threshold, so that the field that child independently uses in the case that the method is accompanied especially suitable for no parent Scape.

Traditional VAD technology based on dominant frequency component will appear a large amount of false positive in the very big scene of environmental noise Erroneous judgement.Since the Oral Training sample in one embodiment of the application contains various true noise scenes, and user Text with reading is pre-set, so that the VAD based on speech recognition is more robust, false positive False Rate can be significantly It reduces.

Step 310: to voice terminate detection after it is to be measured comment voice data carry out sound channel length normalized.

In this step, using VTLN (Vocal Tract Length Normalisation, sound channel length normalization) Technology influences to eliminate vocal tract shape bring recognition performance, in the case where guaranteeing to test and assess accuracy, expands this method The scope of application.

Step 312: to by sound channel length normalized it is to be measured comment voice data carry out speech recognition obtain it is to be measured Comment the voice frame sequence and converting text corresponding with voice frame sequence of voice data.

To by sound channel length normalized it is to be measured comment voice data carry out speech recognition can obtain really to Assessment voice data, convenient for improving the accuracy of marking.

Step 314: the converting text and referenced text are subjected to pressure registration process, and generate pressure alignment information, The pressure alignment information includes the corresponding relationship of the voice frame sequence of converting text Yu the voice frame sequence of the referenced text.

In this step, using Viterbi algorithm, which is a kind of dynamic programming algorithm, optimal minor structure such as formula (1):

α (t, j)=b_j(o_t)min_iα (t-1, i) a_ijFormula (1)

Wherein, α (t, j) is indicated until t moment, and hidden state is the probability of the optimal path of j；b_j(o_t) indicate given O is generated under hidden state j_tEmission probability；a_ijIndicate the transition probability from hidden state i to j；o_tIndicate the observation of t moment.

The detailed alignment information that alignment information is forced as user speech and referenced text voice frame sequence is subsequent Speech assessment provides the foundation information.

During the pressure registration process, using the decoding figure for having Kleene closure, for according to referenced text The voice data repeated in voice data is commented to filter out by be measured, wherein Kleene closure is being based on weighted finite state machine It is widely used in (weighted finite-state transducer, WFST) decoded speech recognition, for solving weight Again with reading the problem of.

Step 316: extracting the spoke language attribute characteristic information to be measured for commenting voice data.

Wherein, the spoke language attribute characteristic information includes: pronunciation attributive character information；

Intonation attributive character information；

And fluency attributive character information.

The content of the spoke language attribute characteristic information reflects the marking direction of the spoken appraisal model, that is, treats assessment Pronunciation, intonation and the fluency of voice data are given a mark.

Wherein, the pronunciation attributive character information comments the corresponding language of each word, syllable, phoneme in voice data by be measured Sound identification prediction probability composition；The intonation attributive character information comments each word, syllable, phoneme pair in voice data by be measured The duration and fundamental frequency answered determine；Fluency attributive character information comments each word, syllable, phoneme in voice data by be measured Corresponding duration, fundamental frequency and word speed determine.

Step 318: special according to the spoke language attribute to be measured for commenting voice data based on pre-generated spoken appraisal model Reference ceases to obtain marking result.

It is a series of to treat assessment voice data progress decompression processing, the speech Separation of depth cluster, pressure registration process etc. Pretreatment, can be improved the accuracy of spoken appraisal model marking, and expand the scope of application of this method, make different crowd all It is able to use this method and carries out spoken assessment, and result of accurately being given a mark.

Referring to fig. 4, Fig. 4 shows the schematic flow chart that spoken appraisal model is generated according to one embodiment of the application, this Apply in an embodiment, the spoken language appraisal model is obtained by step 3091 to step 3093:

Step 3092: Oral Training sample set, each Oral Training sample group are generated according to Oral Training sample group Comprising Oral Training sample, each Oral Training sample includes spoke language attribute characteristic information and spoke language attribute feature letter Cease corresponding marking result.

With the use of user and the accumulation of Oral Training sample, one embodiment of the application is continuous using unsupervised method Ground carries out expansion and model iteration to Oral Training sample set.Offer example audio not allowable in this way, can to user speech into Row assessment, so that this method is more general, more robust.

Step 3094: sound channel length normalized being carried out to the voice data of the Oral Training sample, will be owned Oral Training sample group in Oral Training sample sound channel length it is unitized.

In this step, Oral Training sample set difference sound channel shape is eliminated using VTLN (sound channel length normalization) technology Shape bring recognition performance influences, and ensure that the accuracy of user assessment of the spoken appraisal model to different vocal tract shapes.

Step 3096: spoken appraisal model being trained according to the Oral Training sample set, the spoken assessment mould Type makes the spoke language attribute characteristic information of each Oral Training sample associated with corresponding marking result.

The spoken language appraisal model uses the TDNN (time-delay neural network) based on secondary sampling, and the TDNN is a kind of spy Different convolutional network efficiently can rely on sequential relationship modeling to be long, have operand small, and recognition effect is good, is suitble to real-time The advantages that speech recognition scene.Largely the experimental results showed that, it is better than using the TDNN effect of symmetrical context asymmetrical TDNN.Specific TDNN schematic network structure is referring to Fig. 5, wherein 1 to 6 is hidden layer, and 7 be output layer.

The partial structurtes 500 of TDNN network structure in Fig. 5 are shown referring to Fig. 6, Fig. 6, voice is being extracted from input layer Split is carried out by split layer after frame, split goes out multichannel input；The speech frame after split is delivered to full connection by split layer later Layer, full articulamentum are connected the speech frame after split entirely；The speech frame after full connection is delivered to pond layer, Chi Hua later Layer reduces the output size of the speech frame after full connection.

Step 320: the spoke language attribute characteristic information further include: word marks characteristic information；

According to alignment information is forced, judge whether the conversion text matches with the referenced text, if it is not, then to reference The word addition word mark characteristic information not matched that in text.Wherein, according to alignment information is forced, judge the conversion text Whether this matches with the referenced text, if it is not, there are voice keys in voice frame sequence corresponding for the referenced text Voice blank frame sequence alignment in frame sequence voice frame sequence corresponding with conversion text, then determine described section of voice key frame The content of corresponding referenced text is skip part, and determines that the unmatched content of remaining in the referenced text is to mispronounce portion Point.

Characteristic information is marked according to word, can determine user's skip and the word mispronounced, so that the spoken assessment mould Type can according to the pronunciation to be measured for commenting voice data, intonation, fluency into and user's skip and the word mispronounced give a mark, Compared with the spoken appraisal model given a mark according to the pronunciation for commenting voice data, intonation and the fluency to be measured, marking As a result more accurate.

In addition to this, word mark characteristic information can also be mispronounced according to the different error modes of word, skip Different marks is carried out to corresponding word, shows the word mispronounced with skip in different ways in order to subsequent；Wherein The different modes including but not limited to add different font colors, font size to the skip and the word mispronounced respectively Or/and font type, which word as skip distinguished, which is the word mispronounced.

Step 322: characteristic information being marked according to word, shows the word not matched that, and show marking result.

Show the word that does not match that, and show marking as a result, make user be not only able to know final marking as a result, Also the word for being able to know that oneself skip and misreading.It is not only able to carry out mouth to different crowd by sound channel length normalized Language assessment, but also accurately spoken evaluating result can be obtained, especially solve the problems, such as child's oral evaluation.

The spoken language assessment method is made by carrying out sound channel length normalized and model training to Oral Training sample Spoken assessment can be carried out to different crowd by obtaining the spoken language assessment method, and obtain accurately spoken evaluating result, especially be solved The problem of child's oral evaluation.Meanwhile terminating detection using voice in the spoken language assessment method and improving entire spoken assessment Fluency, reduce user using threshold.In addition to this, repetition asking with reading is solved using Kleene closure in the method Topic；Speech separating method based on depth cluster effectively prevents voice data superposition or the appearance of Alternate Phenomenon to be detected.

One embodiment of this specification additionally provides a kind of spoken assessment device, shows referring to Fig. 7, Fig. 7 according to the application The structural schematic diagram of the spoken assessment device 700 of one embodiment, comprising:

Voice data obtains module 702, is configured as obtaining and to be measured comments voice data；

Attributive character extraction module 704 is configured as extracting the spoke language attribute characteristic information to be measured for commenting voice data；

Marking result-generation module 706 is configured as based on pre-generated spoken appraisal model according to described wait test and assess The spoke language attribute characteristic information of voice data obtains marking result；

Obtain it is to be measured comment voice data after, attributive character extraction module can extract the voice data to be measured of commenting Spoke language attribute characteristic information, so that marking result-generation module uses the spoken appraisal model, according to the comment sound to be measured The spoke language attribute characteristic information of data obtains marking as a result, step is simple, can greatly shorten the time that user waits result.And The spoken language appraisal model is that precondition is good, can treat assessment voice data progress without laying in a large amount of example audio Marking, expands the use scope of the device.

One embodiment of this specification additionally provides a kind of spoken assessment device, comprising:

In an alternative embodiment, the spoken assessment device further include:

Speech recognition module is configured as to be measured voice data being commented to carry out to by the described of sound channel length normalized Speech recognition obtains the converting text to be measured for commenting voice data；

The spoken language assessment device is made by carrying out sound channel length normalized and model training to Oral Training sample Spoken assessment can be carried out to different crowd by obtaining the spoken language assessment method, and obtain accurately spoken evaluating result, especially be solved The problem of child's oral evaluation.Meanwhile terminating detection using voice in the spoken language assessment device and improving entire spoken assessment Fluency, reduce user using threshold.In addition to this, repetition asking with reading is solved using Kleene closure in the method Topic；Speech separating method based on depth cluster effectively prevents voice data superposition or the appearance of Alternate Phenomenon to be detected.

In an alternative embodiment, the voice terminates detection module and includes:

Terminate in detection process in voice, by matched text time point determine it is described it is to be measured comment voice data at the end of Between point can be accurately judged to user whether completed with read, thus realize be not necessarily to manual intervention, can detect and set out automatically Sound terminates, and improves the fluency of entire spoken assessment, and reduce described device uses threshold, especially suitable for no parent The scene that child independently uses in the case where company.

In an alternative embodiment, the spoke language attribute characteristic information includes:

Pronunciation attributive character information, intonation attributive character information, fluency attributive character information and word mark feature Information.

In an alternative embodiment, the spoken assessment device further include:

Mark module judges whether the conversion text matches with the referenced text according to alignment information is forced, if It is no, then to the word addition word mark characteristic information not matched that in referenced text.

The mark module marks characteristic information according to word, can determine user's skip and the word mispronounced, so that institute State spoken appraisal model can according to the pronunciation to be measured for commenting voice data, intonation, fluency into and user's skip and mispronounce Word is given a mark, with the spoken assessment given a mark according to the pronunciation for commenting voice data, intonation and the fluency to be measured Model is compared, and marking result is more accurate.

In addition to this, the mark module can also be to the corresponding list of different error modes according to word mark characteristic information Word carries out different marks, in order to which subsequent show in different ways mispronounces word corresponding with skip；Wherein institute It states different modes and including but not limited to adds different font colors, font size to the skip and the word mispronounced respectively Or/and font type.

In an alternative embodiment, the spoken assessment device further include:

Display module is configured as marking characteristic information according to word, shows the word not matched that, and shows marking knot Fruit.

Display module shows the word not matched that, and shows marking as a result, user is made to be not only able to know marking knot Fruit, the word for being also able to know that oneself skip and misreading.By sound channel length normalized be not only able to different crowd into The spoken assessment of row, but also accurately spoken evaluating result can be obtained, especially solve the problems, such as child's oral evaluation.

In an alternative embodiment, the voice data acquisition module includes:

First acquisition unit can obtain the compressed voice data of fragment, and the spoken assessment device is received point The compressed voice data of piece.

Decompression units can it is compressed to fragment it is to be measured comment voice data carry out decompression processing, treat assessment convenient for subsequent The processing of voice data.

In an alternative embodiment, the pressure alignment module is used for root using the decoding figure for having Kleene closure The voice data repeated in voice data is commented to filter out by be measured according to referenced text.

It is described that alignment module is forced to use the decoding figure with Kleene closure that can repeat by be measured comment in voice data Existing voice data filters out, to solve the problems, such as to repeat with reading.

In an alternative embodiment, the voice data acquisition module includes:

Second acquisition unit is configured as obtaining initial speech data；

Second acquisition unit can obtain initial speech data, and extraction unit is extracted from initial speech data Target speech data comments voice data as to be measured, and then can treat assessment voice data and make true and reliable marking.

One embodiment of this specification additionally provides a kind of device for generating spoken appraisal model, shows referring to Fig. 8, Fig. 8 According to the structural schematic diagram of the device of the spoken appraisal model of the generation of one embodiment of the application, comprising:

Training sample generation module 802 is configured as generating Oral Training sample set according to Oral Training sample group, each The Oral Training sample group include Oral Training sample, each Oral Training sample include spoke language attribute characteristic information and The corresponding marking result of the spoke language attribute characteristic information；

Model generation module 804 is configured as being trained spoken appraisal model according to the Oral Training sample set, The spoken appraisal model is obtained, the spoken language appraisal model believes the spoke language attribute feature of each Oral Training sample Breath is associated with corresponding marking result.

In the training stage of spoken appraisal model, by the spoke language attribute characteristic information of each Oral Training sample with it is corresponding Result of giving a mark is associated, and the spoken language appraisal model is enabled accurately according to spoke language attribute characteristic information, to treat assessment voice It scores.Wherein, spoken appraisal model can be used a variety of neural network models and realize, the spoken appraisal model in the present embodiment TDNN network model can be used.

In an alternative embodiment, the device of spoken appraisal model is generated further include: sound channel length normalized Module is configured as carrying out sound channel length normalized to the voice data of the Oral Training sample, by all mouths The sound channel length that Oral Training sample in sample group is practiced in speech therapy is unitized.

Pronunciation attributive character information；

Intonation attributive character information；

And fluency attributive character information.

The content of the spoke language attribute characteristic information is extended, the direction for expanding the assessment of spoken appraisal model is more Sample, so that the marking result of spoken appraisal model is more true and reliable.

One embodiment of this specification additionally provides a kind of calculating equipment, including memory, processor and is stored in memory Computer instruction that is upper and can running on a processor, the processor realize the spoken assessment method when executing described instruction Or the step of method for generating spoken appraisal model.

One embodiment of this specification additionally provides a kind of computer readable storage medium, is stored with computer instruction, should The step of spoken assessment method or method for generating spoken appraisal model, is realized in instruction when being executed by processor.

A kind of exemplary scheme of above-mentioned computer readable storage medium for the present embodiment.It should be noted that this is deposited The technical side of the technical solution of storage media and the above-mentioned spoken assessment method or the method for generating spoken appraisal model Case belongs to same design, and the detail content that the technical solution of storage medium is not described in detail may refer to above-mentioned spoken assessment The description of the technical solution of method.

The computer instruction includes computer program code, the computer program code can for source code form, Object identification code form, executable file or certain intermediate forms etc..The computer-readable medium may include: that can carry institute State any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, the computer storage of computer program code Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Increase and decrease appropriate can be carried out according to the requirement made laws in jurisdiction with patent practice by holding, such as in certain jurisdictions of courts Area does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium.

It should be noted that for the various method embodiments described above, describing for simplicity, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules might not all be this Shen It please be necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help to illustrate the application.There is no detailed for alternative embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to preferably explain the application Principle and practical application, so that skilled artisan be enable to better understand and utilize the application.The application is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of spoken language assessment method, the data processing end applied to spoken evaluation system characterized by comprising

It obtains and to be measured comments voice data；

It is beaten based on pre-generated spoken appraisal model according to the spoke language attribute characteristic information to be measured for commenting voice data Divide result.

2. it is according to claim 1 spoken language assessment method, which is characterized in that it is described obtain it is to be measured comment voice data after, Include: before the extraction spoke language attribute characteristic information to be measured for commenting voice data

It comments voice data to carry out speech recognition to obtain voice data to be measured of commenting to by the to be measured of sound channel length normalized Voice frame sequence and converting text corresponding with voice frame sequence；

3. it is according to claim 2 spoken language assessment method, which is characterized in that it is described it is to be measured comment voice data carry out language Sound terminates detection

To it is described it is to be measured comment voice data to identify, and the comment sound data to be measured after identification are converted to comprising temporal information To the end of detect speech text；

The knot to be measured for commenting voice data is determined according to preset voice activity detection method and the matched text time point Beam time point.

4. spoken language assessment method according to claim 3, which is characterized in that

The spoke language attribute characteristic information includes:

Pronunciation attributive character information；

Intonation attributive character information；

And fluency attributive character information.

5. spoken language assessment method according to claim 4, which is characterized in that the spoke language attribute characteristic information further include: Word marks characteristic information；

After alignment information is forced in the generation, further includes:

According to alignment information is forced, judge whether the conversion text matches with the referenced text, if it is not, then to referenced text In do not match that word addition word mark characteristic information.

6. spoken language assessment method according to claim 1, which is characterized in that the acquisition is to be measured to comment VoP to include:

Obtain the compressed voice data of fragment；

Decompression processing is carried out to the compressed voice data of fragment to obtain to be measured commenting voice data.

7. spoken language assessment method according to claim 2, which is characterized in that during the pressure registration process, The voice data repeated in voice data is commented to be filtered by be measured using the decoding figure with Kleene closure.

8. spoken language assessment method according to claim 1, which is characterized in that the acquisition is to be measured to comment VoP to include:

Obtain initial speech data；

Using the speech separating method clustered based on depth, extracted from the initial speech data target speech data be used as to Assessment voice data.

9. a kind of method for generating spoken appraisal model characterized by comprising

Oral Training sample set is generated according to Oral Training sample group, each Oral Training sample group includes Oral Training sample This, each Oral Training sample includes spoke language attribute characteristic information and the corresponding marking knot of the spoke language attribute characteristic information Fruit；

Spoken appraisal model is trained according to the Oral Training sample set, the spoken language appraisal model makes each described The spoke language attribute characteristic information of Oral Training sample is associated with corresponding marking result.

10. spoken language assessment method according to claim 9, which is characterized in that described according to the Oral Training sample set Include: before being trained to spoken appraisal model

Sound channel length normalized is carried out to the voice data of the Oral Training sample, by all Oral Training samples The sound channel length of Oral Training sample is unitized in group.

11. a kind of spoken language assessment device characterized by comprising

It gives a mark result-generation module, is configured as to be measured commenting voice data according to described based on pre-generated spoken appraisal model Spoke language attribute characteristic information obtain marking result.

12. device according to claim 11, which is characterized in that further include:

Sound channel degree normalized module, it is long that be measured after being configured as terminating voice detection comments voice data to carry out sound channel Spend normalized；

Speech recognition module is configured as commenting voice data to carry out speech recognition to by the to be measured of sound channel length normalized Obtain the converting text to be measured for commenting voice data；

Alignment module is forced, is configured as the converting text and referenced text carrying out pressure registration process, and generate pressure Alignment information, the pressure alignment information includes the voice frame sequence of converting text and the voice frame sequence of the referenced text Corresponding relationship.

13. device according to claim 12, which is characterized in that the voice terminates detection module and includes:

Recognition unit, be configured as to it is described it is to be measured comment voice data to identify, and described after identification to be measured is commented into voice data It is converted to and detects speech text to the end of comprising temporal information；

Matching unit is configured as carrying out matching acquisition matching with the speech text that detects to the end of according to the referenced text Text time point；

Determination unit, be configured as being determined according to preset voice activity detection method and the matched text time point it is described to The end time point of assessment voice data.

14. device according to claim 12, which is characterized in that further include:

Mark module is configured as judging whether the conversion text matches with the referenced text according to alignment information is forced, If it is not, then to the word addition word mark characteristic information not matched that in referenced text.

15. device according to claim 12, which is characterized in that the voice data obtains module and includes:

Decompression units, are configured as carrying out decompression processing to the compressed voice data of fragment obtaining to be measured commenting voice data.

16. device according to claim 12, which is characterized in that the pressure alignment module, using with Kleene closure Decoding figure, for commenting the voice data repeated in voice data to filter out for be measured according to referenced text.

17. device according to claim 11, which is characterized in that the voice data obtains module and includes:

Second acquisition unit is configured as obtaining initial speech data；

Extraction unit is configured as extracting target from initial speech data using the speech separating method clustered based on depth Voice data comments voice data as to be measured.

18. a kind of device for generating spoken appraisal model characterized by comprising

Training sample generation module is configured as generating Oral Training sample set, each mouth according to Oral Training sample group It includes Oral Training sample that sample group is practiced in speech therapy, and each Oral Training sample includes spoke language attribute characteristic information and the mouth The corresponding marking result of language attributive character information；

Model generation module is configured as being trained spoken appraisal model according to the Oral Training sample set, obtains institute State spoken appraisal model, the spoken language appraisal model make the spoke language attribute characteristic information of each Oral Training sample with it is right The marking result answered is associated.

19. device according to claim 18, which is characterized in that further include:

Sound channel length normalized module is configured as the voice data progress sound channel length to the Oral Training sample and returns One change processing, the sound channel length of Oral Training sample in all Oral Training sample groups is unitized.

20. a kind of calculating equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine instruction, which is characterized in that the processor realizes method described in claim 1-10 any one when executing described instruction The step of.

21. a kind of computer readable storage medium, is stored with computer instruction, which is characterized in that the instruction is held by processor The step of claim 1-10 any one the method is realized when row.