[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

GB2399932A - Speech recognition - Google Patents

Speech recognition Download PDF

Info

Publication number
GB2399932A
GB2399932A GB0406932A GB0406932A GB2399932A GB 2399932 A GB2399932 A GB 2399932A GB 0406932 A GB0406932 A GB 0406932A GB 0406932 A GB0406932 A GB 0406932A GB 2399932 A GB2399932 A GB 2399932A
Authority
GB
United Kingdom
Prior art keywords
speech
utterance
speaker
word
measure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0406932A
Other versions
GB0406932D0 (en
Inventor
Pamela Mary Enderby
Philip Duncan Green
Mark S Hawley
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Sheffield
Barnsley District General Hospital NHS Trust
Original Assignee
University of Sheffield
Barnsley District General Hospital NHS Trust
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Sheffield, Barnsley District General Hospital NHS Trust filed Critical University of Sheffield
Publication of GB0406932D0 publication Critical patent/GB0406932D0/en
Publication of GB2399932A publication Critical patent/GB2399932A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Assistive technology supports the control of an environment by a dysarthric speaker. The accuracy of the control exerted by the dysarthric speaker is improved by the use of consistency and confusability measures within the speech-processing engine. These measure increase the accuracy of the recognition of utterances of the speaker and the accuracy of articulation of utterances of the dysarthric speaker.

Description

DATA PROCESSING SYSTEM AISD METIIOD
Field ofthe Invention
The present invention relates to assistive technology ancl, more particularly, to technology to assist dysarthric speakers with communication and to assist m the control of their environment and to a speech processor.
I3ackround to the Invention Dysarthria is a neurogenic motor speech disorder that mpaus motor function and interferes with the process of speech production. This results in, at best, imprecise articulation of words or parts of words and, at worst, speech that Is unintelligible to all but the most skilled 1 0 listeners.
In severe cases, dysarthric speakers might he highly dependent upon the presence and skill of a care-worker to act, effectively, as a translator in a communication with a third party.
I;urthermore, the dysarthric individual might also rely upon the careworker to perform basic tasks such as, for example, switching the television or lights on and off on their behalf: Speech produced by dysarthrie speakers can be very difficult for listeners unfamiliar with the speaker to understand. Since motor-neurone disease or trauma often affects the cognitive and physical processes responsible for speech production, dysarthric symptoms often accompany neurological conditions such as cerebral palsy, head injury and multiple sclerosis Many people with dysarthria are often physically incapacitated to the extent that spoken commands become an attractive alternative to normal controls for equipment. However, it Is acknowledged that achieving robust automate speech recognition of the speech oi'dysarthrc speakers is variable for mild to moderate dysarthra and extremely diffier.lt t'or severely dysarthric speech For severely dysarthric speech, recognisers trained on a normal speech corpus cannot be expected to work well. Conventional automatic speech recognition systems are insufficient to deal with the abnormalities and word-level variances of severely dysarthric speech, since the vocal articulations or voealsations to be reeognsed are greatly variable, that is, less consistent, as compared to non-dysarthrie speech.
The inability of commercially available automatic speech recognition systems to deal with severely dysarthric speech often results in frustration of the dysarthric speaker since, in the event of the system fang to recogmse an utterance, the clysarthric speaker may be invited to repeat the utterance. Repeated invitations to articulate a particular word may result in the dysarthric speaker becoming both fatigued and frustrated. Conventionally, automatic speech recognition systems improve their accuracy of rccognlion as the underlying model is refiecd However, this refinement may require a relatively large body of training material and signil'icant time and effort on the part of the person whose speech Is to be recognised. It will be apprecatcd that the need to articulate an utterance too many times sucil as, for example, or more times, might lead to a dysarthric speaker becoming, again, both fatigued and frustrated It is an object of embodiments of the present invention at Icast: to mtigatc some of the
problems of the prior art.
Summary of Invention
Accordingly, a first aspect of embodiments of the present invention provides an assistvc technology system comprising a speech processor operable to process an input utterance to identify that utterance, means to output a control signal corresponding to the identified utterance for influencing the operation of respective equipment; the system being characterized by the speech processor comprising means to calculate a confusahlty measure that reflects a degree of correlation between the input utterance and at least a further ut-l:erance; each of the inp.t utterance and the further utterance corresponding to respective words ova vocabulary of words, and means to replace at least one of the respective words with at least a further word having a corresponding utterance having a dit't'erent degree of correlation with at least one of the Input utterance and the further utterance and means to associate the l'urthcr word with a respective control signal Advantageously, dysarthric speakers can communication more el't'ectively and control their environment more effectively than previously.
Embodiments provide an assstive technology system in which the means to calculate the confisability measure comprises means to subject the input utterance, having a corresponding speech model, to a speech model t'or the further utterance to determine the response of the speech model for the further utterance and means to provide the confusability measure according to that response Preferred embodunent:s provide an assstive technology system in which the means to calculate the confusability measure between the words, W' and We, of the input utterance and the further utterance respectively comprises means to calculate Ci, = (I'I;k) /nJ' k where Lo is a per-frame likelihood of each speech model generating each example of each word OTT a Viterbi patio, j and k represent the kth repetition of the jth word in a training set comprising N words We to WN, and no is the number of exanTples of We ITI the training set A dysarthrie speaker may often attempt to improve their speech by practice. Suitably, enTbodiments provide an assistive technology system fur(ller comprising means to calculate a consistency measure for the input utterance and rmearTs to output a visual indication of the consistcTIcy measure.
Preferred ernbodhTTents provide an assistive technology system hT which lhC means to calculate the consst:erTcy nTeasure comprises means to calculate 5, (I L,,k) / n,, where L,,k is a per-frame likelihood of each speech model generating each example of each word On a Viterbi path, i and k represent the k(h repetition of the pith word no a training set comprising N words Wit to WN, and n, TS the nuTnber of examples of W. h1 the training set.
Embodiments provide an assistive technology system in which the means (o calculate the consistency measure compT ises means to calculate A = (d,) /N Embodiments can be realised in which the means to output a visual indication of (he eonsisterTey measure comprises means to present a bar chart comparison of the consistency measure with an average eonsisteTTey measure for that word for a given speaker.
Preferred embodiments provide a system in which the assislive technology system Is a dysarthrie speech assistive technology system.
A second aspect of embodiments of the present invention provides a method ot trahling or treating a speech Impaired speaker eonprising the steps of processing an input utterance of the speech impaired speaker using a corresponding speech model; providing a visual uTdicatioT1 of the degree of correlation between the input utterance and a predetermined utterance of the speech impaired speaker for the corresponding speech model.
Preferred embodiments provide a method of training or treating a speech impaired speaker in which the degree of correlation between the input utterance and the predetcrmned utterance of the speech hnpaned speaker for the corresponding speech model comprises the step of calculating 7, = (I I,,,k) / n,, k S where L,,k is a per-tramc likelihood of each speech model generating each example of each word on a Vterbi path, i and k represent the kth rcpetiton of the ith word he a training set comprising N words W' to WN, and n, is tile number of examples of W. in the trahling set Embodhnents provide a method of training or treating a speech impaired speaker further comprishg the step of cstablishhg the predetermined utterance of the speech impair speaker 1 () for the speech model Embodiments provide a method of traimng or trcathg a speech impaired speaker further comprising the step of processing a plurality of utterances corresponding to the same word and calculating a measure of the average of the plurality of utterances.
Preferred embodmcuts provide a method of training or treating a speech impaired speaker further comprising the steps of establishhig a plurality of utterances corrcspondmg to respective words of a plurality of words; calculating a measure of contusablty between utterances correspondhg to at least a scicctcd pair of words, and selecting an alternative word to replace one of the sciccted pair of words in the vocabulary; the alternative word having an improved measure of confusabilty between a respective utterance for the alternative word and the utterance corresponding to remaining word of the selected pair of A method of framing or treating a speech impaired speaker in which the speechimpaired speaker is a dysarthrc speaker A third aspect of embedments of the present invention provides a speech processor operable to process an hiput utterance to identity that utterance, the speech processor comprising means to calculate a confusability measure that reflects a degree of correlation between the input utterance and at least a further utterance; each of the input utterance and the further utterance corresponding to respective words of a vocabulary of words Embodhnents provide a speech processor further comprshig means to replace at least one of the respective words with at least a t;rther word havhg a corresponding utterance leaving a different degree of correlation with the at least one of the input utterance and the further utterance.
Preferably, embodiments provide a speech processor further comprising unmans to associate the further word with a respective control signal Frnbodiments provide a speech processor in which the means to calculate the eonf'usability measure comprises means to subject the utterance, having a corresponding speech model, to a speech model for the further utterance to determine the response of the speech model for the further utterance and means to provide the confusability measure according to that response.
Preferred embodhnents provide a speech processor ha which the means to calculate the conf;sabilty measure between the words, W. and We, of the input utterance and the further utterance respectively comprises means to calculate C,, = (Z I,qk) / J. k where 1,,k is a per-frame likelihood of each speech model generating each example of each word on a Viterbi path, j and k represent the kth repetition of the jth word in a training set comprising N no is the number of examples of W' in the trahhg set.
Preferably, embodiments of'the speech processor further comprise means to calcr.la.te a consistency measure for the input utterance and means to output a visual hdication of the consistency measure Embodiments preferably provide a speech processor he which the means to calculate the consistency measure comprises means to calculate c7, = (zL,,k)IIl, k where L k is a per-frame likelihood of each speech model generating each example of each and k represent the kth repetition of the pith word in a training set comprising N n is the number of examples of'W in the training set.
lmhodiments provide a speech processor in whirls the means to calculate the consistency measure comprises means to calculate = (d,)IN.
Embodiments provide a speech processor in whicl1 the means to output a visual indication of' 1 () the consistency measure comprises means to present a bar chart conparison of the consistency measure with an average consistency measure for that word for a given speaker Preferred embodiments provide a speech processor whicl1 is an mipaired-speech speech processor and is more preferably a dysarthric speech processor
Brief Descriptions of the Draw ngs
Embodiments of the present invention will now be described by way of example only with reference to the accompanying drawings he which figure I illustrates an assistive technology system according to an embodiment; figure 2 illustrates a conf'usability matrix according to au embodiment; figure 3 shows a flowchart of processing performed by the first embodhnent; and figure 4 shows a l'urtiler flowchart of further processing performed by the embodhnent
Detailed Description of the Preferred Embodiments
Referring to Figure 1 there is shown an assistive technology system 100 for assisting a dysarthric speaker (not shown). 'I'he system 100 comprises a computer system 102 lnavhg a speech-processhg engine or recogniser 1()4 that uses a number of speech models 106 to recognise speech detected by a microphone 10X It will be appreciated that the speecil processing engine represents an embodiment of at least part of a speech processor. 'l'he computer system 102 is also provided with an input device 110 that is adapted to the needs of the dysarthrc speaker The input device may be, for example, a relatively easy to activate switcl. The computer system 102 also comprises device control software 112 which, In response to outputs of at least One of the speech processing engine 104 and the switch 110, produces eont.rol signals that are used to control respective items of equipment 114 to 118 Such as, for example, a television, a radio or satellite receiver. Although tire embodiment illustrated is shown as having a hardware interface 120 which may be any type of hardware interlace, preferred embodTmeTTts are realised In WhICIT tire computer system 102 co,nunicates will, the equipTneTt 114 to 1] 8 wirelessly, US;TTg, lor example, i'Tf,-ared communication, 131uetooth, II.EE 802.1 lb or the like according to the capabilities of the equipment and the interface 120.
fee computer system 102 is provided with access to non-volatile storage 122 m the form of, for example, an IIDD 'I'he non-volatle storage 122 is used to store speech models 124 for respective words that form a vocabulary that the speech processing engine 104 is expected to recognise. It can be seen that a number of individual speech models 126 and 128 are illustrated. Also illustrated are the trahiTTg sets or training corpuses 124' for each of the speech models. Again, it can be appreciated that two training corpuses 126' arid 12X' are illustrated that correspond to respective speech models 126 and 128.
In general terms, the computer system 102 provides a voice interface via Which the dysa.rt.hrtc speaker can control the various items of equTpnent 114 to 118. 'I''he dysarthrc speaker, using either the mierop}rorre alone or the microphone] 08 in Con junction with a switch 1 1 (), utters a word such as, for example, "TV" to control the operation of the TV This aspect of enbodTTneTlts of the present invention will be described hi greater detaTI with reference to figure 4.
Tire speech models may he constructed Using tlTC weil-known HTK toolkit, available from Cambridge University Engineering DepartTncTTt, under licence from Microsoft Corporation, that produces Continuous Density Hidden Market Models 'I'he models have the tililowhTg characteristics: they ale whole-word based rather than phone-level based, they typically have 11 IIMM states, with a mixture of 3 (Jaussian distributions per state, they are "straight 3() tlrougl models that allow only scif:traTstioTTs and trastiors to tlTe TTCXt state, the acoustic vectors compTisc Mel l requency Ccpstral Co- eft'icTents, typically with dif'l'erences but without overall energy (dysaTtlTric speakers often have difficulty mantahhTg a steady volume), training is data labciled at the word level using "silence | word | silence", and a sampling rate for audio data of 16 KHz, with a I Oms frame rate.
Preferably, the speech processing engine or recognser 1()4 Is configured to be able to modify a dysarthric speaker's vocabulary. It is usual for a dysarthric speaker to produce some words more consistently than others. for example, "'I'V" might be an easier proposition that "television". While clinical assessment might help in dentifyhg such words that may be articulated more consistently, the speech processing engine is arranged to provide a quantffative measure of'word-level consistency Furthermore, preferred embodment:s provide a measure of the overall consistency of the speech ha any given trahhg corpus across all or selected words of that corpus. Such a measure of overall consistency might be used to assess the severity of the dysarthria and to record a dysarthric speaker's progress as any therapy proceeds. Still further embodunents can be realised that track utterance-level consistency to provide an indication of'the core relation between the probability scores retuned by a dysarthric speaker's Individual I pronunciations of a given word and their norm for that word. Utterance-level consistency might be used by a clinician to Identify outlier utterances and, if warranted, to remove such utterances hom the training corpus 'I'he utterance-level consistency might also be used to Identify cases where a dysarthric speaker shows two production or articulation styles of the same word, in which case two different speech models t'or that word might be provided.
Having some means of predicting confusion errors might allow more robust recognition to be 2() realiscd. 'I'hercforc, the speech processing engine 104, he prel'errcd cmbodhnents, provides a measure of conf;sabilty. The conf'r. sabi I ity measure is arranged to allow words or groups of l words that might be confused with one another to be modified or removed from the speech models 124. Preferred embodhnents approach or provide a measure of conl'usability usher forced alignment based upon automatic speech recognition derived probability scores rather than the more conventional phonelically-based approach of the prior art, that is, embodhnents of the present invention are statistically based rather than phonetically based "Alignment" he this context means that utterances are processed to identify word boundaries and the speech unit or units def'incd by such word boundaries are subjected to the speech models These consistency and conf'usability measures should be based on the training set, that Is, the training corpus, and the trashed models 'I'he training set or corpus Is stored together with the traipsed models. I;u-ther eTnhodiTncT,t.s use rules to iT,IJleTneTt forced-aligTTeTt of trauTg set utterances agahst the models under the followhg assumptions: a training set for a vocabulary has N words, Wn.WN; a CDIIMM, M,, is provided for each word W.; and we is the kth repetition of the jth word of the training set A per- f'rame log hlvelihood L, is calculated for each model generating each example of each word on a Vilerbi path. 'I'he consistency, id, of a word, W., Is obtained by: 3' (ark L,/,, ( 1) where n, is the number of examples of a given word, W., in a lrainirrg set An average score for a word is obtained by aligning all examples of that word agahst the model for that word Conventionally, the more variation that there Is ha training data for each speech unit, the larger the variances will he in that speech unit's}-IMM state distributions. The f'orced alignment hkelihoods will be lower for an Consistently spoken word than for a consistently spoken word since its distributions will be flatter.
I'he overall consistency of the training corpus, A, is the average of all consistencies for all words within that corpus. Therefore, the overall consistency is given by: ] 5 Z1= (ok: 4)/N (2) As indicated above, the measure of overall consistency of a trahhg corpus might be used to assess the severity of dysarthria and/or to record a dysarthric speaker's progress as any therapy proceeds.
The confusability between two given words, W. and We, is defined by: CIJ= (ark LyJ/2J (3) which is a measure of the average score obtained by algnhg examples of a given word, We, against the Cl)IIMM, M,, t'or a different word, W. The higher the value of CLJ implies a greater likelihood that We will be misrecognised as W..
['raining aid software 130 can be used by a clinician in analysing the confusability results Figure 2 shows an example 200 ot'thc output produced by the speccir processing engine 1()4 for a severely dysarthrc speaker. The output 200 is provided with a cont'usability variance calibration grcy-scalc 202, which provides an indication of the degree of confusablity of any two words forming part of a confusability matrix 2()4. The cont;sahility variance calibration gray-scale 202 is arranged such that the darker the shade, the grcalcr the confusabilty 1 rom the example of the confusabilily matrix 204, it can be appreciated that there is a higher rusk of I confusion between the word TV and the set of words {alarm, lamp, channel, down, radio, volurnc} as compared to a risk of cnnl'usion between the word TV and the set of words (on, ol'f, up}. 'I'he conl'usability matrix 204 can be used by a clhician to tailor the vocabulary rccognised by the speech processing engine 1()4 to reduce the probability or risk of confusion between any two words. 'I'hs should, ha tuna, Improve the response of the speech processing engine 1()4 to the dysa.rthrc speaker's utterances. It will be appreciated that in some cases, the risk of confusion can exist between several words in which cases a number of altenatvc I'he words in the first column or first row of the confusability matrix represent the whole or I part of a dysarthric speaker's vocabulary A clinician will construct such a confusahilily matrix using either a test or training set of words or an intended vocabulary of words for that speaker only. Such an initial vocabulary ol'words might be refined to remove words and to introduce alternative words if it is noted that there is signil'icant confusion between selected In a preferred embodiment, the training aid software 130 has a further mode of operation, which allows a dysarlhric speaker, using the microphone 108 and switch I 10, to practisc their articulation of selected words. 'fee dysarthric speaker can select a word to be practiced from the whole, or part, of their intended vocabulary AIternatively, a cluican might make that selection on behalf of the dysarthric speaker as part of an Interactive therapy session. Using the switch] 10, the dysarthric speaker can arrange for the speech proeesshg enghe 104 to record and process their utterances. The utterance is compared to the speech model that: corresponds to the north of that word for that dysarthric speaker 'I'he speech processing angle 104 returns a probability score to the training aid software 130 that reflects the I closeness ol'rnat:ch ot'the utterance with the speech model representing the norm or average of the utterance for that dysarlDrc speaker. This measure of consistency Is preferably presented to the dysarthric speaker visually as a her chart comprising two bars. The first her represents the probability score of the utterance no the training corpus with a score closest to norm, and the second her represents the probability score of' the utterance, that Is, it represents a means by which the dysarthric speaker can compare their most recent utterance with the norm for that utterance Preferably, the dysarthrc speaker can use the switch I 10 to play the utterance corresponding to their norm for that utterance in advance of practsing that utterance 'he trainhg aid software 130 preferably records all utterances of such practice sessions to allow the recorded data to be analysed by a clinician or speech therapist In preferred embodiments, utterances that deviate by more than a predetermined value such as, I for example, 20%, from the norm of that utterance for that dysarthric speaker are highlighted or drawn to the attention ot'the clinician or speech therapist Identifying such anomalous utterances allows them to be removed from any data that might he used to Influence the performance of the corresponding COHMM. I'rovding a measure ol'the closeness of the lit of the recent utterance with the dysarthric speaker's norm for that utterance allows the dysarthrie speaker to practice that utterance so that they might be able to articulate the utterance in a manner more consistent with their norm. It should be noted that more consistent articulation of any given utterance does not necessarily imply that the utterance will be intelligible to a person unfamiliar with the dysarthric speaker It will be appreciated that as the accuracy of the speech processing angle or recogniser 104 increases, the degree of control exerted by a dysarthrie speaker over the equipment 114 to ] 18 also increases. 'l'lis increased degree of control will lead to an improved quality of life for a dysarthric speaker Referrhg to figure 3, there is shown a flowchart 300 that '11ustrates the basic steps undertaken by the trainhig aid software 1 30 in con junction with the recogniser 1 04 m allowing a dysarthric speaker to practice utterances ova selected or target word At step 302, the target word Is selected from a number ol' displayed words An articula(ecl utterance corresponding to the selected word Is recorded by the recogniser 104 at step 304 The recogniser 104 compares the utterance with an appropriate speech model corresponding to the selected word at step 306 An analysis ot' the closeness of' tit of the most recently recorded and processed utterance with the norm of the utterance for that speaker is performed at step 308 and, at step 310, f'eeciba.ek Is provided to the dysarthric speaker on the closeness of fit of their most r event utterance with their nor m for that utterance l In an alternative enbodmient, or aciclitionally, rather than a dysartl1rie speaker or clhician selecting a target word at step 3() 2, the system can be arranged so that the dysarthric speaker may merely actuate the switch 110 to provide an hdicatior1 to the speech processing engine that the next utterance Is intended to be a control or communication command. In such enbodiments, the recognition step 306 compares the data produced by the signal proeesshg step 3()4 with all of the speech models fair that dysarthric speaker 124 to identify the best match Once a match has been identified, the device control software 112 determines whether or not a control signal should be output in response to that match It will be appreciated by one skilled in the art that all possible command phrases in the recognser's vocabulary are associated with either speech output or a control signal by the clinician when I the system is trained and configured. For example, appropriate codes are stored together with command speech models, which are used to produce corresponding Infrared signals, or other types of signal appropriate to, the devotee to be controlled, via the hardware interiaee These codes are tailored to the particular types of hardware and harclwa.re interface present Therefore, it will be appreciated that if the hardware interface or an item of hardware was changed, the inforrmaton stored would also be changed accordingly. If appropriate, the device control software 112 produces a control signal via the hardware interface 120 that its suitable for controlling a corresponding item of equipment: 1 14 to 1 18.
In alternative embodiments, or additionally, rather than outputting a control signal, a speech synthesis engine 132 can be arranged to output mtelligble speech. In such an embodiment, it will be appreciated that the computer system 102 is acting, effectively, as a translation aid that translates between dysarthrie speech and conventional speech that is intelligible to those unfamiliar with the dysarthric speaker For example, this would allow a dysarthric speaker to greet a friend using the word "hello". It would also support a greater verbal interaction between a dysarthnc speaker and a person unt'amihar with the dysarthric speaker This is Illustrated In figure 4.
Figure 4 shows a flowchart 400 of the processing performed by the computer system 102 to assist the dysartilrc speaker in communicating with people or mteracthg with their environment. The computer system 102 is arranged to operate in a eormnand/control and communication mode. In this mode, the dysarthric speaker provides an indication to the speech processing engine that it should enter a record mode of operation to record an utterance of the dysarthrc speaker. Such an indication can be provided using a mouse or keyboard of the computer system, a specially adapted Input device according to the physical I capabilities of the dysarthric speaker or via the automatic detection of the speaker's voice close to the microphone without using an additional input device. The latter knight be achievedusing, for example, a volume threshold such that the system infers that an utterance intended for processing by the system has been made if the volume of that utterance is above that volume threshold, with all other detected utterances below the threshold being ignorecl.
Therefore, at step 402, the speech processhg engine receives a signal that is ndieatve of the Input device having been actuated Any speech uttered t'ollowmg actuation ol'the Output ]3 device is recorded by the speech processing engme 104 and converted into a sutahie form for processing by the ('DHMM recogniser at step 404. The speech processing engine 104 performs speech recognition at step 406 to deternine the best or most appropriate correlation between the utterance and one of the speech models 124.
Processing is performed, in light of the recognition at step 406, to determine whether the dysarthric speaker has issued a command that needs to be parsed. This processing is perforTned at step 4()8. For exanp]e, the dysarthrie speaker may have uttered "IV OTI", which is a command to switch on the television. Ihe speech processing engine 1()4 is arranged, when identifying a first word of an utterance, to checl; the utterance, or any part ot the utterance, against all speech models 124. Slaving matched at least part of the utterance witl1 one of the speech models 124, the speech processing engine 104 is sufficiently sophisticated to compare the next, or the last, part of the utterance with a limited set of speech models selected from the whole set of speech models 124 this will reduce the processing burden imposed upon the speech-processing engine 1()4. For example, if the utterance is "TV on" and the speech processing engine has identified the first part of the utterance as "TV", the speech processing engine might then expect the second part of the utterance to be a command such as "on", "off', "volume", "up", "channel", "up" or "down".
I-Teace, the second part of the utterance would be processed by speech models corresponding to those words, that is, a limited set of the speech models is used in Identifying subsequent parts ol an overall utterance.
A further example of comparing a later part of an utterance with a very limited set of the speech models, having identified an earlier part of an utterance, would he the commands that control a lamp. I he lamp has two states; namely on and oft, and a speech processing engme, having identified part of the utterance as 'lamp" would then only expect. a subsequent part of the utterance to be "on" or "off' and would, therefore, only need to use two further speech models in fully processing the utterances "damp on" and "lamp of lo'.
I he filtering or narrowing, that is, the selection of a reduced set of speech models for use in i subsequent processing, may be t;rtler reduced If the speech processing engine 104 also stores data relating to the current state of an item of equipment. For example, once a; 3() television has been switched on, it is unlikely that a subsequent part of an utterance following " l'V" is will be "on" since the television is already in the on-state 1 herefore, the speech model lor recognising "on" need not be considered in such circumstances It will he appreciated that such embodiments miglff assume that a given Item of eclupment Is in a desired state following a command or that some form of feedback from the equipment to the system is provided to confine that it has assumed the desned state. Ilavmg parsed an utterance at step 4()8, a determination is made, at step 409, as to whether or not the parsed utterance is a command 'I'herefore, if appropriate, respective control signals for controlling the equipment 1 14 lo 1 18 are gencratcci and output at step 41(). 'I'he device control software 112 has access to a fable 134 of parsed commands 136 that are mapped to corresponding conl.rol signals 138.
A l'urther example in which knowledge oi'l.he current state of the computer equipment can be used to improve the ease with which a dysarlhric speaker can interact with their environment is when, for example, only the radio is switched on. Assume that the dysarthric speaker issues the command "volume up". Knowing that only the radio is currently switched on, the speech processing engine 104 is sufficiently sophisticated to be able to ignore all speech models but for those relating lo the radio.
If the parsed utterance Is not recognised at step 409 as a command, it is assumed to be speech mt.ended to be communicated to a 1slener. That speech communication is converteci into a more htellgble form, that Is, it Is converted into a forth that may be understood by someone unt'amihar with the dysarlhric speaker and subsequently output. The subsequent output might be m the form of text on the screen of the computer system 102 or in the form of synthesized speech. This processing takes place at step 412.
Embodiments of the present invention may use a table comprising a mapping between dysarthric utterances and conventional speech or text 'I'he speech synthesiser may output the speech or the text may be output for display. Alternatively, or additionally, the text may be converted to speech by the speech synthesser Is

Claims (24)

  1. Cl,AIMS I A speech processor operable to process an Input utterance to
    Identify that utterance; the speech processor comprising means to calcuhffe a confusabilty measure that reflects a degree of correlation between the input utterance and at least a further utterance; each ol'the input utterance and the Further utterance corresponding to respective words of a vocabulary of words.
  2. 2. A speech processor as elahned in claim 1, further comprising means to replace at least one of the respective words with at least a further word having a corresponding utterance having a different degree of correlation with the at least one of the input utterance and the further utterance.
  3. 3. speech processor as claimed in claim 2, further comprising means to associate the further word with a respective control signal
  4. 4 A speech processor as claimed m any preceding claim ha which the means to calculate the conl'usabiLty measure comprises means to subject the Input utterance, havhg a corresponding speech model, to a speech model for the further utterance to determine the response of the speech model for the further utterance and means to provide the eonfusahility measure according lo that response.
  5. A speech processor as claimed he any precedhg claim in which the means to calculate the confusability measure between the words, W. and We, of the Input utterance and the further utterance respectively comprises means to calculate U (z i',Jk) / I, where LO is a per-fi-ame likelihood of each speech model generating each example of' each word on a Viterbi path, I and k represent the kth repetition of the Jth word In a traipsing set comprising N no TS the number of examples of We in the framing set.
  6. 6 A speech processor as claimed in any preceding claim further comprising means to Calculate a consistency nTeasure for the input utterance and means to output a visual indication of the consistency measure.
  7. 7. A speech processor as ClainTed iT1 claim 6 in whielT the meaTTs to calculate the consistency measure comprises meaTTs to Calculate 6, = (Z L,,k)/l,, k where L,,k is a per-frame likelihood of eael1 speech model generating each example of each word on a Viterbi path, i and k represent the kth repetition of the ilh word in a training set comprising N T], is the TuT,Tber of examples of W. i', tlTe trairTTng set
  8. 8. A speech processor as claimed In 6 in whielT the means to Calculate the CoTTSiStCTlCy measure Comprises nTeaTTs to calculate = (Gil N
  9. 9. A speech processor as Claimed in any of claims 6 to 8 in whicl1 the nTeans to output a visual indication of the consistency measure comprises means to present a bar chart comparison of the consistency measure with an average consistency measure for that word Thor a given speaker
  10. 10. A speech processor as claimed in arty preceding claim in speech processor is a dysarthre speech processor
  11. I I. An assistive technology system Comprising a speech processor as clauncd nit any preceding clainT further comprising means to outpost a eo'Ttrol signal corresponding to, lice identified utterance lor irTRuencin-,r the operation of respective equipnTent
  12. 12 A dysarthric speech assistive technology system comprising an assistive technology system as claimed in clam, 11.
  13. 13. A method ot training or treating a speech impaired speaker comt., TTsingr the steps of processing an input utterance of the speech impaired speaker using a correspondhg speech model; provdmg a visual indication of the degree of correlation between the input utterance and a predetermined utterance of the speech impaired speaker for the eorrespondhg speech model.
  14. 14. method of training or treating a speech Impaired speaker as claimed in claim 13 in whicl, the degree of correlation between the input utterance and the prCdeternTitTed utterance of the speech impaired speaker for the corresporTdhg speech model comprises the step ot calculathTg 6, ( [,,k) / I?,, wlTere L,,k is a per-frame likelihood of each spiels model generating each example of each i word on a Vterhi path, i field k represent the kill repetition of tlTe it's word in a traipsing set coT,t, TisiTg N n, is the number of examples of W. ITT the training set.
  15. 15. A method of training or treating a speech impaired speaker as claimed in either of claims 13 and 14 further comprising the step of eslahlshuTg the predelermned utterance of the speech hnpair speaker for the speech model.
  16. ]6 A method of training or treating a speech impaired speaker as claimed in any of claims 13 to 15 further comprising the step ol processing a plurality of utterances corresponding to the same word and calculating a measure of the average ol the plurality of utterances.
  17. 17. method of traTnhg or treating a speech impaired speaker as claimed in any ol claims 15 to 16, further comprshg the steps of establishing a plurality of utterances correspondhg to respective words of a ph.Trality of words, calculathg a measure of confusability between utterances corresponding to at least a selected pair of words, and selectirTg an alternative word to replace one of the selected pair of words m tlTe vocabulary; the alternative word having an improved measure of confusabilily between a respective utterance for the alternative word and the utterance correspo'Tdmg to remaining word of the selected pair of words.
  18. 1 8. A TTTetlTod of traiTi'Tg or treating as ciaiTned IT) any of claTnTs 13 to 17 TT1 WI1;CI1 tlTC speech impaired speaker is a dysarthrTe speaker.
  19. 19. ATT assist:ive technology system or speech processor comprising means to implement a TTetlod (TS ClaITT1Cd;] arty of claimers 13 to 1 X
  20. 20. A system or processor substantially as described herein wile reference to/and or as TIIUStrated n] the accompanying drawings.
  21. 21. An assistive technology systenT or speech processor substantially as descrhed herci'T with reference to/and or as illustrated In the accompanying drawings
  22. 22 A method of training or treating a speech inTpaired person substantially as described herein with reference to/a'Td or as Illustrated in the accompanying drawings.
  23. 23. A computer program coTnprisTTg computer prod code nTeaTTs to irTple'Tet a method or system as C]aiTTTed ITT any preceding claim
  24. 24. A computer readable storage storing a computer program as C]aTmCd in claim 23.
GB0406932A 2003-03-28 2004-03-29 Speech recognition Withdrawn GB2399932A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0307201A GB2399931A (en) 2003-03-28 2003-03-28 Assistive technology

Publications (2)

Publication Number Publication Date
GB0406932D0 GB0406932D0 (en) 2004-04-28
GB2399932A true GB2399932A (en) 2004-09-29

Family

ID=9955748

Family Applications (2)

Application Number Title Priority Date Filing Date
GB0307201A Withdrawn GB2399931A (en) 2003-03-28 2003-03-28 Assistive technology
GB0406932A Withdrawn GB2399932A (en) 2003-03-28 2004-03-29 Speech recognition

Family Applications Before (1)

Application Number Title Priority Date Filing Date
GB0307201A Withdrawn GB2399931A (en) 2003-03-28 2003-03-28 Assistive technology

Country Status (1)

Country Link
GB (2) GB2399931A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119109A1 (en) * 2006-05-22 2009-05-07 Koninklijke Philips Electronics N.V. System and method of training a dysarthric speaker

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103405217B (en) * 2013-07-08 2015-01-14 泰亿格电子(上海)有限公司 System and method for multi-dimensional measurement of dysarthria based on real-time articulation modeling technology
CN105719662B (en) * 2016-04-25 2019-10-25 广东顺德中山大学卡内基梅隆大学国际联合研究院 Dysarthrosis detection method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2187586A (en) * 1986-02-06 1987-09-09 Reginald Alfred King Acoustic recognition
EP1022722A2 (en) * 1999-01-22 2000-07-26 Matsushita Electric Industrial Co., Ltd. Speaker adaptation based on eigenvoices
US6185530B1 (en) * 1998-08-14 2001-02-06 International Business Machines Corporation Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system
EP1079370A2 (en) * 1999-08-26 2001-02-28 Canon Kabushiki Kaisha Method for training a speech recognition system with detection of confusable words
EP1217609A2 (en) * 2000-12-22 2002-06-26 Hewlett-Packard Company Speech recognition
US20020116191A1 (en) * 2000-12-26 2002-08-22 International Business Machines Corporation Augmentation of alternate word lists by acoustic confusability criterion
GB2385698A (en) * 2002-02-26 2003-08-27 Canon Kk Speech recognition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2187586A (en) * 1986-02-06 1987-09-09 Reginald Alfred King Acoustic recognition
US6185530B1 (en) * 1998-08-14 2001-02-06 International Business Machines Corporation Apparatus and methods for identifying potential acoustic confusibility among words in a speech recognition system
EP1022722A2 (en) * 1999-01-22 2000-07-26 Matsushita Electric Industrial Co., Ltd. Speaker adaptation based on eigenvoices
EP1079370A2 (en) * 1999-08-26 2001-02-28 Canon Kabushiki Kaisha Method for training a speech recognition system with detection of confusable words
EP1217609A2 (en) * 2000-12-22 2002-06-26 Hewlett-Packard Company Speech recognition
US20020116191A1 (en) * 2000-12-26 2002-08-22 International Business Machines Corporation Augmentation of alternate word lists by acoustic confusability criterion
GB2385698A (en) * 2002-02-26 2003-08-27 Canon Kk Speech recognition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090119109A1 (en) * 2006-05-22 2009-05-07 Koninklijke Philips Electronics N.V. System and method of training a dysarthric speaker
US9508268B2 (en) * 2006-05-22 2016-11-29 Koninklijke Philips N.V. System and method of training a dysarthric speaker

Also Published As

Publication number Publication date
GB0307201D0 (en) 2003-04-30
GB0406932D0 (en) 2004-04-28
GB2399931A (en) 2004-09-29

Similar Documents

Publication Publication Date Title
US6157913A (en) Method and apparatus for estimating fitness to perform tasks based on linguistic and other aspects of spoken responses in constrained interactions
US7949523B2 (en) Apparatus, method, and computer program product for processing voice in speech
EP1286330B1 (en) Method and apparatus for data entry by voice under adverse conditions
JP3968133B2 (en) Speech recognition dialogue processing method and speech recognition dialogue apparatus
US6675142B2 (en) Method and apparatus for improving speech recognition accuracy
JP4085130B2 (en) Emotion recognition device
EP0773532B1 (en) Continuous speech recognition
US5995928A (en) Method and apparatus for continuous spelling speech recognition with early identification
EP0965979B1 (en) Position manipulation in speech recognition
US7680658B2 (en) Method and apparatus for speech recognition
JP3333123B2 (en) Method and system for buffering words recognized during speech recognition
JPH10133684A (en) Method and system for selecting alternative word during speech recognition
JP2002304190A (en) Method for generating pronunciation change form and method for speech recognition
JPH11502953A (en) Speech recognition method and device in harsh environment
JPH10133685A (en) Method and system for editing phrase during continuous speech recognition
JP2008262120A (en) Utterance evaluation device and program
Stuttle et al. A framework for dialogue data collection with a simulated ASR channel.
GB2399932A (en) Speech recognition
Schramm et al. Strategies for name recognition in automatic directory assistance systems
JPH10187184A (en) Method of selecting recognized word at the time of correcting recognized speech and system therefor
JP2007286376A (en) Voice guide system
JP2001042887A (en) Method for training automatic speech recognizing device
JP3621624B2 (en) Foreign language learning apparatus, foreign language learning method and medium
US20220148570A1 (en) Speech interpretation device and system
JP2005241767A (en) Speech recognition device

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)