CA2158064C - Speech processing - Google Patents
Speech processing Download PDFInfo
- Publication number
- CA2158064C CA2158064C CA002158064A CA2158064A CA2158064C CA 2158064 C CA2158064 C CA 2158064C CA 002158064 A CA002158064 A CA 002158064A CA 2158064 A CA2158064 A CA 2158064A CA 2158064 C CA2158064 C CA 2158064C
- Authority
- CA
- Canada
- Prior art keywords
- path
- nodes
- node
- vocabulary
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
- 238000012545 processing Methods 0.000 title claims description 26
- 238000000034 method Methods 0.000 claims abstract description 20
- 239000013598 vector Substances 0.000 claims description 13
- 230000000644 propagated effect Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 abstract description 6
- SNKNNAPOEZWYCH-UHFFFAOYSA-N (2-hydroxy-3-piperidin-1-ylpropyl) n-phenylcarbamate;hydrochloride Chemical compound Cl.C1CCCCN1CC(O)COC(=O)NC1=CC=CC=C1 SNKNNAPOEZWYCH-UHFFFAOYSA-N 0.000 abstract 1
- 230000007704 transition Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- GHOKWGTUZJEAQD-ZETCQYMHSA-N (D)-(+)-Pantothenic acid Chemical compound OCC(C)(C)[C@@H](O)C(=O)NCCC(O)=O GHOKWGTUZJEAQD-ZETCQYMHSA-N 0.000 description 1
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 1
- 241000272168 Laridae Species 0.000 description 1
- AFCARXCZXQIEQB-UHFFFAOYSA-N N-[3-oxo-3-(2,4,6,7-tetrahydrotriazolo[4,5-c]pyridin-5-yl)propyl]-2-[[3-(trifluoromethoxy)phenyl]methylamino]pyrimidine-5-carboxamide Chemical compound O=C(CCNC(=O)C=1C=NC(=NC=1)NCC1=CC(=CC=C1)OC(F)(F)F)N1CC2=C(CC1)NN=N2 AFCARXCZXQIEQB-UHFFFAOYSA-N 0.000 description 1
- 235000004240 Triticum spelta Nutrition 0.000 description 1
- 125000002015 acyclic group Chemical group 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001955 cumulated effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000000819 phase cycle Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
- Exchange Systems With Centralized Control (AREA)
- Electrophonic Musical Instruments (AREA)
Abstract
A path link passing speech recognition system and method for recognising input connected speech, the recognition system having a plurality of vocabulary nodes (24) associated with word representation models, at least one of the vocabulary nodes (24) of the network being able to process mono than one path link simultaneously, so allowing for more than one recognition result.
Description
~WO 94/23424 PCTIGB94/00704 SPEECH PROCESSING
The present invention relates to speech processing and in particular to a system for processing alternative parses of connected speech.
Speech processing includes speaker recognition, in which the identity of a speaker is detected or verified, and speech recognition, wherein a system may be used by anyone without requiring recogniser training, and so-called speaker dependent recognition, in which the users allowed to operate a system are restricted and a training phase is necessary to derive information from each allowed user. It is common in recognition processing to input speech data, typically in digital form, to a so-called front-end processor, which derives from the stream of input speech data a more compact, perceptually significant set of data referred to as a front-end feature set or vector. For example, speech is typically input via a microphone, sampled, digitised, segmented into frames of length 10-20ms (e. g. sampled at 8 kHz) and, for each frame, a set of coefficients is calculated. In speech recognition, the speaker is normally assumed to be speaking one of a known set of words or phrases. A stored representation of the word or phrase, known as a template or model, comprises a reference feature matrix of that word as previously derived from, in the case of speaker independent recognition, multiple speakers. The input feature vector is matched with the model and a measure of similarity between the two is produced.
Speech recognition (whether human or machine) is susceptible to error and may result in the misrecognition of words. If a word or phrase is incorrectly recognised, the speech recogniser may then offer another attempt at recognition, which may or may not be correct.
Various ways have been suggested for processing speech to select the best or alternative matches between input - _ 2 _ speech and stored speech templates or models. In isolated word recognition systems, the production of alternative matches is fairly straightforward: each word is a separate ' path' in a transition network representing the words to be recognised and the independent word paths join only at the , final point in the network. Ordering all the paths exiting the network in terms of their similarity to the stored , templates or the like will give the best and alternative matches.
In most connected recognition systems and some isolated word recognition systems based on connected recognition techniques however, it is not always possible to recombine all the paths at the final point of the network and thus neither the best nor alternative matches are directly obtainable from the information available at the exit point of the network. One solution to the problem of producing a best match is discussed in "Token Passing:
a Simple Conceptual Model for Connected Speech Recognition Sys terns " by S. J. Young, N. H. Rus s el l and J. H. S.
Thornton 1989, which relates to passing packets of information, known as tokens, through a transition network.
A token contains information relating to the partial path travelled as well as an accumulated score indicative of the degree of similarity between the input and the portion of the network processed thus far.
As described by Young et al, at each input of a frame of speech to a transition network, any tokens that are present at the input of a node are passed into the node and the current frame of speech matched within the word models, associated with those nodes. New tokens then appear at the output of the nodes (having "travelled" through the model associated with the node). Only the best scoring token is then passed onto the inputs of the following nodes. When the end of speech has been signalled (by an external device such as a pause detector), a single token will be present at the final node. From this token the entire path through the network can be extracted by tracing back along the path by means of the previous path information contained within the token to provide the best match to the input speech.
The article "A unified direction mechanism for automatic speech recognition using Hidden Markov Models" by S.C. Austin and F. Fallside, ICASSP
1989, Vol. 1, pages 667-670, relates to a connected word speech recogniser which operates in a manner similar to that described by Young et al, as described above. A history relating to the progress of the recognition through the transition network is updated on exiting the word model. At the end of recognition, the result of recognition is derived from the history presented to the output which has the best score. Again only one history is possible for each path terminating at the final node.
Such known arrangements do not allow for an alternative choice to be readily available at the output of the network.
In accordance with the invention a speech recognition apparatus comprises means for deriving a recognition feature vector from an input speech signal for each predetermined time frame; means for modelling expected input speech comprising a plurality of vocabulary nodes each of which has an associated word representation model and links between said vocabulary nodes; processing means for comparing the recognition feature vectors with the modelled input speech and for generating a path link for each node and time frame, said path links indicating the most likely prior sequence of vocabulary nodes for each vocabulary node and time frame, each path link comprising a field for storing an accumulated recognition score and a field for storing a reference to the most likely previous path link in the sequence; and means for indicating recognition of the input speech signal in dependence upon the comparison; characterised in that the processing means (351 ) is capable of processing more than one path link for at least one vocabulary node, other than the final node, in a single time frame.
Such an arrangement means that more than one incoming path link can be processed by a node in a single time frame and hence that more than one recognition result may be obtained.
Each node has associated path links comprising fields for storing a pointer to the previous path link, an accumulated score for a path, a pointer to a previous node and a time index for segmentation information. Preferably, the vocabulary nodes which may have more than one path link processed in a single time frame have more than one identical associated word representation model.
The provision that at least one of the vocabulary nodes other than the final node of the network has more than one associated word representation model allows the processor to process multiple paths for the same time frame and so allows more than one path link to be propagated across each inter-node link at each input frame. In effect, the invention creates multiple layers of a transition network along which several alternative paths may be propagated. The best scoring path may be used by the first model of a node, the next best by the second and so on until either parallel models or incoming paths run out.
In general terms "network" includes directed acyclic graphs (DAGs1 and trees. A DAG is a network with no cycles and a tree is. a network in which the only meeting of paths occurs conceptually right at the end of the network.
The term "word" here denotes a basic recognition unit, which may be a word but equally well may be a diphone, phoneme, allophone, etc. Recognition is the process of matching an unknown utterance with a predefined transition network, the network having been designed to be compatible with what a user is likely to say.
In order to identify the phrase that has been recognised, the apparatus may include means for tracing the path link back through the network.
Alternatively, the apparatus may also include means for assigning a signature to at least some of the nodes having associated word representation models and means for comparing the signature of each path, to determine the path with the best match to the input speech and that with the second best alternative match.
This arrangement allows for an alternative which is necessarily different in character to the best match and does not differ merely in segmentation or noise matches.
5 The word representation models may be Hidden Markov Models (HMMs) as described generally in British Telecom Technology Journal, April 1988, Vol, 6, no. 2, page 105: Cox, "Hidden Markov Models for automatic speech recognition:
theory and application", templates, Dynamic Time Warping models or any other suitable word representation model. The processing which occurs within a model is irrelevant as far as this invention is concerned.
It is not necessary for all the nodes having associated word models to have a signature assigned to them. Depending on the structure of the transition network, it may be sufficient only to assign signatures to those nodes which appear before a decision point within a network. A decision point as used herein relates to a point in the network which has more than one incoming path.
Partial paths may be examined at certain decision points in the network, certain constraints being imposed at these decision points so that only paths conforming to the constraints are propagated, as described in the applicants' International patent application filed on 31 st March 1994 entitled "Connected Speech Recognition", No WO/23425, published 13th October 1994.. Each decision point is associated with a set of valid signatures and any path links with signatures that are not in the set are discarded.
The accumulated signature may be used to identify the complete path, resulting in extra efficiency of operation as the path links need not be traversed to determine the path identity, and the partial path information of the token may not be generated at all. In this case the signature field must be large enough to identify all paths uniquely.
For efficient operation of the apparatus according to the invention, the signal processing of path signatures is preferably carried out in a single operation to increase processing speed.
Other aspects and preferred embodiments of the invention are as disclosed and claimed herein, with advantages that will be apparent hereafter.
The invention will now be described further, by way of example only, with reference to the accompanying drawings in which:
Figure 1 shows schematically the employment of a recognition processor according to the invention in a telecommunications environment;
Figure 2 is a block diagram showing schematically the functional elements of a recognition processor according to the invention;
Figure 3 is a block diagram indicating schematically the components of a classifier forming part of Figure 2;
Figure 4 is block diagram showing schematically the structure of a sequence parser forming part of the embodiment of Figure 2;
Figure 5 shows schematically the content of a field within a store forming part of Figure 4;
Figure 6 is a schematic representation of one embodiment of a transition network applicable with the processor of the sequence parser of Figure 4;
Figure 7a shows a node of a network and Figure 7b shows a path link as employed according to the invention;
Figures 8 to 10 show the progression of path links through the network of Figure 6;
Figure 11 is a schematic representation of a second embodiment of a transition network of a apparatus according to the invention;
Figure 12 is a schematic representation of a third embodiment of a transition network of an apparatus according to the invention.
Referring to Figure 1, a telecommunications system including speech recognition generally comprises a microphone 1, typically forming part of a telephone handset, a telecommunications network (typically a public switched telecommunications network (PSTIV1) 2, a recognition processor 3, connected to receive a voice signal from the network 2, and a utilising apparatus 4 connected to the recognition processor 3 and arranged to receive therefrom a voice recognition signal, indicating recognition or otherwise of a particular word or phrase, and to take action in response thereto. For example, the utilising apparatus 4 may be a remotely operated banking terminal for effecting banking transactions.
In many cases, the utilising apparatus 4 will generate an auditory response to the speaker, transmitted via the network 2 to a loudspeaker 5 typically forming a part of the subscriber handset.
In operation, a speaker speaks into the microphone 1 and an analog speech signal is transmitted from the microphone 1 into the network 2 to the recognition processor 3, where the speech signal is analysed and a signal indicating identification or otherwise of a particular word or phrase is generated and transmitted to the utilising apparatus 4, which then takes appropriate action in the event of recognition of the speech.
Typically, the recognition processor needs to acquire data concerning the speech against which to verify the speech signal, and this data acquisition may be performed by the recognition processor in a second mode of operation in which the recognition processor 3 is not connected to the utilising apparatus 4, but receives a speech signal from the microphone 1 to form the recognition data for that word or phrase. However, other methods of acquiring the speech recognition data are also possible.
_ g _ Typically, the recognition processor 3 is ignorant of the route taken by the signal from the microphone 1 to and through the network 2; any one of a large variety of types and qualities of receiver handset. Likewise, withi n the network 2, any one of a large variety of transmission paths may be taken, including radio links, analog and digital paths and so on. Accordingly, the speech signal Y reaching , the recognition proces~.or 3 corresponds to the speech signal S received at the microphone l, convolved with the transfer characteristics of the microphone 1, link to network 2, channel through the network 2, and link to the recognition processor 3, which may be lumped and designated by a single transfer characteristic H.
Referring to Figure 2, the recognition processor 3 comprises an input 31 for receiving speech in digital form (either from a digital network or from an analog to digital converter?. a frame processor 32 for partioning the succession of digital samples into a succession of frames of contiguous samples; a feature extractor 33 for generating from a frame of samples a corresponding feature vector; a classifier 34 receiving the succession of feature vectors and operating on each with a plurality of model states, to generate recognition results; a sequencer 35 which is arranged to receive the classification results form the classifier 34 and to determine the predetermined utterance to which the sequence of classifier output indicates the greatest similarity; and an output port 38 at which a recognition signal is supplied indicating the speech utterance which has been recognised.
The frame generator 32 is arranged to receive speech samples at a rate of, for example, 8,000 samples per second, and to form frames comprising 256 contiguous samples, at a frame rate of 1 frame every l6ms.
Preferably, each frame is windowed (i. e. the samples towards the edge of the frame are multiplied by _ g _ predetermined weighting constants) using, for example, a Hamming window to reduce spurious artifacts, generated by the frames edges. In a preferred embodiment, the frames are overlapping ( for example by 50% ) so as to ameliorate the effects of the windowing.
Feature Extractor 33 The feature extractor 33 receives frames from the frame generator 32 and generates, in each case, a set or vector of features. The features may, for example, comprise cepstral coefficients (for example, LPC cepstral coefficients or mel frequency cepstral coefficients as described in "On the Evaluation of Speech Recognisers and Databases using a Reference System", Chollet & Gagnoulet, 1982 proc. IEEE p2026), or differential values of such coefficients comprising, for each coefficient, the differences between the coefficient and the corresponding coefficient value in the preceding vector, as described in "On the use of Instantaneous and Transitional Spectral Information in Speaker Recognition", Soong & Rosenberg, 1988 IEEE Trans. on Acoustics, Speech and Signal Processing Vol 36 No. 6 p871. Equally, a mixture of several types of feature coefficient may be used.
The feature extractor 33 outputs a frame number, incremented for each successive frame. The output of the feature extractor 33 is also passed to an end pointer 36, the output of which is connected to the classifier 34. The end pointer 36 detects the end of speech and various types are well known in this field.
The frame generator 32 and feature extractor 33 are, in this embodiment, provided by a single suitably programmed digital signal processor (DSP) device (such as the Motorola DSP 56000, or the Texas Instruments TMS C 320) or similar device.
Referring to Figure 3, in this embodiment, the classifier 34 comprises a classifying processor 341 and a state memory 342.
The state memory 342 comprises a state field 3421, 3422, ...., for each of the plurality of speech states.
For example, each allophone to be recognised by the recognition processor comprises 3 states, and accordingly , 3 state fields are provided in the state memory 342 for each allophone.
The classification processor 34 is arranged to read each state field within the memory 342 in turn, and calculate for each, using the current input feature coefficient set, the probability that the input feature set or vector corresponds to the corresponding state.
Accordingly, the output of the classification processor is a plurality of state probabilities P, one for each state in the state memory 342, indicating the likelihood that the input feature vector corresponds to each state.
The classifying processor 341 may beY~a suitably programmed digital signal processing (DSP) device, may in particular be the same digital signal processing device as the feature extractor 33.
~gq~ssncer 35 Referring to Figure 4, the sequencer 35 in this embodiment comprises a state sequence memory 352, a parsing processor 351, and a sequencer output buffer 354.
Also provided is a state probability memory 353 which stores, for each frame processed, the state probabilities output by the classifier processor 341. The state sequence memory 352 comprises a plurality of state sequence fields 3521, 3522, ...., each corresponding to a word or phase sequence to be recognised consisting of a string of allophones.
Each state sequence in the state sequence memory 352 comprises, as illustrated in Figure 5, a number of states - il -Pi, P2, . . . PN (where N is a multiple of 3 ) and, for each state, two probabilities; a repeat probability (P~1) and a transition probability to the following state (P~2). The states of the sequence are a plurality of groups of three states each relating to a single allophone. The observed sequence of states associated with a series of frames may ~ therefore comprise several repetitions of each state P~ in each state sequence model 3521.etc; for example:
Frame Number 1 2 3 4 5 6 7 8 9 ... Z Z+1 State P1 P1 P1 P2 P2 P2 P2 P2 P2 ... Pn Pn The parsing processor 351 is arranged to read, at each frame, the state probabilities output by the classifier processor 341, and the previous stored state probabilities in the state probability memory 353, and to calculate the most likely path of states to date over time, and to e0m~are this with each of the state sequences stored in the 5~1-e sequence memory 352.
The calculation employs the well known HMM, as discussed in the above referenced Cox paper. Conveniently, the HMM processing performed by the parsing processor 351 uses the well. known Viterbi algorithm. The parsing processor 351 may, for example, be a microprocessor such as the Intel~TM~ i-486~TM~ microprocessor or the Motorola~TM~
microprocessor, or may alternatively be a DSP device (for example, the same DSP device as is employed for any of the preceding processors).
Accordingly for each state sequence (corresponding to a word, phrase or other speech sequence to be recognised) a probability score is output by the parser processor 351 at each frame of input speech. For example the state sequences may comprise the names in a telephone directory.
When the end of the utterance is detected, a label signal a indicating the most probable state sequence is output from the parsing processor 351 to the output port 38, to indicate that the corresponding name, word or phrase has been recognised.
The parsing processor 351 comprises a network which is specifically configured to recognise certain phrases or words for example a string of digits.
Figure 6 shows a simple transition network for recognising a string of words, in this case either a string of four words or a string of three. Each node 12 of the network is associated with a word representation model 13, for example a HMM, which is stored in a model list. Several nodes can be associated with each model and each node includes a pointer to its associated model (as can be seen in Figures 6 and 7a). In order to produce a best match and a single alternative parse, the final node 14 is associated with two models so allowing this node to process two paths. If n parses are required, the final node 14 of the network is associated with n identical word models.
As shown in Figure 7b, a path link 15 contains information relating to a pointer to the previous path link, an accumulated score, a pointer to the node previously exited and a time index. At the start of an utterance, an empty path link 15' is inserted into the first node 16 as shown in Figure 8. The first node now contains a path link and is therefore active whereas the remaining nodes are inactive. At each clock tick (i.e. with each incoming frame of speech) any active nodes accumulate a score in their path link.
If the first model can match, say, a minimum of seven frames of speech, then at the seventh clock pulse a path link 15" is output from the first node with the score for matching the seven frames to the model and pointers to the entry path link and the node just matched. The path link is fed to all of the following nodes 12, as shown in Figure 9. Now the first three nodes are active. The input frame of speech is then matched in the models associated with the active nodes and new path links outputted.
This processing continues, with the first node 16 producing further path links as its model matches increasingly longer parts of the utterance and the succeeding nodes performing similar calculations.
When the input speech has been processed as far as the final node 18 of the network, path links from each 'branch' of the network may be presented to this node 18. If, at any given time frame, there is a single path link (i.e.
only one of the parallel paths has been completed) that path link is taken to be the best (and only) match and is processed by the final node 18. However, if there are two path links presented to the final node 18, both are processed by that node, since the final node 18 is able to process more than one path. The output path links are continuously updated at each frame of speech. When the utterance is completed there will be two path links 15"' at the output of the network, as shown in Figure 10 (from which the pointers to previous path links and nodes have been excluded for the sake of clarity).
The full path can be found by following the pointers to the previous path links and the nodes on the recognised path land hence the input speech deemed to be recognised) can be identified by looking at the pointers to the nodes exited.
Figure 11 represents a second embodiment of a network configured to recognise strings of three digits. The grey nodes 22 are null nodes in the network;
the white nodes are active nodes which may be divided into vocabulary nodes 24 with associated word representation models (not shown), for matching incoming speech and noise nodes 25 which represent arbitrary noise.
If all of the active nodes 24, 25 after and including the third null node 22' are each capable of having three paths for each time frame (i.e. each vocabulary node 24 is associated with three word representation models), the output of the network wilt comprise path links relating to the three top scoring paths of the system. As described with reference to Figures 8 to 10, the three paths can be found by following, for each path,, ahe pointers to the previous path links. The nodes on-the paths (and hence the input speech deemed to be~ recognised) can ~be identified by looking at the pointers to the exited nodes.
In a further development of the invention, the path links may be augmented with signatures which represent the significant nodes of the network. These significant nodes may, for example, include all vocabulary nodes 24. In the embodiment of Figure 11, each vocabulary node 24 is assigned a signature, for example the nodes representing the digit 1 are assigned a signature '1', the nodes 24"
repres enti ng the di gi t 2 are as s i gned a s i gnature ' 2' and s o on.
At the start of parsing, a single empty path link is passed into a network entry node 26. Since this is a null node, the path link is passed to the next node, a noise node 25. The input frame is matched in the noise model (noz shown) of this node and an updated path link produced at the output. This path link is then passed to the next active nodes i. e. the first vocabulary nodes 24 having an associated model (not shown). Each vocabulary node 24 processes the frame of speech in its associated word model and produces an updated path-link. The signature field of the path link is also updated. At the end of each time frame, the updated path links are sorted to retain the three (n) top scoring paths which have different signature fields. A list ordered by score is maintained with the added cons trai nt that ac cumul ated s i gnatures are uni que: i f a second path link with the same signature enters, the better of the two is retained. The list contains only the top '; n" di f f erent, _ paths , the res t bei ng i gnored.
Then path links are propagated through the next null node 22' to the following noise node 25 and vocabulary nodes 24", each of which are associated with three identical word representation models. After this, model processing takes place, resulting in the updating of the RECTIFIED SHEET (RULE 91 ) tSA/EP
~WO 94/23424 PCT/GB94/00704 lists of path links and the extending of the paths into further nodes 24"', 25. It should be clear that the signature fields of the path links are not updated after processing by the null nodes 22 or the noise nodes 25 since these nodes do not have assigned signatures.
The path links are propagated along paths which pass through the remaining active nodes to produce, at an output node 28, up to three paths links indicating the relative scores and signatures, for example 1 2 1, of the paths taken through the network. The path links are continuously updated until the end of speech is detected (for example by an external device such as a pause detector or, until a time out is reached). At this point, the pointers or the accumulated signatures of the path links at the output node 28 are examined to determine the recognition results.
For example, presuming that the following three path links are presented to the output node 28 at some time instant:
SCORE SIGNATURE
Path A, the highest scoring path, is the best match.
However, although path B has the second best score, it would be rejected as an alternative parse since its signature, and hence the speech deemed to be recognised, is the same as path A. Path C would therefore be retained as the second best parse.
If the strings to be recognised have more structure than that discussed above, for example spelt names, .signatures need only be assigned to nodes just before decision points, rather than at every vocabulary node.
Figure 12 shows a network for recognising the spelling of the names " Phil" , " Paul" and " Peter" . For simplicity, no RECTI!=IED SHEET (RULE 91 ) ISA/EP
noise is illustrated. The square nodes 44 indicate where the signature should be augmented.
The system can distinguish between the 'PHI' and 'PAU' paths at the 'L' node because the signatures of the path links created at the previous nodes are different. The following node 47 will be able to distinguish between all three independent paths since the signatures of the square nodes 44 are different.
Only the 'L' node and the final noise node 48 need to be associated with more than one identical word model, so that these models are capable of having more than one path for a single time frame.
In all cases, each network illustrating the speech to be recognised requires analysis to determine which nodes are to be assigned signatures. In addition the network is configured to be compatible with what a user is likely to say.
Savings in memory size and processing speed may be achieved by limiting the signatures that a node will propagate, as described in the applicants' International patent application filed on 31 st March 1994 entitled "Connected Speech Recognition", No WO/23425, published 13th October 1994.For instance, say the only valid input speech to a recogniser having the network of Figure 6 is the four following numbers only: 1 1 1, 1 12, 121, 21 1. Certain nodes within the network are associated with a set of valid signatures and a path will only be propagated by such a 'constrained' node if a path link having one of these signatures is presented. To achieve this, the signature fields of the path links entering a constrained node, eg third null node 22', are examined. If the signature field contains a signature other than 1 or 2, the path link is discarded and the path is not propagated any further. If an allowable path link is presented, it is passed on to the next node. The next constrained node is the null node 22" after the next vocabulary nodes. This null node is constrained to only propagate path links having a signature 1 1, 12 or 21. The null node 22"' after the next vocabulary nodes is constrained to only propagate path links having the signature 1 1 1, 1 12, 121 or 21 1. Such an arrangement significantly reduces the necessary processing and allows for a saving in the memory capacity of the apparatus. Only some of the nodes at decision points in the network need to be so constrained. In practice, a 32 bit signature has proved to be suitable for sequences of up to 9 digits.
bit signature appears suitable for a 12 character alphanumeric string.
End of speech detection and various other aspects of speech recognition relevant to the present invention are more fully set out in the applicants' International Patent Application filed on 25th March 1994 entitled "Speech Recognition", No WO 94/22131, published 29th September 1994.
In the above described embodiments, recognition processing apparatus suitable to be coupled to a telecommunications exchange has been described.
However, in another embodiment, the invention may be embodied on simple apparatus connected to a conventional subscriber station (mobile or fixed) connected to the telephone network; in this case, analog to digital conversion means may be provided for digitising the incoming analog telephone signal.
The present invention relates to speech processing and in particular to a system for processing alternative parses of connected speech.
Speech processing includes speaker recognition, in which the identity of a speaker is detected or verified, and speech recognition, wherein a system may be used by anyone without requiring recogniser training, and so-called speaker dependent recognition, in which the users allowed to operate a system are restricted and a training phase is necessary to derive information from each allowed user. It is common in recognition processing to input speech data, typically in digital form, to a so-called front-end processor, which derives from the stream of input speech data a more compact, perceptually significant set of data referred to as a front-end feature set or vector. For example, speech is typically input via a microphone, sampled, digitised, segmented into frames of length 10-20ms (e. g. sampled at 8 kHz) and, for each frame, a set of coefficients is calculated. In speech recognition, the speaker is normally assumed to be speaking one of a known set of words or phrases. A stored representation of the word or phrase, known as a template or model, comprises a reference feature matrix of that word as previously derived from, in the case of speaker independent recognition, multiple speakers. The input feature vector is matched with the model and a measure of similarity between the two is produced.
Speech recognition (whether human or machine) is susceptible to error and may result in the misrecognition of words. If a word or phrase is incorrectly recognised, the speech recogniser may then offer another attempt at recognition, which may or may not be correct.
Various ways have been suggested for processing speech to select the best or alternative matches between input - _ 2 _ speech and stored speech templates or models. In isolated word recognition systems, the production of alternative matches is fairly straightforward: each word is a separate ' path' in a transition network representing the words to be recognised and the independent word paths join only at the , final point in the network. Ordering all the paths exiting the network in terms of their similarity to the stored , templates or the like will give the best and alternative matches.
In most connected recognition systems and some isolated word recognition systems based on connected recognition techniques however, it is not always possible to recombine all the paths at the final point of the network and thus neither the best nor alternative matches are directly obtainable from the information available at the exit point of the network. One solution to the problem of producing a best match is discussed in "Token Passing:
a Simple Conceptual Model for Connected Speech Recognition Sys terns " by S. J. Young, N. H. Rus s el l and J. H. S.
Thornton 1989, which relates to passing packets of information, known as tokens, through a transition network.
A token contains information relating to the partial path travelled as well as an accumulated score indicative of the degree of similarity between the input and the portion of the network processed thus far.
As described by Young et al, at each input of a frame of speech to a transition network, any tokens that are present at the input of a node are passed into the node and the current frame of speech matched within the word models, associated with those nodes. New tokens then appear at the output of the nodes (having "travelled" through the model associated with the node). Only the best scoring token is then passed onto the inputs of the following nodes. When the end of speech has been signalled (by an external device such as a pause detector), a single token will be present at the final node. From this token the entire path through the network can be extracted by tracing back along the path by means of the previous path information contained within the token to provide the best match to the input speech.
The article "A unified direction mechanism for automatic speech recognition using Hidden Markov Models" by S.C. Austin and F. Fallside, ICASSP
1989, Vol. 1, pages 667-670, relates to a connected word speech recogniser which operates in a manner similar to that described by Young et al, as described above. A history relating to the progress of the recognition through the transition network is updated on exiting the word model. At the end of recognition, the result of recognition is derived from the history presented to the output which has the best score. Again only one history is possible for each path terminating at the final node.
Such known arrangements do not allow for an alternative choice to be readily available at the output of the network.
In accordance with the invention a speech recognition apparatus comprises means for deriving a recognition feature vector from an input speech signal for each predetermined time frame; means for modelling expected input speech comprising a plurality of vocabulary nodes each of which has an associated word representation model and links between said vocabulary nodes; processing means for comparing the recognition feature vectors with the modelled input speech and for generating a path link for each node and time frame, said path links indicating the most likely prior sequence of vocabulary nodes for each vocabulary node and time frame, each path link comprising a field for storing an accumulated recognition score and a field for storing a reference to the most likely previous path link in the sequence; and means for indicating recognition of the input speech signal in dependence upon the comparison; characterised in that the processing means (351 ) is capable of processing more than one path link for at least one vocabulary node, other than the final node, in a single time frame.
Such an arrangement means that more than one incoming path link can be processed by a node in a single time frame and hence that more than one recognition result may be obtained.
Each node has associated path links comprising fields for storing a pointer to the previous path link, an accumulated score for a path, a pointer to a previous node and a time index for segmentation information. Preferably, the vocabulary nodes which may have more than one path link processed in a single time frame have more than one identical associated word representation model.
The provision that at least one of the vocabulary nodes other than the final node of the network has more than one associated word representation model allows the processor to process multiple paths for the same time frame and so allows more than one path link to be propagated across each inter-node link at each input frame. In effect, the invention creates multiple layers of a transition network along which several alternative paths may be propagated. The best scoring path may be used by the first model of a node, the next best by the second and so on until either parallel models or incoming paths run out.
In general terms "network" includes directed acyclic graphs (DAGs1 and trees. A DAG is a network with no cycles and a tree is. a network in which the only meeting of paths occurs conceptually right at the end of the network.
The term "word" here denotes a basic recognition unit, which may be a word but equally well may be a diphone, phoneme, allophone, etc. Recognition is the process of matching an unknown utterance with a predefined transition network, the network having been designed to be compatible with what a user is likely to say.
In order to identify the phrase that has been recognised, the apparatus may include means for tracing the path link back through the network.
Alternatively, the apparatus may also include means for assigning a signature to at least some of the nodes having associated word representation models and means for comparing the signature of each path, to determine the path with the best match to the input speech and that with the second best alternative match.
This arrangement allows for an alternative which is necessarily different in character to the best match and does not differ merely in segmentation or noise matches.
5 The word representation models may be Hidden Markov Models (HMMs) as described generally in British Telecom Technology Journal, April 1988, Vol, 6, no. 2, page 105: Cox, "Hidden Markov Models for automatic speech recognition:
theory and application", templates, Dynamic Time Warping models or any other suitable word representation model. The processing which occurs within a model is irrelevant as far as this invention is concerned.
It is not necessary for all the nodes having associated word models to have a signature assigned to them. Depending on the structure of the transition network, it may be sufficient only to assign signatures to those nodes which appear before a decision point within a network. A decision point as used herein relates to a point in the network which has more than one incoming path.
Partial paths may be examined at certain decision points in the network, certain constraints being imposed at these decision points so that only paths conforming to the constraints are propagated, as described in the applicants' International patent application filed on 31 st March 1994 entitled "Connected Speech Recognition", No WO/23425, published 13th October 1994.. Each decision point is associated with a set of valid signatures and any path links with signatures that are not in the set are discarded.
The accumulated signature may be used to identify the complete path, resulting in extra efficiency of operation as the path links need not be traversed to determine the path identity, and the partial path information of the token may not be generated at all. In this case the signature field must be large enough to identify all paths uniquely.
For efficient operation of the apparatus according to the invention, the signal processing of path signatures is preferably carried out in a single operation to increase processing speed.
Other aspects and preferred embodiments of the invention are as disclosed and claimed herein, with advantages that will be apparent hereafter.
The invention will now be described further, by way of example only, with reference to the accompanying drawings in which:
Figure 1 shows schematically the employment of a recognition processor according to the invention in a telecommunications environment;
Figure 2 is a block diagram showing schematically the functional elements of a recognition processor according to the invention;
Figure 3 is a block diagram indicating schematically the components of a classifier forming part of Figure 2;
Figure 4 is block diagram showing schematically the structure of a sequence parser forming part of the embodiment of Figure 2;
Figure 5 shows schematically the content of a field within a store forming part of Figure 4;
Figure 6 is a schematic representation of one embodiment of a transition network applicable with the processor of the sequence parser of Figure 4;
Figure 7a shows a node of a network and Figure 7b shows a path link as employed according to the invention;
Figures 8 to 10 show the progression of path links through the network of Figure 6;
Figure 11 is a schematic representation of a second embodiment of a transition network of a apparatus according to the invention;
Figure 12 is a schematic representation of a third embodiment of a transition network of an apparatus according to the invention.
Referring to Figure 1, a telecommunications system including speech recognition generally comprises a microphone 1, typically forming part of a telephone handset, a telecommunications network (typically a public switched telecommunications network (PSTIV1) 2, a recognition processor 3, connected to receive a voice signal from the network 2, and a utilising apparatus 4 connected to the recognition processor 3 and arranged to receive therefrom a voice recognition signal, indicating recognition or otherwise of a particular word or phrase, and to take action in response thereto. For example, the utilising apparatus 4 may be a remotely operated banking terminal for effecting banking transactions.
In many cases, the utilising apparatus 4 will generate an auditory response to the speaker, transmitted via the network 2 to a loudspeaker 5 typically forming a part of the subscriber handset.
In operation, a speaker speaks into the microphone 1 and an analog speech signal is transmitted from the microphone 1 into the network 2 to the recognition processor 3, where the speech signal is analysed and a signal indicating identification or otherwise of a particular word or phrase is generated and transmitted to the utilising apparatus 4, which then takes appropriate action in the event of recognition of the speech.
Typically, the recognition processor needs to acquire data concerning the speech against which to verify the speech signal, and this data acquisition may be performed by the recognition processor in a second mode of operation in which the recognition processor 3 is not connected to the utilising apparatus 4, but receives a speech signal from the microphone 1 to form the recognition data for that word or phrase. However, other methods of acquiring the speech recognition data are also possible.
_ g _ Typically, the recognition processor 3 is ignorant of the route taken by the signal from the microphone 1 to and through the network 2; any one of a large variety of types and qualities of receiver handset. Likewise, withi n the network 2, any one of a large variety of transmission paths may be taken, including radio links, analog and digital paths and so on. Accordingly, the speech signal Y reaching , the recognition proces~.or 3 corresponds to the speech signal S received at the microphone l, convolved with the transfer characteristics of the microphone 1, link to network 2, channel through the network 2, and link to the recognition processor 3, which may be lumped and designated by a single transfer characteristic H.
Referring to Figure 2, the recognition processor 3 comprises an input 31 for receiving speech in digital form (either from a digital network or from an analog to digital converter?. a frame processor 32 for partioning the succession of digital samples into a succession of frames of contiguous samples; a feature extractor 33 for generating from a frame of samples a corresponding feature vector; a classifier 34 receiving the succession of feature vectors and operating on each with a plurality of model states, to generate recognition results; a sequencer 35 which is arranged to receive the classification results form the classifier 34 and to determine the predetermined utterance to which the sequence of classifier output indicates the greatest similarity; and an output port 38 at which a recognition signal is supplied indicating the speech utterance which has been recognised.
The frame generator 32 is arranged to receive speech samples at a rate of, for example, 8,000 samples per second, and to form frames comprising 256 contiguous samples, at a frame rate of 1 frame every l6ms.
Preferably, each frame is windowed (i. e. the samples towards the edge of the frame are multiplied by _ g _ predetermined weighting constants) using, for example, a Hamming window to reduce spurious artifacts, generated by the frames edges. In a preferred embodiment, the frames are overlapping ( for example by 50% ) so as to ameliorate the effects of the windowing.
Feature Extractor 33 The feature extractor 33 receives frames from the frame generator 32 and generates, in each case, a set or vector of features. The features may, for example, comprise cepstral coefficients (for example, LPC cepstral coefficients or mel frequency cepstral coefficients as described in "On the Evaluation of Speech Recognisers and Databases using a Reference System", Chollet & Gagnoulet, 1982 proc. IEEE p2026), or differential values of such coefficients comprising, for each coefficient, the differences between the coefficient and the corresponding coefficient value in the preceding vector, as described in "On the use of Instantaneous and Transitional Spectral Information in Speaker Recognition", Soong & Rosenberg, 1988 IEEE Trans. on Acoustics, Speech and Signal Processing Vol 36 No. 6 p871. Equally, a mixture of several types of feature coefficient may be used.
The feature extractor 33 outputs a frame number, incremented for each successive frame. The output of the feature extractor 33 is also passed to an end pointer 36, the output of which is connected to the classifier 34. The end pointer 36 detects the end of speech and various types are well known in this field.
The frame generator 32 and feature extractor 33 are, in this embodiment, provided by a single suitably programmed digital signal processor (DSP) device (such as the Motorola DSP 56000, or the Texas Instruments TMS C 320) or similar device.
Referring to Figure 3, in this embodiment, the classifier 34 comprises a classifying processor 341 and a state memory 342.
The state memory 342 comprises a state field 3421, 3422, ...., for each of the plurality of speech states.
For example, each allophone to be recognised by the recognition processor comprises 3 states, and accordingly , 3 state fields are provided in the state memory 342 for each allophone.
The classification processor 34 is arranged to read each state field within the memory 342 in turn, and calculate for each, using the current input feature coefficient set, the probability that the input feature set or vector corresponds to the corresponding state.
Accordingly, the output of the classification processor is a plurality of state probabilities P, one for each state in the state memory 342, indicating the likelihood that the input feature vector corresponds to each state.
The classifying processor 341 may beY~a suitably programmed digital signal processing (DSP) device, may in particular be the same digital signal processing device as the feature extractor 33.
~gq~ssncer 35 Referring to Figure 4, the sequencer 35 in this embodiment comprises a state sequence memory 352, a parsing processor 351, and a sequencer output buffer 354.
Also provided is a state probability memory 353 which stores, for each frame processed, the state probabilities output by the classifier processor 341. The state sequence memory 352 comprises a plurality of state sequence fields 3521, 3522, ...., each corresponding to a word or phase sequence to be recognised consisting of a string of allophones.
Each state sequence in the state sequence memory 352 comprises, as illustrated in Figure 5, a number of states - il -Pi, P2, . . . PN (where N is a multiple of 3 ) and, for each state, two probabilities; a repeat probability (P~1) and a transition probability to the following state (P~2). The states of the sequence are a plurality of groups of three states each relating to a single allophone. The observed sequence of states associated with a series of frames may ~ therefore comprise several repetitions of each state P~ in each state sequence model 3521.etc; for example:
Frame Number 1 2 3 4 5 6 7 8 9 ... Z Z+1 State P1 P1 P1 P2 P2 P2 P2 P2 P2 ... Pn Pn The parsing processor 351 is arranged to read, at each frame, the state probabilities output by the classifier processor 341, and the previous stored state probabilities in the state probability memory 353, and to calculate the most likely path of states to date over time, and to e0m~are this with each of the state sequences stored in the 5~1-e sequence memory 352.
The calculation employs the well known HMM, as discussed in the above referenced Cox paper. Conveniently, the HMM processing performed by the parsing processor 351 uses the well. known Viterbi algorithm. The parsing processor 351 may, for example, be a microprocessor such as the Intel~TM~ i-486~TM~ microprocessor or the Motorola~TM~
microprocessor, or may alternatively be a DSP device (for example, the same DSP device as is employed for any of the preceding processors).
Accordingly for each state sequence (corresponding to a word, phrase or other speech sequence to be recognised) a probability score is output by the parser processor 351 at each frame of input speech. For example the state sequences may comprise the names in a telephone directory.
When the end of the utterance is detected, a label signal a indicating the most probable state sequence is output from the parsing processor 351 to the output port 38, to indicate that the corresponding name, word or phrase has been recognised.
The parsing processor 351 comprises a network which is specifically configured to recognise certain phrases or words for example a string of digits.
Figure 6 shows a simple transition network for recognising a string of words, in this case either a string of four words or a string of three. Each node 12 of the network is associated with a word representation model 13, for example a HMM, which is stored in a model list. Several nodes can be associated with each model and each node includes a pointer to its associated model (as can be seen in Figures 6 and 7a). In order to produce a best match and a single alternative parse, the final node 14 is associated with two models so allowing this node to process two paths. If n parses are required, the final node 14 of the network is associated with n identical word models.
As shown in Figure 7b, a path link 15 contains information relating to a pointer to the previous path link, an accumulated score, a pointer to the node previously exited and a time index. At the start of an utterance, an empty path link 15' is inserted into the first node 16 as shown in Figure 8. The first node now contains a path link and is therefore active whereas the remaining nodes are inactive. At each clock tick (i.e. with each incoming frame of speech) any active nodes accumulate a score in their path link.
If the first model can match, say, a minimum of seven frames of speech, then at the seventh clock pulse a path link 15" is output from the first node with the score for matching the seven frames to the model and pointers to the entry path link and the node just matched. The path link is fed to all of the following nodes 12, as shown in Figure 9. Now the first three nodes are active. The input frame of speech is then matched in the models associated with the active nodes and new path links outputted.
This processing continues, with the first node 16 producing further path links as its model matches increasingly longer parts of the utterance and the succeeding nodes performing similar calculations.
When the input speech has been processed as far as the final node 18 of the network, path links from each 'branch' of the network may be presented to this node 18. If, at any given time frame, there is a single path link (i.e.
only one of the parallel paths has been completed) that path link is taken to be the best (and only) match and is processed by the final node 18. However, if there are two path links presented to the final node 18, both are processed by that node, since the final node 18 is able to process more than one path. The output path links are continuously updated at each frame of speech. When the utterance is completed there will be two path links 15"' at the output of the network, as shown in Figure 10 (from which the pointers to previous path links and nodes have been excluded for the sake of clarity).
The full path can be found by following the pointers to the previous path links and the nodes on the recognised path land hence the input speech deemed to be recognised) can be identified by looking at the pointers to the nodes exited.
Figure 11 represents a second embodiment of a network configured to recognise strings of three digits. The grey nodes 22 are null nodes in the network;
the white nodes are active nodes which may be divided into vocabulary nodes 24 with associated word representation models (not shown), for matching incoming speech and noise nodes 25 which represent arbitrary noise.
If all of the active nodes 24, 25 after and including the third null node 22' are each capable of having three paths for each time frame (i.e. each vocabulary node 24 is associated with three word representation models), the output of the network wilt comprise path links relating to the three top scoring paths of the system. As described with reference to Figures 8 to 10, the three paths can be found by following, for each path,, ahe pointers to the previous path links. The nodes on-the paths (and hence the input speech deemed to be~ recognised) can ~be identified by looking at the pointers to the exited nodes.
In a further development of the invention, the path links may be augmented with signatures which represent the significant nodes of the network. These significant nodes may, for example, include all vocabulary nodes 24. In the embodiment of Figure 11, each vocabulary node 24 is assigned a signature, for example the nodes representing the digit 1 are assigned a signature '1', the nodes 24"
repres enti ng the di gi t 2 are as s i gned a s i gnature ' 2' and s o on.
At the start of parsing, a single empty path link is passed into a network entry node 26. Since this is a null node, the path link is passed to the next node, a noise node 25. The input frame is matched in the noise model (noz shown) of this node and an updated path link produced at the output. This path link is then passed to the next active nodes i. e. the first vocabulary nodes 24 having an associated model (not shown). Each vocabulary node 24 processes the frame of speech in its associated word model and produces an updated path-link. The signature field of the path link is also updated. At the end of each time frame, the updated path links are sorted to retain the three (n) top scoring paths which have different signature fields. A list ordered by score is maintained with the added cons trai nt that ac cumul ated s i gnatures are uni que: i f a second path link with the same signature enters, the better of the two is retained. The list contains only the top '; n" di f f erent, _ paths , the res t bei ng i gnored.
Then path links are propagated through the next null node 22' to the following noise node 25 and vocabulary nodes 24", each of which are associated with three identical word representation models. After this, model processing takes place, resulting in the updating of the RECTIFIED SHEET (RULE 91 ) tSA/EP
~WO 94/23424 PCT/GB94/00704 lists of path links and the extending of the paths into further nodes 24"', 25. It should be clear that the signature fields of the path links are not updated after processing by the null nodes 22 or the noise nodes 25 since these nodes do not have assigned signatures.
The path links are propagated along paths which pass through the remaining active nodes to produce, at an output node 28, up to three paths links indicating the relative scores and signatures, for example 1 2 1, of the paths taken through the network. The path links are continuously updated until the end of speech is detected (for example by an external device such as a pause detector or, until a time out is reached). At this point, the pointers or the accumulated signatures of the path links at the output node 28 are examined to determine the recognition results.
For example, presuming that the following three path links are presented to the output node 28 at some time instant:
SCORE SIGNATURE
Path A, the highest scoring path, is the best match.
However, although path B has the second best score, it would be rejected as an alternative parse since its signature, and hence the speech deemed to be recognised, is the same as path A. Path C would therefore be retained as the second best parse.
If the strings to be recognised have more structure than that discussed above, for example spelt names, .signatures need only be assigned to nodes just before decision points, rather than at every vocabulary node.
Figure 12 shows a network for recognising the spelling of the names " Phil" , " Paul" and " Peter" . For simplicity, no RECTI!=IED SHEET (RULE 91 ) ISA/EP
noise is illustrated. The square nodes 44 indicate where the signature should be augmented.
The system can distinguish between the 'PHI' and 'PAU' paths at the 'L' node because the signatures of the path links created at the previous nodes are different. The following node 47 will be able to distinguish between all three independent paths since the signatures of the square nodes 44 are different.
Only the 'L' node and the final noise node 48 need to be associated with more than one identical word model, so that these models are capable of having more than one path for a single time frame.
In all cases, each network illustrating the speech to be recognised requires analysis to determine which nodes are to be assigned signatures. In addition the network is configured to be compatible with what a user is likely to say.
Savings in memory size and processing speed may be achieved by limiting the signatures that a node will propagate, as described in the applicants' International patent application filed on 31 st March 1994 entitled "Connected Speech Recognition", No WO/23425, published 13th October 1994.For instance, say the only valid input speech to a recogniser having the network of Figure 6 is the four following numbers only: 1 1 1, 1 12, 121, 21 1. Certain nodes within the network are associated with a set of valid signatures and a path will only be propagated by such a 'constrained' node if a path link having one of these signatures is presented. To achieve this, the signature fields of the path links entering a constrained node, eg third null node 22', are examined. If the signature field contains a signature other than 1 or 2, the path link is discarded and the path is not propagated any further. If an allowable path link is presented, it is passed on to the next node. The next constrained node is the null node 22" after the next vocabulary nodes. This null node is constrained to only propagate path links having a signature 1 1, 12 or 21. The null node 22"' after the next vocabulary nodes is constrained to only propagate path links having the signature 1 1 1, 1 12, 121 or 21 1. Such an arrangement significantly reduces the necessary processing and allows for a saving in the memory capacity of the apparatus. Only some of the nodes at decision points in the network need to be so constrained. In practice, a 32 bit signature has proved to be suitable for sequences of up to 9 digits.
bit signature appears suitable for a 12 character alphanumeric string.
End of speech detection and various other aspects of speech recognition relevant to the present invention are more fully set out in the applicants' International Patent Application filed on 25th March 1994 entitled "Speech Recognition", No WO 94/22131, published 29th September 1994.
In the above described embodiments, recognition processing apparatus suitable to be coupled to a telecommunications exchange has been described.
However, in another embodiment, the invention may be embodied on simple apparatus connected to a conventional subscriber station (mobile or fixed) connected to the telephone network; in this case, analog to digital conversion means may be provided for digitising the incoming analog telephone signal.
Claims (19)
1. A speech recognition apparatus comprising:
means for deriving a recognition feature vector from an input speech signal for each predetermined time frame;
means for modelling expected input speech comprising a plurality of vocabulary nodes each of which has an associated word representation model and links between said vocabulary nodes;
processing means for comparing the recognition feature vectors with the modelled input speech and for generating a path link for each node and time frame, said path links indicating the most likely prior sequence of vocabulary nodes for each vocabulary node and time frame, each path link comprising a field for storing an accumulated recognition score and a field for storing a reference to the most likely previous path link in the sequence; and means for indicating recognition of the input speech signal in dependence upon the comparison;
characterised in that the processing means (351) is capable of processing more than one path link for at least one vocabulary node, other than the final node, in a single time frame.
means for deriving a recognition feature vector from an input speech signal for each predetermined time frame;
means for modelling expected input speech comprising a plurality of vocabulary nodes each of which has an associated word representation model and links between said vocabulary nodes;
processing means for comparing the recognition feature vectors with the modelled input speech and for generating a path link for each node and time frame, said path links indicating the most likely prior sequence of vocabulary nodes for each vocabulary node and time frame, each path link comprising a field for storing an accumulated recognition score and a field for storing a reference to the most likely previous path link in the sequence; and means for indicating recognition of the input speech signal in dependence upon the comparison;
characterised in that the processing means (351) is capable of processing more than one path link for at least one vocabulary node, other than the final node, in a single time frame.
2. A speech recognition apparatus according to Claim 1 characterised in that the at least one of the vocabulary nodes are associated with more than one identical word representation model.
3. A speech recognition apparatus according to Claim 2 characterised in that the word representation models are Hidden Markov Models.
4. A speech recognition apparatus according to any one of Claims 1, 2 or 3 characterised in that all the vocabulary nodes have signatures assigned to them.
5. A speech recognition apparatus according to any one of claims 1, 2 or 3 characterised in that only the vocabulary nodes occurring before a decision point have signatures assigned to them.
6. A speech recognition apparatus according to either Claim 4 or 5 characterised in that the path links include an accumulated signature.
7. A speech recognition apparatus according to any one of Claims 4, 5 or 6 characterised in that at least some of the nodes are constrained only to propagate path links having certain predetermined signatures.
8. A speech recognition apparatus according to any one of Claims 4 to 7 characterised in that the recognition indicating means includes means for comparing the score and signature of the path links to determine the path with the best match to the input connected speech and those with the next best alternative matches.
9. A method of speech recognition comprising:
deriving a recognition feature vector from an input speech signal for each of a predetermined time frame;
modelling expected input speech;
comparing the feature data with the modelled input speech by generating a network containing a plurality of vocabulary nodes associated with word representation models and generating a path link for each node and time frame, the path link indicating the most likely prior sequence of vocabulary nodes for each vocabulary node and time frame, each path link comprising a field for storing an accumulated recognition score and a field for storing a reference to the most likely previous path link in the sequence;
indicating recognition of the speech independence upon the comparison characterised in that more than one path link is processed in a single time frame for at least one vocabulary node other than the final node.
deriving a recognition feature vector from an input speech signal for each of a predetermined time frame;
modelling expected input speech;
comparing the feature data with the modelled input speech by generating a network containing a plurality of vocabulary nodes associated with word representation models and generating a path link for each node and time frame, the path link indicating the most likely prior sequence of vocabulary nodes for each vocabulary node and time frame, each path link comprising a field for storing an accumulated recognition score and a field for storing a reference to the most likely previous path link in the sequence;
indicating recognition of the speech independence upon the comparison characterised in that more than one path link is processed in a single time frame for at least one vocabulary node other than the final node.
10. A method according to Claim 9 characterised in that the at least one of the vocabulary nodes is associated with more than one identical word representation model.
11. A method according to Claim 10 characterised in that the at least one of the vocabulary nodes is associated with a number of identical word representation models equal to the number of desired recognition results.
12. A method according to either Claim 10 or 11 characterised in that the scores of the path links are compared at each decision point of the network, only the n top scoring path links being propagated to the next nodes(s).
13. A method according to any one of claims 10, 11 or 12 characterised by assigning signatures to all the vocabulary nodes.
14. A method according to any one of claims 10, 11 or 12 characterised by only assigning signatures to vocabulary nodes occurring before a decision point in the network.
15. A method according to either Claim 13 or 14 when appended to Claim 12 characterised in that the signatures of the path links are also compared, only path links having different signatures being propagated to the next node(s).
16. A method according to any one of Claims 13, 14 or 15 characterised by constraining at least some nodes only to pass path links having certain predetermined signatures in their signature fields.
17. A method according to any one of Claims 10 to 16 characterised in that the input speech signal deemed to be recognised is determined by tracing the path links back through the network.
18. A method according to any one of Claims 13 to 16 characterised in that the input speech signal deemed to be recognised is determined by the accumulated signature of each path link.
19. A method according to any one of Claims 10 to 18 characterised in that the best scoring path link is processed by the first word representation model of a vocabulary node, the next best by the second and so on until either parallel models or incoming path links run out.
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP93302538.9 | 1993-03-31 | ||
EP93302538 | 1993-03-31 | ||
EP93304993 | 1993-06-25 | ||
EP93304993.4 | 1993-06-25 | ||
PCT/GB1994/000704 WO1994023424A1 (en) | 1993-03-31 | 1994-03-31 | Speech processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2158064A1 CA2158064A1 (en) | 1994-10-13 |
CA2158064C true CA2158064C (en) | 2000-10-17 |
Family
ID=26134252
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002158064A Expired - Lifetime CA2158064C (en) | 1993-03-31 | 1994-03-31 | Speech processing |
Country Status (12)
Country | Link |
---|---|
JP (1) | JPH08508350A (en) |
KR (1) | KR100309205B1 (en) |
CN (1) | CN1196104C (en) |
AU (1) | AU682177B2 (en) |
CA (1) | CA2158064C (en) |
DE (1) | DE69416670T2 (en) |
FI (1) | FI954572A (en) |
HK (1) | HK1014390A1 (en) |
NO (1) | NO308756B1 (en) |
NZ (1) | NZ263223A (en) |
SG (1) | SG47716A1 (en) |
WO (1) | WO1994023424A1 (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680652B2 (en) | 2004-10-26 | 2010-03-16 | Qnx Software Systems (Wavemakers), Inc. | Periodic signal enhancement system |
US7716046B2 (en) | 2004-10-26 | 2010-05-11 | Qnx Software Systems (Wavemakers), Inc. | Advanced periodic signal enhancement |
US7725315B2 (en) | 2003-02-21 | 2010-05-25 | Qnx Software Systems (Wavemakers), Inc. | Minimization of transient noises in a voice signal |
US7885420B2 (en) | 2003-02-21 | 2011-02-08 | Qnx Software Systems Co. | Wind noise suppression system |
US7895036B2 (en) | 2003-02-21 | 2011-02-22 | Qnx Software Systems Co. | System for suppressing wind noise |
US7949520B2 (en) | 2004-10-26 | 2011-05-24 | QNX Software Sytems Co. | Adaptive filter pitch extraction |
US7949522B2 (en) | 2003-02-21 | 2011-05-24 | Qnx Software Systems Co. | System for suppressing rain noise |
US8073689B2 (en) | 2003-02-21 | 2011-12-06 | Qnx Software Systems Co. | Repetitive transient noise removal |
US8170879B2 (en) | 2004-10-26 | 2012-05-01 | Qnx Software Systems Limited | Periodic signal enhancement system |
US8209514B2 (en) | 2008-02-04 | 2012-06-26 | Qnx Software Systems Limited | Media processing system having resource partitioning |
US8271279B2 (en) | 2003-02-21 | 2012-09-18 | Qnx Software Systems Limited | Signature noise removal |
US8306821B2 (en) | 2004-10-26 | 2012-11-06 | Qnx Software Systems Limited | Sub-band periodic signal enhancement system |
US8311819B2 (en) | 2005-06-15 | 2012-11-13 | Qnx Software Systems Limited | System for detecting speech with background voice estimates and noise estimates |
US8326621B2 (en) | 2003-02-21 | 2012-12-04 | Qnx Software Systems Limited | Repetitive transient noise removal |
US8543390B2 (en) | 2004-10-26 | 2013-09-24 | Qnx Software Systems Limited | Multi-channel periodic signal enhancement system |
US8554564B2 (en) | 2005-06-15 | 2013-10-08 | Qnx Software Systems Limited | Speech end-pointer |
US8694310B2 (en) | 2007-09-17 | 2014-04-08 | Qnx Software Systems Limited | Remote control server protocol system |
US8850154B2 (en) | 2007-09-11 | 2014-09-30 | 2236008 Ontario Inc. | Processing system having memory partitioning |
US8904400B2 (en) | 2007-09-11 | 2014-12-02 | 2236008 Ontario Inc. | Processing system having a partitioning component for resource partitioning |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE59601778D1 (en) * | 1995-03-07 | 1999-06-02 | Siemens Ag | METHOD FOR DETECTING AT LEAST ONE DEFINED PATTERN MODELED BY HIDDEN-MARKOV MODELS IN A TIME-VARIANT MEASURING SIGNAL, WHICH IS OVERLAYED BY AT LEAST ONE INTERFERENCE SIGNAL |
US7117149B1 (en) | 1999-08-30 | 2006-10-03 | Harman Becker Automotive Systems-Wavemakers, Inc. | Sound source classification |
US8284947B2 (en) | 2004-12-01 | 2012-10-09 | Qnx Software Systems Limited | Reverberation estimation and suppression system |
US8027833B2 (en) | 2005-05-09 | 2011-09-27 | Qnx Software Systems Co. | System for suppressing passing tire hiss |
US7844453B2 (en) | 2006-05-12 | 2010-11-30 | Qnx Software Systems Co. | Robust noise estimation |
US8326620B2 (en) | 2008-04-30 | 2012-12-04 | Qnx Software Systems Limited | Robust downlink speech and noise detector |
US8335685B2 (en) | 2006-12-22 | 2012-12-18 | Qnx Software Systems Limited | Ambient noise compensation system robust to high excitation noise |
CN103035243B (en) * | 2012-12-18 | 2014-12-24 | 中国科学院自动化研究所 | Real-time feedback method and system of long voice continuous recognition and recognition result |
CN105913848A (en) * | 2016-04-13 | 2016-08-31 | 乐视控股(北京)有限公司 | Path storing method and path storing system based on minimal heap, and speech recognizer |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4980918A (en) * | 1985-05-09 | 1990-12-25 | International Business Machines Corporation | Speech recognition system with efficient storage and rapid assembly of phonological graphs |
DE3750199T2 (en) * | 1986-06-02 | 1995-01-19 | Motorola Inc | Continuous speech recognition system. |
US5388183A (en) * | 1991-09-30 | 1995-02-07 | Kurzwell Applied Intelligence, Inc. | Speech recognition providing multiple outputs |
-
1994
- 1994-03-31 AU AU63829/94A patent/AU682177B2/en not_active Expired
- 1994-03-31 NZ NZ263223A patent/NZ263223A/en unknown
- 1994-03-31 KR KR1019950704196A patent/KR100309205B1/en not_active IP Right Cessation
- 1994-03-31 DE DE69416670T patent/DE69416670T2/en not_active Expired - Lifetime
- 1994-03-31 CN CNB941916529A patent/CN1196104C/en not_active Expired - Lifetime
- 1994-03-31 SG SG1996004023A patent/SG47716A1/en unknown
- 1994-03-31 CA CA002158064A patent/CA2158064C/en not_active Expired - Lifetime
- 1994-03-31 WO PCT/GB1994/000704 patent/WO1994023424A1/en active IP Right Grant
- 1994-03-31 JP JP6521853A patent/JPH08508350A/en not_active Ceased
-
1995
- 1995-09-27 FI FI954572A patent/FI954572A/en unknown
- 1995-09-29 NO NO953895A patent/NO308756B1/en not_active IP Right Cessation
-
1998
- 1998-12-24 HK HK98115660A patent/HK1014390A1/en not_active IP Right Cessation
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8271279B2 (en) | 2003-02-21 | 2012-09-18 | Qnx Software Systems Limited | Signature noise removal |
US7725315B2 (en) | 2003-02-21 | 2010-05-25 | Qnx Software Systems (Wavemakers), Inc. | Minimization of transient noises in a voice signal |
US8612222B2 (en) | 2003-02-21 | 2013-12-17 | Qnx Software Systems Limited | Signature noise removal |
US8374855B2 (en) | 2003-02-21 | 2013-02-12 | Qnx Software Systems Limited | System for suppressing rain noise |
US7895036B2 (en) | 2003-02-21 | 2011-02-22 | Qnx Software Systems Co. | System for suppressing wind noise |
US8326621B2 (en) | 2003-02-21 | 2012-12-04 | Qnx Software Systems Limited | Repetitive transient noise removal |
US7949522B2 (en) | 2003-02-21 | 2011-05-24 | Qnx Software Systems Co. | System for suppressing rain noise |
US8073689B2 (en) | 2003-02-21 | 2011-12-06 | Qnx Software Systems Co. | Repetitive transient noise removal |
US7885420B2 (en) | 2003-02-21 | 2011-02-08 | Qnx Software Systems Co. | Wind noise suppression system |
US8165875B2 (en) | 2003-02-21 | 2012-04-24 | Qnx Software Systems Limited | System for suppressing wind noise |
US9373340B2 (en) | 2003-02-21 | 2016-06-21 | 2236008 Ontario, Inc. | Method and apparatus for suppressing wind noise |
US8306821B2 (en) | 2004-10-26 | 2012-11-06 | Qnx Software Systems Limited | Sub-band periodic signal enhancement system |
US8150682B2 (en) | 2004-10-26 | 2012-04-03 | Qnx Software Systems Limited | Adaptive filter pitch extraction |
US7716046B2 (en) | 2004-10-26 | 2010-05-11 | Qnx Software Systems (Wavemakers), Inc. | Advanced periodic signal enhancement |
US8543390B2 (en) | 2004-10-26 | 2013-09-24 | Qnx Software Systems Limited | Multi-channel periodic signal enhancement system |
US7949520B2 (en) | 2004-10-26 | 2011-05-24 | QNX Software Sytems Co. | Adaptive filter pitch extraction |
US7680652B2 (en) | 2004-10-26 | 2010-03-16 | Qnx Software Systems (Wavemakers), Inc. | Periodic signal enhancement system |
US8170879B2 (en) | 2004-10-26 | 2012-05-01 | Qnx Software Systems Limited | Periodic signal enhancement system |
US8554564B2 (en) | 2005-06-15 | 2013-10-08 | Qnx Software Systems Limited | Speech end-pointer |
US8311819B2 (en) | 2005-06-15 | 2012-11-13 | Qnx Software Systems Limited | System for detecting speech with background voice estimates and noise estimates |
US8457961B2 (en) | 2005-06-15 | 2013-06-04 | Qnx Software Systems Limited | System for detecting speech with background voice estimates and noise estimates |
US8850154B2 (en) | 2007-09-11 | 2014-09-30 | 2236008 Ontario Inc. | Processing system having memory partitioning |
US8904400B2 (en) | 2007-09-11 | 2014-12-02 | 2236008 Ontario Inc. | Processing system having a partitioning component for resource partitioning |
US9122575B2 (en) | 2007-09-11 | 2015-09-01 | 2236008 Ontario Inc. | Processing system having memory partitioning |
US8694310B2 (en) | 2007-09-17 | 2014-04-08 | Qnx Software Systems Limited | Remote control server protocol system |
US8209514B2 (en) | 2008-02-04 | 2012-06-26 | Qnx Software Systems Limited | Media processing system having resource partitioning |
Also Published As
Publication number | Publication date |
---|---|
AU6382994A (en) | 1994-10-24 |
AU682177B2 (en) | 1997-09-25 |
CN1196104C (en) | 2005-04-06 |
DE69416670T2 (en) | 1999-06-24 |
NO308756B1 (en) | 2000-10-23 |
FI954572A0 (en) | 1995-09-27 |
JPH08508350A (en) | 1996-09-03 |
DE69416670D1 (en) | 1999-04-01 |
SG47716A1 (en) | 1998-04-17 |
CA2158064A1 (en) | 1994-10-13 |
FI954572A (en) | 1995-09-27 |
HK1014390A1 (en) | 1999-09-24 |
NO953895L (en) | 1995-11-28 |
KR100309205B1 (en) | 2001-12-17 |
NO953895D0 (en) | 1995-09-29 |
NZ263223A (en) | 1997-11-24 |
WO1994023424A1 (en) | 1994-10-13 |
CN1120372A (en) | 1996-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2158064C (en) | Speech processing | |
EP0695453B1 (en) | Connected speech recognition | |
Soong et al. | A Tree. Trellis based fast search for finding the n best sentence hypotheses in continuous speech recognition | |
US6195634B1 (en) | Selection of decoys for non-vocabulary utterances rejection | |
US5033087A (en) | Method and apparatus for the automatic determination of phonological rules as for a continuous speech recognition system | |
US5072452A (en) | Automatic determination of labels and Markov word models in a speech recognition system | |
EP0314908B1 (en) | Automatic determination of labels and markov word models in a speech recognition system | |
EP0769184B1 (en) | Speech recognition methods and apparatus on the basis of the modelling of new words | |
US6922668B1 (en) | Speaker recognition | |
US5615299A (en) | Speech recognition using dynamic features | |
US6230128B1 (en) | Path link passing speech recognition with vocabulary node being capable of simultaneously processing plural path links | |
EP0475759A2 (en) | Phoneme discrimination method | |
US4783809A (en) | Automatic speech recognizer for real time operation | |
EP0692134B1 (en) | Speech processing | |
JP4199927B2 (en) | Method and apparatus for recognizing at least one keyword in spoken language by a calculator | |
Bush et al. | Network-based connected digit recognition using vector quantization | |
Kanazawa et al. | A hybrid wordspotting method for spontaneous speech understanding using word-based pattern matching and phoneme-based HMM | |
Forgie et al. | An overview of the Lincoln Laboratory speech recognition system | |
CA2013263C (en) | Rejection method for speech recognition | |
JPH05303391A (en) | Speech recognition device | |
Niimi et al. | A voice-input programming system using BASIC-like language | |
Okada et al. | Word spotting system based on stochastic models of phonemic segments | |
Kawabata et al. | Word spotting method based on top-down phoneme verification | |
JPS6346499A (en) | Big vocaburary word voice recognition system | |
Smyth | Segmental sub-word unit classification using a multilayer perceptron |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKEX | Expiry |
Effective date: 20140331 |