US20080071529A1 - Using non-speech sounds during text-to-speech synthesis - Google Patents
Using non-speech sounds during text-to-speech synthesis Download PDFInfo
- Publication number
- US20080071529A1 US20080071529A1 US11/532,470 US53247006A US2008071529A1 US 20080071529 A1 US20080071529 A1 US 20080071529A1 US 53247006 A US53247006 A US 53247006A US 2008071529 A1 US2008071529 A1 US 2008071529A1
- Authority
- US
- United States
- Prior art keywords
- speech
- units
- unit
- voice sample
- input string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015572 biosynthetic process Effects 0.000 title abstract description 44
- 238000003786 synthesis reaction Methods 0.000 title abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 74
- 238000004590 computer program Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims description 31
- 230000002194 synthesizing effect Effects 0.000 claims description 13
- 230000003416 augmentation Effects 0.000 claims description 6
- 230000003190 augmentative effect Effects 0.000 claims description 5
- 206010039424 Salivary hypersecretion Diseases 0.000 claims description 3
- 208000026451 salivation Diseases 0.000 claims description 3
- 102000002067 Protein Subunits Human genes 0.000 claims description 2
- 108010001267 Protein Subunits Proteins 0.000 claims description 2
- 230000008569 process Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 16
- 238000004891 communication Methods 0.000 description 8
- 241000282326 Felis catus Species 0.000 description 7
- 230000001944 accentuation Effects 0.000 description 7
- 230000003595 spectral effect Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 4
- 239000010410 layer Substances 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 3
- 241000270295 Serpentes Species 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 206010011469 Crying Diseases 0.000 description 2
- 239000012792 core layer Substances 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 239000002243 precursor Substances 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 206010011224 Cough Diseases 0.000 description 1
- 208000000059 Dyspnea Diseases 0.000 description 1
- 206010013975 Dyspnoeas Diseases 0.000 description 1
- 206010018762 Grunting Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 208000037656 Respiratory Sounds Diseases 0.000 description 1
- 240000007651 Rubus glaucus Species 0.000 description 1
- 235000011034 Rubus glaucus Nutrition 0.000 description 1
- 235000009122 Rubus idaeus Nutrition 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- ZPUCINDJVBIVPJ-LJISPDSOSA-N cocaine Chemical compound O([C@H]1C[C@@H]2CC[C@@H](N2C)[C@H]1C(=O)OC)C(=O)C1=CC=CC=C1 ZPUCINDJVBIVPJ-LJISPDSOSA-N 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
Definitions
- the following disclosure generally relates to information systems.
- conventional text-to-speech application programs produce audible speech from written text.
- the text can be displayed, for example, in an application program executing on a personal computer or other device.
- a blind or sight-impaired user of a personal computer can have text from a web page read aloud from the personal computer.
- Other text to speech applications include those that read from a textual database and provide corresponding audio to a user by way of a communication device, such as a telephone, cellular telephone, portable music player, in-vehicle navigation system or the like.
- Speech from conventional text-to-speech applications typically sounds artificial or machine-like when compared to human speech.
- One reason for this result is that current text-to-speech applications often synthesize momentary pauses in speech with silence.
- the location and length of pauses is typically determined by parsing the written text and the punctuation in the text such as commas, periods, and paragraph delimiters.
- using empty silence to synthesize pauses, as conventional synthesis applications do can lead listeners to feel a sense of breathlessness; particularly after lengthy exposure to the results of such synthesis.
- pauses can actually consist of breath intakes, mouth clicks and other non-speech sounds. These non-speech sounds provide subtle clues about the sounds and words that are about to follow. These clues are missing when pauses are synthesized as silence, thus requiring more listener effort to comprehend the synthesized speech.
- Some text-to-speech applications produce speech that can include emotive vocal gestures such as laughing, sobbing, crying, scoffing and grunting. However, in general such gestures do not improve comprehension of the resultant speech. Moreover, these techniques rely on explicitly annotated input text to determine where to include the vocal gestures in the speech. Such annotated text may, for example, appear as follows, “What? ⁇ laugh 1 > You mean to tell me this is an improvement? ⁇ laugh 4 >.” The text ‘ ⁇ laugh 1 >’ is an example of a specific textual command that directs the synthesis to produce a specific associated sound (e.g., a mocking laugh).
- Non-speech sounds can be identified from pre-recorded speech that can include meta-data such as the grammatical and phrasal structure of words and sounds that precede and succeed non-speech sounds.
- a non-speech sound can be selected for use in synthesized speech based on the words, punctuation, grammatical and phrasal structure of text from which the speech is being synthesized, or other characteristics.
- a method includes augmenting a synthesized speech with a non-speech sound other than silence, the augmentation based on characteristics of the synthesized speech.
- the method can include replacing pauses in the synthesized speech with a non-speech sound.
- Augmenting can include identifying the non-speech sound based on punctuation, grammatical or phrasal structure of text associated with the synthesized speech.
- the non-speech sound can include the sound of one or more of: inhalation; exhalation; mouth clicks; lip smacks; tongue flicks; and salivation.
- a method in another aspect includes identifying a non-speech unit in a received input string where the non-speech unit is not associated with a specific textual reference in the input string.
- the non-speech unit is matched to an audio segment, which is a voice sample of a non-speech sound.
- the input string is synthesized, which includes combining the audio segments matched with the non-speech unit.
- the method can include identifying the non-speech unit based on punctuation, grammatical and phrasal structure of the input string.
- the method can include identifying the non-speech unit based on non-speech codes in the input string.
- the method can include determining the duration of the non-speech unit.
- the method can include matching the non-speech unit with non-speech sounds based on duration of the non speech unit.
- the method can include generating metadata associated with the plurality of audio segments.
- Generating the metadata can include receiving a voice sample; determining two or more portions of the voice sample having properties; generating a portion of the metadata associated with a first portion of the voice sample to associate a second portion of the voice sample with the first portion of the voice sample; and generating a portion of the metadata associated with the second portion of the voice sample to associate the first portion of the voice sample with the second portion of the voice sample.
- Generating the metadata can include receiving a voice sample; delimiting a portion of the voice sample in which articulation relationships are substantially self-contained; and generating a portion of the metadata to describe the portion of the voice sample.
- the method can include identifying a speech unit in a received input string, the speech unit preceding or following the non-speech unit; and matching the non-speech unit with the non-speech sound based on the speech unit.
- the method can include parsing the speech unit into sub units, at least one sub unit preceding or following the non-speech unit; and matching the non-speech unit with non-speech sounds based on the at least one sub unit.
- Speech units can be phrases, words, or sub-words in the input string.
- the method can include limiting synthesizing non-speech units based on a proximity to preceding synthesized pauses.
- the input string can include ASCII or Unicode characters.
- the method can include outputting amplified speech comprising the combined audio segments.
- a method in another aspect includes receiving audio segments.
- the audio segments are parsed into speech units and non-speech units. Properties are defined of or between speech units and non-speech units. The units and the properties are stored.
- the method can include parsing the speech units into sub units; defining properties of or between the sub units; and storing the sub units and properties.
- the method can include parsing a received input string into speech units and non-speech units; determining properties of or between the speech units and non-speech units if any; matching units to stored units using the properties; and synthesizing the input string including combining the audio segments matched with the speech-units and non-speech units.
- the method can include defining properties between speech-units and non-speech units.
- a system that synthesizes speech with non-speech sounds can more accurately mimic patterns of human spoken communication.
- the resultant synthesized speech sounds more human and less artificial than methods that do not use non-speech sounds.
- Non-speech sounds add audible information and context to speech.
- Synthesized speech with non-speech sounds requires less cognitive effort to comprehend and is more likely to be understood when listening conditions are less than ideal.
- proper inclusion of non-speech sounds add pleasantness to the experience of listening to the resultant speech, making the task more enjoyable and engaging for the listener.
- Speech that includes non-speech sounds can lend a sense of personality and approachableness to the device that is speaking.
- FIG. 1 is a block diagram illustrating a proposed system for text-to-speech synthesis.
- FIG. 2 is a block diagram illustrating a synthesis block of the proposed system of FIG. 1 .
- FIG. 3A is a flow diagram illustrating one method for synthesizing text into speech.
- FIG. 3B is a flow diagram illustrating a second method for synthesizing text into speech.
- FIG. 4 is a flow diagram illustrating a method for providing a plurality of audio segments having defined properties that can be used in the method shown in FIG. 3 .
- FIG. 5 is a schematic diagram illustrating linked segments.
- FIG. 6 is a schematic diagram illustrating another example of linked segments.
- FIG. 7 is a flow diagram illustrating a method for matching units from a stream of text to audio segments at a highest possible unit level.
- FIG. 8 is a schematic diagram illustrating linked segments.
- Non-speech sounds are sounds that are not normally captured by the phonetic or any other linguistic description of a spoken language, including: breathing sounds, lip-smacks, tongue flicks, mouth clicks, salivation sounds, sighs and the like.
- An exemplary system and method is described for mapping text input to speech. Part of the mapping includes the consideration of non-speech sounds that may be appropriate to provide in the audio output.
- a system maps an input stream of text to audio segments that take into account properties of and relationships (including articulation relationships) among units from the text stream.
- Articulation relationships refer to dependencies between sounds, including non-speech sounds, when spoken by a human.
- the dependencies can be caused by physical limitations of humans (e.g., limitations of lip movement, vocal cords, human lung capacity, speed of air intake or outtake, etc.) when, for example, speaking without adequate pause, speaking at a fast rate, slurring, and the like.
- Properties can include those related to pitch, duration, accentuation, spectral characteristics and the like.
- Properties of a given unit can be used to identify follow on units that are a best match for combination in producing synthesized speech.
- properties and relationships that are used to determine units that can be selected from to produce the synthesized speech are referred to in the collective as merely properties.
- FIG. 1 is a block diagram illustrating a system 100 for text-to-speech synthesis that includes non-speech sounds.
- System 100 includes one or more applications such as application 110 , an operating system 120 , a synthesis block 130 , an audio storage 135 , a digital to analog converter (D/A) 140 , and one or more speakers 145 .
- the system 100 is merely exemplary.
- the proposed system can be distributed, in that the input, output and processing of the various streams and data can be performed in several or one location. The input and capture, processing and storage of samples can be separate from the processing of a textual entry.
- the textual processing can be distributed, where for example the text that is identified or received can be at a device that is separate from the processing device that performs the text to speech processing.
- the output device that provides the audio can be separate or integrated with the textual processing device.
- a client server architecture can be provided where the client provides or identifies the textual input, and the server provides the textual processing, returning a processed signal to the client device. The client device can in turn take the processed signal and provide an audio output. Other configurations are possible.
- application 110 can output a stream of text, having individual text strings, to synthesis block 130 either directly or indirectly through operating system 120 .
- Application 110 can be, for example, a software program such as a word processing application, an Internet browser, a spreadsheet application, a video game, a messaging application (e.g., an e-mail application, an SMS application, an instant messenger, etc.), a multimedia application (e.g., MP3 software), a cellular telephone application, and the like.
- application 110 displays text strings from various sources (e.g., received as user input, received from a remote user, received from a data file, etc.).
- a text string can be separated from a continuous text stream through various delimiting techniques described below.
- Text strings can be included in, for example, a document, a spread sheet, or a message (e.g., e-mail, SMS, instant message, etc.) as a paragraph, a sentence, a phrase, a word, a partial word (i.e., sub-word), phonetic segment and the like.
- Text strings can include, for example, ASCII or Unicode characters or other representations of words. Text strings can also contain detailed explicit representations of desired phonemes or sub-phoneme articulatory gestures, possibly associated with pitch and/or duration specifications.
- application 110 includes a portion of synthesis block 130 (e.g., a daemon or capture routine) to identify and initially process text strings for output.
- application 110 provides a designation for speech output of associated text strings (e.g., enable/disable button).
- Operating system 120 can output text strings to synthesis block 130 .
- the text strings can be generated within operating system 120 or be passed from application 110 .
- Operating system 120 can be, for example, a MAC OS X operating system by Apple Computer, Inc. of Cupertino, Calif., a Microsoft Windows operating system, a mobile operating system (e.g., Windows CE or Palm OS), control software embedded within a portable device such as a music player or an in-vehicle navigation system, a cellular telephone control software, and the like.
- Operating system 120 may generate text strings related to user interactions (e.g., responsive to a user selecting an icon), states of user hardware (e.g., responsive to low battery power or a system shutting down), and the like.
- a portion or all of synthesis block 130 is integrated within operating system 120 .
- synthesis block 130 interrogates operating system 120 to identify and provide text strings to synthesis block 130 .
- a kernel layer in operating system 120 can be responsible for general management of system resources and processing time.
- a core layer can provide a set of interfaces, programs and services for use by the kernel layer.
- a core layer can manage interactions with application 110 .
- a user interface layer can include APIs (Application Program Interfaces), services and programs to support user applications.
- a user interface can display a UI (user interface) associated with application 110 and associated text strings in a window or panel.
- One or more of the layers can provide text streams or text strings to synthesis block 130 .
- Synthesis block 130 receives text strings or text string information as described. Synthesis block 130 is also in communication with audio segments 135 and D/A converter 140 .
- Synthesis block 130 can be, for example, a software program, a plug-in, a daemon, or a process and include one or more engines for parsing and correlation functions as discussed below in association with FIG. 2 .
- synthesis block 130 can be executed on a dedicated software thread or hardware thread.
- Synthesis block 130 can be initiated at boot-up, by application, explicitly by a user or by other means.
- synthesis block 130 provides a combination of audio samples that, when combined together, correspond to text strings.
- At least some of the audio samples can be selected to include properties or have relationships with other audio samples in order to provide a natural sounding (i.e., less machine-like) combination of audio samples and can include non-speech sounds. Further details in association with synthesis block 130 are given below.
- Audio storage 135 can be, for example, a database or other file structure stored in a memory device (e.g., hard drive, flash drive, CD, DVD, RAM, ROM, network storage, audio tape, and the like). Audio storage 135 includes a collection of audio segments and associated metadata (e.g., properties). Individual audio segments can be sound files of various formats such as AIFF (Apple Audio Interchange File Format Audio) by Apple Computer, Inc., MP3, MIDI, WAV, and the like. Sound files can be analog or digital and recorded at frequencies such as 22 khz, 44 khz, or 96 khz and, if digital, at various bit rates.
- AIFF Apple Audio Interchange File Format Audio
- These segments can also be more abstract representations of the acoustic speech signal such as spectral energy peaks, resonances, or even representations of the movements of the mouth, tongue, lips, and other articulators. They can also be indices into codebooks of any of the representations.
- the synthesis block can also perform other manipulations of the audio units, such as spectral smoothing, adjustments of pitch or duration, volume normalization, power compression, filtering, and addition of audio effects such as reverberation or echo.
- D/A converter 140 receives a combination of audio samples from synthesis block 130 .
- D/A converter 140 produces analog or digital audio information to speaker 145 .
- D/A converter 140 can provide post-processing to a combination of audio samples to improve sound quality. For example, D/A converter 140 can normalize volume levels or pitch rates, perform sound decoding or formatting, and other signal processing.
- Speakers 145 can receive audio information from D/A converter 140 .
- the audio information can be pre-amplified (e.g., by a sound card) or amplified internally by speakers 145 .
- speakers 145 produce speech synthesized by synthesis block 130 and cognizable by a human.
- the speech can include individual units of sound, non-speech sounds or other properties that produce more human like speech.
- FIG. 2 is a more detailed block diagram illustrating synthesis block 130 .
- Synthesis block 130 includes an input capture routine 210 , a parsing engine 220 , a unit matching engine 230 , an optional modeling block 235 and an output block 240 .
- Input capture routine 210 can be, for example, an application program, a module of an application program, a plug-in, a daemon, a script, or a process. In some implementations, input capture routine 210 is integrated within operating system 120 . In some implementations, input capture routine 210 operates as a separate application program or part of a separate application program. In general, input capture routine 210 monitors, captures, identifies and/or receives text strings or other information for generating speech.
- Parsing engine 220 delimits a text stream or text string into units.
- parsing engine 220 can separate a text string into phrase units and non-speech units.
- Non-speech units specify that a non-speech sound should be synthesized.
- a non-speech unit can fill, partially fill, or specify a momentary pause having a specific duration.
- Non-speech units can be identified from punctuation found in the text, including commas, semi-colons, colons, hyphens, periods, ellipses, brackets, paragraph delimiters (e.g., a carriage return followed immediately by a tab) and other punctuation.
- Non-speech units can also be identified by a grammatical or other analysis of the text even when there is no accompanying punctuation, such as at the boundary between the main topic in a sentence and the subsequent predicate.
- non-speech codes in the text can denote non-speech units.
- non-speech codes can be a specific series of characters that indicate that a non-speech unit should be produced (e.g.., ⁇ breath, 40 >, to denote a breath of 40 milliseconds).
- phrase units can be further separated into word units, word units into sub-word units, and/or sub-word units into phonetic segment units (e.g., a phoneme, a diphone (phoneme-to-phoneme transition), a triphone (phoneme in context), a syllable or a demrisyllable (half of a syllable) or other similar structure).
- phonetic segment units e.g., a phoneme, a diphone (phoneme-to-phoneme transition), a triphone (phoneme in context), a syllable or a demrisyllable (half of a syllable) or other similar structure.
- the parsing can separate text into a hierarchy of units where each unit can be relative to and depend on surrounding units. For example, the text “the cat sat on the mattress, horr. The dog came in” can be divided into phrase units 521 and non-speech units 526 (see FIG. 5 ). Phrase units 521 can be further divided into word units 531 for each word (e.g., phrases divided as necessary into a single word). In addition, word units 531 can be divided into a phonetic segment units 541 or sub-word units (e.g., a single word divided into phonetic segments).
- Parsing engine 220 analyzes units to determine properties and relationships and generates information describing the same. The analysis is described in greater detail below.
- Unit matching engine 230 matches units from a text string to audio segments at a highest possible level in a unit hierarchy. Other text matching schemes are possible. Matching can be based on the properties of one or more units.
- Properties of the preceding or following synthesized audio segment, and the proposed matches can be analyzed to determine a best match.
- Properties can include those associated with the unit and concatenation costs.
- Unit costs can include considerations of one or more of pitch, duration, accentuation, and spectral characteristics. Unit cost can also reflect whether the non-speech unit is of an appropriate length. Unit costs measure the similarity or difference from an ideal model. Predictive models can be used to create ideal pitch, duration etc. predictors that can be used to evaluate which unit from a group of similar units (e.g., similar text unit but different audio sample) should be selected. Models are discussed more below in association with modeling block 235 .
- Concatenation costs can include those associated with articulation relationships such as adjacency between units in samples. Concatenation costs measure how well a unit fits with a neighbor unit.
- segments can be analyzed grammatically, semantically, phonetically or otherwise to determine a best matching segment from a group of audio segments. Metadata can be stored and used to evaluate best matches.
- Unit matching engine 230 can search the metadata in audio storage 135 ( FIG. 1 ) for matches. If a match is found, results are produced to output block 240 . If match is not found, unit matching engine 230 submits the unmatched unit back to parsing engine 220 for further parsing/processing (e.g., processing at different levels including processing smaller units).
- unit matching engine 230 When a text string portion cannot be divided any further, an uncorrelated or raw phoneme, other sub-word units or other units lower in the hierarchy, such as phonemes can be produced to output block 240 . Further details of one implementation of unit matching engine 230 are described below in association with FIG. 7 .
- Modeling block 235 produces ideal models that can be used to analyze segments to select a best segment for synthesis.
- Modeling block 235 can create predictive models that reflect ideal pitch, duration, etc. based on an analysis of the text, prior history of the texts spoken previously in the user interaction, the history of prior user interactions, the communicative purpose of the speech, and prior or learned information about the particular user. Based on the models, a selection of a best matching segment can be made.
- Output block 240 in one implementation, combines audio segments including non-speech segments.
- Output block 240 can receive a copy of a text string received from input capture routine 210 and track matching results from the unit hierarchy to the text string. More specifically, phrase units, non-speech units, word units, sub-word units, and phonetic segments (units), etc., can be associated with different portions of a received text string.
- the output block 240 produces a combined output for the text string.
- Output block 240 can produce combined audio segments in batch or on-the-fly.
- FIG. 3A is a flow diagram illustrating a method 300 for synthesizing text to speech.
- a precursor to the synthesizing process 300 includes the processing and evaluation of training audio samples and storage of such along with attending property information. The precursor process is discussed in greater detail in association with FIG. 4 .
- a text string is identified 302 for processing (e.g., by input capture routine 210 ).
- input text strings from one or more sources can be monitored and identified.
- the input strings can be, for example, generated by a user, sent to a user, or displayed from a file.
- Units from the text string are matched 304 to audio segments, and in one implementation to audio segments at a highest possible unit level.
- units are matched at a high level, more articulation relationships will be contained within an audio segment. Higher level articulation relationships can produce more natural sounding speech.
- non-speech units from the text are matched with non-speech audio segments (i.e. non-speech units).
- Matching non-speech units can also, in one implementation, be made at a highest unit level.
- Matching non-speech units can include evaluating the preceding and following speech units.
- a non-speech unit followed by a ‘cat’ word unit is a better match than a non-speech unit followed by a ‘kit’ word unit; and both are a better match than a non-speech unit followed by a ‘street’ word unit.
- the system also evaluates preceding and following units of non-speech units to determine a non-speech unit that is a best match (e.g., evaluating a series of non speech units preceding a given selection to determine if a breath needs to be inserted or other non-speech unit).
- Both speech units and non-speech units are identified in accordance with a parsing process.
- an initial unit level is identified and the text string is parsed to find matching audio segments for each unit.
- Each unmatched unit then can be further processed. Further processing can include further parsing of the unmatched unit, or a different parsing of the unmatched unit, the entire or a portion of the text string.
- unmatched units are parsed to a next lower unit level in a hierarchy of unit levels. The process repeats until the lowest unit level is reached or a match is identified.
- the text string is initially parsed to determine initial units. Unmatched units can be re-parsed. Alternatively, the entire text string can be re-parsed using a different rule(s) and results evaluated.
- modeling can be performed to determine a best matching unit. Modeling is discussed in greater detail below.
- Units from the input string are synthesized 306 including combining the audio segments associated with all units or unit levels.
- Non-speech units from the input string are synthesized including combining the audio segments associated with speech units with the non-speech sounds associated with each matched non-speech unit 307 .
- Combining non-speech sounds can include prefixing particular non-speech sounds with silence so that the duration of the combined sound is sufficient in length.
- Speech is output 308 at a (e.g., amplified) volume.
- the combination of audio segments can be post-processed to generate better quality speech.
- the audio segments can be supplied from recordings under varying conditions or from different audio storage facilities, leading to variations.
- One example of post-processing is volume normalization. Other post-processing can smooth irregularities between the separate audio segments.
- received text is parsed at a first level ( 352 ), identifying speech units and non-speech units.
- the parsing of the text into speech units can be for example at the phrase unit level, word unit, level, sub-word unit level or other level.
- a match is attempted to be located for each unit ( 354 ). If no match is located for a given unit ( 356 ), the unmatched unit is parsed again at a second unit level ( 358 ).
- the second unit level can be smaller in size than the first unit level and can be at the word unit level, sub-word unit level, diphone level, phoneme level or other level.
- the adjacent speech units of unmatched non-speech units are parsed into their second unit level. After parsing, a match is made to a best unit. The matched units are thereafter synthesized to form speech for output ( 360 ). Details of a particular matching process at multiple levels are discussed below.
- FIG. 4 is a flow diagram illustrating one implementation of a method 400 for providing audio segments and attending metadata.
- Voice samples of speech are provided 402 including associated text.
- a human can speak into a recording device through a microphone or prerecorded voice samples are provided for training. Optimally one human source is used but output is provided under varied conditions. Different samples can be used to achieve a desired human sounding result. Text corresponding to the voice samples can be provided for accuracy or for more directed training.
- audio segments can be computer-generated and a voice recognition system or other automatic or supervised pattern-matching system can determine associated text, and pauses and other speech separators from the voice samples.
- the voice samples are divided 404 into units.
- the voice sample can first be divided into a first unit level, for example into phrase units and non-speech units.
- Phrase units correspond to speech-sounds in the voice sample while non-speech units denote non-speech sounds in the voice sample.
- Each non-speech unit can be associated with punctuation from the text associated with the sample (e.g., a brief breath sound can may be associated with a comma and a particular longer breath sound may be associated with a period, em dash or paragraph delimiter).
- the first unit level can be divided into subsequent unit levels in a hierarchy of units. For example, phrase units can be divided into other units (words, subwords, diphones, etc.) as discussed below.
- the unit levels are not hierarchical, and the division of the voice samples can include division into a plurality of units at a same level (e.g., dividing a voice sample into similar sized units but parsing at a different locations in the sample).
- the voice sample can be parsed a first time to produce a first set of units. Thereafter, the same voice sample can be parsed a second time using a different parsing methodology to produce a second set of units. Both sets of units can be stored including any attending property or relationship data. Other parsing and unit structures are possible.
- the voice samples can be processed creating units at one or more levels. In one implementation, units are produced at each level. In other implementations, only units at selected levels are produced.
- the units are analyzed for associations and properties 406 and the units and attending data (if available) stored 408 .
- Analysis can include determining associations, such as adjacency, with other units in the same level or other levels.
- Non-speech units can as well, have associations, such as adjacency.
- separate non-speech units exist at each hierarchy level and can be associated with adjacent units at the same level.
- non-speech units can be associated with each adjacent unit at more than one (e.g., at all) hierarchical level simultaneously.
- non-speech units can be linked to units at a same or different level (e.g., at a level above, a level below, two levels below, etc.)
- non-speech units can also have associated properties indicating the aural quality of the unit (e.g., whether it is a intake breath, a sigh, or a breath with a tongue flick, etc) and the non-speech unit's duration. Examples of associations that can be stored are shown in FIGS. 5 and 6 .
- Other analysis can include analysis associated with pitch, duration, accentuation, spectral characteristics, and other features of individual units or groups of units.
- non-speech units can be analyzed and characterized with respect to type (e.g., breath, sigh, tongue flick, etc.) Analysis is discussed in greater details below.
- type e.g., breath, sigh, tongue flick, etc.
- each unit including representative text for speech units, associated segment, and metadata (if available) is stored for potential matching.
- FIG. 5 is a schematic diagram illustrating a voice sample that is divided into units on different levels.
- a voice sample 510 includes the phrase 512 “the cat sat on the mattress, horr. The dog came in.
- the voice sample 510 is divided into phrase units 521 including the text “the cat,” “sat,” “on the mattress” and “happily” and into non-speech units 526 .
- Each non-speech unit includes a duration that indicates the length of the unit, representing the duration of non-speech captured by the particular non-speech unit.
- Non-speech units from the same voice sample can be analyzed and characterized by type of non-speech sound, as discussed above.
- phrases units 521 are further divided into word units 531 including the text “the”, “mattress” and others.
- the last unit level of this example is a phonetic segment unit level 540 that includes units 541 which represent word enunciations on an atomic level.
- the sample word “the” consists of the phonemes “D” and “AX” (to rhyme with “thuh”, as in the first syllable of “about”).
- another instance of the sample word “the” can consist of the phonemes “D” and “IY” (to rhyme with “thee”).
- the difference stems from a stronger emphasis of the word “the” in speech when beginning a sentence or after a pause.
- These differences can be captured in metadata (e.g., location or articulation relationship data) associated with the different voice samples (and be used to determine which segment to select from plural available similar segments).
- associations between units can be captured in metadata and saved with the individual audio segments.
- the associations can include adjacency data between and across unit levels.
- three levels of unit associations are shown (phrase unit level 520 , word unit level 530 and phonetic segment unit level 540 ).
- association 561 link preceding phrase units 521 with non-speech units 526 .
- associations 563 link preceding word units 531 with non-speech units and preceding phonetic segment units 541 with non-speech units, respectively.
- Each non-speech unit in this example is also linked to all adjacent units at each level. Also, associations 571 , 573 , 575 link non-speech units with following phrase, word and phoneme units, respectively. In FIG. 6 , the non-speech unit 526 is associated with preceding and following adjacent units at each hierarchy level.
- associations can be stored as metadata corresponding to units.
- each phrase unit, word unit, sub-word unit, phonetic segment unit, etc. can be saved as a separate audio segment.
- links between units can be saved as metadata.
- the metadata can further indicate whether a link is forward or backward and whether a link is between peer units or between unit levels.
- matching can include matching portions of text defined by units with segments of stored audio.
- the text being analyzed can be divided into units and matching routines performed.
- One specific matching routine includes matching to a highest level in a hierarchy of unit levels.
- FIG. 7 is a flow diagram illustrating a method 700 for matching non-speech units from a text string to non-speech audio segments at a highest possible unit level.
- a text stream e.g., continuous text stream
- a text stream can be divided using grammatical delimiters (e.g., periods, and semi-colons) and other document delimiters (e.g., page breaks, paragraph symbols, numbers, outline headers, and bullet points) so as to divide a continuous or long text stream into portions for processing.
- the portions for processing represent sentences of the received text.
- the portions of text for processing can represent the entire text including multiple pages, paragraphs and sentences.
- Each text string is parsed 704 into phrase units and non-speech units (e.g., by parsing engine 220 ).
- a text string itself can comprise a phrase unit and one or more non-speech units.
- the text string can be divided, for example, into a predetermined number of words, into recognizable phrases, non-speech units (e.g., pauses), word pairs, and the like.
- the non-speech units are matched 706 to audio segments from a plurality of audio segments (e.g., by unit matching engine 230 ). To do so, an index of audio segments (e.g., stored in audio storage 135 ) can be accessed. In one implementation, metadata describing the audio segments is searched.
- the metadata can provide information about articulation relationships, properties or other data of a non-speech unit or phrase unit as described above.
- the metadata can describe links between audio segments as peer level associations or inter-level associations (e.g., separated by one level, two levels, or more). For the most natural sounding speech, a highest level match (e.g., phrase unit level in this example) is preferable.
- a non-speech unit in the text string is identified an attempt is made to match it with a stored non-speech unit of equal or lesser duration and, ideally, with matching adjacent high-level units (e.g., units at the phrase unit level).
- a non-speech sound that is longer than the non-speech unit is allowed if the duration of the sound does not exceed some criterion value and if the unit is particularly desirable (e.g., if, in the original recording, the non-speech sound unit was preceded and followed by the same words as are needed in the text string).
- the units adjacent to the unmatched non-speech unit can be further parsed to create other lower-level units.
- the lower-level adjacent units another search for a match is attempted. The process continues until a match occurs or no further parsing of adjacent units is possible (i.e., parsing to the lowest possible level has occurred or no other parsing definitions have been provided).
- the located non-speech unit can be appended (e.g., prefixed) with silence to make up the difference in duration between the located non-speech unit and the desired non-speech unit. If no matching non-speech unit is found, then the unmatched non-speech unit can, in one implementation, be replaced by silence in the final syntheses for the duration specified by the non-speech unit. Subsequent non-speech units in the text string are processed at the first unit level (e.g., phrase unit level) in a similar manner.
- the first unit level e.g., phrase unit level
- Matching can include the evaluation of a plurality of similar (i.e., same text or same non-speech) units having different audio segments (e.g., different accentuation, different duration, different pitch, etc.). Matching can include evaluating data associated with a candidate unit (e.g., metadata) and evaluation of preceding and following units that have been matched (e.g., evaluating the previous matched unit to determine what if any relationships or properties are associated with this unit). Matching is discussed in more detail below.
- phrase units adjacent to unmatched non-speech units are parsed 710 into, for example, word units.
- phrase units that are word pairs can be separated into separate words.
- the matching is attempted among non-speech units and adjacent word units 712 .
- the word units adjacent to the unmatched non-speech units are parsed 716 into, for example, sub-word units.
- word units can be parsed into words, having suffixes or prefixes.
- the matching process ends and synthesis of the text samples can be initiated ( 726 ). Otherwise the process can continue at a next unit level 722 .
- a check is made to determine if a match has been located 724 . If no match is found, the process continues including parsing adjacent units of unmatched non-speech units to a new lower level in the hierarchy until a final unit level is reached 720 . If unmatched units remain after all other levels have been checked, then silence can be output for the duration of the unmatched non-speech unit.
- a check is added in the process after matches have been determined (not shown).
- the check can allow for further refinement in accordance with separate rules. For example, even though a match is located at one unit level, it may be desirable to check to at a next or lower unit level for a match.
- the additional check can include user input to allow for selection from among possible match levels. Other check options are possible.
- heuristic rules can also govern the matching of non-speech. These rules are particularly useful for simulating realistic breathing patterns. For example, a rule can specify that a non-speech unit should not be matched if a non-speech unit was replaced by a similar non-speech unit within five words of the current non-speech unit in the text string. Another rule can specify that a non-speech unit that precedes a sentence should only be matched if the sentence is longer than eight words, unless no non-speech units have been matched in the last eight words (i.e., after successive of short sentences).
- a rule can specify that non-speech units only be matched if the non-speech unit precedes a phrase unit, but never if the non-speech unit precedes an utterance (e.g., a one or two word phrase unit). Yet another rule can specify that non-speech units in the middle of a sentence only be matched if the phrase unit following the non-speech unit is more than six words.
- the particular threshold of words can be tuned to the desired speaking style. For example, when synthesizing speech with faster speaking rates (e.g., for use in screen readers for users with limited vision) the numbers might be larger. Other rules are possible.
- FIG. 8 is a schematic diagram illustrating an example of a process matching non-speech units.
- the text string 810 that is to be processed is “The cats sat on the mats, hurt. The snake hissed.”
- the only searchable/matchable units that are available are those associated with the single training sample “the cat sat on the mattress, hurt. The dog came in” described previously, and in particular the two non-speech units of substantially 80 and 200 milliseconds provided therein.
- the focus of this example is to illustrate how non-speech sounds are synthesized, while ignoring the synthesis of the remaining speech sounds.
- the identified non-speech units for text string 810 include non-speech units 805 and 809 , which are substantially 100 milliseconds and 300 milliseconds in length, respectively.
- non-speech units are matched by considering their adjacent phrase units.
- the selected non-speech sounds are shown at level 817 .
- the first non-speech unit 805 is a match for the 80 ms non-speech unit 850 based on the following “happily” phrase unit, but the preceding “the mats” phrase unit does not match any known non-speech unit.
- a search for non-speech units with matching adjacent word units is made at, for example a next unit level, the word unit level 830 . Again, the preceding “mats” word unit does not match the preceding word units of any known non-speech unit.
- a search for non-speech units with matching adjacent phonemes is made at the phoneme level 840 . At the phoneme level, the “S” phoneme, derived from “mats”, provides a match.
- the matching non-speech unit 850 is selected for synthesis of the non-speech unit 805 .
- the links 872 and 874 denote the association between the non-speech unit and its adjacent units as determined during voice sample processing, as described in reference to FIG. 5 .
- the second non-speech unit 809 is a match with the 200 ms non-speech unit 860 based on the preceding “happily” phrase unit, however the following phrase unit “the snake” does not match the following phrase unit of any known non-speech unit at the initial level (e.g., the phrase level).
- a search for non-speech units with adjacent word units is made at the next level, for example the word unit level 830 .
- the “the” word derived from “the snake” matches “the” following the non-speech unit 860 .
- the matching non-speech unit 860 is selected for synthesis of the matched non-speech unit 809 .
- the links 876 and 878 denote the association between the non-speech unit and its adjacent units as determined during voice sample processing, as described in reference to FIG. 5 .
- Both non-speech units in the above examples are shorter than the duration specified by the non-speech units that they synthesize.
- a matching non-speech unit should be as close to, but not longer than the non-speech unit it replaces (i.e., synthesizes).
- the duration of matching non-speech unit must be greater than a minimum proportion of the desired duration (e.g., a matching speech-unit must have a duration greater than 75% of desired duration).
- a short non-speech unit is preferable to none at all.
- silence is synthesized 852 to, for example, the beginning of each matching non-speech unit that is shorter than desired.
- twenty milliseconds of silence 852 a is prefixed to the 80 ms non-speech unit 850
- 100 ms of silence 852 b is prefixed to the 200 ms non-speech unit 860 .
- properties of units can be stored for matching purposes. Examples of properties include adjacency, pitch contour, accentuation, spectral characteristics, span (e.g., whether the instance spans a silence, a glottal stop, or a word boundary), grammatical context, position (e.g., of word in a sentence), isolation properties (e.g., whether a word can be used in isolation or needs always to be preceded or followed by another word), duration, compound property (e.g., whether the word is part of a compound, other individual unit properties or other properties .
- additional data e.g., metadata
- the additional data can allow for better matches and produce better end results.
- only units e.g., text, non-speech units and audio segments alone
- additional data can be stored.
- a non-speech unit can also be marked with properties such as whether the unit contains a breath intake, lip or tongue click, nasal squeak, snort, cough, throat-clearing, creaky voice, or a sigh. Such properties can be used during selection depending on a text analysis, or by explicit annotation (e.g., in the input text) by the user.
- three unit levels are created including phrases, words and diphones.
- one or more of the following additional data is stored for matching purposes:
- the pitch contour of the instance i.e., whether pitch rises, falls, has bumps, etc.
- the pitch contour of the instance i.e., whether pitch rises, falls, has bumps, etc.
- adjacency data can be stored for matching purposes.
- the adjacency data can be at a same or different unit level.
- the invention and all of the functional operations described herein can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them.
- the invention can be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers.
- a computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- Method steps of the invention can be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output. Method steps can also be performed by, and apparatus of the invention can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- FPGA field programmable gate array
- ASIC application-specific integrated circuit
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
- the invention can be implemented on a device having a display, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and an input device, e.g., a keyboard, a mouse, a trackball, and the like by which the user can provide input to the computer.
- a display e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- an input device e.g., a keyboard, a mouse, a trackball, and the like by which the user can provide input to the computer.
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback provided by speakers associated with a device, externally attached speakers, headphones, and the like, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- the invention can be implemented in, e.g., a computing system, a handheld device, a telephone, a consumer appliance, a multimedia player, an in-vehicle navigation and information system or any other processor-based device.
- a computing system implementation can include a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention, or any combination of such back-end, middleware, or front-end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- non-speech sounds can alternatively be used to augment an existing synthesized speech segment.
- Such an implementation can fill silent pauses in an existing synthesized speech segment with non-speech sounds.
- the augmentation of silences in an existing speech segment can be based on text associated with the segment or can be based on aural characteristics of the segment itself (e.g., duration since last pause, or the pitch, volume, quality or pattern of sound immediately preceding or following the pause).
- non-speech sounds can include emotive utterances that are not usually associated with formal speech patterns, such as laughing, crying, contemplation (e.g. ‘hmmm’), taunting (e.g. a raspberry), etc. Accordingly, other implementations are within the scope of the following claims.
- non-speech sounds is described within the framework of concatenative synthesis based on a corpus of audio recordings.
- Alternative parametric forms of speech synthesis such as formation synthesis or articulatory synthesis, could equally well support this invention by synthesizing the acoustic or articulatory correlates of the non-speech sounds rather than by inserting fragments of audio recordings.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
Description
- The following disclosure generally relates to information systems.
- In general, conventional text-to-speech application programs produce audible speech from written text. The text can be displayed, for example, in an application program executing on a personal computer or other device. For example, a blind or sight-impaired user of a personal computer can have text from a web page read aloud from the personal computer. Other text to speech applications include those that read from a textual database and provide corresponding audio to a user by way of a communication device, such as a telephone, cellular telephone, portable music player, in-vehicle navigation system or the like.
- Speech from conventional text-to-speech applications typically sounds artificial or machine-like when compared to human speech. One reason for this result is that current text-to-speech applications often synthesize momentary pauses in speech with silence. The location and length of pauses is typically determined by parsing the written text and the punctuation in the text such as commas, periods, and paragraph delimiters. However, using empty silence to synthesize pauses, as conventional synthesis applications do, can lead listeners to feel a sense of breathlessness; particularly after lengthy exposure to the results of such synthesis. In human-produced speech, pauses can actually consist of breath intakes, mouth clicks and other non-speech sounds. These non-speech sounds provide subtle clues about the sounds and words that are about to follow. These clues are missing when pauses are synthesized as silence, thus requiring more listener effort to comprehend the synthesized speech.
- Some text-to-speech applications produce speech that can include emotive vocal gestures such as laughing, sobbing, crying, scoffing and grunting. However, in general such gestures do not improve comprehension of the resultant speech. Moreover, these techniques rely on explicitly annotated input text to determine where to include the vocal gestures in the speech. Such annotated text may, for example, appear as follows, “What? <laugh1> You mean to tell me this is an improvement? <laugh4>.” The text ‘<laugh1>’ is an example of a specific textual command that directs the synthesis to produce a specific associated sound (e.g., a mocking laugh).
- Systems, apparatus, methods and computer program products are described below for producing text-to-speech synthesis with non-speech sounds. In general, some of the pauses or silences that would otherwise be generated in synthesized speech are instead synthesized as non-speech sounds such as breaths. Non-speech sounds can be identified from pre-recorded speech that can include meta-data such as the grammatical and phrasal structure of words and sounds that precede and succeed non-speech sounds. A non-speech sound can be selected for use in synthesized speech based on the words, punctuation, grammatical and phrasal structure of text from which the speech is being synthesized, or other characteristics.
- In one aspect a method is provided that includes augmenting a synthesized speech with a non-speech sound other than silence, the augmentation based on characteristics of the synthesized speech.
- One or more implementations can optionally include one or more of the following features. The method can include replacing pauses in the synthesized speech with a non-speech sound. Augmenting can include identifying the non-speech sound based on punctuation, grammatical or phrasal structure of text associated with the synthesized speech. The non-speech sound can include the sound of one or more of: inhalation; exhalation; mouth clicks; lip smacks; tongue flicks; and salivation.
- In another aspect a method is provided that includes identifying a non-speech unit in a received input string where the non-speech unit is not associated with a specific textual reference in the input string. The non-speech unit is matched to an audio segment, which is a voice sample of a non-speech sound. The input string is synthesized, which includes combining the audio segments matched with the non-speech unit.
- One or more implementations can optionally include one or more of the following features. The method can include identifying the non-speech unit based on punctuation, grammatical and phrasal structure of the input string. The method can include identifying the non-speech unit based on non-speech codes in the input string. The method can include determining the duration of the non-speech unit. The method can include matching the non-speech unit with non-speech sounds based on duration of the non speech unit. The method can include generating metadata associated with the plurality of audio segments. Generating the metadata can include receiving a voice sample; determining two or more portions of the voice sample having properties; generating a portion of the metadata associated with a first portion of the voice sample to associate a second portion of the voice sample with the first portion of the voice sample; and generating a portion of the metadata associated with the second portion of the voice sample to associate the first portion of the voice sample with the second portion of the voice sample. Generating the metadata can include receiving a voice sample; delimiting a portion of the voice sample in which articulation relationships are substantially self-contained; and generating a portion of the metadata to describe the portion of the voice sample. The method can include identifying a speech unit in a received input string, the speech unit preceding or following the non-speech unit; and matching the non-speech unit with the non-speech sound based on the speech unit. The method can include parsing the speech unit into sub units, at least one sub unit preceding or following the non-speech unit; and matching the non-speech unit with non-speech sounds based on the at least one sub unit. Speech units can be phrases, words, or sub-words in the input string. The method can include limiting synthesizing non-speech units based on a proximity to preceding synthesized pauses. The input string can include ASCII or Unicode characters. The method can include outputting amplified speech comprising the combined audio segments.
- In another aspect a method is provided that includes receiving audio segments. The audio segments are parsed into speech units and non-speech units. Properties are defined of or between speech units and non-speech units. The units and the properties are stored.
- One or more implementations can optionally include one or more of the following features. The method can include parsing the speech units into sub units; defining properties of or between the sub units; and storing the sub units and properties. The method can include parsing a received input string into speech units and non-speech units; determining properties of or between the speech units and non-speech units if any; matching units to stored units using the properties; and synthesizing the input string including combining the audio segments matched with the speech-units and non-speech units. The method can include defining properties between speech-units and non-speech units.
- Particular embodiments of the invention can be implemented to realize one or more of the following advantages. A system that synthesizes speech with non-speech sounds can more accurately mimic patterns of human spoken communication. The resultant synthesized speech sounds more human and less artificial than methods that do not use non-speech sounds. Non-speech sounds add audible information and context to speech. Synthesized speech with non-speech sounds requires less cognitive effort to comprehend and is more likely to be understood when listening conditions are less than ideal. In addition, proper inclusion of non-speech sounds add pleasantness to the experience of listening to the resultant speech, making the task more enjoyable and engaging for the listener. Speech that includes non-speech sounds can lend a sense of personality and approachableness to the device that is speaking.
- The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
-
FIG. 1 is a block diagram illustrating a proposed system for text-to-speech synthesis. -
FIG. 2 is a block diagram illustrating a synthesis block of the proposed system ofFIG. 1 . -
FIG. 3A is a flow diagram illustrating one method for synthesizing text into speech. -
FIG. 3B is a flow diagram illustrating a second method for synthesizing text into speech. -
FIG. 4 is a flow diagram illustrating a method for providing a plurality of audio segments having defined properties that can be used in the method shown inFIG. 3 . -
FIG. 5 is a schematic diagram illustrating linked segments. -
FIG. 6 is a schematic diagram illustrating another example of linked segments. -
FIG. 7 is a flow diagram illustrating a method for matching units from a stream of text to audio segments at a highest possible unit level. -
FIG. 8 is a schematic diagram illustrating linked segments. - Like reference symbols in the various drawings indicate like elements.
- Systems, methods, computer program products, and means for including non-speech sounds in text-to-speech synthesis are described. Non-speech sounds are sounds that are not normally captured by the phonetic or any other linguistic description of a spoken language, including: breathing sounds, lip-smacks, tongue flicks, mouth clicks, salivation sounds, sighs and the like. An exemplary system and method is described for mapping text input to speech. Part of the mapping includes the consideration of non-speech sounds that may be appropriate to provide in the audio output.
- By way of example a system is described that maps an input stream of text to audio segments that take into account properties of and relationships (including articulation relationships) among units from the text stream. Articulation relationships refer to dependencies between sounds, including non-speech sounds, when spoken by a human. The dependencies can be caused by physical limitations of humans (e.g., limitations of lip movement, vocal cords, human lung capacity, speed of air intake or outtake, etc.) when, for example, speaking without adequate pause, speaking at a fast rate, slurring, and the like. Properties can include those related to pitch, duration, accentuation, spectral characteristics and the like. Properties of a given unit can be used to identify follow on units that are a best match for combination in producing synthesized speech. Hereinafter, properties and relationships that are used to determine units that can be selected from to produce the synthesized speech are referred to in the collective as merely properties.
-
FIG. 1 is a block diagram illustrating asystem 100 for text-to-speech synthesis that includes non-speech sounds.System 100 includes one or more applications such asapplication 110, anoperating system 120, asynthesis block 130, anaudio storage 135, a digital to analog converter (D/A) 140, and one ormore speakers 145. Thesystem 100 is merely exemplary. The proposed system can be distributed, in that the input, output and processing of the various streams and data can be performed in several or one location. The input and capture, processing and storage of samples can be separate from the processing of a textual entry. Further, the textual processing can be distributed, where for example the text that is identified or received can be at a device that is separate from the processing device that performs the text to speech processing. Further, the output device that provides the audio can be separate or integrated with the textual processing device. For example, a client server architecture can be provided where the client provides or identifies the textual input, and the server provides the textual processing, returning a processed signal to the client device. The client device can in turn take the processed signal and provide an audio output. Other configurations are possible. - Returning to the exemplary system,
application 110 can output a stream of text, having individual text strings, to synthesis block 130 either directly or indirectly throughoperating system 120.Application 110 can be, for example, a software program such as a word processing application, an Internet browser, a spreadsheet application, a video game, a messaging application (e.g., an e-mail application, an SMS application, an instant messenger, etc.), a multimedia application (e.g., MP3 software), a cellular telephone application, and the like. In one implementation,application 110 displays text strings from various sources (e.g., received as user input, received from a remote user, received from a data file, etc.). A text string can be separated from a continuous text stream through various delimiting techniques described below. Text strings can be included in, for example, a document, a spread sheet, or a message (e.g., e-mail, SMS, instant message, etc.) as a paragraph, a sentence, a phrase, a word, a partial word (i.e., sub-word), phonetic segment and the like. Text strings can include, for example, ASCII or Unicode characters or other representations of words. Text strings can also contain detailed explicit representations of desired phonemes or sub-phoneme articulatory gestures, possibly associated with pitch and/or duration specifications. In one implementation,application 110 includes a portion of synthesis block 130 (e.g., a daemon or capture routine) to identify and initially process text strings for output. In another implementation,application 110 provides a designation for speech output of associated text strings (e.g., enable/disable button). -
Operating system 120 can output text strings tosynthesis block 130. The text strings can be generated withinoperating system 120 or be passed fromapplication 110.Operating system 120 can be, for example, a MAC OS X operating system by Apple Computer, Inc. of Cupertino, Calif., a Microsoft Windows operating system, a mobile operating system (e.g., Windows CE or Palm OS), control software embedded within a portable device such as a music player or an in-vehicle navigation system, a cellular telephone control software, and the like.Operating system 120 may generate text strings related to user interactions (e.g., responsive to a user selecting an icon), states of user hardware (e.g., responsive to low battery power or a system shutting down), and the like. In some implementations, a portion or all ofsynthesis block 130 is integrated withinoperating system 120. In other implementations,synthesis block 130 interrogatesoperating system 120 to identify and provide text strings tosynthesis block 130. - More generally, a kernel layer (not shown) in
operating system 120 can be responsible for general management of system resources and processing time. A core layer can provide a set of interfaces, programs and services for use by the kernel layer. For example, a core layer can manage interactions withapplication 110. A user interface layer can include APIs (Application Program Interfaces), services and programs to support user applications. For example, a user interface can display a UI (user interface) associated withapplication 110 and associated text strings in a window or panel. One or more of the layers can provide text streams or text strings tosynthesis block 130. -
Synthesis block 130 receives text strings or text string information as described.Synthesis block 130 is also in communication withaudio segments 135 and D/A converter 140.Synthesis block 130 can be, for example, a software program, a plug-in, a daemon, or a process and include one or more engines for parsing and correlation functions as discussed below in association withFIG. 2 . In one implementation,synthesis block 130 can be executed on a dedicated software thread or hardware thread.Synthesis block 130 can be initiated at boot-up, by application, explicitly by a user or by other means. In general,synthesis block 130 provides a combination of audio samples that, when combined together, correspond to text strings. At least some of the audio samples can be selected to include properties or have relationships with other audio samples in order to provide a natural sounding (i.e., less machine-like) combination of audio samples and can include non-speech sounds. Further details in association withsynthesis block 130 are given below. -
Audio storage 135 can be, for example, a database or other file structure stored in a memory device (e.g., hard drive, flash drive, CD, DVD, RAM, ROM, network storage, audio tape, and the like).Audio storage 135 includes a collection of audio segments and associated metadata (e.g., properties). Individual audio segments can be sound files of various formats such as AIFF (Apple Audio Interchange File Format Audio) by Apple Computer, Inc., MP3, MIDI, WAV, and the like. Sound files can be analog or digital and recorded at frequencies such as 22 khz, 44 khz, or 96 khz and, if digital, at various bit rates. These segments can also be more abstract representations of the acoustic speech signal such as spectral energy peaks, resonances, or even representations of the movements of the mouth, tongue, lips, and other articulators. They can also be indices into codebooks of any of the representations. - The synthesis block can also perform other manipulations of the audio units, such as spectral smoothing, adjustments of pitch or duration, volume normalization, power compression, filtering, and addition of audio effects such as reverberation or echo.
- D/
A converter 140 receives a combination of audio samples fromsynthesis block 130. D/A converter 140 produces analog or digital audio information tospeaker 145. In one implementation, D/Aconverter 140 can provide post-processing to a combination of audio samples to improve sound quality. For example, D/Aconverter 140 can normalize volume levels or pitch rates, perform sound decoding or formatting, and other signal processing. -
Speakers 145 can receive audio information from D/A converter 140. The audio information can be pre-amplified (e.g., by a sound card) or amplified internally byspeakers 145. In one implementation,speakers 145 produce speech synthesized bysynthesis block 130 and cognizable by a human. The speech can include individual units of sound, non-speech sounds or other properties that produce more human like speech. -
FIG. 2 is a more detailed block diagram illustratingsynthesis block 130.Synthesis block 130 includes aninput capture routine 210, aparsing engine 220, aunit matching engine 230, anoptional modeling block 235 and anoutput block 240. -
Input capture routine 210 can be, for example, an application program, a module of an application program, a plug-in, a daemon, a script, or a process. In some implementations,input capture routine 210 is integrated withinoperating system 120. In some implementations,input capture routine 210 operates as a separate application program or part of a separate application program. In general,input capture routine 210 monitors, captures, identifies and/or receives text strings or other information for generating speech. - Parsing
engine 220, in one implementation, delimits a text stream or text string into units. For example, parsingengine 220 can separate a text string into phrase units and non-speech units. Non-speech units specify that a non-speech sound should be synthesized. A non-speech unit can fill, partially fill, or specify a momentary pause having a specific duration. Non-speech units can be identified from punctuation found in the text, including commas, semi-colons, colons, hyphens, periods, ellipses, brackets, paragraph delimiters (e.g., a carriage return followed immediately by a tab) and other punctuation. Non-speech units can also be identified by a grammatical or other analysis of the text even when there is no accompanying punctuation, such as at the boundary between the main topic in a sentence and the subsequent predicate. Alternatively, non-speech codes in the text can denote non-speech units. For example, non-speech codes can be a specific series of characters that indicate that a non-speech unit should be produced (e.g.., <breath, 40>, to denote a breath of 40 milliseconds). In one implementation, phrase units can be further separated into word units, word units into sub-word units, and/or sub-word units into phonetic segment units (e.g., a phoneme, a diphone (phoneme-to-phoneme transition), a triphone (phoneme in context), a syllable or a demrisyllable (half of a syllable) or other similar structure). For the purposes of this disclosure a particular architecture and structure for processing phrase units and other word or sub-word units of text is described. The particular structure should not be viewed as limiting. Other systems for processing phrase, word and sub-word units are possible. - The parsing can separate text into a hierarchy of units where each unit can be relative to and depend on surrounding units. For example, the text “the cat sat on the mattress, happily. The dog came in” can be divided into
phrase units 521 and non-speech units 526 (seeFIG. 5 ).Phrase units 521 can be further divided intoword units 531 for each word (e.g., phrases divided as necessary into a single word). In addition,word units 531 can be divided into aphonetic segment units 541 or sub-word units (e.g., a single word divided into phonetic segments). Various forms of text string units such as division by tetragrams, trigrams, bigrams, unigrams, phonemes, diphones, and the like, can be implemented to provide a specific hierarchy of units, with the fundamental unit level being a phonetic segment or other sub-word unit. Examples of unit hierarchies are discussed in further detail below. Parsingengine 220 analyzes units to determine properties and relationships and generates information describing the same. The analysis is described in greater detail below. -
Unit matching engine 230, in one implementation, matches units from a text string to audio segments at a highest possible level in a unit hierarchy. Other text matching schemes are possible. Matching can be based on the properties of one or more units. - Properties of the preceding or following synthesized audio segment, and the proposed matches can be analyzed to determine a best match. Properties can include those associated with the unit and concatenation costs. Unit costs can include considerations of one or more of pitch, duration, accentuation, and spectral characteristics. Unit cost can also reflect whether the non-speech unit is of an appropriate length. Unit costs measure the similarity or difference from an ideal model. Predictive models can be used to create ideal pitch, duration etc. predictors that can be used to evaluate which unit from a group of similar units (e.g., similar text unit but different audio sample) should be selected. Models are discussed more below in association with
modeling block 235. - Concatenation costs can include those associated with articulation relationships such as adjacency between units in samples. Concatenation costs measure how well a unit fits with a neighbor unit. In some implementations, segments can be analyzed grammatically, semantically, phonetically or otherwise to determine a best matching segment from a group of audio segments. Metadata can be stored and used to evaluate best matches.
Unit matching engine 230 can search the metadata in audio storage 135 (FIG. 1 ) for matches. If a match is found, results are produced tooutput block 240. If match is not found,unit matching engine 230 submits the unmatched unit back to parsingengine 220 for further parsing/processing (e.g., processing at different levels including processing smaller units). When a text string portion cannot be divided any further, an uncorrelated or raw phoneme, other sub-word units or other units lower in the hierarchy, such as phonemes can be produced tooutput block 240. Further details of one implementation ofunit matching engine 230 are described below in association withFIG. 7 . -
Modeling block 235 produces ideal models that can be used to analyze segments to select a best segment for synthesis.Modeling block 235 can create predictive models that reflect ideal pitch, duration, etc. based on an analysis of the text, prior history of the texts spoken previously in the user interaction, the history of prior user interactions, the communicative purpose of the speech, and prior or learned information about the particular user. Based on the models, a selection of a best matching segment can be made. -
Output block 240, in one implementation, combines audio segments including non-speech segments.Output block 240 can receive a copy of a text string received frominput capture routine 210 and track matching results from the unit hierarchy to the text string. More specifically, phrase units, non-speech units, word units, sub-word units, and phonetic segments (units), etc., can be associated with different portions of a received text string. Theoutput block 240 produces a combined output for the text string.Output block 240 can produce combined audio segments in batch or on-the-fly. -
FIG. 3A is a flow diagram illustrating amethod 300 for synthesizing text to speech. A precursor to thesynthesizing process 300 includes the processing and evaluation of training audio samples and storage of such along with attending property information. The precursor process is discussed in greater detail in association withFIG. 4 . - A text string is identified 302 for processing (e.g., by input capture routine 210). In response to boot-up of the operating system or launching of an associated application, for example, input text strings from one or more sources can be monitored and identified. The input strings can be, for example, generated by a user, sent to a user, or displayed from a file.
- Units from the text string are matched 304 to audio segments, and in one implementation to audio segments at a highest possible unit level. In general, when units are matched at a high level, more articulation relationships will be contained within an audio segment. Higher level articulation relationships can produce more natural sounding speech. In particular, non-speech units from the text are matched with non-speech audio segments (i.e. non-speech units). Matching non-speech units can also, in one implementation, be made at a highest unit level. Matching non-speech units can include evaluating the preceding and following speech units. For example, to synthesize a breath sound that is followed by the word ‘cat’, a non-speech unit followed by a ‘cat’ word unit is a better match than a non-speech unit followed by a ‘kit’ word unit; and both are a better match than a non-speech unit followed by a ‘street’ word unit. In one implementation, the system also evaluates preceding and following units of non-speech units to determine a non-speech unit that is a best match (e.g., evaluating a series of non speech units preceding a given selection to determine if a breath needs to be inserted or other non-speech unit).
- When lower level matches are needed, an attempt is made to parse units and match appropriate articulation relationships at a lower level. More details about one implementation for the parsing and matching processes are discussed below in association with
FIG. 7 . - Both speech units and non-speech units are identified in accordance with a parsing process. In one implementation, an initial unit level is identified and the text string is parsed to find matching audio segments for each unit. Each unmatched unit then can be further processed. Further processing can include further parsing of the unmatched unit, or a different parsing of the unmatched unit, the entire or a portion of the text string. For example, in one implementation, unmatched units are parsed to a next lower unit level in a hierarchy of unit levels. The process repeats until the lowest unit level is reached or a match is identified. In another implementation, the text string is initially parsed to determine initial units. Unmatched units can be re-parsed. Alternatively, the entire text string can be re-parsed using a different rule(s) and results evaluated. Optionally, modeling can be performed to determine a best matching unit. Modeling is discussed in greater detail below.
- Units from the input string are synthesized 306 including combining the audio segments associated with all units or unit levels. Non-speech units from the input string are synthesized including combining the audio segments associated with speech units with the non-speech sounds associated with each matched
non-speech unit 307. Combining non-speech sounds can include prefixing particular non-speech sounds with silence so that the duration of the combined sound is sufficient in length. Speech isoutput 308 at a (e.g., amplified) volume. The combination of audio segments can be post-processed to generate better quality speech. In one implementation, the audio segments can be supplied from recordings under varying conditions or from different audio storage facilities, leading to variations. One example of post-processing is volume normalization. Other post-processing can smooth irregularities between the separate audio segments. - Referring to
FIG. 3B another implementation for processing speech is shown. In thismethod 350, received text is parsed at a first level (352), identifying speech units and non-speech units. The parsing of the text into speech units can be for example at the phrase unit level, word unit, level, sub-word unit level or other level. A match is attempted to be located for each unit (354). If no match is located for a given unit (356), the unmatched unit is parsed again at a second unit level (358). The second unit level can be smaller in size than the first unit level and can be at the word unit level, sub-word unit level, diphone level, phoneme level or other level. In one implementation, the adjacent speech units of unmatched non-speech units are parsed into their second unit level. After parsing, a match is made to a best unit. The matched units are thereafter synthesized to form speech for output (360). Details of a particular matching process at multiple levels are discussed below. - Prior to matching and synthesis, a corpus of audio samples must be received, evaluated, and stored to facilitate the matching process. The audio samples are required to be divided into unit levels creating audio segments of varying unit sizes. Optional analysis and linking operations can be performed to create additional data (metadata) that can be stored along with the audio segments.
FIG. 4 is a flow diagram illustrating one implementation of amethod 400 for providing audio segments and attending metadata. Voice samples of speech are provided 402 including associated text. A human can speak into a recording device through a microphone or prerecorded voice samples are provided for training. Optimally one human source is used but output is provided under varied conditions. Different samples can be used to achieve a desired human sounding result. Text corresponding to the voice samples can be provided for accuracy or for more directed training. In another implementation, audio segments can be computer-generated and a voice recognition system or other automatic or supervised pattern-matching system can determine associated text, and pauses and other speech separators from the voice samples. - The voice samples are divided 404 into units. The voice sample can first be divided into a first unit level, for example into phrase units and non-speech units. Phrase units correspond to speech-sounds in the voice sample while non-speech units denote non-speech sounds in the voice sample. Each non-speech unit can be associated with punctuation from the text associated with the sample (e.g., a brief breath sound can may be associated with a comma and a particular longer breath sound may be associated with a period, em dash or paragraph delimiter). The first unit level can be divided into subsequent unit levels in a hierarchy of units. For example, phrase units can be divided into other units (words, subwords, diphones, etc.) as discussed below. In one implementation, the unit levels are not hierarchical, and the division of the voice samples can include division into a plurality of units at a same level (e.g., dividing a voice sample into similar sized units but parsing at a different locations in the sample). In this type of implementation, the voice sample can be parsed a first time to produce a first set of units. Thereafter, the same voice sample can be parsed a second time using a different parsing methodology to produce a second set of units. Both sets of units can be stored including any attending property or relationship data. Other parsing and unit structures are possible. For example, the voice samples can be processed creating units at one or more levels. In one implementation, units are produced at each level. In other implementations, only units at selected levels are produced.
- In some implementations, the units are analyzed for associations and
properties 406 and the units and attending data (if available) stored 408. Analysis can include determining associations, such as adjacency, with other units in the same level or other levels. Non-speech units can as well, have associations, such as adjacency. In one implementation separate non-speech units exist at each hierarchy level and can be associated with adjacent units at the same level. In another implementation, non-speech units can be associated with each adjacent unit at more than one (e.g., at all) hierarchical level simultaneously. That is, as is discussed further below, non-speech units can be linked to units at a same or different level (e.g., at a level above, a level below, two levels below, etc.) In one implementation, non-speech units can also have associated properties indicating the aural quality of the unit (e.g., whether it is a intake breath, a sigh, or a breath with a tongue flick, etc) and the non-speech unit's duration. Examples of associations that can be stored are shown inFIGS. 5 and 6 . Other analysis can include analysis associated with pitch, duration, accentuation, spectral characteristics, and other features of individual units or groups of units. For example, non-speech units can be analyzed and characterized with respect to type (e.g., breath, sigh, tongue flick, etc.) Analysis is discussed in greater details below. In the end of the sample processing, each unit, including representative text for speech units, associated segment, and metadata (if available) is stored for potential matching. - For example,
FIG. 5 is a schematic diagram illustrating a voice sample that is divided into units on different levels. A voice sample 510 includes thephrase 512 “the cat sat on the mattress, happily. The dog came in. The voice sample 510 is divided intophrase units 521 including the text “the cat,” “sat,” “on the mattress” and “happily” and intonon-speech units 526. Each non-speech unit includes a duration that indicates the length of the unit, representing the duration of non-speech captured by the particular non-speech unit. Non-speech units from the same voice sample can be analyzed and characterized by type of non-speech sound, as discussed above.Phrase units 521 are further divided intoword units 531 including the text “the”, “mattress” and others. The last unit level of this example is a phoneticsegment unit level 540 that includesunits 541 which represent word enunciations on an atomic level. For example, the sample word “the” consists of the phonemes “D” and “AX” (to rhyme with “thuh”, as in the first syllable of “about”). However, in the same voice sample another instance of the sample word “the” can consist of the phonemes “D” and “IY” (to rhyme with “thee”). The difference stems from a stronger emphasis of the word “the” in speech when beginning a sentence or after a pause. These differences can be captured in metadata (e.g., location or articulation relationship data) associated with the different voice samples (and be used to determine which segment to select from plural available similar segments). - As discussed in
FIG. 4 , associations between units can be captured in metadata and saved with the individual audio segments. The associations can include adjacency data between and across unit levels. InFIG. 5 , three levels of unit associations are shown (phrase unit level 520,word unit level 530 and phonetic segment unit level 540). InFIG. 5 , on aphrase unit level 520,association 561 link precedingphrase units 521 withnon-speech units 526. Similarly, onword unit level 530 and a phoneticsegment unit level 540,associations 563 link precedingword units 531 with non-speech units and precedingphonetic segment units 541 with non-speech units, respectively. Other levels are also possible, such as morphemes or syllables between words and phonemes, and units lower than phonemes such as articulatory gestures or pitch periods. Each non-speech unit in this example is also linked to all adjacent units at each level. Also,associations FIG. 6 , thenon-speech unit 526 is associated with preceding and following adjacent units at each hierarchy level. - As described above, associations can be stored as metadata corresponding to units. In one implementation, each phrase unit, word unit, sub-word unit, phonetic segment unit, etc., can be saved as a separate audio segment. Additionally, links between units can be saved as metadata. The metadata can further indicate whether a link is forward or backward and whether a link is between peer units or between unit levels.
- As described above, matching can include matching portions of text defined by units with segments of stored audio. The text being analyzed can be divided into units and matching routines performed. One specific matching routine includes matching to a highest level in a hierarchy of unit levels.
FIG. 7 is a flow diagram illustrating amethod 700 for matching non-speech units from a text string to non-speech audio segments at a highest possible unit level. A text stream (e.g., continuous text stream) is parsed 702 into a sequence of text strings for processing. In one implementation, a text stream can be divided using grammatical delimiters (e.g., periods, and semi-colons) and other document delimiters (e.g., page breaks, paragraph symbols, numbers, outline headers, and bullet points) so as to divide a continuous or long text stream into portions for processing. In one implementation, the portions for processing represent sentences of the received text. Alternatively, the portions of text for processing can represent the entire text including multiple pages, paragraphs and sentences. - Each text string is parsed 704 into phrase units and non-speech units (e.g., by parsing engine 220). In one implementation, a text string itself can comprise a phrase unit and one or more non-speech units. In other implementations, the text string can be divided, for example, into a predetermined number of words, into recognizable phrases, non-speech units (e.g., pauses), word pairs, and the like. The non-speech units are matched 706 to audio segments from a plurality of audio segments (e.g., by unit matching engine 230). To do so, an index of audio segments (e.g., stored in audio storage 135) can be accessed. In one implementation, metadata describing the audio segments is searched. The metadata can provide information about articulation relationships, properties or other data of a non-speech unit or phrase unit as described above. For example, the metadata can describe links between audio segments as peer level associations or inter-level associations (e.g., separated by one level, two levels, or more). For the most natural sounding speech, a highest level match (e.g., phrase unit level in this example) is preferable.
- More particularly, when a non-speech unit in the text string is identified an attempt is made to match it with a stored non-speech unit of equal or lesser duration and, ideally, with matching adjacent high-level units (e.g., units at the phrase unit level). In another implementation a non-speech sound that is longer than the non-speech unit is allowed if the duration of the sound does not exceed some criterion value and if the unit is particularly desirable (e.g., if, in the original recording, the non-speech sound unit was preceded and followed by the same words as are needed in the text string). If no match is determined because no non-speech unit is available with matching high-level units, then the units adjacent to the unmatched non-speech unit can be further parsed to create other lower-level units. Using the lower-level adjacent units another search for a match is attempted. The process continues until a match occurs or no further parsing of adjacent units is possible (i.e., parsing to the lowest possible level has occurred or no other parsing definitions have been provided). If a match is found, but the located non-speech unit has lesser duration than the ideal non-speech unit being matched, the located non-speech unit can be appended (e.g., prefixed) with silence to make up the difference in duration between the located non-speech unit and the desired non-speech unit. If no matching non-speech unit is found, then the unmatched non-speech unit can, in one implementation, be replaced by silence in the final syntheses for the duration specified by the non-speech unit. Subsequent non-speech units in the text string are processed at the first unit level (e.g., phrase unit level) in a similar manner.
- Matching can include the evaluation of a plurality of similar (i.e., same text or same non-speech) units having different audio segments (e.g., different accentuation, different duration, different pitch, etc.). Matching can include evaluating data associated with a candidate unit (e.g., metadata) and evaluation of preceding and following units that have been matched (e.g., evaluating the previous matched unit to determine what if any relationships or properties are associated with this unit). Matching is discussed in more detail below.
- Returning to the particular implementation shown in
FIG. 7 , if there are unmatchednon-speech units 708, the phrase units adjacent to unmatched non-speech units are parsed 710 into, for example, word units. For example, phrase units that are word pairs can be separated into separate words. The matching is attempted among non-speech units andadjacent word units 712. - If there are unmatched
non-speech units 714, the word units adjacent to the unmatched non-speech units are parsed 716 into, for example, sub-word units. For example, word units can be parsed into words, having suffixes or prefixes. If no unmatched non-speech units remain 720 (at this or any level), the matching process ends and synthesis of the text samples can be initiated (726). Otherwise the process can continue at anext unit level 722. At each unit level, a check is made to determine if a match has been located 724. If no match is found, the process continues including parsing adjacent units of unmatched non-speech units to a new lower level in the hierarchy until a final unit level is reached 720. If unmatched units remain after all other levels have been checked, then silence can be output for the duration of the unmatched non-speech unit. - In one implementation, a check is added in the process after matches have been determined (not shown). The check can allow for further refinement in accordance with separate rules. For example, even though a match is located at one unit level, it may be desirable to check to at a next or lower unit level for a match. The additional check can include user input to allow for selection from among possible match levels. Other check options are possible.
- Optionally, heuristic rules can also govern the matching of non-speech. These rules are particularly useful for simulating realistic breathing patterns. For example, a rule can specify that a non-speech unit should not be matched if a non-speech unit was replaced by a similar non-speech unit within five words of the current non-speech unit in the text string. Another rule can specify that a non-speech unit that precedes a sentence should only be matched if the sentence is longer than eight words, unless no non-speech units have been matched in the last eight words (i.e., after successive of short sentences). A rule can specify that non-speech units only be matched if the non-speech unit precedes a phrase unit, but never if the non-speech unit precedes an utterance (e.g., a one or two word phrase unit). Yet another rule can specify that non-speech units in the middle of a sentence only be matched if the phrase unit following the non-speech unit is more than six words. The particular threshold of words can be tuned to the desired speaking style. For example, when synthesizing speech with faster speaking rates (e.g., for use in screen readers for users with limited vision) the numbers might be larger. Other rules are possible.
-
FIG. 8 is a schematic diagram illustrating an example of a process matching non-speech units. The text string 810 that is to be processed is “The cats sat on the mats, happily. The snake hissed.” For the purposes of this example, the only searchable/matchable units that are available are those associated with the single training sample “the cat sat on the mattress, happily. The dog came in” described previously, and in particular the two non-speech units of substantially 80 and 200 milliseconds provided therein. Furthermore, the focus of this example is to illustrate how non-speech sounds are synthesized, while ignoring the synthesis of the remaining speech sounds. This example assumes that the text string 810 has been parsed using grammatical delimiters, such as periods and commas, to determine the location and duration of non-speech elements. The identified non-speech units for text string 810 includenon-speech units phrase unit level 820, non-speech units are matched by considering their adjacent phrase units. The selected non-speech sounds are shown atlevel 817. - The first
non-speech unit 805 is a match for the 80 msnon-speech unit 850 based on the following “happily” phrase unit, but the preceding “the mats” phrase unit does not match any known non-speech unit. A search for non-speech units with matching adjacent word units is made at, for example a next unit level, theword unit level 830. Again, the preceding “mats” word unit does not match the preceding word units of any known non-speech unit. A search for non-speech units with matching adjacent phonemes is made at thephoneme level 840. At the phoneme level, the “S” phoneme, derived from “mats”, provides a match. The matchingnon-speech unit 850 is selected for synthesis of thenon-speech unit 805. Thelinks FIG. 5 . - The second
non-speech unit 809 is a match with the 200 msnon-speech unit 860 based on the preceding “happily” phrase unit, however the following phrase unit “the snake” does not match the following phrase unit of any known non-speech unit at the initial level (e.g., the phrase level). A search for non-speech units with adjacent word units is made at the next level, for example theword unit level 830. At the word level, the “the” word derived from “the snake” matches “the” following thenon-speech unit 860. Accordingly, the matchingnon-speech unit 860 is selected for synthesis of the matchednon-speech unit 809. Thelinks FIG. 5 . - Both non-speech units in the above examples are shorter than the duration specified by the non-speech units that they synthesize. In one implementation, a matching non-speech unit should be as close to, but not longer than the non-speech unit it replaces (i.e., synthesizes). In another implementation, the duration of matching non-speech unit must be greater than a minimum proportion of the desired duration (e.g., a matching speech-unit must have a duration greater than 75% of desired duration). In the example illustrated in
FIG. 8 , although a longer non-speech unit with similar adjacent units would be preferable, a short non-speech unit is preferable to none at all. To compensate for the difference, silence is synthesized 852 to, for example, the beginning of each matching non-speech unit that is shorter than desired. In the example above, twenty milliseconds ofsilence 852 a is prefixed to the 80 msnon-speech unit 850, while 100 ms ofsilence 852 b is prefixed to the 200 msnon-speech unit 860. - Matching and Properties
- As described above, properties of units can be stored for matching purposes. Examples of properties include adjacency, pitch contour, accentuation, spectral characteristics, span (e.g., whether the instance spans a silence, a glottal stop, or a word boundary), grammatical context, position (e.g., of word in a sentence), isolation properties (e.g., whether a word can be used in isolation or needs always to be preceded or followed by another word), duration, compound property (e.g., whether the word is part of a compound, other individual unit properties or other properties . After parsing, evaluation of the unit, and adjoining units in the text string, can be performed to develop additional data (e.g., metadata). As described above, the additional data can allow for better matches and produce better end results. Alternatively, only units (e.g., text, non-speech units and audio segments alone) without additional data can be stored.
- A non-speech unit can also be marked with properties such as whether the unit contains a breath intake, lip or tongue click, nasal squeak, snort, cough, throat-clearing, creaky voice, or a sigh. Such properties can be used during selection depending on a text analysis, or by explicit annotation (e.g., in the input text) by the user.
- In one implementation, three unit levels are created including phrases, words and diphones. In this implementation, for each diphone unit one or more of the following additional data is stored for matching purposes:
- The pitch contour of the instance, i.e., whether pitch rises, falls, has bumps, etc.
- The accentuation of the phoneme that the instance overlaps, whether it is accentuated or not.
- The spectral characteristics of the border of the instance, i.e. what acoustic contexts it is most likely to fit in.
- Whether the instance spans a silence, a glottal stop, or a word boundary.
- The adjacent instances, which allows the system to know what we want to know about the phonetic context of the instance.
- In this implementation, for each word unit, one or more of the following additional data is stored for matching purposes:
- The grammatical (console the child vs. console window) and semantic (bass fishing vs. bass playing) properties of the word.
- The pitch contour of the instance, i.e., whether pitch rises, falls, has bumps, etc.
- The accentuation of the instance, whether it is accentuated or not, and further details of the type and prominence of any associated pitch accent.
- The position of the word in the phrase it was originally articulated (beginning, middle, end, before a comma, etc.).
- Whether the word can be used in an arbitrary context (or needs to always precede or follow its immediate neighbor).
- Whether the word was part of a compound, i.e. the “fire” in “firefighter”.
- In this implementation, for each phrase unit, adjacency data can be stored for matching purposes. The adjacency data can be at a same or different unit level.
- The invention and all of the functional operations described herein can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The invention can be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- Method steps of the invention can be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output. Method steps can also be performed by, and apparatus of the invention can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
- To provide for interaction with a user, the invention can be implemented on a device having a display, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and an input device, e.g., a keyboard, a mouse, a trackball, and the like by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback provided by speakers associated with a device, externally attached speakers, headphones, and the like, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- The invention can be implemented in, e.g., a computing system, a handheld device, a telephone, a consumer appliance, a multimedia player, an in-vehicle navigation and information system or any other processor-based device. A computing system implementation can include a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention, or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, though three or four specific unit levels were described above in the context of the synthesis process, other numbers and kinds of levels can be used. Furthermore, though the process describes adding non-speech sounds during speech synthesis, non-speech sounds can alternatively be used to augment an existing synthesized speech segment. Such an implementation can fill silent pauses in an existing synthesized speech segment with non-speech sounds. The augmentation of silences in an existing speech segment can be based on text associated with the segment or can be based on aural characteristics of the segment itself (e.g., duration since last pause, or the pitch, volume, quality or pattern of sound immediately preceding or following the pause). Finally, the synthesis of non-speech sounds can include emotive utterances that are not usually associated with formal speech patterns, such as laughing, crying, contemplation (e.g. ‘hmmm’), taunting (e.g. a raspberry), etc. Accordingly, other implementations are within the scope of the following claims.
- The above treatment of non-speech sounds is described within the framework of concatenative synthesis based on a corpus of audio recordings. Alternative parametric forms of speech synthesis, such as formation synthesis or articulatory synthesis, could equally well support this invention by synthesizing the acoustic or articulatory correlates of the non-speech sounds rather than by inserting fragments of audio recordings.
Claims (27)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/532,470 US8027837B2 (en) | 2006-09-15 | 2006-09-15 | Using non-speech sounds during text-to-speech synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/532,470 US8027837B2 (en) | 2006-09-15 | 2006-09-15 | Using non-speech sounds during text-to-speech synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
US20080071529A1 true US20080071529A1 (en) | 2008-03-20 |
US8027837B2 US8027837B2 (en) | 2011-09-27 |
Family
ID=39189739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/532,470 Expired - Fee Related US8027837B2 (en) | 2006-09-15 | 2006-09-15 | Using non-speech sounds during text-to-speech synthesis |
Country Status (1)
Country | Link |
---|---|
US (1) | US8027837B2 (en) |
Cited By (159)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070192105A1 (en) * | 2006-02-16 | 2007-08-16 | Matthias Neeracher | Multi-unit approach to text-to-speech synthesis |
US20100114556A1 (en) * | 2008-10-31 | 2010-05-06 | International Business Machines Corporation | Speech translation method and apparatus |
US8892446B2 (en) | 2010-01-18 | 2014-11-18 | Apple Inc. | Service orchestration for intelligent automated assistant |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US9300784B2 (en) | 2013-06-13 | 2016-03-29 | Apple Inc. | System and method for emergency calls initiated by voice command |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
US9508338B1 (en) * | 2013-11-15 | 2016-11-29 | Amazon Technologies, Inc. | Inserting breath sounds into text-to-speech output |
US9535906B2 (en) | 2008-07-31 | 2017-01-03 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
US9606986B2 (en) | 2014-09-29 | 2017-03-28 | Apple Inc. | Integrated word N-gram and class M-gram language models |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US9620104B2 (en) | 2013-06-07 | 2017-04-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9626955B2 (en) | 2008-04-05 | 2017-04-18 | Apple Inc. | Intelligent text-to-speech conversion |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US9633674B2 (en) | 2013-06-07 | 2017-04-25 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9633660B2 (en) | 2010-02-25 | 2017-04-25 | Apple Inc. | User profiling for voice input processing |
US9646614B2 (en) | 2000-03-16 | 2017-05-09 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US9697822B1 (en) | 2013-03-15 | 2017-07-04 | Apple Inc. | System and method for updating an adaptive speech recognition model |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US9798393B2 (en) | 2011-08-29 | 2017-10-24 | Apple Inc. | Text correction processing |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9922642B2 (en) | 2013-03-15 | 2018-03-20 | Apple Inc. | Training an at least partial voice command system |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9953088B2 (en) | 2012-05-14 | 2018-04-24 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US9959870B2 (en) | 2008-12-11 | 2018-05-01 | Apple Inc. | Speech recognition involving a mobile device |
US9966065B2 (en) | 2014-05-30 | 2018-05-08 | Apple Inc. | Multi-command single utterance input method |
US9966068B2 (en) | 2013-06-08 | 2018-05-08 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US20180130462A1 (en) * | 2015-07-09 | 2018-05-10 | Yamaha Corporation | Voice interaction method and voice interaction device |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US9971774B2 (en) | 2012-09-19 | 2018-05-15 | Apple Inc. | Voice-based media searching |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US10079014B2 (en) | 2012-06-08 | 2018-09-18 | Apple Inc. | Name recognition system |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10089072B2 (en) | 2016-06-11 | 2018-10-02 | Apple Inc. | Intelligent device arbitration and control |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10185542B2 (en) | 2013-06-09 | 2019-01-22 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10199051B2 (en) | 2013-02-07 | 2019-02-05 | Apple Inc. | Voice trigger for a digital assistant |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10269345B2 (en) | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US10283110B2 (en) | 2009-07-02 | 2019-05-07 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US10297253B2 (en) | 2016-06-11 | 2019-05-21 | Apple Inc. | Application integration with a digital assistant |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10318871B2 (en) | 2005-09-08 | 2019-06-11 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10354011B2 (en) | 2016-06-09 | 2019-07-16 | Apple Inc. | Intelligent automated assistant in a home environment |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
CN110505204A (en) * | 2019-07-17 | 2019-11-26 | 视联动力信息技术股份有限公司 | A kind of immediate voice communication method, device, electronic equipment and readable storage medium storing program for executing |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10521466B2 (en) | 2016-06-11 | 2019-12-31 | Apple Inc. | Data driven natural language event detection and classification |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10568032B2 (en) | 2007-04-03 | 2020-02-18 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
CN110970013A (en) * | 2019-12-23 | 2020-04-07 | 出门问问信息科技有限公司 | Speech synthesis method, device and computer readable storage medium |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US20200211531A1 (en) * | 2018-12-28 | 2020-07-02 | Rohit Kumar | Text-to-speech from media content item snippets |
US10706373B2 (en) | 2011-06-03 | 2020-07-07 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10791216B2 (en) | 2013-08-06 | 2020-09-29 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US11145288B2 (en) | 2018-07-24 | 2021-10-12 | Google Llc | Systems and methods for a text-to-speech interface |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US11217255B2 (en) | 2017-05-16 | 2022-01-04 | Apple Inc. | Far-field extension for digital assistant services |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US11289083B2 (en) * | 2018-11-14 | 2022-03-29 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling thereof |
US11302300B2 (en) * | 2019-11-19 | 2022-04-12 | Applications Technology (Apptek), Llc | Method and apparatus for forced duration in neural speech synthesis |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2141696A1 (en) * | 2008-07-03 | 2010-01-06 | Deutsche Thomson OHG | Method for time scaling of a sequence of input signal values |
US20110046957A1 (en) * | 2009-08-24 | 2011-02-24 | NovaSpeech, LLC | System and method for speech synthesis using frequency splicing |
US9368104B2 (en) | 2012-04-30 | 2016-06-14 | Src, Inc. | System and method for synthesizing human speech using multiple speakers and context |
CN104142778B (en) * | 2013-09-25 | 2017-06-13 | 腾讯科技(深圳)有限公司 | A kind of method of text-processing, device and mobile terminal |
US10388270B2 (en) | 2014-11-05 | 2019-08-20 | At&T Intellectual Property I, L.P. | System and method for text normalization using atomic tokens |
JP7119939B2 (en) * | 2018-11-19 | 2022-08-17 | トヨタ自動車株式会社 | Information processing device, information processing method and program |
Citations (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4278838A (en) * | 1976-09-08 | 1981-07-14 | Edinen Centar Po Physika | Method of and device for synthesis of speech from printed text |
US5732395A (en) * | 1993-03-19 | 1998-03-24 | Nynex Science & Technology | Methods for controlling the generation of speech from text representing names and addresses |
US5771276A (en) * | 1995-10-10 | 1998-06-23 | Ast Research, Inc. | Voice templates for interactive voice mail and voice response system |
US5850629A (en) * | 1996-09-09 | 1998-12-15 | Matsushita Electric Industrial Co., Ltd. | User interface controller for text-to-speech synthesizer |
US6047255A (en) * | 1997-12-04 | 2000-04-04 | Nortel Networks Corporation | Method and system for producing speech signals |
US6125346A (en) * | 1996-12-10 | 2000-09-26 | Matsushita Electric Industrial Co., Ltd | Speech synthesizing system and redundancy-reduced waveform database therefor |
US6173263B1 (en) * | 1998-08-31 | 2001-01-09 | At&T Corp. | Method and system for performing concatenative speech synthesis using half-phonemes |
US6185533B1 (en) * | 1999-03-15 | 2001-02-06 | Matsushita Electric Industrial Co., Ltd. | Generation and synthesis of prosody templates |
US20020052730A1 (en) * | 2000-09-25 | 2002-05-02 | Yoshio Nakao | Apparatus for reading a plurality of documents and a method thereof |
US20020072908A1 (en) * | 2000-10-19 | 2002-06-13 | Case Eliot M. | System and method for converting text-to-voice |
US20020133348A1 (en) * | 2001-03-15 | 2002-09-19 | Steve Pearson | Method and tool for customization of speech synthesizer databses using hierarchical generalized speech templates |
US20020173961A1 (en) * | 2001-03-09 | 2002-11-21 | Guerra Lisa M. | System, method and computer program product for dynamic, robust and fault tolerant audio output in a speech recognition framework |
US20030050781A1 (en) * | 2001-09-13 | 2003-03-13 | Yamaha Corporation | Apparatus and method for synthesizing a plurality of waveforms in synchronized manner |
US6535852B2 (en) * | 2001-03-29 | 2003-03-18 | International Business Machines Corporation | Training of text-to-speech systems |
US20040111266A1 (en) * | 1998-11-13 | 2004-06-10 | Geert Coorman | Speech synthesis using concatenation of speech waveforms |
US6757653B2 (en) * | 2000-06-30 | 2004-06-29 | Nokia Mobile Phones, Ltd. | Reassembling speech sentence fragments using associated phonetic property |
US20040254792A1 (en) * | 2003-06-10 | 2004-12-16 | Bellsouth Intellectual Proprerty Corporation | Methods and system for creating voice files using a VoiceXML application |
US6862568B2 (en) * | 2000-10-19 | 2005-03-01 | Qwest Communications International, Inc. | System and method for converting text-to-voice |
US20050119890A1 (en) * | 2003-11-28 | 2005-06-02 | Yoshifumi Hirose | Speech synthesis apparatus and speech synthesis method |
US6910007B2 (en) * | 2000-05-31 | 2005-06-21 | At&T Corp | Stochastic modeling of spectral adjustment for high quality pitch modification |
US6978239B2 (en) * | 2000-12-04 | 2005-12-20 | Microsoft Corporation | Method and apparatus for speech synthesis without prosody modification |
US20060074674A1 (en) * | 2004-09-30 | 2006-04-06 | International Business Machines Corporation | Method and system for statistic-based distance definition in text-to-speech conversion |
US7035794B2 (en) * | 2001-03-30 | 2006-04-25 | Intel Corporation | Compressing and using a concatenative speech database in text-to-speech systems |
US7191131B1 (en) * | 1999-06-30 | 2007-03-13 | Sony Corporation | Electronic document processing apparatus |
US20070106513A1 (en) * | 2005-11-10 | 2007-05-10 | Boillot Marc A | Method for facilitating text to speech synthesis using a differential vocoder |
US20070192105A1 (en) * | 2006-02-16 | 2007-08-16 | Matthias Neeracher | Multi-unit approach to text-to-speech synthesis |
US20070244702A1 (en) * | 2006-04-12 | 2007-10-18 | Jonathan Kahn | Session File Modification with Annotation Using Speech Recognition or Text to Speech |
US7292979B2 (en) * | 2001-11-03 | 2007-11-06 | Autonomy Systems, Limited | Time ordered indexing of audio data |
US7472065B2 (en) * | 2004-06-04 | 2008-12-30 | International Business Machines Corporation | Generating paralinguistic phenomena via markup in text-to-speech synthesis |
US20090076819A1 (en) * | 2006-03-17 | 2009-03-19 | Johan Wouters | Text to speech synthesis |
-
2006
- 2006-09-15 US US11/532,470 patent/US8027837B2/en not_active Expired - Fee Related
Patent Citations (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4278838A (en) * | 1976-09-08 | 1981-07-14 | Edinen Centar Po Physika | Method of and device for synthesis of speech from printed text |
US5732395A (en) * | 1993-03-19 | 1998-03-24 | Nynex Science & Technology | Methods for controlling the generation of speech from text representing names and addresses |
US5771276A (en) * | 1995-10-10 | 1998-06-23 | Ast Research, Inc. | Voice templates for interactive voice mail and voice response system |
US6014428A (en) * | 1995-10-10 | 2000-01-11 | Ast Research, Inc. | Voice templates for interactive voice mail and voice response system |
US5850629A (en) * | 1996-09-09 | 1998-12-15 | Matsushita Electric Industrial Co., Ltd. | User interface controller for text-to-speech synthesizer |
US6125346A (en) * | 1996-12-10 | 2000-09-26 | Matsushita Electric Industrial Co., Ltd | Speech synthesizing system and redundancy-reduced waveform database therefor |
US6047255A (en) * | 1997-12-04 | 2000-04-04 | Nortel Networks Corporation | Method and system for producing speech signals |
US6173263B1 (en) * | 1998-08-31 | 2001-01-09 | At&T Corp. | Method and system for performing concatenative speech synthesis using half-phonemes |
US20040111266A1 (en) * | 1998-11-13 | 2004-06-10 | Geert Coorman | Speech synthesis using concatenation of speech waveforms |
US6185533B1 (en) * | 1999-03-15 | 2001-02-06 | Matsushita Electric Industrial Co., Ltd. | Generation and synthesis of prosody templates |
US7191131B1 (en) * | 1999-06-30 | 2007-03-13 | Sony Corporation | Electronic document processing apparatus |
US6910007B2 (en) * | 2000-05-31 | 2005-06-21 | At&T Corp | Stochastic modeling of spectral adjustment for high quality pitch modification |
US6757653B2 (en) * | 2000-06-30 | 2004-06-29 | Nokia Mobile Phones, Ltd. | Reassembling speech sentence fragments using associated phonetic property |
US20020052730A1 (en) * | 2000-09-25 | 2002-05-02 | Yoshio Nakao | Apparatus for reading a plurality of documents and a method thereof |
US20020072908A1 (en) * | 2000-10-19 | 2002-06-13 | Case Eliot M. | System and method for converting text-to-voice |
US6862568B2 (en) * | 2000-10-19 | 2005-03-01 | Qwest Communications International, Inc. | System and method for converting text-to-voice |
US6990450B2 (en) * | 2000-10-19 | 2006-01-24 | Qwest Communications International Inc. | System and method for converting text-to-voice |
US6978239B2 (en) * | 2000-12-04 | 2005-12-20 | Microsoft Corporation | Method and apparatus for speech synthesis without prosody modification |
US20020173961A1 (en) * | 2001-03-09 | 2002-11-21 | Guerra Lisa M. | System, method and computer program product for dynamic, robust and fault tolerant audio output in a speech recognition framework |
US20020133348A1 (en) * | 2001-03-15 | 2002-09-19 | Steve Pearson | Method and tool for customization of speech synthesizer databses using hierarchical generalized speech templates |
US6513008B2 (en) * | 2001-03-15 | 2003-01-28 | Matsushita Electric Industrial Co., Ltd. | Method and tool for customization of speech synthesizer databases using hierarchical generalized speech templates |
US6535852B2 (en) * | 2001-03-29 | 2003-03-18 | International Business Machines Corporation | Training of text-to-speech systems |
US7035794B2 (en) * | 2001-03-30 | 2006-04-25 | Intel Corporation | Compressing and using a concatenative speech database in text-to-speech systems |
US20030050781A1 (en) * | 2001-09-13 | 2003-03-13 | Yamaha Corporation | Apparatus and method for synthesizing a plurality of waveforms in synchronized manner |
US7292979B2 (en) * | 2001-11-03 | 2007-11-06 | Autonomy Systems, Limited | Time ordered indexing of audio data |
US20040254792A1 (en) * | 2003-06-10 | 2004-12-16 | Bellsouth Intellectual Proprerty Corporation | Methods and system for creating voice files using a VoiceXML application |
US20050119890A1 (en) * | 2003-11-28 | 2005-06-02 | Yoshifumi Hirose | Speech synthesis apparatus and speech synthesis method |
US7472065B2 (en) * | 2004-06-04 | 2008-12-30 | International Business Machines Corporation | Generating paralinguistic phenomena via markup in text-to-speech synthesis |
US20060074674A1 (en) * | 2004-09-30 | 2006-04-06 | International Business Machines Corporation | Method and system for statistic-based distance definition in text-to-speech conversion |
US20070106513A1 (en) * | 2005-11-10 | 2007-05-10 | Boillot Marc A | Method for facilitating text to speech synthesis using a differential vocoder |
US20070192105A1 (en) * | 2006-02-16 | 2007-08-16 | Matthias Neeracher | Multi-unit approach to text-to-speech synthesis |
US20090076819A1 (en) * | 2006-03-17 | 2009-03-19 | Johan Wouters | Text to speech synthesis |
US20070244702A1 (en) * | 2006-04-12 | 2007-10-18 | Jonathan Kahn | Session File Modification with Annotation Using Speech Recognition or Text to Speech |
Cited By (230)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9646614B2 (en) | 2000-03-16 | 2017-05-09 | Apple Inc. | Fast, language-independent method for user authentication by voice |
US10318871B2 (en) | 2005-09-08 | 2019-06-11 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US20070192105A1 (en) * | 2006-02-16 | 2007-08-16 | Matthias Neeracher | Multi-unit approach to text-to-speech synthesis |
US8036894B2 (en) | 2006-02-16 | 2011-10-11 | Apple Inc. | Multi-unit approach to text-to-speech synthesis |
US8930191B2 (en) | 2006-09-08 | 2015-01-06 | Apple Inc. | Paraphrasing of user requests and results by automated digital assistant |
US8942986B2 (en) | 2006-09-08 | 2015-01-27 | Apple Inc. | Determining user intent based on ontologies of domains |
US9117447B2 (en) | 2006-09-08 | 2015-08-25 | Apple Inc. | Using event alert text as input to an automated assistant |
US10568032B2 (en) | 2007-04-03 | 2020-02-18 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US11023513B2 (en) | 2007-12-20 | 2021-06-01 | Apple Inc. | Method and apparatus for searching using an active ontology |
US10381016B2 (en) | 2008-01-03 | 2019-08-13 | Apple Inc. | Methods and apparatus for altering audio output signals |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US9865248B2 (en) | 2008-04-05 | 2018-01-09 | Apple Inc. | Intelligent text-to-speech conversion |
US9626955B2 (en) | 2008-04-05 | 2017-04-18 | Apple Inc. | Intelligent text-to-speech conversion |
US10108612B2 (en) | 2008-07-31 | 2018-10-23 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US9535906B2 (en) | 2008-07-31 | 2017-01-03 | Apple Inc. | Mobile device having human language translation capability with positional feedback |
US10643611B2 (en) | 2008-10-02 | 2020-05-05 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11348582B2 (en) | 2008-10-02 | 2022-05-31 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US20100114556A1 (en) * | 2008-10-31 | 2010-05-06 | International Business Machines Corporation | Speech translation method and apparatus |
US9342509B2 (en) * | 2008-10-31 | 2016-05-17 | Nuance Communications, Inc. | Speech translation method and apparatus utilizing prosodic information |
US9959870B2 (en) | 2008-12-11 | 2018-05-01 | Apple Inc. | Speech recognition involving a mobile device |
US10795541B2 (en) | 2009-06-05 | 2020-10-06 | Apple Inc. | Intelligent organization of tasks items |
US11080012B2 (en) | 2009-06-05 | 2021-08-03 | Apple Inc. | Interface for a virtual digital assistant |
US10475446B2 (en) | 2009-06-05 | 2019-11-12 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10283110B2 (en) | 2009-07-02 | 2019-05-07 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US12087308B2 (en) | 2010-01-18 | 2024-09-10 | Apple Inc. | Intelligent automated assistant |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US9548050B2 (en) | 2010-01-18 | 2017-01-17 | Apple Inc. | Intelligent automated assistant |
US10706841B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Task flow identification based on user intent |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8903716B2 (en) | 2010-01-18 | 2014-12-02 | Apple Inc. | Personalized vocabulary for digital assistant |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US8892446B2 (en) | 2010-01-18 | 2014-11-18 | Apple Inc. | Service orchestration for intelligent automated assistant |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US11423886B2 (en) | 2010-01-18 | 2022-08-23 | Apple Inc. | Task flow identification based on user intent |
US10692504B2 (en) | 2010-02-25 | 2020-06-23 | Apple Inc. | User profiling for voice input processing |
US10049675B2 (en) | 2010-02-25 | 2018-08-14 | Apple Inc. | User profiling for voice input processing |
US9633660B2 (en) | 2010-02-25 | 2017-04-25 | Apple Inc. | User profiling for voice input processing |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US10417405B2 (en) | 2011-03-21 | 2019-09-17 | Apple Inc. | Device access using voice authentication |
US10102359B2 (en) | 2011-03-21 | 2018-10-16 | Apple Inc. | Device access using voice authentication |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US11350253B2 (en) | 2011-06-03 | 2022-05-31 | Apple Inc. | Active transport based notifications |
US10706373B2 (en) | 2011-06-03 | 2020-07-07 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US11120372B2 (en) | 2011-06-03 | 2021-09-14 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US9798393B2 (en) | 2011-08-29 | 2017-10-24 | Apple Inc. | Text correction processing |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US11069336B2 (en) | 2012-03-02 | 2021-07-20 | Apple Inc. | Systems and methods for name pronunciation |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9953088B2 (en) | 2012-05-14 | 2018-04-24 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US10079014B2 (en) | 2012-06-08 | 2018-09-18 | Apple Inc. | Name recognition system |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9971774B2 (en) | 2012-09-19 | 2018-05-15 | Apple Inc. | Voice-based media searching |
US10199051B2 (en) | 2013-02-07 | 2019-02-05 | Apple Inc. | Voice trigger for a digital assistant |
US10978090B2 (en) | 2013-02-07 | 2021-04-13 | Apple Inc. | Voice trigger for a digital assistant |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
US9697822B1 (en) | 2013-03-15 | 2017-07-04 | Apple Inc. | System and method for updating an adaptive speech recognition model |
US9922642B2 (en) | 2013-03-15 | 2018-03-20 | Apple Inc. | Training an at least partial voice command system |
US9633674B2 (en) | 2013-06-07 | 2017-04-25 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
US9966060B2 (en) | 2013-06-07 | 2018-05-08 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9620104B2 (en) | 2013-06-07 | 2017-04-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
US9966068B2 (en) | 2013-06-08 | 2018-05-08 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10657961B2 (en) | 2013-06-08 | 2020-05-19 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US11048473B2 (en) | 2013-06-09 | 2021-06-29 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10185542B2 (en) | 2013-06-09 | 2019-01-22 | Apple Inc. | Device, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10769385B2 (en) | 2013-06-09 | 2020-09-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US9300784B2 (en) | 2013-06-13 | 2016-03-29 | Apple Inc. | System and method for emergency calls initiated by voice command |
US10791216B2 (en) | 2013-08-06 | 2020-09-29 | Apple Inc. | Auto-activating smart responses based on activities from remote devices |
US9508338B1 (en) * | 2013-11-15 | 2016-11-29 | Amazon Technologies, Inc. | Inserting breath sounds into text-to-speech output |
US11314370B2 (en) | 2013-12-06 | 2022-04-26 | Apple Inc. | Method for extracting salient dialog usage from live data |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10169329B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Exemplar-based natural language processing |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US10657966B2 (en) | 2014-05-30 | 2020-05-19 | Apple Inc. | Better resolution when referencing to concepts |
US10699717B2 (en) | 2014-05-30 | 2020-06-30 | Apple Inc. | Intelligent assistant for home automation |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US10497365B2 (en) | 2014-05-30 | 2019-12-03 | Apple Inc. | Multi-command single utterance input method |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US10714095B2 (en) | 2014-05-30 | 2020-07-14 | Apple Inc. | Intelligent assistant for home automation |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9966065B2 (en) | 2014-05-30 | 2018-05-08 | Apple Inc. | Multi-command single utterance input method |
US10417344B2 (en) | 2014-05-30 | 2019-09-17 | Apple Inc. | Exemplar-based natural language processing |
US11257504B2 (en) | 2014-05-30 | 2022-02-22 | Apple Inc. | Intelligent assistant for home automation |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US11133008B2 (en) | 2014-05-30 | 2021-09-28 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US10083690B2 (en) | 2014-05-30 | 2018-09-25 | Apple Inc. | Better resolution when referencing to concepts |
US9668024B2 (en) | 2014-06-30 | 2017-05-30 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10904611B2 (en) | 2014-06-30 | 2021-01-26 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10431204B2 (en) | 2014-09-11 | 2019-10-01 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US9606986B2 (en) | 2014-09-29 | 2017-03-28 | Apple Inc. | Integrated word N-gram and class M-gram language models |
US10390213B2 (en) | 2014-09-30 | 2019-08-20 | Apple Inc. | Social reminders |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US10453443B2 (en) | 2014-09-30 | 2019-10-22 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9986419B2 (en) | 2014-09-30 | 2018-05-29 | Apple Inc. | Social reminders |
US10438595B2 (en) | 2014-09-30 | 2019-10-08 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US11556230B2 (en) | 2014-12-02 | 2023-01-17 | Apple Inc. | Data detection |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US11231904B2 (en) | 2015-03-06 | 2022-01-25 | Apple Inc. | Reducing response latency of intelligent automated assistants |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US10311871B2 (en) | 2015-03-08 | 2019-06-04 | Apple Inc. | Competing devices responding to voice triggers |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10529332B2 (en) | 2015-03-08 | 2020-01-07 | Apple Inc. | Virtual assistant activation |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US11087759B2 (en) | 2015-03-08 | 2021-08-10 | Apple Inc. | Virtual assistant activation |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US11127397B2 (en) | 2015-05-27 | 2021-09-21 | Apple Inc. | Device voice control |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US10356243B2 (en) | 2015-06-05 | 2019-07-16 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
US20180130462A1 (en) * | 2015-07-09 | 2018-05-10 | Yamaha Corporation | Voice interaction method and voice interaction device |
US11500672B2 (en) | 2015-09-08 | 2022-11-15 | Apple Inc. | Distributed personal assistant |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11526368B2 (en) | 2015-11-06 | 2022-12-13 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10354652B2 (en) | 2015-12-02 | 2019-07-16 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US11069347B2 (en) | 2016-06-08 | 2021-07-20 | Apple Inc. | Intelligent automated assistant for media exploration |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
US10354011B2 (en) | 2016-06-09 | 2019-07-16 | Apple Inc. | Intelligent automated assistant in a home environment |
US10733993B2 (en) | 2016-06-10 | 2020-08-04 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US11037565B2 (en) | 2016-06-10 | 2021-06-15 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10521466B2 (en) | 2016-06-11 | 2019-12-31 | Apple Inc. | Data driven natural language event detection and classification |
US10580409B2 (en) | 2016-06-11 | 2020-03-03 | Apple Inc. | Application integration with a digital assistant |
US10089072B2 (en) | 2016-06-11 | 2018-10-02 | Apple Inc. | Intelligent device arbitration and control |
US10269345B2 (en) | 2016-06-11 | 2019-04-23 | Apple Inc. | Intelligent task discovery |
US10297253B2 (en) | 2016-06-11 | 2019-05-21 | Apple Inc. | Application integration with a digital assistant |
US10942702B2 (en) | 2016-06-11 | 2021-03-09 | Apple Inc. | Intelligent device arbitration and control |
US11152002B2 (en) | 2016-06-11 | 2021-10-19 | Apple Inc. | Application integration with a digital assistant |
US10474753B2 (en) | 2016-09-07 | 2019-11-12 | Apple Inc. | Language identification using recurrent neural networks |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10553215B2 (en) | 2016-09-23 | 2020-02-04 | Apple Inc. | Intelligent automated assistant |
US11281993B2 (en) | 2016-12-05 | 2022-03-22 | Apple Inc. | Model and ensemble compression for metric learning |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US11204787B2 (en) | 2017-01-09 | 2021-12-21 | Apple Inc. | Application integration with a digital assistant |
US10332518B2 (en) | 2017-05-09 | 2019-06-25 | Apple Inc. | User interface for correcting recognition errors |
US10417266B2 (en) | 2017-05-09 | 2019-09-17 | Apple Inc. | Context-aware ranking of intelligent response suggestions |
US10755703B2 (en) | 2017-05-11 | 2020-08-25 | Apple Inc. | Offline personal assistant |
US10847142B2 (en) | 2017-05-11 | 2020-11-24 | Apple Inc. | Maintaining privacy of personal information |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
US10395654B2 (en) | 2017-05-11 | 2019-08-27 | Apple Inc. | Text normalization based on a data-driven learning network |
US10410637B2 (en) | 2017-05-12 | 2019-09-10 | Apple Inc. | User-specific acoustic models |
US10791176B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10789945B2 (en) | 2017-05-12 | 2020-09-29 | Apple Inc. | Low-latency intelligent automated assistant |
US11301477B2 (en) | 2017-05-12 | 2022-04-12 | Apple Inc. | Feedback analysis of a digital assistant |
US11405466B2 (en) | 2017-05-12 | 2022-08-02 | Apple Inc. | Synchronization and task delegation of a digital assistant |
US10482874B2 (en) | 2017-05-15 | 2019-11-19 | Apple Inc. | Hierarchical belief states for digital assistants |
US10810274B2 (en) | 2017-05-15 | 2020-10-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
US10403278B2 (en) | 2017-05-16 | 2019-09-03 | Apple Inc. | Methods and systems for phonetic matching in digital assistant services |
US11217255B2 (en) | 2017-05-16 | 2022-01-04 | Apple Inc. | Far-field extension for digital assistant services |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10311144B2 (en) | 2017-05-16 | 2019-06-04 | Apple Inc. | Emoji word sense disambiguation |
US10657328B2 (en) | 2017-06-02 | 2020-05-19 | Apple Inc. | Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling |
US10445429B2 (en) | 2017-09-21 | 2019-10-15 | Apple Inc. | Natural language understanding using vocabularies with compressed serialized tries |
US10755051B2 (en) | 2017-09-29 | 2020-08-25 | Apple Inc. | Rule-based natural language processing |
US10636424B2 (en) | 2017-11-30 | 2020-04-28 | Apple Inc. | Multi-turn canned dialog |
US10733982B2 (en) | 2018-01-08 | 2020-08-04 | Apple Inc. | Multi-directional dialog |
US10733375B2 (en) | 2018-01-31 | 2020-08-04 | Apple Inc. | Knowledge-based framework for improving natural language understanding |
US10789959B2 (en) | 2018-03-02 | 2020-09-29 | Apple Inc. | Training speaker recognition models for digital assistants |
US10592604B2 (en) | 2018-03-12 | 2020-03-17 | Apple Inc. | Inverse text normalization for automatic speech recognition |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US10909331B2 (en) | 2018-03-30 | 2021-02-02 | Apple Inc. | Implicit identification of translation payload with neural machine translation |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10984780B2 (en) | 2018-05-21 | 2021-04-20 | Apple Inc. | Global semantic word embeddings using bi-directional recurrent neural networks |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US11009970B2 (en) | 2018-06-01 | 2021-05-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US10984798B2 (en) | 2018-06-01 | 2021-04-20 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10684703B2 (en) | 2018-06-01 | 2020-06-16 | Apple Inc. | Attention aware virtual assistant dismissal |
US11495218B2 (en) | 2018-06-01 | 2022-11-08 | Apple Inc. | Virtual assistant operation in multi-device environments |
US11386266B2 (en) | 2018-06-01 | 2022-07-12 | Apple Inc. | Text correction |
US10403283B1 (en) | 2018-06-01 | 2019-09-03 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
US10496705B1 (en) | 2018-06-03 | 2019-12-03 | Apple Inc. | Accelerated task performance |
US10504518B1 (en) | 2018-06-03 | 2019-12-10 | Apple Inc. | Accelerated task performance |
US10944859B2 (en) | 2018-06-03 | 2021-03-09 | Apple Inc. | Accelerated task performance |
US11145288B2 (en) | 2018-07-24 | 2021-10-12 | Google Llc | Systems and methods for a text-to-speech interface |
US12020681B2 (en) | 2018-07-24 | 2024-06-25 | Google Llc | Systems and methods for a text-to-speech interface |
US20220180872A1 (en) * | 2018-11-14 | 2022-06-09 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling thereof |
US11289083B2 (en) * | 2018-11-14 | 2022-03-29 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling thereof |
US12154563B2 (en) * | 2018-11-14 | 2024-11-26 | Samsung Electronics Co., Ltd. | Electronic apparatus and method for controlling thereof |
US11114085B2 (en) * | 2018-12-28 | 2021-09-07 | Spotify Ab | Text-to-speech from media content item snippets |
US20200211531A1 (en) * | 2018-12-28 | 2020-07-02 | Rohit Kumar | Text-to-speech from media content item snippets |
US11710474B2 (en) | 2018-12-28 | 2023-07-25 | Spotify Ab | Text-to-speech from media content item snippets |
CN110505204A (en) * | 2019-07-17 | 2019-11-26 | 视联动力信息技术股份有限公司 | A kind of immediate voice communication method, device, electronic equipment and readable storage medium storing program for executing |
US11302300B2 (en) * | 2019-11-19 | 2022-04-12 | Applications Technology (Apptek), Llc | Method and apparatus for forced duration in neural speech synthesis |
CN110970013A (en) * | 2019-12-23 | 2020-04-07 | 出门问问信息科技有限公司 | Speech synthesis method, device and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
US8027837B2 (en) | 2011-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8027837B2 (en) | Using non-speech sounds during text-to-speech synthesis | |
US8036894B2 (en) | Multi-unit approach to text-to-speech synthesis | |
Moberg | Contributions to Multilingual Low-Footprint TTS System for Hand-Held Devices | |
EP3387646B1 (en) | Text-to-speech processing system and method | |
US9424833B2 (en) | Method and apparatus for providing speech output for speech-enabled applications | |
Athanaselis et al. | ASR for emotional speech: clarifying the issues and enhancing performance | |
US20220392430A1 (en) | System Providing Expressive and Emotive Text-to-Speech | |
EP3791382A1 (en) | Generating audio for a plain text document | |
US7010489B1 (en) | Method for guiding text-to-speech output timing using speech recognition markers | |
US20080177543A1 (en) | Stochastic Syllable Accent Recognition | |
Campbell | Conversational speech synthesis and the need for some laughter | |
Furui et al. | Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese | |
Proença et al. | Automatic evaluation of reading aloud performance in children | |
Mitsui et al. | Towards human-like spoken dialogue generation between ai agents from written dialogue | |
Lin et al. | Hierarchical prosody modeling for Mandarin spontaneous speech | |
Dall | Statistical parametric speech synthesis using conversational data and phenomena | |
Iriondo et al. | Objective and subjective evaluation of an expressive speech corpus | |
Kolář | Automatic segmentation of speech into sentence-like units | |
Atterer et al. | Integrating linguistic and performance-based constraints for assigning phrase breaks | |
Trouvain et al. | Speech synthesis: text-to-speech conversion and artificial voices | |
Nemoto | Large-scale acoustic and prosodic investigations of French | |
Verkhodanova et al. | Experiments on detection of voiced hesitations in Russian spontaneous speech | |
JP2005181998A (en) | Speech synthesizer and speech synthesizing method | |
Kayte | Text-To-Speech Synthesis System for Marathi Language Using Concatenation Technique | |
Sloan | Using Linguistic Features to Improve Prosody for Text-to-Speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: APPLE COMPUTER, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SILVERMAN, KIM E.A.;NEERACHER, MATTHIAS;REEL/FRAME:018292/0854 Effective date: 20060913 |
|
AS | Assignment |
Owner name: APPLE INC., CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:APPLE COMPUTER, INC.;REEL/FRAME:019142/0969 Effective date: 20070109 Owner name: APPLE INC.,CALIFORNIA Free format text: CHANGE OF NAME;ASSIGNOR:APPLE COMPUTER, INC.;REEL/FRAME:019142/0969 Effective date: 20070109 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20190927 |