JP2017513047A

JP2017513047A - Pronunciation prediction in speech recognition.

Info

Publication number: JP2017513047A
Application number: JP2016555771A
Authority: JP
Inventors: ペンロッドアダムス、ジェフリー; ウルハスパーリカル、アロク; ポールリリー、ジェフリー; ラストロー、アリヤ
Original assignee: Amazon Technologies Inc
Current assignee: Amazon Technologies Inc
Priority date: 2014-03-04
Filing date: 2015-02-27
Publication date: 2017-05-25
Anticipated expiration: 2035-02-27
Also published as: US10339920B2; EP3114679A4; CN106463113B; CN106463113A; EP3114679B1; WO2015134309A1; JP6550068B2; US20150255069A1; EP3114679A1

Abstract

自動音声認識（ＡＳＲ）装置が、テキスト識別子の１つまたは複数の元言語の予測に基づきテキスト識別子（たとえば、曲名など）の発音を予測するよう構成されてもよい。１つまたは複数の元言語がテキスト識別子に基づいて判断されてもよい。１つの言語の発音、第２の言語の発音及び複数の言語を組み合わせる混合発音を含む混合発音が発音に含まれてもよい。発音はレキシコンに追加され、コンテンツアイテム（たとえば、曲）及び／またはテキスト識別子と照合されてもよい。ＡＳＲ装置は、コンテンツアイテムにアクセスするためのＡＳＲ装置を要求するユーザからの口頭での発話を受信してもよい。ＡＳＲ装置は、口頭での発話がレキシコン上のコンテンツアイテムの発音の１つと一致するか否かを判断する。ＡＳＲ装置はその後、口頭での発話が前潜在的なテキスト識別子の発音の１つに一致する際に、コンテンツにアクセスする。An automatic speech recognition (ASR) device may be configured to predict the pronunciation of a text identifier (eg, song title, etc.) based on a prediction of one or more source languages of the text identifier. One or more source languages may be determined based on the text identifier. The pronunciation may include a pronunciation of one language, a pronunciation of the second language, and a mixed pronunciation that combines a plurality of languages. Pronunciations may be added to the lexicon and matched against content items (eg, songs) and / or text identifiers. The ASR device may receive an oral utterance from a user requesting an ASR device to access the content item. The ASR device determines whether the verbal utterance matches one of the pronunciations of the content item on the lexicon. The ASR device then accesses the content when the verbal utterance matches one of the pronunciations of the previous potential text identifier.

Description

関連出願データの相互参照
本出願は、その全体が参照により本明細書に組み込まれる、２０１４年３月４日出願の米国特許出願第１４／１９６，０５５号に対する優先権を主張する。 CROSS REFERENCE TO RELATED APPLICATION DATA This application claims priority to US patent application Ser. No. 14 / 196,055 filed Mar. 4, 2014, which is incorporated herein by reference in its entirety.

ヒューマンコンピュータインタラクションは、人間が発話によって計算装置を制御し、これらの装置に入力を行うことができる段階まで進んでいる。計算装置は、受信した音声入力の様々な品質に基づく人間のユーザが話す語を識別する技術を利用する。このような技術は音声認識または自動音声認識（ＡＳＲ）と称される。言語処理技術と組み合わせた音声認識により、ユーザが発話した命令に基づく、ユーザによる計算装置の制御及びタスクの実行を可能にすることができる。音声認識はまたユーザの音声をテキストデータに変換してもよく、その後そのテキストデータは様々なテキストに基づくプログラム及びアプリケーションに提供されてもよい。 Human computer interaction has progressed to a stage where humans can control computing devices by speaking and input to these devices. The computing device utilizes techniques to identify words spoken by human users based on various qualities of received speech input. Such a technique is referred to as speech recognition or automatic speech recognition (ASR). Speech recognition combined with language processing techniques can allow the user to control the computing device and execute tasks based on instructions uttered by the user. Speech recognition may also convert user speech into text data, which may then be provided to various text-based programs and applications.

ヒューマンコンピュータインタラクションを向上させるためのコンピュータ、携帯端末、電話回線を利用したコンピュータシステム、キオスク、及び他の様々な装置は、音声認識を利用してもよい。 Computers, portable terminals, computer systems using telephone lines, kiosks, and various other devices for improving human computer interaction may utilize speech recognition.

本開示をさらに完全に理解するために、ここで以下の説明を添付の図面と併せて参照する。 For a more complete understanding of the present disclosure, reference is now made to the following description, taken in conjunction with the accompanying drawings, in which:

本開示の一態様による、語の元言語に基づく、外来語の予想発音の予測のための音声認識技術を示す。6 illustrates a speech recognition technique for predicting predicted pronunciation of a foreign word based on the original language of the word according to one aspect of the present disclosure. 本開示の一態様による、分散型音声認識の利用のためのコンピュータネットワークを示す。1 illustrates a computer network for use of distributed speech recognition according to one aspect of the present disclosure. 本開示の一態様による、音声認識のための装置を概念的に示すブロック図である。FIG. 3 is a block diagram conceptually illustrating an apparatus for speech recognition according to one aspect of the present disclosure. 本開示の一態様によって処理されるオーディオ波形を示す。Fig. 4 illustrates an audio waveform processed according to one aspect of the present disclosure. 本開示の一態様による、音声認識ラティス（ｌａｔｔｉｃｅ）を示す。Fig. 4 illustrates a speech recognition lattice according to an aspect of the present disclosure. 本開示の一態様による、語の元言語に基づく、外来語の予想発音の予測のための音声認識方法を示す。6 illustrates a speech recognition method for predicting predicted pronunciation of a foreign word based on the original language of the word, according to one aspect of the present disclosure. 本開示の一態様による、テキスト識別子を含む口頭での発話の処理のための音声認識方法を示す。6 illustrates a speech recognition method for processing verbal utterances including a text identifier according to one aspect of the present disclosure.

自動音声認識（ＡＳＲ）を実行することが可能な装置との対話において、ユーザはコンテンツアイテムにアクセスするための命令を発声してもよい。これらのコンテンツアイテムは装置のローカルに保存されてもよく、またはリモートに保存されてもよいが、装置によりアクセス可能である。たとえば、ユーザは計算装置に対して特定の音楽のアイテムを「再生する」ための命令を発声してもよい。口頭での命令は、発話と称されてもよい。音楽のアイテムは、テキスト識別子によって識別されてもよい。テキスト識別子は、曲、動画などのコンテンツのアイテムを識別するテキストであってもよい。例示的なテキスト識別子には、アーティスト名、バンド名、アルバム名、曲名、または再生される音楽を識別する他のいくつかのラベルが含まれる。 In an interaction with a device capable of performing automatic speech recognition (ASR), the user may utter instructions to access the content item. These content items may be stored locally on the device or stored remotely, but are accessible by the device. For example, the user may utter a command to the computing device to “play” a particular music item. Verbal commands may be referred to as utterances. Music items may be identified by text identifiers. The text identifier may be text that identifies an item of content such as a song or a video. Exemplary text identifiers include artist names, band names, album names, song names, or some other label that identifies the music being played.

ＡＳＲシステムは、予想発音がテキスト識別子に基づく場合の、対応する予想発音に照合される保存されたテキスト識別子（すなわち、アーティスト名、バンド名、アルバム名、及び曲名）のレキシコン（ｌｅｘｉｃｏｎ）を有してもよい。レキシコンはローカルまたはリモートに保存されてもよい。ＡＳＲシステムは発話を受信する際、発話音声を保存された予想発音と照合し、検索のために発話を１つまたは複数のコンテンツアイテムと照合してもよい。たとえばユーザが「ＡＣ／ＤＣの曲を何か再生して」と述べると、システムは「ＡＣ／ＤＣ」に対応する音声を、対応する予想発音と、その後バンド名と照合してもよい。バンドが識別されると、装置はその後バンドに関連する曲を再生してもよい。 The ASR system has a lexicon of stored text identifiers (ie, artist name, band name, album name, and song name) that are matched to the corresponding expected pronunciation when the expected pronunciation is based on a text identifier. May be. The lexicon may be stored locally or remotely. When the ASR system receives an utterance, it may match the utterance with the stored expected pronunciation and match the utterance with one or more content items for retrieval. For example, if the user says “Play some AC / DC song”, the system may match the audio corresponding to “AC / DC” with the corresponding expected pronunciation and then the band name. Once the band is identified, the device may then play a song associated with the band.

典型的なＡＳＲシステムは、それぞれ特定の言語に関連する。たとえば英語のＡＳＲシステムは英語を解するよう構成されてもよく、ドイツ語のＡＳＲシステムはドイツ語を解するよう構成されてもよい、など。いくつかのテキスト識別子は、ＡＳＲシステムの主言語ではない外国語に由来してもよい。これにより、ユーザが外国語のテキスト識別子の言語学的素性を利用してテキスト識別子を発音しようと試みる場合に混乱が生じる可能性がある。たとえば、ドイツ語の曲名またはドイツ語のバンド名のドイツ語の発音を利用して音楽をリクエストする発話を行うユーザは、英語に基づくＡＳＲシステムを混乱させる場合がある。同様に、ＡＳＲシステムが曲名のテキストに基づく異なる発音を予想している可能性があるため、ドイツ語の曲名の英語の発音を利用するユーザもまたＡＳＲシステムを混乱させる場合がある。 Each typical ASR system is associated with a specific language. For example, an English ASR system may be configured to solve English, a German ASR system may be configured to solve German, and so on. Some text identifiers may come from foreign languages that are not the main language of the ASR system. This can lead to confusion when the user attempts to pronounce the text identifier using the linguistic features of the foreign language text identifier. For example, a user who makes an utterance requesting music using the German pronunciation of a German song name or German band name may confuse the English-based ASR system. Similarly, users who utilize English pronunciation of German song titles may also confuse the ASR system because the ASR system may expect different pronunciations based on the song title text.

テキスト識別子の元言語の予測に基づく、テキスト識別子の予想発音の判断方法を提示する。元言語はテキスト識別子に基づき判断されてもよい。本開示のいくつかの態様においては、テキスト識別子の予想発音がまた、特定のユーザまたはユーザのカテゴリの発音履歴に基づいてもよい。予想発音には、元言語に基づく予想発音の組み合わせ、たとえばある元言語を有すると予想されるテキスト識別子の特定の音素及び異なる元言語を有すると予想されるテキスト識別子の別の音素を有する予想発音が含まれてもよい。さらに、それぞれの予想発音が発生可能性に関連する可能性がある場合、複数の予想発音がテキスト識別子ごとに判断されてもよい。可能性は、テキスト識別子、ユーザの行動、他のユーザの行動、または他の要因に基づいてもよい。 A method for determining the expected pronunciation of a text identifier based on the prediction of the original language of the text identifier is presented. The original language may be determined based on the text identifier. In some aspects of the present disclosure, the expected pronunciation of a text identifier may also be based on the pronunciation history of a particular user or user category. Predicted pronunciation includes a combination of predicted pronunciations based on the original language, for example, a specific phoneme of a text identifier that is expected to have one original language and an expected pronunciation that has another phoneme of the text identifier expected to have a different original language May be included. Further, multiple predicted pronunciations may be determined for each text identifier when each predicted pronunciation may be related to the likelihood of occurrence. The likelihood may be based on text identifiers, user behavior, other user behavior, or other factors.

テキスト識別子の異なる予想発音がレキシコンに追加され、異なるユーザからの異なる発音に対応してもよい。予想発音は、音楽カタログ上に保存された曲などのコンテンツアイテムにリンクされてもよい。計算装置はテキスト識別子を含む口頭での発話を受信すると、発話を予想発音の修正されたレキシコンと照合することによって、口頭での発話にテキスト識別子が含まれるか否かを判断する。口頭での発話が予想発音と一致すると、計算装置は発話の命令部分において示されるように、たとえばリクエストされた曲の再生によってコンテンツ上で動作する。 Predicted pronunciations with different text identifiers may be added to the lexicon to accommodate different pronunciations from different users. Expected pronunciations may be linked to content items such as songs stored on a music catalog. When the computing device receives an oral utterance that includes a text identifier, the computing device determines whether the verbal utterance includes a text identifier by comparing the utterance with a lexicon with a corrected predicted pronunciation. If the verbal utterance matches the expected pronunciation, the computing device operates on the content, eg, by playing the requested song, as indicated in the utterance command portion.

図１は、本開示の一態様による、テキスト識別子の元言語に基づく、テキスト識別子の予想発音の予測のための音声認識技術を示す。図１は、予想発音予測モジュール１２８及びＡＳＲモジュール３１４を有するＡＳＲ装置１００ならびにＡＳＲ装置１００の近位に位置するユーザ１２０を含む。予想発音予測モジュール１２８は、ブロック１０２に示されるようにテキスト識別子（曲名など）にアクセスし、ブロック１０４に示されるようにテキスト識別子の元言語を判断するよう構成されてもよい。予測モジュール１２８はその後ブロック１０６に示すように、元言語に基づくテキスト識別子の１つまたは複数の予想発音を判断してもよい。予想発音はシステムによる検索のために、コンテンツアイテム（たとえば、曲）と照合されてもよい。予測モジュール１２８は、発話の受信前にＡＳＲシステムの動作の構成またはトレーニングを行う際に、これらのアクションを前もって実行してもよい。 FIG. 1 illustrates a speech recognition technique for predicting expected pronunciation of a text identifier based on the original language of the text identifier, according to one aspect of the present disclosure. FIG. 1 includes an ASR device 100 having a predicted pronunciation prediction module 128 and an ASR module 314 and a user 120 located proximal to the ASR device 100. Predictive pronunciation prediction module 128 may be configured to access a text identifier (such as a song title) as indicated at block 102 and determine the original language of the text identifier as indicated at block 104. Prediction module 128 may then determine one or more predicted pronunciations of the text identifier based on the original language, as shown in block 106. Expected pronunciations may be matched to content items (eg, songs) for retrieval by the system. The prediction module 128 may perform these actions in advance when configuring or training the operation of the ASR system prior to receiving the utterance.

ブロック１０８に示される装置の口頭での発話の受信の際に、発話はＡＳＲモジュール３１４に移行される。ＡＳＲモジュールはその後、ブロック１１０に示すように発話を予想発音と照合してもよい。その予想発音はその後、ブロック１１２に示すように、発話において言及された曲などのコンテンツアイテムに照合されてもよい。装置はその後ブロック１１４に示すように、コンテンツアイテムにアクセス（たとえば、曲を再生）してもよい。 Upon receipt of the verbal utterance of the device shown in block 108, the utterance is transferred to the ASR module 314. The ASR module may then match the utterance with the expected pronunciation as shown at block 110. That expected pronunciation may then be matched to a content item such as a song mentioned in the utterance, as shown in block 112. The device may then access the content item (eg, play a song), as shown at block 114.

図１は特定のモジュールによって実行されている特定のタスクを示すが、タスクは特定のＡＳＲシステムによって構成されるように、様々なモジュールによって実行されてもよい。 Although FIG. 1 shows a particular task being performed by a particular module, the task may be performed by various modules, as configured by a particular ASR system.

さらに本明細書に記載の技術は、ＡＳＲ装置１００、ネットワーク装置、または異なる装置のいくつかの組み合わせなどのローカル装置上で実行されてもよい。たとえば元言語及び（１つまたは複数の）予想発音の判断を実際に実行するために、ローカル装置及びリモート装置はローカル装置のテキスト識別子をリモート装置と交換してもよい。さらに、ローカル装置が口頭での発話を含む音声データを受信してもよい間、ローカル装置は音声データを処理のためにリモート装置に送信してもよい。リモート装置はその後、音声上のＡＳＲ処理を実行してもよい。ＡＳＲ結果はその後、発話のコンテンツアイテムとの照合及びコンテンツアイテムへのアクセスのためにローカル装置に送信されてもよく、またはリモート装置及びユーザへの再生のためにローカル装置に送信されたその結果（たとえば、ストリーミング曲）によってそれらのタスクが実行されてもよい。あるいは、ローカル装置及びリモート装置は他の方法でともに作用してもよい。 Further, the techniques described herein may be performed on a local device, such as ASR device 100, a network device, or some combination of different devices. For example, the local device and the remote device may exchange the local device's text identifier with the remote device to actually perform the determination of the original language and the expected pronunciation (s). Further, while the local device may receive voice data including verbal speech, the local device may send the voice data to the remote device for processing. The remote device may then perform voice ASR processing. The ASR result may then be sent to the local device for matching with the content item of utterance and access to the content item, or the result sent to the local device for playback to the remote device and the user ( For example, those tasks may be executed by streaming music. Alternatively, the local device and the remote device may work together in other ways.

これらの複数のＡＳＲ装置はネットワークを介して接続されてもよい。図２に示すように、複数の装置がネットワーク２０２を介して接続されてもよい。ネットワーク２０２は、ローカルまたはプライベートネットワークを含んでもよく、またはインターネットなどの広域ネットワークを含んでもよい。装置は有線または無線接続のいずれかを通じてネットワーク２０２に接続されてもよい。たとえば無線装置２０４は、無線サービスプロバイダを通じてネットワーク２０２に接続されてもよい。コンピュータ２１２などの他の装置は、有線接続を通じてネットワーク２０２に接続してもよい。たとえば、家庭内または商業施設内に位置する冷蔵庫２１８などの他の装置は有線または無線接続を通じてネットワーク２０２に接続してもよい。ラップトップ２０８またはタブレットコンピュータ２１０などの他の装置は、様々な接続方法を利用したネットワーク２０２への接続が可能であってもよく、無線サービスプロバイダを通じて、ＷｉＦｉ接続などを介することを含む。ネットワーク装置は、ヘッドセット２０６または２１４などを介することを含む、いくつかの音声入力装置を通じて、口頭での音声を入力してもよい。音声入力装置は、有線または無線接続のいずれかを通じてネットワーク装置に接続されてもよい。ネットワーク装置はまた、ラップトップ２０８、無線装置２０４またはタブレットコンピュータ２１０内の内部マイクロホン（図示せず）などの埋め込み型音声入力装置を含んでもよい。 The plurality of ASR devices may be connected via a network. As shown in FIG. 2, a plurality of devices may be connected via a network 202. Network 202 may include a local or private network, or may include a wide area network such as the Internet. The device may be connected to the network 202 through either a wired or wireless connection. For example, the wireless device 204 may be connected to the network 202 through a wireless service provider. Other devices such as computer 212 may connect to network 202 through a wired connection. For example, other devices such as a refrigerator 218 located in a home or commercial facility may connect to the network 202 through a wired or wireless connection. Other devices such as laptop 208 or tablet computer 210 may be able to connect to network 202 using a variety of connection methods, including via a WiFi service provider, such as a WiFi connection. The network device may input verbal audio through a number of audio input devices, including via a headset 206 or 214 or the like. The voice input device may be connected to the network device through either a wired or wireless connection. The network device may also include an embedded voice input device such as a laptop 208, a wireless device 204 or an internal microphone (not shown) within the tablet computer 210.

特定のＡＳＲシステム構成において、ある装置が音声信号をキャプチャしてもよく、別の装置がＡＳＲ処理を実行してもよい。たとえば、ヘッドセット２１４への音声入力はコンピュータ２１２によってキャプチャされ、処理のためにネットワーク２０２を介してコンピュータ２２０またはサーバ２１６へと送信されてもよい。あるいは、コンピュータ２１２はネットワーク２０２を介して送信する前に、音声信号を部分的に処理してもよい。ＡＳＲ処理は多大な計算リソースを利用してもよいため、音声をキャプチャする装置の処理能力がリモート装置よりも低く、より高質なＡＳＲ結果が所望される場合に、ストレージ及び処理能力の両方に関してこのような分割構成が利用されてもよい。ユーザ及び処理のために他の装置に送信されるキャプチャされた音声信号の近くで音声キャプチャが行われてもよい。たとえば、１つまたは複数のマイクロホンアレイがＡＳＲ装置とは異なる場所に位置してもよく、キャプチャされた音声が処理のためにアレイからＡＳＲ装置（または装置）に送信されてもよい。 In certain ASR system configurations, one device may capture audio signals and another device may perform ASR processing. For example, audio input to headset 214 may be captured by computer 212 and sent to computer 220 or server 216 via network 202 for processing. Alternatively, computer 212 may partially process the audio signal before transmitting it over network 202. Because ASR processing may use a lot of computational resources, both the storage and processing power when the processing capability of the device that captures voice is lower than that of the remote device and a higher quality ASR result is desired Such a divided configuration may be used. Audio capture may occur near the captured audio signal that is sent to the user and other devices for processing. For example, one or more microphone arrays may be located at a different location than the ASR device, and captured audio may be transmitted from the array to the ASR device (or device) for processing.

図３は、音声認識を実行するための自動音声認識（ＡＳＲ）装置３０２を示す。本開示の態様は、ＡＳＲ装置３０２上に存在してもよいコンピュータ読み取り可能及びコンピュータ実行可能命令を含む。図３は、ＡＳＲ装置３０２内に含まれてもよいいくつかの構成要素を示すが、他の図示されない構成要素もまた含まれてもよい。また図示される構成要素のいくつかは、本開示の態様を利用することが可能なすべての装置内に存在するとは限らない。さらに、単一の構成要素としてＡＳＲ装置３０２内に示されるいくつかの構成要素はまた、単一の装置内に複数回出現してもよい。たとえばＡＳＲ装置３０２は、複数の入力装置３０６、出力装置３０７または複数の制御装置／処理装置３０８を含んでもよい。 FIG. 3 shows an automatic speech recognition (ASR) device 302 for performing speech recognition. Aspects of the present disclosure include computer readable and computer executable instructions that may reside on the ASR device 302. Although FIG. 3 shows some components that may be included within ASR device 302, other components not shown may also be included. Also, some of the illustrated components may not be present in all devices that can utilize aspects of the present disclosure. Further, some components shown in ASR device 302 as a single component may also appear multiple times in a single device. For example, the ASR device 302 may include multiple input devices 306, output devices 307, or multiple control / processing devices 308.

単一の音声認識システム内で複数のＡＳＲ装置が利用されてもよい。このようなマルチデバイスシステムにおいて、ＡＳＲ装置は、音声認識処理の異なる態様を実行するための異なる構成要素を含んでもよい。複数の装置は、重複する構成要素を含んでもよい。図３に示されるようなＡＳＲ装置は例示であり、スタンドアロン装置であってもよく、またはその一部または全部がより規模の大きな装置またはシステムの構成要素として含まれてもよい。 Multiple ASR devices may be utilized within a single speech recognition system. In such a multi-device system, the ASR device may include different components for performing different aspects of the speech recognition process. The plurality of devices may include overlapping components. The ASR device as shown in FIG. 3 is exemplary and may be a stand-alone device, or some or all of it may be included as a component of a larger device or system.

本開示の教示は、たとえば、汎用計算システム、サーバクライアント計算システム、メインフレーム計算システム、電話回線を利用した計算システム、ラップトップコンピュータ、携帯電話、携帯情報端末（ＰＤＡ）、タブレットコンピュータ、他のモバイル装置などを含む、いくつかの異なる装置及びコンピュータシステム内で応用されてもよい。ＡＳＲ装置３０２はまた、たとえば現金自動預払機（ＡＴＭ）、キオスク、家電機器（冷蔵庫、オーブンなど）、乗り物（車、バス、オートバイなど）、及び／または運動機器などの音声認識機能を提供してもよい他の装置またはシステムの構成要素であってもよい。 The teachings of the present disclosure include, for example, general purpose computing systems, server client computing systems, mainframe computing systems, computing systems utilizing telephone lines, laptop computers, mobile phones, personal digital assistants (PDAs), tablet computers, and other mobile It may be applied in a number of different devices and computer systems, including devices and the like. The ASR device 302 also provides voice recognition functions such as, for example, an automated teller machine (ATM), kiosk, home appliance (refrigerator, oven, etc.), vehicle (car, bus, motorcycle, etc.), and / or exercise equipment. It may be a component of another device or system.

ＡＳＲ装置３０２は図３に示すように、処理のために口頭での発話をキャプチャする音声キャプチャ装置３０４を含んでもよい。音声キャプチャ装置３０４は、音声をキャプチャするためのマイクロホンまたは他の好適な構成要素を含んでもよい。音声キャプチャ装置３０４はＡＳＲ装置３０２に一体化されてもよく、またはＡＳＲ装置３０２から分離されてもよい。ＡＳＲ装置３０２はまた、ＡＳＲ装置３０２の構成要素の間でのデータ搬送のためのアドレス／データバス３２４を含んでもよい。ＡＳＲ装置３０２内の各構成要素はまた、バス３２４をまたいでの他の構成要素への接続に加えて（またはそれに代えて）、他の構成要素に直接接続されてもよい。図３に特定の構成要素が直接接続されるように示されているが、これらの接続は例示にすぎず、他の構成要素が互いに直接接続されてもよい（ＡＳＲモジュール３１４が制御装置／処理装置３０８に、など）。 The ASR device 302 may include an audio capture device 304 that captures verbal utterances for processing, as shown in FIG. The audio capture device 304 may include a microphone or other suitable component for capturing audio. The audio capture device 304 may be integrated into the ASR device 302 or separated from the ASR device 302. The ASR device 302 may also include an address / data bus 324 for carrying data between the components of the ASR device 302. Each component within ASR device 302 may also be connected directly to other components in addition to (or instead of) connecting to other components across bus 324. Although specific components are shown in FIG. 3 as being directly connected, these connections are merely exemplary and other components may be directly connected to each other (the ASR module 314 may be connected to the controller / process). Device 308, etc.).

ＡＳＲ装置３０２は、データ及びコンピュータ読み取り可能命令の処理のための中央処理装置（ＣＰＵ）ならびにデータおよび命令の保存のためのメモリ３１０であってもよい制御装置／処理装置３０８を含んでもよい。メモリ３１０は、揮発性ランダムアクセスメモリ（ＲＡＭ）、不揮発性読み取り専用メモリ（ＲＯＭ）、及び／または他のタイプのメモリを含んでもよい。ＡＳＲ装置３０２はまた、データ及び命令の保存のためのデータストレージ構成要素３１２を含んでもよい。データストレージ構成要素３１２は、磁気ストレージ、光学ストレージ、固体ストレージなどの１つまたは複数のストレージタイプを含んでもよい。ＡＳＲ装置３０２はまた、入力装置３０６または出力装置３０７を通じてリムーバブルまたは外部メモリ及び／またはストレージ（リムーバブルメモリカード、メモリーキードライブ、ネットワークストレージなど）に接続されてもよい。ＡＳＲ装置３０２及びその様々な構成要素を操作する制御装置／処理装置３０８による処理のためのコンピュータ命令は、制御装置／処理装置３０８によって実行され、メモリ３１０、ストレージ３１２、外部装置内に、または以下に記載するＡＳＲモジュール３１４に含まれるメモリ／ストレージ内に保存されてもよい。あるいは実行可能命令の一部または全部が、ソフトウェアに加えてまたは代えてハードウェアまたはファームウェア内に埋め込まれてもよい。本開示の教示は、たとえばソフトウェア、ファームウェア、及び／またはハードウェアの様々な組み合わせにおいて実装されてもよい。 The ASR device 302 may include a controller / processor 308, which may be a central processing unit (CPU) for processing data and computer readable instructions and a memory 310 for storing data and instructions. Memory 310 may include volatile random access memory (RAM), non-volatile read only memory (ROM), and / or other types of memory. The ASR device 302 may also include a data storage component 312 for storing data and instructions. Data storage component 312 may include one or more storage types, such as magnetic storage, optical storage, solid state storage, and the like. ASR device 302 may also be connected to removable or external memory and / or storage (removable memory card, memory key drive, network storage, etc.) through input device 306 or output device 307. Computer instructions for processing by the controller / processor 308 that operates the ASR device 302 and its various components are executed by the controller / processor 308, in the memory 310, storage 312, external devices, or below. May be stored in a memory / storage included in the ASR module 314 described in FIG. Alternatively, some or all of the executable instructions may be embedded in hardware or firmware in addition to or instead of software. The teachings of the present disclosure may be implemented in various combinations of software, firmware, and / or hardware, for example.

ＡＳＲ装置３０２は、（１つまたは複数の）入力装置３０６及び（１つまたは複数の）出力装置３０７を含む。様々な（１つまたは複数の）入力／出力装置が装置内に含まれてもよい。例示的な入力装置３０６は、（個別の構成要素として図示される）マイクロホンなどの音声キャプチャ装置３０４、タッチ入力装置、キーボード、マウス、スタイラスまたは他の入力装置を含む。例示的な出力装置３０７は、視覚ディスプレイ、触覚ディスプレイ、オーディオスピーカ、ヘッドホン、プリンタまたは他の出力装置を含む。入力装置３０６及び／または出力装置３０７はまた、ユニバーサルシリアルバス（ＵＳＢ）、ＦｉｒｅＷｉｒｅ（登録商標）、Ｔｈｕｎｄｅｒｂｏｌｔ（登録商標）または他の接続プロトコルなどの、外部周辺装置接続のためのインターフェイスを含んでもよい。入力装置３０６及び／または出力装置３０７はまた、イーサネット（登録商標）ポート、モデムネットなどのネットワーク接続部を含んでもよい。入力装置３０６及び／または出力装置３０７はまた、無線周波（ＲＦ）、赤外線、Ｂｌｕｅｔｏｏｔｈ（登録商標）、無線ローカルエリアネットワーク（ＷＬＡＮ）（ＷｉＦｉなど）などの無線通信装置、またはロングタームエボリューション（ＬＴＥ）ネットワーク、ＷｉＭＡＸネットワーク、３Ｇネットワークなどの無線通信ネットワークを用いた通信が可能な無線機などの無線ネットワーク装置を含んでもよい。ＡＳＲ装置３０２は入力装置３０６及び／または出力装置３０７を通じて、インターネットまたはプライベートネットワークなどの分散型コンピューティング環境を含んでもよいネットワークへの接続をおこなってもよい。 The ASR device 302 includes an input device (s) 306 and an output device (s) 307. Various input / output devices may be included in the device. An exemplary input device 306 includes an audio capture device 304, such as a microphone (shown as a separate component), a touch input device, a keyboard, a mouse, a stylus, or other input device. Exemplary output devices 307 include visual displays, tactile displays, audio speakers, headphones, printers or other output devices. Input device 306 and / or output device 307 may also include an interface for external peripheral device connection, such as Universal Serial Bus (USB), FireWire®, Thunderbolt®, or other connection protocols. . The input device 306 and / or the output device 307 may also include a network connection such as an Ethernet port, a modem net, and the like. The input device 306 and / or the output device 307 may also be a wireless communication device such as radio frequency (RF), infrared, Bluetooth, Bluetooth, wireless local area network (WLAN) (such as WiFi), or long term evolution (LTE). A wireless network device such as a wireless device capable of communication using a wireless communication network such as a network, a WiMAX network, or a 3G network may be included. ASR device 302 may connect through input device 306 and / or output device 307 to a network that may include a distributed computing environment, such as the Internet or a private network.

装置はまた、口頭での音声データのテキストへの処理のためのＡＳＲモジュール３１４を含んでもよい。ＡＳＲモジュール３１４は音声データを音声データに含まれる音声の語を表すテキストデータに書き写す。テキストデータはその後、システム命令の実行、データの入力などの様々な目的のための他の構成要素によって利用されてもよい。口頭での発話を含む音声データは、リアルタイムで処理されてもよく、または後で保存されかつ処理されてもよい。音声データ上の口頭での発話はＡＳＲモジュール３１４に入力され、ＡＳＲモジュール３１４はその後、発話とＡＳＲモジュール３１４に知られているモデルとの間の類似性に基づき発話を解釈する。たとえばＡＳＲモジュール３１４は入力音声データを音声（たとえば、音声単位または音素）及び音声のシーケンスのためのモデルと比較し、音声データの発話において発せられた音声のシーケンスに一致する語を識別してもよい。口頭での発話を解釈することができる異なる方法により、それぞれ特定の組の語が発話において発せられた語の組に一致する可能性を表す確率または認識スコアを割り当ててもよい。認識スコアは、たとえば発話における音声の言語音声のためのモデル（たとえば、音響モデル）との類似性及び音声に一致する特定の語が（たとえば、言語モデルまたは文法を利用して）特定の位置の文に含まれる可能性を含むいくつかの要因に基づいてもよい。考慮される要因及び割り当てられた認識スコアに基づき、ＡＳＲモジュール３１４は、音声データにおいて認識される最も見込みの高い語を出力してもよい。ＡＳＲモジュール３１４はまた、複数の代替的な認識された語をラティスまたはＮ−ｂｅｓｔリスト（以下により詳細に説明する）の形式で出力してもよい。 The device may also include an ASR module 314 for verbal audio data processing to text. The ASR module 314 copies the voice data into text data representing a voice word included in the voice data. The text data may then be utilized by other components for various purposes such as executing system instructions, entering data, etc. Voice data including verbal utterances may be processed in real time or may be stored and processed later. The verbal utterance on the speech data is input to the ASR module 314, which then interprets the utterance based on the similarity between the utterance and the model known to the ASR module 314. For example, the ASR module 314 may compare input speech data with a model for speech (eg, speech units or phonemes) and speech sequences and identify words that match the speech sequence uttered in the speech of speech data. Good. Different ways in which verbal utterances can be interpreted may be assigned probabilities or recognition scores representing the likelihood that each particular set of words matches the set of words uttered in the utterance. The recognition score is, for example, the similarity of speech in speech to a model for language speech (eg, an acoustic model) and a specific word that matches the speech (eg, using a language model or grammar) at a particular location. It may be based on several factors including the possibility of being included in the sentence. Based on the factors considered and the assigned recognition score, the ASR module 314 may output the most probable word recognized in the speech data. The ASR module 314 may also output a plurality of alternative recognized words in the form of a lattice or N-best list (described in more detail below).

認識スコアは音声データの部分が特定の音素または語に対応する確率を表してもよい一方、認識スコアはまた、他の音声データのＡＳＲ処理に対するスコアをつけた音声データのＡＳＲ処理品質を示す他の情報を組み込んでもよい。認識スコアは、０から１までの確率、ログ確率または他のインジケータとして、１から１００までの数値として表されてもよい。認識スコアは音声データの一部が特定の音素、語などに対応する相対的な信頼度を示してもよい。 While the recognition score may represent the probability that a portion of the speech data corresponds to a particular phoneme or word, the recognition score may also indicate the ASR processing quality of the speech data with a score relative to the ASR processing of other speech data. May be incorporated. The recognition score may be expressed as a numerical value from 1 to 100 as a probability from 0 to 1, a log probability or other indicator. The recognition score may indicate a relative reliability that a part of the voice data corresponds to a specific phoneme, word, or the like.

ＡＳＲモジュール３１４は、ＡＳＲ装置３０２のバス３２４、（１つまたは複数の）入力装置３０６及び／または（１つまたは複数の）出力装置３０７、音声キャプチャ装置３０４、エンコーダ／デコーダ３２２、制御装置／処理装置３０８及び／または他の構成要素に接続されてもよい。ＡＳＲモジュール３１４に送信された音声データは、リモートエンティティによってキャプチャされ、ネットワークを介してＡＳＲ装置３０２に送信された音声データのように、音声キャプチャ装置３０４から来てもよく、または入力装置３０６によって受信されてもよい。音声データは、口頭での発話のオーディオ波形のデジタル表現の形式であってもよい。サンプリングレート、フィルタリング、及びアナログデジタル変換処理の他の態様が音声データの全体的な品質に影響する可能性がある。音声キャプチャ装置３０４及び入力装置３０６の様々な設定は、品質とデータサイズまたは他の留意点との従来の兼ね合いに基づき音声データを調節するよう構成されてもよい。 The ASR module 314 includes the bus 324 of the ASR device 302, the input device (s) 306 and / or the output device (s) 307, the audio capture device 304, the encoder / decoder 322, the controller / processing. It may be connected to the device 308 and / or other components. Voice data sent to the ASR module 314 may be captured by the remote entity and may come from the voice capture device 304 or received by the input device 306, such as voice data sent to the ASR device 302 via the network. May be. The audio data may be in the form of a digital representation of an audio waveform of an oral utterance. Other aspects of sampling rate, filtering, and analog-to-digital conversion processing can affect the overall quality of the audio data. Various settings of the audio capture device 304 and the input device 306 may be configured to adjust the audio data based on conventional tradeoffs between quality and data size or other considerations.

ＡＳＲモジュール３１４は、音響フロントエンド（ＡＦＥ）３１６、音声認識エンジン３１８、及び音声ストレージ３２０を含む。ＡＦＥ３１６は音声データを、音声認識エンジン３１８による処理のためのデータに変換する。音声認識エンジン３１８は音声認識データを、元の音声データに含まれる音声の認識のために音声ストレージ３２０に保存された音響、言語、及び他のデータモデルならびに情報と比較する。ＡＦＥ３１６及び音声認識エンジン３１８は、それら自体の（１つまたは複数の）制御装置／（１つまたは複数の）処理装置及びメモリを含んでもよく、またはそれらはたとえばＡＳＲ装置３０２の制御装置／処理装置３０８及びメモリ３１０を利用してもよい。同様に、ＡＦＥ３１６及び音声認識エンジン３１８の操作のための命令は、ＡＳＲ装置３０２のメモリ３１０及び／若しくはストレージ３１２内のＡＳＲモジュール３１４内部、または外部装置内にあってもよい。 The ASR module 314 includes an acoustic front end (AFE) 316, a speech recognition engine 318, and a speech storage 320. The AFE 316 converts the voice data into data for processing by the voice recognition engine 318. The speech recognition engine 318 compares the speech recognition data with the acoustic, language, and other data models and information stored in the speech storage 320 for recognition of speech contained in the original speech data. AFE 316 and speech recognition engine 318 may include their own controller (s) / processor (s) and memory, or they may be, for example, controllers / processors of ASR device 302. 308 and memory 310 may be used. Similarly, instructions for operation of AFE 316 and speech recognition engine 318 may be within ASR module 314 within memory 310 and / or storage 312 of ASR device 302 or within an external device.

受信された音声データは処理のためにＡＦＥ３１６に送信されてもよい。ＡＦＥ３１６は、音声データ内のノイズを低減し、処理のために音声を含む音声データの部分を識別し、識別された音声構成要素を分割及び処理してもよい。ＡＦＥ３１６は、デジタル音声データを各フレームがたとえば１０ミリ秒（ｍｓ）の時間間隔を表すようなフレームまたは音声セグメントに分割してもよい。フレームの間に、ＡＦＥ３１６は、フレーム内の発話部分の素性／品質を表す素性ベクトルと呼ばれる値の組を決定する。素性ベクトルは、たとえば４０などの可変数の値を含んでもよい。素性ベクトルは、フレーム内の音声データの異なる品質を表してもよい。図４は、第１の語４０４が処理される際の第１の語４０４の複数の点４０６を有するデジタル音声データ波形４０２を示す。それらの点音声品質は素性ベクトルに保存されてもよい。素性ベクトルは、口頭での発話の時間を表すマトリクスに流れるかまたは組み合わせられてもよい。これらの素性ベクトルマトリクスはその後処理のために音声認識エンジン３１８へと移行してもよい。いくつかの手法が、音声データの処理のためにＡＦＥ３１６によって利用されてもよい。このような手法は、メル周波数ケプストラム係数（ＭＦＣＣ）、知覚的線形予測（ＰＬＰ）技術、ニューラルネットワーク素性ベクトル技術、線形判別分析、半結合共分散マトリクス、または当業者に知られる他の手法の利用を含んでもよい。 The received audio data may be sent to the AFE 316 for processing. The AFE 316 may reduce noise in the audio data, identify portions of the audio data that include audio for processing, and divide and process the identified audio components. The AFE 316 may divide the digital audio data into frames or audio segments where each frame represents a time interval of, for example, 10 milliseconds (ms). During a frame, the AFE 316 determines a set of values called a feature vector that represents the feature / quality of the speech portion within the frame. The feature vector may include a variable number of values, such as 40, for example. The feature vector may represent different qualities of the audio data in the frame. FIG. 4 shows a digital audio data waveform 402 having a plurality of points 406 of the first word 404 as the first word 404 is processed. Those point voice qualities may be stored in a feature vector. The feature vectors may flow or be combined into a matrix representing the time of verbal utterance. These feature vector matrices may then be transferred to the speech recognition engine 318 for later processing. Several approaches may be utilized by AFE 316 for processing audio data. Such techniques include the use of mel frequency cepstrum coefficients (MFCC), perceptual linear prediction (PLP) techniques, neural network feature vector techniques, linear discriminant analysis, semi-joint covariance matrices, or other techniques known to those skilled in the art. May be included.

処理された素性ベクトルはその後ＡＳＲモジュール３１４から出力され、さらなる処理のために他の装置へと伝達する目的で出力装置３０７に送信されてもよい。素性ベクトルはエンコーダ／デコーダ３２２によって伝達の前に符号化かつ／または圧縮されてもよい。エンコーダ／デコーダ３２２は、デジタル音声データ、素性ベクトルなどのＡＳＲデータの符号化及び復号化のためにカスタマイズされてもよい。エンコーダ／デコーダ３２２はまた、たとえば．ｚｉｐなどの一般的な符号化方式を利用してＡＳＲ装置３０２の非ＡＳＲデータを符号化してもよい。エンコーダ／デコーダ３２２の機能は図３に示されるように個別の構成要素内にあってもよく、または、たとえば制御装置／処理装置３０８、ＡＳＲモジュール３１４、または他の構成要素によって実行されてもよい。 The processed feature vectors are then output from ASR module 314 and may be sent to output device 307 for the purpose of communicating to other devices for further processing. Feature vectors may be encoded and / or compressed prior to transmission by encoder / decoder 322. The encoder / decoder 322 may be customized for encoding and decoding ASR data such as digital audio data, feature vectors. The encoder / decoder 322 may also be e.g. The non-ASR data of the ASR device 302 may be encoded using a general encoding method such as zip. The functions of encoder / decoder 322 may be in separate components as shown in FIG. 3, or may be performed by, for example, controller / processor 308, ASR module 314, or other components. .

音声認識エンジン３１８は、音声ストレージ３２０内に保存された情報を参照してＡＦＥ３１６からの出力を処理してもよい。あるいは、フロントエンド後処理後のデータ（素性ベクトルなど）がＡＳＲモジュール３１４によって、内部ＡＦＥ３１６以外の別のソースより受信されてもよい。たとえば、別のエンティティが音声データを素性ベクトルへと処理し、その情報を（１つまたは複数の）入力装置３０６を通じてＡＳＲ装置３０２へと伝達してもよい。素性ベクトルは符号化されてＡＳＲ装置３０２に到達してもよく、この場合、音声認識エンジン３１８による処理の前に（たとえばエンコーダ／デコーダ３２２によって）復号化されてもよい。 The speech recognition engine 318 may process the output from the AFE 316 with reference to information stored in the speech storage 320. Alternatively, front-end post-processed data (such as feature vectors) may be received by ASR module 314 from another source other than internal AFE 316. For example, another entity may process the voice data into a feature vector and communicate that information to the ASR device 302 through the input device (s) 306. The feature vector may be encoded to reach ASR device 302, where it may be decoded (eg, by encoder / decoder 322) prior to processing by speech recognition engine 318.

音声ストレージ３２０は、音素の発音を特定の語と照合するデータなどの音声認識のための様々な情報を含む。このデータは、音響モデルと称されてもよい。音声ストレージはまた、語の辞書またはレキシコンを含んでもよい。音声ストレージはまた、テキスト識別子をそれらの識別子の予想発音と照合するレキシコンを含んでもよい。テキスト識別子は、カタログ上の音楽、アドレスブック上のコンテンツ、及び／またはＡＳＲ装置に（または他の場所に）保存された他のコンテンツなどのデジタルコンテンツを識別してもよい。テキスト識別子はまた、ＡＳＲシステム及び／またはユーザのデフォルト言語とは異なってもよい（１つまたは複数の）言語に由来する名を有してもよい食品（すなわち、材料、料理など）、レストラン、イベント、または他のアイテムなどの非デジタルアイテムを識別してもよい。音声ストレージはまた、特定のコンテキストにおいてともに利用しやすい語を示すデータを含んでもよい。このデータは、言語または文法モデルと称されてもよい。音声ストレージ３２０はまた、音声認識においてＡＳＲモジュール３１４によって利用されるモデルをトレーニングし改善するために利用されてもよい録音された音声及び／または対応するトランスクリプションを含んでもよいトレーニングコーパスを含んでもよい。トレーニングコーパスは、音響モデル及び言語モデルを含む音声認識モデルを事前にトレーニングするために利用されてもよい。モデルはその後、ＡＳＲ処理中に利用されてもよい。 The speech storage 320 includes various information for speech recognition, such as data that matches phoneme pronunciations with specific words. This data may be referred to as an acoustic model. The voice storage may also include a dictionary of words or lexicons. Voice storage may also include lexicons that match text identifiers with the expected pronunciation of those identifiers. The text identifier may identify digital content such as music on a catalog, content on an address book, and / or other content stored on an ASR device (or elsewhere). The text identifier may also have a name derived from the language (s) that may be different from the ASR system and / or the user's default language (ie, ingredients, dishes, etc.), restaurants, Non-digital items such as events or other items may be identified. Voice storage may also include data indicating words that are easy to use together in a particular context. This data may be referred to as a language or grammar model. Speech storage 320 may also include a training corpus that may include recorded speech and / or corresponding transcriptions that may be utilized to train and improve the models utilized by ASR module 314 in speech recognition. Good. The training corpus may be used to pre-train speech recognition models including acoustic models and language models. The model may then be utilized during ASR processing.

トレーニングコーパスは、たとえば音響モデル及び言語モデルを作成するために利用されてもよい関連する素性ベクトル及び関連する正確なテキストを有するいくつかのサンプル発話を含んでもよい。サンプル発話は、特定の音声単位のための予想される音声に対応する数学的モデルを作成するために利用されてもよい。それらの音声単位は、音素、音節、音節の一部、語などを含んでもよい。音声単位はまた、トライフォン、クインフォンなどのコンテキスト上の音素を含んでもよい。音声において規則的に利用されるコンテキスト上の音素は、それ自体のモデルに関連してもよい。あまり一般的でないコンテキスト上の音素は、群モデルを有するようにクラスタ化されてもよい。音素群をこのようにクラスタ化することで、トレーニングコーパスに含まれるモデルが少なくなってもよく、したがってＡＳＲ処理を容易にする。トレーニングコーパスは、ＡＳＲモジュール３１４の異なる発話の比較を提供するための異なるスピーカからの同じ発話の複数のバージョンを含んでもよい。トレーニングコーパスはまた、正確に認識された発話及び不正確に認識された発話を含んでもよい。これらの不正確に認識された発話は、ＡＳＲモジュール３１４に、たとえばエラータイプ及び対応する訂正の実施例を提供する文法エラー、誤認識エラー、ノイズ、または他のエラーを含んでもよい。トレーニングコーパスは外来語を含み、このような語を認識するようＡＳＲシステムをトレーニングしてもよい。トレーニングコーパスはまた、以下に記載するようにシステム性能を向上させるために特定のユーザの傾向を組み込むよう適合されてもよい。 The training corpus may include a number of sample utterances with associated feature vectors and associated accurate text that may be utilized, for example, to create acoustic and language models. Sample utterances may be utilized to create a mathematical model that corresponds to the expected speech for a particular speech unit. These speech units may include phonemes, syllables, parts of syllables, words, and the like. The speech unit may also include contextual phonemes such as triphones, quinphones. Contextual phonemes regularly used in speech may be associated with its own model. Less common contextual phonemes may be clustered to have a group model. By clustering phonemes in this way, fewer models may be included in the training corpus, thus facilitating ASR processing. The training corpus may include multiple versions of the same utterance from different speakers to provide a comparison of different utterances of the ASR module 314. The training corpus may also include correctly recognized utterances and incorrectly recognized utterances. These incorrectly recognized utterances may include grammatical errors, misrecognition errors, noise, or other errors that provide the ASR module 314 with, for example, error types and corresponding correction examples. The training corpus includes foreign words and the ASR system may be trained to recognize such words. The training corpus may also be adapted to incorporate particular user trends to improve system performance as described below.

他の情報はまた、音声認識における利用のために音声ストレージ３２０内に保存されてもよい。音声ストレージ３２０のコンテンツが一般的なＡＳＲの利用のために用意されてもよく、または、特定のアプリケーションにおいて利用しやすい音声及び語を含むようにカスタマイズされてもよい。たとえばＡＴＭ（現金自動預払機）におけるＡＳＲ処理のために、音声ストレージ３２０は、銀行取引の固有のカスタマイズされたデータを含んでもよい。場合によっては、音声ストレージ３２０はユーザの個別化された音声入力に基づき、個別のユーザのためにカスタマイズされてもよい。性能を向上させるために、ＡＳＲモジュール３１４は、ＡＳＲ処理の結果のフィードバックに基づき音声ストレージ３２０のコンテンツを修正／更新し、ＡＳＲモジュール３１４がトレーニングコーパスにおいて提供された能力を上回るよう音声認識を向上させることを可能にしてもよい。 Other information may also be stored in the voice storage 320 for use in voice recognition. The content of the audio storage 320 may be prepared for general ASR usage or may be customized to include audio and words that are accessible in a particular application. For example, for ASR processing in an ATM (Automated Teller Machine), the voice storage 320 may include unique customized data for bank transactions. In some cases, the voice storage 320 may be customized for individual users based on the user's personalized voice input. To improve performance, the ASR module 314 modifies / updates the content of the voice storage 320 based on the feedback of the results of the ASR process, improving speech recognition so that the ASR module 314 exceeds the capabilities provided in the training corpus. May be possible.

音声認識エンジン３１８は、受信された素性ベクトルを音声ストレージ３２０において知られるような語または部分語単位と照合するよう試みる。部分語単位は音素、コンテキスト上の音素、音節、音節の一部、コンテキスト上の音節、または語の任意の他のこのような部分であってもよい。音声認識エンジン３１８は音響情報及び言語情報に基づき、素性ベクトルのための認識スコアを計算する。素性ベクトル群によって表される意図された音声が部分語単位に一致する可能性を表す音響スコアを計算するために、音響情報が利用される。どの音声及び／または語がコンテキスト上で互いに利用されるかを考慮することによって音響スコアを調節し、それによってＡＳＲモジュールが文法的に意味の通る音声結果を出力する可能性を向上させるために、言語情報が利用される。 Speech recognition engine 318 attempts to match the received feature vector with a word or subword unit as known in speech storage 320. A subword unit may be a phoneme, a contextual phoneme, a syllable, a part of a syllable, a contextual syllable, or any other such part of a word. The speech recognition engine 318 calculates a recognition score for the feature vector based on the acoustic information and the language information. Acoustic information is used to calculate an acoustic score that represents the likelihood that the intended speech represented by the feature vectors will match a subword unit. To adjust the acoustic score by considering which voices and / or words are used in context, thereby improving the likelihood that the ASR module will output grammatically meaningful speech results Language information is used.

音声認識エンジン３１８は、素性ベクトルを音素またはダイフォン、トライフォンなどの他の表音単位と照合するためのいくつかの技術を利用してもよい。ある一般的な技術は、隠れマルコフモデル（ＨＭＭ）を利用している。素性ベクトルが音素に一致してもよい確率を判断するために、ＨＭＭが利用される。ＨＭＭを利用して、その状態がともに潜在的な音素（またはトライフォンなどの他の音声単位）を表し、各状態が混合ガウスモデルなどのモデルに関連するような、いくつかの状態が示される。状態間の遷移はまた、以前の状態から現在の状態に到達することができる可能性を表す関連する確率を有してもよい。受信された音声はＨＭＭの状態間の経路として表されてもよく、複数の経路は同じ音声に関する複数の可能性のあるテキストの一致を表してもよい。各音素は、音素の異なる既知の発音及びそれらの部分（口頭での言語音声の冒頭、中央、及び末尾など）に対応する複数の潜在的な状態によって表されてもよい。潜在的な音素の確率の初期判断は、１つの状態に関連してもよい。新しい素性ベクトルが音声認識エンジン３１８によって処理される際、状態は新しい素性ベクトルの処理に基づき、変化するかまたは同じままであってもよい。処理された素性ベクトルに基づく状態の最も見込みの高いシーケンスを発見するために、ビタビアルゴリズム（Ｖｉｔｅｒｂｉａｌｇｏｒｉｔｈｍ）が利用されてもよい。 Speech recognition engine 318 may utilize several techniques for matching feature vectors with phonemes or other phonetic units such as diphones, triphones, and the like. One common technique uses a hidden Markov model (HMM). An HMM is used to determine the probability that a feature vector may match a phoneme. Using the HMM, several states are shown, each of which represents a potential phoneme (or other speech unit such as a triphone) and each state is associated with a model such as a mixed Gaussian model. . Transitions between states may also have an associated probability that represents the likelihood that the current state can be reached from the previous state. Received speech may be represented as a path between HMM states, and multiple paths may represent multiple possible text matches for the same speech. Each phoneme may be represented by a plurality of potential states corresponding to different known pronunciations of the phonemes and their parts (such as the beginning, middle, and end of verbal verbal speech). The initial determination of potential phoneme probabilities may relate to one state. When a new feature vector is processed by the speech recognition engine 318, the state may change or remain the same based on the processing of the new feature vector. A Viterbi algorithm may be utilized to find the most probable sequence of states based on the processed feature vectors.

確率及び状態はいくつかの技術を利用して計算されてもよい。たとえば各状態のための確率は、素性ベクトル及び音声ストレージ３２０のコンテンツに基づくガウスモデル、混合ガウスモデル、または他の技術を利用して計算されてもよい。最尤推定（ＭＬＥ）などの技術が、音素状態の確率を推定するために利用されてもよい。 Probabilities and states may be calculated using several techniques. For example, the probabilities for each state may be calculated using a Gaussian model based on the feature vector and the content of the audio storage 320, a mixed Gaussian model, or other techniques. Techniques such as maximum likelihood estimation (MLE) may be utilized to estimate the probability of phoneme states.

ある音素のための潜在的な状態の、素性ベクトルとの潜在的な一致としての計算に加え音声認識エンジン３１８はまた、別の音素のための潜在的な状態を、素性ベクトルとの潜在的な一致として計算してもよい。このように、複数の状態及び状態遷移確率が計算されてもよい。 In addition to computing a potential state for one phoneme as a potential match with a feature vector, the speech recognition engine 318 also determines a potential state for another phoneme as a potential with a feature vector. It may be calculated as a match. Thus, a plurality of states and state transition probabilities may be calculated.

音声認識エンジン３１８によって計算された可能性の高い状態及び可能性の高い状態遷移が経路内に構成されてもよい。各経路は、素性ベクトルによって表される音声データと潜在的に一致する音素の進展を表す。１つの経路は各音素のために計算された認識スコアに応じて、１つまたは複数の他の経路と重複してもよい。特定の確率は、状態から状態への各遷移に関連する。累積経路スコアはまた、経路ごとに計算されてもよい。ＡＳＲ処理の一部としてスコアを組み合わせる際、所望の組み合わせられたスコアに到達するためにスコアが乗算されて（または他の方法で組み合わされて）もよく、または確率がログ領域に変換され、処理の補助のために追加されてもよい。 The likely state calculated by the speech recognition engine 318 and the likely state transition may be configured in the path. Each path represents a phoneme evolution that potentially matches the speech data represented by the feature vector. One path may overlap with one or more other paths depending on the recognition score calculated for each phoneme. A particular probability is associated with each transition from state to state. A cumulative path score may also be calculated for each path. When combining scores as part of ASR processing, the scores may be multiplied (or otherwise combined) to arrive at the desired combined score, or the probabilities are converted to log areas and processed May be added to assist.

音声認識エンジン３１８は、潜在的な経路を音声認識結果を表すラティスへと組み合わせてもよい。サンプルラティスが図５に示される。ラティス５０２は、音声認識結果の複数の潜在的な経路を示す。大きなノード間の経路は潜在的な語（たとえば「ｈｅｌｌｏ」、「ｙｅｌｌｏｗ」など）を表し、より小さなノード間の経路は潜在的な音素（たとえば「Ｈ」、「Ｅ」、「Ｌ」、「Ｏ」及び「Ｙ」、「Ｅ」、「Ｌ」、「Ｏ」）を表す。例示の目的のために、個別の音素はラティスの第１の２つの語のためのみに示される。ノード５０４とノード５０６との間の２つの経路は、「ｈｅｌｌｏｈｏｗ」または「ｙｅｌｌｏｗｎｏｗ」の２つの潜在的な語の選択を表す。ノード（潜在的な語などの）間の各経路の点は、認識スコアに関連する。ラティスをまたぐ各経路にまた、認識スコアが割り当てられてもよい。認識スコアが音響モデルスコアの組み合わせである場合の最高の認識スコア経路、言語モデルスコア、及び／または他の要因は、関連する素性ベクトルのためのＡＳＲ結果として音声認識エンジン３１８によって返されてもよい。 The speech recognition engine 318 may combine potential paths into lattices that represent speech recognition results. A sample lattice is shown in FIG. Lattice 502 shows multiple potential paths for speech recognition results. Paths between large nodes represent potential words (eg, “hello”, “yello”, etc.), while paths between smaller nodes represent potential phonemes (eg, “H”, “E”, “L”, “ O ”and“ Y ”,“ E ”,“ L ”,“ O ”). For illustrative purposes, individual phonemes are shown only for the first two words of the lattice. The two paths between node 504 and node 506 represent the selection of two potential words, “hello how” or “yello now”. Each path point between nodes (such as potential words) is associated with a recognition score. A recognition score may also be assigned to each path across the lattice. The highest recognition score path, language model score, and / or other factors when the recognition score is a combination of acoustic model scores may be returned by the speech recognition engine 318 as an ASR result for the associated feature vector. .

ＡＳＲ処理に続き、ＡＳＲ結果がＡＳＲモジュール３１４によってさらなる処理（解釈されたテキストに含まれる命令の実行など）のために制御装置／処理装置３０８などのＡＳＲ装置３０２の別の構成要素へと、または外部装置への送信のために出力装置３０７へと送信されてもよい。 Following ASR processing, the ASR result is sent by ASR module 314 to another component of ASR device 302, such as controller / processor 308, for further processing (such as execution of instructions contained in the interpreted text), or It may be transmitted to the output device 307 for transmission to an external device.

音声認識エンジン３１８はまた、言語モデルまたは文法に基づき経路の分岐のスコアを計算してもよい。言語モデルは、意味の通った語及び文を形成するためにどの語がともに利用しやすいかについてスコアの判断を利用する。言語モデルの応用により、ＡＳＲモジュール３１４が音声データ内に含まれる音声を正確に解釈する可能性が向上してもよい。たとえば口頭での発話内のそれぞれの語の言語コンテキストに基づき「ＨＥＬＯ」（語「ｈｅｌｌｏ」として解釈される）、「ＨＡＬＯ」（語「ｈａｌｏ」として解釈される）、及び「ＹＥＬＯ」（語「ｙｅｌｌｏｗ」として解釈される）の認識スコアを調節するために、「ＨＥＬＯ」、「ＨＡＬＯ」、及び「ＹＥＬＯ」の潜在的な音素経路を返す音響モデル処理が言語モデルによって調節されてもよい。言語モデルは音声ストレージ３２０内に保存されたトレーニングコーパスから判断されてもよく、また特定のアプリケーションのためにカスタマイズされてもよい。特定の次の語を知覚する確率が前のｎ−１語のコンテキスト履歴に依拠する、Ｎ−ｇｒａｍモデルなどの技術を利用して言語モデルが実行されてもよい。Ｎ−ｇｒａｍモデルはまた、次の語を知覚する確率が前の語（バイグラムモデルの場合）または前の２つの語（トリグラムモデルの場合）に依拠するバイグラム（ｎ＝２である）及びトリグラム（ｎ＝３である）モデルとして構成されてもよい。音響モデルはまた、Ｎ−ｇｒａｍ技術を応用してもよい。 The speech recognition engine 318 may also calculate a path branch score based on a language model or grammar. The language model uses score determination as to which words are easy to use together to form meaningful words and sentences. The application of the language model may improve the possibility that the ASR module 314 correctly interprets the speech included in the speech data. For example, “H E L O” (interpreted as the word “hello”), “H A L O” (interpreted as the word “halo”), based on the language context of each word in the verbal utterance, and To adjust the recognition score of “Y E L O” (interpreted as the word “yello”), the potential of “H E L O”, “H A L O”, and “Y E L O” Acoustic model processing that returns phoneme paths may be adjusted by the language model. The language model may be determined from a training corpus stored in the voice storage 320 and may be customized for a particular application. The language model may be implemented using techniques such as the N-gram model, where the probability of perceiving a particular next word depends on the context history of the previous n-1 words. The N-gram model also has bigrams (n = 2) and trigrams whose probability of perceiving the next word depends on the previous word (in the case of the bigram model) or the previous two words (in the case of the trigram model). It may be configured as a model (where n = 3). The acoustic model may also apply N-gram technology.

言語モデルの一部として（またはＡＳＲ処理の他の段階において）、音声認識エンジン３１８は計算リソースを保存するために、言語モデルに従う低い認識スコアまたは他の理由のいずれかにより口頭での発話に対応する可能性がほとんどない低い認識スコア状態または経路を取り除き破棄してもよい。さらにＡＳＲ処理中に音声認識エンジン３１８は、すでに処理された発話部分における付加的な処理パスを反復的に実行してもよい。結果を洗練させ改善するために、後のパスが前のパスの結果を組み込んでもよい。音声認識エンジン３１８が潜在的な語を入力音声から判断する際に、ラティスは多くの潜在的な音声及び語が入力音声との潜在的な一致として見なされるように非常に大きくなってもよい。語の結果のネットワークとして潜在的な一致が示されてもよい。音声認識結果のネットワークは、認識されてもよい音声単位の可能性のあるシーケンス及び各シーケンスの可能性を表すアークならびにノードの接続されたネットワークである。語の結果のネットワークは、語レベルでの音声認識結果のネットワークである。他のレベルでの音声認識ネットワークもまた可能である。結果のネットワークは、任意のタイプの音声認識デコーダ（またはエンジン）によって生成されてもよい。たとえば結果のネットワークは、有限状態トランスデューサ（ＦＳＴ）によってデコーダに基づいて生成されてもよい。最高のスコア結果のラティスまたはＮ−ｂｅｓｔリストなどの音声認識結果の最終組を作成するために、結果のネットワークが利用されてもよい。ニューラルネットワークはまた、ＡＳＲ処理を実行するために利用されてもよい。 As part of the language model (or at other stages of ASR processing), the speech recognition engine 318 responds to verbal utterances either with a low recognition score according to the language model or for other reasons to conserve computational resources Low recognition score states or paths that are unlikely to do may be removed and discarded. Further, during ASR processing, the speech recognition engine 318 may iteratively perform additional processing passes on the already processed speech portion. In order to refine and improve the results, the later pass may incorporate the results of the previous pass. As the speech recognition engine 318 determines potential words from the input speech, the lattice may become very large so that many potential speeches and words are considered as potential matches with the input speech. Potential matches may be shown as a network of word results. The network of speech recognition results is a connected sequence of possible sequences of speech units that may be recognized and arcs and nodes representing the potential of each sequence. A network of word results is a network of speech recognition results at the word level. Voice recognition networks at other levels are also possible. The resulting network may be generated by any type of speech recognition decoder (or engine). For example, the resulting network may be generated based on a decoder by a finite state transducer (FST). The resulting network may be utilized to create a final set of speech recognition results, such as a lattice or N-best list of best score results. Neural networks may also be utilized to perform ASR processing.

音声認識エンジン３１８は、音声認識エンジン３１８によって判断されるように、上位Ｎ個の経路に対応する経路のＮ−ｂｅｓｔリストをそのそれぞれの認識スコアとともに返してもよい。Ｎ−ｂｅｓｔリストを受信するアプリケーション（ＡＳＲ装置３０２の内部若しくは外部のいずれかの、プログラムまたは構成要素など）はその後、リスト及び関連する認識スコアを考慮してリスト上でさらなる動作または分析を実行してもよい。たとえばエラーの訂正ならびに様々な選択肢及びＡＳＲモジュール３１４の処理条件のトレーニングにおいて、Ｎ−ｂｅｓｔリストが利用されてもよい。音声認識エンジン３１８は最善の結果を有する実際の正確な発話をＮ−ｂｅｓｔリスト上の他の結果と比較し、不正確な認識が特定の認識スコアを受信した理由を判断してもよい。音声認識エンジン３１８はその後の処理の試みにおける不正確な手法の認識スコアを低減させるために、その手法を訂正してもよい（また、音声ストレージ３２０内の情報を更新してもよい）。 The speech recognition engine 318 may return an N-best list of routes corresponding to the top N routes along with their respective recognition scores, as determined by the speech recognition engine 318. An application that receives the N-best list (such as a program or component, either internal or external to the ASR device 302) then performs further actions or analysis on the list taking into account the list and associated recognition score. May be. For example, the N-best list may be utilized in error correction and training of various options and processing conditions of the ASR module 314. The speech recognition engine 318 may compare the actual accurate utterance with the best results with other results on the N-best list to determine why the incorrect recognition received a particular recognition score. The voice recognition engine 318 may correct the technique (and update the information in the voice storage 320) to reduce the recognition score of the incorrect technique in subsequent processing attempts.

コンテンツアイテムに関する音声命令を処理するためにＡＳＲ装置が利用されてもよい。コンテンツアイテム自体がＡＳＲ装置上にローカルに保存されるか（携帯電話上の音楽コレクションなど）、またはリモートに保存されてもよい（リモートサーバからストリーミングされてもよい映画など）。それらのコンテンツアイテムは、たとえば、音楽、電子書籍（ｅブック）、映画、コンタクト情報、文書、ショートメッセージサービス通信、ｅメール及び／若しくは他の音声、動画またはテキスト情報を含んでもよい。ＡＳＲ装置のユーザは、再生、編集、転送などを含む様々な目的でのこのようなコンテンツアイテムへのアクセスを要求してもよい。たとえばユーザは、携帯電話がユーザからの口頭での要求に応じて音楽を再生することができるように要求してもよい。ユーザからの要求を実行するために、コンテンツアイテムのカタログが語の辞書またはレキシコンにリンクされてもよい。レキシコンは、個別のコンテンツアイテムにリンクされたテキスト識別子であってもよいテキスト識別子を含んでもよい。たとえばテキスト識別子は、アーティスト名、アルバム名、曲／映画／ｅブックのタイトルなどを含んでもよい。各テキスト識別子はカタログ上のコンテンツの１つまたは複数のアイテム（複数の曲にリンクされているバンド名など）に対応してもよく、各コンテンツアイテムは１つまたは複数のテキスト識別子（曲名、バンド名、アルバム名などにリンクされる曲など）にリンクされてもよい。テキスト識別子はまた、デジタルコンテンツ以外のアイテムを参照してもよい。 An ASR device may be utilized to process voice commands for content items. The content item itself may be stored locally on the ASR device (such as a music collection on a mobile phone) or stored remotely (such as a movie that may be streamed from a remote server). These content items may include, for example, music, electronic books (e-books), movies, contact information, documents, short message service communications, email and / or other audio, video or text information. ASR device users may request access to such content items for a variety of purposes, including playback, editing, forwarding, and the like. For example, the user may request that the mobile phone be able to play music in response to a verbal request from the user. A catalog of content items may be linked to a dictionary of words or a lexicon to fulfill a request from a user. A lexicon may include a text identifier that may be a text identifier linked to an individual content item. For example, the text identifier may include an artist name, album name, song / movie / ebook title, and the like. Each text identifier may correspond to one or more items of content on the catalog (such as band names linked to multiple songs), and each content item may have one or more text identifiers (song names, bands). Name, album name, etc.). Text identifiers may also refer to items other than digital content.

上述のように、レキシコンはまた各テキスト識別子の１つまたは複数の予想発音を含んでもよく、それによってユーザは音声命令を通じて関連するコンテンツアイテムにアクセスすることができる。たとえばユーザは、アーティスト名、アルバムまたは曲名を口に出すことによって音楽カタログ上に保存された曲の再生を試みてもよい。予想発音は、語の綴りに基づいて判断されてもよい。綴りに基づく語の予想発音の判断の処理は、書記素音素（Ｇ２Ｐ）変換または発音の推測（一般的に発音推測と称される）として定義される。場合によって、テキスト識別子は外来語を含んでもよい。例示の目的のために、本応用において言及される外来語（または外国語）は、ＡＳＲシステムのデフォルト言語に対して外国語に由来すると考えられる。本明細書に記載の技術が異なる言語に基づくＡＳＲシステムに応用されてもよいが、ＡＳＲシステムのデフォルト言語は本目的のために英語として示される。 As described above, the lexicon may also include one or more expected pronunciations for each text identifier, thereby allowing the user to access related content items through voice commands. For example, a user may attempt to play a song stored on a music catalog by speaking an artist name, album or song name. The expected pronunciation may be determined based on the spelling of the word. The process of determining the expected pronunciation of a word based on spelling is defined as grapheme phoneme (G2P) conversion or pronunciation guessing (generally called pronunciation guessing). In some cases, the text identifier may include foreign words. For illustrative purposes, the foreign language (or foreign language) mentioned in this application will be derived from a foreign language relative to the default language of the ASR system. Although the techniques described herein may be applied to ASR systems based on different languages, the default language of the ASR system is shown as English for this purpose.

異なる言語の語または言語学的素性を組み込むテキスト識別子のＡＳＲ処理を補助するために、本開示はテキスト識別子の言語の由来に基づいてテキスト識別子の１つまたは複数の発音を予想するようＡＳＲシステムが構成されるシステムを提供する。本開示の一態様において、ＡＳＲシステムはテキスト識別子に基づいてテキスト識別子の元言語を判断する。ＡＳＲシステムはその後、テキスト及び識別された元言語に基づきテキスト識別子の予想発音を判断する。ＡＳＲシステムは、各々が関連する可能性を有する特定のテキスト識別子の複数の予想発音を判断してもよい。予想発音（及び／またはそれらの関連する可能性）はまた、ユーザまたはユーザ群の発音傾向に基づいて調節されてもよい。予想発音はＡＳＲシステムによる最終的な検索のために、レキシコンに追加され、そのそれぞれのコンテンツアイテムにリンクされてもよい。 To assist ASR processing of text identifiers that incorporate words or linguistic features of different languages, this disclosure allows the ASR system to predict one or more pronunciations of a text identifier based on the language origin of the text identifier. Provide a configured system. In one aspect of the present disclosure, the ASR system determines the original language of the text identifier based on the text identifier. The ASR system then determines the expected pronunciation of the text identifier based on the text and the identified original language. The ASR system may determine multiple expected pronunciations of a particular text identifier that each has a potential to relate to. Expected pronunciations (and / or their associated possibilities) may also be adjusted based on the pronunciation tendency of the user or group of users. The expected pronunciation may be added to the lexicon and linked to its respective content item for final search by the ASR system.

元言語を判断するために、綴り／テキスト識別子に基づく言語の由来を予測する分類子をＡＳＲシステムが利用してもよい。分類子は、文字に基づく統計的モデルなどの統計的モデルであってもよい。テキスト識別子（たとえば、バンド名）は文献、段落などの長い形式のテキストに対して短くてもよいため、元言語の予測のための分類子は、他の言語予測システムによって利用されてもよい段落の列をなす複数のテキストに基づく検知よりも短いテキストの基本的な言語単位に重点を置いてもよい。たとえば分類子は、１つまたは複数の言語（たとえば、言語Ａ、ＢまたはＣ）における文字のシーケンスの可能性を識別するようにトレーニングされてもよい。いくつかの態様において、各言語の可能性が個別に学習されてもよい。分類子はまた、異なる言語の語のためのｎ−ｇｒａｍに基づく文字モデルを実装してもよい。ｎ−ｇｒａｍは、ＡＳＲシステムの異なる構成による、音節、文字、語または塩基対などのアイテムのシーケンスに基づいてもよい。 To determine the original language, the ASR system may utilize a classifier that predicts the origin of the language based on the spelling / text identifier. The classifier may be a statistical model, such as a statistical model based on characters. Since text identifiers (eg, band names) may be short for long forms of text such as documents, paragraphs, etc., the classifier for source language prediction is a paragraph that may be utilized by other language prediction systems. Emphasis may be placed on the basic linguistic unit of text that is shorter than detection based on multiple texts in a row. For example, the classifier may be trained to identify the possibility of a sequence of characters in one or more languages (eg, language A, B, or C). In some aspects, the possibilities for each language may be learned individually. The classifier may also implement an n-gram based character model for words in different languages. The n-gram may be based on a sequence of items such as syllables, letters, words or base pairs according to different configurations of the ASR system.

語の綴りが特定の言語に一致する可能性を表すスコアが割り当てられてもよい。たとえばスコアは、テキスト識別子（またはその部分）が由来する可能性の高い２つ以上の言語に割り当てられてもよい。いくつかの態様においてスコアは、元言語の識別を改善させるために異なる言語のそれぞれに割り当てられた確率的重みであってもよい。外国語のための最高スコアを有する１つまたは複数の言語が元言語として識別されてもよい。テキストが「Ｇｏｔｙｅ」である場合、たとえば、確率的重みの７０％がフランス語に、また３０％がドイツ語に割り当てられてもよい。この判断に基づき、フランス語及びドイツ語の両方の語の予想発音ならびに対応する確率的重みがレキシコンに追加されてもよい。本実装態様により、テキストの最も見込みの高い元言語の選択が可能になる。一態様において、テキスト識別子の一部が異なる元言語スコアを有してもよい。たとえば「ＬｕｄｗｉｇｖａｎＢｅｅｔｈｏｖｅｎ」の氏名の最初の語はドイツ語の高いスコアを有してもよいが、一方で中央の語がオランダ語の高いスコアなどを有してもよい。語の一部はまた、互いに異なる言語スコアを有してもよい。以下に記載の異なる予想発音を作成するために、このような異なるスコアが利用されてもよい。 A score may be assigned that represents the likelihood that the spelling of the word matches a particular language. For example, a score may be assigned to more than one language from which a text identifier (or portion thereof) is likely to come from. In some aspects, the score may be a probabilistic weight assigned to each of the different languages to improve the identification of the original language. One or more languages with the highest score for the foreign language may be identified as the original language. If the text is “Gotye”, for example, 70% of the probabilistic weights may be assigned to French and 30% to German. Based on this determination, expected pronunciations of both French and German words and corresponding probabilistic weights may be added to the lexicon. This implementation allows the selection of the most probable source language of the text. In one aspect, some of the text identifiers may have different source language scores. For example, the first word of the name “Ludwig van Beethoven” may have a high German score, while the central word may have a high Dutch score, etc. Some of the words may also have different language scores. Such different scores may be used to create the different expected pronunciations described below.

いくつかの態様においては、言語の素性が展開される機械学習分類子に基づく分類子が実装されてもよい。素性は、テキスト識別子の語列の冒頭、中央または末尾において特定の文字の組み合わせを含んでもよい。これらの素性に基づき、素性を組み込みやすい異なる言語にスコアが割り当てられてもよい。たとえば分類子は、オランダ語の元言語を示す語列の中央におけるＶ−Ａ−Ｎの存在などの素性を識別する。分類子はテキスト識別子がそれらの言語のそれぞれに由来する可能性に基づく、点または重みを潜在的な元言語のそれぞれに割り当てる。他の分類子モデルは、サポートベクトルマシン／モデルまたは最大エントロピーモデル、文字レベル言語モデル及び条件付き確率場モデルを含む。これらのモデルは、最も見込みの高い元言語のスコアをつけるために、異なる言語のための素性及びスコアを組み合わせてもよい。 In some aspects, a classifier may be implemented that is based on a machine learning classifier in which language features are expanded. A feature may include a combination of specific characters at the beginning, middle or end of a text identifier word string. Based on these features, scores may be assigned to different languages that are easy to incorporate features. For example, the classifier identifies a feature such as the presence of VA-N in the middle of a word string indicating the Dutch original language. The classifier assigns points or weights to each potential source language based on the likelihood that the text identifier is from each of those languages. Other classifier models include support vector machines / models or maximum entropy models, character level language models, and conditional random field models. These models may combine features and scores for different languages to score the most probable source language.

開示のいくつかの態様においては、コンテンツアイテムに関連する他のテキスト識別子の元言語に基づいて外国語の元言語が判断されてもよい。たとえば特定のアーティストの１つまたは複数の曲名または曲の歌詞がドイツ語である場合、アーティスト名がドイツ語に由来する可能性が増大してもよい。この場合、アーティスト名の元言語を判断するための証拠として曲名が利用されてもよい。さらに、他のテキスト識別子は識別されるコンテンツに関連するメタデータを含んでもよい。たとえばデジタルコンテンツのアイテムは、テキスト識別子の元言語を識別するかまたは識別するために利用されてもよいメタデータに関連してもよい。元言語の判断を調節するためにテキスト識別子間の他の関係が探求してもよい。 In some aspects of the disclosure, the original language of the foreign language may be determined based on the original language of other text identifiers associated with the content item. For example, if one or more song titles or song lyrics of a particular artist are in German, the likelihood that the artist name comes from German may increase. In this case, the song title may be used as evidence for determining the original language of the artist name. In addition, other text identifiers may include metadata associated with the identified content. For example, an item of digital content may relate to metadata that may be used to identify or identify the original language of a text identifier. Other relationships between text identifiers may be explored to adjust the judgment of the original language.

１つまたは複数の元言語がテキスト識別子（またはその部分）に関連すると、システムはテキスト識別子の（１つまたは複数の）元言語及びテキストに基づき、テキスト識別子の（１つまたは複数の）予想発音を判断してもよい。 When one or more source languages are associated with a text identifier (or part thereof), the system is based on the source language (s) and text of the text identifier, and the expected pronunciation (s) of the text identifier. May be judged.

開示のいくつかの態様において、書記素音素（Ｇ２Ｐ）変換または発音推測モデルなどの変換モデルが各潜在的な元言語のために展開されてもよい。変換モデルは外国語のテキストの綴りから外国語のテキストの発音を導き出す。各言語は、音素などの異なる言語単位を含む。外国語の予想発音を判断するために、クロスリンガルマッピング技術が利用されてもよい。第１の言語（たとえば、ドイツ語）の音素が、第１の言語の音素に最も類似する第２の言語（たとえば、英語）の音素にマッピングされてもよい。しかし、ドイツ語のいくつかの発音／音素は、標準的な英語の音素に類似または対応しない場合がある。たとえばＫｒａｆｔｗｅｒｋの最初の文字「ｒ」のドイツ語の発音は英語の音素に対応しない。文字「ｒ」のドイツ語の発音は実際には、文字「ｈ」の発音と文字「ｒ」の発音の中間の「口蓋垂音／ｒ／」である。このような場合には、ドイツ語の音素は、最も近い英語の音素にマッピングされてもよい。 In some aspects of the disclosure, a transformation model, such as a grapheme phoneme (G2P) transformation or pronunciation guessing model, may be developed for each potential source language. The transformation model derives the pronunciation of the foreign language text from the spelling of the foreign language text. Each language includes different language units such as phonemes. Cross-lingual mapping technology may be used to determine the expected pronunciation of the foreign language. A phoneme of a first language (eg, German) may be mapped to a phoneme of a second language (eg, English) that is most similar to the phoneme of the first language. However, some German pronunciations / phonemes may not be similar to or correspond to standard English phonemes. For example, the German pronunciation of the first letter “r” in Kraftwerk does not correspond to an English phoneme. The pronunciation of the letter “r” in German is actually “palatosis / r /” which is intermediate between the pronunciation of the letter “h” and the pronunciation of the letter “r”. In such a case, the German phoneme may be mapped to the closest English phoneme.

本開示の一態様において、外国語の最も近い発音を判断するために、言語学的技術が利用される。たとえば外国語の最も近い発音を判断するために、「奥舌性」、「円唇性」の部位または調音などの言語学的調音素性が実装されてもよい。調音部位は、発声中に調音器官（たとえば、舌、歯、軟口蓋など）が空気の流れを制限し、形成し、または閉じる口腔内の部位であってもよい。実施例には、両唇音（唇の間）、歯唇音（唇と歯との間）、歯茎音（歯のすぐ後方）、及び口蓋垂音（口蓋垂付近）が含まれる。「奥舌性」は、音声（通常は母音）がのどに向かって調音される度合いとして定義されてもよい。後舌母音は、「ｃａｕｇｈｔ」の「ａｕ」、「ｒｏｔｅ」の「ｏ」、及び「ｌｕｔｅ」の「ｕ」を含んでもよい。「円唇性」または「円唇化」は、度合いとして定義されてもよい。音声（母音であることが多いが、常にそうではない）は唇を丸めて調音される。円唇母音は、「ｒｏｔｅ」の「ｏ」、及び「ｌｕｔｅ」の「ｕ」を含む。対象の音素を有する外国語のいくつかの実施例を認識するために、たとえば英語音素認識装置などの第１の言語認識装置を利用して、言語学的技術が応用されてもよい。認識装置はその後、外国語の潜在的な発音を判断する。 In one aspect of the present disclosure, linguistic techniques are utilized to determine the nearest pronunciation of a foreign language. For example, in order to determine the nearest pronunciation of a foreign language, linguistic articulation features such as “back tongue”, “lips” or articulation may be implemented. The articulatory site may be a site in the oral cavity where the articulatory organ (eg, tongue, teeth, soft palate, etc.) restricts, forms or closes air flow during speech. Examples include both lip sounds (between lips), lip sounds (between lips and teeth), gum sounds (immediately behind the teeth), and uvula sounds (near the uvula). “Back tongue” may be defined as the degree to which a voice (usually a vowel) is tuned toward the throat. The back tongue vowels may include “au” for “caight”, “o” for “rote”, and “u” for “lute”. “Clips” or “lips” may be defined as a degree. Voice (often vowels, but not always) is tuned with rounded lips. The round lip vowel includes “o” of “rote” and “u” of “lute”. Linguistic techniques may be applied using a first language recognition device, such as, for example, an English phoneme recognition device, to recognize some embodiments of foreign languages having the target phonemes. The recognizer then determines the potential pronunciation of the foreign language.

複数の語の関連及びその対応する発音を分析し、新しい語の予想発音を判断するために、いくつかの言語学的技術（たとえば、期待値最大化アルゴリズム、統計的モデル、隠れマルコフモデル（ＨＭＭ））が利用されてもよい。たとえば文字シーケンス、音素シーケンス及びそれぞれの語の音声との間の関連を判断するために、ドイツ語を含むレキシコン及び対応するドイツ語の発音が分析されてもよい。たとえば期待値最大化アルゴリズムは、いくつかの例外を除いて英語での文字Ｐ−ＨがＦとして発音されてもよいことを学習してもよい。期待値最大化アルゴリズムはまた、Ｅが「ｅｅ」に対して「ｅｈ」などといつ発音されるかを学習してもよい。モデルは期待値最大化アルゴリズムの分析に基づいて展開され、新しい音素シーケンスを、またその後新しい語の予想発音を予測するために利用されてもよい。外国語の予想発音を判断するために、言語学的技術が他の技術とともに利用されてもよい。 Several linguistic techniques (eg, expectation maximization algorithms, statistical models, hidden Markov models (HMMs) are used to analyze the association of multiple words and their corresponding pronunciations to determine the expected pronunciation of a new word. )) May be used. For example, lexicons including German and corresponding German pronunciations may be analyzed to determine associations between character sequences, phoneme sequences and the speech of each word. For example, the expectation maximization algorithm may learn that the letter PH in English may be pronounced as F with some exceptions. The expected value maximization algorithm may also learn when E is pronounced as “eh”, etc. with respect to “ee”. The model may be developed based on the analysis of the expectation maximization algorithm and used to predict new phoneme sequences and then the expected pronunciation of new words. Linguistic techniques may be used with other techniques to determine the expected pronunciation of a foreign language.

言語学的技術はまた、（１つまたは複数の）元言語に基づくテキスト識別子のための複数の代替的な発音の予測を可能にする。たとえば各テキスト識別子の複数の発音がグラフによって表されてもよい。グラフの異なる部分が、テキスト識別子の異なる部分のための可能性のある発音を表してもよい。グラフの辺などのグラフの一部は、グラフ上の経路の可能性を示す割り当てられたスコアまたは重みであってもよい。異なる言語（たとえば、英語及びドイツ語）を表すために、異なるグラフが展開されてもよい。たとえば英語及びドイツ語の発音のために、個別のグラフが展開されてもよい。しかしいくつかの態様において外国語の混合発音を予測するために、個別のグラフがともに組み合わせられてもよい。テキスト識別子の発音が進展する際の２つの言語の入れ替えが組み合わせグラフにより可能になるが、これはユーザがある言語に有利に働くテキスト識別子の部分及び別の言語に有利に働くテキスト識別子の他の部分を発音してもよい状況において望ましい。 Linguistic techniques also allow prediction of multiple alternative pronunciations for text identifiers based on the original language (s). For example, a plurality of pronunciations of each text identifier may be represented by a graph. Different parts of the graph may represent possible pronunciations for different parts of the text identifier. A portion of the graph, such as a graph edge, may be an assigned score or weight that indicates the likelihood of a path on the graph. Different graphs may be developed to represent different languages (eg, English and German). For example, separate graphs may be developed for English and German pronunciation. However, individual graphs may be combined together to predict mixed pronunciation of foreign languages in some embodiments. The combination graph allows the switching of the two languages as the pronunciation of the text identifier progresses, but this is the part of the text identifier that favors one language and the other of the text identifier that favors another language. This is desirable in situations where the part may be pronounced.

たとえばドイツのバンド「Ｋｒａｆｔｗｅｒｋ」は、ドイツ語で（たとえば、ＫＨＨＡＡＦＴＶＥＨＲＫ）発音されてもよい。しかし一部のユーザはドイツ語の発音に不慣れである可能性があり、バンド名「Ｋｒａｆｔｗｅｒｋ」を英語として（たとえば、ＫＲＡＥＦＴＷＵＲＫ）発音してもよい。さらに一部のユーザについては、バンド名の発音の選択に一貫性がない可能性がある。結果としてテキスト識別子（バンド名「Ｋｒａｆｔｗｅｒｋ」など）が、各予想発音それ自体がテキスト識別子の（１つまたは複数の）元言語を含む複数の異なる言語に基づいてもよい複数の予想発音と照合されてもよい。 For example, the German band “Kraftwerk” may be pronounced in German (eg, K HH AA F T V EH R K). However, some users may be unfamiliar with German pronunciation and may pronounce the band name “Kraftwerk” in English (eg, KRAEFTWURK). In addition, for some users, the selection of band name pronunciation may not be consistent. As a result, the text identifier (such as the band name “Kraftwerk”) is matched against multiple expected pronunciations, each predicted pronunciation itself may be based on a plurality of different languages including the original language (s) of the text identifier. May be.

一部のユーザは第１の元言語を有しながら、ユーザが異なる言語で意思を疎通する（またはＡＳＲ装置を操作する）国に居住してもよい。これらのユーザは、ユーザの元言語を含む複数の言語からの発音の組み合わせを利用して外国語を発音してもよい。ユーザは外国語の一部を第１の言語で、他の部分を１つまたは複数の異なる言語で発音してもよい。たとえばユーザは、バンド名、Ｋｒａｆｔｗｅｒｋの第１の部分を英語で（たとえば、ＫＲＡＥＦＴ）、第２の部分をドイツ語で（たとえば、ＶＥＨＲＫ）発音してもよい。 Some users may have a first original language but reside in a country where the users communicate in different languages (or operate the ASR device). These users may pronounce a foreign language using a combination of pronunciations from a plurality of languages including the user's original language. The user may pronounce part of the foreign language in the first language and the other part in one or more different languages. For example, the user may pronounce the first part of the band name, Kraftwerk in English (eg, K R AE FT) and the second part in German (eg, V EHRK).

英語の発音、ＫＲＡＥＦＴＷＵＲＫ、ドイツ語の発音、ＫＨＨＡＡＦＴＶＥＨＲＫのそれぞれ及び組み合わせの発音ＫＲＡＥＦＴＶＥＨＲＫは、レキシコンに追加される際に、バンド名と照合されてもよい。複数の予想発音及びバンド名は、ＡＳＲ装置または他の場所に保存されたバンドによる曲にリンクされてもよい。 Pronunciation of English, KR AE FT W UR K, pronunciation of German, pronunciation of K HH AA FT V EH R K and combinations of K R AE F T V EH R K when added to lexicon In addition, the band name may be collated. Multiple expected pronunciations and band names may be linked to songs by bands stored on the ASR device or elsewhere.

外国語の予想発音はまた、特定のユーザの発音履歴に基づいてもよい。たとえばＡＳＲシステムは、特定のユーザの発音パターンまたは癖を認識するようにトレーニングされてもよい。語がその語の綴りに基づきフランス語に８０％、かつ英語に２０％の重みである場合、分類子または音声認識モデルは特定のユーザの癖に基づき、言語に割り当てられる重みを調節してもよい。発音パターンはまた、特定のユーザが好む言語のランクに基づいてもよい。たとえば言語に割り当てられる重みは、ユーザが好む（１つまたは複数の）言語に基づき調節されてもよい。たとえばＬｕｄｗｉｇｖａｎＢｅｅｔｈｏｖｅｎの名は、そのドイツ語及びオランダ語の由来のために、異なるバージョンの発音を有してもよい。この場合、重みがドイツ語（たとえば、６０％）及びオランダ語（たとえば、４０％）に割り当てられてもよい。ＬｕｄｗｉｇｖａｎＢｅｅｔｈｏｖｅｎの名などの外来語を発音する際、特定のユーザが英語、ドイツ語またはオランダ語のどれを好むかに基づき、割り当てられた重みが調節されてもよい。結果としての発音は、ドイツ語、オランダ語及び英語の混合または組み合わせであってもよい。 The predicted pronunciation of the foreign language may also be based on a specific user's pronunciation history. For example, the ASR system may be trained to recognize a particular user's pronunciation pattern or habit. If a word has a weight of 80% for French and 20% for English based on the spelling of the word, the classifier or speech recognition model may adjust the weight assigned to the language based on the particular user's habit. . Pronunciation patterns may also be based on the language rank preferred by a particular user. For example, the weight assigned to a language may be adjusted based on the language (s) preferred by the user. For example, the name Ludwig van Beethoven may have different versions of pronunciation due to its German and Dutch origin. In this case, weights may be assigned to German (eg 60%) and Dutch (eg 40%). In pronouncing foreign words such as Ludwig van Beethoven's name, the assigned weights may be adjusted based on whether a particular user prefers English, German or Dutch. The resulting pronunciation may be a mixture or combination of German, Dutch and English.

ユーザの発音パターンは、ユーザによる同一のまたは異なる語の発音の履歴に基づき判断されてもよい。ＡＳＲ装置は発音パターンまたは履歴に基づき、ユーザによる同一のまたは異なる語の今後の発音を予期してもよい。ＡＳＲ装置はまた、ユーザが１つまたは複数の言語の発音に慣れているか否かを、ユーザの発音パターンに基づき学習してもよい。たとえばバンド名、Ｋｒａｆｔｗｅｒｋの発音のユーザの履歴に基づいて、

または「ＧｕｓｔａｖＭａｈｌｅｒ」などの他のドイツ語のユーザの発音をＡＳＲ装置が予期してもよい。ＡＳＲ装置はまたユーザの発音パターンに基づき、重みを特定のユーザのために、様々な言語に割り当ててもよい。たとえばＡＳＲ装置は、外来語の発音の際にユーザが好む発音（たとえば、１つの言語または言語の組み合わせ）により大きな重みを割り当ててもよい。同様に特定のユーザが好む言語または好む経路のグラフ上の表現が、より高いスコアまたは重みを割り当てられてもよい。より高いスコアの割り当てにより、グラフのこれらの経路はユーザによる外国語の予想発音を表しやすくなる。したがって予想発音は、予想発音のグラフ、予想発音のＮ−ｂｅｓｔリスト、または予想発音の他のいくつかの構成に関連してもよい。 The user's pronunciation pattern may be determined based on the pronunciation history of the same or different words by the user. The ASR device may anticipate future pronunciation of the same or different words by the user based on the pronunciation pattern or history. The ASR device may also learn whether the user is familiar with pronunciation in one or more languages based on the user's pronunciation pattern. For example, based on the user's history of band name and Kraftwerk pronunciation,

Alternatively, the ASR device may expect the pronunciation of other German users, such as “Gustav Mahler”. The ASR device may also assign weights to different languages for a particular user based on the user's pronunciation pattern. For example, the ASR device may assign a greater weight to the pronunciation (for example, one language or a combination of languages) preferred by the user when a foreign word is pronounced. Similarly, a graphical representation of a preferred language or preferred route for a particular user may be assigned a higher score or weight. By assigning higher scores, these paths in the graph are more likely to represent the predicted pronunciation of the foreign language by the user. Thus, the expected pronunciation may be related to the expected pronunciation graph, the N-best list of expected pronunciations, or some other configuration of the expected pronunciation.

さらに類似の行動を有する複数のユーザは、予想発音の重み付けまたは判断の目的のために一緒にクラスタ化されてもよい。クラスタ化されたユーザのための自動音声認識技術の素性が、クラスタ化されたユーザの行動に基づいて選択される。たとえばユーザのクラスタは、類似の音楽的嗜好（たとえば、インド由来の音楽）を有してもよく、そのためにインド音楽が大半を占める音楽カタログを有してもよい。結果として、クラスタに含まれる新しいユーザからの発音はクラスタ内の他のユーザと同様に処理されてもよく、または（外国語の可能性のある発音を表す）グラフに沿った類似の経路をたどってもよい。ユーザのクラスタに関連する音声認識技術の対応する素性（たとえば、発音、好ましい言語など）に、重みが割り当てられてもよい。したがって（外国語の可能性のある発音を表す）グラフは類似の行動パターンを有するユーザの行動パターンまたはユーザのクラスタに基づいて、トリミングされてもよい。 In addition, multiple users with similar behavior may be clustered together for purposes of predictive pronunciation weighting or judgment. Features of automatic speech recognition techniques for clustered users are selected based on the behavior of the clustered users. For example, a user's cluster may have similar musical preferences (eg, music from India) and thus may have a music catalog that is dominated by Indian music. As a result, pronunciations from new users in the cluster may be treated the same as other users in the cluster, or follow a similar path along the graph (representing a possible pronunciation of a foreign language). May be. Weights may be assigned to corresponding features (eg, pronunciation, preferred language, etc.) of speech recognition technology associated with the user's cluster. Thus, the graph (representing a possible pronunciation of a foreign language) may be trimmed based on user behavior patterns or user clusters with similar behavior patterns.

図６は本開示の一態様による、音声認識における元言語に基づく、外国語のテキスト予想発音の予測のための方法のフロー図を示す。予想発音予測モジュール１２８、ＡＳＲ装置１００及び／またはリモート音声処理装置（たとえば、ＡＳＲ装置３０２）において、方法が実装されてもよい。ブロック６０２において、ユーザが利用できるようになるコンテンツが、ＡＳＲ装置１００が利用できるカタログに組み込まれてもよい。ブロック６０４において、１つまたは複数のテキスト識別子がブロック６０４に示すようにコンテンツアイテムにリンクされてもよい。ブロック６０６において、ＡＳＲシステムが（１つまたは複数の）テキスト識別子に基づいて、１つまたは複数の元言語を判断してもよい。（１つまたは複数の）元言語はそれぞれ、（１つまたは複数の）テキスト識別子のスコア及び／または特定の部分に関連してもよい。ブロック６０８においては、ＡＳＲシステムが判断された（１つまたは複数の）元言語に少なくとも部分的に基づき、テキスト識別子の１つまたは複数の予想発音を判断してもよい。（１つまたは複数の）元言語に基づく（１つまたは複数の）予想発音はそれぞれ、（１つまたは複数の）テキスト識別子のスコア及び／または特定の部分に関連してもよい。ブロック６１０において、ＡＳＲシステムはユーザ情報及び／またはユーザ履歴に少なくとも部分的に基づき、テキスト識別子の（１つまたは複数の）予想発音を判断してもよい。ユーザ履歴は母国語またはユーザが頻繁に利用する言語を含んでもよい。ユーザ履歴はまた、ユーザが類似の語を以前発音した方法を含んでもよい。ユーザ情報はまた、装置またはユーザの環境の判断された（１つまたは複数の）言語を含んでもよい。装置によって検知された他の音声において識別された（１つまたは複数の）言語を判断することによってまたは他の手段を通じて地理的領域の既知の（１つまたは複数の）言語を有する相関する位置データによって判断されてもよい、装置の位置において利用される言語を、環境の言語が含んでもよい。環境の言語はまた、ＡＳＲシステムのデフォルト言語を含んでもよい。ユーザの（１つまたは複数の）言語に基づく（１つまたは複数の）予想発音はそれぞれ、（１つまたは複数の）テキスト識別子のスコア及び／または特定の部分に関連してもよい。 FIG. 6 shows a flow diagram of a method for predicting predicted text pronunciation of a foreign language based on the original language in speech recognition, according to one aspect of the present disclosure. The method may be implemented in the predicted pronunciation prediction module 128, the ASR device 100, and / or a remote speech processing device (eg, the ASR device 302). At block 602, content that is made available to the user may be incorporated into a catalog that is available to the ASR device 100. In block 604, one or more text identifiers may be linked to the content item as shown in block 604. At block 606, the ASR system may determine one or more source languages based on the text identifier (s). Each of the source language (s) may be associated with a score and / or specific portion of the text identifier (s). At block 608, the ASR system may determine one or more expected pronunciations of the text identifier based at least in part on the determined source language (s). Each of the expected pronunciation (s) based on the original language (s) may be associated with a score and / or a particular portion of the text identifier (s). At block 610, the ASR system may determine the expected pronunciation (s) of the text identifier based at least in part on the user information and / or user history. The user history may include a native language or a language frequently used by the user. The user history may also include how the user has previously pronounced similar words. User information may also include the determined language (s) of the device or the user's environment. Correlated location data having the known language (s) of the geographic region by determining the identified language (s) in other speech sensed by the device or through other means The language used at the location of the device that may be determined by the environment language may include the language of the environment. The language of the environment may also include the default language of the ASR system. Each of the predicted pronunciation (s) based on the user's language (s) may be associated with a score and / or a particular portion of the text identifier (s).

ブロック６１２において、ＡＳＲシステムは予想発音を組み合わせ、テキスト識別子の（１つまたは複数の）元言語及び判断されたユーザの（１つまたは複数の）言語の組み合わせに少なくとも部分的に基づき、テキスト識別子の１つまたは複数の予想発音を判断してもよい。ユーザの（１つまたは複数の）言語の組み合わせに基づく（１つまたは複数の）予想発音はそれぞれ、（１つまたは複数の）テキスト識別子のスコア及び／または特定の部分に関連してもよい。ブロック６１４においては、ユーザの典型的な発音またはユーザのカテゴリなどのユーザ履歴に基づき、（１つまたは複数の）予想発音及び／若しくは重みのそれぞれまたはそれらの優先度が調節されてもよい。ブロック６１６において、（１つまたは複数の）予想発音は、レキシコン上の（１つまたは複数の）テキスト識別子及び／またはコンテンツアイテムに関連してもよい。 At block 612, the ASR system combines the predicted pronunciation and based on the combination of the text identifier's original language (s) and the determined user's language (s) based on the text identifiers. One or more expected pronunciations may be determined. Each of the expected pronunciation (s) based on a combination of the user's language (s) may be associated with a score and / or a particular portion of the text identifier (s). In block 614, each or their priority of the expected pronunciation (s) and / or weights may be adjusted based on the user's history, such as the user's typical pronunciation or user category. At block 616, the predicted pronunciation (s) may be associated with the text identifier (s) and / or content item on the lexicon.

予想発音の上述の判断は、トレーニングまたはＡＳＲシステムの構成中に行われてもよく、またはＡＳＲ装置が新しいコンテンツを利用することができるようになった際に、ローカルストレージへの追加を通じて、若しくはＡＳＲ装置にアクセス可能になるがリモートに保存されることによって、実行されてもよい。予想発音の判断は、ローカルＡＳＲ装置、リモートＡＳＲ装置、またはその組み合わせによって実行されてもよい。 The above-described determination of expected pronunciation may be made during training or configuration of the ASR system, or through addition to local storage when the ASR device becomes available for new content, or ASR It may be performed by making the device accessible but stored remotely. The determination of the predicted pronunciation may be performed by a local ASR device, a remote ASR device, or a combination thereof.

図７に示すように、ＡＳＲシステムは口頭での発話の受信時に発話を処理してもよい。ブロック７０２において、口頭でのテキスト識別子を含む発話が受信される。ブロック７０４においてＡＳＲシステムは、口頭でのテキスト識別子をテキスト識別子のための（１つまたは複数の）予想発音と照合してもよい。照合には、潜在的な一致のＮ−ｂｅｓｔリストを返すこと、または単に最高のスコア照合を返すことが含まれてもよい。ブロック７０６において、最高のスコア照合テキスト識別子に関連するコンテンツアイテムが判断される。ブロック７０８において、コンテンツアイテムがアクセスされ、発話に関連する任意の命令（音楽の再生など）がＡＳＲシステムによって、または別の装置によって実行されてもよい。 As shown in FIG. 7, the ASR system may process utterances upon receipt of verbal utterances. At block 702, an utterance including a verbal text identifier is received. At block 704, the ASR system may match the verbal text identifier with the expected pronunciation (s) for the text identifier. Matching may include returning an N-best list of potential matches, or simply returning the highest score match. At block 706, the content item associated with the highest score match text identifier is determined. At block 708, the content item is accessed and any instructions related to speech (such as playing music) may be executed by the ASR system or by another device.

本開示の上述の態様は、例示を意図したものである。それらは本開示の原理及び応用を説明するために選択され、すべてを網羅することや本開示を限定することを意図していない。開示された態様の多くの修正や変形が当業者には明らかである。たとえば音声ストレージ内に保存された言語情報に基づく、本明細書に記載のＡＳＲ技術が多くの異なる言語に応用されてもよい。 The above-described aspects of the present disclosure are intended to be exemplary. They are selected to illustrate the principles and applications of the present disclosure and are not intended to be exhaustive or to limit the present disclosure. Many modifications and variations of the disclosed aspects will be apparent to those skilled in the art. For example, the ASR technique described herein based on linguistic information stored in voice storage may be applied to many different languages.

本開示の態様は、コンピュータ実装方法、システムとして、またはメモリ装置若しくは非一時的コンピュータ読み取り可能記憶媒体などの製品として実装されてもよい。コンピュータ読み取り可能記憶媒体はコンピュータによって読み取り可能であってもよく、コンピュータまたは他の装置に本開示に記載の処理を実行させるための命令を含んでもよい。コンピュータ読み取り可能記憶媒体は、揮発性コンピュータメモリ、不揮発性コンピュータメモリ、ハードドライブ、固体メモリ、フラッシュドライブ、リムーバブルディスク、及び／または他の媒体によって実装されてもよい。 Aspects of the present disclosure may be implemented as a computer-implemented method, system, or product such as a memory device or non-transitory computer-readable storage medium. The computer readable storage medium may be readable by a computer and may include instructions that cause a computer or other device to perform the processes described in this disclosure. The computer readable storage medium may be implemented by volatile computer memory, non-volatile computer memory, hard drives, solid state memory, flash drives, removable disks, and / or other media.

本開示の態様は、異なる形式のソフトウェア、ファームウェア、及び／またはハードウェアにおいて実行されてもよい。さらに本開示の教示は、たとえば特定用途向け集積回路（ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、または他の構成要素によって実行されてもよい。 Aspects of the present disclosure may be implemented in different types of software, firmware, and / or hardware. Further, the teachings of the present disclosure may be performed by, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other components.

本開示の態様は単一の装置上で実行されてもよく、または複数の装置上で実行されてもよい。たとえば本明細書に記載の１つまたは複数の構成要素を含むプログラムモジュールは異なる装置内に位置してもよく、それぞれが本開示の１つまたは複数の態様を実行してもよい。本開示において使用されるとき、「ａ」または「ｏｎｅ」の用語は特段の記述がない限り、１つまたは複数のアイテムを含んでもよい。さらに、「ｂａｓｅｄｏｎ」の語句は特段の記述がない限り、「ｂａｓｅｄａｔｌｅａｓｔｉｎｐａｒｔｏｎ」を意味することを意図している。 Aspects of the present disclosure may be performed on a single device or may be performed on multiple devices. For example, program modules that include one or more components described herein may be located in different devices, each performing one or more aspects of the disclosure. As used in this disclosure, the term “a” or “one” may include one or more items, unless stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless stated otherwise.

条項 Clause

条項１
口頭での発話を処理するためのコンピュータ実装された方法であって、
曲名の綴りに少なくとも部分的に基づいて、前記曲名の少なくとも１つの元言語を判断するステップと、
前記少なくとも１つの元言語及びユーザが発話した言語に少なくとも部分的に基づいて前記曲名の複数の潜在的な発音を判断するステップであって、前記複数の潜在的な発音のそれぞれがスコアに関連する、前記判断するステップと、
前記複数の潜在的な発音のそれぞれと前記曲名との間の関連を保存するステップと、
曲の再生の要求を含む口頭での発話を受信するステップと、
前記複数の潜在的な発音の１つのスコアに少なくとも部分的に基づく、前記口頭での発話の部分を前記複数の潜在的な発音の１つと照合するステップと、
前記複数の潜在的な発音の１つに少なくとも部分的に基づく前記曲を識別するステップと、
計算装置上で前記曲を再生をさせるステップと、
を含む、前記方法。 Article 1
A computer-implemented method for processing verbal speech,
Determining at least one source language of the song title based at least in part on the spelling of the song title;
Determining a plurality of potential pronunciations of the song title based at least in part on the at least one source language and the language spoken by the user, each of the plurality of potential pronunciations being associated with a score. The step of determining;
Storing an association between each of the plurality of potential pronunciations and the song name;
Receiving an oral utterance including a request to play a song;
Matching a portion of the verbal utterance with one of the plurality of potential pronunciations based at least in part on a score of the plurality of potential pronunciations;
Identifying the song based at least in part on one of the plurality of potential pronunciations;
Playing the song on a computing device;
Said method.

条項２
前記複数の潜在的な発音を判断するステップがさらに、少なくとも１つの元言語が前記曲名と共通である語のユーザの発音履歴に少なくとも部分的に基づく、請求項１に記載の方法。 Article 2
The method of claim 1, wherein determining the plurality of potential pronunciations is further based at least in part on a user's pronunciation history of words for which at least one source language is common to the song title.

条項３
第１の元言語の前記曲名の１つの部分及び第２の元言語の前記曲名の第２の部分との関連付けにより、少なくとも１つの潜在的な発音を判断するステップをさらに含む、請求項１に記載の方法。 Article 3
The method of claim 1, further comprising determining at least one potential pronunciation by associating with one part of the song title in a first source language and a second part of the song title in a second source language. The method described.

条項４
前記曲名の前記少なくとも１つの元言語を判断するステップが、前記計算装置によって再生可能な他の曲の元言語に少なくとも部分的に基づく、請求項１に記載の方法。 Article 4
The method of claim 1, wherein determining the at least one source language of the song title is based at least in part on the source language of another song reproducible by the computing device.

条項５
計算システムであって、
少なくとも１つの処理装置と、
アクションの組を実行するための、前記少なくとも１つの処理装置によって実行されるよう動作可能な命令を含むメモリ装置であって、前記命令は、少なくとも１つのプロセッサが、
潜在的な元言語がテキスト識別子に少なくとも部分的に基づくような、テキスト識別子の前記潜在的な元言語を判断し、
潜在的な発音が前記潜在的な元言語及び潜在的な口頭での言語に少なくとも部分的に基づくような、前記テキスト識別子の前記潜在的な発音を判断し、
前記潜在的な発音と前記テキスト識別子との間の関連を保存するよう構成する、前記メモリ装置と、
を含む、前記計算システム。 Article 5
A computing system,
At least one processing device;
A memory device comprising instructions operable to be executed by the at least one processing unit for performing a set of actions, wherein the instructions are at least one processor,
Determining the potential source language of the text identifier such that the potential source language is based at least in part on the text identifier;
Determining the potential pronunciation of the text identifier such that the potential pronunciation is based at least in part on the potential source language and a potential verbal language;
The memory device configured to store an association between the potential pronunciation and the text identifier;
Including said computing system.

条項６
前記命令は前記少なくとも１つの処理装置が、
前記テキスト識別子のための前記第２の潜在的な元言語を判断し、前記第２の潜在的な元言語が前記テキスト識別子に少なくとも部分的に基づき、
前記テキスト識別子の第２の潜在的な発音を判断し、前記第２の潜在的な発音が前記第２の潜在的な元言語に少なくとも部分的に基づき、
前記第２の潜在的な発音との前記テキスト識別子間の関連を保存するようにさらに構成する、条項５に記載の計算システム。 Article 6
The instruction is received by the at least one processing unit,
Determining the second potential source language for the text identifier, wherein the second potential source language is based at least in part on the text identifier;
Determining a second potential pronunciation of the text identifier, wherein the second potential pronunciation is based at least in part on the second potential source language;
6. The computing system of clause 5, further configured to store an association between the text identifier with the second potential pronunciation.

条項７
前記潜在的な元言語、第２の潜在的な元言語、潜在的な発音及び第２の潜在的な発音が、それぞれのスコアにそれぞれ関連する、条項６に記載の計算システム。 Article 7
7. The computing system of clause 6, wherein the potential source language, the second potential source language, the potential pronunciation and the second potential pronunciation are each associated with a respective score.

条項８
前記少なくとも１つの処理装置が前記テキスト識別子の第２の潜在的な元言語を判断するようさらに構成され、
前記潜在的な元言語が前記テキスト識別子の第１の部分に関連し、
前記第２の潜在的な元言語が前記テキスト識別子の第２の部分に関連し、
前記潜在的な発音が前記第２の潜在的な元言語に少なくとも部分的にさらに基づく、
条項５に記載の計算システム。 Article 8
The at least one processing device is further configured to determine a second potential source language of the text identifier;
The potential source language is associated with a first portion of the text identifier;
The second potential source language is associated with a second portion of the text identifier;
The potential pronunciation is further based at least in part on the second potential source language;
The calculation system according to clause 5.

条項９
前記少なくとも１つの処理装置が、ユーザの発音履歴に少なくとも部分的にさらに基づいて前記潜在的な発音を判断するようさらに構成される、条項５に記載の計算システム。 Article 9
6. The computing system of clause 5, wherein the at least one processing device is further configured to determine the potential pronunciation based at least in part on a user's pronunciation history.

条項１０
ユーザの前記発音履歴が前記ユーザが発話した言語を含む、条項９に記載の計算システム。 Article 10
The calculation system according to clause 9, wherein the pronunciation history of the user includes a language spoken by the user.

条項１１
前記少なくとも１つの処理装置が、前記テキスト識別子に関連する第２のテキスト識別子の元言語に少なくとも部分的にさらに基づいて前記潜在的な元言語を判断するようさらに構成される、条項５に記載の計算システム。 Article 11
The clause 5, wherein the at least one processing device is further configured to determine the potential source language based at least in part on an original language of a second text identifier associated with the text identifier. Calculation system.

条項１２
前記命令は少なくとも１つのプロセッサが、
発話を含む音声データを受信し、
前記発話における前記潜在的な発音を識別し、
前記保存された関連に基づいて前記テキスト識別子を識別し、
前記テキスト識別子に関連するコンテンツアイテムの少なくとも一部を検索するよう
さらに構成する、条項５に記載の計算システム。 Article 12
The instructions are executed by at least one processor;
Receive audio data including utterances,
Identifying the potential pronunciation in the utterance;
Identifying the text identifier based on the stored association;
6. The computing system of clause 5, further configured to retrieve at least a portion of a content item associated with the text identifier.

条項１３
前記計算装置によってアクセスされるアーティスト、アルバム、バンド、映画、書籍、曲及び／または食品の名称を前記テキスト識別子が含む、条項５に記載の計算システム。 Article 13
6. The computing system of clause 5, wherein the text identifier includes an artist, album, band, movie, book, song and / or food name accessed by the computing device.

条項１４
前記潜在的な口頭での言語が前記システムの装置の位置に関連する言語を含む、条項５に記載の計算システム。 Article 14
6. The computing system of clause 5, wherein the potential verbal language comprises a language associated with a position of the system device.

条項１５
前記少なくとも１つの処理装置が有限状態トランスデューサ（ＦＳＴ）モデル、最大エントロピーモデル、文字レベル言語モデル及び／または条件付き確率場モデルの少なくとも１つを利用して、前記テキスト識別子の前記潜在的な発音を判断するようさらに構成される、条項５に記載の計算システム。 Article 15
The at least one processing unit utilizes at least one of a finite state transducer (FST) model, a maximum entropy model, a character level language model, and / or a conditional random field model to generate the potential pronunciation of the text identifier. 6. The computing system of clause 5, further configured to determine.

条項１６
テキスト識別子のための潜在的な元言語を判断するためのプログラムコードであって、前記潜在的な元言語がテキスト識別子に少なくとも部分的に基づく前記プログラムコードと、
前記テキスト識別子の潜在的な発音を判断するためのプログラムコードであって、前記潜在的な発音が前記潜在的な元言語及び潜在的な口頭での言語に少なくとも部分的に基づく前記プログラムコードと、
前記潜在的な発音と前記テキスト識別子との間の関連を保存するためのプログラムコードと、
を含む、計算装置を制御するための処理装置実行可能命令を保存する非一時的コンピュータ読み取り可能記憶媒体。 Article 16
Program code for determining a potential source language for a text identifier, wherein the potential source language is based at least in part on a text identifier;
Program code for determining a potential pronunciation of the text identifier, wherein the potential pronunciation is based at least in part on the potential source language and a potential verbal language;
Program code for storing an association between the potential pronunciation and the text identifier;
A non-transitory computer readable storage medium storing processing unit executable instructions for controlling a computing device.

条項１７
前記テキスト識別子のための第２の潜在的な元言語を判断するためのプログラムコードであって、前記第２の潜在的な元言語が前記テキスト識別子に少なくとも部分的に基づく前記プログラムコードと、
前記テキスト識別子の第２の潜在的な発音を判断するためのプログラムコードであって、前記第２の潜在的な発音が前記第２の潜在的な元言語に少なくとも部分的に基づく前記プログラムコードと、
前記第２の潜在的な発音と前記テキスト識別子との間の関連を保存するためのプログラムコードと、
をさらに含む、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 17
Program code for determining a second potential source language for the text identifier, wherein the program code is based at least in part on the text identifier;
Program code for determining a second potential pronunciation of the text identifier, wherein the second potential pronunciation is based at least in part on the second potential source language; ,
Program code for storing an association between the second potential pronunciation and the text identifier;
The non-transitory computer readable storage medium of clause 16, further comprising:

条項１８
前記潜在的な元言語、第２の潜在的な元言語、潜在的な発音及び第２の潜在的な発音がそれぞれのスコアにそれぞれ関連する、条項１７に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 18
The non-transitory computer readable storage medium of clause 17, wherein the potential source language, the second potential source language, the potential pronunciation and the second potential pronunciation are each associated with a respective score. .

条項１９
前記テキスト識別子の第２の潜在的な元言語を判断するためのプログラムコードをさらに含む、非一時的コンピュータ読み取り可能記憶媒体であって、
前記潜在的な元言語が前記テキスト識別子の第１の部分に関連し、
前記第２の潜在的な元言語が前記テキスト識別子の第２の部分に関連し、
前記潜在的な発音が前記第２の潜在的な元言語に少なくとも部分的にさらに基づく、
条項１６に記載の前記非一時的コンピュータ読み取り可能記憶媒体。 Article 19
A non-transitory computer readable storage medium further comprising program code for determining a second potential source language of the text identifier,
The potential source language is associated with a first portion of the text identifier;
The second potential source language is associated with a second portion of the text identifier;
The potential pronunciation is further based at least in part on the second potential source language;
The non-transitory computer readable storage medium of clause 16.

条項２０
ユーザの発音履歴に少なくとも部分的にさらに基づき前記潜在的な発音を判断するためのプログラムコードをさらに含む、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 20
The non-transitory computer readable storage medium of clause 16, further comprising program code for determining the potential pronunciation based at least in part on a user's pronunciation history.

条項２１
ユーザの前記発音履歴が前記ユーザが発話した言語を含む、条項２０に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 21
21. A non-transitory computer readable storage medium according to clause 20, wherein the pronunciation history of a user includes a language spoken by the user.

条項２２
前記テキスト識別子に関連する第２のテキスト識別子の元言語に少なくとも部分的にさらに基づき、前記潜在的な元言語を判断するためのプログラムコードをさらに含む、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 22
17. The non-transitory computer readable code of clause 16, further comprising program code for determining the potential source language based at least in part on the source language of a second text identifier associated with the text identifier. Storage medium.

条項２３
発話を含む音声データを受信するためのプログラムコードと、
前記発話における前記潜在的な発音を識別するためのプログラムコードと、
前記保存された関連に基づき前記テキスト識別子を識別するためのプログラムコードと、
前記テキスト識別子に関連するコンテンツアイテムの少なくとも一部分を検索するためのプログラムコードと、
をさらに含む、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 23
Program code for receiving audio data including speech, and
Program code for identifying the potential pronunciation in the utterance;
Program code for identifying the text identifier based on the stored association;
Program code for retrieving at least a portion of a content item associated with the text identifier;
The non-transitory computer readable storage medium of clause 16, further comprising:

条項２４
前記計算装置によってアクセスされる前記テキスト識別子がアーティスト、アルバム、バンド、映画、書籍、曲及び／または食品の名称を含む、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 24
The non-transitory computer readable storage medium of clause 16, wherein the text identifier accessed by the computing device comprises the name of an artist, album, band, movie, book, song and / or food.

条項２５
前記潜在的な口頭での言語が前記システムの装置の位置に関連する、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 25
The non-transitory computer readable storage medium of clause 16, wherein the potential verbal language is associated with a location of a device of the system.

条項２６
前記テキスト識別子の前記潜在的な発音を判断するための前記プログラムコードが、有限状態トランスデューサ（ＦＳＴ）モデル、最大エントロピーモデル、文字レベル言語モデル及び／または条件付き確率場モデルに少なくとも部分的に基づく、条項１６に記載の非一時的コンピュータ読み取り可能記憶媒体。 Article 26
The program code for determining the potential pronunciation of the text identifier is based at least in part on a finite state transducer (FST) model, a maximum entropy model, a character level language model, and / or a conditional random field model; 21. A non-transitory computer readable storage medium according to clause 16.

Claims

A computer-implemented method for processing verbal speech,
Determining at least one source language of the song title based at least in part on the spelling of the song title;
Determining a plurality of potential pronunciations of the song title based at least in part on the at least one source language and the language spoken by the user, each of the plurality of potential pronunciations being associated with a score. The step of determining;
Storing an association between each of the plurality of potential pronunciations and the song name;
Receiving an oral utterance including a request to play a song;
Matching a portion of the verbal utterance with one of the plurality of potential pronunciations based at least in part on a score of the plurality of potential pronunciations;
Identifying the song based at least in part on one of the plurality of potential pronunciations;
Playing the song on a computing device;
Said method.

The method of claim 1, wherein determining the plurality of potential pronunciations is further based at least in part on a user's pronunciation history of words for which at least one source language is common to the song title.

The method of claim 1, further comprising determining at least one potential pronunciation by associating with one part of the song title in a first source language and a second part of the song title in a second source language. The method described.

The method of claim 1, wherein determining the at least one source language of the song title is based at least in part on the source language of another song reproducible by the computing device.

A computing system,
At least one processing device;
A memory device comprising instructions operable to be executed by the at least one processing unit for performing a set of actions, wherein the instructions are at least one processor,
Determining the potential source language of the text identifier such that the potential source language is based at least in part on the text identifier;
Determining the potential pronunciation of the text identifier such that the potential pronunciation is based at least in part on the potential source language and a potential verbal language;
The memory device configured to store an association between the potential pronunciation and the text identifier;
Including said computing system.

The instruction is received by the at least one processing unit,
Determining the second potential source language for the text identifier, wherein the second potential source language is based at least in part on the text identifier;
Determining a second potential pronunciation of the text identifier, wherein the second potential pronunciation is based at least in part on the second potential source language;
Further configured to store an association between the text identifier with the second potential pronunciation;
The calculation system according to claim 5.

The computing system of claim 6, wherein the potential source language, the second potential source language, a potential pronunciation and a second potential pronunciation are associated with respective scores.

The at least one processing device is further configured to determine a second potential source language of the text identifier;
The potential source language is associated with a first portion of the text identifier;
The second potential source language is associated with a second portion of the text identifier;
The potential pronunciation is further based at least in part on the second potential source language;
The calculation system according to claim 5.

The computing system of claim 5, wherein the at least one processing device is further configured to determine the potential pronunciation based at least in part on a user's pronunciation history.

The calculation system according to claim 9, wherein the pronunciation history of the user includes a language spoken by the user.

6. The at least one processing device is further configured to determine the potential source language based at least in part on an original language of a second text identifier associated with the text identifier. Calculation system.

The instructions are executed by at least one processor;
Receive audio data including utterances,
Identifying the potential pronunciation in the utterance;
Identifying the text identifier based on the stored association;
To search for at least some of the content items associated with the text identifier;
6. The computing system according to claim 5, further configured.

6. The computing system of claim 5, wherein the text identifier includes an artist, album, band, movie, book, song, and / or food name accessed by the computing device.

The computing system of claim 5, wherein the potential verbal language comprises a language associated with a position of a device of the system.

The at least one processing unit utilizes at least one of a finite state transducer (FST) model, a maximum entropy model, a character level language model, and / or a conditional random field model to generate the potential pronunciation of the text identifier. The computing system of claim 5, further configured to determine.