JP2022542415A

JP2022542415A - Systems and methods for managing spoken queries using pronunciation information

Info

Publication number: JP2022542415A
Application number: JP2022506260A
Authority: JP
Inventors: アンクールアヘル，; インドラニルクーマードス，; アーシシュゴヤル，; アマンプニヤニ，; カンダラレディ，; ミトゥンウメシュ，
Original assignee: ロヴィガイズ，インコーポレイテッド
Priority date: 2019-07-31
Filing date: 2020-07-22
Publication date: 2022-10-03
Also published as: WO2021021529A1; EP4004913A1; CA3143967A1

Abstract

The system receives the spoken query at the audio interface and converts the spoken query to text. During conversion, the system may determine phonetic information, generate metadata indicating the pronunciation of one or more words of the query, include phonemic information within the text query, or both. A query includes one or more entities that can be more accurately identified based on pronunciation. The system searches for information, content, or both in one or more databases based on generated text queries, phonetic information, user profile information, search histories or trends, and optionally other information. The system identifies one or more entities or content items that match the text query, retrieves the identified information, and provides it to the user.

Description

本開示は、音声クエリを管理するためのシステムに関し、より具体的に、発音情報に基づいて音声クエリを管理するためのシステムに関する。 TECHNICAL FIELD This disclosure relates to systems for managing phonetic queries and, more particularly, to systems for managing phonetic queries based on pronunciation information.

会話システムでは、ユーザが音声クエリをシステムに発すると、発話は、自動発話認識（ＡＳＲ）モジュールを使用して、テキストに変換される。このテキストは、次いで、会話システムへの入力を形成し、それは、テキストへの応答を決定する。例えば、ユーザが、「ＴｏｍＣｒｕｉｓｅの映画を見せて」と言うと、ＡＳＲモジュールは、ユーザの音声をテキストに変換し、それを会話システムに発する。会話システムは、それがＡＳＲモジュールから受信したテキストに基づいて行動するに過ぎない。時として、このプロセスでは、会話システムは、単語の発音の詳細またはユーザのクエリに含まれる音を失う。発音詳細は、特に、同じ単語が、２つ以上の発音を有し、発音が、異なる意味に対応するとき、検索に役立ち得る情報を提供し得る。 In conversational systems, when a user issues a spoken query to the system, the speech is converted to text using an automatic speech recognition (ASR) module. This text then forms the input to the conversation system, which determines the responses to the text. For example, when a user says, "Show me a Tom Cruise movie," the ASR module converts the user's speech to text and emits it to the conversation system. The conversation system only acts on the text it receives from the ASR module. Sometimes, in this process, the conversation system loses details of the pronunciation of words or sounds contained in the user's query. Pronunciation details can provide information that can be useful in searching, especially when the same word has two or more pronunciations and the pronunciations correspond to different meanings.

本開示は、ユーザがクエリ単語を発話すると、複数のコンテキスト入力に基づいて、検索を実施し、ユーザの意図する検索クエリを予測するシステムおよび方法を説明する。検索は、例えば、ユーザ検索履歴、ユーザの好きなものおよび嫌いなもの、一般的傾向、クエリ単語の発音詳細、および任意の他の好適な情報を含む複数のコンテキスト入力に基づき得る。アプリケーションが、音声クエリを受信し、音声クエリを表すテキストクエリを生成する。アプリケーションは、テキストクエリに含まれるテキストクエリに関連付けられたメタデータに含まれ得るか、または、データベース内のエンティティのメタデータに含まれ得る発音情報を使用して、検索結果をより正確に読み出す。いくつかの実施形態では、アプリケーションは、検索クエリからのエンティティの到達可能性を改良するために、テキスト→発話変換、および発話→テキスト変換に基づいて、メタデータを生成する。 This disclosure describes systems and methods for conducting a search and predicting the user's intended search query based on multiple contextual inputs as the user speaks query words. Searches may be based on multiple contextual inputs including, for example, user search history, user likes and dislikes, general trends, query word pronunciation details, and any other suitable information. An application receives a spoken query and generates a text query representing the spoken query. Applications use phonetic information, which may be included in the metadata associated with the text query included in the text query, or included in the metadata of entities in the database, to more accurately retrieve search results. In some embodiments, the application generates metadata based on text-to-speech and speech-to-text conversions to improve reachability of entities from search queries.

本開示の上記および他の目的および利点は、同様の参照記号が全体を通して同様の部分を指す付随する図面と併せて解釈される以下の詳細な説明の考慮に応じて明白であろう。 The above and other objects and advantages of the present disclosure will become apparent upon consideration of the following detailed description taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout.

図１は、本開示のいくつかの実施形態による、テキストクエリを生成するための例証的システムのブロック図を示す。FIG. 1 depicts a block diagram of an illustrative system for generating text queries, according to some embodiments of the present disclosure.

図２は、本開示のいくつかの実施形態による、音声クエリに応答してコンテンツを読み出すための例証的システムのブロック図を示す。FIG. 2 shows a block diagram of an illustrative system for retrieving content in response to voice queries, according to some embodiments of the present disclosure.

図３は、本開示のいくつかの実施形態による、発音情報を生成するための例証的システムのブロック図を示す。FIG. 3 shows a block diagram of an illustrative system for generating pronunciation information, according to some embodiments of the disclosure.

図４は、本開示のいくつかの実施形態による、例証的ユーザ機器のブロック図である。FIG. 4 is a block diagram of illustrative user equipment, according to some embodiments of the present disclosure.

図５は、本開示のいくつかの実施形態による、音声クエリに応答するための例証的システムのブロック図を示す。FIG. 5 shows a block diagram of an illustrative system for responding to voice queries, according to some embodiments of the disclosure.

図６は、本開示のいくつかの実施形態による、発音情報に基づいて音声クエリに応答するための例証的プロセスのフローチャートを示す。FIG. 6 shows a flowchart of an illustrative process for responding to spoken queries based on pronunciation information, according to some embodiments of the present disclosure.

図７は、本開示のいくつかの実施形態による、代替表現に基づいて音声クエリに応答するための例証的プロセスのフローチャートを示す。FIG. 7 shows a flowchart of an illustrative process for responding to spoken queries based on alternative representations, according to some embodiments of the present disclosure.

図８は、本開示のいくつかの実施形態による、発音に基づいてエンティティに関するメタデータを生成するための例証的プロセスのフローチャートを示す。FIG. 8 shows a flowchart of an illustrative process for generating metadata about entities based on pronunciation, according to some embodiments of the present disclosure.

図９は、本開示のいくつかの実施形態による、音声クエリのエンティティに関連付けられたコンテンツを読み出すための例証的プロセスのフローチャートを示す。FIG. 9 depicts a flowchart of an illustrative process for retrieving content associated with an entity of a voice query, according to some embodiments of the present disclosure.

いくつかの実施形態では、本開示は、音声クエリをユーザから受信し、音声クエリを分析し、コンテンツまたは情報を検索するためのテキストクエリ（例えば、転換物）を生成するように構成されたシステムを対象とする。システムは、１つ以上のキーワードの発音に部分的に基づいて、音声クエリに応答する。例えば、英語言語では、同じスペルであるが、異なる発音を有する複数の単語が存在する。これは、特に、人々の名前に当てはまり得る。いくつかの例は、以下を含む。

例証するために、ユーザは、「Ｌｏｕｉｓのインタビューを見せて」とシステムのオーディオインターフェースに対して声に出し得る。システムは、以下等の例証的テキストクエリを生成し得る。
オプション１）「ＦｒａｕｄＭａｇａｚｉｎｅとのＬｏｕｉｓＦｒｅｅｈのインタビューを見せて」
オプション２）「ＣＢＳで放送されたＬｅｗｉｓＢｌａｃｋのインタビューを見せて」
結果として生じるテキストクエリは、ユーザが単語「Ｌｏｕｉｓ」を発話した方法に依存する。ユーザが、「ＬＯＯ－ｅｅ」と発音した場合、システムは、オプション１を選択するか、または、より重い重みをオプション１に適用する。ユーザが、「ＬＯＯ－ｈｉｓ」と発音した場合、システムは、オプション２を選択するか、または、より重い重みをオプション２に適用する。発音が考慮されないと、システムは、音声クエリに正確に応答することが可能ではないであろう可能性が高い。 In some embodiments, the present disclosure provides a system configured to receive spoken queries from users, analyze the spoken queries, and generate text queries (e.g., diversions) for retrieving content or information. for The system responds to voice queries based in part on the pronunciation of one or more keywords. For example, in the English language, there are multiple words with the same spelling but different pronunciations. This can be especially true for people's names. Some examples include:

To illustrate, a user may say "Show me an interview with Louis" into the system's audio interface. The system may generate an illustrative text query such as the following.
Option 1) "Show me Louis Freeh's interview with Fraud Magazine"
Option 2) "Show me the Lewis Black interview that aired on CBS."
The resulting text query depends on how the user pronounced the word "Louis." If the user pronounces "LOO-ee", the system selects option 1 or applies a heavier weight to option 1. If the user pronounces "LOO-his", the system selects option 2 or applies a heavier weight to option 2. If pronunciation is not considered, it is likely that the system will not be able to respond accurately to spoken queries.

いくつかの状況では、人物の部分的名前を含む音声クエリは、その人を正しく検出することにおいて曖昧性を引き起こし得る（例えば、「非決定的人物検索クエリ」と称される）。例えば、ユーザが、「Ｔｏｍが主演の映画を見せて」または「Ｌｏｕｉｓのインタビューを見せて」と声に出す場合、システムは、ユーザが尋ねているのがＴｏｍまたはＬｏｕｉｓ／Ｌｏｕｉｅ／Ｌｅｗｉｓであるかを決定する必要があるであろう。発音情報に加え、システムは、例えば、ユーザ検索履歴（例えば、前のクエリおよび検索結果）、ユーザの好きなもの／嫌いなもの／選好（例えば、ユーザプロファイル情報から）、（例えば、複数のユーザの）一般的傾向、（例えば、複数のユーザの中の）人気、任意の他の好適な情報、またはそれらの任意の組み合わせ等の１つ以上のコンテキスト入力を分析し得る。システムは、自動発話認識（ＡＳＲ）プロセス後、失われないように、発音情報を好適な形態において（例えば、テキストクエリ自体で、またはテキストクエリに関連付けられたメタデータで）に保持する。 In some situations, a spoken query containing a partial name of a person can cause ambiguity in correctly detecting that person (eg, referred to as a "non-deterministic person search query"). For example, if the user says, "Show me a movie starring Tom" or "Show me an interview with Louis," the system asks if it is Tom or Louis/Louie/Lewis that the user is asking. would need to be determined. In addition to pronunciation information, the system can, for example, collect user search history (e.g., previous queries and search results), user likes/dislikes/preferences (e.g., from user profile information), (e.g., multiple user ), popularity (eg, among multiple users), any other suitable information, or any combination thereof. The system retains the pronunciation information in a suitable form (eg, in the text query itself or in metadata associated with the text query) so that it is not lost after the automatic speech recognition (ASR) process.

いくつかの実施形態では、システムによって使用されるための発音情報に関して、その中でシステムが検索する情報フィールドは、クエリとの比較のための発音情報を含まなければならない。例えば、情報フィールドは、発音メタデータを含むエンティティについての情報を含み得る。システムは、音素転換プロセスを実施し得、素転換プロセスは、ユーザの音声クエリを入力としてとり、それをテキストに転換し、テキストは、読み返されると、音声学的に正しく聞こえる。システムは、音素転換プロセスの出力および発音メタデータを使用して、検索結果を決定するように構成され得る。例証的例では、エンティティに関して記憶される発音メタデータは、以下を含み得る。

In some embodiments, for phonetic information to be used by the system, the information fields in which the system searches should contain the phonetic information for comparison with the query. For example, an information field may contain information about an entity that includes phonetic metadata. The system may implement a phoneme conversion process, which takes the user's spoken query as input and converts it into text that, when read back, sounds phonetically correct. The system may be configured to use the output of the phoneme conversion process and pronunciation metadata to determine search results. In an illustrative example, phonetic metadata stored for an entity may include:

いくつかの実施形態では、本開示は、音声クエリをユーザから受信し、音声クエリを分析し、コンテンツまたは情報を検索するためのテキストクエリ（例えば、転換物）を生成するように構成されたシステムを対象とする。システムが検索する情報フィールドは、発音メタデータ、エンティティの代替テキスト表現、または両方を含む。例えば、ユーザが、音声クエリをシステムに発すると、システムは、最初に、ＡＳＲモジュールを使用して、音声をテキストに変換する。結果として生じるテキストは、次いで、会話システム（例えば、クエリに応答して、アクションを実施する）への入力を形成する。例証するために、ユーザが、「ＴｏｍＣｒｕｉｓｅの映画を見せて」と言う場合、ＡＳＲモジュールは、ユーザの発話をテキストに変換し、テキストクエリを会話システムに発する。「ＴｏｍＣｒｕｉｓｅ」に対応するエンティティが、データ内に存在する場合、システムは、それをテキスト「ＴｏｍＣｒｕｉｓｅ」と合致させ、適切な結果（例えば、ＴｏｍＣｒｕｉｓｅについての情報、ＴｏｍＣｒｕｉｓｅを特徴とするコンテンツ、またはそのコンテンツ識別子）を返す。エンティティが、（例えば、情報フィールドの）データ内に存在し、直接、エンティティタイトルを使用してアクセスされることができるとき、エンティティは、「到達可能」と称され得る。到達可能性は、システムが検索動作を実施するために最も重要である。例えば、あるデータ（例えば、映画、芸術家、テレビシリーズ、または他のエンティティ）が、システム内に存在し、関連付けられたデータが、記憶されるが、ユーザが、その情報にアクセスすることができない場合、エンティティは、「到達不能」と称され得る。データシステム内の到達不能エンティティは、検索システムの失敗を表す。 In some embodiments, the present disclosure provides a system configured to receive spoken queries from users, analyze the spoken queries, and generate text queries (e.g., diversions) for retrieving content or information. target. Information fields that the system retrieves include phonetic metadata, alternative textual representations of entities, or both. For example, when a user issues a spoken query to the system, the system first converts the speech to text using the ASR module. The resulting text then forms the input to a conversation system (eg, to perform actions in response to queries). To illustrate, if a user says, "Show me a Tom Cruise movie," the ASR module converts the user's utterance to text and issues a text query to the conversation system. If an entity corresponding to "Tom Cruise" exists in the data, the system will match it with the text "Tom Cruise" and display the appropriate results (e.g., information about Tom Cruise, content featuring Tom Cruise , or its content identifier). An entity may be referred to as 'reachable' when it exists within the data (eg, in an information field) and can be directly accessed using the entity title. Reachability is of utmost importance for the system to perform search operations. For example, if some data (e.g., movie, artist, television series, or other entity) exists within the system and the associated data is stored, the user cannot access that information. If so, the entity may be referred to as "unreachable." An unreachable entity in the data system represents a retrieval system failure.

システムは、複数の記憶された情報の中の１つ以上のエンティティまたはコンテンツ項目を識別し得る。いくつかの実施形態では、システムは、エンティティまたはコンテンツ項目を表す第１のテキスト文字列に基づいて、オーディオファイルを生成する。第１のテキスト文字列および少なくとも１つの発話基準に基づいて、システムは、発話→テキストモジュールを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成し得る。システムは、テキスト文字列を比較し、第２のテキスト文字列が第１のテキスト文字列と同一でない場合、第２のテキスト文字列を記憶する。いくつかの実施形態では、システムは、テキスト－発話－テキスト変換からの結果を含むメタデータを生成し、検索動作中、音声クエリに応答するとき、可能な誤識別を予想する。メタデータは、到達可能性を改良するために、エンティティの代替表現を含み得る。 The system may identify one or more entities or content items among the plurality of stored information. In some embodiments, the system generates an audio file based on the first text string representing the entity or content item. Based on the first text string and at least one speech criterion, the system may use the speech to text module to generate a second text string based on the audio file. The system compares the text strings and stores the second text string if the second text string is not identical to the first text string. In some embodiments, the system generates metadata that includes results from text-to-speech-to-text conversions to anticipate possible misidentifications when responding to spoken queries during search operations. Metadata may include alternative representations of entities to improve reachability.

図１は、本開示のいくつかの実施形態による、テキストクエリを生成するための例証的システム１００のブロック図を示す。システム１００は、ＡＳＲモジュール１１０と、会話システム１２０と、発音メタデータ１５０と、ユーザプロファイル情報１６０と、１つ以上のデータベース１７０とを含む。例えば、一緒にシステム１９９に含まれ得るＡＳＲモジュール１１０および会話システム１２０は、クエリアプリケーションを実装するために使用され得る。 FIG. 1 shows a block diagram of an illustrative system 100 for generating text queries, according to some embodiments of the disclosure. System 100 includes ASR module 110 , conversation system 120 , pronunciation metadata 150 , user profile information 160 and one or more databases 170 . For example, ASR module 110 and conversation system 120, which together may be included in system 199, may be used to implement query applications.

ユーザは、発話「先週のあのＬｏｕｉｓのインタビューを見せて」を含むクエリ１０１をシステム１９９のオーディオインターフェースに対して声に出し得る。ＡＳＲモジュール１１０は、受信されたオーディオ入力をサンプリング、調整、およびデジタル化し、結果として生じるオーディオファイルを分析し、テキストクエリを生成するように構成されている。いくつかの実施形態では、ＡＳＲモジュール１１０は、ユーザプロファイル情報１６０からの情報を読み出し、テキストクエリを生成することに役立てる。例えば、ユーザに関する音声認識情報が、ユーザプロファイル情報１６０に記憶され得、ＡＳＲモジュール１１０は、音声認識情報を使用して、発話するユーザを識別し得る。さらなる例では、システム１９９は、好適なメモリに記憶されたユーザプロファイル情報１６０を含み得る。ＡＳＲモジュール１１０は、声に出された単語「Ｌｏｕｉｓ」に関する発音情報を決定し得る。テキスト単語「Ｌｏｕｉｓ」に関して２つ以上の発音が存在するので、システム１９９は、発音情報に基づいて、テキストクエリを生成する。さらに、音「Ｌｏｏ－ｈｉｓ」は、「Ｌｏｕｉｓ」または「Ｌｅｗｉｓ」としてテキストに変換され得、故に、コンテキスト情報は、音声クエリの正しいエンティティ（例えば、ＬｏｕｉｓＦａｒｒａｋｈａｎにおけるようなＬｏｕｉｓとは対照的に、ＬｅｗｉｓＢｌａｃｋにおけるようなＬｅｗｉｓ）を識別することに役立ち得る。いくつかの実施形態では、会話システム１２０は、ＡＳＲモジュール１１０からの認識された単語、コンテキスト情報、ユーザプロファイル情報１６０、発音メタデータ１５０、１つ以上のデータベース１７０、任意の他の情報、またはそれらの任意の組み合わせに基づいて、テキストクエリを生成すること、テキストクエリに応答すること、または、両方を行うように構成される。例えば、会話システム１２０は、テキストクエリを生成し、次いで、合致を決定するために、テキストクエリを複数のエンティティに関する発音メタデータ１５０と比較し得る。さらなる例では、会話システム１２０は、１つ以上の認識された単語を複数のエンティティに関する発音メタデータ１５０と比較し、合致を決定し、次いで、識別されたエンティティに基づいて、テキストクエリを生成し得る。いくつかの実施形態では、会話システム１２０は、付随の発音情報を伴うテキストクエリを生成する。いくつかの実施形態では、会話システム１２０は、埋め込み発音情報を伴うテキストクエリを生成する。例えば、テキストクエリは、正しい文法的表現「Ｌｏｕｉｓ」ではなく、「ｌｏｏ－ｅｅ」等の単語の音素表現を含み得る。さらなる例では、発音メタデータ１５０は、それとテキストクエリが比較され得る１つ以上の基準音素表現を含み得る。 A user may vocalize a query 101 containing the utterance "Show me that Louis interview from last week" into the system 199 audio interface. ASR module 110 is configured to sample, condition, and digitize received audio input, analyze the resulting audio file, and generate text queries. In some embodiments, ASR module 110 reads information from user profile information 160 to help generate text queries. For example, speech recognition information about the user may be stored in user profile information 160, and ASR module 110 may use the speech recognition information to identify the speaking user. In a further example, system 199 may include user profile information 160 stored in suitable memory. ASR module 110 may determine phonetic information for the spoken word "Louis." Because there is more than one pronunciation for the text word "Louis," system 199 generates a text query based on the pronunciation information. Furthermore, the sound "Loo-his" can be translated into text as "Louis" or "Lewis", thus the context information is the correct entity of the spoken query (e.g., Louis as opposed to Louis Farrakhan, Lewis as in Lewis Black). In some embodiments, conversation system 120 uses recognized words from ASR module 110, contextual information, user profile information 160, pronunciation metadata 150, one or more databases 170, any other information, or is configured to generate text queries, respond to text queries, or both, based on any combination of For example, conversation system 120 may generate a text query and then compare the text query to pronunciation metadata 150 for multiple entities to determine matches. In a further example, conversation system 120 compares one or more recognized words to pronunciation metadata 150 for multiple entities to determine matches, and then generates text queries based on the identified entities. obtain. In some embodiments, conversation system 120 generates text queries with accompanying phonetic information. In some embodiments, conversation system 120 generates text queries with embedded phonetic information. For example, a text query may include the phoneme representation of a word such as "loo-ee" rather than the correct grammatical representation "Louis". In a further example, pronunciation metadata 150 may include one or more reference phoneme representations against which text queries may be compared.

ユーザプロファイル情報１６０は、ユーザ識別情報（例えば、名前、識別子、住所、連絡先情報）、ユーザ検索履歴（例えば、前の音声クエリ、前のテキストクエリ、前の検索結果、前の検索結果またはクエリに関するフィードバック）、ユーザ選好（例えば、検索設定、お気に入りエンティティ、２つ以上のクエリに含まれるキーワード）、ユーザが好きなもの／嫌いなもの（例えば、ソーシャルメディアアプリケーション内でユーザによってフォローされるエンティティ、ユーザ入力情報）、ユーザに接続される他のユーザ（例えば、友人、家族、ソーシャルネットワーキングアプリケーション内の連絡先、ユーザデバイスに記憶される連絡先）、ユーザ音声データ（例えば、オーディオサンプル、シグネチャ、発話パターン、またはユーザの音声を識別するためのファイル）、ユーザについての任意の他の好適な情報、またはそれらの任意の組み合わせを含み得る。 User profile information 160 includes user identification information (e.g., name, identifier, address, contact information), user search history (e.g., previous spoken queries, previous text queries, previous search results, previous search results or queries user preferences (e.g., search preferences, favorite entities, keywords included in more than one query), user likes/dislikes (e.g., entities followed by the user within a social media application, user input information), other users connected to the user (e.g., friends, family, contacts in social networking applications, contacts stored on the user device), user voice data (e.g., audio samples, signatures, utterances) patterns, or files for identifying the user's voice), any other suitable information about the user, or any combination thereof.

１つ以上のデータベース１７０は、テキストクエリを生成すること、テキストクエリに応答すること、または、両方を行うための任意の好適な情報を含む。いくつかの実施形態では、発音メタデータ１５０、ユーザプロファイル情報１６０、または両方は、１つ以上のデータベース１７０に含まれ得る。いくつかの実施形態では、１つ以上のデータベース１７０は、複数のユーザに関する統計的情報（例えば、検索履歴、コンテンツ消費履歴、消費パターン）を含む。いくつかの実施形態では、１つ以上のデータベース１７０は、人、場所、オブジェクト、イベント、コンテンツ項目、１つ以上のエンティティに関連付けられたメディアコンテンツ、またはそれらの組み合わせを含む複数のエンティティについての情報を含む。 One or more databases 170 contain any suitable information for generating text queries, responding to text queries, or both. In some embodiments, pronunciation metadata 150 , user profile information 160 , or both may be contained in one or more databases 170 . In some embodiments, one or more databases 170 contain statistical information (eg, search histories, content consumption histories, consumption patterns) about multiple users. In some embodiments, one or more databases 170 store information about multiple entities, including people, places, objects, events, content items, media content associated with one or more entities, or combinations thereof. including.

図２は、本開示のいくつかの実施形態による、音声クエリに応答してコンテンツを読み出すための例証的システム２００のブロック図を示す。システム２００は、発話処理システム２１０と、検索エンジン２２０と、エンティティデータベース２５０と、ユーザプロファイル情報２４０とを含む。発話処理システム２１０は、オーディオファイルを識別し得、キーワードが識別され得る音素、パターン、単語、または他の要素に関して、オーディオファイルを分析し得る。いくつかの実施形態では、発話処理システム２１０は、時間ドメイン、スペクトルドメイン、または両方において、オーディオ入力を分析し、単語を識別し得る。例えば、発話処理システム２１０は、時間ドメインにおいて、オーディオ入力を分析し、発話が生じる期間を決定し得る（例えば、一時停止または沈黙の期間を排除するため）。発話処理システム２１０は、次いで、スペクトルドメインにおいて、各期間を分析し、キーワードが識別され得る音素、パターン、単語、または他の要素を識別し得る。発話処理システム２１０は、生成されたテキストクエリ、１つ以上の単語、発音情報、またはそれらの組み合わせを出力し得る。いくつかの実施形態では、発話処理システム２１０は、音声認識、発話認識、または両方のために、ユーザプロファイル情報２４０からのデータを読み出し得る。 FIG. 2 shows a block diagram of an illustrative system 200 for retrieving content in response to voice queries, according to some embodiments of the present disclosure. System 200 includes a speech processing system 210 , a search engine 220 , an entity database 250 and user profile information 240 . Speech processing system 210 may identify audio files and analyze them for phonemes, patterns, words, or other elements for which keywords may be identified. In some embodiments, speech processing system 210 may analyze audio input and identify words in the temporal domain, the spectral domain, or both. For example, speech processing system 210 may analyze the audio input in the time domain and determine periods during which speech occurs (eg, to eliminate periods of pause or silence). Speech processing system 210 may then analyze each period in the spectral domain to identify phonemes, patterns, words, or other elements from which keywords may be identified. Speech processing system 210 may output the generated text query, one or more words, pronunciation information, or a combination thereof. In some embodiments, speech processing system 210 may retrieve data from user profile information 240 for speech recognition, speech recognition, or both.

検索エンジン２２０が、発話処理システム２１０からの出力を受信し、検索設定２２１およびコンテキスト情報２２２と組み合わせて、テキストクエリへの応答を生成する。検索エンジン２２０は、ユーザプロファイル情報２４０を使用して、テキストクエリを生成し、それを修正し、または、それに応答し得る。検索エンジン２２０は、テキストクエリを使用して、エンティティ２５０のデータベースのデータの中を検索する。エンティティ２５０のデータベースは、複数のエンティティに関連付けられたメタデータ、複数のエンティティに関連付けられたコンテンツ、または両方を含み得る。例えば、データは、エンティティに関する識別子、エンティティを説明する詳細、エンティティを指すタイトル（例えば、音素表現または代替表現を含み得る）、エンティティに関連付けられた語句（例えば、音素表現または代替表現を含み得る）、エンティティに関連付けられたリンク（例えば、ＩＰアドレス、ＵＲＬ、ハードウェアアドレス）、エンティティに関連付けられたキーワード（例えば、音素表現または代替表現を含み得る）、エンティティに関連付けられた任意の他の好適な情報、またはそれらの任意の組み合わせを含み得る。検索エンジン２２０が、テキストクエリのキーワードに合致する１つ以上のエンティティを識別すること、テキストクエリのキーワードに合致する１つ以上のコンテンツ項目を識別すること、または、両方を行うと、検索エンジン２２０は、次いで、テキストクエリへの応答２７０として、情報、コンテンツ、または両方をユーザに提供し得る。いくつかの実施形態では、検索設定２２１は、テキストクエリの生成、検索結果の読み出し、または両方に影響を及ぼすデータベース、エンティティ、エンティティのタイプ、コンテンツのタイプ、他の検索基準、またはそれらの任意の組み合わせを含む。いくつかの実施形態では、コンテキスト情報２２２は、ジャンル情報（例えば、検索フィールドをさらに絞り込むため）、キーワード、データベース識別（例えば、標的情報またはコンテンツを含む可能性が高いデータベース）、コンテンツのタイプ（例えば、日付、ジャンル、タイトル、フォーマット別）、任意の他の好適な情報、またはそれらの任意の組み合わせを含む。応答２７０は、例えば、コンテンツ（例えば、表示されるビデオ）、情報、検索結果の一覧、コンテンツへのリンク、任意の他の好適な検索結果、またはそれらの任意の組み合わせを含み得る。 Search engine 220 receives the output from speech processing system 210 and combines it with search settings 221 and contextual information 222 to generate a response to a text query. Search engine 220 may use user profile information 240 to generate, modify, or respond to text queries. The search engine 220 searches through data in the database of entities 250 using text queries. The database of entities 250 may include metadata associated with multiple entities, content associated with multiple entities, or both. For example, the data may include an identifier for the entity, details describing the entity, a title referring to the entity (eg, which may include phonemic representations or alternate representations), phrases associated with the entity (eg, which may include phonemic representations or alternate representations). , links associated with the entity (e.g., IP addresses, URLs, hardware addresses), keywords associated with the entity (e.g., which may include phonemic expressions or alternate expressions), any other suitable information, or any combination thereof. When the search engine 220 identifies one or more entities that match the keywords of the text query, identifies one or more content items that match the keywords of the text query, or both, the search engine 220 may then provide information, content, or both to the user as a response 270 to the text query. In some embodiments, search settings 221 are databases, entities, types of entities, types of content, other search criteria, or any of these that affect the generation of text queries, the retrieval of search results, or both. Including combinations. In some embodiments, the contextual information 222 includes genre information (eg, to further refine the search field), keywords, database identification (eg, databases likely to contain the target information or content), type of content (eg, , date, genre, title, format), any other suitable information, or any combination thereof. Response 270 may include, for example, content (eg, video to be displayed), information, a list of search results, links to content, any other suitable search results, or any combination thereof.

図３は、本開示のいくつかの実施形態による、発音情報を生成するための例証的システム３００のブロック図を示す。システム３００は、テキスト→発話エンジン３１０と、発話→テキストエンジン３２０とを含む。いくつかの実施形態では、システム３００は、テキストまたは音声クエリから独立して、発音情報を決定する。例えば、システム３００は、１つ以上のエンティティに関するメタデータ（例えば、システム１００の発音メタデータ１５０またはシステム２００のエンティティ２５０のデータベースに記憶されるメタデータ等）を生成し得る。テキスト→発話エンジン３１０は、音声クエリに含まれる可能性が高いエンティティ名または他の識別子を含み得る第１のテキスト文字列３０２を識別し得る。例えば、テキスト→発話エンジン３１０は、ユーザが、数値または英数字識別子ではなく、名前を含む音声クエリを発話する（例えば、ユーザが、「ＷＩＫＩ０４５５６」ではなく、「Ｌｏｕｉｓ」と発話する）可能性がより高いので、「ＩＤ」フィールドではなく、エンティティメタデータの「名前」フィールドを識別し得る。テキスト→発話エンジン３１０は、第１のテキスト文字列に基づいて、スピーカまたは他のオーディオデバイスにおいて、オーディオ出力３１２を生成する。例えば、テキスト→発話エンジン３１０は、１つ以上の設定を使用して、生成されたオーディオ出力に影響を及ぼし得る音声詳細（例えば、男性／女性音声、アクセント、または他の詳細）、再生速度、または任意の他の好適な設定を規定し得る。発話→テキストエンジン３２０は、マイクロホンまたは他の好適なデバイスにおいて、オーディオ出力３１２からオーディオ入力３１３を受信し（例えば、記憶され得るオーディオファイルに加え、またはその代わりに）、オーディオ入力３１３のテキスト変換を生成する（例えば、記録されるオーディオのオーディオファイルを記憶することに加え、またはその代わりに）。発話→テキストエンジン３２０は、処理設定を使用して、新しいテキスト文字列３２２を生成し得る。新しいテキスト文字列３２２は、第１のテキスト文字列３０２と比較される。新しいテキスト文字列３２２が、テキスト文字列３０２と同一である場合、音声クエリが正確なテキストクエリへの変換をもたらし得るので、メタデータは、生成される必要がない。新しいテキスト文字列３２２が、テキスト文字列３０２と同一でない場合、これは、音声クエリがテキストクエリに正しくなく変換されたこともあることを示す。故に、新しいテキスト文字列３２２が、テキスト文字列３０２と同一でない場合、発話→テキストエンジン３２０は、新しいテキスト文字列３２２をテキスト文字列３０２が関連付けられる、エンティティに関連付けられたメタデータ内に含む。システム３００は、複数のエンティティを識別し、各エンティティに関して、テキスト→発話エンジン３１０および発話→テキストエンジン３２０からの結果として生じるテキスト文字列（例えば、新しいテキスト文字列３２２等）を含むメタデータを生成し得る。いくつかの実施形態では、所与のエンティティに関して、テキスト→発話エンジン３１０、発話→テキストエンジン３２０、または両方は、２つ以上の設定を使用して、２つ以上の新しいテキスト文字列を生成し得る。故に、２つ以上のテキスト文字列は、テキスト文字列３０２と異なるので、次いで、各新しいテキスト文字列は、メタデータに記憶され得る。例えば、異なる設定から生じる異なる発音または発音の解釈は、異なる新しいテキスト文字列を生成し得、それは、異なるユーザからの音声クエリに備えて記憶され得る。代替表現（例えば、テキスト文字列３０２および新しいテキスト文字列３２２）を生成および記憶することによって、システム３００は、メタデータを更新し、より正確な検索を可能にし得る（例えば、エンティティの到達可能性および検索の正確度を改良する）。 FIG. 3 shows a block diagram of an illustrative system 300 for generating pronunciation information, according to some embodiments of the present disclosure. System 300 includes a text-to-speech engine 310 and a speech-to-text engine 320 . In some embodiments, the system 300 determines phonetic information independently from text or phonetic queries. For example, system 300 may generate metadata about one or more entities (eg, metadata stored in pronunciation metadata 150 of system 100 or metadata stored in database of entities 250 of system 200). The text-to-speech engine 310 may identify a first text string 302 that may include entity names or other identifiers that are likely to be included in the spoken query. For example, the text-to-speech engine 310 may allow the user to speak a spoken query that includes a name rather than a numeric or alphanumeric identifier (e.g., the user speaks "Louis" instead of "WIKI04556"). Since it is higher, it may identify the 'name' field of the entity metadata rather than the 'id' field. Text-to-speech engine 310 generates audio output 312 on a speaker or other audio device based on the first text string. For example, the text-to-speech engine 310 uses one or more settings to adjust audio details (e.g., male/female voice, accent, or other details), playback speed, or any other suitable setting may be defined. Speech-to-text engine 320 receives audio input 313 from audio output 312 at a microphone or other suitable device (eg, in addition to or instead of an audio file that may be stored) and performs text conversion of audio input 313. Generate (eg, in addition to or instead of storing an audio file of recorded audio). Speech-to-text engine 320 may use processing settings to generate new text strings 322 . New text string 322 is compared to first text string 302 . If the new text string 322 is identical to the text string 302, no metadata needs to be generated as the spoken query can result in a conversion to an exact text query. If the new text string 322 is not identical to the text string 302, this indicates that the spoken query may have been incorrectly converted to a text query. Thus, if the new text string 322 is not identical to the text string 302, the utterance-to-text engine 320 includes the new text string 322 in the metadata associated with the entity with which the text string 302 is associated. The system 300 identifies multiple entities and generates metadata for each entity that includes the resulting text strings from the text-to-speech engine 310 and the speech-to-text engine 320 (e.g., new text string 322, etc.). can. In some embodiments, for a given entity, Text->Speech engine 310, Speech->Text engine 320, or both use two or more settings to generate two or more new text strings. obtain. Thus, since two or more text strings differ from text string 302, each new text string can then be stored in the metadata. For example, different pronunciations or pronunciation interpretations resulting from different settings can generate different new text strings, which can be stored for spoken queries from different users. By generating and storing alternate representations (e.g., text strings 302 and new text strings 322), system 300 may update metadata and enable more precise searches (e.g., entity reachability and improve search accuracy).

例証的例では、エンティティに関して、システム３００は、タイトルおよび関連語句を識別し、各語句をテキスト→発話エンジン３１０に通し、それぞれのオーディオファイルを保存し、次いで、各それぞれのオーディオファイルを発話→テキストエンジン３２０に通し、ＡＳＲ書き起こし記録（例えば、新しいテキスト文字列３２２）を得る。ＡＳＲ書き起こし記録が、元の語句（例えば、テキスト文字列３０２）と異なる場合、システム３００は、ＡＳＲ書き起こし記録を（例えば、メタデータに記憶されるような）エンティティの関連語句に追加する。いくつかの実施形態では、システム３００は、任意の手動作業を要求せず、完全に自動化され得る（例えば、ユーザ入力は、要求されない）。いくつかの実施形態では、ユーザが、クエリを発し、所望の結果を得られないとき、システム３００は、アラートされる。それに応答して、人が、クエリに関する正しいエンティティであるべきものを手動で識別する。正しくない結果は、記憶され、将来的クエリのための情報を提供する。システム３００は、システムレベルではなく、メタデータレベルにおいて、潜在的不正確度に対処する。多くのエンティティに関するテキスト文字列３０２の分析は、全ての誤った例が、事前に（例えば、ユーザの音声クエリに先立って）識別され、解決されるように、網羅的かつ自動であり得る。システム３００は、誤った例（例えば、代替表現）を生成するために、ユーザが音声クエリを提供することを要求しない。システム３００は、クエリシステムとのユーザの相互作用をエミュレートし、検索を実施することにおける潜在的エラー源を予想するために使用され得る。 In the illustrative example, for entities, system 300 identifies titles and related phrases, runs each phrase through text-to-speech engine 310, saves respective audio files, and then passes each respective audio file to speech-to-text. It is passed through engine 320 to obtain an ASR transcription record (eg, new text string 322). If the ASR transcription record differs from the original phrase (eg, text string 302), system 300 adds the ASR transcription record to the entity's related phrases (eg, as stored in metadata). In some embodiments, the system 300 does not require any manual work and can be fully automated (eg, no user input is required). In some embodiments, system 300 is alerted when a user issues a query and does not get the desired results. In response, a human manually identifies what should be the correct entity for the query. Incorrect results are stored to provide information for future queries. System 300 addresses potential inaccuracies at the metadata level rather than the system level. Analysis of the text string 302 for many entities can be exhaustive and automatic such that all false examples are identified and resolved in advance (eg, prior to the user's voice query). System 300 does not require the user to provide a spoken query to generate false examples (eg, alternative expressions). The system 300 can be used to emulate user interaction with the query system and anticipate potential sources of error in conducting searches.

ユーザは、コンテンツ、（例えば、音声クエリを解釈するための）アプリケーション、および、例えば、そのデバイス（すなわち、ユーザ機器またはオーディオ機器）、１つ以上のネットワーク接続デバイス、ディスプレイを有する１つ以上の電子デバイス、またはそれらの組み合わせのうちの１つ以上のものからの他の特徴にアクセスし得る。本開示の例証的技法のいずれかは、ユーザデバイス、ディスプレイをユーザに提供するデバイス、または、音声クエリに応答し、ディスプレイコンテンツをユーザに生成するように構成された任意の他の好適な制御回路によって実装され得る。 A user may access content, an application (e.g., for interpreting voice queries), and, for example, its device (i.e., user equipment or audio equipment), one or more network-connected devices, one or more electronic devices with displays. Other features from one or more of the devices, or combinations thereof, may be accessed. Any of the illustrative techniques of this disclosure may be implemented in a user device, a device providing a display to a user, or any other suitable control circuitry configured to respond to voice queries and generate display content to a user. can be implemented by

図４は、例証的ユーザデバイスの一般化された実施形態を示す。ユーザ機器システム４０１は、ディスプレイ４１２、オーディオ機器４１４、およびユーザ入力インターフェース４１０を含むか、または、それらに通信可能に結合されたセットトップボックス４１６を含み得る。いくつかの実施形態では、ディスプレイ４１２は、テレビディスプレイまたはコンピュータディスプレイを含み得る。いくつかの実施形態では、ユーザ入力インターフェース４１０は、遠隔制御デバイスである。セットトップボックス４１６は、１つ以上の回路基板を含み得る。いくつかの実施形態では、１つ以上の回路基板は、処理回路、制御回路、および記憶装置（例えば、ＲＡＭ、ＲＯＭ、ハードディスク、リムーバブルディスク等）を含む。いくつかの実施形態では、回路基板は、入／出力経路を含む。ユーザ機器デバイス４００およびユーザ機器システム４０１の各々は、入力／出力（以降では「Ｉ／Ｏ」）経路４０２を介してコンテンツおよびデータを受信し得る。Ｉ／Ｏ経路４０２は、処理回路４０６と記憶装置４０８とを含む制御回路４０４に、コンテンツおよびデータを提供し得る。制御回路４０４は、Ｉ／Ｏ経路４０２を使用して、コマンド、要求、および他の好適なデータを送信および受信するために使用され得る。Ｉ／Ｏ経路４０２は、制御回路４０４（具体的に、処理回路４０６）を１つ以上の通信経路（下記に説明される）に接続し得る。Ｉ／Ｏ機能は、これらの通信経路のうちの１つ以上のものによって提供され得るが、図面を過剰に複雑にすることを回避するように、図４では単一の経路として示される。セットトップボックス４１６が、例証のために図４に示されるが、処理回路、制御回路、および記憶装置を有する任意の好適なコンピューティングデバイスが、本開示に従って使用され得る。例えば、セットトップボックス４１６は、パーソナルコンピュータ（例えば、ノートブック、ラップトップ、デスクトップ）、ユーザアクセス可能クライアントデバイスをホストするネットワークベースのサーバ、非ユーザ所有デバイス、任意の他の好適なデバイス、またはそれらの任意の組み合わせによって置換または補完され得る。 FIG. 4 shows a generalized embodiment of an illustrative user device. User equipment system 401 includes or may include set-top box 416 communicatively coupled to display 412 , audio equipment 414 , and user input interface 410 . In some embodiments, display 412 may include a television display or computer display. In some embodiments, user input interface 410 is a remote control device. Set-top box 416 may include one or more circuit boards. In some embodiments, one or more circuit boards include processing circuitry, control circuitry, and storage devices (eg, RAM, ROM, hard disks, removable disks, etc.). In some embodiments, the circuit board includes input/output paths. User equipment device 400 and user equipment system 401 may each receive content and data via input/output (hereinafter “I/O”) path 402 . I/O paths 402 may provide content and data to control circuitry 404 , which includes processing circuitry 406 and storage device 408 . Control circuitry 404 may be used to send and receive commands, requests, and other suitable data using I/O paths 402 . I/O path 402 may connect control circuitry 404 (specifically, processing circuitry 406) to one or more communication paths (described below). An I/O function may be provided by one or more of these communication paths, but is shown as a single path in FIG. 4 to avoid overcomplicating the drawing. A set-top box 416 is shown in FIG. 4 for purposes of illustration, but any suitable computing device having processing circuitry, control circuitry, and storage may be used in accordance with the present disclosure. For example, set-top box 416 may be a personal computer (eg, notebook, laptop, desktop), network-based server hosting user-accessible client devices, non-user-owned device, any other suitable device, or any other suitable device. can be substituted or complemented by any combination of

制御回路４０４は、処理回路４０６等の任意の好適な処理回路に基づき得る。本明細書で参照されるように、処理回路は、１つ以上のマイクロプロセッサ、マイクロコントローラ、デジタル信号プロセッサ、プログラマブル論理デバイス、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、特定用途向け集積回路（ＡＳＩＣ）等に基づく回路を意味すると理解されるべきであり、マルチコアプロセッサ（例えば、デュアルコア、クアッドコア、ヘキサコア、または任意の好適な数のコア）またはスーパーコンピュータを含み得る。いくつかの実施形態では、処理回路は、複数の別個のプロセッサまたは処理ユニット、例えば、複数の同じのタイプの処理ユニット（例えば、２つのＩｎｔｅｌＣｏｒｅｉ７プロセッサ）または複数の異なるプロセッサ（例えば、ＩｎｔｅｌＣｏｒｅｉ５プロセッサおよびＩｎｔｅｌＣｏｒｅｉ７プロセッサ）を横断して分散される。いくつかの実施形態では、制御回路４０４は、メモリ（例えば、記憶装置４０８）に記憶されたアプリケーションのための命令を実行する。具体的に、制御回路４０４は、上記および下記に議論される機能を実施するようにアプリケーションによって命令され得る。例えば、アプリケーションは、命令を制御回路４０４に提供し、メディアガイド表示を発生させ得る。いくつかの実装では、制御回路４０４によって実施される任意のアクションは、アプリケーションから受信される命令に基づき得る。 Control circuitry 404 may be based on any suitable processing circuitry, such as processing circuitry 406 . As referred to herein, processing circuitry may be one or more microprocessors, microcontrollers, digital signal processors, programmable logic devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), etc. and may include multi-core processors (eg, dual-core, quad-core, hexa-core, or any suitable number of cores) or supercomputers. In some embodiments, the processing circuitry includes multiple separate processors or processing units, such as multiple processing units of the same type (e.g., two Intel Core i7 processors) or multiple different processors (e.g., Intel Core i7 processors). i5 processor and Intel Core i7 processor). In some embodiments, control circuitry 404 executes instructions for applications stored in memory (eg, storage device 408). Specifically, control circuitry 404 may be instructed by an application to perform the functions discussed above and below. For example, an application may provide instructions to control circuitry 404 to generate a media guidance display. In some implementations, any action performed by control circuitry 404 may be based on instructions received from an application.

いくつかのクライアント／サーバベースの実施形態では、制御回路４０４は、アプリケーションサーバまたは他のネットワークまたはサーバと通信するために好適な通信回路を含む。上記に述べられる機能性を実行するための命令は、アプリケーションサーバ上に記憶され得る。通信回路は、他の機器または任意の他の好適な通信回路と通信するために、ケーブルモデム、総合サービスデジタルネットワーク（ＩＳＤＮ）モデム、デジタル加入者回線（ＤＳＬ）モデム、電話モデム、イーサネット（登録商標）カード、または無線モデムを含み得る。そのような通信は、インターネットまたは任意の他の好適な通信ネットワークまたは経路を伴い得る。加えて、通信回路は、ユーザ機器デバイスのピアツーピア通信または互いに遠隔の場所にあるユーザ機器デバイスの通信を可能にする回路（下記により詳細に説明される）を含み得る。 In some client/server-based embodiments, control circuitry 404 includes communication circuitry suitable for communicating with application servers or other networks or servers. Instructions for performing the functionality described above may be stored on the application server. The communication circuitry may be a cable modem, an integrated services digital network (ISDN) modem, a digital subscriber line (DSL) modem, a telephone modem, an Ethernet (RTM) modem, to communicate with other equipment or any other suitable communication circuitry. ) card, or a wireless modem. Such communication may involve the Internet or any other suitable communication network or pathway. Additionally, communication circuitry may include circuitry (discussed in more detail below) that enables peer-to-peer communication of user equipment devices or communication of user equipment devices located remotely from one another.

メモリは、制御回路４０４の一部である記憶装置４０８等の電子記憶デバイスであり得る。本明細書で参照されるように、語句「電子記憶デバイス」または「記憶デバイス」は、ランダムアクセスメモリ、読み取り専用メモリ、ハードドライブ、光学ドライブ、ソリッドステートデバイス、量子記憶デバイス、ゲーム機、ゲーム媒体、または任意の他の好適な固定またはリムーバブル記憶デバイス等の任意の組み合わせ等の電子データ、コンピュータソフトウェア、またはファームウェアを記憶するための任意のデバイスを意味すると理解されるべきである。記憶装置４０８は、本明細書に説明される種々のタイプのコンテンツおよび上記に説明されるメディアガイドデータを記憶するために使用され得る。不揮発性メモリも、（例えば、ブートアップルーチンおよび他の命令を起動するために）使用され得る。クラウドベースの記憶装置が、例えば、記憶装置４０８を補完するために、または記憶装置４０８の代わりに使用され得る。 The memory may be an electronic storage device, such as storage device 408 that is part of control circuitry 404 . As referred to herein, the phrase "electronic storage device" or "storage device" includes random access memory, read-only memory, hard drive, optical drive, solid state device, quantum storage device, game console, game media , or any other suitable fixed or removable storage device, or any combination thereof, for storing electronic data, computer software, or firmware. Storage device 408 may be used to store the various types of content described herein and the media guidance data described above. Non-volatile memory may also be used (eg, to launch boot-up routines and other instructions). Cloud-based storage may be used to supplement or replace storage 408, for example.

ユーザが、ユーザ入力インターフェース４１０を使用して、命令を制御回路４０４に送信し得る。ユーザ入力インターフェース４１０、ディスプレイ４１２、または両方は、表示を提供し、触覚入力を受信するように構成されたタッチスクリーンを含み得る。例えば、タッチスクリーンは、指、スタイラス、または両方から触覚入力を受信するように構成され得る。いくつかの実施形態では、機器デバイス４００は、前向きの画面および後向きの画面、複数の前方画面、または複数の角度付き画面を含み得る。いくつかの実施形態では、ユーザ入力インターフェース４１０は、１つ以上のマイクロホン、ボタン、キーパッド、ユーザ入力を受信するように構成された任意の他のコンポーネント、またはそれらの組み合わせを有するリモートコントロールデバイスを含む。例えば、ユーザ入力インターフェース４１０は、英数字キーパッドおよびオプションを有するハンドヘルドリモートコントロールデバイスを含み得る。さらなる例では、ユーザ入力インターフェース４１０は、音声コマンドを受信および識別し、情報をセットトップボックス４１６に伝送するように構成されたマイクロホンおよび制御回路を有するハンドヘルドリモートコントロールデバイスを含み得る。 A user may use user input interface 410 to send commands to control circuitry 404 . User input interface 410, display 412, or both may include a touch screen configured to provide display and receive tactile input. For example, touch screens may be configured to receive tactile input from a finger, a stylus, or both. In some embodiments, the equipment device 400 may include a front facing screen and a rear facing screen, multiple front screens, or multiple angled screens. In some embodiments, user input interface 410 is a remote control device having one or more microphones, buttons, keypads, any other components configured to receive user input, or combinations thereof. include. For example, user input interface 410 may include a handheld remote control device with an alphanumeric keypad and options. In a further example, user input interface 410 may include a handheld remote control device having a microphone and control circuitry configured to receive and identify voice commands and transmit information to set top box 416 .

オーディオ機器４１４は、ユーザデバイス４００およびユーザ機器システム４０１の各々の他の要素と統合されるものとして提供され得るか、または、独立型ユニットであり得る。ディスプレイ４１２上に表示されるビデオおよび他のコンテンツのオーディオコンポーネントが、オーディオ機器４１４のスピーカを通して再生され得る。いくつかの実施形態では、オーディオは、受信機（図示せず）に分配され得、受信機は、オーディオを処理し、オーディオ機器４１４のスピーカを介して出力する。いくつかの実施形態では、例えば、制御回路４０４は、オーディオ機器４１４のスピーカを使用して、オーディオキューをユーザに、または他のオーディオフィードバックをユーザに提供するように構成される。オーディオ機器４１４は、音声コマンドおよび発話（例えば、音声クエリを含む）等のオーディオ入力を受信するように構成されたマイクロホンを含み得る。例えば、ユーザは、文字または単語を話し得、それらは、マイクロホンによって受信され、制御回路４０４によってテキストに変換される。さらなる例では、ユーザは、コマンドを声に出し得、コマンドは、マイクロホンによって受信され、制御回路４０４によって認識される。 Audio equipment 414 may be provided as integrated with other elements of each of user device 400 and user equipment system 401, or may be a stand-alone unit. The audio component of videos and other content displayed on display 412 may be played through speakers of audio device 414 . In some embodiments, the audio may be distributed to a receiver (not shown), which processes the audio and outputs it through speakers of audio device 414 . In some embodiments, for example, control circuitry 404 is configured to provide audio cues to the user or other audio feedback to the user using speakers of audio device 414 . Audio device 414 may include a microphone configured to receive audio input such as voice commands and utterances (eg, including voice queries). For example, a user may speak letters or words that are received by a microphone and converted to text by control circuit 404 . In a further example, the user may speak a command, which is received by the microphone and recognized by control circuit 404 .

（例えば、音声クエリを管理するための）アプリケーションが、任意の好適なアーキテクチャを使用して実装され得る。例えば、独立型アプリケーションが、ユーザデバイス４００およびユーザ機器システム４０１の各々上に完全に実装され得る。いくつかのそのような実施形態では、アプリケーションのための命令が、ローカルで（例えば、記憶装置４０８内に）記憶され、アプリケーションによって使用するためのデータが、周期的基準で（例えば、帯域外フィードから、インターネットリソースから、または別の好適なアプローチを使用して）ダウンロードされる。制御回路４０４は、記憶装置４０８からアプリケーションのための命令を読み出し、命令を処理し、本明細書に議論される表示のうちのいずれかを発生させ得る。処理された命令に基づいて、制御回路４０４は、入力がユーザ入力インターフェース４１０から受信されるときに実施するべきアクションの内容を決定し得る。例えば、上／下への表示上のカーソルの移動は、入力インターフェース４１０が、上／下ボタンが選択されたことを示すときに、処理された命令によって示され得る。本明細書に議論される実施形態のうちのいずれかを実施するためのアプリケーションおよび／または任意の命令が、コンピュータ読み取り可能な媒体上にエンコードされ得る。コンピュータ読み取り可能な媒体は、データを記憶することが可能な任意の媒体を含む。コンピュータ読み取り可能な媒体は、限定ではないが、伝搬電気または電磁信号を含み、一過性であり得るか、または、限定ではないが、ハードディスク、フロッピー（登録商標）ディスク、ＵＳＢドライブ、ＤＶＤ、ＣＤ、メディアカード、レジスタメモリ、プロセッサキャッシュ、ランダムアクセスメモリ（ＲＡＭ）等の揮発性および不揮発性コンピュータメモリまたは記憶デバイスを含み、非一過性であり得る。 Applications (eg, for managing voice queries) may be implemented using any suitable architecture. For example, a standalone application may be fully implemented on each of user device 400 and user equipment system 401 . In some such embodiments, instructions for the application are stored locally (e.g., in storage device 408) and data for use by the application are sent on a periodic basis (e.g., out-of-band feed , from Internet resources, or using another suitable approach). Control circuitry 404 may retrieve instructions for an application from storage device 408, process the instructions, and generate any of the displays discussed herein. Based on the processed instructions, control circuitry 404 may determine what actions to perform when input is received from user input interface 410 . For example, movement of the cursor on the display up/down may be indicated by the processed instructions when the input interface 410 indicates that the up/down button has been selected. An application and/or any instructions for implementing any of the embodiments discussed herein may be encoded on a computer-readable medium. Computer-readable media include any medium that can store data. Computer readable media include, but are not limited to, propagating electrical or electromagnetic signals and may be transient or include, but are not limited to hard disks, floppy disks, USB drives, DVDs, CDs , media cards, register memory, processor cache, random access memory (RAM), etc., and may be non-transitory.

いくつかの実施形態では、アプリケーションは、クライアント／サーバベースのアプリケーションである。ユーザデバイス４００およびユーザ機器システム４０１の各々上で実装される、シックまたはシンクライアントによって使用するためのデータが、ユーザ機器デバイス４００およびユーザ機器システム４０１の各々から遠隔にあるサーバに要求を発行することによって、オンデマンドで読み出される。例えば、遠隔サーバは、記憶デバイス内にアプリケーションのための命令を記憶し得る。遠隔サーバは、回路（例えば、制御回路４０４）を使用して、記憶された命令を処理し、上記および下記に議論される表示を発生させ得る。クライアントデバイスは、遠隔サーバによって発生させられる表示を受信し得、ユーザデバイス４００上にローカルで表示のコンテンツを表示し得る。このように、命令の処理が、サーバによって遠隔で実施される一方、テキスト、キーボード、または他の視覚物を含み得る結果として生じる表示は、ユーザデバイス４００上にローカルで提供される。ユーザデバイス４００は、入力インターフェース４１０を介してユーザから入力を受信し、対応する表示を処理し、発生させるために、それらの入力を遠隔サーバに伝送し得る。例えば、ユーザデバイス４００は、上／下ボタンが入力インターフェース４１０を介して選択されたことを示す通信を遠隔サーバに伝送し得る。遠隔サーバは、その入力に従って命令を処理し、入力に対応するアプリケーションの表示（例えば、カーソルを上／下に移動させる表示）を発生させ得る。発生させられた表示は、次いで、ユーザへの提示のためにユーザデバイス４００に伝送される。 In some embodiments the application is a client/server based application. Data for use by a thick or thin client implemented on each of user device 400 and user equipment system 401 issues a request to a server remote from each of user equipment device 400 and user equipment system 401. read on demand by For example, a remote server may store instructions for an application in a storage device. The remote server may use circuitry (eg, control circuitry 404) to process stored instructions and generate the displays discussed above and below. The client device may receive the display generated by the remote server and display the content of the display locally on the user device 400 . In this manner, the processing of instructions is performed remotely by the server, while the resulting display, which may include text, keyboard, or other visuals, is provided locally on user device 400 . User device 400 may receive inputs from a user via input interface 410 and transmit those inputs to a remote server for processing and generating a corresponding display. For example, user device 400 may transmit a communication to a remote server indicating that the up/down button was selected via input interface 410 . The remote server may process the instructions according to the input and generate a display of the application corresponding to the input (eg, a display that moves the cursor up/down). The generated display is then transmitted to user device 400 for presentation to the user.

いくつかの実施形態では、アプリケーションは、ダウンロードされ、インタープリタまたは仮想マシン（例えば、制御回路４０４によって起動される）によって解釈され、または別様に起動される。いくつかの実施形態では、アプリケーションは、ＥＴＶバイナリ交換形式（ＥＢＩＦ）でエンコードされ、好適なフィードの一部として制御回路によって受信され、制御回路４０４上で起動するユーザエージェントによって解釈され得る。例えば、アプリケーションは、ＥＢＩＦアプリケーションであり得る。いくつかの実施形態では、アプリケーションは、制御回路４０４によって実行されるローカル仮想マシンまたは他の好適なミドルウェアによって受信および起動される一連のＪＡＶＡ（登録商標）ベースのファイルによって定義され得る。 In some embodiments, the application is downloaded, interpreted by an interpreter or virtual machine (eg, launched by control circuitry 404), or otherwise launched. In some embodiments, applications may be encoded in ETV Binary Interchange Format (EBIF), received by control circuitry as part of a preferred feed, and interpreted by a user agent running on control circuitry 404 . For example, the application can be an EBIF application. In some embodiments, an application may be defined by a series of JAVA-based files that are received and launched by a local virtual machine or other suitable middleware executed by control circuitry 404 .

図５は、本開示のいくつかの実施形態による、音声クエリに応答するための例証的ネットワーク配置５００のブロック図を示す。例証的システム５００は、ユーザが、音声クエリをユーザデバイス５５０において提供すること、コンテンツをユーザデバイス５５０のディスプレイ上で視聴すること、または両方を行う状況を表し得る。システム５００では、２つ以上のタイプのユーザデバイスが存在し得るが、１つのみのが、図面を過度に複雑にすることを回避するために、図５に示される。加えて、各ユーザは、２つ以上のタイプのユーザデバイスを利用し、２つ以上の各タイプのユーザデバイスも利用し得る。ユーザデバイス５５０は、図４のユーザデバイス４００、ユーザ機器システム４０１、任意の他の好適なデバイス、またはそれらの任意の組み合わせと同じであり得る。 FIG. 5 shows a block diagram of an illustrative network arrangement 500 for responding to voice queries, according to some embodiments of the disclosure. Illustrative system 500 may represent a situation in which a user provides a voice query at user device 550, views content on a display of user device 550, or both. Although there may be more than one type of user device in system 500, only one is shown in FIG. 5 to avoid overcomplicating the drawing. Additionally, each user may utilize more than one type of user device and may also utilize more than one type of user device. User device 550 may be the same as user device 400 of FIG. 4, user equipment system 401, any other suitable device, or any combination thereof.

無線対応デバイスとして図示されるユーザデバイス５５０は、通信ネットワーク５１０に結合され得る（例えば、インターネットに接続される）。例えば、ユーザデバイス５５０は、通信経路（例えば、アクセスポイントを含み得る）を介して、通信ネットワーク５１０に結合される。いくつかの実施形態では、ユーザデバイス５５０は、有線接続を介して通信ネットワーク５１０に結合されるコンピューティングデバイスであり得る。例えば、ユーザデバイス５５０は、ＬＡＮへの有線接続またはネットワーク５１０への任意の他の好適な通信リンクも含み得る。通信ネットワーク５１０は、インターネット、携帯電話ネットワーク、モバイル音声またはデータネットワーク（例えば、４ＧまたはＬＴＥネットワーク）、ケーブルネットワーク、公衆交換電話網、または他のタイプの通信ネットワークまたは通信ネットワークの組み合わせを含む１つ以上のネットワークであり得る。通信経路は、衛星経路、光ファイバ系経路、ケーブル経路、インターネット通信をサポートする経路、自由空間接続（例えば、ブロードキャストまたは他の無線信号のため）、または任意の他の好適な有線または無線通信経路またはそのような経路の組み合わせ等の１つ以上の通信経路を含み得る。通信経路は、ユーザデバイス５５０とネットワークデバイス５２０との間に描かれないが、これらのデバイスは、上記に説明されるもの等の通信経路、およびＵＳＢケーブル、ＩＥＥＥ１３９４ケーブル、無線経路（例えば、Ｂｌｕｅｔｏｏｔｈ（登録商標）、赤外線、ＩＥＥＥ８０２－１１ｘ等）等の他の短範囲２地点間通信経路、または有線または無線経路を介した他の短範囲通信を介して、直接、互いに通信し得る。ＢＬＵＥＴＯＯＴＨ（登録商標）は、Ｂｌｕｅｔｏｏｔｈ（登録商標）ＳＩＧ，Ｉｎｃ．によって所有される認証マークである。デバイスはまた、通信ネットワーク５１０を介した間接経路を通して、直接、互いに通信し得る。 User device 550, illustrated as a wireless-enabled device, may be coupled to communication network 510 (eg, connected to the Internet). For example, user device 550 is coupled to communication network 510 via a communication path (eg, which may include an access point). In some embodiments, user device 550 may be a computing device coupled to communication network 510 via a wired connection. For example, user device 550 may also include a wired connection to a LAN or any other suitable communication link to network 510 . Communication network 510 may include one or more of the Internet, a cellular network, a mobile voice or data network (eg, a 4G or LTE network), a cable network, a public switched telephone network, or any other type of communication network or combination of communication networks. can be a network of A communication path may be a satellite path, a fiber optic path, a cable path, a path supporting Internet communications, a free space connection (e.g., for broadcast or other wireless signals), or any other suitable wired or wireless communication path. or may include one or more communication paths, such as a combination of such paths. Although communication paths are not drawn between user device 550 and network device 520, these devices can include communication paths such as those described above, as well as USB cables, IEEE 1394 cables, wireless paths (e.g., Bluetooth ( (trademark), infrared, IEEE 802-11x, etc.) or other short range communication via wired or wireless paths. BLUETOOTH (registered trademark) is a trademark of Bluetooth (registered trademark) SIG, Inc. is a certification mark owned by Devices may also communicate with each other directly through indirect paths through communication network 510 .

図示されるようなシステム５００は、好適な通信経路を介して通信ネットワーク５１０に結合されるネットワークデバイス５２０（例えば、サーバまたは他の好適なコンピューティングデバイス）を含む。ネットワークデバイス５２０とユーザデバイス５５０との間の通信は、１つ以上の通信経路を経由して交換され得るが、図面を過度に複雑にすることを回避するために、図５では、単一経路として示される。ネットワークデバイス５２０は、データベースと、１つ以上のアプリケーション（例えば、アプリケーションサーバ、ホストサーバとして）とを含み得る。複数のネットワークエンティティが、存在し、ネットワーク５１０と通信し得るが、１つのみが、図面を過度に複雑にすることを回避するために、図５に示される。いくつかの実施形態では、ネットワークデバイス５２０は、１つのソースデバイスを含み得る。いくつかの実施形態では、ネットワークデバイス５２０は、多くのユーザデバイス（例えば、ユーザデバイス５５０）におけるアプリケーションのインスタンスと通信するアプリケーションを実装する。例えば、ソーシャルメディアアプリケーションのインスタンスが、ユーザデバイス５５０上に実装され得、アプリケーション情報は、ユーザに関するプロファイル情報を記憶し得るネットワークデバイス５２０に、および、それから通信される（例えば、現在のソーシャルメディアフィードが、ユーザデバイス５５０以外のデバイス上で利用可能であるように）。さらなる例では、検索アプリケーションのインスタンスが、ユーザデバイス５５０上に実装され得、アプリケーション情報は、ユーザに関するプロファイル情報、複数のユーザからの検索履歴、エンティティ情報（例えば、コンテンツおよびメタデータ）、任意の他の好適な情報、またはそれらの任意の組み合わせを記憶し得るネットワークデバイス５２０に、および、それから通信される。 System 500 as illustrated includes network device 520 (eg, a server or other suitable computing device) coupled to communication network 510 via a suitable communication path. Communications between network device 520 and user device 550 may be exchanged via one or more communication paths, but to avoid overcomplicating the drawing, FIG. is shown as Network device 520 may include a database and one or more applications (eg, as an application server, host server). Multiple network entities may exist and communicate with network 510, but only one is shown in FIG. 5 to avoid overcomplicating the drawing. In some embodiments, network device 520 may include one source device. In some embodiments, network device 520 implements an application that communicates with instances of the application on many user devices (eg, user device 550). For example, an instance of a social media application may be implemented on user device 550 and application information communicated to and from network device 520 that may store profile information about the user (e.g., the current social media feed is , as available on devices other than user device 550). In a further example, an instance of a search application may be implemented on the user device 550, and the application information may include profile information about the user, search history from multiple users, entity information (e.g., content and metadata), any other is communicated to and from network device 520, which may store suitable information for the , or any combination thereof.

いくつかの実施形態では、ネットワークデバイス５２０は、例えば、エンティティ情報、メタデータ、コンテンツ、履歴通信および検索記録、ユーザ選好、ユーザプロファイル情報、任意の他の好適な情報、またはそれらの任意の組み合わせを含む、記憶された情報のうちの１つ以上のタイプを含む。ネットワークデバイス５２０は、アプリケーションホストデータベースまたはサーバ、プラグイン、ソフトウェア開発者キット（ＳＤＫ）、アプリケーションプログラミングインターフェース（ＡＰＩ）、または、（例えば、ユーザデバイスにダウンロードされるような）ソフトウェアを提供すること、（例えば、ユーザデバイスによってアクセスされるアプリケーションをホストする）ソフトウェアを遠隔で起動すること、または、別様に、アプリケーションサポートをユーザデバイス５５０のアプリケーションに提供することを行うように構成された他のソフトウェアツールを含み得る。いくつかの実施形態では、ネットワークデバイス５２０からの情報は、クライアント／サーバアプローチを使用して、ユーザデバイス５５０に提供される。例えば、ユーザデバイス５５０は、情報をサーバからプルし得るか、または、サーバは、情報をユーザデバイス５５０にプッシュし得る。いくつかの実施形態では、ユーザデバイス５５０上に常駐するアプリケーションクライアントは、ネットワークデバイス５２０とのセッションを開始し、必要に応じて（例えば、データが、古くなると、またはユーザデバイスが、データを受信するための要求をユーザから受信すると）、情報を取得し得る。いくつかの実施形態では、情報は、ユーザ情報（例えば、ユーザプロファイル情報、ユーザ作成コンテンツ）を含み得る。例えば、ユーザ情報は、ユーザが関わるコンテンツトランザクション、ユーザが実施した検索、ユーザが消費したコンテンツ、ユーザがソーシャルネットワークと相互作用するかどうか、任意の他の好適な情報、またはそれらの任意の組み合わせ等の現在および／または履歴ユーザアクティビティ情報を含み得る。いくつかの実施形態では、ユーザ情報は、ある期間にわたって、所与のユーザのパターンを識別し得る。図示されるように、ネットワークデバイス５２０は、複数のエンティティに関するエンティティ情報を含む。エンティティ情報５２１、５２２、および５２３は、それぞれのエンティティに関するメタデータを含む。それに関してメタデータがネットワークデバイス５２０に記憶されているエンティティは、互いにリンクされ得るか、互いに参照され得るか、メタデータ内に１つ以上のタグによって記述され得るか、またはそれらの組み合わせであり得る。 In some embodiments, network device 520 stores, for example, entity information, metadata, content, historical communication and search records, user preferences, user profile information, any other suitable information, or any combination thereof. including one or more types of stored information. network device 520 provides an application host database or server, plug-ins, software developer kits (SDKs), application programming interfaces (APIs), or software (e.g., as downloaded to user devices); For example, other software tools configured to remotely launch software (hosting applications accessed by the user device) or otherwise provide application support to applications on the user device 550 can include In some embodiments, information from network device 520 is provided to user device 550 using a client/server approach. For example, user device 550 may pull information from a server, or a server may push information to user device 550 . In some embodiments, an application client residing on user device 550 initiates a session with network device 520 and updates as needed (e.g., as data becomes stale or as the user device receives data). information may be obtained upon receiving a request from a user for In some embodiments, information may include user information (eg, user profile information, user-generated content). For example, user information may include content transactions the user has engaged in, searches the user has conducted, content the user has consumed, whether the user interacts with social networks, any other suitable information, or any combination thereof, etc. current and/or historical user activity information. In some embodiments, user information may identify patterns for a given user over time. As shown, network device 520 includes entity information for multiple entities. Entity information 521, 522, and 523 contain metadata about each entity. Entities for which metadata is stored in the network device 520 may be linked to each other, referenced to each other, described by one or more tags in the metadata, or a combination thereof. .

いくつかの実施形態では、アプリケーションは、ユーザデバイス５５０、ネットワークデバイス５２０、または両方上に実装され得る。例えば、アプリケーションは、ソフトウェアまたは実行可能命令の組として実装され得、それらは、ユーザデバイス５５０、ネットワークデバイス５２０、または両方の記憶装置に記憶され、それぞれのデバイスの制御回路によって実行され得る。いくつかの実施形態では、アプリケーションは、クライアント／サーバベースのアプリケーションとして実装されるオーディオ記録アプリケーション、発話→テキストアプリケーション、テキスト→発話アプリケーション、音声－認識アプリケーション、またはそれらの組み合わせを含み得、クライアントアプリケーションのみが、ユーザデバイス５５０上に常駐し、サーバアプリケーションは、遠隔サーバ（例えば、ネットワークデバイス５２０）上に常駐する。例えば、アプリケーションは、部分的に、クライアントアプリケーションとして、ユーザデバイス５５０上に（例えば、ユーザデバイス５５０の制御回路によって）、部分的に、遠隔サーバ上に、遠隔サーバの制御回路（例えば、ネットワークデバイス５２０の制御回路）上で起動するサーバアプリケーションとして、実装され得る。遠隔サーバの制御回路によって実行されると、アプリケーションは、ディスプレイを生成し、生成されたディスプレイをユーザデバイス５５０に伝送するように制御回路に命令し得る。サーバアプリケーションは、ユーザデバイス５５０上への記憶のためにデータを伝送するように遠隔デバイスの制御回路に命令し得る。クライアントアプリケーションは、アプリケーションディスプレイを生成するように受信側ユーザデバイスの制御回路に命令し得る。 In some embodiments, applications may be implemented on user device 550, network device 520, or both. For example, an application may be implemented as software or a set of executable instructions, which may be stored in the storage of user device 550, network device 520, or both, and executed by control circuitry of the respective devices. In some embodiments, the applications may include audio recording applications, speech-to-text applications, text-to-speech applications, voice-recognition applications, or combinations thereof implemented as client/server-based applications; resides on the user device 550 and the server application resides on a remote server (eg, network device 520). For example, an application may run partly as a client application on user device 550 (e.g., by control circuitry of user device 550) and partly on a remote server, by control circuitry of a remote server (e.g., network device 520). can be implemented as a server application running on the control circuit of the When executed by the control circuitry of the remote server, the application may instruct the control circuitry to generate a display and transmit the generated display to user device 550 . The server application may instruct the remote device's control circuitry to transmit the data for storage on the user device 550 . The client application may instruct control circuitry of the receiving user device to generate the application display.

いくつかの実施形態では、システム５００の配置は、クラウドベースの配置である。クラウドは、例の中でもとりわけ、情報記憶、検索、メッセージング、またはソーシャルネットワーキングサービス等のサービスへのアクセス、およびユーザデバイスに関して上記に説明される任意のコンテンツへのアクセスを提供する。サービスは、クラウド－コンピューティングサービスプロバイダを通して、またはオンラインサービスの他のプロバイダを通して、クラウド内に提供されることができる。例えば、クラウドベースのサービスは、ユーザソースコンテンツが接続されるデバイス上での他者による視聴のために配信される記憶サービス、共有サイト、ソーシャルネットワーキングサイト、検索エンジン、または他のサービスを含むことができる。これらのクラウドベースのサービスは、ユーザデバイスが、情報をローカルで記憶し、ローカルで記憶された情報にアクセスするのではなく、情報をクラウドに記憶し、情報をクラウドから受信することを可能にし得る。クラウドリソースは、例えば、ウェブブラウザ、メッセージングアプリケーション、ソーシャルメディアアプリケーション、デスクトップアプリケーション、またはモバイルアプリケーションを使用して、ユーザデバイスによってアクセスされ得、オーディオ記録アプリケーション、発話→テキストアプリケーション、テキスト→発話アプリケーション、音声－認識アプリケーション、および／またはそれらのアクセスアプリケーションの任意の組み合わせを含み得る。ユーザデバイス５５０は、アプリケーション配信のためにクラウドコンピューティングに依拠するクラウドクライアントであり得るか、または、ユーザデバイス５５０は、クラウドリソースへのアクセスを伴わずに、いくつかの機能性を有し得る。例えば、ユーザデバイス５５０上で起動するいくつかのアプリケーションは、クラウドアプリケーション（例えば、インターネットを経由してサービスとして配信されるアプリケーション）であり得る一方、他のアプリケーションは、ユーザデバイス５５０上で記憶および起動され得る。いくつかの実施形態では、ユーザデバイス５５０は、複数のクラウドリソースからの情報を同時に受信し得る。 In some embodiments, the deployment of system 500 is a cloud-based deployment. The cloud provides access to services such as information storage, retrieval, messaging, or social networking services, among other examples, and access to any of the content described above with respect to user devices. Services can be offered in the cloud through cloud-computing service providers or through other providers of online services. For example, cloud-based services may include storage services, sharing sites, social networking sites, search engines, or other services delivered for viewing by others on devices to which User Sourced Content is connected. can. These cloud-based services may allow user devices to store information in the cloud and receive information from the cloud, rather than storing information locally and accessing information stored locally. . Cloud resources can be accessed by user devices using, for example, web browsers, messaging applications, social media applications, desktop applications, or mobile applications, such as audio recording applications, speech-to-text applications, text-to-speech applications, voice- It may include any combination of recognition applications and/or access applications thereof. User device 550 may be a cloud client that relies on cloud computing for application delivery, or user device 550 may have some functionality without access to cloud resources. For example, some applications running on user device 550 may be cloud applications (eg, applications delivered as a service over the Internet), while other applications are stored and running on user device 550. can be In some embodiments, user device 550 may receive information from multiple cloud resources simultaneously.

例証的例では、ユーザは、音声クエリをユーザデバイス５５０に発話し得る。音声クエリは、ユーザデバイス５５０のオーディオインターフェースによって記録され、アプリケーション５６０によってサンプリングおよびデジタル化され、アプリケーション５６０によってテキストクエリに変換される。アプリケーション５６０は、テキストクエリとともに、発音も含み得る。例えば、テキストクエリの１つ以上の単語が、適切なスペルではなく、音素記号によって表され得る。さらなる例では、発音メタデータは、テキストクエリの１つ以上の単語の音素表現を含むテキストクエリとともに記憶され得る。いくつかの実施形態では、アプリケーション５６０は、エンティティ、コンテンツ、メタデータ、またはそれらの組み合わせのデータベースの中を検索するために、テキストクエリおよび任意の好適な発音情報をネットワークデバイス５２０に伝送する。ネットワークデバイス５２０は、テキストクエリに関連付けられたエンティティ、テキストクエリに関連付けられたコンテンツ、または両方を識別し、その情報をユーザデバイス５５０に提供し得る。 In the illustrative example, a user may speak a voice query to user device 550 . The voice query is recorded by the audio interface of user device 550, sampled and digitized by application 560, and converted by application 560 into a text query. Application 560 may also include pronunciations as well as text queries. For example, one or more words of a text query may be represented by phonemic symbols rather than proper spelling. In a further example, pronunciation metadata may be stored with a text query including phonemic representations of one or more words of the text query. In some embodiments, application 560 transmits text queries and any suitable pronunciation information to network device 520 for searching in a database of entities, content, metadata, or combinations thereof. Network device 520 may identify entities associated with the text query, content associated with the text query, or both, and provide that information to user device 550 .

例えば、ユーザは、「ＴｏｍＣｒｕｉｓｅの映画を見せて」とユーザデバイス５５０のマイクロホンに発話し得る。アプリケーション５６０は、テキストクエリ「ＴｏｍＣｒｕｉｓｅの映画」を生成し、テキストクエリをネットワークデバイス５２０に伝送し得る。ネットワークデバイス５２０は、エンティティ「ＴｏｍＣｒｕｉｓｅ」を識別し、次いで、エンティティにリンクされる映画を識別し得る。ネットワークデバイス５２０は、次いで、コンテンツ（例えば、ビデオファイル、トレーラ、またはクリップ）、コンテンツ識別子（例えば、映画タイトルおよび画像）、コンテンツアドレス（例えば、ＵＲＬ、ウェブサイト、またはＩＰアドレス）、任意の他の好適な情報、またはそれらの任意の組み合わせをユーザデバイス５５０に伝送し得る。「Ｔｏｍ」および「Ｃｒｕｉｓｅ」の発音は、概して、曖昧ではないので、アプリケーション５６０は、この状況では、発音情報を生成する必要はない。 For example, a user may speak into the microphone of user device 550, "Show me a Tom Cruise movie." Application 560 may generate the text query “Tom Cruise movies” and transmit the text query to network device 520 . Network device 520 may identify the entity “Tom Cruise” and then identify the movies linked to the entity. The network device 520 then sends content (eg, video files, trailers, or clips), content identifiers (eg, movie titles and images), content addresses (eg, URLs, websites, or IP addresses), any other Suitable information, or any combination thereof, may be transmitted to user device 550 . Since the pronunciations of "Tom" and "Cruise" are generally unambiguous, application 560 need not generate pronunciation information in this situation.

さらなる例では、ユーザは、「Ｌｏｕｉｓとのインタビューを見せて」とユーザデバイス５５０のマイクロホンに発話し得、ユーザは、名前Ｌｏｕｉｓを「ｌｏｏ－ｉｈｓ」ではなく、「ｌｏｏ－ｅｅ」と発音する。いくつかの実施形態では、アプリケーション５６０は、テキストクエリ「Ｌｏｕｉｓとのインタビュー」を生成し、「ｌｏｏ－ｅｅ」としての音素表現を含むメタデータとともに、テキストクエリをネットワークデバイス５２０に伝送し得る。いくつかの実施形態では、アプリケーション５６０は、テキストクエリ「Ｌｏｏ－ｅｅとのインタビュー」を生成し、テキストクエリをネットワークデバイス５２０に伝送し得、テキストクエリ自体は、発音情報（例えば、この例では、音素表現）を含む。名前Ｌｏｕｉｓは、一般的であるので、この識別子を含む、多くのエンティティが存在し得る。いくつかの実施形態では、ネットワークデバイス５２０は、「ｌｏｏ－ｅｅ」を音素表現として有する発音タグを含むメタデータを有するエンティティを識別し得る。いくつかの実施形態では、ネットワークデバイス５２０は、トレンド検索、ユーザの検索履歴、または他のコンテキスト情報を読み出し、ユーザが指す可能性が高いエンティティを識別し得る。例えば、ユーザは、「ＦＢＩ」を以前に検索していることもあり、エンティティＬｏｕｉｓＦｒｅｅｈ（例えば、ＦＢＩの前長官）は、「ＦＢＩ」に関するタグを含むメタデータを含み得る。エンティティが、識別されると、ネットワークデバイス５２０は、次いで、コンテンツ（例えば、インタビューのビデオファイルまたはクリップ）、コンテンツ識別子（例えば、インタビューからのファイルタイトルおよび静止画像）、コンテンツアドレス（例えば、インタビューの１つ以上のビデオファイルをストリーミングするためのＵＲＬ、ウェブサイト、またはＩＰアドレス）、ＬｏｕｉｓＦｒｅｅｈに関連する任意の他の好適な情報、またはそれらの任意の組み合わせをユーザデバイス５５０に伝送し得る。「Ｌｏｕｉｓ」の発音は、曖昧であり得るので、アプリケーション５６０は、そのような状況では、発音情報を生成し得る。 In a further example, a user may say "Show me an interview with Louis" into the microphone of user device 550, and the user pronounces the name Louis as "loo-ee" instead of "loo-ihs." In some embodiments, application 560 may generate a text query “interview with Louis” and transmit the text query to network device 520 along with metadata including the phoneme representation as “loo-ee”. In some embodiments, application 560 may generate a text query “interview with Loo-ee” and transmit the text query to network device 520, where the text query itself contains phonetic information (e.g., phoneme expression). Since the name Louis is generic, there can be many entities containing this identifier. In some embodiments, network device 520 may identify entities that have metadata that includes phonetic tags that have “loo-ee” as a phoneme representation. In some embodiments, network device 520 may retrieve trending searches, a user's search history, or other contextual information to identify entities that users are likely to point to. For example, the user may have previously searched for "FBI," and the entity Louis Freeh (eg, a former director of the FBI) may contain metadata that includes tags for "FBI." Once the entity is identified, the network device 520 then retrieves the content (eg, a video file or clip of an interview), a content identifier (eg, file title and still image from the interview), a content address (eg, one of the interviews). URL, website, or IP address for streaming one or more video files), any other suitable information related to Louis Freeh, or any combination thereof, may be transmitted to user device 550 . Since the pronunciation of "Louis" may be ambiguous, application 560 may generate pronunciation information in such situations.

例証的例では、ユーザは、「ＷｉｌｌｉａｍＤｊｏｋｏ」とユーザデバイス５５０のマイクロホンに発話し得る。アプリケーション５６０は、エンティティの正しいスペルに対応していないこともあるテキストクエリを生成し得る。例えば、音声クエリ「ＷｉｌｌｉａｍＤｊｏｋｏ」は、「Ｗｉｌｌｉａｍｇｊｏｋａ」として、テキストに変換され得る。この正しくないテキスト変換は、正しいエンティティを識別することにおいて困難をもたらし得る。いくつかの実施形態では、エンティティＷｉｌｌｉａｍＤｊｏｋｏに関連付けられたメタデータは、発音に基づく代替表現を含む。エンティティ「ＷｉｌｌｉａｍＤｊｏｋｏ」に関するメタデータは、表１に示されるように、発音タグ（例えば、「関連語句」）を含み得る。

テキストクエリは、正しくないスペルを含み得るが、正しいエンティティに関連付けられたメタデータが、変形例を含むので、正しいエンティティが、識別され得る。故に、ネットワークデバイス５２０は、代替表現を含むエンティティ情報を含み得、したがって、語句「Ｗｉｌｌｉａｍｇｊｏｋａ」を含むテキストクエリに応答して、正しいエンティティを識別し得る。エンティティが、識別されると、ネットワークデバイス５２０は、次いで、コンテンツ（例えば、オーディオまたはビデオファイルクリップ）、コンテンツ識別子（例えば、曲またはアルバムタイトルおよびコンサートからの静止画像）、コンテンツアドレス（例えば、音楽の１つ以上のオーディオファイルをストリーミングするためのＵＲＬ、ウェブサイト、またはＩＰアドレス）、ＷｉｌｌｉａｍＤｊｏｋｏに関連する任意の他の好適な情報、またはそれらの任意の組み合わせをユーザデバイス５５０に伝送し得る。名前「Ｄｊｏｋｏ」は、発話から正しくなく変換され得るので、アプリケーション５６０は、そのような状況では、正しいエンティティを識別するための発音情報をメタデータ内への記憶のために生成し得る。 In an illustrative example, a user may speak “William Djoko” into the microphone of user device 550 . Application 560 may generate text queries that may not correspond to the correct spelling of the entity. For example, the spoken query "William Djoko" may be converted to text as "William gjoka". This incorrect text conversion can lead to difficulties in identifying the correct entity. In some embodiments, the metadata associated with the entity William Djoko includes pronunciation-based alternative representations. Metadata for the entity “William Djoko” may include phonetic tags (eg, “related phrases”), as shown in Table 1.

A text query may contain incorrect spelling, but the correct entity may be identified because the metadata associated with the correct entity contains variations. Thus, network device 520 may include entity information that includes alternative representations, and thus may identify the correct entity in response to a text query that includes the phrase "William gjoka." Once the entity is identified, the network device 520 then retrieves the content (e.g., audio or video file clip), content identifier (e.g., song or album title and still image from a concert), content address (e.g., music URL, website, or IP address for streaming one or more audio files), any other suitable information related to William Djoko, or any combination thereof, may be transmitted to user device 550 . Since the name "Djoko" may be incorrectly translated from speech, application 560 may generate phonetic information for storage in metadata to identify the correct entity in such situations.

上記の例証的例では、エンティティＷｉｌｌｉａｍＤｊｏｋｏの到達可能性は、特に、ＡＳＲプロセスがエンティティ名の文法的に正しくないテキスト変換をもたらし得るので、代替表現を記憶することによって改良される。 In the illustrative example above, the reachability of the entity William Djoko is improved by storing alternate representations, especially since the ASR process may result in a grammatically incorrect text translation of the entity name.

例証的例では、メタデータは、ユーザの音声クエリに応答してではなく、（例えば、テキストクエリまたは他の検索および読み出しプロセスによる）後の参照のために、発音に基づいて生成され得る。いくつかの実施形態では、ネットワークデバイス５２０、ユーザデバイス５５０、または両方は、発音情報に基づいて、メタデータを生成し得る。例えば、ユーザデバイス５５０は、エンティティの代替表現のユーザ入力を受信し得る（例えば、前の検索結果または発話→テキスト変換に基づいて）。いくつかの実施形態では、ネットワークデバイス５２０、ユーザデバイス５５０、または両方は、テキスト→発話モジュールおよび発話→テキストモジュールを使用して、エンティティに関するメタデータを自動的に生成し得る。例えば、アプリケーション５６０は、エンティティのテキスト表現（例えば、エンティティの名前のテキスト文字列）を識別し、テキスト表現をテキスト→発話モジュールに入力し、オーディオファイルを生成し得る。いくつかの実施形態では、テキスト→発話モジュールは、１つ以上の設定または基準（それらを用いてオーディオファイルが生成される）を含む。例えば、設定または基準は、言語（例えば、英語、スペイン語、マンダリン）、アクセント（例えば、地方または言語ベース）、音声（例えば、特定の人の音声、男性音声、女性音声）、速度（例えば、オーディオファイルの関連部分の再生時間）、発音（例えば、複数の音素変形例に関して）、任意の他の好適な設定または基準、またはそれらの任意の組み合わせを含み得る。アプリケーション５６０は、次いで、オーディオファイルを発話→テキストモジュールに入力し、結果として生じるテキスト表現を生成する。結果として生じるテキスト表現が、元のテキスト表現と同一でない場合、アプリケーション５６０は、結果として生じるテキスト表現をエンティティに関連付けられたメタデータに記憶し得る。いくつかの実施形態では、アプリケーション５６０は、種々の設定または基準のためのこのプロセスを繰り返し、したがって、メタデータに記憶され得る種々のテキスト表現を生成し得る。結果として生じるメタデータは、可能性が高い変形例を予想するためのテキスト－発話－テキスト変換を使用して生成された変形例とともに、元のテキスト表現を含む。故に、アプリケーション５６０が、音声クエリをユーザから受信し、テキストへの転換が、エンティティ識別子に正確に合致しないとき、アプリケーション５６０は、依然として、正しいエンティティを識別し得る。さらに、アプリケーション５６０は、メタデータが変形例を含むので、発音情報に関してテキストクエリを分析する必要はない（例えば、分析は、リアルタイムでではなく、事前に実施される）。 In an illustrative example, metadata may be generated based on pronunciation for later reference (eg, by text queries or other search and retrieval processes) rather than in response to a user's spoken query. In some embodiments, network device 520, user device 550, or both may generate metadata based on the pronunciation information. For example, user device 550 may receive user input of alternate representations of entities (eg, based on previous search results or speech-to-text conversions). In some embodiments, network device 520, user device 550, or both may automatically generate metadata about entities using text-to-speech and speech-to-text modules. For example, application 560 may identify a textual representation of an entity (eg, a text string of the entity's name), input the textual representation into a text-to-speech module, and generate an audio file. In some embodiments, the text-to-speech module includes one or more settings or criteria with which the audio file is generated. For example, settings or criteria can include language (e.g., English, Spanish, Mandarin), accent (e.g., local or language-based), voice (e.g., specific human voice, male voice, female voice), speed (e.g., playback time of the relevant portion of the audio file), pronunciation (eg, for multiple phoneme variants), any other suitable setting or criteria, or any combination thereof. Application 560 then inputs the audio file into the Speech to Text module and produces a resulting textual representation. If the resulting textual representation is not identical to the original textual representation, application 560 may store the resulting textual representation in metadata associated with the entity. In some embodiments, application 560 may repeat this process for different settings or criteria, thus generating different textual representations that may be stored in metadata. The resulting metadata contains the original textual representation along with the variations generated using text-to-speech-to-text transformations to predict likely variations. Thus, when application 560 receives a spoken query from a user and the translation to text does not exactly match the entity identifier, application 560 may still identify the correct entity. Further, application 560 does not need to analyze text queries for phonetic information because the metadata includes variations (eg, analysis is performed a priori rather than in real time).

アプリケーション５６０は、例えば、オーディオ記録、発話認識、発話→テキスト変換、テキスト→発話変換、クエリ生成、検索エンジン機能性、コンテンツ読み出し、ディスプレイ生成、コンテンツ提示、メタデータ生成、データベース機能性、またはそれらの組み合わせ等の任意の好適な機能性を含み得る。いくつかの実施形態では、アプリケーション５６０の側面は、２つ以上のデバイスを横断して実装される。いくつかの実施形態では、アプリケーション５６０は、単一デバイス上に実装される。例えば、エンティティ情報５２１、５２２、および５２３は、ユーザデバイス５５０のメモリ記憶装置に記憶され得、アプリケーション５６０によってアクセスされ得る。 Applications 560 may include, for example, audio recording, speech recognition, speech-to-text conversion, text-to-speech conversion, query generation, search engine functionality, content retrieval, display generation, content presentation, metadata generation, database functionality, or any of these. Any suitable functionality, such as combination, may be included. In some embodiments, aspects of application 560 are implemented across two or more devices. In some embodiments, application 560 is implemented on a single device. For example, entity information 521 , 522 , and 523 may be stored in memory storage of user device 550 and accessed by application 560 .

図６は、本開示のいくつかの実施形態による、発音情報に基づいて音声クエリに応答するための例証的プロセス６００のフローチャートを示す。例えば、クエリアプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されたプロセス６００を実施し得る。さらなる例では、クエリアプリケーションは、図５のアプリケーション５６０のインスタンスであり得る。 FIG. 6 shows a flowchart of an illustrative process 600 for responding to spoken queries based on phonetic information, according to some embodiments of the present disclosure. For example, the query application may use user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof, etc. process 600 may be implemented on any suitable hardware. In a further example, the query application may be an instance of application 560 of FIG.

ステップ６０２では、クエリアプリケーションが、音声クエリを受信する。いくつかの実施形態では、オーディオインターフェース（例えば、オーディオ機器４１４、ユーザ入力インターフェース４１０、またはそれらの組み合わせ）は、オーディオ入力を受信し、電子信号を生成するマイクロホンまたは他のセンサを含み得る。いくつかの実施形態では、オーディオ入力は、アナログセンサにおいて受信され、アナログセンサは、アナログ信号を提供し、アナログ信号は、オーディオファイルを生成するために、調整、サンプリング、デジタル化される。オーディオファイルは、次いで、ステップ６０４および６０６において、クエリアプリケーションによって分析され得る。いくつかの実施形態では、オーディオファイルは、メモリ（例えば、記憶装置４０８）に記憶される。いくつかの実施形態では、クエリアプリケーションは、ユーザインターフェース（例えば、ユーザ入力インターフェース４１０）を含み、それは、ユーザが、オーディオ記録を記録、再生、改変、クロッピング、可視化、または別様に管理することを可能にする。例えば、いくつかの実施形態では、オーディオインターフェースは、常時、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、ユーザが指示をユーザに入力インターフェースに提供すると（例えば、タッチスクリーン上のソフトボタンを選択し、オーディオ記録を開始することによって）、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、オーディオ入力を受信し、発話または他の好適なオーディオ信号が検出されると、記録を開始するように構成される。クエリアプリケーションは、オーディオ入力を記憶されたオーディオファイルに変換するために、任意の好適な調整ソフトウェアまたはハードウェアを含み得る。例えば、クエリアプリケーションは、１つ以上のフィルタ（例えば、低域通過、高域通過、ノッチフィルタ、または帯域通過フィルタ）、増幅器、デジメータ、または他の調整を適用し、オーディオファイルを生成し得る。さらなる例では、クエリアプリケーションは、圧縮、転換（例えば、スペクトル変換、ウェーブレット変換）、正規化、等化、切り捨て（例えば、時間またはスペクトルドメインにおいて）、任意の他の好適な処理、またはそれらの任意の組み合わせ等の任意の好適な処理を調整された信号に適用し、オーディオファイルを生成し得る。いくつかの実施形態では、ステップ６０２において、制御回路が、別個のアプリケーションから、クエリアプリケーションの別個のモジュールから、ユーザ入力に基づいて、またはそれらの任意の組み合わせにおいて、オーディオファイルを受信する。例えば、ステップ６０２では、制御回路は、さらなる処理（例えば、プロセス６００のステップ６０４－６１２）のために、記憶装置（例えば、記憶装置４０８）に記憶されるオーディオファイルとして、音声クエリを受信し得る。 At step 602, a query application receives a spoken query. In some embodiments, an audio interface (eg, audio device 414, user input interface 410, or a combination thereof) may include a microphone or other sensor that receives audio input and generates electronic signals. In some embodiments, audio input is received at an analog sensor that provides an analog signal that is conditioned, sampled and digitized to generate an audio file. The audio file may then be analyzed by a query application in steps 604 and 606. In some embodiments, the audio files are stored in memory (eg, storage device 408). In some embodiments, the query application includes a user interface (eg, user input interface 410) that allows users to record, play, modify, crop, visualize, or otherwise manage audio recordings. to enable. For example, in some embodiments the audio interface is configured to receive audio input at all times. By way of further example, in some embodiments, the audio interface may provide audio input when the user provides instructions to the input interface (e.g., by selecting a soft button on the touch screen to initiate audio recording). configured to receive As a further example, in some embodiments the audio interface is configured to receive audio input and initiate recording when speech or other suitable audio signal is detected. The query application may include any suitable conditioning software or hardware for converting audio input into stored audio files. For example, the query application may apply one or more filters (eg, lowpass, highpass, notch filters, or bandpass filters), amplifiers, digitizers, or other adjustments to generate the audio file. In further examples, the query application may be compression, transformation (e.g., spectral transform, wavelet transform), normalization, equalization, truncation (e.g., in the time or spectral domain), any other suitable processing, or any of them. Any suitable processing may be applied to the conditioned signal to generate an audio file, such as a combination of . In some embodiments, at step 602, the control circuit receives an audio file from a separate application, from a separate module of the query application, based on user input, or any combination thereof. For example, at step 602, control circuitry may receive the spoken query as an audio file that is stored in a storage device (eg, storage device 408) for further processing (eg, steps 604-612 of process 600). .

ステップ６０４では、クエリアプリケーションが、１つ以上のキーワードをステップ６０２の音声クエリから抽出する。いくつかの実施形態では、１つ以上のキーワードは、完全な音声クエリを表し得る。いくつかの実施形態では、１つ以上のキーワードは、重要な単語または発話の一部のみを含む。例えば、いくつかの実施形態では、クエリアプリケーションは、発話内の単語を識別し、それらの単語のうちのいくつかをキーワードとして選択し得る。例えば、クエリアプリケーションは、単語を識別し、それらの単語の中から、前置詞ではない単語を選択し得る。さらなる例では、クエリアプリケーションは、キーワードとして、少なくとも３つの文字長の単語のみを識別し得る。さらなる例では、クエリアプリケーションは、キーワードを２つ以上の単語を含む語句として識別し得（例えば、より記述的であり、より多くのコンテキストを提供するために）、それは、関連コンテンツの潜在的検索フィールドを絞り込むために有用であり得る。いくつかの実施形態では、クエリアプリケーションは、オーディオ入力からキーワードを識別するための任意の好適な基準を使用して、例えば、単語、語句、名前、場所、チャネル、メディアアセットタイトル、または他のキーワード等のキーワードを識別する。クエリアプリケーションは、任意の好適な単語検出技法、発話検出技法、パターン認識技法、信号処理技法、またはそれらの任意の組み合わせを使用して、単語を処理し得る。例えば、クエリアプリケーションは、一連の信号テンプレートをオーディオ信号の一部と比較し、合致が存在するかどうか（例えば、特定の単語がオーディオ信号に含まれるかどうか）を見出し得る。さらなる例では、クエリアプリケーションは、学習技法を適用し、音声クエリ内の単語をより良好に認識し得る。例えば、クエリアプリケーションは、複数のクエリとの関連で、複数の要求されるコンテンツ項目に関するフィードバックをユーザから集め、故に、推奨を行い、コンテンツを読み出すために、過去のデータを訓練セットとして使用し得る。いくつかの実施形態では、クエリアプリケーションは、検出された発話中、記録されたオーディオのスニペット（すなわち、短持続時間のクリップ）を記憶し、スニペットを処理し得る。いくつかの実施形態では、クエリアプリケーションは、発話の比較的に大きなセグメント（例えば、１０秒を上回る）をオーディオファイルとして記憶し、ファイルを処理する。いくつかの実施形態では、クエリアプリケーションは、発話を処理し、継続的な計算を使用することによって、単語を検出し得る。例えば、ウェーブレット変換が、リアルタイムで、発話に実施され、若干の時間の遅れがあっても、発話パターンの継続的な計算（例えば、単語を識別するための参照と比較され得る）を提供し得る。いくつかの実施形態では、クエリアプリケーションは、本開示に従って、単語および単語を発声したユーザ（例えば、音声認識）を検出し得る。 At step 604 , the query application extracts one or more keywords from the spoken query of step 602 . In some embodiments, one or more keywords may represent a complete spoken query. In some embodiments, one or more keywords include only significant words or portions of speech. For example, in some embodiments, a query application may identify words in an utterance and select some of those words as keywords. For example, a query application may identify words and select from those words non-prepositional words. In a further example, the query application may identify only words that are at least three letters long as keywords. In a further example, the query application may identify keywords as phrases containing two or more words (e.g., to be more descriptive and provide more context), which are useful for potential searches of related content. Can be useful for narrowing down fields. In some embodiments, the query application uses any suitable criteria for identifying keywords from audio input, such as words, phrases, names, places, channels, media asset titles, or other keywords. Identify keywords such as The query application may process words using any suitable word detection technique, speech detection technique, pattern recognition technique, signal processing technique, or any combination thereof. For example, a query application may compare a series of signal templates to a portion of the audio signal to find out whether a match exists (eg, whether a particular word is contained in the audio signal). In a further example, a query application may apply learning techniques to better recognize words in spoken queries. For example, a query application may collect feedback from users on multiple requested content items in the context of multiple queries, and thus use historical data as a training set to make recommendations and retrieve content. . In some embodiments, the query application may store snippets of recorded audio (ie, short-duration clips) during detected utterances and process the snippets. In some embodiments, the query application stores relatively large segments of speech (eg, greater than 10 seconds) as audio files and processes the files. In some embodiments, the query application may detect words by processing utterances and using continuous computation. For example, a wavelet transform can be performed on the speech in real-time, providing continuous computation of the speech pattern (which can be compared, for example, to a reference to identify words) even with some time delay. . In some embodiments, a query application may detect words and the user who pronounced the words (eg, speech recognition) in accordance with this disclosure.

いくつかの実施形態では、ステップ６０４において、クエリアプリケーションは、検出された単語をクエリ内で検出された単語のリストに追加する。いくつかの実施形態では、クエリアプリケーションは、これらの検出された単語をメモリに記憶し得る。例えば、クエリアプリケーションは、メモリに、ＡＳＣＩＩ文字の集合（すなわち、８ビットコード）、パターン（例えば、単語を合致させるために使用される発話信号基準を示す）、識別子（例えば、単語のためのコード）、文字列、任意の他のデータタイプ、またはそれらの任意の組み合わせとして、単語を記憶し得る。いくつかの実施形態では、メディアガイドアプリケーションは、単語が検出されるにつれて、単語をメモリに追加し得る。例えば、メディアガイドアプリケーションは、以前に検出された単語の文字列に新しく検出された単語を付加すること、新しく検出された単語を以前に検出された単語のセルアレイに追加すること（例えば、セルアレイサイズを１増加させる）、新しく検出された単語に対応する新しい変形例を作成すること、新しく作成された単語に対応する新しいファイルを作成すること、または、ステップ６０４において検出された１つ以上の単語を記憶することを行い得る。 In some embodiments, at step 604, the query application adds the detected word to the list of words detected in the query. In some embodiments, the query application may store these detected words in memory. For example, a query application may store in memory a set of ASCII characters (i.e., an 8-bit code), a pattern (e.g., indicating the speech signal criteria used to match words), an identifier (e.g., a code for a word). ), strings, any other data type, or any combination thereof. In some embodiments, the media guidance application may add words to memory as they are detected. For example, the media guidance application may append newly detected words to a string of previously detected words, add newly detected words to a cell array of previously detected words (e.g., cell array size ), creating a new variant corresponding to the newly detected word, creating a new file corresponding to the newly created word, or one or more of the words detected in step 604 can be performed to store

ステップ６０６では、クエリアプリケーションが、ステップ６０４の１つ以上のキーワードに関する発音情報を決定する。いくつかの実施形態では、発音情報は、１つ以上のキーワードの音素表現（例えば、国際音声記号を使用する）を含む。いくつかの実施形態では、発音情報は、発音を組み込むための１つ以上のキーワードの１つ以上の代替スペルを含む。いくつかの実施形態では、ステップ６０６では、制御回路が、音素表現を含むテキストクエリに関連付けられたメタデータを生成する。 At step 606 , the query application determines pronunciation information for one or more of the keywords of step 604 . In some embodiments, the phonetic information includes phonemic representations of one or more keywords (eg, using the International Phonetic Alphabet). In some embodiments, the phonetic information includes one or more alternate spellings of one or more keywords to incorporate the pronunciation. In some embodiments, at step 606, the control circuit generates metadata associated with the text query including the phoneme representation.

ステップ６０８では、クエリアプリケーションが、ステップ６０４の１つ以上のキーワードおよびステップ６０６の発音情報に基づいて、テキストクエリを生成する。クエリアプリケーションは、１つ以上のキーワードを好適な順序で（例えば、発話された順序で）配置することによって、テキストクエリを生成し得る。いくつかの実施形態では、クエリアプリケーションは、音声クエリの１つ以上の単語（例えば、短単語、前置詞、または比較的にあまり重要ではないと決定された任意の他の単語）を省略し得る。テキストクエリは、ファイル（例えば、テキストファイル）として生成され、好適な記憶装置（例えば、記憶装置４０８）に記憶され得る。 At step 608 , the query application generates a text query based on the one or more keywords of step 604 and the phonetic information of step 606 . A query application may generate a text query by placing one or more keywords in a preferred order (eg, in the order in which they were spoken). In some embodiments, the query application may omit one or more words of the spoken query (eg, short words, prepositions, or any other word determined to be relatively unimportant). A text query may be generated as a file (eg, a text file) and stored in a suitable storage device (eg, storage device 408).

ステップ６１０では、クエリアプリケーションが、テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別する。いくつかの実施形態では、メタデータは、発音タグを含む。いくつかの実施形態では、クエリアプリケーションは、エンティティに対応するコンテンツ項目のメタデータタグを識別することによって、エンティティを識別し得る。例えば、コンテンツ項目は、映画内の俳優に関するタグを有する映画を含み得る。テキストクエリが俳優を含む場合、クエリアプリケーションは、合致を決定し得、合致に基づいて、コンテンツ項目に関連付けられているとして、エンティティを識別し得る。例証するために、クエリアプリケーションは、最初に、エンティティを識別し（例えば、エンティティの中を検索し）、次いで、エンティティに関連付けられたコンテンツを読み出し得るか、または、クエリアプリケーションは、最初に、コンテンツを識別し（例えば、コンテンツの中を検索し）、コンテンツに関連付けられたエンティティがテキストクエリに合致するかどうかを決定し得る。エンティティ別に、コンテンツ別に、またはその両方で配置されているデータベースが、クエリアプリケーションによって検索され得る。 At step 610, a query application identifies entities among a plurality of entities in a database based on the text query and stored metadata about the entities. In some embodiments, the metadata includes phonetic tags. In some embodiments, the query application may identify entities by identifying metadata tags of content items that correspond to the entities. For example, a content item may include a movie with tags for actors in the movie. If the text query includes actors, the query application may determine matches and, based on the matches, identify entities as associated with the content item. To illustrate, a query application may first identify an entity (e.g., search within the entity) and then retrieve content associated with the entity, or a query application may first identify the content (eg, search through the content) to determine whether entities associated with the content match the text query. A database arranged by entity, by content, or both can be searched by a query application.

いくつかの実施形態では、クエリアプリケーションは、ユーザプロファイル情報に基づいて、エンティティを識別する。例えば、クエリアプリケーションは、前の音声クエリからの既に識別されたエンティティに基づいて、エンティティを識別し得る。さらなる例では、クエリアプリケーションは、エンティティに関連付けられた人気情報に基づいて（例えば、複数のユーザに関する検索に基づいて）、エンティティを識別し得る。いくつかの実施形態では、クエリアプリケーションは、ユーザの選好に基づいて、エンティティを識別する。例えば、１つ以上のキーワードがユーザプロファイル情報の好ましいエンティティ名または識別子に合致する場合、クエリアプリケーションは、そのエンティティを識別するか、または、そのエンティティにより重く重み付けし得る。 In some embodiments, the query application identifies entities based on user profile information. For example, a query application may identify entities based on already identified entities from previous voice queries. In a further example, a query application may identify entities based on popularity information associated with the entity (eg, based on searches for multiple users). In some embodiments, the query application identifies entities based on user preferences. For example, if one or more keywords match a preferred entity name or identifier in user profile information, the query application may identify the entity or weight it more heavily.

いくつかの実施形態では、クエリアプリケーションは、複数のエンティティを識別すること（例えば、各エンティティに関して記憶されたメタデータを用いて）と、それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティの各それぞれのエンティティに関して、それぞれのスコアを決定することと、最大スコアを決定することによって、エンティティを選択することとによって、エンティティを識別する。スコアは、テキストクエリのキーワードとエンティティまたはコンテンツ項目に関連付けられたメタデータとの間で識別された合致の数に基づき得る。 In some embodiments, the query application, based on identifying multiple entities (e.g., using metadata stored for each entity) and comparing each phonetic tag with the text query: For each respective entity of the plurality of entities, identifying the entity by determining a respective score and selecting the entity by determining the maximum score. The score may be based on the number of matches identified between the keywords of the text query and the metadata associated with the entity or content item.

いくつかの実施形態では、クエリアプリケーションは、テキストクエリに基づいて、複数のエンティティの中の２つ以上のエンティティ（例えば、関連付けられたメタデータ）を識別する。クエリアプリケーションは、クエリのエンティティのいくつかまたは全てに関連付けられたコンテンツ項目を識別し得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリの少なくとも一部を各エンティティに関して記憶されたメタデータのタグと比較し、合致を識別することによって、エンティティを識別する。 In some embodiments, the query application identifies two or more entities (eg, associated metadata) among the plurality of entities based on the text query. A query application may identify content items associated with some or all of the entities of the query. In some embodiments, the query application identifies entities by comparing at least a portion of the text query to metadata tags stored for each entity to identify matches.

ステップ６１２では、クエリアプリケーションは、エンティティに関連付けられたコンテンツ項目を読み出す。いくつかの実施形態では、クエリアプリケーションは、コンテンツ項目を識別すること、コンテンツ項目をダウンロードすること、コンテンツ項目をストリーミングすること、表示のためにコンテンツ項目を生成すること、または、それらの組み合わせを行う。例えば、音声クエリは、「最近のＴｏｍＣｒｕｉｓｅの映画を見せて」を含み得、クエリアプリケーションは、ユーザがビデオコンテンツを視聴するために選択し得る映画「ＭｉｓｓｉｏｎＩｍｐｏｓｓｉｂｌｅ：Ｆａｌｌｏｕｔ」へのリンクを提供し得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリに合致するエンティティに関連付けられた複数のコンテンツを読み出し得る。例えば、クエリアプリケーションは、本開示に従って、複数のリンク、ビデオファイル、オーディオファイル、または他のコンテンツ、または識別されたコンテンツ項目のリストを読み出し得る。 At step 612, the query application retrieves content items associated with the entity. In some embodiments, the query application identifies content items, downloads content items, streams content items, generates content items for display, or combinations thereof. . For example, the voice query may include "Show me a recent Tom Cruise movie" and the query application provides a link to the movie "Mission Impossible: Fallout" from which the user may select to view the video content. obtain. In some embodiments, the query application may retrieve multiple pieces of content associated with entities that match the text query. For example, a query application may retrieve a list of links, video files, audio files, or other content or identified content items in accordance with this disclosure.

図７は、本開示のいくつかの実施形態による、代替表現に基づいて音声クエリに応答するための例証的プロセス７００のフローチャートを示す。例えば、クエリアプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されるプロセス７００を実施し得る。さらなる例では、クエリアプリケーションは、図５のアプリケーション５６０のインスタンスであり得る。 FIG. 7 shows a flowchart of an illustrative process 700 for responding to spoken queries based on alternative representations, according to some embodiments of the present disclosure. For example, the query application may use user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof, etc. may implement process 700 on any suitable hardware. In a further example, the query application may be an instance of application 560 of FIG.

ステップ７０２では、クエリアプリケーションが、音声クエリを受信する。いくつかの実施形態では、オーディオインターフェース（例えば、オーディオ機器４１４、ユーザ入力インターフェース４１０、またはそれらの組み合わせ）は、オーディオ入力を受信し、電子信号を生成するマイクロホンまたは他のセンサを含み得る。いくつかの実施形態では、オーディオ入力は、アナログセンサにおいて受信され、アナログセンサは、アナログ信号を提供し、アナログ信号は、オーディオファイルを生成するために、調整、サンプリング、デジタル化される。オーディオファイルは、次いで、ステップ７０４において、クエリアプリケーションによって分析され得る。いくつかの実施形態では、オーディオファイルは、メモリ（例えば、記憶装置４０８）に記憶される。いくつかの実施形態では、クエリアプリケーションは、ユーザインターフェース（例えば、ユーザ入力インターフェース４１０）を含み、それは、ユーザが、オーディオ記録を記録、再生、改変、クロッピング、可視化、または別様に管理することを可能にする。例えば、いくつかの実施形態では、オーディオインターフェースは、常時、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、ユーザが指示をユーザインターフェースに提供する（例えば、タッチスクリーン上のソフトボタンを選択し、オーディオ記録を開始することによって）と、オーディオ入力を受信するように構成される。さらなる例では、いくつかの実施形態では、オーディオインターフェースは、オーディオ入力を受信し、発話または他の好適なオーディオ信号が検出されると、記録を開始するように構成される。クエリアプリケーションは、オーディオ入力を記憶されたオーディオファイルに変換するための任意の好適な調整ソフトウェアまたはハードウェアを含み得る。例えば、クエリアプリケーションは、１つ以上のフィルタ（例えば、低域通過、高域通過、ノッチフィルタ、または帯域通過フィルタ）、増幅器、デジメータ、または他の調整を適用し、オーディオファイルを生成し得る。さらなる例では、クエリアプリケーションは、圧縮、転換（例えば、スペクトル変換、ウェーブレット変換）、正規化、等化、切り捨て（例えば、時間またはスペクトルドメインにおいて）、任意の他の好適な処理、またはそれらの任意の組み合わせ等の任意の好適な処理を調整された信号に適用し、オーディオファイルを生成し得る。いくつかの実施形態では、ステップ７０２では、制御回路が、別個のアプリケーションから、クエリアプリケーションの別個のモジュールから、ユーザ入力に基づいて、またはそれらの任意の組み合わせにおいてオーディオファイルを受信する。例えば、ステップ７０２は、さらなる処理（例えば、プロセス７００のステップ７０４－７１０）のために、記憶装置（例えば、記憶装置４０８）に記憶されるオーディオファイルとして、音声クエリを受信することを含み得る。 At step 702, a query application receives a spoken query. In some embodiments, an audio interface (eg, audio device 414, user input interface 410, or a combination thereof) may include a microphone or other sensor that receives audio input and generates electronic signals. In some embodiments, audio input is received at an analog sensor that provides an analog signal that is conditioned, sampled and digitized to generate an audio file. The audio file can then be analyzed by a query application at step 704 . In some embodiments, the audio files are stored in memory (eg, storage device 408). In some embodiments, the query application includes a user interface (eg, user input interface 410) that allows users to record, play, modify, crop, visualize, or otherwise manage audio recordings. to enable. For example, in some embodiments the audio interface is configured to receive audio input at all times. As a further example, in some embodiments, the audio interface outputs audio input when a user provides instructions to the user interface (e.g., by selecting a soft button on the touch screen to initiate audio recording). configured to receive. As a further example, in some embodiments the audio interface is configured to receive audio input and initiate recording when speech or other suitable audio signal is detected. The query application may include any suitable conditioning software or hardware for converting audio input into stored audio files. For example, the query application may apply one or more filters (eg, lowpass, highpass, notch filters, or bandpass filters), amplifiers, digitizers, or other adjustments to generate the audio file. In further examples, the query application may be compression, transformation (e.g., spectral transform, wavelet transform), normalization, equalization, truncation (e.g., in the time or spectral domain), any other suitable processing, or any of them. Any suitable processing may be applied to the conditioned signal to generate an audio file, such as a combination of . In some embodiments, at step 702, the control circuit receives an audio file from a separate application, from a separate module of the query application, based on user input, or any combination thereof. For example, step 702 may include receiving the spoken query as an audio file stored in a storage device (eg, storage device 408) for further processing (eg, steps 704-710 of process 700).

ステップ７０４では、クエリアプリケーションが、１つ以上のキーワードをステップ７０２の音声クエリから抽出する。いくつかの実施形態では、１つ以上のキーワードは、完全な音声クエリを表し得る。いくつかの実施形態では、１つ以上のキーワードは、重要な単語または発話の一部のみを含む。例えば、いくつかの実施形態では、クエリアプリケーションは、発話内の単語を識別し、それらの単語のうちのいくつかをキーワードとして選択し得る。例えば、クエリアプリケーションは、単語を識別し、それらの単語の中から、前置詞ではない単語を選択し得る。さらなる例では、クエリアプリケーションは、キーワードとして、少なくとも３つの文字長の単語のみを識別し得る。さらなる例では、クエリアプリケーションは、キーワードを２つ以上の単語を含む語句として識別し得（例えば、より記述的であり、より多くのコンテキストを提供するために）、それは、関連コンテンツの潜在的検索フィールドを絞り込むために有用であり得る。いくつかの実施形態では、クエリアプリケーションは、オーディオ入力からキーワードを識別するための任意の好適な基準を使用して、例えば、単語、語句、名前、場所、チャネル、メディアアセットタイトル、または他のキーワード等のキーワードを識別する。クエリアプリケーションは、任意の好適な単語検出技法、発話検出技法、パターン認識技法、信号処理技法、またはそれらの任意の組み合わせを使用して、単語を処理し得る。例えば、クエリアプリケーションは、一連の信号テンプレートをオーディオ信号の一部と比較し、合致が存在するかどうか（例えば、特定の単語がオーディオ信号に含まれるかどうか）を見出し得る。さらなる例では、クエリアプリケーションは、学習技法を適用し、音声クエリ内の単語をより良好に認識し得る。例えば、クエリアプリケーションは、複数のクエリとの関連で、複数の要求されるコンテンツ項目に関するフィードバックをユーザから集め、故に、推奨を行い、コンテンツを読み出すために、過去のデータを訓練セットとして使用し得る。いくつかの実施形態では、クエリアプリケーションは、検出された発話中、記録されたオーディオのスニペット（すなわち、短持続時間のクリップ）を記憶し、スニペットを処理し得る。いくつかの実施形態では、クエリアプリケーションは、発話の比較的に大きなセグメント（例えば、１０秒を上回る）をオーディオファイルとして記憶し、ファイルを処理する。いくつかの実施形態では、クエリアプリケーションは、発話を処理し、継続的な計算を使用することによって、単語を検出し得る。例えば、ウェーブレット変換が、リアルタイムで、発話に実施され、若干の時間の遅れがあっても、発話パターンの継続的な計算（例えば、単語を識別するための参照と比較され得る）を提供し得る。いくつかの実施形態では、クエリアプリケーションは、本開示に従って、単語および単語を発声したユーザ（例えば、音声認識）を検出し得る。 At step 704 the query application extracts one or more keywords from the spoken query of step 702 . In some embodiments, one or more keywords may represent a complete spoken query. In some embodiments, one or more keywords include only significant words or portions of speech. For example, in some embodiments, a query application may identify words in an utterance and select some of those words as keywords. For example, a query application may identify words and select from those words non-prepositional words. In a further example, the query application may identify only words that are at least three letters long as keywords. In a further example, the query application may identify keywords as phrases containing two or more words (e.g., to be more descriptive and provide more context), which are useful for potential searches of related content. Can be useful for narrowing down fields. In some embodiments, the query application uses any suitable criteria for identifying keywords from audio input, such as words, phrases, names, places, channels, media asset titles, or other keywords. Identify keywords such as The query application may process words using any suitable word detection technique, speech detection technique, pattern recognition technique, signal processing technique, or any combination thereof. For example, a query application may compare a series of signal templates to a portion of the audio signal to find out whether a match exists (eg, whether a particular word is contained in the audio signal). In a further example, the query application may apply learning techniques to better recognize words in spoken queries. For example, a query application may collect feedback from users on multiple requested content items in the context of multiple queries, and thus use historical data as a training set to make recommendations and retrieve content. . In some embodiments, the query application may store snippets of recorded audio (ie, short-duration clips) during detected utterances and process the snippets. In some embodiments, the query application stores relatively large segments of speech (eg, greater than 10 seconds) as audio files and processes the files. In some embodiments, the query application may detect words by processing utterances and using continuous computation. For example, a wavelet transform can be performed on the speech in real-time, providing continuous computation of the speech pattern (which can be compared, for example, to a reference to identify words) even with some time delay. . In some embodiments, the query application may detect words and the user who pronounced the words (eg, speech recognition) in accordance with this disclosure.

いくつかの実施形態では、ステップ７０４において、クエリアプリケーションは、検出された単語をクエリ内で検出された単語のリストに追加する。いくつかの実施形態では、クエリアプリケーションは、これらの検出された単語をメモリに記憶し得る。例えば、クエリアプリケーションは、メモリに、ＡＳＣＩＩ文字の集合（すなわち、８ビットコード）、パターン（例えば、単語を合致させるために使用される発話信号基準を示す）、識別子（例えば、単語のためのコード）、文字列、任意の他のデータタイプ、またはそれらの任意の組み合わせとして、単語を記憶し得る。いくつかの実施形態では、メディアガイドアプリケーションは、単語が検出されるにつれて、単語をメモリに追加し得る。例えば、メディアガイドアプリケーションは、以前に検出された単語の文字列に新しく検出された単語を付加すること、新しく検出された単語を以前に検出された単語のセルアレイに追加すること（例えば、セルアレイサイズを１増加させる）、新しく検出された単語に対応する新しい変形例を作成すること、新しく作成された単語に対応する新しいファイルを作成すること、または、ステップ７０４において検出された１つ以上の単語を記憶することを行い得る。 In some embodiments, at step 704, the query application adds the detected word to the list of words detected in the query. In some embodiments, the query application may store these detected words in memory. For example, a query application may store in memory a set of ASCII characters (i.e., an 8-bit code), a pattern (e.g., indicating the speech signal criteria used to match words), an identifier (e.g., a code for a word). ), strings, any other data type, or any combination thereof. In some embodiments, the media guidance application may add words to memory as they are detected. For example, the media guidance application may append newly detected words to a string of previously detected words, add newly detected words to a cell array of previously detected words (e.g., cell array size ), creating a new variant corresponding to the newly detected word, creating a new file corresponding to the newly created word, or one or more of the words detected in step 704 can be performed to store

ステップ７０６では、クエリアプリケーションが、ステップ７０４の１つ以上のキーワードに基づいて、テキストクエリを生成する。クエリアプリケーションは、１つ以上のキーワードを好適な順序で（例えば、発話された順序で）配置することによって、テキストクエリを生成し得る。いくつかの実施形態では、クエリアプリケーションは、音声クエリの１つ以上の単語（例えば、短単語、前置詞、または比較的にあまり重要ではないと決定された任意の他の単語）を省略し得る。テキストクエリは、ファイル（例えば、テキストファイル）として生成され、好適な記憶装置（例えば、記憶装置４０８）に記憶され得る。 At step 706 the query application generates a text query based on the one or more keywords of step 704 . A query application may generate a text query by placing one or more keywords in a preferred order (eg, in the order in which they were spoken). In some embodiments, the query application may omit one or more words of the spoken query (eg, short words, prepositions, or any other word determined to be relatively unimportant). A text query may be generated as a file (eg, a text file) and stored in a suitable storage device (eg, storage device 408).

ステップ７０８では、クエリアプリケーションが、テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別する。メタデータは、発音に基づくエンティティの代替テキスト表現を含む。いくつかの実施形態では、クエリアプリケーションは、エンティティの代替表現に対応するコンテンツ項目のメタデータタグを識別することによって、エンティティを識別し得る。例えば、コンテンツ項目は、映画内の俳優に関するタグを有する映画を含み得、タグは、（例えば、システム３００等のシステムから導出されるか、または別様にメタデータに含まれる）代替スペルを含む。テキストクエリが、俳優を含む場合、クエリアプリケーションは、合致を決定し得、合致に基づいて、コンテンツ項目に関連付けられているとして、エンティティを識別し得る。例証するために、クエリアプリケーションは、最初に、エンティティを識別し（例えば、エンティティの中を検索し）、次いで、エンティティに関連付けられたコンテンツを読み出し得るか、または、クエリアプリケーションは、最初に、コンテンツを識別し（例えば、コンテンツの中を検索し）、コンテンツに関連付けられたエンティティがテキストクエリに合致するかどうかを決定し得る。エンティティ別に、コンテンツ別に、またはその両方で配置されているデータベースが、クエリアプリケーションによって検索され得る。クエリアプリケーションは、テキストクエリの１つ以上の単語がエンティティの代替表現（例えば、エンティティに関連付けられたメタデータに記憶されるような）に合致するとき、合致を決定し得る。 At step 708, the query application identifies entities based on the text query and metadata about the entities. Metadata includes alternative textual representations of entities based on their pronunciation. In some embodiments, the query application may identify entities by identifying metadata tags of content items that correspond to alternate representations of the entity. For example, a content item may include movies with tags for actors in the movies, where the tags include alternate spellings (eg, derived from a system such as system 300 or otherwise included in metadata). . If the text query includes actors, the query application may determine matches and, based on the matches, identify entities as associated with the content item. To illustrate, a query application may first identify an entity (e.g., search within the entity) and then retrieve content associated with the entity, or a query application may first identify the content (eg, search through the content) to determine whether entities associated with the content match the text query. A database arranged by entity, by content, or both can be searched by a query application. A query application may determine a match when one or more words of a text query match an alternative representation of an entity (eg, as stored in metadata associated with the entity).

いくつかの実施形態では、クエリアプリケーションは、ユーザプロファイル情報に基づいて、エンティティを識別する。例えば、クエリアプリケーションは、前の音声クエリからの既に識別されたエンティティに基づいて、エンティティを識別し得る。さらなる例では、クエリアプリケーションは、エンティティに関連付けられた人気情報に基づいて（例えば、複数のユーザに関する検索に基づいて）、エンティティを識別し得る。いくつかの実施形態では、クエリアプリケーションは、ユーザの選好に基づいて、エンティティを識別する。例えば、１つ以上のキーワードがユーザプロファイル情報の好ましいエンティティ名または識別子の代替表現に合致する場合、クエリアプリケーションは、そのエンティティを識別するか、または、そのエンティティにより重く重み付けし得る。 In some embodiments, the query application identifies entities based on user profile information. For example, a query application may identify entities based on already identified entities from previous voice queries. In a further example, a query application may identify entities based on popularity information associated with the entity (eg, based on searches for multiple users). In some embodiments, the query application identifies entities based on user preferences. For example, if one or more keywords match a preferred entity name or alternative representation of an identifier in the user profile information, the query application may identify the entity or weight it more heavily.

いくつかの実施形態では、クエリアプリケーションは、複数のエンティティ（例えば、各エンティティに関して記憶されたメタデータを伴う）を識別することと、それぞれのメタデータをテキストクエリと比較することに基づいて、複数のエンティティの各それぞれのエンティティに関して、それぞれのスコアを決定することと、最大スコアを決定することによって、エンティティを選択することとによって、エンティティを識別する。スコアは、テキストクエリのキーワードとエンティティまたはコンテンツ項目に関連付けられたメタデータとの間で識別された合致の数に基づき得る。 In some embodiments, a query application generates multiple entities based on identifying multiple entities (eg, with metadata stored for each entity) and comparing the respective metadata with the text query. , and identifying the entity by determining the respective score and selecting the entity by determining the maximum score. The score may be based on the number of matches identified between the keywords of the text query and the metadata associated with the entity or content item.

ステップ７１０では、クエリアプリケーションは、エンティティに関連付けられたコンテンツ項目を読み出す。いくつかの実施形態では、クエリアプリケーションは、コンテンツ項目を識別すること、コンテンツ項目をダウンロードすること、コンテンツ項目をストリーミングすること、表示のためにコンテンツ項目を生成すること、または、それらの組み合わせを行う。例えば、音声クエリは、「最近のＴｏｍＣｒｕｉｓｅの映画を見せて」を含み得、クエリアプリケーションは、ユーザがビデオコンテンツを視聴するために選択し得る映画「ＭｉｓｓｉｏｎＩｍｐｏｓｓｉｂｌｅ：Ｆａｌｌｏｕｔ」へのリンクを提供し得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリに合致するエンティティに関連付けられた複数のコンテンツを読み出し得る。例えば、クエリアプリケーションは、本開示に従って、複数のリンク、ビデオファイル、オーディオファイル、または他のコンテンツ、または識別されたコンテンツ項目のリストを読み出し得る。 At step 710, the query application retrieves content items associated with the entity. In some embodiments, the query application identifies content items, downloads content items, streams content items, generates content items for display, or combinations thereof. . For example, the voice query may include "Show me a recent Tom Cruise movie" and the query application provides a link to the movie "Mission Impossible: Fallout" from which the user may select to view the video content. obtain. In some embodiments, the query application may retrieve multiple pieces of content associated with entities that match the text query. For example, a query application may retrieve a list of links, video files, audio files, or other content or identified content items in accordance with this disclosure.

図８は、本開示のいくつかの実施形態による、発音に基づいてエンティティに関するメタデータを生成するための例証的プロセス８００のフローチャートを示す。例えば、アプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されたプロセス８００を実施し得る。さらなる例では、アプリケーションは、図５のアプリケーション５８０のインスタンスであり得る。さらなる例では、図３のシステム３００が、例証的プロセス８００を実施し得る。 FIG. 8 shows a flowchart of an illustrative process 800 for generating metadata about entities based on pronunciation, according to some embodiments of the present disclosure. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof. Process 800 may be implemented on any suitable hardware. In a further example, the application may be an instance of application 580 of FIG. In a further example, system 300 of FIG. 3 may implement illustrative process 800 .

ステップ８０２では、アプリケーションが、複数のエンティティのうちの情報が記憶されているエンティティを識別する。いくつかの実施形態では、アプリケーションは、所定の順序に基づいて、エンティティを選択する。例えば、アプリケーションは、エンティティをアルファベット順で選択し、プロセス８００の一部を実施し得る。いくつかの実施形態では、アプリケーションは、エンティティに関するメタデータが作成されると、エンティティを識別する。例えば、アプリケーションは、エンティティがデータベース（例えば、エンティティのデータベース）に追加されると、エンティティを識別し得る。いくつかの実施形態では、アプリケーションは、検索動作が、エンティティを誤識別し、故に、代替表現が、さらなる誤識別を防止するために所望され得るとき、エンティティを識別する。いくつかの実施形態では、アプリケーションは、ユーザ入力に基づいて、エンティティを識別する。例えば、ユーザは、アプリケーションに、正しくない検索結果、到達不能エンティティ、または検索結果内で観察されるエラーに基づいて、エンティティを示し得る（例えば、好適なユーザインターフェースにおいて）。いくつかの実施形態では、アプリケーションは、検索結果におけるエラーまたは所定の順序に応答してエンティティを識別する必要はない。例えば、アプリケーションは、エンティティデータベースのエンティティをランダムに選択し、ステップ８０４に進み得る。いくつかの実施形態では、アプリケーションは、検索クエリ内のエンティティの人気に基づいて、エンティティを識別し得る。例えば、より大きな検索有効性は、より多くの検索クエリが正しく応答されるように、より一般的エンティティに関する代替表現を決定することによって達成され得る。さらなる例では、アプリケーションは、あまり一般的ではない、またはさらに曖昧なエンティティを識別し、非常に少ない検索クエリがこれらのエンティティを規定し得るので、それらのエンティティの到達不能性を防止し得る。アプリケーションは、任意の好適な基準を適用し、識別すべきエンティティを決定し得る。いくつかの実施形態では、アプリケーションは、ステップ８０２において、２つ以上のエンティティを識別し得、故に、ステップ８０４－８１０は、各識別されたエンティティに関して実施され得る。いくつかの実施形態では、アプリケーションは、エンティティではなく、またはそれに加え、コンテンツ項目を識別し得る。例えば、アプリケーションは、映画等のエンティティを識別し、次いで、そのエンティティに関連付けられた全ての他の重要なエンティティを識別し、ステップ８０４－８１０を受けることもある。 At step 802, an application identifies an entity for which information is stored among a plurality of entities. In some embodiments, the application selects entities based on a predetermined order. For example, an application may select entities in alphabetical order and perform part of process 800 . In some embodiments, an application identifies an entity when metadata about the entity is created. For example, an application may identify an entity as it is added to a database (eg, a database of entities). In some embodiments, the application identifies entities when a search operation misidentifies an entity and thus an alternative representation may be desired to prevent further misidentification. In some embodiments, the application identifies entities based on user input. For example, a user may indicate to an application entities based on incorrect search results, unreachable entities, or errors observed in search results (eg, in a preferred user interface). In some embodiments, the application need not identify entities in response to errors or predetermined order in the search results. For example, the application may randomly select an entity in the entity database and proceed to step 804 . In some embodiments, an application may identify entities based on their popularity in search queries. For example, greater search effectiveness may be achieved by determining alternative representations for more general entities so that more search queries are answered correctly. In a further example, an application may identify less common or even more ambiguous entities and prevent unreachability of those entities because very few search queries may define these entities. Applications may apply any suitable criteria to determine which entities to identify. In some embodiments, the application may identify more than one entity in step 802, and steps 804-810 may therefore be performed for each identified entity. In some embodiments, an application may identify content items instead of or in addition to entities. For example, an application may identify an entity such as a movie, then identify all other entities of interest associated with that entity and undergo steps 804-810.

ステップ８０４では、アプリケーションが、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成する。第１のテキスト文字列は、ステップ８０２において識別されたエンティティを記述する。例えば、図３に図示されるように、アプリケーションは、テキスト→発話エンジン３１０を含み得、それは、オーディオファイルを生成するように構成され得る。アプリケーションは、マイクロホンまたは他の好適な検出デバイスによって検出され得るスピーカまたは他の好適な音生成デバイスから出力されたオーディオを生成し得る。アプリケーションは、オーディオファイルを生成および出力することにおいて１つ以上の設定または発話基準を適用し得る。例えば、生成された「音声」の側面は、任意の好適な基準に基づいて、調整または別様に選択され得る。いくつかの実施形態では、少なくとも１つの発話基準は、発音設定（例えば、１つ以上の音節、文字群、または単語が、発音される方法、または使用されるべき音素）を含む。いくつかの実施形態では、少なくとも１つの発話基準は、言語設定（例えば、言語、アクセント、地方アクセント、または他の言語情報を規定する）を含む。 At step 804, the application generates an audio file based on the first text string and at least one speech criterion. A first text string describes the entity identified in step 802 . For example, as illustrated in FIG. 3, an application may include a text-to-speech engine 310, which may be configured to generate audio files. An application may generate audio output from a speaker or other suitable sound producing device that may be detected by a microphone or other suitable detection device. An application may apply one or more settings or speech criteria in generating and outputting audio files. For example, aspects of the generated "sound" may be adjusted or otherwise selected based on any suitable criteria. In some embodiments, at least one speech criterion includes pronunciation settings (eg, how one or more syllables, groups of letters, or words are pronounced or phonemes to be used). In some embodiments, at least one speech criterion includes language settings (eg, defining language, accent, regional accent, or other language information).

複数の発話基準を含む例証的例では、アプリケーションは、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成し、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成し、それぞれの第２のテキスト文字列を第１のテキスト文字列と比較し、第１のテキスト文字列と同一でない場合、それぞれの第２のテキスト文字列を記憶し得る（例えば、エンティティに関連付けられたメタデータ内に）。 In an illustrative example involving multiple speech criteria, the application generates respective audio files based on the first text string and respective speech criteria, and generates respective second audio files based on the respective audio files. may generate a text string, compare each second text string to the first text string, and store each second text string if not identical to the first text string ( in the metadata associated with the entity).

例証的例では、アプリケーションは、第１のテキスト文字列を第１のオーディオ信号に変換し、オーディオ信号に基づいて、発話をスピーカにおいて生成し、マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成し、オーディオ信号を処理し、オーディオファイルを生成し得る。いくつかの実施形態では、アプリケーションは、テキスト→発話モジュールの少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成する。 In the illustrative example, the application converts a first text string into a first audio signal, generates speech on a speaker based on the audio signal, detects the speech using a microphone, and generates a second audio signal. audio signals, process the audio signals, and generate audio files. In some embodiments, the application generates speech at the speaker based on at least one speech setting of the text-to-speech module.

ステップ８０６では、アプリケーションが、オーディオファイルに基づいて、第２のテキスト文字列を生成する。第２のテキスト文字列は、テキスト→発話変換、または発話→テキスト変換から生じ得る差異は別として、第１のテキスト文字列に合致し、ステップ８０２において識別されたエンティティを記述するべきである。例えば、図３に図示されるように、アプリケーションは、発話→テキストエンジン３２０を含み得、それは、オーディオ入力またはその生成されたファイルを受信し、オーディオを書き起こし記録（例えば、テキスト文字列）に転換するように構成され得る。アプリケーションは、オーディオ入力をマイクロホンまたは他の好適な音検出デバイスにおいて受信し得る。アプリケーションは、オーディオファイルを受信し、調整し、テキストに変換することにおいて１つ以上の設定を適用し得る。例えば、検出された「音声」を調整および転換する側面は、任意の好適な基準に基づいて、調整または別様に選択され得る。 At step 806, the application generates a second text string based on the audio file. The second text string should match the first text string and describe the entity identified in step 802, apart from possible differences from text-to-speech or speech-to-text conversion. For example, as illustrated in FIG. 3, an application may include a speech-to-text engine 320, which receives audio input or its generated files and translates the audio into a transcription record (e.g., text string). can be configured to convert. Applications may receive audio input at a microphone or other suitable sound detection device. An application may apply one or more settings in receiving, conditioning, and converting audio files to text. For example, aspects of adjusting and transforming the detected "voice" may be adjusted or otherwise selected based on any suitable criteria.

例証的例では、アプリケーションは、オーディオファイルの再生をスピーカにおいて生成し、マイクロホンを使用して、再生を検出し、オーディオ信号を生成し、１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換する。いくつかの実施形態では、アプリケーションは、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換する。 In the illustrative example, the application generates a playback of an audio file on a speaker, uses a microphone to detect the playback, generates an audio signal, identifies one or more words, and classifies the audio signal. 2 to a text string. In some embodiments, the application converts the audio signal into a second text string based on at least one text setting of the Speech to Text module.

ステップ８０８では、アプリケーションが、第２のテキスト文字列を第１のテキスト文字列と比較する。いくつかの実施形態では、アプリケーションは、第１および第２のテキスト文字列の各文字を比較し、合致を決定する。いくつかの実施形態では、アプリケーションは、第１のテキスト文字列および第２のテキスト文字列が合致する程度（例えば、合致するテキスト文字列の割合、存在する相違の数、合致するか、または、合致しない、キーワードの数）を決定する。アプリケーションは、任意の好適な技法を使用して、第１および第２のテキスト文字列が、同一であるか、類似するか、または、異なるかと、それらが類似または異なる程度とを決定し得る。 At step 808, the application compares the second text string to the first text string. In some embodiments, the application compares each character of the first and second text strings to determine a match. In some embodiments, the application determines the degree to which the first text string and the second text string match (e.g., the percentage of text strings that match, the number of differences that exist, whether they match, or number of keywords that do not match). The application may use any suitable technique to determine whether the first and second text strings are the same, similar, or different, and the degree to which they are similar or different.

ステップ８１０では、アプリケーションが、第１のテキスト文字列と同一でない場合、第２のテキスト文字列を記憶する。いくつかの実施形態では、アプリケーションは、第２のテキスト文字列をエンティティに関連付けられたメタデータに記憶する。いくつかの実施形態では、ステップ８１０は、アプリケーションが、１つ以上のテキストクエリに基づいて、既存のメタデータを更新することを含む。例えば、クエリが、応答され、検索結果が、評価されると、アプリケーションは、メタデータを更新し、新しい学習を反映させ得る。第２のテキスト文字列が、第１のテキスト文字列と同一であると決定された場合、新しい情報は、第２のテキスト文字列を記憶することによって得られない。しかしながら、ステップ８０８の比較の指示は、メタデータに記憶され、音声クエリを介したエンティティの到達可能性における信頼度を増加させ得る。例えば、第２のテキスト文字列が、第１のテキスト文字列と同一である場合、それは、音声ベースのクエリに関する既存のメタデータを検証する役割を果たし得る。 At step 810, the application stores the second text string if it is not identical to the first text string. In some embodiments, the application stores the second text string in metadata associated with the entity. In some embodiments, step 810 includes the application updating existing metadata based on one or more text queries. For example, as queries are answered and search results are evaluated, the application may update metadata to reflect new learnings. If the second text string is determined to be identical to the first text string, no new information is obtained by storing the second text string. However, an indication of the comparison of step 808 may be stored in metadata to increase confidence in the reachability of the entity via voice queries. For example, if the second text string is identical to the first text string, it may serve to validate existing metadata for speech-based queries.

図９は、本開示のいくつかの実施形態による、音声クエリのエンティティに関連付けられたコンテンツを読み出すための例証的プロセス９００のフローチャートを示す。例えば、クエリアプリケーションは、図４のユーザデバイス４００、図４のユーザ機器システム４０１、図５のユーザデバイス５５０、図５のネットワークデバイス５２０、任意の他の好適なデバイス、またはそれらの任意の組み合わせ等の任意の好適なハードウェア上に実装されたプロセス９００を実施し得る。さらなる例では、クエリアプリケーションは、図５のアプリケーション５６０のインスタンスであり得る。 FIG. 9 depicts a flowchart of an illustrative process 900 for retrieving content associated with an entity of a voice query, according to some embodiments of the present disclosure. For example, the query application may use user device 400 of FIG. 4, user equipment system 401 of FIG. 4, user device 550 of FIG. 5, network device 520 of FIG. 5, any other suitable device, or any combination thereof, etc. process 900 may be implemented on any suitable hardware. In a further example, the query application may be an instance of application 560 of FIG.

ステップ９０２では、クエリアプリケーションが、オーディオ信号をオーディオインターフェースにおいて受信する。システムは、マイクロホンまたは他のオーディオ検出デバイスを含み得、デバイスに入力されるオーディオに基づいて、オーディオファイルを記録し得る。 At step 902, a query application receives an audio signal at an audio interface. The system may include a microphone or other audio detection device and may record audio files based on audio input to the device.

ステップ９０４では、クエリアプリケーションが、ステップ９０２のオーディオ信号を解析し、発話を識別する。クエリアプリケーションは、任意の好適なデシメーション、調整（例えば、増幅、フィルタリング）、処理（例えば、時間またはスペクトルドメインにおいて）、パターン認識、アルゴリズム、転換、任意の他の好適なアクション、またはそれらの任意の組み合わせを適用し得る。いくつかの実施形態では、クエリアプリケーションは、任意の好適な技法を使用して、単語、音、語句、またはそれらの組み合わせを識別する。 At step 904, the query application analyzes the audio signal of step 902 to identify speech. The query application may be any suitable decimation, conditioning (e.g., amplification, filtering), processing (e.g., in the time or spectral domain), pattern recognition, algorithm, transformation, any other suitable action, or any of them. A combination may be applied. In some embodiments, the query application identifies words, sounds, phrases, or combinations thereof using any suitable technique.

ステップ９０６では、クエリアプリケーションが、音声クエリが受信されたかどうかを決定する。いくつかの実施形態では、クエリアプリケーションは、オーディオ信号のパラメータに基づいて、音声クエリが受信されたことを決定する。例えば、クエリ前後の発話を伴わない期間は、記録内の音声クエリの範囲を区切り得る。いくつかの実施形態では、クエリアプリケーションは、キーワードを発話された順序で識別し、文またはクエリテンプレートをキーワードに適用し、テキストクエリを抽出する。例えば、名詞、固有名詞、動詞、形容詞、副詞、および発話の他の部分の配置は、音声クエリの開始および終了の指示を提供し得る。クエリアプリケーションは、オーディオ信号を解析する際、任意の好適な基準を適用し、テキストを抽出し得る。ステップ９０８では、クエリアプリケーションは、ステップ９０４および９０６の結果に基づいて、テキストクエリを生成する。いくつかの実施形態では、ステップ９０８において、クエリアプリケーションは、テキストクエリを好適な記憶装置（例えば、記憶装置４０８）に記憶し得る。ステップ９０６において、クエリアプリケーションが、音声クエリが受信されていない、または別様に、テキストクエリが、ステップ９０４の解析されるオーディオに基づいて生成されることができないことを決定する場合、クエリアプリケーションは、ステップ９０２に戻り、音声クエリが受信されるまで、オーディオを検出するステップに進み得る。 At step 906, the query application determines whether a voice query has been received. In some embodiments, the query application determines that a voice query has been received based on parameters of the audio signal. For example, periods without speech before and after the query may delimit the range of voice queries in the recording. In some embodiments, the query application identifies keywords in the order in which they were spoken, applies sentences or query templates to the keywords, and extracts the text query. For example, the placement of nouns, proper nouns, verbs, adjectives, adverbs, and other parts of speech can provide indications of the beginning and end of a spoken query. The query application may apply any suitable criteria when analyzing the audio signal to extract the text. At step 908 the query application generates a text query based on the results of steps 904 and 906 . In some embodiments, at step 908, the query application may store the text query in a suitable storage device (eg, storage device 408). In step 906, if the query application determines that a voice query has not been received or otherwise a text query cannot be generated based on the parsed audio of step 904, the query application , may return to step 902 and proceed to detect audio until a voice query is received.

ステップ９１０では、クエリアプリケーションが、エンティティ情報に関するデータベースにアクセスする。クエリアプリケーションは、ステップ９０８のテキストクエリを使用して、データベースの情報の中を検索する。クエリアプリケーションは、任意の好適な検索アルゴリズムを適用し、データベースの情報、エンティティ、またはコンテンツを識別し得る。 At step 910, a query application accesses a database of entity information. The query application uses the text query of step 908 to search through the information in the database. A query application may apply any suitable search algorithm to identify information, entities, or content in the database.

ステップ９１２では、クエリアプリケーションが、ステップ９１０のデータベースのエンティティがステップ９０８のテキストクエリに合致するかどうかを決定する。クエリアプリケーションは、複数のエンティティを識別および評価し、合致を見出し得る。いくつかの実施形態では、テキストクエリは、２つ以上のエンティティを含み、クエリアプリケーションは、コンテンツの中を検索し、メタデータ内に関連付けられたエンティティを有するコンテンツ項目を決定する（例えば、テキストクエリとコンテンツ項目のメタデータタグを比較することによって）。いくつかの状況では、クエリアプリケーションは、合致を識別することが不可能であり得、それに応答して、検索を継続すること、別のデータベースの中を検索すること、テキストクエリを修正すること（例えば、ステップ９０８に戻る（図示せず））、ステップ９０４に戻り、ステップ９０４において使用される設定を修正すること（図示せず）、検索結果が見出されなかったことの指示を返すこと、任意の他の好適な応答を行うこと、または、それらの任意の組み合わせを行い得る。いくつかの実施形態では、クエリアプリケーションは、テキストクエリに合致する複数のエンティティ、コンテンツ、または両方を識別し得る。ステップ９１４は、クエリアプリケーションが、ステップ９０８のテキストクエリに関連付けられたコンテンツを識別することを含む。いくつかの実施形態では、ステップ９１４および９１０は、逆転され得、クエリアプリケーションは、テキストクエリに基づいて、コンテンツの中を検索し得る。いくつかの実施形態では、エンティティは、コンテンツ識別子を含み得、故に、ステップ９１０および９１４は、組み合わせられ得る。 At step 912 the query application determines whether the entity in the database of step 910 matches the text query of step 908 . A query application can identify and evaluate multiple entities to find matches. In some embodiments, the text query includes two or more entities, and the query application searches through the content to determine content items that have entities associated in the metadata (e.g., the text query and content item metadata tags). In some situations, the query application may be unable to identify a match and in response may choose to continue the search, search in another database, modify the text query ( For example, returning to step 908 (not shown); returning to step 904 to modify the settings used in step 904 (not shown); returning an indication that no search results were found; Any other suitable response may be made, or any combination thereof. In some embodiments, the query application may identify multiple entities, content, or both that match the text query. Step 914 includes the query application identifying content associated with the text query of step 908 . In some embodiments, steps 914 and 910 may be reversed and the query application may search through content based on text queries. In some embodiments, the entity may include a content identifier, so steps 910 and 914 may be combined.

ステップ９１６では、クエリアプリケーションが、ステップ９０８のテキストクエリに関連付けられたコンテンツを読み出す。ステップ９１６では、例えば、クエリアプリケーションが、コンテンツ項目を識別すること、コンテンツ項目をダウンロードすること、コンテンツ項目をストリーミングすること、表示のためにコンテンツ項目またはコンテンツ項目のリスト（例えば、またはコンテンツ項目へのリンクのリスト）を生成すること、または、それらの組み合わせを行い得る。 At step 916 the query application retrieves the content associated with the text query of step 908 . At step 916, for example, the query application identifies content items, downloads content items, streams content items, content items or lists of content items for display (e.g., or directs to content items). list of links), or a combination thereof.

本開示の上記に説明される実施形態は、限定ではなく、例証の目的のために提示され、本開示は、以下に続く請求項のみによって限定される。さらに、いずれか１つの実施形態に説明される特徴および限界が、本明細書の任意の他の実施形態に適用され得、一実施形態に関するフローチャートまたは例が、好適な様式で任意の他の実施形態と組み合わせられること、異なる順序で行われること、または並行して行われ得ることに留意されたい。加えて、本明細書に説明されるシステムおよび方法は、リアルタイムで実施され得る。上記に説明されるシステムおよび／または方法が他のシステムおよび／または方法に適用される、またはそれに従って使用され得ることにも留意されたい。
本明細書は、限定ではないが、以下を含む実施形態を開示する：
（項目１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに関する発音情報を決定することと、
制御回路を使用して、１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと、
を含む、方法。
（項目２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目１に記載の方法。
（項目３）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目１に記載の方法。
（項目４）エンティティを識別することは、前の音声クエリからの以前に識別されたエンティティに基づく、項目３に記載の方法。
（項目５）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目１に記載の方法。
（項目６）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと、
を含む、項目１に記載の方法。
（項目７）エンティティは、第１のエンティティであり、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別することをさらに含み、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目１に記載の方法。
（項目８）データベースの複数のエンティティの中のエンティティを識別することは、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することを含む、項目１に記載の方法。
（項目９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目１に記載の方法。
（項目１０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目１に記載の方法。
（項目１１）音声クエリに応答するためのシステムであって、システムは、
音声クエリを受信するためのオーディオインターフェースと、
オーディオインターフェースに結合された制御回路と
を備え、
制御回路は、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに関する発音情報を決定抽出することと、
１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成抽出することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別抽出することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を行うように構成されている、システム。
（項目１２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目１１に記載のシステム。
（項目１３）制御回路は、ユーザプロファイル情報に基づいて、エンティティを識別するようにさらに構成されている、項目１１に記載のシステム。
（項目１４）制御回路は、前の音声クエリから以前に識別されたエンティティに基づいて、エンティティを識別するようにさらに構成されている、項目１３に記載のシステム。
（項目１５）制御回路は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別するようにさらに構成されている、項目１１に記載のシステム。
（項目１６）制御回路は、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
ことによって、エンティティを識別するようにさらに構成されている、項目１１に記載のシステム。
（項目１７）エンティティは、第１のエンティティであり、制御回路は、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別するようにさらに構成され、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目１１に記載のシステム。
（項目１８）制御回路は、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することによって、データベースの複数のエンティティの中のエンティティを識別するようにさらに構成されている、項目１１に記載の。
（項目１９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目１１に記載のシステム。
（項目２０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目１１に記載のシステム。
（項目２１）エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、命令は、制御回路によって実行されると、
音声クエリをオーディオインターフェースにおいて受信することと、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに関する発音情報を決定することと、
１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を制御回路に行わせる、非一過性コンピュータ読み取り可能な媒体。
（項目２２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２３）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路にユーザプロファイル情報に基づいてエンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２４）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、前の音声クエリからの以前に識別されたエンティティに基づいて、エンティティを識別させる、項目２３に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２５）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、エンティティに関連付けられた人気情報に基づいて、エンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２６）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
複数のエンティティを識別することであってし、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択ることと
によって、制御回路にエンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２７）エンティティは、第１のエンティティであり、エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別させ、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２８）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することによって、データベースの複数のエンティティの中のエンティティを識別させる、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目２９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目３０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目３１）音声クエリに応答するためのシステムであって、システムは、
音声クエリを受信する手段と、
１つ以上のキーワードを音声クエリから抽出する手段と、
１つ以上のキーワードに関する発音情報を決定する手段と、
１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成する手段と、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別する手段であって、メタデータは、発音タグを備えている、手段と、
エンティティに関連付けられたコンテンツ項目を読み出すための手段と
を備えている、システム。
（項目３２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目３１に記載のシステム。
（項目３３）エンティティを識別する手段は、ユーザプロファイル情報に基づいて、エンティティを識別する手段を備えている、項目３１に記載のシステム。
（項目３４）エンティティを識別する手段は、前の音声クエリからの以前に識別されたエンティティに基づいて、エンティティを識別する手段を備えている、項目３３に記載のシステム。
（項目３５）エンティティを識別する手段は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別する手段を備えている、項目３１に記載のシステム。
（項目３６）エンティティを識別する手段は、
複数のエンティティを識別する手段であって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、手段と、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定する手段と、
最大スコアを決定することによって、エンティティを選択する手段と
を備えている、項目３１に記載のシステム。
（項目３７）エンティティは、第１のエンティティであり、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別する手段をさらに備え、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目３１に記載のシステム。
（項目３８）データベースの複数のエンティティの中のエンティティを識別する手段は、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別する手段を備えている、項目３１に記載のシステム。
（項目３９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目３１に記載のシステム。
（項目４０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目３１に記載のシステム。
（項目４１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに関する発音情報を決定することと、
制御回路を使用して、１つ以上のキーワードおよび発音情報に基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関する記憶されたメタデータに基づいて、データベースの複数のエンティティの中のエンティティを識別することであって、メタデータは、発音タグを備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目４２）発音情報は、１つ以上のキーワードのうちの１つの音素を備えている、項目４１に記載の方法。
（項目４３）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目４１－４２のいずれかに記載の方法。
（項目４４）エンティティを識別することは、前の音声クエリからの以前に識別されたエンティティに基づく、項目４１－４３のいずれかに記載の方法。
（項目４５）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目４１－４４のいずれかに記載の方法。
（項目４６）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの発音タグをテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
を含む、項目４１－４５のいずれかに記載の方法。
（項目４７）エンティティは、第１のエンティティであり、テキストクエリおよび第２のエンティティに関する第２のメタデータに基づいて、複数のエンティティの中の第２のエンティティを識別することをさらに含み、コンテンツ項目は、第１のエンティティおよび第２のエンティティに関連付けられている、項目４１－４６のいずれかに記載の方法。
（項目４８）データベースの複数のエンティティの中のエンティティを識別することは、テキストクエリの少なくとも一部を記憶されたメタデータのタグと比較し、合致を識別することを含む、項目４１－４７のいずれかに記載の方法。
（項目４９）１つ以上のキーワードのうちの第１のキーワードは、第１のキーワードの２つ以上の発音に関連付けられている、項目４１－４８のいずれかに記載の方法。
（項目５０）発音情報は、１つ以上のキーワードのうちの第１のキーワードの音素表現を備えている、項目４１－４９のいずれかに記載の方法。
（項目５１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目５２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目５１に記載の方法。
（項目５３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目５１に記載の方法。
（項目５４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目５１に記載の方法。
（項目５５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって生成される、項目５１に記載の方法。
（項目５６）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目５１に記載の方法。
（項目５７）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目５１に記載の方法。
（項目５８）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
を含む、項目５１に記載の方法。
（項目５９）複数のテキストクエリを生成することをさらに含み、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目５１に記載の方法。
（項目６０）
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
をさらに含む、項目５９に記載の方法。
（項目６１）音声クエリに応答するためのシステムであって、システムは、
音声クエリを受信するためのオーディオインターフェースと、
制御回路と
を備え、
制御回路は、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を行うように構成されている、システム。
（項目６２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目６１に記載のシステム。
（項目６３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目６１に記載のシステム。
（項目６４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目６１に記載のシステム。
（項目６５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、制御回路は、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって、複数の代替テキスト表現のうちの各代替テキスト表現を生成するように構成されている、項目６１に記載のシステム。
（項目６６）制御回路は、ユーザプロファイル情報に基づいて、エンティティを識別するようにさらに構成されている、項目６１に記載のシステム。
（項目６７）制御回路は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別するようにさらに構成されている、項目６１に記載のシステム。
（項目６８）制御回路は、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
によって、エンティティを識別するようにさらに構成されている、項目６１に記載のシステム。
（項目６９）制御回路は、複数のテキストクエリを生成するようにさらに構成され、複数のテキストクエリは、テキストクエリを備え、制御回路は、発話→テキストモジュールを備え、複数のテキストクエリのうちの各テキストクエリは、発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目６１に記載のシステム。
（項目７０）制御回路は、
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
を行うようにさらに構成されている、項目６９に記載のシステム。
（項目７１）エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、命令は、制御回路によって実行されると、
音声クエリをオーディオインターフェースにおいて受信することと、
１つ以上のキーワードを音声クエリから抽出することと、
１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を制御回路に行わせる、非一過性コンピュータ読み取り可能な媒体。
（項目７２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって、複数の代替テキスト表現のうちの各代替テキスト表現を生成させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７６）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、ユーザプロファイル情報に基づいて、エンティティを識別させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７７）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、エンティティに関連付けられた人気情報に基づいて、エンティティを識別させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７８）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと、
によって、制御回路にエンティティを識別させる、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目７９）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、複数のテキストクエリを生成させ、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目７１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目８０）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
を制御回路に行わせる、項目７９に記載の非一過性コンピュータ読み取り可能な媒体。
（項目８１）音声クエリに応答するためのシステムであって、システムは、
音声クエリをオーディオインターフェースにおいて受信する手段と、
１つ以上のキーワードを音声クエリから抽出する手段と、
１つ以上のキーワードに基づいて、テキストクエリを生成する手段と、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別する手段であって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、手段と、
エンティティに関連付けられたコンテンツ項目を読み出すための手段と
を備えている、システム。
（項目８２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目８１に記載のシステム。
（項目８３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目８１に記載のシステム。
（項目８４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目８１に記載のシステム。
（項目８５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換する手段と、
オーディオファイルを第２のテキスト表現に変換する手段であって、第２のテキスト表現は、第１のテキスト表現と同一ではない、手段と
によって生成される、項目８１に記載のシステム。
（項目８６）エンティティを識別する手段は、ユーザプロファイル情報に基づいて、エンティティを識別する手段をさらに備えている、項目８１に記載のシステム。
（項目８７）エンティティを識別する手段は、エンティティに関連付けられた人気情報に基づいて、エンティティを識別する手段をさらに備えている、項目８１に記載のシステム。
（項目８８）エンティティを識別する手段は、
複数のエンティティを識別する手段であって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、手段と、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定する手段と、
最大スコアを決定することによって、エンティティを選択する手段と
を備えている、項目８１に記載のシステム。
（項目８９）複数のテキストクエリを生成する手段をさらに備え、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目８１に記載のシステム。
（項目９０）
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別する手段と、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定する手段と、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別する手段と
をさらに備えている、項目８９に記載のシステム。
（項目９１）音声クエリに応答する方法であって、方法は、
音声クエリをオーディオインターフェースにおいて受信することと、
制御回路を使用して、１つ以上のキーワードを音声クエリから抽出することと、
制御回路を使用して、１つ以上のキーワードに基づいて、テキストクエリを生成することと、
テキストクエリおよびエンティティに関するメタデータに基づいて、エンティティを識別することであって、メタデータは、エンティティに関連付けられた識別子の発音に基づくエンティティの１つ以上の代替テキスト表現を備えている、ことと、
エンティティに関連付けられたコンテンツ項目を読み出すことと
を含む、方法。
（項目９２）１つ以上の代替テキスト表現は、エンティティの音素表現を備えている、項目９１に記載の方法。
（項目９３）１つ以上の代替テキスト表現は、発音に基づくエンティティの代替スペルを備えている、項目９１－９２のいずれかに記載の方法。
（項目９４）エンティティの１つ以上の代替テキスト表現は、前の発話→テキスト変換に基づいて生成されたテキスト文字列を備えている、項目９１－９３のいずれかに記載の方法。
（項目９５）１つ以上の代替テキスト表現は、複数の代替テキスト表現を備え、複数の代替テキスト表現のうちの各代替テキスト表現は、
第１のテキスト表現をオーディオファイルに変換することと、
オーディオファイルを第２のテキスト表現に変換することであって、第２のテキスト表現は、第１のテキスト表現と同一ではない、ことと
によって生成される、項目９１－９４のいずれかに記載の方法。
（項目９６）エンティティを識別することは、ユーザプロファイル情報にさらに基づく、項目９１－９５のいずれかに記載の方法。
（項目９７）エンティティを識別することは、エンティティに関連付けられた人気情報にさらに基づく、項目９１－９６のいずれかに記載の方法。
（項目９８）エンティティを識別することは、
複数のエンティティを識別することであって、それぞれのメタデータが、複数のエンティティのうちの各エンティティに関して記憶されている、ことと、
それぞれの１つ以上の代替テキスト表現をテキストクエリと比較することに基づいて、複数のエンティティのうちの各それぞれのエンティティに関して、それぞれのスコアを決定することと、
最大スコアを決定することによって、エンティティを選択することと
を含む、項目９１－９７のいずれかに記載の方法。
（項目９９）複数のテキストクエリを生成することをさらに含み、複数のテキストクエリは、テキストクエリを備え、複数のテキストクエリのうちの各テキストクエリは、制御回路の発話→テキストモジュールのそれぞれの設定に基づいて生成される、項目９１－９８のいずれかに記載の方法。
（項目１００）
複数のテキストクエリのうちのそれぞれのテキストクエリに基づいて、それぞれのエンティティを識別することと、
それぞれのテキストクエリのそれぞれのエンティティに関連付けられたメタデータとの比較に基づいて、それぞれのエンティティに関するそれぞれのスコアを決定することと、
それぞれのスコアの最大スコアを選択することによって、エンティティを識別することと
をさらに含む、項目９９に記載の方法。
（項目１０１）音声クエリに関するエンティティメタデータを生成する方法であって、方法は、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
テキスト→発話モジュールを使用して、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成することであって、第１のテキスト文字列は、エンティティを記述する、ことと、
発話→テキストモジュールを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成することと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を含む、方法。
（項目１０２）少なくとも１つの発話基準は、発音設定を備えている、項目１０１に記載の方法。
（項目１０３）少なくとも１つの発話基準は、言語設定を備えている、項目１０１に記載の方法。
（項目１０４）少なくとも１つの発話基準は、複数の発話基準を備え、方法は、
テキスト→発話モジュールを使用して、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
発話→テキストモジュールを使用して、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
をさらに含む、項目１０１に記載の方法。
（項目１０５）１つ以上のテキストクエリに基づいて、メタデータを更新することをさらに含む、項目１０１に記載の方法。
（項目１０６）エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶することをさらに含む、項目１０１に記載の方法。
（項目１０７）第１のテキスト文字列に基づいて、オーディオファイルを生成することは、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
を含む、項目１０１に記載の方法。
（項目１０８）発話をスピーカにおいて生成することは、テキスト→発話モジュールの少なくとも１つの発話設定にさらに基づく、項目１０７に記載の方法。
（項目１０９）オーディオファイルに基づいて、第２のテキスト文字列を生成することは、
オーディオファイルの再生をスピーカにおいて生成することと、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
を含む、項目１０１に記載の方法。
（項目１１０）オーディオ信号を第２のテキスト文字列に変換することは、発話→テキストモジュールの少なくとも１つのテキスト設定に基づく、項目１０９に記載の方法。
（項目１１１）音声クエリに関するエンティティメタデータを生成するためのシステムであって、システムは、制御回路を備え、
制御回路は、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
制御回路に結合されたオーディオインターフェースを使用して、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成することであって、第１のテキスト文字列は、エンティティを記述する、ことと、
オーディオインターフェースを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成することと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を行うように構成されている、システム。
（項目１１２）少なくとも１つの発話基準は、発音設定を備えている、項目１１１に記載のシステム。
（項目１１３）少なくとも１つの発話基準は、言語設定を備えている、項目１１１に記載のシステム。
（項目１１４）少なくとも１つの発話基準は、複数の発話基準を備え、制御回路は、
オーディオ機器を使用して、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
オーディオ機器を使用して、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
を行うようにさらに構成されている、項目１１１に記載のシステム。
（項目１１５）制御回路は、１つ以上のテキストクエリに基づいて、メタデータを更新するようにさらに構成されている、項目１１１に記載のシステム。
（項目１１６）制御回路は、エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶するようにさらに構成されている、項目１１１に記載のシステム。
（項目１１７）オーディオ機器は、スピーカとマイクロホンとを備え、制御回路は、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
によって、第１のテキスト文字列に基づいて、オーディオファイルを生成するようにさらに構成されている、項目１１１に記載のシステム。
（項目１１８）制御回路は、少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成するようにさらに構成されている、項目１１７に記載のシステム。
（項目１１９）オーディオ機器は、スピーカとマイクロホンとを備え、制御回路は、
オーディオファイルの再生をスピーカにおいて生成することと、
再生をマイクロホンにおいて検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
によって、オーディオファイルに基づいて、第２のテキスト文字列を生成するようにさらに構成されている、項目１１１に記載のシステム。
（項目１２０）制御回路は、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換するようにさらに構成されている、項目１１９に記載のシステム。
（項目１２１）エンコーディングされた命令を有する非一過性コンピュータ読み取り可能な媒体であって、命令は、制御回路によって実行されると、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成ることであって、第１のテキスト文字列は、エンティティを記述する、ことと、
オーディオファイルに基づいて、第２のテキスト文字列を生成ることと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を制御回路に行わせる、非一過性コンピュータ読み取り可能な媒体。
（項目１２２）少なくとも１つの発話基準は、発音設定を備えている、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２３）少なくとも１つの発話基準は、言語設定を備えている、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２４）少なくとも１つの発話基準は、複数の発話基準を備え、エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
を制御回路に行わせる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２５）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、１つ以上のテキストクエリに基づいて、メタデータを更新させる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２６）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶させる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２７）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
を制御回路に行わせる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２８）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、テキスト→発話モジュールの少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成させる、項目１２７に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１２９）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、
オーディオファイルの再生をスピーカにおいて生成することと、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
を制御回路に行わせる、項目１２１に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１３０）エンコーディングされた命令をさらに備え、命令は、制御回路によって実行されると、制御回路に、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換させる、項目１２９に記載の非一過性コンピュータ読み取り可能な媒体。
（項目１３１）音声クエリに関するエンティティメタデータを生成するためのシステムであって、システムは、
複数のエンティティのうちの情報が記憶されているエンティティを識別する手段と、
第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成する手段であって、第１のテキスト文字列は、エンティティを記述する、手段と、
オーディオファイルに基づいて、第２のテキスト文字列を生成する手段と、
第２のテキスト文字列を第１のテキスト文字列と比較する手段と、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶する手段と
を備えている、システム。
（項目１３２）少なくとも１つの発話基準は、発音設定を備えている、項目１３１に記載のシステム。
（項目１３３）少なくとも１つの発話基準は、言語設定を備えている、項目１３１に記載のシステム。
（項目１３４）少なくとも１つの発話基準は、複数の発話基準を備え、システムは、
第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成する手段と、
それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成する手段と、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較する手段と、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶する手段と
をさらに備えている、項目１３１に記載のシステム。
（項目１３５）１つ以上のテキストクエリに基づいて、メタデータを更新する手段をさらに備えている、項目１３１に記載のシステム。
（項目１３６）エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶する手段をさらに備えている、項目１３１に記載のシステム。
（項目１３７）第１のテキスト文字列に基づいて、オーディオファイルを生成する手段は、
第１のテキスト文字列を第１のオーディオ信号に変換する手段と、
オーディオ信号に基づいて、発話をスピーカにおいて生成する手段と、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成する手段と、
オーディオ信号を処理し、オーディオファイルを生成する手段と
を備えている、項目１３１に記載のシステム。
（項目１３８）発話をスピーカにおいて生成する手段は、テキスト→発話モジュールの少なくとも１つの発話設定に基づいて、発話をスピーカにおいて生成する手段をさらに備えている、項目１３７に記載のシステム。
（項目１３９）オーディオファイルに基づいて、第２のテキスト文字列を生成する手段は、
オーディオファイルの再生をスピーカにおいて生成する手段と、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成する手段と、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換する手段と
を含む、項目１３１に記載のシステム。
（項目１４０）オーディオ信号を第２のテキスト文字列に変換する手段は、発話→テキストモジュールの少なくとも１つのテキスト設定に基づいて、オーディオ信号を第２のテキスト文字列に変換する手段を備えている、項目１３９に記載のシステム。
（項目１４１）音声クエリのためのエンティティメタデータを生成する方法であって、方法は、
複数のエンティティのうちの情報が記憶されているエンティティを識別することと、
テキスト→発話モジュールを使用して、第１のテキスト文字列および少なくとも１つの発話基準に基づいて、オーディオファイルを生成することであって、第１のテキスト文字列は、エンティティを記述する、ことと、
発話→テキストモジュールを使用して、オーディオファイルに基づいて、第２のテキスト文字列を生成することと、
第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータに第２のテキスト文字列を記憶することと
を含む、方法。
（項目１４２）少なくとも１つの発話基準は、発音設定を備えている、項目１４１に記載の方法。
（項目１４３）少なくとも１つの発話基準は、言語設定を備えている、項目１４１－１４２のいずれかに記載の方法。
（項目１４４）少なくとも１つの発話基準は、複数の発話基準を備え、方法は、
テキスト→発話モジュールを使用して、第１のテキスト文字列およびそれぞれの発話基準に基づいて、それぞれのオーディオファイルを生成することと、
発話→テキストモジュールを使用して、それぞれのオーディオファイルに基づいて、それぞれの第２のテキスト文字列を生成することと、
それぞれの第２のテキスト文字列を第１のテキスト文字列と比較することと、
第１のテキスト文字列と同一でない場合、エンティティに関連付けられたメタデータにそれぞれの第２のテキスト文字列を記憶することと
をさらに含む、項目１４１－１４３のいずれかに記載の方法。
（項目１４５）１つ以上のテキストクエリに基づいて、メタデータを更新することをさらに含む、項目１４１－１４４のいずれかに記載の方法。
（項目１４６）エンティティに関連付けられたメタデータに第１のテキスト文字列の音素表現を記憶することをさらに含む、項目１４１－１４５のいずれかに記載の方法。
（項目１４７）第１のテキスト文字列に基づいて、オーディオファイルを生成することは、
第１のテキスト文字列を第１のオーディオ信号に変換することと、
オーディオ信号に基づいて、発話をスピーカにおいて生成することと、
マイクロホンを使用して、発話を検出し、第２のオーディオ信号を生成することと、
オーディオ信号を処理し、オーディオファイルを生成することと
を含む、項目１４１－１４６のいずれかに記載の方法。
（項目１４８）発話をスピーカにおいて生成することは、テキスト→発話モジュールの少なくとも１つの発話設定にさらに基づく、項目１４７に記載の方法。
（項目１４９）オーディオファイルに基づいて、第２のテキスト文字列を生成することは、
オーディオファイルの再生をスピーカにおいて生成することと、
マイクロホンを使用して、再生を検出し、オーディオ信号を生成することと、
１つ以上の単語を識別することによって、オーディオ信号を第２のテキスト文字列に変換することと
を含む、項目１４１－１４８のいずれかに記載の方法。
（項目１５０）オーディオ信号を第２のテキスト文字列に変換することは、発話→テキストモジュールの少なくとも１つのテキスト設定に基づく、項目１４９に記載の方法。 The above-described embodiments of the present disclosure are presented for purposes of illustration, not limitation, and the present disclosure is limited only by the claims that follow. Moreover, features and limitations described in any one embodiment may apply to any other embodiment herein, and flowcharts or examples relating to one embodiment may be used in any other implementation in a suitable manner. Note that forms may be combined, performed in a different order, or performed in parallel. Additionally, the systems and methods described herein can be implemented in real time. It should also be noted that the systems and/or methods described above may be applied to or used in accordance with other systems and/or methods.
This specification discloses embodiments including, but not limited to:
(Item 1) A method of responding to a voice query, the method comprising:
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query using control circuitry;
determining phonetic information for one or more keywords using a control circuit;
generating a text query based on one or more keywords and phonetic information using control circuitry;
identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
retrieving a content item associated with the entity;
A method, including
2. The method of claim 1, wherein the phonetic information comprises one phoneme of the one or more keywords.
3. The method of claim 1, wherein identifying the entity is further based on user profile information.
4. The method of claim 3, wherein identifying entities is based on previously identified entities from previous voice queries.
5. The method of claim 1, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 6) Identifying an entity is
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective pronunciation tag to the text query;
selecting an entity by determining a maximum score;
The method of item 1, comprising
(Item 7) the entity is a first entity, further comprising identifying a second entity in the plurality of entities based on the text query and second metadata about the second entity; The method of item 1, wherein the item is associated with a first entity and a second entity.
(Item 8) The method of item 1, wherein identifying an entity among the plurality of entities of the database includes comparing at least a portion of the text query to tags of stored metadata to identify a match. Method.
9. The method of claim 1, wherein a first of the one or more keywords is associated with two or more pronunciations of the first keyword.
10. The method of claim 1, wherein the phonetic information comprises a phoneme representation of a first one of the one or more keywords.
(Item 11) A system for responding to voice queries, the system comprising:
an audio interface for receiving voice queries;
a control circuit coupled to the audio interface;
The control circuit
extracting one or more keywords from the spoken query;
deterministically extracting phonetic information for one or more keywords;
generating and extracting a text query based on one or more keywords and pronunciation information;
identifying and extracting an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
A system configured to: retrieve content items associated with an entity.
12. The system of claim 11, wherein the phonetic information comprises one phoneme of the one or more keywords.
13. The system of claim 11, wherein the control circuitry is further configured to identify entities based on user profile information.
14. The system of claim 13, wherein the control circuitry is further configured to identify entities based on previously identified entities from previous voice queries.
15. The system of claim 11, wherein the control circuitry is further configured to identify the entity based on popularity information associated with the entity.
(Item 16) The control circuit is
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective pronunciation tag to the text query;
12. The system of item 11, further configured to identify the entity by selecting the entity by determining the maximum score.
(Item 17) the entity is a first entity, and the control circuit identifies a second entity in the plurality of entities based on the text query and second metadata about the second entity; 12. The system of item 11, further configured, wherein the content item is associated with the first entity and the second entity.
(Item 18) The control circuitry is further configured to identify an entity among the plurality of entities of the database by comparing at least a portion of the text query to stored metadata tags and identifying matches. of item 11.
19. The system of claim 11, wherein a first one of the one or more keywords is associated with two or more pronunciations of the first keyword.
20. The system of claim 11, wherein the phonetic information comprises a phoneme representation of a first one of the one or more keywords.
(Item 21) A non-transitory computer-readable medium having encoded instructions, the instructions, when executed by a control circuit,
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query;
determining phonetic information for one or more keywords;
generating a text query based on one or more keywords and phonetic information;
identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
A non-transitory computer-readable medium that causes control circuitry to read content items associated with an entity.
22. The non-transitory computer-readable medium of claim 21, wherein the phonetic information comprises one phoneme of the one or more keywords.
23. The non-transitory computer readable of claim 21 further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify the entity based on the user profile information. medium.
(Item 24) Further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify entities based on previously identified entities from previous voice queries, item 23. 3. The non-transitory computer-readable medium of .
25. The method of claim 21 further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify the entity based on popularity information associated with the entity. transient computer readable medium.
(Item 26) further comprising encoded instructions, the instructions, when executed by the control circuit,
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective pronunciation tag to the text query;
22. The non-transitory computer readable medium of item 21, causing the control circuit to identify the entity by selecting the entity by determining the maximum score.
(Item 27) The entity is the first entity and further comprises encoded instructions which, when executed by the control circuit, provide the control circuit with the text query and second metadata about the second entity. 22. The non-transitory computer-readable according to item 21, wherein a second entity among the plurality of entities is identified based on the content item is associated with the first entity and the second entity medium.
(Item 28) Further comprising encoded instructions which, when executed by the control circuitry, cause the control circuitry to compare at least a portion of the text query to tags of stored metadata to identify matches. 22. The non-transitory computer-readable medium of item 21 that identifies an entity in a plurality of entities of a database by.
(Item 29) The non-transitory computer-readable medium of item 21, wherein a first keyword of the one or more keywords is associated with two or more pronunciations of the first keyword.
30. The non-transitory computer-readable medium of claim 21, wherein the phonetic information comprises a phoneme representation of the first of the one or more keywords.
(Item 31) A system for responding to voice queries, the system comprising:
means for receiving voice queries;
means for extracting one or more keywords from a spoken query;
means for determining pronunciation information for one or more keywords;
means for generating a text query based on one or more keywords and pronunciation information;
means for identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
and means for retrieving content items associated with the entity.
32. The system of claim 31, wherein the phonetic information comprises one phoneme of the one or more keywords.
33. The system of claim 31, wherein means for identifying an entity comprises means for identifying an entity based on user profile information.
34. The system of claim 33, wherein means for identifying entities comprises means for identifying entities based on previously identified entities from previous voice queries.
35. The system of claim 31, wherein means for identifying an entity comprises means for identifying an entity based on popularity information associated with the entity.
(Item 36) The means for identifying an entity is
means for identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
means for determining a respective score for each respective entity of the plurality of entities based on comparing the respective pronunciation tag to the text query;
and means for selecting an entity by determining a maximum score.
(Item 37) the entity is a first entity and further comprising means for identifying a second entity among the plurality of entities based on the text query and second metadata about the second entity; 32. The system of item 31, wherein the item is associated with a first entity and a second entity.
(Item 38) The means for identifying an entity among the plurality of entities of the database comprises means for comparing at least a portion of the text query to stored metadata tags to identify a match. System as described.
39. The system of claim 31, wherein a first of the one or more keywords is associated with two or more pronunciations of the first keyword.
40. The system of claim 31, wherein the phonetic information comprises a phoneme representation of a first one of the one or more keywords.
(Item 41) A method of responding to a voice query, the method comprising:
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query using control circuitry;
determining phonetic information for one or more keywords using a control circuit;
generating a text query based on one or more keywords and phonetic information using control circuitry;
identifying an entity among a plurality of entities in a database based on a text query and stored metadata about the entity, the metadata comprising a pronunciation tag;
and retrieving a content item associated with the entity.
42. The method of claim 41, wherein the phonetic information comprises one phoneme of the one or more keywords.
(Item 43) The method of any of items 41-42, wherein identifying the entity is further based on user profile information.
(Item 44) The method of any of items 41-43, wherein identifying the entity is based on previously identified entities from previous speech queries.
(Item 45) The method of any of items 41-44, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 46) Identifying an entity is
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective pronunciation tag to the text query;
46. A method according to any of items 41-45, comprising selecting an entity by determining a maximum score.
(Item 47) the entity is the first entity and further comprising identifying a second entity in the plurality of entities based on the text query and second metadata about the second entity; 47. The method of any of items 41-46, wherein the items are associated with a first entity and a second entity.
(Item 48) Identifying an entity among the plurality of entities of the database includes comparing at least a portion of the text query to tags of stored metadata to identify a match. Any method described.
49. The method of any of items 41-48, wherein a first of the one or more keywords is associated with two or more pronunciations of the first keyword.
50. The method of any of items 41-49, wherein the phonetic information comprises a phoneme representation of a first one of the one or more keywords.
(Item 51) A method of responding to a voice query, the method comprising:
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query using control circuitry;
generating a text query based on one or more keywords using control circuitry;
identifying an entity based on a text query and metadata about the entity, the metadata comprising one or more alternative textual representations of the entity based on the pronunciation of an identifier associated with the entity; ,
and retrieving a content item associated with the entity.
52. The method of claim 51, wherein the one or more alternative textual representations comprise phonemic representations of the entity.
53. The method of claim 51, wherein the one or more alternative textual representations comprise phonetic-based alternative spellings of entities.
54. The method of claim 51, wherein the one or more alternative textual representations of the entity comprise text strings generated based on previous speech-to-text conversions.
(Item 55) The one or more alternative text representations comprises a plurality of alternative text representations, each alternative text representation of the plurality of alternate text representations comprising:
converting the first textual representation into an audio file;
52. Method according to item 51, produced by converting an audio file into a second textual representation, the second textual representation being not identical to the first textual representation.
(Item 56) The method of item 51, wherein identifying the entity is further based on user profile information.
(Item 57) The method of item 51, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 58) Identifying the entity is
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the textual query;
52. The method of item 51, comprising selecting an entity by determining a maximum score.
(Item 59) further comprising generating a plurality of text queries, the plurality of text queries comprising the text queries, each text query of the plurality of text queries corresponding to a respective setting of the control circuit's utterance->text module; 52. The method of item 51, wherein the method is generated based on
(Item 60)
identifying each entity based on each text query of the plurality of text queries;
determining a respective score for each entity based on comparison of each text query to metadata associated with each entity;
60. The method of item 59, further comprising identifying the entity by selecting the maximum score of each score.
(Item 61) A system for responding to voice queries, the system comprising:
an audio interface for receiving voice queries;
a control circuit;
The control circuit
extracting one or more keywords from the spoken query;
generating a text query based on one or more keywords;
identifying an entity based on a text query and metadata about the entity, the metadata comprising one or more alternative textual representations of the entity based on the pronunciation of an identifier associated with the entity; ,
A system configured to: retrieve content items associated with an entity.
62. The system of claim 61, wherein the one or more alternative textual representations comprise phonemic representations of entities.
63. The system of claim 61, wherein the one or more alternative textual representations comprise phonetic-based alternative spellings of entities.
(Item 64) The system of Item 61, wherein the one or more alternative textual representations of the entity comprise text strings generated based on previous speech-to-text conversions.
(Item 65) The one or more alternative text representations comprises a plurality of alternative text representations, the control circuit comprising:
converting the first textual representation into an audio file;
converting the audio file to a second textual representation, the second textual representation not identical to the first textual representation, thereby converting each alternate textual representation of the plurality of alternate textual representations to 62. The system of item 61, configured to generate.
66. The system of claim 61, wherein the control circuitry is further configured to identify entities based on user profile information.
67. The system of claim 61, wherein the control circuitry is further configured to identify the entity based on popularity information associated with the entity.
(Item 68) The control circuit
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the textual query;
62. The system of item 61, further configured to identify an entity by determining a maximum score, selecting the entity, and identifying the entity.
(Item 69) The control circuit is further configured to generate a plurality of text queries, the plurality of text queries comprising the text queries, the control circuit comprising an utterance to text module, 62. The system according to item 61, wherein each text query is generated based on a respective set of speech to text modules.
(Item 70) The control circuit
identifying each entity based on each text query of the plurality of text queries;
determining a respective score for each entity based on comparison of each text query to metadata associated with each entity;
70. The system of item 69, further configured to: identify the entity by selecting the maximum score of the respective scores.
(Item 71) A non-transitory computer-readable medium having encoded instructions, the instructions, when executed by a control circuit,
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query;
generating a text query based on one or more keywords;
identifying an entity based on a text query and metadata about the entity, the metadata comprising one or more alternative textual representations of the entity based on the pronunciation of an identifier associated with the entity; ,
A non-transitory computer-readable medium that causes control circuitry to read content items associated with an entity.
(Item 72) The non-transitory computer-readable medium of Item 71, wherein the one or more alternative textual representations comprise phonemic representations of the entity.
(Item 73) The non-transitory computer-readable medium of Item 71, wherein the one or more alternative textual representations comprise phonetic-based alternative spellings of the entity.
(Item 74) The non-transitory computer-readable medium of Item 71, wherein the one or more alternative textual representations of the entity comprise text strings generated based on previous speech-to-text conversions.
(Item 75) The one or more alternative textual representations comprises a plurality of alternative textual representations and further comprises encoded instructions which, when executed by the control circuitry, cause the control circuitry to:
converting the first textual representation into an audio file;
converting the audio file to a second textual representation, the second textual representation not identical to the first textual representation, thereby converting each alternate textual representation of the plurality of alternate textual representations to 72. The non-transitory computer readable medium of item 71 that is generated.
76. The non-transitory computer-readable of claim 71 further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify the entity based on the user profile information. possible medium.
77. The method of claim 71 further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to identify the entity based on popularity information associated with the entity. transient computer readable medium.
(Item 78) Further comprising encoded instructions, the instructions, when executed by the control circuit,
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the textual query;
selecting an entity by determining a maximum score;
72. The non-transitory computer readable medium of item 71 that causes the control circuit to identify the entity by.
(Item 79) Further comprising encoded instructions, the instructions, when executed by the control circuit, causing the control circuit to generate a plurality of text queries, the plurality of text queries comprising the text queries, the plurality of text queries. 72. The non-transitory computer-readable medium of item 71, wherein each text query of the control circuit is generated based on a respective setting of the speech-to-text module of the control circuit.
(Item 80) further comprising encoded instructions, the instructions, when executed by the control circuit,
identifying each entity based on each text query of the plurality of text queries;
determining a respective score for each entity based on comparison of each text query to metadata associated with each entity;
80. The non-transitory computer-readable medium of item 79, causing the control circuit to identify the entity by selecting the maximum score of the respective scores.
(Item 81) A system for responding to voice queries, the system comprising:
means for receiving voice queries at an audio interface;
means for extracting one or more keywords from a spoken query;
means for generating a text query based on one or more keywords;
means for identifying an entity based on a text query and metadata about the entity, the metadata comprising one or more alternative textual representations of the entity based on pronunciation of an identifier associated with the entity; ,
and means for retrieving content items associated with the entity.
(Item 82) The system of item 81, wherein the one or more alternative textual representations comprise phonemic representations of the entity.
83. The system of claim 81, wherein the one or more alternative textual representations comprise phonetic-based alternative spellings of entities.
(Item 84) The system of Item 81, wherein the one or more alternative textual representations of the entity comprise text strings generated based on previous speech-to-text conversions.
(Item 85) The one or more alternative text representations comprises a plurality of alternative text representations, each alternative text representation of the plurality of alternate text representations comprising:
means for converting the first textual representation into an audio file;
82. System according to item 81, produced by means for converting an audio file into a second textual representation, wherein the second textual representation is not identical to the first textual representation.
86. The system of claim 81, wherein means for identifying an entity further comprises means for identifying an entity based on user profile information.
87. The system of claim 81, wherein means for identifying an entity further comprises means for identifying the entity based on popularity information associated with the entity.
(Item 88) The means for identifying an entity is
means for identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
means for determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the textual query;
82. The system of item 81, comprising means for selecting entities by determining maximum scores.
(Item 89) further comprising means for generating a plurality of text queries, the plurality of text queries comprising the text queries, each text query of the plurality of text queries corresponding to the utterance of the control circuit→the respective setting of the text module; 82. The system of item 81, wherein the system is generated based on:
(Item 90)
means for identifying respective entities based on respective text queries of the plurality of text queries;
means for determining a respective score for each entity based on comparison of each text query to metadata associated with each entity;
and means for identifying entities by selecting the maximum score of each score.
(Item 91) A method of responding to a voice query, the method comprising:
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query using control circuitry;
generating a text query based on one or more keywords using control circuitry;
identifying an entity based on a text query and metadata about the entity, the metadata comprising one or more alternative textual representations of the entity based on the pronunciation of an identifier associated with the entity; ,
and retrieving a content item associated with the entity.
92. The method of claim 91, wherein the one or more alternative textual representations comprise phonemic representations of the entity.
(Item 93) The method of any of items 91-92, wherein the one or more alternative textual representations comprise phonetic-based alternative spellings of entities.
(Item 94) The method of any of items 91-93, wherein the one or more alternative textual representations of the entity comprise text strings generated based on previous speech-to-text conversions.
(Item 95) The one or more alternative text representations comprises a plurality of alternative text representations, each alternative text representation of the plurality of alternate text representations comprising:
converting the first textual representation into an audio file;
95. Any of paragraphs 91-94, produced by converting an audio file into a second textual representation, wherein the second textual representation is not identical to the first textual representation. Method.
(Item 96) The method of any of items 91-95, wherein identifying the entity is further based on user profile information.
(Item 97) The method of any of items 91-96, wherein identifying the entity is further based on popularity information associated with the entity.
(Item 98) Identifying the entity is
identifying a plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the textual query;
selecting an entity by determining a maximum score.
(Item 99) further comprising generating a plurality of text queries, the plurality of text queries comprising the text queries, each text query of the plurality of text queries corresponding to a respective setting of the control circuit's utterance->text module; 99. A method according to any of items 91-98, wherein the method is generated based on
(Item 100)
identifying each entity based on each text query of the plurality of text queries;
determining a respective score for each entity based on comparison of each text query to metadata associated with each entity;
100. The method of item 99, further comprising identifying the entity by selecting the maximum score of each score.
(Item 101) A method of generating entity metadata for a voice query, the method comprising:
identifying an entity for which information is stored among a plurality of entities;
generating an audio file using a text-to-speech module based on a first text string and at least one speech criterion, wherein the first text string describes an entity; ,
generating a second text string based on the audio file using the Speech→Text module;
comparing the second text string to the first text string;
If not identical to the first text string, storing the second text string in metadata associated with the entity.
(Item 102) The method of item 101, wherein the at least one speech criterion comprises a pronunciation setting.
(Item 103) The method of item 101, wherein the at least one speech criterion comprises a language setting.
(Item 104) The at least one speech criterion comprises a plurality of speech criteria, the method comprising:
generating respective audio files based on the first text string and respective speech criteria using a Text to Speech module;
generating respective second text strings based on respective audio files using the Speech to Text module;
comparing each second text string to the first text string;
102. The method of item 101, further comprising, if not identical to the first text string, storing each second text string in metadata associated with the entity.
(Item 105) The method of item 101, further comprising updating metadata based on one or more text queries.
(Item 106) The method of item 101, further comprising storing the phoneme representation of the first text string in metadata associated with the entity.
(Item 107) Generating an audio file based on the first text string includes:
converting the first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
detecting speech using a microphone to generate a second audio signal;
102. The method of item 101, comprising processing an audio signal to generate an audio file.
(Item 108) The method of item 107, wherein generating the speech at the speaker is further based on at least one speech setting of the text-to-speech module.
(Item 109) Generating a second text string based on the audio file includes:
producing playback of the audio file on the speaker;
detecting playback and generating an audio signal using a microphone;
102. The method of item 101, comprising converting the audio signal into a second text string by identifying one or more words.
(Item 110) The method of item 109, wherein converting the audio signal to the second text string is based on at least one text setting of the Speech to Text module.
(Item 111) A system for generating entity metadata for a spoken query, the system comprising a control circuit;
The control circuit
identifying an entity for which information is stored among a plurality of entities;
generating an audio file, using an audio interface coupled to the control circuit, based on a first text string and at least one speech criterion, the first text string describing an entity; to do and
generating a second text string based on the audio file using the audio interface;
comparing the second text string to the first text string;
If not identical to the first text string, store the second text string in metadata associated with the entity.
(Item 112) The system of item 111, wherein the at least one speech criterion comprises a pronunciation setting.
(Item 113) The system of item 111, wherein the at least one speech criterion comprises a language setting.
(Item 114) The at least one speech reference comprises a plurality of speech references, the control circuit comprising:
using audio equipment to generate respective audio files based on the first text string and respective speech criteria;
generating respective second text strings based on respective audio files using audio equipment;
comparing each second text string to the first text string;
If not identical to the first text string, storing each second text string in metadata associated with the entity.
(Item 115) The system of item 111, wherein the control circuitry is further configured to update the metadata based on one or more text queries.
(Item 116) The system of item 111, wherein the control circuitry is further configured to store the phoneme representation of the first text string in metadata associated with the entity.
(Item 117) The audio device includes a speaker and a microphone, and the control circuit includes:
converting the first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
detecting speech using a microphone to generate a second audio signal;
112. The system of item 111, further configured to generate an audio file based on the first text string by processing the audio signal and generating the audio file.
118. The system of claim 117, wherein the control circuit is further configured to generate speech at the speaker based on the at least one speech setting.
(Item 119) The audio equipment includes a speaker and a microphone, and the control circuit includes:
producing playback of the audio file on the speaker;
detecting the playback at the microphone and generating an audio signal;
further configured to generate a second text string based on the audio file by converting the audio signal into a second text string by identifying one or more words; 111. The system of item 111.
(Item 120) The system of item 119, wherein the control circuit is further configured to convert the audio signal to the second text string based on at least one text setting of the speech-to-text module.
(Item 121) A non-transitory computer-readable medium having encoded instructions that, when executed by a control circuit,
identifying an entity for which information is stored among a plurality of entities;
generating an audio file based on a first text string and at least one speech criterion, wherein the first text string describes an entity;
generating a second text string based on the audio file;
comparing the second text string to the first text string;
and storing the second text string in metadata associated with the entity if it is not identical to the first text string.
(Item 122) The non-transitory computer-readable medium of Item 121, wherein the at least one speech criterion comprises pronunciation settings.
(Item 123) The non-transitory computer-readable medium of item 121, wherein the at least one speech criterion comprises a language setting.
(Item 124) The at least one speech criterion comprises a plurality of speech criteria and further comprises encoded instructions, the instructions, when executed by the control circuit,
generating respective audio files based on the first text string and respective speech criteria;
generating respective second text strings based on respective audio files;
comparing each second text string to the first text string;
122. The non-transitory computer reading of clause 121, causing the control circuit to store each second text string in metadata associated with the entity if not identical to the first text string. possible medium.
(Item 125) The non-uniformity of item 121 further comprising encoded instructions that, when executed by the control circuitry, cause the control circuitry to update the metadata based on the one or more text queries. transient computer readable medium.
(Item 126) Item 121 further comprising encoded instructions that, when executed by the control circuit, cause the control circuit to store the phoneme representation of the first text string in metadata associated with the entity. 3. The non-transitory computer-readable medium of .
(Item 127) Further comprising encoded instructions, the instructions, when executed by the control circuit,
converting the first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
detecting speech using a microphone to generate a second audio signal;
122. The non-transitory computer-readable medium of item 121 that causes the control circuit to process the audio signal and generate the audio file.
(Item 128) Further comprising encoded instructions which, when executed by the control circuit, cause the control circuit to generate speech at the speaker based on at least one speech setting of the text-to-speech module. 3. The non-transitory computer-readable medium of .
(Item 129) Further comprising encoded instructions, the instructions, when executed by the control circuit,
producing playback of the audio file on the speaker;
detecting playback and generating an audio signal using a microphone;
122. The non-transitory computer-readable medium of item 121, causing the control circuit to convert the audio signal into the second text string by identifying one or more words.
(Item 130) Further comprising encoded instructions which, when executed by the control circuit, cause the control circuit to convert the audio signal into a second text string based on at least one text setting of the Speech->Text module. 130. The non-transitory computer-readable medium of item 129 that converts to
(Item 131) A system for generating entity metadata for a voice query, the system comprising:
means for identifying an entity of a plurality of entities for which information is stored;
means for generating an audio file based on a first text string and at least one speech criterion, wherein the first text string describes an entity;
means for generating a second text string based on the audio file;
means for comparing the second text string to the first text string;
means for storing the second text string in metadata associated with the entity if it is not identical to the first text string.
(Item 132) The system of item 131, wherein the at least one speech criterion comprises a pronunciation setting.
(Item 133) The system of item 131, wherein the at least one speech criterion comprises a language setting.
(Item 134) The at least one speech criterion comprises a plurality of speech criteria, the system comprising:
means for generating respective audio files based on the first text string and respective speech criteria;
means for generating respective second text strings based on respective audio files;
means for comparing each second text string to the first text string;
132. The system of clause 131, further comprising means for storing each second text string in metadata associated with the entity if not identical to the first text string.
(Item 135) The system of item 131, further comprising means for updating metadata based on one or more text queries.
(Item 136) The system of item 131, further comprising means for storing the phoneme representation of the first text string in metadata associated with the entity.
(Item 137) The means for generating an audio file based on the first text string comprises:
means for converting the first text string into a first audio signal;
means for generating speech at a speaker based on the audio signal;
means for detecting speech and generating a second audio signal using a microphone;
and means for processing audio signals and generating audio files.
138. The system of claim 137, wherein the means for generating speech at the speaker further comprises means for generating speech at the speaker based on at least one speech setting of the text-to-speech module.
(Item 139) The means for generating a second text string based on the audio file comprises:
means for generating playback of an audio file on a speaker;
means for detecting playback and generating an audio signal using a microphone;
means for converting the audio signal into a second text string by identifying one or more words.
(Item 140) The means for converting the audio signal to the second text string comprises means for converting the audio signal to the second text string based on at least one text setting of the speech to text module. , item 139.
(Item 141) A method of generating entity metadata for a voice query, the method comprising:
identifying an entity for which information is stored among a plurality of entities;
generating an audio file using a text-to-speech module based on a first text string and at least one speech criterion, wherein the first text string describes an entity; ,
generating a second text string based on the audio file using the Speech→Text module;
comparing the second text string to the first text string;
If not identical to the first text string, storing the second text string in metadata associated with the entity.
(Item 142) The method of item 141, wherein the at least one speech criterion comprises a pronunciation setting.
(Item 143) The method of any of items 141-142, wherein the at least one speech criterion comprises a language setting.
(Item 144) The at least one speech criterion comprises a plurality of speech criteria, the method comprising:
generating respective audio files based on the first text string and respective speech criteria using a Text to Speech module;
generating respective second text strings based on respective audio files using the Speech to Text module;
comparing each second text string to the first text string;
144. The method of any of items 141-143, further comprising, if not identical to the first text string, storing each second text string in metadata associated with the entity.
(Item 145) The method of any of items 141-144, further comprising updating metadata based on one or more text queries.
(Item 146) The method of any of items 141-145, further comprising storing the phoneme representation of the first text string in metadata associated with the entity.
(Item 147) Generating an audio file based on the first text string includes:
converting the first text string into a first audio signal;
generating speech at a speaker based on the audio signal;
detecting speech using a microphone to generate a second audio signal;
147. A method according to any of items 141-146, comprising processing an audio signal and generating an audio file.
(Item 148) The method of item 147, wherein generating the speech at the speaker is further based on at least one speech setting of the text-to-speech module.
(Item 149) Generating a second text string based on the audio file includes:
producing playback of the audio file on the speaker;
detecting playback and generating an audio signal using a microphone;
converting the audio signal into a second text string by identifying one or more words.
(Item 150) The method of item 149, wherein converting the audio signal to the second text string is based on at least one text setting of the Speech to Text module.

Claims

A method of responding to voice queries, the method comprising:
receiving a voice query at an audio interface;
extracting one or more keywords from the spoken query using control circuitry;
generating a text query based on the one or more keywords using the control circuit;
identifying an entity, wherein identifying the entity is based on the text query and metadata about the entity, the metadata comprising one or more alternative textual representations of the entity; said alternative text representation is based on the pronunciation of an identifier associated with said entity;
and retrieving a content item associated with said entity.

2. The method of claim 1, wherein the one or more alternative textual representations comprise phonemic representations of the entity.

3. The method of any of claims 1 and 2, wherein the one or more alternative textual representations comprise phonetic-based alternative spellings of the entity.

The method of any of claims 1-3, wherein said one or more alternative textual representations of said entity comprise text strings generated based on previous speech-to-text conversions.

The one or more alternative text representations comprise a plurality of alternative text representations, each alternative text representation of the plurality of alternate text representations comprising:
converting the first textual representation into an audio file;
converting the audio file into a second textual representation; and
A method according to any of claims 1-4, wherein said second textual representation is not identical to said first textual representation.

The method of any of claims 1-5, wherein identifying the entity is further based on user profile information.

The method of any of claims 1-6, wherein identifying the entity is further based on popularity information associated with the entity.

Identifying the entity includes:
identifying the plurality of entities, wherein respective metadata is stored for each entity of the plurality of entities;
determining a respective score for each respective entity of the plurality of entities based on comparing the respective one or more alternative textual representations to the textual query;
and selecting the entity by determining a maximum score.

further comprising generating a plurality of text queries, the plurality of text queries comprising the text query, each text query of the plurality of text queries corresponding to a respective setting of the utterance-to-text module of the control circuit; A method according to any one of claims 1 to 8, which is generated based on

identifying respective entities based on respective text queries of the plurality of text queries;
determining a respective score for each entity based on comparing the respective text query to metadata associated with the respective entity;
10. The method of claim 9, further comprising: identifying said entity by selecting the maximum score of said respective scores.

A system for responding to voice queries, said system comprising:
memory;
and means for implementing the method steps of any of claims 1-10.

A non-transitory computer readable medium having encoded instructions which, when executed by a control circuit, causes the control circuit to perform the method steps of any of claims 1-10. A non-transitory computer-readable medium that enables execution.

A system for responding to voice queries, said system comprising means for implementing the method steps of any of claims 1-10.