JP4192703B2

JP4192703B2 - Content processing apparatus, content processing method, and program

Info

Publication number: JP4192703B2
Application number: JP2003188908A
Authority: JP
Inventors: 尚志斯波; 真琴岩田; 直博竹田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-06-30
Filing date: 2003-06-30
Publication date: 2008-12-10
Anticipated expiration: 2023-06-30
Also published as: JP2005025413A

Description

【０００１】
【発明の属する技術分野】
本発明は、コンテンツ処理装置、コンテンツ処理方法及びプログラムに関する。
【０００２】
【従来の技術】
近年、情報のマルチメディア化が進み、映像情報、音声情報およびテキスト情報等とを含むマルチメディアコンテンツの情報量は、急激に増大している。これらの情報を記憶しておき、後に必要に応じて再び呼び出すことにより、より有効に利用することができる。
【０００３】
このため、このようなマルチメディアコンテンツを有効に利用できるようにしたコンテンツ処理装置がある。
このコンテンツ処理装置は、映像情報に付加されている音声情報に基づいて音声認識を行い、音声認識結果であるテキスト情報と映像情報とを構造化する。
【０００４】
複数種類の情報を対応付けるシステムの一例として、原稿上の文字を文字認識して電子化した第１テキストと、音声情報を音声認識して電子化した第２テキストとから最適な第３テキストを生成して、デジタルデータのテキスト情報を得る電子化テキスト作成システムがある（例えば、特許文献１参照）。
【０００５】
また、映像情報に付加されている音声情報にもとづいて音声認識を行い、音声認識結果であるテキスト情報と映像情報とを構造化するシステムがある（例えば、特許文献２参照）。
【０００６】
さらに、映像情報、音声情報、テキスト情報等をデジタル化し、それらの情報と、それらの情報の時間的関連を示す時間情報とを保存するマルチメディア情報処理装置がある（例えば、特許文献３及び特許文献４参照）。
【０００７】
【特許文献１】
特開２００１−２８２７７９号公報（第３−８頁、図１）
【特許文献２】
特開２００２−１８９７２８号公報（第３頁、図１）
【特許文献３】
特開平８−２５３２０９号公報（第３−８頁、図５）
【特許文献４】
特開２００２−２７８９７４号報（第４頁−第６頁、１図）
【０００８】
【発明が解決しようとする課題】
しかし、特許文献１に記載されている従来のコンテンツ処理装置では、原稿と音声認識処理結果のテキストとのマッチング処理を行うので、より正確なテキスト情報を自動的に作成することができる。しかし、特許文献１には、映像情報等の他の情報との対応付けに関する開示がないので、特許文献１に記載された技術を映像情報、音声情報およびテキスト情報の構造化処理に適用する場合には、人為的に構造化処理を行わなくてはならない。
【０００９】
また、特許文献２に記載されている従来のコンテンツ処理装置では、映像情報と映像情報に付加されている音声情報とを構造化することができるが、特許文献２には、さらにテキスト情報を構造化することに関して何ら開示されていない。
【００１０】
さらに、特許文献３に記載されている従来のコンテンツ処理装置では、映像情報、音声情報およびテキスト情報等の各情報間の構造化処理を、入力された時間における時間情報を用いて行っているが、時間情報は、映像情報、音声情報およびテキスト情報等を作成するときに付加されている必要がある。従って、時間情報が付加されていない場合には、テキスト情報、映像情報および音声情報を自動的に対応付けることはできず、対応付けのために人手を要することになる。
【００１１】
特許文献４に記載されている従来のコンテンツ処理装置では、テキスト情報である映画やドラマ、演劇の台本に時間情報が記載されていることを想定し、映像情報の経過時間と比較することにより、テキスト情報と映像情報の対応付けを行っている。しかし、時間情報は台本に記載されていないことが多い。時間情報が記載されている場合でも、実際の撮影や編集によって台本と映像に時間的ズレがある場合が多い。特にテレビドラマのようにシーンが１秒以下で切り替わる場合には、これらのずれが大きく影響し、対応付けができなくなる可能性が高い。
【００１２】
さらに何れの従来のコンテンツ処理装置でも、音声認識に大きく依存しているが、実際のテレビ番組では、人物が登場していても発言の無いシーンや、音声環境が悪く、音声認識が困難である場合が多い。また、出演者のアドリブにより発言内容が台本と大きく異なってしまい、音声情報とテキスト情報を対応付けられない場合がある。
【００１３】
また、音声情報とテキスト情報との対応付けが困難であれば、利用可能なコンテンツを生成するには、時間を要することになり、コンテンツを利用するには、限界がある。
【００１４】
本発明は、このような従来の問題点に鑑みてなされたもので、コンテンツを容易に利用することが可能なコンテンツ処理装置、コンテンツ処理方法及びプログラムを提供することを目的とする。
【００１５】
【課題を解決するための手段】
この目的を達成するため、本発明の第１の観点に係るコンテンツ処理装置は、
映像情報及び音声情報を含むコンテンツと前記コンテンツの筋書きを文字データで表現した台本情報との対応付けを行うコンテンツ処理装置において、
前記コンテンツの認識処理を行って、前記コンテンツが含む各場面の特徴部分を文字データで表現した画像認識文字情報及び音声認識文字情報を生成する認識部と、
前記認識部が生成した画像認識文字情報及び音声認識文字情報の前記特徴部分をそれぞれ抽出して区切り、区切った各部分の開始時刻と終了時刻とを示すタイムコードを生成するタイムコード生成部と、
前記台本情報の特徴部分を取得して、取得した特徴部分に基づいて前記台本情報を各場面毎に区切り、前記認識部が生成した前記画像認識文字情報と前記音声認識文字情報とを、前記タイムコード生成部が生成したタイムコードが示す位置で分割し、分割した前記画像認識文字情報と前記音声認識文字情報とが一致しない場合でもそれぞれを正しいものと判断して、前記台本情報の各場面と分割した前記画像認識文字情報と前記音声認識文字情報との対応付けを行い、前記台本情報の各場面と分割した前記画像認識文字情報と前記音声認識文字情報との対応関係を示す対応情報を生成するマッピング部と、
前記マッピング部が生成した対応情報と前記タイムコード生成部が生成したタイムコードとに基づいて、前記台本情報の各場面と各場面のタイムコードとの関係を示す構造化情報を、必要なコンテンツを検索するための情報として生成する構造化処理部と、を備えたものである。
【００１８】
前記台本情報の画面構成を予測して、予測した画面構成を前記台本情報に付加して前記マッピング部に出力するシーン予測部を備えてもよい。
【００１９】
前記台本情報と、前記コンテンツと、前記タイムコード生成部が生成したタイムコードと、前記構造化処理部が生成した構造化情報と、を格納するためのデータ格納部を備えてもよい。
【００２０】
前記データ格納部は、
前記コンテンツを格納するコンテンツ格納部と、
前記台本情報と前記構造化情報とを格納するテキストファイル格納部と、
前記タイムコードを格納するタイムコード格納部と、を備えたものであってもよい。
【００２１】
前記構造化処理部は、前記テキストファイル格納部における前記台本情報を収納した台本情報ファイルと前記タイムコードを収納したタイムコードファイルとを生成し、前記台本情報ファイルの各区切りの開始アドレスと終了アドレスと、前記タイムコードファイルにおける各区切りのタイムコードの開始アドレスと終了アドレスとを示す管理情報を生成し、
前記データ格納部は、前記管理情報を格納する管理情報格納部を備えたものであってもよい。
【００２２】
前記データ格納部は、
前記台本情報をマークアップアップランゲージファイルとして記憶するマークアップアップランゲージファイル格納部を備えたものであってもよい。
【００２３】
前記データ格納部に格納された前記台本情報と前記コンテンツとを、入力された検索条件に基づいて同期させて出力する同期データ出力部を備えたものであってもよい。
【００２４】
前記同期データ出力部は、
前記台本情報とコンテンツとから必要な場面を抽出するための検索条件を入力し、前記検索条件に対応する場面の台本情報とコンテンツとを出力する入出力部と、
前記入出力部に入力された検索条件に対応する台本情報における場面を特定し、特定した前記場面に対応するタイムコードを抽出する検索制御部と、
抽出された前記タイムコードに対応するコンテンツの場面を特定し、特定した場面のコンテンツと検索条件に対応する台本情報とを同期させる同期処理部と、
前記同期処理部が同期した当該場面に対応するコンテンツと台本情報とを前記入出力部に出力する同期処理部と、を備えたものであってもよい。
【００２５】
本発明の第２の観点に係るコンテンツ処理方法は、
映像情報及び音声情報を含むコンテンツと前記コンテンツの筋書きを文字データで表現した台本情報との対応付けを行うコンテンツ処理方法であって、
前記コンテンツの認識処理を行って、前記コンテンツが含む各場面の特徴部分を文字データで表現した画像認識文字情報及び音声認識文字情報を生成するステップと、
前記生成された画像認識文字情報及び音声認識文字情報の前記特徴部分をそれぞれ抽出して区切り、区切った各部分の開始時刻と終了時刻とを示すタイムコードを生成するステップと、
前記台本情報の特徴部分を取得して、取得した特徴部分に基づいて前記台本情報を各場面毎に区切り、前記生成した前記画像認識文字情報と音声認識文字情報とを、生成した前記タイムコードが示す位置で分割し、分割した前記画像認識文字情報と前記音声認識文字情報とが一致しない場合でもそれぞれを正しいものと判断して、前記台本情報の各場面と分割した前記前記画像認識文字情報と前記音声認識文字情報との対応付けを行い、前記台本情報の各場面と分割した前記画像認識文字情報と前記音声認識文字情報との対応関係を示す対応情報を生成するステップと、
前記生成した対応情報と前記生成したタイムコードとに基づいて、前記台本情報の各場面と各場面のタイムコードとの関係を示す構造化情報を、必要なコンテンツを検索するための情報として生成するステップと、
前記台本情報、コンテンツ、タイムコード、構造化情報を記憶するステップと、を備えたものである。
【００２６】
本発明の第３の観点に係るプログラムは、
コンピュータに、
前記コンテンツに含まれている映像情報及び音声情報の認識処理を行って、前記コンテンツが含む各場面の特徴部分を文字データで表現した画像認識文字情報及び音声認識文字情報を生成する手順、
前記生成された画像認識文字情報及び音声認識文字情報の前記特徴部分をそれぞれ抽出して区切り、区切った各部分の開始時刻と終了時刻とを示すタイムコードを生成する手順、
前記コンテンツの筋書きを文字データで表現した台本情報の特徴部分を取得して、取得した特徴部分に基づいて前記台本情報を各場面毎に区切り、前記生成した前記画像認識文字情報と前記音声認識文字情報とを、生成した前記タイムコードが示す位置で分割し、分割した前記画像認識文字情報と前記音声認識文字情報とが一致しない場合でもそれぞれを正しいものと判断して、前記台本情報の各場面と分割した前記前記画像認識文字情報と前記音声認識文字情報との対応付けを行い、前記台本情報の各場面と分割した前記画像認識文字情報と前記音声認識文字情報との対応関係を示す対応情報を生成するステップと、
前記生成した対応情報と前記生成したタイムコードとに基づいて、前記台本情報の各場面と各場面のタイムコードとの関係を示す構造化情報を、必要なコンテンツを検索するための情報として生成する手順、
前記台本情報、コンテンツ、タイムコード、構造化情報を記憶する手順、
を実行させるためのものである。
【００２７】
【発明の実施の形態】
以下、本発明の実施の形態に係るコンテンツ処理装置を図面を参照して説明する。
本実施の形態に係るコンテンツ処理装置の構成を図１に示す。
コンテンツ処理装置は、同期データ生成部１と、データ格納部２と、同期データ出力部３と、からなる。
【００２８】
尚、コンテンツホルダ４は、映像情報と音声情報とが記録された映像メディア５と、映像情報に関連する文書が記載されているテキストメディア６とを保有しているホルダである。
【００２９】
映像メディア５、テキストメディア６は、外部記録媒体であって、磁気記録テープ、ＤＶＤ（Digital Versatile Disk）が用いられる。また、記録容量は少ないものの、映像メディア５、テキストメディア６に、フレキシブルディスク、ＭＤ（Mini Disc；登録商標）、ＭＯ（Magneto-Optic）、ＣＤ−ＲＯＭ（Compact Disk Read-Only Memory）等を用いることもできる。
【００３０】
テレビ放送におけるドラマ番組の場合、テキストメディア６には、ドラマ番組の台本のデータがテキスト情報として記録される。このテキスト情報は、コンテンツの筋書きを文字データで表現したものである。また、映像メディア５には、ドラマ番組の映像情報と、台詞に従って発声した俳優の発声音等の音声情報と、が記録される。
【００３１】
コンテンツホルダ４から取り出された映像メディア５、テキストメディア６に記録されたデータは、同期データ生成部１に入力される。
【００３２】
同期データ生成部１は、映像メディア５とテキストメディア６とに記録されているデータを取り出してデータ処理を行い、同期したデータを生成するものである。
【００３３】
同期データ生成部１は、テキスト入力部１１と、シーン予測部１２と、映像音声入力部１３と、画像認識部１４と、音声認識部１５と、タイムコード生成部１６と、マッピング部１７と、構造化処理部１８と、を備えて構成される。
【００３４】
テキスト入力部１１は、テキストメディア６からテキスト情報を読み出すものである。尚、テキストメディア６がアナログデータを記録した記録媒体である場合、テキスト入力部１１は、このテキストメディア６からアナログデータを読み出すため、例えば、ＯＣＲ（光学式文字読み取り）装置と、デジタル−アナログ変換器と、を備える（図示せず）。テキスト入力部１１は、テキストメディア６から読み出したテキスト情報をシーン予測部１２に出力する。
【００３５】
シーン予測部１２は、テキスト入力部１１から出力されたテキスト情報に基づいて、各シーンの画面構成および音声構成を予測するものである。画面構成とは、画面内に映っている人物や物の名称や動きのことをいう。また、音声構成とは、シーン内の人物の声や物音、音楽や効果音のことをいう。
【００３６】
シーン予測部１２は、画面構成および音声構成を予測するため、ルール、例えば、台本内にカメラワークに対する指示が無い限り、台本記入の人物の顔が映っているといったルール、主語が省略されている場合は、前のシーンのト書きの主語を使うといったルールを予め蓄積する。
【００３７】
そして、シーン予測部１２は、蓄積したルールとテキスト入力部１１から出力されたテキスト情報とに基づいて、台本に記述されている各シーンの画面構成および音声構成を予測する。各シーンの画面構成および音声構成の予測については、台本に記述してある内容をシーン毎に羅列するだけで十分である。
【００３８】
記述があいまいな場合、例えば、テレビドラマの台本に、“人物Ａが部屋から出て行く”という記述がある場合、画面内に人物Ａがいなくなったと明記されていなくても、シーン予測部１２は、現在の場面設定と人物の行動とから、この人物Ａが画面から消えたと推測する。
シーン予測部１２は、予測した各シーンの画面構成及び音声構成を、テキスト情報に付加してマッピング部１７に出力する。
【００３９】
映像音声入力部１３は、映像メディア５から映像情報、音声トラックに記録されている音声情報を取り出すものである。尚、映像メディア５がアナログデータを記録した記録媒体の場合に映像メディア５からアナログデータを読み出すため、映像音声入力部１３は、ビデオキャプチャ等を備える。そして、映像音声入力部１３は、映像メディア５からアナログ映像情報、アナログ音声情報を読み出し、ビデオキャプチャ等を用いてＡＶＩ形式又はＭＰＥＧ形式のデジタル映像情報、デジタル音声情報に変換する。
映像音声入力部１３は、取り出した映像情報、音声情報を、それぞれ、画像認識部１４、音声認識部１５に出力する。
【００４０】
画像認識部１４は、映像音声入力部１３から出力された映像情報について画像認識処理を行い、画像認識結果として、画像認識文字情報を生成するものである。この画像認識文字情報は、映像情報が含む各場面の特徴部分を文字データで表現したものである。例えば、映像に登場人物として刑事Ａの顔が映っている場合、画像認識部１４は、画像認識結果として、「刑事Ａの顔」という画像認識文字情報を生成する。
【００４１】
画像認識部１４は、このような画像認識処理を行うため、映像情報から、カット、カメラワーク、照明環境、背景、登場人物、表情、顔の向き、年代、性別、髪型、化粧、人物の動作、画面内の物体、物体の動作について、全てのまたは一部の特徴部分を取り出す。そして、画像認識部１４は、取り出した特徴部分に基づいて区切って分割し、画像認識処理を行う。
【００４２】
また、画像認識部１４は、映像情報をタイムコード生成部１６に出力するとともに、映像情報の分割位置を示す情報をタイムコード生成部１６に出力する。
【００４３】
音声認識部１５は、映像音声入力部１３から出力された音声情報について音声認識処理を行い、音声認識結果として、音声認識文字情報を生成するものである。音声認識文字情報は、音声情報が含む各場面の特徴部分を文字データで表現したものである。例えば、音声に登場人物としての刑事Ａの声が含まれている場合、音声認識部１５は、音声認識結果として、「刑事Ａの声」という音声認識文字情報を生成する。
【００４４】
音声認識部１５は、このような音声認識処理を行うため、発言の内容、話者、音楽、効果音、その他の非言語音声、無音区間について、全てのまたは一部の特徴部分を取り出す。
【００４５】
また、音声認識部１５は、生成した音声認識文字情報について、単語切り出し処理を行い、音声情報を各単語に分割する。そして、音声認識部１５は、音声情報とともに、音声情報を各単語に分割したときの分割位置を示す情報をタイムコード生成部１６に出力する。
【００４６】
タイムコード生成部１６は、画像認識部１４、音声認識部１５から出力された情報に基づいて、画像、音声に関するタイムコードを生成するものである。タイムコードは、映像情報、音声情報の分割位置の開始と終了の時刻情報を示すものである。
【００４７】
タイムコード生成部１６は、画像に含まれている特徴部分に基づいて、以下のようなタイムコードを生成する。
（１）カットタイムコード：画像情報におけるカットの開始時刻と終了時刻とを示す。
（２）カメラワークタイムコード：カメラワークの開始時刻と終了時刻とを示す。
（３）照明環境タイムコード：照明環境の開始時刻と終了時刻とを示す。
（４）背景タイムコード：背景の開始時刻と終了時刻とを示す。
（５）登場人物タイムコード：登場人物の登場開始時刻と終了時刻とを示す。
（６）顔特徴タイムコード：人物の顔の向きや表情、化粧といった顔特徴の開始時刻と終了時刻とを示す。
（７）人物特徴タイムコード：年代、性別、髪型、化粧、衣服、背格好といった人物特徴の開始時刻と終了時刻とを示す。
（８）人物動作タイムコード：人物の動作の開始時刻と終了時刻とを示す。
（９）物体タイムコード：物体の登場開始時刻と終了時刻とを示す。
（１０）物体動作タイムコード：物体の動作の開始時刻と終了時刻とを示す。
【００４８】
また、タイムコード生成部１６は、音声に含まれている特徴部分に基づいて、以下のようなタイムコードを生成する。
（１１）単語タイムコード：音声情報における各単語の開始時刻と終了時刻とを示す。
（１２）話者タイムコード：音声情報における各話者の発言開始時刻と終了時刻とを示す。
（１３）非言語音声タイムコード：叫び声や笑い声、ため息などの音声情報における、非言語音声の開始時刻と終了時刻とを示す。
（１４）音楽タイムコード：音声情報における音楽の開始時刻と終了時刻とを示す。
（１５）効果音タイムコード：効果音の開始時刻と終了時刻とを示す。
（１６）無音区間タイムコード：無音区間の開始時刻と終了時刻とを示す。
【００４９】
タイムコード生成部１６は、生成した画像、音声に関するタイムコードのうち、これら全てのまたは一部のタイムコードを、それぞれ、画像認識部１４、音声認識部１５に出力する。
【００５０】
前述の画像認識部１４は、タイムコード生成部１６が出力したタイムコードのうち、全部又は一部のタイムコードを画像認識文字情報に付加して、タイムコードを付加した画像認識文字情報を、映像情報とともにマッピング部１７に出力する。
【００５１】
また、音声認識部１５は、タイムコード生成部１６が出力した各タイムコードの割り付けを行う。即ち、音声認識部１５は、単語タイムコードを各単語に割り付ける。音声認識部１５は、話者タイムコードを各発話に、非言語音声タイムコードを非言語音声に、音楽タイムコードを音楽に、効果音タイムコードを効果音に、無音区間タイムコードを無音区間に割り付ける。
【００５２】
そして、音声認識部１５は、タイムコード生成部１６が出力したタイムコードのうち、全部又は一部のタイムコードを音声認識文字情報に付加して、タイムコードを付加した音声認識文字情報を、音声情報とともにマッピング部１７に出力する。
【００５３】
マッピング部１７は、シーン予測部１２が出力したテキスト情報を予測したシーン（場面）毎に分割し、分割したテキスト情報と、画像認識部１４、音声認識部１５からそれぞれ出力された分割された画像認識文字情報、音声認識文字情報と、の対応付けを行うものである。
【００５４】
具体的には、マッピング部１７は、シーン予測部１２が出力したテキスト情報から、例えば、改行、インデント、人物名、固有名、地名等によって、話者の切り替わり、空白行、改行箇所等を、台本の区切り位置として検出する。マッピング部１７は、区切りを検出すると、検索しやすいように、その区切り位置でテキスト情報を区切る。
【００５５】
次に、マッピング部１７は、テキスト情報と、それぞれ、分割された画像認識文字情報、音声認識文字情報と、を比較する。
【００５６】
尚、比較には、例えば、ＤＰマッチングを用いる。ＤＰマッチングは、ＤＴＷ(Dynamic Time Warping)とも呼ばれるものであり、単語中の同じ音素同士が対応するように動的計画(Dymanic Programing)を用いて時間正規化を行い、単語と単語との類似距離を求める手法である。
【００５７】
次に、マッピング部１７は、これらの情報の対応付けを行い、対応情報を生成する。この対応情報は、図２に示すように、分割した画像認識文字情報及び音声認識文字情報と、テキスト情報と、の１対１の関係を示す情報である。
【００５８】
尚、テキスト情報と、画像認識文字情報と、音声認識文字情報と、が一致しなくても、マッピング部１７は、各々の認識結果は正しいと判断して対応付けを行う。
【００５９】
そして、マッピング部１７は、分割した画像認識文字情報、音声認識文字情報と、テキスト情報と、対応情報と、を構造化処理部１８に出力する。
【００６０】
構造化処理部１８は、タイムコードと対応情報とに基づいて、テキスト情報の区分けされた各シーン（場面）と各タイムコードとの対応付けを行い、構造化情報を生成するものである。構造化情報は、図２に示すように、対応付けされたテキスト情報の各シーンと各タイムコードとの１対１の関係を示す情報である。
【００６１】
構造化処理部１８は、分割された画像認識文字情報及び分割音声認識文字情報に付加されている各タイムコードから、テキスト情報の各区切りの開始時刻と終了時刻である各タイムコードを算出する。
【００６２】
具体的には、構造化処理部１８は、分割された画像認識文字情報に付加された前述のタイムコード（１）〜（１０）のうちの全て又は一部から、テキスト情報の各区切りに対応するタイムコードを算出する。構造化処理部１８は、各区切りの最初の単語の開始時刻と最後の単語の終了時刻とから、テキスト情報におけるト書きや台詞と対応付けを行う。
【００６３】
構造化処理部１８は、いずれのタイムコードも、状況説明や撮影条件を指示した、テキスト情報のト書き部分との対応付けに用いる。話者毎の台詞との対応付けについても、構造化処理部１８は、話者を示す人物タイムコードだけでなく、台詞の内容にある状況説明や固有名詞を利用することにより他のタイムコードとの対応付けを行う。
【００６４】
また、構造化処理部１８は、分割された音声認識文字情報に付加された前述のタイムコード（１１）〜（１６）の全て又は一部から、テキスト情報の各区切りに対応するタイムコードを算出する。
【００６５】
即ち、構造化処理部１８は、分割された音声認識文字情報に付加された話者タイムコードから、分割音声認識テキスト情報における話者の登場開始時刻と終了時刻であるタイムコードを算出し、台詞との対応付けを行う。また、構造化処理部１８は、音楽タイムコードや効果音タイムコードから、ト書きにおける音楽や効果音の開始時刻及び終了時刻と対応付けを行う。また、構造化処理部１８は、無音区間タイムコードから、テキスト情報内の話者の切り替わりやシーンの切り替わりとの対応付けを行う。
【００６６】
また、構造化処理部１８は、構造化情報に基づいてテキストメディアファイルおよびタイムコードファイルを、ト書きや話者毎の台詞に対応付けて生成する。テキストメディアファイルは、テキスト情報を保存するファイルであり、タイムコードファイルは、タイムコードを保存するファイルである。
【００６７】
構造化処理部１８は、各区切りに対応するタイムコードを、テキスト情報における登場順に、タイムコードファイルに格納する。尚、構造化処理部１８は、それぞれが各区切りに対応した複数のテキストメディアファイルと、タイムコードファイルとを生成することもできる。
【００６８】
データ格納部２は、映像メディア格納部２１と、タイムコード格納部２２と、テキストメディア格納部２３と、からなる。
【００６９】
映像メディア格納部２１、タイムコード格納部２２、テキストメディア格納部２３は、それぞれ、映像メディアファイル、タイムコードファイル、テキストメディアファイルを格納するためのものである。尚、映像メディア格納部２１、テキストメディア格納部２３は、それぞれ、映像情報格納部、テキストファイル格納部に相当する。
【００７０】
構造化処理部１８は、映像メディア格納部２１、タイムコード格納部２２、テキストメディア格納部２３に、それぞれ、映像メディアファイル、生成したタイムコードファイル、テキストメディアファイルを格納する。また、構造化処理部１８は、図２に示す構造化情報をテキストメディア格納部２３に格納する。
【００７１】
同期データ出力部３は、データ格納部２に格納されたデータの中から検索対象のデータを検索し、該当するデータを出力するものであり、入出力部３１と、検索制御部３２と、同期処理部３３と、を備えて構成される。
【００７２】
入出力部３１は、ユーザが要求する検索対象の入力を受け付けて、受け付けた検索対象の入力を検索情報として検索制御部３２に供給するとともに、検索の結果、得られた検索結果情報を出力するものである。検索情報としては、例えば、日時、番組タイトル、発言者、俳優名等のキーワード等がある。
【００７３】
検索制御部３２は、入出力部３１から供給された検索情報に従ってテキストメディア格納部２３に格納されているデータを検索するものであり、検索情報に該当するテキスト情報のシーン部分を入出力部３１に出力する。
【００７４】
具体的には、検索制御部３２は、検索情報に基づいて、ユーザによって選択された語句を含むテキスト情報をテキストメディア格納部２３からシーン毎に取り出す。また、検索制御部３２は、テキストメディア格納部２３に格納されている構造化情報に基づいて、ユーザによって選択されたテキスト情報の区切りのタイムコードを、タイムコード格納部２２から取り出し、取り出したタイムコードを同期処理部３３に出力する。
【００７５】
同期処理部３３は、検索制御部３２が出力したタイムコードが示す開始時刻、終了時刻を、それぞれ、映像情報出力の先頭時刻、最終時刻として、入出力部３１に、先頭時刻から最終時刻までの映像情報を出力する。また、同期処理部３３は、タイムコードに基づいてテキスト情報を加工し、加工したテキスト情報を入出力部３１に出力する。このときの加工方法としては、例えば、テキスト情報をスクロールさせるなどの方法がある。
【００７６】
次に、このようなコンテンツ処理装置を実現するハードウェア構成について説明する。
図１に示すコンテンツ処理装置は、図３に示すようなコンピュータシステムによって実現される。
【００７７】
即ち、コンピュータシステムは、端末４１、４３と、記憶装置４２と、を備える。端末４１、４３と、記憶装置４２とは、通信線４４を介して接続される。
【００７８】
尚、このコンテンツ処理装置は、図３に示すようなコンピュータシステム上に構築されてもよいし、あるいは、同一の端末上に構築されてもよい。図３に示すようなコンピュータシステムの場合、通信線４４には、ＬＡＮ（Local Area Network）、インターネット等を用い、端末４１，４３、記憶装置４２は、ネットワークで接続される。
【００７９】
端末４１，４３はコンピュータであり、それぞれ、同期データ生成部１、同期データ出力部３の機能を備える。また、記憶装置４２は、データ格納部２に対応し、磁気ディスク装置等によって構成される。
【００８０】
端末４１，４３は、図４に示すように、ＣＰＵ５１と、ＲＯＭ５２と、ＲＡＭ５３と、表示装置５４と、入力装置５５と、ＨＤＤ５６と、ドライブ装置５７と、を備える。
【００８１】
ＲＯＭ（Read Only Memory）５２は、ＣＰＵ５１を同期データ生成部１、同期データ出力部３として機能させるためのプログラム（データ）を記憶するためのメモリである。
ＣＰＵ（Central Processing Unit）５１は、ＲＯＭ５２に記憶されたプログラムを実行するものである。
【００８２】
ＲＡＭ（Random Access Memory）５３は、ＣＰＵ５１がプログラムを実行するのに必要なデータを記憶するためのメモリである。尚、端末４１のＲＡＭ５３は、シーン予測部１２が出力するテキスト情報を記憶するためにも用いられる。
【００８３】
表示装置５４は、データを表示する液晶ディスプレイ等からなるものである。入力装置５５は、データを入力するためのものであり、キーボード、マウス、マイク、イメージスキャナ、カメラ、ビデオキャプチャインターフェース等によって構成される。尚、端末４３の表示装置５４、入力装置５５が、同期データ出力部３の入出力部３１として機能する。
【００８４】
ＨＤＤ（Hard Disk Drive）５６は、データを記憶するための記憶装置である。
ドライブ装置５７は、映像メディア５，テキストメディア６のような外部記録媒体を装着し、外部記録媒体から、記録されているデータを読み出すためのものである。端末４１のドライブ装置５７は、同期データ生成部１のテキスト入力部１１、映像音声入力部１３として機能する。
【００８５】
次に、本実施の形態に係るコンテンツ処理装置の動作を図５に示すフローチャートに基づいて説明する。
同期データ生成部１の映像音声入力部１３、テキスト入力部１１は、映像メディア５，テキストメディア６から、それぞれ、映像音声情報、テキスト情報を入力する（ステップＳ１０１）。
テキスト入力部１１は、入力したテキスト情報がデジタルデータからなるものか否かを判定する（ステップＳ１０２）。
【００８６】
テキスト情報がデジタルデータからなると判定した場合（ステップＳ１０２においてＹｅｓ）、テキスト情報を、シーン予測部１２に出力する。
【００８７】
一方、テキスト情報がデジタルデータからなるものではない、即ち、アナログデータからなると判定した場合（ステップＳ１０２においてＮｏ）、テキスト入力部１１は、ＯＣＲ等を用いてテキストメディア６に記録されているテキスト情報をデジタル化する（ステップＳ１０３）。そして、テキスト入力部１１は、デジタル化したテキスト情報をシーン予測部１２に出力する。
【００８８】
シーン予測部１２は、テキスト入力部１１から出力されたテキスト情報に基づいて、各シーンの画面構成および音声構成を予測し、予測した各シーンの画面構成および音声構成をテキスト情報に付加してマッピング部１７に出力する（ステップＳ１０４）。
【００８９】
映像音声入力部１３は、映像情報、音声情報を、映像メディア５から入力し、入力した映像情報、音声情報がデジタルデータからなるか否かを判定する（ステップＳ１０５）。
【００９０】
映像情報、音声情報がデジタルデータからなると判定した場合（ステップＳ１０５においてＹｅｓ）、映像音声入力部１３は、映像情報、音声情報を、それぞれ、画像認識部１４、音声認識部１５に出力する。
【００９１】
一方、映像情報、音声情報がデジタルデータからなるものではない、即ち、アナログデータからなるものと判定した場合（ステップＳ１０５においてＮｏ）、映像音声入力部１３は、アナログ映像情報、アナログ音声情報を、ビデオキャプチャ等を用いて、ＡＶＩ形式又はＭＰＥＧ形式のデジタルデータからなる情報に変換する（ステップＳ１０６）。そして、映像音声入力部１３は、デジタル化した映像情報、音声情報を、それぞれ、画像認識部１４、音声認識部１５に出力する。
【００９２】
画像認識部１４は、映像音声入力部１３から供給された映像情報について画像認識処理を行い、画像認識文字情報を生成する（ステップＳ１０７）。画像認識部１４は、映像情報とともに映像情報の分割位置を示す情報をタイムコード生成部１６に出力する。
【００９３】
音声認識部１５は、映像音声入力部１３から出力された音声情報について音声認識処理を行い、音声認識処理の結果として、音声認識文字情報を生成する（ステップＳ１０８）。音声認識部１５は、音声情報とともに音声情報の分割位置を示す情報をタイムコード生成部１６に出力する。
【００９４】
タイムコード生成部１６は、画像認識部１４、音声認識部１５から出力された情報に基づいて、映像、音声に関する前述のタイムコード（１）〜（１６）の全部又は一部を生成する（ステップＳ１０９）。タイムコード生成部１６は、生成した映像、音声に関するタイムコードを、それぞれ、画像認識部１４、音声認識部１５に出力する。
【００９５】
画像認識部１４は、タイムコード生成部１６から出力された映像に関するタイムコードを画像認識文字情報に付加する（ステップＳ１１０）。音声認識部１５は、タイムコード生成部１６から出力された音声に関するタイムコードを音声認識文字情報に付加する（ステップＳ１１０）。画像認識部１４、音声認識部１５は、タイムコードを付加して分割した画像認識文字情報、音声認識文字情報を、それぞれ、映像情報、音声情報とともにマッピング部１７に出力する。
【００９６】
マッピング部１７は、シーン予測部１２が出力するテキスト情報を一時格納し、テキスト情報をト書きや話者毎の台詞に従って区切る。
また、マッピング部１７は、テキスト情報と分割した画像認識文字情報及び音声認識文字情報とを比較して、テキスト情報の段落区切り位置に基づいて、テキスト情報と分割した画像認識文字情報及び音声認識文字情報との対応付けを行う。
【００９７】
さらに、マッピング部１７は、テキスト情報の各段落と図２に示す対応情報とを生成し（ステップＳ１１１）、テキスト情報、分割した画像認識文字情報及び音声認識文字情報とともに、対応情報を構造化処理部１８に出力する。
【００９８】
構造化処理部１８は、分割された画像認識文字情報、音声認識文字情報に付加されたタイムコードに基づいて、図２に示す構造化情報を生成する（ステップＳ１１２）。構造化処理部１８は、構造化情報に基づいて、テキストメディアファイルとタイムコードファイルとを、ト書きや話者毎の台詞に対応付けて生成する。
【００９９】
構造化処理部１８は、生成したテキストメディアファイルをデータ格納部２のテキストメディア格納部２３に格納し、タイムコードファイルをタイムコード格納部２２に格納する、また、構造化処理部１８は、映像メディアファイルを映像メディア格納部２１に格納する（ステップＳ１１３）。
そして、同期データ生成部１は、この処理を終了させる。
【０１００】
次に、ユーザが検索条件を入力すると、同期データ出力部３は、データの検索を行い、検索したデータを出力する。この同期データ出力部３の同期データ出力処理を図６に示すフローチャートに基づいて説明する。
【０１０１】
ユーザが検索情報を入力すると、入出力部３１は、この入力操作に応答して、検索情報を検索制御部３２に出力する（ステップＳ２０１）。
【０１０２】
検索制御部３２は、入出力部３１から出力された検索条件に基づいてデータ格納部２に格納されているデータを検索する（ステップＳ２０２）。
【０１０３】
検索制御部３２は、検索条件に合致する該当データがあるか否かを判定する（ステップＳ２０３）。
該当データがないと判定した場合（ステップＳ２０３においてＮｏ）、検索制御部３２は、該当データがなかった旨を表示し、その旨の音声を出力する（ステップＳ２０６）。
【０１０４】
一方、該当データがあると判定した場合（ステップＳ２０３においてＹｅｓ）、検索制御部３２は、テキスト情報の段落を特定し、該当データをすべて取り出す（ステップＳ２０４）。
【０１０５】
検索制御部３２は、テキストメディア格納部２３に格納されている構造化情報に基づいて、ユーザによって選択されたテキスト情報の区切りのタイムコードを、タイムコード格納部２２から取り出し、取り出したタイムコードを同期処理部３３に出力する（ステップＳ２０４）。
【０１０６】
同期処理部３３は、検索制御部３２が出力したタイムコードの開始時刻、終了時刻を、それぞれ、テキスト情報の段落に対応する映像情報として、再生する先頭時刻、最終時刻とすることを入出力部３１に通知し、選択された段落のテキスト情報と、映像情報、音声情報を入出力部３１に供給する（ステップＳ２０５）。
【０１０７】
入出力部３１は、供給された情報に基づいて台本と映像とを表示し、音声を出力する（ステップＳ２０６）。
このようにして、同期データ出力部３は、ユーザにデータを提供する。
【０１０８】
次に、具体的な動作をさらに詳しく説明する。
映像メディア５，テキストメディア６は、例えば、ドラマ番組の制作を担当する制作会社のコンテンツホルダ４に格納され、このコンテンツホルダ４から取り出される。
【０１０９】
同期データ生成部１のテキスト入力部１１は、テキストメディア６から、コンテンツの筋書きを文字データで表現したテキスト情報を取り出す。テキスト入力部１１は、台詞に関するテキスト情報をシーン予測部１２に出力する。
【０１１０】
シーン予測部１２は、予め蓄積されたルールに従って、図７に示すように、シーン１からシーン７まで、順に、刑事Ａの顔、刑事Ａの顔、刑事課全員の顔、刑事Ｂの顔、刑事Ａと刑事Ｂの顔、顔なし、刑事Ａの顔（後で課長Ｃの顔が加わる）といった画面構成を予測する（図５のステップＳ１０４の処理）。シーン予測部１２は、この予測内容をテキスト情報に付加してマッピング部１７に出力する。
【０１１１】
画像認識部１４は、画像を認識し、音声認識部１５は、音声を認識して、それぞれ、図７に示すような画像認識結果、音声認識結果を生成する（ステップＳ１０７，１０８の処理）。
【０１１２】
タイムコード生成部１６は、例えば、画像認識部１４が生成した画像認識結果「刑事Ａの顔」のタイムコードとして、「00:00:02:13,00:00:02:26」を生成する（ステップＳ１０９の処理）。「00:00:02:13」、「00:00:02:26」は、それぞれ、画像認識結果「刑事Ａの顔」の開始時刻、終了時刻を示す。
【０１１３】
また、タイムコード生成部１６は、音声認識部１５が生成した音声認識結果として、刑事Ａの声「強盗」、「グレー」に、タイムコードとして、「00:00:02:15,00:00:02:26」を生成する（ステップＳ１０９の処理）。「00:00:02:15」、「00:00:02:26」は、それぞれ、音声認識結果、刑事Ａの声「強盗」、「グレー」の開始時刻、終了時刻を示す。
【０１１４】
マッピング部１７は、テキスト情報と、画像認識文字情報と、音声認識文字情報とを、ＤＰマッチングの手法を用いて順次比較する。そして、マッピング部１７は、図７に示すような対応付けを行う（ステップＳ１１１の処理）。
【０１１５】
例えば、図７に示すように、台本には、シーン１に、「刑事Ａ電話を受けて〜」と記載され、画像認識結果のタイムコード「00:00:02:13〜00:00:02:26」の画像認識文字情報には、「刑事Ａの顔」がある。また、音声認識結果のタイムコード「00:00:02:15〜00:00:02:26」の音声認識文字情報には、「刑事Ａの声〜」がある。台本のシーン１とこの画像認識結果と音声認識結果とでは、「刑事Ａ」が一致している。従って、マッピング部１７は、ＤＰマッチングを行うことにより、台本のシーン１とこの画像認識結果と音声認識結果との類似距離が近いと判定し、台本のシーン１とこの画像認識結果と音声認識結果との対応付けを行う。
【０１１６】
同様にして、マッピング部１７は、台本のシーン２〜７と、画像認識結果と音声認識結果との対応付けを、順次、行う。
【０１１７】
この場合、同じシーンでありながら、画像認識結果と音声認識結果として、刑事Ａの顔の出現時間と、刑事Ａの音声の出現時間が異なる場合、マッピング部１７は、シーンの時間が長くなるようにする。
【０１１８】
例えば、シーン１のように、出現開始時間が異なっている場合、マッピング部１７は、出現が早い時刻「00:00:02:13」の方を採用する。また、シーン５のように、出現終了時間が異なっている場合、マッピング部１７は、退出や消失が遅い時刻「00:00:02:38」の方を採用する。あるいは、マッピング部１７は、図７に示すように、画像認識結果については、ト書き部分、音声認識結果については台詞部分のタイムコードとみなすという方法をとることもできる。また、マッピング部１７は、同じシーンでも、シーン内の顔や物や声や物音といったシーン要素各々に対してタイムコードが存在するとみなすこともできる。
【０１１９】
登場人物については、マッピング部１７は、音声認識処理による音声認識結果に基づいて話者の同定を行うこともできる。マッピング部１７は、画像と音声とを同時に使うことによって、より頑強な人物同定を実現する。
【０１２０】
例えば、登場人物が無言であったり、音楽や周囲の雑音で音声認識が困難な状況でも、画像認識により、人物を同定することができる。反対に、逆光など照明条件が劣悪な場合や、登場人物が後ろ向きであったり下向きであったりして顔が見えない場合でも、音声認識により人物を同定することができる。
【０１２１】
また、画像情報と音声情報とテキスト情報とを比較してお互いに矛盾がある場合、マッピング部１７は、各々の認識結果が正しいと判断して対応付けを行う。例えば、あるシーンにおいて、画像認識では人物ＡとＢのみが検出され、音声認識では人物ＡとＣが検出された場合、マッピング部１７は、Ａ、Ｂ、Ｃ３名とも存在しているものとして対応付けを行う。
【０１２２】
例えば、シーン５のように、刑事Ａの顔が検出できなかった場合、マッピング部１７は、顔の出現順番の前後関係から対応づけすることができるし、ほぼ同じ時刻に音声認識で、刑事Ｂの声で「怪我」「無茶」という語を検出すると、この語で対応付けを行う。このようにして、より高い信頼度で対応付けが行われる。
【０１２３】
一方、このシーンで音声認識ができなくても、刑事Ａと刑事Ｂの顔が検出できていれば、マッピング部１７は、台本と画像認識結果と音声認識結果との対応付けを行うこともできる。更にこのシーンで画像認識も音声認識もできなかった場合でも、マッピング部１７は、前後シーンで顔や声の出現順番から対応付けできれば、認識できなかったシーンでも対応付けを行うことができる。
【０１２４】
但し、マッピング部１７は、全ての組み合わせについて対応付けを行うこともできる。例えば、この場合、マッピング部１７は、Ａ、Ｂ、Ｃ３名とも存在すると仮定した対応づけと、Ａ、Ｂの２名のみが存在すると仮定した対応付けと、Ａ、Ｃの２名のみが存在すると仮定した対応付けと、Ａのみが存在すると仮定した対応付けと、を行うこともできる。
【０１２５】
また、マッピング部１７は、カットやカメラワーク、背景、人物の動作、フレーム内の物体、物体の動作、台詞内の単語、音楽、効果音についても同様に対応関係をとり、区切り位置を検出することができる。これらについてもお互いに矛盾がある場合は、各々の認識結果が全て正しいとして対応付けしてもよいし、全ての組み合わせについて対応付けを行ってもよい。
【０１２６】
対応付けを行う場合、マッピング部１７は、単に順番だけでなく、台本の台詞の長さやト書きなどからシーンの長さを推定し、検出された顔の出現時間があらかじめ定めた範囲内（例えば推定したシーンの長さの０．５倍から１．５倍の範囲）にあるかどうかで、対応しているかどうかを判断することもできる。
【０１２７】
構造化処理部１８は、テキスト情報とタイムコードとを、以下のように対応付けして、テキストメディアファイルと、タイムコードファイルとを生成する（ステップＳ１１２の処理）。
【０１２８】
即ち、構造化処理部１８は、例えば、第１の台詞のテキストメディアファイルのファイル名を「台詞１．ｔｘｔ」、第１の台詞のタイムコードファイルのファイル名「時間１．ｔｘｔ」とする。
構造化処理部１８は、第１の台詞のファイルは、拡張子を除いたファイル名の末尾が「１」という対応付けを行う。
【０１２９】
同様に、構造化処理部１８は、第２の台詞同士をそれぞれ「台詞２．ｔｘｔ」と「時間２．ｔｘｔ」とすることで、第２段落のファイルを対応付けを行う。同様に、構造化処理部１８は、第３、第４段落以降も対応付けを行う。尚、ト書きがあれば、構造化処理部１８は、ト書きについても同様に対応付けを行う。
【０１３０】
そして、構造化処理部１８は、テキストメディアファイル、タイムコードファイルおよび映像メディアファイルをデータ格納部２に格納する（ステップＳ１１３）。
【０１３１】
次に、ユーザが、端末４３を操作して、格納したデータを検索する場合、端末４３は、この操作に応答して図８に示すような検索画面を表示装置５４に表示する。
【０１３２】
端末４３の表示装置５４は、この検索画面に、映像検索システムの検索条件入力画面として、日時と、番組タイトルと、発言者と、キーワードとの入力欄を表示する。また、端末４３の表示装置５４は、この検索画面に、検索実行を指定するための検索実行ボタンを表示する。
【０１３３】
ユーザが、表示された検索条件の入力画面に従って、例えば、キーワードとして「俳優Ｄ」を入力すると、端末４３の表示装置５４は、この操作に応答してキーワードの入力欄に「俳優Ｄ」を表示する。
【０１３４】
尚、検索する際に、全文一致検索や、各検索条件によるアンド検索といった一般的な検索処理を用いることができる。
【０１３５】
ユーザが、この検索実行ボタンをクリックすると、同期データ出力部３の入出力部３１は、この操作に応答して、検索情報を検索制御部３２に供給し（ステップＳ２０１の処理）、検索制御部３２は、検索を開始する（ステップＳ２０２の処理）。
【０１３６】
検索制御部３２は、入力された検索条件に基づいて、データ格納部２に格納されたデータの中から検索条件に合致するデータを有するテキスト情報の段落を特定する。
【０１３７】
検索制御部３２は、キーワードの語句として入力された「俳優Ｄ」を、台本内の配役表から、「刑事Ｂ」に変換する。そして、検索制御部３２は、「刑事Ｂ」に基づいてデータ格納部２に格納されたデータを検索する。
【０１３８】
「刑事Ｂ」に該当するデータが１２件あると、検索制御部３２は、該当する１２件のデータを取り出して、検索結果としてテキスト情報のシーンを入出力部３１に供給する（ステップＳ２０３，Ｓ２０４の処理）。
【０１３９】
また、同期処理部３３は、映像情報を再生する先頭時刻、最終時刻を入出力部３１に通知し、映像情報、音声情報を、選択された段落のテキスト情報と同期させて入出力部３１に供給する（ステップＳ２０５の処理）。
【０１４０】
端末４３の表示装置５４は、入出力部３１に供給された図９に示すような台本と映像とを表示し、音声を出力する（ステップＳ２０６の処理）。
【０１４１】
尚、映像情報を抽出する場合、端末４３の表示装置５４は、例えば、図８に示すように、テキスト情報とともに、代表的な画像をサムネイル画像として表示することもできる。また、表示装置５４は、静止画によらず、段落の開始時刻から終了時刻までの動画を表示することもできる。
【０１４２】
同期処理部３３は、例えば、図９に示すように、表示装置５４に表示された巻戻し、再生、停止、一時停止、早送り等の各スイッチ部が操作されると、この操作情報に従って動作する。
【０１４３】
例えば、ユーザが、先頭から２分３１秒の位置から映像メディア５を再生するように、端末４３の入力装置５５を操作すると、同期処理部３３は、映像情報の再生のタイミングに合わせて、テキスト情報をスクロールする処理を実行する。
【０１４４】
端末４３の表示装置５４は、映像情報の再生のタイミングに合わせて、選択された段落のテキスト情報として、例えば、「怪我治って」を含むテキスト情報の部分のスクロール表示を行う。
【０１４５】
また、端末４３の表示装置５４は、テキスト情報の再生箇所を、再生箇所であることを示すために、図９に示すように、アンダーラインを施し、斜体文字で表示する。
【０１４６】
以上説明したように、本実施の形態によれば、テキスト情報と、分割された画像認識文字情報及び音声認識文字情報と、を対応付けるようにしたので、タイムコードを介してテキスト情報と映像情報、音声情報とを、容易に構造化することができる。
【０１４７】
また、ユーザが入力した検索条件に合致するテキスト情報における分割部分を特定し、特定された分割部分に対応するタイムコードを抽出し、抽出されたタイムコードに対応する映像情報を特定して特定した映像情報をユーザに提供するようにした。このため、ユーザに、所望の映像情報を提供することができる。
【０１４８】
さらに画像情報と音声情報とを併用し、映像情報、音声情報のみで対応付けられなくても音声情報にて対応付けたり、映像情報にて対応付けたりすることで、テキスト情報と映像情報との構造化をより正確に行うことができる。
【０１４９】
尚、本発明を実施するにあたっては、種々の形態が考えられ、上記実施の形態に限られるものではない。
例えば、上記実施の形態では、映像情報を、映像音声入力部１３から、画像認識部１４、音声認識部１５、マッピング部１７および構造化処理部１８を介してデータ格納部２に供給するように構成された。しかし、映像情報を映像音声入力部１３から、直接、データ格納部２に供給するように構成されてもよい。
【０１５０】
また、構造化処理部１８は、テキストメディアファイルにおけるテキスト情報の各区切りの開始アドレスと終了アドレスと、タイムコードファイルにおける各区切りのタイムコードの開始アドレスと終了アドレスとを、管理情報として生成するように構成されてもよい。
【０１５１】
このようにするには、コンテンツ処理装置は、図１０に示すように、データ格納部２に、管理情報を格納する管理ファイル格納部２４を備える。この管理ファイル格納部２４は、管理情報格納部に相当するものである。そして、構造化処理部１８は、管理情報を管理ファイル格納部２４に格納する。
【０１５２】
また、コンテンツ処理装置は、図１１に示すように、テキスト情報とタイムコードとを結合して、構造化結果を生成するように構成されることもできる。この場合、データ格納部２は、図１２に示すように、映像メディア格納部２１とテキストメディア格納部２３と、を備える。構造化処理部１８は、テキスト情報とタイムコードとを結合して、タイムコードを含む構造化されたテキストメディアファイルを生成し、テキストメディア格納部２３に、生成したテキストメディアファイルを格納する。尚、データ格納部２は、タイムコード格納部２２を備える必要はない。
【０１５３】
さらに、コンテンツ処理装置は、図１３に示すようなＸＭＬ（エクステンシブルマークアップランゲージ）ファイルを生成して、格納しておくように構成されることもできる。ＸＭＬファイルは、拡張マークアップ言語であるＸＭＬ（エクステンシブルマークアップランゲージ）言語によるＭＰＥＧ７（ムービングピクチャーエキスパートグループ７）形式の構造的記述によるＸＭＬのファイルである。
【０１５４】
データ格納部２は、図１４に示すように、ＸＭＬファイル格納部２５を備える。この場合、データ格納部２は、タイムコード格納部２２と、テキストメディア格納部２３と、を備える必要はない。構造化処理部１８は、ＸＭＬファイルを生成する。
【０１５５】
構造化処理部１８は、図１５に示すフローチャートに従って、ＸＭＬファイルを生成する。
【０１５６】
即ち、構造化処理部１８は、タグを用いて、ＸＭＬテンプレートに、映像情報のコンピュータシステム内での格納位置（フォルダ位置）情報を挿入する（ステップＳ３０１）。
【０１５７】
構造化処理部１８は、タグを用いて、ＸＭＬテンプレートに、テキストメディア６に記載されている番組タイトル情報を挿入する（ステップＳ３０２）。
【０１５８】
構造化処理部１８は、タグを用いて、ＸＭＬテンプレートに、各段落の開始時間、終了時間情報を挿入する（ステップＳ３０３）。
【０１５９】
構造化処理部１８は、タグを用いて、ＸＭＬテンプレートに、登場人物情報を挿入する（ステップＳ３０４）。
【０１６０】
構造化処理部１８は、タグを用いて、ＸＭＬテンプレートに、テキスト情報を挿入する（ステップＳ３０５）。
【０１６１】
構造化処理部１８は、挿入した台詞の終了時間が、タイムコードの最終時間に達したか否かを判定する（ステップＳ３０６）。
最終時間に達していないと判定した場合（ステップＳ３０６においてＮｏ）、構造化処理部１８は、ＸＭＬテンプレートに、再度、各段落の開始時間、終了時間情報、登場人物情報、テキスト情報を挿入する（ステップＳ３０３〜Ｓ３０５）。
【０１６２】
最終時間に達したと判定した場合（ステップＳ３０６においてＹｅｓ）、構造化処理部１８は、データ格納部２のＸＭＬファイル格納部２５に、ＸＭＬファイルを格納する（ステップＳ３０７）。
【０１６３】
図１３に示すＸＭＬファイルにおいて、＜？ｘｍｌ＞、＜Ｍｐｅｇ７＞は、予め、ＭＰＥＧ７規格として定められているＸＭＬテンプレートである。
【０１６４】
＜ＭｅｄｉａＬｏｃａｔｏｒ＞タグ、および＜ＭｅｄｉａＵｒｉ＞タグは、映像情報の格納位置（フォルダ位置）を示すタグである。「Ｃ：￥メタ情報￥映像データ￥ドラマ映像０２０９１３．ｍｐｇ」は、構造化処理部１８が挿入した映像情報を格納しようとする映像メディア格納部２１における格納位置を示す（ステップＳ３０１の処理結果）。
【０１６５】
＜ＣｒｅａｔｉｏｎＩｎｆｏｒｍａｔｉｏｎ＞タグ内の＜Ｔｉｔｌｅ＞タグは、番組タイトル情報挿入用のタグである。「ドラマ番組０２０９１３」は、構造化処理部１８が挿入した番組タイトル情報を示す（ステップＳ３０２の処理結果）。
【０１６６】
＜ＭｅｄｉａＴｉｍｅ＞タグ、および＜ＭｅｄｉａＲｅｌＴｉｍｅＰｏｉｎｔ＞タグ、および＜ＭｅｄｉａＤｕｒａｔｉｏｎ＞タグは、各段落の開始時間と終了時間の情報挿入用のタグである。構造化処理部１８は、このタグを用いて各段落の開始時間と終了時間の情報を挿入する（ステップＳ３０３の処理結果）。
【０１６７】
＜Ｎａｍｅ＞タグ、および＜ＧｉｖｅｎＮａｍｅ＞タグ、および＜ＦａｍｉｌｙＮａｍｅ＞タグは、登場人物情報挿入用のタグである。「刑事Ａ」は、構造化処理部１８が挿入した登場人物情報である（ステップＳ３０４の処理結果）。
【０１６８】
＜ＴｅｘｔＡｎｎｏｔａｔｉｏｎ＞タグ、および＜ＦｒｅｅＴｅｘｔＡｎｎｏｔａｔｉｏｎ＞タグは、テキスト情報挿入用のタグである。「×××１丁目１番地で強盗事件」を含む台詞は、構造化処理部１８がこれらのタグを用いて挿入したテキスト情報である（ステップＳ３０５の処理結果）。
【０１６９】
挿入した台詞の終了時間が、タイムコードの最終時間に到達していなければ、構造化処理部１８は、次の台詞の各情報の挿入を行う（ステップＳ３０６）。台詞の終了時間がタイムコードの最終時間に達したときは、構造化処理部１８は、ＸＭＬファイルをデータ格納部２のＸＭＬファイル格納部２５にＸＭＬファイルを格納する（ステップＳ３０７の処理）。
【０１７０】
構造化処理部１８が、このようなＸＭＬファイルをＸＭＬファイル格納部２５に格納した後、ユーザが、同期データ出力部３にキーワードとなる語句を入力して所望の映像情報およびテキスト情報を要求するものとする。
【０１７１】
入出力部３１は、ユーザによって入力された語句を検索制御部３２に出力する。検索制御部３２は、その語句を含むテキスト情報の段落をＸＭＬファイル格納部２５に格納されているデータの中から検索し、該当するテキスト情報があれば、その段落部分を入出力部３１に出力する。
【０１７２】
ユーザが、あるテキスト情報の特定の範囲を選択すれば、入出力部３１は、ユーザが選択したテキスト情報と同期する映像情報を出力するように検索制御部３２に要求する。
【０１７３】
以上が図１３に示すようなＸＭＬ（エクステンシブルマークアップランゲージ）ファイルを生成して格納するように構成されたコンテンツ処理装置の応用例である。
【０１７４】
同期データ出力部３の入出力部３１は、ユーザが発話した音声を入力するように構成されることもできる。この場合、端末４３の入力装置５５に、ユーザの発話内容をテキストデータに変換する音声認識部を備える。音声認識部に対応する入出力部３１は、ユーザの発話内容から変換されたテキスト情報を検索制御部３２に出力する。
【０１７５】
また、同期データ出力部３は、検索対象となる人物の音声を入力することにより、話者を同定して、その人物名を検索制御部３２に出力するように構成されることもできる。
【０１７６】
さらに、同期データ出力部３は、所望のシーンと同じようなカット割の動画像や、同じようなカメラワークの動画像、背景の画像、人物の画像、人物の動作の動画像、物体の画像、物体の動作の動画像、音楽、効果音、同じような無音区間配置の音声の何れかが入力されて、各々を認識し、認識結果の全部または一部を検索制御部３２に出力するように構成されることもできる。
【０１７７】
本実施の形態では、タイムコード生成部１６が、生成したタイムコードを画像認識部１４及び音声認識部１５にそれぞれ出力し、画像認識部１４及び音声認識部１５が、それぞれ生成した画像認識文字情報、音声認識文字情報にタイムコードを付加するようにした。しかし、これに限られるものではなく、図１６に示すコンテンツ処理装置のように、タイムコード生成部１６が、構造化処理部１８に、生成したタイムコードを供給するようにしてもよい。
【０１７８】
このように構成された場合、画像認識部１４及び音声認識部１５が、それぞれ生成した画像認識文字情報及び音声認識文字情報にタイムコードを付加するのではなく、構造化処理部１８が、画像認識文字情報及び音声認識文字情報と各タイムコードとの対応関係に基づいて、各タイムコードとテキスト情報との対応関係を確定させる。
【０１７９】
また、コンピュータを、再生装置の全部又は一部として動作させ、あるいは、上述の処理を実行させるためのプログラムを、フレキシブルディスク、ＭＤ、ＣＤ−ＲＯＭ、ＤＶＤなどのコンピュータ読み取り可能な記録媒体に格納して配布し、これをコンピュータにインストールし、上述の手段として動作させ、あるいは、上述の工程を実行させてもよい。
【０１８０】
さらに、インターネット上のサーバ装置が有するディスク装置等にプログラムを格納しておき、例えば、搬送波に重畳させて、コンピュータにダウンロード等するものとしてもよい。
【０１８１】
【発明の効果】
以上説明したように、本発明によれば、コンテンツを容易に利用することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係るコンテンツ処理装置の構成を示すブロック図である。
【図２】図１のコンテンツ処理装置が処理する各情報の関係を示す説明図である。
【図３】図１に示すコンテンツ処理装置のハードウェア構成を示すブロック図である。
【図４】図３に示す端末の構成を示すブロック図である。
【図５】図１の同期データ生成部の動作を示すフローチャートである。
【図６】図１の同期データ出力部の動作を示すフローチャートである。
【図７】図１のマッピング部の処理内容を示す説明図である。
【図８】図４に示す表示装置に表示された検索画面を示す説明図である。
【図９】図４に示す表示装置に表示された検索結果画面を示す説明図である。
【図１０】図１に示すデータ格納部の応用例（１）として、管理ファイル格納部を備えたデータ格納部の構成を示すブロック図である。
【図１１】図１の構造化処理部が構造化したデータの応用例を示す説明図である。
【図１２】図１に示すデータ格納部の応用例（２）として、映像メディア格納部とテキストメディア格納部とのみを備えた構成を示すブロック図である。
【図１３】図１に示すコンテンツ処理装置の応用例として、同期データ生成部が処理するＸＭＬファイルの記述例を示す説明図である。
【図１４】図１に示すデータ格納部の応用例（３）として、ＸＭＬファイル格納部を備えた構成を示すブロック図である。
【図１５】図１に示す同期データ生成部がＸＭＬファイルを生成する動作を示すフローチャートである。
【図１６】図１に示すコンテンツ処理装置を応用した構成を示すブロック図である。
【符号の説明】
１同期データ生成部
２データ格納部
３同期データ出力部
１６タイムコード生成部
１７マッピング部
１８構造化処理部
３１入出力部
３２検索制御部
３３同期処理部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a content processing apparatus, a content processing method, and a program.
[0002]
[Prior art]
In recent years, information has been made multimedia, and the amount of information of multimedia contents including video information, audio information, text information, and the like has been rapidly increasing. By storing these pieces of information and recalling them later as necessary, they can be used more effectively.
[0003]
For this reason, there is a content processing apparatus that can effectively use such multimedia content.
This content processing apparatus performs voice recognition based on the voice information added to the video information, and structures the text information and the video information as a voice recognition result.
[0004]
As an example of a system for associating a plurality of types of information, an optimal third text is generated from a first text obtained by character recognition of characters on a manuscript and a second text obtained by voice recognition of voice information. There is an electronic text creation system that obtains text information of digital data (see, for example, Patent Document 1).
[0005]
In addition, there is a system that performs voice recognition based on voice information added to video information and structures text information and video information as a voice recognition result (see, for example, Patent Document 2).
[0006]
Furthermore, there is a multimedia information processing apparatus that digitizes video information, audio information, text information, etc., and stores the information and temporal information indicating the temporal relationship between the information (for example, Patent Document 3 and Patents). Reference 4).
[0007]
[Patent Document 1]
Japanese Patent Laid-Open No. 2001-28279 (page 3-8, FIG. 1)
[Patent Document 2]
JP 2002-189728 A (page 3, FIG. 1)
[Patent Document 3]
JP-A-8-253209 (page 3-8, FIG. 5)
[Patent Document 4]
JP 2002-278974 (page 4-page 6, figure 1)
[0008]
[Problems to be solved by the invention]
However, since the conventional content processing apparatus described in Patent Document 1 performs matching processing between the manuscript and the text of the speech recognition processing result, more accurate text information can be automatically created. However, since Patent Literature 1 does not disclose the association with other information such as video information, the technique described in Patent Literature 1 is applied to the structuring processing of video information, audio information, and text information. In order to do this, an artificially structured process must be performed.
[0009]
Further, in the conventional content processing apparatus described in Patent Document 2, it is possible to structure video information and audio information added to the video information. There is no disclosure regarding conversion.
[0010]
Furthermore, in the conventional content processing apparatus described in Patent Document 3, structuring processing between pieces of information such as video information, audio information, and text information is performed using time information at an input time. The time information needs to be added when creating video information, audio information, text information, and the like. Therefore, when time information is not added, text information, video information, and audio information cannot be automatically associated with each other, and manpower is required for the association.
[0011]
In the conventional content processing apparatus described in Patent Document 4, assuming that time information is described in a text, movie, drama, or play script, and comparing it with the elapsed time of video information, The text information and the video information are associated with each other. However, the time information is often not described in the script. Even when time information is described, there are many cases where the script and the video are shifted in time due to actual shooting or editing. In particular, when a scene is switched in one second or less as in a TV drama, there is a high possibility that these shifts greatly affect the association.
[0012]
In addition, any conventional content processing apparatus relies heavily on voice recognition, but in actual TV programs, even if a person appears, a scene without speech or a voice environment is bad and voice recognition is difficult. There are many cases. In addition, the content of the statement is greatly different from the script due to the ad lib of the performer, and the voice information and the text information may not be associated with each other.
[0013]
If it is difficult to associate audio information with text information, it takes time to generate usable content, and there is a limit to using the content.
[0014]
The present invention has been made in view of such conventional problems, and an object thereof is to provide a content processing apparatus, a content processing method, and a program that can easily use content.
[0015]
[Means for Solving the Problems]
In order to achieve this object, a content processing apparatus according to the first aspect of the present invention provides:
Video information and audio information News In a content processing apparatus for associating content to be included and script information expressing the content scenario with character data,
Recognizing the content, characterizing each scene included in the content with character data Image recognition character information and voice recognition characters A recognition unit that generates information;
Generated by the recognition unit Image recognition character information and voice recognition characters The characteristic part of the information Respectively A time code generation unit that generates a time code indicating a start time and an end time of each divided part,
Acquiring the characteristic part of the script information, dividing the script information for each scene based on the acquired characteristic part, The image recognition character information generated by the recognition unit and the voice recognition character information When The , Dividing at the position indicated by the time code generated by the time code generation unit, Even if the divided image recognition character information and the voice recognition character information do not match, it is determined that each is correct, Each scene of the script information Divided image recognition character information and voice recognition character information With each scene of the script information Divided image recognition character information and voice recognition character information A mapping unit for generating correspondence information indicating a correspondence relationship with
Based on the correspondence information generated by the mapping unit and the time code generated by the time code generation unit, structured information indicating the relationship between each scene of the script information and the time code of each scene is obtained as necessary content. And a structured processing unit that is generated as information for searching.
[0018]
A scene prediction unit that predicts the screen configuration of the script information, adds the predicted screen configuration to the script information, and outputs it to the mapping unit may be provided.
[0019]
You may provide the data storage part for storing the said script information, the said content, the time code which the said time code generation part produced | generated, and the structured information which the said structured process part produced | generated.
[0020]
The data storage unit
A content storage unit for storing the content;
A text file storage unit for storing the script information and the structured information;
A time code storage unit for storing the time code.
[0021]
The structuring processing unit generates a script information file storing the script information and a time code file storing the time code in the text file storage unit, and a start address and an end address of each segment of the script information file And generating management information indicating the start address and end address of each time code in the time code file,
The data storage unit may include a management information storage unit that stores the management information.
[0022]
The data storage unit
A markup language file storage unit that stores the script information as a markup language file may be provided.
[0023]
A synchronization data output unit may be provided that outputs the script information and the content stored in the data storage unit in synchronization based on an input search condition.
[0024]
The synchronous data output unit
An input / output unit that inputs search conditions for extracting necessary scenes from the script information and content, and outputs the script information and content of scenes corresponding to the search conditions;
A search control unit that identifies a scene in the script information corresponding to the search condition input to the input / output unit, and extracts a time code corresponding to the identified scene;
A synchronization processing unit that identifies the scene of the content corresponding to the extracted time code, and synchronizes the content of the identified scene and the script information corresponding to the search condition;
A synchronization processing unit that outputs content and script information corresponding to the scene synchronized by the synchronization processing unit to the input / output unit may be provided.
[0025]
A content processing method according to a second aspect of the present invention includes:
Video information and audio information News A content processing method for associating content to be included and script information in which a scenario of the content is expressed by character data,
Recognizing the content, the feature part of each scene included in the content is represented by character data Image recognition character information and voice recognition characters Generating information;
The generated Image recognition character information and voice recognition characters The characteristic part of the information Respectively Extracting and separating, generating a time code indicating a start time and an end time of each separated part;
Acquiring the characteristic part of the script information, dividing the script information for each scene based on the acquired characteristic part, Generated image recognition character information and voice recognition character information When The Generated Divide at the position indicated by the time code, Even if the divided image recognition character information and the voice recognition character information do not match, it is determined that each is correct, Each scene of the script information The divided image recognition character information and voice recognition character information With each scene of the script information Divided image recognition character information and voice recognition character information Generating correspondence information indicating a correspondence relationship with
Based on the generated correspondence information and the generated time code, structured information indicating a relationship between each scene of the script information and the time code of each scene is generated as information for searching for necessary content. Steps,
Storing the script information, content, time code, and structured information.
[0026]
The program according to the third aspect of the present invention is:
On the computer,
The content Information and audio information contained in The feature part of each scene included in the content is expressed as character data. Image recognition character information and voice recognition characters Procedures for generating information,
The generated Image recognition character information and voice recognition characters The characteristic part of the information Respectively A procedure to extract and delimit, and generate a time code indicating the start time and end time of each delimited part,
Above Representing the content scenario as text data Acquiring the characteristic part of the script information, dividing the script information for each scene based on the acquired characteristic part, The generated image recognition character information and the voice recognition character information When The Generated Divide at the position indicated by the time code, Even if the divided image recognition character information and the voice recognition character information do not match, it is determined that each is correct, Each scene of the script information The divided image recognition character information and voice recognition character information With each scene of the script information Divided image recognition character information and voice recognition character information Generating correspondence information indicating a correspondence relationship with
Based on the generated correspondence information and the generated time code, structured information indicating the relationship between each scene of the script information and the time code of each scene is generated as information for searching for necessary content. procedure,
A procedure for storing the script information, content, time code, structured information;
Is to execute.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a content processing apparatus according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 shows the configuration of the content processing apparatus according to the present embodiment.
The content processing apparatus includes a synchronization data generation unit 1, a data storage unit 2, and a synchronization data output unit 3.
[0028]
Note that the content holder 4 is a holder that holds a video medium 5 in which video information and audio information are recorded, and a text medium 6 in which a document related to the video information is described.
[0029]
The video media 5 and the text media 6 are external recording media, and magnetic recording tapes and DVDs (Digital Versatile Disks) are used. Although the recording capacity is small, a flexible disk, MD (Mini Disc; registered trademark), MO (Magneto-Optic), CD-ROM (Compact Disk Read-Only Memory) or the like is used for the video media 5 and the text media 6. You can also
[0030]
In the case of a drama program in television broadcasting, the text media 6 records drama program script data as text information. This text information expresses a scenario of content with character data. The video media 5 records video information of the drama program and audio information such as the voice of the actor uttered according to the dialogue.
[0031]
Data recorded on the video media 5 and the text media 6 taken out from the content holder 4 is input to the synchronous data generation unit 1.
[0032]
The synchronous data generation unit 1 extracts data recorded in the video media 5 and the text media 6, performs data processing, and generates synchronized data.
[0033]
The synchronization data generation unit 1 includes a text input unit 11, a scene prediction unit 12, a video / audio input unit 13, an image recognition unit 14, an audio recognition unit 15, a time code generation unit 16, a mapping unit 17, And a structured processing unit 18.
[0034]
The text input unit 11 reads text information from the text media 6. When the text media 6 is a recording medium on which analog data is recorded, the text input unit 11 reads analog data from the text media 6, for example, an OCR (optical character reading) device and a digital-analog conversion. (Not shown). The text input unit 11 outputs the text information read from the text media 6 to the scene prediction unit 12.
[0035]
The scene prediction unit 12 predicts the screen configuration and audio configuration of each scene based on the text information output from the text input unit 11. The screen configuration refers to the names and movements of persons and objects shown on the screen. In addition, the voice configuration refers to a person's voice, sound, music, and sound effect in the scene.
[0036]
In order to predict the screen configuration and the voice configuration, the scene predicting unit 12 omits a rule, for example, a rule that the face of the person who entered the script appears in the script unless there is an instruction for camera work, and the subject. In this case, rules such as using the subject of the previous scene are stored in advance.
[0037]
Then, the scene prediction unit 12 predicts the screen configuration and audio configuration of each scene described in the script based on the accumulated rules and the text information output from the text input unit 11. For the prediction of the screen configuration and audio configuration of each scene, it is sufficient to list the contents described in the script for each scene.
[0038]
When the description is ambiguous, for example, in the script of a TV drama, there is a description that “person A goes out of the room”, the scene prediction unit 12 does not specify that the person A disappears on the screen. From the current scene setting and the action of the person, it is estimated that this person A has disappeared from the screen.
The scene prediction unit 12 adds the predicted screen configuration and audio configuration of each scene to the text information and outputs the text information to the mapping unit 17.
[0039]
The video / audio input unit 13 extracts video information and audio information recorded on an audio track from the video medium 5. Note that the video / audio input unit 13 includes a video capture or the like in order to read analog data from the video media 5 when the video media 5 is a recording medium on which analog data is recorded. The video / audio input unit 13 reads analog video information and analog audio information from the video media 5 and converts them into digital video information and digital audio information in AVI format or MPEG format using video capture or the like.
The video / audio input unit 13 outputs the extracted video information and audio information to the image recognition unit 14 and the audio recognition unit 15, respectively.
[0040]
The image recognition unit 14 performs image recognition processing on the video information output from the video / audio input unit 13 and generates image recognition character information as an image recognition result. This image recognition character information is a character data representing a characteristic part of each scene included in video information. For example, when the face of criminal A is shown as a character in the video, the image recognition unit 14 generates image recognition character information “face of criminal A” as an image recognition result.
[0041]
In order to perform such image recognition processing, the image recognition unit 14 performs cut, camera work, lighting environment, background, characters, facial expressions, face orientation, age, gender, hairstyle, makeup, and human actions from video information. Extract all or some of the features of the object in the screen and the motion of the object. Then, the image recognition unit 14 divides and divides the image based on the extracted feature portion, and performs image recognition processing.
[0042]
Further, the image recognition unit 14 outputs the video information to the time code generation unit 16 and outputs information indicating the division position of the video information to the time code generation unit 16.
[0043]
The voice recognition unit 15 performs voice recognition processing on the voice information output from the video / audio input unit 13 and generates voice recognition character information as a voice recognition result. The voice recognition character information is a character data representing a characteristic part of each scene included in the voice information. For example, when the voice includes criminal A's voice as a character, the voice recognition unit 15 generates voice recognition character information “voice of criminal A” as a voice recognition result.
[0044]
In order to perform such a speech recognition process, the speech recognition unit 15 extracts all or a part of the feature portions of the content of a utterance, a speaker, music, sound effects, other non-language speech, and a silent section.
[0045]
Further, the voice recognition unit 15 performs a word segmentation process on the generated voice recognition character information, and divides the voice information into each word. Then, the voice recognition unit 15 outputs information indicating the division position when the voice information is divided into words together with the voice information to the time code generation unit 16.
[0046]
The time code generation unit 16 generates a time code related to images and sounds based on information output from the image recognition unit 14 and the voice recognition unit 15. The time code indicates time information of start and end of the division position of the video information and audio information.
[0047]
The time code generation unit 16 generates the following time code based on the feature portion included in the image.
(1) Cut time code: Indicates the cut start time and end time in the image information.
(2) Camera work time code: Indicates the start time and end time of camera work.
(3) Lighting environment time code: Indicates the start time and end time of the lighting environment.
(4) Background time code: Indicates the start time and end time of the background.
(5) Character time code: Indicates the appearance start time and end time of the character.
(6) Facial feature time code: indicates the start time and end time of facial features such as the face direction, facial expression, and makeup of a person.
(7) Person feature time code: Indicates the start time and end time of person features such as age, gender, hairstyle, makeup, clothes, and appearance.
(8) Person action time code: Indicates the start time and end time of a person action.
(9) Object time code: Indicates the appearance start time and end time of an object.
(10) Object motion time code: Indicates the start time and end time of the motion of the object.
[0048]
Further, the time code generation unit 16 generates the following time code based on the feature portion included in the voice.
(11) Word time code: indicates the start time and end time of each word in the audio information.
(12) Speaker time code: Indicates the start time and end time of each speaker in the voice information.
(13) Non-verbal voice time code: Indicates the start time and end time of non-language voice in voice information such as screams, laughter, and sighs.
(14) Music time code: Indicates the start time and end time of music in the audio information.
(15) Sound effect time code: Indicates the start time and end time of the sound effect.
(16) Silent section time code: Indicates the start time and end time of the silent section.
[0049]
The time code generation unit 16 outputs all or part of the generated time codes related to the image and sound to the image recognition unit 14 and the sound recognition unit 15, respectively.
[0050]
The above-described image recognition unit 14 adds all or a part of the time code output from the time code generation unit 16 to the image recognition character information, and converts the image recognition character information with the time code into video. It outputs to the mapping part 17 with information.
[0051]
The voice recognition unit 15 assigns each time code output from the time code generation unit 16. That is, the voice recognition unit 15 assigns a word time code to each word. The voice recognition unit 15 sets the speaker time code to each utterance, the non-language voice time code to non-language voice, the music time code to music, the sound effect time code to sound effect, and the silence interval time code to silence interval. Assign.
[0052]
The voice recognition unit 15 adds all or part of the time code output from the time code generation unit 16 to the voice recognition character information, and converts the voice recognition character information with the time code into the voice code. It outputs to the mapping part 17 with information.
[0053]
The mapping unit 17 divides the text information output from the scene prediction unit 12 for each predicted scene (scene), and the divided text information and the divided images output from the image recognition unit 14 and the voice recognition unit 15 respectively. The recognition character information and the voice recognition character information are associated with each other.
[0054]
Specifically, the mapping unit 17 uses the text information output from the scene prediction unit 12 to switch speaker, blank lines, line breaks, etc. by line breaks, indents, person names, proper names, place names, etc. Detected as script breakpoint. When the mapping unit 17 detects the delimiter, the mapping unit 17 delimits the text information at the delimiter position so that it can be easily searched.
[0055]
Next, the mapping unit 17 compares the text information with the divided image recognition character information and speech recognition character information, respectively.
[0056]
For comparison, for example, DP matching is used. DP matching is also called DTW (Dynamic Time Warping), and time normalization is performed using dynamic programming (Dymanic Programming) so that the same phonemes in a word correspond to each other. This is a method for obtaining.
[0057]
Next, the mapping unit 17 associates these pieces of information to generate correspondence information. As shown in FIG. 2, the correspondence information is information indicating a one-to-one relationship between the divided image recognition character information and voice recognition character information and text information.
[0058]
Even if the text information, the image recognition character information, and the voice recognition character information do not match, the mapping unit 17 determines that each recognition result is correct and performs the association.
[0059]
Then, the mapping unit 17 outputs the divided image recognition character information, voice recognition character information, text information, and correspondence information to the structuring processing unit 18.
[0060]
The structured processing unit 18 associates each scene (scene) into which the text information is divided and each time code based on the time code and the correspondence information, and generates structured information. As shown in FIG. 2, the structured information is information indicating a one-to-one relationship between each scene of the associated text information and each time code.
[0061]
The structuring processing unit 18 calculates each time code that is a start time and an end time of each segment of the text information from each time code added to the divided image recognition character information and the divided speech recognition character information.
[0062]
Specifically, the structuring processing unit 18 corresponds to each delimiter of the text information from all or part of the time codes (1) to (10) added to the divided image recognition character information. The time code to be calculated is calculated. The structuring processing unit 18 associates the text information with the text or dialogue from the start time of the first word and the end time of the last word of each segment.
[0063]
The structuring processing unit 18 uses any of the time codes in association with the overwritten portion of the text information instructing the situation explanation or the photographing condition. Regarding the correspondence with speech for each speaker, the structuring processing unit 18 uses not only the person time code indicating the speaker but also other time codes by using the situation explanation and proper nouns in the content of the speech. Is associated.
[0064]
Further, the structuring processing unit 18 calculates a time code corresponding to each delimiter of the text information from all or part of the time codes (11) to (16) added to the divided speech recognition character information. To do.
[0065]
That is, the structuring processing unit 18 calculates a time code that is the appearance start time and end time of the speaker in the divided speech recognition text information from the speaker time code added to the divided speech recognition character information. Is associated. Further, the structuring processing unit 18 associates the start time and end time of music or sound effect in the writing with the music time code or sound effect time code. Further, the structuring processing unit 18 associates the switching of the speaker and the switching of the scene in the text information from the silent section time code.
[0066]
Further, the structuring processing unit 18 generates a text media file and a time code file in association with the writing or dialogue for each speaker based on the structured information. The text media file is a file that stores text information, and the time code file is a file that stores time code.
[0067]
The structuring processing unit 18 stores the time code corresponding to each segment in the time code file in the order of appearance in the text information. The structuring processing unit 18 can also generate a plurality of text media files and time code files each corresponding to each delimiter.
[0068]
The data storage unit 2 includes a video media storage unit 21, a time code storage unit 22, and a text media storage unit 23.
[0069]
The video media storage unit 21, the time code storage unit 22, and the text media storage unit 23 store video media files, time code files, and text media files, respectively. The video media storage unit 21 and the text media storage unit 23 correspond to a video information storage unit and a text file storage unit, respectively.
[0070]
The structured processing unit 18 stores the video media file, the generated time code file, and the text media file in the video media storage unit 21, the time code storage unit 22, and the text media storage unit 23, respectively. Further, the structured processing unit 18 stores the structured information shown in FIG. 2 in the text media storage unit 23.
[0071]
The synchronous data output unit 3 searches for data to be searched from the data stored in the data storage unit 2, and outputs the corresponding data. The input / output unit 31, the search control unit 32, And a processing unit 33.
[0072]
The input / output unit 31 receives an input of a search target requested by the user, supplies the received input of the search target to the search control unit 32 as search information, and outputs search result information obtained as a result of the search. Is. Search information includes, for example, keywords such as date and time, program title, speaker, and actor name.
[0073]
The search control unit 32 searches the data stored in the text media storage unit 23 according to the search information supplied from the input / output unit 31, and the scene portion of the text information corresponding to the search information is input / output unit 31. Output to.
[0074]
Specifically, the search control unit 32 retrieves text information including the phrase selected by the user from the text media storage unit 23 for each scene based on the search information. In addition, the search control unit 32 takes out the time code for delimiting the text information selected by the user from the time code storage unit 22 based on the structured information stored in the text media storage unit 23, and extracts the extracted time code. The code is output to the synchronization processing unit 33.
[0075]
The synchronization processing unit 33 sends the start time and end time indicated by the time code output from the search control unit 32 to the input / output unit 31 as the start time and end time of video information output, respectively, from the start time to the end time. Output video information. The synchronization processing unit 33 processes the text information based on the time code, and outputs the processed text information to the input / output unit 31. As a processing method at this time, for example, there is a method of scrolling text information.
[0076]
Next, a hardware configuration for realizing such a content processing apparatus will be described.
The content processing apparatus shown in FIG. 1 is realized by a computer system as shown in FIG.
[0077]
That is, the computer system includes terminals 41 and 43 and a storage device 42. The terminals 41 and 43 and the storage device 42 are connected via a communication line 44.
[0078]
The content processing apparatus may be constructed on a computer system as shown in FIG. 3 or may be constructed on the same terminal. In the case of a computer system as shown in FIG. 3, a LAN (Local Area Network), the Internet, or the like is used for the communication line 44, and the terminals 41 and 43 and the storage device 42 are connected by a network.
[0079]
Terminals 41 and 43 are computers, each having functions of a synchronous data generation unit 1 and a synchronous data output unit 3. The storage device 42 corresponds to the data storage unit 2 and is configured by a magnetic disk device or the like.
[0080]
As shown in FIG. 4, the terminals 41 and 43 include a CPU 51, a ROM 52, a RAM 53, a display device 54, an input device 55, an HDD 56, and a drive device 57.
[0081]
A ROM (Read Only Memory) 52 is a memory for storing a program (data) for causing the CPU 51 to function as the synchronization data generation unit 1 and the synchronization data output unit 3.
A CPU (Central Processing Unit) 51 executes a program stored in the ROM 52.
[0082]
A RAM (Random Access Memory) 53 is a memory for storing data necessary for the CPU 51 to execute a program. Note that the RAM 53 of the terminal 41 is also used to store text information output by the scene prediction unit 12.
[0083]
The display device 54 includes a liquid crystal display that displays data. The input device 55 is for inputting data, and includes a keyboard, a mouse, a microphone, an image scanner, a camera, a video capture interface, and the like. The display device 54 and the input device 55 of the terminal 43 function as the input / output unit 31 of the synchronous data output unit 3.
[0084]
An HDD (Hard Disk Drive) 56 is a storage device for storing data.
The drive device 57 is for loading external recording media such as video media 5 and text media 6 and reading recorded data from the external recording media. The drive device 57 of the terminal 41 functions as the text input unit 11 and the video / audio input unit 13 of the synchronous data generation unit 1.
[0085]
Next, the operation of the content processing apparatus according to the present embodiment will be described based on the flowchart shown in FIG.
The video / audio input unit 13 and the text input unit 11 of the synchronous data generation unit 1 input the video / audio information and text information from the video media 5 and the text media 6, respectively (step S101).
The text input unit 11 determines whether or not the input text information is digital data (step S102).
[0086]
If it is determined that the text information is composed of digital data (Yes in step S102), the text information is output to the scene prediction unit 12.
[0087]
On the other hand, when it is determined that the text information does not consist of digital data, ie, analog data (No in step S102), the text input unit 11 uses the OCR or the like to store the text information recorded on the text media 6 Is digitized (step S103). Then, the text input unit 11 outputs the digitized text information to the scene prediction unit 12.
[0088]
The scene prediction unit 12 predicts the screen configuration and audio configuration of each scene based on the text information output from the text input unit 11, and adds the predicted screen configuration and audio configuration of each scene to the text information for mapping. Part 1 7 (Step S104).
[0089]
The video / audio input unit 13 inputs video information and audio information from the video medium 5, and determines whether or not the input video information and audio information are digital data (step S105).
[0090]
When it is determined that the video information and the audio information are composed of digital data (Yes in step S105), the video / audio input unit 13 outputs the video information and the audio information to the image recognition unit 14 and the audio recognition unit 15, respectively.
[0091]
On the other hand, when it is determined that the video information and audio information are not composed of digital data, that is, are composed of analog data (No in step S105), the video and audio input unit 13 converts the analog video information and analog audio information into Using video capture or the like, it is converted into information consisting of digital data in AVI format or MPEG format (step S106). Then, the video / audio input unit 13 outputs the digitized video information and audio information to the image recognition unit 14 and the audio recognition unit 15, respectively.
[0092]
The image recognition unit 14 performs image recognition processing on the video information supplied from the video / audio input unit 13 to generate image recognition character information (step S107). The image recognition unit 14 outputs information indicating the division position of the video information together with the video information to the time code generation unit 16.
[0093]
The voice recognition unit 15 performs voice recognition processing on the voice information output from the video / audio input unit 13, and generates voice recognition character information as a result of the voice recognition processing (step S108). The voice recognition unit 15 outputs information indicating the division position of the voice information together with the voice information to the time code generation unit 16.
[0094]
The time code generation unit 16 generates all or part of the time codes (1) to (16) related to video and audio based on the information output from the image recognition unit 14 and the voice recognition unit 15 (steps). S109). The time code generation unit 16 outputs the generated video and audio time codes to the image recognition unit 14 and the voice recognition unit 15, respectively.
[0095]
The image recognition unit 14 adds a time code related to the video output from the time code generation unit 16 to the image recognition character information (step S110). The voice recognition unit 15 adds the time code related to the voice output from the time code generation unit 16 to the voice recognition character information (step S110). The image recognition unit 14 and the voice recognition unit 15 output the image recognition character information and the voice recognition character information divided by adding the time code to the mapping unit 17 together with the video information and the voice information, respectively.
[0096]
The mapping unit 17 temporarily stores the text information output from the scene prediction unit 12 and divides the text information according to the text or the dialogue for each speaker.
The mapping unit 17 compares the text information with the divided image recognition character information and voice recognition character information, and based on the paragraph break position of the text information, the image recognition character information and voice recognition character divided with the text information. Correlate with information.
[0097]
Further, the mapping unit 17 generates each paragraph of the text information and the correspondence information shown in FIG. 2 (step S111), and processes the correspondence information together with the text information, the divided image recognition character information and voice recognition character information. To the unit 18.
[0098]
The structured processing unit 18 generates structured information shown in FIG. 2 based on the time code added to the divided image recognition character information and speech recognition character information (step S112). Based on the structured information, the structured processing unit 18 generates a text media file and a time code file by associating the text media file and the dialogue for each speaker.
[0099]
The structured processing unit 18 stores the generated text media file in the text media storage unit 23 of the data storage unit 2 and stores the time code file in the time code storage unit 22. The structured processing unit 18 The media file is stored in the video media storage unit 21 (step S113).
And the synchronous data production | generation part 1 complete | finishes this process.
[0100]
Next, when the user inputs search conditions, the synchronous data output unit 3 searches for data and outputs the searched data. The synchronous data output process of the synchronous data output unit 3 will be described based on the flowchart shown in FIG.
[0101]
When the user inputs search information, the input / output unit 31 outputs the search information to the search control unit 32 in response to the input operation (step S201).
[0102]
The search control unit 32 searches the data stored in the data storage unit 2 based on the search condition output from the input / output unit 31 (step S202).
[0103]
The search control unit 32 determines whether there is corresponding data that matches the search condition (step S203).
If it is determined that there is no corresponding data (No in step S203), the search control unit 32 displays that there is no corresponding data and outputs a sound to that effect (step S206).
[0104]
On the other hand, when it is determined that the corresponding data is present (Yes in step S203), the search control unit 32 specifies the paragraph of the text information and extracts all the corresponding data (step S204).
[0105]
Based on the structured information stored in the text media storage unit 23, the search control unit 32 takes out the time code of the text information selected by the user from the time code storage unit 22, and extracts the extracted time code. The data is output to the synchronization processing unit 33 (step S204).
[0106]
The synchronization processing unit 33 sets the start time and end time of the time code output by the search control unit 32 as video information corresponding to the paragraph of the text information as the start time and the final time of reproduction, respectively. 31 and supplies the text information, video information, and audio information of the selected paragraph to the input / output unit 31 (step S205).
[0107]
The input / output unit 31 displays a script and video based on the supplied information, and outputs audio (step S206).
In this way, the synchronous data output unit 3 provides data to the user.
[0108]
Next, specific operations will be described in more detail.
For example, the video media 5 and the text media 6 are stored in the content holder 4 of the production company in charge of the production of the drama program, and are taken out from the content holder 4.
[0109]
The text input unit 11 of the synchronous data generation unit 1 takes out text information from the text media 6 in which a content scenario is expressed by character data. The text input unit 11 outputs text information related to the dialogue to the scene prediction unit 12.
[0110]
As shown in FIG. 7, the scene prediction unit 12, in order from the scene 1 to the scene 7, in order from the scene 1 to the scene 7, the face of the criminal A, the face of the criminal A, the faces of all the criminal sections, the face of the criminal B, A screen configuration such as the faces of criminal A and criminal B, no face, and the face of criminal A (the face of section C will be added later) is predicted (processing in step S104 in FIG. 5). The scene prediction unit 12 adds this prediction content to the text information and outputs it to the mapping unit 17.
[0111]
The image recognition unit 14 recognizes the image, and the voice recognition unit 15 recognizes the voice and generates an image recognition result and a voice recognition result as shown in FIG. 7 (processing in steps S107 and S108), respectively.
[0112]
For example, the time code generation unit 16 generates “00: 00: 02: 13,00: 00: 02: 26” as the time code of the image recognition result “the face of criminal A” generated by the image recognition unit 14. (Processing of step S109). “00: 00: 02: 13” and “00: 00: 02: 26” indicate the start time and end time of the image recognition result “face of criminal A”, respectively.
[0113]
In addition, the time code generation unit 16 sets “00: 00: 02: 15,00: 00” as the time code as the time code as the voice recognition result generated by the voice recognition unit 15 to the voice “robbery” and “gray” of the criminal A. : 02: 26 "is generated (processing in step S109). “00: 00: 02: 15” and “00: 00: 02: 26” indicate the voice recognition result, the start time and the end time of criminal A's voice “robbery” and “gray”, respectively.
[0114]
The mapping unit 17 sequentially compares the text information, the image recognition character information, and the voice recognition character information using a DP matching method. And the mapping part 17 performs matching as shown in FIG. 7 (process of step S111).
[0115]
For example, as shown in FIG. 7, in the script, “Receiving a criminal A call” is described in the scene 1, and the time code “00: 00: 02: 13 to 00:00:02” of the image recognition result is written. : 26 ”includes“ the face of criminal A ”. The voice recognition character information of the time code “00: 00: 02: 15 to 00: 00: 02: 26” of the voice recognition result includes “voice of criminal A”. In the script scene 1, the image recognition result, and the speech recognition result, “criminal A” matches. Therefore, the mapping unit 17 performs DP matching to determine that the similarity distance between the script scene 1 and the image recognition result and the speech recognition result is short, and the script scene 1 and the image recognition result and the speech recognition result. Is associated.
[0116]
Similarly, the mapping unit 17 sequentially associates the script scenes 2 to 7 with the image recognition result and the voice recognition result.
[0117]
In this case, if the appearance time of the criminal A's face differs from the appearance time of the criminal A's voice as the image recognition result and the voice recognition result, the mapping unit 17 may increase the scene time even though the scenes are the same. To.
[0118]
For example, when the appearance start times are different as in the case of the scene 1, the mapping unit 17 adopts the time “00: 00: 02: 13” where the appearance is earlier. Further, when the appearance end times are different as in the case of the scene 5, the mapping unit 17 adopts the time “00: 00: 02: 38” at which the exit or disappearance is later. Alternatively, as shown in FIG. 7, the mapping unit 17 can take a method in which the image recognition result is regarded as the time code of the writing portion and the speech recognition result is regarded as the time code of the dialogue portion. The mapping unit 17 can also consider that a time code exists for each scene element such as a face, an object, a voice, or a sound in the scene even in the same scene.
[0119]
For the characters, the mapping unit 17 can also identify the speaker based on the speech recognition result obtained by the speech recognition process. The mapping unit 17 realizes more robust person identification by using the image and the sound at the same time.
[0120]
For example, a person can be identified by image recognition even when the character is silent or when voice recognition is difficult due to music or ambient noise. On the other hand, a person can be identified by voice recognition even when lighting conditions such as backlighting are poor, or even when a character is facing backward or downward and the face cannot be seen.
[0121]
If the image information, the sound information, and the text information are inconsistent with each other, the mapping unit 17 determines that each recognition result is correct and performs the association. For example, in a scene, if only people A and B are detected in image recognition, and people A and C are detected in voice recognition, the mapping unit 17 responds as if all A, B, and C names exist. To do.
[0122]
For example, when the face of the detective A cannot be detected as in the scene 5, the mapping unit 17 can make the correspondence based on the order of the appearance order of the faces, and can detect the detective B by voice recognition at almost the same time. When the words “injury” and “unreasonable” are detected in the voice, the association is performed using these words. In this way, the association is performed with higher reliability.
[0123]
On the other hand, even if voice recognition cannot be performed in this scene, if the faces of detective A and detective B can be detected, the mapping unit 17 can also associate the script with the image recognition result and the voice recognition result. . Further, even when image recognition and voice recognition cannot be performed in this scene, the mapping unit 17 can perform association even in a scene that could not be recognized as long as it can be associated from the appearance order of faces and voices in the preceding and following scenes.
[0124]
However, the mapping unit 17 can also associate all combinations. For example, in this case, the mapping unit 17 associates with the assumption that A, B, and C3 names exist, associates with the assumption that only two names A and B exist, and only two names A and C exist. Then, the assumed association and the association assumed that only A exists can be performed.
[0125]
In addition, the mapping unit 17 similarly takes correspondences with respect to cuts, camera work, backgrounds, human motions, objects in frames, motions of words, words in speech, music, and sound effects, and detects a break position. be able to. If there is a contradiction with each other, the recognition results may be associated with each other, or all combinations may be associated with each other.
[0126]
When the mapping is performed, the mapping unit 17 estimates the length of the scene not only based on the order but also from the length of the script and the writing, etc., and the appearance time of the detected face is within a predetermined range (for example, estimation) It is also possible to determine whether or not it is supported by whether it is within the range of 0.5 to 1.5 times the length of the scene.
[0127]
The structured processing unit 18 associates the text information and the time code as follows, and generates a text media file and a time code file (processing in step S112).
[0128]
That is, for example, the structuring processing unit 18 sets the file name of the first line text media file to “line 1.txt” and the file name “time 1.txt” of the first line time code file.
The structuring processing unit 18 associates the first dialogue file with the file name excluding the extension “1” at the end.
[0129]
Similarly, the structuring processing unit 18 associates the files in the second paragraph by setting the second lines to “line 2.txt” and “time 2.txt”, respectively. Similarly, the structuring processing unit 18 performs association in the third and fourth paragraphs. If there is a writing, the structuring processing unit 18 associates the writing as well.
[0130]
Then, the structuring processing unit 18 stores the text media file, the time code file, and the video media file in the data storage unit 2 (Step S113).
[0131]
Next, when the user operates the terminal 43 to search the stored data, the terminal 43 displays a search screen as shown in FIG. 8 on the display device 54 in response to this operation.
[0132]
The display device 54 of the terminal 43 displays on this search screen input fields for date and time, program title, speaker, and keyword as a search condition input screen of the video search system. Further, the display device 54 of the terminal 43 displays a search execution button for designating search execution on this search screen.
[0133]
When the user inputs “actor D” as a keyword, for example, according to the displayed search condition input screen, the display device 54 of the terminal 43 displays “actor D” in the keyword input field in response to this operation. To do.
[0134]
When searching, general search processing such as full-text matching search and AND search based on each search condition can be used.
[0135]
When the user clicks the search execution button, the input / output unit 31 of the synchronous data output unit 3 supplies the search information to the search control unit 32 in response to this operation (processing in step S201), and the search control unit 32 starts the search (processing of step S202).
[0136]
The search control unit 32 specifies a paragraph of text information having data matching the search condition from the data stored in the data storage unit 2 based on the input search condition.
[0137]
The search control unit 32 converts “actor D” input as a keyword phrase into “criminal B” from the cast table in the script. Then, the search control unit 32 searches the data stored in the data storage unit 2 based on “criminal B”.
[0138]
If there are 12 data corresponding to “criminal B”, the search control unit 32 retrieves the 12 corresponding data and supplies the text information scene to the input / output unit 31 as a search result (steps S203 and S204). Processing).
[0139]
Further, the synchronization processing unit 33 notifies the input / output unit 31 of the start time and the final time for reproducing the video information, and synchronizes the video information and audio information with the text information of the selected paragraph. Supply (processing of step S205).
[0140]
The display device 54 of the terminal 43 displays the script and video as shown in FIG. 9 supplied to the input / output unit 31 and outputs audio (processing in step S206).
[0141]
When extracting video information, the display device 54 of the terminal 43 can also display a representative image as a thumbnail image together with text information, as shown in FIG. 8, for example. Further, the display device 54 can also display a moving image from the start time to the end time of a paragraph, regardless of the still image.
[0142]
For example, as illustrated in FIG. 9, the synchronization processing unit 33 operates according to the operation information when each switch unit such as rewind, playback, stop, pause, and fast-forward displayed on the display device 54 is operated. .
[0143]
For example, when the user operates the input device 55 of the terminal 43 so as to reproduce the video media 5 from the position 2 minutes 31 seconds from the beginning, the synchronization processing unit 33 causes the text information to be synchronized with the reproduction timing of the video information. A process of scrolling information is executed.
[0144]
The display device 54 of the terminal 43 performs scroll display of the text information portion including, for example, “Healed by injury” as the text information of the selected paragraph in accordance with the reproduction timing of the video information.
[0145]
Further, the display device 54 of the terminal 43 displays an italic character with an underline, as shown in FIG. 9, in order to indicate that the text information is reproduced.
[0146]
As described above, according to the present embodiment, since the text information is associated with the divided image recognition character information and voice recognition character information, the text information and the video information through the time code, Audio information can be easily structured.
[0147]
In addition, the division part in the text information that matches the search condition input by the user is specified, the time code corresponding to the specified division part is extracted, and the video information corresponding to the extracted time code is specified and specified. Provided video information to users. Therefore, desired video information can be provided to the user.
[0148]
Furthermore, by using image information and audio information together, it is possible to associate text information and video information by associating with audio information or associating with video information even if it is not associated only with video information and audio information. Structuring can be performed more accurately.
[0149]
In carrying out the present invention, various forms are conceivable and the present invention is not limited to the above embodiment.
For example, in the above embodiment, the video information is supplied from the video / audio input unit 13 to the data storage unit 2 via the image recognition unit 14, the voice recognition unit 15, the mapping unit 17, and the structured processing unit 18. Configured. However, the video information may be directly supplied from the video / audio input unit 13 to the data storage unit 2.
[0150]
Further, the structuring processing unit 18 generates, as management information, the start address and end address of each segment of the text information in the text media file, and the start address and end address of each segment of the time code in the time code file. May be configured.
[0151]
To do so, the content processing apparatus includes a management file storage unit 24 for storing management information in the data storage unit 2 as shown in FIG. The management file storage unit 24 corresponds to a management information storage unit. The structured processing unit 18 stores the management information in the management file storage unit 24.
[0152]
Further, as shown in FIG. 11, the content processing apparatus can be configured to combine text information and a time code to generate a structured result. In this case, the data storage unit 2 includes a video media storage unit 21 and a text media storage unit 23 as shown in FIG. The structured processing unit 18 combines the text information and the time code to generate a structured text media file including the time code, and stores the generated text media file in the text media storage unit 23. The data storage unit 2 need not include the time code storage unit 22.
[0153]
Further, the content processing apparatus may be configured to generate and store an XML (Extensible Markup Language) file as shown in FIG. The XML file is an XML file having a structural description in the MPEG7 (moving picture expert group 7) format in the XML (Extensible Markup Language) language which is an extended markup language.
[0154]
As shown in FIG. 14, the data storage unit 2 includes an XML file storage unit 25. In this case, the data storage unit 2 does not need to include the time code storage unit 22 and the text media storage unit 23. The structuring processing unit 18 generates an XML file.
[0155]
The structuring processing unit 18 generates an XML file according to the flowchart shown in FIG.
[0156]
That is, the structuring processing unit 18 uses the tag to insert the storage position (folder position) information of the video information in the computer system into the XML template (step S301).
[0157]
The structured processing unit 18 uses the tag to insert program title information described in the text media 6 into the XML template (step S302).
[0158]
The structuring processing unit 18 inserts the start time and end time information of each paragraph into the XML template using the tag (step S303).
[0159]
The structuring processing unit 18 inserts the character information into the XML template using the tag (step S304).
[0160]
The structured processing unit 18 inserts text information into the XML template using the tag (step S305).
[0161]
The structuring processing unit 18 determines whether or not the end time of the inserted dialogue has reached the final time of the time code (step S306).
When it is determined that the final time has not been reached (No in step S306), the structuring processing unit 18 inserts again the start time, end time information, character information, and text information of each paragraph in the XML template ( Steps S303 to S305).
[0162]
If it is determined that the final time has been reached (Yes in step S306), the structuring processing unit 18 stores the XML file in the XML file storage unit 25 of the data storage unit 2 (step S307).
[0163]
In the XML file shown in FIG. xml> and <Mpeg7> are XML templates defined in advance as the MPEG7 standard.
[0164]
The <Media Locator> tag and the <MediaUri> tag are tags indicating the storage location (folder location) of the video information. “C: ¥ meta information ¥ video data ¥ drama video 020913.mpg” indicates a storage position in the video media storage unit 21 in which the video information inserted by the structured processing unit 18 is to be stored (processing result of step S301). .
[0165]
The <Title> tag in the <CreationInformation> tag is a tag for inserting program title information. “Drama program 020913” indicates program title information inserted by the structuring processing unit 18 (processing result of step S302).
[0166]
A <MediaTime> tag, a <MediaRelTimePoint> tag, and a <MediaDuration> tag are tags for inserting information about the start time and end time of each paragraph. The structuring processing unit 18 uses the tag to insert information about the start time and end time of each paragraph (processing result in step S303).
[0167]
The <Name> tag, <GivenName> tag, and <FamilyName> tag are characters information insertion tags. “Criminal A” is the character information inserted by the structuring unit 18 (processing result of step S304).
[0168]
The <TextAnnotation> tag and the <FreeTextAnnotation> tag are text information insertion tags. The dialogue including “XXX 1 robbery case at 1 chome” is text information inserted by the structuring processing unit 18 using these tags (processing result of step S305).
[0169]
If the end time of the inserted dialogue has not reached the final time of the time code, the structuring processing unit 18 inserts each piece of information of the next dialogue (step S306). When the end time of the dialogue reaches the final time of the time code, the structuring processing unit 18 stores the XML file in the XML file storage unit 25 of the data storage unit 2 (processing in step S307).
[0170]
After the structured processing unit 18 stores such an XML file in the XML file storage unit 25, the user inputs desired words and phrases to the synchronous data output unit 3 to request desired video information and text information. Shall.
[0171]
The input / output unit 31 outputs the word / phrase input by the user to the search control unit 32. The search control unit 32 searches the data stored in the XML file storage unit 25 for the paragraph of the text information including the word, and outputs the paragraph part to the input / output unit 31 if there is corresponding text information. To do.
[0172]
If the user selects a specific range of certain text information, the input / output unit 31 requests the search control unit 32 to output video information synchronized with the text information selected by the user.
[0173]
The above is an application example of a content processing apparatus configured to generate and store an XML (Extensible Markup Language) file as shown in FIG.
[0174]
The input / output unit 31 of the synchronous data output unit 3 can also be configured to input voice spoken by the user. In this case, the input device 55 of the terminal 43 is provided with a voice recognition unit that converts the user's utterance content into text data. The input / output unit 31 corresponding to the voice recognition unit outputs text information converted from the user's utterance content to the search control unit 32.
[0175]
The synchronization data output unit 3 can also be configured to identify a speaker by inputting the voice of a person to be searched and output the person name to the search control unit 32.
[0176]
Furthermore, the synchronous data output unit 3 is a cut image similar to a desired scene, a similar camerawork moving image, a background image, a person image, a person operation moving image, or an object image. Any one of the moving image of the motion of the object, music, sound effect, and sound of the same silent section arrangement is input, each is recognized, and all or part of the recognition result is output to the search control unit 32. It can also be configured.
[0177]
In the present embodiment, the time code generation unit 16 outputs the generated time code to the image recognition unit 14 and the voice recognition unit 15, respectively, and the image recognition unit 14 and the voice recognition unit 15 respectively generate the generated image recognition character information. The time code is added to the voice recognition character information. However, the present invention is not limited to this, and the time code generator 16 may supply the generated time code to the structured processor 18 as in the content processing apparatus shown in FIG.
[0178]
In such a configuration, the image recognition unit 14 and the voice recognition unit 15 do not add a time code to the generated image recognition character information and voice recognition character information, respectively. Based on the correspondence between the character information and the voice recognition character information and each time code, the correspondence between each time code and the text information is determined.
[0179]
In addition, a program for operating the computer as all or part of the playback apparatus or executing the above-described processing is stored in a computer-readable recording medium such as a flexible disk, MD, CD-ROM, or DVD. May be distributed and installed in a computer and operated as the above-mentioned means, or the above-described steps may be executed.
[0180]
Furthermore, the program may be stored in a disk device or the like included in a server device on the Internet, and may be downloaded onto a computer by being superimposed on a carrier wave, for example.
[0181]
【The invention's effect】
As described above, according to the present invention, content can be easily used.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a content processing apparatus according to a first embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a relationship between pieces of information processed by the content processing apparatus of FIG. 1;
FIG. 3 is a block diagram showing a hardware configuration of the content processing apparatus shown in FIG. 1;
4 is a block diagram showing a configuration of a terminal shown in FIG. 3. FIG.
FIG. 5 is a flowchart showing an operation of the synchronous data generation unit of FIG. 1;
6 is a flowchart showing the operation of the synchronous data output unit of FIG.
7 is an explanatory diagram illustrating processing contents of a mapping unit in FIG. 1; FIG.
FIG. 8 is an explanatory diagram showing a search screen displayed on the display device shown in FIG. 4;
FIG. 9 is an explanatory diagram showing a search result screen displayed on the display device shown in FIG. 4;
10 is a block diagram illustrating a configuration of a data storage unit including a management file storage unit as an application example (1) of the data storage unit illustrated in FIG. 1;
11 is an explanatory diagram showing an application example of data structured by the structured processing unit of FIG. 1; FIG.
12 is a block diagram showing a configuration including only a video media storage unit and a text media storage unit as an application example (2) of the data storage unit shown in FIG. 1; FIG.
13 is an explanatory diagram showing a description example of an XML file processed by a synchronous data generation unit as an application example of the content processing apparatus shown in FIG. 1;
14 is a block diagram showing a configuration including an XML file storage unit as an application example (3) of the data storage unit shown in FIG. 1; FIG.
FIG. 15 is a flowchart showing an operation of generating an XML file by the synchronous data generation unit shown in FIG. 1;
16 is a block diagram showing a configuration to which the content processing apparatus shown in FIG. 1 is applied.
[Explanation of symbols]
1 Synchronization data generator
2 Data storage
3 Synchronous data output section
16 Time code generator
17 Mapping section
18 Structured processing unit
31 I / O section
32 Search control unit
33 Synchronization processor

Claims

In the content processing apparatus for correspondence between the script information representing the plot of the content and the content including the video information and audio information in the character data,
A recognition unit that performs recognition processing of the content, and generates image recognition character information and voice recognition character information that represent characterizing data of each scene included in the content;
A time code generator for generating a time code indicating the said characteristic portion of the separator to extract each recognition unit generated image recognition character information and the voice recognition character information, separated by a start and end times of each part,
Acquires the characteristic portion of the script information, delimiting the script information for each scene based on the characteristic part acquired, the image recognition character information which the recognizer is generated and with said voice recognition character information, the time Dividing at the position indicated by the time code generated by the code generator, and even if the divided image recognition character information and the voice recognition character information do not match, each is determined to be correct, and each scene of the script information Associating the divided image recognition character information with the voice recognition character information, and generating correspondence information indicating a correspondence relationship between each scene of the script information and the divided image recognition character information and the voice recognition character information A mapping unit to
Based on the correspondence information generated by the mapping unit and the time code generated by the time code generation unit, structured information indicating the relationship between each scene of the script information and the time code of each scene is obtained as necessary content. A structured processing unit for generating information for searching,
A content processing apparatus.

A screen predicting unit configured to predict the screen configuration of the script information, and adding the predicted screen configuration to the script information to output to the mapping unit;
The content processing apparatus according to claim 1.

A data storage unit for storing the script information, the content, the time code generated by the time code generation unit, and the structured information generated by the structured processing unit;
The content processing apparatus according to claim 1, wherein the content processing apparatus is a content processing apparatus.

The data storage unit
A content storage unit for storing the content;
A text file storage unit for storing the script information and the structured information;
A time code storage unit for storing the time code,
The content processing apparatus according to claim 3 .

The structuring processing unit generates a script information file storing the script information and a time code file storing the time code in the text file storage unit, and a start address and an end address of each segment of the script information file And generating management information indicating the start address and end address of each time code in the time code file,
The data storage unit includes a management information storage unit that stores the management information.
The content processing apparatus according to claim 4 .

The data storage unit
A markup language file storage unit for storing the script information as a markup language file;
The content processing apparatus according to claim 5.

A synchronization data output unit that outputs the script information and the content stored in the data storage unit in synchronization with each other based on an input search condition;
Content processing apparatus according to any one of claims 3 to 6, characterized in that.

The synchronous data output unit
An input / output unit that inputs search conditions for extracting necessary scenes from the script information and content, and outputs the script information and content of scenes corresponding to the search conditions;
A search control unit that identifies a scene in the script information corresponding to the search condition input to the input / output unit, and extracts a time code corresponding to the identified scene;
A synchronization processing unit that identifies the scene of the content corresponding to the extracted time code, and synchronizes the content of the identified scene and the script information corresponding to the search condition;
A synchronization processing unit that outputs content and script information corresponding to the scene synchronized by the synchronization processing unit to the input / output unit;
The content processing apparatus according to claim 7 .

A content processing method for associating content including video information and audio information with script information in which a scenario of the content is expressed in character data,
Performing recognition processing of the content, and generating image recognition character information and voice recognition character information expressing the characterizing portion of each scene included in the content with character data;
Extracting and separating the characteristic portions of the generated image recognition character information and speech recognition character information, and generating a time code indicating a start time and an end time of each divided portion;
The feature information of the script information is acquired, the script information is divided for each scene based on the acquired feature portion, the generated image recognition character information and voice recognition character information are generated, and the generated time code is When the divided image recognition character information and the voice recognition character information do not match, it is determined that each is correct, and each scene of the script information and the image recognition character information divided Performing correspondence with the voice recognition character information and generating correspondence information indicating a correspondence relationship between the image recognition character information and the voice recognition character information divided from each scene of the script information;
Based on the generated correspondence information and the generated time code, structured information indicating the relationship between each scene of the script information and the time code of each scene is generated as information for searching for necessary content. Steps,
Storing the script information, content, time code, and structured information,
A content processing method characterized by the above.

On the computer,
A procedure for performing recognition processing of video information and audio information included in the content, and generating image recognition character information and voice recognition character information in which the characterizing portion of each scene included in the content is represented by character data,
A procedure for extracting and separating the feature parts of the generated image recognition character information and voice recognition character information, and generating a time code indicating a start time and an end time of each divided part,
Acquiring a feature portion of the script information in which the scenario of the content is expressed by character data, dividing the script information into each scene based on the acquired feature portion, and generating the generated image recognition character information and the voice recognition character Information is divided at the position indicated by the generated time code, and even if the divided image recognition character information and the voice recognition character information do not match, each of the scenes of the script information is determined to be correct. Correspondence information indicating the correspondence between each scene of the script information and the divided image recognition character information and the voice recognition character information. Steps to generate,
Based on the generated correspondence information and the generated time code, structured information indicating the relationship between each scene of the script information and the time code of each scene is generated as information for searching for necessary content. procedure,
A procedure for storing the script information, content, time code, structured information;
A program for running