JP3488626B2

JP3488626B2 - Video division method, apparatus and recording medium recording video division program

Info

Publication number: JP3488626B2
Application number: JP06816098A
Authority: JP
Inventors: 憲一南; 明人阿久津; 佳伸外村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-03-18
Filing date: 1998-03-18
Publication date: 2004-01-19
Anticipated expiration: 2018-03-18
Also published as: JPH11266428A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、映像に含まれる音
情報の背景音を解析し、その特徴量の類似性に基づいて
映像を分割する映像分割方法、装置および映像分割プロ
グラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention analyzes a background sound of sound information contained in a video and records the video splitting method, device and video splitting program for splitting the video based on the similarity of their feature amounts. Regarding the medium.

【０００２】[0002]

【従来の技術】映像を分割する方法には主に画像情報を
用いるものがあり、例えば、カメラの切り替わりである
カット点を検出し、映像をショットに分割するものがあ
る。2. Description of the Related Art Some methods of dividing an image mainly use image information. For example, there is a method of detecting a cut point at which a camera is switched and dividing the image into shots.

【０００３】[0003]

【発明が解決しようとする課題】カット点を検出する方
法を用いて画像情報を分割するようにした技術の応用例
として、ショットの先頭画像をそのショットを表す代表
的な静止画像（代表画像）として空間的に並べて表示
し、映像の内容を一覧できるようにした映像表現方法が
あるが、カット点は頻繁に存在するため、長時間の映像
を対象とした場合には、代表画像の数が増えすぎてしま
うという問題があった。代表画像の数を減らすために
は、映像をより大まかに分割する必要がある。As an application example of a technique for dividing image information by using a method of detecting a cut point, a leading still image of a shot is a typical still image (representative image) representing the shot. There is a video expression method that allows you to view the contents of the video by displaying them side by side, but since there are frequent cut points, the number of representative images is long when a long video is targeted. There was a problem that it would increase too much. In order to reduce the number of representative images, it is necessary to divide the video more roughly.

【０００４】映像製作の観点から、ショットの集合はシ
ーンであり、当該シーンをとらえて映像を分割すること
も考えられるが、通常シーンは同じ場面のつながりであ
り、自動的に分割することは困難であった。From the viewpoint of image production, a set of shots is a scene, and it is possible to divide the image by capturing the scene, but a normal scene is a connection of the same scenes, and it is difficult to automatically divide it. Met.

【０００５】本発明は、同じ場面では背景音が類似する
可能性が高いという特徴を利用し、映像を大まかに分割
するようにすることを目的としている。An object of the present invention is to roughly divide an image by utilizing the characteristic that background sounds are likely to be similar in the same scene.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
め、本発明においては、映像を入力し、入力された映像
を蓄積し、映像の音情報から音楽および音声を検出し、
音情報のうち、音楽および音声を含まない区間に対して
当該音情報について周波数解析して時間平均スペクトル
を求めた特徴量を抽出し、抽出された特徴量の相互の相
関を求めた類似度を算出し、類似度が高い区間およびそ
の区間に挟まれた音情報を含む映像を１つのまとまりの
ある区間とみなしてセグメントとして分割することによ
り、大まかに映像を分割するようにしている。In order to achieve the above object, in the present invention, an image is inputted, the inputted image is accumulated, and music and voice are detected from sound information of the image,
For sections of sound information that do not include music and voice
Frequency average of the sound information and time-averaged spectrum
The extracted feature quantity is extracted, and the mutual phase of the extracted feature quantity
The degree of similarity obtained by calculating the relationship is calculated, and an image including a section with a high degree of similarity and sound information sandwiched between the sections is combined into one unit.
The video is roughly divided by dividing it as a segment by regarding it as a certain section .

【０００７】[0007]

【０００８】[0008]

【発明の実施の形態】以下に、本発明の実施例について
図面を参照して説明する。図１は、本発明の一実施形態
の映像分割装置の概略構成を示すブロック図である。本
実施形態の映像分割装置は、映像を入力する映像入力部
１０１と、映像を蓄積する映像蓄積部１０２と、音楽を
検出する音楽検出部１０３と、音声を検出する音声検出
部１０４と、音楽および音声を含まない区間に対して、
特徴量を抽出する特徴抽出部１０５と、抽出された特徴
量の類似度を算出し、類似度が高い区間およびその区間
に挟まれた音情報を含む映像を１つのセグメントとして
分割する映像分割部１０６から構成されている。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a video division device according to an embodiment of the present invention. The video division device of the present embodiment includes a video input unit 101 for inputting a video, a video storage unit 102 for storing a video, a music detection unit 103 for detecting music, a voice detection unit 104 for detecting voice, and a music detection unit 104. And for the section that does not include voice,
A feature extraction unit 105 that extracts a feature amount, and a video division unit that calculates a similarity between the extracted feature amounts and divides a video including a section with high similarity and sound information sandwiched between the sections as one segment. It is composed of 106.

【０００９】図２は、本発明の一実施例の映像分割装置
の処理の流れを示したフローチャートである。本発明を
ソフトウェアで実現した場合でも同様の処理の流れとな
る。１ループの処理は１秒程度の映像セグメントに対し
て行われる。FIG. 2 is a flow chart showing the flow of processing of the video division apparatus according to the embodiment of the present invention. Even when the present invention is realized by software, the same processing flow is used. The processing of one loop is performed for a video segment of about 1 second.

【００１０】まず、映像蓄積処理２０１で映像を蓄積
し、映像の音情報に対して音楽検出処理２０２を行う。
判断２０３において音楽かどうかの判別を行い、音楽な
らば判断２０８へジャンプする。音楽でない場合には、
音声検出処理２０４を施す。判断２０５において音声か
どうかの判断を行い、音声ならば判断２０８へジャンプ
する。音楽の検出には、音情報の周波数スペクトルのピ
ークが、周波数方向に対して時間的に安定しているとい
う特徴を用い、音声の検出には、くし形フィルタを用い
る方法（南他、「音解析による映像インデクシング」、
電子情報通信学会総合大会、Ｄ−１２−６４、１９９
７）などが有効である。First, video is stored in the video storage processing 201, and music detection processing 202 is performed on the sound information of the video.
At decision 203, it is decided whether or not it is music, and if it is music, the process jumps to decision 208. If not music,
The voice detection process 204 is performed. In judgment 205, it is judged whether or not it is a voice, and if it is a voice, the process jumps to judgment 208. For music detection, the peak of the frequency spectrum of the sound information is temporally stable in the frequency direction, and for speech detection, a comb filter method is used (Minami et al. Video indexing by analysis ",
IEICE General Conference, D-12-64, 199
7) etc. are effective.

【００１１】[0011]

【００１２】音声でない場合には、その期間は音楽およ
び音声を含まない背景音であるとして、即ち、背景音に
対応するセグメントとして特徴抽出処理２０６が施され
る。特徴抽出処理２０６では、音情報を周波数解析し、
長時間平均スペクトルを求める。長時間平均スペクトル
は、各周波数におけるスペクトルのパワーの時間的平均
値である。If it is not a voice, the feature extraction processing 206 is performed as a background sound that does not include music and voice during the period, that is, as a segment corresponding to the background sound. In the feature extraction processing 206, frequency analysis is performed on the sound information,
Obtain the long-term average spectrum. The long-term average spectrum is a temporal average value of the power of the spectrum at each frequency.

【００１３】次に、映像分割処理２０７において、１ル
ープ前に算出された長時間平均スペクトルと現在の長時
間平均スペクトルとの相関を求め、相関が高い場合には
同一場面であるとみなし、ラベリングする。相関を求め
た２つのセグメントに存在する音楽あるいは音声のセグ
メントも同一場面のものとしてラベリングする。ラベル
情報は、セグメントの時間情報と共に映像蓄積部１０２
に保存される。Next, in the image division processing 207, the correlation between the long-time average spectrum calculated one loop before and the current long-time average spectrum is obtained, and if the correlation is high, it is considered that the scene is the same, and labeling is performed. To do. The music or voice segments existing in the two segments for which the correlation has been obtained are also labeled as the same scene. The label information, together with the segment time information, is stored in the video storage unit 102.
Stored in.

【００１４】なお前記において映像の分割について説明
したが、当該分割の態様はデータ処理装置が実行できる
プログラムの形で保持することができ、本発明は当該プ
ログラムを記録した記録媒体をも含むものである。Although the video division has been described above, the mode of the division can be held in the form of a program that can be executed by the data processing apparatus, and the present invention also includes a recording medium recording the program.

【００１５】[0015]

【発明の効果】（１）請求項１、２および３の発明は、
映像を入力し、入力された映像を蓄積し、映像の音情報
から音楽および音声を検出し、音情報のうち、音楽およ
び音声を含まない区間に対して当該音情報について周波
数解析して時間平均スペクトルを求めた特徴量を抽出
し、抽出された特徴量の相互の相関を求めた類似度を算
出し、類似度が高い区間およびその区間に挟まれた音情
報を含む映像を１つのまとまりのある区間とみなしてセ
グメントとして分割することを可能にし、大まかに映像
を分割することを可能にする。(1) The inventions of claims 1, 2 and 3 are as follows:
Input the video, store the input video, detect music and voice from the sound information of the video, and detect the frequency of the sound information in the section that does not include the music and voice in the sound information.
Extract the feature quantity for which the time-averaged spectrum is obtained by numerical analysis , calculate the similarity degree by calculating the mutual correlation of the extracted feature quantity, and include the section with high similarity degree and the sound information sandwiched between the sections. It is possible to divide an image into segments by treating the image as one united section, and to roughly divide the image.

【００１６】[0016]

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施形態の映像分割装置の概略構成
を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a video division device according to an embodiment of the present invention.

【図２】本発明の一実施形態の映像分割装置の処理の流
れと本発明をソフトウェアで実現した場合の処理の流れ
を示すフローチャートである。FIG. 2 is a flowchart showing a processing flow of the video division device according to the embodiment of the present invention and a processing flow when the present invention is realized by software.

[Explanation of symbols]

１０１映像入力部１０２映像蓄積部１０３音楽検出部１０４音声検出部１０５特徴抽出部１０６映像分割部２０１映像蓄積処理２０２音楽検出処理２０３音楽判定処理２０４音声検出処理２０５音声判定処理２０６特徴抽出処理２０７映像分割処理２０８映像終了判定処理 101 Video input section 102 video storage 103 music detector 104 voice detector 105 Feature Extraction Unit 106 video division unit 201 Image storage processing 202 Music detection processing 203 Music determination processing 204 voice detection processing 205 voice determination processing 206 Feature extraction process 207 Video division processing 208 Video end determination processing

フロントページの続き (56)参考文献特開平９−214879（ＪＰ，Ａ) 特開平８−95596（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) H04N 5/76 - 5/956 G10L 3/00 Continuation of the front page (56) References JP-A-9-214879 (JP, A) JP-A-8-95596 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) H04N 5 / 76-5/956 G10L 3/00

Claims

(57) [Claims]

1. A method of dividing a given image according to a scene, the image inputting step of inputting the image, the image accumulating step of accumulating the image, and the music detecting music from the sound information in the image. A detection step, a voice detection step of detecting voice from sound information, and a section of the sound information that does not include music and voice,
Frequency average of the sound information and time-averaged spectrum
Image including a feature extraction step of extracting a feature amount obtained, video and sound information sandwiched between the interval of the extracted feature quantity calculated to the similarity degree of similarity calculated correlation mutual high section of And a video segmentation step of segmenting the video into a plurality of segments by regarding them as one coherent section and grouping the segments into segments, and a video segmentation method comprising:

2. An apparatus for dividing a given video image according to a scene, the video input unit for inputting the video image, the video storage unit for accumulating the video image, and the music for detecting music from the sound information in the video image. A detection unit, a voice detection unit that detects voice from sound information, and a section of the sound information that does not include music and voice,
Frequency average of the sound information and time-averaged spectrum
Image including a feature extracting section for extracting a feature amount obtained, video and sound information sandwiched between the interval of the extracted feature quantity calculated to the similarity degree of similarity calculated correlation mutual high section of And a video dividing unit that divides the video into a plurality of segments by regarding them as one coherent section, and dividing the video into a plurality of segments.

3. A recording medium in which a program for dividing a given video corresponding to a scene is recorded, the video input processing for inputting the video, the video storage processing for storing the video, and the sound information in the video. Music detection processing that detects music, voice detection processing that detects voice from sound information, and for the section of the sound information that does not include music and voice,
Frequency average of the sound information and time-averaged spectrum
Image including a feature extraction process of extracting a feature amount obtained, video and sound information sandwiched between the interval of the extracted feature quantity calculated to the similarity degree of similarity calculated correlation mutual high section of Is regarded as one cohesive section, and is grouped as a segment to divide the video into a plurality of segments, and a video division program for causing a computer to execute is recorded. Recording medium.