JP6448212B2

JP6448212B2 - Recognition device and recognition method

Info

Publication number: JP6448212B2
Application number: JP2014083781A
Authority: JP
Inventors: 俊太舘; 克彦森; 優和真継
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2014-04-15
Filing date: 2014-04-15
Publication date: 2019-01-09
Anticipated expiration: 2034-04-15
Also published as: JP2015204030A

Description

本発明は認識装置及び認識方法に関し、特に、静止画、動画または距離画像などの画像情報を入力画像とし、画像中の物体を手掛かりとしてシーン、イベント、構図、もしくは主被写体といった画像の属性情報を推定するために用いて好適な技術に関する。 The present invention relates to a recognition apparatus and a recognition method, and in particular, image information such as a still image, a moving image, or a distance image is used as an input image, and image attribute information such as a scene, event, composition, or main subject is used as a clue. The present invention relates to a technique suitable for use in estimation.

画像中の物体を手掛かりとして、画像のシーンやイベントを推定する従来の方法として、例えば非特許文献１がある。非特許文献１は、複数の特定のクラスの物体が画像中に存在するか否かを調べ、その有無の結果の分布を特徴量として画像のシーン判別を行う。 As a conventional method for estimating an image scene or event using an object in an image as a clue, for example, there is Non-Patent Document 1. Non-Patent Document 1 examines whether or not a plurality of specific classes of objects are present in an image, and performs image scene discrimination using the distribution of the result of the presence or absence of the presence or absence as a feature amount.

Ｌｉ−ＪｉａＬｉ，ＨａｏＳｕ，ＹｏｎｇｗｈａｎＬｉｍ，ＬｉＦｅｉ−Ｆｅｉ， "ＯｂｊｅｃｔｓａｓＡｔｔｒｉｂｕｔｅｓｆｏｒＳｃｅｎｅＣｌａｓｓｉｆｉｃａｔｉｏｎ"，Ｐｒｏｃ．ｏｆｔｈｅＥｕｒｏｐｅａｎＣｏｎｆ．ｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ（ＥＣＣＶ２０１０）．Li-Jia Li, Hao Su, Longwan Lim, Li Fei-Fei, “Objects as Attributes for Scene Classification”, Proc. of the European Conf. on Computer Vision (ECCV 2010). Ｐ．Ｆｅｌｚｅｎｓｚｗａｌｂ，Ｒ．Ｇｉｒｓｈｉｃｋ，Ｄ．ＭｃＡｌｌｅｓｔｅｒ，ａｎｄＤ．Ｒａｍａｎａｎ． "ＯｂｊｅｃｔＤｅｔｅｃｔｉｏｎｗｉｔｈＤｉｓｃｒｉｍｉｎａｔｉｖｅｌｙＴｒａｉｎｅｄＰａｒｔＢａｓｅｄＭｏｄｅｌｓ"，ＩＥＥＥＴｒａｎｓ．ｏｎＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ２０１０P. Felzenszwalb, R.A. Girstick, D.M. McAllester, and D.M. Ramanan. "Object Detection with Discriminative Trained Part Based Models", IEEE Trans. on Pattern Analysis and Machine Intelligence 2010 ＫｏｅｎＥ．Ａ．ｖａｎｄｅＳａｎｄｅ，ＪａｓｐｅｒＲ．Ｒ．Ｕｉｊｌｉｎｇｓ，ＴｈｅｏＧｅｖｅｒｓ，ＡｒｎｏｌｄＷ．Ｍ．Ｓｍｅｕｌｄｅｒｓ，ＳｅｇｍｅｎｔａｔｉｏｎＡｓＳｅｌｅｃｔｉｖｅＳｅａｒｃｈｆｏｒＯｂｊｅｃｔＲｅｃｏｇｎｉｔｉｏｎ，ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎ，２０１１Koen E.M. A. van de Sande, Jasper R .; R. Uijlings, Theo Gevers, Arnold W. M.M. Smeulders, Segmentation As Selective Search for Object Recognition, IEEE International Conferencing on Computer Vision, 2011 ＪｉａｎｂｏＳｈｉａｎｄＪｉｔｅｎｄｒａＭａｌｉｋ，ＮｏｒｍａｌｉｚｅｄＣｕｔｓａｎｄＩｍａｇｅＳｅｇｍｅｎｔａｔｉｏｎ，ＩＥＥＥＴｒａｎｓ．ｏｎＰａｔｔｅｒｎＡｎａｌｙｓｉｓａｎｄＭａｃｈｉｎｅＩｎｔｅｌｌｉｇｅｎｃｅ，Ｖｏｌ．２２，Ｎｏ．８，２０００Jianbo Shi and Jitendra Malik, Normalized Cuts and Image Segmentation, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 22, no. 8, 2000 ＪｏａｏＣａｒｒｅｉｒａａｎｄＣｒｉｓｔｉａｎＳｍｉｎｃｈｉｓｅｓｃｕ，ＣｏｎｓｔｒａｉｎｅｄＰａｒａｍｅｔｒｉｃＭｉｎ−ＣｕｔｓｆｏｒＡｕｔｏｍａｔｉｃＯｂｊｅｃｔＳｅｇｍｅｎｔａｔｉｏｎ，ＩＥＥＥＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ，２０１０Joao Carreira and Christian Sminchiscu, Constrained Parametric Min-Cuts for Automatic Object Segmentation, IEEE Conference on Computer Revision and Pattern Recognition 20 ＳｖｅｔｌａｎａＬａｚｅｂｎｉｋ，ＣｏｒｄｅｌｉａＳｃｈｍｉｄ，ＪｅａｎＰｏｎｃｅ，ＢｅｙｏｎｄＢａｇｓｏｆＦｅａｔｕｒｅｓ：ＳｐａｔｉａｌＰｙｒａｍｉｄＭａｔｃｈｉｎｇｆｏｒＲｅｃｏｇｎｉｚｉｎｇＮａｔｕｒａｌＳｃｅｎｅＣａｔｅｇｏｒｉｅｓ，ＩＥＥＥＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ，２００６Svetlana Lazebnik, Cordelia Schmid, Jean Ponce, Beyond Bags of Features 6 Ｐ．Ｆｅｌｚｅｎｓｚｗａｌｂ，Ｄ．ＭｃＡｌｌｅｓｔｅｒ，Ｄ．Ｒａｍａｎａｎ，ＡＤｉｓｃｒｉｍｉｎａｔｉｖｅｌｙＴｒａｉｎｅｄ，Ｍｕｌｔｉｓｃａｌｅ，ＤｅｆｏｒｍａｂｌｅＰａｒｔＭｏｄｅｌ，ＩＥＥＥＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔｅｒＶｉｓｉｏｎａｎｄＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ，２００８P. Felzenszwalb, D.W. McAllester, D.M. Ramanan, A Discriminative Trained, Multiscale, Deformable Part Model, IEEE Conferencing on Computer Vision and Pattern Recognition, 2008 Ｔ．ＫｏｂａｙａｓｈｉａｎｄＮ．Ｏｔｓｕ．ＡｃｔｉｏｎａｎｄＳｉｍｕｌｔａｎｅｏｕｓＭｕｌｔｉｐｌｅ−ＰｅｒｓｏｎＩｄｅｎｔｉｆｉｃａｔｉｏｎＵｓｉｎｇＣｕｂｉｃＨｉｇｈｅｒ−ＯｒｄｅｒＬｏｃａｌＡｕｔｏ−Ｃｏｒｒｅｌａｔｉｏｎ"，ＩｎＰｒｏｃ．ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＰａｔｔｅｒｎＲｅｃｏｇｎｉｔｉｏｎ，ｐｐ．７４１−７４４，２００４．T.A. Kobayashi and N.K. Otsu. Action and Simultaneous Multiple-Person Identification Using Cubic Higher-Order Local Auto-Correlation ", In Proc. International Conference on Patter.

このような方法では、シーン判別の手掛かりとなる被写体を認識する検出器（例えば、非特許文献２のような手法を用いる）を複数用意し、犬や車といった特定の被写体ごとに検出処理を行う必要がある。この時、以下のような課題が存在する。 In such a method, a plurality of detectors (for example, using a technique such as Non-Patent Document 2) for recognizing a subject as a clue for scene discrimination are prepared, and detection processing is performed for each specific subject such as a dog or a car. There is a need. At this time, the following problems exist.

第１に、多数のシーンの種類を正確に判別するには、各シーンに関連する多数の被写体の検出器を用意する必要がある。非特許文献２のような検出器の処理は、検出器ごとにスライディング窓と呼ばれる画像走査を行うために計算量が多く、シーン数が増えるのに応じてシーン判別にかかる処理時間が著しく増大する可能性がある。 First, in order to accurately determine the types of a large number of scenes, it is necessary to prepare detectors for a large number of subjects related to each scene. The processing of the detector as in Non-Patent Document 2 requires a large amount of calculation because image scanning called a sliding window is performed for each detector, and the processing time required for scene discrimination significantly increases as the number of scenes increases. there is a possibility.

第２に、シーンを判別する際に、いずれの被写体が重要かは一般に未知のため、微妙なシーンを見分ける場合、どのような被写体の検出器を用意すればよいかを事前に決めづらい問題がある。 Second, since it is generally unknown which subject is important when determining a scene, it is difficult to determine in advance what kind of subject detector should be prepared when identifying a subtle scene. is there.

第３に、例えば誕生パーティと結婚式の披露宴のシーンを見分ける際に、人物の服装が普段着かドレスかといった差異が手掛かりになるなど、被写体の有無ではなくバリエーションの違いが重要な場合がある。非特許文献１のような従来手法では、そのままでは被写体のバリエーションの違いを判別の手掛かりにすることができない問題点があった。
本発明は前述の問題点に鑑み、様々な被写体を手掛かりにして処理負荷の低い方法で画像のシーンの判別を行うことができるようにすることを目的とする。 Third, for example, when differentiating between the scenes of a birthday party and a wedding reception, a difference in variation, not the presence or absence of a subject, may be important, such as a difference in whether a person's clothes are usually worn or dressed. In the conventional method as in Non-Patent Document 1, there is a problem that the difference in the variation of the subject cannot be used as a clue for discrimination as it is.
In view of the above-described problems, an object of the present invention is to make it possible to determine a scene of an image by a method with a low processing load using various subjects as clues.

本発明の認識装置は、入力画像から、被写体の複数の候補領域を抽出する候補領域抽出手段と、前記複数の候補領域のそれぞれから該候補領域の特徴量を抽出する特徴量抽出手段と、前記複数の候補領域のそれぞれについて、該候補領域から抽出された前記特徴量に基づいて、学習画像の物体領域から抽出された特徴量と当該学習画像の属性とを対応付けた学習結果を参照して、当該候補領域が抽出された前記入力画像の属性を判定する属性判定手段と、前記複数の候補領域に対する前記属性判定手段の判定結果を集計することにより、前記入力画像の属性を同定する同定手段とを有することを特徴とする。 Recognition device of the present invention, from the input image, the candidate area extraction means for extracting a plurality of candidate areas of the object, a feature amount extracting section which extracts a feature quantity of the candidate region from each of said plurality of candidate areas, the for each of the plurality of candidate regions, on the basis of the feature amount extracted from the candidate regions, with reference to the learning result that associates the feature quantity extracted from the object region of the learning image and the attribute of the learning image , Attribute determination means for determining the attribute of the input image from which the candidate area is extracted, and identification means for identifying the attribute of the input image by aggregating the determination results of the attribute determination means for the plurality of candidate areas It is characterized by having.

本発明によれば、様々な被写体を手掛かりにして処理負荷の低い方法で画像のシーンの判別を行うことが可能となる。また、各シーンにおいてどのような被写体を識別するべきか予め教示する必要がない。 According to the present invention, it is possible to discriminate an image scene by a method with a low processing load using various subjects as clues. Further, it is not necessary to teach in advance what kind of subject should be identified in each scene.

第１の実施形態の認識装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the recognition apparatus of 1st Embodiment. 第１の実施形態の認識装置の処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of a process of the recognition apparatus of 1st Embodiment. 候補領域を抽出する処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the process which extracts a candidate area | region. 候補領域の特徴を抽出する処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the process which extracts the characteristic of a candidate area | region. 属性判定処理部の模式図である。It is a schematic diagram of an attribute determination processing unit. 属性の判定処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the determination process of an attribute. 属性の判定処理の結果例を示す図である。It is a figure which shows the example of a result of the determination process of an attribute. 第１の実施形態の学習フェーズの基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the learning phase of 1st Embodiment. 第１の実施形態の学習フェーズの処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the process of the learning phase of 1st Embodiment. 学習フェーズにおける学習用物体領域の抽出の結果例を示す図である。It is a figure which shows the example of a result of extraction of the object area for learning in a learning phase. 学習フェーズにおける分類木の学習処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the learning process of the classification tree in a learning phase. 属性判定処理部の学習結果の模式図である。It is a schematic diagram of the learning result of an attribute determination process part. 第１の実施形態の派生形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of the derivative form of 1st Embodiment. 第２の実施形態の認識装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the recognition apparatus of 2nd Embodiment. 第３の実施形態の認識装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the recognition apparatus of 3rd Embodiment. 第３の実施形態の被写体の候補領域の抽出結果例を示す図である。It is a figure which shows the example of extraction of the candidate area | region of the object of 3rd Embodiment. 第４の実施形態の認識装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the recognition apparatus of 4th Embodiment. 構図クラスおよび推定結果の例を示す図である。It is a figure which shows the example of a composition class and an estimation result. 主被写体領域の推定結果の例を示す図である。It is a figure which shows the example of the estimation result of the main subject area.

［第１の実施形態］
以下、図面を参照して本発明の認識装置を実施するための形態の例を説明する。本認識装置は、入力画像を受け取り、その画像があらかじめ定められた複数のシーンのクラスのいずれに属するかを正しく判別することを目的とする。 [First Embodiment]
Hereinafter, an example of a mode for carrying out the recognition apparatus of the present invention will be described with reference to the drawings. An object of the present recognition apparatus is to receive an input image and correctly determine to which of a plurality of predetermined scene classes the image belongs.

図１に、本実施形態による認識装置の基本的な構成図を示す。画像入力部１０１は、画像データを入力する。候補領域抽出部１０２は、画像の属性情報を判別するための画像の属性に関連する被写体の領域を抽出する。特徴量抽出部１０３は、候補領域抽出部１０２において抽出された被写体の候補領域から画像特徴量を抽出する。属性判定部１０４は、特徴量抽出部１０３で抽出された特徴量に基づいて候補領域がいずれのシーンクラスの画像に含まれる領域であるかを判定する。判定結果統合部１０５は、属性判定部１０４の結果をまとめて最終的に画像のシーンクラスを判定する。 FIG. 1 shows a basic configuration diagram of a recognition apparatus according to the present embodiment. The image input unit 101 inputs image data. The candidate area extraction unit 102 extracts a subject area related to the attribute of the image for determining the attribute information of the image. The feature amount extraction unit 103 extracts an image feature amount from the candidate region of the subject extracted by the candidate region extraction unit 102. The attribute determination unit 104 determines, based on the feature amount extracted by the feature amount extraction unit 103, a candidate region that is included in an image of any scene class. The determination result integration unit 105 collects the results of the attribute determination unit 104 and finally determines the scene class of the image.

なおここで、画像のシーンクラスには誕生パーティ、クリスマスパーティ、結婚式、キャンプ、運動会、学芸会、といった様々なシーンやイベントの種別が含まれる。本実施形態では、前述したような数十クラスのシーンがユーザーによって与えられているものとし、入力画像からこれらのシーンクラスを判別する。 Here, the scene class of the image includes various scenes and event types such as a birthday party, a Christmas party, a wedding, a camp, an athletic meet, and a school performance. In this embodiment, it is assumed that several tens of classes of scenes as described above are given by the user, and these scene classes are discriminated from the input image.

さらに、本発明は、前述の例のような日常の出来事に関わるシーンのクラス以外にも適用可能である。例えば、夜景・光源の方向（順光、逆光、右からの斜光、左からの斜光）、花のクロースアップ、といった、カメラが撮影を行う際に撮影のパラメータを調整する目的で設定される撮影モードと呼ばれる種別をシーンクラスとして定義して用いることもできる。このように、本発明は様々な対象の画像の属性の判別に適用可能である。 Furthermore, the present invention can be applied to a class other than a scene class related to a daily event such as the above-described example. For example, shooting set for the purpose of adjusting shooting parameters when the camera performs shooting, such as night view / light source direction (forward light, backlight, oblique light from the right, oblique light from the left), flower close-up, etc. A type called a mode can also be defined and used as a scene class. As described above, the present invention is applicable to discrimination of attributes of various target images.

次に、本実施形態の認識装置の認識処理と学習処理の流れについて説明する。
＜認識フェーズ＞
まず、認識処理の流れを図２のフローチャートを用いて説明する。
Ｓ２０１では、画像入力部１０１が画像データを受け取る。ここで、本発明の実施形態における画像データとは、カラー画像、動画、または距離画像といった様々な形式の映像情報、およびそれらの組み合わせを指す。本実施形態においては、画像入力部１０１には静止画のカラー画像が入力されるものとする。さらに、本ステップでは画像に応じてサイズの拡大縮小や輝度値の正規化等、以降の認識処理に必要な前処理を必要に応じて行う。 Next, the flow of recognition processing and learning processing of the recognition apparatus of the present embodiment will be described.
<Recognition phase>
First, the flow of recognition processing will be described using the flowchart of FIG.
In S201, the image input unit 101 receives image data. Here, the image data in the embodiment of the present invention refers to various types of video information such as a color image, a moving image, or a distance image, and combinations thereof. In the present embodiment, it is assumed that a color image of a still image is input to the image input unit 101. Further, in this step, preprocessing necessary for the subsequent recognition processing such as size enlargement / reduction and luminance value normalization is performed as necessary according to the image.

次に、Ｓ２０２では、候補領域抽出部１０２が画像のシーンクラスを見分けるための手掛かりとなる複数の被写体の候補領域を抽出する処理を行う。被写体には主に人体、犬、およびコップといった、ある程度決まった形状やサイズを持つ物体の領域と、空、芝生、山といった比較的サイズが大きく形状の不定な背景に関する領域とが存在する。 Next, in S202, the candidate area extraction unit 102 performs a process of extracting candidate areas of a plurality of subjects that are clues for identifying the scene class of the image. The subject mainly includes a region of an object having a certain shape and size such as a human body, a dog, and a cup, and a region related to a background having a relatively large size and an indefinite shape such as the sky, lawn, and a mountain.

本実施形態では、このうちの物体を分析することで、画像シーンを類別する形態について述べる。それは、例えば画像中に料理と思われる物体領域が存在していればパーティのシーンの可能性が高く、ランプのような形状の物体があればキャンプのシーンである可能性が高いと考える、といった具合である。 In the present embodiment, a mode in which image scenes are classified by analyzing objects among them will be described. For example, if there is an object area that seems to be a dish in the image, the possibility of a party scene is high, and if there is an object shaped like a lamp, it is highly likely that it is a camping scene. Condition.

候補領域抽出部１０２では、画像中に存在する物体の領域全体をなるべく正確に抽出することが望ましい。それは、物体に基づいてシーンを判別する際に、抽出された領域が物体の一部しか含んでいなかったり、あるいは異なる物体が混じっていたりすると誤判別の可能性が高くなるためである。 It is desirable for the candidate area extraction unit 102 to extract the entire area of the object existing in the image as accurately as possible. This is because, when a scene is determined based on an object, if the extracted region includes only a part of the object or different objects are mixed, the possibility of erroneous determination increases.

ただし、画像中に存在している様々な物体について、その物体が何であるかを知らずして、一つ一つの物体の境界を正確に切り出すことは極めて困難である。そのため、ここでは完璧な物体領域の切り出しは期待せずに、物体の領域らしいと考えられる候補領域を複数個抽出する。そして、そのうちのいくつかはある程度の正確さで物体領域を含んでいると仮定して、各候補領域で画像のシーンの判別を行う。その結果を多数決することで、誤りが多少含まれていても最終的に正しくシーンのクラスを判定できることを期待する。 However, it is extremely difficult to accurately cut out the boundaries of individual objects without knowing what the objects are about various objects present in the image. Therefore, here, a plurality of candidate regions that are considered to be object regions are extracted without expecting perfect object region segmentation. Then, assuming that some of them include an object region with a certain degree of accuracy, the image scene is determined in each candidate region. It is hoped that the majority of the results will finally determine the scene class correctly even if some errors are included.

このような条件において、候補領域を抽出するために用いることのできる公知の手法は種々あるが、本実施形態では公知の技術である非特許文献３の方法を参考に用いる。処理の流れを図３のフローチャートを用いて簡単に説明する。 There are various known methods that can be used to extract candidate regions under such conditions, but in this embodiment, the method of Non-Patent Document 3, which is a known technique, is used as a reference. The flow of processing will be briefly described with reference to the flowchart of FIG.

まず、Ｓ３０１では、画像をＳｕｐｅｒ−ｐｉｘｅｌ（以下、ＳＰと表記する）と呼ばれる色が近い画素をまとめた小領域に分割する。
次に、Ｓ３０２では、隣接する全てのＳＰのペアについて、（１）テクスチャ特徴の類似度、および（２）サイズの類似度を算出する。これを数式１のように所定の係数α（０≦α≦１）で重み付け和した値をＳＰペアの類似度とする。 First, in S301, the image is divided into small regions in which pixels having a similar color called Super-pixel (hereinafter referred to as SP) are grouped.
Next, in S302, (1) texture feature similarity and (2) size similarity are calculated for all adjacent SP pairs. A value obtained by weighting and summing this with a predetermined coefficient α (0 ≦ α ≦ 1) as in Equation 1 is used as the SP pair similarity.

ここで、テクスチャ特徴としては、特徴量として広く一般的な色ＳＩＦＴ特徴の頻度ヒストグラムを用いる（詳細は、非特許文献３を参照）。テクスチャ特徴の類似度としては、広く一般に用いられる距離尺度であるヒストグラム交差を用いる。ＳＰサイズの類似度としては、ＳＰペアのうち小さい方のＳＰの面積を大きいＳＰの面積で割った値とする。テクスチャの類似度もサイズの類似度も、類似度が最少のときに０、最大のときに１となる。 Here, as a texture feature, a frequency histogram of a general color SIFT feature is used as a feature amount (refer to Non-Patent Document 3 for details). As the similarity of the texture features, histogram intersection, which is a widely used distance measure, is used. The SP size similarity is a value obtained by dividing the area of the smaller SP of the SP pairs by the area of the larger SP. Both the texture similarity and the size similarity are 0 when the similarity is minimum and 1 when the similarity is maximum.

次に、Ｓ３０３では、最も類似性の高かった２つの隣接ＳＰのペアを連結し、被写体の候補領域とする。連結したＳＰを新たなＳＰとし、特徴量と隣接ＳＰとの類似度を算出しなおす。すべてのＳＰが連結されるまでこれを繰り返す（Ｓ３０２〜Ｓ３０６）。このようにすることで、ＳＰの数から１を引いた数だけ、大小様々な候補領域が抽出される。ただし、小さすぎる被写体領域は誤判別の原因になりやすいので、所定値より大きな面積を持つＳＰのみを候補領域とする（Ｓ３０４〜Ｓ３０５）。以上の候補領域の生成処理の詳細は非特許文献３を参照されたい。 Next, in S303, the pair of two adjacent SPs having the highest similarity is connected to form a subject candidate area. The connected SP is set as a new SP, and the similarity between the feature amount and the adjacent SP is recalculated. This is repeated until all the SPs are connected (S302 to S306). In this way, as many and small candidate areas as the number obtained by subtracting 1 from the number of SPs are extracted. However, since a subject area that is too small is likely to cause erroneous determination, only SPs having an area larger than a predetermined value are set as candidate areas (S304 to S305). Refer to Non-Patent Document 3 for details of the above candidate region generation processing.

なお、ここでは、非特許文献３の方法を参考にして説明したが、物体らしい領域を抽出する方法はこれに限らず、様々な方法が考えられる。例えば、グラフカットと呼ばれる前景と背景を分離する手法、またはテクスチャ分析を行って画像をテクスチャごとに分割する手法（詳細は、非特許文献４を参照のこと）などが広く公知である。 In addition, although it demonstrated with reference to the method of nonpatent literature 3 here, the method of extracting the area | region like an object is not restricted to this, Various methods can be considered. For example, a technique for separating the foreground and background called graph cut, or a technique for performing texture analysis to divide an image into textures (refer to Non-Patent Document 4 for details) is widely known.

図２のフローチャートの説明に戻る。
Ｓ２０３では、特徴量抽出部１０３が各候補領域から特徴量を抽出する。抽出する特徴量は複数の種類を含む。ここでは、図４のフローチャートのように、６種類の特徴量の抽出を行う。 Returning to the flowchart of FIG.
In S203, the feature amount extraction unit 103 extracts a feature amount from each candidate area. The feature quantity to be extracted includes a plurality of types. Here, six types of feature amounts are extracted as shown in the flowchart of FIG.

（１）Ｓ４０１では、候補領域内の色ＳＩＦＴ特徴の頻度ヒストグラムを抽出する。
（２）Ｓ４０２では、候補領域の面積と位置を出力する。ただし、ここで面積の値は画像全体を１として正規化した値とし、位置は画像の縦横の長さを１として正規化した領域の重心位置とする。 (1) In S401, a frequency histogram of color SIFT features in the candidate area is extracted.
(2) In S402, the area and position of the candidate region are output. Here, the value of the area is a value obtained by normalizing the whole image as 1, and the position is a barycentric position of a region normalized by setting the vertical and horizontal lengths of the image to 1.

（３）Ｓ４０３では、候補領域の周囲の領域から色ＳＩＦＴ特徴を抽出し頻度ヒストグラムを生成する。これは、物体を認識する際に背景の特徴も手掛かりになるため、周囲の領域の特徴を別に与えるものである。なお、候補領域の周囲の領域とは、候補領域の所定の幅だけ膨張させた領域とする。 (3) In S403, a color SIFT feature is extracted from the area around the candidate area to generate a frequency histogram. This is because the feature of the background is also a clue when recognizing the object, so that the feature of the surrounding region is given separately. The area around the candidate area is an area expanded by a predetermined width of the candidate area.

次に、候補領域が真に物体の領域か否かを判別する手掛かりとなる特徴量として、以下の第４から第６の特徴量を、Ｓ４０４〜Ｓ４０６で算出する。
（４）Ｓ４０４では、候補領域内の色ＳＩＦＴ特徴と、候補領域の周囲の色ＳＩＦＴ特徴の頻度ヒストグラムの類似度を、特徴量として算出する。なお、ここで類似度としてはヒストグラム交差を用いる。 Next, the following fourth to sixth feature amounts are calculated in steps S404 to S406 as feature amounts that serve as clues for determining whether the candidate region is a true object region.
(4) In S404, the similarity between the frequency SIFT features in the candidate region and the frequency histogram of the color SIFT features around the candidate region is calculated as a feature amount. Here, histogram intersection is used as the similarity.

（５）Ｓ４０５では、候補領域輪郭のエッジの強度の平均値を、特徴量として算出する。ここでは、画像の輝度勾配の絶対値（ｄｘ²＋ｄｙ²）^1/2を、領域の輪郭上で計算して平均値を求める。ただし、ｄｘとｄｙはそれぞれ画像のｘ方向とｙ方向の輝度勾配の値である。 (5) In S405, the average value of the edge strength of the candidate region contour is calculated as the feature amount. Here, the absolute value (dx ² + dy ² ) ^1/2 of the luminance gradient of the image is calculated on the contour of the region to obtain the average value. However, dx and dy are values of luminance gradients in the x and y directions of the image, respectively.

（６）Ｓ４０６では、候補領域の凸形状度を特徴量として算出する。ただし、ここで凸形状度とは、候補領域の面積と候補領域の凸包の面積との比である。この値は、候補領域が凸形状であれば１になり、凹形状であれば０に近づく。
以上のような特徴を用いることで、候補領域がその周囲から独立した物体領域か否かをある程度判別することが可能となる。 (6) In S406, the convex shape degree of the candidate area is calculated as a feature amount. Here, the convex shape degree is a ratio between the area of the candidate region and the area of the convex hull of the candidate region. This value is 1 if the candidate area is convex, and approaches 0 if the candidate area is concave.
By using the characteristics as described above, it is possible to determine to some extent whether or not the candidate area is an object area independent from the surrounding area.

以上が、Ｓ２０３で行われる特徴抽出の処理となる。なお、ここで用いた特徴量の他にも、候補領域のアスペクト比やモーメント長など、領域に関する特徴量は様々なものが利用可能である。本発明の実施形態としては、前述した特徴量に限るものではないことを留意されたい。 The above is the feature extraction process performed in S203. In addition to the feature values used here, various feature values related to the region such as the aspect ratio and moment length of the candidate region can be used. It should be noted that the embodiment of the present invention is not limited to the above-described feature amount.

次に、Ｓ２０４では属性判定部１０４が、各候補領域の特徴を見てどのシーンクラスに関連のある領域かを判別する。なお、ここで判別されるのは候補領域自体の種別ではなく、候補領域が関連するシーンクラスの種別であることが本発明の重要な特徴である。 Next, in S204, the attribute determination unit 104 determines which scene class is associated with the feature of each candidate region. Note that it is an important feature of the present invention that what is determined here is not the type of the candidate area itself but the type of the scene class to which the candidate area relates.

本実施形態では、属性判定部１０４は、図５（ａ）に模式図で表すように、複数の分類木５０２ａ〜５０２ｃの集合（アンサンブル）によって構成されるアンサンブル分類木５０２からなる。各分類木は、候補領域の特徴量から正解のシーンクラスを推定し、推定結果を図５（ｂ）のシーンクラスの投票空間へ投票する。 In the present embodiment, the attribute determination unit 104 includes an ensemble classification tree 502 configured by a set (ensemble) of a plurality of classification trees 502a to 502c, as schematically illustrated in FIG. Each classification tree estimates the correct scene class from the feature quantity of the candidate area, and votes the estimation result to the scene class voting space of FIG.

分類木の各判別ノード（一部に記号５０３ａ〜５０３ｃを付して示す）は、線形判別器で構成されている。各葉ノード（一部に記号５０３ｄ〜５０３ｆを付して示す）には、学習フェーズにおいてその葉ノードに割り振られた学習データの情報が格納されている。学習データの一部に記号５０４ａ〜５０４ｃ、対応するシーンクラスのラベルに記号５０５ａ〜５０５ｃを付して示す。後に学習フェーズで詳細を説明するように、各判別ノードは、学習データの候補領域がなるべくシーンクラスごとに分かれるように学習されている。 Each discrimination node (partially indicated by symbols 503a to 503c) includes a linear discriminator. Each leaf node (partially indicated by symbols 503d to 503f) stores information on learning data assigned to the leaf node in the learning phase. Symbols 504a to 504c are attached to a part of the learning data, and symbols 505a to 505c are attached to the labels of the corresponding scene classes. As will be described in detail later in the learning phase, each discriminating node is learned so that candidate areas of learning data are divided into scene classes as much as possible.

図６のフローチャートを参照しながらアンサンブル分類木による判定処理の手順を説明する。
Ｓ６０１で、候補領域をアンサンブル分類木に入力すると、判定処理をＳ６０３〜Ｓ６０７において、各々の分類木ごとに行う。 The procedure of the determination process using the ensemble classification tree will be described with reference to the flowchart of FIG.
When the candidate area is input to the ensemble classification tree in S601, the determination process is performed for each classification tree in S603 to S607.

まずＳ６０３で、分類木の根ノード５０３ａからシーンクラス判別処理を開始する。各ノードでは、候補領域の特徴に基づいて判別器による判別を行う。
Ｓ６０４では、判別結果によって次の枝を決定して進む。
Ｓ６０５では、葉ノードに到達したか否かを判断し、到達していない場合はＳ６０４に戻り、各ノードでの判別と移動を葉ノードにたどり着くまで繰り返す。葉ノードに到達したら、葉ノードに格納されている学習データのシーンクラスの比率を参照する。これが入力候補領域のシーンクラスの尤度スコアになる。 First, in S603, the scene class discrimination process is started from the root node 503a of the classification tree. In each node, discrimination by a discriminator is performed based on the feature of the candidate area.
In S604, the next branch is determined based on the determination result, and the process proceeds.
In S605, it is determined whether or not the leaf node has been reached. If not, the processing returns to S604, and the determination and movement in each node are repeated until the leaf node is reached. When the leaf node is reached, the scene class ratio of the learning data stored in the leaf node is referred to. This is the likelihood score of the scene class in the input candidate area.

ただしここで、記号φが付された学習データ（物体でない領域のデータなのでここではこれを非物体領域クラスと呼ぶ）の比率の多い葉ノードにたどり着いた場合は、入力した候補領域も物体領域でない可能性が高い。そのため、所定の比率以上に非物体領域クラスのデータがある葉ノードからは投票を行わない（Ｓ６０６）。そうでない場合は、各クラスに対して尤度スコアの値を投票して加算する。（Ｓ６０７）。図５（ａ）の例では、到達した葉ノードでは誕生パーティクラスの割合が２／３のため、０．６６７の値を加算している。
なお、投票の方法はこれ以外にもいくつかバリエーションがある。例えば葉ノードの最大比率のシーンクラスを一つ選んで１票のみを投票するような形態でもよい。 However, if the node reaches a leaf node with a high ratio of learning data (which is non-object region data, which is called non-object region class) with the symbol φ, the input candidate region is not an object region. Probability is high. Therefore, voting is not performed from a leaf node having non-object region class data exceeding a predetermined ratio (S606). Otherwise, the likelihood score value is voted and added to each class. (S607). In the example of FIG. 5A, since the ratio of the birthday party class is 2/3 in the reached leaf node, a value of 0.667 is added.
There are several other variations of the voting method. For example, it is possible to select one scene class with the maximum ratio of leaf nodes and vote only one vote.

以上のようにして全ての分類木が全ての候補領域について判別して投票を行ったら、投票処理Ｓ２０５を終了する。
次に、Ｓ２０６では、こうして得られた投票の合計数の最大のシーンクラスを入力画像のシーンクラスとして出力する。ただし、ここで別の形態として、いずれのシーンクラスの投票数も所定の閾値に未たない場合には、シーンクラスは不明であると出力してもよい。また逆に、複数のシーンクラスの投票数が同時に所定の閾値を超えている場合は、該当するシーンクラス全てを出力してもよい。 As described above, when all the classification trees have determined and voted for all candidate regions, the voting process S205 is ended.
In step S206, the scene class having the maximum total number of votes thus obtained is output as the scene class of the input image. However, as another form, when the number of votes of any scene class does not fall within a predetermined threshold, it may be output that the scene class is unknown. Conversely, if the number of votes for a plurality of scene classes exceeds a predetermined threshold value at the same time, all the corresponding scene classes may be output.

図７に、候補領域の抽出からシーンクラスの判定までの結果の例を示す。図７（ａ）は、入力画像の例である。図７（ｂ）は、候補領域抽出部１０２によって抽出された物体の候補領域を矩形枠で示したものである（ただし、煩雑さを避けるためすべての候補領域を図示していない）。 FIG. 7 shows an example of results from extraction of candidate areas to scene class determination. FIG. 7A shows an example of an input image. FIG. 7B shows a candidate area of an object extracted by the candidate area extraction unit 102 with a rectangular frame (however, all candidate areas are not shown in order to avoid complexity).

図７（ｃ）は全候補領域を属性判定部１０４で判定させた結果、非物体領域クラスと判断された候補領域を除いた後の図である。このようにして、残った候補領域からシーンクラスの投票を行った結果例を図７（ｄ）に示す。図中の記号θはシーンクラスを判定するための闘値である。この例ではパーティのシーンが閾値θを上回っているため、パーティシーンとして出力して認識処理を終了する。 FIG. 7C is a diagram after removing candidate areas determined to be non-object area classes as a result of having all candidate areas determined by the attribute determination unit 104. FIG. 7D shows an example of the result of voting a scene class from the remaining candidate areas in this way. The symbol θ in the figure is a threshold value for determining the scene class. In this example, since the party scene exceeds the threshold θ, the party scene is output and the recognition process is terminated.

＜学習フェーズ＞
次に、学習フェーズと呼ばれる、アンサンブル分類木の学習の処理について説明する。
本処理の目的は、ユーザーから提供される（１）学習画像セット、およびそれに対応する（２）シーンクラスの教師値、および（３）物体の位置の教師値、の３つを属性判定部１０４に与えて訓練する。そして、入力画像に対して正しくシーンクラスの種別が出力できるようなアンサンブル分類木を作成することである。 <Learning phase>
Next, processing for learning an ensemble classification tree, called a learning phase, will be described.
The purpose of this process is to provide the attribute determination unit 104 with the following three items: (1) a learning image set provided by the user, and (2) a teacher value of a scene class and (3) a teacher value of an object position. To give and train. Then, an ensemble classification tree that can correctly output the scene class type for the input image is created.

図８に、学習フェーズにおける本認識装置の構成図を示す。これは、認識フェーズでの基本構成図の図１に準ずるものである。図１との相違点は、物体位置データ入力部１０６および画像属性データ入力部１０７が存在して、画像のシーンクラスおよび物体の位置の教師値を入力する点である。 FIG. 8 shows a configuration diagram of the recognition apparatus in the learning phase. This is similar to FIG. 1 of the basic configuration diagram in the recognition phase. The difference from FIG. 1 is that an object position data input unit 106 and an image attribute data input unit 107 exist and input a teacher value of an image scene class and an object position.

図９（ａ）のフローチャートを参照しながら学習フェーズの動作処理の説明を行う。
まず、Ｓ９０１では、画像入力部１０１が学習画像を入力する。また同時に、画像属性データ入力部１０７が各学習画像に対応するシーンクラスの教師値を入力する。 The learning phase operation process will be described with reference to the flowchart of FIG.
First, in S901, the image input unit 101 inputs a learning image. At the same time, the image attribute data input unit 107 inputs the teacher value of the scene class corresponding to each learning image.

次に、Ｓ９０２では、各学習画像について、物体位置データ入力部１０６が物体の領域の位置の教師値を入力する。
ここでの物体位置データとは、学習画像１枚１枚について、図１０（ａ）に示すように、画像に含まれる物体領域の位置をユーザーが教師値として用意しておいたものである。図１０（ａ）では、物体領域の位置の教師値の形態の一例として物体の領域に外接する矩形で表している（教師値の矩形の一部に記号１００２ａ〜１００２ｄを付して示している）。 In step S <b> 902, the object position data input unit 106 inputs a teacher value of the position of the object region for each learning image.
The object position data here is the one prepared by the user as the teacher value for the position of the object region included in the image for each learning image, as shown in FIG. In FIG. 10A, a rectangle circumscribing the object area is shown as an example of the form of the teacher value of the position of the object area (the symbols 1002a to 1002d are attached to a part of the rectangle of the teacher value. ).

なお、ユーザーにとってどの物体がそのシーンを判別する際に大きく寄与するかという判断は、判別対象シーン数が多い場合は困難である。そのため、ユーザーは各物体についての余計な推定はせず、物体であるか否かのみを判断してなるべく多くの物体の位置を教示するものとする。ただし、小さすぎる物体、遮蔽されていてよく見えていないような物体については学習を困難にするため教示を省いてもよい。 Note that it is difficult for the user to determine which object greatly contributes to the determination of the scene when the number of determination target scenes is large. For this reason, the user does not perform extraneous estimation for each object, but only determines whether the object is an object, and teaches as many object positions as possible. However, teaching may be omitted for objects that are too small, or objects that are occluded and not visible, to make learning difficult.

次に、Ｓ９０３では、認識フェーズで行ったのと同じ方法によって候補領域抽出部１０２が候補領域を抽出する。図１０（ｂ）は、その結果例である。ここでは、候補領域の矩形のいくつかに記号１００３ａ〜１００３ｃを付して示す。 Next, in S903, the candidate area extraction unit 102 extracts candidate areas by the same method as that performed in the recognition phase. FIG. 10B shows an example of the result. Here, symbols 1003a to 1003c are attached to some of the rectangles of the candidate areas.

次に、Ｓ９０４では、抽出した候補領域について、最も近くにある教師値の物体領域の重なりの度合い（オーバーラップ値）を調べる。領域ｘと領域ｙのオーバーラップ値は以下の数式２で算出する。 Next, in S904, the degree of overlap (overlap value) of the closest teacher-valued object region is examined for the extracted candidate region. The overlap value of the region x and the region y is calculated by the following formula 2.

次に、Ｓ９０５では、物体領域とのオーバーラップ値が所定の値以上（ここでは、０．５以上とする）の候補領域を学習に用いる物体領域として採用する。学習用に採用された物体領域を、図１０（ｃ）において記号１００４ａ〜１００４ｅを付して示す。
さらに、オーバーラップ値が所定の値未満（ここでは、０．２未満とする）だった候補領域を、非物体クラスの領域として以降の学習に用いる。但し非物体クラスの領域数は物体の領域の数よりも多数のため、適当にサンプリングして数を減らして用いる。 Next, in S905, a candidate area whose overlap value with the object area is a predetermined value or more (here, 0.5 or more) is adopted as the object area used for learning. The object regions adopted for learning are shown with symbols 1004a to 1004e in FIG.
Further, the candidate area whose overlap value is less than a predetermined value (here, less than 0.2) is used as a non-object class area for subsequent learning. However, since the number of areas of the non-object class is larger than the number of areas of the object, the number is appropriately sampled and used.

次に、Ｓ９０６では、ここまでの処理で得られた物体領域の学習用データと非物体領域の学習用データを用いて属性判定部１０４の学習を行う。以降は、重要な処理であるので特に詳細なフローを図１１にて示して説明を行う。 In step S <b> 906, the attribute determination unit 104 performs learning using the object region learning data and the non-object region learning data obtained by the above processing. Hereinafter, since it is an important process, a particularly detailed flow will be described with reference to FIG.

まず、図１１のＳ１１０１で特徴量抽出部１０３が認識フェーズと同様の方法で、全ての物体領域と非物体領域より特徴量を抽出する。以下、Ｓ１１０２〜Ｓ１１０６において、各々の領域ごとに特徴量の抽出を行う。
Ｓ１１０３では、根ノードから学習を開始する。まず学習データが存在するシーンクラスを２グループにランダムに分ける。更に、非物体領域も独立した一つのシーンクラスとして見なし、これも２グループのどちらかをランダムに選んでまとめて割り振る。 First, in S1101 of FIG. 11, the feature amount extraction unit 103 extracts feature amounts from all object regions and non-object regions by the same method as in the recognition phase. Hereinafter, in S1102 to S1106, feature amounts are extracted for each region.
In S1103, learning is started from the root node. First, a scene class in which learning data exists is randomly divided into two groups. Further, the non-object region is also regarded as an independent scene class, and either one of the two groups is selected at random and allocated together.

次にＳ１１０４では、各学習データの特徴量を入力として、先に定義した２グループに判別できるように判別器の学習を行う。ここでは、機械学習の方法として一般的な線形のサポートベクトルマシン（以降ＳＶＭと呼ぶ）の手法を用いる。なおこの時、一部の学習データをランダムにサンプルして評価用のデータとして学習に使わずに別にとっておく。 In step S1104, the classifier is trained so that it can be discriminated into the two groups defined above using the feature amount of each learning data as input. Here, a general linear support vector machine (hereinafter referred to as SVM) method is used as a machine learning method. At this time, a part of the learning data is randomly sampled and is not used as the data for evaluation but used for learning.

次のＳ１１０５では、この評価用のデータを用いてＳＶＭの判別能力の有効性を評価する。まず評価用データをＳＶＭで判別して２分割する。そして、分類木の学習に一般的に使われる下記の（数式３）の情報量を計算する。 In the next S1105, the effectiveness of the discrimination capability of the SVM is evaluated using this evaluation data. First, the evaluation data is discriminated by SVM and divided into two. Then, the amount of information of the following (formula 3) that is generally used for learning the classification tree is calculated.

ただし、ここで、ｃはシーンクラスの変数、ｐ_kcは２つに分割されたデータのうち、ｋ側（ここでは、分割後のデータをＬとＲの記号で示している）のデータでシーンクラスｃが占める割合を意味する。Ｎは分割前のデータ数、ｎ_kはｋ側に分割されたデータの数、を意味する。判別器によって２分割された時に、分割後の各群のデータにクラスの分布の偏りが生じていれば情報量の値は大きくなり、偏りがなければ情報量の値は小さい。
Ｓ１１０２〜Ｓ１１０６では、このようにしてランダムな２グループの定義と、ＳＶＭの学習と、ＳＶＭの判別結果の評価を所定の回数繰り返す。 Here, c is a scene class variable, and p _kc is the k-side data (in this case, the divided data is indicated by L and R symbols) of the two divided data. It means the proportion of class c. N means the number of data before division, and _nk means the number of data divided on the k side. When the data is divided into two by the discriminator, the value of the information amount is large if there is a deviation in the class distribution in the data of each group after the division, and the value of the information amount is small if there is no deviation.
In S1102 to S1106, the definition of two random groups, the learning of SVM, and the evaluation of the discrimination result of SVM are repeated a predetermined number of times.

所定の回数が終われば、次にＳ１１０７で、それまでに得られた中で最も情報量の大きな結果が得られたＳＶＭのパラメータをこのノードの判別器として採用する。
Ｓ１１０８では、採用したＳＶＭの学習パラメータを用いて学習データを改めて判別してＳＶＭスコアが正か負かで２つに分割し、それぞれを右と左の枝のノードに割り当てる。そして、左右の各枝においてＳ１１０２からの処理を再帰的に繰り返す。 When the predetermined number of times is over, in S1107, the SVM parameter that has obtained the largest amount of information obtained so far is adopted as the discriminator of this node.
In S1108, the learning data is determined again using the adopted SVM learning parameters, and is divided into two depending on whether the SVM score is positive or negative, and each is assigned to the nodes on the right and left branches. Then, the processing from S1102 is recursively repeated on each of the left and right branches.

ただし、分割した結果、シーンクラスの種類が一種類のみになった場合（Ｓ１１０９）、もしくは分割後のデータ数が所定数を下回った場合（Ｓ１１１０）、その枝の学習を打ち切って葉ノードとする。そして、その時点で残っている学習データの情報を葉ノードに記憶して、再帰処理を終了する。 However, as a result of the division, when there is only one kind of scene class (S1109), or when the number of data after division is less than a predetermined number (S1110), learning of the branch is terminated and a leaf node is obtained. . And the information of the learning data remaining at that time is memorize | stored in a leaf node, and a recursive process is complete | finished.

このように、各ノードでの分割ごとにシーンクラスの出現頻度に偏りを生じさせる判別を行っていくことで、シーンクラスの分類が行われる。このような学習結果の模式的な例を図１２に示す。 In this manner, scene classes are classified by performing discrimination that causes deviation in appearance frequency of scene classes for each division at each node. A schematic example of such a learning result is shown in FIG.

図１２では、分類木５０２ａの各葉ノード（一部に、５０３ｄ〜５０３ｆの記号を付して示す）に、形態の特徴量が近いか、もしくはシーンクラスの同じ領域のデータが固まっている。必ずしも完全な分類はなされていないものの、このような分類木を多数学習してアンサンブル分類木として投票によって統合することで、精度の高いシーンクラスの判別結果が得られる。 In FIG. 12, each leaf node of the classification tree 502a (partially indicated by the symbols 503d to 503f) is close to the feature quantity of the form, or data in the same area of the scene class is gathered. Although a complete classification is not necessarily performed, a high-accuracy scene class determination result can be obtained by learning a large number of such classification trees and integrating them by voting as an ensemble classification tree.

またここで、誕生日ケーキ５０４ｂとクリスマスケーキ５０４ｃが別の葉ノード５０３ｅと５０３ｄにあることに留意されたい。本手法では被写体をケーキや帽子といった種別ごとに分類するのではなく、被写体の属する画像のシーンクラスごとに分類を行う。そのため、ケーキという同じ種別の物体であっても、誕生日とクリスマスという異なるシーンクラスに属し、且つ見えのバリエーションが異なるものであれば、この図１２のように、異なる枝への分類が自動的に行われることも可能なことを示すものである。これは、本発明の形態の重要な効果であるので特にここで強調しておく。 Also note here that birthday cake 504b and Christmas cake 504c are on separate leaf nodes 503e and 503d. In this method, the subject is not classified by type such as cake or hat, but is classified by the scene class of the image to which the subject belongs. Therefore, even if an object of the same type as a cake belongs to different scene classes of birthday and Christmas and has different appearance variations, classification into different branches is automatically performed as shown in FIG. It is also shown that it can be performed. This is particularly emphasized here as it is an important effect of the embodiment of the present invention.

なお、ここで判別器の学習にＳＶＭを用い、評価基準として情報量基準を用いたが、学習方法や評価基準は、本発明において特にこれに限定するものではなく、公知の判別器の手法が広く適用可能である。例えば、ＳＶＭの代わりに線形判別分析でもよいし、評価基準には一般的なジニ係数などを用いてもよい。 Here, the SVM is used for learning of the discriminator, and the information amount criterion is used as the evaluation criterion. However, the learning method and the evaluation criterion are not particularly limited in the present invention, and a known discriminator technique is used. Widely applicable. For example, linear discriminant analysis may be used instead of SVM, and a general Gini coefficient may be used as an evaluation criterion.

以上が分類木の学習の方法である。ここでは、一本の分類木の学習についてのみ述べた。アンサンブル分類木は複数の分類木で構成されており、他の分類木を学習する際には、分類木ごとにバリエーションを持たせる必要がある。その方法としては、複数の公知の方法があるが、ここでは最も一般的な方法を用いる。即ち、分類木ごとに学習データをサブサンプリングし、それぞれの木で異なる学習データセットに基づいて学習を行えばよい。 The above is the method of learning the classification tree. Only the learning of a single classification tree has been described here. The ensemble classification tree is composed of a plurality of classification trees, and when learning other classification trees, it is necessary to provide variations for each classification tree. There are a plurality of known methods as the method, but the most general method is used here. That is, learning data may be subsampled for each classification tree, and learning may be performed based on different learning data sets for each tree.

＜被写体の位置の教示方法についての派生の形態＞
ここまでの説明では、被写体の存在位置を教師値として与える学習の方法について述べたが、学習時の被写体の位置の教示は本発明において必須の要件ではない。本発明の適用の範囲がこの形態に限定されないことを示すために、以下に被写体の教示方法の他の派生形態の説明を行う。 <Derived Form for Teaching Method of Subject Position>
In the above description, the learning method for giving the subject position as a teacher value has been described. However, teaching of the subject position at the time of learning is not an essential requirement in the present invention. In order to show that the scope of application of the present invention is not limited to this form, another derivative form of the subject teaching method will be described below.

被写体の教師値の方法の例として、以下の４つの方法が考えられる。
（１）ユーザーが被写体の位置を全ての画像について多数教示する。
（２）ユーザーが被写体の位置を一部のみ教示する。
（３）ユーザーが被写体の位置を教示しない。
（４）ユーザーがシーンクラスに特に関連が強いと考える被写体の位置のみ教示する。 The following four methods can be considered as examples of the method of teaching value of a subject.
(1) The user teaches a large number of object positions for all images.
(2) The user teaches only a part of the position of the subject.
(3) The user does not teach the position of the subject.
(4) Teach only the position of the subject that the user thinks is particularly relevant to the scene class.

ここで、（１）はすでに説明を行った方法である。（４）はクリスマスパーティにおけるクリスマスツリーなど、そのシーンとの関連の顕著な物体のみを指示するものである。以降、（２）〜（４）の派生の形態について順に簡単な説明を行う。 Here, (1) is the method already described. (4) indicates only objects that are significantly related to the scene, such as a Christmas tree at a Christmas party. Hereinafter, the derivation forms (2) to (4) will be briefly described in order.

まず、被写体の位置を一部のみ教示する形態例（２）について説明する。ここでは、形態例（１）との相違点のみに絞って説明する。
形態例（２）は、半教師学習と呼ばれる公知の学習手法に類する。具体的には、まず物体位置が教示されている学習画像から物体領域の特徴量を抽出する。次に、教師値の与えられてない学習画像からも候補領域を抽出し、それらの特徴量を求める。次に、候補領域のうち、いずれかの教示された領域と所定値以上に特徴量が類似している領域があれば、それらも学習用の物体領域として優先的に採用する。所定値以下の領域は、非物体クラスの学習データとする。以上が形態例（２）の方法の説明である。 First, an example (2) of teaching only a part of the position of the subject will be described. Here, only the differences from the embodiment (1) will be described.
The form example (2) is similar to a known learning method called semi-teacher learning. Specifically, first, feature quantities of the object region are extracted from the learning image in which the object position is taught. Next, candidate areas are extracted from learning images to which no teacher value is given, and their feature values are obtained. Next, if there is a region whose feature quantity is similar to a certain value or more from any of the taught regions, the candidate region is preferentially adopted as an object region for learning. The area below the predetermined value is used as learning data for the non-object class. The above is the description of the method of the embodiment (2).

次に、被写体の情報を全く教示しない形態例（３）について説明する。本形態例（３）では、学習画像から抽出した全ての候補領域を物体領域データか非物体領域かの区別なく使って分類木の学習を行う。学習フェーズのフローは図９（ｂ）のフローチャートのようになる。 Next, an example (3) in which no subject information is taught will be described. In this embodiment (3), the classification tree is learned using all candidate regions extracted from the learning image without discrimination between object region data and non-object regions. The flow of the learning phase is as shown in the flowchart of FIG.

図９（ｂ）のフローチャートにおいて、Ｓ９１１はＳ９０１に対応し、Ｓ９１２はＳ９０３に対応し、Ｓ９１３はＳ９０６に対応する。この形態においては、先に用いた「非物体領域のクラス」は設定しない。そのため不完全に切り出された領域や物体に無関係な領域も多数シーンクラスに関係する被写体領域として学習に用いられることになる。 In the flowchart of FIG. 9B, S911 corresponds to S901, S912 corresponds to S903, and S913 corresponds to S906. In this embodiment, the previously used “non-object region class” is not set. Therefore, incompletely cut out regions and regions unrelated to the object are also used for learning as subject regions related to the scene class.

形態例（３）の方法は、ユーザーの教示に費やす負担は軽減されるが、その分判別精度が低い。これについての対策の工夫は複数考えられる。
まず、第１の工夫として、データの数および分類木の数を増やし、認識時の投票の数を増やすことである。アンサンブル学習の一般的性質として、個々の弱識別器（分類木）の判別精度が低くても、バリエーションが大きければ弱識別器の数の増加につれて判別精度が漸近的に上がることは広く公知である。 In the method of the embodiment (3), although the burden on the user's teaching is reduced, the discrimination accuracy is low accordingly. There are a number of ideas for countermeasures.
First, as a first device, the number of data and the number of classification trees are increased, and the number of votes at the time of recognition is increased. As a general property of ensemble learning, it is widely known that even if the discrimination accuracy of each weak classifier (classification tree) is low, the discrimination accuracy increases asymptotically as the number of weak classifiers increases if the variation is large. .

さらに、被写体の教示方法の形態例（３）の別の付加的な工夫としては、候補領域抽出部１０２において候補領域を抽出した後に、どの程度物体の領域が正しく抽出されているか（以降、これを物体度と呼ぶ）を推定する。物体度が低いと判定された領域は学習にも認識にも使わないことである。このような物体度の推定方法は種々の方法が公知になっており、例えば非特許文献５のような方法が挙げられる。 Furthermore, as another additional contrivance of the form (3) of the subject teaching method, the extent to which the object region is correctly extracted after the candidate region extraction unit 102 extracts the candidate region (hereinafter referred to as this). Is called object degree). An area determined to have a low object degree is not used for learning or recognition. Various methods for estimating the object degree are known, and for example, the method described in Non-Patent Document 5 can be cited.

非特許文献５では、候補領域の特徴量を入力とし、候補領域と真の物体領域の重なりの量であるオーバーラップ値を推定する回帰の問題を解いている（詳細は非特許文献５を参照のこと）。 Non-Patent Document 5 solves the problem of regression that estimates the overlap value, which is the amount of overlap between the candidate area and the true object area, using the feature quantity of the candidate area as input (see Non-Patent Document 5 for details). )

これを踏まえて処理フローを図９（ｃ）のフローチャートのように改変する。
即ち、Ｓ９２２で候補領域を抽出した後に、Ｓ９２３でオーバーラップ値を推定して物体度とし、Ｓ９２４で所定の値よりも物体度の低い領域を除いた上で、Ｓ９２５で学習および認識を行う。 Based on this, the processing flow is modified as shown in the flowchart of FIG.
That is, after extracting a candidate area in S922, an overlap value is estimated to be an object degree in S923, and an area having an object degree lower than a predetermined value is removed in S924, and then learning and recognition are performed in S925.

次に、重要な被写体のみを教示する形態例（４）の方法について説明する。（４）の方法の実現形態の一つの例は以下のようになる。即ち、ユーザーが指定した重要物体と重なる物体候補領域は、重なりの量に応じた分だけ分類木の学習の際に重視して重み付けして学習を行う。そして重なりのないその他の物体候補領域は、通常のまま重みを付けずに学習する。 Next, a method of the embodiment (4) for teaching only an important subject will be described. One example of implementation of the method (4) is as follows. In other words, the object candidate region that overlaps the important object specified by the user is weighted and learned by weighting the amount corresponding to the overlap amount when learning the classification tree. Then, other object candidate areas that do not overlap are learned without weighting as usual.

以降、形態例（４）の方法を詳細に説明する。まず、物体位置を教示しない形態例（３）の方法と同様、全ての物体候補領域を抽出する。さらに、物体度を推定して物体度の低い領域を除く。次に、残った全ての領域について、ユーザーの指示した重要物体とのオーバーラップ値を算出する。次に、オーバーラップ値に基づいて物体候補領域ｘに以下の数式４のｗ（ｘ）で重みを付けて分類木の学習を行う。 Hereinafter, the method of the embodiment (4) will be described in detail. First, all object candidate regions are extracted as in the method of the embodiment (3) in which the object position is not taught. Furthermore, an object degree is estimated and the area | region with a low object degree is remove | excluded. Next, for all remaining regions, the overlap value with the important object designated by the user is calculated. Next, based on the overlap value, the object candidate region x is weighted with w (x) of the following Equation 4 to learn the classification tree.

ただし、ここでＯ（ｘ）は、領域ｘと重要物体とのオーバーラップ度、βは０以上の係数であり、この値が大きければ重要物体をより強く重視して学習を行う。βは交差確認法等で適切な値を設定する。
次に、数式４の情報量を拡張し、学習データの重要性の重みを考慮した情報量基準として、数式３を拡張して以下のように定義する。これを用いて判別器の学習を行えばよい。 Here, O (x) is the degree of overlap between the region x and the important object, and β is a coefficient equal to or greater than 0. If this value is large, the importance object is emphasized more strongly and learning is performed. For β, an appropriate value is set by the intersection confirmation method or the like.
Next, the information amount of Equation 4 is expanded, and Equation 3 is expanded and defined as follows as an information amount criterion considering the importance weight of learning data. What is necessary is just to learn a discriminator using this.

ただし、ｐ_kc（ｗ）はシーンクラスｃの学習データ数の重み付きの割合で、 However, p _kc (w) is a weighted ratio of the number of learning data of scene class c,

である。これにより、重要な物体に近い領域をより重視して区別するよう学習することが可能である。なお、上式はβが０のとき、もしくは重要な物体が学習事例の中に存在しない時は、これまでの学習の式と同じ値になる。
以上、ここまで学習フェーズにおける被写体領域の教示方法の派生の形態例（１）〜（４）について述べた。 It is. Thereby, it is possible to learn so as to distinguish the area close to the important object with more importance. Note that the above equation has the same value as the previous learning equation when β is 0 or when an important object does not exist in the learning case.
As above, the derivation examples (1) to (4) of the teaching method of the subject area in the learning phase have been described.

＜大分類と小分類を用いる派生の形態＞
以降では、本発明に係る認識装置の全体構成の派生の形態について述べる。
ここでは、本実施形態の派生の形態として、既存のシーン分類の方法で一度大まかにシーンを分けておいて、その後に本発明を適用して詳細なシーンを判別する形態について述べる。 <Derived form using major and minor categories>
Hereinafter, a derivation form of the overall configuration of the recognition apparatus according to the present invention will be described.
Here, as a form of derivation of this embodiment, a form in which scenes are roughly divided once by an existing scene classification method, and then a detailed scene is discriminated by applying the present invention will be described.

以降では、大まかなシーンクラスの分類を大分類シーン、その後の詳細なシーンクラスの分類を小分類シーンと呼ぶ。たとえば、大分類シーンはパーティや野外スポーツといった分類であり、小分類シーンとしてはクリスマスパーティや誕生日パーティ、スポーツシーンであればサッカーや野球、などのクラスを考える。 Hereinafter, the rough scene class classification is referred to as a large classification scene, and the detailed scene class classification thereafter is referred to as a small classification scene. For example, a major classification scene is a classification such as a party or outdoor sport, and a class such as a Christmas party or a birthday party is considered as a minor classification scene, and soccer or baseball is considered as a sports scene.

大分類シーンを判別する方法としては、例えばＢａｇｏｆＷｏｒｄｓ手法と呼ばれる手法が有効であることが非特許文献６などにおいて公知である。
図１３に認識装置の構成図を示す。
画像入力部１１１には、カメラの画像が入力される。大分類判定部１１２は、非特許文献６のＢａｇｏｆＷｏｒｄｓ手法により、パーティシーン、スポーツシーン、風景画像シーン、その他さまざまなシーンを分類する。 As a method for discriminating a large classification scene, for example, Non-Patent Document 6 discloses that a method called a Bag of Words method is effective.
FIG. 13 shows a configuration diagram of the recognition apparatus.
The image input unit 111 receives a camera image. The large classification determination unit 112 classifies party scenes, sports scenes, landscape image scenes, and other various scenes using the Bag of Words method of Non-Patent Document 6.

小分類判定部１１３ａ〜１１３ｃは大分類判定部１１２での判別に基づいていずれか一つかが選ばれて以降の処理が行われる。
小分類判定部１１３ａ〜１１３ｃがそれぞれ備える属性判定部は大分類ごとに異なった学習データで学習した異なったパラメータ（判定辞書）を備える。 One of the small classification determination units 113a to 113c is selected based on the determination in the large classification determination unit 112, and the subsequent processing is performed.
The attribute determination unit provided in each of the small classification determination units 113a to 113c includes different parameters (determination dictionaries) learned with different learning data for each large classification.

これは例えば、パーティのシーンを分析する小分類判定部１１３ａであれば、様々な種類のパーティの画像のみを学習データとして与えて学習することを意味する。また、候補領域抽出部と特徴量抽出部についても、小分類のシーンを見分けるのに適した候補領域の抽出の基準や特徴量の種別を定めてよい。 For example, in the case of the small classification determination unit 113a that analyzes a party scene, it means that learning is performed by providing only various types of party images as learning data. The candidate region extraction unit and the feature amount extraction unit may also determine the criteria for extracting candidate regions and the types of feature amounts that are suitable for distinguishing small classification scenes.

これらの適切な基準や特徴量はユーザーが手で調整してもよいが、複数のパラメータの組み合わせを与えて最も精度の高くなるパラメータを交差確認法等で探索してもよい。これにより、大分類ごとに特化した、より正確なシーンクラス判別が行われる。以上が本実施形態の派生の形態の説明である。このようにして、本発明は既存手法と組み合わせるような形態の適応も可能であることが示された。 These appropriate criteria and feature amounts may be adjusted manually by the user, but a parameter that provides the highest accuracy by giving a combination of a plurality of parameters may be searched by a cross-confirmation method or the like. Thereby, more accurate scene class discrimination specialized for each large classification is performed. The above is the description of the derivative form of the present embodiment. In this way, it has been shown that the present invention can be adapted in a form that can be combined with existing methods.

以上で第１の実施形態の説明を終える。
前述したように、第１の実施形態の認識装置によれば、画像中の物体を分析して詳細に画像のシーンを判別することができる。また、各シーンにおいてどのような被写体を識別するべきか予め教示することなくシーン判別の学習を行うことができる。
また、判別対象のシーン数が増えても、認識処理時間が著しく増加しない。また、被写体の有無のみならず被写体のバリエーションの差異に基づいて類似のシーンを判別することも可能である。 This is the end of the description of the first embodiment.
As described above, according to the recognition apparatus of the first embodiment, it is possible to analyze an object in an image and discriminate a scene of the image in detail. In addition, scene discrimination learning can be performed without previously teaching what kind of subject should be identified in each scene.
Even if the number of scenes to be discriminated increases, the recognition processing time does not increase significantly. It is also possible to discriminate similar scenes based not only on the presence / absence of a subject but also on the difference in subject variations.

［第２の実施形態］
第２の実施形態は、第１の実施の形態を拡張した形態である。拡張であるので、ここでは相違点のみに絞って簡便に説明する。
第１の実施形態では、物体の領域を手掛かりとして、これを分類することでシーンクラスを推定した。この場合、物体が存在しないような画像シーン、例えば海や山といった風景のみのシーンでは判別の手掛かりが存在しない。 [Second Embodiment]
The second embodiment is an expanded form of the first embodiment. Since this is an extension, only a difference will be briefly described here.
In the first embodiment, the scene class is estimated by classifying an object region as a clue. In this case, there is no clue for discrimination in an image scene in which no object exists, for example, a scene with only a landscape such as the sea or a mountain.

そこで、本第２の実施形態では、物体以外のタイプの被写体領域も分析の対象として採用し、それぞれのタイプの候補領域を抽出し、それぞれ異なった属性判別器でどのシーンクラスに関連するかを判定する。このように、複数のタイプの被写体を用いて多面的にシーンクラスの解析を行うことにより、物体領域のみを手掛かりにした場合よりもシーンクラスの判別が頑健になることが期待される。 Therefore, in the second embodiment, a subject area of a type other than an object is also used as an analysis target, candidate areas of each type are extracted, and which scene class is associated with different attribute classifiers. judge. As described above, it is expected that the scene class determination is more robust than the case where only the object region is a clue by analyzing the scene class in a multifaceted manner using a plurality of types of subjects.

図１４に、本実施形態の基本構成図を示す。第１の実施形態の構成図と本構成図の異なる点は、タイプの異なる３種類の被写体の候補領域の抽出部、特徴量抽出部および属性判定部を備えることである。具体的には、本認識装置は物体の候補領域を抽出する物体候補領域抽出部１２２ａ、人体の領域を抽出する人体候補領域抽出部１２２ｂ、空や山や床といった背景の領域を抽出する背景候補領域抽出部１２２ｃ、を備える。 FIG. 14 shows a basic configuration diagram of the present embodiment. The difference between the configuration diagram of the first embodiment and this configuration diagram is that it includes an extraction unit, a feature amount extraction unit, and an attribute determination unit for candidate regions of three different types of subjects. Specifically, the recognition apparatus includes an object candidate region extraction unit 122a that extracts a candidate region of an object, a human body candidate region extraction unit 122b that extracts a human body region, and a background candidate that extracts a background region such as a sky, a mountain, or a floor. A region extraction unit 122c.

物体候補領域抽出部１２２ａ〜１２２ｃでは、第１の実施形態と同様に物体の抽出を行う。人体候補領域抽出部では、非特許文献７のような公知の方法を用いて人体の候補領域を抽出する。背景候補領域抽出部では、色もしくはテクスチャの均一性が高く、面積の広い領域を抽出して背景領域とする。具体的には、非特許文献４のようなテクスチャ分析の方法で画像を複数領域に分割してそれぞれを背景候補領域として抽出する。 The object candidate area extraction units 122a to 122c extract objects in the same manner as in the first embodiment. The human body candidate region extraction unit extracts a human body candidate region using a known method such as Non-Patent Document 7. The background candidate region extraction unit extracts a region having a high color or texture uniformity and a large area as a background region. Specifically, the image is divided into a plurality of regions by a texture analysis method as in Non-Patent Document 4, and each is extracted as a background candidate region.

特徴量抽出部１２３ａ〜１２３ｃでは、被写体のタイプにあった特徴量を抽出する。ここでは、物体の特徴量抽出部１２３ａおよび背景領域の特徴量抽出部１２３ｃは第１の実施形態で述べたものと同じ特徴量を抽出する。人体領域の特徴量抽出部１２３ｂでは服装や頭部の被り物の有無等を重点的に見るために、検出された人体領域を頭部、上半身、下半身、の３か所のパーツに分けて、それぞれから色ＳＩＦＴ特徴量を抽出して連結し、特徴ベクトルとする。 The feature amount extraction units 123a to 123c extract feature amounts that match the type of subject. Here, the feature amount extraction unit 123a of the object and the feature amount extraction unit 123c of the background region extract the same feature amounts as described in the first embodiment. The human body region feature amount extraction unit 123b divides the detected human body region into three parts, the head, upper body, and lower body, in order to focus on the presence or absence of clothes and head coverings. The color SIFT feature values are extracted from these and connected to form a feature vector.

それぞれの属性判定部１２４ａ〜１２４ｃは、このように、タイプの異なる被写体に特化して学習を行う。学習の仕方は第１の実施形態の方法と同様であるが、ここで、多少バリエーションを加える。 As described above, each of the attribute determination units 124a to 124c performs learning specialized for subjects of different types. The learning method is the same as the method of the first embodiment, but here, some variation is added.

それは、背景領域に関しては、例えば一様な青空の領域からは屋外であることは分かっても、それがキャンプのシーンなのか野球のシーンなのかは知り得ない、というようなことが起こりやすいことである。そのため、第１の実施形態の方法で背景属性判定部１２４ｃを学習すると過学習を起こし易くなる可能性がある。 As for the background area, it is easy to happen that, for example, it can be known that it is outdoor from a uniform blue sky area, but it is not possible to know whether it is a camp scene or a baseball scene. It is. Therefore, when the background attribute determination unit 124c is learned by the method of the first embodiment, there is a possibility that overlearning is likely to occur.

そこで、ここでは、以下のような工夫を行う。例えば一つの工夫としては、分類木の深さが所定数以上になったら学習を早期に打ち切って過学習を防ぐことである。また別の工夫としては、各ノードの判別器を訓練する際に情報量基準を用いずに、データを単にランダムに分割するだけの判別器を採用することである。 Therefore, the following measures are taken here. For example, as one idea, when the depth of the classification tree exceeds a predetermined number, learning is stopped early to prevent over-learning. Another idea is to employ a discriminator that simply divides data randomly without using an information criterion when training the discriminator of each node.

これは、各分類木の判別器の判別関数の係数をランダムな値にすることで実現される。この判別器はハッシュ法、ランダム射影、類似事例データの探索、近似最近傍探索といわれる公知の機械学習の方法と数理的に同種の操作である。即ち、本発明の属性判定部の判別器には、アンサンブル分類木に限らず、上に挙げたような様々な機械学習の方法が適用可能である。 This is realized by setting the coefficient of the discriminant function of each classifier to a random value. This discriminator is a mathematically similar operation to a known machine learning method called a hash method, random projection, search for similar case data, and approximate nearest neighbor search. That is, the classifier of the attribute determination unit of the present invention is not limited to the ensemble classification tree, and various machine learning methods as described above can be applied.

次に、判定結果統合部１２５では、属性判定部１２４ａ〜１２４ｃの投票結果を統合する。このとき、被写体のタイプによって得られる判定の信頼性に差があるのでそのまま加算せずに重みＷ＝［ｗ₁，ｗ₂，ｗ₃］^Tを定義して重み付和を求めて最終的に画像の属性を同定して最終の判定結果とする。ただし、Ｗは複数の候補値から交差確認法で決定するか、平均的に判別精度が最良となる値を最小二乗法で解いて決定するとよい。また単純に重み付和するのではなく、上記の投票結果を入力とし、画像の属性が出力されるようにサポートベクトルマシンなどの判別器を用いて学習するような形態でもよい。 Next, the determination result integration unit 125 integrates the voting results of the attribute determination units 124a to 124c. At this time, since there is a difference in the reliability of determination obtained depending on the type of subject, the weight W = [w ₁ , w ₂ , w ₃ ] ^T is defined without adding as it is, and a weighted sum is finally obtained. The attribute of the image is identified and used as the final determination result. However, W may be determined by a cross-confirmation method from a plurality of candidate values, or may be determined by solving a value with the best discrimination accuracy on average by the least square method. In addition, instead of simply performing the weighted sum, a form in which the above voting result is input and learning is performed using a discriminator such as a support vector machine so that an image attribute is output may be used.

以上が複数のタイプの被写体を用いてシーンクラスの判別を行う第２の実施形態の説明になる。 The above is the description of the second embodiment in which the scene class is determined using a plurality of types of subjects.

［第３の実施形態］
第３の実施形態の認識装置は、動画を用いて動画のシーンクラスを判別することを目的とするものである。本実施形態では特に監視カメラ画像の正常状態・異常状態について判別を行う形態について述べる。監視カメラ画像の異常状態としては複数のタイプが考えられるが、ここでは群衆の異常行動を判別するような認識装置について述べる。 [Third Embodiment]
The recognition apparatus according to the third embodiment is intended to determine a moving image scene class using a moving image. In the present embodiment, a mode for determining whether the monitoring camera image is normal or abnormal will be described. Although there are a plurality of types of abnormal states of surveillance camera images, here, a recognition device that discriminates abnormal behavior of a crowd will be described.

図１５に、本実施形態の認識装置の基本構成図を示す。
画像入力部１３１には、監視カメラの動画像が入力される。人体候補領域抽出部１３２ａは、第２の実施形態と同様にして非特許文献７等の人体検出器の手法によって人体を検出する部位である。ここで抽出された人体は、人体特徴量抽出部１３３ａにて人体候補領域中の見えの特徴量を算出する。 FIG. 15 shows a basic configuration diagram of the recognition apparatus of the present embodiment.
A moving image of the surveillance camera is input to the image input unit 131. The human body candidate region extraction unit 132a is a part that detects a human body by a human body detector technique such as Non-Patent Document 7 as in the second embodiment. The human body extracted here calculates the appearance feature amount in the human body candidate region by the human body feature amount extraction unit 133a.

また、前後の動画フレームを参照して人体の移動方向と移動速度、および動き特徴ベクトルも抽出して特徴量に加える。ここで見えの特徴とは第１の実施形態と同様に色ＳＩＦＴ特徴のヒストグラム、動き特徴ベクトルとしては例えばＣＨＬＡＣと呼ばれる非特許文献８に開示されているような公知の方法の特徴量を抽出する。 Further, the moving direction and moving speed of the human body and the motion feature vector are extracted with reference to the preceding and following moving image frames and added to the feature amount. Here, as in the first embodiment, the visible feature is a color SIFT feature histogram, and the motion feature vector is a feature value of a known method disclosed in Non-Patent Document 8 called CHLAC, for example. .

次に、人体属性判定部１３４ａで人体領域の特徴量をアンサンブル分類木からなる判別器で判定させて第１の実施形態と同様にして個々の人体の候補領域が異常シーンの人体か正常シーンの人体かの尤度スコアを得る。人体属性判定部１３４ａは第１の実施形態と同様にあらかじめ異常行動の動画像と正常なシーンの動画像の学習データから人体候補領域抽出部１３２ａによって人体を抽出し、両者の特徴量を分類木によって学習してある。これにより、武器を振りまわしている人物や、逃げる人物、格闘している人物の動きなど、異常な行動下にある人体領域を異常度が高いとして判定することができる。 Next, the human body attribute determination unit 134a determines the feature amount of the human body region with a discriminator made of an ensemble classification tree, and the candidate region of each human body is a human body of an abnormal scene or a normal scene as in the first embodiment. Get the likelihood score of the human body. Similar to the first embodiment, the human body attribute determination unit 134a extracts the human body from the learning data of the moving image of the abnormal behavior and the moving image of the normal scene in advance by the human body candidate region extraction unit 132a, and classifies the feature amounts of both into the classification tree. Have learned by. Thereby, it is possible to determine a human body region under abnormal behavior such as a person who is swinging a weapon, a person who runs away, or a person who is fighting, as having a high degree of abnormality.

なお、非特許文献７のような人体検出手法は、遮蔽のない立位の人体の検出精度は高いが、雑踏の中に頭部が一部だけ見えている人物や、見えのサイズの小さな人物、立位以外の姿勢の人物、に対しては検出が困難であるという課題がある。図１６（ａ）に、群衆の動画像のシーンの例を示すが、図１６（ｂ）で黒矩形枠で例示するように、このような条件下では人体として検出されるのは少数にとどまることも多い。 The human body detection method as in Non-Patent Document 7 has high detection accuracy of a standing human body without shielding, but a person whose head is partially visible in a crowd or a person with a small appearance size However, there is a problem that it is difficult to detect a person in a posture other than standing. FIG. 16A shows an example of a scene of a moving image of a crowd. As illustrated by a black rectangular frame in FIG. 16B, only a small number of human bodies are detected under such conditions. There are many things.

そこで、このような課題に対する工夫として本認識装置は個々の人体のみならず、群衆候補領域抽出部１３２ｂによって群衆が存在している可能性が高いと判断できる候補領域を抽出する。そして、同領域が正常シーンにおける群衆か、異常シーンの群衆であるか、領域の特徴量を手掛かりに調べて判定を行う。 Therefore, as a contrivance for such a problem, the present recognition apparatus extracts not only individual human bodies but also candidate areas that can be determined by the crowd candidate area extraction unit 132b to be highly likely to have a crowd. Then, whether the same area is a crowd in a normal scene or a crowd in an abnormal scene is determined by examining the feature amount of the area as a clue.

群衆の候補領域の抽出には先に動き特徴ベクトルとして用いたＣＨＬＡＣ特徴（詳細は、非特許文献８を参照されたい）を用いる。これを用いて以下のように処理を行う。
まず、あらかじめ学習データとして様々な群衆の動画と、群衆は含まれていないが様々な動きのある動画とを用意する。これらの動画を１６画素×１６画素×１６フレームといった一定サイズの時空間のブロックに分割し、それぞれのブロックからＣＨＬＡＣ特徴ベクトルを抽出する。 The extraction of the candidate region of the crowd uses the CHLAC feature previously used as the motion feature vector (refer to Non-Patent Document 8 for details). Using this, processing is performed as follows.
First, as the learning data, various crowd videos and videos that do not include the crowd but have various movements are prepared. These moving images are divided into space-time blocks of a certain size such as 16 pixels × 16 pixels × 16 frames, and a CHLAC feature vector is extracted from each block.

ＣＨＬＡＣ特徴は２５１次元の特徴量であるので、２５１次元空間上にブロック数と等しい数のサンプルの２群の分布が得られる。次にこの２群のデータに対して一般的な判別手法である線形判別分析を行い、２群のデータを分ける最良の１次元の基底への射影ベクトルを得る。 Since the CHLAC feature is a 251-dimensional feature quantity, distribution of two groups of samples equal to the number of blocks is obtained in the 251-dimensional space. Next, linear discriminant analysis, which is a general discriminating method, is performed on the two groups of data to obtain a projection vector onto the best one-dimensional basis for dividing the two groups of data.

次に、認識時の群衆候補領域抽出部１３２ｂの動作について述べる。まず入力動画を受け取ったら、学習時と同じサイズのブロックデータに分割する（ブロックは互いに重なっていてもよい）。各ブロックからＣＨＬＡＣ特徴を抽出し、先に線形判別分析で得た１次元基底上に射影して基底上の値を得る。 Next, the operation of the crowd candidate area extraction unit 132b at the time of recognition will be described. First, when an input moving image is received, it is divided into block data of the same size as at the time of learning (the blocks may overlap each other). A CHLAC feature is extracted from each block and projected onto a one-dimensional basis obtained by linear discriminant analysis previously to obtain a value on the basis.

この基底上の値が各ブロックの群衆らしさの尤度となるので、所定の閾値で切って群衆候補領域としてこれを抽出する。図１６（ｃ）に黒太枠線で結果の例を示す。後の処理はこれまでの形態と同様に群衆特徴量抽出部１３３ｂで見えおよび動きの特徴を抽出し、群衆属性判定部１３４ｂで異常か正常かの判別を行って判別スコアを得る。 Since the value on the basis is the likelihood of the crowd likeness of each block, it is cut by a predetermined threshold value and extracted as a crowd candidate region. FIG. 16C shows an example of the result with a thick black border. In the subsequent processing, the features of the appearance and movement are extracted by the crowd feature amount extraction unit 133b in the same manner as the previous embodiments, and the discrimination attribute is obtained by determining whether the crowd attribute determination unit 134b is abnormal or normal.

次に、物体候補領域抽出部１３２ｃが抽出対象とする領域は、異常な群衆行動に伴って観察されることのある、不特定な被写体の領域である。例えば発煙灯の煙や、路上の物が燃やされることで発生する炎、破壊行為による破片の散乱、等々様々な被写体が考えられる。 Next, the region to be extracted by the object candidate region extraction unit 132c is an unspecified subject region that may be observed with abnormal crowd behavior. For example, various subjects such as smoke from a smoke lamp, flames generated by burning objects on the road, scattering of fragments due to vandalism, and the like can be considered.

本発明を適用して得られる本実施形態の認識装置は、学習時に予めどのような特定の被写体がシーンクラスの判別に関連するかの前提を持たずに学習を行うことが特長の一つである。そのため、ここでは様々な物体候補領域が抽出できるように物体候補領域抽出部１３２ｃを構成する。具体的には、動きおよび見えの特徴が似たまとまった領域を物体候補領域として抽出するようにする。 Recognizing apparatus of this embodiment obtained by applying the present invention, during pre any particular subject in one to perform the learning features without having any assumptions relating to the determination of the scene classes in the learning is there. Therefore, the object candidate area extraction unit 132c is configured so that various object candidate areas can be extracted here. Specifically, a region in which features of motion and appearance are similar is extracted as an object candidate region.

物体候補領域の具体的な抽出の仕方は、第１の実施形態で図３により説明したＳｕｐｅｒ−ｐｉｘｅｌ（以降、ＳＰと表記する）を用いる方法を拡張したものである。図３の方法との相違点は２点ある。一点目はＳＰの連結の際に画素の見え特徴のみならず、動き特徴も類似度として用いるようにすることである。二点目は動きのないＳＰ領域を始めに候補領域から除外することである。以下に詳細を説明する。 The specific method of extracting the object candidate region is an extension of the method using Super-pixel (hereinafter referred to as SP) described with reference to FIG. 3 in the first embodiment. There are two differences from the method of FIG. The first point is to use not only the appearance feature of the pixel but also the motion feature as the similarity when the SP is connected. The second point is to exclude the SP area without movement from the candidate area first. Details will be described below.

まず、動画像中の１フレームについて、動画像解析で一般的な手法であるオプティカルフローを画素ごとに計算する。次にＳＰを作成するが、このときに領域内のオプティカルフローの平均量が一定以下のＳＰは削除する。 First, for one frame in a moving image, an optical flow, which is a general technique in moving image analysis, is calculated for each pixel. Next, SPs are created. At this time, SPs whose average amount of optical flows in the region is below a certain level are deleted.

次に、図３のＳ３０２〜Ｓ３０６と同様に類似する隣接ＳＰを連結していくが、この際に用いる類似度として、ＳＰのＲＧＢの色分布、およびオプティカルフローの方位の分布、の両方を連結してまとめたベクトルについて類似度を算出する。このようにすることで、見えおよび動きの似たまとまった領域を物体候補領域として抽出することができる。 Next, similar SPs similar to S302 to S306 in FIG. 3 are connected, and as the similarity used at this time, both the RGB color distribution of SP and the optical flow orientation distribution are connected. The similarity is calculated for the combined vectors. By doing in this way, the area | region where appearance and motion resembled can be extracted as an object candidate area | region.

図１６（ｄ）に、物体候補領域抽出部１３２ｃの動作の結果の例を黒太枠線で示す。ここでは、炎と煙の領域、および群衆の一部、が物体候補領域として抽出されている。
なお、物体候補領域の抽出の仕方はここで述べた方法に限定するものではなく、見えや動きの似たまとまった領域を抽出する方法であればいずれも適用可能である。
以降の物体特徴量抽出部１３３ｃと物体属性判定部１３４ｃでは、先の群衆領域に対するものと同一である。繰り返しになるので詳細な説明を省略する。 FIG. 16D shows an example of the result of the operation of the object candidate region extraction unit 132c with a black thick frame line. Here, a flame and smoke area and a part of the crowd are extracted as object candidate areas.
Note that the method of extracting the object candidate region is not limited to the method described here, and any method can be applied as long as it is a method of extracting a region having a similar appearance and motion.
Subsequent object feature amount extraction unit 133c and object attribute determination unit 134c are the same as those for the previous crowd area. Since it is repeated, detailed description is omitted.

以上のようにして、複数のタイプの候補領域を手掛かりとして、入力動画の映像が正常な群衆のシーンか否かのスコアが各候補領域から得られた。
次いで、第２の実施形態と同様に判定結果統合部１３５で判別スコアを投票して各被写体タイプごとに集計する。次に得られたスコアを重み付け和する。さらに、１フレームごとの結果は安定しない場合があるので、前後の複数フレーム間で結果を移動平均し、最終結果とする。 As described above, a score as to whether or not the video of the input moving image is a normal crowd scene is obtained from each candidate region using a plurality of types of candidate regions as clues.
Next, as in the second embodiment, the determination result integration unit 135 votes the discrimination score and totals it for each subject type. Next, the obtained scores are weighted and summed. Furthermore, since the results for each frame may not be stable, the results are moving averaged between a plurality of previous and subsequent frames to obtain the final result.

なお、動きベクトル特徴としてはＣＨＬＡＣ特徴以外に、時空間勾配のヒストグラムを用いる方法や、隠れマルコフモデルを用いるものなど様々あるので被写体のタイプにあったものを選べばよい。 In addition to the CHLAC feature, there are various motion vector features such as a method using a spatio-temporal gradient histogram, a method using a hidden Markov model, and so on, and it is only necessary to select one that suits the type of subject.

なお、前述の実施形態の派生として、それぞれの属性判定部を学習する際に、例えば火災の動画データと正常シーンの動画データを学習データとして与えて２クラスのシーン判別を学習すれば、火災検出機能を持った認識装置を実現することができる。またあるいは火災・異常行動・正常の３シーンのクラスの教師値を与えてクラス判別するように学習するようなこともできる。このように、本発明は様々な現実問題の課題に対して適用が可能である。 As a derivation of the above-described embodiment, when learning each attribute determination unit, for example, if fire motion data and normal scene motion data are given as learning data to learn 2-class scene discrimination, fire detection A recognition device having a function can be realized. Alternatively, learning can be performed by classifying by giving teacher values of classes of three scenes of fire, abnormal behavior, and normal. Thus, the present invention can be applied to various real problems.

また、本実施形態では動画像を用いたが、監視カメラとしての精度を増すために距離画像を併用して使う派生の形態も考えられる。この場合、輝度画像、距離画像、どちらの画像からも特徴量を抽出して連結し、判別器で判別を行うような工夫を用いればよい。
以上で、動画から群衆およびシーンの異常状態を判別する形態の第３の実施形態の説明を終える。 In the present embodiment, a moving image is used. However, a derivative form using a distance image in combination is also conceivable in order to increase the accuracy as a surveillance camera. In this case, a device may be used in which feature amounts are extracted and connected from both the luminance image and the distance image, and the discrimination is performed by the discriminator.
Above, description of 3rd Embodiment of the form which discriminate | determines the abnormal state of a crowd and a scene from a moving image is completed.

［第４の実施形態］
第４の実施形態は、静止画像を入力として画像の構図のクラスを判別するものである。またさらに本実施形態では画像構図のクラスの判別と同時に、画像の主被写体の領域も推定する。本実施形態の開示によって、本発明はシーンクラスのような一変数の情報を判別するのみならず、適切な工夫によって主被写体領域のような複雑な画像属性の推定に用いることも可能であることを示すものである。 [Fourth Embodiment]
In the fourth embodiment, a composition class of an image is determined using a still image as an input. In this embodiment, the area of the main subject of the image is estimated simultaneously with the determination of the class of the image composition. According to the disclosure of the present embodiment, the present invention can be used not only for determining information of a single variable such as a scene class but also for estimating a complicated image attribute such as a main subject region by an appropriate device. Is shown.

一般に、画像構図のクラスには様々な種類が提唱されており「日の丸構図」、「（黄金比の近似である）三分割構図」、「対角線構図」、「三角構図」、などのタイプが知られている。 In general, various types of image composition classes have been proposed, and the types such as “Hinomaru composition”, “Tripartite composition (which is an approximation of the golden ratio)”, “Diagonal composition”, and “Triangular composition” are known. It has been.

写真撮影時に自動で写真構図を推定することができれば、焦点位置や露出などの撮影時のカメラパラメータを構図に合った適切な値を決定することができる。また被写体に合った写真構図の枠線をユーザーに教示するなどして、構図の修正を容易にすることもできる。 If the photographic composition can be automatically estimated at the time of taking a picture, it is possible to determine appropriate values for the photographic composition such as the focus position and exposure at the time of taking the picture. It is also possible to easily correct the composition by, for example, teaching the user the frame line of the photographic composition that matches the subject.

また、主被写体の領域が分かれば、焦点や露出を主被写体に合わせて適切に制御することが可能になる。また画像の整理やハイライト作成などの画像の二次利用においても主被写体の情報は極めて重要な情報である。 In addition, if the area of the main subject is known, the focus and exposure can be appropriately controlled according to the main subject. In addition, information on the main subject is extremely important information for secondary use of images such as organizing images and creating highlights.

但し、従来の主被写体認識の方法にはいくつか課題があった。例えば、色コントラスト差などの顕著度に基づいて判断する方法の場合、意味的に重要でない領域であっても周囲との色や輝度のコントラストが強いと主被写体として誤って認識されてしまうことがあった。 However, there are some problems with the conventional main subject recognition method. For example, in the case of a method that makes a determination based on a degree of saliency such as a color contrast difference, even if it is an area that is not semantically important, it may be mistakenly recognized as a main subject if the color and brightness contrast with the surroundings is strong. there were.

例えば、室内の写真で部屋の隅にだけ光が明るく当たっているような場合や、路上で家と家の隙間から白い空が僅かに見えているような場合に、誤って主被写体と判別することがあった。 For example, when a photo is taken in a room where the light is only bright in the corner of the room, or when the white sky is slightly visible from the gap between the house and the house on the street, it is mistakenly identified as the main subject. There was a thing.

また、例えば他の主被写体認識の方法に、人体検出や顔検出などの物体検出を利用して主被写体を決定する方法があるが、このような物体検出を用いる方法では、不特定の様々な物体に対して適応できない問題があった。 For example, as another main subject recognition method, there is a method of determining a main subject using object detection such as human body detection or face detection. However, in such a method using object detection, there are various unspecified methods. There was a problem that could not be applied to objects.

本実施形態では、画像を構成する被写体を抽出し、属性判定部で分類・判別して被写体個々に構図クラスおよび主被写体領域を推定し、統合することで、これらの画像属性を自動判別することが可能であることを示す。 In the present embodiment, the subject constituting the image is extracted, and the attribute determination unit classifies and discriminates, estimates the composition class and main subject region for each subject, and integrates them to automatically discriminate these image attributes. Indicates that is possible.

図１８（ａ）に、本実施形態の判別対象のクラスである７種類の画像の構図クラスを示す。学習データの各画像には、この７クラスのどれに該当するかの教師値があらかじめ付与されている。また各画像の主被写体の領域は、二値画像によって主被写体の教師値が与えられている。 FIG. 18A shows composition classes of seven types of images, which are classes to be discriminated in the present embodiment. Each image of the learning data is previously assigned a teacher value corresponding to which of the seven classes. The main subject area of each image is given a teacher value of the main subject by a binary image.

図１７に、本実施形態の基本構成図を示す。これは、第２の実施形態の形態に準じる構成になっている。第２の実施形態との相違点の一つは、判定結果統合部１４５が画像構図判定統合部１４５ａと、主被写体領域判定統合部１４５ｂの二つの機能部位を備える点である。 FIG. 17 shows a basic configuration diagram of the present embodiment. This is a configuration according to the form of the second embodiment. One of the differences from the second embodiment is that the determination result integration unit 145 includes two functional parts, an image composition determination integration unit 145a and a main subject region determination integration unit 145b.

以降では、第２の実施形態との相違点に絞って説明を行う。
本実施形態で候補領域として抽出する被写体のタイプは物体、線分、人体の３タイプである。構図を推定するのに直接的な効果があると考えられる線分の情報が被写体として新たに用いられている点が大きな相違点である。 In the following, description will be made focusing on differences from the second embodiment.
In this embodiment, there are three types of subjects to be extracted as candidate areas: object, line segment, and human body. The major difference is that information on a line segment that is considered to have a direct effect in estimating the composition is newly used as a subject.

線分候補領域を抽出するには、入力画像からハフ変換によって線分を抽出し、所定以上のエッジ強度および長さを持った線分のみを候補線分として残すという処理を行う。
図１８（ｂ）に、記号１４０２ｂを付して線分候補領域の抽出結果の例を示す。更に線分特徴量抽出部１４３ｂでは、候補線分の所定の周囲の範囲から色ＳＩＦＴ特徴を抽出してこれを候補線分の見えの特徴量とする。この他に候補線分の重心位置、長さ、傾きなども算出して全て連結し、特徴量とする。 In order to extract a line segment candidate region, a line segment is extracted from the input image by Hough transform, and only a line segment having a predetermined edge strength and length is left as a candidate line segment.
FIG. 18B shows an example of the extraction result of the line segment candidate region with the symbol 1402b. Further, the line segment feature amount extraction unit 143b extracts a color SIFT feature from a predetermined peripheral range of the candidate line segment, and sets this as a feature amount of appearance of the candidate line segment. In addition to this, the center of gravity position, length, inclination, etc. of the candidate line segments are also calculated and connected to form a feature amount.

その他の物体候補領域や人体候補領域については、第２の実施形態と同様の方法で特徴量を抽出する。ただし、第２の実施形態で用いた特徴量に加えて、構図を決定する上で重要と考えられる各領域の重心位置や、領域形状の２次モーメントの特徴量も抽出して加える。さらに、高いコントラストの領域やピントのあっている領域は構図を決定する上で影響が大きいことを考慮し、領域の内外の色のコントラストの差や、領域の内外のエッジの量の比、などコントラストやピントに関連のある特徴量も追加して抽出しておく。 For other object candidate regions and human body candidate regions, feature quantities are extracted by the same method as in the second embodiment. However, in addition to the feature amount used in the second embodiment, the center of gravity position of each region considered to be important in determining the composition and the feature amount of the secondary moment of the region shape are also extracted and added. Furthermore, considering the fact that high-contrast areas and in-focus areas have a great influence on determining composition, the difference in color contrast between the inside and outside of the area, the ratio of the amount of edges inside and outside the area, etc. Feature quantities related to contrast and focus are also added and extracted.

各被写体の属性判定部１４４ａ〜１４４ｃでは、被写体の特徴量と、被写体の属する画像の構図クラスとを関連付けてあらかじめアンサンブル分類木を学習させてある。これは、第２の実施形態で各被写体候補領域の特徴量とシーンクラスとを関連付けて学習判別したことと相同の関係である。各被写体の特徴量である位置やサイズや境界の強度などに、構図クラスを判別する手掛かりとなる情報が多少ずつでも存在していれば、これらを統合することによってシーンクラスを判別する時と同じように構図クラスも正しく判別することができる。 In the subject attribute determination units 144a to 144c, the ensemble classification tree is learned in advance by associating the feature amount of the subject with the composition class of the image to which the subject belongs. This is similar to the relationship in the second embodiment in which learning is determined by associating the feature amount of each subject candidate area with the scene class. If there is a small amount of information that can be used to determine the composition class, such as the position, size, and boundary strength, which are the feature quantities of each subject, the same as when determining the scene class by integrating them. Thus, the composition class can also be correctly determined.

図１８（ｂ）に、このようなアンサンブル分類木で構図クラスを推定した結果の模式的な例を示す。抽出された各被写体領域１４０２ａ〜１４０２ｃの個々の被写体領域について構図クラスを推定し、それぞれを投票した結果を投票空間１４０３ａ〜１４０３ｃに表している。これを重み付け和して統合し、最終結果１４０４を得る。結果の例の図では三分割構図が最終的な答えとして出力されている。 FIG. 18B shows a schematic example of the result of estimating the composition class using such an ensemble classification tree. The composition classes are estimated for the individual subject areas of the extracted subject areas 1402a to 1402c, and the voting results are shown in the voting spaces 1403a to 1403c. These are weighted and integrated to obtain a final result 1404. In the example result diagram, a three-part composition is output as the final answer.

次に、主被写体領域の推定の方法について説明する。主被写体領域を構図クラスと同時に推定するために、本実施形態ではアンサンブル分類木で構図クラスを学習する時に以下のような工夫を併せて行う。即ち、各分類器の葉ノードに、学習データの構図クラスの比率だけでなく、学習データの主被写体領域の情報も記憶させておく。このような、学習の目標変数以外のデータのことをメタデータと呼ぶ。 Next, a method for estimating the main subject area will be described. In order to estimate the main subject area at the same time as the composition class, in the present embodiment, when learning the composition class using the ensemble classification tree, the following measures are also taken. That is, not only the composition class ratio of the learning data but also information on the main subject area of the learning data is stored in the leaf node of each classifier. Data other than the learning target variable is called metadata.

メタデータである主被写体領域の情報は、まずそれぞれ縦横比１対１の画像に比率を正規化する。次に葉ノードの中で主被写体領域画像を平均化して、これをその葉ノードにおける主被写体の事前分布とする。さらに、これをガウス分布で近似してガウス分布のパラメータのみを記憶する。事前分布そのものを使わないで近似を行うのは、分類木のサイズが大きくなると記憶容量および投票の計算処理にかかる速度が膨大になるためである。この点の問題がなければ近似せずに分布そのものを記憶して用いてもよい。またガウス分布も混合ガウス分布等で近似してもよい。 The main subject area information, which is metadata, first normalizes the ratio to an image with an aspect ratio of 1: 1. Next, the main subject area images are averaged in the leaf nodes, and this is used as the prior distribution of the main subjects in the leaf nodes. Furthermore, this is approximated by a Gaussian distribution and only the parameters of the Gaussian distribution are stored. The approximation is performed without using the prior distribution itself because the storage capacity and the speed of the voting calculation process become enormous as the size of the classification tree increases. If there is no problem in this respect, the distribution itself may be stored and used without approximation. The Gaussian distribution may be approximated by a mixed Gaussian distribution or the like.

こうして得られた主被写体領域のメタデータの例を図１９（ａ）に示す。ここでは、物体候補領域として抽出された空の領域１４１１を分類木で判別させたときに到達した葉ノードに記憶されている主被写体の領域の事前分布（をガウス分布で近似したもの）を記号１４１２を付して示している。 An example of the metadata of the main subject area obtained in this way is shown in FIG. Here, the prior distribution (approximate with Gaussian distribution) of the main subject area stored in the leaf node reached when the empty area 1411 extracted as the object candidate area is discriminated by the classification tree is symbolized 1412 is shown.

この結果例では、空の領域１４１１のように、均一の領域が画像上部に写っている画像の場合、主被写体領域は上方には存在しないことが多いため、図１９（ａ）のように中央付近を中心とした位置に主被写体領域がある可能性が高いことを示している。 In this example, in the case of an image in which a uniform area is shown in the upper part of the image, such as the empty area 1411, the main subject area often does not exist above, so that the center as shown in FIG. This indicates that there is a high possibility that the main subject area is at a position centered around the vicinity.

このように、候補被写体ごとに主被写体領域の事前分布のメタデータを参照し、それを投票した結果の例の模式図を図１９（ｂ）に示す。ここでは、各被写体からの投票結果を記号１４１３ａ〜１４１３ｃを付し、重み付けして統合した後の最終的な推定結果を記号１４１４を付して示す。本図では主被写体の位置（人物）の近辺をある程度正しく推定している様子を示している。 FIG. 19B shows a schematic diagram of an example of a result obtained by referencing the metadata of the pre-distribution of the main subject area for each candidate subject and voting it. Here, the result of voting from each subject is indicated by symbols 1413a to 1413c, and the final estimation result after weighted integration is indicated by symbol 1414. This figure shows a state in which the vicinity of the position (person) of the main subject is correctly estimated to some extent.

以上で画像の構図クラスおよび主被写体領域を推定する第４の実施形態についての説明を終える。本発明を適用した実現例により、画像中の被写体のパターンを手掛かりにして、構図推定を行えることを示した。また、被写体の分類結果にメタデータを付随させて用いることにより、主被写体領域の推定も学習的に行うことができることを示した。これは、顕著度のように機械的な基準で決める従来の方法と大きく相違する点である。 This is the end of the description of the fourth embodiment for estimating the composition class and main subject area of the image. It has been shown that composition estimation can be performed by using a subject pattern in an image as a clue by an implementation example to which the present invention is applied. It was also shown that the main subject area can be estimated by learning by using metadata in association with the subject classification result. This is a point that is greatly different from the conventional method in which the degree of saliency is determined by a mechanical standard.

本発明の実施形態によれば、画像の構図や、画像中の群衆の行動の正常・異常の判別や、主被写体の判別など、画像中の被写体を手掛かりにして様々な画像の属性を判別することが可能である。 According to the embodiment of the present invention, various image attributes are discriminated by using the subject in the image as a clue, such as the composition of the image, the normality / abnormality of the behavior of the crowd in the image, and the main subject. It is possible.

（その他の実施形態）
また、本発明は、以下の処理を実行することによっても実現される。即ち、前述した実施形態の機能を実現するソフトウェア（コンピュータプログラム）を、ネットワーク又は各種のコンピュータ読み取り可能な記憶媒体を介してシステム或いは装置に供給する。そして、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。 (Other embodiments)
The present invention can also be realized by executing the following processing. That is, software (computer program) that implements the functions of the above-described embodiments is supplied to a system or apparatus via a network or various computer-readable storage media. Then, the computer (or CPU, MPU, etc.) of the system or apparatus reads out and executes the program.

１０１画像入力部
１０２候補領域抽出部
１０３特徴量抽出部
１０４属性判定部
１０５判定結果統合部 101 Image Input Unit 102 Candidate Area Extraction Unit 103 Feature Quantity Extraction Unit 104 Attribute Determination Unit 105 Determination Result Integration Unit

Claims

Candidate area extracting means for extracting a plurality of candidate areas of the subject from the input image;
Feature quantity extraction means for extracting feature quantities of the candidate areas from each of the plurality of candidate areas;
For each of said plurality of candidate areas, the based on the feature amount extracted from the candidate regions, by referring to the learning result of associating the feature amount extracted from the object region and the attribute of the learning image of the learning image Te, an attribute determining means for determining an attribute of the input image to which the candidate region is extracted,
A recognition apparatus comprising: an identification unit that identifies attributes of the input image by aggregating the determination results of the attribute determination unit for the plurality of candidate regions .

The recognition apparatus according to claim 1, wherein the candidate area extraction unit extracts candidate areas of a plurality of subjects based on a criterion of a predetermined candidate area.

The candidate region extraction means, recognition apparatus according to claim 2, characterized in that it comprises a plurality of candidate region extracting means based on a plurality of different said reference.

Said attribute determining means, recognition apparatus according to claim 3, characterized in that it comprises a plurality of attribute determination means corresponding to a plurality of different said candidate region extracting means.

The candidate area extraction unit extracts the plurality of candidate areas by repeating a process of combining two adjacent superpixels having the highest similarity among the plurality of superpixels generated by dividing the image. The recognition apparatus according to claim 2.

The recognition apparatus according to claim 5, wherein the candidate area extracting unit extracts, as the plurality of candidate areas, a super pixel that is generated by repeating the processing and has a larger area than a predetermined value among the super pixels. .

The recognition apparatus according to claim 1, wherein the candidate area extraction unit includes a human body candidate area extraction unit that extracts a human body candidate area from the image.

The attribute determination unit performs the learning by classifying the object region so as to be separated according to the attribute of the image when learning the attribute of the object region of the learning image . 8. The recognition device according to any one of items 7.

The recognition apparatus according to claim 8, wherein the attribute determination unit performs learning so as to classify a region taught as an important object region preferentially when the object region is classified.

The recognition apparatus according to claim 1, wherein the attribute determination unit performs learning based on a candidate area of a subject that is not related to the attribute of the image.

The recognition apparatus according to claim 1, wherein the attribute determination unit is configured by a method based on a classification tree.

The recognition apparatus according to claim 1, wherein the attribute determination unit is configured by a method based on a search for similar case data.

The recognition apparatus according to claim 1, wherein the attribute determination unit is configured by a technique based on a hash method.

The attribute to be determined by the attribute determining means is any one of an image scene, a crowd action in the image, a composition type of the image, information on a main subject of the image, and information on a light source direction of the image. The recognition device according to any one of claims 1 to 13.

The recognition apparatus according to claim 1, wherein the image for which the attribute is determined is a moving image.

The recognition apparatus according to claim 1, wherein the image for which the attribute is determined is a distance image.

The identification unit excludes the determination result of the attribute determination unit for a candidate region having a high ratio determined not to be an object in the learning image when the determination result of the attribute determination unit for the plurality of candidate regions is totaled. The recognition apparatus according to any one of claims 1 to 16, wherein the recognition apparatus includes:

A candidate area extracting step of extracting a plurality of candidate areas of the subject from the input image;
A feature amount extraction step of extracting a feature amount of the candidate region from each of the plurality of candidate regions;
For each of said plurality of candidate areas, the based on the feature amount extracted from the candidate regions, by referring to the learning result of associating the feature amount extracted from the object region and the attribute of the learning image of the learning image An attribute determination step of determining an attribute of the input image from which the candidate area is extracted ;
An identification step of identifying the attributes of the input image by aggregating the determination results of the attribute determination step for the plurality of candidate regions .

A candidate area extracting step of extracting a plurality of candidate areas of the subject from the input image;
A feature amount extraction step of extracting a feature amount of the candidate region from each of the plurality of candidate regions;
For each of said plurality of candidate areas, the based on the feature amount extracted from the candidate regions, by referring to the learning result of associating the feature amount extracted from the object region and the attribute of the learning image of the learning image An attribute determination step of determining an attribute of the input image from which the candidate area is extracted ;
A program for causing a computer to execute an identification step of identifying attributes of the input image by counting the determination results of the attribute determination step for the plurality of candidate regions .