JP2011081445A

JP2011081445A - Facial expression recognition device, inter-personal feeling estimation device, facial expression recognizing method, inter-personal feeling estimating method, and program

Info

Publication number: JP2011081445A
Application number: JP2009230808A
Authority: JP
Inventors: Shiro Kumano; 史朗熊野; Kazuhiro Otsuka; 和弘大塚; Dan Mikami; 弾三上; Junji Yamato; 淳司大和
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-10-02
Filing date: 2009-10-02
Publication date: 2011-04-21
Anticipated expiration: 2029-10-02
Also published as: JP5349238B2

Abstract

<P>PROBLEM TO BE SOLVED: To recognize the head posture of a person and the facial expression of the person with high accuracy, even if the person changes the head posture. <P>SOLUTION: An input unit 1 inputs an input moving image capturing the plurality of persons. Facial expression estimation units 3-1 to 3-N estimate the head posture of each person based on the likelihood of each person's head posture for the input moving image and estimate the display category of each person based on the likelihood of each person's facial expression category for the input image. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、画像処理に基づいた対話中の人物の表情認識技術、並びに、人物間の感情の推定技術、特に、時系列フィルタリング技術、及び統計モデルに基づいた表情認識装置、人物間感情推定装置、表情認識方法、人物間感情推定方法、及びプログラムに関する。 The present invention relates to a human facial expression recognition technique based on image processing, and an emotion estimation technique between persons, in particular, a time series filtering technique, a facial expression recognition apparatus based on a statistical model, and an interpersonal emotion estimation apparatus. The present invention relates to a facial expression recognition method, an interpersonal emotion estimation method, and a program.

対面対話は、日常的な情報共有、意思決定などのための日常的なコミュニケーションの基本的形態である。対面対話では、発話内容といった言語情報のみならず、ジェスチャや、視線、韻律といった非言語情報も表出され、どちらも重要なことが心理学的に明らかにされている（参考文献１：M. Argyle: “Bodily Communication -- 2nd ed.”, Routledge, London and New York, 1988.）。 Face-to-face dialogue is a basic form of daily communication for daily information sharing and decision making. In face-to-face conversations, not only linguistic information such as utterance contents but also non-linguistic information such as gestures, gaze, and prosody are expressed, and both are important psychologically (reference 1: M.M. Argyle: “Bodily Communication-2nd ed.”, Routledge, London and New York, 1988.).

対話の場面における感情表現の中でも、微笑は、特に重要であると考えられる。というのも、微笑は、対話の流れの円滑化、肯定的感情の表出、親密さの形成や、維持（参考文献２：J. N. Cappella: Behavioral and judged coordination in adult informal social interactions: vocal and kinesic indicators, J. Personality and Social Psychology, 72(1):119--131, 1997）など、様々な大切な役割を多く果たしているからである。 Among emotional expressions in dialogue, smiles are considered to be particularly important. Because smiles facilitate the flow of dialogue, express positive emotions, form and maintain intimacy (Reference 2: JN Cappella: Behavioral and judged coordination in adult informal social interactions: vocal and kinesic indicators , J. Personality and Social Psychology, 72 (1): 119--131, 1997).

一般的に、微笑と視覚的に類似したものに哄笑があるが。哄笑よりも微笑の方が発話を伴わず、より表情変化が微小であるため、話者の発話や、他者の微笑とも容易に共起可能であるという点においても重要であると考えられる。だが、対話においては、両者には、より重要な違いがある。それは、微笑には明確な指向性があるのに対し、つまり、基本的に特定の受け手に向けられるのに対して、哄笑にはそのような機能はほとんどない。 Generally, there is ridicule in something visually similar to smile. It is thought that smile is more important than ridicule because it does not involve utterances and changes in facial expressions are much smaller, so it can be easily co-occurred with the utterances of the speaker and the smiles of others. But in dialogue, there are more important differences between the two. That's because smiles have a clear direction, that is, they are basically aimed at a specific audience, while ridicule has little such function.

これまでにも、微笑の自動認識に関する方法はいくつか提案されている。中には、頭部姿勢の情報を用いて微笑の検出精度の向上を図っているものもある（例えば、非特許文献１参照）。また、哄笑の認識（例えば、非特許文献２参照）や、興味を持っているか否かの認識（例えば、非特許文献３参照）といった研究も存在する。 So far, several methods for automatic recognition of smiles have been proposed. Some have attempted to improve smile detection accuracy using information on the head posture (see, for example, Non-Patent Document 1). There are also studies such as recognition of ridicule (for example, see Non-Patent Document 2) and recognition of whether or not they are interested (for example, see Non-Patent Document 3).

表情の認識には、非接触・非浸襲なデバイスであるカメラを用いる、すなわち、画像から認識できることが望ましい。しかし、画像から微笑を認識することには、様々な技術的課題がある。まず、微笑では、顔部品の移動量が少ない、すなわち、画像での見えの変化が小さいことが挙げられる。 For recognition of facial expressions, it is desirable to use a camera that is a non-contact / non-invasive device, that is, to be able to recognize from an image. However, recognizing smiles from images has various technical challenges. First, in smiles, the amount of movement of facial parts is small, that is, the change in appearance in the image is small.

さらに、三者以上の対話では、対話参与者達は、対象者を基準として、様々な方向に位置していることから、カメラを対象人物の頭部に取り付けでもしない限り、注視対象の人物が変わることに伴って顔の向きが変化する。よって、既存の多くの表情認識手法（例えば、非特許文献４参照）で仮定されているような、入力画像中の人物の顔が正面顔、すなわち、カメラに対して正面・正立という仮定は基本的に成立しない。このため、表情を正しく認識するためには頭部姿勢も同時に推定する必要がある。 Furthermore, in a three-or-more-person dialogue, participants in the dialogue are positioned in various directions with respect to the subject person, so that the person to be watched will not be attached unless the camera is attached to the head of the subject person. The direction of the face changes with the change. Therefore, as assumed in many existing facial expression recognition methods (for example, see Non-Patent Document 4), the assumption that the human face in the input image is a front face, that is, the front / upright with respect to the camera is Basically not true. For this reason, it is necessary to simultaneously estimate the head posture in order to correctly recognize the facial expression.

さらに、表情は個人差が大きい。このため、微笑を認識しようとする場合、微笑の個人差に対処しなければ、微笑が哄笑をはじめとした視覚的に類似した表情と混同される可能性が高い。既存の多くの表情認識手法は、多人数の表情の学習データから単一のモデルを用意し、それを不特定多数の人物に対してモデルとして使用するというアプローチを採っている（例えば、非特許文献５参照）。 Furthermore, facial expressions vary greatly between individuals. For this reason, when trying to recognize smiles, it is highly possible that smiles will be confused with visually similar facial expressions such as ridicule unless individual differences in smiles are addressed. Many existing facial expression recognition methods take an approach of preparing a single model from learning data for facial expressions of a large number of people and using it as a model for an unspecified number of people (for example, non-patented) Reference 5).

また、シンプルなモデルを個人毎に用意し、高い認識率を達成することを目指した方法として、特許文献１がある。この方法では、少数の注目点を顔面上の顔部品の周辺に固定し、それらの点の画像の輝度が表情に変化に伴って変化することを利用して表情を認識している。 Patent Document 1 discloses a method aiming at achieving a high recognition rate by preparing a simple model for each individual. In this method, a small number of points of interest are fixed to the periphery of a facial part on the face, and the facial expression is recognized by utilizing the fact that the luminance of the image at those points changes as the facial expression changes.

特開２００９−１１０４２６号公報JP 2009-110426 A

J. Cohn, L. Reed, T. Moriyama, J. Xiao, K. Schmidt, and Z. Ambadar: Multimodal coordination of facial action, head rotation, and eye motion during spontaneous smiles, In Proc. IEEE Int'l Conf. Automatic Face and Gesture Recognition, pages 129--135, 2004.J. Cohn, L. Reed, T. Moriyama, J. Xiao, K. Schmidt, and Z. Ambadar: Multimodal coordination of facial action, head rotation, and eye motion during spontaneous smiles, In Proc.IEEE Int'l Conf. Automatic Face and Gesture Recognition, pages 129--135, 2004. S. Petridis and M. Pantic: Audiovisual laughter detection based on temporal features, In Proc. ICMI, pages 37--44, 2008.S. Petridis and M. Pantic: Audiovisual laughter detection based on temporal features, In Proc.ICMI, pages 37--44, 2008. B. Schuller, R. Muller, B. Hornler, A. Hothker, H. Konosu, and G. Rigoll: Audiovisual recognition of spontaneous interest within conversations, In Proc. ICMI, pages 30--37, 2007.B. Schuller, R. Muller, B. Hornler, A. Hothker, H. Konosu, and G. Rigoll: Audiovisual recognition of spontaneous interest within conversations, In Proc. ICMI, pages 30--37, 2007. Y. L. Tian, T. Kanade, and J. Cohn: Facial expression analysis, Springer, 2005.Y. L. Tian, T. Kanade, and J. Cohn: Facial expression analysis, Springer, 2005. M. S. Bartlett, G. Littlewort, M. G. Frank, C. Lainscsek, I. R. Fasel, and J. R. Movellan: Automatic recognition of facial actions in spontaneous expressions, J. Multimedia, 1(6):22--35, 2006.M. S. Bartlett, G. Littlewort, M. G. Frank, C. Lainscsek, I. R. Fasel, and J. R. Movellan: Automatic recognition of facial actions in spontaneous expressions, J. Multimedia, 1 (6): 22--35, 2006.

しかしながら、上述した非特許文献１から非特許文献３の技術では、微笑が誰に対して向けられているのかを自動的に認識することはできないという問題がある。 However, the techniques of Non-Patent Document 1 to Non-Patent Document 3 described above have a problem that it is impossible to automatically recognize to whom the smile is directed.

また、非特許文献４で紹介されている画像ベースの表情認識手法の大半は、入力画像中の表情が正面顔、すなわち、カメラに対して正面・正立であることを仮定している。しかしながら、対話を含め対象人物が頭部姿勢を変化させる場面では、そのような画像を得るためにカメラを各対象人物の頭部に取り付ければならず、実用的でない。 Most of the image-based facial expression recognition methods introduced in Non-Patent Document 4 assume that the facial expression in the input image is a front face, that is, front / upright with respect to the camera. However, in a scene where the target person changes the head posture including dialogue, a camera must be attached to the head of each target person in order to obtain such an image, which is not practical.

非特許文献５の方法では、未学習の人物に対して高い認識率を達成可能なモデルを作成することができればよいが、現在の技術では、オーバーフィッティングの問題が生じることが報告されている。 In the method of Non-Patent Document 5, it is only necessary to create a model that can achieve a high recognition rate for an unlearned person, but it has been reported that the current technique causes an overfitting problem.

また、特許文献１の方法では、注目点の位置が表情変化を考慮せずに決定されるため、対象表情の認識に適した点が選択されるとは限らない。結果として、やはり、微笑や、哄笑のような視覚的に類似した表情は混同される可能性が高い。しかし、上述した特許文献１、及び、非特許文献１から非特許文献５のいずれの表情認識手法も、基本的には一人の人物の表情が何であるのかを認識するものであり、対話の場面において表出される表情がそれぞれ誰に対して向けられているのかを認識するという手法はこれまでに存在しない。 In the method of Patent Document 1, the position of the point of interest is determined without considering the change in facial expression. Therefore, a point suitable for recognition of the target facial expression is not always selected. As a result, visually similar expressions such as smiles and ridicule are still likely to be confused. However, any of the facial expression recognition methods described in Patent Document 1 and Non-Patent Document 1 to Non-Patent Document 5 basically recognizes what a person's facial expression is. There has never been a method for recognizing to whom each facial expression expressed in is directed.

本発明は、このような事情を考慮してなされたものであり、その目的は、人物が頭部姿勢を変化させる場合であっても、人物の頭部姿勢とともに人物の表情を高精度に認識することができる表情認識装置、人物間感情推定装置、表情認識方法、人物間感情推定方法、及びプログラムを提供することにある。 The present invention has been made in view of such circumstances, and its purpose is to recognize a person's facial expression together with the person's head posture with high accuracy even when the person changes the head posture. An expression recognition device, an interpersonal emotion estimation device, an expression recognition method, an interpersonal emotion estimation method, and a program are provided.

上述した課題を解決するために、本発明は、人物を撮影して入力動画像として入力する入力手段と、前記入力手段により入力した入力動画像に対する頭部姿勢の尤度に基づいて、前記人物の頭部姿勢を推定し、頭部姿勢推定値として出力するととともに、前記入力手段により入力した入力画像に対する表情カテゴリの尤度に基づいて、前記人物の表示カテゴリを推定し、表示カテゴリ推定値として出力する表情推定手段とを備えることを特徴とする表情認識装置である。 In order to solve the above-described problem, the present invention provides an input unit that captures a person and inputs the input moving image as an input moving image, and the human head posture likelihood for the input moving image input by the input unit. The head posture is estimated and output as a head posture estimated value, and the person's display category is estimated based on the likelihood of the facial expression category with respect to the input image input by the input means. A facial expression recognition device comprising facial expression estimation means for outputting.

本発明は、上記の発明において、前記表情推定手段は、前記表情カテゴリの認識用に予め定められている注目点集合であって、前記入力動画像の所定の領域に配置されている注目点集合とその輝度分布モデルとに基づいて第１の尤度を算出し、前記頭部姿勢の推定用に予め定められている注目点集合であって、前記入力動画像の所定の領域に配置されている注目点集合とその輝度分布モデルとに基づいて第２の尤度を算出し、前記第１の尤度と前記第２の尤度との積を、前記表情カテゴリ及び前記頭部姿勢の同時尤度として算出し、当該算出した同時尤度に基づいて、前記表情カテゴリ推定値及び前記頭部姿勢推定値を推定して出力することを特徴とする。 According to the present invention, in the above invention, the facial expression estimating means is a set of attention points that are predetermined for recognition of the expression category, and the attention point set arranged in a predetermined area of the input moving image A first likelihood is calculated based on the luminance distribution model and the luminance distribution model, and is a set of attention points that are predetermined for estimating the head posture, and is arranged in a predetermined region of the input moving image. A second likelihood is calculated based on the set of attention points and its luminance distribution model, and the product of the first likelihood and the second likelihood is calculated simultaneously with the facial expression category and the head posture. It is calculated as a likelihood, and the expression category estimated value and the head posture estimated value are estimated and output based on the calculated simultaneous likelihood.

本発明は、上記の発明において、前記表情推定手段は、前記表情カテゴリの認識用に予め定められている注目点集合を、前記入力動画像において、人物の表情変化が顕著となる領域であって、人物に対して予め定められている領域に配置することを特徴とする。 According to the present invention, in the above invention, the facial expression estimating means is a region in which a change in facial expression of a person is remarkable in the input moving image, wherein a set of points of interest predetermined for recognition of the facial expression category is defined. It is characterized in that it is arranged in a predetermined area for a person.

また、上述した課題を解決するために、本発明は、複数の人物を撮影した入力動画像を入力する入力手段と、前記入力手段により入力した入力動画像に対する各人物の頭部姿勢の尤度に基づいて、各人物の頭部姿勢を推定し、各人物の頭部姿勢推定値として出力するとともに、前記入力手段により入力した入力画像に対する各人物の表情カテゴリの尤度に基づいて、各人物の表示カテゴリを推定し、各人物の表示カテゴリ推定値として出力する表情カテゴリ及び頭部姿勢推定手段と、前記各人物の頭部姿勢推定値に基づいて、前記各人物の表情の表出対象を推定し、表出対象推定値として出力する表出対象推定手段とを備えることを特徴とする人物間感情推定装置である。 In order to solve the above-described problem, the present invention provides an input unit that inputs an input moving image obtained by photographing a plurality of persons, and a likelihood of the head posture of each person with respect to the input moving image input by the input unit. Based on the likelihood of each person's facial expression category with respect to the input image input by the input means And the facial expression category and head posture estimation means for outputting each person's display category estimation value, and based on the head posture estimation value of each person, It is an emotion estimation apparatus between persons characterized by including the estimation object estimation means which estimates and outputs as an expression object estimation value.

本発明は、上記の発明において、前記表情カテゴリ推定値と前記表出対象推定値とに基づいて、前記複数の人物間の感情の関係を、図式化して出力する人物間感情ネットワーク作成手段を更に備えることを特徴とする。 According to the present invention, in the above-described invention, there is further provided an interpersonal emotion network creating means for graphically outputting an emotional relationship between the plurality of persons based on the expression category estimated value and the expression target estimated value. It is characterized by providing.

本発明は、上記の発明において、前記人物間感情ネットワーク作成手段により図式化して出力される前記複数の人物間の感情の関係に基づいて、対話の種類、対話の盛り上がり度合い、人間関係、好意的関係にあるサブグループの構成、個人の性格のうち少なくとも一つを推定する推定部を備えることを特徴とする。 According to the present invention, in the above invention, the type of dialogue, the degree of excitement of dialogue, the human relationship, An estimation unit is provided for estimating at least one of the composition of subgroups and the personality of the relationship.

本発明は、上記の発明において、入力手段が人物を撮影して入力動画像として入力するステップと、表情推定手段が、前記入力手段により入力した入力動画像に対する頭部姿勢の尤度に基づいて、前記人物の頭部姿勢を推定し、頭部姿勢推定値として出力するとともに、前記入力手段により入力した入力画像に対する表情カテゴリの尤度に基づいて、前記人物の表示カテゴリを推定し、表示カテゴリ推定値として出力するステップとを含むことを特徴とする表情認識方法である。 According to the present invention, in the above invention, the input unit captures a person and inputs it as an input moving image, and the facial expression estimation unit is based on the likelihood of the head posture with respect to the input moving image input by the input unit. The head posture of the person is estimated and output as a head posture estimated value, and the display category of the person is estimated based on the likelihood of the facial expression category for the input image input by the input means. And a step of outputting as an estimated value.

本発明は、上記の発明において、入力手段が、複数の人物を撮影した入力動画像を入力するステップと、表情推定手段が、前記入力手段により入力した入力動画像に対する各人物の頭部姿勢の尤度に基づいて、各人物の頭部姿勢を推定し、各人物の頭部姿勢推定値として出力するとともに、前記入力手段により入力した入力画像に対する各人物の表情カテゴリの尤度に基づいて、各人物の表示カテゴリを推定し、各人物の表示カテゴリ推定値として出力するステップと、表出対象推定手段が、前記各人物の頭部姿勢推定値に基づいて、前記各人物の表情の表出対象を推定し、表出対象推定値として出力するステップとを含むことを特徴とする人物間感情推定方法である。 According to the present invention, in the above invention, the input unit inputs an input moving image obtained by photographing a plurality of persons, and the facial expression estimating unit determines the head posture of each person with respect to the input moving image input by the input unit. Based on the likelihood, the head posture of each person is estimated and output as a head posture estimated value of each person, and based on the likelihood of each person's facial expression category for the input image input by the input means, Estimating the display category of each person and outputting the estimated display category value of each person; and a representation target estimating means for expressing the facial expression of each person based on the estimated head posture value of each person. And a step of estimating a target and outputting it as a target estimation value.

本発明は、上記の発明において、画像処理に基づいて人物の表情を認識する表情認識装置のコンピュータに、人物を撮影して入力動画像として入力する入力機能、前記入力動画像に対する頭部姿勢の尤度に基づいて、前記人物の頭部姿勢を推定し、頭部姿勢推定値として出力するとともに、前記入力画像に対する表情カテゴリの尤度に基づいて、前記人物の表示カテゴリを推定し、表示カテゴリ推定値として出力する表情推定機能を実行させることを特徴とするプログラムである。 According to the present invention, in the above invention, an input function for photographing a person and inputting it as an input moving image on a computer of a facial expression recognition device that recognizes the expression of the person based on image processing, and a head posture with respect to the input moving image The head posture of the person is estimated based on the likelihood, and is output as a head posture estimated value, and the display category of the person is estimated based on the likelihood of the facial expression category for the input image. A program that executes a facial expression estimation function that is output as an estimated value.

本発明は、上記の発明において、画像処理に基づいて複数の人物の表情を認識し、前記複数の人物間の感情を推定する人物間感情推定装置のコンピュータに、複数の人物を撮影した入力動画像を入力する入力機能、前記入力動画像に対する各人物の頭部姿勢の尤度に基づいて、各人物の頭部姿勢を推定し、各人物の頭部姿勢推定値として出力するとともに、前記入力画像に対する各人物の表情カテゴリの尤度に基づいて、各人物の表示カテゴリを推定し、各人物の表示カテゴリ推定値として出力する表情カテゴリ及び頭部姿勢推定機能、前記各人物の頭部姿勢推定値に基づいて、前記各人物の表情の表出対象を推定し、表出対象推定値として出力する表出対象推定機能を実行させることを特徴とするプログラムである。 According to the present invention, in the above invention, an input moving image in which a plurality of persons are photographed on a computer of an interpersonal emotion estimation apparatus that recognizes facial expressions of a plurality of persons based on image processing and estimates emotions between the plurality of persons. An input function for inputting an image, estimating the head posture of each person based on the likelihood of the head posture of each person with respect to the input moving image, and outputting the estimated head posture of each person as well as the input Estimating the display category of each person based on the likelihood of each person's facial expression category with respect to the image, and outputting the facial expression category and head posture estimation function as each person's display category estimation value, estimating the head posture of each person According to the present invention, there is provided a program for executing an expression target estimation function for estimating an expression target of each person's facial expression based on a value and outputting the estimated expression target value.

この発明によれば、人物が頭部姿勢を変化させる場合であっても、人物の頭部姿勢とともに人物の表情を高精度に認識することができる。 According to the present invention, even when the person changes the head posture, the facial expression of the person can be recognized with high accuracy together with the head posture of the person.

本実施形態における表情認識装置の全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the facial expression recognition apparatus in this embodiment. 撮影場面の一例、及び、それを全方位カメラで撮影した入力画像の一例を示す図である。It is a figure which shows an example of an imaging | photography scene and an example of the input image which image | photographed it with the omnidirectional camera. 表情推定部３の具体的な構成を示すブロック図である。3 is a block diagram showing a specific configuration of a facial expression estimation unit 3. FIG. 表情カテゴリ及び表出対象推定部３２の具体的な構成を示すブロック図である。It is a block diagram which shows the specific structure of the expression category and the expression target estimation part 32. FIG. 表情カテゴリ及び頭部姿勢推定部３２１の具体的な構成を示すブロック図である。6 is a block diagram showing a specific configuration of a facial expression category and head posture estimation unit 321. FIG. 表情カテゴリｅ_ｔ、頭部姿勢ｈ_ｔ、及び画像ｚ_ｔの関係を示す概念図である。It is a conceptual diagram which shows the relationship between the expression category e _t , the head posture h _t , and the image z _t . 形状モデル作成部５の処理を示すブロック図である。It is a block diagram which shows the process of the shape model creation part 5. FIG. 各顔部品について登録される注目点集合の一例を示す模式図である。It is a schematic diagram which shows an example of the attention point set registered about each face part. 輝度平均マップ、輝度分散マップ、注目点選択、及び輝度分布モデルの作成の一連の処理の流れを示すブロック図である。It is a block diagram which shows the flow of a series of processes of a brightness | luminance average map, a brightness | luminance dispersion map, attention point selection, and preparation of a brightness distribution model. 微笑、哄笑、その他についての輝度平均マップの一例を示す模式図である。It is a schematic diagram which shows an example of the brightness | luminance average map about smile, ridicule, etc. 微笑、哄笑、その他についての輝度分散マップの一例を示す模式図である。It is a schematic diagram which shows an example of the brightness | luminance dispersion map about smile, ridicule, etc. 微笑、哄笑、その他についての表情顕著性マップを示す模式図である。It is a schematic diagram which shows the expression saliency map about smile, ridicule, and others. 人物間感情ネットワークをグラフ構造として表現した一例を示す概念図である。It is a conceptual diagram which shows an example which expressed the emotion network between persons as a graph structure.

本発明は、複数人対話を対象として、対話を撮影した動画像から対話参与者間で表出される微小、哄笑、苦笑、怒りといった表情を正しく識別するとともに、それらを基に人物間の感情を推定する。つまり、対話の動画像から、誰がいつ誰に対してどのような表情を表出したのかを認識し、そこから好意的／敵対的といった対話参与者間の感情を推定する。また、表情認識の部分においては、視覚的に類似しているが人物間感情に対する性質の異なる表情を正しく識別する。 The present invention correctly identifies facial expressions such as minute, ridicule, bitter laughter, and anger that are expressed among the participants from a moving image obtained by capturing a conversation for a multi-person conversation, and also expresses emotions between persons based on them. presume. That is, it recognizes who expresses what expression to whom and when from the dialogue moving image, and estimates the feelings between the dialogue participants such as favorable / hostile from there. In the facial expression recognition part, facial expressions that are visually similar but have different properties with respect to emotions between persons are correctly identified.

対象人物の微小、哄笑、苦笑、驚き、怒り、悲しみや思考中などの表情を、そのときの人物の頭部姿勢と同時に推定することで、対象人物の頭部姿勢に対して頑健に表情カテゴリを認識する。なお、本実形態において、「表情を頭部姿勢と同時に推定する」とは、「表情を頭部姿勢とともに推定する」または「表情を頭部姿勢と並列に推定する」ということである。
さらに、認識した頭部姿勢を基に視覚的注視対象の人物を認識し、視覚的注視対象の人物と表情の表出対象の人物とが同一であると仮定して、表情の表出対象の人物を特定する。最後に、入力画像中の各フレームにおける表情カテゴリ、及びその表出対象を基に、人物間の感情がどのようになっているのかを推定する。 The facial expression category is robust against the subject's head posture by estimating the subject's facial expressions such as micro, ridicule, bitter laughter, surprise, anger, sadness and thinking at the same time as the person's head posture. Recognize In this embodiment, “estimating the facial expression simultaneously with the head posture” means “estimating the facial expression together with the head posture” or “estimating the facial expression in parallel with the head posture”.
Furthermore, it recognizes the person of the visual gaze target based on the recognized head posture, and assumes that the person of the visual gaze target and the person of the facial expression expression are the same. Identify people. Finally, based on the facial expression category in each frame in the input image and the expression target, it is estimated how the emotions between persons are.

まず、各被験者についてのそれぞれの動画像中の各フレームでの表情と頭部姿勢とを同時に推定する方法については、対象人物の対象表情の認識に適した注目点を学習画像から自動的に選択する枠組みを本発明において新たに導入する（それ以外の部分については、特許文献１に従う）。 First, for the method of simultaneously estimating the facial expression and head posture in each frame in each moving image for each subject, the attention point suitable for recognizing the target facial expression of the target person is automatically selected from the learning image This framework is newly introduced in the present invention (the other parts are in accordance with Patent Document 1).

次に、表情の表出対象、すなわち、表情の向けられた対象人物を推定する。対話中においては、視覚的注視対象と頭部姿勢との間の相関が高いという既存の知見（参考文献３：R. Stiefelhagen: Tracking focus of attention in meetings, In Proc. ICMI, pages 273--280, 2002.）を利用し、推定した対象人物の頭部姿勢から最尤法の枠組みにより表情が誰に対して向けられたのかを推定する。 Next, the expression target of the expression, that is, the target person to whom the expression is directed is estimated. Existing knowledge that there is a high correlation between visual gaze target and head posture during dialogue (Reference 3: R. Stiefelhagen: Tracking focus of attention in meetings, In Proc. ICMI, pages 273--280 , 2002.), it is estimated from the estimated head posture of the target person to whom the facial expression was directed by the framework of maximum likelihood method.

最後に、全ての被験者の表情、及びその表出対象人物の時系列のデータを統合し、対話中の人物間の感情を示唆する様々な指標を算出する。具体的には、好意的表情や、敵対的表情がどの程度表出されたか、あるいは、表情がどの程度共起していたかなどを基に人物間感情を推定する。さらに、推定した人物間感情を視覚的に理解しやすいグラフ構造にて表示する。 Finally, all the test subject's facial expressions and the time-series data of the person to be expressed are integrated, and various indexes that suggest emotions between the persons in the conversation are calculated. Specifically, the inter-personal emotion is estimated based on how friendly expressions, hostile expressions are expressed, or how many facial expressions co-occur. Furthermore, the estimated emotions between persons are displayed in a graph structure that is easy to understand visually.

以下，本発明の実施形態について図面を参照して説明する。
図１は、本実施形態における表情認識装置の全体構成を示すブロック図である。図において、表情認識装置は、入力部１、学習画像作成部２−１〜２−Ｎ（以下、総称する場合、学習画像作成部２とする）、表情推定部３−１〜３−Ｎ（以下、総称する場合、表情推定部３とする）、及び人物間感情ネットワーク作成部４から構成されている。入力部１は、撮影手段としてのＣＣＤカメラなどを用いて各対話参与者の顔の動画像を撮影する。学習画像作成部２−１〜２−Ｎは、各々、各対話参与者Ｐ_１〜Ｐ_Ｎの学習画像及び推定対象の動画像１０−１〜１０−Ｎ（以下、総称して学習画像１０ａ、入力動画像１０ｂ）を出力する。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing the overall configuration of the facial expression recognition device in the present embodiment. In the figure, the facial expression recognition device includes an input unit 1, learning image creation units 2-1 to 2-N (hereinafter collectively referred to as a learning image creation unit 2), and facial expression estimation units 3-1 to 3-N ( Hereinafter, it is referred to as a facial expression estimation unit 3 when collectively referred to) and an emotion network creation unit 4 between persons. The input unit 1 captures a moving image of the face of each participant using a CCD camera or the like as an imaging unit. The learning image creation units 2-1 to 2-N respectively include learning images of the respective conversation participants P _{1 to} P _N and moving images 10-1 to 10-N to be estimated (hereinafter collectively referred to as learning images 10a, The input moving image 10b) is output.

表情推定部３−１〜３−Ｎは、各々、各対話参与者Ｐ_１〜Ｐ_Ｎの学習画像及び推定対象の動画像１０−１〜１０−Ｎ、及び入力部１からの動画像を入力とし、対話参与者Ｐ_１〜Ｐ_Ｎの、実時間で推定された表情カテゴリ及び表出対象の推定値１１−１〜１１−Ｎを出力する。人物間感情ネットワーク作成部４は、その全ての対話参与者Ｐ_１〜Ｐ_Ｎの表情カテゴリ及び表出対象者推定値１１−１〜１１−Ｎ（以下、総称する場合、表情カテゴリ及び表出対象者推定値１１）を入力とし、「誰がどのくらい頻繁に誰に微笑みかけていたか」、「誰と誰の表情がよく共起していたか」といった対話参与者間の感情のネットワーク構造を推定して出力する。これら学習画像作成部２−１〜２−Ｎ、表情推定部３−１〜３−Ｎ、及び人物間感情ネットワーク作成部４の機能は、ソフトウェア処理によって実現される。 Expression estimating unit 3-1 to 3-N, respectively, enter the moving image from each interaction participants _P 1 to P _N of the learning image and the estimated target moving image 10-1 to 10-N, and the input unit 1 and then, the conversation participants _P 1 to P _N, and outputs the estimated value 11-1 to 11-N of the estimated expression category and exposed object in real time. The person-to-person emotion network creation unit 4 generates facial expression categories and expression target person estimation values 11-1 to 11-N (hereinafter collectively referred to as expression categories and expression objects) of all the conversation participants P _{1 to} P _N. Estimate the network structure of emotions between participants, such as “who smiled at how often” and “who and who ’s facial expressions often co-occurred” Output. The functions of the learning image creation units 2-1 to 2-N, the facial expression estimation units 3-1 to 3-N, and the inter-person emotion network creation unit 4 are realized by software processing.

なお、対話参与者の数Ｎは、既知であるものとし、入力部１のカメラの内部パラメータについては、事前のキャリブレーションにより既に得られているものとする。但し、対話参与者の数Ｎについては、既存の顔検出器（例えば、Ｈａａｒ−ｌｉｋｅ特徴量に基づくカスケード型ＡｄａＢｏｏｓｔ検出器（参考文献４：P. Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 511--518, 2001）により自動的に算出しても良い。 It is assumed that the number N of participants in the dialogue is already known, and the internal parameters of the camera of the input unit 1 have already been obtained by prior calibration. However, for the number N of participants in the dialogue, an existing face detector (for example, a cascade type AdaBoost detector based on Haar-like features (Reference 4: P. Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 511--518, 2001).

ここで、以下で主に使用される記号をここでまとめておく。なお、上記図１の説明では、対話参与者の人数をＮとしたが、以下では、Ｎ_ｉと表記し、表情カテゴリをｅ∈｛１，…，Ｎ_ｅ｝、また、その表出対象をｇ_ｉにて表す。ここで、ｉは、Ｎ_ｉ人の対話参与者のうちの一人を表す。すなわち、ｉ∈｛１，…，Ｎ_ｉ｝である。表情の表出対象については、他の対話参与者のうちの一人に向けられている、あるいは、誰に対しても向けられていないものとする。つまり、離散的にｇ∈｛１，…，Ｎ_ｉ｝として表される。表出対象ｇ_ｉ，ｔ＝ｊ（≠ｉ）は、対話参与者Ｐ_ｉの時刻ｔでの表情が対話参与者Ｐ_ｊに向けられていることを示し、一方、ｇ_ｉ，ｔ＝ｉは、注視対象が他の誰にも属さない、すなわち、表情が特定の人物に向けられていないことを示す。 Here, symbols used mainly below are summarized here. In the description of FIG. 1, although the number of dialogue participants was N, in the following, denoted as N _i, e∈ facial expression category {1, ..., N _e}, also the exposed target expressed in g _i. Here, i represents one of the N _i conversation participants. That is, iε {1,..., N _i }. It is assumed that the expression object is directed to one of the other participants in the dialogue or not directed to anyone. That is, it is discretely expressed as gε {1,..., N _i }. The expression object g _{i, t} = j (≠ i) indicates that the facial expression of the conversation participant P _{i at} the time t is directed to the conversation participant P _j , while g _{i, t} = i is , Indicating that the gaze target does not belong to anyone else, that is, the facial expression is not directed to a specific person.

本実施形態では、対話参与者数Ｎを４（ｉ＝１〜４）とし、認識対象の表情をｅ∈｛微笑，哄笑，苦笑，その他｝とする。すなわち、認識対象の表情の数Ｎ_ｅは、４である。まとめると、状態ｅ_ｉ，ｔ＝‘微笑’、かつ、ｇ_ｉ，ｔ＝ｊ（≠ｉ）は、時刻ｔにおいて対話参与者Ｐ_ｉが対話参与者Ｐ_ｊに対して微笑んでいることを意味する。但し、本発明の枠組みは、２以上の任意の対話参与者数Ｎ、及び、任意の表情カテゴリ及び任意のカテゴリ数を対象とすることができる。 In this embodiment, the number N of participants in the dialogue is 4 (i = 1 to 4), and the facial expression to be recognized is e∈ {smiling, laughing, laughing, etc.}. That is, the number N _e of the recognized facial expression is 4. In summary, the state e _{i, t} = “smile” and g _{i, t} = j (≠ i) means that the conversation participant P _i is smiling with respect to the conversation participant P _{j at} time t. To do. However, the framework of the present invention can target any number N of two or more dialog participants, any facial expression category, and any number of categories.

次に、入力部１について説明する。入力部１は、前述したように撮影手段としてのＣＣＤカメラなどを用いて、対話参与者、それぞれの顔画像を撮影する。主なカメラの配置方法として、１台の全方位カメラを対話参与者間の中心付近に配置する方法と、各対話参与者の正面方向に１台ずつのカメラを配置する方法とが考えられる。 Next, the input unit 1 will be described. As described above, the input unit 1 uses the CCD camera or the like as an image capturing unit to capture the face images of each participant. As a main camera arrangement method, there are a method in which one omnidirectional camera is arranged in the vicinity of the center between dialogue participants, and a method in which one camera is arranged in the front direction of each dialogue participant.

図２（ａ）、（ｂ）は、撮影場面の一例、及び、それを全方位カメラで撮影した入力画像の一例を示す図である。本発明の枠組みでは、全ての対話参与者の顔面を撮影できる配置であれば、どのような配置方法でも構わないが、本実施形態では、図２（ａ）に示すように、対話参与者Ｐ_１〜Ｐ_４の中央部に全方位カメラを配置することとする。これは、全方位カメラを用いた場合には、カメラの外部パラメータ（世界座標系に対する並進及び回転）を求めることなく、本発明で必要となる表情の表出対象人物を特定するための対話参与者間の位置関係を、対話参与者の数に関わらず、対象人物の画像中の位置関係から容易に把握することができるので有利なためである。対話参与者Ｐ_１〜Ｐ_４の中央部に全方位カメラを配置した場合、図２（ｂ）に示すように、対話参与者Ｐ_１〜Ｐ_４の各々が撮影される。 FIGS. 2A and 2B are diagrams illustrating an example of a shooting scene and an example of an input image obtained by shooting the shooting scene with an omnidirectional camera. In the framework of the present invention, any arrangement method can be used as long as it can capture the faces of all the participants, but in this embodiment, as shown in FIG. and placing an omnidirectional camera in the center of ₁ to P _4. This is because, when an omnidirectional camera is used, it is possible to participate in the dialogue for specifying the expression target person of the facial expression required in the present invention without obtaining external parameters (translation and rotation with respect to the world coordinate system) of the camera. This is because the positional relationship between the persons can be easily grasped from the positional relationship in the image of the target person regardless of the number of participants in the dialogue, which is advantageous. If you place an omnidirectional camera in the center of conversation participants P ₁ to P _4, as shown in FIG. 2 (b), each conversation participant P ₁ to P ₄ are photographed.

他方、複数台のカメラを用いる方法では、各対話参与者の顔の解像度を上げやすい反面、ハードウェア構成が複雑になることに加え、対話毎に対話参与者の数や、座席配置が異なる場合には、その都度、各カメラの外部パラメータを得るためのキャリブレーションを行う必要がある。 On the other hand, with the method using multiple cameras, it is easy to increase the resolution of each participant's face, but in addition to the complexity of the hardware configuration, the number of participants and the seat arrangement differ for each dialogue. Therefore, it is necessary to perform calibration for obtaining external parameters of each camera each time.

次に、表情推定部３について説明する。
図３は、表情推定部３の具体的な構成を示すブロック図である。図において、表情推定部３は、テンプレート作成部３１、及び表情カテゴリ及び表出対象推定部３２からなる。テンプレート作成部３１は、対象人物の認識対象の表情カテゴリの正面正立の顔画像と予め付与されたその表情カテゴリのラベルとを対とするデータセット（学習画像）１０ａから、対象人物に特化した顔のテンプレート１２を作成する。表情カテゴリ及び表出対象推定部３２は，対象人物のテンプレート１２に基づいて、対象人物の入力動画像１０ｂ中の各フレームにおける表情カテゴリ及びその表出対象を推定し、表情カテゴリ及びその表出対象推定値１１として出力する。 Next, the facial expression estimation unit 3 will be described.
FIG. 3 is a block diagram showing a specific configuration of the facial expression estimation unit 3. In the figure, the facial expression estimation unit 3 includes a template creation unit 31 and a facial expression category and expression target estimation unit 32. The template creation unit 31 specializes in a target person from a data set (learning image) 10a in which a face-upright face image of the facial expression category to be recognized by the target person and a label of the facial expression category given in advance are paired. A face template 12 is created. The expression category and expression target estimation unit 32 estimates the expression category and the expression target in each frame in the input moving image 10b of the object person based on the template 12 of the object person, and the expression category and the expression object thereof. The estimated value 11 is output.

次に、表情カテゴリ及び表出対象推定部３２について説明する。
図４は、表情カテゴリ及び表出対象推定部３２の具体的な構成を示すブロック図である。図において、表情カテゴリ及び表出対象推定部３２は、表情カテゴリ及び頭部姿勢推定部３２１、及び表情表出対象推定部３２２からなる。表情カテゴリ及び頭部姿勢推定部３２１は、１名の対話参与者についての入力動画像中の各フレームにおける表情カテゴリを推定し、表情カテゴリ推定値１１ａとして出力するとともに、頭部姿勢を推定し、頭部姿勢推定値１３として出力する。表情表出対象推定部３２２は、表情カテゴリ及び頭部姿勢推定部３２１により推定された頭部姿勢推定値１３から表情の表出対象を推定し、表情表出対象推定値１１ｂとして出力する。 Next, the expression category and expression target estimation unit 32 will be described.
FIG. 4 is a block diagram illustrating a specific configuration of the facial expression category and expression target estimation unit 32. In the figure, a facial expression category and expression target estimation unit 32 includes a facial expression category and head posture estimation unit 321 and a facial expression expression target estimation unit 322. The facial expression category and head posture estimation unit 321 estimates the facial expression category in each frame in the input moving image for one conversation participant and outputs it as the facial expression category estimate 11a, and estimates the head posture. Output as head posture estimated value 13. The facial expression expression target estimation unit 322 estimates a facial expression expression target from the head posture estimation value 13 estimated by the facial expression category and head posture estimation unit 321 and outputs the facial expression expression target estimation value 11b.

次に、表情カテゴリ及び頭部姿勢推定部３２１について説明する。
図５は、表情カテゴリ及び頭部姿勢推定部３２１の具体的な構成を示すブロック図である。図において、表情カテゴリ及び頭部姿勢推定部３２１は、入力動画像１０ｂの各フレームにおける対象人物の表情カテゴリ及び頭部姿勢を、それらの尤度に基づいて推定する。 Next, the facial expression category and head posture estimation unit 321 will be described.
FIG. 5 is a block diagram illustrating a specific configuration of the facial expression category and head posture estimation unit 321. In the figure, a facial expression category and head posture estimation unit 321 estimates the facial expression category and head posture of the target person in each frame of the input moving image 10b based on their likelihoods.

ここで、図６は、表情カテゴリｅ_ｔ、頭部姿勢ｈ_ｔ、及び画像ｚ_ｔの関係を示す概念図である。各時刻の表情カテゴリと頭部姿勢とは、そのときの画像が与えられた下で条件付独立であること、さらに、表情カテゴリと頭部姿勢とは、それぞれ独立なマルコフ過程に従うことを仮定している。頭部姿勢ｈは、６自由度である。本実施形態では、顔中心の位置（３次元）、及び３次元回転角（ロール、ピッチ、ヨー）から構成されている。 Here, FIG. 6 is a conceptual diagram illustrating the relationship between the expression category e _t , the head posture h _t , and the image z _t . It is assumed that the facial expression category and head posture at each time are conditionally independent under the given image, and that the facial expression category and head posture follow independent Markov processes. ing. The head posture h has 6 degrees of freedom. In the present embodiment, the face center position (three-dimensional) and a three-dimensional rotation angle (roll, pitch, yaw) are used.

時刻ｔにおける頭部姿勢推定値＾ｈ_ｔ、及び表情カテゴリ推定値＾ｅ_ｔは、同時刻における同時事後確率密度分布ｐ（ｈ_ｔ，ｅ_ｔ｜ｚ_１：ｔ）に基づいて計算される。頭部姿勢推定値は、その同時事後確率密度分布の頭部姿勢についての周辺確率として、表情カテゴリ推定値は、同時事後確率密度分布の各表情カテゴリについての周辺確率を最大化するカテゴリとしてそれぞれ計算される。これら頭部姿勢推定値＾ｈ_ｔ、及び表情カテゴリ推定値＾ｅ_ｔは、数式（１）、（２）にて表される。 The head posture estimated value ^ h _t and the expression category estimated value ^ e _t at time _t are calculated based on the simultaneous posterior probability density distribution p (h _t , e _t | z _{1: t} ) at the same time. The head pose estimate is calculated as the peripheral probability for the head pose of the simultaneous posterior probability density distribution, and the facial expression category estimate is calculated as the category that maximizes the peripheral probability for each facial expression category in the simultaneous posterior probability density distribution. Is done. The head posture estimated value ^ h _t and the expression category estimated value ^ e _t are expressed by Equations (1) and (2).

また、同時事後確率密度分布は、ベイズ則を用いて次式（３）のように分解可能である。 Further, the simultaneous posterior probability density distribution can be decomposed as shown in the following equation (3) using Bayes rule.

ここで、ｐ（ｚ｜ｈ，ｅ）は、表情カテゴリｅ、及び頭部姿勢ｈの観測画像ｚに対する同時尤度である。また、確率密度分布ｐ（ｈ_ｔ，ｅ_ｔ｜ｚ_{１：ｔ−１}）は、時刻ｔにおける予測分布を表す。さらに、同時尤度は、頭部姿勢の観測画像に対する尤度、及び，頭部姿勢が与えられた下での表情カテゴリの観測画像に対する尤度の積、すなわち、次式（４）として表現される。 Here, p (z | h, e) is a simultaneous likelihood for the observed image z of the facial expression category e and the head posture h. Further, the probability density distribution p (h _t , e _t | z _{1: t−1} ) represents a predicted distribution at time t. Furthermore, the simultaneous likelihood is expressed as the product of the likelihood of the head posture with respect to the observed image and the likelihood with respect to the observation image of the facial expression category given the head posture, that is, the following equation (4): The

これらの尤度は、テンプレートに基づいて計算される。この尤度の詳細については、後述するテンプレートにて詳細に説明する。なお、事後確率密度分布は、次式（５）を満たすように正規化される。 These likelihoods are calculated based on the template. Details of the likelihood will be described in detail in a template described later. The posterior probability density distribution is normalized so as to satisfy the following equation (5).

予測分布を直前の表情及び頭部姿勢の状態で条件付けすると、次式（６）のような逐次式に変換される。 When the predicted distribution is conditioned on the state of the previous facial expression and head posture, it is converted into a sequential expression such as the following expression (6).

ここで、ｐ（ｈ_ｔ｜ｚ_１：ｔ）は、現時刻ｔまでの全ての観測が与えられたもとでの頭部姿勢の事後確率密度分布を表す。確率行列ｐ（ｈ_ｔ｜ｈ_ｔ−１）、及びＰ（ｅ_ｔ｜ｅ_ｔ−１）は、表情及び頭部姿勢のそれぞれについての遷移確率、すなわち、連続する二時刻間に表情及び頭部姿勢がそれぞれどのように変化するのかを確率にて表したものである。 Here, p (h _t | z _{1: t} ) represents the posterior probability density distribution of the head posture when all the observations up to the current time t are given. The probability matrices p (h _t | h _t−1 ) and P (e _t | e _t−1 ) are the transition probabilities for each of the facial expression and the head posture, that is, the facial expression and the head between two consecutive times. This shows how the posture changes with probability.

本実施形態では、表情及び頭部姿勢共に、前に時刻間遷移の確率が得られていないものとし、表情カテゴリについては、全ての二表情間の遷移の確率が等しいとし、頭部姿勢については、ランダムウォークモデルに従うこととする。但し、本発明の枠組みでは、任意の遷移確率を使用可能である。例えば、頭部姿勢については、等速運動や、等加速度運動、あるいは、磁気センサを用いるなどして、事前に獲得した人間の頭部姿勢運動のモデルなどを使用可能である。表情カテゴリについては、同じ表情が持続する確率を他のα（α＞１）とするか、あるいは、人手で付けたラベルから得た実際の対話中での表情カテゴリの遷移の確率などとすることが考えられる。 In this embodiment, it is assumed that the probabilities of transition between times are not obtained before for both facial expressions and head postures, and for facial expression categories, the probabilities of transition between all two facial expressions are equal, and for head postures Let's follow the random walk model. However, any transition probability can be used in the framework of the present invention. For example, with respect to the head posture, it is possible to use a human head posture motion model acquired in advance by using a constant velocity motion, a constant acceleration motion, or a magnetic sensor. For facial expression categories, the probability that the same facial expression lasts is assumed to be another α (α> 1), or the probability of transition of facial expression categories during actual dialogue obtained from labels attached manually. Can be considered.

数式（６）の予測分布は、閉形式として厳密に計算することができない。これは、頭部姿勢が画像に対して与える影響が、カメラ射影（ここでは、中心投影）という非線形変換によるものであり、さらに、顔を大きく横に向けるなどした際に生じる顔の一部領域の遮蔽といった効果も加わるためである。よって、近似的な解法を用いる必要がある。 The predicted distribution of Equation (6) cannot be calculated strictly as a closed form. This is because the effect of the head posture on the image is due to non-linear transformation called camera projection (in this case, central projection), and a part of the face that occurs when the face is turned sideways largely. This is because an effect such as shielding is also added. Therefore, it is necessary to use an approximate solution.

本実施形態では、パーティクルフィルタ、あるいは逐次モンテカルロ法などと呼ばれる、時間方向に逐次的なサンプリングを行う手法を用いることとする。より具体的には、図５に示すように、パーティクルフィルタ３２１１により事後確率密度分布を推定し、推定値算出部３２１２により、上記事後確率密度分布に基づいて、表情カテゴリ推定値、及び頭部姿勢推定値１１を算出する。このとき、各サンプル（パーティクル）が保持する状態は、６次元の頭部姿勢ｈ、及び表情カテゴリｅの計７変数である。パーティクルフィルタの詳細については、参考文献５（M. Isard, and A. Blake: Condensation -- conditional density propagation for visual tracking, International Journal of Computer Vision}, 29(1):5--28, 1998.）に記されている。 In the present embodiment, a method of performing sequential sampling in the time direction, which is called a particle filter or a sequential Monte Carlo method, is used. More specifically, as shown in FIG. 5, the posterior probability density distribution is estimated by the particle filter 3211, and the facial expression category estimated value and the head are estimated by the estimated value calculation unit 3212 based on the posterior probability density distribution. The estimated posture value 11 is calculated. At this time, the state held by each sample (particle) is a total of seven variables of a six-dimensional head posture h and a facial expression category e. For details on particle filters, see Reference 5 (M. Isard, and A. Blake: Condensation-conditional density propagation for visual tracking, International Journal of Computer Vision}, 29 (1): 5--28, 1998.) It is written in.

次に、テンプレート作成部３１について説明する。
テンプレートＭは、数式（３）中の尤度ｐ（ｚ｜ｈ，ｅ）を定義するためのものである。テンプレートＭは、Ｍ＝｛Ｓ，Ｑ^Ｅ，Ｑ^Ｈ，Ｉ^Ｅ，Ｉ^Ｈ｝という３種５つの要素から構成される。ここで、Ｓ、Ｑ、及びＩは、各々、剛体の顔形状モデル、注目点集合、及び注目点輝度分布モデルである。添え字Ｅ及びＨは、表情及び頭部姿勢をそれぞれ表す。以下では、表情あるいは頭部姿勢のどちら用のモデルであるのかを対象状態と呼び、＊∈｛Ｅ，Ｈ｝にて表す。 Next, the template creation unit 31 will be described.
The template M is for defining the likelihood p (z | h, e) in the mathematical formula (3). The template M is composed of five elements of three types: M = {S, Q ^E , Q ^H , I ^E , I ^H }. Here, S, Q, and I are a rigid face shape model, an attention point set, and an attention point luminance distribution model, respectively. Subscripts E and H represent facial expressions and head postures, respectively. Hereinafter, the model for the expression or the head posture is referred to as a target state, and is represented by * ∈ {E, H}.

まず、顔形状モデルＳは、学習画像上にて定義される注目点の奥行き情報を与えるためのものである。これにより、注目点の３次元座標（画像座標＋奥行き座標）が揃い、様々な３次元の頭部姿勢における各注目点の３次元位置を計算することが可能となる。 First, the face shape model S is for giving depth information of the attention point defined on the learning image. Thereby, the three-dimensional coordinates (image coordinates + depth coordinates) of the attention point are aligned, and the three-dimensional position of each attention point in various three-dimensional head postures can be calculated.

次いで、注目点集合Ｑ^＊は、尤度の計算の際に考慮する顔面上の注目位置を示す。これらの注目点は、学習画像上にて疎に配置される。特に、表情についての注目点Ｑ^Ｅは、特定の表情カテゴリの検出に特化した注目点の集合により構成される。注目点集合は、次式（７）で表される。 Next, the attention point set Q ^* indicates the attention position on the face to be considered in the likelihood calculation. These attention points are sparsely arranged on the learning image. In particular, attention point Q ^E of facial expression is constituted by a set of point of interest that specializes in the detection of a specific facial expression category. The attention point set is expressed by the following equation (7).

ここで、ｑ_ｋは、ｋ番目の注目点の学習画像上での位置である。これについては後述する。Ｎ^＊ _ｋは、対象状態＊についての注目点の数である。 Here, q _k is the position on the learning image of the kth point of interest. This will be described later. N ^* _k is the number of points of interest for the target state *.

最後に、輝度分布モデルＩ^＊は，各注目点ｋの輝度が各表情において、どのように変化するのかをモデル化したものであり、次式（８）で表される。 Finally, the luminance distribution model I ^* models how the luminance of each attention point k changes in each facial expression, and is expressed by the following equation (8).

本実施形態では、注目点集合Ｑ^＊、及び輝度分布モデルＩ^＊は、個人特化とする。つまり、これらのモデルを対象人物自身についての学習画像から作成する。よって、対象人物が頭部姿勢を大きく変化させ、顔全体の見えが大きく変化する場合であっても、微笑と哄笑とのような見え方の違いが小さな表情を精度よく識別可能である。一方、顔形状モデルＳについては汎用モデルとする。 In this embodiment, the attention point set Q ^* and the luminance distribution model I ^* are personalized. That is, these models are created from learning images of the target person. Therefore, even if the target person changes the head posture greatly and the appearance of the entire face changes greatly, it is possible to accurately identify facial expressions with small differences in appearance such as smile and ridicule. On the other hand, the face shape model S is a general-purpose model.

学習画像集合Ｙは、表情のカテゴリが付与された各対象表情カテゴリについて複数の画像からなる。ここでは、それをＹ＝｛Ｙ_０，Ｙ_ｅ＝１，…，Ｙ_ｅ＝Ｎｅ｝と表す。ここで、Ｙ_０は、１枚の無表情の顔画像であり、Ｙ_ｅは、表情カテゴリｅ（＞０）のラベルが付与された複数の画像である。無表情の学習画像Ｙ_０は、頭部姿勢推定のためのモデル作成用であり、その他の学習画像Ｙ_ｅは、表情認識用である。なお、各学習画像中には、対象人物の正面正立の顔が含まれており、かつ、それらの顔の位置は、画像間で等しいものとする。そのような学習画像の作成方法については、後述する学習画像作成部２にて説明する。 The learning image set Y includes a plurality of images for each target facial expression category to which facial expression categories are assigned. Here, it is expressed as Y = {Y ₀ , Y _{e = 1} ,..., Y _{e = Ne} }. Here, Y ₀ is a single expressionless facial image, and Y _e is a plurality of images to which a label of expression category e (> 0) is assigned. Learning image Y ₀ of expressionless is a model created for the head pose estimation, other learning image Y _e, is for facial expression recognition. Note that each learning image includes a front-right face of the target person, and the positions of those faces are the same between the images. Such a learning image creation method will be described in a learning image creation unit 2 described later.

なお、本発明の枠組みでは、個人毎の学習データを事前に用意することが困難な場合には、個人特化モデルを作成する代わりに、汎用的に使用可能な注目点集合Ｑ^＊、及び輝度分布モデルＩ^＊を用意しても構わない。このときは、事前に多人数の学習データを用意しておく必要がある。そして、そこから各表情についての平均顔画像、及びその分散画像を作成し、それを学習画像集合Ｙと見立てて、個人特化のモデルと同様の方法でモデルを作成すればよい。但し、そのような汎用モデルを用いた場合には、個人特化モデルの場合と比べて、認識精度が低下する可能性が高い。また、逆に、顔形状モデルＳについては、事前に対象人物の顔形状を、レンジスキャナ等で計測したデータがあれば、それを用いても構わない。 Note that in the framework of the present invention, when it is difficult to prepare learning data for each individual in advance, instead of creating a personalized model, a general-purpose attention point set Q ^* and luminance A distribution model I ^* may be prepared. At this time, it is necessary to prepare learning data for a large number of people in advance. Then, an average face image for each facial expression and a dispersion image thereof are created therefrom, and this is regarded as a learning image set Y, and a model may be created by a method similar to a model for personalization. However, when such a general-purpose model is used, there is a high possibility that the recognition accuracy will be lower than in the case of a personalized model. Conversely, as for the face shape model S, if there is data obtained by measuring the face shape of the target person in advance with a range scanner or the like, it may be used.

＜顔形状モデルＳの作成＞
図７は、形状モデル作成部５の構成を示すブロック図である。
本実施形態では、複数人の顔形状の平均形状を、対象人物の１枚の無表情の学習画像Ｙ_０に対してフィッティングしたものを、この顔形状モデルＳとして使用する。具体的には、図７に示す形状モデル作成部５により、学習画像（形状モデル作成用）１０ｃの顔の大きさにフィットするよう平均形状を縦横方向にスケーリングし、さらに、それぞれのスケーリング係数の二乗和の平方根を係数とした奥行き方向にスケーリングを行い、形状モデル１４を作成する。但し、本発明の枠組みでは、任意の剛体形状を形状モデルとして使用可能である。例えば、事前に対象人物の顔形状をレンジスキャナ等で計測したデータがあれば、それを使用すればよい。 <Creation of face shape model S>
FIG. 7 is a block diagram showing the configuration of the shape model creation unit 5.
In the present embodiment, the average shape of the plurality of persons face shape, a material obtained by fitting against learning image Y ₀ of one expressionless of the target person, for use as the face shape model S. Specifically, the shape model creation unit 5 shown in FIG. 7 scales the average shape vertically and horizontally so as to fit the face size of the learning image (for shape model creation) 10c. The shape model 14 is created by scaling in the depth direction using the square root of the sum of squares as a coefficient. However, in the framework of the present invention, any rigid body shape can be used as a shape model. For example, if there is data obtained by measuring the face shape of the target person with a range scanner or the like in advance, it may be used.

＜表情認識用モデル（Ｑ^Ｅ，Ｉ^Ｅ）の作成＞
表情認識用の注目点集合Ｑ^Ｅは、特定の表情カテゴリにおいて、輝度が特徴的である領域に定義される。注目点は、無表情の学習画像Ｙ_０の各顔部品について、対象表情のカテゴリ毎に１点ずつ登録される。本実施形態では、顔部品を眉（左／右）、目（左／右）、鼻、及び口の計４種６領域とする。表情毎に選択する注目点の数は、任意であるが、処理速度、及び認識精度に影響を及ぼす。本実施形態では、計６８点（眉：１２×２、目：６×２、鼻：８、及び、口：２４）とする。このときの表情認識用の注目点の総数は、Ｎ^Ｅ _ｋ＝６８×４＝２７２である。 <Creation of facial expression recognition models (Q ^E , I ^E )>
Target point set Q ^E for facial expression recognition, in particular facial expression category, brightness is defined in a region characteristic. Point of interest is, for expressionless each facial part of the learning image Y ₀ of, is registered by 1 point for each of the target facial expression category. In the present embodiment, the facial parts are four regions and six regions in total: eyebrows (left / right), eyes (left / right), nose, and mouth. The number of attention points to be selected for each facial expression is arbitrary, but affects the processing speed and the recognition accuracy. In this embodiment, there are a total of 68 points (eyebrow: 12 × 2, eyes: 6 × 2, nose: 8, and mouth: 24). The total number of attention points for facial expression recognition at this time is N ^E _k = 68 × 4 = 272.

図８は、各顔部品について登録される注目点集合の一例を示す模式図である。ここでの注目点集合は、個人毎に異なる。それは、顔部品配置や、その見え方の変化、及び表情変化による移動や、変形が個人毎に異なるためである。なお、顔部品の検出方法も任意であるが、本実施形態では、Ｈａａｒ−ｌｉｋｅ特徴量に基づくカスケード型ＡｄａＢｏｏｓｔ検出器（P. Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 511--518, 2001）などを用いて、それぞれ矩形領域として検出するものとする。 FIG. 8 is a schematic diagram showing an example of a set of attention points registered for each facial part. The attention point set here is different for each individual. This is because the movement and deformation due to the face part arrangement, the change in its appearance, and the change in facial expression are different for each individual. In addition, although the detection method of a facial component is also arbitrary, in this embodiment, a cascade type AdaBoost detector (P. Viola and M. Jones: Rapid object detection using a boosted cascade of simple features, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 511--518, 2001).

表情認識用の注目点輝度分布モデルＩ^Ｅは、各注目点の輝度が各表情に対して、どのように変化するのかをモデル化したものである。本発明は、この特性を利用して、観測画像中の注目点の輝度から、そのときの表情カテゴリを認識する。だが、顔形状モデルの誤差のため、正確な注目点の位置の算出が困難であることや、撮影環境中の照明が時間的に一定であっても、対象人物の顔向きの変化に伴い、顔に対する照明の方向が変化することにより、観測される注目点の輝度には、ばらつきが生じる。これらのばらつきをどのようにモデル化してもよいが、本実施形態では、正規分布にてモデル化する。ｋ番目の注目点についての輝度分布モデルＩ^Ｅ _ｋは、次式（９）のように表される． The attention point luminance distribution model ^IE for facial expression recognition is a model of how the luminance of each attention point changes for each expression. The present invention uses this characteristic to recognize the facial expression category at that time from the luminance of the point of interest in the observed image. However, due to errors in the face shape model, it is difficult to accurately calculate the position of the point of interest, and even if the lighting in the shooting environment is constant in time, As the direction of illumination with respect to the face changes, the brightness of the observed point of interest varies. These variations may be modeled in any way, but in this embodiment, they are modeled by a normal distribution. The luminance distribution model I ^E _k for the _kth attention point is expressed as the following equation (9).

ここで、μ_ｋ（ｅ）及びσ^２ _ｋ（ｅ）は、ｋ番目の注目点の表情カテゴリｅにおける平均、及び分散をそれぞれ表す。この平均、及び分散は、学習画像Ｙ_ｅから算出した輝度平均マップ及び輝度分散マップの、位置ｑ^Ｅ _ｋにおける値とする。 Here, μ _k (e) and σ ² _k (e) represent an average and a variance in the facial expression category e of the k-th attention point, respectively. The average and variance are values at the position q ^E _k of the luminance average map and luminance variance map calculated from the learning image Y _e .

図９は、輝度平均マップ、輝度分散マップ、注目点選択、及び輝度分布モデルの作成の一連の処理の流れを示すブロック図である。輝度平均マップ及び輝度分散マップ作成部６は、学習画像１０ｄ（Ｙ_ｅ）から輝度分散マップ１５及び輝度平均マップ１６を作成する。顕著性算出部７は、輝度分散マップ１５から顕著性マップ１７を算出する。注目点選択部８は、顕著性マップ１７に従って、顔部品領域内から対象表情カテゴリの識別に有用な注目点集合１８を選択する。また、輝度分布モデル作成部９は、輝度平均マップ１６及び注目点集合１８から、表情及び頭部姿勢に対する注目点集合１８の輝度変化を示す輝度分布モデル１９を作成する。 FIG. 9 is a block diagram showing a flow of a series of processes of luminance average map, luminance dispersion map, attention point selection, and luminance distribution model creation. The luminance average map and luminance variance map creating unit 6 creates a luminance variance map 15 and a luminance average map 16 from the learning image 10d (Y _e ). The saliency calculation unit 7 calculates a saliency map 17 from the luminance dispersion map 15. The point-of-interest selection unit 8 selects a point-of-interest set 18 that is useful for identifying the target facial expression category from within the face part region, according to the saliency map 17. Further, the luminance distribution model creation unit 9 creates a luminance distribution model 19 indicating the luminance change of the attention point set 18 with respect to the facial expression and the head posture from the luminance average map 16 and the attention point set 18.

また、図１０（ａ）〜（ｃ）は、各々、微笑、哄笑、その他についての輝度平均マップ１６の一例を示す模式図である。また、図１１（ａ）〜（ｃ）は、各々、微笑、哄笑、その他についての輝度分散マップ１５の一例を示す模式図である。輝度平均マップ１６、及び輝度分散マップ１５とは、各表情カテゴリの学習画像の平均を取った画像、及びその分散の画像のことである。以下、詳細に説明する。 FIGS. 10A to 10C are schematic diagrams showing an example of the luminance average map 16 for smile, ridicule and others. FIGS. 11A to 11C are schematic diagrams showing an example of the luminance dispersion map 15 for smile, ridicule and others. The luminance average map 16 and the luminance variance map 15 are an image obtained by averaging the learning images of each expression category and an image of the variance. This will be described in detail below.

＜注目点の選択＞
画像座標ｘの輝度の対象表情ｅについての顕著性は、次式（１０）に示すように、輝度の表情カテゴリｅでの分散（カテゴリ内分散）とその輝度のカテゴリ間分散の比とする。 <Selecting the point of interest>
As shown in the following expression (10), the saliency of the luminance of the image coordinate x with respect to the target facial expression e is the ratio of the variance in the facial expression category e of luminance (intra-category variance) and the variance between categories of the luminance.

ここで、σ^２ _{ｘ，Ｗ（ｅ）}は、表情カテゴリｅの位置ｘにおける輝度の分散を表し、σ^２ _{ｘ，Ｂ（ｅ）}は、表情カテゴリｅとそれ以外の表情カテゴリとの間の輝度の分散を表す。表情顕著性の一例を図１２に示す。図１２（ａ）〜（ｃ）には、各々、微笑、哄笑、その他についての表情顕著性マップを示している。 Here, σ ² _{x, W (e)} represents the variance in luminance at the position x of the facial expression category e, and σ ² _{x, B (e)} is the luminance between the facial expression category e and the other facial expression categories. Represents the variance of. An example of facial expression saliency is shown in FIG. FIGS. 12A to 12C show facial expression saliency maps for smile, ridicule and others, respectively.

各対象表情ｅについての注目点は、次のようにして選択される。
（１）まず、顔面上の全ての位置ｘについて表情カテゴリｅについての顕著性Ｓ_ｘ，ｅを、表情カテゴリｅについての学習画像集合Ｙ_ｅから計算する。
（２）次いで、最も顕著性Ｓ_ｘ，ｅの高い位置（ｘ´とする）を注目点として選択する。
（３）そして、その選択された位置ｘ´の顕著性を０とするとともに、その周囲の点の顕著性についても、位置ｘ´からの距離ｄに基づいて減少させる。ここでは、Ｓ_ｘ，ｅ←Ｓ_ｘ，ｅ−α・ｅｘｐ（−ｄ^２）とする。
（４）もし、選択された注目点の数が規定数に達していなければ、上記（２）に戻り、さらに注目点を選択する。なお、上記（３）により、各表情で注目点が顕著性を考慮しつつなるべく広く配置されるようにしている。 The attention point for each target facial expression e is selected as follows.
(1) First, the saliency S _{x, e} for the expression category e for all the positions x on the face is calculated from the learning image set Y _e for the expression category e.
(2) Next, a position having the highest saliency S _{x, e} (referred to as x ′) is selected as a point of interest.
(3) The saliency of the selected position x ′ is set to 0, and the saliency of the surrounding points is also reduced based on the distance d from the position x ′. Here, S _{x, e} ← S _{x, e} −α · exp (−d ² ).
(4) If the number of selected attention points has not reached the specified number, the process returns to the above (2), and further attention points are selected. Note that, due to the above (3), attention points are arranged as widely as possible in consideration of saliency in each facial expression.

＜表情カテゴリの尤度＞
表情カテゴリ及び頭部姿勢推定部３２１において、入力動画像の各フレームにおける対象人物の表情カテゴリを推定する際の尤度について説明する。頭部姿勢ｈが与えられた下での観測画像ｚに対する表情カテゴリｅの尤度ｐ（ｈ，ｚ｜ｅ_ｔ）は、テンプレート作成部３１により作成されたテンプレートを用いて算出される。 <Likelihood of facial expression category>
The likelihood when the facial expression category and head posture estimation unit 321 estimates the facial expression category of the target person in each frame of the input moving image will be described. The likelihood p (h, z | e _t ) of the facial expression category e for the observed image z given the head posture h is calculated using the template created by the template creation unit 31.

疎に配置された各注目点の輝度が互いに独立であることを仮定すると、表情カテゴリについての尤度は、次式（１１）で示すように、各注目点についての尤度の積に分解される。 Assuming that the brightness of the sparsely arranged attention points is independent of each other, the likelihood for the facial expression category is decomposed into products of likelihoods for each attention point as shown in the following equation (11). The

ここで、ｚ_ｋ(ｈ)は、次式（１２）、（１３）で示すように、頭部姿勢ｈにおけるｋ番目の特徴点の画像座標での輝度を表す。 Here, z _k (h) represents the luminance at the image coordinates of the kth feature point in the head posture h, as shown by the following equations (12) and (13).

ここで、関数ρ（・）は、ロバスト関数を表し、少数の注目点中に含まれる観測ノイズ（画像ノイズや、形状モデルの誤差に伴う注目点の算出位置のずれなど）により、尤度ｐ(ｚ_ｋ|ｅ）全体が小さくなることを緩和するためのものである。本実施形態では、ロバスト関数ρとしてＧｅｍａｎ−ＭｃＣｌｕｒｅ関数ρ（ｘ）＝ｃ・ｘ^２／（１＋ｘ^２）を用い、そのスケーリング係数をｃ（＝９）とする。 Here, the function ρ (•) represents a robust function, and the likelihood p is determined by observation noise included in a small number of attention points (image noise, shift in the calculation position of the attention point due to the shape model error, etc.). (z _k | e) is intended to alleviate the decrease in the overall size. In this embodiment, the Geman-McClure function ρ (x) = c · x ² / (1 + x ² ) is used as the robust function ρ, and the scaling coefficient is c (= 9).

輝度ｚ_ｉ，ｔは、ｋ番目の注目点の頭部姿勢ｈ_ｔにおける画像座標における画像ｚ_ｔでの輝度である。この画像座標は、以下のように算出される。 The luminance z _{i, t} is the luminance in the image z _t at the image coordinates in the head posture h _t of the k-th attention point. The image coordinates are calculated as follows.

（１）まず、形状モデルＳの正面正立方向を画像での正面正立方向に合わせた状態で、学習画像上に定義されている注目点を平行投影により形状モデルＳ上に投影する（貼り付ける）。
（２）次いで、形状モデルＳの顔面の高さ、及び幅の積が、ベースとした元の顔形状（ここでは、平均顔形状）のそれに等しくなるよう、形状モデルを等方的にスケーリングする。
（３）続いて、形状モデルＳを頭部姿勢ｈ_ｔに従い並進、及び回転させる。
（４）最後に、形状モデルＳ上のｉ番目の注目点を中心投影法により観測画像へと投影する。なお、上記（１）及び（２）の処理は、テンプレートを作成した時点などに１度行っておけば、後は、その結果を必要の度に利用すればよい。 (1) First, the target point defined on the learning image is projected onto the shape model S by parallel projection in a state in which the front erect direction of the shape model S is aligned with the front erect direction in the image (paste). wear).
(2) Next, the shape model is isotropically scaled so that the product of the height and width of the face of the shape model S is equal to that of the original original face shape (here, the average face shape). .
(3) Subsequently, the translation of the shape model S accordance head posture h _t, and rotate.
(4) Finally, the i-th target point on the shape model S is projected onto the observation image by the central projection method. Note that if the processes (1) and (2) are performed once at the time when the template is created, the results may be used as needed.

＜頭部姿勢推定用のモデル及び尤度＞
次に、表情カテゴリ及び頭部姿勢推定部３２１において、入力動画像の各フレームにおける対象人物の頭部姿勢を推定する際の尤度について説明する。 <Model and likelihood for head posture estimation>
Next, the likelihood when the facial expression category and head posture estimation unit 321 estimates the head posture of the target person in each frame of the input moving image will be described.

頭部姿勢推定用の輝度分布モデル１９は、注目点の輝度が頭部姿勢に応じてどのように変化するのかをモデル化したものである。それを事前に学習する方法には、２つの方法が考えられる。１つは、様々な頭部姿勢の学習画像があり、さらに、各画像中の注目点の正確な位置が与えられていれば作成可能である。 The luminance distribution model 19 for estimating the head posture models how the luminance of the point of interest changes according to the head posture. Two methods are conceivable as a method of learning this in advance. One is a learning image with various head postures, and can be created if the exact position of the point of interest in each image is given.

例えば、レンジセンサなどにより計測した顔形状と、画像と合わせて磁気センサ等を用いて頭部姿勢を計測すればよい。あるいは、照明環境、対象人物の顔形状モデル、及びその反射特性が既知であれば、コンピュータグラフィックス技術を用いてレンダリングを行うことで、任意の頭部姿勢での注目点の輝度を獲得し、それを元にその注目点での輝度分布モデルを計算可能である。形状、及び反射特性については、人物の平均的な形状や、反射特性を用いても近似的なモデルを作成できる。これらのいずれかの方法を選択できる場合には、表情認識用の輝度分布モデルＩ^Ｅ _ｋと同様に、正規分布を用いるなどして、頭部姿勢推定用の輝度分布モデルＩ^Ｈ _ｋを表現すればよい。 For example, the head posture may be measured using a magnetic sensor or the like in combination with the face shape measured by a range sensor or the like and the image. Alternatively, if the lighting environment, the face shape model of the target person, and the reflection characteristics thereof are known, by performing rendering using computer graphics technology, the brightness of the attention point in an arbitrary head posture is obtained, Based on this, it is possible to calculate a luminance distribution model at the point of interest. As for the shape and the reflection characteristic, an approximate model can be created using the average shape of the person and the reflection characteristic. If any of these methods can be selected, the luminance distribution model I ^H _k for head posture estimation can be expressed by using a normal distribution or the like in the same manner as the luminance distribution model I ^E _k for facial expression recognition. That's fine.

しかしながら、上記の方法を選択できない場合も多い。この場合には、注目点Ｑ^Ｈの無表情の学習画像Ｙ_０における輝度と観測される輝度との距離に基づいて尤度を計算することとする。この方法は、一般的なテンプレート照合に基づく頭部追跡手法の枠組みと大差ない。よって、本実施形態では、参考文献６（D. Mikami, K. Otsuka, and J. Yamato: Memory-based particle filter for face pose tracking robust under complex dynamics, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 999--1006, 2009.）の方法に従い、頭部姿勢推定用のテンプレートを定義する。 However, there are many cases where the above method cannot be selected. In this case, the calculating the likelihood based on the distance between the luminance that is observed as luminance in the learning image Y ₀ expressionless point of interest Q ^H. This method is not much different from the framework of a head tracking method based on general template matching. Therefore, in this embodiment, Reference 6 (D. Mikami, K. Otsuka, and J. Yamato: Memory-based particle filter for face pose tracking robust under complex dynamics, In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 999--1006, 2009.), define a template for head posture estimation.

頭部姿勢推定用の注目点Ｑ^Ｈとして、顔画像上の輝度のエッジを跨ぐダイポール（２点の対）を顔面上に疎に配置する。選択方法としては、ダイポール内の２点の輝度の差が大きなものから順に、対象のダイポール中心が既に選択されたダイポール中心から閾値以上離れたものを１つずつ選択していく。このように選択したダイポールは、顔面上に広範囲に配置されることとなる。本実施形態では、頭部姿勢推定用の注目点の数Ｎ^Ｈ _ｋ（ダイポール数の２倍の値）については経験的に２５６とする。 As target point Q ^H for head pose estimation, dipole across the luminance of the edge of the face image (pairs of two points) is disposed sparsely on the face. As a selection method, in order from the largest difference in luminance between two points in the dipole, the target dipole centers are selected one by one away from the already selected dipole center by a threshold or more. The dipole selected in this way is arranged over a wide range on the face. In the present embodiment, the number of attention points N ^H _k for head position estimation (a value twice the number of dipoles) is empirically set to 256.

これらの注目点の輝度分布Ｉ^Ｈについては、平均μ_ｋ，０を無表情の学習画像Ｙ_０の座標ｑ_ｋにおける輝度とし、その標準偏差については平均に比例する、すなわち、σ_ｋ，０∝μ_ｋ，０であるものとする。なお、このときの比例定数については、陽に定義する必要はない。それは、数式（３）における正規化定数に含めて考えることができるためである。 For the luminance distribution I ^H of these attention points, the average μ _{k, 0} is the luminance at the coordinate q _k of the expressionless learning image Y ₀ , and its standard deviation is proportional to the average, that is, σ _{k, 0} ∝ It is assumed that μ _{k, 0} . Note that the proportionality constant at this time need not be explicitly defined. This is because it can be considered to be included in the normalization constant in Equation (3).

頭部姿勢についての尤度については、上記参考文献６に従い、次式（１４）とする。 About the likelihood about a head posture, it is set as the following formula (14) according to the above-mentioned reference 6.

ここで、ロバスト関数ρ中のスケーリング係数ｃについては１とする。 Here, the scaling coefficient c in the robust function ρ is set to 1.

＜外部要因による顔面の輝度変化への対処＞
顔面上の輝度は、一般には、表情変化以外の要因によっても変化し得る。代表的な２つの要因は、照明環境の顔方向に対する不均一性、及び、照明そのものの明るさの変化である。前述した通り、本実施形態では、これらの外部要因による顔面の輝度の変化については陽に扱わないが、本発明の枠組みでは、陽にこれらに対処することも可能である。 <Responding to changes in facial brightness due to external factors>
In general, the brightness on the face may change due to factors other than the change in facial expression. Two typical factors are non-uniformity of the lighting environment with respect to the face direction, and changes in the brightness of the lighting itself. As described above, in the present embodiment, changes in facial brightness due to these external factors are not dealt with explicitly, but in the framework of the present invention, these can be dealt with explicitly.

その２つの方法例を説明すると、まず、顔面上の輝度の変化が顔全体で一様である（例えば、参考文献６）、あるいは、顔の小領域内で一様である（例えば、特許文献１）といった仮定を行う。そして、その輝度変化係数を、表情及び頭部姿勢の同時推定において、それらと同じ確率変数とみなして、パーティクルフィルタにおけるサンプルの保持する状態に加えることで確率的に推定する（例えば、頭部姿勢と輝度変化係数のみの推定だが、参考文献６）、あるいは、パーティクルフィルタのサンプルには含めずに、最尤法の枠組みにて確定的に推定する（例えば、特許文献１）などする。 Explaining these two method examples, first, the change in luminance on the face is uniform over the entire face (for example, Reference 6), or uniform within a small area of the face (for example, patent document). 1) The assumption is made. Then, in the simultaneous estimation of the facial expression and the head posture, the luminance change coefficient is regarded as the same random variable as that and is added to the state held by the sample in the particle filter (for example, the head posture is estimated). However, the estimation is performed only in the reference literature 6) or deterministically in the framework of the maximum likelihood method (for example, Patent Document 1).

次に、学習画像作成部２について説明する。
学習画像は、対象人物の１枚の無表情、かつ、正面正立の顔画像Ｙ_０、及び、それと顔の位置、及び向きが同じである各対象表情カテゴリｅについての複数の画像Ｙ_ｅから成る。学習画像作成部２では、学習画像を以下のようにして準備する。 Next, the learning image creation unit 2 will be described.
The learning image includes a single expressionless face image Y ₀ of the target person and a plurality of images Y _e for each target expression category e having the same face position and orientation as the face image Y ₀ . Become. The learning image creation unit 2 prepares learning images as follows.

まず、無表情の顔画像Ｙ_０については、認識対象の表情カテゴリの動画像のうち、対象人物が無表情、かつ、正面正立である任意の１フレームを選んだものとする。もし、そのようなフレームが入力動画像中になくとも、対象動画像と同一の照明環境において同時期に撮影した対象人物の画像があればよい。次いで、その無表情の学習画像Ｙ_０から、形状モデル及び頭部姿勢推定用のモデル、すなわち、｛Ｓ，Ｑ^Ｈ，Ｉ^Ｈ｝を作成する。 First, the face image Y ₀ expressionless, among moving images of facial expression category to be recognized, the target person is expressionless and shall chose any one frame is a front erect. Even if such a frame is not included in the input moving image, it is sufficient if there is an image of the target person photographed at the same time in the same lighting environment as the target moving image. Next, a shape model and a head posture estimation model, that is, {S, Q ^H , I ^H } are created from the expressionless learning image Y ₀ .

続いて、対象動画像に対して、それらのモデルのみ（表情認識用のモデルを含まない）を用いて頭部姿勢を推定する。この方法については、既存の頭部姿勢推定手法である参考文献６の方法と基本的に同じである。そして、入力動画像から、対象人物の表情が認識対象のカテゴリであるフレームをカテゴリ毎に複数枚選び出す。最後に、それらのフレームについて、推定した頭部姿勢と形状モデルＳとを用いて、各フレーム中の対象人物の顔を正面正立の顔画像へと逆投影したものを、表情カテゴリについての学習画像Ｙ_ｅとする。なお、逆投影にて作成する正面正立の顔画像のサイズについては、自由に決定できるが。１５０×２００［ｐｉｘｅｌ］程度で構わない。 Subsequently, with respect to the target moving image, the head posture is estimated using only those models (not including the expression recognition model). This method is basically the same as the method of Reference 6 which is an existing head posture estimation method. Then, a plurality of frames for which the facial expression of the target person is a recognition target category are selected for each category from the input moving image. Finally, learning about the facial expression category, using the estimated head posture and shape model S for those frames, and back-projecting the target person's face in each frame into a face-upright face image _Let it be image Ye. Note that the size of a face-up face image created by back projection can be freely determined. It may be about 150 × 200 [pixel].

次に、表情表出対象推定部３２２について説明する。
表情カテゴリの表出対象ｇは、注視対象と等しいと仮定する。注視対象の推定には、大きく２通りの方法が考えられる．１つは、視線方向を直接推定する方法（例えば、参考文献７：山添大丈，内海章，安部伸治, 顔特徴点追跡による単眼視線推定,映像情報メディア学会誌, Vol.61, No.12, pp.1750-1755, 2007.）である。もう１つは、視線方向を直接推定するのではなく、注視方向と相関が高いとされる頭部姿勢を基に推定する方法である。本発明の枠組みでは、どちらの方法も適用可能であるが、本実施形態では、表情カテゴリ及び頭部姿勢推定部３２１において各対話参与者の頭部姿勢が表情カテゴリと同時に得られるため、この頭部姿勢の推定値を基に注視対象を推定することとする。なお、対話中を対象とした注視方向と頭部姿勢との関係についての従来研究としては、参考文献８（R. Stiefelhagen: Tracking focus of attention in meetings, In Proc. ICMI, pages 273--280, 2002.）がある． Next, the expression expression target estimation unit 322 will be described.
It is assumed that the expression object g in the expression category is equal to the gaze object. There are two main methods for estimating the gaze target. One is a method for directly estimating the gaze direction (for example, Reference 7: Daijo Yamazoe, Akira Utsumi, Shinji Abe, Monocular gaze estimation by tracking facial feature points, Journal of the Institute of Image Information and Television Engineers, Vol.61, No.12. , pp.1750-1755, 2007.). The other is a method in which the gaze direction is not estimated directly, but is estimated based on the head posture that has a high correlation with the gaze direction. In the framework of the present invention, both methods can be applied, but in this embodiment, the head posture of each conversation participant is obtained at the same time as the facial expression category in the facial expression category and head posture estimation unit 321. The gaze target is estimated based on the estimated value of the head posture. In addition, as a conventional study on the relationship between gaze direction and head posture during conversation, Reference 8 (R. Stiefelhagen: Tracking focus of attention in meetings, In Proc. ICMI, pages 273--280, 2002.).

以下では、頭部姿勢から表情の表出対象者を特定する方法について説明する。本実施形態では、対象人物の注視対象の対話参与者に対する頭部姿勢の分布が正規分布に従うことを仮定し、表情カテゴリ及び頭部姿勢推定部３２１から出力される頭部姿勢を入力として、最尤推定の枠組みにより表情の表出対象者を推定する。これは、数式（１５）にて表される。 In the following, a method for specifying the expression target person from the head posture will be described. In this embodiment, it is assumed that the distribution of the head posture of the target person with respect to the conversation participant to be watched follows a normal distribution, and the head posture output from the facial expression category and head posture estimation unit 321 is used as an input. Estimate the facial expression subject using the likelihood estimation framework. This is expressed by Equation (15).

ここで、＾ｇ_ｉ＝ｊは、対話参与者Ｐ_ｉの表情が対話参与者Ｐ_ｊに対して向けられていると推定されたことを意味する。角度ｈ^ＨＯＲ _ｉは、表情カテゴリ及び頭部姿勢推定部３２１より出力される対話参与者Ｐ_ｉの頭部姿勢のうちの水平角度成分であり、φ_ｉ，ｊは、対話参与者Ｐ_ｉから見た対話参与者Ｐ_ｊの相対的な方位角を表す。これについても表情カテゴリ及び頭部姿勢推定部３２１より出力される対話参与者Ｐ_ｉ及び対話参与者Ｐ_ｊの次元位置から計算可能である．σ^２及びκは、分散、及びスケーリング係数をそれぞれ表す。 Here, {circumflex over (g ₎ } _i = j means that the facial expression of the conversation participant P _i is estimated to be directed toward the conversation participant P _j . The angle h ^HOR _i is a horizontal angle component of the head posture of the conversation participant P _i output from the facial expression category and head posture estimation unit 321, and φ _{i, j} is viewed from the conversation participant P _i. and it represents the relative azimuth of the conversation participants P _j. This can also be calculated from the dimension positions of the dialogue participant P _i and the dialogue participant P _j output from the expression category and head posture estimation unit 321. σ ² and κ represent the variance and the scaling factor, respectively.

なお、各人物の表情の表出対象が他の対話参与者の誰にも属さない場合、すなわち、注視対象が他のどの対話参与者からもそれている場合の頭部姿勢については、一様分布にて表現する。つまり、数式（１５）で推定された＾ｇ_ｉの確率が閾値以下であれば、そのときの注視対象は、他の対話参与者の誰にも属さないものとする。 It should be noted that the head posture when the expression of each person's facial expression does not belong to any of the other participants, that is, when the gaze target is deviated from any other participant, is uniform. Expressed by distribution. That is, if the probability of the estimated ^ g _i is equal to or less than the threshold value in Equation (15), fixation target at that time shall not belong to anyone other interactive participants.

次に、人物間感情ネットワーク作成部４について説明する。
人物間の感情（人物間感情）は、各対話参与者が他の対話参与者に対して、どのような表情をどれだけ表出したかに基づいて推定される。 Next, the inter-person emotion network creation unit 4 will be described.
Emotions between persons (interpersonal feelings) are estimated based on how many facial expressions each participant has expressed with respect to other participants.

人物間感情ネットワークは、対話参与者間の感情を表すものである。この感情の尺度については、好意的−敵対的であるとか、協調度合いなど様々なものが考えられる。好意的−敵対的感情ネットワークを推定する基準としては、対話参与者間での好意的表情、あるいは敵対的表情の表出量や、表情の共起の度合いなどが考えられる。まず、好意的−敵対的を人物間感情の尺度として、好意的−敵対的表情の表出量に基づいて感情を定義する方法について説明する。対話参与者Ｐ_ｉの対話参与者Ｐ_ｊに対する表情の表出量を、対話長で正規化された対話参与者Ｐ_ｉから対話参与者Ｐ_ｊへの表情の表出時間とすると、対話参与者Ｐ_ｉの対話参与者Ｐ_ｊに対する感情ｒ_ｉ→ｊは、次式（１６）のように表される。 The inter-person emotion network represents emotions among participants in the dialogue. There are various measures of emotion such as favorable-hostile and the degree of cooperation. As a criterion for estimating a favorable-hostile emotion network, there may be a favorable expression between dialogue participants, an amount of expression of a hostile expression, a degree of co-occurrence of expressions, and the like. First, a method for defining an emotion based on the amount of expression of a favorable-hostile expression will be described using favorable-hostile as a measure of emotion between persons. The expression amount of expression relative to interact participant P _j conversation participant P _i, if the expression of the expression time from normalized conversation participant P _i in an interactive length to interact participant P _j, conversation participants emotion r _{i → j} for dialogue participant _{P j} of P _i is represented by the following equation (16).

ここで、ｗ（ｅ）は、符号付き重み関数であり、表情カテゴリｅが好意的なカテゴリである場合には１、敵対的なカテゴリである場合には−１、中立的なカテゴリである場合には０を返す。１（・）は、括弧内の条件が満たされる場合には１、そうでない場合には０を返す関数である。Ｔは、対象動画像の総フレーム数を表す。この推定される人物間感情ｒ_ｉ→ｊを対話参与者全体についてまとめた構造のことを人物間感情ネットワークと呼ぶ。 Here, w (e) is a signed weight function, and is 1 when the expression category e is a favorable category, -1 when it is a hostile category, and a neutral category. Returns 0. 1 (•) is a function that returns 1 if the condition in parentheses is satisfied, and returns 0 otherwise. T represents the total number of frames of the target moving image. A structure in which the estimated interpersonal emotions r _{i → j} are gathered for the entire conversation participants is called an interpersonal emotion network.

なお、ここでは、ｗ（・）の出力の大きさ（絶対値）が０、または１、すなわち、全ての表情の人物間感情への寄与を等しい重みで扱っている。但し、このｗ（・）の出力の大きさをカテゴリ毎に変化させてもよい。例えば、怒りのように僅かな時間の表出でも、人物間感情に大きな影響を及ぼすと考えられるならば、カテゴリの重みを大きくしてもよい。また、微笑と哄笑のように同じ好意的なカテゴリであっても、表出の指向性が大きく異なるカテゴリには、微笑の重みを大きく、哄笑の重みを小さくするなどすることが考えられる。 It should be noted that here, the output magnitude (absolute value) of w (•) is 0 or 1, that is, the contributions of all facial expressions to the emotion between persons are treated with equal weight. However, the output size of w (•) may be changed for each category. For example, the weight of a category may be increased if it is considered that even a short time expression such as anger greatly affects the feelings between persons. In addition, even in the same favorable category such as smile and ridicule, it is conceivable that the weight of smile is increased and the weight of laughter is decreased for a category having a different expression directivity.

他の人物間感情の尺度として、表情の表出量以外の尺度も使用可能である。例えば、親密度合いのよい指標であるとされる表情の共起を用いる場合には、次式（１７）などとする。 A scale other than the amount of expression of facial expressions can be used as a scale of emotions between other persons. For example, when co-occurrence of a facial expression that is considered to be an index with good closeness is used, the following equation (17) is used.

ここで、Ｅ^＋は、微笑など親密度合いに正に寄与する表情カテゴリの集合を示す。このときの関数ｗ（ｅ）の出力については、Ｅ^＋に属する表情については正、苦笑や、怒りなど親密度合いに負に寄与する表情カテゴリについては負、他の表情については０などとする。このとき、対話参与者Ｐ_ｉが対話参与者Ｐ_ｊに対して親密度合いに正に寄与する表情カテゴリＥ^＋に属する表情を表出しているときに、対話参与者Ｐ_ｊも対話参与者Ｐ_ｉに対してＥ^＋に属する表情を表出している時間が長いほど、ｒ_ｉ→ｊの値が大きく、対話参与者Ｐ_ｊが対話参与者Ｐ_ｉに対してＥ^＋に属さない表情を表出している時間が長いほど、また、対話参与者Ｐ_ｊが対話参与者Ｐ_ｉの方向に注視対象を向けていないほど、ｒ_ｉ→ｊの値が小さくなる。 Here, E ⁺ indicates a set of facial expression categories that contribute positively to closeness such as smiles. The output of the function w (e) at this time is positive for facial expressions belonging to E ⁺ , negative for facial expression categories that contribute negatively to intimacy such as laughter and anger, and zero for other facial expressions. In this case, when the dialogue participants P _i is exposed to a positively contribute facial expression category facial expressions that belong to E ⁺ to the intimacy degree to the conversation participant P _j, interactive participant P _j also interactive participant P _i The longer the time for expressing the facial expression belonging to E ⁺ , the larger the value of r _{i → j} , and the dialogue participant P _j expresses the facial expression that does not belong to E ⁺ to the dialogue participant P _i . The value of r _{i → j} becomes smaller as the conversation time is longer and as the conversation participant P _j does not point the gaze target in the direction of the conversation participant P _i .

＜人物間感情ネットワークのグラフ表現＞
図１３は、人物間感情ネットワークをグラフ構造として表現した一例を示す概念図である。このグラフ表現では、対話参与者を円形のノードで表し、二者間の方向付きリンク（矢印）は、それら人物間の感情ｒ_ｉ→ｊを表す。ここでは、人物間感情ｒを中立的／好意的／敵対的という１次元にて扱い、これをリンクの濃淡（または色）にて表現する。中立的である（ｒ≒±０）場合には灰色、好意的である（ｒ＞＞０）ほど白色に近い色、敵対的である（ｒ＜＜0）ほど黒色に近い色にて表している。また、リンクの太さ（線幅）については、対象人物方向についての表情の表出量に比例している。なお、感情ｒの次元数が３以下であれば、ＲＧＢの各成分にｒの各成分を割り当てることで、同様に感情の状態を表現できる。 <Graphic representation of emotion network between people>
FIG. 13 is a conceptual diagram showing an example in which an inter-person emotion network is expressed as a graph structure. In this graph representation, the participant is represented by a circular node, and the directional link (arrow) between the two represents the emotion r _{i → j} between the persons. Here, the emotion r between persons is treated in a one-dimensional manner of neutral / favorable / hostile, and this is expressed by the density (or color) of the link. When neutral (r≈ ± 0), it is gray, as friendly (r >> 0) is closer to white, and as hostile (r << 0) is closer to black Yes. The link thickness (line width) is proportional to the amount of expression of the facial expression in the target person direction. If the dimension number of emotion r is 3 or less, the emotional state can be similarly expressed by assigning each component of r to each component of RGB.

対話参与者Ｐ_ｉのノードの大きさは、好意的、及び敵対的な表情カテゴリの表出量（フレーム数）に比例し、ノードの濃淡（または色）は、他の対話参与者からの感情ｒの和、すなわち、ｒ_ｉ→ｊに従い、リンクの濃淡（または色）と同様の方法にて表す。但し、このノードの大きさと濃淡（または色）についての基準とする指標には、他にも、後述する表情表出量の対話参与者間の分散や、表情被表出量の対話参与者間の分散など、様々なものが考えられる。 The size of the node of the dialogue participant P _i is proportional to the amount of expression (number of frames) of the friendly and hostile facial expression category, and the shade (or color) of the node represents the emotion from other dialogue participants. In accordance with the sum of r, that is, ri _{→ j} , it is expressed in the same manner as the shade (or color) of the link. However, there are other indicators used as a reference for the size and shading (or color) of this node, such as the distribution among the participants in the facial expression expression and the dialogue between the participants in the facial expression expression described later. Various things can be considered, such as dispersion of.

この人物間感情ネットワークは、コンパクトな表現であるものの、そこから様々な対話の状態や、対話参与者間の関係などが示唆される。以下でその一例を挙げる。 Although this interpersonal emotion network is a compact expression, it suggests various states of dialogue and relationships among participants in the dialogue. An example is given below.

（１）対話の種類
対象の対話が、互いに異なる意見を持った議題に対するディスカッションであるのか、あるいは、日常会話のような特に大きな意見の衝突のない対話であるのかを、ある程度推定可能である。前者の場合では、好意的な感情と敵対的な感情とが混合する。すわなち、感情ｒを好意的−敵対的感情を尺度として、ｒ_ｉ→ｊのｉ及びｊについての分散が大きい。図１３に示すグラフ表示方法では、黒色に近いから白色に近いまで、様々な色のリンクが表れる。後者の場合では、主に好意的な感情が表れる。すわなち、ｒ_ｉ→ｊのｉ及びｊについてのその分散が小さく平均が大きい。図１３に示すグラフ表示方法では、主に灰色から白色の間の色のリンクが表れる。 (1) Type of dialogue It can be estimated to some extent whether the target dialogue is a discussion on an agenda with different opinions or a dialogue with no particularly big opinion conflict such as daily conversation. In the former case, favorable emotions and hostile emotions are mixed. In other words, the distribution of r _{i → j} with respect to i and j is large with the feeling r as a favorable-hostile feeling as a scale. In the graph display method shown in FIG. 13, links of various colors appear from near black to near white. In the latter case, mainly positive feelings appear. That is, the variance of r _{i → j} for i and j is small and the average is large. In the graph display method shown in FIG. 13, a color link mainly between gray and white appears.

（２）対話の盛り上がり度合い
感情ｒを表情の共起の度合いとした場合、ｒ_ｉ→ｊの多くのi及びjの組み合わせについての分散が小さく平均が大きければ、その対話が盛り上がっていたことが示唆される。また、対話参与者Ｐ_ｉ及び対話参与者Ｐ_ｊについて、ｒ_ｉ→ｊ及びｒ_ｊ→ｉが共に大きければ、少なくともその二者間では、対話が盛り上がっていたことが示唆される。 (2) Exciting degree of dialogue When the emotion r is the degree of co-occurrence of facial expressions, if the variance for many combinations of i and j in r _{i → j} is small and the average is large, the dialogue is exciting. It is suggested. In addition, with respect to the dialogue participant P _i and the dialogue participant P _j , if both r _{i → j} and r _{j → i} are large, it is suggested that at least the two parties were excited.

（３）人間関係
誰と誰とが好意的な関係であり、逆に、誰と誰とが対立的関係にあるのかが見て取れる。例えば、対話参与者Ｐ_ｉ及び対話参与者Ｐ_ｊの関係を、ｒ_ｉ→ｊ＞τ∩ｒ_ｊ→ｉ＞τ（τ＞０）であれば好意的、ｒ_ｉ→ｊ＜−τ∩ｒ_ｊ→ｉ＜−τであれば敵対的、｜ｒ_ｉ→ｊ｜＜τ∩｜ｒ_ｊ→ｉ｜＜τであれば中立的、それ以外の場合は片方の一方的な好意であるであるとみなせる。 (3) Human relationships You can see who and who are in a positive relationship, and conversely who and who are in a conflicting relationship. For example, if the relationship between the dialogue participant P _i and the dialogue participant P _j is r _{i → j} > τ∩r _{j → i} > τ (τ> 0), it is favorable, and r _{i → j} <−τ∩r. It is hostile if _{j → i} <−τ, neutral if | r _{i → j} | <τ∩ | r _{j → i} | <τ, otherwise one-sided favor. Can be considered.

（４）グループ分け
対話参与者を好意的関係にある者同士をまとめたグループに分割することができる。例えば、二者間の感情がｍｉｎ｛ｒ_ｉ→ｊ，ｒ_ｊ→ｉ｝＞τ´であるリンクで繋がれた集団を１つのグループとみなすことが考えられる。図１３に示す例では、例えば、対話参与者Ｐ_１と対話参与者Ｐ_２とからなるグループと、対話参与者Ｐ_３と対話参与者Ｐ_４とからなるグループとが存在することとなる。 (4) Grouping It is possible to divide the participants in the dialogue into groups in which the people who have a favorable relationship are put together. For example, it can be considered that a group connected by a link whose emotions between the two parties are min {r _{i → j} , r _{j → i} }> τ ′ is regarded as one group. In the example illustrated in FIG. 13, for example, there are a group composed of the dialogue participant P ₁ and the dialogue participant P ₂ and a group composed of the dialogue participant P ₃ and the dialogue participant P ₄ .

（５）個人の性格
対話参与者Ｐ_ｉを除く対話参与者が、対話参与者Ｐ_ｉに対して抱いている感情の和をｒ_ＩＮ（ｉ）＝Σ_{ｊ（≠ｉ）}ｒ_ｊ→ｉと表し、対話参与者Ｐ_ｉが抱く他の全ての対話参与者への感情の和をｒ_ＯＵＴ（ｉ）＝Σ_{ｊ（≠ｉ）}ｒ_ｉ→ｊと表す。ｒ_ＯＵＴ（ｉ）の値が大きく（小さく）、かつ、ｒ_ｉ→ｊのｊに対する分散が小さい場合には、対話参与者Ｐ_ｉが全体的に対話の流れをよく保とうとしているか（話題をつまらないと思っているか）、あるいは、他の対話参与者を分け隔てなく好意的（敵対的）に思っているか、また逆に、誰からも好意的に思われたい（誰からも好意的に思われなくともよい）といったことを暗示している。ｒ_ｉ→ｊのｊに対する分散が大きな場合には、対話参与者Ｐ_ｉが一部の対話参与者を特に好意的（敵対的）に思っていることを暗示する。ｒ_ＯＵＴ（ｉ）が０付近であり、ｒ_ｉ→ｊのｊに対する分散が小さな場合には、対話参与者Ｐ_ｉが他の全ての対話参与者や、話題そのものに、あまり関心がないことを暗示する。 (5) interactive participant except for the character dialogue participants _{P i} of individuals, the sum of the emotions you are holding to the conversation participant _{_{P i r IN (i) =}} Σ j (≠ i) and r _{j → i} And the sum of emotions of all other dialogue participants held by the dialogue participant P _i is represented as r _OUT (i) = Σ _{j (≠ i)} r _{i → j} . If the value of r _OUT (i) is large (small) and the variance of r _{i → j} with respect to j is small, is the dialogue participant P _i trying to keep the flow of dialogue well as a whole? Do you think it is boring), or do you like other dialogue participants favorably (hostile) and vice versa? It is implied. If the distribution of r _{i → j} with respect to j is large, it implies that the dialogue participant P _{i considers} some dialogue participants particularly friendly (hostile). When r _OUT (i) is near 0 and the variance of r _{i → j} with respect to j is small, the dialogue participant P _i is not interested in all other dialogue participants or the topic itself. Imply.

なお、上記に（１）から（５）として説明したように、人物間感情ネットワーク、すなわち、図１３に示されるようなグラフ構造などで図式化して出力される複数の人物間の感情の関係に基づいて、対話の種類、対話の盛り上がり度合い、人間関係、好意的関係にあるサブグループの構成、個人の性格のうち少なくとも一つを推定する推定部を備えてもよい。 In addition, as described above as (1) to (5), the relationship between emotions between a plurality of persons that is graphically output with an emotion network between persons, that is, a graph structure as shown in FIG. An estimation unit that estimates at least one of the type of dialogue, the degree of excitement of dialogue, the relationship between humans, the structure of subgroups in a favorable relationship, and the personality of an individual may be provided.

上述した実施形態によれば、以下の従来にない顕著な効果を期待することができる。
（１）広範囲にわたり、頭部姿勢を変化させる対話中の人物に対し、視覚的に類似した表情カテゴリを、実時間で、かつ、高精度に認識することができる。
（２）視覚的に類似した表情カテゴリの識別に有効な個人特化のテンプレートを、少ない学習画像から作成することができる。
（３）どの対話参与者が、いつ誰にどのようなカテゴリの表情を表出したかを認識できる。
（４）好意的−敵対的、あるいは、協調的といった、ある対話参与者から他の対話参与者への感情を推定することができる。
（５）人物間感情ネットワークの構造を即座に、直感的に把握可能なグラフとして表示することができる。
（６）人物間感情ネットワークに基づいて、対話の種類、対話の盛り上がり度合い、人間関係、好意的関係にあるサブグループや、個人の性格など、様々な対話の状態や、人物間関係などを推定することができる。 According to the above-described embodiment, the following remarkable effects that cannot be expected can be expected.
(1) Visually similar facial expression categories can be recognized with high accuracy in real time for a person who is interacting with changing head posture over a wide range.
(2) A personalized template effective for identifying visually similar facial expression categories can be created from a small number of learning images.
(3) It is possible to recognize which dialogue participant has displayed what kind of facial expression to whom and when.
(4) It is possible to estimate feelings from one dialogue participant to another dialogue participant, such as favorable-hostile or cooperative.
(5) The structure of the emotion network between persons can be immediately displayed as a graph that can be intuitively grasped.
(6) Based on the emotional network between people, various types of dialogues such as the type of dialogue, the degree of excitement of dialogue, subgroups in human relations and favorable relationships, and personality, and relationships between people are estimated. can do.

１入力部
２−１〜２−Ｎ学習画像作成部
３−１〜３−Ｎ表情推定部
４人物間感情ネットワーク作成部
３１テンプレート作成部
３２表情カテゴリ及び表出対象推定部
３２１表情カテゴリ及び頭部姿勢推定部
３２２表情表出対象推定部
３２１１パーティクルフィルタ
３２１２推定値算出部
５形状モデル作成部
６輝度平均マップ及び輝度分散マップ作成部
７顕著性算出部
８注目点選択部
９輝度分布モデル作成部 1 Input unit 2-1 to 2-N Learning image creation unit 3-1 to 3-N Facial expression estimation unit 4 Interpersonal emotion network creation unit 31 Template creation unit 32 Facial expression category and expression target estimation unit 321 Facial expression category and head Posture estimation unit 322 Expression expression target estimation unit 3211 Particle filter 3212 Estimated value calculation unit 5 Shape model creation unit 6 Luminance average map and luminance variance map creation unit 7 Saliency calculation unit 8 Attention point selection unit 9 Luminance distribution model creation unit

Claims

An input means for photographing a person and inputting it as an input moving image;
The head posture of the person is estimated based on the likelihood of the head posture with respect to the input moving image input by the input means, and is output as a head posture estimated value, and the facial expression for the input image input by the input means A facial expression recognition device comprising: facial expression estimation means for estimating the display category of the person based on the likelihood of the category and outputting the estimated display category value.

The facial expression estimation means includes
A first likelihood set based on a set of attention points that are predetermined for the recognition of the facial expression category and arranged in a predetermined region of the input moving image and its luminance distribution model. Calculate
A second likelihood based on a set of attention points that are set in advance for estimating the head posture and are arranged in a predetermined region of the input moving image and its luminance distribution model To calculate
A product of the first likelihood and the second likelihood is calculated as a simultaneous likelihood of the facial expression category and the head posture, and based on the calculated simultaneous likelihood, the facial expression category estimated value and The facial expression recognition device according to claim 1, wherein the head posture estimation value is estimated and output.

The facial expression estimation means includes
The point of interest set predetermined for recognizing the facial expression category is placed in a region in the input moving image where a change in the facial expression of the person is significant and predetermined for the person. The facial expression recognition device according to claim 2.

An input means for inputting an input moving image obtained by photographing a plurality of persons;
Based on the likelihood of the head posture of each person with respect to the input moving image input by the input means, the head posture of each person is estimated and output as an estimated value of the head posture of each person. Based on the likelihood of each person's facial expression category with respect to the input image, the facial expression category and the head posture estimation means for estimating each person's display category and outputting each person's display category estimation value;
The human emotion is characterized by comprising: an expression target estimation means for estimating the expression object of each person's facial expression based on the estimated head posture value of each person and outputting the expression object estimation value. Estimating device.

5. The inter-person emotion network creating means for graphically outputting an emotion relationship between the plurality of persons based on the expression category estimated value and the expression target estimated value. The human emotion estimation apparatus described.

Based on the emotional relationship between the plurality of persons that is graphically output by the interpersonal emotion network creation means, the type of dialogue, the degree of excitement of the dialogue, the relationship between humans, the subgroups that are favorably related, the individual 6. The inter-person emotion estimation apparatus according to claim 5, further comprising an estimation unit that estimates at least one of the personality of the person.

A step in which an input means photographs a person and inputs it as an input moving image;
The facial expression estimating means estimates the head posture of the person based on the likelihood of the head posture with respect to the input moving image input by the input means, and outputs the estimated head posture as a head posture estimated value. And a step of estimating the display category of the person based on the likelihood of the facial expression category with respect to the input image and outputting the estimated display category as a display category estimation value.

An input means for inputting an input moving image obtained by photographing a plurality of persons;
The facial expression estimation means estimates the head posture of each person based on the likelihood of the head posture of each person with respect to the input moving image input by the input means, and outputs the estimated head posture of each person. Estimating each person's display category based on the likelihood of each person's facial expression category with respect to the input image input by the input means, and outputting each person's display category estimated value;
And a step of estimating the expression target of the facial expression of each person based on the estimated head posture value of each person, and outputting the estimation target estimation value. Emotion estimation method between people.

In the computer of the facial expression recognition device that recognizes the facial expression of a person based on image processing,
An input function to capture a person and input it as an input video
Based on the likelihood of the head posture with respect to the input moving image, the head posture of the person is estimated and output as a head posture estimated value, and based on the likelihood of the facial expression category with respect to the input image, the person This program executes a facial expression estimation function that estimates a display category and outputs it as a display category estimate.

Recognizing facial expressions of a plurality of people based on image processing, and estimating a feeling between the plurality of people in a computer of the emotion estimation device between people,
An input function for inputting input moving images obtained by shooting multiple people,
Based on the likelihood of the head posture of each person with respect to the input moving image, the head posture of each person is estimated and output as an estimated value of the head posture of each person, and the facial expression category of each person with respect to the input image A facial expression category and a head posture estimation function to estimate the display category of each person based on the likelihood of and output as a display category estimation value of each person,
A program for executing an expression target estimation function for estimating an expression target of each person's facial expression based on the estimated head posture value of each person and outputting the estimated expression target value.