JP7346528B2

JP7346528B2 - Image processing device, image processing method and program

Info

Publication number: JP7346528B2
Application number: JP2021192448A
Authority: JP
Inventors: 俊太舘; 泰弘奥野; 日出来空門
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-05-26
Filing date: 2021-11-26
Publication date: 2023-09-19
Anticipated expiration: 2041-11-26
Also published as: JP2022182960A

Description

本発明は、画像を用いた顔認証技術に関するものである。 The present invention relates to face recognition technology using images.

画像中の人物の顔が、他の画像中の人物と同一人物であるか否かを判定する顔認証技術がある。顔認証では、一般に撮影時の物体の見えの角度、照明、マスクおよび眼鏡といった装着物の有無、などの対象の状態や撮影環境の条件が異なると照合が困難である。そこで、特許文献１では、画像から人物の特徴を抽出する際に、マスクや眼鏡の装着を判定し、その結果に応じて特徴量を抽出する画像領域を動的に変更する。 There is a face recognition technology that determines whether the face of a person in an image is the same as a person in another image. In face recognition, matching is generally difficult when the conditions of the photographing environment and the state of the object differ, such as the viewing angle of the object at the time of photographing, lighting, presence or absence of items worn such as masks and glasses, etc. Therefore, in Patent Document 1, when extracting features of a person from an image, it is determined whether a person is wearing a mask or glasses, and an image region from which feature amounts are extracted is dynamically changed according to the result.

特許第４９５７０５６号公報Patent No. 4957056

しかしながら、特許文献１では、人物の登録時に装着物等の状態に応じて複数パターンの特徴を保存する必要があった。 However, in Patent Document 1, when registering a person, it is necessary to store characteristics of multiple patterns depending on the state of the worn items and the like.

本発明は上記課題に鑑みてなされたものであり、異なる状態である物体同士を照合する場合において登録すべき情報より少なくすることを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to reduce the amount of information that should be registered when comparing objects in different states.

上記課題を解決する本発明にかかる画像処理装置は、第一の画像内の所定の条件に合致する第一の物体の第一の特徴量であって、第一の学習済みモデルを用いて得られる第一の特徴量を取得する第一の取得手段と、第二の画像内の前記所定の条件に合致しない第二の物体の第二の特徴量であって、第二の学習済みモデルを用いて得られる第二の特徴量を取得する第二の取得手段と、前記第一の特徴量と前記第二の特徴量に基づいて、前記第一の画像内の前記第一の物体と前記第二の画像内の前記第二の物体が同一の物体であるか否かを判定する照合手段と、を有し、前記第一および第二の学習済みモデルは、前記第一の物体と前記第二の物体が同一の物体である場合、前記所定の条件に合致する前記第一の物体の特徴量として前記第一の学習済みモデルを用いて得られる前記第一の特徴量と、前記所定の条件に合致しない前記第二の物体の特徴量として前記第二の学習済みモデルを用いて得られる前記第二の特徴量とが類似した特徴量になるように学習されることを特徴とする。 An image processing device according to the present invention that solves the above problems is characterized in that a first feature of a first object that meets a predetermined condition in a first image is obtained using a first trained model. and a second feature of a second object that does not meet the predetermined condition in the second image, the second acquired model being a second learned model. and a second acquisition means for acquiring a second feature amount obtained using the first object and the first object in the first image based on the first feature amount and the second feature amount. and a matching means for determining whether or not the second object in the second image is the same object , and the first and second trained models are connected to the first object and the second object. When the second objects are the same object, the first feature obtained using the first trained model as the feature of the first object that meets the predetermined condition, and the predetermined The feature quantity of the second object that does not meet the condition is trained so that the second feature quantity obtained using the second trained model becomes a similar feature quantity. .

本発明によれば、異なる状態である物体同士を照合する場合において登録すべき情報をより少なくすることが出来る。 According to the present invention, it is possible to further reduce the amount of information to be registered when comparing objects in different states.

画像処理装置の機能構成例を示すブロック図Block diagram showing an example of functional configuration of an image processing device 画像処理装置のハードウェア構成例を示すブロック図Block diagram showing an example of the hardware configuration of an image processing device 照合処理の動作の一例を示す模式図Schematic diagram showing an example of the operation of the matching process 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 学習処理の動作の一例を示す模式図Schematic diagram showing an example of the operation of learning processing 学習処理の動作の一例を示す模式図Schematic diagram showing an example of the operation of learning processing 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 学習処理の一例を示す模式図Schematic diagram showing an example of learning processing 画像処理装置の機能構成例を示すブロック図Block diagram showing an example of functional configuration of an image processing device 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 学習処理の動作の一例を示す模式図Schematic diagram showing an example of the operation of learning processing 画像処理装置の機能構成例を示すブロック図Block diagram showing an example of functional configuration of an image processing device 画像処理装置の機能構成例を示すブロック図Block diagram showing an example of functional configuration of an image processing device 照合処理の動作の一例を示す模式図Schematic diagram showing an example of the operation of the matching process 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 画像処理装置の機能構成例を示すブロック図Block diagram showing an example of functional configuration of an image processing device 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device 画像処理装置が実行する処理を示すフローチャートFlowchart showing processing performed by the image processing device

＜実施形態１＞
本発明の実施形態に係る画像処理装置を、図面を参照しながら説明する。なお、図面間で符号の同じものは同じ動作をするとして重ねての説明を省く。また、この実施の形態に掲載されている構成要素はあくまで例示であり、この発明の範囲をそれらのみに限定する趣旨のものではない。 <Embodiment 1>
An image processing device according to an embodiment of the present invention will be described with reference to the drawings. Components with the same reference numerals in the drawings operate in the same way, and redundant explanation will be omitted. Furthermore, the components described in this embodiment are merely examples, and the scope of the present invention is not intended to be limited thereto.

従来の顔認証技術では、大きく２つの問題がある。ひとつは、（１）人物の登録時に装着物等の状態に応じて複数パターンの特徴を保存する必要がある。あるいは（２）人物のマスク等の状態を判定した後に登録画像の特徴量変換を行う必要がある。このため、照合の対象となる登録人物が多数の場合、（１）の方法では多くの記憶領域が必要になり、（２）の方法では照合速度に劣る問題がある。本実施形態に係る画像処理装置は、画像中の物体の撮影時の状態に応じ異なる特徴量変換手段で特徴量に変換してから照合を行う。これにより状態に応じて特徴量変換手段を変更しない従来の方法に比べて、照合の精度に優れる。また本発明によれば、異なる変換手段を用いつつ、同一物体であれば出力される特徴量が互いに類似するように学習の調整を行う。このため変換の方法が相異なっても区別することなく照合処理に用いることができる。このため登録画像パターンの特徴量を抽出する従来の方法に比べ、特徴量の記憶に必要なメモリ量が少なく済む。あるいは照合処理の計算コストや速度に優れる。 Conventional facial recognition technology has two major problems. The first is (1) when registering a person, it is necessary to store multiple patterns of characteristics depending on the state of things worn, etc. Or (2) it is necessary to perform feature amount conversion of the registered image after determining the state of the person's mask or the like. Therefore, when there are a large number of registered persons to be verified, method (1) requires a large storage area, and method (2) has the problem of poor verification speed. The image processing apparatus according to the present embodiment converts the object into a feature amount using different feature amount converting means depending on the state of the object in the image at the time of photographing, and then performs matching. This results in superior matching accuracy compared to conventional methods that do not change the feature value conversion means depending on the state. Further, according to the present invention, learning is adjusted so that the output feature amounts are similar to each other for the same object, while using different conversion means. Therefore, even if the conversion methods are different, they can be used for matching processing without distinction. Therefore, compared to the conventional method of extracting feature amounts of registered image patterns, the amount of memory required to store feature amounts is smaller. Alternatively, it is superior in calculation cost and speed of matching processing.

図１は、画像処理装置の機能構成例を示す図である。画像処理装置１は、第一の画像取得部１０１、第二の画像取得部１０２、物体パラメータ決定１０３、記憶部１０４、第一の特徴量変換部１０５、第二の特徴量変換部１０６、特徴量照合部１０７、を有する。詳細は後述する。 FIG. 1 is a diagram showing an example of the functional configuration of an image processing apparatus. The image processing device 1 includes a first image acquisition section 101, a second image acquisition section 102, an object parameter determination 103, a storage section 104, a first feature amount conversion section 105, a second feature amount conversion section 106, and a feature It has a quantity matching section 107. Details will be described later.

図２は、本実施形態における、画像処理装置１のハードウェア構成図である。ＣＰＵＨ１０１は、ＲＯＭＨ１０２に格納されている制御プログラムを実行することにより、本装置全体の制御を行う。ＲＡＭＨ１０３は、各構成要素からの各種データを一時記憶する。また、プログラムを展開し、ＣＰＵＨ１０１が実行可能な状態にする。記憶部Ｈ１０４は、本実施形態の画像変換を行うための変換パラメータを格納するものである。記憶部Ｈ１０４の媒体としては、ＨＤＤ，フラッシュメモリ、各種光学メディアなどを用いることができる。取得部Ｈ１０５は、キーボード・タッチパネル、ダイヤル等で構成され、ユーザからの入力を受け付けるものであり、被写体の画像を再構成する際の任意視点の設定等に用いる。表示部Ｈ１０６は、液晶ディスプレイ等で構成され、被写体の画像の再構成結果を表示する。また、本装置は通信部Ｈ１０７を介して、撮影装置やその他の装置と通信することができる。 FIG. 2 is a hardware configuration diagram of the image processing device 1 in this embodiment. The CPU H101 controls the entire apparatus by executing a control program stored in the ROM H102. RAM H103 temporarily stores various data from each component. Additionally, the program is expanded and made executable by the CPU H101. The storage unit H104 stores conversion parameters for performing image conversion in this embodiment. As a medium for the storage unit H104, an HDD, flash memory, various optical media, etc. can be used. The acquisition unit H105 is configured with a keyboard, touch panel, dial, etc., and receives input from the user, and is used for setting an arbitrary viewpoint when reconstructing an image of a subject. The display unit H106 is configured with a liquid crystal display or the like, and displays the result of reconstructing the image of the subject. Further, this device can communicate with a photographing device and other devices via the communication unit H107.

＜画像照合処理フェーズ＞
図３は、本実施形態の照合処理の模式図であり、本発明の方法と従来の方法との差異を示している。図３（Ａ）は従来の方法であり、認証処理の対象となる人物を含む入力画像と登録人物を含む登録画像とに対して同一のパラメータで特徴量の変換を行う。この時マスクやサングラスの装着の有無といった大きな見えの変化があると、精度の劣化が生じ易い。一方であらゆる見えの変化に対応させようとすると、特徴量変換部の構成規模が大きくなる課題がある。図３（Ｂ）は本発明の模式図例である。同図では入力画像が入力されると、物体パラメータ決定１０３がマスク装着の有無といった被写体の状態を判定する。その判定結果に応じて特徴量変換部１０６が記憶部１０４から適切な変換パラメータを読み出して特徴量変換を行う。ここで、変換パラメータは、人物の状態や撮影環境に応じて、複数種類学習されている。変換パラメータは被写体の状態に特化して個別に学習がなされているため、マスクやサングラスの装着の有無といった大きな見えの変化に対しても頑健な照合が実現できる。 <Image matching processing phase>
FIG. 3 is a schematic diagram of the matching process of this embodiment, and shows the difference between the method of the present invention and the conventional method. FIG. 3A shows a conventional method in which feature amounts are converted using the same parameters for an input image including a person to be subjected to authentication processing and a registered image including a registered person. At this time, if there is a large change in appearance, such as whether or not a mask or sunglasses are worn, accuracy is likely to deteriorate. On the other hand, there is a problem in that the size of the structure of the feature quantity converting section becomes large when trying to cope with all kinds of changes in appearance. FIG. 3(B) is a schematic diagram of the present invention. In the figure, when an input image is input, an object parameter determination 103 determines the state of the subject, such as whether or not a mask is worn. According to the determination result, the feature amount conversion unit 106 reads appropriate conversion parameters from the storage unit 104 and performs feature amount conversion. Here, a plurality of types of conversion parameters are learned depending on the state of the person and the shooting environment. Since the conversion parameters are learned individually based on the state of the subject, robust matching can be achieved even with large changes in appearance, such as whether or not a mask or sunglasses are worn.

なお本実施形態の方法では、上記特徴量は、いずれの変換パラメータで変換されたものであっても、同一物体であれば互いに類似度が高くなるように学習を行う（学習方法については後述する）。このため、特徴量の照合部１０７は特徴量間の内積や角度といった基本的な方法に拠って類似度を算出すればよく、特別な処理を必要としない。このように物体の状態に関わらず一種類の類似度を統一的な照合の尺度とすることができる。例えば、特許文献１の方法では、特徴抽出方法の数と同じ数だけ登録人物の特徴量を記憶しなければならないのに対して、本実施形態の方法では、登録人物に対しては１つの変換パラメータを適用するので、登録すべき特徴量を絞ることができる。 In addition, in the method of this embodiment, the above-mentioned feature quantities are trained so that they have a high degree of similarity to each other if they are the same object, regardless of which transformation parameter they are transformed with (the learning method will be described later). ). Therefore, the feature amount matching unit 107 only needs to calculate the degree of similarity based on a basic method such as an inner product or an angle between feature amounts, and no special processing is required. In this way, one type of similarity can be used as a uniform matching measure regardless of the state of the object. For example, in the method of Patent Document 1, it is necessary to store the same number of feature quantities of registered persons as the number of feature extraction methods, whereas in the method of this embodiment, one transformation is required for registered persons. Since parameters are applied, it is possible to narrow down the features to be registered.

次に図４を用いて照合の処理の手順を説明する。本実施形態では２枚の人物画像が与えられたときに、同一人物が写っているか、異なる人物かを画像特徴量に基づいて判定することを目的とする。図４のフローチャートに示した処理は、コンピュータである図２のＣＰＵ１０１により記憶装置１０４に格納されているコンピュータプログラムに従って実行される。以下の説明では、各工程（ステップ）について先頭にＳを付けて表記することで、工程（ステップ）の表記を省略する。 Next, the procedure of the verification process will be explained using FIG. 4. The purpose of this embodiment is to determine, when two person images are given, whether the same person or different people are shown based on image feature amounts. The processing shown in the flowchart of FIG. 4 is executed by the CPU 101 of FIG. 2, which is a computer, according to a computer program stored in the storage device 104. In the following description, each process (step) is indicated by adding S to the beginning thereof, thereby omitting the notation of the process (step).

まずＳ１０１では、第一の画像取得部１０１が、認証対象の物体（ここでは人物）を含む一枚目の画像（第一の画像）を取得する。Ｓ１０２では、判定部１０３が、第一の画像が所定の条件を満たすか判定する。所定の条件を満たす場合は、物体の状態や撮影環境が通常の状態（学習された環境に近い状態）であって、それ以外の場合はマスクをしている場合や環境の照度が変わった場合等で通常の状態でないと判定する。ここでは、具体的には、１枚目の画像の人物がマスクを装着しているか否かを判定する。マスクの検出は、テンプレートマッチング等の手法を用いる。所定の条件（マスクをしていない）を満たしている場合はＳ１０３に進む。所定の条件を満たしていない場合（マスクをしている）はＳ１０４に進む。 First, in S101, the first image acquisition unit 101 acquires a first image (first image) including an object to be authenticated (here, a person). In S102, the determination unit 103 determines whether the first image satisfies a predetermined condition. If the predetermined conditions are met, the state of the object or the shooting environment is normal (close to the learned environment); otherwise, if a mask is worn or the illuminance of the environment has changed. etc., it is determined that the state is not normal. Here, specifically, it is determined whether the person in the first image is wearing a mask. Mask detection uses a method such as template matching. If the predetermined condition (not wearing a mask) is met, the process advances to S103. If the predetermined condition is not met (masked), the process advances to S104.

Ｓ１０３では、第一の特徴量変換部（第一の特徴取得部）１０５が、通常人物用の特徴量変換のパラメータ（第一のパラメータセット）を読み出して学習済みモデルにセットする。学習済みモデルは、画像から物体の特徴量を取得するためのニューラルネットワークである。第一のパラメータセットが適用された学習済みモデルを第一の学習済みモデルと呼ぶ。Ｓ１０４では、第一の特徴量変換部１０５が、マスク装着人物用の特徴量変換パラメータ（第二のパラメータセット）を読み出して学習済みモデルにセットする。第二のパラメータセットが適用された学習済みモデルを第二の学習済みモデルと呼ぶ。ここで特徴量変換部１０５は、例えば、非特許文献１で公知な畳み込みニューラルネットワークで構成されている。または、特徴量変換部１０５は、特許文献２で公知なＴｒａｎｓｆｏｒｍｅｒネットワーク（トランスフォーマーネットワーク）と呼ばれるディープニューラルネットワーク（以降ＤＮＮと略す）で構成されている。つまり、特徴量変換部１０５は、画像に含まれる人物の特徴を取得するための学習済みモデルであって、画像に含まれる人物の状態に応じて学習されたパラメータセットを用いて特徴量を取得する。（非特許文献１：Ｄｅｎｇ，ｅｔ．Ａｌ．，ＡｒｃＦａｃｅ：ＡｄｄｉｔｉｖｅＡｎｇｕｌａｒＭａｒｇｉｎＬｏｓｓｆｏｒＤｅｅｐＦａｃｅＲｅｃｏｇｎｉｔｉｏｎ．ＩｎＣＶＰＲ，２０１９）。（特許文献２：米国特許第１０９５６８１９号）。ここで特徴量変換のパラメータはニューロンの層数やニューロンの数、結合重み等の各種パラメータである。次にＳ１０５では、第一の特徴量変換部１０５が、第一の学習済みモデルまたは第二の学習済みモデルに基づいて、第一の画像取得部１０１から受け取った第一の画像から特徴量を変換する。 In S103, the first feature value conversion unit (first feature acquisition unit) 105 reads out parameters for feature value conversion for a normal person (first parameter set) and sets them in the learned model. The trained model is a neural network for acquiring object features from images. The trained model to which the first parameter set is applied is called a first trained model. In S104, the first feature amount conversion unit 105 reads the feature amount conversion parameters (second parameter set) for a person wearing a mask and sets them in the learned model. The trained model to which the second parameter set is applied is called a second trained model. Here, the feature quantity conversion unit 105 is configured with a convolutional neural network known in Non-Patent Document 1, for example. Alternatively, the feature value conversion unit 105 is configured with a deep neural network (hereinafter abbreviated as DNN) known as a Transformer network in Patent Document 2. In other words, the feature amount conversion unit 105 is a trained model for acquiring the features of the person included in the image, and acquires the feature amount using a parameter set learned according to the state of the person included in the image. do. (Non-Patent Document 1: Deng, et. Al., ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In CVPR, 2019). (Patent Document 2: US Pat. No. 1,095,6819). Here, the parameters for feature value conversion are various parameters such as the number of neuron layers, the number of neurons, and connection weights. Next, in S105, the first feature converter 105 converts the feature from the first image received from the first image acquisition unit 101 based on the first trained model or the second trained model. Convert.

次にＳ１０６からＳ１１０では、２枚目の画像（第二の画像）に対して前述のＳ１０１からＳ１０５までと同一の処理を行う。つまり、第二の画像に含まれる人物がマスクをしていない場合は、第一のパラメータセットが適用された学習済みモデルを第一の学習済みモデルから特徴量を取得する。第二の画像に含まれる人物がマスクをしている場合は、第二のパラメータセットが適用された第二の学習済みモデルから特徴量を取得する。ただし上記処理を第二の画像取得部１０２と第二の特徴量変換部（第二の特徴取得部）１０６が行う。これにより、１枚目の画像と２枚目の画像がそれぞれ特徴量に変換される。この特徴量をｆ_１，ｆ_２と表す。ｆ_１とｆ_２は非特許文献１と同様に１次元ベクトルとする。（ＤＮＮの全結合層の処理を経て、１次元ベクトルに変換されている。）また第一の特徴量変換部１０５と第二の特徴量変換部１０６が受け取るＤＮＮのパラメータは同一の構成である必要はないが、最終層のニューロンの出力チャンネル数は同一とする。これによりｆ_１とｆ_２の次元の長さは同一に揃っているとする。 Next, in S106 to S110, the same processing as in S101 to S105 described above is performed on the second image (second image). That is, if the person included in the second image is not wearing a mask, the feature amount is acquired from the trained model to which the first parameter set is applied. If the person included in the second image is wearing a mask, the feature amount is acquired from the second trained model to which the second parameter set is applied. However, the above processing is performed by the second image acquisition unit 102 and the second feature amount conversion unit (second feature acquisition unit) 106. As a result, the first image and the second image are each converted into feature amounts. These feature amounts are expressed as f ₁ and f ₂ . Let f ₁ and f ₂ be one-dimensional vectors as in Non-Patent Document 1. (It is converted into a one-dimensional vector through the fully connected layer processing of the DNN.) Furthermore, the DNN parameters received by the first feature converter 105 and the second feature converter 106 have the same configuration. Although it is not necessary, the number of output channels of neurons in the final layer should be the same. As a result, it is assumed that the dimension lengths of f ₁ and f ₂ are the same.

次にＳ１１１では、特徴量照合部１０７が、２つの特徴量の類似度スコアを算出する。すなわち、第一の特徴量と第二の特徴量とに基づいて、第一の画像に含まれる物体と第二の画像に含まれる物体が同一か否かを判定する。第一の特徴量と第二の特徴量との類似度スコアが所定の閾値以上である場合は、２つの画像には同一の物体が含まれる。第一の特徴量と第二の特徴量との類似度スコアが所定の閾値より小さい場合は、２つの画像には異なる物体が含まれる。ここで特徴量間の類似度を図る指標は複数が公知であるが、ここでは非特許文献１の方法と同じく、特徴量ベクトル間の角度を用いる。下記のように類似度のスコアを計算する。
（数式１）
類似度スコア（ｆ_１，ｆ_２）：＝ｃｏｓ（θ_１２）
＝＜ｆ_１，ｆ_２＞÷（｜ｆ_１｜・｜ｆ_２｜）
ただしθ_１２は特徴量ベクトルｆ_１とｆ_２のなす角度であり、＜ｘ，ｙ＞はｘとｙの内積、｜ｘ｜はｘの長さである。特徴量照合部１０７は上記の類似度スコアが所定の閾値以上であれば同一人物、そうでなければ他人、と判定する。以上で照合処理の動作が終了する。なお、第一の画像および第二の画像は、共通の画像取得部、特徴量変換部によって、特徴量を取得する構成でもよい。 Next, in S111, the feature matching unit 107 calculates a similarity score between the two feature amounts. That is, based on the first feature amount and the second feature amount, it is determined whether the object included in the first image and the object included in the second image are the same. If the similarity score between the first feature amount and the second feature amount is greater than or equal to a predetermined threshold, the two images include the same object. If the similarity score between the first feature amount and the second feature amount is smaller than a predetermined threshold, the two images include different objects. Although a plurality of indices are known to measure the similarity between feature amounts, here, as in the method of Non-Patent Document 1, angles between feature vectors are used. Calculate the similarity score as follows.
(Formula 1)
Similarity score (f ₁ , f ₂ ) := cos(θ ₁₂ )
=<f ₁ , f ₂ >÷(|f ₁ |・|f ₂ |)
However, θ ₁₂ is the angle formed by the feature vectors f ₁ and f ₂ , <x, y> is the inner product of x and y, and |x| is the length of x. The feature matching unit 107 determines that the person is the same person if the above-mentioned similarity score is equal to or higher than a predetermined threshold, and otherwise determines that the person is a different person. This completes the operation of the matching process. Note that the first image and the second image may have a configuration in which feature amounts are acquired by a common image acquisition section and a feature amount conversion section.

＜学習処理フェーズ＞
本実施形態の学習フェーズについて説明する。ここでは非特許文献１で公知である＜代表ベクトル手法＞による学習を行う。代表ベクトル手法は、各人物を代表する特徴量ベクトルを設定し、これを併用することで学習効率を上げる顔認証の学習手法である。詳細は非特許文献１を参照されたい。なお、学習処理フェーズにおける画像処理装置２は、図１４に示す。画像変換部２００は、対象の基準となる画像（例えば、装着物がない状態の人物の顔画像）のセットである第一の画像群を、対象の所定の状態を示す画像（例えば、マスクを装着した人物の顔画像）のセットである第二の画像群に変換する。具体的には、マスク等の装着物を示す画像を顔画像に合成することや、ある一定の明るさになるよう画像を変換する。画像取得部２０１は、学習用に用いる画像群を取得する。ここでは、２種類以上のパラメータセットを学習するため、２種類以上の画像群を取得する。特徴量変換部２０２は、画像の状態に応じたパラメータセットと画像から特徴量を抽出する学習モデルとを用いて、画像のそれぞれから特徴量を取得する。学習部２０３は、パラメータセットと、画像から特徴量を抽出する学習モデルを学習する。なお、本実施形態では、第一の学習モデルと第二の学習モデルを交互に学習させる例を述べる。 <Learning processing phase>
The learning phase of this embodiment will be explained. Here, learning is performed using the <representative vector method> known in Non-Patent Document 1. The representative vector method is a face recognition learning method that sets feature vectors that represent each person and uses them together to increase learning efficiency. For details, please refer to Non-Patent Document 1. Note that the image processing device 2 in the learning processing phase is shown in FIG. The image conversion unit 200 transforms a first image group, which is a set of reference images of the target (for example, a face image of a person without wearing anything), into an image showing a predetermined state of the target (for example, a face image of a person wearing no mask). The second image group is a set of facial images of the person wearing the device. Specifically, an image showing something worn such as a mask is combined with a facial image, or the image is converted to have a certain brightness. The image acquisition unit 201 acquires a group of images used for learning. Here, in order to learn two or more types of parameter sets, two or more types of image groups are acquired. The feature amount conversion unit 202 obtains feature amounts from each image using a parameter set according to the state of the image and a learning model that extracts feature amounts from the images. The learning unit 203 learns a parameter set and a learning model for extracting feature amounts from images. In this embodiment, an example will be described in which the first learning model and the second learning model are learned alternately.

本形態の処理フロー手順は図５（Ａ）（Ｂ）からなる。ここで図５（Ａ）に示した処理を＜一回目の学習処理＞、図５（Ｂ）に示した処理を＜二回目の学習処理＞、と呼ぶ。＜一回目の学習処理＞ではマスク非装着の人物の画像群（第一の画像群）を用いて通常の特徴量変換の学習を行う。＜二回目の学習処理＞ではマスクを装着した人物の画像群（第二の画像群）を用いてマスク人物に特化した学習を行う。なお、図１４の実線部分は＜一回目の学習処理＞の処理で用いる構成であって、破線部分は＜二回目の学習処理＞の処理で用いる構成である。 The processing flow procedure of this embodiment consists of FIGS. 5(A) and 5(B). Here, the process shown in FIG. 5(A) will be referred to as <first learning process>, and the process shown in FIG. 5(B) will be referred to as <second learning process>. In the <first learning process>, normal feature amount conversion learning is performed using a group of images of a person not wearing a mask (first group of images). In the <second learning process>, learning specialized for masked persons is performed using a group of images of persons wearing masks (second image group). Note that the solid line part in FIG. 14 is the configuration used in the process of <first learning process>, and the broken line part is the configuration used in the process of <second learning process>.

＜一回目の学習処理＞の内容は基本的に非特許文献１の方法に準じる。図５（Ａ）に画像処理装置が実行する学習フェーズでの処理を示す。まずＳ２０１では、特徴量変換部２０２が、第一の学習モデルのパラメータセットと代表ベクトルｖ_１～ｖ_ｎを乱数で初期化する。ここで１～ｎは学習画像中に含まれる全人物のＩＤである。各代表ベクトルｖはｄ次元ベクトルである（ｄは所定の値である）。 The contents of <first learning process> basically follow the method of Non-Patent Document 1. FIG. 5A shows processing in the learning phase performed by the image processing device. First, in S201, the feature quantity conversion unit 202 initializes the parameter set and representative vectors v ₁ to v _n of the first learning model with random numbers. Here, 1 to n are IDs of all persons included in the learning image. Each representative vector v is a d-dimensional vector (d is a predetermined value).

次にＳ２０２では、画像取得部２０１が、第一の画像群からランダムに選んだ画像Ｉ_１～Ｉ_ｍを取得する。第一の画像群は、基準となる画像群であって、マスクを装着していない複数の人物画像であり、一人の人物につき１枚以上の画像が含まれる。各画像には人物のＩＤの情報が付されている。 Next, in S202, the image acquisition unit 201 acquires images I ₁ to I _m randomly selected from the first image group. The first image group is a reference image group, and is a plurality of images of people not wearing masks, and includes one or more images for each person. Each image is attached with information about the person's ID.

次にＳ２０３では、特徴量変換部２０２が、第一の学習モデルに上記第一の画像群の各画像Ｉ_ｉを入力することによって第一の学習特徴量ｆ_ｉを取得する。ここで学習特徴量ｆ_ｉはｄ次元のベクトルである。次にＳ２０４では、特徴量変換部２０２が、各人物画像と代表ベクトル間の特徴量の類似度（クラス内類似度）と各人物と他人の代表ベクトルの特徴量の類似度（クラス間類似度）に基づいて、損失値を計算する。 Next, in S203, the feature amount conversion unit 202 obtains a first learning feature amount f _i by inputting each image I _i of the first image group to the first learning model. Here, the learning feature amount f _i is a d-dimensional vector. Next, in S204, the feature amount conversion unit 202 converts the similarity of the feature amounts between each person image and the representative vector (intra-class similarity) and the similarity of the feature amounts of the representative vectors of each person and another person (inter-class similarity). ), calculate the loss value.

（数式２）
クラス内類似度スコア（ｆ_ｉ）＝類似度スコア（ｆ_ｉ，ｖ_ｙ（ｉ）），
クラス間類似度スコア（ｆ_ｉ）＝ Σ_{ｊ≠ｙ（-ｉ）} 類似度スコア（ｆ_ｉ，ｖ_ｊ）
ただしここで、ｙ（ｉ）は画像Ｉ_ｉの人物のＩＤの番号である。これを各画像について下記のように総和したものが学習に用いる損失値となる。 (Formula 2)
Intra-class similarity score ( _fi ) = similarity score ( _fi , vy _(i) ),
Inter-class similarity score (f _i ) = Σ _j≠y(-i) similarity score (f _i , v _j )
However, here, y(i) is the ID number of the person in the image _Ii . The sum of these values for each image as shown below becomes the loss value used for learning.

（数式３）
損失値＝ Σ_ｉクラス間類似度スコア（ｆ_ｉ）－ λクラス内類似度スコア（ｆ_ｉ）
λは学習のバランスのための重みパラメータである。なお上記は損失値の一例であり、マージン付の類似度スコアや交差エントロピーを用いる等の様々な公知の方法がある。詳細は非特許文献１等を参照されたい。 (Formula 3)
Loss value = Σ _i inter-class similarity score (f _i ) - λ intra-class similarity score (f _i )
λ is a weight parameter for learning balance. Note that the above is an example of a loss value, and there are various known methods such as using a similarity score with a margin or cross entropy. For details, please refer to Non-Patent Document 1 etc.

次にＳ２０５とＳ２０６では、学習部２０３が、上記の損失値を小さくするように特徴変換部（第一の学習モデル）の第一のパラメータセットを更新する。Ｓ２０５では、特徴量変換部２０３が、代表ベクトルの値を、Ｓ２０６では第一のパラメータセットを、それぞれ更新する。ＤＮＮで一般的な誤差逆伝搬手法を用いることで損失値を減ずる方向に微小に更新していく。これにより代表ベクトルは各人物の特徴を代表する値としてより機能するように改善され、第一の学習済みモデルは同一人物の特徴量であれば互いに似るように改善されていく。 Next, in S205 and S206, the learning unit 203 updates the first parameter set of the feature conversion unit (first learning model) so as to reduce the above loss value. In S205, the feature value conversion unit 203 updates the value of the representative vector, and in S206, updates the first parameter set. By using an error backpropagation method commonly used in DNNs, the loss value is slightly updated in the direction of decreasing it. As a result, the representative vector is improved so that it functions more as a value that represents the characteristics of each person, and the first learned model is improved so that it becomes similar to each other if it is a feature amount of the same person.

以上のような学習処理を学習が収束するか所定の回数まで繰り返す（Ｓ２０７）。次にＳ２０８、および、Ｓ２０９では、記憶部１０４が、第一のパラメータセットおよび代表ベクトルｖ_１～ｖ_ｎの値を記憶して保存する。 The above learning process is repeated until the learning converges or a predetermined number of times (S207). Next, in S208 and S209, the storage unit 104 stores and saves the first parameter set and the values of the representative vectors v ₁ to v _n .

図６は＜一回目の学習処理＞が終了した時点の結果例を模式的に示している。特徴空間６００上に代表ベクトル６０１，６０２，６０３がＩＤ１番～ＩＤ３番の人物を代表する特徴ベクトルとして得られている。さらに各人物の特徴ａ，ｂや特徴ｐ，ｑなどはこれらの代表ベクトルの近傍に位置するように第一のパラメータセットが適切に学習されている（図中に各人物の画像特徴を黒丸で表している）。 FIG. 6 schematically shows an example of the results when the <first learning process> is completed. Representative vectors 601, 602, and 603 are obtained on the feature space 600 as feature vectors representing the persons with IDs 1 to 3. Furthermore, the first parameter set has been appropriately learned so that the features a, b, p, q, etc. of each person are located in the vicinity of these representative vectors (the image features of each person are indicated by black circles in the figure). ).

次に＜二回目の学習処理＞を行う。本処理ではマスクを装着した人物の学習用画像群（第二の画像群）を用いて、マスク装着人物用の特徴量変換のＤＮＮ（第二の学習モデル）を学習する。 Next, <second learning process> is performed. In this process, a DNN (second learning model) for feature value conversion for a person wearing a mask is learned using a learning image group (second image group) of a person wearing a mask.

図５（Ｂ）を用いて＜二回目の学習処理＞を説明する。準備として、Ｓ３００では、画像変換部２００が、第一の画像群を所定の条件を満たすような第二の画像群に変換する。具体的には、マスクやサングラス等の装着物を合成した画像や、照度の異なる画像を既存の変換方法を用いて生成する。第二の画像群が予め用意できている場合はＳ３００をスキップしてよい。Ｓ３０１では、特徴量変換部２０２が、第一のパラメータセットを取得し、第二の学習モデルのパラメータの初期値とする。次にＳ３０２～Ｓ３０６まで図５（Ａ）の処理フローと同様に第二の学習モデルの第二のパラメータの学習を行う。処理の内容や損失の計算等は先のＳ２０２～Ｓ２０７の処理と同一である。ただしＳ２０５で行った代表ベクトルｖ_１～ｖ_ｎの更新処理は行わず、前段階のＳ２０８で保存した値のまま固定して使う。これにより、マスクを装着した人物の特徴量が、マスクを装着していない人物の代表ベクトルに近づくような学習が行われる。学習が収束したらＳ３０７では、記憶部１０４が、第二のパラメータセットを保存して学習を終了する。なお代表ベクトルの値は学習時にのみ用い、照合動作時には代表ベクトルの値は使用しない。 <Second learning process> will be explained using FIG. 5(B). As preparation, in S300, the image conversion unit 200 converts the first image group into a second image group that satisfies a predetermined condition. Specifically, an existing conversion method is used to generate a composite image of wearing items such as masks and sunglasses, or images with different illuminances. If the second image group is prepared in advance, S300 may be skipped. In S301, the feature value conversion unit 202 acquires a first parameter set and uses it as an initial value of the parameters of the second learning model. Next, learning of the second parameter of the second learning model is performed from S302 to S306 in the same manner as in the process flow of FIG. 5(A). The contents of the process, calculation of loss, etc. are the same as the processes of S202 to S207. However, the updating processing of the representative vectors v ₁ to v _n performed in S205 is not performed, and the values saved in the previous step S208 are used as fixed values. As a result, learning is performed such that the feature amount of a person wearing a mask approaches the representative vector of a person not wearing a mask. When the learning converges, in S307, the storage unit 104 stores the second parameter set and ends the learning. Note that the value of the representative vector is used only during learning, and is not used during the matching operation.

図７は＜二回目の学習処理＞の開始時点を模式的に示した図である。代表ベクトル６０１，６０２，６０３の位置は固定され、以降学習による更新はされない。マスクを装着した人物の画像ｃ，画像ｄは、その人物の代表ベクトル６０１から遠いところに位置している。＜二回目の学習処理＞の学習調整を行うことで、特徴ｃ（付番７０２）に矢印を付して示すように、各人物の特徴はそれぞれの代表ベクトルの方向に近づくように、第二のパラメータセットが学習される。これにより、学習の収束時には、マスク非装着人物の画像（図６のａ，ｂ）に対して第一のパラメータセットを用いた特徴量と、マスクを装着した人物の画像（図７のｃ，ｄ）に対して第二のパラメータセットを用いた特徴量とが、特徴空間上で近接するようになる。 FIG. 7 is a diagram schematically showing the start point of the <second learning process>. The positions of representative vectors 601, 602, and 603 are fixed and will not be updated by learning thereafter. Images c and d of a person wearing a mask are located far from the representative vector 601 of that person. By performing the learning adjustment in the <second learning process>, the features of each person are adjusted in the second direction so that they approach the direction of their respective representative vectors, as shown by the arrow attached to feature c (numbered 702). parameter set is learned. As a result, when learning converges, the features using the first parameter set are calculated for images of a person not wearing a mask (a, b in FIG. 6), and images of a person wearing a mask (c, d) and the feature amount using the second parameter set become close to each other on the feature space.

＜学習方法の派生形態＞
ここで学習の形態のその他の派生的な形態を挙げる。例えば、＜代表ベクトル＞を用いない学習形態も考えられる。この学習の動作処理のフロー例を図８、模式図として図９を用いて説明する。本形態例では通常の人物の画像のセットと、同画像にマスク画像を重畳合成した画像群を用いる。図９（Ａ）に通常の人物の画像ａ，ｂ，ｐ、およびマスクを重畳した画像ａ’，ｂ’，ｐ’の例を示す。本派生の例では画像ａ’，ｂ’，ｐ’の特徴量が画像ａ，ｂ，ｐの特徴量へとそれぞれ近づくように第二のパラメータセットを学習する。 <Derivative forms of learning methods>
Here are some other derivative forms of learning. For example, a learning form that does not use <representative vector> is also conceivable. A flow example of this learning operation process will be explained using FIG. 8 and a schematic diagram using FIG. 9. In this embodiment, a set of normal human images and a group of images obtained by superimposing and combining the same images with a mask image are used. FIG. 9A shows examples of images a, b, and p of normal people and images a', b', and p' on which masks are superimposed. In the example of this derivation, the second parameter set is learned so that the feature quantities of images a', b', and p' approach the feature quantities of images a, b, and p, respectively.

まず＜一回目の学習処理＞は通常の人物の画像群を用いて、先述の方法に準じた学習処理をＳ４０１～Ｓ４０７で行う。なお先述の方法と異なり代表ベクトルを用いずに下式でクラス内類似度とクラス間類似度から損失値を算出し、第一の学習モデルの第一のパラメータセットを更新する。 First, in the <first learning process>, a learning process based on the method described above is performed in S401 to S407 using a group of images of normal people. Note that, unlike the above-mentioned method, a loss value is calculated from the intra-class similarity and the inter-class similarity using the following formula without using the representative vector, and the first parameter set of the first learning model is updated.

（数式４）
クラス内類似度スコア（ｆ_ｉ）＝ Σ_{ｙ（ｋ）＝ｙ（ｉ）}類似度スコア（ｆ_ｉ，ｆ_ｋ），
クラス間類似度スコア（ｆ_ｉ）＝ Σ_{ｙ（ｊ）≠ｙ（-ｉ）}類似度スコア（ｆ_ｉ，ｆ_ｊ），
損失値＝ Σ_ｉクラス間類似度スコア（ｆ_ｉ）－ λクラス内類似度スコア（ｆ_ｉ）
ここでｆ_ｉ，ｆ_ｋは同一人物の特徴量のペア、ｆ_ｉ，ｆ_ｊは他人同士の特徴量のペアである。＜一回目の学習処理＞の結果を図９（Ｂ）に示す。 (Formula 4)
Intraclass similarity score (f _i ) = Σ _{y(k) = y(i)} similarity score (f _i , f _k ),
Inter-class similarity score (f _i ) = Σ _{y (j)≠y (-i)} similarity score (f _i , f _j ),
Loss value = Σ _i inter-class similarity score (f _i ) - λ intra-class similarity score (f _i )
Here, f _i and f _k are a pair of feature quantities of the same person, and f _i and f _j are a pair of feature quantities of strangers. The results of <first learning process> are shown in FIG. 9(B).

次に、＜二回目の学習処理＞で第二の学習モデルの第二のパラメータセットを学習する。Ｓ５０１では、特徴量変換部２０２が、ＤＮＮのパラメータを初期化し、Ｓ５０２では、画像取得部２０１が、学習画像としてマスクを重畳する前の元画像（第一の学習画像）と合成重畳した画像（第二の学習画像）のペアを取得する。つまり、第一の学習画像と第二の学習画像とは、同一の物体が撮像された画像であって、物体の状態や撮影環境が異なるような画像のペアである。Ｓ５０３とＳ５０４では、特徴量変換部２０２が、第一の学習モデルと元画像（第一の画像）から第一の学習特徴量を、第二の特徴モデルと合成画像（第二の画像）からそれぞれ学習特徴量を取得する。Ｓ５０５では、学習部２０３が、人物のクラス内とクラス間の損失値を算出する。この時、これまでに用いた人物のクラス内とクラス間の類似度スコアの項に加えて下式のように画像ペアの類似度の項を新たに追加する。 Next, in <second learning process>, a second parameter set of the second learning model is learned. In S501, the feature value conversion unit 202 initializes the parameters of the DNN, and in S502, the image acquisition unit 201 converts the original image before superimposing the mask (first learning image) and the synthesized superimposed image ( second training image). In other words, the first learning image and the second learning image are images of the same object, but are a pair of images in which the state of the object and the shooting environment are different. In S503 and S504, the feature amount conversion unit 202 converts the first learning feature amount from the first learning model and the original image (first image) and the first learning feature amount from the second feature model and the composite image (second image). Obtain learning features for each. In S505, the learning unit 203 calculates loss values within and between classes of the person. At this time, in addition to the terms of similarity scores within and between classes of persons used so far, a term of similarity of image pairs is newly added as shown in the following equation.

（数式５）
画像ペア類似度スコア（ｆ_ｘ）＝類似度スコア（ｆ_ｘ，ｆ_ｘ’）
（数式６）
損失値＝ Σ_ｉクラス間類似度スコア（ｆ_ｉ）－ λ_１クラス内類似度スコア（ｆ_ｉ）
－ λ_２画像ペア類似度スコア（ｆ_ｉ）
なお上式でｆ_ｘは画像ｘの特徴量、ｆ_ｘ’は画像ｘにマスクを重畳合成した画像ｘ’の特徴量である。λ_１，λ_２は各項のバランスをとるパラメータである。 (Formula 5)
Image pair similarity score (f _x ) = similarity score (f _x , f _x' )
(Formula 6)
Loss value = Σ _i inter-class similarity score (f _i ) - λ ₁ intra-class similarity score (f _i )
- λ ₂ image pair similarity score (f _i )
Note that in the above equation, f _x is a feature amount of image x, and f _x' is a feature amount of image x' obtained by superimposing and synthesizing a mask on image x. λ ₁ and λ ₂ are parameters for balancing each term.

画像ペアの類似度の項はマスク重畳前の元画像（第一の学習画像）と重畳後の合成画像（第二の学習画像）のそれぞれの学習特徴量同士との距離が所定の値より小さくなるように学習する。特徴量ペアの類似度の項の模式図を図９（Ｃ）に付番９００，９０１，９０２を矢印に付して併せて示す。同図で矢印９０３は従来のクラス内類似度，矢印９０４はクラス間類似度を示している。このように複数の類似度を組み合わせて損失値を定義することで、照合の精度を向上させることが期待できる。Ｓ５０６では上記の損失値を減ずるように第二の学習モデルの第二パラメータセットの学習を行う。ここでは第一の学習モデルの学習を行わないため、この＜二回目の学習処理＞では、マスク非装着の元画像の特徴量は「固定」されて動かず、マスクを装着合成した画像の特徴量が、マスク非装着の特徴量に近づく方向に変化するような学習が行われる。Ｓ５０７で、学習部２０３が、学習が収束したと判断した場合、Ｓ５０８で第二の学習モデルの第二のパラメータセットを保存して学習を終了する。以上が学習方法の派生形態の例になる。 The similarity term for image pairs is defined as the distance between the learning features of the original image before mask superimposition (first learning image) and the composite image after superposition (second learning image) is smaller than a predetermined value. Learn to become. A schematic diagram of the similarity terms of feature quantity pairs is also shown in FIG. 9C with numbers 900, 901, and 902 attached to arrows. In the figure, an arrow 903 indicates the conventional intra-class similarity, and an arrow 904 indicates the inter-class similarity. By defining a loss value by combining multiple degrees of similarity in this way, it is expected that the accuracy of matching will be improved. In S506, the second parameter set of the second learning model is trained to reduce the above loss value. Since the first learning model is not trained here, in this <second learning process>, the features of the original image without the mask are "fixed" and do not move, and the features of the synthesized image with the mask on are "fixed" and do not move. Learning is performed such that the amount changes in a direction closer to the feature amount of the mask not being worn. If the learning unit 203 determines that the learning has converged in S507, it saves the second parameter set of the second learning model and ends the learning in S508. The above are examples of derivative forms of learning methods.

またさらに別の学習方法の形態例も考えられる。一つの例として、＜一回目の学習処理＞で通常人物用の特徴量変換部を学習する際に、若干数のマスク人物画像含めて学習を行っておくことが考えられる。このようにすると照合時に物体パラメータ決定１０３が判定に失敗して、誤った特徴量変換パラメータが適用されても、大幅な性能劣化を抑止することが期待できる。同様に、マスク装着人物用の特徴量変換部の学習を行う際に、通常人物の画像も混ぜて学習することも考えられる。 Furthermore, other examples of learning methods are also possible. As one example, when learning the feature amount conversion unit for normal people in <first learning process>, it is conceivable to perform learning including some masked person images. In this way, even if the object parameter determination 103 fails during verification and an incorrect feature value conversion parameter is applied, it can be expected to prevent significant performance deterioration. Similarly, when learning the feature amount conversion unit for a person wearing a mask, it is also possible to mix and learn images of a normal person.

このように学習処理については様々な形態の学習処理が考えられる。ここで説明した複数の学習処理方法を、学習の進度に応じて段階的に適用することも考えられる。このように本発明の画像処理装置を学習するための処理は一つの例に限定されない。 As described above, various types of learning processing are possible. It is also conceivable to apply the plurality of learning processing methods described here in stages according to the progress of learning. In this way, the processing for learning the image processing apparatus of the present invention is not limited to one example.

＜特徴量変換部の構成の派生形態＞
次にＤＮＮの構成について派生の形態例を挙げる。例えば、通常人物用の特徴量変換のＤＮＮと、マスク装着人物用のＤＮＮで、層数やニューロン数を変更することが考えられる。一般に、マスクをつけた人物や横顔の人物などの照合困難な対象や、見えのバリエーションが豊富な対象は、規模の大きいＤＮＮを用いることで性能が向上しやすい。このため、扱う対象に応じて各ＤＮＮの規模を調整すれば計算コストと照合精度の費用対効果を向上させることができる。 <Derivative form of the configuration of the feature amount conversion unit>
Next, an example of a derived form of the DNN configuration will be given. For example, it is conceivable to change the number of layers and the number of neurons between a DNN for feature value conversion for a normal person and a DNN for a person wearing a mask. In general, for objects that are difficult to match, such as a person wearing a mask or a person in profile, or for objects with a wide variety of appearances, performance is likely to be improved by using a large-scale DNN. Therefore, by adjusting the scale of each DNN depending on the target to be handled, it is possible to improve the cost-effectiveness of calculation cost and matching accuracy.

また別の形態として、通常人物用の特徴量変換のＤＮＮと、マスク装着人物用のＤＮＮで、前段の層は共有し、後段の層のみを人物の状態に応じて部分的に変更するといった形態が考えられる。 Another method is to use a DNN for feature value conversion for normal people and a DNN for people wearing masks, and share the first layer, and only partially change the second layer depending on the state of the person. is possible.

さらに別の形態として、通常人物用の特徴量変換部とマスク装着人物用の特徴量変換部で構成の全く異なる特徴量変換の手段を用いることが考えられる。例えば通常人物用の特徴量変換部に畳み込みニューラルネットワークを用いて、マスク装着人物用に特許文献２で公知なＴｒａｎｓｆｏｒｍｅｒネットワークを用いることが考えられる。また再帰的ニューラルネットワーク等を用いてもよい。損失値に基づいてパラメータを調整することが可能な手段であれば、特徴量変換部にはＤＮＮに限らず広く様々な特徴量変換の手段が適用可能である。 As another form, it is conceivable to use feature amount conversion means having completely different configurations, including a feature amount conversion section for a normal person and a feature amount conversion section for a person wearing a mask. For example, it is conceivable to use a convolutional neural network in the feature quantity transformation unit for a normal person, and to use a Transformer network known in Patent Document 2 for a person wearing a mask. Alternatively, a recursive neural network or the like may be used. As long as it is possible to adjust the parameters based on the loss value, a wide variety of feature value conversion means other than DNN can be applied to the feature value conversion unit.

さらに別の派生の形態として、入力画像を変換して得られる特徴量ｆ_１，ｆ_２は、１次元ベクトルでなくＮ次元行列の形態でもよい。また本実施形態では第一の学習済みモデルと第二の学習済みモデルから得られる特徴ベクトルの長さを同一としたが、長さが異なっていてもよい。異なる長さの特徴量を用いる場合は、ＥａｒｔｈＭｏｖｅｒ‘ｓＤｉｓｔａｎｃｅなどの不等長のベクトル間の類似度を算出する公知の方法を用いればよい。 As yet another form of derivation, the feature quantities f ₁ and f ₂ obtained by transforming the input image may be in the form of an N-dimensional matrix instead of a one-dimensional vector. Further, in this embodiment, the lengths of the feature vectors obtained from the first trained model and the second trained model are the same, but the lengths may be different. When using feature quantities of different lengths, a known method for calculating the similarity between vectors of unequal length, such as Earth Mover's Distance, may be used.

以上で実施形態１の説明を終える。 This concludes the description of the first embodiment.

＜実施形態２＞
本実施形態はマスクやサングラスの装着の有無による切り替え以外の形態に本発明を適用する。実施形態１では１枚対１枚の画像を入力とし、同一物体の被写体かを判定した。本実施形態では、顔認証によって開閉する自動ドアのゲートのようなユースケースを想定した形態例を説明する。本実施形態の画像処理装置には予めＮ人の人物の特徴量を登録しておく。照合時にはゲートの前のカメラで撮影した１枚の画像を入力画像として入力し、入力された人物が登録されたＮ人のうちいずれかの人物と同一であるか、いずれにも該当しないかを判定する。 <Embodiment 2>
In this embodiment, the present invention is applied to modes other than switching depending on whether or not a mask or sunglasses are worn. In the first embodiment, one-to-one images are input, and it is determined whether the images are the same object. In this embodiment, an example of a configuration will be described assuming a use case such as an automatic door gate that opens and closes using facial recognition. The feature amounts of N people are registered in advance in the image processing apparatus of this embodiment. During verification, a single image taken with a camera in front of the gate is input as an input image, and it is checked whether the input person is the same as one of the registered N people or does not correspond to any of them. judge.

実施形態１ではマスクの有無を判定して特徴量変換部の切り替えを行った。本実施形態では、登録用の顔画像（照明条件が良好な正面顔）と、問い合わせ用の顔画像（カメラの設置状況により照明条件が悪い、顔向きの角度が大きい、等がある）で、撮影条件が大きく異なる。そこで、それぞれに対応する特徴量変換部を学習して用いることとする。 In the first embodiment, the presence or absence of a mask is determined and the feature quantity conversion unit is switched. In this embodiment, the facial image for registration (frontal face with good lighting conditions) and the facial image for inquiry (poor lighting conditions, large face angle, etc. depending on the camera installation situation) are used. The shooting conditions are very different. Therefore, we will learn and use feature quantity conversion units corresponding to each.

図１０に画像処理装置３の機能構成例を示す。基本的な構成は図１に準じている。差異としては、新たに特徴登録部１０８および処理モード設定部１０９を備える。照合処理のフローは図１１である。人物の登録動作を図１１（Ａ）に、入力画像と登録人物との照合動作を図１１（Ｂ）に示している。 FIG. 10 shows an example of the functional configuration of the image processing device 3. The basic configuration is similar to that shown in FIG. The difference is that a feature registration section 108 and a processing mode setting section 109 are newly provided. The flow of the matching process is shown in FIG. FIG. 11(A) shows the registration operation of a person, and FIG. 11(B) shows the matching operation between the input image and the registered person.

画像処理装置３が登録動作を開始すると、処理モード設定部１０９が、現在の動作モードを登録動作モードに設定する（Ｓ６０１）。Ｓ６０２では、第一の特徴量変換部１０５が、登録動作モード用の変換パラメータセット（第一のパラメータセット）を取得する。取得したパラメータセットを学習済みモデルに適用する。次にＳ６０４では、第一の画像取得部１０１が、一人ずつ全Ｎ人の登録用人物画像を入力し（Ｓ６０４）、特徴量変換部１０５が特徴量に変換し（Ｓ６０５）、特徴登録部１０８に各人物の特徴量として登録する。登録画像としては良好な条件で撮影した人物の正面顔が想定される。そのため第一の特徴量変換部は正面顔を主に用いて予め学習してある。 When the image processing device 3 starts the registration operation, the processing mode setting unit 109 sets the current operation mode to the registration operation mode (S601). In S602, the first feature converter 105 acquires a conversion parameter set (first parameter set) for the registered operation mode. Apply the obtained parameter set to the trained model. Next, in S604, the first image acquisition unit 101 inputs all N person images for registration one by one (S604), the feature value conversion unit 105 converts them into feature values (S605), and the feature registration unit 108 is registered as a feature amount of each person. The registered image is assumed to be a frontal face of a person photographed under good conditions. Therefore, the first feature converter is trained in advance mainly using frontal faces.

次に画像処理装置が照合動作を開始すると、処理モード設定部１０９が、動作モードを照合動作モードに設定する（Ｓ７０１）。まずＳ７０２は、第二の特徴量変換部１０６が、複数の学習済みのパラメータセットのうち、状況に応じて選択されたパラメータセット（第二のパラメータセット）を取得する。第二のパラメータセットは、様々な角度の人物を学習データとして用いて予め学習してある。 Next, when the image processing apparatus starts a matching operation, the processing mode setting unit 109 sets the operation mode to the matching operation mode (S701). First, in S702, the second feature converter 106 acquires a parameter set (second parameter set) selected according to the situation from among a plurality of learned parameter sets. The second parameter set is learned in advance using people from various angles as learning data.

Ｓ７０３では、第二の画像取得部１０２が、撮影した一枚の入力画像を取得する。なおカメラとゲートドアの位置関係の状況によっては画像中のどこに人物が写っているかは事前に決定されない。そのため第二の画像取得部１０２の内部に顔検出器を用意しておき、顔を検出させて顔周辺の画像だけを切り出してもよい。（顔検出器は広く公知のものを使用すればよい。）次に第二の特徴量変換部１０６が入力画像から第二の特徴量を取得する（Ｓ７０４）。Ｓ７０５～Ｓ７０７で特徴量照合部１０７が入力画像の特徴量と各登録済の特徴量との類似度を一つ一つ算出し（Ｓ７０６）、所定値以上に類似度の高い候補人物がいればその結果を出力する（Ｓ７０８）。処理フロー中には図示しないが、実際のユースケースでは以上の結果に基づきゲートドアの開閉動作を行う。具体的には、第二の画像に含まれる人物が登録人物のいずれかと一致する場合は、ゲートを開ける制御を行い、いずれの登録人物とも一致しない場合は、ゲートを開けず、必要に応じて管理者に通知を出力する。認証結果を入室ゲートの近くの表示装置に出力しても良い。 In S703, the second image acquisition unit 102 acquires one captured input image. Note that depending on the positional relationship between the camera and the gate door, where the person appears in the image cannot be determined in advance. Therefore, a face detector may be provided inside the second image acquisition unit 102 to detect the face and cut out only the image around the face. (A widely known face detector may be used.) Next, the second feature converter 106 acquires a second feature from the input image (S704). In S705 to S707, the feature matching unit 107 calculates the similarity between the input image feature and each registered feature one by one (S706), and if there is a candidate person whose similarity is higher than a predetermined value, The result is output (S708). Although not shown in the processing flow, in an actual use case, the opening/closing operation of the gate door is performed based on the above results. Specifically, if the person included in the second image matches one of the registered persons, the gate is controlled to be opened; if the person included in the second image does not match any of the registered persons, the gate is not opened and the gate is opened as necessary. Output a notification to the administrator. The authentication result may be output to a display device near the entrance gate.

図１２は本実施形態２の学習処理のフローである。図１３に模式図を併せて示す。ここでは実施形態１の形態と異なり、第一の学習モデルと第二の学習モデルとを同時に学習する点がこれまでとの差異である。本実施形態の学習の方法がこのような方法にも適用可能であることを説明する。なお、ハードウェア構成例は図２、画像処理装置の機能構成例は図１４と同様である。 FIG. 12 is a flowchart of learning processing according to the second embodiment. A schematic diagram is also shown in FIG. The difference here from the first embodiment is that the first learning model and the second learning model are learned at the same time. It will be explained that the learning method of this embodiment is also applicable to such a method. Note that an example of the hardware configuration is the same as that shown in FIG. 2, and an example of the functional configuration of the image processing apparatus is the same as that shown in FIG. 14.

図１２のＳ８０１では、画像取得部２０１が、登録画像の撮影条件を模した正面画像だけを集めた第一の学習画像群を取得する。Ｓ８０２では、特徴量変換部２０２が、第一のパラメータセットを用いた第一の学習モデルに基づいて、第一の学習画像群から第一の学習特徴量を取得する。Ｓ８０３では、画像取得部２０１が、第二の学習画像群を取得する。第二の画像群は入力画像を想定した見下ろしなどを含む角度の異なる様々な人物画像を含む。Ｓ８０４では、特徴量変換部２０２が、第二のパラメータセットを用いた第二の学習モデルに基づいて、第二の学習画像群から第二の学習特徴量を取得する。 In S801 of FIG. 12, the image acquisition unit 201 acquires a first learning image group that is a collection of only frontal images that simulate the photographing conditions of the registered images. In S802, the feature amount conversion unit 202 acquires a first learning feature amount from a first learning image group based on a first learning model using a first parameter set. In S803, the image acquisition unit 201 acquires a second learning image group. The second image group includes various human images from different angles, including a top-down view and the like assuming the input image. In S804, the feature amount conversion unit 202 acquires a second learning feature amount from the second learning image group based on the second learning model using the second parameter set.

Ｓ８０５では、学習部２０３が、それぞれの画像群から画像をランダムに選んで本人ペア（クラス内ペア）と他人ペア（クラス間ペア）を作り、それらの特徴量間の類似度に基づいて損失値を求める。損失には下記のように非特許文献２等で公知なトリプレット損失を用いる。（非特許文献２：ＦｌｏｒｉａｎＳｃｈｒｏｆｆ，ＤｍｉｔｒｙＫａｌｅｎｉｃｈｅｎｋｏ，ａｎｄＪａｍｅｓＰｈｉｌｂｉｎ．Ｆａｃｅｎｅｔ：Ａｕｎｉｆｉｅｄｅｍｂｅｄｄｉｎｇｆｏｒｆａｃｅｒｅｃｏｇｎｉｔｉｏｎａｎｄｃｌｕｓｔｅｒｉｎｇ．ＩｎＣＶＰＲ，２０１５）。 In S805, the learning unit 203 randomly selects images from each image group to create a subject pair (intra-class pair) and an other-person pair (inter-class pair), and calculates a loss value based on the similarity between the feature amounts. seek. As the loss, a triplet loss known in Non-Patent Document 2 is used as described below. (Non-patent document 2: Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and cluster In CVPR, 2015).

（数式７）
損失値＝ Σ_ｉ［クラス間ペア類似度スコア（ｆ_ｉ，ｆ_ｊ）
－クラス内ペア類似度スコア（ｆ_ｉ，ｆ_ｋ）＋ｍ］^＋，
ただしｍは学習を頑健にするための損失のマージン値の定数、［・］^＋は
（数式８）
［ｘ］^＋＝ｘＩｆｘ＞０
［ｘ］^＋＝０Ｏｔｈｅｒｗｉｓｅ
で定義される関数である。
ここでｆ_ｉは人物画像Ｉ_ｉの特徴量、ｆ_ｊは画像Ｉ_ｉと異なる人物の特徴量、ｆ_ｋはＩ_ｉと同一人物の別の画像Ｉ_ｋの特徴量である。 (Formula 7)
Loss value = Σ _i [inter-class pair similarity score (f _i, f _j )
− Intra-class pair similarity score (f _i, f _k )+m ] ⁺ ,
However, m is a constant for the loss margin value to make learning robust, and [・] ⁺ is (Formula 8)
[x] ⁺ =x If x>0
[x] ⁺ =0 Otherwise
This is a function defined by .
Here, f _i is a feature amount of the person image I _i , f _j is a feature amount of a person different from the image I _i , and f _k is a feature amount of another image I _k of the same person as I _i .

なお人物画像Ｉ_ｉは第一の学習セットあるいは第二の学習セットからランダムに選択し、それに応じて人物画像Ｉ_ｊとＩ_ｋをサンプリングしてクラス間ペアとクラス内ペアを作る。この時、人物画像Ｉ_ｉを第一の学習セットから選んだ場合は人物画像Ｉ_ｊとＩ_ｋは第二の学習セットから選び、人物画像Ｉ_ｉを第二の学習セットから選んだ場合は人物画像Ｉ_ｊとＩ_ｋは第一の学習セットから選ぶ。これにより第一の学習モデルと第二の学習モデルを連動させて学習させることができる。 Note that the person images I _i are randomly selected from the first learning set or the second learning set, and the person images I _j and I _k are sampled accordingly to create inter-class pairs and intra-class pairs. At this time, when the person image I _i is selected from the first learning set, the person images I _j and I _k are selected from the second learning set, and when the person image I i is selected from the second learning set, the person images I j and I _k are selected from the second learning set. Images I _j and I _k are selected from the first training set. Thereby, the first learning model and the second learning model can be trained in conjunction with each other.

Ｓ８０６では、学習部２０３が、第一の学習モデルと第二の学習モデルのそれぞれが損失値を減ずる方向に誤差逆伝搬の方法を用いてパラメータの学習更新を行う。この結果、図１３に模式図を示すように、二つの学習モデルのそれぞれの出力に対して類似度に基づく損失値を算出し、それを誤差として再び各特徴変換部に逆伝搬させて学習更新が行われる。 In S806, the learning unit 203 performs learning update of parameters using the error back propagation method in a direction that reduces the loss value of each of the first learning model and the second learning model. As a result, as shown in the schematic diagram in Figure 13, a loss value based on the similarity is calculated for each output of the two learning models, and this is back-propagated to each feature conversion unit as an error to update the learning. will be held.

以上のように第一の学習モデルと第二の学習モデルとで異なる特性の画像を処理させながら、双方で同時に学習を行う例について説明した。なお派生的方法として、初期段階は二つの学習モデルを同時に学習し、後半では第一の特徴量を固定して第二の特徴量のみ学習するといった組み合わせも考えられる。 As described above, an example has been described in which the first learning model and the second learning model process images with different characteristics and simultaneously perform learning on both models. As a derivative method, a combination of learning two learning models at the same time in the initial stage, fixing the first feature amount and learning only the second feature amount in the latter half can also be considered.

＜実施形態３＞
上述の実施形態では、状態判定と特徴量変換の双方が画像から状態や特徴量を各々求めていた。本実施形態では、画像から中間特徴量を生成し、中間特徴量をもとに状態判定と特徴量変換を行う形態について説明する。ここで、状態とは、例えば、性別、人種や年齢といった人物の属性を含む。本実施形態では、画像に含まれる人物について個人を特定するための特徴量を得る際に、人物の属性に応じて学習モデルの一部のパラメータを異ならせる。一方で、人物の属性（状態）判定及び特徴量変換の処理を実行する学習モデルのレイヤについては共通のものを用いる。これにより、状態判定と特徴量変換の処理が共通化され、速度・メモリの効率が高められる。 <Embodiment 3>
In the above-described embodiment, both the state determination and the feature amount conversion obtain the state and feature amount from the image. In this embodiment, an embodiment will be described in which an intermediate feature amount is generated from an image, and state determination and feature amount conversion are performed based on the intermediate feature amount. Here, the state includes, for example, attributes of a person such as gender, race, and age. In this embodiment, when obtaining feature amounts for identifying a person included in an image, some parameters of the learning model are changed depending on the attributes of the person. On the other hand, a common learning model layer is used for determining attributes (states) of a person and converting features. As a result, the processing of state determination and feature quantity conversion is shared, and speed and memory efficiency are improved.

本実施形態では、図１５～図１８を用いて、実施形態１と同様の１枚対１枚の画像を入力とし、同一物体の被写体かを判定する「１対１の画像照合処理」の場合について説明する。次に、図１９～図２０を使用して、予め登録したＮ人の人物から、入力画像に映る人物がいずれかの登録人物と同一であるかを判定する「１対Ｎの画像照合処理」の場合について説明する。なお、ハードウェア構成は実施形態１，２における図２の情報処理装置と同様である。 In this embodiment, using FIGS. 15 to 18, the same one-to-one images as in the first embodiment are input, and "one-to-one image matching processing" is performed to determine whether the subjects are the same object. I will explain about it. Next, using FIGS. 19 and 20, "1-to-N image matching processing" is performed to determine whether the person appearing in the input image is the same as any registered person from among the N people registered in advance. The case will be explained below. Note that the hardware configuration is the same as that of the information processing apparatus shown in FIG. 2 in the first and second embodiments.

＜１対１の画像照合処理＞
図１５に画像処理装置１５の機能構成例を示す。基本的な構成は図１に準じている。差異としては、第一の特徴量変換部１５０１が中間特徴量を生成することである。これに伴い、中間特徴量をもとにパラメータ決定部１５０２と第二の特徴量変換部１５０４と第三の特徴量変換部１５０５（第三の特徴取得部）が動作するようになっている。パラメータ決定部１５０２は、画像に含まれる物体の状態（人物の場合、属性）に応じて、学習済みモデルのパラメータを決定する。パラメータ決定部１５０２は、画像の中間特徴量に基づいて、画像に含まれる物体の状態を推定する。推定方法は、注目属性の代表的な特徴量との一致度が所定の閾値以上であれば注目属性であると推定する。または、画像から物体の状態に関する特徴量を出力する第三の学習済みモデルに基づいて画像に含まれる物体の状態を推定する。さらに、パラメータ決定部１５０２は、推定された状態（人物の属性）に応じて予め対応付けられた変換パラメータを決定する。つまり、第一の画像に含まれる物体の属性と、第二の画像に含まれる物体の属性が同じ場合は、同一の学習済みモデル（または特徴変換パラメータ）が決定される。第一の画像に含まれる物体の属性と、第二の画像に含まれる物体の属性が異なる場合は、異なる学習済みモデル（またはモデルのパラメータ）が決定される。また、記憶部１５０３は第二の特徴量変換部１５０４と第三の特徴量変換部１５０５に供給する変換パラメータを記憶する。 <One-to-one image matching process>
FIG. 15 shows an example of the functional configuration of the image processing device 15. The basic configuration is similar to that shown in FIG. The difference is that the first feature amount conversion unit 1501 generates intermediate feature amounts. Along with this, the parameter determination section 1502, the second feature amount conversion section 1504, and the third feature amount conversion section 1505 (third feature acquisition section) operate based on the intermediate feature amount. The parameter determining unit 1502 determines the parameters of the learned model according to the state of the object (in the case of a person, the attribute) included in the image. The parameter determination unit 1502 estimates the state of an object included in the image based on the intermediate feature amount of the image. The estimation method estimates that the attribute of interest is the attribute of interest if the degree of matching with the representative feature amount of the attribute of interest is greater than or equal to a predetermined threshold. Alternatively, the state of the object included in the image is estimated based on a third learned model that outputs feature amounts related to the state of the object from the image. Furthermore, the parameter determination unit 1502 determines transformation parameters that are associated in advance according to the estimated state (person's attributes). That is, if the attributes of the object included in the first image and the attributes of the object included in the second image are the same, the same learned model (or feature conversion parameter) is determined. If the attributes of the object included in the first image and the attributes of the object included in the second image are different, different learned models (or model parameters) are determined. Furthermore, the storage unit 1503 stores conversion parameters to be supplied to the second feature amount conversion unit 1504 and the third feature amount conversion unit 1505.

図１６は本実施形態の照合処理の模式図である。入力画像は、第一の特徴量変換部１５０１により、物体の状態に関する中間特徴量に変換される。変換された中間特徴量を用いて、パラメータ決定部１５０２によって状態に応じた変換パラメータが求められる。物体の状態は、性別・人種などがある。あるいは、年齢・顔向き・マスク有無等であってもよく、これらに限定されるものではない。記憶部１５０３には、状態Ｙに特化した変換パラメータ１６０２と、全状態に対応する一般用の所定の変換パラメータ１６０１が保存されている。例えば、入力画像に対する状態判定が「状態Ｙ」であれば、状態Ｙ用の変換パラメータ１６０２を第三の特徴量変換部１５０５に設定する。なお、対象の物体が学習済みの特定の状態には当てはまらない場合は、ダミーとして所定のパラメータを与えるようにしてもよい。そして、第三の特徴量変換部１５０５は、パラメータ決定部１５０２によって決定されたパラメータに基づいて、中間特徴量を顔特徴量に変換する。なお、前記実施形態では、特徴量と呼称していたが、中間特徴量と区別しやすくするため顔特徴量と呼称している。次に、登録画像も顔特徴量へと変換を行い、特徴量照合部１０７により入力画像と登録画像の顔特徴量の照合を行う。 FIG. 16 is a schematic diagram of the matching process of this embodiment. The input image is converted by a first feature converting unit 1501 into an intermediate feature related to the state of the object. Using the converted intermediate feature amount, the parameter determination unit 1502 determines conversion parameters according to the state. The state of an object includes gender, race, etc. Alternatively, it may be age, facial orientation, presence or absence of a mask, etc., but is not limited to these. The storage unit 1503 stores conversion parameters 1602 specific to state Y and predetermined general conversion parameters 1601 corresponding to all states. For example, if the state determination for the input image is "state Y", the conversion parameter 1602 for state Y is set in the third feature quantity conversion unit 1505. Note that if the target object does not fit into a specific learned state, a predetermined parameter may be given as a dummy. Then, the third feature converter 1505 converts the intermediate feature into a facial feature based on the parameters determined by the parameter determiner 1502. Note that in the embodiment described above, it was called a feature amount, but in order to make it easier to distinguish it from an intermediate feature amount, it is called a facial feature amount. Next, the registered image is also converted into facial features, and the feature matching unit 107 matches the facial features of the input image and the registered image.

これによって、中間特徴量に変換する部分が共通化されるため、処理スピードを高められる。加えて、パラメータ決定部や第二と第三の特徴変換部のモデルのサイズを小さくできる。また、モデルサイズが小さくなることにより、記憶部１５０３で管理する変換パラメータのサイズも小さくできる上に、変換パラメータの読み出し速度も高速にできる。なお、実施形態１ではパラメータ決定部１５０２は、物体の状態（マスクの装着の有無）をテンプレートマッチング等の方法により求めていた。しかし、パラメータ決定部１５０２も第二と第三の特徴変換部等と同様にディープニューラルネットワークにより構成してもよい。同様に、第一の特徴量変換部もディープニューラルネットワークとして構成してもよい。具体的な状態判定方法は図２１を用いて後述する。 As a result, the processing speed can be increased because the parts to be converted into intermediate feature quantities are shared. In addition, the size of the models of the parameter determination section and the second and third feature conversion sections can be reduced. Further, by reducing the model size, the size of the conversion parameters managed in the storage unit 1503 can also be reduced, and the reading speed of the conversion parameters can also be increased. Note that in the first embodiment, the parameter determining unit 1502 determines the state of the object (whether or not a mask is worn) by a method such as template matching. However, the parameter determination unit 1502 may also be configured by a deep neural network, similar to the second and third feature conversion units. Similarly, the first feature converter may also be configured as a deep neural network. A specific state determination method will be described later using FIG. 21.

これによって、特定の状態に特化した変換パラメータを保持することにより、状態の変化に対して頑健な照合が実現できる。加えて、状態判定に失敗したとしても、いずれの変換パラメータも特徴空間を共有しているため、大きく失敗した特徴量変換をしない。そのため、状態判定の性能に対しても頑健な照合が実現できる。また、この性質を高めるために、各変換パラメータは対応する状態以外の画像に対する特徴量変換もある程度はできるように学習しておいても良い。例えば、学習データとして対応する状態の画像に加えて、少量の他状態の画像を含めて学習するなどしても良い。あるいは、他状態のときは損失値を小さくする等の損失関数を変更した学習をしても良い。 In this way, by holding conversion parameters specific to a specific state, matching that is robust against changes in state can be achieved. In addition, even if state determination fails, since all transformation parameters share the feature space, feature quantity transformation that has failed significantly is not performed. Therefore, matching that is robust to the performance of state determination can be achieved. Further, in order to enhance this property, each transformation parameter may be learned so that it can perform feature amount transformation for images other than the corresponding state to some extent. For example, in addition to images in a corresponding state as learning data, a small amount of images in other states may be included for learning. Alternatively, learning may be performed by changing the loss function, such as reducing the loss value, in other states.

次に図１７を用いて照合の処理の手順を説明する。この処理では、１枚対１枚の画像を入力とし、同一物体の被写体かを判定する。この例では、パラメータ決定部１５０２が求める状態は「性別」として説明する。 Next, the procedure of the verification process will be explained using FIG. 17. In this process, one-by-one images are input, and it is determined whether the images are the same object. In this example, the state determined by the parameter determination unit 1502 will be described as "gender."

Ｓ１７０１では、第一の画像取得部１０１が、人物を含む１枚目の画像（第一の画像）を取得する。 In S1701, the first image acquisition unit 101 acquires a first image (first image) including a person.

Ｓ１７０２では、第一の特徴量変換部１５０１が、第一の画像を中間特徴量（第一の中間特徴量）に変換する。 In S1702, the first feature converting unit 1501 converts the first image into an intermediate feature (first intermediate feature).

Ｓ１７０３では、パラメータ決定部１５０２が、第一の中間特徴量から第一の画像の状態（第一の状態）であるか否か判定する。具体的には、第一の画像に映る人物の性別が男性であるか否か（女性でないか）を判定する。 In S1703, the parameter determining unit 1502 determines whether the image is in the first image state (first state) based on the first intermediate feature amount. Specifically, it is determined whether the gender of the person appearing in the first image is male (or not female).

Ｓ１７０４では、パラメータ決定部１５０２が、判定結果に基づいて、記憶部１５０３から第一の状態に対応する変換パラメータを読み出して、第二の特徴量変換部１５０４にセットする。 In S1704, the parameter determination unit 1502 reads the conversion parameter corresponding to the first state from the storage unit 1503 based on the determination result, and sets it in the second feature value conversion unit 1504.

Ｓ１７０５で、第二の特徴量変換部１５０４が、第一の中間特徴量を変換して顔特徴量（第一の顔特徴量）を得る。ここでは、Ｓ１７０３での判定結果に応じて、第一の状態が男性である場合は、第二の特徴変換部１５０４には、男性の識別が得意なパラメータが設定された学習済みモデルに基づいて、画像から特徴を得ることになる。 In S1705, the second feature amount conversion unit 1504 converts the first intermediate feature amount to obtain a facial feature amount (first facial feature amount). Here, if the first state is male according to the determination result in S1703, the second feature conversion unit 1504 uses a learned model that is set with parameters that are good at identifying males. , we will get features from the image.

Ｓ１７０６では、第二の画像取得部１０２が、人物を含む２枚目の画像（第二の画像）を取得する。 In S1706, the second image acquisition unit 102 acquires a second image (second image) including a person.

Ｓ１７０７では、第一の特徴量変換部１５０１が、第二の画像を中間特徴量（第二の中間特徴量）に変換する。 In S1707, the first feature converting unit 1501 converts the second image into an intermediate feature (second intermediate feature).

Ｓ１７０８では、パラメータ決定部１５０２が、第二の中間特徴量から第二の画像の状態（第二の状態）を判定する。具体的には、第二の画像に映る人物の性別が男性であるか否か（女性でないか）を判定する。 In S1708, the parameter determination unit 1502 determines the state of the second image (second state) from the second intermediate feature amount. Specifically, it is determined whether the gender of the person appearing in the second image is male (or not female).

Ｓ１７０９では、記憶部１５０３から第二の状態に対応する変換パラメータを読み出して、第三の特徴量変換部１５０５にセットする。 In S1709, the conversion parameters corresponding to the second state are read from the storage unit 1503 and set in the third feature value conversion unit 1505.

Ｓ１７１０では、第三の特徴量変換部１５０５が第二の中間特徴量を変換して顔特徴量（第二の顔特徴量）を得る。ここで、第一の画像と第二の画像がともに男性の画像であれば、第二の特徴変換部１５０４と第三の特徴変換部１５０５に設定される学習済みモデルのパラメータは同じものになる。一方で、例えば、第一の画像が男性、第二の画像が女性の画像であれば、第二の特徴変換部１５０４と第三の特徴変換部１５０５に設定される学習済みモデルのパラメータは異なる。 In S1710, the third feature amount conversion unit 1505 converts the second intermediate feature amount to obtain a facial feature amount (second facial feature amount). Here, if the first image and the second image are both images of men, the parameters of the trained model set in the second feature conversion unit 1504 and the third feature conversion unit 1505 will be the same. . On the other hand, for example, if the first image is a male image and the second image is a female image, the parameters of the learned model set in the second feature conversion unit 1504 and the third feature conversion unit 1505 are different. .

Ｓ１７１１では、特徴量照合部１０７が、Ｓ１７０５とＳ１７１０で得た２つの特徴量の類似度スコアを算出する。類似度スコアを閾値処理することで、２つの画像に映る人物が同一か否かを判定できる。 In S1711, the feature amount matching unit 107 calculates the similarity score of the two feature amounts obtained in S1705 and S1710. By subjecting the similarity score to threshold processing, it is possible to determine whether or not the person appearing in the two images is the same.

次に図１８を用いて、図１７とは異なる照合の処理手順を説明する。パラメータ決定部１５０２によって判定される状態が人種・性別等のとき、異なる状態であれば異なる人物であると判断ができる。この処理では、予め２枚の画像の状態を求めてから、画像に含まれる物体に状態ついての判定結果の確信度高くかつそれぞれの状態が異なると判断される場合には、顔特徴量への変換処理をスキップする。これによって処理を軽減できる。また、２枚とも同じ状態と判定されるときは、変換パラメータの読みだしを１回にまとめることで処理を軽減できる。 Next, using FIG. 18, a verification processing procedure different from that in FIG. 17 will be explained. When the condition determined by the parameter determining unit 1502 is race, gender, etc., it can be determined that the person is a different person if the condition is different. In this process, the states of the two images are determined in advance, and if it is determined that the judgment result regarding the state of the object included in the images is high and the states are different, then the state of the facial features is determined. Skip the conversion process. This can reduce processing. Further, when it is determined that both sheets are in the same state, processing can be reduced by reading out the conversion parameters at once.

図１８のＳ１８０１～Ｓ１８０３は、図１７のＳ１７０１～Ｓ１７０３と同じで、第一の特徴変換部１５０１が、第一の画像を中間特徴量に変換し、第一の画像の状態（第一の状態）を求める。Ｓ１８０４～Ｓ１８０６も、Ｓ１７０６～Ｓ１７０８と同様に、第一の特徴変換部１５０１が、第二の画像を中間特徴量に変換して第二の画像の状態（第二の状態）を求める。 Steps S1801 to S1803 in FIG. 18 are the same as steps S1701 to S1703 in FIG. ). In S1804 to S1806, similarly to S1706 to S1708, the first feature conversion unit 1501 converts the second image into intermediate feature amounts to obtain the state of the second image (second state).

Ｓ１８０７では、パラメータ決定部１５０２が、Ｓ１８０３とＳ１８０６で求めた第一の状態と第二の状態が同じであるか否かを判定する。同じときはＳ１８０８へ移り、それ以外はＳ１８１２へ移る。 In S1807, the parameter determining unit 1502 determines whether the first state and the second state obtained in S1803 and S1806 are the same. If they are the same, the process moves to S1808; otherwise, the process moves to S1812.

Ｓ１８０８では、パラメータ決定部１５０２が、記憶部１５０３から第一の状態に対応する変換パラメータを読みだして、第二の特徴量変換部１５０４と第三の特徴量変換部１５０５にセットする。 In S1808, the parameter determining unit 1502 reads the conversion parameters corresponding to the first state from the storage unit 1503 and sets them in the second feature converter 1504 and the third feature converter 1505.

Ｓ１８０９では、第二の特徴量変換部１５０４が、第一の中間特徴量を顔特徴量（第一の顔特徴量）に変換する。 In S1809, the second feature amount conversion unit 1504 converts the first intermediate feature amount into a facial feature amount (first facial feature amount).

Ｓ１８１０では、第三の特徴量変換部１５０５が、第二の中間特徴量を顔特徴量（第二の顔特徴量）に変換する。 In S1810, the third feature amount conversion unit 1505 converts the second intermediate feature amount into a facial feature amount (second facial feature amount).

Ｓ１８１１では、特徴量照合部１０７が、第一の顔特徴量と第二の顔特徴量の類似度スコアを算出する。 In S1811, the feature matching unit 107 calculates a similarity score between the first facial feature and the second facial feature.

Ｓ１８１２では、パラメータ決定部１５０２が出力した状態のスコア（状態スコア）が高いか否かを判定する。そのため、パラメータ決定部１５０２は状態とともにスコアを出力するように構成する。例えば、パラメータ決定部１５０２をディープニューラルネットワークとして構成し、状態ごとの出力を得るように構成する。そして、画像の状態に対応する出力が最も大きくなるように学習しておく。状態判定は、出力が最大になる状態として判定すればよく、状態スコアはその出力値を用いればよい。状態スコアを求める具体的な方法は図２１を用いて後述する。予め定めた閾値より状態スコアが大きいならば、Ｓ１８１３に移る。それ以外は、Ｓ１８１４に移る。 In S1812, it is determined whether the state score (state score) output by the parameter determination unit 1502 is high. Therefore, the parameter determination unit 1502 is configured to output the score together with the state. For example, the parameter determination unit 1502 is configured as a deep neural network and configured to obtain an output for each state. Then, learning is performed so that the output corresponding to the state of the image is the largest. The state may be determined as the state in which the output is maximized, and the output value may be used as the state score. A specific method for determining the state score will be described later using FIG. 21. If the state score is greater than the predetermined threshold, the process moves to S1813. Otherwise, the process moves to S1814.

Ｓ１８１３では、特徴量照合部１０７が、第一の画像と第二の画像に対する類似度をゼロとして出力する。つまり、状態判定に対する確信度が所定値以上であって、それぞれの物体の状態（人物の属性）が異なる場合は、同一物体である可能性が低いことが判断できる。 In S1813, the feature matching unit 107 outputs the similarity between the first image and the second image as zero. In other words, if the confidence level for state determination is greater than or equal to a predetermined value and the states (person attributes) of the objects are different, it can be determined that the objects are unlikely to be the same object.

Ｓ１８１４では、パラメータ決定部１５０２が、記憶部１５０３から第一の状態に対応する変換パラメータを読み出して、第二の特徴量変換部１５０４にセットする。 In S1814, the parameter determination unit 1502 reads the conversion parameter corresponding to the first state from the storage unit 1503 and sets it in the second feature amount conversion unit 1504.

Ｓ１８１５では、第二の特徴量変換部１５０４が第一の中間特徴量を変換して顔特徴量（第一の顔特徴量）を得る。 In S1815, the second feature amount conversion unit 1504 converts the first intermediate feature amount to obtain a facial feature amount (first facial feature amount).

Ｓ１８１６では、記憶部１５０３から第二の状態に対応する変換パラメータを読み出して、第三の特徴量変換部１５０５にセットする。 In S1816, the conversion parameters corresponding to the second state are read from the storage unit 1503 and set in the third feature value conversion unit 1505.

Ｓ１８１７では、第三の特徴量変換部１５０５が第二の中間特徴量を変換して顔特徴量（第二の顔特徴量）を得る。 In S1817, the third feature amount conversion unit 1505 converts the second intermediate feature amount to obtain a facial feature amount (second facial feature amount).

Ｓ１８１８では、特徴量照合部１０７が、Ｓ１８１５とＳ１８１７で得た２つの特徴量の類似度スコアを算出する。上述した実施形態と同様に、類似度スコアが所定の閾値以上であれば、２つの物体は同一と判定され、閾値未満であれば異なる物体であると判定される。 In S1818, the feature amount matching unit 107 calculates the similarity score of the two feature amounts obtained in S1815 and S1817. Similarly to the embodiment described above, two objects are determined to be the same if the similarity score is greater than or equal to a predetermined threshold, and are determined to be different objects if the similarity score is less than the threshold.

＜１対Ｎの画像照合処理＞
図１９に画像処理装置１９の機能構成例を示す。基本的な構成は図１５に準じている。差異としては、処理モード設定部１９０１と特徴量登録部１９０２を備える。照合処理のフローは図２０である。人物の登録動作を図２０（Ａ）に、入力画像と登録人物との照合動作を図２０（Ｂ）に示す。 <1 to N image matching process>
FIG. 19 shows an example of the functional configuration of the image processing device 19. The basic configuration is similar to that shown in FIG. The difference is that a processing mode setting section 1901 and a feature amount registration section 1902 are provided. The flow of the verification process is shown in FIG. FIG. 20(A) shows the registration operation of a person, and FIG. 20(B) shows the matching operation between the input image and the registered person.

パラメータ決定部１５０２は、登録動作では、予め取得した登録人物の人種の状態に応じた変換パラメータを決定する。これは、登録時には、登録人物の人種を正確に知ることができるため、画像から推定する必要がないためである。具体的な処理の流れについて、図２０（Ａ）を用いて説明する。 In the registration operation, the parameter determining unit 1502 determines a conversion parameter according to the race status of the registered person acquired in advance. This is because the race of the registered person can be accurately known at the time of registration, so there is no need to estimate it from the image. A specific process flow will be explained using FIG. 20(A).

Ｓ２００１ａでは、処理モード設定部１０９が、現在の動作モードを登録動作モードに設定する。 In S2001a, the processing mode setting unit 109 sets the current operation mode to the registered operation mode.

Ｓ２００２ａでは、処理モード設定部１０９が、登録人物の人種の状態を取得する。例えば、予め登録人物ごとの人種の状態のリストをＨＤＤ等の記憶部Ｈ１０４に記憶しておき、それを取得する。あるいは、キーボードなどの取得部Ｈ１０５から登録する人物の人種の状態を取得する。 In S2002a, the processing mode setting unit 109 acquires the racial status of the registered person. For example, a list of race status for each registered person is stored in advance in the storage unit H104 such as an HDD, and the list is acquired. Alternatively, the race status of the person to be registered is acquired from the acquisition unit H105 such as a keyboard.

Ｓ２００３ａは、登録人物を順に処理するためのループの始端である。登録人物には１から順に番号が割り当てられているものとする。登録人物を変数ｉを用いて参照するため、はじめにｉを１に初期化する。さらに、ｉが登録人物数以下であるときＳ２００５ａへ移り、これを満たさないときループを抜けて処理を終了する
Ｓ２００４ａでは、パラメータ決定部１５０２が、処理モード設定部１０９が取得した人物ｉの状態に基づいて、記憶部１５０３から対応する変換パラメータを読みだし、第二の特徴量変換部１５０４にセットする。 S2003a is the start of a loop for sequentially processing registered persons. It is assumed that registered persons are assigned numbers sequentially starting from 1. In order to refer to a registered person using a variable i, first initialize i to 1. Further, when i is less than or equal to the number of registered persons, the process moves to S2005a, and when this is not satisfied, the process exits the loop and ends the process. Based on this, the corresponding conversion parameters are read from the storage unit 1503 and set in the second feature amount conversion unit 1504.

Ｓ２００５ａでは、第一の画像取得部１０１が人物ｉの登録画像を取得する。 In S2005a, the first image acquisition unit 101 acquires a registered image of person i.

Ｓ２００６ａでは、第一の特徴量変換部１５０１が、登録画像を中間特徴量に変換する。 In S2006a, the first feature converter 1501 converts the registered image into an intermediate feature.

Ｓ２００７ａでは、第二の特徴量変換部１５０４が中間特徴量を変換して顔特徴量を得る。 In S2007a, the second feature amount converting unit 1504 converts the intermediate feature amount to obtain a facial feature amount.

Ｓ２００８ａでは、特徴登録部１９０２に人物ｉの顔特徴量として登録する。加えて、人物ｉの人種の状態も登録する。 In S2008a, it is registered in the feature registration unit 1902 as a facial feature amount of person i. In addition, the racial status of person i is also registered.

Ｓ２００９ａは、登録人物のループの終端であり、ｉに１を加算してＳ２００３ａへ戻る。 S2009a is the end of the registered person loop, in which 1 is added to i and the process returns to S2003a.

次に、入力画像と登録人物の照合動作について図２０（Ｂ）を用いて説明する。照合動作のときは、入力画像の人種等の状態は不明であるため、画像から推定した状態に基づいて処理を行う。また、人種・性別等の状態のとき、異なる状態であれば異なる人物であると判断ができる。そこで、入力画像の人種等の状態を確信度高く推定できたときは、照合する登録人物を絞り込むことで処理速度を向上させる。具体的な処理の流れについて、図２０（Ｂ）を用いて説明する。なお、この例では、パラメータ決定部１５０２が求める状態は「人種」である。 Next, a comparison operation between an input image and a registered person will be explained using FIG. 20(B). During the matching operation, since the status of the input image, such as race, is unknown, processing is performed based on the status estimated from the image. Further, when the status of race, gender, etc. is different, it can be determined that the person is a different person. Therefore, when the state of the input image, such as race, can be estimated with high certainty, the processing speed is improved by narrowing down the registered persons to be matched. A specific process flow will be explained using FIG. 20(B). In this example, the state sought by the parameter determination unit 1502 is "race."

Ｓ２００１ｂでは、処理モード設定部１０９が、動作モードを照合動作モードに設定する。これにより処理モード設定部１０９から状態を取得しないようになる。 In S2001b, the processing mode setting unit 109 sets the operation mode to the verification operation mode. This prevents the status from being acquired from the processing mode setting unit 109.

Ｓ２００２ｂでは、第二の画像取得部１０２が、問い合わせ画像（第二の画像）を取得する。 In S2002b, the second image acquisition unit 102 acquires the inquiry image (second image).

Ｓ２００３ｂでは、第一の特徴量変換部１５０１が、第二の画像を中間特徴量（第二の中間特徴量）に変換する。 In S2003b, the first feature converting unit 1501 converts the second image into an intermediate feature (second intermediate feature).

Ｓ２００４ｂでは、パラメータ決定部１５０２が、第二の中間特徴量から第二の画像の状態（第二の状態）を判定する。具体的には、第二の画像に映る人物の人種を判定する。 In S2004b, the parameter determination unit 1502 determines the state of the second image (second state) from the second intermediate feature amount. Specifically, the race of the person appearing in the second image is determined.

Ｓ２００５ｂでは、パラメータ決定部１５０２が、第二の状態に応じて、記憶部１５０３から第二の状態に対応する変換パラメータを決定する。第三の特徴量変換部１５０５には、決定された変換パラメータを（第三の）学習済みモデルに設定する。 In S2005b, the parameter determining unit 1502 determines a conversion parameter corresponding to the second state from the storage unit 1503 according to the second state. The third feature conversion unit 1505 sets the determined conversion parameters to the (third) trained model.

Ｓ２００６ｂでは、第三の特徴量変換部１５０５が、第二の中間特徴量を変換して顔特徴量（第二の顔特徴量）を得る。 In S2006b, the third feature amount conversion unit 1505 converts the second intermediate feature amount to obtain a facial feature amount (second facial feature amount).

Ｓ２００７ｂでは、パラメータ決定部１５０２が出力した状態のスコア（状態スコア）が高いか否かを判定する。予め定めた閾値より状態スコアが大きいならば、Ｓ２００８ｂに移る。それ以外は、Ｓ２００９ｂに移る。 In S2007b, it is determined whether the state score (state score) output by the parameter determination unit 1502 is high. If the state score is greater than the predetermined threshold, the process moves to S2008b. Otherwise, the process moves to S2009b.

Ｓ２００８ｂでは、特徴量照合部１０７は、第二の状態と同じ状態である登録人物を、候補人物として絞り込む。つまり、本実施形態では、同じ人種の登録人物に絞り込む。 In S2008b, the feature matching unit 107 narrows down registered persons in the same state as the second state as candidate persons. In other words, in this embodiment, the search is narrowed down to registered persons of the same race.

Ｓ２００９ｂは、登録人物を順に処理するためのループの始端である。Ｓ２００８ｂにより登録人物が絞り込まれている場合は、特徴量照合部１０７は、絞り込まれた登録人物を順に照合処理する。そのため、変数ｉで順に登録人物を参照するため、はじめに処理対象となる登録人物に１から順に番号を割り当て、ｉを１に初期化する。さらに、ｉが処理対象の登録人物数以下であるときＳ２０１０ｂへ移り、これを満たさないときループを抜けてＳ２０１２ｂへ移る。 S2009b is the start of a loop for sequentially processing registered persons. If the registered persons have been narrowed down in S2008b, the feature value matching unit 107 sequentially performs a matching process on the narrowed down registered persons. Therefore, in order to sequentially refer to the registered persons using the variable i, numbers are first assigned sequentially starting from 1 to the registered persons to be processed, and i is initialized to 1. Further, when i is less than or equal to the number of registered persons to be processed, the process moves to S2010b, and when this is not satisfied, the process exits the loop and moves to S2012b.

Ｓ２０１０ｂでは、特徴量照合部１０７は、特徴登録部１９０２に記憶された人物ｉの顔特徴量を得る。そして、特徴量照合部１０７が、Ｓ２００６ｂで得た第二の顔特徴量と、人物ｉの顔特徴量の類似度スコアを算出する。 In S2010b, the feature matching unit 107 obtains the facial feature of person i stored in the feature registration unit 1902. Then, the feature matching unit 107 calculates a similarity score between the second facial feature obtained in S2006b and the facial feature of person i.

Ｓ２０１１ｂは、登録人物のループの終端であり、ｉに１を加算してＳ２００９ｂへ戻る。 S2011b is the end of the registered person loop, in which 1 is added to i and the process returns to S2009b.

Ｓ２０１２ｂでは、出力部１９００が、Ｓ２０１０ｂで求めた類似度スコアが、所定値以上の人物がいればその結果を出力する。なお、出力部１９００は、特徴量照合部１０７における照合結果、つまり顔認証の結果を表示装置等に出力する。 In S2012b, the output unit 1900 outputs the result if there is a person whose similarity score obtained in S2010b is equal to or greater than a predetermined value. Note that the output unit 1900 outputs the matching result in the feature value matching unit 107, that is, the result of face authentication, to a display device or the like.

＜状態判定方法の例＞
第一の特徴量変換部１５０１とパラメータ決定部１５０２により画像から状態を求める方法について述べる。第一の特徴量変換部１５０１とパラメータ決定部１５０２を、前述のＤＮＮを使用して構成する。パラメータ決定部１５０２はニューラルネットワークの出力数を状態数と同じにして、Ｓｏｆｔｍａｘ関数を通して出力を得るように構成する。 <Example of status determination method>
A method for determining the state from an image using the first feature converter 1501 and the parameter determiner 1502 will be described. The first feature converter 1501 and parameter determiner 1502 are configured using the above-mentioned DNN. The parameter determining unit 1502 is configured to set the number of outputs of the neural network to be the same as the number of states, and obtain the output through the Softmax function.

次に、画像から状態を求められるよう学習する。本実施形態では、パラメータ決定部１５０２のＳｏｆｔｍａｘ関数の出力の各次元に状態のラベルを対応付けて、画像の対応する状態が１をとり、それ以外が０をとるように学習をする。学習フローについて図２１を用いて説明する。 Next, it learns to determine states from images. In this embodiment, a state label is associated with each dimension of the output of the Softmax function of the parameter determination unit 1502, and learning is performed such that the corresponding state of the image takes 1 and the other states take 0. The learning flow will be explained using FIG. 21.

Ｓ２１０１では、第一の特徴量変換部１５０１で使用するパラメータセットを乱数などで初期化する。あるいは、前述の図５（Ａ）等に記載の方法で、顔認証を学習して獲得したパラメータセットで初期化するなどしても良い。 In S2101, a parameter set used by the first feature converter 1501 is initialized with random numbers or the like. Alternatively, initialization may be performed using a parameter set obtained by learning face recognition using the method described in FIG. 5(A) and the like described above.

Ｓ２１０２では、パラメータ決定部１５０２で使用するパラメータセットを乱数などで初期化する。 In S2102, the parameter set used by the parameter determination unit 1502 is initialized with random numbers or the like.

Ｓ２１０３では、状態のラベルが付与された顔画像群を取得する。例えば、状態が人種であれば、人種のラベルが付与された顔画像群が取得される。 In S2103, a group of facial images to which state labels have been added is acquired. For example, if the status is race, a group of facial images labeled with race are acquired.

Ｓ２１０４では、パラメータ決定部１５０２で状態のラベルを推定する。画像を入力として、ＤＮＮをフォワード処理して、Ｓｏｆｔｍａｘ関数の値を得ることを行う。 In S2104, the parameter determination unit 1502 estimates the label of the state. Using an image as input, forward processing of the DNN is performed to obtain the value of the Softmax function.

Ｓ２０１５では、交差エントロピーとして知られる数式９に基づいて損失を計算する。 In S2015, loss is calculated based on Equation 9, which is known as cross entropy.

（数式９）
損失値＝－Σｐ（ｉ）ｌｏｇ（ｑ（ｉ））
ここで、ｐ（ｉ）は、ｉ番目の状態値が正解のときに１をとり、それ以外は０をとる正解ラベルの情報を示す。ｑ（ｉ）は、ｉ番目の状態に対応するＳｏｆｔｍａｘ関数の値を示す。 (Formula 9)
Loss value = -Σp(i)log(q(i))
Here, p(i) indicates information on a correct label that takes 1 when the i-th state value is correct and takes 0 otherwise. q(i) indicates the value of the Softmax function corresponding to the i-th state.

Ｓ２０１６では、損失値が小さくなるように第一の特徴量変換部１５０１とパラメータ決定部１５０２のパラメータセットを更新する。ＤＮＮで一般的な誤差逆伝搬手法を用いることで損失値を減ずる方向に微小に更新していく。 In S2016, the parameter sets of the first feature converter 1501 and the parameter determiner 1502 are updated so that the loss value becomes smaller. By using an error backpropagation method commonly used in DNNs, the loss value is slightly updated in the direction of decreasing it.

Ｓ２１０７では、学習が終了したか否かを判定する。例えば、損失値の減少幅が小さくなったとき、学習が終了したと判定する。あるいは、予め定めた回数だけ学習が繰り返された場合に学習が終了したと判断するなどしてもよい。学習が終了した場合は、Ｓ２１０８へ移り。それ以外はＳ２１０３へ戻る。 In S2107, it is determined whether learning has ended. For example, when the amount of decrease in the loss value becomes small, it is determined that learning has ended. Alternatively, it may be determined that learning has ended when learning has been repeated a predetermined number of times. If learning is completed, the process moves to S2108. Otherwise, the process returns to S2103.

Ｓ２１０８では、第一の特徴量変換部１５０１のパラメータセットを記憶する。 In S2108, the parameter set of the first feature converter 1501 is stored.

Ｓ２１０９では、パラメータ決定部１５０２のパラメータセットを記憶する。 In S2109, the parameter set of the parameter determination unit 1502 is stored.

これによって得られた第一の特徴量変換部１５０１とパラメータ決定部１５０２のパラメータセットを用いることで、画像に対する状態を求めることができるようになる。具体的には、画像に対するＳｏｆｔｍａｘ関数の値を得て、最も大きな値をとる次元に対応する状態に該当すると判定する。なお、このとき得られるＳｏｆｔｍａｘ関数の値は、より確信度が高いときはより大きな値をとるようになるため、Ｓｏｆｔｍａｘ関数の値を状態スコアとして使用することもできる。 By using the parameter set of the first feature converter 1501 and the parameter determiner 1502 thus obtained, the state of the image can be determined. Specifically, the value of the Softmax function for the image is obtained, and it is determined that the state corresponds to the dimension that takes the largest value. Note that the value of the Softmax function obtained at this time takes a larger value when the degree of certainty is higher, so the value of the Softmax function can also be used as the state score.

以上によって、状態判定と特徴量変換の中間特徴量の算出までを共通化することで、処理速度が高速化される。加えて、状態判定と特徴量変換のモデルサイズも小さくでき、メモリ使用量も削減できる。また、記憶部１５０３で管理する変換パラメータも小さくできるため、変換パラメータの読み出し速度を高速化できる。 As described above, the processing speed is increased by standardizing the state determination and the calculation of the intermediate feature amount of the feature amount conversion. In addition, the model size for state determination and feature value conversion can be reduced, and memory usage can also be reduced. Furthermore, since the conversion parameters managed in the storage unit 1503 can also be made smaller, the reading speed of the conversion parameters can be increased.

加えて、人種・年齢等の状態の異なりが、人物の異なりと一致する場合においては、確信度高く状態が異なると判断されるとき、特徴量変換をスキップして類似度を低く見積もる。これにより、処理の高速化を図れる。なお、状態の異なりに基づき類似度を低く見積もることは、状態判定と特徴量変換の中間特徴量の算出までを共通化しない場合にも適用可能である。つまり、実施形態１や実施形態２のように、状態判定と特徴量変換がともに画像を入力として行われる場合においても適用可能である。また、状態としては、人物が生涯で変化しにくい属性を設定すればよい。あるいは、運用期間が短いのであれば、年齢・髭の有無・髪型等の見た目の属性を使用してもよい。また、人種の代わりに肌の色等の代替属性を使用してもよい。そのため、使用する状態は、人種や性別に限定されるものではない。 In addition, when differences in states such as race and age match differences in people, when it is determined with high certainty that the states are different, feature amount conversion is skipped and the degree of similarity is estimated to be low. This makes it possible to speed up the processing. Note that estimating the degree of similarity to be low based on the difference in state can be applied even in the case where the calculation of the intermediate feature amount between the state determination and the feature amount conversion is not standardized. In other words, the present invention is applicable even when the state determination and the feature amount conversion are both performed using images as input, as in the first and second embodiments. Further, as the state, an attribute that is difficult to change during a person's lifetime may be set. Alternatively, if the operating period is short, appearance attributes such as age, presence of beard, hairstyle, etc. may be used. Further, an alternative attribute such as skin color may be used instead of race. Therefore, the conditions of use are not limited to race or gender.

＜その他派生の形態＞
本明細書中では人物の照合を中心に説明を行ったが、本発明は同一性の照合や類似度の算出に関する様々なタスクに適応可能である。例えば特定のカテゴリの物体を検出するタスク、動画中から特定形状の意匠を抽出する画像問い合わせタスク、類似画像検索、などへの適用がある。 <Other derivative forms>
Although the description in this specification has focused on matching people, the present invention is applicable to various tasks related to matching identity and calculating similarity. For example, it can be applied to tasks such as detecting objects in a specific category, image inquiry tasks to extract designs of specific shapes from videos, and similar image searches.

条件判定部１０３や処理モード設定部１０９が判定する状態は、入力画像の画質、物体の見えの角度、物体のサイズ、物体の見えの明瞭さ、照明の明暗、物体の遮蔽、物体の付属物や装着物の有無、或いは物体のサブタイプ、或いはそれらの組合せを含む。 The conditions determined by the condition determination unit 103 and the processing mode setting unit 109 include the image quality of the input image, the angle of view of the object, the size of the object, the clarity of the view of the object, the brightness and darkness of the illumination, the occlusion of the object, and the attachments of the object. the presence or absence of an object, the subtype of the object, or a combination thereof.

またここでは物体の状態に応じて２種類のパラメータを使い分けたが、これを３種類以上用いて切り替える形態も考えられる。 Further, here, two types of parameters are used depending on the state of the object, but it is also possible to use three or more types of parameters and switch between them.

またここでは画像認識の実施形態を中心に例示したが、画像に限らず、音声信号、音楽といった情報の照合や類似検索も考えられる。特許文献２のようなテキストを特徴量に変換する手法を用いることで、書籍やＳＮＳのログ、帳票といったテキスト情報について、意味内容の類似した文書を照合・検索するといったタスクに応用することも考えられる。なお書籍やＳＮＳ等はそれぞれのカテゴリに固有な語彙やフォーマットが存在するので、各文書のカテゴリごとに特徴量変換手段を使い分けることで性能が上がる余地がある。 In addition, although the embodiment of image recognition has been mainly illustrated here, collation and similarity search of information other than images, such as audio signals and music, can also be considered. By using the method of converting text into features as in Patent Document 2, it is also possible to apply it to tasks such as matching and searching for documents with similar meanings in text information such as books, SNS logs, and forms. It will be done. Note that since each category of books, SNS, etc. has its own unique vocabulary and format, there is room for performance to be improved by using different feature amount conversion means for each document category.

また、実施形態では同一物体か否かの照合を主に説明したが、物体間の類似度の値を回帰推定することも可能である。そのためには例えば下式のように物体ｉと物体ｊのペア間の真の類似度を教師値として与え、推定類似度スコアとの二乗誤差で損失値を定義する。 Further, in the embodiment, the explanation has mainly been given to checking whether or not objects are the same, but it is also possible to estimate the similarity value between objects by regression estimation. To do this, for example, the true similarity between the pair of object i and object j is given as a teacher value as shown in the following equation, and the loss value is defined as the square error with the estimated similarity score.

（数式１０）
損失値＝ Σ_ｉΣ_ｊ（真のペア類似度スコア（ｆ_ｉ，ｆ_ｊ）
－ペア類似度スコア（ｆ_ｉ，ｆ_ｊ））^２
この損失値を減ずるように特徴量変換部１０５と特徴量変換部１０６のパラメータをそれぞれ学習すればよい。ただしここでｆ_ｉ，ｆ_ｊはそれぞれ第一の学習済みモデルと第二の学習済みモデルで変換された画像の特徴量のペアである。以上のように本発明が様々なタスクに適用可能であることを示した。 (Formula 10)
Loss value = Σ _i Σ _j (true pair similarity score (f _i, f _j )
- Pair similarity score (f _i, f _j )) ²
The parameters of the feature amount converter 105 and the feature amount converter 106 may be learned respectively so as to reduce this loss value. However, here, f _{i and} f _j are pairs of feature amounts of images transformed by the first trained model and the second trained model, respectively. As described above, it has been shown that the present invention is applicable to various tasks.

本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、データ通信用のネットワーク又は各種記憶媒体を介してシステム或いは装置に供給する。そして、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。また、そのプログラムをコンピュータが読み取り可能な記録媒体に記録して提供してもよい。 The present invention is also realized by performing the following processing. That is, software (programs) that implement the functions of the embodiments described above are supplied to the system or device via a data communication network or various storage media. This is a process in which the computer (or CPU, MPU, etc.) of the system or device reads and executes the program. Further, the program may be recorded on a computer-readable recording medium and provided.

１画像処理装置
１０１第一の画像取得部
１０２第二の画像取得部
１０３物体パラメータ決定
１０４記憶部
１０５第一の特徴量変換部
１０６第二の特徴量変換部
１０７特徴量照合部 1 Image processing device 101 First image acquisition section 102 Second image acquisition section 103 Object parameter determination 104 Storage section 105 First feature amount conversion section 106 Second feature amount conversion section 107 Feature amount matching section

Claims

A first acquisition means for acquiring a first feature amount of a first object that matches a predetermined condition in a first image and is obtained using a first trained model. and,
a second acquisition of a second feature amount of the second object that does not match the predetermined condition in the second image, the second feature amount obtained using the second trained model; means and
Based on the first feature amount and the second feature amount, whether the first object in the first image and the second object in the second image are the same object. a verification means for determining whether the
has
When the first object and the second object are the same object, the first and second trained models are configured such that when the first object and the second object are the same object, the first and second trained models the first feature amount obtained using the trained model; and the second feature obtained using the second trained model as the feature amount of the second object that does not meet the predetermined condition. An image processing device characterized in that learning is performed so that a feature quantity becomes a similar feature quantity .

further comprising determining means for determining whether the object in the image matches the predetermined condition;
When the determination means determines that the third object in the third image meets the predetermined condition, the second acquisition means acquires a third feature of the third object, obtaining a third feature obtained using the first trained model;
The matching means determines whether the first object in the first image and the third object in the third image are the same based on the first feature amount and the third feature amount. The image processing device according to claim 1, wherein the image processing device determines whether or not it is an object .

The determination means includes the image quality of the input image, the angle of view of the object, the size of the object, the clarity of the view of the object, the brightness and darkness of illumination, the occlusion of the object, the presence or absence of attachments or attachments to the object, the subtype of the object, 3. The image processing apparatus according to claim 2, wherein it is determined whether or not an object in an image matches the predetermined condition based on a result of detecting at least one state.

The image processing apparatus according to any one of claims 1 to 3, wherein the first object is a person wearing a mask.

When the first object and the second object are the same object, the first object obtained using the first trained model as a feature of the first object that meets the predetermined condition. The first learning is performed so that the feature amount of the second object obtained by using the second trained model is similar to the feature amount of the second object that does not meet the predetermined condition. The image processing apparatus according to any one of claims 1 to 4, further comprising learning means for learning the trained model and the second trained model.

The image processing apparatus according to claim 5 , wherein the learning means learns each of the first trained model and the second trained model based on a plurality of image groups.

The plurality of image groups include a first image group serving as a reference and a second image group obtained by converting the reference image group,
The learning means is configured to determine the feature amount of the object included in the first image group when the object included in the first image group and the object included in the second image group are the same object. 7. The image processing apparatus according to claim 6, wherein the image processing apparatus learns so that the degree of similarity between and the feature amount of the object included in the second image group becomes larger than a predetermined value.

8. The image processing apparatus according to claim 7, wherein the second image group is an image group in which an attachment is synthesized with a specific object in the first image group.

The image processing device according to any one of claims 1 to 8, wherein at least one of the first trained model and the second trained model is a neural network including a plurality of layers.

The image according to any one of claims 1 to 8 , wherein the first trained model and the second trained model are neural networks in which parameters of some layers are shared. Processing equipment.

The image processing apparatus according to any one of claims 1 to 9, wherein at least one of the first trained model and the second trained model is a transformer network.

The image processing according to any one of claims 1 to 11 , wherein the second trained model is trained based on feature amounts extracted based on the first trained model. Device.

12. The image processing apparatus according to claim 1 , wherein the first trained model and the second trained model are trained simultaneously or alternately.

a third acquisition unit that acquires intermediate feature quantities of the first image based on a third trained model that outputs feature quantities related to the state of the object from the image;
14. The method according to claim 1, further comprising parameter determining means for determining parameters of the first learned model based on intermediate feature amounts of the acquired first image. The image processing device according to item 1 .

The third acquisition means further acquires an intermediate feature amount of the second image,
The parameter determining means determines parameters of the second trained model based on the intermediate feature amount of the acquired second image,
The parameters of the second trained model are such that the attribute of the object indicated by the intermediate feature of the first image is different from the attribute of the object indicated by the intermediate feature of the acquired second image. 15. The image processing apparatus according to claim 14, wherein if the first learned model is selected, parameters are determined to be different from parameters of the first trained model.

15. The first acquisition means acquires the first feature amount using an intermediate feature amount of the first image acquired by the third acquisition means. image processing device .

16. The second acquisition means acquires the second feature amount using an intermediate feature amount of the second image acquired by the third acquisition means. image processing device .

a first acquisition means that acquires a first feature quantity from a first image based on a first trained model that extracts features from the image;
a second acquisition unit that acquires a second feature amount from the second image based on a second trained model that extracts features from the image, which is determined according to the state of the second image;
Comparing means for determining whether an object included in the first image and an object included in the second image are the same based on the first feature amount and the second feature amount. death,
The image processing device is characterized in that the second trained model is trained based on feature amounts extracted based on the first trained model.

a first acquisition means that acquires a first feature quantity from a first image based on a first trained model that extracts features from the image;
a second acquisition unit that acquires a second feature amount from the second image based on a second trained model that extracts features from the image, which is determined according to the state of the second image;
a matching unit that determines whether an object included in the first image and an object included in the second image are the same based on the first feature amount and the second feature amount;
a third acquisition unit that acquires intermediate feature quantities of the first image based on a third trained model that outputs feature quantities related to the state of the object from the image;
An image processing apparatus comprising: parameter determining means for determining parameters of the first learned model based on intermediate feature amounts of the first image.

20. The second trained model is a model in which the second feature amount is learned in the same feature space as the first trained model. The image processing device described.

a registration means for registering a first feature amount obtained using the first trained model for each image of a registered person that meets a predetermined condition ;
The first trained model or a second trained model different from the first trained model is selected depending on whether the person in the second image matches the predetermined condition. a selection means to
acquisition means for acquiring a second feature amount as a feature amount of the person in the second image based on the learned model selected by the selection means ;
a verification unit that determines whether the person in the second image is the same as any of the registered persons based on the first feature amount and the second feature amount ;
has
The first and second trained models meet the predetermined condition when the registered person who matches the predetermined condition and the person who does not match the predetermined condition in the second image are the same person. The first feature amount obtained by using the first trained model as the feature amount of the registered person who matches, and the second learned model as the feature amount of the person who does not match the predetermined condition. An image processing apparatus characterized in that learning is performed so that the second feature obtained by using the second feature becomes a similar feature .

A first acquisition step of acquiring a first feature amount of a first object that matches a predetermined condition in the first image and is obtained using the first trained model. and a second feature amount of the second object that does not match the predetermined condition in the second image, the second feature amount obtained using the second trained model. an acquisition step; based on the first feature amount and the second feature amount, the first object in the first image and the second object in the second image are the same object; A verification step to determine whether or not there is a
has
When the first object and the second object are the same object, the first and second trained models are configured such that when the first object and the second object are the same object, the first and second trained models the first feature amount obtained using the trained model; and the second feature obtained using the second trained model as the feature amount of the second object that does not meet the predetermined condition. An image processing method characterized in that the image processing method is characterized in that learning is performed so that the quantity becomes a similar feature quantity .

a registration step of registering a first feature amount obtained using the first trained model for each image of a registered person that meets predetermined conditions ;
The first trained model or a second trained model different from the first trained model is selected depending on whether the person in the second image matches the predetermined condition. a selection process to
an acquisition step of acquiring a second feature amount as a feature amount of the person in the second image based on the trained model selected in the selection step ;
a matching step of determining whether a person included in the second image is the same as one of the registered persons based on the first feature amount and the second feature amount ;
has
The first and second trained models meet the predetermined condition when the registered person who matches the predetermined condition and the person who does not match the predetermined condition in the second image are the same person. The first feature amount obtained by using the first trained model as the feature amount of the registered person who matches, and the second learned model as the feature amount of the person who does not match the predetermined condition. An image processing method characterized in that learning is performed so that the second feature obtained using the image processing method is similar to the second feature .

A program for causing a computer to function as each means included in the image processing apparatus according to any one of claims 1 to 21 .