JP7070157B2

JP7070157B2 - Image processing program, image processing device and image processing method

Info

Publication number: JP7070157B2
Application number: JP2018124232A
Authority: JP
Inventors: 琢麿山本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2022-05-18
Anticipated expiration: 2038-06-29
Also published as: JP2020004179A

Description

本発明は、画像処理プログラム、画像処理装置及び画像処理方法に関する。 The present invention relates to an image processing program, an image processing apparatus, and an image processing method.

従来より、所定の被写体を抽出対象として、撮影画像から画素領域を抽出する背景差分技術が知られている。背景差分技術とは、抽出対象を含む撮影画像である入力画像と、抽出対象を含まない撮影画像である背景画像との差分をとることで、抽出対象の画素領域を抽出する技術である。背景差分技術を適用した場合、抽出対象の画素領域を抽出できる一方で、抽出対象に付随する抽出対象以外の画素領域（例えば、被写体の影等）も抽出されることになる。 Conventionally, there has been known a background subtraction technique for extracting a pixel region from a captured image with a predetermined subject as an extraction target. The background subtraction technique is a technique for extracting a pixel region to be extracted by taking a difference between an input image which is a captured image including an extraction target and a background image which is a captured image not including an extraction target. When the background subtraction technique is applied, the pixel region to be extracted can be extracted, while the pixel region other than the extraction target associated with the extraction target (for example, the shadow of the subject) is also extracted.

これに対して、例えば、下記特許文献では、抽出対象となる被写体と、該被写体の影とのテクスチャ成分（交流成分）の差に着目して背景差分技術を適用することで、被写体の影を除外した画素領域を抽出する抽出方法が提案されている。 On the other hand, for example, in the following patent documents, the shadow of a subject is obtained by applying the background subtraction technique focusing on the difference between the subject to be extracted and the shadow of the subject and the texture component (alternating current component). An extraction method for extracting the excluded pixel area has been proposed.

特開２００３－１４１５４６号公報Japanese Patent Application Laid-Open No. 2003-141546

しかしながら、上記抽出方法の場合、例えば、抽出対象となる被写体のうちテクスチャ成分が小さい領域は、該被写体の影と同様に除外されるなど、抽出対象となる被写体の画素領域を適切に（過不足なく）抽出することができないという問題がある。 However, in the case of the above extraction method, for example, a region having a small texture component among the subjects to be extracted is excluded in the same manner as the shadow of the subject, and the pixel region of the subject to be extracted is appropriately (excess or deficient). There is a problem that it cannot be extracted.

一つの側面では、画像から抽出対象の画素領域を抽出する抽出精度を向上させることを目的としている。 One aspect is to improve the extraction accuracy of extracting the pixel region to be extracted from the image.

一態様によれば、画像処理プログラムは、
抽出対象を含む第１の画像と、抽出対象を含まない第２の画像とを取得し、
抽出対象を含む複数の識別対象を識別するように学習されたニューラルネットワークに、前記第１の画像と前記第２の画像とをそれぞれ入力し、該ニューラルネットワークの複数の層のうちの所定の層から、前記第１の画像に対応する第１の中間画像と、前記第２の画像に対応する第２の中間画像とを取得し、
前記第１の中間画像と、前記第２の中間画像との差分に基づき、抽出対象の画素領域を抽出する、
処理をコンピュータに実行させる。 According to one aspect, the image processing program
The first image including the extraction target and the second image not including the extraction target are acquired, and the image is obtained.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From, the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image are acquired.
The pixel region to be extracted is extracted based on the difference between the first intermediate image and the second intermediate image.
Let the computer perform the process.

画像から抽出対象の画素領域を抽出する抽出精度を向上させることができる。 It is possible to improve the extraction accuracy of extracting the pixel area to be extracted from the image.

画像処理システムのシステム構成及び画像処理装置の機能構成の一例を示す図である。It is a figure which shows an example of the system configuration of an image processing system and the functional configuration of an image processing apparatus. 画像処理装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of an image processing apparatus. 学習用画像情報の具体例を示す図である。It is a figure which shows the specific example of the image information for learning. 第１画像取得部により取得される第１画像の具体例を示す図である。It is a figure which shows the specific example of the 1st image acquired by the 1st image acquisition part. 第２画像取得部により取得される第２画像の具体例を示す図である。It is a figure which shows the specific example of the 2nd image acquired by the 2nd image acquisition part. 物体識別部による処理の具体例を示す第１の図である。It is the first figure which shows the specific example of the processing by an object identification part. 識別対象及び非識別対象と、抽出対象及び非抽出対象との関係を示す図である。It is a figure which shows the relationship between the identification target and the non-discrimination target, and the extraction target and the non-extraction target. 物体識別部による処理の具体例を示す第２の図である。It is the 2nd figure which shows the specific example of the processing by an object identification part. 物体識別部による処理の具体例を示す第３の図である。FIG. 3 is a third diagram showing a specific example of processing by the object identification unit. 領域抽出部による処理の具体例を示す図である。It is a figure which shows the specific example of the processing by a region extraction part. 画像処理部による画像処理の流れを示すフローチャートである。It is a flowchart which shows the flow of image processing by an image processing unit. 画像処理システムの適用例を示す第１の図である。It is the first figure which shows the application example of an image processing system. 画像処理システムの適用例を示す第２の図である。It is the 2nd figure which shows the application example of an image processing system.

以下、各実施形態について添付の図面を参照しながら説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複した説明を省略する。 Hereinafter, each embodiment will be described with reference to the attached drawings. In the present specification and the drawings, the components having substantially the same functional configuration are designated by the same reference numerals, and duplicate description thereof will be omitted.

［第１の実施形態］
＜画像処理システムのシステム構成及び画像処理装置の機能構成＞
はじめに、画像処理システムのシステム構成及び画像処理装置の機能構成について説明する。図１は、画像処理システムのシステム構成及び画像処理装置の機能構成の一例を示す図である。 [First Embodiment]
<System configuration of image processing system and functional configuration of image processing device>
First, the system configuration of the image processing system and the functional configuration of the image processing device will be described. FIG. 1 is a diagram showing an example of a system configuration of an image processing system and a functional configuration of an image processing device.

図１に示すように、画像処理システム１００は、撮像装置１１０と、画像処理装置１２０と、学習用画像情報格納部１３０とを有する。撮像装置１１０と画像処理装置１２０とは、通信可能に接続される。同様に、画像処理装置１２０と学習用画像情報格納部１３０とは、通信可能に接続される。 As shown in FIG. 1, the image processing system 100 includes an image pickup device 110, an image processing device 120, and a learning image information storage unit 130. The image pickup apparatus 110 and the image processing apparatus 120 are communicably connected to each other. Similarly, the image processing device 120 and the learning image information storage unit 130 are communicably connected to each other.

撮像装置１１０は、所定の位置に設置され、固定した撮影方向で撮影を行い、撮影した画像（撮影画像）を画像処理装置１２０に送信する。撮像装置１１０により送信される撮影画像には、
・抽出対象を含むか否かを画像処理装置１２０が判定する「第１画像」（いわゆる入力画像）と、
・第１画像を撮影した際の撮影条件と同じ撮影条件のもとで、異なるタイミングで撮影された、抽出対象を含まない「第２画像」（いわゆる背景画像）と、
が含まれる。 The image pickup device 110 is installed at a predetermined position, shoots in a fixed shooting direction, and transmits the shot image (shooting image) to the image processing device 120. The captured image transmitted by the image pickup apparatus 110 includes
A "first image" (so-called input image) in which the image processing device 120 determines whether or not an extraction target is included, and
-A "second image" (so-called background image) that does not include the extraction target and was shot at different timings under the same shooting conditions as when the first image was shot.
Is included.

なお、抽出対象とは、撮像装置１１０により撮影される被写体のうち、画像処理システム１００の管理者が指定した被写体であって、撮影中に変化する被写体（一般的には、動物体）を指すものとする。 The extraction target refers to a subject (generally an animal body) that is a subject designated by the administrator of the image processing system 100 and changes during shooting among the subjects photographed by the image pickup apparatus 110. It shall be.

画像処理装置１２０には、画像処理プログラムがインストールされており、当該プログラムが実行されることで、画像処理装置１２０は、取得部の一例である第１画像取得部１２１及び第２画像取得部１２２として機能する。また、画像処理装置１２０は、識別部の一例である物体識別部１２３及び中間画像取得部１２４として機能する。更に、画像処理装置１２０は、抽出部の一例である領域抽出部１２５として機能する。 An image processing program is installed in the image processing device 120, and when the program is executed, the image processing device 120 has a first image acquisition unit 121 and a second image acquisition unit 122, which are examples of acquisition units. Functions as. Further, the image processing device 120 functions as an object identification unit 123 and an intermediate image acquisition unit 124, which are examples of the identification unit. Further, the image processing device 120 functions as a region extraction unit 125, which is an example of the extraction unit.

第１画像取得部１２１は、撮像装置１１０より送信される第１画像を取得する。第１画像取得部１２１は、取得した第１画像を物体識別部１２３及び領域抽出部１２５に通知する。 The first image acquisition unit 121 acquires the first image transmitted from the image pickup apparatus 110. The first image acquisition unit 121 notifies the object identification unit 123 and the area extraction unit 125 of the acquired first image.

第２画像取得部１２２は、撮像装置１１０より送信される第２画像を取得する。第２画像取得部１２２は、取得した第２画像を物体識別部１２３に通知する。 The second image acquisition unit 122 acquires the second image transmitted from the image pickup apparatus 110. The second image acquisition unit 122 notifies the object identification unit 123 of the acquired second image.

物体識別部１２３は、学習用画像情報格納部１３０に格納された学習用画像情報を読み出し、ニューラルネットワークを学習させる機能と、第１画像及び第２画像を入力し、学習済みのニューラルネットワークを実行させる機能とを有する。 The object identification unit 123 has a function of reading the learning image information stored in the learning image information storage unit 130 and learning the neural network, inputting the first image and the second image, and executing the learned neural network. It has a function to make it.

中間画像取得部１２４は、第１画像及び第２画像を入力することで、物体識別部１２３が、学習済みのニューラルネットワークを実行させることで、所定の層から出力される中間画像を取得する。 The intermediate image acquisition unit 124 acquires an intermediate image output from a predetermined layer by inputting the first image and the second image and causing the object identification unit 123 to execute the trained neural network.

物体識別部１２３が、第１画像を入力することで、学習済みのニューラルネットワークを実行させた場合、中間画像取得部１２４は、所定の層から出力される第１中間画像を取得する。 When the object identification unit 123 executes the trained neural network by inputting the first image, the intermediate image acquisition unit 124 acquires the first intermediate image output from a predetermined layer.

また、物体識別部１２３が、第２画像を入力することで、学習済みのニューラルネットワークを実行させた場合、中間画像取得部１２４は、所定の層から出力される第２中間画像を取得する。 Further, when the object identification unit 123 executes the trained neural network by inputting the second image, the intermediate image acquisition unit 124 acquires the second intermediate image output from a predetermined layer.

中間画像取得部１２４は、取得した第１中間画像及び第２中間画像を領域抽出部１２５に通知する。 The intermediate image acquisition unit 124 notifies the region extraction unit 125 of the acquired first intermediate image and the second intermediate image.

領域抽出部１２５は、中間画像取得部１２４から通知された第１中間画像及び第２中間画像を用いて、第１画像取得部１２１から通知された第１画像において、抽出対象の画素領域を抽出する。なお、第１の実施形態において、“抽出対象の画素領域を抽出する”とは、抽出対象の画素領域を、第１画像から切り出すことのほか、抽出対象の画素領域の各画素の座標を、第１画像において特定することも含まれるものとする。 The area extraction unit 125 extracts the pixel region to be extracted in the first image notified from the first image acquisition unit 121 by using the first intermediate image and the second intermediate image notified from the intermediate image acquisition unit 124. do. In the first embodiment, "extracting the pixel area to be extracted" means that the pixel area to be extracted is cut out from the first image and the coordinates of each pixel of the pixel area to be extracted are defined. It shall also include specifying in the first image.

学習用画像情報格納部１３０は、物体識別部１２３がニューラルネットワークを学習させる際に用いる、学習用画像情報を格納する。 The learning image information storage unit 130 stores learning image information used by the object identification unit 123 when learning a neural network.

なお、上記説明では、物体識別部１２３が、ニューラルネットワークを学習させる機能と、学習済みのニューラルネットワークを実行させる機能の両方を有するものとした。しかしながら、物体識別部１２３は、少なくとも学習済みのニューラルネットワークを実行させる機能を有していれば足り、ニューラルネットワークを学習させる機能は、他の装置が有していてもよい。この場合、画像処理装置１２０は、学習用画像情報格納部１３０と通信可能に接続されている必要はなく、他の装置が、学習用画像情報を用いて学習させた学習済みのニューラルネットワークまたはニューラルネットワークの重みを取得する機能を有していればよい。 In the above description, it is assumed that the object identification unit 123 has both a function of learning the neural network and a function of executing the trained neural network. However, it is sufficient that the object identification unit 123 has at least a function of executing a trained neural network, and another device may have a function of training the neural network. In this case, the image processing device 120 does not need to be communicably connected to the learning image information storage unit 130, and a trained neural network or neural network trained by another device using the learning image information. It suffices to have a function to acquire the weight of the network.

＜画像処理装置のハードウェア構成＞
次に、画像処理装置１２０のハードウェア構成について説明する。図２は、画像処理装置のハードウェア構成の一例を示す図である。図２に示すように、画像処理装置１２０は、ＣＰＵ（Central Processing Unit）２０１、ＲＯＭ（Read Only
Memory）２０２、ＲＡＭ（Random Access Memory）２０３を有する。ＣＰＵ２０１、ＲＯＭ２０２、ＲＡＭ２０３は、いわゆるコンピュータを形成する。 <Hardware configuration of image processing device>
Next, the hardware configuration of the image processing apparatus 120 will be described. FIG. 2 is a diagram showing an example of the hardware configuration of the image processing device. As shown in FIG. 2, the image processing device 120 includes a CPU (Central Processing Unit) 201 and a ROM (Read Only).
It has a Memory) 202 and a RAM (Random Access Memory) 203. The CPU 201, ROM 202, and RAM 203 form a so-called computer.

また、画像処理装置１２０は、補助記憶装置２０４、表示装置２０５、操作装置２０６、Ｉ／Ｆ（Interface）装置２０７、ドライブ装置２０８を有する。なお、画像処理装置１２０の各ハードウェアは、バス２０９を介して相互に接続されている。 Further, the image processing device 120 includes an auxiliary storage device 204, a display device 205, an operation device 206, an I / F (Interface) device 207, and a drive device 208. The hardware of the image processing device 120 is connected to each other via the bus 209.

ＣＰＵ２０１は、補助記憶装置２０４にインストールされている各種プログラム（例えば、画像処理プログラム等）を実行する演算デバイスである。 The CPU 201 is an arithmetic device that executes various programs (for example, an image processing program, etc.) installed in the auxiliary storage device 204.

ＲＯＭ２０２は、不揮発性メモリである。ＲＯＭ２０２は、補助記憶装置２０４にインストールされている各種プログラムをＣＰＵ２０１が実行するために必要な各種プログラム、データ等を格納する主記憶デバイスとして機能する。具体的には、ＲＯＭ２０２はＢＩＯＳ（Basic Input/Output System）やＥＦＩ（Extensible
Firmware Interface）等のブートプログラム等を格納する、主記憶デバイスとして機能する。 ROM 202 is a non-volatile memory. The ROM 202 functions as a main storage device for storing various programs, data, and the like necessary for the CPU 201 to execute various programs installed in the auxiliary storage device 204. Specifically, ROM 202 has BIOS (Basic Input / Output System) and EFI (Extensible).
It functions as a main memory device that stores boot programs such as Firmware Interface).

ＲＡＭ２０３は、ＤＲＡＭ（Dynamic Random Access Memory）やＳＲＡＭ（Static
Random Access Memory）等の揮発性メモリである。ＲＡＭ２０３は、補助記憶装置２０４にインストールされている各種プログラムがＣＰＵ２０１によって実行される際に展開される作業領域を提供する、主記憶デバイスとして機能する。 RAM 203 includes DRAM (Dynamic Random Access Memory) and SRAM (Static).
Random Access Memory) and other volatile memories. The RAM 203 functions as a main storage device that provides a work area that is expanded when various programs installed in the auxiliary storage device 204 are executed by the CPU 201.

補助記憶装置２０４は、各種プログラムや、各種プログラムが実行される際に用いられる情報を格納する補助記憶デバイスである。 The auxiliary storage device 204 is an auxiliary storage device that stores various programs and information used when various programs are executed.

表示装置２０５は、画像処理装置１２０の内部状態を表示する表示デバイスである。操作装置２０６は、画像処理装置１２０の管理者が画像処理装置１２０に対して各種指示を入力するための入力デバイスである。 The display device 205 is a display device that displays the internal state of the image processing device 120. The operation device 206 is an input device for the administrator of the image processing device 120 to input various instructions to the image processing device 120.

Ｉ／Ｆ装置２０７は、撮像装置１１０と接続され、撮像装置１１０と画像処理装置１２０との間で撮影画像の送受信を行う。また、Ｉ／Ｆ装置２０７は、学習用画像情報格納部１３０と接続され、画像処理装置１２０と学習用画像情報格納部１３０との間で、学習用画像情報の送受信を行う。 The I / F device 207 is connected to the image pickup device 110, and transmits and receives captured images between the image pickup device 110 and the image processing device 120. Further, the I / F device 207 is connected to the learning image information storage unit 130, and transmits / receives learning image information between the image processing device 120 and the learning image information storage unit 130.

ドライブ装置２０８は記録媒体２１０をセットするためのデバイスである。ここでいう記録媒体２１０には、ＣＤ－ＲＯＭ、フレキシブルディスク、光磁気ディスク等のように情報を光学的、電気的あるいは磁気的に記録する媒体が含まれる。また、記録媒体２１０には、ＲＯＭ、フラッシュメモリ等のように情報を電気的に記録する半導体メモリ等が含まれていてもよい。 The drive device 208 is a device for setting the recording medium 210. The recording medium 210 referred to here includes a medium such as a CD-ROM, a flexible disk, a magneto-optical disk, or the like, which records information optically, electrically, or magnetically. Further, the recording medium 210 may include a semiconductor memory or the like for electrically recording information such as a ROM or a flash memory.

なお、補助記憶装置２０４にインストールされる各種プログラムは、例えば、配布された記録媒体２１０がドライブ装置２０８にセットされ、該記録媒体２１０に記録された各種プログラムがドライブ装置２０８により読み出されることでインストールされる。あるいは、補助記憶装置２０４にインストールされる各種プログラムは、ネットワークよりダウンロードされることでインストールされてもよい。 The various programs installed in the auxiliary storage device 204 are installed, for example, by setting the distributed recording medium 210 in the drive device 208 and reading the various programs recorded in the recording medium 210 by the drive device 208. Will be done. Alternatively, various programs installed in the auxiliary storage device 204 may be installed by being downloaded from the network.

＜学習用画像情報の具体例＞
次に、学習用画像情報格納部１３０に格納される学習用画像情報の具体例について説明する。図３は、学習用画像情報の具体例を示す図である。図３に示すように、学習用画像情報３００には、情報の項目として、“画像データＩＤ”、“ファイル名”、“識別対象”、“格納先”が含まれる。 <Specific example of image information for learning>
Next, a specific example of the learning image information stored in the learning image information storage unit 130 will be described. FIG. 3 is a diagram showing a specific example of learning image information. As shown in FIG. 3, the learning image information 300 includes "image data ID", "file name", "identification target", and "storage destination" as information items.

“画像データＩＤ”には、画像データを識別するための識別子が格納される。図１の例は、学習用画像情報格納部１３０に、Ｎ個の画像データが格納されていることを示している。 An identifier for identifying image data is stored in the "image data ID". The example of FIG. 1 shows that N image data are stored in the learning image information storage unit 130.

“ファイル名”には、画像データのファイル名が格納される。“識別対象”には、各画像データに含まれる一般的な物体（以下、「一般物体」と称す）であって、物体識別部１２３がニューラルネットワークを学習させる際に、正解となる一般物体を示す情報が格納される。例えば、物体識別部１２３は、画像データＩＤ＝“ＩＤ００１”の画像データをニューラルネットワークに入力した際に、“対象Ａ”と識別されるように、ニューラルネットワークを学習させる。なお、第１の実施形態では、説明を簡略化するために、物体識別部１２３は４種類の一般物体（“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”）を識別するように、ニューラルネットワークを学習させるものとする。 The file name of the image data is stored in the "file name". The "identification target" is a general object (hereinafter referred to as "general object") included in each image data, which is a correct answer when the object identification unit 123 learns the neural network. The information to be shown is stored. For example, the object identification unit 123 trains the neural network so that it can be identified as "object A" when the image data of the image data ID = "ID001" is input to the neural network. In the first embodiment, in order to simplify the explanation, the object identification unit 123 identifies four types of general objects (“object A”, “object B”, “object C”, and “object D”). As such, it is assumed that the neural network is trained.

“格納先”には、画像データを学習用画像情報格納部１３０に格納した格納先を示すフォルダ名が格納される。 In the "storage destination", a folder name indicating a storage destination in which the image data is stored in the learning image information storage unit 130 is stored.

なお、図３に示す学習用画像情報３００の場合、各画像データが、いずれかの識別対象を含むものとして示されているが、学習用画像情報３００の各画像データには、識別対象を含まない画像データが含まれていてもよい。また、学習用画像情報３００の各画像データには、非識別対象（識別対象以外の一般物体）を含む画像データが含まれていてもよい。 In the case of the learning image information 300 shown in FIG. 3, each image data is shown to include one of the identification targets, but each image data of the learning image information 300 includes an identification target. It may contain no image data. Further, each image data of the learning image information 300 may include image data including a non-identification target (general object other than the identification target).

また、図３に示す学習用画像情報３００の各画像データには、抽出対象を含む画像データが含まれている必要があるが、抽出対象を含まない画像データが含まれていてもよい。更に、学習用画像情報３００の各画像データには、非抽出対象（撮像装置１１０により撮影される被写体のうち、抽出対象以外の被写体）を含む画像データが含まれていてもよい。 Further, each image data of the learning image information 300 shown in FIG. 3 needs to include image data including an extraction target, but may include image data not including an extraction target. Further, each image data of the learning image information 300 may include image data including a non-extraction target (a subject other than the extraction target among the subjects photographed by the image pickup apparatus 110).

＜第１画像及び第２画像の具体例＞
次に、第１画像取得部１２１により取得される第１画像と、第２画像取得部１２２により取得される第２画像の具体例について説明する。 <Specific examples of the first image and the second image>
Next, specific examples of the first image acquired by the first image acquisition unit 121 and the second image acquired by the second image acquisition unit 122 will be described.

（１）第１画像の具体例
図４は、第１画像取得部により取得される第１画像の具体例を示す図である。図４に示すように、入力画像の具体例である第１画像４００は、一般物体を複数含む。図４の例は、第１画像４００が、一般物体として、“対象Ａ”、“対象Ｆ”、“対象Ｇ”、“対象Ｈ”を含むことを示している。 (1) Specific Example of First Image FIG. 4 is a diagram showing a specific example of the first image acquired by the first image acquisition unit. As shown in FIG. 4, the first image 400, which is a specific example of the input image, includes a plurality of general objects. The example of FIG. 4 shows that the first image 400 includes "object A", "object F", "object G", and "object H" as general objects.

なお、上述したとおり、第１画像４００が含む一般物体のうち、“対象Ａ”は、物体識別部１２３が学習済みのニューラルネットワークを実行させる際に、識別対象となる一般物体である。また、第１の実施形態において、“対象Ａ”は、領域抽出部１２５が抽出する抽出対象となる一般物体でもある。一方、第１画像４００が含む一般物体のうち、“対象Ｆ”、“対象Ｇ”、“対象Ｈ”は、物体識別部１２３が学習済みのニューラルネットワークを実行させる際に非識別対象となる一般物体であり、領域抽出部１２５が抽出しない非抽出対象の一般物体でもある。 As described above, among the general objects included in the first image 400, the "object A" is a general object to be identified when the object identification unit 123 executes the trained neural network. Further, in the first embodiment, the "object A" is also a general object to be extracted by the region extraction unit 125. On the other hand, among the general objects included in the first image 400, the "object F", "object G", and "object H" are general objects that are not identified when the object identification unit 123 executes the trained neural network. It is an object, and is also a non-extractable general object that is not extracted by the area extraction unit 125.

（２）第２画像の具体例
図５は、第２画像取得部により取得される第２画像の具体例を示す図である。図５に示すように、背景画像の具体例である第２画像５００は、一般物体を複数含む。図５の例は、第２画像５００が、一般物体として、“対象Ｆ”、“対象Ｉ”を含むことを示している。 (2) Specific Example of Second Image FIG. 5 is a diagram showing a specific example of the second image acquired by the second image acquisition unit. As shown in FIG. 5, the second image 500, which is a specific example of the background image, includes a plurality of general objects. The example of FIG. 5 shows that the second image 500 includes "object F" and "object I" as general objects.

第２画像５００が含む一般物体（“対象Ｆ”、“対象Ｉ”）は、いずれも、物体識別部１２３が学習済みのニューラルネットワークを実行させる際に非識別対象となる一般物体であり、領域抽出部１２５が抽出しない非抽出対象の一般物体でもある。このように、第２画像５００は、抽出対象となる一般物体を含まない。 The general objects (“object F” and “object I”) included in the second image 500 are all general objects that are not identified when the object identification unit 123 executes the trained neural network, and are regions. It is also a non-extractable general object that is not extracted by the extraction unit 125. As described above, the second image 500 does not include a general object to be extracted.

なお、第１画像４００が、識別対象となる一般物体であって、非抽出対象となる一般物体（“対象Ｂ”、“対象Ｃ”、“対象Ｄ”）を含む場合には、第２画像５００も、当該一般物体（“対象Ｂ”、“対象Ｃ”、“対象Ｄ”）を含むことが必要となる。 When the first image 400 is a general object to be identified and includes a general object to be non-extracted (“target B”, “target C”, “target D”), the second image The 500 also needs to include the general object (“object B”, “object C”, “object D”).

＜物体識別部による処理の具体例＞
次に、物体識別部１２３による処理の具体例として、学習用画像情報を読み出し、一般物体を識別するためにニューラルネットワークを学習させる処理と、第１画像及び第２画像を入力し、学習済みのニューラルネットワークを実行させる処理と、について説明する。 <Specific example of processing by the object identification unit>
Next, as a specific example of the process by the object identification unit 123, a process of reading image information for learning and learning a neural network to identify a general object, and a process of inputting a first image and a second image and learning have been completed. The process of executing the neural network will be described.

（１）ニューラルネットワークを学習させる処理の具体例
はじめに、学習用画像情報を読み出し、一般物体を識別するためにニューラルネットワークを学習させる処理の具体例について説明する。図６は、物体識別部による処理の具体例を示す第１の図である。 (1) Specific Example of Processing for Learning Neural Network First, a specific example of processing for learning an neural network in order to read image information for learning and identify a general object will be described. FIG. 6 is a first diagram showing a specific example of processing by the object identification unit.

図６（ａ）に示すように、物体識別部１２３は、第１層から第３層までの各層を有する畳み込みニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）を有し、学習用画像情報３００の各画像データが入力される。 As shown in FIG. 6A, the object identification unit 123 has a convolutional neural network (CNN) having each layer from the first layer to the third layer, and each image of the learning image information 300. Data is entered.

なお、参考までに、図６（ｂ）に一般的な畳み込みニューラルネットワークの処理を示す図を記載する。第１の実施形態では、説明の簡略化のため、４種類の識別対象を例示しているが、実際には、図６（ｂ）に示すように、１０００種類程度の識別対象が取り扱われることになる。 For reference, FIG. 6B shows a diagram showing the processing of a general convolutional neural network. In the first embodiment, four types of identification targets are illustrated for the sake of simplification of the description, but in reality, as shown in FIG. 6B, about 1000 types of identification targets are handled. become.

図６（ａ）の説明に戻る。上述したとおり、学習用画像情報３００の各画像データは、いずれかの識別対象を含む既知の画像データである。図６（ａ）の例は、学習用画像情報３００の各画像データとして、画像データＩＤ＝“ＩＤ００１”～“ＩＤ００Ｎ”のＮ個の画像データが入力される様子を示している。 Returning to the description of FIG. 6 (a). As described above, each image data of the learning image information 300 is known image data including any of the identification targets. The example of FIG. 6A shows how N pieces of image data of image data ID = "ID001" to "ID00N" are input as each image data of the image information 300 for learning.

また、物体識別部１２３は、入力されたＮ個の画像データが含む識別対象が、適切に識別されるよう、畳み込みニューラルネットワークのフィルタ係数を学習させる。図６（ａ）の例は、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”の４種類の一般物体が識別されるように、フィルタ係数を学習させる様子を示している。 Further, the object identification unit 123 learns the filter coefficient of the convolutional neural network so that the identification target included in the input N image data can be appropriately identified. The example of FIG. 6A shows how to learn the filter coefficient so that four types of general objects of “object A”, “object B”, “object C”, and “object D” can be identified. There is.

畳み込みニューラルネットワークでは、畳み込みにより抽出された局所特徴量が、階層を経るごとに、高次な特徴量へと変換され、最終的に識別結果が得られる。このため、学習された畳み込みニューラルネットワークの各層からは、識別対象の識別に寄与する特徴量を含む画像（特徴マップ）が出力される。 In the convolutional neural network, the local features extracted by the convolution are converted into higher-order features each time the layers are passed, and finally the identification result is obtained. Therefore, an image (feature map) including a feature amount that contributes to the identification of the identification target is output from each layer of the learned convolutional neural network.

具体的には、畳み込みニューラルネットワークの各層から出力される各特徴マップには、識別対象（“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”）の識別に寄与する特徴量が含まれている。一方で、畳み込みニューラルネットワークの各層から出力される各特徴マップからは、非識別対象（“対象Ｆ”、“対象Ｇ”、“対象Ｈ”、“対象Ｉ”）の識別に寄与する特徴量が排除されている。なお、第１の実施形態では、畳み込みニューラルネットワークの各層から出力される特徴マップの集合を、中間画像と称する。 Specifically, each feature map output from each layer of the convolutional neural network has a feature amount that contributes to the identification of the identification target (“target A”, “target B”, “target C”, “target D”). It is included. On the other hand, from each feature map output from each layer of the convolutional neural network, the features that contribute to the identification of the non-discrimination target (“target F”, “target G”, “target H”, “target I”) are It has been excluded. In the first embodiment, a set of feature maps output from each layer of the convolutional neural network is referred to as an intermediate image.

このように、畳み込みニューラルネットワークの各層からは、識別対象を識別するのに必要な特徴量を含む画像である中間画像が出力される。このため、第１画像から抽出対象の画素領域を抽出するにあたり、領域抽出部１２５が当該中間画像を利用する場合、物体識別部１２３が学習させる畳み込みニューラルネットワークは、
・識別対象の中に、抽出対象が含まれていること（抽出対象が非識別対象でないこと）、
・識別対象の中に、抽出対象に付随する非抽出対象等、撮影中に変化する非抽出対象が含まれてないこと、
が条件となる。 In this way, from each layer of the convolutional neural network, an intermediate image which is an image including the feature amount necessary for identifying the identification target is output. Therefore, when the area extraction unit 125 uses the intermediate image when extracting the pixel area to be extracted from the first image, the convolutional neural network trained by the object identification unit 123 is
-The extraction target is included in the identification target (the extraction target is not a non-identification target).
-The identification target does not include the non-extraction target that changes during shooting, such as the non-extraction target that accompanies the extraction target.
Is a condition.

なお、抽出対象に付随する非抽出対象とは、抽出対象となる被写体の影等のように、撮影中の被写体の変化に伴って変化する被写体を指す。 The non-extraction target attached to the extraction target refers to a subject that changes with the change of the subject during shooting, such as a shadow of the subject to be extracted.

図７は、識別対象及び非識別対象と、抽出対象及び非抽出対象との関係を示す図である。このうち、図７（ａ）は、畳み込みニューラルネットワークの識別対象の集合を示している。図７（ａ）に示す一般物体を識別対象とする畳み込みニューラルネットワークの場合、中間画像には、“人”、“自動車”、“犬”等を識別するのに寄与する特徴量が含まれる。 FIG. 7 is a diagram showing the relationship between the identification target and the non-discrimination target and the extraction target and the non-extraction target. Of these, FIG. 7A shows a set of identification targets of the convolutional neural network. In the case of the convolutional neural network whose identification target is the general object shown in FIG. 7A, the intermediate image contains a feature amount that contributes to the identification of "person", "automobile", "dog" and the like.

また、図７（ａ）に示す一般物体を識別対象とする畳み込みニューラルネットワークの場合、中間画像には、“人”、“自動車”、“犬”等以外（非識別対象）を識別するのに寄与する特徴量が含まれない。例えば、中間画像には、“影”を識別するのに寄与する特徴量が含まれない。 Further, in the case of the convolutional neural network whose identification target is the general object shown in FIG. 7A, the intermediate image is used to identify other than "people", "automobiles", "dogs", etc. (non-identification targets). Contributed features are not included. For example, the intermediate image does not contain features that contribute to identifying "shadows".

このため、図７（ｂ）に示すように、第１画像において、抽出対象である“人”と、抽出対象に付随する非抽出対象である“影”とが含まれていた場合、中間画像には、“人”を識別するのに寄与する特徴量が残る。一方、中間画像において、“影”を識別するのに寄与する特徴量は除外されることになる。 Therefore, as shown in FIG. 7B, when the first image includes the “person” that is the extraction target and the “shadow” that is the non-extraction target associated with the extraction target, the intermediate image is an intermediate image. Remains a feature quantity that contributes to distinguishing a "person". On the other hand, in the intermediate image, the feature amount that contributes to distinguishing the "shadow" is excluded.

（２）ニューラルネットワークを実行させる処理の具体例１
次に、学習済みのニューラルネットワークを実行させる処理の具体例１について説明する。図８は、物体識別部による処理の具体例を示す第２の図である。図８において、物体識別部１２３は、学習済みの畳み込みニューラルネットワークを有し、第１画像取得部１２１により取得された第１画像が入力される。図８の例は、“対象Ａ”、“対象Ｆ”、“対象Ｇ”、“対象Ｈ”を含む第１画像４００が入力される様子を示している。 (2) Specific example of processing for executing a neural network 1
Next, a specific example 1 of the process of executing the trained neural network will be described. FIG. 8 is a second diagram showing a specific example of processing by the object identification unit. In FIG. 8, the object identification unit 123 has a trained convolutional neural network, and the first image acquired by the first image acquisition unit 121 is input. The example of FIG. 8 shows how the first image 400 including the “object A”, the “object F”, the “object G”, and the “object H” is input.

ここで、物体識別部１２３が有する畳み込みニューラルネットワークは、識別対象として“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”が適切に識別されるように学習されている。このため、第１画像４００が入力されることで、第１層または第２層において出力される各特徴マップには、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のいずれかの識別対象の特徴量が含まれる。換言すると、第１層または第２層において出力される各特徴マップにおいて、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のいずれにも該当しない非識別対象の特徴量は除外される。 Here, the convolutional neural network included in the object identification unit 123 is learned so that "object A", "object B", "object C", and "object D" are appropriately identified as identification targets. Therefore, when the first image 400 is input, each feature map output in the first layer or the second layer has "target A", "target B", "target C", and "target D". The feature amount of any one of the identification targets is included. In other words, in each feature map output in the first layer or the second layer, the feature amount of the non-identification target that does not correspond to any of "target A", "target B", "target C", and "target D". Is excluded.

第１画像４００には、学習済みの畳み込みニューラルネットワークの識別対象である、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のうち、“対象Ａ”が含まれる。このため、第１層または第２層において出力される各特徴マップには、“対象Ａ”の特徴量が含まれ、“対象Ａ”以外の非識別対象である“対象Ｆ”、“対象Ｇ”、“対象Ｈ”の特徴量は除外される。 The first image 400 includes "target A" among "target A", "target B", "target C", and "target D", which are identification targets of the trained convolutional neural network. Therefore, each feature map output in the first layer or the second layer includes the feature amount of "target A", and is a non-identification target other than "target A", "target F" and "target G". ", The feature amount of" target H "is excluded.

図８の例は、物体識別部１２３が、学習済み畳み込みニューラルネットワークを実行させることで、第１層または第２層より、特徴マップの集合である第１中間画像８００が出力された様子を示している。なお、第１層または第２層より出力された第１中間画像８００は、中間画像取得部１２４によって取得され、領域抽出部１２５に通知される。 The example of FIG. 8 shows how the object identification unit 123 outputs the first intermediate image 800, which is a set of feature maps, from the first layer or the second layer by executing the trained convolutional neural network. ing. The first intermediate image 800 output from the first layer or the second layer is acquired by the intermediate image acquisition unit 124 and notified to the region extraction unit 125.

（３）ニューラルネットワークを実行させる処理の具体例２
次に、学習済みのニューラルネットワークを実行させる処理の具体例２について説明する。図９は、物体識別部による処理の具体例を示す第３の図である。図８との違いは、図９の場合、物体識別部１２３が、第２画像取得部１２２により取得された第２画像を入力する点である。なお、図９の例は、物体識別部１２３が、“対象Ｆ”、“対象Ｉ”を含む第２画像５００を入力する様子を示している。 (3) Specific example 2 of processing for executing a neural network
Next, a specific example 2 of the process of executing the trained neural network will be described. FIG. 9 is a third diagram showing a specific example of processing by the object identification unit. The difference from FIG. 8 is that, in the case of FIG. 9, the object identification unit 123 inputs the second image acquired by the second image acquisition unit 122. The example of FIG. 9 shows how the object identification unit 123 inputs the second image 500 including the “object F” and the “object I”.

上述したとおり、物体識別部１２３が有する畳み込みニューラルネットワークは、識別対象として“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”が適切に識別されるように学習されている。このため、第２画像５００が入力されることで、第１層または第２層において出力される各特徴マップには、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のいずれかの識別対象の特徴量が含まれる。換言すると、第１層または第２層において出力される各特徴マップにおいて、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のいずれにも該当しない非識別対象の特徴量は除外される。 As described above, the convolutional neural network included in the object identification unit 123 is learned so that "object A", "object B", "object C", and "object D" are appropriately identified as identification targets. Therefore, when the second image 500 is input, each feature map output in the first layer or the second layer has "target A", "target B", "target C", and "target D". The feature amount of any one of the identification targets is included. In other words, in each feature map output in the first layer or the second layer, the feature amount of the non-identification target that does not correspond to any of "target A", "target B", "target C", and "target D". Is excluded.

第２画像５００には、学習済みの畳み込みニューラルネットワークの識別対象である“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のいずれも含まれていない。このため、第１層または第２層において出力される各特徴マップにおいて、“対象Ａ”、“対象Ｂ”、“対象Ｃ”、“対象Ｄ”のいずれかの特徴量が含まれることはない。また、非識別対象である“対象Ｆ”、“対象Ｉ”の特徴量は除外される。 The second image 500 does not include any of the “target A”, “target B”, “target C”, and “target D” that are the identification targets of the trained convolutional neural network. Therefore, the feature amount of any one of "target A", "target B", "target C", and "target D" is not included in each feature map output in the first layer or the second layer. .. Further, the feature quantities of "object F" and "object I" which are non-identification targets are excluded.

図９の例は、物体識別部１２３が、学習済み畳み込みニューラルネットワークを実行させることで、第１層または第２層より、特徴マップの集合である第２中間画像９００が出力された様子を示している。なお、第１層または第２層より出力された第２中間画像９００は、中間画像取得部１２４によって取得され、領域抽出部１２５に通知される。このように、物体識別部１２３では、第ｉ層から中間画像を取り出した場合、第ｉ＋１層以降の処理を実行しない。 The example of FIG. 9 shows how the object identification unit 123 outputs the second intermediate image 900, which is a set of feature maps, from the first layer or the second layer by executing the trained convolutional neural network. ing. The second intermediate image 900 output from the first layer or the second layer is acquired by the intermediate image acquisition unit 124 and notified to the region extraction unit 125. As described above, when the intermediate image is taken out from the i-th layer, the object identification unit 123 does not execute the processing after the i + 1 layer.

＜領域抽出部による処理の具体例＞
次に、領域抽出部１２５による処理の具体例について説明する。図１０は、領域抽出部による処理の具体例を示す図である。 <Specific example of processing by the area extraction unit>
Next, a specific example of the processing by the region extraction unit 125 will be described. FIG. 10 is a diagram showing a specific example of processing by the region extraction unit.

図１０（ａ）に示すように、領域抽出部１２５は、中間画像取得部１２４から通知された第１中間画像８００及び第２中間画像９００を用いて差分画像１０１０を生成する。第１中間画像８００と第２中間画像９００の相違点は、第１中間画像８００には、識別対象の特徴量が含まれている点である。 As shown in FIG. 10A, the region extraction unit 125 generates the difference image 1010 using the first intermediate image 800 and the second intermediate image 900 notified by the intermediate image acquisition unit 124. The difference between the first intermediate image 800 and the second intermediate image 900 is that the first intermediate image 800 includes a feature amount to be identified.

図１０（ａ）の例の場合、第１中間画像８００には、識別対象の“対象Ａ”の特徴量が含まれているため、第１中間画像８００と第２中間画像９００との差の絶対値を算出することで得られる差分画像１０１０には、“対象Ａ”の画素領域が含まれる。 In the case of the example of FIG. 10A, since the first intermediate image 800 contains the feature amount of the "object A" to be identified, the difference between the first intermediate image 800 and the second intermediate image 900. The difference image 1010 obtained by calculating the absolute value includes a pixel region of “target A”.

なお、図１０（ａ）の例では、明示していないが、第１中間画像８００及び第２中間画像９００にそれぞれ含まれる識別対象の特徴量のうち、撮影中に変化しない非抽出対象の特徴量は、差の絶対値を算出することで除外される。 Although not explicitly shown in the example of FIG. 10A, among the feature quantities of the identification target included in the first intermediate image 800 and the second intermediate image 900, the features of the non-extracted target that do not change during shooting. The quantity is excluded by calculating the absolute value of the difference.

第１中間画像８００の各画素の画素値をＦ_１（ｃ，ｘ，ｙ）、第２中間画像９００の各画素の画素値をＦ_２（ｃ，ｘ，ｙ）とすると、差分画像１０１０は、｜Ｆ_１（ｃ，ｘ，ｙ）－Ｆ_２（ｃ，ｘ，ｙ）｜と表すことができる。ｃは、複数の特徴マップのいずれかを示し、ｘ、ｙは各画素の座標を示している。 Assuming that the pixel value of each pixel of the first intermediate image 800 is F ₁ (c, x, y) and the pixel value of each pixel of the second intermediate image 900 is F ₂ (c, x, y), the difference image 1010 is , | F ₁ (c, x, y) -F ₂ (c, x, y) |. c indicates one of a plurality of feature maps, and x and y indicate the coordinates of each pixel.

なお、第１中間画像８００及び第２中間画像９００が、例えば、ｎ個の特徴マップが含まれるとすると、ｎ個の差分画像が生成される。このため、領域抽出部１２５は、それらの差分画像を加算することで、差分画像１０１０が生成される。つまり、差分画像１０１０の各画素の画素値は、 If the first intermediate image 800 and the second intermediate image 900 include, for example, n feature maps, n difference images are generated. Therefore, the region extraction unit 125 generates the difference image 1010 by adding the difference images thereof. That is, the pixel value of each pixel of the difference image 1010 is

と表すことができる。

It can be expressed as.

領域抽出部１２５は、差分画像１０１０において、“対象Ａ”の画素領域の各画素の画素値を、“１”とおき、差分画像１０１０において、“対象Ａ”の画素領域以外の各画素の画素値を“０”とおく（つまり、抽出対象の画素領域の各画素の座標を特定する）。 The area extraction unit 125 sets the pixel value of each pixel in the pixel region of "target A" to "1" in the difference image 1010, and sets the pixel value of each pixel other than the pixel region of "target A" in the difference image 1010. The value is set to "0" (that is, the coordinates of each pixel in the pixel area to be extracted are specified).

また、図１０（ｂ）に示すように、領域抽出部１２５は、第１画像取得部１２１より通知された第１画像４００に、差分画像１０１０をかけ合わせることで、抽出画像１０２０を生成する。上述したとおり、差分画像１０１０において、“対象Ａ”の画素領域の各画素の画素値は、“１”であり、“対象Ａ”の画素領域以外の各画素の画素値は、“０”である。このため、第１画像４００に差分画像１０１０をかけあわせることで、領域抽出部１２５は、第１画像４００の“対象Ａ”の画素領域が含まれる抽出画像１０２０を生成することができる（つまり、抽出対象の画素領域を切り出すことができる）。 Further, as shown in FIG. 10B, the region extraction unit 125 generates the extracted image 1020 by multiplying the first image 400 notified by the first image acquisition unit 121 by the difference image 1010. As described above, in the difference image 1010, the pixel value of each pixel in the pixel area of "target A" is "1", and the pixel value of each pixel other than the pixel area of "target A" is "0". be. Therefore, by multiplying the first image 400 by the difference image 1010, the region extraction unit 125 can generate the extraction image 1020 including the pixel region of the “target A” of the first image 400 (that is,). The pixel area to be extracted can be cut out).

＜画像処理の流れ＞
次に、画像処理装置１２０による画像処理（ニューラルネットワークを実行させる処理）の流れについて説明する。図１１は、画像処理装置による画像処理の流れを示すフローチャートである。 <Flow of image processing>
Next, the flow of image processing (processing for executing a neural network) by the image processing device 120 will be described. FIG. 11 is a flowchart showing the flow of image processing by the image processing apparatus.

ステップＳ１１０１において、第１画像取得部１２１は、撮像装置１１０より送信される第１画像を取得する。 In step S1101, the first image acquisition unit 121 acquires the first image transmitted from the image pickup apparatus 110.

ステップＳ１１０２において、第１画像取得部１２１は、取得した第１画像を、物体識別部１２３に入力する。 In step S1102, the first image acquisition unit 121 inputs the acquired first image to the object identification unit 123.

ステップＳ１１０３において、物体識別部１２３は、学習済みの畳み込みニューラルネットワークを実行させることで、第１層または第２層より第１中間画像を出力し、中間画像取得部１２４は、出力された第１中間画像を取得する。 In step S1103, the object identification unit 123 outputs the first intermediate image from the first layer or the second layer by executing the trained convolutional neural network, and the intermediate image acquisition unit 124 outputs the output first. Get an intermediate image.

ステップＳ１１０４において、第２画像取得部１２２は、撮像装置１１０より送信される第２画像を取得する。 In step S1104, the second image acquisition unit 122 acquires the second image transmitted from the image pickup apparatus 110.

ステップＳ１１０５において、第２画像取得部１２２は、取得した第２画像を、物体識別部１２３に入力する。 In step S1105, the second image acquisition unit 122 inputs the acquired second image to the object identification unit 123.

ステップＳ１１０６において、物体識別部１２３は、学習済みの畳み込みニューラルネットワークを実行させることで、第１層または第２層より第２中間画像を出力し、中間画像取得部１２４は、出力された第２中間画像を取得する。 In step S1106, the object identification unit 123 outputs the second intermediate image from the first layer or the second layer by executing the learned convolutional neural network, and the intermediate image acquisition unit 124 outputs the output second. Get an intermediate image.

ステップＳ１１０７において、領域抽出部１２５は、第１中間画像と第２中間画像との差分を算出することで、差分画像を生成する。 In step S1107, the area extraction unit 125 generates a difference image by calculating the difference between the first intermediate image and the second intermediate image.

ステップＳ１１０８において、領域抽出部１２５は、第１画像に差分画像をかけ合わせることで、抽出画像を生成する。 In step S1108, the area extraction unit 125 generates an extracted image by multiplying the first image by the difference image.

＜一般的な画像処理との違い＞
次に、画像処理装置１２０により実行される上記画像処理（図１１）と、一般的な画像処理との違いについて説明する。 <Differences from general image processing>
Next, the difference between the image processing (FIG. 11) executed by the image processing apparatus 120 and general image processing will be described.

（１）一般的な画像処理その１（画像のテクスチャ成分について背景差分技術を適用する画像処理）との違い
画像のテクスチャ成分について背景差分技術を適用する画像処理の場合、抽出対象となる被写体のうちテクスチャ成分が小さい領域は除外される。この結果、抽出対象の輪郭線のみが抽出され、輪郭線の内側の画素領域を抽出することができない。 (1) Differences from general image processing No. 1 (image processing that applies background subtraction technology to image texture components) In the case of image processing that applies background subtraction technology to image texture components, the subject to be extracted Areas with a small texture component are excluded. As a result, only the contour line to be extracted is extracted, and the pixel area inside the contour line cannot be extracted.

これに対して、画像処理装置１２０による画像処理によれば、抽出対象となる被写体のうち、テクスチャ成分が小さい領域についても画素領域を抽出することができる。また、抽出対象に付随する被写体の影の画素領域が抽出されることもない。 On the other hand, according to the image processing by the image processing apparatus 120, the pixel region can be extracted even in the region where the texture component is small among the subjects to be extracted. In addition, the pixel region of the shadow of the subject accompanying the extraction target is not extracted.

（２）一般的な画像処理その２（ニューラルネットワークを用いた識別処理）との違い
ニューラルネットワークを用いた識別処理の場合、第１画像を入力して学習済みのニューラルネットワークを実行させることで、第１画像が、抽出対象を含むか否かを判定することはできる。しかしながら、ニューラルネットワークを用いた識別処理の場合、第１画像から抽出対象となる被写体の画素領域を抽出することまではできない。 (2) Difference from general image processing 2 (discrimination process using neural network) In the case of discrimination process using neural network, the first image is input and the trained neural network is executed. It is possible to determine whether or not the first image includes an extraction target. However, in the case of the identification process using the neural network, it is not possible to extract the pixel region of the subject to be extracted from the first image.

これに対して、画像処理装置１２０による画像処理によれば、第１画像から抽出対象となる被写体の画素領域を過不足なく抽出することができる。 On the other hand, according to the image processing by the image processing apparatus 120, the pixel region of the subject to be extracted can be extracted from the first image without excess or deficiency.

＜画像処理システムの適用例＞
次に、画像処理システム１００の適用例について説明する。 <Application example of image processing system>
Next, an application example of the image processing system 100 will be described.

（１）適用例１
図１２は、画像処理システムの適用例を示す第１の図である。図１２に示すように、画像処理システム１００を、自由視点映像生成装置１２１０に接続することで、自由視点映像生成システム１２００を形成することができる。 (1) Application example 1
FIG. 12 is a first diagram showing an application example of an image processing system. As shown in FIG. 12, by connecting the image processing system 100 to the free viewpoint image generation device 1210, the free viewpoint image generation system 1200 can be formed.

なお、自由視点映像生成システム１２００を形成するにあたり、画像処理装置１２０には、複数の撮像装置（撮像装置１１０に加えて、撮像装置１２２０＿１～１２２０＿ｍ）が接続され、識別対象の１つである被写体１２３０を異なる方向から撮影するものとする。また、複数の撮像装置それぞれから出力される画像データに対しては、同様の画像処理が施され、領域抽出部１２５では、被写体１２３０の画素領域（撮像装置の数に応じた数の画素領域であって、いずれも影を含まない画素領域）が抽出されるものとする。 In forming the free viewpoint image generation system 1200, a plurality of image pickup devices (in addition to the image pickup device 110, image pickup devices 1220_1 to 1220_m) are connected to the image processing device 120, and the subject is one of the identification targets. It is assumed that the 1230 is photographed from different directions. Further, the same image processing is applied to the image data output from each of the plurality of image pickup devices, and the area extraction unit 125 has a pixel region of the subject 1230 (a number of pixel regions corresponding to the number of image pickup devices). Therefore, it is assumed that the pixel area (pixel area that does not include shadows) is extracted.

図１２に示すように、自由視点映像生成装置１２１０は、ＶｉｓｕａｌＨｕｌｌ部１２１１、レンダリング部１２１２、出力部１２１３を有する。 As shown in FIG. 12, the free viewpoint image generation device 1210 has a Visual Hull unit 1211, a rendering unit 1212, and an output unit 1213.

ＶｉｓｕａｌＨｕｌｌ部１２１１は、領域抽出部１２５にて抽出された、被写体１２３０の画素領域（撮像装置の数に応じた数の画素領域）を用いて、被写体１２３０の３次元構造を復元する。 The Visual Hull unit 1211 restores the three-dimensional structure of the subject 1230 by using the pixel regions of the subject 1230 (the number of pixel regions corresponding to the number of image pickup devices) extracted by the region extraction unit 125.

レンダリング部１２１２は、ＶｉｓｕａｌＨｕｌｌ部１２１１において復元された３次元構造を用いて、任意の視点からの映像をレンダリングする。 The rendering unit 1212 renders an image from an arbitrary viewpoint by using the three-dimensional structure restored in the Visual Hull unit 1211.

出力部１２１３は、レンダリング部１２１２によりレンダリングされた任意の視点からの映像のうち、指示された視点からの映像を出力する。 The output unit 1213 outputs the image from the designated viewpoint among the images from an arbitrary viewpoint rendered by the rendering unit 1212.

このように、画像処理システム１００を、自由視点映像生成装置１２１０に接続して自由視点映像生成システム１２００を形成することで、被写体１２３０の任意視点の映像を適切に（過不足なく）出力することが可能となる。 In this way, by connecting the image processing system 100 to the free viewpoint image generation device 1210 to form the free viewpoint image generation system 1200, it is possible to appropriately (just enough) output the image of the arbitrary viewpoint of the subject 1230. Is possible.

（２）適用例２
図１３は、画像処理システムの適用例を示す第２の図である。図１３に示すように、画像処理システム１００を、映像監視装置１３１０に接続することで、映像監視システム１３００を形成することができる。 (2) Application example 2
FIG. 13 is a second diagram showing an application example of the image processing system. As shown in FIG. 13, by connecting the image processing system 100 to the video monitoring device 1310, the video monitoring system 1300 can be formed.

また、図１３に示すように、映像監視装置１３１０は、判定部１３１１を有する。判定部１３１１は、領域抽出部１２５にて抽出された抽出対象の画素領域に含まれる画素数をカウントし、画素数が所定の閾値以上であった場合に、不審者または不審物であると判定し、メッセージを出力する。 Further, as shown in FIG. 13, the video monitoring device 1310 has a determination unit 1311. The determination unit 1311 counts the number of pixels included in the pixel area to be extracted extracted by the area extraction unit 125, and if the number of pixels is equal to or greater than a predetermined threshold value, it is determined to be a suspicious person or a suspicious object. And output a message.

このように、画像処理システム１００を、映像監視装置１３１０に接続して映像監視システム１３００を形成することで、不審者または不審物の画素領域を適切に（過不足なく）判定することが可能となる。 In this way, by connecting the image processing system 100 to the video monitoring device 1310 to form the video monitoring system 1300, it is possible to appropriately determine the pixel area of a suspicious person or a suspicious object (without excess or deficiency). Become.

以上の説明から明らかなように、第１の実施形態に係る画像処理装置１２０は、抽出対象を含む第１画像と、抽出対象を含まない第２画像とを取得し、抽出対象を含む複数の識別対象を識別するように学習されたニューラルネットワークに入力する。 As is clear from the above description, the image processing apparatus 120 according to the first embodiment acquires a first image including an extraction target and a second image not including an extraction target, and a plurality of images including the extraction target. Input to a neural network trained to identify the identification target.

また、第１の実施形態に係る画像処理装置１２０は、ニューラルネットワークの所定の層から、第１画像に対応する第１中間画像と、第２画像に対応する第２中間画像とを取得する。更に、第１の実施形態に係る画像処理装置１２０は、第１中間画像と第２中間画像との差分から得られる差分画像を、第１画像にかけ合わせることで、第１画像から、抽出対象の画素領域を抽出する。 Further, the image processing apparatus 120 according to the first embodiment acquires a first intermediate image corresponding to the first image and a second intermediate image corresponding to the second image from a predetermined layer of the neural network. Further, the image processing apparatus 120 according to the first embodiment multiplies the difference image obtained from the difference between the first intermediate image and the second intermediate image with the first image, so that the image processing device 120 can be extracted from the first image. Extract the pixel area.

これにより、第１の実施形態に係る画像処理装置１２０によれば、画像から抽出対象の画素領域を適切に（過不足なく）抽出することが可能となる。つまり、抽出対象の画素領域を抽出する抽出精度を向上させることができる。 As a result, according to the image processing apparatus 120 according to the first embodiment, it is possible to appropriately (excess or deficiency) extract the pixel region to be extracted from the image. That is, it is possible to improve the extraction accuracy of extracting the pixel region to be extracted.

［第２の実施形態］
上記第１の実施形態では、第１画像及び第２画像を取得するタイミングについて特に言及しなかったが、例えば、第２画像は、第１画像を取得する直前または直後に取得することが望ましい。天候や時間帯等、周囲環境が同じ条件のもとで、差分画像を生成することで、抽出対象の画素領域を抽出する抽出精度をより向上させることができるからである。つまり、上記第１の実施形態において記載した“撮影条件”には、撮像装置１１０側の条件に加え、撮像装置１１０の周囲環境の条件も含まれるものとする。 [Second Embodiment]
In the first embodiment, the timing of acquiring the first image and the second image is not particularly mentioned, but for example, it is desirable that the second image is acquired immediately before or immediately after the acquisition of the first image. This is because the extraction accuracy of extracting the pixel region to be extracted can be further improved by generating the difference image under the same conditions such as the weather and the time zone in the surrounding environment. That is, the "shooting conditions" described in the first embodiment include the conditions of the surrounding environment of the image pickup device 110 in addition to the conditions of the image pickup device 110 side.

また、上記第１の実施形態では、第１画像取得部１２１及び第２画像取得部１２２が、撮像装置１１０から、直接、第１画像及び第２画像を取得するものとして説明した。しかしながら、第１画像取得部１２１及び第２画像取得部１２２は、撮像装置１１０から送信された画像データが格納される格納先から、第１画像及び第２画像を取得してもよい。また、第１画像取得部１２１及び第２画像取得部１２２は、第１画像及び第２画像を取得する際、ノイズ除去処理や色補正処理等の各種前処理を行ってもよい。 Further, in the first embodiment, the first image acquisition unit 121 and the second image acquisition unit 122 have been described as acquiring the first image and the second image directly from the image pickup apparatus 110. However, the first image acquisition unit 121 and the second image acquisition unit 122 may acquire the first image and the second image from the storage destination in which the image data transmitted from the image pickup apparatus 110 is stored. Further, the first image acquisition unit 121 and the second image acquisition unit 122 may perform various preprocessing such as noise removal processing and color correction processing when acquiring the first image and the second image.

また、上記第１の実施形態では、第１画像に対応する第１中間画像と、第２画像に対応する第２中間画像とを、同じ層から取得するものとして説明したが、異なる層から取得してもよい。ただし、異なる層から取得する場合、各種補間処理を行うことで、中間画像を拡大する処理が行われるものとする。 Further, in the first embodiment, the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image have been described as being acquired from the same layer, but they are acquired from different layers. You may. However, when acquiring from different layers, it is assumed that the intermediate image is enlarged by performing various interpolation processes.

また、上記第１の実施形態では、差分画像を生成するにあたり、第１中間画像と第２中間画像の差の絶対値を加算するものとして説明したが、第１中間画像と第２中間画像の差の２乗を加算してもよい。 Further, in the first embodiment, when the difference image is generated, the absolute value of the difference between the first intermediate image and the second intermediate image is added, but the first intermediate image and the second intermediate image are described as being added. The square of the difference may be added.

また、上記第１の実施形態では、差分画像を生成するにあたり、第１中間画像に含まれる複数の特徴マップと第２中間画像に含まれる複数の特徴マップを全て用いるものとして説明したが、複数の特徴マップの一部を用いて差分画像を生成するようにしてもよい。 Further, in the first embodiment, in generating the difference image, the plurality of feature maps included in the first intermediate image and the plurality of feature maps included in the second intermediate image are all used. The difference image may be generated by using a part of the feature map of.

なお、開示の技術では、以下に記載する付記のような形態が考えられる。
（付記１）
抽出対象を含む第１の画像と、抽出対象を含まない第２の画像とを取得し、
抽出対象を含む複数の識別対象を識別するように学習されたニューラルネットワークに、前記第１の画像と前記第２の画像とをそれぞれ入力し、該ニューラルネットワークの複数の層のうちの所定の層から、前記第１の画像に対応する第１の中間画像と、前記第２の画像に対応する第２の中間画像とを取得し、
前記第１の中間画像と、前記第２の中間画像との差分に基づき、抽出対象の画素領域を抽出する、
処理をコンピュータに実行させるための画像処理プログラム。
（付記２）
前記第１の中間画像と、前記第２の中間画像との差分を算出することで、差分画像を生成し、前記第１の画像に前記差分画像をかけ合わせることで、前記第１の画像において前記抽出対象の画素領域を抽出することを特徴とする付記１に記載の画像処理プログラム。
（付記３）
前記第１の中間画像に含まれる複数の特徴マップと、前記第２の中間画像に含まれる複数の特徴マップそれぞれの差の絶対値を加算することで、前記差分画像を生成することを特徴とする付記２に記載の画像処理プログラム。
（付記４）
前記第１の画像と前記第２の画像は、同じ位置に設置された撮像装置が、同じ撮影条件のもとで異なるタイミングで撮影した画像であることを特徴とする付記１乃至３のいずれかの付記に記載の画像処理プログラム。
（付記５）
前記複数の識別対象は、前記抽出対象に付随する非抽出対象を含まないことを特徴とする付記１乃至４のいずれかの付記に記載の画像処理プログラム。
（付記６）
前記第１の画像が、前記複数の識別対象のうちのいずれかの識別対象であって、かつ、非抽出対象である物体を含む場合、該物体を含む前記第２の画像を取得することを特徴とする付記５に記載の画像処理プログラム。
（付記７）
抽出対象を含む第１の画像と、抽出対象を含まない第２の画像とを取得する取得部と、
抽出対象を含む複数の識別対象を識別するように学習されたニューラルネットワークに、前記第１の画像と前記第２の画像とをそれぞれ入力し、該ニューラルネットワークの複数の層のうちの所定の層から、前記第１の画像に対応する第１の中間画像と、前記第２の画像に対応する第２の中間画像とを取得する識別部と、
前記第１の中間画像と、前記第２の中間画像との差分に基づき、抽出対象の画素領域を抽出する抽出部と
を有することを特徴とする画像処理装置。
（付記８）
抽出対象を含む第１の画像と、抽出対象を含まない第２の画像とを取得し、
抽出対象を含む複数の識別対象を識別するように学習されたニューラルネットワークに、前記第１の画像と前記第２の画像とをそれぞれ入力し、該ニューラルネットワークの複数の層のうちの所定の層から、前記第１の画像に対応する第１の中間画像と、前記第２の画像に対応する第２の中間画像とを取得し、
前記第１の中間画像と、前記第２の中間画像との差分に基づき、抽出対象の画素領域を抽出する、
処理をコンピュータが実行する画像処理方法。 It should be noted that the disclosed technique may have the form described in the appendix below.
(Appendix 1)
The first image including the extraction target and the second image not including the extraction target are acquired, and the image is obtained.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From, the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image are acquired.
The pixel region to be extracted is extracted based on the difference between the first intermediate image and the second intermediate image.
An image processing program that allows a computer to perform processing.
(Appendix 2)
A difference image is generated by calculating the difference between the first intermediate image and the second intermediate image, and the difference image is multiplied by the first image to obtain the first image. The image processing program according to Appendix 1, wherein the pixel region to be extracted is extracted.
(Appendix 3)
The feature is that the difference image is generated by adding the absolute values of the differences between the plurality of feature maps included in the first intermediate image and the plurality of feature maps included in the second intermediate image. The image processing program according to Appendix 2.
(Appendix 4)
The first image and the second image are any of the appendices 1 to 3 characterized in that the image pickup devices installed at the same position are images taken at different timings under the same shooting conditions. The image processing program described in the appendix of.
(Appendix 5)
The image processing program according to any one of Supplementary Provisions 1 to 4, wherein the plurality of identification targets do not include a non-extraction target associated with the extraction target.
(Appendix 6)
When the first image is an identification target of any of the plurality of identification targets and includes an object that is not an extraction target, the second image including the object is acquired. The image processing program according to Appendix 5, which is a feature.
(Appendix 7)
An acquisition unit that acquires a first image that includes an extraction target and a second image that does not include an extraction target.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From the identification unit for acquiring the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image.
An image processing apparatus comprising: an extraction unit for extracting a pixel region to be extracted based on a difference between the first intermediate image and the second intermediate image.
(Appendix 8)
The first image including the extraction target and the second image not including the extraction target are acquired, and the image is obtained.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From, the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image are acquired.
The pixel region to be extracted is extracted based on the difference between the first intermediate image and the second intermediate image.
An image processing method in which a computer performs processing.

なお、上記実施形態に挙げた構成等に、その他の要素との組み合わせ等、ここで示した構成に本発明が限定されるものではない。これらの点に関しては、本発明の趣旨を逸脱しない範囲で変更することが可能であり、その応用形態に応じて適切に定めることができる。 The present invention is not limited to the configurations shown here, such as combinations with other elements in the configurations and the like described in the above embodiments. These points can be changed without departing from the spirit of the present invention, and can be appropriately determined according to the application form thereof.

１００：画像処理システム
１１０：撮像装置
１２０：画像処理装置
１２１：第１画像取得部
１２２：第２画像取得部
１２３：物体識別部
１２４：中間画像取得部
１２５：領域抽出部
３００：学習用画像情報
４００：第１画像
５００：第２画像
８００：第１中間画像
９００：第２中間画像
１０１０：差分画像
１０２０：抽出画像
１２００：自由視点映像生成システム
１２１０：自由視点映像生成装置
１２２０＿１～１２２０＿ｍ：撮像装置
１２３０：被写体
１３００：映像監視システム
１３１０：映像監視装置
１３１１：判定部 100: Image processing system 110: Image pickup device 120: Image processing device 121: First image acquisition unit 122: Second image acquisition unit 123: Object identification unit 124: Intermediate image acquisition unit 125: Area extraction unit 300: Learning image information 400: First image 500: Second image 800: First intermediate image 900: Second intermediate image 1010: Difference image 1020: Extracted image 1200: Free viewpoint image generation system 1210: Free viewpoint image generation device 1220_1 to 1220_m: Image pickup device 1230: Subject 1300: Image monitoring system 1310: Image monitoring device 1311: Judgment unit

Claims

The first image including the extraction target and the second image not including the extraction target are acquired, and the image is obtained.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From, the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image are acquired.
The pixel region to be extracted is extracted based on the difference between the first intermediate image and the second intermediate image.
An image processing program that allows a computer to perform processing.

A difference image is generated by calculating the difference between the first intermediate image and the second intermediate image, and the first image and the difference image are used to obtain the difference image from the first image. The image processing program according to claim 1, wherein a pixel region to be extracted is extracted.

The feature is that the difference image is generated by adding the absolute values of the differences between the plurality of feature maps included in the first intermediate image and the plurality of feature maps included in the second intermediate image. The image processing program according to claim 2.

An acquisition unit that acquires a first image that includes an extraction target and a second image that does not include an extraction target.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From the identification unit for acquiring the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image.
An image processing apparatus comprising: an extraction unit for extracting a pixel region to be extracted based on a difference between the first intermediate image and the second intermediate image.

The first image including the extraction target and the second image not including the extraction target are acquired, and the image is obtained.
The first image and the second image are input to a neural network trained to identify a plurality of identification targets including an extraction target, and a predetermined layer among the plurality of layers of the neural network is input. From, the first intermediate image corresponding to the first image and the second intermediate image corresponding to the second image are acquired.
The pixel region to be extracted is extracted based on the difference between the first intermediate image and the second intermediate image.
An image processing method in which a computer performs processing.