JP6964857B2

JP6964857B2 - Image recognition device, image recognition method, computer program, and product monitoring system

Info

Publication number: JP6964857B2
Application number: JP2017063675A
Authority: JP
Inventors: 金輝陳; 貴志上東; 宗彦伊藤; 泰郎高槻
Original assignee: Kobe University NUC
Current assignee: Kobe University NUC
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2021-11-10
Anticipated expiration: 2037-03-28
Also published as: JP2018165948A

Description

本発明は、画像認識装置、画像認識方法、コンピュータプログラム、及び製品監視システムに関する。具体的には、階層型畳み込みニューラルネットワークを用いた画像認識の精度を向上する画像処理技術に関する。 The present invention relates to an image recognition device, an image recognition method, a computer program, and a product monitoring system. Specifically, the present invention relates to an image processing technique for improving the accuracy of image recognition using a hierarchical convolutional neural network.

近年、深層学習（Deep Learning）による画像認識の性能が飛躍的に向上している。深層学習は、多層の階層型ニューラルネットワークを用いた機械学習の総称である。多層の階層型ニューラルネットワークとしては、例えば、畳み込みニューラルネットワーク（以下、「ＣＮＮ」ともいう。）が用いられる。 In recent years, the performance of image recognition by deep learning has been dramatically improved. Deep learning is a general term for machine learning using a multi-layered hierarchical neural network. As the multi-layered hierarchical neural network, for example, a convolutional neural network (hereinafter, also referred to as “CNN”) is used.

ＣＮＮは、局所領域の畳み込み層とプーリング層とが繰り返される多層の積層構造を有しており、かかる積層構造により画像認識の性能が向上するとされている。
非特許文献１及び２に示すように、畳み込みニューラルネットワークを用いた深層学習により、オブジェクトのクラスを認識することも既に行われている。 CNN has a multi-layered laminated structure in which a convolution layer and a pooling layer in a local region are repeated, and it is said that the laminated structure improves the image recognition performance.
As shown in Non-Patent Documents 1 and 2, it has already been performed to recognize a class of an object by deep learning using a convolutional neural network.

"ImageNet Classification with Deep Convolutional Neural Networks" A. krizhevsky et al. in: Proc. Adv. Neural Inf. Proc. Syst. (NIPS), 2012, PP.1097-1105"ImageNet Classification with Deep Convolutional Neural Networks" A. krizhevsky et al. In: Proc. Adv. Neural Inf. Proc. Syst. (NIPS), 2012, PP.1097-1105 "Very Deep Convolutional Networks for Large-Scale Image Recognition" K.Symonyan et al. arXiv:1409.1556v6 [cs.CV] 10 Apr.2015"Very Deep Convolutional Networks for Large-Scale Image Recognition" K.Symonyan et al. ArXiv: 1409.1556v6 [cs.CV] 10 Apr. 2015

畳み込みニューラルネットワークを用いた画像認識では、原画像に前処理を施すことなく、原画像の画素値（ＲＧＢ値）をそのままネットワークに入力するか、画素値に主成分分析（Principle Component Analysis）が行われる。
このように、従来では、原画像の画素値（生データ）をそのまま使用するか、原画像から単一の特徴因子を抽出する前処理を行うだけであるから、認識精度を向上するには、多数のサンプル画像及び同じクラスの多形態のサンプル画像を収集する必要がある。 In image recognition using a convolutional neural network, the pixel value (RGB value) of the original image is input to the network as it is without preprocessing, or principal component analysis (Principle Component Analysis) is performed on the pixel value. It is said.
In this way, conventionally, the pixel values (raw data) of the original image are used as they are, or only preprocessing for extracting a single feature factor from the original image is performed. Therefore, in order to improve the recognition accuracy, It is necessary to collect a large number of sample images and polymorphic sample images of the same class.

特に、回転又は反転したオブジェクトを含むサンプル画像は稀少であるから、通常の向きのサンプル画像を用いてＣＮＮの学習を重ねても、回転又は反転したオブジェクトの認識精度を余り向上できないという問題もある。
本発明は、かかる従来の問題点に鑑み、階層型ニューラルネットワークによる画像認識の精度を向上することを目的とする。 In particular, since a sample image including a rotated or inverted object is rare, there is a problem that the recognition accuracy of the rotated or inverted object cannot be improved so much even if CNN learning is repeated using the sample image in a normal orientation. ..
In view of the conventional problems, an object of the present invention is to improve the accuracy of image recognition by a hierarchical neural network.

（１）本発明の画像認識装置は、原画像に所定のデータ処理を施して入力データを生成するデータ生成部と、生成された前記入力データに含まれるオブジェクトの種類を認識する階層型ニューラルネットワークを有する画像処理部と、を備える画像認識装置であって、前記画像処理部は、前記原画像がサンプル画像である場合は、前記ネットワークの認識結果に基づいて当該ネットワークのパラメータを学習する処理を行い、前記原画像が認識の対象画像である場合は、前記ネットワークの認識結果を出力する処理を行い、前記データ生成部が行う前記データ処理は、前記原画像に対して回転及び反転のうちの少なくとも１つの不変性を付与する処理である。 (1) The image recognition device of the present invention is a data generation unit that generates input data by performing predetermined data processing on an original image, and a hierarchical neural network that recognizes the types of objects included in the generated input data. An image processing unit including an image processing unit, and when the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network. If the original image is the target image for recognition, the process of outputting the recognition result of the network is performed, and the data processing performed by the data generation unit is one of rotation and inversion with respect to the original image. It is a process of imparting at least one immutability.

本発明の画像認識装置によれば、データ生成部が行うデータ処理は、原画像に対して回転及び反転のうちの少なくとも１つの不変性を付与する処理よりなる。
このため、同数のサンプル画像により学習した場合には、上記のデータ処理を施さずに原画像をそのまま入力データとする用いる場合に比べて、階層型ニューラルネットワークによる画像認識の精度を向上することができる（図１０参照）。また、回転又は反転したオブジェクトでも正確に認識できようになる。 According to the image recognition device of the present invention, the data processing performed by the data generation unit comprises a process of imparting at least one invariance of rotation and inversion to the original image.
Therefore, when learning with the same number of sample images, the accuracy of image recognition by the hierarchical neural network can be improved as compared with the case where the original image is used as the input data as it is without performing the above data processing. Yes (see Figure 10). In addition, even a rotated or inverted object can be recognized accurately.

（２）本発明の画像認識装置において、具体的には、前記データ生成部が行う前記データ処理には、下記に定義する第１処理及び第２処理が含まれる。
第１処理：原画像に対して回転及び反転のうちの少なくとも１つの不変性を有する画像フィルタを生成する処理
第２処理：第１処理で生成した画像フィルタを原画像に畳み込む処理 (2) In the image recognition device of the present invention, specifically, the data processing performed by the data generation unit includes the first processing and the second processing defined below.
First process: Process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image Second process: Process of convolving the image filter generated in the first process into the original image

（３）より具体的には、前記画像フィルタは、前記原画像の所定点を原点とする極座標で定義される任意の画素点の色ベクトルを、当該画素点を起点として所定角度で開く任意の方向に分割した、複数の色ベクトルに含まれる要素よりなる。
その理由は、極座標表示の画素点の色ベクトルを上記のように分割した複数の色ベクトルは、原点回りに任意の角度で回転しても等価なままであり、原画像に対して回転及び反転のうちの少なくとも１つの不変性を有するからである。 (3) More specifically, the image filter opens a color vector of an arbitrary pixel point defined by polar coordinates with a predetermined point of the original image as an origin at an arbitrary angle starting from the pixel point. It consists of elements contained in a plurality of color vectors divided in a direction.
The reason is that the plurality of color vectors obtained by dividing the color vectors of the pixel points in the polar coordinate display as described above remain equivalent even when rotated at an arbitrary angle around the origin, and are rotated and inverted with respect to the original image. This is because it has at least one invariance of.

（４）更に具体的には、前記画像フィルタは、前記原画像の所定点を原点とする極座標で定義される任意の画素点の色ベクトルを、当該画素点を起点として半径方向及び接線方向に分割した、２つの色ベクトルに含まれる要素よりなることが好ましい。
その理由は、色ベクトルを半径方向と接線方向の２方向に分解すると、計算パラメータの数量を最小限にすることができ、データ生成部の処理負荷を低減できるからである。 (4) More specifically, the image filter sets a color vector of an arbitrary pixel point defined by polar coordinates with a predetermined point of the original image as the origin in the radial direction and the tangential direction starting from the pixel point. It preferably consists of elements contained in the two divided color vectors.
The reason is that if the color vector is decomposed in two directions, the radial direction and the tangential direction, the number of calculation parameters can be minimized and the processing load of the data generation unit can be reduced.

（５）本発明の画像認識装置において、具体的には、前記階層型ニューラルネットワークは、畳み込みニューラルネットワークよりなる。
その理由は、畳み込みニューラルネットワークは、階層型ニューラルネットワークの中でも画像認識に高い性能を実現できるからである。 (5) In the image recognition device of the present invention, specifically, the hierarchical neural network comprises a convolutional neural network.
The reason is that the convolutional neural network can realize high performance for image recognition even in the hierarchical neural network.

（６）本発明の画像認識装置において、種類が認識される前記オブジェクトは、手書き文字、人間、動物、植物、及び製品のうちの少なくとも１つの物体であればよい。
その理由は、本発明の特徴である、原画像に対して回転及び反転のうちの少なくとも１つの不変性を付与するデータ処理は、原画像に含まれるオブジェクトの属性に関係なく、種々のオブジェクトに適用可能であると考えられるからである。従って、本発明の画像認識装置の適用範囲は、特定のオブジェクトの認識に限定されるものではない。 (6) In the image recognition device of the present invention, the object whose type is recognized may be at least one of handwritten characters, humans, animals, plants, and products.
The reason is that the data processing that imparts at least one invariance of rotation and inversion to the original image, which is a feature of the present invention, applies to various objects regardless of the attributes of the objects contained in the original image. This is because it is considered applicable. Therefore, the scope of application of the image recognition device of the present invention is not limited to the recognition of a specific object.

（７）本発明のコンピュータプログラムは、上述の（１）〜（６）のいずれかに記載の画像認識装置として、コンピュータを機能させるためのコンピュータプログラムに関する。
従って、本発明のコンピュータプログラムは、上述の（１）〜（６）のいずれかに記載の画像認識装置と同様の作用効果を奏する。 (7) The computer program of the present invention relates to a computer program for operating a computer as the image recognition device according to any one of (1) to (6) above.
Therefore, the computer program of the present invention has the same effect as the image recognition device according to any one of (1) to (6) above.

（８）本発明の画像認識方法は、上述の（１）〜（６）のいずれかに記載の画像認識装置が実行する画像認識方法に関する。
従って、本発明の画像認識方法は、上述の（１）〜（６）のいずれかに画像認識装置と同様の作用効果を奏する。 (8) The image recognition method of the present invention relates to an image recognition method executed by the image recognition device according to any one of (1) to (6) above.
Therefore, the image recognition method of the present invention exerts the same effect as that of the image recognition device in any of the above-mentioned (1) to (6).

（９）本発明の製品監視システムは、複数の製品を撮影する撮影装置と、撮影された前記複数の製品のうちのいずれかを外部に取り出すロボット装置と、取り出すべき前記製品を前記ロボット装置に指示する制御装置と、を備える製品監視システムであって、前記制御装置は、上述の（１）〜（６）のいずれかに記載の画像認識装置を含み、前記画像認識装置は、不良品と認識した前記製品の取り出しを前記ロボット装置に指示する。 (9) In the product monitoring system of the present invention, a photographing device for photographing a plurality of products, a robot device for taking out one of the photographed products to the outside, and the product to be taken out are used as the robot device. A product monitoring system including a control device for instructing, wherein the control device includes the image recognition device according to any one of (1) to (6) above, and the image recognition device is a defective product. Instruct the robot device to take out the recognized product.

本発明の製品監視システムによれば、画像認識装置が、不良品と認識した製品の取り出しをロボット装置に指示するので、適当数のサンプル画像により画像認識装置を学習させることにより、不良品の取り出しを自動的かつ正確に行うことができる。 According to the product monitoring system of the present invention, the image recognition device instructs the robot device to take out the product recognized as a defective product. Therefore, by learning the image recognition device from an appropriate number of sample images, the defective product can be taken out. Can be done automatically and accurately.

本発明は、上記のような特徴的な構成を備えるシステム及び装置として実現できるだけでなく、かかる特徴的な構成をコンピュータに実行させるためのコンピュータプログラムとして実現することができる。
また、上記の本発明は、システム及び装置の一部又は全部を実現する、１又は複数の半導体集積回路として実現することができる。 The present invention can be realized not only as a system and an apparatus having the above-mentioned characteristic configuration, but also as a computer program for causing a computer to execute such a characteristic configuration.
Further, the present invention described above can be realized as one or more semiconductor integrated circuits that realize a part or all of a system and an apparatus.

本発明によれば、階層型ニューラルネットワークによる画像認識の精度を向上することができる。 According to the present invention, the accuracy of image recognition by a hierarchical neural network can be improved.

本発明の実施形態に係る画像識別装置のブロック図である。It is a block diagram of the image identification apparatus which concerns on embodiment of this invention. ＣＮＮ処理部に含まれるＣＮＮの概略構成図である。It is a schematic block diagram of CNN included in a CNN processing part. 畳み込み層の処理内容の概念図である。It is a conceptual diagram of the processing content of a convolution layer. 受容野の構造の概念図である。It is a conceptual diagram of the structure of the receptive field. データ生成部による第１処理の説明図である。It is explanatory drawing of the 1st process by a data generation part. データ生成部による第２処理の説明図である。It is explanatory drawing of the 2nd process by a data generation part. 任意の画素点を所定角度だけ回転させた回転点等の説明図である。It is explanatory drawing of the rotation point and the like which rotated arbitrary pixel point by a predetermined angle. ＣＮＮ処理部に構築される深層ＣＮＮの構造図である。It is a structural drawing of the deep CNN constructed in the CNN processing part. シミュレーション実験に用いた手書き文字の一例を示す図である。It is a figure which shows an example of the handwritten character used in the simulation experiment. 文字クラスごとの認識精度の試験結果を表すグラフである。It is a graph which shows the test result of the recognition accuracy for each character class. 本発明の実施形態に係る製品監視システムの全体構成図である。It is an overall block diagram of the product monitoring system which concerns on embodiment of this invention.

以下、図面を参照して、本発明の実施形態の詳細を説明する。なお、以下に記載する実施形態の少なくとも一部を任意に組み合わせてもよい。 Hereinafter, the details of the embodiment of the present invention will be described with reference to the drawings. In addition, at least a part of the embodiments described below may be arbitrarily combined.

〔画像処理装置の全体構成〕
図１は、本発明の実施形態に係る画像認識装置１０のブロック図である。
図１に示すように、本実施形態の画像認識装置１０は、例えば、図示しないＰＣ（Personal Computer）に搭載された演算処理部１と画像処理部２と備える。 [Overall configuration of image processing device]
FIG. 1 is a block diagram of an image recognition device 10 according to an embodiment of the present invention.
As shown in FIG. 1, the image recognition device 10 of the present embodiment includes, for example, an arithmetic processing unit 1 and an image processing unit 2 mounted on a PC (Personal Computer) (not shown).

演算処理部１は、ＣＰＵ（Central Processing Unit）を含む。演算処理部１のＣＰＵの数は１つでも複数でもよく、ＦＰＧＡ（Field-Programmable Gate Array）やＡＳＩＣ（Application Specific Integrated Circuit）などの集積回路を含んでもよい。
演算処理部１は、ＲＡＭ（Random Access Memory）を含む。ＲＡＭは、ＳＲＡＭ（Static ＲＡＭ）又はＤＲＡＭ（Dynamic ＲＡＭ）などのメモリ素子で構成され、ＣＰＵなどが実行するコンピュータプログラム及びその実行に必要なデータを一時的に記憶する。 The arithmetic processing unit 1 includes a CPU (Central Processing Unit). The number of CPUs in the arithmetic processing unit 1 may be one or a plurality, and may include integrated circuits such as FPGA (Field-Programmable Gate Array) and ASIC (Application Specific Integrated Circuit).
The arithmetic processing unit 1 includes a RAM (Random Access Memory). The RAM is composed of memory elements such as SRAM (Static RAM) or DRAM (Dynamic RAM), and temporarily stores a computer program executed by a CPU or the like and data necessary for the execution thereof.

画像処理部２は、ＧＰＵ（Graphics Processing Unit）を含む。画像処理部２のＧＰＵの数は１つでも複数でもよく、ＦＰＧＡやＡＳＩＣなどの集積回路を含んでもよい。
画像処理部２は、ＲＡＭを含む。ＲＡＭは、ＳＲＡＭ又はＤＲＡＭなどのメモリ素子で構成され、ＧＰＵなどが実行するコンピュータプログラム及びその実行に必要なデータを一時的に記憶する。 The image processing unit 2 includes a GPU (Graphics Processing Unit). The number of GPUs in the image processing unit 2 may be one or a plurality, and may include integrated circuits such as FPGA and ASIC.
The image processing unit 2 includes a RAM. The RAM is composed of a memory element such as an SRAM or a DRAM, and temporarily stores a computer program executed by the GPU or the like and data necessary for the execution thereof.

演算処理部１は、メモリに記録された演算処理のコンピュータプログラムを、ＣＰＵが実行することにより実現される機能部として、ＣＮＮ処理部４への入力画像を生成するデータ生成部３を備える。
データ生成部３は、ラベル付きのサンプル画像７又は識別対象の撮影画像（以下、「対象画像」ともいう。）８に対して、下記の第１及び第２処理を施すことにより、ＣＮＮ処理部４に対する入力画像（以下、「入力データ」ともいう。）を生成する。 The arithmetic processing unit 1 includes a data generation unit 3 that generates an input image to the CNN processing unit 4 as a functional unit realized by the CPU executing a computer program for arithmetic processing recorded in the memory.
The data generation unit 3 performs the following first and second processes on the labeled sample image 7 or the photographed image to be identified (hereinafter, also referred to as “target image”) 8 to form a CNN processing unit. An input image for 4 (hereinafter, also referred to as “input data”) is generated.

第１処理：サンプル画像７又は対象画像８から、当該画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を生成する処理（図５参照）
第２処理：サンプル画像７又は対象画像８に対して、第１処理で生成した画像フィルタ９を畳み込む処理（図６参照） First process: A process of generating an image filter 9 having rotation and inversion invariance with respect to the images 7 and 8 from the sample image 7 or the target image 8 (see FIG. 5).
Second process: A process of convolving the image filter 9 generated in the first process with respect to the sample image 7 or the target image 8 (see FIG. 6).

以下において、データ生成部３に入力される「サンプル画像」及び「対象画像」の総称を、「原画像」ともいう。原画像は、処理される（訓練若しくは認識）オブジェクト領域画像のみを指す。
データ生成部３は、原画像７，８に対して第１及び第２処理を行って得られた入力データを、後段の画像処理部２におけるＣＮＮ処理部４に入力する。 Hereinafter, the generic term of the "sample image" and the "target image" input to the data generation unit 3 is also referred to as an "original image". The original image refers only to the object area image to be processed (trained or recognized).
The data generation unit 3 inputs the input data obtained by performing the first and second processing on the original images 7 and 8 to the CNN processing unit 4 in the image processing unit 2 in the subsequent stage.

画像処理部２は、メモリに記録された画像処理のコンピュータプログラムを、ＧＰＵが実行することにより実現される機能部として、ＣＮＮ処理部４、学習部５、及び認識部６を備える。
ＣＮＮ処理部４は、入力データに含まれるオブジェクトの種類の認識（例えば、入力画像に含まれる文字の種類の認識など）を実行し、その認識結果（具体的には、分類クラスごとの確率など）を学習部５又は認識部６に入力する。 The image processing unit 2 includes a CNN processing unit 4, a learning unit 5, and a recognition unit 6 as functional units realized by the GPU executing a computer program for image processing recorded in the memory.
The CNN processing unit 4 recognizes the type of object included in the input data (for example, recognizes the type of character included in the input image), and the recognition result (specifically, the probability for each classification class, etc.) ) Is input to the learning unit 5 or the recognition unit 6.

具体的には、ラベル付きのサンプル画像７を用いてＣＮＮを訓練する場合には、ＣＮＮ処理部４は、サンプル画像７の分類クラスを特定し、特定した分類クラスを学習部５に入力する。
他方、学習済みのＣＮＮ処理部４に対象画像８の分類クラスを特定させる場合、すなわち、画像処理部２が識別器として動作する場合には、ＣＮＮ処理部４は、特定した分類クラスを認識部６に入力する。 Specifically, when training the CNN using the labeled sample image 7, the CNN processing unit 4 specifies the classification class of the sample image 7 and inputs the specified classification class to the learning unit 5.
On the other hand, when the trained CNN processing unit 4 is made to specify the classification class of the target image 8, that is, when the image processing unit 2 operates as a classifier, the CNN processing unit 4 recognizes the specified classification class. Enter in 6.

学習部５は、入力された分類クラスに基づいて、ＣＮＮ処理部４が保持するパラメータ（重みやバイアス）を更新し、更新後のパラメータをＣＮＮ処理部４に記憶させる。
認識部６は、入力された分類クラスに基づいて、認識結果を出力する。具体的には、ＣＮＮ処理部４から入力された最も高い確率の分類クラスを、対象画像８の分類クラスとして出力する。認識部６が出力する認識結果は、ＰＣのディスプレイなどに表示されることにより、ＰＣのオペレータに通知される。 The learning unit 5 updates the parameters (weights and biases) held by the CNN processing unit 4 based on the input classification class, and stores the updated parameters in the CNN processing unit 4.
The recognition unit 6 outputs the recognition result based on the input classification class. Specifically, the classification class with the highest probability input from the CNN processing unit 4 is output as the classification class of the target image 8. The recognition result output by the recognition unit 6 is displayed on the display of the PC or the like to notify the operator of the PC.

〔ＣＮＮ処理部の処理内容〕
（ＣＮＮの構成例）
図２は、ＣＮＮ処理部４に含まれるＣＮＮの概略構成図である。
図２に示すように、ＣＮＮ処理部４に構築されるＣＮＮは、畳み込み層（「ダウンサンプリング層」ともいう。）Ｃ１，Ｃ２、プーリング層Ｐ１，Ｐ２及び全結合層Ｆの３つの演算処理層と、ＣＮＮの出力層である最終層Ｅとを備える。 [Processing content of CNN processing unit]
(CNN configuration example)
FIG. 2 is a schematic configuration diagram of a CNN included in the CNN processing unit 4.
As shown in FIG. 2, the CNN constructed in the CNN processing unit 4 has three arithmetic processing layers: a convolution layer (also referred to as a “downsampling layer”) C1 and C2, a pooling layer P1 and P2, and a fully connected layer F. And the final layer E, which is the output layer of the CNN.

畳み込み層Ｃ１，Ｃ２の後にはプーリング層Ｐ１，Ｐ２が配置され、最後のプーリング層Ｐ２の後に全結合層Ｆが配置される。ＣＮＮの最終層Ｅには、予め設定された分類クラス数と同数（図２では１０個）の最終ノードが含まれる。
図２では、畳み込み層Ｃ１，Ｃ２とこれに対応するプーリング層Ｐ１，Ｐ２が２つの場合を例示している。もっとも、畳み込み層とプーリング層は、３つ以上であってもよい。また、全結合層Ｆは少なくとも１つ配置される。 The pooling layers P1 and P2 are arranged after the convolution layers C1 and C2, and the fully connected layer F is arranged after the last pooling layer P2. The final layer E of the CNN includes the same number of final nodes (10 in FIG. 2) as the preset number of classification classes.
FIG. 2 illustrates a case where there are two convolution layers C1 and C2 and corresponding pooling layers P1 and P2. However, the number of the convolution layer and the pooling layer may be three or more. Further, at least one fully connected layer F is arranged.

ある層Ｃ１，Ｐ１，Ｃ２，Ｐ２におけるｊ番目のノードは、直前の層のｍ個のノードからそれぞれ入力ｘ_ｉ（ｉ＝１，２，……ｍ）を受け取り、これらの重み付き和にバイアスを加算した中間変数ｕ_ｊを計算する。すなわち、中間変数ｕ_ｊは次式で計算される。なお、次式において、ｗ_ｉｊは重みであり、ｂ_ｊはバイアスである。

The j-th node in a certain layer C1, P1, C2, P2 _{receives inputs x i} (i = 1, 2, ... m) from m nodes in the immediately preceding layer, respectively, and biases them to the weighted sum. Calculate the intermediate variable u _{j by adding.} That is, the intermediate variable u _j is calculated by the following equation. Note that in the following _{equation, w ij} is the weight, _{b j} is the bias.

非線形関数である活性化関数ａ（・）に中間変数ｕ_ｊを適用した応答ｙ_ｊ、すなわち、ｙ_ｊ＝ａ（ｕ_ｊ）がこの層のノードの出力となり、この出力は次の層に入力される。
活性化関数ａには、「シグモイド関数」、或いは、ａ（ｘ_ｊ）＝ｍａｘ（ｘ_ｊ，０）などが使用される。特に、後者の活性化関数は、「ＲｅＬＵ（Rectified Linear Unit）」と呼ばれる。ＲｅＬＵは、収束性の良さや学習速度の向上などに貢献することから、近年よく使用される。 _{The response y j} _{in which the intermediate variable u j} is applied to the activation function a (・), which is a non-linear function, that is, y _j = a (u _j ) becomes the output of the node of this layer, and this output is input to the next layer. Will be done.
For the activation function a, a "sigmoid function" or a (x _j ) = max (x _j , 0) is used. In particular, the latter activation function is called "ReLU (Rectified Linear Unit)". ReLU is often used in recent years because it contributes to good convergence and improvement of learning speed.

ＣＮＮの出力層付近には、隣接層間のノードをすべて結合した全結合層Ｆが１層以上配置される。ＣＮＮの出力を与える最終層Ｅは、通常のニューラルネットワークと同様に設計される。
入力画像のクラス分類を目的とする場合は、分類クラス数と同数のノードが最終層Ｅに配置され、最終層Ｅの活性化関数ａには「ソフトマックス関数」が用いられる。 In the vicinity of the output layer of the CNN, one or more fully connected layers F in which all the nodes between adjacent layers are connected are arranged. The final layer E, which gives the output of the CNN, is designed in the same way as a normal neural network.
When the purpose is to classify the input image, the same number of nodes as the number of classification classes are arranged in the final layer E, and the "softmax function" is used for the activation function a of the final layer E.

具体的には、ｎ個のノードへの入力ｕ_ｊ（ｊ＝１，２，……ｎ）をもとに、次式が算出される。認識時には、ｐ_ｊが最大値をとるノードのインデックスｊ＝ａｒｇｍａｘ_ｊｐ_ｊが推定クラスとして選択される。

Specifically, the following equation is calculated based on the inputs u _j (j = 1, 2, ... n) to the n nodes. At the time of recognition, the index j = argmax _j p _{j of the} _{node in which p j} has the maximum value is selected as the estimation class.

（畳み込み層の処理内容）
図３は、畳み込み層Ｃ１，Ｃ２の処理内容の概念図である。
図３に示すように、畳み込み層Ｃ１，Ｃ２の入力は、縦長のサイズがＳ×Ｓ画素のＮ枚（Ｎチャンネル）の形式となっている。
以下、この形式の画像をＳ×Ｓ×Ｎと記載する。また、Ｓ×Ｓ×Ｎの入力をｘ_ｉｊｋ（ただし、(i,j,k）∈[0,S-1],[0,S-1],[1,N]）と記載する。 (Processing content of convolution layer)
FIG. 3 is a conceptual diagram of the processing contents of the convolution layers C1 and C2.
As shown in FIG. 3, the inputs of the convolution layers C1 and C2 are in the form of N pixels (N channels) having a vertically long size of S × S pixels.
Hereinafter, an image of this format will be referred to as S × S × N. Further, the input of S × S × N is described as x _ijk (where (i, j, k) ∈ [0, S-1], [0, S-1], [1, N]).

ＣＮＮにおいて、最初の入力層（畳み込み層Ｃ１）のチャンネル数は、入力画像がグレースケールならばＮ＝１となり、カラーならばＮ＝３（ＲＧＢの３チャンネル）となる。
畳み込み層Ｃ１，Ｃ２では、入力ｘ_ｉｊｋにフィルタ（「カーネル」ともいう。）を畳み込む計算が実行される。 In CNN, the number of channels of the first input layer (convolution layer C1) is N = 1 if the input image is grayscale, and N = 3 (3 channels of RGB) if the input image is color.
In the convolution layers C1 and C2, the calculation of convolving a filter (also referred to as "kernel") _{in the input x ijk is executed.}

この計算は、一般的な画像処理におけるフィルタの畳み込み、例えば、小サイズの画像を入力画像に２次元的に畳み込んで画像をぼかす処理（ガウシアンフィルタ）や、エッジを強調する処理（鮮鋭化フィルタ）と基本的に同様の処理である。
具体的には、各チャンネルｋ（ｋ＝１〜Ｎ）の入力ｘ_ｉｊｋのサイズＳ×Ｓの画素に、Ｌ×Ｌのサイズの２次元フィルタを畳み込み、その結果を全チャンネルｋ＝１〜Ｎにわたって加算する。この計算結果は、１チャンネルの画像ｕ_ｉｊの形式となる。 This calculation includes filter convolution in general image processing, for example, a process of two-dimensionally convolving a small-sized image into an input image to blur the image (Gaussian filter), and a process of emphasizing edges (sharpening filter). ) Is basically the same process.
Specifically, a two-dimensional filter having an L × L size is convoluted into pixels having _{an input x ijk} size S × S of each channel k (k = 1 to N), and the result is obtained from all channels k = 1 to N. Add over. The calculation result is in the form of a 1-channel image _uij.

フィルタをｗ_ｉｊｋ（ただし、(i,j,k）∈[1,L-1],[1,L-1],[1,N]）と定義すると、ｕ_ｉｊは次式で算出される。

If the filter is _{defined as w ijk} (where (i, j, k) ∈ [1, L-1], [1, L-1], [1, N]), u _ij is calculated by the following equation. ..

ただし、Ｐ_ｉｊは、画像中の画素（ｉ，ｊ）を頂点とするサイズＬ×Ｌ画素の正方領域である。すなわち、Ｐ_ｉｊは、次式で定義される。

However, _Pij is a square region of size L × L pixels having pixels (i, j) in the image as vertices. That is, _Pij is defined by the following equation.

ｂ_ｋは、バイアスである。本実施形態では、バイアスは、チャンネルごとに全出力ノード間で共通とする。すなわち、ｂ_ｉｊｋ＝ｂ_ｋとする。
フィルタは、全画素ではなく複数画素の間隔で適用されることもある。すなわち、所定の画素数ｓについて、Ｐ_ｉｊを次式のように定義し、ｗ_{ｐ−ｉ，ｑ−ｊ，ｋ}をｗ_{ｐ−ｓｉ，ｑ−ｓｊ，ｋ}と置き換えてｕ_ｉｊを計算してもよい。この画素間隔ｓを「スライド」という。

b _k is a bias. In this embodiment, the bias is common to all output nodes for each channel. That is, b _ijk = b _k .
The filter may be applied at intervals of multiple pixels instead of all pixels. That is, for a given number of pixels _s, define a _{P ij} as _{follows, w p-i, q-} j, a _{_{k w p-si, q-}} sj, to calculate the _{u ij} replaced with _k May be good. This pixel spacing s is called a "slide".

上記のように計算されたｕ_ｉｊは、その後、活性化関数ａ（・）を経て、畳み込み層Ｃ１，Ｃ２の出力ｙ_ｉｊとなる。すなわち、ｙ_ｉｊ＝ａ（ｕ_ｉｊ）となる。
これにより、１つのフィルタｗ_ｉｊｋにつき、入力ｘ_ｉｊｋと縦横サイズが同じであるＳ×Ｓの１チャンネル分の出力ｙ_ｉｊが得られる。 _{The u ij} calculated as described above is then passed through the activation function a (.) To become _{the output y ij of the convolution layers C1 and C2.} That is, y _ij = a (u _ij ).
As a result, for one filter w _ijk _{, the output y ij} for one channel of S × S having the same vertical and horizontal sizes as the input x _ijk can be obtained.

同様のフィルタをＮ’個用意して、それぞれ独立して上述の計算を実行すれば、Ｎ’チャンネル分のＳ×Ｓの出力、すなわち、Ｓ×Ｓ×Ｎ’サイズの出力ｙ_ｉｊｋ（ただし、(i,j,k）∈[1,S-1],[1,S-1],[1,N']）が得られる。
このＮ’チャンネル分の出力ｙ_ｉｊｋは、次の層への入力ｘ_ｉｊｋとなる。図３は、Ｎ’個あるフィルタのうちの１つに関する計算内容を示している。 If N'similar filters are prepared and the above calculations are performed independently for each, the output of S × S for N'channels, that is, the output of S × S × _N'size y ijk (however, (i, j, k) ∈ [1, S-1], [1, S-1], [1, N']) is obtained.
The output y _{ijk for} _{this N'channel} becomes the input x ijk to the next layer. FIG. 3 shows the calculation contents for one of the N'filters.

以上の計算は、例えば図４に示すように、特殊な形で層間ノードが結ばれた単層ネットワークとして表現できる。図４は、受容野の構造の概念図である。左側の図では受容野が矩形で表現され、右側の図では受容野がノードで表現されている。
具体的には、上位層の各ノードは下位層の各ノードの一部と結合している（これを「局所受容野」という。）。また、結合の重みは各ノード間で共通となっている（これを「重み共有」という。）。 The above calculation can be expressed as a single-layer network in which interlayer nodes are connected in a special form, as shown in FIG. 4, for example. FIG. 4 is a conceptual diagram of the structure of the receptive field. In the figure on the left, the receptive field is represented by a rectangle, and in the figure on the right, the receptive field is represented by a node.
Specifically, each node in the upper layer is connected to a part of each node in the lower layer (this is called a "local receptive field"). In addition, the join weight is common among each node (this is called "weight sharing").

（プーリング層の処理内容）
図２に示す通り、プーリング層Ｐ１，Ｐ２は、畳み込み層Ｃ１，Ｃ２と対で存在する。従って、畳み込み層Ｃ１，Ｃ２の出力はプーリング層Ｐ１，Ｐ２への入力となり、プーリング層Ｐ１，Ｐ２の入力はＳ×Ｓ×Ｎの形式となる。
プーリング層Ｐ１，Ｐ２の目的は、画像のどの位置でフィルタの応答が強かったかという情報を一部捨てて、特徴の微少な変化に対する応答の不変性を実現することである。 (Processing content of pooling layer)
As shown in FIG. 2, the pooling layers P1 and P2 exist in pairs with the convolution layers C1 and C2. Therefore, the outputs of the convolution layers C1 and C2 are the inputs to the pooling layers P1 and P2, and the inputs of the pooling layers P1 and P2 are in the form of S × S × N.
The purpose of the pooling layers P1 and P2 is to discard some information about the position of the image where the filter response was strong, and to realize the invariance of the response to a slight change in the characteristics.

プーリング層Ｐ１，Ｐ２のノード（ｉ，ｊ）は、畳み込み層Ｃ１，Ｃ２と同様に、入力側の層に局所受容野Ｐ_ｉ，ｊを有する。プーリング層Ｐ１，Ｐ２のノード（ｉ，ｊ）は、局所受容野Ｐ_ｉ，ｊの内部のノード（ｐ，ｑ）∈Ｐ_ｉ，ｊの出力ｙ_ｐ，ｑを１つに集約する。
プーリング層Ｐ１，Ｐ２の局所受容野Ｐ_ｉ，ｊのサイズは、畳み込み層Ｃ１，Ｃ２のそれ（フィルタサイズ）と無関係に設定される。 The nodes (i, j) of the pooling layers P1 and P2 have local receptive fields _{Pi and j} in the input side layer, similarly to the convolutional layers C1 and C2. Node pooling layer P1, P2 (i, j) is to aggregate local receptive field _{P i,} the internal nodes of the _{_{j (p, q) ∈P i}} , the output _{y p} of the _{_j, q} to one.
_{The sizes of the local receptive fields Pi and j} of the pooling layers P1 and P2 are set independently of those of the convolutional layers C1 and C2 (filter size).

入力が複数チャンネルの場合、チャンネルごとに上記の処理が行われる。すなわち、畳み込み層Ｃ１，Ｃ２とプーリング層Ｐ１，Ｐ２の出力チャンネル数は一致する。
プーリングは、画像の縦横（ｉ，ｊ）の方向に間引いて行われる。すなわち、２以上のストライドｓが設定される。例えば、ｓ＝２とすると、出力の縦横サイズは入力の縦横サイズの半分となり、プーリング層の出力ノード数は、入力ノード数の１／ｓ^２倍となる。 When the input is a plurality of channels, the above processing is performed for each channel. That is, the number of output channels of the convolution layers C1 and C2 and the pooling layers P1 and P2 are the same.
Pooling is performed by thinning out in the vertical and horizontal directions (i, j) of the image. That is, two or more strides are set. For example, when s = 2, vertical and horizontal size of the output becomes half the vertical and horizontal size of the input, the number of output nodes of the pooling layer becomes 1 / s ² times the number of input nodes.

受容野Ｐ_ｉ，ｊの内部のノードからの入力を１つに纏めて集約する方法には、「平均プーリング」及び「最大プーリング」などがある。
平均プーリングは、次式の通り、Ｐ_ｉ，ｊに属するノードからの入力ｘ_ｐｑｋの平均値を出力する方法である。

There are "average pooling" and "maximum pooling" as a method of collecting and aggregating the inputs from the nodes inside the receptive fields _{Pi and j.}
The average pooling is a method of outputting the average value _{of the input x pqk} from the nodes belonging to _{Pi and j as shown in the following equation.}

最大プーリングは、次式の通り、Ｐ_ｉ，ｊに属するノードからの入力ｘ_ｐｑｋの最大値を出力する方法である。ＣＮＮの初期の研究では平均プーリングが主流であったが、現在では最大プーリングが一般的に採用される。

Maximum pooling is a method of outputting the maximum value _{of the input x pqk} from the nodes belonging to _{Pi and j as shown in the following equation.} Average pooling was predominant in CNN's early studies, but maximum pooling is now commonly adopted.

なお、畳み込み層Ｃ１，Ｃ２と異なり、プーリング層Ｐ１，Ｐ２では、学習によって変化する重みは存在せず、活性化関数も適用されない。
本実施形態のＣＮＮにおいて、平均プーリング及び最大プーリングのいずれを採用してもよいが、図７に示すＣＮＮの実装例では最大プーリングを採用している。 Unlike the convolution layers C1 and C2, the pooling layers P1 and P2 do not have weights that change due to learning, and the activation function is not applied.
In the CNN of the present embodiment, either the average pooling or the maximum pooling may be adopted, but in the implementation example of the CNN shown in FIG. 7, the maximum pooling is adopted.

〔学習部の処理内容〕
ＣＮＮの学習（training）では、「教師あり学習」が基本である。本実施形態においても、学習部５は教師あり学習を実行する。
具体的には、学習部５は、学習データとなる多数のラベル付きのサンプル画像を含む集合を対象として、各サンプル画像の分類誤差を最小化することにより実行される。以下、この処理について説明する。 [Processing content of the learning department]
In CNN training, "supervised learning" is the basis. Also in this embodiment, the learning unit 5 executes supervised learning.
Specifically, the learning unit 5 is executed by minimizing the classification error of each sample image for a set including a large number of labeled sample images as learning data. This process will be described below.

ＣＮＮ処理部４の最終層Ｅの各ノードは、ソフトマックス関数による正規化（前述の〔数２〕）により、対応するクラスに対する確率ｐ_ｊ（ｊ＝１，２，……ｎ）を出力する。この確率ｐ_ｊは、学習部５に入力される。
学習部５は、入力された確率ｐ_ｊから算出される分類誤差を最小化するように、ＣＮＮ処理部４に設定された重みなどのパラメータを更新する。 _{Each node of the final layer E of the CNN processing unit 4 outputs the probability pj} (j = 1, 2, ... n) for the corresponding class by normalization by the softmax function ([Equation 2] described above). .. This probability p _j is input to the learning unit 5.
Learning unit 5, so as to minimize the classification error calculated from the input probability p _j, it updates the parameters such weight set in the CNN processing unit 4.

具体的には、学習部５は、入力サンプルに対する理想的な出力ｄ１，ｄ２，……ｄｎ（ラベル）と、出力ｐ１．ｐ２．……ｐｎの乖離を、次式の交差エントロピーＣによって算出する。この交差エントロピーＣが分類誤差である。

Specifically, the learning unit 5 has ideal outputs d1, d2, ... dn (label) for the input sample and output p1. p2. The dissociation of pn is calculated by the cross entropy C of the following equation. This cross entropy C is the classification error.

目標出力ｄ１，ｄ２，……ｄｎは、正解クラスｊのみでｄ_ｊ＝１となり、それ以外のすべてのｋ（≠ｊ）ではｄ_ｋ＝０となるように設定される。
学習部５は、上記の交差エントロピーＣが小さくなるように、各畳み込み層Ｃ１，Ｃ２のフィルタの係数ｗ_ｉｊｋと各ノードのバイアスｂ_ｋ、及び、ＣＮＮの出力層側に配置された全結合層Ｆの重みとバイアスを調整する。 The target outputs d1, d2, ... dn are set so that d _j _{= 1 only in the correct answer class j and d k} = 0 in all other k (≠ j).
Learning unit 5, as in the above cross-entropy C decreases, the bias b _k coefficients w _ijk each node of the filter of each convolution layers C1, _C2, and total binding layer disposed on the output layer side of the CNN Adjust the weight and bias of F.

分類誤差Ｃの最小化には、確率的勾配降下法が用いられる。学習部５は、重みやバイアスに関する誤差勾配（∂Ｃ／∂ｗ_ｉｊ）を、誤差逆伝播法（ＢＰ法）により計算する。ＢＰ法による計算方法は、通常のニューラルネットワークの場合と同様である。
もっとも、ＣＮＮ処理部４が最大プーリングを採用する場合の逆伝播では、学習サンプルに対する順伝播の際に、プーリング領域のどのノードの値を選んだかを記憶し、逆伝播時にそのノードのみと結合（重み１で結合）させる。 A stochastic gradient descent method is used to minimize the classification error C. Learning section 5, the error gradient related weights and biases the (∂C / ∂w _ij), calculated by backpropagation (BP method). The calculation method by the BP method is the same as in the case of a normal neural network.
However, in the back propagation when the CNN processing unit 4 adopts the maximum pooling, the value of which node in the pooling region is selected at the time of forward propagation for the training sample is memorized, and only that node is combined at the time of back propagation ( Combine with weight 1).

学習部５による分類誤差Ｃの評価とこれに基づくパラメータ（重みなど）の更新は、全学習サンプルについて実行してもよい。しかし、収束性及び計算速度の観点から、数個から数百個程度のサンプルの集合（ミニバッチ）ごとに実行することが好ましい。この場合の重みｗ_ｉｊの更新量Δｗ_ｉｊは、次式で決定される。

The evaluation of the classification error C by the learning unit 5 and the update of the parameters (weights, etc.) based on the evaluation may be performed for all the training samples. However, from the viewpoint of convergence and calculation speed, it is preferable to execute each set (mini-batch) of several to several hundred samples. The update amount Δw _ij of the weight w _{ij in} this case is determined by the following equation.

上式において、Δｗ_ｉｊ ^（ｔ）は今回の重み更新量であり、Δｗ_ｉｊ ^{（ｔ−１）}は前回の重み更新量である。上式の第１項は、勾配降下法により誤差を削減するためのｗ_ｉｊの修正量を表す項であり、εは学習率である。
上式の第２項は、モメンタム（momentum）である。モメンタムは、前回更新量のα（〜０．９）倍を加算することでミニパッチの選択による重みの偏りを抑える。第３項は、重み減衰（weight decay）である。重み減衰は、重みが過大にならないようにするパラメータである。なお、バイアスｂ_ｋの更新についても同様である。 In the above equation, Δw _ij ^(t) is the current weight update amount, and Δw _ij ^(t-1) is the previous weight update amount. The first term of the above equation is a term representing the amount of correction of _wij for reducing the error by the gradient descent method, and ε is the learning rate.
The second term of the above equation is momentum. Momentum suppresses weight bias due to the selection of mini-patches by adding α (~ 0.9) times the previous update amount. The third term is weight decay. Weight attenuation is a parameter that prevents the weight from becoming excessive. The same applies to the update of the bias b _k.

〔画像生成部の処理内容〕
図５は、データ生成部３による第１処理の説明図である。
前述の通り、「第１処理」は、サンプル画像７又は対象画像８から、当該原画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を生成する処理である。
図５において、学習段階での原画像は、サンプル画像７であり、画像処理装置２を識別器とする場合の原画像は、対象画像８である。 [Processing content of image generator]
FIG. 5 is an explanatory diagram of the first process by the data generation unit 3.
As described above, the "first process" is a process of generating an image filter 9 having invariance of rotation and inversion with respect to the original images 7 and 8 from the sample image 7 or the target image 8.
In FIG. 5, the original image at the learning stage is the sample image 7, and the original image when the image processing device 2 is used as the classifier is the target image 8.

原画像７，８には、直交座標の各画素点ｐ（ｘ，ｙ）におけるＲＧＢ値（０〜２５５）が含まれる。ここでは、図５（ａ）に示すように、画素点ｐでのＲＧＢ値を要素とするデータ列（Ｒ，Ｇ，Ｂ）を色ベクトル「ｇ」という。
データ生成部３は、まず、原画像７，８の中心点ｃを抽出し、抽出した中心点ｃを座標の原点とする。次に、データ生成部３は、図５（ｂ）に示すように、直交座標（ｘ，ｙ）の画素点ｐを、中心点ｃを原点とする極座標に変換する（極化処理）。 The original images 7 and 8 include RGB values (0 to 255) at each pixel point p (x, y) in Cartesian coordinates. Here, as shown in FIG. 5A, the data sequence (R, G, B) having the RGB value at the pixel point p as an element is referred to as a color vector “g”.
First, the data generation unit 3 extracts the center points c of the original images 7 and 8, and sets the extracted center points c as the origin of the coordinates. Next, as shown in FIG. 5B, the data generation unit 3 converts the pixel point p of the orthogonal coordinates (x, y) into polar coordinates with the center point c as the origin (polarization process).

なお、極座標の原点は、必ずしも原画像７，８の中心点ｃでなくてもよく、中心点ｃから多少ずれた位置にある所定のポイントであってもよい。 The origin of the polar coordinates does not necessarily have to be the center point c of the original images 7 and 8, but may be a predetermined point located at a position slightly deviated from the center point c.

次に、データ生成部３は、中心点ｃを原点とする極座標に含まれる、任意の画素点ｐの色ベクトルｇ＝（Ｒ，Ｇ，Ｂ）を、画素点ｐにおける半径方向の色ベクトルｇｒと接線方向の色ベクトルｇｔに分解する。この色ベクトルｇの分解は、次式により実行される。

Next, the data generation unit 3 sets the color vector g = (R, G, B) of an arbitrary pixel point p included in the polar coordinates with the center point c as the origin, and the color vector gr in the radial direction at the pixel point p. And the color vector gt in the tangential direction is decomposed. The decomposition of the color vector g is executed by the following equation.

ここで、「ｇ^Ｔ」は、色ベクトルｇ＝（Ｒ，Ｇ，Ｂ）の転置ベクトルである。「ｒ」は、次式により定義される画素点ｐにおける半径方向の単位ベクトルである。「ｔ」は、次式により定義される画素点ｐにおける接線方向の単位ベクトルである。 Here, "g ^T " is a transposed vector of the color vector g = (R, G, B). “R” is a unit vector in the radial direction at the pixel point p defined by the following equation. “T” is a unit vector in the tangential direction at the pixel point p defined by the following equation.

上式において、「Ｒ_θ」は、単位ベクトルｒを角度θだけ回転させる回転ベクトルである。本実施形態では、単位ベクトルｔの方向は接線方向（単位ベクトルｒからの角度が９０度）であるから、回転行列Ｒ_θの角度θの値は、θ＝π／２となる。
原画像７，８の極座標に含まれるすべての画素点ｐに上記の計算を行うことにより、各画素点ｐについて、合計６種類の要素（Ｒｒ、Ｒｔ、Ｇｒ、Ｇｔ、Ｂｒ、Ｂｔ）を含むシングルチャンネルの画像フィルタ９が生成される。 In the above equation, "R _θ " is a rotation vector that rotates the unit vector r by an angle θ. In the present embodiment, since the direction of the unit vector t is the tangential direction (the angle from the unit vector r is 90 degrees), the value of the angle θ of the _{rotation matrix R θ is θ = π / 2.}
By performing the above calculation on all the pixel points p included in the polar coordinates of the original images 7 and 8, a total of 6 types of elements (Rr, Rt, Gr, Gt, Br, Bt) are included in each pixel point p. A single channel image filter 9 is generated.

図６は、データ生成部３による第２処理の説明図である。
前述の通り、「第２処理」は、原画像７，８に対して、第１処理で生成した画像フィルタ９を畳み込む処理である。
図６において、学習段階での原画像は、サンプル画像７であり、画像処理装置２を識別器とする場合の原画像は、対象画像８である。 FIG. 6 is an explanatory diagram of the second process by the data generation unit 3.
As described above, the "second process" is a process of convolving the image filter 9 generated in the first process with respect to the original images 7 and 8.
In FIG. 6, the original image at the learning stage is the sample image 7, and the original image when the image processing device 2 is used as the classifier is the target image 8.

〔画像フィルタの回転及び反転の不変性〕
図７は、画素点ｐを所定角度θだけ回転させた変換点ｐ_θ、反転させた反転点ｐ'と回転反転（若しくは反転回転）させた複数変更点ｐ'_θの説明図である。
図７に示すように、任意の画素点ｐに対して、同じ半径で左回りに所定角度θだけ進んだ点である回転点を「ｐ_θ」とする。
図７に示すように、任意の画素点ｐに対して、反転させた反転点を「ｐ'」とする。
図７に示すように、任意の画素点ｐの反転点ｐ'に対して、同じ半径で左回りに所定角度θだけ進んだ点である回転点を「ｐ'_θ」とする。但し、複数の変更があった場合に、変換される手順は反転回転でも回転反転でもよい。
また、回転点ｐ_θ、反転点ｐ'、回転反転（若しくは反転回転）点ｐ'_θにおける色ベクトルをそれぞれ「ｇ_θ」、「ｇ'」、「ｇ'_θ」とし、回転点ｐ'_θにおける半径方向及び接線方向の単位ベクトルをそれぞれ「ｒ_θ」、「ｒ'」、「ｒ'_θ」及び「ｔ_θ」、「ｔ'」、「ｔ'_θ」とする。 [Invariance of rotation and inversion of image filter]
Figure 7 is a conversion point p _theta rotating the pixel point p by a predetermined angle theta, is an illustration of a _theta 'multiple changes p rotated inverted (or reversed rotation)' is inverted reversal point p.
As shown in FIG. 7, a rotation point that is a point that advances counterclockwise by a predetermined angle θ with the same radius with respect to an arbitrary pixel point p is defined as “p _θ ”.
As shown in FIG. 7, the inverted inversion point is defined as "p'" with respect to an arbitrary pixel point p.
As shown in FIG. 7, the rotation point, which is a point advanced counterclockwise by a predetermined angle θ with the same radius with respect to the inversion point p'of an arbitrary pixel point p, is defined as "_p'θ". However, when there are a plurality of changes, the procedure for conversion may be reverse rotation or rotation reverse.
The rotation point p _theta, reversal point p ', the rotation inversion (or reversal rotation) point p' each color vector in _theta "g _theta", and "g '", "g' _theta ', rotating point p' _theta each "r _theta" a unit vector in the radial and tangential directions in, "r '", "r'_theta" and "t _theta", and "t '", "t' _theta '.

この場合、次式に示すように、画素点ｐ、回転点ｐ_θ、反転点ｐ'、複数変換点ｐ'_θにおける半径方向及び接線方向の色ベクトルは、（ｇ^Ｔｒ，ｇ^Ｔｔ）、（(Ｒ_θｇ)^Ｔｒ，(Ｒ_θｇ)^Ｔｔ）、（(Mｇ)^Ｔｒ，(Mｇ)^Ｔｔ）、（(MＲ_θｇ)^Ｔｒ，(MＲ_θｇ)^Ｔｔ）と一致する。
ただし、次式において、Ｍは反転行列である。Ｍは、対角成分が１又は−１の対角行列であるため、Ｍ^ＴＭ＝Ｉとなる。 In this case, as shown in the following equation, the pixel point p, rotation point p _theta, color vector inversion point p ', more conversion points p' radial and tangential direction in the _theta ^{is, (g T r, g T} t) _{^{, ((R θ g) T}} r, (R θ g) T t), ((Mg) T r, (Mg) T t), ((MR θ g) T r, (MR θ g) T t) Matches with.
However, in the following equation, M is an inversion matrix. M, since the diagonal elements are the diagonal matrix of 1 or ^-1, and ^M T M = I.

（回転の場合）

(In case of rotation)

（反転の場合）

(In the case of inversion)

（回転反転（若しくは反転回転）の場合）

(In the case of rotation reversal (or reversal rotation))

上記の等式は、中心点ｃ回りの角度θの値に関係なく成立する。すなわち、画素点ｐでの半径方向の色ベクトルｇｒと接線方向の色ベクトルｇｔは、中心点ｃ回りにどのような角度θで回転しても不変である。同様、反転及び複数変更の場合も不変性を有する。なお、反転には、左右の反転（ｙ軸対称）と上下の反転（ｘ軸対称）の双方が含まれる。
従って、離散情報に構成された原画像７，８の色ベクトルｇに対して、各方向の色ベクトルｇｒ，ｇｔを要素とする画像フィルタ９を畳み込む処理を実行すれば、処理後の入力データは、中心点ｃ回りの回転及び反転に対して不変性を有する入力データとなる。 The above equation holds regardless of the value of the angle θ around the center point c. That is, the radial color vector gr and the tangential color vector gt at the pixel point p are invariant regardless of the angle θ around the center point c. Similarly, inversion and multiple changes also have invariance. The inversion includes both left-right inversion (y-axis symmetry) and up-down inversion (x-axis symmetry).
Therefore, if the processing of convolving the image filter 9 having the color vectors gr and gt in each direction as elements with respect to the color vectors g of the original images 7 and 8 configured as discrete information is executed, the input data after the processing will be obtained. , The input data has invariance with respect to rotation and inversion around the center point c.

〔推奨されるＣＮＮの構造例〕
図８は、ＣＮＮ処理部４に構築される深層ＣＮＮの構造図である。
図８に示すように、本願発明者らが推奨する、画像認識のためのＣＮＮのアーキテクチャは、入力ボリュームを出力ボリュームに変換する畳み込み層Ｃ１〜Ｃ４と、全結合層Ａ１〜Ａ３の積層体により構成されている。 [Recommended CNN structure example]
FIG. 8 is a structural diagram of the deep CNN constructed in the CNN processing unit 4.
As shown in FIG. 8, the CNN architecture for image recognition recommended by the inventors of the present application is based on a laminate of convolution layers C1 to C4 for converting an input volume into an output volume and fully coupled layers A1 to A3. It is configured.

ＣＮＮの各層Ｃ１〜Ｃ４，Ａ１〜Ａ３は、幅、高さ及び奥行きの３次元的に配列されたニューロンを有する。
最初の入力層Ｃ１の幅、高さ及び奥行きのサイズは５６×５６×３が好ましい。畳み込み層Ｃ２〜Ｃ４及び全結合層Ａ１の内部のニューロンは、１つ前の層の受容野と呼ばれる小領域のノードのみに接続されている。 Each layer C1-C4, A1-A3 of the CNN has neurons arranged three-dimensionally in width, height and depth.
The width, height and depth of the first input layer C1 are preferably 56 × 56 × 3. The neurons inside the convolutional layers C2-C4 and the fully connected layer A1 are connected only to the nodes in a small area called the receptive field of the previous layer.

出力ボリュームの空間的な大きさは、次式で計算することができる。
Ｗ２＝１＋（Ｗ１−Ｋ＋２Ｐ）／Ｓ
上式において、Ｗ１は、入力ボリュームのサイズである。Ｋは、畳み込み層のニューロンの核（ノード）のフィールドサイズである。Ｓはストライド、すなわち、カーネルマップにおける隣接するニューロンの受容野の中心間距離を意味する。Ｐは、ボーダー上で使用されるゼロパディングの量を意味する。 The spatial size of the output volume can be calculated by the following equation.
W2 = 1 + (W1-K + 2P) / S
In the above equation, W1 is the size of the input volume. K is the field size of the nucleus (node) of the neuron in the convolution layer. S means stride, that is, the distance between the centers of the receptive fields of adjacent neurons in the kernel map. P means the amount of zero padding used on the border.

図８のＣＮＮでは、第１畳み込み層Ｃ１において、Ｗ１＝５６、Ｋ＝５、Ｓ＝２、Ｐ＝２である。従って、第２畳み込み層Ｃ２の出力ボリュームの空間的な大きさは、Ｗ２＝１＋（５６−５＋２×２）／２＝２８．５→２８となる。
図８のネットワークでは、重みを持つ７つの層を含む。最初の４つは畳み込み層Ｃ１〜Ｃ４であり、残りの３つは完全に接続された全結合層Ａ１〜Ａ３である。全結合層Ａ１〜Ａ３には、ドロップアウトが含まれる。 In the CNN of FIG. 8, in the first convolution layer C1, W1 = 56, K = 5, S = 2, P = 2. Therefore, the spatial size of the output volume of the second convolution layer C2 is W2 = 1 + (56-5 + 2 × 2) / 2 = 28.5 → 28.
The network of FIG. 8 includes seven layers with weights. The first four are convolution layers C1 to C4 and the remaining three are fully connected fully connected layers A1 to A3. Fully bonded layers A1 to A3 include dropouts.

最後の全結合層Ａ３の出力は、この層Ａ３と完全に接続された最終層である、７クラスラベルの分布を生成する7-way SOFTMAXに供給される。
畳み込み層Ｃ２〜Ｃ４と全結合層Ａ１のニューロンは前の層の受容野に接続され、全結合層Ａ２〜Ａ３のニューロンは、前の層の全てのニューロンに接続されている。 The output of the last fully connected layer A3 is fed to 7-way SOFTMAX, which produces a distribution of 7 class labels, which is the final layer fully connected to this layer A3.
The neurons of the convolutional layers C2 to C4 and the fully connected layer A1 are connected to the receptive field of the previous layer, and the neurons of the fully connected layers A2 to A3 are connected to all the neurons of the previous layer.

畳み込み層Ｃ１，Ｃ２の後にはバッチ正規化層が続く。各バッチ正規化層の後には、それぞれ前述の最大プーリングを実行するプーリング層が続く。
畳み込み層Ｃ１〜Ｃ４と全結合層Ａ１〜Ａ３のための非線形マッピング関数は、整流リニアユニット（ＲｅＬＵ）よりなる。 The convolution layers C1 and C2 are followed by a batch normalization layer. Each batch normalization layer is followed by a pooling layer that performs the maximal pooling described above.
The nonlinear mapping function for the convolution layers C1 to C4 and the fully coupled layers A1 to A3 comprises a rectifying linear unit (ReLU).

第１畳み込み層Ｃ１は、サイズが５×５×３の６４個のカーネルにより、２画素のストライドで５６×５６×３の入力画像（ＡＧＥ画像）をフィルタリングする。
ストライド（歩幅）は、カーネルマップ内で隣接するニューロンの受容野の中心間の距離である。ストライドは、すべての畳み込み層において１ピクセルに設定されている。 The first convolution layer C1 filters a 56 × 56 × 3 input image (AGE image) with a 2-pixel stride by 64 kernels having a size of 5 × 5 × 3.
Stride is the distance between the centers of the receptive fields of adjacent neurons in the kernel map. The stride is set to 1 pixel in all convolution layers.

第２畳み込み層Ｃ２の入力は、バッチ正規化及び最大プールされた第１畳み込み層Ｃ１の出力である。第２畳込み層Ｃ２は、サイズが３×３×６４である１２８のカーネルで入力をフィルタリングする。
第３畳み込み層Ｃ３は、サイズが３×３×６４である１２８のカーネルを有し、これらは第２層Ｃ２（バッチ正規化とＭＡＸプーリング）の出力に接続されている。 The input of the second convolution layer C2 is the output of the first convolution layer C1 that has been batch normalized and maximally pooled. The second convolution layer C2 filters the inputs with 128 kernels of size 3x3x64.
The third convolution layer C3 has 128 kernels of size 3x3x64, which are connected to the output of the second layer C2 (batch normalization and MAX pooling).

第４畳み込み層Ｃ４は、サイズが３×３×１２８である１２８のカーネルを備えている。完全に接続された全結合層Ａ１〜Ａ３は、それぞれ１０２４のニューロンを備えている。 The fourth convolution layer C4 comprises 128 kernels having a size of 3x3x128. Fully connected fully connected layers A1 to A3 each include 1024 neurons.

〔推奨される学習例〕
本願発明者らは、図８の構造の深層ＣＮＮを実際に訓練（学習）させた。訓練に際しては、NVIDIA GTX745 4GBのＧＰＵを実装するＰＣに対して、オープンソースの数値解析ソフトウェアである「ＭＡＴＬＡＢ」を用いて行った。
ＣＮＮの学習ステップにおいては、重み減衰、モメンタム、バッチサイズ、学習率や学習サイクルを含むパラメータなどの重要な設定がある。以下、この点について説明する。 [Recommended learning example]
The inventors of the present application actually trained (learned) the deep CNN of the structure of FIG. The training was conducted using "MATLAB", an open source numerical analysis software, for a PC equipped with an NVIDIA GTX745 4GB GPU.
In the CNN learning step, there are important settings such as weight attenuation, momentum, batch size, learning rate and parameters including learning cycle. This point will be described below.

本願発明者らによる訓練では、モメンタムが０．９であり、重み減衰が０．０００５である非同期の確率的勾配降下法を採用した。次式は、今回採用した重みｗの更新ルールである。

In the training by the inventors of the present application, an asynchronous stochastic gradient descent method having a momentum of 0.9 and a weight attenuation of 0.0005 was adopted. The following equation is the update rule for the weight w adopted this time.

上式において、ｉは反復回数であり、ｍはモメンタム変数である。εは学習率を意味する。右辺の第３項は、ｗｉにおいて誤差Ｌを削減するための重みｗの修正量のｉ番目のバッチＤｉに関する平均値である。
バッチサイズの増加は、より信頼性の高い勾配推定値をもたらし、学習時間を短縮できるが、それでは最大の安定した学習率εの増加が得られない。そこで、ＣＮＮのモデルに適したバッチサイズを選択する必要がある。 In the above equation, i is the number of iterations and m is the momentum variable. ε means the learning rate. The third term on the right side is the average value of the correction amount of the weight w for reducing the error L in wi with respect to the i-th batch Di.
Increasing the batch size provides a more reliable gradient estimate and can reduce the learning time, but it does not provide the maximum stable increase in the learning rate ε. Therefore, it is necessary to select a batch size suitable for the CNN model.

ここでは、畳み込み層Ｃ１〜Ｃ４について、それぞれ、６４、１２８、２５６及び５１２のバッチサイズを採用した訓練（学習）の結果を比較した。その結果、図８のＣＮＮでは、２５６のバッチサイズが最適であることが判明した。
また、すべての層に同等の学習率を使用し、訓練を通して手動で調整した。学習率は０．１に初期化し、エラーレートが現時点の学習率で改善を停止したときに、学習率を１０で分割した。また、訓練に際しては、約２０サイクルでネットワークを訓練した。 Here, the results of training (learning) using batch sizes of 64, 128, 256, and 512 for the convolution layers C1 to C4 were compared. As a result, it was found that the batch size of 256 was optimal for the CNN of FIG.
Equal learning rates were used for all layers and manually adjusted throughout the training. The learning rate was initialized to 0.1, and when the error rate stopped improving at the current learning rate, the learning rate was divided by 10. In the training, the network was trained in about 20 cycles.

〔実験例：手書き文字を識別する場合の効果〕
本願発明者らは、「神戸大学経済経営研究所附属企業資料総合センター」に所蔵された、「鐘紡資料データベース」の「支配人回章１」に含まれる手書き文字の画像を用いて、本願発明の有意性を試す比較実験を行った。 [Experimental example: Effect of identifying handwritten characters]
The inventors of the present application used the images of the handwritten characters included in the "Manager's Circular 1" of the "Kanebo Material Database" held in the "Research Institute for Economics and Business Administration, Kobe University" to obtain the invention of the present application. A comparative experiment was conducted to test the significance.

識別するオブジェクト（手書き文字）の種類は、支配人回章１に含まれる「支」、「配」、「人」、「工」、「場」、「長」、「会（會）」、「社」、「明」及び「治」とした。学習に用いる各文字のサンプル数は、各々４００枚（５６×５６ピクセル）とした。図９は、比較実験に用いた手書き文字（原画像）の一例を示す図である。 The types of objects (handwritten characters) to be identified are "support", "distribution", "person", "work", "place", "chief", "kai", and "kai" included in the manager's turn 1. "Sha", "Ming" and "Ji". The number of samples of each character used for learning was 400 (56 × 56 pixels). FIG. 9 is a diagram showing an example of handwritten characters (original image) used in the comparative experiment.

図１０は、文字クラスごとの認識精度の試験結果を表すグラフである。図１０において、○のグラフ（ours）は、本願発明の認識精度を示す。□のグラフ（alexnet）は、非特許文献１の場合（ただし、入力データはＲＧＢ値）の認識精度を示す。
▽のグラフ（vgg-vd-16）は、非特許文献２の場合（入力データはＲＧＢ値であり、レイヤ数は１６）の認識精度を示す。＊のグラフ（vgg-vd-19）は、非特許文献２の場合（入力データはＲＧＢ値であり、レイヤ数は１９）の認識精度を示す。 FIG. 10 is a graph showing the test results of recognition accuracy for each character class. In FIG. 10, the graphs (ours) of ◯ show the recognition accuracy of the present invention. The graph (alexnet) of □ shows the recognition accuracy in the case of Non-Patent Document 1 (however, the input data is an RGB value).
The graph (vgg-vd-16) of ▽ shows the recognition accuracy in the case of Non-Patent Document 2 (the input data is an RGB value and the number of layers is 16). The graph (vgg-vd-19) of * shows the recognition accuracy in the case of Non-Patent Document 2 (the input data is an RGB value and the number of layers is 19).

図１０に示すように、入力画像として従来通りのＲＧＢ値を用いる非特許文献１及び２の場合には、１０種類のすべての文字クラスについて、９０％を超える認識精度を達成できない。
これに対して、手書き文字の原画像に回転及び反転の不変性を与える本願発明の場合には、１０種類のすべての文字クラスについて、９０％以上の認識精度を獲得した。 As shown in FIG. 10, in the case of Non-Patent Documents 1 and 2 using the conventional RGB values as the input image, the recognition accuracy exceeding 90% cannot be achieved for all 10 types of character classes.
On the other hand, in the case of the present invention, which imparts invariance of rotation and inversion to the original image of handwritten characters, recognition accuracy of 90% or more was obtained for all 10 types of character classes.

図１０のグラフから明らかな通り、深層ＣＮＮを用いた画像認識（図１０の例では文字認識）において、原画像に回転及び反転の不変性を与える処理を行ってＣＮＮの入力データとすれば、従来の生データ（ＲＧＢ画像）を入力データとする場合に比べて、認識精度が有意に改善される。 As is clear from the graph of FIG. 10, in image recognition using a deep CNN (character recognition in the example of FIG. 10), if the original image is processed to give invariance of rotation and inversion to obtain CNN input data, The recognition accuracy is significantly improved as compared with the case where the conventional raw data (RGB image) is used as the input data.

〔画像認識装置の効果〕
以上の通り、本実施形態の画像認識装置１０によれば、原画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を畳み込むことにより、ＣＮＮ処理部４への入力データを生成する。このため、サンプル画像７をそれほど多く収集しなくても、他の深層学習技術に比べて高い認識精度を発揮できる。
例えば、図１０に示す通り、サンプル画像７の数が「４００」の場合には、従来の深層学習技術よりも高い認識精度が得られる。 [Effect of image recognition device]
As described above, according to the image recognition device 10 of the present embodiment, the input data to the CNN processing unit 4 is generated by convolving the image filter 9 having the invariance of rotation and inversion with respect to the original images 7 and 8. do. Therefore, even if a large number of sample images 7 are not collected, higher recognition accuracy can be exhibited as compared with other deep learning techniques.
For example, as shown in FIG. 10, when the number of sample images 7 is "400", higher recognition accuracy than the conventional deep learning technique can be obtained.

本実施形態では、６種類の要素（Ｒｒ、Ｒｔ、Ｇｒ、Ｇｔ、Ｂｒ、Ｂｔ）を含むシングルチャンネルの画像フィルタ９をサンプル画像７に畳み込むので、１つのサンプル画像７に含まれるデータ量を少なくとも１６倍に拡張することになる。
このため、従来の深層学習では１００枚のサンプル画像７を必要とする場合には、ほぼ６枚（１００／１６＝６．２５）のサンプル画像７を収集すれば、従来の深層学習と概ね同程度の識別精度が得られる。 In the present embodiment, since the single-channel image filter 9 including 6 types of elements (Rr, Rt, Gr, Gt, Br, Bt) is convoluted into the sample image 7, the amount of data contained in one sample image 7 is at least It will be expanded 16 times.
Therefore, when 100 sample images 7 are required in the conventional deep learning, if about 6 (100/16 = 6.25) sample images 7 are collected, it is almost the same as the conventional deep learning. A degree of identification accuracy can be obtained.

本実施形態の画像認識装置１０によれば、画像フィルタ９の畳み込みにより、回転又は反転したオブジェクトでも正確に認識できる。
従って、例えば、古文書などの撮影画像に含まれる文字を正確に認識して、テキスト又はワープロ文書データに変換する装置として利用できる。また、帳票などの文書に記載の文字を読み込んで、テキストやワープロ文書データに変換する装置や、タッチパネルに手書き入力された文字をリアルタイムに認識する装置として利用することもできる。 According to the image recognition device 10 of the present embodiment, even a rotated or inverted object can be accurately recognized by convolution of the image filter 9.
Therefore, for example, it can be used as a device that accurately recognizes characters included in a photographed image such as an old document and converts them into text or word processor document data. It can also be used as a device that reads characters described in a document such as a form and converts them into text or word processor document data, or a device that recognizes characters handwritten on a touch panel in real time.

その他、本実施形態の画像認識装置１０は、人間の顔の表情認識、顔画像からの年齢認識、及び、動物、植物、製品などのあらゆる物体の種類の認識などに利用可能である。
このように、本実施形態の画像認識装置１０において、ＣＮＮ処理部４が種類を認識可能なオブジェクトは、手書き文字、人間、動物、植物、及び製品のうちの少なくとも１つの物体であればよく、あらゆるオブジェクトの種類の認識に利用することができる。 In addition, the image recognition device 10 of the present embodiment can be used for facial expression recognition of a human face, age recognition from a face image, recognition of all types of objects such as animals, plants, and products.
As described above, in the image recognition device 10 of the present embodiment, the object whose type can be recognized by the CNN processing unit 4 may be at least one of handwritten characters, humans, animals, plants, and products. It can be used to recognize any type of object.

〔画像認識装置のその他の応用例〕
図１１は、本実施形態の製品監視システム２０の全体構成図である。
製品監視システム２０は、撮影画像に含まれる製品２５の種類を認識する画像認識装置１０（図１参照）を、不良品の選別及び取り出しに利用するシステムである。
図１１に示すように、製品監視システム２０は、撮影装置２１、駆動制御装置２２、及び可動アーム式のロボット装置２３を備える。 [Other application examples of image recognition device]
FIG. 11 is an overall configuration diagram of the product monitoring system 20 of the present embodiment.
The product monitoring system 20 is a system that uses an image recognition device 10 (see FIG. 1) that recognizes the types of products 25 included in the captured image for sorting and taking out defective products.
As shown in FIG. 11, the product monitoring system 20 includes a photographing device 21, a drive control device 22, and a movable arm type robot device 23.

撮影装置２１は、例えば、ＣＣＤ（電荷結合素子）を利用してデジタル画像を生成するデジタルカメラよりなる。撮影装置２１は、ベルトコンベア２４の上方に吊り下げられており、下流側（図１１の右側）に進行するベルトコンベア２４上の複数の製品２５を上から撮影する。
撮影装置２１は、複数の製品２５が含まれるデジタル画像よりなる撮影画像を、駆動制御装置２２に送信する。撮影画像は、静止画及び動画像のいずれでもよい。 The photographing device 21 includes, for example, a digital camera that generates a digital image using a CCD (charge-coupled device). The photographing device 21 is suspended above the belt conveyor 24, and photographs a plurality of products 25 on the belt conveyor 24 traveling downstream (on the right side in FIG. 11) from above.
The photographing device 21 transmits a photographed image including a digital image including a plurality of products 25 to the drive control device 22. The captured image may be either a still image or a moving image.

駆動制御装置２２は、ロボット装置２３の動作を制御するコンピュータ装置よりなる。駆動制御装置２２は、第１通信部２６、第２通信部２７、制御部２８、及び記憶部２９を備える。
第１通信部２６は、所定のＩ／Ｏインタフェース規格により、撮影装置２１と通信する通信装置よりなる。第１通信部２６と撮影装置２１との通信は、有線通信及び無線通信のいずれであってもよい。 The drive control device 22 includes a computer device that controls the operation of the robot device 23. The drive control device 22 includes a first communication unit 26, a second communication unit 27, a control unit 28, and a storage unit 29.
The first communication unit 26 includes a communication device that communicates with the photographing device 21 according to a predetermined I / O interface standard. The communication between the first communication unit 26 and the photographing device 21 may be either wired communication or wireless communication.

第２通信部２７は、所定のＩ／Ｏインタフェース規格により、ロボット装置２３と通信する通信装置よりなる。第２通信部２７とロボット装置２３との通信は、有線通信及び無線通信のいずれであってもよい。 The second communication unit 27 includes a communication device that communicates with the robot device 23 according to a predetermined I / O interface standard. The communication between the second communication unit 27 and the robot device 23 may be either wired communication or wireless communication.

制御部２８は、１又は複数のＣＰＵを有する情報処理装置であり、上述の本実施形態の画像識別装置（図１参照）を含む制御装置よりなる。
記憶部２９は、１又は複数のＲＡＭ及びＲＯＭなどのメモリを含む記憶装置よりなる。記憶部２９は、制御部２８に実行させる各種のコンピュータプログラムや、撮影装置２１などから受信した画像データなどの、一時的又は非一時的な記録媒体として機能する。 The control unit 28 is an information processing device having one or more CPUs, and includes a control device including the image identification device (see FIG. 1) of the present embodiment described above.
The storage unit 29 includes one or more RAMs and a storage device including a memory such as a ROM. The storage unit 29 functions as a temporary or non-temporary recording medium such as various computer programs executed by the control unit 28 and image data received from the photographing device 21 or the like.

このように、駆動制御装置２２は、コンピュータを備えて構成される。従って、駆動制御装置２２の各機能は、当該コンピュータの記憶装置に記憶されたコンピュータプログラムが当該コンピュータのＣＰＵ及びＧＰＵによって実行されることで発揮される。
かかるコンピュータプログラムは、ＣＤ−ＲＯＭやＵＳＢメモリなどの一時的又は非一時的な記録媒体に記憶させることができる。 In this way, the drive control device 22 is configured to include a computer. Therefore, each function of the drive control device 22 is exhibited by executing the computer program stored in the storage device of the computer by the CPU and GPU of the computer.
Such a computer program can be stored in a temporary or non-temporary recording medium such as a CD-ROM or a USB memory.

制御部２８は、記憶部２９に格納されたコンピュータプログラムを読み出して実行することにより、第１及び第２通信部２６，２７に対する通信制御や、ロボット装置２３の動作制御を実現できる。
例えば、制御部２８は、第１通信部２６が受信した撮影画像から製品２５の画像部分（以下、「製品画像」という。）を抽出し、抽出した製品画像の分類クラス（ここでは、「正常」又は「不良」とする。）を識別する。 By reading and executing the computer program stored in the storage unit 29, the control unit 28 can realize communication control for the first and second communication units 26 and 27 and operation control of the robot device 23.
For example, the control unit 28 extracts an image portion of the product 25 (hereinafter, referred to as “product image”) from the captured image received by the first communication unit 26, and classifies the extracted product image (here, “normal”). ”Or“ defective ”).

制御部２８は、分類クラスが不良である製品画像を検出すると、撮影時刻とベルトコンベア２４の進行速度などから、製品２５がロボット装置２３の下を通過する時刻及び位置を算出し、算出した通過時刻及び通過位置を第２通信部２７に送信させる。
ロボット装置２３は、多関節のロボットアーム３０と、ロボットアーム３０を駆動するアクチュエータ３１とを備える。ロボットアーム３０は、コンベア２４上の製品２５を把持することができるハンド部（図示せず）を有する。 When the control unit 28 detects a product image having a defective classification class, it calculates the time and position at which the product 25 passes under the robot device 23 from the shooting time, the traveling speed of the belt conveyor 24, and the like, and the calculated passage. The time and the passing position are transmitted to the second communication unit 27.
The robot device 23 includes an articulated robot arm 30 and an actuator 31 that drives the robot arm 30. The robot arm 30 has a hand portion (not shown) capable of gripping the product 25 on the conveyor 24.

制御部２８が算出した不良品の通過時刻及び通過位置は、アクチュエータ３１に送信される。アクチュエータ３１は、不良品の通過時刻にハンド部が通過位置に移動し、製品２５を掴んで外部に取り出すように、ロボットアーム３０の各関節を駆動する。
記憶部２９は、製品２５の良否判定を実行可能な所定構造のＣＮＮ（例えば図８）や、当該ＣＮＮに対する学習済みの重み及びバイアスなどを記憶している。制御部２８は、学習済みのＣＮＮにより、撮影画像から抽出した製品画像の良否を判定する。 The passing time and passing position of the defective product calculated by the control unit 28 are transmitted to the actuator 31. The actuator 31 drives each joint of the robot arm 30 so that the hand portion moves to the passing position at the passing time of the defective product, grasps the product 25, and takes it out to the outside.
The storage unit 29 stores a CNN having a predetermined structure (for example, FIG. 8) capable of performing a quality determination of the product 25, a learned weight and a bias with respect to the CNN, and the like. The control unit 28 determines the quality of the product image extracted from the captured image based on the learned CNN.

上述の製品監視システム２０を実現するのに必要となる、工場管理者が行うべき作業工程を列挙すると、次の通りである。
工程１）ロボットアーム３０の下流側の作業員３２が、コンベア２４を流れる製品２５の中から不良品を判別し、その不良品をデジタルカメラ（図示せず）で撮影する。
工程２）撮影した画像データを、不良品のサンプル画像７として駆動制御装置２２の記憶部２９に入力する。 The work processes to be performed by the factory manager, which are necessary to realize the above-mentioned product monitoring system 20, are as follows.
Step 1) A worker 32 on the downstream side of the robot arm 30 determines a defective product from the products 25 flowing on the conveyor 24, and photographs the defective product with a digital camera (not shown).
Step 2) The captured image data is input to the storage unit 29 of the drive control device 22 as a sample image 7 of a defective product.

工程３）上記の工程１及び２を、所望の識別精度（例えば、９９％以上）が得られるまで繰り返す。
なお、不良品の代表的な形状が予め判明している場合には、不良品のサンプル画像は、コンベア２４を流れる製品２５以外の製品を撮影した画像データであってもよい。 Step 3) The above steps 1 and 2 are repeated until the desired identification accuracy (for example, 99% or more) is obtained.
When the typical shape of the defective product is known in advance, the sample image of the defective product may be image data obtained by photographing a product other than the product 25 flowing on the conveyor 24.

工程１及び２を繰り返す学習段階において、運用当初は、不良品のサンプル画像７が少ないことから、駆動制御装置２２による不良品の認識精度は低い。
しかし、不良品のサンプル画像７が増加するに従い、駆動制御装置２２の認識精度が向上し、不良品を１００％に近い状態で排除できるようになる。このため、駆動制御装置２２が所定数のサンプル画像７によって学習を積むことにより、作業員３２による目視の監視が不要となる。 In the learning stage in which steps 1 and 2 are repeated, since the sample image 7 of the defective product is small at the beginning of operation, the recognition accuracy of the defective product by the drive control device 22 is low.
However, as the number of sample images 7 of defective products increases, the recognition accuracy of the drive control device 22 improves, and defective products can be eliminated in a state close to 100%. Therefore, since the drive control device 22 learns from a predetermined number of sample images 7, visual monitoring by the worker 32 becomes unnecessary.

特に、本実施形態では、回転又は反転したオブジェクトでも認識精度が高い識別器を使用するので、例えば図１１に示すように、コンベア２４に種々の向きで載せられる製品２５の場合でも、早期にほぼ１００％に近い形で、不良品の識別を行うことができる。
なお、コンベア２４に種々の向きで載せられる製品２５の例としては、お菓子や練り製品などの食品類や成形機により生産される成形品等が考えられる。また、現在、コンベア２４上で製品２５を搬送し、作業員３２が目視し、不良品をコンベア２４上から除去しているような場合には、本実施形態の製品監視システム２０を採用することで、目視検査を実施している当該作業員３２の人数を、大幅に削減できる効果が期待できる。 In particular, in the present embodiment, since a discriminator having high recognition accuracy is used even for a rotated or inverted object, as shown in FIG. 11, for example, even in the case of a product 25 mounted on a conveyor 24 in various orientations, almost at an early stage. Defective products can be identified in a form close to 100%.
As an example of the product 25 mounted on the conveyor 24 in various directions, foods such as sweets and paste products, molded products produced by a molding machine, and the like can be considered. Further, when the product 25 is currently conveyed on the conveyor 24 and the worker 32 visually observes and removes the defective product from the conveyor 24, the product monitoring system 20 of the present embodiment is adopted. Therefore, the effect of significantly reducing the number of the workers 32 who are performing the visual inspection can be expected.

〔第１の変形例〕
上述の実施形態において、色ベクトルｇの分解方向は、「半径方向」（図５のｒ方向）及び「接線方向」（図５のｔ方向）に限定されない。すなわち、色ベクトルｇの分解方向は、画素点ｐを基点として所定角度で開く任意の２方向であればよい。
もっとも、半径方向と接線方向以外の２方向で分解すると、各方向の色ベクトルの演算式が複雑になり、データ生成部３の処理負荷が大きくなる。従って、上述の実施形態の通り、色ベクトルｇの分解方向は半径方向及び接線方向とすることが好ましい。 [First modification]
In the above-described embodiment, the decomposition direction of the color vector g is not limited to the “radial direction” (r direction in FIG. 5) and the “tangential direction” (t direction in FIG. 5). That is, the decomposition direction of the color vector g may be any two directions that open at a predetermined angle with the pixel point p as the base point.
However, if the data is decomposed in two directions other than the radial direction and the tangential direction, the calculation formula of the color vector in each direction becomes complicated, and the processing load of the data generation unit 3 becomes large. Therefore, as described in the above embodiment, the decomposition direction of the color vector g is preferably the radial direction and the tangential direction.

〔第２の変形例〕
上述の実施形態において、中心点ｃを原点とする３次元極座標を定義し、色ベクトルｇを、半径方向と２つの接線方向（３次元の場合は合計で３方向）に分解した要素を有する３次元の画像フィルタにより、原画像７，８を畳み込み処理してもよい。
このようにすれば、中心点ｃを原点とする２次元の回転又は反転だけでなく、中心点ｃを原点とした奥行き方向に傾斜する対象画像８に対しても、画像の認識精度を向上することができる。 [Second modification]
In the above-described embodiment, the three-dimensional polar coordinates with the center point c as the origin are defined, and the color vector g is decomposed into a radial direction and two tangential directions (three directions in total in the case of three dimensions). The original images 7 and 8 may be convoluted by a three-dimensional image filter.
By doing so, the image recognition accuracy is improved not only for the two-dimensional rotation or inversion with the center point c as the origin but also for the target image 8 tilted in the depth direction with the center point c as the origin. be able to.

〔第３の変形例〕
上述の実施形態では、原画像７，８に対して回転及び反転の不変性を有する画像フィルタ９を採用したが、画像フィルタ９は、原画像７，８に対して回転のみの不変性を有する画像フィルタ、或いは、原画像７，８に対して反転のみの不変性を有する画像フィルタであってもよい。すなわち、本願発明の画像フィルタ９は、原画像７，８に対して回転及び反転のうちの少なくとも１つの不変性を有する画像フィルタであればよい。 [Third variant]
In the above-described embodiment, the image filter 9 having rotation and inversion invariance with respect to the original images 7 and 8 is adopted, but the image filter 9 has rotation-only invariance with respect to the original images 7 and 8. An image filter or an image filter having invariance of only inversion with respect to the original images 7 and 8 may be used. That is, the image filter 9 of the present invention may be an image filter having at least one invariance of rotation and inversion with respect to the original images 7 and 8.

〔その他の変形例〕
今回開示した実施形態（変形例を含む。）はすべての点で例示であって制限的なものではない。本発明の権利範囲は、上述の実施形態に限定されるものではなく、特許請求の範囲に記載された構成と均等の範囲内でのすべての変更が含まれる。
例えば、上述の実施形態では、ニューラルネットワークが畳み込みニューラルネットワーク（ＣＮＮ）よりなるが、畳み込み層を有しない他の構造の階層型ニューラルネットワークであってもよい。 [Other variants]
The embodiments disclosed this time (including modified examples) are examples in all respects and are not restrictive. The scope of rights of the present invention is not limited to the above-described embodiment, and includes all modifications within a range equivalent to the configuration described in the claims.
For example, in the above-described embodiment, the neural network is composed of a convolutional neural network (CNN), but it may be a hierarchical neural network having another structure that does not have a convolutional layer.

１演算処理部
２画像処理部
３データ生成部
４ＣＮＮ処理部
５学習部
６認識部
７サンプル画像（原画像）
８対称画像（原画像）
９画像フィルタ
１０画像認識装置
２０製品監視システム
２１撮影装置
２２駆動制御装置（制御装置）
２３ロボット装置
２４ベルトコンベア
２５製品
２６第１通信部
２７第２通信部
２８制御部
２９記憶部
３０ロボットアーム
３１アクチュエータ
３２作業員 1 Arithmetic processing unit 2 Image processing unit 3 Data generation unit 4 CNN processing unit 5 Learning unit 6 Recognition unit 7 Sample image (original image)
8 Symmetric image (original image)
9 Image filter 10 Image recognition device 20 Product monitoring system 21 Imaging device 22 Drive control device (control device)
23 Robot device 24 Belt conveyor 25 Product 26 1st communication unit 27 2nd communication unit 28 Control unit 29 Storage unit 30 Robot arm 31 Actuator 32 Worker

Claims

A data generator that generates input data by performing predetermined data processing on the original image,
An image recognition device including an image processing unit having a hierarchical neural network that recognizes the types of objects included in the generated input data.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network, and when the original image is a recognition target image, the image processing unit performs the process of learning the parameters of the network. Performs the process of outputting the network recognition result and
Wherein the data processing data generating unit performs an image recognition apparatus wherein a treatment for imparting invariance of the rotation and reversing the original image.

The image recognition device according to claim 1, wherein the data processing performed by the data generation unit includes a first process and a second process defined below.
First processing: rotation and generating an image filter having the invariance of the inversion processing second process on an original image: processing of convoluting the image filter produced in the first process to the original image

A data generator that generates input data by performing predetermined data processing on the original image,
An image recognition device including an image processing unit having a hierarchical neural network that recognizes the types of objects included in the generated input data.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network, and when the original image is a recognition target image, the image processing unit performs the process of learning the parameters of the network. Performs the process of outputting the network recognition result and
The data processing performed by the data generation unit is a process of imparting at least one invariance of rotation and inversion to the original image.
The data processing performed by the data generation unit includes the first processing and the second processing defined below.
The image filter is a plurality of color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates having a predetermined point of the original image as an origin in an arbitrary direction that opens at a predetermined angle starting from the pixel point. images recognizer ing than elements included in.
First process: A process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image.
Second process: A process of convolving the image filter generated in the first process into the original image.

A data generator that generates input data by performing predetermined data processing on the original image,
An image recognition device including an image processing unit having a hierarchical neural network that recognizes the types of objects included in the generated input data.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network, and when the original image is a recognition target image, the image processing unit performs the process of learning the parameters of the network. Performs the process of outputting the network recognition result and
The data processing performed by the data generation unit is a process of imparting at least one invariance of rotation and inversion to the original image.
The data processing performed by the data generation unit includes the first processing and the second processing defined below.
The image filter includes two color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates with a predetermined point of the original image as an origin in a radial direction and a tangential direction starting from the pixel point. images recognizer ing from elements.
First process: A process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image.
Second process: A process of convolving the image filter generated in the first process into the original image.

The image recognition device according to any one of claims 1 to 4, wherein the hierarchical neural network is a convolutional neural network.

The image recognition device according to any one of claims 1 to 5, wherein the object whose type is recognized is composed of at least one object of handwritten characters, humans, animals, plants, and products.

A data generator that generates input data by performing predetermined data processing on the original image,
A computer program for operating a computer as an image recognition device including an image processing unit having a hierarchical neural network that recognizes objects included in the generated input data.
Wherein the data generating unit, as the data processing, and performing the treatment for imparting invariance of rotation and inverted with respect to the original image,
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network.
A computer program including a step of outputting a recognition result of the network when the original image is a recognition target image by the image processing unit.

A data generator that generates input data by performing predetermined data processing on the original image,
An image recognition method executed by an image recognition device including an image processing unit having a hierarchical neural network that recognizes objects included in the generated input data.
Wherein the data generating unit, as the data processing, and performing the treatment for imparting invariance of rotation and inverted with respect to the original image,
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network.
An image recognition method including a step of outputting a recognition result of the network when the original image is a recognition target image by the image processing unit.

A shooting device that shoots multiple products,
A robot device that takes out any of the above-mentioned multiple products that have been photographed, and
The control apparatus including an image recognition apparatus, a product monitoring system and a said control device to instruct the product to the robotic device to retrieve,
The image recognition device is
A data generator that generates input data by performing predetermined data processing on the original image,
An image processing unit having a hierarchical neural network that recognizes the types of objects included in the generated input data is provided.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network, and when the original image is a recognition target image, the image processing unit performs the process of learning the parameters of the network. Performs the process of outputting the network recognition result and
The data processing performed by the data generation unit is a process of imparting at least one invariance of rotation and inversion to the original image.
The image recognition device is a product monitoring system that instructs the robot device to take out the product recognized as a defective product.

A data generator that generates input data by performing predetermined data processing on the original image,
A computer program for operating a computer as an image recognition device including an image processing unit having a hierarchical neural network that recognizes objects included in the generated input data .
As the data processing, the data generation unit performs a process of imparting at least one invariance of rotation and inversion to the original image, and a step of performing the process.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network.
When the original image is a recognition target image, the image processing unit includes a step of outputting the recognition result of the network.
The data processing performed by the data generation unit includes the first processing and the second processing defined below.
The image filter is a plurality of color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates having a predetermined point of the original image as an origin in an arbitrary direction that opens at a predetermined angle starting from the pixel point. A computer program consisting of the elements contained in.
First process: A process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image.
Second process: A process of convolving the image filter generated in the first process into the original image.

A data generator that generates input data by performing predetermined data processing on the original image,
A computer program for operating a computer as an image recognition device including an image processing unit having a hierarchical neural network that recognizes objects included in the generated input data .
As the data processing, the data generation unit performs a process of imparting at least one invariance of rotation and inversion to the original image, and a step of performing the process.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network.
When the original image is a recognition target image, the image processing unit includes a step of outputting the recognition result of the network.
The data processing performed by the data generation unit includes the first processing and the second processing defined below.
The image filter includes two color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates having a predetermined point of the original image as an origin in a radial direction and a tangential direction starting from the pixel point. A computer program consisting of elements.
First process: A process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image.
Second process: A process of convolving the image filter generated in the first process into the original image.

A data generator that generates input data by performing predetermined data processing on the original image,
An image recognition method executed by an image recognition device including an image processing unit having a hierarchical neural network that recognizes objects included in the generated input data.
As the data processing, the data generation unit performs a process of imparting at least one invariance of rotation and inversion to the original image, and a step of performing the process.
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network.
When the original image is a recognition target image, the image processing unit includes a step of outputting the recognition result of the network.
The data processing performed by the data generation unit includes the first processing and the second processing defined below.
The image filter is a plurality of color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates having a predetermined point of the original image as an origin in an arbitrary direction that opens at a predetermined angle starting from the pixel point. An image recognition method consisting of the elements contained in.
First process: A process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image.
Second process: A process of convolving the image filter generated in the first process into the original image.

A data generator that generates input data by performing predetermined data processing on the original image,
An image recognition method executed by an image recognition device including an image processing unit having a hierarchical neural network that recognizes objects included in the generated input data.
A step before Symbol data generating unit, as the data processing, to perform the process of applying at least one invariance of rotation and inverted with respect to the original image,
When the original image is a sample image, the image processing unit performs a process of learning the parameters of the network based on the recognition result of the network.
When the original image is a recognition target image, the image processing unit includes a step of outputting the recognition result of the network.
The data processing performed by the data generation unit includes the first processing and the second processing defined below.
The image filter includes two color vectors obtained by dividing a color vector of an arbitrary pixel point defined by polar coordinates with a predetermined point of the original image as an origin in a radial direction and a tangential direction starting from the pixel point. An image recognition method consisting of elements.
First process: A process of generating an image filter having at least one invariance of rotation and inversion with respect to the original image.
Second process: A process of convolving the image filter generated in the first process into the original image.