JP6701057B2

JP6701057B2 - Recognizer, program

Info

Publication number: JP6701057B2
Application number: JP2016215759A
Authority: JP
Inventors: 和寿石丸; ホセインテヘラニニキネジャド; 白井　孝昌; 孝昌白井; ジョンヴィジャイ; 誠一三田
Original assignee: Toyota School Foundation; Denso Corp
Current assignee: Toyota School Foundation; Denso Corp
Priority date: 2016-11-04
Filing date: 2016-11-04
Publication date: 2020-05-27
Anticipated expiration: 2036-11-04
Also published as: JP2018073308A

Description

本開示は、物体の認識に関する。 The present disclosure relates to object recognition.

特許文献１は、道路地形情報を検出するための手法として、次の内容を開示している。ベースポイントと称する選択された位置についての、高水準の空間特徴生成を行う。このベースポイントについての空間特徴生成は、環境の視覚的及び物理的特徴物を捉えた価値連続な信頼度表現に基づいて行われる。 Patent Literature 1 discloses the following contents as a method for detecting road topographic information. Performs high level spatial feature generation for selected locations called base points. Spatial feature generation for this base point is performed based on a continuous value-based reliability expression that captures the visual and physical features of the environment.

特開２０１３−０７３６２０号公報JP, 2013-073620, A

特許文献１の技術は、精度が不十分であるのが実情である。特に物体間の境界において誤認識が発生しやすい。このような課題は、道路地形の検出に限られない。本開示は、上記を踏まえ、道路地形その他の物体の認識を高精度に実施することを解決課題とする。 In reality, the technique of Patent Document 1 is insufficient in accuracy. In particular, erroneous recognition is likely to occur at boundaries between objects. Such a problem is not limited to detection of road topography. Based on the above, the present disclosure has as a problem to be solved the recognition of road terrain and other objects with high accuracy.

本開示の一形態は、認識対象の画像から取得された画素毎の情報を入力データとして、複数の中間層を含むディープニューラルネットワーク（１３）に入力することによって、前記画素毎にラベル付けをし、初期認識値を取得する初期認識部（Ｓ２００）と、前記認識対象の画像が属するシーンを、前記複数の中間層の少なくとも１層から出力される中間データの少なくとも一部であるシーン識別用中間データに基づき識別するシーン識別部（Ｓ３００）と、前記初期認識値を対象に、前記中間データに基づく訂正を実行する訂正部（Ｓ５００）であって、前記訂正を、前記識別されたシーンに基づき実行する訂正実行部（Ｓ７４０）を備える訂正部（Ｓ５００）と、を備える認識装置である。 One form of the present disclosure labels each pixel by inputting the information for each pixel acquired from an image to be recognized as input data to a deep neural network (13) including a plurality of intermediate layers. An initial recognition unit (S200) for acquiring an initial recognition value, and a scene identification intermediate that is at least a part of intermediate data output from at least one of the plurality of intermediate layers, for the scene to which the image to be recognized belongs. identifying the scene identification section on the basis of the data and (S300), targeting the initial recognition value, the a correcting unit that executes a correction based on the intermediate data (S500), the correction based on the identified scene And a correction unit (S500) including a correction execution unit (S740) to be executed .

この形態によれば、初期認識部によって付与されたラベルを、初期認識における中間データを利用して訂正するので、高精度にラベル付けができ、ひいては物体の認識精度が向上する。 According to this aspect, since the label given by the initial recognition unit is corrected by using the intermediate data in the initial recognition, the labeling can be performed with high accuracy, and the recognition accuracy of the object is improved.

認識装置を搭載した自動車の機能ブロック図。FIG. 3 is a functional block diagram of a vehicle equipped with a recognition device. 認識装置を示すフローチャート。The flowchart which shows a recognition apparatus. ディープニューラルネットワークによるラベル付けの様子を示す図。The figure which shows the mode of labeling by a deep neural network. 初期認識値と画像とを重ねた図。The figure which overlapped the initial recognition value and the image. シーン識別器の学習の様子を示す図。The figure which shows the mode of learning of a scene classifier. シーンの一例を示す図。The figure which shows an example of a scene. エラー領域とパッチとの関係を示す図。The figure which shows the relationship between an error area and a patch. パッチを用いたラベルの絞り込みを示すフローチャート。The flowchart which shows the narrowing down of the label using a patch. 位置関係特徴量および外見−距離特徴量の学習の様子を示す図。The figure which shows the mode of learning of a positional relationship feature-value and an appearance-distance feature-value. パッチとレセプタとの空間関係を示す図。The figure which shows the spatial relationship between a patch and a receptor. 画素を用いたら部の絞り込みを示すフローチャート。The flowchart which shows narrowing down of a part if a pixel is used. 外見−距離特徴量の生成の様子を示す図。The figure which shows a mode of appearance-distance feature-value production|generation. 初期認識値による認識結果の一例を示す図。The figure which shows an example of the recognition result by an initial recognition value. 最終認識値による認識結果の一例を示す図。The figure which shows an example of the recognition result by a final recognition value.

図１に示すように、認識装置１０は、自動車１に搭載される。認識装置１０は、カメラ２１と、カメラ２２とからそれぞれ、撮像データとして、各画素のＲＧＢ値を取得する。認識装置１０は、この撮像データによって表される画像を、図２に示す認識処理によって認識し、認識結果を制御部２５に入力する。制御部２５は、入力された認識結果に基づき、自動車１の動作を制御する。 As shown in FIG. 1, the recognition device 10 is mounted on the automobile 1. The recognition device 10 acquires the RGB values of each pixel from the camera 21 and the camera 22, respectively, as imaging data. The recognition device 10 recognizes the image represented by the imaged data by the recognition processing shown in FIG. 2, and inputs the recognition result to the control unit 25. The control unit 25 controls the operation of the automobile 1 based on the input recognition result.

カメラ２１，２２は、自動車１の前方が撮像範囲内となるように搭載されている。カメラ２１，２２は、ステレオカメラを構成している。 The cameras 21 and 22 are mounted so that the front of the automobile 1 is within the imaging range. The cameras 21 and 22 form a stereo camera.

認識装置１０は、ＣＰＵ１１と、ＲＯＭ、ＲＡＭ等のメモリ１２とを備えた周知のコンピュータとして構成されている。認識装置１０は、ＣＰＵ１１とメモリ１２とを用いて、メモリに格納されたプログラムを実行することによって認識処理を実行する。 The recognition device 10 is configured as a well-known computer including a CPU 11 and a memory 12 such as a ROM and a RAM. The recognition device 10 uses the CPU 11 and the memory 12 to execute a recognition process by executing a program stored in the memory.

認識装置１０は、自動車が走行可能である間、繰り返し、認識処理を実行する。認識処理は、画素毎にラベル付けを実行するセマンティック・セグメンテーションを実現するための処理である。 The recognition device 10 repeatedly executes the recognition process while the vehicle can run. The recognition process is a process for realizing semantic segmentation in which labeling is performed for each pixel.

認識装置１０は、認識処理を開始すると、Ｓ１００として、入力データを生成する。入力データとは、各画素の色相と彩度と距離とをパラメータとするデータである。認識装置１０は、カメラ２１及びカメラ２２から入力される撮像データから、入力データを生成する。入力データの生成には、既知の手法が用いられる。 When the recognition device 10 starts the recognition process, the recognition device 10 generates input data in S100. The input data is data having the hue, saturation, and distance of each pixel as parameters. The recognition device 10 generates input data from the imaged data input from the cameras 21 and 22. A known method is used to generate the input data.

次に、認識装置１０は、Ｓ２００として、初期認識を実行することによって、初期認識値を得る。初期認識は、図３に示されるディープニューラルネットワーク（以下、ＤＮ）１３に、入力データを入力することによって、取得される。ＤＮ１３は、メモリ１２に予め記憶されている。 Next, the recognition device 10 obtains an initial recognition value by performing initial recognition in S200. The initial recognition is acquired by inputting input data to the deep neural network (hereinafter, DN) 13 shown in FIG. The DN 13 is stored in the memory 12 in advance.

ＤＮ１３は、複数の中間層を含む。具体的には、ＤＮ１３は、畳み込みニューラルネットワークと、逆畳み込みニューラルネットワークとを含む。ＤＮ１３に入力された入力データは、畳み込み層によって、画素数２２４×２２４の中間データＣ１に変換される。 The DN 13 includes a plurality of intermediate layers. Specifically, the DN 13 includes a convolutional neural network and a deconvolutional neural network. The input data input to the DN 13 is converted into the intermediate data C1 having the number of pixels of 224×224 by the convolutional layer.

中間データＣ１は、プーリング層によって、画素数１１２×１１２の中間データＰ２に変換される。本実施形態では、ＭＡＸプーリングを用いる。このように、畳み込み層とプーリング層とによる出力が交互に繰り返され、画素数１×１の全結合層ＦＣに至る。なお、或るプーリング層と、次のプーリング層とに挟まれる畳み込み層の数は、２以上でもよい。 The intermediate data C1 is converted by the pooling layer into the intermediate data P2 having 112×112 pixels. In this embodiment, MAX pooling is used. In this way, the outputs from the convolutional layer and the pooling layer are alternately repeated until the total coupling layer FC having the number of pixels of 1×1 is reached. The number of convolutional layers sandwiched between a certain pooling layer and the next pooling layer may be two or more.

全結合層ＦＣは、逆畳み込み層によって画素数７×７の中間データＤＣ−ＦＣに変換される。中間データＤＣ−ＦＣは、アンプーリング層によって画素数１４×１４の中間データＵＰ１に変換される。このように、逆畳み込み層とアンプーリング層とによる出力が交互に繰り返され、画素数２２４×２２４の中間データＤＣ５が出力されると、初期認識値を得る。図４には、初期認識値と、撮像画像とを重ね合わせた様子が示されている。 The fully concatenated layer FC is converted into intermediate data DC-FC having 7×7 pixels by the deconvolution layer. The intermediate data DC-FC is converted into intermediate data UP1 having 14×14 pixels by the ampling layer. In this way, when the outputs of the deconvolution layer and the ampling layer are alternately repeated and the intermediate data DC5 of the number of pixels 224×224 is output, the initial recognition value is obtained. FIG. 4 shows a state in which the initial recognition value and the captured image are superimposed.

初期認識値は、２２４×２２４の画素それぞれに対して、ラベルが付与されたデータである。本実施形態におけるラベルは、ラベルＫ１が道路、ラベルＫ２が障害物、ラベルＫ３が空、ラベルＫ４が天井を意味するである。天井は、トンネル等における天井のことである。 The initial recognition value is data in which a label is given to each of 224×224 pixels. In the label in this embodiment, the label K1 means a road, the label K2 means an obstacle, the label K3 means an empty space, and the label K4 means a ceiling. The ceiling is a ceiling in a tunnel or the like.

ＤＮ１３は、多数の学習用入力データとラベルの真値との組み合わせによる教師あり学習によって学習済みであり、各中間層が微調整されている。ＤＮ１３に入力データを入力すると、各画素について、各ラベルの信頼値が出力される。初期認識は、各画素について、信頼値が最も高いラベルを採用することで実現される。 The DN 13 has been learned by supervised learning using a combination of a large number of learning input data and the true value of the label, and each intermediate layer is finely adjusted. When input data is input to the DN 13, the confidence value of each label is output for each pixel. The initial recognition is realized by adopting the label with the highest confidence value for each pixel.

認識装置１０は、初期認識の後、Ｓ３００におけるシーンの識別と、Ｓ４００におけるエラー領域の特定とを実行する。 After the initial recognition, the recognition apparatus 10 performs the scene identification in S300 and the error area identification in S400.

シーンの識別は、シーン識別器３１０にシーン識別用中間データを入力することによって実現される。本実施形態におけるシーン識別用中間データは、中間データＰ５である。 The scene identification is realized by inputting the scene identification intermediate data to the scene identification unit 310. The scene identification intermediate data in this embodiment is intermediate data P5.

シーン識別器３１０は、ランダムフォレストを用いている。シーン識別器３１０は、図５に示す教師あり学習によって学習済みである。 The scene classifier 310 uses a random forest. The scene classifier 310 has been learned by the supervised learning shown in FIG.

例えば、図６に示す画像から生成される中間データＰ５ａを入力とする。そして、図６に示すシーンの真値を「２車線であり、２本の白い実線によって外側の境界が画定されており、１本の破線によって車線が区分されている」を示すシーンの真値をシーンａと学習させる。そして、シーンａ及び他の種々のシーン（例えば、高速道路、トンネルなど）それぞれについて、多数のデータを用意し、教師あり学習をさせておく。 For example, the intermediate data P5a generated from the image shown in FIG. 6 is input. Then, the true value of the scene shown in FIG. 6 is “two lanes, the outer boundary is defined by two white solid lines, and the lane is divided by one broken line”. Is learned as scene a. Then, a large amount of data is prepared for each of the scene a and various other scenes (for example, a highway, a tunnel, etc.), and learning with a teacher is performed.

次に、エラー領域について説明する。エラー領域とは、初期認識値の信頼性が低い領域のことである。エラー領域の特定には、初期認識として出力された信頼値を用いる。具体的には、信頼値が突出して高いラベルが無い場合は、エラー領域に含まれる画素であると特定する。 Next, the error area will be described. The error area is an area in which the reliability of the initial recognition value is low. The confidence value output as the initial recognition is used to specify the error area. Specifically, when there is no label with a remarkably high confidence value, it is specified as a pixel included in the error region.

本実施形態においては、各画素についての信頼値を規格化して合計値を１００％とした場合に、何れの信頼値も閾値（例えば８０％）未満である領域は、エラー領域であると特定される。 In the present embodiment, when the reliability value for each pixel is standardized and the total value is set to 100%, an area in which any reliability value is less than a threshold value (for example, 80%) is specified as an error area. It

例えば、或る画素に付された道路の信頼値が９９％であれば、その画素はエラー領域には含まれないと判断される。一方、道路の信頼値が４０％、障害物の信頼値が４０％、空の信頼値が１０％、天井の信頼値が１０％であれば、エラー領域に含まれると判断される。図７に示す例の場合、エラー領域Ｅ１，Ｅ２，Ｅ３，Ｅ４が特定されている。エラー領域Ｅ１，Ｅ２，Ｅ３，Ｅ４それぞれは、閉領域を形成するように特定されている。 For example, if the road reliability value assigned to a certain pixel is 99%, it is determined that the pixel is not included in the error region. On the other hand, if the road reliability value is 40%, the obstacle reliability value is 40%, the sky reliability value is 10%, and the ceiling reliability value is 10%, it is determined that the road area is included in the error area. In the case of the example shown in FIG. 7, the error areas E1, E2, E3 and E4 are specified. Each of the error areas E1, E2, E3, E4 is specified so as to form a closed area.

最後に、初期認識値と、エラー領域の情報と、識別されたシーン（以下、識別シーン）と、絞り込み用中間データとを用いて訂正処理を実行することによって最終認識値を得る。本実施形態における絞り込み用中間データは、図３に示すように、中間データＣ１である。 Finally, the final recognition value is obtained by executing the correction process using the initial recognition value, the error area information, the identified scene (hereinafter, the identification scene), and the narrowing intermediate data. The narrow-down intermediate data in this embodiment is intermediate data C1 as shown in FIG.

訂正処理では、Ｓ６００としてのパッチレベルの絞り込みと、Ｓ７００としての画素レベルでの絞り込みとを、実施する。 In the correction process, the patch level narrowing down in S600 and the pixel level narrowing down in S700 are performed.

本実施形態におけるパッチとは、図７に示すように、４×４に予め区分された各々の領域を意味する。このため、各パッチは、５６×５６の画素を有する。各パッチは、予め、通し番号が定められている。例えば、最も左上のパッチは、パッチｐ１であり、パッチｐ１の右隣がパッチｐ２、パッチｐ１の直下はパッチｐ５である。 The patch in the present embodiment means each area preliminarily divided into 4×4 as shown in FIG. 7. Therefore, each patch has 56×56 pixels. A serial number is determined in advance for each patch. For example, the upper left patch is the patch p1, the patch p2 is on the right of the patch p1, and the patch p5 is immediately below the patch p1.

図２及び図８に示すように、パッチレベルの絞り込みには、初期認識値と、エラー領域の情報と、識別シーンとを用いる。 As shown in FIGS. 2 and 8, the initial recognition value, the error area information, and the identification scene are used for narrowing down the patch level.

具体的には、認識装置１０は、まずＳ６１０として、エラー領域を含むパッチを特定する。図７に示す例の場合、エラー領域Ｅ１がパッチｐ１０に含まれている。さらに、エラー領域Ｅ２がパッチｐ６に、エラー領域Ｅ３がパッチｐ１１に、エラー領域Ｅ４がパッチｐ１２に含まれている。このため、パッチｐ６，ｐ１０，ｐ１１，ｐ１２が特定される。 Specifically, the recognition device 10 first identifies a patch including an error area in S610. In the case of the example shown in FIG. 7, the error area E1 is included in the patch p10. Further, the error area E2 is included in the patch p6, the error area E3 is included in the patch p11, and the error area E4 is included in the patch p12. Therefore, the patches p6, p10, p11, p12 are specified.

次に、認識装置１０は、Ｓ６２０として、特定したパッチそれぞれについて、識別シーンに基づき、学習済みの位置関係特徴量を読み出す。 Next, in S620, the recognition device 10 reads out the learned positional relationship feature amount for each identified patch based on the identification scene.

ここで位置関係特徴量、及びその学習について説明する。本実施形態における位置関係特徴量は、シーンに依存する特徴量であり、且つ、ラベル同士の相対的な位置関係に関する特徴量である。位置関係特徴量は、図９に示されるように、シーン毎に、教師あり学習によって学習される。 Here, the positional relationship feature amount and its learning will be described. The positional relationship feature amount in the present embodiment is a feature amount that depends on the scene, and is also a feature amount related to the relative positional relationship between labels. As shown in FIG. 9, the positional relationship feature amount is learned by supervised learning for each scene.

学習に際しては、教師あり学習における真値としてのセットを多数、与える。各セットは、シーンの真値と、そのシーンに属する入力データの各画素（２２４×２２４）に付した真値としてのラベルから構成される。 In learning, many sets as true values in supervised learning are given. Each set is composed of a true value of a scene and a label as a true value attached to each pixel (224×224) of input data belonging to the scene.

或る１つのセットに含まれるシーンの真値をシーンｓとし、そのセットに含まれる真値としてのラベルの空間配置をラベル配置Ｇとして、以下、このセットの学習を例にとって説明する。位置関係特徴量の学習は、シーンｓに、ラベル配置Ｇの位置関係特徴量を対応付けることによって実現される。 Let us say that the true value of a scene included in a certain set is a scene s, and the spatial arrangement of labels as the true value included in that set is a label arrangement G, and learning of this set will be described below as an example. The learning of the positional relationship feature amount is realized by associating the positional relationship feature amount of the label arrangement G with the scene s.

ラベル配置Ｇの位置関係特徴量の計算は、まず図１０に示すように、先述したパッチに分割されたラベル配置Ｐと、４個のレセプタに分割されたラベル配置Ｑとを生成する。このため、各レセプタは、１１２×１１２の画素を有する。ラベル配置Ｑに含まれるレセプタには、ｑ１〜ｑ４の通し番号が付けられている。ラベル配置Ｑに含まれるレセプタの数は、ラベル配置Ｐに含まれるパッチの数よりも少ない。 The calculation of the positional relationship feature amount of the label arrangement G first generates the label arrangement P divided into the patches described above and the label arrangement Q divided into four receptors, as shown in FIG. Therefore, each receptor has 112×112 pixels. The receptors included in the label arrangement Q are assigned serial numbers q1 to q4. The number of receptors included in the label arrangement Q is smaller than the number of patches included in the label arrangement P.

位置関係特徴量ｇは、パッチのｐ１〜ｐ１６それぞれについて計算される。具体的には、下記の式によって算出される。
g_pn ^s=[ω_pn ^K1,ω_pn ^K2,ω_pn ^K3,ω_pn ^K4]・・・（１）
上の式のｓは、シーンｓの位置関係特徴量ｇであることを示す。ｐｎは、ｐ１〜ｐ１６の何れか１つであることを示す。Ｋ１〜Ｋ４は、ラベルを示す。以下では、Ｋ１〜Ｋ４の何れか１つであることをＫｍとも表記する。 The positional relationship feature amount g is calculated for each of the patches p1 to p16. Specifically, it is calculated by the following formula.
g _pn ^s =[ω _pn ^K1 , ,ω _pn ^K2 ,ω _pn ^K3 ,ω _pn ^K4 ]・・・(1)
S in the above equation indicates that it is the positional relationship feature amount g of the scene s. pn indicates any one of p1 to p16. K1 to K4 represent labels. Hereinafter, any one of K1 to K4 is also referred to as Km.

ω_pn ^Kmは、（パッチｐｎにおけるラベルＫｍ）−（レセプタｑ１〜ｑ４における全てのラベル）の幾何的な関係を示す。図１０に示された矢印は、ω_p16 ^K1を構成するベクトルを示している。 ω _pn ^Km represents a geometrical relationship of (label Km in patch pn)-(all labels in receptors q1 to q4). The arrow shown in FIG. 10 indicates a vector forming ω _p16 ^K1 .

上記のベクトルを表記すると、次の式のようになる。
ω_pn ^Km=[v(pn,q1)_Km ^K1,…,v(pn,q4)_Km ^K1,…,v(pn,q1)_Km ^K4,…,v(pn,q4)_Km ^K4]・・・（２）
v(pn,q1)_Km ^K1は、（パッチｐｎでラベルＫｍ）−（レセプタｑ１でラベルＫ１）の２次元平均空間ベクトルである。v(pn,q4)_Km ^K1は、（パッチｐｎでラベルＫｍ）−（レセプタｑ４でラベルＫ１）の２次元平均空間ベクトルである。v(pn,q1)_Km ^K4は、（パッチｐｎでラベルＫｍ）−（レセプタｑ１でラベルＫ４）の２次元平均空間ベクトルである。v(pn,q4)_Km ^K4は、（パッチｐｎでラベルＫｍ）−（レセプタｑ４でラベルＫ４）の２次元平均空間ベクトルである。 Notation of the above vector is as follows.
ω _pn ^Km = [v(pn,q1) _Km ^K1 ,…,v(pn,q4) _Km ^K1 ,…,v(pn,q1) _Km ^K4 ,…,v(pn,q4) _Km ^K4 ]・・・(2)
v(pn,q1) _Km ^K1 is a two-dimensional average space vector of (patch Km with patch pn)−(label K1 with receptor q1). v(pn,q4) _Km ^K1 is a two-dimensional average space vector of (label Km for patch pn)-(label K1 for receptor q4). v(pn,q1) _Km ^K4 is a two-dimensional average space vector of (patch Km with patch pn)−(label K4 with receptor q1). v(pn,q4) _Km ^K4 is a two-dimensional average space vector of ((patch mn is label Km)-(receptor q4 is label K4).

例えば平均空間ベクトルv(pn,q4)_Km ^K1は、(ラベルＫｍ−パッチｐｎの重心)と、(ラベルＫ１を含むレセプタｑ４を構成する全画素)との間の空間ベクトルによって算出される。 For example, the average space vector v(pn,q4) _Km ^K1 is calculated by the space vector between (label _Km- center of gravity of patch pn) and (all pixels constituting receptor q4 including label K1).

平均空間ベクトルは、計算された全ての空間ベクトルの平均値で、２次元の大きさと角度とで表される。ここでいう角度は、水平軸となす角度である。 The average space vector is an average value of all calculated space vectors and is represented by a two-dimensional size and an angle. The angle mentioned here is an angle formed with the horizontal axis.

v(pn,qi)_Km ^K1は、ラベルＫ１がレセプタｑｉに含まれない場合、ゼロベクトル[0,0]になる。ｉは、１〜４の何れかである。例えば、図１０に示すように、レセプタｑ１にはラベルＫ１は含まれていないので、v(pn,q1)_Km ^K1は、ゼロベクトルになる。よって、図１０には、v(p16,q1)_K1 ^K1を示すベクトルは示されていない。 v(pn,qi) _Km ^K1 becomes a zero vector [0,0] when the label K1 is not included in the receptor qi. i is any one of 1-4. For example, as shown in FIG. 10, since the label q1 is not included in the receptor q1, v(pn,q1) _Km ^K1 becomes a zero vector. Therefore, the vector indicating v(p16,q1) _K1 ^K1 is not shown in FIG.

同様に、v(pn,qi)_Km ^K1は、ラベルＫｍがパッチｐｎに含まれない場合、ゼロベクトル[0,0]になる。従って、式（２）から、ω_pn ^Kmは、ラベルＫｍがパッチｐｎに含まれていない場合、ゼロ値のみを含む。つまり、パッチｐｎにラベルＫｍが含まれていない場合、式（２）は、次のようになる。
ω_pn ^Km＝[0,…,0,…,0,…,0]・・・（３） Similarly, v(pn,qi) _Km ^K1 becomes a zero vector [0,0] when the label Km is not included in the patch pn. Therefore, from equation (2), ω _pn ^Km contains only a zero value if the label Km is not included in the patch pn. That is, when the label Km is not included in the patch pn, the equation (2) is as follows.
ω _pn ^Km ＝[0,…,0,…,0,…,0]・・・(3)

認識装置１０は、上記の手法によって、多数の真値のセットについて学習を済ませている。認識装置１０は、先述したようにＳ６２０において、エラー領域として特定したパッチそれぞれについて、識別シーンに基づき位置関係特徴量ｇを読み出す。 The recognition device 10 has learned many sets of true values by the above method. As described above, the recognition device 10 reads the positional relationship feature amount g based on the identification scene for each patch identified as the error region in S620.

一方で、認識装置１０は、Ｓ６３０として、エラー領域として特定したパッチそれぞれについて、学習時と同様な計算を実行することによって、位置関係特徴量ρ_pn ^s(Ｋｍ)を算出する。Ｓ６３０においては、Ｋｍとして、ラベルＫ１〜Ｋ４それぞれについての位置関係特徴量ρ_pn ^sを算出する。ラベルＫ１について算出する場合は、ラベルＫ１を訂正候補として算出する。つまり、ラベルＫ１について算出する場合は、算出対象のパッチにおける支配的なラベルがラベルＫ１であることを仮定する。ラベルＫ２〜Ｋ４それぞれについての算出も同様である。 On the other hand, in S630, the recognition apparatus 10 calculates the positional relationship feature amount ρ _pn ^s (Km) by performing the same calculation as that at the time of learning for each patch identified as the error region. In S630, the positional relationship feature amount ρ _pn ^s for each of the labels K1 to K4 is calculated as Km. When calculating the label K1, the label K1 is calculated as a correction candidate. That is, when calculating the label K1, it is assumed that the dominant label in the patch to be calculated is the label K1. The same applies to the calculation for each of the labels K2 to K4.

或るパッチにおける支配的なラベルとは、そのパッチ内において、他のラベルに比べて、突出して多くの画素に対応付けられたラベルのことである。 The dominant label in a patch is a label that is associated with a larger number of pixels in the patch than other labels.

続いて、Ｓ６３０として、パッチに基づく枝刈りを実行し、絞り込まれたラベルを出力する。パッチに基づく枝刈りとは、具体的には、次の内容を意味する。 Subsequently, in S630, pruning based on the patch is executed, and the narrowed down label is output. The pruning based on the patch specifically means the following contents.

各パッチについて、Ｓ６２０で読み出した位置関係特徴量ｇと、Ｓ６３０で算出した位置関係特徴量ρとのユークリッド距離の差を、ラベルＫ１〜Ｋ４のそれぞれについて計算する。この差が閾値以上であるラベルを除外し、訂正候補を残ったラベルに絞り込む。支配的な１つのラベルだけが残る場合もあるし、複数のラベルが残る場合もある。 For each patch, the difference in Euclidean distance between the positional relationship feature amount g read out in S620 and the positional relation feature amount ρ calculated in S630 is calculated for each of the labels K1 to K4. Labels for which this difference is greater than or equal to a threshold are excluded, and correction candidates are narrowed down to the remaining labels. In some cases, only one dominant label remains, and in some cases multiple labels remain.

パッチに基づく枝刈りとは、このようにして、各パッチにおける訂正候補とならないラベルを除外する処理のことである。 The pruning based on the patch is a process of eliminating the label that is not a correction candidate in each patch in this way.

図２及び図１１に示すように、画素レベルの絞り込みは、絞り込み用中間データと、エラー領域の情報と、識別シーンと、絞り込まれたラベルの情報とを用いて、最終認識値を出力する。 As shown in FIGS. 2 and 11, in the pixel level narrowing down, the final recognition value is output using the narrowing down intermediate data, the error area information, the identification scene, and the narrowed down label information.

認識装置１０は、まずＳ７１０として、各エラー領域の重心となる画素（以下、重心画素）を、各エラー領域の代表として特定する。厳密に重心に一致する画素が無い場合は、重心からの距離が最も短い画素を、重心画素として特定する。 First, in S710, the recognition apparatus 10 identifies the pixel serving as the center of gravity of each error area (hereinafter, the center of gravity pixel) as a representative of each error area. If no pixel exactly coincides with the center of gravity, the pixel having the shortest distance from the center of gravity is specified as the center of gravity pixel.

続いて、認識装置１０は、Ｓ７２０として、各エラー領域の重心画素について、学習済みの外見−距離特徴量を読み出す。外見−距離特徴量は、シーンに依存した特徴量である。且つ、外見−距離特徴量は、色相、彩度、距離に関する特徴量であるので、入力データに含まれるパラメータに対応する特徴量である。 Subsequently, in S720, the recognition device 10 reads out the learned appearance-distance feature amount for the centroid pixel of each error region. The appearance-distance feature quantity is a scene-dependent feature quantity. In addition, the appearance-distance feature amount is a feature amount related to hue, saturation, and distance, and thus is a feature amount corresponding to the parameter included in the input data.

図９に示すように、外見−距離特徴量は、シーン毎に、教師あり学習によって学習される。学習に際しては、教師あり学習における真値としてのセットを多数、与える。各セットは、位置関係特徴量の学習に用いた真値に加え、そのシーンに属する入力データに畳み込み処理を施したデータから構成される。ここでの畳み込み処理は、ＤＮ１３における中間データＣ１を得るための畳み込み処理のことである。つまり、このデータは、ＤＮ１３によって出力される中間データＣ１である。 As shown in FIG. 9, the appearance-distance feature amount is learned by supervised learning for each scene. In learning, many sets as true values in supervised learning are given. Each set is composed of the true value used for learning the positional relationship feature amount and the data obtained by performing convolution processing on the input data belonging to the scene. The convolution process here is a convolution process for obtaining the intermediate data C1 in the DN 13. That is, this data is the intermediate data C1 output by the DN 13.

ＤＮ１３の説明においては省略したが、中間データＣ１は、Ｄ個の２２４×２２４の画素からなるデータによって構成される。Ｄは、２以上の整数である。このため、例えば図１２に示す画素ｎに対して、畳み込み処理によって得られるＤ次元の特徴ベクトルを対応付けることができる。図１２には、画素ｎに対応する特徴ベクトルλ（ｈ，ｗ）が示されている。 Although omitted in the description of the DN 13, the intermediate data C1 is composed of data composed of D 224×224 pixels. D is an integer of 2 or more. Therefore, for example, the pixel n shown in FIG. 12 can be associated with the D-dimensional feature vector obtained by the convolution processing. FIG. 12 shows the feature vector λ(h,w) corresponding to the pixel n.

図１２に示すように、本実施形態においては、中間データＣ１を得るための畳み込み層において、３×３のフィルタを用いている。従って、特徴ベクトルλは、画素ｎの周囲８画素のパラメータを反映している特徴量である。 As shown in FIG. 12, in the present embodiment, a 3×3 filter is used in the convolutional layer for obtaining the intermediate data C1. Therefore, the feature vector λ is a feature amount that reflects the parameters of eight pixels around the pixel n.

このような特徴ベクトルの平均を取ることによって、画素ｎに対応する外見−距離特徴量を決定し、学習データとする。外見−距離特徴量は、外見と距離との情報から生成された特徴量である。ここでいう外見とは、彩度と色相とのことである。外見−距離特徴量として、色相、彩度および距離に関する特徴量が得られるのは、入力データがこれらのパラメータで構成されているからである。 By averaging such feature vectors, the appearance-distance feature amount corresponding to the pixel n is determined and used as learning data. The appearance-distance feature amount is a feature amount generated from information on appearance and distance. Appearance here means saturation and hue. As the appearance-distance feature amount, the feature amount relating to hue, saturation, and distance is obtained because the input data is composed of these parameters.

特徴ベクトルの平均とは、同じシーンに属する入力データにおける同じ位置の画素、且つ、同じ真値としてのラベルが付された画素について取得された特徴ベクトルの平均のことである。このため、外見−距離特徴量は、シーン毎且つ画素毎について、各ラベルの学習データが得られる。Ｓ７２０では、識別シーン及び重心画素に対応する各ラベルの外見−距離特徴量を読み出す。 The average of the feature vectors is the average of the feature vectors acquired for the pixels at the same position in the input data belonging to the same scene, and the pixels labeled as the same true value. Therefore, as the appearance-distance feature amount, learning data of each label is obtained for each scene and each pixel. In S720, the appearance-distance feature amount of each label corresponding to the identification scene and the barycentric pixel is read.

一方で、Ｓ７３０として、絞り込み用中間データとしての中間データＣ１から、各エラー領域の重心画素について、外見−距離特徴量を取得する。Ｓ７３０においては、上記の学習と同様な手法によって取得される特徴ベクトルを、外見−距離特徴量として取得する。 On the other hand, in S730, the appearance-distance feature amount is acquired for the center-of-gravity pixel of each error region from the intermediate data C1 as the intermediate data for narrowing down. In S730, the feature vector acquired by the same method as the above learning is acquired as the appearance-distance feature amount.

最後に、Ｓ７４０として、最終認識値を出力する。Ｓ７４０は、図１１に示すように、Ｓ７２０及びＳ７３０による出力、並びに、Ｓ６３０によって絞り込まれたラベルに基づき実行される。Ｓ７４０は、ユークリッド距離に基づくラベルの訂正が実行される。具体的には、次のように実行される。 Finally, as S740, the final recognition value is output. As shown in FIG. 11, S740 is executed based on the outputs of S720 and S730 and the label narrowed down by S630. In S740, the label correction based on the Euclidean distance is executed. Specifically, it is executed as follows.

各重心画素について、Ｓ７２０で読み出した特徴ベクトルと、Ｓ７３０で取得した特徴ベクトルとのユークリッド距離の差を、ラベル毎に算出する。この差が閾値以下のラベルが、重心画素を含むエラー領域全体の訂正候補となる。この訂正候補になったラベルの何れかが、Ｓ６００で絞り込まれたラベルの何れかと一致する場合、初期認識値としてのラベルを、そのラベルに訂正する。 For each centroid pixel, the difference in Euclidean distance between the feature vector read in S720 and the feature vector acquired in S730 is calculated for each label. Labels whose difference is less than or equal to a threshold value are correction candidates for the entire error area including the centroid pixel. If any of the labels that are candidates for correction matches any of the labels narrowed down in S600, the label as the initial recognition value is corrected to that label.

複数のラベルが一致する場合、情報特徴量および位置関係特徴量のユークリッド距離の和が最も短いラベルに訂正する。この場合、情報特徴量および位置関係特徴量に適宜、重み付けをしてもよい。このように訂正されたラベルを含む出力が、最終認識値である。 When a plurality of labels match, the label having the shortest sum of the Euclidean distances of the information feature amount and the positional relation feature amount is corrected. In this case, the information feature amount and the positional relationship feature amount may be appropriately weighted. The output containing the label thus corrected is the final recognition value.

図１３に示す初期認識値に比べ、図１４に示す最終認識値の場合は、特に、破線で囲った領域において認識精度が向上している。つまり、最終認識値の場合、破線で囲った領域において、道路と障害物との境界をより正確に認識できている。 Compared with the initial recognition value shown in FIG. 13, in the case of the final recognition value shown in FIG. 14, the recognition accuracy is improved particularly in the area surrounded by the broken line. That is, in the case of the final recognition value, the boundary between the road and the obstacle can be recognized more accurately in the area surrounded by the broken line.

（１）上記のように認識精度が向上するのは、中間データを利用しているからである。
（２）中間データの１つであるシーン識別用中間データは、シーンの識別に用いられる。このため、識別シーンを利用した訂正を実行できる。
（３）シーン識別用中間データとしての中間データＰ５は、複数回のプーリング処理によって、７×７までに情報が圧縮されたデータであるので、シーンの識別用として適している。
（４）シーンの識別は、教師あり学習によって学習済みのデータとの比較に基づき実行するため、精度が高い。 (1) The recognition accuracy is improved as described above because the intermediate data is used.
(2) The scene identification intermediate data, which is one of the intermediate data, is used for scene identification. Therefore, the correction using the identification scene can be executed.
(3) Since the intermediate data P5 as the intermediate data for scene identification is data whose information is compressed up to 7×7 by a plurality of pooling processes, it is suitable for scene identification.
(4) Since the scene identification is performed based on the comparison with the data learned by the supervised learning, the accuracy is high.

（５）絞り込み用中間データである中間データＣ１を用いることによって、画素レベルで周囲の影響が反映された特徴量を取得できる。そして、この特徴量である特徴ベクトルを、学習結果と比較することによって、認識精度が向上する。
（６）中間データＣ１は、入力データに対して１回の畳み込み処理によって得られるデータであるので、入力データとの関係を示す特徴量として適している。
（７）中間データＣ１は、プーリング処理が施されておらず、圧縮されていないので、入力データとの関係を示す特徴量として適している。 (5) By using the intermediate data C1 which is the intermediate data for narrowing down, it is possible to acquire the feature amount in which the influence of the surroundings is reflected at the pixel level. Then, the recognition accuracy is improved by comparing the feature vector, which is the feature amount, with the learning result.
(6) Since the intermediate data C1 is data obtained by performing the convolution process once on the input data, it is suitable as a feature amount indicating the relationship with the input data.
(7) Since the intermediate data C1 has not been subjected to pooling processing and has not been compressed, it is suitable as a feature amount indicating the relationship with the input data.

（８）訂正処理において、位置関係特徴量を用いるので、訂正の精度が向上する。
（９）位置関係特徴量の算出は、パッチとレセプタとを用いることによって適切に実行できる。 (8) Since the positional relationship feature amount is used in the correction process, the accuracy of correction is improved.
(9) The calculation of the positional relationship feature amount can be appropriately executed by using the patch and the receptor.

（１０）訂正処理は、情報特徴量と位置関係特徴量とを総合して、訂正を実行するので、訂正の精度が向上する。 (10) In the correction process, since the correction is performed by integrating the information feature amount and the positional relationship feature amount, the accuracy of the correction is improved.

（１１）訂正処理は、エラー領域を対象に実行するので、初期推定において信頼値が高いラベルを訂正の対象から除外できる。このため、処理負荷が軽減されると共に、訂正の必要が無いラベルを訂正することが抑制される。
（１２）位置関係特徴量の算出対象から、エラー領域を含まないパッチが除外されるので、処理負荷が軽減される。
（１３）各エラー領域の訂正は、エラー領域を代表する画素の訂正結果を援用して実行されるので、処理負荷が軽減される。
（１４）エラー領域を代表する画素は、重心画素であるので、エラー領域の端に位置するような画素が代表となる場合と比べ、訂正の精度が向上する。 (11) Since the correction process is performed on the error region as a target, the label having a high confidence value in the initial estimation can be excluded from the correction target. Therefore, the processing load is reduced and the correction of the label that does not need to be corrected is suppressed.
(12) Since the patch not including the error area is excluded from the calculation target of the positional relationship feature amount, the processing load is reduced.
(13) Since the correction of each error area is executed by using the correction result of the pixel representing the error area, the processing load is reduced.
(14) Since the pixel representing the error area is the center-of-gravity pixel, the accuracy of correction is improved as compared with the case where the pixel located at the end of the error area is the representative pixel.

外見−距離特徴量は、情報特徴量に対応する。この他、Ｓ２００は初期認識部、Ｓ３００はシーン識別部、Ｓ４００は特定部、Ｓ５００は訂正部、Ｓ６３０は算出部、Ｓ７１０は重心画素取得部、Ｓ７３０は情報特徴量取得部、Ｓ７４０は訂正実行部に対応する。 The appearance-distance feature amount corresponds to the information feature amount. In addition, S200 is an initial recognition unit, S300 is a scene identification unit, S400 is a specification unit, S500 is a correction unit, S630 is a calculation unit, S710 is a centroid pixel acquisition unit, S730 is an information feature amount acquisition unit, and S740 is a correction execution unit. Corresponding to.

本開示は、本明細書の実施形態や実施例、変形例に限られるものではなく、その趣旨を逸脱しない範囲において種々の構成で実現できる。例えば、発明の概要の欄に記載した各形態中の技術的特徴に対応する実施形態、実施例、変形例中の技術的特徴は、先述の課題の一部又は全部を解決するために、或いは、先述の効果の一部又は全部を達成するために、適宜、差し替えや、組み合わせができる。その技術的特徴が本明細書中に必須なものとして説明されていなければ、適宜、削除できる。例えば、以下のものが例示される。 The present disclosure is not limited to the embodiments, examples, and modified examples of the present specification, and can be realized in various configurations without departing from the gist thereof. For example, the technical features in the embodiments, examples, and modifications corresponding to the technical features in each mode described in the column of the outline of the invention are to solve some or all of the above-mentioned problems, or In order to achieve some or all of the effects described above, they can be replaced or combined as appropriate. If the technical features are not described as essential in this specification, they can be deleted as appropriate. For example, the following are exemplified.

中間データの利用は、シーンに依存しなくてもよい。例えば、場所に対して固定されたカメラから撮像データを得て認識を実行する場合、シーンは固定されている。このような場合、その固定されたシーンにおいて、変化し得る物体を認識するために、中間データを利用してもよい。 Utilization of the intermediate data may not depend on the scene. For example, when capturing image data from a camera that is fixed for a location and performing recognition, the scene is fixed. In such a case, the intermediate data may be used to recognize a changeable object in the fixed scene.

自動車の走行制御以外に利用されてもよい。例えば、先述したように、固定カメラに適用してもよいし、他の輸送用機器（例えば二輪車）でもよいし、ロボットでもよい。 It may be used for other than the traveling control of the automobile. For example, as described above, it may be applied to a fixed camera, may be another transportation device (for example, a motorcycle), or may be a robot.

Ｓ６００におけるパッチレベルの絞り込み、及びＳ７００における画素レベルの絞り込みは、何れか１つのみを実行してもよい。この場合でも、中間データの利用による認識精度の向上は、実現される。 Only one of the patch level narrowing in S600 and the pixel level narrowing in S700 may be executed. Even in this case, the improvement of the recognition accuracy by using the intermediate data is realized.

エラー領域の特定は、実行しなくてもよい。この場合、パッチレベルの絞り込みは全パッチを対象にしてもよいし、画素レベルの絞り込みは全画素を対象にしてもよい。 It is not necessary to specify the error area. In this case, the patch level may be narrowed down to all patches, or the pixel level may be narrowed down to all pixels.

入力データは、色相、彩度、距離によって構成されていなくてもよい。例えば、ＲＧＢ値と距離とによって構成されていてもよいし、輝度値と距離とによって構成されていてもよい。輝度値と距離とによって構成される場合、撮像画像はモノクロでもよい。 The input data may not be composed of hue, saturation, and distance. For example, it may be composed of RGB values and a distance, or may be composed of a luminance value and a distance. The captured image may be monochrome if it is composed of a luminance value and a distance.

入力データに含まれる距離の情報は、ステレオカメラ以外から取得してもよい。例えば、デプスセンサーを用いてもよいし、レーダ波などを用いてもよい。 The information on the distance included in the input data may be acquired from other than the stereo camera. For example, a depth sensor may be used, or a radar wave or the like may be used.

上記実施形態において、ソフトウエアによって実現された機能及び処理の一部又は全部は、ハードウエアによって実現されてもよい。また、ハードウエアによって実現された機能及び処理の一部又は全部は、ソフトウエアによって実現されてもよい。ハードウエアとしては、例えば、集積回路、ディスクリート回路、または、それらの回路を組み合わせた回路モジュールなど、各種回路を用いてもよい。 In the above embodiment, some or all of the functions and processes realized by software may be realized by hardware. In addition, some or all of the functions and processes realized by hardware may be realized by software. As the hardware, for example, various circuits such as an integrated circuit, a discrete circuit, or a circuit module in which these circuits are combined may be used.

１３ディープニューラルネットワーク、２０認識装置 13 deep neural networks, 20 recognizers

Claims

By inputting information for each pixel acquired from the image to be recognized as input data to the deep neural network (13) including a plurality of intermediate layers, each pixel is labeled and an initial recognition value is acquired. An initial recognition unit (S200),
A scene identification unit (S300) for identifying a scene to which the image to be recognized belongs, based on scene identification intermediate data that is at least a part of intermediate data output from at least one of the plurality of intermediate layers;
Targeting the initial recognition value, the a correcting unit that executes a correction based on the intermediate data (S500), the correct, the identified correction execution unit for executing, based on the scene (S740) and correction unit comprising ( S500) ,
A recognition device including.

The recognition device according to claim 1 , wherein the intermediate data for scene identification is data subjected to at least one pooling process.

The scene discrimination unit, an identification of the scene, and the intermediate data the scene identification, recognition apparatus according to claim 1 or claim 2 executes based on the comparison of the learned data by supervised learning.

The correction unit further includes an information characteristic amount acquisition unit (S730) that acquires an information characteristic amount that is a characteristic amount corresponding to a parameter included in the input data from the narrowing intermediate data that is at least a part of the intermediate data. Prepare,
The correction execution unit, the correction, the acquisition information feature amount, from claim 1 to perform on the basis of a comparison between the learned information feature amount by supervised learning that the identified scene true value The recognition device according to claim 3 .

The recognition device according to claim 4 , wherein the narrowing intermediate data is data obtained by performing convolution processing on the input data at least once.

The intermediate data for narrowing the recognition device according to the input data, to claim 4 or claim 5 pooling process is data that has not been subjected.

The correction unit further includes a calculation unit (S630) that calculates a positional relationship feature amount regarding a relative positional relationship between the labels in the initial recognition value.
The correction execution unit executes the correction based on a comparison between the acquired positional relation feature amount and the positional relation feature amount learned by supervised learning with the identified scene as a true value. The recognition device according to any one of claims 1 to 6 .

The calculation unit calculates the positional relationship feature amount by using a plurality of patches configured by collecting labeled pixels included in the initial recognition value in a first size, and a label included in the initial recognition value. A plurality of receptors configured by collecting attached pixels with a second size larger than the first size is prepared, and a label included in each of the plurality of patches and a plurality of labels included in each of the plurality of receptors are provided. The recognition device according to claim 7, which is realized by deriving a relationship with a label.

Claim wherein the correction execution unit, the said correction, the information features and claim 4 performed by Filter label as a correction candidate with the positional relationship characteristic amount, dependent on claim 5 or claim 6 The recognition device according to claim 7 or claim 8 .

Further comprising a specifying unit (S400) for specifying an error area that is a closed area due to pixels whose reliability as the initial recognition value is less than a threshold value
The correction unit, the correct, the initial recognition either of said at least some of the labels applied to the pixels included in the specified error area claims 1 to run in the target to claim 9 as a value one The recognition device according to the item.

The recognition device according to claim 10 , which is dependent on claim 8 , wherein the calculation unit calculates the positional relationship feature amount for the patch including the specified error region.

7. The correction execution unit applies the correction for a pixel that is a representative of each of the error regions to the correction for each of the error regions, and is dependent on any one of claims 4 to 6. The recognition device according to item 10 .

The recognition device according to claim 12 , wherein the correction execution unit further includes a centroid pixel acquisition unit (S710) that obtains a pixel serving as a centroid of each of the error regions as the representative pixel.

By inputting the information of each pixel acquired from the image to be recognized as input data to the deep neural network (13) including a plurality of intermediate layers, each pixel is labeled and the initial recognition value is acquired. ,
A scene to which the image to be recognized belongs is identified based on scene identification intermediate data that is at least a part of intermediate data output from at least one layer of the plurality of intermediate layers,
A program for causing a recognition device to correct the initial recognition value based on the identified scene .