JP7563241B2

JP7563241B2 - Neural network and training method thereof

Info

Publication number: JP7563241B2
Application number: JP2021034929A
Authority: JP
Inventors: ジャン・ホォイガン; 留安汪; 俊孫
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-04-08
Filing date: 2021-03-05
Publication date: 2024-10-08
Anticipated expiration: 2041-03-05
Also published as: CN113554042B; CN113554042A; JP2021168114A

Description

本発明は、ニューラルネットワーク及びその訓練方法に関し、特に、対象（ｏｂｊｅｃｔ）検出のためのニューラルネットワークモデル及び対応する訓練方法に関する。 The present invention relates to a neural network and a training method thereof, and in particular to a neural network model for object detection and a corresponding training method.

今のところ、スマートフォンの普及に伴い、人々が科学技術によってもたらされる利便性を享受しつつある。これはまた、ある程度、人工知能技術の継続的な開発を刺激している。人工智能は、往々にして、特定の効果を達成するために、多くの計算能力を要する。しかし、モバイルプラットフォームのハードウェアの処理能力が通常制限されているから、多くの成熟したアルゴリズムをスマートフォンに展開して適用することができない。 At present, with the widespread use of smartphones, people are beginning to enjoy the convenience brought by science and technology. This has also, to a certain extent, stimulated the continuous development of artificial intelligence technology. Artificial intelligence often requires a lot of computing power to achieve certain effects. However, many mature algorithms cannot be deployed and applied on smartphones because the processing power of the hardware of mobile platforms is usually limited.

そのため、研究者が小さなモデルの実現を探求し始めた。近年、多くの効率的なアーキテクチャ、例えば、非特許文献１に記載のＰｅｌｅｅ、非特許文献２に記載のＳｈｕｆｆｌｅＮｅｔＶ２及び非特許文献３に記載のＭｏｂｉｌｅＮｅｔＶ３が提案されている。これらのモデルは、モバイルプラットフォームのリアルタイム性を満足し得るが、精度を向上させる余地がまだある。特に、検出タスクの場合、小さなモデルの精度損失は、分類タスクの場合の精度損失よりもはるかに大きくなる。 Therefore, researchers have begun to explore the realization of small models. In recent years, many efficient architectures have been proposed, such as Pelee described in Non-Patent Document 1, ShuffleNetV2 described in Non-Patent Document 2, and MobileNetV3 described in Non-Patent Document 3. Although these models can meet the real-time requirements of mobile platforms, there is still room for improving the accuracy. In particular, for detection tasks, the accuracy loss of small models is much larger than that for classification tasks.

周知のように、検出タスクはコンピュータビジョンの基礎研究であり、それは広く研究され、実際に適用されている。従来の大多数のターゲット検出モデルは、画像分類のために設計されるネットワークをバックボーンとして使用し、また、開発者は、検出器について様々な特徴表現を開発している。 As we all know, detection tasks are fundamental research in computer vision, which has been widely studied and applied in practice. Most traditional target detection models use networks designed for image classification as their backbone, and developers have developed various feature representations for detectors.

一般的に言えば、従来の検出モデルは、通常、以下のような欠点がある。 Generally speaking, traditional detection models usually suffer from the following shortcomings:

１）検出モデルが大量の手作業及び事前知識に依存しており、良好な検出精度が取得され得るが、リアルタイムタスクには適しておらず；
２）人工設計の小さなモデル又は剪定モデルによってリアルタイム問題に対処し得るが、これらのモデルのバックボーンが分類タスクのために設計されるネットワークからのものであるため、精度は往々にして高くない。 1) The detection model relies on a large amount of manual effort and prior knowledge, and although good detection accuracy can be obtained, it is not suitable for real-time tasks;
2) Although artificially designed small models or pruned models can address real-time problems, the accuracy is often not high because the backbone of these models is from networks designed for classification tasks.

Ｗａｎｇ、Ｒ．Ｊ．，Ｌｉ、Ｘ．，Ａｏ、Ｓ．，Ｌｉｎｇ、Ｃ．Ｘ．：Ｐｅｌｅｅ：Ａｒｅａｌ－ｔｉｍｅｏｂｊｅｃｔｄｅｔｅｃｔｉｏｎｓｙｓｔｅｍｏｎｍｏｂｉｌｅｄｅｖｉｃｅｓ．ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１８０４．０６８８２（２０１８）Wang, R. J. , Li, X. , Ao, S. , Ling, C. X. : Pelee: A real-time object detection system on mobile devices. arXiv preprint arXiv:1804.06882 (2018) ＮｉｎｇｎｉｎｇＭａ、ＸｉａｎｇｙｕＺｈａｎｇ、Ｈａｉ－ＴａｏＺｈｅｎｇ、ａｎｄＪｉａｎＳｕｎ．Ｓｈｕｆｆｌｅｎｅｔｖ２：Ｐｒａｃｔｉｃａｌｇｕｉｄｅｌｉｎｅｓｆｏｒｅｆｆｉｃｉｅｎｔｃｎｎａｒｃｈｉｔｅｃｔｕｒｅｄｅｓｉｇｎ．ＩｎＥＣＣＶ、２０１８Ninning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018 Ｈｏｗａｒｄ、Ａ．，Ｓａｎｄｌｅｒ、Ｍ．，Ｃｈｕ、Ｇ．，Ｃｈｅｎ、Ｌ．－Ｃ．，Ｃｈｅｎ、Ｂ．，Ｔａｎ、Ｍ．，Ｗａｎｇ、Ｗ．，Ｚｈｕ、Ｙ．，Ｐａｎｇ、Ｒ．，Ｖａｓｕｄｅｖａｎ、Ｖ．，ｅｔａｌ．Ｓｅａｒｃｈｉｎｇｆｏｒｍｏｂｉｌｅｎｅｔｖ３．ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１９０５．０２２４４、２０１９Howard, A. , Sandler, M. , Chu, G. , Chen, L. -C. , Chen, B. , Tan, M. , Wang, W. , Zhu, Y. , Pang, R. , Vasudevan, V. , et al. Searching for mobilenetv3. arXiv preprint arXiv:1905.02244, 2019

本発明の発明者は、エンドツーエンドの検出モデルが検出タスクに必要且つ効率的であることを発見した。これに基づいて、本発明の目的は、リソースが限られるプラットフォームに適したエンドツーエンドの検出モデル及び対応する訓練方法を提供することにある。 The inventors of the present invention have discovered that an end-to-end detection model is necessary and efficient for the detection task. Based on this, the objective of the present invention is to provide an end-to-end detection model and a corresponding training method suitable for resource-limited platforms.

本発明の一側面によれば、ニューラルネットワークを訓練する方法が提供され、そのうち、前記ニューラルネットワークは画像中の対象の検出のために用いられ、且つバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含み、前記特徴ネットワークは第一モジュール及び第二モジュールを含み、前記方法は、
前記バックボーンネットワークがサンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力し；
前記特徴ネットワークの第一モジュールが、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいてＮ－１回の逆畳み込み操作を行い、且つ異なるサイズのＮ個の第二特徴を出力し；
前記特徴ネットワークの第二モジュールが、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力し；
前記Ｎ個の第二特徴のうちの各々と、前記Ｎ個の第三特徴のうちの、同じサイズを有する対応する１つとを組み合わせることで異なるサイズのＮ個の第四特徴を生成し、且つ前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作を行い；
前記予測モジュールが前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し；
前記予測モジュールが畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算し；及び
前記第一損失と前記第二損失との組み合わせに基づいて、前記バックボーンネットワーク、前記特徴ネットワーク及び前記予測モジュールの設定を最適化することにより、前記ニューラルネットワークを訓練することを含む。 According to one aspect of the present invention, there is provided a method for training a neural network, wherein the neural network is used for detecting an object in an image and includes a backbone network, a feature network and a prediction module, the feature network includes a first module and a second module, and the method includes:
The backbone network processes the sample image and outputs N first features of different sizes;
A first module of the feature network performs N-1 deconvolution operations based on the smallest first feature output from the backbone network, and outputs N second features of different sizes;
A second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes;
generating N fourth features of different sizes by combining each of the N second features with a corresponding one of the N third features having the same size, and performing a different number of convolution operations on the N fourth features;
the prediction module making a prediction based on the N fourth features and calculating a first loss;
the prediction module making a prediction based on the features obtained after the convolution operation and calculating a second loss; and training the neural network by optimizing settings of the backbone network, the feature network and the prediction module based on a combination of the first loss and the second loss.

本発明の他の側面によれば、画像中の対象の検出に用いられるニューラルネットワークが提供され、前記ニューラルネットワークはバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含む。前記特徴ネットワークは第一モジュール及び第二モジュールを含む。前記バックボーンネットワークはサンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力する。前記特徴ネットワークの第一モジュールは、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいて、Ｎ－１回の逆畳み込み操作を行い、且つ異なるサイズのＮ個の第二特徴を出力する。前記特徴ネットワークの第二モジュールは、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力する。前記Ｎ個の第二特徴のうちの各々が、前記Ｎ個の第三特徴のうちの、同じサイズを有する対応する１つと組み合わせられ、異なるサイズのＮ個の第四特徴を生成し、前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作が行われる。前記予測モジュールは、前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し、及び、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算する。前記ニューラルネットワークは、前記第一損失と前記第二損失との組み合わせに基づいて訓練される。 According to another aspect of the present invention, a neural network for detecting an object in an image is provided, the neural network including a backbone network, a feature network, and a prediction module. The feature network includes a first module and a second module. The backbone network processes a sample image and outputs N first features of different sizes. The first module of the feature network performs N-1 deconvolution operations based on the smallest first feature output from the backbone network, and outputs N second features of different sizes. The second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes. Each of the N second features is combined with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and the N fourth features are convolved a different number of times. The prediction module makes a prediction based on the N fourth features and calculates a first loss, and makes a prediction based on the features obtained after the convolution operation and calculates a second loss. The neural network is trained based on a combination of the first loss and the second loss.

本発明の他の側面によれば、ニューラルネットワークを訓練する装置が提供され、そのうち、前記ニューラルネットワークは画像中の対象の検出のために用いられ、且つバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含み、前記特徴ネットワークは第一モジュール及び第二モジュールを含む。前記バックボーンネットワークは、サンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力する。前記特徴ネットワークの第一モジュールは、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいてＮ－１回の逆畳み込み操作を行い、且つ異なるサイズのＮ個の第二特徴を出力する。前記特徴ネットワークの第二モジュールは、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力する。前記Ｎ個の第二特徴のうちの各々が、前記Ｎ個の第三特徴のうちの、同じサイズを有する対応する１つと組み合わせられ、異なるサイズのＮ個の第四特徴を生成し、前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作が行われる。前記予測モジュールは、前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し、及び、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算する。前記装置は、１つ又は複数の処理器を含み、前記処理器は、前記第一損失と前記第二損失との組み合わせに基づいて、前記バックボーンネットワーク、前記特徴ネットワーク及び前記予測モジュールの設定を最適化することにより前記ニューラルネットワークを訓練するように構成される。 According to another aspect of the present invention, there is provided an apparatus for training a neural network, the neural network being used for detecting an object in an image, the apparatus including a backbone network, a feature network, and a prediction module, the feature network including a first module and a second module. The backbone network performs processing on a sample image and outputs N first features of different sizes. The first module of the feature network performs N-1 deconvolution operations based on the smallest first feature output from the backbone network, and outputs N second features of different sizes. The second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes. Each of the N second features is combined with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and the N fourth features are subjected to different convolution operations. The prediction module performs prediction based on the N fourth features and calculates a first loss, and performs prediction based on the features obtained after the convolution operation and calculates a second loss. The apparatus includes one or more processors, the processors configured to train the neural network by optimizing settings of the backbone network, the feature network, and the prediction module based on a combination of the first loss and the second loss.

本発明の他の側面によれば、プログラムを記憶した記憶媒体が提供され、前記プログラムは、実行されるときに、コンピュータに、ニューラルネットワークを訓練するための方法を実行させる。そのうち、前記ニューラルネットワークは画像中の対象の検出のために用いられ、且つバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含み、前記特徴ネットワークは第一モジュール及び第二モジュールを含む。前記バックボーンネットワークはサンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力する。前記特徴ネットワークの第一モジュールは、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいて、Ｎ－１回の逆畳み込み操作を行い、且つ異なるサイズのＮ個の第二特徴を出力する。前記特徴ネットワークの第二モジュールは、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力する。前記Ｎ個の第二特徴のうちの各々が、前記Ｎ個の第三特徴のうちの、同じサイズを有する対応する１つと組み合わせられ、異なるサイズのＮ個の第四特徴を生成し、前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作が実行される。前記予測モジュールは、前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し、及び、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算する。前記方法は、前記第一損失と前記第二損失との組み合わせに基づいて、前記バックボーンネットワーク、前記特徴ネットワーク及び前記予測モジュールの設定を最適化することにより前記ニューラルネットワークを訓練することを含む。 According to another aspect of the present invention, a storage medium is provided that stores a program, which, when executed, causes a computer to execute a method for training a neural network. The neural network is used for detecting an object in an image, and includes a backbone network, a feature network, and a prediction module, and the feature network includes a first module and a second module. The backbone network processes a sample image and outputs N first features of different sizes. The first module of the feature network performs N-1 deconvolution operations based on the smallest first feature output from the backbone network, and outputs N second features of different sizes. The second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes. Each of the N second features is combined with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and different convolution operations are performed on the N fourth features. The prediction module makes predictions based on the N fourth features and calculates a first loss, and makes predictions based on features obtained after the convolution operation and calculates a second loss. The method includes training the neural network by optimizing settings of the backbone network, the feature network, and the prediction module based on a combination of the first loss and the second loss.

本発明による対象検出モデルのアーキテクチャを示す図である。FIG. 2 illustrates the architecture of an object detection model according to the present invention. バックボーンネットワークのアーキテクチャを示す図である。A diagram showing the architecture of a backbone network. 畳み込みユニットを示す図である。FIG. 2 illustrates a convolution unit. 特徴ネットワーク及びターゲット予測モジュールのアーキテクチャを示す図である。FIG. 2 illustrates the architecture of the feature network and target prediction module. 特徴ネットワークの第二モジュールにより実行される合併処理を示す図である。FIG. 13 illustrates the merging process performed by the second module of the feature network. 本発明の実施例による対象検出モデルの訓練方法のフローチャートである。1 is a flowchart of a method for training an object detection model according to an embodiment of the present invention. 本発明を実現し得るコンピュータハードウェアの例示的なブロック図である。FIG. 2 is an exemplary block diagram of computer hardware in which the present invention may be implemented.

以下、添付した図面を参照しながら、本発明を実施するための好適な実施例を詳細に説明する。なお、以下の実施例は、例示に過ぎず、本発明を限定するものでない。 Below, a preferred embodiment for carrying out the present invention will be described in detail with reference to the attached drawings. Note that the following embodiment is merely illustrative and does not limit the present invention.

図１は、本発明による対象検出モデルのアーキテクチャを示す図である。該対象検出モデルは、ニューラルネットワークにより実現され得る。図１に示すように、対象検出モデルは、バックボーンネットワーク１１０、特徴ネットワーク１２０及び予測モジュール１３０を含む。バックボーンネットワーク１１０は、検出モデルの基礎ネットワークを構成し、特徴ネットワーク１２０は、特徴表現を抽出するために用いられ、ターゲット予測モジュール１３０は、抽出された特徴表現を用いて対象検出を行う。 Figure 1 is a diagram showing the architecture of an object detection model according to the present invention. The object detection model can be realized by a neural network. As shown in Figure 1, the object detection model includes a backbone network 110, a feature network 120, and a prediction module 130. The backbone network 110 constitutes the basic network of the detection model, the feature network 120 is used to extract feature representations, and the target prediction module 130 performs object detection using the extracted feature representations.

また、検出モデルに入力される画像は、統一したサイズを有するように調整されても良く、例えば、３２０×３２０に調整され得る。予測モジュール１３０は例えば、３×３畳み込み操作からなっても良く、且つ最終的には、境界枠及びクラスラベルが付く出力画像を出力し、境界枠は、検出された対象の位置を指示し、クラスラベルは、該対象の属するクラスを指示する。 Also, the images input to the detection model may be scaled to have a uniform size, e.g., 320x320. The prediction module 130 may, for example, consist of a 3x3 convolution operation, and finally output an output image with a bounding box and a class label, where the bounding box indicates the location of the detected object and the class label indicates the class to which the object belongs.

図２は、バックボーンネットワークのアーキテクチャを例示的に示す図である。図２に示すように、本発明によるバックボーンネットワークは例えば、１７個の層を含むが、本発明はこれに限定されない。第１層がバックボーン層であり、それは所定ステップ長の畳み込み操作を実行し、例えば、ステップ長が２である３×３畳み込み操作を行う。第２層～第１７層のうちの各層がすべて図３に示す畳み込みユニットからなるが、畳み込みユニットの設定の面において第２層～第１７層は互いに異なっても良い。 Figure 2 is an exemplary diagram illustrating the architecture of a backbone network. As shown in Figure 2, the backbone network according to the present invention includes, for example, 17 layers, but the present invention is not limited thereto. The first layer is the backbone layer, which performs a convolution operation of a predetermined step length, for example, a 3x3 convolution operation with a step length of 2. Each of the second to seventeenth layers is composed of the convolution unit shown in Figure 3, but the second to seventeenth layers may be different from each other in terms of the configuration of the convolution unit.

図３に示すように、現在の層が１つ前の層の出力を受信し、且つこれに対してチャネルのランダム混合を行い、その後、チャネルを、数が等しい２つの部分に均等に分け、２つのブランチを形成する。この２つのブランチにおいて同じ処理が行われ、且つ計算を並列して実行することができるので、シングルブランチの畳み込み層に比較して、処理時間を節約することができる。各ブランチでは、次のような操作を順次実行し、即ち、（１）１×１畳み込み操作であり、且つその後、チャネル数に対して所定倍数の増加、即ち、チャネル拡張を行い；（２）Ｋ×Ｋ深層畳み込み操作であり；（３）１×１畳み込み操作であり、且つその後、チャネル数に対して同じ倍数の減少、即ち、チャネル減縮を行うことで、チャネル数が操作（１）における最初の数になるようにさせる。最後に、２つのブランチのそれぞれの出力に対してカスケード接続を行って次の１つの層に出力する。 As shown in FIG. 3, the current layer receives the output of the previous layer, performs random mixing of channels on it, and then divides the channels into two equal parts to form two branches. The same processing is performed in these two branches, and the calculations can be performed in parallel, so that the processing time can be saved compared with a single-branch convolutional layer. Each branch sequentially performs the following operations: (1) 1×1 convolution operation, followed by an increase in the number of channels by a certain factor, i.e., channel expansion; (2) K×K deep convolution operation; (3) 1×1 convolution operation, followed by a decrease in the number of channels by the same factor, i.e., channel reduction, so that the number of channels is the initial number in operation (1). Finally, the outputs of the two branches are cascaded and output to the next layer.

図３に示す畳み込みユニットは、異なる層において異なる設定を有しても良い。具体的に言えば、次のようなもののうちの少なくとも１つが設定されても良く、即ち、チャネル拡張倍率、深層畳み込みのカーネルのサイズ、処理済みのチャネルと未処理のチャネルとの加算（和）を行うか、及びＳＥブロック（Ｓｑｕｅｅｚｅ－ａｎｄ－Ｅｘｃｉｔａｔｉｏｎｂｌｏｃｋ）を追加するかである。１つの例として、事前知識に基づいて最適な組み合わせを得ることができる。 The convolution unit shown in FIG. 3 may have different settings in different layers. Specifically, at least one of the following may be set: channel expansion factor, deep convolution kernel size, whether to add processed and unprocessed channels, and whether to add a squeeze-and-excitation block. As an example, an optimal combination can be obtained based on prior knowledge.

例えば、チャネル拡大倍率が１、３又は６であり、深層畳み込みのカーネルのサイズが３×３又は５×５であっても良い。また、図３に示されていないが、より高い精度を得るために、或る層又は幾つかの層の畳み込みユニットにＳＥブロックを追加しても良い。このような場合、深層畳み込みと１×１畳み込み（チャネル減縮）との間にＳＥブロックを追加しても良い。 For example, the channel expansion factor may be 1, 3, or 6, and the size of the deep convolution kernel may be 3x3 or 5x5. Also, although not shown in FIG. 3, an SE block may be added to the convolution units of a certain layer or some layers to obtain higher accuracy. In such a case, an SE block may be added between the deep convolution and the 1x1 convolution (channel reduction).

また、図３に示すように、各ブランチでは、畳み込みを経たチャネルと、未処理のチャネルとの和を求めることで、該ブランチの出力を生成し、図中の「ショートカット」が付く矢印及び“加算”を示す符号（＋）により示されるようである。しかし、この設定は変えることができる。例えば、現在の層と１つ前の層との間にチャネルの変化又は特徴サイズの変化があったときに、このような加算処理を行わない。言い換えると、このような場合、図３の中の「ショートカット」が付く矢印及び“加算”符号（＋）を除去し、畳み込みを経たチャネルのみを対応するブランチの出力とする。 Also, as shown in Figure 3, in each branch, the output of that branch is generated by taking the sum of the convoluted channel and the unprocessed channel, as indicated by the "shortcut" arrow and the "addition" sign (+) in the figure. However, this setting can be changed. For example, such addition processing is not performed when there is a change in channel or feature size between the current layer and the previous layer. In other words, in such a case, the "shortcut" arrow and the "addition" sign (+) in Figure 3 are removed, and only the convoluted channel is used as the output of the corresponding branch.

以下の表１にはバックボーンネットワークの具体的なパラメータが示されており、そのうち、“ｕｎｉｔ”は図３に示す畳み込みユニットを表し、“Ｙ”は追加のＳＥブロックを表す。

The specific parameters of the backbone network are shown in Table 1 below, in which "unit" represents the convolution unit shown in FIG. 3, and "Y" represents an additional SE block.

図１に示すように、バックボーンネットワーク１１０の出力は特徴ネットワーク１２０の入力である。ＦＰＮ（ｆｅａｔｕｒｅｐｙｒａｍｉｄｎｅｔｗｏｒｋ）が画像中の異なるサイズの対象の処理に適したので、本発明ではＦＰＮ作を基本特徴構造として使用する。そのため、本発明は、バックボーンネットワーク１１０の各層（第１層～第１７層）の出力に基づいて第一特徴を生成し、該第一特徴は、バックボーンネットワーク１１０の出力として特徴ネットワーク１２０に入力される。 As shown in FIG. 1, the output of the backbone network 110 is the input of the feature network 120. Since the feature pyramid network (FPN) is suitable for processing objects of different sizes in an image, the present invention uses the FPN as the basic feature structure. Therefore, the present invention generates a first feature based on the output of each layer (layer 1 to layer 17) of the backbone network 110, and the first feature is input to the feature network 120 as the output of the backbone network 110.

以下、１つの例を挙げて第一特徴の生成を説明する。該例では、バックボーンネットワーク１１０の第５層、第１２層、第１７層により出力される特徴を選択し、これらの特徴を｛ｆ１、ｆ２、ｆ３｝と定義する。特徴ｆ１、ｆ２、ｆ３は異なるサイズを有し、例えば、それぞれは４０×４０、２０×２０、１０×１０であり、そのうち、特徴ｆ３のサイズが最も小さい。その後、特徴ｆ３に対してステップ長が２及び４である最大プーリング操作をそれぞれ行うことで、サイズがより小さい別の２つの特徴ｆ４及びｆ５を取得する。取得された５つの異なるサイズの特徴｛ｆ１、ｆ２、ｆ３、ｆ４、ｆ５｝により第一特徴を構成し、そして、特徴ネットワーク１２０に入力する。なお、本発明はこの例に限られず、他の方法を採用して第一特徴を生成することもできる。例えば、第５層、第１２層、第１７層以外の他の層から出力される特徴を選択しても良く、あるいは、第一特徴に含まれる特徴の数を変えても良い。 The generation of the first feature will be described below with an example. In this example, features output by the 5th, 12th, and 17th layers of the backbone network 110 are selected, and these features are defined as {f1, f2, f3}. The features f1, f2, and f3 have different sizes, for example, 40×40, 20×20, and 10×10, respectively, and among them, the size of the feature f3 is the smallest. Then, by performing max pooling operations with step lengths of 2 and 4 on the feature f3, respectively, two other features f4 and f5 with smaller sizes are obtained. The five obtained features of different sizes {f1, f2, f3, f4, f5} are used to construct the first feature, which is then input to the feature network 120. Note that the present invention is not limited to this example, and other methods can also be used to generate the first feature. For example, features output from layers other than the 5th, 12th, and 17th layers may be selected, or the number of features included in the first feature may be changed.

図４は、特徴ネットワーク及びターゲット予測モジュールのアーキテクチャを例示的に示す図である。図４に示すように、特徴ネットワークは第一モジュール４１０及び第二モジュール４２０を含み、且つバックボーンネットワークからの第一特徴Ｆ１を受信する。第一特徴Ｆ１には、異なるサイズのＮ個（図では例示的に５つが示されている）の特徴が含まれる。 Figure 4 is an exemplary diagram illustrating the architecture of the feature network and the target prediction module. As shown in Figure 4, the feature network includes a first module 410 and a second module 420, and receives a first feature F1 from a backbone network. The first feature F1 includes N features of different sizes (five are shown in the figure as an example).

第一モジュール４１０は、第一特徴Ｆ１の中のサイズが最も小さい特徴ｆ５に基づいてＮ－１回（４回）の逆畳み込み操作を実行し、１組の新しい特徴（第二特徴Ｆ２と表される）を生成する。具体的に言えば、まず、特徴ｆ５に対して第一回の逆畳み込み操作を実行し、その後、第一回の逆畳み込み操作後に得られた特徴に対して第二回の逆畳み込み操作を実行し、このような方式で、トータルでＮ－１回の逆畳み込み操作を実行する。例えば、生成される第二特徴Ｆ２には、特徴ｆ５、及び４回の逆畳み込み操作により生成される４つの特徴が含まれる。 The first module 410 performs N-1 (four) deconvolution operations based on the feature f5 that has the smallest size among the first features F1, to generate a set of new features (represented as second features F2). Specifically, it first performs a first deconvolution operation on feature f5, and then performs a second deconvolution operation on the features obtained after the first deconvolution operation, and performs a total of N-1 deconvolution operations in this manner. For example, the generated second feature F2 includes feature f5 and four features generated by four deconvolution operations.

第二モジュール４２０は、第一特徴Ｆ１に含まれる複数の特徴に対して合併を行い、１組の新しい特徴（第三特徴Ｆ３と表れる）を生成する。図５は、合併処理を行う方法を示す図である。図５に示すように、ｉ＋１番目の特徴とｉ番目の特徴との合併を実行し、そのうち、ｉ＝１、２、…、Ｎ－１であり、且つｉ＋１番目の特徴のサイズがｉ番目の特徴のサイズよりも小さい。まず、ｉ＋１番目の特徴に対して、双線形補間及び畳み込み操作を含む処理を実行する。双線形補間は、ｉ＋１番目の特徴に対してアップサンプリングを行うことで、そのサイズをｉ番目の特徴と同じように変更するために用いられる。畳み込み操作は、この２つの特徴のチャネルの正規化を実現するために用いられる。それから、ｉ番目の特徴と処理済みのｉ＋１番目の特徴との合併を行い、新しいｉ番目の特徴を取得する。その後、取得された新しいｉ番目の特徴と、オリジナルなｉ－１番目の特徴とに対して図５に示す合併処理を行うことにより、新しいｉ－１番目の特徴を取得する。 The second module 420 performs merging on the features included in the first feature F1 to generate a set of new features (denoted as the third feature F3). FIG. 5 illustrates a method of performing the merging process. As shown in FIG. 5, a merging process is performed between the i+1th feature and the ith feature, where i=1, 2, ..., N-1, and the size of the i+1th feature is smaller than that of the ith feature. First, a process including bilinear interpolation and convolution operation is performed on the i+1th feature. The bilinear interpolation is used to change the size of the i+1th feature to be the same as that of the ith feature by upsampling it. The convolution operation is used to realize channel normalization of the two features. Then, the ith feature is merged with the processed i+1th feature to obtain a new ith feature. After that, the new i-1th feature is obtained by performing the merging process shown in FIG. 5 on the obtained new i-1th feature and the original i-1th feature.

上述の方式によりＮ－１個の新しい特徴を生成することができる。その後、第二モジュール４２０は、これらの新しい特徴のうちの、サイズが最も小さいＮ－１番目の新しい特徴に対して最大プーリング操作を行い、Ｎ番目の新しい特徴を得る。該Ｎ番目の新しい特徴のサイズがＮ－１番目の新しい特徴のサイズよりも小さい。ここまで、Ｎ個の新しい特徴が生成されており、それらは、第二モジュール４２０の出力、即ち、第三特徴Ｆ３を構成する。 N-1 new features can be generated by the above method. Then, the second module 420 performs a max pooling operation on the N-1th new feature with the smallest size among these new features to obtain the Nth new feature, whose size is smaller than the N-1th new feature. Up to this point, N new features have been generated, which constitute the output of the second module 420, i.e., the third feature F3.

第一モジュール４１０及び第二モジュール４２０は、処理を並列して実行することにより、効率を向上させることができる。その後、第一モジュール４１０により出力される第二特徴Ｆ２と、第二モジュール４２０により出力される第三特徴Ｆ３とに対してカスケード接続を行うことで、インハンス（ｅｎｈａｎｃｅ）された多尺度（マルチスケール）特徴、即ち、第四特徴Ｆ４を取得する。より具体的には、第二特徴Ｆ２に含まれる各特徴と、第三特徴Ｆ３の中の同じサイズを有する特徴とのカスケード接続を行うことにより、Ｎ個の特徴を含む第四特徴Ｆ４を生成する。 The first module 410 and the second module 420 can improve efficiency by performing processes in parallel. Then, the second feature F2 output by the first module 410 and the third feature F3 output by the second module 420 are cascaded to obtain an enhanced multi-scale feature, i.e., a fourth feature F4. More specifically, the fourth feature F4 including N features is generated by cascading each feature included in the second feature F2 with a feature having the same size in the third feature F3.

図４は、５つの特徴｛ｐ１、ｐ２、ｐ３、ｐ４、ｐ５｝を含む第四特徴Ｆ４を示す図である。予測モジュール１３０は、第四特徴Ｆ４に基づいて予測を行い、且つ第一損失を計算する。なお、これは当業者にとって既知であるため、本文ではその詳しい処理を省略する。 Figure 4 shows a fourth feature F4 that includes five features {p1, p2, p3, p4, p5}. The prediction module 130 performs prediction based on the fourth feature F4 and calculates the first loss. Note that this is known to those skilled in the art, so the detailed process is omitted in this document.

一方、第四特徴Ｆ４に含まれる特徴ｐ１、ｐ２、ｐ３、ｐ４、ｐ５に対してそれぞれ異なる回数の畳み込み操作（例えば、３×３畳み込み）を行い、予測モジュール１３０は、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算する。より具体的には、特徴のサイズの増大に伴い、それに対して行われる畳み込み操作の回数が減少する。この面において、図４では、サイズが最も大きい特徴ｐ１に対して畳み込み操作を行わず、サイズが次第に減少する特徴ｐ２～ｐ５に対してそれぞれ１回～４回の畳み込みを行うことを例示的に示している。なお、本発明は図４に示す例に限定されない。 Meanwhile, different numbers of convolution operations (e.g., 3×3 convolutions) are performed on features p1, p2, p3, p4, and p5 included in the fourth feature F4, and the prediction module 130 performs prediction based on the features obtained after the convolution operations and calculates the second loss. More specifically, as the size of a feature increases, the number of convolution operations performed thereon decreases. In this respect, FIG. 4 exemplarily shows that no convolution operation is performed on feature p1, which has the largest size, and one to four convolutions are performed on features p2 to p5, which have gradually decreasing sizes. Note that the present invention is not limited to the example shown in FIG. 4.

尺度が比較的大きい特徴に対して畳み込み操作を実行することにより比較的大きい計算量が生じるため、効率の低下を来すことがある。よって、本発明では、比較的大きいサイズの特徴に対して比較的少ない回数の畳み込み計算を行い、比較的小さいサイズの特徴に対して比較的多い回数の畳み込み計算を行うことにより、正確率と効率との間の良好なバランスをとることができる。 Performing convolution operations on features with relatively large scales can result in a relatively large amount of calculations, which can lead to a decrease in efficiency. Therefore, in the present invention, a good balance between accuracy and efficiency can be achieved by performing a relatively small number of convolution calculations on features with relatively large sizes and a relatively large number of convolution calculations on features with relatively small sizes.

取得された第一損失及び第二損失は、図１に示す対象検出モデルの訓練に用いられる。具体的に言えば、予測モジュール１３０は例えば、３×３畳み込みを用いてそれぞれ回帰表現及び和分類表現を計算する。モデル評価の面において、例えば、回帰損失法及び焦点損失法を採用することができる。回帰損失（ＲＬＯＳＳ）は、境界枠の検出に関する損失を表し、焦点損失（ＦＬＯＳＳ）は、クラスラベルに関する損失を表す。なお、焦点損失は、分類損失のうちの１つの種類であり、本文では、焦点損失を重点的に説明しているが、本発明では、他の分類損失を使用しても良い。 The obtained first loss and second loss are used to train the object detection model shown in FIG. 1. Specifically, the prediction module 130 calculates the regression representation and the sum classification representation, respectively, using, for example, 3×3 convolution. In terms of model evaluation, for example, the regression loss method and the focal loss method can be adopted. The regression loss (RLOSS) represents the loss related to the detection of the bounding box, and the focal loss (FLOSS) represents the loss related to the class label. Note that the focal loss is one type of classification loss, and although the present invention focuses on the focal loss, other classification losses may be used in the present invention.

第一損失は回帰損失ＲＬＯＳＳ_１及び焦点損失ＦＬＯＳＳ_１を含み、第二損失は回帰損失ＲＬＯＳＳ_２及び焦点損失ＦＬＯＳＳ_２を含む。よって、検出モデルｍに対して訓練を行うときに、以下の式（１）で表す損失関数を採用することができる。

The first loss includes a regression loss RLOSS ₁ and a focus loss FLOSS ₁ , and the second loss includes a regression loss RLOSS ₂ and a focus loss FLOSS _2. Therefore, when training the detection model m, the loss function expressed by the following equation (1) can be adopted.

損失関数ＬＯＳＳ（ｍ）を最小化することにより、バックボーンネットワーク１１０、特徴ネットワーク１２０及び予測モジュール１３０の検出モデルの設定パラメータを最適化し、良好な検出精度を取得することができる。 By minimizing the loss function LOSS(m), the setting parameters of the detection models of the backbone network 110, the feature network 120 and the prediction module 130 can be optimized to obtain good detection accuracy.

図６は、本発明による対象検出モデルの訓練方法のフローチャート。図６に示すように、ステップＳ６１０において、バックボーンネットワーク１１０が、入力される画像に対して処理を行い、第一特徴Ｆ１を生成して出力する。第一特徴Ｆ１は、異なるサイズのＮ個の特徴、例えば、上述した｛ｆ１、ｆ２、ｆ３、ｆ４、ｆ５｝を含む。 Figure 6 is a flowchart of a method for training an object detection model according to the present invention. As shown in Figure 6, in step S610, the backbone network 110 processes an input image to generate and output a first feature F1. The first feature F1 includes N features of different sizes, for example, {f1, f2, f3, f4, f5} as described above.

ステップＳ６２０において、特徴ネットワーク１２０の第一モジュール４１０が、バックボーンネットワーク１１０により出力される第一特徴Ｆ１のうちの、サイズが最も小さい特徴（例えば、特徴ｆ５）に基づいてＮ－１回の逆畳み込み操作を行い、第二特徴Ｆ２を生成する。第二特徴Ｆ２は、該サイズが最も小さい特徴、及び毎回逆畳み込み操作実行後に得られた特徴を含む。 In step S620, the first module 410 of the feature network 120 performs N-1 deconvolution operations based on the smallest feature (e.g., feature f5) among the first features F1 output by the backbone network 110 to generate a second feature F2. The second feature F2 includes the smallest feature and the feature obtained after each deconvolution operation.

ステップＳ６３０において、特徴ネットワーク１２０の第二モジュール４２０が、バックボーンネットワーク１１０により出力される第一特徴Ｆ１のうちのＮ個の特徴に対して合併を行い、第三特徴Ｆ３を生成する。なお、合併処理の方法については、前述の図５に関する説明を参照することができる。 In step S630, the second module 420 of the feature network 120 performs merging on N features among the first features F1 output by the backbone network 110 to generate a third feature F3. For the method of merging, please refer to the explanation of FIG. 5 above.

ステップＳ６４０において、サイズの等級（ｇｒａｄｅ）に従って、第二特徴Ｆ２と第三特徴Ｆ３との組み合わせを行い、第四特徴Ｆ４を生成する。第四特徴Ｆ４は、異なるサイズのＮ個の特徴、例えば、上述した｛ｐ１、ｐ２、ｐ３、ｐ４、ｐ５｝を含む。 In step S640, the second feature F2 and the third feature F3 are combined according to the size grade to generate a fourth feature F4. The fourth feature F4 includes N features of different sizes, for example, {p1, p2, p3, p4, p5} as described above.

ステップＳ６５０において、第四特徴Ｆ４の中の各特徴に対してそれぞれ異なる回数の畳み込み操作を行う。特に、特徴のサイズの増大に伴い、それに対して行われる畳み込み操作の回数が減少する。 In step S650, a different number of convolution operations are performed on each feature in the fourth feature F4. In particular, as the size of a feature increases, the number of convolution operations performed on it decreases.

ステップＳ６６０において、予測モジュール１３０が、畳み込み操作を経ていない第四特徴Ｆ４に基づいて予測を行い、且つ第一損失を計算する。また、予測モジュール１３０は更に、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算する。 In step S660, the prediction module 130 makes a prediction based on the fourth feature F4 that has not undergone the convolution operation and calculates a first loss. The prediction module 130 also makes a prediction based on the features obtained after the convolution operation and calculates a second loss.

ステップＳ６７０において、上述の式（１）に示す損失関数に基づいて対象検出モデルを訓練し、バックボーンネットワーク１１０、特徴ネットワーク１２０及び予測モジュール１３０の設定を最適化する。 In step S670, the object detection model is trained based on the loss function shown in the above equation (1), and the settings of the backbone network 110, the feature network 120, and the prediction module 130 are optimized.

以上、具体的な実施例をもとに本発明で提案されるエンドツーエンドの対象検出モデル及びその訓練方法について説明した。従来のモデルに比較して、本発明による検出モデルは少なくとも以下のような利点を有する。 The end-to-end object detection model and its training method proposed in the present invention have been described above based on specific examples. Compared to conventional models, the detection model according to the present invention has at least the following advantages:

・この検出モデルは、小型プラットフォーム上の検出タスクのために特別に設計されるものであり；
・モデルのアーキテクチャの設計では、機器の並列処理能力を十分に考慮しており、データ流を同時に処理することができ；
・バックボーンネットワークの各層（バックボーン層以外の各層）では、畳み込みカーネルのサイズ及びチャネルの数を柔軟に設定している。各層においてすべて固定した畳み込みカーネルサイズ及び固定したチャネル数を用いるモデルに比べて、より高い精度を得ることができ；
・精度と効率との間の良好なバランスを実現することができる。 This detection model is specifically designed for detection tasks on small platforms;
The model architecture design takes into full account the parallel processing capabilities of the device and can process data streams simultaneously;
In each layer of the backbone network (each layer other than the backbone layer), the size of the convolution kernel and the number of channels are flexibly set. This allows for higher accuracy than models that use fixed convolution kernel sizes and fixed numbers of channels in all layers;
A good balance between accuracy and efficiency can be achieved.

上述の実施例に記載の方法は、ソフトウェア、ハードウェア、あるいは、ソフトウェア及びハードウェアの組み合わせにより実現され得る。ソフトウェアに含まれるプログラムは、予め、機器の内部又は外部に設置される記憶媒体に格納することができる。１つの例として、実行時に、これらのプログラムは、ＲＯＭに書き込まれ、且つ処理器（例えば、ＣＰＵ）により実行されることにより、本文に記載の各種の方法及び処理を実現することができる。 The methods described in the above-described embodiments can be realized by software, hardware, or a combination of software and hardware. The programs included in the software can be stored in advance on a storage medium installed inside or outside the device. As one example, at run time, these programs can be written into a ROM and executed by a processor (e.g., a CPU) to realize the various methods and processes described herein.

図７は、プログラムに基づいて、本発明による方法を実行するコンピュータハードウェア（汎用ＰＣ７００）の構成ブロック図である。該コンピュータハードウェアは、本発明における検出モデル訓練用の装置の１つの例である。また、本発明の検出モデルにおけるバックボーンネットワーク、特徴ネットワーク及び予測モジュールは、該コンピュータハードウェアを用いて実現することもできる。 Figure 7 is a block diagram of computer hardware (general-purpose PC 700) that executes the method of the present invention based on a program. The computer hardware is an example of an apparatus for training a detection model in the present invention. In addition, the backbone network, feature network, and prediction module in the detection model of the present invention can also be realized using the computer hardware.

汎用ＰＣ７００は、例えば、コンピュータシステムであっても良い。なお、汎用ＰＣ７００は、例示に過ぎず、本発明による方法及び装置の適用範囲又は機能は、これに限定されない。また、汎用ＰＣ７００は、上述の方法及び装置における任意のモジュールやアセンブリなどあるいはその組み合わせにも依存しない。 The general-purpose PC 700 may be, for example, a computer system. Note that the general-purpose PC 700 is merely an example, and the scope or functionality of the method and device according to the present invention is not limited thereto. Furthermore, the general-purpose PC 700 does not depend on any module, assembly, etc., or combination thereof, in the above-mentioned method and device.

図７では、中央処理装置（ＣＰＵ）７０１は、ＲＯＭ７０２に記憶されているプログラム又は記憶部７０８からＲＡＭ７０３にロッドされているプログラムに基づいて各種の処理を行うことができる。ＲＡＭ７０３では、ニーズに応じて、ＣＰＵ７０１が各種の処理を行うときに必要なデータなどを記憶することもできる。ＣＰＵ７０１、ＲＯＭ７０２及びＲＡＭ７０３は、バス７０４を経由して互いに接続され得る。入力／出力インターフェース７０５もバス７０４に接続され得る。 In FIG. 7, a central processing unit (CPU) 701 can perform various processes based on a program stored in a ROM 702 or a program loaded from a storage unit 708 to a RAM 703. The RAM 703 can also store data required when the CPU 701 performs various processes, depending on the needs. The CPU 701, the ROM 702, and the RAM 703 can be connected to each other via a bus 704. An input/output interface 705 can also be connected to the bus 704.

また、入力／出力インターフェース７０５には、さらに、次のような部品が接続され、即ち、キーボードなどを含む入力部７０６、液晶表示器（ＬＣＤ）などのような表示器及びスピーカーなどを含む出力部７０７、ハードディスクなどを含む記憶部７０８、ネットワーク・インターフェース・カード、例えば、ＬＡＮカード、モデムなどを含む通信部７０９である。通信部７０９は、例えば、インターネット、ＬＡＮなどのネットワークを経由して通信処理を行う。ドライブ７１０は、ニーズに応じて、入力／出力インターフェース７０５に接続されても良い。取り外し可能な媒体７１１、例えば、半導体メモリなどは、必要に応じて、ドライブ７１０にセットされることにより、その中から読み取られたコンピュータプログラムを記憶部７０８にインストールすることができる。 The input/output interface 705 is further connected to the following components: an input section 706 including a keyboard, an output section 707 including a display such as a liquid crystal display (LCD) and a speaker, a storage section 708 including a hard disk, and a communication section 709 including a network interface card such as a LAN card and a modem. The communication section 709 performs communication processing via a network such as the Internet or a LAN. A drive 710 may be connected to the input/output interface 705 according to needs. A removable medium 711, such as a semiconductor memory, can be set in the drive 710 as needed, and a computer program read from the medium can be installed in the storage section 708.

また、本発明は、さらに、マシン可読指令コードを含むプログラムプロダクトを提供する。このような指令コードは、マシンにより読み取られて実行されるときに、上述の本発明の実施例における方法を実行することができる。それ相応に、このようなプログラムプロダクトをキャリー（Ｃａｒｒｙ）する、例えば、磁気ディスク（フロッピーディスク（登録商標）を含む）、光ディスク（ＣＤ－ＲＯＭ及びＤＶＤを含む）、光磁気ディスク（ＭＤ（登録商標）を含む）、及び半導体記憶器などの各種記憶媒体も本発明に含まれる。 The present invention further provides a program product including machine-readable instruction code. Such instruction code, when read and executed by a machine, can execute the method in the above-described embodiment of the present invention. Accordingly, various storage media that carry such a program product, such as magnetic disks (including floppy disks (registered trademark)), optical disks (including CD-ROMs and DVDs), magneto-optical disks (including MDs (registered trademark)), and semiconductor memory devices, are also included in the present invention.

上述の記憶媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、半導体記憶器などを含んでも良いが、これらに限定されない。 The above-mentioned storage media may include, for example, but are not limited to, magnetic disks, optical disks, magneto-optical disks, semiconductor memory devices, etc.

また、上述の方法における各操作（処理）は、各種のマシン可読記憶媒体に記憶されているコンピュータ実行可能なプログラムの方式で実現することもできる。 Furthermore, each operation (process) in the above-described method can be realized in the form of a computer-executable program stored on various machine-readable storage media.

また、以上の実施例などに関しては、さらに以下のように付記として開示する。 Furthermore, the above examples are disclosed in the following addendum.

（付記１）
ニューラルネットワークを訓練する方法であって、
前記ニューラルネットワークは画像中の対象の検出のために用いられ、且つバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含み、前記特徴ネットワークは第一モジュール及び第二モジュールを含み、前記方法は、
前記バックボーンネットワークが、サンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力し；
前記特徴ネットワークの第一モジュールが、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいてＮ－１回の逆畳み込み操作を実行し、且つ異なるサイズのＮ個の第二特徴を出力し；
前記特徴ネットワークの第二モジュールが、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力し；
前記Ｎ個の第二特徴のうちの各々と、前記Ｎ個の第三特徴のうちの同じサイズを有する対応する１つとを組み合わせることで、異なるサイズのＮ個の第四特徴を生成し、且つ前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作を実行し；
前記予測モジュールが前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し；
前記予測モジュールが畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算し；及び
前記第一損失と前記第二損失との組み合わせに基づいて、前記バックボーンネットワーク、前記特徴ネットワーク及び前記予測モジュールの設定を最適化することにより、前記ニューラルネットワークを訓練することを含む、方法。 (Appendix 1)
1. A method for training a neural network, comprising:
The neural network is used for detecting an object in an image, and includes a backbone network, a feature network, and a prediction module, the feature network includes a first module and a second module, and the method includes:
The backbone network operates on the sample image and outputs N first features of different sizes;
A first module of the feature network performs N-1 deconvolution operations based on a first feature with the smallest size output from the backbone network, and outputs N second features with different sizes;
A second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes;
generating N fourth features of different sizes by combining each of the N second features with a corresponding one of the N third features having the same size, and performing a different number of convolution operations on the N fourth features;
the prediction module making a prediction based on the N fourth features and calculating a first loss;
the prediction module making a prediction based on the features obtained after the convolution operation and calculating a second loss; and training the neural network by optimizing settings of the backbone network, the feature network, and the prediction module based on a combination of the first loss and the second loss.

（付記２）
付記１に記載の方法であって、
前記特徴ネットワークの第一モジュールにより出力されるＮ個の第二特徴は、サイズが最も小さい前記第一特徴、及び毎回逆畳み込み操作を実行した後に得られた特徴を含む、方法。 (Appendix 2)
2. The method according to claim 1, comprising:
The N second features output by the first module of the feature network include the first feature with the smallest size and a feature obtained after performing a deconvolution operation each time.

（付記３）
付記１に記載の方法であって、更に、
前記特徴ネットワークの第二モジュールが、ｉ＋１番目の第一特徴に対して、双線形補間及び畳み込み操作を含む処理を行い、且つｉ番目の第一特徴と処理済みのｉ＋１番目の第一特徴との合併を行うことを含み、
そのうち、ｉ＝１、２、…、Ｎ－１である、方法。 (Appendix 3)
The method according to claim 1, further comprising:
A second module of the feature network performs processing on the i+1th first feature, including bilinear interpolation and convolution operations, and merges the ith first feature with the processed i+1th first feature;
Among them, the method where i=1, 2, . . . , N-1.

（付記４）
付記１に記載の方法であって、更に、
前記特徴ネットワークの第二モジュールが、ｉ＋１番目の第一特徴に対して、双線形補間及び畳み込み操作を含む処理を行い、且つｉ番目の第一特徴と処理済みのｉ＋１番目の第一特徴との合併を行い、ｉ番目の第三特徴を取得し、そのうち、ｉ＝１、２、…、Ｎ－１であり；及び
前記第二モジュールが、Ｎ－１番目の第三特徴に対して最大プーリング操作を行い、Ｎ番目の第三特徴を取得することを含む、方法。 (Appendix 4)
The method according to claim 1, further comprising:
a second module of the feature network performing a process including a bilinear interpolation and a convolution operation on the i+1th first feature, and performing a merging of the ith first feature with the processed i+1th first feature to obtain an ith third feature, where i=1, 2, ..., N-1; and a second module performing a max pooling operation on the N-1th third feature to obtain an Nth third feature.

（付記５）
付記１に記載の方法であって、
前記第四特徴のサイズの増大に伴い、それに対して行われる畳み込み操作の回数が減少する、方法。 (Appendix 5)
2. The method according to claim 1, comprising:
A method wherein as the size of the fourth feature increases, the number of convolution operations performed thereon decreases.

（付記６）
付記１に記載の方法であって、
前記第一損失及び前記第二損失のうちの各々が回帰損失及び分類損失を含む、方法。 (Appendix 6)
2. The method according to claim 1, comprising:
wherein each of the first loss and the second loss comprises a regression loss and a classification loss.

（付記７）
付記１に記載の方法であって、
前記バックボーンネットワークが複数の層を含み、且つ各層のチャネルが、数が等しい２つの部分に分けられ、
前記方法は、更に
前記２つの部分のうちの各部分のチャネルに対して、畳み込み、チャネル拡張及びチャネル減縮を含む処理を行い、且つ前記２つの部分の処理済みのチャネルを組み合わせて次の１つの層に入力することを含む、方法。 (Appendix 7)
2. The method according to claim 1, comprising:
the backbone network includes a plurality of layers, and the channels of each layer are divided into two equal parts;
The method further includes performing processing including convolution, channel expansion, and channel reduction on the channels of each of the two parts, and combining the processed channels of the two parts for input to a next layer.

（付記８）
付記７に記載の方法であって、更に、
各層の各部分のチャネルに対して、処理済みのチャネルと未処理のチャネルとの和を求め、該部分の出力を取得し；及び
前記２つの部分の出力を組み合わせて前記次の１つの層に入力することを含む、方法。 (Appendix 8)
The method according to claim 7, further comprising:
For each partial channel of each layer, summing a processed channel with an unprocessed channel to obtain an output of the portion; and combining the two partial outputs as input to the next one layer.

（付記９）
付記８に記載の方法であって、
前記２つの部分のうちの各部分のチャネルに対して行われる処理は更にＳＥブロックの追加を含む、方法。 (Appendix 9)
9. The method of claim 8, further comprising:
The method, wherein the processing performed on the channels of each of the two portions further comprises adding an SE block.

（付記１０）
付記９に記載の方法であって、
前記バックボーンネットワークの各層について、チャネル拡張倍率、畳み込みカーネルサイズ、処理済みのチャネルと未処理のチャネルとの和を求めるか、及びＳＥブロックを追加するかのうちの少なくとも１つを設定する、方法。 (Appendix 10)
10. The method of claim 9, further comprising:
The method includes setting at least one of a channel expansion factor, a convolution kernel size, whether to sum processed and unprocessed channels, and whether to add an SE block for each layer of the backbone network.

（付記１１）
付記７に記載の方法であって、更に、
前記バックボーンネットワークの複数の層により異なるサイズの複数の特徴を生成し；
前記複数の特徴の中のサブセットを選択し、且つ前記サブセットの中のサイズが最も小さい特定特徴を確定し；
前記特定特徴に対して処理を行い、サイズが前記特定特徴よりも小さい１つ又は複数の特徴を生成し；及び
前記サブセットの中の特徴、及びサイズが前記特定特徴よりも小さい前記１つ又は複数の特徴により、前記バックボーンネットワークから出力されるＮ個の第一特徴を構成することを含む、方法。 (Appendix 11)
The method according to claim 7, further comprising:
generating a plurality of features of different sizes through a plurality of layers of the backbone network;
selecting a subset of the plurality of features, and determining a particular feature within the subset that is smallest in size;
performing processing on the particular feature to generate one or more features smaller in size than the particular feature; and constituting N first features output from the backbone network with the features in the subset and the one or more features smaller in size than the particular feature.

（付記１２）
画像中の対象の検出のために用いられるニューラルネットワークであって、
前記ニューラルネットワークは、
バックボーンネットワークであって、前記バックボーンネットワークがサンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力する、もの；
特徴ネットワークであって、前記特徴ネットワークが第一モジュール及び第二モジュールを含み、前記第一モジュールが、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいてＮ－１回の逆畳み込み操作を実行し、且つ異なるサイズのＮ個の第二特徴を出力し、前記第二モジュールが、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力し、そのうち、前記Ｎ個の第二特徴のうちの各々が前記Ｎ個の第三特徴のうちの同じサイズを有する対応する１つと組み合わせられることで、異なるサイズのＮ個の第四特徴を生成し、前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作が行われる、もの；及び
予測モジュールであって、前記予測モジュールが前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し、及び、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算する、ものを含み、
そのうち、前記ニューラルネットワークは前記第一損失と前記第二損失との組み合わせに基づいて訓練される、ニューラルネットワーク。 (Appendix 12)
1. A neural network for use in detecting objects in an image, comprising:
The neural network comprises:
a backbone network, the backbone network operating on a sample image and outputting N first features of different sizes;
A feature network, the feature network including a first module and a second module, the first module performing N-1 deconvolution operations based on a first feature with a smallest size output from the backbone network, and outputting N second features of different sizes, the second module performing a merge on the N first features output from the backbone network, and outputting N third features of different sizes, in which each of the N second features is combined with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and performing different numbers of convolution operations on the N fourth features; and a prediction module, the prediction module performing a prediction based on the N fourth features and calculating a first loss, and performing a prediction based on a feature obtained after the convolution operation and calculating a second loss,
Wherein, the neural network is trained based on a combination of the first loss and the second loss.

（付記１３）
ニューラルネットワークを訓練する装置であって、
前記ニューラルネットワークは画像中の対象の検出のために用いられ、且つバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含み、前記特徴ネットワークは第一モジュール及び第二モジュールを含み、
前記バックボーンネットワークは、サンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力し、
前記特徴ネットワークの第一モジュールは、前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいてＮ－１回の逆畳み込み操作を実行し、且つ異なるサイズのＮ個の第二特徴を出力し、
前記特徴ネットワークの第二モジュールは、前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力し、
前記Ｎ個の第二特徴のうちの各々と、前記Ｎ個の第三特徴のうちの同じサイズを有する対応する１つとの組み合わせを行い、異なるサイズのＮ個の第四特徴を生成し、前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作を行い、
前記予測モジュールは前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し、及び、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算し、
前記装置は、１つ又は複数の処理器を含み、前記処理器は、前記第一損失と前記第二損失との組み合わせに基づいて、前記バックボーンネットワーク、前記特徴ネットワーク及び前記予測モジュールの設定を最適化することにより、前記ニューラルネットワークを訓練するように構成される、装置。 (Appendix 13)
1. An apparatus for training a neural network, comprising:
The neural network is used for detecting an object in an image, and includes a backbone network, a feature network and a prediction module, the feature network includes a first module and a second module;
The backbone network processes the sample image and outputs N first features of different sizes;
The first module of the feature network performs N-1 deconvolution operations based on the smallest first feature output from the backbone network, and outputs N second features of different sizes;
A second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes;
combining each of the N second features with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and performing a different number of convolution operations on the N fourth features;
The prediction module performs prediction based on the N fourth features and calculates a first loss; and performs prediction based on features obtained after the convolution operation and calculates a second loss;
The apparatus includes one or more processors configured to train the neural network by optimizing settings of the backbone network, the feature network, and the prediction module based on a combination of the first loss and the second loss.

（付記１４）
プログラムを記憶した記憶媒体であって、
前記プログラムは、実行されるときに、コンピュータに、ニューラルネットワークを訓練するための方法を実行させ、
そのうち、前記ニューラルネットワークは、画像中の対象の検出のために用いられ、且つバックボーンネットワーク、特徴ネットワーク及び予測モジュールを含み、
前記特徴ネットワークは第一モジュール及び第二モジュールを含み、
前記バックボーンネットワークはサンプル画像に対して処理を行い、且つ異なるサイズのＮ個の第一特徴を出力し、
前記特徴ネットワークの第一モジュールは前記バックボーンネットワークから出力される、サイズが最も小さい第一特徴に基づいてＮ－１回の逆畳み込み操作を実行し、且つ異なるサイズのＮ個の第二特徴を出力し、
前記特徴ネットワークの第二モジュールは前記バックボーンネットワークから出力されるＮ個の第一特徴に対して合併を行い、且つ異なるサイズのＮ個の第三特徴を出力し、
前記Ｎ個の第二特徴のうちの各々が、前記Ｎ個の第三特徴のうちの同じサイズを有する対応する１つと組み合わせられ、異なるサイズのＮ個の第四特徴を生成し、前記Ｎ個の第四特徴に対してそれぞれ異なる回数の畳み込み操作が行われ、
前記予測モジュールは、前記Ｎ個の第四特徴に基づいて予測を行い、且つ第一損失を計算し、及び、畳み込み操作後に得られた特徴に基づいて予測を行い、且つ第二損失を計算し、
そのうち、前記方法は、前記第一損失と前記第二損失との組み合わせに基づいて、前記バックボーンネットワーク、前記特徴ネットワーク及び前記予測モジュールの設定を最適化することにより、前記ニューラルネットワークを訓練することを含む、記憶媒体。 (Appendix 14)
A storage medium storing a program,
The program, when executed, causes a computer to perform a method for training a neural network,
Wherein, the neural network is used for detecting an object in an image, and includes a backbone network, a feature network and a prediction module;
the feature network includes a first module and a second module;
The backbone network processes the sample image and outputs N first features of different sizes;
The first module of the feature network performs N-1 deconvolution operations based on the smallest size first feature output from the backbone network, and outputs N second features of different sizes;
A second module of the feature network performs merging on the N first features output from the backbone network, and outputs N third features of different sizes;
each of the N second features is combined with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and a different number of convolution operations are performed on each of the N fourth features;
The prediction module performs prediction based on the N fourth features and calculates a first loss; and performs prediction based on features obtained after the convolution operation and calculates a second loss;
wherein the method includes training the neural network by optimizing settings of the backbone network, the feature network, and the prediction module based on a combination of the first loss and the second loss.

以上、本発明の好ましい実施形態を説明したが、本発明はこの実施形態に限定されず、本発明の趣旨を離脱しない限り、本発明に対するあらゆる変更は、本発明の技術的範囲に属する。 Although a preferred embodiment of the present invention has been described above, the present invention is not limited to this embodiment, and any modification to the present invention falls within the technical scope of the present invention as long as it does not deviate from the spirit of the present invention.

Claims

1. A method for training a neural network, comprising:
The neural network is used for detecting an object in an image, and includes a backbone network, a feature network, and a prediction module;
the feature network includes a first module and a second module;
The method comprises:
The backbone network processes the sample image and outputs N first features of different sizes;
A first module of the feature network performs N-1 deconvolution operations based on the smallest first feature output from the backbone network, and outputs N second features of different sizes;
A second module of the feature network performs a merging process on the N first features output from the backbone network to output N third features of different sizes;
combining each of the N second features with a corresponding one of the N third features having the same size to generate N fourth features of different sizes, and performing a different number of convolution operations on the N fourth features;
the prediction module making a prediction based on the N fourth features and calculating a first loss;
the prediction module makes a prediction based on the features obtained after the convolution operation and calculates a second loss; and training the neural network by optimizing settings of the backbone network, the feature network, and the prediction module based on a combination of the first loss and the second loss.

2. The method of claim 1 ,
The N second features output by the first module of the feature network include the first feature with the smallest size and a feature obtained after performing a deconvolution operation each time.

2. The method of claim 1 ,
The second module of the feature network further includes performing a process on the i+1th first feature, including a bilinear interpolation and a convolution operation, and performing a merge process on the i-th first feature and the processed i+1th first feature;
The method, where i=1, 2, . . . , N-1.

2. The method of claim 1 ,
The method, wherein the number of convolution operations performed on the fourth feature decreases as the size of the fourth feature increases.

2. The method of claim 1 ,
wherein each of the first loss and the second loss comprises a regression loss and a classification loss.

2. The method of claim 1 ,
the backbone network includes a plurality of layers, and the channels of each layer are divided into two equal parts;
The method comprises:
The method further comprises performing processing on the channels of each of the two parts, including a convolution operation, a channel expansion, and a channel reduction, and combining the processed channels of the two parts for input to a next layer.

7. The method of claim 6,
For each partial channel of each layer, summing the processed channel with the unprocessed channel to obtain an output of the portion; and combining the outputs of the two portions as input to the next one layer.

8. The method of claim 7,
The method, wherein the processing performed on the channels of each of the two portions further comprises the addition of a Squeeze-and-Excitation block (SE block).

9. The method of claim 8,
For each layer of the backbone network,
The method includes setting at least one of a channel expansion factor, a convolution kernel size, whether to sum processed and unprocessed channels, and whether to add an SE block.

7. The method of claim 6,
generating a plurality of features of different sizes through a plurality of layers of the backbone network;
selecting a subset of the plurality of features and determining a particular feature from the subset that is smallest in size;
performing processing on the particular feature to generate one or more features whose size is smaller than a size of the particular feature; and constituting N first features output from the backbone network with the features in the subset and the one or more features whose size is smaller than a size of the particular feature.