JP7579972B2

JP7579972B2 - Scaling Neural Architectures for Hardware Accelerators

Info

Publication number: JP7579972B2
Application number: JP2023524743A
Authority: JP
Inventors: リー，アンドリュー; リー，ション; タン，ミンシン; パン，ルオミン; チェン，リチュン; リー，コック・ブイ; ジョピー，ノーマン・ポール
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2021-01-15
Filing date: 2021-07-29
Publication date: 2024-11-08
Anticipated expiration: 2041-07-29
Also published as: WO2022154829A1; JP2023552048A; TW202230221A; CN116261734A; EP4217928A1

Description

関連出願の相互参照
本願は、２０２１年１月１５日に出願された米国特許出願第６３／１３７，９２６号の利益を米国特許法第１１９条（ｅ）の下で主張する２０２１年２月１２日に出願された米国特許出願第１７，１７５，０２９号の継続出願であり、その開示内容を引用により本明細書に援用する。 CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. Patent Application No. 17,175,029, filed February 12, 2021, which claims the benefit under 35 U.S.C. § 119(e) of U.S. Patent Application No. 63/137,926, filed January 15, 2021, the disclosure of which is incorporated herein by reference.

背景
ニューラルネットワークは、受け付けた入力に対する出力を予測するための非線形演算の１つ以上の層を含む機械学習モデルである。入力層および出力層に加えて、１つ以上の隠れ層を含むニューラルネットワークもある。各隠れ層の出力は、ニューラルネットワークの別の隠れ層または出力層に入力され得る。ニューラルネットワークの各層は、その層の１つ以上のモデルパラメータの値に応じて、受け付けた入力から各出力を生成できる。モデルパラメータは、訓練アルゴリズムによって決定される、ニューラルネットワークに正確な出力を生成させるための重みまたはバイアスであり得る。 Background A neural network is a machine learning model that includes one or more layers of nonlinear operations to predict an output for a received input. In addition to an input layer and an output layer, some neural networks include one or more hidden layers. The output of each hidden layer can be input to another hidden layer or to an output layer of the neural network. Each layer of a neural network can generate each output from a received input depending on the value of one or more model parameters of that layer. The model parameters can be weights or biases determined by a training algorithm to cause the neural network to generate an accurate output.

概要
本発明の態様に従って実装されたシステムは、各候補の計算要件（たとえば、ＦＬＯＰＳ）、演算強度、および実行効率に応じてニューラルネットワークアーキテクチャ候補を探索することによって、ニューラルネットワークアーキテクチャのレイテンシを減らすことができる。本明細書において説明するように、計算要件単体では推論時のレイテンシを含むレイテンシに影響を与えるのとは対照的に、計算要件、演算強度、および実行効率は、ターゲットコンピューティングリソース上のニューラルネットワークのレイテンシの根本的な原因であることがわかった。本開示の態様は、この観測されたレイテンシと計算との関係、演算強度、および実行効率に基づいて、レイテンシを意識した複合スケーリングによって、そしてニューラルネットワーク候補を探索する空間を拡張することによってニューラルアーキテクチャ探索およびスケーリングを実行するための技術を可能にする。 Overview Systems implemented according to aspects of the present invention can reduce the latency of neural network architectures by exploring candidate neural network architectures according to each candidate's computational requirements (e.g., FLOPS), computational intensity, and execution efficiency. As described herein, computational requirements, computational intensity, and execution efficiency have been found to be the underlying causes of neural network latency on a target computing resource, as opposed to computational requirements alone impacting latency, including inference latency. Aspects of the present disclosure enable techniques for performing neural architecture exploration and scaling through latency-aware composite scaling and by expanding the space of neural network candidates to be explored based on this observed latency-to-computational, computational intensity, and execution efficiency.

さらには、システムは、複数の目的に応じてニューラルネットワークの複数のパラメータを均一にスケーリングするための複合スケーリングを実行できる。これにより、１つの目的が考慮される手法やニューラルネットワークのスケーリングパラメータが別個に探索される手法よりも、スケーリングされたニューラルネットワークのパフォーマンスが改善できる。レイテンシを意識した複合スケーリングを用いて、最初のスケーリングされたニューラルネットワークアーキテクチャからの異なる値に応じてスケーリングされ、かつ、異なるユースケースに適し得るニューラルネットワークアーキテクチャのファミリーを素早く構築することができる。 Furthermore, the system can perform compound scaling to uniformly scale multiple parameters of a neural network according to multiple objectives. This can improve the performance of the scaled neural network over approaches where a single objective is considered or where the neural network scaling parameters are explored separately. Using latency-aware compound scaling, a family of neural network architectures can be quickly constructed that are scaled according to different values from an initial scaled neural network architecture and that may suit different use cases.

本開示の態様によると、１つのコンピュータにより実施される方法は、ニューラルネットワークのアーキテクチャを決定するための方法を含む。この方法は、１つ以上のプロセッサが、ニューラルネットワークタスクに対応する訓練データと、ターゲットコンピューティングリソースを指定する情報とを受信することと、１つ以上のプロセッサが、複数の第１目的に従って、訓練データを用いて探索空間のニューラルアーキテクチャ探索を実行して基本ニューラルネットワークのアーキテクチャを識別することと、１つ以上のプロセッサが、ターゲットコンピューティングリソースを指定する情報と、基本ニューラルネットワークの複数のスケーリングパラメータとに応じて基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別することとを含む。識別することは、複数のスケーリングパラメータ値候補を選択することと、複数のスケーリングパラメータ値候補に応じてスケーリングされた基本ニューラルネットワークの性能評価指標を決定することとを繰り返し実行することを含み、性能評価指標は、レイテンシ目的を含む複数の第２目的に従って決定される。方法は、１つ以上のプロセッサが、複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークのアーキテクチャを生成することをさらに含み得る。 According to an aspect of the present disclosure, a computer-implemented method includes a method for determining an architecture of a neural network. The method includes one or more processors receiving training data corresponding to a neural network task and information specifying a target computing resource, the one or more processors performing a neural architecture search of a search space using the training data to identify an architecture of a base neural network according to a plurality of first objectives, and the one or more processors identifying a plurality of scaling parameter values for scaling the base neural network according to the information specifying the target computing resource and a plurality of scaling parameters of the base neural network. The identifying includes iteratively selecting a plurality of candidate scaling parameter values and determining a performance evaluation metric of the scaled base neural network according to the plurality of candidate scaling parameter values, the performance evaluation metric being determined according to a plurality of second objectives including a latency objective. The method may further include one or more processors generating an architecture of the scaled neural network using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

上記およびその他の実施態様は、各々、必要に応じて下記の特徴のうち１つ以上を単体または組み合わせて含み得る。 These and other embodiments may each include one or more of the following features, either alone or in combination, as appropriate.

ニューラルアーキテクチャ探索を実行する複数の第１目的は、複数のスケーリングパラメータ値を識別するための複数の第２目的と同じであり得る。 The first objectives for performing the neural architecture search may be the same as the second objectives for identifying the scaling parameter values.

複数の第１目的および複数の第２目的は、基本ニューラルネットワークの出力の正解率に対応する正解率目的を含み得る。 The multiple first objectives and multiple second objectives may include an accuracy rate objective corresponding to the accuracy rate of the output of the basic neural network.

性能評価指標は、基本ニューラルネットワークが複数のスケーリングパラメータ値候補に応じてスケーリングされ、ターゲットコンピューティングリソース上にデプロイされたときの基本ニューラルネットワークが入力を受け付けることと、出力を生成することとの間のレイテンシの評価指標に少なくとも一部対応し得る。 The performance metric may correspond at least in part to a metric of latency between the basic neural network accepting an input and generating an output when the basic neural network is scaled in response to multiple candidate scaling parameter values and deployed on a target computing resource.

レイテンシ目的は、基本ニューラルネットワークがターゲットコンピューティングリソース上にデプロイされたときの基本ニューラルネットワークが入力を受け付けることと、出力を生成することとの間の最小レイテンシに対応し得る。 The latency objective may correspond to the minimum latency between the base neural network accepting an input and producing an output when the base neural network is deployed on a target computing resource.

探索空間は、ニューラルネットワーク層候補を含み得、各ニューラルネットワーク候補層は、１つ以上の演算を実行するように構成される。探索空間は、異なる活性化関数を含むニューラルネットワーク層候補を含み得る。 The search space may include candidate neural network layers, where each candidate neural network layer is configured to perform one or more operations. The search space may include candidate neural network layers that include different activation functions.

基本ニューラルネットワークのアーキテクチャは、複数のコンポーネント候補を含み得、各コンポーネントは、複数のニューラルネットワーク層を有する。探索空間は、第１の活性化関数を含むネットワーク層候補の第１コンポーネントと、第１の活性化関数とは異なる第２の活性化関数を含むネットワーク層候補の第２コンポーネントとを含む、ニューラルネットワーク層候補の複数のコンポーネント候補を含み得る。 The architecture of the basic neural network may include multiple candidate components, each component having multiple neural network layers. The search space may include multiple candidate neural network layer components, including a first candidate network layer component that includes a first activation function and a second candidate network layer component that includes a second activation function that is different from the first activation function.

ターゲットコンピューティングリソースを指定する情報は、１つ以上のハードウェアアクセラレータを指定し得、方法は、スケーリングされたニューラルネットワークを１つ以上のハードウェアアクセラレータ上で実行してニューラルネットワークタスクを実行することをさらに含む。 The information specifying the target computing resource may specify one or more hardware accelerators, and the method further includes executing the scaled neural network on the one or more hardware accelerators to perform the neural network task.

ターゲットコンピューティングリソースは、第１のターゲットコンピューティングリソースを含み得、複数のスケーリングパラメータ値は、複数の第１スケーリングパラメータ値であり、方法は、１つ以上のプロセッサが、第１のターゲットコンピューティングリソースとは異なる第２のターゲットコンピューティングリソースを指定する情報を受信することと、第２のターゲットコンピューティングリソースを指定する情報に応じて基本ニューラルネットワークをスケーリングするための複数の第２スケーリングパラメータ値を識別することとをさらに含み得、複数の第２スケーリングパラメータ値は、複数の第１スケーリングパラメータ値とは異なる。 The target computing resource may include a first target computing resource, and the plurality of scaling parameter values are a plurality of first scaling parameter values, and the method may further include the one or more processors receiving information specifying a second target computing resource different from the first target computing resource, and identifying a plurality of second scaling parameter values for scaling the basic neural network in response to the information specifying the second target computing resource, and the plurality of second scaling parameter values are different from the plurality of first scaling parameter values.

複数のスケーリングパラメータ値は、複数の第１スケーリングパラメータ値であり、方法は、複数の第２スケーリングパラメータ値を用いてスケーリングされた基本ニューラルネットワークアーキテクチャから、スケーリングされたニューラルネットワークアーキテクチャを生成することをさらに含み、第２スケーリングパラメータ値は、複数の第１スケーリングパラメータ値と、第１スケーリングパラメータ値の各々の値を均一に変更する１つ以上の複合係数に応じて生成される。 The plurality of scaling parameter values are a plurality of first scaling parameter values, and the method further includes generating a scaled neural network architecture from the base neural network architecture scaled with a plurality of second scaling parameter values, the second scaling parameter values being generated as a function of the plurality of first scaling parameter values and one or more compound coefficients that uniformly modify the value of each of the first scaling parameter values.

基本ニューラルネットワークは、畳み込みニューラルネットワークであり得、複数のスケーリングパラメータは、基本ニューラルネットワークの深さ、基本ニューラルネットワークの幅、基本ニューラルネットワークの入力の分解能のうち、１つ以上を含み得る。 The base neural network may be a convolutional neural network, and the multiple scaling parameters may include one or more of a depth of the base neural network, a width of the base neural network, and a resolution of the input of the base neural network.

別の態様によると、ニューラルネットワークのアーキテクチャを決定するための方法は、１つ以上のプロセッサが、ターゲットコンピューティングリソースを指定する情報を受け付けることと、１つ以上のプロセッサが、基本ニューラルネットワークのアーキテクチャを指定するデータを受信することと、１つ以上のプロセッサが、ターゲットコンピューティングリソースを指定する情報と、基本ニューラルネットワークの複数のスケーリングパラメータとに応じて基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別することとを含む。識別することは、複数のスケーリングパラメータ値候補を選択することと、複数のスケーリングパラメータ値候補に応じてスケーリングされた基本ニューラルネットワークの性能評価指標を決定することとを繰り返し実行することを含み、性能評価指標は、レイテンシ目的を含む複数の目的に応じて決定され、方法は、１つ以上のプロセッサが、複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークのアーキテクチャを生成することをさらに含む。 According to another aspect, a method for determining an architecture of a neural network includes one or more processors accepting information specifying a target computing resource, one or more processors receiving data specifying an architecture of a base neural network, and one or more processors identifying a plurality of scaling parameter values for scaling the base neural network in response to the information specifying the target computing resource and a plurality of scaling parameters of the base neural network. The identifying includes iteratively performing a selection of a plurality of scaling parameter value candidates and a determination of a performance evaluation metric of the scaled base neural network in response to the plurality of scaling parameter value candidates, the performance evaluation metric being determined in response to a plurality of objectives including a latency objective, and the method further includes one or more processors generating an architecture of the scaled neural network using the architecture of the base neural network scaled in response to the plurality of scaling parameter values.

複数の目的は、複数の第２目的であり得、基本ニューラルネットワークのアーキテクチャを指定するデータを受信することは、１つ以上のプロセッサが、ニューラルネットワークタスクに対応する訓練データを受信することと、１つ以上のプロセッサが、複数の第１目的に従って、訓練データを用いて探索空間のニューラルアーキテクチャ探索を実行して基本ニューラルネットワークのアーキテクチャを識別することとを含み得る。 The multiple objectives may be multiple secondary objectives, and receiving data specifying an architecture of the base neural network may include one or more processors receiving training data corresponding to a neural network task, and one or more processors performing a neural architecture search of a search space using the training data to identify an architecture of the base neural network in accordance with the multiple first objectives.

その他の実施態様は、前記方法の動作を実行するように各々が構成されたコンピュータシステムと、装置と、１つ以上のコンピュータ記憶装置上に記録されたコンピュータプログラムとを含む。 Other embodiments include computer systems, devices, and computer programs stored on one or more computer storage devices, each configured to perform the operations of the methods.

デプロイされたニューラルネットワークが動作するハードウェアアクセラレータが収容されているデータセンターにおいてデプロイするための、スケーリングされたニューラルネットワークアーキテクチャのファミリーを示すブロック図である。FIG. 1 is a block diagram illustrating a family of scaled neural network architectures for deployment in data centers housing hardware accelerators on which the deployed neural networks run. ターゲットコンピューティングリソース上で実行するためのスケーリングされたニューラルネットワークアーキテクチャを生成するための例示的なプロセスのフロー図である。FIG. 1 is a flow diagram of an example process for generating a scaled neural network architecture for execution on a target computing resource. 基本ニューラルネットワークアーキテクチャのレイテンシを意識した複合スケーリングの例示的なプロセスである。1 is an exemplary process of latency-aware complex scaling of a basic neural network architecture. 本開示の態様に係る、ＮＡＳ－ＬＡＣＳ（ニューラルアーキテクチャ探索-レイテンシを意識した複合スケーリング）システムのブロック図である。FIG. 1 is a block diagram of a NAS-LACS (Neural Architecture Search-Latency Aware Composite Scaling) system according to an aspect of the present disclosure. ＮＡＳ－ＬＡＣＳシステムを実装するための例示的な環境のブロック図である。FIG. 1 is a block diagram of an exemplary environment for implementing a NAS-LACS system.

詳細な説明
概要
本明細書に記載のテクノロジーは、概して、異なるハードウェアアクセラレータなど、異なるターゲットコンピューティングリソース上で実行するためのニューラルネットワークをスケーリングすることに関する。ニューラルネットワークは、複数の異なるパフォーマンス目的に応じてスケーリングされ得る。これらのパフォーマンス目的は、処理時間（本明細書において、レイテンシと称する）を最小限に抑える目的と、ターゲットコンピューティングリソース上で実行するためにスケーリングされたときのニューラルネットワークの正解率を最大化する目的という別個の目的を含み得る。 DETAILED DESCRIPTION Overview The technology described herein generally relates to scaling neural networks for execution on different target computing resources, such as different hardware accelerators. The neural networks may be scaled according to a number of different performance objectives. These performance objectives may include the distinct objectives of minimizing processing time (referred to herein as latency) and maximizing the accuracy rate of the neural network when scaled to execute on the target computing resources.

一般に、１つ以上の目的に応じてアーキテクチャ候補から構成される所与の探索空間からニューラルネットワークアーキテクチャを選択するためのＮＡＳ（ニューラルアーキテクチャ探索）システムがデプロイされ得る。１つの共通する目的は、ニューラルネットワークの正解率である。一般に、ＮＡＳ技術を実装するシステムは、正解率が低いネットワークよりも、訓練後に正解率が高くなるネットワークを好む。ＮＡＳに続いて基本ニューラルネットワークが選択された後、１つ以上のスケーリングパラメータに応じて基本ニューラルネットワークがスケーリングされ得る。スケーリングは、たとえば、数字から構成される係数探索空間にあるスケーリングパラメータを探索することによって、基本ニューラルネットワークをスケーリングするための１つ以上スケーリングパラメータ値を探索することを含み得る。スケーリングは、ニューラルネットワークをデプロイするために利用可能な計算リソースおよび／またはメモリリソースを有効活用するために、ニューラルネットワークが有する層の数または各層のサイズを増やしたり減らしたりすることを含み得る。 Generally, a NAS (Neural Architecture Search) system may be deployed to select a neural network architecture from a given search space of candidate architectures according to one or more objectives. One common objective is the accuracy rate of the neural network. Generally, systems implementing NAS techniques favor networks that have a high accuracy rate after training over networks with a low accuracy rate. After a base neural network is selected following the NAS, the base neural network may be scaled according to one or more scaling parameters. Scaling may include searching for one or more scaling parameter values for scaling the base neural network, for example, by searching for scaling parameters in a coefficient search space composed of numbers. Scaling may include increasing or decreasing the number of layers the neural network has or the size of each layer to make better use of the computational and/or memory resources available for deploying the neural network.

ニューラルアーキテクチャ探索およびスケーリングに関して共通して抱かれている考えは、ニューラルネットワークを通った入力を処理するために必要とされる、たとえばＦＬＯＰＳ（１秒当たりの浮動小数点演算）で測定されるネットワークの計算要件は、ネットワークに入力を送信することと、出力を受け付けることとの間のレイテンシに比例するという考えである。すなわち、計算要件が低い（低ＦＬＯＰＳ）ニューラルネットワークは、ネットワークの計算要件が高い（高ＦＬＯＰＳ）場合よりも高速に出力を生成すると信じられている。なぜならば、全体として実行される演算が少ないためである。よって、多くのＮＡＳシステムは、計算要件が低いニューラルネットワークを選択する。しかしながら、演算強度、並列性、および実行効率などニューラルネットワークのその他の特徴がニューラルネットワークの全体的なレイテンシに影響を与える可能性があるので、計算要件とレイテンシとの関係は比例しないことが明らかになっている。 A commonly held belief in neural architecture exploration and scaling is that the computational requirements of a neural network, measured for example in FLOPS (floating point operations per second), required to process an input through the network, is proportional to the latency between sending the input to the network and receiving the output. That is, it is believed that neural networks with low computational requirements (low FLOPS) will generate output faster than networks with high computational requirements (high FLOPS) because fewer operations are performed overall. Thus, many NAS systems choose neural networks with low computational requirements. However, it has become clear that the relationship between computational requirements and latency is not linear, as other characteristics of neural networks, such as computational intensity, parallelism, and execution efficiency, can affect the overall latency of a neural network.

本明細書において説明するテクノロジーは、ＬＡＣＳ（レイテンシを意識した複合スケーリング）と、ニューラルネットワークが選択されるニューラルネットワーク候補探索空間の拡張とを可能にする。探索空間を拡張するという状況では、演算およびアーキテクチャが探索空間に含まれ得る。当該演算およびアーキテクチャは、これらを加えることで、様々な種類のハードウェアアクセラレータ上でデプロイするのに適したより高度な演算強度、実行効率、および並列性がもたらされるという点で、「ハードウェアアクセラレータフレンドリーな」演算およびアーキテクチャである。このような演算は、ｓｐａｃｅ－ｔｏ－ｄｅｐｔｈ（空間から深さへの）演算、ｓｐａｃｅ－ｔｏ－ｂａｔｃｈ（空間からバッチへの）演算、融合された畳み込み構造、およびコンポーネントごとの探索活性化関数を含み得る。 The technology described herein enables LACS (Latency Aware Compound Scaling) and the expansion of the neural network candidate search space from which neural networks are selected. In the context of expanding the search space, operations and architectures may be included in the search space that are "hardware accelerator friendly" in that their addition provides higher computational intensity, execution efficiency, and parallelism suitable for deployment on various types of hardware accelerators. Such operations may include space-to-depth operations, space-to-batch operations, fused convolutional structures, and component-wise search activation functions.

ニューラルネットワークのレイテンシを意識した複合スケーリングは、レイテンシに応じて最適化を実行しない従来の手法よりも、ニューラルネットワークのスケーリングを改善できる。その代わりに、ＬＡＣＳを用いて、正確かつターゲットコンピューティングリソース上で低レイテンシで動作するスケーリングされたニューラルネットワークのスケーリングされたパラメータ値を識別してもよい。 Latency-aware composite scaling of neural networks can improve the scaling of neural networks over traditional approaches that do not perform optimizations according to latency. Instead, LACS may be used to identify scaled parameter values for a scaled neural network that are accurate and operate with low latency on the target computing resources.

このテクノロジーは、さらに、ＮＡＳまたは同様の技術を用いてニューラルアーキテクチャを探索する目的を共有する多目的スケーリングを可能にする。スケーリングされたニューラルネットワークは、基本ニューラルネットワークを探索する際に用いられた目的と同じ目的に応じて識別可能である。その結果、基本アーキテクチャ探索およびスケーリングという各ステージを別個の目的を有するタスクとして扱うのではなく、スケーリングされたニューラルネットワークをこれら２つのステージにおけるパフォーマンスに最適化させることができる。 This technology also enables multi-objective scaling that shares the objective of neural architecture exploration using NAS or similar techniques. The scaled neural network is identifiable according to the same objective used in exploring the base neural network. As a result, rather than treating each stage of base architecture exploration and scaling as a task with a separate objective, the scaled neural network can be optimized for performance in these two stages.

ＬＡＣＳを既存のＮＡＳシステムと一体化できる。なぜならば、少なくとも、スケーリングされたニューラルネットワークアーキテクチャを決定するためのエンドツーエンドシステムを構築するために、探索およびスケーリングの両方に対して同じ目的を使用できるためである。さらには、スケーリングされたニューラルネットワークアーキテクチャのファミリーを、ｓｅａｒｃｈｉｎｇ－ｗｉｔｈｏｕｔ－ｓｃａｌｉｎｇ手法よりも高速で識別できる。ｓｅａｒｃｈｉｎｇ－ｗｉｔｈｏｕｔ－ｓｃａｌｉｎｇ手法では、ニューラルネットワークアーキテクチャを探索するが、ターゲットコンピューティングリソース上にデプロイするためのスケーリングは行わない。 LACS can be integrated with existing NAS systems because, at a minimum, the same objectives can be used for both searching and scaling to build an end-to-end system for determining scaled neural network architectures. Furthermore, it can identify a family of scaled neural network architectures faster than searching-without-scaling approaches, which search for neural network architectures but do not scale them for deployment on target computing resources.

本明細書において説明するテクノロジーは、ＬＡＣＳを使わない従来のｓｅａｒｃｈ－ａｎｄ－ｓｃａｌｉｎｇ（サーチおよびスケーリング）手法で識別されるニューラルネットワークよりも改善したニューラルネットワークを可能にし得る。これに加えて、モデルの正解率および推論レイテンシのような目的間のトレードオフが異なるニューラルネットワークのファミリーを、様々なユースケースに適用するために素早く生成できる。また、このテクノロジーは、特定のタスクを実行するためのニューラルネットワークをより高速で識別し得るが、識別されたニューラルネットワークは、その他の手法を用いて識別されたニューラルネットワークよりも向上した正解率で機能し得る。これは、少なくとも、本明細書において説明する探索およびスケーリングを実行した結果識別されるニューラルネットワークが、単にネットワークの計算要件だけを考慮するのではなく、レイテンシに影響を与える可能性のある演算強度および実行効率のような特性を考慮するためである。このようにすれば、識別されたニューラルネットワークは、推論時、ネットワークの正解率を犠牲にすることなく、より高速で動作できる。 The technology described herein may enable improved neural networks over those identified by traditional search-and-scaling techniques that do not use LACS. In addition, families of neural networks with different trade-offs between objectives, such as model accuracy and inference latency, may be rapidly generated for application to various use cases. The technology may also identify neural networks for performing specific tasks faster, but the identified neural networks may perform with improved accuracy over neural networks identified using other techniques. This is at least because the neural networks identified as a result of performing the search and scaling described herein consider characteristics such as computational intensity and execution efficiency that may affect latency, rather than simply considering the computational requirements of the network. In this way, the identified neural networks can operate faster during inference without sacrificing the accuracy of the network.

このテクノロジーは、さらに、向上したコンピューティングリソース環境に既存のニューラルネットワークを素早く移行させるための一般に適用可能なフレームワークを可能にし得る。たとえば、特定のハードウェアを有するデータセンター用に選択された既存のニューラルネットワークの実行を、異なるハードウェアを用いるデータセンターに移行するときに、本明細書において説明するＬＡＣＳおよびＮＡＳを適用できる。この点に関して、既存のニューラルネットワークのタスクを実行するためのニューラルネットワークのファミリーを素早く識別し、新しいデータセンターハードウェア上にデプロイすることができる。この応用方法は、コンピュータビジョンにおけるタスクまたはその他の画像処理タスクをネットワークが実行するなど、効率よく実行するために最先端のハードウェアを必要とする高速デプロイメントの分野で特に有用になり得る。 This technology may further enable a generally applicable framework for quickly migrating existing neural networks to improved computing resource environments. For example, the LACS and NAS described herein may be applied when migrating the execution of an existing neural network selected for a data center with specific hardware to a data center using different hardware. In this regard, a family of neural networks to perform the tasks of the existing neural network may be quickly identified and deployed on the new data center hardware. This application method may be particularly useful in areas of rapid deployment that require state-of-the-art hardware to perform efficiently, such as networks performing tasks in computer vision or other image processing tasks.

図１は、ハードウェアアクセラレータ１１６が収容されているデータセンター１１５にデプロイするためのスケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎのファミリー１０３のブロック図である。ハードウェアアクセラレータ１１６上では、デプロイされたニューラルネットワークが動作する。ハードウェアアクセラレータ１１６は、ＣＰＵ、ＧＰＵ、ＦＧＰＡなどの任意の種類のプロセッサ、またはＴＰＵなどのＡＳＩＣであり得る。本開示の態様によると、スケーリングされたニューラルネットワークアーキテクチャのファミリー１０３は、基本ニューラルネットワークアーキテクチャ１０１から生成され得る。 FIG. 1 is a block diagram of a family 103 of scaled neural network architectures 104A-104N for deployment in a data center 115 housing a hardware accelerator 116 on which the deployed neural network runs. The hardware accelerator 116 can be any type of processor, such as a CPU, GPU, FPGA, or an ASIC, such as a TPU. According to aspects of the present disclosure, the family 103 of scaled neural network architectures can be generated from a base neural network architecture 101.

ニューラルネットワークのアーキテクチャは、ニューラルネットワークを識別する特性を指す。たとえば、アーキテクチャは、１つのネットワークを構成する複数の異なるニューラルネットワーク層の特性、これらの層が入力をどのように処理するか、これらの層が互いにどのようにやり取りするかなどを含み得る。たとえば、ＣｏｎｖＮｅｔ（畳み込みニューラルネットワーク）のアーキテクチャは、入力された画像データを受信する離散畳み込み層を規定し、次にプーリング層を規定し、次に、入力された画像データの内容を分類するなどニューラルネットワークタスクに従って出力を生成する全結合層を規定できる。また、ニューラルネットワークのアーキテクチャは、各層内で実行される演算の種類も規定できる。たとえば、ＣｏｎｖＮｅｔのアーキテクチャは、ネットワークの全結合層においてＲｅＬＵ活性化関数を使用すると規定し得る。 The architecture of a neural network refers to the characteristics that identify the neural network. For example, the architecture may include characteristics of the different neural network layers that make up a network, how these layers process inputs, how these layers interact with each other, etc. For example, a ConvNet (convolutional neural network) architecture may specify discrete convolutional layers that receive input image data, then a pooling layer, then a fully connected layer that generates output according to the neural network task, such as classifying the content of the input image data. The neural network architecture may also specify the type of operations that are performed within each layer. For example, a ConvNet architecture may specify the use of a ReLU activation function in the fully connected layer of the network.

ＮＡＳを用いて、基本ニューラルネットワークアーキテクチャ１０１を目的のセットに応じて識別できる。ニューラルネットワークアーキテクチャ候補から構成される探索空間から、ＮＡＳを用いて、基本ニューラルネットワークアーキテクチャ１０１を目的のセットに応じて識別できる。本明細書においてさらに詳細に説明するが、ニューラルネットワークアーキテクチャ候補から構成される探索空間は、それぞれ異なるネットワークコンポーネント、それぞれ異なる演算、および目的を満たす基本ネットワークが識別され得るそれぞれ異なる層を含むように拡張できる。 Using the NAS, a base neural network architecture 101 can be identified according to a set of objectives. From a search space of candidate neural network architectures, the NAS can be used to identify a base neural network architecture 101 according to a set of objectives. As described in more detail herein, the search space of candidate neural network architectures can be expanded to include different network components, different operations, and different layers from which a base network that satisfies the objectives can be identified.

また、基本ニューラルネットワークアーキテクチャ１０１を識別するために用いられる目的のセットを適用して、ファミリー１０３にあるニューラルネットワーク１０４Ａ～１０４Ｎごとにスケーリングパラメータ値を識別することもできる。基本ニューラルネットワークアーキテクチャ１０１、およびスケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎは、パラメータの数によって特徴付けられ得る。これらのパラメータは、スケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎにおいて様々な程度にスケーリングされる。図１では、ニューラルネットワーク１０１、１０４Ａは、ニューラルネットワークにある層の数を示すＤと、ニューラルネットワーク層の幅またはニューラルネットワーク層内のニューロンの数を示すＷと、ニューラルネットワークによって所与の層において処理される入力のサイズを示すＲ、という３つのスケーリングパラメータを有すると示されている。 The set of objectives used to identify the base neural network architecture 101 can also be applied to identify scaling parameter values for each of the neural networks 104A-104N in the family 103. The base neural network architecture 101 and the scaled neural network architectures 104A-104N can be characterized by a number of parameters that are scaled to different degrees in the scaled neural network architectures 104A-104N. In FIG. 1, the neural networks 101, 104A are shown to have three scaling parameters: D, which indicates the number of layers in the neural network; W, which indicates the width of the neural network layer or the number of neurons in the neural network layer; and R, which indicates the size of the inputs processed by the neural network at a given layer.

本明細書においてさらに詳細に説明するが、ＬＡＣＳを実行するように構成されたシステムは、係数探索空間１０８を探索してスケーリングパラメータ値の複数のセットを識別できる。各スケーリングパラメータ値は、係数探索空間にある係数であり、たとえば、正の実数のセットであり得る。各ネットワーク候補１０７Ａ～１０７Ｎは、係数探索空間１０８における探索の一部として識別された係数値候補に応じて基本ニューラルネットワーク１０１からスケーリングされる。システムは、パレートフロンティア探索またはグリッドサーチなど、係数候補を識別するための任意の様々な探索技術を適用可能である。ネットワーク候補１０７Ａ～１０７Ｎごとに、システムは、ニューラルネットワークタスクを実行する際の当該ネットワーク候補の性能評価指標を評価できる。性能評価指標は、複数の目的に基づき得、入力を受け付けることと、ニューラルネットワークタスクの実行の一部として対応する出力を生成することとの間のネットワーク候補のレイテンシを測定するというレイテンシ目的を含む。 As described in more detail herein, a system configured to perform LACS can search the coefficient search space 108 to identify multiple sets of scaling parameter values. Each scaling parameter value is a coefficient in the coefficient search space and can be, for example, a set of positive real numbers. Each network candidate 107A-107N is scaled from the base neural network 101 according to the coefficient value candidates identified as part of the search in the coefficient search space 108. The system can apply any of a variety of search techniques to identify the coefficient candidates, such as a Pareto frontier search or a grid search. For each network candidate 107A-107N, the system can evaluate a performance metric of the network candidate in performing a neural network task. The performance metric can be based on multiple objectives, including a latency objective that measures the latency of the network candidate between accepting an input and generating a corresponding output as part of performing a neural network task.

係数探索空間１０８に対してスケーリングパラメータ値探索が実行された後、システムは、スケーリングされたニューラルネットワークアーキテクチャ１０９を受け付け得る。スケーリングされたニューラルネットワークアーキテクチャ１０９は、係数探索空間における探索中に識別されたネットワーク候補１０７Ａ～１０７Ｎの最大性能評価指標をもたらすスケーリングパラメータ値を用いて、基本ニューラルネットワーク１０１からスケーリングされる。 After the scaling parameter value search is performed on the coefficient search space 108, the system may receive a scaled neural network architecture 109. The scaled neural network architecture 109 is scaled from the base neural network 101 using the scaling parameter values that result in the maximum performance evaluation metric of the network candidates 107A-107N identified during the search in the coefficient search space.

スケーリングされたニューラルネットワークアーキテクチャ１０９から、システムは、スケーリングされたニューラルネットワークアーキテクチャ１０４Ａ～１０４Ｎのファミリー１０３を生成できる。ファミリー１０３は、スケーリングされたニューラルネットワークアーキテクチャ１０９をそれぞれ異なる値に応じてスケーリングすることによって生成できる。スケーリングされたニューラルネットワークアーキテクチャ１０９の各スケーリングパラメータ値を均一にスケーリングして、ファミリー１０３に含まれる他のスケーリングされたニューラルネットワークアーキテクチャを生成できる。たとえば、スケーリングされたニューラルネットワークアーキテクチャ１０９の各スケーリングパラメータ値は、各スケーリングパラメータ値を２倍に増やすことによってスケーリングできる。スケーリングされたニューラルネットワークアーキテクチャ１０９の各スケーリングパラメータ値に均一に適用されるそれぞれ異なる値-または「複合係数」について、スケーリングされたニューラルネットワークアーキテクチャ１０９をスケーリングできる。いくつかの実施態様では、スケーリングされたニューラルネットワークアーキテクチャ１０９を、たとえば各スケーリングパラメータ値を別個にスケーリングすることによってなど別の方法でスケーリングして、ファミリー１０３に含まれるスケーリングされたニューラルネットワークアーキテクチャを生成する。 From the scaled neural network architecture 109, the system can generate a family 103 of scaled neural network architectures 104A-104N. The family 103 can be generated by scaling the scaled neural network architecture 109 according to different values. Each scaling parameter value of the scaled neural network architecture 109 can be scaled uniformly to generate other scaled neural network architectures in the family 103. For example, each scaling parameter value of the scaled neural network architecture 109 can be scaled by increasing each scaling parameter value by a factor of two. The scaled neural network architecture 109 can be scaled for different values - or "compound coefficients" - that are applied uniformly to each scaling parameter value of the scaled neural network architecture 109. In some implementations, the scaled neural network architecture 109 is scaled in a different manner, such as by scaling each scaling parameter value separately, to generate the scaled neural network architectures in the family 103.

スケーリングされたニューラルネットワークアーキテクチャ１０９をそれぞれ異なる値に応じてスケーリングすることによって、様々なユースケースに応じてタスクを実行するためのそれぞれ異なるニューラルネットワークアーキテクチャを素早く生成することができる。当該異なるユースケースは、スケーリングされたニューラルネットワークアーキテクチャ１０９を識別するために用いられる複数の目的間の異なるトレードオフとして指定され得る。たとえば、あるスケーリングされたニューラルネットワークアーキテクチャは、実行中のレイテンシが大きいという犠牲を伴って、より高い正解率のしきい値を満たすと識別され得る。別のスケーリングされたニューラルネットワークアーキテクチャは、より低い正解率のしきい値を満たすと識別され得るが、ハードウェアアクセラレータ１１６上で低レイテンシで実行できる。別のスケーリングされたニューラルネットワークアーキテクチャは、ハードウェアアクセラレータ１１６上の正解率とレイテンシとのトレードオフのバランスを取ると識別され得る。 By scaling the scaled neural network architecture 109 to different values, different neural network architectures can be quickly generated to perform tasks for various use cases. The different use cases can be specified as different tradeoffs between objectives used to identify the scaled neural network architecture 109. For example, one scaled neural network architecture can be identified as meeting a higher accuracy threshold at the expense of a higher latency during execution. Another scaled neural network architecture can be identified as meeting a lower accuracy threshold, but can run with low latency on the hardware accelerator 116. Another scaled neural network architecture can be identified as balancing the tradeoff between accuracy and latency on the hardware accelerator 116.

例として、物体認識などのコンピュータビジョンタスクを実行するために、アプリケーションが連続して映像データまたは画像データならびに受信データにある特定のクラスの物体を識別するタスクを受け付けることの一部として、ニューラルネットワークアーキテクチャは、出力をリアルタイムまたはほぼリアルタイムで生成する必要があるであろう。この例示的なタスクでは、正解率の許容値は低い可能性があるので、低レイテンシおよび低正解率についての適切なトレードオフでスケーリングされたニューラルネットワークアーキテクチャがデプロイされてタスクを実行し得る。 As an example, to perform a computer vision task such as object recognition, as part of an application continuously accepting video or image data and the task of identifying a particular class of objects in the received data, a neural network architecture would need to generate output in real-time or near real-time. For this exemplary task, the tolerance for accuracy may be low, so a neural network architecture scaled with an appropriate tradeoff between low latency and low accuracy may be deployed to perform the task.

別の例として、ニューラルネットワークアーキテクチャは、画像データまたは映像データから受信したシーンにあるすべての物体を分類するタスクが課せられ得る。この例では、この例示的なタスクを実行する際のレイテンシが当該タスクを精度高く実行することと同じくらい重要であると考慮されていない場合、レイテンシを犠牲にした正解率の高いスケーリングされたニューラルネットワークがデプロイされ得る。その他の例では、正解率と、レイテンシと、その他の目的との間のトレードオフのバランスを取るスケーリングされたニューラルネットワークアーキテクチャがデプロイされ得る。ここでは、ニューラルネットワークタスクを実行する際、特定のトレードオフは識別されたり所望されたりしない。 As another example, a neural network architecture may be tasked with classifying all objects in a scene received from image or video data. In this example, a scaled neural network may be deployed that provides high accuracy at the expense of latency, where latency in performing this exemplary task is not considered as important as performing the task accurately. In other examples, scaled neural network architectures may be deployed that balance tradeoffs between accuracy, latency, and other objectives. No particular tradeoffs are identified or desired here when performing the neural network task.

スケーリングされたニューラルネットワークアーキテクチャ１０４Ｎは、スケーリングされたニューラルネットワーク１０４Ａを取得するためにスケーリングされたニューラルネットワークアーキテクチャ１０９をスケーリングするのに用いられたスケーリングパラメータ値とは異なるスケーリングパラメータ値を用いてスケーリングされ、異なるユースケース、たとえば、推論時にレイテンシよりも正解率が所望されるユースケースを表し得る。 The scaled neural network architecture 104N may be scaled using different scaling parameter values than those used to scale the scaled neural network architecture 109 to obtain the scaled neural network 104A, and may represent a different use case, e.g., a use case in which accuracy is more desirable than latency during inference.

本明細書において説明するＬＡＣＳおよびＮＡＳ技術は、ハードウェアアクセラレータ１１６のファミリー１０３を生成し、さらなる訓練データと、異なる種類のハードウェアアクセラレータなど複数の異なるコンピューティングリソースを指定する情報とを受信し得る。ハードウェアアクセラレータ１１６のファミリー１０３を生成することに加えて、システムは、基本ニューラルネットワークアーキテクチャを探索し、他のハードウェアアクセラレータについて、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。たとえば、ＧＰＵおよびＴＰＵの場合、システムは、ＧＰＵとＴＰＵそれぞれについての正解率とレイテンシとのトレードオフに最適化させた別個のモデルファミリーを生成し得る。いくつかの実施態様では、システムは、同じ基本ニューラルネットワークアーキテクチャから、複数のスケーリングされたファミリーを生成し得る。 The LACS and NAS techniques described herein may generate a family 103 of hardware accelerators 116 and may receive additional training data and information specifying multiple different computing resources, such as different types of hardware accelerators. In addition to generating a family 103 of hardware accelerators 116, the system may explore the base neural network architecture and generate a family of scaled neural network architectures for other hardware accelerators. For example, for GPUs and TPUs, the system may generate separate model families optimized for accuracy vs. latency tradeoffs for GPUs and TPUs, respectively. In some implementations, the system may generate multiple scaled families from the same base neural network architecture.

例示的な方法
図２は、ターゲットコンピューティングリソース上で実行するためのスケーリングされたニューラルネットワークアーキテクチャを生成するための例示的なプロセス２００のフロー図である。例示的なプロセス２００は、１つ以上の場所にある１つ以上のプロセッサから構成されるシステム上で実行され得る。たとえば、本明細書において説明するＮＡＳ－ＬＡＣＳ（ニューラルアーキテクチャ探索-レイテンシを意識した複合スケーリング）システムは、プロセス２００を実行し得る。 2 is a flow diagram of an exemplary process 200 for generating a scaled neural network architecture for execution on a target computing resource. The exemplary process 200 may be performed on a system comprised of one or more processors in one or more locations. For example, the NAS-LACS (Neural Architecture Search-Latency Aware Composite Scaling) system described herein may perform the process 200.

ブロック２１０に示すように、システムは、ニューラルネットワークタスクに対応する訓練データを受信する。ニューラルネットワークタスクは、ニューラルネットワークによって実行され得る機械学習タスクである。スケーリングされたニューラルネットワークは、任意の種類のデータ入力を受け付けて、ニューラルネットワークタスクを実行するための出力を生成するように構成され得る。例として、出力は、入力に基づいて出力される任意の種類のスコア、クラス分類、または回帰であり得る。これに対応して、ニューラルネットワークタスクは、与えられた入力に対する出力を予測するためのスコアリングタスク、クラス分類タスク、および／または回帰タスクであり得る。これらのタスクは、画像、映像、テキスト、音声、またはその他の種類のデータを処理する際の様々なアプリケーションに対応し得る。 As shown in block 210, the system receives training data corresponding to a neural network task. A neural network task is a machine learning task that may be performed by a neural network. The scaled neural network may be configured to accept any type of data input and generate an output for performing the neural network task. By way of example, the output may be any type of score, classification, or regression output based on the input. Correspondingly, the neural network task may be a scoring task, a classification task, and/or a regression task to predict an output for a given input. These tasks may correspond to a variety of applications in processing images, video, text, audio, or other types of data.

受け付けた訓練データは、様々な学習技術のうち１つの学習技術に応じて、ニューラルネットワークを訓練するのに適した任意の形式であり得る。ニューラルネットワークを訓練するための学習技術は、教師あり学習技術、教師なし学習技術、および半教師あり学習技術を含み得る。たとえば、訓練データは、ニューラルネットワークが入力として受け付け得る複数の訓練例を含み得る。訓練例は、特定のニューラルネットワークタスクを実行するように適切に訓練されたニューラルネットワークによって生成されることになっている出力に対応する既知の出力でラベル付けされ得る。たとえば、ニューラルネットワークタスクがクラス分類タスクである場合、訓練例は、画像に描かれている被写体を分類分けする１つ以上のクラスでラベル付けされた画像であり得る。 The received training data may be in any format suitable for training a neural network according to one of a variety of learning techniques. Learning techniques for training a neural network may include supervised, unsupervised, and semi-supervised learning techniques. For example, the training data may include a number of training examples that the neural network may accept as inputs. The training examples may be labeled with known outputs that correspond to outputs that are to be generated by a neural network properly trained to perform a particular neural network task. For example, if the neural network task is a classification task, the training examples may be images labeled with one or more classes that categorize the objects depicted in the images.

ブロック２２０に示すように、システムは、ターゲットコンピューティングリソースを指定する情報を受け付ける。ターゲットコンピューティングリソースのデータは、ニューラルネットワークの少なくとも一部がデプロイされ得るコンピューティングリソースの特性を指定し得る。コンピューティングリソースは、様々な種類のハードウェアデバイスをホストしている１つ以上のデータセンターまたはその他の物理的位置に収容され得る。ハードウェアの種類として、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＧＰＵ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、エッジコンピューティングデバイスまたはモバイルコンピューティングデバイス、ＦＧＰＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、および様々な種類のＡＳＩＣ（Ａｐｐｌｉｃａｔｉｏｎ－ＳｐｅｃｉｆｉｃＣｉｒｃｕｉｔ）などが挙げられる。 As shown in block 220, the system accepts information specifying a target computing resource. The target computing resource data may specify characteristics of the computing resource on which at least a portion of the neural network may be deployed. The computing resource may be housed in one or more data centers or other physical locations hosting various types of hardware devices. The hardware types may include central processing units (CPUs), graphics processing units (GPUs), edge or mobile computing devices, field programmable gate arrays (FPGAs), and various types of application-specific circuits (ASICs).

ハードウェアアクセラレーションのために構成され得るデバイスもあり、特定の種類の演算を効率よく実行するために構成されたデバイスを含み得る。たとえばＧＰＵとＴＰＵ（ＴｅｎｓｏｒＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）とを含むこれらのハードウェアアクセラレータは、ハードウェアアクセラレーションの特殊機能を実施し得る。ハードウェアアクセラレーションの機能として、行列乗算など、機械学習モデルの実行に共通して関連する演算を実行するための構成などを挙げることができる。また、例として、これらの特殊機能は、異なる種類のＧＰＵにおいて利用可能な行列積和ユニット、およびＴＰＵにおいて利用可能な行列積ユニットを含み得る。 Some devices may be configured for hardware acceleration, including devices configured to efficiently perform certain types of operations. These hardware accelerators, including GPUs and Tensor Processing Units (TPUs), may perform specialized functions of hardware acceleration, such as configurations for performing operations commonly associated with running machine learning models, such as matrix multiplication. Also, by way of example, these specialized functions may include matrix multiply-accumulate units available in different types of GPUs and matrix multiplication units available in TPUs.

ターゲットコンピューティングリソースのデータは、１つ以上のターゲットコンピューティングリソースセットについてのデータを含み得る。ターゲットコンピューティングリソースセットは、ニューラルネットワークをデプロイしたいコンピューティングデバイスの集まりを指す。ターゲットコンピューティングリソースセットを指定する情報は、ターゲットセットに含まれるハードウェアアクセラレータまたはその他のコンピューティングデバイスの種類および量を指し得る。ターゲットセットは、同じ種類または異なる種類のデバイスを含み得る。たとえば、ターゲットコンピューティングリソースセットは、処理能力、スループット、およびメモリ容量を含む、特定の種類のハードウェアアクセラレータのハードウェア特性および量を規定し得る。本明細書において説明したように、システムは、ターゲットコンピューティングリソースセットにおいて指定されたデバイスごとに、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。 The target computing resource data may include data about one or more target computing resource sets. The target computing resource set refers to a collection of computing devices on which one wishes to deploy the neural network. The information specifying the target computing resource set may refer to the type and quantity of hardware accelerators or other computing devices included in the target set. The target set may include devices of the same or different types. For example, the target computing resource set may specify the hardware characteristics and quantity of a particular type of hardware accelerator, including processing power, throughput, and memory capacity. As described herein, the system may generate a family of scaled neural network architectures for each device specified in the target computing resource set.

これに加えて、ターゲットコンピューティングリソースのデータは、異なるターゲットコンピューティングリソースセットを指定し得、たとえば、データセンターに収容されているコンピューティングリソースのそれぞれ異なる可能な構成を反映している。この訓練およびターゲットコンピューティングリソースのデータから、システムは、ニューラルネットワークアーキテクチャのファミリーを生成し得る。各アーキテクチャは、システムが識別した基本ニューラルネットワークから生成され得る。 In addition, the target computing resource data may specify different sets of target computing resources, e.g., reflecting different possible configurations of computing resources housed in a data center. From this training and the target computing resource data, the system may generate a family of neural network architectures. Each architecture may be generated from a base neural network identified by the system.

ブロック２３０に示すように、システムは、訓練データを用いて、探索空間に対してニューラルアーキテクチャ探索を実行し、基本ニューラルネットワークのアーキテクチャを識別し得る。システムは、強化学習、進化的探索、または微分可能探索に基づく技術など、様々なＮＡＳ技術のいずれも使用し得る。いくつかの実施態様では、システムは、たとえば、本明細書において説明するように訓練データを受け付けてＮＡＳを実行することなく、基本ニューラルネットワークのアーキテクチャを指定するデータを直接受け付けてもよい。たとえば、特定のニューラルネットワークタスクを実行するのに適したニューラルネットワークアーキテクチャの種類の解析に基づいて、最初のニューラルネットワークアーキテクチャ候補を予め定めることができる。別の例として、ニューラルネットワーク候補から構成される探索空間から、最初のニューラルネットワークアーキテクチャ候補をランダムに選択できる。別の例として、最初のニューラルネットワーク候補の少なくとも一部の特性を、たとえば、同様または同じ数の層、重み値など、以前訓練されたニューラルネットワークの同様の特性に基づいて選択できる。 As shown in block 230, the system may use the training data to perform a neural architecture search on the search space to identify a base neural network architecture. The system may use any of a variety of NAS techniques, such as techniques based on reinforcement learning, evolutionary search, or differentiable search. In some implementations, the system may directly accept data specifying the base neural network architecture, for example, without accepting training data and performing NAS as described herein. For example, the initial neural network architecture candidates may be predefined based on an analysis of types of neural network architectures suitable for performing a particular neural network task. As another example, the initial neural network architecture candidates may be randomly selected from a search space of neural network candidates. As another example, at least some characteristics of the initial neural network candidates may be selected based on similar characteristics of previously trained neural networks, such as, for example, a similar or the same number of layers, weight values, etc.

探索空間は、基本ニューラルネットワークアーキテクチャの一部として選択される可能性のあるニューラルネットワーク候補またはニューラルネットワーク候補の一部を指す。ニューラルネットワークアーキテクチャ候補の一部は、ニューラルネットワークのコンポーネントを指し得る。ニューラルネットワークのアーキテクチャは、ニューラルネットワークの複数のコンポーネントに応じて規定され得る。ニューラルネットワークでは、各コンポーネントは、１つ以上のニューラルネットワーク層を含む。コンポーネントレベルのアーキテクチャにおいてニューラルネットワーク層の特性を規定でき、これは、コンポーネントにおける特定の演算を当該アーキテクチャが規定できることを意味し、その結果、コンポーネントにある各ニューラルネットワークが、コンポーネントに対して規定された同じ演算を実施する。また、コンポーネントは、アーキテクチャにおいて、コンポーネントにある層の数で規定され得る。 The search space refers to the candidate neural networks or a portion of candidate neural networks that may be selected as part of the basic neural network architecture. The portion of the candidate neural network architecture may refer to the components of the neural network. The architecture of the neural network may be defined according to the components of the neural network. In a neural network, each component includes one or more neural network layers. The characteristics of the neural network layers can be defined in the component level architecture, which means that the architecture can define specific operations in the component, so that each neural network in the component performs the same operations defined for the component. Also, the components may be defined in the architecture by the number of layers they have.

ＮＡＳを実行することの一部として、システムは、ニューラルネットワーク候補を識別すること、複数の目的に対応するパフォーマンスメトリックを取得すること、これらの各パフォーマンスメトリックに応じてニューラルネットワーク候補を評価することを、繰り返し実行し得る。ニューラルネットワーク候補の正解率およびレイテンシのメトリックなどのパフォーマンスメトリックを取得することの一部として、システムは、受け付けた訓練データを用いてニューラルネットワーク候補を訓練し得る。訓練し終わると、システムは、ニューラルネットワークアーキテクチャ候補を評価し、そのパフォーマンスメトリックを判定し、現在最適な候補に応じてこれらのパフォーマンスメトリックを比較し得る。 As part of performing the NAS, the system may iteratively identify candidate neural networks, obtain performance metrics corresponding to multiple objectives, and evaluate the candidate neural networks according to each of these performance metrics. As part of obtaining performance metrics, such as accuracy and latency metrics, for the candidate neural networks, the system may train the candidate neural networks using the received training data. Once trained, the system may evaluate the candidate neural network architectures, determine their performance metrics, and compare these performance metrics according to the currently optimal candidate.

システムは、ニューラルネットワーク候補を選択し、ネットワークを訓練し、そのパフォーマンスメトリックを比較することによって、停止メトリックに達するまでこの探索プロセスを繰り返し実行し得る。停止メトリックは、現在のネットワーク候補が満たすパフォーマンスの所定の最小しきい値であり得る。これに加えてまたはこれに代えて、停止メトリックは、最大数の探索イテレーション、または探索を実行するために割り当てられる最大期間であり得る。停止メトリックは、ニューラルネットワークのパフォーマンスが収束する条件、たとえば、後続のイテレーションのパフォーマンスが前回のイテレーションのパフォーマンスとは異なるしきい値未満である条件であり得る。 The system may perform this search process iteratively by selecting candidate neural networks, training the networks, and comparing their performance metrics until a stopping metric is reached. The stopping metric may be a predetermined minimum threshold of performance that the current candidate network meets. Additionally or alternatively, the stopping metric may be a maximum number of search iterations, or a maximum period of time allotted to perform the search. The stopping metric may be a condition under which the performance of the neural network converges, e.g., the performance of a subsequent iteration differs from the performance of a previous iteration by less than a threshold.

正解率およびレイテンシなど、ニューラルネットワークの様々なパフォーマンスメトリックを最適化させるという状況では、停止メトリックは、「最適である」と予め定められたしきい値範囲を指定し得る。たとえば、最適なレイテンシのしきい値範囲は、ターゲットコンピューティングリソースが実現する理論上の最小レイテンシまたは測定された最小レイテンシからのしきい値範囲であり得る。理論上の最小レイテンシまたは測定された最小レイテンシは、コンピューティングリソースのコンポーネントが物理的に受信データを読み込みして処理できるために最低限必要な時間など、コンピューティングリソースの物理的特性に基づき得る。いくつかの実施態様では、レイテンシは、たとえば、物理的に可能な限りゼロ遅延に近い最小値として保持され、ターゲットコンピューティングリソースから測定または算出されたターゲットレイテンシに基づいてはいない。 In the context of optimizing various performance metrics of a neural network, such as accuracy and latency, the stopping metric may specify a predetermined threshold range that is "optimal." For example, the optimal latency threshold range may be a threshold range from a theoretical or measured minimum latency that the target computing resource achieves. The theoretical or measured minimum latency may be based on physical characteristics of the computing resource, such as the minimum time required for a component of the computing resource to physically read and process incoming data. In some implementations, the latency is held to a minimum, e.g., as close to zero delay as physically possible, and is not based on a measured or calculated target latency from the target computing resource.

システムは、次のニューラルネットワークアーキテクチャ候補を選択するために機械学習モデルまたはその他の技術を使用するように構成され得る。ここで、選択は、特定のニューラルネットワークタスクの目的を受けてうまく機能する可能性の高いそれぞれ異なるニューラルネットワーク候補の学習済み特性に少なくとも一部基づき得る。 The system may be configured to use machine learning models or other techniques to select the next candidate neural network architecture, where the selection may be based at least in part on learned characteristics of different candidate neural networks that are likely to perform well given the objectives of a particular neural network task.

いくつかの例では、システムは、基本ニューラルネットワークアーキテクチャを識別するために多目的報酬メカニズムを次のように使用し得る。 In some examples, the system may use a multi-objective reward mechanism to identify the underlying neural network architecture as follows:

ニューラルネットワーク候補の正解率を測定するために、システムは、訓練セットを用いて、ニューラルネットワークタスクを実行するようにニューラルネットワーク候補を訓練し得る。システムは、たとえば、８０／２０分割によって訓練データを訓練セットと検証セットとに分割し得る。たとえば、システムは、教師あり学習技術を適用して、ニューラルネットワーク候補が生成する出力と、ネットワークが処理する訓練例の正解ラベルとの誤差を算出し得る。システムは、ニューラルネットワークが訓練されているタスクの種類に適した任意の様々な損失関数または誤差関数を利用でき、クラス分類タスクには交差エントロピー誤差、回帰タスクには平均二乗誤差などがある。ニューラルネットワーク候補の重みを変化させた場合の誤差の勾配を、たとえば逆伝播アルゴリズムを用いて算出し得、ニューラルネットワークの重みを更新し得る。システムは、訓練のためのイテレーション回数、最大期間、収束、または正解率の最小しきい値を満たした場合など、停止メトリックが満たされるまでニューラルネットワーク候補を訓練するように訓練され得る。 To measure the accuracy of the candidate neural network, the system may train the candidate neural network to perform the neural network task using the training set. The system may split the training data into a training set and a validation set, for example, by an 80/20 split. For example, the system may apply supervised learning techniques to calculate the error between the output generated by the candidate neural network and the ground truth labels of the training examples processed by the network. The system may utilize any of a variety of loss or error functions appropriate to the type of task for which the neural network is being trained, such as cross entropy error for classification tasks and mean squared error for regression tasks. The gradient of the error as the weights of the candidate neural network are changed may be calculated, for example, using a backpropagation algorithm, and the weights of the neural network may be updated. The system may train the candidate neural network until a stopping metric is met, such as the number of iterations for training, a maximum period, convergence, or a minimum threshold for accuracy.

（１）ターゲットコンピューティングリソース上にデプロイされたときの基本ニューラルネットワーク候補の演算強度、および／またはターゲットコンピューティングリソース上の基本ニューラルネットワーク候補の実行効率を含むその他のパフォーマンスメトリックに加えて、システムは、ターゲットコンピューティングリソース上のニューラルネットワークアーキテクチャ候補の正解率およびレイテンシのパフォーマンスメトリックを生成し得る。いくつかの実施態様では、正解率およびレイテンシに加えて、基本ニューラルネットワーク候補の性能評価指標は、演算強度および／または実行効率の少なくとも一部に基づく。 (1) In addition to other performance metrics including the computational intensity of the candidate base neural network when deployed on the target computing resource and/or the execution efficiency of the candidate base neural network on the target computing resource, the system may generate performance metrics of accuracy and latency of the candidate neural network architecture on the target computing resource. In some implementations, in addition to accuracy and latency, the performance evaluation index of the candidate base neural network is based at least in part on the computational intensity and/or execution efficiency.

レイテンシ、演算強度、および実行効率は、次のように定義され得る。 Latency, computational intensity, and execution efficiency can be defined as follows:

システムは、計算が少ないネットワークのみを探して探索するのではなく、演算強度、実行効率、および計算要件が改善された複数のニューラルネットワークアーキテクチャ候補を同時に探索することによって、基本ニューラルネットワークアーキテクチャを探索し、最終ニューラルネットワークのレイテンシを改善させることができる。システムをこのように動作するように構成して、最終基本ニューラルネットワークアーキテクチャの全体的なレイテンシを軽減させることができる。 The system can explore base neural network architectures and improve the latency of the final neural network by simultaneously exploring multiple candidate neural network architectures with improved computational intensity, execution efficiency, and computational requirements, rather than only searching for and exploring networks with low computational complexity. The system can be configured to operate in this manner to reduce the overall latency of the final base neural network architecture.

これに加えて、特に、ターゲットコンピューティングリソースがデータセンターハードウェアアクセラレータである場合、システムが基本ニューラルネットワークアーキテクチャを選択するアーキテクチャ候補探索空間を拡張して、ターゲットコンピューティングリソース上で少ない推論レイテンシで精度高く動作する可能性の高い利用可能なニューラルネットワーク候補の種類を広げることができる。 In addition, particularly when the target computing resource is a data center hardware accelerator, the system can expand the architecture candidate search space from which to select a base neural network architecture, broadening the variety of available neural network candidates that are likely to operate accurately with low inference latency on the target computing resource.

記載のように探索空間を拡張することで、データセンターアクセラレータのデプロイにより適したニューラルネットワークアーキテクチャ候補の数を増やすことができ、その結果、本開示の態様に応じて拡張されなかった探索空間では候補にならなかったであろう基本ニューラルネットワークアーキテクチャを識別できるようになる。ターゲットコンピューティングリソースがＧＰＵおよびＴＰＵのようなハードウェアアクセラレータを指定する実施例では、アーキテクチャ候補、または、演算強度、並列性、および／もしくは実行効率を向上させるコンポーネントまたは演算などのアーキテクチャの一部を用いて探索空間を拡張できる。 Expanding the search space as described can increase the number of candidate neural network architectures that are more suitable for data center accelerator deployment, thereby enabling identification of basic neural network architectures that would not have been candidates in a search space that was not expanded in accordance with aspects of the present disclosure. In examples where the target computing resources specify hardware accelerators such as GPUs and TPUs, the search space can be expanded with candidate architectures or portions of architectures, such as components or operations that improve computational intensity, parallelism, and/or execution efficiency.

１つの例示的な拡張方法では、様々な種類の活性化関数のうち１つを実装する層を有するニューラルネットワークアーキテクチャコンポーネントを含むように探索空間を拡張できる。ＴＰＵおよびＧＰＵの場合、ＲｅＬＵまたはｓｗｉｓｈなどの活性化関数は、通常、演算強度が低く、これらの種類のハードウェアアクセラレータ上のメモリによる制約を受けることが通常であることが分かった。ニューラルネットワークにおいて活性化関数を実行することは、概して、ターゲットコンピューティングリソース上で利用可能なメモリの総量による制約を受けるので、これらの関数の実行は、エンドツーエンドネットワークの推論速度のパフォーマンスに非常にマイナスの影響を与え得る。 In one exemplary extension method, the search space can be extended to include neural network architecture components having layers that implement one of many different types of activation functions. For TPUs and GPUs, it has been found that activation functions such as ReLU or swish are typically low in computational intensity and typically constrained by memory on these types of hardware accelerators. Because executing activation functions in neural networks is generally constrained by the amount of memory available on the target computing resource, executing these functions can have a significant negative impact on the performance of the inference speed of the end-to-end network.

活性化関数に相対する探索空間の１つの例示的な拡張は、関連する離散畳み込みと融合させた活性化関数を探索空間に導入することである。活性化関数は、概して、要素単位の演算であり、ベクトル演算のために構成されたハードウェアアクセラレータ単位で動作するので、離散畳み込みと並列してこれらの活性化関数を実行できる。離散畳み込みは、通常、ハードウェアアクセラレータの行列単位上で動作する行列ベースの演算である。これらの融合活性化関数-畳み込み演算は、本明細書に記載の基本ニューラルネットワークアーキテクチャ探索の一部として、システムによってニューラルネットワークコンポーネント候補として選択可能である。Ｓｗｉｓｈ、ＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）、Ｓｉｇｍｏｉｄ、Ｔａｎｈ、およびＳｏｆｔｍａｘを含む、任意の様々な活性化関数を利用できる。 One exemplary extension of the search space relative to activation functions is to introduce activation functions fused with associated discrete convolutions into the search space. Activation functions are generally element-wise operations and operate on the per-unit basis of hardware accelerators configured for vector operations, so that these activation functions can be performed in parallel with discrete convolutions. Discrete convolutions are typically matrix-based operations that operate on the matrix units of the hardware accelerator. These fused activation function-convolution operations can be selected by the system as candidate neural network components as part of the basic neural network architecture search described herein. Any of a variety of activation functions can be utilized, including Swish, ReLU (Rectified Linear Unit), Sigmaid, Tanh, and Softmax.

融合活性化関数-畳み込み演算の層を含む様々なコンポーネントを探索空間に付加でき、当該コンポーネントは、使われる活性化関数の種類のよって異なり得る。たとえば、活性化関数－畳み込みの層の１つのコンポーネントがＲｅＬＵ活性化関数を含み得る一方で、別のコンポーネントは、Ｓｗｉｓｈ活性化関数を含み得る。異なる活性化関数を用いて異なるハードウェアアクセラレータがより効率よく動作し得るため、複数種類の活性化関数の融合活性化関数－畳み込みを含むように探索空間を拡張させることで、対象のニューラルネットワークタスクを実行するのに最も適した基本ニューラルネットワークアーキテクチャを識別することをさらに向上できることが分かった。 Various components including layers of fused activation function-convolution operations can be added to the search space, which may vary depending on the type of activation function used. For example, one component of an activation function-convolution layer may include a ReLU activation function, while another component may include a Swish activation function. Because different hardware accelerators may operate more efficiently with different activation functions, it has been found that expanding the search space to include fused activation function-convolutions of multiple types of activation functions can further improve identifying the underlying neural network architecture that is most suitable for performing the neural network task of interest.

本明細書において説明した様々な活性化関数を有するコンポーネントに加えて、その他の融合された畳み込み構造を用いて探索空間を拡張し、異なる形状、種類、および大きさの畳み込みを用いて探索空間をさらに豊かにすることもできる。異なる畳み込み構造は、ニューラルネットワークアーキテクチャ候補の一部として追加されるコンポーネントであり得、１×１畳み込みからなる拡張層、ｄｅｐｔｈ－ｗｉｓｅ（深さ単位）の畳み込み、１×１畳み込みからなる投射層、ならびに活性化関数、バッチ正規化関数、および／またはスキップ接続などその他の演算を含み得る。 In addition to the components with different activation functions described herein, other fused convolutional structures can be used to extend the search space and further enrich it with convolutions of different shapes, types, and sizes. Different convolutional structures can be components added as part of a candidate neural network architecture and can include extension layers of 1x1 convolutions, depth-wise convolutions, projection layers of 1x1 convolutions, and other operations such as activation functions, batch normalization functions, and/or skip connections.

そのため、ＮＡＳの探索空間をその他の方法で拡張して、ハードウェアアクセラレータ上で利用可能な並列性を利用できる演算を含むようにすることができる。探索空間は、深さ単位の畳み込みを隣接する１×１畳み込みと融合させるための１つ以上の演算と、ニューラルネットワークの入力を整形するための演算とを含み得る。たとえば、ニューラルネットワーク候補への入力は、テンソルであり得る。テンソルは、異なる階数に応じた複数の値を表し得るデータ構造である。たとえば、一階のテンソルは、ベクトルであり得、二階のテンソルは、行列であり得、三階の行列は、３次元行列であり得る…などである。深さ単位の畳み込みを融合させることにはメリットがあるであろう。なぜならば、深さ単位の演算は、概して、演算強度の低い演算であり、この演算を隣接する畳み込みと融合させることで、演算強度をハードウェアアクセラレータの最大能力に近い演算強度に高めることができるためである。 Thus, the search space of the NAS can be expanded in other ways to include operations that can exploit the parallelism available on the hardware accelerator. The search space can include one or more operations for fusing a depthwise convolution with adjacent 1x1 convolutions and for shaping the input of the neural network. For example, the input to the candidate neural network can be a tensor. A tensor is a data structure that can represent multiple values according to different ranks. For example, a first-order tensor can be a vector, a second-order tensor can be a matrix, a third-order matrix can be a three-dimensional matrix, etc. Fusing depthwise convolutions can be beneficial because depthwise operations are generally low-intensity operations, and fusing them with adjacent convolutions can increase the computation intensity to near the maximum capability of the hardware accelerator.

また、探索空間は、ターゲットコンピューティングリソース上のメモリ内の様々な場所にテンソルの要素を移動させることによって入力テンソルを整形する演算を含み得る。これに加えてまたはこれに代えて、演算は、メモリ内の様々な場所に要素を複製し得る。 The search space may also include operations that reshape an input tensor by moving elements of the tensor to various locations in memory on the target computing resource. Additionally or alternatively, the operations may replicate elements to various locations in memory.

いくつかの実施態様では、システムは、直接スケーリングするための基本ニューラルネットワークを受信し、基本ニューラルネットワークを識別するためのＮＡＳまたはその他の探索を実行しないように構成される。いくつかの実施態様では、１つのデバイス上で基本ニューラルネットワークを識別し、本明細書において説明したように基本ニューラルネットワークを別のデバイス上でスケーリングすることによって、複数のデバイスが個々にプロセス２００の少なくとも一部を実行する。 In some implementations, the system is configured to receive the base neural network for scaling directly and not perform a NAS or other search to identify the base neural network. In some implementations, multiple devices individually perform at least a portion of process 200 by identifying the base neural network on one device and scaling the base neural network on another device as described herein.

図２のブロック２４０に示すように、システムは、ターゲットコンピューティングリソースを指定する情報と、複数のスケーリングパラメータとに応じて基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別し得る。システムは、基本ニューラルネットワークのスケーリングパラメータ値を探索する目的としてスケーリングされたニューラルネットワーク候補の正解率とレイテンシとを使うために、今回説明したようなレイテンシを意識した複合スケーリングを利用できる。たとえば、システムは、図２のステップ２４０において、レイテンシを意識した複合スケーリングのプロセス３００を適用し、複数のスケーリングパラメータ値を識別し得る。 As shown in block 240 of FIG. 2, the system may identify multiple scaling parameter values for scaling the base neural network in response to the information specifying the target computing resource and multiple scaling parameters. The system may use latency-aware composite scaling as described herein to use the accuracy rate and latency of the scaled neural network candidates to search for scaling parameter values for the base neural network. For example, the system may apply the latency-aware composite scaling process 300 in step 240 of FIG. 2 to identify multiple scaling parameter values.

一般に、スケーリング技術をＮＡＳと併せて適用し、ターゲットコンピューティングリソース上でデプロイするためにスケーリングされるニューラルネットワークを識別する。モデルスケーリングをＮＡＳと併せて使用し、様々なユースケースをサポートするニューラルネットワークのファミリーをさらに効率よく探索できる。スケーリング手法の下では、様々な技術を用いて、ニューラルネットワークの深さ、幅、および分解能などのスケーリングパラメータについて、様々な値を探索できる。スケーリングパラメータごとに値を別個に探索することによって、または、複数のスケーリングパラメータをまとめて調整するための均一の値セットを探索することによって、スケーリングを行うことができる。前者は、単純スケーリングと称される場合があり、後者は、複合スケーリングと称される場合がある。 In general, scaling techniques are applied in conjunction with NAS to identify neural networks that are scaled for deployment on target computing resources. Model scaling can be used in conjunction with NAS to more efficiently explore families of neural networks that support different use cases. Under the scaling approach, different techniques can be used to explore different values for scaling parameters such as the depth, width, and resolution of the neural network. Scaling can be done by exploring values for each scaling parameter separately or by exploring a uniform set of values for tuning multiple scaling parameters together. The former may be referred to as simple scaling, and the latter may be referred to as composite scaling.

唯一の目的として正解率を用いるスケーリング技術では、データセンターアクセラレータなど専用ハードウェア上にデプロイされたときのパフォーマンス／速度の影響を適切に考慮してスケーリングされたニューラルネットワークを得ることができない。図３においてさらに詳細を説明するが、ＬＡＣＳは、基本ニューラルネットワークアーキテクチャを識別するために使われる目的と同じ目的として共有され得る正解率目的およびレイテンシ目的の両方を使用し得る。 Scaling techniques that use accuracy as the only objective do not result in a scaled neural network that adequately accounts for the performance/speed impacts when deployed on dedicated hardware such as data center accelerators. As explained in further detail in Figure 3, LACS may use both accuracy and latency objectives, which may be shared as the same objective as the objective used to identify the base neural network architecture.

図３は、基本ニューラルネットワークアーキテクチャのレイテンシを意識した複合スケーリングの例示的なプロセス３００である。例示的なプロセス３００は、１つ以上の場所にある１つ以上のプロセッサから構成されるシステムまたはデバイス上で実行され得る。たとえば、プロセス３００は、本明細書に記載のＮＡＳ－ＬＡＣＳシステム上で実行され得る。たとえば、システムは、ステップ２４０の一部としてプロセス３００を実行し、基本ニューラルネットワークをスケーリングするための複数のスケーリングパラメータ値を識別し得る。 Figure 3 is an example process 300 for latency-aware composite scaling of a base neural network architecture. The example process 300 may be performed on a system or device comprised of one or more processors in one or more locations. For example, the process 300 may be performed on a NAS-LACS system as described herein. For example, the system may perform the process 300 as part of step 240 to identify multiple scaling parameter values for scaling the base neural network.

複合スケーリング手法では、スケーリングパラメータごとのスケーリング係数をまとめて探索する。システムは、パレートフロンティア探索またはグリッドサーチなど、係数のタプルを識別するための任意の様々な探索技術を適用できる。システムは、係数のタプルを探索するが、図２を参照して本明細書において説明した基本ニューラルネットワークアーキテクチャを識別するために用いられる目的と同じ目的に応じて当該タプルを探索し得る。これに加えて、複数の目的は、正解率とレイテンシとの両方を含み得、次のように表すことができる。 In a composite scaling approach, scaling coefficients for each scaling parameter are searched together. The system may apply any of a variety of search techniques to identify tuples of coefficients, such as Pareto frontier search or grid search. The system may search the tuples of coefficients according to the same objectives used to identify the basic neural network architecture described herein with reference to FIG. 2. In addition, the multiple objectives may include both accuracy and latency, which may be expressed as follows:

性能評価指標を決定することの一部として、システムは、受信した訓練データを用いて、スケーリングされたニューラルネットワークアーキテクチャ候補をさらに訓練および調整し得る。システムは、訓練済みのスケーリングされたニューラルネットワークの性能評価指標を決定し、ブロック３３０に従って、性能評価指標がパフォーマンスしきい値を満たすかどうかを判断し得る。性能評価指標およびパフォーマンスしきい値は、それぞれ、複数のパフォーマンスメトリックの複合物および複数のパフォーマンスしきい値であり得る。たとえば、システムは、スケーリングされたニューラルネットワークの正解率および推論レイテンシの両方のメトリックから１つの性能評価指標を決定し得、または、異なる目的について別個のパフォーマンスメトリックを判定し、各メトリックを対応するパフォーマンスしきい値と比較し得る。 As part of determining the performance metrics, the system may further train and tune the candidate scaled neural network architecture using the received training data. The system may determine a performance metric for the trained scaled neural network and determine whether the performance metric meets a performance threshold according to block 330. The performance metric and the performance threshold may be a composite of multiple performance metrics and multiple performance thresholds, respectively. For example, the system may determine one performance metric from both the accuracy rate and inference latency metrics of the scaled neural network, or may determine separate performance metrics for different objectives and compare each metric to a corresponding performance threshold.

性能評価指標がパフォーマンスしきい値を満たした場合、プロセス３００は終了する。そうでない場合、プロセスは継続し、システムは、ブロック３１０に従って新しい複数のスケーリングパラメータ値候補を選択する。たとえば、システムは、以前選択したタプル候補およびその対応する性能評価指標に少なくとも一部基づいて、係数探索空間からスケーリング係数から構成される新しいタプルを選択し得る。いくつかの実施態様では、システムは、係数から構成される複数のタプルを探索し、タプル候補の各々の近傍にある複数の目的に応じてより細かい探索を行う。 If the performance metric meets the performance threshold, process 300 ends. Otherwise, the process continues and the system selects new scaling parameter value candidates according to block 310. For example, the system may select a new tuple of scaling coefficients from the coefficient search space based at least in part on the previously selected tuple candidates and their corresponding performance metric. In some implementations, the system searches multiple tuples of coefficients and performs a finer search according to multiple objectives in the neighborhood of each of the tuple candidates.

システムは、たとえば、グリッドサーチ、強化学習、進化的探索を用いるなど、係数候補空間を繰り返し探索するために任意の様々な技術を実施し得る。前述したように、システムは、収束またはイテレーションの数などの停止メトリックに達するまで、スケーリングパラメータ値を探索し続け得る。 The system may implement any of a variety of techniques to iteratively explore the coefficient candidate space, such as using grid search, reinforcement learning, evolutionary search, etc. As previously described, the system may continue to explore scaling parameter values until a stopping metric, such as convergence or number of iterations, is reached.

いくつかの実施態様では、システムは、１つ以上のコントローラパラメータ値に応じて調整され得る。コントローラパラメータ値は、手作業で調整され得、機械学習技術によって学習され得、またはこれらの組合せでもあり得る。コントローラパラメータは、タプル候補の全体的な性能評価指標に対する各目的の相対的な効果に影響を与え得る。いくつかの例では、タプル候補に含まれる特定の値または値同士の特定の関係は、コントローラパラメータ値に少なくとも一部が反映された理想的なスケーリング係数の学習済み特性に基づいて、好まれたり好まれなかったりし得る。 In some implementations, the system may be tuned in response to one or more controller parameter values. The controller parameter values may be manually tuned, learned by machine learning techniques, or a combination of these. The controller parameters may affect the relative effect of each objective on the overall performance metric of the candidate tuple. In some examples, particular values or relationships between values in the candidate tuple may be favored or disfavored based on learned characteristics of ideal scaling coefficients reflected at least in part in the controller parameter values.

ブロック３４０に従って、システムは、１つ以上の目的トレードオフに応じて、選択したスケーリングパラメータ値候補から、１つ以上のスケーリングパラメータ値グループを生成する。目的トレードオフは、正解率およびレイテンシなど、目的ごとのそれぞれ異なるしきい値を表し得、様々なスケーリングされたニューラルネットワークによって満たされ得る。たとえば、１つの目的トレードオフでは、ネットワーク正解率のしきい値は高いが、推論レイテンシのしきい値は低い（すなわち、精度が高いネットワークでレイテンシが高い）。別の例として、目的トレードオフでは、ネットワーク正解率のしきい値は低いが、推論レイテンシのしきい値は高い（すなわち、精度が低いネットワークでレイテンシが低い）。別の例として、目的トレードオフでは、正解率とレイテンシパフォーマンスとの間でバランスが取られている。 Pursuant to block 340, the system generates one or more scaling parameter value groups from the selected scaling parameter value candidates in response to one or more objective trade-offs. The objective trade-offs may represent different thresholds for each objective, such as accuracy and latency, that may be satisfied by different scaled neural networks. For example, one objective trade-off has a high threshold for network accuracy but a low threshold for inference latency (i.e., high latency for a network with high accuracy). As another example, the objective trade-off has a low threshold for network accuracy but a high threshold for inference latency (i.e., low latency for a network with low accuracy). As another example, the objective trade-off is a balance between accuracy and latency performance.

目的トレードオフごとに、システムは、目的トレードオフを満たすように基本ニューラルネットワークアーキテクチャをスケーリングするためにシステムが使用可能なスケーリングパラメータ値グループを識別し得る。すなわち、システムは、ブロック３１０に示す選択と、ブロック３２０に示す性能評価指標の決定と、ブロック３３０に示す性能評価指標がパフォーマンスしきい値を満たすかどうかの判断とを繰り返し得る（パフォーマンスしきい値が目的トレードオフによって定められるという点は異なる）。いくつかの実施態様では、基本ニューラルネットワークアーキテクチャのタプルを探索するのではなく、システムは、ブロック３３０に従って複数の目的の性能評価指標を最初に満たした選択したスケーリングパラメータ値の候補に応じてスケーリングされた基本ニューラルネットワークアーキテクチャのスケーリング係数タプル候補を探索し得る。 For each objective trade-off, the system may identify a group of scaling parameter values that the system can use to scale the base neural network architecture to satisfy the objective trade-off. That is, the system may iterate between the selection shown in block 310, the determination of the performance evaluation index shown in block 320, and the determination of whether the performance evaluation index satisfies the performance threshold shown in block 330 (except that the performance threshold is determined by the objective trade-off). In some implementations, rather than searching for tuples of the base neural network architecture, the system may search for candidate scaling coefficient tuples of the base neural network architecture scaled according to the selected candidate scaling parameter values that first satisfy the performance evaluation indexes of the multiple objectives according to block 330.

図２に戻ると、ブロック２５０に示すように、ＮＡＳ－ＬＡＣＳシステムは、複数のスケーリングパラメータ値に応じてスケーリングされた基本ニューラルネットワークのアーキテクチャを用いて、スケーリングされたニューラルネットワークの１つ以上のアーキテクチャを生成し得る。スケーリングされたニューラルネットワークアーキテクチャは、基本ニューラルネットワークアーキテクチャおよび様々なスケーリングパラメータ値から生成されたニューラルネットワークのファミリーであり得る。 Returning to FIG. 2, as shown in block 250, the NAS-LACS system may generate one or more architectures of scaled neural networks using the architecture of the base neural network scaled according to multiple scaling parameter values. The scaled neural network architecture may be a family of neural networks generated from the base neural network architecture and various scaling parameter values.

ターゲットコンピューティングリソースを指定する情報が複数のターゲットコンピューティングリソース、たとえば、複数の様々な種類のハードウェアアクセラレータから構成される１つ以上のセットを含む場合、システムは、ハードウェアアクセラレータごとに、プロセス２００およびプロセス３００を繰り返し、ターゲットセットに各々が対応するスケーリングされたニューラルネットワークのアーキテクチャを生成し得る。ハードウェアアクセラレータごとに、システムは、レイテンシと正解率との間、またはレイテンシと、正解率と、その他の目的（特に、（３）を参照して説明した演算強度および実行効率を含む）との間の様々な目的トレードオフに応じて、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。 If the information specifying the target computing resources includes multiple target computing resources, e.g., one or more sets of multiple hardware accelerators of different types, the system may repeat process 200 and process 300 for each hardware accelerator to generate a family of scaled neural network architectures, each corresponding to a target set. For each hardware accelerator, the system may generate a family of scaled neural network architectures according to various objective trade-offs between latency and accuracy, or between latency, accuracy, and other objectives (including, in particular, computational intensity and execution efficiency as described with reference to (3)).

いくつかの実施態様では、システムは、同じ基本ニューラルネットワークアーキテクチャから複数のスケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。この手法は、様々な対象デバイスが同様のハードウェア特性を共有している状況において有用であり得、デバイスごとの対応するスケーリングされたファミリーをより高速に識別できるようになる。なぜならば、少なくとも、たとえば図２のプロセスに示すような基本ニューラルネットワークアーキテクチャの探索が、１回しか実行されないためである。 In some implementations, the system may generate a family of multiple scaled neural network architectures from the same base neural network architecture. This approach may be useful in situations where various target devices share similar hardware characteristics, allowing faster identification of corresponding scaled families for each device, since at least one search of base neural network architectures, e.g., as shown in the process of FIG. 2, is performed only once.

例示的なシステム
図４は、本開示の態様に係る、ＮＡＳ－ＬＡＣＳ（ニューラルアーキテクチャ探索-レイテンシを意識した複合スケーリング）システム４００のブロック図である。システム４００は、ニューラルネットワークタスクを実行するための訓練データ４０１と、ターゲットコンピューティングリソースを指定するターゲットコンピューティングリソースのデータ４０２とを受信するように構成される。図１～図３を参照して本明細書において説明したように、システム４００は、スケーリングされたニューラルネットワークアーキテクチャのファミリーを生成するための技術を実施するように構成され得る。 4 is a block diagram of a NAS-LACS (Neural Architecture Search-Latency Aware Composite Scaling) system 400 according to an aspect of the disclosure. System 400 is configured to receive training data 401 for performing a neural network task and target computing resource data 402 specifying a target computing resource. As described herein with reference to FIGS. 1-3, system 400 may be configured to implement techniques for generating a family of scaled neural network architectures.

システム４００は、ユーザーインタフェースに応じて入力データを受信するように構成され得る。たとえば、システム４００は、システム４００を公開しているＡＰＩ（アプリケーションプログラムインタフェース）に対する呼び出しの一部としてデータを受信し得る。図５を参照して本明細書に説明するが、システム４００は、１つ以上のコンピューティングデバイス上に実装できる。たとえば、ネットワークで１つ以上のコンピューティングデバイスに接続されたリモートストレージを含む記憶媒体を通してシステム４００への入力が行われ得、または、システム４００に連結されたクライアントコンピューティングデバイス上のユーザーインタフェースを通して入力が行われ得る。 System 400 may be configured to receive input data in response to a user interface. For example, system 400 may receive data as part of a call to an API (application program interface) that exposes system 400. As described herein with reference to FIG. 5, system 400 may be implemented on one or more computing devices. For example, input to system 400 may be provided through a storage medium, including remote storage, connected to one or more computing devices over a network, or input may be provided through a user interface on a client computing device coupled to system 400.

システム４００は、スケーリングされたニューラルネットワークアーキテクチャのファミリーなど、スケーリングされたニューラルネットワークアーキテクチャ４０９を出力するように構成され得る。スケーリングされたニューラルネットワークアーキテクチャ４０９は、たとえばユーザディスプレイ上に表示するための出力として送信され、必要に応じて、アーキテクチャにおいて規定されている各ニューラルネットワーク層の形状およびサイズに従って可視化され得る。いくつかの実施態様では、システム４００は、スケーリングされたニューラルネットワークアーキテクチャ４０９を１つ以上のコンピュータプログラムなど、コンピュータ読み取り可能な命令セットとして提供するように構成され得る。コンピュータ読み取り可能な命令セットは、スケーリングされたニューラルネットワークアーキテクチャ４０９を実装するために、ターゲットコンピューティングリソースによって実行され得る。 The system 400 may be configured to output a scaled neural network architecture 409, such as a family of scaled neural network architectures. The scaled neural network architecture 409 may be sent as an output for display, for example on a user display, and visualized, if desired, according to the shape and size of each neural network layer defined in the architecture. In some implementations, the system 400 may be configured to provide the scaled neural network architecture 409 as a set of computer-readable instructions, such as one or more computer programs. The set of computer-readable instructions may be executed by a target computing resource to implement the scaled neural network architecture 409.

コンピュータプログラムは、たとえば、宣言型、手続き型、アセンブリ、オブジェクト指向、データ指向、関数型、または命令型など、任意のプログラミングパラダイムに従って任意の種類のプログラミング言語で書かれ得る。コンピュータプログラムは、１つ以上の異なる関数を実行し、コンピューティング環境内、たとえば物理デバイス上、仮想マシン上、または複数のデバイス間で動作するように書かれ得る。また、コンピュータプログラムは、本明細書に記載の機能、たとえば、システム、エンジン、モジュール、またはモデルによって実行される機能を実施する。 A computer program may be written in any type of programming language according to any programming paradigm, such as, for example, declarative, procedural, assembly, object-oriented, data-oriented, functional, or imperative. A computer program may be written to perform one or more different functions and to operate within a computing environment, such as on a physical device, on a virtual machine, or across multiple devices. A computer program also implements the functionality described herein, such as the functionality performed by a system, engine, module, or model.

いくつかの実施態様では、システム４００は、アーキテクチャを（必要に応じて、機械学習モデルを生成するためのフレームワークの一部として）コンピュータプログラミング言語で書かれた実行可能なプログラムに変換するために構成された１つ以上のその他のデバイスに、スケーリングされたニューラルネットワークアーキテクチャ４０９用のデータを転送するように構成される。また、システム４００は、スケーリングされたニューラルネットワークアーキテクチャ４０９に対応するデータを、格納して後の検索に用いるため記憶装置に送るように構成され得る。 In some implementations, system 400 is configured to transfer data for scaled neural network architecture 409 to one or more other devices configured to convert the architecture into an executable program written in a computer programming language (optionally as part of a framework for generating a machine learning model). System 400 may also be configured to send data corresponding to scaled neural network architecture 409 to a storage device for storage and subsequent retrieval.

システム４００は、ＮＡＳエンジン４０５を備え得る。ＮＡＳエンジン４０５およびシステム４００のその他の構成要素は、１つ以上のコンピュータプログラム、特別に構成された電子回路、またはこれらの任意の組合せとして実装され得る。ＮＡＳエンジン４０５は、訓練データ４０１とターゲットコンピューティングリソースのデータ４０２とを受信し、基本ニューラルネットワークアーキテクチャ４０７を生成するように構成され得る。基本ニューラルネットワークアーキテクチャ４０７は、ＬＡＣＳエンジン４１５に送信され得る。ＮＡＳエンジン４０５は、図１～図３を参照して本明細書において説明したニューラルアーキテクチャ探索の任意の様々な技術を実装できる。システムは、ターゲットコンピューティングリソース上で実行されたときのニューラルネットワーク候補の推論レイテンシと正解率とを含む複数の目的を用いてＮＡＳを実行するよう、本開示の態様に従って構成され得る。ＮＡＳエンジン４０５が基本ニューラルネットワークアーキテクチャを探索するために利用できるパフォーマンスメトリックを判定することとの一部として、システム４００は、パフォーマンス測定エンジン４１０を備え得る。 The system 400 may include a NAS engine 405. The NAS engine 405 and other components of the system 400 may be implemented as one or more computer programs, specially configured electronic circuits, or any combination thereof. The NAS engine 405 may be configured to receive the training data 401 and the target computing resource data 402 and generate a base neural network architecture 407. The base neural network architecture 407 may be transmitted to the LACS engine 415. The NAS engine 405 may implement any of the various techniques of neural architecture search described herein with reference to FIGS. 1-3. The system may be configured according to aspects of the present disclosure to perform NAS using multiple objectives, including inference latency and accuracy rate of candidate neural networks when executed on the target computing resource. As part of determining performance metrics available to the NAS engine 405 for searching base neural network architectures, the system 400 may include a performance measurement engine 410.

パフォーマンス測定エンジン４１０は、基本ニューラルネットワーク候補のアーキテクチャを受信する、およびＮＡＳエンジン４０５によってＮＡＳを実行するために使われる目的に応じてパフォーマンスメトリックを生成するように構成され得る。パフォーマンスメトリックは、複数の目的に応じてニューラルネットワーク候補の全体的な性能評価指標を提供し得る。基本ニューラルネットワーク候補の正解率を判断するために、パフォーマンス測定エンジン４１０は、たとえば、訓練データ４０１の一部を残しておくことによって検証用の訓練例セットを取得することによって、当該検証セット上で基本ニューラルネットワーク候補を実行し得る。 The performance measurement engine 410 may be configured to receive the architecture of the candidate base neural network and generate performance metrics according to the objectives used to perform the NAS by the NAS engine 405. The performance metrics may provide an overall performance evaluation indicator of the candidate neural network according to multiple objectives. To determine the accuracy rate of the candidate base neural network, the performance measurement engine 410 may run the candidate base neural network on the validation set, for example, by obtaining a set of training examples for validation by retaining a portion of the training data 401.

レイテンシを測定するために、パフォーマンス測定エンジン４１０は、データ４０２によって指定されているターゲットコンピューティングリソースに対応するコンピューティングリソースと通信し得る。たとえば、ターゲットコンピューティングリソースのデータ４０２がＴＰＵを対象リソースと指定している場合、パフォーマンス測定エンジン４１０は、基本ニューラルネットワーク候補を対応するＴＰＵ上で実行するために送信し得る。ＴＰＵは、システム４００を実装する１つ以上のプロセッサと（たとえば、図５を参照してさらに詳細を説明するネットワークで）通信しているデータセンターに収容され得る。 To measure latency, the performance measurement engine 410 may communicate with a computing resource that corresponds to the target computing resource specified by the data 402. For example, if the data 402 for the target computing resource specifies a TPU as the target resource, the performance measurement engine 410 may send the base neural network candidates for execution on the corresponding TPU. The TPU may be housed in a data center in communication (e.g., over a network, which will be described in further detail with reference to FIG. 5) with one or more processors implementing the system 400.

パフォーマンス測定エンジン４１０は、ターゲットコンピューティングリソースが入力を受け付けることと、出力を生成することとの間のレイテンシを示すレイテンシ情報を受信し得る。レイテンシ情報は、現地でターゲットコンピューティングリソースに対して直接測定され、パフォーマンス測定エンジン４１０に送られ得る、または、パフォーマンス測定エンジン４１０自体によって測定され得る。パフォーマンス測定エンジン４１０がレイテンシを測定した場合、エンジン４１０は、基本ニューラルネットワーク候補の処理が原因ではないレイテンシ、たとえば、ターゲットコンピューティングリソースと通信するためのネットワークレイテンシを補償するように構成され得る。別の例として、パフォーマンス測定エンジン４１０は、ターゲットコンピューティングリソースの以前の測定値、およびターゲットコンピューティングリソースのハードウェア特性に基づいて、基本ニューラルネットワーク候補を通った入力の処理のレイテンシを推定し得る。 The performance measurement engine 410 may receive latency information indicating the latency between the target computing resource accepting the input and generating the output. The latency information may be measured directly on-site on the target computing resource and sent to the performance measurement engine 410, or may be measured by the performance measurement engine 410 itself. If the performance measurement engine 410 measures latency, the engine 410 may be configured to compensate for latency not caused by processing of the base neural network candidate, e.g., network latency to communicate with the target computing resource. As another example, the performance measurement engine 410 may estimate the latency of processing the input through the base neural network candidate based on previous measurements of the target computing resource and the hardware characteristics of the target computing resource.

パフォーマンス測定エンジン４１０は、演算強度および実行効率など、ニューラルネットワークアーキテクチャ候補のその他の特性のパフォーマンスメトリックを生成し得る。図１～図３を参照して本明細書において説明したように、推論レイテンシは、ＦＬＯＰＳ（計算要件）、実行効率、および演算強度に応じて判定され得、いくつかの実施態様では、システム４００は、これらの追加特性に基づいてニューラルネットワークを探索し、直接または間接的にスケーリングする。 The performance measurement engine 410 may generate performance metrics of other characteristics of candidate neural network architectures, such as computational intensity and execution efficiency. As described herein with reference to Figures 1-3, inference latency may be determined as a function of FLOPS (computational requirements), execution efficiency, and computational intensity, and in some implementations, the system 400 explores and directly or indirectly scales neural networks based on these additional characteristics.

パフォーマンスメトリックが生成されると、パフォーマンス測定エンジン４１０は、メトリックをＮＡＳエンジン４０５に送り得る。そして、ＮＡＳエンジン４０５は、図２を参照して本明細書において説明したように、停止メトリックに達するまで新しい基本ニューラルネットワークアーキテクチャ候補の新しい探索を繰り返し得る。 Once the performance metrics are generated, the performance measurement engine 410 may send the metrics to the NAS engine 405, which may then repeat a new search for new base neural network architecture candidates, as described herein with reference to FIG. 2, until a stopping metric is reached.

いくつかの例では、ＮＡＳエンジン４０５が次の基本ニューラルネットワークアーキテクチャ候補をどのように選択するかについて調整するための１つ以上のコントローラパラメータに応じて、ＮＡＳエンジン４０５を調整する。コントローラパラメータは、特定のニューラルネットワークタスクのためのニューラルネットワークの所望の特性に応じて、手作業で調整できる。いくつかの例では、コントローラパラメータを、様々な機械学習技術によって学習でき、ＮＡＳエンジン４０５は、レイテンシおよび正解率など、複数の目的に応じて基本ニューラルネットワークアーキテクチャを選択するために訓練された１つ以上の機械学習モデルを実装し得る。たとえば、ＮＡＳエンジン４０５は、以前の基本ニューラルネットワーク候補の特徴量および複数の目的を使用するように訓練された再帰型ニューラルネットワークを実装し、これらの目的を満たす可能性がより高い基本ネットワーク候補を予測し得る。ニューラルネットワークタスクに関連する訓練データセット、およびターゲットコンピューティングリソースのデータが与えられたときに選択された最終基本ニューラルアーキテクチャを示すようラベル付けされた訓練データとパフォーマンスメトリックとを用いて、ニューラルネットワークを訓練できる。 In some examples, the NAS engine 405 is tuned according to one or more controller parameters to adjust how the NAS engine 405 selects the next base neural network architecture candidate. The controller parameters can be manually tuned according to the desired characteristics of the neural network for a particular neural network task. In some examples, the controller parameters can be learned by various machine learning techniques, and the NAS engine 405 can implement one or more machine learning models trained to select the base neural network architecture according to multiple objectives, such as latency and accuracy. For example, the NAS engine 405 can implement a recurrent neural network trained to use features of previous base neural network candidates and multiple objectives to predict base network candidates that are more likely to meet these objectives. The neural network can be trained using a training dataset related to the neural network task, and the training data and performance metrics labeled to indicate the final base neural architecture selected given the data of the target computing resource.

ＬＡＣＳエンジン４１５は、本開示の態様に従って説明したように、レイテンシを意識した複合スケーリングを実行するように構成され得る。ＬＡＣＳエンジン４１５は、基本ニューラルネットワークアーキテクチャを指定するデータ４０７をＮＡＳエンジン４０５から受信するように構成される。ＮＡＳエンジン４０５と同様に、ＬＡＣＳエンジン４１５は、パフォーマンス測定エンジン４１０と通信して、スケーリングされたニューラルネットワークアーキテクチャ候補のパフォーマンスメトリックを取得し得る。図１～図３を参照して本明細書において説明したように、ＬＡＣＳエンジン４１５は、スケーリング係数から構成される異なるタプルのメモリに探索空間を保持し得、また、最終のスケーリングされたアーキテクチャをスケーリングして、スケーリングされたニューラルネットワークアーキテクチャのファミリーを素早く取得するように構成され得る。いくつかの実施態様では、ＬＡＣＳエンジン４１５は、その他の形式のスケーリング、たとえば、単純スケーリングを実行するように構成されるが、ＮＡＳエンジン４０５が用いるレイテンシを含む複数の目的を使用する。 The LACS engine 415 may be configured to perform latency-aware composite scaling as described according to aspects of the present disclosure. The LACS engine 415 is configured to receive data 407 from the NAS engine 405 specifying the base neural network architecture. Like the NAS engine 405, the LACS engine 415 may communicate with the performance measurement engine 410 to obtain performance metrics of the scaled neural network architecture candidates. As described herein with reference to Figures 1-3, the LACS engine 415 may hold a search space in memory of different tuples of scaling coefficients and may be configured to scale the final scaled architecture to quickly obtain a family of scaled neural network architectures. In some implementations, the LACS engine 415 is configured to perform other forms of scaling, e.g., simple scaling, but using multiple objectives including latency employed by the NAS engine 405.

図５は、ＮＡＳ－ＬＡＣＳシステム４００を実装するための例示的な環境５００のブロック図である。システム４００は、サーバコンピューティングデバイス５１５内など１つ以上の場所に１つ以上のプロセッサを有する１つ以上のデバイス上に実装され得る。クライアントコンピューティングデバイス５１２およびサーバコンピューティングデバイス５１５は、ネットワーク５６０で１つ以上の記憶装置５３０に通信可能に連結され得る。記憶装置（複数可）５３０は、揮発性メモリと不揮発性メモリとの組合せであり得、コンピューティングデバイス５１２、５１５と物理的に同じ位置にあってもよく、物理的に異なる位置にあってもよい。たとえば、記憶装置（複数可）５３０は、ハードドライブ、ソリッドステートドライブ、テープドライブ、光記憶装置、メモリカード、ＲＯＭ、ＲＡＭ、ＤＶＤ、ＣＤ－ＲＯＭ、書き込み可能メモリ、および読取り専用メモリなど、情報を格納可能な任意の種類の非一時的なコンピュータ読み取り可能な媒体を含み得る。 5 is a block diagram of an exemplary environment 500 for implementing a NAS-LACS system 400. The system 400 may be implemented on one or more devices having one or more processors in one or more locations, such as in a server computing device 515. The client computing device 512 and the server computing device 515 may be communicatively coupled to one or more storage devices 530 over a network 560. The storage device(s) 530 may be a combination of volatile and non-volatile memory and may be in the same physical location as the computing devices 512, 515 or in a different physical location. For example, the storage device(s) 530 may include any type of non-transitory computer-readable medium capable of storing information, such as hard drives, solid-state drives, tape drives, optical storage devices, memory cards, ROM, RAM, DVDs, CD-ROMs, writable memory, and read-only memory.

サーバコンピューティングデバイス５１５は、１つ以上のプロセッサ５１３と、メモリ５１４とを備え得る。メモリ５１４は、プロセッサ（複数可）５１３がアクセスできる情報を格納し得、この情報は、プロセッサ（複数可）５１３によって実行され得る命令５２１を含む。また、メモリ５１４は、プロセッサ（複数可）５１３が検索したり、操作したり、格納したりできるデータ５２３を含み得る。メモリ５１４は、揮発性メモリおよび不揮発性メモリなど、プロセッサ（複数可）５１３がアクセスできる情報を格納可能な種類の非一時的なコンピュータ読み取り可能な媒体であり得る。プロセッサ（複数可）５１３は、１つ以上のＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、１つ以上のＧＰＵ（ＧｒａｐｈｉｃＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、１つ以上のＦＧＰＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、および／または、ＴＰＵ（ＴｅｎｓｏｒＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などの１つ以上のＡＳＩＣ（特定用途向け集積回路）を含み得る。 The server computing device 515 may include one or more processors 513 and a memory 514. The memory 514 may store information accessible to the processor(s) 513, including instructions 521 that may be executed by the processor(s) 513. The memory 514 may also include data 523 that the processor(s) 513 may retrieve, manipulate, or store. The memory 514 may be any type of non-transitory computer-readable medium capable of storing information accessible to the processor(s) 513, such as volatile and non-volatile memory. The processor(s) 513 may include one or more central processing units (CPUs), one or more graphic processing units (GPUs), one or more field programmable gate arrays (FPGAs), and/or one or more application-specific integrated circuits (ASICs), such as a tensor processing unit (TPU).

命令５２１は、１つ以上の命令を含み得る。当該１つ以上の命令は、プロセッサ（複数可）５１３によって実行されると、命令が定める動作を１つ以上のプロセッサに実行させる。命令５２１は、プロセッサ（複数可）５１３によって直接処理されるオブジェクトコード形式、または、要求に基づいて解釈されたり予めコンパイルされたりする独立したソースコードモジュールから構成される解釈可能なスクリプトまたはコレクションを含む、その他の形式で格納できる。命令５２１は、本開示の態様と一致したシステム４００を実装するための命令を含み得る。システム４００は、プロセッサ（複数可）５１３を用いて実行でき、および／またはサーバコンピューティングデバイス５１５から遠隔の場所に置かれているその他のプロセッサを用いて実行できる。 Instructions 521 may include one or more instructions that, when executed by processor(s) 513, cause the one or more processors to perform the operations defined by the instructions. Instructions 521 may be stored in object code format for direct processing by processor(s) 513, or in other formats, including interpretable scripts or collections of separate source code modules that are interpreted or pre-compiled upon request. Instructions 521 may include instructions for implementing system 400 consistent with aspects of the present disclosure. System 400 may be executed using processor(s) 513 and/or other processors located remotely from server computing device 515.

データ５２３は、命令５２１に従って、プロセッサ（複数可）５１３によって検索されたり、格納されたり、または修正されたりし得る。データ５２３は、コンピュータレジスタに格納でき、複数の異なるフィールドおよびレコードを有するテーブルとしてリレーショナルデータベースもしくは非リレーショナルデータベースに格納でき、またはＪＳＯＮ、ＹＡＭＬ、ｐｒｏｔｏ、もしくはＸＭＬ文書として格納できる。また、データ５２３は、バイナリ値、ＡＳＣＩＩ、またはＵｎｉｃｏｄｅなど、コンピュータスが読み取り可能な形式にフォーマットされ得るが、これらに限定されない。また、データ５２３は、数字、説明文、プロプライエタリコード、ポインタ、その他のネットワークの場所などその他のメモリに格納されているデータへのリファレンスなど、関連性のある情報を特定するのに十分な情報、または、関連性のあるデータを計算するために関数が用いる情報を含み得る。 Data 523 may be retrieved, stored, or modified by processor(s) 513 according to instructions 521. Data 523 may be stored in computer registers, in a relational or non-relational database as a table with multiple different fields and records, or as a JSON, YAML, proto, or XML document. Data 523 may also be formatted in a computer readable format, such as, but not limited to, binary values, ASCII, or Unicode. Data 523 may also include information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memory, such as other network locations, or information used by a function to calculate the relevant data.

また、クライアントコンピューティングデバイス５１２は、サーバコンピューティングデバイス５１５と同様に、１つ以上のプロセッサ５１６、メモリ５１７、命令５１８、およびデータ５１９で構成できる。クライアントコンピューティングデバイス５１２も、ユーザ出力部５２６と、ユーザ入力部５２４とを含み得る。ユーザ入力部５２４は、キーボード、マウス、機械式アクチュエータ、ソフトアクチュエータ、タッチスクリーン、マイクロフォン、およびセンサーなど、ユーザから入力を受け付けるための任意の適切なメカニズムまたは技術を含み得る。 The client computing device 512, like the server computing device 515, may also include one or more processors 516, memory 517, instructions 518, and data 519. The client computing device 512 may also include a user output 526 and a user input 524. The user input 524 may include any suitable mechanism or technology for accepting input from a user, such as a keyboard, a mouse, a mechanical actuator, a soft actuator, a touch screen, a microphone, and a sensor.

サーバコンピューティングデバイス５１５は、クライアントコンピューティングデバイス５１２にデータを送信するように構成され得、クライアントコンピューティングデバイス５１２は、受信データの少なくとも一部を、ユーザ出力部５２６の一部として実装されたディスプレイに表示するように構成され得る。また、クライアントコンピューティングデバイス５１２とサーバコンピューティングデバイス５１５との間のインタフェースを表示するためにユーザ出力部５２６を用いることができる。これに代えてまたはこれに加えて、ユーザ出力部５２６は、１つ以上のスピーカー、変換器またはその他の音声出力部、クライアントコンピューティングデバイス５１２のプラットフォームユーザに非視覚的かつ非可聴式情報を提供する触覚インタフェースまたはその他の触覚フィードバックを含み得る。 The server computing device 515 may be configured to transmit data to the client computing device 512, which may be configured to display at least a portion of the received data on a display implemented as part of the user output 526. The user output 526 may also be used to display an interface between the client computing device 512 and the server computing device 515. Alternatively or additionally, the user output 526 may include one or more speakers, transducers or other audio outputs, a haptic interface or other haptic feedback that provides non-visual and non-audible information to a platform user of the client computing device 512.

図５では、プロセッサ５１３、５１６およびメモリ５１４、５１７がコンピューティングデバイス５１５、５１２内にあると図示されているが、プロセッサ５１３、５１６およびメモリ５１４、５１７を含む本明細書に記載の構成要素は、物理的に異なる位置でそれぞれ動作でき、かつ、同じコンピューティングデバイス内に存在しない複数のプロセッサおよび複数のメモリを含み得る。たとえば、命令５２１、５１８およびデータ５２３、５１９の一部をリムーバブルＳＤカード上に格納し、残りを読取り専用コンピュータチップ内に格納できる。命令およびデータの一部またはすべては、プロセッサ５１３、５１６から物理的に離れた場所ではあるがプロセッサ５１３、５１６がアクセスできる場所に格納できる。同様に、プロセッサ５１３、５１６は、同時に動作できるおよび／または逐次動作できるプロセッサの集合を含み得る。コンピューティングデバイス５１５、５１２は、各々、タイミング情報を提供する１つ以上の内部クロックを備え得る。タイミング情報は、コンピューティングデバイス５１５、５１２によって実行される演算およびプログラムの時間を測定するために用いられ得る。 5 illustrates processors 513, 516 and memories 514, 517 within computing devices 515, 512, the components described herein, including processors 513, 516 and memories 514, 517, may each operate in different physical locations and may include multiple processors and multiple memories that are not within the same computing device. For example, some of the instructions 521, 518 and data 523, 519 may be stored on a removable SD card and the rest may be stored in a read-only computer chip. Some or all of the instructions and data may be stored in a location physically separate from but accessible to the processors 513, 516. Similarly, the processors 513, 516 may include a collection of processors that may operate simultaneously and/or sequentially. The computing devices 515, 512 may each include one or more internal clocks that provide timing information. The timing information may be used to time operations and programs executed by the computing devices 515, 512.

サーバコンピューティングデバイス５１５は、ハードウェアアクセラレータ５５１Ａ～５５１Ｎが収容されているデータセンター５５０にネットワーク５６０で接続され得る。データセンター５５０は、複数のデータセンターのうち１つであり得、またはハードウェアアクセラレータなど様々な種類のコンピューティングデバイスが置かれているその他の設備のうち１つであり得る。本明細書において説明したように、データセンター５５０に収容されているコンピューティングリソースは、スケーリングされたニューラルネットワークアーキテクチャをデプロイするためのターゲットコンピューティングリソースの一部として指定され得る。 The server computing device 515 may be connected by a network 560 to a data center 550 in which the hardware accelerators 551A-551N are housed. The data center 550 may be one of multiple data centers or other facilities in which various types of computing devices, such as hardware accelerators, are located. As described herein, the computing resources housed in the data center 550 may be designated as part of the target computing resources for deploying the scaled neural network architecture.

サーバコンピューティングデバイス５１５は、データセンター５５０にあるコンピューティングリソース上のクライアントコンピューティングデバイス５１２から、データを処理する要求を受け付けるように構成され得る。たとえば、環境５００は、様々なユーザーインタフェースおよび／またはプラットフォームサービスを公開しているＡＰＩを通して様々なサービスをユーザに提供するように構成されたコンピューティングプラットフォームの一部であり得る。１つ以上のサービスは、機械学習フレームワークであり得、または、指定のタスクおよび訓練データに応じてニューラルネットワークもしくはその他の機械学習モデルを生成するためのツールセットであり得る。クライアントコンピューティングデバイス５１２は、特定のニューラルネットワークタスクを実行するように訓練されたニューラルネットワークを実行するために割り当てられるターゲットコンピューティングリソースを指定するデータを受送信し得る。図１～図４を参照して本明細書において説明した本開示の態様によると、ＮＡＳ－ＬＡＣＳシステム４００は、ターゲットコンピューティングリソースを指定するデータと訓練データとを受信し、それに応答して、ターゲットコンピューティングリソース上にデプロイするためのスケーリングされたニューラルネットワークアーキテクチャのファミリーを生成し得る。 The server computing device 515 may be configured to accept requests to process data from the client computing devices 512 on computing resources in the data center 550. For example, the environment 500 may be part of a computing platform configured to provide various services to users through various user interfaces and/or APIs exposing platform services. One or more of the services may be a machine learning framework or a set of tools for generating neural networks or other machine learning models in response to a specified task and training data. The client computing devices 512 may receive and transmit data specifying target computing resources to be assigned to execute a neural network trained to perform a particular neural network task. According to aspects of the disclosure described herein with reference to FIGS. 1-4, the NAS-LACS system 400 may receive data specifying the target computing resources and the training data and, in response, generate a family of scaled neural network architectures for deployment on the target computing resources.

環境５００を実装するプラットフォームが提供する可能性のあるサービスのその他の例として、サーバコンピューティングデバイス５１５は、データセンター５５０において利用可能である可能性のある様々なターゲットコンピューティングリソースに従って様々なスケーリングされたニューラルネットワークアーキテクチャのファミリーを保持し得る。たとえば、サーバコンピューティングデバイス５１５は、データセンター５５０に収容されている様々な種類のＴＰＵおよび／またはＧＰＵ上にニューラルネットワークをデプロイするための様々なファミリーを保持し得、そうでない場合、処理に使用できる様々なファミリーを保持し得る。 As another example of a service that a platform implementing environment 500 may provide, server computing device 515 may maintain a family of different scaled neural network architectures according to different target computing resources that may be available in data center 550. For example, server computing device 515 may maintain different families for deploying neural networks on different types of TPUs and/or GPUs housed in data center 550 or otherwise available for processing.

デバイス５１２、５１５、およびデータセンター５５０は、ネットワーク５６０で直接または間接的に通信可能である。たとえば、ネットワークソケットを使用して、クライアントコンピューティングデバイス５１２は、インターネットプロトコルを通してデータセンター５５０において動作しているサービスに接続できる。デバイス５１５、５１２は、情報を送受信するための開始接続を受け付け得るリスニングソケットをセットアップできる。ネットワーク５６０自体が、インターネット、ＷｏｒｌｄＷｉｄｅＷｅｂ、イントラネット、仮想プライベートネットワーク、ワイドエリアネットワーク、ローカルネットワーク、および１つ以上の会社が所有する通信プロトコルを用いたプライベートネットワークを含む、様々な構成およびプロトコルを含み得る。ネットワーク５６０は、様々な短距離接続および長距離接続をサポートできる。短距離接続および長距離接続は、２．４０２ＧＨｚ～２．４８０ＧＨｚ（共通してＢｌｕｅｔｏｏｔｈ（登録商標）規格に対応付けられている）、２．４ＧＨｚおよび５ＧＨｚ（共通してＷｉ－Ｆｉ（登録商標）通信プロトコルに対応付けられている）などの様々な帯域幅で行われ得、または、ワイヤレスブロードバンド通信のためのＬＴＥ（登録商標）規格など様々な通信規格を用いて行われ得る。また、これに加えてまたはこれに代えて、ネットワーク５６０は、デバイス５１２、５１５とデータセンター５５０との間で、様々な種類のイーサネット（登録商標）接続での有線接続を含む、有線接続をサポートできる。 The devices 512, 515 and the data center 550 can communicate directly or indirectly over the network 560. For example, using network sockets, the client computing device 512 can connect to services running in the data center 550 through the Internet Protocol. The devices 515, 512 can set up listening sockets that can accept initiating connections to send and receive information. The network 560 itself can include a variety of configurations and protocols, including the Internet, the World Wide Web, an intranet, a virtual private network, a wide area network, a local network, and a private network using one or more company-proprietary communication protocols. The network 560 can support a variety of short-range and long-range connections. The short-range and long-range connections may be made at various bandwidths, such as 2.402 GHz to 2.480 GHz (commonly associated with the Bluetooth® standard), 2.4 GHz and 5 GHz (commonly associated with the Wi-Fi® communication protocol), or using various communication standards, such as the LTE® standard for wireless broadband communications. Additionally or alternatively, the network 560 may support wired connections between the devices 512, 515 and the data center 550, including wired connections over various types of Ethernet® connections.

１つのサーバコンピューティングデバイス５１５、１つのクライアントコンピューティングデバイス５１２、および１つのデータセンター５５０が図５に示されているが、本開示の態様は、逐次処理または並列処理のためのパラダイムで実装する、または複数のデバイスから構成される分散ネットワーク上で実装するなど、様々な構成および量のコンピューティングデバイスに応じて実装できることを理解されたい。いくつかの実施態様では、本開示の態様は、ニューラルネットワークを処理するために構成された複数のハードウェアアクセラレータに接続された１つのデバイス上で実行できる、または、それらの任意の組合せであり得る。 Although one server computing device 515, one client computing device 512, and one data center 550 are shown in FIG. 5, it should be understood that aspects of the disclosure can be implemented in various configurations and quantities of computing devices, such as in paradigms for serial or parallel processing, or on a distributed network of multiple devices. In some implementations, aspects of the disclosure can be performed on a single device connected to multiple hardware accelerators configured to process neural networks, or any combination thereof.

例示的なユースケース
本明細書において説明したように、本開示の態様は、多目的手法に応じて基本ニューラルネットワークからスケーリングされたニューラルネットワークのアーキテクチャの生成を可能にする。ニューラルネットワークタスクの例は、以下の通りである。 As described herein, aspects of the present disclosure enable the generation of scaled neural network architectures from a base neural network according to a multi-objective approach. Examples of neural network tasks are:

例として、ニューラルネットワークへの入力は、画像形式、映像形式であり得る。与えられた入力を処理することの一部として、たとえばコンピュータビジョンタスクの一部として、特徴量を抽出、識別、および生成するようにニューラルネットワークを構成できる。この種類のニューラルネットワークタスクを実行するように訓練されたニューラルネットワークを、様々なあり得るクラス分類セットから１つの出力クラス分類を生成するように訓練できる。これに加えてまたはこれに代えて、画像または映像において識別された被写体が特定のクラスに属している可能性があるとの推定に対応するスコアを出力するようにニューラルネットワークを訓練できる。 By way of example, the input to a neural network may be in the form of an image or video. As part of processing the given input, the neural network may be configured to extract, identify, and generate features, e.g., as part of a computer vision task. A neural network trained to perform this type of neural network task may be trained to generate an output classification from a set of possible classifications. Additionally or alternatively, the neural network may be trained to output a score that corresponds to a hypothesis that an object identified in an image or video is likely to belong to a particular class.

別の例として、ニューラルネットワークへの入力は、特定のフォーマットに対応するデータファイルであり得、たとえば、ＨＴＭＬファイル、ワープロ文書、または、画像ファイルのメタデータなど、その他の種類のデータから取得したフォーマット済みのメタデータであり得る。この状況におけるニューラルネットワークタスクは、受け付けた入力についての特性を分類すること、スコアリングすること、そうでない場合、予測することであり得る。たとえば、受け付けた入力が特定のテーマに関連するテキストを含んでいる可能性を予測するようにニューラルネットワークを訓練できる。また、特定のタスクを実行することの一部として、たとえば文書を作成中に文書におけるテキストのオートコンプリートのためのツールの一部としてテキスト予測を生成するようにニューラルネットワークを訓練できる。たとえば、メッセージの作成中に入力文書にあるテキストの対象言語への翻訳を予測するためのニューラルネットワークを訓練できる。 As another example, the input to a neural network may be a data file corresponding to a particular format, e.g., formatted metadata obtained from other types of data, such as metadata for an HTML file, a word processing document, or an image file. The neural network task in this context may be to classify, score, or otherwise predict characteristics about the received input. For example, a neural network may be trained to predict the likelihood that a received input contains text related to a particular subject. A neural network may also be trained to generate text predictions as part of performing a particular task, e.g., as part of a tool for auto-completion of text in a document while the document is being composed. For example, a neural network may be trained to predict the translation of text in an input document into a target language while a message is being composed.

その他の種類の入力文書は、相互に接続されたデバイスから構成されるネットワークの特性に関連するデータであり得る。これらの入力文書は、アクティビティログ、および、様々なコンピューティングデバイスが機密である可能性のあるデータの様々なソースにアクセスできるアクセス特権に関するレコードを含み得る。現在または将来のネットワークへのセキュリティ侵害を予測するためにこれらの文書およびその他の種類の文書を処理するようにニューラルネットワークを訓練できる。たとえば、悪意のある行為者によるネットワークへの侵入を予測するようにニューラルネットワークを訓練できる。 Other types of input documents can be data related to the characteristics of a network of interconnected devices. These input documents can include activity logs and records regarding the access privileges that allow various computing devices to access various sources of potentially sensitive data. Neural networks can be trained to process these and other types of documents to predict current or future security breaches in the network. For example, a neural network can be trained to predict intrusions into a network by malicious actors.

別の例として、ニューラルネットワークへの入力は、ストリーミングオーディオ、予め録音された音声、および映像またはその他のソースもしくはメディアの一部としての音声を含む、音声入力であり得る。音声という状況では、ニューラルネットワークタスクは、その他の識別された音声ソースから音声を分離すること、および／または識別された音声の特性を強調して聞き取りやすくすることを含む音声認識を含み得る。たとえば翻訳ツールの一部として入力音声の対象言語へのリアルタイムな正確な翻訳を予測するようにニューラルネットワークを訓練できる。 As another example, the input to the neural network may be audio input, including streaming audio, pre-recorded audio, and audio as part of a video or other source or media. In the context of audio, the neural network task may include speech recognition, including separating audio from other identified audio sources and/or enhancing characteristics of identified audio to make it easier to hear. For example, a neural network may be trained to predict accurate translations of input audio into a target language in real time as part of a translation tool.

また、本明細書に記載の様々な種類のデータを含むデータ入力に加えて、与えられた入力に対応する特徴量を処理するようにニューラルネットワークを訓練できる。特徴量とは、値であり、たとえば、入力の特性に関連する数値または明確な値である。たとえば、画像という状況では、画像の特徴量は、画像にある画素ごとのＲＧＢ値に関連し得る。画像／映像の状況におけるニューラルネットワークタスクは、たとえば様々な人、場所、または物の存在を対象として画像または映像の内容を分類することであり得る。与えられた入力に対する出力を生成するために処理される関連性のある特徴量を抽出および選択するようにニューラルネットワークを訓練でき、学習した入力データの様々な特性間の関係性に基づいて新しい特徴量を生成するようにも訓練できる。 In addition to data inputs, including the various types of data described herein, neural networks can also be trained to process features corresponding to a given input. A feature is a value, e.g., a numeric or explicit value, that is related to a characteristic of the input. For example, in the context of an image, an image's features may relate to the RGB values for each pixel in the image. A neural network task in an image/video context may be to classify the content of the image or video, e.g., for the presence of various people, places, or objects. Neural networks can be trained to extract and select relevant features that are processed to generate an output for a given input, and can also be trained to generate new features based on learned relationships between various characteristics of the input data.

本開示の態様は、デジタル回路、コンピュータ読み取り可能な記憶媒体、１つ以上のコンピュータプログラムとして、またはこれらのうちの１つ以上の組合せとして実装できる。コンピュータ読み取り可能な記憶媒体は、たとえば、プロセッサ（複数可）によって実行可能であり、有形の記憶装置上に格納される１つ以上の命令として、非一時的なコンピュータ読み取り可能な記憶媒体であり得る。 Aspects of the present disclosure may be implemented as digital circuitry, a computer-readable storage medium, one or more computer programs, or as a combination of one or more of these. The computer-readable storage medium may be a non-transitory computer-readable storage medium, for example, as one or more instructions executable by a processor(s) and stored on a tangible storage device.

本明細書において、「構成される（ｃｏｎｆｉｇｕｒｅｄｔｏ）」というフレーズが、コンピュータシステム、ハードウェア、またはコンピュータプログラムの一部に関連する様々な状況で使われている。システムは１つ以上の演算を実行するように構成される、と述べられている場合、これは、動作時、システムに１つ以上の演算を実行させる適切なソフトウェア、ファームウェア、および／またはハードウェアがシステムにインストールされていることを意味する。ハードウェアは１つ以上の演算を実行するように構成される、と述べられている場合、これは、動作時、入力を受け付け、入力に応じて１つ以上の演算に対応する出力を生成する１つ以上の回路をハードウェアが備えることを意味する。コンピュータプログラムは１つ以上の演算を実行するように構成される、と述べられている場合、これは、１つ以上のコンピュータによって実行されると１つ以上のコンピュータに１つ以上の演算を実行させる１つ以上のプログラム命令をコンピュータプログラムが含むことを意味する。 In this specification, the phrase "configured to" is used in various contexts in connection with a computer system, hardware, or part of a computer program. When a system is said to be configured to perform one or more operations, this means that the system has installed thereon appropriate software, firmware, and/or hardware that, when in operation, causes the system to perform one or more operations. When hardware is said to be configured to perform one or more operations, this means that the hardware comprises one or more circuits that, when in operation, accepts inputs and, in response to the inputs, generates outputs corresponding to one or more operations. When a computer program is said to be configured to perform one or more operations, this means that the computer program includes one or more program instructions that, when executed by one or more computers, cause the one or more computers to perform one or more operations.

図面に示されている動作およびクレームに記載されている動作は、特定の順序で示されているが、これらの動作は、示されている順序とは異なる順序で実行できること、一部の動作は省略できること、１回以上実行できること、および／またはその他の動作と並行して実行できることを理解されたい。さらには、様々な動作を実行するために構成された様々なシステム構成要素を分離することは、これらの構成要素を分離する必要があると理解されるべきではない。記載されている構成要素、モジュール、プログラム、およびエンジンは、１つのシステムに統合でき、または複数のシステムの一部とすることができる。 Although the operations illustrated in the figures and recited in the claims are shown in a particular order, it should be understood that the operations may be performed in an order different from that shown, that some operations may be omitted, may be performed more than once, and/or may be performed in parallel with other operations. Furthermore, separation of various system components configured to perform various operations should not be understood as requiring separation of these components. The components, modules, programs, and engines described may be integrated into a system or may be part of multiple systems.

特に明示しない限り、上記のその他の実施例のほとんどは、相互に排他的ではない。しかし、様々な組合せで実装してユニークな利点を実現してもよい。上述した機能のこれらのおよびその他の変形例および組合せは、添付の特許請求の範囲によって示される発明の主題を逸脱しない範囲で利用することができるため、上記実施の形態の説明は、添付の特許請求の範囲によって示される発明の主題を限定するものではなく、一例としてとらえるべきである。これに加えて、本明細書に記載した実施例の提供、および「ｓｕｃｈａｓ」、「ｉｎｃｌｕｄｉｎｇ」などの言葉で表現された節の提供は、添付の特許請求の範囲の発明の主題を具体例に限定すると解釈されるべきではない。むしろ、これらの実施例は、多くの可能な実施の形態のうちの１つを例示しているにすぎない。さらには、異なる図面における同一の参照番号は、同一または同様の要素を識別し得る。 Unless otherwise expressly stated, most of the above other examples are not mutually exclusive. However, they may be implemented in various combinations to achieve unique advantages. These and other variations and combinations of the above features may be utilized without departing from the subject matter of the invention as set forth in the appended claims, and therefore the description of the above embodiments should be taken as examples, rather than as limitations on the subject matter of the invention as set forth in the appended claims. In addition, the provision of examples described herein, and the provision of clauses expressed with words such as "such as," "including," and the like, should not be construed as limiting the subject matter of the appended claims to specific examples. Rather, these examples are merely illustrative of one of many possible embodiments. Furthermore, the same reference numbers in different drawings may identify the same or similar elements.

Claims

1. A computer-implemented method for determining an architecture of a neural network, comprising:
receiving, by one or more processors, information specifying a target computing resource;
receiving, by the one or more processors, data specifying an architecture of a base neural network;
and the one or more processors identifying a plurality of scaling parameter values for scaling the basic neural network in response to information specifying the target computing resource and a plurality of scaling parameters of the basic neural network, the identifying comprising:
selecting a plurality of candidate scaling parameter values;
and determining a performance evaluation metric of the base neural network scaled in response to the plurality of scaling parameter value candidates, the performance evaluation metric being determined in response to a plurality of objectives including a latency objective, the method comprising:
The method further includes the one or more processors generating a scaled neural network architecture using the base neural network architecture scaled according to the plurality of scaling parameter values.

the plurality of objectives being a plurality of second objectives;
Receiving the data specifying the architecture of the base neural network comprises:
receiving, by one or more processors, training data corresponding to a neural network task;
2. The method of claim 1, further comprising: the one or more processors performing a neural architecture search of a search space using the training data to identify an architecture for the base neural network according to a plurality of first objectives.

the search space includes candidate neural network layers, each candidate neural network layer configured to perform one or more operations;
The method of claim 2 , wherein the search space includes candidate neural network layers that include different activation functions.

The basic neural network architecture includes a plurality of candidate components, each component having a plurality of neural network layers;
4. The method of claim 3, wherein the search space includes a plurality of candidate neural network layer components, the candidate neural network layer components including a first candidate network layer component including a first activation function and a second candidate network layer component including a second activation function different from the first activation function.

The method of claim 2, wherein the first objectives for performing the neural architecture search are the same as the second objectives for identifying the scaling parameter values.

The method of claim 2, wherein the plurality of first objectives and the plurality of second objectives include an accuracy rate objective corresponding to an accuracy rate of the output of the basic neural network when trained using the training data.

The method of any one of claims 1 to 6, wherein the performance evaluation metric corresponds at least in part to an evaluation metric of the latency between the basic neural network accepting an input and generating an output when the basic neural network is scaled in response to the plurality of scaling parameter value candidates and deployed on the target computing resource.

The method of any one of claims 1 to 7, wherein the latency objective corresponds to a minimum latency between the basic neural network accepting an input and generating an output when the basic neural network is deployed on the target computing resource.

The information specifying the target computing resource specifies one or more hardware accelerators;
3. The method of claim 2 , further comprising executing the scaled neural network on the one or more hardware accelerators to perform the neural network task.

the target computing resource is a first target computing resource, and the plurality of scaling parameter values are a plurality of first scaling parameter values;
The method comprises:
receiving, by the one or more processors, information specifying a second target computing resource different from the first target computing resource;
10. The method of claim 9, further comprising: identifying a second plurality of scaling parameter values for scaling the basic neural network in response to information specifying the second target computing resource, the second plurality of scaling parameter values being different from the first plurality of scaling parameter values.

the plurality of scaling parameter values being a plurality of first scaling parameter values;
10. The method of claim 1, further comprising generating a scaled neural network architecture from the base neural network architecture scaled with a plurality of second scaling parameter values, the second scaling parameter values being generated as a function of the plurality of first scaling parameter values and one or more compound coefficients that uniformly modify a value of each of the first scaling parameter values.

The method according to any one of claims 1 to 9, wherein the basic neural network is a convolutional neural network, and the plurality of scaling parameters include one or more of the depth of the basic neural network, the width of the basic neural network, and the resolution of the input of the basic neural network.

1. A system comprising:
one or more processors;
and one or more memory devices coupled to the one or more processors storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations for determining an architecture of a neural network, the operations including:
accepting information specifying a target computing resource;
receiving data specifying an architecture of a base neural network;
and identifying a plurality of scaling parameter values for scaling the basic neural network in response to information specifying the target computing resource and a plurality of scaling parameters of the basic neural network, the identifying act comprising:
selecting a plurality of candidate scaling parameter values;
and determining a performance evaluation metric of the base neural network scaled in response to the plurality of scaling parameter value candidates, the performance evaluation metric being determined in response to a plurality of objectives including a latency objective, the operations further comprising:
The system includes an operation of generating a scaled neural network architecture using the base neural network architecture scaled according to the plurality of scaling parameter values.

the plurality of objectives being a plurality of second objectives;
The act of receiving the data specifying the architecture of the base neural network comprises:
receiving training data corresponding to a neural network task;
and performing a neural architecture search of a search space using the training data to identify an architecture for the base neural network according to a plurality of first objectives.

the search space includes candidate neural network layers, each candidate neural network layer configured to perform one or more operations;
The system of claim 14 , wherein the search space includes candidate neural network layers that include different activation functions.

The system of claim 14, wherein the first objectives for performing the neural architecture search are the same as the second objectives for identifying the scaling parameter values.

15. The system of claim 14, wherein the plurality of first objectives and the plurality of second objectives include a accuracy objective corresponding to an accuracy rate of an output of the basic neural network when trained with the training data.

The system of any one of claims 13 to 17, wherein the performance evaluation metric corresponds at least in part to an evaluation metric of the latency between the basic neural network accepting an input and generating an output when the basic neural network is scaled in response to the plurality of scaling parameter value candidates and deployed on the target computing resource.

The system of any one of claims 13 to 18, wherein the latency objective corresponds to a minimum latency between the basic neural network accepting an input and generating an output when the basic neural network is deployed on the target computing resource.

A program that causes one or more processors to execute the method according to any one of claims 1 to 12.