TW202230221A

TW202230221A - Neural architecture scaling for hardware accelerators

Info

Publication number: TW202230221A
Application number: TW110124428A
Authority: TW
Inventors: 予寧李; 盛李; 譚明星; 若鳴龐; 立群程; 國Ｖ樂; 諾曼保羅約皮
Original assignee: 美商谷歌有限責任公司
Priority date: 2021-01-15
Filing date: 2021-07-02
Publication date: 2022-08-01
Also published as: WO2022154829A1; JP7579972B2; JP2023552048A; CN116261734A; EP4217928A1

Abstract

Methods, systems, and apparatus, including computer-readable media, for scaling neural network architectures on hardware accelerators. A method includes receiving training data and information specifying target computing resources, and performing using the training data, a neural architecture search over a search space to identify an architecture for a base neural network. A plurality of scaling parameter values for scaling the base neural network can be identified, which can include repeatedly selecting a plurality of candidate scaling parameter values, and determining a measure of performance for the base neural network scaled according to the plurality of candidate scaling parameter values, in accordance with a plurality of second objectives including a latency objective. An architecture for a scaled neural network can be determined using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

Description

Neural Architecture Scaling for Hardware Accelerators

神經網路係包含一或多個非線性操作層以針對一所接收輸入預測一輸出之機器學習模型。除了一輸入層及一輸出層以外，一些神經網路亦包含一或多個隱藏層。各隱藏層之輸出可被輸入至神經網路之另一隱藏層或輸出層。神經網路之各層可根據該層之一或多個模型參數之值從一所接收輸入產生一各自輸出。模型參數可為透過一訓練演算法判定以導致神經網路產生精確輸出之權重或偏差。A neural network is a machine learning model that includes one or more layers of nonlinear operations to predict an output for a received input. In addition to an input layer and an output layer, some neural networks also include one or more hidden layers. The output of each hidden layer can be input to another hidden or output layer of the neural network. Each layer of the neural network can generate a respective output from a received input according to the value of one or more model parameters of the layer. Model parameters may be weights or biases determined by a training algorithm to cause the neural network to produce accurate outputs.

根據本發明之態樣實施之一系統可藉由根據各候選者之運算要求(例如，FLOPS)、操作強度及執行效率一起搜尋候選神經網路架構來減少一神經網路架構之延時。發現運算要求、操作強度及執行效率被歸為一神經網路在目標運算資源上之延時之一根本原因，而非運算要求單獨影響延時(包含推論延時)，如本文中描述。本發明之態樣提供執行神經架構搜尋及縮放之技術，諸如藉由延時感知複合縮放及藉由基於延時及運算、操作強度及執行效率之間的此所觀察關係來擴增從其中搜尋候選神經網路之空間。A system implemented in accordance with aspects of the present invention can reduce the latency of a neural network architecture by searching for candidate neural network architectures together according to each candidate's computational requirements (eg, FLOPS), operational intensity, and execution efficiency. Computational requirements, operational intensity, and execution efficiency were found to be one of the root causes of a neural network's latency on target computational resources, while non-computational requirements alone affect latency (including inference latency), as described herein. Aspects of the present invention provide techniques for performing neural architecture search and scaling, such as by delay-aware compound scaling and by augmentation based on this observed relationship between delay and computation, operational intensity, and execution efficiency from which to search candidate neurons cyberspace.

此外，該系統可執行複合縮放以一致地且根據多個目標縮放神經網路之多個參數，此可導致優於其中考量一單一目標之方法或其中分開搜尋一神經網路之縮放參數之方法之經縮放神經網路之改良效能。延時感知複合縮放可用於快速建立一系列神經網路架構，根據來自一初始縮放神經網路架構之不同值進行縮放，且可適用於不同使用情況。Furthermore, the system can perform compound scaling to scale multiple parameters of a neural network consistently and according to multiple objectives, which can result in advantages over methods in which a single objective is considered or in which the scaling parameters of a neural network are searched separately Improved performance of the scaled neural network. Delay-aware compound scaling can be used to quickly build a series of neural network architectures, scaled according to different values from an initial scaled neural network architecture, and can be adapted to different use cases.

根據本發明之態樣，一種電腦實施方法包含一種用於判定一神經網路之一架構之方法，其包含：藉由一或多個處理器接收對應於一神經網路任務之訓練資料及指定目標運算資源之資訊；藉由該一或多個處理器使用該訓練資料且根據複數個第一目標在一搜尋空間上執行一神經架構搜尋以識別一基本神經網路之一架構；藉由該一或多個處理器根據指定該等目標運算資源之該資訊及該基本神經網路之複數個縮放參數來識別用於縮放該基本神經網路之複數個縮放參數值。該識別可包含重複執行以下步驟：選擇複數個候選縮放參數值；及判定根據該複數個候選縮放參數值縮放之該基本神經網路之一效能量度，其中根據包含一延時目標之複數個第二目標來判定該效能量度。該方法可包含藉由該一或多個處理器使用根據該複數個縮放參數值縮放之該基本神經網路之該架構產生一經縮放神經網路之一架構。According to aspects of the invention, a computer-implemented method includes a method for determining an architecture of a neural network comprising: receiving, by one or more processors, training data corresponding to a neural network task and specifying target computing resource information; identifying an architecture of a basic neural network by the one or more processors using the training data and performing a neural architecture search on a search space according to a plurality of first targets; by the One or more processors identify a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network. The identifying may include repeatedly performing the steps of: selecting a plurality of candidate scaling parameter values; and determining an efficacy measure of the base neural network scaled according to the plurality of candidate scaling parameter values, wherein according to a plurality of second ones including a delay target target to determine the efficacy measure. The method may include generating, by the one or more processors, an architecture of a scaled neural network using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

前述及其他實施方案可各視情況單獨或組合地包含以下特徵之一或多者。The foregoing and other embodiments may each include one or more of the following features, alone or in combination.

用於執行該神經架構搜尋之該複數個第一目標可相同於用於識別該複數個縮放參數值之該複數個第二目標。The plurality of first targets used to perform the neural architecture search may be the same as the plurality of second targets used to identify the plurality of scaling parameter values.

該複數個第一目標及該複數個第二目標可包含對應於該基本神經網路之輸出之精度之一精度目標。The plurality of first targets and the plurality of second targets may include an accuracy target corresponding to the accuracy of the output of the base neural network.

當該基本神經網路根據該複數個候選縮放參數值進行縮放且部署於該等目標運算資源上時，該效能量度可至少部分與該基本神經網路接收一輸入與產生一輸出之間的一延時量度對應。When the base neural network is scaled according to the plurality of candidate scaling parameter values and deployed on the target computing resources, the efficiency metric may be at least partially associated with a relationship between the base neural network receiving an input and generating an output Latency metric corresponds.

當該基本神經網路部署於該等目標運算資源上時，該延時目標可對應於該基本神經網路接收一輸入與產生一輸出之間的一最小延時。When the basic neural network is deployed on the target computing resources, the delay target may correspond to a minimum delay between the basic neural network receiving an input and generating an output.

該搜尋空間可包含候選神經網路層，各候選神經網路層經組態以執行一或多個各自操作。該搜尋空間可包含候選神經網路層，該等候選神經網路層包含不同各自激發函數。The search space may include candidate neural network layers, each candidate neural network layer configured to perform one or more respective operations. The search space may include candidate neural network layers that include different respective excitation functions.

該基本神經網路之該架構可包含複數個組件，各組件具有各自複數個神經網路層。該搜尋空間可包含候選神經網路層之複數個候選組件，包含：候選網路層之一第一組件，其包含一第一激發函數；及候選網路層之一第二組件，其包含不同於該第一激發函數之一第二激發函數。The architecture of the basic neural network may include a plurality of components, each component having a respective plurality of neural network layers. The search space may include a plurality of candidate components of the candidate neural network layer, including: a first component of the candidate network layer, which includes a first excitation function; and a second component of the candidate network layer, which includes different a second excitation function in the first excitation function.

指定該等目標運算資源之該資訊可指定一或多個硬體加速器；且其中該方法進一步包含在該一或多個硬體加速器上執行該經縮放神經網路以執行該神經網路任務。The information specifying the target computing resources may specify one or more hardware accelerators; and wherein the method further includes executing the scaled neural network on the one or more hardware accelerators to perform the neural network task.

該等目標運算資源可包含第一目標運算資源，該複數個縮放參數值係複數個第一縮放參數值，且該方法可進一步包含：藉由該一或多個處理器接收指定不同於該等第一目標運算資源之第二目標運算資源之資訊；及根據指定該等第二目標運算資源之該資訊來識別用於縮放該基本神經網路之複數個第二縮放參數值，其中該複數個第二縮放參數值不同於該複數個第一縮放參數值。The target computing resources may include a first target computing resource, the plurality of scaling parameter values are a plurality of first scaling parameter values, and the method may further include: receiving, by the one or more processors, a designation different from the information of a second target computing resource of a first target computing resource; and identifying a plurality of second scaling parameter values for scaling the basic neural network according to the information specifying the second target computing resources, wherein the plurality of The second scaling parameter value is different from the plurality of first scaling parameter values.

該複數個縮放參數值係複數個第一縮放參數值，且其中該方法進一步包含：從使用複數個第二縮放參數值縮放之該基本神經網路架構產生一經縮放神經網路架構，其中依據該複數個第一縮放參數值及一致地修改該等第一縮放參數值之各者之值之一或多個複合係數產生該等第二縮放參數值。The plurality of scaling parameter values are a plurality of first scaling parameter values, and wherein the method further comprises: generating a scaled neural network architecture from the base neural network architecture scaled using the plurality of second scaling parameter values, wherein according to the A plurality of first scaling parameter values and one or more composite coefficients that uniformly modify the value of each of the first scaling parameter values generate the second scaling parameter values.

該基本神經網路可為一迴旋神經網路，且其中該複數個縮放參數可包含該基本神經網路之一深度、該基本神經網路之一寬度及該基本神經網路之一輸入解析度之一或多者。The basic neural network can be a convolutional neural network, and wherein the plurality of scaling parameters can include a depth of the basic neural network, a width of the basic neural network, and an input resolution of the basic neural network one or more.

根據另一態樣，一種用於判定一神經網路之一架構之方法包含：藉由一或多個處理器接收指定目標運算資源之資訊；藉由該一或多個處理器接收指定一基本神經網路之一架構之資料；藉由該一或多個處理器根據指定該等目標運算資源之該資訊及該基本神經網路之複數個縮放參數來識別用於縮放該基本神經網路之複數個縮放參數值。該識別可包含重複執行以下步驟：選擇複數個候選縮放參數值；及判定根據該複數個候選縮放參數值縮放之該基本神經網路之一效能量度，其中根據包括一延時目標之複數個目標來判定該效能量度；及藉由該一或多個處理器使用根據該複數個縮放參數值縮放之該基本神經網路之該架構產生一經縮放神經網路之一架構。According to another aspect, a method for determining an architecture of a neural network includes: receiving, by one or more processors, information specifying a target computing resource; receiving, by the one or more processors, specifying a basic Data of an architecture of a neural network; identified by the one or more processors for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network A plurality of scaling parameter values. The identifying may include repeatedly performing the steps of: selecting a plurality of candidate scaling parameter values; and determining an efficacy measure of the base neural network scaled according to the plurality of candidate scaling parameter values, wherein according to a plurality of targets including a delay target Determining the performance measure; and generating, by the one or more processors, an architecture of a scaled neural network using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

該複數個目標可為複數個第二目標；且接收指定該基本神經網路之該架構之該資料可包含：藉由一或多個處理器接收對應於一神經網路任務之訓練資料；及藉由該一或多個處理器使用該訓練資料且根據複數個第一目標在一搜尋空間上執行一神經架構搜尋以識別該基本神經網路之該架構。The plurality of targets can be a plurality of second targets; and receiving the data specifying the architecture of the base neural network can include: receiving, by one or more processors, training data corresponding to a neural network task; and The architecture of the base neural network is identified by the one or more processors using the training data and performing a neural architecture search on a search space according to a plurality of first targets.

其他實施方案包含電腦系統、設備及記錄於一或多個電腦儲存裝置上之電腦程式，各經組態以執行該等方法之動作。Other implementations include computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

相關申請案之交叉參考Cross-references to related applications

本申請案根據35 U.S.C.§119(e)規定主張2021年1月15日申請之美國專利申請案第63/137,926號之權利，該案之揭示內容以引用的方式併入本文中。概述： This application claims the benefit of US Patent Application Serial No. 63/137,926, filed January 15, 2021, under 35 U.S.C. §119(e), the disclosure of which is incorporated herein by reference. Overview:

本說明書中描述之技術大體上係關於在不同目標運算資源(諸如不同硬體加速器)上執行之縮放神經網路。神經網路可根據多個不同效能目標進行縮放，該多個不同效能目標可包含用於當在目標運算資源上進行縮放以執行時最小化處理時間(本文中被稱為延時)且最大化神經網路之精度之單獨目標。The techniques described in this specification generally relate to scaling neural networks that execute on different target computing resources, such as different hardware accelerators. Neural networks may be scaled according to a number of different performance goals, which may include minimizing processing time (referred to herein as latency) and maximizing neural network when scaling on target computing resources to execute. A separate goal of network precision.

一般言之，神經架構搜尋(NAS)系統可經部署用於根據一或多個目標從候選架構之一給定搜尋空間選擇一神經網路架構。一個常見目標係神經網路之精度，通常，實施一NAS技術之一系統將有利於在訓練之後導致較高精度之網路，而非具有較低精度之網路。在緊接著一NAS選擇一基本神經網路之後，可根據一或多個縮放參數來縮放基本神經網路。縮放可包含例如藉由在數字之一係數搜尋空間中搜尋縮放參數來搜尋用於縮放基本神經網路之一或多個縮放參數值。縮放可包含增加或減少一神經網路具有之層數或各層之大小以高效地利用可用於部署神經網路之運算及/或記憶體資源。In general, a neural architecture search (NAS) system can be deployed to select a neural network architecture from a given search space of one of candidate architectures according to one or more objectives. A common target is the accuracy of neural networks, and generally, implementing a system of NAS techniques will favor networks that result in higher accuracy after training, rather than networks with lower accuracy. Immediately after a NAS selects a base neural network, the base neural network may be scaled according to one or more scaling parameters. Scaling may include searching for one or more scaling parameter values for scaling the base neural network, eg, by searching for scaling parameters in a coefficient search space of a number. Scaling can include increasing or decreasing the number of layers or the size of layers that a neural network has to efficiently utilize the computing and/or memory resources available to deploy the neural network.

與神經架構搜尋及縮放相關之一普遍觀念係，一網路透過一神經網路處理輸入所需之運算要求(例如，以每秒之浮點運算(FLOPS)來量測)與將一輸入發送至網路與接收一輸出之間的延時成比例。換言之，具有一低運算(低FLOPS)要求之一神經網路據信比網路具有一較高運算(高FLOPS)要求的情況更快地產生輸出，此係因為總體上執行更少操作。因此，許多NAS系統選擇具有一低運算要求之神經網路。然而，由於神經網路之其他特性(諸如操作強度、並行性及執行效率)可影響一神經網路之總延時，所以已判定運算要求與延時關係係不成比例的。A common concept associated with neural architecture search and scaling is that a network processes the computational requirements (eg, measured in floating point operations per second (FLOPS)) required by a neural network to process an input and sends an input The delay between going to the network and receiving an output is proportional. In other words, a neural network with a low operation (low FLOPS) requirement is believed to produce output faster than a network with a higher operation (high FLOPS) requirement because fewer operations are performed overall. Therefore, many NAS systems choose neural networks with a low computational requirement. However, since other characteristics of neural networks, such as operational intensity, parallelism, and execution efficiency, can affect the overall latency of a neural network, it has been determined that the computational requirements are not proportional to the latency.

本文中描述之技術提供延時感知複合縮放(LACS)及從其中選擇一神經網路之候選神經網路搜尋空間之擴增。在擴增一搜尋空間之內容脈絡中，可在搜尋空間中包含「硬體加速器友好」之操作及架構，其中此等添加可導致較高操作強度、執行效率及適於部署在各種類型之硬體加速器上之並行性。此等操作可包含空間至深度操作、空間至批量操作、融合迴旋結構及逐組件搜尋激發函數。The techniques described herein provide delay-aware composite scaling (LACS) and augmentation of the candidate neural network search space from which a neural network is selected. In the context of augmenting a search space, "hardware accelerator friendly" operations and architectures can be included in the search space, where such additions can result in higher operational intensity, execution efficiency, and suitability for deployment in various types of hardware Parallelism on bulk accelerators. Such operations may include space-to-depth operations, space-to-batch operations, fusion convolutional structures, and component-wise search for excitation functions.

神經網路之延時感知複合縮放可相對於未根據延時進行最佳化之習知方法改良神經網路之縮放。代替地，LACS可用於識別精確且在目標運算資源上以低延時操作之經縮放神經網路之經縮放參數值。Delay-aware composite scaling of neural networks can improve the scaling of neural networks relative to conventional methods that are not optimized for delay. Instead, LACS can be used to identify scaled parameter values for scaled neural networks that are accurate and operate with low latency on target computing resources.

該技術進一步提供共用用於使用NAS或類似技術搜尋一神經架構之目標之多目標縮放。可根據搜尋基本神經網路時使用之相同目標來識別經縮放神經網路。因此，經縮放神經網路可在兩個階段(基本架構搜尋及縮放)最佳化效能，而非將各階段視為具有單獨目標之任務。The technique further provides multi-target scaling that is shared for searching a neural architecture target using NAS or similar techniques. Scaled neural networks can be identified based on the same targets used when searching for the base neural network. Thus, a scaled neural network can optimize performance in two phases (basic architecture search and scaling), rather than treating each phase as a task with separate goals.

LACS可與現有NAS系統整合，此係至少因為用於搜尋及縮放兩者之相同目標可用於產生用於判定經縮放神經網路架構之一端至端系統。此外，可比無縮放搜尋方法更快地識別一系列經縮放神經網路架構，其中神經網路架構被搜尋但未縮放以部署於目標運算資源上。LACS can be integrated with existing NAS systems, at least because the same goals for both search and scaling can be used to create an end-to-end system for determining a scaled neural network architecture. Furthermore, a series of scaled neural network architectures, where the neural network architectures are searched but not scaled for deployment on target computing resources, can be identified faster than unscaled search methods.

與在不使用LACS之習知搜尋及縮放方法中識別之神經網路相比，本文中描述之技術可提供經改良神經網路。另外，可快速地產生在如模型精度及推論延時之目標之間具有不同權衡之一系列神經網路以應用於各種使用情況。再者，該技術可提供用於執行特定任務之神經網路之較快識別，但所識別神經網路可以優於使用其他方法識別之神經網路之改良精度來執行。此係至少因為由於如本文中描述之搜尋及縮放而識別之神經網路可考量可影響延時之特性(如操作強度及執行效率)，而非僅考量一網路之運算要求。以此方式，所識別之一神經網路可在不犧牲網路精度的情況下更快地執行推論。The techniques described herein may provide improved neural networks compared to neural networks identified in conventional search and scaling methods that do not use LACS. Additionally, a family of neural networks with different tradeoffs between goals such as model accuracy and inference latency can be quickly generated for various use cases. Furthermore, this technique may provide faster identification of neural networks used to perform a particular task, but the identified neural networks may perform with improved accuracy over neural networks identified using other methods. This is at least because neural networks identified as a result of searching and scaling as described herein can take into account characteristics that can affect latency, such as operational intensity and execution efficiency, rather than just the computational requirements of a network. In this way, one of the identified neural networks can perform inference faster without sacrificing the accuracy of the network.

該技術可進一步提供用於將現有神經網路快速遷移至經改良運算資源環境之一大體上適用框架。例如，當為具有特定硬體之一資料中心選擇之一現有神經網路之執行被遷移至使用不同硬體之一資料中心時，可應用如本文中描述之LACS及NAS。在此方面，可快速識別一系列神經網路以執行現有神經網路之任務，且將其等部署於新資料中心之硬體上。此應用在需要最新技術硬體來高效執行之快速發展領域中可為尤其有用的，諸如執行電腦視覺中之任務或其他影像處理任務之網路。This technique may further provide a generally applicable framework for rapidly migrating existing neural networks to an improved computing resource environment. For example, LACS and NAS as described herein may be applied when the execution of an existing neural network selected for a data center with specific hardware is migrated to a data center using a different hardware. In this regard, a series of neural networks can be quickly identified to perform the tasks of existing neural networks and deployed on hardware in new data centers. This application may be particularly useful in rapidly evolving fields that require state-of-the-art hardware to perform efficiently, such as networks performing tasks in computer vision or other image processing tasks.

圖1係繪示部署於容置所部署神經網路將在其上執行之硬體加速器116之一資料中心115中之一系列103經縮放神經網路架構104A至N之一方塊圖。硬體加速器116可為任何類型之處理器，諸如CPU、GPU、FGPA或ASIC，諸如一TPU。根據本發明之態樣，可從一基本神經網路架構101產生該系列103經縮放神經網路架構。1 is a block diagram illustrating a series 103 of scaled neural network architectures 104A-N deployed in a data center 115 housing a hardware accelerator 116 on which the deployed neural network will execute. The hardware accelerator 116 may be any type of processor, such as a CPU, GPU, FGPA or ASIC, such as a TPU. According to aspects of the invention, the series 103 of scaled neural network architectures can be generated from a base neural network architecture 101 .

一神經網路之一架構指代定義神經網路之特性。例如，架構可包含網路之複數個不同神經網路層之特性、該等層如何處理輸入、該等層如何彼此互動等。例如，一迴旋神經網路(ConvNet)之一架構可定義接收輸入影像資料之一離散迴旋層，其後接著一匯集層，其後接著根據一神經網路任務產生一輸出(例如，對輸入影像資料之內容進行分類)之一完全連接層。一神經網路之架構亦可定義在各層內執行之操作類型。例如，一ConvNet之架構可定義在網路之完全連接層中使用ReLU激發函數。An architecture of a neural network refers to the properties that define the neural network. For example, the architecture may include the characteristics of the various neural network layers of the network, how the layers process input, how the layers interact with each other, and so on. For example, a convolutional neural network (ConvNet) architecture may define a discrete convolutional layer that receives input image data, followed by a pooling layer, which is then followed by generating an output based on a neural network task (eg, for the input image The content of the data is classified) a fully connected layer. The architecture of a neural network can also define the types of operations performed within each layer. For example, the architecture of a ConvNet can define the use of ReLU excitation functions in the fully connected layers of the network.

可根據一組目標且使用NAS來識別基本神經網路架構101。可根據該組目標且使用NAS從候選神經網路架構之一搜尋空間識別基本神經網路架構101。如本文中更詳細描述，候選神經網路架構之搜尋空間可經擴增以包含不同網路組件、操作及層，可從其等識別滿足目標之一基本網路。The basic neural network architecture 101 can be identified from a set of objectives and using NAS. The base neural network architecture 101 can be identified from the search space from one of the candidate neural network architectures according to the set of objectives and using NAS. As described in more detail herein, the search space of candidate neural network architectures can be augmented to include different network components, operations, and layers, from which a base network that satisfies a goal can be identified.

用於識別基本神經網路架構101之該組目標亦可應用於識別系列103中之各神經網路104A至N之縮放參數值。基本神經網路架構101及經縮放神經網路架構104A至N可藉由數個參數來特性化，其中此等參數在經縮放神經網路架構104A至N中縮放至不同程度。在圖1中，使用三個縮放參數展示神經網路101、104A：D指示神經網路中之層數；W指示一神經網路層內之神經元之寬度或數目；且R指示由神經網路在一給定層處處理之輸入之大小。The set of objectives used to identify the base neural network architecture 101 can also be applied to identify scaling parameter values for each of the neural networks 104A-N in the series 103 . The base neural network architecture 101 and the scaled neural network architectures 104A-N can be characterized by several parameters, which are scaled to different degrees in the scaled neural network architectures 104A-N. In Figure 1, neural networks 101, 104A are shown using three scaling parameters: D indicates the number of layers in the neural network; W indicates the width or number of neurons within a neural network layer; The size of the input processed by the way at a given layer.

如本文中更詳細描述，經組態以執行LACS之一系統可搜尋一係數搜尋空間108以識別多組縮放參數值。各縮放參數值係係數搜尋空間中之一係數，其例如可為一組正實數。根據作為係數搜尋空間108中之一搜尋之部分而識別之候選係數值從基本神經網路101縮放各候選網路107A至N。系統可應用各種不同搜尋技術之任何者來識別候選係數，諸如柏拉圖(Pareto)前緣搜尋或網格式搜尋。針對各候選網路107A至N，系統可評估候選網路在執行一神經網路任務時之一效能量度。效能量度可基於多個目標，包含用於量測一候選網路在接收一輸入與作為執行一神經網路任務之部分而產生一對應輸出之間的延時之一延時目標。As described in more detail herein, a system configured to perform LACS may search a coefficient search space 108 to identify sets of scaling parameter values. Each scaling parameter value is a coefficient in the coefficient search space, which can be, for example, a set of positive real numbers. Each candidate network 107A to N is scaled from the base neural network 101 according to the candidate coefficient values identified as part of a search in the coefficient search space 108 . The system may apply any of a variety of different search techniques to identify candidate coefficients, such as a Pareto leading edge search or a trellis search. For each candidate network 107A-N, the system may evaluate a measure of the effectiveness of the candidate network in performing a neural network task. The efficacy metric may be based on a number of objectives, including a latency objective for measuring the delay between receiving an input for a candidate network and producing a corresponding output as part of performing a neural network task.

當在係數搜尋空間108上執行縮放參數值搜尋之後，系統可接收一經縮放神經網路架構109。經縮放神經網路架構109從基本神經網路101進行縮放，其中縮放參數值導致在係數搜尋空間中之搜尋期間識別之候選網路107A至N之最高效能度量。After performing a search for scaled parameter values on the coefficient search space 108 , the system may receive a scaled neural network architecture 109 . The scaled neural network architecture 109 scales from the base neural network 101, where the scaling parameter values result in the highest performance metric for candidate networks 107A-N identified during a search in the coefficient search space.

從經縮放神經網路架構109，系統可產生該系列103經縮放神經網路架構104A至N。可藉由根據不同值縮放經縮放神經網路架構109來產生該系列103。可一致地縮放經縮放神經網路架構109之各縮放參數值以產生該系列103中之不同經縮放神經網路架構。例如，可藉由將各縮放參數值提高達2倍來縮放經縮放神經網路架構109之各縮放參數值。可針對一致地應用於經縮放神經網路架構109之各縮放參數值之不同值或「複合係數」縮放經縮放神經網路架構109。在一些實施方案中，以其他方式縮放經縮放神經網路架構109，例如，藉由分開縮放各縮放參數值以產生該系列103中之一經縮放神經網路架構。From the scaled neural network architecture 109, the system can generate the series 103 of scaled neural network architectures 104A-N. The series 103 can be generated by scaling the scaled neural network architecture 109 according to different values. The various scaling parameter values of the scaled neural network architecture 109 can be scaled consistently to generate the different scaled neural network architectures in the series 103 . For example, each scaling parameter value of the scaled neural network architecture 109 may be scaled by increasing each scaling parameter value by a factor of 2. The scaled neural network architecture 109 may be scaled for different values or "composite coefficients" of each scaling parameter value that are consistently applied to the scaled neural network architecture 109 . In some implementations, the scaled neural network architecture 109 is scaled in other ways, eg, by separately scaling each scaling parameter value to produce one of the scaled neural network architectures in the series 103 .

藉由根據不同值縮放經縮放神經網路架構109，可快速產生不同神經網路架構以根據各種不同使用情況執行一任務。可將不同使用情況指定為用於識別經縮放神經網路架構109之多個目標之間的不同權衡。例如，一個經縮放神經網路架構可被識別為滿足一較高精度臨限值，而代價係執行期間之較高延時。另一經縮放神經網路架構可被識別為滿足一較低精度臨限值，但可在硬體加速器116上以較低延時來執行。可識別在硬體加速器116上平衡精度與延時之間的權衡之另一經縮放神經網路架構。By scaling the scaled neural network architecture 109 according to different values, different neural network architectures can be quickly generated to perform a task according to a variety of different use cases. Different use cases may be specified for identifying different tradeoffs among the multiple goals of the scaled neural network architecture 109 . For example, a scaled neural network architecture can be identified as meeting a higher accuracy threshold at the expense of higher latency during execution. Another scaled neural network architecture may be identified as meeting a lower accuracy threshold, but may execute on hardware accelerator 116 with lower latency. Another scaled neural network architecture that balances the tradeoff between accuracy and latency on the hardware accelerator 116 can be identified.

作為一實例，為了執行一電腦視覺任務(諸如物件辨識)，一神經網路架構可需要即時或近即時地產生輸出，作為連續接收視訊或影像資料之一應用及用以識別所接收資料中之一特定類別之物件之任務之部分。在此例示性任務中，對精度之容限可較低，因此可部署以對較低延時及較低精度之適當權衡縮放之一經縮放神經網路架構來執行任務。As an example, in order to perform a computer vision task (such as object recognition), a neural network architecture may be required to generate output in real-time or near real-time, as an application for continuously receiving video or image data and for identifying the Part of a task for an object of a specific class. In this exemplary task, the tolerance to accuracy may be lower, so a scaled neural network architecture that scales with an appropriate trade-off for lower latency and lower accuracy may be deployed to perform the task.

作為另一實例，一神經網路架構可被賦予對來自影像或視訊資料之一所接收場景中之各物件進行分類的任務。在此實例中，若執行此例示性任務時之延時未被視為與精確地執行任務同樣重要，則可部署以延時為代價之具有較高精度之一經縮放神經網路。在其他實例中，在當執行神經網路任務時未識別或期望一特定權衡的情況下，可部署平衡精度、延時及其他目標之間的權衡之經縮放神經網路架構。As another example, a neural network architecture may be tasked with classifying objects in a scene received from one of image or video data. In this example, if latency in performing this exemplary task is not considered as important as accurately performing the task, then a scaled neural network with higher accuracy at the expense of latency can be deployed. In other instances, scaled neural network architectures that balance tradeoffs between accuracy, latency, and other goals may be deployed where a particular tradeoff is not identified or desired when performing neural network tasks.

經縮放神經網路架構104N使用與用於縮放經縮放神經網路架構109之縮放參數值不同之縮放參數值進行縮放以獲得經縮放神經網路104A，且可表示一不同使用情況，例如，其中期望精度而非推論延時之使用情況。The scaled neural network architecture 104N is scaled using a different scaling parameter value than the scaling parameter value used to scale the scaled neural network architecture 109 to obtain the scaled neural network 104A, and may represent a different use case, eg, where Use cases where precision is expected rather than inference latency.

本文中描述之LACS及NAS技術可產生用於硬體加速器116之系列103，且接收額外訓練資料及指定多個不同運算資源(諸如不同類型之硬體加速器)之資訊。除了產生用於硬體加速器116之系列103之外，系統亦可搜尋一基本神經網路架構，且產生用於不同硬體加速器之一系列經縮放神經網路架構。例如，鑑於一GPU及一TPU，系統可分別為GPU及TPU產生針對精度-延時權衡進行最佳化之單獨模型系列。在一些實施方案中，系統可從同一基本神經網路架構產生多個經縮放系列。例示性方法 The LACS and NAS techniques described herein can generate series 103 for hardware accelerators 116 and receive additional training data and information specifying a number of different computing resources, such as different types of hardware accelerators. In addition to generating series 103 for hardware accelerators 116, the system can also search for a base neural network architecture and generate a series of scaled neural network architectures for different hardware accelerators. For example, given one GPU and one TPU, the system may generate separate model families optimized for the accuracy-latency tradeoff for the GPU and TPU, respectively. In some implementations, the system can generate multiple scaled series from the same basic neural network architecture. Exemplary method

圖2係用於產生在目標運算資源上執行之經縮放神經網路架構之一例示性程序200之一流程圖。可在一或多個位置中之一或多個處理器之一系統上執行例示性程序200。例如，如本文中描述之一神經架構搜尋-延時感知複合縮放(NAS-LACS)系統可執行程序200。2 is a flow diagram of an exemplary process 200 for generating a scaled neural network architecture for execution on a target computing resource. The exemplary process 200 may be executed on a system of one or more processors in one or more locations. For example, a Neural Architecture Search-Latency Aware Complex Scaling (NAS-LACS) system as described herein may execute the procedure 200 .

如方塊210中展示，系統接收對應於一神經網路任務之訓練資料。一神經網路任務係可由一神經網路執行之一機器學習任務。經縮放神經網路可經組態以接收任何類型之資料輸入以產生用於執行一神經網路任務之輸出。作為實例，輸出可為基於輸入之任何類型之得分、分類或遞歸輸出。相應地，神經網路任務可為用於鑑於一些輸入預測一些輸出之一評分、分類及/或遞歸任務。此等任務可對應於處理影像、視訊、文字、語音或其他類型之資料之各種不同應用。As shown in block 210, the system receives training data corresponding to a neural network task. A neural network task is a machine learning task that can be performed by a neural network. A scaled neural network can be configured to receive any type of data input to generate output for performing a neural network task. As an example, the output can be any type of scoring, classification, or recursive output based on the input. Accordingly, a neural network task may be a scoring, classification and/or recursive task for predicting some outputs given some inputs. These tasks may correspond to a variety of different applications for processing images, video, text, voice, or other types of data.

根據各種不同學習技術之一者，所接收訓練資料可為適合於訓練一神經網路之任何形式。用於訓練一神經網路之學習技術可包含監督式學習、無監督式學習及半監督式學習技術。例如，訓練資料可包含可由一神經網路作為輸入接收之多個訓練實例。可使用對應於旨在由經適當訓練以執行一特定神經網路任務之一神經網路產生之輸出之一已知輸出來標記訓練實例。例如，若神經網路任務係一分類任務，則訓練實例可為使用一或多個類別標記之影像，該一或多個類別對影像中描繪之物件進行分類。The training data received may be in any form suitable for training a neural network according to one of a variety of different learning techniques. Learning techniques for training a neural network can include supervised learning, unsupervised learning, and semi-supervised learning techniques. For example, training data may include training instances that may be received as input by a neural network. Training instances may be labeled with a known output corresponding to an output intended to be produced by a neural network that is properly trained to perform a particular neural network task. For example, if the neural network task is a classification task, the training example may be an image labeled with one or more classes that classify the objects depicted in the image.

如方塊220中展示，系統接收指定目標運算資源之資訊。目標運算資源資料可指定可在其上至少部分部署一神經網路之運算資源之特性。運算資源可容置於一或多個資料中心或擁有各種不同類型之硬體裝置之任何者之其他實體位置中。硬體之例示性類型包含中央處理單元(CPU)、圖形處理單元(GPU)、邊緣或行動運算裝置、場可程式化閘陣列(FGPA)及各種類型之特定應用電路(ASIC)。As shown in block 220, the system receives information specifying a target computing resource. The target computing resource data may specify characteristics of the computing resource on which a neural network may be deployed at least in part. Computing resources may be housed in one or more data centers or other physical locations with any of a variety of different types of hardware devices. Exemplary types of hardware include central processing units (CPUs), graphics processing units (GPUs), edge or mobile computing devices, field programmable gate arrays (FGPAs), and various types of application specific circuits (ASICs).

一些裝置可經組態用於硬體加速，其等可包含經組態用於高效地執行特定類型之操作之裝置。此等硬體加速器(其等可例如包含GPU及張量處理單元(TPU))可實施用於硬體加速之特殊特徵。用於硬體加速之例示性特徵可包含用以執行通常與機器學習模型執行相關聯之操作(諸如矩陣乘法)之組態。作為實例，此等特殊特徵亦可包含不同類型之GPU中可用之矩陣乘法及累加單元，以及TPU中可用之矩陣乘法單元。Some devices may be configured for hardware acceleration, which may include devices configured to perform certain types of operations efficiently. Such hardware accelerators, which may include, for example, GPUs and tensor processing units (TPUs) may implement special features for hardware acceleration. Exemplary features for hardware acceleration may include configurations to perform operations typically associated with machine learning model execution, such as matrix multiplication. Such special features may also include, as examples, matrix multiply and accumulate units available in different types of GPUs, and matrix multiply units available in TPUs.

目標運算資源資料可包含運算資源之一或多個目標集之資料。運算資源之一目標集可指代期望在其上部署一神經網路之一運算裝置集合。指定運算資源之目標集之資訊可指代目標集中之硬體加速器或其他運算裝置之類型及數量。目標集可包含相同或不同類型之裝置。例如，運算資源之一目標集可定義一特定類型之硬體加速器之硬體特性及數量，包含其處理能力、處理量及記憶體容量。如本文中描述，一系統可針對在運算資源之目標集中指定之各裝置產生系列經縮放神經網路架構。Target computing resource data may include data for one or more target sets of computing resources. A target set of computing resources may refer to a set of computing devices on which a neural network is desired to be deployed. Information specifying a target set of computing resources may refer to the type and number of hardware accelerators or other computing devices in the target set. The target set may contain devices of the same or different types. For example, a target set of computing resources may define the hardware characteristics and quantity of a particular type of hardware accelerator, including its processing power, throughput, and memory capacity. As described herein, a system can generate a series of scaled neural network architectures for each device specified in a target set of computing resources.

另外，目標運算資源資料可指定運算資源之不同目標集，例如反映容置於一資料中心中之運算資源之不同潛在組態。從此訓練及目標運算資源資料，系統可產生一系列神經網路架構。可從由系統識別之一基本神經網路產生各架構。Additionally, target computing resource data may specify different target sets of computing resources, eg, reflecting different potential configurations of computing resources housed in a data center. From the training and target computing resource data, the system can generate a series of neural network architectures. Architectures can be generated from one of the basic neural networks identified by the system.

如方塊230中展示，系統可使用訓練資料在一搜尋空間上執行一神經架構搜尋以識別一基本神經網路之一架構。系統可使用各種NAS技術之任何者，諸如基於強化學習、演進搜尋或可微搜尋之技術。在一些實施方案中，系統可直接接收指定一基本神經網路之一架構之資料，例如，而無需接收訓練資料且執行如本文中描述之NAS。As shown in block 230, the system may perform a neural architecture search on a search space using the training data to identify an architecture of a basic neural network. The system may use any of a variety of NAS techniques, such as those based on reinforcement learning, evolutionary search, or differentiable search. In some implementations, the system can directly receive data specifying an architecture of an underlying neural network, eg, without receiving training data and performing NAS as described herein.

一搜尋空間指代可潛在地被選擇為一基本神經網路架構之部分之候選神經網路或候選神經網路之部分。一候選神經網路架構之一部分可指代神經網路之一組件。一神經網路之架構可根據神經網路之複數個組件來定義，其中各組件包含一或多個神經網路層。神經網路層之特性可在組件級之一架構中定義，此意謂架構可定義在組件中執行之特定操作，使得組件中之各神經網路實施針對組件定義之相同操作。組件亦可在架構中由組件中之層數來定義。A search space refers to candidate neural networks or portions of candidate neural networks that can potentially be selected as part of a base neural network architecture. A portion of a candidate neural network architecture may refer to a component of the neural network. The architecture of a neural network can be defined in terms of a plurality of components of the neural network, each of which includes one or more neural network layers. The properties of a neural network layer can be defined in an architecture at the component level, which means that the architecture can define specific operations to be performed in a component such that each neural network in a component implements the same operations defined for the component. Components can also be defined in the architecture by the number of layers in the component.

作為執行NAS之部分，系統可重複識別候選神經網路，獲得對應於多個目標之效能度量，且根據候選神經網路之各自效能度量對其等進行評估。作為獲得效能度量(諸如候選神經網路之精度及延時之度量)之部分，系統可使用所接收訓練資料來訓練候選神經網路。一旦經訓練，系統便可評估候選神經網路架構以判定其效能度量，且根據一當前最佳候選者來比較效能度量。As part of performing NAS, the system may repeatedly identify candidate neural networks, obtain performance metrics corresponding to multiple targets, and evaluate candidate neural networks, among others, based on their respective performance metrics. As part of obtaining performance metrics, such as measures of accuracy and latency of the candidate neural network, the system may use the received training data to train the candidate neural network. Once trained, the system can evaluate candidate neural network architectures to determine their performance metrics, and compare performance metrics against a current best candidate.

系統可藉由選擇候選神經網路，訓練網路，且比較其效能度量來重複執行此搜尋程序，直至達到停止準則。停止準則可為由一當前候選網路所滿足之一最小預定效能臨限值。另外或替代地，停止準則可為一最大搜尋反覆次數，或經分配用於執行搜尋之一最大時間量。停止準則可為其中神經網路之效能收斂之一條件，例如，一後續反覆之效能小於與先前反覆之效能不同之一臨限值。The system can repeat this search procedure by selecting candidate neural networks, training the networks, and comparing their performance metrics until a stopping criterion is reached. The stopping criterion may be a minimum predetermined performance threshold value satisfied by a current candidate network. Additionally or alternatively, the stopping criterion may be a maximum number of search iterations, or a maximum amount of time allotted to perform a search. The stopping criterion may be a condition in which the performance of the neural network converges, eg, the performance of a subsequent iteration is less than a threshold value that differs from the performance of the previous iteration.

在最佳化神經網路之不同效能度量(例如，精度及延時)之內容脈絡中，停止準則可指定預先判定為「最佳」之臨限值範圍。例如，最佳延時之一臨限值範圍可為來自由目標運算資源達成之一理論或經量測最小延時之一臨限值範圍。理論或經量測最小延時可基於運算資源之實體特性，諸如運算資源之組件能夠實體地讀取及處理傳入資料所需之最小時間量。在一些實施方案中，延時被保持為一絕對最小值，例如，實體上儘可能接近於零延遲，且不基於從目標運算資源量測或計算之一目標延時。In the context of optimizing a neural network for different performance metrics (eg, accuracy and latency), the stopping criterion may specify a range of thresholds that are pre-determined to be "best." For example, a threshold range of optimal delays may be derived from a theoretical or measured minimum delay threshold range achieved by the target computing resource. The theoretical or measured minimum latency may be based on physical characteristics of the computing resource, such as the minimum amount of time required for components of the computing resource to physically read and process incoming data. In some implementations, the latency is kept to an absolute minimum, eg, physically as close to zero latency as possible, and not based on a target latency measured or calculated from the target computing resources.

系統可經組態以使用一機器學習模型或其他技術來選擇下一候選神經網路架構，其中該選擇可至少部分基於更有可能在一特定神經網路任務之目標下良好地執行之不同候選神經網路之經學習特性。The system can be configured to use a machine learning model or other techniques to select the next candidate neural network architecture, wherein the selection can be based, at least in part, on different candidates that are more likely to perform well under the goals of a particular neural network task The learned properties of neural networks.

在一些實例中，系統可使用一多目標獎勵機制用於識別基本神經網路架構，如下：

ACCURACY(m)係一候選神經網路m之精度之效能度量，且 LATENCY(m)係神經網路在於目標運算資源上產生一輸出時之延時。如本文中更詳細描述，精度及延時亦可為由系統作為根據目標運算資源之特性縮放基本神經網路架構之部分所使用之目標。 Target係一目標延時值，例如以毫秒為單位量測，且可經預先判定。值 w係用於對網路延時對候選神經網路之整體效能之影響進行加權之一可調諧參數。可根據運算資源之不同目標集來學習或調諧不同值 w。作為一實例， w可經設定為-0.09，從而反映適用於如TPU及GPU之運算資源之一整體較大因數，該等運算資源對延時變動之敏感度小於例如行動平台(其中 w可經設定為一較小值，諸如-0.07)。 In some instances, the system may use a multi-objective reward mechanism for identifying basic neural network architectures, as follows:

ACCURACY(m) is a performance measure of the accuracy of a candidate neural network m, and LATENCY(m) is the delay for the neural network to generate an output on the target computing resource. As described in more detail herein, accuracy and latency may also be goals used by the system as part of scaling the underlying neural network architecture according to the characteristics of the target computing resources. Target is a target delay value, measured in milliseconds, for example, and can be pre-determined. The value w is a tunable parameter used to weight the effect of network latency on the overall performance of the candidate neural network. Different values of w may be learned or tuned according to different target sets of computing resources. As an example, w may be set to -0.09, reflecting an overall larger factor applicable to computing resources such as TPUs and GPUs that are less sensitive to latency variations than, for example, mobile platforms (where w may be set is a small value, such as -0.07).

為了量測一候選神經網路之精度，系統可使用訓練集來訓練候選神經網路以執行一神經網路任務。系統可例如根據80/20劃分將訓練資料劃分為一訓練集及一驗證集。例如，系統可應用一監督式學習技術來計算由候選神經網路產生之輸出與由網路處理之一訓練實例之一真實數據標籤之間的一誤差。系統可使用適合於神經網路經訓練用於之任務類型之各種損失或誤差函數之任何者，諸如分類任務之交叉熵損失，或遞歸任務之均方誤差。例如，可使用倒傳遞演算法來計算相對於候選神經網路之不同權重之誤差梯度，且可更新神經網路之權重。系統可經組態以訓練候選神經網路直至滿足停止準則，諸如用於訓練之一反覆次數、一最大時間週期、收斂或當滿足一最小精度臨限值時。To measure the accuracy of a candidate neural network, the system can use the training set to train the candidate neural network to perform a neural network task. The system may divide the training data into a training set and a validation set, eg, according to an 80/20 split. For example, the system may apply a supervised learning technique to compute an error between the output produced by the candidate neural network and the true data label of a training instance processed by the network. The system may use any of a variety of loss or error functions appropriate for the type of task the neural network is trained for, such as cross-entropy loss for classification tasks, or mean squared error for recursive tasks. For example, a back-pass algorithm can be used to calculate the error gradient with respect to different weights of the candidate neural network, and the weights of the neural network can be updated. The system can be configured to train candidate neural networks until stopping criteria are met, such as a number of iterations for training, a maximum time period, convergence, or when a minimum accuracy threshold is met.

除了其他效能度量之外，系統亦可產生候選神經網路架構在目標運算資源上之精度及延時之效能度量，包含(i)當部署於目標運算資源上時候選基本神經網路之操作強度，及/或候選基本神經網路在目標運算資源上之一執行效率。在一些實施方案中，除了精度及延時之外，候選基本神經網路之效能度量亦至少部分基於其操作強度及/或執行效率。In addition to other performance metrics, the system can also generate performance metrics of the accuracy and latency of candidate neural network architectures on target computing resources, including (i) the operational strength of the selected basic neural network when deployed on target computing resources, and/or an execution efficiency of the candidate basic neural network on the target computing resource. In some implementations, in addition to accuracy and latency, performance metrics for candidate base neural networks are also based, at least in part, on their operational strength and/or execution efficiency.

延時、操作強度及執行效率可定義為如下：

Latency, operation intensity and execution efficiency can be defined as follows:

在(3)中， LATENCY被定義為

，其中 W係用於執行所量測之神經網路架構之運算量(例如，以FLOPS為單位)，且 C係藉由目標運算資源在神經網路架構上處理輸入而達成之運算速率(例如，以FLOPS/sec為單位)。 E係候選神經網路在目標運算資源上執行之執行效率，且 E可等於 C/C _ideal (且因此，藉由代數，

)。 C _ideal 係在用於執行所量測之神經網路架構之目標運算資源上達成之理想運算速率。 C _ideal 根據操作強度 I、目標運算資源之記憶體頻寬 b及 C _max 來定義， C _max 表示目標運算資源之峰值運算速率(例如，一GPU之峰值運算速率)。 In (3), LATENCY is defined as

, where W is the amount of computation (e.g., in FLOPS) used to execute the measured neural network architecture, and C is the computation rate achieved by the target computing resource processing input on the neural network architecture (e.g., , in FLOPS/sec). E is the execution efficiency of the candidate neural network on the target computing resource, and E can be equal to C/C _ideal (and thus, by algebra,

). C _ideal is the ideal computing rate achieved on the target computing resources used to execute the measured neural network architecture. C _ideal is defined according to the operation intensity I , the memory bandwidth b of the target computing resource, and C _max , where C _max represents the peak computing rate of the target computing resource (eg, the peak computing rate of a GPU).

C _max 及 b係常數且分別對應於目標運算資源之硬體特性，例如，其峰值運算速率及記憶體頻寬。操作強度 I指代由部署一神經網路之運算資源處理之資料量之一量測，鑑於神經網路執行所需之運算量W，除以在神經網路之執行期間由運算資源引起之記憶體訊務Q(

，如(3)中展示)。 _Cmax and b are constants and correspond to the hardware characteristics of the target computing resource, such as its peak computing rate and memory bandwidth, respectively. Operational intensity I refers to a measure of the amount of data processed by the computing resources deploying a neural network, given the amount of computing W required for the neural network to execute, divided by the memory induced by the computing resources during the execution of the neural network Sports Q (

, as shown in (3)).

用於在目標運算資源上執行神經網路之理想運算速率在 I＜R時係 I x b，且否則係 C _max 。 R係脊點，或神經網路架構在目標運算資源上達成峰值運算速率所需之最小操作強度。 The ideal computational rate for executing the neural network on the target computational resource is Ixb when I<R , and _Cmax otherwise. R is the spine point, or the minimum operating intensity required by the neural network architecture to achieve peak computing rates on the target computing resources.

共同地，(3)及對應描述展示，一經量測神經網路之推論時延係依據操作強度 I、運算 W及執行效率 E，而非單獨僅依據運算 W。在硬體加速器(特定言之，如通常部署於資料中心中之TPU及GPU之硬體加速器)之內容脈絡中，此關係可應用於在NAS期間使用「加速器友好操作」來擴增一候選神經網路搜尋空間，以及改良如何由系統選擇及隨後縮放一基本神經網路。 Collectively, (3) and the corresponding description show that the inference latency of a neural network once measured is based on operation intensity I , operation W and execution efficiency E , rather than operation W alone. In the context of hardware accelerators (specifically, hardware accelerators such as TPUs and GPUs commonly deployed in data centers), this relationship can be applied to use "accelerator friendly operations" to augment a candidate neural network during NAS The network search space, and how to refine how a basic neural network is selected and subsequently scaled by the system.

系統可藉由同時搜尋具有經改良操作強度、執行效率及運算要求之候選神經網路架構來搜尋基本神經網路架構以改良一最終神經網路之延時，而非單獨搜尋以找到減少運算網路。系統可經組態以依此方式操作以減少最終基本神經網路架構之總延時。The system can search the base neural network architecture to improve the latency of a final neural network by simultaneously searching for candidate neural network architectures with improved operational strength, execution efficiency, and computing requirements, rather than searching separately to find reduced computing networks . The system can be configured to operate in this manner to reduce the overall latency of the final basic neural network architecture.

另外，系統從其選擇基本神經網路架構之候選架構搜尋空間可經擴增以擴大更有可能精確地且以目標運算資源上之減少推論延時(特別是在其中目標運算資源係資料中心硬體加速器之情況中)執行之可用候選神經網路之範圍。In addition, the candidate architecture search space from which the system selects the underlying neural network architecture can be expanded to expand more likely to be accurate and with reduced inference latency on the target computing resource (especially where the target computing resource is data center hardware) In the case of accelerators) the range of available candidate neural networks for execution.

如描述般擴增搜尋空間可增加更適合於資料中心加速器部署之候選神經網路架構之數目，此可導致可能尚非未成為未根據本發明之態樣擴增之一搜尋空間中之一候選者之一所識別基本神經網路架構。在其中目標運算資源指定如GPU及TPU之硬體加速器之實例中，搜尋空間可使用候選架構或架構之部分來擴增，諸如促進操作強度、並行性及/或執行效率之組件或操作。Enlarging the search space as described can increase the number of candidate neural network architectures that are more suitable for data center accelerator deployment, which can lead to a candidate in a search space that may not have not been expanded according to aspects of the present invention The basic neural network architecture identified by one of them. In instances where target computing resources specify hardware accelerators such as GPUs and TPUs, the search space may be augmented using candidate architectures or portions of architectures, such as components or operations that promote operational strength, parallelism, and/or execution efficiency.

在一個例示性擴增方式中，搜尋空間可經擴增以包含具有實施各種不同類型之激發函數之一者之層之神經網路架構組件。在TPU及GPU之情況中，已發現激發函數(諸如ReLU或swish)通常具有低操作強度，且代替地在此等類型之硬體加速器上通常受記憶體限制。由於一神經網路中之激發函數之執行通常受到目標運算資源上可用之記憶體總量之限制，所以此等函數之執行可對端至端網路推論速度具有大的負面效能影響。In one exemplary augmentation, the search space may be augmented to include neural network architecture components with layers implementing one of various different types of excitation functions. In the case of TPUs and GPUs, it has been found that firing functions such as ReLU or swish are typically of low operational intensity, and are instead typically memory-bound on these types of hardware accelerators. Since the execution of excitation functions in a neural network is typically limited by the total amount of memory available on the target computing resource, the execution of such functions can have a large negative performance impact on the end-to-end network inference speed.

關於激發函數對搜尋空間之一個例示性擴增係將激發函數與其等之相關聯離散迴旋融合引入至搜尋空間。由於激發函數通常係逐元素操作且在經組態用於向量操作之硬體加速器單元上運行，所以此等激發函數之執行可與離散迴旋並行地執行，該等離散迴旋通常係在一硬體加速器之矩陣單元上操作之基於矩陣之操作。此等經融合激發函數-迴旋操作可由系統選擇為一候選神經網路組件，作為搜尋如本文中描述之一基本神經網路架構之部分。可使用各種不同激發函數之任何者，包含Swish、ReLU (修正線性單元)、Sigmoid、Tanh及Softmax。One exemplary augmentation of the search space with respect to excitation functions is to introduce discrete convolutional fusions associated with excitation functions and their equivalents into the search space. Since firing functions are typically operated element-wise and run on a hardware accelerator unit configured for vector operations, execution of these firing functions can be performed in parallel with discrete convolutions, which are typically tied to a hardware Matrix-based operations that operate on the accelerator's matrix unit. These fused excitation function-convolution operations may be selected by the system as a candidate neural network component as part of a search for a basic neural network architecture as described herein. Any of a variety of different excitation functions can be used, including Swish, ReLU (Revised Linear Unit), Sigmoid, Tanh, and Softmax.

包含融合激發函數-迴旋操作層之不同組件可經添加至搜尋空間且可依據所採用之激發函數之類型而變化。例如，激發函數-迴旋層之一個組件可包含ReLU激發函數，而另一組件可包含Swish激發函數。已發現，不同硬體加速器可使用不同激發函數更高效地執行，因此擴增搜尋空間以包含多種類型之激發函數之經融合激發函數-迴旋可進一步改良識別最適合於執行所討論之神經網路任務之一基本神經網路架構。Different components including the fusion excitation function-convolution operation layer can be added to the search space and can vary depending on the type of excitation function employed. For example, one component of the excitation function-convolution layer may contain the ReLU excitation function, while the other component may contain the Swish excitation function. It has been found that different hardware accelerators can perform more efficiently using different excitation functions, thus expanding the search space to include fused excitation functions-convolutions of multiple types of excitation functions can further improve the identification of the neural network most suitable for implementation of the discussed neural network. One of the tasks is the basic neural network architecture.

除了本文中描述之具有各種不同激發函數之組件之外，搜尋空間亦可使用其他經融合迴旋結構來擴增，以進一步使用不同形狀、類型及大小之迴旋來豐富搜尋空間。不同迴旋結構可為作為一候選神經網路架構之部分添加之組件，且可包含1x1迴旋之擴展層、深度迴旋、1x1迴旋之投影層以及其他操作，諸如激發函數、批量正規化函數及/或跳躍連接。In addition to the components described herein with various different excitation functions, the search space can also be augmented with other fused convolutional structures to further enrich the search space with convolutions of different shapes, types, and sizes. Different convolutional structures may be components added as part of a candidate neural network architecture, and may include 1x1 convolutional expansion layers, depth convolutions, 1x1 convolutional projection layers, and other operations such as excitation functions, batch normalization functions, and/or skip connection.

如本文中描述，識別運算要求與延時之間的非比例關係之根本原因亦證實並行性對硬體加速器之影響。並行性對於在如GPU及TPU之硬體加速器上實施之神經網路而言可為關鍵的，此係因為此等硬體加速器可需要大的並行性以達成高效能。例如，一神經網路層之一迴旋操作需要具有足夠大小之深度、批量及空間維度以提供足夠並行性來達成硬體加速器之矩陣單元上之高執行效率E。如參考(3)描述之執行效率形成影響推論時之網路延時之因素之全貌之部分。Identifying the root cause of the non-proportional relationship between computational requirements and latency, as described herein, also demonstrates the impact of parallelism on hardware accelerators. Parallelism can be critical for neural networks implemented on hardware accelerators such as GPUs and TPUs because such hardware accelerators can require large parallelisms to achieve high performance. For example, a convolution operation of a neural network layer needs to have depth, batch, and spatial dimensions of sufficient size to provide sufficient parallelism to achieve high execution efficiency E on matrix units of hardware accelerators. The execution efficiency as described in reference (3) forms part of the overall picture of factors affecting network latency in inference.

因此，可以其他方式擴增NAS之搜尋空間以包含可利用一硬體加速器上可用之並行性之操作。搜尋空間可包含用以將深度迴旋與相鄰1x1迴旋融合之一或多個操作及用於重塑一神經網路之輸入之操作。例如，至一候選神經網路之輸入可為一張量。一張量係可根據不同階表示值之一資料結構。例如，一階張量可為一向量，二階張量可為一矩陣，且三階矩陣可為三維矩陣，且以此類推。融合深度迴旋可為有益的，此係因為深度操作通常具有一較低操作強度，且將操作與相鄰迴旋融合可將操作強度增加至較接近於一硬體加速器之最大能力。Therefore, the search space of the NAS can be augmented in other ways to include operations that can take advantage of the parallelism available on a hardware accelerator. The search space may include one or more operations to fuse depth convolutions with adjacent 1x1 convolutions and operations to reshape the input of a neural network. For example, the input to a candidate neural network can be a scalar. A scalar is a data structure that can represent values according to different orders. For example, a first-order tensor can be a vector, a second-order tensor can be a matrix, and a third-order matrix can be a three-dimensional matrix, and so on. Fusing deep convolutions can be beneficial because deep operations typically have a lower operating intensity, and fusing operations with adjacent convolutions can increase the operating intensity closer to the maximum capability of a hard accelerator.

搜尋空間亦可包含藉由改變一輸入張量之不同維度來重塑該張量之操作。例如，若一張量具有一深度、寬度及高度維度，則搜尋空間中可形成一候選神經網路架構之部分之一或多個操作可經組態以改變張量之深度、寬度及解析度維度之一或多者。在一個實例中，搜尋空間可使用一或多個空間至深度迴旋(諸如2x2迴旋)來擴增，該一或多個空間至深度迴旋藉由增加輸入張量之深度，同時減少輸入張量之其他維度來重塑輸入張量。在一些實施方案中，包含使用步幅- n nxn迴旋之一或多個操作，其中 n表示系統可用於重塑一張量輸入之一正整數。例如，若輸入至一候選神經網路之一張量係

，則一或多個所添加操作可將張量重塑為

之一維度。 The search space can also include operations that reshape an input tensor by changing different dimensions of the tensor. For example, if a tensor has a depth, width, and height dimensions, one or more operations in the search space that may form part of a candidate neural network architecture may be configured to change the depth, width, and resolution of the tensor One or more of the dimensions. In one example, the search space can be augmented using one or more space-to-depth convolutions, such as 2x2 convolutions, by increasing the depth of the input tensors while decreasing the depth of the input tensors other dimensions to reshape the input tensor. In some implementations, one or more operations are included using stride- n nxn convolutions, where n represents a positive integer that the system can use to reshape a scalar input. For example, if a tensor system input to a candidate neural network

, then one or more of the added operations reshape the tensor to

one dimension.

實施如上文描述之空間至深度迴旋可具有至少兩個優點。首先，迴旋與相對高操作強度及執行效率相關聯，且因此搜尋空間中之更多迴旋選項可豐富搜尋一候選神經網路時之整體搜尋空間。高操作強度及執行效率亦有利於在諸如TPU及GPU之硬體加速器上實施。接著，步幅 nxn迴旋亦可作為候選神經網路之部分進行訓練以貢獻於神經網路之容量。 Implementing the space-to-depth convolution as described above can have at least two advantages. First, convolution is associated with relatively high operational intensity and execution efficiency, and thus more convolution options in the search space can enrich the overall search space when searching for a candidate neural network. High operational intensity and execution efficiency also facilitate implementation on hardware accelerators such as TPUs and GPUs. Then, stride nxn convolutions can also be trained as part of the candidate neural network to contribute to the capacity of the neural network.

搜尋空間亦可包含藉由將輸入張量之元素移動至目標運算資源上之記憶體中之不同位置來重塑該張量之操作。另外或替代地，該等操作可將元素複製至記憶體中之不同位置。The search space may also include operations that reshape the input tensor by moving elements of the tensor to different locations in memory on the target computing resource. Additionally or alternatively, these operations may copy elements to different locations in memory.

在一些實施方案中，系統經組態以直接接收用於縮放之一基本神經網路，且未執行NAS或用於識別基本神經網路之任何其他搜尋。在一些實施方案中，多個裝置例如藉由在一個裝置上識別基本神經網路，且在另一裝置上如本文中描述般縮放基本神經網路來個別地執行程序200之至少部分。In some implementations, the system is configured to directly receive a base neural network for scaling, and does not perform NAS or any other searches for identifying base neural networks. In some implementations, multiple devices individually perform at least a portion of process 200, eg, by identifying a base neural network on one device and scaling the base neural network as described herein on another device.

系統可根據指定目標運算資源之資訊及複數個縮放參數來識別用於縮放基本神經網路之複數個縮放參數值，如圖2之方塊240中展示。系統可使用延時感知複合縮放(如目前所描述)用於使用一候選縮放神經網路之精度及延時作為用於搜尋基本神經網路之縮放參數值之目標。The system may identify a plurality of scaling parameter values for scaling the base neural network based on the information specifying the target computing resource and the plurality of scaling parameters, as shown in block 240 of FIG. 2 . The system may use delay-aware compound scaling (as presently described) for using the accuracy and delay of a candidate scaling neural network as a target for searching the scaling parameter values for the base neural network.

一般言之，縮放技術與NAS結合應用以識別經縮放以部署於目標運算資源上之一神經網路。模型縮放可與NAS結合使用以更高效地搜尋支援各種不同使用情況之一系列神經網路。在一縮放方法下，可使用各種不同技術來搜尋縮放參數之不同值，諸如一神經網路之深度、寬度及解析度。可藉由分開搜尋各縮放參數之值或藉由搜尋用於一起調整多個縮放參數之一致值集來進行縮放。前者有時被稱為簡單縮放，且後者有時被稱為複合縮放。In general, scaling techniques are applied in conjunction with NAS to identify a neural network that is scaled for deployment on a target computing resource. Model scaling can be used in conjunction with NAS to more efficiently search for a family of neural networks supporting a variety of different use cases. Under a scaling method, various techniques can be used to search for different values of scaling parameters, such as the depth, width, and resolution of a neural network. Scaling can be performed by searching for the value of each scaling parameter separately or by searching for a consistent set of values for adjusting multiple scaling parameters together. The former is sometimes called simple scaling, and the latter is sometimes called compound scaling.

使用精度作為唯一目標之縮放技術可導致神經網路在部署於專用硬體(諸如資料中心加速器)上時未經縮放以適當考量此等經縮放網路之效能/速度影響。如圖3中更詳細描述，LACS可使用精度及延時目標兩者，其等可被共用為用於識別基本神經網路架構之相同目標。Scaling techniques that use accuracy as the sole objective can result in neural networks that are not scaled when deployed on dedicated hardware, such as data center accelerators, to properly account for the performance/speed impact of such scaled networks. As described in more detail in Figure 3, LACS can use both accuracy and latency goals, which can be shared as the same goal for identifying the underlying neural network architecture.

圖3係用於一基本神經網路架構之延時感知複合縮放之一例示性程序300。可在一或多個位置中之一或多個處理器之一系統或裝置上執行例示性程序300。例如，可在如本文中描述之一NAS-LACS系統上執行程序300。FIG. 3 is an exemplary procedure 300 for delay-aware compound scaling for a basic neural network architecture. The exemplary process 300 may be executed on a system or device, one or more processors, in one or more locations. For example, procedure 300 may be performed on a NAS-LACS system as described herein.

如方塊310中展示，系統選擇複數個候選縮放參數值。縮放參數值係一神經網路之不同縮放參數之值。如本文中描述，深度、寬度及輸入解析度可為縮放參數，此係至少因為一神經網路之該等參數可變大或變小。作為選擇縮放參數值之部分，系統可從一係數搜尋空間選擇一縮放係數元組。係數搜尋空間包含從其等可判定一神經網路之縮放參數值之候選縮放係數元組。各元組之縮放係數之數目可取決於縮放參數之數目。例如，若縮放參數係深度、寬度及解析度，則係數搜尋空間將包含形式

之候選元組，各具有三個縮放係數。各元組中之係數可為數值，例如，整數或實值。 As shown in block 310, the system selects a plurality of candidate scaling parameter values. The scaling parameter values are the values of different scaling parameters of a neural network. As described herein, depth, width, and input resolution may be scaling parameters, at least because these parameters of a neural network may be larger or smaller. As part of selecting a scaling parameter value, the system may select a tuple of scaling coefficients from a coefficient search space. The coefficient search space contains tuples of candidate scaling coefficients from which scaling parameter values for a neural network can be determined. The number of scaling factors for each tuple may depend on the number of scaling parameters. For example, if the scaling parameters are depth, width, and resolution, the coefficient search space will contain the form

The candidate tuples each have three scaling factors. The coefficients in each tuple can be numeric, eg, integers or real values.

在一複合縮放方法中，一起搜尋各縮放參數之縮放係數。系統可應用各種不同搜尋技術之任何者用於識別一係數元組，諸如柏拉圖前緣搜尋或網格式搜尋。然而，系統搜尋係數元組，系統可根據用於識別本文中參考圖2描述之基本神經網路架構之相同目標來搜尋該元組。另外，多個目標可包含精度及延時兩者，其等可被表達為：

雖然經執行以識別基本神經架構之NAS之間的目標(本文中在(1)處展示)相同於在(2)中展示之目標，但在(2)下運算效能量度之一關鍵差異係評估候選經縮放神經網路 m _scaled 而非基本神經網路 m。系統之總體目標可為識別多個目標之各者之柏拉圖平衡中之縮放參數值，例如，精度及延時。 In a compound scaling method, the scaling factors for each scaling parameter are searched together. The system may apply any of a variety of different search techniques for identifying a coefficient tuple, such as a Platonic leading edge search or a grid-like search. However, the system searches for a tuple of coefficients that the system may search for according to the same goals used to identify the basic neural network architecture described herein with reference to FIG. 2 . Additionally, multiple goals can include both accuracy and latency, which can be expressed as:

While the goal between the NAS performed to identify the underlying neural architecture (shown herein at (1)) is the same as the goal shown in (2), one key difference in the measure of computational efficiency under (2) is to assess The candidate scaled neural network m _scaled instead of the base neural network m . The overall goal of the system may be to identify scaled parameter values in the Platonic equilibrium of each of the multiple goals, eg, accuracy and latency.

換言之，系統判定根據複數個候選神經縮放參數值縮放之基本神經網路之一效能量度，如方塊320中展示。系統根據反映從候選神經網路縮放參數值縮放之候選經縮放神經網路之效能之不同效能度量來判定基本神經網路之一效能量度。In other words, the system determines an efficacy measure of the underlying neural network scaled according to the plurality of candidate neural scaling parameter values, as shown in block 320 . The system determines a performance metric for the base neural network based on different performance metrics reflecting the performance of the candidate scaled neural network scaled from the candidate neural network scaling parameter values.

基本神經網路架構根據在係數搜尋空間中搜尋之縮放係數元組由其縮放參數進行縮放。 ACCURACY(m _scaled) 可為候選經縮放神經網路在從所接收訓練資料獲得之一驗證實例集上之精度之一量度。 LATENCY(m _scaled) 可為當部署於目標運算資源上時在候選經縮放神經網路上接收輸入與產生一對應輸出之間的時間之一量度。一般言之，系統致力於最大化一候選神經網路架構之精度，同時最小化延時。 The basic neural network architecture is scaled by its scaling parameters according to the tuple of scaling coefficients searched in the coefficient search space. ACCURACY( _mscaled ) may be a measure of the accuracy of the candidate scaled neural network on a set of validation instances obtained from received training data. LATENCY( _mscaled ) may be a measure of the time between receiving an input on a candidate scaled neural network and producing a corresponding output when deployed on a target computing resource. In general, the system strives to maximize the accuracy of a candidate neural network architecture while minimizing latency.

根據參考(3)之描述，系統可直接或間接獲得候選經縮放神經網路之操作強度、運算要求及執行效率之效能度量。此係因為 LATENCY係依據此三個其他潛在效能度量。因此，除了經縮放神經網路架構在部署於目標運算資源上時之精度及延時之外，系統亦可經組態以搜尋最佳化操作強度、運算要求及執行效率之一或多者之候選縮放參數值。 According to the description of reference (3), the system can directly or indirectly obtain the performance metrics of the operation strength, operation requirement and execution efficiency of the candidate scaled neural network. This is because LATENCY is based on these three other potential efficacy measures. Thus, in addition to the accuracy and latency of the scaled neural network architecture when deployed on the target computing resource, the system can also be configured to search for candidates that optimize one or more of operational intensity, computational requirements, and execution efficiency Scale parameter value.

作為判定效能量度之部分，系統可使用接收訓練資料進一步訓練或調諧候選經縮放神經網路架構。根據方塊330，系統可判定經訓練及經縮放神經網路之效能量度，且判定效能量度是否滿足一效能臨限值。效能量度及效能臨限值可分別為多個效能度量及效能臨限值之一複合物。例如，系統可從經縮放神經網路之精度及推論延時兩者之度量判定一單一效能量度，或系統可判定不同目標之單獨效能度量，且比較各度量與一對應效能臨限值。As part of determining the efficacy measure, the system may use the received training data to further train or tune the candidate scaled neural network architecture. According to block 330, the system may determine the effectiveness metric of the trained and scaled neural network, and determine whether the effectiveness metric meets a performance threshold. The efficacy measure and efficacy threshold, respectively, may be a composite of multiple efficacy measures and efficacy thresholds. For example, the system can determine a single performance metric from both the scaled neural network's precision and inference latency metrics, or the system can determine separate performance metrics for different targets and compare each metric to a corresponding performance threshold.

若效能量度滿足效能臨限值，則程序300結束。否則，程序繼續，且根據方塊310，系統選擇新的複數個候選縮放參數值。例如，系統可至少部分基於先前選擇之候選元組及其對應效能量度從係數搜尋空間選擇新的縮放係數元組。在一些實施方案中，系統搜尋多個係數元組，且根據多個目標在候選元組之各者附近執行一更精細粒度搜尋。If the efficacy metric meets the efficacy threshold, routine 300 ends. Otherwise, the process continues, and according to block 310, the system selects a new plurality of candidate scaling parameter values. For example, the system may select a new tuple of scaling coefficients from the coefficient search space based at least in part on previously selected candidate tuples and their corresponding effectiveness metrics. In some implementations, the system searches multiple coefficient tuples and performs a finer-grained search near each of the candidate tuples according to multiple objectives.

系統可實施用於反覆地搜尋係數候選空間之各種不同技術之任何者，例如使用網格式搜尋、強化學習、演進搜尋及類似物。系統可繼續搜尋縮放參數值，直至達到停止準則，諸如收斂或反覆次數，如先前描述。The system may implement any of a variety of different techniques for iteratively searching the coefficient candidate space, such as using gridded search, reinforcement learning, evolutionary search, and the like. The system may continue to search for scaling parameter values until a stopping criterion, such as convergence or number of iterations, is reached, as previously described.

在一些實施方案中，可根據一或多個控制器參數值來調諧系統，該一或多個控制器參數值可經手動調諧、由一機器學習技術學習或兩者之一組合。控制器參數可影響各目標對一候選元組之整體效能量度之相對影響。在一些實例中，基於在控制器參數值中至少部分反映之理想縮放係數之所學習特性，一候選元組中之特定值或值之間的關係可為有利或不利的。In some implementations, the system may be tuned according to one or more controller parameter values, which may be manually tuned, learned by a machine learning technique, or a combination of both. Controller parameters may affect the relative impact of each target on the overall efficacy measure of a candidate tuple. In some examples, a particular value or relationship between values in a candidate tuple may be favorable or unfavorable based on the learned properties of the ideal scaling factor reflected at least in part in the controller parameter values.

根據方塊340，系統根據一或多個目標權衡從選定候選縮放參數值產生一或多個縮放參數值群組。目標權衡表示各目標之不同臨限值，例如精度及延時，且可由不同經縮放神經網路來滿足。例如，一個目標權衡可具有一較高網路精度臨限值，但具有一較低推斷延時臨限值(即，具有較高延時之更精確網路)。作為另一實例，一目標權衡可具有一較低網路精度臨限值，但具有一較高推斷延時臨限值(即，具有較低延時之較不精確網路)。作為另一實例，一目標權衡可平衡精度及延時效能。According to block 340, the system generates one or more groups of scaling parameter values from the selected candidate scaling parameter values according to one or more target tradeoffs. The target trade-offs represent different thresholds for each target, such as accuracy and latency, and can be satisfied by different scaled neural networks. For example, a target trade-off may have a higher network accuracy threshold, but a lower inferred latency threshold (ie, a more accurate network with higher latency). As another example, a target trade-off may have a lower network accuracy threshold, but a higher inferred latency threshold (ie, a less accurate network with lower latency). As another example, a target trade-off may balance accuracy and latency performance.

針對各目標權衡，系統可識別一各自縮放參數值群組，系統可使用該各自縮放參數值群組來縮放基本神經網路架構以滿足目標權衡。換言之，系統可重複如方塊310中展示之選擇、如方塊320中展示之判定效能量度及如方塊330中展示之判定效能量度是否滿足效能臨限值，其中效能臨限值之間的差異由目標權衡來定義。在一些實施方案中，根據方塊330，系統可搜尋根據最初滿足多個目標之效能量度之縮放參數值之選定候選者縮放之基本神經網路架構之候選縮放係數元組，而非搜尋基本神經網路架構之元組。For each target trade-off, the system can identify a respective group of scaling parameter values that the system can use to scale the base neural network architecture to meet the target trade-off. In other words, the system may repeat the selection as shown in block 310, the determination of the efficacy metric as shown in block 320, and the determination of whether the efficacy metric meets performance thresholds as shown in block 330, where the difference between the performance thresholds is determined by the target Balance to define. In some implementations, according to block 330, instead of searching the base neural network, the system may search for candidate scaling factor tuples of the base neural network architecture scaled according to the selected candidate of scaling parameter values that initially satisfy the efficacy measure of the multiple objectives A tuple of road architectures.

在一些實施方案中，在選擇複數個縮放參數值且滿足效能量度之後，系統可經組態以藉由再次縮放最初經縮放之神經網路來產生一系列神經網路。換言之，從一縮放係數元組

，系統可例如藉由一公因數或「複合係數」來縮放元組以獲得用於藉由其他因數來縮放基本神經網路之元組

及

。在此方法中，可快速產生一模型系列，且其可包含適用於各種使用情況之不同神經網路。 In some implementations, after a plurality of scaling parameter values are selected and the efficacy metric is satisfied, the system can be configured to generate a series of neural networks by rescaling the initially scaled neural network. In other words, from a tuple of scaling coefficients

, the system may, for example, scale the tuples by a common factor or "composite factor" to obtain tuples for scaling the base neural network by other factors

and

. In this approach, a family of models can be generated quickly and can include different neural networks suitable for various use cases.

例如，一系列中之不同神經網路可根據一或多個複合係數進行縮放。鑑於一複合係數

及一縮放係數元組

，一系列中之一經縮放神經網路之縮放參數可藉由以下定義：

其中

分別為神經網路之深度、寬度及解析度之縮放參數值。在第一經縮放神經網路架構之情況中，

可為1。一般言之，複合係數可表示用於網路縮放之延時預算，而

控制如何將延時預算分別分配給不同縮放參數值。 For example, different neural networks in a series can be scaled according to one or more composite coefficients. Given a compound coefficient

and a tuple of scaling coefficients

, the scaling parameters of a scaled neural network in a series can be defined by:

in

are the scaling parameter values of the depth, width and resolution of the neural network, respectively. In the case of the first scaled neural network architecture,

Can be 1. In general, the composite factor can represent the delay budget for network scaling, while

Controls how the delay budget is allocated separately to different scaling parameter values.

返回至圖2，NAS-LACS系統可使用根據複數個縮放參數值縮放之基本神經網路之架構來產生經縮放神經網路之一或多個架構，如方塊250中展示。經縮放神經網路架構可為從基本神經網路架構及不同縮放參數值產生之一系列神經網路之部分。Returning to FIG. 2 , the NAS-LACS system may use the architecture of the base neural network scaled according to a plurality of scaling parameter values to generate one or more architectures of the scaled neural network, as shown in block 250 . The scaled neural network architecture may be part of a series of neural networks generated from the base neural network architecture and different scaling parameter values.

若指定目標運算資源之資訊包含一或多組多個目標運算資源，例如，多個不同類型之硬體加速器，則系統可針對各硬體加速器重複程序200及程序300，以產生各對應於一目標集之一各自經縮放神經網路之一架構。針對各硬體加速器，系統可根據延時與精度之間的不同目標權衡或延時、精度及其他目標(特定言之，包含操作強度及執行效率，在本文中參考(3)描述)產生一系列經縮放神經網路架構。If the information specifying the target computing resources includes one or more sets of multiple target computing resources, for example, a plurality of different types of hardware accelerators, the system may repeat the process 200 and the process 300 for each hardware accelerator to generate a corresponding One of the target sets each scales one of the architectures of the neural network. For each hardware accelerator, the system can generate a series of processed algorithms based on different target trade-offs between latency and accuracy or latency, accuracy, and other goals (in particular, including operational intensity and execution efficiency, described herein with reference to (3)). Scaling neural network architectures.

在一些實施方案中，系統可從同一基本神經網路架構產生多個系列之經縮放神經網路架構。此方法可有助於其中不同目標裝置共用類似硬體特性之狀況，且至少由於對一基本神經網路架構之搜尋僅執行一次而可導致更快地識別各裝置之一對應經縮放系列。例示性系統： In some implementations, the system can generate multiple series of scaled neural network architectures from the same basic neural network architecture. This approach can help in situations where different target devices share similar hardware characteristics, and can result in faster identification of a corresponding scaled series of devices, at least because the search for a basic neural network architecture is performed only once. Exemplary system:

圖4係根據本發明之態樣之一神經架構搜尋延時感知複合縮放(NAS-LACS)系統400之一方塊圖。系統400經組態以接收用於執行一神經網路之訓練資料401及指定目標運算資源之目標運算資源資料402。系統400可經組態以實施用於產生一系列經縮放神經網路架構之技術，如本文中參考圖1至圖3描述。FIG. 4 is a block diagram of a neural architecture search latency-aware composite scaling (NAS-LACS) system 400 according to an aspect of the present invention. System 400 is configured to receive training data 401 for executing a neural network and target computing resource data 402 specifying target computing resources. System 400 can be configured to implement techniques for generating a series of scaled neural network architectures, as described herein with reference to FIGS. 1-3 .

系統400可經組態以根據一使用者介面接收輸入資料。例如，系統400可接收資料作為對曝露系統400之一應用程式介面(API)之一呼叫之部分。系統400可在一或多個運算裝置上實施，如本文中參考圖5描述。例如，至系統400之輸入可透過一儲存媒體(包含透過一網路連接至一或多個運算裝置之遠端儲存器)或作為透過耦合至系統400之一用戶端運算裝置上之一使用者介面之輸入來提供。System 400 can be configured to receive input data according to a user interface. For example, system 400 may receive data as part of a call to an application programming interface (API) of exposure system 400 . System 400 may be implemented on one or more computing devices, as described herein with reference to FIG. 5 . For example, input to system 400 may be through a storage medium (including remote storage connected to one or more computing devices through a network) or as a user on a client computing device coupled to system 400 provided by the input of the interface.

系統400可經組態以輸出經縮放神經網路架構409，諸如一系列經縮放神經網路架構。可將經縮放神經網路架構409作為輸出發送，例如用於顯示在一使用者顯示器上，且視情況根據如架構中定義之各神經網路層之形狀及大小視覺化。在一些實施方案中，系統400可經組態以將經縮放神經網路架構409提供為一組電腦可讀指令，諸如一或多個電腦程式，其或其等可由目標運算資源執行以實施經縮放神經網路架構409。System 400 can be configured to output a scaled neural network architecture 409, such as a series of scaled neural network architectures. The scaled neural network architecture 409 can be sent as output, eg, for display on a user display, and optionally visualized according to the shape and size of each neural network layer as defined in the architecture. In some implementations, the system 400 can be configured to provide the scaled neural network architecture 409 as a set of computer-readable instructions, such as one or more computer programs, which, or the like, can be executed by a target computing resource to implement the Scaling Neural Network Architecture 409.

一電腦程式可以任何類型之程式設計語言且根據任何程式設計範例進行編寫，例如，宣告、程序、組合、物件導向、資料導向、函數或命令式。一電腦程式可經編寫以執行一或多個不同功能且在一運算環境內(例如，在一實體裝置、虛擬機上或跨多個裝置)操作。一電腦程式亦可實施本說明書中描述之例如由一系統、引擎、模組或模型執行之功能性。A computer program can be written in any type of programming language and according to any programming paradigm, such as declarative, procedural, compositional, object-oriented, data-oriented, functional, or imperative. A computer program can be written to perform one or more different functions and operate within a computing environment (eg, on a physical device, a virtual machine, or across multiple devices). A computer program may also implement the functionality described in this specification, such as performed by a system, engine, module or model.

在一些實施方案中，系統400經組態以將經縮放神經網路架構409之資料轉發至一或多個其他裝置，該一或多個其他裝置經組態用於將架構轉換為以一電腦程式設計語言編寫之一可執行程式，且視情況作為用於產生機器學習模型之一框架之部分。系統400亦可經組態以將對應於經縮放神經網路架構409之資料發送至一儲存裝置以進行儲存及隨後擷取。In some implementations, the system 400 is configured to forward data from the scaled neural network architecture 409 to one or more other devices that are configured to convert the architecture to a computer An executable program written in a programming language and optionally as part of a framework for generating machine learning models. System 400 can also be configured to send data corresponding to scaled neural network architecture 409 to a storage device for storage and subsequent retrieval.

系統400可包含一NAS引擎405。NAS引擎405及系統400之其他組件可被實施為一或多個電腦程式、特別組態之電子電路或前述之任何組合。NAS引擎405可經組態以接收訓練資料401及目標運算資源資料402，且產生一基本神經網路架構407，該基本神經網路架構407可被發送至一LACS引擎415。NAS引擎405可實施用於本文中參考圖1至圖3描述之神經架構搜尋之各種不同技術之任何者。系統可根據本發明之態樣進行組態以使用多個目標執行NAS，包含一候選神經網路在於目標運算資源上執行時之推論延時及精度。作為判定NAS引擎405可用於搜尋基本神經網路架構之效能度量之部分，系統400可包含一效能量測引擎410。System 400 may include a NAS engine 405 . NAS engine 405 and other components of system 400 may be implemented as one or more computer programs, specially configured electronic circuits, or any combination of the foregoing. NAS engine 405 can be configured to receive training data 401 and target computing resource data 402 and generate a base neural network architecture 407 that can be sent to a LACS engine 415 . NAS engine 405 may implement any of a variety of different techniques for neural architecture search described herein with reference to FIGS. 1-3. A system can be configured according to aspects of the present invention to perform NAS using multiple targets, including the inference latency and accuracy of a candidate neural network when executing on the target computing resources. As part of determining that NAS engine 405 can be used to search performance metrics for the underlying neural network architecture, system 400 can include a performance measurement engine 410 .

效能量測引擎410可經組態用於接收用於一候選基本神經網路之一架構，且根據用於由NAS引擎405執行NAS之目標來產生效能度量。效能度量可根據多個目標提供候選神經網路之一整體效能量度。為了判定候選基本神經網路之精度，效能量測引擎410可在訓練實例之一驗證集上執行候選基本神經網路，例如藉由保留一些訓練資料401來獲得驗證集。Performance measurement engine 410 may be configured to receive a framework for a candidate base neural network and generate performance metrics according to goals for performing NAS by NAS engine 405 . The performance metric may provide an overall performance metric for one of the candidate neural networks according to multiple objectives. To determine the accuracy of the candidate base neural network, the performance measurement engine 410 may execute the candidate base neural network on a validation set of one of the training instances, eg, by retaining some training data 401 to obtain the validation set.

為了量測延時，效能量測引擎410可與對應於由資料402指定之目標運算資源之運算資源通信。例如，若目標運算資源資料402指定一TPU作為一目標資源，則效能量測引擎410可發送候選基本神經網路以在一對應TPU上執行。TPU可容置於例如透過如參考圖5更詳細描述之網路與實施系統400之一或多個處理器通信之一資料中心中。To measure latency, performance measurement engine 410 may communicate with a computing resource corresponding to the target computing resource specified by data 402 . For example, if the target computing resource data 402 specifies a TPU as a target resource, the performance measurement engine 410 may send a candidate basic neural network to execute on a corresponding TPU. The TPU may be housed, for example, in a data center in communication with one or more processors implementing system 400 through a network as described in more detail with reference to FIG. 5 .

效能量測引擎410可接收指示目標運算資源接收輸入與產生輸出之間的延時之延時資訊。可直接在現場對目標運算資源量測延時資訊且將其發送至效能量測引擎410，或由效能量測引擎410本身量測。若效能量測引擎410量測延時，則引擎410可經組態以補償不歸咎於處理候選基本神經網路之延時，例如通信至目標運算資源及自目標運算資源通信之網路延時。作為另一實例，效能量測引擎410可基於目標運算資源之先前量測及目標運算資源之硬體特性來估計處理透過候選基本神經網路之輸入之延時。The performance measurement engine 410 may receive delay information indicative of the delay between receiving the input and generating the output by the target computing resource. The latency information can be measured directly on the target computing resource on-site and sent to the performance measurement engine 410, or measured by the performance measurement engine 410 itself. If the performance measurement engine 410 measures latency, the engine 410 can be configured to compensate for latency not attributable to processing candidate base neural networks, such as network latency for communication to and from the target computing resource. As another example, the performance measurement engine 410 may estimate the latency of processing the input through the candidate base neural network based on previous measurements of the target computing resource and hardware characteristics of the target computing resource.

效能量測引擎410可產生候選神經網路架構之其他特性之效能度量，諸如其操作強度及其執行效率。如本文中參考圖1至圖3描述，推斷延時可被判定為一起依據運算要求(FLOPS)、執行效率及操作強度，且在一些實施方案中，系統400基於此等額外特性直接或間接搜尋及縮放一神經網路。The performance measurement engine 410 may generate performance metrics of other characteristics of the candidate neural network architecture, such as its operational strength and its execution efficiency. As described herein with reference to FIGS. 1-3 , inference latency can be determined as a function of operational requirements (FLOPS), execution efficiency, and operational intensity together, and in some implementations, system 400 directly or indirectly searches for and Scaling a neural network.

一旦產生效能度量，效能量測引擎410可將度量發送至NAS引擎405，該NAS引擎405繼而可反覆對一新候選基本神經網路架構之一新搜尋，直至達到如本文中參考圖2描述之停止準則。Once the performance metric is generated, the performance measurement engine 410 can send the metric to the NAS engine 405, which can then iteratively search for a new one of the new candidate base neural network architectures until reaching as described herein with reference to FIG. 2 Stop Criteria.

在一些實例中，根據一或多個控制器參數來調諧NAS引擎405用於調整NAS引擎405如何選擇下一候選基本神經網路架構。可根據用於一特定神經網路任務之一神經網路之所要特性來手動調諧控制器參數。在一些實例中，控制器參數可透過各種機器學習技術之任何者來學習，且NAS引擎405可實施一或多個機器學習模型，該一或多個機器學習模型經訓練用於根據諸如延時及精度之多個目標來選擇一基本神經網路架構。例如，NAS引擎405可實施一遞迴神經網路，該遞迴神經網路經訓練以使用一先前候選基本神經網路之特徵及多個目標以預測更有可能滿足目標之一候選基本網路。可使用經標記以指示鑑於與一神經網路任務相關聯之一組訓練資料及目標運算資源資料而選擇之最終基本神經架構之效能度量及訓練資料來訓練神經網路。In some examples, tuning the NAS engine 405 according to one or more controller parameters is used to adjust how the NAS engine 405 selects the next candidate base neural network architecture. The controller parameters can be manually tuned according to the desired characteristics of a neural network for a particular neural network task. In some examples, the controller parameters may be learned through any of a variety of machine learning techniques, and the NAS engine 405 may implement one or more machine learning models trained for parameters such as latency and Multiple goals of accuracy to select a basic neural network architecture. For example, NAS engine 405 may implement a recurrent neural network trained to use features of a previous candidate base neural network and targets to predict a candidate base network that is more likely to satisfy the target . A neural network can be trained using performance metrics and training data that are labeled to indicate the final base neural architecture selected in view of a set of training data and target computing resource data associated with a neural network task.

LACS引擎415可經組態以執行根據本發明之態樣描述之延時感知複合縮放。LACS引擎415經組態以從NAS引擎405接收指定一基本神經網路架構之資料407。類似於NAS引擎405，LACS引擎415可與效能量測引擎410通信以獲得用於一候選經縮放神經網路架構之效能度量。LACS引擎415可在記憶體中維持不同縮放係數元組之一搜尋空間，且亦可經組態用於縮放一最終經縮放架構以快速獲得一系列經縮放神經網路架構，如本文中參考圖1至圖3描述。在一些實施方案中，LACS引擎415經組態以執行其他形式之縮放(例如，簡單縮放)，但使用由NAS引擎405使用之多個目標(包含延時)。LACS engine 415 may be configured to perform delay-aware compound scaling described in accordance with aspects of this disclosure. LACS engine 415 is configured to receive data 407 from NAS engine 405 specifying a basic neural network architecture. Similar to NAS engine 405, LACS engine 415 may communicate with performance measurement engine 410 to obtain performance metrics for a candidate scaled neural network architecture. The LACS engine 415 can maintain a search space in memory for different scale factor tuples, and can also be configured to scale a final scaled architecture to quickly obtain a range of scaled neural network architectures, as referenced herein 1 to Figure 3 are described. In some implementations, the LACS engine 415 is configured to perform other forms of scaling (eg, simple scaling), but using multiple targets (including latency) used by the NAS engine 405 .

圖5係用於實施NAS-LACS系統400之一例示性環境500之一方塊圖。系統400可在具有一或多個位置中之一或多個處理器之一或多個裝置上(諸如在伺服器運算裝置515中)實施。用戶端運算裝置512及伺服器運算裝置515可透過一網路560通信地耦合至一或多個儲存裝置530。(若干)儲存裝置530可為揮發性及非揮發性記憶體之一組合，且可在與運算裝置512、515相同或不同之實體位置處。例如，(若干)儲存裝置530可包含能夠儲存資訊之任何類型之非暫時性電腦可讀媒體，諸如一硬碟機、固態硬碟、磁帶機、光學儲存器、記憶卡、ROM、RAM、DVD、CD-ROM、可寫及唯讀記憶體。FIG. 5 is a block diagram of an exemplary environment 500 for implementing the NAS-LACS system 400 . System 400 may be implemented on one or more devices having one or more processors in one or more locations, such as in server computing device 515 . Client computing device 512 and server computing device 515 may be communicatively coupled to one or more storage devices 530 through a network 560 . The storage device(s) 530 may be one of a combination of volatile and non-volatile memory, and may be at the same or a different physical location than the computing devices 512, 515. For example, storage device(s) 530 may comprise any type of non-transitory computer-readable medium capable of storing information, such as a hard drive, solid state drive, tape drive, optical storage, memory card, ROM, RAM, DVD , CD-ROM, writable and read-only memory.

伺服器運算裝置515可包含一或多個處理器513及記憶體514。記憶體514可儲存可由(若干)處理器513存取之資訊，包含可由(若干)處理器513執行之指令521。記憶體514亦可包含可由(若干)處理器513擷取、操縱或儲存之資料523。記憶體514可為能夠儲存可由(若干)處理器513存取之資訊之一種類型之非暫時性電腦可讀媒體，諸如揮發性及非揮發性記憶體。(若干)處理器513可包含一或多個中央處理單元(CPU)、圖形處理單元(GPU)、場可程式化閘陣列(FGPA)及/或特定應用積體電路(ASIC)，諸如張量處理單元(TPU)。Server computing device 515 may include one or more processors 513 and memory 514 . Memory 514 may store information accessible by processor(s) 513 , including instructions 521 executable by processor(s) 513 . Memory 514 may also include data 523 that may be retrieved, manipulated, or stored by processor(s) 513 . Memory 514 may be a type of non-transitory computer-readable medium capable of storing information accessible by processor(s) 513 , such as volatile and non-volatile memory. Processor(s) 513 may include one or more central processing units (CPUs), graphics processing units (GPUs), field programmable gate arrays (FGPAs), and/or application-specific integrated circuits (ASICs), such as tensors Processing Unit (TPU).

指令521可包含當由(若干)處理器513執行時導致一或多個處理器執行由指令定義之動作之一或多個指令。指令521可以目的碼格式儲存以由(若干)處理器513直接處理，或以其他格式儲存，包含按需解譯或預先編譯之可解譯指令碼或獨立原始碼模組集合。指令521可包含用於實施與本發明之態樣一致之系統400之指令。可使用(若干)處理器513及/或使用遠離於伺服器運算裝置515定位之其他處理器來執行系統400。Instructions 521 may include one or more instructions that, when executed by processor(s) 513, cause one or more processors to perform the actions defined by the instructions. Instructions 521 may be stored in object code format for direct processing by processor(s) 513, or in other formats, including on-demand or precompiled interpretable instruction code or sets of independent source code modules. Instructions 521 may include instructions for implementing system 400 consistent with aspects of the present invention. System 400 may be executed using processor(s) 513 and/or using other processors located remotely from server computing device 515 .

可藉由(若干)處理器513根據指令521擷取、儲存或修改資料523。資料523可作為具有複數個不同欄位及記錄之一表或作為JSON、YAML、proto或XML文件儲存於電腦暫存器、一關係或非關係資料庫中。資料523亦可以一電腦可讀格式進行格式化，諸如但不限於二進位值、ASCII或萬國碼。再者，資料523可包含足以識別相關資訊之資訊，諸如數字、描述性文字、專有程式碼、指標、對儲存於其他記憶體(包含其他網路位置)中之資料之參考或由一函數用於計算相關資料之資訊。Data 523 may be retrieved, stored or modified according to instructions 521 by processor(s) 513 . Data 523 may be stored in a computer register, a relational or non-relational database, as a table with a plurality of distinct fields and records or as JSON, YAML, proto or XML documents. Data 523 may also be formatted in a computer-readable format, such as, but not limited to, binary values, ASCII, or Unicode. Furthermore, data 523 may contain information sufficient to identify relevant information, such as numbers, descriptive text, proprietary code, indicators, references to data stored in other memory (including other network locations), or by a function Information used to calculate the relevant data.

用戶端運算裝置512亦可類似於伺服器運算裝置515組態，具有一或多個處理器516、記憶體517、指令518及資料519。用戶端運算裝置512亦可包含一使用者輸出526及一使用者輸入524。使用者輸入524可包含用於接收來自一使用者之輸入之任何適當機構或技術，諸如鍵盤、滑鼠、機械致動器、軟性致動器、觸控螢幕、麥克風及感測器。The client computing device 512 can also be configured similarly to the server computing device 515 , having one or more processors 516 , memory 517 , instructions 518 and data 519 . The client computing device 512 may also include a user output 526 and a user input 524 . User input 524 may include any suitable mechanism or technology for receiving input from a user, such as keyboards, mice, mechanical actuators, soft actuators, touch screens, microphones, and sensors.

伺服器運算裝置515可經組態以將資料傳輸至用戶端運算裝置512，且用戶端運算裝置512可經組態以在作為使用者輸出526之部分實施之一顯示器上顯示所接收資料之至少一部分。使用者輸出526亦可用於顯示用戶端運算裝置512與伺服器運算裝置515之間的一介面。替代地或另外，使用者輸出526可包含一或多個揚聲器、換能器或其他音訊輸出、一觸覺介面或將非視覺及非聽覺資訊提供給用戶端運算裝置512之平台使用者之其他觸覺回饋。Server computing device 515 can be configured to transmit data to client computing device 512, and client computing device 512 can be configured to display at least a portion of the received data on a display implemented as part of user output 526. part. User output 526 may also be used to display an interface between client computing device 512 and server computing device 515 . Alternatively or in addition, the user output 526 may include one or more speakers, transducers or other audio outputs, a haptic interface or other haptics that provide non-visual and non-auditory information to the platform user of the client computing device 512 Give back.

儘管圖5將處理器513、516及記憶體514、517繪示為在運算裝置515、512內，然本說明書中描述之組件(包含處理器513、516及記憶體514、517)可包含可在不同實體位置中且非在相同運算裝置內操作之多個處理器及記憶體。例如，一些指令521、518及資料523、519可儲存於一可移除SD卡上，且其他者可儲存於一唯讀電腦晶片內。一些或全部指令及資料可儲存於實體地遠離於處理器513、516但仍可由處理器513、516存取之一位置中。類似地，處理器513、516可包含可執行同時及/或循序操作之一處理器集合。運算裝置515、512可各包含提供時序資訊之一或多個內部時脈，該一或多個內部時脈可用於由運算裝置515、512運行之操作及程式之時間量測。Although Figure 5 shows processors 513, 516 and memories 514, 517 as being within computing devices 515, 512, the components described in this specification, including processors 513, 516 and memories 514, 517, may include Multiple processors and memories in different physical locations and not operating within the same computing device. For example, some instructions 521, 518 and data 523, 519 may be stored on a removable SD card, and others may be stored on a read-only computer chip. Some or all of the instructions and data may be stored in a location physically remote from the processors 513 , 516 but still accessible by the processors 513 , 516 . Similarly, processors 513, 516 may comprise sets of processors that may perform simultaneous and/or sequential operations. The computing devices 515, 512 may each include one or more internal clocks that provide timing information that may be used for time measurement of operations and programs run by the computing devices 515, 512.

伺服器運算裝置515可透過網路560連接至容置硬體加速器551A至N之一資料中心550。資料中心550可為多個資料中心或各種類型之運算裝置(諸如硬體加速器)定位於其中之其他設施之一者。容置於資料中心550中之運算資源可被指定為用於部署經縮放神經網路架構之目標運算資源之部分，如本文中描述。The server computing device 515 may be connected through a network 560 to a data center 550 that houses the hardware accelerators 551A-N. Data center 550 may be one of multiple data centers or other facilities in which various types of computing devices, such as hardware accelerators, are located. The computing resources housed in the data center 550 may be designated as part of the target computing resources for deploying the scaled neural network architecture, as described herein.

伺服器運算裝置515可經組態以從用戶端運算裝置512接收在資料中心550中之運算資源上處理資料之請求。例如，環境500可為經組態以透過曝露平台服務之各種使用者介面及/或API將各種服務提供給使用者之一運算平台之部分。一或多個服務可為用於根據一指定任務及訓練資料產生神經網路或其他機器學習模型之一機器學習框架或一組工具。用戶端運算裝置512可接收及傳輸指定待分配用於執行經訓練以執行一特定神經網路任務之一神經網路之目標運算資源之資料。根據本文中參考圖1至圖4描述之本發明之態樣，NAS-LACS系統400可接收指定目標運算資源之資料及訓練資料，且作為回應產生用於部署於目標運算資源上之一系列經縮放神經網路架構。Server computing device 515 may be configured to receive requests from client computing device 512 to process data on computing resources in data center 550 . For example, environment 500 may be part of a computing platform that is configured to provide various services to users through various user interfaces and/or APIs that expose platform services. One or more services may be a machine learning framework or set of tools for generating neural networks or other machine learning models from a given task and training data. Client computing device 512 may receive and transmit data specifying target computing resources to be allocated for executing a neural network that is trained to perform a particular neural network task. According to aspects of the invention described herein with reference to FIGS. 1-4, NAS-LACS system 400 may receive data and training data specifying a target computing resource, and in response generate a series of processed data for deployment on the target computing resource. Scaling neural network architectures.

作為由實施環境500之一平台提供之潛在服務之其他實例，伺服器運算裝置515可根據資料中心550處可用之不同潛在目標運算資源來維持各種系列之經縮放神經網路架構。例如，伺服器運算裝置515可維持用於在容置於資料中心550中或以其他方式可用於處理之各種類型之TPU及/或GPU上部署神經網路之不同系列。As other examples of potential services provided by a platform of implementation environment 500, server computing device 515 may maintain various sets of scaled neural network architectures according to different potential target computing resources available at data center 550. For example, server computing device 515 may maintain different families for deploying neural networks on various types of TPUs and/or GPUs housed in data center 550 or otherwise available for processing.

裝置512、515及資料中心550可能夠透過網路560進行直接及間接通信。例如，在使用一網路套接字(network socket)的情況下，用戶端運算裝置512可透過一網際網路協定連接至在資料中心550中操作之一服務。裝置515、512可設立監聽套接字，該等監聽套接字可接受用於發送及接收資訊之一起始連接。網路560本身可包含各種組態及協定，包含網際網路、全球資訊網、內部網路、虛擬專用網路、廣域網路、區域網路及使用一或多個公司專有之通信協定之專用網路。網路560可支援各種短程及長程連接。短程及長程連接可在不同頻寬上進行，諸如2.402 GHz至2.480 GHz (通常與Bluetooth®標準相關聯)、2.4 GHz及5 GHz(通常與Wi-Fi®通信協定相關聯)；或使用各種通信標準，諸如用於無線寬頻通信之LTE®標準。另外或替代地，網路560亦可支援裝置512、515與資料中心550之間的有線連接，包含透過各種類型之乙太網路連接。Devices 512 , 515 and data center 550 may be capable of direct and indirect communication over network 560 . For example, using a network socket, client computing device 512 may connect to a service operating in data center 550 via an Internet protocol. Devices 515, 512 can set up listening sockets that can accept an initial connection for sending and receiving information. The network 560 itself may include a variety of configurations and protocols, including the Internet, the World Wide Web, Intranets, Virtual Private Networks, Wide Area Networks, Local Area Networks, and private networks using one or more company-proprietary communication protocols. network. Network 560 may support various short-range and long-range connections. Short-range and long-range connections can be made on different bandwidths, such as 2.402 GHz to 2.480 GHz (commonly associated with the Bluetooth® standard), 2.4 GHz and 5 GHz (commonly associated with the Wi-Fi® communication protocol); or use various communications standards, such as the LTE® standard for wireless broadband communications. Additionally or alternatively, network 560 may also support wired connections between devices 512, 515 and data center 550, including through various types of Ethernet connections.

儘管在圖5中展示一單一伺服器運算裝置515、用戶端運算裝置512及資料中心550，然應理解，本發明之態樣可根據運算裝置之各種不同組態及數量來實施，包含在用於循序或並行處理之範例中，或透過多個裝置之分佈式網路。在一些實施方案中，本發明之態樣可在連接至經組態用於處理神經網路之硬體加速器之一單一裝置及其等之任何組合上執行。例示性使用情況： Although a single server computing device 515, client computing device 512, and data center 550 are shown in FIG. 5, it should be understood that aspects of the present invention may be implemented with various configurations and numbers of computing devices, including using In the case of sequential or parallel processing, or through a distributed network of multiple devices. In some implementations, aspects of the present invention may be performed on a single device connected to a hardware accelerator configured to process neural networks, and any combination thereof, among others. Exemplary use case:

如本文中描述，本發明之態樣提供根據一多目標方法產生從基本神經網路縮放之一神經網路之一架構。下文係神經網路任務之實例。As described herein, aspects of the present invention provide an architecture for scaling a neural network from a base neural network according to a multi-objective approach. The following are examples of neural network tasks.

作為一實例，至神經網路之輸入可為影像、視訊之形式。作為處理一給定輸入之部分，例如作為一電腦視覺任務之部分，一神經網路可經組態以提取、識別及產生特徵。經訓練以執行此類型之神經網路任務之一神經網路可經訓練以從一組不同潛在分類產生一輸出分類。另外或替代地，神經網路可經訓練以輸出對應於影像或視訊中之一所識別物件屬於一特定類別之一估計概率之一得分。As an example, the input to the neural network may be in the form of images, video. As part of processing a given input, such as as part of a computer vision task, a neural network can be configured to extract, identify and generate features. A neural network trained to perform this type of neural network task can be trained to generate an output classification from a set of different potential classifications. Additionally or alternatively, the neural network may be trained to output a score corresponding to an estimated probability that an identified object in an image or video belongs to a particular class.

作為另一實例，至神經網路之輸入可為對應於一特定格式之資料檔案，例如HTML檔案、文字處理文件或從其他類型之資料獲得之格式化後設資料，諸如影像檔案之後設資料。在此內容脈絡中，一神經網路任務可為分類、評分或以其他方式預測關於所接收輸入之某一特性。例如，一神經網路可經訓練以預測所接收輸入包含與一特定物件相關之文字之概率。亦作為執行一特定任務之部分，神經網路可經訓練以產生文字預測，例如作為用於在編寫一文件時自動完成文件中之文字之一工具之部分。一神經網路亦可經訓練用於預測一輸入文件中之文字至一目標語言之一翻譯，例如當編寫一訊息時。As another example, the input to the neural network may be data files corresponding to a particular format, such as HTML files, word processing documents, or formatted metadata obtained from other types of data, such as image file metadata. In this context, a neural network task may be to classify, score, or otherwise predict some characteristic about the received input. For example, a neural network can be trained to predict the probability that the received input contains words associated with a particular object. Also as part of performing a particular task, neural networks can be trained to generate text predictions, such as as part of a tool for automatically completing text in a document as it is being written. A neural network can also be trained to predict a translation of text in an input document to a target language, such as when composing a message.

其他類型之輸入文件可為與互連裝置之一網路之特性相關之資料。此等輸入文件可包含活動日誌，以及關於不同運算裝置存取潛在敏感資料之不同源之存取特權之記錄。一神經網路可經訓練用於處理此等及其他類型之文件以預測正在進行及未來之網路安全漏洞。例如，神經網路可經訓練以預測一惡意行為者對網路之入侵。Other types of input files may be data related to the characteristics of a network of interconnected devices. These input files may contain activity logs, as well as records of access privileges for different computing devices to access different sources of potentially sensitive data. A neural network can be trained to process these and other types of documents to predict ongoing and future network security breaches. For example, a neural network can be trained to predict the intrusion of a malicious actor into a network.

作為另一實例，至一神經網路之輸入可為音訊輸入，包含串流化音訊、預記錄音訊及作為一視訊或其他源或媒體之部分之音訊。音訊內容脈絡中之一神經網路任務可包含語音辨識，包含將語音與其他所識別音訊源隔離及/或增強所識別語音之特性以更容易被聽到。一神經網路可經訓練以預測輸入語音至一目標語言之一精確翻譯，例如即時地作為一翻譯工具之部分。As another example, the input to a neural network may be audio input, including streamed audio, pre-recorded audio, and audio that is part of a video or other source or media. One of the neural network tasks in the audio content context may include speech recognition, including isolating speech from other recognized audio sources and/or enhancing characteristics of the recognized speech to be more easily heard. A neural network can be trained to predict an accurate translation of input speech to a target language, eg, in real-time as part of a translation tool.

除了包含本文中描述之各種類型之資料之資料輸入之外，一神經網路亦可經訓練以處理對應於給定輸入之特徵。特徵係與輸入之某一特性相關之值(例如，數值或類別)。例如，在一影像之內容脈絡中，影像之一特徵可與影像中之各像素之RGB值相關。影像/視訊內容脈絡中之一神經網路任務可為對一影像或視訊之內容進行分類，例如針對不同人、地方或事物之存在。神經網路可經訓練以提取及選擇相關特徵用於處理以針對一給定輸入產生一輸出，且亦可經訓練以基於輸入資料之各種特性之間的所學習關係來產生新特徵。In addition to data inputs comprising various types of data described herein, a neural network can also be trained to process features corresponding to a given input. A feature is a value (eg, a value or category) associated with some characteristic of the input. For example, in the context of an image, a feature of the image may be related to the RGB value of each pixel in the image. A neural network task in the image/video content context may be to classify the content of an image or video, eg, for the presence of different people, places, or things. Neural networks can be trained to extract and select relevant features for processing to generate an output for a given input, and can also be trained to generate new features based on learned relationships between various properties of the input data.

本發明之態樣可在數位電路、電腦可讀儲存媒體中、作為一或多個電腦程式或前述之一或多者之一組合來實施。電腦可讀儲存媒體可為非暫時性的，例如，作為可由(若干)處理器執行且儲存於一有形儲存裝置上之一或多個指令。Aspects of the invention can be implemented in digital circuitry, in computer-readable storage media, as one or more computer programs, or in a combination of one or more of the foregoing. A computer-readable storage medium may be non-transitory, eg, as one or more instructions executable by the processor(s) and stored on a tangible storage device.

在本說明書中，片語「經組態以」用於與電腦系統、硬體或一電腦程式之部分相關之不同內容脈絡中。當一系統被稱為經組態以執行一或多個操作時，此意謂系統具有安裝在系統上之適當軟體、韌體及/或硬體，該軟體、韌體及/或硬體當在操作時導致系統執行一或多個操作。當某一硬體被稱為經組態以執行一或多個操作時，此意謂硬體包含一或多個電路，該一或多個電路當在操作時接收輸入且根據輸入且對應於一或多個操作產生輸出。當一電腦程式被稱為經組態以執行一或多個操作時，此意謂電腦程式包含一或多個程式指令，該一或多個程式指令在由一或多個電腦執行時導致一或多個電腦執行一或多個操作。In this specification, the phrase "configured to" is used in various contexts relating to computer systems, hardware, or parts of a computer program. When a system is said to be configured to perform one or more operations, it means that the system has the appropriate software, firmware and/or hardware installed on the system, which software, firmware and/or hardware is Causes the system to perform one or more operations when in operation. When a piece of hardware is said to be configured to perform one or more operations, it means that the piece of hardware includes one or more circuits that, when in operation, receive inputs and, in accordance with the inputs, correspond to One or more operations produce output. When a computer program is said to be configured to perform one or more operations, it means that the computer program includes one or more program instructions that, when executed by one or more computers, result in an or more computers to perform one or more operations.

雖然以一特定順序展示在圖式中展示且在發明申請專利範圍中敘述之操作，但應理解，可以不同於所展示之順序執行該等操作，且一些操作可被省略、執行超過一次、及/或與其他操作並行執行。此外，經組態用於執行不同操作之不同系統組件之分離不應被理解為需要分離組件。所描述之組件、模組、程式及引擎可整合在一起作為一單一系統，或可為多個系統之部分。Although the operations shown in the drawings and described in the scope of the invention are shown in a particular order, it should be understood that the operations may be performed in a different order than shown, and that some operations may be omitted, performed more than once, and / or in parallel with other operations. Furthermore, separation of different system components configured to perform different operations should not be construed as requiring separation of components. The described components, modules, programs and engines may be integrated together as a single system, or may be part of multiple systems.

除非另有說明，否則前述替代實例並非相互排斥的，而可以各種組合來實施以達成獨特優點。由於可在不脫離由發明申請專利範圍定義之標的物之情況下利用上文論述之特徵之此等及其他變化及組合，所以實施例之先前描述應被視為藉由繪示而非藉由限制由發明申請專利範圍定義之標的物來進行。另外，提供本文中描述之實例以及表達為「諸如」、「包含」及類似物之子句不應被解釋為將發明申請專利範圍之標的物限於特定實例；實情係，該等實例旨在僅繪示許多可能態樣之一者。此外，不同圖式中之相同元件符號可識別相同或類似元件。Unless otherwise indicated, the foregoing alternative examples are not mutually exclusive, but can be implemented in various combinations to achieve unique advantages. Since these and other variations and combinations of the above-discussed features may be utilized without departing from the subject matter defined by the scope of the invention, the preceding description of the embodiments should be considered by way of illustration and not by way of Limitations are made by the subject matter defined by the scope of the invention application. In addition, the provision of examples described herein and clauses expressed as "such as", "including" and the like should not be construed as limiting the subject matter of the patentable scope of the invention to the particular examples; one of many possible forms. Furthermore, the same reference numbers in different figures may identify the same or similar elements.

101:基本神經網路架構 103:系列 104A至N:經縮放神經網路架構 107A至N:候選網路 108:係數搜尋空間 109:經縮放神經網路架構 115:資料中心 116:硬體加速器 200:程序 210:方塊 220:方塊 230:方塊 240:方塊 250:方塊 300:程序 310:方塊 320:方塊 330:方塊 340:方塊 400:神經架構搜尋延時感知複合縮放(NAS-LACS)系統 401:訓練資料 402:目標運算資源資料 405:神經架構搜尋(NAS)引擎 407:基本神經網路架構/資料 409:經縮放神經網路架構 410:效能量測引擎 415:延時感知複合縮放(LACS)引擎 500:環境 512:用戶端運算裝置 513:處理器 514:記憶體 515:伺服器運算裝置 516:處理器 517:記憶體 518:指令 519:資料 521:指令 523:資料 524:使用者輸入 526:使用者輸出 530:儲存裝置 550:資料中心 551A至N:硬體加速器 560:網路 101: Basic Neural Network Architecture 103: Series 104A to N: Scaled Neural Network Architecture 107A to N: Candidate Nets 108: Coefficient Search Space 109: Scaled Neural Network Architecture 115:Data Center 116: Hardware Accelerator 200: Program 210: Blocks 220: Square 230: Square 240: Square 250: Square 300: Procedure 310: Blocks 320: Square 330: Square 340: Square 400: Neural Architecture Search Latency-Aware Composite Scaling (NAS-LACS) System 401: Training Materials 402: Target computing resource information 405: Neural Architecture Search (NAS) Engine 407: Basic Neural Network Architecture/Material 409: Scaled Neural Network Architecture 410: Efficiency Measurement Engine 415: Latency-Aware Composite Scaling (LACS) Engine 500: Environment 512: Client computing device 513: Processor 514: Memory 515: Server computing device 516: Processor 517: Memory 518: Command 519: Information 521: Command 523: Information 524: User input 526: user output 530: Storage Device 550:Data Center 551A to N: Hardware Accelerator 560: Internet

圖1係繪示用於部署於容置所部署神經網路將其上執行之硬體加速器中之一資料中心中之一系列經縮放神經網路架構之一方塊圖。1 is a block diagram illustrating a series of scaled neural network architectures for deployment in a data center that houses a hardware accelerator upon which the deployed neural network will execute.

圖2係用於產生用於在目標運算資源上執行之經縮放神經網路架構之一例示性程序之一流程圖。2 is a flow diagram of an exemplary process for generating a scaled neural network architecture for execution on a target computing resource.

圖3係用於一基本神經網路架構之延時感知複合縮放之一例示性程序。3 is an exemplary procedure for delay-aware compound scaling for a basic neural network architecture.

圖4係根據本發明之態樣之一神經架構搜尋延時感知複合縮放(NAS-LACS)系統之一方塊圖。4 is a block diagram of a neural architecture search latency-aware complex scaling (NAS-LACS) system according to an aspect of the present invention.

圖5係用於實施NAS-LACS系統之一例示性環境之一方塊圖。5 is a block diagram of an exemplary environment for implementing a NAS-LACS system.

101:基本神經網路架構 101: Basic Neural Network Architecture

103:系列 103: Series

104A至N:經縮放神經網路架構 104A to N: Scaled Neural Network Architecture

107A至N:候選網路 107A to N: Candidate Nets

108:係數搜尋空間 108: Coefficient Search Space

109:經縮放神經網路架構 109: Scaled Neural Network Architecture

115:資料中心 115:Data Center

116:硬體加速器 116: Hardware Accelerator

Claims

A computer-implemented method for determining an architecture of a neural network, comprising: receive information specifying target computing resources by one or more processors; receiving, by the one or more processors, data specifying an architecture of a basic neural network; identifying, by the one or more processors, a plurality of scaling parameter values for scaling the basic neural network according to the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network, wherein the identifying Include repeating the following steps: selecting a plurality of candidate scaling parameter values, and determining an efficacy metric of the base neural network scaled according to the plurality of candidate scaling parameter values, wherein the efficacy metric is determined according to a plurality of targets including a delay target; and An architecture of a scaled neural network is generated by the one or more processors using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

As in the method of claim 1, wherein the plurality of targets are a plurality of second targets; and wherein receiving the data specifying the architecture of the underlying neural network includes: receiving, by one or more processors, training data corresponding to a neural network task, and The architecture of the base neural network is identified by the one or more processors using the training data and performing a neural architecture search on a search space according to a plurality of first targets.

As in the method of claim 2, wherein the search space includes candidate neural network layers, each candidate neural network layer configured to perform one or more respective operations, and Wherein the search space includes candidate neural network layers comprising different respective excitation functions.

As in the method of claim 3, wherein the architecture of the basic neural network includes a plurality of components, each component has a respective plurality of neural network layers, and Wherein the search space includes a plurality of candidate components of the candidate neural network layer, including a first component of the candidate network layer, which includes a first excitation function; and a second component of the candidate network layer, which includes a different One of the first excitation functions is a second excitation function.

The method of claim 2, wherein the plurality of first targets used to perform the neural architecture search are the same as the plurality of second targets used to identify the plurality of scaling parameter values.

The method of claim 2, wherein the plurality of first targets and the plurality of second targets include an accuracy target corresponding to the accuracy of the output of the base neural network when training using the training data.

The method of claim 1, wherein when the base neural network is scaled according to the plurality of candidate scaling parameter values and deployed on the target computing resources, the performance metric is at least partially associated with the base neural network to receive an input and A delay metric correspondence between an output is generated.

The method of claim 1, wherein when the basic neural network is deployed on the target computing resources, the delay target corresponds to a minimum delay between the basic neural network receiving an input and generating an output.

As in the method of claim 1, wherein the information specifying the target computing resources specifies one or more hardware accelerators; and Wherein the method further includes executing the scaled neural network on the one or more hardware accelerators to perform the neural network task.

According to the method of claim 9, wherein the target computing resources are first target computing resources, the plurality of scaling parameter values are a plurality of first scaling parameter values, and Wherein the method further includes: receiving, by the one or more processors, information specifying a second target computing resource different from the first target computing resources, and A plurality of second scaling parameter values for scaling the base neural network are identified based on the information specifying the second target computing resources, wherein the plurality of second scaling parameter values are different from the plurality of first scaling parameter values .

The method of claim 1, wherein the plurality of scaling parameter values are a plurality of first scaling parameter values, and wherein the method further includes generating a scaled neural network architecture from the base neural network architecture scaled using a plurality of second scaling parameter values, wherein the first scaling parameters are modified in accordance with the plurality of first scaling parameter values and consistently One or more composite coefficients of each of the values yield the second scaling parameter values.

The method of claim 1, wherein the base neural network is a convolutional neural network, and wherein the plurality of scaling parameters include a depth of the base neural network, a width of the base neural network, and a base neural network One or more of the input resolutions.

A system comprising: one or more processors, and One or more storage devices, or the like, are coupled to the one or more processors and store instructions that, when executed by the one or more processors, cause the one or more processors to execute for determining The operation of an architecture of a neural network, including: Receive information about the specified target computing resource; receive data specifying an architecture of a basic neural network; Identifying a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network, wherein the identifying includes repeating the following steps: selecting a plurality of candidate scaling parameter values, and determining an efficacy metric of the base neural network scaled according to the plurality of candidate scaling parameter values, wherein the efficacy metric is determined according to a plurality of targets including a delay target; and An architecture of a scaled neural network is generated using the architecture of the base neural network scaled according to the plurality of scaling parameter values.

The system of claim 13, wherein the plurality of targets are a plurality of second targets; and wherein receiving the data specifying the architecture of the underlying neural network includes: receiving training data corresponding to a neural network task, and A neural architecture search is performed on a search space using the training data and according to a plurality of first targets to identify the architecture of the base neural network.

As in the system of claim 14, wherein the search space includes candidate neural network layers, each candidate neural network layer configured to perform one or more respective operations, and Wherein the search space includes candidate neural network layers comprising different respective excitation functions.

The system of claim 14, wherein the plurality of first targets used to perform the neural architecture search are the same as the plurality of second targets used to identify the plurality of scaling parameter values.

The system of claim 14, wherein the plurality of first targets and the plurality of second targets include an accuracy target corresponding to the accuracy of the output of the base neural network when trained using the training data.

The system of claim 13, wherein when the base neural network is scaled according to the plurality of candidate scaling parameter values and deployed on the target computing resources, the performance measure is at least partially associated with the base neural network receiving an input and A delay metric correspondence between an output is generated.

The system of claim 13, wherein the delay target corresponds to a minimum delay between the basic neural network receiving an input and generating an output when the basic neural network is deployed on the target computing resources.

One or more non-transitory computer-readable storage media, or the like, store instructions that, when executed by one or more processors, cause the one or more processors to execute one of the functions for determining a neural network The operation of the architecture, including: Receive information about the specified target computing resource; receiving, by the one or more processors, data specifying an architecture of a basic neural network; Identifying a plurality of scaling parameter values for scaling the basic neural network based on the information specifying the target computing resources and a plurality of scaling parameters of the basic neural network, wherein the identifying includes repeating the following steps: selecting a plurality of candidate scaling parameter values, and determining an efficacy metric of the base neural network scaled according to the plurality of candidate scaling parameter values, wherein the efficacy metric is determined according to a plurality of targets including a delay target; and An architecture of a scaled neural network is generated using the architecture of the base neural network scaled according to the plurality of scaling parameter values.