US20200226473A1 - Systems, apparatus, methods, and architectures for heterogeneous precision acceleration of quantized neural networks - Google Patents
Systems, apparatus, methods, and architectures for heterogeneous precision acceleration of quantized neural networks Download PDFInfo
- Publication number
- US20200226473A1 US20200226473A1 US16/744,039 US202016744039A US2020226473A1 US 20200226473 A1 US20200226473 A1 US 20200226473A1 US 202016744039 A US202016744039 A US 202016744039A US 2020226473 A1 US2020226473 A1 US 2020226473A1
- Authority
- US
- United States
- Prior art keywords
- tag
- precision
- data
- bit
- logic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F5/00—Methods or arrangements for data conversion without changing the order or content of the data handled
- G06F5/06—Methods or arrangements for data conversion without changing the order or content of the data handled for changing the speed of data flow, i.e. speed regularising or timing, e.g. delay lines, FIFO buffers; over- or underrun control therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
- G06F15/7885—Runtime interface, e.g. data exchange, runtime control
- G06F15/7889—Reconfigurable logic implemented as a co-processor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/10—Interfaces, programming languages or software development kits, e.g. for simulating neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/046—Forward inferencing; Production systems
Definitions
- Embodiments described herein generally relate to the fields of data processing and machine learning, and more particularly relates to a hardware accelerator having a heterogenous architecture for training quantized neural networks.
- DNNs Deep Neural Networks
- quantization reduces the bit widths for data and operations in a deep learning model to yield increased performance and/or energy efficiency.
- the quantized version of the DNNs needs to be retrained which can take weeks on GPUs, depending on the size of the DNN model.
- a hardware accelerator with a heterogenous architecture for training quantized neural networks comprises software controllable multilevel memory to store data and a mixed precision array coupled to the memory.
- the mixed precision array includes an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
- FIG. 1 shows an embodiment of a block diagram of a big data system 100 for providing big data applications for a plurality of devices in accordance with one embodiment.
- FIGS. 2A and 2B illustrate methods for training quantized DNNs with a hardware accelerator architecture (e.g., homogenous architecture in FIG. 2A , heterogenous architecture in FIG. 2B ) in accordance with one embodiment.
- a hardware accelerator architecture e.g., homogenous architecture in FIG. 2A , heterogenous architecture in FIG. 2B
- FIG. 3A illustrates pooling layers for the inference phase in accordance with one embodiment.
- FIG. 3B illustrates pooling layers 370 for the back-propagation phase in accordance with one embodiment.
- FIG. 4 illustrates an architecture 400 that includes three distinct types of computational blocks 410 , 420 , and 430 that are specialized for the different types of operations for training quantized DNNs in accordance with one embodiment.
- FIG. 5 illustrates a homogeneous accelerator architecture 500 in accordance with one embodiment.
- FIG. 6 illustrates the design of a compute unit in accordance with one embodiment.
- FIG. 7 illustrates adder logic 700 that utilizes a novel low-overhead desynchronized encoding for zero-skipping in accordance with one embodiment.
- FIG. 8 illustrates non-zero detection logic 800 that includes zero-detector logic 810 and non-zero selector 820 in accordance with embodiment.
- FIG. 9 illustrates scheduling operations across multiple MPZS-arrays in accordance with one embodiment.
- FIG. 10 illustrates an overview of a DNN workflow 1000 in accordance with one embodiment.
- FIGS. 11 and 12 illustrate performance of the GPU platform in comparison to different variations of the present design as implemented in a FPGA platform in accordance with one embodiment.
- FIG. 13 illustrates the schematic diagram of a data processing system according to an embodiment of the present invention.
- FIG. 14 illustrates the schematic diagram of a multi-layer accelerator according to an embodiment of the invention.
- FIG. 15 is a diagram of a computer system including a data processing system according to an embodiment of the invention.
- FIG. 16 shows the details of the specialized circuit 1700 for accelerating neural networks in prior art.
- FIG. 17 shows the details of the CU 1800 in the systolic array circuit.
- FIG. 18 shows the operations in the forward propagation 1940 and backward propagation 1950 phases for a single convolution layer 1900 for neural networks.
- FIG. 19 illustrates a novel heterogeneous-precision circuit 2000 , which is a specialized circuit for accelerating neural network training and inference.
- FIG. 20 illustrates a design of a single CU in Q-array 2010 .
- FIG. 21 shows the operations for a single layer of quantized neural network using the circuited 2000 described in the specification.
- FIG. 22 describes the zero-skipping logic 2300 , which reads a reads a 8wide vector of data 2301 from the IBUF and selects one non-zero value 2302 from the 8-wide vector of data in each cycle for each row of CUs 2600 .
- FIG. 23 shows the circuit for CU 2600 that can skip zero-valued gradients.
- FIG. 24 shows the circuit 2612 for accumulating the multiplication results across different CUs 2600 .
- FIG. 25 shows the details of the accumulation logic per lane 2800 of the accumulation logic 2612 .
- both the Q-array 2010 and the MP-array 2020 blocks of the architecture proposed in this specification may use zero-skipping logic, as shown in FIG. 26 .
- multiple MP-array 2020 blocks without any Q-array 2010 blocks may be employed to accelerate Neural Network training and inference, as shown in FIG. 27 .
- the present design leverages two algorithmic properties: quantization and sparsity for quantized training.
- the present design provides a unified architecture that leverages both properties and shows that FPGAs not only provide higher energy efficiency than GPUs, and FPGAs can, on average, outperform GPUs across a range of quantization techniques and DNN topologies.
- I/O Input/Output.
- DMA Direct Memory Access
- CPU Central Processing Unit.
- FPGA Field Programmable Gate Arrays.
- CGRA Coarse-Grain Reconfigurable Accelerators.
- GPGPU General-Purpose Graphical Processing Units.
- MLWC Many Light-weight Cores.
- ASIC Application Specific Integrated Circuit.
- PCIe Peripheral Component Interconnect express.
- CDFG Control and Data-Flow Graph.
- NIC Network Interface Card
- Dataflow analysis An analysis performed by a compiler on the CDFG of the program to determine dependencies between a write operation on a variable and the consequent operations which might be dependent on the written operation.
- Accelerator a specialized HW/SW component that is customized to run an application or a class of applications efficiently.
- In-line accelerator An accelerator for I/O-intensive applications that can send and receive data without CPU involvement. If an in-line accelerator cannot finish the processing of an input data, it passes the data to the CPU for further processing.
- Bailout The process of transitioning the computation associated with an input from an in-line accelerator to a general purpose instruction-based processor (i.e. general purpose core).
- Rollback A kind of bailout that causes the CPU to restart the execution of an input data on an accelerator from the beginning or some other known location with related recovery data like a checkpoint.
- Gorilla++ A programming model and language with both dataflow and shared-memory constructs as well as a toolset that generates HW/SW from a Gorilla++ description.
- GDF Gorilla dataflow (the execution model of Gorilla++).
- GDF node A building block of a GDF design that receives an input, may apply a computation kernel on the input, and generates corresponding outputs.
- a GDF design consists of multiple GDF nodes.
- a GDF node may be realized as a hardware module or a software thread or a hybrid component. Multiple nodes may be realized on the same virtualized hardware module or on a same virtualized software thread.
- GDF A special kind of component such as GDF that contains computation.
- Computation kernel The computation that is applied to all input data elements in an engine.
- Data state A set of memory elements that contains the current state of computation in a Gorilla program.
- Control State A pointer to the current state in a state machine, stage in a pipeline, or instruction in a program associated to an engine.
- Dataflow token Components input/output data elements.
- Kernel operation An atomic unit of computation in a kernel. There might not be a one to one mapping between kernel operations and the corresponding realizations as states in a state machine, stages in a pipeline, or instructions running on a general purpose instruction-based processor.
- the highly parallel multiply add operations for convolutions/fully-connected layers are interleaved with quantization transformations and require expensive transcendental functions such as tanh or sigmoid that operate on floating-point data.
- the present design targets FPGAs for their flexibility and develops a heterogenous architecture, which is an accelerator for training quantized DNNs.
- This heterogenous architecture is designed to challenge the reign of GPUs as the de facto platform for DNN training.
- the heterogenous architecture leverages three algorithmic properties of quantized DNN training algorithms.
- compute intensive operations for the convolution and fully-connected layers in quantized training need mixed precision; that is, one of the operands is a high-precision gradient while the other is a quantized weight/activation.
- mixed-precision allows the heterogenous architecture to reduce the high resource cost of the compute units, increasing the parallelism that the FPGA can offer using its limited pool of resources.
- training operations for quantized DNNs possess a dual characteristic—the high-precision gradients in the backward phase are highly sparse (>99% zeros); while the quantized activations in the forward phase have between 45-60% zeroes.
- the heterogenous architecture leverages the dual characteristics of high-precision, high-sparsity in the backward phase and low-precision, low-sparsity in the forward phase.
- both the data-representations (fixed-point, power of 2, etc.) and precision (number of bits) for activations, weights, and gradients vary between different DNN models.
- the heterogenous architecture utilizes a template architecture that exploits the reconfigurability of the FPGA to generate a specialized implementation for each quantized DNN.
- the heterogenous architecture acting as an accelerator utilizes the properties of quantization in the bit-heterogeneous architecture to deliver significant improvement in performance and energy efficiency over GPUs.
- the quantization transformation and the quantized data representation both differ for different training algorithms.
- the structure of the compute intensive convolution/activation layers remain the same.
- the heterogenous architecture uses (1) systolic arrays (e.g., sparse dense heterogenous architecture array) for the highly parallel mixed-precision Multiply-Accumulate (MAC) operations in convolution/fully-connected layers in a DNN, and (2) programmable data Transformation Arrays (TX-array) to support the resource intensive quantization transformations as well as the activation/pooling layers in DNNs.
- systolic arrays e.g., sparse dense heterogenous architecture array
- MAC mixed-precision Multiply-Accumulate
- TX-array programmable data Transformation Arrays
- FIG. 1 shows an embodiment of a block diagram of a machine learning system 100 for providing machine learning applications for a plurality of devices in accordance with one embodiment.
- the machine learning system 100 includes machine learning modules 130 (e.g., DNN modules), ingestion layer 132 , enrichment layer 134 , microservices 136 (e.g., microservice architecture), reactive services 138 , and business intelligence layer 150 .
- a microservice architecture is a method of developing software applications as a suite of independently deployable, small, modular services. Each service has a unique process and communicates through a lightweight mechanism.
- the system 100 provides big data services by collecting data from messaging systems 182 and edge devices, messaging systems 184 , web servers 195 , communication modules 102 , internet of things (IoT) devices 186 , and devices 104 and 106 (e.g., source device, client device, mobile phone, tablet device, laptop, computer, connected or hybrid television (TV), IPTV, Internet TV, Web TV, smart TV, satellite device, satellite TV, automobile, airplane, etc.).
- Each device may include a respective big data application 105 , 107 (e.g., a data collecting software layer) for collecting any type of data that is associated with the device (e.g., user data, device type, network connection, display orientation, volume setting, language preference, location, web browsing data, transaction type, purchase data, etc.).
- a network 180 e.g., Internet, wide area network, cellular, WiFi, WiMax, satellite, etc.
- FIGS. 2A and 2B illustrate methods for training quantized DNNs with a hardware accelerator architecture (e.g., homogenous architecture in FIG. 2A , heterogenous architecture in FIG. 2B ) in accordance with one embodiment.
- a hardware accelerator architecture e.g., homogenous architecture in FIG. 2A , heterogenous architecture in FIG. 2B
- FIGS. 2A and 2B illustrate methods for training quantized DNNs with a hardware accelerator architecture in accordance with one embodiment.
- a hardware accelerator architecture e.g., homogenous architecture in FIG. 2A , heterogenous architecture in FIG. 2B
- FIGS. 2A and 2B illustrate methods for training quantized DNNs with a hardware accelerator architecture in accordance with one embodiment.
- the operations of the methods in FIGS. 2A and 2B may be executed by a compiler component, a data processing system, a machine, a server, a web appliance, a centralized system, a distributed node, or any system, which includes an accelerator (e.g., CPU, GPU, FPGA).
- the accelerator may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine or a device), or a combination of both.
- the compute intensive convolution and fully-connected layers which require a large number of simple MAC operations, are interleaved with resource intensive quantization transformations, which perform fewer operations but need more FPGA resources for implementing the complex operations.
- FIG. 2A illustrates the various operations of method 200 to train a single quantized convolution layer when using an architecture with homogenous precision for all computations.
- FIG. 2B illustrates the various operations of method 250 to train a single quantized convolution layer when using an architecture with heterogenous precision for all computations.
- Subscripts f, b, and w refer to the forward propagation, backward propagation of loss, and weight gradient calculations, respectively.
- the conv f , conv b , and conv w are highly-parallel convolution operations that require a large number of Multiply-Accumulate (MAC) operations.
- MAC Multiply-Accumulate
- Inference phase 201 includes operations 202 , 204 , 206 , 208 , 210 , 212 , and 214 .
- Data is quantized in operations 202 , 208 , and 214 .
- the method includes receiving input data for an input layer with the input data being quantized (e.g., quantized from a first precision datatype for input data into a second precision datatype).
- the method includes receiving the second precision datatype (e.g., high precision, 32-bit floating-point) for the input data.
- the method includes receiving a first precision datatype for the initial weights with the weights being quantized from a first precision datatype into a second precision datatype.
- the method includes receiving the second precision datatype (e.g., high precision, 32-bit floating-point) for the weights.
- the method includes performing a convolution operation(s) (conv f ) of a convolution layer on the input data and weights including a large number of Multiply-Accumulate (MAC) operations. Weights from operation 206 can be applied to the input data during the convolution operations.
- output from operation 210 is generated as the second precision datatype and quantized into a first precision datatype at operation 214 . The output of an output layer is available for further processing at operation 214 .
- the backward propagation phase 220 updates original weights to reduce a loss function to improve classification of the input data.
- the backward propagation phase 220 includes operations 222 , 224 , 226 , 228 , 230 , 240 , 242 , 244 , 246 , and 248 .
- an output loss function is generated.
- weights are quantized from a first precision datatype into a second precision datatype (e.g., high precision datatype) to form high precision datatype at operation 242 .
- a convolution (conv b ) is performed on output from operation 240 and the second precision datatype weights from operation 242 to generate an input loss at operation 248 .
- an output loss function is generated.
- inputs are quantized from a first precision datatype into a second precision datatype (e.g., high precision datatype) to form high precision datatype at operation 224 .
- a convolution (conv b ) is performed on output from operation 222 and the second precision datatype inputs from operation 224 to generate a weight loss function at operation 230 .
- conv f uses low-bitwidth fixed-point data for activations and weights.
- conv b and conv w may require mixed precision data types (e.g., high bit width fixed-point/floating-point) for gradients, depending on the quantization algorithm.
- the gradients for the Conv b and Conv w operations may require either high bit width fixed-point or floating-point datatypes, depending on the quantization algorithm.
- the activations for the Conv w operation and weights for the Conv b operation may require low bit width fixed-point representation.
- the precision requirements are a static property of the quantized DNN, designed by the programmer/machine learning expert.
- FIG. 2A shows an example in which the Input and Weights are first converted into high precision datatypes to match the high precision datatypes of gradients before performing the Conv f , Conv b , or Conv w operations.
- this present design introduces the use of heterogeneous precision in the accelerator design for quantized DNN training.
- the proposed architecture uses specialized compute units that dynamically match the varying precision requirements of quantized DNN training.
- FIG. 2B shows, using heterogeneous precision enables the proposed architecture to avoid conversion to high precision datatypes and perform computations on either quantized or mixed-precision datatypes.
- An advantage of this design is that compute units for quantized and mixed-precision datatypes use significantly less amount of hardware resources and less energy compared to high-precision compute units. Note that the Output tensor for Conv f in FIG. 2B may still require high precision datatype to avoid overflow of the intermediate data.
- Inference phase 251 includes operations 254 , 256 , and 258 - 260 .
- Data is quantized in operations 202 , 208 , and 214 .
- the method includes receiving input data for an input layer with the input data being quantized or a mixed precision datatype. Any low bit width precision datatypes are not converted into high bit width precision datatypes.
- the method includes receiving initial weights with the weights being quantized or a mixed precision datatype. Any low bit width precision datatypes are not converted into high bit width precision datatypes.
- the method includes performing a convolution operation(s) (conv f ) of a convolution layer on the input data and weights including Multiply-Accumulate (MAC) operations. Weights from operation 256 can be applied to the input data during the convolution operations.
- output from operation 260 is generated as a second precision datatype and quantized into a first precision datatype at operation 259 . The output of an output layer is available for further processing at operation 259 .
- the backward propagation phase 290 updates original weights to reduce a loss function to improve classification of the input data.
- the backward propagation phase 290 includes operations 270 , 272 , 274 , 276 , 280 , 282 , 284 , and 286 .
- an output loss function is generated.
- weights are quantized or a mixed precision datatype. Any low bit width precision datatypes are not converted into high bit width precision datatypes.
- a convolution (conv b ) is performed on output from operation 270 and the weights from operation 272 to generate an input loss function at operation 276 .
- an output loss function is generated.
- inputs are quantized or a mixed precision datatype. Any low bit width precision datatypes do not need to be converted into high bit width precision datatypes.
- a convolution (conv b ) is performed on output from operation 280 and the inputs from operation 282 to generate a weight loss function at operation 286 .
- the method 200 or 250 selects output data (an output neuron of output layer) having a highest activation value as being the most likely value for the input.
- a highest activation value may predict a dog when an input image shows an image of a cat, instead of a dog. Additional training allows the method to predict a cat for the input image.
- the present design utilizes a static property of quantized DNNs, varying precision requirements, in the design of accelerators for DNN training. Additionally, the present design also exploits a run-time property of quantized DNN training that many zero-valued multiplications can be skipped in both forward and backward computations.
- Prior approaches have explored zero-skipping techniques for inference phase and reported that skipping zero-valued 16-bit activation values can provide significant performance benefits.
- the present design determines that zero-skipping for training phase opens significantly more opportunities than the inference phase, since the training phase contains a larger fraction of zero-valued multiplications among the total operations. However, seizing the opportunities via zero-skipping imposes additional hardware cost to identify and skip ineffectual multiplications.
- FIG. 3A illustrates pooling layers for the inference phase in accordance with one embodiment.
- Pooling layers 320 for the inference phase select maximum values out of a 2-D grid of inputs 310 to generate maximum inputs 330 , as shown in FIG. 3A in accordance with one embodiment.
- FIG. 3B illustrates pooling layers 370 for the back-propagation phase in accordance with one embodiment.
- the gradients corresponding to the maximum values selected in the inference phase are non-zero while the rest are zero for grid of inputs 350 .
- the grid 370 includes the non-zero values from the grid 350 .
- the gradients corresponding to the negative inputs for ReLU activation (rectifier linear function) are zero, which can be as high as 50% sparsity.
- the heterogenous architecture specializes the computational resources to account for these runtime properties.
- the present design utilizes the interplay between quantization and sparsity and defines sparsity minimum as the minimum number of zero-valued activations or gradients required to break-even from the overhead of zero-skipping with sparsity minimum being defined as follows.
- sparsity minimum in the above formulation assumes an ideal architecture that can skip all zero-valued computations and serves as a reference to evaluate the potential benefits from zero-skipping.
- a compute intensive convolution and fully-connected layers which require a large number of simple MAC operations, are interleaved with resource intensive quantization transformations, which perform fewer operations but need more FPGA resources for implementing the complex operations.
- the quantized training requires additional operations that transform the activations, weights, and gradients to different data representations.
- the type of quantization transformation varies according to the quantization algorithm. Offloading these operations to the host CPU would lead to high latencies.
- a homogeneous accelerator architecture 500 of FIG. 5 would overprovision resources for the different types of operations using a homogeneous set or array 510 of Processing Engines (PEs), and more importantly, (2) would be unable to exploit the algorithmic characteristics of reduced precision from quantization and high sparsity in back-propagated gradients. Therefore, heterogeneity is important to maximize the potential performance benefits using the limited pool of resources on a FPGA die. Motivated by the above insight, a heterogeneous architecture for accelerating quantized training has been designed.
- PEs Processing Engines
- the present design utilizes a template architecture that is both scalable—to maximally utilize the FPGA's on-chip resources, and customizable—to adapt to the precision requirements of the quantized DNN being trained.
- This heterogenous architecture 400 includes three distinct types of computational blocks 410 , 420 , and 430 that are specialized for the different types of operations for training quantized DNNs.
- a Dense Quantized Array 410 , 412 (DQ-array), which is a systolic array (e.g., 16 ⁇ 16 systolic array) of low bit width multiply-accumulate computation units that are labeled as processing engines (PEs) in one example, includes an input buffer, an output buffer, and the PEs.
- DQ-array is a systolic array (e.g., 16 ⁇ 16 systolic array) of low bit width multiply-accumulate computation units that are labeled as processing engines (PEs) in one example, includes an input buffer, an output buffer, and the PEs.
- PEs processing engines
- a mixed precision zero skipping array 420 , 422 which is a systolic array (e.g., 16 ⁇ 16 systolic array) of mixed-precision multiply-accumulate computation units that are labeled as processing engines (PEs), includes an input buffer, zero skip logic, an output buffer, and PEs.
- MPZS-array which is a systolic array (e.g., 16 ⁇ 16 systolic array) of mixed-precision multiply-accumulate computation units that are labeled as processing engines (PEs)
- PEs processing engines
- Array 430 , 432 which is an array (e.g., 4 ⁇ 4 array) of floating-point processing engines (PEs), includes a buffer and the PEs.
- the arrays 410 and 420 are specialized for the highly parallel multiply-add operations in the forward and backward phases of DNNs, while the array 430 is a more general purpose array that can be programmed to either compute element-wise data transformations necessary for quantized training.
- the present design uses a three level memory hierarchy: global, cluster, and local memory (e.g., global-uram, cluster-bram, and local-bram).
- the memory at each level of the hierarchy is controlled by software, making it a scratchpad memory. Using the on-chip memory as scratchpads takes away the burden of prefetching and evicting from the hardware and places it on the software. This enables layer-specific compiler optimizations that take advantage of data-reuse within layers of the quantized DNN.
- the present application will now describe the microarchitecture of the heterogenous architecture 400 , and an algorithm for optimizing the sizes of the three types of arrays to maximize performance.
- the present heterogenous architecture uses a MPZS-array that exploits the dual characteristics of high sparsity for the high precision gradients for zero-skipping in the backward phase, and uses a dense quantized execution for the forward phase.
- the basic building block for the MPZS-array is the CU, which is a bit-flexible compute unit, described below.
- FIG. 6 illustrates the design of a compute unit in accordance with one embodiment.
- the CU 600 includes n quantized mixed precision multipliers (e.g., 610 - 613 ), each of which can multiply up to m-bit operands. While m depends on the minimum precision required by the MAC operations in convolution/fully-connected layers, n depends on the ratio of precision max /precision min . The outputs of the n quantized multipliers are added to produce an output 690 .
- n quantized mixed precision multipliers e.g., 610 - 613
- the CU supports a flexible range of precision for the floating point or fixed point inputs 601 - 608 (e.g., floating point 32 bit 601 , fixed point 2 bit 602 , floating point 32 bit 603 , fixed point 2 bit 604 , floating point 32 bit 605 , fixed point 2 bit 606 , floating point 32 bit 607 , fixed point 2 bit 608 )—activations in the forward phase and the gradients in the backward phase.
- the n quantized multipliers in a CU perform n independent multiplications.
- the n quantized multipliers together multiply a single n ⁇ m-bit operand with a m-bit operand.
- a MPZS array uses a 2D systolic array of 16 ⁇ 16 CUs.
- each compute unit in the MPZS array performs multiple multiply-add operations for quantized activations and weights in the forward phase of training.
- the partial results generated by different quantized multipliers are added together to produce a single output.
- the gradients in the backward phase for DNNs have high sparsity (e.g., up to 99%).
- a naive first approach for obtaining performance for such a high degree of sparsity is to serialize the MAC operations using a single row of the systolic array.
- Such an approach has two drawbacks: (1) each row would require its own single-precision floating point accumulator which would increase the resource cost (FPGA LUT/DSP) per row; and (2) limited parallelism due to a single row.
- a second approach is to use multiple rows in the systolic array, which increases parallelism. Further, outputs within each column of the systolic array can be accumulated in a single floating-point accumulator.
- the drawback of the second approach is that it enforces synchronization between different rows of the systolic array. That is, each row waits for all the other rows to finish computing the current output before moving on to the next output.
- Prior work uses the second approach to improve inference performance when the sparsity for activations is between 45-60%.
- the present design on the other hand aims to exploit the considerably higher sparsity present in the gradients of the backward phase of quantized DNN training. Due to the high sparsity in the gradients for the backward phase, synchronization between different rows of the systolic array would significantly limit the performance benefits from zero-skipping.
- the present design identifies two limitations of the above technique when applied to highly-sparse gradients.
- the fundamental assumption here is that the compute units in each column synchronize and operate on a single sparse-vector. Therefore, for the first limitation, each row stalls for all the other rows to finish operating on their own sub-vectors before proceeding to the next sparse-vector; which will limit the potential benefits from zero-skipping due to the high-sparsity in gradients.
- the non-zero detect logic when reading one sparse sub-vector at a time from the memory (e.g., BRAM), the non-zero detect logic will stall when there are no non-zero values in the sub-vector. Assuming a 95% sparsity in gradients, the probability of all zeros in a sub-vector (assuming independent and identical distribution) is 44%.
- the present design utilizes a novel low-overhead desynchronized encoding for zero-skipping as illustrated in a multi-lane adder logic 700 of FIG. 7 .
- This encoding uses a desynchronization-tag or d-tag 706 to remove synchronization between rows of a MPZS-array.
- MPZS-array encodes the non-zero values as value 702 , offset 704 , and d-tag pair 706 .
- the d-tag 706 specifies the identification (ID) of the sparse-vector that each row operates on.
- ID identification
- the present design uses two tag-lanes 712 and 714 within each column.
- the compute units in each column share tag-lanes. Within each column, compute units forward their results to one of the tag-lanes using the LSB of the d-tag.
- the select logic 730 determines that the tag for the current row matches the previous row's tag for either the odd or even tag-lanes, the values are added together and forwarded to the next row.
- the tags do not match, the results are stored locally.
- the present design decomposes the non-zero detection logic 800 of FIG. 8 into two different modules: (1) zero-detector logic 810 , and (2) non-zero selector 820 .
- the zero-detector logic includes a series of comparators that generate a bit-vector that corresponds to using a single bit for each bit of the sub-vector (e.g., 16 bit wide sub-vector). Each bit in the bit-vector specifies if the corresponding value in the sub-vector is zero (low) or non-zero (high). When all bits in the bit-vector are low, the sub-vector is skipped entirely.
- the sub-vector is pushed to a FIFO queue 830 , along with its bit-vector and a d-tag for identifying the input ID.
- the non-zero selector then pops the FIFO queue to read only those sub-vectors that have some non-zero values.
- the non-zero selector selects the first non-zero value and the corresponding offset to produce a (value, offset, tag) pair.
- the present design improves the performance of MPZS-array when sparsity is high.
- the MPZS-array utilizes a dense execution for the forward phase of quantized DNN training, as described below.
- the present design uses a template architecture to implement the MPZS-array on FPGA.
- the precision for the multiply-add operations can be modified according to the needs of the quantized DNN.
- the following section discusses the scheduling of operations for quantized training across multiple MPZS-arrays.
- the present design splits the computations in each operation into tiles.
- the total amount of data is often much larger than the limited on-chip memory available on the FPGA. Therefore, splitting the computations into tiles are necessary to fit the data into on-chip memory.
- FIG. 9 illustrates scheduling operations across multiple MPZS-arrays in accordance with one embodiment.
- the present design uses three types of tiling and expresses the task of determining the tile sizes as a constrained optimization problem.
- the three types of tilings correspond to three levels of memory hierarchy in the MPZS architecture and the sizes of each level of memory hierarchy serves the constraints for optimizing the tile sizes.
- the present design uses a simple fully-connected layer in FIG. 9 as an example to explain the scheduling of operations 910 , 920 , 930 , and 940 .
- the fully-connected layer from the FIG. 9 can be expressed as a matrix multiplication as follows.
- FIG. 9 shows how the operations in a fully-connected layer are split into tiles for each level of memory hierarchy including global memory tile (e.g., URAM tile) at operation 910 and cluster memory tile (e.g., BRAM tile) at operation 930 .
- global memory tile e.g., URAM tile
- cluster memory tile e.g., BRAM tile
- Using a larger tile size for each level of hierarchy increases the data reuse at that level of hierarchy at operation 940 .
- the tile sizes are constrained by the capacity of memory at that level of hierarchy.
- the DNN workflow begins with a programmer defining a Dataflow Graph (DFG) of the DNN using a high-level API.
- This API allows the programmer to specify the precision for each operation in the DNN.
- this workflow includes four operations: ( 1010 ) a dataflow analysis operation to analyze the resource requirements for the dataflow graph, ( 1020 ) a resource partitioning operation to analytically split the FPGA's resources, ( 1030 ) a cycle-accurate scheduling operation to obtain cycle counts, and ( 1040 ) a builder operation to generate a synthesizable accelerator using the optimal resource breakdown from operation 1030 .
- a dataflow analysis operation to analyze the resource requirements for the dataflow graph
- 1020 a resource partitioning operation to analytically split the FPGA's resources
- 1030 a cycle-accurate scheduling operation to obtain cycle counts
- 1040 a builder operation to generate a synthesizable accelerator using the optimal resource breakdown from operation 1030 .
- the first operation 1010 of the workflow includes analyzing the type of computational resources required by the DNN model.
- This operation 1010 a includes the dataflow analyzer component iterating over the nodes of the dataflow graph of the DNN and generates a list of pairs (e.g., operation type, precision, operation count) for the forward and backward passes of training.
- the operation type is a type of scalar operation (e.g., multiply, add, etc)
- the precision field is a tuple of the data-types required by the operands (e.g., fixed-point, floating-point, or power-of-2)
- the operation count field describes the number of scalar operations.
- the dataflow analyzer generates the highest and lowest precision required for the forward pass and repeats the same for the backward pass. Determining the range of precision requirements is essential for estimating the resources required for compute units in the FPGA (e.g., LUTs, DSPs, and Flip-Flops).
- the dataflow analyzer performs runtime analysis by sampling the data propagated in the forward and backward passes of the dataflow graph for numerous iterations using a user-specified batch-size of inputs. Next, the dataflow analysis calculates the proportion of zero-valued data in sampled data. Using the information generated by the static and runtime analysis in the dataflow analysis operation, the resource partitioning component divides the FPGA's resources as follows at operation 1020 .
- the resource partitioner component of the workflow uses an analytical model to obtain the optimal breakdown of the FPGA's resources for forward and backward passes. Since most operations in a DNN are Multiply-Accumulate (MAC) operations, the resource partitioner only considers the MAC operations for the analytical model. For a given pair of (precision fwd , ops fwd ) and (precision bwd , ops fwd ) for the forward and backward passes of training, the resource partitioner generates the optimal breakdown (p, 1 ⁇ p) of FPGA's resources for executing forward and backward passes, respectively.
- MAC Multiply-Accumulate
- alu fwd p ⁇ resource total /resource fwd (3a)
- resource partitioning component opti-mizes the ideal number of cycles required by the forward and backward operations given by the following equation:
- Cycles total ops fwd ⁇ nz fwd /(alu fwd +alu fwd )+(ops bwd ⁇ nz bwd )/alu bwd (4)
- equation [4] solved quadratically to get the optimal partitioning p as follows.
- the c term is the ratio of non-zero computations in the backward pass to the non-zero computations in the forward pass
- r term is the one minus the ratio of resources required for the backward pass to the resources required by the forward pass. While computing the value of r requires static information, computing c requires both static and dynamic information.
- the value of p obtained from equation [6] is the optimal breakdown of the FPGA's resources assuming no under-utilization of resources due to memory accesses. In reality, however, even quantized DNNs have a large memory footprint and hence performance of the generated FPGA accelerator depends both on the breakdown of the FPGA's resources and the organization of on-chip memory. Nevertheless, the value of p obtained from equation [6] serves as a good initial solution for optimizing the breakdown of the FPGA's resources.
- the scheduler component evaluates the solution provided by the resource partitioner.
- the scheduler is the third component of the workflow which evaluates the quality of the solution generated by the resource partitioner.
- the present design uses an architectural vs cycle-accurate simulation model for determining the quality of the partitioning solution.
- the simulator component divides the FPGA's LUT and DSP resources into 16 ⁇ 16 systolic arrays for the forward and backward passes using the p obtained from the resource partitioner.
- the simulator evenly divides the FPGA's memory (e.g., URAM and BRAM) resources for each systolic array.
- the architecture of the present design uses a 2 level hierarchy for organizing the on-chip memory, as discussed above.
- the simulator component uses the number of forward and backward systolic arrays along with the memory organization, the simulator component performs cycle-accurate simulation.
- the simulation model accounts for limited bandwidth and latency for communication over both PCIe and the off-chip DRAMs.
- the scheduler generates the cycle counts for DQ-array and MPZS-array. Using the cycle-counts, the scheduler updates the compute ratio c defined in Equation [7] as follows.
- Algorithm 1 summarizes the tasks of the Dataflow Analyzer, Resource Partitioner, and Scheduler. Since the present design aims to flexibly support a wide range of quantized training algorithms, it uses a template architecture to accelerate a wide range of quantized DNNs. The first three components generate an optimized set of parameters for the template architecture along with an optimized execution schedule. The last component, the builder generates a synthesizable accelerator using both the optimized set of architectural parameter and execution schedule.
- Algorithm 1 heterogenous resource partitioning
- Table I shows the evaluated benchmarks, their datasets, number of operations, model size, and final accuracy.
- the postfix -W, -Q, -D refer to quantization techniques proposed by different prior approaches that use uniform quantization using fixed-point representation for activations and weights but use different quantization strategies for gradients.
- DoReFa-Net uses fixed-point quantization with added gaussian noise
- QNN uses logarithmic quantization using a power-of-2 data representation
- WRPN uses floating-point.
- Benchmarks ResNet-34-W, GoogleNet-Q, AlexNet-Q, AlexNet-W, AlexNet-D are image classification models trained on the Imagenet 2012 dataset.
- Benchmarks SVHN-W and SVHN-Q are optical character recognition models based on the SVHN dataset. Unlike inference, the quality of the trained model depends significantly on the batch size. Therefore, the same batch sizes reported in these prior approaches is used for both GPUs and the heterogenous architecture of the present design. Furthermore, the three benchmarks use stochastic noise to speed-up convergence. Across all the benchmarks, both performance and power consumption are measured for a FPGA platform and a GPU platform for 10,000 training iterations and present the average. For both GPU and FPGA implementations, the host CPU is used as the parameter server.
- a FPGA platform includes 6840 DSPs, 1182K LUTs, 33.7 MB URAM, 8.4 MB BRAMs, 42 W TDP, 200 MHz frequency, and 16 nm technology node.
- a GPU platform has 3584 cores, 12 GB memory, 250 W TDP, 1531 MHz frequency, and 16 nm technology node.
- FIGS. 11 and 12 illustrate performance of the GPU platform in comparison to different variations of the present design as implemented in the FPGA platform.
- the present design provides an alternative solution for GPUs, by leveraging the inherent characteristic of quantized deep learning and introducing heterogeneous accelerator architecture for FPGAs. As such, this design exists at the intersection of (a) quantization for deep learning, (b) acceleration for quantized deep learning, (c) acceleration for ML training, (d) heterogeneous architecture, and (e) exploitation of sparsity in deep learning.
- FIG. 13 illustrates the schematic diagram of data processing system 1300 according to an embodiment of the present invention.
- Data processing system 1300 includes I/O processing unit 1310 and general purpose instruction-based processor 1320 .
- general purpose instruction-based processor 1320 may include a general purpose core or multiple general purpose cores. A general purpose core is not tied to or integrated with any particular algorithm.
- general purpose instruction-based processor 1320 may be a specialized core.
- I/O processing unit 1310 may include an accelerator 1311 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, etc.) for implementing embodiments as described herein.
- accelerator 1311 e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, etc.
- In-line accelerators are a special class of accelerators that may be used for I/O intensive applications. Accelerator 1311 and general purpose instruction-based processor may or may not be on a same chip. Accelerator 1311 is coupled to I/O interface 1312 . Considering the type of input interface or input data, in one embodiment, the accelerator 1311 may receive any type of network packets from a network 1330 and an input network interface card (NIC). In another embodiment, the accelerator maybe receiving raw images or videos from the input cameras. In an embodiment, accelerator 1311 may also receive voice data from an input voice sensor device.
- NIC input network interface card
- accelerator 1311 partially performs the computation associated with the input data elements and transfers the control to other accelerators or the main general purpose instruction-based processor in the system to complete the processing.
- computation may refer to any computer task processing including, but not limited to, any of arithmetic/logic operations, memory operations, I/O operations, and offloading part of the computation to other elements of the system such as general purpose instruction-based processors and accelerators. Accelerator 1311 may transfer the control to general purpose instruction-based processor 1320 to complete the computation.
- accelerator 1311 may be implemented using any device known to be used as accelerator, including but not limited to field-programmable gate array (FPGA), Coarse-Grained Reconfigurable Architecture (CGRA), general-purpose computing on graphics processing unit (GPGPU), many light-weight cores (MLWC), network general purpose instruction-based processor, I/O general purpose instruction-based processor, and application-specific integrated circuit (ASIC).
- I/O interface 1312 may provide connectivity to other interfaces that may be used in networks, storages, cameras, or other user interface devices. I/O interface 1312 may include receive first in first out (FIFO) storage 1313 and transmit FIFO storage 1314 .
- FIFO first in first out
- FIFO storages 1313 and 1314 may be implemented using SRAM, flip-flops, latches or any other suitable form of storage.
- the input packets are fed to the accelerator through receive FIFO storage 1313 and the generated packets are sent over the network by the accelerator and/or general purpose instruction-based processor through transmit FIFO storage 1314 .
- I/O processing unit 1310 may be Network Interface Card (NIC).
- accelerator 1311 is part of the NIC.
- the NIC is on the same chip as general purpose instruction-based processor 1320 .
- the NIC 1310 is on a separate chip coupled to general purpose instruction-based processor 1320 .
- the NIC-based accelerator receives an incoming packet, as input data elements through I/O interface 1312 , processes the packet and generates the response packet(s) without involving general purpose instruction-based processor 1320 . Only when accelerator 1311 cannot handle the input packet by itself, the packet is transferred to general purpose instruction-based processor 1320 .
- accelerator 1311 communicates with other I/O interfaces, for example, storage elements through direct memory access (DMA) to retrieve data without involving general purpose instruction-based processor 1320 .
- DMA direct memory access
- Accelerator 1311 and the general purpose instruction-based processor 1320 are coupled to shared memory 1343 through private cache memories 1341 and 1342 respectively.
- shared memory 1343 is a coherent memory system.
- the coherent memory system may be implemented as shared cache.
- the coherent memory system is implemented using multiples caches with coherency protocol in front of a higher capacity memory such as a DRAM.
- the transfer of data between different layers of accelerations may be done through dedicated channels directly between accelerator 1311 and processor 1320 .
- the control will be transferred to the general-purpose core 1320 .
- Processing data by forming two paths of computations on accelerators and general purpose instruction-based processors have many other applications apart from low-level network applications.
- most emerging big-data applications in data centers have been moving toward scale-out architectures, a technology for scaling the processing power, memory capacity and bandwidth, as well as persistent storage capacity and bandwidth.
- These scale-out architectures are highly network-intensive. Therefore, they can benefit from acceleration.
- These applications however, have a dynamic nature requiring frequent changes and modifications. Therefore, it is highly beneficial to automate the process of splitting an application into a fast-path that can be executed by an accelerator with subgraph templates and a slow-path that can be executed by a general purpose instruction-based processor as disclosed herein.
- a FPGA accelerator can backed by a many-core hardware.
- the many-core hardware can be backed by a general purpose instruction-based processor.
- a multi-layer system 1000 is formed by a first accelerator 1011 1 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, or both) and several other accelerators 1011 . (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, or both).
- the multi-layer system 1050 includes several accelerators, each performing a particular level of acceleration. In such a system, execution may begin at a first layer by the first accelerator 1011 1 . Then, each subsequent layer of acceleration is invoked when the execution exits the layer before it.
- the accelerator 1011 1 cannot finish the processing of the input data, the input data and the execution will be transferred to the next acceleration layer, accelerator 1011 2 .
- the transfer of data between different layers of accelerations may be done through dedicated channels between layers (e.g., 1071 1 to 1071 n ).
- the control will be transferred to the general-purpose core 1090 .
- FIG. 15 is a diagram of a computer system including a data processing system that utilizes an accelerator according to an embodiment of the invention.
- a computer system 1200 Within the computer system 1200 is a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein including accelerating machine learning operations.
- the machine may be connected (e.g., networked) to other machines in a LAN, an intranet, an extranet, or the Internet.
- the machine can operate in the capacity of a server or a client in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment, the machine can also operate in the capacity of a web appliance, a server, a network router, switch or bridge, event producer, distributed node, centralized system, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
- the term “machine” shall also be taken to include any collection of machines (e.g., computers) that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
- Data processing system 1202 includes a general purpose instruction-based processor 1227 and an accelerator 1226 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, etc.).
- the general purpose instruction-based processor may be one or more general purpose instruction-based processors or processing devices (e.g., microprocessor, central processing unit, or the like). More particularly, data processing system 1202 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, general purpose instruction-based processor implementing other instruction sets, or general purpose instruction-based processors implementing a combination of instruction sets.
- CISC complex instruction set computing
- RISC reduced instruction set computing
- VLIW very long instruction word
- the accelerator may be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal general purpose instruction-based processor (DSP), network general purpose instruction-based processor, many light-weight cores (MLWC) or the like.
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- DSP digital signal general purpose instruction-based processor
- MLWC light-weight cores
- the exemplary computer system 1200 includes a data processing system 1202 , a main memory 1204 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM (RDRAM), etc.), a static memory 1206 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 1216 (e.g., a secondary memory unit in the form of a drive unit, which may include fixed or removable computer-readable storage medium), which communicate with each other via a bus 1208 .
- the storage units disclosed in computer system 1200 may be configured to implement the data storing mechanisms for performing the operations and steps discussed herein.
- Memory 1206 can store code and/or data for use by processor 1227 or accelerator 1226 .
- Memory 1206 include a memory hierarchy that can be implemented using any combination of RAM (e.g., SRAM, DRAM, DDRAM), ROM, FLASH, magnetic and/or optical storage devices.
- RAM e.g., SRAM, DRAM, DDRAM
- ROM e.g., ROM, FLASH, magnetic and/or optical storage devices.
- Memory may also include a transmission medium for carrying information-bearing signals indicative of computer instructions or data (with or without a carrier wave upon which the signals are modulated).
- Processor 1227 and accelerator 1226 execute various software components stored in memory 1204 to perform various functions for system 1200 .
- memory 1206 may store additional modules and data structures not described above.
- Operating system 1205 a includes various procedures, sets of instructions, software components and/or drivers for controlling and managing general system tasks and facilitates communication between various hardware and software components.
- a compiler is a computer program (or set of programs) that transform source code written in a programming language into another computer language (e.g., target language, object code).
- a communication module 1205 c provides communication with other devices utilizing the network interface device 1222 or RF transceiver 1224 .
- the computer system 1200 may further include a network interface device 1222 .
- the data processing system disclose is integrated into the network interface device 1222 as disclosed herein.
- the computer system 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD), LED, or a cathode ray tube (CRT)) connected to the computer system through a graphics port and graphics chipset, an input device 1212 (e.g., a keyboard, a mouse), a camera 1214 , and a Graphic User Interface (GUI) device 1220 (e.g., a touch-screen with input & output functionality).
- a video display unit 1210 e.g., a liquid crystal display (LCD), LED, or a cathode ray tube (CRT)
- an input device 1212 e.g., a keyboard, a mouse
- a camera 1214 e.g., a camera 1214
- GUI Graphic User Interface
- the computer system 1200 may further include a RF transceiver 1224 provides frequency shifting, converting received RF signals to baseband and converting baseband transmit signals to RF.
- a radio transceiver or RF transceiver may be understood to include other signal processing functionality such as modulation/demodulation, coding/decoding, interleaving/de-interleaving, spreading/dispreading, inverse fast Fourier transforming (IFFT)/fast Fourier transforming (FFT), cyclic prefix appending/removal, and other signal processing functions.
- IFFT inverse fast Fourier transforming
- FFT fast Fourier transforming
- the Data Storage Device 1216 may include a machine-readable storage medium (or more specifically a computer-readable storage medium) on which is stored one or more sets of instructions embodying any one or more of the methodologies or functions described herein. Disclosed data storing mechanism may be implemented, completely or at least partially, within the main memory 1204 and/or within the data processing system 1202 by the computer system 1200 , the main memory 1204 and the data processing system 1202 also constituting machine-readable storage media.
- the computer system 1200 is an autonomous vehicle that may be connected (e.g., networked) to other machines or other autonomous vehicles in a LAN, WAN, or any network.
- the autonomous vehicle can be a distributed system that includes many computers networked within the vehicle.
- the autonomous vehicle can transmit communications (e.g., across the Internet, any wireless communication) to indicate current conditions (e.g., an alarm collision condition indicates close proximity to another vehicle or object, a collision condition indicates that a collision has occurred with another vehicle or object, etc.).
- the autonomous vehicle can operate in the capacity of a server or a client in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
- the storage units disclosed in computer system 1200 may be configured to implement data storing mechanisms for performing the operations of autonomous vehicles.
- the computer system 1200 also includes sensor system 1214 and mechanical control systems 1207 (e.g., motors, driving wheel control, brake control, throttle control, etc.).
- the processing system 1202 executes software instructions to perform different features and functionality (e.g., driving decisions) and provide a graphical user interface 1220 for an occupant of the vehicle.
- the processing system 1202 performs the different features and functionality for autonomous operation of the vehicle based at least partially on receiving input from the sensor system 1214 that includes laser sensors, cameras, radar, GPS, and additional sensors.
- the processing system 1202 may be an electronic control unit for the vehicle.
- FIG. 16 shows the details of the specialized circuit 1700 for accelerating neural networks in prior art.
- the specialized circuit in FIG. 16 includes one or more circuits 1701 that are specialized for one or more computations in neural networks.
- the systolic array circuit 1701 shown in FIG. 16 is specialized for the convolution and matrix-multiplication operations in neural networks.
- the systolic array circuit 1701 further includes (1) a plurality of CUs 1800 that perform the operations in the plurality of layers in neural network training and inference, (2) a buffer IBUF 1704 to store inputs and (3) a buffer OBUF 1705 to store intermediate results when executing the operations for the multi-dimensional arrays of data in neural network.
- the CUs 1800 in FIG. 16 are organized as a 2-dimensional grid, with a plurality of rows 1702 and a plurality of columns 1703 .
- the buffer IBUF 1704 feeds data to the CU 1800 on the first column (the left most column), as shown in FIG. 16 .
- the results from the CUs 1800 in each column of the systolic array are accumulated in an accumulator circuit and the stored in the OBUF 1705 .
- Each CU 1800 can perform multiply-add operations, with one operand from the CU 1800 in the previous column (the CU 1800 on the left), with one operand from the CU 1800 's private buffer called WBUF 1803 to generate a product.
- the product is then added with the result from the CU 1800 in the previous row (the CU 1800 on the top) and sent to the CU 1800 in the next row (the CU 1800 on the bottom).
- WBUF 1803 private buffer
- FIG. 17 shows the details of the CU 1800 in the systolic array circuit.
- the CU 1800 multiplies one value from either the previous CU 1800 (the CU 1800 on the left) or from the IBUF 1704 , with one value from the CU 1800 's private buffer WBUF 1803 .
- the resulting product is added with the results from the previous CU 1800 .
- the resulting sum is then forwarded to the CU 1800 in the next row (CU 1800 on the bottom).
- the accumulator may perform additional operations (max, min, multiplication, etc) required for different layers of the neural network (like pooling, activation, etc).
- the data is divided into portions such that size of each portion does not exceed the capacity of on-chip buffers.
- the precision for multidimensional array inputs for the layers of neural networks is the same. Consequently, the width of the IBUF 1704 and WBUF 1803 buffers are sized according to the precision of the operands supported by the circuit, and the width of the OBUF 1705 is sized according to the precision for the intermediate data. Similarly, the CUs 1800 in the systolic array is designed for the precision supported by the circuit.
- FIG. 18 shows the operations in the forward propagation 1940 and backward propagation 1950 phases for a single convolution layer 1900 for neural networks.
- the CONV F 1910 operation in the forward propagation phase consumes two multidimensional arrays, one for the inputs 1905 , and another for the weights 1905 .
- the CONV B 1931 and CONV W 1930 operations in the backward propagation phase of neural network training accept two inputs—weights 1955 and gradients 1952 for CONV B 1931 ; and inputs 1954 and gradients 1952 for CONV W 1930 .
- the inputs 1901 and weights 1902 use low-precision data representation, while the gradients 1952 , 1951 , and 1953 , require high-precision data representation.
- the specialized circuits for neural network described in prior art use the same precision for all data types. Therefore, the arrays inputs 1901 and weights 1902 are first converted to a high-precision that is supported by the circuit using 1920 , 1922 , 1921 , and 1921 , to produce high-precision multidimensional arrays 1905 , 1954 , 1905 , and 1955 , respectively.
- the specialized circuits described in prior art use the same precision for the different multidimensional arrays of inputs, weights, and gradients for neural network training.
- the circuits in prior art either support just high precision (e.g. half-precision, single-precision, and double-precision floating-point, etc.) for all data types and introduce additional data type conversion operations, like operations 1920 , 1922 , 1921 , and 1923 , or lose accuracy by using low-precision for all data types.
- the circuit 2000 can operate on heterogeneous precision data types for the inputs, weights, and gradients in neural networks.
- the specialized circuit 2000 is includes one or more instances two types of sub-circuits: (1) a quantized circuit called Q-array 2010 responsible for the operations in the forward propagation for neural network training, and (2) a mixed-precision circuit called MP-array 2020 that uses asymmetric precision for the backward propagation operations in neural network training—floating-point representation for the gradients and quantized representation for the inputs and weights.
- the Q-array 2010 includes of a plurality of CUs 2100
- MP-array 2020 includes of a plurality of CUs 2600 , with the CUs 2100 and CUs 2600 organized as a 2-dimensional grid to form systolic arrays.
- the circuit 2000 described in the specification does not require additional data type conversions and can directly operate on both low-precision inputs and weights, and high-precision gradients.
- FIG. 21 shows the operations for a single layer of quantized neural network using the circuited 2000 described in the specification.
- the CONV F 2210 operation in the forward propagation phase can be directly executed with low-precision inputs 1901 , and weights 1902 .
- the CONV B 2231 and CONV W 2230 operations in the backward propagation phase of quantized neural network training can be directly executed with high-precision gradients and low-precision inputs and weights.
- Neural Network 1900 the operations 1920 , 1922 , 1921 , and 1921 from Neural Network 1900 are no longer required for Neural Network 2200 in FIG. 21 to convert the inputs and weights to a high-precision representation.
- Q-array 2010 contains a 2-dimensional grid of quantized CUs 2100 , that support quantized inputs and weights for the forward propagation operations for quantized neural network training.
- the IBUF 2011 buffer in Q-array 2010 stores the multidimensional arrays of inputs and stores the multidimensional arrays of weights in the WBUF 2103 that is private for each CU 2100 .
- the proposed circuit for Q-array 2010 stores both the inputs and the weights in low-precision.
- Q-array 2010 does not require additional data type conversion for both inputs and weights.
- the precision for inputs and weights in the Q-array 2010 is fixed for all forward propagation operations, and is the same for both inputs and weights. In another implementation, the precision for inputs and weights in the Q-array 2010 is fixed for all forward propagation operations, but can be different for the inputs and weights. In one implementation, the precision for operands in the Q-array 2010 can be varied at run-time to support different precisions for the inputs and weights across different forward propagation operations.
- the width of the IBUF 2011 and the WBUF 2103 buffers are sized according to the precision or set of different precisions supported by the Q-array 2010 .
- the circuit in FIG. 19 shows one embodiment of this specification, the inputs stored in the IBUF 2011 use 8-bit fixed-point data representation and the weights stored in the WBUF 2103 use 4-bit fixed-point data representation.
- a central control logic 2023 for the Q-array 2010 generated the address for the WBUF 2103 that is private for each CU 2100 .
- the plurality of CUs 2100 in Q-array 2010 can perform multiply-add operation, wherein CU 2100 performs a multiplication 2105 between a single 8-bit precision input that is supplied by the IBUF 2011 and shared by all CUs 2100 in a row, and a single 4-bit precision weight 2104 supplied by the WBUF private to that CU 2100 according to the address generated by 2023 .
- the results from the multiplication in a CU 2100 is added and accumulated across CUs 2100 in a column of Q-array 2010 through an adder.
- the precision of adder 2107 is set according to the highest precision supported by the IBUF 2011 and WBUF 2103 to avoid overflows/underflows.
- the accumulated results at the bottom of each column of Q-array 2010 require higher precision (eg. half, single, or double floating-point precision, or fixedpoint precision with more number of bits compared to inputs and weights).
- Q-array 2010 can either write back higher precision accumulated results from OBUF 2012 to next level of memory or can quantize the results to a lower precision fixed-point representation for use by the next forward propagation or backward propagation operation.
- the OBUF 2012 can store only low-precision data, which can reduce the size of the OBUF 2012 but may introduce some error in the results.
- the MP-array 2020 contains a 2-dimensional grid of CUs 2600 that are responsible for the backward propagation operations for quantized neural network training, and can operate of data with mixed-precision—high-precision gradients, and low-precision inputs and weights.
- the input gradients are stored in the IBUF 2021 for both CONV W 2230 and CONV B 2231 operations.
- the IBUF 2021 is coupled with an additional zeroskipping logic 2300 , which enables the MP-array 2020 circuit to skip over zero-valued gradients in the IBUF 2021 .
- FIG. 22 describes the zero-skipping logic 2300 , which reads a reads a 8wide vector of data 2301 from the IBUF and selects one non-zero value 2302 from the 8-wide vector of data in each cycle for each row of CUs 2600 .
- a log 2 (8)-bit or 3-bit non-zero index 2303 marks the position of the non-zero value 2302 selected from the N-wide vector of data 2301 .
- the width for buffer IBUF 2021 is set to be 8 ⁇ precision for gradients for each row of CUs 2600 in MP-array 2020 in order to supply data to 2300 logic.
- the zero-skipping logic 2300 can be extended to read any N-wide vector of data to produce a log 2 (N)-bit index.
- the circuit 2300 in FIG. 22 is replicated for each row of 2-dimensional array of CUs 2600 in the MP-array 2020 .
- the non-zero value 2302 and the associated 3-bit non-zero index 2303 are then sent to all the CUs 2600 in a row of the 2-dimensional grid of CUs 2600 in MParray 2020 . Once all the non-zero entries from the 8-wide vector have been selected, the next 8-wide vector of data is read from the IBUF.
- the results from all the CUs 2600 in a column of the MP-array 2020 need to accumulated to produce an output. This introduces a dependency between different rows of the MP-array 2020 , where all the rows have to finish processing all the nonzero value 2302 corresponding to one output value before proceeding with the next output value.
- the inefficiency introduced by this dependency can be large in the case where a large majority of gradients are zero.
- non-zero value 2302 and non-zero index 2303 in the proposed circuit 2020 is appended with a desynchronization-tag (nonzero d-tag 2304 ), which specifies the output address and is generated by the MP-array control logic 2023 .
- the non-zero d-tag 2304 allows the CUs 2600 across different rows to operate on the non-zero value 2302 and non-zero index 2303 for different outputs.
- the non-zero value 2302 , 3-bit non-zero index 2303 , and the non-zero d-tag 2304 are then shared across all CUs 2600 in a row of MP-array 2020 .
- FIG. 23 shows the circuit for CU 2600 that can skip zero-valued gradients.
- a base address generated by control logic 2023 is combined with non-zero index 2303 , which is a part of the incoming data to generate the read address for WBUF 2620 .
- the multiplier 2607 then generates a product 2608 using the non-zero data 2302 and the data 2602 from WBUF 2620 .
- the floating-point data non-zero data 2302 is first converted to a 2's complement form by combining the sign and mantissa bits.
- a 2's complement multiplier 2607 can then be used to perform the multiplication.
- a shifter 2615 is used to left-shift the results of the multiplier. When operating on weights or the least significant 4-bits of the inputs, the shift amount is zero. When operating on the most significant 4-bits of the inputs stored in WBUF 2620 , the shift amount is 4 bits to the left.
- This approach to support multiple different precisions for operands in Neural Network training and inference can be generalized to support any precision by choosing the appropriate number of bits for the intermediate data at the output of the shifter 2608 and the appropriate shift-amounts.
- the results from multiplication and shifting 2608 is then added with the results from the previous CU 2600 using an accumulation logic 2612 .
- FIG. 24 shows the circuit 2612 for accumulating the multiplication results across different CUs 2600 .
- the non-zero dtag is used to associate the multiplication results with the output that it corresponds to.
- the accumulator logic 2612 includes multiple lanes 2700 , where the different lanes allow the CUs 2600 in different rows to work on different outputs.
- FIG. 25 shows the details of the accumulation logic per lane 2800 of the accumulation logic 2612 .
- muxes 2802 and 2803 (1) when the non-zero d-tag 2811 for the multiplication result in current row matches the non-zero d-tag 2810 for incoming data using comparator 2801 , and the incoming data is valid 2830 for a lane, the results 2820 and 2821 are added together and sent to the next row 2804 ; (2) otherwise when the incoming data is valid, the data 2820 and d-tag 2810 from the previous row is sent to the output 2804 , or finally (3) the multiplication result ( 2831 , 2811 , 2821 ) for the CU 2600 in the current row is sent directly to the next row 2804 .
- both the Q-array 2010 and the MP-array 2020 blocks of the architecture proposed in this specification may use zero-skipping logic, as shown in FIG. 26 .
- multiple MP-array 2020 blocks without any Q-array 2010 blocks may be employed to accelerate Neural Network training and inference, as shown in FIG. 27 .
- both the Q-array 2010 and the MP-array 2020 blocks of the architecture proposed in this specification may use fixed-point representation, with a greater number of bits for the gradients in the MP-array 2020 .
- the gradients for the MP-array 2020 may use other data types including logarithmic or power-of-2 data representations.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
For one embodiment, a hardware accelerator with a heterogeneous-precision architecture for training quantized neural networks is described. In one example, a hardware accelerator for training quantized neural networks comprises a multilevel memory to store data and a software controllable mixed precision array coupled to the memory. The mixed precision array includes an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for the forward and backward propagation phase of training quantized neural networks.
Description
- This application claims the benefit of U.S. Provisional Application No. 62/792,785, filed on Jan. 15, 2019, the entire contents of this Provisional application is hereby incorporated by reference.
- Embodiments described herein generally relate to the fields of data processing and machine learning, and more particularly relates to a hardware accelerator having a heterogenous architecture for training quantized neural networks.
- While interest in Deep Neural Networks (DNNs) continues to grow for big data applications, the focus of recent literature has shifted towards exploring efficient ways of training and executing deep learning models. One prominent approach for improving efficiency is quantization, which reduces the bit widths for data and operations in a deep learning model to yield increased performance and/or energy efficiency. From the architecture community, several prior approaches have exploited quantization to improve the efficiency of the inference phase of deep learning. In order to maximize the benefits from quantization and retain classification accuracy, the quantized version of the DNNs needs to be retrained which can take weeks on GPUs, depending on the size of the DNN model.
- For one embodiment of the present invention, a hardware accelerator with a heterogenous architecture for training quantized neural networks is described. In one example, a hardware accelerator for training quantized data comprises software controllable multilevel memory to store data and a mixed precision array coupled to the memory. The mixed precision array includes an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
- Other features and advantages of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows below.
-
FIG. 1 shows an embodiment of a block diagram of abig data system 100 for providing big data applications for a plurality of devices in accordance with one embodiment. -
FIGS. 2A and 2B illustrate methods for training quantized DNNs with a hardware accelerator architecture (e.g., homogenous architecture inFIG. 2A , heterogenous architecture inFIG. 2B ) in accordance with one embodiment. -
FIG. 3A illustrates pooling layers for the inference phase in accordance with one embodiment. -
FIG. 3B illustratespooling layers 370 for the back-propagation phase in accordance with one embodiment. -
FIG. 4 illustrates anarchitecture 400 that includes three distinct types ofcomputational blocks -
FIG. 5 illustrates ahomogeneous accelerator architecture 500 in accordance with one embodiment. -
FIG. 6 illustrates the design of a compute unit in accordance with one embodiment. -
FIG. 7 illustratesadder logic 700 that utilizes a novel low-overhead desynchronized encoding for zero-skipping in accordance with one embodiment. -
FIG. 8 illustratesnon-zero detection logic 800 that includes zero-detector logic 810 andnon-zero selector 820 in accordance with embodiment. -
FIG. 9 illustrates scheduling operations across multiple MPZS-arrays in accordance with one embodiment. -
FIG. 10 illustrates an overview of a DNN workflow 1000 in accordance with one embodiment. -
FIGS. 11 and 12 illustrate performance of the GPU platform in comparison to different variations of the present design as implemented in a FPGA platform in accordance with one embodiment. -
FIG. 13 illustrates the schematic diagram of a data processing system according to an embodiment of the present invention. -
FIG. 14 illustrates the schematic diagram of a multi-layer accelerator according to an embodiment of the invention. -
FIG. 15 is a diagram of a computer system including a data processing system according to an embodiment of the invention. -
FIG. 16 shows the details of thespecialized circuit 1700 for accelerating neural networks in prior art. -
FIG. 17 shows the details of theCU 1800 in the systolic array circuit. -
FIG. 18 shows the operations in theforward propagation 1940 and backwardpropagation 1950 phases for asingle convolution layer 1900 for neural networks. -
FIG. 19 illustrates a novel heterogeneous-precision circuit 2000, which is a specialized circuit for accelerating neural network training and inference. -
FIG. 20 illustrates a design of a single CU in Q-array 2010. -
FIG. 21 shows the operations for a single layer of quantized neural network using the circuited 2000 described in the specification. -
FIG. 22 describes the zero-skippinglogic 2300, which reads a reads a 8wide vector ofdata 2301 from the IBUF and selects onenon-zero value 2302 from the 8-wide vector of data in each cycle for each row ofCUs 2600. -
FIG. 23 shows the circuit for CU 2600 that can skip zero-valued gradients. -
FIG. 24 shows thecircuit 2612 for accumulating the multiplication results acrossdifferent CUs 2600. -
FIG. 25 shows the details of the accumulation logic perlane 2800 of theaccumulation logic 2612. - In another embodiment of this work, both the Q-
array 2010 and the MP-array 2020 blocks of the architecture proposed in this specification may use zero-skipping logic, as shown inFIG. 26 . - In another embodiment of this work, multiple MP-
array 2020 blocks without any Q-array 2010 blocks may be employed to accelerate Neural Network training and inference, as shown inFIG. 27 . - Methods and systems having a heterogenous architecture for training quantized neural networks are described. The present design leverages two algorithmic properties: quantization and sparsity for quantized training. Training operations for quantized DNNs possess dual characteristics: (1) due to high sparsity in the high precision gradients, the backward phase favors sparse execution, and (2) the quantized activations/weights in the forward phase favor dense execution due to the large overhead of zero-skipping for quantized activations. The present design provides a unified architecture that leverages both properties and shows that FPGAs not only provide higher energy efficiency than GPUs, and FPGAs can, on average, outperform GPUs across a range of quantization techniques and DNN topologies.
- In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the present invention.
- Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment. Likewise, the appearances of the phrase “in another embodiment,” or “in an alternate embodiment” appearing in various places throughout the specification are not all necessarily all referring to the same embodiment.
- The following glossary of terminology and acronyms serves to assist the reader by providing a simplified quick-reference definition. A person of ordinary skill in the art may understand the terms as used herein according to general usage and definitions that appear in widely available standards and reference books.
- HW: Hardware.
- SW: Software.
- I/O: Input/Output.
- DMA: Direct Memory Access.
- CPU: Central Processing Unit.
- FPGA: Field Programmable Gate Arrays.
- CGRA: Coarse-Grain Reconfigurable Accelerators.
- GPGPU: General-Purpose Graphical Processing Units.
- MLWC: Many Light-weight Cores.
- ASIC: Application Specific Integrated Circuit.
- PCIe: Peripheral Component Interconnect express.
- CDFG: Control and Data-Flow Graph.
- FIFO: First In, First Out
- NIC: Network Interface Card
- HLS: High-Level Synthesis
- Dataflow analysis: An analysis performed by a compiler on the CDFG of the program to determine dependencies between a write operation on a variable and the consequent operations which might be dependent on the written operation.
- Accelerator: a specialized HW/SW component that is customized to run an application or a class of applications efficiently.
- In-line accelerator: An accelerator for I/O-intensive applications that can send and receive data without CPU involvement. If an in-line accelerator cannot finish the processing of an input data, it passes the data to the CPU for further processing.
- Bailout: The process of transitioning the computation associated with an input from an in-line accelerator to a general purpose instruction-based processor (i.e. general purpose core).
- Continuation: A kind of bailout that causes the CPU to continue the execution of an input data on an accelerator right after the bailout point.
- Rollback: A kind of bailout that causes the CPU to restart the execution of an input data on an accelerator from the beginning or some other known location with related recovery data like a checkpoint.
- Gorilla++: A programming model and language with both dataflow and shared-memory constructs as well as a toolset that generates HW/SW from a Gorilla++ description.
- GDF: Gorilla dataflow (the execution model of Gorilla++).
- GDF node: A building block of a GDF design that receives an input, may apply a computation kernel on the input, and generates corresponding outputs. A GDF design consists of multiple GDF nodes. A GDF node may be realized as a hardware module or a software thread or a hybrid component. Multiple nodes may be realized on the same virtualized hardware module or on a same virtualized software thread.
- Engine: A special kind of component such as GDF that contains computation.
- Infrastructure component: Memory, synchronization, and communication components.
- Computation kernel: The computation that is applied to all input data elements in an engine.
- Data state: A set of memory elements that contains the current state of computation in a Gorilla program.
- Control State: A pointer to the current state in a state machine, stage in a pipeline, or instruction in a program associated to an engine.
- Dataflow token: Components input/output data elements.
- Kernel operation: An atomic unit of computation in a kernel. There might not be a one to one mapping between kernel operations and the corresponding realizations as states in a state machine, stages in a pipeline, or instructions running on a general purpose instruction-based processor.
- Two challenges for accelerating training for quantized DNNs have been identified including high precision for gradients and variation in computations. Gradients in the backward phase of training include both the backward propagation of loss and the calculation of weight gradients, compared to activations and weights for forward propagation. From a hardware perspective, the higher precision requirements for gradients means that an accelerator for training quantized DNNs would limit the benefits from quantizing the DNNs.
- In regards to variation in computations, the highly parallel multiply add operations for convolutions/fully-connected layers are interleaved with quantization transformations and require expensive transcendental functions such as tanh or sigmoid that operate on floating-point data.
- While it can be argued that the transcendental functions and the data movement operations can be offloaded to the host CPU, the latency for data-transfer for every convolution in the DNN can limit the benefits from acceleration. Furthermore, the quantization transformation and even the data-representations (e.g., fixed-point, power-of-2, floating-point) vary significantly across the different techniques proposed in recent literature, making ASIC acceleration approach less appealing. To overcome the challenges mentioned above, the present design targets FPGAs for their flexibility and develops a heterogenous architecture, which is an accelerator for training quantized DNNs. This heterogenous architecture is designed to challenge the reign of GPUs as the de facto platform for DNN training. The heterogenous architecture leverages three algorithmic properties of quantized DNN training algorithms.
- In one example, compute intensive operations for the convolution and fully-connected layers in quantized training need mixed precision; that is, one of the operands is a high-precision gradient while the other is a quantized weight/activation. Using mixed-precision allows the heterogenous architecture to reduce the high resource cost of the compute units, increasing the parallelism that the FPGA can offer using its limited pool of resources.
- In another example, training operations for quantized DNNs possess a dual characteristic—the high-precision gradients in the backward phase are highly sparse (>99% zeros); while the quantized activations in the forward phase have between 45-60% zeroes. The heterogenous architecture leverages the dual characteristics of high-precision, high-sparsity in the backward phase and low-precision, low-sparsity in the forward phase.
- In another example, both the data-representations (fixed-point, power of 2, etc.) and precision (number of bits) for activations, weights, and gradients vary between different DNN models. The heterogenous architecture utilizes a template architecture that exploits the reconfigurability of the FPGA to generate a specialized implementation for each quantized DNN.
- The heterogenous architecture acting as an accelerator utilizes the properties of quantization in the bit-heterogeneous architecture to deliver significant improvement in performance and energy efficiency over GPUs. The quantization transformation and the quantized data representation both differ for different training algorithms. However, the structure of the compute intensive convolution/activation layers remain the same. To support a wide range of quantization transformations, and yet, provide high performance for a wide range of DNNs, the heterogenous architecture uses (1) systolic arrays (e.g., sparse dense heterogenous architecture array) for the highly parallel mixed-precision Multiply-Accumulate (MAC) operations in convolution/fully-connected layers in a DNN, and (2) programmable data Transformation Arrays (TX-array) to support the resource intensive quantization transformations as well as the activation/pooling layers in DNNs.
-
FIG. 1 shows an embodiment of a block diagram of amachine learning system 100 for providing machine learning applications for a plurality of devices in accordance with one embodiment. Themachine learning system 100 includes machine learning modules 130 (e.g., DNN modules),ingestion layer 132,enrichment layer 134, microservices 136 (e.g., microservice architecture),reactive services 138, and business intelligence layer 150. In one example, a microservice architecture is a method of developing software applications as a suite of independently deployable, small, modular services. Each service has a unique process and communicates through a lightweight mechanism. Thesystem 100 provides big data services by collecting data from messagingsystems 182 and edge devices,messaging systems 184,web servers 195,communication modules 102, internet of things (IoT) devices 186, anddevices 104 and 106 (e.g., source device, client device, mobile phone, tablet device, laptop, computer, connected or hybrid television (TV), IPTV, Internet TV, Web TV, smart TV, satellite device, satellite TV, automobile, airplane, etc.). Each device may include a respectivebig data application 105, 107 (e.g., a data collecting software layer) for collecting any type of data that is associated with the device (e.g., user data, device type, network connection, display orientation, volume setting, language preference, location, web browsing data, transaction type, purchase data, etc.). Thesystem 100, messaging systems andedge devices 182,messaging systems 184,web servers 195,communication modules 102, internet of things (IoT) devices 186, anddevices -
FIGS. 2A and 2B illustrate methods for training quantized DNNs with a hardware accelerator architecture (e.g., homogenous architecture inFIG. 2A , heterogenous architecture inFIG. 2B ) in accordance with one embodiment. Although the operations in the methods are shown in a particular order, the order of the actions can be modified. Thus, the illustrated embodiments can be performed in a different order, and some operations may be performed in parallel. Some of the operations listed inFIGS. 2A and 2B are optional in accordance with certain embodiments. The numbering of the operations presented is for the sake of clarity and is not intended to prescribe an order of operations in which the various operations must occur. Additionally, operations from the various flows may be utilized in a variety of combinations. - The operations of the methods in
FIGS. 2A and 2B may be executed by a compiler component, a data processing system, a machine, a server, a web appliance, a centralized system, a distributed node, or any system, which includes an accelerator (e.g., CPU, GPU, FPGA). The accelerator may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general purpose computer system or a dedicated machine or a device), or a combination of both. - The compute intensive convolution and fully-connected layers, which require a large number of simple MAC operations, are interleaved with resource intensive quantization transformations, which perform fewer operations but need more FPGA resources for implementing the complex operations.
-
FIG. 2A illustrates the various operations ofmethod 200 to train a single quantized convolution layer when using an architecture with homogenous precision for all computations.FIG. 2B illustrates the various operations ofmethod 250 to train a single quantized convolution layer when using an architecture with heterogenous precision for all computations. - Subscripts f, b, and w refer to the forward propagation, backward propagation of loss, and weight gradient calculations, respectively. The convf, convb, and convw are highly-parallel convolution operations that require a large number of Multiply-Accumulate (MAC) operations.
-
Inference phase 201 includesoperations operations operation 202, the method includes receiving input data for an input layer with the input data being quantized (e.g., quantized from a first precision datatype for input data into a second precision datatype). Atoperation 204, the method includes receiving the second precision datatype (e.g., high precision, 32-bit floating-point) for the input data. Atoperation 208, the method includes receiving a first precision datatype for the initial weights with the weights being quantized from a first precision datatype into a second precision datatype. Atoperation 206, the method includes receiving the second precision datatype (e.g., high precision, 32-bit floating-point) for the weights. - At
operation 210, the method includes performing a convolution operation(s) (convf) of a convolution layer on the input data and weights including a large number of Multiply-Accumulate (MAC) operations. Weights fromoperation 206 can be applied to the input data during the convolution operations. Atoperation 212, output fromoperation 210 is generated as the second precision datatype and quantized into a first precision datatype atoperation 214. The output of an output layer is available for further processing atoperation 214. - The
backward propagation phase 220 updates original weights to reduce a loss function to improve classification of the input data. Thebackward propagation phase 220 includesoperations operation 240, an output loss function is generated. Atoperation 244, weights are quantized from a first precision datatype into a second precision datatype (e.g., high precision datatype) to form high precision datatype atoperation 242. Atoperation 246, a convolution (convb) is performed on output fromoperation 240 and the second precision datatype weights fromoperation 242 to generate an input loss atoperation 248. - At
operation 222, an output loss function is generated. Atoperation 226, inputs are quantized from a first precision datatype into a second precision datatype (e.g., high precision datatype) to form high precision datatype atoperation 224. Atoperation 228, a convolution (convb) is performed on output fromoperation 222 and the second precision datatype inputs fromoperation 224 to generate a weight loss function atoperation 230. - In one embodiment, convf uses low-bitwidth fixed-point data for activations and weights. In contrast, convb and convw may require mixed precision data types (e.g., high bit width fixed-point/floating-point) for gradients, depending on the quantization algorithm. The gradients for the Convb and Convw operations may require either high bit width fixed-point or floating-point datatypes, depending on the quantization algorithm. At the same time, the activations for the Convw operation and weights for the Convb operation may require low bit width fixed-point representation. The precision requirements are a static property of the quantized DNN, designed by the programmer/machine learning expert.
- The varying precision requirements of quantized DNN training potentially provide ample opportunities to improve performance and energy-efficiency. However, exploiting this algorithmic property on hardware accelerators is challenging, since the homogeneous-precision hardware accelerators, such as GPUs, need to account for the highest precision requirements. The high precision requirement of gradients in DNN training often force the accelerators to run all the operations on the high precision such as 32-bit single-precision floating-point. Thus, even when the operations in quantize DNN training can use low-bit width datatypes (e.g., binary, ternary, etc) on a homogeneous precision architecture, the data needs to be first converted into higher precision datatypes before executing the operations on hardware.
FIG. 2A shows an example in which the Input and Weights are first converted into high precision datatypes to match the high precision datatypes of gradients before performing the Convf, Convb, or Convw operations. - In contrast, this present design introduces the use of heterogeneous precision in the accelerator design for quantized DNN training. The proposed architecture uses specialized compute units that dynamically match the varying precision requirements of quantized DNN training. As
FIG. 2B shows, using heterogeneous precision enables the proposed architecture to avoid conversion to high precision datatypes and perform computations on either quantized or mixed-precision datatypes. An advantage of this design is that compute units for quantized and mixed-precision datatypes use significantly less amount of hardware resources and less energy compared to high-precision compute units. Note that the Output tensor for Convf inFIG. 2B may still require high precision datatype to avoid overflow of the intermediate data. -
Inference phase 251 includesoperations operations operation 254, the method includes receiving input data for an input layer with the input data being quantized or a mixed precision datatype. Any low bit width precision datatypes are not converted into high bit width precision datatypes. Atoperation 256, the method includes receiving initial weights with the weights being quantized or a mixed precision datatype. Any low bit width precision datatypes are not converted into high bit width precision datatypes. - At
operation 260, the method includes performing a convolution operation(s) (convf) of a convolution layer on the input data and weights including Multiply-Accumulate (MAC) operations. Weights fromoperation 256 can be applied to the input data during the convolution operations. Atoperation 258, output fromoperation 260 is generated as a second precision datatype and quantized into a first precision datatype atoperation 259. The output of an output layer is available for further processing atoperation 259. - The
backward propagation phase 290 updates original weights to reduce a loss function to improve classification of the input data. Thebackward propagation phase 290 includesoperations operation 270, an output loss function is generated. Atoperation 272, weights are quantized or a mixed precision datatype. Any low bit width precision datatypes are not converted into high bit width precision datatypes. - At
operation 274, a convolution (convb) is performed on output fromoperation 270 and the weights fromoperation 272 to generate an input loss function atoperation 276. - At
operation 280, an output loss function is generated. Atoperation 282, inputs are quantized or a mixed precision datatype. Any low bit width precision datatypes do not need to be converted into high bit width precision datatypes. Atoperation 284, a convolution (convb) is performed on output fromoperation 280 and the inputs fromoperation 282 to generate a weight loss function atoperation 286. - In one example, the
method - Thus far, the present design utilizes a static property of quantized DNNs, varying precision requirements, in the design of accelerators for DNN training. Additionally, the present design also exploits a run-time property of quantized DNN training that many zero-valued multiplications can be skipped in both forward and backward computations. Prior approaches have explored zero-skipping techniques for inference phase and reported that skipping zero-valued 16-bit activation values can provide significant performance benefits. The present design determines that zero-skipping for training phase opens significantly more opportunities than the inference phase, since the training phase contains a larger fraction of zero-valued multiplications among the total operations. However, seizing the opportunities via zero-skipping imposes additional hardware cost to identify and skip ineffectual multiplications. Therefore, the benefits from zero-skipping are dependent on two factors: (1) the overhead of additional logic required for skipping the computation, and (2) the number of ineffectual computations that can be skipped. In this design, the overhead for zero-skipping logic is lower on mixed-precision arrays than on quantized computations. Moreover, the backward phase of DNN training contains significantly higher zero values (e.g., up to 90%) in comparison with the zero activations of the forward phase compute (e.g., up to 45-60%). This larger number of zero valued gradients for the backward phase compared to zero-valued activations for the forward phase leads to the following analysis for
FIGS. 3A and 3B .FIG. 3A illustrates pooling layers for the inference phase in accordance with one embodiment. Pooling layers 320 for the inference phase select maximum values out of a 2-D grid ofinputs 310 to generatemaximum inputs 330, as shown inFIG. 3A in accordance with one embodiment.FIG. 3B illustrates poolinglayers 370 for the back-propagation phase in accordance with one embodiment. For the back-propagation phase, the gradients corresponding to the maximum values selected in the inference phase are non-zero while the rest are zero for grid ofinputs 350. Thegrid 370 includes the non-zero values from thegrid 350. The gradients corresponding to the negative inputs for ReLU activation (rectifier linear function) are zero, which can be as high as 50% sparsity. The heterogenous architecture specializes the computational resources to account for these runtime properties. - In regards to variation in runtime characteristics, while quantization is a static property of a DNN, sparsity—the % of zero-valued activations—in the forward or backward computations is a run-time property. Quantization reduces the size of multipliers required and exploiting sparsity requires an area overhead for zero-skipping. Prior art references have shown performance improvements when skipping zero-valued activations for the inference phase when using 16-bit data representations.
- The present design utilizes the interplay between quantization and sparsity and defines sparsityminimum as the minimum number of zero-valued activations or gradients required to break-even from the overhead of zero-skipping with sparsityminimum being defined as follows.
-
sparsityminimum=1−(1/overheadzero skipping). (1) - Note that sparsityminimum in the above formulation assumes an ideal architecture that can skip all zero-valued computations and serves as a reference to evaluate the potential benefits from zero-skipping.
- A compute intensive convolution and fully-connected layers, which require a large number of simple MAC operations, are interleaved with resource intensive quantization transformations, which perform fewer operations but need more FPGA resources for implementing the complex operations. The quantized training requires additional operations that transform the activations, weights, and gradients to different data representations. The type of quantization transformation varies according to the quantization algorithm. Offloading these operations to the host CPU would lead to high latencies.
- Thus, a
homogeneous accelerator architecture 500 ofFIG. 5 would overprovision resources for the different types of operations using a homogeneous set orarray 510 of Processing Engines (PEs), and more importantly, (2) would be unable to exploit the algorithmic characteristics of reduced precision from quantization and high sparsity in back-propagated gradients. Therefore, heterogeneity is important to maximize the potential performance benefits using the limited pool of resources on a FPGA die. Motivated by the above insight, a heterogeneous architecture for accelerating quantized training has been designed. - The present design utilizes a template architecture that is both scalable—to maximally utilize the FPGA's on-chip resources, and customizable—to adapt to the precision requirements of the quantized DNN being trained.
- This
heterogenous architecture 400, as shown inFIG. 4 , includes three distinct types ofcomputational blocks Dense Quantized Array 410, 412 (DQ-array), which is a systolic array (e.g., 16×16 systolic array) of low bit width multiply-accumulate computation units that are labeled as processing engines (PEs) in one example, includes an input buffer, an output buffer, and the PEs. - A mixed precision zero skipping
array 420, 422 (MPZS-array), which is a systolic array (e.g., 16×16 systolic array) of mixed-precision multiply-accumulate computation units that are labeled as processing engines (PEs), includes an input buffer, zero skip logic, an output buffer, and PEs. -
Array arrays array 430 is a more general purpose array that can be programmed to either compute element-wise data transformations necessary for quantized training. - While FPGAs run at a much lower frequency than contemporary processors like CPUs or GPUs, the FPGAs offset the lower frequency by offering high degrees of parallelism for the accelerator. Feeding data to the large number of compute units in the FPGA within the FPGA's limited off-chip bandwidth is challenging, especially due to large memory footprint of DNNs. Fortunately, DNNs have a large degree of data reuse. To utilize the data-reuse for training quantized DNNs, the present design uses a three level memory hierarchy: global, cluster, and local memory (e.g., global-uram, cluster-bram, and local-bram). Unlike caches in CPUs or GPUs, the memory at each level of the hierarchy is controlled by software, making it a scratchpad memory. Using the on-chip memory as scratchpads takes away the burden of prefetching and evicting from the hardware and places it on the software. This enables layer-specific compiler optimizations that take advantage of data-reuse within layers of the quantized DNN.
- The present application will now describe the microarchitecture of the
heterogenous architecture 400, and an algorithm for optimizing the sizes of the three types of arrays to maximize performance. - As previously described herein, the runtime characteristics for the forward and backward phases of quantized training differ significantly. To this end, the present heterogenous architecture uses a MPZS-array that exploits the dual characteristics of high sparsity for the high precision gradients for zero-skipping in the backward phase, and uses a dense quantized execution for the forward phase. The basic building block for the MPZS-array is the CU, which is a bit-flexible compute unit, described below.
-
FIG. 6 illustrates the design of a compute unit in accordance with one embodiment. TheCU 600 includes n quantized mixed precision multipliers (e.g., 610-613), each of which can multiply up to m-bit operands. While m depends on the minimum precision required by the MAC operations in convolution/fully-connected layers, n depends on the ratio of precisionmax/precisionmin. The outputs of the n quantized multipliers are added to produce anoutput 690. The CU supports a flexible range of precision for the floating point or fixed point inputs 601-608 (e.g., floatingpoint 32bit 601,fixed point 2bit 602, floatingpoint 32bit 603,fixed point 2bit 604, floatingpoint 32bit 605,fixed point 2bit 606, floatingpoint 32bit 607,fixed point 2 bit 608)—activations in the forward phase and the gradients in the backward phase. At the lowest precision mode, the n quantized multipliers in a CU perform n independent multiplications. At the highest precision mode, the n quantized multipliers together multiply a single n×m-bit operand with a m-bit operand. In this example, a MPZS array uses a 2D systolic array of 16×16 CUs. - In the dense forward execution mode, each compute unit in the MPZS array performs multiple multiply-add operations for quantized activations and weights in the forward phase of training. The partial results generated by different quantized multipliers are added together to produce a single output.
- As discussed, the gradients in the backward phase for DNNs have high sparsity (e.g., up to 99%). A naive first approach for obtaining performance for such a high degree of sparsity is to serialize the MAC operations using a single row of the systolic array. Such an approach has two drawbacks: (1) each row would require its own single-precision floating point accumulator which would increase the resource cost (FPGA LUT/DSP) per row; and (2) limited parallelism due to a single row.
- A second approach is to use multiple rows in the systolic array, which increases parallelism. Further, outputs within each column of the systolic array can be accumulated in a single floating-point accumulator. The drawback of the second approach is that it enforces synchronization between different rows of the systolic array. That is, each row waits for all the other rows to finish computing the current output before moving on to the next output. Prior work uses the second approach to improve inference performance when the sparsity for activations is between 45-60%. The present design on the other hand aims to exploit the considerably higher sparsity present in the gradients of the backward phase of quantized DNN training. Due to the high sparsity in the gradients for the backward phase, synchronization between different rows of the systolic array would significantly limit the performance benefits from zero-skipping.
- The present design identifies two limitations of the above technique when applied to highly-sparse gradients. The fundamental assumption here is that the compute units in each column synchronize and operate on a single sparse-vector. Therefore, for the first limitation, each row stalls for all the other rows to finish operating on their own sub-vectors before proceeding to the next sparse-vector; which will limit the potential benefits from zero-skipping due to the high-sparsity in gradients.
- For the second limitation, when reading one sparse sub-vector at a time from the memory (e.g., BRAM), the non-zero detect logic will stall when there are no non-zero values in the sub-vector. Assuming a 95% sparsity in gradients, the probability of all zeros in a sub-vector (assuming independent and identical distribution) is 44%.
- To overcome the above second limitation, the present design utilizes a novel low-overhead desynchronized encoding for zero-skipping as illustrated in a
multi-lane adder logic 700 ofFIG. 7 . This encoding uses a desynchronization-tag or d-tag 706 to remove synchronization between rows of a MPZS-array. MPZS-array encodes the non-zero values asvalue 702, offset 704, and d-tag pair 706. The d-tag 706 specifies the identification (ID) of the sparse-vector that each row operates on. To take advantage of the proposed desynchronized encoding in MPZS-array, the present design uses two tag-lanes select logic 730 determines that the tag for the current row matches the previous row's tag for either the odd or even tag-lanes, the values are added together and forwarded to the next row. When the tags do not match, the results are stored locally. - To overcome the first limitation, the present design decomposes the
non-zero detection logic 800 ofFIG. 8 into two different modules: (1) zero-detector logic 810, and (2)non-zero selector 820. The zero-detector logic includes a series of comparators that generate a bit-vector that corresponds to using a single bit for each bit of the sub-vector (e.g., 16 bit wide sub-vector). Each bit in the bit-vector specifies if the corresponding value in the sub-vector is zero (low) or non-zero (high). When all bits in the bit-vector are low, the sub-vector is skipped entirely. Otherwise, the sub-vector is pushed to aFIFO queue 830, along with its bit-vector and a d-tag for identifying the input ID. The non-zero selector then pops the FIFO queue to read only those sub-vectors that have some non-zero values. The non-zero selector then selects the first non-zero value and the corresponding offset to produce a (value, offset, tag) pair. Using desynchronization and sub-vector skipping, the present design improves the performance of MPZS-array when sparsity is high. - While these two techniques improve performance, these techniques also increase the consumption of FPGA's LUT resources. As discussed herein, the resource overhead of sparsity outweighs the benefits from zero-skipping in the forward phase that uses low bit width activations for quantized DNNs. Therefore, the MPZS-array utilizes a dense execution for the forward phase of quantized DNN training, as described below.
- The present design uses a template architecture to implement the MPZS-array on FPGA. The precision for the multiply-add operations can be modified according to the needs of the quantized DNN.
- The following section discusses the scheduling of operations for quantized training across multiple MPZS-arrays. In order to parallelize the operations for training a quantized DNN across multiple MPZS-arrays, the present design splits the computations in each operation into tiles. For most operations required for training quantized DNNs, the total amount of data is often much larger than the limited on-chip memory available on the FPGA. Therefore, splitting the computations into tiles are necessary to fit the data into on-chip memory.
-
FIG. 9 illustrates scheduling operations across multiple MPZS-arrays in accordance with one embodiment. To maximize the performance of the MPZS architecture, the present design uses three types of tiling and expresses the task of determining the tile sizes as a constrained optimization problem. The three types of tilings correspond to three levels of memory hierarchy in the MPZS architecture and the sizes of each level of memory hierarchy serves the constraints for optimizing the tile sizes. The present design uses a simple fully-connected layer inFIG. 9 as an example to explain the scheduling ofoperations FIG. 9 can be expressed as a matrix multiplication as follows. -
output(B×Cout)=input(B×Cin)×weights(Cin×Cout) (2) -
FIG. 9 shows how the operations in a fully-connected layer are split into tiles for each level of memory hierarchy including global memory tile (e.g., URAM tile) atoperation 910 and cluster memory tile (e.g., BRAM tile) at operation 930. Using a larger tile size for each level of hierarchy increases the data reuse at that level of hierarchy atoperation 940. The tile sizes are constrained by the capacity of memory at that level of hierarchy. - Next, an overview of a DNN workflow 1000 is illustrated in
FIG. 10 in accordance with one embodiment. The DNN workflow begins with a programmer defining a Dataflow Graph (DFG) of the DNN using a high-level API. This API allows the programmer to specify the precision for each operation in the DNN. As shown in theFIG. 10 , this workflow includes four operations: (1010) a dataflow analysis operation to analyze the resource requirements for the dataflow graph, (1020) a resource partitioning operation to analytically split the FPGA's resources, (1030) a cycle-accurate scheduling operation to obtain cycle counts, and (1040) a builder operation to generate a synthesizable accelerator using the optimal resource breakdown fromoperation 1030. Below, we describe the four operations in detail. - For static analysis, the
first operation 1010 of the workflow includes analyzing the type of computational resources required by the DNN model. Thisoperation 1010 a includes the dataflow analyzer component iterating over the nodes of the dataflow graph of the DNN and generates a list of pairs (e.g., operation type, precision, operation count) for the forward and backward passes of training. In one example, the operation type is a type of scalar operation (e.g., multiply, add, etc), the precision field is a tuple of the data-types required by the operands (e.g., fixed-point, floating-point, or power-of-2), and the operation count field describes the number of scalar operations. Next, the dataflow analyzer generates the highest and lowest precision required for the forward pass and repeats the same for the backward pass. Determining the range of precision requirements is essential for estimating the resources required for compute units in the FPGA (e.g., LUTs, DSPs, and Flip-Flops). - While static analysis determines the static utilization of the FPGA's resources, runtime analysis is essential to estimate the dynamic utilization considering that a large number of multiply-add operations in a DNN are ineffectual due to one of the operands being zero. At
operation 1010 b, the dataflow analyzer performs runtime analysis by sampling the data propagated in the forward and backward passes of the dataflow graph for numerous iterations using a user-specified batch-size of inputs. Next, the dataflow analysis calculates the proportion of zero-valued data in sampled data. Using the information generated by the static and runtime analysis in the dataflow analysis operation, the resource partitioning component divides the FPGA's resources as follows atoperation 1020. - The resource partitioner component of the workflow uses an analytical model to obtain the optimal breakdown of the FPGA's resources for forward and backward passes. Since most operations in a DNN are Multiply-Accumulate (MAC) operations, the resource partitioner only considers the MAC operations for the analytical model. For a given pair of (precisionfwd, opsfwd) and (precisionbwd, opsfwd) for the forward and backward passes of training, the resource partitioner generates the optimal breakdown (p, 1−p) of FPGA's resources for executing forward and backward passes, respectively.
-
alufwd =p×resourcetotal/resourcefwd (3a) -
alubwd=(1−p)×resourcetotal/resourcebwd (3b) - Where resourcefwd and resourcebwd are obtained from synthesizing compute units with precisionfwd and precisionbwd, respectively. Next, the resource partitioning component opti-mizes the ideal number of cycles required by the forward and backward operations given by the following equation:
-
Cyclestotal=opsfwd×nzfwd/(alufwd+alufwd)+(opsbwd×nzbwd)/alubwd (4) - Using equations [3b] and [3b], equation [4] solved quadratically to get the optimal partitioning p as follows.
-
Minimize p cyclestotal(p), p∈[0,1) (5) -
p=−c+1/c×r−1+square root((c+1)2/(c×r−1)2−c−r/r×(c×r−1) (6) -
where c=opsbwd×nzbwd/(opfwd×nzbwd) (7) -
r=resourcebwd/resourcefwd−1 (8) - Here, the c term is the ratio of non-zero computations in the backward pass to the non-zero computations in the forward pass, and r term is the one minus the ratio of resources required for the backward pass to the resources required by the forward pass. While computing the value of r requires static information, computing c requires both static and dynamic information.
- The value of p obtained from equation [6] is the optimal breakdown of the FPGA's resources assuming no under-utilization of resources due to memory accesses. In reality, however, even quantized DNNs have a large memory footprint and hence performance of the generated FPGA accelerator depends both on the breakdown of the FPGA's resources and the organization of on-chip memory. Nevertheless, the value of p obtained from equation [6] serves as a good initial solution for optimizing the breakdown of the FPGA's resources.
- Next, the scheduler component evaluates the solution provided by the resource partitioner. The scheduler is the third component of the workflow which evaluates the quality of the solution generated by the resource partitioner. The present design uses an architectural vs cycle-accurate simulation model for determining the quality of the partitioning solution. First, the simulator component divides the FPGA's LUT and DSP resources into 16×16 systolic arrays for the forward and backward passes using the p obtained from the resource partitioner.
- Next, the simulator evenly divides the FPGA's memory (e.g., URAM and BRAM) resources for each systolic array. The architecture of the present design uses a 2 level hierarchy for organizing the on-chip memory, as discussed above. Finally, using the number of forward and backward systolic arrays along with the memory organization, the simulator component performs cycle-accurate simulation. The simulation model accounts for limited bandwidth and latency for communication over both PCIe and the off-chip DRAMs. The scheduler generates the cycle counts for DQ-array and MPZS-array. Using the cycle-counts, the scheduler updates the compute ratio c defined in Equation [7] as follows.
-
cnext=cyclesDQ-array/cyclesMPZS-array (9) - The scheduler then feeds back the updated compute ratio to the resource partitioner.
Algorithm 1 summarizes the tasks of the Dataflow Analyzer, Resource Partitioner, and Scheduler. Since the present design aims to flexibly support a wide range of quantized training algorithms, it uses a template architecture to accelerate a wide range of quantized DNNs. The first three components generate an optimized set of parameters for the template architecture along with an optimized execution schedule. The last component, the builder generates a synthesizable accelerator using both the optimized set of architectural parameter and execution schedule. -
Algorithm 1: heterogenous resource partitioning Inputs : D: DFG of the quantized DNN resourcetotal : FPGA's total resources Output : p: Optimal breakdown of resources for DQ-array schedule: Schedule of operations for the optimized p arg min : cyclestotal : The total execution cycles for one training iteration Function AnalyzeDFG(D) //Static Analysis //Number of operations in forward/backward opsfwd, opsbwd <− D // Tuple of precision per layer precisionfwd <− opsfwd precisionbwd <− opsbwd LUT/DSP resources for forward operations resourcefwd <− precisionfwd resourcebwd <− precisionbwd Runtime analysis nzfwd, nzbwd <− execute(D) Obtain c and r c <− opbwd x nzbwd / opf wd x nz f wd r <− resourcebwd/resource f wd −1 return c, r end Function Partition(c, r) return p = − c+1 / cxr−1 +r squareroot ((c+1)2/(cxr−1)2 − c−r /rx(cxr−1)) end Function Schedule(D, p) // Schedule and estimate cycles for forward/backward phase cycles f wd, cyclesbwd <− Model(D, p) return cycles f wd, cyclesbwd end Initialize c, r AnalyzeDFG(D) cycles f wd, cyclesbwd Schedule(D, p) cnext <− cyclesbwd / cycles f wd // Refine the partitioning Initialize cnext <− − infinity Do p < − Partition(c, r) cycles f wd, cyclesbwd < − Schedule(D, p) cnext < − cyclesbwd / cycles f wd while abs value( c-cnext ) > epsilon; -
# of Quantiza- Batch # of Ops para- Benchmark tion Dataset Size (per-batch) meters AlexNet-D DoReFa-Net ImageNet 128 8,256 Mops 62M SVHN-D DoReFa-Net SVHN 128 342 Mops 62M AlexNet-D QNN ImageNet 512 2067 Mops 50M Cifar-10-Q QNN Cifar-10 50 1,844 Mops 12M SVHN- Q QNN SVHN 200 469 Mops 5M GoogleNet- QNN ImageNet 64 4777 Mops 56M Q AlexNet-W WRPN ImageNet 54 31,503 Mops 108M ResNet-W WRPN ImageNet 64 12025 Mops 23M - Table I shows the evaluated benchmarks, their datasets, number of operations, model size, and final accuracy. The postfix -W, -Q, -D refer to quantization techniques proposed by different prior approaches that use uniform quantization using fixed-point representation for activations and weights but use different quantization strategies for gradients. For gradients, DoReFa-Net uses fixed-point quantization with added gaussian noise, QNN uses logarithmic quantization using a power-of-2 data representation, and WRPN uses floating-point. Benchmarks ResNet-34-W, GoogleNet-Q, AlexNet-Q, AlexNet-W, AlexNet-D are image classification models trained on the
Imagenet 2012 dataset. Benchmarks SVHN-W and SVHN-Q are optical character recognition models based on the SVHN dataset. Unlike inference, the quality of the trained model depends significantly on the batch size. Therefore, the same batch sizes reported in these prior approaches is used for both GPUs and the heterogenous architecture of the present design. Furthermore, the three benchmarks use stochastic noise to speed-up convergence. Across all the benchmarks, both performance and power consumption are measured for a FPGA platform and a GPU platform for 10,000 training iterations and present the average. For both GPU and FPGA implementations, the host CPU is used as the parameter server. - A FPGA platform includes 6840 DSPs, 1182K LUTs, 33.7 MB URAM, 8.4 MB BRAMs, 42 W TDP, 200 MHz frequency, and 16 nm technology node. A GPU platform has 3584 cores, 12 GB memory, 250 W TDP, 1531 MHz frequency, and 16 nm technology node.
-
FIGS. 11 and 12 illustrate performance of the GPU platform in comparison to different variations of the present design as implemented in the FPGA platform. - The present design provides an alternative solution for GPUs, by leveraging the inherent characteristic of quantized deep learning and introducing heterogeneous accelerator architecture for FPGAs. As such, this design exists at the intersection of (a) quantization for deep learning, (b) acceleration for quantized deep learning, (c) acceleration for ML training, (d) heterogeneous architecture, and (e) exploitation of sparsity in deep learning.
-
FIG. 13 illustrates the schematic diagram ofdata processing system 1300 according to an embodiment of the present invention.Data processing system 1300 includes I/O processing unit 1310 and general purpose instruction-basedprocessor 1320. In an embodiment, general purpose instruction-basedprocessor 1320 may include a general purpose core or multiple general purpose cores. A general purpose core is not tied to or integrated with any particular algorithm. In an alternative embodiment, general purpose instruction-basedprocessor 1320 may be a specialized core. I/O processing unit 1310 may include an accelerator 1311 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, etc.) for implementing embodiments as described herein. In-line accelerators are a special class of accelerators that may be used for I/O intensive applications.Accelerator 1311 and general purpose instruction-based processor may or may not be on a same chip.Accelerator 1311 is coupled to I/O interface 1312. Considering the type of input interface or input data, in one embodiment, theaccelerator 1311 may receive any type of network packets from anetwork 1330 and an input network interface card (NIC). In another embodiment, the accelerator maybe receiving raw images or videos from the input cameras. In an embodiment,accelerator 1311 may also receive voice data from an input voice sensor device. - In an embodiment,
accelerator 1311 partially performs the computation associated with the input data elements and transfers the control to other accelerators or the main general purpose instruction-based processor in the system to complete the processing. The term “computation” as used herein may refer to any computer task processing including, but not limited to, any of arithmetic/logic operations, memory operations, I/O operations, and offloading part of the computation to other elements of the system such as general purpose instruction-based processors and accelerators.Accelerator 1311 may transfer the control to general purpose instruction-basedprocessor 1320 to complete the computation. - In an embodiment,
accelerator 1311 may be implemented using any device known to be used as accelerator, including but not limited to field-programmable gate array (FPGA), Coarse-Grained Reconfigurable Architecture (CGRA), general-purpose computing on graphics processing unit (GPGPU), many light-weight cores (MLWC), network general purpose instruction-based processor, I/O general purpose instruction-based processor, and application-specific integrated circuit (ASIC). In an embodiment, I/O interface 1312 may provide connectivity to other interfaces that may be used in networks, storages, cameras, or other user interface devices. I/O interface 1312 may include receive first in first out (FIFO)storage 1313 and transmitFIFO storage 1314.FIFO storages FIFO storage 1313 and the generated packets are sent over the network by the accelerator and/or general purpose instruction-based processor through transmitFIFO storage 1314. - In an embodiment, I/
O processing unit 1310 may be Network Interface Card (NIC). In an embodiment of the invention,accelerator 1311 is part of the NIC. In an embodiment, the NIC is on the same chip as general purpose instruction-basedprocessor 1320. In an alternative embodiment, theNIC 1310 is on a separate chip coupled to general purpose instruction-basedprocessor 1320. In an embodiment, the NIC-based accelerator receives an incoming packet, as input data elements through I/O interface 1312, processes the packet and generates the response packet(s) without involving general purpose instruction-basedprocessor 1320. Only whenaccelerator 1311 cannot handle the input packet by itself, the packet is transferred to general purpose instruction-basedprocessor 1320. In an embodiment,accelerator 1311 communicates with other I/O interfaces, for example, storage elements through direct memory access (DMA) to retrieve data without involving general purpose instruction-basedprocessor 1320. -
Accelerator 1311 and the general purpose instruction-basedprocessor 1320 are coupled to sharedmemory 1343 throughprivate cache memories memory 1343 is a coherent memory system. The coherent memory system may be implemented as shared cache. In an embodiment, the coherent memory system is implemented using multiples caches with coherency protocol in front of a higher capacity memory such as a DRAM. - In an embodiment, the transfer of data between different layers of accelerations may be done through dedicated channels directly between
accelerator 1311 andprocessor 1320. In an embodiment, when the execution exits the last acceleration layer byaccelerator 1311, the control will be transferred to the general-purpose core 1320. - Processing data by forming two paths of computations on accelerators and general purpose instruction-based processors (or multiple paths of computation when there are multiple acceleration layers) have many other applications apart from low-level network applications. For example, most emerging big-data applications in data centers have been moving toward scale-out architectures, a technology for scaling the processing power, memory capacity and bandwidth, as well as persistent storage capacity and bandwidth. These scale-out architectures are highly network-intensive. Therefore, they can benefit from acceleration. These applications, however, have a dynamic nature requiring frequent changes and modifications. Therefore, it is highly beneficial to automate the process of splitting an application into a fast-path that can be executed by an accelerator with subgraph templates and a slow-path that can be executed by a general purpose instruction-based processor as disclosed herein.
- While embodiments of the invention are shown as two accelerated and general-purpose layers throughout this document, it is appreciated by one skilled in the art that the invention can be implemented to include multiple layers of computation with different levels of acceleration and generality. For example, a FPGA accelerator can backed by a many-core hardware. In an embodiment, the many-core hardware can be backed by a general purpose instruction-based processor.
- Referring to
FIG. 14 , in an embodiment of invention, a multi-layer system 1000 is formed by a first accelerator 1011 1 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, or both) and several other accelerators 1011. (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, or both). Themulti-layer system 1050 includes several accelerators, each performing a particular level of acceleration. In such a system, execution may begin at a first layer by the first accelerator 1011 1. Then, each subsequent layer of acceleration is invoked when the execution exits the layer before it. For example, if the accelerator 1011 1 cannot finish the processing of the input data, the input data and the execution will be transferred to the next acceleration layer, accelerator 1011 2. In an embodiment, the transfer of data between different layers of accelerations may be done through dedicated channels between layers (e.g., 1071 1 to 1071 n). In an embodiment, when the execution exits the last acceleration layer by accelerator 1011 n, the control will be transferred to the general-purpose core 1090. -
FIG. 15 is a diagram of a computer system including a data processing system that utilizes an accelerator according to an embodiment of the invention. Within thecomputer system 1200 is a set of instructions for causing the machine to perform any one or more of the methodologies discussed herein including accelerating machine learning operations. In alternative embodiments, the machine may be connected (e.g., networked) to other machines in a LAN, an intranet, an extranet, or the Internet. The machine can operate in the capacity of a server or a client in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment, the machine can also operate in the capacity of a web appliance, a server, a network router, switch or bridge, event producer, distributed node, centralized system, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines (e.g., computers) that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. -
Data processing system 1202, as disclosed above, includes a general purpose instruction-basedprocessor 1227 and an accelerator 1226 (e.g., in-line accelerator, offload accelerator for offloading processing from another computing resource, accelerator with heterogenous architecture for DNN training, etc.). The general purpose instruction-based processor may be one or more general purpose instruction-based processors or processing devices (e.g., microprocessor, central processing unit, or the like). More particularly,data processing system 1202 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, general purpose instruction-based processor implementing other instruction sets, or general purpose instruction-based processors implementing a combination of instruction sets. The accelerator may be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal general purpose instruction-based processor (DSP), network general purpose instruction-based processor, many light-weight cores (MLWC) or the like.Data processing system 1202 is configured to implement the data processing system for performing the operations and steps discussed herein. - The
exemplary computer system 1200 includes adata processing system 1202, a main memory 1204 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM (RDRAM), etc.), a static memory 1206 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 1216 (e.g., a secondary memory unit in the form of a drive unit, which may include fixed or removable computer-readable storage medium), which communicate with each other via abus 1208. The storage units disclosed incomputer system 1200 may be configured to implement the data storing mechanisms for performing the operations and steps discussed herein.Memory 1206 can store code and/or data for use byprocessor 1227 oraccelerator 1226.Memory 1206 include a memory hierarchy that can be implemented using any combination of RAM (e.g., SRAM, DRAM, DDRAM), ROM, FLASH, magnetic and/or optical storage devices. Memory may also include a transmission medium for carrying information-bearing signals indicative of computer instructions or data (with or without a carrier wave upon which the signals are modulated). -
Processor 1227 andaccelerator 1226 execute various software components stored inmemory 1204 to perform various functions forsystem 1200. Furthermore,memory 1206 may store additional modules and data structures not described above. -
Operating system 1205 a includes various procedures, sets of instructions, software components and/or drivers for controlling and managing general system tasks and facilitates communication between various hardware and software components. A compiler is a computer program (or set of programs) that transform source code written in a programming language into another computer language (e.g., target language, object code). Acommunication module 1205 c provides communication with other devices utilizing thenetwork interface device 1222 orRF transceiver 1224. - The
computer system 1200 may further include anetwork interface device 1222. In an alternative embodiment, the data processing system disclose is integrated into thenetwork interface device 1222 as disclosed herein. Thecomputer system 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD), LED, or a cathode ray tube (CRT)) connected to the computer system through a graphics port and graphics chipset, an input device 1212 (e.g., a keyboard, a mouse), acamera 1214, and a Graphic User Interface (GUI) device 1220 (e.g., a touch-screen with input & output functionality). - The
computer system 1200 may further include aRF transceiver 1224 provides frequency shifting, converting received RF signals to baseband and converting baseband transmit signals to RF. In some descriptions a radio transceiver or RF transceiver may be understood to include other signal processing functionality such as modulation/demodulation, coding/decoding, interleaving/de-interleaving, spreading/dispreading, inverse fast Fourier transforming (IFFT)/fast Fourier transforming (FFT), cyclic prefix appending/removal, and other signal processing functions. - The
Data Storage Device 1216 may include a machine-readable storage medium (or more specifically a computer-readable storage medium) on which is stored one or more sets of instructions embodying any one or more of the methodologies or functions described herein. Disclosed data storing mechanism may be implemented, completely or at least partially, within themain memory 1204 and/or within thedata processing system 1202 by thecomputer system 1200, themain memory 1204 and thedata processing system 1202 also constituting machine-readable storage media. - In one example, the
computer system 1200 is an autonomous vehicle that may be connected (e.g., networked) to other machines or other autonomous vehicles in a LAN, WAN, or any network. The autonomous vehicle can be a distributed system that includes many computers networked within the vehicle. The autonomous vehicle can transmit communications (e.g., across the Internet, any wireless communication) to indicate current conditions (e.g., an alarm collision condition indicates close proximity to another vehicle or object, a collision condition indicates that a collision has occurred with another vehicle or object, etc.). The autonomous vehicle can operate in the capacity of a server or a client in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The storage units disclosed incomputer system 1200 may be configured to implement data storing mechanisms for performing the operations of autonomous vehicles. - The
computer system 1200 also includessensor system 1214 and mechanical control systems 1207 (e.g., motors, driving wheel control, brake control, throttle control, etc.). Theprocessing system 1202 executes software instructions to perform different features and functionality (e.g., driving decisions) and provide agraphical user interface 1220 for an occupant of the vehicle. Theprocessing system 1202 performs the different features and functionality for autonomous operation of the vehicle based at least partially on receiving input from thesensor system 1214 that includes laser sensors, cameras, radar, GPS, and additional sensors. Theprocessing system 1202 may be an electronic control unit for the vehicle. -
FIG. 16 shows the details of thespecialized circuit 1700 for accelerating neural networks in prior art. The specialized circuit inFIG. 16 includes one ormore circuits 1701 that are specialized for one or more computations in neural networks. For example, thesystolic array circuit 1701 shown inFIG. 16 is specialized for the convolution and matrix-multiplication operations in neural networks. - The
systolic array circuit 1701 further includes (1) a plurality ofCUs 1800 that perform the operations in the plurality of layers in neural network training and inference, (2) abuffer IBUF 1704 to store inputs and (3) abuffer OBUF 1705 to store intermediate results when executing the operations for the multi-dimensional arrays of data in neural network. - The
CUs 1800 inFIG. 16 are organized as a 2-dimensional grid, with a plurality ofrows 1702 and a plurality ofcolumns 1703. - The
buffer IBUF 1704 feeds data to theCU 1800 on the first column (the left most column), as shown inFIG. 16 . The results from theCUs 1800 in each column of the systolic array are accumulated in an accumulator circuit and the stored in theOBUF 1705. - Each
CU 1800, can perform multiply-add operations, with one operand from theCU 1800 in the previous column (theCU 1800 on the left), with one operand from theCU 1800's private buffer calledWBUF 1803 to generate a product. The product is then added with the result from theCU 1800 in the previous row (theCU 1800 on the top) and sent to theCU 1800 in the next row (theCU 1800 on the bottom). Thus, data is shared betweenCUs 1800 in a row and the output fromCUs 1800 in a column are accumulated and sent downwards. -
FIG. 17 shows the details of theCU 1800 in the systolic array circuit. TheCU 1800 multiplies one value from either the previous CU 1800 (theCU 1800 on the left) or from theIBUF 1704, with one value from theCU 1800'sprivate buffer WBUF 1803. The resulting product is added with the results from theprevious CU 1800. The resulting sum is then forwarded to theCU 1800 in the next row (CU 1800 on the bottom). In some implementations, the accumulator may perform additional operations (max, min, multiplication, etc) required for different layers of the neural network (like pooling, activation, etc). - When the size of the data required for inputs and the outputs for layers of the neural network is larger than the capacity of the on-chip buffers (
IBUF 1704,WBUF 1803, and OBUF 1705), then the data is divided into portions such that size of each portion does not exceed the capacity of on-chip buffers. - One notable feature of hardware acceleration circuits in prior art is that the precision for multidimensional array inputs for the layers of neural networks is the same. Consequently, the width of the
IBUF 1704 and WBUF 1803 buffers are sized according to the precision of the operands supported by the circuit, and the width of theOBUF 1705 is sized according to the precision for the intermediate data. Similarly, theCUs 1800 in the systolic array is designed for the precision supported by the circuit. - To show the execution of quantized neural network training using the
specialized circuit 1700, we use a simple single-layer neural network shown inFIG. 18 .FIG. 18 shows the operations in theforward propagation 1940 andbackward propagation 1950 phases for asingle convolution layer 1900 for neural networks. - The
CONV F 1910 operation in the forward propagation phase consumes two multidimensional arrays, one for theinputs 1905, and another for theweights 1905. Similarly, theCONV B 1931 andCONV W 1930 operations in the backward propagation phase of neural network training accept two inputs—weights 1955 andgradients 1952 forCONV B 1931; andinputs 1954 andgradients 1952 forCONV W 1930. - For quantized neural network training, the
inputs 1901 andweights 1902 use low-precision data representation, while thegradients arrays inputs 1901 andweights 1902 are first converted to a high-precision that is supported by the circuit using 1920, 1922, 1921, and 1921, to produce high-precisionmultidimensional arrays - The specialized circuits described in prior art use the same precision for the different multidimensional arrays of inputs, weights, and gradients for neural network training. Thus, the circuits in prior art either support just high precision (e.g. half-precision, single-precision, and double-precision floating-point, etc.) for all data types and introduce additional data type conversion operations, like
operations - To overcome these problems, this specification describes a novel heterogeneous-
precision circuit 2000 inFIG. 19 , which is a specialized circuit for accelerating neural network training and inference. Thecircuit 2000 can operate on heterogeneous precision data types for the inputs, weights, and gradients in neural networks. Thespecialized circuit 2000 is includes one or more instances two types of sub-circuits: (1) a quantized circuit called Q-array 2010 responsible for the operations in the forward propagation for neural network training, and (2) a mixed-precision circuit called MP-array 2020 that uses asymmetric precision for the backward propagation operations in neural network training—floating-point representation for the gradients and quantized representation for the inputs and weights. - In one implementation of the
circuit 2000, the Q-array 2010 includes of a plurality ofCUs 2100, and MP-array 2020 includes of a plurality ofCUs 2600, with theCUs 2100 andCUs 2600 organized as a 2-dimensional grid to form systolic arrays. - The
circuit 2000 described in the specification does not require additional data type conversions and can directly operate on both low-precision inputs and weights, and high-precision gradients. -
FIG. 21 shows the operations for a single layer of quantized neural network using the circuited 2000 described in the specification. TheCONV F 2210 operation in the forward propagation phase can be directly executed with low-precision inputs 1901, andweights 1902. Similarly, theCONV B 2231 andCONV W 2230 operations in the backward propagation phase of quantized neural network training can be directly executed with high-precision gradients and low-precision inputs and weights. - Specifically, the
operations Neural Network 1900 are no longer required forNeural Network 2200 inFIG. 21 to convert the inputs and weights to a high-precision representation. - Q-
array 2010 contains a 2-dimensional grid of quantizedCUs 2100, that support quantized inputs and weights for the forward propagation operations for quantized neural network training. - Similar to circuit described in
prior art 1700, theIBUF 2011 buffer in Q-array 2010 stores the multidimensional arrays of inputs and stores the multidimensional arrays of weights in theWBUF 2103 that is private for eachCU 2100. - Unlike the circuit in
prior art 1700, the proposed circuit for Q-array 2010 stores both the inputs and the weights in low-precision. Thus, Q-array 2010 does not require additional data type conversion for both inputs and weights. - In one implementation, the precision for inputs and weights in the Q-
array 2010 is fixed for all forward propagation operations, and is the same for both inputs and weights. In another implementation, the precision for inputs and weights in the Q-array 2010 is fixed for all forward propagation operations, but can be different for the inputs and weights. In one implementation, the precision for operands in the Q-array 2010 can be varied at run-time to support different precisions for the inputs and weights across different forward propagation operations. - The width of the
IBUF 2011 and theWBUF 2103 buffers are sized according to the precision or set of different precisions supported by the Q-array 2010. - The circuit in
FIG. 19 shows one embodiment of this specification, the inputs stored in theIBUF 2011 use 8-bit fixed-point data representation and the weights stored in theWBUF 2103 use 4-bit fixed-point data representation. - A
central control logic 2023 for the Q-array 2010 generated the address for theWBUF 2103 that is private for eachCU 2100. The plurality ofCUs 2100 in Q-array 2010 can perform multiply-add operation, whereinCU 2100 performs amultiplication 2105 between a single 8-bit precision input that is supplied by theIBUF 2011 and shared by allCUs 2100 in a row, and a single 4-bit precision weight 2104 supplied by the WBUF private to thatCU 2100 according to the address generated by 2023. The results from the multiplication in aCU 2100 is added and accumulated acrossCUs 2100 in a column of Q-array 2010 through an adder. The precision ofadder 2107 is set according to the highest precision supported by theIBUF 2011 andWBUF 2103 to avoid overflows/underflows. - The accumulated results at the bottom of each column of Q-
array 2010 require higher precision (eg. half, single, or double floating-point precision, or fixedpoint precision with more number of bits compared to inputs and weights). - The accumulated results are then written back to the
OBUF 2012, according to the address generated bycontrol logic 2013. Q-array 2010 can either write back higher precision accumulated results fromOBUF 2012 to next level of memory or can quantize the results to a lower precision fixed-point representation for use by the next forward propagation or backward propagation operation. - Alternatively, the
OBUF 2012 can store only low-precision data, which can reduce the size of theOBUF 2012 but may introduce some error in the results. - The MP-
array 2020 contains a 2-dimensional grid ofCUs 2600 that are responsible for the backward propagation operations for quantized neural network training, and can operate of data with mixed-precision—high-precision gradients, and low-precision inputs and weights. - The input gradients are stored in the
IBUF 2021 for bothCONV W 2230 andCONV B 2231 operations. TheIBUF 2021 is coupled with anadditional zeroskipping logic 2300, which enables the MP-array 2020 circuit to skip over zero-valued gradients in theIBUF 2021. -
FIG. 22 describes the zero-skippinglogic 2300, which reads a reads a 8wide vector ofdata 2301 from the IBUF and selects onenon-zero value 2302 from the 8-wide vector of data in each cycle for each row ofCUs 2600. A log2(8)-bit or 3-bitnon-zero index 2303 marks the position of thenon-zero value 2302 selected from the N-wide vector ofdata 2301. The width forbuffer IBUF 2021 is set to be 8× precision for gradients for each row ofCUs 2600 in MP-array 2020 in order to supply data to 2300 logic. In general, the zero-skippinglogic 2300 can be extended to read any N-wide vector of data to produce a log2(N)-bit index. Thecircuit 2300 inFIG. 22 is replicated for each row of 2-dimensional array ofCUs 2600 in the MP-array 2020. - The
non-zero value 2302 and the associated 3-bitnon-zero index 2303 are then sent to all the CUs 2600 in a row of the 2-dimensional grid ofCUs 2600 inMParray 2020. Once all the non-zero entries from the 8-wide vector have been selected, the next 8-wide vector of data is read from the IBUF. - The results from all the CUs 2600 in a column of the MP-
array 2020 need to accumulated to produce an output. This introduces a dependency between different rows of the MP-array 2020, where all the rows have to finish processing all thenonzero value 2302 corresponding to one output value before proceeding with the next output value. The inefficiency introduced by this dependency can be large in the case where a large majority of gradients are zero. - To overcome this limitation, the
non-zero value 2302 andnon-zero index 2303 in the proposedcircuit 2020 is appended with a desynchronization-tag (nonzero d-tag 2304), which specifies the output address and is generated by the MP-array control logic 2023. The non-zero d-tag 2304 allows theCUs 2600 across different rows to operate on thenon-zero value 2302 andnon-zero index 2303 for different outputs. - The
non-zero value 2302, 3-bitnon-zero index 2303, and the non-zero d-tag 2304 are then shared across allCUs 2600 in a row of MP-array 2020. -
FIG. 23 shows the circuit forCU 2600 that can skip zero-valued gradients. To access theWBUF 2620, a base address generated bycontrol logic 2023 is combined withnon-zero index 2303, which is a part of the incoming data to generate the read address forWBUF 2620. - The
multiplier 2607 then generates aproduct 2608 using thenon-zero data 2302 and thedata 2602 fromWBUF 2620. - In one implementation, the floating-point data
non-zero data 2302 is first converted to a 2's complement form by combining the sign and mantissa bits. A 2'scomplement multiplier 2607 can then be used to perform the multiplication. - To support both 8-bit precision for inputs and 4-bit precision for weights stored in the
private WBUF 2620, ashifter 2615 is used to left-shift the results of the multiplier. When operating on weights or the least significant 4-bits of the inputs, the shift amount is zero. When operating on the most significant 4-bits of the inputs stored inWBUF 2620, the shift amount is 4 bits to the left. - This approach to support multiple different precisions for operands in Neural Network training and inference can be generalized to support any precision by choosing the appropriate number of bits for the intermediate data at the output of the
shifter 2608 and the appropriate shift-amounts. - The results from multiplication and shifting 2608 is then added with the results from the
previous CU 2600 using anaccumulation logic 2612. -
FIG. 24 shows thecircuit 2612 for accumulating the multiplication results acrossdifferent CUs 2600. The non-zero dtag is used to associate the multiplication results with the output that it corresponds to. Theaccumulator logic 2612 includesmultiple lanes 2700, where the different lanes allow theCUs 2600 in different rows to work on different outputs. -
FIG. 25 shows the details of the accumulation logic perlane 2800 of theaccumulation logic 2612. Usingmuxes 2802 and 2803: (1) when the non-zero d-tag 2811 for the multiplication result in current row matches the non-zero d-tag 2810 for incoming data using comparator 2801, and the incoming data is valid 2830 for a lane, theresults next row 2804; (2) otherwise when the incoming data is valid, thedata 2820 and d-tag 2810 from the previous row is sent to theoutput 2804, or finally (3) the multiplication result (2831, 2811, 2821) for theCU 2600 in the current row is sent directly to thenext row 2804. - In another embodiment of this work, both the Q-
array 2010 and the MP-array 2020 blocks of the architecture proposed in this specification may use zero-skipping logic, as shown inFIG. 26 . - In another embodiment of this work, multiple MP-
array 2020 blocks without any Q-array 2010 blocks may be employed to accelerate Neural Network training and inference, as shown inFIG. 27 . - In another embodiment of this work, both the Q-
array 2010 and the MP-array 2020 blocks of the architecture proposed in this specification may use fixed-point representation, with a greater number of bits for the gradients in the MP-array 2020. - In another embodiment of this work, the gradients for the MP-
array 2020 may use other data types including logarithmic or power-of-2 data representations. - The above description of illustrated implementations of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific implementations of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
- These modifications may be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific implementations disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
Claims (26)
1. A hardware accelerator for training quantized data, comprising:
software controllable multilevel memory to store data; and
a mixed precision array coupled to the memory, the mixed precision array includes an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
2. The hardware accelerator of claim 1 , wherein the mixed precision array utilizes low-overhead desynchronized encoding for skipping zero value operands.
3. The hardware accelerator of claim 1 , wherein the detect logic comprises a multi-lane adder logic with desynchronized encoding and uses a desynchronization tag to remove synchronization between rows of the mixed precision array.
4. The hardware accelerator of claim 3 , wherein the detect logic is configured to encode non-zero value operands as value, offset, and desynchronization tag to specify an identification (ID) of a sparse-vector that operates on each row.
5. The hardware accelerator of claim 4 , wherein the multi-lane adder logic uses two tag-lanes within each column, the compute units in each column share tag-lanes, and within each column, compute units forward their results to one of the tag-lanes using the least significant bit (LSB) of the desynchronization tag.
6. The hardware accelerator of claim 5 , wherein each lane includes select logic to determine whether the tag for a current row matches a previous row's tag for either odd or even tag-lanes, the values are added together and forwarded to the next row when the tags match, and results are stored locally when the tags do not match.
7. The hardware accelerator of claim 6 , wherein the detect logic includes zero value operand detector logic and non-zero selector.
8. The hardware accelerator of claim 7 , wherein the zero value operand detector logic includes comparators to generate a bit-vector that corresponds to using a single bit for each bit of a sub-vector.
9. The hardware accelerator of claim 8 , wherein each bit in the bit-vector specifies if a corresponding value in the sub-vector is a zero value or a non-zero value.
10. The hardware accelerator of claim 9 , wherein when all bits in the bit-vector are zero values then the sub-vector is skipped entirely, wherein if at least one bit in the bit-vector is non-zero value then the sub-vector is pushed to a FIFO queue, along with its bit-vector and a desynchronization tag for identifying an input ID.
11. The hardware accelerator of claim 10 , wherein the non-zero selector is configured to cause the FIFO queue to dequeue at least one sub-vector and to read only those sub-vectors that have at least one non-zero value.
12. A data processing system comprising:
a hardware processor;
memory; and
a hardware accelerator coupled to the memory, the hardware accelerator includes a mixed precision array having an input buffer, detect logic to detect zero value operands, and a plurality of heterogenous precision compute units to perform computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
13. The data processing system of claim 12 , wherein the mixed precision array utilizes low-overhead desynchronized encoding for skipping zero value operands.
14. The data processing system of claim 13 , wherein the detect logic comprises a multi-lane adder logic with desynchronized encoding and uses a desynchronization tag to remove synchronization between rows of the mixed precision array.
15. The data processing system of claim 14 , wherein the detect logic is configured to encode non-zero value operands as value, offset, and desynchronization tag to specify an identification (ID) of a sparse-vector that operates on each row.
16. The data processing system of claim 15 , wherein the multi-lane adder logic uses two tag-lanes within each column, the compute units in each column share tag-lanes, and within each column, compute units forward their results to one of the tag-lanes using the least significant bit (LSB) of the desynchronization tag.
17. The data processing system of claim 16 , wherein each lane includes select logic to determine whether the tag for a current row matches a previous row's tag for either odd or even tag-lanes, the values are added together and forwarded to the next row when the tags match, and results are stored locally when the tags do not match.
18. The data processing system of claim 17 , wherein the detect logic includes zero value operand detector logic and non-zero selector.
19. A computer implemented method for quantized neural network training comprising:
storing data in a software controllable multilevel memory;
receiving data for training with a mixed precision array;
detecting zero value operands with detect logic of the mixed precision array; and
performing, with a plurality of heterogenous precision compute units, computations of mixed precision data types for a backward propagation phase of training quantized data of a neural network.
20. The computer implemented method of claim 19 , wherein the mixed precision array utilizes low-overhead desynchronized encoding for skipping zero value operands.
21. The computer implemented method of claim 19 , wherein the detect logic comprises a multi-lane adder logic with desynchronized encoding and uses a desynchronization tag to remove synchronization between rows of the mixed precision array.
22. The computer implemented method of claim 21 , further comprising:
encoding, with the detect logic, non-zero value operands as value, offset, and desynchronization tag to specify an identification (ID) of a sparse-vector that operates on each row.
23. The computer implemented method of claim 22 , further comprising:
within each column, forwarding, with the compute units, their results to one of the tag-lanes using a least significant bit (LSB) of the desynchronization tag.
24. The computer implemented method of claim 23 , further comprising:
determining, with select logic of each lane, whether the tag for a current row matches a previous row's tag for either odd or even tag-lanes;
adding values and forwarding to the next row when the tags match; and
results are stored locally when the tags do not match.
25. The computer implemented method of claim 24 , wherein the detect logic includes zero value operand detector logic and non-zero selector.
26. The computer implemented method of claim 5 , further comprising:
generating, with the zero value operand detector logic, a bit-vector that corresponds to using a single bit for each bit of a sub-vector, wherein each bit in the bit-vector specifies if a corresponding value in the sub-vector is a zero value or a non-zero value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/744,039 US20200226473A1 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architectures for heterogeneous precision acceleration of quantized neural networks |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962792785P | 2019-01-15 | 2019-01-15 | |
US16/744,039 US20200226473A1 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architectures for heterogeneous precision acceleration of quantized neural networks |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200226473A1 true US20200226473A1 (en) | 2020-07-16 |
Family
ID=71516690
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/744,037 Abandoned US20200226444A1 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architecture for precision heterogeneity in accelerating neural networks for inference and training |
US16/744,040 Active 2040-10-24 US11321606B2 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architectures for a neural network workflow to generate a hardware accelerator |
US16/744,039 Abandoned US20200226473A1 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architectures for heterogeneous precision acceleration of quantized neural networks |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/744,037 Abandoned US20200226444A1 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architecture for precision heterogeneity in accelerating neural networks for inference and training |
US16/744,040 Active 2040-10-24 US11321606B2 (en) | 2019-01-15 | 2020-01-15 | Systems, apparatus, methods, and architectures for a neural network workflow to generate a hardware accelerator |
Country Status (1)
Country | Link |
---|---|
US (3) | US20200226444A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257844A (en) * | 2020-09-29 | 2021-01-22 | 浙江大学 | Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof |
CN113469327A (en) * | 2021-06-24 | 2021-10-01 | 上海寒武纪信息科技有限公司 | Integrated circuit device for executing advance of revolution |
US11144282B2 (en) * | 2019-01-16 | 2021-10-12 | Mediatek Inc. | Mathematical accelerator for artificial intelligence applications |
US20210320967A1 (en) * | 2020-04-09 | 2021-10-14 | Micron Technology, Inc. | Edge Server with Deep Learning Accelerator and Random Access Memory |
US20210357748A1 (en) * | 2020-05-14 | 2021-11-18 | Samsung Electronics Co., Ltd. | Hierarchical weight preprocessing for neural network accelerator |
US11467806B2 (en) | 2019-11-27 | 2022-10-11 | Amazon Technologies, Inc. | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range |
CN115470901A (en) * | 2022-09-06 | 2022-12-13 | 北京大学 | Hybrid precision training method and device supporting load sharing of heterogeneous processor at mobile terminal |
CN116167461A (en) * | 2023-04-21 | 2023-05-26 | 之江实验室 | Model training method and device, storage medium and electronic equipment |
US20230273749A1 (en) * | 2020-01-27 | 2023-08-31 | Samsung Electronics Co., Ltd. | Latency and throughput centric reconfigurable storage device |
US11762803B2 (en) | 2020-06-29 | 2023-09-19 | Amazon Technologies, Inc. | Multiple accumulate busses in a systolic array |
US11816446B2 (en) | 2019-11-27 | 2023-11-14 | Amazon Technologies, Inc. | Systolic array component combining multiple integer and floating-point data types |
US20230376534A1 (en) * | 2022-05-18 | 2023-11-23 | Sap Se | Scalable bandwidth efficient graph processing on field programmable gate arrays |
US11842169B1 (en) * | 2019-09-25 | 2023-12-12 | Amazon Technologies, Inc. | Systolic multiply delayed accumulate processor architecture |
US11880682B2 (en) | 2021-06-30 | 2024-01-23 | Amazon Technologies, Inc. | Systolic array with efficient input reduction and extended array performance |
US12001929B2 (en) * | 2020-04-01 | 2024-06-04 | Samsung Electronics Co., Ltd. | Mixed-precision neural processing unit (NPU) using spatial fusion with load balancing |
Families Citing this family (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11562248B2 (en) * | 2019-04-29 | 2023-01-24 | Advanced Micro Devices, Inc. | Data sparsity monitoring during neural network training |
US11645524B2 (en) * | 2019-05-10 | 2023-05-09 | Royal Bank Of Canada | System and method for machine learning architecture with privacy-preserving node embeddings |
CN111523642B (en) * | 2020-04-10 | 2023-03-28 | 星宸科技股份有限公司 | Data reuse method, operation method and device and chip for convolution operation |
US11240340B2 (en) * | 2020-05-12 | 2022-02-01 | International Business Machines Corporation | Optimized deployment of analytic models in an edge topology |
US11809908B2 (en) * | 2020-07-07 | 2023-11-07 | SambaNova Systems, Inc. | Runtime virtualization of reconfigurable data flow resources |
CN114004968A (en) * | 2020-07-28 | 2022-02-01 | 富泰华工业(深圳)有限公司 | Image processing method, image processing device, electronic equipment and storage medium |
CN112130812B (en) * | 2020-08-04 | 2022-04-15 | 中科天玑数据科技股份有限公司 | Analysis model construction method and system based on data stream mixed arrangement |
CN112214198A (en) * | 2020-10-22 | 2021-01-12 | 南京博芯电子技术有限公司 | Precision dynamic self-adaptive accumulation module for bit width incremental addition tree |
CN112214326B (en) * | 2020-10-22 | 2022-10-21 | 南京博芯电子技术有限公司 | Equalization operation acceleration method and system for sparse recurrent neural network |
US11861328B2 (en) | 2020-11-11 | 2024-01-02 | Samsung Electronics Co., Ltd. | Processor for fine-grain sparse integer and floating-point operations |
US11861327B2 (en) | 2020-11-11 | 2024-01-02 | Samsung Electronics Co., Ltd. | Processor for fine-grain sparse integer and floating-point operations |
TW202223629A (en) * | 2020-11-30 | 2022-06-16 | 財團法人工業技術研究院 | Verification system and verification method for neural network accelerator hardware |
US11182221B1 (en) | 2020-12-18 | 2021-11-23 | SambaNova Systems, Inc. | Inter-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS) |
US11237880B1 (en) | 2020-12-18 | 2022-02-01 | SambaNova Systems, Inc. | Dataflow all-reduce for reconfigurable processor systems |
US11392740B2 (en) * | 2020-12-18 | 2022-07-19 | SambaNova Systems, Inc. | Dataflow function offload to reconfigurable processors |
CN112712174B (en) * | 2020-12-31 | 2022-04-08 | 湖南师范大学 | Hardware accelerator, acceleration method and image classification method of full-frequency-domain convolutional neural network |
CN112749799B (en) * | 2020-12-31 | 2022-04-12 | 湖南师范大学 | Hardware accelerator, acceleration method and image classification method of full-frequency-domain convolutional neural network based on self-adaptive ReLU |
CN112906883B (en) * | 2021-02-04 | 2024-09-03 | 云从科技集团股份有限公司 | Hybrid precision quantization strategy determination method and system for deep neural network |
WO2022175494A1 (en) * | 2021-02-18 | 2022-08-25 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus, method and computer program for analyzing a sensor signal |
CN112906863B (en) * | 2021-02-19 | 2023-04-07 | 山东英信计算机技术有限公司 | Neuron acceleration processing method, device, equipment and readable storage medium |
US11782760B2 (en) | 2021-02-25 | 2023-10-10 | SambaNova Systems, Inc. | Time-multiplexed use of reconfigurable hardware |
US11200096B1 (en) | 2021-03-26 | 2021-12-14 | SambaNova Systems, Inc. | Resource allocation for reconfigurable processors |
US20220321403A1 (en) * | 2021-04-02 | 2022-10-06 | Nokia Solutions And Networks Oy | Programmable network segmentation for multi-tenant fpgas in cloud infrastructures |
US11921784B2 (en) * | 2021-05-13 | 2024-03-05 | Advanced Micro Devices, Inc. | Flexible, scalable graph-processing accelerator |
CN113392959B (en) * | 2021-06-03 | 2024-10-29 | 沐曦集成电路(上海)有限公司 | Method for reconstructing architecture in computing system and computing system |
US20230010897A1 (en) * | 2021-07-06 | 2023-01-12 | Google Llc | In situ sparse matrix expansion |
US11709611B2 (en) | 2021-10-26 | 2023-07-25 | SambaNova Systems, Inc. | Determining and using memory unit partitioning solutions for reconfigurable dataflow computing systems |
CN114237551B (en) * | 2021-11-26 | 2022-11-11 | 南方科技大学 | Multi-precision accelerator based on pulse array and data processing method thereof |
US20230305111A1 (en) * | 2022-03-23 | 2023-09-28 | Nxp B.V. | Direction of arrival (doa) estimation using circular convolutional network |
CN116301920B (en) * | 2023-03-23 | 2023-11-07 | 东北大学 | Compiling system for deploying CNN model to high-performance accelerator based on FPGA |
CN116107726B (en) * | 2023-04-13 | 2023-07-18 | 上海思尔芯技术股份有限公司 | FPGA resource scheduling method, device, equipment and storage medium |
CN117573607B (en) * | 2023-11-28 | 2024-08-13 | 北京智芯微电子科技有限公司 | Reconfigurable coprocessor, chip, multi-core signal processing system and computing method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180322607A1 (en) * | 2017-05-05 | 2018-11-08 | Intel Corporation | Dynamic precision management for integer deep learning primitives |
US20190057303A1 (en) * | 2017-08-18 | 2019-02-21 | Microsoft Technology Licensing, Llc | Hardware node having a mixed-signal matrix vector unit |
US20190171420A1 (en) * | 2017-12-06 | 2019-06-06 | Advanced Micro Devices, Inc. | Dynamic, variable bit-width numerical precision on fpgas for machine learning tasks |
US20190196788A1 (en) * | 2017-12-22 | 2019-06-27 | Alibaba Group Holding Limited | Programmable multiply-add array hardware |
US20190205746A1 (en) * | 2017-12-29 | 2019-07-04 | Intel Corporation | Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism |
US20190325303A1 (en) * | 2018-04-24 | 2019-10-24 | Intel Corporation | Machine learning accelerator architecture |
US11170289B1 (en) * | 2018-04-20 | 2021-11-09 | Perceive Corporation | Computation of neural network node by neural network inference circuit |
US20220164666A1 (en) * | 2020-11-20 | 2022-05-26 | Adobe Inc. | Efficient mixed-precision search for quantizers in artificial neural networks |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8706790B1 (en) | 2009-03-03 | 2014-04-22 | Altera Corporation | Implementing mixed-precision floating-point operations in a programmable integrated circuit device |
US20180143860A1 (en) * | 2016-11-22 | 2018-05-24 | Intel Corporation | Methods and apparatus for programmable integrated circuit coprocessor sector management |
US11144820B2 (en) * | 2017-02-28 | 2021-10-12 | Microsoft Technology Licensing, Llc | Hardware node with position-dependent memories for neural network processing |
CN109937416B (en) | 2017-05-17 | 2023-04-04 | 谷歌有限责任公司 | Low delay matrix multiplication component |
US10678509B1 (en) | 2018-08-21 | 2020-06-09 | Xilinx, Inc. | Software-driven design optimization for mapping between floating-point and fixed-point multiply accumulators |
US10831702B2 (en) | 2018-09-20 | 2020-11-10 | Ceva D.S.P. Ltd. | Efficient utilization of systolic arrays in computational processing |
WO2020110113A1 (en) * | 2018-11-27 | 2020-06-04 | Deep Ai Technologies Ltd. | Reconfigurable device based deep neural network system and method |
US11586883B2 (en) * | 2018-12-14 | 2023-02-21 | Microsoft Technology Licensing, Llc | Residual quantization for neural networks |
US11676003B2 (en) * | 2018-12-18 | 2023-06-13 | Microsoft Technology Licensing, Llc | Training neural network accelerators using mixed precision data formats |
-
2020
- 2020-01-15 US US16/744,037 patent/US20200226444A1/en not_active Abandoned
- 2020-01-15 US US16/744,040 patent/US11321606B2/en active Active
- 2020-01-15 US US16/744,039 patent/US20200226473A1/en not_active Abandoned
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180322607A1 (en) * | 2017-05-05 | 2018-11-08 | Intel Corporation | Dynamic precision management for integer deep learning primitives |
US20190057303A1 (en) * | 2017-08-18 | 2019-02-21 | Microsoft Technology Licensing, Llc | Hardware node having a mixed-signal matrix vector unit |
US20190171420A1 (en) * | 2017-12-06 | 2019-06-06 | Advanced Micro Devices, Inc. | Dynamic, variable bit-width numerical precision on fpgas for machine learning tasks |
US20190196788A1 (en) * | 2017-12-22 | 2019-06-27 | Alibaba Group Holding Limited | Programmable multiply-add array hardware |
US20190205746A1 (en) * | 2017-12-29 | 2019-07-04 | Intel Corporation | Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism |
US11170289B1 (en) * | 2018-04-20 | 2021-11-09 | Perceive Corporation | Computation of neural network node by neural network inference circuit |
US20190325303A1 (en) * | 2018-04-24 | 2019-10-24 | Intel Corporation | Machine learning accelerator architecture |
US20220164666A1 (en) * | 2020-11-20 | 2022-05-26 | Adobe Inc. | Efficient mixed-precision search for quantizers in artificial neural networks |
Non-Patent Citations (8)
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11144282B2 (en) * | 2019-01-16 | 2021-10-12 | Mediatek Inc. | Mathematical accelerator for artificial intelligence applications |
US11842169B1 (en) * | 2019-09-25 | 2023-12-12 | Amazon Technologies, Inc. | Systolic multiply delayed accumulate processor architecture |
US11467806B2 (en) | 2019-11-27 | 2022-10-11 | Amazon Technologies, Inc. | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range |
US12067375B2 (en) | 2019-11-27 | 2024-08-20 | Amazon Technologies, Inc. | Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range |
US11816446B2 (en) | 2019-11-27 | 2023-11-14 | Amazon Technologies, Inc. | Systolic array component combining multiple integer and floating-point data types |
US20230273749A1 (en) * | 2020-01-27 | 2023-08-31 | Samsung Electronics Co., Ltd. | Latency and throughput centric reconfigurable storage device |
US12001929B2 (en) * | 2020-04-01 | 2024-06-04 | Samsung Electronics Co., Ltd. | Mixed-precision neural processing unit (NPU) using spatial fusion with load balancing |
US20210320967A1 (en) * | 2020-04-09 | 2021-10-14 | Micron Technology, Inc. | Edge Server with Deep Learning Accelerator and Random Access Memory |
US20210357748A1 (en) * | 2020-05-14 | 2021-11-18 | Samsung Electronics Co., Ltd. | Hierarchical weight preprocessing for neural network accelerator |
US11762803B2 (en) | 2020-06-29 | 2023-09-19 | Amazon Technologies, Inc. | Multiple accumulate busses in a systolic array |
CN112257844A (en) * | 2020-09-29 | 2021-01-22 | 浙江大学 | Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof |
CN113469327A (en) * | 2021-06-24 | 2021-10-01 | 上海寒武纪信息科技有限公司 | Integrated circuit device for executing advance of revolution |
US11880682B2 (en) | 2021-06-30 | 2024-01-23 | Amazon Technologies, Inc. | Systolic array with efficient input reduction and extended array performance |
US20230376534A1 (en) * | 2022-05-18 | 2023-11-23 | Sap Se | Scalable bandwidth efficient graph processing on field programmable gate arrays |
US11921786B2 (en) * | 2022-05-18 | 2024-03-05 | Sap Se | Scalable bandwidth efficient graph processing on field programmable gate arrays |
CN115470901A (en) * | 2022-09-06 | 2022-12-13 | 北京大学 | Hybrid precision training method and device supporting load sharing of heterogeneous processor at mobile terminal |
CN116167461A (en) * | 2023-04-21 | 2023-05-26 | 之江实验室 | Model training method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
US11321606B2 (en) | 2022-05-03 |
US20200225996A1 (en) | 2020-07-16 |
US20200226444A1 (en) | 2020-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11321606B2 (en) | Systems, apparatus, methods, and architectures for a neural network workflow to generate a hardware accelerator | |
Mittal | A survey of FPGA-based accelerators for convolutional neural networks | |
US20240118892A1 (en) | Apparatuses, methods, and systems for neural networks | |
Nurvitadhi et al. | Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC | |
US11816446B2 (en) | Systolic array component combining multiple integer and floating-point data types | |
Luo et al. | Towards efficient deep neural network training by FPGA-based batch-level parallelism | |
Shen et al. | Escher: A CNN accelerator with flexible buffering to minimize off-chip transfer | |
CN108805262A (en) | System and method for carrying out systolic arrays design according to advanced procedures | |
US20180046894A1 (en) | Method for optimizing an artificial neural network (ann) | |
Daghero et al. | Energy-efficient deep learning inference on edge devices | |
Tomov et al. | Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing | |
Xu et al. | A dedicated hardware accelerator for real-time acceleration of YOLOv2 | |
Jeon et al. | Deep learning with GPUs | |
US20210334636A1 (en) | Systolic-cnn: an opencl-defined scalable runtime-flexible programmable accelerator architecture for accelerating convolutional neural network inference in cloud/edge computing | |
Noh et al. | FlexBlock: A flexible DNN training accelerator with multi-mode block floating point support | |
Lou et al. | Octcnn: A high throughput fpga accelerator for cnns using octave convolution algorithm | |
Mousouliotis et al. | Squeezejet: High-level synthesis accelerator design for deep convolutional neural networks | |
Vestias | Efficient design of pruned convolutional neural networks on fpga | |
Lou et al. | RV-CNN: Flexible and efficient instruction set for CNNs based on RISC-V processors | |
Ghodhbani et al. | Deploying deep learning networks based advanced techniques for image processing on FPGA platform | |
Yu et al. | Bshift: a low cost deep neural networks accelerator | |
Gan et al. | High performance reconfigurable computing for numerical simulation and deep learning | |
WO2022047802A1 (en) | Processing-in-memory device and data processing method thereof | |
Nie et al. | Laius: an energy-efficient FPGA CNN accelerator with the support of a fixed-point training framework | |
Zhang et al. | End-to-end acceleration of the YOLO object detection framework on FPGA-only devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: BIGSTREAM SOLUTIONS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SHARMA, HARDIK;PARK, JONGSE;SIGNING DATES FROM 20200121 TO 20200123;REEL/FRAME:051952/0975 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |