CN113469328A - Device, board card, method and readable storage medium for executing revolution crossing - Google Patents
Device, board card, method and readable storage medium for executing revolution crossing Download PDFInfo
- Publication number
- CN113469328A CN113469328A CN202110704820.9A CN202110704820A CN113469328A CN 113469328 A CN113469328 A CN 113469328A CN 202110704820 A CN202110704820 A CN 202110704820A CN 113469328 A CN113469328 A CN 113469328A
- Authority
- CN
- China
- Prior art keywords
- layer
- data
- neural network
- processing
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003860 storage Methods 0.000 title claims abstract description 33
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 61
- 238000003062 neural network model Methods 0.000 claims abstract description 25
- 238000006243 chemical reaction Methods 0.000 claims description 36
- 238000004364 calculation method Methods 0.000 claims description 33
- 238000013139 quantization Methods 0.000 claims description 23
- 238000007781 pre-processing Methods 0.000 claims description 18
- 238000012805 post-processing Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 3
- 230000005012 migration Effects 0.000 claims description 3
- 238000013508 migration Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 2
- 238000013500 data storage Methods 0.000 abstract 1
- 239000010410 layer Substances 0.000 description 130
- 238000013528 artificial neural network Methods 0.000 description 23
- 238000004891 communication Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 14
- 238000005457 optimization Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 10
- 239000011159 matrix material Substances 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 7
- 238000012546 transfer Methods 0.000 description 7
- 102100030148 Integrator complex subunit 8 Human genes 0.000 description 5
- 101710092891 Integrator complex subunit 8 Proteins 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 239000011229 interlayer Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7817—Specially adapted for signal processing, e.g. Harvard architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Neurology (AREA)
- Signal Processing (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a device, a board card, a method and a readable storage medium for executing revolution traversal in a neural network model, wherein the computing device of the invention is included in an integrated circuit device which comprises a universal interconnection interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by the user. The integrated circuit device may further include a storage device, which is connected to the computing device and the other processing device, respectively, for data storage of the computing device and the other processing device.
Description
Technical Field
The present invention relates generally to the field of neural networks. More particularly, the present invention relates to an apparatus, a board, a method and a readable storage medium for performing a revolution-through in a neural network model.
Background
In recent years, neural network algorithms, as a branch category in artificial intelligence algorithms, exhibit good adaptability and superior performance in more and more fields, such as: image recognition, target detection, natural language processing, etc. have become a research hotspot in academic and industrial fields.
However, the calculation amount of the neural network algorithm is large (up to 100 hundred million orders of magnitude of operation), the model training needs a back propagation process, a large amount of hardware resources are consumed, and the requirement of an intelligent application scene cannot be met in order to take account of the universality of the traditional general processor, so that a high-performance and low-power consumption neural network accelerator becomes one of the research hotspots in the field of the architecture in recent years.
Since the accelerators have different architectures and have different constraints on data placement, blocking, moving, and operation, the hardware implementation details of the bottom layer need to be considered for the corresponding programming system to generate the instructions. Particularly, convolution and full-connection operators in the neural network model occupy most of operation resources, and the operation efficiency is reduced due to insufficient hardware computing power.
Therefore, a compiling and optimizing scheme for the neural network model is urgently needed.
Disclosure of Invention
To at least partially solve the technical problems mentioned in the background, aspects of the present invention provide an apparatus, a board, a method, and a readable storage medium for performing a revolution-through in a neural network model.
In one aspect, an integrated circuit device for performing a turn-through in a neural network model includes a processing device and a computing device. The processing device is used for identifying a fixed point operation layer from the neural network model, wherein the fixed point operation layer is a fixed point number operation layer; judging whether the previous layer is an operator only for rearranging the positions of the data elements; if yes, executing the judging step; if not, setting the previous layer as the destination layer of the turn number scheduling, and scheduling the data type conversion operation executed on the fixed point operation layer to the destination layer of the turn number scheduling. The calculating device is used for operating the neural network model based on the scheduled fixed point operation layer and the rotation number scheduling destination layer.
In another aspect, the present invention discloses a board card including the integrated circuit device.
In another aspect, the present disclosure discloses a method of performing a turn-through in a neural network model, comprising: identifying a fixed point operation layer from the neural network model, wherein the fixed point operation layer is a fixed point number operation layer; judging whether the previous layer is an operator only for rearranging the positions of the data elements; if yes, executing the judging step; if not, setting the upper layer as a rotation number scheduling destination layer, and scheduling the data type conversion operation executed on the fixed point operation layer to the rotation number scheduling destination layer; and running a neural network model based on the scheduled fixed point operation layer and the rotation number scheduling target layer.
In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer program code for performing a number of revolutions traversal in a neural network model, which when executed by a processing device, performs the aforementioned method.
The invention provides a scheme for passing through the revolution, which can meet the requirement of the calculation of the quantization parameter in the target layer of revolution scheduling in advance by circularly dividing calculation logic in a neural network operator according to an algorithm, save interlayer data transfer amount, reduce the bandwidth occupation amount of hardware and improve the performance.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:
fig. 1 is a structural diagram showing a board card of the embodiment of the present invention;
FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;
FIG. 3 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;
FIG. 4 is a schematic diagram showing the internal structure of a processor core of an embodiment of the invention;
FIG. 5 is a diagram illustrating an execution tree of an embodiment of the present invention;
FIG. 6 is a diagram illustrating parsing a traversal execution tree according to an embodiment of the invention;
FIG. 7 is a flow chart illustrating data type transition scheduling advancement by an embodiment of the present invention;
FIG. 8 is a diagram illustrating an exemplary neural network segment; and
FIG. 9 is a diagram illustrating online quantized scheduling optimization according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".
The following detailed description of embodiments of the invention refers to the accompanying drawings.
Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present invention. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.
The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).
Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.
The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.
The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of central processing unit (cpu), Graphics Processing Unit (GPU), or other general and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.
The DRAM204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.
Fig. 3 shows an internal structural diagram of the computing apparatus 201. The computing device 201 is used for processing input data such as computer vision, voice, natural language, data mining and the like, the computing device 201 in the figure is designed by adopting a multi-core hierarchical structure, the computing device 201 is used as a system on chip and comprises a plurality of clusters (clusters), each cluster comprises a plurality of processor cores, in other words, the computing device 201 is formed by a system on chip-cluster-processor core hierarchy.
Looking at the system-on-chip hierarchy, as shown in FIG. 3, the computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.
There may be multiple external memory controllers 301, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 302 is used for receiving the control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute the task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302 and the plurality of clusters 305 for transmitting data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 are exemplarily shown in the figure, and as the hardware is developed, the computing device 201 of the present invention may further include 8, 16, 64, or even more clusters 305. The clusters 305 are used to efficiently execute deep learning algorithms.
Viewed at the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU core)306 and a memory core (MEM core) 307.
The number of the processor cores 306 is exemplarily shown as 4 in the figure, and the present invention does not limit the number of the processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an arithmetic module 42 and a storage module 43.
The control module 41 is used for coordinating and controlling the operations of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an Instruction Fetch Unit (IFU) 411 and an Instruction Decode Unit (IDU) 412. The instruction fetch unit 411 is used to obtain an instruction from the processing device 203, and the instruction decode unit 412 decodes the obtained instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.
The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.
The storage module 43 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)431, a weight storage unit (weight RAM, WRAM)432, an input/output direct memory access module (input/output type DMA)433, and a transport direct memory access module (MVDMA) 434. NRAM 431 is used to store the feature map for processor core 306 to compute and the intermediate result after computation; the WRAM 432 is used for storing the weight of the deep learning network; the input/output class DMA 433 controls the access of NRAM 431/WRAM 432 and DRAM204 through the broadcast bus 309; the MVDMA 434 is used to control access of the NRAM 431/WRAM 432 and the SRAM 308.
Returning to FIG. 3, the storage core 307 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 306, as well as perform communications between the clusters 305 and the DRAMs 204, communications among the clusters 305, communications among the processor cores 306, and the like. In other embodiments, storage core 307 has the capability of scalar operations to perform scalar operations.
The memory core 307 includes a shared memory unit (SRAM)308, a broadcast bus 309, a Cluster Direct Memory Access (CDMA) 310, and a Global Direct Memory Access (GDMA) 311. The SRAM308 plays a role of a high-performance data transfer station, data multiplexed among different processor cores 306 in the same cluster 305 do not need to be acquired from the DRAM204 through the processor cores 306 respectively, but are transferred among the processor cores 306 through the SRAM308, and the storage core 307 only needs to rapidly distribute the multiplexed data from the SRAM308 to the plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip and off-chip input/output access is greatly reduced.
The broadcast bus 309, CDMA 310, and GDMA 311 are used to perform communication among the processor cores 306, communication among the cluster 305, and data transfer between the cluster 305 and DRAM204, respectively. As will be described separately below.
The broadcast bus 309 is used to accomplish high-speed communication among the processor cores 306 in the cluster 305, and the broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., from a single processor core to a single processor core) data transfer, multicast is a communication for transferring a copy of data from SRAM308 to a specific number of processor cores 306, and broadcast is a communication for transferring a copy of data from SRAM308 to all processor cores 306, and is a special case of multicast.
The GDMA 311 cooperates with the external memory controller 301 to control the access of the SRAM308 of the cluster 305 to the DRAM204 or to read data from the DRAM204 into the SRAM 308. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly connect DRAM204 with NRAM 431 or WRAM 432 through input and output DAM 433; the second channel is that data is transferred between DRAM204 and SRAM308 via GDMA 311, and then between SRAM308 and NRAM 431 or WRAM 432 via MVDMA 434. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM 432 may be more efficient over the second channel. The embodiment of the invention can select the data transmission channel according to the hardware condition.
In other embodiments, the functionality of the GDMA 311 and the functionality of the input output class DMA 433 may be integrated in the same component. For convenience of description, the GDMA 311 and the input/output DMA 433 are considered as different components, and it is within the scope of the present invention for those skilled in the art to achieve the same functions and achieve the same technical effects as the present invention. Further, the function of the GDMA 311, the function of the input/output DMA 433, the function of the CDMA 310, and the function of the MVDMA 434 may be implemented by the same component.
The neural network framework to which this embodiment applies predefines a series of neural network layers or operator interfaces. A developer sets layer parameters of each layer by calling an Application Programming Interface (API) of a neural network framework, and links the dependency relationship between data and the layers to build a neural network model structure. After the network model training process, the model parameters and weight data are saved in a structured model file, stored in DRAM 204. During deployment operation, the processing device 203 calls the API of the framework, loads the trained network model, and executes the forward inference process of the network model on the computing device 201 using the actual input data to obtain the final output result of the network. Since the model structure and parameters are known in the forward reasoning process, this embodiment uses this information to speed up.
This embodiment proposes a tree-like neural network operator programming method, called an execution tree. Fig. 5 shows a schematic diagram of the execution tree of this embodiment. The nodes of the execution tree are an iterative data structure, and are formed by connecting a root node 501 with a sub-tree, wherein the sub-tree can comprise any multiple layers and any multiple sub-nodes, and the sub-nodes are divided into non-leaf nodes and leaf nodes. The non-leaf nodes are located in the middle level of the subtree, and 2 non-leaf nodes 502 and 503 are exemplarily shown in fig. 5. The leaf nodes are located at the last level of the subtree, and 2 leaf nodes 504 and 505 are shown in fig. 5 as an example. The number of layers and the number of child nodes of the subtree depend on the needs of the operator, and this embodiment is not limited.
The execution logic of the operations of the root node and the child nodes is the same, and comprises the following steps: initial operation, pretreatment operation, main body operation, post-treatment operation and finishing operation. The root node and child nodes also include a loop operation (not shown) to record the number of times the node needs to be executed repeatedly.
The initial operation is the first part to be executed in the execution tree of the same level, and is executed only once, is not repeatedly executed along with the loop, and belongs to a one-time initialization instruction, such as a register initialization instruction, an activation operation configuration instruction and the like. The preprocessing operation is executed after the initial operation, and is executed repeatedly at least once according to the loop operation, which is responsible for preprocessing before the main body operation, for example, in the Scale operator, the short vector right operand corresponds to the fetch operation of the loop segment data, and the like. The subject operation is performed after the pre-processing operation, and is also repeatedly performed at least once according to the loop operation, which is responsible for the calculation part of the operator subject loop. If the node is a root node or a non-leaf node, the main body operation is only used for cutting data and distributing tasks to the child nodes of the next layer; if it is a leaf node, its main operation is to execute the operation core part of the tree, for example, to perform the accumulation addition operation. The post-processing operation is repeatedly executed at least once after the main operation according to the loop operation, and is responsible for the post-processing operation after the operation, such as the shift of the multiplexing data, the register offset and the like. The finishing operation is performed only once to output the calculation result.
The execution times and the execution timing of the above operations are created by the processing device 203 based on the loop analysis of the operation instruction of the neural network operator on the computing device 201, and are not the function limitation of the execution tree. When the cyclic operation is required, the cyclic part is a pretreatment operation, a main body operation and a post-treatment operation.
In this embodiment, the execution of the neural network operator can be roughly divided into 3 stages: the loading stage, the calculating stage and the storing stage, so the processing device 203 divides the execution tree of the neural network operator into three types of trees of loading, calculating and storing, and the execution tree of each operator is composed of a root node and a subtree of the loading, calculating and storing tree, that is, all the execution trees of one operator belong to one of the 3 trees, and each tree has the structure of fig. 5.
In running the neural network model, 3 execution trees of one operator can implement all the instructions required for the neural network operator to run on the computing device 201. First, the computing device 201 executes all instructions of the operations in the corresponding execution sequence of one leaf node of the load tree, then executes one leaf node of the compute tree, and finally executes the leaf node of the store tree, and the process is repeated until all the nodes are executed.
In more detail, in the compiling stage, when the processing device 203 parses and traverses an execution tree, according to the order of the preorder traversal, the initial and pretreatment operations of the root node are executed first, then all the nodes in the main operation of the subtree are traversed, and finally the post-treatment and finishing operations of the root node are executed. Wherein the pre-processing, main body and post-processing operations are repeatedly performed in a loop.
To implement the loop operation, when repeated execution is required, a synchronization instruction is inserted after a post-processing operation of a node that needs repeated execution. When the computing device 201 is running, if a synchronization instruction is received, the node returns to the preprocessing operation of the node, and executes the preprocessing operation, the main body operation and the post-processing operation again until the cycle number of the cycle operation is satisfied, and then executes the ending operation of the node.
Fig. 6 shows a schematic diagram of parsing the traversal execution tree in this embodiment. The simplified execution tree includes a root node 601, a first leaf node 602, and a second leaf node 603. Assuming that the loop operation of the root node 601 records the loop times of the root node 601 as 3 times, the loop operation of the first leaf node 602 records the loop times of the first leaf node 602 as 5 times, and the loop operation of the second leaf node 603 records the loop times of the second leaf node 603 as 1 time. When traversing the execution tree, the processing device 203 first executes the initial and pre-processing operations of the root node 601, then executes the main operation, and then executes the initial, pre-processing, main and post-processing operations of the first leaf node 602 according to the front and back link order of the sub-tree, and then receives the synchronization instruction 604, and the loop information record of the synchronization instruction 604 needs to be executed repeatedly 5 times. Since the first leaf node 602 is executed only once, the pre-processing, main processing, and post-processing operations of the first leaf node 602 are repeatedly executed until 5 cycles are performed, and finally the end operation of the first leaf node 602 is executed. By this time all operations of the subtree of the first leaf node 602 have been traversed.
The processing means 203 then traverses the sub-tree executing the second leaf node 603. Since the second leaf node 603 only needs to cycle once, the second leaf node 603 returns to the root node 601 by directly executing the initial, pre-processing, main body, post-processing, and finishing operations without inserting a synchronization instruction.
The root node 601 is traversed, i.e., post-processing operations of the root node 601 are performed. Since the root node 601 needs to be executed 3 times, the post-processing operation of the root node 601 is followed by the synchronization instruction 605, and the loop information record of the synchronization instruction 605 needs to be executed 3 times repeatedly. At this time, the processing device 203 returns to the root node 601 to perform the pre-processing operation repeatedly, then performs the entire operation flow of all the subtrees thereof as described above, performs the post-processing operation of the root node 601 again until 3 cycles are performed, and finally performs the ending operation of the root node 601 to complete the execution of all the operations in the root node 601 tree.
As for the example of fig. 6, as the traversal order of the single execution tree, it can be seen from the above that the computing apparatus 201 repeatedly traverses the nodes of the execution tree based on the chain loop of load → compute → store → load → compute → store when computing the operator.
When compiling the execution tree, the processing device 203 analyzes the execution tree based on the specific algorithm of the neural network operator to obtain the calculation loop level, construct the corresponding execution tree level, and link the sub-tree relationship. And then, the maximum input (or output) data volume in each calculation cycle is obtained according to the occupation proportion or the actual size of the on-chip resources (mainly NRAM 431 memory space) of the data blocks such as input, output and constants at each time, and the cycle level of the data slice is obtained by dividing the input data volume of the specific calculation cycle level by the maximum input data volume of the single cycle so as to link the subtree relationship. In each subtree, memory allocation and release are performed in appropriate operations according to the data amount in actual circulation. And finally, filling corresponding instructions for loading off-chip data, moving multiplexing data, calculating, storing output data and the like in proper operation of each subtree so as to finish the compiling work of the operator.
Since convolutional layers and fully-connected layers occupy most of the computation in the full network, optimization needs to be performed for these computations to improve the performance of the full network. In this embodiment, it is considered that there is a certain redundancy due to a large parameter amount of weight data in the convolutional layer and the fully-connected layer, and a low-precision calculation method is adopted based on the condition that precision is completely lost or precision within an allowable range is lost. In other words, to save the use of hardware resources, the embodiment uses quantization to convert high precision floating point numbers into low precision fixed point numbers to speed up the neural network operation. For example, the matrix operation unit 422 only supports multiply-accumulate operations with 8-bit fixed-point numbers (INT8), and before performing matrix operations, both input data and weight data are converted into fixed-point numbers of INT8 data types, and then introduced into the matrix operation unit 422 for calculation.
In the two layers, the weight data can be converted in advance by using an off-line preprocessing quantization method. The weight data is stored in the model off line, can be preprocessed during compiling, is converted according to the corresponding data type, is stored in a new network model file, modifies the corresponding network model structure description file, marks the operation data type of the corresponding neural network layer, and adds the corresponding parameter required for quantification. At compile time, a sequence of instructions for the computing device 201 is generated according to the quantized network model parameters. During operation, the computing device 201 loads the required weight data to the WRAM 432 according to the bit width corresponding to the type of the computed data through the instruction sequence generated by the processing device 203, and performs convolution and operation of the full connection layer, so as to realize network acceleration.
However, input data of convolution and full join operators may be output results from other neural network layers in the network, and data types cannot be converted in advance during compiling, and corresponding instruction sequences need to be used on a chip to complete data type conversion operation. The above instructions are all calculation type instructions and will be performed in the calculation stage of the operator.
This embodiment proposes a compilation optimization method with advanced data type conversion scheduling. From a layer perspective, scheduling optimization for data type conversion is desired ahead of time and must be performed when a data block satisfies the data dependency constraints of either of the following two cases. One case is that the data block of the data type to be converted is derived from the output of the previous layer and is only used by the current layer as input data, i.e. the data output of the previous layer is only used as the data input of the current layer and is not provided for other layers. In another case, the receiving layers (including the current layer) of the output data block of the previous layer are all neural network layers for fixed point number calculation, and the quantization parameters used in the data type conversion operation are the same. In other words, this embodiment needs to ensure that the scheduling of the data type conversion does not affect the operation of other layers. Fig. 7 shows a flowchart of performing the data type conversion scheduling advance of this embodiment.
In step 701, the processing device 203 identifies a layer requiring fixed-point computation, hereinafter referred to as a fixed-point computation layer, from the neural network model.
In step 702, the processing device 203 determines whether the previous layer is an operator of an operation of rearranging only the positions of the data elements. Because only the operator of the operation of rearranging the positions of the data elements affects only the operations of data migration or transposition and the like, the operator can be divided into 3 types according to the data volume calculated by the data bit width without numerical calculation: the first type is an operator of a data migration class, such as Concat, Split, Slice/Crop, StridedSlice, Shuffle, addrod, Space2Batch, Batch2Space, and the like; the second class is the data shape conversion class operator, for example: reshape, Flatten, Permute, Squeeze, Unsequeneze, etc.; the third class is the transposed class operator, for example: transpose.
If the processing device 203 determines that the previous layer is an operator of an operation only for rearranging the positions of the data elements, and the output data blocks of these class operators satisfy the aforementioned data dependency constraints, the process returns to step 702, and the processing device 203 continues to determine whether the previous layer is an operator of an operation only for rearranging the positions of the data elements. And until the previous layer is no longer the operator of the operation only for rearranging the positions of the data elements, the operator of the operation not only for rearranging the positions of the data elements in the previous layer is the destination layer of the turn scheduling. Then, step 703 is executed, and the processing device 203 schedules the operation of data type conversion performed on the fixed point operation layer to the rotation number scheduling destination layer. The calculation device 201 runs a neural network based on the scheduled fixed point operation layer and the rotation number scheduling destination layer.
Fig. 8 shows an exemplary neural network segment to illustrate the type conversion scheduling operation of this embodiment. The neural network segment includes a first layer 801, a second layer 802, a third layer 803, a fourth layer 804, and a fifth layer 805, where the exemplary first layer 801 is a pooling layer, the second layer 802 is a BatchNorm layer, the third layer 803 is a Squeeze layer, the fourth layer 804 is a Transpose layer, and the fifth layer 805 is a convolutional layer. As described above, the convolution operation is mainly performed by the matrix operation unit 422, and the matrix operation unit 422 only supports the multiply-accumulate operation of INT8, and both the input data and the weight data are converted into fixed-point numbers of INT8 data type before performing the matrix operation, so that the pre-processing operation of the computation tree of the fifth layer 805 is to perform the data type conversion operation 806.
First, in step 701, the processing device 203 identifies a fixed-point operation layer from the neural network model segment of fig. 8. Of the first to fifth layers, only the fifth layer 805 performs convolution operation, and thus the fifth layer 805 is a fixed-point operation layer.
In step 702, the processing device 203 determines whether the upper layer of the fifth layer 805 is an operator of an operation for rearranging the positions of the data elements only. Since the layer above the fifth layer 805 is the fourth layer 804 and is a Transpose layer, and the Transpose operator belonging to the third class is indeed an operator of an operation of rearranging only the positions of the data elements, the processing device 203 determines whether the layer above the fourth layer 804 is an operator of an operation of rearranging only the positions of the data elements based on the fourth layer 804 by re-executing step 702. The layer above the fourth layer 804 is the third layer 803, which is the Squeeze layer, the operator of the data shape conversion class belonging to the second class is also the operator of the operation of rearranging only the positions of the data elements, so that the processing device 203 determines whether the layer above the third layer 803 is the operator of the operation of rearranging only the positions of the data elements based on the third layer 803 by re-executing step 702 again. The layer above the third layer 803 is the second layer 802, which is a BatchNorm layer, and does not belong to any of the three categories, so the second layer 802 is not an operator for an operation of merely rearranging the positions of data elements, and the second layer 802 is a layer for the turn scheduling purpose.
Finally, in step 703, the processing means 203 schedules the data type conversion operation 806 of the fifth layer 805 to be inserted after the last calculation operation of the second layer 802, i.e. in the post-processing operation of the leaf nodes in the calculation tree of the second layer 802.
The data type conversion operation 806 is arranged after the last calculation operation of the second layer 802, and does not affect the original calculation of the second layer 802, and the advanced execution of the data type conversion operation 806 can reduce the partial data movement amount of the second layer 802, the total data movement amount of the third layer 803 and the fourth layer 804, and the partial data movement amount of the fifth layer. More precisely, the total saved data volume is: the data amount of the corresponding output data block of the layer (the second layer 802) of the turn number scheduling destination is reduced by half in the storage phase, the input and output data amounts of all the data blocks of all the intermediate data transfer layers (the third layer 803 and the fourth layer 804) are reduced by half, and the input data amount of the fixed point operation layer (the fifth layer 805) in the loading phase is reduced by half.
This optimization of scheduling data type conversion operations to the previous layer is referred to as "turn-through". From the perspective of the execution tree, the data type conversion operation is moved from the pre-processing operation of the computation tree of the fixed point operation layer to the post-processing operation of the leaf node of the computation tree of which the computation is completed for the last time of the rotation number scheduling destination layer.
When the input data uses an offline quantization method, for a specific layer in the neural network model, the quantization parameters are derived from the parameters of the corresponding layer fixed in the model file, and are transmitted to the rear end of the software stack by the software interface and directly used in the corresponding data type conversion instruction. Therefore, the scheduling optimization method can be directly applied to an offline quantization method, and data type conversion operation is scheduled among operators.
When the input data is quantized online, if the quantization is performed by the maximum absolute value undistorted quantization algorithm, the maximum absolute values of all the data need to be calculated first, and then the quantization parameters need to be calculated. A process of repeated cycling of input data is necessary. The processing means 203 generates an instruction to perform the first cycle of the absolute maximum operation on the input data and perform the data type conversion, then generates an instruction to separate the second cycle of the fixed point arithmetic layer calculation process, and finally generates an instruction to perform the scheduling optimization on the absolute maximum operation. Since the maximum absolute value operation is performed by the vector operation unit 421 regardless of the shape of the data block, scheduling optimization can be performed without modifying the calculation logic sequence of the rpm scheduling destination layer, thereby reducing the data amount of the cyclic input/output operation for transferring the data block from the DRAM204 to the SRAM 308.
Fig. 9 shows a schematic diagram of online quantitative scheduling optimization of this embodiment from the perspective of an execution tree. The fixed point number calculation layer 901 mainly includes 2 execution trees: an execution tree 902 for calculating quantization parameters and an execution tree 903 for calculating input data type conversion and fixed point number, each execution tree includes a load tree, a calculation tree and a storage tree, and each node also includes an initial operation, a pre-processing operation, a main operation, a post-processing operation and an end operation. For simplicity of illustration, operations not directly associated with online quantitative scheduling optimization are not shown.
Before scheduling optimization, the main operation 904 of the loading tree of the execution tree 902 for calculating the quantization parameter is used to load the input data, the main operation 905 of the calculation tree of the execution tree 902 for calculating the quantization parameter is used to calculate the maximum absolute value of the input data, and the ending operation 906 of the calculation tree of the execution tree 902 for calculating the quantization parameter is used to temporarily store the maximum absolute value in the SRAM308, so that the maximum absolute value is used immediately, and therefore, the step of storing the maximum absolute value in the DRAM204 and then loading the maximum absolute value in the SRAM308 does not need to be performed, so that the storage tree of the execution tree 902 for calculating the quantization parameter has no operation, and the maximum absolute value is directly stored in the SRAM 308.
A load tree body operation 907 to calculate the input data type conversion and fixed point execution tree 903 for loading the input data, a pre-process tree operation 908 to calculate the input data type conversion and fixed point execution tree 903 for performing the type conversion of the input data based on the maximum absolute value stored in the SRAM308, a compute tree body operation 909 to calculate the input data type conversion and fixed point execution tree 903 for performing INT8 calculation to make the weight data into a fixed point data format, and a store tree body operation 910 to calculate the input data type conversion and fixed point execution tree 903 for storing the fixed point input data in the DRAM 204.
In performing the scheduling optimization, the processing device 203 moves the quantization parameter calculation operation of the subject operation 905 into the post-processing operation of the calculation tree of the rotation number scheduling destination layer execution tree 911. Further, the processing device 203 adjusts the instruction related to the maximum value of the storage absolute value for ending operation 906 to the ending operation in the storage tree of the scheduling destination layer execution tree 911, so that the calculated quantization parameter will be stored in the DRAM 204. Finally, the processing device 203 loads the operation of loading the maximum absolute value from the DRAM204 into the SRAM308 as a parameter in a pre-processing operation 912 of calculating a load tree of the input data type conversion and fixed point number operation 903, so as to retrieve the corresponding quantization parameter, thereby completing the data type conversion operation.
Another embodiment of the invention is a computer readable storage medium having stored thereon computer program code for performing a number of revolutions traversal in a neural network model, which when executed by a processing device, performs a method as shown in fig. 7.
The invention provides a scheme for passing through the revolution, which can meet the requirement of the calculation of the quantization parameter in the target layer of revolution scheduling in advance by circularly dividing calculation logic in a neural network operator according to an algorithm, save interlayer data transfer amount, reduce the bandwidth occupation amount of hardware and improve the performance.
According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.
It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.
In particular implementations, based on the disclosure and teachings of the present invention, one skilled in the art will appreciate that the several embodiments disclosed in the present invention can be practiced by other methods than those disclosed in the present embodiments. For example, as for each unit in the foregoing embodiment of the electronic device or apparatus, the embodiment splits the unit based on the logic function, and there may be another splitting manner in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.
In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices (e.g., computing devices or other processing devices) described in this embodiment may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.
The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained in this embodiment by applying specific examples, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (9)
1. An integrated circuit device that performs a turn-through in a neural network model, comprising:
a processing device to:
identifying a fixed point operation layer from the neural network model, wherein the fixed point operation layer is a fixed point number operation layer;
judging whether the previous layer is an operator only for rearranging the positions of the data elements;
if so, executing the judging step; and
if not, setting the previous layer as a rotation number scheduling destination layer, and scheduling the data type conversion operation executed on the fixed point operation layer to the rotation number scheduling destination layer; and
and the computing device is used for running the neural network model based on the scheduled fixed point operation layer and the turn scheduling destination layer.
2. The integrated circuit device according to claim 1, wherein the operator of the operation that merely repositions data elements is one of a data migration class operator, a data shape conversion class operator, and a transposition class operator.
3. The integrated circuit device according to claim 1, wherein the processing means moves the data type conversion operation from a pre-processing operation of the computation tree of the fixed-point operation layer to a post-processing operation of a leaf node of the computation tree of which computation is completed last time of the rotation number scheduling destination layer when the data type conversion operation performed on the fixed-point operation layer is scheduled to the rotation number scheduling destination layer.
4. The integrated circuit device according to claim 3, wherein the fixed-point operation layer comprises a calculate quantization parameter operation, the processing device moving a quantization parameter calculation operation of a subject operation of the calculate quantization parameter operation into a post-processing operation of a computation tree of the turn-number scheduling destination layer.
5. The integrated circuit device according to claim 4, wherein the processing device 203 adjusts the associated instruction of the maximum value of the stored absolute value of the end operation of the operation of calculating the quantization parameter to the end operation in the memory tree of the scheduling destination layer.
6. The integrated circuit device according to claim 5, wherein the fixed-point operation layer comprises an operation of calculating an input data type conversion and a fixed-point number operation, the processing device loading the maximum absolute value in a pre-processing operation of the load tree of calculating the input data type conversion and the fixed-point number operation.
7. A card comprising an integrated circuit device according to any of claims 1 to 6.
8. A method of performing a turn-through in a neural network model, comprising:
identifying a fixed point operation layer from the neural network model, wherein the fixed point operation layer is a fixed point number operation layer;
judging whether the previous layer is an operator only for rearranging the positions of the data elements;
if so, executing the judging step;
if not, setting the previous layer as a rotation number scheduling destination layer, and scheduling the data type conversion operation executed on the fixed point operation layer to the rotation number scheduling destination layer; and
and running the neural network model based on the scheduled fixed point operation layer and the rotation number scheduling target layer.
9. A computer-readable storage medium having stored thereon computer program code for performing a number of revolutions traversal in a neural network model, which when executed by a processing device, performs the method of claim 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110704820.9A CN113469328B (en) | 2021-06-24 | 2021-06-24 | Device, board, method and readable storage medium for executing revolution passing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110704820.9A CN113469328B (en) | 2021-06-24 | 2021-06-24 | Device, board, method and readable storage medium for executing revolution passing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113469328A true CN113469328A (en) | 2021-10-01 |
CN113469328B CN113469328B (en) | 2024-03-19 |
Family
ID=77872813
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110704820.9A Active CN113469328B (en) | 2021-06-24 | 2021-06-24 | Device, board, method and readable storage medium for executing revolution passing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113469328B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019114517A1 (en) * | 2017-12-13 | 2019-06-20 | 腾讯科技(深圳)有限公司 | Neural network model deployment method, prediction method, and apparatus |
US20190279072A1 (en) * | 2018-03-09 | 2019-09-12 | Canon Kabushiki Kaisha | Method and apparatus for optimizing and applying multilayer neural network model, and storage medium |
CN110334436A (en) * | 2019-07-03 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of data processing method and equipment |
CN110458294A (en) * | 2019-08-19 | 2019-11-15 | Oppo广东移动通信有限公司 | Model running method, apparatus, terminal and storage medium |
CN111723935A (en) * | 2020-06-24 | 2020-09-29 | 湖北亿咖通科技有限公司 | Neural network computation graph processing method, computer storage medium and electronic device |
CN112465122A (en) * | 2020-12-09 | 2021-03-09 | 安徽寒武纪信息科技有限公司 | Device and method for optimizing original dimension operator in neural network model |
WO2021052460A1 (en) * | 2019-09-18 | 2021-03-25 | 华为技术有限公司 | Data processing method, model optimization device, and model execution device |
-
2021
- 2021-06-24 CN CN202110704820.9A patent/CN113469328B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019114517A1 (en) * | 2017-12-13 | 2019-06-20 | 腾讯科技(深圳)有限公司 | Neural network model deployment method, prediction method, and apparatus |
US20190279072A1 (en) * | 2018-03-09 | 2019-09-12 | Canon Kabushiki Kaisha | Method and apparatus for optimizing and applying multilayer neural network model, and storage medium |
CN110334436A (en) * | 2019-07-03 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of data processing method and equipment |
CN110458294A (en) * | 2019-08-19 | 2019-11-15 | Oppo广东移动通信有限公司 | Model running method, apparatus, terminal and storage medium |
WO2021052460A1 (en) * | 2019-09-18 | 2021-03-25 | 华为技术有限公司 | Data processing method, model optimization device, and model execution device |
CN111723935A (en) * | 2020-06-24 | 2020-09-29 | 湖北亿咖通科技有限公司 | Neural network computation graph processing method, computer storage medium and electronic device |
CN112465122A (en) * | 2020-12-09 | 2021-03-09 | 安徽寒武纪信息科技有限公司 | Device and method for optimizing original dimension operator in neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN113469328B (en) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114035916A (en) | Method for compiling and scheduling calculation graph and related product | |
CN116185942A (en) | Data processing method, device, storage medium and electronic equipment | |
CN113469336A (en) | Compiling method and execution method for optimizing neural network model and related products | |
CN113469326B (en) | Integrated circuit device and board for executing pruning optimization in neural network model | |
CN114595813B (en) | Heterogeneous acceleration processor and data computing method | |
CN113469337B (en) | Compiling method for optimizing neural network model and related products thereof | |
CN116402091A (en) | Hybrid engine intelligent computing method and device for artificial intelligent chip | |
CN115952848A (en) | Convolution operation circuit, compiling method and related product | |
CN113469328B (en) | Device, board, method and readable storage medium for executing revolution passing | |
CN113469327B (en) | Integrated circuit device for performing rotation number advance | |
CN112948001A (en) | Method for setting tensor hardware configuration, readable storage medium and device | |
CN115840894A (en) | Method for processing multidimensional tensor data and related product thereof | |
CN115081600A (en) | Conversion unit for executing Winograd convolution, integrated circuit device and board card | |
CN115081603A (en) | Computing device, integrated circuit device and board card for executing Winograd convolution | |
CN113469365B (en) | Reasoning and compiling method based on neural network model and related products thereof | |
CN113792867B (en) | Arithmetic circuit, chip and board card | |
WO2022063183A1 (en) | Device and method for neural network computing, and board and readable storage medium | |
CN113742266B (en) | Integrated circuit device, electronic apparatus, board and computing method | |
CN114692846A (en) | Data processing device, data processing method and related product | |
CN114692845A (en) | Data processing device, data processing method and related product | |
CN115081602A (en) | Computing device, integrated circuit device and board card for executing Winograd convolution | |
CN115599738A (en) | Method for optimizing neural network model and related product | |
CN114692811A (en) | Device and board card for executing Winograd convolution | |
CN114692848A (en) | Device and board card for obtaining convolution result | |
CN115438777A (en) | Device for performing Winograd convolution forward transform on neuron data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |