[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021057720A1 - 神经网络模型处理方法、装置、计算机设备及存储介质 - Google Patents

神经网络模型处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021057720A1
WO2021057720A1 PCT/CN2020/116816 CN2020116816W WO2021057720A1 WO 2021057720 A1 WO2021057720 A1 WO 2021057720A1 CN 2020116816 W CN2020116816 W CN 2020116816W WO 2021057720 A1 WO2021057720 A1 WO 2021057720A1
Authority
WO
WIPO (PCT)
Prior art keywords
split
operator
state
target
tensor data
Prior art date
Application number
PCT/CN2020/116816
Other languages
English (en)
French (fr)
Inventor
张潇
周玉松
孟小甫
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Priority to US17/622,709 priority Critical patent/US20220391678A1/en
Priority to EP20868455.5A priority patent/EP4036803A4/en
Publication of WO2021057720A1 publication Critical patent/WO2021057720A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of computer technology, and in particular to a neural network model processing method, device, computer equipment and storage medium.
  • multi-core processors based on the memory sharing model have become the mainstream architecture of current processors.
  • This multi-core architecture and the vector processing capabilities in each core can also be applied to neural network calculations.
  • data parallelism can usually be used to make full use of the additional hardware resources brought about by the multi-core processor architecture, that is, each processor core can execute calculations on the same neural network model with different data at the same time.
  • the multi-core processor structure cannot use this parallel method to process small batches of neural network computing tasks that require low latency in reasoning scenarios. Then, how to ensure that the data parallelism and the neural network model parallelism are unified so as to make full use of the hardware resources of the multi-core processor is a technical problem that needs to be solved urgently.
  • the embodiment of the present invention provides a neural network model processing method, device, computer equipment and storage medium.
  • the multi-core processor can directly call the single-core architecture
  • the calculation library under makes full use of the hardware resources of the multi-core processor, which can avoid the extra workload of re-implementation.
  • an embodiment of the present application provides a neural network model processing method, which is applied to a multi-core artificial intelligence processor, and the method includes:
  • the target operator is split according to the target split path, so as to be allocated to corresponding cores of the multi-core artificial intelligence processor for processing.
  • an embodiment of the present application provides a neural network model processing device, which includes a unit for executing the method of the first aspect. Specifically, the device is applied to a multi-core artificial intelligence processor.
  • the above devices include:
  • a determining unit configured to determine a split state set of tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model
  • a split path determining unit configured to traverse the split state set and determine a split path of the tensor data of the target operator between adjacent split state sets;
  • a target split path determining unit configured to determine a target split path of the tensor data of the target operator according to the weight of the split path;
  • the processing unit is configured to split the target operator according to the target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • an embodiment of the present application provides a chip, and the chip includes the neural network model processing device provided in the second aspect.
  • an embodiment of the present application provides a computer device that includes the chip provided in the third aspect or the neural network model processing device provided in the second aspect.
  • an embodiment of the present application provides a computer device, including a processor and a memory, the processor and the memory are connected to each other, wherein the processor includes a general-purpose processor and an artificial intelligence processor, and the memory is used for A computer program that supports a computer device to execute the above method is stored, the computer program includes program instructions, and the processor is configured to invoke the program instructions to execute the method of the above first aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processing The device executes the method of the first aspect described above.
  • an embodiment of the present application provides a computer program product, wherein the above-mentioned computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the above-mentioned computer program is operable to cause a computer to execute as implemented in this application. Examples include part or all of the steps described in the method described in the first aspect.
  • the computer program product may be a software installation package.
  • the computer device splits the neural network computing task into several smaller sub-computing tasks, so that the multi-core processor can directly call the computing library under the single-core architecture, making full use of the hardware of the multi-core processor Resources, which can avoid the extra workload of re-implementation.
  • the computer equipment can adjust the split state in the split state set of the tensor data associated with the operator through the glue operator, and determine the target optimization path based on the updated split state set. The extra cost brought by the operator and the parallel efficiency of the different splitting methods of the operator are put together for decision-making, and an optimal splitting scheme based on the entire neural network is obtained, which can improve the execution efficiency of the computer equipment.
  • FIG. 1A is a schematic structural diagram of a multi-core processor provided by an embodiment of the present application.
  • FIG. 1B is a schematic structural diagram of a software stack of an artificial intelligence processor provided by an embodiment of the present application
  • Figure 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a neural network processing method provided by an embodiment of the present application.
  • Fig. 4 is a calculation diagram of a neural network convolution operator provided by an embodiment of the present application.
  • FIG. 5A is a schematic diagram obtained by splitting the input data according to the N dimension
  • FIG. 5B is a schematic diagram of splitting according to the C dimension of the output data
  • FIG. 5C is a schematic diagram obtained by splitting according to the C dimension of input data
  • FIG. 5D is a schematic diagram obtained by splitting according to the H dimension of the input data
  • FIG. 5E is a schematic diagram obtained by splitting according to the W dimension of the input data
  • 5F is a schematic structural diagram of a face recognition neural network model provided by an embodiment of the present application.
  • 5G is a schematic structural diagram of a neural network model for license plate character recognition provided by an embodiment of the present application.
  • FIG. 5H is an abstract schematic diagram of a neural network model provided by an embodiment of the present application.
  • FIG. 6A is an abstract schematic diagram of a serial neural network model provided by an embodiment of the present application.
  • FIG. 6B is a schematic diagram of a split method for adjusting tensor data through a glue operator according to an embodiment of the present application.
  • 6C is a schematic diagram of the semantics of a concat operator provided by an embodiment of the present application.
  • FIG. 6D is a schematic diagram of the semantics of a split operator provided by an embodiment of the present application.
  • 6E is an abstract schematic diagram of a neural network model after inserting a glue operator according to an embodiment of the present application.
  • 6F is an abstract schematic diagram of another neural network model after inserting a glue operator provided by an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of a neural network processing device provided by an embodiment of the present application.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the so-called data parallelism refers to dividing data into several blocks and mapping them to different processors, and each processor runs the same processing program to process the allocated data.
  • most of the parallel processing uses this processing method, especially for problems with high computational complexity, such as fluid mechanics calculations, image processing, and so on.
  • data parallelism can be applied to large-scale neural network parallel training.
  • the core of data parallelism is to use multiple processors to simultaneously train the same neural network model.
  • each processor obtains the data used in this iteration from the data set, completes a round of inference and training calculations for the entire network on each processor, and returns the gradient data calculated in this round To update the model.
  • the weight-maintaining server receives the gradients of all processors, it uses these gradients to update the model data.
  • the key to data parallelism lies in the batch size of the data to be processed in each iteration. The larger the batch, the more processors are divided as much as possible for parallel processing.
  • model parallelism is another neural network parallel calculation method besides data parallelism.
  • model parallelism is to distribute the computational load to different processors by dividing the parameters of the neural network model.
  • the most common structure currently adopted by multi-core processors is a multi-core structure based on storage sharing.
  • the processor contains multiple computing cores, each with independent cache, register file, computing unit, and Command control unit, all computing cores share the same global storage.
  • a single core is sufficient to complete any complex logic calculation task, but its performance is limited by Moore's Law and chip technology.
  • multiple computing cores are introduced into the processor, and they can be used to process computing tasks with a high degree of parallelism.
  • the shared storage multi-core structure is a classic multi-core structure, and it is very suitable for data-parallel neural network training methods.
  • Each core can be used as a processor in data parallel, read different data respectively, and then complete the forward and reverse calculations of the network model in parallel. In the calculation phase, each core can still maintain its good performance-to-power ratio under the previous single-core architecture. At the same time, the throughput of the entire system can also increase with the expansion of the number of cores.
  • the original operator before the split and several sub-operators after the split are all operators supported by the artificial intelligence processor.
  • the original tensor data is also split with the split of the operator. Divide into several new sub-tensor data. Reflected on the calculation graph, the original calculation graph containing a single operator is refined into a calculation graph containing more operators that can be executed in parallel.
  • operator splitting is not entirely limited to splitting model parameters, and data parallelism is also used to split data.
  • This method actually blurs the boundary between model parallelism and data parallelism.
  • the convolution operator Take the convolution operator as an example. If the input data and weight of the convolution operator are used as the equivalent low-order tensor data in the calculation graph, then the calculation is divided based on the division of the input data when the data is parallel, and when the model is parallel The calculation is divided based on the division of weights, both of which achieve the division of the calculation load by dividing the tensor data associated with the convolution operator. From this perspective, data parallelism and model parallelism are unified.
  • the artificial intelligence processor is also called a dedicated processor.
  • the artificial intelligence processor refers to a processor for a specific application or field.
  • Graphics Processing Unit also known as display core, visual processor, display chip
  • GPU Graphics Processing Unit
  • visual processor display chip
  • NPU Neural Network Processor
  • NPU Neural Processing Unit
  • NPU Neural Processing Unit
  • Caffe Convolutional Architecture for Fast Feature Embedding
  • Caffe supports multiple types of deep learning architectures, image classification and image segmentation, and can also support Convolutional Neural Networks (Convolutional Neural Networks).
  • Networks CNN
  • Convolutional Neural Networks (Region-CNN, RCNN) for target detection
  • LSTM Long Short-Term Memory
  • the Caffe framework may support multiple types of basic operators.
  • the multiple types of basic operators involved here may include common neural network operators.
  • common neural network operators include: convolution/deconvolution operators, pooling operators, activation operators, softmax (classifier) operators, and fully connected operators.
  • activation operators can include but are not limited to ReLU, Sigmoid, Tanh, and other operators that can be implemented by interpolation.
  • performing a certain operation on any function can be regarded as an operator.
  • the functions under the Caffe framework may include: Caffe Blob function, Caffe Layer function, and Caffe Net function.
  • Blob is used to store, exchange, and process data and derivative information of forward and reverse iterations in the network
  • Layer is used to perform calculations, which can include convolve, pool, inner product
  • Non-linear operations such as rectified-linear and sigmoid may also include element-level data transformation, normalization (normalize), data loading (load data), classification (softmax), and hinge and other loss calculations (losses).
  • each Layer defines three important operations. These three operations are setup, forward, and backward. Among them, setup is used to reset the layers and the connection between each other when the model is initialized; forward is used to receive input data from the bottom layer, and the output is sent to the top layer after calculation; backward is used to give the top layer The output gradient is calculated and the input gradient is passed to the bottom layer.
  • Layer may include DateLayer, ConvolutionLayers, PoolingLayer, InnerProductLayer, ReLULayer, SigmoidLayer, LRNLayer, DropoutLayer, SoftmaxWithLossLayer, SoftmaxLayer, AccuracyLayers, etc.
  • Net starts with the data layer, that is, loads data from the disk, and ends with the loss layer, that is, calculates the objective function of tasks such as classification and reconstruction.
  • Net is a directed acyclic calculation graph composed of a series of layers. Caffe retains all the intermediate values in the calculation graph to ensure the accuracy of forward and reverse iterations.
  • the software stack structure 10 includes an artificial intelligence application 100, an artificial intelligence framework 102, an artificial intelligence learning library 104, an artificial intelligence runtime library 106, and a driver 108. Let's elaborate on it in detail:
  • the artificial intelligence application 100 corresponds to different application scenarios and provides corresponding artificial intelligence algorithm models.
  • the algorithm model can be directly analyzed by the programming interface of the artificial intelligence framework 102.
  • the artificial intelligence algorithm model is converted into binary instructions through the artificial intelligence learning library 104, and the artificial intelligence runtime library 106 is called to convert the binary instructions. It is converted into an artificial intelligence learning task, the artificial intelligence learning task is placed in the task queue, and the artificial intelligence learning task in the task queue is scheduled by the driver 108 to be executed by the underlying artificial intelligence processor.
  • the artificial intelligence runtime library 106 can also be directly called to run the offline running files that have been solidified and generated previously, reducing the intermediate overhead of the software architecture and improving the operating efficiency.
  • the artificial intelligence framework is the first layer in the entire deep learning ecosystem.
  • Layer was regarded as the basic element of building neural networks.
  • the later artificial intelligence frameworks such as TensorFlow and MXNet, although different names are used, such as Operator, the core idea of the layer is still the same as that of Caffe. They are similar. They all divide neural network calculations into various common tensor data-oriented operators.
  • the artificial intelligence framework needs to embody the deep learning tasks expressed by the computational graph structure of the neural network mapping into a CPU or Instructions and data executed by the artificial intelligence processor.
  • the artificial intelligence framework uses operators as specific elements to implement computing tasks, and provides each operator with a kernel function (Kernel) executed on the CPU or artificial intelligence processor.
  • kernel function Kernel
  • the artificial intelligence framework Schedule and execute the kernel function corresponding to each operator in the calculation graph to complete the calculation of the entire neural network.
  • the problem of data parallelism is that its scalability depends on the size of the processed data batch. Although this is not usually a problem in the training phase, it is difficult to guarantee this premise in the inference phase.
  • the neural network model used in the real-time service field including video surveillance, autonomous driving, etc.
  • the processed data is usually serially input in a stream, resulting in a small scale of data processed each time or even a single Pictures.
  • data parallelism cannot provide any degree of parallelism, and all work tasks will be concentrated on a single core, which prevents the computing resources brought by multiple cores from being converted into the speed of processing tasks.
  • the model After completing the training of the neural network model using the data set offline, the model will be deployed to a server in the cloud to process the data sent from the outside world.
  • the application scenario will change from offline training to online reasoning.
  • a very important indicator is the time delay, that is, the time from the server receiving the data to be processed to the return of the processed result, and furthermore, the time to process the data using the neural network model.
  • the low latency ensures that the cloud server can respond to the data sent by the client in the shortest time. In some more sensitive scenarios, it directly determines whether the solution is available. Therefore, the requirements for artificial intelligence processors in the online reasoning stage have changed from processing large batches of data and high throughput to processing small batches of data with low latency.
  • the deep learning artificial intelligence processor adapts its own hardware design to adapt to the data parallel characteristics of the deep learning algorithm itself and improves the computational throughput.
  • the artificial intelligence processor often needs sufficient data scale to achieve high computational efficiency. Further splitting within the operator will reduce the calculation scale on each core. When the split reaches a certain granularity, the loss of computational efficiency on each core will exceed the benefits of splitting to increase the degree of parallelism. Therefore, between split parallelism and computational efficiency, sufficient parallelism must be provided while ensuring sufficient computational efficiency.
  • the neural network model can be seen as a complex calculation graph composed of hundreds or even thousands of operators.
  • the algorithm logic in different types of operators is different, which leads to different methods of splitting these operators.
  • the division of each operator in addition to balancing its own calculation efficiency and parallelism, also considers the combination with the front and rear operators, and even the overall impact.
  • the rapid development of deep learning has brought about more and more large-scale and complex networks. It is unrealistic to find a good parallel method manually. Therefore, an automated method is needed to ensure that it can be used for different networks. Give a better split and parallel strategy.
  • one operator is split into multiple smaller-scale sub-operators, so that the computing library under the single-core architecture can be directly called, avoiding the extra workload of re-implementation.
  • an activation operator can get many smaller activation operators after being split, which means that only the original single-core activation function needs to be called on multiple cores to complete each sub-task, without the need to modify or renew Implement a multi-core version of the activation function.
  • it is necessary to take into account the calculation efficiency and parallelism of each operator itself after the split, and also consider the mutual cooperation between the context operators in the split. The ultimate goal is to obtain a split and parallel solution that can effectively reduce the end-to-end reasoning delay of the entire neural network model.
  • the neural network processing method can avoid modifying the single-core processor calculation library as much as possible, and at the same time can realize the parallel execution of the neural network model on the multi-core processor.
  • the upper framework divides the operator in the neural network model into several sub-operators that can be executed in parallel.
  • the deep learning framework calls the computing library to generate the sub-operator that is executed on a single core.
  • the machine instructions by loading the machine instructions of the sub-operators onto different cores, realize the parallel calculation of the operators on the multi-core processor.
  • the deep learning framework can use a single-core processor calculation library to generate calculation instructions for sub-operators
  • the input and output tensor data of the operators in the neural network model are split into sub-operators along with the operators. It is also split into corresponding sub-tensor data.
  • FIG. 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 20 may include a general-purpose processor 201, a memory 202, a communication bus 203, a communication interface 204, and at least one artificial intelligence processor 205.
  • the general-purpose processor 201 and the artificial intelligence processor 205 pass through the communication bus.
  • the storage 202 and the communication interface 203 are connected.
  • the general-purpose processor 201 may be a central processing unit (CPU), and the general-purpose processor 201 may also be other general-purpose processors, digital signal processors (DSP), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor 201 may be a microprocessor or the general-purpose processor 201 may also be any conventional processor or the like.
  • the general-purpose processor 201 may also be an integrated circuit chip with signal processing capability. In the implementation process, each step of the neural network processing method of the present application can be completed by the integrated logic circuit of hardware in the general-purpose processor 201 or instructions in the form of software.
  • the memory 202 may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), or other memories.
  • the memory 202 is used to store data and various software programs, such as a program for splitting the neural network model according to the determined target splitting path in the embodiment of the present application.
  • the memory may include a physical device for storing information, and the information is usually digitized and then stored in a medium using electrical, magnetic, or optical methods.
  • the memory described in this embodiment may also include: devices that use electrical energy to store information, such as RAM, ROM, etc.; devices that use magnetic energy to store information, such as hard disks, floppy disks, magnetic tapes, magnetic core memories, bubble memory, U disks ; A device that uses optical means to store information, such as CD or DVD.
  • devices that use electrical energy to store information such as RAM, ROM, etc.
  • devices that use magnetic energy to store information such as hard disks, floppy disks, magnetic tapes, magnetic core memories, bubble memory, U disks
  • a device that uses optical means to store information such as CD or DVD.
  • quantum memory graphene memory, and so on.
  • the communication interface 204 uses a transceiving device such as but not limited to a transceiver to implement communication between the computer device 20 and other devices or a communication network. For example, a model file sent by another device can be received through the communication interface 204.
  • a transceiving device such as but not limited to a transceiver to implement communication between the computer device 20 and other devices or a communication network. For example, a model file sent by another device can be received through the communication interface 204.
  • the artificial intelligence processor 205 can be mounted on a host CPU (Host CPU) as a coprocessor, and the host CPU allocates tasks to it. In practical applications, the artificial intelligence processor 205 can implement one or more operations. For example, taking a neural network processor (Network Processing Unit, NPU) NPU as an example, the core part of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 202 and perform multiplication and addition operations.
  • NPU Network Processing Unit
  • the artificial intelligence processor 205 may include 8 clusters, and each cluster includes 4 artificial intelligence processor cores.
  • the artificial intelligence processor 205 may be an artificial intelligence processor with a reconfigurable architecture.
  • the reconfigurable architecture means that if a certain artificial intelligence processor can use reusable hardware resources, it can flexibly change its own architecture according to different application requirements, so as to provide it for each specific application requirement. Match the architecture, then this artificial intelligence processor is called a reconfigurable computing system, and its architecture is called a reconfigurable architecture.
  • the computer device 20 is only an example provided by the embodiment of the present application, and the computer device 20 may have more or fewer components than the components shown, may combine two or more components, or may have Different configurations of components are realized.
  • FIG. 2 Based on the schematic structural diagram of the computer device shown in FIG. 2, the following is a schematic flow diagram of a neural network processing method provided by an embodiment of the present application shown in FIG. 3, specifically explaining how to perform the target operator in the embodiment of the present application. Split, and then achieve the mother of optimization of the artificial processor core computing process.
  • the following takes caffe as an example for detailed description, which can include but is not limited to the following steps:
  • Step S300 Determine a set of split states of tensor data associated with the target operator according to the target operator in the neural network model.
  • the target operator may be a corresponding target layer in the neural network model, the target layer is at least one layer in the neural network model, and the tensor data includes input tensor data and Output tensor data.
  • the neural network model may receive input data, and generate a prediction output according to the received input data and current model parameters.
  • the neural network model can be a regression model, a deep neural network (DNN), a convolutional neural network model (Convolutional Neural Networks, CNN), a recurrent neural network model (Recurrent Neural Networks, RNN), etc.
  • DNN deep neural network
  • CNN convolutional Neural Networks
  • RNN recurrent neural network model
  • the embodiments of this application do not make specific limitations.
  • the input neurons and output neurons of the multi-layer operation do not refer to the neurons in the input layer and the neurons in the output layer of the entire neural network model.
  • the neuron in the lower layer of the network forward operation is the input neuron
  • the neuron in the upper layer of the network forward operation is the output neuron.
  • the layer is called the input layer, the neurons in it are the input neurons, the K+1 layer is called the output layer, and the neurons in it are the output neurons. That is, except for the top layer, each layer can be used as the input layer, and the next layer is the corresponding output layer.
  • an operator refers to a function that implements a certain specific function.
  • a function that implements a certain specific function For example, take the reshape operator as an example, which is used to reinterpret the shape of tensor data.
  • the transpose operator as an example, which is used to adjust the dimensional order of tensor data.
  • the directed acyclic graph refers to adding the restriction of acyclic on the basis of the directed graph.
  • the directed edge can be used to characterize the connection relationship between the operator and the operator, and can also be used to characterize the execution sequence of the artificial intelligence processor when the neural network model is executed.
  • the split state in the split state set of the input tensor data of the target operator is based on the operation logic of the target operator and the split state in the split state set corresponding to the output tensor data.
  • the sub-state is determined.
  • the split state in the split state set of the output tensor data of the target operator is based on the operation logic of the operator and the split in the split state set of the corresponding input tensor data. The status is determined.
  • the neural network model can usually be regarded as a directed acyclic graph (DAG, Directed acyclic graph) composed of operators and multi-dimensional tensor data.
  • DAG Directed acyclic graph
  • Operators and tensor data pass through a directed edge. Connected to each other, the direction of the directed edge indicates that the data is the input or output of the operator.
  • the deep learning framework uniformly chooses to use the splitting methods of tensor data associated with the operators to illustrate the splitting methods of different operators.
  • the calculation logic supported by the operator is also different, and there are also different splitting strategies.
  • the computer device can determine the type according to the operator.
  • the splitting mode of the operator so that the split state in the split state set can be obtained. Specifically, please refer to Table 1:
  • the splitting methods supported by different types of operators are different.
  • the operator can be split in a targeted manner based on the characteristics of the operator, thereby avoiding the negative impact caused by unreasonable splitting methods, for example, increasing the resource consumption of computer equipment and causing Time-consuming problems caused by the unbalanced scale of the sub-operators after splitting, etc.
  • the different splitting methods of the convolution operator can be described as the following five types. These five conditions can cross each other and exist at the same time to ensure sufficient Split degree:
  • the neural network model has a hierarchical structure, as shown in FIG. 4, which is a schematic diagram of an original calculation diagram of a convolution operator provided in an embodiment of the present application.
  • the convolution operator conv it contains input data (input) in 4 dimensions, and under the action of the weight matrix, output data (output) can be obtained.
  • the convolution operator on the calculation graph provided in this embodiment of the present application has multiple splitting methods under the condition of a parallelism of 2.
  • FIG. 5A is a schematic diagram obtained by splitting according to the N dimension of input data; FIG.
  • FIG. 5B is a schematic diagram obtained by splitting according to the C dimension of output data
  • FIG. 5C is a schematic diagram obtained by splitting according to the C dimension of input data
  • 5D is a schematic diagram obtained by splitting according to the H dimension of the input data
  • FIG. 5E is a schematic diagram obtained by splitting according to the W dimension of the input data.
  • each tensor data in the figure gives the starting point and end point of each dimension, which is used to clarify the relationship between the split sub-tensor data and the original tensor data.
  • n represents the input data batch size
  • ic represents the number of input data feature images
  • ih represents the length of the input data feature image
  • iw represents the width of the input data feature image
  • oc represents the number of output data feature images
  • oh represents the output data feature image.
  • Length, ow represents the width of the output data feature image
  • kh represents the length of the convolution kernel window
  • kw represents the width of the convolution kernel window.
  • these splitting methods are executed in different dimensions, and at the same time, they can be combined with each other to form more splitting methods, which can provide sufficient parallelism to utilize the resources of multi-core processors, and at the same time To a certain extent, the excessive splitting of a single dimension can be avoided to affect the calculation efficiency of computer equipment.
  • the computer device can split the softmax operator in any one or several dimensions other than the dimension normalized by the probability of the softmax operator, and the result will be Several softmax operators that can be executed in parallel.
  • a computer device can allow its input data and output data to be split in any dimension.
  • the input data of an activation operator when the input data of an activation operator is divided into several sub-blocks (from the point of view of consistency, the output data will also be divided in the same way), it may be expressed as input0, input1, input2,... ..., inputm-1 and output0, output1, output2,..., outputm-1, in the calculation stage, the entire activation operator is actually split into m smaller activation operators, These activation operators have no dependencies on each other and can run on multiple cores.
  • the split state set when determining the split state set of tensor data associated with the target operator, the split state set may include the following manifestations:
  • the neural network model contains a variety of different types of operators, and these operators can be split in any dimension.
  • the computer equipment can be The split mode corresponding to each of the different operators determines the split state in the set of split states.
  • the neural network model has a hierarchical structure.
  • the face recognition neural network model contains a variety of different types of operators (convolution Operator, pooling operator, fully connected operator), where the connection relationship between the operators is: convolutional layer 1-pooling layer 1-convolutional layer 2-pooling layer 2-fully connected layer 1 -Fully connected layer 2. Since these operators can allow splitting in any dimension, in this case, the computer equipment can determine the splitting state in the splitting state set according to the splitting method corresponding to each operator.
  • the neural network model contains many different types of operators. Among them, some operators can allow splitting in any dimension, and some operators only support limited dimensions. Splitting, then, in this case, the computer equipment can determine the splitting methods corresponding to these multiple different operators, and then determine the splitting methods corresponding to the multiple different operators as the split state set The split status in.
  • the neural network model contains many different types of operators. Among them, some operators can be split in any dimension, and some operators only support limited dimensions. Splitting, then, in this case, the computer equipment can determine the corresponding splitting methods of these multiple different operators, and then, the intersection of the splitting methods supported by each of the multiple operators Determined as the split state in the split state collection.
  • the neural network model has a hierarchical structure.
  • the license plate character recognition neural network model contains a variety of different types of operators (convolution Operator, pooling operator, activation operator, softmax operator, etc.), where the connection relationship between the operators is: convolution layer 1-activation function Relu-maximum pooling layer 1-convolution layer 2- Activation function Relu-Maximum pooling layer 2-Convolutional layer 3-Activation function Relu-Maximum pooling layer 3-Convolutional layer 4-Activation function-Maximum pooling layer 4-Convolutional layer 5-Activation function-Maximum pooling Layer 5-fully connected layer 1-softmax layer-output layer.
  • convolution Operator convolution Operator, pooling operator, activation operator, softmax operator, etc.
  • the neural network model contains many different types of operators, some of which do not support any form of splitting at all, while other operators in the neural network model
  • the split format of the data remains consistent.
  • the neural network model is not split.
  • this state We consider this state to be a non-split state.
  • the negative effects of unreasonable splitting methods can be avoided, for example, increasing the resource consumption of computer equipment, leading to time-consuming due to the unbalanced scale of the sub-operators after splitting Questions and so on.
  • any split of tensor data as a split state s of the tensor data.
  • a sub-tensor data set is obtained.
  • the split state s is characterized by the corresponding sub-tensor data set. All possible splits ⁇ s0,s1,s2,... ⁇ constitute the split state set S of the tensor data.
  • this is a very large state space, which means that the splitting of tensor data
  • the space of possible split modes of the operator represented by the state is also very huge.
  • the computer device can prune the state space of the tensor data under at least one pruning condition that has been set to reduce the state space.
  • the pruning conditions may include but are not limited to: (1)
  • the unbalanced split states can be removed from the state space S of the tensor.
  • the reason for ensuring that the scale of the sub-operators after splitting is balanced is that: First, the time delay for the multi-core processor to complete the calculation of an operator depends on the core that takes the longest time to execute the sub-task. In the multi-core architecture, each core is equal to each other in terms of hardware structure.
  • the time spent on each core depends on the task load assigned to the core. Under the condition that the scale of the split sub-operators is balanced, it can be ensured that the time consumption of each core in the multi-core structure is equal, so that the execution efficiency of the computer device can be improved.
  • the number of sub-operators after splitting is an integer power of 2. Through this implementation, it is possible to remove the unbalanced split states from the state space S of the tensor.
  • the reason for ensuring that the number of sub-operators after splitting is an integer power of 2 is that the number of cores in the architecture of a multi-core processor is usually an integer power of 2, such as 1, 2, 4 , 8, 16, and so on.
  • the number of sub-operators after splitting should be guaranteed to be an integer power of 2. It is understandable that when the computer equipment satisfies at least one of the above pruning conditions, the computer can adjust the split state in the state space S to remove some unreasonable split states. While dividing the search space of strategies, it avoids the negative effects of unreasonable splitting methods. For example, it increases the resource consumption of computer equipment and causes the consumption caused by the unbalanced scale of the sub-operators after splitting. Time issues and so on.
  • not selecting the split state of any tensor data associated with an operator can represent an effective splitting method of the operator.
  • the dimension of the split of tensor data should be supported by the operator.
  • Softmax normalized exponential regression operator
  • the splitting of the input tensor and output tensor of the operator should satisfy the calculation logic of the operator.
  • the start and end points of each sub-block split in the H/W dimension of the output data of the convolution operator should be It is true that the sub-blocks of the corresponding input data split in the H/W dimension are calculated according to the convolution kernel and the displacement step length of the convolution operator; the input data of the convolution operator should be split in the C dimension It is exactly the same as the split of the weight data in the C dimension, and the split of the output data in the C dimension should be exactly the same as the weight data in the N dimension.
  • the output state is used to infer the input state of the operator according to the specific logic of each operator, or the input state is used to infer the output state of the operator forward according to the specific logic of each operator. This ensures that the split state of related data can always indicate an effective operator splitting method.
  • Step S302 Traverse the split state set, and determine a split path of the tensor data of the target operator between adjacent split state sets.
  • the splitting scheme P of the entire neural network model can be regarded as the output from a split state in the set of split states of the input tensor data of each operator A split state jump in a tensor.
  • the split state of the output tensor of the previous operator is the split state of the input tensor of the next operator.
  • Each possible jump through the operator corresponds to an effective way of splitting on the operator. Therefore, the split path can represent the split mode of the operator.
  • the calculation logic of the operator is split through the split manner corresponding to the split path to obtain the corresponding set of sub-operators.
  • the state of the input tensor data and the state of the corresponding output tensor data are connected by a split path, and the sub-tensor data set representing a split state of the input tensor data is processed by the sub-operator in the sub-operator set to obtain The sub-tensor data set corresponding to the split state of the output tensor data.
  • the path is used to represent the intermediate process from the input to the output of the operator.
  • the time used when the operator is executed in parallel on the multi-core processor in a certain split state can be characterized as a weight.
  • the calculation time for a multi-core processor to complete an operator depends on the time of the core that takes the longest time to execute the split sub-calculation task.
  • the weight value of each split path can be determined through the following steps A1-A4:
  • A1. Determine the calculation loads c1, c2,...,cn of the n sub-operators after splitting. Among them, ci is calculated according to the type and scale of the i-th sub-operator after splitting;
  • A2 Determine the amount of fetched data d1, d2,...,dn of n sub-operators. Among them, di is calculated according to the type and scale of the i-th sub-operator after splitting;
  • A3. Determine the computing throughput rate ⁇ of each artificial intelligence processor core. ⁇ is determined by the performance parameters of the artificial intelligence processor itself;
  • A4. Determine the memory access bandwidth ⁇ of each artificial intelligence processor core.
  • the memory access bandwidth
  • B the total bandwidth of the multi-core artificial intelligence processor.
  • the computer equipment can calculate the weight value corresponding to each split strategy according to the following calculation formula (1):
  • the inner maximum value operation in the calculation formula is based on the fact that the calculation part and the memory access part realized by the operator can be hidden from each other, that is, the calculation and memory access can be performed concurrently as much as possible.
  • the calculation throughput of each core will decrease, and ⁇ can be further modified to make the estimation more accurate.
  • the outer maximum value operation in the calculation formula means that the time for the multi-core artificial intelligence processor to complete the calculation of an operator depends on the time of the core that takes the longest time to execute the sub-calculation task.
  • the above method of obtaining the weight of the split path is only a partial list of examples, rather than an exhaustive list. Those skilled in the art may understand the essence of the technical solution of the present application.
  • Other deformations or transformations are generated on the basis of, such as: the weight of the split path can not only be the time spent in executing subtasks, but also the throughput of executing subtasks.
  • the weight of the split path can also be determined by actually measuring the time for executing all the subtasks in the operator split mode corresponding to the split path on the multi-core processor.
  • the weight of the split path can also be determined by actually measuring the time for executing all the subtasks in the operator split mode corresponding to the split path on the multi-core processor.
  • Step S304 Determine a target split path of the tensor data of the target operator according to the weight of the split path.
  • the target split path can be determined by a forward traversal method.
  • the target optimization path can be determined by a reverse traversal method. It will be elaborated below in detail:
  • determining the target optimization path by forward traversal may include:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the initial split state corresponding to the directed edge to the split state of the input tensor data of the target operator
  • a split path from a state to a split state of the input tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • determining the target optimization path by means of reverse traversal may include:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the split state corresponding to the end point of the directed edge to the split state of the output tensor data of the target operator A split path from a state to a split state of the output tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • the computer device may determine the splitting scheme with the smallest weight value as the target splitting of the neural network model. Sub-path.
  • the number of target split paths obtained by the computer device through forward traversal may be one or multiple, which is not specifically limited in the embodiment of the present application.
  • the number of target split paths often needs to be determined in conjunction with a specific neural network model (or target operator).
  • the embodiment of the present application may use any one of the multiple target optimization paths to split the neural network model, or use Select an optimal target optimization path among multiple target optimization paths to split the neural network model, so that the multi-core processor runs the split neural network model on the corresponding core.
  • the computer device may combine the Viterbi algorithm to obtain the target optimization path from FIG. 5H.
  • the target optimization path is the path with the smallest weight and the minimum.
  • the Viterbi algorithm is a dynamic programming algorithm used to find the hidden state sequence that is most likely to produce the observed time sequence.
  • the state in the tensor data split state set is regarded as the hidden state in the Viterbi algorithm
  • the directed edge between the split state set is regarded as the transition between the hidden states Relationship
  • the weight of the directed edge corresponds to the logarithmic value of the transition probability between the implicit states.
  • the computer device can traverse all operators in the network calculation graph from front to back.
  • the computer device When accessing the i-th operator, the computer device according to all the directed edges corresponding to the current operator and their weights (Specifically, as Formula 5) Determine the split state in the split state set of the input tensor data in the neural network model to the split state set of the output tensor data of the current operator The shortest path for each split state in among them,
  • the computer equipment When the computer equipment completes the traversal of all operators, the shortest distance from the split state in the split state set of the input tensor data of the neural network to the split state set of the output tensor data can be obtained. Path; After that, the computer equipment determines the shortest path in the global scope among these shortest paths, that is, the target optimization path.
  • the neural network model has a serial structure, and the input tensor data and output tensor data of the entire neural network model are not split.
  • that the input tensor data of the entire neural network model is in a non-split state means that there is one and only one input state in the current set of split states.
  • that the output tensor data of the entire neural network model is in a non-split state means that there is one and only one output state in the current set of split states.
  • the goal of the search strategy is to find a mapping relationship between the tensor itself and a certain state of its state set Tensor i ⁇ S i , by giving the nerve
  • Each tensor data in the network model determines a specific split state, which can determine the split mode of all operators. Therefore, a mapping relationship from all tensor data in a neural network model to its split state is called It is a split plan P of the network model.
  • the i-th operator OPi uses the input data in the split state S to calculate the output tensor data in the split state r.
  • the specific parallel calculation method is determined by the state of the input tensor data and the output tensor data.
  • the calculation time of the operator is recorded as t s ⁇ r , and its value depends on the corresponding splitting method and the hardware characteristics of the underlying accelerator.
  • the calculation formula for the delay T of the entire network is:
  • the splitting scheme P of the entire network can be regarded as a jump from a state in the state set of the input tensor of each operator to a state in the output tensor.
  • each possible jump through the operator corresponds to an effective split method on the operator, and also corresponds to the use of this split method to execute the calculation in parallel on a multi-core processor.
  • the time ti applied to the sub, so ti can be regarded as the weight of the directed edge from the state of the input tensor of the operator to the state of the output tensor.
  • Equation 3 and Equation 4 give this abstract formula.
  • the computer device sets the non-split state of the input tensor data of the entire neural network model as the initial state Sroot.
  • the non-split state of the input tensor data of the neural network model corresponds to the initial state Sroot.
  • the weight of the split path is 0, and the weight of the corresponding split path for all states of all the remaining tensor data is ⁇ .
  • Any state s of any piece of data in the neural network model has a corresponding split path weight from Sroot to s as ls. Visit each split state set from front to back, in each split state set, traverse each state s in turn. For each state s, there are directed edges e1,...,eks that point to several split states in the next set of split states.
  • the input tensor data of the entire neural network model can be obtained from the non-split state sroot to the output tensor data of the neural network model The target split path of send in the undisassembled state.
  • the above describes a path from the non-split state sroot to the non-split state send through a state in each split state set, and this path is the split path of the neural network model.
  • the computer device can select the least weighted split path from the split path of the neural network model as the target split path of the neural network model.
  • the neural network model shown in FIG. 6A is a serial neural network model, and in order to facilitate the description of the technical solution, the input tensor data and output tensor data of the neural network model correspond to the set of split states.
  • Split state When the split state set of the output tensor data of the neural network model is not a non-split state Send, but a set composed of multiple split states, each of the split state sets of the output tensor data of the neural network model The minimum value of the weight of the split path of the split state is selected as the target split path between the split state set of the input tensor data of the entire neural network model and the split state set of the output tensor data of the neural network model .
  • the computer device can search for the split path from the unsplit state Send to the unsplit state Sroot, and the two are equivalent.
  • the split state set of the input tensor data of the neural network model is not the unsplit state Send, but a set of multiple split states, in the split state set of the input tensor data of the neural network model The minimum value of the weight of the split path of each split state is selected as the target between the split state set of the input tensor data of the entire neural network model and the split state set of the output tensor data of the neural network model Split the path.
  • Step S306 Split the target operator according to the target split path to allocate to the corresponding core of the multi-core artificial intelligence processor for processing.
  • the number of cores of the multi-core artificial intelligence processor may be 8 or 16, which is not specifically limited in the embodiment of the present application.
  • the computer device may split the target operator according to the determined target optimization path.
  • the neural network model is used to perform a specific neural network computing task, such as face recognition; another example, edge detection; another example, semantic analysis, and so on.
  • the computer device splits the neural network according to the target split path, that is, the neural network computing task is split into several sub-computing tasks.
  • the computer device can call the multi-core artificial intelligence processor to run the split Several sub-computing tasks of, so that the running results can be obtained.
  • the running result refers to the result when the computer device executes a specific neural network computing task, which may include, but is not limited to: the accuracy of the neural network model, the running time of the neural network model, and so on.
  • the computer device can output the running result, for example, the computer device can display the running result on the display screen.
  • the computer device splits the neural network computing task into several smaller sub-computing tasks, so that the multi-core processor can directly call the computing library under the single-core architecture, making full use of the hardware of the multi-core processor Resources, which can avoid the extra workload of re-implementation.
  • a glue operator can be inserted between the target operator and the associated split state to adjust the split state in the set of split states.
  • the glue operator which determines the target optimization path based on the updated split state set, may include but is not limited to the following steps:
  • Step S400 According to the target operator in the neural network model, determine the split state set of the tensor data associated with the operator of the target operator.
  • step S400 for the specific implementation of step S400, please refer to the foregoing step S300, which will not be repeated here.
  • Step S402 Insert a glue operator between the target operator and the associated set of split states, adjust the split states in the set of split states to obtain an adjusted set of split states; wherein, the glue The operator is used to convert the split state obtained according to one split mode of the tensor data into a split state obtained according to any split mode.
  • the set of split states before the introduction of the glue operator in order to facilitate the distinction between the set of split states before the introduction of the glue operator and the set of split states adjusted after the introduction of the glue operator, we define the set of split states before the introduction of the glue operator as the first
  • the set of split states is defined as the second set of split states, which is adjusted after the introduction of the glue operator.
  • the tensor data associated with the operator when a single operator is split, will also be split into several sub-tensor data in different ways according to the selected splitting method. Since in actual networks, tensor data is often associated with multiple operators, it is not an isolated problem that each operator in each calculation graph chooses how to split it, and it will affect adjacent operators or even the network. All operators in will have an impact. For example, in the simplest case, a certain tensor data Tensor1 is both the output data of the operator OP0 and the output data of the operator OP1.
  • OP1 When OP0 is determined to split in a certain way, Tensor1 is used as the output of OP0, and it is also determined to split into some column sub-tensor data in a certain way, then OP1 must ensure the selected method when choosing the splitting method It is compatible with the determined splitting method of its input data Tensor1, which restricts the selection range of OP0. Then, it can be understood that the split mode selected by OP1 under this constraint will restrict the split selection of other neighboring operators through the tensor data associated with it.
  • splitting methods that different operators can support depend on the type of the operator itself and the size of the data.
  • Some operators such as activation operator Relu and convolution operator Conv, support splitting methods that allow their input data to be split in any dimension in NCHW; some operators, such as softmax operators, support splitting
  • the split method only allows its input data to be split in a specific dimension; and the last operators are often very complex operators in implementation, such as the NMS (Non-maximum suppression) operator, which is difficult to split through the operator.
  • the calculation load is distributed to multiple cores in parallel.
  • this type of operator can only be executed on a single core, and the corresponding input data should remain intact and undivided. Then, it is understandable that if there is the last type of operator mentioned above in a neural network model, it must ensure that the input data of the operator is in a complete state without splitting, otherwise the network cannot continue at this operator carried out. If this constraint spreads with the network structure, it will make it difficult to mine a sufficient amount of parallelism in neural network calculations by means of operator splitting.
  • a glue operator is inserted between the target operator and the associated first split state set.
  • This glue operator can realize the neural network
  • Each operator in the calculation graph corresponding to the model can flexibly and unrestrictedly choose its own splitting method.
  • the glue operator (Transform) is used to adjust the state of the tensor data from several sub-tensor data obtained according to one splitting method to several sub-tensor data obtained according to another splitting method.
  • the glue operator (Transform) is used to adjust the state of the tensor data from several sub-tensor data obtained according to one splitting method to several sub-tensor data obtained according to another splitting method.
  • the splitting method of the current tensor data is not allowed by any splitting method of the subsequent operators, or the subsequent operators are compatible with the splitting method of the current tensor data Under the premise, the performance improvement brought by the alternative scheme is very poor.
  • the computer equipment can adjust the current data into another better splitting method by inserting a glue operator in the calculation graph.
  • the semantics of the glue operator can be obtained through the concat operator and/or the split operator in the neural network model. It will be elaborated below in detail:
  • the concat operator that is, the concatenation operator, is used to concatenate multiple tensor data into one tensor along a specified dimension. Except for the specified dimensions, the other dimensions of the input tensor should be consistent.
  • the neural network splices multiple tensors representing features from different upstream locations into one, so that these features can be processed together in downstream calculations. Specifically, refer to the schematic diagram of the semantics of the concat operator shown in FIG. 6C.
  • the split operator that is, the split operator, is used to split a tensor into multiple tensors in a specified dimension. Except for the specified dimension, the multiple tensors after splitting remain consistent in other dimensions.
  • the split operator the features belonging to the same tensor data can be split into multiple copies, so that they can be processed separately in subsequent calculations. Specifically, refer to the schematic diagram of the split operator semantics shown in FIG. 6D.
  • the glue operator adopts one of the four methods of split-splicing, splicing-split, splicing, and splitting.
  • the splicing stage it is possible to place adjacent ones in any dimension.
  • the sub-data blocks of is spliced into a new sub-tensor data.
  • any sub-tensor data can be split into several smaller sub-tensor data.
  • the sub-tensor data obtained by splitting the tensor data in any one way can be converted into the sub-tensor data obtained by splitting the tensor data in another way.
  • let’s assume that the data itself is one-dimensional.
  • the split form before adjustment is expressed as ⁇ (0,p1),(p1,p2),...,(pn-1,end) ⁇ , and each segment represents one A sub-segment after the split of the dimensional data
  • the split form adjusted by the glue operator is ⁇ (0,q1),(q1,q2),...,(qm-1,end) ⁇ , if a certain value before adjustment
  • the splicing stage is skipped, and the corresponding split is performed in the splitting stage.
  • all the data can be combined into a complete one-dimensional data in the splicing stage, and the corresponding split is performed in the splicing stage.
  • the glue operator is inserted between the target operator and the associated first split state set to adjust the split in the split state set of the input tensor data of the operator State, get the second split state set, including:
  • a glue operator is inserted between the target operator and the associated first split state set, and the split state in the first split state set is updated to the second split through the glue operator State collection.
  • the entire neural network can be abstracted as shown in FIG. 5H.
  • the dashed box represents the set of split states of each tensor data.
  • the set contains several split states, and these split states come from the split state space of the tensor data.
  • the directed edge between the state in the split state set of the input tensor data of the operator and the state in the split state set of the output tensor represents a split method of the operator itself, and the split method is used The following parallel time is used as the weight of the directed edge.
  • Tensor0 is the input tensor data of the entire neural network
  • Tensor3 is the output tensor data of the entire neural network. Any one starts from any state in the split state set of Tensor0 to any one in the split state set of Tensor3.
  • the path to the end of a state corresponds to an effective splitting scheme of the neural network, for example, it can be denoted as P.
  • a glue operator is split into the split state set associated with OP0, and the state in the split state set is split through the glue operator.
  • the split states in the updated split state set include: state m'1, state m'2, and state m'k.
  • state m'1, state m'2, and state m'k are the new states generated after the state in the first split state set passes through the glue operator.
  • the glue operator is inserted between the target operator and the associated first split state set to adjust the split in the split state set of the input tensor data of the operator State to obtain a second set of split states, including: inserting a glue operator between the target operator and the associated first set of split states, and the first set of split states is integrated into the first set of split states through the glue operator
  • the split state of is updated to a third split state set; the second split state set is generated according to the first split state set and the third split state set.
  • the split state set of Tensor1 that is, the first split state set
  • a glue operator is split into the split state set associated with OP0, and the glue is calculated
  • the child adjusts the split state in the split state set, updates the split state in the first split state set to the third split state set, and then, according to the first split state set and the third split state set
  • the state set generates a second split state set.
  • the split states in the second split state set include: state 1, state 2, ..., state m'.
  • state m is the split state in the first split state set, and state m'is the new split state generated by the state in the first split state set after the glue operator. Sub-state.
  • the behavior of adjusting the split state of tensor data is represented by a glue operator.
  • the calculation scale of each layer of the neural network model is constantly changing with the extension of the network, and as the neural network model is split Changes in sub-trends require corresponding adjustments to the way the operator splits, that is, to adjust the state of the intermediate results.
  • a glue operator is added between Op0 and its input Tensor1, which can convert any split state of tensor data into another split state.
  • its input tensor data and output tensor data have the same shape and the same state space.
  • Fig. 6E or Fig. 6F shows a glue operator inserted between the operator and the corresponding input tensor data, or a glue operator can be inserted between the operator and the corresponding output tensor data.
  • the glue operator is inserted between the operator and the corresponding input tensor data and output tensor data.
  • Step S404 Traverse the adjusted split state set, and determine a split path of the tensor data of the target operator between adjacent split state sets.
  • the adjusted split state set is also the second split state set.
  • Step S406 Determine the target split path of the tensor data of the target operator according to the weight of the split path.
  • Step S408 Split the target operator according to the target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • step S404 to step S408 for the specific implementation of step S404 to step S408, please refer to the aforementioned step S302 to step S306, and details are not repeated here.
  • a glue operator is inserted between the target operator and the associated split state set.
  • the glue operator can realize that each operator in the calculation graph corresponding to the neural network model can be flexibly and unrestrictedly choose the splitting method that acts on itself, so that the problem of mutual influence between operator splits can be solved.
  • the glue operator is introduced so that each operator can choose an appropriate splitting method according to the actual situation.
  • the glue operator The child itself will bring additional expenses, which will undoubtedly increase the resource consumption of computer equipment.
  • split-splicing or splicing-split as an example for the glue operator, assume that the total size of the tensor data to be adjusted is M, and the two stages cannot be skipped, and each stage must be spliced for 4 dimensions Or split.
  • splicing and splitting are usually implemented using the concatenation operator (Concat) and the split operator (Split) that come with the neural network algorithm.
  • step S4010 may be further included, which will be described in detail below:
  • Step S4010 When the state of the input tensor data and the state of the output tensor data of the same glue operator included in the target splitting path are the same, the inserted corresponding glue operator is deleted to obtain an optimized The target split path.
  • the computer device determines the state of the input tensor data and the output tensor data of the same glue operator included in the target split path. Whether the state is the same, when the state of the input tensor data and the state of the output tensor data of the same glue operator are the same, the glue operator is removed.
  • the state of the input tensor data and the state of the output tensor data of the same glue operator are the same, it means that using the glue operator at this position does not make any adjustments to the split state of the tensor data.
  • the glue operator itself will bring extra overhead, which undoubtedly increases the resource consumption of the computer device.
  • the computer device satisfies the same state of the input tensor data and the state of the output tensor data of the same glue operator, removing the implementation of the glue operator can reduce the resource consumption of the computer device.
  • this implementation method can combine the additional overhead caused by the introduction of the glue operator and the parallel efficiency of the different splitting methods of the operator itself for decision-making, so that an optimal splitting scheme based on the entire neural network can be obtained. P.
  • the computer device determines the state of the input tensor data and output tensor data of the same glue operator included in the target split path. Whether the state of the quantity data is the same, if the state of the input tensor data and the state of the output tensor data of the same glue operator are not the same, the glue operator is retained.
  • the glue operator introduced here can make the splitting method of each operator compatible with the splitting method of the tensor data directly related to it. Through this implementation method, the introduced glue operator can be brought into The additional overhead and the parallel efficiency of the different splitting methods of the operator itself are put together for decision-making, so that an optimal splitting plan P based on the entire neural network can be obtained.
  • the target operator is split according to the optimized target split path.
  • the computer device executes the above step S306
  • the target operator is split according to the optimized target split path.
  • the operator at the branch junction has more than one input tensor data, such as the parametric addition operator (Add), the parametric multiplication operator (Mult), and the concatenation operator (Concat).
  • the computer device accesses the operator, that is, after the split state set of the output tensor data is determined according to the split state set of the input tensor data, the two input tensor data tensorleft , Tensorright has corresponding split state sets Sleft and Sright respectively.
  • the two branches will extend directly to the end of the traversal, which means that the entire network has more than one input data, which is usually not common in reasoning tasks.
  • the two branches merge together at a certain operator. In either case, when the splitting scheme P is determined, on the two input tensor data tensorleft and tensorright of operator A, split states that do not match each other may be selected.
  • the backtracking process selected in the split state set of tensorleft may be a state that only splits in the C dimension, while in the split state of tensorright
  • the selected in the set may be a state that only has a split in the H dimension.
  • the splitting methods of the addition operator itself represented by the two split states are inconsistent, which will cause the entire splitting plan P to be invalid.
  • backtracking refers to the reverse process of the previous implementation process.
  • backtracking refers to the reverse traversal of the neural network model.
  • the purpose of the backtracking process is to avoid misjudgment by the computer equipment when determining the target optimization path, which leads to negative effects such as increased time consumption when calling the split neural network model according to the computer equipment.
  • the split state set corresponding to tensorleft and tensorright contains only one split state, which ensures the certainty of the state selected in the two-state set during the backtracking process.
  • the current operator's One split state is retained in the set of split states of the output tensor data, and the retained split state is determined by the same directed edge of the current operator.
  • the following specifically describes how to ensure the certainty of the state selected in the two-state set in the backtracking process in the embodiment of the present application.
  • the method may include but is not limited to the following steps:
  • Step 700 Determine a split state set of tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model;
  • Step 702 Traverse the split state set, and determine a split path of the tensor data of the operator between adjacent split state sets.
  • Step 704 Determine a target split path of the tensor data of the target operator according to the weight of the split path.
  • the determining the target split path of the tensor data of the target operator includes:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the initial split state corresponding to the directed edge to the split state of the input tensor data of the target operator
  • a split path from a state to a split state of the input tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • this implementation method is to obtain the target optimization path through forward traversal.
  • Step 706 Split the target operator according to the target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • the output tensor data of the current operator is used as input tensor data by at least two operators, Or when the current operator has at least two output tensor data, one split state is retained in the split state set of the output tensor data of the current operator, and the retained split state is determined by the same directed edge of the current operator .
  • the state with the smallest cumulative weight will be selected from the split state sets of multiple input data and retained, and the other split states in the split state set will be removed.
  • step S700 to step S706 please refer to the related description of step S300 to step S306, which will not be repeated here.
  • a glue operator is introduced to adjust the split state in the split state set, and the state of the input tensor data in the target optimization path is the same as the state of the output tensor data.
  • Deletion can be a modification of the method described in step S700-step S706, which can include but is not limited to the following steps:
  • Step 700' Determine the split state set of tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model;
  • Step 702' insert a glue operator between the target operator and the associated first split state set, adjust the split state in the split state set of the input tensor data of the target operator, to obtain the first 2.
  • a set of splitting states wherein the glue operator is used to convert the splitting state obtained by one splitting method of the tensor data into a splitting state obtained according to any splitting method;
  • Step 704' Traverse the second split state set, and determine a split path of the tensor data of the target operator between adjacent split state sets;
  • Step 706' Determine the target split path of the tensor data of the target operator according to the weight of the split path.
  • the determining the target split path of the tensor data of the target operator includes:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the initial split state corresponding to the directed edge to the split state of the input tensor data of the target operator
  • a split path from a state to a split state of the input tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • this implementation method is to obtain the target optimization path through forward traversal.
  • the output tensor data of the current operator is used as input tensor data by at least two operators, Or when the current operator has at least two output tensor data, one split state is retained in the split state set of the output tensor data of the current operator, and the retained split state is determined by the same directed edge of the current operator .
  • the state with the smallest cumulative weight will be selected from the split state sets of multiple input data and retained, and the other split states in the split state set will be removed.
  • Step S708' in the case that the state of the input tensor data and the state of the output tensor data of the same glue operator included in the target splitting path are the same, delete the glue operator to obtain an optimized target Split path
  • Step S7010' split the target operator according to the optimized target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • step S700'-step S7010' for the specific implementation of step S700'-step S7010', please refer to the relevant description in the foregoing embodiment, and details are not repeated here.
  • the computer device in the forward traversal stage, for the operator or output tensor located at the branch point, the computer device only retains the shortest state of the only corresponding path so far, and deletes all other states.
  • one split state is retained in the set of split states of the input tensor data of the operator, and The split state is determined via the same state path of the operator.
  • the following specifically describes how to ensure the certainty of the state selected in the two-state set in the backtracking process in the embodiment of the present application.
  • the method may include but is not limited to the following steps:
  • Step 800 Determine a split state set of tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model;
  • Step 802 Traverse the set of split states, and determine a split path of the tensor data of the operator between adjacent sets of split states;
  • Step 804 Determine a target split path of the tensor data of the target operator according to the weight of the split path.
  • the determining the target split path of the tensor data of the target operator includes:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the split state corresponding to the end point of the directed edge to the split state of the output tensor data of the target operator A split path from a state to a split state of the output tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • this implementation method is to obtain the target optimization path through reverse traversal.
  • Step 806 Split the target operator according to the target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • the input tensor of the current operator In the embodiment of this application, in order to ensure the certainty of the state selected in the two-state set during the backtracking process, in the reverse traversal phase, when the current operator has at least two input tensor data, the input tensor of the current operator One split state is reserved in the set of split states of the data, and the split state is determined by the same directed edge of the operator. In this way, before the end of the traversal branch operator, the state with the smallest cumulative weight will be selected from the split state sets of multiple input data and retained, and the other split states in the split state set will be removed.
  • a glue operator is introduced to adjust the split state in the split state set, and the state of the input tensor data in the target optimization path is the same as the state of the output tensor data.
  • Deletion can be a modification of the method described in step S800-step S806, which can include but is not limited to the following steps:
  • Step 800' Determine the split state set of the tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model;
  • Step 802' insert a glue operator between the target operator and the associated first split state set, adjust the split state in the split state set of the input tensor data of the target operator, to obtain the first 2.
  • a set of split states wherein the split state obtained by the tensor data according to one split method is converted into a split state obtained according to any split method;
  • Step 804' Traverse the second set of split states, and determine a split path of the tensor data of the target operator between adjacent sets of split states.
  • Step 806' Determine the target split path of the tensor data of the target operator according to the weight of the split path.
  • the determining the target split path of the tensor data of the target operator includes:
  • this implementation method is to obtain the target optimization path through reverse traversal.
  • the input tensor of the current operator In the embodiment of this application, in order to ensure the certainty of the state selected in the two-state set during the backtracking process, in the reverse traversal phase, when the current operator has at least two input tensor data, the input tensor of the current operator One split state is reserved in the set of split states of the data, and the split state is determined by the same directed edge of the operator. In this way, before the end of the traversal branch operator, the state with the smallest cumulative weight will be selected from the split state sets of multiple input data and retained, and the other split states in the split state set will be removed.
  • Step S808' in the case that the state of the input tensor data and the state of the output tensor data of the same glue operator included in the target splitting path are the same, delete the glue operator to obtain an optimized target Split path
  • Step S8010' split the target operator according to the optimized target split path, so as to allocate it to the corresponding core of the multi-core artificial intelligence processor for processing.
  • step S800'-step S8010' for the specific implementation of step S800'-step S8010', please refer to the relevant description in the foregoing embodiment, which will not be repeated here.
  • the computer device in the reverse traversal phase, for the operator or output tensor located at the branch point, the computer device only retains the shortest state of the only corresponding path so far, and deletes all other states.
  • the vehicle needs to analyze and process external information such as images, videos, and voices collected by on-board sensors during the automatic driving process.
  • external information such as images, videos, and voices collected by on-board sensors during the automatic driving process.
  • the vehicle In order to ensure safety, the vehicle must obtain the analysis results of the above-mentioned various external information in the shortest time, so as to make scientific and effective decisions.
  • the vehicle's hardware system is equipped with a processing chip with a multi-core processor structure, the vehicle's hardware system can split the calculation task of the neural network model to process small batches of external information through the technical solution described in this application, and obtain the split multiple There are multiple sub-computing tasks, and the split sub-computing tasks are evenly distributed to multiple processor cores, so that multiple sub-computing tasks can be executed in parallel on the multiple processor cores.
  • This implementation can efficiently complete the processing of external information and return the processing result.
  • the intelligent driving system of the vehicle can assist the vehicle in automatic driving according to the returned result. It is understandable that this technical solution can split an operator into multiple smaller-scale sub-operators, so that the computing library under the single-core architecture can be directly called, and the hardware resources of the multi-core processor can be fully utilized. Avoid the extra workload of recurring implementation.
  • the multi-core processor structure chip is set on the vehicle.
  • the multi-core processor structure chip can be set on the cloud server, and the vehicle can generate the image, video, voice and other external information from the vehicle sensor to the cloud server through 3G/4G, WIFI and other networks.
  • the cloud server uses this solution to evenly distribute the computational load of the neural network model for processing small batches of external information to multiple processing cores. Within the specified response time of the vehicle, the cloud server will feed back the processing result to the vehicle via 3G/4G, WIFI and other networks.
  • the scale of external information collected by on-board sensors is different. Before application, according to external information of different scales, the on-board processor uses this scheme to determine the corresponding operator split path.
  • the multi-core processor structure chip obtains the external information
  • the corresponding operator splitting path is called to split the operators in the neural network model, and the external
  • the calculation load of the information is evenly distributed to multiple processor cores.
  • steps in the flowchart of FIG. 3 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIG. 3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • the device 70 may at least include:
  • the determining unit 700 is configured to determine the split state set of tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model;
  • the split path determination unit 702 is configured to traverse the split state set and determine the split path of the tensor data of the target operator between adjacent split state sets;
  • the target split path determining unit 704 is configured to determine the target split path of the tensor data of the target operator according to the weight of the split path;
  • the processing unit 706 is configured to split the target operator according to the target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • the target split path determining unit 704 is specifically configured to:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the initial split state corresponding to the directed edge to the split state of the input tensor data of the target operator
  • a split path from a state to a split state of the input tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • the target split path determination unit 704 is further specifically configured to:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the split state corresponding to the end point of the directed edge to the split state of the output tensor data of the target operator A split path from a state to a split state of the output tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • the device 70 may further include a glue operator inserting unit 708; wherein, the glue operator inserting unit 708 is configured to set between the target operator and the associated split state set Insert a glue operator in between to adjust the split state in the set of split states; wherein, the glue operator is used to convert the split state obtained by one split method of the tensor data into any one The split state obtained by a split method.
  • a glue operator inserting unit 708 is configured to set between the target operator and the associated split state set Insert a glue operator in between to adjust the split state in the set of split states; wherein, the glue operator is used to convert the split state obtained by one split method of the tensor data into any one The split state obtained by a split method.
  • the glue operator insertion unit 708 is specifically configured to:
  • the target splitting path of the target operator in the calculation graph including the glue operator is used to select each glue operator inserted, and the input of the same glue operator included in the target splitting path is satisfied. If the split state of the tensor data is the same as the split state of the output tensor data, the inserted corresponding glue operator is deleted.
  • the glue operator is used to splice the split states in the set of split states.
  • the glue operator is used to split the split state in the set of split states.
  • the glue operator is used to splice the split states in the split state set, and then split the split states in the split state set after the splicing process .
  • the glue operator is used to split the split state in the split state set, and then perform the split state on the split state set after the split processing. Splicing.
  • the apparatus 70 may further include a forward branch processing unit 7010; wherein, the branch processing unit 7010 is configured to reduce the output tensor data of the current operator at least in the forward traversal phase.
  • the branch processing unit 7010 is configured to reduce the output tensor data of the current operator at least in the forward traversal phase.
  • the device 70 may further include a reverse branch processing unit 7012; wherein, the branch processing unit 7012 is configured to have at least two input tensors in the reverse traversal phase.
  • the branch processing unit 7012 is configured to have at least two input tensors in the reverse traversal phase.
  • one split state is reserved in the set of split states of the input tensor data of the current operator, and the split state is determined by the same directed edge of the operator.
  • the weight of the directed edge is obtained through the split path according to the operation type of the target operator corresponding to the split path, and the tensor data of the target operator The data size of the corresponding sub-data, the throughput rate and memory access bandwidth of each processor core are determined.
  • the split state in the set of split states of the input tensor data of the target operator of the neural network model is based on the operation logic of the operator and the split of the corresponding output tensor data.
  • the split state in the sub-state set is determined.
  • the split state in the set of split states of the output tensor data of the target operator of the neural network model is based on the operation logic of the operator and the split of the corresponding input tensor data.
  • the split state in the sub-state set is determined.
  • the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the units or modules described as separate components may or may not be physically separate.
  • a component described as a unit or a module may be a physical unit or not a physical unit, that is, it may be located in one device, or may also be distributed on multiple devices.
  • the solutions of the embodiments of the present disclosure can be implemented by selecting some or all of the units according to actual needs.
  • the embodiment of the present application also provides a chip.
  • the neural network chip may be a multi-core chip, including a central processing unit (CPU) and N single-core neural network processors (NNP). N Is an integer greater than 1.
  • the CPU is used for overall control and scheduling of the chip, and is the main body of execution of the neural network model processing method in the embodiment of the application.
  • the embodiment of the present application also provides another computer device, which includes the above-mentioned chip or the above-mentioned neural network model processing device 70.
  • the embodiment of the present application also provides a computer storage medium for storing computer software instructions used by the computer device shown in FIG. 2 above, which includes a program for executing the above method embodiment.
  • a program for executing the above method embodiment By executing the stored program, split the tensor data associated with the target operator in the neural network model corresponding to the calculation graph, obtain the split state set corresponding to the tensor data, and then determine the tensor data between adjacent split state sets
  • the split path and the weight of the split path are determined, and the target split path of the tensor data of the target operator is determined.
  • the target operator of the calculation graph is split according to the target split path to match to the multi-core processor. Correspond to the core for processing.
  • the target operator is split to achieve the purpose of reducing the size of the operator's operation data, and then based on the split path selection between the split states corresponding to the target operator, the splitting of the target operator is further optimized the way.
  • the target operator obtained by splitting is allocated to the multi-core processor, so that the hardware resources of each core in the multi-core processor can be effectively used. This solution can effectively reduce the end-to-end performance of various neural network models on the multi-core processor. End delay.
  • this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • clause A1 a neural network processing method, characterized in that the method is applied to a multi-core artificial intelligence processor, and the method includes:
  • the target operator is split according to the target split path, so as to be allocated to corresponding cores of the multi-core artificial intelligence processor for processing.
  • the determining the target splitting path of the tensor data of the target operator includes:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the initial split state corresponding to the directed edge to the split state of the input tensor data of the target operator A split path from a state to a split state of the input tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all split paths corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • the determining the target split path of the tensor data of the target operator includes:
  • the current split is determined according to the weight of the directed edge and the weight of the split path from the split state corresponding to the end point of the directed edge to the split state of the output tensor data of the target operator A split path from a state to a split state of the output tensor data of the target operator; wherein the weight of the split path is determined according to the weights of all directed edges corresponding to the split path;
  • the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained.
  • the target split path After traversing all the split state sets of the target operator, the difference between the split state set of the input tensor data of the target operator and the split state set of the output tensor data of the target operator is obtained. The target split path.
  • A4 The method according to any one of A1-A3, the method further comprising:
  • a glue operator is inserted between the target operator and the associated split state set to adjust the split state in the split state set; wherein, the glue operator is used to divide the tensor data according to a
  • the split state obtained by this splitting method is transformed into the splitting state obtained according to any one of the splitting methods.
  • the step of inserting a glue operator between the target operator and the associated set of split states includes:
  • the target splitting path of the target operator in the calculation graph including the glue operator is used to select each glue operator inserted, and the input of the same glue operator included in the target splitting path is satisfied. If the split state of the tensor data is the same as the split state of the output tensor data, the inserted corresponding glue operator is deleted.
  • the glue operator is used to splice the split states in the set of split states.
  • the glue operator is used to split the split state in the set of split states.
  • the glue operator is used to splice the split states in the split state set, and then split the split states in the split state set after the splicing process .
  • the glue operator is used to split the split state in the split state set, and then perform the split state in the split state set after the split processing Splicing.
  • A10 The method according to any one of A1-A9, the method further comprising:
  • the output tensor data of the current operator is used as input tensor data by at least two operators, or the current operator has at least two output tensor data
  • the output tensor data of the current operator is split
  • One split state is reserved in the set of sub-states, and the retained split state is determined by the same directed edge of the current operator.
  • A11 The method according to any one of A1-A9, the method further comprising:
  • A12 The method according to A2 or A3, wherein the weight of the directed edge is determined according to the operation type of the target operator corresponding to the split path, and the tensor data of the target operator after the splitting The data size of the corresponding sub-data acquired by the path, the throughput rate and memory access bandwidth of each processor core are determined.
  • the split state in the split state set of the input tensor data of the target operator is based on the operation logic of the target operator and the split state in the split state set corresponding to the output tensor data.
  • the sub-state is determined.
  • the split state in the split state set of the output tensor data of the target operator is based on the operation logic of the target operator and the split state set corresponding to the input tensor data The split status is determined.
  • a neural network processing device characterized in that the device is applied to a multi-core artificial intelligence processor, and the device includes:
  • a determining unit configured to determine a split state set of tensor data associated with the target operator according to the target operator in the calculation graph corresponding to the neural network model
  • a split path determining unit configured to traverse the split state set and determine a split path of the tensor data of the target operator between adjacent split state sets;
  • a target split path determining unit configured to determine a target split path of the tensor data of the target operator according to the weight of the split path;
  • the processing unit is configured to split the target operator according to the target split path, so as to be allocated to the corresponding core of the multi-core artificial intelligence processor for processing.
  • a computer device comprising a plurality of heterogeneous processors and memories connected to each other, wherein the plurality of heterogeneous processors include general-purpose processors and artificial intelligence processors, and the memory is used
  • the computer program includes program instructions
  • the processor is configured to invoke the program instructions to execute the method according to any one of claims A1-A14.
  • a computer-readable storage medium storing a computer program.
  • the computer program includes program instructions that when executed by a processor cause the processor to execute as claimed in claim A1. -The method of any one of A14.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种神经网络处理方法、装置、计算机设备及存储介质,把一个算子拆分成多个规模更小的子算子,这样可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。

Description

神经网络模型处理方法、装置、计算机设备及存储介质 技术领域
本申请涉及计算机技术领域,尤其涉及一种神经网络模型处理方法、装置、计算机设备及存储介质。
背景技术
随着人工智能技术的快速发展,基于内存共享模型的多核处理器已经成为了当前处理器的主流架构,这种多核架构和每个核内的向量处理能力同样可以应用到神经网络计算中。在实际应用中,通常可以采用数据并行的方式来充分利用多核处理器架构所带来的额外硬件资源,即令每个处理器核分别同时执行不同数据在同一个神经网络模型上的计算。然而,多核处理器结构并不能使用这种并行方法来处理推理场景下的小批量且要求低时延的神经网络计算任务。那么,如何保证数据并行与神经网络模型并行相统一,以充分利用多核处理器的硬件资源是亟需解决的技术问题。
发明内容
本发明实施例提供一种神经网络模型处理方法、装置、计算机设备及存储介质,通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
为实现上述目的,第一方面,本申请实施例提供了一种神经网络模型处理方法,该方法应用于多核人工智能处理器,方法包括:
根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
第二方面,本申请实施例提供了一种神经网络模型处理装置,该装置包括用于执行上述第一方面的方法的单元。具体地,该装置应用于多核人工智能处理器。上述装置包括:
确定单元,用于根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
拆分路径确定单元,用于遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
目标拆分路径确定单元,用于根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
处理单元,用于根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
第三方面,本申请实施例提供了一种芯片,所述芯片包括第二方面提供的神经网络模型处理装置。
第四方面,本申请实施例提供了一种计算机设备,所述计算机设备包括第三方面提供的芯片或第二方面提供的神经网络模型处理装置。
第五方面,本申请实施例提供了一种计算机设备,包括处理器和存储器,所述处理器 和存储器相互连接,其中,所述处理器包括通用处理器和人工智能处理器,所述存储器用于存储支持计算机设备执行上述方法的计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面的方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。
第七方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面所述的方法中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
实施本申请实施例,计算机设备通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。进一步地,计算机设备可以通过胶水算子调整与算子关联的张量数据的拆分状态集合中的拆分状态,并基于更新后的拆分状态集合来确定目标优化路径,可以把引入胶水算子所带来的额外开销和算子本身不同拆分方式的并行效率放在一起进行决策,得到一个基于整个神经网络的最优拆分方案,从而可以提高计算机设备的执行效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是本申请实施例提供的一种多核处理器的结构示意图;
图1B是本申请实施例提供的一种人工智能处理器的软件栈的结构示意图;
图2是本申请实施例提供的一种计算机设备的结构示意图;
图3是本申请实施例提供的一种神经网络处理方法的流程示意图;
图4是本申请实施例提供的一种神经网络卷积算子的计算图;
图5A为按照输入数据的N维度进行拆分得到的示意图;
图5B为按照输出数据的C维度进行拆分的示意图;
图5C为按照输入数据C维度进行拆分得到的示意图;
图5D为按照输入数据的H维度进行拆分得到的示意图;
图5E为按照输入数据的W维度进行拆分得到的示意图;
图5F是本申请实施例提供的一种人脸识别神经网络模型的结构示意图;
图5G是本申请实施例提供的一种车牌字符识别的神经网络模型的结构示意图;
图5H是本申请实施例提供的一种神经网络模型的抽象示意图;
图6A是本申请实施例提供的一种串行神经网络模型的抽象示意图;
图6B是本申请实施例提供的一种通过胶水算子调整张量数据的拆分方式的示意图;
图6C是本申请实施例提供的一种concat算子语义的示意图;
图6D是本申请实施例提供的一种split算子语义的示意图;
图6E是本申请实施例提供的一种插入胶水算子后的神经网络模型的抽象示意图;
图6F是本申请实施例提供的另一种插入胶水算子后的神经网络模型的抽象示意图;
图7是本申请实施例提供的一种神经网络处理装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。
应当理解,本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
为了便于更好的理解本申请所描述的技术方案,下面先解释本申请实施例所涉及的技术术语:
(1)数据并行。
具体来说,所谓数据并行是指把数据划分成若干块分别映像到不同的处理器上,每一个处理器运行同样的处理程序对所分派的数据进行处理。现有中,大部分并行处理均采用这种处理方式,尤其是对于计算复杂性很高的问题,如流体力学计算、图象处理等。
在本申请实施例中,数据并行可以应用于大规模的神经网络并行训练中。具体来说,数据并行的核心是使用多个处理器同时进行对于同一个神经网络模型的训练。在训练的每一轮迭代中,每个处理器从数据集中获取本轮迭代使用的数据,在每个处理器上完成一轮整个网络的推理及训练计算,并返回本轮计算得到的梯度数据来进行模型的更新。维护权值的服务器在收到所有处理器的梯度之后,使用这些梯度进行模型数据的更新。显然,由于多个处理器会并行地执行训练任务,其等价于在每轮迭代中一个更大批量的数据能够被处理,也就加快了系统完成这个训练任务所需要的时间。所以,数据并行的关键在于每一轮迭代中待处理数据的批量的大小,批量越大,尽可能划分到越多的处理器来并行处理。
(2)模型并行。
在本申请实施例中,模型并行是数据并行之外的另一种神经网络并行计算方式。简单来说,模型并行是通过划分神经网络模型参数的方式把计算负载分配到不同的处理器上。
(3)多核处理器。
当前多核处理器采用的最普遍的结构是基于存储共享的多核结构,如图1A所示,处理器中包含了多个计算核,每个计算核上有独立的缓存,寄存器堆,计算单元以及指令控制单元,所有的计算核共享同一全局存储。
现有中,单个核已经足够完成任何复杂逻辑的计算任务,但其性能受限于摩尔定律和芯片工艺。为了进一步提升处理器的性能,多个计算核被引入处理器中,它们可以被用于处理那些有着较高并行度的计算任务。
在实际应用中,共享存储多核结构是一种经典的多核结构,并且非常适合数据并行的神经网络训练方法。每个核可以作为数据并行中的一个处理器,分别读取不同的数据,然后并行完成网络模型的正反向计算。每个核在计算阶段仍能够保持其在之前单核架构下良好的性能功耗比,与此同时,整个系统的吞吐量也可以随着核数的扩展而增加。
(4)算子拆分。
在本申请实施例中,我们采用算子拆分的方式来实现计算任务的拆分,即把单个算子拆分成多个可以并行执行的子算子。需要说明的是,这里,拆分前的原始算子和拆分后的 若干个子算子都是人工智能处理器所支持的算子,原始的张量数据随着算子的拆分也被拆分成若干个新的子张量数据。反映到计算图上,则是把原来的包含单个算子的计算图细化成了一张包含更多可并行执行的算子的计算图。通过这一实现方式,可以实现类似于模型并行的算子内任务拆分,同时又保证了拆分后的每个子算子都可以复用单核架构下算子的指令实现来进行计算,避免了对原有算子的指令实现的重构。
在本申请实施例中,算子拆分不完全局限于对模型参数的拆分,也会采用数据并行的方式对数据进行拆分,这种方法实际上模糊了模型并行和数据并行的界限。以卷积算子为例,如果把卷积算子的输入数据和权值作为计算图中等同低位的张量数据,那么,数据并行时基于对输入数据的划分来分割计算,而模型并行时基于权值的划分来分割计算,这二者都是通过划分卷积算子相关联的张量数据来实现对计算负载的划分。从这个角度来说,数据并行和模型并行是统一的。
(5)人工智能处理器
人工智能处理器,也称之为专用处理器,在本申请实施例中,人工智能处理器是指针对特定应用或者领域的处理器。例如:图形处理器(GPU,Graphics Processing Unit),又称显示核心、视觉处理器、显示芯片,是一种专门在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上进行图像运算工作的专用处理器。又例如:神经网络处理器(NPU,Neural Processing Unit),是一种在人工智能领域的应用中针对矩阵乘法运算的专用处理器,采用“数据驱动并行计算”的架构,特别擅长处理视频、图像类的海量多媒体数据。
(6)深度学习框架。
以卷积神经网络框架Caffe(Convolutional Architecture for Fast Feature Embedding)为例,在实际应用中,Caffe支持多种类型的深度学习架构、面向图像分类和图像分割,还可以支持卷积神经网络(Convolutional Neural Networks,CNN)、用于目标检测的卷积神经网络(Region-CNN,RCNN)、长短期记忆神经网络(Long Short-Term Memory,LSTM)和全连接神经网络设计。
在本申请实施例中,Caffe框架可以支持多种类型的基本算子,具体地,这里所涉及的多种类型的基本算子可以包括:常见的神经网络算子。例如,常见的神经网络算子有:卷积/反卷积算子,池化算子,激活算子、softmax(分类器)算子,全连接算子。其中,激活算子可以包括但不限于ReLU、Sigmoid、Tanh以及其他可以用插值方式实现的算子。
在本申请实施例中,对任何函数进行某一项操作都可以认为是一个算子。
在本申请实施例中,Caffe框架下的函数可以包括:Caffe Blob函数、Caffe Layer函数和Caffe Net函数。其中,Blob用于存储、交换和处理网络中正向和反向迭代的数据和导数信息;Layer用于执行计算,可以包括卷积(convolve)、池化(pool)、内积(inner product)、rectified-linear和sigmoid等非线性运算,还可以包括元素级的数据变换、归一化(normalize)、数据加载(load data)、分类(softmax)和hinge等损失计算(losses)。
具体实现中,每个Layer都定义了3种重要的运算,这3种运算为初始化设置(setup),前向传播(forward),反向传播(backward)。其中,setup用于模型初始化时重置layers及互相之间的连接;forward用于从底部(bottom)层中接受输入数据,计算后输出送到顶部(top)层;backward用于给定top层的输出梯度,计算其输入的梯度,并传递到bottom层。例如,Layer可以包括Date Layer、Convolution Layers、Pooling Layer、InnerProduct Layer、ReLU Layer、Sigmoid Layer、LRN Layer、Dropout Layer、SoftmaxWithLoss Layer、Softmax Layer、Accuracy Layers等。一个Net开始于data layer,也即从磁盘中加载数据,终止于loss layer,也即计算如分类和重构这些任务的目标函数。具体来说,Net是由一系列层组成的有向无环计算图,Caffe保留了计算图中所有的中间值以确保前向和反向迭代的准确性。
(7)人工智能处理器的软件栈。
参见图1B,该软件栈结构10包括人工智能应用100、人工智能框架102、人工智能学习库104、人工智能运行时库106以及驱动108。接下来对其进行具体阐述:
人工智能应用100对应不同的应用场景,提供对应的人工智能算法模型。该算法模型可以直接被人工智能框架102的编程接口解析,在其中一个可能的实现方式中,通过人工智能学习库104将人工智能算法模型转换为二进制指令,调用人工智能运行时库106将二进制指令转换为人工智能学习任务,将该人工智能学习任务放在任务队列中,由驱动108调度任务队列中的人工智能学习任务让底层的人工智能处理器执行。在其中另一个可能的实现方式中,也可以直接调用人工智能运行时库106,运行先前已固化生成的离线运行文件,减少软件架构的中间开销,提高运行效率。
人工智能框架是整个深度学习生态体系中的第一层。早期在Caffe中,Layer被当做是构建神经网络的基本元素,而在之后的人工智能框架,例如TensorFlow、MXNet中,虽然采用了不同的称呼,例如Operator,但与Caffe的layer在核心思想上依旧是相似的,都是将神经网络计算进一步拆分为各类常见的面向张量数据的算子,人工智能框架需要将神经网络映射的计算图结构所表达的深度学习任务具体化成可以在CPU或者人工智能处理器执行的指令和数据。在这个过程中,人工智能框架采用算子作为落实计算任务的具体元素,为每个算子都提供了在CPU或者人工智能处理器上执行的核函数(Kernel),根据计算图,人工智能框架调度执行计算图中每个算子对应的核函数,完成整个神经网络的计算。
为了便于更好的理解本申请,下面具体阐述本申请所描述的技术方案的研究思路:
现有技术中,数据并行的问题在于,其扩展性依赖于处理的数据批量的大小。尽管在训练阶段这通常不会是一个问题,但是对于推理阶段这个前提则难以保证。一般来说,用于实时服务领域(包括视频监控,自动驾驶等)的神经网络模型,处理的数据通常是以流的方式串行输入,导致了每次处理的数据规模很小甚至往往是单张图片。在这种情况下,数据并行不能提供任何并行度,所有的工作任务会集中在单个核上,这使得多核带来的计算资源不能转化成处理任务的速度。
当在线下使用数据集完成了神经网络模型的训练后,就会把模型部署到云端的服务器上来处理外界发来的数据,此时的应用场景就由离线训练变成了在线推理。在在线推理阶段,一个非常重要的指标是时延,也就是从服务器收到待处理数据到返回处理后的结果的时间,进一步来说,是使用神经网络模型处理数据的时间。低时延保证云端服务器能够对客户端发来的数据在最短的时间内做出响应,在一些更加敏感的场景下,直接决定了方案是否可用。因此,在线推理阶段对于人工智能处理器的要求就由处理大批量数据、高吞吐量转变为处理小批量数据、低时延。
在这种情况下,传统的数据并行或者模型并行难以有效降低推理任务的时延。对于数据并行来说,大批量数据是前提,这本身与在线推理小批量数据的特点矛盾。对于模型并行来说,它通常是为了解决一个规模很大的神经网络模型超过了单个设备的内存限制而采用的方法,把算子分配到不同的核上并不能降低网络的时延。为了真正能够在多核人工智能处理器上降低推理任务的时延,必须寻找一种方法,能够把对小批量数据甚至单个数据的推理计算任务合理地分配到多核架构的各个核上,保证每一时刻都有尽可能多的核参与计算,才能充分利用多核架构的资源。一种方法是把神经网络中的每个算子的计算任务都拆分到多个核上计算,这种方法即使在处理单张图片的推理任务时也能保证每一时刻都有多个核参与计算,从而达到了利用多核资源降低时延的目的。
但是,对于多核人工智能处理器来说,还有很多要解决的问题。首先,深度学习人工智能处理器通过定制化自身的硬件设计来适配深度学习算法本身的数据并行特征,提高计算吞吐量,人工智能处理器往往需要足够的数据规模才能达到较高的计算效率,而算子内 的进一步拆分会减小每个核上的计算规模。当拆分达到一定粒度,每个核上计算效率的损失会超过拆分增加并行度所带来的收益。因此,必须在拆分并行和计算效率之间,在保证足够计算效率的同时提供足够的并行度。
另一方面,神经网络模型可以看做是一个由通常数以百计甚至千记的算子所构成的复杂计算图。不同种类的算子内的算法逻辑各不相同,这就导致对这些算子进行拆分的方法也不一样。每个算子的拆分,除了平衡自身的计算效率和并行度,还要考虑和前后算子的搭配,甚至于对全局的影响。深度学习的快速发展带来的是越来越多的大规模复杂网络,通过手动方式寻找一种好的并行方法是不现实的,因此需要一种自动化的方法来保证来对于不同的网络都能够给出一种较好的拆分并行策略。
此外,还需要考虑的是对于底层人工智能处理器的可移植性。对于没有足够良好的可编程性的人工智能处理器来说,由单核扩展到多核,并且实现算子内部的拆分并行所带来的修改软件栈的工作量是非常大的。传统的数据并行和模型并行的实现仍然是基于一个处理核完成一个算子的计算任务,所以并不会带来很多额外的工作,而单个算子的跨核并行需要对算子本身实现进行修改,这种修改的难易程度依赖于人工智能处理器的可编程性和原有算子实现逻辑的复杂程度。如何减小在多核架构上实现低时延推理过程中的额外开销,缓解实现过程中工作量对于人工智能处理器本身可编程性的依赖,使得方法能够在未来对于不同的多核人工智能处理器都有一定的通用性也是一个需要考虑的问题。
基于上述分析描述,在本申请实施例中,把一个算子拆分成多个规模更小的子算子,这样可以直接调用单核架构下的计算库,避免了重新实现的额外工作量。比如:一个激活算子在经过拆分后可以得到许多更小的激活算子,这意味着只需要在多个核上调用原有的单核激活函数完成每个子任务,而不需要修改或者重新实现一个多核版本的激活函数。在这个过程中,既需要兼顾每个算子本身的拆分后的计算效率和并行度,也要考虑上下文算子彼此之间在拆分上的相互配合。最终目标是得到一个能够有效降低整个神经网络模型端到端的推理时延的拆分并行方案。
此外,需要说明的是,本申请实施例所提供的神经网络处理方法能够尽量避免对单核处理器计算库进行修改,同时也能够实现神经网络模型在多核处理器上的并行执行。具体地,上层框架通过把神经网络模型中的算子拆分成若干个可以并行执行子算子,对每个子算子,深度学习框架调用计算库生成所述子算子在单个核上执行的机器指令,通过把所述子算子的机器指令加载到不同核上,实现算子在多核处理器上的并行计算。具体地,因为深度学习框架可以使用单核处理器计算库生成子算子的计算指令,神经网络模型中所述算子的输入和输出张量数据随着所述算子被拆分成子算子同样被拆分成相应的子张量数据。
基于上述分析,首先介绍一下本申请所描述的方法可以适用的硬件设备的结构示意图。参见图2,是本申请实施例提供的一种计算机设备的结构示意图。如图2所示,计算机设备20可以包括通用处理器201、存储器202、通信总线203、通信接口204和至少一个人工智能处理器205,通用处理器201、人工智能处理器205通过所述通信总线连接所述存储器202和所述通信接口203。
通用处理器201可以是中央处理单元(Central Processing Unit,CPU),该通用处理器201还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器201可以是微处理器或者该通用处理器201也可以是任何常规的处理器等。
通用处理器201还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的神经网络处理方法的各个步骤可以通过通用处理器201中的硬件的集成逻辑电路 或者软件形式的指令完成。
存储器202可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)或其他存储器。本申请实施例中,存储器202用于存储数据以及各种软件程序,例如本申请实施例中根据确定好的目标拆分路径对神经网络模型进行拆分的程序等。
可选的,在本申请实施例中,所述存储器可以包括用于存储信息的物理装置,通常是将信息数字化后再以利用电、磁或者光学等方法的媒体加以存储。本实施方式所述的存储器又可以包括:利用电能方式存储信息的装置,如RAM、ROM等;利用磁能方式存储信息的装置,如硬盘、软盘、磁带、磁芯存储器、磁泡存储器、U盘;利用光学方式存储信息的装置,如CD或DVD。当然,还有其他方式的存储器,例如量子存储器、石墨烯存储器等等。
通信接口204使用例如但不限于收发器一类的收发装置,来实现计算机设备20与其他设备或通信网络之间的通信。例如,可以通过通信接口204接收其他设备发送的模型文件。
人工智能处理器205可以作为协处理器挂载到主CPU(Host CPU)上,由主CPU为其分配任务。在实际应用中,人工智能处理器205可以实现一种或多种运算。例如,以神经网络处理器(Network Processing Unit,NPU)NPU为例,NPU的核心部分为运算电路,通过控制器控制运算电路提取存储器202中的矩阵数据并进行乘加运算。
可选的,人工智能处理器205可以包括8个集群(cluster),每个cluster中包括4个人工智能处理器核。
可选的,人工智能处理器205可以是可重构体系结构的人工智能处理器。这里,可重构体系结构是指,如果某一人工智能处理器能够利用可重用的硬件资源,根据不同的应用需求,灵活的改变自身的体系结构,以便为每个特定的应用需求提供与之相匹配的体系结构,那么这一人工智能处理器就称为可重构的计算系统,其体系结构称为可重构的体系结构。
应当理解,计算机设备20仅为本申请实施例提供的一个例子,并且,计算机设备20可具有比示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
基于图2所示的计算机设备的结构示意图,下面结合图3所示的本申请实施例提供的一种神经网络处理方法的流程示意图,具体说明在本申请实施例中是如何对目标算子进行拆分,进而达到优化人工处理器核运算过程的母的,下面以caffe为例进行详细描述,可以包括但不限于如下步骤:
步骤S300、根据所述神经网络模型中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合。
在caffe框架下,所述目标算子可以是神经网络模型中的对应目标层(layer),该目标层为所述神经网络模型中的至少一层,所述张量数据包括输入张量数据和输出张量数据。
在本申请实施例中,神经网络模型可以接收输入数据,并根据接收的输入数据和当前的模型参数生成预测输出。在实际应用中,该神经网络模型可以是回归模型、深度神经网络(deep neural network,DNN)、卷积神经网络模型(Convolutional Neural Networks,CNN)、循环神经网络模型(Recurrent Neural Networks,RNN)等,本申请实施例不作具体限定。
在计算机设备执行神经网络计算任务时,如果该神经网络计算任务具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络模型的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络模型有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层 称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
在本申请实施例中,算子是指,实现某种特定功能的函数。例如,以reshape算子为例,该算子用于对张量数据的形状进行重新诠释。又例如,以transpose算子为例,该算子用于调整张量数据的维度顺序。
在本申请实施例中,有向无环图是指,在有向图的基础上加入了无环的限制。
在本申请实施例中,有向边可以用于表征算子与算子之间的连接关系,也可以用于表征人工智能处理器执行神经网络模型时的执行顺序。
在本申请实施例中,所述目标算子的输入张量数据的拆分状态集合中的拆分状态根据所述目标算子的运算逻辑和对应输出张量数据的拆分状态集合中的拆分状态确定。
在本申请实施例中,所述目标算子的输出张量数据的拆分状态集合中的拆分状态根据所述算子的运算逻辑和对应输入张量数据的拆分状态集合中的拆分状态确定。
具体来说,神经网络模型通常可以看作是一个由算子和多维张量数据所构成的有向无环图(DAG,Directed acyclic graph),算子和张量数据彼此之间通过有向边相互连接,有向边的指向表明了数据是算子的输入或者是输出。为了便于阐述,在本申请实施例中,我们使用op表示算子,tensor表示张量数据。同时,为了统一不同算子的拆分方式的表达,深度学习框架统一选择使用与算子相关联的张量数据的拆分方式来说明不同算子的拆分方式。在本申请实施例中,认为神经网络中所有张量数据都是4维的,对于图像分类网络最后的全连接层和归一化指数回归层的输入数据或输出数据来说,实际维度不到4,仍然将其表示为4维张量。用符号N,C,H,W分别表示这4个维度。其中,N表示批量的大小,C表示特征图像的个数,H表示特征图像的高,W表示特征图像的宽。这种假设是仅仅是出于说明的便捷性,对于框架本身来说,可以支持含有任意维度数量的张量数据的神经网络模型的处理。尽管如此,4维对于相当大一部分的神经网络结构都足够使用。
在本申请实施例中,当计算机设备对神经网络模型中的算子进行拆分时,考虑到算子的种类不同,该算子支持的计算逻辑也不相同,同样有不同的拆分策略。为了能够统一地表达不同算子的拆分策略,我们采用算子的输入张量数据、输出张量数据的拆分状态来表示算子本身计算逻辑的拆分。
在本申请实施例中,考虑到不同的算子具有不同的特性,为了避免不合理的拆分方式带来的负面影响,在对算子进行拆分时,计算机设备可以根据算子的类型确定算子的拆分方式,从而可以得到拆分状态集合中的拆分状态。具体地,请参见表1:
表1
Figure PCTCN2020116816-appb-000001
如表1所示,不同类型的算子支持的拆分方式是不同的。通过这一实现方式,可以结合算子的特性对算子进行有针对性地拆分,从而可以避免不合理的拆分方式带来的负面影响,例如,加大了计算机设备的资源消耗、导致因拆分后的子算子的规模不均衡而带来的 耗时问题等等。
具体来说,以卷积算子为例,在本申请实施例中,卷积算子的不同拆分方式可以描述为以下5种,这5种情况可以相互交叉,同时存在,可以保证足够的拆分度:
(1)当卷积算子输入数据的N维度超过1时,在N维度上进行拆分;
(2)在卷积算子的输入数据的C维度上进行拆分;
(3)在卷积算子的输出数据的C维度上进行拆分;
(4)在卷积算子的输入数据的H维度上进行拆分;
(5)在卷积算子的输入数据的W维度上进行拆分。
可以知道的是,上述五种拆分方式都是把原始的卷积算子拆分成更小的卷积。
为了便于理解,下面结合具体的实例进行阐述。在caffe框架下,神经网络模型具有层级结构,如图4所示,是本申请实施例提供的一种卷积算子的原始计算图的示意图。对于卷积算子conv来说,其包含4个维度上的输入数据(input),并在权值矩阵的作用下,可以得到输出数据(output)。如图5A-5E所示,为本申请实施例提供的计算图上卷积算子在并行度为2的条件下的多种拆分方式。具体地,图5A为按照输入数据的N维度进行拆分得到的示意图;图5B为按照输出数据的C维度进行拆分的示意图;图5C为按照输入数据C维度进行拆分得到的示意图;图5D为按照输入数据的H维度进行拆分得到的示意图;图5E为按照输入数据的W维度进行拆分得到的示意图。需要说明的是,图中每个张量数据给出了各个维度的起点和终点,用来明确拆分后的子张量数据与原始张量数据之间的关系。图中n表示输入数据批量大小、ic表示输入数据特征图像数量、ih表示输入数据特征图像的长度、iw表示输入数据特征图像的宽度、oc表示输出数据特征图像数量、oh表示输出数据特征图像的长度、ow表示输出数据特征图像的宽度、kh表示卷积核窗口的长度、kw表示卷积核窗口宽度。在实际应用中,这些拆分方式执行在不同的维度上,同时彼此之间可以通过相互组合形成更多的拆分方式,从而可以提供足够的并行度来利用多核处理器的资源,同时在一定程度上可以避免单个维度的过度拆分影响计算机设备的计算效率。
又例如,以分类器(softmax)算子为例,计算机设备可以在softmax算子概率归一化的维度之外的任意一个或几个维度上对softmax算子进行拆分,拆分后将得到若干个可以并行执行的softmax算子。
又例如,以激活算子为例,计算机设备可以允许其输入数据和输出数据在任意维度上进行拆分。在实际应用中,当一个激活算子的输入数据被分成了若干个子块(从一致性的角度来考虑,输出数据也会进行同样的划分),不妨表示为input0、input1、input2、......、inputm-1和output0、output1、output2、......、outputm-1,则在计算阶段,整个激活算子实际上被拆分成了m个更小的激活算子,这些激活算子彼此之间没有依赖关系,可以运行在多个核上。
这里,需要说明的是,选择在何种维度上对算子进行拆分对于拆分方式特别敏感的算子是非常有意义的。例如,如上描述的分类器(softmax)算子。
在本申请实施例中,在确定与目标算子关联的张量数据的拆分状态集合时,拆分状态集合可以包括如下几种表现形态:
(1)在一种可能的实现方式中,神经网络模型中包含多种不同类型的算子,且这些算子可以允许在任意维度上进行拆分,在这种情况下,计算机设备可以根据多种不同算子中的每个算子各自对应的拆分方式确定拆分状态集合中的拆分状态。
为了便于理解,下面结合具体的实例进行阐述,在caffe框架下,神经网络模型具有层级结构,例如,如图5F所示,人脸识别神经网络模型中包含多种不同类型的算子(卷积算子、池化算子、全连接算子),其中,各算子之间的连接关系为:卷积层1-池化层1-卷积层2-池化层2-全连接层1-全连接层2。由于这些算子可以允许在任意维度上进行拆分,那 么,在这种情况下,计算机设备可以根据每个算子各自对应的拆分方式确定拆分状态集合中的拆分状态。
(2)在一种可能的实现方式中,神经网络模型中包含多种不同类型的算子,其中,一些算子可以允许在任意维度上进行拆分,一些算子只支持在有限维度上进行拆分,那么,在这种情况下,计算机设备可以分别确定这多个不同算子各自对应的拆分方式,然后,将这多个不同算子各自对应的拆分方式确定为拆分状态集合中的拆分状态。
(3)在一种可能的实现方式中,神经网络模型中包含多种不同类型的算子,其中,一些算子可以允许在任意维度上进行拆分,一些算子只支持在有限维度上进行拆分,那么,在这种情况下,计算机设备可以分别确定这多个不同算子各自对应的拆分方式,然后,将多个算子中的每个算子均支持的拆分方式的交集确定为拆分状态集合中的拆分状态。
为了便于理解,下面结合具体的实例进行阐述,在caffe框架下,神经网络模型具有层级结构,例如,如图5G所示,车牌字符识别神经网络模型中包含多种不同类型的算子(卷积算子、池化算子、激活算子、softmax算子等),其中,各算子之间的连接关系为:卷积层1-激活函数Relu-最大池化层1-卷积层2-激活函数Relu-最大池化层2-卷积层3-激活函数Relu-最大池化层3-卷积层4-激活函数-最大池化层4-卷积层5-激活函数-最大池化层5-全连接层1-softmax层-输出层。由于卷积算子、池化算子、激活算子可以允许在任意维度上进行拆分,而softmax算子只支持在有限维度上进行拆分,那么,在这种情况下,计算机设备将这多个算子中的每个算子均支持的拆分方式的交集确定为拆分状态集合中的拆分状态。
(4)在一种可能的实现方式中,神经网络模型中包含多种不同类型的算子,其中,一些算子完全不支持任何形式的拆分,而神经网络模型中的其他算子为了在数据的拆分格式上保持一致,在这种情况下,不对神经网络模型进行拆分。我们将这一状态认为是不拆分状态。通过这一实现方式,可以避免不合理的拆分方式带来的负面影响,例如,加大了计算机设备的资源消耗、导致因拆分后的子算子的规模不均衡而带来的耗时问题等等。
这里,在确定拆分状态集合中的状态时,对于本申请所描述的技术方案来说,可以对整个神经网络模型中的所有算子进行拆分,也可以对神经网络模型中的部分算子进行拆分,本申请实施例不做具体限定。另外,考虑到目前深度学习领域出现的网络结构和算法已经逐渐模糊各个数据维度的物理意义和彼此之间的界限,本技术方案可以扩展应用到更多维度下的算子拆分。
在本申请实施例中,我们将张量数据的任意一种拆分称为该张量数据的一种拆分状态s,计算机设备对张量数据拆分后,获得子张量数据集合。拆分状态s通过对应的子张量数据集合进行表征。所有可能的拆分{s0,s1,s2,…}组成了该张量数据的拆分状态集合S,一般来说,这是一个非常大的状态空间,这意味着由张量数据的拆分状态所表示的算子的可能的拆分方式的空间也非常巨大。
在本申请实施例中,计算机设备可以在满足设置好的至少一种剪枝条件下,对张量数据的状态空间进行剪枝,以缩小状态空间。例如,剪枝条件可以包括但不限于:(1)、在对神经网络模型进行拆分时,应该保证拆分后的子算子的规模是均衡的。通过这一实现方式,可以从张量的状态空间S中去除那些拆分上不均衡的拆分状态。在本申请实施例中,保证拆分后的子算子的规模是均衡的原因在于:首先,多核处理器完成一个算子的计算的时延取决于执行子任务耗时最长的那个核的时间,而多核架构中各个核在硬件结构上彼此是相互对等的,因此每个核的时间耗费的长短取决于分配给该核的任务负载的多少。在满足拆分后的子算子的规模是均衡的情况下,可以保证多核结构中的每个核的时间消耗是相等的,从而可以提高计算机设备的执行效率。(2)在对神经网络模型进行拆分时,应该保证拆分后的子算子的数量是2的整数次幂。通过这一实现方式,可以从张量的状态空间S 中去除那些拆分数量上不均衡的拆分状态。在本申请实施例中,保证拆分后的子算子的数量是2的整数次幂的原因在于:多核处理器的架构中的核数通常是2的整数次幂,如1,2,4,8,16等等。在实际应用中,一个并行度不是2的整数次幂的任务往往会导致核的调度上产生“碎片”,因此拆分后的子算子数量应当保证是2的整数次幂。可以理解的是,当计算机设备在满足如上至少一种剪枝条件时,计算机可以对状态空间S中的拆分状态进行调整,以去除一些不合理的拆分状态,可以在实现缩小算子拆分策略的搜索空间的同时,避免不合理的拆分方式带来的负面影响,例如,加大了计算机设备的资源消耗、导致因拆分后的子算子的规模不均衡而带来的耗时问题等等。
在本申请实施例中,并非选择任意与算子关联的张量数据的拆分状态都能表示该算子的一种有效的拆分方式。张量数据的拆分的维度应该被算子所支持,例如归一化指数回归算子(Softmax)的输入数据不应该在待归一化的维度上存在拆分。此外,算子的输入张量和输出张量的拆分应该满足算子的计算逻辑,例如,卷积算子的输出数据在H/W维度上拆分的每一个子块的起止点都应该确实是其对应的输入数据在H/W维度上拆分的子块根据卷积算子的卷积核和位移步长计算出来的;卷积算子的输入数据在C维度上的拆分应该和权值数据在C维度上的拆分完全一致,输出数据在C维度上的拆分应该和权值数据在N维度上的拆分完全一致。在深度学习框架中,使用输出状态根据每个算子的具体逻辑来反向推出算子的输入状态,或者使用输入状态根据每个算子的具体逻辑来正向推出算子的输出状态。这保证了相关数据的拆分状态始终能够表示一个有效的算子拆分方式。
步骤S302、遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径。
在本申请实施例中,如图5H所示,整个神经网络模型的拆分方案P可以看作是由每个算子的输入张量数据的拆分状态集合中的一种拆分状态向输出张量中的一种拆分状态的跳转。前一个算子的输出张量的拆分状态即是后一个算子输入张量的拆分状态。经过算子的每一种可能的跳转对应了在该算子上的一种有效的拆分方式。因此,拆分路径可以表示算子的拆分方式。
在本申请实施例中,经所述拆分路径对应的拆分方式对算子的计算逻辑进行拆分,获得对应的子算子集合。输入张量数据的状态和对应输出张量数据的状态通过拆分路径连接,表示该输入张量数据的一拆分状态的子张量数据集合经过子算子集合中的子算子处理,得到该输出张量数据的对应拆分状态的子张量数据集合。这里,路径用于表示算子的输入到输出的中间过程。
在本申请实施例中,可以将算子在某种拆分状态下在多核处理器上并行执行时所用的时间表征为权重。这里,需要说明的是,多核处理器完成一个算子的计算时间取决于执行拆分后的子计算任务耗时最长的那个核的时间。
在本申请实施例中,可以通过如下步骤A1-A4确定每种拆分路径的权重值:
A1、确定拆分后的n个子算子的计算负载c1,c2,…,cn。其中,ci根据拆分后第i个子算子的类型和规模计算得到;
A2、确定n个子算子的访存数据量d1,d2,…,dn。其中,di根据拆分后第i个子算子的类型和规模计算得到;
A3、确定每个人工智能处理器核的计算吞吐速率α。α由人工智能处理器本身的性能参数所决定;
A4、确定每个人工智能处理器核的访存带宽β。通常来说,人工智能处理器的多个核共享有限的访存带宽,因此β=B/n。其中,B是多核人工智能处理器的总带宽。
基于上述确定好的参数,计算机设备可以根据如下计算公式(1)来计算每种拆分策略对应的权重值:
t=max i=1,...,n(max(c i/α,d i/β))         (1)
其中,计算公式中内侧的取最大值操作是基于算子实现的计算部分和访存部分之间能够相互隐藏,即计算和访存可以做到尽量并发执行。对于一些人工智能处理器来说,当子算子的规模过小时会导致每个核的计算吞吐量降低,可以对α进行进一步修正使估值更加准确。计算公式中外侧的取最大值操作就是多核人工智能处理器完成一个算子的计算的时间取决于执行子计算任务耗时最长的那个核的时间。
需要说明的是,上述获取拆分路径的权重的方式仅仅是例举的部分情况,而不是穷举,本领域技术人员在理解本申请技术方案的精髓的情况下,可能会在本申请技术方案的基础上产生其它的变形或者变换,比如:衡量拆分路径的权重不仅仅可以是执行子任务的所花费的时间,也可以是执行子任务的吞吐量。或也可以通过实际测量在多核处理器上执行所述拆分路径对应的算子拆分方式下的所有子任务的时间来确定拆分路径的权重。但只要其实现的功能以及达到的技术效果与本申请类似,那么均应当属于本申请的保护范围。
步骤S304、根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径。
在本申请实施例中,在确定目标算子的张量数据的目标拆分路径时,可以包括两种不同的实现方式。在一种可能的实现方式中,可以通过正向遍历的方式来确定目标拆分路径。在另一种可能的实现方式中,可以通过反向遍历的方式来确定目标优化路径。下面对其进行具体阐述:
在本申请实施例中,通过正向遍历的方式来确定目标优化路径,可以包括:
遍历所述目标算子的张量数据的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有指向当前拆分状态的有向边以及所述有向边的起点对应的拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边对应的起始拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
在本申请实施例中,通过反向遍历的方式来确定目标优化路径,可以包括:
遍历所述目标算子的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有以当前拆分状态为起点的有向边以及所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
在本申请实施例中,当计算机设备确定好了神经网络模型的多种不同的拆分方案各自对应的权重值之后,计算机设备可以将权重值最小的拆分方案确定为神经网络模型的目标拆分路径。
在本申请实施例中,计算机设备通过正向遍历(或反向遍历)获取到的目标拆分路径 的数量可以为1个,也可以为多个,本申请实施例不作具体限定。本领域的技术人员应当理解,目标拆分路径的数量往往需要结合具体的神经网络模型(亦或是目标算子)来确定。这里,进一步需要说明的是,在目标优化路径的数量为多个的情况下,本申请实施例可以采用多个目标优化路径中的任意一个目标优化路径对神经网络模型进行拆分,也可以采用在多个目标优化路径中选择一个最优的目标优化路径对神经网络模型进行拆分,以使多核处理器在对应的核上运行拆分好的神经网络模型。
在本申请实施例中,计算机设备可以结合Viterbi算法从图5H中获取目标优化路径。这里,目标优化路径为权重和最小的路径。具体来说,Viterbi算法是一种动态规划算法,用于寻找最有可能产生观测时间序列的隐含状态序列。在本申请实施例中,将张量数据拆分状态集合中的状态看作是Viterbi算法中的隐含状态,将拆分状态集合之间的有向边看作是隐含状态之间的转移关系,而有向边的权重对应着隐含状态之间的转移概率的对数值。
具体实现中,计算机设备可以由前往后遍历网络计算图中的所有算子,当访问第i个算子时,计算机设备根据当前算子对应的所有有向边及其权重
Figure PCTCN2020116816-appb-000002
(具体地,如公式5)确定神经网络模型中的输入张量数据的拆分状态集合中的拆分状态到当前算子的输出张量数据的拆分状态集合
Figure PCTCN2020116816-appb-000003
中的每个拆分状态的最短路径
Figure PCTCN2020116816-appb-000004
其中,
Figure PCTCN2020116816-appb-000005
当计算机设备完成所有算子的遍历后,可以得到由神经网络的输入张量数据的拆分状态集合中的拆分状态到输出张量数据的拆分状态集合中的每个拆分状态的最短路径;之后,计算机设备在这些最短路径中确定全局范围内的最短路径,也即目标优化路径。
这里,需要说明的是,上述例举的采用类似于viterbi算法来获取目标优化路径的实现方式,只是作为一种示例,而不是穷举,本领域技术人员在理解本申请技术方案的精髓的情况下,可能会在本申请技术方案的基础上产生其它的变形或者变换,比如:神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的每一条拆分路径的权重由对应的状态路径的权重之和确定。根据经验设置一阈值,拆分路径的权重小于设定的阈值,就可以作为目标拆分路径对神经网络模型进行拆分。但只要其实现的功能以及达到的技术效果与本申请类似,那么均应当属于本申请的保护范围。
为了便于理解,下面结合具体的实例阐述在本申请实施例中是如何在遍历了目标算子的所有拆分状态集合后获得目标拆分路径的。
如图6A所示,该神经网络模型为串行结构,且整个神经网络模型的输入张量数据和输出张量数据均为不拆分状态。这里,整个神经网络模型的输入张量数据为不拆分状态是指:在当前的拆分状态集合中有且只有一个输入状态。那么,相应的,整个神经网络模型的输出张量数据为不拆分状态是指:在当前的拆分状态集合中有且只有一个输出状态。
我们把一个包含n个算子的串行神经网络模型描述为一个算子序列(OP0,OP1,OP2,...,OPn),假设每个算子只有一个输入和一个输出,前一个算子的输入是后一个算子的输出,那么包括整个神经网络的输入张量数据和输出张量数据以及所有的算子之间的中间结果张量在内,所有的张量数据构成集合(Tensor0,Tensor1,...,Tensorn),OPi的输入是Tensori-1,输出是Tensori。对每个数据张量Tensori,有与之对应的状态集合Si,搜索策略的目标是寻找一种张量本身和其状态集合的某个状态之间的映射关系Tensor i→S i,通过给神经网络模型中每个张量数据确定一个具体的拆分状态,从而可以确定所有算子的拆分方式,因此,把一个神经网络模型中所有张量数据到其拆分状态的一种映射关系称为 该网络模型的一种拆分方案P。在计算阶段,第i个算子OPi以处于拆分状态S的输入数据计算出处于拆分状态r的输出张量数据,具体的并行计算方式由输入张量数据和输出张量数据的状态所决定,同时,该算子的计算时间记为t s→r,其值的大小取决于相应的拆分方式和底层加速器的硬件特性,则整个网络的时延T的计算公式为:
Figure PCTCN2020116816-appb-000006
其中,s i-1∈S i-1,s i∈S i
由于整个网络的拆分方案P可以看作是由每个算子的输入张量的状态集合中的一种状态向输出张量中的一种状态的跳转。这里,经过算子的每一种可能的跳转对应了在该算子上的一种有效的拆分方式,同样与之对应的还有使用该拆分方式在多核处理器上并行执行该算子所适用的时间ti,因此ti可以看作是由算子的输入张量的状态指向输出张量的状态的有向边的权重。同时,作为整个网络的输入张量和输出张量,他们对应的状态空间中只有一种不拆分的保持整个数据块连续完整的状态,这使得神经网络模型的拆分方案P由完整的输入数据开始,到完整的输出数据结束,外部的用户始终看到一个完整的输入和输出。此时,对于给定的神经网络模型搜索一个好的拆分方案P,即是寻找一条由输入张量数据的不拆分状态到输出张量数据的不拆分状态的最短路径,该路径在每个中间结果张量的有效状态空间中都要选择一个状态经过。
这里,式3、式4给出了这种抽象的公式表示。
P={s 0,s 1,...,s n}=argmin(T(s 0,s 1,...,s n))       (3)
Figure PCTCN2020116816-appb-000007
具体地,计算机设备将整个神经网络模型的输入张量数据的不拆分状态为起始状态Sroot,在初始阶段,神经网络模型的输入张量数据的不拆分状态为起始状态Sroot对应的拆分路径的权重为0,其余所有张量数据的所有状态的对应的拆分路径的权重为∞。神经网络模型中任一张量数据的任一状态s都有与之对应的由Sroot开始到s的拆分路径权重为ls。由前往后访问每一个拆分状态集合,在每个拆分状态集合中,依次遍历其中的每一个状态s。对每个状态s,有指向后一拆分状态集合中若干拆分状态的有向边e1,…,eks。以后一拆分状态集合中的拆分状态v为例,使用式(1)获得状态s到状态v之间的权重tsv,利用式(5)来更新该状态路径指向的下一拆分状态集合中的状态v对应的由Sroot开始到状态v的拆分路径的权重lv。
l v=min(l v,l s+t sv)         (5)
在计算机设备依据神经网络模型的有向边正向遍历完成所有拆分状态集合的访问后,可以获得整个神经网络模型的输入张量数据的不拆分状态sroot到神经网络模型的输出张量数据的不拆状态send的目标拆分路径。
上述描述一条由不拆分状态sroot到不拆分状态send经过每个拆分状态集合中的一个状态的路径,该路径即为神经网络模型的拆分路径。计算机设备可以从神经网络模型的拆分路径中选取权重最小的作为神经网络模型的目标拆分路径。
需要说明的是,图6A所示的神经网络模型为串行神经网络模型,且为了便于说明本技术方案,神经网络模型的输入张量数据和输出张量数据对应的拆分状态集合均为不拆分 状态。在神经网络模型的输出张量数据的拆分状态集合不是不拆分状态Send,而是多个拆分状态构成的集合时,神经网络模型的输出张量数据的拆分状态集合中的每个拆分状态的拆分路径的权重中选出最小值作为整个神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的目标拆分路径。
另外,需要说明的是,计算机设备可以搜索由不拆分状态Send到不拆分状态Sroot的拆分路径,二者是等价的。同理,在神经网络模型的输入张量数据的拆分状态集合不是不拆分状态Send,而是多个拆分状态构成的集合时,神经网络模型的输入张量数据的拆分状态集合中的每个拆分状态的拆分路径的权重中选出最小值作为整个神经网络模型的输入张量数据的拆分状态集合到神经网络模型的输出张量数据的拆分状态集合之间的目标拆分路径。
步骤S306、根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在本申请实施例中,多核人工智能处理器的核数可以为8,也可以为16,本申请实施例不作具体限定。
在本申请实施例中,在确定了目标优化路径之后,计算机设备可以根据确定好的目标优化路径对目标算子进行拆分。考虑到神经网络模型用于执行某个特定的神经网络计算任务,例如,人脸识别;又例如,边缘检测;又例如,语义分析等等。当计算机设备根据目标拆分路径对神经网络进行拆分,也即将神经网络计算任务拆分成若干个子计算任务,在这种情况下,计算机设备可以通过调用多核人工智能处理器来运行拆分后的若干个子计算任务,从而可以得到运行结果。这里,运行结果是指,计算机设备执行特定神经网络计算任务时的结果,可以包括但不限于:神经网络模型的精度、神经网络模型的运行时间等等。在实际应用中,计算机设备可以输出该运行结果,例如,计算机设备通过显示屏显示该运行结果。
实施本申请实施例,计算机设备通过将神经网络计算任务拆分成若干个规模更小的子计算任务,这样多核处理器可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
在本申请实施例中,可以在目标算子与关联的拆分状态之间插入胶水算子,以调整拆分状态集合中的拆分状态,接下来具体介绍在本申请实施例中是如何引入胶水算子,基于更新后的拆分状态集合来确定目标优化路径的,可以包括但不限于如下步骤:
步骤S400、根据所述神经网络模型中目标算子,确定与所述目标算子的算子关联的张量数据的拆分状态集合。
在本申请实施例中,关于步骤S400的具体实现请参考前述步骤S300,此处不多加赘述。
步骤S402、在所述目标算子与关联的拆分状态集合之间插入胶水算子,调整所述拆分状态集合中的拆分状态,得到调整后的拆分状态集合;其中,所述胶水算子用于将所述张量数据按照一种拆分方式得到的拆分状态转换成按照任一种拆分方式得到的拆分状态。
在本申请实施例中,为了便于区分引入胶水算子前的拆分状态集合和引入胶水算子后调整得到的拆分状态集合,我们将引入胶水算子前的拆分状态集合定义为第一拆分状态集合,将引入胶水算子后调整得到的拆分状态集合定义为第二拆分状态集合。
在本申请实施例中,当对单个算子进行拆分时,根据选择的拆分方式的不同,与算子关联的张量数据也会按照不同的方式被拆分成若干子张量数据。由于在实际网络中,张量数据往往与多个算子存在关联关系,使得每个计算图中的每个算子选择怎样的拆分方式不是一个孤立的问题,会对相邻算子甚至网络中的所有算子都会产生影响。例如,最简单的情况下,某张量数据Tensor1既是算子OP0的输出数据,也是算子OP1的输出数据。当 OP0确定按照某种方式进行拆分后,Tensor1作为OP0的输出,同样也确定了按照某种方式拆分成了一些列子张量数据,那么OP1在选择拆分方式时,必须保证选择的方式与其输入数据Tensor1已经确定的拆分方式相兼容,这使得OP0的选择范围受到了约束。那么,可以理解的是,OP1在这种约束下所选择的拆分方式又会通过与其关联的张量数据约束其他相邻算子的拆分选择。
这种算子间关于拆分方式选择的相互影响会带来很多问题,首先,会带来性能方面的问题。在实际应用中,当计算机设备在多核处理器上调用不同的拆分方式下对应子计算任务时,性能间有差异。那么,可以理解的是,如果相邻的两算子最佳的算子拆分方案在其共同关联的张量数据上的拆分方式上不一致的情况下,为了避免冲突,必然有一方要屈就于另一方的选择。
其次,算子之间的拆分方式的相互影响会影响整个网络的可执行性。如前所述,不同算子所能够支持的拆分方式取决于算子自身的类型以及数据规模。有些算子,譬如激活算子Relu,卷积算子Conv,所支持的拆分方式允许其输入数据在NCHW中的任意维度上进行拆分;有些算子,譬如softmax算子,所支持的拆分方式只允许其输入数据在某个特定的维度上进行拆分;而最后一些算子,往往是实现上非常复杂的算子,譬如NMS(Non-maximum suppression)算子,难以通过算子拆分的方式把计算负载分配到多个核上并行,因此,这类算子最终只能在单个核上执行,相应的输入数据应该保持完整不拆分的状态。那么,可以理解的是,如果一个神经网络模型中存在上述提及的最后一类算子,则必须保证该算子的输入数据处于完整不拆分的状态,否则网络在该算子处无法继续执行。如果这种约束随着网络结构扩散,就会使得难以通过算子拆分的方式在神经网络计算中挖掘出足够数量的并行度。
在本申请实施例中,为了解决算子拆分彼此之间相互影响的问题,在目标算子与关联的第一拆分状态集合之间插入胶水算子,该胶水算子可以实现让神经网络模型对应的计算图中的每个算子可以灵活不受限地选择作用于自身的拆分方式。
具体来说,胶水算子(Transform)用于将张量数据由按照一种拆分方式得到的若干个子张量数据的状态调整为按照另一种方式拆分得到的若干个子张量数据。如图6B所示,当当前张量数据的拆分方式不被其后的算子的任何一种拆分方式所允许,又或者其后的算子在兼容当前张量数据的拆分方式的前提下可选择的方案所带来的性能提升很差,在这种情况下,计算机设备可以通过在计算图中插入胶水算子把当前数据调整成另一种更优的拆分方式。
在本申请实施例中,胶水算子的语义可以通过神经网络模型中的concat算子和/或split算子得到。下面对其进行具体阐述:
在本申请实施例中,concat算子,也即,拼接算子,用于将多个张量数据沿着指定的维度拼接成一个张量。除了在指定维度外,输入张量的其他维度应该保持一致。通过concat算子,神经网络将代表来自上游不同位置的特征的多个张量拼接成一个,从而可以在下游计算中对这些特征共同进行处理。具体地,可以参见图6C所示的concat算子语义的示意图。
在本申请实施例中,split算子,也即拆分算子,用于将一个张量在指定维度上拆分成多个张量。拆分后的多个张量除了指定维度之外,在其他维度上保持一致。通过split算子,可以把属于同一张量数据的特征拆成多份,从而在后续计算中分别进行针对性处理。具体地,可以参见图6D所示的split算子语义的示意图。
在本申请实施例中,胶水算子内部采用拆分-拼接、拼接-拆分、拼接、拆分这四种方式中的其中之一种实现方式,在拼接阶段可以把在任意维度上相邻的子数据块拼接成一个新的子张量数据,在拆分阶段,可以把任意一个子张量数据拆分成几个更小的子张量数据。 这样,张量数据按照任意一种拆分方式得到的子张量数据可以转换成按照另一种方式拆分得到的子张量数据。为了说明这一点,不妨假设数据本身是一维的,调整前的拆分形式表示为{(0,p1),(p1,p2),…,(pn-1,end)},每一段代表一维数据拆分后的一个子段,通过胶水算子调整后的拆分形式是{(0,q1),(q1,q2),…,(qm-1,end)},如果调整前的某相邻两段(pi-1,pi),(pi,pi+1)是调整后的某一段(qj,qj+1),即pi-1=qj,pi+1=qj+1,在调整这一部分时只需要在拼接阶段把(pi-1,pi),(pi,pi+1)拼接在一起,跳过拆分阶段。同样另一种情况下,如果调整前的某一子段是调整后的若干子段的集合,则跳过拼接阶段,在拆分阶段执行相应的拆分。最坏情况下,可以把拼接阶段把所有数据组合成一个完整一维数据,在拼接阶段在进行对应的拆分。
在本申请实施例中,所述在所述目标算子与关联的第一拆分状态集合之间插入胶水算子,调整所述算子的输入张量数据的拆分状态集合中的拆分状态,得到第二拆分状态集合,包括:
在所述目标算子与关联的第一拆分状态集合之间插入胶水算子,通过所述胶水算子将所述第一拆分状态集合中的拆分状态更新为所述第二拆分状态集合。
如前所述,我们把数据按照任意一种方式拆分得到的所有子张量数据称为该张量数据的一种拆分状态S,张量数据所有可能的状态构成了该张量数据的状态空间S。假设网络中有算子OP按照某一拆分方式进行拆分,则其输入数据Tensor0和输出数据Tensor1分别有状态s和t,二者分属于Tensor0和Tensor1的状态空间S、T。在此基础上,OP自身的拆分方式可以视作是一条由s指向t的有向边。
在本申请实施例中,基于上述对张量数据的状态的抽象描述,整个神经网络可以被抽象成如图5H所示。图中,虚线框代表每个张量数据的拆分状态集合,集合中包含了若干个拆分状态,这些拆分状态来自于该张量数据的拆分状态空间。算子的输入张量数据的拆分状态集合中的状态和输出张量的拆分状态集合中的状态之间的有向边表示该算子本身的一种拆分方式,使用该拆分方式下的并行时间作为有向边的权重。其中,Tensor0是整个神经网络的输入张量数据,Tensor3是整个神经网络的输出张量数据,任意一条由Tensor0的拆分状态集合中的任一状态出发,到Tensor3的拆分状态集合中的任一状态结束的路径,都对应了一种该神经网络的有效拆分方案,例如,可以记为P。
在本申请实施例中,以图5H所示的Tensor1的拆分状态集合为例,在OP0关联的拆分状态集合中拆入胶水算子,通过胶水算子将该拆分状态集合中的状态进行调整,可以得到更新后的拆分状态集合。具体地,可以如图6E所示。在图6E中,更新后的拆分状态集合中的拆分状态包括:state m’1、state m’2、state m’k。这里,state m’1、state m’2、state m’k为第一拆分状态集合中的状态经过胶水算子后产生的新的状态。
在本申请实施例中,所述在所述目标算子与关联的第一拆分状态集合之间插入胶水算子,调整所述算子的输入张量数据的拆分状态集合中的拆分状态,得到第二拆分状态集合,包括:在所述目标算子与关联的第一拆分状态集合之间插入胶水算子,通过所述胶水算子将所述第一拆分状态集合中的拆分状态更新为第三拆分状态集合;根据所述第一拆分状态集合和所述第三拆分状态集合生成所述第二拆分状态集合。
在本申请实施例中,以图5H所示的Tensor1的拆分状态集合(也即第一拆分状态集合)为例,在OP0关联的拆分状态集合中拆入胶水算子,通过胶水算子将该拆分状态集合中的拆分状态进行调整,将第一拆分状态集合中的拆分状态更新为第三拆分状态集合,之后,根据第一拆分状态集合和第三拆分状态集合生成第二拆分状态集合。具体地,可以如图6F所示。在图6F中,第二拆分状态集合中的拆分状态包括:state 1、state 2、...、state m’。这里,state 1、state 2、...state m为第一拆分状态集合中的拆分状态,而state m’为第一拆分状态集合中的状态经过胶水算子后产生的新的拆分状态。通过这一实现方式,可以保证 第二拆分状态集合中尽可能的包含多种不同的拆分状态,这有利于接下来获取整个神经网络模型的目标优化路径。
在本申请实施例中,通过胶水算子来表示调整张量数据的拆分状态的行为方式,神经网络模型的每一层的计算规模随着网络的延申不断变化,随着神经网络模型拆分趋势的变化,需要对算子的拆分方式做出相应的调整,也就是对中间结果的状态进行调整。如图6E所示,在Op0和其输入Tensor1之间加入了胶水算子,可以把张量数据的任意一种拆分状态转换成另一种拆分状态。对胶水算子而言,其输入张量数据和输出张量数据有着相同的形状和相同的状态空间,由输入张量数据的任一拆分状态,存在指向输出张量数据所有拆分状态的有向边,因此在输入张量数据的拆分状态集合和输出张量数据的拆分状态集合之间形成了全连接的网状结构。这使得任意一种输入张量数据的拆分状态可以在算子Op0前转换成另一种拆分状态,给拆分方案的搜索空间中引入了在每个算子计算开始前调整其输入张量数据的拆分状态,也即是调整算子本身拆分方式的可能性。
需要说明的是,图6E或图6F示出了算子与对应的输入张量数据之间插入胶水算子,也可以在算子与对应的输出张量数据之间插入胶水算子,更可以在算子与对应的输入张量数据、输出张量数据之间均插入胶水算子,此次仅仅是例举的部分情况,而不是穷举,本领域技术人员在理解本申请技术方案的精髓的情况下,可能会在本申请技术方案的基础上产生其它的变形或者变换,但只要其实现的功能以及达到的技术效果与本申请类似,那么均应当属于本申请的保护范围。
步骤S404、遍历所述调整后的拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径。
如前所述,这里,调整后的拆分状态集合也即第二拆分状态集合。
步骤S406、根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径。
步骤S408、根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在本申请实施例中,关于步骤S404-步骤S408的具体实现请参考前述步骤S302-步骤S306,此处不多加赘述。
实施本申请实施例,在目标算子与关联的拆分状态集合之间插入胶水算子,该胶水算子可以实现让神经网络模型对应的计算图中的每个算子可以灵活不受限地选择作用于自身的拆分方式,从而可以解决算子拆分彼此之间相互影响的问题。
在本申请实施例中,通过引入胶水算子使得每个算子都可以根据实际情况选择适当的拆分方式,然而,在计算机设备在运行包含有胶水算子的神经网络模型时,由于胶水算子本身会带来额外的开销,这无疑加大了计算机设备的资源消耗。以胶水算子采用拆分-拼接或拼接-拆分为例,假设待调整张量数据的总大小为M,且两个阶段均不能跳过,且每个阶段都要针对4个维度进行拼接或者拆分。为了便于移植,拼接和拆分通常会使用神经网络算法中自带的拼接算子(Concat)和拆分算子(Split)实现,由于这两个算子每次只能处理一个维度,整个胶水在最差情况下会带来8M的存储读写开销。所以,必须在调整拆分状态和引入的额外开销之间寻找一个最佳的平衡点,再引入尽量少的胶水算子的情况下又能符合网络结构的规律再合理的地方对算子的拆分方式进行调整。这也是本申请描述的技术方案旨在解决的技术问题。
基于此,在本申请实施例中,在上述步骤S406之后,在步骤S408之前,还可以包括步骤S4010,下面对其进行具体阐述:
步骤S4010、在满足所述目标拆分路径中包含的同一胶水算子的输入张量数据的状态和输出张量数据的状态相同的情况下,将插入的对应胶水算子删除,得到优化后的目标拆分路径。
在本申请实施例中,计算机设备根据拆分路径的权重确定好目标拆分路径之后,计算机设备判断目标拆分路径中包含的同一胶水算子的输入张量数据的状态和输出张量数据的状态是否相同,在满足同一胶水算子的输入张量数据的状态和输出张量数据的状态相同的情况下,移除该胶水算子。这里,当同一胶水算子的输入张量数据的状态和输出张量数据的状态相同表示在该位置使用胶水算子并没有对张量数据的拆分状态进行任何调整。如前所述,当计算机设备在运行包含有胶水算子的神经网络模型时,因胶水算子本身会带来额外的开销,这无疑加大了计算机设备的资源消耗。当计算机设备在满足同一胶水算子的输入张量数据的状态和输出张量数据的状态相同的情况下,移除该胶水算子的实现方式,可以减少计算机设备的资源消耗。进一步地,这一实现方式可以把引入胶水算子所带来的额外开销和算子本身不同拆分方式的并行效率放在一起进行决策,从而可以得到一个基于整个神经网络的最优拆分方案P。
那么,相应地,计算机设备在执行上述步骤S304根据拆分路径的权重确定好目标拆分路径之后,计算机设备判断目标拆分路径中包含的同一胶水算子的输入张量数据的状态和输出张量数据的状态是否相同,在满足同一胶水算子的输入张量数据的状态和输出张量数据的状态不相同的情况下,保留该胶水算子。在这种情况下,这里引入的胶水算子可以使得每个算子的拆分方式兼容了与其直接关联的张量数据的拆分方式,通过这一实现方式,可以把引入胶水算子所带来的额外开销和算子本身不同拆分方式的并行效率放在一起进行决策,从而可以得到一个基于整个神经网络的最优拆分方案P。
那么,相应地,在本申请实施例中,当计算机设备执行上述步骤S306时,根据所述优化后的目标拆分路径对目标算子进行拆分。这里,对目标算子的拆分的具体实现请参考前述描述,此处不多加赘述。
实施本申请实施例,删除目标优化路径中输入张量数据的状态和输出张量数据的状态相同的胶水算子,可以在调整拆分状态和引入的额外开销之间寻找一个最佳的平衡点。当计算机设备执行根据优化后的目标优化路径进行拆分的神经网络模型时,可以减少计算机设备的资源消耗。
在本申请实施例中,考虑到神经网络模型中具有多分支结构,在这种情况下,需要解决多分神经网络模型中不同分支拆分方式一致性的问题。位于分支交汇处的算子具有1个以上的输入张量数据,例如对位加法算子(Add),对位乘法算子(Mult),拼接算子(Concat)。对一个有2个输入的算子A,在计算机设备访问该算子,即根据输入张量数据的拆分状态集合确定输出张量数据的拆分状态集合结束后,两个输入张量数据tensorleft,tensorright分别有对应的拆分状态集合Sleft和Sright。分别沿tensorleft,tensorright开始的两条之路继续向前遍历,一种情况下,两条支路会直接延伸直至遍历结束,代表整个网络有不止一个输入数据,这通常在推理任务中并不常见,另一种情况下,两条支路在某算子处合到一起。无论哪种情况,当确定拆分方案P时,在算子A的两个输入张量数据tensorleft,tensorright上,可能会选中相互不匹配的拆分状态。具体来说,假设算子A是二元对位加法算子,回溯过程在tensorleft的拆分状态集合中选中的可能是一个仅在C维度上有拆分的状态,而在tensorright的拆分状态集合中选中的可能是一个仅在H维度上有拆分的状态,这两个拆分状态所表示的加法算子本身的拆分方式是不一致的,因此会导致整个拆分方案P无效。
在本申请实施例中,回溯是指上一个实现过程的逆过程。例如,当对神经网络模型进行正向遍历时,回溯是指对神经网络模型进行反向遍历。回溯过程的目的在于避免计算机设备在确定目标优化路径时出现误判,从而导致根据计算机设备在调用拆分后的神经网络模型时的时间消耗增大等负面影响。
为了解决这个问题,在遍历算子A结束前保证tensorleft,tensorright对应的拆分状态集合中都只含有一个拆分状态,这确保回溯过程中在两状态集合中选择的状态的确定性。
在一种情形下,在正向遍历阶段,当前算子的输出张量数据被至少两个算子作为输入张量数据,或当前算子具有至少两个输出张量数据时,当前算子的输出张量数据的拆分状态集合中保留一个拆分状态,且保留的拆分状态经由当前算子的同一有向边确定。
下面具体描述在本申请实施例中,是如何确保回溯过程中在两状态集合中选择的状态的确定性,该方法可以包括但不限于如下步骤:
步骤700:根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
步骤702、遍历所述拆分状态集合,确定相邻拆分状态集合之间所述算子的张量数据的拆分路径;
步骤704、根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径。
具体实现中,所述确定所述目标算子的张量数据的目标拆分路径,包括:
遍历所述目标算子的张量数据的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有指向当前拆分状态的有向边以及所述有向边的起点对应的拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边对应的起始拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
这里,这一实现方式为通过正向遍历的方式来获取目标优化路径。
步骤706、根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在本申请实施例中,为了确保回溯过程中在两状态集合中选择的状态的确定性,在正向遍历阶段,当前算子的输出张量数据被至少两个算子作为输入张量数据,或当前算子具有至少两个输出张量数据时,当前算子的输出张量数据的拆分状态集合中保留一个拆分状态,且保留的拆分状态经由当前算子的同一有向边确定。这样,在遍历分支算子结束前,将从多个输入数据的拆分状态集合中选择出对应累计权重最小的状态保留下来,移除拆分状态集合中其他的拆分状态。
在本申请实施例中,关于步骤S700-步骤S706的具体实现请参考前述步骤S300-步骤S306的相关描述,此处不多加赘述。
在一种可能的实现方式中,结合引入胶水算子来调整拆分状态集合中的拆分状态以及对目标优化路径中输入张量数据的状态和输出张量数据的状态相同的胶水算子进行删除,可以得到对上述步骤S700-步骤S706描述的方法的变形,可以包括但不限于如下步骤:
步骤700’、根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
步骤702’、在所述目标算子与关联的第一拆分状态集合之间插入胶水算子,调整所述目标算子的输入张量数据的拆分状态集合中的拆分状态,得到第二拆分状态集合;其中,所述胶水算子用于将所述张量数据按照一种拆分方式得到的拆分状态转换成按照任一种拆分方式得到的拆分状态;
步骤704’、遍历所述第二拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
步骤706’、根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径。
具体实现中,所述确定所述目标算子的张量数据的目标拆分路径,包括:
遍历所述目标算子的张量数据的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有指向当前拆分状态的有向边以及所述有向边的起点对应的拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边对应的起始拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
这里,这一实现方式为通过正向遍历的方式来获取目标优化路径。
在本申请实施例中,为了确保回溯过程中在两状态集合中选择的状态的确定性,在正向遍历阶段,当前算子的输出张量数据被至少两个算子作为输入张量数据,或当前算子具有至少两个输出张量数据时,当前算子的输出张量数据的拆分状态集合中保留一个拆分状态,且保留的拆分状态经由当前算子的同一有向边确定。这样,在遍历分支算子结束前,将从多个输入数据的拆分状态集合中选择出对应累计权重最小的状态保留下来,移除拆分状态集合中其他的拆分状态。
步骤S708’、在满足所述目标拆分路径中包含的同一胶水算子的输入张量数据的状态和输出张量数据的状态相同的情况下,删除所述胶水算子,得到优化后的目标拆分路径;
步骤S7010’、根据所述优化后的目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在本申请实施例中,关于步骤S700’-步骤S7010’具体实现请参考前述实施例中的相关描述,此处不多加赘述。
实施本申请实施例,在正向遍历阶段,对于位于分支点的算子或输出张量,计算机设备仅保留唯一一个到目前为止对应的路径的最短的状态,删除其他所有的状态。通过这一实现方式,可以避免回溯阶段可能出现的不一致,可以提高计算机设备确定目标优化路径的效率和精确度。
在另一种情形下,在反向遍历阶段,当所述算子具有至少两个输入张量数据时,所述算子的输入张量数据的拆分状态集合中保留一个拆分状态,且所述拆分状态经由所述算子的同一状态路径确定。
下面具体描述在本申请实施例中,是如何确保回溯过程中在两状态集合中选择的状态的确定性,该方法可以包括但不限于如下步骤:
步骤800、根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
步骤802、遍历所述拆分状态集合,确定相邻拆分状态集合之间所述算子的张量数据的拆分路径;
步骤804、根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径。
具体实现中,所述确定所述目标算子的张量数据的目标拆分路径,包括:
遍历所述目标算子的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有以当前拆分状态为起点的有向边以及所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的 所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
这里,这一实现方式为通过反向遍历的方式来获取目标优化路径。
步骤806、根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在本申请实施例中,为了确保回溯过程中在两状态集合中选择的状态的确定性,在反向遍历阶段,当前算子具有至少两个输入张量数据时,当前算子的输入张量数据的拆分状态集合中保留一个拆分状态,且所述拆分状态经由所述算子的同一有向边确定。这样,在遍历分支算子结束前,将从多个输入数据的拆分状态集合中选择出对应累计权重最小的状态保留下来,移除拆分状态集合中其他的拆分状态。
在一种可能的实现方式中,结合引入胶水算子来调整拆分状态集合中的拆分状态以及对目标优化路径中输入张量数据的状态和输出张量数据的状态相同的胶水算子进行删除,可以得到对上述步骤S800-步骤S806描述的方法的变形,可以包括但不限于如下步骤:
步骤800’、根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
步骤802’、在所述目标算子与关联的第一拆分状态集合之间插入胶水算子,调整所述目标算子的输入张量数据的拆分状态集合中的拆分状态,得到第二拆分状态集合;其中,述张量数据按照一种拆分方式得到的拆分状态转换成按照任一种拆分方式得到的拆分状态;
步骤804’、遍历所述第二拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径。
步骤806’、根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径。
具体实现中,所述确定所述目标算子的张量数据的目标拆分路径,包括:
遍历所述目标算子的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有以当前拆分状态为起点的有向边以及所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;根据所述有向边的权重和所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
这里,这一实现方式为通过反向遍历的方式来获取目标优化路径。
在本申请实施例中,为了确保回溯过程中在两状态集合中选择的状态的确定性,在反向遍历阶段,当前算子具有至少两个输入张量数据时,当前算子的输入张量数据的拆分状态集合中保留一个拆分状态,且所述拆分状态经由所述算子的同一有向边确定。这样,在遍历分支算子结束前,将从多个输入数据的拆分状态集合中选择出对应累计权重最小的状态保留下来,移除拆分状态集合中其他的拆分状态。
步骤S808’、在满足所述目标拆分路径中包含的同一胶水算子的输入张量数据的状态和输出张量数据的状态相同的情况下,删除所述胶水算子,得到优化后的目标拆分路径;
步骤S8010’、根据所述优化后的目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在本申请实施例中,关于步骤S800’-步骤S8010’具体实现请参考前述实施例中的相 关描述,此处不多加赘述。
实施本申请实施例,在反向遍历阶段,对于位于分支点的算子或输出张量,计算机设备仅保留唯一一个到目前为止对应的路径的最短的状态,删除其他所有的状态。通过这一实现方式,可以避免回溯阶段可能出现的不一致,可以提高计算机设备确定目标优化路径的效率和精确度。
为了便于理解,下面示例性地描述本申请可以适用的应用场景。
以自动驾驶应用为例,车辆在自动行驶过程中需要对车载传感器采集到的图像、视频、语音等外部信息进行分析处理。为了保证安全性,车辆必须在最短的时间内得到上述多种外部信息的分析结果,从而做出科学、有效地决策。由于车辆的硬件系统中配置了多核处理器结构的处理芯片,车辆的硬件系统可以通过本申请描述的技术方案对神经网络模型处理小批量外部信息的计算任务进行拆分,得到拆分后的多个子计算任务,并通过将拆分后的子计算任务均衡地分配给多个处理器核上,从而可以实现在多个处理器核上并行执行多个子计算任务。这一实现方式可以高效的完成外部信息的处理,并将处理结果返回,车辆的智能驾驶系统可以根据返回结果辅助车辆自动驾驶。可以理解的是,本技术方案可以把一个算子拆分成多个规模更小的子算子,这样可以直接调用单核架构下的计算库,充分利用了多核处理器的硬件资源,从而可以避免重现实现的额外工作量。
在上述应用场景中,多核处理器结构芯片设置在车辆上。实际中,多核处理器结构芯片可以设置在云端服务器上,车辆可以通过3G/4G、WIFI等网络将车载传感器传来的图像、视频、语音等外部信息发生至云端服务器。云端服务器使用本方案把神经网络模型处理小批量外部信息的计算负载均衡地分配到多个处理核上。在车辆行驶规定的响应时间内,云端服务器将处理结果通过3G/4G、WIFI等网络反馈至车辆。在实际中,车载传感器采集到的外部信息的规模不同。在应用之前,根据不同规模的外部信息,车载处理器利用本方案确定相应的算子拆分路径。将不同规模的外部信息对应的算子拆分方案存储对应区域,多核处理器结构芯片获取外部信息后调出对应的算子拆分路径来对神经网络模型中的算子进行拆分,把外部信息的计算负载均衡地分配到多个处理器核上。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
进一步需要说明的是,虽然图3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
上述详细阐述了本申请实施例的方法,为了便于更好地实施本申请实施例的上述方案,相应地,下面还提供用于配合实施上述方案的相关装置。
参见图7,是本申请实施例提供的一种神经网络处理装置的解耦示意图,该装置70至少可以包括:
确定单元700,用于根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
拆分路径确定单元702,用于遍历所述拆分状态集合,确定相邻拆分状态集合之间所 述目标算子的张量数据的拆分路径;
目标拆分路径确定单元704,用于根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
处理单元706,用于根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
在一种可能的实现方式中,所述目标拆分路径确定单元704具体用于:
遍历所述目标算子的张量数据的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有指向当前拆分状态的有向边以及所述有向边的起点对应的拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边对应的起始拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
在一种可能的实现方式中,所述目标拆分路径确定单元704还具体用于:
遍历所述目标算子的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有以当前拆分状态为起点的有向边以及所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
在一种可能的实现方式中,所述装置70还可以包括胶水算子插入单元708;其中,所述胶水算子插入单元708,用于在所述目标算子与关联的拆分状态集合之间插入胶水算子,调整所述拆分状态集合中的拆分状态;其中,所述胶水算子用于将所述张量数据按照一种拆分方式得到的拆分状态转换成按照任一种拆分方式得到的拆分状态。
在一种可能的实现方式中,所述胶水算子插入单元708具体用于:
利用包含所述胶水算子在内的计算图中的目标算子的目标拆分路径对插入的每个胶水算子进行选择,在满足所述目标拆分路径中包含的同一胶水算子的输入张量数据的拆分状态和输出张量数据的拆分状态相同的情况下,将插入的对应胶水算子删除。
在一种可能的实现方式中,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拼接。
在一种可能的实现方式中,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拆分。
在一种可能的实现方式中,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拼接,再对经过拼接处理后的拆分状态集合中的拆分状态进行拆分。
在一种可能的实现方式中,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拆分,再对经过拆分处理后的拆分状态集合中的拆分状态进行拼接。
在一种可能的实现方式中,所述装置70还可以包括正向分支处理单元7010;其中,所述分支处理单元7010,用于在正向遍历阶段,当前算子的输出张量数据被至少两个算子作为输入张量数据,或当前算子具有至少两个输出张量数据时,当前算子的输出张量数据 的拆分状态集合中保留一个拆分状态,且保留的拆分状态经由当前算子的同一有向边确定。
在一种可能的实现方式中,所述装置70还可以包括反向分支处理单元7012;其中,所述分支处理单元7012,用于在反向遍历阶段,当前算子具有至少两个输入张量数据时,当前算子的输入张量数据的拆分状态集合中保留一个拆分状态,且所述拆分状态经由所述算子的同一有向边确定。
在一种可能的实现方式中,所述有向边的权重根据所述拆分路径对应的所述目标算子的运算操作类型、所述目标算子的张量数据经所述拆分路径获取的对应子数据的数据规模、每个处理器核的吞吐率和访存带宽确定。
在其中一种可能的实现方式中,所述神经网络模型的目标算子的输入张量数据的拆分状态集合中的拆分状态根据所述算子的运算逻辑和对应输出张量数据的拆分状态集合中的拆分状态确定。
在其中一种可能的实现方式中,所述神经网络模型的目标算子的输出张量数据的拆分状态集合中的拆分状态根据所述算子的运算逻辑和对应输入张量数据的拆分状态集合中的拆分状态确定。
应该理解,上述的装置实施例仅是示意性的,本披露的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
所述作为分离部件说明的单元或模块可以是物理上分开的,也可以不是物理上分开的。作为单元或模块说明的部件可以是物理单元,也可以不是物理单元,即可以位于一个装置中,或者也可以分布到多个装置上。本披露中实施例的方案可以根据实际的需要选择其中的部分或者全部单元来实现。
本申请实施例还提供一种芯片,该神经网络芯片可以为多核芯片,其中,包括中央处理单元(Central Processing Unit,CPU)和N个单核神经网络处理器(Neural Network Processor,NNP),N为大于1的整数。所述CPU用于对所述芯片进行整体的控制和调度,为本申请实施例中的神经网络模型处理方法的执行主体。
本申请实施例还提供另一种计算机设备,该计算机设备包含上述芯片或上述神经网络模型处理装置70。
本申请实施例还提供了一种计算机存储介质,用于存储为上述图2所示的计算机设备所用的计算机软件指令,其包含用于执行上述方法实施例所涉及的程序。通过执行存储的程序,对神经网络模型对应计算图中目标算子关联的张量数据进行拆分,获得张量数据对应的拆分状态集合,再确定相邻拆分状态集合之间张量数据的拆分路径和拆分路径的权重,确定目标算子的张量数据的目标拆分路径,最后根据目标拆分路径对计算图的目标算子进行拆分,以分匹配到多核处理器的对应核进行处理。在这个过程中,通过对目标算子进行拆分达到减小算子运算数据规模的目的,再根据目标算子对应的拆分状态之间的拆分路径选择,进一步优化目标算子的拆分方式。最后拆分获得的目标算子分配至多核处理器上,使得多核处理器中的每个核的硬件资源都能有效利用,该方案能有效降低各种神经网络模型在多核处理器上的端到端时延。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图 和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
进一步地,依据以下条款可更好地理解前述内容:
例如,条款A1、一种神经网络处理方法,其特征在于,所述方法应用于多核人工智能处理器,所述方法包括:
根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
A2、根据A1所述的方法,所述确定所述目标算子的张量数据的目标拆分路径,包括:
遍历所述目标算子的张量数据的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有指向当前拆分状态的有向边以及所述有向边的起点对应的拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边对应的起始拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有拆分路径的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
A3、根据A1所述的方法,所述确定所述目标算子的张量数据的目标拆分路径,包括:
遍历所述目标算子的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有以当前拆分状态为起点的有向边以及所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;
根据所述有向边的权重和所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
A4、根据A1-A3任一项所述的方法,所述方法还包括:
在所述目标算子与关联的拆分状态集合之间插入胶水算子,调整所述拆分状态集合中的拆分状态;其中,所述胶水算子用于将所述张量数据按照一种拆分方式得到的拆分状态转换成按照任一种拆分方式得到的拆分状态。
A5、根据A4所述的方法,在所述目标算子与关联的拆分状态集合之间插入胶水算子的步骤包括:
利用包含所述胶水算子在内的计算图中的目标算子的目标拆分路径对插入的每个胶水算子进行选择,在满足所述目标拆分路径中包含的同一胶水算子的输入张量数据的拆分状态和输出张量数据的拆分状态相同的情况下,将插入的对应胶水算子删除。
A6、根据A1所述的方法,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拼接。
A7、根据A1所述的方法,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拆分。
A8、根据A1所述的方法,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拼接,再对经过拼接处理后的拆分状态集合中的拆分状态进行拆分。
A9、根据A1所述的方法,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拆分,再对经过拆分处理后的拆分状态集合中的拆分状态进行拼接。
A10、根据A1-A9任一项所述的方法,所述方法还包括:
在正向遍历阶段,当前算子的输出张量数据被至少两个算子作为输入张量数据,或当前算子具有至少两个输出张量数据时,当前算子的输出张量数据的拆分状态集合中保留一个拆分状态,且保留的拆分状态经由当前算子的同一有向边确定。
A11、根据A1-A9任一项所述的方法,所述方法还包括:
在反向遍历阶段,当前算子具有至少两个输入张量数据时,当前算子的输入张量数据的拆分状态集合中保留一个拆分状态,且所述拆分状态经由所述目标算子的同一有向边确定。
A12、根据A2或A3所述的方法,所述有向边的权重根据所述拆分路径对应的所述目标算子的运算操作类型、所述目标算子的张量数据经所述拆分路径获取的对应子数据的数据规模、每个处理器核的吞吐率和访存带宽确定。
A13、根据A1所述的方法,所述目标算子的输入张量数据的拆分状态集合中的拆分状态根据目标算子的运算逻辑和对应输出张量数据的拆分状态集合中的拆分状态确定。
A14、根据A1所述的方法,所述目标算子的输出张量数据的拆分状态集合中的拆分状态根据所述目标算子的运算逻辑和对应输入张量数据的拆分状态集合中的拆分状态确定。
B1、一种神经网络处理装置,其特征在于,所述装置应用于多核人工智能处理器,所述装置包括:
确定单元,用于根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
拆分路径确定单元,用于遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
目标拆分路径确定单元,用于根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
处理单元,用于根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
C1、一种计算机设备,包括多个异构处理器和存储器,所述处理器和存储器相互连接,其中,所述多个异构处理器包括通用处理器和人工智能处理器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行 如权利要求A1-A14任一项所述的方法。
D1、一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求A1-A14任一项所述的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。

Claims (20)

  1. 一种神经网络模型处理方法,其特征在于,所述方法应用于多核人工智能处理器,所述方法包括:
    根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
    遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
    根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
    根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述目标算子的张量数据的目标拆分路径,包括:
    遍历所述目标算子的张量数据的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有指向当前拆分状态的有向边以及所述有向边的起点对应的拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;
    根据所述有向边的权重和所述有向边对应的起始拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输入张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有拆分路径的权重确定;
    遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
  3. 根据权利要求1所述的方法,其特征在于,所述确定所述目标算子的张量数据的目标拆分路径,包括:
    遍历所述目标算子的所有拆分状态集合,对当前拆分状态集合,遍历每一拆分状态,获得所有以当前拆分状态为起点的有向边以及所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;
    根据所述有向边的权重和所述有向边的终点对应的拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径的权重确定所述当前拆分状态到所述目标算子的输出张量数据的拆分状态之间的拆分路径;其中,所述拆分路径的权重根据所述拆分路径对应的所有有向边的权重确定;
    遍历完所述目标算子的所有拆分状态集合后,获得所述目标算子的输入张量数据的拆分状态集合与所述目标算子的输出张量数据的拆分状态集合之间的目标拆分路径。
  4. 根据权利要求1~3任一权利要求所述的方法,其特征在于,所述方法还包括:
    在所述目标算子与关联的拆分状态集合之间插入胶水算子,调整所述拆分状态集合中的拆分状态;其中,所述胶水算子用于将所述张量数据按照一种拆分方式得到的拆分状态转换成按照任一种拆分方式得到的拆分状态。
  5. 根据权利要求4所述的方法,其特征在于,在所述目标算子与关联的拆分状态集合之间插入胶水算子的步骤包括:
    利用包含所述胶水算子在内的计算图中的目标算子的目标拆分路径对插入的每个胶水算子进行选择,在满足所述目标拆分路径中包含的同一胶水算子的输入张量数据的拆分状态和输出张量数据的拆分状态相同的情况下,将插入的对应胶水算子删除。
  6. 根据权利要求4所述的方法,其特征在于,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拼接。
  7. 根据权利要求4所述的方法,其特征在于,所述胶水算子用于将所述拆分状态集合 中的拆分状态进行拆分。
  8. 根据权利要求4所述的方法,其特征在于,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拼接,再对经过拼接处理后的拆分状态集合中的拆分状态进行拆分。
  9. 根据权利要求4所述的方法,其特征在于,所述胶水算子用于将所述拆分状态集合中的拆分状态进行拆分,再对经过拆分处理后的拆分状态集合中的拆分状态进行拼接。
  10. 根据权利要求1~9任一项所述的方法,其特征在于,所述方法还包括:
    在正向遍历阶段,当前算子的输出张量数据被至少两个算子作为输入张量数据,或当前算子具有至少两个输出张量数据时,当前算子的输出张量数据的拆分状态集合中保留一个拆分状态,且保留的拆分状态经由当前算子的同一有向边确定。
  11. 根据权利要求1~9任一项所述的方法,其特征在于,所述方法还包括:
    在反向遍历阶段,当前算子具有至少两个输入张量数据时,当前算子的输入张量数据的拆分状态集合中保留一个拆分状态,且所述拆分状态经由所述目标算子的同一有向边确定。
  12. 根据权利要求2或3所述的方法,其特征在于,所述有向边的权重根据所述拆分路径对应的所述目标算子的运算操作类型、所述目标算子的张量数据经所述拆分路径获取的对应子数据的数据规模、每个处理器核的吞吐率和访存带宽确定。
  13. 根据权利要求1所述的方法,其特征在于,所述目标算子的输入张量数据的拆分状态集合中的拆分状态根据目标算子的运算逻辑和对应输出张量数据的拆分状态集合中的拆分状态确定。
  14. 根据权利要求1所述的方法,其特征在于,所述目标算子的输出张量数据的拆分状态集合中的拆分状态根据所述目标算子的运算逻辑和对应输入张量数据的拆分状态集合中的拆分状态确定。
  15. 一种神经网络模型处理装置,其特征在于,所述装置应用于多核人工智能处理器,所述装置包括:
    确定单元,用于根据所述神经网络模型对应的计算图中目标算子,确定与所述目标算子关联的张量数据的拆分状态集合;
    拆分路径确定单元,用于遍历所述拆分状态集合,确定相邻拆分状态集合之间所述目标算子的张量数据的拆分路径;
    目标拆分路径确定单元,用于根据所述拆分路径的权重,确定所述目标算子的张量数据的目标拆分路径;
    处理单元,用于根据所述目标拆分路径对所述目标算子进行拆分,以分配到所述多核人工智能处理器的对应核进行处理。
  16. 一种芯片,其特征在于,所述芯片集成如权利要求15所述的神经网络模型处理装置。
  17. 一种计算机设备,其特征在于,所述计算机设备包括如权利要求16所述的芯片或如权利要求15所述的神经网络模型处理装置。
  18. 一种计算机设备,其特征在于,包括处理器和存储器,所述处理器和存储器相互连接,其中,所述处理器包括通用处理器和人工智能处理器,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1-14任一项所述的方法。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-14任一项所述的方法。
  20. 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序 的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-14任一项所述的方法。
PCT/CN2020/116816 2019-09-24 2020-09-22 神经网络模型处理方法、装置、计算机设备及存储介质 WO2021057720A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/622,709 US20220391678A1 (en) 2019-09-24 2020-09-22 Neural network model processing method and apparatus, computer device, and storage medium
EP20868455.5A EP4036803A4 (en) 2019-09-24 2020-09-22 NEURAL NETWORK MODEL PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE AND STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910910055.9 2019-09-24
CN201910910055.9A CN110689115B (zh) 2019-09-24 2019-09-24 神经网络模型处理方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021057720A1 true WO2021057720A1 (zh) 2021-04-01

Family

ID=69110006

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116816 WO2021057720A1 (zh) 2019-09-24 2020-09-22 神经网络模型处理方法、装置、计算机设备及存储介质

Country Status (4)

Country Link
US (1) US20220391678A1 (zh)
EP (1) EP4036803A4 (zh)
CN (1) CN110689115B (zh)
WO (1) WO2021057720A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268269A (zh) * 2021-06-07 2021-08-17 中科计算技术西部研究院 一种针对动态规划算法的加速方法、系统及装置

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689115B (zh) * 2019-09-24 2023-03-31 安徽寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质
CN113222125A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN113222136A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN113222099A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
CN113449857B (zh) * 2020-03-27 2022-08-19 华为技术有限公司 一种数据处理方法和数据处理设备
CN113469351A (zh) * 2020-03-30 2021-10-01 嘉楠明芯(北京)科技有限公司 一种数据处理方法、装置及存储介质
CN111582456B (zh) * 2020-05-11 2023-12-15 抖音视界有限公司 用于生成网络模型信息的方法、装置、设备和介质
CN111752713B (zh) * 2020-06-28 2022-08-05 浪潮电子信息产业股份有限公司 模型并行训练任务负载均衡方法、装置、设备及存储介质
WO2022000225A1 (zh) * 2020-06-30 2022-01-06 华为技术有限公司 一种卷积神经网络数据处理方法及其相关设备
CN111860820A (zh) * 2020-07-31 2020-10-30 北京灵汐科技有限公司 神经网络算子的划分方法、装置及划分设备
CN112084023A (zh) * 2020-08-21 2020-12-15 安徽寒武纪信息科技有限公司 数据并行处理的方法、电子设备及计算机可读存储介质
CN113065665A (zh) * 2021-03-04 2021-07-02 山东英信计算机技术有限公司 一种模型算子对比方法、系统及存储介质
CN113326922B (zh) * 2021-05-31 2023-06-13 北京市商汤科技开发有限公司 神经网络的生成方法、装置、电子设备及存储介质
CN113326137B (zh) * 2021-06-25 2022-07-12 上海燧原科技有限公司 深度学习计算方法、装置、芯片及介质
CN116151316A (zh) * 2021-11-15 2023-05-23 平头哥(上海)半导体技术有限公司 适用于类神经网络模型的计算系统及实现类神经网络模型的方法
WO2023155041A1 (zh) * 2022-02-15 2023-08-24 华为技术有限公司 一种智能驾驶方法、装置及包括该装置的车辆
CN116362316B (zh) * 2023-05-29 2023-12-12 成都阿加犀智能科技有限公司 一种模型转换方法、装置、存储介质及电子设备
CN118394349B (zh) * 2024-06-28 2024-09-06 浪潮电子信息产业股份有限公司 一种三方库接入方法、装置、设备、程序产品及介质
CN118426989A (zh) * 2024-07-01 2024-08-02 北京壁仞科技开发有限公司 人工智能模型的优化方法、电子设备与存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106155635A (zh) * 2015-04-03 2016-11-23 北京奇虎科技有限公司 一种数据处理方法和装置
CN107862378A (zh) * 2017-12-06 2018-03-30 芯原微电子(上海)有限公司 基于多核的卷积神经网络加速方法及系统、存储介质及终端
CN109426553A (zh) * 2017-08-21 2019-03-05 上海寒武纪信息科技有限公司 任务切分装置及方法、任务处理装置及方法、多核处理器
US20190138891A1 (en) * 2017-11-09 2019-05-09 Samsung Electronics Co., Ltd. Apparatus and method with neural network
CN109993299A (zh) * 2017-12-29 2019-07-09 中兴通讯股份有限公司 数据训练方法及装置、存储介质、电子装置
CN110689115A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6311265B1 (en) * 1996-03-25 2001-10-30 Torrent Systems, Inc. Apparatuses and methods for programming parallel computers
CN106873907B (zh) * 2017-01-09 2020-04-21 中国电子科技集团公司第五十二研究所 一种多控制器存储阵列读写负载均衡方法及装置
CN106940815B (zh) * 2017-02-13 2020-07-28 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN107451653A (zh) * 2017-07-05 2017-12-08 深圳市自行科技有限公司 深度神经网络的计算方法、装置及可读存储介质
CN107832839B (zh) * 2017-10-31 2020-02-14 南京地平线机器人技术有限公司 执行卷积神经网络中的运算的方法和装置
KR102098713B1 (ko) * 2018-01-29 2020-04-08 주식회사 유엑스팩토리 Cnn과 rnn이 하나의 고성능 저전력 칩으로 집적된 이기종 프로세서 구조
CN109472356A (zh) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 一种可重构神经网络算法的加速装置及方法
CN209231976U (zh) * 2018-12-29 2019-08-09 南京宁麒智能计算芯片研究院有限公司 一种可重构神经网络算法的加速装置
CN109885406B (zh) * 2019-02-27 2020-01-24 上海燧原智能科技有限公司 算子计算优化方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106155635A (zh) * 2015-04-03 2016-11-23 北京奇虎科技有限公司 一种数据处理方法和装置
CN109426553A (zh) * 2017-08-21 2019-03-05 上海寒武纪信息科技有限公司 任务切分装置及方法、任务处理装置及方法、多核处理器
US20190138891A1 (en) * 2017-11-09 2019-05-09 Samsung Electronics Co., Ltd. Apparatus and method with neural network
CN107862378A (zh) * 2017-12-06 2018-03-30 芯原微电子(上海)有限公司 基于多核的卷积神经网络加速方法及系统、存储介质及终端
CN109993299A (zh) * 2017-12-29 2019-07-09 中兴通讯股份有限公司 数据训练方法及装置、存储介质、电子装置
CN110689115A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268269A (zh) * 2021-06-07 2021-08-17 中科计算技术西部研究院 一种针对动态规划算法的加速方法、系统及装置

Also Published As

Publication number Publication date
CN110689115B (zh) 2023-03-31
EP4036803A4 (en) 2023-10-18
US20220391678A1 (en) 2022-12-08
CN110689115A (zh) 2020-01-14
EP4036803A1 (en) 2022-08-03

Similar Documents

Publication Publication Date Title
WO2021057720A1 (zh) 神经网络模型处理方法、装置、计算机设备及存储介质
WO2021057713A1 (zh) 用多核处理器实现神经网络模型拆分方法及相关产品
CN111242321B (zh) 一种数据处理方法及相关产品
WO2021057746A1 (zh) 神经网络处理方法、装置、计算机设备及存储介质
WO2021057722A1 (zh) 用多核处理器实现神经网络模型拆分方法及相关产品
CN110674936A (zh) 一种神经网络处理方法、装置、计算机设备及存储介质
US11704553B2 (en) Neural network system for single processing common operation group of neural network models, application processor including the same, and operation method of neural network system
JP7430744B2 (ja) 機械学習モデルを改良して局所性を改善させること
KR102175044B1 (ko) 인공 신경망 역방향 트레이닝 실행용 장치와 방법
CN110826708B (zh) 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111160551A (zh) 计算图执行方法、计算机设备及存储介质
US20190138373A1 (en) Multithreaded data flow processing within a reconfigurable fabric
US20190130268A1 (en) Tensor radix point calculation in a neural network
US20210373944A1 (en) Scheduler, method of operating the same, and accelerator apparatus including the same
CN113449859A (zh) 一种数据处理方法及其装置
JP7226696B2 (ja) 機械学習方法、機械学習システム及び非一時的コンピュータ可読記憶媒体
Gadiyar et al. Artificial intelligence software and hardware platforms
CN112764893A (zh) 数据处理方法和数据处理系统
Ravikumar et al. Acceleration of Image Processing and Computer Vision Algorithms
CN115860061A (zh) 图神经网络优化方法和图神经网络推理系统
US12014202B2 (en) Method and apparatus with accelerator
US11461662B1 (en) Compilation time reduction for memory and compute bound neural networks
US11960982B1 (en) System and method of determining and executing deep tensor columns in neural networks
CN118313458A (zh) 数据处理方法、数据处理器、电子设备、存储介质
TW202427274A (zh) 機器學習編譯器優化中的節點對稱性

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20868455

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020868455

Country of ref document: EP

Effective date: 20220425