CN113449859A

CN113449859A - Data processing method and device

Info

Publication number: CN113449859A
Application number: CN202010232274.9A
Authority: CN
Inventors: 张晓达; 苏腾
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2021-09-28

Abstract

The application discloses a data processing method, which relates to the field of artificial intelligence and comprises the following steps: the processing method corresponding to the neural network model is determined according to the data flow graph, wherein the processing method comprises a rearrangement method for rearranging tensors among operators besides a segmentation method for the input tensors of the operators.

Description

Data processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method and apparatus.

Background

Deep neural networks have been developed in recent years with high accuracy in tasks such as vision and translation. However, neural networks that perform with high accuracy require a significant training overhead, which comes from a large amount of training data and model parameters.

In order to solve the challenge of a large amount of training data, one solution is data parallelism, each computing device has a full-scale copy of a model to be trained, and each computing device is only responsible for a part of slices of a training tensor. During the training process, the device needs to synchronize local model parameters after each iteration is finished. When the parameters of the model to be trained are less and the model is formed by operators with high calculation cost (such as convolution operation), the data parallel can achieve better performance.

However, in the prior art, after an arithmetic device performs an operation corresponding to a certain operator, in order to ensure that the subsequent operator can continue to perform the operation, tensors in all layers of the entire network are only segmented according to the same mode, which makes the segmentation strategy in the prior art have great limitations.

Disclosure of Invention

In a first aspect, the present application provides a data processing method, where a tensor obtained by performing an operation corresponding to an operator is rearranged by a tensor, so that a subsequent segmentation method of an input tensor of the operator has more choices, and accordingly, processing efficiency of a dataflow graph can be improved, where the data processing method includes:

acquiring a data flow diagram of a neural network model, wherein the data flow diagram comprises a first operator and a second operator of the neural network model;

determining a processing method corresponding to the neural network model according to the dataflow graph, wherein the processing method comprises a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

the first splitting method is used for splitting an input tensor of the first operator to obtain M first sub-tensors, the rearranging method is used for rearranging M second sub-tensors to obtain a first tensor, the M second sub-tensors are outputs obtained by processing the M first sub-tensors through the first operator, and the first tensor is an input tensor of the second operator, wherein M is a positive integer greater than 1.

Because the number of parameters of each dimension of the sub-tensor obtained by the operation corresponding to the operator of the split tensor is possibly far smaller than that of the tensor obtained by the operation corresponding to the operator of the non-split tensor, the selection of the splitting method corresponding to the input tensor of the subsequent operator of the sub-tensor is very limited, in the embodiment, the sub-tensor obtained by the operation corresponding to the operator is rearranged, which is equivalent to the reduction of the tensor obtained by the operation corresponding to the operator of the non-split tensor, at this time, compared with the selection of the splitting method corresponding to the input tensor of the subsequent operator of the sub-tensor, the selection of the splitting method corresponding to the input tensor of the subsequent operator of the rearranged tensor is more, namely, the splitting method of the input tensor of the operator can be selected more by the definition of the rearrangement, correspondingly, the method can be used as a candidate for processing more, thus, different operators can use different corresponding segmentation methods to segment the input tensor, namely, the segmentation methods suitable for the operators are respectively used. This is helpful to improve the processing efficiency of the dataflow graph and also to reduce the overhead of the device in computing the dataflow graph. Because a dataflow graph often includes different operators, and the suitable segmentation methods for input data of different operators may be different, and an operator can only segment input data by using the most suitable method for the operator, so that the calculation efficiency of the operator can be guaranteed.

In a possible implementation of the first aspect, each of the M second sub-tensors is obtained by performing, on a corresponding operation device, an operation corresponding to the first operator on a corresponding first sub-tensor of the M first sub-tensors.

In one possible implementation of the first aspect, the first and second segmentation methods are different segmentation methods.

In a possible implementation of the first aspect, the tensor rearrangement method is configured to obtain the first tensor by transmitting the M second sub-tensors to a same operation device for rearrangement.

In a possible implementation of the first aspect, an input tensor of the first operator is the same as the first tensor after an operation corresponding to the first operator is performed on an operation device.

It should be understood that the input tensor of the first operator may have the same result as the first tensor after performing the corresponding operation of the first operator on an operation device, or only the arrangement of the data of each dimension in the tensor is different, but the data content included in the tensor as a whole is the same.

In a possible implementation of the first aspect, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q.

In a possible implementation of the first aspect, the neural network model includes N operators, where the N operators include the first operator and the second operator, and the processing method for determining the neural network model according to the dataflow graph includes:

determining a plurality of candidate processing methods corresponding to the neural network model according to the data flow graph; each of the plurality of candidate processing methods includes a tensor splitting method for splitting an input tensor of each of the N operators and a rearrangement method for rearranging tensors between at least two of the N operators, where the tensor splitting method is used to split the input tensor of the corresponding operator, and the tensor rearrangement method is used to transmit an operation result obtained after an operation corresponding to the operator is performed to an operation device and then rearrange the operation result;

determining the processing method from the plurality of candidate processing methods according to an overhead value corresponding to each candidate processing method in the plurality of candidate processing methods, where the overhead value includes a memory overhead value corresponding to a tensor splitting method of each operator in the N operators and a first communication overhead value corresponding to a tensor rearrangement method included in each candidate processing method, the memory overhead value represents a memory overhead generated when an operation corresponding to the operator is performed, the first communication overhead value represents a communication overhead generated in a process of transmitting the operation result to one operation device, and the processing method is the candidate processing method with the smallest overhead value in the at least one candidate processing method.

In the embodiment of the present application, a constraint condition may be further added to the process of determining the processing method: and the determined memory overhead of the processing method is less than the upper limit of the memory overhead which can be borne by each computing device.

The redistribution method is defined, and the situation that the segmentation methods of the input tensors are different can occur to the operators before and after the redistribution method. This makes it possible to convert model parallel to data parallel, or a hybrid parallel approach of data parallel to model parallel, making the policy search space larger. In addition, the method and the device realize the function of calculating the memory cost and the communication cost. On the premise of a given processing method, the memory overhead and the communication overhead consumed by operators and rearrangement can be calculated, and the definition of the overhead value simultaneously comprises the memory overhead and the communication overhead. The optimized memory overhead and communication overhead can be modeled.

In a possible implementation of the first aspect, the N operators include a third operator, each candidate processing method includes a target splitting method for splitting an input tensor of the third operator, the third operator is configured to perform a first operation on a sub-tensor obtained by splitting the input tensor of the third operator by the target splitting method, and transmit an operation result obtained by the operation to an operation device for a second operation, and the overhead value further includes a second communication overhead, where the second communication overhead represents a communication overhead generated in a process of transmitting the operation result to the operation device when the third operator is operated.

In a possible implementation of the first aspect, the memory overhead value is related to a number of parameters included in the M first sub-tensors and a type of the parameters.

In one possible implementation of the first aspect, the first communication overhead value is related to a number of parameters included in the M second sub-tensors and a type of parameter.

In one possible implementation of the first aspect, the second communication overhead value is related to a number of parameters included in the M second sub-tensors and a type of parameter.

In a possible implementation of the first aspect, the overhead value is an average of the memory overhead value, the first communication overhead value, and the second communication overhead value.

In one possible implementation of the first aspect, the method further comprises:

determining a plurality of processing methods corresponding to the neural network model according to the data flow graph;

determining the plurality of candidate processing methods from the plurality of processing methods according to the memory overhead value or the first communication overhead value corresponding to each of the plurality of processing methods, wherein the plurality of candidate processing methods are part of the plurality of processing methods.

In the embodiment of the application, after the processing method is generated, the memory occupancy is arranged according to a non-decreasing sequence, and the strategy is selected according to a fixed step length. The selection method can effectively reduce the search time of the processing method on the premise of keeping the precision to a certain extent.

In one possible implementation of the first aspect, the dataflow graph further includes a third operator of the neural network model, an output tensor of the third operator being an input tensor of the second operator, the method further including:

and merging the first operator to the second operator so as to enable the second segmentation method corresponding to the second operator to establish an incidence relation with the first segmentation method and the tensor rearrangement method.

In the embodiment of the present application, the first operator is merged to the second operator, and at the same time, the first splitting method corresponding to the input tensor of the first operator and the redistribution method between the first operator and the second operator are also merged to the second operator, and a piece of data including the first splitting method, the tensor redistribution method, and the second splitting method is formed (i.e., an association between the first splitting method, the tensor redistribution method, and the second splitting method is established).

In one possible implementation of the first aspect, the dataflow graph further includes a fourth operator of the neural network model, an output tensor of the first operator is an input tensor of the fourth operator, and the method further includes:

and merging the second operator to the first operator, so that the first segmentation method corresponding to the first operator establishes an association relationship with the tensor redistribution method and the second segmentation method.

In the embodiment of the present application, the second operator is merged to the first operator, and at the same time, the second segmentation method corresponding to the input tensor of the second operator and the redistribution method between the first operator and the second operator are also merged to the first operator, and a piece of data including the first segmentation method, the tensor redistribution method, and the second segmentation method is formed (i.e., an association between the first segmentation method, the tensor redistribution method, and the second segmentation method is established).

In one possible implementation of the first aspect, the dataflow graph further includes a fifth operator and a sixth operator of the neural network model, wherein the first tensor is a first input tensor of the second operator and a first input tensor of the fifth operator, and an output tensor of the sixth operator is a second input tensor of the second operator and a second input tensor of the fifth operator, the method further includes:

obtaining a third splitting method for splitting the input tensor of the sixth operator, a fourth splitting method for splitting the second input tensor of the second operator, and a first tensor rearrangement method for rearranging tensors between the fourth operator and the second operator; the third splitting method is configured to split the second input tensor of the sixth operator to obtain a plurality of third sub-tensors, the first tensor rearrangement method is configured to rearrange a plurality of fourth sub-tensors obtained by performing operation corresponding to the sixth operator on the plurality of third sub-tensors to obtain a third tensor, and the fourth splitting method is configured to split the third tensor;

and merging the sixth operator to the first operator, so that the first division method corresponding to the first operator establishes a first association relationship with the third division method, the fourth division method and the first sheet rearrangement method.

and merging the sixth operator to the first operator, so that the first cutting method corresponding to the first operator establishes a second association relation with the third cutting method and the second cutting method.

Three graph contraction operations are added to support strategy search of a more complex neural network. Merge Elimination merging Elimination operation, Contract Elimination association Elimination operation and Star Elimination operation are added, and a calculation method of the contracted overhead value is given;

the Merge Elimination operation may represent that the first operator is merged to the second operator, so that the second segmentation method corresponding to the second operator is associated with the first segmentation method and the tensor redistribution method;

the Contract Elimination association Elimination operation may represent that the second operator is merged into the first operator, so that the first splitting method corresponding to the first operator establishes an association relationship with the tensor redistribution method and the second splitting method;

the Star animation Star Elimination operation may represent that the sixth operator is merged to the first operator, so that the first segmentation method corresponding to the first operator establishes a first association relationship with the third segmentation method, the fourth segmentation method, and the first vector rearrangement method.

In one possible implementation of the first aspect, the neural network model is for processing image data, audio data, video data, or text data;

the neural network model comprises operators with input tensors of the image data, the audio data, the video data or the text data; or,

the input tensors of the operators included in the neural network model are obtained by processing the image data, the audio data, the video data or the text data by using at least one operator included in the neural network model.

In a second aspect, the present application provides a data processing method, including:

obtaining a neural network model and a segmentation method corresponding to the neural network model, wherein the neural network model comprises a first operator and a second operator; the splitting method includes a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

segmenting the input tensor of the first operator by the first segmentation method to obtain M first sub-tensors, wherein the M first sub-tensors comprise a first target sub-tensor;

processing the first target sub-tensor using the first operator to obtain a second target sub-tensor;

in this embodiment of the application, the operation device may process the first target sub-tensor by calling a code corresponding to the first operator and based on the called code pair.

Receiving at least one second sub tensor sent by at least one arithmetic device; each second sub-tensor is obtained by performing operation corresponding to the first operator on one first sub-tensor except the first target sub-tensor in the M first sub-tensors through corresponding operation equipment;

rearranging the second target sub-tensor and the at least one second sub-tensor by the rearranging method to obtain a first tensor;

and cutting the first sheet by the second cutting method.

In a possible implementation of the second aspect, each of the M second sub-tensors is obtained by transmitting a corresponding first sub-tensor of the M first sub-tensors to a corresponding operation device, and performing an operation corresponding to the first operator on the corresponding operation device.

In a possible implementation of the second aspect, the tensor rearrangement method is configured to obtain the first tensor by transmitting the M second sub-tensors to a same operation device for rearrangement.

In a possible implementation manner of the second aspect, the input tensor of the first operator is the same as the first tensor after an operation corresponding to the first operator is performed on an operation device.

In a possible implementation of the second aspect, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q.

In one possible implementation of the second aspect, the neural network model is used to process image data, audio data, video data, or text data;

In a third aspect, the present application provides a data processing method, where the method includes:

acquiring a neural network model and a processing method corresponding to the neural network model, wherein the neural network model comprises a first operator and a second operator; the processing method comprises a first slicing method for slicing the input tensor of the first operator and a second slicing method for slicing the input tensor of the second operator;

sending the second target sub-tensor to an arithmetic device;

receiving a first tensor sent by the operation device, wherein the first tensor is obtained by rearranging a plurality of second sub-tensors by the operation device through a rearranging method, and the plurality of second sub-tensors comprise a second target sub-tensor;

and cutting the first sheet by the second cutting method.

In a possible implementation of the third aspect, the input tensor of the first operator is the same as the first tensor after an operation corresponding to the first operator is performed on an operation device.

In a possible implementation of the third aspect, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q.

In one possible implementation of the third aspect, the neural network model is used to process image data, audio data, video data, or text data;

In a fourth aspect, the present application provides a data processing apparatus, the apparatus comprising:

an obtaining module, configured to obtain a data flow graph of a neural network model, where the data flow graph includes a first operator and a second operator of the neural network model;

a determining module, configured to determine, according to the dataflow graph, a processing method corresponding to the neural network model, where the processing method includes a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

In a possible implementation of the fourth aspect, each of the M second sub-tensors is obtained by transmitting a corresponding first sub-tensor of the M first sub-tensors to a corresponding operation device, and performing an operation corresponding to the first operator on the corresponding operation device.

In a possible implementation of the fourth aspect, the tensor rearrangement method is used to obtain the first tensor by transmitting the M second sub-tensors to the same operation device for splicing.

In a possible implementation of the fourth aspect, the input tensor of the first operator is the same as the first tensor after an operation corresponding to the first operator is performed on an operation device.

In a possible implementation of the fourth aspect, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q.

In a possible implementation of the fourth aspect, the neural network model includes N operators, where the N operators include the first operator and the second operator, and the determining module is specifically configured to:

In a possible implementation of the fourth aspect, the N operators include a third operator, each candidate processing method includes a target splitting method for splitting an input tensor of the third operator, the third operator is configured to perform a first operation on a sub-tensor obtained by splitting the input tensor of the third operator by the target splitting method, and transmit an operation result obtained by the operation to an operation device for a second operation, and the overhead value further includes a second communication overhead, where the second communication overhead represents a communication overhead generated in a process of transmitting the operation result to the operation device when the third operator is operated.

In a possible implementation of the fourth aspect, the memory overhead value is related to the number of parameters included in the M first sub-tensors and the type of the parameters.

In one possible implementation of the fourth aspect, the first communication overhead value is related to the number of parameters included in the M second sub-tensors and the type of the parameters.

In one possible implementation of the fourth aspect, the second communication overhead value is related to the number of parameters included in the M second sub-tensors and the type of the parameter.

In a possible implementation of the fourth aspect, the overhead value is an average of the memory overhead value, the first communication overhead value, and the second communication overhead value.

In a possible implementation of the fourth aspect, the determining module is further configured to:

determining the plurality of candidate processing methods from the plurality of processing methods according to the memory overhead value or the first communication overhead value corresponding to each processing method in the plurality of segmentation methods, wherein the plurality of candidate processing methods are part of the plurality of processing methods.

In one possible implementation of the fourth aspect, the data flow graph further includes a third operator of the neural network model, an output tensor of the third operator is an input tensor of the second operator, and the apparatus further includes:

and the merging module is used for merging the first operator to the second operator so as to enable the second segmentation method corresponding to the second operator to establish an association relationship with the first segmentation method and the tensor rearrangement method.

In one possible implementation of the fourth aspect, the data flow graph further includes a fourth operator of the neural network model, an output tensor of the first operator is an input tensor of the fourth operator, and the apparatus further includes:

a merging module, configured to merge the second operator to the first operator, so that the first splitting method corresponding to the first operator establishes an association relationship with the tensor redistribution method and the second splitting method.

In one possible implementation of the fourth aspect, the dataflow graph further includes a fifth operator and a sixth operator of the neural network model, wherein the first tensor is a first input tensor of the second operator and a first input tensor of the fifth operator, and an output tensor of the sixth operator is a second input tensor of the second operator and a second input tensor of the fifth operator, the obtaining module is configured to:

obtaining a third splitting device for splitting the input tensor of the sixth operator, a fourth splitting device for splitting the second input tensor of the second operator, and a first tensor rearrangement device for rearranging tensors between the fourth operator and the second operator; the third splitting device is configured to split the second input tensor of the sixth operator to obtain a plurality of third sub-tensors, the first tensor rearrangement device is configured to rearrange a plurality of fourth sub-tensors obtained by performing operation corresponding to the sixth operator on the plurality of third sub-tensors to obtain a third tensor, and the fourth splitting device is configured to split the third tensor;

the device further comprises:

a merging module, configured to merge the sixth operator to the first operator, so that a first association relationship is established between the first segmentation method corresponding to the first operator and the third segmentation method, the fourth segmentation method, and the first vector rearrangement method.

In a possible implementation of the fourth aspect, the apparatus further includes:

a merging module, configured to merge the sixth operator to the first operator, so that a second association relationship is established between the first splitting method corresponding to the first operator and the third splitting method as well as the second splitting method.

In one possible implementation of the fourth aspect, the neural network model is used to process image data, audio data, video data, or text data;

In a fifth aspect, the present application provides a data processing apparatus, the apparatus comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a neural network model and a processing method corresponding to the neural network model, and the neural network model comprises a first operator and a second operator; the processing method includes a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

a first segmentation module, configured to segment an input tensor of the first operator by using the first segmentation method to obtain M first sub-tensors, where the M first sub-tensors include a first target sub-tensor;

the operation module is used for processing the first target sub-tensor by using the first operator to obtain a second target sub-tensor;

the receiving module is used for receiving at least one second sub tensor sent by at least one computing device; each second sub-tensor is obtained by performing operation corresponding to the first operator on one first sub-tensor except the first target sub-tensor in the M first sub-tensors through corresponding operation equipment;

a rearrangement module, configured to rearrange the second target sub-tensor and the at least one second sub-tensor by the rearrangement method to obtain a first tensor;

and the second segmentation module is used for segmenting the first sheet by the second segmentation method.

In a possible implementation of the fifth aspect, each of the M second sub-tensors is obtained by transmitting a corresponding first sub-tensor of the M first sub-tensors to a corresponding operation device, and performing an operation corresponding to the first operator on the corresponding operation device.

In a possible implementation of the fifth aspect, the tensor rearrangement method is configured to obtain the first tensor by transmitting the M second sub-tensors to a same operation device for rearrangement.

In a possible implementation manner of the fifth aspect, the input tensor of the first operator is the same as the first tensor after an operation corresponding to the first operator is performed on an operation device.

In a possible implementation of the fifth aspect, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q.

In one possible implementation of the fifth aspect, the neural network model is used to process image data, audio data, video data, or text data;

In a sixth aspect, an embodiment of the present application provides a data processing apparatus, where the apparatus includes:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a neural network model and a processing method corresponding to the neural network model, and the neural network model comprises a first operator and a second operator; the processing method comprises a first slicing method for slicing the input tensor of the first operator and a second slicing method for slicing the input tensor of the second operator;

the sending module is used for sending the second target sub-tensor to the operation equipment;

a receiving module, configured to receive a first tensor sent by the operation device, where the first tensor is obtained by rearranging, by the operation device, a plurality of second sub-tensors by using a rearrangement method, where the plurality of second sub-tensors include the second target sub-tensor;

In a possible implementation of the sixth aspect, the result obtained by performing the operation corresponding to the first operator on an operation device by using the input tensor of the first operator is the same as the first tensor.

In a possible implementation of the sixth aspect, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmentation method is a Q-dimension tensor, and P is not an integral multiple of Q.

In one possible implementation of the sixth aspect, the neural network model is used to process image data, audio data, video data, or text data;

In a seventh aspect, an embodiment of the present application provides a data processing apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to perform the method according to any one of the first aspect and the second aspect.

In an eighth aspect, embodiments of the present application provide a data processing apparatus, which may include a memory, a processor, and a bus system, where the memory is used for storing a program, and the processor is used for executing the program in the memory to perform the method according to any one of the second aspect and the second aspect.

In a ninth aspect, embodiments of the present application provide a data processing apparatus, which may include a memory, a processor, and a bus system, where the memory is used for storing a program, and the processor is used for executing the program in the memory to perform the method according to any one of the third aspect and the optional method according to the third aspect.

In a tenth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the above first aspect or any optional method thereof.

In an eleventh aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the computer program causes the computer to execute the method of any one of the second aspect and the second aspect.

In a twelfth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer program causes the computer to execute the third aspect or the third aspect and any optional method thereof.

In a thirteenth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the first aspect and any optional method thereof.

In a fourteenth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the second aspect and any optional method thereof.

In a fifteenth aspect, embodiments of the present application provide a computer program which, when run on a computer, causes the computer to perform the third aspect and any optional method thereof.

In a sixteenth aspect, the present application provides a chip system, which includes a processor, configured to enable an executing device to implement the functions recited in the above aspects, for example, to transmit or process data recited in the above methods; or, information. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

The application provides a data processing method, which is characterized by comprising the following steps: acquiring a data flow diagram of a neural network model, wherein the data flow diagram comprises a first operator and a second operator of the neural network model; and determining a processing method corresponding to the neural network model according to the data flow graph, wherein the processing method comprises a rearranging method for rearranging tensors between the first operator and the second operator. Because the number of parameters of each dimension of the sub-tensor obtained by the operation corresponding to the operator of the split tensor is possibly far smaller than that of the tensor obtained by the operation corresponding to the operator of the non-split tensor, the selection of the splitting method corresponding to the input tensor of the subsequent operator of the sub-tensor is very limited, in the embodiment, the sub-tensor obtained by the operation corresponding to the operator is rearranged, which is equivalent to the reduction of the tensor obtained by the operation corresponding to the operator of the non-split tensor, at the moment, compared with the selection of the splitting method corresponding to the input tensor of the subsequent operator of the sub-tensor, the selection of the splitting method corresponding to the input tensor of the subsequent operator of the rearranged tensor is more, the splitting method of the input tensor of the operator can be selected more by the definition of rearrangement, so that different splitting methods can be used for splitting the input tensor corresponding to different splitting methods, i.e. each using a segmentation method suitable for itself. This is helpful to improve the processing efficiency of the dataflow graph and also to reduce the overhead of the device in computing the dataflow graph. Because a dataflow graph often includes different operators, and the suitable segmentation methods for input data of different operators may be different, and an operator can only segment input data by using the most suitable method for the operator, so that the calculation efficiency of the operator can be guaranteed.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;

FIG. 2 is a schematic diagram of a tensor segmentation method;

FIG. 3 is a representation of a deep learning framework software stack;

fig. 4a is a schematic diagram of an application architecture provided in the embodiment of the present application;

fig. 4b is a schematic diagram of an application architecture provided in the embodiment of the present application;

FIG. 5 is a schematic diagram of a system architecture provided in an embodiment of the present application;

FIG. 6 is a schematic of a convolutional/pooling layer;

FIG. 7 is a schematic of a convolutional neural network;

fig. 8 is a hardware structure of a chip according to an embodiment of the present disclosure;

fig. 9 is a flowchart illustrating a data processing method according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating a data processing method according to an embodiment of the present application;

FIG. 11 is a flow chart illustrating a data processing method according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a redistribution method in an embodiment of the present application;

FIG. 13 is a schematic diagram showing a contraction of one of the figures in an embodiment of the present application;

FIG. 14 is a schematic diagram of a graph contraction in an embodiment of the present application;

FIG. 15 is a schematic diagram showing a contraction of one of the embodiments of the present application;

FIG. 16 is a schematic diagram showing a contraction of one of the embodiments of the present application;

FIG. 17 is a schematic diagram showing a contraction of one of the embodiments of the present application;

fig. 18 is a flowchart illustrating a data processing method according to an embodiment of the present application;

fig. 19 is a flowchart illustrating a data processing method according to an embodiment of the present application;

fig. 20 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 21 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 22 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 23 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 24 is a schematic structural diagram of a training apparatus provided in an embodiment of the present application;

fig. 25 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, safe city etc..

When data processing is carried out, the data parallel can be regarded as that the input tensor of an operator is segmented in batch dimension (batch), the segmented tensor is operated by corresponding operation equipment, the equipment is organized into a high-dimensional mesh structure, each dimension is named correspondingly, and the tensor is named correspondingly so as to express the mapping from the tensor to the equipment. In the example of fig. 2, MatMul (x, w) ═ h, MatMul (h, v) ═ y, MatMul denotes matrix multiplication, and x, w, h, v, y are matrices (two-dimensional tensors); the dimension of x can be represented as (b, d _ io), and the dimension of w can be represented as (d _ io, d _ h); the 4 computing devices can likewise be represented as a matrix (r, c). The tensor-to-device slicing rule can be expressed as [ (b, r), (d _ h, c) ]. If the total dimension of the tensor to be split is N and the dimension of the grid formed by the equipment is M, then the possible total mapping modes are M ^ N. And (3) adopting an integer programming modeling mode by the Mesh-TensorFlow, and then solving the optimal mapping by utilizing an existing solver.

The embodiment can realize the normalization processing of data parallel and model parallel, realize the simultaneous segmentation of the training data and the model parameters, and can actually train the neural network on specific hardware equipment. However, tensors in all layers of the whole network can only be segmented according to the same mode, and mixed parallelism cannot be realized. As in fig. 2, due to the mapping relationship (d _ h, c), the second dimension of the tensor w, the second dimension of the tensor h, and the first dimension of the tensor v are all mapped onto the second dimension of the device matrix. Such a mapping approach has limitations that do not allow mapping of the corresponding dimensions of the tensors in which different operators appear to different device dimensions. The parallel of data to the model or the mixed segmentation from the parallel of the model to the parallel of the data cannot be realized, that is, the rearrangement of the tensor is not supported. And when the integer programming is used to solve the mapping relationship, the minimum memory usage is used as a target, which may result in poor training performance. Because one of the factors of the slow training process is that the network communication data volume is large, the increase of the communication volume may be caused by independently optimizing the memory usage amount, for example, a matrix multiplication operator, and the splitting of the last dimension of the first tensor causes the introduction of AllReduce into the forward calculation, thereby triggering the communication overhead.

The application can be applied to the field of deep learning. As shown in fig. 3, a deep learning framework software stack can be that a user submits a written script program to a framework, and the framework obtains a dataflow graph formed by operators after being analyzed by front-end expression; marking the parallel strategy for each operator after the data flow graph is searched by the parallel strategy; the graph segmentation process segments the whole data flow graph according to a parallel strategy, and distributes the data flow graph to each device for one computation subgraph; the subgraph is issued to the computing device for execution.

Next, an application architecture of the embodiment of the present application is described, referring to fig. 4a, fig. 4a is an application architecture schematic provided in the embodiment of the present application, as shown in fig. 4a, a computer system may be a multi-core system including a CPU and multiple GPUs, in which the CPU or the GPU may call and execute code in a corresponding memory, and the CPU and the multiple GPUs may be interconnected through a bus.

In the embodiment of the application, each GPU may acquire a data flow graph corresponding to the neural network model from a corresponding memory, the data flow graph includes each operator in the neural network model and a segmentation method corresponding to the neural network model, and each GPU may perform data processing based on the data flow graph. How each GPU performs data processing based on the dataflow graph will be described in the following embodiments, and details are not repeated here.

Hereinafter, processing modules such as GPUs and/or CPUs may be described as computing devices.

Next, another application architecture of the embodiment of the present application is described, referring to fig. 4b, where fig. 4b is a schematic application architecture provided in the embodiment of the present application, as shown in fig. 4b, a computer system may include a plurality of computing devices, where the computing devices may be independent computing devices, such as independent computers or terminal devices, and in the system, the plurality of computing devices may be interconnected through a bus.

In the embodiment of the application, each operation device may obtain a data flow graph corresponding to the neural network model, the data flow graph includes each operator in the neural network model and a segmentation method corresponding to the neural network model, and each operation device may perform data processing based on the data flow graph. How each computing device performs data processing based on the dataflow graph will be described in the following embodiments, and details are not repeated here.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, and the neural units may refer to arithmetic units with xs and intercept 1 as inputs.

(2) Deep neural network

Deep Neural Networks (DNN) can be understood as Neural networks with many hidden layers, where "many" has no special metric, and we often say that the multilayer Neural networks and the Deep Neural networks are essentially the same thing. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

(3) Convolutional Neural Networks (CNN) are a type of deep neural Network with convolutional structures.

(4) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

(5) Recurrent Neural Networks (RNNs) are used to process sequence data.

RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

(6) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(7) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

The following describes a system architecture provided by the embodiments of the present application.

Referring to fig. 5, the present embodiment provides a system architecture 100. As shown in the system architecture 100, the data collecting device 160 is configured to collect training data, which in this embodiment of the present application includes: an image or image block of the object and a category of the object; and stores the training data into the database 130, and the training device 120 trains to obtain a CNN feature extraction model based on the training data maintained in the database 130 (it is explained that the feature extraction model here is the model obtained by training in the training stage described above, and may be a neural network for feature extraction, etc.). In the following, how the training device 120 obtains a CNN feature extraction model based on training data will be described in more detail in an embodiment, where the CNN feature extraction model can be used to implement the neural network provided in the embodiment of the present application, that is, after performing relevant preprocessing on an image or an image block to be recognized, the image or the image block is input into the CNN feature extraction model, and information such as 2D, 3D, Mask, a key point, and the like of an object of interest of the image or the image block to be recognized can be obtained. The CNN feature extraction model in the embodiment of the present application may specifically be a CNN convolutional neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the CNN feature extraction model based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule obtained by training according to the training device 120 may be applied to different systems or devices, for example, the execution device 110 shown in fig. 5, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 5, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: an image or image block or image to be recognized.

During the input data preprocessing performed by the execution device 120 or the processing related to the computation performed by the computation module 111 of the execution device 120 (such as performing the function implementation of the neural network in the present application), the execution device 120 may call the data, the code, and the like in the data storage system 150 for corresponding processing, and may store the data, the instruction, and the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results, such as the images or image blocks obtained as described above or the 2D, 3D, Mask, keypoints, etc., of the object of interest in the images to the client device 140, and thereby provides them to the user.

Alternatively, the client device 140 may be a planning control unit in an automatic driving system, or a beauty algorithm module in a mobile phone terminal.

It should be noted that the training device 120 may generate corresponding target models/rules based on different training data for different targets or different tasks, and the corresponding target models/rules may be used to achieve the targets or complete the tasks, so as to provide the user with the required results.

In the case shown in fig. 5, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also serve as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 5 is only a schematic diagram of a system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 5, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 5, a CNN feature extraction model is obtained by training according to the training device 120, and the CNN feature extraction model may be a CNN convolutional neural network in the embodiment of the present application or a neural network to be described in the following embodiments.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 5. As described in the introduction of the basic concept above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

The structure of the convolutional neural network employed in the embodiment of the present application can be as shown in fig. 6. In fig. 6, Convolutional Neural Network (CNN)200 may include an input layer 210, a convolutional/pooling layer 220 (where pooling is optional), and a neural network layer 230. The input layer 210 may obtain an image to be processed, and deliver the obtained image to be processed to the convolutional layer/pooling layer 220 and the following neural network layer 230 for processing, so as to obtain a processing result of the image. The following describes the internal layer structure in CNN 200 in fig. 6 in detail.

An operator in the present application may represent an entire layer structure, or a partial operation in a layer structure.

Convolutional layer/pooling layer 220:

and (3) rolling layers:

the convolutional layer/pooling layer 220 shown in fig. 6 may include layers such as 221 and 226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, where the layers 221-226, as illustrated by 220 in fig. 6, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers.

The neural network layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the neural network layer 230. Accordingly, a plurality of hidden layers (231, 232 to 23n shown in fig. 6) and an output layer 240 may be included in the neural network layer 230, and parameters included in the hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

It should be noted that the convolutional neural network 210 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.

The structure of the convolutional neural network of the embodiment of the present application can be as shown in fig. 7. In fig. 7, Convolutional Neural Network (CNN)200 may include input layer 110, convolutional/pooling layer 120 (where pooling layer is optional), and neural network layer 130. Compared with fig. 6, in the convolutional layers/pooling layers 120 in fig. 7, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the all-neural network layer 130 for processing.

Fig. 8 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor NPU 50. The chip may be provided in the execution device 110 as shown in fig. 5 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 5 to complete the training work of the training apparatus 120 and output the target model/rule. The algorithms for the various layers in the convolutional neural networks shown in fig. 6 and 7 can be implemented in a chip as shown in fig. 8.

The neural network processor NPU 50, NPU is mounted as a coprocessor on a main processing unit (CPU) (host CPU), and tasks are distributed by the main CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform an operation.

The vector calculation unit 507 may further process the output of the arithmetic circuit.

The unified memory 506 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 505 (DMAC) to transfer the input data in the external memory to the input memory 501 and/or the unified memory 506, store the weight data in the external memory in the weight memory 502, and store the data in the unified memory 506 in the external memory.

A Bus Interface Unit (BIU) 510, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 509 through a bus.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the controller 504 is configured to call the instruction cached in the instruction storage 509 to implement controlling the working process of the operation accelerator.

Referring to fig. 9, fig. 9 is a flowchart illustrating a data processing method according to an embodiment of the present application, and as shown in fig. 9, the data processing method according to the embodiment of the present application includes:

901. obtaining a data flow graph of a neural network model, wherein the data flow graph comprises a first operator and a second operator of the neural network model.

In the embodiment of the application, a data flow graph of a neural network model can be obtained, wherein the data flow graph comprises a first operator and a second operator of the neural network model.

In an embodiment of the present application, a dataflow graph may include a plurality of operators including a first operator and a second operator.

In this embodiment, the output tensor of the first operator is the input tensor of the second operator.

902. Determining a processing method corresponding to the neural network model according to the dataflow graph, wherein the processing method comprises a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

In the embodiment of the application, after the data flow graph is obtained, a processing method corresponding to the neural network model can be determined according to the data flow graph.

In an embodiment of the present application, a dataflow graph includes a plurality of operators, and a possible segmentation method of an input tensor of each of the plurality of operators included in the dataflow graph may be determined, where the segmentation method includes a number of segments of each dimension of the tensor, which is related to the number of arithmetic devices, a size of each dimension in the operator, and a type of the operator. The number of the computing devices determines the upper limit of the number of the parts to be cut in each dimension, namely the number of the parts to be cut does not exceed the number of the computing devices; the size of each dimension in the operator determines the number of segmentation parts of each dimension, for example, the number of segmentation parts of each dimension can be an integral multiple of the tensor size of the corresponding dimension; the type of operator determines the segmentation rule of the number of segments to be segmented for each dimension, e.g. for the MatMul operator, there are two input tensors: and the size of the second dimension of each sub-tensor after the B tensor is segmented is the same as that of the first dimension of each sub-tensor after the C tensor is segmented.

It should be noted that in this embodiment, the size of the tensor in one dimension can be understood as the specification of the tensor in the dimension, which can indicate how many elements the tensor includes in the dimension.

For example, taking the operator type as MatMul as an example, a ═ MatMul (B, C), where the shape of B tensor is [8,32], the shape of C tensor is [32,16], the number of computing devices is 4, and several slicing methods can be generated as follows:

[1,1], [1,1] which means that the first dimension of the B tensor is divided into 1 part, i.e., the first dimension of the B tensor is not divided, the first dimension of the C tensor is divided into 1 part, i.e., the first dimension of the C tensor is not divided, the second dimension of the B tensor is divided into 1 part, i.e., the second dimension of the B tensor is not divided, and the second dimension of the C tensor is divided into 1 part, i.e., the second dimension of the C tensor is not divided;

[1,1], [1,2] which means that the first dimension of the B tensor is divided into 1, i.e., the first dimension of the B tensor is not divided, the first dimension of the C tensor is divided into 1, i.e., the first dimension of the C tensor is not divided, the second dimension of the B tensor is divided into 1, i.e., the second dimension of the B tensor is not divided, and the second dimension of the C tensor is divided into 2; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [32, 8 ];

[1,1], [1,4] which means that the first dimension of the B tensor is divided into 1, i.e., the first dimension of the B tensor is not divided, the first dimension of the C tensor is divided into 1, i.e., the first dimension of the C tensor is not divided, the second dimension of the B tensor is divided into 1, i.e., the second dimension of the B tensor is not divided, and the second dimension of the C tensor is divided into 4; obtaining four sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [32, 4 ];

[1,2], [2,1] which means that the first dimension of the B tensor is divided into 1 part, that is, the first dimension of the B tensor is not divided, the first dimension of the C tensor is divided into 2 parts, the second dimension of the B tensor is divided into 2 parts, and the second dimension of the C tensor is divided into 1 part, that is, the second dimension of the C tensor is not divided; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8,16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16,16 ];

[1,2], [2,2] which means that the first dimension of the B tensor is divided into 1 part, that is, the first dimension of the B tensor is not divided, the first dimension of the C tensor is divided into 2 parts, the second dimension of the B tensor is divided into 2 parts, and the second dimension of the C tensor is divided into 2 parts; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8,16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16, 8 ];

[1,4], [4,1] which means that the first dimension of the B tensor is divided into 1 part, that is, the first dimension of the B tensor is not divided, the first dimension of the C tensor is divided into 4 parts, the second dimension of the B tensor is divided into 4 parts, and the second dimension of the C tensor is divided into 1 part, that is, the second dimension of the C tensor is not divided; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8, 8 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [8,16 ];

[2,1], [1,1] which means that the first dimension of the B tensor is divided into 2, the first dimension of the C tensor is divided into 1, i.e., the first dimension of the C tensor is not divided, the second dimension of the B tensor is divided into 1, i.e., the second dimension of the B tensor is not divided, and the second dimension of the C tensor is divided into 1, i.e., the second dimension of the C tensor is not divided; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [4,32 ];

[2,1], [1,2] which means that the first dimension of the B tensor is divided into 2, the first dimension of the C tensor is divided into 1, namely, the first dimension of the C tensor is not divided, the second dimension of the B tensor is divided into 1, namely, the second dimension of the B tensor is not divided, and the second dimension of the C tensor is divided into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [32, 8 ];

[2,2], [2,1] which means that the first dimension of the B tensor is divided into 2, the first dimension of the C tensor is divided into 2, the second dimension of the B tensor is divided into 2, and the second dimension of the C tensor is divided into 1, namely the second dimension of the C tensor is not divided; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [4, 16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16,16 ];

[4,1], [1,1] which means that the first dimension of the B tensor is divided into 4, the first dimension of the C tensor is divided into 1, i.e., the first dimension of the C tensor is not divided, the second dimension of the B tensor is divided into 1, i.e., the second dimension of the B tensor is not divided, and the second dimension of the C tensor is divided into 1, i.e., the second dimension of the C tensor is not divided; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [2, 32 ];

in this embodiment of the application, the dataflow graph may include a plurality of operators and a connection relationship between the operators, and the connection relationship may represent a flow direction of the tensor, where the dataflow graph includes a first operator and a second operator, the first operator and the second operator are connected, and an output tensor of the first operator is an input tensor of the second operator, that is, after the input tensor is operated by the first operator, an obtained output tensor may be used as an input tensor of the second operator.

In this embodiment of the present application, each operator included in the dataflow graph may be traversed, and a processing method corresponding to the neural network model may be determined, where the processing method may include a first splitting method for splitting the input tensor of the first operator, where the first splitting method is a splitting method of the input tensor of the first operator. The processing method may comprise a second segmentation method for segmenting the input tensor of said second operator, wherein the first human segmentation method is a segmentation method of the input tensor of the first human operator.

In this embodiment, the processing method may further include a rearranging method for rearranging tensors between the first operator and the second operator; the first splitting method is used for splitting an input tensor of the first operator to obtain M first sub-tensors, the rearranging method is used for rearranging M second sub-tensors to obtain a first tensor, the M second sub-tensors are outputs obtained by processing the M first sub-tensors through the first operator, and the first tensor is an input tensor of the second operator, wherein M is a positive integer greater than 1.

In this embodiment of the application, each of the M second sub-tensors is obtained by performing, on a corresponding operation device, an operation corresponding to the first operator on a corresponding first sub-tensor of the M first sub-tensors, the input tensor of the first operator can be sliced by using the first operator to obtain M first sub-tensors, each of the M first sub-tensors can be processed by the first operator to obtain a second sub-tensor, that is, the M first sub-tensors are processed by the first operator to obtain M second sub-tensors, which can be further used to rearrange the M second sub-tensors by a rearrangement method, wherein reordering can be understood as stitching the M second sub-tensors to obtain the first tensor, wherein, and the result obtained after the input tensor of the first operator is subjected to the operation corresponding to the first operator on one operation device is the same as the first tensor. Specifically, the tensor redistribution method may be used to obtain the first tensor by transmitting the M second sub-tensors to the same operation device for splicing, and the second splitting method is used to split the first tensor.

In this embodiment of the application, in the framework shown in fig. 4a, M GPUs may obtain an input tensor of a first operator from a memory, each GPU may segment the input tensor of the first operator by using a first segmentation method to obtain M first sub-tensors, each GPU may select one of the first sub-tensors to perform an operation corresponding to the first operator to obtain a second sub-tensor, at this time, each GPU of the M GPUs may obtain one second sub-tensor, and then, M-1 GPUs may send the respective second sub-tensors to one GPU by using a rearrangement method, and the GPU may rearrange the received M-1 first sub-tensors and the second sub-tensors calculated by itself, where the rearrangement may be understood as splicing the M second sub-tensors to obtain the first tensor, where an input of the first operator performs an operation corresponding to the first operator on one tensor operation device, and then obtains a result and the obtained by performing the operation corresponding to the first operator, and the obtained result are rearranged by using the tensor operation device The first tensors are the same, and then the GPU may send the rearranged first tensor to the remaining M-1 GPUs, and each GPU may segment the first tensor by using the second segmentation method.

In this embodiment of the application, in the framework shown in fig. 4b, M operation devices may obtain an input tensor of a first operator, each operation device may divide the input tensor of the first operator by using a first division method to obtain M first sub-tensors, each operation device may select one of the first sub-tensors to perform an operation corresponding to the first operator to obtain a second sub-tensor, at this time, each operation device of the M operation devices may obtain one of the second sub-tensors, and then, the M-1 operation devices may send the respective second sub-tensors to one operation device by using a rearrangement method, and the operation device may rearrange the received M-1 first sub-tensors and the second sub-tensor obtained by self-calculation, where it may be understood that the M second sub-tensors are spliced, the first tensor is obtained, wherein the result obtained by performing the operation corresponding to the first operator on the input tensor of the first operator on one operation device is the same as the first tensor, and then the operation device may send the rearranged first tensor to the remaining M-1 operation devices, and each operation device may segment the first tensor by using the second segmentation method.

In this embodiment of the application, the second sub-tensor is a P-dimension tensor, the sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q. That is, each first sub-tensor in the M first sub-tensors may be processed by the first operator to obtain a second sub-tensor, and on one GPU, the second sub-tensor cannot be segmented to obtain a sub-tensor obtained by segmenting the first tensor by the second segmentation method. At this time, if the plurality of second sub-tensors are rearranged to obtain the first tensor, the first tensor may be split on one GPU according to the second splitting method.

Referring to fig. 10, fig. 10 is a schematic flow chart of a data processing method in the embodiment of the present application, as shown in fig. 10, the computer system includes a GPU1, a GPU2, a GPU3, and a GPU4, and the GPU1 may acquire two tensors (tensor B and tensor C) and perform a rectangular multiplication operation on the acquired tensors: a ═ MatMul (B, C), the shape of the B tensor is [8,32], the shape of the C tensor is [32,16], and the number of arithmetic devices is 4.

In this embodiment, the shape of the tensor can be understood as the specification of the tensor, and how many elements each dimension includes can be expressed.

Specifically, the GPU1 may divide the first dimension of the B tensor into 2, divide the first dimension of the C tensor into 1, do not divide the first dimension of the C tensor, do not divide the second dimension of the B tensor, and divide the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the GPU1 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ]. The GPU2 may segment the first dimension of the B tensor into 2, segment the first dimension of the C tensor into 1, not segment the first dimension of the C tensor, not segment the second dimension of the B tensor, and segment the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the GPU2 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ]. The GPU3 may segment the first dimension of the B tensor into 2, segment the first dimension of the C tensor into 1, not segment the first dimension of the C tensor, not segment the second dimension of the B tensor, and segment the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the GPU3 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ]. The GPU4 may segment the first dimension of the B tensor into 2, segment the first dimension of the C tensor into 1, not segment the first dimension of the C tensor, not segment the second dimension of the B tensor, and segment the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the GPU4 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ].

Then, the GPU2, the GPU3, and the GPU4 may send the respective resultant tensors to the GPU1, the GPU1 may receive the respective resultant tensors that the GPU2, the GPU3, and the GPU4 may obtain, and the GPU2, the GPU3, and the GPU4 may rearrange the respective resultant tensors and the resultant tensors obtained by themselves, that is, the four resultant tensors are spliced together to obtain a first tensor, the shape of the first tensor is [8,16], and the first tensor is the same as a result obtained by directly performing matrix multiplication on the B tensor and the C tensor.

The GPU1 may then send the first size to the GPU2, the GPU3, and the GPU4, and accordingly, the GPU2, the GPU3, and the GPU4 may perform a slicing operation on the first size after receiving the first size.

It should be noted that the present invention is not limited to this, and the present invention may be integrated into any one of the

GPUs

1,2, 3, and 4, or integrated into another GPU.

In one embodiment, the computer system may include an arithmetic device 1, an arithmetic device 2, an arithmetic device 3, and an arithmetic device 4, and the arithmetic device 1 may acquire two tensors (tensor B and tensor C) and perform a rectangular multiplication operation on the acquired tensors: a ═ MatMul (B, C), the shape of the B tensor is [8,32], the shape of the C tensor is [32,16], and the number of arithmetic devices is 4.

Specifically, the arithmetic device 1 may divide the first dimension of the B tensor into 2, divide the first dimension of the C tensor into 1, do not divide the first dimension of the C tensor, do not divide the second dimension of the B tensor, and divide the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the arithmetic device 1 may perform matrix multiplication on a sub tensor obtained by splitting a B tensor and a sub tensor obtained by splitting a C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ]. The arithmetic device 2 may divide the first dimension of the B tensor into 2, divide the first dimension of the C tensor into 1, do not divide the first dimension of the C tensor, do not divide the second dimension of the B tensor, and divide the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the operation device 2 may perform matrix multiplication on a sub-tensor obtained by splitting the B tensor and a sub-tensor obtained by splitting the C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ]. The arithmetic device 3 may divide the first dimension of the B tensor into 2, divide the first dimension of the C tensor into 1, do not divide the first dimension of the C tensor, do not divide the second dimension of the B tensor, and divide the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the arithmetic device 3 may perform matrix multiplication on the sub-tensor obtained by segmenting the B tensor and the sub-tensor obtained by segmenting the C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ]. The arithmetic device 4 may divide the first dimension of the B tensor into 2, divide the first dimension of the C tensor into 1, do not divide the first dimension of the C tensor, do not divide the second dimension of the B tensor, and divide the second dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the shape of each sub-tensor is [4,32 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the shape of each sub-tensor is [32, 8 ]; the operation device 4 may perform matrix multiplication on the sub-tensor obtained by segmenting the B tensor and the sub-tensor obtained by segmenting the C tensor to obtain a result tensor, where the shape of the result tensor is [4, 8 ].

Then, the operation device 2, the operation device 3, and the operation device 4 may transmit the respective resultant tensors to the operation device 1, the operation device 1 may receive the respective resultant tensors that the operation device 2, the operation device 3, and the operation device 4 may obtain, and the operation device 2, the operation device 3, and the operation device 4 may rearrange the respective resultant tensors and the resultant tensors obtained by themselves, that is, the four resultant tensors are spliced together to obtain a first tensor, which has a shape of [8,16], and which is the same as a result obtained by directly performing a matrix multiplication operation on the B tensor and the C tensor.

After that, the operation device 1 may send the first scale to the operation device 2, the operation device 3, and the operation device 4, and accordingly, after the operation device 2, the operation device 3, and the operation device 4 receive the first scale, the first scale may be sliced.

Note that the present invention is not limited to this, and may be integrated into any one of the

computing devices

1,2, 3, and 4, or may be integrated into another computing device.

Next, how to determine the processing method corresponding to the neural network model according to the data flow graph is described:

in the embodiment of the application, a plurality of candidate processing methods corresponding to the neural network model can be determined according to the data flow graph; wherein each of the candidate processing methods includes a tensor splitting method for splitting an input tensor of each of the N operators and a rearranging method for rearranging tensors between at least two of the N operators, the tensor splitting method is used for splitting the input tensor of the corresponding operator, the tensor rearranging method is used for rearranging an operation result obtained after an operation corresponding to the operator is performed and transmitted to an operation device, the processing methods are determined from the candidate processing methods according to an overhead value corresponding to each of the candidate processing methods, and the overhead value includes a memory overhead value corresponding to the tensor splitting method of each of the N operators and a first communication overhead value corresponding to the tensor rearranging method included in each of the candidate processing methods, the memory overhead value represents memory overhead generated when an operation corresponding to an operator is performed, the first communication overhead value represents communication overhead generated in a process of transmitting the operation result to an operation device, and the processing method is a candidate processing method with the smallest overhead value in the at least one candidate processing method.

In this embodiment of the present application, the input tensor of each operator may correspond to a plurality of segmentation methods, and when the above-mentioned situation that the segmentation cannot be performed occurs, a tensor rearrangement method may also be included between the operators.

In this embodiment, a dataflow graph may be traversed to determine multiple candidate processing methods corresponding to the neural network model, where each candidate processing method includes a tensor splitting method for splitting an input tensor of each of the N operators and a rearranging method for rearranging tensors between at least two of the N operators.

In order to select a processing method with a better multiple candidate processing methods, an overhead value of each candidate processing method in the multiple candidate processing methods needs to be determined, where the overhead value includes a memory overhead value corresponding to a tensor splitting method of each operator in the N operators and a first communication overhead value corresponding to a tensor rearrangement method included in each candidate processing method, the memory overhead value represents a memory overhead generated when an operation corresponding to the operator is performed, and the first communication overhead value represents a communication overhead generated in a process of transmitting the operation result to one operation device.

In this embodiment of the application, the N operators include a third operator, each candidate processing method includes a target splitting method for splitting an input tensor of the third operator, the third operator is configured to perform a first operation on a sub-tensor obtained by splitting the input tensor of the third operator by the target splitting method, and transmit an operation result obtained by the operation to an operation device for a second operation, the overhead value further includes a second communication overhead, and the second communication overhead represents a communication overhead generated in a process of transmitting the operation result to the operation device when the third operator is operated.

Referring to fig. 11, fig. 11 is a schematic flow chart of a data processing method in this embodiment of the present application, as shown in fig. 11, the computer system includes a GPU1, a GPU2, a GPU3, and a GPU4, and the GPU1 may acquire two tensors (tensor B and tensor C) and perform a rectangular multiplication operation on the acquired tensors: a ═ MatMul (B, C), the shape of the B tensor is [8,32], the shape of the C tensor is [32,16], and the number of arithmetic devices is 4.

Specifically, the GPU1 may divide the second dimension of the B tensor into 2, and divide the first dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8,16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16,16 ]; the GPU1 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the size of the result tensor is [8,16 ]. The GPU2 may slice the second dimension of the B tensor into 2, and the first dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8,16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16,16 ]; the GPU2 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the size of the result tensor is [8,16 ]. The GPU3 may slice the second dimension of the B tensor into 2, and the first dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8,16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16,16 ]; the GPU3 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the size of the result tensor is [8,16 ]. The GPU4 may slice the second dimension of the B tensor into 2, and the first dimension of the C tensor into 2; obtaining two sub-tensors after the B tensor is segmented, wherein the size of each sub-tensor is [8,16 ]; obtaining two sub-tensors after the C tensor is segmented, wherein the size of each sub-tensor is [16,16 ]; the GPU4 may perform matrix multiplication on a sub-tensor obtained by splitting a B tensor and a sub-tensor obtained by splitting a C tensor to obtain a result tensor, where the size of the result tensor is [8,16 ].

Then, the GPU2, the GPU3, and the GPU4 may send the respective resultant tensors to the GPU1, the GPU1 may receive the respective resultant tensors that the GPU2, the GPU3, and the GPU4 may obtain, and the GPU2, the GPU3, and the GPU4 may add the respective resultant tensors and the resultant tensors obtained by themselves element by element, so as to obtain a first tensor, where a size of the first tensor is [8,16], and the first tensor is the same as a result obtained by directly performing a matrix multiplication operation on the B tensor and the C tensor.

GPUs

1,2, 3, and 4, or integrated into another GPU.

In this embodiment, the overhead value further includes a second communication overhead, where the second communication overhead represents a communication overhead generated in a process of transmitting an operation result to an operation device when the third operator is operated, and in the above example, the second communication overhead may represent a communication overhead generated in a process that the GPU2, the GPU3, and the GPU4 may transmit respective obtained result tensors to the GPU 1.

In this embodiment, the second communication overhead value is related to the number of parameters and the type of parameters included in the M second sub-tensors. Taking the above example as an example, the second communication overhead value is related to the number of parameters (the number of elements included in the tensor) and the type of parameters (for example, shaped or floating point type) that the GPU2, the GPU3, and the GPU4 can include in the respective resultant tensors.

In this embodiment of the application, the memory overhead value is related to the number of parameters and the type of parameters included in the M first sub-tensors. The first communication overhead value is related to the number of parameters included in the M second sub-tensors and the type of the parameters. The overhead value is an average of the memory overhead value, the first communication overhead value, and the second communication overhead value. The average may be an arithmetic average, a geometric average, a squared average (root mean square, rms), a harmonic average, a weighted average, and the like, and is not limited in this application.

In the embodiment of the present application, a specific calculation formula of the overhead value may be:

the memory overhead is (0 or 1) × (the number of parameters included in the M first sub-tensors) × the parameter type, and the communication overhead is (0 or 1) × (the number of parameters included in the M second sub-tensors) × the parameter type. Taking a ═ MatMul (B, C) as an example, the B matrix shape is [8,32], the C matrix shape is [32,16], and the number of devices is 4; if one of B or C is a parameter requiring gradient update, when the policy is [ [4,1], [1,1], (2 × 32+32 × 16) × parameter type weight is 768 × parameter type, and the communication cost is 0, and when the policy is [ [1,4], [4,1], (8 × 8+ 16) × parameter type weight is 321 ═ parameter type weight, and the communication cost is (8 × 16) × parameter type weight is 128.

It should be noted that, in the present application, if the input tensor of the operator does not need to be subjected to gradient update in the training process of the neural network model, no matter what the corresponding segmentation method is, no memory overhead is generated.

After the operator strategy is generated and thinned, a strategy needs to be generated for rearrangement, and the rearrangement strategy is defined as an edge strategy, which is a binary group formed by the strategy of the previous operator and the strategy of the current operator, as shown in fig. 12. If there are N available policies for node 1 and M available policies for node 2, then there are N × M policies for the edge. Each strategy of rearrangement has a cost, and the calculation mode is similar to the cost calculation of an operator.

In the embodiment of the application, a plurality of processing methods corresponding to the neural network model can be determined according to the data flow graph; determining the plurality of candidate processing methods from the plurality of processing methods according to the memory overhead value or the first communication overhead value corresponding to each of the plurality of processing methods, wherein the plurality of candidate processing methods are part of the plurality of processing methods.

In the embodiment of the application, the simplification can be performed after the processing method is generated: for example, the processing methods can be arranged according to the non-decreasing sequence of the memory occupancy, then the step length is calculated according to the target of reserving 1/epsilon, and 1/epsilon processing methods are selected from the step length. The advantage of doing so is that after sparsification, the number of processing methods left is independent of the tensor shape and the number of devices, which is a great simplification when the shape is large or the number of devices is large; on the other hand, epsilon is a user-configurable option, and if the user wants to search out a policy with acceptable performance quickly, epsilon needs to be adjusted up, and if the user wants to search out a processing method with best performance, epsilon needs to be adjusted down.

In this embodiment, the data flow graph further includes a third operator of the neural network model, an output tensor of the third operator is an input tensor of the second operator, and the first operator may be merged to the second operator, so that the second segmentation method corresponding to the second operator establishes an association relationship with the first segmentation method and the tensor rearrangement method.

In the embodiment of the application, Merge animation operation is provided, and complex graph structures can be processed.

As shown in fig. 13, when such a sub-structure is encountered in the graph, the first segmentation method N2 of the first operator is shrunk into the second operator together with the tensor redistribution method E2, so that the second segmentation method corresponding to the second operator is associated with the first segmentation method and the tensor redistribution method. When contracting, all policies in N2 and E2 are traversed, and the corresponding cost values are added, and the pseudo code can be shown as follows.

The pseudo code for the entire calculation process may be as follows.

In this embodiment, the data flow graph further includes a fourth operator of the neural network model, the output tensor of the first operator is the input tensor of the fourth operator, and the second operator may be merged into the first operator, so that the first splitting method corresponding to the first operator is associated with the tensor redistribution method and the second splitting method.

In the embodiment of the application, a Contract animation operation is provided, and a complex graph structure can be processed.

As shown in fig. 14, when such a substructure is encountered in the dataflow graph, the second segmentation method N2 can be shrunk into the first operator along with the tensor redistribution method E1. All strategies in the second segmentation method N2 and the tensor rearrangement method E1 are traversed during contraction, and corresponding cost values are added, and the pseudo code can be shown as follows.

In an embodiment of the present application, the dataflow graph further includes a fifth operator and a sixth operator of the neural network model, where the first tensor is a first input tensor of the second operator and a first input tensor of the fifth operator, and an output tensor of the sixth operator is a second input tensor of the second operator and a second input tensor of the fifth operator, a third splitting method for splitting the input tensor of the sixth operator, a fourth splitting method for splitting the second input tensor of the second operator, and a first tensor rearrangement method for rearranging the tensors between the fourth operator and the second operator are obtained; the third splitting method is configured to split the second input tensor of the sixth operator to obtain a plurality of third sub-tensors, the first tensor rearrangement method is configured to rearrange a plurality of fourth sub-tensors obtained by performing operation corresponding to the sixth operator on the plurality of third sub-tensors to obtain a third tensor, and the fourth splitting method is configured to split the third tensor; and merging the sixth operator to the first operator, so that the first division method corresponding to the first operator establishes a first association relationship with the third division method, the fourth division method and the first sheet rearrangement method.

As shown in fig. 15, when such a sub-structure is encountered in the dataflow graph, the third segmentation method N4 may be shrunk into the first operator N1, and the first re-arrangement method E3 is updated to (the first re-arrangement method and the first segmentation method) E3 ', and similarly, E4 is updated to E4'. When the third segmentation method N4 is contracted into the first operator N1, the segmentation method and cost included in the first operator N1 need to be updated, so that the updated code contains the strategy of the third segmentation method N4.

Illustratively, in the example of FIG. 16, the graph after Star animation is subjected to edge contraction twice, Contract animation and Merge animation in sequence to Contract the graph into a final graph with only one point.

The pseudo code is as follows:

in the embodiment of the present application, after a dataflow graph is shrunk into a final graph with only one point, how to select an optimal one from a candidate cost list defines an overhead value model f (memory overhead) + beta communication overhead); where alpha and beta are configurable parameters. This model is more general than prior art one and three because both memory and communication costs are taken into account. alpha and beta are not set to fixed values because different hardware and processor speeds and network card transmission speeds are different, resulting in the setting of these two values being likely to be different.

In the example of FIG. 16, after NodeElimination, EdgeElimination, MergeElimination are performed in sequence, the graph is shrunk to a final graph with only one node. After the optimal strategy is selected in the final graph, strategy setting is further carried out on the graph with two points and one edge because the stack top element stores the graph with two points and one edge involved in MergeElimination and one node of the final graph; and operating in the way until the original image is restored, so that the strategy of each operator is set.

As shown in fig. 17, a small network composed of MatMul operator, two Mul operators, Onehot operator and Add operator is taken as an example, the network is a substructure of a ReID neural network, and ReID is a technology for determining whether a specific pedestrian exists in an image or a video sequence.

The dynamic planning algorithm for supporting rearrangement, which is proposed by the application on the small network, is explained, and the specific flow is as follows:

first, the substructure capable of performing Star animation is determined, so the Onehot operator is shrunk into Matmul, and the processing method overhead value of Matmul operator is updated in the manner described in fig. 15. Determining two groups of edges which can be subjected to Edge animation, combining E1 and E2 into a new Edge, and combining E3 and E4 into another new Edge; the cost of the new edge is the sum of the corresponding edges under the same strategy. And determining two groups of structures capable of performing Node Elimination, respectively shrinking two Mul operators, wherein the candidate strategy of the new edge E5 comprises all the candidate strategies of the left Mul operator, and similarly, the candidate strategy of the new edge E6 comprises all the candidate strategies of the right Mul operator. Determining the structure capable of performing Edge animation, so that E5 and E6 can be combined into a new Edge, and similarly, the cost of the new Edge is the sum of the corresponding edges under the same strategy. Determining a structure capable of performing Merge animation, and then retracting the MatMul operator into the Add operator, traversing the MatMul operator and all strategies in edges during contraction, and adding corresponding cost values, as shown in MergeElimationCostCall of FIG. 13.

In the embodiment of the present application, the neural network model may be used to process image data, audio data, video data, or text data; the neural network model comprises operators with input tensors of the image data, the audio data, the video data or the text data; or the input tensor of the operator included in the neural network model is obtained by processing the image data, the audio data, the video data or the text data by using at least one operator included in the neural network model.

In the embodiment of the application, a plurality of graph operations are newly added, so that the problem of strategy search of a complex graph structure can be solved; a re-arrangement strategy is newly added, so that the problem that the optimal strategy cannot be searched can be solved; the design of the objective function is added, so that the problem that the obtained strategy cannot shorten the end-to-end training time is solved.

The embodiment of the application provides a data processing method, which comprises the following steps: acquiring a data flow diagram of a neural network model, wherein the data flow diagram comprises a first operator and a second operator of the neural network model; determining a processing method corresponding to the neural network model according to the dataflow graph, wherein the processing method comprises a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator; the first splitting method is used for splitting an input tensor of the first operator to obtain M first sub-tensors, the rearranging method is used for rearranging M second sub-tensors to obtain a first tensor, the M second sub-tensors are outputs obtained by processing the M first sub-tensors through the first operator, and the first tensor is an input tensor of the second operator, wherein M is a positive integer greater than 1. The redistribution method is defined, and the situation that the segmentation methods of the input tensors are different can occur to the operators before and after the redistribution method. This makes it possible to convert model parallel to data parallel, or a hybrid parallel approach of data parallel to model parallel, making the policy search space larger. Because the number of parameters of each dimension of the sub-tensor obtained by the operation corresponding to the operator of the split tensor is possibly far smaller than that of the tensor obtained by the operation corresponding to the operator of the non-split tensor, the selection of the splitting method corresponding to the input tensor of the subsequent operator of the sub-tensor is very limited, in the embodiment, the sub-tensor obtained by the operation corresponding to the operator is rearranged, which is equivalent to the reduction of the tensor obtained by the operation corresponding to the operator of the non-split tensor, at the moment, compared with the selection of the splitting method corresponding to the input tensor of the subsequent operator of the sub-tensor, the selection of the splitting method corresponding to the input tensor of the subsequent operator of the rearranged tensor is more, the splitting method of the input tensor of the operator can be selected more by the definition of rearrangement, so that different splitting methods can be used for splitting the input tensor corresponding to different splitting methods, i.e. each using a segmentation method suitable for itself. This is helpful to improve the processing efficiency of the dataflow graph and also to reduce the overhead of the device in computing the dataflow graph. Because a dataflow graph often includes different operators, and the suitable segmentation methods for input data of different operators may be different, and an operator can only segment input data by using the most suitable method for the operator, so that the calculation efficiency of the operator can be guaranteed.

In addition, three graph contraction operations are added to support strategy search of a more complex neural network. Merge animation, Contract animation, Star animation operations are added and the calculation method of the contracted overhead value is given.

In addition, the method and the device realize the function of calculating the memory cost and the communication cost. Under the premise of a given processing method, the memory overhead and the communication overhead consumed by operators and rearrangement can be calculated.

In addition, after the processing methods are generated, the processing methods can be arranged according to the non-decreasing sequence of the memory occupation amount, and the processing methods are selected according to the fixed step length. The selection method can effectively reduce the strategy searching time on the premise of keeping the precision to a certain degree.

In addition, the definition of the overhead value of the application simultaneously comprises memory overhead and communication overhead. The optimized memory overhead and communication overhead can be modeled.

Referring to fig. 18, fig. 18 is a flowchart illustrating a data processing method according to an embodiment of the present application, and as shown in fig. 18, the data processing method according to the embodiment of the present application includes:

1801. the method comprises the steps of obtaining a neural network model and a processing method corresponding to the neural network model, wherein the neural network model comprises a first operator and a second operator, the processing method comprises a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator and a rearranging method for rearranging tensors between the first operator and the second operator.

In the embodiment of the application, a user submits a written script program to a frame, and the frame obtains a dataflow graph formed by operators after being analyzed by front-end expression; marking the parallel strategy for each operator after the data flow graph is searched by the parallel strategy; the graph segmentation process segments the whole data flow graph according to a parallel strategy, and distributes the data flow graph to each device for one computation subgraph; the subgraphs are issued to the computing device for execution, and each subgraph can include the processing method corresponding to the neural network model.

1802. Segmenting the input tensor of the first operator through the first segmentation method to obtain M first sub-tensors, wherein the M first sub-tensors comprise a first target sub-tensor.

The detailed description of step 1802 may refer to the description related to the first dividing method in the above embodiments, and is not repeated here.

1803. Processing the first target sub-tensor using the first operator to obtain a second target sub-tensor.

1804. Receiving at least one second sub tensor sent by at least one arithmetic device; each second sub-tensor is obtained by performing operation corresponding to the first operator on one first sub-tensor except the first target sub-tensor in the M first sub-tensors through corresponding operation equipment.

1805. Splicing the second target sub-tensor and the at least one second sub-tensor by the re-distribution method to obtain a first tensor;

for the detailed description of step 1805, reference may be made to the description related to the rearrangement method in the foregoing embodiment, and details are not described here.

1806. And cutting the first sheet by the second cutting method.

For the detailed description of step 1806, reference may be made to the description related to the second segmentation method in the foregoing embodiment, and details are not described here.

Optionally, each of the M second sub-tensors is obtained by transmitting a corresponding first sub-tensor of the M first sub-tensors to a corresponding operation device, and performing an operation corresponding to the first operator on the corresponding operation device.

Optionally, the tensor rearrangement method is configured to obtain the first tensor by transmitting the M second sub-tensors to the same operation device for rearrangement.

Optionally, the result obtained by performing the operation corresponding to the first operator on one operation device by using the input tensor of the first operator is the same as the first tensor.

Optionally, the second sub-tensor is a P-dimension tensor, a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimension tensor, and P is not an integral multiple of Q.

Optionally, the neural network model is used for processing image data, audio data, video data or text data; the neural network model comprises operators with input tensors of the image data, the audio data, the video data or the text data; or the input tensor of the operator included in the neural network model is obtained by processing the image data, the audio data, the video data or the text data by using at least one operator included in the neural network model.

The embodiment of the application provides a data processing method, which comprises the following steps: acquiring a neural network model and a processing method corresponding to the neural network model, wherein the neural network model comprises a first operator and a second operator, and the processing method comprises a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator and a rearranging method for rearranging tensors between the first operator and the second operator; segmenting the input tensor of the first operator through the first segmentation method to obtain M first sub-tensors; splicing M second sub-tensors obtained by performing operation corresponding to the first operator on the M first sub-tensors through the rearrangement method to obtain a first tensor, wherein M is a positive integer greater than 1; and cutting the first sheet by the second cutting method. By the mode, after the operation corresponding to the operator is carried out on the split tensor, the tensor can be rearranged for one time, so that the split method of the input tensor of the operator has more choices.

Referring to fig. 19, fig. 19 is a flowchart illustrating a data processing method according to an embodiment of the present application, and as shown in fig. 19, the data processing method according to the embodiment of the present application includes:

1901. acquiring a neural network model and a processing method corresponding to the neural network model, wherein the neural network model comprises a first operator and a second operator; the processing method comprises a first slicing method for slicing the input tensor of the first operator and a second slicing method for slicing the input tensor of the second operator;

1902. Segmenting the input tensor of the first operator by the first segmentation method to obtain M first sub-tensors, wherein the M first sub-tensors comprise a first target sub-tensor;

the detailed description of step 1902 may refer to the description related to the first dividing method in the above embodiments, and will not be described herein.

1903. Performing operation corresponding to the first operator on the first target sub-tensor to obtain a second target sub-tensor;

1904. sending the second target sub-tensor to an arithmetic device;

1905. receiving a first tensor sent by the operation device, wherein the first tensor is obtained by splicing a plurality of second sub-tensors through a rearrangement method by the operation device, and the plurality of second sub-tensors comprise a second target sub-tensor;

1906. and cutting the first sheet by the second cutting method.

For a detailed description of step 1906, reference may be made to the description related to the second segmentation method in the foregoing embodiment, and details are not repeated here.

The embodiment of the application provides a data processing method, which comprises the following steps: acquiring a neural network model and a processing method corresponding to the neural network model, wherein the neural network model comprises a first operator and a second operator; the processing method comprises a first slicing method for slicing the input tensor of the first operator and a second slicing method for slicing the input tensor of the second operator; segmenting the input tensor of the first operator by the first segmentation method to obtain M first sub-tensors, wherein the M first sub-tensors comprise a first target sub-tensor; processing the first target sub-tensor using the first operator to obtain a second target sub-tensor; sending the second target sub-tensor to an arithmetic device; receiving a first tensor sent by the operation device, wherein the first tensor is obtained by rearranging a plurality of second sub-tensors by the operation device through a rearranging method, and the plurality of second sub-tensors comprise a second target sub-tensor; and cutting the first sheet by the second cutting method. By the mode, after the operation corresponding to the operator is carried out on the split tensor, the tensor can be rearranged for one time, so that the split method of the input tensor of the operator has more choices.

Referring to fig. 20, fig. 20 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 20, a data processing apparatus 2000 according to an embodiment of the present application includes:

an obtaining module 2001, configured to obtain a dataflow graph of a neural network model, where the dataflow graph includes a first operator and a second operator of the neural network model;

a determining module 2002, configured to determine, according to the dataflow graph, a processing method corresponding to the neural network model, where the processing method includes a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

Optionally, the tensor redistribution method is configured to obtain the first tensor by transmitting the M second sub-tensors to the same operation device for splicing.

Optionally, the input tensor of the first operator is the same as the first tensor in result of performing the operation corresponding to the first operator on an operation device.

Optionally, the neural network model includes N operators, where the N operators include the first operator and the second operator, and the determining module is specifically configured to:

Optionally, the N operators include a third operator, each candidate processing method includes a target splitting method for splitting an input tensor of the third operator, the third operator is configured to perform a first operation on a sub-tensor obtained by splitting the input tensor of the third operator by the target splitting method, and transmit an operation result obtained by the operation to an operation device for a second operation, the overhead value further includes a second communication overhead, and the second communication overhead represents a communication overhead generated in a process of transmitting the operation result to the operation device when the third operator is operated.

Optionally, the memory overhead value is related to the number of parameters and the type of parameters included in the M first sub-tensors.

Optionally, the first communication overhead value is related to the number of parameters and the type of parameters included in the M second sub-tensors.

Optionally, the second communication overhead value is related to the number of parameters and the type of parameters included in the M second sub-tensors.

Optionally, the overhead value is an average value of the memory overhead value, the first communication overhead value, and the second communication overhead.

Optionally, the determining module is further configured to:

Optionally, the data flow graph further includes a third operator of the neural network model, an output tensor of the third operator being an input tensor of the second operator, the apparatus further includes:

Optionally, the data flow graph further includes a fourth operator of the neural network model, an output tensor of the first operator is an input tensor of the fourth operator, and the apparatus further includes:

Optionally, the data flow graph further includes a fifth operator and a sixth operator of the neural network model, wherein the first tensor is the first input tensor of the second operator and the first input tensor of the fifth operator, and the output tensor of the sixth operator is the second input tensor of the second operator and the second input tensor of the fifth operator, and the obtaining module is configured to:

the device further comprises:

Optionally, the apparatus further comprises:

Optionally, the neural network model is used for processing image data, audio data, video data or text data;

Referring to fig. 21, fig. 21 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 21, a data processing apparatus 2100 according to an embodiment of the present application includes:

an obtaining module 2101, configured to obtain a neural network model and a processing method corresponding to the neural network model, where the neural network model includes a first operator and a second operator; the processing method includes a first splitting method for splitting an input tensor of the first operator, a second splitting method for splitting an input tensor of the second operator, and a rearranging method for rearranging tensors between the first operator and the second operator;

a first segmentation module 2102 configured to segment the input tensor of the first operator by the first segmentation method to obtain M first sub-tensors, where the M first sub-tensors include a first target sub-tensor;

an operation module 2103, configured to process the first target sub-tensor using the first operator to obtain a second target sub-tensor;

a receiving module 2104 for receiving at least one second sub-tensor sent by at least one computing device; each second sub-tensor is obtained by performing operation corresponding to the first operator on one first sub-tensor except the first target sub-tensor in the M first sub-tensors through corresponding operation equipment;

a rearranging module 2105, configured to rearrange the second target sub-tensor and the at least one second sub-tensor by the rearranging method to obtain a first tensor;

a second segmentation module 2106 for segmenting the first quantum by the second segmentation method.

Referring to fig. 22, fig. 22 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 22, a data processing apparatus 2200 according to an embodiment of the present application includes:

an obtaining module 2201, configured to obtain a neural network model and a processing method corresponding to the neural network model, where the neural network model includes a first operator and a second operator; the processing method comprises a first slicing method for slicing the input tensor of the first operator and a second slicing method for slicing the input tensor of the second operator;

a first segmentation module 2202, configured to segment the input tensor of the first operator by the first segmentation method to obtain M first sub-tensors, where the M first sub-tensors include a first target sub-tensor;

an operation module 2203, configured to process the first target sub-tensor using the first operator to obtain a second target sub-tensor;

a sending module 2204, configured to send the second target sub-tensor to an arithmetic device;

a receiving module 2205, configured to receive a first tensor sent by the operation device, where the first tensor is obtained by rearranging, by the operation device, a plurality of second sub-tensors by using a rearrangement method, where the plurality of second sub-tensors include the second target sub-tensor;

a second segmentation module 2206 for segmenting the first quantum by the second segmentation method.

Referring to fig. 23, fig. 23 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 2300 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, and the like, which is not limited herein. Specifically, the execution apparatus 2300 includes: a receiver 2301, a transmitter 2302, a processor 2303 and a memory 2304 (wherein the number of processors 2303 in the execution device 2300 may be one or more, for example, one processor in fig. 23), wherein the processor 2303 may include an application processor 23031 and a communication processor 23032. In some embodiments of the application, the receiver 2301, the transmitter 2302, the processor 2303 and the memory 2304 may be connected by a bus or other means.

The memory 2304 may include both read-only memory and random access memory, and provides instructions and data to the processor 2303. A portion of the memory 2304 may also include non-volatile random access memory (NVRAM). The memory 2304 stores processors and operating instructions, executable modules or data structures, or a subset or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 2303 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The methods disclosed in the embodiments of the present application may be implemented in the processor 2303 or implemented by the processor 2303. The processor 2303 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2303. The processor 2303 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 2303 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2304, and the processor 2303 reads information in the memory 2304 and completes the steps of the method in combination with hardware thereof.

The receiver 2301 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. The transmitter 2302 may be used to output numeric or character information through a first interface; the transmitter 2302 may also be used to send instructions to the disk groups through the first interface to modify data in the disk groups; the transmitter 2302 may also include a display screen or the like.

In the embodiment of the present application, the processor 2303 is configured to execute the data processing method in the above embodiment.

Referring to fig. 24, fig. 24 is a schematic structural diagram of a training device provided in the embodiment of the present application, specifically, the training device 2400 is implemented by one or more servers, and the training device 2400 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 2424 (e.g., one or more processors) and a memory 2432, and one or more storage media 2430 (e.g., one or more mass storage devices) for storing an application 2442 or data 2444. The memory 2432 and the storage medium 2430 can be, among other things, transient or persistent storage. The program stored on the storage medium 2430 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Still further, central processor 2424 may be disposed in communication with storage medium 2430 for performing a sequence of instructional operations on storage medium 2430 on exercise device 2400.

Training device 2400 may also include one or more power supplies 2426, one or more wired or wireless network interfaces 2450, one or more input-output interfaces 2458; or one or more operating systems 2441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In the embodiment of the present application, the central processing unit 2424 is configured to perform the steps related to the perceptual network training method in the foregoing embodiment.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored by the storage unit to cause the chip in the execution device to execute the data processing method described in the above embodiment, or to cause the chip in the training device to execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 25, fig. 25 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 2500, and the NPU 2500 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2503, and the controller 2504 controls the arithmetic circuit 2503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 2503 is a two-dimensional systolic array. The arithmetic circuit 2503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 2503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2502 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2501 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 2508.

The unified memory 2506 is used for storing input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 2505, and the DMAC is transferred to a weight Memory 2502. The input data is also carried into the unified memory 2506 via the DMAC.

The BIU is a Bus Interface Unit, Bus Interface Unit 2510, for the interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 2509.

A Bus Interface Unit 2510(Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2509 to obtain instructions from the external memory, and is also used for the memory Unit access controller 2505 to obtain the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2506, or transfer weight data to the weight memory 2502, or transfer input data to the input memory 2501.

The vector calculation unit 2507 includes a plurality of operation processing units, and performs further processing such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like on the output of the operation circuit 2503 if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2507 can store the processed output vector to the unified memory 2506. For example, the vector calculation unit 2507 may calculate a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 2503, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 2507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 2503, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 2509 connected to the controller 2504, configured to store instructions used by the controller 2504;

the unified memory 2506, the input memory 2501, the weight memory 2502, and the instruction fetch memory 2509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Claims

1. A method of data processing, the method comprising:

the first splitting method is used for splitting an input tensor of the first operator to obtain M first sub-tensors, the rearranging method is used for rearranging M second sub-tensors to obtain a first tensor, the M second sub-tensors are outputs obtained by processing the M first sub-tensors through the first operator, the first tensor is an input tensor of the second operator, and M is a positive integer greater than 1.

2. The method of claim 1, wherein the input tensor of the first operator is the same as the first tensor when the corresponding operation of the first operator is performed on an arithmetic device.

3. The method according to claim 1 or 2, wherein the second sub-tensor is a P-dimensional tensor, and a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimensional tensor, and P is not an integral multiple of Q.

4. The method according to any of claims 1 to 3, wherein the neural network model includes N operators, the N operators including the first operator and the second operator, and the processing method for determining the neural network model according to the dataflow graph includes:

5. The method according to claim 4, wherein the N operators include a third operator, each candidate processing method includes a target splitting method for splitting an input tensor of the third operator, the third operator is configured to perform a first operation on a sub-tensor obtained by splitting the input tensor of the third operator by the target splitting method, and transmit an operation result obtained by the operation to an operation device for a second operation, and the overhead value further includes a second communication overhead representing a communication overhead generated in a process of transmitting the operation result to the operation device when the third operator is operated.

6. The method of any of claims 1 to 5, wherein the dataflow graph further includes a third operator of the neural network model, an output tensor of the third operator being an input tensor of the second operator, the method further including:

7. The method of any of claims 1 to 5, wherein the dataflow graph further includes a fourth operator of the neural network model, an output tensor of the first operator is an input tensor of the fourth operator, the method further including:

8. The method of any of claims 1 to 5, wherein the dataflow graph further includes a fifth operator and a sixth operator of the neural network model, wherein the first tensor is a first input tensor for the second operator and a first input tensor for the fifth operator, and wherein an output tensor for the sixth operator is a second input tensor for the second operator and a second input tensor for the fifth operator, the method further comprising:

9. The method of any one of claims 1 to 8, wherein the neural network model is used to process image data, audio data, video data, or text data;

10. A method of data processing, the method comprising:

and cutting the first sheet by the second cutting method.

11. The method of claim 10, wherein the input tensor of the first operator is the same as the first tensor when the corresponding operation of the first operator is performed on an arithmetic device.

12. The method according to claim 10 or 11, wherein the second sub-tensor is a P-dimensional tensor, and a sub-tensor obtained by segmenting the first tensor by the second segmenting method is a Q-dimensional tensor, and P is not an integral multiple of Q.

13. The method of any one of claims 10 to 12, wherein the neural network model is used to process image data, audio data, video data, or text data;

14. A method of data processing, the method comprising:

sending the second target sub-tensor to an arithmetic device;

and cutting the first sheet by the second cutting method.

15. The method of claim 14, wherein the input tensor of the first operator is the same as the first tensor when the corresponding operation of the first operator is performed on an arithmetic device.

16. The method according to claim 14 or 15, wherein the second sub-tensor is a P-dimensional tensor, and a sub-tensor obtained by slicing the first tensor by the second slicing method is a Q-dimensional tensor, and P is not an integral multiple of Q.

17. The method of any one of claims 14 to 16, wherein the neural network model is used to process image data, audio data, video data, or text data;

18. A data processing apparatus, characterized in that the apparatus comprises:

19. The apparatus of claim 18, wherein the input tensor of the first operator is equal to the first tensor when the corresponding operation of the first operator is performed on an arithmetic device.

20. The apparatus according to claim 18 or 19, wherein the second sub-tensor is a P-dimensional tensor, and a sub-tensor obtained by the second slicing method by slicing the first tensor is a Q-dimensional tensor, and P is not an integral multiple of Q.

21. The apparatus according to any one of claims 18 to 20, wherein the neural network model comprises N operators, the N operators comprising the first operator and the second operator, and the determining module is specifically configured to:

22. The apparatus according to claim 21, wherein the N operators include a third operator, each candidate processing method includes a target splitting method for splitting an input tensor of the third operator, the third operator is configured to perform a first operation on a sub-tensor obtained by splitting the input tensor of the third operator by the target splitting method, and to transmit an operation result obtained by the operation to one operation device for a second operation, and the overhead value further includes a second communication overhead representing a communication overhead generated in a process of transmitting the operation result to one operation device when the third operator is operated.

23. The apparatus of claim 18 or 22, wherein the dataflow graph further includes a third operator of the neural network model, an output tensor of the third operator being an input tensor of the second operator, the apparatus further comprising:

24. The apparatus of any of claims 18 to 22, wherein the dataflow graph further includes a fourth operator of the neural network model, an output tensor of the first operator is an input tensor of the fourth operator, the apparatus further comprising:

25. The apparatus of any of claims 18 to 22, wherein the dataflow graph further includes a fifth operator and a sixth operator of the neural network model, wherein the first tensor is a first input tensor of the second operator and a first input tensor of the fifth operator, and an output tensor of the sixth operator is a second input tensor of the second operator and a second input tensor of the fifth operator, the obtaining module is configured to:

the device further comprises:

26. The apparatus of any one of claims 18 to 25, wherein the neural network model is configured to process image data, audio data, video data, or text data;

27. A data processing apparatus, characterized in that the apparatus comprises:

28. The apparatus of claim 27, wherein the input tensor of the first operator is equal to the first tensor when the corresponding operation of the first operator is performed on an arithmetic device.

29. The apparatus according to claim 27 or 28, wherein the second sub-tensor is a P-dimensional tensor, and a sub-tensor obtained by the second slicing method by slicing the first tensor is a Q-dimensional tensor, and P is not an integral multiple of Q.

30. The apparatus of any one of claims 27 to 29, wherein the neural network model is configured to process image data, audio data, video data, or text data;

31. A data processing apparatus, characterized in that the apparatus comprises:

32. The apparatus of claim 31, wherein the input tensor of the first operator is equal to the first tensor when the corresponding operation of the first operator is performed on an arithmetic device.

33. The apparatus according to claim 31 or 32, wherein the second sub-tensor is a P-dimensional tensor, and a sub-tensor obtained by the second slicing method by slicing the first tensor is a Q-dimensional tensor, and P is not an integral multiple of Q.

34. The apparatus of any one of claims 31 to 33, wherein the neural network model is configured to process image data, audio data, video data, or text data;

35. A data processing apparatus comprising a memory for storing a program, a processor for executing the program in the memory to perform the data processing method of any one of claims 1 to 17, and a bus system.

36. A computer-readable storage medium, in which a computer program is stored which, when run on a computer, causes the computer to perform the data processing method of any one of claims 1 to 17.