CN112711478B

CN112711478B - Task processing method and device based on neural network, server and storage medium

Info

Publication number: CN112711478B
Application number: CN201911016715.5A
Authority: CN
Inventors: 刘文峰
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Zero Boundary Integrated Circuit Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Zero Boundary Integrated Circuit Co Ltd
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2024-05-28
Anticipated expiration: 2039-10-24
Also published as: CN112711478A

Abstract

The application relates to the technical field of neural networks, in particular to a task processing method, device, server and storage medium based on a neural network, which are used for solving the technical problem that the execution efficiency of data processing tasks of the neural network is low in the prior art. The method comprises the following steps: dividing an operation task to be processed into a plurality of first-class subtasks, and further determining the operation task to be processed into executable subtasks; determining the dependency value of each executable subtask in turn according to the obtained dependency relationship among the plurality of executable subtasks; adding an executable subtask with a dependency value being a preset value into an activation queue; based on the plurality of cores, a plurality of executable subtasks in the active queue are executed in parallel. Therefore, based on the strategy of dividing before executing in parallel, the execution efficiency of the related task of the data processing of the target layer of the neural network is improved.

Description

Task processing method and device based on neural network, server and storage medium

Technical Field

The present application relates to the field of neural networks, and in particular, to a task processing method, device, server and storage medium based on a neural network.

Background

Artificial neural networks are one of the main branches of intelligent control technology, and are applied and studied in various fields, including mainly: pattern recognition, signal processing, knowledge engineering, expert systems, optimization combining, robot control, etc. With the continuous development of artificial neural network theory, related theory and related technology, the application of the neural network is expected to be deeper.

The concept of deep learning is derived from the study of an artificial neural network, and a multi-layer perceptron with multiple hidden layers is a deep learning structure, and deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to find distributed feature representations of data.

With the rapid development of deep learning technology in recent years, various large artificial neural networks are introduced, and quite high requirements are put on the computing capacity, flexibility and computing efficiency of a processor.

However, in the prior art, the operation speed, the cache space and the storage wall are limited by the processor, so that the processor is more and more difficult to cope with the operation requirement of a large-scale neural network. For example, convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network that includes convolutional calculations and has a depth structure, which is one of the representative algorithms for deep learning, capable of performing a shift-invariant classification on input information according to its hierarchical structure, also referred to as a shift-invariant artificial neural network. When the convolutional neural network is adopted to process an image, if the pixel size of the image is larger, the convolution kernel of the corresponding convolutional operation is larger, the convolution operation of the larger convolution kernel needs to occupy larger memory or buffer space, and when the buffer space of the processor is insufficient to load related operation data, especially the buffer of the computing unit of the simplified network processor (Network Processing Unit, NPU) is not large, the convolution operation or all data of a certain middle layer of other neural networks can not be loaded sometimes, so that the processing efficiency is reduced.

In view of this, there is a need to redesign a method to overcome the above drawbacks.

Disclosure of Invention

The embodiment of the application provides a task processing method, device, server and storage medium based on a neural network, which are used for solving the technical problem of low task execution efficiency of the neural network data processing in the prior art.

The specific technical scheme provided by the embodiment of the application is as follows:

In the embodiment of the application, the to-be-processed operation task in the target layer appointed in the neural network is divided into a plurality of first type subtasks based on the multi-core processor architecture, the executable subtasks are determined based on the first type subtasks, the dependent values of the executable subtasks are determined according to the dependent relation among the executable subtasks, when the dependent values of the executable subtasks meet the preset value, the dependent values of the executable subtasks are added into an activation queue, and a plurality of executable subtasks in the activation queue are executed in parallel based on a plurality of cores of the multi-core processor architecture. Therefore, the operation task to be processed of the target layer appointed in the neural network can be cut into a plurality of first type subtasks with smaller granularity through division, the operation resources required to be occupied when each first type subtask is executed are smaller than those required to be occupied when the operation task is executed before division, the operation is not required to be executed when enough operation resources are needed to be waited, the possibility that the current task is executed by the processor core quickly is improved, and the requirement on the performance of the processor is reduced; and based on a plurality of cores under the processor architecture, a plurality of executable subtasks are almost simultaneously executed in a parallel execution mode, so that the processing efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of a heterogeneous multi-core processor architecture in an embodiment of the present application;

FIG. 2 is a flow chart of an embodiment of a neural network based task processing method in an embodiment of the present application;

FIG. 3 is a schematic diagram of a partial region of a topology graph in an embodiment of the application;

FIG. 4 is a schematic diagram of an active queue operation mechanism according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a convolutional layer divided into a plurality of executable subtasks according to an embodiment of the present application;

FIG. 6 is a schematic diagram of operational parameters of an executable subtask according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a task processing device based on a neural network according to an embodiment of the present application;

Fig. 8 is a schematic diagram of a server structure in an embodiment of the application.

Detailed Description

In order to solve the technical problem of low efficiency of executing the neural network operation task in the prior art, in the embodiment of the application, the operation task of the neural network middle layer is divided into a plurality of first type subtasks, a plurality of executable subtasks are determined based on the first type subtasks, the dependent value of each executable subtask is confirmed according to the dependent relation among the subtasks, an activation queue is added for the executable subtasks with the current dependent value being a preset value, and a parallel execution mode is adopted to call the plurality of executable subtasks in the activation queue once, so that the called plurality of subtasks are almost executed simultaneously in time sequence.

The preferred embodiments of the present application will be described in further detail below with reference to the accompanying drawings:

The task processing method based on the neural network provided by the embodiment of the application can be realized based on a multi-core processor architecture as an implementation mode, and can be implemented based on a single-core processor when the processing performance of the multi-core processor can be simulated based on the single-core processor through technical improvement. The multi-core processor architecture is preferably a heterogeneous multi-core processor, and may be an isomorphic multi-core processor, that is, the processing method provided by the embodiment of the application is mainly applicable to the heterogeneous multi-core processor, but also applicable to the isomorphic multi-core processor.

The heterogeneous multi-core processor architecture comprises more than two processors, for example, a central processing unit (Central Processing Unit, CPU) and an NPU, wherein a plurality of CPU cores are arranged in the CPU, and a plurality of NPU cores are arranged in the NPU; or two processors including a CPU, a graphics processor (Graphics Processing Unit, GPU), and so on.

For example, referring to fig. 1, a preferred multi-core processor architecture may include three processors including a CPU, an NPU, and a GPU, where the CPU has three cores, i.e., CPU1, CPU2, and CPU3; the NPU is provided with three cores, namely NPU1, NPU2 and NPU3; the GPU is provided with three cores, namely GPU1, GPU2 and GPU3.

The task processing method based on the neural network provided by the embodiment of the application can be applied to data processing in a plurality of application scenes in a plurality of fields, for example, in the image recognition field, when the convolutional neural network is adopted to carry out text recognition or face recognition, the operation task processing method provided by the embodiment of the application can be adopted, after the image data is input, the task for carrying out convolution operation on the image data can be divided into a plurality of first subtasks with smaller granularity, a plurality of corresponding executable subtasks are further determined based on the plurality of first subtasks, finally, a dependency topology graph is constructed according to the data dependency relationship among the executable subtasks, the dependency value of each executable subtask is updated according to the current execution condition, the dependency value reaches a preset value, the corresponding executable subtasks are added into an activation queue, and the plurality of executable subtasks in the activation queue can be scheduled and executed in parallel so as to improve the processing efficiency of executing the corresponding data operation task based on the convolutional neural network.

Referring to fig. 2, in the embodiment of the present application, the detailed flow of the task processing method based on the neural network is as follows:

s201: and dividing the operation task to be processed in the target layer appointed in the neural network into a plurality of first subtasks.

In the embodiment of the present application, as an implementation manner, before S201, a processor that can be used to execute a task and at least one core corresponding to the processor are determined from the multi-core processor architecture.

In S201, the target layer specified in the neural network may be any one layer or a combination of multiple layers of an input layer, an output layer, and a middle layer of the neural network, that is, the operation task processing method provided by the embodiment of the present application may be applicable to each layer of the neural network. For the convolutional neural network, the target layer mainly comprises intermediate layers such as a convolutional layer, a pooling layer and a full-connection layer.

Specifically, for the division of the operation task to be processed, different division modes should be adopted for different neural networks.

If the neural network is a convolutional neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first subtasks based on the number of channels corresponding to the target layer or the number of corresponding rows and columns. The channels are input channels or output channels for the interaction data of the target layer and other layers, and the rows and columns are two-dimensional arrays or three-dimensional arrays of each channel of the target layer.

The middle layer of the convolutional neural network is a three-dimensional array, and each dimension corresponds to a row, a column and a channel respectively, so that the operation task division of the convolutional neural network, the channel division mode and the row and column division mode are feasible.

For other non-convolutional neural networks, such as a recurrent neural network (Recurrent Neural Network, RNN), the row-column division manner is only applicable to convolutional neural networks, and the to-be-processed operation tasks in the target layer need to be divided into a plurality of corresponding first-class subtasks based on the number of channels corresponding to the target layer.

For example, when the number of input channels is 4 according to the channel mode, the task to be processed of the target layer is divided into 4 first-class subtasks.

The sizes of the divided first type subtasks are not completely consistent, for example, a three-dimensional array with 15 columns is divided into 4 first subtasks, and then 3 first type subtasks are all 4 columns, and the rest other first type subtasks are only 3 columns.

For the convolutional neural network, the convolutional operation waiting processing operation task of the target layer is generally only divided, and the convolutional kernel is not segmented. In some special cases, the convolution kernels may also be divided, for example, by way of channel division, where the number of channels is 9, then for a convolution kernel of size 7x7, it may be divided into 9 convolution kernels of size 3x 3.

In the embodiment of the application, after the first type subtasks are divided, the operation parameters of each first type subtask are generated. The operation parameters of a first subtask at least comprise any one or any combination of the following parameters: input tensors, output tensors, weight parameters, convolution parameters, and additional parameters.

Wherein the weight parameter comprises any one or two of weight address and weight length; the convolution parameters comprise any one or any combination of parameters related to convolution operation such as convolution type, convolution kernel size, convolution step length and the like; the additional parameters include any one or a combination of normalization parameters and activation parameters.

S202: a plurality of executable sub-tasks is determined, the plurality of executable sub-tasks including at least one first type of sub-task.

Optionally, the determination of the executable subtasks may include, but is not limited to, the following:

mode one:

After dividing the operation task to be processed into a plurality of first-class subtasks, compiling the plurality of first-class subtasks, and determining the compiled plurality of first-class subtasks as a plurality of corresponding executable subtasks.

The compiling of the first type subtasks comprises compiling the plurality of first type subtasks into operation instructions executable by at least one processor in the multi-core processor architecture.

The purpose of compilation is to convert sub-tasks of a first type into the form of instructions that are executable by at least one processor.

Specifically, based on the operation types supported by different processors, at least the following compiling process should be performed:

Acquiring weight parameters corresponding to the plurality of first-type subtasks, and respectively executing the following operations aiming at the plurality of first-type subtasks:

if the weight parameter corresponding to one first type subtask is fixed-point integer, compiling the one first type subtask into an operation instruction of fixed-point integer; and if the weight parameter corresponding to one of the first type subtasks is a floating point number, compiling the one of the first type subtasks into an operation instruction in a floating point format.

For example, if the weight parameter of a first type subtask is int8 fixed point integer, compiling the first type subtask into an int8 type task capable of running in the NPU; if the weight parameter of one of the first type subtasks is a floating point number, the first type subtask is compiled into a floating point type task which can run on the GPU or the CPU.

Mode two:

After dividing the operation task to be processed into a plurality of first-class subtasks, directly determining the plurality of first-class subtasks which are not compiled into a plurality of corresponding executable subtasks without compiling the first-class subtasks.

Mode three:

Because the problem of a memory wall generally exists in the neural network computing, in the embodiment of the application, in order to ensure that the importing and exporting of data does not become a performance bottleneck, operations of importing and exporting data of a target layer of the neural network in a cache and a memory, and operations of importing images from the memory to the cache and other storage types are used as the storage tasks to be processed together for dividing. Therefore, unlike the first and second modes described above, in the third mode, the division object includes not only the operation task to be processed but also the storage task to be processed, and the first type of subtask obtained by dividing the operation task to be processed and the second type of subtask obtained by dividing the storage task to be processed are determined together as executable subtasks.

Specifically, in a storage area of the multi-core processor architecture, a storage task to be processed for importing and/or exporting data of a target layer is divided into a plurality of second-class subtasks, the plurality of second-class subtasks are compiled, and the compiled plurality of second-class subtasks and the plurality of first-class subtasks are combined to obtain a plurality of corresponding executable subtasks.

That is, in the third mode, after the second type subtasks are divided, they are compiled. The compiling of the second type of subtasks comprises compiling the second type of subtasks into data pre-loading instructions and/or data pre-saving instructions executable by at least one processor in the multi-core processor architecture.

For example, many NPU cores support data pre-load and data pre-save instructions in addition to convolution instructions. If the multi-core processor architecture comprises an NPU processor, the NPU is determined to be a processor for executing the to-be-processed storage task, and the second type of subtasks are compiled into data preloading and data pre-storing instructions supported by the NPU core during compiling, so that the to-be-processed storage task and the to-be-processed operation task which are imported and exported by the data can be executed in parallel.

The data preloading instruction needs to provide data input address, output address, copy length and other storage parameters, the storage parameters are compiled together in the corresponding second type subtasks, and the NPU kernel is operated to complete data loading and saving according to the storage parameters during operation.

Mode four:

in the fourth mode, as in the third mode, the storage task to be processed for importing and/or exporting the data of the target layer is divided into a plurality of second-class subtasks, and the plurality of second-class subtasks and the plurality of first-class subtasks are combined and determined together as executable subtasks; in the fourth mode, the second type subtask is not compiled after being divided, and the second type subtask and the first type subtask are directly determined as executable subtasks.

The method has the advantages that one to-be-processed operation task is divided into a first type of subtask, or the to-be-processed storage task is divided into a plurality of second type of subtasks, namely, the large-granularity to-be-processed task is divided into smaller-granularity subtasks, so that the requirement on the operation resources (including memory, cache space and the like) of the processor kernel when each task is processed is reduced, the kernel which is originally insufficient for executing the large-granularity to-be-processed task can be fully utilized, the utilization rate of the kernel is improved, and the processing efficiency is further improved.

S203: and determining the dependency value of each executable subtask according to the dependency relationship among the plurality of executable subtasks.

Specifically, in the embodiment of the present application, the dependency relationship between multiple executable subtasks, that is, the data input/output relationship between different executable subtasks, for example, if the input of the executable subtask a is the output of the executable subtask B, there is a dependency relationship between a and B, and the dependency relationship is that a depends on B; if the output of the executable subtask C is the input of the executable subtask D, then there is a dependency relationship between C and D, and the dependency relationship is that D depends on C.

Wherein the dependency value characterizes the number of other executable subtasks on which the corresponding executable subtask depends, indicating that the dependency value has a definite quantitative relationship with the number of other executable subtasks on which he depends, not necessarily being equal.

Specifically, regarding the determination of the dependency value, if the output of m executable subtasks are all input to the executable subtask a, the dependency value=m×α+n of the executable subtask a, where α is a step value and n is a preset value. When n is set to 0 and the step value is 1, the dependency value of the executable subtask a is m. For example, if n is set to 0 and the step value is set to 1, if there are 5 executable subtasks output as inputs to the executable subtask a, the dependency value of the executable subtask a is set to 5, and if n is set to 2 and the step value is set to 1, the dependency value of the executable subtask is set to 7.

Specifically, this step may be implemented in, but is not limited to, the following two ways:

Mode 1 relies on the topology scheme.

Generating a corresponding one of the nodes based on one of the executable subtasks; and respectively defining each other node with a data transmission relation with the one node as an upstream node or a downstream node of the one node, and generating a corresponding dependency topology graph. The number of upstream nodes on which the one node depends is marked as the dependency value of the one node.

Any other node may be an upstream node or a downstream node of a node corresponding to the above-mentioned one executable subtask, and specifically needs to be determined according to a data transmission direction (may also be referred to as a dependency relationship), for example, a node corresponding to an executable subtask a in a topology graph, B corresponding to an executable subtask B in the topology graph, and an input of the executable subtask a is an output of the executable subtask B, where the data transmission direction is from node B to node a, and node a is a downstream node of node B.

Regarding the dependency values, specifically, the dependency relationship between each executable subtask is recorded, the input number of one executable subtask (i.e. the dependent executable subtask number) is counted, and the input number is used as the dependency value of the corresponding node of the executable subtask in the topological graph.

And in the dependency topology, for one node corresponding to one executable subtask, subtracting a step value from the dependency value of the one node every time the executable subtask corresponding to any node in the upstream nodes on which the one node depends is determined to be executed.

For example, referring to fig. 3, according to the structure of the neural network and the operational relationship between the executable subtasks, it may be determined that the outputs of the 4 executable subtasks A, E, F, G are all inputs of the executable subtask B, and then the nodes a, e, f, g corresponding to the executable subtask A, E, F, G in the topology map are all upstream nodes of the node B, and the dependency value of the node B is marked as 4.

In the operation process, when the executable subtasks corresponding to any one of the upstream nodes are determined to be executed, and each upstream node is executed, the dependency value of the node b is subtracted by a step value. The step value is a constant, the value interval is (0, 1), and preferably, the step value is set to 1, that is, each upstream node is executed, and the dependency value of the one node is subtracted by 1.

In the dependency topology, the executable subtasks divided by the first layer of the neural network are independent of other executable subtasks, and the dependency value is n, preferably, when n is set to 0, the initial dependency value of the first layer node is 0.

Mode 2, build dependency vector mode.

In this manner, according to the dependency relationship among the executable subtasks, a corresponding dependency vector is generated for each executable subtask, the elements in the dependency vector are the other executable subtasks on which the executable subtask depends in turn, and the number of the elements of the dependency vector is the dependency value of the executable subtask. For example, the outputs of the 4 executable subtasks A, E, F, G are all inputs to the executable subtask B, then the dependency vector of the executable subtask B is [ a, E, F, G ], the dependency value of the executable subtask B is 4, the operation level of the executable subtask is 4, or other values having a fixed correspondence with the dependency values.

When the operation task to be processed is executed, if the executable subtask A is executed, deleting the corresponding element A from the rest of the dependency vectors corresponding to the executable subtask, updating the dependency vector of the executable subtask B to [ E, F, G ], correspondingly updating the dependency value of the B, and upgrading the operation level of the B to 3.

S204: adding the executable subtasks with the dependent values as preset values into an activation queue; and executing a plurality of executable subtasks in the activation queue in parallel through a plurality of cores.

In S204, the multiple cores are multiple cores currently available in the multi-core processor architecture, and may include multiple cores of only one processor, or may include multiple cores of different processors. For example, the plurality of cores may be four cores of a CPU; two cores of the CPU and two cores of the GPU can be included; or one core of the CPU, one core of the GPU and two cores of the NPU, which can be specifically determined according to the actual situation of the multi-core processor architecture, the embodiments of the present application are not listed.

Specifically, when step S204 is executed, executing the plurality of executable subtasks in the active queue in parallel includes selecting a plurality of executable subtasks corresponding to the number of currently available cores from the active queue, and executing the plurality of executable subtasks in parallel, for example, if it is determined that 5 cores are currently available, then selecting 5 executable subtasks from the active queue.

And, the executed executable subtasks should pop up the active queue in time.

The dependent value of each executable subtask is not a fixed value, but is updated continuously as the dependent executable subtask is gradually executed, and the updating mode can be real-time updating or periodical updating.

When the dependency value of an executable subtask is updated to n, the executable subtask indicating that all the dependent executable subtasks of the executable subtask are executed can be executed, and at the moment, the executable subtask is activated and added into an activation queue.

For example, in one embodiment, when the dependency value of the executable subtasks becomes 0, the executable subtasks are activated and added to the activation queue, and then, among the activated executable subtasks, the executable subtasks are selected for parallel execution. In another embodiment, a preset value n is set to be 1, when the dependency value is 1, the corresponding executable subtasks are activated, the corresponding executable subtasks are added into an activation queue, and a plurality of executable subtasks are selected from the activated executable subtasks to be executed in parallel.

Specifically, in the first embodiment, referring to fig. 4, each executable subtask that is not executed is traversed, so as to obtain an executable subtask with a dependency value of 0, and the executable subtask is pushed into the activation queue. Based on a plurality of cores of at least one processor in a multi-core processor architecture, such as the architecture shown in fig. 1, there are 9 cores in total, wherein it is determined that there are 6 cores available, 6 executable subtasks are selected from the activation queue, and the 6 cores are invoked to execute the 6 executable subtasks in parallel. When the operation of the 6 executable subtasks is completed, the 6 executable subtasks are popped out of the active queue and the dependency value of the executable subtasks downstream of the 6 executable subtasks is decremented by 1. That is, as an implementation manner, after the executable task of each batch is executed, the corresponding dependency value is updated correspondingly.

At this time, the information of the currently remaining operation resources in the multi-core processor architecture should be obtained, and when parallel scheduling is performed, the operation resources required to be occupied by executing one executable subtask should be no more than the currently remaining operation resources of at least one core, so as to ensure that each executable subtask to be executed can be executed by at least one core. That is, according to the type of the executable subtask and the current available operation resource, the executable subtask is transferred to the corresponding NPU kernel, GPU kernel and CPU kernel to complete calculation respectively, and a plurality of activated executable subtasks can run in different kernels at the same time.

The remaining computing resources in the multi-core processor architecture include remaining computing resources corresponding to a plurality of cores corresponding to each processor. The computing resources are various resources necessary to perform the executable subtasks, including but not limited to any one or any combination of the following parameters: the operation speed of each core of the processor, the available memory space, the available cache space and the like corresponding to each core.

If a certain executable subtask is executed in a certain processor core, the situation that the executable subtask is too slow or is blocked even causes that the active queue has no executable subtask, so that the waste of computing resources is caused, other cores can schedule the same executable subtask to run, and the processor core is informed that the running task of the executable subtask is cancelled.

Corresponding to the second and fourth modes in step 202, when the first type subtask or the second type subtask is not compiled in step 202, the activated executable subtask should be compiled before the executable subtask in the activation queue is executed in parallel.

Specifically, compiling is performed according to the hardware structure of the multi-core processor architecture and the condition of the residual operation resources.

For example, when the architecture of the applicable multi-core processor is the architecture shown in fig. 1, according to the weight parameters of each executable subtask, the executable subtask is compiled into a floating point type supportable by a CPU and a GPU, or is compiled into a fixed point integer supportable by an NPU, and is compiled into a fixed point integer executable subtask, which can also run in the CPU and the GPU. For executable subtasks scheduled for execution by the NPU, it is also necessary to compile them into the form of NPU executable instructions. The second type of subtasks determined as executable subtasks are preferentially selected for NPU execution, so that the second type of subtasks in the executable subtasks should be compiled into data preloading instructions and data pre-saving instructions executable by the NPU.

When the applicable multi-core processor architecture is a homogeneous multi-core processor, i.e. the CPU has a plurality of cores, no compilation is required.

In S204, executing, by the plurality of cores, the plurality of executable subtasks in the activation queue in parallel may include: determining available multiple cores under the current multi-core processor architecture, sequentially distributing executable subtasks in an activation queue to the available multiple cores, wherein one core corresponds to one executable subtask, the multiple cores run in parallel, the distributed multiple executable subtasks with the same number as the multiple cores are executed in parallel, for example, 4 available cores are provided, 4 executable subtasks are executed in parallel, after execution is completed, the executed subtasks are sequentially popped up from the activation queue, and the subsequent executable subtasks are continuously distributed to the currently available cores for execution.

In the embodiment of the application, parallel execution is not understood to be completely synchronous in time sequence, and can be partial coincidence in execution time.

A complete embodiment of the neural network-based task processing method provided by the embodiment of the present application is listed below.

In this complete embodiment, the heterogeneous multi-core processor architecture shown in fig. 1 is taken as a hardware base, and a convolutional neural network is taken as a neural network to be processed as an example.

Firstly, determining that the heterogeneous multi-core processor architecture comprises a CPU processor, wherein three CPU cores are arranged, a GPU processor is arranged, three GPU cores are arranged, an NPU processor is also arranged, and three NPU cores are arranged.

In this embodiment, the middle layer of the convolutional neural network is taken as a specified target layer. Referring to fig. 5, taking a convolution layer in the middle layer as an example, the convolution operation task of each convolution layer is correspondingly divided into a plurality of first type subtasks according to the number of input channels of the convolution neural network. The convolution operation task corresponding to the convolution layer 1 is divided into 3 first type subtasks 11, 12 and 13, the convolution operation task corresponding to the convolution layer 2 is divided into 3 first type subtasks 21, 22 and 23, and the convolution operation task corresponding to the convolution layer 3 is divided into 3 first type subtasks 31, 32 and 33.

Referring to fig. 6, after the first type subtasks are divided, operation parameters of each first type subtask are correspondingly generated. In this embodiment, the operation parameters of one first type subtask include an input tensor, an output tensor, a weight parameter, a convolution parameter, and an additional parameter. The dependency values and operator activation states shown in fig. 6 are added correspondingly after the dependency topology is constructed in the subsequent steps.

Based on the heterogeneous multi-core processor architecture, the operation of importing and exporting the convolution layer data in the cache and the memory and the operation of importing the image from the memory into the cache are divided into a plurality of second-class subtasks as the storage tasks to be processed.

The plurality of first type subtasks and the plurality of second type subtasks are together determined as executable subtasks. The first type of subtasks 11-13, 21-23, and 31-33 are all determined to be executable subtasks.

Then, a dependency topology graph is constructed from the dependency relationships between the executable subtasks. In this embodiment, a preset value n=0, and a step value α=1 are set. Correspondingly, referring to fig. 5, the outputs of the first type subtasks 11 and 12 are inputs of the first type subtask 21, and then the dependency value of the first type subtask 21 is 2, and the first type subtasks 11-13 are independent of other subtasks, so that the dependency value is set to 0, that is, the dependency value of the first layer node in the topology graph is 0.

And determining the dependency value of the node corresponding to one executable subtask as the number of the dependent upstream nodes according to the dependency topology graph. For example, after the first type of subtask 21 is determined to be an executable subtask, if there are 2 upstream nodes on which it depends, then its dependency value is 2.

When the dependency value is 0, the corresponding executable subtask is activated and added into the activation queue.

For example, assuming that there is no pending storage task, i.e. the current executable subtask does not include the second type subtask, the dependency value of the 3 executable subtasks 11-13 corresponding to the convolution layer 1 is 0, the task is activated first, and the task is added to the activation queue, where there are 3 executable subtasks in the activation queue.

If the 3 executable subtasks are all floating point type, the parallel scheduling is performed according to the remaining operation resources of the CPU or the 3 cores of the GPU, so that one executable subtask is executed by one CPU core or one GPU core, the 3 cores execute the 3 executable subtasks in parallel, namely, the 3 executable subtasks are executed simultaneously. When the 3 executable subtasks are executed, the 3 executable subtasks are popped out of the activation queue. Because the executable subtasks 11 and 12 are executed simultaneously, the dependent values of the executable subtasks 21 and 22 corresponding to the convolution layer 2 are reduced by 2 simultaneously, the dependent values become 0, the activated subtasks are added into the activation queue after activation, and the steps are repeatedly executed until the convolution operation tasks of the convolution neural network are executed completely.

Based on the same inventive concept, referring to fig. 7, an embodiment of the present application further provides a task processing device based on a neural network, based on a multi-core processor architecture setting, including:

The dividing unit 701 is configured to divide an operation task to be processed in a target layer specified in the neural network into a plurality of first type subtasks;

a first determining unit 702, configured to determine a plurality of executable subtasks, where the plurality of executable subtasks includes at least one first type subtask;

a second determining unit 703 that determines a dependency value of each of the executable subtasks according to the obtained dependency relationship between the plurality of executable subtasks; wherein the dependency value characterizes a number of other executable subtasks on which the corresponding executable subtask depends;

An execution unit 704, configured to add the dependent value as an executable subtask of a preset value to an activation queue; and executing a plurality of executable subtasks in the activation queue in parallel through a plurality of cores.

When dividing the to-be-processed operation task in the target layer specified in the neural network into a plurality of first-class subtasks, the dividing unit 701 is specifically configured to: if the neural network is a convolutional neural network, dividing an operation task to be processed in the target layer into a plurality of corresponding first subtasks based on the number of channels corresponding to the target layer or the number of corresponding rows and columns, wherein the channels are input channels or output channels of the interaction data of the target layer and other layers, and the rows and columns are two-dimensional arrays or three-dimensional arrays of each channel of the target layer;

or when dividing the to-be-processed operation task in the target layer designated in the neural network into a plurality of first-class subtasks, the dividing unit 701 is specifically configured to: if the neural network is a non-convolution neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first sub-tasks based on the number of channels corresponding to the target layer, wherein the channels are channels for the interaction data of the target layer and other layers.

After dividing the to-be-processed operation task in the target layer specified in the neural network into a plurality of first-class subtasks, before determining a plurality of executable subtasks, the dividing unit 701 is further configured to: generating operation parameters of the first type subtasks, wherein the operation parameters of one first type subtask at least comprise any one or any combination of the following parameters: input tensors, output tensors, weight parameters, convolution parameters, and additional parameters.

When determining a plurality of executable subtasks based on the plurality of first type subtasks, the first determining unit 702 is specifically configured to:

Compiling the plurality of first-class subtasks, and determining the compiled plurality of first-class subtasks as a plurality of corresponding executable subtasks;

Or alternatively

Dividing a storage task to be processed for importing and/or exporting data of the target layer into a plurality of second-class subtasks in a storage area of the multi-core processor architecture; compiling the plurality of second-class subtasks, and combining the compiled plurality of second-class subtasks with the plurality of first-class subtasks to obtain a plurality of corresponding executable subtasks.

When compiling the first type subtasks, the first determining unit 702 is specifically configured to:

compiling the plurality of first type subtasks into operational instructions executable by at least one processor in the multi-core processor architecture.

When compiling the plurality of first-class subtasks, the first determining unit is specifically configured to: acquiring weight parameters corresponding to the plurality of first-class subtasks respectively; the following operations are respectively executed for the plurality of first-type subtasks:

When compiling the second type subtasks, the first determining unit 702 is specifically configured to: and compiling the plurality of second-class subtasks into data pre-loading instructions and/or data pre-saving instructions executable by at least one processor in the multi-core processor architecture respectively.

Wherein, when sequentially determining the operation levels of the plurality of executable subtasks according to the obtained dependency relationship between the plurality of executable subtasks, the second determining unit 703 is specifically configured to: the following operations are performed for each executable subtask:

Generating a corresponding one of the nodes based on one of the executable subtasks;

Each other node having a data transmission relation with the one node is defined as an upstream node or a downstream node of the one node respectively, and a corresponding dependency topology graph is generated, wherein the dependency topology graph characterizes the dependency relation among the nodes; the number of upstream nodes on which the one node depends is marked as the dependency value of the one node.

Wherein after marking the number of upstream nodes on which the one node depends as the dependent value of the one node, the execution unit 704 is specifically configured to: the following operations are performed for each executable subtask, respectively:

In the dependency topology, for one node corresponding to one executable subtask, subtracting a step value (for example, 1) from a dependency value of the one node after determining that the executable subtask corresponding to any node in upstream nodes on which the one node depends is completed; and adding the executable subtasks with the dependency values being preset values (for example, 0) into an activation queue.

Based on the same inventive concept, referring to fig. 8, an embodiment of the present application provides a server, which at least includes: a memory 801, and a processor 802, wherein,

A memory 801 for storing executable instructions;

a processor 802 for reading and executing executable instructions stored in a memory to implement any of the methods referred to in the above embodiments.

Based on the same inventive concept, an embodiment of the present application provides a storage medium, which when executed by a processor, enables to perform any one of the methods referred to in the above embodiments.

In summary, in the embodiment of the application, the task to be processed of the target layer of the neural network is divided into a plurality of first type subtasks, so that the refinement of operation granularity is realized, the requirement on the performance of the processor kernel is reduced, and the divided fine-granularity first type subtasks are easier to be executed by the processor kernel; and based on a plurality of cores of the multi-core processor, an efficient scheduling mechanism is provided by setting an activation queue of the executable subtasks, so that the executable subtasks are orderly scheduled, the executable subtasks can be respectively executed in different cores, and the task processing efficiency is improved.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the spirit or scope of the embodiments of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is also intended to include such modifications and variations.

Claims

1. A neural network-based task processing method, comprising:

dividing an operation task to be processed in a target layer appointed in a neural network into a plurality of first-class subtasks;

determining a plurality of executable subtasks, the plurality of executable subtasks including at least one first type of subtask;

The following operations are performed for each executable subtask: generating a corresponding one of the nodes based on one of the executable subtasks; each other node with a data transmission relation with the node is defined as an upstream node or a downstream node of the node respectively, and a corresponding dependency topological graph is generated; marking the number of upstream nodes on which the one node depends as a dependent value of the one node; or generating a corresponding dependency vector for each executable subtask according to the dependency relationship among the executable subtasks, wherein the elements in the dependency vector are other executable subtasks on which the executable subtask depends in turn, and the number of the elements of the dependency vector is the dependency value of the executable subtask; wherein the dependency value characterizes a number of other executable subtasks on which the corresponding executable subtask depends;

Adding the executable subtasks with the dependent values as preset values into an activation queue; and executing a plurality of executable subtasks in the activation queue in parallel through a plurality of cores.

2. The method of claim 1, wherein dividing the computational tasks to be processed within the target layer specified in the neural network into a plurality of first-class subtasks, specifically comprises:

If the neural network is a convolutional neural network, dividing an operation task to be processed in the target layer into a plurality of corresponding first subtasks based on the number of channels corresponding to the target layer or the number of corresponding rows and columns, wherein the channels are input channels or output channels of the interaction data of the target layer and other layers, and the rows and columns are two-dimensional arrays or three-dimensional arrays of each channel of the target layer;

Or alternatively

If the neural network is a non-convolution neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first sub-tasks based on the number of channels corresponding to the target layer, wherein the channels are channels for the interaction data of the target layer and other layers.

3. The method of claim 1, wherein after dividing the computational tasks to be processed within the target layer specified in the neural network into a plurality of first-type subtasks, prior to determining the plurality of executable subtasks, further comprising:

Generating operation parameters of the first type subtasks, wherein the operation parameters of one first type subtask at least comprise any one or any combination of the following parameters: input tensors, output tensors, weight parameters, convolution parameters, and additional parameters.

4. A method as claimed in claim 1, 2 or 3, wherein determining a plurality of executable subtasks comprises:

Or alternatively

Dividing a storage task to be processed for importing and/or exporting data of the target layer into a plurality of second-class subtasks in a storage area of the multi-core processor architecture;

Compiling the plurality of second-class subtasks, and combining the compiled plurality of second-class subtasks with the plurality of first-class subtasks to obtain a plurality of corresponding executable subtasks.

5. The method according to claim 4, wherein compiling the first type of subtasks specifically comprises:

6. The method according to claim 4, wherein compiling the plurality of first-type subtasks specifically comprises:

Acquiring weight parameters corresponding to the plurality of first-class subtasks respectively;

The following operations are respectively executed for the plurality of first-type subtasks:

If the weight parameter corresponding to one first type subtask is fixed-point integer, compiling the one first type subtask into an operation instruction of fixed-point integer;

And if the weight parameter corresponding to one of the first type subtasks is a floating point number, compiling the one of the first type subtasks into an operation instruction in a floating point format.

7. The method according to claim 4, wherein compiling the second type of subtasks specifically comprises:

And compiling the plurality of second-class subtasks into data pre-loading instructions and/or data pre-saving instructions executable by at least one processor in the multi-core processor architecture respectively.

8. The method of claim 1, wherein after marking the number of upstream nodes on which the one node depends as the dependent value of the one node, the executable subtasks having the dependent value of a preset value are added to the active queue, further comprising:

the following operations are performed for each executable subtask, respectively:

In the dependency topology, for one node corresponding to one executable subtask, subtracting a step value from the dependency value of the one node after the executable subtask corresponding to any node in the upstream nodes on which the one node depends is determined to be executed;

adding the executable subtasks with the dependency value being a preset value into an activation queue, wherein the method specifically comprises the following steps:

and adding the executable subtasks with the dependency values being preset values into an activation queue.

9. A neural network-based task processing device, characterized by comprising:

the dividing unit is used for dividing the operation task to be processed in the target layer appointed in the neural network into a plurality of first subtasks;

a first determining unit configured to determine a plurality of executable subtasks, the plurality of executable subtasks including at least one first type of subtask;

a second determining unit, configured to perform the following operations for each executable subtask: generating a corresponding one of the nodes based on one of the executable subtasks; each other node with a data transmission relation with the node is defined as an upstream node or a downstream node of the node respectively, and a corresponding dependency topological graph is generated; marking the number of upstream nodes on which the one node depends as a dependent value of the one node; or generating a corresponding dependency vector for each executable subtask according to the dependency relationship among the executable subtasks, wherein the elements in the dependency vector are other executable subtasks on which the executable subtask depends in turn, and the number of the elements of the dependency vector is the dependency value of the executable subtask; wherein the dependency value characterizes a number of other executable subtasks on which the corresponding executable subtask depends;

the execution unit is used for adding the executable subtasks with the dependency values being preset values into an activation queue; and executing a plurality of executable subtasks in the activation queue in parallel through a plurality of cores.

10. The apparatus of claim 9, wherein the dividing unit is specifically configured to, when dividing a task to be processed in a target layer specified in the neural network into a plurality of first-class subtasks:

Or alternatively

If the neural network is a non-convolution neural network, dividing the to-be-processed operation task in the target layer into a plurality of corresponding first sub-tasks based on the number of channels corresponding to the target layer, wherein the channels are input channels or output channels of the interaction data of the target layer and other layers.

11. The apparatus of claim 9, wherein after dividing the computational task to be processed within the target layer specified in the neural network into a plurality of first-type subtasks, before determining the plurality of executable subtasks, the dividing unit is further to:

12. The apparatus according to claim 9, 10 or 11, wherein, when determining a plurality of executable subtasks, the first determining unit is specifically configured to:

Or alternatively

13. The apparatus of claim 12, wherein, when compiling the first type of subtask, the first determining unit is specifically configured to:

14. The apparatus of claim 12, wherein, when compiling the plurality of first-class subtasks, the first determining unit is specifically configured to:

15. The apparatus of claim 12, wherein, when compiling the second type of subtasks, the first determining unit is specifically configured to:

16. The apparatus of claim 9, wherein after marking the number of upstream nodes on which the one node depends as the dependent value of the one node, the execution unit, before joining the activation queue, is further configured to:

17. A server, the server comprising: a memory and a processor; wherein,

A memory for storing executable instructions;

A processor for reading and executing executable instructions stored in a memory to implement the method of any one of claims 1-8.

18. A storage medium, wherein instructions in the storage medium, when executed by a processor, enable performing the method of any one of claims 1-8.