CN117764190A

CN117764190A - Data processing method and device

Info

Publication number: CN117764190A
Application number: CN202211131266.0A
Authority: CN
Inventors: 许奕星; 陈醒濠; 王云鹤
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2024-03-26
Also published as: WO2024055952A1

Abstract

A data processing method, applied to binarization of a neural network, the method comprising: processing input data through each first processing branch of a plurality of first processing branches in an MLP including block to obtain a plurality of first processing results; wherein each first processing branch comprises one or more fully-connected layers; the plurality of first processing branches comprise processing branches for interacting input data in a space dimension and for interacting input data in a channel dimension; the input data and parameters in the first plurality of processing branches are binarized data. In the method, the block in the binarization MLP is provided with a plurality of parallel processing branches, and the parallel processing branches can simultaneously carry out information interaction on input data in space dimension and channel dimension, so that the information interaction complexity of a network is increased, and the network performance of the binarization MLP is improved.

Description

Data processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method and apparatus thereof.

Background

Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), which was introduced to machine learning to bring it closer to the original goal-artificial intelligence (artificial intelligence, AI).

With the development of deep learning technology, deep neural networks (deep neural networks, DNN) have been widely used in various fields. For example, convolutional neural networks (convolutional neural network, CNN) have been successfully applied to the fields of picture classification, object detection, and the like, as one of the deep neural networks. However, the convolutional neural network needs huge computing resources, so it is difficult to directly apply the convolutional neural network to devices with limited computing power such as mobile phones, cameras, robots, etc.

To solve this problem, many compression algorithms and acceleration algorithms of neural networks have been proposed, and application of such algorithms to deep neural networks can result in very high compression and acceleration ratios with very little impact on the accuracy of the original network. One of the methods is to perform binarization processing on the weight with larger occupied space to obtain a binary neural network (binary neuralnetwork, BNN) so as to reduce the storage space required by the convolutional neural network; and, the binarization processing is carried out on the activation value with larger occupied space so as to improve the operation speed of the neural network.

However, the current neural network binarization methods are all the binarization methods proposed for the CNN structure, and the direct application of these methods to the multi-layer persistence (MLP) network structure brings about a very large performance degradation.

Disclosure of Invention

The application provides a data processing method which can improve the network performance of binarized MLP.

In a first aspect, the present application provides a data processing method applied to a multi-layer perceptron MLP in a machine learning model; the MLP comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; the method comprises the following steps: processing input data through each first processing branch in the plurality of first processing branches to obtain a plurality of first processing results; wherein each of said first processing branches comprises one or more fully connected layers; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a spatial dimension; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a channel dimension; the input data and parameters in the plurality of first processing branches are binarized data. In the embodiment of the application, the block in the binarization MLP is provided with the plurality of parallel processing branches, and the plurality of parallel processing branches can simultaneously perform information interaction on the input data in the space dimension and the channel dimension, so that the information interaction complexity of the network is increased, and the network performance of the binarization MLP is improved.

Parallel connection is understood to mean that a plurality of first processing branches, which process identical input data, and which perform data fusion at the output.

That is, at least one first processing branch may allow interaction (or may be described as communication) between data of different channels of input data, and at least one first processing branch may allow interaction (or may be described as communication) between data of different spatial locations (e.g., tokens).

In one possible implementation, the spatial dimensions of the input data specifically include a width dimension and a height dimension; at least one of the plurality of first processing branches is to interact with input data in the width dimension and at least one of the plurality of first processing branches is to interact with input data in the height dimension.

It should be understood that branches other than the plurality of first processing branches may be included in the block, for example, branches in series with at least one first processing branch may be included, or processing branches connected in parallel with at least one first processing branch but having different functions from the first processing branch may be included, which is not limited herein.

In one possible implementation, the method further comprises: and fusing the plurality of first processing results to obtain a fused result. The fusion can be addition operation, weighting, splicing and other processes.

In one possible implementation, the block further includes a plurality of parallel second processing branches; the method further comprises the steps of: processing the fusion result of the plurality of first processing results through each second processing branch in the plurality of second processing branches to obtain a plurality of second processing results; wherein at least one of the plurality of second processing branches is configured to interact with the fusion result in a spatial dimension; at least one second processing branch of the plurality of second processing branches is configured to interact with the fusion result in a channel dimension; the parameters in the plurality of second processing branches are binarized data.

In one possible implementation, the processing results of the plurality of first processing branches may be fused, which may be, but is not limited to, a matrix addition operation. In addition, the processing results of the plurality of first processing branches may be fused with the input data of the batch normalization layer preceding the plurality of first processing branches.

In one possible implementation, the fusion result of the plurality of first processing results may be processed by each of the plurality of second processing branches, similarly to the first processing branches (the fusion result may be a fusion result of the processing result of the plurality of first processing branches described above and input data of a batch normalization layer preceding the plurality of first processing branches, and furthermore the fusion result may be a result after normalization processing by the batch normalization layer), to obtain a plurality of second processing results.

In one possible implementation, the information expression capability included in the MLP may be equally distributed in three dimensions of the spatial dimension, i.e. the first number, the second number and the third number of processing branches in the block are the same; the first number is the number of processing branches included in the block for interacting in a width dimension; the second number is the number of processing branches included in the block for interacting in a depth dimension; the third number is the number of processing branches included in the block for interacting in the channel dimension.

In one possible implementation, to improve the information expression capability of the model, a universal shortcut (Uni-shortcut) may be introduced on the processing branch of the block (e.g. the first processing branch or the second processing branch), which may be understood as connecting the input of the branch and the output of the intermediate network layer, where data fusion may be performed, however, in the processing branch of the block in the MLP, the data size of the input data and the data size of the output data of the intermediate network layer are not consistent, and fusion is typically performed for data of the same size. In the embodiment of the application, the data with different sizes can be adjusted to the same size, and then data fusion can be performed.

In one possible implementation, the plurality of first processing branches includes a target processing branch; the target processing branch comprises a plurality of network layers which are connected in series and comprise a first network layer, a second network layer and a third network layer, the second network layer and the third network layer are connected adjacently, input data of the first network layer are first input data, and input data of the third network layer are fusion results of the second input data and the third input data; the third input data is the data output by the second network layer; wherein the size of the first input data and the size of the third input data are different; the second input data is obtained by adjusting the size of the first input data, and the second input data and the third input data have the same size.

In one possible implementation, the size is the number of channels; the number of channels of the first input data is larger than that of channels of the third input data, and the second input data is obtained by carrying out data fusion on the first input data in a channel dimension; or the number of channels of the first input data is smaller than that of channels of the third input data, and the second input data is obtained by copying the first input data in the channel dimension.

In one possible implementation, the MLP further comprises a downsampling network; the downsampling network comprises a convolution module and a plurality of downsampling modules; the method further comprises the steps of: performing convolution operation on the fusion result of the plurality of first processing results in the channel dimension through the convolution module to obtain a first processing result; executing pooling operation on the first processing result in the space dimension through each downsampling module in the plurality of downsampling modules to obtain a plurality of third processing results; and fusing the plurality of third processing results.

In one possible implementation, the convolution operation is a step-size 1 convolution; the pooling operation is pooling with a step size of N, wherein N is greater than 1.

The existing binarization method does not binarize the downsampled layer, and still retains the downsampled layer of the original FP 32. This results in a very computationally intensive MLP network, which cannot be compared to a 1bit CNN network. According to the embodiment of the application, the calculated amount of the network can be reduced under the condition that the processing precision of the downsampling network is high.

In a second aspect, the present application provides a data processing method, the method comprising:

receiving performance requirements sent by terminal equipment;

constructing a machine learning model meeting the performance requirements according to the performance requirements; the multi-layer perceptron MLP in the machine learning model comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; wherein each of said first processing branches comprises one or more fully connected layers; each first processing branch of the plurality of first processing branches is for processing input data; the performance requirement is used for determining the number of first processing branches and/or the type of the first processing branches included in the block; the type is one of the following: interacting input data in a space dimension and interacting the input data in a channel dimension; the parameters in the input data and the plurality of first processing branches are binarized data;

And sending the machine learning model to the terminal equipment.

In one possible implementation, the performance requirements include at least one of:

precision requirements, latency requirements or model parameter requirements.

In one possible implementation, the spatial dimensions of the input data specifically include a width dimension and a height dimension;

the interacting the input data in the spatial dimension includes:

the input data is interacted with in the width dimension or the input data is interacted with in the height dimension.

In one possible implementation, the block further includes a plurality of parallel second processing branches; the method further comprises the steps of:

processing the fusion result of the plurality of first processing results through each second processing branch in the plurality of second processing branches to obtain a plurality of second processing results; wherein at least one of the plurality of second processing branches is configured to interact with the fusion result in a spatial dimension; at least one second processing branch of the plurality of second processing branches is configured to interact with the fusion result in a channel dimension; the parameters in the plurality of second processing branches are binarized data; the performance requirement is used for determining the number of second processing branches and/or the type of the second processing branches included in the block; the type is one of the following: the method comprises the steps of interacting input data in a space dimension and interacting the input data in a channel dimension.

In one possible implementation, the first, second, and third numbers of processing branches in the block are the same;

the first number is the number of processing branches included in the block for interacting in a width dimension;

the second number is the number of processing branches included in the block for interacting in a depth dimension;

the third number is the number of processing branches included in the block for interacting in the channel dimension.

In one possible implementation, the plurality of first processing branches includes a target processing branch; the target processing branch comprises a plurality of network layers which are connected in series and comprise a first network layer, a second network layer and a third network layer, the second network layer and the third network layer are connected adjacently, input data of the first network layer are first input data, and input data of the third network layer are fusion results of the second input data and the third input data; the third input data is the data output by the second network layer; wherein,

the size of the first input data and the size of the third input data are different; the second input data is obtained by adjusting the size of the first input data, and the second input data and the third input data have the same size.

In one possible implementation, the size is the number of channels;

the number of channels of the first input data is larger than that of channels of the third input data, and the second input data is obtained by carrying out data fusion on the first input data in a channel dimension; or,

the number of channels of the first input data is smaller than that of channels of the third input data, and the second input data is obtained by copying the first input data in the channel dimension.

In one possible implementation, the MLP further comprises a downsampling network; the downsampling network comprises a convolution module and a plurality of downsampling modules;

the method further comprises the steps of:

performing convolution operation on the fusion result of the plurality of first processing results in the channel dimension through the convolution module to obtain a first processing result;

executing pooling operation on the first processing result in the space dimension through each downsampling module in the plurality of downsampling modules to obtain a plurality of third processing results;

and fusing the plurality of third processing results.

In a third aspect, the present application provides a data processing apparatus applied to a multi-layer perceptron MLP in a machine learning model; the MLP comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; the device comprises:

the processing module is used for processing the input data through each first processing branch in the plurality of first processing branches to obtain a plurality of first processing results; wherein,

each of the first processing branches includes one or more fully-connected layers; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a spatial dimension; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a channel dimension; the input data and parameters in the plurality of first processing branches are binarized data.

at least one of the plurality of first processing branches is to interact with input data in the width dimension and at least one of the plurality of first processing branches is to interact with input data in the height dimension.

In one possible implementation, the block further includes a plurality of parallel second processing branches; the processing module is further configured to:

processing the fusion result of the plurality of first processing results through each second processing branch in the plurality of second processing branches to obtain a plurality of second processing results; wherein at least one of the plurality of second processing branches is configured to interact with the fusion result in a spatial dimension; at least one second processing branch of the plurality of second processing branches is configured to interact with the fusion result in a channel dimension; the parameters in the plurality of second processing branches are binarized data.

In one possible implementation, the size is the number of channels;

the processing module is further configured to:

and fusing the plurality of third processing results.

In a fourth aspect, the present application provides a data processing apparatus, the apparatus comprising:

the acquisition module is used for receiving the performance requirements sent by the terminal equipment;

the processing module is used for constructing a machine learning model meeting the performance requirements according to the performance requirements; the multi-layer perceptron MLP in the machine learning model comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; wherein each of said first processing branches comprises one or more fully connected layers; each first processing branch of the plurality of first processing branches is for processing input data; the performance requirement is used for determining the number of first processing branches and/or the type of the first processing branches included in the block; the type is one of the following: interacting input data in a space dimension and interacting the input data in a channel dimension; the parameters in the input data and the plurality of first processing branches are binarized data;

And the sending module is used for sending the machine learning model to the terminal equipment.

precision requirements, latency requirements or model parameter requirements.

the interacting the input data in the spatial dimension includes:

In a fifth aspect, embodiments of the present application provide a data processing apparatus, which may include a memory, a processor, and a bus system, where the memory is configured to store a program, and the processor is configured to execute the program in the memory, to perform the method as in the first aspect and any optional method thereof, or the second aspect and any optional method thereof.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the first aspect and any optional method thereof, or the second aspect and any optional method thereof, as described above.

In a seventh aspect, embodiments of the present application provide a computer program which, when run on a computer, causes the computer to perform the above first aspect and any optional method thereof, or the above second aspect and any optional method thereof.

In an eighth aspect, the present application provides a chip system comprising a processor for supporting a device to implement the functions involved in the above aspects, e.g. to transmit or process data involved in the above method; or, information. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Drawings

FIG. 1A is a schematic diagram of a structure of an artificial intelligence main body frame;

FIGS. 1B and 2 are illustrations of an application system framework of the present invention;

FIG. 3 is a schematic diagram of an alternative hardware architecture of the terminal;

FIG. 4 is a schematic diagram of a server;

FIG. 5 is a system architecture diagram of the present application;

FIG. 6 is a flow of a cloud service;

FIG. 7 is a flow of a cloud service;

FIG. 8 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 9 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 10 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 11 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 12A is a flowchart illustrating a data processing method according to an embodiment of the present disclosure;

FIG. 12B is a flowchart illustrating a data processing method according to an embodiment of the present disclosure;

FIG. 13 is a flowchart of a data processing method according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 15 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;

fig. 16 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 17 is a schematic structural diagram of a training device according to an embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which the embodiments of the application described herein have been described for objects of the same nature. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The terms "basic," "about," and the like are used herein as approximate terms, rather than as degree terms, and are intended to take into account inherent deviations in measured or calculated values that would be known to one of ordinary skill in the art. Furthermore, the use of "may" in describing embodiments of the present invention refers to "one or more embodiments that may be possible". The terms "use", "used", and "used" as used herein may be regarded as synonymous with the terms "utilized", "utilizing", and "utilized", respectively. In addition, the term "exemplary" is intended to refer to an instance or illustration.

Referring to fig. 1A, fig. 1A shows a schematic structural diagram of an artificial intelligence main body framework, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, smart city etc.

The embodiment of the application can be applied to the optimization design of the network structure of the neural network, and the neural network trained through the application can be particularly applied to various subdivision fields in the artificial intelligence field, such as the image processing field, the computer vision field, the natural language processing field and the like, and can be particularly used for image classification, image segmentation, target detection, image super-resolution reconstruction and the like.

Next, taking image processing as an example, first, an application scenario of the present application may be, but not limited to, application to an application program having an image processing function or a cloud service provided by a cloud server, and then, description will be made respectively:

1. image processing class application program

The product form of the embodiment of the application can be an image processing application program. The image processing class application may run on a terminal device or a server on the cloud side.

In one possible implementation, the image processing class application may perform tasks such as picture classification, image segmentation, object detection, and image super-resolution reconstruction.

In one possible implementation, a user may open an application installed on a terminal device and having an image processing function, and input an image (may be an active input or may be a passive acquisition, for example, acquired by a camera on the terminal device), where the image processing application may obtain a processing result based on input data (image) by using a method provided by an embodiment of the present application, and present the processing result to the user (a presentation manner may be, but is not limited to, displaying, saving, uploading to a cloud side, etc.).

In one possible implementation, a user may open an image processing application installed on the terminal device and input an image (may be actively input or may be passively acquired, for example, acquired by a camera on the terminal device), where the image processing application may send the image to a cloud side server, and the cloud side server generates an image processing result based on the image by using a method provided by the embodiment of the present application and returns the image processing result to the terminal device, where the terminal device may present the image processing result to the user (a presentation manner may be, but not limited to, displaying, saving, uploading to the cloud side, etc.).

The image processing class application in the embodiments of the present application is next described separately from the functional architecture and the product architecture that implements the functions.

Referring to fig. 1B, fig. 1B is a schematic functional architecture of an image processing application in an embodiment of the present application:

in one possible implementation, as shown in FIG. 1B, an image processing class application 102 may receive input parameters 101 (e.g., images) and generate processing results 103. The image processing class application 102 is executable on at least one computer system, for example, and includes computer code which, when executed by one or more computers, causes the computers to perform the data processing methods described herein.

Referring to fig. 2, fig. 2 is a schematic entity architecture for running an image processing application in an embodiment of the present application:

referring to fig. 2, fig. 2 shows a schematic diagram of a system architecture. The system may include a terminal 100 and a server 200. Wherein the server 200 may comprise one or more servers (illustrated in fig. 2 as comprising one server as an example), the server 200 may provide virtual man-generating services for one or more terminals.

The terminal 100 may install an image processing application program thereon, or open a web page related to the virtual person generation, where the application program and the web page may provide an interface, the terminal 100 may receive relevant parameters input by a user on the virtual person generation interface and send the parameters to the server 200, and the server 200 may obtain a processing result based on the received parameters and return the processing result to the terminal 100.

It should be understood that, in some alternative implementations, the terminal 100 may also perform the data processing result based on the received parameters by itself, without requiring a server to cooperate with the implementation, which is not limited by the embodiments of the present application.

Next, the product form of the terminal 100 of fig. 2 will be described;

the terminal 100 in the embodiment of the present application may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digitalassistant, PDA), or the like, which is not limited in any way.

Fig. 3 shows an alternative hardware configuration of the terminal 100.

Referring to fig. 3, the terminal 100 may include a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, a power supply 190, and the like. It will be appreciated by those skilled in the art that fig. 3 is merely an example of a terminal or multifunction device and is not limiting of the terminal or multifunction device and may include more or fewer components than shown, or may combine certain components, or different components.

The input unit 130 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the portable multifunction device. In particular, the input unit 130 may comprise a touch screen 131 (optional) and/or other input devices 132. The touch screen 131 may collect touch operations on or near the user (e.g., operations of the user on or near the touch screen using any suitable object such as a finger, a joint, a stylus, etc.), and drive the corresponding connection means according to a preset program. The touch screen can detect the touch action of a user on the touch screen, convert the touch action into a touch signal, send the touch signal to the processor 170, and receive and execute a command sent by the processor 170; the touch signal includes at least touch point coordinate information. The touch screen 131 may provide an input interface and an output interface between the terminal 100 and a user. In addition, the touch screen may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 130 may include other input devices in addition to the touch screen 131. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys 132, switch keys 133, etc.), a trackball, mouse, joystick, etc.

Wherein the input device 132 may receive an input image, etc.

The display unit 140 may be used to display information input by a user or information provided to the user, various menus of the terminal 100, an interactive interface, file display, and/or play of any of the multimedia files. In the embodiment of the present application, the display unit 140 may be used to display an interface of an image processing-type application program, or the like.

The memory 120 may be used to store instructions and data, and the memory 120 may mainly include a storage instruction area and a storage data area, and the storage data area may store various data, such as multimedia files, text, and the like; the store instruction area may store software elements such as operating systems, applications, instructions required for at least one function, or a subset, an extension set thereof. And may also include nonvolatile random access memory; providing processor 170 includes managing hardware, software, and data resources in the computing processing device, supporting control software and applications. And is also used for storing multimedia files and storing running programs and applications.

The processor 170 is a control center of the terminal 100, connects various parts of the entire terminal 100 using various interfaces and lines, and performs various functions of the terminal 100 and processes data by executing or executing instructions stored in the memory 120 and calling data stored in the memory 120, thereby controlling the terminal device as a whole. Optionally, the processor 170 may include one or more processing units; preferably, the processor 170 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 170. In some embodiments, the processor, memory, may be implemented on a single chip, or they may be implemented separately on separate chips in some embodiments. The processor 170 may be further configured to generate corresponding operation control signals to corresponding components of the computing processing device, and to read and process data in the software, and in particular, to read and process data and programs in the memory 120, so that each functional module therein performs a corresponding function, thereby controlling the corresponding components to act as required by the instructions.

The memory 120 may be used for storing software codes related to a data processing method, and the processor 170 may execute steps of the data processing method of the chip, or may schedule other units (such as the input unit 130 and the display unit 140) to implement corresponding functions.

The rf unit 110 (optional) may be configured to receive and send information or receive and send signals during a call, for example, after receiving downlink information of a base station, process the downlink information with the processor 170; in addition, the data of the design uplink is sent to the base station. Typically, RF circuitry includes, but is not limited to, antennas, at least one amplifier, transceivers, couplers, low noise amplifiers (LowNoise Amplifier, LNAs), diplexers, and the like. In addition, the radio frequency unit 110 may also communicate with network devices and other devices via wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, global system for mobile communications (Global System of Mobile communication, GSM), general packet radio service (General PacketRadio Service, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE), email, short message service (Short Messaging Service, SMS), and the like.

In this embodiment, the radio frequency unit 110 may send the image to the server 200 and receive the information of the processing result sent by the server 200.

It should be appreciated that the radio unit 110 is optional and may be replaced with other communication interfaces, such as a portal.

The terminal 100 also includes a power supply 190 (e.g., a battery) for powering the various components, which may be logically connected to the processor 170 via a power management system, such as a power management system that performs functions such as charge, discharge, and power consumption management.

The terminal 100 further includes an external interface 180, which may be a standard Micro USB interface, or a multi-pin connector, which may be used to connect the terminal 100 to communicate with other devices, or may be used to connect a charger to charge the terminal 100.

Although not shown, the terminal 100 may further include a flash, a wireless fidelity (wireless fidelity, wiFi) module, a bluetooth module, sensors of different functions, etc., which will not be described herein. Some or all of the methods described hereinafter may be applied to the terminal 100 as shown in fig. 3.

Next, the product form of the server 200 in fig. 2 will be described;

Fig. 4 provides a schematic structural diagram of a server 200, and as shown in fig. 4, the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. Communication between processor 202, memory 204, and communication interface 203 is via bus 201.

Bus 201 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.

The processor 202 may be any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphicsprocessing unit, GPU), a Microprocessor (MP), or a digital signal processor (digital signalprocessor, DSP).

The memory 204 may include volatile memory (RAM), such as random access memory (randomaccess memory). The memory 204 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a mechanical hard disk (HDD) or a solid state disk (solid state drive, SSD).

The memory 204 may be used for storing software codes related to a data processing method, and the processor 202 may execute steps of the data processing method of the chip, or may schedule other units to implement corresponding functions.

It should be appreciated that the terminal 100 and the server 200 may be centralized or distributed devices, and the processors (e.g., the processor 170 and the processor 202) in the terminal 100 and the server 200 may be hardware circuits (such as an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmablegate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the processor may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the hardware system without an instruction execution function and a hardware system with an instruction execution function.

It should be understood that the steps related to the model reasoning process in the embodiments of the present application relate to AI-related operations, and the instruction execution architecture of the terminal device and the server is not limited to the architecture of the processor combined with the memory described above when performing AI operations. The system architecture provided in the embodiment of the present application is described in detail below with reference to fig. 5.

Referring to fig. 5, fig. 5 is a system architecture diagram of a task processing system provided in an embodiment of the present application, in fig. 5, a task processing system 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition device 560, where the execution device 510 includes a computing module 511. The data collection device 560 is configured to obtain a large-scale data set (i.e., a training set) of open sources required by a user, store the training set in the database 530, train the target model/rule 501 based on the training set maintained in the database 530 by the training device 520, and then use the trained neural network on the execution device 510. The execution device 510 may invoke data, code, etc. in the data storage system 550, or may store data, instructions, etc. in the data storage system 550. The data storage system 550 may be disposed in the execution device 510, or the data storage system 550 may be an external memory with respect to the execution device 510.

The trained neural network obtained after training the target model/rule 501 by the training device 520 may be applied to different systems or devices (i.e. the execution device 510), specifically, may be an edge device or an end device, for example, a mobile phone, a tablet, a notebook, a monitoring system (such as a camera), a security system, and the like. In fig. 5, the executing device 510 is configured with an I/O interface 512 for data interaction with external devices, and a "user" may input data to the I/O interface 512 through the client device 540. For example, the client device 540 may be an image capturing device of a monitoring system, the target image captured by the image capturing device is input as input data to the computing module 511 of the executing device 510, the computing module 511 detects the input target image to obtain a detection result, and then the detection result is output to the image capturing device or is directly displayed on the display interface (if any) of the executing device 510; in addition, in some embodiments of the present application, the client device 540 may also be integrated in the executing device 510, for example, when the executing device 510 is a mobile phone, the client device may directly obtain the target task through the mobile phone (for example, the target task may be obtained by shooting a target image through a camera of the mobile phone, or the target task may be not limited by a target voice recorded by a recording module of the mobile phone, etc.), or receive the target task sent by another device (for example, another mobile phone), and then the computing module 511 in the mobile phone detects the target task to obtain a detection result, and directly present the detection result on a display interface of the mobile phone. The product forms of the execution device 510 and the client device 540 are not limited herein.

It should be noted that fig. 5 is only a schematic diagram of a system architecture provided in the embodiments of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings is not limited in any way, for example, in fig. 5, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510. It should be appreciated that the execution device 510 described above may be deployed in a client device 540.

From the reasoning side of the model:

in this embodiment, the computing module 511 of the executing device 520 may obtain codes stored in the data storage system 550 to implement the steps related to the model reasoning process in this embodiment of the present application.

In this embodiment, the computing module 511 of the execution device 520 may include a hardware circuit (such as an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gatearray, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, a DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above hardware systems without an instruction execution function and a hardware system with an instruction execution function.

Specifically, the computing module 511 of the execution device 520 may be a hardware system with an instruction executing function, the steps related to the model reasoning process provided in the embodiments of the present application may be software codes stored in a memory, and the computing module 511 of the execution device 520 may obtain the software codes from the memory and execute the obtained software codes to implement the steps related to the model reasoning process provided in the embodiments of the present application.

It should be understood that, the computing module 511 of the execution device 520 may be a combination of a hardware system that does not have an instruction execution function and a hardware system that has an instruction execution function, and some of the steps related to the model reasoning process provided in the embodiments of the present application may also be implemented by a hardware system that does not have an instruction execution function in the computing module 511 of the execution device 520, which is not limited herein.

From the training side of the model:

in this embodiment of the present application, the training device 520 may obtain codes stored in a memory (not shown in fig. 5, and may be integrated into the training device 520 or disposed separately from the training device 520) to implement the steps related to model training in this embodiment of the present application.

In this embodiment, the training device 520 may include hardware circuits (such as an application specific integrated circuit (applicationspecific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system having an instruction execution function, such as a CPU, a DSP, etc., or a hardware system not having an instruction execution function, such as an ASIC, an FPGA, etc., or a combination of the above hardware systems not having an instruction execution function and a hardware system having an instruction execution function.

It should be understood that, the training device 520 may be a combination of a hardware system without an instruction execution function and a hardware system with an instruction execution function, and some steps related to training a model provided in the embodiment of the present application may also be implemented by a hardware system without an instruction execution function in the training device 520, which is not limited herein.

2. Image processing cloud-like services provided by a server:

in one possible implementation, the server may provide virtual human-generated services to the end-side through an application programming interface (application programming interface, API).

The terminal device may send relevant parameters (such as images) to the server through an API provided by the cloud, and the server may obtain a processing result based on the received parameters, and return the processing result to the terminal.

The description of the terminal and the server may be described in the above embodiments, and will not be repeated here.

Fig. 6 shows a flow of an image processing cloud-like service provided using a cloud platform.

1. And opening and purchasing the content auditing service.

2. The user can download a software development kit (software development kit, SDK) corresponding to the content auditing service, and generally the cloud platform provides a plurality of development versions of SDKs for the user to select according to requirements of a development environment, for example, a JAVA version of SDK, a python version of SDK, a PHP version of SDK, an Android version of SDK, and the like.

3. After downloading the SDK of the corresponding version to the local according to the requirement, the user imports the SDK project into the local development environment, configures and debugs the SDK project in the local development environment, and develops other functions by the local development environment, so that an application integrating the image processing capability is formed.

4. The image processing class application can trigger API call generated by the virtual person when the virtual person is needed to be generated in the process of being used. When the application triggers the virtual person generating function, an API request is initiated to an operation instance of an image processing class service in the cloud environment, wherein the API request carries an image, and the operation instance in the cloud environment processes the image to obtain a processing result.

5. And the cloud environment returns the processing result to the application, so that one virtual person generation service call is completed.

3. Model search cloud-like services provided by a server:

in one possible implementation, the server may provide a binary model that meets the performance requirements by model search based on the performance requirements of the model provided by the client.

In one possible implementation, the server may provide the end side with a service for model searching through an application programming interface (application programming interface, API).

The terminal device may send relevant parameters (e.g., performance requirements of the model) to the server through an API provided by the cloud, and the server may obtain a processing result based on the received parameters, and return the processing result (e.g., a binary model meeting the performance requirements) to the terminal.

FIG. 7 illustrates a process for searching for cloud-like services using a model provided by a cloud platform.

Since the embodiments of the present application relate to a large number of applications of neural networks, for ease of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit having xs and intercept 1 as inputs, and the output of the arithmetic unit may be:

where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network

Deep neural networks (Deep Neural Network, DNN) can be understood as neural networks with many hidden layers, here "many" are not particularly metrics, we say that multi-layer neural networks and deep neural networks are essentially the same thing. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein (1)>Is an input vector, +.>Is the output vector, +.>Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector +.>The output vector is obtained by such simple operation>Since DNN has a large number of layers, the coefficient W and the offset vector +.>I.e. a large number. How does a particular parameter define DNN? First we look at the definition of the coefficient W. Taking a three-layer DNN as an example, for example: the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as +. >The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. In summary, the coefficients from the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as +.>Note that the input layer is without W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks.

(3) The convolutional neural network (Convosutionas Neuras Network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. So we can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.

(5) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(6) Multi-layer perceptron (MLP)

One type of deep neural network is a type of feed-forward neural network that includes fully connected operation and has a deep structure.

(7) Binary neural networks (binary neural network BNN)

The traditional deep neural network uses FP32 precision to represent the numerical value and calculate, and BNN uses 1bit precision to represent the weight of the neural network and calculate the forward direction of the network, so that great gain can be brought in the aspects of power consumption, storage, time delay and the like. Binary convolutional neural networks (Binary CNNs) and Binary multi-layer perceptrons (Binary MLPs) may be collectively referred to as Binary neural networks BNNs, which refer to Binary MLPs unless specifically described later.

Before the embodiments of the present application are described, a simple description is first made of the technology of binarizing the present neural network and related background, so as to facilitate the subsequent understanding of the embodiments of the present application.

In the deep learning field, the application of the neural network is ubiquitous, the Central processing unit (Central ProcessingUnit, CPU) has gradually failed to meet the requirements of high concurrency, high computation and the like of various deep neural networks (such as a Convolutional neural network (Convolitional NeuralNetworks, CNN)), and the graphics processor (Graphics Processing Unit, GPU) can partially solve the problems of high concurrency and high computation, but the application of the graphics processor on a mobile terminal (comprising an end-side device and an edge device) is limited due to the reasons of larger power consumption, higher price and the like, so that the graphics processor can purchase the high-end GPU for training, testing and applying the neural network generally at an enterprise level or a research institute. Currently, some mobile phone chips of the mobile terminal have integrated a neural Network Processor (NPU), but how to achieve the balance between power consumption and performance is still a problem to be solved.

The main technical problems of limiting the application of the deep neural network on the mobile terminal are as follows: 1) The calculated amount is too large; 2) The amount of parameters of the neural network is excessive. Taking CNN as an example, the calculated amount of convolution operation is huge, one convolution kernel containing hundreds of thousands of parameters, the floating point operation times (Floating Point Of Perations, FLPs) of the convolution operation can reach tens of millions, the total calculated amount of the existing common CNN with n layers can reach hundreds of billions of FLPs, the CNN which can be operated in real time on a GPU (graphics processing unit) reaches a mobile terminal very slowly, and the calculation resource of the mobile terminal is required to be considered to reduce the calculated amount of convolution under the condition that the calculation resource of the mobile terminal is difficult to meet the real-time operation of the existing CNN; in addition, in the CNN commonly used at present, the parameter amount of each convolution layer can often reach tens of thousands, hundreds of thousands or even more, the parameters of the n layers of the whole network can be added up to reach tens of thousands, and each parameter is represented by a 32-bit floating point number, so that hundreds of megabytes of memory or cache are needed to store the parameters, while in the mobile terminal, the memory and cache resources are very limited, how to reduce the parameter amount of the convolution layer, so that the CNN adapts to the relevant equipment of the mobile terminal, and in this context, a binary neural network (BinaryNeural Network, BNN) is generated.

Currently, the commonly used BNN is to perform binarization processing on the weight and the activation value of the neural network on the basis of the existing neural network, namely, the value of each weight in the weight matrix of each layer of the original neural network and the activation value of each layer of the neural network are assigned to be one of +1 and-1 or one of +1 and 0. BNN does not change the network structure of the original neural network, and mainly performs some optimization treatments on gradient descent, weight updating and convolution operation. Obviously, the binarization processing is carried out on the weights of the neural network, so that the storage space occupied by the weights is reduced, and complex multiplication operation is changed into addition and subtraction operation, thereby reducing the operation amount and improving the operation speed; similarly, the operation amount can be reduced and the operation speed can be increased by performing binarization processing on the activation value of the neural network.

Currently, there are two main methods of binarization, the first is a deterministic method based on a Sign function (also called Sign function), and the second is a random method (also called statistical method); theoretically, the second method is more reasonable, but the actual operation requires the generation of random numbers by hardware, which is difficult. Therefore, in practical application, the second method is not yet applied, and the first method is adopted, namely, binarization processing is performed through a Sign function.

However, the current neural network binarization methods are all the binarization methods proposed for the CNN structure, and the direct application of these methods to the MLP network structure brings about a very large performance degradation. This is because convolution operations in CNN networks typically use a larger convolution kernel size (. Gtoreq.3X3), while FC layers in MLP networks can be thought of as using a 1X 1 convolution kernel for convolution operations. The output of a large convolution kernel is far more expressive than a small convolution kernel when binarized. Table 1 shows the reduced accuracy of the binarized network compared to the full-precision network for different network structures, and the kernel size of the convolution kernel used in the network (note that the FC layer is considered as a convolution of kernel_size=1). As can be seen from table 1, the smaller the convolution kernel, the greater the loss.

TABLE 1

Network	Kernel size	Performance drop
			WaveMLP	1	22％
ResNet-18	3	17％
			AlexNet	11&5	13％

Next, a detailed description will be given of why a serious degradation of accuracy occurs when the existing binarization method is directly applied to the MLP:

first two definitions are given:

1. computational complexity (computational complexity, CC):

given inputConvolution kernel->Wherein K is _h And K _w Being the height and width of the convolution kernel, H and W are the height and width of the input feature, c_in and c_out are the number of channels of the input feature and the output feature, the computational complexity can be defined as:

2. Expression ability (representation ability, RA):

given the input F and convolution kernel K described above, the output of the convolution can be defined as:

wherein F is ^b And K ^b For the binarized input and convolution kernel weights, d_x and d_y are offsets relative to the center point.

Thus, each element in Y may only beOne of n+1 different numbers, where n=c _in ·K _h ·K _w ；

N is defined as representation ability of the binarized convolution.

Table 2 below shows the CC and RA comparisons of the FC layer (which can be seen as a 1x1 convolution) in an MLP network with the normal convolution layer (kxk convolution) in a CNN network.

TABLE 2

in_channel	out_channel	Kernel size	CC	RA
					C _in	C _out	1×1	1	1
C _in	C _out	k×k	k ²	k ²
					k ² C _in	k ² C _out	1×1	k ⁴	k ²

As shown in Table 2, when the number of input/output channels is the same, the CC and RA of the k×k binarization convolution are each 1×1 convolved k ² And k ² Multiple times.

In order for RA of a 1×1 convolution to be the same as k×k convolution, it is most straightforward to expand both the input and output channel numbers by k ² Multiple times. This brings the problem that the 1×1 convolved CC becomes k of the k×k convolution ² The calculated amount is greatly increased by times. Therefore, when the conventional binarization method is directly applied to the MLP, serious degradation of accuracy occurs.

Therefore, the embodiment of the application provides a data processing method, which relates to a novel architecture of MLP binarization, mainly solves the problem that serious power failure occurs when the existing binarization method is directly applied to MLP, and can improve the accuracy of binary MLP on the premise of not increasing (or hardly increasing) the calculated amount.

In order to solve the above problems, embodiments of the present application provide a structure of a binarized MLP and a corresponding data processing method. The architecture of the binarized MLP according to the embodiments of the present application is described in detail below with reference to the accompanying drawings.

Referring to fig. 8, fig. 8 is a schematic architecture of a binarized MLP provided in an embodiment of the present application, where the MLP may include a plurality of blocks, and the blocks may be connected in series, where the MLP may generally include a plurality of stages (stages), each stage may include a plurality of blocks connected in series, sizes of data processed by the blocks in different stages may be different, and a downsampling module may be included between adjacent stages, so as to reduce the sizes of the data.

It should be understood that the blocks shown in fig. 8 may also be subdivided into blocks (as shown in fig. 9).

Taking one block of a plurality of blocks as an example, the block may include a plurality of first processing branches connected in parallel, where the parallel connection may be understood as that the plurality of first processing branches process the same input data and perform data fusion at an output end; when data processing is performed, the input data can be processed through each first processing branch in the plurality of first processing branches, so as to obtain a plurality of first processing results.

In one possible implementation, each of the first processing branches includes one or more fully-connected layers, and the input data and parameters in the plurality of first processing branches are binarized data. As shown in fig. 9, a batch normalization layer may be connected before the first processing branches, and the input data of the first processing branches may be data output by the batch normalization layer.

In one possible implementation, at least one of the plurality of first processing branches is configured to interact with the input data in a spatial dimension.

Illustratively, the first processing branch for interacting the input data in the Spatial dimension may be referred to as Spatial Binary FC, or Spatial Binary MLP formed by stacking a plurality of Spatial Binary FC, or using an MLP module such as CycleMLP, waveMLP, etc. as proposed in any existing MLP network, the definition of Spatial Binaryfc may be:

Y _{SB_FC} ＝RPReLU(BN(LFC(X ^b ))+U(X ^b ))；

wherein RPReLU definition may be exemplified as follows:

a schematic diagram of the RPReLU function can be referred to fig. 10, where γ and ζ are coordinates deviating from the origin, β represents the slope, spatial Binary FC is input Xb, BN () is a standard batch normalization operation, LFC () is a local FC (local full link layer), and any local FC operation proposed in the existing MLP network, such as the cycloefc operation proposed in the cycleMLP, the waveFC operation proposed in the waveMLP, etc., can be used. U () is a Uni-shortcut operation, the definition of which is given in the following examples (this Uni-shortcut is an optional operation, which may be replaced by a normal shortcut or not, but which causes performance degradation, as in U () in channel binary FC below).

In one possible implementation, at least one of the plurality of first processing branches is configured to interact with the input data in a channel dimension.

Illustratively, the first processing branch for interacting the input data in the channel dimension may be referred to as Channel Binary FC or Channel Binary MLP formed by stacking a plurality of Channel Binary FC, alternatively, channel Binary FC may be defined as:

Y _{CB_EC} ＝RPReLU(GFC(BN(X ^b ))+U(X ^b ))；

where GFC () is global FC (global fully connected layer), which is the standard fully connected operation. The definition of the RPReLU function and the definition of U () are identical to those in Spatial Binary FC.

Referring to fig. 9, fig. 9 shows a number of first processing branches of 3, where one first processing branch is used to interact with input data in the width dimension (Spatial Binary MLP), one first processing branch is used to interact with input data in the height dimension (Spatial Binary FC), and one first processing branch is used to interact with input data in the channel dimension (Channel Binary MLP).

Referring to fig. 11, fig. 11 is a schematic architecture of a binarized MLP in the prior art, where a block in the binarized MLP includes only one processing branch, either performs interaction on input data in the spatial dimension or performs interaction on input data in the channel dimension, and in this embodiment, by setting a plurality of parallel processing branches for the block in the binarized MLP, the plurality of parallel processing branches may perform information interaction on the input data in the spatial dimension and the channel dimension at the same time, so as to increase information interaction complexity of a network, thereby improving network performance of the MLP.

In one possible implementation, the block further includes a plurality of parallel second processing branches; the plurality of parallel second processing branches may be connected after the plurality of parallel first processing branches or after the plurality of parallel first processing branches with the batch normalization layer, after the batch normalization layer; the parameters in the plurality of second processing branches are binarized data.

In one possible implementation, at least one of the plurality of second processing branches is configured to interact with the fusion result in a spatial dimension; at least one of the plurality of second processing branches is configured to interact with the fusion result in a channel dimension.

Similarly, in one possible implementation, the spatial dimensions of the fusion result specifically include a width dimension and a height dimension; at least one of the plurality of second processing branches is to interact with the fusion result in the width dimension and at least one of the plurality of second processing branches is to interact with the fusion result in the height dimension.

Referring to fig. 9, fig. 9 shows a number of second processing branches of 3, where one second processing branch is used to interact with input data in the width dimension (Spatial Binary FC), one second processing branch is used to interact with input data in the height dimension (Spatial Binary MLP), and one second processing branch is used to interact with input data in the channel dimension (Channel Binary FC).

In one possible implementation, the plurality of first processing branches includes a target processing branch, that is, the target processing branch may be one of the plurality of first processing branches; the target processing branch comprises a plurality of network layers which are connected in series and comprise a first network layer, a second network layer and a third network layer, wherein the second network layer and the third network layer are adjacently connected, and the first network layer can be an FC layer, the second network layer can be a BN layer and the third network layer can be an activation layer.

In one possible implementation, the input data of the first network layer is first input data, and the input data of the third network layer is a fusion result of the second input data and the third input data; the third input data is the data output by the second network layer; wherein the size of the first input data and the size of the third input data are different; the second input data is obtained by adjusting the size of the first input data, and the second input data and the third input data have the same size.

In one possible implementation, the size is the number of channels; the number of channels of the first input data is greater than that of the third input data, and the second input data is obtained by performing data fusion (for example, average value can be taken) on the first input data in a channel dimension; or the number of channels of the first input data is smaller than that of channels of the third input data, and the second input data is obtained by copying the first input data in the channel dimension.

Illustratively, the definition of Uni-shortcut may be as follows:

that is, when the number of channels of the input data is n (n is greater than 1) times the number of channels of the output data c_in=n×c_out, the input data Xb may be averaged per group in the channel dimension. When the number of channels of the output data is n (n is greater than 1) times the number of channels of the input data c_out=n c_in, the input data may be repeated n times and connected together.

In one possible implementation, a downsampling network may also be included in the MLP, however, downsampling is performed using a 3×3 kernel, stride=2 convolution in a full-precision MLP, which is computationally intensive. In the prior art, the down sample layer is not binarized. In the embodiment of the present application, in order to reduce the calculation amount of the downsampling network, the downsampling network may be set to include a convolution module and a network of a plurality of downsampling modules, where a convolution operation may be performed on the fusion result of the plurality of first processing results by the convolution module in a channel dimension to obtain a first processing result; executing pooling operation on the first processing result in the space dimension through each downsampling module in the plurality of downsampling modules to obtain a plurality of third processing results; and fusing the plurality of third processing results.

Referring to fig. 12A, the downsampling network shown in fig. 12A may be divided into two parts, the first part being a convolution kernel of 1x1 size, step size 1, intended to transform the channel dimension. The second part is four maxpooling layers (max pooling layers) of different core sizes, with the aim of downsampling features in the spatial dimension, with a step size of 2.

The structure design can also be implemented in various ways, that is, one part transforms the channel dimension (which may be a convolution operation with different dimensions, stride=1) and the other part downsamples in the space dimension (if n times of downsampling is needed, a maxpooling, averagepooling operation with different dimensions, stride=n may be used).

The application provides a data processing method which is applied to a multi-layer perceptron MLP in a machine learning model; the MLP comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; the method comprises the following steps: processing input data through each first processing branch in the plurality of first processing branches to obtain a plurality of first processing results; wherein each of said first processing branches comprises one or more fully connected layers; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a spatial dimension; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a channel dimension; in this embodiment of the present application, by setting blocks in the binarized MLP to a plurality of parallel processing branches, the plurality of parallel processing branches may perform information interaction on the input data in a spatial dimension and a channel dimension at the same time, thereby increasing the complexity of information interaction of the network, and further improving the network performance of the binarized MLP.

Next, the beneficial effects of the embodiments of the present application are described in conjunction with experiments performed on ImageNet classification datasets, the experimental results are shown in table 3 and fig. 12B:

TABLE 3 Table 3

It can be seen that the embodiment of the application can significantly improve the accuracy of the binary MLP network and surpass the existing best method. Bit-width (W/a) in the table indicates the number of bits used for the weight and activation values, flow indicates the amount of computation required for the operation of FP32 layer in the network, BOPs indicates the amount of computation required for the operation of 1Bit layer, ops indicates the total amount of computation.

Referring to fig. 13, fig. 13 is a schematic diagram of an embodiment of a data processing method provided in an embodiment of the present application, and as shown in fig. 13, the data processing method provided in the embodiment of the present application includes:

1301. and receiving the performance requirement sent by the terminal equipment.

In this embodiment of the present application, the terminal device may send a performance requirement to the cloud side device, where the performance requirement information may at least include one of the following: precision requirements, latency requirements or model parameter requirements.

1302. Constructing a machine learning model meeting the performance requirements according to the performance requirements; the multi-layer perceptron MLP in the machine learning model comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; wherein each of said first processing branches comprises one or more fully connected layers; each first processing branch of the plurality of first processing branches is for processing input data; the performance requirement is used for determining the number of first processing branches and/or the type of the first processing branches included in the block; the type is one of the following: interacting input data in a space dimension and interacting the input data in a channel dimension; the input data and parameters in the plurality of first processing branches are binarized data.

In some scenarios, when the terminal device needs to acquire a model for reasoning from the cloud-side device, a model acquisition request may be sent to the cloud-side device, where the model acquisition request may include a performance requirement, and accordingly, the cloud-side device may receive performance requirement information sent by the terminal device and determine a machine learning model that meets the performance requirement according to the performance requirement information, where when the accuracy requirement is higher, the parameter number of the machine learning model may be larger (for example, the number of processing branches of a block in the MLP may be greater, or the number of FC layers included may be greater), so as to provide a model with very high performance, and when the time delay requirement is higher, the parameter number of the machine learning model may be smaller, so as to provide a model that may be quickly deduced.

1303. And sending the machine learning model to the terminal equipment.

In the embodiment of the application, after the cloud side device determines the machine learning model, the determined machine learning model can be sent to the terminal device, and the terminal device can perform reasoning according to the received machine learning model. The terminal device may further perform a model compression process on the received machine learning model, which is not limited herein.

precision requirements, latency requirements or model parameter requirements.

the interacting the input data in the spatial dimension includes:

In one possible implementation, the size is the number of channels;

the method further comprises the steps of:

and fusing the plurality of third processing results.

Referring to fig. 14, fig. 14 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application, and as shown in fig. 14, a data processing apparatus 1400 provided in an embodiment of the present application is applied to a multi-layer perceptron MLP in a machine learning model; the MLP comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; the apparatus 1400 includes:

a processing module 1401, configured to process input data through each of the plurality of first processing branches, to obtain a plurality of first processing results; wherein each of said first processing branches comprises one or more fully connected layers; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a spatial dimension; at least one first processing branch of the plurality of first processing branches is configured to interact with the input data in a channel dimension; the input data and parameters in the plurality of first processing branches are binarized data.

The specific description of the processing module 1401 may refer to the description related to the binarized MLP in the above embodiment, and will not be repeated here.

In one possible implementation, the size is the number of channels;

the processing module is further configured to:

and fusing the plurality of third processing results.

Referring to fig. 15, fig. 15 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 15, a data processing apparatus 1500 according to an embodiment of the present application includes:

an obtaining module 1501, configured to receive a performance requirement sent by a terminal device;

the specific description of the obtaining module 1501 may refer to the description of step 1301 in the above embodiment, which is not repeated here.

A processing module 1502, configured to construct a machine learning model that meets the performance requirement according to the performance requirement; the multi-layer perceptron MLP in the machine learning model comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; wherein each of said first processing branches comprises one or more fully connected layers; each first processing branch of the plurality of first processing branches is for processing input data; the performance requirement is used for determining the number of first processing branches and/or the type of the first processing branches included in the block; the type is one of the following: interacting input data in a space dimension and interacting the input data in a channel dimension; the parameters in the input data and the plurality of first processing branches are binarized data;

The specific description of the processing module 1502 may refer to the description of step 1302 in the above embodiment, which is not repeated herein.

A sending module 1503, configured to send the machine learning model to the terminal device.

The specific description of the transmitting module 1503 may refer to the description of step 1303 in the foregoing embodiment, which is not repeated here.

precision requirements, latency requirements or model parameter requirements.

the interacting the input data in the spatial dimension includes:

precision requirements, latency requirements or model parameter requirements.

the interacting the input data in the spatial dimension includes:

In one possible implementation, the size is the number of channels;

the method further comprises the steps of:

and fusing the plurality of third processing results.

Next, referring to fig. 16, fig. 16 is a schematic structural diagram of an execution device provided in the embodiment of the present application, where the execution device 1600 may specifically be represented by a virtual reality VR device, a mobile phone, a tablet, a notebook, an intelligent wearable device, a monitoring data processing device, or a server, which is not limited herein. Specifically, the execution device 1600 includes: a receiver 1601, a transmitter 1602, a processor 1603, and a memory 1604 (where the number of processors 1603 in the execution device 1600 may be one or more, one processor is illustrated in fig. 16), where the processor 1603 may include an application processor 16031 and a communication processor 16032. In some embodiments of the present application, the receiver 1601, transmitter 1602, processor 1603, and memory 1604 may be connected by a bus or other means.

Memory 1604 may include read only memory and random access memory, and provides instructions and data to processor 1603. A portion of the memory 1604 may also include non-volatile random access memory (non-volatile random accessmemory, NVRAM). The memory 1604 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.

The processor 1603 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The methods disclosed in the embodiments of the present application may be applied to the processor 1603 or implemented by the processor 1603. Processor 1603 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in processor 1603. The processor 1603 may be a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1603 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1604 and the processor 1603 reads the information in the memory 1604 and, in conjunction with its hardware, performs the steps of the method described above that involve a model inference process.

The receiver 1601 is operable to receive input digital or character information and to generate signal inputs related to performing device related settings and function control. The transmitter 1602 is operable to output numeric or character information via a first interface; the transmitter 1602 may also be used to send instructions to the disk group through the first interface to modify data in the disk group; the transmitter 1502 may also include a display device such as a display screen.

Referring to fig. 17, fig. 17 is a schematic structural diagram of the training device provided in the embodiment of the present application, specifically, training device 1700 is implemented by one or more servers, where training device 1700 may generate relatively large differences due to configuration or performance, and may include one or more central processing units (central processing units, CPU) 1717 (e.g., one or more processors) and memory 1732, and one or more storage media 1730 (e.g., one or more mass storage devices) storing application programs 1742 or data 1744. Wherein the memory 1732 and storage medium 1730 may be transitory or persistent storage. The program stored on the storage medium 1730 may include one or more modules (not shown), each of which may include a series of instruction operations on the training device. Still further, the central processor 1717 may be configured to communicate with a storage medium 1730 to execute a series of instruction operations in the storage medium 1730 on the training device 1700.

Training device 1700 may also include one or more power supplies 1726, one or more wired or wireless network interfaces 1750, and one or more input/output interfaces 1758; or, one or more operating systems 1741, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

In this embodiment, the cpu 1717 is configured to perform the actions related to model training in the foregoing embodiment.

Embodiments of the present application also provide a computer program product that, when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device, or causes the computer to perform the steps performed by the aforementioned training device.

There is also provided in an embodiment of the present application a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device or causes the computer to perform the steps performed by the aforementioned training device.

The execution device, training device or terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 18, fig. 18 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1800, where the NPU 1800 is mounted as a coprocessor on a Host CPU (HostCPU), and the Host CPU distributes tasks. The core part of the NPU is an arithmetic circuit 1803, and the controller 1804 controls the arithmetic circuit 1803 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1803 includes a plurality of processing units (PEs) inside. In some implementations, the operational circuitry 1803 is a two-dimensional systolic array. The arithmetic circuit 1803 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1803 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1802 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1801 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1808.

The unified memory 1806 is used for storing input data and output data. The weight data is directly transferred to the weight memory 1802 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1805. The input data is also carried into the unified memory 1806 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 1810, for the AXI bus to interact with DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1809.

The bus interface unit 1810 (Bus Interface Unit, abbreviated as BIU) is configured to obtain an instruction from the external memory by the instruction fetch memory 1809, and further configured to obtain raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1805.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1806 or to transfer weight data to the weight memory 1802 or to transfer input data to the input memory 1801.

The vector calculation unit 1807 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1803, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector computation unit 1807 can store the vector of processed outputs to the unified memory 1806. For example, the vector calculation unit 1807 may perform a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 1803, such as linear interpolation of the feature planes extracted by the convolutional layer, and then such as a vector of accumulated values, to generate the activation value. In some implementations, the vector computation unit 1807 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 1803, e.g., for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1809 connected to the controller 1804, for storing instructions used by the controller 1804;

the unified memory 1806, input memory 1801, weight memory 1802, and finger memory 1809 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method described in the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. The data processing method is characterized by being applied to a multi-layer perceptron MLP in a machine learning model; the MLP comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; the method comprises the following steps:

processing input data through each first processing branch in the plurality of first processing branches to obtain a plurality of first processing results; wherein,

2. The method of claim 1, wherein the spatial dimensions of the input data specifically include a width dimension and a height dimension;

3. The method of claim 1 or 2, wherein the block further comprises a plurality of parallel second processing branches; the method further comprises the steps of:

4. A method according to any one of claims 1 to 3, wherein the first, second and third numbers of processing branches in the block are the same;

5. The method of any of claims 1-4, wherein the plurality of first processing branches comprises target processing branches; the target processing branch comprises a plurality of network layers which are connected in series and comprise a first network layer, a second network layer and a third network layer, the second network layer and the third network layer are connected adjacently, input data of the first network layer are first input data, and input data of the third network layer are fusion results of the second input data and the third input data; the third input data is the data output by the second network layer; wherein,

6. The method of claim 5, wherein the size is a number of channels;

7. The method of any one of claims 1 to 6, wherein the MLP further comprises a downsampling network; the downsampling network comprises a convolution module and a plurality of downsampling modules;

the method further comprises the steps of:

and fusing the plurality of third processing results.

8. The method of claim 7, wherein the convolution operation is a step-size 1 convolution; the pooling operation is pooling with a step size of N, wherein N is greater than 1.

9. The method according to any one of claims 1 to 8, further comprising:

and fusing the plurality of first processing results to obtain a fused result.

10. A method of data processing, the method comprising:

receiving performance requirements sent by terminal equipment;

and sending the machine learning model to the terminal equipment.

11. The method of claim 10, wherein the performance requirements include at least one of:

precision requirements, latency requirements or model parameter requirements.

12. A data processing device, characterized by being applied to a multi-layer perceptron MLP in a machine learning model; the MLP comprises a block, wherein the block comprises a plurality of first processing branches connected in parallel; the device comprises:

13. The apparatus of claim 12, wherein the spatial dimensions of the input data specifically include a width dimension and a height dimension;

14. The apparatus of claim 12 or 13, wherein the block further comprises a plurality of parallel second processing branches; the processing module is further configured to:

15. The apparatus of any of claims 12 to 14, wherein the first, second, and third numbers of processing branches in the block are the same;

16. The apparatus of any of claims 12 to 15, wherein the plurality of first processing branches comprises target processing branches; the target processing branch comprises a plurality of network layers which are connected in series and comprise a first network layer, a second network layer and a third network layer, the second network layer and the third network layer are connected adjacently, input data of the first network layer are first input data, and input data of the third network layer are fusion results of the second input data and the third input data; the third input data is the data output by the second network layer; wherein,

17. The apparatus of claim 16, wherein the size is a number of channels;

18. The apparatus of any of claims 12 to 17, wherein the MLP further comprises a downsampling network; the downsampling network comprises a convolution module and a plurality of downsampling modules;

the processing module is further configured to:

and fusing the plurality of third processing results.

19. The apparatus of claim 18, wherein the convolution operation is a step-size 1 convolution; the pooling operation is pooling with a step size of N, wherein N is greater than 1.

20. The apparatus of any one of claims 12 to 19, wherein the processing module is further configured to:

and fusing the plurality of first processing results to obtain a fused result.

21. A data processing apparatus, the apparatus comprising:

22. The apparatus of claim 21, wherein the performance requirements comprise at least one of:

Precision requirements, latency requirements or model parameter requirements.

23. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any one of claims 1 to 11.

24. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 11.

25. A system comprising at least one processor, at least one memory; the processor and the memory are connected through a communication bus and complete communication with each other;

the at least one memory is used for storing codes;

the at least one processor is configured to execute the code to perform the method of any of claims 1 to 11.