Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As shown in fig. 1, one embodiment of the present application provides a method 001 of implementing a deep neural network on a field programmable gate array, comprising:
s10, analyzing the resource demand and the saturation throughput of each network layer of the deep neural network;
s20, enumerating all the division schemes for dividing all the network layers into a plurality of programmable logic gate arrays according to the resource demand and the saturation throughput;
s30, calculating the effect parameter data of all the division schemes, and selecting the optimal scheme from all the division schemes according to the effect parameter data of all the division schemes;
and S40, implementing the optimal scheme on a board.
In some embodiments, the analyzing the resource demand and saturation throughput of each network layer of the deep neural network comprises:
reading parameter data of the deep neural network, and calculating the resource demand of each network layer;
and acquiring the saturated throughput of each network layer according to the resource demand.
In some embodiments, said implementing said optimal solution upper plate comprises: and writing the parameter data of the network layer into the correspondingly divided field programmable gate arrays according to the optimal scheme.
In some embodiments, the parameter data of the network layer includes a layer type of the convolutional layer, the number of convolutional kernels, a size of the convolutional kernels, and a step size of the convolutional kernels. The network layer includes convolutional layers.
In some embodiments, the network layer includes a convolutional layer, and further includes one or more of an input layer, a pooling layer, an excitation layer, a fully-connected layer, and an output layer. The parameter data of the network layer comprises parameter data of the convolutional layer and also comprises one or more of the following data: the parameter data of the input layer, the parameter data of the pooling layer, the parameter data of the excitation layer, the parameter data of the full-connection layer and the parameter data of the output layer; the parameter data of the input layer comprises a layer type; the parameter data of the convolution layer comprises a layer type, the number of convolution kernels, the size of the convolution kernels and the step length of the convolution kernels; the parameter data of the pooling layer comprises a layer type, a pooling function, the size of the pooling layer and the step size of the pooling layer; the parameter data of the excitation layer comprises a layer type and an excitation function; the parameter data of the full connection layer comprises a layer type and an output number; the parameter data of the output layer comprises a layer type.
In some embodiments, the optimal solution comprises a partitioning solution that satisfies the following conditions: the total throughput of the deep neural network reaches a preset threshold value, and the number of field programmable gate arrays is minimum;
in some embodiments, the optimal solution comprises a partitioning solution that satisfies the following conditions: the number of the field programmable gate arrays reaches a preset threshold value, and the total throughput of the deep neural network is maximum.
As shown in fig. 2, the present embodiment further provides an apparatus for implementing a deep neural network on a field programmable gate array, including:
an analysis module 100, configured to analyze a resource demand and a saturation throughput of each network layer of the deep neural network;
an enumeration module 200, configured to enumerate all partitioning schemes for partitioning all network layers into a plurality of programmable logic gate arrays according to the resource demand and the saturation throughput;
a selecting module 300, configured to calculate effect parameter data of all the partition schemes, and select an optimal scheme from all the partition schemes according to the effect parameter data of all the partition schemes;
and the upper plate module 400 is used for implementing the optimal scheme on the upper plate.
The present embodiments also provide a non-transitory computer readable storage medium having stored thereon a computer program, which is executed by a processor, to implement the above-described method.
The embodiment also provides a data processing method, which utilizes the deep neural network realized by the method to process data.
As shown in fig. 3, one embodiment of the present application provides a method 002 of implementing a deep neural network on a field programmable gate array, comprising:
A1. a pretreatment step: analyzing resource demand and saturation throughput of each network layer of the deep neural network;
A2. dividing: planning an optimal scheme for dividing the deep neural network into a plurality of Field Programmable Gate Arrays (FPGAs) according to the result (resource demand and saturation throughput) obtained in the step A1;
A3. the implementation steps are as follows: and (4) realizing the optimal scheme obtained by A2 on a board.
In certain embodiments, step a1 comprises:
b1, reading parameter data of the deep neural network;
and B2, calculating the resource demand and the saturation throughput of each network layer according to the parameter data of the deep neural network.
In certain embodiments, step a1 comprises:
c1, reading the parameter data of the deep neural network, and calculating the resource demand of each network layer of the deep neural network;
c2, acquiring the saturation throughput of each network layer according to the resource demand; the saturated throughput is the throughput of the network layer when the resource demand is met.
In certain embodiments, a2 comprises:
d1, selecting a plurality of FPGAs according to the resource demand and the saturation throughput, and enumerating all division schemes for dividing all the network layers into the FPGAs.
D2, calculating the effect parameter data of all the division schemes, and selecting the optimal scheme from all the division schemes according to the effect parameter data of all the division schemes.
In certain embodiments, a2 comprises:
E1. and (3) state division of dynamic planning: the use condition of each FPGA is recorded by using state compression, and the numbers of the network layer and the FPGA number used at last are represented by using subscripts, so that the state transition is facilitated;
E2. and (3) dynamically planning state transition steps: enumerating all partitioning schemes for partitioning the network layer into the FPGA, and calculating feasible maximum throughput according to the saturated throughput obtained in the preprocessing step to complete state transition;
E3. backtracking an optimal solution step: and obtaining an optimal division scheme according to the final optimal result and the FPGA number used for recording each time.
In certain embodiments, step a3 comprises: and writing the parameter data of the network layer into the FPGA according to the optimal scheme, thereby realizing the deep neural network.
As shown in fig. 4, this embodiment further provides an apparatus for implementing a deep neural network on a field programmable gate array, including:
a preprocessing module for implementing step a 1;
a dividing module for implementing step a 2;
and the upper plate module is used for realizing the step A3.
As shown in fig. 5, one embodiment of the present application provides a method 003 of implementing a deep neural network on a field programmable gate array, comprising:
and S1, reading the parameter data of the deep neural network, and calculating the resource demand of each network layer of the deep neural network.
In some embodiments, each network layer of the deep neural network has an identifier for distinguishing between identifications.
In some embodiments, the network layer comprises an input layer, a convolutional layer, a pooling layer, an excitation layer, a fully-connected layer, and an output layer; the number of the convolution layers, the pooling layers, the excitation layers and the full-connection layers can be multiple, and the specific number is set according to the requirement of practical application when the deep neural network is designed.
In some embodiments, the computing units of each network layer are extracted, and the amount of resources, such as the amount of storage space required, etc., required to implement these computing units using the FPGA is calculated.
In some embodiments, each of the network layers is provided with an identifier (which may be a number) for distinguishing between identifications.
S2, acquiring the saturation throughput of each network layer according to the resource demand; the saturated throughput is the throughput of the network layer when the resource demand is met.
In some embodiments, the saturation throughput of the network layer may be obtained through a test, that is, the network layer is written into an FPGA capable of meeting the resource demand of the network layer, and the throughput is tested and recorded to obtain the saturation throughput.
In some embodiments, the saturation throughput of the network layer can be obtained through calculation, and the saturation throughput obtained through prediction in a calculation mode is time-saving, but has a certain error with the saturation throughput obtained through actual test.
S3, selecting a plurality of FPGAs according to the resource demand and the saturation throughput, and enumerating all division schemes for dividing all the network layers into the plurality of FPGAs.
When partitioning network layers into FPGAs, it is ensured that each of the network layers reaches a saturated throughput.
In some embodiments, each FPGA is provided with an identifier (which may be a number) for distinguishing between identifications, and the network layers and FPGAs that match with each other are marked together by the respective identifiers, so as to record matching modes. For example, assuming that there are a total of 50 network layers, the 50 network layers are sequentially numbered as N, respectively1、N2、N3、……、N50(ii) a Assuming that the finally selected FPGAs are 10 in total, the 10 FPGAs are respectively set to be numbered A1、A2、A3、……、A10(ii) a If the first to fourth layer network layers are divided into number A in the division scheme2In FPGA of (1), then N is1、N2、N3、N4And A2Are marked togetherTo facilitate recording of the matching pattern.
In some embodiments, E (i, j, x) represents the throughput of the FPGA when the network layers from the ith layer to the jth layer are all placed on the xth FPGA; where 1 ≦ i ≦ n, 2 ≦ j ≦ n, n being the total number of network layers, e.g., n ≦ 50. And judging whether the ith to j-th layer network layers can be placed on the x FPGA according to the known quantities of C, a (i), and the like:
when the resource quantity of the x FPGA is less than the sum of the resource demand quantities of the i-th to j-th layer network layers, the x FPGA cannot be set;
when the resource quantity of the x-th FPGA is larger than or equal to the sum of the resource demands of the i-th to j-th layer network layers, if E (i, j, x) is min { E (i, j-1, x), b (j) }, the x-th FPGA can be put down; otherwise, the file cannot be placed.
And S4, calculating the effect parameter data of all the division schemes, and selecting the optimal scheme from all the division schemes according to the effect parameter data of all the division schemes.
The effect parameter data of the partitioning scheme includes the total throughput of the deep neural network and the number of FPGAs used.
In some embodiments, the conditions satisfied by the optimal solution include: the total throughput of the deep neural network reaches a preset threshold value, and the number of the used FPGAs is the least; for example, if the preset threshold is T1, selecting the partition scheme with the total throughput of T1 and the least number of FPGAs from all the partition schemes; if two division schemes with the same number of used FPGAs exist, the scheme with larger total throughput is taken as the optimal scheme.
In some embodiments, the conditions satisfied by the optimal solution include: the number of the used FPGAs reaches a preset threshold value, and the total throughput of the deep neural network is maximum; for example, if the preset threshold is 10 (i.e. the number of used FPGAs should not exceed 10), then from all the partition schemes, the partition scheme with the number of used FPGAs not greater than 10 and the maximum total throughput is selected; and if two partition schemes with the same total throughput exist, the scheme with the smaller number of the used FPGAs is taken as the optimal scheme.
Each network layer reaches a saturation throughput under the condition that the resource demand is met, and if a plurality of network layers are arranged on the same FPGA, the throughput of the FPGA is the minimum value of the throughputs of the network layers. Only successive network layers can be arranged on the same FPGA.
In some embodiments, the total throughput of the deep neural network is represented by F (l, s, x); l represents that the placed network layers are the 1 st to the l-th network layers; s is a binary number used to mark the FPGA being used.
For example, assuming that the total number of the FPGAs is 12 and the numbers are 1-12 respectively, a 12-bit binary number s is used for representing the use condition of the FPGAs, and the numbers of the FPGAs and the digits of the s are in one-to-one correspondence from left to right; for example, s-001101011001 indicates that 3 rd, 4 th, 6 th, 8 th, 9 th, and 12 th FPGAs are used.
The FPGA placed on the l-th layer is the FPGA with the number of x; wherein the constraint relationship of x and s is (s &2^ x) > 0. F (l, s, x) ═ max { min (E (k +1, l, x), B (y, x), F (k, s-2^ x, y)) }, wherein k < l, ((s-2^ x) &2^ y) >0, and the calculation can be completed by using circulation. An array is defined as the auxiliary array, and during the calculation, k and y of the maximum value are obtained by using the auxiliary array record F (l, s, x).
And selecting the optimal scheme according to the obtained values of F (l, s, x) in different division schemes and the number of the used FPGAs.
And S5, writing the parameter data of the network layer into the FPGA according to the optimal scheme, thereby realizing the deep neural network.
The parameter data of the network layer comprises: the parameter data of the input layer, the parameter data of the convolutional layer, the parameter data of the pooling layer, the parameter data of the excitation layer, the parameter data of the full-connection layer and the parameter data of the output layer;
the parameter data of the input layer comprises a layer type;
the parameter data of the convolution layer comprises a layer type, the number of convolution kernels, the size of the convolution kernels and the step length of the convolution kernels;
the parameter data of the pooling layer comprises a layer type, a pooling function, the size of the pooling layer and the step size of the pooling layer;
the parameter data of the excitation layer comprises a layer type and an excitation function;
the parameter data of the full connection layer comprises a layer type and an output number;
the parameter data of the output layer comprises a layer type.
In some embodiments, a dynamic programming algorithm is used according to the performance parameters of each FPGA and the resource demand and the saturation throughput of each network layer, and the optimal scheme of the network layers and the FPGA is obtained under the condition of meeting the resource demand of each network layer, so that the requirement of the throughput of the resource demand of the network layers is met, and the waste of FPGA resources is avoided.
Fig. 6 is a flowchart illustrating a method for implementing a deep neural network on a field programmable gate array according to another embodiment of the present application.
The embodiment also provides an electronic device, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method.
The present embodiment also provides a non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the method described above.
The embodiment also provides a data processing method, which utilizes the deep neural network realized by the method for realizing the deep neural network on the field programmable gate array to process data, such as processing image data and the like; and enabling the deep neural network realized on the field programmable gate array to communicate with a computer (the FPGA communicates with the computer), and inputting data to be processed into the deep neural network realized on the field programmable gate array through the computer to process so as to obtain a processing result. In some embodiments, the communication between the FPGA and the host is accomplished using a first-in-first-out queue at the time of communication. The data processing method has the advantages of high data processing speed and good processing result.
According to the method for realizing the deep neural network on the field programmable gate array, the optimal division scheme for dividing the network layer of the deep neural network into the plurality of FPGAs is planned, the resource utilization rate of the FPGAs and the computing capacity of the deep neural network are greatly improved, FPGA resources can be saved, the larger total throughput can be achieved, and the performance requirement for realizing the deep neural network can be well met.
It should be noted that:
the term "module" is not intended to be limited to a particular physical form. Depending on the particular application, a module may be implemented as hardware, firmware, software, and/or combinations thereof. Furthermore, different modules may share common components or even be implemented by the same component. There may or may not be clear boundaries between the various modules.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the creation apparatus of a virtual machine according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.