CN108920413B - Convolutional neural network multi-core parallel computing method facing GPDSP - Google Patents
Convolutional neural network multi-core parallel computing method facing GPDSP Download PDFInfo
- Publication number
- CN108920413B CN108920413B CN201810689646.3A CN201810689646A CN108920413B CN 108920413 B CN108920413 B CN 108920413B CN 201810689646 A CN201810689646 A CN 201810689646A CN 108920413 B CN108920413 B CN 108920413B
- Authority
- CN
- China
- Prior art keywords
- data
- core
- data buffer
- dsp core
- gpdsp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
- G06F15/8061—Details on data memory access
- G06F15/8069—Details on data memory access using a cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a GPDSP-oriented convolutional neural network multi-core parallel computing method, which comprises the following steps: s1, a CPU core constructs two data buffer areas and a weight data buffer area in an off-chip memory; s2, the CPU core combines the convolution core data with the designated number and stores the combined data in a weight data buffer area; s3, the CPU core accesses the image data to be calculated of the designated frame to carry out merging processing and transmits the image data to an idle data cache region; s4, if the DSP core is idle and data of the data cache region is ready, transmitting the address to the DSP core; s5, carrying out convolution neural network calculation in parallel by each DSP core; s6, outputting a current calculation result; s7, circulating the steps S3-S6 until all the calculations are completed. The invention can fully exert the performance and the multilevel parallelism of the CPU core and the DSP core in the GPDSP and realize the high-efficiency convolution neural network calculation.
Description
Technical field
The present invention relates to depth learning technology fields, more particularly to one kind is towards GPDSP (General-Purpose
Digital Signal Processor, general-purpose computations digital signal processor) convolutional neural networks multi-core parallel concurrent calculating side
Method.
Background technique
Currently based on the deep learning model of convolutional neural networks (Convolutional Neural Networks, CNN)
It is equal in various aspects such as image recognition and calssification, machine translation, Text Automatic Processing, speech recognition, automatic Pilot, video analysis
The achievement to attract people's attention is achieved, the research hotspot in each field is become.Convolutional neural networks are a kind of depth feedforward neural networks,
It is usually alternately made of several convolutional layers, active coating and pond layer, wherein convolutional layer is rolled up by convolution kernel and input feature vector
Product operation carries out feature extraction, so that the feature of each classification is arrived in study.Convolutional layer calculating occupies in convolutional neural networks calculating
The calculation amount of whole network structure 90%, thus optimize and accelerate convolutional layer to be calculated as promoting convolutional neural networks calculated performance
Key.
In order to improve the performance of convolutional neural networks, network structure deeper and deeper and complicated, allusion quotation are currently constantly proposed
Type such as LeNet, AlexNet, VGGNet, GoogleNet etc., but with the continuous expansion of network size, network parameter
Scale is also increasing, and corresponding large-scale convolutional neural networks calculate process performance and data memory bandwidth to processor
Also higher and higher.Industry is generally to use high-performance GPU to meet convolutional neural networks and calculate requirement, or even pass through design at present
Dedicated convolutional neural networks processor accelerates convolutional neural networks to calculate, but the calculated performance of high-performance GPU is limited, real
Existing convolutional neural networks computational efficiency is still to be improved, and the calculated performance for being especially unable to satisfy extensive convolutional neural networks is wanted
It asks, and the convolutional neural networks processor of design specialized is at high cost, realizes complicated.
GPDSP is a kind of system with powerful calculating ability, it includes CPU core unit and DSP core unit, wherein CPU
Nuclear unit is mainly used for being responsible for the generic transaction pipe including storage management, document control, process scheduling, interrupt management task
The complete support of reason and offer to the general-purpose operating system, DSP core unit include at 64 bit vectors of several powerful calculating abilities
Array is managed, for supporting the resolving of highly dense processor active task.GPDSP powerful computing capability, so that it is likely to become acceleration volume
One extraordinary platform of product neural computing, however GPDSP is the heterogeneous multi-nucleus processor comprising CPU core and DSP core,
Including sharing storage array in vector array memory in register file, scalar memory, piece, piece, DDR memory etc. outside piece
Multi-level storage architecture, existing convolutional neural networks calculation method, which can not directly apply in GPDSP, to be realized.By GPDSP
Realize that convolutional neural networks calculate, there is also how to calculate convolutional neural networks to be mapped to the CPU core of GPDSP and multiple contain
64 bit vectors handle the DSP core of array, and the problems such as how to play the multi-level parallelisms of GPDSP, there has been no being capable of base at present
The effective solution that convolutional neural networks calculate is realized in GPDSP, and it is urgent to provide a kind of convolutional neural networks towards GPDSP
Core parallel calculation method, to improve the calculating of convolutional neural networks using the structural system feature of GPDSP, multi-level parallelisms
Efficiency.
Summary of the invention
The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one
Performance and multi-level parallelisms that realization principle is simple and convenient to operate, can give full play to CPU core and DSP core in GPDSP are planted, meter
Calculate the good convolutional neural networks core parallel calculation method towards GPDSP of high-efficient and performance.
In order to solve the above technical problems, technical solution proposed by the present invention are as follows:
A kind of convolutional neural networks core parallel calculation method towards GPDSP, step include:
CPU core in S1.GPDSP constructs two for storing the data of input image data outside piece in DDR memory
Buffer area and one are for storing the weighted data buffer area of convolution Nuclear Data;
The convolution Nuclear Data of specified number is merged place according to the picture number that SIMD is capable of parallel processing by S2.CPU core
Reason generates convolution Nuclear Data required for meeting calculating, and is stored in the weighted data buffer area;
The idle state of two data buffer areas of S3.CPU Nuclear monitoring starts CPU if available free data buffer area
Core accesses a specified image data to be calculated and merges processing, generates image data required for meeting calculating, and be transferred to
Idle data buffer area;
S4.CPU core judges the data mode of the idle state of each DSP core and two data buffer areas in GPDSP, if
It is idle and have the data ready of target data buffer area to determine each DSP core, then by the address of target data buffer area and described
The address of weighted data buffer area is transferred to each DSP core, is calculated with starting DSP core;
S5. each DSP core carries out convolution according to each auxiliary image data of the address received to target data buffer area parallel
Neural computing;
The calculating state of two data buffer areas of S6.CPU Nuclear monitoring and DSP core, and work as and monitor two data buffer storages
In area data processing finish and DSP core calculate at the end of, output work as previous calculated result;
S7. circulation step S3~S6, the calculating until completing all image datas.
As a further improvement of the present invention: the size of data buffer area described in the step S1 is with specific reference to GPDSP
The picture number d that middle DSP core quantity p, input picture port number c and SIMD are capable of parallel processing is configured.
As a further improvement of the present invention: the size concrete configuration of the data buffer area is n auxiliary input image data
Memory capacity size, and n=p*c*d, wherein d=64/w, w are the data bits of pictorial element to be calculated.
As a further improvement of the present invention: further include in the step S1 setting two data buffer state marks with
Respectively corresponding the data of two data buffer areas of mark, whether ready state and DSP core calculate Status Flag to indicate
DSP core whether idle state;
The CPU core of GPDSP calculates Status Flag judges whether DSP core is idle according to the DSP core in the step S3, with
And judge whether the data of two data buffer areas are ready according to the data buffer area Status Flag;
When monitoring in the data buffer area after data processing in the step S6, the corresponding data are set
Buffer state mark is arranged the corresponding DSP core and calculates Status Flag after CPU core monitors that DSP core calculates.
As a further improvement of the present invention: processing is merged in the step S2, specially by d different convolution
Nuclear Data merges, and generation, which obtains meeting, calculates required convolution Nuclear Data, and wherein d is that SIMD being capable of parallel processing in GPDSP
Picture number.
As a further improvement of the present invention: processing is merged in the step S3, specially by d width input picture number
According to merging, generation, which obtains meeting, calculates required image data, and wherein d is the image that SIMD is capable of parallel processing in GPDSP
Number.
As a further improvement of the present invention: the specific steps of idle data buffer area are transferred in the step S3
Are as follows: the data buffer area is divided into p and the one-to-one memory block of each DSP core in advance, wherein p is DSP core in GPDSP
Number;When generating image data required for meeting calculating, the image data of generation is successively transmitted to each storage in sequence
Area.
As a further improvement of the present invention, the step of parallel-convolution neural computing is carried out in the step S5 is wrapped
It includes:
S51. the n auxiliary image data in each DSP core parallel processing target data buffer area, wherein each DSP core handles c*d
Width image, the picture number of processing needed for this core is calculated according to the first address of target data buffer area and this core ID in each DSP core
According to first address;
S52. weighted data buffer area in piece is arranged in each DSP core in respective vector storage array, by DSP core 0 outside piece
Convolution Nuclear Data is read in the weighted data buffer area of DDR memory, and is broadcast to the vector storage array of each DSP core
Piece in weighted data buffer area;Each DSP core is corresponding according to the parallel reading of the first address that the step S51 is calculated
Input image data into respective scalar memory buffer;
S53. each DSP core is to the input image data and respective vector storage array piece in respective scalar memory buffer
Convolution kernel data parallel in interior weighted data buffer area carries out convolutional neural networks calculating;
S54. in the piece in the vector storage array of each DSP core the convolution Nuclear Data of weighted data buffer area have been calculated finish after
Synchronize waiting;Judge whether that all DSP cores reach by the DSP core 0 of GPDSP, if then going to step S52, continues subsequent portion
The convolutional neural networks divided calculate;
S55. circulation step S52~S54, the convolutional neural networks until completing this calculate.
As a further improvement of the present invention, the specific step of convolutional neural networks calculating is carried out in the step S53 parallel
Suddenly are as follows:
S531. the obtained position the W convolution Nuclear Data comprising d convolution Nuclear Data will be generated after merging treatment to be successively extended to
D W convolution Nuclear Datas, wherein W is the digit of Vector Processing array in GPDSP;
S532. the d W convolution Nuclear Datas step S531 being extended, successively with generated after merging treatment
W bit image data comprising d auxiliary image data carry out the multiply-add calculating of SIMD, complete current convolutional neural networks and calculate.
Compared with the prior art, the advantages of the present invention are as follows:
1) the convolutional neural networks core parallel calculation method of the invention towards GPDSP, by the system knot for combining GPDSP
Structure feature, convolutional neural networks calculating task is divided, and runs operating system by CPU core, is responsible for outside input image data
Connect, the format and merging treatment of format and merging treatment and weighted data and the scheduling of calculating task, status data it is same
Step etc., DSP core is responsible for parallel convolutional neural networks and calculates kernel program, continual that new calculating task is obtained from CPU core
And report operation result to CPU core, the general-purpose computations of CPU core and the powerful vectorization computing capability of DSP core can be given full play to
The advantages of, the close coordinated between CPU core and DSP core is realized, to efficiently realize convolutional neural networks multi-core parallel concurrent meter
It calculates.
2) the convolutional neural networks core parallel calculation method of the invention towards GPDSP, the architecture based on GPDSP are special
Sign, using efficient CPU core and DSP core cooperated computing, by the CPU core of convolutional neural networks calculating efficient mapping to GPDSP
And in multiple DSP cores, it can make full use of the CPU core general-purpose computations and the powerful parallel meter of DSP core Vector Processing array of GPDSP
It calculates, high bandwidth vector data load capability, gives full play to the multi-level parallelisms of GPDSP, can be adapted for extensive convolutional Neural
Efficient parallel computation is realized in network.
3) the convolutional neural networks core parallel calculation method of the invention towards GPDSP further passes through two numbers of setting
According to buffer state mark with respectively correspond mark two data buffer areas data whether ready state and a DSP core
Calculate Status Flag with indicate DSP core whether idle state, CPU core controls holding for calculating task by monitoring each Status Flag
Row can be further improved the efficient coordinated between CPU core and DSP core, improve what convolutional neural networks multi-core parallel concurrent calculated
Efficiency.
Detailed description of the invention
Fig. 1 is the simplification memory access structural model schematic diagram for the GPDSP that the present embodiment uses.
Fig. 2 is the implementation process schematic diagram of convolutional neural networks core parallel calculation method of the present embodiment towards GPDSP.
Fig. 3 is the specific implementation flow schematic diagram that the present embodiment carries out convolutional neural networks calculating parallel.
Fig. 4 is the realization principle schematic diagram that SIMD parallel-convolution neural computing is carried out in the specific embodiment of the invention.
Fig. 5 is the implementation process signal that convolutional neural networks core parallel calculation method is realized in the specific embodiment of the invention
Figure.
Specific embodiment
Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and
It limits the scope of the invention.
The simplification memory access structural model for the GPDSP that the present embodiment specifically uses is as shown in Figure 1, system includes CPU core list
Member and DSP core unit, wherein DSP core unit includes that several 64 bit vectors handle array computation unit, dedicated interior scalar is deposited
Storage is shared in the shared piece of reservoir and vector array memory, CPU core unit and DSP core unit, the outer DDR of piece of large capacity is deposited
Reservoir, i.e. GPDSP include the DSP core of multiple 64 bit vectors processing array, can carry out simultaneously parallel data processing by SIMD.
As shown in Fig. 2, convolutional neural networks core parallel calculation method of the present embodiment towards GPDSP, step include:
CPU core in S1.GPDSP constructs two for storing the data of input image data outside piece in DDR memory
Buffer area (input1 and input2) and one are for storing the weighted data buffer area of convolution Nuclear Data.
The size of above-mentioned data buffer area with specific reference in GPDSP include 64 bit vectors processing array DSP core quantity p,
The picture number d that input picture port number c and SIMD are capable of parallel processing is configured.
In concrete application embodiment, the size concrete configuration of above-mentioned data buffer area is depositing for n auxiliary input image data
Amount of capacity, and n=p*c*d are stored up, wherein d=64/w, data bits of the w for pictorial element to be calculated, w=64,32,16,8,
4,2 etc., the data bits for respectively indicating pictorial element is 64,32,16,8,4,2.According to the data bit of pictorial element to be calculated
Number w, can determine that SIMD is capable of the picture number d of parallel processing, i.e. d=64/w, further can determine above-mentioned n, i.e. n=p*c*
d。
The convolution Nuclear Data of specified number is merged place according to the picture number d that SIMD is capable of parallel processing by S2.CPU core
Reason generates convolution Nuclear Data required for meeting calculating, and is stored in weighted data buffer area.
It is above-mentioned to merge processing, specially d different convolution Nuclear Datas are merged, generation obtains needed for meeting calculating
The convolution Nuclear Data wanted, wherein d is the picture number that SIMD is capable of parallel processing in GPDSP.I.e. according to image data to be calculated
Digit merges multiple and different convolution Nuclear Datas, if d=2, image data digit to be calculated is 32 digits
According to then by two different convolution Nuclear Datas merging, wherein high 32 data and low 32 data save one 32 volumes respectively
Product nuclear element data;If d=4, image data digit to be calculated is 16 data, then 64 data save 4 16 volumes
Product nuclear element data, and so on.
The idle state of two data buffer areas of S3.CPU Nuclear monitoring, if available free data buffer area, starting CPU core is connect
Enter to specify an image data to be calculated to merge processing, generates image data required for meeting calculating, and be transferred to the free time
Data buffer area, wherein image data can be externally to connect the input image data of camera or from other data sources
Image data.
Since GPDSP includes the DSP core of multiple 64 bit vectors processing array, parallel data can be carried out simultaneously by SIMD
Processing, it is above-mentioned to merge processing, specially d width input image data is merged, generation obtains figure required for meeting calculating
Picture data merge more auxiliary image datas that is, according to image data digit to be calculated, for example, if figure to be calculated
As data bits is 32 data, then two images data are merged, wherein the data of a sub-picture are stored in high 32 data,
Another width image data is stored in low 32 data;If image data digit to be calculated is 16 data, by 4 width picture numbers
According to merging, and so on.
The above-mentioned specific steps that image data is transferred to idle data buffer area are as follows: in advance according to DSP core number and
Sequentially, data buffer area is divided into p and the one-to-one adjacent storage zones of each DSP core, wherein p is DSP core in GPDSP
Number;When generating image data required for meeting calculating, the image data of generation is successively transmitted to each memory block in sequence,
I.e. in above-mentioned p memory block, according to channel sequence, successively the data in one channel of transmission pass every time to each memory block every time
Defeated is the d auxiliary image data after merging treatment.
S4.CPU core judges the data mode of the idle state of each DSP core and two data buffer areas in GPDSP, if
It determines each DSP core free time and has the data ready of target data buffer area, then by the address of target data buffer area and weight
The address of data buffer zone is transferred to each DSP core, carries out convolutional Neural net to start DSP core to the data buffer area image data
Network calculates.
S5. each DSP core carries out convolution according to each auxiliary image data of the address received to target data buffer area parallel
Neural computing.
As shown in figure 3, the step of the present embodiment above-mentioned parallel carry out convolutional neural networks calculating, includes:
N auxiliary image data in S51.DSP core parallel processing target data buffer area, wherein each DSP core handles c*d width
Image, the image data of processing needed for this core is calculated according to the first address of target data buffer area and this core ID in each DSP core
First address;
Weighted data buffer area in piece is arranged in S52.DSP core in respective vector storage array, by DSP core 0 outside piece DDR
Convolution Nuclear Data is read in the weighted data buffer area of memory, and is broadcast in the piece of vector storage array of each DSP core
In weighted data buffer area;Each DSP core is according to the corresponding input picture of the parallel reading of the first address that step S51 is calculated
Data are into respective scalar memory buffer;
S53. each DSP core is to the input image data and respective vector storage array piece in respective scalar memory buffer
Convolution kernel data parallel in interior weighted data buffer area carries out convolutional neural networks calculating;
Bi Houjin has been calculated in the convolution Nuclear Data of weighted data buffer area in piece in the vector storage array of S54.DSP core
Row is synchronous to be waited;Judge whether that all DSP cores reach by the DSP core 0 of GPDSP, if then going to step S52, continues further part
Convolutional neural networks calculate;
S55. circulation step S52~S54, the convolutional neural networks until completing this calculate.
The specific steps of convolutional neural networks calculating are carried out in above-mentioned steps S53 parallel are as follows:
S531. obtained 64 convolution Nuclear Datas comprising d convolution Nuclear Data will be generated after merging treatment successively to extend
At d 64 convolution Nuclear Datas.When such as d=2, then low 32 data are extended to 64 convolution Nuclear Data A, it will be 32 high
Data expansion is at another 64 convolution Nuclear Data B, wherein high 32 data of convolution Nuclear Data A and low 32 before extension
Data are identical, and low 32 data of convolution Nuclear Data B are identical as high 32 data before extension, other d value situations and above-mentioned original
Reason is consistent.
S532. the d that step S531 is extended 64 convolution Nuclear Datas successively include with what is generated after merging treatment
64 bit image data of d auxiliary image data carry out the multiply-add calculating of SIMD, complete current convolutional neural networks and calculate.
As shown in figure 4, the present invention realizes that 2 convolution Nuclear Datas and 2 auxiliary image datas carry out in concrete application embodiment
SIMD parallel-convolution neural computing step are as follows:
Step 1: 64 convolution Nuclear Data R comprising 2 convolution Nuclear Datas are successively extended to 2 64 convolution Nuclear Datas
A and B;
Step 2: by 2 64 convolution Nuclear Datas A and B of above-mentioned generation successively with 64 bitmaps comprising 2 auxiliary image datas
As data D carries out the multiply-add calculating of SIMD, i.e. A, D, E progress multiply-add calculating of SIMD, B, D, F progress multiply-add calculating of SIMD.
The calculating state of two data buffer areas of S6.CPU Nuclear monitoring and DSP core, when monitoring two data buffer areas
At the end of middle data processing finishes and DSP core calculates, previous calculated result is worked as in output.
S7. circulation step S3~S6, the calculating until completing all image datas.
The present embodiment above method is drawn convolutional neural networks calculating task in conjunction with the architectural feature of GPDSP
Point, operating system is run by CPU core, is responsible for that input image data is external, format of format and merging treatment and weighted data
Scheduling, status data with merging treatment and calculating task synchronize, and DSP core is responsible for parallel convolutional neural networks and is calculated
Kernel program, it is continual to obtain new calculating task from CPU core and report operation result to CPU core, CPU can be given full play to
The advantages of general-purpose computations of core and the powerful vectorization computing capability of DSP core, realizes that CPU core is matched with closely cooperateing between DSP core
It closes, to efficiently realize that convolutional neural networks multi-core parallel concurrent calculates.
The present invention specifically should in embodiment step S1 further include two data buffer state marks of setting (flag1 and
Flag2) with respectively correspond two data buffer areas (input1 and input2) of mark data whether ready state and one
DSP core calculate Status Flag (flag3) with indicate DSP core whether idle state;The CPU core of GPDSP is according to DSP core in step S3
It calculates Status Flag judges whether DSP core is idle, and two data buffer areas is judged according to data buffer area Status Flag
Whether data are ready;When monitoring in data buffer area after data processing in step S6, corresponding data buffer area is set
Status Flag is arranged corresponding DSP core and calculates Status Flag, bonding state mark after CPU core monitors that DSP core calculates
The configuration of will further increases the efficiency of calculating with acquisition.As shown in figure 5, convolutional Neural net is realized in the configuration of bonding state mark
The detailed step that network multi-core parallel concurrent calculates is as follows, and each step principle is consistent with the above:
The CPU core of step 1:GPDSP outside piece DDR memory construct two data buffer area input1 and input2 and
One weighted data buffer area, the size of data buffer area are the memory capacity of n auxiliary input image, while configuring two state marks
Will flag1 and flag2 indicate respectively two data buffer areas input1 and input2 whether data ready and a DSP
Whether idle assess calculation Status Flag flag3 mark DSP core.
The CPU core of step 2:GPDSP is capable of the picture number d of parallel processing to d different convolution Nuclear Datas according to SIMD
Format analysis processing is merged, generation meets convolution Nuclear Data required for this is calculated, is stored in weighted data buffer area.
The CPU core monitoring data of step 3:GPDSP cache distinctive emblem (flag1 and flag2), if available free data buffer storage
Area, then CPU core externally enters d width image data and merges format analysis processing, and generation meets image data required for this is calculated,
It is transferred to idle data buffer area in order, by respective flag position 1 when data buffer area is full of.
The CPU core of step 4:GPDSP calculates Status Flag (flag3) judges whether DSP core is idle according to DSP core, and
Data buffer area Status Flag (flag1 and flag2) judges whether data buffer area data are ready;If DSP core is idle, and has number
According to buffer area data ready, then data buffer storage regional address and weighted data buffer zone address are transferred to DSP core, start DSP core
Convolutional neural networks calculating is carried out to the data buffer area image data.
Multiple DSP cores of step 5:GPDSP carry out parallel-convolution neural network meter to the n auxiliary image data of data buffer area
It calculates, after DSP core completes the calculating of this convolutional neural networks to the calculating of input picture layer, setting data buffer area data processing is complete
Finish mark;DSP core continues subsequent convolutional neural networks and calculates, and after completing the whole calculating of this convolutional neural networks, sets DSP core
Calculating terminates label.
After the CPU core of step 6:GPDSP monitors that data buffer area data processing finishes mark, by corresponding Status Flag
Set 0;After CPU core monitors that DSP core calculates end label, corresponding DSP core calculating Status Flag is set 0;CPU core will be calculated and be tied
Fruit outflow.
Step 7: circulation step 3 to 6, until completing all to calculate.
Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention
It has been disclosed in a preferred embodiment above, however, it is not intended to limit the invention.Therefore, all without departing from technical solution of the present invention
Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention
In the range of technical solution of the present invention protection.
Claims (9)
1. a kind of convolutional neural networks core parallel calculation method towards GPDSP, which is characterized in that step includes:
CPU core in S1.GPDSP constructs two for storing the data buffer storage of input image data outside piece in DDR memory
Area and one are for storing the weighted data buffer area of convolution Nuclear Data;
The convolution Nuclear Data of specified number is merged processing according to the picture number that SIMD is capable of parallel processing by S2.CPU core, raw
Required convolution Nuclear Data is calculated at meeting, and is stored in the weighted data buffer area;
The idle state of two data buffer areas of S3.CPU Nuclear monitoring, if available free data buffer area, starting CPU core is connect
Enter to specify an image data to be calculated to merge processing, generates image data required for meeting calculating, and be transferred to the free time
Data buffer area;
S4.CPU core judges the data mode of the idle state of each DSP core and two data buffer areas in GPDSP, if it is determined that
It is idle to each DSP core and have the data ready of target data buffer area, then by the address of target data buffer area and the weight
The address of data buffer zone is transferred to each DSP core, is calculated with starting DSP core;
S5. each DSP core carries out convolutional Neural according to each width image data of the address received to target data buffer area parallel
Network query function;
The calculating state of two data buffer areas of S6.CPU Nuclear monitoring and DSP core, and work as and monitor in two data buffer areas
At the end of data processing finishes and DSP core calculates, previous calculated result is worked as in output;
S7. circulation step S3~S6, the calculating until completing all image datas.
2. the convolutional neural networks core parallel calculation method according to claim 1 towards GPDSP, which is characterized in that
The size of data buffer area described in the step S1 with specific reference to DSP core quantity p, input picture port number c in GPDSP and
The picture number d that SIMD is capable of parallel processing is configured.
3. the convolutional neural networks core parallel calculation method according to claim 2 towards GPDSP, which is characterized in that
The size concrete configuration of the data buffer area is the memory capacity size of n width input image data, and n=p*c*d, wherein d
=64/w, w are the data bits of pictorial element to be calculated.
4. the convolutional neural networks core parallel calculation method according to claim 1 or 2 or 3 towards GPDSP, feature
It is: further includes two data buffer state marks of setting in the step S1 to respectively correspond two data buffer areas of mark
Data whether be that ready state and DSP core calculate Status Flag to indicate whether DSP core is idle state;
The CPU core of GPDSP calculates Status Flag judges whether DSP core is idle according to the DSP core in the step S4, Yi Jigen
Judge whether the data of two data buffer areas are ready according to the data buffer area Status Flag;
When monitoring in the data buffer area after data processing in the step S6, the corresponding data buffer storage is set
Zone state mark is arranged the corresponding DSP core and calculates Status Flag after CPU core monitors that DSP core calculates.
5. the convolutional neural networks core parallel calculation method according to claim 1 or 2 or 3 towards GPDSP, feature
It is, processing is merged in the step S2, specially merge d different convolution Nuclear Datas, generation obtains meeting meter
Convolution Nuclear Data required for calculating, wherein d is the picture number that SIMD is capable of parallel processing in GPDSP.
6. the convolutional neural networks core parallel calculation method according to claim 1 or 2 or 3 towards GPDSP, feature
It is, merges processing in the step S3, specially merge d width input image data, generation, which obtains meeting, calculates institute
The image data needed, wherein d is the picture number that SIMD is capable of parallel processing in GPDSP.
7. the convolutional neural networks core parallel calculation method according to claim 6 towards GPDSP, which is characterized in that
The specific steps of idle data buffer area are transferred in the step S3 are as follows: the data buffer area is divided into p in advance
With the one-to-one memory block of each DSP core, wherein p is DSP core number in GPDSP;It generates to meet and calculates required picture number
According to when, the image data of generation is successively transmitted to each memory block in sequence.
8. the convolutional neural networks core parallel calculation method according to claim 2 or 3 towards GPDSP, feature exist
In, in the step S5 carry out parallel-convolution neural computing the step of include:
S51. the n width image data in each DSP core parallel processing target data buffer area, wherein each DSP core handles c*d width figure
Picture, the picture number of processing needed for corresponding core is calculated according to the first address of target data buffer area and corresponding core ID in each DSP core
According to first address;
S52. weighted data buffer area in piece is arranged in each DSP core in respective vector storage array, by 0 core of DSP core DDR outside piece
Convolution Nuclear Data is read in the weighted data buffer area of memory, and be broadcast to the vector storage array of each DSP core
In piece in weighted data buffer area;Each DSP core is corresponding according to the parallel reading of the first address that the step S51 is calculated
Input image data is into respective scalar memory buffer;
S53. each DSP core in the input image data and respective vector storage array piece in respective scalar memory buffer to weighing
Convolution kernel data parallel in weight data buffer zone carries out convolutional neural networks calculating;
S54. in the piece in the vector storage array of each DSP core the convolution Nuclear Data of weighted data buffer area have been calculated finish after carry out
It is synchronous to wait;Judge whether that the data of all DSP cores reach by 0 core of DSP core of GPDSP, if then going to step S52, after continuation
The convolutional neural networks of continuous part calculate;
S55. circulation step S52~S54, the convolutional neural networks until completing this calculate.
9. the convolutional neural networks core parallel calculation method according to claim 8 towards GPDSP, which is characterized in that
The specific steps of convolutional neural networks calculating are carried out in the step S53 parallel are as follows:
S531. the obtained position the W convolution Nuclear Data comprising d convolution Nuclear Data will be generated after merging treatment is successively extended to d W
Position convolution Nuclear Data, wherein W is the digit of Vector Processing array in GPDSP;
S532. the d W convolution Nuclear Datas step S531 extended include successively d with what is generated after merging treatment
The W bit image data of width image data carry out the multiply-add calculating of SIMD, complete current convolutional neural networks and calculate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689646.3A CN108920413B (en) | 2018-06-28 | 2018-06-28 | Convolutional neural network multi-core parallel computing method facing GPDSP |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689646.3A CN108920413B (en) | 2018-06-28 | 2018-06-28 | Convolutional neural network multi-core parallel computing method facing GPDSP |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108920413A CN108920413A (en) | 2018-11-30 |
CN108920413B true CN108920413B (en) | 2019-08-09 |
Family
ID=64421783
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810689646.3A Active CN108920413B (en) | 2018-06-28 | 2018-06-28 | Convolutional neural network multi-core parallel computing method facing GPDSP |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108920413B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109858622B (en) * | 2019-01-31 | 2021-03-02 | 瑞芯微电子股份有限公司 | Data handling circuit and method for deep learning neural network |
CN109886395B (en) * | 2019-03-06 | 2020-11-24 | 上海熠知电子科技有限公司 | Data reading method for multi-core image processing convolutional neural network |
CN109976893A (en) * | 2019-03-29 | 2019-07-05 | 北京润科通用技术有限公司 | The sequential control method and device of real time operating system |
CN109858472B (en) * | 2019-04-09 | 2023-08-04 | 武汉领普科技有限公司 | Embedded real-time humanoid detection method and device |
CN110489356B (en) * | 2019-08-06 | 2022-02-22 | 上海商汤智能科技有限公司 | Information processing method, information processing device, electronic equipment and storage medium |
CN113095471B (en) * | 2020-01-09 | 2024-05-07 | 北京君正集成电路股份有限公司 | Method for improving efficiency of detection model |
CN113095503B (en) * | 2020-01-09 | 2024-05-03 | 北京君正集成电路股份有限公司 | System for realizing high efficiency of detection model |
CN113111995B (en) * | 2020-01-09 | 2024-08-02 | 北京君正集成电路股份有限公司 | Method for shortening model reasoning and model post-processing running time |
CN111897579B (en) * | 2020-08-18 | 2024-01-30 | 腾讯科技(深圳)有限公司 | Image data processing method, device, computer equipment and storage medium |
CN112068955B (en) * | 2020-08-21 | 2023-10-27 | 北京科技大学 | Communication optimization method in heterogeneous multi-core platform processor and electronic equipment |
CN112101284A (en) * | 2020-09-25 | 2020-12-18 | 北京百度网讯科技有限公司 | Image recognition method, training method, device and system of image recognition model |
CN113469350B (en) * | 2021-07-07 | 2023-03-24 | 武汉魅瞳科技有限公司 | Deep convolutional neural network acceleration method and system suitable for NPU |
CN113869446A (en) * | 2021-10-11 | 2021-12-31 | 沈阳航空航天大学 | CNN target identification system and method based on FPGA |
CN116303108B (en) * | 2022-09-07 | 2024-05-14 | 芯砺智能科技(上海)有限公司 | Weight address arrangement method suitable for parallel computing architecture |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709462A (en) * | 2016-12-29 | 2017-05-24 | 天津中科智能识别产业技术研究院有限公司 | Indoor positioning method and device |
CN107657581A (en) * | 2017-09-28 | 2018-02-02 | 中国人民解放军国防科技大学 | Convolutional neural network CNN hardware accelerator and acceleration method |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6925641B1 (en) * | 2000-02-04 | 2005-08-02 | Xronix Communications, Inc. | Real time DSP load management system |
CN102591657B (en) * | 2011-12-29 | 2014-06-25 | 东南大学 | Graphical user interface (GUI) system achieving method based on collaboration mechanism of central processing unit (CPU) and digital signal processor (DSP) |
KR101834195B1 (en) * | 2012-03-15 | 2018-04-13 | 삼성전자주식회사 | System and Method for Balancing Load on Multi-core Architecture |
CN104615584B (en) * | 2015-02-06 | 2017-12-22 | 中国人民解放军国防科学技术大学 | The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group |
CN106228238B (en) * | 2016-07-27 | 2019-03-22 | 中国科学技术大学苏州研究院 | Accelerate the method and system of deep learning algorithm on field programmable gate array platform |
CN106959937B (en) * | 2017-03-30 | 2019-03-29 | 中国人民解放军国防科学技术大学 | A kind of vectorization implementation method of the warp product matrix towards GPDSP |
CN107301456B (en) * | 2017-05-26 | 2020-05-12 | 中国人民解放军国防科学技术大学 | Deep neural network multi-core acceleration implementation method based on vector processor |
CN107862378B (en) * | 2017-12-06 | 2020-04-24 | 芯原微电子(上海)股份有限公司 | Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal |
CN107885700B (en) * | 2017-12-29 | 2021-05-14 | 中国人民解放军国防科技大学 | Multi-core implementation method for large-scale matrix convolution |
CN108205702B (en) * | 2017-12-29 | 2020-12-01 | 中国人民解放军国防科技大学 | Parallel processing method for multi-input multi-output matrix convolution |
-
2018
- 2018-06-28 CN CN201810689646.3A patent/CN108920413B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106709462A (en) * | 2016-12-29 | 2017-05-24 | 天津中科智能识别产业技术研究院有限公司 | Indoor positioning method and device |
CN107657581A (en) * | 2017-09-28 | 2018-02-02 | 中国人民解放军国防科技大学 | Convolutional neural network CNN hardware accelerator and acceleration method |
Non-Patent Citations (1)
Title |
---|
一种高效的面向基2 FFT算法的SIMD并行存储结构;陈海燕 等;《电子学报》;20160229(第2期);241-246 * |
Also Published As
Publication number | Publication date |
---|---|
CN108920413A (en) | 2018-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108920413B (en) | Convolutional neural network multi-core parallel computing method facing GPDSP | |
CN111897579B (en) | Image data processing method, device, computer equipment and storage medium | |
CN104899182B (en) | A kind of Matrix Multiplication accelerated method for supporting variable partitioned blocks | |
TWI749249B (en) | Chip device, chip, intelligent device and operation method of the neural network | |
CN107657581B (en) | Convolutional neural network CNN hardware accelerator and acceleration method | |
CN109543832B (en) | Computing device and board card | |
CN107301456B (en) | Deep neural network multi-core acceleration implementation method based on vector processor | |
CN107679620A (en) | Artificial neural network processing unit | |
CN108805797A (en) | Optimized computing hardware for machine learning operation | |
CN109558937A (en) | The operating method of nerve network system and nerve network system | |
CN101717817B (en) | Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar | |
CN108388537A (en) | A kind of convolutional neural networks accelerator and method | |
CN101398753A (en) | System, method and computer program product for performing a scan operation | |
WO2022252568A1 (en) | Method based on gpgpu reconfigurable architecture, computing system, and apparatus for reconfiguring architecture | |
CN111105023B (en) | Data stream reconstruction method and reconfigurable data stream processor | |
WO2020253383A1 (en) | Streaming data processing method based on many-core processor, and computing device | |
WO2022226721A1 (en) | Matrix multiplier and method for controlling matrix multiplier | |
CN109711540B (en) | Computing device and board card | |
CN110796244B (en) | Core computing unit processor for artificial intelligence device and accelerated processing method | |
US11714649B2 (en) | RISC-V-based 3D interconnected multi-core processor architecture and working method thereof | |
CN108090865A (en) | The in-orbit real-time streaming processing method of optical satellite remote sensing image and system | |
US20230289580A1 (en) | Neural network circuit and neural network circuit control method | |
CN115482456A (en) | High-energy-efficiency FPGA (field programmable Gate array) acceleration framework of YOLO (YOLO) algorithm | |
CN118036776A (en) | Model training method and related device | |
CN111260070B (en) | Operation method, device and related product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |