CN113837923B - Data processing device, data processing method and related products - Google Patents
Data processing device, data processing method and related products Download PDFInfo
- Publication number
- CN113837923B CN113837923B CN202111131280.6A CN202111131280A CN113837923B CN 113837923 B CN113837923 B CN 113837923B CN 202111131280 A CN202111131280 A CN 202111131280A CN 113837923 B CN113837923 B CN 113837923B
- Authority
- CN
- China
- Prior art keywords
- data
- dimension
- block
- inner layer
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012545 processing Methods 0.000 title claims abstract description 210
- 238000003672 processing method Methods 0.000 title claims abstract description 8
- 238000003860 storage Methods 0.000 claims abstract description 111
- 238000000034 method Methods 0.000 claims description 48
- 230000008569 process Effects 0.000 claims description 39
- 210000004205 output neuron Anatomy 0.000 claims description 31
- 230000000903 blocking effect Effects 0.000 claims description 28
- 238000000638 solvent extraction Methods 0.000 claims description 7
- 239000000203 mixture Substances 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 abstract description 5
- 238000004364 calculation method Methods 0.000 description 43
- 238000010586 diagram Methods 0.000 description 27
- 210000002569 neuron Anatomy 0.000 description 25
- 210000002364 input neuron Anatomy 0.000 description 13
- 238000013135 deep learning Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 11
- 230000005540 biological transmission Effects 0.000 description 9
- 238000009826 distribution Methods 0.000 description 8
- 239000011159 matrix material Substances 0.000 description 8
- 230000008859 change Effects 0.000 description 7
- 238000003062 neural network model Methods 0.000 description 7
- 102100025297 Mannose-P-dolichol utilization defect 1 protein Human genes 0.000 description 5
- 101710089919 Mannose-P-dolichol utilization defect 1 protein Proteins 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 210000004027 cell Anatomy 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 102100030148 Integrator complex subunit 8 Human genes 0.000 description 3
- 101710092891 Integrator complex subunit 8 Proteins 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 101150083561 nop-1 gene Proteins 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000010339 dilation Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
Abstract
The present disclosure discloses a data processing apparatus, a data processing method for executing a block instruction using the data processing apparatus, and a related product. The data processing means may be comprised as computing means in a combined processing means, which combined processing means may also comprise interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the present disclosure realizes data dimension conversion storage in small convolution operation, and improves operation processing efficiency.
Description
Technical Field
The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to a data processing apparatus, a data processing method, a chip, and a board for executing a block instruction on data using the data processing apparatus.
Background
Currently, deep learning (DEEP LEARNING) has become an important branch in machine learning, and has also greatly facilitated the development of Artificial Intelligence (AI). The core technology of deep learning, deep Neural Network (DNN), has found wide application in many industries.
Neural networks are one of the most critical techniques in artificial intelligence, deep learning, with convolutional neural networks (Convolution Neural Network, CNN) being one of the most important network types. The most critical calculation in convolutional neural networks is the convolutional operation (Convolution Operation) of the convolutional layer (Conv layer). The function of the convolution layer is to extract the characteristics of the input data, and complex characteristics can be extracted through multi-layer convolution, so that the network is ensured to have enough expression capability and generalization capability. The neural network model comprises a large number of various convolution operations, and the calculation performance of the convolution operations greatly influences the calculation performance of the whole neural network model. When the neural network model is applied to different fields, such as speech recognition, machine translation, image processing, etc., the respective dimensions of the corresponding input feature maps and weights may be different. In order to fully utilize the hardware advantages of the deep learning processor, different types of convolution operations with different scales need to be optimized to improve the calculation performance of executing the neural network model.
Disclosure of Invention
To address at least one or more of the technical problems mentioned above, the present disclosure proposes, in various aspects, a data processing apparatus that can enable data of various dimension sizes to adapt to hardware of a convolution operation by executing a block instruction on the data, thereby improving the computational efficiency of the convolution operation. The convolution operations of embodiments of the present disclosure may be operations in various neural network models that may be applied in various fields, such as image processing, speech processing, text processing, and the like, which may include, for example, but not limited to, recognition and classification.
In a first aspect, an embodiment of the present disclosure provides a data processing apparatus including a control circuit, a first storage circuit, and a second storage circuit, wherein: the first storage circuit is used for storing first data before processing; the second storage circuit is used for storing the processed second data; and the control circuit is used for configuring and executing the blocking instruction to convert the first data stored on the first storage circuit according to the first dimension storage sequence into the second data stored on the second storage circuit according to the second dimension storage sequence, wherein the first data is multi-dimensional data, and the multi-dimensional shape is as follows:
[ high dimension ho ] × [ medium dimension wo ] × [ co dimension ] × [ multidimensional mixing ]
Wherein the multi-dimensional mixture comprises at least various combinations of: co, medium dimension ho, low dimension ho, high dimension wo, low dimension wo;
the second data is three-dimensional data, and the three-dimensional shape is as follows:
[ho×wo×co]
wherein the transformed co is located in the lowest storage dimension of the second data, the transformed wo is located in the second lowest storage dimension of the second data, and the transformed ho is located in the highest storage dimension of the second data.
In a second aspect, embodiments of the present disclosure provide a chip comprising the data processing apparatus of the first aspect.
In a third aspect, embodiments of the present disclosure provide a board card comprising the chip of the foregoing second aspect.
In a fourth aspect, embodiments of the present disclosure provide a data processing method of executing a block instruction on input data using the data processing apparatus of the first aspect.
By the data processing device, the chip, the board and the data processing method for executing the blocking instruction by the data processing device, the scheme of the embodiment of the disclosure carries out blocking processing on output data in various convolution splitting schemes so as to adapt to the processing capacity of the hardware operation device, thereby fully utilizing the parallel processing capacity of a plurality of slave processing circuits and effectively improving the operation efficiency of convolution operation.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;
FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;
FIG. 3a illustrates a schematic internal architecture of a processor core of a single core computing device of an embodiment of the present disclosure;
FIG. 3b shows a simplified schematic diagram of the internal structure of a multi-core computing device of an embodiment of the present disclosure;
FIG. 4 illustrates an example convolution operation principle example to which embodiments of the present disclosure may be applied;
FIG. 5 shows a schematic block diagram of a computing device according to an embodiment of the disclosure;
FIG. 6 illustrates an exemplary data storage sequence according to an embodiment of the present disclosure;
7a-7c illustrate several exemplary grouping modes according to embodiments of the present disclosure;
FIG. 8 illustrates an exemplary split schematic of an input feature map in accordance with an embodiment of the present disclosure;
FIG. 9 illustrates a split and store schematic of a Forward4 scheme in accordance with an embodiment of the present disclosure;
FIG. 10 shows a schematic diagram of the division of the output points of an arithmetic circuit in a Forward4 scheme in accordance with an embodiment of the present disclosure;
FIG. 11 shows a single operation schematic in a Forward4 scheme in accordance with an embodiment of the disclosure;
FIG. 12 shows a sliding convolution schematic in the Forward4 scheme according to an embodiment of the present disclosure;
FIG. 13 shows a schematic diagram of an output data format of a Forward4 scheme in accordance with an embodiment of the present disclosure;
FIG. 14 illustrates an overall data handling process according to an embodiment of the present disclosure;
FIG. 15 shows a schematic conceptual diagram of TRANS TILING according to an embodiment of the disclosure;
FIG. 16 shows a schematic diagram of a front-to-back configuration table; and
Fig. 17 shows a schematic diagram of executing a block instruction on output neuron data according to an embodiment of the present disclosure.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.
Exemplary hardware Environment
Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.
The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.
The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may include a single chip microcomputer (Micro Controller Unit, MCU).
Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.
The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.
The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.
The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processors, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.
The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.
Fig. 3a shows a schematic diagram of the internal structure of a processing core when the computing device 201 is a single core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.
The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.
The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.
The storage module 33 is used for storing or carrying related data, including a neuron storage unit (NRAM) 331, a weight storage unit (WEIGHT RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204.
Fig. 3b shows a simplified schematic diagram of the internal architecture of computing device 201 as a multi-core. The multi-core computing device may be abstracted using a hierarchical hardware model. As shown, the multi-Core computing device may be abstracted into four levels, namely a Card level (Card) 350, a Chip level (Chip) 360, a processor Cluster level (Cluster) 370, and a processor Core level (Core) 380. The embodiments of the present disclosure mainly relate to data transmission and calculation unit portions of a storage unit, and thus the drawings and description briefly show and introduce related calculation structures, and other portions are omitted.
At the board level, each board contains local DDR memory on it, and each processor chip acts as a compute and control unit.
At the chip level, each processor chip contains multiple processors as computing units.
At the compute cluster level, each multiprocessor includes multiple accelerator cores as control and compute units, and further has shared memory SRAM as a memory unit.
At the processor core level, each accelerator core includes an array of local memory and local processing units. NFU refers to a neural arithmetic unit (Neuron Function Unit) for performing convolution calculations.
In the multi-Core computing device, the storage model includes a board global memory, SRAM (shared memory) on a Cluster, NRAM on Core, WRAM, registers, and the like. For better performance, the data movement and memory/computation balancing between memory levels below the Card may be explicitly controlled. The SRAM is included in a memory processing unit MPU (Memory Process Unit Core, abbreviated as MPU, or Mem Core). Core refers to an intelligent processing Core (INTELLIGENT PROCESS UNIT CORE, IPU Core or Core for short) in a multi-Core computing device. 1 IPU Core contains NRAM, WRAM, NFU, etc. Cluster refers to a Cluster of processors or computing clusters, typically a multi-Core computing device comprising a number of Cluster, one Cluster comprising 1 Mem core+N IPU cores.
Exemplary convolution operation types
The convolution layers in the neural network model may perform convolution operations to perform feature extraction by applying convolution kernels (also known as filters, weights, etc.) to the input feature map (also known as input data, neurons, or input neurons). The convolution layer may contain a plurality of convolution kernels, each element constituting the convolution kernels corresponding to a weight coefficient and a bias. The disclosed embodiments may be applied to data splitting for various convolution operations.
In the conventional 3D convolution operation, assuming that the tensor shape of the input Feature map (Feature map) in the convolution layer is denoted as X [ N Hi Wi Ci ], the tensor shape of the convolution kernel (kernel) is denoted as K [ Co Kh Kw Ci ], and the output result is Y [ N Ho Wo Co ], the mathematical calculation formula of the simplified convolution operation can be expressed as follows:
Yin,jc,jh,jw=∑0≤ic≤ci,0≤ih≤kh,0≤iw≤kwXin,ic,jh×sh+ih,jw×sw+iw×Kjc,ic,ih,iw (1)
In the above formula, X is input data, Y is output data, K is a convolution kernel, kh and Kw are length and width of K, sh and sw are steps in length and width directions (stride), the formula ignores bias, padding pad and expansion dilation, and assuming that the input data X has been padded, the convolution kernel has been expanded. The formula ignores the N-dimension and the C-dimension, and the forward computation of the neural network model is independent in the N-dimension and fully connected in the C-dimension. When the convolution kernel works, the convolution kernel sweeps the input features according to a certain step length, and matrix element multiplication summation and deviation amount superposition are carried out on the input features in a convolution window.
Fig. 4 illustrates an example of an exemplary conventional 3D convolution operation principle to which embodiments of the present disclosure may be applied.
Four-dimensional input data X of size [ N Hi Wi Ci ] is exemplarily shown in the figure, which can be represented as a three-dimensional rectangle 410 of size N Hi X Wi X Ci. Also shown by way of example is a four-dimensional convolution kernel K of size [ Co Kh Kw Ci ], which may be represented as a three-dimensional convolution kernel 420 of size Co Kh Kw Ci. The convolution result of the input data X and the convolution kernel K yields output data Y, which is four-dimensional data of the size of [ N Ho Wo Co ], and can be expressed as a three-dimensional rectangle 430 of the size of N ho×wo×co.
Also specifically shown is an example of convolution operation, wherein the input data is a6×6×3 size input feature map 440, omitting the N dimension; a3 x 3 size stereo convolution kernel 450 for a single Co; the output data is a 4 x4 output profile 460. The specific operation process is as follows:
The convolution kernel 450 sweeps the input feature map 440 in steps, matrix element multiplicative sums the input features within a convolution window 470 and superimposes the offsets. That is, the value at each position in the output feature map 460 is obtained by summing the corresponding block of each input feature map and the corresponding convolution kernel after performing a two-dimensional convolution operation. For example, the values (i.e., convolved output points) at the (0, 0) position on the output signature 460 are shown as 3 values by two-dimensional convolution operations of the convolution window 470 outlined by the black cube in the input signature and the stereo convolution kernel 450, and then summed to the final value.
To obtain outputs at other locations, the location of the convolution kernel 450, i.e., the convolution window of the convolved output points, may be shifted on the input signature 440. In the example in the figure, the convolution step (Sx, sy) is (1, 1), and when the convolution operation is performed after shifting one frame in the lateral direction (width direction) to the right or in the longitudinal direction (height direction), the value of the (0, 1) or (1, 0) position on the output feature map 460 can be obtained respectively.
From the above description, in one convolutional layer of the neural network, there are N sets of input feature maps, each set containing hi×wi×ci pieces of information, where Hi and Wi are the height and width of the input feature maps, respectively, and Ci is the number of input feature maps, also referred to as the number of input channels. The convolution layer has a convolution kernel of the size Ci Co of Kh Kw, where Ci is the number of input channels, co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernel, respectively. The output feature map contains Ho x Wo x Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, a convolution step (Sx, sy) is also involved, and the size of the convolution step affects the size of the output feature map.
Input Feature map (Feature map), input data, neurons, or input neurons are used interchangeably herein; convolution kernel, filter, or weights may be used interchangeably. Further, the H (height) and Y dimensions are used interchangeably and the W (width) and X dimensions are used interchangeably. Accordingly, the H dimension of the input feature map may be denoted as Hi or Yi, and the H dimension of the output feature map may be denoted as Ho or Yo, with the W dimension similarly denoted. In an embodiment of the present disclosure, each convolution output point has a corresponding convolution window having a shape equal to the shape of the convolution kernel. The value of each convolution output point corresponds to the result of the para-multiply-accumulate of the input feature map and weights within its convolution window.
Exemplary computing device/data processing device
In embodiments of the present disclosure, computing devices in a master-slave configuration may be employed to implement the convolution operations described above. Further, different data paths can be configured for the input feature map and the convolution kernel, so that memory efficiency is improved.
Fig. 5 shows a schematic block diagram of a computing device 500 according to an embodiment of the disclosure. It will be appreciated that the architecture may be regarded as a refinement of the internal architecture of the operation module of a single processing core in fig. 3, or as a functional partitioning block diagram that is unified on the basis of the operation modules of a plurality of processing cores shown in fig. 3. As shown in fig. 5, a computing device 500 of an embodiment of the disclosure may be configured to perform various types of convolution operations, which may include a master processing circuit (MA) 510 and a plurality of slave processing circuits (SL) 520, 16 of which are shown as SL 0-SL 15. Those skilled in the art will appreciate that the number of slave processing circuits may be greater or lesser depending on the particular hardware configuration, and embodiments of the present disclosure are not limited in this respect.
The master processing circuit and the slave processing circuits and the plurality of slave processing circuits may communicate with each other via various connections. In different application scenarios, the connection manner between the plurality of slave processing circuits may be a hard connection manner through hard wire arrangement, or may be a logic connection manner configured according to, for example, micro instructions, so as to form a topology structure of the plurality of slave processing circuit arrays. The disclosed embodiments are not limited in this respect. The master processing circuit and the slave processing circuit may cooperate with each other, thereby realizing parallel arithmetic processing.
In order to support the arithmetic function, the master processing circuit and the slave processing circuit may include various calculation circuits, and may include a vector arithmetic unit and a matrix arithmetic unit, for example. The vector operation unit is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit is responsible for core computation of the deep learning algorithm, such as matrix multiplication and convolution.
The slave processing circuit may be configured to perform an intermediate operation on the corresponding data in parallel according to the operation instruction to obtain a plurality of intermediate results, and transmit the plurality of intermediate results back to the master processing circuit, for example.
By arranging the computing device 500 in a master-slave configuration (e.g., a master-multiple-slave configuration, or a multiple-master-multiple-slave configuration, the disclosure is not limited in this respect), for the computing instruction of the forward operation, the data may be split according to the computing instruction, so that the computing speed is increased by parallel computing on the portion with larger computing amount through the multiple slave processing circuits, so that the computing time is saved, and the power consumption is further reduced.
In some embodiments of the present disclosure, by transmitting the input feature map and the weight using different data paths, multiple multiplexing modes of the input feature map and the weight may be supported, so as to reduce the data access amount during the operation and improve the processing efficiency.
Specifically, the computing device 500 may further include a first storage device 530 and a second storage device 540 for respectively storing data transmitted via different data channels.
The first memory circuit 530 may be used to store multicast data, i.e., the data in the first memory circuit is to be transmitted over a broadcast bus to a plurality of slave processing circuits that receive the same data. It will be appreciated that broadcast and multicast may be implemented over a broadcast bus. Multicasting refers to a communication scheme in which a piece of data is transmitted to a plurality of slave processing circuits; and broadcasting is a communication mode of transmitting a piece of data to all slave processing circuits, which is a special case of multicasting. Since both multicast and broadcast correspond to one-to-many transmission, which are not purposely distinguished herein, broadcast and multicast may be collectively referred to as multicast, whose meaning will be apparent to those skilled in the art from the context.
The second storage circuit 540 may be used to store distribution data, i.e. the data in the second storage circuit will be transferred to different slave processing circuits, each receiving different data.
By providing the first memory circuit and the second memory circuit separately, transmission of data to be operated on in different transmission modes can be supported, thereby reducing the data access amount by multiplexing multicast data among a plurality of slave processing circuits.
In some embodiments, the master processing circuit may determine one of the input signature and convolution kernel as multicast data and store it in the first storage circuit to broadcast the data to the scheduled plurality of slave processing circuits during operation. Correspondingly, the main processing circuit may determine the other of the input signature and the convolution kernel as distribution data and store it in the second storage circuit. These distribution data may be distributed to the corresponding slave processing circuits prior to the operation.
Fig. 5 also shows an internal structural schematic diagram of the slave processing circuit SL according to an embodiment of the present disclosure. As shown, each slave processing circuit 520 may include a plurality of arithmetic circuits CU 521, a first buffer circuit 522, and a second buffer circuit 523. The figure shows 4 arithmetic circuits CU0 to CU3. Those skilled in the art will appreciate that the number of operational circuits may be greater or lesser, depending on the particular hardware configuration, and embodiments of the present disclosure are not limited in this respect.
In some embodiments, the first buffer circuit 522 may be used to buffer weights or input profiles assigned to the slave processing circuit. Accordingly, the second buffer circuit 523 may then be used to buffer the input profile or weights assigned to the slave processing circuit. Both buffer circuits are used to select the data involved in the operation. The data of the first buffer circuit 522 may be a plurality of data lines from, for example, the first memory circuit 530 or the second memory circuit 540, and correspondingly, the data of the second buffer circuit 523 may be a plurality of data lines from, for example, the second memory circuit 540 or the first memory circuit 530. Depending on the particular multiplexing, these data lines may be distributed to the corresponding arithmetic circuits CU 521 or broadcast to all CUs 521 within the slave processing circuit 520 during the operation.
Each arithmetic circuit CU 521 is configured to perform a permutation multiply-accumulate operation for the data line selected from the first buffer circuit and the data line selected from the second buffer circuit, respectively, at each calculation.
By providing the first buffer circuit and the second buffer circuit respectively, it is possible to support transmission of data to be operated on in different transmission modes, thereby reducing the data access amount by multiplexing data as much as possible between a plurality of operation circuits within a single slave processing circuit.
The slave processing circuit 520 may further include a third buffer circuit 524 for buffering the operation result of each operation circuit CU 521.
It will be appreciated that although the various processing and memory circuits are shown as separate modules in fig. 5, the memory and processing circuits may be combined into one module according to different configurations. For example, the first memory circuit 530 may be integrated with the master processing circuit 510, and the second memory circuit 540 may be shared by multiple slave processing circuits 520, and each slave processing circuit may be assigned a separate memory area to accelerate access. The disclosed embodiments are not limited in this respect. Furthermore, in the computing device, the master processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, as the disclosure is not limited in this respect.
Exemplary data splitting and storage
In the presently disclosed embodiments, the dimensions of the multidimensional data referred to are characterized as (N, H, W, C) or (Co, H, W, ci), which represent the order in which the data is stored in memory. It will be appreciated that although the multi-dimensional data has multiple dimensions, there is a correspondence between the multi-dimensional data and the order of storage on the memory because the layout of the memory is always one-dimensional. The multi-dimensional data is typically allocated in contiguous memory space, i.e., the multi-dimensional data can be one-dimensionally expanded and stored in sequence on the memory. For example, in embodiments of the present disclosure, the initial input feature maps may be stored sequentially in a low-dimensional (where C/Ci is the lowest dimension) priority manner; in order to optimize the convolution operation, the order of storage of the input feature maps may be adjusted during the operation, as will be described in more detail later. Adjacent dimensions refer to dimensions that are next to each other in the dimensional information representation of the multi-dimensional data, e.g., W and Ci are adjacent, which may also be referred to as consecutive dimensions.
In an intelligent processor, the main arithmetic unit of the hardware is a vector multiply-add operator for reasons of computational power requirements and area power consumption overhead. Support for various convolution algorithms is implemented in hardware design, essentially maximizing extraction of multiply-add operations in the algorithm, and efficient exchange of input and output data for the multiply-add operations between on-chip RAM (such as NRAM, WRAM, etc. in fig. 3) and the operators is implemented through data paths.
The hardware is stored in a row (cache line) in a storage manner, and the reading, writing and calculating operations are most efficient when the whole row is aligned, so in order to fully utilize the bandwidth, the data needs to be aligned in a vectorization manner to adapt to the requirements of the memory of the arithmetic unit array and the like. The design of artificial intelligence chips is typically with the lowest dimension in the Ci dimension, i.e., the NHWC placement order described above, where the data in the Ci dimension is continuous. Therefore, the vectorization alignment requires that the size of the Ci dimension be aligned to a specified value, for example, an aligned value M, so that the number of accesses is made in units of the aligned value M, which may also be referred to as a hardware single maximum operand. M may have different values, e.g., 64bit, 128bit, 256bit, 512bit, etc., based on different hardware designs. In general, the size of the input port of the operator array is also related to M, for example, in the case of symmetric input data bit width, the size of the input port of the operator array is typically 2 times of M, that is, the input feature map data and the weight data of the alignment value M scale are processed at one time. When the Ci dimension of the input feature map is large, it is easier to satisfy the above-described alignment requirement.
When the Ci dimension of the input feature map is small, for example, smaller than the size of one cache line, the Ci dimension needs to be padded to one line of data (for example, 512 bits), that is, invalid data 0 is filled. This padding causes a large amount of redundant computation, resulting in resource waste and reduced operation efficiency.
In an embodiment of the disclosure, a convolution operation scheme is proposed, which may determine a corresponding convolution split scheme according to a size of a lowest storage dimension (e.g., ci) of an input feature map, wherein the convolution split scheme indicates at least a shape of a split unit of data to be operated. A split unit contains data of an amount not exceeding a single maximum operation amount of hardware.
In some embodiments, the data amount contained in one split unit may be set to be the alignment value M of the hardware for one-time processing, so that the calculation processing is performed in units of split units, and the calculation force of the hardware may be fully exerted, so that invalid calculation is avoided or reduced.
In the exemplary description of the present disclosure, it is not assumed that m=512 bit=64 Byte, the data type may be Int8, int16, float16, or Float32, and the input signature is consistent with the data type of the convolution kernel. Since the data type requires a width of at least 1 byte and the minimum unit of the arithmetic processing is one data, various calculations such as m=64b, ci=28b, and the like are performed in units of bytes in the following examples, with the units being omitted from time to time for brevity.
When the data amount of the split unit is equal to M, the data block shape of each split unit is blockC × blockY × blockX, which may be in various cases, several of which are listed in table 1:
TABLE 1 data block shape
As can be seen from table 1, some data block shapes have equal X and Y dimensions (shown as dark rows) and this shape can simplify subsequent operations. Thus, in embodiments of the present disclosure, it may be preferable to split the data to be operated on using such a data block shape.
For simplicity, the 64b×1×1 shape splitting scheme is referred to as Forward64, the 16b×2×2 shape splitting scheme is referred to as Forward16, the 4b×4×4 shape splitting scheme is referred to as Forward4, the 4b×4×4 shape splitting scheme applied to the deep convolution operation is referred to as Forward1, the 4b×4×4 shape splitting scheme applied to the inverse deep convolution operation is referred to as Update1, and the 4b×4×4 shape splitting scheme applied to the cross convolution operation is referred to as Update4. In addition to Forward64, these split schemes are suitable for the smaller scenario of channel C in the convolution computation and therefore may also be collectively referred to as small convolutions. In these small convolution split schemes, one split unit includes data of the lowest storage dimension and at least one other storage dimension, and the total data size of one split unit does not exceed the hardware single maximum operand.
Different convolution splitting schemes can be suitable for different operation scenes, so that performance optimization with different degrees is obtained.
After the splitting scheme is determined, the input feature map and the convolution kernel can be split into a plurality of corresponding splitting units according to the determined convolution splitting scheme, and the dimension storage sequence of the splitting units is converted, so that data in one splitting unit is continuously stored as one data row, and the subsequent reading processing is conveniently performed by taking the splitting unit (the data row) as a unit.
In some embodiments, for three-dimensional or four-dimensional neuron or weight data, all of them are divided into data blocks of size blockC × blockY × blockX (uc×uy×ux), each data block being stored consecutively on one row of, for example, m=64b, whereby the data of one data block is actually fetched when one row of data is read.
Specifically, from the data to be operated stored in the first dimension storage sequence, taking the split units as units, reading one or more split units in the first reading sequence, and storing the read split units on the corresponding storage circuits, wherein the data in each split unit is stored in the second dimension storage sequence, and the split units are stored in the third dimension storage sequence.
FIG. 6 illustrates an exemplary data storage sequence according to an embodiment of the present disclosure.
As shown, 610 represents a storage manner of the four-dimensional tensor to be operated, which includes N3-dimensional tensors, where N is stored in the highest dimension, that is, the first dimension of the four-dimensional tensor is NHWC. Note that H and Y, W and X are used interchangeably herein. Each sub-tensor is divided into smaller data blocks or split units, and the number of the data blocks in each dimension is C/Y/X respectively.
The middle graph 620 shows the manner in which each sub-tensor is stored, with each data block being stored as a succession of 64 bytes, i.e., a row. When the order in which the data blocks are read is different, the order between rows will also change accordingly. In the example shown in the figure, the data blocks are read in the directions of C, then X and finally Y, i.e. the first read order is YXC, and then the rows are stored in the order of yxxc, i.e. the third dimension storage order is YXC or HWC. In this example, the third dimension storage order is the same as the first dimension storage order. It will be appreciated that other read orders may be used, resulting in a third dimension storage order that differs from the first dimension storage order, and are not listed here.
The right graph 630 shows the order within each row, i.e., the order of data within each data block, in the shape blockC x blockY x blockX, where the second dimension storage order is CYX or CHW.
Exemplary grouping operations
The small convolution adopts a block form, and compared with the traditional convolution, the Ci direction alignment only needs to meet the requirement of block alignment in the Ci direction. In this small channel scenario, the weights (co×kh×kw×ci) are generally small, kh and Kw are usually unit digits, and co and ci are almost. In the computing device/data processing device described above in connection with fig. 5, the second memory circuit (e.g., WRAM 332 of fig. 3) typically has a larger memory space than the first memory circuit (e.g., NRAM 331 of fig. 3). Therefore, in order to make full use of the on-chip computation space, in most small convolution schemes, such as Forward4, forward1, etc., a scheme is adopted that is interchanged with the neuron of the normal convolution and the weight storage location, i.e., the neuron is stored on the second storage circuit WRAM, and the weight is stored on the first storage circuit NRAM.
The convolution calculation is that each input feature map needs to multiply and add with the convolution kernel of each Co, so as to output Co output feature maps. However, not the space on the chip is necessarily capable of storing the convolution kernels and the input feature graphs of all scales at the same time, so that a series of operations for repeatedly loading the input feature data or the weight data exist for the hardware, and how to balance the repeated loading of the input feature data or the weight data has a certain influence on the calculation efficiency. In practical operation, in order to reduce frequent off-chip accesses, there is a problem of splitting policy for neurons and weights. In some embodiments, different manners of splitting may be employed, depending on the size characteristics of the data involved in the operation.
According to the convolution operation principle described above, the operation results in the Co dimension (the depth convolution is the C dimension) do not need to be accumulated, so that the operation distribution in different Co can be performed relatively independently in different operation circuits. In a small convolution scenario, the size of the output channel Co dimension of the convolution kernel in a single round of operation typically does not exceed the number of slave processing circuits scheduled, so that a single Co operation needs to be completed by one or more slave processing circuits. More generally, even when the Co dimension is large, this can be achieved by splitting into multiple rounds of operations, where the Co size of each round of operation process does not exceed the number of slave processing circuits scheduled. Thus, in one example, the number of operations rounds required to complete the convolution operation and the number of Co processed in each round operation or the corresponding grouping pattern may be first determined based on the output channel Co dimension size of the convolution kernel and the number of schedulable slave processing circuits Ns.
Regardless of the allocation, in a single round of operation, co may have two allocation scenarios: multiple slave processing circuits process one Co value, or a single slave processing circuit processes one or more Co values. Specifically, in a single operation round of processing Nco output channels, each Rs SL forms a slave processing circuit group SLB, and a convolution kernel corresponding to the same output Co value is processed, where rs= [ Ns/Nco ], that is, the same convolution kernel is multiplexed on Rs SLs within the same SLB, and Rs represents the number of multiplexing times of the convolution kernel between slave processing circuits. Accordingly, the input feature map may be multiplexed between the respective slave processing circuit groups SLB, rn= [ Ns/Rs ], indicating the number of times of multiplexing of the input feature map between the slave processing circuits.
Alternatively or additionally, when each slave processing circuit processes convolution kernels corresponding to rn Co values, rn= [ Nco/Ns ], then each slave processing circuit processed input signature may be reused for rn convolution kernels, rn representing the number of times the input signature is multiplexed within a single slave processing circuit. The maximum number of convolution kernel multiplexes rs and the maximum number of input signature multiplexes rn applicable within a single slave processing circuit may be determined taking into account factors such as hardware buffer space limitations (e.g., the size of the first buffer circuit and the second buffer circuit in fig. 5).
In view of cache size limitations and multiplexing benefits in hardware circuits, in some embodiments of the present disclosure, one slave processing circuit is temporarily not considered to process multiple Co values in a single round of operation, but only one or more slave processing circuits are considered to process only one Co value in a single round of operation.
Depending on the number of slave processing circuits SL processing the same Co value in a single round of operation, different grouping modes may be employed. It will be appreciated that the callable slave processing circuits SL are preferably equally distributed so that the calculation forces are balanced, for example, one group per 2 SLs, so that 16 SLs can process 8 Co values simultaneously; or a group of every 4 SLs, so that 16 SLs can handle 4 Co values simultaneously; etc. In the computing device described above in connection with fig. 5, the second memory circuit WRAM has 16 memory areas, which are respectively assigned to the 16 slave processing circuits SL. Further, each 4 blocks may be combined into a memory block, and distributed to the corresponding slave processing circuit group SLB. Thus, in some embodiments, for a computing device including ns=16 SL as shown in fig. 5, several grouping modes may be selected: group1 mode, group4 mode, and Group16 mode. It will be appreciated by those skilled in the art that there may be different grouping modes depending on the value of Ns, each of which may be correspondingly processed with reference to the three representative grouping modes set forth herein.
In some embodiments, the grouping mode may be denoted by GroupN, and represents that all the slave processing circuits SL scheduled in the current round of operation are divided into N groups, each slave processing circuit group SLB processes the same Co value, and different slave processing circuit groups SLB process different Co values. For a total of 16 SL schedulable occasions, N may be 1,4,16, corresponding to Group1, group4, and Group16 above, respectively.
Fig. 7a-7c illustrate several exemplary grouping modes according to embodiments of the present disclosure. Fig. 7a shows a Group1 mode, fig. 7b shows a Group16 mode, and fig. 7c shows a Group4 mode.
As shown in fig. 7a, the Group1 mode means that all schedulable 16 SLs belong to a Group, and Co values are processed together, e.g. SL0 to SL15 belong to Group G0. Thus, the operation for the one output channel is distributed over 16 SL. In this mode, the convolution kernel 720 of the output channel may be preferentially transmitted to each SL in a broadcast manner, and the input feature map 710 is split and allocated to each SL, so as to improve the memory access efficiency.
In one embodiment, the convolution kernel may be stored on the first storage circuit 530 of fig. 5 for transmission using a broadcast channel. The input profile may then be divided in the XY direction of the output profile, stored on the second storage circuit 540, for distribution to different SL's. Thus, all SLs jointly calculate an output profile for Co. The division and storage of the input feature map will be described in detail later with reference to the drawings.
As shown in fig. 7b, the Group16 mode refers to all schedulable 16 SLs being divided into 16 groups, i.e. one SL per Group, each SL handling a different Co value. For example, SL0 belongs to group G0, SL1 belongs to group G1, and so on, until SL15 belongs to group G15. In this mode, the same block of input feature map 730 may be reused between 16 SLs, so that it may be preferable to broadcast the input feature map 730 to each SL, while the convolution kernels 740 for the different Co's are distributed to the corresponding SLs.
In one embodiment, the input signature may be replicated 16 copies and stored on the second memory circuit over 16 memory areas allocated for 16 slave processing circuits. The convolution kernel is divided according to Co, one SL corresponds to one Co, 16 Co are processed at a time, and the Co is stored in a first storage circuit and distributed to different SLs in a unicast mode. Thus, all SLs calculate output profiles of different Co for the same input profile.
As shown in fig. 7c, the Group4 mode refers to all schedulable 16 SLs being divided into 4 groups, each Group handling one Co value. Each SL group (SLB for short) includes a number of SLs equal to rs=ns/4=4. For example, SL0 to SL3 belong to group G0, SL4 to SL7 belong to group G1, SL8 to SL11 belong to group G2, and SL12 to SL15 belong to group G3. This pattern is between Group1 and Group16, so either the convolution kernel or the input profile can be determined to be multicast data, while the other is determined to be distribution data.
In one embodiment, the convolution kernels may be divided into 4 groups by Co and stored on the first storage circuit 530 of FIG. 5 for transmission using the broadcast channel. The input signature may then be divided into 4 copies in the XY direction of the output signature and the 4 copies may be stored on the second storage circuit 540 for distribution to the 4 SLBs. Each SLB obtains the same input signature and distributes it to the 4 SLs within the SLB in 4 shares divided. Thus, all SLs in each SLB jointly calculate an output profile of Co, and 4 SLBs process a different Co respectively.
As shown in fig. 7c, the convolution kernels are divided into 4 groups, each group being divided into Co in units of interval 1. For example, when co=12, the 4 groups of Co are {0,4,8}, {1,5,9}, {2,6,10} and {3,7,11}, respectively. Each time, one Co of each group is transmitted, for example, co=0 to 3 is transmitted for the first time, one Co corresponds to one SLB, and 4 SLs in one SLB share the same weight; the second transmission co=4 to 7, and so on. Thus, the Co dimension of the operation result outputted from each SLB is continuous after each operation is completed.
When the small convolution split operation scheme of Forward4 is adopted, in order to support the three modes at the same time, neurons can be uniformly stored in the second storage circuit WRAM, and weights can be stored in the first storage circuit NRAM.
Exemplary splitting of input feature graphs
As can be seen from the foregoing description, when multiple SL's jointly process one Co value, the input signature needs to be split among the multiple SL's, e.g., group1 packet mode needs to split the input signature into 16 parts, while Group4 packet mode needs to split the input signature into 4 parts.
To ensure that the split input feature maps may share a convolution kernel, they may be partitioned according to the Ho/Wo direction of the output feature map, thereby mapping back to the partitioning of the input feature map. In some embodiments, the input signature may be divided among the Rs slave processing circuits SL included within each slave processing circuit group as follows: according to the size of the corresponding output feature map, the output feature map is divided into Rs output feature blocks with the same shape on average in the XY dimension (namely Ho/Wo dimension); and dividing the input feature map into Rs input feature blocks in the XY dimension (i.e., hi/Wi dimension) according to the input feature map region required for calculating each output feature block, so as to be distributed to Rs slave processing circuits. It will be appreciated that there may be overlap of the input feature maps for adjacent output points on the output feature map, depending on the convolution kernel size and convolution step size.
Fig. 8 illustrates an exemplary split schematic of an input feature map according to an embodiment of the present disclosure. In this example, the input feature map is divided into 16 shares and distributed over 16 SLs, corresponding to the Group1 mode.
In the figure 810, an output characteristic diagram of single Co is shown, which is divided into 16 output characteristic blocks having the same shape in the XY direction in a 4×4 manner, and is allocated to SL0 to SL15, respectively. Then, the 16 output feature blocks can be mapped onto the input feature map 820, and 16 input feature map regions required for calculating the 16 output feature blocks respectively are obtained, which is also dividing the input feature map in the XY direction. The 16 input profile areas can be allocated to the 16 slave processing circuits SL accordingly.
According to the foregoing description, the input feature map is split in units of splitting units according to the determined convolution splitting scheme, so in the above embodiment, the input feature map is partitioned so that each divided input feature map block is a multiple of the XY dimension of the splitting unit in the XY direction, that is, may be aligned in the XY direction according to the splitting unit. For example, in selecting a 4 x 4 convolutional splitting scheme, each input feature tile is aligned by 4 x 4; while each input feature tile is aligned by 2 x 2 when a 16 x 2 convolutional split scheme is selected.
For the case where the output feature map is not aligned by a split element (e.g., 4 x 4 or 2 x 2), a corresponding padding (e.g., 0) on the input feature map is required so that the actually calculated output XY is aligned by a split element (e.g., 4 x 4 or 2 x 2) and the input XY is also aligned by a split element (e.g., 4 x 4 or 2 x 2).
It will be appreciated by those skilled in the art that the output feature map may be split in the XY direction according to other rules, for example, by splitting the output feature map into 16 output feature blocks having the same shape in a1×16 manner, and assigning the output feature blocks to SL0 to SL15, respectively. The disclosed embodiments are not limited in this respect. Furthermore, it will be appreciated that although described above in connection with splitting between slave processing circuits, this manner of splitting may also be applied to splitting in other scenarios, such as splitting between arithmetic circuits CU within a single slave processing circuit SL, as the embodiments of the disclosure are not limited in this respect.
Exemplary convolution operation procedure within a Single Slave processing Circuit
After splitting the data to be operated and carrying out corresponding placement and storage, the plurality of slave processing circuits can be scheduled to execute convolution operation on the input feature images and the corresponding data lines of the convolution kernel, and then the operation results returned by the plurality of slave processing circuits can be spliced according to a convolution splitting scheme so as to obtain output feature images of the convolution operation of the input feature images and the convolution kernel. Specifically, a specific convolution operation process may be performed using a plurality of operation circuits CU in the slave processing circuit and respective buffer circuits (see fig. 5). Depending on the size of the space from the processing circuitry internal buffer circuitry and the computational power limitations of the arithmetic circuitry, multiple cycles of operations are typically required to be performed in each round of operations to complete the desired operation.
As can be seen from the foregoing description, in the context of conventional 3D convolution operations, all the arithmetic circuits within a single slave processing circuit calculate one output feature map or part of the output feature map corresponding to the same output channel Co. Depending on the buffer space sizes of the first buffer circuit and the second buffer circuit within the slave processing circuit SL, the processing capability (e.g., internal registers, etc.) of the arithmetic circuit CU, the slave processing circuit may not be able to calculate the output feature map allocated thereto at a time. Thus, the output feature blocks may be divided in units of a single operation capability (e.g., a single calculation Nop output points or partial sums) of the operation circuits, each corresponding to a single operation capability (N CU ×nop output points) of all schedulable N CU operation circuits within a single SL. For example, taking the previous example of fig. 5 where each SL includes 4 CUs, a single SL may calculate 4×4=16 output points (or partial sums) at a time, assuming that each CU may calculate nop=4 output points or partial sums of output points at a time. Thus, the output feature map may be divided into output feature blocks in XoYo dimensions in accordance with 16 output point alignments, each output feature block being calculated one by one. It will be appreciated that these 16 output points may be in a 4 x 4 format or a1 x 16 format, as embodiments of the disclosure are not limited in this respect.
In calculating each divided output feature block, the output points of the output feature block may be further divided among the N CU arithmetic circuits to determine the processing objects of the respective arithmetic circuits. Then, according to the division of the output points, the splitting unit is taken as a sliding window, N CU input characteristic data lines are selected from the first buffer circuit and distributed to N CU operation circuits, corresponding weight data are selected from the second buffer circuit and broadcast to N CU operation circuits, and therefore parallel calculation of the output points corresponding to the sliding windows is achieved through multiplexing of the weight data. Nk sliding choices are performed, where Nk is determined from the size of the convolution kernel in the X and Y dimensions and the smaller of the maximum convolution kernel sizes supported from a single operation of the processing circuitry in the current convolution split mode.
In some embodiments, when performing a conventional three-dimensional convolution operation, the corresponding weight data may be selected as follows: and selecting 1/Nop weight lines from the second buffer circuit according to the sliding mode corresponding to the first buffer circuit, copying Nop-1 copies of the weight lines to be expanded into one expanded weight line, and broadcasting the expanded weight line to N CU operation circuits in the slave processing circuit.
At this time, each arithmetic circuit may perform para-multiply accumulation in units of 1/Nop data lines for one input feature line from the first buffer circuit and one extended weight data line from the second buffer circuit at each sliding selection calculation, to obtain Nop partial sums; and accumulating Nk multiplied by Nop parts obtained by Nk sliding number selection calculation and corresponding convolution output points to obtain and output Nop operation results.
When the slave processing circuit outputs the output points of the operation circuits in the slave processing circuit, the output points calculated by the operation circuits in the slave processing circuit can be output according to a specific sequence according to the division mode of the output points, so that the output points which are continuously output are continuous in X and/or Y dimensions, and the slave processing circuit is convenient for subsequent processing. In some embodiments, the master processing circuit may further store the operation results returned from the respective slave processing circuits in a fourth-dimensional storage order. According to circumstances, the main processing circuit may also convert the operation result into a desired dimensional storage order storage.
The division of the output points between the operation circuits can be performed in a plurality of ways, and accordingly, the sliding selection convolution process and the output order of the output points are different.
The overall data splitting, storing, convolution sliding and computational output process is described in detail below in conjunction with the Forward4 scheme.
Shape description of input neurons, weights for Forward4 scheme
In Forward4, the shape of the split unit block is 4b×4×4. The block shape varies slightly depending on the data type. Table 2 shows the block shape of Forward4 for different data types.
Table 2, forward4 data block shape under different data types
Fig. 9 illustrates a split and store schematic of Forward4 scheme according to one embodiment of the present disclosure. For simplicity, the example in the figure assumes that the data type is Int8.
The diagram 910 shows the raw data to be computed (which may be neurons or weights) stored in HWC order. Also shown are 4 data blocks 911-914 of the original data to be operated upon split by split units, each data block comprising 4 x 4 = 64 data.
The split data placement format is shown at 920 to facilitate reading. It can be seen that the original data blocks (e.g., 911-914) are laid out in a row (e.g., 921-924) in the C dimension. Within each row, data is stored in order of CHW, for example, for data row 921, 16 data of c=0 are stored first, then 16 of c=1 are stored, then 16 of c=2, and finally 16 of c=3.
Specifically, for neurons, it is necessary to put the data from [1Hi Wi Ci ] to:
[1 XHi/4 Xwi/4 XCi/4X (4X 4) ], the shape of this seven-dimensional tensor.
For weights, the data need to be put from [ Co Kh Kw Ci ] to:
[ Co×Kh/4×Kw/4× ] Ci/4× (4 x 4) ], the shape of this seven-dimensional tensor.
From the foregoing description, forward4 schemes can support multiple packet modes. For neurons, the seven-dimensional shape of the block format is finally split into each storage area of the second storage circuit with slight differences for different grouping modes and HoWo split modes in the group.
Assume that the original input neuron size is: [1 Xhi Xwi Xci ]
In Group1 grouping mode, the input neuron number varies according to HoWo split modes:
Ho×Wo4×4 split: 16[ hi/(4X 4), wi/(4X 4), ci/4X (4X 4) ]
HoxWo1×16 split: 16[ hi/(4), wi/(4X 4), ci/4X (4X 4) ]
In the 4 x 4 split described above, 16 denotes 16 BLOCKs of CHW split from three dimensions, the last 4 x 4 (CHW), of the two divisions 4 of wi, the first 4 indicates splitting up hi×wi into 16 shares for distribution to 16 SL, and the second 4 indicates folding hi, wi to ci directions. 1X 16 splits have the same meaning.
In Group4 grouping mode, the input neuron number varies according to HoWo split mode:
HoxWo1×4 resolution: 4X [ hi/(1X 4), wi/(4X 4), ci/4X (4X 4) ]
For a slave processing circuit SL: hi/(1×4), wi/(4×4), ci/4× (4) x 4)
In the above representation, the first 4 of the preceding represent 4 SLBs, the neurons were replicated 4 copies, the second 4 indicates that neurons are split on 4 SLs of one SLB, and the last 4 x 4 indicates that the BLOCK of CHW is split from three dimensions. In Group16 grouping mode, the input neurons do not need to be split, and the pendulum numbers are as follows:
16×[hi/4,wi/4,ci/4×(4×4×4)]
The above 16 indicates that neurons replicate on 16 SLs, the last 4X 4 indicates the BLOCK of CHW split from three dimensions, hi, wi are divided by 4, indicating the direction of folding hi, wi to ci.
Output point splitting between arithmetic circuits in Forward4 scheme
When a single slave processing circuit SL has multiple arithmetic circuits CU processing one Co value in common, splitting of the output points between these multiple CUs is required.
Fig. 10 illustrates a schematic diagram of assigning interval output points to each arithmetic circuit in a Forward4 scheme according to some embodiments of the present disclosure. In these embodiments, the output feature block may be divided equally between N CU arithmetic circuits into Nop identically shaped output feature sub-blocks, each including N CU output points, divided respectively to N CU arithmetic circuits. For example, in the figure, each SL includes 4 CUs, and each CU can calculate nop=4 output points or partial sums at a time, and the output feature block 1010 is shown to include 4×4 output points, and each output feature sub-block 1011 to 1014 divided equally includes 2×2 output points. In each output characteristic sub-block, these 2×2 output points are allocated to 4 arithmetic circuits. Thus, each arithmetic circuit calculates one output point in each of the 4 output feature sub-blocks. The output points assigned to the 4 different arithmetic circuits CU0 to CU3 are shown in different contexts in the figure. As can be seen from the figure, each arithmetic circuit calculates a plurality of output points on the output signature that are spaced in the X and/or Y dimensions at each calculation.
Based on the above output point division, when performing convolution operation by sliding the selection number, N CU data lines may be selected from the first buffer circuit to perform operation in correspondence with the output point positions of each output feature sub-block according to the data required to calculate the output feature sub-block. For example, when the first selection number of the input feature data is calculated, 4 input data lines may be selected from the corresponding input feature blocks according to 4 input feature blocks required for calculating 4 output points in the first output feature sub-block 1011, and distributed to 4 arithmetic circuits. It will be appreciated that since the 4 output points are consecutive in the X and/or Y direction, the spacing or step size in the X and/or Y direction of the simultaneously selected 4 input data lines is 1.
When the weight data is selected, the corresponding weight data can be selected from the second buffer circuit and broadcasted to N CU operation circuits, so that the parallel calculation of the output points corresponding to the operation circuits is realized by multiplexing the weight data. Further, in some embodiments, in order to fully exploit the computing power (e.g., multiply-add operator) inside the computing device CU, for example, nop output points or partial sums are computed at a time, weight multiplexing may be performed within a single input data line, so that Nop output points or partial sums are computed simultaneously.
For example, when the number of weight data is selected, only 1/Nop weight lines may be selected, and Nop-1 copies may be copied to be extended into 1 weight line, and the extended weight line includes Nop identical 1/Nop weight lines. The extended weight row may also be broadcast to N CU arithmetic circuits, multiplexing weights at a smaller granularity (e.g., 1/Nop row) between the computation of Nop output points of a single arithmetic circuit while multiplexing weights between multiple arithmetic circuits.
Thus, N CU x Nop output points or partial sums can be calculated each time by taking N CU input feature data lines, taking 1/Nop weight lines, and copying and expanding them into 1 weight line each time. When the calculated result is the partial sum, the partial sum can be calculated for a plurality of times by sliding for a plurality of times, and the partial sums of each time are accumulated according to the associated output points, so that a final result can be obtained.
According to the division mode of the output points, the sliding times and the sliding step length of convolution operation can be determined. In the division of fig. 10, the number of sliding nk=ceil (Kx/2) ×ceil (Ky/2), where Kx, ky are smaller values of the size of the convolution kernel in the X and Y dimensions and the maximum convolution kernel size supported from a single operation of the processing circuit in the current convolution split mode, respectively, and the sliding step size=2. The maximum convolution kernel size supported by a single operation from the processing circuitry is determined, for example, by at least the spatial sizes of the first buffer circuitry and the second buffer circuitry. It will be appreciated that when the convolution kernel exceeds the maximum convolution kernel size, a split in the Kx and Ky directions is required according to the maximum convolution kernel size.
Convolution sliding process in Forward4 scheme
Fig. 11 shows a single operation process diagram in Forward4 scheme according to one embodiment of the present disclosure. In this example, the first buffer circuit 1110 has a size of 3×3×64B, i.e., at most 9 lines of data can be buffered, and the second buffer circuit 1120 has a size of 2×2×64B, i.e., at most 4 lines of data can be buffered. The storage within the buffer circuit in the figure is also shown in units of split cells for consistency with the split cells.
The operation of the first sliding fetch is shown. According to a mode corresponding to the dividing mode of the output points, taking the splitting unit as a sliding window, sliding and selecting N CU input characteristic lines from the first buffer circuit, and respectively sending the N CU input characteristic lines to N CU operation circuits for calculation; and selecting 1/Nop weight lines from the second buffer circuit according to a sliding mode corresponding to the first buffer circuit, wherein Nop is the number of the convolution output points which can be calculated at a maximum time of each operation circuit, copying Nop-1 copy of the number of the convolution output points into an extended weight line, and broadcasting the extended weight line to N CU operation circuits in the slave processing circuit.
Specifically, in the computing device shown in fig. 5, N CU =4, nop=4. In dividing the output points, each arithmetic circuit calculates 2×2 output points each spaced 1 in the X and Y dimensions for division at each calculation.
As shown, one input characteristic data line is selected from the first buffer circuit 1110 at the start position and the position shifted by 1 in the X and/or Y directions, and 4 input characteristic data lines are selected in total and are correspondingly sent to 4 arithmetic circuits 1140 in the slave processing circuit SL. 1/4 weight data lines are selected from the second buffer circuit 1120 at the start position, i.e. 2×2 data are selected, and 3 copies of the data are expanded into an expanded weight data line 1130, which is broadcast to 4 arithmetic circuits 1140 in the SL.
Each arithmetic circuit performs bit-multiply-accumulate in units of 1/Nop data lines for one input feature line from the first buffer circuit and one extended weight line from the second buffer circuit at each calculation to obtain Nop partial sums.
As shown, 4 arithmetic circuits 1140 perform a para-multiply-accumulate operation on the distributed input feature data lines and the broadcast extended weight data lines to obtain an operation result 1150. The results of the different background colors in 1150 represent results from the different arithmetic circuits 1140. It can be seen that each time an operation, one CU calculates a partial sum of 4 output points, and 4 CUs total to obtain a 4 x 4 partial sum. It can be seen that the output points calculated by each CU are not adjacent in the XoYo dimension of the output feature map.
Then, the first buffer circuit and the second buffer circuit synchronously slide the number of samples, and the next calculation is performed. And performing Nk sliding options, wherein Nk=ceil (Kx/2) ×ceil (Ky/2), and Kx and Ky are smaller values of the size of the convolution kernel in the X and Y dimensions or the maximum convolution kernel size supported by a single operation from the processing circuit in the current convolution split mode, respectively. Correspondingly, the operation circuit accumulates Nk times Nop parts calculated in Nk sliding calculation and corresponding convolution output points to obtain Nop operation results.
In some embodiments, in Forward4 mode, the maximum convolution kernel size supported by a single operation from the processing circuitry is 8x 8.
Fig. 12 shows a schematic diagram of a sliding convolution process in Forward4 scheme according to one embodiment of the present disclosure. Taking a 9×9 input feature map as an example, and a convolution kernel of 5×5 as an example, the convolution step size is 1, and the output feature map size is 5×5. The input signature needs to be aligned to 12 x 12, divided into 9 blocks of size 4 x 4 (C x H x W), stored in a first buffer circuit, shown as 1210, where the C dimension is omitted. The convolution kernel 5×5 is then aligned to 8×8, the aligned portion is complemented by 0, and stored in a second buffer circuit, shown as 1220, again omitting the C dimension. And during each calculation, selecting a block with the size of 2 multiplied by 2 in the convolution kernel, copying 4 times, and just corresponding to a block with the size of 4 multiplied by 4 of the upper input characteristic diagram, wherein the copying operation can be realized by hardware.
The selection range of the input feature map and the convolution kernel in the first buffer circuit and the second buffer circuit during each sliding is shown in fig. 12, and 9 maps are total, which represents that the sliding is performed for 9 times. Block 1210 represents the input signature in the first buffer circuit, with four dashed boxes representing regions selected for four CUs; block 1220 represents the convolution kernel in the second buffer circuit, and the dashed box represents the selected 1/4 line, which is broadcast to 4 CUs after being duplicated 3 copies extended into one line. Number of sliding nk=ceil (Kx/2) ×ceil (Ky/2) =9.
Each CU performs para-multiply accumulation on 1/4 data row units aiming at one input characteristic data row from a first buffer circuit and one expansion weight data row from a second buffer circuit during each calculation to obtain 4 partial sums; and accumulating Nk partial sums corresponding to the same convolution output point obtained in Nk times of calculation in the current operation round to obtain and output 4 operation results.
Specifically, for each graph in fig. 12, the number ncu=4 of CUs, and each CU calculates nop=4 output points or partial sums, which are the result of the para-multiply-accumulate of 1/4 data lines, i.e., standard convolution in which each output point is 4×2×2 (ci×y×x) at a time. After sliding nk=ceil (Kx/2) ×ceil (Ky/2) =9 times, accumulation is completed in the y×x direction, and finally, a complete 4×4 (y×x) output is obtained in 1 SL. In this mode, only the case that the convolution kernel is not larger than 8×8 is supported by a single calculation, and for larger convolution kernels, splitting is required to be performed according to 8×8 in the Kx and Ky directions, and splitting operation can be performed according to the same principle as above.
It will be appreciated that when Ci >4, it is necessary to traverse in the Ci direction while switching inputs and weights until the complete output is calculated. When the Xo/Yo calculated by each CU is greater than 4, it is necessary to slide along the Xo/Yo direction, reading different input neurons and weights. Those skilled in the art can similarly derive the calculation process from the foregoing description, and the details are not repeated here.
Output shape description in Forward4 scheme
As can be seen from the previous output point division manner and sliding convolution process, the result of the sliding mode output is not the normal arrangement order of the conventional convolution output data. Thus, in the output process, each slave processing circuit SL can convert the operation result of the operation circuit CU therein into a specified format, for example, a format of nco×uy×ux. In some embodiments, each slave processing circuit may output a partial operation result of its internal partial operation circuit at a time, the partial operation result being continuous in the X and/or Y dimensions of the output feature map. The master processing circuit may further store the operation results returned from the respective slave processing circuits in the fourth-dimensional storage order. According to circumstances, the main processing circuit may also convert the operation result into a desired dimensional storage order storage.
When the grouping mode and/or the splitting mode of the input feature map in the single SLB (i.e., the splitting mode according to HoWo of the output feature map) are different, the output data formats are slightly different.
Fig. 13 shows a schematic diagram of an output data format of Forward4 scheme according to one embodiment of the present disclosure. In this embodiment, the grouping mode is Group1, and the splitting manner of the input feature map in a single SLB (including 16 SLs) is split according to ho×wo=1×16.
The original output of 1 SL is shown at 1310. As can be seen from the figure, each SL outputs a region of 1×1×4 (co×y×x) at a time, that is, outputs a partial operation result of its internal partial operation circuit at a time, for example, 2 operation results (see fig. 10) of 2 CUs each, which are continuous in the X and/or Y dimensions of the output feature map, for example, the same row (shown in fig. 13) or the same column. The 4 operation results for each of the 4 CUs are returned 4 times in succession to the 1×4×4 (co×y×x) region. Different SLs output different regions of the output profile of the same Co. After outputting all the 4 x 4 regions of Co, continuing to output switches different output points.
The figure 1320 shows a 16 SL store data structure. As shown, the final output data becomes yoxxoxcox4x16x4 format after being written into a memory circuit (e.g., a first memory circuit), where Yo and Xo are the number of blocks of the output feature map divided for each SL, and 16 is the division over 16 SL. In some implementations, the wobble operation may be performed again to translate to other desired data formats, as desired.
As mentioned above, when the grouping mode and/or the splitting manner of the input feature map between the plurality of SLs within the single SLB are different, there is a slight difference in the output data format. Assume that the original output size is:
1×ho×wo×co
then, the output data shape of Group1 when ho×wo is divided by 4×4 is:
ho/(4×4)×wo/(4×4)×co/group×(4×16×4)
In the above formula, (4×16×4) is a basic output block of forward4, and directions correspond to h×c×w, where 16 represents division of ho and wo of the same co on 16 SLs, and specifically can be decomposed into 4[ mid-dimensional ho ] ×4[ low-dimensional ho ] ×4[ high-dimensional wo ] ×4[ low-dimensional wo ]; ho, wo are two times except 4, wherein the first 4 represents 4 x4 splitting when SL stores data, and the second 4 represents h, w direction data block folding. In Group1 mode, the upper group=1.
The output data shape of Group1 when ho×wo is split according to 1×16 is:
ho/(4)×wo/(4×16)×co/group×(4×16×4)
In the above formula, (4×16×4) is a basic output block of forward4, and directions respectively correspond to h×c×w, where 16 represents division of ho and wo of the same co on 16 SLs, and specifically can be decomposed into 4[ low-dimensional ho ] ×16[ high-dimensional wo ] ×4[ low-dimensional wo ]; in Group1 mode, the upper group=1. This shape is also referred to as the schematic shape of fig. 13.
It follows that in the case of Group1, 16 SLs bisect the Yo x Xo dimension of an output feature map. The data in the output time-line inner dimension SL corresponds one-to-one to the way 16 SL bisect the output neurons in the Yo x Xo direction. The scene is suitable for inputting large values of Y X directions and small values of Co of neurons.
The output data shape of Group4 when ho×wo is divided by 2×2 is:
ho/(2×4)×wo/(2×4)×co/group×(4×16×4)
In the above formula, (4×16×4) means the same as above, except that 16 represents the division of wo outputs of 4 co over 4 SL, specifically 4[ co ] ×4[ medium dimension ho ] ×2[ low dimension ho ] ×2[ high dimension wo ] ×4[ low dimension wo ]. In Group4 mode, the upper group=4.
The output data shape of Group4 when ho×wo is split according to 1×4 is:
ho/(1×4)×wo/(4×4)×co/group×(4×16×4)
In the above formula, (4×16×4) means the same as above, except that 16 represents the division of the wo output of 4 co over 4 SL, specifically 4[ co ] ×4[ low dimension ho ] ×4[ high dimension wo ] ×4[ low dimension wo ]. In Group4 mode, the upper group=4.
Group16 output data shape is:
ho/4×wo/4×co/group×(4×16×4)
in the above, (4×16×4) means the same as above, except that 16 means output division of 16 co over 16 SL, specifically, 4[ low dimension ho ] ×16[ co ] ×4[ low dimension wo ]. In Group16 mode, the upper group=16.
Since groups have different split categories in the H×W direction, the above 16 of 4×16×4 also has a difference in specific split. Since Forwrd4 is a block of 4b×4x4 as a calculation unit, there is an unavoidable alignment limit at the time of calculation. According to different Group modes, different H×W splitting modes of the same Group mode are different in alignment limitation in calculation finally. In the calculation of alignment, the alignment limit of ho×wo can be determined according to the splitting mode of the output feature map, and then the ho×wo is reversely pushed back to hi×wi, and the input neuron needs to be put into the form of split unit blocks, so that the alignment is needed once again. The above alignment limits can be summarized in table 3 below:
TABLE 3 alignment limits
In summary, at the time of output, the hardware can automatically output neurons in the 4×16×4 (y×sl×x) dimension within a row, and in the y×x×c dimension between rows. The same holds for larger convolution kernels.
Bias shape description in Forward4 scheme
Bias is Bias after the convolution calculation is finished, and the original format of Bias is: [1 co ].
Since the data format output by Forward4 is ho×wo×co/group× (4×16×4) format, if offset is required on the chip on the data directly output by Forward4, the basic shape of the offset needs to be changed. The placement format offset in the on-chip space is related to the Group grouping mode. Specifically, the number of swings biased in each packet mode is as follows:
group1 packet mode, biased pendulum number: [1 co.times.64 ]
Where 64 denotes that a single offset is replicated 64 times and placed consecutively.
Group4 packet mode, biased pendulum number: [1 Co.times.16 ]
Where 16 denotes that a single offset is replicated 16 times and placed consecutively.
Group16 packet mode, biased pendulum number: [1 Co.times.4 ]
Where 4 denotes that a single offset is replicated 4 times and placed consecutively.
Data handling process
As can be seen from the foregoing description of the small convolution scheme, the input neurons and the weights need to be split and transformed in storage dimensions, and the output neurons need to be transformed in certain dimensions. When based on the hardware architecture of the multi-core computing device of fig. 3b, for example, the input data needs to be read from the global memory first, and the data is stored in the shared memory SRAM after being loaded, for reasons of hardware IO efficiency. As mentioned above, forward4 needs to split neurons, and considering alignment factors, the split feature determines that Forward4 has more calculation advantage in the case of larger processing input feature map and smaller channel number. Thus, in a hardware design involving Forward4, larger neurons may be deposited on the WRAM, with relatively smaller weights deposited in the NRAM. Meanwhile, since the weights and the neuron data need to be put into the block form described above, the neurons stored in the WRAM need to undergo NRAM once to perform shape transformation of tensor data.
Fig. 14 illustrates an overall data handling process according to an embodiment of the present disclosure.
As shown, weights are read into the SRAM from off-chip storage, e.g., DDR, via a global direct memory access module (GDMA). HW level alignment and pad operation is done on the SRAM. A blocking instruction (Tiling) is utilized in the process of data from SRAM to NRAM, and at the moment, the data carrying process and the data dimension conversion and alignment process can be completed.
The handling of neurons is similar to weights, except that after handling to NRAM by a blocking instruction, it is also required to handle to WRAM. Since neurons, when computing, slide with the convolution kernel, there is a large portion of data overlap, which greatly reduces the efficiency of data handling. To address this issue, some embodiments of the present disclosure employ img2col instructions for distributing data, as described in more detail below.
The output data can be restored on the NRAM, and the dimension change of the data can be completed through a block instruction and carried on the SRAM. Then, it may be restored to the off-chip memory DDR via GDMA.
Exemplary principles of blocking instructions
The data dimension change and the pendulum number refer to a process of putting tensor data of a specific shape into a required specific shape. Data handling refers to read and write operations performed on data in different memory spaces. As described above, the Forward4 convolution scheme requires that the neurons and weights used for convolution operation be well aligned and placed in a specific block pattern. In addition, the output data is also data output according to the Forward4 specific output format, which requires that tensor data be put in block form before calculation, and that normal tensor shapes be put back as required after calculation is completed.
In the disclosed embodiment, the handling operation is accomplished by a block instruction (TRANS TILING) during the handling of the input neurons, weights, bias data from the SRAM to the NRAM, and the handling of the output data from the NRAM to the SRAM. In the process of carrying, the basic carrying process of the data is required to be completed, and the dimensional change and the placing process of the data are also required to be completed, so that the calculation requirement is met.
The form instruction series provides the capability of data shape transformation and data type conversion for the IO data path, and mainly comprises functions of TRANS (transpose), MOVE (carry), ROTATE and the like. The mode for implementing the transpose function in this instruction set is named TRANS TILING, which is mainly to provide performance support for various shape transformations of small convolutions. The form divides a 3-dimensional data block into an inner layer and an outer layer, the inner layer has three dimensions (corresponding to parameters n0-2 in the instruction), the unit of the lowest dimension is a byte, the unit of the next-lowest dimension and the highest dimension is a non-unit, and the number of the last layer is represented. The outer layer also has three dimensions (corresponding to parameters n3-n5 in the instruction), all representing multiples of the corresponding inner layer dimension.
When implementing the small convolution splitting scheme, input data stored in a first dimension storage order (e.g., HWC) needs to be split, dimension converted and stored in units of split units, each split unit is stored in a second dimension storage order (e.g., CHW), and split units are stored in a third dimension storage order (e.g., HWC).
Fig. 15 shows a schematic conceptual diagram TRANS TILING according to an embodiment of the disclosure.
The left graph in the figure shows the input data before deformation. It can be seen that the three-dimensional input data is described using six dimensions, n0 and n3 corresponding to a first dimension (e.g., the lowest dimension) of the original three-dimensional data, n1 and n4 corresponding to a second dimension (e.g., the next lowest dimension) of the original three-dimensional data, and n2 and n5 corresponding to a third dimension (e.g., the highest dimension) of the data block. In the example in the figure, the inner layer of the input data corresponds to the splitting unit, taking the Forward4 scheme as an example, the inner layer data block of the input data is a 4b×4×4 data block, where n0=4b, n1=n2=4.
The right graph in the figure shows the deformed output data. Three-dimensional output data is also described using six dimensions. At this time, the inner layer of the output data corresponds to the deformed splitting unit, and in the Forward scheme, the inner layer data block of the output data is a 64b×1×1 data block, where n0=64b, n1=n2=1.
In addition, the transpose partition (TRANS TILING) also has an intra-line transform (Inline-buffer) function, including a Tiling forward intra-line transform based on a pre-match table (Pretable) and a Tiling backward intra-line transform based on a post-match table (Posttable). The front matching table is a function of rearranging n0 data input by Tiling, and the rear matching table is a function of rearranging n0 data output by Tiling. The pre-allocation table and post-allocation table are essentially an array representing 64 byte data locations, regardless of the flag bit of the table.
Fig. 16 shows a schematic diagram of the front-to-back configuration table.
As shown, the front-to-back table represents the reorder position of a row of data in the n0 dimension of input or output, respectively, which includes 64B. Each byte of 8 bits respectively comprises 6 bits of Index bits, and records the number of bytes of data of 0-63 bytes in the original data; a zero_en bit of 1 bit, indicating whether 0 is set, if the bit is1, then the 0, [5,0] bit is forcedly written to be invalid; and a mask bit of 1 bit indicating whether this bit is valid.
By the front-back allocation table, the data of n0 of the input data of the block instruction and/or the data of n0 of the output data of the block instruction can be rearranged as needed.
Table 4 shows the meaning of the individual parameters of the blocking instruction. Assuming that the bit width of data to be subjected to blocking is dwidth in units of B (bytes), the size of the data amount of one atomic operation of the blocking instruction is called a blocking bit width T in units of B (bytes). Among parameters of the block instruction, 11 parameters of n0 to n5 and s1 to s5 are required to describe tensor shapes of inner layer data and outer layer data, wherein n0 to n2, s1 to s2 are parameters describing the inner layer, and n3 to n5, s3 to s5 are parameters describing the outer layer.
TABLE 4 Parametric meaning of blocking instruction
For tensor description before and after execution of the block instruction, a set of parameters is needed for each of the input tensor and the input tensor, and 22 parameters are used for description respectively, namely in0 to in5, is1 to is5, on0 to on5 and os1 to os 5. The block instruction may support a variety of block bit widths T, e.g., 1B, 2B, 4B, 6B, 8B, 16B, 32B, etc., and may set corresponding values based on different block tasks. Therefore, the block instruction also includes a block bit width T.
There are also some basic usage restrictions or constraints on the use of a block instruction, including, for example: in0, in1, in2, on0, on1, on2< = 64; n0 performance requires 64B alignment; in0=on1×on2×t, in0=in1×in2×t; in3 x in4 x in5 = on3 x on4 x on5; t < = 32B; front-back table=64b.
Furthermore, the blocking instruction cannot operate in-place, i.e. two memory areas are required. Accordingly, in an embodiment of the present disclosure, there is provided a data processing apparatus including a control circuit, a first storage circuit, and a second storage circuit. The first storage circuit is used for storing first data before executing the block instruction; the second storage circuit is used for storing second data after the block instruction is executed. The control circuitry is configured to configure and execute the block instructions. In some embodiments, the data processing apparatus may be, for example, a processor cluster in a multi-core computing apparatus as shown in fig. 3b, the control circuit may be, for example, a processor core within the processor cluster, the first memory circuit may be, for example, a shared memory SRAM within the processor cluster or an NRAM within the processor core, and the second memory circuit may be, for example, an NRAM or an SRAM. When executing the block instruction for different data (input neuron, weight, output neuron, etc.), the dimension change and the handling process to be realized are different, and different block instruction parameter allocation schemes need to be designed.
General scheme for blocking instruction of output neuron
As can be seen from the foregoing description of a small convolution operation scheme such as Forward4, when Forward4 uses different Group grouping modes and/or different ho×wo splitting modes within a Group, the format of the output data (i.e. the output neurons) is not completely the same, and a detailed description of the output data format can be found in the output shape description section in the foregoing Forward4 scheme. Since the output data format is not a conventional output format, it is necessary to perform shape conversion and dimension conversion on the output data using a block instruction.
For output neurons, the function of a blocking instruction is to convert first data stored in a first dimension storage order on a first storage circuit (e.g., NRAM) into second data stored in a second dimension storage order on a second storage circuit (e.g., SRAM) during the output neuron's transportation from, for example, NRAM to SRAM.
In order to better understand the blocking processing procedure of the output neuron, in the following description, the blocking instruction processing of the output data is described in detail for the case where the Group1 grouping mode in the Forward4 convolution splitting scheme and the ho×wo is the 1×16 splitting mode. Those skilled in the art will appreciate that other output data formats may be similarly processed.
In the case of Group1, ho×Wo being 1×16 split, the shape of Forward4 output data is:
1×ho/(4)×wo/(4×16)×co×(4×16×4)
the specific meaning of the above-mentioned shape is as follows:
The execution of the blocking instruction aims to convert the first data in the six-dimensional data format (without considering the N dimension) into the second data in the standard output format (without considering the N dimension), which is a three-dimensional shape:
[ho×wo×co]
wherein the transformed co is located in the lowest storage dimension of the second data, the transformed wo is located in the second lowest storage dimension of the second data, and the transformed ho is located in the highest storage dimension of the second data.
Since there are too many output data dimensions, while there is also data across dimensions in 4×16×4, some simplified processing is required in order to execute the block instruction. Specifically, the relevant dimensions of the output data may be combined to form three-dimensional data, and then a portion of the ordered data may be partitioned, and finally all the data may be converted back to the standard format.
In some embodiments, the control circuit in the data processing apparatus may consider the first data to be a three-dimensional equivalent shape before the block processing from the six-dimensional shape:
Wherein the highest dimension is The second highest dimension is co, and the lowest dimension is 4[ low dimension ho ] ×16[ high dimension wo ] ×4[ low dimension wo ]. It can be seen that in the three-dimensional equivalent shape, the highest and second highest dimensions conform to the dimensional order of the final second data, and only the lowest dimension needs to be adjusted. Specifically, 4[ low dimension ho ] ×16[ high dimension wo ] in the lowest dimension may be adjusted as a whole between the high dimension ho and the medium dimension wo of the highest dimension, and 4[ low dimension wo ] in the lowest dimension may be adjusted between the highest dimension and the next highest dimension, thereby conforming to the order of hwc.
Alternatively or additionally, in some embodiments, the control circuit in the data processing apparatus may treat the above-mentioned second data as a three-dimensional equivalent shape after the block processing from the above-mentioned three-dimensional shape:
FIG. 17 illustrates a schematic diagram of executing a block instruction on output neuron data, according to an embodiment of the present disclosure.
The left hand graph shows the output neuron data (i.e., the input tensor of the blocking instruction) prior to the blocking process. It can be seen that six-dimensional output neuron data can be represented as three-dimensional equivalent shapesTo convert ho and wo in the low dimension (4×16×4) back to the high dimension, a data block of size 1×co_full× (4×16×4) can be processed each time, where co_full represents an integer segment of the base alignment value M aligned to the block instruction.
Thus, the input tensor can be divided into inner and outer layers, each layer being represented using three dimensions. Since 4[ low dimension ho ] ×16[ high dimension wo ] in the lowest dimension (4×16×4) needs to be treated as an overall adjustment order, and it can also satisfy the requirement of m=64b alignment, it can be divided into 44×16 inner layer data blocks 1701. At this time, the in0 dimension of the inner layer data block 1701 is aligned to a first alignment value, for example, m=64b, according to the restriction of the block instruction; the in1 dimension may then be set to M/dwidth =64B/dwidth according to the data bit width dwidth; the in2 dimension is set to 64/in1 such that in2×in1×in0=64/in1×in1×m=64×64B. After the inner layer data block is determined, the sizes of the outer layer three dimensions in3, in4 and in5 can be correspondingly determined, and the sizes of the outer layer three dimensions are respectively equal to the number of the inner layer data blocks in the corresponding dimensions.
After the partitioning process is completed, the data block of 1×co_full× (4×16×4) will be converted into the data block of (4×16) ×4×co_full, that is, the data of the low dimension has been shifted into the high dimension.
The right diagram in fig. 17 shows the output neuron data after the blocking processing (i.e., the output tensor of the blocking instruction). It can be seen that the output neuron data shape at this time becomesIt is also divided into inner and outer layers, each described using three dimensions. Because the position of the neuron data needs to be greatly adjusted, the block bit width T can be set to dwidth in combination with the constraint condition of the block instruction, namely the data quantity of one atomic operation is one data, so that the storage sequence is convenient to adjust one data by one data. At this time, the inner layer data block 1702 may correspond to the inner layer data block 1701 of the input tensor, but the shape is changed. For example, for the co integer section, the shape is changed from a flat plate shape of 1×co_full× (4×16×4) to a vertical plate shape of (4×16) ×4×co_full. The on0 dimension of the inner layer data block is set to in1×in2×t=64×t according to the constraint condition of the block instruction; the on1 dimension is set to 4 and the on2 dimension is set to in 0/T/on1=16b/T. After the inner layer data blocks are determined, the sizes of the outer three dimensions on3, on4 and on5 can be determined accordingly, where the sizes are respectively equal to the number of inner layer data blocks containing output tensors in the corresponding dimensions.
The above describes the chunking procedure for processing data blocks of 1 xco_full× (4×16×4) size each time. The multiple treatments may then be cycled in a sequence. Thus, in some embodiments, the control circuitry may execute the blocking instruction in a loop for the three-dimensional equivalent shape of the output neuron (i.e., the first data), the loop comprising three layers: an inner co-dimensional loop, a middle wo-dimensional loop, and an outer ho-dimensional loop.
Specifically, in some embodiments, the inner layer co-dimensional loop includes: dividing the dimension of the co into an integer segment and/or a remainder segment according to the dimension of the co, wherein the dimension of the co of the integer segment is aligned to a reference alignment value M of the partitioning instruction, and the dimension of the co of the remainder segment is smaller than M; and repeating repeat_co times with each M being a data block and each remainder segment being a data block according to the co dimension, wherein repeat_co= (co+M-1)/M= (co+64-1)/64 times.
Alternatively or additionally, in some embodiments, the middle level wo-dimensional loop includes: repeat_wo is repeated once every 1 data in wo dimension, where repeat_wo=wo/(4×16) times.
Alternatively or additionally, in some embodiments, the outer ho dimension loop includes: repeat_ho is repeated once every 1 data block in the ho dimension, where repeat_ho=ho/4 times.
A three-layer loop requires execution of repeat = repeat co x repeat wo x repeat ho times a block instruction in total.
In the above-described partitioning instruction scheme, the 4×16×4 dimensions are reduced appropriately, and therefore, the use of the front-back allocation table is not required. Because of the need to divide into repeat times, a bias is applied to the data storage space before each block is processed.
Specifically, in some embodiments, the control circuit may set, prior to each execution of the block instruction for the data block, an input tensor offset and an output tensor offset for the block instruction executed for the current data block according to the processed data block size, wherein the input tensor offset represents an offset of the data block before processing relative to a starting storage address of the first data and the output tensor offset represents an offset of the data block after processing relative to a starting storage address of the second data.
In one example, the bias may be increased and the entire processing logic controlled as per the logic shown in the pseudocode of Table 5 below.
Table 5, output neuron blocking instruction loop processing logic
In actual operation, the shape of the output neuron is dynamically changed, that is, the size of co is arbitrary. As mentioned previously, in the inner layer co-loop, the chunking instruction may be executed by dividing the co-dimension into integer segments and/or remainder segments, where the co-dimension of the integer segments is aligned to a reference alignment value M of the chunking instruction, once every M is a data block; the co dimension of the remainder segment is smaller than M, and the remainder segment is a primary data block.
It will be appreciated that depending on the different co values, there may be only an integer segment, or only a remainder segment, or both an integer segment and a remainder segment. Assume that the integer segment length size aligned with 64B in co is co_full and the remainder segment not aligned with 64B is co_rem. For example, the INT8 type output neurons 256×256×96, co=96, then co_full=64, co_rem=32.
Table 6 shows the shape change of the output neuron data before and after executing the blocking instruction.
TABLE 6 shape Change before and after output neuron data blocking processing
Note that the shape assumptions in table 6 are parameters after having been aligned as required in the alignment constraints in the different Group modes and different h×w splitting modes with respect to Forward4 in the foregoing table 3.
For the integer segment portion, the first block instruction may be configured with reference to what was described above in connection with fig. 17.
In one example, when m=64b for the Forward4 scheme, the Group1 grouping mode, the ho×wo 1×16 split, the first block instruction for the integer section may be configured as in table 7 below.
Table 7, output neuron integer segment block instruction parameter configuration scheme
Wherein dwidth denotes a data bit width, B denotes a byte, T denotes a block bit width, which denotes a data amount of one atomic operation of a block instruction, in0, in1, in2 denote an inner-layer lowest dimension size, an inner-layer low dimension size, and an inner-layer highest dimension size of an inner-layer data block of an input tensor of a first block instruction, in3, in4, and in5 denote three outer-layer dimension sizes of the input tensor, respectively, the size values of the three outer-layer dimensions denote the number of inner-layer data blocks containing the input tensor in the corresponding dimension, is1 to is5 denote five dimension steps of the input tensor except for the inner-layer lowest dimension, on0, on1, on2 denote an inner-layer lowest dimension size, an inner-layer low dimension size, and an inner-layer highest dimension size of the inner-layer data block of the output tensor of the first block instruction, on3, on4, and on5 denote three outer-layer dimension sizes of the output tensor, respectively, the size values of the three outer-layer dimensions denote the number of inner-layer data blocks containing the input tensor in the corresponding dimension, respectively, is1 to is5, and on0, on2 represents the inner-layer step size of the inner-layer data blocks containing the output tensor in the corresponding dimension.
For the remainder segment portion, the second block instruction may be configured with a slight adjustment based on the integer segment portion.
In one example, when m=64b for Forward4 scheme, group1 grouping mode, ho×wo is a1×16 split, the second block instruction for the remainder section may be configured as in table 8 below.
Table 8, neuron remainder segment block instruction parameter configuration scheme
Wherein dwidth denotes a data bit width, co_rem denotes a co dimension size of a remainder segment, B denotes a byte, T denotes a block bit width, which denotes a data amount of one atomic operation of a block instruction, in0, in1, in2 denote an inner layer minimum dimension size, an inner layer low dimension size, and an inner layer maximum dimension size of an inner layer data block of an input tensor of a first block instruction, in3, in4, and in5 denote three outer layer dimension sizes of the input tensor, respectively, the size values of the three outer layer dimensions denote the number of inner layer data blocks containing the input tensor in the corresponding dimensions, is1 to is5 denote five dimension steps except for the inner layer minimum dimension of the input tensor, on0, on1, on2 denote an inner layer minimum dimension size, an inner layer low dimension size, and an inner layer maximum dimension size of an inner layer data block of an output tensor of a first block instruction, on3, on4, and on5 denote three outer layer dimension sizes of the output tensor, respectively, and the size values of the three outer layer dimensions respectively denote the number of inner layer data blocks containing the input tensor in the corresponding dimensions except for the inner layer minimum dimension 1 to 5.
Thus, the blocking scheme of the output neurons is described above in connection with the Forward4 scheme, the Group1 grouping mode, and the specific example of the hoxwo 1 x 16 split approach. As can be seen from the output shape description section in the Forward4 scheme, although different Group grouping patterns and different ho×wo splitting patterns within a Group can result in different formats of the final output neurons, these formats can be generalized to the following form in general:
[ high dimension ho ] × [ medium dimension wo ] × [ co dimension ] × [ multidimensional mixing ]
More specifically, the method comprises the following steps:
ho/(hgs X4) [ high dimension ho ]. Times.wo/(wgs X4) [ medium dimension wo ]. Times.co/group [ co dimension ] × (4X 16X 4)
Wherein hgs and wgs respectively represent intra-group ho×wo splitting modes, hgs × wgs =16/group; while the low-dimensional (4 x 16 x 4) is a hybrid dimension, having different meanings depending on the grouping mode and/or intra-group splitting, but each includes a mixture of various dimensions including various combinations of: co, medium dimension ho, low dimension ho, high dimension wo, low dimension wo.
In order to convert such a mixed multidimensional output neuron into three-dimensional data [ ho×wo×co ], the ho and wo dimensions currently in a multidimensional mixture of low dimensions (4×16×4) need to be mentioned before. Thus, also in the manner described above, the multidimensional output neurons can be considered as equivalent three-dimensional shapes by dimension merging first:
([ high dimension ho ] × [ medium dimension wo ])× [ co dimension ] × [ multidimensional mixing ]
Wherein the multi-dimensional mixture can be combined into at least two parts, wherein the first part needs to be integrally adjusted between the high-dimensional ho and the middle-dimensional wo of the current highest dimension, and the second part needs to be adjusted between the current highest dimension and the second highest dimension, namely between the middle-dimensional wo and the co-dimension.
Similarly, each time a data block of size 1 xco_full× (4×16×4) is processed, the multiple processing is also performed in a three-layer loop: an inner co-dimensional loop, a middle wo-dimensional loop, and an outer ho-dimensional loop.
The inner layer co dimension circulation processing is the same as before, and can be divided into an integer segment and/or a remainder segment, wherein each M is a data block according to the co dimension, the remainder segment is a data block, and repeat_co is repeated for times, wherein repeat_co= (co+M-1)/M.
The middle wo-dimensional cycle includes: repeat_wo is repeated once every 1 data block according to wo dimension, wherein repeat_wo=above middle dimension wo.
The outer ho dimension loop includes: repeat_ho is repeated every 1 data block in the ho dimension, where repeat_ho=high-dimensional ho in the highest dimension.
The assignment of specific block instructions may be similarly designed according to the principles described above and will not be expanded herein.
Thus, embodiments of the present disclosure provide a chunking approach to output neuron data that can rearrange dimension-mixed multidimensional data into a specified order of dimensions. In some embodiments, the partitioning process may be simplified by appropriate shape equivalence. Further, through three-layer circulation, data can be placed at desired positions one by one. The division of the co integer and remainder segments may make the scheme suitable for arbitrarily shaped output neurons.
The embodiment of the disclosure also provides a data processing method for executing the block instruction by using the data processing device. Those skilled in the art will appreciate that the method steps of executing the block instructions correspond to the various features of the computing device described above in connection with the figures, and thus the features described above are equally applicable to the method steps and are not repeated here.
The disclosed embodiments also provide a chip that may include the data processing apparatus of any of the embodiments described above in connection with the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server computing cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (RESISTIVE RANDOM ACCESS MEMORY, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (ENHANCED DYNAMIC Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.
The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.
Claims (13)
1. A data processing apparatus comprising a control circuit, a first storage circuit and a second storage circuit, wherein:
the first storage circuit is used for storing first data before processing;
the second storage circuit is used for storing the processed second data; and
The control circuit is used for configuring and executing a blocking instruction to convert first data stored on the first storage circuit according to a first dimension storage sequence into second data stored on the second storage circuit according to a second dimension storage sequence, wherein the first data is multi-dimensional data, and the multi-dimensional shape is as follows:
[ high dimension ho ] × [ medium dimension wo ] × [ co dimension ] × [ multidimensional mixing ]
Wherein the multi-dimensional mixture comprises at least various combinations of: co, medium dimension ho, low dimension ho, high dimension wo, low dimension wo;
the second data is three-dimensional data, and the three-dimensional shape is as follows:
[ho×wo×co]
wherein the transformed co is located in the lowest storage dimension of the second data, the transformed wo is located in the second lowest storage dimension of the second data, and the transformed ho is located in the highest storage dimension of the second data.
2. The data processing apparatus of claim 1, wherein the control circuit is further to:
Treating the first data as a three-dimensional equivalent shape before the partitioning process from the multi-dimensional shape:
([ high dimension ho ] × [ medium dimension wo ])× [ co dimension ] × [ multidimensional mixing ]
Wherein the highest dimension is [ high dimension ho ] × [ medium dimension wo ], the next highest dimension is co, and the lowest dimension is multi-dimensional mixing; and
Regarding the second data as a three-dimensional equivalent shape after the block processing from the three-dimensional shape:
3. The data processing apparatus of claim 2, wherein the control circuit is further to:
executing a blocking instruction on the first data of the three-dimensional equivalent shape according to a loop, wherein the loop comprises an inner layer co-dimension loop, a middle layer wo-dimension loop and an outer layer ho-dimension loop.
4. A data processing apparatus according to claim 3, wherein the inner layer co-dimensional loop comprises:
dividing the dimension of the co into an integer segment and/or a remainder segment according to the dimension of the co, wherein the dimension of the co of the integer segment is aligned to a reference alignment value M of the block instruction, and the dimension of the co of the remainder segment is smaller than the M; and
Repeating repeat_co for a number of times, wherein repeat_co= (co+m-1)/M, with each M being a data block and each remainder being a data block in the co dimension.
5. The data processing apparatus of claim 4, wherein the middle level wo-dimensional loop comprises:
Repeat_wo is repeated once every 1 data in wo dimension as a data block, wherein repeat_wo=said middle dimension wo.
6. The data processing apparatus of claim 5, wherein the outer ho dimension loop comprises:
repeat_ho is repeated every 1 data block per ho dimension, where repeat_ho = high dimension ho in the highest dimension.
7. The data processing apparatus of claim 6, wherein the control circuit is further to:
Before each execution of a block instruction for a data block, setting an input tensor offset and an output tensor offset of the block instruction for a current data block according to the size of the processed data block, wherein the input tensor offset represents an offset of the data block before processing relative to a starting storage address of the first data, and the output tensor offset represents an offset of the data block after processing relative to a starting storage address of the second data.
8. The data processing apparatus of claim 7, wherein the first data is an output neuron in a Group1 mode, ho xwo 1 x 16 split condition using a Forward4 small convolution operation scheme, having a multidimensional shape of:
9. The data processing apparatus of claim 8, wherein when the integer segment is present, the control circuitry is further to configure a first chunking instruction for a chunk of data of the integer segment as follows in table 1:
TABLE 1
Wherein dwidth denotes a data bit width, B denotes a byte, T denotes a block bit width, which denotes a data amount of one atomic operation of the block instruction, in0, in1, in2 denote an inner layer minimum dimension size, an inner layer low dimension size, and an inner layer maximum dimension size of an inner layer data block of an input tensor of a first block instruction, in3, in4, and in5 denote three outer layer dimension sizes of the input tensor, respectively, the size values of the three outer layer dimensions denote numbers of the inner layer data blocks contained in corresponding dimensions, is1 to is5 denote five dimension steps of the input tensor except for an inner layer minimum dimension, on0, on1, on2 denote inner layer minimum dimension sizes, inner layer low dimension sizes, and inner layer maximum dimension sizes of the inner layer data block of an output tensor of the first block instruction, on3, on4, and on5 denote three outer layer dimension sizes of the output tensor, respectively, the size values of the three outer layer dimensions denote numbers of the inner layer data blocks contained in corresponding dimensions, respectively, and is1 to is5 denotes numbers of the inner layer data blocks contained in the corresponding dimensions except for the inner layer minimum dimension, and on1 to on2 denote steps of the inner layer data blocks.
10. The data processing apparatus according to any one of claims 8-9, wherein when the remainder section is present, the control circuitry is further to configure a second block instruction for a data block of the remainder section as follows table 2:
TABLE 2
Wherein dwidth denotes a data bit width, co_rem denotes a co dimension size of the remainder section, B denotes a byte, T denotes a block bit width, which denotes a data amount of one atomic operation of the block instruction, in0, in1, in2 denote an inner layer lowest dimension size, an inner layer low dimension size, and an inner layer highest dimension size of an inner layer data block of an input tensor of a first block instruction, in3, in4, and in5 denote three outer layer dimension sizes of the input tensor, respectively, a size value of the three outer layer dimensions denote a number of the inner layer data block contained in a corresponding dimension, is1 to is5 denote five dimension steps of the input tensor except for an inner layer lowest dimension, on0, on1, on2 denote an inner layer lowest dimension size, an inner layer low dimension size, and an inner layer highest dimension size of an inner layer data block of an output tensor of the first block instruction, on3, on4, and on5 denote three outer layer large dimension sizes of the output tensor, respectively, the size values of the three outer layer dimensions denote a number of the inner layer data block except for the inner layer lowest dimension, and is1 to is 5.
11. A chip comprising a data processing device according to any of claims 1-10.
12. A board card comprising the chip of claim 11.
13. A data processing method of executing a block instruction on output neuron data using the data processing apparatus according to any one of claims 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111131280.6A CN113837923B (en) | 2021-09-26 | 2021-09-26 | Data processing device, data processing method and related products |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111131280.6A CN113837923B (en) | 2021-09-26 | 2021-09-26 | Data processing device, data processing method and related products |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113837923A CN113837923A (en) | 2021-12-24 |
CN113837923B true CN113837923B (en) | 2024-08-06 |
Family
ID=78970236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111131280.6A Active CN113837923B (en) | 2021-09-26 | 2021-09-26 | Data processing device, data processing method and related products |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113837923B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657782A (en) * | 2018-12-14 | 2019-04-19 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
CN111079917A (en) * | 2018-10-22 | 2020-04-28 | 北京地平线机器人技术研发有限公司 | Tensor data block access method and device |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102415508B1 (en) * | 2017-03-28 | 2022-07-01 | 삼성전자주식회사 | Convolutional neural network processing method and apparatus |
KR102095335B1 (en) * | 2017-11-15 | 2020-03-31 | 에스케이텔레콤 주식회사 | Apparatus and method for generating and using neural network model applying accelerated computation |
JP7104546B2 (en) * | 2018-04-13 | 2022-07-21 | キヤノン株式会社 | Information processing equipment, information processing method |
CN111695686B (en) * | 2019-03-15 | 2022-11-01 | 上海寒武纪信息科技有限公司 | Address allocation method and device |
CN112416433B (en) * | 2020-11-24 | 2023-01-17 | 中科寒武纪科技股份有限公司 | Data processing device, data processing method and related product |
CN112926646B (en) * | 2021-02-22 | 2023-07-04 | 上海壁仞智能科技有限公司 | Data batch normalization method, computing device, and computer-readable storage medium |
CN113407904B (en) * | 2021-06-09 | 2023-04-07 | 中山大学 | Winograd processing method, system and medium compatible with multi-dimensional convolutional neural network |
-
2021
- 2021-09-26 CN CN202111131280.6A patent/CN113837923B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079917A (en) * | 2018-10-22 | 2020-04-28 | 北京地平线机器人技术研发有限公司 | Tensor data block access method and device |
CN109657782A (en) * | 2018-12-14 | 2019-04-19 | 北京中科寒武纪科技有限公司 | Operation method, device and Related product |
Also Published As
Publication number | Publication date |
---|---|
CN113837923A (en) | 2021-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023045445A1 (en) | Data processing device, data processing method, and related product | |
CN108170640B (en) | Neural network operation device and operation method using same | |
CN114154112A (en) | Data processing device, chip and board card | |
CN113837922B (en) | Computing device, data processing method and related product | |
CN113837923B (en) | Data processing device, data processing method and related products | |
CN113837921B (en) | Data processing device, data processing method and related products | |
CN113850379A (en) | Data processing device, data processing method and related product | |
CN113850377A (en) | Data processing device, data processing method and related product | |
WO2023045638A1 (en) | Computing device, method for implementing convolution operation by using computing device, and related product | |
CN115470176B (en) | Computing device, method for implementing convolution operation by utilizing computing device and related product | |
CN114692844A (en) | Data processing device, data processing method and related product | |
CN113850378A (en) | Data processing device, data processing method and related product | |
CN111291884B (en) | Neural network pruning method, device, electronic equipment and computer readable medium | |
CN113469333A (en) | Artificial intelligence processor, method and related product for executing neural network model | |
CN116980277B (en) | Data processing method, device, computer equipment and storage medium | |
CN116150556A (en) | Computing device, method and related product for performing convolution operation | |
WO2023087814A1 (en) | Computing apparatus, method for implementing convolution operation by using computing apparatus, and related product | |
CN115878543A (en) | Computing device, method for performing convolution operation by using computing device and related product | |
CN115878546A (en) | Computing device, method for performing convolution operation by using computing device and related product | |
CN117252241A (en) | Computing device, method and related product for performing convolution operation | |
CN115878541A (en) | Computing device, method for performing convolution operation by using computing device and related product | |
CN115878544A (en) | Processing circuit, method for performing convolution operation by using processing circuit and related product | |
WO2022134872A1 (en) | Data processing apparatus, data processing method and related product | |
CN115878542A (en) | Computing device, method for performing convolution operation by using computing device and related product | |
CN118898279A (en) | Computing device, chip, board card and method for performing convolution operation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |