Detailed Description
The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, some specific details are set forth in detail. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present disclosure. The figures are not necessarily drawn to scale.
The following terms are used herein.
Deep learning model: deep Learning is a new research direction in the field of Machine Learning (ML), which is introduced to make Machine Learning closer to the original goal, Artificial Intelligence (AI). The internal rules and the expression levels of the sample data are deeply learned, and the information obtained in the learning process is greatly helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. And the deep learning model is a deep learning model. The deep learning model has different formats according to different dependent model frameworks (frames), and can be divided into different types of formats, such as tenserflow, pytorch, mxnet, and the like.
A computing device: the device with computing or processing capability may be embodied in the form of a terminal, such as an internet of things device, a mobile terminal, a desktop computer, a laptop computer, etc., or may be embodied as a server or a cluster of servers. In the context of the internet of things of the present disclosure, the computing device is an internet of things terminal in the internet of things.
A scheduling unit: in addition to conventional processing (processing not used for complicated operations such as image processing and various deep learning models) performed in the computing apparatus, a unit that performs a scheduling function for an acceleration unit is also assumed. It allocates to the acceleration unit the tasks that the acceleration unit needs to undertake, such as tensor calculation tasks. The scheduling unit may take various forms such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like.
An acceleration unit: in a computing device, a processing unit is designed to increase the data processing speed in some special-purpose fields in order to cope with the situation that a conventional processing unit is not efficient in the special-purpose fields (for example, processing images, processing various operations of a deep learning model, and the like). The acceleration unit, also known as an Artificial Intelligence (AI) processing unit, includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a General Purpose Graphics Processing Unit (GPGPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and special-purpose intelligent acceleration hardware (e.g., neural network processor NPU, tensor processor TPU).
A processing unit: the device with processing capability of convolution, matrix multiplication and other related operations in the deep learning model, which is positioned in the acceleration unit (such as a tensor engine of an acceleration unit core), can be embodied as a system on chip and can be inserted into or replaced from a computing device.
Tensor operation: the set composed of ordinal numbers satisfying a certain coordinate transformation relation when a plurality of coordinate systems are changed is a tensor. Colloquially, it is a vector and matrix based generalization. The scalar is treated as a 0 th order tensor, the vector is treated as a 1 st order tensor, and the matrix is treated as a 2 nd order tensor, but when neither of the two dimensions of space is sufficient to represent the input quantity, more than 3 rd order tensors are produced. With the tensor, the amount of input in any dimension space can be represented. The deep learning model is characterized by being capable of receiving input quantity of any dimension space. Regardless of the input amount in any dimension, it can be expressed as an input tensor that inputs a node of the first layer of the deep learning model. The nodes of the first layer have weight tensors of the same dimensional space. Since the dimension space of the input tensor is the same as the dimension space of the weight tensor, the operation of the input tensor and the weight tensor, such as point multiplication, convolution and the like, can be performed in the same dimension space, and the generated output is still the output of the same dimension space. The output tensor of the node of the previous layer is input to the node of the next layer as input, and tensor operations such as point multiplication, convolution and the like are performed on the output tensor of the node of the next layer in the same dimension space until the output tensor of the node of the last layer is obtained and used as the output tensor of the whole deep learning model.
Application environment of the present disclosure
The embodiment of the disclosure provides a tensor operation scheme. The whole tensor operation scheme is relatively universal, and can be used for various hardware devices for executing various deep learning models, such as a data center, an AI (artificial intelligence) acceleration unit, a GPU (graphic processing unit), IOT (internet of things) devices for executing the deep learning models, embedded devices and the like. The tensor operation method is independent of the hardware where the processing unit executing the tensor operation method is finally deployed. For exemplary description, however, the following description mainly refers to the internet of things as an application scenario. Those skilled in the art will appreciate that the disclosed embodiments are also applicable to other application scenarios.
Whole framework of thing networking
Fig. 1 is a system architecture diagram of an internet of things (IoT)100 to which an embodiment of the present disclosure is applied.
The cloud 110 may represent the internet, or may be a Local Area Network (LAN), or a Wide Area Network (WAN), such as a company's private network. IoT devices may include any number of different types of devices grouped in various combinations. For example, the traffic control group 206 may include IoT devices along streets in a city. These IoT devices may include traffic lights, traffic flow monitors, cameras, weather sensors, and the like. Each IoT device in the traffic control group 206 or other subgroup may communicate with the cloud 110 over a wireless link 208, such as an LPWA link or the like. Further, the wired or wireless subnetwork 212 can allow IoT devices to communicate with each other, such as over a local area network, wireless local area network, and so forth. The IoT device may use another device, such as the gateway 210, to communicate with the cloud 110.
Other groupings of IoT devices may include remote weather stations 214, local information terminals 216, alarm systems 218, automated teller machines 220, alarm panels 222, or mobile vehicles, such as emergency vehicles 224 or other vehicles 226, and the like. Each of these IoT devices may communicate with other IoT devices, with the server 140, or both.
As can be seen from fig. 1, a large number of IoT devices may communicate through the cloud 110. This may allow different IoT devices to autonomously request or provide information to other devices. For example, the traffic control group 206 may request a current weather forecast from a group of remote weather stations 214, which may provide the forecast without human intervention. Further, the emergency vehicle 224 may be alerted by the automated teller machine 220 that a theft is occurring. As the emergency vehicle 224 proceeds toward the automated teller machine 220, it may access the traffic control group 206 to request permission to reach the location, for example, by turning a light red to block cross traffic at the intersection for sufficient time to allow the emergency vehicle 224 to enter the intersection unimpeded.
Machine learning is often used in the IoT devices described above. For example, the automated teller machine 220 recognizes human faces using machine learning, and the traffic control group 206 analysis of traffic flow and control schemes using machine learning. Because the environmental conditions are uncertain, the external environment bandwidth changes along with the changes of the network conditions, the weather conditions and other environmental conditions, the computing power of the processing unit and the external environment bandwidth are not adapted, the computing energy efficiency of the processor is reduced, and the processing unit of the embodiment of the disclosure needs to be adopted.
Scheduling unit and acceleration unit
Fig. 2 is an internal structural diagram of a scheduling unit 420 and an acceleration unit 430 of an IoT device (computing device) according to an embodiment of the present disclosure. As shown in fig. 2, the IoT device includes a memory 410, a scheduling unit 420, and an acceleration unit 430. For convenience of description, only one scheduling unit 420 and one accelerating unit 430 are shown in fig. 2, but it should be understood that the embodiments of the present disclosure are not limited thereto. The IoT device of the present disclosure may include a scheduling unit cluster and an acceleration unit cluster connected to a memory 410 through a bus, the scheduling unit cluster including a plurality of scheduling units 420, and the acceleration unit cluster including a plurality of acceleration units 430. The acceleration unit 430 is a processing unit designed to increase the data processing speed in a special-purpose field. The acceleration unit, also known as an Artificial Intelligence (AI) processing unit, includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a General Purpose Graphics Processing Unit (GPGPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and special-purpose intelligent acceleration hardware (e.g., neural network processor NPU, tensor processor TPU). The embodiment of the disclosure can be applied to NPU scenes, but due to the adoption of a general compiling custom interface, a CPU, a GPU, a GPGPU, a TPU and the like can be used under hardware. The scheduling unit is a processing unit that schedules the acceleration units and allocates instruction sequences to be executed to each acceleration unit, and may take various forms such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), and the like. In some embodiments, the scheduling unit 420 allocates to each acceleration unit 430 a sequence of instructions to be executed for the tensor computation task to be performed.
In the traditional architecture design of the central processing unit, a control unit and a storage unit occupy a large part of space in the architecture, and the space occupied by a computing unit is insufficient, so that the traditional architecture design is very effective in logic control and is not efficient in large-scale parallel computing. Therefore, various special acceleration units have been developed to perform more efficient processing for increasing the operation speed for calculations of different functions and different fields. The acceleration unit provided by the invention is a processing unit special for accelerating the operation processing speed of a neural network model. The method is a processing unit which adopts a data-driven parallel computing architecture and is used for processing a large number of operations (such as convolution, pooling and the like) of each neural network node. Because data in a large number of operations (such as convolution, pooling and the like) of each neural network node and intermediate results are closely related in the whole calculation process and are frequently used, the conventional central processing unit framework needs to frequently access an off-core storage in a large number because the internal memory capacity of a core of the central processing unit is small, and thus, the processing efficiency is low. By adopting the accelerating unit special for accelerating the operation processing speed of the neural network model, because each core of the accelerating unit is provided with the on-chip memory with the storage capacity suitable for the neural network calculation, the frequent access to the memory outside the core is avoided, the processing efficiency can be greatly improved, and the calculation performance is improved.
The acceleration unit 430 is to accept the schedule of the scheduling unit 420. As shown in fig. 3, the memory 410 stores various deep learning models including nodes of the models, weight tensors of the nodes, and the like. These deep learning models are deployed by a scheduling unit 420 to an acceleration unit 430 in fig. 2 when needed. That is, the scheduling unit 420 may send addresses of parameters in the model (such as weight tensors of the nodes) in the memory 410 to the acceleration unit 430 in the form of instructions. When the acceleration unit 430 actually uses the deep learning model to perform the calculation, the parameters (such as the weight tensor) are directly addressed in the memory 410 according to the addresses of the parameters in the memory 410, and are temporarily stored in the on-chip memory thereof. When the acceleration unit 430 actually uses the deep learning model to perform calculation, the scheduling unit 420 further sends the input tensor of the model to the acceleration unit 430 in the form of an instruction, and temporarily stores the input tensor in the on-chip memory of the acceleration unit 430. The acceleration unit 430 can then perform inference calculations based on these input tensors and the parameters (e.g., weight tensors) in the model.
How the scheduling unit 420 schedules the acceleration unit 430 to operate will be described in detail below with reference to the internal structures of the scheduling unit 420 and the acceleration unit 430 shown in fig. 2.
As shown in fig. 2, scheduling unit 420 includes a plurality of processor cores 422 and a cache 221 shared by the plurality of processor cores 422. Each processor core 422 includes an instruction fetch unit 423, an instruction decode unit 424, an instruction issue unit 425, and an instruction execution unit 426.
Instruction fetch unit 423 is used to move instructions to be executed from memory 410 into an instruction register (which may be a register of register file 429 shown in fig. 2 that stores instructions) and to receive or compute a next instruction fetch address according to an instruction fetch algorithm, which may include, for example: the address is incremented or decremented according to the instruction length.
After fetching the instruction, dispatch unit 420 enters an instruction decode stage where instruction decode unit 424 decodes the fetched instruction according to a predetermined instruction format to obtain operand fetch information needed by the fetched instruction in preparation for operation by instruction execution unit 426. The operand fetch information points, for example, to an immediate, register, or other software/hardware capable of providing source operands.
The instruction issue unit 425 is located between the instruction decode unit 424 and the instruction execution unit 426 for scheduling and control of instructions to efficiently distribute individual instructions to different instruction execution units 426, enabling parallel operation of multiple instructions.
After instruction issue unit 425 issues the instruction to instruction execution unit 426, instruction execution unit 426 begins executing the instruction. But if the instruction execution unit 426 determines that the instruction should be executed by an acceleration unit, it is forwarded to the corresponding acceleration unit for execution. For example, if the instruction is a deep learning model inference (inference) instruction, the instruction execution unit 426 no longer executes the instruction, but rather sends the instruction over the bus to the acceleration unit 430 for execution by the acceleration unit 430.
Although the embodiment of the present disclosure is used in an NPU scenario, since a generic compiling custom interface is adopted, the acceleration unit 430 shown in fig. 2 is not limited to an NPU, and may also be a TPU. The TPU, i.e. tensor processor, is a processor dedicated to speeding up the computational power of deep neural networks. In addition, the acceleration unit 430 may also be a CPU, GPU, FPGA, ASIC, or the like.
The acceleration unit 430 includes a plurality of cores 436 (4 cores are shown in FIG. 2, but it will be understood by those skilled in the art that other numbers of cores 436, a command processor 437, a direct memory access mechanism 435, and a bus channel 431 may be included in the acceleration unit 430.
Bus channel 431 is a channel for instructions to pass from the bus to and from accelerator unit 430.
Direct Memory Access (DMA) mechanism 435 is a function provided by some computer bus architectures that enables data to be written from an attached device directly into the Memory of a computer motherboard. Compared with the mode that all data transmission between the devices needs to pass through the processing unit, the mode greatly improves the efficiency of data access. Due to the mechanism, the core of the acceleration unit 430 can directly access the memory 410, read parameters (such as weight tensors of each node) in the deep learning model, and the like, and greatly improve data access efficiency.
The command handler 437 allocates instructions sent by the dispatch unit 420 to the acceleration unit 430 for execution by the core 436. The instruction execution unit 426 sends the acceleration unit 430 a sequence of instructions to be executed that require execution by the acceleration unit 430. The instruction sequence to be executed is buffered at the command handler 437 as it enters from the bus channel 431, and the command handler 437 selects the core 436 to which the instruction sequence is assigned for execution. In some embodiments, the sequence of instructions to be executed is a sequence of instructions to be executed for a tensor computation task, instructing the processor 437 to assign the tensor computation task to the core 436. In addition, the command processor 437 is also responsible for synchronizing operations between the cores 436.
Accelerating unit core
FIG. 3 is an internal block diagram of an acceleration unit core according to one embodiment of the present disclosure.
In one embodiment, as shown in FIG. 3, the acceleration unit core 436 includes a tensor engine 510, a pooling engine 520, a memory copy engine 530, a sequencer 550, an instruction buffer 540, an on-chip memory 560, and a constant buffer 570.
The instruction sequence assigned to the accelerator unit core 436 by the command processor 437 is first buffered in the instruction cache 540. The sequencer 550 then fetches instructions from the instruction buffer 540 in a first-in-first-out order, and assigns them to the tensor engine 510 or pooling engine 520 for execution based on the nature of the instructions. Tensor engine 510 is responsible for handling the convolution and matrix multiplication related operations in the deep learning model. The pooling engine 520 is responsible for handling pooling operations in the deep learning model. The memory copy engine 530 is a unit dedicated to handling data copies, where a data copy includes copying some data from the on-chip memory 560 to memory shared by the cores 436, or the on-chip memory 560 of other cores 436, due to a potential overflow of the on-chip memory 560. The sequencer 550 determines whether to assign an instruction to the tensor engine 510, the pooling engine 520, or the memory copy engine 530, depending on the operation property such as convolution, matrix multiplication, pooling, or data copy of the fetched instruction.
The on-chip memory 560 is an in-core memory that stores the weight tensor in the deep learning model, and the input tensor and various intermediate results when the deep learning model is actually used. The constant buffer 570 is a buffer that stores other constant parameters (e.g., hyperparameters in the neural network model) in the deep learning model in addition to the weight tensors. As described above, in the process of the scheduling unit 420 pre-configuring the deep learning model in the acceleration unit 430, the scheduling unit 420 sends the addresses of the parameters in the model in the memory 410 to the acceleration unit 430 in the form of instructions. These parameters include the weight tensor of the node and other parameters (e.g., hyperparameters). For the weight tensor, the acceleration unit 430 takes out the weight tensor from the corresponding position of the memory 410 and puts the weight tensor into the on-chip memory 560 when the actual deep learning model is operated. For other parameters, the acceleration unit 430 is fetched from the corresponding location of the memory 410 and placed in the constant buffer 570 during the actual deep learning model operation. In addition, when an instruction to actually start inference (inference) is assigned to the core 436 by the command processor 437 and executed, an input tensor (input to the deep learning model) in the instruction is also stored in the on-chip memory 560. In addition, after the tensor engine 510 and the pooling engine 520 perform convolution or pooling operations, various intermediate results obtained are also stored in the on-chip memory 560.
Processing unit
Fig. 4 is an internal structural diagram of a processing unit (tensor engine 510) according to one embodiment of the present disclosure.
In one embodiment, as shown in FIG. 4, the processing unit includes a computational unit controller 610, a monitoring unit 620, and a computational matrix 630, wherein the monitoring unit 620 is optional. The calculation matrix 630 is composed of n × m calculation units 640 arranged in n rows and m columns, n and m being non-zero natural numbers.
When the monitoring unit 620 is not used, the instruction execution unit 426 may be enabled to send the instruction sequence to be executed with the external environment bandwidth of the processing unit. The external environment bandwidth is the ability of the external environment of the processing unit (e.g., memory 410 and on-chip memory 650, etc.) to transfer data to the computational matrix 630 of the processing unit, such as the total amount of data transferred from the external environment of the processing unit (e.g., memory 410 and on-chip memory 650, etc.) to the computational matrix 630 of the processing unit in one clock cycle. In general, a computing device is in an environmental condition where network conditions, weather conditions, and the like change, and an external environmental bandwidth changes as the environmental condition changes. The instruction sequence to be executed enters from the bus channel 431, is buffered in the command processor 437, allocated to the core 436 by the command processor 437, buffered in the instruction buffer 540 of the core 436, and allocated to the tensor engine 510, i.e., the processing unit, by the sequencer 550. The processing unit extracts the external environment bandwidth therefrom.
When the monitoring unit 620 is employed, the monitoring unit 620 monitors the bandwidth of the external environment in which the processing unit is located. In one embodiment, the monitoring unit 620 may monitor network conditions, etc., and extrapolate the external environmental bandwidth in conjunction with predetermined rules. In another embodiment, the monitoring unit 620 may record execution of a sequence of instructions historically entered into the processing unit and determine an average outside environment bandwidth over a predetermined period of time from a current point in time as the current outside environment bandwidth based on the instruction sequence execution record. The detection may be real-time, thereby ensuring that the computational matrix data delivery manner determined by the computational unit controller 610 is more in line with objective practice, enabling the computational power and external bandwidth of the processing unit to be adapted in real time, and improving the computational energy efficiency of the processing unit.
In some embodiments, the calculation unit controller 610 controls the calculation matrix 630 to operate in a multicast data input mode according to the external environment bandwidth in which the processing unit is located, in the case that the external environment bandwidth meets a predetermined bandwidth requirement, the data is broadcast to all the calculation units 640 of the corresponding column by column and broadcast to all the calculation units 640 of the corresponding row by row, in the case that the external environment bandwidth does not meet the predetermined bandwidth requirement, the calculation matrix 630 is controlled to operate in a pulsating data input mode, and the calculation units 640 receive the data from the calculation units 640 of the same row in the previous column and from the calculation units 640 of the same row in the previous column to support the calculation matrix 630 to perform tensor operation on the input tensor and the tensor weight. In some cases, in the event that the external environment bandwidth meets a predetermined bandwidth requirement, i.e., is greater than a predetermined environment bandwidth threshold, computational unit controller 610 controls computational matrix 630 to operate in a multicast data input mode. In other cases, in the event that the external ambient bandwidth does not meet the predetermined bandwidth requirement, i.e., is not greater than the predetermined ambient bandwidth threshold, computational unit controller 610 controls computational matrix 630 to operate in a systolic data input mode. The predetermined environment bandwidth threshold is, for example, a preset reference bandwidth, which may be a maximum amount of data required to be input into the calculation matrix 630 in one clock cycle in the multicast data input mode.
It should be appreciated that in some embodiments, in a multicast data input mode, data is broadcast column by column to all of the computational units 640 of each corresponding column of the computational matrix 630 and row by row to all of the computational units 640 of each corresponding row of the computational matrix 630 in one clock cycle. That is, the number of rows and columns of the computational units 640 in the computational matrix 630 determines the maximum amount of data that needs to be input into the computational matrix 630 in one clock cycle. In the multicast data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 is n × m × 2 in one clock cycle. In some embodiments, in the systolic data input mode, in one clock cycle, the computing units 640 in each row of the first column and the computing units 640 in each column of the first row of the computing matrix 630 receive data from the external environment, and the other computing units 640 receive data from the computing units 640 in the same row of the previous column and the computing units 640 in the same row of the previous column, so that the data input into the computing matrix 630 is systolic multiplexed in the computing matrix 630. That is, in the systolic data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 is n + m in one clock cycle. n × m × 2 is greater than n + m, that is, for the n × m calculation matrix 630, in one clock cycle, the maximum total amount of data that needs to be input into the calculation matrix 630 in the multicast data input mode is greater than the maximum total amount of data that needs to be input into the calculation matrix 630 in the burst data input mode. In this way, when the external environment bandwidth is larger than n × m × 2, in the multicast data input mode, it is ensured that the total amount of data input from the external environment of the processing unit to the calculation matrix 630 reaches n × m × 2 in one clock cycle, and all of the n × m calculation units 640 in the calculation matrix 630 perform calculation, and there is no calculation unit 640 in the idle waiting state. However, when the external environment bandwidth is equal to n × m × 2, in the multicast data input mode, since the external environment bandwidth varies with the environmental condition of the external environment of the processing unit, it is very likely that the total amount of data input from the external environment to the calculation matrix 630 in one clock cycle is less than n × m × 2, resulting in the calculation unit 640 being in an idle waiting state. When the external environment bandwidth is less than n × m × 2, in the multicast data input mode, the total amount of data that can be input from the external environment to the calculation matrix 630 in one clock cycle is less than n × m × 2, and the calculation matrix 630 has the calculation unit 640 in the idle waiting state. Thus, when the external environment bandwidth is not greater than n × m × 2, the operation mode of the calculation matrix 630 is switched from the multicast data input mode to the pulsating data input mode, and in the pulsating data input mode, although the total amount of data that can be input into the calculation matrix 630 from the external environment in one clock cycle is less than n × m × 2, the input data is transmitted from the calculation unit 640 in the same row in the previous column to the calculation unit 640 in the same row in the next column and from the calculation unit 640 in the previous row in the same column to the calculation unit 640 in the next row in the same column, that is, the input data is pulsatory multiplexed in the calculation matrix 630, and the number of the calculation units 640 in the idle waiting state can be reduced.
In the embodiment of the present disclosure, when the external environment bandwidth meets the predetermined bandwidth requirement, that is, is greater than the predetermined environment bandwidth threshold, the calculation matrix 630 is controlled to operate in the multicast data input mode, the input data is input into the calculation matrix 630 in rows and columns, the calculation units 640 all perform calculations, and the calculation matrix 630 does not have the calculation unit 640 in the idle waiting state, so that the calculation capacity of the calculation matrix 630 is fully exerted, the operation throughput of the calculation matrix 630 is improved, and the calculation capacity of the processing unit is improved. Under the condition that the external environment bandwidth does not meet the requirement of the preset bandwidth, namely the external environment bandwidth is not larger than the threshold value of the preset environment bandwidth, the working mode of the calculation matrix 630 is switched into a pulsating data input mode, input data are subjected to pulsating multiplexing in the calculation matrix 630, the number of the calculation units 640 in an idle waiting state is reduced, and the calculation energy efficiency of the processing units is improved.
In some embodiments, the calculation matrix 630 is used to perform multiplication of the first matrix and the second matrix, resulting in a product matrix. The first matrix is for example an input tensor and the second matrix is for example a tensor weight. The first matrix, the second matrix, and the product matrix are stored, for example, in on-chip memory 650. Assuming that the first matrix is a, the second matrix is B, and the product matrix of the first matrix a and the second matrix B is C, they are respectively expressed as follows:
then C is11=A11B11+A12B21+A13B31+…+A1NBN1 (4)
C12=A11B12+A12B22+A13B32+…+A1NBN2 (5)
By the way of analogy, the method can be used,
C1N=A11B1N+A12B2N+A13B3N+…+A1NBNN (6)
by the way of analogy, the method can be used,
CN1=ANxB11+AN2B21+AN3B31+…+ANNBN1 (7)
CN2=AN1B12+AN2B22+AN3B32+…+ANNBN2 (8)
by the way of analogy, the method can be used,
CNN=AN1B1N+AN2B2N+AN3B3N+…+ANNBNN (9)
from the aboveAs can be seen from equations (1) to (9), the process of taking the product of the first matrix A and the second matrix B is actually the element A of the first matrix AlkRespectively with elements B of a second matrix BkjCollision and multiplication, and a process of accumulating the products, wherein l, j, k, N is a natural number other than 0, l is less than or equal to N, j is less than or equal to N, and k is less than or equal to N.
FIG. 5 is an internal block diagram of a computation matrix 630, according to one embodiment of the present disclosure.
In some embodiments, as shown in fig. 5, the computational matrix 630 is made up of N × N computational units 640 arranged in N rows and N columns, N being a non-zero natural number. In the multicast data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 in one clock cycle is nxnx 2. In some embodiments, the predetermined environment bandwidth threshold is nxnx 2, the calculation matrix 630 is controlled to operate in the multicast data input mode if the external environment bandwidth is greater than nxnx 2, and the calculation matrix 630 is controlled to operate in the systolic data input mode if the external environment bandwidth is not greater than nxnx 2.
In some embodiments, as shown in fig. 5, in the systolic data input mode, the computing units 640 in the computing matrix 630 receive data from the computing units 640 in the same row in the previous column via a first input line 641 and from the computing units 640 in the previous row in the same column via a second input line 642. That is, in the systolic data input mode, the input data to each compute unit 640 is sourced from compute units 640 in the same row in the previous column and compute units 640 in the same row in the previous column.
In some cases, in the systolic data input mode, as shown in FIG. 5, let element A of the first row and first column of the first matrix A be in the first clock cycle11Enter the first column of the computation matrix 630, let element B of the first row and first column of the second matrix B11Into the first row of the calculation matrix 630 such that the calculation unit T in the first column of the first row of the calculation matrix 63011To obtain A11B11. In the second clock cycle, the element A of the first row and the first column of the first matrix A11From the calculation unit T of the first row and the first column of the calculation matrix 63011Enter the first row and the second column to the rightIs calculated by the calculation unit T12Element B of the first row and the first column of the second matrix B11From the calculation unit T of the first row and the first column of the calculation matrix 63011Continue to enter the calculation unit T of the second row and the first column21. At the same time let the first row and the second column of the first matrix A be12And a second row and a first column element A21The calculation units T respectively entering the first two rows of the first column of the calculation matrix 63011And T21Let the second row and the first column of elements B of the second matrix B21And element B of the first row and the second column12The calculation units T of the first two columns entering the first row of the calculation matrix 630 respectively11And T12. Thus, in the second clock cycle, at the computing unit T11,A12And B21Meet to obtain A12B21(ii) a In the computing unit T12,A11And B12Meet to obtain A11B12(ii) a In the computing unit T21,A21And B11Meet to obtain A21B11. By analogy, in this way, in the nth clock cycle, exactly N elements (N elements whose sum of row number and column number equals N + 1) in the first matrix a enter the N rows of the computation units T of the first column of the computation matrix 630, respectively11To TN1Exactly N elements (N elements with the sum of row and column numbers equal to N + 1) in the second matrix B enter the N columns of the first row of the computation matrix 630, respectively11To T1N. By analogy, in the 2N-1 clock period, the first matrix A has only one element ANN(sum of row number and column number equal to 2N) into the calculation unit T of the Nth row and first column of the calculation matrix 630N1The second matrix B has only one element BNN(sum of row number and column number equal to 2N) into the calculation unit T of the Nth column in the first row of the calculation matrix 6301N. That is, for the ith clock cycle, the calculation unit controller 610 makes the elements with the sum of row number and column number i +1 in the second matrix B enter the calculation units 640 of the corresponding columns of the calculation matrix 630, and makes the elements with the sum of row number and column number i +1 in the first matrix A enter the counter for the first 2N-1 clock cyclesA calculating unit 640 for calculating corresponding rows of the matrix 630, wherein the calculating unit 640 multiplies the same elements of the received column number from the first matrix a and the received row number from the second matrix B, and accumulates the multiplied result into the previous accumulated result; in clock cycles 2N to 3N-1, the first matrix a and the second matrix B have no more new elements input to the calculation matrix 630, the elements are pulsed in the calculation matrix 630, the elements having the same column number received from the first matrix a and the same row number received from the second matrix B are multiplied by the calculation unit 640, and the multiplied result is accumulated in the previous accumulated result. Finally, after 3N-1 clock cycles, each element in the product matrix C of the first matrix A and the second matrix B is obtained by the calculation matrix 630.
Thus, in the systolic data input mode, in the nth clock cycle, the calculation unit controller 610 makes N elements of the second matrix B having the sum of the row number and the column number N +1 enter the calculation units 640 of the corresponding columns of the calculation matrix 630, and makes N elements of the first matrix a having the sum of the row number and the column number N +1 enter the calculation units 640 of the corresponding rows of the calculation matrix 630. That is, in the ripple data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 in one clock cycle is N + N, the input data is transmitted from the calculation unit 640 in the same row in the previous column to the calculation unit 640 in the same row in the next column and from the calculation unit 640 in the same row in the previous column to the calculation unit 640 in the same row in the next column, and the input data is ripple-multiplexed in the calculation matrix 630.
In some embodiments, as shown in fig. 5, in the multicast data input mode, the computing units 640 in the same column in the computing matrix 630 are commonly connected to the first column bus 643, and the computing units 640 in the same column respectively receive data through the first column bus 643. The computing units 640 in the same row of the computing matrix 630 are commonly connected to a first row bus 644, and the computing units 640 in the same row respectively receive data through the first row bus 644. That is, in the multicast data input mode, data is input to each calculation unit 640 in rows and columns in a multicast transmission manner.
In some cases, in the multicast data input mode, as shown in fig. 5, in the second placeOne clock cycle, compute unit controller 610 lets element A of the first row and first column of the first matrix A11The calculation units T broadcast to the first row of the calculation matrix 63011To T1NLet element A of the second row and the first column in the first matrix A21Broadcast to the calculation units T of the second row of the calculation matrix 63021To T2NBy analogy, let the element A in the Nth row and the first column in the first matrix AN1The calculation units T broadcast to the Nth row of the calculation matrix 630N1To TNNLet element B of the first row and the first column in the second matrix B11The calculation units T broadcast to the first column of the calculation matrix 63011To TN1Let element B of the first row and second column in the second matrix B12The calculation units T broadcast to the second column of the calculation matrix 63012To TN2And so on, let the element B in the first row and the Nth column in the second matrix B1NThe calculation unit T broadcast to the Nth column of the calculation matrix 6301NTo TNN. In the second clock cycle, the computing unit controller 610 makes the first row and the second column of the element A in the first matrix A12The calculation units T broadcast to the first row of the calculation matrix 63011To T1NLet the element A of the second row and the second column in the first matrix A22Broadcast to the calculation units T of the second row of the calculation matrix 63021To T2NAnd so on, let the element A of the Nth row and the second column in the first matrix AN2The calculation units TN broadcast to the Nth row of the calculation matrix 6301To TNNLet element B of the second row and the first column in the second matrix B21The calculation units T broadcast to the first column of the calculation matrix 63011To TN1Let element B in the second row and second column of the second matrix B22The calculation units T broadcast to the second column of the calculation matrix 63012To TN2And so on, let the second row and Nth column element B in the second matrix B2NThe calculation unit T broadcast to the Nth column of the calculation matrix 6301NTo TNN. By analogy, in the Nth clock cycle, the computing unit controller 610 lets the first row and Nth column of the element A in the first matrix A1NThe calculation units T broadcast to the first row of the calculation matrix 63011To T1NLet the element A in the Nth column of the second row in the first matrix A2NBroadcast to the calculation units T of the second row of the calculation matrix 63021To T2NBy analogy, let the element A in the Nth row and Nth column of the first matrix ANNThe calculation units TN broadcast to the Nth row of the calculation matrix 6301To TNNLet element B in the Nth row and the first column of the second matrix BN1The calculation units T broadcast to the first column of the calculation matrix 63011To TN1Let the element B in the Nth row and the second column of the second matrix BN2The calculation units T broadcast to the second column of the calculation matrix 63012To TN2By analogy, let the element B in the Nth row and Nth column of the second matrix BNNThe calculation unit T broadcast to the Nth column of the calculation matrix 6301NTo TNN。
Thus, for the first N clock cycles, at the ith clock cycle, the compute unit T at the first row and column of the compute matrix 63011To obtain A1iBi1In the first row and the second column of the calculation matrix 63012To obtain AliBi2And so on, the calculation unit T at the Nth column of the first row of the calculation matrix 6301NTo obtain A1iBiNIn the second row and the first column of the calculation matrix 63021To obtain A2iBi1In the second row and the second column of the calculation matrix 63022To obtain A2iBi2And so on, the calculation unit T at the Nth column in the second row of the calculation matrix 6302NTo obtain A2iBiNAnd so on, the calculation unit T in the Nth row and the first column of the calculation matrix 630N1To obtain ANiBi1In the Nth row and the second column of the calculation matrix 630, a calculation unit TN2To obtain ANiBi2And so on, the calculation unit T at the Nth row and the Nth column of the calculation matrix 630NNTo obtain ANiBiN. That is, for the first N clock cycles, in the ith clock cycle, compute unit controller 610 broadcasts the elements of the ith column of first matrix A to the rows of compute matrix 630, respectively, and broadcasts the second momentThe elements of the ith row in array B are broadcast to the columns of the calculation matrix 630, the elements with the same column number received from the first matrix a and the row number received from the second matrix B are multiplied by the calculation unit 640, and the multiplied result is accumulated into the previous accumulated result. In the (N + 1) th clock cycle, each element in the product matrix C of the first matrix a and the second matrix B obtained by calculating the matrix 630 is output.
Thus, in the multicast data input mode, for the first N clock cycles, in the ith clock cycle, the computing unit controller 610 broadcasts the N elements in the ith row of the first matrix a to the N rows of the computing matrix 630, and broadcasts the N elements in the ith row of the second matrix B to the N columns of the computing matrix 630. That is, in the multicast data input mode, the maximum total amount of data that needs to be input into the calculation matrix 630 in one clock cycle is nxnxnxnxnx2, that is, nxn calculation units 640 in the calculation matrix 630 all perform calculation, and there is no calculation unit 640 in an idle waiting state.
It should be noted that, in general, the calculation matrix 630 is fixed in the processing unit shown in fig. 5, and the first matrix a and the second matrix B are not fixed. For the case that the number of rows and the number of columns of the first matrix a and the second matrix B are not consistent with the calculation matrix 630, the first matrix a and the second matrix B are usually split, and then tensor operation is performed on the first matrix a and the second matrix B by using the calculation matrix 630 after the split.
Computing unit
Fig. 6 is an internal structural diagram of the calculation unit 640 according to one embodiment of the present disclosure.
In some embodiments, the internal structure of the computation units 640 in the computation matrix 630 is the same. As shown in fig. 7, the calculation unit 640 includes: a first register 651, a second register 652, a third register 653, a multiplier 654, and an accumulator 655.
In some embodiments, the monitoring unit 620 may monitor the external environment bandwidth in real time, and the computational unit controller 610 switches the computational matrix 630 between the systolic data input mode and the multicast data input mode according to the external environment bandwidth monitored in real time. In some embodiments, the computing unit 640 further comprises: a first gate 656 and a second gate 657. The first gating device 656 is configured to receive the control signal M provided by the computing unit controller 610, and gate the second input line 642 or the first column bus 643 in the computing matrix 630 according to the control signal M, so as to switch the computing matrix 630 between the burst data input mode and the multicast data input mode. The second gate 657 is used for receiving the control signal M provided by the computing unit controller 610, and gates the first input line 641 or the first row bus 644 in the computing matrix 630 according to the control signal M, so that the computing matrix 630 is switched between the systolic data input mode and the multicast data input mode. In some cases, according to the control signal M, the first gate 656 gates the second input line 642 in the calculation matrix 630, the calculation unit 640 receives data from the calculation unit 640 in the previous row of the same column through the second input line 642, the second gate 657 gates the first input line 641 in the calculation matrix 630, and the calculation unit 640 receives data from the calculation unit 640 in the same row of the previous column through the first input line 641, so that the calculation matrix 630 operates in the burst data input mode. In other cases, according to the control signal M, the first gate 656 gates the first column bus 643 in the computation matrix 630, the computation units 640 in the same column receive data through the first column bus 643, the second gate 657 gates the first row bus 644 in the computation matrix 630, and the computation units 640 in the same row receive data through the first row bus 644, so that the computation matrix 630 operates in the multicast data input mode.
In some embodiments, in the systolic data input mode, the computational cells 640 in the computational matrix 630 receive data from the computational cells 640 of the same row in the previous column through the first input line 641 and from the computational cells 640 of the same row in the previous column through the second input line 642 for one clock cycle. The first register 651 stores data transferred from the calculation unit 640 of the same row in the previous column according to the control signal M. The second register 652 stores data transferred from the computing unit 640 of the previous row of the same column according to the control signal M. The multiplier 654 multiplies the same elements of the column sequence number received from the first input line 641 and the row sequence number received from the second input line 642. The accumulator 655 accumulates the result multiplied by the multiplier 654 into a previous accumulated result. The third register 653 stores the accumulated result. In the next clock cycle, the calculation unit 640 transfers the data stored in the first register 651 to the calculation unit 640 of the same row in the next column, and transfers the data stored in the second register 655 to the calculation unit 640 of the same row in the next column.
In some embodiments, in the multicast data input mode, in one clock cycle, the calculation units 640 in the same column in the calculation matrix 630 respectively receive data through the first column bus 643, and the calculation units 640 in the same row respectively receive data through the first row bus 644. The first register 651 suspends its operation in accordance with the control signal M. The second register 652 suspends the operation according to the control signal M. The multiplier 654 multiplies the same element of the column sequence number received from the first row bus 644 and the row sequence number received from the first column bus 643. The accumulator 655 accumulates the result multiplied by the multiplier 654 into a previous accumulated result. The third register 653 stores the accumulated result.
Tensor operation method of the disclosed embodiment
Fig. 7 is a flowchart of a tensor operation method provided by an embodiment of the present disclosure. As shown on the figure, the method comprises the following steps.
In step S701, an external environment bandwidth in which a plurality of computing units constituting a computing matrix of n rows and m columns, n and m being non-zero natural numbers, are located is obtained.
In step S702, based on the external environment bandwidth, when the external environment bandwidth where the computation matrix is located meets a predetermined bandwidth requirement, the computation matrix is controlled to operate in a multicast data input mode, data is broadcast to all computation units in a corresponding column by columns and broadcast to all computation units in a corresponding row by rows, and when the external environment bandwidth does not meet the predetermined bandwidth requirement, the computation matrix is controlled to operate in a pulsating data input mode, and the computation units receive data from the computation units in the same row in the previous column and the computation units in the previous row in the same column, so as to support tensor operation.
The method of the embodiment of the disclosure is executed in a computing device which comprises a processing unit and a computing unit controller, and the method controls the computing matrix to work in at least one of a pulsating data input mode and a multicast data input mode according to the external environment bandwidth where the processing unit is located by utilizing the computing unit controller so as to support tensor operation. The resulting external environment bandwidth and computing power of the processing unit are adapted such that a better computational efficiency is achieved for the computing device.
Commercial value of the disclosed embodiments
The processing unit provided by the embodiment of the disclosure flexibly selects the working mode of the computation matrix according to the external environment bandwidth in which the processing unit is located, and controls the computation matrix to work in a multicast data input mode under the condition that the external environment bandwidth is greater than a preset environment bandwidth threshold, all the computation units execute computation, and the computation unit in an idle waiting state does not exist, so that the computation capability of the processing unit is improved; under the condition that the external environment bandwidth is not larger than the preset environment bandwidth threshold, the working mode of the calculation matrix is switched into a pulsating data input mode, input data are subjected to pulsating multiplexing in the calculation matrix, the number of calculation units in an idle waiting state is reduced, and the calculation energy efficiency of the processing unit is improved. Under the scene, the power consumption of the computing device is reduced by reducing the power consumption of the processing unit in the pulsating data input mode, and the running cost of the whole Internet of things is further reduced. The computing capability of the computing device is improved by improving the computing performance of the processing unit in the multicast data input mode, and the computing capability of the whole internet of things is further improved. The embodiment of the disclosure reduces the calculation energy consumption and improves the calculation capacity, thereby having good commercial value and economic value.
As will be appreciated by one skilled in the art, the present disclosure may be embodied as systems, methods and computer program products. Accordingly, the present disclosure may be embodied in the form of entirely hardware, entirely software (including firmware, resident software, micro-code), or in the form of a combination of software and hardware. Furthermore, in some embodiments, the present disclosure may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied therein.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium is, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer-readable storage medium include: an electrical connection for the particular wire or wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical memory, a magnetic memory, or any suitable combination of the foregoing. In this context, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a processing unit, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a chopper. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any other suitable combination. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., and any suitable combination of the foregoing.
Computer program code for carrying out embodiments of the present disclosure may be written in one or more programming languages or combinations. The programming language includes an object-oriented programming language such as JAVA, C + +, and may also include a conventional procedural programming language such as C. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAn) or a wide area network (WAn), or the connection may be made to an external computer (for example, through the internet using an internet service provider).
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.