CN118550584A - Data processing method, device, processor, computer equipment and system - Google Patents
Data processing method, device, processor, computer equipment and system Download PDFInfo
- Publication number
- CN118550584A CN118550584A CN202310231360.1A CN202310231360A CN118550584A CN 118550584 A CN118550584 A CN 118550584A CN 202310231360 A CN202310231360 A CN 202310231360A CN 118550584 A CN118550584 A CN 118550584A
- Authority
- CN
- China
- Prior art keywords
- matrix
- calculation
- instruction
- processor
- instructions
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 13
- 239000011159 matrix material Substances 0.000 claims abstract description 746
- 238000004364 calculation method Methods 0.000 claims abstract description 362
- 238000000034 method Methods 0.000 claims abstract description 67
- 238000012545 processing Methods 0.000 claims description 40
- 238000004891 communication Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 description 28
- 238000004590 computer program Methods 0.000 description 15
- 239000013598 vector Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 7
- 239000007787 solid Substances 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000001427 coherent effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Advance Control (AREA)
Abstract
A data processing method, a device, a processor, computer equipment and a system are disclosed, and relate to the field of computers. The method includes merging a set of matrices associated with a plurality of matrix calculation instructions to perform matrix calculations. Because the processor core can only process one instruction at a time, matrix calculation needs to be executed for a plurality of times when the processor core processes a plurality of matrix calculation instructions, and if the size of a matrix associated with the matrix calculation instructions is smaller, hardware resources of a processor executing the matrix calculation instructions cannot be fully utilized, so that the calculation throughput is lower. The scheme provided by the application combines multiple matrix calculations into one matrix calculation, namely, the matrix associated with a plurality of matrix calculation instructions is combined into a large matrix, and the matrix calculation is executed once on the combined matrix, so that the matrix size is increased through the combined matrix, the matrix calculation times are reduced, and the utilization rate of a processor executing the matrix calculation instructions is improved.
Description
Technical Field
The present application relates to the field of computers, and in particular, to a data processing method, apparatus, processor, computer device, and system.
Background
Currently, a computer compiles (compilation, compile) an application program into a language that the computer recognizes, i.e., the computer compiles a source program written in a high-level language into a target program, and runs the target program. Wherein a fragment of an instruction for a scalable matrix extension (Scalable Matrix Extension, SME) suitable for use in an application is compiled into a call to the SME instruction. SME defines an architecture capable of storing a two-dimensional matrix and a series of instructions associated with matrix computation for instructing the execution of matrix computation. Matrix computation mainly involves a scalar and a vector, where the scalar is a number and the vector is a sequence of numbers. When a user process performs scalar computation, the register utilization of the processor executing the SME instruction is low compared to the utilization of the vector computation process because the scalar is only a numerical value. In addition, if the user process needs to calculate a smaller matrix, the hardware resources of the processor executing the SME instruction cannot be fully utilized, resulting in lower calculation throughput. Therefore, how to improve the utilization of a processor executing SME instructions is a technical problem to be solved.
Disclosure of Invention
The application provides a data processing method, a data processing device, a data processing processor, a computer device and a data processing system, so that the utilization rate of a processor executing SME instructions is improved.
In a first aspect, a data processing method is provided, the method including: after a plurality of matrix calculation instructions are acquired, combining matrix sets associated with the matrix calculation instructions and executing matrix calculation. For example, a first matrix set associated with a first matrix calculation instruction and a second matrix set associated with a second matrix calculation instruction are obtained, wherein the first matrix calculation instruction and the second matrix calculation instruction are respectively one of a plurality of matrix calculation instructions, the first matrix set comprises at least two first matrices, and the second matrix set comprises at least two second matrices. Combining the matrixes of the first matrix set and the second matrix set to obtain a combined matrix; and executing the calculation operation of the combined matrix to respectively obtain the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction.
Because the processor core can only process one instruction at a time, matrix calculation needs to be executed for a plurality of times when the processor core processes a plurality of matrix calculation instructions, and if the size of a matrix associated with the matrix calculation instructions is smaller, hardware resources of a processor executing the matrix calculation instructions cannot be fully utilized, so that the calculation throughput is lower. The scheme provided by the application combines multiple matrix calculations into one matrix calculation, namely, the matrix associated with a plurality of matrix calculation instructions is combined into a large matrix, and the matrix calculation is executed once on the combined matrix, so that the matrix size is increased through the combined matrix, the matrix calculation times are reduced, and the utilization rate of a processor executing the matrix calculation instructions is improved.
In one possible implementation, when the matrix associated with the matrix calculation instruction meets the combination condition, combining the matrices associated with the matrix calculation instructions to obtain a combined matrix. Wherein the merge condition is used to indicate a size threshold and a merge period that need to be satisfied for the merge to perform matrix computation. The size threshold is determined by the size of the register used to store the matrix. The merge period is used to indicate the unit of time for which the matrix computation is performed by the merge.
For example, merging the matrices of the first set of matrices and the second set of matrices to obtain a merged matrix, comprising: and when the matrixes of the first matrix set and the second matrix set meet the size threshold, combining the matrixes of the first matrix set and the second matrix set to obtain a combined matrix. Understandably, when the multiple matrix sets are combined, multiple combined matrices can be obtained, so that the size of the multiple combined matrices is as equal as possible to the size threshold, that is, the size of the multiple combined matrices is similar to the size of the register, so that the hardware resources of the processor executing the matrix calculation instruction and the storage resources of the register storing the matrix are fully utilized, and the utilization rate of the processor executing the matrix calculation instruction is improved.
For another example, merging the matrices of the first matrix set and the second matrix set to obtain a merged matrix, including: and in the merging period, merging the matrixes of the first matrix set and the second matrix set to obtain a merged matrix. It should be appreciated that the shorter the merge period, the more frequently the merge performs matrix calculations; the longer the merge period, the fewer the number of times the merge performs matrix computation. Thus, in the merging period, if the size of the merged matrix is similar to the size of the register, merging the plurality of matrix sets; if the size of the matrix after the combination is larger than the size of the register in the combination period, but the combination period is finished, the plurality of matrix sets are combined, so that the utilization rate of a processor executing the matrix calculation instruction is improved by reasonably setting the combination period and executing the matrix calculation instruction combination in the combination period. Frequent execution of matrix calculation instruction merging is avoided, resulting in reduced utilization of the processor executing the matrix calculation instructions.
The operation of matrix calculation comprises at least one of matrix multiplication, matrix addition, matrix subtraction, number multiplication and matrix joint multiplication addition.
In another possible implementation, the operation of calculating the matrix is matrix multiplication, and the multiple matrices in the combined matrix are arranged in diagonal lines. Therefore, matrix multiplication of matrixes associated with different matrix calculation instructions is avoided, and error of matrix calculation results is avoided.
In another possible implementation, obtaining a first set of matrices associated with the first matrix computing instruction and a second set of matrices associated with the second matrix computing instruction includes: when the remote device performs non-matrix computation, a first matrix set associated with the first matrix computation order and a second matrix set associated with the second matrix computation order are obtained from the remote device. Wherein the remote device comprises any one of a processor core, a processor, and a computer device. The remote device is a device supporting matrix computation or a device not supporting matrix computation. Therefore, by unloading the matrix calculation instructions from the remote equipment, the combination of the matrix calculation instructions is realized, and the utilization rate of a processor executing the matrix calculation instructions is improved.
In another possible implementation manner, after the calculation operation of the combined matrix is performed, the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction are obtained from the calculation result of the combined matrix. The method further comprises the steps of: feeding back a calculation result of the first matrix calculation instruction to a remote device which sends the first matrix calculation instruction according to the identification associated with the first matrix calculation instruction; and feeding back the calculation result of the second matrix calculation instruction to the remote equipment which sends the second matrix calculation instruction according to the identification associated with the second matrix calculation instruction.
In a second aspect, there is provided a data processing apparatus comprising means for performing the data processing method of the first aspect or any of the possible designs of the first aspect. For example, the data processing apparatus includes a communication module and a merging module. The communication module is used for acquiring a first matrix set associated with a first matrix calculation instruction and a second matrix set associated with a second matrix calculation instruction, wherein the first matrix calculation instruction and the second matrix calculation instruction are one of a plurality of matrix calculation instructions respectively, the first matrix set comprises at least two first matrixes, and the second matrix set comprises at least two second matrixes. The merging module is used for merging the matrixes of the first matrix set and the second matrix set to obtain a merged matrix; the merging module is further used for executing the calculation operation of the merged matrix to respectively obtain the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction.
In one possible implementation manner, the merging module merges the matrices of the first matrix set and the second matrix set, so as to obtain a merged matrix, which is specifically used for: and when the matrixes of the first matrix set and the second matrix set meet the size threshold, combining the matrixes of the first matrix set and the second matrix set to obtain a combined matrix, wherein the size threshold is determined by the size of a register for storing the matrixes.
In another possible implementation manner, the merging module merges the matrices of the first matrix set and the second matrix set, so as to obtain a merged matrix, which is specifically configured to: and merging the matrixes of the first matrix set and the second matrix set in a merging period to obtain a merged matrix, wherein the merging period is used for indicating a time unit for merging and executing matrix calculation.
In another possible implementation, the computing operation includes at least one of matrix multiplication, matrix addition, matrix subtraction, number multiplication, and matrix joint multiplication addition.
In another possible implementation, the computing operation is a matrix multiplication, at least two of the matrices being combined being arranged in a diagonal.
In another possible implementation manner, when the communication module obtains the first matrix set associated with the first matrix calculation instruction and the second matrix set associated with the second matrix calculation instruction, the communication module is specifically configured to: when the remote device performs non-matrix computation, a first matrix set associated with the first matrix computation order and a second matrix set associated with the second matrix computation order are obtained from the remote device.
The remote device comprises any one of a processor core, a processor and a server, and is a device supporting matrix calculation or a device not supporting matrix calculation.
In another possible implementation manner, when the merging module obtains the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction, the merging module is specifically configured to: and obtaining the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction from the calculation results of the combined matrices.
In another possible implementation manner, the communication module is further configured to feed back a calculation result of the first matrix calculation instruction to a remote device that sends the first matrix calculation instruction according to an identifier associated with the first matrix calculation instruction; and feeding back the calculation result of the second matrix calculation instruction to the remote equipment which sends the second matrix calculation instruction according to the identification associated with the second matrix calculation instruction.
In a third aspect, a processor is provided, the processor comprising at least two processor cores and a memory, the memory for storing a set of computer instructions; when the processor core executes the set of computer instructions, the operational steps of the method of the first aspect or any one of the possible implementations of the first aspect are performed.
In a fourth aspect, a computer device is provided, the computer device comprising a processor for performing the operational steps of the method of the first aspect or any one of the possible implementations of the first aspect.
In a fifth aspect, a computer system is provided, the computer system comprising a plurality of computer devices for performing the operational steps of the method of the first aspect or any one of the possible implementations of the first aspect.
In a sixth aspect, a computer readable storage medium, comprising: computer software instructions; the computer software instructions, when run in a processor, cause the processor to perform the operational steps of the method as described in the first aspect or any one of the possible implementations of the first aspect.
In a seventh aspect, a computer program product is provided which, when run on a computer, causes the computer to perform the operational steps of the method as described in the first aspect or any one of the possible implementations of the first aspect.
The technical effects of any one of the second aspect to the seventh aspect may be referred to as the technical effects of the first aspect or the different designs of the first aspect, and will not be described herein.
Further combinations of the present application may be made to provide further implementations based on the implementations provided in the above aspects.
Drawings
FIG. 1 is a schematic diagram of a computer device according to the present application;
FIG. 2 is a schematic diagram of a scenario for matrix computation instruction offloading provided by the present application;
FIG. 3 is a schematic flow chart of a data processing method according to the present application;
FIG. 4 is a schematic diagram of a matrix computing instruction offload provided by the present application;
FIG. 5 is a schematic diagram of a combined execution matrix calculation according to the present application;
FIG. 6 is a schematic diagram of a matrix combination according to the present application;
fig. 7 is a schematic structural diagram of a data processing apparatus according to the present application.
Detailed Description
For convenience of description, the terms involved in the present application are first briefly introduced.
Single instruction multiple data (single instruction multiple data, SIMD) is a technique that performs the same operation on each data in a set of data (also called a "data vector") separately, achieving spatial parallelism. SIMD is an extension of the basic instruction set of the central processing unit (central processing unit, CPU).
Scalable vector extensions (scalable vector extension, SVE), which are SIMD instruction sets under the advanced reduced instruction set computing machine (advanced reduced instruction set computing machines, ARM) AArch64 architecture, are aimed at accelerating high performance computations. Variable vector length, per-channel prediction, aggregate loading and scatter storage, and lateral operation are supported. SVE differs from SIMD in that SIMD is used to process fixed length data; SVE is used to process variable length data.
Scalable matrix extensions (scalable matrix extension, SME), which are instruction sets for support matrix computation based on scalable vector extensions (e.g., SVE and SVE 2) extensions under the ARMv9-a architecture. It will be appreciated that the processor provides a new mode of providing matrix computation functionality, making it easier for the processor to use different vector sizes when processing matrix computation, as well as SVE or SIMD computation. SME defines an architecture capable of storing a two-dimensional matrix and a series of instructions related to matrix computation, including: ① Calculating two vector outer products to realize the instruction of matrix multiplication; ② Instructions to read, store, move vectors from the matrix. The SME support vector length includes 128-2048 bits (bits).
Matrix is a set of complex or real numbers arranged in an array.
In the hierarchical structure of computer memory systems, the closer to the memory of the processor, the faster the access speed and the smaller the memory capacity. The memory is divided into, in order from near to far from the processor: registers, cache, main memory, disk.
Main memory, also known as internal memory, is simply referred to as main memory or memory (memory). Main memory is an important component of a computer system, namely, a bridge through which an external memory communicates with a processor. The main memory is used for temporarily storing operation data in the processor and data exchanged between the processor and an external memory such as a hard disk. For example, the computer starts to run, data to be operated is loaded into the processor from the main memory to operate, and after the operation is completed, the processor stores the operation result into the main memory. For example, the main memory includes dynamic random access memory (dynamic random access memory, DRAM), double data rate synchronous dynamic random access memory (double data rate synchronous DRAM, DDR SDRAM).
External memory, also called secondary memory, is simply referred to as external memory or secondary memory. Compared with the main memory, the external memory has large memory capacity and low access speed. For example, the external memory includes network memory, solid state drive (solid STATE DISK or solid STATE DRIVE, SSD), hard disk drive (HARD DISK DRIVE, HDD).
Cache memory (cache) is a high-speed small-capacity memory interposed between a processor and main memory. The cache is simply a cache. Compared with the main memory, the cache has small storage capacity and high access speed. The caches include a level one cache (L1 cache), a level two cache (L2 cache), and a level three cache (L3 cache). The first level cache is disposed within the processor core. The secondary cache may be located within the processor core or external to the processor core. The primary and secondary caches are typically shared exclusively (exclusive) by the processor core in which they reside. The tertiary cache is typically located outside of the processor cores, shared by multiple processor cores.
A register (register) is a small memory located inside the processor and used for temporarily storing data and operation results involved in the operation. The register may be a conventional sequential logic circuit.
Socket is a mechanism by which application layer processes exchange data using a network protocol. An abstraction of endpoints that communicate bi-directionally between application processes on different hosts in a network. In terms of the position, the socket is connected with the application process in an upper mode, and the socket is connected with the network protocol stack in a lower mode, so that the socket is an interface for the application program to communicate through the network protocol, and is an interface for the application program to interact with the network protocol stack.
In order to improve the utilization rate of a processor executing SME instructions, the application provides a data processing method, namely, after a plurality of matrix calculation instructions are acquired, a matrix set associated with the matrix calculation instructions is combined to execute matrix calculation. For example, after a plurality of matrix calculation instructions are acquired, the matrix calculation instructions are combined to execute matrix calculation. For example, a first matrix set associated with a first matrix calculation instruction and a second matrix set associated with a second matrix calculation instruction are obtained, wherein the first matrix calculation instruction and the second matrix calculation instruction are respectively one of a plurality of matrix calculation instructions, the first matrix set comprises at least two first matrices, and the second matrix set comprises at least two second matrices. Combining the matrixes of the first matrix set and the second matrix set to obtain a combined matrix; and executing the calculation operation of the combined matrix to respectively obtain the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction.
Because the processor core can only process one instruction at a time, matrix calculation needs to be executed for a plurality of times when the processor core processes a plurality of matrix calculation instructions, and if the size of a matrix associated with the matrix calculation instructions is smaller, hardware resources of a processor executing the matrix calculation instructions cannot be fully utilized, so that the calculation throughput is lower. The scheme provided by the application combines multiple matrix calculations into one matrix calculation, namely, the matrix associated with a plurality of matrix calculation instructions is combined into a large matrix, and the matrix calculation is executed once on the combined matrix, so that the matrix size is increased through the combined matrix, the matrix calculation times are reduced, and the utilization rate of a processor executing the matrix calculation instructions is improved.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of a computer device according to the present application. As shown in fig. 1, computer device 100 includes a processor 110 and a main memory 120. Processor 110 includes a cache 111 and registers 112.
Processor 110 is the control center of computer device 100. The processor 110 may be a computing unit with computing capabilities, such as a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a data processing unit (data processing unit, DPU), a neural processing unit (neural processing unit, NPU), etc. Processor 110 includes a processor core or a plurality of processor cores. For example, the processor 110 shown in FIG. 1 includes N processor cores.
Cache 111 is used to store instructions or data that may be accessed multiple times by processor cores in processor 110. Thus, the speed of the processor processing data is increased, avoiding frequent processor accesses to main memory 120.
In physical form, cache memory 111 may be random access memory (random access memory, RAM), static Random Access Memory (SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), or other types of storage devices that may store information and instructions.
Logically, the cache memory 111 may be a level one cache (L1 cache), a level two cache (L2 cache), a level three cache (L3 cache), or any level of cache devices. For example, as shown in fig. 1, the cache memory 111 provided inside the processor core may be a first level cache (L1 cache) and a second level cache (L2 cache). The cache memory 111 provided outside the processor core may be a level three cache (L3 cache).
Registers 112 are used to store instructions or data that may be accessed multiple times by processor cores in processor 110. Since the access speed of the register is higher than that of the cache memory, instructions or data which can be accessed by the processor core for many times can be stored in the register first, and the speed of processing the data by the processor can be further improved.
Among the instructions that a processor core may process are SIMD instructions, SVE instructions, and SME instructions. The type of data used by the processor cores in processing different instructions is different and the computations performed are different. For example, when a processor processes SIMD instructions, it processes fixed length data (e.g., integer data, floating point data), performs addition, subtraction, or other calculations on the data. As another example, when the processor processes SVE instructions, variable length data (e.g., integer data, floating point data) is processed, and addition, subtraction, or other calculations are performed on the data. As another example, the processor performs matrix calculations on the matrix as it processes SME instructions.
In some embodiments, the register type is divided by the type of instruction that the register stores. A register may be used to store data for an instruction. The processor core may include various types of registers. Illustratively, as shown in FIG. 1, the processor 110 includes M registers 112. Wherein the M registers 112 include three types of registers. For example, a first type of register is used to store data for the processor core to process SIMD instructions, a second type of register is used to store data for the processor core to process SVE instructions, and a third type of register is used to store data for the processor core to process SME instructions.
Alternatively, one register may be used to store data for a variety of instructions. For example, the register is used to store data of any one of the SIMD instruction, the SVE instruction, and the SME instruction.
For ease of description, the storage medium storing the data of the SIMD instruction may be referred to as a SIMD register. The storage medium storing the data of the SVE instructions may be referred to as SVE registers. The storage medium storing the data of the SME instruction may be referred to as a SME register.
It will be appreciated that at one point in time, the processor core executes an instruction and then stores the data of that instruction to the register corresponding to the type of instruction, with the registers corresponding to other types of instructions being idle. For example, the processor core processes the SIMD instruction, stores the data of the SIMD instruction to the SIMD register, and the SME register is idle. The processor core processes the SVE instruction and stores the data of the SVE instruction into the SVE register, and the SME register is idle. In addition, if the size of the matrix associated with the SME instruction is smaller, the processor checks that the smaller matrix uses less memory resources and computing resources to perform the matrix computation, and the computation throughput is lower.
In the application, the processor core provides a matrix calculation instruction unloading function and a matrix calculation instruction merging function. The matrix calculation instruction described in the present application may refer to an SME instruction.
When the processor core performs a non-matrix computation, or when the size of a matrix for which the processor core performs a matrix computation is less than or greater than a threshold, the matrix computation instruction is offloaded to a processor core supporting matrix computation, the processor core supporting matrix computation containing SME registers. After a processor core supporting matrix calculation acquires a plurality of matrix calculation instructions, the matrixes associated with the matrix calculation instructions are combined, namely, the matrixes associated with the matrix calculation instructions are combined into a large matrix, and matrix calculation is performed on the combined matrix once, so that the size of the matrix is increased through the combined matrix, the number of times of matrix calculation is reduced, the utilization rate of a processor executing the matrix calculation instructions is improved, and the calculation throughput is improved.
The processor core of the offload matrix calculation instruction may support matrix calculation or not support matrix calculation, that is, the processor core of the offload matrix calculation instruction may support a function of processing the matrix calculation instruction or not support a function of processing the matrix calculation instruction.
For example, processor core 1 does not support matrix computation and processor core 2 does support matrix computation. Processor core 1 reads the SME instruction, which is offloaded to processor core 2 because processor core 1 does not support matrix computation. The processor core 1 processes other instructions. If the processor core 1 reads the SVE instruction, the processor core 1 processes the SVE instruction and performs non-matrix computation. Therefore, the processor cores which do not support matrix calculation indirectly realize matrix calculation, and the application range of matrix calculation instructions in a computer system is expanded. Moreover, the processor cores which do not support matrix computation can asynchronously execute non-matrix computation, thereby improving the computation throughput.
As another example, processor core 1, processor core 2, and processor core 3 each support matrix computation. The processor core 1 reads the first SME instruction, and the processor core 2 reads the second SME instruction, wherein the size of the matrix associated with the first SME instruction and the size of the matrix associated with the second SME instruction are smaller than a size threshold. Processor core 1 offloads the first SME instruction to processor core 3 and processor core 2 offloads the second SME instruction to processor core 3. Processor core 1 and processor core 2 may process other instructions (e.g., SVE instructions, SIMD instructions). The processor core 3 acquires a first SME instruction and a second SME instruction, merges the matrix associated with the first SME instruction and the matrix associated with the second SME instruction, executes the calculation operation of the merged matrix, and acquires the calculation result of the first SME instruction and the calculation result of the second SME instruction respectively. Thus, the utilization rate of the processor executing the matrix calculation instruction is improved, and the calculation throughput is improved.
In addition, the application does not limit the execution body for executing the matrix calculation instruction unloading function and the matrix calculation instruction merging function, and the matrix calculation instruction unloading function and the matrix calculation instruction merging function can be realized by the same computer equipment or different computer equipment.
Illustratively, as shown in (a) of fig. 2, the matrix calculation instruction offload function and the matrix calculation instruction merge function are implemented within the processor. Wherein the matrix computing instruction offload function is implemented between different processor cores in the same processor. Processor core 0, processor core 1, and processor core 2 are located in the same processor. Processor core 0 offloads matrix calculation instruction 1 to processor core 1 and processor core 2 offloads matrix calculation instruction 2 to processor core 1. Processor core 1 performs matrix computation on matrix computation instruction 1 and matrix computation instruction 2 in combination.
As shown in (b) of fig. 2, the matrix calculation instruction offload function and the matrix calculation instruction merge function are implemented in the same computer device. Wherein the matrix computing instruction offload function is implemented between different processors in the same computer device. Processor 0 includes processor core 0 and processor core 1. Processor 1 includes processor core 0 and processor core 1. Processor core 0 in processor 0 offloads matrix calculation instruction 1 to processor core 1 in processor 1, processor core 1 in processor 0 offloads matrix calculation instruction 2 to processor core 1 in processor 1. Processor core 1 in processor 1 performs matrix calculation for matrix calculation instruction 1 and matrix calculation instruction 2 in combination.
As shown in fig. 2 (c), the matrix calculation instruction offload function and the matrix calculation instruction merge function are implemented at different computer devices. Wherein the matrix computing instruction offload function is implemented between different processors in different computer devices. The computer device 1 comprises a processor 0 and a processor 1. The computer device 2 comprises a processor 0 and a processor 1. Processor 0 in computer device 1 offloads matrix calculation instructions 1 to processor 0 in computer device 2, processor 1 in computer device 1 offloads matrix calculation instructions 2 to processor 0 in computer device 2. The processor 0 in the computer device 2 performs matrix calculation on the combination of the matrix calculation instruction 1 and the matrix calculation instruction 2.
The processor core is coupled to other devices in the processor (e.g., processor core, cache 111) via bus 112, and other devices are accessed via bus 112. For example, matrix computation instructions are transferred between processor cores via bus 112. As another example, the processor core transmits a write instruction or a read instruction to the cache 111 through the bus 112, causing the cache 111 to write to the cache line according to the write instruction or read to the cache line according to the read instruction. Bus 112 may be an industry standard architecture (industry standard architecture, ISA) bus, a peripheral component interconnect express (PERIPHERAL COMPONENT INTERCONNECT EXPRESS, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, a unified bus (unified bus, ubus or UB), a computer quick link (compute express link, CXL), a cache coherent interconnect protocol (cache coherent interconnect for accelerators, CCIX), etc., or a proprietary bus standard of a non-standard architecture, etc. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 1, but not only one bus or one type of bus.
The manner in which the processor cores and other devices in the processor are connected in fig. 1 is merely a schematic illustration. In one possible implementation, the processor cores are connected with other devices in the processor through a ring bus (ring bus), and access the other devices in the processor through the ring bus. In another possible implementation, the processor core and other devices in the processor are connected through a mesh bus (mesh bus), and the processor core accesses the other devices in the processor through the mesh bus.
The processor 110 is connected to a main memory 120 through a Memory Controller (MC) 113. The processor 110 may perform various functions of the computer device 100 by running or executing software programs stored in the main memory 120, and invoking data stored in the main memory 120. For example, the processor 110 writes the calculation result of the matrix calculation instruction back to the main memory 120 through the Memory Controller (MC) 113.
Main memory 120 may be a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory), or the like. The main memory 120 is also used to store programs related to the present embodiment.
The processor 110 may also contain an external device 130 and a peripheral management module 114. The peripheral 130 is connected to the peripheral management module 114 via the bus 112. The peripheral 130 may be an Application SPECIFIC INTEGRATED Circuit (ASIC). For example: a microprocessor (DIGITAL SIGNAL processor, DSP), or one or more field programmable gate arrays (field programmable GATE ARRAY, FPGA), or a graphics processor (graphics processing unit, GPU), or a neural Network Processor (NPU). The peripheral 130 may also transmit instructions to the processor core to perform matrix calculations via the bus 112, causing the processor core to merge multiple SME instructions to perform matrix calculations.
The device structure shown in fig. 1 does not constitute a limitation of the computer device, and may include more or less components than illustrated, or may combine certain components, or may be arranged in different components.
Next, a data processing method provided by the present application will be described in detail with reference to the accompanying drawings.
For convenience of explanation, taking the unloading scenario between the computer devices shown in fig. 2 (c), the remote device unloads the matrix calculation instruction to the local device, and the local device executes the matrix calculation instruction merging function for illustration, referring to fig. 3, fig. 3 is a flow chart of a data processing method provided by the present application. As shown in fig. 3, the method includes the following steps.
Step 310, the remote device uninstalls the matrix calculation instruction.
Before the remote device runs the application program, the application program is compiled into a target program, namely, the target program can be run in a language which can be identified by the computer. The object program includes matrix computing instructions and non-matrix computing instructions (e.g., SIMD instructions, SVE instructions). In one example, when the remote device does not support matrix computation, the remote device reads the matrix computation instructions, uninstalls the matrix computation instructions, i.e., uninstalls the matrix computation instructions to the device supporting matrix computation.
In another possible implementation, the remote device uninstalls the matrix calculation instructions when the size of the matrix associated with the matrix calculation instructions read by the remote device is less than or equal to a threshold value.
And combining and processing a plurality of matrix calculation instructions by the equipment supporting matrix calculation, thereby improving the utilization rate of calculation resources supporting matrix calculation and the utilization rate of storage resources of the SME register in the computer equipment. The remote device may process non-matrix computing instructions, thereby improving the utilization of computing resources in the computer device that do not support matrix computing.
Wherein the remote device includes any one of a processor core, a processor, and a server. It should be appreciated that matrix computation instructions may be offloaded from processor core to processor core in the same computer device; matrix computation instructions may be offloaded from processor to processor. Matrix computation instructions may be offloaded between different computer devices, between the computer devices and the computer devices.
In addition, the remote device may be a device that supports matrix computation or a device that does not support matrix computation.
Step 320, the local device obtains a matrix set associated with a plurality of matrix calculation instructions, where the matrix calculation instructions are used to instruct to execute matrix calculation.
A plurality of matrix calculation instructions are obtained from one or more remote devices. For example, a plurality of matrix calculation instructions are acquired from a remote device. As another example, a plurality of matrix calculation instructions are obtained from a plurality of remote devices. The source of the plurality of matrix calculation instructions is not limited in the application. Each matrix set associated with the matrix calculation instruction comprises at least two matrices, and a calculation result can be obtained by performing matrix calculation on the at least two matrices.
In some embodiments, the local device obtains a matrix calculation message sent by the remote device, where the matrix calculation message is used to indicate a matrix set associated with the matrix calculation instruction.
For example, when a matrix calculation message is sent between processors in the same computer device, or between processor cores in the same processor, the matrix calculation message includes a physical address for indicating a storage location of a matrix set associated with the matrix calculation instruction. Wherein the processor core or processor that sends the matrix calculation message may not support matrix calculation or may support matrix calculation. The local device may be a processor core or processor that supports matrix computation.
For another example, when a matrix calculation message is sent between a computer device and a computer device, the matrix calculation message includes a matrix set associated with matrix calculation instructions. Wherein the computer device sending the matrix calculation message may not support matrix calculation or may support matrix calculation. The local device may be a computer device supporting matrix computation.
In addition, the matrix calculation message also comprises an identifier, which is used for indicating the remote equipment for unloading the matrix calculation instruction, so that the local equipment feeds back the calculation result of the matrix calculation instruction to the remote equipment according to the identifier.
Alternatively, the calculation result of the matrix calculation message and the matrix calculation instruction may be transmitted between the remote device and the local device based on the socket. For example, as shown in fig. 4, the remote device may set up a so library. The so library refers to a program function library under Linux, i.e. compiled codes and data that can be used by other programs. In the present application, the so library may provide an interface to offload matrix computation instructions. The remote device operates a user process (application program) to call an unloading matrix calculation instruction through an interface, and transmits a matrix calculation message to the local device through a socket according to the corresponding relation between the remote device and the local device. The local terminal device runs a service process, merges matrix sets associated with a plurality of matrix calculation instructions, and performs matrix calculation on the merged matrix.
Step 330, the local device merges the matrix sets associated with the plurality of matrix calculation instructions, and performs matrix calculation on the merged matrix.
The local terminal equipment combines a plurality of matrix sets associated with a plurality of matrix calculation instructions to obtain a combined matrix, and executes the calculation operation of the combined matrix, namely, executes matrix calculation on the combined matrix. Wherein the computing operation includes at least one of matrix multiplication, matrix addition, matrix subtraction, number multiplication, and matrix joint multiplication addition.
Matrix computation is understood to mean a computation operation performed on at least two matrices based on arithmetic operation symbols. For example, the plurality of matrix calculation instructions includes a first matrix calculation instruction and a second matrix calculation instruction. The first matrix calculation instructions associate a first set of matrices, the first set of matrices including at least two first matrices. The second matrix calculation instructions associate a second set of matrices, the second set of matrices including at least two second matrices. The first matrix calculation instructions are for instructing to perform matrix calculation on at least two first matrices. The second matrix calculation instructions are for instructing to perform matrix calculation on at least two second matrices.
When the local terminal equipment merges the matrix sets associated with the matrix calculation instructions, the matrix located on the same side of the arithmetic operation symbol in the matrix sets associated with the matrix calculation instructions can be merged to obtain a plurality of merged matrices. The combined matrix comprises a plurality of matrices which are positioned on the same side of the arithmetic operation symbol in a matrix set associated with matrix calculation instructions. At least two matrices in a matrix set associated with a matrix calculation instruction are located at the same position in different combined matrices. The plurality of matrix calculation instruction associated matrix sets perform the same matrix calculation.
It should be noted that the matrix calculation is a matrix addition or a matrix subtraction, and a plurality of matrices to be combined may be placed in the same row or the same column. The matrix is calculated as a number multiplication, a plurality of matrixes in a matrix set associated with a plurality of matrix calculation instructions need to be multiplied by the same number, and a plurality of matrixes to be combined can be placed in the same row or the same column. The matrix is calculated as matrix multiplication, and a plurality of matrixes to be combined cannot be placed in the same row or the same column, otherwise, the calculation results are mutually interfered. The multiple matrixes are arranged in a diagonal line, and the rest positions are complemented with 0, so that the multiple matrixes in the combined matrixes are arranged in the diagonal line. The plurality of matrices to be combined comprise matrices located on the same side of the arithmetic operation symbol in a matrix set associated with the plurality of matrix calculation instructions.
For example, as shown in fig. 5 (a), the first matrix calculation instructs the correlation matrix a and the matrix B to perform matrix multiplication. The second matrix calculation instructs the correlation matrix D and the matrix E to perform matrix multiplication. And combining the matrix A and the matrix D which are positioned at the left side of the multiplication symbol to obtain a first combined matrix, wherein the matrix A and the matrix D in the first combined matrix are arranged in a diagonal manner. And combining the matrix B and the matrix E positioned on the right side of the multiplication symbol to obtain a second combined matrix, wherein the matrix B and the matrix E in the second combined matrix are arranged in a diagonal manner. Wherein matrix A is located in the upper left corner of the first merged matrix and matrix B is located in the upper left corner of the second merged matrix. Matrix D is located in the lower right hand corner of the first merged matrix and matrix E is located in the lower right hand corner of the second merged matrix. And performing matrix multiplication on the first combined matrix and the second combined matrix, namely multiplying the rows in the first combined matrix by the columns in the first combined matrix to obtain a matrix multiplication result. Wherein, matrix A multiplies matrix B to get matrix C, matrix D multiplies matrix E to get matrix F. The matrix multiplication result includes a matrix C and a matrix F arranged in diagonal lines. And splitting the result to obtain a matrix multiplication result matrix C of the matrix A and the matrix B, and a matrix multiplication result matrix F of the matrix D and the matrix E.
As another example, as shown in (b) of fig. 5, the first matrix calculation instructs the correlation matrix a and the matrix b to perform matrix addition. The second matrix calculation instructs the correlation matrix d and the matrix e to perform matrix addition. And combining the matrix a and the matrix d positioned at the left side of the addition symbol to obtain a first combined matrix. The matrix a and the matrix d in the first combined matrix may be arranged in a diagonal line. And combining the matrix b and the matrix e positioned on the right side of the addition symbol to obtain a second combined matrix, wherein the matrix b and the matrix e in the second combined matrix are arranged in a diagonal manner. Wherein matrix a is located in the upper left corner of the first merged matrix and matrix b is located in the upper left corner of the second merged matrix. Matrix d is located in the lower right hand corner of the first merged matrix and matrix e is located in the lower right hand corner of the second merged matrix. Optionally, the matrix a and the matrix d in the first combined matrix are located in the same row, and the matrix b and the matrix e in the second combined matrix are located in the same row.
And performing matrix addition on the first combined matrix and the second combined matrix, namely adding elements positioned at the same position in the two matrices to obtain a matrix multiplication result. Wherein matrix a is added with matrix b to obtain matrix c, and matrix d and matrix e are added to obtain matrix f. The matrix addition result includes a matrix c and a matrix f arranged in diagonal lines, or a matrix c and a matrix f located in the same row. And splitting the result to obtain a matrix addition result matrix c of the matrix a and the matrix b, and a matrix addition result matrix f of the matrix d and the matrix e.
In some embodiments, a merging condition may be preconfigured, and when the plurality of matrices meet the merging condition, a plurality of matrix sets associated with the plurality of matrix calculation instructions are merged to obtain a merged matrix. Wherein the merge condition is used to indicate a size threshold and a merge period that need to be satisfied for the merge to perform matrix computation.
The size threshold is determined by the size of the register used to store the matrix. For example, the size threshold may be less than or equal to the size of the register. It should be appreciated that the size of the merged matrix is less than or equal to the size threshold. When the size of the combined matrix is closer to the size threshold, the SME register can be fully utilized, and the utilization rate of a processor executing the SME instruction and the utilization rate of storage resources of the SME register are improved. When the size of the merged matrix is smaller than the size threshold, the utilization of the processor executing the SME instruction and the utilization of the memory resources of the SME register are lower.
The merging period is used to indicate the unit of time for which matrix computation is performed by merging, i.e., the time thresholds for periodically merging multiple sets of matrices. It should be understood that the shorter the merging period, the more frequently the merging performs matrix computation, and more computing resources are required to be consumed; the longer the merge period, the fewer times the merge performs matrix calculations, the smaller the size of the matrix the processor executing the SME instruction may perform matrix calculations, resulting in lower utilization of the processor executing the SME instruction. Thus, a reasonable setting of the merge period is advantageous to promote utilization of the processor executing the SME instructions. The merge period may be configured according to at least one of expert experience and business requirements. For example, the traffic demand indication may require faster processing speed and less processing delay, and the merge period may be set shorter. The traffic demand indication requires processing reliability and the merge period may be set longer.
Alternatively, the size threshold and the merging period are configured by providing start-up parameters, configuration files, hard coding, etc. of the computer program that merges the matrix calculation functions. The start-up parameter may be a parameter read at the start-up of the computer program, typically at the start-up position of the computer program. I.e. the size threshold and the merging period may be located at the beginning of the computer program. The configuration file (configuration file) is a computer file that can configure parameters and initial settings for some computer programs, i.e., files that are configured differently for different objects. The configuration file is typically a file external to the computer program, which may be invoked after the computer program is run. Hard coding (hard coding) is a software development practice that embeds data directly into the source code of a computer program or other executable object, as opposed to obtaining the data from the outside or generating the data at runtime.
In a first possible implementation, when the size of the plurality of matrices meets the size threshold, the plurality of matrices are combined to obtain a combined matrix. For example, when the sum of the sizes of the matrices included in the plurality of matrix sets associated with the plurality of matrix calculation instructions is equal to the size threshold, the plurality of matrices are combined to obtain a combined matrix.
In a second possible implementation, the sizes of the matrices preferentially meet the size threshold during the merging period, and the plurality of matrices are merged to obtain a merged matrix. For example, in the merging period, the sum of the sizes of the matrices included in the plurality of matrix sets is equal to the size threshold, and the plurality of matrices are merged to obtain a merged matrix. Illustratively, as shown in (a) of fig. 6, in the merging period, the merging period has not ended, and if the size of the first five matrices is equal to the size threshold, the first five matrices are merged to obtain a merged matrix.
In a third possible implementation, when the multiple matrices meet a merging period, the multiple matrices are merged to obtain a merged matrix. In the merging period, the sum of the sizes of the matrixes contained in the matrix sets is smaller than a size threshold value, the time threshold value is preferentially met, and the matrixes are merged to obtain a merged matrix. Illustratively, as shown in (b) of fig. 6, at the end of the merging period, the first four matrices are merged to obtain a merged matrix, although the size of the first four matrices is smaller than the size threshold.
In a fourth possible implementation, at the end of the merging period, there is no matrix that can be merged in the merging period, and then matrix calculation is performed on the single matrix. Illustratively, as shown in (c) of fig. 6, at the end of the merging period, there is only one matrix in the merging period, matrix merging is not required, and matrix calculation is performed on a single matrix.
Step 340, the local end device feeds back the calculation result corresponding to the matrix calculation instruction in the matrix calculation results of the combined matrix to the remote end device.
The local terminal equipment combines matrix sets associated with a plurality of matrix calculation instructions, and performs matrix calculation on the combined matrix to obtain a matrix calculation result of the combined matrix. The matrix calculation results of the combined matrix comprise calculation results of respectively executing matrix calculation on matrix sets associated with a plurality of matrix calculation instructions. The position of the calculation result of the matrix set associated with each matrix calculation instruction in the matrix calculation results is the same as the position of the matrix associated with the matrix calculation instruction in the combined matrix. For example, as shown in fig. 5 (a), matrix a is located at the upper left corner of the first merged matrix and matrix B is located at the upper left corner of the second merged matrix. Matrix multiplication result matrix C of matrix a and matrix B is located in the upper left corner of the matrix multiplication result. The matrix D is positioned at the right lower corner of the first combined matrix, the matrix E is positioned at the right lower corner of the second combined matrix, and the matrix multiplication result matrix F of the matrix D and the matrix E is positioned at the right lower corner of the matrix multiplication result.
The local terminal equipment splits the matrix calculation results of the matrix after combination to obtain the calculation results of the matrix set associated with each matrix calculation instruction before combination.
And the local terminal equipment feeds back a calculation result corresponding to the matrix calculation instruction in the matrix calculation results of the combined matrix to the remote terminal equipment according to the matrix calculation instruction associated identifier included in the matrix calculation message.
For example, the local device obtains the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction from the matrix calculation results of the combined matrices. And feeding back the calculation result of the first matrix calculation instruction to the remote equipment which sends the first matrix calculation instruction according to the identification associated with the first matrix calculation instruction. And feeding back the calculation result of the second matrix calculation instruction to the remote equipment which sends the second matrix calculation instruction according to the identification associated with the second matrix calculation instruction.
Because the processor core can only process one instruction at a time, matrix calculation needs to be executed for a plurality of times when the processor core processes a plurality of matrix calculation instructions, and if the size of a matrix associated with the matrix calculation instructions is smaller, hardware resources of a processor executing the matrix calculation instructions cannot be fully utilized, so that the calculation throughput is lower. The scheme provided by the application combines multiple matrix calculations into one matrix calculation, namely, the matrix associated with a plurality of matrix calculation instructions is combined into a large matrix, and the matrix calculation is executed once on the combined matrix, so that the matrix size is increased through the combined matrix, the matrix calculation times are reduced, the utilization rate of a processor executing the matrix calculation instructions is improved, the storage resources and the calculation resources of the processor executing the matrix calculation instructions are fully utilized, the utilization rate of the processor executing the matrix calculation instructions is improved, and the calculation throughput is improved.
It should be noted that, for the offloading scenario of the inter-processor core matrix computation in the processor shown in fig. 2 (a) and fig. 2 (b), and the offloading scenario of the inter-processor core matrix computation in the different processors, the data processing procedure is similar to that in fig. 2 (c). For example, for the scenario shown in fig. 2 (a), processor core 0 and processor core 2 correspond to remote devices, and processor core 1 corresponds to a local device; for the scenario shown in fig. 2 (b), processor 0 corresponds to the remote device and processor 1 corresponds to the local device. For brevity, only the scenario shown in fig. 2 (c) will be described herein as an example, and the data processing procedure of other scenarios is similar to that shown in fig. 3, and will not be described herein again.
It will be appreciated that, in order to implement the functions of the above embodiments, the processor includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application scenario and design constraints imposed on the solution.
The data processing method provided according to the present embodiment is described in detail above with reference to fig. 1 to 6, and the data processing apparatus provided according to the present embodiment will be described below with reference to fig. 7.
Fig. 7 is a schematic structural diagram of a possible data processing apparatus according to this embodiment. These data processing apparatuses may be used to implement the functions of the home terminal device in the above method embodiments, so that the beneficial effects of the above method embodiments may also be implemented. In this embodiment, the data processing apparatus may be a processor core as shown in fig. 1, or may be a module (e.g., a chip) applied to a server.
As shown in fig. 7, the data processing apparatus 700 includes a communication module 710, a combining module 720, and a storage module 730. The data processing apparatus 700 is configured to implement the functions of the local device in the embodiment of the method shown in fig. 3.
The communication module 710 is configured to obtain a first matrix set associated with a first matrix calculation instruction and a second matrix set associated with a second matrix calculation instruction, where the first matrix calculation instruction and the second matrix calculation instruction are each one of a plurality of matrix calculation instructions, the first matrix set includes at least two first matrices, and the second matrix set includes at least two second matrices. For example, the communication module 710 is configured to perform step 320 in fig. 3.
The merging module 720 is configured to merge the matrices of the first matrix set and the second matrix set to obtain a merged matrix, and execute a calculation operation of the merged matrix to obtain a calculation result of the first matrix calculation instruction and a calculation result of the second matrix calculation instruction, respectively. For example, the merging module 720 is configured to perform step 330 in fig. 3.
Wherein the computing operation includes at least one of matrix multiplication, matrix addition, matrix subtraction, number multiplication, and matrix joint multiplication addition. The computing operation is matrix multiplication, and at least two matrices in the combined matrices are arranged in diagonal lines.
The storage module 730 is configured to store the matrix calculation instructions and the matrices, so that the merging module 720 performs matrix calculation on matrix merging associated with the plurality of matrix calculation instructions.
Optionally, the merging module 720 is specifically configured to merge the multiple matrices associated with the multiple matrix calculation instructions when the multiple matrices satisfy the merging condition, so as to obtain a merged matrix.
The merging condition is used for indicating a size threshold value and a merging period which are required to be met by merging the matrix calculation, wherein the size threshold value is determined by the size of a register used for storing the matrix, and the merging period is used for indicating a time unit of merging the matrix calculation.
The merging module 720 is specifically configured to, when the matrices of the first matrix set and the second matrix set meet a size threshold, merge the matrices of the first matrix set and the second matrix set to obtain a merged matrix, where the size threshold is determined by a size of a register used for storing the matrices.
The merging module 720 is specifically configured to merge the matrices of the first matrix set and the second matrix set in a merging period, so as to obtain a merged matrix, where the merging period is used to indicate a time unit for merging and performing matrix calculation.
It should be appreciated that the data processing apparatus 700 of an embodiment of the present application may be implemented by an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), which may be a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA), a general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof. When the data processing method shown in fig. 3 is implemented by software, each module thereof may be a software module, and the data processing apparatus 700 and each module thereof may be a software module.
The data processing apparatus 700 according to the embodiment of the present application may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each unit in the data processing apparatus 700 are respectively for implementing the corresponding flow of each method in fig. 3, and are not described herein for brevity.
The present application also provides a processor comprising a memory and at least two processor cores, the memory for storing a set of computer instructions; the operational steps of the methods described in the various embodiments above are performed when the processor core executes the set of computer instructions.
The application also provides a computer system comprising a plurality of computer devices for performing the operational steps of the methods described in the various embodiments above.
The method steps in this embodiment may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable ROM (erasable PROM, EPROM), electrically Erasable Programmable ROM (EEPROM), registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a computer device. The processor and the storage medium may reside as discrete components in a computer device.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; but also optical media such as digital video discs (digital video disc, DVD); but also semiconductor media such as Solid State Drives (SSDs) STATE DRIVE. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (13)
1. A method of data processing, the method performed by a processor, comprising:
Acquiring a first matrix set associated with a first matrix calculation instruction and a second matrix set associated with a second matrix calculation instruction, wherein the first matrix calculation instruction and the second matrix calculation instruction are one of a plurality of matrix calculation instructions respectively, the first matrix set comprises at least two first matrixes, and the second matrix set comprises at least two second matrixes;
combining the matrixes of the first matrix set and the second matrix set to obtain a combined matrix;
And executing the calculation operation of the combined matrix to respectively obtain the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction.
2. The method of claim 1, wherein combining the matrices of the first set of matrices and the second set of matrices to obtain a combined matrix comprises:
and when the matrixes of the first matrix set and the second matrix set meet a size threshold, combining the matrixes of the first matrix set and the second matrix set to obtain the combined matrix, wherein the size threshold is determined by the size of a register for storing the matrixes.
3. The method according to claim 1 or 2, wherein merging the matrices of the first set of matrices and the second set of matrices to obtain the merged matrix comprises:
and merging the matrixes of the first matrix set and the second matrix set in a merging period to obtain the merged matrix, wherein the merging period is used for indicating a time unit for merging and executing matrix calculation.
4. A method according to any one of claims 1-3, wherein the computing operation comprises at least one of matrix multiplication, matrix addition, matrix subtraction, number multiplication, and matrix joint multiplication addition.
5. The method of claim 4, wherein the computing operation is a matrix multiplication, and at least two of the combined matrices are arranged in a diagonal.
6. The method of any of claims 1-5, wherein obtaining a first set of matrices associated with first matrix computing instructions and a second set of matrices associated with second matrix computing instructions comprises:
When a remote device performs non-matrix computation, a first matrix set associated with the first matrix computation order and a second matrix set associated with the second matrix computation order are acquired from the remote device.
7. The method of claim 6, wherein the remote device comprises any one of a processor core, a processor, and a computer device, the remote device being a device that supports matrix computation or a device that does not support matrix computation.
8. The method of claim 7, wherein obtaining the computation result of the first matrix computation instruction and the computation result of the second matrix computation instruction comprises:
And obtaining the calculation result of the first matrix calculation instruction and the calculation result of the second matrix calculation instruction from the calculation results of the combined matrices.
9. The method according to any one of claims 1-8, further comprising:
feeding back a calculation result of the first matrix calculation instruction to a remote device which sends the first matrix calculation instruction according to the identification associated with the first matrix calculation instruction;
and feeding back a calculation result of the second matrix calculation instruction to a remote device which sends the second matrix calculation instruction according to the identification associated with the second matrix calculation instruction.
10. A data processing apparatus, the apparatus comprising:
The communication module is used for acquiring a first matrix set associated with a first matrix calculation instruction and a second matrix set associated with a second matrix calculation instruction, wherein the first matrix calculation instruction and the second matrix calculation instruction are one of a plurality of matrix calculation instructions respectively, the first matrix set comprises at least two first matrixes, and the second matrix set comprises at least two second matrixes;
the merging module is used for merging the matrixes of the first matrix set and the second matrix set to obtain a merged matrix;
the merging module is further configured to perform a calculation operation of the merged matrix, and obtain a calculation result of the first matrix calculation instruction and a calculation result of the second matrix calculation instruction respectively.
11. A processor comprising a memory and at least two processor cores, the memory for storing a set of computer instructions; the method of any of the preceding claims 1-9, when executed by the processor core.
12. A computer device comprising a memory and at least one processor, the memory for storing a set of computer instructions; the method of any of the preceding claims 1-9, when executed by the processor.
13. A computer system, characterized in that it comprises a plurality of computer devices for performing the operating steps of the method according to any of the preceding claims 1-9.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310231360.1A CN118550584A (en) | 2023-02-27 | 2023-02-27 | Data processing method, device, processor, computer equipment and system |
PCT/CN2024/077603 WO2024179326A1 (en) | 2023-02-27 | 2024-02-19 | Data processing method and apparatus, processor, computer device, and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310231360.1A CN118550584A (en) | 2023-02-27 | 2023-02-27 | Data processing method, device, processor, computer equipment and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118550584A true CN118550584A (en) | 2024-08-27 |
Family
ID=92444956
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310231360.1A Pending CN118550584A (en) | 2023-02-27 | 2023-02-27 | Data processing method, device, processor, computer equipment and system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118550584A (en) |
WO (1) | WO2024179326A1 (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10031806B2 (en) * | 2016-11-01 | 2018-07-24 | Cisco Technology, Inc. | Efficient repair of erasure coded data based on coefficient matrix decomposition |
US11086967B2 (en) * | 2017-03-01 | 2021-08-10 | Texas Instruments Incorporated | Implementing fundamental computational primitives using a matrix multiplication accelerator (MMA) |
US11409838B2 (en) * | 2019-10-29 | 2022-08-09 | Meta Platforms, Inc. | High throughput matrix processor with support for concurrently processing multiple matrices |
CN114065122A (en) * | 2020-07-31 | 2022-02-18 | 深圳市中兴微电子技术有限公司 | Data processing method, device and storage medium |
CN115374396A (en) * | 2021-05-19 | 2022-11-22 | 辉达公司 | Techniques for accelerating matrix multiplication computations using hierarchical representations of sparse matrices |
-
2023
- 2023-02-27 CN CN202310231360.1A patent/CN118550584A/en active Pending
-
2024
- 2024-02-19 WO PCT/CN2024/077603 patent/WO2024179326A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2024179326A1 (en) | 2024-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101253012B1 (en) | Method and apparatus to facilitate shared pointers in a heterogeneous platform | |
KR20210011451A (en) | Embedded scheduling of hardware resources for hardware acceleration | |
US20200073665A1 (en) | Method for Accessing Memory of Multi-Core System, Related Apparatus, System, and Storage Medium | |
JP5658509B2 (en) | Autonomous memory architecture | |
JP2011060278A (en) | Autonomous subsystem architecture | |
JP2012160182A (en) | Memory correctness checking in distributed computer systems | |
CN115033188B (en) | Storage hardware acceleration module system based on ZNS solid state disk | |
US10481827B2 (en) | Writing same data on a storage system | |
Ji et al. | Efficient intranode communication in GPU-accelerated systems | |
CN113011553A (en) | Accelerator, method of operating an accelerator, and apparatus including an accelerator | |
US20130176323A1 (en) | Method and apparatus for graphic processing using multi-threading | |
US20080126747A1 (en) | Methods and apparatus to implement high-performance computing | |
JP6878570B2 (en) | Methods and devices for resource reconfiguration | |
US8656120B2 (en) | Device, method and computer-readable medium relocating remote procedure call data in heterogeneous multiprocessor system on chip | |
CN118550584A (en) | Data processing method, device, processor, computer equipment and system | |
US12001879B2 (en) | Instruction offload to processor cores in attached memory | |
US20230409302A1 (en) | Computer-readable recording medium storing conversion program and conversion processing method | |
CN113110879B (en) | Instruction processing method and device | |
CN116483643A (en) | GPU debugging method, device, equipment and storage medium | |
WO2022007597A1 (en) | Matrix operation method and accelerator | |
CN115344393A (en) | Service processing method and related equipment | |
US20230236889A1 (en) | Distributed accelerator | |
US11941433B2 (en) | Computing apparatus and data processing method for offloading data processing of data processing task from at least one general purpose processor | |
US20110191638A1 (en) | Parallel computer system and method for controlling parallel computer system | |
TWI811620B (en) | Computing apparatus and data processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |