[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113641956B - High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor - Google Patents

High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor Download PDF

Info

Publication number
CN113641956B
CN113641956B CN202110896851.9A CN202110896851A CN113641956B CN 113641956 B CN113641956 B CN 113641956B CN 202110896851 A CN202110896851 A CN 202110896851A CN 113641956 B CN113641956 B CN 113641956B
Authority
CN
China
Prior art keywords
thread
sub
matrix
vector
case
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110896851.9A
Other languages
Chinese (zh)
Other versions
CN113641956A (en
Inventor
胡怡
陈道琨
杨超
刘芳芳
马文静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202110896851.9A priority Critical patent/CN113641956B/en
Publication of CN113641956A publication Critical patent/CN113641956A/en
Application granted granted Critical
Publication of CN113641956B publication Critical patent/CN113641956B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)
  • Debugging And Monitoring (AREA)
  • Multi Processors (AREA)

Abstract

The invention discloses a high-performance implementation method of a 1-level BLAS and 2-level BLAS function library facing an SW26010-Pro processor, which comprises the following steps: carrying out task division on the problem to generate a plurality of sub-problems, wherein the structure of the problem comprises a vector, a common matrix, a symmetrical matrix or a triangular matrix; if the vector is vector, common matrix or symmetric matrix, the operation of each sub-problem is distributed to the corresponding thread; if the operation is a triangular matrix, dividing the operation of the diagonal part of the sub-problem to a thread number 0, and distributing the operation of the non-diagonal part to other corresponding threads; and splicing the operation results of each thread to obtain the solution of the problem. The invention realizes parallelization of BLAS1 and 2-level functions, solves the problem of data dependence among threads, and further improves the performance of the functions through a self-adaptive tuning mechanism.

Description

High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor
Technical Field
The invention relates to the field of basic linear algebraic library BLAS (Basic Linear Algebra Subprograms) implementation, in particular to a high-performance implementation method of a 1-level and 2-level BLAS function library for a SW26010-Pro processor.
Background
BLAS is a basic linear algebraic subroutine library, mainly comprising basic operations of vectors and matrices, is one of the most basic and important mathematical libraries, and is widely applied to the fields of scientific calculation, weather forecast, celestial physics and the like. The BLAS library is the core of many specialized software, where BLAS1, 2-level functions are called repeatedly many times by almost all applications involving matrix operations and dense linear algebraic algorithm packages (e.g., LAPACK, scaLAPACK). Practices in numerical matrix analysis, deep learning and the like show that BLAS1 and 2-level functions have important significance for improving the operation speed of application and fully playing the performance of a high-performance computer.
The BLAS1, 2-level functions implement vector-vector, matrix-vector operations, comprising a total of 30 functions, and include four types of single precision, double precision, complex single precision, and complex double precision. The BLAS1 and 2 class functions have the characteristic of dense access, the performance is limited by the access bandwidth of the system, the number of the functions is large, and the matrix related to the functions has various data arrangement modes in the memory. How to reasonably divide data, make full use of efficient access mode, improve data reuse rate, and is a great challenge for realizing high performance of BLAS1 and 2 class function libraries.
There have been considerable research efforts at home and abroad in terms of high performance implementation of BLAS1, 2-grade functions. Li Yi et al implemented a secondary BLAS library towards the polynuclear Loongson 3A (Li Yi, he Songsong, li Kai. Optimization of the secondary BLAS library on the polynuclear Loongson 3A [ J ]. Computer systems application, 2011,20 (1): 163-167.). With the rapid development of GPU accelerators, the optimization work of BLAS1, 2-level functions on GPUs has also become a research hotspot in recent years, jian Yin et al have realized parallel GEMVs (Jian Y, hui Y, xu W, et al highly parallel GEMV with register blocking method on GPU architecture [ J ]. Journal of Visual Communication & Image Representation,2014,25 (7): 1566-1573) on Nvidia GPUs using a register blocking method, weizhi Xu et al have realized a performance tuning framework for GEMVs on Nvidia GPUs, and an optimal algorithm is selected for the input scale of GEMVs (w.xu et al., "Auto-Tuning GEMV on Many-Core GPUs," 2012IEEE 18th International Conference on Parallel and Distributed Systems,2012,pp.30-36, doi: 10.1109/icpads.2012.15.).
SW26010-Pro is a many-core processor with heterogeneous architecture. On a Shenwei new generation supercomputer based on an SW26010-Pro many-core processor architecture, customized high-performance BLAS1 and 2-level function libraries are not deployed at present, and the performance of the existing open-source mathematical library on the platform is lower, so that effective performance support cannot be provided for applications. Therefore, it is urgently required to design and implement a high-performance BLAS1, 2 class function library for the many-core platform, so as to make full use of access bandwidth of the shenwei many-core processor, and meet urgent requirements of upper-layer applications on the high-performance BLAS1, 2 class functions on the shenwei many-core platform.
Disclosure of Invention
The invention provides a high-performance implementation method of a 1-level and 2-level BLAS function library facing a SW26010-Pro processor, which is used for meeting requirements of BLAS 1-level and 2-level functions on the SW26010-Pro many-core processor and solving the problem of lower performance of the existing open source math library.
A high-performance implementation method of a 1, 2-level BLAS function library facing SW26010-Pro processor comprises the following steps:
1) Carrying out task division on the problem to generate a plurality of sub-problems, wherein the structure of the problem comprises a vector, a common matrix, a symmetrical matrix or a triangular matrix;
2) If the vector is vector, common matrix or symmetric matrix, the operation of each sub-problem is distributed to the corresponding thread; if the operation is a triangular matrix, dividing the operation of the diagonal part of the sub-problem to a thread number 0, and distributing the operation of the non-diagonal part to other corresponding threads;
3) And splicing the operation results of each thread to obtain the solution of the problem.
Further, sub-problems are created by the following strategies:
1) For vectors, each vector segment is treated as a sub-problem x i′ Wherein i 'is a vector segment number, i' is more than or equal to 0 and less than or equal to k-1, and k is the number of sub-problems;
2) For a common matrix, each row block is considered a sub-problem A i Wherein i+1 is the row number of the matrix, i is more than or equal to 0 and less than or equal to k-1;
3) For symmetric matrices, each column of blocks is considered a sub-problem A j Wherein j+1 is the column number of the matrix, and j is more than or equal to 0 and less than or equal to k-1;
4) For a triangular matrix, each row of blocks is considered a sub-problem A i
Further, when the structure of the problem is a vector, the solution of the problem is obtained by the following steps:
1) Will sub problem x i′ Assigned to the corresponding thread T i
2) Thread T 0 Calculate the sub-problem x 0 Solution y of (2) 0
3) Using formula y i ←α×x i′ +y i Each thread T i Calculating to obtain solution y i Wherein α is a first weight value;
4) Splice solution y i A solution y to the problem is obtained.
Further, when the structure of the problem is a common matrix, the solution of the problem is obtained through the following steps:
1) Will sub problem A i Assigning to thread T i Wherein i is more than or equal to 0 and less than or equal to k-1, and k is the number of sub-problems;
2) Based on vector x' and sub-problem A 0 Thread T 0 Calculating to obtain solution y 0
3) Using formula y i ←α×A i ×x′+β×y i Each thread T i Calculating to obtain solution y i Wherein α is a first weight value and β is a second weight value;
4) Splice solution y i A solution y to the problem is obtained.
Further, when the structure of the problem is a symmetric matrix, the solution of the problem is obtained through the following steps:
1) For each sub-problem A j Dividing to obtain a diagonal submatrix D j Lower triangular submatrix L ij And divide the sub-problem A j Assigning to thread T j
2) Dividing the vector x 'to obtain a plurality of sub-vectors x' j
3) Will diagonal submatrix D j With corresponding lower triangular submatrix L ij Filling;
4) Each thread T j Based on diagonal submatrix D 0 And the subvector x' 0 Or the corresponding lower triangular submatrix L in the upper triangular portion i0 And the subvector x' j Calculating to obtain solution y 0 Or solve L 0j The method comprises the steps of carrying out a first treatment on the surface of the Each thread T j Based on lower triangular submatrix L (j+1)j And the subvector x' j Calculating to obtain corresponding solution y (j+1)j
5) For the symmetrical parts of the diagonal submatrix, the lower triangular submatrix and the lower triangular submatrix, each thread T j Respectively using the formula y j ←D j ×x′ j +y j 、y i ←L ij ×x′ j +y i Y j ←L ij ×x′ i +y j And carrying out iterative solution, and splicing corresponding sub solutions to obtain a solution y of the problem.
Further, when the structure of the problem is a triangular matrix, the solution of the problem is obtained by:
1) Will sub problem A i Divided into corresponding diagonal sub-matrices D i Off-diagonal submatrix L ij Dividing the right-end term vector b to obtain a sub right-end term vector b i
2) For each diagonal submatrix D i Off-diagonal submatrix L ij Distributing threads;
3) For the diagonal submatrix, thread T i Based on diagonal submatrix D i Solving; for the off-diagonal submatrices, formula y is used i ←D i ×(b i -∑ 0≤j<i L ij ×y j ) Solving;
4) And splicing the corresponding sub solutions to obtain a solution y of the problem.
Further, for the off-diagonal submatrices, the solution is performed by:
1) Parallel execution of normal matrix-vector multiply computation L using loop unrolling and SIMD vectorization instructions ij *y j
2) Reduce the calculation results to thread T 0
3) Thread number 0 is based on reduction result, diagonal submatrix D i Right-end term vector segment b i Performing recurrent solution to obtain sub solution y i
Further, calculate L i(i-1) *y (i-1) Before, corresponding thread and thread T 0 And synchronizing.
Further, the communication method between threads comprises the following steps: RMA point-to-point communication.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when run.
An electronic device comprising a memory and a processor, wherein the memory stores a program for performing the above-described method.
The invention has the following technical effects:
the invention realizes parallelization of BLAS1 and 2 grade functions. The invention designs a thread reduction mechanism and a thread communication mechanism, and solves the problem of data dependence among threads. The invention also uses cyclic transformation and vectorization techniques to optimize the computation. In addition, the invention designs a self-adaptive tuning mechanism, and sets proper thread quantity according to the scale of the input problem, thereby further improving the performance of the function. The high performance BLAS1, 2 grade library of the present invention has an average speed ratio of 22.37 and a highest speed ratio of 65.47 as compared to the single core open source BLAS math library GotoBLAS.
Drawings
FIG. 1 is a schematic diagram of the overall flow of a method for high performance implementation of a class 1,2 BLAS library for a SW26010-Pro processor of the present invention;
FIG. 2 is a diagram illustrating vector segmentation and inter-core data mapping;
FIG. 3 is a schematic diagram of a generic matrix partitioning and inter-core data mapping;
FIG. 4 is a diagram illustrating symmetric matrix partitioning and inter-core data mapping;
FIG. 5 is a schematic diagram of a triangular matrix partition and inter-core data mapping;
FIG. 6 is a schematic diagram of a thread reduction mechanism, wherein (a) is a thread row reduction schematic diagram and (b) is a thread column reduction schematic diagram;
FIG. 7 is a task block diagram of a TRSV;
FIG. 8 is a schematic diagram of a thread communication mechanism;
FIG. 9 is a task segment schematic diagram of AXPY;
FIG. 10 is a task block diagram of a GEMV;
FIG. 11 is a task block diagram of a SYMV;
fig. 12 is a graph of the performance acceleration ratio of the present invention to the open source gotobas.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The high-performance implementation method of the invention is characterized by comprising the following steps:
and firstly, performing task division on the matrix or the vector according to the scale of the input problem to generate a plurality of subtasks, and distributing each subtask to each thread.
And secondly, providing a thread reduction mechanism based on RMA communication and a thread communication mechanism based on point-to-point synchronization.
And thirdly, optimizing calculation by using cyclic transformation and vectorization technology.
And fourthly, providing a self-adaptive tuning mechanism, and setting proper thread quantity for each scale of the matrix or the vector.
Further, feature one includes:
as for the vector, as shown in FIG. 2, the vector is divided into a plurality of vector segments on average, each vector segment is mapped to each thread in turn, T in the figure 0 ,T 1 ,T 2 ,...T 63 Thread 0, thread 1, thread 2, …, 63;
as shown in FIG. 3, for a common matrix, it is divided into several small matrices, each row block is mapped to each thread in turn, T in the figure 0 ,T 1 ,T 2 ,...T 63 Representing thread No. 0, thread No. 1, thread No. 2, thread No. 63;
as shown in FIG. 4, for a symmetric matrix, it is divided into several small matrices, each column block is mapped to each thread in turn, T in the figure 0 ,T 1 ,T 2 ,...T 63 Thread 0, thread 1, thread 2, …, 63;
as in FIG. 5, for a triangular matrix, it is divided into several small matrices, diagonal blocks are mapped to thread number 0, each column block (except diagonal blocks) is mapped to other threads, T in the figure 0 ,T 1 ,T 2 ,...T 63 Thread No. 0, thread No. 1, thread No. 2.
Further, feature two includes:
referring to FIG. 6, any number of consecutive threads starting from thread number 0 are reduced by RMA point-to-point communication, first, the thread in the same line is reduced by the first column of threads targeting the core group, then the first column of threads of the core group is reduced by the first column of threads targeting thread number 0, T in the figure 0 ,T 1 ,T 2 ,...T 63 Representing thread No. 0, thread No. 1, thread No. 2, thread No. 63;
after the thread 0 completes the current operation, a point-to-point synchronization request is initiated to one of the threads 1 to 63, and at the same time, the corresponding thread responds to the synchronization request before performing the corresponding operation.
Further, feature three includes:
the computing portion of the BLAS1, 2-level functions is optimized using loop unrolling and SIMD vectorization instructions.
Further, feature four includes:
the BLAS 1-grade functions are classified into four types according to vector scale, including small-scale, medium-scale, large-scale, and ultra-large-scale. For the above four types, 8, 16, 32, 64 threads are started, respectively. The small-scale vector range is preferably [1024, 4096], the medium-scale vector range is preferably (4096, 32768), the large scale vector range is preferably (32768, 2626144), and the ultra-large scale vector range is preferably (262626144, + -infinity);
the BLAS 2-grade functions are classified into two types according to the matrix size, including small-scale and large-scale. For both types, 16, 64 threads are started, respectively. The small-scale matrix range here is preferably 128 x 128, 2048 x 2048, the large-scale matrix range is preferably (2048, ++ infinity A kind of electronic device.
Taking the threaded system of equations solution (TRSV) involving the following triangular coefficient matrix as an example, it mainly solves the following equations: a=x=b, where a represents a lower triangular matrix, x represents an unknown vector to be solved, and b represents a right-hand term vector. The specific implementation steps comprise:
step one: and determining the number of threads which need to be started by the function according to the scale of the matrix A.
Step two: as shown in fig. 7, the present invention performs task division on the matrix a, the vector x, and the vector b by rows, and treats each row block as one sub-problem, resulting in k sub-problems in total. The invention further divides the sub-questions, each of which will produce a diagonal sub-matrix D i (0.ltoreq.i.ltoreq.k-1) and several off-diagonal submatrices L ij ((0. Ltoreq.j < i)) which correspond to the unknown vector segment x to be solved i Solution y of (2) i And right-end term vector segment b i . Each sub-problem performs the following operations: y is i ←D i ×(b i -∑ 0≤j<i L ij ×y j )。
Step three: the invention traverses each sub-problem in turn, the operation of the diagonal part of the sub-problem is distributed to the No. 0 thread, and the operation of the non-diagonal part is distributed to other threads in turn according to the thread number. Assuming that the current processing is a sub-problem i (0 < i.ltoreq.k-1), the threads responsible for the off-diagonal portion perform the normal matrix-vector multiply computation in parallel using loop unrolling and SIMD vectorization instructions: l (L) ij *y j (j is more than or equal to 0 and less than i), reducing the calculation result to a No. 0 thread, wherein the No. 0 thread is used for reducing the result according to the diagonal submatrix D i Right-end term vector segment b i Performing the back-substitution solution (back substitution) to obtain y i And y is taken as i Write back to main memory. For example, currently handled is sub-problem 3, thread 1, thread 2, thread 3, which is responsible for the off-diagonal portion, performing the normal matrix-vector multiplication computation in parallel: l (L) 30 *x 0 ,L 31 *x 1 ,L 32 *x 2 . Reducing the calculation result to a thread number 0, and performing back-substitution solving (back substitution) on the thread number 0 according to the reduction result to obtain x 3 And x is taken as 3 Write back to main memory.
As shown in fig. 8, the responsible submatrix L i(i-1) The thread of (1) needs to synchronize with the thread 0 before computation, and waits for the thread 0 to y (i-1) Write back to main memory. L (L) ij Is 128X 128, L ij *y j The invention expands the outer loop for 8 times by adopting two-layer loop realization, increases the number of multiply-add operation operations in a single loop, and accelerates the multiply-add operation by using a floating point vector multiply-add instruction provided by SW26010-pro many-core processor hardware in the calculation process.
Step four: the solution y of the vector x is output.
Taking scalar vector multiplication (AXPY) as an example, its calculation form is: y=α x+y, where x and y represent vectors and α represents a scalar. The specific implementation steps comprise:
step one: and determining the number of threads which need to be started by the function according to the size of the vector.
Step two: as shown in fig. 9, the present invention performs task division on a vector x and an unknown vector y to be solved, and regards each vector segment as a sub-problem, resulting in k sub-problems in total. Each sub-problem performs the following operations: y is i ←α×x i +y i
Step three: the invention traverses each sub-problem in turn, and assigns sub-problem i to the i-thread. Assuming that the current processing is a sub-problem i (0.ltoreq.i.ltoreq.63), the thread No. i performs the computation: alpha x i +y i Obtaining y i And y is taken as i Write back to main memory.
Step four: the vector y is output.
Taking the general matrix vector multiplication (GEMV) as an example, its calculation form is as follows: y=α×a+β×y, where a represents a normal matrix, x and y represent vectors, and α and β represent scalar quantities. The specific implementation steps comprise:
step one: and determining the number of threads which need to be started by the function according to the scale of the matrix A.
Step two: as shown in fig. 10, the present invention performs task division on the matrix a and the unknown vector y to be solved by rows, and considers each row block as a sub-problem, resulting in k sub-problems in total. Each sub-problem performs the following operations: y is i ←α×A i ×x+β×y i
Step three: the invention traverses each sub-problem in turn, and assigns sub-problem i to the i-thread. Assuming that the current processing is a sub-problem i (0.ltoreq.i.ltoreq.63), the thread No. i performs the computation: alpha x A i ×x+β×y i Obtaining y i And y is taken as i Write back to main memory.
Step four: the vector y is output.
Taking the symmetric matrix vector multiplication (SYMV) involving the lower triangular matrix as an example, its calculation form is: y=α×a+β×y, where a represents a lower triangular symmetry matrix, x and y represent vectors, and α and β represent scalar quantities. The specific implementation steps comprise:
step one: and determining the number of threads which need to be started by the function according to the scale of the matrix A.
Step two: as shown in fig. 11, the present invention performs task division on the matrix a by columns, and treats each column block as one sub-problem, resulting in k sub-problems in total. The invention further divides the sub-questions, each of which will produce a diagonal sub-matrix D j (0.ltoreq.j.ltoreq.k-1) and a plurality of lower triangular submatrices L ij (i.gtoreq.j). Each sub-problem completes the following operations: for the diagonal submatrices, D will be j Is filled in with the elements of the lower triangle, and is calculated: y is j ←D j ×x j +y j The method comprises the steps of carrying out a first treatment on the surface of the For the lower triangular submatrix, calculate: y is i ←L ij ×x j +y i The method comprises the steps of carrying out a first treatment on the surface of the For the symmetric part calculation of the lower triangular sub-matrix: y is j ←L ij ×x i +y j
Step three: the invention traverses each sub-problem in turn, and assigns sub-problem j to j number threads. Assuming that the current processing is a sub-problem j (0. Ltoreq.j. Ltoreq.k-1), thread j performs the operation: will D j Is filled in with the elements of the lower triangle, and is calculated: d (D) j ×x j +y j Obtaining y j And y is taken as j Writing back to main memory; and (3) calculating: l (L) ij ×x j +y i Obtaining y i And y is taken as i Writing back to main memory; and (3) calculating: l (L) ij ×x i +y j Obtaining y j And y is taken as j Write back to main memory.
Step four: the vector y is output.
In the embodiment, the GotoBLAS mathematical library is adopted to verify the performance acceleration effect of the invention. The problem scale selected by the embodiment ensures that the function performances of the two versions reach respective optimal values, and the selected precision is real double precision. FIG. 12 shows the performance acceleration ratio of the present invention to the open source GotoBLAS, from which it can be seen that the average acceleration ratio of the present invention relative to GotoBLAS is 22.37 and the highest acceleration ratio is 65.47.
In this embodiment, the content of the present invention is transplanted to other platforms after being simply deformed, or the task division and thread reduction mechanism of the present invention are not creatively improved, or the computing stage is simply optimized based on the present invention, which essentially does not depart from the content covered by the present invention, and still falls into the scope of the protection of the present invention.
Parts of the invention not described in detail are known to those skilled in the art.
The above embodiments are merely illustrative of specific examples of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims (3)

1. A high-performance implementation method of a 1, 2-level BLAS function library facing SW26010-Pro processor comprises the following steps:
1) Carrying out task division on the problem to generate a plurality of sub-problems, wherein the structure of the problem comprises a vector, a common matrix, a symmetrical matrix or a triangular matrix; wherein,,
in the case where the structure of the problem is a vector, each vector segment is treated as a sub-problem x i Wherein i is more than or equal to 0 and less than or equal to k-1, and k is the number of sub-problems;
in the case where the structure of the problem is a normal matrix, each row block is regarded as a sub-problem A i
In the case where the structure of the problem is a symmetric matrix, each column of blocks is considered as a sub-problem A j Wherein j+1 is the column number of the matrix, and j is more than or equal to 0 and less than or equal to k-1;
in the case where the structure of the problem is a triangular matrix, each row of blocks is regarded as a sub-problem A i
2) Distributing the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread; wherein,,
in the case that the structure of the problem is a vector, the allocating the operation of each sub-problem to the corresponding thread to obtain the operation result of the thread includes:
will sub problem x i Assigned to the corresponding thread T i To make thread T i Performing calculation to obtain thread T i Corresponding solution y i =α×x i +y i Alpha represents a scalar;
in the case that the structure of the problem is a common matrix, the allocating the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread includes:
will sub problem A i Assigning to thread T i To make thread T i Performing calculation to obtain thread T i Corresponding solution y i =α×A i ×x+β×y i Beta represents a scalar quantity, and x represents a vector quantity;
in the case that the structure of the problem is a triangular matrix, the allocating the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread includes:
for each sub-problem A i Dividing to obtain a diagonal submatrix D i Off-diagonal submatrix L ij ,0≤j<i;
Will diagonal submatrix D i The operation of (1) is distributed to the line 0Journey, off-diagonal submatrix L ij The operation of the thread number i is sequentially distributed to other threads;
the other threads perform common matrix-vector multiply computations in parallel using loop unrolling and SIMD vectorization instructions: l (L) ij *y j Reducing the calculation result to a thread number 0;
thread number 0 is based on reduction result, diagonal submatrix D i Right-end term vector segment b i Performing back-substitution solution to obtain solution y corresponding to thread 0 i
In the case that the structure of the problem is a symmetric matrix, the allocating the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread includes:
for each sub-problem A j Dividing to obtain a diagonal submatrix D j Off-diagonal submatrix L ij ,i≥j;
Will sub problem A j Assigned to the corresponding thread T j
Should thread T j Will D j Is filled in with the elements of the lower triangle, and is calculated: y is j ←D j ×x j +y j The method comprises the steps of carrying out a first treatment on the surface of the For the lower triangular submatrix, calculate: y is i ←L ij ×x j +y i The method comprises the steps of carrying out a first treatment on the surface of the For the symmetric part calculation of the lower triangular sub-matrix: y is j ←L ij ×x i +y j
3) And splicing the operation results of each thread to obtain the solution of the problem.
2. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of claim 1 when run.
3. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method of claim 1.
CN202110896851.9A 2021-08-05 2021-08-05 High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor Active CN113641956B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110896851.9A CN113641956B (en) 2021-08-05 2021-08-05 High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110896851.9A CN113641956B (en) 2021-08-05 2021-08-05 High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor

Publications (2)

Publication Number Publication Date
CN113641956A CN113641956A (en) 2021-11-12
CN113641956B true CN113641956B (en) 2023-05-30

Family

ID=78419683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110896851.9A Active CN113641956B (en) 2021-08-05 2021-08-05 High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor

Country Status (1)

Country Link
CN (1) CN113641956B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508704A (en) * 2011-11-10 2012-06-20 上海市共进通信技术有限公司 Method for implementing task decomposition and parallel processing in computer software system
CN103440121A (en) * 2013-08-20 2013-12-11 中国人民解放军国防科学技术大学 Triangular matrix multiplication vectorization method of vector processor
CN103514629A (en) * 2012-06-22 2014-01-15 密执安大学评议会 Method and apparatus for iterative reconstruction
CN103959233A (en) * 2011-09-15 2014-07-30 埃克森美孚上游研究公司 Optimized matrix and vector operations in instruction limited algorithms that perform eos calculations
CN104484234A (en) * 2014-11-21 2015-04-01 中国电力科学研究院 Multi-front load flow calculation method and system based on GPU (graphics processing unit)
CN105808309A (en) * 2016-03-08 2016-07-27 中国科学院软件研究所 High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform
CN106650925A (en) * 2016-11-29 2017-05-10 郑州云海信息技术有限公司 Deep learning framework Caffe system and algorithm based on MIC cluster
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN107590106A (en) * 2017-08-08 2018-01-16 北京中科睿芯科技有限公司 A kind of computational methods for being applied to symmetrical matrix and vector multiplication
CN110968345A (en) * 2018-09-29 2020-04-07 英特尔公司 Architecture and method for data parallel Single Program Multiple Data (SPMD) execution
CN112380003A (en) * 2020-09-18 2021-02-19 北京大学 High-performance parallel implementation device for K-NN on GPU processor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210248115A1 (en) * 2020-02-10 2021-08-12 Nvidia Corporation Compute graph optimization
US20210294673A1 (en) * 2020-03-19 2021-09-23 Nvidia Corporation Techniques for orchestrating stages of thread synchronization

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103959233A (en) * 2011-09-15 2014-07-30 埃克森美孚上游研究公司 Optimized matrix and vector operations in instruction limited algorithms that perform eos calculations
CN102508704A (en) * 2011-11-10 2012-06-20 上海市共进通信技术有限公司 Method for implementing task decomposition and parallel processing in computer software system
CN103514629A (en) * 2012-06-22 2014-01-15 密执安大学评议会 Method and apparatus for iterative reconstruction
CN103440121A (en) * 2013-08-20 2013-12-11 中国人民解放军国防科学技术大学 Triangular matrix multiplication vectorization method of vector processor
CN104484234A (en) * 2014-11-21 2015-04-01 中国电力科学研究院 Multi-front load flow calculation method and system based on GPU (graphics processing unit)
CN105808309A (en) * 2016-03-08 2016-07-27 中国科学院软件研究所 High-performance realization method of BLAS (Basic Linear Algebra Subprograms) three-level function GEMM on the basis of SW platform
CN106650925A (en) * 2016-11-29 2017-05-10 郑州云海信息技术有限公司 Deep learning framework Caffe system and algorithm based on MIC cluster
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN107590106A (en) * 2017-08-08 2018-01-16 北京中科睿芯科技有限公司 A kind of computational methods for being applied to symmetrical matrix and vector multiplication
CN110968345A (en) * 2018-09-29 2020-04-07 英特尔公司 Architecture and method for data parallel Single Program Multiple Data (SPMD) execution
CN112380003A (en) * 2020-09-18 2021-02-19 北京大学 High-performance parallel implementation device for K-NN on GPU processor

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Towards highly efficient DGEMM on the emerging SW26010 many-core processor;Lijuan Jiang 等;《ICPP2017》;422-431 *
基于申威众核处理器的1、2级BLAS函数优化研究;孙家栋等;《计算机系统应用》;101-108 *
矩阵乘协处理器上BLAS Level-3运算的设计;贾讯等;《计算机工程与科学》;1913-1921 *
面向国产申威26010众核处理器的SpMV实现与优化;刘芳芳等;《软件学报》;3921-3932 *

Also Published As

Publication number Publication date
CN113641956A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
KR20230153972A (en) Accessing data in multi-dimensional tensors
Martínez-del-Amor et al. Population Dynamics P systems on CUDA
Clarke et al. Fupermod: A framework for optimal data partitioning for parallel scientific applications on dedicated heterogeneous hpc platforms
Plazolles et al. SIMD monte-carlo numerical simulations accelerated on GPU and xeon phi
Maris et al. Accelerating an iterative eigensolver for nuclear structure configuration interaction calculations on GPUs using OpenACC
Hatcher et al. A feasibility study for the solution of transient stability problems by multiprocessor structures
Oiso et al. Implementing genetic algorithms to CUDA environment using data parallelization
Yamaguchi et al. GPU implementation of a sophisticated implicit low-order finite element solver with FP21-32-64 computation using OpenACC
CN109753682B (en) Finite element stiffness matrix simulation method based on GPU (graphics processing Unit) end
Firoz et al. On the feasibility of using reduced-precision tensor core operations for graph analytics
CN113641956B (en) High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor
Dong et al. Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs
Marker et al. Code generation and optimization of distributed-memory dense linear algebra kernels
US9600446B2 (en) Parallel multicolor incomplete LU factorization preconditioning processor and method of use thereof
Iványi CUDA accelerated implementation of parallel dynamic relaxation
US20180349321A1 (en) Parallel processing apparatus, parallel operation method, and parallel operation program
Myllykoski et al. On solving separable block tridiagonal linear systems using a GPU implementation of radix-4 PSCR method
Küchlin PARSAC-2: Parallel computer algebra on the desk-top
Wang et al. Fine-grained heterogeneous parallel direct solver for finite element problems
Zhang et al. Accelerating lattice QCD on sunway many-core processor
Doroshenko et al. Large-Scale Loops Parallelization for GPU Accelerators.
Luo et al. Gpu port of a parallel incompressible navier-stokes solver based on openacc and mvapich2
Han et al. Towards efficient tile low-rank GEMM computation on sunway many-core processors
Siklósi et al. Bitwise Reproducible task execution on unstructured mesh applications
Fialko Time history analysis of buildings and structures design models in SCAD software on multicore computers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant