CN113641956B

CN113641956B - High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor

Info

Publication number: CN113641956B
Application number: CN202110896851.9A
Authority: CN
Inventors: 胡怡; 陈道琨; 杨超; 刘芳芳; 马文静
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-08-05
Filing date: 2021-08-05
Publication date: 2023-05-30
Anticipated expiration: 2041-08-05
Also published as: CN113641956A

Abstract

The invention discloses a high-performance implementation method of a 1-level BLAS and 2-level BLAS function library facing an SW26010-Pro processor, which comprises the following steps: carrying out task division on the problem to generate a plurality of sub-problems, wherein the structure of the problem comprises a vector, a common matrix, a symmetrical matrix or a triangular matrix; if the vector is vector, common matrix or symmetric matrix, the operation of each sub-problem is distributed to the corresponding thread; if the operation is a triangular matrix, dividing the operation of the diagonal part of the sub-problem to a thread number 0, and distributing the operation of the non-diagonal part to other corresponding threads; and splicing the operation results of each thread to obtain the solution of the problem. The invention realizes parallelization of BLAS1 and 2-level functions, solves the problem of data dependence among threads, and further improves the performance of the functions through a self-adaptive tuning mechanism.

Description

High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor

Technical Field

The invention relates to the field of basic linear algebraic library BLAS (Basic Linear Algebra Subprograms) implementation, in particular to a high-performance implementation method of a 1-level and 2-level BLAS function library for a SW26010-Pro processor.

Background

BLAS is a basic linear algebraic subroutine library, mainly comprising basic operations of vectors and matrices, is one of the most basic and important mathematical libraries, and is widely applied to the fields of scientific calculation, weather forecast, celestial physics and the like. The BLAS library is the core of many specialized software, where BLAS1, 2-level functions are called repeatedly many times by almost all applications involving matrix operations and dense linear algebraic algorithm packages (e.g., LAPACK, scaLAPACK). Practices in numerical matrix analysis, deep learning and the like show that BLAS1 and 2-level functions have important significance for improving the operation speed of application and fully playing the performance of a high-performance computer.

The BLAS1, 2-level functions implement vector-vector, matrix-vector operations, comprising a total of 30 functions, and include four types of single precision, double precision, complex single precision, and complex double precision. The BLAS1 and 2 class functions have the characteristic of dense access, the performance is limited by the access bandwidth of the system, the number of the functions is large, and the matrix related to the functions has various data arrangement modes in the memory. How to reasonably divide data, make full use of efficient access mode, improve data reuse rate, and is a great challenge for realizing high performance of BLAS1 and 2 class function libraries.

There have been considerable research efforts at home and abroad in terms of high performance implementation of BLAS1, 2-grade functions. Li Yi et al implemented a secondary BLAS library towards the polynuclear Loongson 3A (Li Yi, he Songsong, li Kai. Optimization of the secondary BLAS library on the polynuclear Loongson 3A [ J ]. Computer systems application, 2011,20 (1): 163-167.). With the rapid development of GPU accelerators, the optimization work of BLAS1, 2-level functions on GPUs has also become a research hotspot in recent years, jian Yin et al have realized parallel GEMVs (Jian Y, hui Y, xu W, et al highly parallel GEMV with register blocking method on GPU architecture [ J ]. Journal of Visual Communication & Image Representation,2014,25 (7): 1566-1573) on Nvidia GPUs using a register blocking method, weizhi Xu et al have realized a performance tuning framework for GEMVs on Nvidia GPUs, and an optimal algorithm is selected for the input scale of GEMVs (w.xu et al., "Auto-Tuning GEMV on Many-Core GPUs," 2012IEEE 18th International Conference on Parallel and Distributed Systems,2012,pp.30-36, doi: 10.1109/icpads.2012.15.).

SW26010-Pro is a many-core processor with heterogeneous architecture. On a Shenwei new generation supercomputer based on an SW26010-Pro many-core processor architecture, customized high-performance BLAS1 and 2-level function libraries are not deployed at present, and the performance of the existing open-source mathematical library on the platform is lower, so that effective performance support cannot be provided for applications. Therefore, it is urgently required to design and implement a high-performance BLAS1, 2 class function library for the many-core platform, so as to make full use of access bandwidth of the shenwei many-core processor, and meet urgent requirements of upper-layer applications on the high-performance BLAS1, 2 class functions on the shenwei many-core platform.

Disclosure of Invention

The invention provides a high-performance implementation method of a 1-level and 2-level BLAS function library facing a SW26010-Pro processor, which is used for meeting requirements of BLAS 1-level and 2-level functions on the SW26010-Pro many-core processor and solving the problem of lower performance of the existing open source math library.

A high-performance implementation method of a 1, 2-level BLAS function library facing SW26010-Pro processor comprises the following steps:

1) Carrying out task division on the problem to generate a plurality of sub-problems, wherein the structure of the problem comprises a vector, a common matrix, a symmetrical matrix or a triangular matrix;

2) If the vector is vector, common matrix or symmetric matrix, the operation of each sub-problem is distributed to the corresponding thread; if the operation is a triangular matrix, dividing the operation of the diagonal part of the sub-problem to a thread number 0, and distributing the operation of the non-diagonal part to other corresponding threads;

3) And splicing the operation results of each thread to obtain the solution of the problem.

Further, sub-problems are created by the following strategies:

1) For vectors, each vector segment is treated as a sub-problem x _i′ Wherein i 'is a vector segment number, i' is more than or equal to 0 and less than or equal to k-1, and k is the number of sub-problems;

2) For a common matrix, each row block is considered a sub-problem A _i Wherein i+1 is the row number of the matrix, i is more than or equal to 0 and less than or equal to k-1;

3) For symmetric matrices, each column of blocks is considered a sub-problem A _j Wherein j+1 is the column number of the matrix, and j is more than or equal to 0 and less than or equal to k-1;

4) For a triangular matrix, each row of blocks is considered a sub-problem A _i 。

Further, when the structure of the problem is a vector, the solution of the problem is obtained by the following steps:

1) Will sub problem x _i′ Assigned to the corresponding thread T _i ；

2) Thread T ₀ Calculate the sub-problem x ₀ Solution y of (2) ₀ ；

3) Using formula y _i ←α×x _i′ +y _i Each thread T _i Calculating to obtain solution y _i Wherein α is a first weight value;

4) Splice solution y _i A solution y to the problem is obtained.

Further, when the structure of the problem is a common matrix, the solution of the problem is obtained through the following steps:

1) Will sub problem A _i Assigning to thread T _i Wherein i is more than or equal to 0 and less than or equal to k-1, and k is the number of sub-problems;

2) Based on vector x' and sub-problem A ₀ Thread T ₀ Calculating to obtain solution y ₀ ；

3) Using formula y _i ←α×A _i ×x′+β×y _i Each thread T _i Calculating to obtain solution y _i Wherein α is a first weight value and β is a second weight value;

4) Splice solution y _i A solution y to the problem is obtained.

Further, when the structure of the problem is a symmetric matrix, the solution of the problem is obtained through the following steps:

1) For each sub-problem A _j Dividing to obtain a diagonal submatrix D _j Lower triangular submatrix L _ij And divide the sub-problem A _j Assigning to thread T _j ；

2) Dividing the vector x 'to obtain a plurality of sub-vectors x' _j ；

3) Will diagonal submatrix D _j With corresponding lower triangular submatrix L _ij Filling;

4) Each thread T _j Based on diagonal submatrix D ₀ And the subvector x' ₀ Or the corresponding lower triangular submatrix L in the upper triangular portion _i0 And the subvector x' _j Calculating to obtain solution y ₀ Or solve L _0j The method comprises the steps of carrying out a first treatment on the surface of the Each thread T _j Based on lower triangular submatrix L _(j+1)j And the subvector x' _j Calculating to obtain corresponding solution y _(j+1)j ；

5) For the symmetrical parts of the diagonal submatrix, the lower triangular submatrix and the lower triangular submatrix, each thread T _j Respectively using the formula y _j ←D _j ×x′ _j +y _j 、y _i ←L _ij ×x′ _j +y _i Y _j ←L _ij ×x′ _i +y _j And carrying out iterative solution, and splicing corresponding sub solutions to obtain a solution y of the problem.

Further, when the structure of the problem is a triangular matrix, the solution of the problem is obtained by:

1) Will sub problem A _i Divided into corresponding diagonal sub-matrices D _i Off-diagonal submatrix L _ij Dividing the right-end term vector b to obtain a sub right-end term vector b _i ；

2) For each diagonal submatrix D _i Off-diagonal submatrix L _ij Distributing threads;

3) For the diagonal submatrix, thread T _i Based on diagonal submatrix D _i Solving; for the off-diagonal submatrices, formula y is used _i ←D _i ×(b _i -∑ _0≤j＜i L _ij ×y _j ) Solving;

4) And splicing the corresponding sub solutions to obtain a solution y of the problem.

Further, for the off-diagonal submatrices, the solution is performed by:

1) Parallel execution of normal matrix-vector multiply computation L using loop unrolling and SIMD vectorization instructions _ij *y _j ；

2) Reduce the calculation results to thread T ₀ ；

3) Thread number 0 is based on reduction result, diagonal submatrix D _i Right-end term vector segment b _i Performing recurrent solution to obtain sub solution y _i 。

Further, calculate L _i(i-1) *y _(i-1) Before, corresponding thread and thread T ₀ And synchronizing.

Further, the communication method between threads comprises the following steps: RMA point-to-point communication.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when run.

An electronic device comprising a memory and a processor, wherein the memory stores a program for performing the above-described method.

The invention has the following technical effects:

the invention realizes parallelization of BLAS1 and 2 grade functions. The invention designs a thread reduction mechanism and a thread communication mechanism, and solves the problem of data dependence among threads. The invention also uses cyclic transformation and vectorization techniques to optimize the computation. In addition, the invention designs a self-adaptive tuning mechanism, and sets proper thread quantity according to the scale of the input problem, thereby further improving the performance of the function. The high performance BLAS1, 2 grade library of the present invention has an average speed ratio of 22.37 and a highest speed ratio of 65.47 as compared to the single core open source BLAS math library GotoBLAS.

Drawings

FIG. 1 is a schematic diagram of the overall flow of a method for high performance implementation of a

class

1,2 BLAS library for a SW26010-Pro processor of the present invention;

FIG. 2 is a diagram illustrating vector segmentation and inter-core data mapping;

FIG. 3 is a schematic diagram of a generic matrix partitioning and inter-core data mapping;

FIG. 4 is a diagram illustrating symmetric matrix partitioning and inter-core data mapping;

FIG. 5 is a schematic diagram of a triangular matrix partition and inter-core data mapping;

FIG. 6 is a schematic diagram of a thread reduction mechanism, wherein (a) is a thread row reduction schematic diagram and (b) is a thread column reduction schematic diagram;

FIG. 7 is a task block diagram of a TRSV;

FIG. 8 is a schematic diagram of a thread communication mechanism;

FIG. 9 is a task segment schematic diagram of AXPY;

FIG. 10 is a task block diagram of a GEMV;

FIG. 11 is a task block diagram of a SYMV;

fig. 12 is a graph of the performance acceleration ratio of the present invention to the open source gotobas.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The high-performance implementation method of the invention is characterized by comprising the following steps:

and firstly, performing task division on the matrix or the vector according to the scale of the input problem to generate a plurality of subtasks, and distributing each subtask to each thread.

And secondly, providing a thread reduction mechanism based on RMA communication and a thread communication mechanism based on point-to-point synchronization.

And thirdly, optimizing calculation by using cyclic transformation and vectorization technology.

And fourthly, providing a self-adaptive tuning mechanism, and setting proper thread quantity for each scale of the matrix or the vector.

Further, feature one includes:

as for the vector, as shown in FIG. 2, the vector is divided into a plurality of vector segments on average, each vector segment is mapped to each thread in turn, T in the figure ₀ ，T ₁ ，T ₂ ，...T ₆₃ Thread 0, thread 1, thread 2, …, 63;

as shown in FIG. 3, for a common matrix, it is divided into several small matrices, each row block is mapped to each thread in turn, T in the figure ₀ ，T ₁ ，T ₂ ，...T ₆₃ Representing thread No. 0, thread No. 1, thread No. 2, thread No. 63;

as shown in FIG. 4, for a symmetric matrix, it is divided into several small matrices, each column block is mapped to each thread in turn, T in the figure ₀ ，T ₁ ，T ₂ ，...T ₆₃ Thread 0, thread 1, thread 2, …, 63;

as in FIG. 5, for a triangular matrix, it is divided into several small matrices, diagonal blocks are mapped to thread number 0, each column block (except diagonal blocks) is mapped to other threads, T in the figure ₀ ，T ₁ ，T ₂ ，...T ₆₃ Thread No. 0, thread No. 1, thread No. 2.

Further, feature two includes:

referring to FIG. 6, any number of consecutive threads starting from thread number 0 are reduced by RMA point-to-point communication, first, the thread in the same line is reduced by the first column of threads targeting the core group, then the first column of threads of the core group is reduced by the first column of threads targeting thread number 0, T in the figure ₀ ，T ₁ ，T ₂ ，...T ₆₃ Representing thread No. 0, thread No. 1, thread No. 2, thread No. 63;

after the thread 0 completes the current operation, a point-to-point synchronization request is initiated to one of the threads 1 to 63, and at the same time, the corresponding thread responds to the synchronization request before performing the corresponding operation.

Further, feature three includes:

the computing portion of the BLAS1, 2-level functions is optimized using loop unrolling and SIMD vectorization instructions.

Further, feature four includes:

the BLAS 1-grade functions are classified into four types according to vector scale, including small-scale, medium-scale, large-scale, and ultra-large-scale. For the above four types, 8, 16, 32, 64 threads are started, respectively. The small-scale vector range is preferably [1024, 4096], the medium-scale vector range is preferably (4096, 32768), the large scale vector range is preferably (32768, 2626144), and the ultra-large scale vector range is preferably (262626144, + -infinity);

the BLAS 2-grade functions are classified into two types according to the matrix size, including small-scale and large-scale. For both types, 16, 64 threads are started, respectively. The small-scale matrix range here is preferably 128 x 128, 2048 x 2048, the large-scale matrix range is preferably (2048, ++ infinity A kind of electronic device.

Taking the threaded system of equations solution (TRSV) involving the following triangular coefficient matrix as an example, it mainly solves the following equations: a=x=b, where a represents a lower triangular matrix, x represents an unknown vector to be solved, and b represents a right-hand term vector. The specific implementation steps comprise:

step one: and determining the number of threads which need to be started by the function according to the scale of the matrix A.

Step two: as shown in fig. 7, the present invention performs task division on the matrix a, the vector x, and the vector b by rows, and treats each row block as one sub-problem, resulting in k sub-problems in total. The invention further divides the sub-questions, each of which will produce a diagonal sub-matrix D _i (0.ltoreq.i.ltoreq.k-1) and several off-diagonal submatrices L _ij ((0. Ltoreq.j < i)) which correspond to the unknown vector segment x to be solved _i Solution y of (2) _i And right-end term vector segment b _i . Each sub-problem performs the following operations: y is _i ←D _i ×(b _i -∑ _0≤j＜i L _ij ×y _j )。

Step three: the invention traverses each sub-problem in turn, the operation of the diagonal part of the sub-problem is distributed to the No. 0 thread, and the operation of the non-diagonal part is distributed to other threads in turn according to the thread number. Assuming that the current processing is a sub-problem i (0 < i.ltoreq.k-1), the threads responsible for the off-diagonal portion perform the normal matrix-vector multiply computation in parallel using loop unrolling and SIMD vectorization instructions: l (L) _ij *y _j (j is more than or equal to 0 and less than i), reducing the calculation result to a No. 0 thread, wherein the No. 0 thread is used for reducing the result according to the diagonal submatrix D _i Right-end term vector segment b _i Performing the back-substitution solution (back substitution) to obtain y _i And y is taken as _i Write back to main memory. For example, currently handled is sub-problem 3, thread 1, thread 2, thread 3, which is responsible for the off-diagonal portion, performing the normal matrix-vector multiplication computation in parallel: l (L) ₃₀ *x ₀ ，L ₃₁ *x ₁ ，L ₃₂ *x ₂ . Reducing the calculation result to a thread number 0, and performing back-substitution solving (back substitution) on the thread number 0 according to the reduction result to obtain x ₃ And x is taken as ₃ Write back to main memory.

As shown in fig. 8, the responsible submatrix L _i(i-1) The thread of (1) needs to synchronize with the thread 0 before computation, and waits for the thread 0 to y _(i-1) Write back to main memory. L (L) _ij Is 128X 128, L _ij *y _j The invention expands the outer loop for 8 times by adopting two-layer loop realization, increases the number of multiply-add operation operations in a single loop, and accelerates the multiply-add operation by using a floating point vector multiply-add instruction provided by SW26010-pro many-core processor hardware in the calculation process.

Step four: the solution y of the vector x is output.

Taking scalar vector multiplication (AXPY) as an example, its calculation form is: y=α x+y, where x and y represent vectors and α represents a scalar. The specific implementation steps comprise:

step one: and determining the number of threads which need to be started by the function according to the size of the vector.

Step two: as shown in fig. 9, the present invention performs task division on a vector x and an unknown vector y to be solved, and regards each vector segment as a sub-problem, resulting in k sub-problems in total. Each sub-problem performs the following operations: y is _i ←α×x _i +y _i 。

Step three: the invention traverses each sub-problem in turn, and assigns sub-problem i to the i-thread. Assuming that the current processing is a sub-problem i (0.ltoreq.i.ltoreq.63), the thread No. i performs the computation: alpha x _i +y _i Obtaining y _i And y is taken as _i Write back to main memory.

Step four: the vector y is output.

Taking the general matrix vector multiplication (GEMV) as an example, its calculation form is as follows: y=α×a+β×y, where a represents a normal matrix, x and y represent vectors, and α and β represent scalar quantities. The specific implementation steps comprise:

Step two: as shown in fig. 10, the present invention performs task division on the matrix a and the unknown vector y to be solved by rows, and considers each row block as a sub-problem, resulting in k sub-problems in total. Each sub-problem performs the following operations: y is _i ←α×A _i ×x+β×y _i 。

Step three: the invention traverses each sub-problem in turn, and assigns sub-problem i to the i-thread. Assuming that the current processing is a sub-problem i (0.ltoreq.i.ltoreq.63), the thread No. i performs the computation: alpha x A _i ×x+β×y _i Obtaining y _i And y is taken as _i Write back to main memory.

Step four: the vector y is output.

Taking the symmetric matrix vector multiplication (SYMV) involving the lower triangular matrix as an example, its calculation form is: y=α×a+β×y, where a represents a lower triangular symmetry matrix, x and y represent vectors, and α and β represent scalar quantities. The specific implementation steps comprise:

Step two: as shown in fig. 11, the present invention performs task division on the matrix a by columns, and treats each column block as one sub-problem, resulting in k sub-problems in total. The invention further divides the sub-questions, each of which will produce a diagonal sub-matrix D _j (0.ltoreq.j.ltoreq.k-1) and a plurality of lower triangular submatrices L _ij (i.gtoreq.j). Each sub-problem completes the following operations: for the diagonal submatrices, D will be _j Is filled in with the elements of the lower triangle, and is calculated: y is _j ←D _j ×x _j +y _j The method comprises the steps of carrying out a first treatment on the surface of the For the lower triangular submatrix, calculate: y is _i ←L _ij ×x _j +y _i The method comprises the steps of carrying out a first treatment on the surface of the For the symmetric part calculation of the lower triangular sub-matrix: y is _j ←L _ij ×x _i +y _j 。

Step three: the invention traverses each sub-problem in turn, and assigns sub-problem j to j number threads. Assuming that the current processing is a sub-problem j (0. Ltoreq.j. Ltoreq.k-1), thread j performs the operation: will D _j Is filled in with the elements of the lower triangle, and is calculated: d (D) _j ×x _j +y _j Obtaining y _j And y is taken as _j Writing back to main memory; and (3) calculating: l (L) _ij ×x _j +y _i Obtaining y _i And y is taken as _i Writing back to main memory; and (3) calculating: l (L) _ij ×x _i +y _j Obtaining y _j And y is taken as _j Write back to main memory.

Step four: the vector y is output.

In the embodiment, the GotoBLAS mathematical library is adopted to verify the performance acceleration effect of the invention. The problem scale selected by the embodiment ensures that the function performances of the two versions reach respective optimal values, and the selected precision is real double precision. FIG. 12 shows the performance acceleration ratio of the present invention to the open source GotoBLAS, from which it can be seen that the average acceleration ratio of the present invention relative to GotoBLAS is 22.37 and the highest acceleration ratio is 65.47.

In this embodiment, the content of the present invention is transplanted to other platforms after being simply deformed, or the task division and thread reduction mechanism of the present invention are not creatively improved, or the computing stage is simply optimized based on the present invention, which essentially does not depart from the content covered by the present invention, and still falls into the scope of the protection of the present invention.

Parts of the invention not described in detail are known to those skilled in the art.

The above embodiments are merely illustrative of specific examples of the present invention and are not intended to limit the scope of the present invention, and various modifications and improvements made by those skilled in the art to the technical solution of the present invention should fall within the protection scope defined by the claims of the present invention without departing from the design spirit of the present invention.

Claims

1. A high-performance implementation method of a 1, 2-level BLAS function library facing SW26010-Pro processor comprises the following steps:

1) Carrying out task division on the problem to generate a plurality of sub-problems, wherein the structure of the problem comprises a vector, a common matrix, a symmetrical matrix or a triangular matrix; wherein,,

in the case where the structure of the problem is a vector, each vector segment is treated as a sub-problem x _i Wherein i is more than or equal to 0 and less than or equal to k-1, and k is the number of sub-problems;

in the case where the structure of the problem is a normal matrix, each row block is regarded as a sub-problem A _i ；

In the case where the structure of the problem is a symmetric matrix, each column of blocks is considered as a sub-problem A _j Wherein j+1 is the column number of the matrix, and j is more than or equal to 0 and less than or equal to k-1;

in the case where the structure of the problem is a triangular matrix, each row of blocks is regarded as a sub-problem A _i ；

2) Distributing the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread; wherein,,

in the case that the structure of the problem is a vector, the allocating the operation of each sub-problem to the corresponding thread to obtain the operation result of the thread includes:

will sub problem x _i Assigned to the corresponding thread T _i To make thread T _i Performing calculation to obtain thread T _i Corresponding solution y _i ＝α×x _i +y _i Alpha represents a scalar;

in the case that the structure of the problem is a common matrix, the allocating the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread includes:

will sub problem A _i Assigning to thread T _i To make thread T _i Performing calculation to obtain thread T _i Corresponding solution y _i ＝α×A _i ×x+β×y _i Beta represents a scalar quantity, and x represents a vector quantity;

in the case that the structure of the problem is a triangular matrix, the allocating the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread includes:

for each sub-problem A _i Dividing to obtain a diagonal submatrix D _i Off-diagonal submatrix L _ij ，0≤j＜i；

Will diagonal submatrix D _i The operation of (1) is distributed to the line 0Journey, off-diagonal submatrix L _ij The operation of the thread number i is sequentially distributed to other threads;

the other threads perform common matrix-vector multiply computations in parallel using loop unrolling and SIMD vectorization instructions: l (L) _ij *y _j Reducing the calculation result to a thread number 0;

thread number 0 is based on reduction result, diagonal submatrix D _i Right-end term vector segment b _i Performing back-substitution solution to obtain solution y corresponding to thread 0 _i ；

In the case that the structure of the problem is a symmetric matrix, the allocating the operation of each sub-problem to a corresponding thread to obtain the operation result of the thread includes:

for each sub-problem A _j Dividing to obtain a diagonal submatrix D _j Off-diagonal submatrix L _ij ，i≥j；

Will sub problem A _j Assigned to the corresponding thread T _j ；

Should thread T _j Will D _j Is filled in with the elements of the lower triangle, and is calculated: y is _j ←D _j ×x _j +y _j The method comprises the steps of carrying out a first treatment on the surface of the For the lower triangular submatrix, calculate: y is _i ←L _ij ×x _j +y _i The method comprises the steps of carrying out a first treatment on the surface of the For the symmetric part calculation of the lower triangular sub-matrix: y is _j ←L _ij ×x _i +y _j ；

2. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of claim 1 when run.

3. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method of claim 1.