CN118246438B

CN118246438B - Fault-tolerant computing method, device, equipment, medium and computer program product

Info

Publication number: CN118246438B
Application number: CN202410683294.6A
Authority: CN
Inventors: 李逍; 赵雅倩; 史宏志; 张亚强; 高飞; 陈筱琳; 许光远
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-05-29
Filing date: 2024-05-29
Publication date: 2024-09-20
Anticipated expiration: 2044-05-29
Also published as: CN118246438A

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a fault-tolerant computing method, a device, equipment, a medium and a computer program product, wherein in the process of performing iterative computation by converting a prompt word in input problem information into a word vector and inputting the word vector into a target language model, when the weight matrix and the input word vector are computed in propagation computation, the weight matrix is divided into a plurality of data blocks according to rows, a plurality of redundant vectors are obtained by computing column element sums according to the data blocks to form a redundant matrix, and carrying out error detection and correction on a data calculation result obtained by multiplying the weight matrix and the input word vector by using a first check value obtained by multiplying the redundancy matrix and the input word vector, so as to obtain a generation result corresponding to the problem information in the current iterative computation after completing propagation computation of each layer of the target language model, thereby ensuring the accuracy of generating the language model or the accuracy of generating answer information when executing an artificial intelligent question-answering task.

Description

Fault-tolerant computing method, device, equipment, medium and computer program product

Technical Field

The present invention relates to the field of artificial intelligence technology, and in particular, to a fault tolerance computing method, apparatus, device, medium and computer program product.

Background

With the development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, the adopted artificial intelligence model is more and more large in scale, so that the model calculation process is more and more complex, the area of a required calculation unit (Processing Element, PE) is continuously increased, the failure occurrence rate of single hardware is amplified, and finally, the error of the result generated by the model is caused. For example, errors in model computation of large language models (Large Language Model, LLM) may result in incomplete, discontinuous generation of response information, even far from the answers required for question information, and the occurrence of non-interviewed results.

How to improve the accuracy of artificial intelligence model calculation is a technical problem that needs to be solved by the person skilled in the art.

Disclosure of Invention

The invention aims to provide a fault-tolerant computing method, a fault-tolerant computing device, a fault-tolerant computing medium and a fault-tolerant computing computer program product, which are used for improving the accuracy of artificial intelligent model computing.

In order to solve the technical problems, the invention provides a fault-tolerant computing method applied to artificial intelligence question answering, which comprises the following steps:

Receiving input problem information, wherein the problem information comprises prompt words;

Converting the prompt word into a word vector, and inputting the word vector into a target language model for iterative computation;

In the propagation calculation of the current iteration calculation of the target language model, after a redundant matrix is generated according to a weight matrix of a current layer, carrying out general matrix multiplication calculation on the weight matrix, the redundant matrix and an input word vector of the current layer respectively to correspondingly obtain a data calculation result and a first check value of the propagation calculation of the current layer, generating a second check value from the data calculation result, checking the data calculation result of the current layer according to a consistency comparison result of the second check value and the corresponding first check value, entering the propagation calculation of the next layer if the data calculation result of the current layer passes the check, otherwise, carrying out error correction treatment on the data calculation result of the current layer;

After the propagation calculation of each layer of the target language model is completed, a generation result corresponding to the problem information of the current iterative calculation is obtained;

Wherein generating the redundancy matrix according to the weight matrix includes: dividing the weight matrix into a plurality of data blocks according to rows, generating a plurality of different redundant vectors according to the data blocks, and forming the redundant vectors into the redundant matrix; one redundancy vector corresponds to one or more data blocks, and redundancy values in the redundancy vector are column element sums of corresponding columns in the corresponding data blocks; the second check value corresponds to a method of calculating the redundancy value.

In one aspect, dividing the weight matrix into a plurality of data blocks according to rows, generating a plurality of different redundancy vectors according to the data blocks, and forming each redundancy vector into the redundancy matrix, including:

the weight matrix is uniformly divided into rows Partitioning the data;

Calculating column elements of the same column elements in a single data block and redundancy values of corresponding columns to sequentially obtain the redundancy vectors corresponding to the single data block, calculating the column elements of the same column elements and the redundancy values of corresponding columns for a group of data blocks without repeating every two data blocks to sequentially obtain the corresponding redundancy vectors, and the like until the same column elements and the corresponding redundancy vectors of all the data blocks are obtained;

Where p is a blocking factor and p is a positive integer.

In another aspect, the redundancy matrix construction process includes constructing byRedundancy matrix of row d and column：

For j=1, 2, … …, d,

When (when)In the case of an odd number of the number,Wherein, the method comprises the steps of, wherein,=2k-1，；

When (when)In the case of an even number of the number,Wherein, the method comprises the steps of, wherein,L satisfies；

Wherein, ，；

The weight matrix comprises n rows and d columns; k is the sequence number of the data block,The kth data block corresponds to the kth weight matrixLine to kh line, if the line sequence number n of the last line of the last data block is less than kh, the line is up to nth line; h is the number of lines of the data block; is the first of the redundant matrices The elements of row j and column j,Is the first of the redundant matricesThe elements of row j and column j,Is the first of the redundant matricesThe elements of row j and column j,For the elements of the ith row and jth column of the weight matrix, l is an exponential parameter,Representation ofCompletely removeP is the blocking factor, and,Is a round-up operation.

On the other hand, the general matrix multiplication calculation is performed on the weight matrix, the redundancy matrix and the input word vector of the current layer respectively, so as to correspondingly obtain a data calculation result of propagation calculation of the current layer and a first check value, which comprises the following steps:

splicing the redundant matrix below the weight matrix to obtain a check matrix;

performing general matrix multiplication calculation on the check matrix and the input word vector to obtain a check vector, wherein the first n components of the check vector are the data calculation results corresponding to each row of the weight matrix, and the last n components of the check vector are the data calculation results corresponding to each row of the weight matrix And each component is the first check value corresponding to each row of the redundancy matrix.

In another aspect, the data calculation result fails verification, including:

The second check value obtained by calculation according to the data calculation result is inconsistent with the corresponding first check value;

The second check value is inconsistent with the corresponding first check value and is represented by the following formula:

， M satisfies ；

Wherein, Is the first of the check vectorsThe number of components of the composition,For the i-th component in the check vector, h is the number of rows of one data block, n is the number of rows of the weight matrix, m is an exponential parameter,Representation ofCompletely removeP is a blocking factor.

On the other hand, generating the second check value from the data calculation result includes recursively calculating the second check value corresponding to the data block using the following equation:

When (when) In the case of an odd number of the number,Wherein, the method comprises the steps of, wherein,=2k-1，；

Wherein, Is the first of the check vectorsThe second check value calculated by the data calculation result corresponding to the first check value,Is the first of the check vectorsThe second check value calculated by the data calculation result corresponding to the first check value,Is the first of the check vectorsThe second check value calculated by the data calculation result corresponding to the first check value, h is the number of rows of one data block, n is the number of rows of the weight matrix, l is an exponential parameter,Representation ofCompletely removeP is a blocking factor.

In another aspect, generating the second check value from the data calculation result includes:

and calculating the second check value corresponding to each first check value in parallel according to the calculation method of the redundancy value.

On the other hand, verifying the data calculation result of the current layer according to the consistency comparison result of the second verification value and the corresponding first verification value, including:

if the second check values corresponding to all the data blocks are consistent with the first check values corresponding to all the data blocks, determining that the first check values corresponding to all the data blocks pass the check, and further determining that the data calculation results of the current layer pass the check;

If the second check values corresponding to all the data blocks are inconsistent with the first check values corresponding to the data blocks, performing step-by-step check according to the first check values;

In the step-by-step verification process, if the current first verification value passes the verification, the branch index of the first verification value is not verified; if the current first check value does not pass the check and the first check value has the branch index, checking the branch index of the first check value; if the current first check value does not pass the check and the first check value does not have the branch index, determining that the data block corresponding to the first check value is an error block; if the current first check value does not pass the check but all branch indexes of the first check value pass the check, setting the current first check value as passing the check;

Determining that the data calculation result corresponding to the error block is an error result, and determining that the data calculation result corresponding to the first check value passing the check is a correct result;

wherein the branch index is the first check value of the first order Is the ordered of (2)Is set to the first check value of (1) and a first check value that is set to be equal to the first check value,For the current sequence number of the first check value, l is an index parameter, and l satisfies，Representation ofCompletely remove。

On the other hand, the blocking factor is determined according to the computing resources of the computing device executing the propagation computation of the current layer and the number of rows of the weight matrix, so that the sum of the number of rows of the weight matrix and the number of rows of the redundancy matrix is smaller than or equal to the maximum computing resources provided by the computing device for the propagation computation of the current layer.

On the other hand, performing error correction processing on the data calculation result of the current layer includes:

determining that the corresponding data block is an error block when the second check value is inconsistent with the corresponding first check value;

dividing each error block into recalculation matrixes;

Performing general matrix multiplication calculation on the recalculation matrix and the input word vector of the current layer to obtain a recalculation result;

And carrying out error correction processing on the data calculation result of the propagation calculation of the current layer according to the recalculation result.

In another aspect, generating the recalc matrix from each of the error blocks includes:

Copying each error block into a plurality of copies and then splicing to obtain the recalculation matrix;

The error correction processing is carried out on the data calculation result of the propagation calculation of the current layer according to the recalculation result, and the error correction processing comprises the following steps:

And replacing the data calculation result corresponding to the error block by using the final calculation result according to the result with the largest occurrence number in the recalculation result corresponding to the error block as the final calculation result corresponding to the error block.

On the other hand, copying each error block into a plurality of copies, and then splicing to obtain the recalculation matrix, including:

Obtaining a recalculated redundancy modulus;

And copying the error blocks for multiple times according to the recalculation redundancy modulus, and then splicing the error blocks into the recalculation matrix, so that the number of the error blocks in the recalculation matrix is the recalculation redundancy modulus.

In another aspect, before the general matrix multiplication is performed on the recalculation matrix and the input word vector of the current layer to obtain a recalculation result, the method further includes:

judging whether the number of the lines of the recalculation matrix exceeds the sum of the number of the lines of the weight matrix and the number of the lines of the redundancy matrix;

If yes, returning to the step of generating a redundant matrix according to the weight matrix of the current layer;

if not, the step of performing general matrix multiplication calculation on the recalculation matrix and the input word vector of the current layer is entered.

In another aspect, the method further comprises:

and if the number of times of returning to the step of generating the redundant matrix according to the weight matrix of the current layer exceeds a recalculation threshold value, stopping iterative calculation of the target language model and outputting error reporting information of the computing equipment fault.

In another aspect, the recalculated redundancy modulus is obtained by:

；

wherein t is the recalculated redundancy modulus, n is the number of rows of the weight matrix, S is the number of the error blocks, h is the number of the data blocks, p is a block factor,Is a round-down operation.

On the other hand, the performing general matrix multiplication computation on the weight matrix, the redundancy matrix and the input word vector of the current layer respectively to obtain a data computation result and a first check value of propagation computation of the current layer, including:

and performing the general matrix multiplication calculation of the weight matrix and the input word vector of the current layer in parallel, and performing the general matrix multiplication calculation of the redundancy matrix and the input word vector of the current layer.

In order to solve the technical problem, the invention also provides a fault-tolerant computing method, which comprises the following steps:

After determining information of a storage device and information of a computing device according to a model parallel computing task, reading data to be processed corresponding to the model parallel computing task from the storage device;

Converting the data to be processed into an input vector, and inputting the input vector into the computing equipment deployed with the model parameters of the target model for iterative computation;

In the propagation calculation of the current iteration calculation of the target model, after a redundant matrix is generated according to a weight matrix of a current layer, carrying out general matrix multiplication calculation on the weight matrix, the redundant matrix and an input vector of the current layer respectively to correspondingly obtain a data calculation result and a first check value of the propagation calculation of the current layer, generating a second check value from the data calculation result, checking the data calculation result of the current layer according to a consistency comparison result of the second check value and the corresponding first check value, entering the propagation calculation of the next layer if the data calculation result of the current layer passes the check, otherwise, carrying out error correction treatment on the data calculation result of the current layer;

After the propagation calculation of each layer of the target model is completed, a generation result corresponding to the data to be processed in the current iterative calculation is obtained;

In order to solve the technical problems, the invention also provides a fault-tolerant computing device applied to artificial intelligence question answering, comprising:

the first receiving unit is used for receiving input problem information, wherein the problem information comprises prompt words;

The first conversion unit is used for converting the prompt word into a word vector, and inputting the word vector into a target language model for iterative computation;

The first calculation unit is used for generating a redundant matrix according to a weight matrix of a current layer in the propagation calculation of the current iterative calculation of the target language model, performing general matrix multiplication calculation on the weight matrix, the redundant matrix and an input word vector of the current layer respectively to correspondingly obtain a data calculation result of the propagation calculation of the current layer and a first check value, generating a second check value from the data calculation result, checking the data calculation result of the current layer according to a consistency comparison result of the second check value and the corresponding first check value, entering the propagation calculation of the next layer if the data calculation result of the current layer passes the check, and otherwise, performing error correction processing on the data calculation result of the current layer; after the propagation calculation of each layer of the target language model is completed, a generation result corresponding to the problem information of the current iterative calculation is obtained;

In order to solve the above technical problems, the present invention further provides a fault tolerant computing device, including:

The information determining unit is used for reading data to be processed corresponding to the model parallel computing task from the storage device after determining the information of the storage device and the information of the computing device according to the model parallel computing task;

the second receiving unit is used for converting the data to be processed into an input vector, and inputting the input vector into the computing equipment deployed with the model parameters of the target model for iterative computation;

The second calculation unit is used for generating a redundant matrix according to a weight matrix of a current layer in the propagation calculation of the current iteration calculation of the target model, performing general matrix multiplication calculation on the weight matrix, the redundant matrix and an input vector of the current layer respectively to correspondingly obtain a data calculation result of the propagation calculation of the current layer and a first check value, generating a second check value from the data calculation result, checking the data calculation result of the current layer according to a consistency comparison result of the second check value and the corresponding first check value, entering the propagation calculation of the next layer if the data calculation result of the current layer passes the check, and otherwise performing error correction processing on the data calculation result of the current layer; after the propagation calculation of each layer of the target model is completed, a generation result corresponding to the data to be processed in the current iterative calculation is obtained;

In order to solve the above technical problem, the present invention further provides a fault tolerant computing device, including:

a memory for storing a computer program;

a processor for executing the computer program, which when executed by the processor implements the steps of the fault tolerant computing method as described in any one of the above.

To solve the above technical problem, the present invention further provides a non-volatile storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the fault-tolerant computing method according to any one of the above.

To solve the above technical problem, the present invention also provides a computer program product, which includes a computer program/instruction, where the computer program/instruction implements the steps of the fault tolerance calculation method according to any one of the above steps when executed by a processor.

The fault-tolerant computing method provided by the invention has the beneficial effects that the fault-tolerant computing method is applied to artificial intelligence questions and answers, in the process of performing iterative computation by converting prompt words in input problem information into word vectors and inputting the word vectors into a target language model, when performing computation of a weight matrix and the input word vectors in propagation computation, the weight matrix is divided into a plurality of data blocks according to rows, a plurality of different redundant vectors are generated in a mode that column elements of each row in one or a plurality of data blocks and generated redundancy values form corresponding redundant vectors to form the redundant matrix, the weight matrix and the redundant matrix are respectively subjected to general matrix multiplication computation with the input word vectors, a data computation result and a first check value are correspondingly obtained, a second check value is generated according to the generation method of the redundancy values, the data computation result of the current layer is checked according to the consistency comparison result of the second check value and the first check value, if the data computation result of the current layer passes through the verification, the propagation computation of the next layer is entered, and otherwise error correction processing is performed on the data computation result; and then, after the propagation calculation of each layer of the target language model is completed, a generation result corresponding to the current iterative calculation and the problem information is obtained, so that the fault-tolerant calculation of the target language model is realized, further, the error detection and error correction in the training or reasoning process of the language model are realized, and the accuracy of generating the language model or the accuracy of generating the answer information when the artificial intelligent question-answering task is executed is ensured.

The invention also provides a fault-tolerant computing method, which comprises the steps of determining information of a storage device and information of a computing device according to a model parallel computing task, reading data to be processed corresponding to the model parallel computing task from the storage device, converting the data to be processed into an input vector, inputting the input vector into a computing device deployed with model parameters of a target model for iterative computation, dividing the weight matrix into a plurality of data blocks according to rows when computing the weight matrix and the input vector in propagation computation, generating a plurality of different redundant vectors in a mode of forming corresponding redundant vectors by column elements of each row in one or more data blocks and generating redundant values, and carrying out general matrix multiplication computation on the weight matrix and the redundant matrix respectively with the input word vector, correspondingly obtaining a data computing result and a first check value, generating a second check value according to the generation method of the redundant values, checking the data computing result of a current layer according to the consistency comparison result of the second check value and the first check value, and if the data computing result of the current layer is not, entering the propagation computation result of the next layer by checking, and carrying out error correction computation on the data; and then, after the propagation calculation of each layer of the target model is completed, a generation result corresponding to the data to be processed in the current iterative calculation is obtained, so that the fault-tolerant calculation of the target model is realized, error detection and error correction in the model training or model reasoning process are realized, and the accuracy of the model generated by training the model parallel calculation or the accuracy of the generation result of executing the model reasoning task by using the model parallel calculation is ensured.

The present invention also provides a fault tolerant computing device, a fault tolerant computing apparatus, a non-volatile storage medium and a computer program product, which have the above advantages and are not described herein.

Drawings

For a clearer description of embodiments of the invention or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a first fault tolerant computing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a first fault tolerant computing device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a redundancy matrix construction process according to an embodiment of the present invention;

FIG. 4 is a flowchart of a second fault tolerant computing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another redundancy matrix construction process according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a 5-mode majority voting process according to an embodiment of the present invention;

FIG. 7 is a flowchart of a third fault tolerant computing method according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a fault tolerant computing device according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another fault tolerant computing device according to an embodiment of the present invention;

Fig. 10 is a schematic structural diagram of a fault tolerant computing device according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a fault-tolerant computing method, a device, equipment, a medium and a computer program product, which are used for improving the accuracy of artificial intelligent model computation.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

With the rapid rise of large models and the new development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, requirements are placed on the accuracy of model computation in the process of training artificial intelligence models and performing inference computation by using the artificial intelligence models to solve practical problems. With the synchronous improvement of hardware devices, the current mainstream model computing device is a computing unit array, and adopts a parallel computing architecture, and consists of a plurality of computing units (Processing Element, PE). An array of computational units, such as graphics processors (Graphics Processing Unit, GPUs), can perform computations on the computational units in parallel, enabling acceleration of model computations. In order to realize efficient calculation, a large number of calculation units are adopted by one calculation unit array to amplify the failure occurrence rate of a single calculation unit, so that the accuracy of a model calculation result is increasingly unable to meet the demands.

Model calculation is mainly applied to two occasions of model training and model reasoning. Model training is to train and adjust model parameters according to sample data to obtain an artificial intelligent model. Model reasoning is to perform actual tasks by using a trained artificial intelligence model, such as completing artificial intelligence question-answering tasks, such as translation, article generation, abstract generation, information search, image generation, image analysis, code generation and the like by using a trained large language model (Large Language Model, LLM).

Model computation mainly includes two main steps, forward propagation computation and backward propagation computation. In the forward propagation calculation, the output of the upper layer of the model is taken as the input of the lower layer of the model, and the vector of the lower layer is inputThe input vector X of the layer is calculated by the layer:

；

Wherein, Is a General Matrix multiplication (GEMM) calculation,Called weight matrix, functionIs a nonlinear function of the layer.

The process of back-propagation computation is more complex but still involves a general matrix multiplication computation.

The computational cell array designed to perform model calculation tasks is primarily used to implement general matrix multiplication computations. Therefore, in order to improve the accuracy of model calculation and further ensure the accuracy of model training process and model reasoning process, error detection and error correction are required to be performed on the calculation result of general matrix multiplication calculation.

The fault-tolerant computing method provided by the embodiment of the invention can be applied to a single computing device or a cluster consisting of a plurality of computing devices. If clusters are employed, each computing device may employ the same type of computing device or heterogeneous computing devices. Types of computing devices may include, but are not limited to, graphics processors (Graphics Processing Unit, GPUs), field programmable gate array devices (Field Programmable GATE ARRAY, FPGA), application SPECIFIC INTEGRATED Circuits (ASICs), and processor-dispersed processing unit devices (Data Processing Unit, DPUs).

Whether performing model training tasks or model reasoning tasks, a significant amount of memory is required to store model parameters and data to be processed, which may be provided by the computing device, which may be a host, or a host+accelerator architecture. The storage space may also be provided by another storage device or by a storage pool consisting of a plurality of storage devices in a cluster.

Based on the above architecture, the fault-tolerant computing method provided by the embodiment of the invention is described below with reference to the accompanying drawings.

FIG. 1 is a flowchart of a first fault tolerant computing method according to an embodiment of the present invention; fig. 2 is a schematic structural diagram of a first fault tolerant computing device according to an embodiment of the present invention.

As shown in fig. 1, the fault tolerance calculation method provided by the embodiment of the present invention includes:

S101: and after the information of the storage device and the information of the computing device are determined according to the model parallel computing task, reading the data to be processed corresponding to the model parallel computing task from the storage device.

S102: and converting the data to be processed into input vectors, and inputting the input vectors into a computing device provided with model parameters of the target model for iterative computation.

S103: in the propagation calculation of the current iteration calculation of the target model, after a redundant matrix is generated according to the weight matrix of the current layer, carrying out general matrix multiplication calculation on the weight matrix, the redundant matrix and an input vector of the current layer respectively to correspondingly obtain a data calculation result and a first check value of the propagation calculation of the current layer, generating a second check value from the data calculation result, checking the data calculation result of the current layer according to the consistency comparison result of the second check value and the corresponding first check value, entering the propagation calculation of the next layer if the data calculation result of the current layer passes the check, and otherwise, carrying out error correction treatment on the data calculation result of the current layer.

Wherein generating a redundancy matrix from the weight matrix comprises: dividing the weight matrix into a plurality of data blocks according to rows, generating a plurality of different redundant vectors according to the data blocks, and forming each redundant vector into a redundant matrix; one redundancy vector corresponds to one or more data blocks, and redundancy values in the redundancy vector are column element sums of corresponding columns in the corresponding data blocks; the second check value corresponds to a method of calculating the redundancy value.

S104: and after the propagation calculation of each layer of the target model is completed, obtaining a generation result of the current iterative calculation corresponding to the data to be processed.

In a specific implementation, for S101, the model parallel computing task may be in a model training process or a model reasoning process. In the model training process, the acquired data to be processed is sample data, iterative computation is carried out on the target model based on the sample data, forward propagation computation and backward propagation computation are carried out in the iterative computation to adjust model parameters of the target model, one-time iterative computation is completed, and the trained target model is obtained after the iteration ending condition is reached. In the model reasoning process, the acquired data to be processed is data to be calculated, for example, when the target model is a large language model, the data to be processed is problem information containing prompt words (prompt); and outputting a generated result by forward propagation calculation in iterative calculation, and outputting the generated result after reaching an iteration ending condition, for example, when the target model is a large language model, outputting the result as response information.

The architecture of the storage device and the computing device may be referred to the description of the above embodiments of the invention.

And S102, converting the input data to be processed into a vector form, obtaining an input vector meeting the model calculation requirement of the target model, and inputting the input vector into the target model for iterative calculation.

For S103, when the weight matrix and the input vector are calculated in the propagation calculation, dividing the weight matrix into a plurality of data blocks according to rows, generating a plurality of different redundant vectors in a mode that column elements of each row in one or a plurality of data blocks and generated redundant values form corresponding redundant vectors to form a redundant matrix, carrying out general matrix multiplication calculation on the weight matrix and the redundant matrix and the input word vector respectively to correspondingly obtain a data calculation result and a first check value, generating a second check value according to the data calculation result by a redundancy value generation method, checking the data calculation result of the current layer according to the consistency comparison result of the second check value and the first check value, if the data calculation result passes through the check, entering the propagation calculation of the next layer, and otherwise, carrying out error correction treatment on the data calculation result; and then, after the propagation calculation of each layer of the target model is completed, a generation result corresponding to the data to be processed in the current iterative calculation is obtained, so that the fault-tolerant calculation of the target model is realized, error detection and error correction in the model training or model reasoning process are realized, and the accuracy of the model generated by training the model parallel calculation or the accuracy of the generation result of executing the model reasoning task by using the model parallel calculation is ensured.

It can be seen that in embodiments of the present invention, by dividing the weight matrix into a plurality of data chunks, each data chunk corresponds to a plurality of rows of elements in the weight matrix. And generating a redundant vector according to the data blocks to obtain a redundant matrix, so that the redundant vector and the data blocks have a corresponding relation. By performing general matrix multiplication calculation on the weight matrix and the redundancy matrix and the input vector respectively to obtain a data calculation result and a first check value, each row of the weight matrix corresponds to one data calculation result, namely, each row of one data block corresponds to one data calculation result, one data block corresponds to a plurality of data calculation results, each row of the redundancy matrix corresponds to one first check value, according to the corresponding relation between the data block and the redundancy vector, the corresponding relation between the data calculation result and the first check value can be obtained, and further, according to the corresponding relation, a second check value corresponding to the first check value can be obtained by summing calculation on the corresponding data calculation result, namely, the second check value corresponds to the redundancy value calculation method in the embodiment of the invention.

Regardless of the effect of error compensation, if the calculation process of the first check value corresponding to one redundancy vector generated according to one or more data blocks is correct, and each data calculation result corresponding to one or more corresponding data blocks is calculated correctly, the consistency comparison result of the second check value and the first check value should be that the second check value is equal to the corresponding first check value. Therefore, in S103, verifying the data calculation result of the current layer according to the consistency comparison result of the second check value and the corresponding first check value may include: if the second check value is equal to the corresponding first check value, determining that the first check value passes the check, further determining that the data block corresponding to the first check value passes the check, and further determining that the data calculation result corresponding to the first check value passes the check; if the second check value is not equal to the corresponding first check value, determining that the first check value fails to pass the check, further determining that the data block corresponding to the first check value fails to pass the check, and further determining that the data calculation result corresponding to the first check value fails to pass the check.

By using the fault-tolerant computing method provided by the embodiment of the invention, the error blocks can be found from the data blocks, so that the verification of partial or all data computing results in the general matrix multiplication computation of the weight matrix can be realized, namely, it can be understood that if the verification of all data computing results in the general matrix multiplication computation of the weight matrix is to be realized, the generated redundant vector should cover all the data blocks, for example, a redundant vector can be generated for each data block or each few data blocks.

Because the minimum unit for checking is a data block, when an error block which fails to pass the checking is found, the error block and the input word vector are subjected to general matrix multiplication calculation again to obtain a correct data calculation result.

For S104, the propagation computation of each layer is sequentially performed after error detection and error correction of the general matrix multiplication computation in the propagation computation of the object model are performed through S103, and after the propagation computation of each layer is completed, a generation result corresponding to the problem information of the current iteration computation is obtained. If the model calculation is in the model training process, after the forward propagation calculation of each layer is completed, and then the backward propagation calculation of each layer is completed, the generation result of the current iterative calculation corresponding to the problem information is obtained. And if the model calculation is in the model reasoning process, obtaining a generation result of the current iterative calculation corresponding to the data to be processed after the forward propagation calculation of each layer is completed.

In the fault-tolerant computing method provided by the embodiment of the invention, the process of constructing the weight matrix and the process of generating the second check value from the data computing result are both sum computation, the process of comparing the second check value with the first check value is comparison computation, and the process of performing general matrix multiplication computation on the weight matrix, the redundant matrix and the input vector of the current layer is general matrix multiplication computation respectively, so as to be shown in fig. 2, the computing device adopted by the embodiment of the invention can comprise an adder 201, a comparator 202 and a general matrix multiplication calculator 203, wherein the adder 201 is used for summing column elements of the data block to obtain the redundant value and summing the data computing result according to the generating method of the redundant value to obtain the second check value; the comparator 202 is configured to compare whether the second check value is equal to the first check value; the general matrix multiplication calculator 203 is configured to perform general matrix multiplication calculation on the weight matrix, the redundancy matrix, and the input vector of the current layer, respectively. It should be noted that, the fault-tolerant computing device shown in fig. 2 only performs the function implementation on the hardware of the computing device for the computing type required by the fault-tolerant computing method according to the embodiment of the present invention, which does not mean that the fault-tolerant computing device only includes these computing modules.

FIG. 3 is a schematic diagram of a redundancy matrix construction process according to an embodiment of the present invention; fig. 4 is a flowchart of a second fault tolerant computing method according to an embodiment of the present invention.

On the basis of the embodiment, the embodiment of the invention further describes a fault tolerance calculation method.

In the embodiment of the present invention, in S103, the weight matrix is divided into a plurality of data blocks according to rows, and a plurality of different redundancy vectors are generated according to the data blocks, and each redundancy vector forms a redundancy matrix, which may include:

The weight matrix is uniformly divided into rows Data partitioning;

Calculating column elements of the same column elements in a single data block and redundancy values of corresponding columns to sequentially obtain redundancy vectors corresponding to the single data block, calculating the column elements of the same column elements for one group of data blocks and the redundancy values of corresponding columns for one group of data blocks without repeating the steps to sequentially obtain corresponding redundancy vectors, and the like until the same column elements and the corresponding redundancy vectors of all the data blocks are obtained;

Splicing the redundant vectors up and down to obtain a redundant matrix;

Where p is a blocking factor and p is a positive integer.

In the embodiment of the invention, the input vector of the current layer can be marked as X, and the weight matrix of the current layer can be marked as XI.e., the weight matrix is a matrix of n rows and d columns,Representing a weight matrixThe elements of row i and column j, i=1, 2, … …, n; j=1, 2, … …, d.

Obtaining a blocking factor p, wherein p is an integer and satisfies. In the embodiment of the invention, the blocking factor p can be dynamically input according to the requirement, can be set by a worker, and can also be calculated by adopting a setting function. In some optional implementations of the embodiments of the present invention, the blocking factor may be determined according to a computing resource of a computing device that performs the propagation computation of the current layer and a number of rows of the weight matrix, so that a sum of the number of rows of the weight matrix and the number of rows of the redundancy matrix is less than or equal to a maximum computing resource provided by the computing device for the propagation computation of the current layer. In some alternative implementations of the embodiments of the invention, p may be selected from，Is a round-down operation.

According to the blocking factor p, the weight matrix is formedEvenly divided into rowsData block, and record the number of lines of a data block as the height of the data blockThen the weight matrixThe kth data block of the matrix corresponds to the weight matrixIs the first of (2)Line to kh, number k is the sequence number of the data block, k=1, 2, … …,。

It should be noted that, in the embodiment of the present invention, "uniformly partitioning" does not mean that the number of lines of all the data partitions is the same, for example, the number of lines of the last data partition may be smaller than that of the other data partitions, and if kh > n is not counted for the data partitions, that is, the last data partition is up to the nth line.

ConstructionRow d column redundancy matrix，Representing redundancy matrixMiddle (f)The elements of row j and column j,=1，2，……，；j=1，2，……，d。

In some alternative implementations of the embodiments of the invention, the redundancy matrix construction process includes constructing byRedundancy matrix of row d and column：

For j=1, 2, … …, d,

When (when)In the case of an odd number of the number,Wherein, the method comprises the steps of, wherein,=2k-1，；（1）

When (when)In the case of an even number of the number,Wherein, the method comprises the steps of, wherein,L satisfies；（2）

Wherein, ，；

The weight matrix comprises n rows and d columns; k is the sequence number of the data block,The kth data block corresponds to the kth weight matrixLine to kh line, if the line number n of the last line of the last data block is less than kh, the line is up to the nth line; h is the number of lines of a data block; is the first in the redundant matrix The elements of row j and column j,Is the first in the redundant matrixThe elements of row j and column j,Is the first in the redundant matrixThe elements of row j and column j,For the elements of the ith row and jth column of the weight matrix, l is an exponential parameter,Representation ofCompletely removeP is the blocking factor, and,Is a round-up operation.

Wherein the method comprises the steps ofRepresentation ofCompletely removeI.e.But is provided with。

Then the redundancy vectors corresponding to the individual data blocks can be calculated according to the above-mentioned recursion relation, i.e. the redundancy matrix is obtainedMiddle row numberThe redundant value of the redundant vector with odd number is calculated again, namely the redundant vectors corresponding to the two data blocks are respectively obtained in the redundant matrixMiddle row numberRedundancy vectors that are even but not multiples of 4, and so on, and finally calculate redundancy vectors corresponding to all data blocks, i.e., in the redundancy matrixMiddle row numberIs thatIs a redundant vector of (a). Through the calculation of the recurrence relation, each row of redundant vectors can be obtained in turn to form a redundant matrix。

Or the row redundancy vectors may also be calculated in parallel based on the calculation unit array, equations (1), (2) may be expressed as:

， M satisfies 。（3）

Wherein m is an index parameter,Representation ofCompletely removeI.e.But is provided with。

In the weight matrixOn the basis of (1) respectivelySubstituting equation (3) for calculation to obtain the whole redundant matrix。

In some optional implementations of the embodiments of the present invention, in S103, performing general matrix multiplication computation on the weight matrix, the redundancy matrix, and the input word vector of the current layer, to obtain a data computation result and a first check value of propagation computation of the current layer, where the data computation result and the first check value may include:

Splicing the redundant matrix below the weight matrix to obtain a check matrix;

performing general matrix multiplication calculation on the check matrix and the input word vector to obtain a check vector, wherein the first n components of the check vector are data calculation results corresponding to each row of the weight matrix, and the last n components of the check vector are data calculation results corresponding to each row of the weight matrix The individual components are the first check values corresponding to the rows of the redundancy matrix.

Redundancy matrixAnd a weight matrixSplicing one above the other, e.g. redundancy matrixSpliced in a weight matrixUnder the construction, a check matrix is obtainedCheck matrixIs oneMatrix of row d and column, 1 st to n th row of which corresponds to weight matrixN+1 to (n) th row vectorRow corresponding redundancy matrixIs a row vector of (a). Check matrixIs a weight matrixA kind of electronic deviceMultiple times.

Fig. 3 is an example of a blocking factor p=2, and as shown in fig. 3, the weights may be matrixIs divided into=4 Data blocks, according toIt can be known that the redundancy matrix is constructed=Including 7 redundant vectors.

If equations (1) and (2) are adopted and each redundancy vector is calculated according to the recursive relation, the redundancy vector corresponding to a single data block can be calculated according to equation (1), the redundancy vector corresponding to the first data block is V1, the redundancy vector corresponding to the second data block is V3, the redundancy vector corresponding to the third data block is V5, the redundancy vector corresponding to the fourth data block is V7, and the sequence number indicates that the redundancy vector is in the redundancy matrixIs a number of rows in the column.

The redundant vectors V2, V4, V6 corresponding to two or more data blocks may be calculated from the redundant vector corresponding to a single data block and equation (2). For example, v2=v1+v3, which represents redundancy vectors corresponding to the first data block and the second data block, V2 may be directly calculated according to V1 and V3 obtained by the above calculation; v6=v5+v7, which represents redundant vectors corresponding to the third data block and the fourth data block, V6 can be directly calculated according to V5 and V7 obtained by the calculation; v4=v2+v6, then V4 can be directly calculated from V2 and V6 obtained by the above calculation.

It can be seen that after three rounds of calculation, redundant vectors corresponding to one data block, two data blocks and four data blocks are obtained respectively, and each redundant vector is constructed into a redundant matrix of 7 rowsFurther obtain the check matrix as。

Alternatively, the row redundancy vectors may be calculated in parallel according to equation (3) based on the calculation cell array, that is, to be calculated separatelySubstituting equation (3) to obtain each redundant vector in parallel, and constructing each redundant vector into a redundant matrix of 7 rowsFurther obtain the check matrix as。

Constructing the check matrixThereafter, an input vector is acquiredBy using a check matrixInstead of weight matrixWill input vectorCheck matrixInputting the calculation unit array, and calculating to obtain check vectorAs a result of (2), then:

；（4）

Wherein the superscript T denotes the transpose of the vector.

It can be seen that the first n components of the check vector Y are respectively associated with the weight matrixThe data calculation results corresponding to each row and the back of the check vector YThe individual components are respectively the redundant matrixAnd the first check value corresponding to each row.

The n data calculation results in the check vector Y correspond to each otherThe number of data blocks, and the kth data block (k=1, 2, … …,) Corresponding to the first check vector YTo the kh component, i.eAnd (1)The data calculation result of each data block is the first data in the check vector YTo the nth component, i.e. Record the first of the check vectors YThe set of data calculation results up to the kh component are the data calculation results corresponding to the kth data block.

It can be seen that the size of the blocking factor p affects the calculation efficiency of the check vector Y, i.e. the check matrixIs a weight matrixA kind of electronic deviceDouble, therefore, to complete the same-scale target computing task, the required hardware scale is also required to be originalMultiple times. Conversely, the efficiency of current-layer general matrix multiplication is native to the same hardware scaleMultiple times. The appropriate value of the blocking factor may be selected based on the hardware resource situation of the computational cell array employed and the time requirements of the iterative computation.

Redundancy matrix provided according to an embodiment of the present inventionThe relation should be satisfied between the component values of the check vector:

， M satisfies ；（5）

The sum value of the data calculation results of all the data blocks corresponding to one first check value, that is, the sum value of the data blocks corresponding to the right side of equation (5) may be calculated first, that is, the number of the sum values is equal to the number of the first check values.

In calculating the sum value on the right side of equation (5), the same calculation methods as those of equations (1) and (2), i.e., calculation methods, respectively, can be adoptedIs odd and sumAnd is even. Generating a second check value from the data calculation result in S103 may include recursively calculating a second check value corresponding to the data block using:

When (when) In the case of an odd number of the number,Wherein, the method comprises the steps of, wherein,=2k-1，；（6）

When (when)In the case of an even number of the number,Wherein, the method comprises the steps of, wherein,L satisfies；（7）

Wherein, Representation ofCompletely removeI.e.But is provided with。

Wherein, Is the first in the check vectorA second check value calculated by the data calculation result corresponding to the first check value,Is the first in the check vectorA second check value calculated by the data calculation result corresponding to the first check value,Is the first in the check vectorA second check value calculated by the data calculation result corresponding to the first check value, h is the number of rows of a data block, n is the number of rows of a weight matrix, l is an index parameter,Representation ofCompletely removeP is a blocking factor.

I.e. according to the recursive relation, can be calculated firstCalculating the sum of the results for the data corresponding to the odd first check value, and then calculatingCalculating the sum of the results of the data corresponding to the first check value which is even but not a multiple of 4, and so on, and finally calculatingIs thatThe sum of the data calculation results corresponding to the first check value of (a), namely the sum of the data calculation results corresponding to all the data blocks.

Or generating the second check value from the data calculation result in S103 may include: and calculating in parallel according to the calculation method of the redundancy value to obtain a second check value corresponding to each first check value. That is, the calculation unit arrays may be arranged in parallelSubstituting equation (5) for calculation can be accomplished with the same calculation instructions on the same hardware structure as the calculation instructions of equation (3) because the calculation instructions involved are the same.

Then according to equation (5) to the left of the equal signAfter each first check value is determined in the check vector Y, calculating the first n data calculation results of the check vector Y according to the summation method on the right side of the equal sign of the equation (5) to obtain a second check value corresponding to the first check value, comparing the corresponding first check value with the second check value, if the first check value and the second check value are consistent, checking the group of check values, if the first check value and the second check value are inconsistent, checking the group of check values not to be consistent, and performing the next check sum error correction.

In the case of error compensation (i.e. two or more calculation errors cancel each other, so that the verification relationship is still established), if a soft error such as bit flip occurs in the calculation process of the data calculation result, some data calculation results will change, and the equality relationship between the corresponding first verification value and the corresponding second verification value will no longer be established.

In the embodiment of the present invention, the data calculation result in S103 does not pass the verification, which may include: and the second check value calculated according to the data calculation result is inconsistent with the corresponding first check value.

Wherein the second check value is inconsistent with the corresponding first check value, and can be represented by the following formula:

， M satisfies ；（6）

Wherein, Is the first in the check vectorThe number of components of the composition,For the i-th component in the check vector, h is the number of rows of a data block, n is the number of rows of the weight matrix, m is an exponential parameter,Representation ofCompletely removeI.e.But is provided withP is a blocking factor.

Based on this, the embodiment of the present invention provides a verification process for a data calculation result of general matrix multiplication calculation, and in S103, verifying a data calculation result of a current layer according to a consistency comparison result of a second verification value and a corresponding first verification value may include:

If the second check values corresponding to all the data blocks are consistent with the corresponding first check values, determining that the first check values corresponding to all the data blocks pass the check, and further determining that the data calculation results of the current layer pass the check;

If the second check values corresponding to all the data blocks are inconsistent with the corresponding first check values, performing step-by-step check according to the first check values;

in the step-by-step verification process, if the current first verification value passes the verification, the branch index of the first verification value is not verified; if the current first check value does not pass the check and the first check value has a branch index, checking the branch index of the first check value; if the current first check value does not pass the check and the first check value does not have a branch index, determining that the data block corresponding to the first check value is an error block; if the current first check value does not pass the check but all branch indexes of the first check value pass the check, setting the current first check value as passing the check;

Wherein the branch index is the first check value of the first order Is the ordered of (2)And a first check value,For the sequence number of the current first check value, l is an index parameter, and l meets，Representation ofCompletely remove。

In the embodiment of the invention, in order to verify whether all data calculation results are correct, verification can be performed according to the following process:

1) Initializing, recording error block number set S=phi, index parameter l=p and initial index ；

2) Verifying whether the second check value is in relation with the corresponding first check value; If so, no additional operation is performed; if not, when the exponent parameter l=0, the number set is updated; When the index parameter l >0, letRespectively orderAnd(These two indices are calledBranch metrics of (2)) and repeating step 2);

3) Outputting the updated error block number set S, if any If (if)The data calculation result in the kth data partition has an error result; for any oneIf (if)The data calculation results in the kth data partition are all correct results.

When an error occurs in the calculation of a certain data calculation result, the check of the first check value corresponding to the error block where the error result is located is always not passed, and thus the above-mentioned check process always detects the error block number where the error result is located.

However, when the first check value corresponding to two or more data blocks fails to pass the check, but the data calculation results corresponding to all the data blocks corresponding to the first check value are correct, the branch index of the first check value (i.e. in step 3) is the branch index of the first check value although the first check value fails to pass the checkAnd) All pass the check, at this time, after the check vector YCorresponding sequence number in each componentThe failing check of the even number of the first check values does not affect the check process of the data calculation result, that is, "if the current first check value fails the check but all branch indexes of the first check value pass the check, the current first check value is set to pass the check" as described above.

However, if the first check value corresponding to a single data block fails the check, i.e. after the check vector YCorresponding sequence number in each componentSince the odd index has no branch index, the corresponding data block is marked as an error block directly in the case that the first check value which is odd is not checked.

In the verification process, if the first verificationIf so, the checking flow is directly ended, the checking of other first check values is not performed, the number set S is not empty, and all data calculation results are calculated correctly. If the number set S is not an empty set, it means that an error calculation result is found in the verification process, and the error calculation result can be located to a specific data block and marked as an error block, and the rest of the data blocks are marked as correct blocks. And then, continuously performing secondary calculation on the error blocks until a correct calculation result is obtained, and realizing error correction.

Through the above verification process provided by the embodiment of the present invention, the verification may be performed without using all the first verification values at one time, but sequentially from the back of the verification vector YOf the first componentAnd starting to check the branch index and the branch index … … of the branch index in sequence by the first check values, and if all the first check values of a certain layer pass the check, checking the branch index without checking the branch index, thereby reducing the check comparison process.

The correct result can be obtained by voting after repeated calculation for the error blocks detected in the steps, and the process can comprise the following steps:

According to the error block set S, the number of error blocks is recorded as G is marked as the serial number of the error block, and the serial numbers of the error blocks are divided according to the descending order of the error blocks in the data blocks, so that the error blocks are divided，Represents the data block sequence number corresponding to the g-th error block in the error block set,。

On the basis of the foregoing embodiment, in the embodiment of the present invention, performing error correction processing on the data calculation result of the current layer in S103 may include:

dividing each error block into recalculation matrixes;

carrying out general matrix multiplication calculation on the recalculation matrix and the input word vector of the current layer to obtain a recalculation result;

In some optional implementations of the embodiments of the invention, generating the recalculation matrix for each error chunk may include:

copying each error block into a plurality of copies and then splicing to obtain a recalculation matrix;

performing error correction processing on the data calculation result of the propagation calculation of the current layer according to the recalculation result, including:

and replacing the data calculation result corresponding to the error block by using the final calculation result according to the final calculation result corresponding to the error block, wherein the result with the largest occurrence number in the recalculation result corresponding to the error block is the final calculation result corresponding to the error block.

The method for obtaining the recalculation matrix by splicing after copying each error block into a plurality of copies can comprise the following steps:

Obtaining a recalculated redundancy modulus;

And copying the error blocks for multiple times according to the recalculation redundancy modulus, and then splicing the error blocks into a recalculation matrix, so that the number of each error block in the recalculation matrix is the recalculation redundancy modulus.

The recalculated redundancy modulus can be defined asWherein t is the redundant modulus calculated for the sake of gravity, n is the number of rows of the weight matrix,S is the number of error blocks, h is the number of data blocks, p is the block factor,Is a round-down operation.

Thereby making full use of the check matrixHardware resources for performing general matrix multiplication computations.

The data calculation result corresponding to the error block is the data calculation result of the first n components of the check vector YCorresponding error block in weight matrixThe components of the serial numbers of the data blocks in the data blocks, namely%,,) May include erroneous calculation results, while the remaining data calculation results are correct.

Weight matrixDivided intoThe individual data chunks may be represented as:

；（7）

Wherein, Representing a first block of data,Representing a second block of data,Represent the firstThe data is partitioned.

Constructing the recalc matrix F may be expressed as:

；（8）

Wherein, Indicating that the first block of errors is to be made,Representing a second one of the blocks of errors,Represent the firstEach error block, it can be seen that in the recalculation matrix F, each error block is replicated in t copies and in the equationUnder the constraint of (1), the size of the recalculation matrix F does not exceed the check matrix。

After constructing the recalculation matrix F, the input vector is then calculatedAnd the recalculation matrix F is input into a calculation unit array for general matrix multiplication calculation to obtain a recalculation vector:

。（9）

thus, the h error calculation results within each error block are repeatedly calculated t times. It can be observed that as the range of values that are possible with the data calculation results increases, when t times of repeated calculation are performed for each erroneous block in the recalculation matrix F, even if a plurality of erroneous calculation results occur again, the possibility that there are two equal erroneous calculation results decreases, and the correct calculation results are always the same. Therefore, the corresponding recalculation result is repeatedly calculated for t times of the same error block, the recalculation result with the largest occurrence number is found and is used as the final calculation result corresponding to the error block, and the process is called t-mode voting. By selecting a proper and large blocking factor p, t is more than or equal to 3, and the vector is recalculated The correct calculation result of each error block can be obtained by means of t-mode voting. Replacing the data calculation result corresponding to the error block with the final calculation result corresponding to the error block, and retaining the data calculation result by using the check matrixAnd (3) carrying out general matrix multiplication calculation to obtain a correct result in the data calculation result, wherein the correct result is unchanged, so that a final correct target vector of the general matrix multiplication of the current layer is obtained.

If the same hardware resource is used for carrying out general matrix multiplication calculation and check matrix of recalculation matrix FThe re-computation does not generate additional overhead on hardware, but the time spent by the current layer of the general matrix multiplication computation is twice as long as the computation time originally spent. But the error detection and the error correction of the target vector obtained by the general matrix multiplication are realized, the result correctness of the target vector is ensured, and the fault-tolerant calculation of the general matrix multiplication on the calculation unit array is realized.

On the basis of the fault-tolerant computing device shown in fig. 2, as shown in fig. 4, the fault-tolerant computing method provided by the embodiment of the invention may include:

s401: constructing a check matrix;

s402: calculating a check vector;

s403: calculating a second check value;

S404: checking the correctness of data partitioning;

S405: the error partition is recalculated.

In fig. 4, a circular block diagram indicates an input or output value, a square block diagram indicates a flow step, a diamond block diagram indicates a judgment branch, and a dotted line block indicates that S402 and S405 are performed on the same hardware structure, that is, a common matrix multiplication operation is performed on the same common matrix multiplication calculator, and there is a possibility that the result of the calculation is erroneous.

The specific implementation of the embodiment of the present invention will be described in the above description.

In other optional implementations of the embodiment of the present invention, before performing general matrix multiplication on the recalculation matrix and the input word vector of the current layer to obtain the recalculation result, the fault-tolerant calculation method provided by the embodiment of the present invention may further include:

if not, the step of performing general matrix multiplication calculation on the recalculated matrix and the input word vector of the current layer is entered.

That is, in order not to exceed the original hardware resources, if the re-calculation matrix size exceeds the check matrix size, the propagation calculation of the current layer is re-performed instead of performing the error block re-calculation.

In other optional implementations of the embodiments of the present invention, the fault tolerant computing method may further include:

If the number of times of returning to the step of generating the redundant matrix according to the weight matrix of the current layer exceeds the recalculation threshold value, stopping iterative calculation of the target language model and outputting error reporting information of the computing equipment fault.

That is, if the propagation computation of the current layer is repeated a plurality of times, the iterative computation of the model is stopped and an error is reported, because the computing error at this time may be caused by a computing device failure.

In other optional implementations of the embodiments of the present invention, in S103, general matrix multiplication is performed on the weight matrix, the redundancy matrix, and the input word vector of the current layer, so as to obtain a data calculation result of propagation calculation of the current layer and a first check value, which may also include:

the general matrix multiplication of the weight matrix and the input word vector of the current layer and the general matrix multiplication of the redundancy matrix and the input word vector of the current layer are performed in parallel.

The general matrix multiplication calculation of the weight matrix and the input word vector of the current layer and the general matrix multiplication calculation of the redundancy matrix and the input word vector of the current layer can be performed in parallel under the condition of hardware calculation resources to improve the calculation efficiency. For example, a general matrix multiplication calculator for performing general matrix multiplication calculation of the weight matrix and the input word vector of the current layer and general matrix multiplication calculation of the redundant matrix and the input word vector of the current layer may be deployed locally in the computing device in advance, or one of the calculation tasks may be sent to another computing device to perform general matrix multiplication calculation.

FIG. 5 is a schematic diagram of another redundancy matrix construction process according to an embodiment of the present invention; FIG. 6 is a schematic diagram of a 5-mode majority voting process according to an embodiment of the present invention.

The following is a weight matrix of 32×32And an input vector of 32 x 1Multiplication is taken as an example, and a fault-tolerant computing method of general matrix multiplication provided by the embodiment of the invention is introduced.

As shown in fig. 5, according to the weight matrixThe scale of (a) may be chosen to have a blocking factor p=3, thereby to matrix the weightsIs divided intoA block, denoted as a data partition blockThe matrix size of each data block is 4 x 32. Then the data blocks and weight matrixThe relation of (2) is:

。

Wherein the height h of each data block is 4.

According toThe number of rows constituting the redundancy matrix can be determined=15, Then according to equations (1), (2), for anyConstruction ofRedundancy matrix of (a)Redundancy matrixElements of (2)The definition is as follows:

Wherein, the method comprises the steps of, wherein, ；（10）

Wherein, the method comprises the steps of, wherein,；（11）

Wherein, the method comprises the steps of, wherein,；（12）

Wherein, the method comprises the steps of, wherein,。（13）

For a pair ofThe whole redundancy value can be obtained by recursion calculation according to the processOr all redundancy values corresponding to each column in the computing unit arrayAnd performing parallel computation.

Weight matrixData partitioning of (a)Redundancy matrix=The correspondence of (2) can be represented by fig. 5.

Redundancy matrixSpliced in a weight matrixBelow, a check matrix with a size of 47×32 is obtained。

Input vectorCheck matrixInputting into a computing unit array, and computing to obtain a check vector：

；（14）

Wherein the superscript T denotes the transpose of the vector.

Thus, the check vectorThe first 32 components of (a) are weight matrixCorresponding 32 data calculation results and check vectorThe last 15 components of (a) are redundant matricesCorresponding 15 first check values.

The 32 data calculations also correspond to 8 data chunks, of which the firstCorresponding check vector of each data blockMiddle (f)To the firstIndividual components, i.e.The set of data calculation results is referred to as the data calculation result of the kth data block.

In order to detect the correctness of the data calculation result, 15 second check values may be calculated according to equation (5), and whether the data calculation result is erroneous or not may be detected by a consistency comparison result of the second check values and the first check values.

Then according to equations (10) - (13), calculating a second check value（) The process of (2) is as follows:

Wherein, the method comprises the steps of, wherein, ；（15）

Wherein, the method comprises the steps of, wherein,；（16）

Wherein, the method comprises the steps of, wherein,；（17）

Wherein, the method comprises the steps of, wherein,。（18）

According to the recursion relation of the second check values, all the second check values with subscripts being odd numbers can be calculated first, then all the second check values with subscripts being even numbers but not multiples of 4 are calculated, and so on, and finally the second check value with subscripts being 8 is calculated。

Or can be parallel based on the computing unit arraySubstituting equation (5) for calculation can be accomplished with the same calculation instructions on the same hardware structure as the calculation instructions of equation (3) because the calculation instructions involved are the same.

After the calculation of the second check value is completed, whether the corresponding data calculation result is correct is judged by checking whether the second check value and the corresponding first check value meet the equality relation. As an example, assume that the check vector is calculatedIn the process of (1), the data calculation resultAnd a first check valueIf errors occur in the calculation of (a), the verification process is as follows:

1) Initializing, recording error block number set S=phi, index parameter l=3 and initial index 。

2) Verifying whether the second check value and the corresponding first check value are in a relationship, and firstly checking=Whether or not to establish; as a result of data calculationIs an erroneous result, then=And (3) fails the verification, and as l is more than 0, the branch index is continuously verified;

Let l=2 and, =4, CheckWhether or not to establish; as a result of data calculationIs the result of the erroneous calculationAnd (3) fails the verification, and as l is more than 0, the branch index is continuously verified;

Let l=1 and, =2, CheckWhether or not to establish; as a result of data calculationIs the result of the erroneous calculationAnd (3) fails the verification, and as l is more than 0, the branch index is continuously verified;

let l=0 and, =1, CheckWhether or not to establish; as a result of data calculationIs the result of the erroneous calculationSince l=0, the set S of erroneous block numbers is updated to beRepresenting the first data block as an error block;

let l=0 and, =3, CheckWhether or not to establish; since there is no error result in the data calculation result corresponding to the 3 rd second check value, thenChecking;

Let l=1 and, =6, CheckWhether or not to establish; the first check value is not found out due to the fact that the error result does not exist in the data calculation result corresponding to the 6 th second check valueCalculation error, thenFailing to pass the verification, the branch index is continuously verified as l is more than 0;

let l=0 and, =5, CheckWhether or not to establish; since there is no erroneous result in the data calculation result corresponding to the 5 th second check value, thenChecking;

let l=0 and, =7, CheckWhether or not to establish; since there is no erroneous result in the data calculation result corresponding to the 7 th second check value, thenChecking;

Let l=2 and, =12, CheckWhether or not to establish; as a result of data calculationIs the result of the erroneous calculationFailing to pass the verification, the branch index is continuously verified as l is more than 0;

Let l=1 and, =10, CheckWhether or not to establish; since there is no erroneous result in the data calculation result corresponding to the 10 th second check value, thenChecking;

Let l=1 and, =14, CheckWhether or not to establish; as a result of data calculationIs the result of the erroneous calculationFailing to pass the verification, the branch index is continuously verified as l is more than 0;

let l=0 and, =13, CheckWhether or not to establish; since there is no erroneous result in the data calculation result corresponding to the 13 th second check value, thenAnd passing the verification.

Let l=0 and,=15, CheckWhether or not to establish; as a result of data calculationIs the result of the erroneous calculationWithout passing the check, since l=0, the set of erroneous block numbers S is updated to beIndicating that the first data block and the eighth data block are error blocks.

3) Outputting updated error block number set。

It can be seen that, in the above application example, the sequence numbers of all the error blocks are found in the verification process, and the calculation error of the first verification value does not affect the verification result. According to error block number setIt is known that the data calculation results corresponding to the second to seventh data blocks are correct, and that there is an error in the data calculation results corresponding to the first and eighth data blocks.

And then, recalculating the detected error blocks to realize error correction. According to error block number setPartitioning the corresponding dataEach copy was 5 copies and spliced to define a recalculation matrix as follows:

；（19）

Recalculating the matrix Is slightly smaller than the check matrixScale of (c). In the calculation·Recalculating matrices on arrays of computational cellsTo obtain a recalculation vector:

；（20）

recalculating vectors Is a 40 x 1 column vector, each 4 components belonging to a set of erroneous blocks, the first 5 blocks beingThe last 5 blocks are allIs a recalculation result of (2). Thus recalculating the vectorThe value of (2) can be regarded asAnd5 Modulo redundancy of (2). Computing recalculation vectorsIs used for calculating the check vectorSimilarly, calculation errors may occur. Then the flow shown in figure 6 is obtained by votingAndAs the final calculation result. The unshaded partitions in FIG. 6 are multiple identical recalculated results, with the remaining differently shaded partitions representing different recalculated results as erroneous results, as the voted correct result, i.e., the final calculated result.

In fig. 6 a 5-mode majority voting system is shown, i.e. when 3 or more of the 5 inputs are identical, i.e. 3/5 (G) as shown in fig. 6, the same value is taken as output. It should be noted that each block of the recalculated vector contains 4 components, corresponding to 4 components of the block information value, and that 5-modulo redundancy can vote on each component separately.

For example, in a pair ofWhen voting is performed on the first component of (2), the input is a recalculated vectorIn the flow of fig. 6, the calculation result of the 5 th component is wrong, and the other four calculation results are correct, so that 4 inputs in the 5-module redundancy are the same, and a correct result can be output; likewise, toWhen voting on the 2 nd component of (2), inputs are recalculated vectorsIn the flow of fig. 6, the 2 nd component is wrong in calculation result, and the other four calculation results are correct, so that 4 inputs in the 5-mode redundancy are the same, and the correct result can be output. Although and withCorresponding recalculation vectorOf the first 5 blocks of (2) there are all computing errors, but only one corresponding bit per component is in error. In contrast, in the above example, withIn the corresponding five recalculation blocks, two blocks have errors, and the error positions correspond to the same component, so that in the voting process of the component, 5 inputs have two errors, the other 3 identical inputs have no errors, and the identical value is taken as the final value of the component.

Maintaining a check vectorThe data calculation results of the 2 nd to 7 th blocks are unchanged, and the recalculation vector is adoptedThe final calculation results obtained by the 1 st block and the 8 th block obtained by voting in the above are obtainedThereby realizing the fault-tolerant computation of the universal matrix multiplication.

The fault-tolerant computing method applied to the artificial intelligence question-answering provided by the embodiment of the invention is described below with reference to the accompanying drawings.

Fig. 7 is a flowchart of a third fault tolerant computing method according to an embodiment of the present invention.

As shown in fig. 7, the third fault-tolerant computing method provided by the embodiment of the invention is applied to artificial intelligence question answering, and includes:

S701: and receiving input question information, wherein the question information comprises prompt words.

S702: and converting the prompt word into a word vector, and inputting the word vector into a target language model for iterative computation.

S703: in the propagation calculation of the current iteration calculation of the target language model, after a redundant matrix is generated according to the weight matrix of the current layer, carrying out general matrix multiplication calculation on the weight matrix, the redundant matrix and the input word vector of the current layer respectively to correspondingly obtain a data calculation result and a first check value of the propagation calculation of the current layer, generating a second check value from the data calculation result, checking the data calculation result of the current layer according to the consistency comparison result of the second check value and the corresponding first check value, entering the propagation calculation of the next layer if the data calculation result of the current layer passes the check, and otherwise carrying out error correction processing on the data calculation result of the current layer.

S704: and after the propagation calculation of each layer of the target language model is completed, obtaining a generation result of the current iterative calculation corresponding to the problem information.

A large language model (Large Language Model, LLM), which is a deep learning model trained on massive text data, has the core capabilities of generating natural language text and deeply understanding text meanings. Such models are capable of performing a variety of natural language processing tasks, such as text abstracts, questions and answers, translations, and the like.

The design of large language models is to simulate the language understanding and generating capabilities of humans, from which they typically train on large data sets, from which language structures, grammars and context information are learned.

In the embodiment of the present invention, for S701, a large language model is trained based on sample data, when iterative computation is performed, input problem information, that is, sample data, includes prompt words (prompt), and after forward propagation computation and backward propagation computation are performed, a current iterative computation generation result is generated, and after an iteration end condition is reached, a large language model with well-adjusted model parameters is obtained.

When executing the artificial intelligence question-answering task, the question information input by the user is usually received, the question information comprises a prompt word, and the prompt word is input into a large language model to perform reasoning calculation for preset times. In the process of carrying out reasoning calculation by utilizing the large language model, a new mark (token) is generated by inputting the prompt word into the large language model, the newly generated mark and the prompt word are spliced together and then are input into the model to regenerate a new mark, and the sequence generated by analogy is response information.

And S702, converting the prompt words in the input problem information into a vector form to obtain word vectors meeting the model calculation requirements of the target language model, and inputting the word vectors into the target language model for iterative calculation.

The specific embodiment of S703 may refer to the description of the specific embodiment of S103.

For S104, the propagation computation of each layer is sequentially performed after error detection and error correction of the general matrix multiplication computation in the propagation computation of the target language model are performed through S103, and after the propagation computation of each layer is completed, a generation result corresponding to the problem information of the current iteration computation is obtained. If the model calculation is in the model training process, after the forward propagation calculation of each layer is completed, and then the backward propagation calculation of each layer is completed, the generation result of the current iterative calculation corresponding to the problem information is obtained. If the model calculation is in the model reasoning process, the generation result of the current iterative calculation corresponding to the problem information is obtained after the forward propagation calculation of each layer is completed.

On the basis of the above embodiment, in S703, dividing the weight matrix into a plurality of data blocks according to rows, generating a plurality of different redundancy vectors according to the data blocks, and forming each redundancy vector into a redundancy matrix may include:

The weight matrix is uniformly divided into rows Data partitioning;

Splicing the redundant vectors up and down to obtain a redundant matrix;

Where p is a blocking factor and p is a positive integer.

For j=1, 2, … …, d,

Wherein, ，；

In some optional implementations of the embodiments of the present invention, in S703, performing general matrix multiplication computation on the weight matrix, the redundancy matrix, and the input word vector of the current layer, to obtain a data computation result and a first check value of propagation computation of the current layer, where the data computation result and the first check value may include:

Splicing the redundant matrix below the weight matrix to obtain a check matrix;

In some optional implementations of the embodiments of the present invention, the data calculation result in S703 does not pass the verification, which may include:

the second check value obtained through calculation according to the data calculation result is inconsistent with the corresponding first check value;

the second check value is inconsistent with the corresponding first check value and is expressed by the following formula:

， M satisfies ；

Wherein, Is the first in the check vectorThe number of components of the composition,For the i-th component in the check vector, h is the number of rows of a data block, n is the number of rows of the weight matrix, m is an exponential parameter,Representation ofCompletely removeP is a blocking factor.

In some optional implementations of the embodiments of the present invention, generating the second check value from the data calculation result in S703 includes recursively calculating the second check value corresponding to the data chunk using:

In other alternative implementations of the embodiments of the present invention, generating the second check value from the data calculation result in S703 includes:

And calculating in parallel according to the calculation method of the redundancy value to obtain a second check value corresponding to each first check value.

In some optional implementations of the embodiments of the present invention, verifying the data calculation result of the current layer according to the consistency comparison result of the second check value and the corresponding first check value in S703 may include:

In some optional implementations of the embodiments of the present invention, the blocking factor is determined according to a computing resource of a computing device performing the propagation computation of the current layer and a number of rows of the weight matrix, so that a sum of the number of rows of the weight matrix and the number of rows of the redundancy matrix is less than or equal to a maximum computing resource provided by the computing device for the propagation computation of the current layer.

In some optional implementations of the embodiments of the present invention, performing error correction processing on the data calculation result of the current layer in S703 may include:

dividing each error block into recalculation matrixes;

Wherein partitioning each error into blocks generates a recalculation matrix may comprise:

Obtaining a recalculated redundancy modulus;

Wherein, the recalculated redundancy modulus can be obtained by:

；

wherein t is the redundant modulus calculated for the time, n is the number of rows of the weight matrix, S is the number of error blocks, h is the number of data blocks, p is the block factor,In order to perform the rounding down operation,Is a round-up operation.

In the embodiment of the present invention, before performing general matrix multiplication calculation on the recalculation matrix and the input word vector of the current layer to obtain the recalculation result, the fault-tolerant calculation method may further include:

In the embodiment of the present invention, the fault-tolerant computing method may further include:

In other optional implementations of the embodiments of the present invention, in S703, performing general matrix multiplication on the weight matrix, the redundancy matrix, and the input word vector of the current layer, to obtain a data calculation result of propagation calculation of the current layer and a first check value, which may include:

It should be noted that, in the embodiments of the fault tolerant computing methods of the present invention, some of the steps or features may be omitted or not performed. The divided hardware or software functional modules are not the only implementation form for realizing the fault-tolerant computing method provided by the embodiment of the invention.

Various embodiments of fault-tolerant computing methods are detailed above, and on the basis of the embodiments, the invention also discloses a fault-tolerant computing device, equipment, a nonvolatile storage medium and a computer program product corresponding to the methods.

Fig. 8 is a schematic structural diagram of a fault tolerant computing device according to an embodiment of the present invention.

As shown in fig. 8, the fault tolerant computing apparatus provided by the embodiment of the present invention includes:

An information determining unit 801, configured to determine information of a storage device and information of a computing device according to a model parallel computing task, and then read data to be processed corresponding to the model parallel computing task from the storage device;

a second receiving unit 802, configured to convert data to be processed into an input vector, and input the input vector into a computing device configured with model parameters of a target model for iterative computation;

The second calculating unit 803 is configured to, in the propagation calculation of the current iterative calculation of the target model, generate a redundancy matrix according to the weight matrix of the current layer, and then perform general matrix multiplication calculation on the weight matrix, the redundancy matrix and the input vector of the current layer, respectively, to obtain a data calculation result of the propagation calculation of the current layer and a first check value, generate a second check value from the data calculation result, and perform a check on the data calculation result of the current layer according to a consistency comparison result of the second check value and the corresponding first check value, and enter the propagation calculation of the next layer if the data calculation result of the current layer passes the check, otherwise perform an error correction process on the data calculation result of the current layer; after the propagation calculation of each layer of the target model is completed, a generation result corresponding to the data to be processed in the current iterative calculation is obtained;

The fault-tolerant computing device provided by the embodiment of the invention can further comprise:

The second judging unit is used for judging whether the number of lines of the recalculation matrix exceeds the sum of the number of lines of the weight matrix and the number of lines of the redundant matrix before the recalculation matrix and the input word vector of the current layer are subjected to general matrix multiplication calculation to obtain a recalculation result; if yes, returning to the step of generating a redundant matrix according to the weight matrix of the current layer; if not, the step of performing general matrix multiplication calculation on the recalculated matrix and the input word vector of the current layer is entered.

And the second error reporting unit is used for stopping iterative computation of the target language model and outputting error reporting information of the computing equipment fault if the number of times of returning to the step of generating the redundant matrix according to the weight matrix of the current layer exceeds the recalculation threshold value.

Fig. 9 is a schematic structural diagram of another fault tolerant computing device according to an embodiment of the present invention.

As shown in fig. 9, another fault-tolerant computing device provided by an embodiment of the present invention, applied to artificial intelligence question answering, includes:

A first receiving unit 901, configured to receive input question information, where the question information includes a prompt word;

a first conversion unit 902, configured to convert the hint word into a word vector, and input the word vector into a target language model for iterative computation;

The first computing unit 903 is configured to, in a propagation computation of a current iterative computation of the target language model, generate a redundancy matrix according to a weight matrix of a current layer, and then perform general matrix multiplication computation on the weight matrix, the redundancy matrix, and an input word vector of the current layer, respectively, to obtain a data computation result of the propagation computation of the current layer and a first check value, generate a second check value from the data computation result, and perform a check on the data computation result of the current layer according to a consistency comparison result of the second check value and the corresponding first check value, and enter a propagation computation of a next layer if the data computation result of the current layer passes the check, otherwise perform an error correction process on the data computation result of the current layer; after the propagation calculation of each layer of the target language model is completed, a generation result corresponding to the problem information of the current iterative calculation is obtained;

The first judging unit is used for judging whether the number of lines of the recalculation matrix exceeds the sum of the number of lines of the weight matrix and the number of lines of the redundant matrix before the recalculation matrix and the input word vector of the current layer are subjected to general matrix multiplication calculation to obtain a recalculation result; if yes, returning to the step of generating a redundant matrix according to the weight matrix of the current layer; if not, the step of performing general matrix multiplication calculation on the recalculated matrix and the input word vector of the current layer is entered.

And the first error reporting unit is used for stopping iterative computation of the target language model and outputting error reporting information of the computing equipment fault if the number of times of returning to the step of generating the redundant matrix according to the weight matrix of the current layer exceeds a recalculation threshold value.

It should be noted that, in each implementation manner of the fault-tolerant computing device provided by the embodiment of the present invention, the division of the units is only one logical functional division, and other division manners may be adopted. The connection between the different units may be electrical, mechanical or other. Separate units may be located in the same physical location or distributed across multiple network nodes. The units may be implemented in hardware or in software functional units. The aim of the scheme of the embodiment of the invention can be realized by selecting part or all of the units provided by the embodiment of the invention according to actual needs and adopting a corresponding connection mode or an integration mode.

Since the embodiments of the apparatus portion and the embodiments of the method portion correspond to each other, the embodiments of the apparatus portion are referred to the description of the embodiments of the method portion, and are not repeated herein.

As shown in fig. 10, the fault tolerant computing device provided by the embodiment of the invention includes:

a memory 1010 for storing a computer program 1011;

A processor 1020 for executing a computer program 1011, which computer program 1011, when executed by the processor 1020, implements the steps of the fault tolerant computing method provided by any one of the embodiments described above.

Processor 1020 may include one or more processing cores, such as a 3-core processor, an 8-core processor, etc. The processor 1020 may be implemented in at least one of Digital Signal Processing (DSP), field-Programmable gate array (fieldprogrammable GATE ARRAY, FPGA), and Programmable logic array (Programmable Logic Array, PLA) in hardware. The processor 1020 may also include a main processor and a coprocessor, wherein the main processor is a processor for processing data in an awake state, and is also called a central processor (Central Processing Unit, CPU); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1020 may be integrated with a graphics processor (Graphics Processing Unit, GPU) that is responsible for rendering and rendering of the content that is required to be displayed by the display screen. In some embodiments, the processor 1020 may also include an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor for processing computing operations related to machine learning.

Memory 1010 may include one or more non-volatile storage media, which may be non-transitory. Memory 1010 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 1010 is at least used for storing a computer program 1011, where the computer program 1011, when loaded and executed by the processor 1020, is capable of implementing the relevant steps in the fault tolerant computing method disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 1010 may also include an operating system 1012, data 1013, and the like, and the storage manner may be transient storage or permanent storage. Wherein operating system 1012 may be Windows, lunux or other types of operating systems. Data 1013 may include, but is not limited to, data related to the above-described method.

In some embodiments, the fault tolerant computing device may also include a display 1030, a power supply 1040, a communication interface 1050, an input-output interface 1060, sensors 1070, and a communication bus 1080.

Those skilled in the art will appreciate that the architecture shown in FIG. 10 is not limiting of a fault tolerant computing device and may include more or fewer components than shown.

The fault-tolerant computing device provided by the embodiment of the invention comprises a memory and a processor, wherein the processor can realize the steps of the fault-tolerant computing method provided by the embodiment when executing the program stored in the memory, and the effects are the same as the above.

An embodiment of the present invention provides a non-volatile storage medium having stored thereon a computer program which, when executed by a processor, can implement the steps of the fault tolerant computing method provided in any of the embodiments described above.

The nonvolatile storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

For the introduction of the non-volatile storage medium provided by the embodiment of the present invention, please refer to the above method embodiment, and the effect of the method is the same as that of the fault tolerance calculation method provided by the embodiment of the present invention, and the disclosure is not repeated here.

Embodiments of the present invention provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the fault tolerant computing method provided by any of the embodiments described above.

For the introduction of the computer program product provided by the embodiment of the present invention, please refer to the above method embodiment, and the effect thereof is the same as that of the fault tolerance calculation method provided by the embodiment of the present invention, and the disclosure is not repeated here.

The fault tolerance computing method, device, equipment, medium and computer program product provided by the invention are described in detail above. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The apparatus, device, non-volatile storage medium, and computer program product of the embodiments disclosed herein, as they correspond to the methods of the embodiments disclosed herein, are described in the simpler terms, where relevant to the description of the methods section. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A fault tolerant computing method, applied to artificial intelligence questions and answers, comprising:

2. The fault tolerant computing method of claim 1 wherein dividing the weight matrix into a plurality of data blocks by rows and generating a plurality of different redundancy vectors from the data blocks, each of the redundancy vectors comprising the redundancy matrix comprises:

the weight matrix is uniformly divided into rows Partitioning the data;

Splicing the redundant vectors up and down to obtain the redundant matrix;

Where p is a blocking factor and p is a positive integer.

3. The fault tolerant computing method of claim 2 wherein the redundancy matrix construction process comprises constructing byRedundancy matrix of row d and column：

For j=1, 2, … …, d,

Wherein, ，；

4. The fault-tolerant computing method of claim 3, wherein performing general matrix multiplication on the weight matrix, the redundancy matrix, and the current layer input word vector, respectively, to obtain a data calculation result of propagation calculation of the current layer and a first check value, includes:

splicing the redundant matrix below the weight matrix to obtain a check matrix;

5. The fault tolerant computing method of claim 4, wherein the data computation result fails a check, comprising:

， M satisfies ；

6. The fault tolerant computing method of claim 4 wherein generating the second check value from the data computation result comprises recursively computing the second check value for the data chunk using:

The first of the check vectorsThe second check value calculated by the data calculation result corresponding to the first check value,Is the first of the check vectorsThe second check value calculated by the data calculation result corresponding to the first check value,Is the first of the check vectorsThe second check value calculated by the data calculation result corresponding to the first check value, h is the number of rows of one data block, n is the number of rows of the weight matrix, l is an exponential parameter,Representation ofCompletely removeP is a blocking factor.

7. The fault tolerant computing method of claim 1 wherein generating the second check value from the data calculation result comprises:

8. The fault tolerant computing method of claim 4, wherein verifying the data computation of the current layer based on a comparison of the consistency of the second check value with the corresponding first check value comprises:

9. The fault tolerant computing method of claim 2 wherein the blocking factor is determined based on computing resources of a computing device performing the propagation computation of the current layer and the number of rows of the weight matrix such that a sum of the number of rows of the weight matrix and the number of rows of the redundancy matrix is less than or equal to a maximum computing resource provided by the computing device for the propagation computation of the current layer.

10. The fault-tolerant computing method of claim 1, wherein performing error correction processing on the data computation result of a current layer comprises:

dividing each error block into recalculation matrixes;

11. The fault tolerant computing method of claim 10, wherein generating the recalculation matrix from each of the error blocks comprises:

12. The fault tolerant computing method of claim 11, wherein the computing matrix is obtained by stitching after copying each of the error blocks into a plurality of copies, comprising:

Obtaining a recalculated redundancy modulus;

13. The fault tolerant computing method of claim 12, further comprising, prior to said performing a common matrix multiplication of said recalculation matrix with said input word vector of the current layer to obtain a recalculation result:

14. The fault tolerant computing method of claim 13, further comprising:

15. The fault tolerant computing method of claim 12 wherein the recalculated redundancy modulus is obtained by:

；

16. The fault-tolerant computing method according to claim 1, wherein the performing general matrix multiplication on the weight matrix, the redundancy matrix, and the input word vector of the current layer respectively, to obtain a data calculation result of propagation calculation of the current layer and a first check value, includes:

17. A fault tolerant computing method, comprising:

18. A fault tolerant computing device for use with an artificial intelligence question and answer, comprising:

19. A fault tolerant computing device, comprising:

20. A fault tolerant computing device, comprising:

a memory for storing a computer program;

a processor for executing the computer program, which when executed by the processor implements the steps of the fault tolerant computing method according to any one of claims 1 to 17.

21. A non-volatile storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the fault tolerant computing method of any of claims 1 to 17.

22. A computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the fault tolerant computing method of any of claims 1 to 17.