CN118194862B - Text error correction method and device based on fault-tolerant suffix automaton - Google Patents
Text error correction method and device based on fault-tolerant suffix automaton Download PDFInfo
- Publication number
- CN118194862B CN118194862B CN202410410143.3A CN202410410143A CN118194862B CN 118194862 B CN118194862 B CN 118194862B CN 202410410143 A CN202410410143 A CN 202410410143A CN 118194862 B CN118194862 B CN 118194862B
- Authority
- CN
- China
- Prior art keywords
- node
- corrected
- state data
- text
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012937 correction Methods 0.000 title claims abstract description 97
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000004048 modification Effects 0.000 claims abstract description 29
- 238000012986 modification Methods 0.000 claims abstract description 29
- 230000007704 transition Effects 0.000 claims description 32
- 238000004891 communication Methods 0.000 claims description 17
- 230000015654 memory Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 7
- 238000010276 construction Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text error correction method and a text error correction device based on a fault-tolerant suffix automaton, wherein the method comprises the following steps: constructing a prefix tree according to a preset word list; the prefix tree comprises a tail node; the end-of-word node records the text and the text length; generating a corresponding fault-tolerant suffix automaton based on a preset editing distance of the text to be corrected; searching based on the prefix tree, and determining an intersection with the fault-tolerant suffix automaton; the intersection comprises a node movement track to be corrected of the fault-tolerant suffix automaton when the node movement track is matched with the tail node of the prefix tree and the error correction length; and determining the node to be corrected with the minimum correction length in the intersection set as a correction modification node, and performing correction modification on the text to be corrected according to the prefix tree. The intersection of the fault-tolerant suffix automaton and the prefix tree constructed by the preset editing distance can enable the preset word list to be quickly and fuzzy matched in the text to be corrected, find the words with the positioning errors, and is high in calculation speed and accurate in correction.
Description
Technical Field
The embodiment of the invention relates to the technical field of text processing, in particular to a text error correction method and device based on a fault-tolerant suffix automaton.
Background
When character editing or character processing is recognized by using OCR (Optical Character Recognition ), situations such as multiple characters, fewer characters, wrong characters and the like often occur, and inconvenience is brought to subsequent operations such as character processing and the like.
The existing text error correction algorithm generally adopts a deep learning model, such as an LSTM (Long Short-Term Memory network), a Bert model and the like, and performs error recognition, candidate recall, error correction sequencing and the like on places which are likely to be in error, but the existing processing mode has very high calculation cost and can not solve the problems of partial word missing, multiple words and the like.
Disclosure of Invention
In view of the foregoing, embodiments of the present invention are provided to provide a text error correction method and apparatus based on a fault-tolerant suffix automaton that overcomes or at least partially solves the foregoing problems.
According to an aspect of the embodiment of the present invention, there is provided a text error correction method based on a fault-tolerant suffix automaton, including:
constructing a prefix tree according to a preset word list; the prefix tree comprises a tail node; the end-of-word node records the text and the text length;
Generating a corresponding fault-tolerant suffix automaton based on a preset editing distance of the text to be corrected; the fault-tolerant suffix automaton comprises state transition of each character in a text to be corrected among a plurality of nodes to be corrected;
Searching based on the prefix tree, and determining an intersection with the fault-tolerant suffix automaton; the intersection comprises a node movement track to be corrected of the fault-tolerant suffix automaton when the node movement track is matched with the tail node of the prefix tree and the error correction length;
and determining the node to be corrected with the minimum correction length in the intersection set as a correction modification node, and performing correction modification on the text to be corrected according to the prefix tree.
According to another aspect of the embodiment of the present invention, there is provided a text error correction device based on a fault tolerant suffix automaton, the device including:
the prefix tree module is suitable for constructing a prefix tree according to a preset word list; the prefix tree comprises a tail node; the end-of-word node records the text and the text length;
The suffix automaton module is suitable for generating a corresponding fault-tolerant suffix automaton based on a preset editing distance for the text to be corrected; the fault-tolerant suffix automaton comprises state transition of each character in a text to be corrected among a plurality of nodes to be corrected;
The intersection module is suitable for searching based on the prefix tree and determining an intersection with the fault-tolerant suffix automaton; the intersection comprises a node movement track to be corrected of the fault-tolerant suffix automaton when the node movement track is matched with the tail node of the prefix tree and the error correction length;
The error correction module is suitable for determining the node to be corrected with the minimum error correction length in the intersection as an error correction modification node, and performing error correction modification on the text to be corrected according to the prefix tree.
According to yet another aspect of an embodiment of the present invention, there is provided a computing device including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
The memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the text error correction method based on the fault-tolerant suffix automaton.
According to still another aspect of the embodiments of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the above-described text error correction method based on a fault-tolerant suffix automaton.
According to yet another aspect of embodiments of the present invention, there is provided a computer program product comprising at least one executable instruction for causing a processor to perform operations corresponding to the above-described fault-tolerant suffix automaton based text error correction method.
According to the text error correction method and device based on the fault-tolerant suffix automaton, the intersection of the fault-tolerant suffix automaton and the prefix tree constructed through the preset editing distance can enable the preset word list to be quickly and fuzzy matched in the text to be corrected, search for words with positioning errors, and is high in calculation speed and accurate in error correction.
The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific implementation of the embodiments of the present invention will be more apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 illustrates a flow diagram of a fault-tolerant suffix automaton based text error correction method according to an embodiment of the invention;
FIG. 2 shows a prefix tree schematic;
FIG. 3 illustrates a suffix automaton diagram;
FIG. 4 illustrates a fault-tolerant suffix automaton diagram;
FIG. 5 illustrates a schematic diagram of a text error correction device based on a fault-tolerant suffix automaton, according to an embodiment of the invention;
FIG. 6 illustrates a schematic diagram of a computing device, according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
FIG. 1 shows a flow chart of a fault-tolerant suffix automaton based text error correction method according to one embodiment of the invention, as shown in FIG. 1, the method comprising the steps of:
Step S101, constructing a prefix tree according to a preset word list.
The preset vocabulary is determined according to the vocabulary collected in advance, and can be added according to requirements. And constructing a prefix tree for the preset word list based on the preset word list, and performing query matching on the follow-up and postfix automata. The prefix tree records characters of each word in the word list in a 'path' mode, specifically, each word is recorded by utilizing nodes and edges, wherein the edges are positioned between the nodes and used for marking the next node pointed by a character after being matched with the current character, the next node is used for indicating whether the text has a subsequent character in the character, and the nodes are used for recording whether the text ends or whether the text does not end has a subsequent character, and the like. As shown in fig. 2, the prefix tree constructed by four texts of "three-heart-two-meaning", "thirty-six-count-up", "counter-roll", which includes a plurality of nodes, node 0 is the root node, 4, 7, 11, 15 are end-of-word nodes, which record each text and text length, etc., e.g., end-of-word node 7 records text as "thirty-six meters", text length 4, etc. If a state transition cannot be made at a certain node or no transition is made to a prefix node, it means that the prefix tree does not contain the text, such as "thirty-six" and "thirty-six", etc., are not in the prefix tree shown in fig. 2.
The prefix tree may be constructed by creating an initializing root node, traversing each character of the text for each text in a preset vocabulary, and checking whether the next node of the current node (the root node in the initial state) contains the character. If so, the state moves to the corresponding next node; if not, a new node is created and the character is added to the new node. After traversing each character contained in one text, marking the last node as a word tail node, and traversing the next text in the preset word list until traversing all the texts in the preset word list is completed, so as to obtain a prefix tree. The foregoing is illustrative, and is not limited thereto, depending on the implementation.
Step S102, generating a corresponding fault-tolerant suffix automaton based on the preset editing distance for the text to be corrected.
And obtaining a text to be corrected, and converting the text to be corrected into a fault-tolerant suffix automaton. When the method is used for generating, the method is based on the preset editing distance, so that the calculated amount for constructing the fault-tolerant suffix automaton is reduced, fuzzy matching with a preset word list in a prefix tree is facilitated, error correction and the like are performed on a text to be corrected, the preset editing distance is preferably 1 or 2, and the method is specifically set according to implementation conditions.
For the suffix automaton, all characters in a character string s are constructed into a prefix tree, and then nodes with the same function are combined to obtain a deterministic finite state automaton (DFA) with the least number of nodes in all suffixes of s. The construction of the suffix automaton is generally completed step by step according to the character of a character string s (with the length of n), and the number of nodes is not more than 2n-1. When the suffix automaton is constructed, the length of a character string s (i.e. text to be corrected) is n in the following manner:
1. Initializing a state array, wherein the initial value is 0, the length is n+1, setting a node table, storing the state of the initial value into the node table, and correspondingly constructing a node;
2. determining state transitions for existing character sets
1) Sequentially selecting one character from s as c, recording the position as k, executing the subsequent steps 2) -6) if the state [ k ] is 0, otherwise, acquiring the next character as c;
2) Initializing a new state, wherein the initial length is 1, and assigning a value of [ state [0] +1] to the new state;
3) for i from 0 to n-1, performing a cyclic operation to sequentially acquire each character in s;
3.1 If s [ i ] =c, then cost=0, otherwise cost=1;
3.2 The results of state [ i ] +cost are sequentially pushed into new_state
4) Changing all the numbers greater than 1 in new_state into 1, namely, only the numbers with the values of 0 and 1 are contained in new_state;
5) Setting a node table for storing new_state, if the new_state value is already stored in the node table, taking out the corresponding node, otherwise, storing new_state, creating a node, and recording the node corresponding to new_state;
6) The state transition from the state node to the new_state node is performed under the action of the character c;
3. Assigning the state as new_state, repeating the step2 until all states (states) are traversed, and generating no new state any more;
4. And (3) taking out the state transition process (transition from state to new_state under the action of character c) in the step 2.6, namely the suffix automaton.
Taking the string s as 7 characters of the thirty-six discipline as an example, the following modes are adopted:
Initialization state= [0,0,0,0,0,0,0,0] (8 0 s);
When executing state transition, if "three" is selected from s, initializing new_state= [1];
Cycling i from 0 to 7-1 for a total of 7 times;
s [0] is equal to "three", cost=0, new_state= [1,0]
S [1] is not equal to "three", cost=1, new_state= [1,0,1]
…
The new_state= [1,0,1,0,1,1,1,1] is obtained after n times of circulation
The current node table has only node 0, which is [0,0,0,0,0,0,0,0], new_state is the new state [1,0,1,0,1,1,1,1], denoted as node 1, and the state transition from node 0 to node 1 is saved under the action of the character "three".
And assigning new_state [1,0,1,0,1,1,1,1] to state, repeating the above operation, and transferring the obtained suffix automaton to a certain node after any substring which comprises 8 nodes, 12 edges and s can be matched from node 0 as shown in figure 3. But the suffix automaton of the character string s can only match "thirty-six", but cannot match "thirty-six".
On the basis, according to each character of the text to be corrected, determining state transition generated by each character at the node to be corrected based on a preset editing distance, and obtaining the fault-tolerant suffix automaton according to each state transition. Specifically, the method comprises the following steps:
Step S1, initializing first state data according to the character length of a text to be corrected, constructing a node to be corrected according to the first state data, and storing the first state data into a node table for constructing the node to be corrected according to the node table. The initial value of the first state data is a preset initial value. If the character string s is n in length, the editing distance d is preset. Initializing a first state data state, setting the initial value as 0 and the length as n+1, setting a node table, storing the state of the initial value into the node table, and correspondingly constructing a node.
Step S2, any character in the text to be corrected is sequentially obtained, and if the first state data corresponding to the character position is smaller than or equal to the preset editing distance, the second state data is initialized. If a character c is selected from s in turn, the position is marked as k, and if the state k is less than or equal to d, initializing the initial length of the second state data new_state as 1, and assigning the value of [ state 0] +1] to the new_state. And comparing the obtained character with other characters of the text to be corrected, obtaining a plurality of editing distances based on a plurality of single character preset operations according to the comparison result, and obtaining the minimum editing distance with other characters. If for i is from 0 to n-1, performing a loop operation, sequentially acquiring each character in s, if s [ i ] =c is judged, the comparison result cost=0, otherwise, the comparison result cost=1. And calculating the minimum value of the new_state [ i ] +1, the state [ i ] +cost and the state [ i+1] +1, namely the minimum editing distance min_value, wherein the new_state [ i ] +1 corresponds to adding a single character, the state [ i+1] +1 obtains the next character and corresponds to deleting the single character, and the state [ i ] +cost corresponds to replacing the single character. The larger the edit distance, the more the difference between the two words, and therefore, the present embodiment employs the minimum edit distance. And circularly acquiring the minimum editing distance between the character and other characters, and updating the second state data according to the minimum editing distance, namely sequentially pressing each minimum editing distance min_value into the second state data new_state. After the second state data is obtained, the second state data is updated based on the preset editing distance, for example, all numbers greater than d+1 in new_state are changed to d+1, i.e. numbers contained in new_state are not greater than the preset editing distance d+1. Judging according to the node table, if the second state data is not stored in the node table, newly constructing a node to be corrected according to the second state data, and storing the node to be corrected in the node table. And if the second state data is stored in the node table, acquiring the node to be corrected constructed by the second state data. The state transition from the first state data state to the second state data new_state based on the character from the node to be corrected corresponding to the first state data to the node to be corrected corresponding to the second state data is recorded, namely, the state transition from the first state data state to the second state data new_state is stored under the action of the character c.
Step S3, executing the operation in the step S2 by utilizing characters in the non-error correction text such as # and the like, specifically, determining a plurality of editing distances between the preset universal characters and each character in the error correction text based on a plurality of single-character preset operations, acquiring the minimum editing distance between the preset universal characters and each character in the error correction text, updating the second state data according to the minimum editing distance between the preset universal characters and each character in the error correction text, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring the node to be corrected, which is constructed by the second state data; and recording state transition from the node to be corrected corresponding to the first state data to the node to be corrected corresponding to the second state data based on the preset universal character. Here, the preset universal character is used for state transition when the preset universal character cannot be matched with characters in the prefix tree.
And S4, updating the first state data by using the second state data stored in the node table, assigning the state to be new_state, and repeatedly executing the steps S2-S3 until the obtained first state data are all data stored in the node table, so that all states (states) are traversed, and no new state is generated.
According to the state transitions of the steps S2 and S3, a fault-tolerant suffix automaton is obtained, which includes state transitions of characters in a text to be corrected among a plurality of nodes to be corrected, taking the text to be corrected of thirty-six disciplines as an example, and under the condition that the preset editing distance is 1, the fault-tolerant suffix automaton is shown in fig. 4 and includes 46 nodes and 151 state transitions.
Step S103, searching is carried out based on the prefix tree, and an intersection with the fault-tolerant suffix automaton is determined.
The prefix tree is subjected to depth-first search, the nodes to be corrected of the fault-tolerant suffix automaton are transferred according to the nodes to be corrected, the text 'thirty-six meters' in a preset word list is taken as an example, the nodes are matched with the fault-tolerant suffix automaton, the nodes are transferred from the node 0 to the node 22 through the nodes 15, 19 and 21 as red lines shown in fig. 4, and an intersection is obtained. The intersection comprises the movement track of the node to be corrected of the fault-tolerant suffix automaton when the node to be corrected is matched with the end node of the prefix tree. Specifically, when the state is transferred to the end-of-word node of the prefix tree, the text of the end-of-word node, the moving track constructed by the position information of each node to be corrected in the fault-tolerant suffix automaton, and the error correction length determined according to the text length of the end-of-word node and the difference of the matching number of the characters corresponding to each node to be corrected in the fault-tolerant suffix automaton can be recorded. Such as "thirty-3", ten ", 2, thirty-2", six: 1, 2", wherein each digit represents the minimum number of words that need to be changed when the character is completely matched with" thirty-six ", i.e., the error correction length, as in fig. 2, the end-of-word node 7 records that the text is" thirty-six ", the text length is 4, the number of character matches of the 1 st digit with" thirty-six "in the preset vocabulary is only 1, 1 = error correction length 3 according to the text length 4-match number," thirty-3 "represents that the three characters of" thirty-six "are removed to be matched, the" six: 1 "of the 5 th digit represents that the last character of" thirty-six "is removed to be matched, and the" 1 "of the 6 th digit represents that the last digit of" thirty-six "is replaced with" century "to be matched.
Step S104, determining the node to be corrected with the minimum correction length in the intersection set as a correction modification node, and performing correction modification on the text to be corrected according to the prefix tree.
And sequencing the error correction lengths in the intersection, and determining the node to be corrected corresponding to the minimum error correction length as an error correction modification node. According to the text of the end node of the prefix tree, the error correction modification node or the node to be corrected of the error correction modification node can be subjected to error correction modification. Taking the text to be corrected "thirty-six disciplines" as an example, the "century" of the 6 th bit can be replaced by "meter" according to the "thirty-six meters", and the above is taken as an example, and the specific correction modification is set according to the implementation situation, and is not limited herein.
Further, when correcting the text to be corrected, the text composed of the accurate characters can be added into the preset word list, the preset word list is dynamically expanded, and the like, and the expansion description is omitted.
According to the text error correction method based on the fault-tolerant suffix automaton, the intersection of the fault-tolerant suffix automaton and the prefix tree constructed through the preset editing distance can enable the preset word list to be quickly and fuzzy matched in the text to be corrected, search for words with positioning errors, and is high in calculation speed and accurate in error correction.
Fig. 5 shows a schematic structural diagram of a text error correction device based on a fault-tolerant suffix automaton according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes:
The prefix tree module 510 is adapted to construct a prefix tree according to a preset vocabulary; the prefix tree comprises a tail node; the end-of-word node records the text and the text length;
the suffix automaton module 520 is adapted to generate a corresponding fault-tolerant suffix automaton based on a preset editing distance for the text to be corrected; the fault-tolerant suffix automaton comprises state transition of each character in a text to be corrected among a plurality of nodes to be corrected;
an intersection module 530 adapted to search based on the prefix tree, determining an intersection with a fault-tolerant suffix automaton; the intersection comprises a node movement track to be corrected of the fault-tolerant suffix automaton when the node movement track is matched with the tail node of the prefix tree and the error correction length;
the error correction module 540 is adapted to determine the node to be corrected with the smallest error correction length in the intersection set as an error correction modification node, and perform error correction modification on the text to be corrected according to the prefix tree.
Optionally, the prefix tree module 510 is further adapted to:
aiming at each word of a preset word list, recording each word by utilizing nodes and edges to obtain a prefix tree; the edge is positioned between the nodes and used for marking the next node pointed after the current character is matched.
Optionally, the suffix automaton module 520 is further adapted to:
determining state transition of each character generated at the nodes to be corrected based on a preset editing distance according to each character of the text to be corrected;
And obtaining the fault-tolerant suffix automaton according to each state transition.
Optionally, the suffix automaton module 520 further includes:
the initialization unit 521 is adapted to initialize the first status data according to the character length of the text to be corrected, and construct the node to be corrected according to the first status data; storing the first state data to a node table for constructing a node to be corrected according to the node table; the initial value of the first state data is a preset initial value;
the character state construction unit 522 is adapted to sequentially acquire any character in the text to be corrected, and initialize the second state data if the first state data corresponding to the character position is less than or equal to the preset editing distance; comparing to obtain comparison results of the characters and other characters of the text to be corrected, obtaining a plurality of editing distances based on a plurality of single character preset operations according to the comparison results, and obtaining minimum editing distances between the characters and other characters; updating the second state data according to the minimum editing distance of other characters, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring the node to be corrected, which is constructed by the second state data; recording state transition from a node to be corrected corresponding to the first state data to a node to be corrected corresponding to the second state data based on the characters; the single character presetting operation comprises deleting single characters, adding single characters and replacing single characters;
The universal character state construction unit 523 is adapted to determine a plurality of editing distances between the preset universal character and each character in the text to be corrected based on a plurality of single character preset operations by using the preset universal character, and obtain a minimum editing distance between the preset universal character and each character in the text to be corrected; updating the second state data according to the minimum editing distance between the second state data and each character in the text to be corrected, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring the node to be corrected, which is constructed by the second state data; recording state transition from a node to be corrected corresponding to the first state data to a node to be corrected corresponding to the second state data based on a preset universal character;
The updating unit 524 is adapted to update the first state data with the second state data stored in the node table, and repeatedly execute the character state construction unit 522-the universal character state construction unit 523 until the obtained first state data are all data stored in the node table.
Optionally, the intersection module 530 is further adapted to:
performing depth-first search on the prefix tree, and transferring according to each node to be corrected of the fault-tolerant suffix automaton;
When transferring to the tail node of the prefix tree, recording the text of the tail node, the moving track constructed by the position information of each node to be corrected in the fault-tolerant suffix automaton and the correction length; the error correction length is determined according to the text length of the end-of-word node and the difference value of the number of the matched characters corresponding to each node to be corrected in the fault-tolerant suffix automaton.
Optionally, the error correction module 540 is further adapted to:
sequencing the error correction lengths in the intersection, and determining a node to be corrected corresponding to the minimum error correction length as an error correction modification node;
And carrying out error correction modification on the error correction modification node or a node to be corrected after the error correction modification node according to the text of the end node of the prefix tree.
The above descriptions of the modules refer to the corresponding descriptions in the method embodiments, and are not repeated herein.
The embodiment of the invention also provides a nonvolatile computer storage medium, and the computer storage medium stores at least one executable instruction, wherein the executable instruction can execute the operation corresponding to the text error correction method based on the fault-tolerant suffix automaton in any method embodiment.
Embodiments of the present application provide a computer program product comprising at least one executable instruction or a computer program, where the executable instruction or the computer program may cause a processor to perform operations corresponding to the text error correction method based on the fault-tolerant suffix automaton in any of the above-described method embodiments.
FIG. 6 illustrates a schematic diagram of a computing device, according to an embodiment of the invention, the particular embodiment of which is not limiting of the particular implementation of the computing device.
As shown in fig. 6, the computing device may include: a processor 602, a communication interface Communications Interface, a memory 606, and a communication bus 608.
Wherein:
Processor 602, communication interface 604, and memory 606 perform communication with each other via communication bus 608.
Communication interface 604 is used to communicate with network elements of other devices, such as clients or other servers.
The processor 602 is configured to execute the program 610, and may specifically perform relevant steps in the foregoing embodiment of the text error correction method based on the fault-tolerant suffix automaton.
In particular, program 610 may include program code including computer-operating instructions.
The processor 602 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.
A memory 606 for storing a program 610. The memory 606 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.
The program 610 may be specifically configured to cause the processor 602 to perform the text error correction method based on the fault-tolerant suffix automaton in any of the method embodiments described above. The specific implementation of each step in the program 610 may refer to the corresponding descriptions in the corresponding steps and units in the text error correction embodiment based on the fault-tolerant suffix automaton, which are not described herein. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and modules described above may refer to corresponding procedure descriptions in the foregoing method embodiments, which are not repeated herein.
The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It should be appreciated that the teachings of embodiments of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of preferred embodiments of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., an embodiment of the invention that is claimed, requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). Embodiments of the present invention may also be implemented as a device or apparatus program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the embodiments of the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.
Claims (8)
1. A text error correction method based on a fault-tolerant suffix automaton is characterized by comprising the following steps:
constructing a prefix tree according to a preset word list; the prefix tree comprises a tail node; the end-of-word node records the text and the text length;
Generating a corresponding fault-tolerant suffix automaton based on a preset editing distance of the text to be corrected; the fault-tolerant suffix automaton comprises state transition of each character in the text to be corrected among a plurality of nodes to be corrected;
Searching based on the prefix tree, and determining an intersection with the fault-tolerant suffix automaton; the intersection comprises a node movement track to be corrected of the fault-tolerant suffix automaton when the intersection is matched with the tail node of the prefix tree, and the fault-tolerant suffix automaton comprises a fault correction length;
determining a node to be corrected with the minimum correction length in the intersection set as a correction modification node, and performing correction modification on the text to be corrected according to the prefix tree;
The generating the corresponding fault-tolerant suffix automaton based on the preset editing distance for the text to be corrected further comprises:
step S1, initializing first state data according to the character length of a text to be corrected, and constructing a node to be corrected according to the first state data; storing the first state data to a node table for constructing a node to be corrected according to the node table; the initial value of the first state data is a preset initial value;
Step S2, any character in the text to be corrected is sequentially obtained, and if the first state data corresponding to the character position is smaller than or equal to the preset editing distance, the second state data is initialized; comparing to obtain a comparison result of the character and other characters of the text to be corrected, obtaining a plurality of editing distances based on a plurality of single character preset operations according to the comparison result, and obtaining the minimum editing distance between the character and other characters; updating the second state data according to the minimum editing distance of other characters, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring a node to be corrected constructed by the second state data; recording state transition from a node to be corrected corresponding to the first state data to a node to be corrected corresponding to the second state data based on the character; the single character presetting operation comprises deleting single characters, adding single characters and replacing single characters;
Step S3, utilizing preset universal characters, determining a plurality of editing distances between the preset universal characters and each character in the text to be corrected based on a plurality of single-character preset operations, and obtaining the minimum editing distance between the preset universal characters and each character in the text to be corrected; updating second state data according to the minimum editing distance between the second state data and each character in the text to be corrected, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring a node to be corrected constructed by the second state data; recording state transition from a node to be corrected corresponding to the first state data to a node to be corrected corresponding to the second state data based on the preset universal character;
Step S4, updating the first state data by using the second state data stored in the node table, and repeatedly executing the steps S2-S3 until the obtained first state data are all data stored in the node table;
And obtaining the fault-tolerant suffix automaton according to each state transition.
2. The method of claim 1, wherein building a prefix tree from a pre-set vocabulary further comprises:
Aiming at each word of a preset word list, recording each word by utilizing nodes and edges to obtain a prefix tree; the edge is located between the nodes and used for marking the next node pointed after the current character is matched.
3. The method of claim 1, wherein the searching based on the prefix tree, determining an intersection with the fault-tolerant suffix automaton further comprises:
performing depth-first search on the prefix tree, and transferring according to each node to be corrected of the fault-tolerant suffix automaton;
When transferring to the tail node of the prefix tree, recording the text of the tail node, the moving track constructed by the position information of each node to be corrected in the fault-tolerant suffix automaton and the correction length; and the error correction length is determined according to the text length of the tail node and the difference value of the number of the matched characters corresponding to each node to be corrected in the fault-tolerant suffix automaton.
4. A method according to any of claims 1-3, wherein said determining as an error correction modification node the node to be error corrected having the smallest error correction length in the intersection, performing error correction modification on the text to be error corrected according to the prefix tree, further comprises:
sequencing the error correction lengths in the intersection, and determining a node to be corrected corresponding to the minimum error correction length as an error correction modification node;
and carrying out error correction modification on the error correction modification node or a node to be corrected of the error correction modification node according to the text of the tail node of the prefix tree.
5. A text error correction device based on a fault tolerant suffix automaton, the device comprising:
The prefix tree module is suitable for constructing a prefix tree according to a preset word list; the prefix tree comprises a tail node; the end-of-word node records the text and the text length;
The suffix automaton module is suitable for generating a corresponding fault-tolerant suffix automaton based on a preset editing distance for the text to be corrected; the fault-tolerant suffix automaton comprises state transition of each character in the text to be corrected among a plurality of nodes to be corrected;
An intersection module adapted to search based on the prefix tree, determining an intersection with the fault-tolerant suffix automaton; the intersection comprises a node movement track to be corrected of the fault-tolerant suffix automaton when the intersection is matched with the tail node of the prefix tree, and the fault-tolerant suffix automaton comprises a fault correction length;
the error correction module is suitable for determining the node to be corrected with the minimum error correction length in the intersection as an error correction modification node, and performing error correction modification on the text to be corrected according to the prefix tree;
The suffix automaton module is further adapted to:
step S1, initializing first state data according to the character length of a text to be corrected, and constructing a node to be corrected according to the first state data; storing the first state data to a node table for constructing a node to be corrected according to the node table; the initial value of the first state data is a preset initial value;
Step S2, any character in the text to be corrected is sequentially obtained, and if the first state data corresponding to the character position is smaller than or equal to the preset editing distance, the second state data is initialized; comparing to obtain a comparison result of the character and other characters of the text to be corrected, obtaining a plurality of editing distances based on a plurality of single character preset operations according to the comparison result, and obtaining the minimum editing distance between the character and other characters; updating the second state data according to the minimum editing distance of other characters, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring a node to be corrected constructed by the second state data; recording state transition from a node to be corrected corresponding to the first state data to a node to be corrected corresponding to the second state data based on the character; the single character presetting operation comprises deleting single characters, adding single characters and replacing single characters;
Step S3, utilizing preset universal characters, determining a plurality of editing distances between the preset universal characters and each character in the text to be corrected based on a plurality of single-character preset operations, and obtaining the minimum editing distance between the preset universal characters and each character in the text to be corrected; updating second state data according to the minimum editing distance between the second state data and each character in the text to be corrected, and updating the second state data based on the preset editing distance; if the second state data is not stored in the node table, constructing a node to be corrected according to the second state data and storing the node to be corrected in the node table; if the second state data is stored in the node table, acquiring a node to be corrected constructed by the second state data; recording state transition from a node to be corrected corresponding to the first state data to a node to be corrected corresponding to the second state data based on the preset universal character;
Step S4, updating the first state data by using the second state data stored in the node table, and repeatedly executing the steps S2-S3 until the obtained first state data are all data stored in the node table;
And obtaining the fault-tolerant suffix automaton according to each state transition.
6. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;
the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the fault-tolerant suffix automaton-based text error correction method according to any of claims 1-4.
7. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the fault-tolerant suffix automaton based text error correction method of any of claims 1-4.
8. A computer program product comprising at least one executable instruction for causing a processor to perform operations corresponding to the fault-tolerant suffix automaton based text error correction method according to any of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410410143.3A CN118194862B (en) | 2024-04-07 | 2024-04-07 | Text error correction method and device based on fault-tolerant suffix automaton |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410410143.3A CN118194862B (en) | 2024-04-07 | 2024-04-07 | Text error correction method and device based on fault-tolerant suffix automaton |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118194862A CN118194862A (en) | 2024-06-14 |
CN118194862B true CN118194862B (en) | 2024-09-06 |
Family
ID=91414213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410410143.3A Active CN118194862B (en) | 2024-04-07 | 2024-04-07 | Text error correction method and device based on fault-tolerant suffix automaton |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118194862B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260354A (en) * | 2015-08-20 | 2016-01-20 | 及时标讯网络信息技术(北京)有限公司 | Chinese AC (Aho-Corasick) automaton working method based on keyword dictionary tree structure |
CN116644737A (en) * | 2023-02-09 | 2023-08-25 | 一贯智服(杭州)技术有限公司 | Proper noun error correction method based on automatic word stock updating and prefix tree structure |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201902772D0 (en) * | 2019-03-01 | 2019-04-17 | Palantir Technologies Inc | Fuzzy searching 7 applications thereof |
-
2024
- 2024-04-07 CN CN202410410143.3A patent/CN118194862B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260354A (en) * | 2015-08-20 | 2016-01-20 | 及时标讯网络信息技术(北京)有限公司 | Chinese AC (Aho-Corasick) automaton working method based on keyword dictionary tree structure |
CN116644737A (en) * | 2023-02-09 | 2023-08-25 | 一贯智服(杭州)技术有限公司 | Proper noun error correction method based on automatic word stock updating and prefix tree structure |
Also Published As
Publication number | Publication date |
---|---|
CN118194862A (en) | 2024-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112329465B (en) | Named entity recognition method, named entity recognition device and computer readable storage medium | |
JP3672242B2 (en) | PATTERN SEARCH METHOD, PATTERN SEARCH DEVICE, COMPUTER PROGRAM, AND STORAGE MEDIUM | |
CN111444320A (en) | Text retrieval method and device, computer equipment and storage medium | |
CN111046671A (en) | Chinese named entity recognition method based on graph network and merged into dictionary | |
US5553284A (en) | Method for indexing and searching handwritten documents in a database | |
CN112800769B (en) | Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium | |
CN111159329A (en) | Sensitive word detection method and device, terminal equipment and computer-readable storage medium | |
CN112148819A (en) | Address recognition method and device combining RPA and AI | |
CN118194862B (en) | Text error correction method and device based on fault-tolerant suffix automaton | |
CN112395880B (en) | Error correction method and device for structured triples, computer equipment and storage medium | |
CN111061927B (en) | Data processing method and device and electronic equipment | |
CN114090722B (en) | Method and device for automatically completing query content | |
CN113177391B (en) | Method for redirecting operation cursor in streaming interface, computing equipment and storage medium | |
CN115244539B (en) | Inference method for tokenization of words or word segments | |
JP6261669B2 (en) | Query calibration system and method | |
CN112579713B (en) | Address recognition method, address recognition device, computing equipment and computer storage medium | |
CN113076733A (en) | Text matching method, terminal device and storage medium | |
CN112464101A (en) | Electronic book sorting recommendation method, electronic device and storage medium | |
CN112364630A (en) | License content error correction method, device and system | |
CN110543622A (en) | Text similarity detection method and device, electronic equipment and readable storage medium | |
CN117874088B (en) | Data fuzzy matching method, device, equipment and medium | |
US8666925B1 (en) | Method for parallel computation of a finite state machine | |
CN118193543B (en) | Method for searching node tree based on EDA, electronic equipment and storage medium | |
JPH03257693A (en) | Character recognized result correcting system | |
CN110852101B (en) | Path decoding method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |