CN114065710A

CN114065710A - Identification correction method and device, electronic equipment and readable storage medium

Info

Publication number: CN114065710A
Application number: CN202111223231.5A
Authority: CN
Inventors: 辛洋
Original assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd; Wuhan Kingsoft Office Software Co Ltd
Current assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd; Wuhan Kingsoft Office Software Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-02-18

Abstract

The invention discloses an identification correction method, an identification correction device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring first characteristics of merging row cells to be identified in a table to be identified; obtaining first identification data according to the first characteristics and the first classification model; obtaining the cells of the in-doubt merging row and the second characteristics of the cells according to the first identification data; obtaining second identification data according to the second characteristic and a second classification model; and correcting the identification result of the structure type of the cells of the in-doubt merging row according to the second identification data.

Description

Identification correction method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of database technologies, and in particular, to a method and an apparatus for identifying and correcting a table structure, an electronic device, and a readable storage medium.

Background

Spreadsheets are made up of a plurality of rows, which can be classified into different categories according to the table content of each row, such as: the major title, the row title, the table content and others, the category of each row is taken as the structure category of the row, and the structure categories of all rows in the table can be taken as the structure category of the table, so that the data analysis for generating the table is facilitated based on the structure categories of the table.

However, for some more complex table structures, the structure category identification of partial row cells is inaccurate. Since in the case of determining the structure category of a line using a classification model, in order to increase the accuracy of identifying the structure category, each structure category uses a binary classification model, for example, a large-headed model has a classification result of a decimal between 0 and 1, a closer to 1 indicates that the structure category of the corresponding line is a large heading, and a closer to 0 indicates that the structure category of the corresponding line is not a large heading. This classification method may result in a row of cells having similar scores in headlines, row titles, contents, and others, resulting in inaccurate results for identifying the structural category of the row of cells.

Disclosure of Invention

The invention provides a new technical scheme capable of improving the accuracy of the identification result of the structure type of the merging row cells in the table.

According to a first aspect of the present invention, there is provided a method for identifying and correcting a table structure, including:

acquiring first characteristics of merging row cells to be identified in a table to be identified;

obtaining first identification data according to the first characteristics and the first classification model;

obtaining the cells of the in-doubt merging row and the second characteristics of the cells according to the first identification data;

obtaining second identification data according to the second characteristic and a second classification model;

and correcting the identification result of the structure type of the cells of the in-doubt merging row according to the second identification data.

Optionally, the obtaining first identification data according to the first feature and the first classification model includes:

according to the first characteristic and a first classification model corresponding to at least one structure type, obtaining first identification data of the merging row unit cells to be identified corresponding to the structure type;

the first classification model is used for judging whether the merging row cells to be identified belong to the corresponding structure category.

Optionally, the obtaining an in-doubt merge-row cell according to the first identification data includes:

and judging the merging row unit cell to be identified with ambiguity as the in-doubt merging parallel unit cell according to the first identification data.

traversing the merging row cells to be identified to obtain the sum data of the first identification data of the traversed merging parallel cells;

calculating a ratio of the first identification data of the traversed merged parallel cell to the sum data and a standard deviation of the ratio;

and obtaining the doubt merging parallel unit cell according to the comparison result of the traversed standard deviation of the merging parallel unit cell and a preset threshold value.

Optionally, obtaining the suspected merged parallel cell according to the comparison result of the merged row cell to be identified includes:

and taking the merging row cell to be identified with the standard deviation smaller than or equal to the threshold value as the in-doubt merging parallel cell according to the comparison result.

Optionally, the obtaining second identification data according to the second feature and the second classification model includes:

obtaining second identification data representing the structure category of the in-doubt parallel cells according to the second characteristics and the second classification model;

and the second classification model is used for judging the structure type of the in-doubt merging row unit cell.

Optionally, before the obtaining the first feature of the merge row cell to be identified in the table to be identified, the method further includes:

obtaining the cell content of each cell in the table to be identified;

generating feature information of at least one cell in the table to be recognized based on the cell content of each cell in the table to be recognized, wherein the feature information of the cell represents the structure type corresponding to the cell content of the cell;

calculating the similarity of any two adjacent lines in the table to be recognized according to the feature information of the unit cells in the unit cells to be recognized;

merging the two adjacent rows of the cells with the similarity reaching the similarity threshold, and obtaining the merged row cells to be identified according to the merging result; the merging row unit cells to be identified comprise at least one row of unit cells.

Optionally, the obtaining the second feature of the in-doubt merge-row cell includes:

according to the first identification data, judging a reference merging parallel cell corresponding to at least one structure category in the table to be identified, wherein the reference merging parallel cell is a merging row cell to be identified, which does not have ambiguity in the identification result of the structure category;

and generating a first feature list of at least one structure category according to the first feature of the reference merging row unit cell of at least one structure category, and obtaining the second feature of the in-doubt merging row unit cell according to the first feature list.

Optionally, the second feature comprises at least one of:

the in-doubt parallel unit cell corresponds to first identification data of at least one structure category;

the number of rows of cells contained in the in-doubt parallel cell;

comparing the first features of the in-doubt merging row unit cells with a first feature list of a structure category of any structure category, wherein the similarity of the first features of the in-doubt merging row unit cells is greater than the number of the first features of a preset similarity threshold; the first feature list is a list formed by first features of merging row cells of which structure types are not ambiguous in the table to be recognized;

the structure type of the last merged parallel cell of the in-doubt merged row cell;

the structure type of the next merging parallel unit cell of the in-doubt merging parallel unit cells;

the difference between the line number of the in-doubt merging line cell and the line number of the last blank merging line cell;

the difference between the row number of the in doubt merge row cell and the row number of the next blank merge row cell.

Optionally, the first feature comprises at least one of:

the ratio of the number of merging cells in the merging row cells to be identified to the minimum cell number of the merging row cells to be identified;

a set of feature information of each cell in the merge row cells to be identified;

the feature information in the merge row cells to be identified includes the number of the minimum cells in chinese, and the ratio of the number of the minimum cells having contents in the merge row cells to be identified.

Optionally, the feature information in the merge row cell to be identified includes at least one of the following:

the ratio of the number of the minimum cells in the merged row cells to be identified to the number of the minimum cells with contents of the merged row cells to be identified;

the number of colons that the content in the merge row cell to be identified has;

the ratio of the number of the minimum cells different from the feature information of the minimum cell in the merging row cells closest to the merging row to be identified to the number of the minimum cells having the content of the merging row cells to be identified.

traversing each row of cells in the merged row cells to be identified, and obtaining third identification data of the currently traversed row according to the first characteristics of the currently traversed row and the first classification model corresponding to at least one structure category;

and under the condition of ending traversal, obtaining the first identification data of the merging row unit cells to be identified according to the third identification data of each row of unit cells in the merging row unit cells to be identified.

According to a second aspect of the present disclosure, there is provided an identification correction apparatus of a table structure, including:

the first characteristic acquisition module is used for acquiring first characteristics of merging row cells to be identified in a table to be identified;

the first data obtaining module is used for obtaining first identification data according to the first characteristics and the first classification model;

the second characteristic obtaining module is used for obtaining the doubt merging row unit cell and the second characteristic thereof according to the first identification data;

the second data obtaining module is used for obtaining second identification data according to the second characteristics and the second classification model;

and the identification result correction module is used for correcting the identification result of the structure type of the in-doubt merging row unit cell according to the second identification data.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

the apparatus of the second aspect of the disclosure; or,

a processor and a memory for storing instructions for controlling the processor to perform the method according to the first aspect of the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect of the present disclosure.

According to the embodiment of the disclosure, first identification data is obtained according to the first characteristic and the first classification model of the merging row unit cell to be identified, the in-doubt merging row unit cell is obtained according to the first identification data, second identification data is obtained according to the second characteristic and the second classification model of the in-doubt merging row unit cell, and then the identification result of the structure category of the in-doubt merging parallel unit cell is corrected according to the second identification data. The second feature is more accurate relative to the first feature, and the context similarity relation of the ambiguous merging line cells is introduced into the second feature, so that the identification result of the structure category of the ambiguous merging line cells is corrected according to the second identification data obtained by the second feature, and the accuracy of the identification result of the structure category of the ambiguous merging parallel cells can be improved.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a block diagram of one example of a hardware configuration of an electronic device that can be used to implement an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a table structure recognition and correction method according to an embodiment of the present invention.

Fig. 3 is a block diagram of an identification correction apparatus of a table structure according to an embodiment of the present invention.

FIG. 4 shows a block diagram of an electronic device of one embodiment of the invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

< hardware configuration >

FIG. 1 is a schematic structural diagram of an electronic device that can be used to implement embodiments of the present disclosure.

The electronic device 1000 may be a smart phone, a portable computer, a desktop computer, a tablet computer, a server, etc., and is not limited herein.

The electronic device 1000 may include, but is not limited to, a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and the like. The processor 1100 may be a central processing unit CPU, a graphics processing unit GPU, a microprocessor MCU, or the like, and is configured to execute a computer program, and the computer program may be written by using an instruction set of architectures such as x86, Arm, RISC, MIPS, and SSE. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a serial interface, a parallel interface, and the like. The communication device 1400 is capable of wired communication using an optical fiber or a cable, or wireless communication, and specifically may include WiFi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display device 1500 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1600 may include, for example, a touch screen, a keyboard, a somatosensory input, and the like. The speaker 1700 is used to output an audio signal. The microphone 1800 is used to collect audio signals.

As applied to the disclosed embodiments, the memory 1200 of the electronic device 1000 is used to store a computer program for controlling the processor 1100 to operate to implement the methods according to the disclosed embodiments. The skilled person can design the computer program according to the solution disclosed in the present disclosure. How the computer program controls the processor to operate is well known in the art and will not be described in detail here. The electronic device 1000 may be installed with an intelligent operating system (e.g., Windows, Linux, android, IOS, etc. systems) and application software.

It should be understood by those skilled in the art that although a plurality of devices of the electronic apparatus 1000 are illustrated in fig. 1, the electronic apparatus 1000 of the embodiments of the present disclosure may refer to only some of the devices therein, for example, only the processor 1100 and the memory 1200, etc.

Various embodiments and examples according to the present invention are described below with reference to the accompanying drawings.

< method examples >

In the present embodiment, a method for identifying and correcting a table structure is provided. The method is implemented by an electronic device. The electronic device may be an electronic product having a processor and a memory. For example, a desktop computer, a notebook computer, a mobile phone, a tablet computer, etc. In one example, the electronic device may be provided as electronic device 1000 shown in FIG. 1.

FIG. 2 is a schematic flow chart diagram of a table structure identification correction method of an embodiment. As shown in fig. 2, the method for identifying and correcting a table structure according to the present embodiment includes the following steps S2100 to S2500:

in step S2100, a first feature of a merge row cell to be recognized in a table to be recognized is obtained.

In this embodiment, the merge row cells to be identified may include at least one row of cells.

In one embodiment of the disclosure, the first feature may include at least one of:

Further, the characteristic information in the merge row cell to be identified includes at least one of:

the ratio of the number of the minimum cells of the numbers in the merging row cells to be identified to the number of the minimum cells with contents of the merging row cells to be identified;

the number of the minimum cells different from the feature information of the minimum cell in the merging row closest to the merging row in the merging row to be identified is a ratio to the number of the minimum cells having the content in the merging row cells to be identified.

In this embodiment, the merged cell is a cell obtained by merging at least two minimum cells, and the minimum cell is a cell that cannot be split.

In calculating the ratio of the number of merging cells in the merging row cell to be identified to the minimum number of cells of the merging row cell to be identified, as will be appreciated by those skilled in the art, if the merged row of cells to be identified includes at least two rows of cells, the first row having 1 merged cell and 4 smallest cells, wherein the merged cell is a result of merging 2 minimum cells, then the first row has 6 minimum cells in total, the second row has 2 merged cells and 2 minimum cells, wherein both of the two merged cells are obtained by merging 2 minimum cells, then the second row also has 6 minimum cells in total, and then the merged row cell to be identified has 3 merged cells and 12 minimum cells in total, so the ratio is 3/12.

In calculating the ratio of the number of minimum cells having different feature information from the minimum cell among the merge row cells to be recognized to the number of minimum cells having contents of the merge row cell to be recognized, which is the closest to the merge row cell to be recognized, to the number of minimum cells having contents of the merge row cell to be recognized, it can be understood by those skilled in the art that, assuming that there are 5 minimum cells in total in the merge row cell to be recognized, the feature information of each minimum cell among the merge row cells to be recognized is a number, a chinese, an english, and a date in the order from left to right, and there are two rows of tables to be recognized in the merge row cell closest to the merge row cell to be recognized, wherein the feature information of each minimum cell in the first row is a number, an english, and a date in the order from left to right, the feature information of each minimum cell in the second row is a number, english, and a date in the order from left to right, and only the feature information of the third cell in the merge row cell to be recognized is different from the feature information of the corresponding cell in the merge row cell to be recognized, so that the ratio of the number of minimum cells different from the feature information of the minimum cell in the merge row cell closest to the row in the merge row cell to be recognized to the number of minimum cells having contents in the merge row cell to be recognized is 1/5.

Illustratively, the ratio of the number of the minimum cells in Chinese to the number of the minimum cells with content of the row in the merged row cell to be identified is a, and the ratio of the number of the minimum cells in Chinese to the number of the minimum cells with content of the merged row cell to be identified is 1-a.

Further, the attribute feature of each merge row cell to be identified may further include: the word size of the content in the merge row cell to be identified and the word size of the content in the merge row cell closest to the merge row cell to be identified.

In one embodiment of the present disclosure, before performing step S2100, the method may further include steps S3100 to S3400 as shown below:

step S3100, cell contents of each cell in the table to be recognized are obtained.

Step S3200, generating feature information of at least one cell in the table to be recognized based on the cell content of each cell in the table to be recognized.

The characteristic information of the cell represents the structure type corresponding to the cell content of the cell.

The characteristic information of one cell represents the type of the cell content of the cell; specifically, the cell contents of each cell in the table to be recognized may be classified into types of chinese, english, numeric, date, time, blank, and the like, and used as the feature information of each cell in the table to be recognized.

And S3300, calculating the similarity of any two adjacent rows in the table to be recognized according to the feature information of the cells in the cells to be recognized.

In an embodiment of the present disclosure, calculating the similarity between any two adjacent rows in the table to be recognized according to the feature information of the cells in the cells to be recognized may include: under the condition that the table to be identified contains the merging cells, determining the feature information and the cell content of the merging cells as the feature information and the cell content of each minimum cell forming the merging cells; and calculating the similarity of every two adjacent rows of the table to be recognized according to the characteristic information of the minimum cell in each row of the table to be recognized. And under the condition that the table to be recognized does not contain the merging cells, calculating the similarity of every two adjacent lines of the table to be recognized according to the characteristic information of the minimum cells of each line of the table to be recognized.

Under the condition that the table to be identified contains the merging cells, the feature information and the cell contents of the merging cells are determined to be the feature information and the cell contents of the minimum cells forming the merging cells, so that the number of the feature information of the minimum cells in each row can be equal, the calculation of the similarity of every two adjacent rows is facilitated, and the number of the cell contents of the minimum cells in each row can be equal.

As will be understood by those skilled in the art, if the table to be recognized does not include merged cells, that is, the table to be recognized only includes minimum cells, the number of feature information of the minimum cells in each row in the table to be recognized is equal, and the feature vector of the cells in each row of the table to be recognized can be obtained according to the feature information of the minimum cells in each row.

If the table to be recognized contains the merging cells, the feature vector of each row of cells of the table to be recognized can be obtained according to the feature information of each minimum cell in each row of the table to be recognized.

Specifically, the feature vector of each row of cells of the table to be identified can be obtained according to the feature information of each minimum cell of each row, then the distance between the feature vectors of every two adjacent rows is calculated to serve as the similarity between every two adjacent rows, if the distance between the feature vectors of every two adjacent rows is larger, the similarity between every two adjacent rows is smaller, and if the distance between the feature vectors of every two adjacent rows is smaller, the similarity between every two adjacent rows is larger; of course, the similarity between every two adjacent rows may also be calculated in other ways, and the application is not limited herein.

The distance may be an euclidean distance, or may be other distances, and the embodiment of the present application is not limited herein.

In one embodiment, the feature vector for the row of cells may be generated by:

and generating a feature vector of each row of the table to be recognized based on the corresponding relation between the feature information of each minimum cell of each row of the table to be recognized and a preset numerical value, wherein the feature vector of each row comprises the preset numerical value corresponding to the feature information of each minimum cell in the row.

For example, the correspondence between the characteristic information and the preset value may be as shown in table 1 below:

TABLE 1

Characteristic information	Preset number value
		Chinese character	1
English	2
		Number of	3
Date	4
		Time	5
Blank space	0

Suppose that the feature information of the minimum cells in the first row of the table to be recognized is number, date, Chinese and blank in the order from left to right, and the feature vector of the first row of the table to be recognized is (3,4,1, 0).

And calculating the similarity of two adjacent lines of the table to be recognized based on the feature vector of each line of the table to be recognized.

And step S3400, merging two adjacent rows of cells with the similarity reaching the similarity threshold, and obtaining merged row cells to be identified according to a merging result.

In this step, the size of the similarity threshold may be set according to actual conditions.

Specifically, multiple rows of cells in the table to be recognized are merged to obtain merged rows of cells to be recognized. Then, the cell content of each cell in the merged row cell to be identified is the cell content of each cell in all rows merged by the merged row cell to be identified.

Illustratively, the similarity threshold of the embodiment of the present invention is 0.9, and the similarity between the first row and the second row in the table to be recognized is 0.95; and if the similarity between the second row and the third row is 0.93, and the similarity between the third row and the fourth row is 0.2, merging the first row, the second row and the third row into a row to be used as a merged row cell to be identified, and using the cell content of each cell in the first row, the second row and the third row in the table to be identified as the cell content of each cell in the merged row cell to be identified.

In one embodiment, the cell content of each cell in the cells of the merged row to be identified refers to the cell content of each minimum cell in the merged row, and if the table to be identified contains merged cells, the execution result of step 204 may be directly obtained, and the cell content of each minimum cell in all rows of the table to be identified merged for the merged row is taken as the cell content of each cell in the merged row.

Step S2200 is to obtain first identification data according to the first feature and the first classification model.

In this embodiment, the first classification model may be a binary classification model corresponding to a preset structure type, and is used to determine whether the merge row cell to be identified corresponding to the first feature belongs to the corresponding structure type. The first identification data output by the first classification model is data indicating whether the merge row cells to be identified belong to the corresponding structure category. For example, the first identification data may be any number between 0 and 1.

In particular, the structure categories may include headlines, line titles, table contents, and others. In one example, a first classification model corresponding to each structure class may be preset. That is, a first classification model corresponding to a headline, a first classification model corresponding to a line title, a first classification model corresponding to table contents, and a first classification model corresponding to the other may be set in advance. The first classification model corresponding to the headline may be used to determine whether the structural class of the merge row to be identified belongs to the headline; the first classification model corresponding to the line title may be used to determine whether the structural category of the merge line to be identified belongs to the line title; a first classification model corresponding to the table content may be used to determine whether a structural class of the merge row to be identified belongs to the table content; the first classification model corresponding to the other may be used to determine whether the structural class of the merge row to be identified belongs to the other.

In one embodiment of the disclosure, the first identification data output by each first classification model and indicating whether the merge row cell to be identified belongs to the corresponding structure category thereof may be obtained by inputting the first feature into each first classification model.

In another embodiment of the present disclosure, obtaining the first identification data according to the first feature and the first classification model may include steps S2210 to S2220 as shown below:

step S2210, traversing each row of cells in the merged row cells to be identified, and obtaining third identification data of the currently traversed row according to the first characteristics and the first classification model of the currently traversed row.

In this embodiment, the first feature of the currently traversed row may be input into each first classification model, and the third identification data, which is output by each classification model and indicates whether the currently traversed row belongs to the corresponding structural class, is obtained.

Step S2220, under the condition that the traversal is finished, according to the third identification data of each row of unit cells in the merging row unit cells to be identified, the first identification data of the merging row unit cells to be identified are obtained.

The third identification data of any one of the first classification models corresponding to each row of the merging row cells to be identified may be averaged, and the obtained average value may be used as the first identification data of any one of the first classification models corresponding to the merging row cell to be identified.

In step S2300, the suspected merge row cell and the second feature thereof are obtained according to the first identification data.

In one embodiment of the present disclosure, the ambiguous merging line cell to be recognized may be discriminated as the ambiguous merging parallel cell according to the first recognition data.

Specifically, the in-doubt merging line cell may be a merging line cell to be identified whose identification result is ambiguous, where the identification result is a result of the structure category of the merging line cell to be identified determined according to the first identification data.

In one embodiment of the present disclosure, obtaining the in-doubt merge-line cell according to the first identification data may include steps S2310-S2330 as follows:

step S2310, traverse the merge row cells to be identified, and obtain the sum data of the first identification data of the traversed merge row cells.

The first identification data of the traversed merge row cell may be the first identification data output by each first classification model according to the first feature of the traversed merge row cell, for example, according to the first feature of the traversed merge row cell, the first identification data output by the first classification model corresponding to the headline may be a1, the first identification data output by the first classification model corresponding to the row heading may be a2, the first identification data output by the first classification model corresponding to the table content may be a3, the first identification data output by the first classification model corresponding to the other may be a4, and then the sum data sum of the first identification data sum of the traversed merge row cell may be represented as: sum-a 1+ a2+ a3+ a 4.

Step S2320, a ratio of the first identification data of the traversed merge row cell to the sum data and a standard deviation of the ratio are calculated.

In this embodiment, the ratios of each first identification data and the sum data of the traversed merge row cells may be calculated separately, and the standard deviations of the ratios may be calculated.

The ratio per1 of the first recognition data a1 output by the first classification model corresponding to the headline and the sum data sum can be expressed as: per1 ═ a1/sum, and the ratio per2 of the first classification model output first identification data a2 and the sum data sum corresponding to the line title can be expressed as: per2 ═ a2/sum, the ratio per3 of the first classification model output first identification data a3 and the sum data sum corresponding to the table contents can be expressed as: per3 ═ a3/sum, and the ratio per4 of the first recognition data a4 and the sum data sum output by the first classification model corresponding to others can be expressed as: per4 ═ a 4/sum.

The standard deviation std of these ratios can be obtained by the following formula:

in step S2330, suspicious merge-line cells are obtained according to the comparison result of the standard deviation of the traversed merge-line cells and the preset threshold.

The preset threshold may be preset according to an application scenario or a specific requirement, and for example, the preset threshold may be 0.2.

In the case that the standard deviation of the traversed merge row cell is greater than the preset threshold, it may be determined that there is no ambiguity in the recognition result of the traversed merge row cell, and the traversed merge row cell may be used as a reference merge row cell. And under the condition that the standard deviation of the traversed merging row unit cell is less than or equal to the preset threshold, determining that the identification result of the traversed merging row unit cell is ambiguous, and taking the traversed merging row unit cell as an in-doubt merging parallel unit cell.

In this embodiment, the second feature may include at least one of:

first identification data of at least one structure category corresponding to the parallel unit cells are suspected;

doubt the number of rows of cells contained in the parallel cells;

comparing the first features of the in-doubt merging row unit cells with the first feature list of the structure category of any structure category, wherein the similarity of the first features of the in-doubt merging row unit cells is larger than the number of the first features of a preset similarity threshold; the first feature list is a list formed by first features of merging row cells of which structure types are not ambiguous in the table to be recognized;

doubtful structure type of the last parallel cell of the merging row cells;

doubt the structure type of the next merged parallel cell of the merged parallel cells;

doubtful difference between the line number of the merging line cell and the line number of the previous blank merging line cell;

the difference between the row number of the merge row cell and the row number of the next blank merge row cell is questioned.

In this embodiment, the first feature list may be obtained by referring to the first feature of the merge row cell.

Specifically, the first feature list of the headline may be constructed according to the first feature of the reference parallel cell of the headline as the identification result of the structure category; constructing a first feature list of the line titles according to the first feature of the reference merging line cell of the line titles, which is the identification result of the structure type; constructing a first feature list of the table content according to the first feature of the reference and parallel unit cells of the table content as the identification result of the structure category; and constructing other first feature lists according to the first features of the other reference merging row cells of the identification result of the structure category.

In the first feature list of any one structure category, the first features of all the reference merge row cells whose recognition results are the structure category may be included.

And step S2400, obtaining second identification data according to the second feature and the second classification model.

The second classification model may be a four-classification model, which may be used to determine which of the headline, line title, table content, and others the structure category to which the in-doubt merge-line cell belongs.

Specifically, the second feature may be input into a second classification model, and the second classification model may output second identification data representing a structure category to which the in-doubt merge-row cell belongs.

Step S2500, the recognition result of the structure type of the suspected parallel cell is corrected according to the second recognition data.

In this embodiment, the recognition result of the structure category of the merging row cell to be recognized may be determined in advance from the first recognition data. Since the identification result of the in-doubt merging-line cell determined on the basis of the first identification data is ambiguous, it may be that the identification result of the in-doubt merging-line cell on the basis of the first identification data is corrected on the basis of the second identification data. That is, in the table to be recognized, the recognition result of the merged row cell to be recognized, in which the recognition result determined based on the first recognition data is not ambiguous, is determined based on the first recognition data; the identification result of the doubtful parallel cell, for which the identification result determined based on the first identification data is ambiguous, is determined based on the second identification data.

< apparatus embodiment >

In the present embodiment, an identification correction apparatus 4000 with a table structure is provided, as shown in fig. 3, including a first feature obtaining module 4100, a first data obtaining module 4200, a second feature obtaining module 4300, a second data obtaining module 4400, and an identification result correction module 4500. The first feature obtaining module 4100 is configured to obtain a first feature of a merge row cell to be identified in a table to be identified; the first data obtaining module 4200 is configured to obtain first identification data according to a first feature and a first classification model; the second characteristic obtaining module 4300 is configured to obtain the suspected merging line cell and the second characteristic thereof according to the first identification data; the second data obtaining module 4400 is configured to obtain second identification data according to the second feature and the second classification model; the recognition result correction module 4500 is configured to correct the recognition result of the structure category of the suspected parallel cell according to the second recognition data.

In one embodiment of the present disclosure, the first data obtaining module 4200 may be further configured to:

according to the first characteristic and a first classification model corresponding to at least one structure type, first identification data of the structure type corresponding to the merging row unit cells to be identified are obtained;

the first classification model is used for judging whether the merging row cells to be identified belong to the corresponding structure types.

In one embodiment of the present disclosure, the second feature acquisition module 4300 may be further configured to:

and judging the merging row unit cell with ambiguity to be identified as the doubt merging parallel unit cell according to the first identification data.

In one embodiment of the present disclosure, the second feature acquisition module 4300 may further include:

the traversing unit is used for traversing the merging row unit cells to be identified and obtaining the sum data of the first identification data of the traversed merging row unit cells;

the calculation unit is used for calculating the ratio of the first identification data of the traversed merging row cells to the sum data and the standard deviation of the ratio;

and the obtaining unit is used for obtaining the doubtful merging line unit cells according to the comparison result of the standard deviation of the traversed merging line unit cells and a preset threshold value.

In an embodiment of the disclosure, the obtaining unit may be further configured to:

and taking the merging row unit cell to be identified with the standard deviation smaller than or equal to the threshold value as the in-doubt merging parallel unit cell according to the comparison result.

In one embodiment of the disclosure, the second data obtaining module 4400 may be further configured to:

and the second classification model is used for judging the structure type of the cells of the in-doubt merging row.

In an embodiment of the present disclosure, the identification correction apparatus 4000 may further include:

the content acquisition unit is used for acquiring the cell content of each cell in the table to be identified;

the information generation unit is used for generating the feature information of at least one cell in the table to be recognized based on the cell content of each cell in the table to be recognized, and the feature information of the cell represents the structure type corresponding to the cell content of the cell;

the similarity calculation unit is used for calculating the similarity of any two adjacent lines in the table to be recognized according to the characteristic information of the cells in the cells to be recognized;

the merging unit is used for merging two adjacent rows of cells with the similarity reaching the similarity threshold value, and obtaining merging row cells to be identified according to merging results; the merging row unit cells to be identified comprise at least one row of unit cells.

the reference row judging unit is used for judging a reference merging parallel cell corresponding to at least one structure type in the table to be recognized according to the first recognition data, wherein the reference merging parallel cell is a merging row cell to be recognized, which does not have ambiguity in the recognition result of the structure type;

and the list construction unit is used for generating a first feature list of at least one structure category according to the first feature of the reference merging row unit cell of at least one structure category and acquiring a second feature of the in-doubt merging parallel unit cell according to the first feature list.

In one embodiment of the disclosure, the second feature includes at least one of:

doubt the number of rows of cells contained in the parallel cells;

doubtful structure type of the last parallel cell of the merging row cells;

In one embodiment of the disclosure, the first feature includes at least one of:

In one embodiment of the present disclosure, the characteristic information in the merge row cell to be identified includes at least one of:

In one embodiment of the present disclosure, the first data obtaining module 4200 may further include:

the line traversing unit is used for traversing each line of cells in the merged line cells to be identified, and obtaining third identification data of the currently traversed line according to the first characteristics of the currently traversed line and the first classification model corresponding to at least one structure category;

and the data obtaining unit is used for obtaining first identification data of the merging row unit cells to be identified according to the third identification data of each row of unit cells in the merging row unit cells to be identified under the condition of ending traversal.

It will be understood by those skilled in the art that the identification correction apparatus 4000 of the table structure may be implemented in various ways. For example, the identification correction apparatus 4000 of the table structure may be implemented by an instruction configuration processor. For example, instructions may be stored in ROM and read from ROM into a programmable device when the apparatus is started to implement the recognition correcting device 4000 of the table structure. For example, the identification correction device 4000 of the table structure may be cured into a dedicated device (e.g., ASIC). The identification correction device 4000 of the table structure may be divided into units independent of each other, or may be implemented by combining them together. The identification correction device 4000 of the table structure may be implemented by one of the various implementations described above, or may be implemented by a combination of two or more of the various implementations described above.

In this embodiment, the identification and correction device 4000 for table structure may have various implementation forms, for example, the identification and correction device 4000 for table structure may be any functional module running in a software product or application providing identification service for table structure, or a peripheral insert, plug-in, patch, etc. of the software product or application, or the software product or application itself.

< electronic device embodiment >

The present disclosure also provides an electronic device 5000.

In one embodiment, the electronic device 5000 may include the aforementioned identification correction apparatus 4000 of a table structure.

In another embodiment, the electronic device 5000 may further include a processor 5100 and a memory 5200 as shown in fig. 4, the memory 5200 being configured to store executable instructions; the instruction is used to control the processor 5100 to perform the aforementioned identification correction method of the table structure.

In this embodiment, the electronic device 5000 may be any electronic product having a processor 5100 and a memory 5200, such as a mobile phone, a tablet computer, a palmtop computer, a desktop computer, a notebook computer, a workstation, a game machine, a server, and the like.

< readable storage Medium embodiment >

In this embodiment, there is also provided a readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the method for identifying and correcting a table structure according to any embodiment of the present disclosure.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. An identification correction method, comprising:

2. The method of claim 1, wherein obtaining first identification data based on the first feature and the first classification model comprises:

3. The method of claim 1, wherein obtaining an in-doubt merge-row cell from the first identifying data comprises:

4. The method of claim 3, wherein obtaining the suspected merged parallel cell according to the comparison of the merged row cell to be identified comprises:

5. The method of claim 1, wherein obtaining second identification data from the second feature and a second classification model comprises:

6. The method of claim 1, wherein before obtaining the first feature of the merge row cell to be identified in the table to be identified, the method further comprises:

obtaining the cell content of each cell in the table to be identified;

7. The method of claim 1, wherein obtaining the second characteristic of the in-doubt merge-row cell comprises:

8. The method of claim 1, wherein obtaining first identification data based on the first feature and the first classification model comprises:

and under the condition that the traversal is finished, obtaining the first identification data of the merging row unit cells to be identified according to the third identification data of each row of unit cells in the merging row unit cells to be identified.

9. An apparatus for identifying and correcting a table structure, comprising:

10. An electronic device, comprising:

the apparatus of claim 9; or,

a processor and a memory for storing instructions for controlling the processor to perform the method of any of claims 1 to 8.

11. A readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the method according to any one of claims 1 to 8.