Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, PDF files containing complex frame structures cannot be accurately mined or identified by conventional approaches for mining PDF files. Even if data in a PDF file is mined, a lost file frame structure will cause a loss of the actual meaning of the mined data. Because the PDF files have no uniform format, it is difficult to generally mine a large amount of PDF files directly through the uniform format. This problem is particularly acute in professional PDF files. Such PDF files have a large number of complex characters and a complex file frame structure. Meanwhile, if the data is not mined according to the framework structure, the data loses the practical meaning. Such data needs manual re-labeling at a later stage, and the manual labeling method is difficult to perform in large data.
To address at least in part one or more of the above issues and other potential issues, an example embodiment of the present disclosure proposes a scheme for mining PDF files, in which a mechanism to which a PDF file belongs may be determined by a mechanism determination algorithm. With each organization defining one or more report templates. By employing the template matching algorithm proposed by the present disclosure, a PDF file may be matched into one or more report templates defined. Meanwhile, by adopting a data mining algorithm, data can be mined from the PDF file through the matched report template, so that the PDF file can be processed into data with a regular structure. For example, PDF files are mined as Excel dataforms, XML files, YAML files, etc. that are associated with actual meaning.
In addition, the disclosure also provides a corresponding method for further mining (for example, year mining, data depth mining and table segmentation) of the mined data, so that the fineness of the mined data is improved.
Fig. 1 shows a schematic diagram of a system 100 for mining PDF files according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes a computing device 110 and a PDF file management device 130 and a network 140. The computing device 110, the PDF file management device 130 may interact with data through a network 140 (e.g., the internet).
The PDF file management device 130 may perform, for example, a conventional management of PDF files, such as collection and storage of PDF files. The PDF file management device 130 may also send the managed PDF files to the computing device 110. The PDF file management device 130 is not limited to, for example: desktop computers, laptop computers, netbook computers, tablet computers, web browsers, e-book readers, Personal Digital Assistants (PDAs), wearable computers (such as smart watches and activity tracker devices), and the like, that can perform PDF file reading and modification. The PDF file management device 130 may be configured to store PDF files, send PDF files to the computing device 110 via the network 140, and receive PDF files from the computing device 110 for processing.
With respect to the computing device 110, it is used, for example, to receive PDF files from the PDF file management device 130 via the network 140. The computing device 110 may perform mechanism identification on the received PDF file. Based on the identified organization, a template for the organization associated with the PDF file may be matched. Based on the matched template, relevant data can be accurately mined from the PDF file. Computing device 110 may also perform text block deduplication, data validation, and normalization on the mined data. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as a CPU. Additionally, one or more virtual machines may also be running on each computing device 110. In some embodiments, the computing device 110 and the PDF file management device 130 may be integrated or may be provided separately from each other. In some embodiments, computing device 110 includes, for example, a target table region extraction unit 112, a mechanism determination unit 114, a template matching unit 116, a template determination unit 118, a data mining unit 120, and an additional processing unit 122.
An extracting unit 112, the extracting unit 112 being configured to parse the text blocks of the PDF file so as to acquire coordinate information of the text blocks of the PDF file.
A mechanism determination unit 114, the mechanism determination unit 114 configured to determine a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file.
A template matching unit 116, wherein the template matching unit 116 is configured to match one or more report templates of the target association mechanism with the coordinate information of the text block by using a matching algorithm, so as to determine matching degree data of the one or more report templates and the PDF file.
A template determination unit 118, wherein the template determination unit 118 is configured to determine a report template of a target association mechanism corresponding to the PDF file based on the acquired matching degree data.
A data mining unit 120, the data mining unit 110 being configured to mine data in the PDF file corresponding to the determined report template based on the determined report template.
The additional processing unit 122 may be configured to perform various operations such as data validation, data normalization, data deduplication, and so on.
Units 112-120 may extract text information in a PDF file. Based on the extracted text information, an association mechanism associated with the PDF file may be determined. After the association mechanism is determined, the report template associated with the PDF file can be determined by means of coordinate matching. Based on the determined report template, the data in the PDF file can be mined in a matching mode, so that the PDF file can be accurately mined, and the actual meaning of the data in the PDF file is reserved.
Based on the data mined by units 112-120, the additional processing unit 122 may also perform various operations such as data validation, data normalization, data deduplication, etc. on the mined data. After the above processing is completed for the PDF file, the data in the mined PDF file may be transmitted to the PDF file management apparatus 130 via the network 140.
Note that the scheme of the present disclosure for mining PDF files involves a coordinate system for locating characters in the PDF file. In the art, the coordinate system of the PDF file may have the upper left corner as the origin, the x horizontal direction right of the origin, and the y vertical direction right below the origin. Based on such a coordinate system, a standard text message can be located with the upper left and lower right coordinates. However, it is also possible to establish different coordinate systems based on different ways. The selection of the coordinate system does not influence the technical scheme for mining the PDF file provided by the disclosure.
A method 200 for mining PDF files is described below in conjunction with fig. 1. Fig. 2 shows various paths and orders for the purpose of presenting the working principle of the scheme for mining PDF files as a whole, but some of the paths and paths are not necessary for implementing the following examples, and various methods involved in the technical solution of the present disclosure may be performed in different orders and paths.
Fig. 2 shows a flow diagram of a method 200 for mining PDF files according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12. It should be understood that method 200 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the present disclosure is not limited in this respect.
At step 202, the computing device 110 may parse the text blocks of the PDF file to obtain coordinate information of the text blocks of the PDF file.
In some embodiments, the computing device 110 may parse all or a portion of the text blocks in the PDF file into editable text blocks via processing tools commonly used in the PDF processing field, such as PDFminer, Camelot, and the like.
Note that parsing a text block using processing tools commonly used in the field of PDF processing involves parsing only the text content in a PDF file, i.e., identifying processable characters or character strings legally defined therein. Here, the processing tool does not recognize the structure of the PDF file. For example, no associations between multiple text blocks in a file are identified at this step.
It should also be noted that the processing tools commonly used in the field of PDF processing may include any code, software, library files that can parse PDF text, such as software packages or software libraries invokable by Python, Java, etc. programming languages, including but not limited to PDFminer, camellot, etc.
At step 204, the computing device 110 may determine a target association mechanism associated with the PDF file using a mechanism determination algorithm based on the parsed text blocks of the PDF file.
In one embodiment, based on the text blocks of the PDF file parsed in step 202, computing device 110 may utilize a mechanism determination algorithm to determine a target association mechanism associated with the PDF file. In the context of the present disclosure, a target association mechanism may be any entity that has an association with a PDF file, such as a producer, or an issuer of the PDF file.
Since the target affiliate typically makes public PDF files using a fixed one or more reporting templates, such PDF files have a strong correlation in the time dimension. With this strong correlation, a reporting template associated with the PDF file may be determined, mining the data of the PDF file based on the reporting template.
The principle of the institution-determination algorithm is to identify the target affiliate among multiple institutions based on the strong characteristics (e.g., address, logo identification) of the parsed text block. By defining the set of characteristics of the institution, it is possible to calculate how many text blocks are associated with the institution characteristics. Further, by the weight calculation, scores of the PDF files with respect to a plurality of institutions can be calculated. Based on the optimal score, the mechanism associated with the PDF file can be determined.
The mechanism determination algorithm and the mechanism determination step will be described in detail hereinafter.
At step 206, the computing device 110 may determine matching data of one or more report templates to the PDF file by matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm.
In one embodiment, based on the target affiliation determined in step 204, the computing device 110 may match one or more report templates defined under the target affiliation name to the coordinate information of the text block, respectively. The computing device 110 will match the text blocks according to the features in the report template to determine one or more match-degree data for one or more report templates.
The matching algorithm and the matching step will be described in detail below.
At step 208, the computing device 110 may determine a report template for the target affiliate corresponding to the PDF file based on the obtained match data.
In one embodiment, based on the one or more match data determined in step 206 for the one or more reporting templates for the target affiliate, the computing device 110 may determine the reporting template in which the best match data is the reporting template for the target affiliate corresponding to the PDF file. The determined report template may be used in a subsequent step to mine the data of the PDF file.
At step 210, the computing device 110 may mine the data in the PDF file corresponding to the determined reporting template based on the determined reporting template.
In one embodiment, the computing device 110 may mine the PDF file using a data mining algorithm based on the report template determined at step 208, thereby mining the data in the PDF file corresponding to the determined report template. In particular, the computing device 110 may match the mined features in the report template to the PDF file again according to a data mining algorithm. And if the matching is successful, mining the data matched with the mining features into data associated with the mining features, adding corresponding identifications, and extracting or storing the data out of the PDF file. If the matching fails, the processing can be further processed by a method such as manual identification.
The data mining algorithm and the mining steps will be described in detail below.
Fig. 3 shows a flow diagram of a method 300 of determining a target association mechanism associated with the PDF file using a mechanism determination algorithm according to an embodiment of the present disclosure. Method 300 corresponds to step 204 of method 200. The method 300 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12.
As described above, PDF files from production facilities (e.g., legal facilities, financial facilities) have strong characteristics. The target associated mechanism with which the PDF file is associated (e.g., the mechanism that produced the PDF file) can be determined based on the strong characteristics of the mechanism.
At step 302, the computing device 110 may build a mechanism key feature array for a plurality of mechanisms associated with a PDF file, the mechanism key feature array comprising: the number of key features associated with the organization, the key features, and the weights to which the key features correspond.
Specifically, the user may pre-construct the number of key features associated with the organization, the key features, and the weights corresponding to the key features. For example, for a certain security company, the user may define one or more (e.g., 3) key features for the security company, such as a company name, a company organization registration office address, and a company identifier (logo), and assign corresponding weights to the corresponding features, such as a weight of 1 for the company name, a weight of 3 for the company organization registration office address, and a weight of 5 for the company identifier (logo), and consider the feature to be more relevant to the organization with a higher weight.
At step 304, the computing device 110 may retrieve text blocks parsed based on the PDF file based on the organization key feature array to determine the number of occurrences of key features associated with the organization. By setting the key features, the parsed text blocks in the PDF file can be retrieved, and the manner of extracting information can be as described above. Through text retrieval, the number of occurrences of key features associated with an organization may be determined. The number of times a key feature occurs may be matched with the weights as defined in step 302 to calculate the likelihood of the associated institution.
At step 306, the computing device 110 may calculate a sequence of weights for the target association mechanism based on the determined number of occurrences of the key feature associated with the mechanism for use in determining the target association mechanism for the PDF file. After obtaining the key features, feature weights, and number of occurrences of the features, a sequence of weights for the target institution may be generated. And mining the PDF file with the first rank as a target association mechanism by ranking the weight sequence aiming at the target mechanism. For example, if the first in the weight sequence ordering is a security company, the PDF file may be considered to be associated with the security company, for example, the file was written by the security company.
With this solution, the degree of correlation (e.g. a sequence of weights) of one or more institutions with a PDF file can be calculated by the strong features of the defined institutions. The target association mechanism associated with the PDF file may be determined by the weight sequence.
FIG. 4 shows a flow diagram of a method 400 for determining a target association mechanism for a PDF file according to an embodiment of the present disclosure. Method 400 corresponds to step 204 in method 200. The method 400 may be performed by the computing device 110 as shown in fig. 1, or may be performed at the electronic device 1200 shown in fig. 12.
At step 402, the computing device 110 may determine the mechanism corresponding to the maximum value in the sequence of weights. By the method described in method 300, a weight sequence of PDF files can be obtained, and a mechanism corresponding to the maximum value in the sequence can be specified.
At step 404, the computing device 110 may determine whether the number of institutions corresponding to the maximum value is 1, i.e., whether there is more than one institution corresponding to the maximum value. For example, the presence of two or more of the same maximum values means two or more different mechanisms, respectively.
At step 406, the computing device 110 may determine that the institution corresponding to the maximum value is the target affiliate of the PDF file in response to determining that the number of institutions corresponding to the maximum value is 1. If only 1 maximum value exists, the mechanism corresponding to the maximum value is the target association mechanism of the PDF file.
At step 408, the computing device 110 may determine that the target-associated organization is not identified in response to determining that the number of organizations corresponding to the maximum value is greater than 1. If a plurality of same maximum values exist and the mechanisms corresponding to the maximum values are different, the target association mechanism of the PDF file cannot be determined. Further methods such as manual identification are needed to determine the target association of the PDF files.
With this solution, the target association mechanism associated with the PDF file can be determined by a further method, such as manual identification, when the weight sequence has a plurality of identical values.
FIG. 5 illustrates a flow diagram of a method 500 for matching one or more report templates of the target affiliate with the coordinate information of the text block using a matching algorithm in accordance with an embodiment of the present disclosure. Method 500 corresponds to step 206 of method 200. The method 500 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 1200 shown in FIG. 12.
As indicated above, since the target affiliate may be provided with different report templates for different reports, matching of one or more report templates of the target affiliate is also required after the target affiliate is determined.
At step 502, the computing device 110 may define an identifying feature block for each of the one or more reporting templates, respectively.
In one embodiment, the computing device 110 may define an identifying feature block separately for each of one or more report templates of the target affiliate.
Fig. 6 shows a schematic diagram of defining an identification feature block according to an embodiment of the present disclosure. As shown in fig. 6, the computing device 110 may define three identifying feature blocks for the report, respectively a stock code area, a title area, and a summary area. These three key areas essentially cover the main features of the report. Based on these three features, it can be determined whether the PDF file belongs to this reporting template. In other embodiments, identifying feature blocks containing other regions may also be defined. For example, the top right corner may also be defined as the identifying feature block of the title area. The more the identification feature block definition, the more accurate the identification match.
At step 504, the computing device 110 may obtain coordinate information of the identified feature blocks.
In one embodiment, the computing device 110 may obtain coordinate information for the discriminating characteristic block defined at step 502. The coordinate information may include an upper left coordinate and a lower right coordinate of the recognition feature block. Accordingly, the upper right coordinate and the lower left coordinate of the recognition feature block or other coordinates including the recognition feature block may also be included. Note that the coordinate information may be transformed accordingly according to the definition of the coordinate system. Such transformed coordinates are all included in the technical solution of the present disclosure.
At step 506, the computing device 110 may calculate, for each of the one or more reporting templates, a matching value of the reporting template to the text block according to a matching function based on the coordinate information of the text block and the coordinate information of the identified feature block of the reporting template.
In one embodiment, for each of the one or more report templates of the target affiliate, the computing device 110 may calculate a match value of the report template to the text block according to a matching function based on the coordinate information of the text block of the PDF file and the coordinate information of the identifying feature block of the report template acquired in step 504.
Taking the upper left coordinate and the lower right coordinate of the text block and the recognition feature block as an example, the matching function may be expressed as satisfying any one of the following conditions, and then the matching value of the matching function is a first predetermined value:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block;
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block;
recognizing that an abscissa value of an upper left coordinate of the feature block falls within an abscissa value interval of an upper left coordinate and a lower right coordinate of the text block and a ordinate value of an upper left coordinate of the feature block falls within an ordinate value interval of an upper left coordinate and a lower right coordinate of the text block; and
the abscissa value identifying the lower right coordinate of the feature block falls within the interval of the abscissa values identifying the upper left coordinate and the lower right coordinate of the text block and the ordinate value identifying the lower right coordinate of the feature block falls within the interval of the ordinate values identifying the upper left coordinate and the lower right coordinate of the feature block,
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the matching function is a second predetermined value. In an embodiment, the first predetermined value may be 1 and the second predetermined value may be 0. With the above matching function, the computing device 110 may calculate a matching value of the report template to one text block.
Matching function in combination with equation (1)
Can be expressed as:
wherein,
which represents a block of text that is,
a block representing the identification features is identified,
representing blocks of text
The upper left-hand coordinate of (a),
representing blocks of text
The lower right-hand coordinates of (a),
is a characteristic block
The lower-right coordinate of (a) is,
representative identification feature Block
Upper left coordinate of (d).
Note that the matching function may be set to a combination of other different conditions according to the area of the recognition feature block. For example, the matching function may have a matching value of the matching function of a first predetermined value only if any of the following conditions is satisfied:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block; and
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the recognition feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the recognition feature block;
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the matching function is a second predetermined value. The user can flexibly set the matching function to meet different recognition matching requirements.
Since the PDF file includes one or more text blocks, the computing device 110 may sequentially and respectively calculate matching values of the one or more text blocks and the report template, thereby obtaining total matching degree data of the PDF file and the report template.
At step 508, the computing device 110 may operate on all of the calculated match values to determine match data for one or more report templates to the PDF file.
In one embodiment, the computing device 110 may operate on the matching values calculated in step 506 for one or more text blocks that respectively match a reporting template to determine the matching data of the reporting template to the PDF file. Here, the operation may be implemented by a direct summation.
Matching degree data in combination with equation (2)
Can be expressed as:
wherein,
representing matching functions
About the ith text block
With the jth recognition feature block
To carry outThe score of the match.
In another embodiment, the operation may be implemented by weighted summation. For example, different weighting factors can be set for one or more text blocks, i.e. a corresponding weighting factor is assigned to each match. For example, the more important regions may be set with a larger weight coefficient. Thereby ensuring that the match data of the report template and the PDF file is accurate enough in the weighted sum.
In this way, by calculating the matching degree data of one or more report templates of the target affiliate with the PDF file, the matching degree data of each report template with the PDF file can be acquired. Therefore, the report template with the highest score can be selected as the report template matched to the PDF file by the target association mechanism.
By using the technical scheme, the report template which is most matched with the PDF file in the target association mechanism can be calculated through the matching function based on the coordinate information. The template may be used in subsequent steps to accurately mine the PDF file.
Fig. 7 shows a flowchart of a method 700 of mining data in the PDF file corresponding to the determined report template according to an embodiment of the present disclosure. Method 700 corresponds to step 210 of method 200. Method 700 may be performed by computing device 110 as shown in FIG. 1, or may be performed at electronic device 1200 shown in FIG. 12.
At step 702, the computing device 110 may define a mined feature block for each of the one or more report templates, respectively.
In one embodiment, similar to identifying feature blocks as described above, the computing device 110 may define mined feature blocks separately for each of the one or more report templates. The mining feature block may cover the area where the data that needs to be mined is located. For example, if stock codes and summaries in a PDF file need to be mined, stock code areas and summary areas may be defined as mining feature blocks. In this step, one or more mined feature blocks may be defined.
At step 704, the computing device 110 may obtain coordinate information of the mined feature blocks.
In one embodiment, the computing device 110 may obtain coordinate information of the mined feature blocks. The coordinate information may include an upper left coordinate and a lower right coordinate of the mined feature block. Accordingly, the upper right coordinate and the lower left coordinate of the recognition feature block or other coordinates including the recognition feature block may also be included. Note that the coordinate information may be transformed accordingly according to the definition of the coordinate system. Such transformed coordinates are all included in the technical solution of the present disclosure.
At step 706, the computing device 110 may calculate a matching value of the determined report template to the text block according to a mining matching function based on the coordinate information of the text block and the coordinate information of the mining feature block of the determined report template.
In one embodiment, based on the report template of the target association mechanism associated with the PDF file determined in the previous step, the computing device 110 may mine the PDF file according to the coordinate information of the mined feature blocks of the report template, the coordinate information of the text blocks, and the mined matching function. The mining matching function in step 706 may be similar to the matching function in step 506. Taking a similar matching function as an example, the mining matching function may be expressed as satisfying any one of the following conditions, and then the matching value of the mining matching function is a first predetermined value:
the abscissa value of the upper left coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the mined feature block and the ordinate value of the upper left coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the mined feature block;
the abscissa value of the lower right coordinate of the text block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the mined feature block and the ordinate value of the lower right coordinate of the text block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the mined feature block;
the horizontal coordinate value of the upper left coordinate of the mining characteristic block falls into the horizontal coordinate value interval of the upper left coordinate and the lower right coordinate of the text block, and the vertical coordinate value of the upper left coordinate of the mining characteristic block falls into the vertical coordinate value interval of the upper left coordinate and the lower right coordinate of the text block; and
the abscissa value of the lower right coordinate of the mined feature block falls within the abscissa value interval of the upper left coordinate and the lower right coordinate of the text block and the ordinate value of the lower right coordinate of the mined feature block falls within the ordinate value interval of the upper left coordinate and the lower right coordinate of the mined feature block,
in a case where it is determined that any one of the above conditions is not satisfied, the matching value of the mining matching function is a second predetermined value. In an embodiment, the first predetermined value may be 1 and the second predetermined value may be 0. With the above mining matching function, the computing device 110 may determine whether a text block of a PDF file matches a report template.
Combining the formula (3) to mine the matching function
Can be expressed as:
wherein,
which represents a block of text that is,
representing the block of the mined features,
representing blocks of text
The upper left-hand coordinate of (a),
representing blocks of text
The lower right-hand coordinates of (a),
for digging feature blocks
The lower right-hand coordinates of (a),
representing mined feature blocks
Upper left coordinate of (d).
Note that the matching function used in the mining step may also be different from the matching function used in the matching step. Other combinations of different conditions may be set according to the area of the recognition feature block. For example, the mining matching function may only if any of the following conditions is satisfied, then the matching value of the mining matching function is a first predetermined value:
the horizontal coordinate value of the upper left coordinate of the mining characteristic block falls into the horizontal coordinate value interval of the upper left coordinate and the lower right coordinate of the text block, and the vertical coordinate value of the upper left coordinate of the mining characteristic block falls into the vertical coordinate value interval of the upper left coordinate and the lower right coordinate of the text block; and
the abscissa value of the lower right coordinate of the mined feature block falls within the interval of the abscissa values of the upper left coordinate and the lower right coordinate of the text block and the ordinate value of the lower right coordinate of the mined feature block falls within the interval of the ordinate values of the upper left coordinate and the lower right coordinate of the mined feature block.
In a case where it is determined that any one of the above conditions is not satisfied, the matching value of the mining matching function is a second predetermined value. The user can flexibly set the mining matching function to meet different recognition matching requirements.
Since the PDF file includes one or more text blocks, the computing device 110 may, in turn, calculate matching values for the one or more text blocks to mined feature blocks of the report template, respectively. Subsequently, text blocks in the PDF file can be mined according to the matching values.
At step 708, the computing device 110 may mine text blocks for which the match value is not the second predetermined value as data corresponding to the determined report template.
In one embodiment, the computing device 110 may mine the text blocks that were computed in step 706 by the mining matching function and that have a matching value that is not a second predetermined value (e.g., 0), i.e., a matching value of a first predetermined value (e.g., 1), as data corresponding to the determined report template. A mining value of 0 represents a complete mismatch of the text block and the mined feature blocks of the report template.
Mining includes extracting and/or storing data in the PDF file into a corresponding database according to names of the mined feature blocks. For example, "600315. SH" in fig. 6 may be mined as a stock code reported by PDF. At the same time "600315. SH" is stored in the location or database where the stock code for that PDF file should be stored. In this way, the characteristics of the title, author, writing date, stock name, stock code, abstract, etc. of the PDF file can be extracted and/or stored in the corresponding database, respectively, thereby obtaining the original text information data of the desired PDF file.
In one embodiment, if the matching values of the mining feature block and the text block in the matched report template are both 0, the target association mechanism may be considered to propose a new report template. In this case, the PDF file may be transferred to other processing. For example, a new report template is added or defined and the identified feature blocks and mined feature blocks are defined for the new report template.
By the technical scheme, the data corresponding to the PDF file and the matched report template can be mined by the mining matching function. Such data may be extracted and/or stored with corresponding defined meanings, thereby not only mining the data of the PDF file accurately at a high speed but also retaining the corresponding actual meanings of the data.
FIG. 8 illustrates a flow diagram of a method 800 of verifying the legitimacy of mined data in accordance with an embodiment of the present disclosure.
In step 802, the computing device 110 may verify the validity of the mined data based on the determined data structure of the mined feature blocks of the report template.
In one embodiment, the computing device 110 may define a legal data structure of mined feature blocks based on the determined data structure of the mined feature blocks of the report template. For example, a stock code may be defined as a structural form of "numeric. Note that because the data representation is different, multiple legal data structures may be defined for mining feature blocks. For example, a stock code may be defined as "number," "number | english character," "number english character," and so forth. Data conforming to such a structure can be determined to be legitimate and normalized in subsequent steps. Through the defined data structure, the computing device 110 may verify the legitimacy of the data mined in the above-described method.
In step 804, the computing device 110 may perform a normalization process on the mined data in the PDF file in response to the legitimacy of the mined data being legitimate.
In one embodiment, the computing device 110 may perform a normalization process on the mined data in the PDF file in response to the validity of the mined data in the above-described method being legal. Computing device 110 may normalize the mined legitimate one or more different forms of data into a standard data format. For example, stock codes expressed in the format of "numeral", "numeral | english character", and "numeral english character" are normalized to "numeral.
In step 806, the computing device 110 may determine other report templates based on the match data and re-mine the PDF files in response to the validity of the mined data being illegal.
In one embodiment, the computing device 110 may determine that the mined data is erroneous in response to the legitimacy of the mined data being illegal. For example, if the data mined when the stock codes are mined is Chinese characters, the report template is determined to be incorrect. The computing device 110 may determine other report templates and re-mine the PDF files with the match data calculated in the above-described method.
With this technical solution, data represented in a plurality of different types of data structures can be normalized to standard data. Meanwhile, whether the excavation is correct or not can be determined in the method.
FIG. 9 illustrates a flow diagram of a method 900 of performing a normalization process on mined data in the PDF file according to an embodiment of the disclosure. Method 900 corresponds to step 804 of method 800.
At step 902, the computing device 110 may define a standard expression of mined feature blocks of the report template.
In one embodiment, as described above, the computing device 110 may define a standard expression of mined feature blocks of the report template. The names used are different in different reports due to the same data. For example, financing cash flow, financing activity cash flow net, etc. may all represent the same data. The standard expression "financing cash flow" may thus be defined.
At step 904, computing device 110 may define a corresponding plurality of non-standard expressions based on the standard expressions.
In one embodiment, computing device 110 may define a corresponding plurality of non-standard expressions based on the standard expressions. For example, multiple non-standard expressions such as "financing activity cash flow," financing activity cash flow volume, "financing activity cash flow net amount" may be defined for the "financing cash flow. This step is equivalent to creating a set of look-up tables for different writing methods for the standard expression.
At step 906, the computing device 110 may uniformly convert the non-canonical expressions in the mined data to canonical expressions.
In one embodiment, the computing device 110 may uniformly convert all non-canonical expressions defined in the mined data to canonical expressions. For example, the mined "financing activity cash flow, financing activity cash flow net amount" can be uniformly converted into "financing cash flow". Namely, a plurality of non-standard expressions in the comparison relation table are normalized into a standard expression.
According to the same principle, it is also possible to convert numerical data into a correct floating point number or convert a plurality of units expressing different numbers into a unified unit, or the like.
By using the technical scheme, the data which are obtained by mining and expressed as different expressions can be normalized into the standard data of the unified expression.
Fig. 10 shows a flow diagram of a method 1000 of performing a normalization process on mined data in the PDF file according to an embodiment of the present disclosure. Method 1000 corresponds to step 804 of method 800.
At step 1002, the computing device 110 may determine, based on the mined data, the closest actual year associated with the stock data in the mined data.
In one embodiment, the computing device 110 may determine, based on the mined data, the closest actual year associated with the stock data in the mined data. For example, if the mined data includes year data such as 2020, 2021, 2022, and 2023, and assuming that the current year is 2021, 2022 starts with forecast data, so that "2021 year" may be defined as the closest actual year.
At step 1004, the computing device 110 may query the actual data of the stock data at the closest actual year.
In one embodiment, the computing device 110 may query the mined data for the actual data of the stock data that is closest to the actual year via a database or other means. For example, "financing cash flow" of stock data in 2021 may be queried and the queried data may be defined as real data.
In step 1006, the computing device 110 may compare the stock data to the real data to obtain units of the stock data.
In one embodiment, the computing device 110 may compare the stock data in the data mined in the above-described method to the actual data defined in step 1004 to determine whether the data is correct.
If the difference between the mined stock data and the real data expression is too large, namely a certain threshold value is different through certain operation, the mining can be considered as wrong, and the matching template needs to be determined again or the mining needs to be performed again.
If the expressions of the mined data and the real data are similar, namely the difference does not exceed a threshold value through certain operation, the unit of the stock data in the PDF table is calculated according to the conversion between the two data.
For example, if the query's real data is "1000000" and the mined data is "100", the unit of the data in the PDF table can be calculated as "(ten thousand)". For example, if the query's real data is "5000000" and the mined data is "200", then the mining may be considered as erroneous and further additional processing may be required.
By using the technical scheme, the mined data with different expressions can be normalized into the data with standard expression.
FIG. 11 shows a flow diagram of a method 1100 for deduplication against an acquired text chunk in accordance with an embodiment of the present disclosure.
In step 1102, the computing device 110 calculates the number of text characters of the acquired text block, i.e. the number of text characters, or text strings, within the text block acquired in the method described above. For example, the number of characters in the "balance sheet" text block may be determined to be 5.
At step 1104, the computing device 110 determines whether the calculated number of text characters is greater than or equal to a predetermined number of characters threshold. The user may set a character count threshold for the deduplication algorithm, e.g., (10 characters). The character number threshold is used for judging whether the text characters belong to long characters or short characters, and different de-duplication algorithms are applicable to different types of characters.
At step 1106, the computing device 110 calculates a similarity of the retrieved text blocks based on a first algorithm in response to determining that the calculated number of text characters is greater than or equal to a predetermined number of characters threshold. If the number of characters of the text is greater than or equal to the preset character number threshold value, the characters are determined to be long characters, and the similarity of the text blocks is calculated by applying a first algorithm. The first algorithm for long characters may be any deduplication algorithm that works well when applied to long strings, such as a simhash deduplication algorithm, a hashmap deduplication algorithm, or the like.
At step 1108, computing device 110 calculates a similarity of the retrieved text blocks based on a second algorithm in response to determining that the calculated number of text characters is less than or equal to the predetermined number of characters threshold. And if the number of the text characters is less than the preset character number threshold value, the text characters are determined to be short characters, and the similarity of the text blocks is calculated by applying a second algorithm. The second algorithm for long characters may be any deduplication algorithm that is superior in performance when applied to short strings, such as a minihash deduplication algorithm, a set deduplication algorithm, or the like.
At step 1110, the computing device 110 performs deduplication for the retrieved text block based on the similarity calculation result. The similarity between the text blocks is calculated by the corresponding algorithms determined in steps 1106 and 1108, so that the text block with high similarity can be regarded as a repeated text block, and the removal or combination is performed on the repeated text block, thereby completing the deduplication.
By using the technical scheme, whether the repeated data or the repeated indexes exist in the table can be judged. If the activity exists, repeated cells can be merged or eliminated according to a deduplication algorithm, so that the table processing efficiency is improved.
FIG. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. For example, the computing device 110 as shown in fig. 1 may be implemented by the electronic device 1200. As shown, the electronic device 1200 includes a Central Processing Unit (CPU) 1201 that may perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 1202 or computer program instructions loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the random access memory 1203, various programs and data necessary for the operation of the electronic apparatus 1200 may also be stored. The central processing unit 1201, the read only memory 1202, and the random access memory 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
A number of components in the electronic device 1200 are connected to the input/output interface 1205, including: an input unit 1206 such as a keyboard, a mouse, a microphone, and the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The various processes and processes described above, such as methods 200, 300, 400, 500, 700, 800, 900, and 1100, may be performed by the central processing unit 1201. For example, in some embodiments, methods 200, 300, 400, 500, 700, 800, 900, and 1100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, some or all of the computer programs may be loaded and/or installed on the device 1200 via the read only memory 1202 and/or the communication unit 1209. When the computer program is loaded into the random access memory 1203 and executed by the central processing unit 1201, one or more of the actions of the methods 200, 300, 400, 500, 700, 800, 900 and 1100 described above may be performed.
The present disclosure relates to methods, apparatuses, systems, electronic devices, computer-readable storage media and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, such as punch cards or in-groove raised structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge computing devices. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be appreciated by persons skilled in the art that the present invention is not limited to the embodiments described above, but that the invention may be embodied in many other forms without departing from the spirit or scope of the invention. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made thereto without departing from the spirit and scope of the present invention as defined by the appended claims.