CN116956836A - Efficient and automatic document file annotation method based on hash tree algorithm - Google Patents
Efficient and automatic document file annotation method based on hash tree algorithm Download PDFInfo
- Publication number
- CN116956836A CN116956836A CN202310880346.4A CN202310880346A CN116956836A CN 116956836 A CN116956836 A CN 116956836A CN 202310880346 A CN202310880346 A CN 202310880346A CN 116956836 A CN116956836 A CN 116956836A
- Authority
- CN
- China
- Prior art keywords
- document
- file
- hash tree
- crf
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 102
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 34
- 238000013508 migration Methods 0.000 claims abstract description 18
- 230000005012 migration Effects 0.000 claims abstract description 18
- 230000008520 organization Effects 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 24
- 238000004458 analytical method Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 16
- 230000000694 effects Effects 0.000 claims description 11
- 230000008859 change Effects 0.000 claims description 6
- 238000011144 upstream manufacturing Methods 0.000 claims description 5
- 101150060512 SPATA6 gene Proteins 0.000 claims 40
- 230000011218 segmentation Effects 0.000 abstract description 13
- 238000010586 diagram Methods 0.000 description 9
- 238000007405 data analysis Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 238000013480 data collection Methods 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 4
- 238000013523 data management Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 239000004816 latex Substances 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000013499 data model Methods 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013497 data interchange Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000005477 standard model Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/169—Annotation, e.g. comment data or footnotes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
- G06F40/18—Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a document file high-efficiency automatic annotation method realized based on a hash tree algorithm, which comprises the following steps: obtaining a CRF document file to be processed, extracting the CRF document file to be processed into a document structure in a super directory form, converting the document structure into a hash tree structure, and storing the hash tree structure as a JSON file; manually editing the JSON file, and inserting a value containing a character string into a place needing to be marked in the file to realize annotation information addition; converting the edited aJSON file into a hash tree data structure in a memory; rescanning an original CRF document file, taking the key combination of the captured multidimensional structure as a key combination to perform value-taking operation in a hash tree, simultaneously realizing annotation or annotation migration, automatically adding all annotation information into the CRF document file, and generating an annotated CRF document file. The method realizes automatic annotation on CRF forms of any version, and rapidly realizes version migration and modularized segmentation and merging operation, namely, annotates CRF clinical data to generate aCRF.
Description
Technical Field
The invention relates to the technical field of document processing, in particular to a high-efficiency automatic document file annotation method based on a hash tree algorithm.
Background
STDM (Study Data Tabulation Model, clinical study data list model), CRF (Case Report Form) and aclrf (Annotated Case Report Form ) are all critical in the entire clinical study, ensuring accuracy and consistency of experimental data. STDM provides a standard model for clinical trials, helping researchers to collect and manage data more effectively. CRF is a core collection tool for data, ensuring the quality and integrity of the data. The aCRF can improve the quality of CRF according to STDM through marking and annotation, shorten the time of data analysis and reporting, improve the research efficiency, and simultaneously ensure the accuracy and the integrity of data. Thus, the role of STDM, CRF and aclrf is not replaceable in clinical studies, and high quality, reliable research data can be obtained only through their effective use, and support is provided for the development of clinical medicine.
These three items are described below:
(1) STDM is a standardized data standard used to normalize data collection, data management and data reporting processes in clinical studies. The data model is a normalized data model proposed by CDISC (Clinical Data Interchange Standards Consortium), can effectively reduce data errors and improve data quality, and can enable different systems to be matched and connected with each other, so that the conversion and analysis efficiency of clinical research data are improved.
(2) The CRF is a standard data collection form or spreadsheet for collecting data from clinical trials. It is a structured data gathering method for gathering various data and information during the course of the test. CRF defines the data required for the trial including treatment and follow-up activities, baseline and follow-up data, clinical outcome, laboratory exam, safety, etc.
(3) aCRF (annotated CRF) is a variant of CRF and is mainly used for data management and analysis in clinical trials. In contrast to conventional CRFs, the aclf includes detailed STDM-based markers and annotations in addition to basic data collection. These markers and annotations enhance the accuracy, consistency, understandability, and auditability of the data. The aCRF can help testers, auditors and data analysts to better understand the collected data, so that the data quality problem caused by input errors, unnormal, incomplete, unclear and other problems in the CRF is avoided. The aCRF can ensure the integrity and the accuracy of data through marking and annotation, shorten the time for data analysis and reporting and improve the research efficiency.
an acrrf is understood to mean that a color text box is drawn next to a specified text on a PDF form, in which comments are written according to specified criteria. And placing a text box on the side of the appointed text, and writing some text in the text box. The aCRF is a variant of the CRF used to collect test data, and unlike conventional CRFs, it includes more detailed data markers and notes. These markers and annotations enhance the accuracy, consistency, understandability, and auditability of the data, ensuring the quality and integrity of the data collection. The process of acrrf is relatively simple. The researcher provides a basic data collection form (CRF) and the system then perfects the data by automated tool addition of labels and comments, and the acrrf can further refine the data by manual review. This makes the data become more accurate and reliable, improving the data analysis efficiency and the reliability of analysis results.
In clinical research, data collection and management are critical parts, and the acrrf can reduce data errors, improve data quality, shorten data analysis time, quickly detect data anomalies, and the like. The acrrf is an indispensable loop in the whole data analysis process, since the acrrf can improve the quality and accuracy of data and shorten the time for data analysis and reporting.
The speed constraints of the acrrf across the data analysis chain are also very important. The lack of accurate, understandable, and standard data can lead to the data analyst spending more time analyzing and interpreting the data. And the aCRF can shorten the time for an analyst to analyze data through tools such as automatic marking and annotation, improve the analysis efficiency and provide reliable and accurate data results.
Therefore, in clinical research, the actual effect and application of the aCRF influence the progress and flow of the whole clinical research, improve the data quality and accuracy, accelerate the speed and process of the analysis process, and correspondingly, evaluate the clinical prospect according to the analysis result more accurately and reliably.
However, conventional acrrf is a very labor-intensive task that requires a worker to write and collate. In order to solve the above problems, software and analysis procedures are designed in the prior art, as follows: 1 annotation addition based on source code from before tabulation
These operations are all to analyze and convert the original form of the CRF before generating the PDF file of the CRF, and output the relevant annotation result to the corresponding PDF making tool to be output into the PDF file together with the original information of the CRF.
1.1 SAS-based Mock Shell addition
Mock Shells of SAS is an automated tool for building tables. Before the list is made, all relevant titles or content characters are output to SAS software, and Mock Shells are mapped to a database of the SAS according to the form of list- (title-name) -content to be inquired and annotated, and finally the new document is formed by adding the new document to a designated position.
1.2 LaTeX based addition
LaTeX is a powerful typesetting tool and is widely used in academia and publishing industry. The final PDF version of the partial CRF is directly generated from the LaTex typeset file. The code can be written by itself or the related tool can be used for directly editing the LaTex original file, and the annotation text box is output at a designated position. This requires the tabulating staff to thoroughly cook the structure of the true form in the chest, knowing what the specific text represents in which position and whether there is empty space in the vicinity to add comments.
1.3 direct database addition
PDF files of partial CRF are directly exported based on mature database systems such as Oracle, and annotation information and coordinates of the PDF files can be directly generated by adding the annotation information and coordinates of the PDF files into a bottom table of an original database. Similar tools have Clinical (Oracle, USA), -A>(Jumbiant organosys, USA),>(Medita solution) RedCAP, et al.
2 adding after tabulating
In most cases, annotating and tabulating is not a team. The annotators can only make post-annotation on blank PDFs without annotations, and the following methods are currently published:
2.1 annotation addition based on a tabulated specification file Study Design Specification (SDS)
The tabulating team can output an Excel table at the same time of outputting the blank PDF, the Excel table records the content of each page of the PDF file, the page number, the data type and the coordinates, and the file is called an SDS file. On this basis, the annotators add the specified columns, such as the color of the annotation, the content of the annotation, the coordinate offset of the annotation text box, and the like, by themselves according to the contents of this table. And then annotates the blank PDF with relevant tools based on this table information. Fig. 1 is a schematic diagram of an SDS file and an annotation result PDF. Wherein, columns D and E in SDS are post-added information, not SDS original information.
2.2 adding based on form text matching
2.2.1 feature text matching method
The method prepares a table file in advance, and writes words contained in the table header and corresponding annotation information. And then PDF text information is extracted independently, and sentence breaking and segmentation are carried out according to line feed symbols or blank spaces. The method requires that the characters representing the table name or header have certain characteristics, such as being enclosed by brackets, or being surrounded by brackets, and thereby determining which CRF table is scanned at which position, and the remaining text is kept as a key. It is checked whether the content of the keyword appears in the file corresponding to the form and if so, annotation information is placed next to this word. For example, table names and headers are modified, and several asterisks are several levels of headings. And the program analyzes the text, obtains the title level and the page number according to the feature matching, carries out step-by-step matching on the title level and the title name, and releases four notes of national men, national women, foreign men and foreign women to corresponding positions. Fig. 2 is a schematic diagram of an annotation method based on text matching.
2.2.2 text content numbering
Another approach is to capture the content and coordinates of each word in a page by page, line by line, scan the page, where the content of each line is numbered and recorded in an Excel, then edit the Excel and write the code reduction annotation. FIG. 3 is a schematic illustration of annotation based on a pre-numbering system.
2.3 adding based on Book Marks
The staff can look through the document in advance, and mark a bookmark on each page of the document, wherein the content of the bookmark is the table name corresponding to the page and the table head name appearing in the page. And writing a program, namely adding comment information of the corresponding header into one corner of the page in the form of text boxes according to the bookmarks, opening a PDF editor, and manually dragging the text box information containing comments to the side of the corresponding text.
2.4 editing based on XDF files
The xdf file is an XML file format for inserting data directly into PDF. The xdf file may be used to populate the PDF file with user information in multiple forms. The xdf file may be opened using a PDF reader, such as Adobe Acrobat Reader, etc. At the beginning of a task, an xdf file is created manually, coordinates of a text box are manually entered into the xdf file, and annotation information is written in the xdf file. After the storage, the XDF file and the blank PDF are loaded into an editor at the same time and stored as a final annotation result.
However, in the practical application process, the following defects exist in the several prior arts:
(1) Files such as SDS require a form generator to provide them. These form producers are typically foreign companies, such as Oracle corporation, etc., that are difficult to communicate abroad. And, other people than the management high level, such as annotation teams, have substantially no access to the SDS files.
(2) Some form contents may be lengthened when the data is filled out, resulting in the final document not being paired with the SDS file page.
(3) The tables are distributed step by step and finally sorted, the page numbers of all the tables may be disordered, and the SDS file pairs are not on.
(4) If a pre-numbering system is used, the method fails if the document page order changes because the numbers and annotations are both page numbers. In addition, at the time of inspection, since the organization structure of all contents is disturbed by the number, it is difficult to inspect. For example, in the above example: 1 (page number) → nationality sample questionnaire → china → gender → male added notes as national male, whereas 1 (page number) → nationality sample questionnaire → others → gender → male should be noted as foreign male. Put into the pre-numbering system becomes 1 (page number) →4→the male addition comment is a national male, and 1 (page number) →8→the male addition comment is an external male. Such information is non-human readable, unreadable, and difficult to perform post-inspection.
(5) For the feature text matching method, it requires that the header and title levels and contents of the table are both text-tagged with feature text, such as # or x or brackets. If the title level is distinguished by other methods such as color, font size, font, character margin, special labeling, etc., all the contents are changed into plain text as soon as the method is used, the characteristics are lost and cannot be matched. The method is realized by using tm package of R language, the package has bug, once the picture appears in the document, the text obtained by later analysis is messy code.
(6) With the feature text matching method, it is necessary to know in advance that the title has several levels at most. Most pre-compiled languages such as C, golang, java, javaScript, etc. require specifying the dimensions of the dictionary and pre-building the structure for data storage. For example, golang and Java, the annotation of the primary title requires construction of a one-dimensional Hash table, the secondary title constructs a two-dimensional Hash table structure, the Hash [ primary title ] [ secondary title ] =annotation result, the tertiary title requires construction of a structure of Hash [ primary title ] [ secondary title ] [ tertiary title ] =annotation result, the quaternary title, and the like. The code cannot be written at all. The method commonly used at present is to read through a document, know that there are at most several levels of titles, and then construct a plurality of data structures in advance in the declaration of the code and perform cross comparison. The method is time-consuming and labor-consuming, the code maintenance is extremely difficult, the mass memory is occupied, and one CRF-PDF file can only correspond to one program and cannot achieve universality.
(7) If the document is a PDF file, its presentation results and its encoding may not be consistent, for example, some headlines may have spaces and tabs inserted inside for presentation of aesthetics, and using SDS, current methods only support text exact matches. Even if one period does not match, this can be a mistake in the project.
(8) Instead of completing the annotation PDF at one time, the PDF document structure may be structured and the content of the annotation may be subject to multiple versions of the annotation. Each iteration requires modification of a portion of the annotation, and how to do the annotation migration is also a big problem. The above methods are bound with the coordinates and page numbers, and all work is to be pushed over once the annotation or the PDF document structure coordinates change.
(9) Each CRF is a summary of data of clinical studies for at least 3 years, and has huge data volume, and PDF documents of one CRF are at least 200 pages, and more likely to be ten thousand pages. One person doing this brings about an excessive working pressure, and there is an urgent need for a working and coding mechanism capable of dividing the cooperation. All current annotation methods rely on page numbers and also on a preset coordinate system. Therefore, only the document segmentation method can be adopted, that is, a PDF file of one CRF is divided into a plurality of parts according to a table and page numbers of each page are reserved, and different parts are divided to different people to construct an annotation table or modify an SDS file. The project requires everyone to strictly adhere to page numbers and coordinates of document contents, even if the contents are exceeded, the pages and coordinates are changed once the pages are paged, and the work of others is not effective. Finally, the administrator merges all SDS or preset form files. Annotating.
(10) For projects without SDS files, it is quite unreasonable for the whole company to rely on only one employee to build Book Marks first or code each page of content and then the whole company to annotate. Especially when the employee cannot work if leave or other things are left, the working efficiency is seriously affected, especially across companies with international teams, and the phenomenon is more serious.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks.
Therefore, the invention aims to provide a high-efficiency automatic document file annotation method realized based on a hash tree algorithm, which can realize automatic annotation on CRF forms of any PDF version and realize version migration and modularized segmentation and merging operation, namely, annotate CRF clinical data to generate aCRF (annotated CRF).
In order to achieve the above object, an embodiment of the present invention provides a document file efficient and automatic annotation method implemented based on a hash tree algorithm, including the steps of:
step S1, obtaining a CRF document file to be processed and constructing a super directory, wherein the step comprises the following steps: extracting the CRF document file to be processed into a document structure in the form of a super directory, and converting the document structure into a hash tree structure, wherein the hash tree structure is the super directory containing all contents in the CRF document file and upstream and downstream organization relations and contents of the CRF document file; storing the super directory as a JSON file;
Step S2, manually editing the JSON file corresponding to the super directory, inserting a value containing a character string into the JSON file at a place to be marked so as to realize adding annotation information, and generating the edited JSON file corresponding to the super directory, and marking the JSON file as an aJSON file;
step S3, converting the edited super directory corresponding to the aJSON file in the S2 into a hash tree data structure in a memory, then rescanning the CRF document file to be processed, and taking the key combination of the captured multidimensional structure as a key combination to perform value-taking operation in the hash tree and realize annotation or annotation migration, wherein the method comprises the following steps: when detecting that the value taken in the multi-level key combination in the corresponding hash tree structure contains a character string, extracting the coordinate where the last key word in the multi-key combination is located in the CRF document file, generating a text box beside the coordinate, and placing corresponding annotation information in the text box, namely taking the key combination as the value taken in the hash tree structure as the annotation information, and the like, automatically adding the annotation information in the step S2 to the corresponding sub-node positions of the CRF document file to generate the annotated CRF document file as the aCRF file.
In the specific embodiment of the present invention, in the step S1, the CRF document file to be processed is scanned page by page and line, the logic structure of the whole document is constructed and abstracted into a hash tree, including, but not limited to, different level titles of the CRF document file are used as keys in hash tables of different levels of the hash tree, each row and each column is converted into keys in a hash table of a certain dimension of the hash tree according to the logical membership of the table, and multi-level storage is realized by nesting the multi-dimensional hash tables multiple times;
after traversing the whole CRF document file once, a hash tree object containing the whole document content and organization structure is obtained, the macro structure form is analogized into a super directory for recording all data organization forms and contents of the document, and the super directory is saved as a JSON format file.
In a specific embodiment of the present invention, the logical organization structure of the CRF document file and the table and the corresponding text content are recorded in the super directory.
In a specific embodiment of the invention, the super directory comprises all contents and content organization structures of the documents, and the hash tree data structure generated by the super directory has keyword searching and fuzzy searching functions to form a hash tree based on regular searching.
In the specific embodiment of the present invention, in the step S2, a value in the form of a string is added to the JSON file where the annotation is required, so as to implement adding annotation information. .
In the specific embodiment of the present invention, in the step S3, after editing the JSON file, reading and converting the edited JSON file into a hash tree, traversing the blank CRF document file row by row again, and implementing document annotation in a valued manner;
analyzing a blank CRF document file page by page and line by line, taking a multi-level title combination obtained after each line by line analysis as a keyword combination to take a value of a hash tree, if the value is available and contains a character string S, taking the character string out, extracting the rightmost x-axis coordinate where the last title keyword of the multi-level title keyword combination is located from the CRF document, adding an offset in the coordinate position, setting a text box at the coordinate position after offset, and reducing the width of the text box along with the width of the character string S; and putting the character string S as annotation information into a corresponding text box to realize annotation.
In the embodiment of the invention, the hash tree generated after the CRF document file is analyzed can be divided into different hash tree subtrees according to different chapters, and the subtrees are completely independent and do not influence each other.
Based on the characteristics, the CRF document file is divided into a plurality of blocks according to a form or a chapter form, each block is independently constructed into a hash subtree corresponding to the JSON file record, the hash subtrees are respectively and independently annotated by different staff, and finally all annotation results are combined by using an Update function without any data loss and ambiguity.
In a specific embodiment of the present invention, the generated hash tree is locally or globally overlaid by a hash or linked list data structure for an unlimited number of times to achieve splitting and version updating, including:
(1) The hash tree Ti generated by any sub-chapter or sub-table in the original CRF document file is a true sub-tree of the hash tree T generated by analyzing the complete original document, and the hash sub-trees formed by any independent sub-units in the document are completely independent of each other and do not affect each other;
(2) The sequence change among the hash tree subtrees has no effect on the total tree;
(3) The contents in any subtree of the hash tree can be arbitrarily covered by the subtree with the same key structure, and the final value is only related to the value covered last time;
(4) The hash tree content can be arbitrarily expanded, and if a new branch appears in a new subtree in the coverage process, related content can be automatically expanded in the generated hash tree.
In a specific embodiment of the present invention, the CRF document file is any readable file.
In particular embodiments of the present invention, the method is applicable to any document format file that can be opened by a reading tool, including, but not limited to, documents and picture files.
According to the efficient and automatic document file annotation method realized based on the hash tree algorithm, which is disclosed by the embodiment of the invention, the CRF clinical data can be annotated and generated aCRF (annotated CRF), namely, the CRF table of any PDF version is automatically annotated, and the version migration, the modularized segmentation and the modularized merging operation are rapidly realized.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a diagram of a SDS file and a PDF of annotation results in the prior art;
FIG. 2 is a schematic diagram of a text matching-based annotation method in the prior art;
FIG. 3 is a schematic illustration of prior art annotation based on a pre-numbering system;
FIG. 4 is a flow chart of a method for efficient automated annotation of document files based on a hash tree algorithm in accordance with an embodiment of the present invention;
FIG. 5 is a flowchart of an overall analysis of a method for efficient automated annotation of document files based on a hash tree algorithm in accordance with an embodiment of the present invention;
FIG. 6 is a diagram of a hash tree model versus document structure according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an annotation migration model according to an embodiment of the present invention;
FIG. 8 is a flow chart of a document annotation migration operation according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a document segmentation and integration model according to an embodiment of the invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The invention provides a high-efficiency automatic annotation method for a document file based on a hash tree algorithm, which can realize standardization, modularization and migration of a PDF electronic form document with a specified format.
Before explaining the efficient and automatic document file annotation method based on the hash tree algorithm, the technical terms applied to the method are explained first.
(1) Hash table and infinite dimension hash table (hash tree):
hash tables (Hash tables), also called Hash tables, are data structures that are directly accessed according to Key-Value values. It accesses records by mapping key values to a location in the table to speed up the lookup. This mapping Function is called Hash Function (Hash Function), the array storing the records is called Hash table (also called Hash table), and a large address space is usually required to be opened up as the Hash table during operation. The hash table has the main advantages of high searching efficiency, capability of finding out required elements in the time with the time complexity of O (1), and obviously better efficiency than other data structures for searching or sequencing tasks with large data volume.
Hash trees (Hash trees), also known as Merkle trees (Merkle trees), are a Tree-like data structure that is an efficient way to validate and manage data, colloquially a Hash table that can store multidimensional data. When data is written to the hash tree, each node represents the data of the entire subtree with its hash value. When a change occurs to the data in the tree, only this particular node and the hash value associated with it need be modified, as the other nodes are not affected.
In the hash tree, each element has a unique key value. This key may be used to calculate the location in the hash table. Calculating this location typically uses a hash function that typically considers only one dimension of the key value and returns a specific hash value. In each dimension, there is a corresponding hash function to calculate the hash value in that dimension.
The main advantages of hash trees are their flexibility and scalability. It can dynamically add or delete dimensions making it well suited for storing multi-level nested data. Therefore, the infinite dimension hash tree is widely used in many application scenarios, such as machine learning, recommendation system, image processing, etc.
Hash trees have the following advantages in terms of data query and organization:
quick query: the infinite dimension hash tree uses a hash function to calculate the position of each element in the hash table, which makes the lookup operation very fast with an average time complexity of O (1). For large data volume search or sorting tasks, the efficiency of an infinite dimension hash tree is far higher than that of a traditional data structure. For example, if they store a student list in a hash table, the information of each student can be quickly searched by the student ID. And a certain item of information of a certain student is quickly obtained through a second level of keys such as weight, height, sex, etc. Theoretically, an infinite dimension hash tree can store all information and organization of any one table regardless of the order in which the information appears.
Data organization: the hash tree can be used to store multidimensional data, which makes it very useful in terms of data organization. It may be used to store objects having multiple attributes, where each attribute may be considered a key value for one dimension. This makes data management and access more flexible and easy to organize.
Dynamically adding or deleting dimensions: the infinite dimension hash tree can dynamically add or delete dimensions as needed, which gives it flexibility over traditional hash tables. This means that it can process multidimensional data without having to fix the dimensions in advance, which is very useful in some application scenarios.
Low memory consumption: the infinite dimension hash tree can efficiently use memory during storage, and has lower memory consumption. This is particularly applicable to a scenario where a large amount of data is handled, and can effectively reduce memory usage and improve processing performance.
Easy to maintain: since hash tables typically use simple key values for the data structure, maintenance is very simple, especially when the amount of data is small. If the user needs to search some data of a user, the user information can be easily searched by only using the unique ID. The use of hash tables is more natural and less costly to maintain than conventional relational databases.
In summary, hash trees offer excellent solutions in terms of multidimensional data storage, fast data querying, and flexibility, and are suitable for use in processing large-scale data.
(2) Happability of arbitrary tables:
any table can be converted into a hash tree of finite dimensions according to mathematical derivation. The derivation idea for converting the data table into a hash tree is as follows:
the data table may be considered a two-dimensional matrix in which each row and column may be considered a dimension.
A hash table is a multidimensional data structure in which each dimension can be considered a key, and a position in the hash table can be uniquely determined by combining the key values of all the dimensions together.
Thus, they can convert the data table into a hash table of finite dimensions, i.e., a hash tree. The specific conversion mode is to select one or more dimensions, wherein key values of the dimensions are used as key values of a hash table, and data of rows or columns are used as values of the hash table.
(3) Hash tree and JSON data format:
JSON is a lightweight data exchange format commonly used to transfer data from a server to a client. The method is a plain text format, is easy to read and write, and can be used for data exchange among various programming languages. JSON consists of key-value pairs, where keys must be strings, and values can be any valid JSON data type, such as numbers, strings, boolean values, arrays, or objects.
In practical applications, the hash tree and JSON may also be converted to each other, for example, to convert data in the hash tree into JSON format for network transmission or storage, or to convert data in JSON format into the hash tree for fast data retrieval and processing. In addition, JSON is conveniently presented to users and edited by a number of proprietary editors.
(4) Multiple hash tree recursive update
Mature Update functions are a basic function that any mature computer language must contain. Can be used to update the existing elements in the hash table of any dimension. For an n-dimensional hash table (n >1, which constitutes a hash tree), the Update function is typically implemented in a recursive manner. The basic flow is as follows:
1) It is checked whether a given key exists in the hash table.
2) If not, the element may be selected for insertion into the hash table, or may be ignored.
3) If so, the number of dimensions of the current hash table node (or sub-hash table) is checked.
4) If the dimension number of the current hash table node is equal to 1, the node is indicated to correspond to a single data element, and the value of the element is directly updated.
5) If the number of dimensions of the current hash table node is greater than 1, a sub hash table corresponding to the node is indicated, an Update function needs to be called recursively, and key values of the next dimension are continuously checked until all the dimensions are processed.
6) If the corresponding element is still not found, the element may be selected for insertion into the hash table, or may be ignored.
7) The entire hash table or just the values of the specified elements may be updated as needed.
8) It should be noted that, in order to prevent the hash table from generating excessive hash conflicts, the Update function needs to reasonably select the hash function according to the design principle, and perform the expansion of the hash table in time. In addition, the Update function also needs to consider the problem of concurrent Update of multiple threads, and lock or atomic operation is generally adopted to ensure the consistency of data and thread security.
Due to the existence of the recursive Update function, the value of one hash can be quickly updated based on the other hash between two multidimensional hashes.
Referring to fig. 4 and 5, the efficient and automatic document file annotation method implemented based on the hash tree algorithm according to the embodiment of the invention comprises the following steps:
step S1, obtaining a CRF document file to be processed and constructing a super directory, wherein the step comprises the following steps: the CRF document file to be processed is extracted into a document structure in the form of a super directory and is converted into a hash tree structure. The hash tree structure is a super directory containing all contents in the CRF document file and upstream and downstream organization relations and contents thereof; the super directory is saved as a JSON file.
In an embodiment of the present invention, the logical organization structure of CRF document files and tables and corresponding text contents are recorded in the super directory.
It should be noted that the CRF document file may be any readable file. For example, the CRF document file may be a PDF format file, a file converted into a PDF format by other formats, and so on, which will not be described herein. The method provided by the invention is applicable to any document format file which can be opened by a reading tool, including but not limited to documents and picture files.
The flow of the present invention will be described below taking a CRF document file in PDF format as an example.
In the present invention, the python script generatejson. Py is used to transform the blank CRF-PDF file extract document structure into a hash tree structure and the data into a JSON file.
The algorithm principle and implementation of the hash tree are described below, and the hash algorithm is the fastest query algorithm in theory and has the highest efficiency.
The method for recording the document organization structure based on the hash tree is a method for constructing a super directory containing all contents and the content organization structure of the document. The technology of recording the document directory structure based on hash is mature, and includes MarkDown and CSS. However, the prior art has the following disadvantages: 1) Corresponding settings are made in the source code before the file is formed, and the generated document is directly provided with the catalogue. And, the algorithm is generally only supporting the title of the document, generally up to level 4. However, the super directory is forcedly constructed based on the formed PDF format document, the invention does not know how many levels of titles are at most in advance, and generally only 4 levels of titles are supported, and the invention can be required to be 10 levels. This is not possible with the current maturation tools, and the invention designs a data structure supporting infinite level hash, which solves the above problems.
The present algorithm is an index technology supporting 4-level titles, and the algorithm of the present invention can be used for annotating not only titles, but also text contents.
For example, if this requirement is fulfilled, only the hash tree algorithm of the present invention is possible, e.g. requiring notes to be added next to the text of price questionnaires → product catalogue → home appliances → small appliances → domestic brands → all with dishwasher keywords [ product off shelf ]: 1) The present invention supports hash trees of infinite dimension without knowing in advance that this product appears under several levels of headings at all. 2) The directory generated by indexing only the title also does not meet the requirement, related words must be put into the super directory, keyword searching and fuzzy searching are also required to be supported, and infinite dimension hash tree based on regular searching is realized in the invention.
The super catalog provided by the invention comprises all contents and content organization structures of the documents, and the hash tree data structure generated by the super catalog has the functions of keyword searching and fuzzy searching, so that a hash tree based on regular searching is formed.
It should be noted that, the existing hash table with limited dimensions is a subset of the hash tree, and can be seen as a difference between a calculator that only supports four operations and a calculator that supports infinite calculations, compared to the hash tree with infinite dimensions of the present invention. Limited dimension hash tables typically support up to 4 levels of headers, which is encompassed by the present invention within the scope of an infinite dimension hash tree. That is, the hash tree with infinite dimension provided by the invention covers at least 1 level of title, and the hash tree application from 1 level to infinite level of title dimension belongs to the protection scope of the invention.
The invention uses Python to implement infinite dimension hash, i.e. hash tree. The invention constructs an object by using the data structure of Python with hash structure table, which allows the object to nest itself wirelessly and modifies the memory garbage collection mechanism partially. Unused hashes are not reclaimed until the code ends. By implementing the algorithm, the invention realizes infinite dimension hash.
Meanwhile, the method for searching the interior of the hash is reconstructed, so that the key value of the method is subjected to character string matching search, and if the character string matching search is not performed and a list of the key value contains a regular object (re. Pattern), the method is used for matching the regular object and returning a hash table object corresponding to the regular expression on the first matching. The following effects are obtained:
1. the data structure need not be declared in advance any time an arbitrary dimension hash table is declared.
2. If the value of a hash table of an arbitrary dimension of the declaration does not exist, its value is defined as a one-dimensional hash.
3. The hash table value for each dimension may be any object including another hash table, but only the data containing the string will be retrieved "=". For example, the following code (hereinafter underlined characters are Python codes):
First, multi-hash [ "T1" ] [ "T2" ] = "Comment" ] is assigned to the multi-dimensional hash "
If the code appears MultiHash [ "T1" ] [ "T2" ] returns an empty hash { };
if the code appears MultiHash [ "T1" ] [ "T2" ] [ "T3" ], then the value of MultiHash [ "T1" ] [ "T2" ] is actually:
any equal sign valued symbol can only get the value of the first string list part. x=MultiHash [ "T1" ] [ "T2" ], then the value of X is [ "Comment" ]
Finally, the invention constructs the whole data structure as one object (Class), which is convenient for subsequent call.
Fig. 6 is a diagram of a hash tree model and document structure according to an embodiment of the present invention. The hash tree does not record the coordinates and appearance page numbers of each character element, even the order, but only records the membership between character strings or titles of different levels. It can be found that only the membership of the titles of different levels in table 2 and table 1 are recorded in this hash tree, the order of the titles is random, and no meaning is given in the hash result. The contents of the two sub-tables in table 1 and table 2 are completely independent and do not affect each other. Thus, lossless segmentation and distribution and parallelism of tasks can be achieved.
The blank CRF-PDF file is scanned page by page and line by line, and the logic structure of the whole document can be reconstructed and imported into the hash tree according to the format specification information obtained before.
A hash tree is a data structure that allows the present invention to quickly access and find multiple values through multiple keys. In this scenario, the present invention can perform page-by-page progressive scanning on the CRF document file to be processed, construct and abstract the logical structure of the whole document into a hash tree, including, but not limited to, using different levels of titles of the CRF document file as keys in hash tables of different levels of the hash tree, converting each row and each column into keys or values in a hash table of a certain dimension of the hash tree according to the logical membership of the table, and realizing multi-level storage by multiple nesting of the multi-dimensional hash tables. The method not only can enable the invention to easily access and process the table data, but also can improve the efficiency and reduce the labor and time cost through the quick search function of the hash table.
Generally, the reading habit of human beings is that page numbers are sequentially from low to high, and a document is read from left to right in the page direction from top to bottom. If the document is traversed according to the standard, the expected document result can be obtained. Colloquially, "heading 3.1" appears, and "heading 3" must be found on other pages above or to the left of its agreed page or with fewer pages than that. If some background knowledge is combined, a super directory can be accurately constructed for recording the organization structure of the document and the corresponding text elements. For example, in the case of acquiring format standards of titles and document contents of different levels, titles of arbitrary levels can be captured by progressive scanning of the entire page. For example, the page above, the present invention knows in advance:
The primary headings are black, bolded, 5, curel, and enclosed by brackets,
The secondary title is black, bolded, 6, timesNewRoman, and is enclosed by brackets, and the block where the appearance is located is not more than 50 pixels from the frame in X coordinate.
The tertiary heading fonts and font size secondary headings are consistent but appear at least a distance from the left border 300 pixels.
Four-level title in [ a: in the form of a colon, and the content after the colon is consistent with the final annotation result.
The header formats above four levels are the same as four levels but spaced at least 100 pixels apart.
Based on the above knowledge, the invention can quickly scan out titles of different levels of the whole document in a line-by-line traversal mode. And assigning values to the Infinite dimension Hash tree row by row in the form of codes Hash [ T1] [ T2] … … [ Tn ]. Of course, other logic is possible, depending on the different settings of each document. But a plurality of information based on fonts, font sizes, colors, context, character coordinates, special symbols, etc. can be used as parsing references.
The compound conditions such as font, font size, coordinates, etc. for each document part may be used as features to label different hierarchical document titles or table names. For example, it may be required that: the character margin is larger than 2 pixels, the X-axis coordinate is larger than 1000 pixels, the black body, the TimesNewRoman font, the No. 6 font size, and the red characters are primary titles. Because the method directly analyzes the PDF binary files page by page and line by line and text blocks by line, all the meta-information can be used as characteristics to match titles of different levels, which clearly provides a great deal of convenience for the tabulating party.
Even if the meta information is insufficient to define the document titles of all levels, the invention can scan document parts one by one from left to right and from top to bottom according to the reading habit of human beings, and basically correct results can be obtained just before and after the X-axis coordinate and the scanning order. Even if the analysis result and the expectation of the invention conflict, two lines of identical characters do not necessarily appear in the table, based on the method, the method can traverse all potential annotation positions in the document and generate the JSON file for convenient editing.
After traversing the whole CRF document file once, a hash tree object containing the whole document content and organization structure is obtained, the macro structure form of the hash tree object can be analogized into a super directory for recording all data organization forms and contents of the document, and the super directory is saved as a JSON format file and is given to staff for annotation.
The hash tree is constructed by using a method similar to a catalog to record the document structure, and the document structure is completely separated from an absolute coordinate system and page numbers. Meanwhile, the text blocks of the binary file are scanned by using codes, so that the areas such as pictures are completely avoided, and no influence is caused.
And S2, manually editing the JSON file corresponding to the super directory, inserting a value containing a character string into the JSON file at a place to be marked so as to realize adding annotation information, and generating the edited JSON file corresponding to the super directory, and marking the JSON file as a aJSON (annotated JSON) file.
And adding a value in the form of a character string to the position to be marked in the JSON file to realize the addition of annotation information.
Step S3, converting the edited super directory corresponding to the aJSON file in the S2 into a hash tree data structure in a memory, then rescanning the CRF document file to be processed, and taking the key combination of the captured multidimensional structure as a key combination to perform value-taking operation in the hash tree and realize annotation or annotation migration, wherein the method comprises the following steps: when detecting that the value taken in the key combination in the corresponding hash tree structure contains a character string, extracting the coordinate of the last keyword in the keyword combination in the CRF document file, generating a text box beside the coordinate, and placing corresponding annotation information in the text box, namely taking the value taken in the hash tree structure by taking the keyword combination as the key combination as the annotation information. And by analogy, the annotation information in the step S2 is automatically added to the corresponding child node positions of the CRF document file, and the annotated CRF document file is generated and used as a aCRF (annotated CRF) file.
Specifically, after the JSON file is edited, the edited JSON file is read and converted into a hash tree, the blank CRF document file is traversed row by row again, and document annotation is realized in a valued mode.
Analyzing a blank CRF document file page by page and line by line, taking a multi-level title combination obtained after each line by line analysis as a keyword combination to take a value of a hash tree, if the value is available and contains a character string S, taking the character string out, extracting the rightmost x-axis coordinate where the last title keyword of the multi-level title keyword combination is located from the CRF document, adding an offset in the coordinate position, setting a text box at the coordinate position after offset, and reducing the width of the text box along with the width of the character string S; and putting the character string S as annotation information into a corresponding text box to realize annotation.
To sum up, in this step, the annotated JSON file is first driven into the memory to restore into the hash tree data structure, then the CRF document file is traversed page by page and line by line and parsed line by line, each time the corresponding key value keyword combination obtained by parsing each line is used to take the value in the hash tree, if there is text in the value, the text is extracted. And simultaneously, extracting the corresponding coordinates of the tail keyword of the corresponding keyword combination in the page, setting an offset to draw a text box, and writing the corresponding text into the text box to finish annotation.
And transmitting the edited JSON file and the blank CRF-PDF file to an AddComment.py script, reading the data structure of the JSON file by the program, comparing the data structure with the hash tree of the blank CRF, and if the information containing the character string, namely annotation information, appears under a certain child node, drawing a text box beside coordinates matched with the node to place the annotation information so as to finish annotation. Finally, by using the AddComment. Py script, the annotation is added to the CRF-PDF, and an annotated CRF-PDF file is generated.
FIG. 8 is a flow chart of a document annotation migration operation according to an embodiment of the present invention.
If the table name of each sub-table is taken as the key of the first dimension of the hash table and then the data in the sub-table is stored as the value of the hash table, then the present invention can say that the table name of each sub-table is the key of the first dimension of the hash tree.
In this case, the present invention may use the table name of each sub-table as the top level key of the hash table, then use the headers of each row and each column as the second level key and the third level key in the hash table, and finally store the data in the cells as the values in the hash table. The invention can conveniently access the data in the sub-table according to the name of the sub-table, and simultaneously searches the cells according to the row title and the column title, thereby realizing efficient data storage and query.
The hash has disorder, and in the parsed hash tree, the sequence of the table appearance has no influence on the parsing result. For PDF data files where changes occur, such as adding partial titles, changing the order of the secondary titles, paging adjustment, etc., since the hash tree records the logical organization structure of the PDF document, if the contents of a certain table have not changed, the corresponding multidimensional hash value has not changed. Therefore, the notes to be migrated can be easily replaced without the problems of sequence, page number, coordinates and the like which are intentionally caused as long as the key value pairs in the data structure and the structure of the new document can be obtained. Meanwhile, if the hash table structure of the old document and the new document are inconsistent, the inconsistent structure is added to the new hash tree: this means that during the final traversal of the annotation by the new document, the hash branches unique to these conflicting old documents are not accessed, as they do not correspond to the new document, and do not have any impact on the overall result.
In summary, only the organization structure of the PDF document is reserved in the hash tree. The program can traverse the whole PDF file, only when a specific text element which is required to be annotated and is on a specific document structure is encountered, the absolute coordinates of the element in the document are called, and the operation of placing an annotation text box beside the element is calculated by using the coordinate data. This means that even if the document structure changes, even the absolute coordinates of the relevant elements, the last step in the flow of the invention is to scan the document and obtain the coordinates, and therefore the changes follow, the previous coordinates and page number order do not have any influence on the invention.
FIG. 7 illustrates an annotation migration process. The subgraph (A) is a record file obtained after the old version document is annotated, and the subgraph (B) is a directory structure file of the new version document, and a table, namely 'table 3', can be found out from the new version. Sub-graph (C) is the final annotation result obtained after migration. Note that the noted "table 3" portion of data, while present in the data file, does not have any effect. As the new version PDF table file will be traversed during the final annotation restore process. It is clear that this table "table 3" does not exist in the final PDF document, and thus there is no possibility of being annotated. Similarly, if "Table 3" exists in the new version PDF document, its comments are also successfully migrated.
When updating, comparing the organization structures of two documents with different versions, and directly covering the new hash tree with the old version of annotation: if a certain part of the document is consistent with the new and old structures, the old notes directly cover the new blank key value pairs to realize document note migration. If not, the old new subtree takes the union. Since the new version blank document is searched and traversed finally, the old and special document structure cannot be traversed at all, so that the new document is not influenced at all.
FIG. 9 is a schematic diagram of a document segmentation and integration model according to an embodiment of the invention.
In the invention, the hash tree generated after the CRF document file is analyzed can be divided into different hash tree subtrees according to different chapters, and the subtrees are completely independent and do not influence each other.
In particular, a CRF document is currently mainly done head to tail by a staff, obviously very laborious, from at least 200 pages. By using the hash tree, all different tables can be divided into different subtrees (namely, all primary keys in the figure are completely independent), and all the subtrees are not affected by each other. Thus, individual annotations can be cut out individually for the subtrees corresponding to each table without affecting the overall annotation effect.
The generated hash tree may be locally or globally overlaid an unlimited number of times by a hash or linked list data structure to enable splitting and version updating, including:
(1) The hash tree (Ti) generated by any sub-section or sub-table in the original CRF file is necessarily the true subtree of the hash tree T generated by analyzing the complete original file, and the hash subtrees formed by any independent sub-units in the document are completely independent of each other and do not affect each other.
(2) The order change between the hash tree subtrees has no effect on the overall tree.
(3) The contents of any subtree of the hash tree can be arbitrarily overlaid by the same subtree of the key structure, the final value of which is only related to the value of the last overlay.
(4) The hash tree content can be arbitrarily expanded, and if a new branch appears in a new subtree in the coverage process, the related content can be automatically expanded in the generated hash tree.
Based on the characteristics, the CRF document file is divided into a plurality of blocks according to the table or the chapter, each block is independently constructed into a hash subtree corresponding to the JSON file record, the hash subtrees are respectively and independently annotated by different staff, and finally all annotation results are combined by using an Update function without any data loss and ambiguity.
Specifically, a PDF document can be divided into blocks according to tables or chapters, each of which is individually structured to describe an individual document organization structure, so long as the division is appropriate. And then, different people independently annotate the annotation, and finally, the Update function is used for merging all annotation results back. This enables lossless segmentation and perfect integration of tasks.
The method has the advantages that the method realizes modularization and segmentation operation capability of CRF-PDF document annotation depending on the disorder of the hash table and the complete independence among hash tree subtrees, one document is segmented according to a table form and sent to a plurality of persons for modification and annotation, and no matter how the document is operated or expanded, the document is modified, and no influence is caused on other parts of the document.
The data structure of the recursive nested infinite dimension hash tree algorithm of the present invention is described below in connection with one specific embodiment.
Keys (keys), values (values), and key combinations (key combinations) are proper nouns in the hash structure. When the hash tree is valued, the meaning of [ level 1 key ] → [ level 2 key ] → [ level 3 key ] is hierarchical. For example, the present invention needs to search from country to city, district to street, district to building number to building floor to house number, which is a level of stepwise decrease, not a simple permutation and combination. The invention herein uses hash trees to build a catalog of this oversized queriable (regular fuzzy query + string matching exact query) that holds all the content and hierarchical relationships in this file, such as:
the method comprises the steps of [ book name ] → [ chapter name 1 ] → [ sub-chapter name 1 ] → [ paragraph name 1 ] → [ first row ] →hello→null
… … → [ second row → is eaten → is empty
The result is saved as a special file format, namely a JSON format file.
Then, the invention edits this JSON file, adding text properties after having eaten: with the → "he hungry" drawing the invention here is for the purpose of telling the program that this is text.
Then, the invention reclasses the Json file added with the text attribute of 'hungry' back to the memory and converts the Json file back to the hash tree. The original PDF file is now parsed again row by row again by the method of initially building a hash tree. The content of the article is unchanged, and the book name, the chapter name and the sub-chapter names generated by the method are the same as the result obtained when the hash tree is built last time by the method.
In the traversal process, the hash tree is not established, but the value is taken according to the hash tree stored in the memory,
the invention scans the file line by line, and has no value when scanning from the book name to the chapter name 1 to the sub-chapter name 1 to the chapter name 1. The invention then sweeps the next line, the value of which is null when [ book name ] → [ chapter name 1 ] → [ sub-chapter name 1 ] → [ paragraph name 1 ] [ first line ] →hello. Then sweep the next row.
The values are "hungry" when [ book ] to [ chapter name 1 ] to [ sub-chapter name 1 ] to [ paragraph name 1 ] to [ second row ] to eat. At the moment, the original article is scanned to eat the character, the coordinate where the character is located is obtained, a text box is placed on the right side of the character according to the coordinate, and the text box is written with 'hungry' to complete annotation. This way, scanning the file line by line through the file can achieve that all annotations to be added are added.
In summary, the invention designs and realizes a data structure of an infinite dimension hash tree algorithm based on recursion nesting, which is used for storing the analysis result of PDF table data. The principle is that a super directory containing all contents of the table and the upstream and downstream organization relations is constructed and stored as a file. The staff edits the super directory file and inserts character strings into the places to be marked. The utility then implements automatic annotation against the original PDF file and this modified super directory file.
According to the efficient and automatic document file annotation method realized based on the hash tree algorithm, which is disclosed by the embodiment of the invention, the CRF clinical data can be annotated and generated aCRF (annotated CRF), namely, the CRF table of any PDF version is automatically annotated, and the version migration, the modularized segmentation and the modularized merging operation are rapidly realized.
The efficient and automatic document file annotation method based on the hash tree algorithm provided by the embodiment of the invention has the following steps
The beneficial effects are that:
(1) And the whole meta information of the document is reserved, so that the tabulation and the operation process are convenient. The compound conditions such as font, font size, coordinates, etc. for each document part may be used as features to label different hierarchical document titles or table names. For example, it may be required that: the character margin is larger than 2 pixels, the X-axis coordinate is larger than 1000 pixels, the black body, the TimesNewRoman font, the No. 6 font size, and the red characters are primary titles. Because the method directly analyzes the PDF binary files page by page and line by line and text blocks by line, all the meta-information can be used as characteristics to match titles of different levels, which clearly provides a great deal of convenience for the tabulating party.
(2) Has high tolerance to PDF analysis errors. Even if the meta information is insufficient to define the document titles of all levels, the invention can scan document parts one by one from left to right and from top to bottom according to the reading habit of human beings, and basically correct results can be obtained just before and after the X-axis coordinate and the scanning order. Even if the analysis result and the expectation of the invention conflict, two lines of identical characters do not necessarily appear in the table, based on the method, the method can traverse all potential annotation positions in the document and generate the JSON file for convenient editing.
(3) The coordinate position and page number of the table contents are not sensitive to the typesetting form. The order of the keys in the Hash table has no significance. Therefore, in practice, the present invention only needs to ensure the integrity of each table structure, the order of each page within the table, and the order among the tables will not have any effect on the final parsing result of the present invention.
(4) The processing speed approaches the theoretical limit. The look-up time of the Hash table is nearly O (1), its look-up expected time is close to 1/2O (1). Is the almost fastest search algorithm. Thus, in theory the method approaches the theoretical limit in speed. And comparing the analysis speed of the related study, processing a 200-page document at the same time, wherein the opposite analysis flow is divided into 6 large steps, and each step is close to 30s without manual operation. The invention only needs 6s from beginning to end, and the processing speed is far higher than that of the existing method.
(5) The method does not depend on any pre-condition or pre-manual operation, uses a program to traverse a PDF file to construct a data file with a JSON structure, and can automatically construct notes based on the edited JSON file, which does not depend on any condition such as an SDS file or manual construction of Book Marks. The automatic analysis and the automatic annotation are really realized, and the manpower is greatly saved. And, the annotation result is a text box constructed by adding an offset according to the searched text. Depending on the relative coordinates, there is no dependence at all on page numbers and text absolute coordinates.
(6) The Fuzzy search function based on the regular expression is realized. For some common modifiers, such as spaces, tabs, etc., regular expressions may be specified for fuzzy searching. While in practice some annotated values will vary with the matching content, these can also be perfectly implemented using regular expressions.
(7) And is highly modular. The method has the advantages that the method realizes modularization and segmentation operation capability of CRF-PDF document annotation depending on the disorder of the hash table and the complete independence among hash tree subtrees, one document is segmented according to a table form and sent to a plurality of persons for modification and annotation, and no matter how the document is operated or expanded, the document is modified, and no influence is caused on other parts of the document.
(8) The document is extremely convenient to update. All previous methods, whether based on SDS files, text matching, content numbering, book Marks or XDF file creation, do not depend on page numbers and absolute coordinates, and do not depend on specific text typesetting rules. Once the two systems change, the whole front-end work needs to be done once, for example, two chapters of a certain table are sequentially exchanged, the sequence among a plurality of tables is exchanged, and a certain margin of a certain table becomes larger or smaller, which leads to the whole adjustment of the coordinate and page number system, and the front-end work depending on the coordinate and page number system is totally destroyed. Based on the text matching method, if a picture is encountered in a document or a rare text coding rule such as UTF-16, GBK36 and the like is encountered, all subsequent text becomes messy codes, and the whole method collapses.
The method of the invention uses a method similar to a catalog to construct a hash tree to record the document structure, and is completely separated from an absolute coordinate system and page numbers. Meanwhile, the text blocks of the binary file are scanned by using codes, so that the areas such as pictures are completely avoided, and no influence is caused. Meanwhile, when updating, the organization structures of two documents with different versions are compared, and the old version annotation directly covers the new hash tree with a blank version: if a certain part of the document is consistent with the new and old structures, the old notes directly cover the new blank key value pairs to realize document note migration. If not, the old new subtree takes the union. Since the new version blank document is searched and traversed finally, the old and special document structure cannot be traversed at all, so that the new document is not influenced at all.
(9) The data is extremely convenient to store. The JSON format itself is a format for data storage in database software and is widely accepted by a range of NoSQL software. The interchange of hash trees and JSON formats also has some column standards and tools that are very mature. The document structure and annotation information can be conveniently saved in any NoSQL database after being saved by using the hash tree and converted into the JSON format.
(10) And (5) convenience in operation. The method does not depend on any prepositive data, uses PDF files to construct a document structure relation network similar to a catalog structure in the form of a hash tree, and omits the step of the prepositive data interacted by different teams. Meanwhile, due to the fact that a plurality of convenient JSON editor software exist, the data structure of the whole document can be well displayed, an operator can easily deduce the document structure and annotate the document by looking at the JSON file, and compared with other methods, the whole operation process is greatly simplified. Moreover, since the structure of the entire document is preserved entirely, its corresponding annotation must appear next to the corresponding text. The method is greatly convenient for the user to check the accuracy of the result. It is known whether the annotation is correct or not by checking the text box contents beside the corresponding text, and the use effect is extremely convenient.
(11) The method can be applied to annotating CRF clinical data to form aCRF files, supporting document format adjustment, task modularization segmentation and annotation migration.
(12) The invention uses a relative coordinate system, and does not need the coordinates of which page and which line a certain text sits on, as long as a comment text box is put down left and right or up and down beside the text. The invention can identify the coordinates of the appointed content of the appointed title level of the appointed form in the appointed document, and put an annotation text box beside the appointed content, and the width and the height of the annotation text box can be automatically adjusted according to the annotation content. And note that the content of a certain title is not significant, because the title may appear multiple times in the document, the combination forms of the upstream and downstream multi-level titles are absolutely unique, such as gender+male or female title combinations are not unique, and the uniqueness of the title combinations can be determined by the four-layer title combination of nationality sample questionnaire, other → gender → male.
(13) On the premise of supporting strict correspondence of character strings, fuzzy search is supported, and even if non-numeric letters such as case errors and spaces appear, the search is supported, so that the problems of placeholders such as the spaces added due to attractive display effects, case writing and the like can be prevented. The present invention also supports searching using regular matches if these strings (i.e., titles or table names) do not achieve perfect matches.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (10)
1. The efficient and automatic document file annotation method based on the Hash tree algorithm is characterized by comprising the following steps of:
Step S1, obtaining a CRF document file to be processed and constructing a super directory, wherein the step comprises the following steps: extracting the CRF document file to be processed into a document structure in the form of a super directory, and converting the document structure into a hash tree structure, wherein the hash tree structure is the super directory containing all contents in the CRF document file and upstream and downstream organization relations and contents of the CRF document file; storing the super directory as a JSON file;
step S2, manually editing the JSON file corresponding to the super directory, inserting a value containing a character string into the JSON file at a place to be marked so as to realize adding annotation information, and generating the edited JSON file corresponding to the super directory, and marking the JSON file as an aJSON file;
step S3, converting the edited super directory corresponding to the aJSON file in the step S2 into a hash tree data structure, then rescanning the CRF document file to be processed, and taking the key combination of the captured multidimensional structure as a key combination to perform value-taking operation in the hash tree and realize annotation or annotation migration, wherein the method comprises the following steps: when detecting that the value taken in the multi-dimensional key combination in the corresponding hash tree structure contains a character string, extracting the coordinate of the last keyword in the keyword combination in the CRF document file, generating a text box beside the coordinate, placing corresponding annotation information in the text box, namely taking the value taken in the hash tree structure by taking the keyword combination as the key combination as the annotation information, and the like, automatically adding all the annotation information in the step S2 to the corresponding sub-node positions of the CRF document file, and generating the annotated CRF document file as the aCRF file.
2. The efficient automatic annotation method for document files based on hash tree algorithm as claimed in claim 1, wherein in the step S1, the CRF document file to be processed is scanned page by page, the logical structure of the whole document is constructed and abstracted into hash tree, including but not limited to, different level titles of the CRF document file are used as keys in hash tables of different levels of hash tree, each row and each column is converted into keys in hash table of a certain dimension of hash tree according to logical membership of the table, and multi-level storage is realized by nesting multi-dimensional hash tables multiple times;
after traversing the whole CRF document file once, a hash tree object containing the whole document content and organization structure is obtained, the macro structure form is analogized into a super directory for recording all data logical organization forms and contents of the document, and the super directory is saved as a JSON format file.
3. The efficient and automatic annotation method for document files based on hash tree algorithm as claimed in claim 2, wherein the logical organization structure of CRF document files and tables and corresponding text contents are recorded in the super directory.
4. The efficient and automatic document file annotation method implemented based on a hash tree algorithm as claimed in claim 3, wherein the super directory includes all contents and content organization structures of the document, and the hash tree data structure generated by the super directory has keyword searching and fuzzy searching functions, forming a hash tree based on regular searching.
5. The efficient and automatic annotation method for document files based on hash tree algorithm as claimed in claim 1, wherein in said step S2, a value containing character string is added to the JSON file where annotation is needed to realize annotation information addition.
6. The efficient automatic document file annotation method based on the hash tree algorithm as claimed in claim 1, wherein in the step S3, after editing the JSON file, the edited JSON file is read and converted into a hash tree, and the blank CRF document file is traversed row by row again, and document annotation is realized by taking a value;
analyzing a blank CRF document file page by page and line by line, taking a multi-level title combination obtained after each line by line analysis as a keyword key combination to carry out a value-taking operation on a hash tree, if a value exists and the value contains a character string S, taking out the character string, extracting the rightmost x-axis coordinate where the last title keyword of the multi-level title keyword combination is located in the CRF document, adding an offset in the coordinate position, setting a text box in the coordinate position after offset, and reducing the width of the text box along with the width of the character string S; and putting the character string S as annotation information into a corresponding text box to realize annotation.
7. The efficient and automatic document file annotation method based on hash tree algorithm as claimed in claim 1, wherein the hash tree generated after parsing the CRF document file can be divided into different hash tree subtrees according to different chapters, and the subtrees are completely independent and do not affect each other.
Based on the characteristics, the CRF document file is divided into a plurality of blocks according to a form or a chapter form, each block is independently constructed into a hash subtree corresponding to the JSON file record, the hash subtrees are respectively and independently annotated by different staff, and finally all annotation results are combined by using an Update function without any data loss and ambiguity.
8. The efficient and automated annotation method for document files based on hash tree algorithm as claimed in claim 7, wherein said generated hash tree is locally or globally overlaid an unlimited number of times by hash or linked list data structures to achieve split and version update, comprising:
(1) The hash tree Ti generated by any sub-chapter or sub-table in the original CRF document file is a true sub-tree of the hash tree T generated by analyzing the complete original document, and the hash sub-trees formed by any independent sub-units in the document are completely independent of each other and do not affect each other;
(2) The sequence change among the hash tree subtrees has no effect on the total tree;
(3) The contents in any subtree of the hash tree can be arbitrarily covered by the subtree with the same key structure, and the final value is only related to the value covered last time;
(4) The hash tree content can be arbitrarily expanded, and if a new branch appears in a new subtree in the coverage process, related content can be automatically expanded in the generated hash tree.
9. The efficient and automated annotation method for document files based on hash tree algorithm as claimed in claim 1, wherein said CRF document file is any readable file.
10. A hash tree algorithm based efficient and automated document file annotation method according to any of claims 1-9, wherein the method is applicable to any format file openable by a reading tool, including but not limited to documents and picture files.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310880346.4A CN116956836A (en) | 2023-07-18 | 2023-07-18 | Efficient and automatic document file annotation method based on hash tree algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310880346.4A CN116956836A (en) | 2023-07-18 | 2023-07-18 | Efficient and automatic document file annotation method based on hash tree algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116956836A true CN116956836A (en) | 2023-10-27 |
Family
ID=88450551
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310880346.4A Pending CN116956836A (en) | 2023-07-18 | 2023-07-18 | Efficient and automatic document file annotation method based on hash tree algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116956836A (en) |
-
2023
- 2023-07-18 CN CN202310880346.4A patent/CN116956836A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Schmidt et al. | Efficient relational storage and retrieval of XML documents | |
Gonnet et al. | Mind your grammar: a new approach to modelling text | |
Chang et al. | A survey of web information extraction systems | |
US7353222B2 (en) | System and method for the storage, indexing and retrieval of XML documents using relational databases | |
US20060047646A1 (en) | Query-based document composition | |
Embley et al. | Converting heterogeneous statistical tables on the web to searchable databases | |
Ling et al. | Semistructured database design | |
Schweinsberg et al. | Advantages of complex SQL types in storing XML documents | |
CN115827862A (en) | Associated acquisition method for multivariate expense voucher data | |
Kanungo et al. | TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR | |
Nagy et al. | Table headers: An entrance to the data mine | |
Barbosa et al. | Efficient incremental validation of XML documents after composite updates | |
CN107491524B (en) | Method and device for calculating Chinese word relevance based on Wikipedia concept vector | |
CN116956836A (en) | Efficient and automatic document file annotation method based on hash tree algorithm | |
Bartoli et al. | Semisupervised wrapper choice and generation for print-oriented documents | |
Rastan | Automatic tabular data extraction and understanding | |
CN113282793A (en) | Web table data semantic extraction and RDF construction method | |
Chartrand | Ontology-based extraction of RDF data from the world wide web | |
Guo | Research on logical structure annotation in English streaming document based on deep learning | |
De Oliveira Santarosa Martins | Metadata Extraction and Digital News Preservation | |
Kong et al. | Word File Parsing Based On Python | |
Liu et al. | Structured data extraction: wrapper generation | |
Marin-Castro et al. | VR-Tree: A novel tree-based approach for modeling Web Query Interfaces | |
Zheng et al. | Research on the Application of XML in Fault Diagnosis IETM | |
CN116628301A (en) | Knowledge-driven-based webpage form extraction and structuring processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |