[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113449063B - Method and device for constructing document structure information retrieval library - Google Patents

Method and device for constructing document structure information retrieval library Download PDF

Info

Publication number
CN113449063B
CN113449063B CN202110708173.9A CN202110708173A CN113449063B CN 113449063 B CN113449063 B CN 113449063B CN 202110708173 A CN202110708173 A CN 202110708173A CN 113449063 B CN113449063 B CN 113449063B
Authority
CN
China
Prior art keywords
document
domain
sample
vector
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110708173.9A
Other languages
Chinese (zh)
Other versions
CN113449063A (en
Inventor
沈鹏
陈垚亮
王俞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rootcloud Technology Co Ltd
Original Assignee
Rootcloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rootcloud Technology Co Ltd filed Critical Rootcloud Technology Co Ltd
Priority to CN202110708173.9A priority Critical patent/CN113449063B/en
Publication of CN113449063A publication Critical patent/CN113449063A/en
Application granted granted Critical
Publication of CN113449063B publication Critical patent/CN113449063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for constructing a document structure information retrieval library, wherein the method comprises the following steps: carrying out domain subdivision item judgment on the collected sample document; extracting word segmentation words of a sample document of each determined domain segmentation term, and constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation words; dividing sample documents of the domain subdivision item according to the document type, extracting document structural information of each sample document, and generating a document structural information vector of the sample document according to the document structural information and a domain subdivision item keyword library; performing dimension reduction on the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document; and constructing a document structure information retrieval library according to the domain sub-term keyword library, the preset domain sub-term codes and the document structure information dimension reduction vector of the sample document. The accuracy for document retrieval can be improved.

Description

Method and device for constructing document structure information retrieval library
Technical Field
The invention relates to the technical field of information retrieval, in particular to a method and a device for constructing a document structure information retrieval library.
Background
With the continuous popularization of the digitization of industrial enterprises, many industrial enterprises have a large number of documents such as description, flow, specification and the like. Based on the consideration of data security, industrial enterprises generally choose to develop internal office and business systems aiming at the self field, and share and query documents in the internal office and business systems.
However, the current document retrieval library stores documents and keywords of the documents, and hits the keywords according to the short document content input by a user, but the document retrieval library only stores the documents and the keywords of the documents, and hits the keywords according to the keywords extracted from the short documents, so that the document retrieval precision is lower, and the refined document retrieval requirement cannot be met.
Disclosure of Invention
In view of the above, the present invention aims to provide a method and an apparatus for constructing a document structure information retrieval library, so as to improve the accuracy of the constructed document structure information retrieval library for document retrieval.
In a first aspect, an embodiment of the present invention provides a method for constructing a document structure information retrieval library, including:
Carrying out domain subdivision item judgment on the collected sample document;
extracting word segmentation words of a sample document of each determined domain segmentation term, and constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation words;
dividing sample documents of the domain subdivision item according to preset document types, extracting document structural information of each sample document according to a document structural information extraction strategy corresponding to the document type of the sample document, and generating a document structural information vector of the sample document according to the document structural information and a domain subdivision item keyword library;
performing dimension reduction on the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
and constructing a document structure information retrieval library according to the domain sub-term keyword library, the preset domain sub-term codes and the document structure information dimension reduction vector of the sample document.
With reference to the first aspect, the embodiment of the present invention provides a first possible implementation manner of the first aspect, where the method further includes:
determining the sub-division item of the to-be-searched field to which the input short file to be searched belongs and coding the sub-division item of the to-be-searched field;
Acquiring a keyword library of the domain to be searched corresponding to the domain subdivision item code to be searched;
extracting the structural information of the to-be-searched document in the short to-be-searched document, and generating a to-be-searched document structural information vector according to the structural information of the to-be-searched document and a keyword library in the to-be-searched field;
performing dimension reduction processing on the document structured information vector to be searched to obtain a document structured information dimension reduction vector to be searched;
and searching in a document structure information search library according to the document structure information dimension-reducing vector to be searched to obtain a search result.
With reference to the first possible implementation manner of the first aspect, the embodiment of the present invention provides a second possible implementation manner of the first aspect, where the method further includes:
if the similarity of hit documents in the search result exceeds a preset similarity threshold, inquiring whether the short documents to be searched are stored in a storage area corresponding to the subdivision item of the field to be searched, and if not, updating the information stored in the storage area according to the short documents to be searched.
With reference to the first aspect, the first possible implementation manner or the second possible implementation manner of the first aspect, the embodiment of the present invention provides a third possible implementation manner of the first aspect, wherein the extracting the word segmentation terms of the sample document of the domain segmentation term and constructing the vectorized domain segmentation term keyword library based on the extracted word segmentation terms includes:
Performing Chinese word segmentation on each sample document of the target field sub-items to obtain word segmentation words;
calculating word frequency-inverse text frequency index values of the word segmentation words aiming at each word segmentation word;
sorting the word and phrase according to word frequency-inverse text frequency index value;
vectorizing word segmentation words of N bits before sequencing to construct a domain segmentation term keyword library aiming at the target domain segmentation term, wherein N is a preset natural number.
With reference to the first aspect, the first possible implementation manner or the second possible implementation manner of the first aspect, the embodiment of the present invention provides a fourth possible implementation manner of the first aspect, wherein the generating the document structured information vector of the sample document according to the document structured information and the domain subdivision term keyword library includes:
aiming at each category in the document structured information extraction strategy, vectorizing the document structured information of the category;
in the domain subdivision item keyword library, if the corresponding position has no vectorized document structuring information, setting the vector of the position to 0 to obtain a document structuring information vector of the category;
and splicing the document structured information vectors of the categories to obtain the document structured information vector of the sample document.
With reference to the fourth possible implementation manner of the first aspect, the embodiment of the present invention provides a fifth possible implementation manner of the first aspect, wherein the document types include: txt documents, doc/docx documents, xml/html documents, and pdf documents.
With reference to the fifth possible implementation manner of the first aspect, the embodiment of the present invention provides a sixth possible implementation manner of the first aspect, wherein the category includes: a first category, a second category, and a third category, wherein the third category includes the entire keyword;
the document type is txt document, and the first category includes: paragraph number and document length, wherein the second category comprises keywords in M lines before and after the document;
the document type is doc/docx document, and the first category includes: title, number of levels, length, the second category including chart versus article start position;
the document type is xml/html document, and the first category comprises: title label, number of levels, length, the second category includes the relative document start position of key content label;
the document type is pdf document, and the first category includes: title, number of levels, length, the second category includes chart versus article start position.
In a second aspect, an embodiment of the present invention further provides an apparatus for constructing a document structure information retrieval library, including:
the domain judging module is used for judging domain subdivision items of the collected sample documents;
the word stock construction module is used for extracting word segmentation words of the sample document of each determined domain segmentation term and constructing a vectorized domain segmentation term keyword stock based on the extracted word segmentation words;
the structure vector generation module is used for dividing the sample document of the domain subdivision item according to the preset document type, extracting the document structural information of the sample document according to the document structural information extraction strategy corresponding to the document type of the sample document for each sample document, and generating the document structural information vector of the sample document according to the document structural information and the domain subdivision item keyword library;
the dimension reduction module is used for reducing the dimension of the document structured information vector of the sample document to obtain the document structured information dimension reduction vector of the sample document;
the search library construction module is used for constructing a document structure information search library according to the domain subdivision item keyword library, the preset domain subdivision item codes and the document structure information dimension reduction vector of the sample document.
In a third aspect, embodiments of the present application provide a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method described above.
The method and the device for constructing the document structure information retrieval library provided by the embodiment of the invention judge the domain subdivision item of the collected sample document; extracting word segmentation words of a sample document of each determined domain segmentation term, and constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation words; dividing sample documents of the domain subdivision item according to preset document types, extracting document structural information of each sample document according to a document structural information extraction strategy corresponding to the document type of the sample document, and generating a document structural information vector of the sample document according to the document structural information and a domain subdivision item keyword library; performing dimension reduction on the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document; and constructing a document structure information retrieval library according to the domain sub-term keyword library, the preset domain sub-term codes and the document structure information dimension reduction vector of the sample document. Therefore, the document structure information retrieval library is constructed by fusing the document structure information and semantic keyword information and converting the document structure information and semantic keyword information into vectors, and the accuracy of the constructed document structure information retrieval library for document retrieval can be effectively improved.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for constructing a document structure information retrieval library according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an apparatus for constructing a document structure information retrieval library according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a computer device 300 according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
The embodiment of the invention provides a method and a device for constructing a document structure information retrieval library, and the method and the device are described below through the embodiment.
FIG. 1 is a schematic flow chart of a method for constructing a document structure information retrieval library according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step 101, judging domain subdivision items of collected sample documents;
in the embodiment of the invention, as an optional embodiment, the domain subdivision item determination of the sample document can be performed based on user interactive operation.
In the embodiment of the invention, the field and the field subdivision item are judged on the sample document, so that a more refined search library can be constructed. As an alternative embodiment, the field includes, but is not limited to: literature fields, information processing fields, art, bioscience, medical fields, and the like. As another alternative, for each domain, one or more levels of domain subdivisions may be included, e.g., for a literary work domain, the one level of domain subdivisions includes: novels, songs, etc., for a first-level domain sub-term novice, the corresponding second-level domain sub-term includes: emotion, martial arts, science fiction, etc.
In the embodiment of the present invention, as an alternative embodiment, the domain division may be performed according to "chinese library classification method", and corresponding codes are set for each domain or domain subdivision, respectively, where the coding format may refer to "table-domain subdivision and coding sample.
In the embodiment of the invention, for a batch of sample documents, as an alternative embodiment, the domain sub-items to which each of the sample documents belongs are set interactively, and the domain sub-items are the last stage of the belonging domain. As another alternative embodiment, the domain and domain score of the sample document may be determined by extracting keywords in the sample document and matching the extracted keywords with a preset domain keyword library and a preset subdivision keyword library. Table 1 shows the domain and domain details and coding schematic of the examples of the present invention.
TABLE 1
Figure BDA0003132306820000071
Figure BDA0003132306820000081
In table 1, the industrial technology is a domain, the automation technology and the computer technology are primary domain sub-items, the information processing is secondary domain sub-items, and the text information processing is tertiary domain sub-items. In the embodiment of the invention, text information processing is the last stage of industrial technology in the field.
102, extracting word segmentation words of a sample document of each determined domain segmentation term, and constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation words;
in the embodiment of the present invention, as an optional embodiment, extracting word segmentation words of a sample document of the domain segmentation term, and constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation words, including:
A11, carrying out Chinese word segmentation on each type of document of the target field fine segmentation item to obtain word segmentation words;
a12, calculating word frequency-inverse text frequency index values of the word segmentation words aiming at each word segmentation word;
a13, sorting word and phrase according to word frequency-inverse text frequency index value;
a14, vectorizing word segmentation words of N bits before sequencing to construct a domain segmentation term keyword library aiming at the target domain segmentation term, wherein N is a preset natural number.
In the embodiment of the invention, aiming at each domain fine item contained in the domain, all batch sample documents contained in the domain fine item are obtained, word segmentation processing is carried out on the obtained sample documents, the word Frequency-inverse text Frequency index (TF-IDF, term Frequency-Inverse Document Frequency) value of the word segmentation words is calculated, the word segmentation words are ordered according to the TF-IDF value, the first N word segmentation words are obtained and used as full keywords, vectorization is carried out on the full keywords, and the domain fine item keyword library of the domain fine item of the batch sample documents is obtained, so that the construction of the domain fine item keyword library is completed.
In the embodiment of the present invention, as an optional embodiment, the domain sub-term keyword library is represented by a full-scale keyword vector, where the vector dimension is the number of full-scale keywords. As an alternative embodiment, vector dimensions take values of 512, 1024, 2048, etc., with 2048 being used by default.
In the embodiment of the invention, taking the field sub-items as text information processing as an example, the corresponding codes are TP391.111, and aiming at batch sample documents of the field sub-items, the full keywords acquired according to TF-IDF values are assumed as follows:
[ natural language, processing, algorithm, ].
Vectorization is carried out on the total keywords:
[1,1,1,...]。
the constructed domain subdivision item keyword library corresponding to text information processing is as follows:
{TP391.111,[1,1,1,...,1]}。
step 103, dividing the sample document of the domain subdivision item according to the preset document type, extracting the document structural information of the sample document according to the document structural information extraction strategy corresponding to the document type of the sample document for each sample document, and generating the document structural information vector of the sample document according to the document structural information and the domain subdivision item keyword library;
in the embodiment of the present invention, as an optional embodiment, the document types include: text (txt) documents, doc/docx documents, web page (xml/html) documents, and pdf documents. Wherein, each document type corresponds to a document structured information extraction policy for extracting three types of document structured information of the document type, and each type of document structured information corresponds to a document structured information vector, as shown in table 2.
Table 2 document type and document structured information vector reference Table
Figure BDA0003132306820000091
Figure BDA0003132306820000101
In table 2, the vector dimension of the document structured information vector is the same as the vector dimension of the total keyword vector in the domain keyword library, and in the extraction process, zero padding is performed on vectors with less than N terms, and vectors with more than N terms are discarded. The whole keyword is required to be determined according to TF-IDF value sequence of each word segmentation word of the current document.
In the embodiment of the present invention, as an optional embodiment, generating a document structured information vector of the sample document according to the document structured information and the domain subdivision item keyword library includes:
a21, aiming at each category in the document structured information extraction strategy, vectorizing the document structured information of the category;
a22, if the corresponding position does not have vectorized document structuring information in the domain subdivision item keyword library, setting the vector of the position to 0, and obtaining the document structuring information vector of the category;
a23, splicing the document structured information vectors of all the categories to obtain the document structured information vector of the sample document.
In the embodiment of the invention, the field subdivision item is used as text information processing and coding: TP391.111, the total keyword dimension N of the constructed domain subdivision item keyword library is 2048, and a sample document is assumed to be: the document structural information of the extracted partial category is:
-paragraph number 20;
-text length 400;
-M front lines of document, M back lines of content.
Taking document structural information as an example of the whole keyword and keywords in M rows before and after the document, for the whole keyword, after the sample document is subjected to word segmentation processing, a TF-IDF algorithm is used for calculating the TF-IDF value of the word segmentation words of the sample document, and the whole keyword is sequenced. And then generating the whole keyword vector according to the sorted whole keywords and the domain subdivision item keyword library. For keywords in M lines before and after the text, word segmentation is carried out according to the total 6 lines of contents of the front and back 3 lines, TF-IDF values of words of each word segmentation are calculated by using a TF-IDF algorithm, the TF-IDF values are ordered, and a keyword vector is generated according to the ordered words of the word segmentation and a domain subdivision item keyword library.
The resulting document structured information vectors for each category are assumed to be as follows:
n-dimensional vector, category 1: [20,400, 0];
n-dimensional vector, category 2: [1, ], 0];
n-dimensional vector, category 3: [1,1,...,0].
The document structured information vector of the sample document is:
[(20,400,...,0),(1,1,...,0),(1,1,...,0)]。
in the embodiment of the present invention, as an alternative embodiment, the parameter M is set to 3, and as another alternative embodiment, in order to reduce the operation amount, M is set to not more than 1/2 of the total number of lines of the sample document.
In the embodiment of the invention, for the sample document, the sample document contains more keywords before and after, and the keywords cannot be effectively added due to the excessive M value.
104, reducing the dimension of the document structural information vector of the sample document to obtain the document structural information dimension-reducing vector of the sample document;
in the embodiment of the invention, for each document type sample document, after obtaining 3-category document structured information vectors, N x 3-dimensional vectors (document structured information vectors) are spliced and generated. The document structured information vector of the sample document is subjected to dimension reduction processing using a dimension reduction algorithm, for example, principal component analysis (PCA, principal Component Analysis), and reduced to an n×1 dimension vector (document structured information dimension reduction vector of the sample document), so that each sample document can be converted into a corresponding field and vector in an elastic search (elastic search).
As an alternative embodiment, the document structured information dimension-reduction vector of the sample document is:
[-297.24811933,148.62405966,148.62405966,...,0]
and 105, constructing a document structure information retrieval library according to the domain segment keyword library, the preset domain segment codes and the document structure information dimension reduction vector of the sample document.
In the embodiment of the invention, information integration and association are carried out based on a domain subdivision item keyword library, domain subdivision item codes and document structured information dimension reduction vectors of sample documents, so as to construct a document structured information retrieval library. For example, the step-by-step construction of a document structure information retrieval library is carried out according to a domain subdivision item keyword library, domain subdivision item codes and document structure information dimension reduction vectors of sample documents.
In the embodiment of the present invention, as an alternative embodiment, the document structure information search library is stored using an elastic search, and the storage format is exemplified as follows:
creation of a mapping # for a file index in an elastomer search is annotated content
"mappings":{"properties":{
Title { # document title name
"type":"text",
"analyzer" i k max word ", # word segmentation method for index
"search_analyzer": word segmentation method for query by "ik_smart" # "
},
Document vector { #document vector }
"type":"dense_vector",
"dims":512# vector dimension is exemplified by 512
}
Field coding of { # document
"type":"keyword"
},
...
}
}
In the embodiment of the invention, taking the field subdivision item as an example of text information processing, the contained batch sample document collection is as follows:
1. doc is a natural language actual combat;
2. html;
3. natural language processing and algorithm txt
Wherein N is 2048 and M is 3;
aiming at a batch sample document collection, calculating TF-IDF values after Chinese word segmentation, and constructing a domain subdivision key word library:
{ TP391.111, [ natural language, processing, algorithm, ] }
For sample documents: txt, the extracted document structural information is:
paragraph number 20;
text length 400;
according to the total 6 lines of content of the front and back 3 lines, the key words are generated by using a TF-IDF algorithm as follows:
[ Natural language, processing, chinese, … ]
The whole keyword is generated according to the TF-IDF algorithm as follows:
natural language, processing. . . ]
The structured document structured information vector corresponding to each category is as follows:
n-dimensional vector, category 1: [20,400, 0];
n-dimensional vector, category 2: [1, ], 0];
n-dimensional vector, category 3: [1,1,...,0].
After the PCA is used for dimension reduction, the document structured information vector of the generated sample document (natural language processing and algorithm. Txt) is as follows:
[-297.24811933,148.62405966,148.62405966,...,0]
sample document: the stored information of txt in the document structure information retrieval library is as follows:
{
"title" { "natural language processing and algorithms. Txt" },
"document_vector":{[-297.24811933,148.62405966,148.62405966,...,0]}
"field_coding":{"TP391.111"}
}
in an embodiment of the present invention, as an optional embodiment, the method further includes:
B11, determining the sub-division item of the to-be-searched field to which the input short file to be searched belongs and coding the sub-division item of the to-be-searched field;
in the embodiment of the invention, after the input short document to be searched is obtained, interactive operation is carried out to judge the subdivision item of the domain to be searched, and the corresponding subdivision item code of the domain to be searched is obtained according to the judged subdivision item of the domain to be searched.
In the embodiment of the invention, when the domain subdivision item to be searched is judged, the selection of the domain subdivision item to be searched can be completed by utilizing a drop-down list and looking up a table through interactive operation.
In the embodiment of the invention, the short document to be searched can be a specific document, for example, the document can be input: the natural language txt, through interactive operation, determines that the to-be-retrieved domain sub-items to which the document belongs are: text information processing, corresponding to-be-searched domain subdivision item codes are as follows: TP391.111.
B12, obtaining a keyword library of the domain to be searched corresponding to the domain subdivision item code to be searched;
in the embodiment of the invention, the obtained keyword library in the field to be searched is as follows: [ Natural language, algorithm, processing, … ].
B13, extracting the structural information of the to-be-searched document in the short to-be-searched document, and generating a to-be-searched document structural information vector according to the structural information of the to-be-searched document and a keyword library in the to-be-searched field;
In the embodiment of the invention, the short document to be searched is extracted with the structural information of the document to be searched, and the extracted structural information of the document to be searched is as follows:
paragraph number 10;
text length 100;
the front and back 3 lines of the document total 6 lines of content.
In the embodiment of the invention, as an optional embodiment, for the category of the whole keywords, the short document to be searched is segmented, the segmented words of the document to be searched are ordered by utilizing a TF-IDF algorithm, the first N whole keywords are extracted, and the list of the whole keywords is as follows:
[ Natural language, algorithm, processing, … ].
For the content of 6 lines in total of 3 lines before and after the document, generating a keyword list according to the 3 lines before and after the document by using a TF-IDF algorithm as follows:
[ Natural language, processing, chinese, … ]
The generated structured information vector of the document to be retrieved is as follows:
n-dimensional vector of category 1: [10,100, ], 0];
n-dimensional vector of class 2: [1, ], 0];
n-dimensional vector of category 3: [1,0,...,0].
After PCA dimension reduction, the following document vectors are generated:
[-297.2,148.6,148.6,...,0]。
b14, performing dimension reduction processing on the document structured information vector to be searched to obtain a document structured information dimension reduction vector to be searched;
and B15, searching in a document structure information search library according to the document structure information dimension-reducing vector to be searched, and obtaining a search result.
In the embodiment of the invention, the elastic search is used for vectorizing search of the heterogeneous document structure information search library, and searching is carried out according to the similarity of the document structure information dimension reduction vector to be searched and the document structure information dimension reduction vector of each document.
In the embodiment of the invention, the elastic search provides a cosine similarity (cosineSimilary) function in the original script language, so that the ranking of the structural information dimension reduction vector of the document to be searched and all document similarity in the document structural information search library can be realized, and the document structural information search library can be searched. As an alternative embodiment, the program code segments for retrieving are as follows:
document vector query_vector= [1,0, ], 0] elastic search query sample:
{
"script_score":{
"query":{"match_all":{}},"script":{
"source":"cosineSimilarity(params.query_vector,'document_vector')+1.0","params":{"query_vector":query_vector}
# query vector
}
}
}
In the embodiment of the invention, the program code segments of the same example of vectorization search by using the elastic search are as follows:
{
"script_score":{
"query":{"match_all":{}},"script":{
"source":"cosineSimilarity(params.query_vector,'document_vector')+1.0","params":{"query_vector":[-297.2,148.6,148.6,...,0]}
}
}
}
search results:
{
"title" { "natural language processing and algorithms. Txt" },
"document_vector":{[-297.24811933,148.62405966,148.62405966,...,0]}
"field_coding":{"TP391.111"}
"score":99
}
{
"title":{"..."},
"document_vector":{...}
"field_coding":{"TP391.111"}
"score":97
}
in the embodiment of the invention, as an optional embodiment, the search result contains the title, the domain segmentation item and the like of the similar document.
In the embodiment of the invention, the returned search result is X pieces of data which are most similar in similarity score in the elastic search and contain the title and the name of the document.
In an embodiment of the present invention, as an optional embodiment, the method further includes:
if the similarity of hit documents in the search result exceeds a preset similarity threshold, inquiring whether the short documents to be searched are stored in a storage area corresponding to the subdivision item of the field to be searched, and if not, updating the information stored in the storage area according to the short documents to be searched.
In the embodiment of the invention, if the search result hits and the similarity of hit documents exceeds the preset similarity threshold, the short document to be searched can be used as a part of the domain segmentation term document to supplement the domain segmentation term keyword library and the document structure information search library. For example, in the above example, if the score (score) of the first item in the search result exceeds a preset similarity threshold, for example, 98, which indicates that "natural language. Txt" and "natural language processing and algorithm. Txt" are similar and belong to the term in the text information processing field, the field term keyword library update and the document structure information search library update may be performed. Taking a keyword vector in a document structured information vector as an example, segmenting the front and rear 3 rows of a short document to be retrieved, sequencing the obtained segmented words and each keyword in the keyword vector by TF-IDF values, and updating the keyword vector according to the sequencing result. Therefore, the content of the field subdivision item can be perfected, and the upgrading evolution is realized.
In the embodiment of the invention, the document structural information and semantic keyword information are fused and converted into vectors, so that the constructed document structural information retrieval library is based on the document structural information, can quickly construct various document type retrieval models for various industrial enterprises, realize document retrieval functions under different field fine items, and the enterprise informatization system can realize efficient retrieval of heterogeneous documents through the search function, and has high document retrieval precision. Further, by introducing the interactive technical scheme, the document retrieval accuracy is improved, the vocabulary richness of the domain subdivision item keyword library is improved, and the document retrieval capability aiming at the domain subdivision item is improved. Moreover, based on the interactive domain fine-term keyword library and the heterogeneous document structural information, the search accuracy can be improved for documents with different domain fine-terms.
Fig. 2 shows a schematic diagram of an apparatus for constructing a document structure information retrieval library according to an embodiment of the present invention. As shown in fig. 2, the apparatus includes:
a domain determining module 201, configured to determine domain subdivision terms on the collected sample document;
in the embodiment of the invention, as an optional embodiment, the domain subdivision item determination of the sample document can be performed based on user interactive operation. The domain division can be performed according to the Chinese library classification method, and corresponding codes are respectively set for each domain or domain subdivision, wherein the coding format can refer to the table-domain subdivision and the coding sample.
The word stock construction module 202 is configured to extract word segmentation terms of a sample document of each determined domain segmentation term, and construct a vectorized domain segmentation term keyword stock based on the extracted word segmentation terms;
in the embodiment of the invention, aiming at each domain sub-term contained in the domain, all batch sample documents contained in the domain sub-term are obtained, word segmentation processing is carried out on the obtained sample documents, TF-IDF values of word segmentation words are calculated, the word segmentation words are ordered according to the TF-IDF values, the first N word segmentation words are obtained and used as full keywords, and vectorization is carried out on the full keywords, so that a domain sub-term keyword library is obtained.
The structure vector generation module 203 is configured to divide a sample document of a domain subdivision item according to a preset document type, extract, for each sample document, document structural information of the sample document according to a document structural information extraction policy corresponding to the document type of the sample document, and generate a document structural information vector of the sample document according to the document structural information and a domain subdivision item keyword library;
the dimension reduction module 204 is configured to reduce the dimension of the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
The search library construction module 205 is configured to construct a document structure information search library according to the domain segmentation term keyword library, the preset domain segmentation term code, and the document structure information dimension reduction vector of the sample document.
In the embodiment of the invention, information integration and association are carried out based on a domain subdivision item keyword library, domain subdivision item codes and document structured information dimension reduction vectors of sample documents, so as to construct a document structured information retrieval library.
In an embodiment of the present invention, as an optional embodiment, the apparatus further includes:
the searching module (not shown in the figure) is used for determining the domain subdivision item to be searched and the domain subdivision item code to be searched, to which the input short file to be searched belongs;
acquiring a keyword library of the domain to be searched corresponding to the domain subdivision item code to be searched;
extracting the structural information of the to-be-searched document in the short to-be-searched document, and generating a to-be-searched document structural information vector according to the structural information of the to-be-searched document and a keyword library in the to-be-searched field;
performing dimension reduction processing on the document structured information vector to be searched to obtain a document structured information dimension reduction vector to be searched;
and searching in a document structure information search library according to the document structure information dimension-reducing vector to be searched to obtain a search result.
In an embodiment of the present invention, as an optional embodiment, the apparatus further includes:
and the updating module (not shown in the figure) is used for inquiring whether the short documents to be searched are stored in the storage area corresponding to the subdivision item of the field to be searched or not if the similarity of hit documents in the search result exceeds a preset similarity threshold value, and updating the information stored in the storage area according to the short documents to be searched if the similarity of hit documents in the search result exceeds the preset similarity threshold value.
In the embodiment of the present invention, as an optional embodiment, the word stock construction module 202 is specifically configured to:
performing Chinese word segmentation on each sample document of the target field sub-items to obtain word segmentation words;
calculating word frequency-inverse text frequency index values of the word segmentation words aiming at each word segmentation word;
sorting the word and phrase according to word frequency-inverse text frequency index value;
vectorizing word segmentation words of N bits before sequencing to construct a domain segmentation term keyword library aiming at the target domain segmentation term, wherein N is a preset natural number.
In this embodiment of the present invention, as an optional embodiment, the structure vector generation module 203 is specifically configured to:
aiming at each category in the document structured information extraction strategy, vectorizing the document structured information of the category;
In the domain subdivision item keyword library, if the corresponding position has no vectorized document structuring information, setting the vector of the position to 0 to obtain a document structuring information vector of the category;
and splicing the document structured information vectors of the categories to obtain the document structured information vector of the sample document.
In the embodiment of the present invention, as an optional embodiment, the document types include: txt documents, doc/docx documents, xml/html documents, and pdf documents.
In an embodiment of the present invention, as an optional embodiment, the category includes: a first category, a second category, and a third category, wherein the third category includes the entire keyword;
the document type is txt document, and the first category includes: paragraph number and document length, wherein the second category comprises keywords in M lines before and after the document;
the document type is doc/docx document, and the first category includes: title, number of levels, length, the second category including chart versus article start position;
the document type is xml/html document, and the first category comprises: title label, number of levels, length, the second category includes the relative document start position of key content label;
The document type is pdf document, and the first category includes: title, number of levels, length, the second category includes chart versus article start position.
As shown in fig. 3, an embodiment of the present application provides a computer device 300 for executing the method for constructing a document structure information retrieval library in fig. 1, where the device includes a memory 301, a processor 302, and a computer program stored in the memory 301 and executable on the processor 302, where the steps of the method for constructing a document structure information retrieval library are implemented when the processor 302 executes the computer program.
Specifically, the above-mentioned memory 301 and processor 302 can be general-purpose memories and processors, and are not particularly limited herein, and the above-mentioned method of constructing a document structure information retrieval library can be performed when the processor 302 runs a computer program stored in the memory 301.
Corresponding to the method of constructing a document structure information retrieval library in fig. 1, the embodiments of the present application also provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the above method of constructing a document structure information retrieval library.
In particular, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, on which a computer program is executed to perform the above-described method of constructing a document structure information retrieval library.
In the embodiments provided herein, it should be understood that the disclosed systems and methods may be implemented in other ways. The system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions in actual implementation, and e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. A method of constructing a document structure information retrieval library, comprising:
carrying out domain subdivision item judgment on the collected sample document;
extracting word segmentation words of a sample document of each determined domain segmentation term, and constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation words;
According to preset document types, the document types comprise: dividing a sample document of a domain subdivision item into a text document, a doc/docx document, a webpage document and a pdf document, extracting document structural information of the sample document according to a document structural information extraction strategy corresponding to the document type of the sample document for each sample document, and generating a document structural information vector of the sample document according to the document structural information and a domain subdivision item keyword library;
performing dimension reduction on the document structured information vector of the sample document to obtain a document structured information dimension reduction vector of the sample document;
constructing a document structure information retrieval library according to a domain sub-term keyword library, a preset domain sub-term code and a document structure information dimension reduction vector of a sample document;
the generating the document structured information vector of the sample document according to the document structured information and the domain subdivision item keyword library comprises the following steps:
aiming at each category in the document structured information extraction strategy, vectorizing the document structured information of the category;
in the domain subdivision item keyword library, if the corresponding position has no vectorized document structuring information, setting the vector of the position to 0 to obtain a document structuring information vector of the category;
Splicing the document structured information vectors of all the categories to obtain the document structured information vector of the sample document;
the categories include: a first category, a second category, and a third category, wherein the third category includes the entire keyword;
the document type is txt document, and the first category includes: paragraph number and document length, wherein the second category comprises keywords in M lines before and after the document;
the document type is doc/docx document, and the first category includes: title, number of levels, length, the second category including chart versus article start position;
the document type is xml/html document, and the first category comprises: title label, number of levels, length, the second category includes the relative document start position of key content label;
the document type is pdf document, and the first category includes: title, number of levels, length, the second category includes chart versus article start position.
2. The method according to claim 1, wherein the method further comprises:
determining the sub-division item of the to-be-searched field to which the input short file to be searched belongs and coding the sub-division item of the to-be-searched field;
Acquiring a keyword library of the domain to be searched corresponding to the domain subdivision item code to be searched;
extracting the structural information of the to-be-searched document in the short to-be-searched document, and generating a to-be-searched document structural information vector according to the structural information of the to-be-searched document and a keyword library in the to-be-searched field;
performing dimension reduction processing on the document structured information vector to be searched to obtain a document structured information dimension reduction vector to be searched;
and searching in a document structure information search library according to the document structure information dimension-reducing vector to be searched to obtain a search result.
3. The method according to claim 2, wherein the method further comprises:
if the similarity of hit documents in the search result exceeds a preset similarity threshold, inquiring whether the short documents to be searched are stored in a storage area corresponding to the subdivision item of the field to be searched, and if not, updating the information stored in the storage area according to the short documents to be searched.
4. A method according to any one of claims 1 to 3, wherein extracting the word segmentation terms of the sample document of the domain segmentation term, constructing a vectorized domain segmentation term keyword library based on the extracted word segmentation terms, comprises:
Performing Chinese word segmentation on each sample document of the target field sub-items to obtain word segmentation words;
calculating word frequency-inverse text frequency index values of the word segmentation words aiming at each word segmentation word;
sorting the word and phrase according to word frequency-inverse text frequency index value;
vectorizing word segmentation words of N bits before sequencing to construct a domain segmentation term keyword library aiming at the target domain segmentation term, wherein N is a preset natural number.
5. An apparatus for constructing a document structure information retrieval library, comprising:
the domain judging module is used for judging domain subdivision items of the collected sample documents;
the word stock construction module is used for extracting word segmentation words of the sample document of each determined domain segmentation term and constructing a vectorized domain segmentation term keyword stock based on the extracted word segmentation words;
the structure vector generation module is used for generating a structure vector according to a preset document type, wherein the document type comprises: dividing a sample document of a domain subdivision item into a text document, a doc/docx document, a webpage document and a pdf document, extracting document structural information of the sample document according to a document structural information extraction strategy corresponding to the document type of the sample document for each sample document, and generating a document structural information vector of the sample document according to the document structural information and a domain subdivision item keyword library;
The dimension reduction module is used for reducing the dimension of the document structured information vector of the sample document to obtain the document structured information dimension reduction vector of the sample document;
the search library construction module is used for constructing a document structure information search library according to the domain subdivision item keyword library, the preset domain subdivision item codes and the document structure information dimension reduction vector of the sample document;
the generating the document structured information vector of the sample document according to the document structured information and the domain subdivision item keyword library comprises the following steps:
aiming at each category in the document structured information extraction strategy, vectorizing the document structured information of the category;
in the domain subdivision item keyword library, if the corresponding position has no vectorized document structuring information, setting the vector of the position to 0 to obtain a document structuring information vector of the category;
splicing the document structured information vectors of all the categories to obtain the document structured information vector of the sample document;
the categories include: a first category, a second category, and a third category, wherein the third category includes the entire keyword;
the document type is txt document, and the first category includes: paragraph number and document length, wherein the second category comprises keywords in M lines before and after the document;
The document type is doc/docx document, and the first category includes: title, number of levels, length, the second category including chart versus article start position;
the document type is xml/html document, and the first category comprises: title label, number of levels, length, the second category includes the relative document start position of key content label;
the document type is pdf document, and the first category includes: title, number of levels, length, the second category includes chart versus article start position.
6. A computer device, comprising: a processor, a memory and a bus, said memory storing machine-readable instructions executable by said processor, said processor and said memory communicating via the bus when the computer device is running, said machine-readable instructions when executed by said processor performing the steps of the method of constructing a document structure information retrieval library according to any one of claims 1 to 4.
7. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method of constructing a document structure information retrieval library according to any one of claims 1 to 4.
CN202110708173.9A 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library Active CN113449063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110708173.9A CN113449063B (en) 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110708173.9A CN113449063B (en) 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library

Publications (2)

Publication Number Publication Date
CN113449063A CN113449063A (en) 2021-09-28
CN113449063B true CN113449063B (en) 2023-06-16

Family

ID=77812699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110708173.9A Active CN113449063B (en) 2021-06-25 2021-06-25 Method and device for constructing document structure information retrieval library

Country Status (1)

Country Link
CN (1) CN113449063B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936269A (en) * 2022-06-07 2022-08-23 来也科技(北京)有限公司 Document searching platform, searching method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012119339A1 (en) * 2011-03-04 2012-09-13 中兴通讯股份有限公司 Retrieval method and apparatus
WO2019153551A1 (en) * 2018-02-12 2019-08-15 平安科技(深圳)有限公司 Article classification method and apparatus, computer device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW548557B (en) * 2000-09-13 2003-08-21 Intumit Inc A method and system for electronic document to have fast-search category and mutual link
CN102890711B (en) * 2012-09-13 2015-08-12 中国人民解放军国防科学技术大学 A kind of retrieval ordering method and system
CN111460090A (en) * 2020-03-04 2020-07-28 深圳壹账通智能科技有限公司 Vector-based document retrieval method and device, computer equipment and storage medium
CN112883165B (en) * 2021-03-16 2022-12-02 山东亿云信息技术有限公司 Intelligent full-text retrieval method and system based on semantic understanding

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012119339A1 (en) * 2011-03-04 2012-09-13 中兴通讯股份有限公司 Retrieval method and apparatus
WO2019153551A1 (en) * 2018-02-12 2019-08-15 平安科技(深圳)有限公司 Article classification method and apparatus, computer device and storage medium

Also Published As

Publication number Publication date
CN113449063A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN106649818B (en) Application search intention identification method and device, application search method and server
Pereira et al. Using web information for author name disambiguation
JP5424001B2 (en) LEARNING DATA GENERATION DEVICE, REQUESTED EXTRACTION EXTRACTION SYSTEM, LEARNING DATA GENERATION METHOD, AND PROGRAM
Schubotz et al. Semantification of identifiers in mathematics for better math information retrieval
TWI536181B (en) Language identification in multilingual text
KR101508260B1 (en) Summary generation apparatus and method reflecting document feature
US20100205198A1 (en) Search query disambiguation
US20150100308A1 (en) Automated Formation of Specialized Dictionaries
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
CN107844493B (en) File association method and system
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN106844482B (en) Search engine-based retrieval information matching method and device
JP2005301856A (en) Method and program for document retrieval, and document retrieving device executing the same
CN111133429A (en) Extracting expressions for natural language processing
CN113449063B (en) Method and device for constructing document structure information retrieval library
Tahmasebi et al. On the applicability of word sense discrimination on 201 years of modern english
CN112612867B (en) News manuscript propagation analysis method, computer readable storage medium and electronic device
CN114255067A (en) Data pricing method and device, electronic equipment and storage medium
JP5869948B2 (en) Passage dividing method, apparatus, and program
CN112926297A (en) Method, apparatus, device and storage medium for processing information
CN117149956A (en) Text retrieval method and device, electronic equipment and readable storage medium
Balaji et al. Finding related research papers using semantic and co-citation proximity analysis
JP6181890B2 (en) Literature analysis apparatus, literature analysis method and program
Jo Automatic text summarization using string vector based K nearest neighbor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant