CN112183035A

CN112183035A - Text labeling method, device and equipment and readable storage medium

Info

Publication number: CN112183035A
Application number: CN202011233453.0A
Authority: CN
Inventors: 左永忠; 刘余海
Original assignee: Shanghai Hengsheng Juyuan Data Service Co ltd
Current assignee: Shanghai Hengsheng Juyuan Data Service Co ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-01-05
Anticipated expiration: 2040-11-06
Also published as: CN112183035B

Abstract

The embodiment of the application provides a text labeling method, a text labeling device and a readable storage medium, wherein in title items of a text page to be labeled, titles of a table are determined, title items meeting preset conditions are searched in a target title item according to a reverse order of sequencing, the title items meeting the preset conditions are taken as superior titles and the title items after sequencing are taken as inferior titles, and the preset conditions comprise: no text exists between the title items. The upper title and the lower title in the text page are identified according to the distinguishing characteristics of the upper title and the lower title. And segmenting the content indicated by each identified title to obtain the segmentation result of each title, inquiring a target segmentation unit from a preset corresponding relation, and taking a label item corresponding to the target segmentation unit as the label result of the title. According to the scheme, the grades of the titles are determined, so that the titles in the text can be automatically labeled, and the accuracy of the labeling result can be ensured.

Description

Text labeling method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a text labeling method, apparatus, device, and readable storage medium.

Background

Financial bond financial remark data is important financial data disclosed by a debt enterprise, and a financial analyst can clearly understand the current situation of the debt enterprise through the remark data so as to guide financial capital investment.

Currently, the bond financial remark data includes remarks, each of the remarks at least includes a tagging result obtained by manually tagging the disclosure, and the tagging result is a preset tagging item corresponding to a title in the disclosure, for example, the remark included in the remark data of the disclosure includes: "accounts payable" -page 7 ", where" accounts receivable "is a preset label item corresponding to the heading" amount of accounts payable "manually identified from the details in the disclosure.

In practical application, the annotation data of each disclosed file comprises a large number of annotation items, so that the method for manually labeling the disclosed files is low in efficiency, the manual labeling process completely depends on the subjective experience of labeling personnel, the labeling result is low in accuracy, and the generated annotation data cannot meet the requirements of a data market easily.

Disclosure of Invention

The applicant found in the course of the study: for the texts comprising the titles in multiple levels, the grades of the titles are related to the accuracy of the labeling results, so that the grades of the titles are identified for the texts, and the accuracy of the labeling results is improved.

The application provides a text labeling method, a text labeling device, text labeling equipment and a readable storage medium, and aims to label texts automatically and improve the accuracy of labeling results, and the method comprises the following steps:

a text labeling method comprises the following steps:

determining the title of the table in the title item of the text page to be marked; the text page to be marked comprises the table and the title item;

searching for the title items meeting a preset condition in a reverse order of the ordering from target title items, wherein the target title items comprise titles of the table and the title items ordered before the titles of the table, and the preset condition comprises: no text exists between the title items; the sequence is the typesetting sequence of the text pages;

according to the sorting, the front title item is used as a superior title, and the back title item is used as a subordinate title in the title items meeting the preset conditions;

identifying the upper title and the lower title in the text page according to the distinguishing characteristics of the upper title and the lower title;

performing word segmentation on the identified content indicated by each title to obtain word segmentation results of each title, wherein the title comprises the superior title and the inferior title;

inquiring a target word segmentation unit from a preset corresponding relation, wherein the target word segmentation unit comprises word segmentation units which have the same grading with the title and are similar to the word segmentation result, the corresponding relation comprises the corresponding relation between the word segmentation units and the label item, and the word segmentation units are the word segmentation results of the sample title;

and taking the labeling item corresponding to the target word segmentation unit as a labeling result of the title.

Optionally, the text page is any page in the text, and the method further includes:

if the upper title and the lower title are not identified in other text pages in the text, taking the title items in the text pages as the upper title under the condition that the title items meeting preset conditions do not exist in the target title items;

if the upper title and the lower title are recognized in other text pages in the text, in the case that the title item satisfying the preset condition does not exist in the target title item, recognizing the upper title and the lower title from the title items of the text pages according to the distinguishing characteristics of the upper title and the lower title recognized in other text pages.

Optionally, the method further includes:

determining an affiliation between the superior title and the inferior title according to the ranking;

and storing the dependency relationship in a preset data structure.

Optionally, the corresponding relationship includes a corresponding relationship between a higher-level word segmentation unit and a labeled item, and a corresponding relationship between a lower-level word segmentation unit and a labeled item; the superior word segmentation unit and the inferior word segmentation unit have a subordinate relationship;

the step of querying a word segmentation unit which has the same grade with the title and is similar to the word segmentation result from a preset corresponding relation, and the step of querying the word segmentation unit as a target word segmentation unit comprises the following steps:

searching a superior participle unit similar to the participle result of the superior title from the superior participle unit as a target superior participle unit;

and inquiring a lower-level word segmentation unit similar to the word segmentation result of a lower-level title belonging to the upper-level title from the lower-level word segmentation units belonging to the target upper-level word segmentation unit as a target lower-level word segmentation unit.

Optionally, the method further includes:

under the condition that the superior title does not have a subordinate title, searching word segmentation units similar to word segmentation results of the superior title in the subordinate word segmentation units;

the target word segmentation unit further comprises: and the word segmentation units in the lower word segmentation units are similar to the word segmentation results of the upper title.

Optionally, the obtaining process of the corresponding relationship includes:

identifying the upper title and the lower title from a sample text as sample titles;

performing word segmentation on the indicated content of the sample title to obtain a sample word segmentation result;

and determining the labeling item corresponding to the word segmentation result according to the similarity between the sample word segmentation result and the preset labeling item.

Optionally, the method further includes:

storing the segmentation results of the indicated contents of the sample titles to form a dictionary for segmentation.

Optionally, the content indicated by the title includes: the title, and non-title text between the title and an adjacent title ordered after the title;

the method further comprises the following steps:

inquiring a target text in the text page, wherein the target text is the non-title text without a title;

and sequencing the last title in the last text page of the text pages to be used as the title of the target text.

A text annotation device comprising:

the table title acquisition module is used for determining the title of the table in the title item of the text page to be marked; the text page to be marked comprises the table and the title item;

a title item selecting module, configured to search, from target title items, title items that meet a preset condition in a reverse order of sorting, where the target title items include titles of the table and the title items sorted before the titles of the table, and the preset condition includes: no text exists between the title items; the sequence is the typesetting sequence of the text pages;

a first hierarchical determining module, configured to determine, according to the sorting, a preceding title item of the title items that meet a preset condition as a superior title, and a succeeding title item of the title items that meet the preset condition as a subordinate title;

a second hierarchical determining module, configured to identify the upper-level title and the lower-level title in the text page according to a distinguishing feature of the upper-level title and the lower-level title;

a word segmentation result acquisition module, configured to perform word segmentation on the identified content indicated by each title to obtain a word segmentation result of each title, where each title includes the higher-level title and the lower-level title;

the target word segmentation unit acquisition module is used for inquiring a target word segmentation unit from a preset corresponding relation, the target word segmentation unit comprises word segmentation units which are the same as the titles in grading and similar to the word segmentation results, the corresponding relation comprises the corresponding relation between the word segmentation units and the labeled items, and the word segmentation units are the word segmentation results of the sample titles;

and the labeling result acquisition unit is used for taking the labeling item corresponding to the target word segmentation unit as the labeling result of the title.

A text annotation apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the text labeling method.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text annotation method described above.

According to the technical scheme, the text labeling method, the text labeling device, the text labeling equipment and the readable storage medium provided by the embodiment of the application determine the title of the table in the title items of the text page to be labeled, and search the title items meeting the preset condition according to the reverse order of the sequence from the target title items, wherein the target title items comprise the title of the table and the title items sequenced before the title of the table, and the preset condition comprises the following steps: the method further takes the title item which meets the preset condition and is ranked in the front as the superior title and the title item which is ranked in the back as the inferior title, because the distinguishing characteristics of the titles which are ranked in the same text page and are the same, the superior title and the inferior title in the text page are identified according to the distinguishing characteristics of the superior title and the inferior title, and the accurate superior title and the accurate inferior title can be obtained. The method further carries out word segmentation on the content indicated by each identified title to obtain word segmentation results of the upper-level title and the lower-level title, and inquires the target word segmentation unit from the preset corresponding relation. The target word segmentation unit comprises word segmentation units which are the same as the titles in grading and similar to word segmentation results, the corresponding relation comprises the corresponding relation between the word segmentation units and the labeled items, the word segmentation units are the word segmentation results of the sample titles, and the labeled items corresponding to the target word segmentation units are used as the labeled results of the titles. Due to the fact that the method identifies the grades (the upper-level titles or the lower-level titles) of all the titles, the accuracy of the searched target word segmentation units is high, and further the accuracy of the labeling results of the titles is high.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a specific implementation of a text annotation method according to an embodiment of the present application;

fig. 2 is a schematic diagram of an upper title and a lower title provided in an embodiment of the present application;

fig. 3 is a method for generating sample annotation data according to an embodiment of the present disclosure;

fig. 4a is a schematic flowchart of another text annotation method according to an embodiment of the present application;

fig. 4b is a schematic flowchart of another text annotation method according to the embodiment of the present application;

FIG. 5 is a diagram illustrating a B-tree structure according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a text annotation device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text annotation device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The text labeling method provided in the embodiment of the present application is applied to, but not limited to, automatically labeling a text, and it should be noted that, in practical application, an exposed file is a PDF (Portable Document Format) file, one exposed file includes a plurality of pages of text pages, and an optional application scenario is as follows: in the field of financial bonds, any page of text of a publication is labeled.

Fig. 1 is a schematic flow chart of a text labeling method provided in an embodiment of the present application, and is specifically applied to a process of labeling any one text page, as shown in fig. 1, the method specifically includes the following steps S101 to S108.

S101, determining the title of the table in the title item of the text page to be marked.

In this embodiment, the text page to be labeled is any page in the disclosure document to be labeled. The text page to be labeled comprises a table and a title item.

An alternative method of determining the table title includes A1-A2.

A1, in the text page to be marked, positioning the table.

It should be noted that, since the text pages in the disclosure file are in the PDF format, in this embodiment, the text pages in the PDF format are converted into the HTML format, and the table is located by looking up the tag identifying the table, specifically, an optional method for converting the text pages in the PDF format into the HTML format includes a11 to a12, as follows:

a11, using OCR (Optical Character Recognition) parsing technology, recognizing words and numbers of each text page in the disclosure document, and generating parsing information of each page in the disclosure document, wherein the parsing information includes a Character string in JSON format of each page content, and the Character string includes: page number of a text page, coordinates of numbers or characters in the text page.

It should be noted that the coordinates indicate the positions of the numbers or characters in the text page, and the specific implementation of a1 can be found in the prior art.

A12, converting the text page in PDF format into HTML format according to the analysis information of each text page.

The specific implementation method for converting the text page in the PDF format into the HTML format comprises the following steps: identifying the TABLE by a TABLE tag, identifying each text line by a P tag, and identifying a heading item by a paragraph tag in the text line, it should be noted that identification of the heading item is determined according to parsing information, optionally, when the text line satisfies: when the number exists in a preset format at a preset position and is a paragraph, the text line is taken as a title item.

Note that, in a Text page in an HTML (Hyper Text Markup Language) format, contents of a table and contents of a non-table are identified with different tags, and a Text line of a non-title item and a Text line of a title item are identified with different tags. And, the position of each text line or table is recorded in coordinates including a start coordinate and an end coordinate determined from coordinates of characters or numbers in the text line.

It should be further noted that, the specific implementation manner of converting each text page from the PDF format to the HTML format can be referred to in the prior art.

A2, determining the title of the table from the title item.

In this embodiment, the title item is a text line identified by a P tag and having a paragraph mark. One specific implementation of determining the title of the table is as follows: according to the typesetting sequence of the text page, searching the title item before the TABLE label, searching the P label adjacent to the TABLE label in the P label of the title item before the TABLE label as a target P label, and taking the text line identified by the target P label as the title of the TABLE.

It should be noted that, the method for determining the table title from the title item may further include other implementation manners, which may specifically refer to the prior art, and this embodiment is not described in detail herein.

S102, inquiring a target text in the text page, and sequencing the last title in the previous text page of the text page as the title of the target text.

In this embodiment, the target text is a non-title text in which no title exists. It will be appreciated that the target text is the text content preceding the first-ranked title item in the text page. When the target text exists in the text page, the target text is the content spread page included in the title item, and the title of the target text is not in the text page, so that the last title in the previous text page is sequenced to be used as the title of the target text.

In this embodiment, the ordering of the title items is determined according to the typesetting order of the text page, optionally, the ordering of the title items is determined according to the coordinates of the title items, and optionally, the title of the target text is the first title in the ordering of the text page.

It should be noted that, in the present embodiment, by determining the title of the target text by using the title of the previous text page, the disadvantage that the title identification is not accurate due to the page spread of the non-title text is avoided.

S103, searching title items meeting preset conditions from the target title items according to the reverse order of the sorting.

In the present embodiment, the target title items include the title of the table and the title items ordered before the title of the table.

In this embodiment, the target title items include the text title w in order_i(i is an integer from 1 to n, and n is the number of titles of texts in the target title item) as an example, the method for searching the title items meeting the preset condition according to the reverse order of the sorting comprises the following steps:

for T, the title item adjacent to T (i.e. w) in the target title item is judged_n) If no, determining T and w_nIs a title item satisfying a preset condition.

If present, from w_nInitially, two adjacent title items (i.e., w) are sequentially judged_i-1And w_i) If there is text, then determine w_i-1And w_iIs a title item satisfying a preset condition.

If it is found to w in the reverse order₂，w₂And w₁If the text exists before, it means that the title item meeting the preset condition does not exist in the target title item. If the text page also comprises other target title items, searching title items with preset conditions in the other target title items.

It should be noted that, when the text page to be labeled includes a plurality of tables, the text page includes a plurality of sets of target title items, in this embodiment, starting from the first set of target title items, and searching for title items meeting the preset condition from the target title items according to the reverse order of the ordering.

And S104, regarding the front title item as a superior title and the back title item as a subordinate title in the title items meeting the preset conditions according to the sequence.

It should be noted that, when the target title item includes a plurality of groups, since the distinguishing features of all the upper titles are the same in the same text page, the present embodiment may optionally recognize the upper titles and the lower titles from a group of target title items.

And S105, identifying the upper title and the lower title in the text page according to the distinguishing characteristics of the upper title and the lower title.

In this embodiment, the distinguishing feature is a feature that is recognized in advance according to the text content of the title item, for example, a paragraph mark that identifies the title item is used as the distinguishing feature, the paragraph marks are different for the title items, and the titles are different in rank. Then, among the title items of the text page, a title item identical to the paragraph mark of the upper title is recognized as the upper title, and a title item identical to the paragraph mark of the lower title is recognized as the lower title. For example, the upper title is a first-level title and the lower title is a second-level title, and fig. 2 illustrates a schematic diagram of the upper title and the lower title, as shown in fig. 2, the second-level title is "(1) prepaid account is listed by account, and the first-level title is" 7 prepaid account ". Wherein "(1)" and "7" are distinguishing features of the primary title and the secondary title, respectively, and can be recognized by using the OCR recognition technology and identified by the paragraph mark in S101.

It should be noted that the title of the target text is the same as the rank of the last title item in the last text page. For example, if the last title item in the last text page is a top title, the title of the target text is a top title.

And S106, performing word segmentation on the identified contents indicated by each title to obtain word segmentation results of each title.

In the present embodiment, the title includes an upper title and a lower title. The content indicated by the title specifically includes: the title and the non-title text between the title and the adjacent title ordered after the title may be extracted according to the corresponding coordinates of the title, which may be specifically referred to in the prior art.

Specifically, a preset dictionary for word segmentation is used as a basis, the contents indicated by each title are segmented by an artificial intelligence word segmentation technology to obtain a plurality of words conforming to the word segmentation dictionary, and the words are subjected to a dirtying treatment to obtain a word segmentation result which is a minimum word segmentation set. In general, the part of speech of the word included in the minimum participle set is a verb or a noun.

It should be noted that the word segmentation result obtained by word segmentation can represent the content indicated by the title with the most refined word set, and the specific implementation method of word segmentation and the dictionary obtaining method can all refer to the flow shown in fig. 3.

And S107, inquiring the target word segmentation unit from the preset corresponding relation.

In this embodiment, the target word segmentation unit includes a word segmentation unit that has the same classification as the title and is similar to the word segmentation result, that is, the target word segmentation unit at least includes a word segmentation unit that is similar to the word segmentation result of the upper-level title and a word segmentation unit that is similar to the word segmentation result of the lower-level title. The corresponding relation comprises a corresponding relation between a word segmentation unit and a labeling item, and the word segmentation unit is a word segmentation result of the sample title.

Optionally, a method for querying the target participle unit includes B1-B2, as follows:

b1, searching the upper level word segmentation unit similar to the word segmentation result of the upper level title from the upper level word segmentation units as the target upper level word segmentation unit.

In this embodiment, the upper level word segmentation unit is obtained by segmenting the content indicated by the upper level title in the sample title in advance. The lower-level word segmentation unit is obtained by segmenting a content indicated by the lower-level title in the sample title in advance. The method for judging whether the word segmentation result of the superior title is similar to the superior word segmentation unit comprises the steps of calculating the Hamming distance between the words in the word segmentation result of the superior title and the words in the superior word segmentation unit, determining the similarity according to the Hamming distance, wherein the greater the Hamming distance is, the higher the similarity is, and when the similarity is greater than a preset first threshold, determining that the word segmentation result of the superior title is similar to the superior word segmentation unit.

B2, searching the lower level word segmentation unit similar to the word segmentation result of the lower level title from the lower level word segmentation units as the target lower level word segmentation unit.

In this embodiment, a method for determining whether the word segmentation result of the lower title is similar to the lower word segmentation unit may be referred to as B1.

It should be noted that, the method for obtaining the corresponding relationship may refer to the flow shown in fig. 3, which is not described in detail in this embodiment.

It should be noted that B1 to B2 are only specific implementation manners of the optional query target word segmentation unit disclosed in this embodiment, and it should be noted that other specific implementation manners are also included, which is not described in this embodiment.

And S108, taking the labeling item corresponding to the target word segmentation unit as a labeling result of the title.

Specifically, the labeling item corresponding to the target higher-level word segmentation unit is used as the labeling result of the higher-level title, and the labeling item corresponding to the target lower-level word segmentation unit is used as the labeling result of the lower-level title.

In this embodiment, the corresponding relationship between the upper-level word segmentation unit and the tagging item and the corresponding relationship between the lower-level word segmentation unit and the tagging item are pre-stored, and the method for obtaining the corresponding relationship between the word segmentation unit and the tagging item refers to the flow shown in fig. 3.

It should be noted that the text labeling method shown in fig. 1 is not limited to be applied to labeling disclosure documents in the field of financial bonds, and may also be applied to other texts, and fig. 1 is only a specific implementation manner that is optional when the text labeling method provided in this embodiment of the present application is used for labeling any one text page, for example, S102 is an optional step, and for example, when no table exists in the text page, in this embodiment, the title of the table may be replaced by any title item, and S103 to S108 are executed.

According to the technical scheme, the text labeling method determines the title of the table in the title items of the text page to be labeled, and searches the title items meeting the preset condition according to the reverse sequence of the sequence from the target title items, wherein the target title items comprise the title of the table and the title items sequenced before the title of the table, and the preset condition comprises the following steps: the method further takes the title item which meets the preset condition and is ranked in the front as the superior title and the title item which is ranked in the back as the inferior title, because the distinguishing characteristics of the titles which are ranked in the same text page and are the same, the superior title and the inferior title in the text page are identified according to the distinguishing characteristics of the superior title and the inferior title, and the accurate superior title and the accurate inferior title can be obtained. The method further carries out word segmentation on the content indicated by each identified title to obtain word segmentation results of the upper-level title and the lower-level title, and inquires the target word segmentation unit from the preset corresponding relation. The target word segmentation unit comprises word segmentation units which are the same as the titles in grading and similar to word segmentation results, the corresponding relation comprises the corresponding relation between the word segmentation units and the labeled items, the word segmentation units are the word segmentation results of the sample titles, and the labeled items corresponding to the target word segmentation units are used as the labeled results of the titles. Because the method identifies the grades (the superior titles or the subordinate titles) of all the titles, the accuracy of the inquired target word segmentation unit is high, and further, the accuracy of the labeling result of the titles is high, so that not only can the titles in the text be automatically labeled, but also the accuracy of the labeling result can be ensured.

Fig. 3 is a method for generating sample annotation data according to an embodiment of the present application, where the sample annotation data includes a correspondence relationship and a dictionary, where the correspondence relationship includes: the corresponding relation between the word segmentation unit and the label item, and the dictionary is used for segmenting the content indicated by the title. Specifically, S301 to S304 may be included as follows:

s301 identifies an upper title and a lower title from the sample text as sample titles.

It should be noted that, the obtaining method of the sample title may include multiple methods, for example, a method for manually identifying an upper title and a lower title in a large amount of sample texts may be referred to in S101 to S106, and as a result, the sample title in the sample text is automatically identified in S101 to S106, so that efficiency and accuracy of obtaining the sample title are improved.

And S302, performing word segmentation on the indicated content of the sample title to obtain a sample word segmentation result.

Specifically, the method for obtaining the word segmentation result includes multiple methods, and an optional method is to obtain the sample word segmentation result according to the part-of-speech rule, including:

and B1, performing word segmentation on the sample title to obtain at least one word segmentation and the part of speech of the word segmentation.

And B2, acquiring a minimum part-of-speech set according to the participles and the parts-of-speech of the participles.

It should be noted that the minimum part of speech set is obtained according to the participle and the part of speech of the participle, the minimum part of speech set includes the part of speech of at least one participle, the minimum part of speech set can express the semantics of the sample title, and the method for obtaining the minimum part of speech set is referred to the prior art.

And B3, extracting the participles meeting the minimum part of speech set as target participles.

And B4, combining the target word segmentation in sequence to obtain a word segmentation result.

It should be noted that the word segmentation result of any sample title can indicate the semantics of the sample title, and the number of words included in the word segmentation result is the least.

S303, determining the labeling items corresponding to the word segmentation results according to the similarity between the sample word segmentation results and the preset labeling items.

In this embodiment, the annotation item includes a standard title configured in advance, and the method for obtaining the annotation item corresponding to the sample word segmentation result includes:

and segmenting the label item to obtain a label segmentation result, calculating the Hamming distance between the label segmentation result and the sample segmentation result, calculating the similarity according to the Hamming distance, and determining that the sample segmentation result corresponds to the label item when the similarity is greater than a second threshold value.

And acquiring a labeling item corresponding to each sample word segmentation result, thereby generating a corresponding relation between the word segmentation result and the labeling item.

It should be noted that the numerical value of the second threshold is adjusted according to the manual audit result of the corresponding relationship between the segmentation result and the tagging item, so as to obtain the corresponding relationship with a higher accuracy, thereby improving the accuracy of the tagging item corresponding to the target segmentation unit, further ensuring the accuracy of the tagging result, and avoiding the need of manual audit.

S304, storing the word segmentation result of the content indicated by the sample title, and forming a dictionary for word segmentation.

It should be noted that the word segmentation result of the content indicated by the sample title is obtained according to a large number of sample texts, and the relevance between the word segmentation result and the text to be labeled is higher compared with the existing dictionary, so the word segmentation accuracy is high, and the further labeling result accuracy is high.

The text labeling method provided by the embodiment of the application can also be applied to a scene for labeling a plurality of pages of text when the text comprises the pages of text. For example, the text includes two pages of text, which are respectively denoted as a first text page and a second text page, where the first text page is a page previous to the second text page. The embodiment of the present application provides a specific implementation manner of another optional text labeling method, which is specifically used for labeling a first text page and a second text page of a text.

In this embodiment, text pages are labeled according to the order of page numbers, first a first text page is labeled, and then a second text page is labeled, and this embodiment introduces the implementation process of the text labeling method with reference to fig. 4a and 4b, respectively.

The process shown in fig. 4a is used to automatically label the first text page and improve the accuracy of the labeling result. The method comprises S401-410, and comprises the following steps:

s401, in the title item of the first text page, determining the title of the table.

It should be noted that, the first text page is taken as the text page to be labeled, the first text page includes a table and a title item, and a specific method for determining the title of the table is referred to above S101, which is not described herein again in this embodiment.

S402, searching title items meeting preset conditions from the target title items according to the reverse sequence of the sequence.

In the present embodiment, the target title items include the title of the table and the title items ordered before the title of the table. For a specific method for searching for a title item meeting the preset condition in the reverse order of the ordering from the target title items, reference may be made to the above S103, which is not described herein again in this embodiment.

If there is a title item satisfying the preset condition, S403 to S404 are executed, and if there is no title item satisfying the preset condition, S405 is executed.

And S403, under the condition that the title items meeting the preset conditions exist, taking the front title item as a superior title and the back title item as a subordinate title in the title items meeting the preset conditions according to the sorting.

S404, identifying the upper title and the lower title in the first text page according to the distinguishing characteristics of the upper title and the lower title.

It should be noted that, for specific implementation methods of S403 to S404, reference may be made to the above S104 to S105, which is not described herein again in this embodiment.

S405, under the condition that no title item meeting the preset condition exists, all titles in the first text page are taken as primary titles.

And S406, determining the subordination relation between the superior title and the inferior title according to the sequence.

In the present embodiment, for any one subordinate title, the subordinate title is subordinate to a superior title that is ranked before the subordinate title and is closest to the subordinate title. For example, in the case where the upper-level title is a first-level title and the lower-level title is a second-level title, as shown in fig. 2, the second-level title "(1) prepaid item" belonging to the first-level title "7 prepaid item" is listed by account age.

The dependency relationship between the upper title and the lower title, the corresponding page numbers of the upper title and the upper title, the lower titles included in the upper title, and the corresponding page numbers of each of the lower titles, and the corresponding page numbers of the titles (the upper title or the lower titles) include: the title indicates the page number where the content is located, the title indicating the content including the title and non-title text between the title and an adjacent title ranked the same as the title, ordered after the title. The corresponding page number is obtained according to the coordinates of the text line or the table, which may be referred to in the prior art.

S407, storing the dependency relationship in a preset data structure.

In this embodiment, the preset data structure is a B-tree structure, the dependency relationship is stored in the B-tree structure according to the identification result, taking the document as the bond financial note announcement as an example, an upper-level title in the bond financial note announcement is a first-level title, and a lower-level title is a second-level title, and fig. 5 illustrates a schematic diagram of the B-tree structure obtained according to the bond financial note announcement. Specific methods for generating B-tree structures can be found in the prior art.

It should be noted that, in this embodiment, the preset data structure is not limited to a B-tree, but also includes multiple selectable data structures, and according to the structural characteristics and the traversal rules of the B-tree, the embodiment uses the B-tree to store the dependency relationship, thereby improving the speed of data processing.

And S408, performing word segmentation on the identified content indicated by each title to obtain word segmentation results of each title.

In this embodiment, according to the traversal rule of the B-tree, the content indicated by the titles is sequentially extracted, where the titles include at least a higher-level title, and when the higher-level title and a lower-level title are identified, the titles include the higher-level title and the lower-level title. The content indicated by the title specifically includes: a title, and non-title text between the title and an adjacent title ordered after the title.

For a specific implementation manner of performing word segmentation on the content indicated by each identified title to obtain a word segmentation result of each title, reference may be made to S106 described above, which is not described in detail herein in this embodiment.

And S409, inquiring the target word segmentation unit from the preset corresponding relation.

In this embodiment, the target word segmentation unit includes a word segmentation unit that has the same classification as the title and is similar to the word segmentation result, that is, the target word segmentation unit includes a word segmentation unit that is similar to the word segmentation result of the upper-level title and a word segmentation unit that is similar to the word segmentation result of the lower-level title. The corresponding relation comprises a corresponding relation between a word segmentation unit and a labeling item, and the word segmentation unit is a word segmentation result of the sample title.

The specific method for querying the target word segmentation unit comprises C1-C3 as follows:

and C1, searching the upper-level word segmentation units similar to the word segmentation result of the upper-level title from the upper-level word segmentation units as target upper-level word segmentation units.

The method for obtaining the target upper level word segmentation unit may specifically refer to S107 described above, which is not described herein again in this embodiment.

And C2, searching the lower-level word segmentation units similar to the word segmentation results of the lower-level titles belonging to the upper-level titles from the lower-level word segmentation units belonging to the target upper-level word segmentation units as target lower-level word segmentation units.

In this embodiment, the upper-level word segmentation unit and the lower-level word segmentation unit have a subordinate relationship, and the subordinate relationship is determined according to the subordinate relationship of the sample titles of the upper-level word segmentation unit and the lower-level word segmentation unit. It should be noted that, the method for determining whether the word segmentation result of the lower-level title is similar to the lower-level word segmentation unit may be referred to as S107.

And C3, if the lower-level title which is subordinate to the upper-level title does not exist and/or the target upper-level word segmentation unit does not exist, querying the word segmentation unit which is similar to the word segmentation result of the upper-level title in the lower-level word segmentation units as the target upper-level word segmentation unit.

It should be noted that the target word segmentation unit further includes a word segmentation unit similar to the word segmentation result of the upper-level title in the lower-level word segmentation unit, and a method for determining whether the word segmentation result of the upper-level title is similar to the lower-level word segmentation unit may be referred to as S107.

In summary, after the hierarchical levels of the titles and the subordinate relations between the upper-level titles and the lower-level titles are identified, the search ranges of the query target segmentation units are different for the titles in different hierarchical levels, for example, after the target segmentation unit (that is, the target upper-level segmentation unit) of the upper-level title is determined, it is not necessary to search the segmentation units similar to the segmentation result of the lower-level title in all the segmentation units, so that the search efficiency is improved, and the labeling efficiency is further improved. If the subordinate titles belonging to the superior titles do not exist and/or the target superior word segmentation unit does not exist, the word segmentation unit similar to the word segmentation result of the superior titles is inquired in the subordinate word segmentation unit, and inaccurate labeling caused by misjudgment of the title classification is avoided. Therefore, the query efficiency is improved, and meanwhile, the accuracy of the query result can be ensured.

And S410, taking the labeling item corresponding to the target word segmentation unit as a labeling result of the title.

S401 to S410 are optional specific implementation manners of labeling the first text page provided in this embodiment, and are used to automatically label the first text page and improve the accuracy of the labeling result.

The process shown in fig. 4b is used to automatically label the second text page and improve the accuracy of the labeling result. The method comprises the following steps of S411-422:

s411, in the title item of the second text page, determining the title of the table.

It should be noted that, the second text page is taken as the text page to be labeled, the second text page includes a table and a title item, and a specific method for determining the title of the table is referred to above S101, which is not described herein again in this embodiment.

S412, inquiring the target text in the text page, and sequencing the last title in the first text page as the title of the target text.

Referring to S102 specifically, this embodiment is not described herein again.

And S413, searching the title items meeting the preset condition from the target title items according to the reverse order of the sorting.

Referring to S103 specifically, this embodiment is not described herein again.

If there is a title item satisfying the preset condition, S414 to S415 are executed, and if there is no title item satisfying the preset condition, S416 or S417 is executed.

S414, regarding the top-ranked title item as the top-ranked title and the bottom-ranked title item as the bottom-ranked title, among the title items satisfying the preset condition.

S415, identifying the upper title and the lower title in the second text page according to the distinguishing features of the upper title and the lower title.

It should be noted that, for specific implementation methods of S414 to S415, reference may be made to the above S104 to S105, which is not described herein again in this embodiment.

And S416, if the title items meeting the preset conditions do not exist in the target title items of the second text page, and the first text page does not recognize the upper title and the lower title, taking the title items in the second text page as the upper title.

S417, if there is no title item satisfying the preset condition in the target title items of the second text page and the upper title and the lower title are recognized in the first text page, recognizing the upper title and the lower title from the title items of the second text page according to the distinguishing features of the upper title and the lower title recognized in the first text page.

In this embodiment, a title item having the same distinguishing feature as the upper title recognized in the first text page is used as the upper title among the title items of the second text page. And regarding the title items of the second text page as subordinate titles, wherein the title items have the same distinguishing characteristics with the subordinate titles identified in the first text page.

It should be noted that, in general, in the same text page, the distinguishing features of all the upper titles are the same, and the distinguishing features of all the lower titles are the same, so in this embodiment, for the second text page (or the first text page), it is identified that another title item belongs to the upper title or the lower title according to the distinguishing features of the upper title and the lower title, and thus, the identification accuracy of the title hierarchy (the upper title or the lower title) of the title item is ensured, and the identification efficiency is improved.

Further, when the second text page has no title item meeting the preset condition, the second text page is labeled according to the labeling result of other text pages (namely, the first text page), so that the accuracy of the labeling result of the second text page is improved.

S418, determining the affiliation between the upper title and the lower title according to the sorting.

And S419, storing the dependency relationship in a preset data structure.

Note that, the dependencies between the upper titles and the lower titles in all the text pages in the text are stored in the same B-tree, and as shown in fig. 5, the dependencies between the upper titles and the lower titles in the first text page and the dependencies between the upper titles and the lower titles in the second text page are stored in the B-tree.

And S420, performing word segmentation on the identified content indicated by each title to obtain word segmentation results of each title.

It should be noted that the content indicated by the title includes the title and non-title text between the title and an adjacent title ordered after the title, and the title of the target text includes the content including the title and the target text. The specific implementation manners of S418 to S420 are shown in S406 to S408, which are not described herein in detail in this embodiment.

And S421, inquiring the target word segmentation unit from the preset corresponding relation.

And S422, taking the labeling item corresponding to the target word segmentation unit as a labeling result of the title.

The specific implementation manners of S421 to S422 are referred to in S409 to S410, which are not described herein in detail in this embodiment.

It should be noted that S411 to S422 are an optional specific implementation manner for labeling the second text page provided in this embodiment, and are used to automatically label the second text page and improve the accuracy of the labeling result.

It should be further noted that the method is not limited to be applied to labeling of texts including only two text pages, for example, when the number of pages of texts included in the texts is greater than 2, when text pages other than the top page are labeled, if no title item meeting preset conditions exists in the target title items of the text pages, and no upper-level title and lower-level title are recognized by other text pages, all title items in the text pages are taken as upper-level titles, wherein the other text pages are text pages other than the currently processed text page.

In conclusion, the method can automatically label the text comprising a plurality of pages and improve the accuracy of labeling.

In practical application, after the annotation result is obtained, the annotation data is further generated according to the annotation result.

Specifically, the labeling result of the upper title and the corresponding page number of the upper title are used as the annotation data of the upper title, and the labeling result of the lower title and the corresponding page number of the lower title are used as the annotation data of the lower title.

It should be noted that, the title of the target text and the last title in the last text page of the text page are the same title, so that the labeling result of the last title is retained in the last text page of the text page, and the page number of the content included in the last title in the last text page of the text page includes a start page number and an end page number, where the start page number is the page number of the last title in the last text page, and the end page number is the page number of the text page where the target text is located.

According to the technical scheme, the text annotation method can improve the accuracy of the annotation result, further ensure the accuracy of the annotated data and meet the requirements of the data market.

Fig. 6 is a schematic structural diagram of a text annotation device according to an embodiment of the present application, and as shown in fig. 6, the text annotation device may include:

Specifically, for a specific implementation manner of the functions of the modules, reference may be made to the method embodiment.

The device described in this embodiment can not only automatically label the title in the text, but also ensure the accuracy of the labeling result.

FIG. 7 is a schematic structural diagram of the text annotation device, which can include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704;

in the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703 and the communication bus 704 is at least one, and the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704;

the processor 701 may be a central processing unit CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 703 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

the memory stores a program, and the processor can execute the program stored in the memory to realize the steps of the text labeling method provided by the embodiment of the application, which are as follows:

a text labeling method comprises the following steps:

Optionally, the method further includes:

and storing the dependency relationship in a preset data structure.

Optionally, the method further includes:

Optionally, the obtaining process of the corresponding relationship includes:

Optionally, the method further includes:

the method further comprises the following steps:

The embodiment of the present application further provides a readable storage medium, where the readable storage medium may store a computer program suitable for being executed by a processor, and when the computer program is executed by the processor, the computer program implements the steps of the text annotation method provided in the embodiment of the present application, as follows:

a text labeling method comprises the following steps:

Optionally, the method further includes:

and storing the dependency relationship in a preset data structure.

Optionally, the method further includes:

Optionally, the obtaining process of the corresponding relationship includes:

Optionally, the method further includes:

the method further comprises the following steps:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text labeling method is characterized by comprising the following steps:

2. The method of claim 1, wherein the page of text is any page of text, the method further comprising:

3. The method of claim 1 or 2, further comprising:

and storing the dependency relationship in a preset data structure.

4. The method according to claim 1, wherein the correspondence includes a correspondence between a higher-level participle unit and a labeled item, and a correspondence between a lower-level participle unit and a labeled item; the superior word segmentation unit and the inferior word segmentation unit have a subordinate relationship;

5. The method of claim 4, further comprising:

6. The method according to claim 1, wherein the obtaining of the corresponding relationship comprises:

7. The method of claim 1, wherein the content indicated by the title comprises: the title, and non-title text between the title and an adjacent title ordered after the title;

the method further comprises the following steps:

8. A text labeling apparatus, comprising:

9. A text annotation apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing the steps of the text labeling method according to any one of claims 1 to 7.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text annotation method according to any one of claims 1 to 7.