[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114219012A - Sample data processing method, device, computer program product and storage medium - Google Patents

Sample data processing method, device, computer program product and storage medium Download PDF

Info

Publication number
CN114219012A
CN114219012A CN202111417183.3A CN202111417183A CN114219012A CN 114219012 A CN114219012 A CN 114219012A CN 202111417183 A CN202111417183 A CN 202111417183A CN 114219012 A CN114219012 A CN 114219012A
Authority
CN
China
Prior art keywords
text
sample
entity
determining
sample text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111417183.3A
Other languages
Chinese (zh)
Inventor
李东超
崔鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Fangjianghu Technology Co Ltd
Original Assignee
Beijing Fangjianghu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Fangjianghu Technology Co Ltd filed Critical Beijing Fangjianghu Technology Co Ltd
Priority to CN202111417183.3A priority Critical patent/CN114219012A/en
Publication of CN114219012A publication Critical patent/CN114219012A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the disclosure discloses a method, a device, a computer program product and a storage medium for sample data processing, wherein the method comprises the following steps: acquiring a text set to be processed; determining the syntactic structure of the sample text and the number ratio of the syntactic structure in the text set; inputting the sample text into a named entity recognition model, and determining boundary labels of words in the sample text and confidence degrees of the boundary labels; determining an entity included in the sample text based on the boundary label, and determining an F value and a type label of the entity; determining the number proportion of the type labels of the entities in the text set; determining the support degree of the sample text based on the number proportion of the type label of the entity in the text set, the F value of the entity and the number proportion of the syntactic structure in the text set; determining the confusion degree of the sample text based on the confidence degree of the boundary label; and acquiring a target sample text from the text set based on the support degree, the confusion degree, the preset support degree threshold value and the confusion degree threshold value of the sample text.

Description

Sample data processing method, device, computer program product and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence and the field of natural language processing, and in particular, to a method, an apparatus, a computer program product, and a storage medium for sample data processing.
Background
In the field of natural language processing, sequence annotation is a main task at the sentence level for predicting the tags in a sequence that need to be annotated on a given text sequence.
In order to improve the performance of the sequence labeling model, high-quality samples are required to train the sequence labeling model, and the samples with high confusion degree and large information amount can play a greater role in the optimization process of the sequence labeling model. Therefore, how to select more valuable samples is an important factor for improving the performance of the sequence labeling model.
In the related art, in order to select the sample with the "most confusion" or the "largest amount of information", the following methods are generally adopted: the minimum confidence selection method (Least confidence) can select a sample with the maximum prediction probability but low 'confidence'; the minimum distance sample selection method (margin sampling) may select a sample with the smallest difference between two probability values of the largest model prediction. The minimum confidence degree selection method ignores the samples with lower probability, and the minimum distance sample selection method only considers two samples with the maximum prediction probability, so that the efficiency of sample selection is lower.
Disclosure of Invention
The disclosed embodiments provide a sample data processing method, device, computer program product and storage medium to select a sample text with higher value.
In one aspect of the embodiments of the present disclosure, a method for processing sample data is provided, including: acquiring a text set to be processed, wherein the text set comprises more than one sample text; determining a syntactic structure of a sample text, and determining the number ratio of the syntactic structure in a text set; inputting a sample text into a pre-trained named entity recognition model, and outputting the boundary labels and the confidence degrees of the boundary labels of the words in the sample text through the named entity recognition model; determining an entity included in the sample text based on the boundary label, and determining an F value and a type label of the entity; determining the number proportion of the type labels of the entities in the text set; determining the support degree of the sample text based on the number proportion of the type label of the entity in the text set, the F value of the entity and the number proportion of the syntactic structure in the text set; determining the confusion degree of the sample text based on the confidence degree of the boundary label; and selecting a target sample text from the text set based on the support degree, the confusion degree, the preset support degree threshold value and the confusion degree threshold value of the sample text.
In some embodiments, determining the degree of confusion for the sample text based on the confidence of the boundary labels comprises: determining the information entropy of the word based on the confidence of the boundary label; determining the average value of the information entropy of the words included in the entity as the confusion degree of the entity; an average of the degrees of confusion of the entities included in the sample text is determined as the degree of confusion of the sample text.
In some embodiments, the method further comprises determining a confidence level of the type label of the entity; determining the information entropy of the word based on the confidence of the boundary label, including: adjusting the confidence coefficient of the boundary label based on the confidence coefficient of the type label of the entity to which the word belongs to obtain the adjusted confidence coefficient of the boundary label; and determining the information entropy of the word based on the confidence degree of the adjusted boundary label.
In some embodiments, the support of the sample text is positively correlated with the number of the syntactic structures in the text set, the support of the sample text is positively correlated with a first numerical value, and the support of the sample text is negatively correlated with a second numerical value, where the first numerical value is a mean value of the number of the type labels of the entities included in the sample in the text set, and the second numerical value is a mean value of the F values of the entities included in the sample text.
In some embodiments, the syntax structure is determined via the steps of: performing word segmentation on the sample text to obtain a word segmentation sequence; determining the part of speech of the word in the word segmentation sequence; and performing syntactic analysis on the word sequence based on the part of speech to obtain a syntactic structure.
In some embodiments, the type tag of the entity is obtained via the following steps: and utilizing Elastic Search to recall entity types to obtain the type labels of the entities.
In some embodiments, the method further comprises: constructing a sample set based on the target sample text; and training a pre-constructed initial sequence labeling model based on the sample set to obtain a trained sequence labeling model.
In another aspect of the embodiments of the present disclosure, an apparatus for processing sample data is provided, including: the data acquisition unit is configured to acquire a text set to be processed, and the text set comprises more than one sample text; a syntax determination unit configured to determine a syntax structure of the sample text and determine a number of the syntax structure in the text set as a ratio; a text recognition unit configured to input a sample text into a pre-trained named entity recognition model, and output a boundary tag of a word included in the sample text and a confidence of the boundary tag via the named entity recognition model; an entity determination unit configured to determine an entity included in the sample text based on the boundary label, and determine an F value and a type label of the entity; a type proportion unit configured to determine a number proportion of type tags of the entity in the text set; the support degree unit is configured to determine the support degree of the sample text based on the number proportion of the type labels of the entities in the text set, the F value of the entities and the number proportion of the syntactic structures in the text set; a confusion unit configured to determine a confusion of the sample text based on the confidence of the boundary label; and the sample selecting unit is configured to select a target sample text from the text set based on a preset support threshold and a preset confusion threshold.
In yet another aspect of the disclosed embodiments, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the method of sample data processing in any of the above embodiments.
In yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method for sample data processing in any of the above embodiments.
The method for processing the sample data comprises the steps of firstly determining a syntactic structure of a sample text, and determining the number ratio of the syntactic structure in a text set; then, inputting the sample text into a pre-trained named entity recognition model, and determining the boundary labels of the characters in the sample text and the confidence degrees of the boundary labels; then, determining an entity included in the sample text based on the boundary label, and determining an F value and a type label of the entity; then determining the number ratio of the type labels of the entity in the text set; then, determining the support degree of the sample text based on the number proportion of the type label of the entity in the text set, the F value of the entity and the number proportion of the syntactic structure in the text set; determining the confusion degree of the sample text based on the confidence degree of the boundary label; and finally, selecting a target sample text from the text set based on a preset support threshold and a preset confusion threshold. The number of sample texts with similar syntactic structures is limited through the support degree, sample imbalance caused by excessive sample number of the same syntactic structures is avoided, and meanwhile, the information amount in the sample texts is limited through the confusion degree, so that the sample texts with higher values can be selected.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of one embodiment of a method of sample data processing of the present disclosure;
FIG. 2 is a flow diagram of determining a degree of confusion in one embodiment of sample data processing of the present disclosure;
FIG. 3 is a flow diagram of determining a syntax structure in one embodiment of a method of sample data processing of the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method of sample data processing of the present disclosure;
FIG. 5 is a schematic diagram illustrating an embodiment of an apparatus for sample data processing according to the present disclosure;
fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the disclosure
In the process of implementing the present disclosure, the inventors found that the minimum confidence selection method ignores the samples with lower probability, and the minimum distance sample selection method also only considers two samples with the highest prediction probability, resulting in lower efficiency of sample selection. In addition, the similarity of syntactic structures among samples is not considered in the two methods, and when the number of similar samples is large, the samples are unbalanced, so that the performance of the sequence labeling model is influenced.
Exemplary method
Referring next to fig. 1, fig. 1 shows a flowchart of one embodiment of a method of sample data processing of the present disclosure, as shown in fig. 1, the flowchart includes the following steps.
And step 110, acquiring a text set to be processed.
Wherein the text set comprises more than one sample text.
As an example, the execution subject may be, for example, a terminal device or a server, and the text set to be processed may be acquired through a network.
And step 120, determining the syntactic structure of the sample text, and determining the number ratio of the syntactic structure in the text set.
The syntactic structure represents a combination manner between words, and for example, the syntactic structure may include types such as a bingo structure, a predicate structure, a bias structure, and the like according to the combination manner.
In this embodiment, the number of syntax structures in the text set is a ratio of the number of sample texts characterizing the same syntax structure to the total number of sample texts.
As an example, the execution subject may perform syntactic analysis on the sample text via the HanLP tool, determining the syntactic structure of the sample text.
Step 130, inputting the sample text into a pre-trained named entity recognition model, and outputting the boundary labels and the confidence degrees of the boundary labels of the words in the sample text through the named entity recognition model.
As an example, a bilstm + crf model may be used to perform entity recognition on the sample text to determine the boundary label of each word in the sample text, and the confidence of each boundary label, for example, the boundary labels may include B, I, O, where B represents the starting point of the entity boundary, I represents the entity, O represents the non-entity, and the sum of the confidences of the three labels for the same word is 1.
Step 140, determining an entity included in the sample text based on the boundary label, and determining an F value and a type label of the entity.
Typically, the F value (i.e., F-score) is used to characterize the performance of an entity and can be calculated from the accuracy and recall of the entity.
In this embodiment, the type tag of the entity is used to characterize the type of the entity, for example, the entity tag of "apple" may be a type of "fruit", "mobile phone", or the like.
In a specific example, the execution body may extract the words in the interval in order by using the word with the boundary label "B" as the start point of the interval and then using the word before the word with the first boundary label "O" after the word as the end point of the interval, so as to obtain the entity composed of the words. Then, the execution body may determine the F value corresponding to the entity according to the accuracy rate and the recall rate corresponding to the entity, and identify the entity by using an entity identification model (e.g., a baseline model), so as to determine the type tag of the entity.
In some optional implementations of this embodiment, the type tag of the entity is obtained through the following steps: and utilizing Elastic Search to recall entity types to obtain the type labels of the entities.
In this implementation manner, the execution main body may input an index value (for example, pinyin of the entity) corresponding to the entity into the Elastic Search, so as to obtain a type tag corresponding to the index value, thereby improving the efficiency of determining the entity type.
As an example, the executing agent may retrieve "pingguo" through Elastic Search, and the recalled entities may include "apple, cell phone", "apple, fruit" arranged by confidence, where "cell phone" and "fruit" are type tags of the entity "apple".
Step 150, determining the number of the type labels of the entity in the text set.
In this embodiment, the number of type tags of an entity in the text set is compared with the ratio of the number of entities having the same type tag to the total number of entities in the text set.
And step 160, determining the support degree of the sample text based on the number proportion of the type labels of the entities in the text set, the F value of the entities and the number proportion of the syntactic structures in the text set.
In this embodiment, the support degree of the sample text may characterize the similarity degree of the syntactic structure of the sample text.
In some optional implementation manners of this embodiment, the support degree of the sample text is positively correlated with the number of the syntax structure in the text set, the support degree of the sample text is positively correlated with a first numerical value, and the support degree of the sample text is negatively correlated with a second numerical value, where the first numerical value is an average value of the number of the type labels of the entities included in the sample in the text set, and the second numerical value is an average value of the F values of the entities included in the sample text.
As an example, the support degree of the sample text may be determined by the following formulas (1), (2), (3).
Figure BDA0003375671970000081
Figure BDA0003375671970000082
Figure BDA0003375671970000083
In the formula, s _ sample represents the support degree of the sample text, s _ content represents the number ratio of the syntactic structure in the text set, s _ entry represents a first numerical value, and s _ entry representsjThe number of type labels representing the jth entity in the text set is proportional, a represents the first entity in the sample text, b represents the last entity in the sample text, m represents the number of entities in the sample text, FjDenotes the value of F for the jth entity and F denotes the second value.
And step 170, determining the confusion degree of the sample text based on the confidence degree of the boundary label.
As an example, the execution subject may first determine the information entropy of each word based on the confidence of the boundary label, and then determine the average of the information entropy of each word as the confusability of the sample text. For example, the information entropy of a word can be determined by the following equation (4):
Hi=-(PB*log2 PB+PI*log2 PI+PO*log2 PO) (4)
in the formula, HiEntropy of information, P, representing the ith wordBIndicates the confidence level, P, of the boundary label "BIIndicates the confidence level, P, of the boundary label "IOIndicating the confidence of the boundary label "O".
And step 180, acquiring a target sample text from the text set based on the support degree, the confusion degree, the preset support degree threshold and the confusion degree threshold of the sample text.
As an example, the support threshold may be empirically set to 0.8, the confusability threshold may be set to 0.7, the execution subject may traverse sample texts in the text set, and then determine sample texts with support greater than 0.8 and confusability greater than 0.7 as the target sample texts.
The method for processing the sample data comprises the steps of firstly determining a syntactic structure of a sample text, and determining the number ratio of the syntactic structure in a text set; then, inputting the sample text into a pre-trained named entity recognition model, and determining the boundary labels of the characters in the sample text and the confidence degrees of the boundary labels; then, determining an entity included in the sample text based on the boundary label, and determining an F value and a type label of the entity; then determining the number ratio of the type labels of the entity in the text set; then, determining the support degree of the sample text based on the number proportion of the type label of the entity in the text set, the F value of the entity and the number proportion of the syntactic structure in the text set; determining the confusion degree of the sample text based on the confidence degree of the boundary label; and finally, selecting a target sample text from the text set based on a preset support threshold and a preset confusion threshold. The number of sample texts with similar syntactic structures is limited through the support degree, sample imbalance caused by excessive sample number of the same syntactic structures is avoided, and meanwhile, the information amount in the sample texts is limited through the confusion degree, so that the sample texts with higher values can be selected.
Referring next to fig. 2, fig. 2 shows a flowchart of determining the degree of confusion in an embodiment of sample data processing of the present disclosure, and in some alternative implementations of the embodiment shown in fig. 1, step 170 may also adopt the flowchart shown in fig. 2, which includes the following steps.
And step 210, determining the information entropy of the word based on the confidence degree of the boundary label.
Step 220, determining the average value of the information entropy of the words included in the entity as the confusability of the entity.
In this embodiment, the execution subject may determine an average value of information entropies of words included in the entity as a degree of confusion of the entity.
Step 230, determining the average value of the confusability of the entities included in the sample text as the confusability of the sample text.
In this embodiment, the execution subject may determine an average of the degrees of confusion of the entities included in the sample text as the degree of confusion of the sample text.
In one specific example, the execution subject may determine the confusability of the entity by equation (5) and the confusability of the sample text by equation (6).
Figure BDA0003375671970000101
Figure BDA0003375671970000102
In the formula, H _ entityjRepresenting the confusability of the jth entity, H _ sample representing the confusability of the sample text, l representing the end position of the entity, k representing the start position of the entity, n representing the number of words in the entity, m representing the number of entities in the sample text, a representing the first entity in the sample text, and b representing the last entity in the sample text.
The process for determining the confusability of the sample text shown in fig. 2 can determine the confusability of the entity based on the information entropy of the word, and then determine the confusability of the sample text based on the confusability of the entity, so that the calculation of the confusability of the multi-label classification task is realized, and the information amount of the sample can be more accurately reflected.
In some optional implementations of this embodiment, the method further includes determining a confidence level of the type label of the entity; the step 210 may further include: adjusting the confidence coefficient of the boundary label according to the confidence coefficient of the type label of the entity to which the word belongs to obtain the adjusted confidence coefficient of the boundary label; and determining the information entropy of the word based on the confidence degree of the adjusted boundary label.
As an example, the execution principal may determine the confidence of the type tag of the entity using Elastic Search. And then multiplying the confidence coefficient of the entity type label with the confidence coefficient of the boundary label of each word included in the entity, determining the obtained product as the adjusted boundary confidence coefficient of the word, and determining the information entropy of the word based on the adjusted boundary confidence coefficient.
For example, the execution subject may adjust the confidence of the boundary label by the following equations (7), (8), (9):
PB=P_ESj*pB (7)
PI=P_ESj*pI (8)
Po=P_ESj*pO (9)
in the formula, pBIndicates the confidence level, P, of the boundary label BBIndicates the confidence, p, of the adjusted boundary label BIRepresenting the confidence of the boundary label I, PIIndicates the confidence, p, of the adjusted boundary label IOIndicates the confidence level, P, of the boundary label OOIndicates the confidence level, P _ ES, of the adjusted boundary label OjRepresenting the confidence of the type label of the jth entity.
In the implementation manner, the confidence of the type label of the entity can be introduced into the calculation process of the information entropy of the word, so that the information entropy of the word can include the type information of the entity, the obtained confusion degree of the sample text can reflect the information amount of the entity type contained in the sample text, and the value of the target sample text to the entity recognition model can be improved.
Referring next to fig. 3, fig. 3 illustrates a flow diagram for determining a syntax structure in one embodiment of a method of sample data processing of the present disclosure, as illustrated in fig. 3, which includes the following steps.
And 310, performing word segmentation on the sample text to obtain a word segmentation sequence.
And step 320, determining the part of speech of the words in the word segmentation sequence.
For example, parts of speech may include adjectives, verbs, nouns, and the like.
And step 330, performing syntactic analysis on the word sequence based on the part of speech to obtain a syntactic structure.
In a specific example, the sample text is "i buy west two houses", and the analysis sequence obtained by performing the subject on the sample text is as follows: "i", "buy", "west second flag", and "house", and then respectively determine the part of speech of each word, for example, the part of speech of "i" is a pronoun, and the part of speech of "house" is a noun. Then, the executing subject may perform syntactic analysis on the sequence of terms using the HanLP tool, and the obtained syntactic structure may include: a cardinal relationship, an actor-guest relationship, a centering relationship, a right attachment relationship, and a core relationship.
In this embodiment, the execution subject may perform word segmentation on the sample text, determine a part-of-speech of a word obtained by the word segmentation, and then perform syntactic analysis on the word segmentation sequence according to the part-of-speech, determine a syntactic structure of the sample text, so that accuracy and efficiency of the syntactic structure may be improved.
Referring next to FIG. 4, FIG. 4 illustrates a flow diagram of yet another embodiment of a method of sample data processing of the present disclosure. As shown in fig. 4, the method may further include the flow shown in fig. 4 on the basis of the flow shown in fig. 1, and the flow includes the following steps.
And step 410, constructing a sample set based on the target sample text.
And step 420, training a pre-constructed initial sequence labeling model based on the sample set to obtain a trained sequence labeling model.
In this embodiment, the target sample texts selected based on the confusion degree and the support degree can have a larger information amount, and the distribution of the target sample texts in the sample set is balanced, so that the training quality of the sequence labeling model can be improved, and the performance of the sequence labeling model can be improved.
Exemplary devices
Referring next to fig. 5, fig. 5 is a schematic structural diagram illustrating an embodiment of an apparatus for processing sample data according to the present disclosure, and as shown in fig. 5, the apparatus includes: a data obtaining unit 510 configured to obtain a text set to be processed, the text set including more than one sample text; a syntax determination unit 520 configured to determine a syntax structure of the sample text and determine a number of the syntax structure in the text set as a ratio; a text recognition unit 530 configured to input the sample text into a pre-trained named entity recognition model, and output a boundary label of a word included in the sample text and a confidence of the boundary label via the named entity recognition model; an entity determining unit 540 configured to determine an entity included in the sample text based on the boundary label, and determine an F value and a type label of the entity; a type proportion unit 550 configured to determine a number proportion of type tags of the entity in the text set; a support degree unit 560 configured to determine a support degree of the sample text based on a number ratio of the type tag of the entity in the text set, an F value of the entity, and a number ratio of the syntax structure in the text set; a confusability unit 570 configured to determine a confusability of the sample text based on the confidence of the boundary label; the sample selecting unit 580 is configured to obtain the target sample text from the text set based on the support degree, the confusability degree, the preset support degree threshold and the confusability threshold of the sample text.
In this embodiment, the confusability unit 570 further includes: an information entropy module configured to determine an information entropy of the word based on the confidence of the boundary label; an entity confusion module configured to determine a mean of information entropies of words included in an entity as a confusion of the entity; a sample confusability module configured to determine an average of confusabilities of entities included in the sample text as the confusability of the sample text.
In this embodiment, the apparatus further comprises a type confidence unit configured to determine a confidence of the type label of the entity; the information entropy module further comprises: the adjusting submodule is configured to adjust the confidence coefficient of the boundary label based on the confidence coefficient of the type label of the entity to which the word belongs, so that the confidence coefficient of the adjusted boundary label is obtained; and the information entropy determining submodule is configured to determine the information entropy of the word based on the confidence degree of the adjusted boundary label.
In this embodiment, the support of the sample text is positively correlated with the number of the syntax structure in the text set, the support of the sample text is positively correlated with a first numerical value, and the support of the sample text is negatively correlated with a second numerical value, where the first numerical value is an average of the number of the type labels of the entities included in the sample in the text set, and the second numerical value is an average of the F values of the entities included in the sample text.
In this embodiment, the syntax determination unit 520 further includes: the word segmentation module is configured to segment words of the sample text to obtain a word segmentation sequence; a part-of-speech recognition module configured to determine parts-of-speech of words in the sequence of participles; and the structure analysis module is configured to perform syntactic analysis on the word sequence based on the part of speech to obtain a syntactic structure.
In this embodiment, the entity determining unit 540 is further configured to: and utilizing Elastic Search to recall entity types to obtain the type labels of the entities.
In this embodiment, the apparatus further comprises: a sample construction unit configured to construct a sample set based on the target sample text; and the training unit is configured to train a pre-constructed initial sequence labeling model based on the sample set to obtain a trained sequence labeling model.
In addition, an embodiment of the present disclosure also provides an electronic device, including:
a memory for storing a computer program;
a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the method for processing sample data according to any of the above embodiments of the present disclosure.
Fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present disclosure. Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 6. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.
As shown in fig. 6, the electronic device includes one or more processors and memory.
The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium and executed by a processor to implement the methods of sample data processing of the various embodiments of the present disclosure described above and/or other desired functions.
In one example, the electronic device may further include: an input device and an output device, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device may also include, for example, a keyboard, a mouse, and the like.
The output device may output various information including the determined distance information, direction information, and the like to the outside. The output devices may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 6, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.
In addition to the above methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the method of sample data processing according to various embodiments of the present disclosure described in the above section of this specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the method of sample data processing according to various embodiments of the present disclosure described in the above section of this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (10)

1. A sample data processing method is characterized by comprising the following steps:
acquiring a text set to be processed, wherein the text set comprises more than one sample text;
determining a syntactic structure of the sample text, and determining the number ratio of the syntactic structure in the text set;
inputting the sample text into a pre-trained named entity recognition model, and outputting the boundary labels of the words included in the sample text and the confidence degrees of the boundary labels through the named entity recognition model;
determining an entity included in the sample text based on the boundary label, and determining an F value and a type label of the entity;
determining the number of type tags of the entity in the text set as a ratio;
determining the support of the sample text based on the number proportion of the type label of the entity in the text set, the F value of the entity and the number proportion of the syntactic structure in the text set;
determining the confusion degree of the sample text based on the confidence degree of the boundary label;
and acquiring a target sample text from the text set based on the support degree, the confusion degree, a preset support degree threshold value and a preset confusion degree threshold value of the sample text.
2. The method of claim 1, wherein determining the degree of confusion of the sample text based on the confidence of the boundary labels comprises:
determining the information entropy of the word based on the confidence of the boundary label;
determining the average value of the information entropy of the words included in the entity as the confusability of the entity;
determining an average of the degrees of confusion of the entities included in the sample text as the degree of confusion of the sample text.
3. The method of claim 2, further comprising determining a confidence level of a type label of the entity;
determining the information entropy of the word based on the confidence of the boundary label, including:
adjusting the confidence coefficient of the boundary label based on the confidence coefficient of the type label of the entity to which the word belongs to obtain the adjusted confidence coefficient of the boundary label;
and determining the information entropy of the word based on the confidence degree of the adjusted boundary label.
4. The method according to one of claims 1 to 3, wherein a support degree of the sample text is positively correlated with a number of the syntactic structure in the text set, the support degree of the sample text is positively correlated with a first numerical value, and the support degree of the sample text is negatively correlated with a second numerical value, wherein the first numerical value is a mean value of the number of the type label of each of the entities included in the sample in the text set, and the second numerical value is a mean value of the F value of each of the entities included in the sample text.
5. The method according to one of claims 1 to 4, characterized in that the syntax structure is determined by:
performing word segmentation on the sample text to obtain a word segmentation sequence;
determining the part of speech of the words in the word segmentation sequence;
and carrying out syntactic analysis on the word segmentation sequence based on the part of speech to obtain the syntactic structure.
6. Method according to one of claims 1 to 5, characterized in that the type tag of the entity is obtained via the following steps:
and utilizing an Elastic Search to recall the entity type to obtain the type tag of the entity.
7. The method according to one of claims 1 to 6, characterized in that the method further comprises:
constructing a sample set based on the target sample text;
and training a pre-constructed initial sequence labeling model based on the sample set to obtain a trained sequence labeling model.
8. An apparatus for sample data processing, comprising:
a data acquisition unit configured to acquire a text set to be processed, the text set including one or more sample texts;
a syntax determination unit configured to determine a syntax structure of the sample text and determine a number of the syntax structure in the text set as a ratio;
a text recognition unit configured to input the sample text into a pre-trained named entity recognition model, and output a boundary label of a word included in the sample text and a confidence of the boundary label via the named entity recognition model;
an entity determination unit configured to determine an entity included in the sample text based on the boundary label, and determine an F value and a type label of the entity;
a type proportion unit configured to determine a number proportion of type tags of the entity in the text set;
a support degree unit configured to determine a support degree of the sample text based on a number proportion of the type tag of the entity in the text set, an F value of the entity, and a number proportion of the syntax structure in the text set;
a confusability unit configured to determine a confusability of the sample text based on the confidence of the boundary label;
and the sample selecting unit is configured to obtain a target sample text from the text set based on the support degree, the confusion degree, a preset support degree threshold value and a preset confusion degree threshold value of the sample text.
9. A computer program product comprising computer programs/instructions for implementing the method according to any one of claims 1 to 7 when executed by a processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 7.
CN202111417183.3A 2021-11-25 2021-11-25 Sample data processing method, device, computer program product and storage medium Pending CN114219012A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111417183.3A CN114219012A (en) 2021-11-25 2021-11-25 Sample data processing method, device, computer program product and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111417183.3A CN114219012A (en) 2021-11-25 2021-11-25 Sample data processing method, device, computer program product and storage medium

Publications (1)

Publication Number Publication Date
CN114219012A true CN114219012A (en) 2022-03-22

Family

ID=80698467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111417183.3A Pending CN114219012A (en) 2021-11-25 2021-11-25 Sample data processing method, device, computer program product and storage medium

Country Status (1)

Country Link
CN (1) CN114219012A (en)

Similar Documents

Publication Publication Date Title
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
US9761220B2 (en) Language modeling based on spoken and unspeakable corpuses
CN108170749B (en) Dialog method, device and computer readable medium based on artificial intelligence
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
JP5901001B1 (en) Method and device for acoustic language model training
CN107729300B (en) Text similarity processing method, device and equipment and computer storage medium
CN110334209B (en) Text classification method, device, medium and electronic equipment
CN107590172B (en) Core content mining method and device for large-scale voice data
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN110941951A (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
US10970488B2 (en) Finding of asymmetric relation between words
CN112836039B (en) Voice data processing method and device based on deep learning
CN114118100A (en) Method, apparatus, device, medium and program product for generating dialogue statements
CN114743012B (en) Text recognition method and device
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
CN111291551A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112307738B (en) Method and device for processing text
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN110472241B (en) Method for generating redundancy-removed information sentence vector and related equipment
WO2012134396A1 (en) A method, an apparatus and a computer-readable medium for indexing a document for document retrieval
CN111161730A (en) Voice instruction matching method, device, equipment and storage medium
CN114219012A (en) Sample data processing method, device, computer program product and storage medium
CN116108181A (en) Client information processing method and device and electronic equipment
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN110276001B (en) Checking page identification method and device, computing equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination