CN113761900A

CN113761900A - Unstructured transaction information identification method and system based on natural language processing

Info

Publication number: CN113761900A
Application number: CN202111051611.5A
Authority: CN
Inventors: 牛志遥; 杨骏逸
Original assignee: Southern Fund Management Co ltd
Current assignee: Southern Fund Management Co ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-12-07
Anticipated expiration: 2041-09-08
Also published as: CN113761900B

Abstract

The invention discloses a method and system for identifying unstructured transaction information based on natural language processing. The method includes: acquiring transaction language text corresponding to transaction quotation information sent by a user, and preprocessing the transaction language text; Transaction context word segmentation and part-of-speech recognition; perform semantic group segmentation and paragraph completion processing on transaction language text; extract part-of-speech features from the transaction context word segmentation results and part-of-speech recognition results, and use Bayesian network to vectorize the part-of-speech features. The model is used to perform global optimal recognition of transaction language texts; matrix format verification is performed on the global optimal recognition results, and matrix structure processing is performed on the results of group segmentation and paragraph completion processing based on the verification results. The invention performs part-of-speech processing, paragraph processing, etc. on the transaction quotation information, and parses various transaction "jargon" into specific quotation and transaction information, thereby improving the processing efficiency of the transaction quotation information.

Description

Unstructured transaction information identification method and system based on natural language processing

Technical Field

The invention relates to the technical field of computer information processing, in particular to unstructured transaction information identification and system based on natural language processing.

Background

Currently, trade matches between fixed-income products and non-standardized securities are negotiated by the various market-participating entities themselves. With the prosperity of the trading market among banks, a large number of trades need to manually collect market conditions, inquire price and price, negotiate and negotiate trades, and consume a large amount of manpower communication cost. At present, the market is still in a state of high degree of manual work and low automation, so how to realize intelligent identification of information such as market quotation is a problem to be solved by technical personnel in the field.

Disclosure of Invention

The embodiment of the invention provides an unstructured transaction information identification method and system based on natural language processing, and aims to improve the processing efficiency of transaction quotation information.

In a first aspect, an embodiment of the present invention provides an unstructured transaction information identification method based on natural language processing, including:

acquiring a transaction language text corresponding to transaction quotation information sent by a user, and preprocessing the transaction language text;

performing transaction context word segmentation and part-of-speech identification on the preprocessed transaction language text based on the conditional random field;

performing sense group segmentation and paragraph completion processing on the transaction language text according to the transaction context word segmentation result and the part of speech identification result;

extracting part-of-speech characteristics from the transaction context word segmentation result and the part-of-speech recognition result, and performing vectorization modeling on the part-of-speech characteristics through a Bayesian network to perform global optimal recognition on the transaction language text;

and carrying out matrix format verification on the global optimal recognition result, carrying out matrix structuralization processing on the result of meaning group segmentation and paragraph completion processing based on the verification result, and outputting the result of the matrix structuralization processing as a transaction quotation information recognition result.

In a second aspect, an embodiment of the present invention provides an unstructured transaction information identification device based on natural language processing, including:

the system comprises a preprocessing unit, a display unit and a display unit, wherein the preprocessing unit is used for acquiring a transaction language text corresponding to transaction quotation information sent by a user and preprocessing the transaction language text;

the part-of-speech processing unit is used for carrying out transaction context word segmentation and part-of-speech identification on the preprocessed transaction language text based on the conditional random field;

the paragraph processing unit is used for performing sense group segmentation and paragraph completion processing on the transaction language text according to the transaction context word segmentation result and the part of speech identification result;

the optimal recognition unit is used for extracting part-of-speech characteristics from the transaction context word segmentation result and the part-of-speech recognition result, and performing vectorization modeling on the part-of-speech characteristics through a Bayesian network so as to perform global optimal recognition on the transaction language text;

and the first structural processing unit is used for carrying out matrix format verification on the global optimal identification result, carrying out matrix structural processing on the result of meaning group segmentation and paragraph completion processing based on the verification result, and outputting the result of the matrix structural processing as the identification result of the transaction quotation information.

In a third aspect, an embodiment of the present invention provides a distributed computer single-point and clustered deployment apparatus, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the unstructured transaction information identification method based on natural language processing according to any one of claims 1 to 7 when executing the computer program, and supports single-computer computing and multi-computer parallel computing to implement the unstructured transaction information identification method based on natural language processing according to the first aspect.

In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for identifying unstructured transaction information based on natural language processing according to the first aspect is implemented.

The embodiment of the invention provides an unstructured transaction information identification method and system based on natural language processing, wherein the method comprises the following steps: acquiring a transaction language text corresponding to transaction quotation information sent by a user, and preprocessing the transaction language text; performing transaction context word segmentation and part-of-speech identification on the preprocessed transaction language text based on the conditional random field; performing sense group segmentation and paragraph completion processing on the transaction language text according to the transaction context word segmentation result and the part of speech identification result; extracting part-of-speech characteristics from the transaction context word segmentation result and the part-of-speech recognition result, and performing vectorization modeling on the part-of-speech characteristics through a Bayesian network to perform global optimal recognition on the transaction language text; and carrying out matrix format verification on the global optimal recognition result, carrying out matrix structuralization processing on the result of meaning group segmentation and paragraph completion processing based on the verification result, and outputting the result of the matrix structuralization processing as a transaction quotation information recognition result. The embodiment of the invention analyzes various transaction 'jargon' into specific quotation and transaction information by performing part-of-speech processing, paragraph processing and the like on the transaction quotation information of the user, thereby improving the processing efficiency of the transaction quotation information.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention;

fig. 2 is a schematic sub-flowchart of step S101 in an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention;

fig. 3 is a schematic sub-flowchart of step S102 in an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention;

fig. 4 is a schematic sub-flowchart of step S103 in an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention;

fig. 5 is a schematic sub-flowchart of step S104 in an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention;

fig. 6 is a schematic sub-flowchart of step S105 in an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention;

fig. 7 is a schematic block diagram of an unstructured transaction information recognition device based on natural language processing according to an embodiment of the present invention;

FIG. 8 is a sub-schematic block diagram of a preprocessing unit in an unstructured transaction information recognition apparatus based on natural language processing according to an embodiment of the present invention;

fig. 9 is a sub-schematic block diagram of a part-of-speech processing unit in an unstructured transaction information recognition apparatus based on natural language processing according to an embodiment of the present invention;

FIG. 10 is a sub-schematic block diagram of a paragraph processing unit in an unstructured transaction information recognition apparatus based on natural language processing according to an embodiment of the present invention;

FIG. 11 is a sub-schematic block diagram of a dispute correction unit in an unstructured transaction information identification apparatus based on natural language processing according to an embodiment of the present invention;

fig. 12 is a sub-schematic block diagram of a first structured processing unit in an unstructured transaction information recognition apparatus based on natural language processing according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart of an unstructured transaction information identification method based on natural language processing according to an embodiment of the present invention, which specifically includes: steps S101 to S105.

S101, acquiring a transaction language text corresponding to transaction quotation information sent by a user, and preprocessing the transaction language text;

s102, performing transaction context word segmentation and part-of-speech identification on the preprocessed transaction language text based on the conditional random field;

s103, performing sense group segmentation and paragraph completion processing on the transaction language text according to the transaction context word segmentation result and the part of speech identification result;

s104, extracting part-of-speech characteristics from the transaction context word segmentation result and the part-of-speech recognition result, and performing vectorization modeling on the part-of-speech characteristics through a Bayesian network to perform global optimal recognition on the transaction language text;

and S105, carrying out matrix format verification on the global optimal identification result, carrying out matrix structuralization processing on the result of sense segmentation and paragraph completion processing based on the verification result, and outputting the result of matrix structuralization processing as the transaction quotation information identification result.

In this embodiment, the trading offer information sent by the user is preprocessed into a trading language text to achieve the effects of sense group perception and aggregation, and of course, the trading offer information may also refer to other security trading information, such as price inquiry information and the like. On the basis of the conditional random field, part-of-speech processing is carried out on the transaction language text, namely word segmentation and part-of-speech recognition of the transaction context, for example, the transaction context to which the part-of-speech belongs in the transaction language text is judged, and part-of-speech judgment is verbs or nouns or other parts-of-speech and the like. And then carrying out paragraph processing on the trading language text, namely carrying out sense group segmentation and paragraph completion processing, wherein the sense group refers to each component of a sentence in the trading language text divided according to the meaning and the structure, each component is called a sense group, and words in the same sense group are closely related to each other and cannot be randomly split, otherwise misunderstanding can be caused. And then, according to the results of part-of-speech processing and paragraph processing, extracting part-of-speech characteristics from the transaction language text, and establishing a vectorization model according to the part-of-speech characteristics so as to realize global optimal identification on the transaction language text. And then carrying out matrix format verification on the transaction language text according to the global optimal recognition result, wherein the result obtained by verification can be output as the recognition processing result of the transaction language text.

In the embodiment, the word processing, the paragraph processing and the like are performed on the transaction quotation information of the user, and various transaction 'jargon' are analyzed into specific quotation and transaction information, so that the processing efficiency of the transaction quotation information is improved. The embodiment can automatically identify price inquiry, price quotation and transaction information of the transaction language text generated by negotiation and exchange. And the transaction language text can be intelligently corrected, and in a specific application scenario, the recognition accuracy of the transaction language text reaches 99.94%. In another specific application scenario, the present embodiment takes on the order of 20ms on average from capture to structured output for each set of market conditions. In conclusion, the embodiment can greatly reduce the manpower cost of price inquiry and quotation of off-site trading, and improve the processing efficiency of the trading quotation information. In addition, the embodiment also supports multi-machine distributed deployment, and accesses massive chat groups and friend users, realizes the processing of real-time messages, and can be packaged into a high-performance streaming type computing unit on a message stream bus.

In one embodiment, as shown in fig. 2, the step S101 includes: steps S201 to S204.

S201, performing meaning group rough perception on the transaction language text through a preset transaction word list, and clearing non-quotation information in the transaction language text according to a result of the meaning group rough perception;

s202, constructing an index queue by utilizing an index message queue technology and a transaction language text through a Hash algorithm, and realizing streaming short-time duplicate removal;

s203, context tracing is conducted according to the sequence of the transaction language texts in the index queue, and the transaction language texts are aggregated through the marked line separators according to the acquisition time of the transaction language texts;

s204, combining the decision tree, the text punctuation attributes and the character distribution characteristics among punctuations to construct a self-adaptive sentence-segmentation word-segmentation algorithm, and pre-segmenting the transaction language text by using the self-adaptive sentence-segmentation word-segmentation algorithm.

In this embodiment, the transaction language text is first collected and analyzed, that is, the transaction language sense group rough perception is performed through a preset transaction verb table, so as to judge the market information in the transaction language text. Here, the preset transaction verb table may be a transaction verb table extracted based on a regular expression matching algorithm, and the transaction verb table is formed after a trader manually reviews verbs meaningful in a scene. And then, carrying out duplicate removal on the multi-source data in the transaction language text, specifically, pressing the transaction language text into a duplicate removal queue after hash processing, discarding the hash duplicate, and pressing the message out of the queue after the message in the duplicate removal queue is overtime for a certain limit or the message goes to the end of the queue. When the transaction language text is aggregated, the deduplication messages in the previous step are aggregated according to the sending interval of the sender, and the aggregation information marks the line separators. And then pre-segmenting the word segments of the transaction language text by adopting a self-adaptive sentence-segmentation and word-segmentation algorithm. The self-adaptive sentence-break separation algorithm specifically comprises the following steps: a decision tree (e.g., C4.5 decision tree, a decision tree algorithm) is constructed that sets the level of separation of the punctuation, e.g., periods and carriage returns, higher than commas and pauses, based on the text punctuation properties and character distribution characteristics between the punctuation.

In one embodiment, as shown in fig. 3, the step S102 includes: steps S301 to S306.

S301, constructing target part-of-speech vectors corresponding to transaction market key elements based on the transaction context of the transaction language text, and constructing part-of-speech judgment expert rules for the elements in the target part-of-speech vectors one by one based on regularization expression so as to construct a part-of-speech judger;

s302, performing part-of-speech judgment on the semi-structured information in the transaction language text by using the part-of-speech judging device, and correcting a part-of-speech judgment result by using a header recognizer matched with a keyword;

s303, embedding the part-of-speech determiner into a conditional random field model, and utilizing a pre-constructed continuous character cursor to feed unstructured information in the transaction language text character by character so as to perform part-of-speech determination;

s304, recording the start and end positions of each character in the continuous character cursors, and connecting the start and end positions of each character through greedy degree of a hit judgment rule to obtain a corresponding local condition cost distance;

s305, constructing a corresponding position transfer matrix according to the field word length of the character, and sequentially replacing the coordinate distance values of the corresponding start and end positions in the position transfer matrix by using the local condition cost distance;

s306, based on the coordinate distance value, calculating the shortest path of the position transfer matrix through a shortest path algorithm model, and taking the matching words on the shortest path as an optimal word segmentation scheme so as to perform transaction context word segmentation and part of speech identification on the transaction language text.

In this embodiment, a target part-of-speech vector of a market (i.e., a market key element type) is constructed based on a transaction context, and part-of-speech determination expert rules are constructed one by one for elements in the target part-of-speech vector based on a regular expression rule and are uniformly packaged as a part-of-speech determiner. The part-of-speech decider supports the access of expert rules to the same part-of-speech, and accounts for the greedy degree t of the rule based on the frequency of matching passing of the existing part-of-speech samples on different rules, wherein the higher the passing rate is, the more greedy the part-of-speech samples are.

For semi-structured messages with better word segment separation, the part of speech determiner can be directly used for determination, and the part of speech determiner can be used for correction through a header recognizer based on keyword matching. For unstructured messages, for example, when paragraph and punctuation separation are not clear or irregular, a word adhesion separation algorithm is adopted for processing, namely, a continuous character Cursor Cursor is constructed, the part of speech determiner is embedded into a CRF conditional random field model, and the word segments to be recognized are fed character by character. And when the cursor moves every time, performing part-of-speech judgment on the cursor, recording the starting and ending positions (or the starting and ending positions) of the judged words, and connecting the starting and ending positions by the greedy degree t of the hit judgment rule to form a local condition cost distance d, wherein the greedy matching distance cost is higher. And meanwhile, assuming that the word length of the word segment to be recognized is n, constructing a position transfer matrix of (n +1) × (n +1), wherein the transfer distances of adjacent bit positions are large numbers, and then sequentially replacing the local condition cost distance d matched by the vernier with the distance value of the start-stop position coordinate corresponding to the position transfer matrix. After the cursor finishes the whole word segment, the shortest path is calculated for the position transition matrix by using a shortest path algorithm model (for example, Dijkstra DijJacsterra algorithm), and the matching words on the shortest path are used as an optimal word segmentation scheme. In the process of continuously identifying and processing word segments, the word segments meeting the quotation part-of-speech vectors are subjected to word segmentation and then are continuously recorded until the current vector is subjected to repeated attribute and then sentence break. And can also realize the large-scale real-time message processing, also keep certain intelligence at the same time.

In one embodiment, as shown in fig. 4, the step S103 includes: steps S401 to S403.

S401, performing context distance calculation on the target part-of-speech vector according to a transaction context word segmentation result and a part-of-speech identification result, and calculating a context difference degree by adopting an edit distance algorithm;

s402, judging the context difference degree by adopting an accumulative loss method, and separating language segments of the transaction language text according to a judgment result;

and S403, performing part-of-speech alignment on the transaction language texts sentence by sentence, and completing missing values in each sentence in a previous value filling mode.

In this embodiment, based on the transaction context word segmentation result and the part-of-speech recognition result obtained in the foregoing steps, the context Distance calculation is performed on the constructed target part-of-speech vector, the context difference is calculated by using an edit Distance algorithm (for example, using a Levenshtein Distance levenston Distance algorithm), and the context difference is determined by using an accumulated loss method. Namely, if the context difference degree is zero, the accumulated loss is forgotten, if the context difference degree is not zero, the accumulated loss is accumulated, and whether the meaning group is cut off is judged according to the specific size of the context difference degree so as to separate the speech segments. And for the separated word segments, performing part-of-speech alignment sentence by sentence, and completing missing values in the word segments by adopting a previous value filling mode.

In one embodiment, the step S104 includes:

and constructing a sense group dispute part-of-speech recognizer to correct the group dispute problem generated by the part-of-speech recognition result, and classifying the correction process step by step based on a priority rule.

In this embodiment, since there is often a group dispute problem in the part of speech recognition process, a group-meaning dispute part of speech recognizer needs to be constructed to correct the group dispute problem. The method can specifically adopt a priority rule set to carry out shunting dispute classification processing on the group dispute problems. In one embodiment, the group dispute problem can be handled in three categories (or levels): and (4) identifying the header characteristic, the value characteristic and the statistical distribution characteristic, and if the dispute problem cannot be effectively solved at the current stage, the next stage can be used for solving the dispute problem.

Further, in an embodiment, as shown in fig. 5, the modifying the group dispute problem generated by the part-of-speech recognition result by the constructed group-of-meaning dispute part-of-speech recognizer and classifying the modification step by step based on the priority rule include: steps S501 to S504.

S501, performing header feature identification on the group dispute problem in the unstructured message through header key character matching;

s502, performing category correction on the group dispute problem by combining numerical value domain boundary constraint of the group dispute problem and text enumeration type assertion to achieve value characteristic identification;

s503, acquiring a text type and a numerical type corresponding to the group dispute problem in the structured information, performing distribution statistics on the text type and the numerical type, and labeling a part-of-speech tag to the group dispute problem according to a distribution statistical result;

s504, constructing a corresponding dispute feature matrix based on the part-of-speech labels, training the dispute feature matrix by using a supervised model to obtain a corresponding part-of-speech classification result, and taking the part-of-speech classification result as a global optimal recognition result.

In this embodiment, header feature recognition is implemented through header key character matching, and is usually mined from redundant information in an unstructured message. During value feature identification, category correction is realized through numerical value domain boundary constraint and text enumeration type assertion of a specific part of speech of a group dispute problem clock. When the distribution characteristics are identified, performing distribution statistics on the text type and the numerical type of the group dispute problem, wherein the text type is used for counting the characteristics of character frequency, field length, number ratio and the like; and counting characteristics such as median, variance, range, coefficient of variation gain and the like by using the numerical type. After part-of-speech tag labeling is carried out according to the distribution statistical result, a dispute feature matrix can be constructed and obtained by utilizing

Bayes (naive Bayes) and/or CART tree and other rapid supervised models are used for training the dispute feature matrix, and the trained models are used for carrying out part-of-speech classification on the group dispute problems and forming final decisions.

The embodiment can effectively improve the accuracy of market recognition, and in a specific application scene, the accuracy of market recognition can be improved to more than 99.9% from about 95%, so that the whole system can be put into production reliably.

In one embodiment, as shown in fig. 6, the step S105 includes: steps S601 to S604.

S601, constructing a part-of-speech projection matrix according to the result of sense segmentation and paragraph completion processing, calculating an inner product between the part-of-speech projection matrix and a preset check vector, and judging whether market situation basic elements of the trading language text are prepared or not according to the inner product;

s602, carrying out matrix structuring processing on the part of speech projection matrix by a common attribute aggregation, non-common column data alignment or column rearrangement sequencing method;

s603, carrying out standardized conversion on the part of speech in the transaction language text, and translating the word value corresponding to the part of speech and defining the numerical rule as a standardized data structure meeting the production specification;

s604, outputting the standardized data structure through a plurality of asynchronous threads.

In this embodiment, the matrix format check is mainly performed by constructing a check vector T, calculating an inner product p of a part-of-speech projection matrix formed by the check vector T and a complemented sense group paragraph, and if the value of the inner product p is not zero, it indicates whether market conditions basic elements are complete or not, and the matrix format check can be output. The output format protocol mainly makes the Italian group matrix structured through modes of public attribute aggregation, non-public column data alignment, column reordering and the like. In addition, by constructing an automatic part-of-speech translator, the standardized conversion of word values of each part-of-speech is realized, and the word values are translated and numerically reduced into a standardized data structure which accords with production specifications through the marks in the identification process and the accumulated expert translation rules. And pushing the translated result to a message queue in real time by adopting a plurality of asynchronous threads, and quickly outputting the translated result by writing the translated result into a database and/or writing the translated result into an archived file.

In the embodiment, automatic part-of-speech translation and numerical value specification are performed on the transaction language text, so that structured generation and real-time pushing of quotation and transaction information are realized.

Fig. 7 is a schematic block diagram of an unstructured transaction information recognition apparatus 700 based on natural language processing according to an embodiment of the present invention, where the apparatus 700 includes:

the preprocessing unit 701 is used for acquiring a transaction language text corresponding to transaction offer information sent by a user and preprocessing the transaction language text;

a part-of-speech processing unit 702, configured to perform transaction context segmentation and part-of-speech recognition on the preprocessed transaction language text based on the conditional random field;

the paragraph processing unit 703 is configured to perform sense segmentation and paragraph completion processing on the transaction language text according to the transaction context word segmentation result and the part-of-speech recognition result;

an optimal recognition unit 704, configured to extract part-of-speech features from the transaction context segmentation result and the part-of-speech recognition result, and perform vectorization modeling on the part-of-speech features through a bayesian network, so as to perform global optimal recognition on the transaction language text;

the first structural processing unit 705 is configured to perform matrix format verification on the global optimal recognition result, perform matrix structural processing on the result of sense segmentation and paragraph completion processing based on the verification result, and output the result of matrix structural processing as a transaction offer information recognition result.

In one embodiment, as shown in fig. 8, the preprocessing unit 701 includes:

the perception unit 801 is used for performing sense group rough perception on the transaction language text through a preset transaction word list and clearing non-quotation information in the transaction language text according to a sense group rough perception result;

the duplication removal unit 802 is configured to construct an index queue from the transaction language text by using an index message queue technology through a hash algorithm, so as to implement streaming short-time duplication removal;

the aggregation unit 803 is configured to perform context tracing according to the ordering of the transaction language texts in the index queue, and perform aggregation processing on the transaction language texts through a mark line separator according to the acquisition time of the transaction language texts;

and the pre-segmentation unit 804 is used for constructing a self-adaptive sentence-segmentation word-segmentation algorithm by combining the decision tree, the text punctuation attributes and the character distribution characteristics among punctuations, and pre-segmenting the transaction language text by using the self-adaptive sentence-segmentation word-segmentation algorithm.

In one embodiment, as shown in fig. 9, part-of-speech processing unit 702 includes:

a constructing unit 901, configured to construct a target part-of-speech vector corresponding to the transaction market key elements based on the transaction context of the transaction language text, and construct part-of-speech decision expert rules for the elements in the target part-of-speech vector one by one based on regularization expression, so as to construct a part-of-speech decider;

a correction unit 902, configured to perform part-of-speech determination on the semi-structured information in the transaction language text by using the part-of-speech determiner, and correct a part-of-speech determination result by using a header identifier matched with a keyword;

a judging unit 903, configured to embed the part-of-speech determiner into a conditional random field model, and utilize a pre-constructed continuous character cursor to eat unstructured information in the transaction language text character by character to perform part-of-speech judgment;

a connection unit 904, configured to record the start and end positions of each character in the continuous character cursor, and connect the start and end positions of each character by greedy of the hit determination rule to obtain a corresponding local conditional cost distance;

a replacing unit 905, configured to construct a corresponding position transfer matrix according to a field word length of a character, and sequentially replace coordinate distance values of corresponding start and end positions in the position transfer matrix with the local condition cost distances;

and the path calculation unit 906 is configured to calculate a shortest path of the position transition matrix through a shortest path algorithm model based on the coordinate distance value, and use a matching word on the shortest path as an optimal word segmentation scheme to perform transaction context word segmentation and part of speech recognition on the transaction language text.

In one embodiment, as shown in fig. 10, the paragraph processing unit 703 includes:

a context distance calculation unit 1001 configured to perform context distance calculation on the target part-of-speech vector according to the transaction context word segmentation result and the part-of-speech recognition result, and calculate a context difference degree by using an edit distance algorithm;

a difference degree determination unit 1002, configured to determine the context difference degree by using an accumulated loss method, and separate a corpus from the transaction language text according to a determination result;

a missing value filling unit 1003, configured to perform part-of-speech alignment on the trading language text sentence by sentence, and fill up a missing value in each sentence by using a previous value filling method.

In one embodiment, the optimal identification unit 704 includes:

and the dispute correction unit is used for constructing a sense group dispute part-of-speech recognizer to correct the group dispute problem generated by the part-of-speech identification result and classifying the correction process step by step based on a priority rule.

In one embodiment, as shown in FIG. 11, the dispute correction unit comprises:

a header feature identification unit 1101, configured to perform header feature identification on a group dispute problem in the unstructured message through header key character matching;

a value feature identification unit 1102, configured to perform category correction on the group dispute problem in combination with a numerical value domain boundary constraint of the group dispute problem and text enumeration type assertion, so as to achieve value feature identification;

the distribution feature identification unit 1103 is configured to acquire a text type and a numerical type corresponding to the group dispute problem in the structured information, perform distribution statistics on the text type and the numerical type, and label a part-of-speech tag to the group dispute problem according to a distribution statistical result;

and the matrix construction unit 1104 is configured to construct a corresponding dispute feature matrix based on the part-of-speech tag, train the dispute feature matrix by using a supervised model to obtain a corresponding part-of-speech classification result, and then use the part-of-speech classification result as a global optimal recognition result.

In one embodiment, as shown in fig. 12, the first structuring processing unit 705 includes:

an inner product calculation unit 1201, configured to construct a part-of-speech projection matrix according to the result of sense segmentation and paragraph completion processing, calculate an inner product between the part-of-speech projection matrix and a preset check vector, and then determine whether the market situation basic elements of the trading language text are complete according to the inner product;

a second structuring processing unit 1202, configured to perform matrix structuring processing on the part of speech projection matrix through a common attribute aggregation, non-common column data alignment, or a column reordering method;

a standardization conversion unit 1203, configured to perform standardization conversion on the part of speech in the transaction language text, translate word values corresponding to the part of speech, and define a numerical rule as a standardized data structure meeting the production specification;

an output unit 1204 for outputting the standardized data structure by a plurality of asynchronous threads.

Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the steps provided by the above embodiments can be implemented. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the invention also provides a distributed computer single-point and clustering deployment device, which can comprise a memory and a processor, wherein the memory stores a computer program, and the steps provided by the embodiment can be realized when the processor calls the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and other components, and support single-computer computing and multi-computer parallel computing to implement the steps provided by the above embodiments.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An unstructured transaction information identification method based on natural language processing is characterized by comprising the following steps:

2. The unstructured transaction information identification method based on natural language processing according to claim 1, wherein the acquiring and preprocessing transaction language text corresponding to transaction offer information sent by a user comprises:

performing meaning group rough perception on the transaction language text through a preset transaction word moving table, and clearing non-quotation information in the transaction language text according to a meaning group rough perception result;

by utilizing an index message queue technology, constructing an index queue of the transaction language text through a Hash algorithm, and realizing streaming short-time duplicate removal;

context tracing is carried out according to the sequence of the transaction language texts in the index queue, and the transaction language texts are aggregated through marking line separators according to the acquisition time of the transaction language texts;

and combining the decision tree, the text punctuation attributes and the character distribution characteristics among punctuations to construct an adaptive sentence-segmentation and word-segmentation algorithm, and pre-segmenting the transaction language text into word segments by using the adaptive sentence-segmentation and word-segmentation algorithm.

3. The unstructured transaction information recognition method based on natural language processing as claimed in claim 1, wherein the performing of transaction context participle and part of speech recognition on the preprocessed transaction language text based on conditional random field comprises:

constructing target part-of-speech vectors corresponding to the transaction quotation key elements based on the transaction context of the transaction language text, and constructing part-of-speech judgment expert rules for the elements in the target part-of-speech vectors one by one based on regularization expression so as to construct a part-of-speech judgment device;

performing part-of-speech determination on the semi-structured information in the transaction language text by using the part-of-speech determiner, and correcting a part-of-speech determination result by using a header recognizer matched with a keyword;

embedding the part-of-speech determiner into a conditional random field model, and utilizing a pre-constructed continuous character cursor to feed unstructured information in the transaction language text character by character so as to perform part-of-speech determination;

recording the start and end positions of each character in the continuous character cursors, and connecting the start and end positions of each character through the greedy degree of the hit judgment rule to obtain a corresponding local condition cost distance;

constructing a corresponding position transfer matrix according to the field word length of the character, and sequentially replacing the coordinate distance values of the corresponding start and end positions in the position transfer matrix by using the local condition cost distance;

and calculating the shortest path of the position transfer matrix through a shortest path algorithm model based on the coordinate distance value, and taking the matching words on the shortest path as an optimal word segmentation scheme so as to perform transaction context word segmentation and part of speech identification on the transaction language text.

4. The unstructured transaction information recognition method based on natural language processing as claimed in claim 3, wherein said performing sense segmentation and paragraph completion processing on said transaction language text according to the transaction context segmentation result and the part of speech recognition result comprises:

according to the transaction context word segmentation result and the part of speech identification result, performing context distance calculation on the target part of speech vector, and calculating the context difference degree by adopting an edit distance algorithm;

judging the context difference degree by adopting an accumulative loss method, and separating language segments of the transaction language text according to a judgment result;

and performing part-of-speech alignment on the transaction language text sentence by sentence, and filling up missing values in each sentence in a previous value filling mode.

5. The unstructured transaction information recognition method based on natural language processing as claimed in claim 1, wherein said extracting part-of-speech features from said transaction context segmentation result and part-of-speech recognition result, and vectorially modeling said part-of-speech features through bayesian network to perform global optimal recognition on said transaction language text, comprises:

6. The unstructured transaction information recognition method based on natural language processing as claimed in claim 5, wherein the structured group dispute part-of-speech recognizer corrects the group dispute problem generated by the part-of-speech recognition result, and classifies the correction process based on the priority rule, including:

performing header feature identification on the group dispute problem in the unstructured message through header key character matching;

performing category correction on the group dispute problems by combining numerical value domain boundary constraint of the group dispute problems and text enumeration type assertion so as to achieve value characteristic identification;

acquiring a text type and a numerical type corresponding to the group dispute problem in the structured information, performing distribution statistics on the text type and the numerical type, and labeling a part-of-speech tag to the group dispute problem according to a distribution statistical result;

and constructing a corresponding dispute feature matrix based on the part-of-speech labels, training the dispute feature matrix by using a supervised model to obtain a corresponding part-of-speech classification result, and taking the part-of-speech classification result as a global optimal recognition result.

7. The unstructured transaction information recognition method based on natural language processing of claim 6, wherein the performing of matrix format check on the global optimal recognition result and performing matrix structuring on the result of sense segmentation and paragraph completion processing based on the check result comprises:

constructing a part-of-speech projection matrix according to the result of the sense group segmentation and paragraph completion processing, calculating an inner product between the part-of-speech projection matrix and a preset check vector, and judging whether the market situation basic elements of the trading language text are prepared or not according to the inner product;

carrying out matrix structuralization processing on the part of speech projection matrix by a public attribute aggregation, non-public column data alignment or column rearrangement sequencing method;

performing standardized conversion on the part of speech in the transaction language text, and translating word values corresponding to the part of speech and defining the value rule as a standardized data structure meeting the production specification;

outputting the standardized data structure through a plurality of asynchronous threads.

8. An unstructured transaction information recognition device based on natural language processing, comprising:

9. A distributed computer single-point and clustered deployment apparatus, comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor implements the unstructured transaction information recognition method based on natural language processing according to any one of claims 1 to 7 when executing the computer program, and supports single-computer computing and multi-computer parallel computing to implement the unstructured transaction information recognition method based on natural language processing according to any one of claims 1 to 7.

10. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the natural language processing based unstructured transaction information identification method according to any one of claims 1 to 7.