CN110765889B - Feature extraction method, related device and storage medium for legal document - Google Patents
Feature extraction method, related device and storage medium for legal document Download PDFInfo
- Publication number
- CN110765889B CN110765889B CN201910936787.5A CN201910936787A CN110765889B CN 110765889 B CN110765889 B CN 110765889B CN 201910936787 A CN201910936787 A CN 201910936787A CN 110765889 B CN110765889 B CN 110765889B
- Authority
- CN
- China
- Prior art keywords
- document
- feature extraction
- legal
- paragraph
- paragraphs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 113
- 238000000034 method Methods 0.000 claims abstract description 18
- 238000012545 processing Methods 0.000 claims description 24
- 230000015654 memory Effects 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 7
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000003062 neural network model Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 238000009960 carding Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 235000017166 Bambusa arundinacea Nutrition 0.000 description 1
- 235000017491 Bambusa tulda Nutrition 0.000 description 1
- 241001330002 Bambuseae Species 0.000 description 1
- 235000015334 Phyllostachys viridis Nutrition 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000011425 bamboo Substances 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Tourism & Hospitality (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Technology Law (AREA)
- Economics (AREA)
- Primary Health Care (AREA)
- Human Resources & Organizations (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Machine Translation (AREA)
Abstract
The method comprises the steps of pre-identifying a legal document, and determining a paragraph division model and a feature extraction model corresponding to the legal document; the feature extraction model comprises a corresponding relation between a document paragraph and a document element; carrying out document paragraph division on the legal document through the paragraph division model; and extracting document elements corresponding to the document paragraphs from the legal document divided by the document paragraphs through the feature extraction model, and outputting the extraction result of the document elements.
Description
Technical Field
The present application relates to the field of electronic technologies, and in particular, to a method for extracting features of legal documents, a related device, and a storage medium.
Background
With the continuous perfection of the legal system in China, the right-keeping consciousness of the people is increasingly improved, legal services play a role in daily life, the legal services are an important necessary component part in various industries of society, and various Internet plus legal platforms are created and operated online like the bamboo shoots after raining. Legal service, however, is a highly personalized and specialized industry with higher requirements on the internet+.
Legal documents contain rich legal concepts and legal logic. By deconstructing the case, the user can be assisted to quickly grasp the case request element.
In the prior art, legal documents are deconstructed, the deconstructed elements are simple, only simple document type classification can be realized, complete legal logic is lacked, and effective case carding information is difficult to provide.
Disclosure of Invention
The embodiment of the application provides a feature extraction method, an electronic device and a computer readable storage medium for legal documents, which are used for deconstructing the content of specific document elements of the legal documents.
The feature extraction method of the legal document provided by the first aspect of the embodiment of the application comprises the following steps:
pre-identifying legal documents, and determining paragraph division models and feature extraction models corresponding to the legal documents; the feature extraction model comprises a corresponding relation between a document paragraph and a document element;
Carrying out document paragraph division on the legal document through the paragraph division model;
And extracting document elements corresponding to the document paragraphs from the legal document divided by the document paragraphs through the feature extraction model, and outputting the extraction result of the document elements.
In one implementation of the embodiment of the present application, before the pre-identifying the legal document, the method further includes:
Pre-treating the legal document, the pre-treating comprising at least one of:
Abnormal line feed processing, chinese amount processing, conversion of Chinese numbers into Arabic numbers, unified punctuation formats, illegal character replacement and misplaced word processing.
In one implementation manner of the embodiment of the present application, the pre-identifying the legal document, determining a paragraph division model and a feature extraction model corresponding to the legal document, includes:
identifying a document title of the legal document;
determining a document type corresponding to the legal document according to the document title;
and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
In one implementation manner of the embodiment of the present application, the extracting, by the feature extraction model, document elements from the legal document after document paragraph division includes:
Acquiring a document paragraph after the legal document is divided, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Breaking sentences of the document paragraphs according to punctuation marks, and cutting the obtained multiple sentences to form sentence sequences;
Screening document element rules corresponding to the document paragraphs in the feature extraction model according to the document paragraphs after the paragraph division;
sequentially reading sentences one by one according to the sentence sequence, and performing feature matching on the read sentences by using document element rules corresponding to the document paragraphs; after matching is successful on a document element rule, outputting a corresponding document element, and matching the next sentence until all sentences in the sentence sequence are matched.
In one implementation of the embodiment of the present application, the feature extraction model includes: textCNN, textRNN and TextRCNN networks;
The TextCNN network, the TextRNN network and the TextRCNN network are arranged in parallel, and the output ends of the three networks are connected;
The input information of the three networks is consistent, and the three networks are symbolized document paragraphs; and the output information of the three networks is a label identification result, and the three label identification results are added and averaged to obtain the output of the feature extraction model. A feature extraction device of a legal document provided in a second aspect of the embodiment of the present application includes:
The pre-recognition unit is used for pre-recognizing the legal document and determining a paragraph division model and a feature extraction model corresponding to the legal document; the feature extraction model comprises a corresponding relation between a document paragraph and a document element;
the paragraph dividing unit is used for dividing the document paragraphs of the legal document through the paragraph dividing model;
And the feature extraction unit is used for extracting the document elements corresponding to the document paragraphs from the legal document divided by the document paragraphs through the feature extraction model and outputting the extraction result of the document elements.
In one implementation of the embodiment of the present application, the apparatus further includes: a preprocessing unit;
The pretreatment unit is used for pretreating the legal document, and the pretreatment comprises at least one of the following steps:
Abnormal line feed processing, chinese amount processing, conversion of Chinese numbers into Arabic numbers, unified punctuation formats, illegal character replacement and misplaced word processing.
In one implementation of the embodiment of the present application, the feature extraction unit is specifically configured to:
Acquiring a document paragraph after the legal document is divided, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Breaking sentences of the document paragraphs according to punctuation marks, and cutting the obtained multiple sentences to form sentence sequences;
Sequentially reading sentences one by one according to the sentence sequence, and carrying out feature matching on the read sentences according to the document element rule; after matching is successful on a document element rule, outputting a corresponding document element, and matching the next sentence until all sentences in the sentence sequence are matched.
A third aspect of an embodiment of the present application provides another electronic device, including: the method for extracting the characteristics of the legal document provided by the first aspect of the embodiment of the application is realized when the processor executes the computer program.
A fourth aspect of the embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the feature extraction method of the legal document provided in the first aspect of the embodiment of the present application.
From the above, the scheme of the application pre-identifies the legal document and determines the paragraph division model and the feature extraction model corresponding to the legal document; then, dividing the legal document into document paragraphs by the paragraph dividing model, and finally extracting document elements from the document-divided legal document by the feature extracting model; the embodiment of the application utilizes the strong relevance of the document paragraphs and the document features (namely, some document elements exist in specific document paragraphs with high probability), so that the complex document element content can be rapidly positioned in a paragraph division mode, and then the document elements can be extracted in the document paragraphs with high probability, thereby improving the feature extraction efficiency of legal documents.
Drawings
Fig. 1 is a schematic implementation flow chart of a feature extraction method of a legal document according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a feature extraction device for legal documents according to an embodiment of the present application;
Fig. 3 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the application.
Detailed Description
In order to make the objects, features and advantages of the present application more comprehensible, the technical solutions in the embodiments of the present application will be clearly described in conjunction with the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Example 1
The embodiment of the application provides a feature extraction method of a legal document, which is applied to an electronic device, wherein the electronic device can be equipment capable of installing application programs such as a smart phone, a tablet personal computer and a computer, and an operating system of the electronic device can be a ios, android, windows system or other operating systems, and is not limited herein.
Referring to fig. 1, the feature extraction method of the legal document mainly includes the following steps:
101. Pre-identifying legal documents, and determining paragraph division models and feature extraction models corresponding to the legal documents;
Pre-identifying legal documents, and determining paragraph division models and feature extraction models corresponding to the legal documents; the feature extraction model comprises a corresponding relation between a document paragraph and a document element.
Illustratively, the pre-recognition may be: identifying a document title of the legal document; determining a document type corresponding to the legal document according to the document title; and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
In practical applications, there are various types of legal documents, such as litigation (civil, criminal, and administrative), arbitration books, referees, and the like. Different legal documents correspond to different generic and personalized deconstructed features. Therefore, before the legal document is deconstructed, the embodiment of the application carries out logic carding on different legal documents to determine the general deconstructed characteristics corresponding to different legal documents; and setting a corresponding model according to the general deconstructing characteristics to deconstruct.
In the embodiment of the application, different legal documents can correspondingly use different paragraph division models and feature extraction models. Specifically, the embodiment of the application trains different types of legal documents in a machine learning mode, such as a paragraph dividing mode and what document elements are extracted from what document paragraphs. The specific legal document has specific document paragraph features (for example, can be divided into 5 paragraphs, what the paragraph content corresponding to the 5 paragraphs is), and these all establish a mapping relationship through machine learning, and store the mapping relationship in a local processing terminal.
Illustratively, when a legal document to be identified is taken, a paragraph division model for paragraph division corresponding to the legal document and a feature extraction model (for extracting document elements) corresponding to the paragraph division model are determined through identifying a specific position (such as a title bit) in the legal document. There are two sets of mapping relationships, respectively: the method comprises the steps of storing two groups of mapping relations of a user in a processing terminal, and obtaining a paragraph division model and a feature extraction model corresponding to a legal document to be identified through character identification and mapping relation search of a document title.
In practical application, in order to improve efficiency and accuracy of legal documents processing, the legal documents may be preprocessed before being pre-identified, where the preprocessing includes at least one of the following: abnormal line feed processing, chinese amount processing, conversion of Chinese numbers into Arabic numbers, unified punctuation formats, illegal character replacement and misplaced word processing.
102. Carrying out document paragraph division on the legal document through the paragraph division model;
and carrying out document paragraph division on the legal document after document pretreatment through a paragraph division model. The application is described with reference to a referee document (a one-examination decision document) as an application example.
Exemplary paragraph categories that need to be divided include: title, litigation subject, trial pass, complaint, debate, evidence, trial find, court belief, decision result, court staff.
Illustratively, the paragraph partition model for the pre-defined classes may be a rule model in practical applications, which belongs to a machine learning model of the unsupervised algorithm class. A CRF (Conditional Random Field) model belonging to the markov probability map model can also be used, which is widely used in the problem of text sequence labeling.
Regarding the division of the document paragraphs, in practical application, the target paragraph categories to be divided may be preset to be N (N is an integer greater than one) categories, and N paragraph extraction rules are corresponding to the target paragraphs, where the target paragraphs and the paragraph extraction rules are in one-to-one correspondence. For example, the "referee document" has 5 paragraph types, and corresponds to 5 "paragraph extraction rules". An example of a "paragraph extraction rule": the paragraph start feature may be "thought of by the home" and the paragraph end feature may be "decided as follows" and the paragraph end feature may be "another paragraph extraction rule start feature".
In practical application, the paragraph features of the referee document are obvious and exhaustive, and the rule extraction model is preferentially used. Paragraph extraction rules may be set in the paragraph division model, the paragraph extraction rules including paragraph start features and paragraph end features. Taking "herein considered paragraph" as an example, a paragraph start feature may be "herein considered" and a paragraph end feature may be "decided as follows" and a paragraph end feature may be "another paragraph extraction rule start feature".
103. And extracting document elements corresponding to the document paragraphs from the legal document divided by the document paragraphs through the feature extraction model, and outputting the extraction result of the document elements.
And (3) obtaining N paragraphs divided according to a paragraph extraction rule through the processing of the step (102), and extracting M types of document elements according to the N paragraphs, wherein M is an integer larger than N.
Specifically, the document element is a logic element related to a case in a legal document. Illustratively, M may be 7, and the document element specifically includes: principal, complaint item, dialect item, evidence item, fact item, dispute focus, court view, decision item.
For example, the principal's paperwork elements (citizens, companies, law firm information, etc.) may be extracted from litigation body paragraphs; the document elements of the main complaint item can be extracted from the complaint section; a document element of the dialect may be extracted from the dialect paragraph; a document element of the evidence item can be extracted from the evidence paragraph; document elements of the fact item can be extracted from the trial find paragraphs; a dispute focus and a court view document element may be extracted from the court's thought paragraphs; the document element of the decision item may be extracted from the decision result.
The document elements in the embodiments of the present application are mainly implemented by a machine learning model of rule extraction (refer to the second embodiment below), and the "rule extraction" is applicable to obvious and exhaustible text feature extraction scenarios, such as explicit paragraph start text (e.g., "the" of the present institute). For some unobvious and inexhaustible text feature extraction scenarios, semantic recognition is required, and can be implemented using a supervised neural network model (see the following third embodiment).
From the above, the scheme of the application pre-identifies the legal document and determines the paragraph division model and the feature extraction model corresponding to the legal document; then, dividing the legal document into document paragraphs by the paragraph dividing model, and finally extracting document elements from the document-divided legal document by the feature extracting model; the embodiment of the application utilizes the strong relevance of the document paragraphs and the document features (namely, some document elements exist in specific document paragraphs with high probability), so that the complex document element content can be rapidly positioned in a paragraph division mode, and then the document elements can be extracted in the document paragraphs with high probability, thereby improving the feature extraction efficiency of legal documents.
Example two
The embodiment of the application mainly describes a scheme for realizing document element extraction through a machine learning model for rule extraction, wherein each document element corresponds to a document element rule, and the document element rule is used for reading the characteristics corresponding to the document element in a sentence.
Step 1, acquiring a document paragraph after dividing the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Step 2, breaking sentences of the document paragraphs according to punctuation marks, and cutting the obtained multiple sentences to form sentence sequences;
Step 3, screening document element rules corresponding to the document paragraphs in the feature extraction model according to the document paragraphs after the paragraph division;
step 4, sequentially reading sentences one by one according to the sentence sequence, and performing feature matching on the read sentences by using a document element rule corresponding to the document paragraph; after matching is successful on a document element rule, outputting a corresponding document element, and matching the next sentence until all sentences in the sentence sequence are matched.
Exemplary, for example, document elements of "court view" are extracted, and in the "court view" paragraph, punctuation marks are used. Is! Sentence making, sentence reading is carried out one by one according to the corresponding document element rule of the court view, and sentences containing the judgment as follows (one of the document element rules of the court view) are positioned, such as the court commentary, according to the regulations of the first hundred ninety-six of the "contract law of the people's republic of China", the 6 th of the "several opinions of the highest people's court about the people's court's trial lending cases", the first hundred forty-two, the first hundred forty-four and the first hundred fifty-two of the "litigation law of the people's republic of China", the judgment is as follows: "extraction of the cited regulation statement in court view can be realized. Further, the rule and sentence can be extracted through the characteristics of ' plus ' so as to realize the extraction of rule and sentence ' first hundred ninety six of the contract law of the people's republic of China ', ' 6 th of the opinion of the highest people's court about the approval lending case of the people's court ', ' first hundred forty two of the law of the people's republic of China's litigation of the people's republic of China ', ' first hundred forty four of the law of the people's litigation '.
Exemplary, as the extraction of principal categories: original notice, interview, original notice agent, interview agent, legal representative, etc. Such as "original report" for the principal paragraph: tang somewhere, women, han nationality. N proxy agent: zhou somewhere, hubei somewhere law firm law. N is reported: a property development building limited company in the military, a property of the city of the military in Hubei province. N legal representatives: liu is somehow, the company's board. N proxy agent: depression is somewhere, and law firm in Beijing is someplace. N is reported: rich in a certain, men and Han nationality. N is reported: zhang, man, han nationality. N is reported: liu somewhere, man, han nationality. N proxy agent: depression is somewhere, and law firm in Beijing is someplace. The entity category corresponding to "Tang somewhere" is "original notice", the extracted rule is "paragraph sentence division+entity start position in positioning sentence+entity start position in searching sentence", "finally, the category can be confirmed as" original notice "by removing the middle punctuation mark.
The processing targets are as follows:
"original report: zhou Xiaojie;
Is informed: mr. in forest "
"Paragraph sentence division" refers to paragraph division features, such as the introduction descriptions of the above two character categories, which are segmented, and thus require identification of paragraph division features, such as segmentors. The "entity start position" refers to a literal place, i.e., not taking punctuation marks as start bits. "eliminating the middle punctuation" refers to, for example, identifying features as ": zhou Xiaojie ", the symbol of": "is removed, and the remaining text is the notice.
Exemplary, court views: support the original report, support the report, reject the original report, reject the report. Mainly by rules. Such as: "because Tang somewhere does not submit evidence to the home to confirm that 48.8 ten thousand yuan of interest, which is composed of interest generated by 3000 ten thousand yuan and 1240 ten thousand yuan of borrowing principal, respectively, tang somewhere is responsible for litigation request of all interest generated by 4240 ten thousand yuan, and has no evidence, and the home is refused by law. The entity types of "Tang somewhere original notice", "Liu somewhere reported" and "key sentence" appearing in the sentence are first identified, so that the Tang somewhere has no basis for actual law and is refused by the law. The "rule feature appearing in" Tang somewhere "+" litigation request "+" reject "is determined to be classified as" reject original report ".
Example III
The embodiment of the application mainly describes a scheme for realizing extraction of document elements by a supervised neural network model, and specifically comprises the following steps:
firstly, training a neural network model;
The neural network model of the embodiment of the application integrates three single models, 1: based on TextCNN networks, batchNormal is added, and two fully connected layers are used in classification. Model 2: based on TextRNN network, using two-way Short-Term Memory (LSTM), classifying hidden vectors after K-MaxPooling; model 3: textRCNN networks. The three model networks are fused, namely, the outputs of the three networks are added, and then corresponding legal document samples (marked with document elements) are used for model training.
The TextCNN network, the TextRNN network and the TextRCNN network are arranged in parallel, and the output ends of the three networks are connected;
The input information of the three networks is consistent, and the three networks are symbolized document paragraphs. For example, in practical applications, before text data is input into the neural network, text data is subjected to sentence breaking, word segmentation and part-of-speech analysis, and then related words are carried by using numerical characters, for example, "1" represents "you", "2" represents "in", "3" represents "there". The symbolized text "where you are" 123".
And the output information of the three networks is a label identification result, and the three label identification results are added and averaged to obtain the output of the feature extraction model. For example, assuming that the feature extraction model presets 6 types of document elements requiring labels, the label recognition result may be a 6-bit number sequence, and the 6 types of document elements respectively correspond to each other. Wherein each bit represents a probability that the current identified content is a particular document element; the three tag identification results are added, i.e., the probabilities on the corresponding bits are added, and the added results on each bit are averaged.
Taking a document element of a 'complaint' as an example, firstly dividing document paragraphs in legal document samples, marking the content of the document paragraphs possibly presenting the 'complaint', and putting the marked document paragraphs into a neural network model for training.
Secondly, extracting document elements;
The method comprises the steps of acquiring a document paragraph related to a 'complaint', and inputting the document paragraph related to the 'complaint' into a neural network model for extracting document elements.
For example, for the "request to repay principal" category, the category related complaints involved in the case are expressed by a small number of labels, such as "request: 1. judging that a certain company pays back 4240 ten thousand yuan of a certain borrowing principal and 965.2 ten thousand yuan of interest of a day of stop prosecution immediately, and judging that a certain company pays back interest from the day of prosecution to the day of clearing debt; ". Labeling 10% of the number of cases, learning the category characteristics in the related complaint expression through a classification algorithm, and labeling the complaint of other unlabeled cases with category labels by using the trained characteristic model.
Example IV
Referring to fig. 2, a feature extraction device for a legal document is provided in an embodiment of the present application. The electronic device can be used for realizing the feature extraction method of the legal document provided by the embodiment shown in the above figure 1. As shown in fig. 2, the feature extraction device of the legal document mainly includes:
A pre-recognition unit 201, configured to pre-recognize a legal document, and determine a paragraph division model and a feature extraction model corresponding to the legal document; the feature extraction model comprises a corresponding relation between a document paragraph and a document element;
A paragraph dividing unit 202, configured to divide the document paragraph by using the paragraph dividing model;
and a feature extraction unit 203, configured to extract, from the legal document divided by the document paragraphs, document elements corresponding to the document paragraphs through the feature extraction model, and output an extraction result of the document elements.
In one implementation of the embodiment of the present application, the apparatus further includes: a preprocessing unit 204;
The preprocessing unit 204 is configured to perform preprocessing on the legal document, where the preprocessing includes at least one of the following:
Abnormal line feed processing, chinese amount processing, conversion of Chinese numbers into Arabic numbers, unified punctuation formats, illegal character replacement and misplaced word processing.
In one implementation of the embodiment of the present application, the pre-identifying unit 201 is specifically configured to:
identifying a document title of the legal document;
determining a document type corresponding to the legal document according to the document title;
and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
In one implementation of the embodiment of the present application, the feature extraction unit 203 is specifically configured to:
Acquiring a document paragraph after the legal document is divided, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Breaking sentences of the document paragraphs according to punctuation marks, and cutting the obtained multiple sentences to form sentence sequences;
Screening document element rules corresponding to the document paragraphs in the feature extraction model according to the document paragraphs after the paragraph division;
sequentially reading sentences one by one according to the sentence sequence, and performing feature matching on the read sentences by using document element rules corresponding to the document paragraphs; after matching is successful on a document element rule, outputting a corresponding document element, and matching the next sentence until all sentences in the sentence sequence are matched.
It should be noted that, in the embodiment of the electronic device illustrated in fig. 2, the division of the functional modules is merely illustrative, and in practical application, the above-mentioned functional allocation may be performed by different functional modules according to requirements, for example, configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the electronic device is divided into different functional modules to perform all or part of the functions described above. In addition, in practical application, the corresponding functional modules in the embodiment may be implemented by corresponding hardware, or may be implemented by corresponding hardware executing corresponding software. The embodiments provided in the present specification can apply the principles described above, and will not be repeated herein.
For a specific process of implementing respective functions by each functional module in the electronic device provided in this embodiment, please refer to the specific content described in the embodiment shown in fig. 1, which is not described herein again.
Example five
Referring to fig. 3, an embodiment of the present application provides an electronic device, which includes:
The memory 301, the processor 302 and the computer program stored in the memory 301 and executable on the processor 302, when the processor 302 executes the computer program, implement the feature extraction method of the legal document described in the embodiment shown in fig. 1.
Further, the electronic device further includes:
at least one input device 303 and at least one output device 304.
The memory 301, the processor 302, the input device 303, and the output device 304 are connected via a bus 305.
The input device 303 may be a camera, a touch panel, a physical key, a mouse, or the like. The output device 304 may be in particular a display screen.
The memory 301 may be a high-speed random access memory (RAM, random Access Memory) memory or a non-volatile memory (non-volatile memory), such as a disk memory. Memory 301 is used to store a set of executable program codes and processor 302 is coupled to memory 301.
Further, an embodiment of the present application further provides a computer readable storage medium, which may be provided in the electronic device in each of the foregoing embodiments, and the computer readable storage medium may be a memory in the foregoing embodiment shown in fig. 3. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the feature extraction method of the legal document described in the embodiment shown in fig. 1. Further, the computer-readable medium may be any medium capable of storing a program code, such as a usb (universal serial bus), a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a readable storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned readable storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The foregoing describes the method for extracting features of legal documents, the electronic device and the computer readable storage medium, and is not to be construed as limiting the application in view of the foregoing, as modifications can be made in the specific embodiments and the application scope of the embodiments of the application by those skilled in the art.
Claims (9)
1. A method for extracting features of a legal document, comprising:
Identifying a document title of the legal document;
determining a document type corresponding to the legal document according to the document title;
Determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model; the feature extraction model comprises a corresponding relation between a document paragraph and a document element; the paragraph division model and the feature extraction model are obtained through training in a machine learning mode;
Carrying out document paragraph division on the legal document through the paragraph division model;
And extracting document elements corresponding to the document paragraphs from the legal document divided by the document paragraphs through the feature extraction model, and outputting the extraction result of the document elements.
2. The method for extracting features of a legal document according to claim 1, wherein,
Before the identifying of the document title of the legal document, the method further comprises the following steps:
Pre-treating the legal document, the pre-treating comprising at least one of:
Abnormal line feed processing, chinese amount processing, conversion of Chinese numbers into Arabic numbers, unified punctuation formats, illegal character replacement and misplaced word processing.
3. The method for extracting features of a legal document according to claim 1, wherein,
The extracting of the document elements from the legal document after the document paragraphs are divided by the feature extraction model comprises the following steps:
Acquiring a document paragraph after the legal document is divided, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules;
breaking sentences of the document paragraphs according to punctuation marks, and cutting the obtained multiple sentences to form sentence sequences;
Screening document element rules corresponding to the document paragraphs in the feature extraction model according to the document paragraphs after the paragraph division;
sequentially reading sentences one by one according to the sentence sequence, and performing feature matching on the read sentences by using document element rules corresponding to the document paragraphs; after matching is successful on a document element rule, outputting a corresponding document element, and matching the next sentence until all sentences in the sentence sequence are matched.
4. The method for extracting features of a legal document according to claim 1, wherein,
The feature extraction model includes: textCNN, textRNN and TextRCNN networks;
The TextCNN network, the TextRNN network and the TextRCNN network are arranged in parallel, and the output ends of the three networks are connected;
The input information of the three networks is consistent, and the three networks are symbolized document paragraphs; and the output information of the three networks is a label identification result, and the three label identification results are added and averaged to obtain the output of the feature extraction model.
5. A feature extraction device for a legal document, comprising:
A pre-recognition unit for recognizing a document title of a legal document; determining a document type corresponding to the legal document according to the document title; determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model; the feature extraction model comprises a corresponding relation between a document paragraph and a document element; the paragraph division model and the feature extraction model are obtained through training in a machine learning mode;
the paragraph dividing unit is used for dividing the document paragraphs of the legal document through the paragraph dividing model;
And the feature extraction unit is used for extracting the document elements corresponding to the document paragraphs from the legal document divided by the document paragraphs through the feature extraction model and outputting the extraction result of the document elements.
6. The feature extraction device of legal documents according to claim 5, wherein the feature extraction device comprises,
The apparatus further comprises: a preprocessing unit;
The pretreatment unit is used for pretreating the legal document, and the pretreatment comprises at least one of the following steps:
Abnormal line feed processing, chinese amount processing, conversion of Chinese numbers into Arabic numbers, unified punctuation formats, illegal character replacement and misplaced word processing.
7. The feature extraction device of legal documents according to claim 5, wherein the feature extraction device comprises,
The feature extraction unit is specifically configured to:
Acquiring a document paragraph after the legal document is divided, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules;
breaking sentences of the document paragraphs according to punctuation marks, and cutting the obtained multiple sentences to form sentence sequences;
Screening document element rules corresponding to the document paragraphs in the feature extraction model according to the document paragraphs after the paragraph division;
sequentially reading sentences one by one according to the sentence sequence, and performing feature matching on the read sentences by using document element rules corresponding to the document paragraphs; after matching is successful on a document element rule, outputting a corresponding document element, and matching the next sentence until all sentences in the sentence sequence are matched.
8. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910936787.5A CN110765889B (en) | 2019-09-29 | 2019-09-29 | Feature extraction method, related device and storage medium for legal document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910936787.5A CN110765889B (en) | 2019-09-29 | 2019-09-29 | Feature extraction method, related device and storage medium for legal document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110765889A CN110765889A (en) | 2020-02-07 |
CN110765889B true CN110765889B (en) | 2024-06-25 |
Family
ID=69329135
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910936787.5A Active CN110765889B (en) | 2019-09-29 | 2019-09-29 | Feature extraction method, related device and storage medium for legal document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110765889B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111476034B (en) * | 2020-04-07 | 2023-05-12 | 同方赛威讯信息技术有限公司 | Legal document information extraction method and system based on combination of rules and models |
CN111428484B (en) * | 2020-04-14 | 2022-02-18 | 广州云从鼎望科技有限公司 | Information management method, system, device and medium |
WO2021232293A1 (en) * | 2020-05-20 | 2021-11-25 | Accenture Global Solutions Limited | Contract recommendation platform |
CN112686012B (en) * | 2020-11-11 | 2023-03-31 | 福建亿榕信息技术有限公司 | Document feature extraction method, device, equipment and medium |
CN113673255B (en) * | 2021-08-25 | 2023-06-30 | 北京市律典通科技有限公司 | Text function area splitting method and device, computer equipment and storage medium |
CN114138928A (en) * | 2021-09-27 | 2022-03-04 | 平安国际智慧城市科技股份有限公司 | Method, system, device, electronic equipment and medium for extracting text content |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106815208A (en) * | 2015-12-01 | 2017-06-09 | 北京国双科技有限公司 | The analysis method and device of law judgement document |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10320409A (en) * | 1997-05-19 | 1998-12-04 | Seiko Epson Corp | Method and device for extracting document information and storage medium storing document extracting process program |
JP4140744B2 (en) * | 1999-05-07 | 2008-08-27 | 独立行政法人情報通信研究機構 | How to automatically split caption text |
US6772149B1 (en) * | 1999-09-23 | 2004-08-03 | Lexis-Nexis Group | System and method for identifying facts and legal discussion in court case law documents |
US7996223B2 (en) * | 2003-10-01 | 2011-08-09 | Dictaphone Corporation | System and method for post processing speech recognition output |
US7983468B2 (en) * | 2005-02-09 | 2011-07-19 | Jp Morgan Chase Bank | Method and system for extracting information from documents by document segregation |
US20160103823A1 (en) * | 2014-10-10 | 2016-04-14 | The Trustees Of Columbia University In The City Of New York | Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents |
CN106815203B (en) * | 2015-12-01 | 2021-03-30 | 北京国双科技有限公司 | Method and device for analyzing amount of money in referee document |
CN107783946A (en) * | 2016-08-27 | 2018-03-09 | 上海卓易电子科技有限公司 | Text display method and text display |
CN108874814B (en) * | 2017-05-10 | 2022-05-27 | 北京国双科技有限公司 | Legal document processing method and device |
CN107590131A (en) * | 2017-10-16 | 2018-01-16 | 北京神州泰岳软件股份有限公司 | A kind of specification document processing method, apparatus and system |
CN107832360A (en) * | 2017-10-24 | 2018-03-23 | 广东欧珀移动通信有限公司 | Comment processing method and relevant device |
CN109753647B (en) * | 2017-11-07 | 2022-11-04 | 北京国双科技有限公司 | Paragraph dividing method and device |
CN109359288B (en) * | 2018-08-16 | 2023-05-19 | 上海法可法科技有限公司 | Method for quantitatively evaluating documents in legal field |
CN109213864A (en) * | 2018-08-30 | 2019-01-15 | 广州慧睿思通信息科技有限公司 | Criminal case anticipation system and its building and pre-judging method based on deep learning |
CN109446511B (en) * | 2018-09-10 | 2022-07-08 | 平安科技(深圳)有限公司 | Referee document processing method, referee document processing device, computer equipment and storage medium |
CN109376240A (en) * | 2018-10-11 | 2019-02-22 | 平安科技(深圳)有限公司 | A kind of text analyzing method and terminal |
CN109933768A (en) * | 2019-03-11 | 2019-06-25 | 徐鹏 | A kind of legal documents Intelligent treatment, write method and system |
CN110147445A (en) * | 2019-04-09 | 2019-08-20 | 平安科技(深圳)有限公司 | Intension recognizing method, device, equipment and storage medium based on text classification |
-
2019
- 2019-09-29 CN CN201910936787.5A patent/CN110765889B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106815208A (en) * | 2015-12-01 | 2017-06-09 | 北京国双科技有限公司 | The analysis method and device of law judgement document |
Also Published As
Publication number | Publication date |
---|---|
CN110765889A (en) | 2020-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110765889B (en) | Feature extraction method, related device and storage medium for legal document | |
CN109872162B (en) | Wind control classification and identification method and system for processing user complaint information | |
CN110929125B (en) | Search recall method, device, equipment and storage medium thereof | |
CN107590172B (en) | Core content mining method and device for large-scale voice data | |
CN109190110A (en) | A kind of training method of Named Entity Extraction Model, system and electronic equipment | |
US11055327B2 (en) | Unstructured data parsing for structured information | |
CN111783471B (en) | Semantic recognition method, device, equipment and storage medium for natural language | |
CN111814482B (en) | Text key data extraction method and system and computer equipment | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN110941702A (en) | Retrieval method and device for laws and regulations and laws and readable storage medium | |
CN112232088A (en) | Contract clause risk intelligent identification method and device, electronic equipment and storage medium | |
CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN114416976A (en) | Text labeling method and device and electronic equipment | |
CN114265919A (en) | Entity extraction method and device, electronic equipment and storage medium | |
CN111506595B (en) | Data query method, system and related equipment | |
CN111178080B (en) | Named entity identification method and system based on structured information | |
CN111782793A (en) | Intelligent customer service processing method, system and equipment | |
CN111651994A (en) | Information extraction method and device, electronic equipment and storage medium | |
CN110457436B (en) | Information labeling method and device, computer readable storage medium and electronic equipment | |
CN109993381B (en) | Demand management application method, device, equipment and medium based on knowledge graph | |
CN115730237B (en) | Junk mail detection method, device, computer equipment and storage medium | |
CN111209724A (en) | Text verification method and device, storage medium and processor | |
CN115544256A (en) | Automatic data classification and classification method and system based on NLP algorithm model | |
CN115510219A (en) | Method and device for recommending dialogs, electronic equipment and storage medium | |
CN111782601A (en) | Electronic file processing method and device, electronic equipment and machine readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |