CN107291688A

CN107291688A - Judgement document's similarity analysis method based on topic model

Info

Publication number: CN107291688A
Application number: CN201710376341.2A
Authority: CN
Inventors: 周业茂; 葛季栋; 王悦; 李传艺; 李忠金; 周筱羽; 骆斌
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-22
Filing date: 2017-05-22
Publication date: 2017-10-24

Abstract

The invention discloses a kind of judgement document's similarity analysis method based on topic model.This method uses LDA (Latent Dirichlet Allocation) topic model in machine learning, for judgement document, proposes a kind of based on semantic, semi-automatic, general similarity analysis method.This method mainly includes choosing language material, sets up the steps such as similarity mark, Text Pretreatment, input selection, parameter setting, repetitive exercise, generation model and application model.This method takes into full account the characteristics of specialized vocabulary is abundant, semanteme is complicated in judgement document's content, using the semi-structured characteristic of judgement document, so as to improve the accuracy and applicability of judgement document's similarity analysis on the basis of general similarity analysis method.

Description

Similarity Analysis Method of Judgment Documents Based on Topic Model

技术领域technical field

本发明是一种文本相似度分类方法，针对法院内部的裁判文书，属于机器学习、文本挖掘技术领域。The invention is a text similarity classification method, aimed at the court's internal judgment documents, and belongs to the technical fields of machine learning and text mining.

背景技术Background technique

中国裁判文书网从2013年开始建设，截止2017年5月14日已经累积存储文书超过2900 万篇，逐渐成长为全球最大的裁判文书共享网站。基于这些数据，一系列司法大数据研究、分析工作也相继开展，在获得斐然成果的同时，还面临着许多问题和挑战。其中一部分问题集中在针对法院数据挖掘分析能力和相关研究的不足上。The China Judgment Documents Network started construction in 2013. As of May 14, 2017, it has accumulated more than 29 million documents and has gradually grown into the world's largest website for sharing judicial documents. Based on these data, a series of judicial big data research and analysis work has been carried out one after another. While achieving remarkable results, there are still many problems and challenges. Some of the problems focus on the inadequacy of court data mining and analysis capabilities and related research.

裁判文书，作为法院工作的重要组成部分，记载了人民法院审理的过程和结果。它既是法院诉讼活动结果的载体，也是人民法院确定和分配当事人实体权利义务的惟一凭证。在中国法院信息化过程中汇聚起来的裁判文书已经成为审判领域宝贵的数据资源，通过针对裁判文书的大数据挖掘研究工作，可以提出更加智能化的信息技术手段辅助法官办案。例如：从已有裁判文书库中挖掘相似案例的裁判文书，并为法官提供相似案例推荐；法院可以根据一个法官经手裁判文书的相似程度来评估其一段时间的工作量；法官、诉讼参与人、法律工作者等可以输入案情来查看某案件可能涉及的相关法律条文。针对这些应用场景和需求，本专利提出一种针对裁判文书的文本相似度分析方法。Judgment documents, as an important part of the court's work, record the process and results of the people's court's trial. It is not only the carrier of the results of the court's litigation activities, but also the only evidence for the people's court to determine and distribute the substantive rights and obligations of the parties. Judgment documents gathered during the informatization process of Chinese courts have become valuable data resources in the trial field. Through big data mining research on judgment documents, more intelligent information technology methods can be proposed to assist judges in handling cases. For example: excavate the judgment documents of similar cases from the existing judgment document database, and provide similar case recommendations for judges; the court can evaluate the workload of a judge for a period of time according to the similarity of the judgment documents handled by a judge; judges, litigation participants, Legal workers, etc. can enter the case to view the relevant legal provisions that may be involved in a case. In response to these application scenarios and requirements, this patent proposes a text similarity analysis method for referee documents.

法院审判工作的各类要求，裁判文书本身所具有的种种特性，以及现有相似度分析方法的局限和难点，迫切需要研究一种针对裁判文书的特点而量身定制的相似度分析方法。其中，中国法院裁判文书的半结构化特性为提高文本相似度分析结果提供了可能，并为相似度结果评估提供了依据。中国法院裁判文书中存在的案由、法律条文等相对固定的分类、标识信息为我们使用主题模型方法提供了启示。裁判文书的文字内容注重逻辑、注重推理的特性则对相似度方法的语义理解能力提出了相应的要求。基于以上一些原因，本文中提出了一种采用 LDA(Latent Dirichlet Allocation)主题模型、针对裁判文书、基于语义的、半自动化的、通用的相似度分析方法。Due to the various requirements of the court trial work, the various characteristics of the judgment documents themselves, and the limitations and difficulties of the existing similarity analysis methods, it is urgent to study a similarity analysis method tailored to the characteristics of the judgment documents. Among them, the semi-structured nature of Chinese court judgment documents provides the possibility to improve the results of text similarity analysis and provides a basis for the evaluation of similarity results. The relatively fixed classification and identification information such as cause of action and legal provisions in Chinese court judgment documents provide inspiration for us to use the topic model method. The characteristics of the text content of the judgment documents, which focus on logic and reasoning, put forward corresponding requirements for the semantic understanding ability of the similarity method. Based on the above reasons, this paper proposes a semantic-based, semi-automatic, and general-purpose similarity analysis method for referee documents using LDA (Latent Dirichlet Allocation) topic model.

文本相似度分析方法作为自然语言处理中的一个重要研究方向，用于衡量目标对象之间的相似程度，已经被应用在信息抽取、文本分类、文本聚类、主题探索、主题跟踪等诸多领域。相似度方法一般存在两个关键点：一个是特征的表示，另一个则是针对特征间的相似度关系计算。现有的文本相似度分析方法，从单纯的基于字符的分析方法，到基于语料和知识系统挖掘文本语义的分析方法，已经经历了长期的发展。而中文相似度方法则在此基础上，针对中文语言的特性进行了进一步的探索。由于相似度问题本身的多样性，在针对不同的目标对象(词语、短文本、长文本)，或处于不同的应用场景时，往往需要采用不同的相似度方法以达到更好的分析效果。As an important research direction in natural language processing, the text similarity analysis method is used to measure the similarity between target objects, and has been applied in many fields such as information extraction, text classification, text clustering, topic exploration, and topic tracking. There are generally two key points in the similarity method: one is the representation of features, and the other is the calculation of the similarity relationship between features. Existing text similarity analysis methods, from purely character-based analysis methods to analysis methods based on corpus and knowledge system mining text semantics, have experienced long-term development. On the basis of this, the Chinese similarity method further explores the characteristics of the Chinese language. Due to the diversity of the similarity problem itself, when targeting different target objects (words, short texts, long texts) or in different application scenarios, it is often necessary to use different similarity methods to achieve better analysis results.

主题模型系列方法，尤其是其中的LSA(Latent Semantic Analysis)、LDA相关方法，是现在针对文本相似度研究的一个重要方向。简而言之，主题模型基于如下假设：每一个单词都一定概率属于某几个主题，而每一篇文本都表达若干个主题。当把主题模型应用在相似度分析方法中，针对训练完成的主题模型，可以推断出一篇文本的主题归属，从而根据主题情况进一步计算获得不同文本间的相似程度。从另一个角度来说，主题模型相当于将以词汇为基础的高维向量映射到语义空间，使其降维。主题模型由Latent SemanticIndexing(LSI) 发展而来，Probabilistic Latent Semantic Analysis(pLSA)是第一个有影响力的主题概率模型。Blei在pLSA的基础上引入了Dirichlet分布，提出了LDA，进一步泛化了主题模型方法。在之后的应用研究中，为了针对不同的问题以及提升LDA的使用效率(并行化)，其它一系列与LDA相关的改进方法被相继发表。The topic model series methods, especially LSA (Latent Semantic Analysis) and LDA related methods, are an important direction for text similarity research. In short, the topic model is based on the assumption that each word belongs to certain topics with a certain probability, and each text expresses several topics. When the topic model is applied in the similarity analysis method, the topic attribution of a text can be inferred for the trained topic model, so that the similarity between different texts can be further calculated according to the topic situation. From another perspective, the topic model is equivalent to mapping a vocabulary-based high-dimensional vector to a semantic space to reduce its dimension. The topic model is developed from Latent Semantic Indexing (LSI), and Probabilistic Latent Semantic Analysis (pLSA) is the first influential topic probability model. Blei introduced Dirichlet distribution on the basis of pLSA, proposed LDA, and further generalized the topic model method. In the subsequent applied research, in order to address different problems and improve the efficiency of LDA (parallelization), a series of other improved methods related to LDA have been published.

LDA全称Latent Dirichlet Allocation，由Blei在2003年提出，是一种非监督的主题模型方法，可以用于对大规模文档集或语料库进行语义理解和隐藏主题识别。LDA方法相对于原有的主题模型方法，引入了Dirichlet分布，加入了先验概率假设。这使得模型更容易应用在训练语料集以外的文本，降低模型过拟合的可能性，对于数据量较小的语料具有更好的表现力。目前，LDA方法已经在文本信息抽取、文本分类、文本自动摘要、图像处理等领域有了广泛的尝试和应用。The full name of LDA is Latent Dirichlet Allocation, which was proposed by Blei in 2003. It is an unsupervised topic model method that can be used for semantic understanding and hidden topic identification on large-scale document sets or corpora. Compared with the original topic model method, the LDA method introduces the Dirichlet distribution and adds a priori probability assumption. This makes it easier for the model to be applied to texts other than the training corpus, reduces the possibility of model overfitting, and has better expressiveness for corpora with a small amount of data. At present, the LDA method has been widely tried and applied in the fields of text information extraction, text classification, automatic text summarization, and image processing.

主题模型方法建立在如下公式假设下：The topic model approach is built on the assumption of the following formula:

其中，P(t_l|D_i)表示词语t_l出现在文档D_i中的概率，p(t_l|T_j)表示主题T_j中出现词语t_l的概率，P(T_j|D_i)表示文档D_i出现主题T_j的概率。该模型假设，文档中出现一个词的概率等于所有该文档可能属于的主题的概率乘以每个主题中出现这个词的概率的累加。Among them, P(t _l |D _i ) represents the probability of word t _l appearing in document D _i , p(t _l |T _j ) represents the probability of word t _l appearing in topic T _j , P(T _j |D _i ) represents the probability that topic T _j appears in document D _i . The model assumes that the probability of a word appearing in a document is equal to the sum of the probabilities of all topics to which the document may belong multiplied by the probability of the word appearing in each topic.

LDA模型的训练主要包括Gibbs Sampling和变分EM两种不同方法。其中，GibbsSampling 在马尔可夫链的基础上，模拟抽样过程，在概率的转移过程中以求获得平稳的概率分布。变分EM方法则构建在贝叶斯计算上，主要通过寻找变分参数最优解(E步)和估计原模型参数、最大化模型下界(M步)两个步骤完成。其模型示意图如图4所示。The training of the LDA model mainly includes two different methods: Gibbs Sampling and variational EM. Among them, GibbsSampling simulates the sampling process on the basis of the Markov chain, in order to obtain a stable probability distribution in the process of probability transfer. The variational EM method is built on the basis of Bayesian calculation, which is mainly completed by two steps: finding the optimal solution of variational parameters (E step) and estimating the original model parameters and maximizing the lower bound of the model (M step). Its model diagram is shown in Figure 4.

该图表示内容如下：从根据超参数α获得的文档、主题间的Dirichlet先验分布中生成文档的主题分布θ_i；从主题的多项式分布θ_i中生成文档的主题z_i，j；从根据超参数β获得的主题、词汇间的Dirichlet先验分布中生成主题z_i，j的词语分布从词语的多项分布中最终生成词语w_i，j。该模型的具体表现方式在Gibbs Sampling和变分EM中有一定的差别。相较而言，变分EM方法比Gibbs Sampling方法具有更快的训练速度，但变分EM方法获得的结果为局部最优，而不一定等于全局最优。同时，虽然Gibbs Sampling的程序逻辑较为简单，但却无法像变分EM方法一样支持分布式运算。The figure shows the content as follows: the topic distribution θ _i of the document is generated from the document obtained according to the hyperparameter α, and the Dirichlet prior distribution among topics; the topic z _{i, j} of the document is generated from the multinomial distribution θ _i of the topic; Generate the word distribution of topic z _{i, j} in the Dirichlet prior distribution between topics and words obtained by hyperparameter β from the multinomial distribution of words In the end, the word w _{i, j} is generated. There is a certain difference in the specific expression of the model between Gibbs Sampling and variational EM. In comparison, the variational EM method has a faster training speed than the Gibbs Sampling method, but the result obtained by the variational EM method is a local optimum, not necessarily equal to the global optimum. At the same time, although the program logic of Gibbs Sampling is relatively simple, it cannot support distributed computing like the variational EM method.

发明内容Contents of the invention

本发明要解决的技术问题是：针对裁判文书，如何提出一种通用的、半自动化的相似度分析方法，从而应用于基于相似度的文书分类、相似文书推荐、基于裁判文书相似度的法官工作量评估、针对案情的法律条文预测等方向。该方法利用文本挖掘技术中的TF-IDF方法、 LDA方法，通过一系列处理工作和迭代训练方式以建立相似度分析模型，并根据分析模型，获得文书间的相似度关系。该方法具有良好的相似度分析结果，并具备快速的相似度计算能力，从而为基于裁判文书的相似度相关应用提供更好的基础。The technical problem to be solved by the present invention is: how to propose a general, semi-automatic similarity analysis method for judgment documents, so as to be applied to similarity-based document classification, similar document recommendation, and judge work based on similarity of judgment documents. Quantitative evaluation, prediction of legal provisions for cases, etc. This method uses the TF-IDF method and LDA method in text mining technology to establish a similarity analysis model through a series of processing work and iterative training methods, and obtain the similarity relationship between documents according to the analysis model. This method has good similarity analysis results and has fast similarity calculation capabilities, thus providing a better foundation for similarity-related applications based on referee documents.

本发明的技术方案为：Technical scheme of the present invention is:

1、基于主题模型的裁判文书相似度分析方法，其特征是针对裁判文书及其特点，使用基于主题模型的文本挖掘方法来进行文本相似度分析。本方法的简要流程步骤如图1所示，其中文本预处理、参数选取部分拥有多个子步骤，迭代训练部分可以进一步展开，详细流程步骤如图2所示，具体如下：1. The similarity analysis method of referee documents based on topic model is characterized in that the text mining method based on topic model is used to analyze text similarity for referee documents and their characteristics. The brief process steps of this method are shown in Figure 1, in which the text preprocessing and parameter selection parts have multiple sub-steps, and the iterative training part can be further expanded. The detailed process steps are shown in Figure 2, specifically as follows:

(1)以裁判文书的结构化分类信息(包括案由、案件类型等)作为目标语料；(1) Use the structured classification information of the judgment document (including the cause of action, case type, etc.) as the target corpus;

(2)将目标语料分为训练语料和测试语料，并对测试语料进行相似度标注；(2) The target corpus is divided into training corpus and test corpus, and the test corpus is marked with similarity;

(3)对作为训练语料的文书文本进行预处理操作，包括文书分段、文书筛选、中文分词、分词前后的词语获取和过滤操作；(3) Perform preprocessing operations on the document text as the training corpus, including document segmentation, document screening, Chinese word segmentation, word acquisition and filtering operations before and after word segmentation;

(4)选择目标语料的高可信部分作为输入内容；(4) Select the high-confidence part of the target corpus as the input content;

(5)设置各类参数，包括停用词、LDA模型训练参数、TF-IDF输入和评估标准设置；(5) Set various parameters, including stop words, LDA model training parameters, TF-IDF input and evaluation standard settings;

(6)使用训练语料，根据LDA进行模型训练；(6) Use training corpus, carry out model training according to LDA;

(7)使用测试语料评估本次训练模型(指和测试语料相似度标注的符合程度)；(7) Use the test corpus to evaluate the training model (referring to the degree of conformity with the test corpus similarity label);

(8)调整参数，迭代执行步骤(6)，直到针对所有要求参数完成遍历；(8) Adjust the parameters, iteratively execute step (6), until the traversal is completed for all required parameters;

(9)根据不同参数下的准确度，选择合适的参数，生成训练模型；(9) According to the accuracy under different parameters, select appropriate parameters to generate a training model;

(10)应用训练模型。(10) Apply the training model.

2、在步骤(2)中，其具体内容如图3所示。首先，要将步骤(1)中获得的目标语料分成训练语料和测试语料两部分。之后，需要针对测试语料进行相似度标注。2. In step (2), its specific content is shown in FIG. 3 . First, the target corpus obtained in step (1) should be divided into two parts: training corpus and test corpus. Afterwards, similarity labeling is required for the test corpus.

相似度标注是指针对一定量的目标文书，标注预期的输出结果。例如，标注每篇文书相对其它文书的相似度度量结果，或根据相似度进行的分类、排序等类似结果。该过程由两个维度决定：一个维度是标注方法，表示标注的实施方式；另一个维度是标注粒度，表示标注的细致程度。Similarity labeling refers to labeling the expected output results for a certain amount of target documents. For example, annotate the results of similarity measures for each document relative to other documents, or classification, ranking, and similar results based on similarity. This process is determined by two dimensions: one dimension is the labeling method, which indicates the implementation method of labeling; the other dimension is labeling granularity, which shows the degree of detail of labeling.

标注方法分为两种：一种是自动化标注，需要制定和实现相应的相似度判断策略；另一种是人工标注，由法院相关专家来完成标注工作。There are two types of labeling methods: one is automatic labeling, which needs to formulate and implement corresponding similarity judgment strategies; the other is manual labeling, which is completed by relevant court experts.

标注粒度分为两种：第一种是数字化标注，是指使用数字形式标注每篇文书与目标文书相比的相似度；第二种是非数字化标注，针对无法以数字形式进行逐篇标注的情况，可以采用类似分类、排序等标注方式。There are two types of labeling granularity: the first is digital labeling, which refers to the use of digital forms to mark the similarity of each document compared with the target document; the second is non-digital labeling, which is aimed at situations where it is not possible to mark each document in digital form , you can use similar labeling methods such as classification and sorting.

3、步骤(3)以简化输入和去除干扰为目的，包括五个具体的预处理子步骤：3. Step (3) is aimed at simplifying input and removing interference, including five specific preprocessing sub-steps:

(3.1)对裁判文书进行分段；(3.1) Segment the judgment document;

(3.2)去除写作不规范的裁判文书；(3.2) Remove irregularly written judgment documents;

(3.3)在裁判文书中删除对分词有害的停用词；(3.3) Delete stop words that are harmful to word segmentation in the judgment documents;

(3.4)对裁判文书进行中文分词；(3.4) Carry out Chinese word segmentation for the judgment document;

(3.5)生成裁判文书的专有停用词。(3.5) Generate exclusive stop words for referee documents.

4、由于在相似度分析中，裁判文书各个部分的重要性及可信性存在区别，步骤(4)需要选择目标语料的高可信部分作为输入内容。4. Since there are differences in the importance and credibility of each part of the referee document in the similarity analysis, step (4) needs to select the highly credible part of the target corpus as the input content.

5、步骤(5)以构建模型训练参数和完成训练前的准备工作为目标，包括以下四个子步骤：5. Step (5) is aimed at constructing model training parameters and completing pre-training preparations, including the following four sub-steps:

(5.1)设置停用词；(5.1) Set stop words;

(5.2)设置训练参数；(5.2) Set training parameters;

(5.3)针对训练语料生成TF-IDF向量；(5.3) generate TF-IDF vector for training corpus;

(5.4)评估标准设置，用于判定训练模型的实际效果。(5.4) The evaluation standard setting is used to determine the actual effect of the training model.

6、在步骤(10)中，可以使用训练获得的模型计算出任意两个文书之间基于主题的相似度关系，从而可以快速的获得任意两篇文书之间的相似度，进而可以开发一系列基于相似度的应用，包括裁判文书相似度分类、相似裁判文书推荐、基于裁判文书相似度的法官工作量评估、基于案情的法律条文推荐等。6. In step (10), the model obtained by training can be used to calculate the topic-based similarity relationship between any two documents, so that the similarity between any two documents can be quickly obtained, and a series of Applications based on similarity include classification of similarity of judgment documents, recommendation of similar judgment documents, workload evaluation of judges based on similarity of judgment documents, recommendation of legal provisions based on case conditions, etc.

根据本发明内容，我们已经开发出基于python语言的裁判文书相似度分析工具，该工具可以支持模型训练工作，同时也可以直接使用该工具进行相似裁判文书的推荐工作，基于案情的法律条文预测工作。此外，基于该裁判文书相似度分析工具还可以拓展更加丰富的相似度分析和应用。According to the content of the present invention, we have developed a similarity analysis tool for judgment documents based on python language. This tool can support model training work, and at the same time, it can also be directly used to recommend similar judgment documents and predict legal provisions based on case conditions. . In addition, based on the judgment document similarity analysis tool, more abundant similarity analysis and applications can be expanded.

本方法在一般相似度分析方法的基础上，充分考虑裁判文书内容中专业词汇丰富、语义复杂的特点，利用裁判文书半结构化特点，从而提升裁判文书相似度分析的准确性和适用性。此外，该相似度分析方法由于采用了主题模型方法，可以通过离线处理方式，提高相似度分析的实时响应速度，从而提高相关应用的使用效率。Based on the general similarity analysis method, this method fully considers the characteristics of rich professional vocabulary and complex semantics in the content of the judgment documents, and utilizes the semi-structured characteristics of the judgment documents to improve the accuracy and applicability of the similarity analysis of the judgment documents. In addition, because the similarity analysis method adopts the topic model method, it can improve the real-time response speed of similarity analysis through offline processing, thereby improving the use efficiency of related applications.

附图说明Description of drawings

图1基于主题模型的裁判文书相似度分析方法简要流程图Figure 1. Brief flow chart of similarity analysis method for referee documents based on topic model

图2基于主题模型的裁判文书相似度分析方法详细流程图Fig. 2 Detailed flow chart of similarity analysis method of referee documents based on topic model

图3目标裁判文书分类及标注方式Figure 3 Classification and labeling methods of target referee documents

图4 LDA模型示意图Fig.4 Schematic diagram of LDA model

图5相似度标注示例Figure 5 Example of similarity labeling

图6裁判文书案件基本情况示例Figure 6 Example of the Basic Situation of Judgment Document Cases

图7裁判文书核心结构Figure 7 The core structure of the referee document

图8训练模型评估示例步骤Figure 8 Example steps of training model evaluation

图9主题数与模型评估结果折线图示例Figure 9 Example of a line chart of the number of topics and model evaluation results

图10相似度推荐应用流程图Figure 10 similarity recommendation application flow chart

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清晰，下面将结合附图及具体实例对本发明进行详细描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific examples.

本发明旨在对裁判文书进行相似度分析。其分析结果可以应用于基于相似度的裁判文书分类、相似裁判文书推荐、基于裁判文书相似度的法官工作量评估、案件法律条文预测等场景。本方法采用TF-IDF方法、LDA方法，同时针对裁判文书的特性进行特殊处理和度量，其具体步骤如下：The invention aims at analyzing the similarity of the judgment documents. The analysis results can be applied to scenarios such as classification of judgment documents based on similarity, recommendation of similar judgment documents, evaluation of judge workload based on similarity of judgment documents, and prediction of legal provisions of cases. This method uses the TF-IDF method and the LDA method, and at the same time performs special processing and measurement for the characteristics of the referee document. The specific steps are as follows:

(1)在裁判文书集中，以某种属性(如案由、案件类型等)作为筛选条件抽取目标文书子集作为目标语料；(1) In the set of judgment documents, a certain attribute (such as the cause of action, case type, etc.) is used as a screening condition to extract a subset of target documents as the target corpus;

(5)设置各类参数，包括设置停用词、LDA主题模型训练参数、TF-IDF输入和评估标准；(5) Set various parameters, including setting stop words, LDA topic model training parameters, TF-IDF input and evaluation criteria;

(6)使用训练语料，应用LDA主题模型进行模型训练；(6) Using the training corpus, apply the LDA topic model to carry out model training;

(9)根据不同参数下的准确度，选择合适的参数，生成裁判文书相似度分析的训练模型；(9) According to the accuracy under different parameters, select appropriate parameters to generate a training model for similarity analysis of referee documents;

(10)应用步骤(9)生成的训练模型，做裁判文书相似度分析。(10) Apply the training model generated in step (9) to analyze the similarity of referee documents.

下面将结合民事一审案件裁判文书进行相似度分析的例子来进行具体解释，该应用示例旨在通过文书相似度，根据案件基本情况中的内容，以预测可能与其相关的法律条文。该功能可以帮助法官进行裁判过程，同时，帮助当事人完成自动化的法律咨询：The following will be combined with the example of similarity analysis of civil first-instance case judgment documents to give a specific explanation. This application example aims to predict legal provisions that may be related to it based on the content of the basic situation of the case through the similarity of documents. This function can help judges in the adjudication process, and at the same time, help parties complete automated legal consultation:

(1)本步骤意在获取目标文书语料，这部份文书将作为未来流程中用于训练和进行测试验证的对象。由于裁判文书采用半结构化格式，其中文书案件类型、案由信息可以帮助我们对文书进行进一步分类。同时，在不同的分类下，案件情况及相应法律条文也都有不同程度的对应。所以，为提高之后进行模型训练的准确性，降低复杂度，本方法要求对文书进行进一步分类操作，并针对不同类型内容进行分别处理。分类的维度包括案件类型、案由两种。其中就分类粒度而言，案件类型＞案由。(1) This step is intended to obtain the target document corpus, and this part of the document will be used as the object for training and test verification in the future process. Since the judgment documents adopt a semi-structured format, the case type and cause of action information in the documents can help us further classify the documents. At the same time, under different classifications, case conditions and corresponding legal provisions also correspond to varying degrees. Therefore, in order to improve the accuracy of subsequent model training and reduce complexity, this method requires further classification operations on documents, and separate processing for different types of content. The dimensions of classification include case type and cause of action. Among them, in terms of classification granularity, case type > cause of action.

本例中，只采用案件类型进行分类，选取案件类型为民事一审案件，共计53000篇。由于法律本身的时效性，该部分文书都选取立案年度2014年以后的文书。In this example, only case types are used for classification, and the case type is selected as civil first-instance cases, with a total of 53,000 articles. Due to the timeliness of the law itself, this part of the documents are selected after 2014 in the filing year.

(2)本步骤意在将文书分成训练集和测试集两个部分。前者用于模型训练，后者需要进行相似度标注，以表现出预期的输出结果，用于对模型结果进行测试、评估，从而通过迭代方式获得理想的可用于进行相似度分析的模型。(2) This step is intended to divide the document into two parts, the training set and the test set. The former is used for model training, and the latter needs to be marked with similarity to show the expected output results for testing and evaluating the model results, so as to obtain an ideal model that can be used for similarity analysis in an iterative manner.

如前文发明内容中所述，相似度标注过程由两个维度决定，一个是标注方法，表示标注的实施方式。另一个维度是标注粒度，表示标注的细致程度。As mentioned in the content of the invention above, the similarity labeling process is determined by two dimensions, one is the labeling method, which indicates the implementation of the labeling. Another dimension is labeling granularity, which indicates the level of detail of labeling.

标注方法分为两种。一种是自动化标注，需要制定和实现相应的相似度判断策略。另一种是人工标注，由法院相关专家来完成标注工作。There are two marking methods. One is automatic labeling, which requires formulating and implementing corresponding similarity judgment strategies. The other is manual labeling, which is done by relevant court experts.

标注粒度分为两种，第一种是数字化标注，是指使用数字形式标注每篇文书针对目标文书中其它文书的相似度。例如，标注文书1针对文书2，文书3的相似度分别为80％、60％。第二种是非数字化标注，针对无法以数字形式进行逐篇标注的情况，可以采用类似分类，排序等标注方式。例如，倘若期望使用本方法进行相似文书推荐工作，则可以由法院相关专家将测试语料中的文书进行人工的分类，以分类情况作为标注内容。There are two types of labeling granularity. The first one is digital labeling, which refers to marking the similarity of each document with other documents in the target document in digital form. For example, document 1 is labeled with document 2, and the similarities of document 3 are 80% and 60%, respectively. The second is non-digital labeling. For situations where it is not possible to label articles in digital form, labeling methods such as classification and sorting can be used. For example, if it is expected to use this method to recommend similar documents, relevant court experts can manually classify the documents in the test corpus, and use the classification as the label content.

两者的关系和优缺点如表1所示。由于数字化标注比非数字化标注更加精确，有利于获得更好的结果，所以在同等条件下应采用数字化标注。对于人工标注方式，由于进行数字化标注往往难以实现，所以更多使用非数字化标注。The relationship and advantages and disadvantages of the two are shown in Table 1. Since digital annotation is more accurate than non-digital annotation and is conducive to better results, digital annotation should be used under the same conditions. For manual labeling, since digital labeling is often difficult to achieve, more non-digital labeling is used.

表1标注方式有缺点及和标注粒度间关系Table 1 The labeling method has shortcomings and the relationship with the labeling granularity

在本例中，选用50000篇文书作为训练集。3000篇文书将作为测试集。对于测试集，选用自动化、非数字化标注方式来进行标注。由于本例的目标是根据案情进行法律条文预测，所以针对测试语料中的每篇文书，选取其引用的主要法律条文作为本篇文书的标注。由于法律条文及其本身写作相对的固定性，该标注过程可以通过一定的自动化方式完成。具体的表现形式如图5所示，其中，每篇文书都和若干条法律条文相关，方括号中的数字表示相应的文书和法律条文的序号。由于法律条文本身书写有一定的随意性，在标注时，需要对法律条文进行一定的处理和对应。本例中只考虑到具体的法律条目上，并不进一步记录引用的款项，以简化计算。对应的，在步骤(7)中，方法通过输入案情后根据模型预测得到的法律条文和实际标注的法律条文进行对比，从而完成测试评估工作。In this example, 50,000 documents are selected as the training set. 3000 documents will be used as the test set. For the test set, the automatic and non-digital labeling methods are selected for labeling. Since the goal of this example is to predict the legal provisions based on the case, for each document in the test corpus, the main legal provisions cited by it are selected as the annotations of this document. Due to the relatively fixed nature of the legal text and its own writing, the labeling process can be completed in a certain automated way. The specific form of expression is shown in Figure 5, where each document is related to several legal provisions, and the numbers in square brackets represent the serial numbers of the corresponding documents and legal provisions. Due to the randomness of the writing of the legal provisions themselves, it is necessary to deal with and correspond to the legal provisions when labeling. In this example, only specific legal entries are considered, and the cited sums are not further recorded to simplify the calculation. Correspondingly, in step (7), the method compares the legal clauses predicted by the model with the legal clauses actually marked after inputting the case, so as to complete the test and evaluation work.

(3)本步骤旨在对训练集文书进行预处理操作。其主要目标如下：1、获取文书中和训练相关所需段落；2、剔除噪音干扰。下面具体描述其步骤内容：(3) This step is aimed at preprocessing the training set documents. Its main objectives are as follows: 1. Obtain the required passages in documents and training; 2. Eliminate noise interference. The steps are described in detail below:

(3.1)裁判文书具有半结构化特点。我国法院制定有裁判文书的段落结构规范，基于段落结构规范以及常用的各段特征词汇，可以获得裁判文书的各段段落文本，这将有利于我们之后的训练、分析工作。(3.1) Judgment documents are semi-structured. The courts of our country have formulated the paragraph structure specification of the judgment document. Based on the paragraph structure specification and the commonly used characteristic vocabulary of each paragraph, the text of each paragraph of the judgment document can be obtained, which will be beneficial to our subsequent training and analysis work.

(3.2)由于个别裁判文书存在没有遵循裁判文书的段落结构规范的现象，部分文书的写作过于随意，关于这些没有遵循段落结构规范的裁判文书，我们会将其从训练集中剔除，以减少干扰。(3.2) Due to the phenomenon that individual judgment documents do not follow the paragraph structure specification of the judgment document, some documents are written too casually. Regarding these judgment documents that do not follow the paragraph structure specification, we will remove them from the training set to reduce interference.

(3.4)分词往往是中文语言处理的基础，在本例中，采用jieba分词来进行具体的分词工作。(3.4) Word segmentation is often the basis of Chinese language processing. In this example, jieba word segmentation is used to carry out specific word segmentation work.

(3.3)、(3.5)在裁判文书中，会存在大量地名(如某某市、某某县)、专有名词(如原告、被告)以及低频率词汇。这些词语对与比较文书相似度不但没有太大意义，反而有可能干扰训练结果。例如，裁判文书中的“原告”、“被告”、“本院”等词汇。所以，在大部分情况下，需要对这部分词汇进行去除。由于部分词汇在分词时可能会造成额外的干扰，因此，我们选择在步骤(3.3)中先行去除掉一部分词汇，同时，部分法院高频词汇需要我们在裁判文书中进行统计从而获得，所以在步骤(3.5)中，统计高频的无特定指向的词汇作为之后停用词的词库。(3.3), (3.5) In the judgment documents, there will be a large number of place names (such as XX city, XX county), proper nouns (such as plaintiff, defendant) and low-frequency words. These words are not only meaningless to compare the similarity of documents, but may interfere with the training results. For example, words such as "plaintiff", "defendant" and "this court" in the judgment documents. Therefore, in most cases, this part of vocabulary needs to be removed. Since some words may cause additional interference during word segmentation, we choose to remove some words in step (3.3). At the same time, some court high-frequency words need to be obtained by statistics in the judgment documents, so in step (3.3) In (3.5), count the high-frequency non-specific vocabulary as the thesaurus of subsequent stop words.

在本例中，提取训练语料文书中的案件基本信息作为输入段落，该段落的主要部分可以继续拆解成原告诉称、被告辩称、查明事实段、证据段，该段描述符合法院对相关裁判文书的制作规范(具体可参考法院相关文书制作规范内容： http：//www.cibsn.com/Article/Detailed/43618)，且具有明显的分段原则，可以对其进行自动拆解。一个文书的案件基本情况及其分段示例如图6所示。In this example, the basic information of the case in the training corpus is extracted as the input paragraph, and the main part of the paragraph can be further disassembled into the plaintiff’s statement, the defendant’s defense, the fact-finding paragraph, and the evidence paragraph. The production specifications of relevant judgment documents (for details, please refer to the content of relevant court document production specifications: http://www.cibsn.com/Article/Detailed/43618), and has an obvious segmentation principle, which can be automatically disassembled. The basic situation of a document and its subsections are shown in Figure 6.

其中，若训练语料文书不存在如上段落，或无法拆解出相应段落，则剔除该篇文书。同时，在(3.3)中剔除部分裁判文书特有词汇，剔除各级地名，剔除文书中的名字及名字代称，类似王某、王某某、王某甲等。(3.5)中统计该部分文书中高频的无特殊意义的裁判文书特有词汇加入之后的停用词。Among them, if the above paragraph does not exist in the training corpus document, or the corresponding paragraph cannot be disassembled, then the document is excluded. At the same time, in (3.3), part of the unique vocabulary of the judgment document is removed, the place names at all levels are removed, and the name and name pronoun in the document are removed, similar to Wang, Wang, Wang, etc. In (3.5), the stop words after the addition of the high-frequency non-special meaning specific vocabulary of the judgment document in this part of the document are counted.

(4)对于裁判文书来说，在相似度分析的过程中，不同段落本身的重要程度是不同的。究其根本，这与裁判文书本身的结构有关。对于一篇裁判文书，其核心内容由证据、事实、法律条文、判决组成。由证据印证证据，由证据(或事实)推导事实，由事实关联法律条文，并由此得出判决结果，具体如图7所示。其中，判决是结果，法律条文是明确的条款，而证据、事实则充满不确定性。例如，有些证据经由法院确认，有些证据则不予采信；原告诉称中所描述的事实不如查明事实段中的事实可信。所以，若能获得所列事实、证据的可信程度，则可以更有效的反映不同词语不同的重要性。但实际操作中，由于文书自然语言的随意性，逐条获得证据、事实的可信程度是困难的，所以我们一般会从语料中选取高可信的部分作为之后训练的输入。(4) For judgment documents, in the process of similarity analysis, the importance of different paragraphs is different. At its root, this is related to the structure of the judgment document itself. For a judgment document, its core content consists of evidence, facts, legal provisions, and judgments. The evidence is confirmed by the evidence, the facts are derived from the evidence (or facts), the facts are related to the legal provisions, and the judgment results are obtained from this, as shown in Figure 7. Among them, the judgment is the result, the legal provisions are clear terms, while the evidence and facts are full of uncertainties. For example, some evidence has been confirmed by the court, while some evidence is not admissible; the facts described in the plaintiff's statement are not as credible as the facts in the ascertaining facts paragraph. Therefore, if the credibility of the listed facts and evidence can be obtained, the different importance of different words can be reflected more effectively. However, in actual operation, due to the randomness of the natural language of the document, it is difficult to obtain the credibility of the evidence and facts one by one, so we generally select the highly credible part from the corpus as the input for subsequent training.

在本例中，案件基本信息段的主要部分可以分解成原告诉称、被告辩称、查明事实段、证据段。其中，查明事实段和证据段可作为高可信得部分，作为我们之后进行训练的输入。In this example, the main part of the basic information section of the case can be decomposed into the plaintiff's statement, the defendant's defense, the fact-finding section, and the evidence section. Among them, the fact-finding segment and the evidence segment can be used as high-confidence parts, which can be used as input for our subsequent training.

(5)本步骤除了为LDA模型设置训练参数外，还需要完成训练前的准备工作。(5) In addition to setting the training parameters for the LDA model, this step also needs to complete the preparatory work before training.

(5.1)设置停用词，包括(3.5)中的词汇和通用的停用词。具体内容可以根据实际要求进行调整。(5.1) Set stop words, including vocabulary in (3.5) and common stop words. The specific content can be adjusted according to actual requirements.

(5.2)设置训练参数，以便为迭代训练时提供边界。该部分主要包括主题数范围和主题间隔数，其中主题数可参考相应类型下法律条文数进行设置。(5.2) Set training parameters to provide boundaries for iterative training. This part mainly includes the topic number range and the topic interval number, and the topic number can be set by referring to the number of legal articles under the corresponding type.

(5.3)针对语料生成TF-IDF向量。比起单独的使用词袋或词集模型作为输入，TF-IDF 向量的输入方式具有更强的表现力。(5.3) Generate TF-IDF vectors for the corpus. Compared with using bag-of-words or word-set models alone as input, the input method of TF-IDF vector is more expressive.

(5.4)设置评估标准。该步骤决定了在步骤(7)中使用测试语料计算模型精确度时的相关参数。(5.4) Set evaluation criteria. This step determines the relevant parameters when using the test corpus to calculate the accuracy of the model in step (7).

在本例中，常用停用词为中文常用的不含有特殊意义的词汇，包括“的”、“了”等词语。主题范围为300～900，主题间隔为50。In this example, the commonly used stop words are words commonly used in Chinese without special meaning, including words such as "的" and "了". The topic range is 300-900, and the topic interval is 50.

(6)根据设置，使用LDA算法，针对处理过后的训练集进行训练。由于LDA模型的收敛速度较慢，所以当训练规模较大时，训练所需资源和时间较长。(6) According to the settings, use the LDA algorithm to train on the processed training set. Due to the slow convergence speed of the LDA model, when the training scale is large, the resources and time required for training are longer.

在本例中，采用gensim程序作为底层的LDA算法库完成实验，从主题数为300时开始训练，之后主题数每次增加50，直到主题数增加至800为止，期间将每次结果传递到下一步骤进行训练模型的评估。In this example, the gensim program is used as the underlying LDA algorithm library to complete the experiment. The training starts when the number of topics is 300, and then the number of topics increases by 50 each time until the number of topics increases to 800. During this period, each result is passed to the next Evaluate the trained model in one step.

(7)使用测试语料计算本次训练模型的精确度。该部分的操作过程和步骤2中相似度标注方法相关联，根据不同的相似度标注方案会得到不同的精确度计算方案。如果采用数字化标注方式，则推荐采用准确率、召回率等指标来进行计算，例如，通过在一定数量内成功命中的预测法律条文的数量来进行评估。而如果采用非数字标注方式，则要根据实际的设计要求和显示需要来进行该部分的设计。例如，若是相似度标注方式为对任一篇文书选择相似度最高的n篇其它文书，只有排序而没有具体数值，则可以根据不同排序的重要程度，给排序的各个位置赋予不同的权重，再进行计算。(7) Use the test corpus to calculate the accuracy of this training model. The operation process of this part is related to the similarity labeling method in step 2, and different accuracy calculation schemes will be obtained according to different similarity labeling schemes. If digital labeling is used, it is recommended to use indicators such as precision rate and recall rate for calculation, for example, to evaluate by the number of predicted legal provisions that are successfully hit within a certain number. However, if a non-digital labeling method is used, the design of this part should be carried out according to the actual design requirements and display needs. For example, if the similarity labeling method is to select n other documents with the highest similarity for any document, there is only sorting but no specific value, then according to the importance of different sorting, different weights can be assigned to each position of the sorting, and then Calculation.

在本例中，采用的是自动化，非数字化的标注方式。具体评估方法根据其法律条文预测准确性来获得，其具体步骤如图8所示，即将模型用于进行法律条文预测，并将预测的准确性作为模型评估的结果。In this example, an automated, non-digital labeling method is used. The specific evaluation method is obtained according to the prediction accuracy of the legal provisions. The specific steps are shown in Figure 8, that is, the model is used to predict the legal provisions, and the prediction accuracy is taken as the result of the model evaluation.

(8)在该步骤中，需要迭代执行(6)、(7)步。在设定的参数迭代范围和参数迭代规则中，重复执行(6)、(7)步。并记录不同参数下获得的模型在评估中所表现出的精确度。直观来说，该步骤将为我们绘制出一副横轴为迭代参数，纵轴为训练模型精确度的折线图，我们可以根据该折线图进行之后的决策。(8) In this step, steps (6) and (7) need to be executed iteratively. Repeat steps (6) and (7) within the set parameter iteration range and parameter iteration rules. And record the accuracy exhibited by the model obtained under different parameters in the evaluation. Intuitively, this step will draw us a line graph with the iteration parameters on the horizontal axis and the accuracy of the training model on the vertical axis. We can make subsequent decisions based on the line graph.

在本例中，我们以LDA模型的主题数为迭代参数，主题从300增加至800，会产生11个不同的训练模型，其主题数与模型评估结果折线图示例如图9所示，评估结果具有局部的最大值。In this example, we use the number of topics of the LDA model as the iteration parameter, increasing the number of topics from 300 to 800 will generate 11 different training models. An example of the line chart of the number of topics and model evaluation results is shown in Figure 9. The evaluation results has a local maximum.

(9)根据之前获得的关于主题数与模型评估结果的折线图，决策出需要选择什么样的主题数来进行模型训练。(9) According to the previously obtained line chart about the number of topics and model evaluation results, decide what number of topics needs to be selected for model training.

在本例中，当主题数为450左右时，训练模型的准确性最高。In this example, the accuracy of the trained model is highest when the number of topics is around 450.

然而，本例中的结果是多次实验后的一次性展示。事实上，实际操作中，可能很难一次性地确定什么样的主题数合适，而即使统计曲线呈现类似下图的存在明显峰值的情况，也无法肯定相应值是全局最优而不是局部最优。所以，在本步骤中，如果条件允许，请尽可能提高迭代次数，同时，和预期精确度要求一起共同决定如何进行主题数的选择工作。之后可以根据确定的合适参数来进行训练，而此时训练后的结果将作为未来进行应用的底层模型。However, the results in this example are a one-off presentation after many experiments. In fact, in actual operation, it may be difficult to determine the appropriate number of topics at one time, and even if the statistical curve shows a clear peak like the figure below, it is impossible to be sure that the corresponding value is the global optimum rather than the local optimum . Therefore, in this step, if conditions permit, please increase the number of iterations as much as possible, and at the same time, determine how to select the number of topics together with the expected accuracy requirements. Afterwards, training can be carried out according to the determined appropriate parameters, and the results after training at this time will be used as the underlying model for future applications.

(10)在之前的步骤中，我们获得了可以用来进行相似度分析的基础模型。然而，在实际应用中，我们还需要对其增加应用层才可以提高结果的可见性。如前文所述，该模型可以支持裁判文书相似度分类，相似裁判文书推荐，基于裁判文书相似度的工作量评估，基于案情的法律条文预测等应用。(10) In the previous steps, we obtained the basic model that can be used for similarity analysis. However, in practical applications, we also need to add an application layer to it to improve the visibility of the results. As mentioned above, this model can support applications such as classification of similarity of judgment documents, recommendation of similar judgment documents, workload evaluation based on similarity of judgment documents, and prediction of legal provisions based on case conditions.

在本例中，在底层模型的基础上实现了基于案情的法律条文预测应用。其效果为，当输入一个民事一审案件的案件基本情况或其中的查明事实段、证据段时，系统可以根据模型预测出针对于案情可能相关的法律条文。其实现流程如图10所示，即应用本方法获得的相似度模型，先根据案情输入寻找相似文书，再根据相似文书引用的法律条文统计得到预测的法律条文情况。In this example, the prediction application of legal clauses based on the case is implemented on the basis of the underlying model. The effect is that when inputting the basic situation of a civil first-instance case or the fact-finding paragraphs and evidence paragraphs, the system can predict the legal provisions that may be relevant to the case according to the model. The implementation process is shown in Figure 10, that is, apply the similarity model obtained by this method, first search for similar documents according to the input of the case, and then obtain the predicted legal provisions according to the statistics of the legal provisions cited by similar documents.

Claims

1. The similarity analysis method of referee documents based on topic model is characterized in that for referee documents and their characteristics, the text mining method based on topic model is used to carry out text similarity analysis, and its steps are as follows:

(1) In the set of judgment documents, a certain attribute (such as cause of action, case type, etc.) is used as a screening condition to extract a subset of target documents as the target corpus;

(2) The target corpus is divided into training corpus and test corpus, and the test corpus is marked with similarity;

(3) Perform preprocessing operations on the document text as the training corpus, including document segmentation, document screening, Chinese word segmentation, word acquisition and filtering operations before and after word segmentation;

(4) Select the high-confidence part of the target corpus as the input content;

(5) Set various parameters, including setting stop words, LDA topic model training parameters, TF-IDF input and evaluation criteria;

(6) Using the training corpus, apply the LDA topic model to carry out model training;

(7) Use the test corpus to evaluate the training model (referring to the degree of conformity with the test corpus similarity label);

(8) Adjust the parameters, iteratively execute step (6), until the traversal is completed for all required parameters;

(9) According to the accuracy under different parameters, select appropriate parameters to generate a training model for similarity analysis of referee documents;

(10) Apply the training model generated in step (9) to analyze the similarity of referee documents.

2. The method for similarity analysis of referee documents based on topic model according to claim 1, characterized in that: step (3) includes five specific preprocessing methods for the purpose of simplifying input and removing interference through a preprocessing process Substeps:

(3.1) Segment the judgment document;

(3.2) Remove irregularly written judgment documents;

(3.3) Delete stop words that are harmful to word segmentation in the judgment documents;

(3.4) Carry out Chinese word segmentation for the judgment document;

(3.5) Generate exclusive stop words for referee documents.

3. the method for similarity analysis of referee documents based on topic model according to claim 1, characterized in that: step (5) is aimed at building model training parameters and completing preparatory work before training, including the following four sub-steps:

(5.1) Set stop words;

(5.2) Set training parameters;

(5.3) generate TF-IDF vector for training corpus;

(5.4) The evaluation standard setting is used to determine the actual effect of the training model.

4. The method for analyzing similarity of referee documents based on topic model according to claim 1, characterized in that: in step (10), the model obtained by training can be used to calculate the topic-based similarity between any two documents , so that the similarity between any two documents can be quickly obtained, and a series of applications based on similarity can be developed, including similarity classification of judgment documents, recommendation of similar judgment documents, judge workload evaluation based on similarity of judgment documents, Recommendations on legal provisions based on the facts of the case, etc.