JP2023025933A

JP2023025933A - Similarity degree determination device, similarity degree determination system, similarity degree determination method, and program

Info

Publication number: JP2023025933A
Application number: JP2021131400A
Authority: JP
Inventors: 佳典栗田; Yoshinori Kurita; 謙一柏木; Kenichi Kashiwagi; 裕志郎高橋; Yushiro Takahashi
Original assignee: Croco Corp
Current assignee: Croco Corp
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2023-02-24
Anticipated expiration: 2041-08-11
Also published as: JP7138981B1

Abstract

To provide a similarity degree determination device, a similarity degree determination system, a similarity degree determination method, and a program, capable of calculating a similarity degree of texts with higher accuracy.SOLUTION: A similarity degree determination device includes: a sentence vector acquisition unit that acquires a sentence vector that is a feature vector for each sentence obtained by decomposing a first text that is an analysis source text and a second text that is a text to be compared; a key phrase acquisition unit that acquires a key phrase that is included in each of the first text and the second text and is an important element constituting the texts; and a similarity degree calculation unit that calculates a comprehensive similarity degree of the first text and the second text based on a similarity degree between the sentence vectors between the first text and the second text and a similarity degree of an appearance degree of the same key phrase.SELECTED DRAWING: Figure 2

Description

特許法第３０条第２項適用申請有り１．電気通信回線を通じた公開ウェブサイトの掲載日令和２年８月２５日ウェブサイトのＵＲＬｈｔｔｐｓ：／／ｋａｇｅｍｕｓｙａ．ｂｉｚ－ｓａｍｕｒａｉ．ｃｏｍ／There is an application for the application of Article 30, Paragraph 2 of the Patent Act. Disclosure through telecommunication lines Website publication date August 25, 2020 Website URL https://kagemusya.com biz-samurai. com/

特許法第３０条第２項適用申請有り２．電気通信回線を通じた公開ウェブサイトの掲載日令和２年８月２７日ウェブサイトのＵＲＬｈｔｔｐｓ：／／ｃｒｏ－ｃｏ．ｃｏ．ｊｐ／ｉｎｆｏｒｍａｔｉｏｎ／ｎｅｗｓ／ｓｅｒｖｉｃｅ／４１８／There is an application for the application of Article 30, Paragraph 2 of the Patent Act. Disclosure through telecommunication lines Website publication date August 27, 2020 Website URL https://cro-co. co. jp/information/news/service/418/

本発明は、類似度判定装置、類似度判定システム、類似度判定方法、およびプログラムに関する。 The present invention relates to a similarity determination device, a similarity determination system, a similarity determination method, and a program.

インターネットを介してアクセス可能なオンライン文書の数が膨大になるに伴い、類似文書の検索に関する技術が、文献盗用検索等の多くの分野に活用されている。 With the enormous number of online documents accessible via the Internet, techniques related to searching for similar documents are being utilized in many fields such as document plagiarism searching.

これに関連し、複数に分類された文書群と入力文書との類似性を導出する文書類似性導出装置が提案されている（特許文献１参照）。具体的に、文書類似性導出装置は、入力文書に含まれる文を形態素解析した結果に基づいて、重みを要素とした入力文書の特徴ベクトルを算出し、複数に分類された各文書群に含まれる各文書の特徴ベクトルから文書群の平均特徴ベクトルを算出し、入力文書の特徴ベクトルおよび各文書群の平均特徴ベクトルから、入力文書が各文書群のうち、いずれの文書群に最も類似するかを判定する。 In relation to this, a document similarity derivation device has been proposed that derives the similarity between a group of documents classified into a plurality of categories and an input document (see Patent Document 1). Specifically, the document similarity deriving device calculates a feature vector of the input document with the weight as an element based on the result of morphological analysis of the sentence included in the input document, and calculates the feature vector of the input document, and Calculate the average feature vector of the document group from the feature vector of each document in the input document and the average feature vector of each document group to determine which document group the input document is most similar to judge.

特開２００９－５３７４３号公報JP-A-2009-53743

特許文献１に記載の技術では、専ら特徴ベクトルを中心に文章の類似度を算出しており、特徴ベクトル以外の要素を考慮していないため、文章の類似度に関して精度が十分でない場合があった。 In the technique described in Patent Document 1, the similarity of sentences is calculated mainly based on feature vectors, and elements other than feature vectors are not taken into account. .

本発明は、このような事情を考慮してなされたものであり、より高精度に文章の類似度を算出することができる類似度判定装置、類似度判定システム、類似度判定方法、およびプログラムを提供することを目的とする。 The present invention has been made in consideration of such circumstances, and provides a similarity determination device, a similarity determination system, a similarity determination method, and a program capable of calculating the similarity of sentences with higher accuracy. intended to provide

上記目的を達成するため、本発明の類似度判定装置は、解析元の文章である第１文章と、比較対象の文章である第２文章のそれぞれを分解した文ごとの特徴ベクトルである文ベクトルを取得する文ベクトル取得部と、前記第１文章と前記第２文章のそれぞれに含まれ、文章を構成する重要な要素であるキーフレーズを取得するキーフレーズ取得部と、前記第１文章と前記第２文章との間の前記文ベクトル同士の類似度と、同じ前記キーフレーズの出現度合の類似度とに基づいて、前記第１文章と前記第２文章の総合類似度を算出する類似度算出部と、を備える。 In order to achieve the above object, the similarity determination device of the present invention provides a sentence vector, which is a feature vector for each sentence obtained by decomposing a first sentence, which is a sentence to be analyzed, and a second sentence, which is a sentence to be compared. a sentence vector acquisition unit that acquires a key phrase that is included in each of the first sentence and the second sentence and is an important element that constitutes the sentence; a key phrase acquisition unit that acquires the first sentence and the Similarity calculation for calculating a total similarity between the first sentence and the second sentence based on the similarity between the sentence vectors with the second sentence and the similarity of the appearance of the same key phrase and

本発明の更なる特徴及び態様は、添付図面を参照し、以下に述べる実施形態の詳細な説明から明らかとなるであろう。 Further features and aspects of the present invention will become apparent from the detailed description of the embodiments set forth below, with reference to the accompanying drawings.

本発明によれば、より高精度に文章の類似度を算出することができる。 According to the present invention, it is possible to calculate the similarity of sentences with higher accuracy.

実施形態に係る類似度判定システム１の全体構成の一例を示す図である。It is a figure which shows an example of the whole structure of the similarity determination system 1 which concerns on embodiment. 実施形態に係る類似度判定装置１００の構成を示す図である。It is a figure which shows the structure of the similarity determination apparatus 100 which concerns on embodiment. 文ベクトル比較部１３２が文ベクトル同士の類似度を算出する様子の一例を示す図である。FIG. 10 is a diagram showing an example of how the sentence vector comparison unit 132 calculates the similarity between sentence vectors. キーフレーズ比較部１３４が第１文章および第２文章における一致するキーフレーズがそれぞれの文章に出現した出現度合を算出する様子の一例を示す図である。FIG. 10 is a diagram showing an example of how the key phrase comparison unit 134 calculates the degree of occurrence of matching key phrases in the first sentence and the second sentence in each sentence. 類似度算出部１３０が第１文章と第２文章の総合類似度を算出する処理の内容を模式的に示す図である。FIG. 10 is a diagram schematically showing the content of a process of calculating a total similarity between a first sentence and a second sentence by a similarity calculating unit 130; 類似度判定装置１００が、第１文章および第２文章の総合類似度を算出する処理の一例を示すフローチャートである。4 is a flow chart showing an example of a process of calculating a total similarity between a first sentence and a second sentence by the similarity determination device 100; 総合類似度を算出する処理が行われた後、類似度判定装置１００が端末装置２００に表示させる画面の一例を示す図である。FIG. 10 is a diagram showing an example of a screen displayed on the terminal device 200 by the similarity determination device 100 after the process of calculating the total similarity is performed.

以下、実施形態の類似度判定装置、類似度判定システム、類似度判定方法、およびプログラムを、図面を参照して説明する。類似度判定装置は、文章を構成する各文の特徴ベクトルおよびキーフレーズに基づき、解析元の文章と比較対象の文章との類似度を判定する装置である。類似度判定装置は、例えば、解析元の文章と比較対象の文章との類似度を算出し、端末装置に算出結果を送信する。端末装置は、例えば、パーソナルコンピュータ、タブレット型コンピュータ、スマートフォンなどである。解析元の文章と比較対象の文章は、端末装置から取得されてもよいし、インターネット等の外部環境から自動で取得する等他の手法で取得されてもよい。類似度判定装置は、各文の特徴ベクトルを取得する際に、他のサーバの機能を利用してもよい。 Hereinafter, a similarity determination device, a similarity determination system, a similarity determination method, and a program according to embodiments will be described with reference to the drawings. The similarity determination device is a device that determines the degree of similarity between a text to be analyzed and a text to be compared based on feature vectors and key phrases of each sentence that constitutes the text. The similarity determination device, for example, calculates the degree of similarity between the analysis source sentence and the comparison target sentence, and transmits the calculation result to the terminal device. The terminal device is, for example, a personal computer, a tablet computer, a smart phone, or the like. The text to be analyzed and the text to be compared may be obtained from a terminal device, or may be obtained by other methods such as being automatically obtained from an external environment such as the Internet. The similarity determination device may use functions of other servers when acquiring the feature vector of each sentence.

図１は、実施形態に係る類似度判定システム１の全体構成の一例を示す図である。類似度判定システム１は、類似度判定装置１００と、端末装置２００とを備える。類似度判定装置１００、端末装置２００、および外部サーバ３００は、ネットワークを介して互いに通信する。ネットワークＮＷは、例えば、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、インターネット、プロバイダ装置、無線基地局、専用回線などのうちの一部または全部を含む。外部サーバ３００の役割については後述する。 FIG. 1 is a diagram showing an example of the overall configuration of a similarity determination system 1 according to an embodiment. A similarity determination system 1 includes a similarity determination device 100 and a terminal device 200 . Similarity determination device 100, terminal device 200, and external server 300 communicate with each other via a network. The network NW includes, for example, a part or all of a WAN (Wide Area Network), a LAN (Local Area Network), the Internet, a provider device, a wireless base station, a dedicated line, and the like. The role of the external server 300 will be described later.

図２は、実施形態に係る類似度判定装置１００の構成を示す図である。類似度判定装置１００は、例えば、通信部１１０、文章取得部１２０、文ベクトル取得部１２２、キーフレーズ取得部１２４、類似度算出部１３０、および記憶部１５０を備える。 FIG. 2 is a diagram showing the configuration of the similarity determination device 100 according to the embodiment. The similarity determination device 100 includes, for example, a communication unit 110, a text acquisition unit 120, a sentence vector acquisition unit 122, a key phrase acquisition unit 124, a similarity calculation unit 130, and a storage unit 150.

通信部１１０と記憶部１５０以外の各部は、例えば、ＣＰＵ（Central Processing Unit）などのハードウェアプロセッサがプログラム（ソフトウェア）を実行することにより実現される。これらの構成要素のうち一部または全部は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、ＧＰＵ（Graphics Processing Unit）などのハードウェア（回路部；circuitryを含む）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。プログラムは、予めＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの記憶装置（非一過性の記憶媒体を備える記憶装置）に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体（非一過性の記憶媒体）に格納されており、記憶媒体がドライブ装置に装着されることで記憶装置にインストールされてもよい。 Each unit other than the communication unit 110 and the storage unit 150 is implemented by executing a program (software) by a hardware processor such as a CPU (Central Processing Unit). Some or all of these components are hardware (circuit part; circuitry) or by cooperation of software and hardware. The program may be stored in advance in a storage device (a storage device with a non-transitory storage medium) such as a HDD (Hard Disk Drive) or flash memory, or may be stored in a removable storage such as a DVD or CD-ROM. It may be stored in a medium (non-transitory storage medium) and installed in the storage device by loading the storage medium into the drive device.

通信部１１０は、例えばＮＩＣ（Network Interface Card）などのネットワークインターフェースを備える。類似度判定装置１００の各部は、通信部１１０を用いて、ネットワークＮＷを介して端末装置２００および外部サーバ３００と通信する。 The communication unit 110 includes a network interface such as a NIC (Network Interface Card). Each unit of the similarity determination device 100 uses the communication unit 110 to communicate with the terminal device 200 and the external server 300 via the network NW.

記憶部１５０は、例えば、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、ＨＤＤ、フラッシュメモリ、またはこれらのうち複数が組み合わされたハイブリッド型記憶装置などにより実現される。記憶部１５０には、例えば、取得文章１５２、取得文ベクトル１５４、取得キーフレーズ１５６、および文ベクトルの組の数１５８等のデータが格納される。 The storage unit 150 is implemented by, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), an HDD, a flash memory, or a hybrid storage device in which a plurality of these are combined. The storage unit 150 stores, for example, data such as an acquired sentence 152, an acquired sentence vector 154, an acquired key phrase 156, and the number 158 of pairs of sentence vectors.

文章取得部１２０は、例えば、端末装置２００から解析元の文章と比較対象の文章を取得する。以下において、解析元の文章を第１文章とし、比較対象の文章を第２文章と称する。文章取得部１２０は、取得した第１文章および第２文章を記憶部１５０に記憶させる。 The text acquisition unit 120 acquires, for example, a text to be analyzed and a text to be compared from the terminal device 200 . In the following, the original text to be analyzed is referred to as the first text, and the text to be compared is referred to as the second text. The sentence acquisition unit 120 causes the storage unit 150 to store the acquired first sentence and second sentence.

ここでは、文ベクトル取得部１２２を説明する前に、外部サーバ３００の役割について説明する。外部サーバ３００は、文章を分割、文章から単語の出現位置の特定、特徴ベクトル（文ベクトル）を取得やキーフレーズを取得等の各種解析器を有し、各種解析器に対して読み込ませる辞書・ライブラリ等を提供する。例えば、外部サーバ３００は、ＢＥＲＴ（Bidirectional Encoder Representations from Transformers）の自然言語処理モデルを用い、終止符などのような区切り符号に基づいて、文章ごとに対応する文ベクトルを作成することができる。また、外部サーバ３００は、例えば、ｓｐａＣｙ／ＧｉＮＺＡ（日本語形態素解析器の一種）を用いて、文章ごとに対して文章を構成する重要な要素（キーフレーズ）を取得することができる。 Here, before describing the sentence vector acquisition unit 122, the role of the external server 300 will be described. The external server 300 has various analyzers such as segmenting sentences, specifying the appearance positions of words from sentences, acquiring feature vectors (sentence vectors), and acquiring key phrases. Provide libraries, etc. For example, the external server 300 can use a natural language processing model of BERT (Bidirectional Encoder Representations from Transformers) to create a corresponding sentence vector for each sentence based on delimiters such as full stops. In addition, the external server 300 can acquire important elements (key phrases) that make up a sentence for each sentence using, for example, spaCy/GiNZA (a type of Japanese morphological analyzer).

外部サーバ３００は、例えば、事前に用意したコーパスをクラスタリングし、分類したクラスタ単位で各単語のスコアを計算する。文章からキーフレーズを抽出する際は、文章が属するクラスタの各単語のスコアを用いてフレーズ候補のスコアを求める。外部サーバ３００は、類似した文章を多数集めることで、より良い精度の単語スコアを得ることができる。ここで、外部サーバ３００は、ジャンル別のコーパスを用いてキーフレーズを抽出してもよい。例えば、外部サーバ３００は、スポーツ、料理といったジャンルごとにコーパスを予め用意し、文ベクトル取得部１２２により指定されたジャンルに対応するコーパスを使用して単語のスコアを計算する。この場合、文ベクトル取得部１２２は、端末装置２００からジャンルの指定を受け付けてもよい。 The external server 300, for example, clusters a corpus prepared in advance and calculates the score of each word for each classified cluster. When extracting key phrases from a sentence, the score of each word in the cluster to which the sentence belongs is used to obtain the score of the phrase candidate. The external server 300 can obtain a more accurate word score by collecting many similar sentences. Here, the external server 300 may extract key phrases using genre-specific corpora. For example, the external server 300 prepares a corpus in advance for each genre such as sports and cooking, and uses the corpus corresponding to the genre specified by the sentence vector acquisition unit 122 to calculate the word score. In this case, the sentence vector acquisition unit 122 may receive designation of genre from the terminal device 200 .

文ベクトル取得部１２２は、記憶部１５０に格納された第１文章および第２文章を外部サーバ３００に送信して文ベクトルの作成を依頼し、第１文章および第２文章のそれぞれに対応する文ベクトルを取得する。本実施形態において、第１文章に基づき作成された文ベクトルを第１文ベクトルと称し、第２文章に基づき作成された文ベクトルを第２文ベクトルと称する。文ベクトル取得部１２２は、取得した第１文ベクトルおよび第２文ベクトルを類似度算出部１３０に出力する。 The sentence vector acquisition unit 122 transmits the first sentence and the second sentence stored in the storage unit 150 to the external server 300 to request creation of sentence vectors, and obtains sentences corresponding to the first sentence and the second sentence, respectively. Get a vector. In this embodiment, a sentence vector created based on the first sentence is called a first sentence vector, and a sentence vector created based on the second sentence is called a second sentence vector. The sentence vector acquisition unit 122 outputs the acquired first sentence vector and second sentence vector to the similarity calculation unit 130 .

キーフレーズ取得部１２４は、記憶部１５０に格納された第１文章および第２文章を取得し、それらを外部サーバ３００に送信して第１文章および第２文章のそれぞれからのキーフレーズの抽出を依頼し、第１文章および第２文章のそれぞれに対応するキーフレーズを取得する。本実施形態において、第１文章に基づき作成されたキーフレーズを第１キーフレーズと称し、第２文章に基づき作成されたキーフレーズを第２キーフレーズと称する。キーフレーズ取得部１２４は、取得した第１キーフレーズおよび第２キーフレーズを類似度算出部１３０に出力する。 The key phrase acquisition unit 124 acquires the first and second sentences stored in the storage unit 150, transmits them to the external server 300, and extracts key phrases from each of the first and second sentences. Request and obtain key phrases corresponding to each of the first and second sentences. In this embodiment, a key phrase created based on the first sentence is called a first key phrase, and a key phrase created based on the second sentence is called a second key phrase. The key phrase acquisition unit 124 outputs the acquired first key phrase and second key phrase to the similarity calculation unit 130 .

類似度算出部１３０は、第１文章および第２文章の総合類似度を算出する。類似度算出部１３０は、例えば、文ベクトル比較部１３２、キーフレーズ比較部１３４、総合類似度算出部１３６、および類似度表示制御部１３８を備える。 The similarity calculator 130 calculates the total similarity between the first sentence and the second sentence. The similarity calculation unit 130 includes, for example, a sentence vector comparison unit 132, a key phrase comparison unit 134, an overall similarity calculation unit 136, and a similarity display control unit 138.

文ベクトル比較部１３２は、第１文章および第２文章のそれぞれに基づいて作成された第１文ベクトルおよび第２文ベクトルのベクトル間距離（ユークリッド距離）またはコサイン類似度に基づき、第１文ベクトルと第２文ベクトルの類似度を算出する。文ベクトル比較部１３２は、第１文ベクトルと第２文ベクトルの類似度を、例えば網羅的に算出し、第１文章および第２文章の総合類似度を算出するための第１指標値を算出する。 The sentence vector comparison unit 132 compares the first sentence vector and the second sentence vector based on the inter-vector distance (Euclidean distance) or the cosine similarity between the first sentence vector and the second sentence vector created based on the first sentence and the second sentence, respectively. and the second sentence vector are calculated. The sentence vector comparison unit 132 comprehensively calculates the similarity between the first sentence vector and the second sentence vector, for example, and calculates a first index value for calculating the total similarity between the first sentence and the second sentence. do.

図３は、文ベクトル比較部１３２が文ベクトル同士の類似度を算出する様子の一例を示す図である。図示するように、第１文章および第２文章のそれぞれは、文１、文２、文３のような形で複数の文に分解されている。第１文章の文１の文ベクトルと、第２文章の文２の文ベクトルとの類似度は９７％であり、第１文章の文１の文ベクトルと、第２文章の文６の文ベクトルとの類似度は３５％であり、第１文章の文３の文ベクトルと、第２文章の文４の文ベクトルとの類似度は９７％である。このように、文ベクトル比較部１３２は、第１文章に含まれるそれぞれの文の文ベクトルと、第２文章に含まれるそれぞれの文の文ベクトルとの類似度を算出する。 FIG. 3 is a diagram showing an example of how the sentence vector comparison unit 132 calculates the degree of similarity between sentence vectors. As illustrated, each of the first sentence and the second sentence is broken down into a plurality of sentences such as sentence 1, sentence 2, sentence 3, and so on. The similarity between the sentence vector of sentence 1 of the first sentence and the sentence vector of sentence 2 of the second sentence is 97%. is 35%, and the similarity between the sentence vector of sentence 3 of the first sentence and the sentence vector of sentence 4 of the second sentence is 97%. In this way, the sentence vector comparison unit 132 calculates the degree of similarity between the sentence vector of each sentence included in the first sentence and the sentence vector of each sentence included in the second sentence.

文ベクトル比較部１３２は、例えば、上記文ベクトル同士の類似度を算出した後に、算出した類似度が閾値以上であるか否かを判定する。そして、文ベクトル比較部１３２は、閾値以上である類似度（類似度の最大値を含む）および類似度が閾値以上である文ベクトルの組の数を計数する。本実施形態において、例えば、文ベクトル同士の類似度閾値を８５％とする。この場合、図３における文ベクトルの組（１）、文ベクトルの組（４）、および文ベクトルの組（５）のそれぞれの類似度が「９７％」、「８８％」、「９７％」であり、これらは８５％の閾値以上であるため、文ベクトル比較部１３２は、上記文ベクトル同士の類似度およびそれらの組数「３」を算出して記憶部１５０に記憶させる。 For example, after calculating the degree of similarity between the sentence vectors, the sentence vector comparison unit 132 determines whether the calculated degree of similarity is equal to or greater than a threshold. Then, the sentence vector comparison unit 132 counts the similarities (including the maximum value of the similarities) greater than or equal to the threshold and the number of sets of sentence vectors whose similarities are greater than or equal to the threshold. In this embodiment, for example, the similarity threshold between sentence vectors is set to 85%. In this case, the similarities of sentence vector set (1), sentence vector set (4), and sentence vector set (5) in FIG. 3 are 97%, 88%, and 97%, respectively. Since these are equal to or greater than the threshold of 85%, the sentence vector comparison unit 132 calculates the similarity between the sentence vectors and the number of sets “3” and stores them in the storage unit 150 .

文ベクトル比較部１３２は、単に、算出した文ベクトル同士の類似度が閾値以上である類似度に基づいて第１指標値を計算してもよいし、閾値以上である類似度および類似度が閾値以上である文ベクトルの組の数に基づいて第１指標値を計算してもよい。また、文ベクトル比較部１３２は、例えば、閾値を超えた値が大きいほど重みを大きくして組の数の加重和を求めてもよい。 The sentence vector comparison unit 132 may simply calculate the first index value based on the similarity of the calculated similarity between sentence vectors equal to or higher than the threshold, or the similarity equal to or higher than the threshold and the similarity to the threshold. The first index value may be calculated based on the number of sets of sentence vectors as described above. Further, the sentence vector comparison unit 132 may obtain a weighted sum of the number of sets by increasing the weight as the value exceeding the threshold increases, for example.

更に、文ベクトル比較部１３２は、類似度の最大値に基づいて第１指標値を計算してもよい。本実施形態では、第１文章および第２文章の総合類似度を高精度に算出するため、文ベクトル比較部１３２は、文ベクトル同士の類似度の最大値および類似度が閾値以上である文ベクトルの組の数に基づいて、第１指標値を算出することとする。例えば、図３の場合、文ベクトル比較部１３２は、「９７％」および「３」を用いて第１指標値を算出する。 Furthermore, the sentence vector comparison unit 132 may calculate the first index value based on the maximum similarity value. In this embodiment, in order to calculate the overall similarity between the first sentence and the second sentence with high accuracy, the sentence vector comparison unit 132 calculates the maximum value of the similarity between sentence vectors, Suppose that the first index value is calculated based on the number of pairs of . For example, in the case of FIG. 3, the sentence vector comparison unit 132 uses "97%" and "3" to calculate the first index value.

なお、文ベクトル比較部１３２は、単に類似度が閾値以上である文ベクトルの組の数に基づいて第１指標値を算出してもよいし、単に類似度の最大値に基づいて第１指標値を算出してもよい。文ベクトル比較部１３２は、例えば、類似度が閾値以上である文ベクトルの組の数が大きいほど第１指標値が大きくなるように、且つ、類似度の最大値が大きいほど第１指標値が大きくなるように、総合類似度を算出する。この傾向を有する限り、文ベクトル比較部１３２は、任意の手法で第１指標値を算出してよい。 Note that the sentence vector comparison unit 132 may simply calculate the first index value based on the number of pairs of sentence vectors whose similarity is equal to or greater than the threshold, or may simply calculate the first index value based on the maximum value of the similarity. value may be calculated. For example, the sentence vector comparison unit 132 sets the first index value so that the larger the number of pairs of sentence vectors whose similarity is equal to or higher than the threshold value, the larger the first index value, and the larger the maximum value of the similarity, the larger the first index value. Comprehensive similarity is calculated so as to increase. As long as this tendency exists, the sentence vector comparison unit 132 may calculate the first index value by any method.

キーフレーズ比較部１３４は、第１文章および第２文章のそれぞれから抽出されたキーフレーズの間で一致するキーフレーズが、第１文章および第２文章のそれぞれにおいて出現する出現度合に基づいて、第１文章および第２文章の総合類似度を算出するための第２指標値を算出する。出現度合とは、第１文章および第２文章の長さを考慮して、相対的にどの程度の頻度ないし比重で出現したかを示す情報量である。 The key phrase comparison unit 134 compares the key phrases extracted from the first sentence and the second sentence based on the degree of occurrence of matching key phrases in the first sentence and the second sentence. A second index value for calculating the total similarity between the first sentence and the second sentence is calculated. The degree of appearance is the amount of information that indicates the relative frequency or weight of appearance in consideration of the lengths of the first and second sentences.

図４は、キーフレーズ比較部１３４が第１文章および第２文章における一致するキーフレーズがそれぞれの文章に出現した出現度合を算出する様子の一例を示す図である。まず、キーフレーズ比較部１３４は、第１文章および第２文章におけるキーフレーズの出現数をカウントする。そして、キーフレーズ比較部１３４は、キーフレーズの出現数を、第１文章または第２文章の長さに応じた係数（第１係数または第２係数）で除算した値を、キーフレーズの出現度合として算出する。この係数は、第１文章または第２文章が長い程、大きくなる値である。 FIG. 4 is a diagram showing an example of how the key phrase comparison unit 134 calculates the degree of occurrence of matching key phrases in the first and second sentences. First, the key phrase comparison unit 134 counts the number of appearances of key phrases in the first and second sentences. Then, the keyphrase comparison unit 134 divides the number of appearances of the keyphrase by a coefficient (first coefficient or second coefficient) corresponding to the length of the first sentence or the second sentence, and calculates the degree of appearance of the keyphrase. Calculate as This coefficient is a value that increases as the length of the first sentence or the second sentence increases.

例えば、図示するように、第１文章および第２文章のそれぞれに出現するキーフレーズは、「ドリブル」、「ユース」、「ボール」、「プレースタイル」、「２０１９」や「２０１５」等である。キーフレーズ「ドリブル」は第１文章に５回出現し、第２文章に８回出現している。これに対して、第１文章に対応する係数α１が５０、第２文章に対応する係数α２が１００と計算されたとすると、「ドリブル」は、第１文章における出現度合が０．１と算出され、第２文章における出現度合が０．０８と算出される。第１文章と第２文章のうち一方にだけ出現するキーフレーズも存在し得る。その場合、そのキーフレーズが出現しない方の文章に対応するキーフレーズ出現数とキーフレーズ出現度合はゼロとなる。 For example, as shown, key phrases appearing in each of the first and second sentences are "dribbling", "youth", "ball", "playing style", "2019" and "2015". . The key phrase "dribble" appears five times in the first sentence and eight times in the second sentence. On the other hand, if the coefficient α1 corresponding to the first sentence is calculated to be 50 and the coefficient α2 corresponding to the second sentence is calculated to be 100, the degree of occurrence of "dribble" in the first sentence is calculated to be 0.1. , the degree of appearance in the second sentence is calculated as 0.08. There may also be key phrases that appear in only one of the first and second sentences. In that case, the number of appearances of key phrases and the degree of appearance of key phrases corresponding to the sentence in which the key phrase does not appear are zero.

そして、キーフレーズ比較部１３４は、例えば、第１文章と第２文章の少なくともいずれかに出現するキーフレーズを仮想的に並べ、その出現数を要素とするベクトル（キーフレーズベクトル）を、第１文章と第２文章のそれぞれについて定義する。第１文章に対応するキーフレーズベクトルを第１キーフレーズベクトルＶ１、第２文章に対応するキーフレーズベクトルを第２キーフレーズベクトルＶ２とすると、図４の例では、式（１）のように表される。
Ｖ１＝（０．１，０．０６，０．２，０．１４，０．２，０．０４，…）
Ｖ２＝（０．０８，０，０．１，０．０５，０．０２，０．０８，…） …（１） Then, the key phrase comparison unit 134, for example, virtually arranges key phrases that appear in at least one of the first sentence and the second sentence, and generates a vector (key phrase vector) whose elements are the number of occurrences of the key phrases in the first sentence. Define each of the sentence and the second sentence. Assuming that the keyphrase vector corresponding to the first sentence is the first keyphrase vector V1 and the keyphrase vector corresponding to the second sentence is the second keyphrase vector V2, the example in FIG. be done.
V1 = (0.1, 0.06, 0.2, 0.14, 0.2, 0.04, ...)
V2=(0.08, 0, 0.1, 0.05, 0.02, 0.08, ...) (1)

キーフレーズ比較部１３４は、例えば、第１キーフレーズベクトルＶ１と第２キーフレーズベクトルＶ２とのベクトル間距離（ユークリッド距離）やコサイン類似度（すなわち第１キーフレーズベクトルＶ１と第２キーフレーズベクトルＶ２との類似度であり、キーフレーズの第１文章と第２文章における出現度合の類似度の一例である）を計算し、第２指標値として算出する。 The key-phrase comparison unit 134, for example, compares the inter-vector distance (Euclidean distance) between the first key-phrase vector V1 and the second key-phrase vector V2 and the cosine similarity (that is, the first key-phrase vector V1 and the second key-phrase vector V2). (which is an example of the degree of similarity between the first sentence and the second sentence of the key phrase) is calculated as the second index value.

総合類似度算出部１３６は、第１指標値および第２指標値に基づいて、第１文章および第２文章の総合類似度を算出する。例えば、総合類似度算出部１３６は、第１指標値が大きいほど総合類似度が大きくなるように、且つ、第２指標値が大きいほど総合類似度が大きくなるように、総合類似度を算出する。この傾向を有する限り、総合類似度算出部１３６は、任意の手法で総合類似度を算出してよい。 The total similarity calculator 136 calculates the total similarity between the first sentence and the second sentence based on the first index value and the second index value. For example, the total similarity calculation unit 136 calculates the total similarity such that the larger the first index value, the larger the total similarity, and the larger the second index value, the larger the total similarity. . As long as this tendency exists, the total similarity calculation unit 136 may calculate the total similarity by any method.

図５は、類似度算出部１３０が第１文章と第２文章の総合類似度を算出する処理の内容を模式的に示す図である。本実施形態において、例えば、類似度が閾値以上である文ベクトルの組の数はｎ、類似度の最大値はｍと計算され、更に、キーフレーズの第１文章における出現数はＰ_１、キーフレーズの第２文章における出現数はＰ_２であり、第１文章の正規化係数はα_１、第２文章の正規化係数はα_２と設定されている。 FIG. 5 is a diagram schematically showing the content of processing for calculating the total similarity between the first sentence and the second sentence by the similarity calculation unit 130. As shown in FIG. In this embodiment, for example, the number of pairs of sentence vectors whose similarity is equal to or greater than the threshold is calculated as n, the maximum similarity is calculated as m, and the number of appearances of the key phrase in the first sentence is P ₁ , the key phrase The number of occurrences of the phrase in the second sentence is _P2 , the normalization factor for the first sentence is set to _α1 , and the normalization factor for the second sentence is set to _α2 .

総合類似度算出部１３６は、文ベクトル比較部１３２がｎおよびｍに基づいて算出した第１指標値をＦとし、キーフレーズ比較部１３４がＰ_１、Ｐ_２、α_１、およびα_２に基づいて算出した第２指標値をＱとする。第１指標値Ｆの算出手法は、例えばＦ＝ｆ（ｎ，ｍ）で表される。ｆ（ｎ，ｍ）は、前述したように、類似度が閾値以上である文ベクトルの組の数ｎが大きいほど第１指標値Ｆが大きくなるように、且つ、類似度の最大値ｍが大きいほど第１指標値Ｆが大きくなるように、第１指標値Ｆを算出する関数である。また、第２指標値Ｑの算出手法は、例えば、Ｑ＝ｑ（Ｐ_１，Ｐ_２，α_１，α_２）で表される。Ｑ＝ｑ（Ｐ_１，Ｐ_２，α_１，α_２）は、前述したように、第１文章におけるキーフレーズの出現数Ｐ_１を第１文章の長さに応じた係数α_１で除算した値と、第２文章におけるキーフレーズの出現数Ｐ_２を第２文章の長さに応じた係数α_２で除算した値とを、キーフレーズの各文章における出現度合として算出し、算出した値に基づき第２指標値Ｑを算出する関数である。 The total similarity calculation unit 136 sets F to the first index value calculated by the sentence vector comparison unit 132 based on n and m, and the key phrase comparison unit 134 calculates F based on P ₁ , P ₂ , α ₁ , and α _{2 .} Let Q be the second index value calculated by A method of calculating the first index value F is represented by, for example, F=f(n,m). As described above, f(n,m) is set such that the larger the number n of sentence vector sets whose similarity is equal to or higher than the threshold, the larger the first index value F becomes, and the maximum value m of similarity is It is a function for calculating the first index value F so that the first index value F increases as the value increases. Moreover, the calculation method of the 2nd index value Q is represented by Q=q( _P1 ,P2,( _alpha ₎₁ ,(alpha) ₂ ), for example. Q = q (P ₁ , P ₂ , α ₁ , α ₂ ) is obtained by dividing the number of appearances P ₁ of key phrases in the first sentence by the coefficient α ₁ according to the length of the first sentence, as described above. and the value obtained by dividing the number of appearances _P2 of the key phrase in the second sentence by a coefficient _α2 corresponding to the length of the second sentence, are calculated as the degree of appearance of the key phrase in each sentence, and the calculated value is It is a function for calculating the second index value Q based on.

そして、総合類似度算出部１３６は、第１指標値Ｆおよび第２指標値Ｑに基づいて、第１文章および第２文章の総合類似度Ｓを算出する。総合類似度Ｓの算出手法は、例えば、Ｓ＝ｈ（Ｆ，Ｑ）で表される。Ｓ＝ｈ（Ｆ，Ｑ）は、前述したように、第１指標値Ｆが大きいほど総合類似度Ｓが大きくなるように、且つ、第２指標値Ｑが大きいほど総合類似度Ｓが大きくなるように、総合類似度Ｓを算出する関数である。 Then, based on the first index value F and the second index value Q, the total similarity calculator 136 calculates the total similarity S between the first sentence and the second sentence. A method of calculating the total similarity S is represented by, for example, S=h(F, Q). S=h(F, Q) is such that the larger the first index value F, the larger the total similarity S, and the larger the second index value Q, the larger the total similarity S. is a function for calculating the total similarity S as follows.

類似度表示制御部１３８は、例えば、総合類似度、第１指標値、第２指標値、および第１文章と前記第２文章の間でのキーフレーズの出現度合の類似度を表すグラフ（レーダーチャート）を図示しない端末装置２００の表示部に表示させる。詳しくは、後述する。 The similarity display control unit 138 displays, for example, the overall similarity, the first index value, the second index value, and a graph (radar chart) is displayed on the display unit of the terminal device 200 (not shown). Details will be described later.

図６は、類似度判定装置１００が、第１文章および第２文章の総合類似度を算出する処理の一例を示すフローチャートである。 FIG. 6 is a flow chart showing an example of a process of calculating the total similarity between the first sentence and the second sentence by the similarity determination device 100 .

まず、文章取得部１２０は、第１文章と第２文章のそれぞれの全文を、例えば端末装置２００から取得する（ステップＳ２００）。文ベクトル取得部１２２は、外部サーバ３００に第１文章および第２文章を送信して文ベクトルの作成を依頼する。そして、文ベクトル取得部１２２は、第１文章の文ベクトルおよび第２文章の文ベクトルを外部サーバ３００から取得し、キーフレーズ取得部１２４は、第１文章のキーフレーズおよび第２文章のキーフレーズを外部サーバ３００から取得する（ステップＳ２０２）。 First, the sentence acquisition unit 120 acquires the full sentences of the first sentence and the second sentence from, for example, the terminal device 200 (step S200). The sentence vector acquisition unit 122 transmits the first sentence and the second sentence to the external server 300 and requests creation of sentence vectors. Then, the sentence vector acquisition unit 122 acquires the sentence vector of the first sentence and the sentence vector of the second sentence from the external server 300, and the key phrase acquisition unit 124 acquires the key phrase of the first sentence and the key phrase of the second sentence. is obtained from the external server 300 (step S202).

次に、文ベクトル比較部１３２は、第１文章および第２文章のそれぞれに基づいて作成された組ごとの文ベクトル同士の類似度を算出する（ステップＳ２０４）。文ベクトル比較部１３２は、第１文ベクトルから一つの文ベクトルを選び、第２文ベクトルから一つの文ベクトルを選び、選ばれた文ベクトル同士の類似度Ａ_ｉ,ｊの算出を、全ての組み合わせについて実行する。そして、文ベクトル比較部１３２は、算出した文ベクトル同士の類似度Ａ_ｉ,ｊが閾値Ｔｈ以上であるか否かを判定する（ステップＳ２０８）。Ａ_ｉ,ｊとは、ｉ番目の第１文ベクトルと、ｊ番目の第２文ベクトルとの類似度を表している。算出した文ベクトル同士の類似度Ａ_ｉ,ｊが閾値Ｔｈ以上でない場合、文ベクトル比較部１３２は、ステップＳ２１６に進む。算出した文ベクトル同士の類似度Ａ_ｉ,ｊが閾値Ｔｈ以上である場合、文ベクトル比較部１３２は、閾値Ｔｈ以上である文ベクトルの組の数ｎをカウントアップし、記憶部１５０に格納させる（ステップＳ２１２）。 Next, the sentence vector comparison unit 132 calculates the degree of similarity between sentence vectors for each set created based on each of the first sentence and the second sentence (step S204). The sentence vector comparison unit 132 selects one sentence vector from the first sentence vectors, selects one sentence vector from the second sentence vectors, and calculates the similarity A _i,j between the selected sentence vectors. Run on combinations. Then, the sentence vector comparison unit 132 determines whether or not the calculated similarity A _i,j between the sentence vectors is equal to or greater than the threshold Th (step S208). A _i,j represents the degree of similarity between the i-th first sentence vector and the j-th second sentence vector. If the calculated similarity A _i,j between sentence vectors is not equal to or greater than the threshold Th, the sentence vector comparison unit 132 proceeds to step S216. When the calculated similarity A _i,j between sentence vectors is equal to or greater than the threshold Th, the sentence vector comparison unit 132 counts up the number n of sentence vector pairs equal to or greater than the threshold Th, and causes the storage unit 150 to store the number n. (Step S212).

そして、文ベクトル比較部１３２は、全ての文ベクトルの組み合わせについて類似度Ａ_ｉ,ｊを算出したか否かを判定する（ステップＳ２１６）。全ての文ベクトルの組み合わせについて類似度Ａ_ｉ,ｊを算出していない場合、文ベクトル比較部１３２は、ステップＳ２０４に戻って、次の組の文ベクトルを選び、類似度Ａ_ｉ,ｊを算出する。全ての文ベクトルの組み合わせについて類似度Ａ_ｉ,ｊを算出した場合、文ベクトル比較部１３２は、算出した類似度のうち最大値ｍを抽出する（ステップＳ２２０）。次に、文ベクトル比較部１３２は、ステップＳ２１２においてカウントした数ｎおよびステップＳ２２０において計算した類似度の最大値ｍに基づいて、第１指標値Ｆを算出する（ステップＳ２２４）。 Then, the sentence vector comparison unit 132 determines whether or not the similarity A _i,j has been calculated for all combinations of sentence vectors (step S216). If the similarity A _i,j has not been calculated for all combinations of sentence vectors, the sentence vector comparison unit 132 returns to step S204, selects the next set of sentence vectors, and calculates the similarity A _i,j. do. When the similarities A _i,j are calculated for all sentence vector combinations, the sentence vector comparison unit 132 extracts the maximum value m among the calculated similarities (step S220). Next, the sentence vector comparison unit 132 calculates a first index value F based on the number n counted in step S212 and the maximum similarity value m calculated in step S220 (step S224).

本実施形態において、類似度判定装置１００は、ステップＳ２０４の処理を行うに伴い、第１文章と第２文章のそれぞれにおける一致するキーフレーズを抽出することを並行してもよい（ステップＳ２０６）。 In this embodiment, the similarity determination device 100 may extract matching key phrases in each of the first sentence and the second sentence in parallel with the process of step S204 (step S206).

キーフレーズ比較部１３４は、第１文章と第２文章のそれぞれにおけるキーフレーズの出現数Ｐ_１およびＰ_２を算出する（ステップＳ２１０）。次に、キーフレーズ比較部１３４は、第１文章および第２文章のそれぞれの正規化係数α_１およびα_２を算出する（ステップＳ２１４）。上記算出したＰ_１、Ｐ_２、α_１、およびα_２に基づいて、キーフレーズ比較部１３４は、第２指標値Ｑを算出する（ステップＳ２１８）。そして、キーフレーズ比較部１３４は、キーフレーズの出現度合の類似度に関するレーダーチャートを作成し、類似度表示制御部１３８は、レーダーチャートを端末装置２００の表示部に表示させる（ステップＳ２２２）。 The key phrase comparison unit 134 calculates the number of appearances P ₁ and P ₂ of key phrases in each of the first sentence and the second sentence (step S210). Next, the key phrase comparison unit 134 calculates normalization coefficients α ₁ and α ₂ for the first and second sentences (step S214). Based on P ₁ , P ₂ , α ₁ , and α ₂ calculated above, the key phrase comparison unit 134 calculates the second index value Q (step S218). Then, the key phrase comparison unit 134 creates a radar chart regarding the degree of similarity of appearance of key phrases, and the similarity degree display control unit 138 causes the display unit of the terminal device 200 to display the radar chart (step S222).

最後に、総合類似度算出部１３６は、第１指標値Ｆおよび第２指標値Ｑに基づき、第１文章および第２文章の総合類似度Ｓを算出する（ステップＳ２２６）。 Finally, the total similarity calculator 136 calculates the total similarity S of the first sentence and the second sentence based on the first index value F and the second index value Q (step S226).

キーフレーズの比較によれば、文章全体からキーフレーズを抽出するため、文章全体の類似度を判定することができる。しかしながら、比較対象の第２文章が解析元の第１文章の一部の文のみを流用する場合、キーフレーズの比較のみでは、類似度が低いと判定される可能性がある。これに対し、本実施形態のように、キーフレーズの比較と文ベクトルの比較とを組み合わせることで、類似度をより高精度に判定することができる。 According to the comparison of key phrases, since key phrases are extracted from the entire sentences, it is possible to determine the similarity of the entire sentences. However, if the second sentence to be compared uses only a part of the first sentence to be analyzed, it may be determined that the similarity is low only by comparing the key phrases. In contrast, similarity can be determined with higher accuracy by combining comparison of key phrases and comparison of sentence vectors as in the present embodiment.

すなわち、本実施形態において、文ベクトル同士の類似度を算出することで、ユーザは、解析元の第１文章と比較対象の第２文章の局所的な類似度を判定できる。また、キーフレーズの出現度合の類似度を算出することで、ユーザは、解析元の第１文章と比較対象の第２文章の全体の類似度を判断できる。 That is, in the present embodiment, by calculating the degree of similarity between sentence vectors, the user can determine the degree of local similarity between the first sentence to be analyzed and the second sentence to be compared. Further, by calculating the degree of similarity of the degree of occurrence of key phrases, the user can determine the degree of similarity between the first sentence to be analyzed and the second sentence to be compared.

上述したように、本実施形態においては、解析元の第１文章と比較対象の第２文章のそれぞれに対し、文ベクトルの作成とキーフレーズの抽出という、文章の概要を示すという目的が類似しつつ互いに手法が全く異なる二種類の解析を行って総合類似度を算出している。これによって、例えばオリンピックを五輪と置換するといった単純な一括変換による文章の流用をも検出することが可能となっている。 As described above, in the present embodiment, the first sentence to be analyzed and the second sentence to be compared have the same purpose of showing an outline of the sentence, that is, creation of sentence vectors and extraction of key phrases. However, the overall similarity is calculated by performing two types of analysis that are completely different from each other. This makes it possible to detect the diversion of sentences by simple batch conversion, such as replacing the Olympics with the Olympics.

図７は、総合類似度を算出する処理が行われた後、類似度判定装置１００が端末装置２００に表示させる画面の一例を示す図である。この画面には、例えば、「総合類似度」が表示される表示領域Ａ１、「第１指標値」が表示される表示領域Ａ２、「第２指標値」が表示される表示領域Ａ３、および「キーフレーズの出現度合の類似度に関するレーダーチャート」が表示される表示領域Ａ４が設けられる。図示するように、総合類似度の計算に関する処理が行われた後に、類似度表示制御部１３８は、各表示領域に表示される項目を含む「解析結果詳細報告」を端末装置２００の表示画面に表示させる。「キーフレーズの出現度合の類似度に関するレーダーチャート」において、第１文章におけるキーフレーズの出現度合を表すラインＬ１と、第２文章におけるキーフレーズの出現度合を表すラインＬ２とが表示されている。レーダーチャートの１２時方向、すなわち真上には、第１文章で最も出現頻度の高いキーフレーズが配置され、反時計回りに第１文章における出現頻度の順にキーフレーズが配置されている。したがって、第１文章を示すラインＬ１は、真上から反時計回りに徐々に半径が小さくなるらせん状の曲線が描画されることになる。これに対し、第２文章を示すラインＬ２は、類似度が低ければ、らせん状にはならずにいびつな形状となる。また、第１文章の一部で第２文章が使われていた場合には、特定のキーフレーズが突出して高く、もしくは低く描画されるが、多くのキーフレーズの出現頻度が類似することになるので、らせん形状に近くなる。このレーダーチャートにより、ユーザは、第１文章におけるキーフレーズの出現度合と、第２文章におけるキーフレーズの出現度合とを視覚的に比較することができる。 FIG. 7 is a diagram showing an example of a screen displayed on the terminal device 200 by the similarity determination device 100 after the process of calculating the total similarity is performed. This screen includes, for example, a display area A1 displaying the "total similarity", a display area A2 displaying the "first index value", a display area A3 displaying the "second index value", and " A display area A4 is provided in which "Radar Chart Concerning Similarity of Appearance of Key Phrases" is displayed. As shown in the figure, after the processing related to the calculation of the total similarity is performed, the similarity display control unit 138 displays the “analysis result detailed report” including the items displayed in each display area on the display screen of the terminal device 200. display. In the "radar chart for similarity of key phrase appearance", a line L1 representing the appearance of key phrases in the first sentence and a line L2 representing the appearance of key phrases in the second sentence are displayed. Key phrases with the highest appearance frequency in the first sentence are arranged in the 12 o'clock direction of the radar chart, that is, directly above, and the key phrases are arranged counterclockwise in order of appearance frequency in the first sentence. Therefore, the line L1 indicating the first sentence is drawn as a spiral curve whose radius gradually decreases counterclockwise from directly above. On the other hand, if the similarity is low, the line L2 indicating the second sentence does not have a spiral shape but has an distorted shape. In addition, when the second sentence is used in part of the first sentence, a specific key phrase is drawn with a prominent high or low frequency, but many of the key phrases have similar appearance frequencies. Therefore, it becomes close to a spiral shape. This radar chart allows the user to visually compare the frequency of appearance of key phrases in the first sentence and the frequency of appearance of key phrases in the second sentence.

上記説明した外部サーバ３００の機能のうち一部または全部は、類似度判定装置１００の機能に含まれてもよい。例えば、類似度判定装置１００の文ベクトル取得部１２２が、文ベクトルを生成する機能を有してもよいし、類似度判定装置１００のキーフレーズ取得部１２４が、キーフレーズを抽出する機能を有してもよい。 Some or all of the functions of the external server 300 described above may be included in the functions of the similarity determination device 100 . For example, the sentence vector acquisition unit 122 of the similarity determination device 100 may have a function of generating sentence vectors, and the key phrase acquisition unit 124 of the similarity determination device 100 may have a function of extracting key phrases. You may

また、文章取得部１２０によって取得される第１文章および第２文章は、論文、記事、歌詞、俳句、詩、小説等のようなあらゆる著作物であってよい。これによって、類似度判定装置１００を、論文、記事、歌詞、俳句、詩、小説等のような著作物の盗用の検出に使用することができる。 Also, the first sentence and the second sentence acquired by the sentence acquisition unit 120 may be any copyrighted material such as a thesis, article, lyrics, haiku, poem, novel, and the like. As a result, the similarity determination device 100 can be used to detect plagiarism of works such as papers, articles, lyrics, haiku, poems, novels, and the like.

また、類似度判定装置１００は、文章に対応する文ベクトルおよびキーフレーズの双方に基づき、文章間の類似度を判定する（類似判定）ため、第１文章において、第２文章の内容の一部または全部を引用していることが明記されている場合（或いはその逆や、第１・第２文章ともに別の第３文章を引用している場合）、類似度判定装置１００は、第１文章および第２文章の引用されている部分を除外して、第１文章と第２文章の総合類似度を算出してもよい。例えば、第１文章および第２文章がＨＴＭＬ（Hyper Text Markup Language）形式で入力された場合、類似度判定装置１００は、引用タグに基づいて引用されている範囲を判定してもよい。文章の盗用を検出する目的において、引用されている部分で文章が同一となることは当然であるが、引用部分を含めて類似判定をしてしまうと不必要に類似度が高く判定され、結果的に文章の盗用が行われているかどうかという目的が達成できないためである。引用先がＨＴＭＬなどで入力されていない場合であっても、正しい引用が行われている場合にはフォントを変更したり、アスタリスクマーク（＊）等によって引用元文献が明示されていることが通常であるので、そのような引用か所を、自然言語処理を用いて機械検知してもよい。 Further, the similarity determination device 100 determines the degree of similarity between sentences (similarity determination) based on both sentence vectors and key phrases corresponding to the sentences. Or if it is specified that the whole is quoted (or vice versa, or if both the first and second sentences quote a different third sentence), the similarity determination device 100 determines that the first sentence and the quoted part of the second sentence may be excluded to calculate the total similarity between the first sentence and the second sentence. For example, when the first sentence and the second sentence are input in HTML (Hyper Text Markup Language) format, the similarity determination device 100 may determine the range of citation based on the citation tag. For the purpose of detecting plagiarism, it is natural that the quoted part of the text is the same, but if the similarity is determined including the quoted part, the similarity will be unnecessarily high. This is because the purpose of whether or not texts are plagiarized on a regular basis cannot be achieved. Even if the citation is not entered in HTML, etc., if the citation is correct, it is normal to change the font or clearly indicate the citation source with an asterisk (*), etc. Therefore, such citations may be machine-detected using natural language processing.

このような引用が行われている場合、文章全体では類似度が高くないにも関わらず、特定の文章で極端に類似度スコアが高くなる。自然言語処理は類似度判定装置１００の負荷が大きいので、そのような局所的な類似度不連続箇所が見出された際にのみ、機械検知を動作させてもよいし、図７に示したチャートを表示する際に、ユーザがマニュアルで引用か所を指摘して再判定を促すように端末装置２００に表示してもよい。 When such citations are made, the similarity score of a specific sentence is extremely high, even though the similarity of the sentence as a whole is not high. Since natural language processing places a heavy load on the similarity determination device 100, machine detection may be operated only when such a local similarity discontinuity is found. When the chart is displayed, the terminal device 200 may display the chart so that the user can manually point out the quoted portion and prompt re-determination.

また、ニュースや人気製品の発売等、客観的な事実に基づいた記事を作成する場合、当該事実に関しては多くのニュース媒体で同様の記事が作成される。このような場合は、事実を伝える部分について類似判定を行うと、盗用でなくても類似度が高いと判定されることになってしまい、結果として盗用が行われているかどうかという目的が達成できなくなるので、客観的事実を伝える部分を類似判定の対象から除外するようにしてもよい。引用か所が明記される引用と異なり、事実部分の指定は自然言語処理で行うことも可能であるが、文意から判定する必要がある上、事実に関する記載は全く同じ文章とは限らないため、より高度な自然言語処理が必要となる。そのようなニュース記事の場合、抽出される第２文章候補が比較的新しく、かつ近似した日付で複数検出される傾向がある。したがって、そのような近い日付の文献が第２文章の候補として複数抽出された場合にのみ、自然言語処理による除外か所検出を行ってもよいし、ユーザに事実を伝える記事であるかのチェックを入力させるチェックボックスを端末装置２００に表示し、事実を伝える記事部分をマニュアルで除外させる入力を促してもよい。引用と同様、図７に示したレーダーチャートを表示する際に、ユーザによる再判定を促してもよいが、自然言語処理が重い処理であるので、類似度判定装置１００の負荷軽減のためには、自然言語処理よりも前に除外か所の指定ができる方が好適である。 In addition, when creating an article based on objective facts such as news or the sale of a popular product, similar articles are created in many news media regarding the facts. In such a case, if similarity judgment is performed on the part that conveys the facts, it will be judged that the degree of similarity is high even if it is not plagiarism, and as a result, the purpose of whether or not plagiarism has occurred cannot be achieved. Therefore, portions that convey objective facts may be excluded from similarity determination targets. Unlike citations, in which the places of citation are specified, it is possible to specify the factual part by natural language processing, but it is necessary to judge from the meaning of the sentence, and the factual description is not necessarily the same sentence. , more advanced natural language processing is required. In the case of such news articles, there is a tendency that a plurality of extracted second sentence candidates are relatively new and have similar dates. Therefore, only when a plurality of documents dated close to each other are extracted as candidates for the second sentence, it is possible to detect excluded points by natural language processing, or check whether the article conveys the facts to the user. may be displayed on the terminal device 200 to prompt the user to manually exclude the part of the article that conveys the fact. Similar to the citation, when displaying the radar chart shown in FIG. 7, the user may be prompted to re-determine. , it is preferable to be able to specify an exclusion point before natural language processing.

以上説明した実施形態によれば、解析元の文章である第１文章と、比較対象の文章である第２文章のそれぞれを分解した文ごとの特徴ベクトルである文ベクトルを取得する文ベクトル取得部１２２と、第１文章と第２文章のそれぞれに含まれ、文章を構成する重要な要素であるキーフレーズを取得するキーフレーズ取得部１２４と、第１文章と第２文章との間の文ベクトル同士の類似度と、同じキーフレーズの出現度合の類似度とに基づいて、第１文章と第２文章の総合類似度を算出する類似度算出部１３０とを備えることで、より高精度に文章の類似度を算出することができる。 According to the embodiment described above, the sentence vector acquisition unit acquires the sentence vector, which is the feature vector for each sentence obtained by decomposing the first sentence, which is the sentence to be analyzed, and the second sentence, which is the sentence to be compared. 122, a key phrase acquisition unit 124 that acquires key phrases that are included in each of the first and second sentences and are important elements that constitute the sentences, and a sentence vector between the first and second sentences. By providing a similarity calculation unit 130 that calculates the total similarity between the first sentence and the second sentence based on the similarity between them and the similarity in the appearance of the same key phrase, sentences can be calculated with higher accuracy. can be calculated.

また、実施形態によれば、文ベクトル同士の組のうち類似度が閾値以上である文ベクトルの組の数に基づいて第１指標値を算出することで、更に高精度に文章の類似度を算出することができる。 Further, according to the embodiment, by calculating the first index value based on the number of pairs of sentence vectors whose similarity is equal to or higher than a threshold among pairs of sentence vectors, the similarity of sentences can be calculated with higher accuracy. can be calculated.

更に、実施形態によれば、第１文章と第２文章の間でのキーフレーズの出現度合の類似度を表すグラフを表示させる類似度表示制御部１３８を更に備えることで、ユーザは、第１文章におけるキーフレーズの出現度合と、第２文章におけるキーフレーズの出現度合とを比較することができる。 Furthermore, according to the embodiment, by further including the similarity display control unit 138 that displays a graph representing the degree of similarity of appearance of key phrases between the first sentence and the second sentence, the user can The frequency of occurrence of key phrases in the sentence and the frequency of occurrence of key phrases in the second sentence can be compared.

以上、本発明の好ましい実施例を説明したが、本発明はこれら実施例に限定されることはない。本発明の趣旨を逸脱しない範囲で、構成の付加、省略、置換、およびその他の変更が可能である。本発明は前述した説明によって限定されることはなく、添付のクレームの範囲によってのみ限定される。 Although preferred embodiments of the invention have been described above, the invention is not limited to these embodiments. Configuration additions, omissions, substitutions, and other changes are possible without departing from the scope of the present invention. The present invention is not limited by the foregoing description, but only by the scope of the appended claims.

１００類似度判定装置
１２０文章取得部
１２２文ベクトル取得部
１２４キーフレーズ取得部
１３０類似度算出部
１３２文ベクトル比較部
１３４キーフレーズ比較部
１３６総合類似度算出部
２００端末装置
３００外部サーバ 100 Similarity determination device 120 Sentence acquisition unit 122 Sentence vector acquisition unit 124 Key phrase acquisition unit 130 Similarity calculation unit 132 Sentence vector comparison unit 134 Key phrase comparison unit 136 Comprehensive similarity calculation unit 200 Terminal device 300 External server

上記目的を達成するため、本発明の類似度判定装置は、解析元の文章である第１文章と、比較対象の文章である第２文章のそれぞれを分解した文ごとの特徴ベクトルである文ベクトルを取得する文ベクトル取得部と、前記第１文章と前記第２文章のそれぞれに含まれ、文章を構成する重要な要素であるキーフレーズを取得するキーフレーズ取得部と、前記第１文章と前記第２文章との間の前記文ベクトル同士の類似度と、同じ前記キーフレーズの出現度合の類似度とに基づいて、前記第１文章と前記第２文章の総合類似度を算出する類似度算出部と、を備え、前記類似度算出部は、前記第１文章に含まれる前記キーフレーズの出現数を要素とする第１キーフレーズベクトルと、前記第２文章に含まれる前記キーフレーズの出現数を要素とする第２キーフレーズベクトルとを求め、前記第１キーフレーズベクトルおよび前記第２キーフレーズベクトルに基づいて、前記第１文章と前記第２文章との類似度を表す第２指標値を算出するキーフレーズ比較部を備える。 In order to achieve the above object, the similarity determination device of the present invention provides a sentence vector, which is a feature vector for each sentence obtained by decomposing a first sentence, which is a sentence to be analyzed, and a second sentence, which is a sentence to be compared. a sentence vector acquisition unit that acquires a key phrase that is included in each of the first sentence and the second sentence and is an important element that constitutes the sentence; a key phrase acquisition unit that acquires the first sentence and the Similarity calculation for calculating a total similarity between the first sentence and the second sentence based on the similarity between the sentence vectors with the second sentence and the similarity of the appearance of the same key phrase wherein the similarity calculation unit calculates a first key phrase vector whose elements are the number of appearances of the key phrases included in the first sentence, and the number of appearances of the key phrases included in the second sentence. A second key phrase vector having elements of is obtained, and based on the first key phrase vector and the second key phrase vector, a second index value representing the degree of similarity between the first sentence and the second sentence is calculated A key phrase comparison unit for calculating is provided .

上記目的を達成するため、本発明の類似度判定装置は、解析元の文章である第１文章と、比較対象の文章である第２文章のそれぞれを分解した文ごとの特徴ベクトルである文ベクトルを取得する文ベクトル取得部と、前記第１文章と前記第２文章のそれぞれに含まれ、文章を構成する重要な要素であるキーフレーズを取得するキーフレーズ取得部と、前記第１文章と前記第２文章との間の前記文ベクトル同士の類似度と、同じ前記キーフレーズの出現度合の類似度とに基づいて、前記第１文章と前記第２文章の総合類似度を算出する類似度算出部と、表示部を備える端末装置に、前記キーフレーズ取得部により取得されたキーフレーズの前記第１文章と前記第２文章の間での出現度合の類似度を表すグラフを表示させる表示制御部と、を備え、前記類似度算出部は、前記第１文章に含まれる前記キーフレーズの出現数を要素とする第１キーフレーズベクトルと、前記第２文章に含まれる前記キーフレーズの出現数を要素とする第２キーフレーズベクトルとを求め、前記第１キーフレーズベクトルおよび前記第２キーフレーズベクトルに基づいて、前記第１文章と前記第２文章との類似度を表す第２指標値を算出するキーフレーズ比較部を備え、前記表示制御部は、前記第１文章における複数の前記キーフレーズのそれぞれの出現度合を表す点を結ぶ第１ラインと、前記第２文章における複数の前記キーフレーズのそれぞれの出現度合を表す点を結ぶ第２ラインとが表示されたレーダーチャートを、前記グラフとして前記端末装置に表示させる。 In order to achieve the above object, the similarity determination device of the present invention provides a sentence vector, which is a feature vector for each sentence obtained by decomposing a first sentence, which is a sentence to be analyzed, and a second sentence, which is a sentence to be compared. a sentence vector acquisition unit that acquires a key phrase that is included in each of the first sentence and the second sentence and is an important element that constitutes the sentence; a key phrase acquisition unit that acquires the first sentence and the Similarity calculation for calculating a total similarity between the first sentence and the second sentence based on the similarity between the sentence vectors with the second sentence and the similarity of the appearance of the same key phrase and a display unit for displaying a graph representing the degree of similarity of appearance between the first sentence and the second sentence of the key phrase acquired by the key phrase acquisition unit. and the similarity calculation unit calculates a first key phrase vector whose elements are the number of appearances of the key phrases included in the first sentence, and the number of appearances of the key phrases included in the second sentence. A second key phrase vector as an element is obtained, and a second index value representing the degree of similarity between the first sentence and the second sentence is calculated based on the first key phrase vector and the second key phrase vector. wherein the display control unit includes: a first line connecting points representing degrees of appearance of each of the plurality of key phrases in the first sentence; A radar chart displaying a second line connecting points representing respective appearance degrees is displayed on the terminal device as the graph.

Claims

a sentence vector acquisition unit that acquires a sentence vector that is a feature vector for each sentence obtained by decomposing a first sentence that is an analysis source sentence and a second sentence that is a comparison target sentence;
a key phrase acquisition unit that acquires a key phrase that is included in each of the first sentence and the second sentence and is an important element that constitutes the sentence;
total similarity between the first sentence and the second sentence based on the similarity between the sentence vectors between the first sentence and the second sentence and the similarity of the occurrence of the same key phrases; a similarity calculation unit that calculates
A similarity determination device comprising:

The similarity calculation unit
comprehensively comparing one or more first sentence vectors, which are the sentence vectors obtained from the first sentence, and one or more second sentence vectors, which are the sentence vectors obtained from the second sentence; a sentence vector comparison unit that calculates the similarity between the sentence vectors and calculates a first index value representing the similarity between the first sentence and the second sentence based on the similarity between the sentence vectors;
The key-phrase that matches between the key-phrase obtained from the first sentence and the key-phrase obtained from the second sentence occurs in each of the first sentence and the second sentence. a key phrase comparison unit that calculates a second index value representing the degree of similarity between the first sentence and the second sentence based on the degree of appearance;
a total similarity calculation unit that calculates a total similarity between the first sentence and the second sentence based on the first index value and the second index value,
The similarity determination device according to claim 1.

The sentence vector comparison unit calculates the first index value based on the number of pairs of sentence vectors having a similarity equal to or higher than a threshold among the pairs of sentence vectors.
The similarity determination device according to claim 2.

The sentence vector comparison unit calculates the first index value based on a maximum value of similarities between the sentence vectors.
4. The similarity determination device according to claim 2 or 3.

The key phrase comparison unit normalizes the number of occurrences of the matching key phrase in the first sentence by a first coefficient based on the length of the first sentence, and the number of occurrences of the matching key phrase in the second sentence. Calculate the number of occurrences as the second index value based on a value normalized by a second coefficient based on the length of the second sentence;
The similarity determination device according to any one of claims 2 to 4.

Further comprising a display control unit for displaying a graph representing a similarity degree of appearance between the first sentence and the second sentence of the key phrase acquired by the key phrase acquisition unit, on the terminal device comprising the display unit. ,
The similarity determination device according to any one of claims 1 to 5.

a similarity determination device according to any one of claims 1 to 6;
A terminal device that displays the total similarity calculated by the similarity calculation unit of the similarity determination device,
Similarity determination system.

the computer
Obtaining a sentence vector, which is a feature vector for each sentence obtained by decomposing the first sentence, which is the original sentence to be analyzed, and the second sentence, which is the sentence to be compared, and
Obtaining key phrases, which are important elements constituting a sentence, from each of the first sentence and the second sentence,
With respect to the first sentence and the second sentence, a total similarity between the first sentence and the second sentence is calculated based on the similarity between the sentence vectors and the similarity of appearance of the same key phrase. do,
Similarity determination method.

to the computer,
Acquire a sentence vector, which is a feature vector for each sentence obtained by decomposing the first sentence, which is the sentence to be analyzed, and the second sentence, which is the sentence to be compared,
Acquiring key phrases, which are important elements constituting a sentence, from each of the first sentence and the second sentence;
With respect to the first sentence and the second sentence, a total similarity between the first sentence and the second sentence is calculated based on the similarity between the sentence vectors and the similarity of appearance of the same key phrase. let
program.