JP2010182238A

JP2010182238A - Citation detection device, device and method for creating original document database, program and recording medium

Info

Publication number: JP2010182238A
Application number: JP2009027288A
Authority: JP
Inventors: Toshiyuki Sakurai; 俊之櫻井; Yoshihiro Matsuo; 義博松尾; Genichiro Kikui; 玄一郎菊井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-02-09
Filing date: 2009-02-09
Publication date: 2010-08-19
Anticipated expiration: 2029-02-09
Also published as: JP4831787B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately detect whether an input document includes a citation composed of two or more continuous sentences without altering a character string in the other document, with less computational complexity. <P>SOLUTION: An original document DB 4 is prepared by dividing each document in original documents that are candidates of citation source into partial character strings that can be units of citation, creating summaries of the partial character strings, arranging each summary in the order of appearance of the partial character strings to form a digest of the document, and registering, for each partial character string, the digest with a document ID thereof so as to be capable of longest prefix match. A digest creation means 5 converts an input document to a digest similar to the above, and a citation detection means 6 retrieves the original document DB 4 using the digest of the input document as a key by longest prefix match, and outputs, if there is a document in which the number of summaries continuously matching is a predetermined threshold or more, its document ID. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、ブログ等の任意の文書中に、ニュース、プレスリリース、Ｗｉｋｉｐｅｄｉａ等の他の文書からの引用が含まれているかどうかを検出する技術に関するものである。ここで、引用とは、他の文書内の文であって文字列の改変が無い連続した２つ以上の文と定義するものとする。 The present invention relates to a technique for detecting whether an arbitrary document such as a blog includes a quote from another document such as news, a press release, or Wikipedia. Here, citation is defined as two or more consecutive sentences that are sentences in other documents and that have no character string modification.

従来のこの種の技術としては、以下の二つがあった。 There are the following two conventional techniques of this type.

・ＤＰ（Dynamic Programming）マッチング（従来技術１）
ＤＰマッチングでは、比較すべき２つの文字列を逐次照合しながら当該２つの文字列間の差分を計算して、２つの文書間の類似度を推定する（例えば、非特許文献１参照）。・ DP (Dynamic Programming) matching (prior art 1)
In DP matching, the difference between the two character strings is calculated while sequentially comparing the two character strings to be compared, and the similarity between the two documents is estimated (for example, see Non-Patent Document 1).

・Ｓｉｍｈａｓｈ（従来技術２）
Ｓｉｍｈａｓｈでは、文書をベクトル表現し、ランダムに定義された超平面のどちら側にあるかを判別することにより特殊なハッシュ値（simhash）を得て、各文書のｓｉｍｈａｓｈ値を比較することで、２つの文書間の類似度を推定する。この際、各文書のＳｉｍｈａｓｈ間のハミング距離が文書ベクトル間のコサイン距離の近似値になる性質がある（例えば、非特許文献２参照）。・ Simhash (Prior Art 2)
In Simhash, a special hash value (simhash) is obtained by vector-expressing a document and determining which side of the hyperplane is defined at random, and by comparing the Simhash values of each document, 2 Estimate the similarity between two documents. At this time, the Hamming distance between Simhashes of each document has a property that becomes an approximate value of the cosine distance between document vectors (see, for example, Non-Patent Document 2).

Ｎｅｅｄｌｅｍａｎ，Ｓ．Ｂ．，Ｗｕｎｓｃｈ，Ｃ．Ｄ．：”Ａｇｅｎｅｒａｌｍｅｔｈｏｄａｐｐｌｉｃａｂｌｅｔｏｔｈｅｓｅａｒｃｈｆｏｒｓｉｍｉｌａｒｉｔｉｅｓｉｎｔｈｅａｍｉｎｏａｃｉｄｓｅｑｕｅｎｃｅｏｆｔｗｏｐｒｏｔｅｉｎｓ．”Ｊ．Ｍｏｌ．Ｂｉｏｌ，Ｖｏｌ．４８，ｐｐ．４４３−４５３，１９７０．Needleman, S.M. B. Wunsch, C .; D. "A general method applicable to the search for similarities in the amino acid sequence of two proteins." Mol. Biol, Vol. 48, pp. 443-453, 1970. Ｍ．Ｓ．Ｃｈａｒｉｋａｒ．Ｓｉｍｉｌａｒｉｔｙｅｓｔｉｍａｔｉｏｎｔｅｃｈｎｉｑｕｅｓｆｒｏｍｒｏｕｎｄｉｎｇａｌｇｏｒｉｔｈｍｓ．ＩｎＳＴＯＣ’０２：Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅｔｈｉｒｙ−ｆｏｕｒｔｈａｎｎｕａｌＡＣＭｓｙｍｐｏｓｉｕｍｏｎＴｈｅｏｒｙｏｆｃｏｍｐｕｔｉｎｇ，ｐｐ．３８０−３８８，２００２．M.M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC'02: Proceedings of the third-four annual ACM symposium on Theory of computing, pp. 380-388, 2002.

しかし、前述した従来技術には、以下のような３つの課題があった。 However, the prior art described above has the following three problems.

即ち、従来技術１では、総当たりでマッチングを行う必要があるため、計算量が多くなり、大規模化に向かなかった（課題１）。 That is, in the prior art 1, since it is necessary to perform brute force matching, the amount of calculation increases, and it is not suitable for large scale (Problem 1).

また、従来技術２は、大規模化に向いているが、文書全体に対する引用の割合が小さいと検出できなかった。例えば、図１（ａ）に示す例のように、引用元文書と引用先文書との間の共通部分の割合が少ないと類似度が低下するため、引用を検出できなかった（課題２）。 Prior art 2 is suitable for enlargement, but could not be detected if the quoting ratio of the entire document was small. For example, as in the example shown in FIG. 1A, if the ratio of the common part between the citation source document and the citation destination document is small, the degree of similarity decreases, so that the citation cannot be detected (Problem 2).

さらにまた、従来技術２では、単語の出現順序を考慮することなく、２つの文書間の類似度を推定するため、引用を正確に検出することができなかった。例えば、図１（ｂ）に示す例のように、引用関係にない２つの文書間でも、単語同士の共通割合が高いと類似度が高くなり、誤って判定してしまうことがあった（課題３）。 Furthermore, in the prior art 2, since the similarity between two documents is estimated without considering the appearance order of words, the citation cannot be accurately detected. For example, as in the example shown in FIG. 1B, even between two documents that are not in a citation relationship, if the common ratio between words is high, the degree of similarity increases, and there is a case where the determination is made erroneously (problem) 3).

本発明では、他の文書、即ち引用元の候補となる文書の集合である原典文書集合中の文書を引用の単位となり得る部分文字列に分割し、当該部分文字列の要約（例えば公知のフィンガープリント）を生成し、各要約を前記部分文字列の出現順に並べた前記文書のダイジェストを、前記部分文字列毎に前方最長一致検索可能な形式でその文書ＩＤとともに登録してなる原典文書データベース（ＤＢ）を用意し、一方、入力文書（対象文書）を前記同様のダイジェストに変換し、この入力文書のダイジェストをキーとして原典文書ＤＢを前方最長一致で検索し、この際、連続して一致する要約の数が所定の閾値以上の他の文書があればその文書ＩＤを出力することを特徴とする。 In the present invention, another document, that is, a document in a source document set, which is a set of documents that are candidates for citation sources, is divided into partial character strings that can serve as citation units, and a summary of the partial character strings (for example, known fingers) is divided. The original document database in which digests of the documents in which the summaries are arranged in the order of appearance of the partial character strings are registered together with the document IDs in a format that allows a longest forward matching search for each partial character string ( DB) is prepared, and on the other hand, the input document (target document) is converted into a digest similar to the above, and the source document DB is searched with the longest forward match using the digest of this input document as a key. If there is another document whose number of summaries exceeds a predetermined threshold, the document ID is output.

本発明によれば、他の文書のダイジェストを前方最長一致検索可能な形式にてその文書ＩＤとともに登録した原典文書ＤＢを用いるとともに、入力文書をダイジェストに変換し、ダイジェスト同士を比較して引用が含まれるか否かを判定するため、計算量を少なくすることができるとともに、メモリ使用量、ディスク使用量を削減することができるため、大規模化が可能となる。 According to the present invention, a source document DB in which a digest of another document is registered together with its document ID in a format that can be searched for the longest forward match is used, the input document is converted into a digest, and the digests are compared with each other for citation. Since it is determined whether or not it is included, the amount of calculation can be reduced, and the memory usage and disk usage can be reduced, so that the scale can be increased.

また、ダイジェスト同士の一致する要約の数、即ち一致するダイジェストの長さによって引用か否かを判定するため、文書全体に対する引用の割合の大小にかかわらず、引用を検出可能となる。 In addition, since it is determined whether or not the citation is based on the number of digests that match the digests, that is, the length of the matched digests, the citation can be detected regardless of the ratio of the citation to the entire document.

さらにまた、引用の単位となり得る部分文字列の出現順を保存したまま、ダイジェストの一致する長さで判定するため、文書全体における類似度の高低にかかわらず、引用のみを正しく検出可能となる。 Furthermore, since the order of appearance of the partial character strings that can be a citation unit is preserved and the determination is made based on the matching length of the digests, only the citation can be correctly detected regardless of the level of similarity in the entire document.

従来技術の課題を示す説明図Explanatory drawing showing the problems of the prior art 本発明の原典文書データベース生成装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the original document database production | generation apparatus of this invention ダイジェスト生成手段の詳細を示す構成図Configuration diagram showing details of digest generation means ダイジェスト生成手段における処理の流れ図Process flow in digest generation means 原典文書ＤＢ生成手段における処理の流れ図Flow chart of processing in original document DB generation means 本発明の引用検出装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the quotation detection apparatus of this invention 引用開始位置及び引用終了位置の説明図Explanatory drawing of citation start position and citation end position 引用検出手段における処理の流れ図Flow chart of processing in citation detection means ＳｕｆｆｉｘＡｒｒａｙの説明図Illustration of Suffix Array 原典文書集合からダイジェストを生成する際の一例を示す説明図Explanatory drawing which shows an example at the time of producing | generating a digest from a source document collection 原典文書集合からダイジェストを生成する際の一例を示す説明図Explanatory drawing which shows an example at the time of producing | generating a digest from a source document collection 原典文書集合からダイジェストを生成する際の一例を示す説明図Explanatory drawing which shows an example at the time of producing | generating a digest from a source document collection 原典文書集合からダイジェストを生成する際の一例を示す説明図Explanatory drawing which shows an example at the time of producing | generating a digest from a source document collection ダイジェストから原典文書ＤＢを生成する際の一例を示す説明図Explanatory drawing which shows an example at the time of producing | generating a source document DB from a digest 入力文書からダイジェストを生成する際の一例を示す説明図Explanatory drawing which shows an example at the time of producing | generating a digest from an input document 入力文書から引用を検出する際の一例を示す説明図Explanatory drawing showing an example when detecting citations from input documents

本発明において、引用検出を行うには、事前に原典文書ＤＢを作成しておく必要があるため、まず、原典文書集合から原典文書ＤＢを生成する原典文書データベース生成装置について説明し、その次に、原典文書ＤＢに基づいて入力文書中に引用が含まれているかどうかを判定する引用検出装置について説明する。 In the present invention, in order to perform citation detection, it is necessary to create a source document DB in advance. First, a source document database generation device that generates a source document DB from a source document set will be described, and then A citation detection apparatus that determines whether a citation is included in an input document based on the original document DB will be described.

＜原典文書データベース生成装置＞
図２は本発明の原典文書データベース生成装置の実施の形態の一例を示すもので、図中、１は原典文書集合、２はダイジェスト生成手段、３は原典文書ＤＢ生成手段、４は原典文書ＤＢである。 <Original document database generator>
FIG. 2 shows an example of an embodiment of a source document database generation apparatus according to the present invention. In the figure, 1 is a source document set, 2 is a digest generation unit, 3 is a source document DB generation unit, and 4 is a source document DB. It is.

原典文書集合１は、引用元の候補となる文書、例えば各種ニュースサイト、プレスリリース、Ｗｉｋｉｐｅｄｉａ等の文書の集合からなるもので、各文書には予め固有の文書ＩＤが付されているものとする。 The original document set 1 is composed of a set of documents that are candidates for citations, for example, various news sites, press releases, Wikipedia, and the like, and each document is assigned a unique document ID in advance. .

ダイジェスト生成手段２は、原典文書集合１中の各文書について、文書の文字列を引用の単位となり得る部分文字列に分割し、当該部分文字列の要約、ここではフィンガープリントをそれぞれ生成し、各フィンガープリントを前記部分文字列の出現順に並べて構成した前記文書のダイジェストをその文書ＩＤとともに原典文書ＤＢ生成手段３へ出力する。 For each document in the original document set 1, the digest generation means 2 divides the document character string into partial character strings that can serve as citation units, generates a summary of the partial character string, here a fingerprint, The digest of the document in which the fingerprints are arranged in the order of appearance of the partial character strings is output to the original document DB generation unit 3 together with the document ID.

図３はダイジェスト生成手段２の詳細構成を示すもので、セグメンテーション部２１、正規化部２２、足切り部２３及びフィンガープリント生成部２４からなっている。また、図４はダイジェスト生成手段２における処理の流れを示すものである。 FIG. 3 shows a detailed configuration of the digest generation means 2, which includes a segmentation unit 21, a normalization unit 22, a foot cut unit 23, and a fingerprint generation unit 24. FIG. 4 shows the flow of processing in the digest generating means 2.

セグメンテーション部２１は、入力された文書（テキスト）の文字列を任意のセパレータで区切ることによって引用の単位となり得る部分文字列（以下、セグメント）に分割し、各セグメントをその出現順の情報、ここでは各セグメントの開始位置及び終了位置、並びに前記文書の文書ＩＤとともに正規化部２２へ出力する。なお、セパレータとしては、文の区切り位置に現れる頻度の高い任意の文字、記号、これらの組合せ（パターン）、制御文字、タグ等が考えられる。また、開始位置及び終了位置としては、文書の文頭からの文字位置等が考えられる。 The segmentation unit 21 divides a character string of an input document (text) into arbitrary character strings (hereinafter referred to as segments) that can be a citation unit by dividing the character string by an arbitrary separator, and information about the order of appearance of each segment, here Then, the start position and end position of each segment and the document ID of the document are output to the normalization unit 22. In addition, as a separator, arbitrary characters, symbols, combinations (patterns), control characters, tags, and the like that frequently appear at sentence break positions are conceivable. Further, as the start position and the end position, a character position from the beginning of the document can be considered.

正規化部２２は、セグメンテーション部２１から入力された各セグメントを正規化し、正規化後の各セグメントをその開始位置及び終了位置、並びに文書ＩＤとともに足切り部２３へ出力する。ここで、正規化としては、具体的には、アルファベットの大文字小文字の正規化、全角半角文字の正規化、記号の除去、ＨＴＭＬやＸＭＬ等のタグの除去等が考えられるが、任意に設定可能である。 The normalization unit 22 normalizes each segment input from the segmentation unit 21 and outputs each segment after normalization to the cut-off unit 23 together with its start position and end position and document ID. Here, specific examples of normalization include normalization of upper and lower case letters of the alphabet, normalization of full-width and half-width characters, removal of symbols, removal of tags such as HTML and XML, etc., but can be arbitrarily set It is.

足切り部２３は、正規化部２２から入力された正規化後の各セグメントのうち、一定の長さ（ｌｅｎｇｔｈ）以下のセグメントを削除し、残りの正規化後の各セグメントをその開始位置及び終了位置、並びに文書ＩＤとともにフィンガープリント生成部２４へ出力する。なお、前記一定の長さ（ｌｅｎｇｔｈ）の単位としては文字、バイト（ｂｙｔｅ）、ビット（ｂｉｔ）等が考えられ、その値は任意に変更可能である。 The cut-off unit 23 deletes a segment having a length equal to or less than a certain length from each segment after normalization input from the normalization unit 22, and replaces each segment after normalization with its start position and The end position and the document ID are output to the fingerprint generation unit 24. In addition, as the unit of the fixed length, a character, a byte, a bit, or the like can be considered, and the value can be arbitrarily changed.

フィンガープリント生成部２４は、足切り部２３から入力された残りの正規化後の各セグメントを任意のハッシュ関数に入力してフィンガープリントをそれぞれ生成し、各フィンガープリントを前記各セグメントの開始位置及び終了位置に従って並べて構成した前記文書のダイジェスト（フィンガープリント列）を各セグメントの開始位置及び終了位置、並びに文書ＩＤとともに出力する。 The fingerprint generation unit 24 inputs each remaining normalized segment input from the cut-off unit 23 to an arbitrary hash function to generate a fingerprint, and generates each fingerprint as a start position of each segment and A digest (fingerprint string) of the documents arranged side by side according to the end position is output together with the start position and end position of each segment and the document ID.

原典文書ＤＢ生成手段３は、ダイジェスト生成手段２から入力された原典文書集合１中の各文書に対応するダイジェストについて、前記セグメント毎に前方最長一致検索可能な形式でその文書ＩＤとともに登録して原典文書ＤＢ４を生成する。図５は原典文書ＤＢ生成手段３における処理の流れを示すものである。 The source document DB generation unit 3 registers the digest corresponding to each document in the source document set 1 input from the digest generation unit 2 together with its document ID in a format that allows the longest forward matching search for each segment. A document DB 4 is generated. FIG. 5 shows the flow of processing in the original document DB generation means 3.

ここで、前方最長一致検索可能なデータベースの形式として、どのようなものを用いても良いが、一例として（後述する）公知のＳｕｆｆｉｘＡｒｒａｙを用いることができる。 Here, any database format that can be searched for the longest forward match can be used, but a known Suffix Array (described later) can be used as an example.

原典文書ＤＢ４は、原典文書集合１からダイジェスト生成手段２及び原典文書ＤＢ生成手段３によって生成されたデータベース、即ち原典文書集合１中の各文書について、文書の文字列を引用の単位となり得るセグメントに分割し、当該セグメントのフィンガープリントをそれぞれ生成し、各フィンガープリントを前記セグメントの出現順に並べて構成した前記文書のダイジェストを、前記セグメント毎に前方最長一致検索可能な形式でその文書ＩＤとともに登録してなるデータベースである。 The source document DB 4 is a database generated from the source document set 1 by the digest generation unit 2 and the source document DB generation unit 3, that is, for each document in the source document set 1, the document character string is a segment that can be a citation unit. Divide and generate fingerprints for each segment, and register the digest of the document in which each fingerprint is arranged in the order of appearance of the segments, along with its document ID in a format that allows the longest forward matching search for each segment. It is a database.

図６は本発明の引用検出装置の実施の形態の一例を示すもので、図中、４は原典文書ＤＢ、５はダイジェスト生成手段、６は引用検出手段である。 FIG. 6 shows an example of an embodiment of the citation detection apparatus of the present invention. In the figure, 4 is an original document DB, 5 is a digest generation means, and 6 is a citation detection means.

ダイジェスト生成手段５は、処理対象が入力文書（引用検出対象文書）となる点を除き前述した原典文書データベース生成装置のダイジェスト生成手段２と同一であり、入力文書の文字列を引用の単位となり得るセグメントに分割し、当該セグメントのフィンガープリントをそれぞれ生成し、各フィンガープリントを前記セグメントの開始位置及び終了位置に従って並べて構成した前記入力文書のダイジェストを各セグメントの開始位置及び終了位置、並びにその文書ＩＤとともに引用検出手段６へ出力する。 The digest generation means 5 is the same as the digest generation means 2 of the original document database generation apparatus described above except that the processing target is an input document (quotation detection target document), and the character string of the input document can be a citation unit. Divide into segments, generate fingerprints of the segments, and arrange the digests of the input documents according to the start position and end position of the segments. At the same time, it is output to the citation detection means 6.

引用検出手段６は、ダイジェスト生成手段５から入力された入力文書のダイジェストをキーとして原典文書ＤＢ４を前方最長一致で検索し、一致するフィンガープリントの数が所定の閾値以上の他の文書があればその文書ＩＤ（引用元文書ＩＤ）を、引用開始位置及び引用終了位置とともに出力する。 The citation detection unit 6 searches the original document DB 4 with the longest forward match using the digest of the input document input from the digest generation unit 5 as a key, and if there is another document whose number of matching fingerprints is equal to or greater than a predetermined threshold value. The document ID (citation source document ID) is output together with the citation start position and citation end position.

ここで、引用開始位置及び引用終了位置とは、図７に示すように、対象文書（入力文書）中の引用元文書からの引用部分の先頭位置及び終了位置をそれぞれ表すものとする。 Here, as shown in FIG. 7, the citation start position and the citation end position represent the start position and the end position of the citation portion from the citation source document in the target document (input document), respectively.

図８は引用検出手段６における処理の流れを示すもので、以下、詳細に説明する。 FIG. 8 shows the flow of processing in the citation detection means 6, which will be described in detail below.

ステップ１：一致するフィンガープリント数の最小値として予め定義された値ｍｉｎを所定の閾値Ｍｉｎに代入し、ステップ２に移る。 Step 1: A value min defined in advance as the minimum value of the number of matching fingerprints is substituted into a predetermined threshold value Min, and the process proceeds to Step 2.

ステップ２：ダイジェスト生成手段５から入力された入力文書のダイジェストの末尾まで探索済みであれば終了し、探索済みでなければステップ３に移る。 Step 2: If the search has been completed up to the end of the digest of the input document input from the digest generation means 5, the process ends. If not, the process proceeds to Step 3.

ステップ３：入力文書のダイジェストをキーとして原典文書ＤＢ４から前方最長一致のエントリを検索し、ステップ４に移る。 Step 3: Using the digest of the input document as a key, search the source document DB 4 for the longest forward entry, and go to Step 4.

ステップ４：一致したフィンガープリント数が所定の閾値Ｍｉｎ以上のエントリが検索された場合は引用であるとみなしてステップ５に移り、閾値未満の場合はステップ８に移る。 Step 4: If an entry with a matching fingerprint count equal to or greater than the predetermined threshold Min is found, it is regarded as citation, and the process proceeds to Step 5, and if it is less than the threshold, the process proceeds to Step 8.

ステップ５：入力文書のダイジェスト中の前記一致したフィンガープリントのうち、先頭のフィンガープリントに対応するセグメントの開始位置及び末尾のフィンガープリントに対応するセグメントの終了位置をそれぞれ引用開始位置及び引用終了位置として、一致したエントリの文書ＩＤ（引用元文書ＩＤ）とともに出力し、ステップ６に移る。 Step 5: Among the matched fingerprints in the digest of the input document, the start position of the segment corresponding to the first fingerprint and the end position of the segment corresponding to the last fingerprint are set as the citation start position and the citation end position, respectively. , Output together with the document ID (cited document ID) of the matched entry, and go to Step 6.

ステップ６：Ｍｉｎの値をステップ４で検出した引用部分の長さ（一致したフィンガープリント数）に変更し、ステップ７に移る。 Step 6: The value of Min is changed to the length of the quoted portion detected in Step 4 (number of matched fingerprints), and the process proceeds to Step 7.

ステップ７：入力文書のダイジェストを１フィンガープリントだけ末尾側へシフトし、ステップ２に戻る。 Step 7: The digest of the input document is shifted toward the end by one fingerprint, and the process returns to Step 2.

ステップ８：入力文書のダイジェストを１フィンガープリントだけ末尾側へシフトし、ステップ９に移る。 Step 8: The digest of the input document is shifted toward the end by one fingerprint, and the process proceeds to Step 9.

ステップ９：Ｍｉｎの値を１だけ減算（但し、Ｍｉｎ≧ｍｉｎ）し、ステップ２に戻る。 Step 9: Subtract 1 from the value of Min (where Min ≧ min), and return to Step 2.

＜ＳｕｆｆｉｘＡｒｒａｙ＞
ＳｕｆｆｉｘＡｒｒａｙ（接尾辞配列）とは、高速な文字列検索を可能にするデータ構造であり、どんな部分文字列でも検索可能、単純な仕組みなので実装が簡単という特徴を有する。但し、事前にインデックス（ｓｕｆｆｉｘＡｒｒａｙ）を作成しておく必要がある。 <Suffix Array>
The Suffix Array (suffix array) is a data structure that enables high-speed character string search, and can be searched for any partial character string, and has a feature that it is easy to implement because it is a simple mechanism. However, it is necessary to create an index (suffix array) in advance.

Ｓｕｆｆｉｘを文字単位で構成する場合を例にとってＳｕｆｆｉｘＡｒｒａｙの作成方法を説明すると、まず、文字列中の各文字に先頭から連番でインデックスポイント（Ｉｎｄｅｘｐｏｉｎｔ）を割り当て、各インデックスポイントから末尾までの文字列（Ｓｕｆｆｉｘ）をそれぞれ作成する。そして、各Ｓｕｆｆｉｘを辞書順にソートし、その際のインデックスポイント列がｓｕｆｆｉｘＡｒｒａｙとなる。 A method for creating a Suffix Array will be described by taking the case where the Suffix is composed of characters as an example. First, an index point is assigned to each character in the character string in order from the beginning, and from each index point to the end. Each character string (Suffix) is created. Then, each Suffix is sorted in dictionary order, and the index point sequence at that time becomes a Suffix Array.

文字列が「ＡＢＡＢＤＡＣ」の場合、図９（ａ）に示すようにＩｎｄｅｘｐｏｉｎｔが与えられ、さらに各Ｓｕｆｆｉｘとして「ＡＢＡＢＤＡＣ」，「ＢＡＢＤＡＣ」，「ＡＢＤＡＣ」，……と得られるが、これらをまとめると、図９（ｂ）に示すようになる。そして、各Ｓｕｆｆｉｘを辞書順にソートすると、図９（ｃ）に示すようになり、その結果、図９（ｄ）に示すようなＳｕｆｆｉｘＡｒｒａｙが得られる。 When the character string is “ABABDAC”, an index point is given as shown in FIG. 9A, and further, “ABABDAC”, “BABDAC”, “ABDAC”,... As shown in FIG. Then, when each Suffix is sorted in dictionary order, it becomes as shown in FIG. 9C, and as a result, a Suffix Array as shown in FIG. 9D is obtained.

図１０乃至図１３はダイジェスト生成手段２により原典文書集合からダイジェストを生成する際の一例、ここでは原典文書集合中の文書ＩＤ：ＡＡＡの文書「今日は、良い天気だなぁ。明日は晴れるかな？晴れるといいな。」からダイジェストを生成する例を示すものである（なお、図１０乃至図１３には記載の一部に重複有り。）。 FIG. 10 to FIG. 13 show an example of generating a digest from the original document set by the digest generating means 2. Here, the document with document ID: AAA in the original document set “Today is good weather. Will it clear tomorrow? An example of generating a digest from “I hope it clears” is shown (note that there are some overlaps in the descriptions in FIGS. 10 to 13).

まず、セグメンテーション部２１により、“。”及び“？”をセパレータとしてセグメントに分割すると、セグメント＃１「今日は、良い天気だなぁ。」、セグメント＃２「明日は晴れるかな？」、セグメント＃３「晴れるといいな。」に分割される。この際、各セグメント＃１，＃２，＃３の開始位置は「１，１３，２２」であり、終了位置は「１２，２１，２９」である。 First, when the segmentation unit 21 divides “.” And “?” Into segments, the segment # 1 “Today is fine weather today”, the segment # 2 “Is it fine tomorrow?”, The segment # 3 Divided into "I hope it's sunny". At this time, the start positions of the segments # 1, # 2, and # 3 are “1, 13, 22”, and the end positions are “12, 21, 29”.

次に、正規化部２２により、句読点及び記号を除去する正規化を行うと、セグメント＃１「今日は良い天気だなぁ」、セグメント＃２「明日は晴れるかな」、セグメント＃３「晴れるといいな」となる。 Next, when normalization unit 22 performs normalization to remove punctuation marks and symbols, segment # 1 “It ’s a good weather today”, segment # 2 “Is it fine for tomorrow”, segment # 3 “It ’s good to be fine” ""

次に、足切り部２３により、５文字以下のセグメントを削除するが、ここではいずれも５文字より多いため、削除されるセグメントはない。 Next, a segment of 5 characters or less is deleted by the cut-off unit 23. However, since all of the segments have more than 5 characters, no segment is deleted.

最後に、フィンガープリント生成部２４により、任意のハッシュ関数を用いてセグメント＃１，＃２，＃３毎にフィンガープリントを生成、ここでは４バイトのハッシュ値からなるフィンガープリント＃１「ａ３１ｂ」、フィンガープリント＃２「ｅ２ｃｄ」、フィンガープリント＃３「ｄｆｄｅ」を生成し、これらを各セグメントの開始位置及び終了位置に従って並べてダイジェスト（フィンガープリント列）「ａ３１ｂｅ２ｃｄｄｆｄｅ」として、前記各セグメントの開始位置及び終了位置、並びに文書ＩＤ：ＡＡＡとともに出力する。 Finally, the fingerprint generation unit 24 generates a fingerprint for each of the segments # 1, # 2, and # 3 using an arbitrary hash function. Here, the fingerprint # 1 “a31b” including a 4-byte hash value is generated. Fingerprint # 2 “e2cd” and fingerprint # 3 “dfde” are generated and arranged according to the start position and end position of each segment to form a digest (fingerprint string) “a31be2cddfde”, and the start position and end of each segment. Output together with position and document ID: AAA.

図１４は原典文書ＤＢ生成手段３によりダイジェストから原典文書ＤＢを生成する際の一例、ここではＳｕｆｆｉｘをフィンガープリント長（ここでは４ｂｙｔｅ）単位で生成する例を示すものである。 FIG. 14 shows an example of generating the original document DB from the digest by the original document DB generating means 3, here, an example in which Suffix is generated in units of fingerprint length (here, 4 bytes).

原典文書として、前述した文書ＩＤ：ＡＡＡのダイジェスト（フィンガープリント列）「ａ３１ｂｅ２ｃｄｄｆｄｅ」とともに、文書ＩＤ：ＢＢＢのダイジェスト「３ｃｄｆａｅ５１ｂｄａｃ」、文書ＩＤ：ＣＣＣのダイジェスト「１２ａａｂ４ａｄ３ｂ４２」が入力されたとすると、文書ＩＤ：ＡＡＡのダイジェストのＳｕｆｆｉｘとして「ａ３１ｂｅ２ｃｄｄｆｄｅ」，「ｅ２ｃｄｄｆｄｅ」，「ｄｆｄｅ」、文書ＩＤ：ＢＢＢのダイジェストのＳｕｆｆｉｘとして「３ｃｄｆａｅ５１ｂｄａｃ」，「ａｅ５１ｂｄａｃ」，「ｂｄａｃ」、文書ＩＤ：ＣＣＣのダイジェスト「１２ａａｂ４ａｄ３ｂ４２」のＳｕｆｆｉｘとして「１２ａａｂ４ａｄ３ｂ４２」，「ｂ４ａｄ３ｂ４２」，「３ｂ４２」がそれぞれ得られ、辞書順にソートされて図１４の右側に示すような原典文書ＤＢが生成される。 If the document ID: AAA digest (fingerprint string) “a31be2cddfde”, the document ID: BBB digest “3cdfae51bdac”, and the document ID: CCC digest “12aab4ad3b42” are input as the original document, the document ID: AAA digest Suffix “a31be2cddfde”, “e2cddfde”, “dfde”, document ID: BBB digest Suffix “3cdfae51bdac”, “ae51bdac”, “bdac”, document ID: Cb ab “Su” 4b "12aab4ad3b42", "b4ad3b42", and "3b42" are obtained and sorted in dictionary order. It is original text document DB such as shown on the right side of FIG. 14 is generated.

図１５はダイジェスト生成手段５により入力文書からダイジェストを生成する際の一例、ここでは前記文書ＩＤ：ＡＡＡの文書を引用として含む、文書ＩＤ：ＪＪＪの入力文書「友達がこんなこと書いてた。今日は、良い天気だなぁ。明日は晴れるかな？晴れるといいな。でも、明日雨らしいよ。」からダイジェストを生成する例を示すものである。 FIG. 15 shows an example when a digest is generated from an input document by the digest generation means 5. Here, the document ID: JJJ, which includes the document ID: AAA document as a citation, is written by a friend. "It's good weather. Is it fine tomorrow? I hope it is fine. But it's raining tomorrow."

詳細は図１０乃至図１３に示した実施例１の場合と同様なので省略するが、最終的に、ダイジェスト（フィンガープリント列）「ｂ４ａ３ａ３１ｂｅ２ｃｄｄｆｄｅ３４ａ２」が、各セグメントの開始位置「１，１４，２６，３５，４３」及び終了位置「１３，２５，３４，４２，５３」、並びに文書ＩＤ：ＪＪＪとともに出力される。 Details are the same as in the case of the first embodiment shown in FIGS. 10 to 13, and will be omitted. However, the digest (fingerprint sequence) “b4a3a31be2cddfde34a2” finally becomes the start position “1, 14, 26, 35 of each segment”. , 43 ”and end position“ 13, 25, 34, 42, 53 ”and document ID: JJJ.

図１６は引用検出手段６により入力文書のダイジェストと原典文書ＤＢを比較して引用を検出する際の一例、ここでは前記文書ＩＤ：ＪＪＪの入力文書のダイジェストと図１４に示した原典文書ＤＢを比較して引用を検出する例を示すものである。 FIG. 16 shows an example of detecting a citation by comparing the digest of the input document and the original document DB by the citation detection means 6, here, the digest of the input document with the document ID: JJJ and the original document DB shown in FIG. An example of detecting citations by comparison is shown.

まず、ダイジェスト「ｂ４ａ３ａ３１ｂｅ２ｃｄｄｆｄｅ３４ａ２」をキーとして、図１４に示した原典文書ＤＢから前方最長一致のエントリを検索するが、一致するエントリは存在しない。そこで、入力文書のダイジェストをフィンガープリント長（ここでは４ｂｙｔｅ）だけシフトして「ａ３１ｂｅ２ｃｄｄｆｄｅ３４ａ２」とし、これをキーとして前記同様に図１４に示した原典文書ＤＢを検索すると、先頭から３フィンガープリント長分、原典文書ＤＢ中の文書ＩＤ：ＡＡＡのダイジェストのＳｕｆｆｉｘ「ａ３１ｂｅ２ｃｄｄｆｄｅ」と一致する。 First, using the digest “b4a3a31be2cddfde34a2” as a key, the forward longest matching entry is searched from the original document DB shown in FIG. 14, but there is no matching entry. Therefore, when the digest of the input document is shifted by the fingerprint length (here, 4 bytes) to “a31be2cddfde34a2” and the original document DB shown in FIG. Document ID in the original document DB: Matches the suffix “A31be2cddfde” of AAA digest.

この際、Ｍｉｎが「２」あるいは「３」であれば、一致したフィンガープリントのうち、先頭のフィンガープリントに対応するセグメントの開始位置及び末尾のフィンガープリントに対応するセグメントの終了位置、ここでは「１４」及び「４２」をそれぞれ引用開始位置及び引用終了位置として、文書ＩＤ：ＡＡＡとともに出力する。 At this time, if Min is “2” or “3”, among the matched fingerprints, the start position of the segment corresponding to the first fingerprint and the end position of the segment corresponding to the last fingerprint, 14 ”and“ 42 ”are output together with the document ID: AAA as the citation start position and the citation end position, respectively.

１：原典文書集合、２，５：ダイジェスト生成手段、３：原典文書ＤＢ生成手段、４：原典文書ＤＢ、６は引用検出手段、２１：セグメンテーション部、２２：正規化部、２３：足切り部、２４：フィンガープリント生成部。 1: source document set, 2, 5: digest generation unit, 3: source document DB generation unit, 4: source document DB, 6: citation detection unit, 21: segmentation unit, 22: normalization unit, 23: cut off unit 24: Fingerprint generator.

Claims

Detects whether the input document contains citations that are two or more consecutive sentences with no modification of the character string in the other document, and if so, outputs the document ID of the other document A device that performs
For each document in the original document set, which is a set of other documents, the document character string is divided into partial character strings that can serve as citation units, and summaries of the partial character strings are generated, respectively. A source document database in which digests of the documents arranged in the order of appearance of columns are registered together with their document IDs in a format that allows the longest forward matching search for each partial character string;
Dividing the character string of the input document into partial character strings that can be used as citation units, generating summaries of the partial character strings, and arranging the digests of the input documents configured by arranging the summaries in the order of appearance of the partial character strings. Digest generating means for outputting together with the ID;
Citation detection means for searching the original document database with the longest forward match using the digest of the input document as a key, and outputting the document ID if there is another document whose number of matching summaries exceeds a predetermined threshold A citation detector characterized by.

The citation detection device according to claim 1,
The digest generation means
A segmentation unit that divides a character string of a document into segments that can be a unit of citation, and outputs each segment together with information on the order of appearance thereof, and a document ID of the document;
A normalization unit for normalizing each segment;
Of each segment after the normalization, a cut-off part that deletes a segment of a certain length or less,
Each remaining segment after normalization is input to an arbitrary hash function to generate a fingerprint, and a digest of the document configured by arranging each fingerprint according to the order of appearance of each segment is output together with the document ID. A citation detection device comprising:

An apparatus for generating a source document database from a source document set that is a set of other documents,
For each document in the source document set, the character string of the document is divided into partial character strings that can be used as citation units, summaries of the partial character strings are generated, and the summaries are arranged in the order of appearance of the partial character strings. Digest generating means for outputting the digest of the document together with the document ID;
Source document database generation means for generating a source document database by registering the digest corresponding to each document in the source document set together with its document ID in a format that allows the longest forward matching search for each partial character string. An original document database generation device characterized by

In the original document database generation device according to claim 3,
The digest generation means
For each document in the original document set, a segmentation unit that divides the character string of the document into segments that can be a citation unit, and outputs each segment together with information on its appearance order, the document ID of the document,
A normalization unit for normalizing each segment;
Of each segment after the normalization, a cut-off part that deletes a segment of a certain length or less,
Each remaining segment after normalization is input to an arbitrary hash function to generate a fingerprint, and a digest of the document configured by arranging each fingerprint according to the order of appearance of each segment is output together with the document ID. An original document database generation device comprising:

Detects whether the input document contains citations that are two or more consecutive sentences with no modification of the character string in the other document, and if so, outputs the document ID of the other document A way to
For each document in the original document set, which is a set of other documents, the document character string is divided into partial character strings that can serve as citation units, and summaries of the partial character strings are generated, respectively. Using the original document database in which the digests of the documents arranged in the order of appearance of the columns are registered together with their document IDs in a format that can be searched for the longest match for each partial character string,
The input document in which the digest generation means divides the character string of the input document into partial character strings that can be a citation unit, generates summaries of the partial character strings, and arranges the summaries in the order of appearance of the partial character strings. Outputting a digest of along with the document ID;
The citation detection means searches the original document database with the longest forward match using the digest of the input document as a key, and outputs another document ID if there is another document whose number of matching summaries exceeds a predetermined threshold value. A citation detection method characterized by including.

The citation detection method according to claim 5,
The digest generation step is
A segmentation unit that divides a character string of a document into segments that can serve as a citation unit, and outputs each segment together with information on the order of appearance thereof and a document ID of the document;
A normalization unit normalizing each of the segments;
A step of deleting a segment having a certain length or less from each segment after normalization,
A digest of the document in which the fingerprint generation unit generates the fingerprints by inputting the remaining normalized segments into an arbitrary hash function, and arranges the fingerprints according to the order of appearance of the segments. And a step of outputting the document ID together with the document ID.

A method for generating a source document database from a source document set that is a set of other documents,
For each document in the source document set, the digest generation means divides the document character string into partial character strings that can serve as citation units, generates summaries of the partial character strings, and each summarization of the partial character strings. Outputting a digest of the document arranged in order of appearance together with its document ID;
A source document database generating means for registering a digest corresponding to each document in the source document set together with its document ID in a format that allows a longest forward matching search for each partial character string; and generating a source document database A source document database generation method characterized by including.

The original document database generation method according to claim 7,
The digest generation step is
A segmentation unit, for each document in the source document set, divides the document character string into segments that can be a citation unit, and outputs each segment together with information on the order of appearance thereof and the document ID of the document;
A normalization unit normalizing each of the segments;
A step of deleting a segment having a certain length or less from each segment after normalization,
A digest of the document in which the fingerprint generation unit generates the fingerprints by inputting the remaining normalized segments into an arbitrary hash function, and arranges the fingerprints according to the order of appearance of the segments. And a document ID of the original document database.

The program for functioning a computer as each means of the apparatus in any one of Claims 1 thru | or 4.

A computer-readable recording medium on which the program according to claim 9 is recorded.