JP2017211784A

JP2017211784A - Summarizing device, method and program

Info

Publication number: JP2017211784A
Application number: JP2016103759A
Authority: JP
Inventors: ジュンオウ; Jung Oh; 鈴木　敏; Satoshi Suzuki; 敏鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-05-24
Filing date: 2016-05-24
Publication date: 2017-11-30

Abstract

PROBLEM TO BE SOLVED: To acquire a summary which is easy to read.SOLUTION: An entity calculation unit 52 calculates an entity and a state of the entity. A probability estimation unit 53 estimates a transition probability which expresses a probability in which the state of the entity in a preceding sentence transitions to a state of the entity in a subsequent sentence, in adjacent sentences in a document, on the basis of the entity and the state of the entity for each pair of states of an entity. A sentence selection unit 55 generates plural candidates of a summary of the document. A summary selection unit 58 selects a summary from each of candidates of the summary, so as to optimize an evaluation function which is expressed using a coherence score which is expressed by using the transition probability in which the state of the entity in the preceding sentence transitions to the state of the entity in the subsequent sentence, in the adjacent sentences in the summary, and weighting of the entity, on the basis of each transition probability, each candidate of the summary, and the weighting of the entity which is determined in advance.SELECTED DRAWING: Figure 2

Description

本発明は、要約装置、方法、及びプログラムに関する。 The present invention relates to a summary device, a method, and a program.

文書の自動要約は、元の文書を短縮文字で示すことを目的としている。要約の主要なアプローチには、抽出法と抽象化法とがある。抽出法には、要約を生成するために、元の文書から文（あるいは文節等）を選択するものがある。また、もう一つの方法である抽象化法としては、要約として新規の文章を要約として生成する手法がある。 Automatic document summarization is intended to indicate the original document in abbreviated characters. The main approaches to summarization include extraction methods and abstraction methods. Some extraction methods select sentences (or phrases, etc.) from the original document to generate a summary. As another abstraction method, there is a method of generating a new sentence as a summary as a summary.

例えば、非特許文献１では、要約を最大ナップサック問題（ＭＫＭＣ）とみなしている。ここでの要約とは、可能な限り多くのコンセプトを網羅する複数文を抽出することである。なお、コンセプトとは、非機能語のことである。 For example, in Non-Patent Document 1, the summary is regarded as a maximum knapsack problem (MKMC). Summarization here means extracting multiple sentences that cover as many concepts as possible. A concept is a non-functional word.

また、参考文献２では、文書に含まれているエンティティを使用して、一貫性を評価する方法を説明している。機械で生成された文書は、文書におけるエンティティの有無に従って評価される。この場合、エンティティは、名詞と代名詞を指す。 Reference 2 describes a method for evaluating consistency using an entity included in a document. Machine-generated documents are evaluated according to the presence or absence of entities in the document. In this case, entities refer to nouns and pronouns.

Takamura, Hiroya and Okumura, Manabu. “Text summarization model based on maximum coverage problem and its variant” Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2009.Takamura, Hiroya and Okumura, Manabu. “Text summarization model based on maximum coverage problem and its variant” Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2009. Barzilay, Regina and Lapata, Mirella. “Modeling local coherence: An entitybased approach” Computational Linguistics Vol 34, MIT Press, 2008.Barzilay, Regina and Lapata, Mirella. “Modeling local coherence: An entitybased approach” Computational Linguistics Vol 34, MIT Press, 2008.

しかしながら、従来手法によると、抽出法でも抽象化法でも、必ずしも人が読み易いとは言えない文章が生成されてしまう。 However, according to the conventional method, a sentence that is not necessarily easy to read by humans is generated by both the extraction method and the abstraction method.

本発明では、人が読みやすい要約を得ることができる要約装置、方法、及びプログラムを提供することを目的とする。 An object of the present invention is to provide a summarization apparatus, method, and program capable of obtaining a human-readable summary.

上記目的を達成するために、第１の発明に係る要約装置は、文書に含まれる文を選択して要約を生成する要約装置であって、前記文書に含まれる複数の文の各々について、前記文に含まれる、エンティティと前記エンティティの状態を計算するエンティティ計算部と、前記エンティティ計算部によって前記複数の文の各々について計算されたエンティティと前記エンティティの状態に基づいて、前記エンティティの各々について、前記エンティティの状態ペア毎に、前記文書において隣接する文のうちの前文での前記エンティティの状態が、後文での前記エンティティの状態へ遷移する確率を表す遷移確率を推定する確率推定部と、前記複数の文から、文の長さの総和が予め定められた長さ以下になるように選択された文からなる前記文書の要約の候補を複数生成する文選択部と、前記確率推定部によって推定された前記遷移確率の各々と、前記文選択部によって生成された前記要約の候補の各々と、予め求められたエンティティの重みとに基づいて、要約に含まれる前記エンティティの各々についての、前記要約において隣接する文のうちの前文での前記エンティティの状態が、後文での前記エンティティの状態へ遷移する前記遷移確率を用いて表されるコヒーレンススコア、及び前記エンティティの重みを用いて表される評価関数を最適化するように、前記要約の候補の各々から、前記文書の要約を選択する要約選択部と、を含んで構成されている。 In order to achieve the above object, a summarization device according to a first invention is a summarization device that generates a summary by selecting a sentence included in a document, and for each of a plurality of sentences included in the document, An entity included in a sentence and an entity calculator that calculates the state of the entity, and for each of the entities based on the entity and the state of the entity calculated for each of the plurality of sentences by the entity calculator. A probability estimator that estimates a transition probability representing a probability that the state of the entity in the preceding sentence among the adjacent sentences in the document transitions to the state of the entity in the subsequent sentence, for each state pair of the entity; Of the plurality of sentences, the document composed of sentences selected so that the sum of the sentence lengths is equal to or less than a predetermined length. A sentence selection unit that generates a plurality of candidates, each of the transition probabilities estimated by the probability estimation unit, each of the summary candidates generated by the sentence selection unit, and a previously determined entity weight For each of the entities included in the summary, the transition probability that the state of the entity in the preceding sentence of the adjacent sentences in the summary transitions to the state of the entity in the subsequent sentence is used. A summary selector for selecting a summary of the document from each of the summary candidates so as to optimize the coherence score represented by It is configured.

第２の発明に係る要約方法は、文書に含まれる文を選択して要約を生成する要約方法であって、エンティティ計算部が、前記文書に含まれる複数の文の各々について、前記文に含まれる、エンティティと前記エンティティの状態を計算するステップと、確率推定部が、前記エンティティ計算部によって前記複数の文の各々について計算されたエンティティと前記エンティティの状態に基づいて、前記エンティティの各々について、前記エンティティの状態ペア毎に、前記文書において隣接する文のうちの前文での前記エンティティの状態が、後文での前記エンティティの状態へ遷移する確率を表す遷移確率を推定するステップと、文選択部が、前記複数の文から、文の長さの総和が予め定められた長さ以下になるように選択された文からなる前記文書の要約の候補を複数生成するステップと、要約選択部が、前記確率推定部によって推定された前記遷移確率の各々と、前記文選択部によって生成された前記要約の候補の各々と、予め求められたエンティティの重みとに基づいて、要約に含まれる前記エンティティの各々についての、前記要約において隣接する文のうちの前文での前記エンティティの状態が、後文での前記エンティティの状態へ遷移する前記遷移確率を用いて表されるコヒーレンススコア、及び前記エンティティの重みを用いて表される評価関数を最適化するように、前記要約の候補の各々から、前記文書の要約を選択するステップと、を含んで構成されている。 A summarization method according to a second invention is a summarization method for selecting a sentence included in a document and generating a summary, wherein the entity calculator includes each of a plurality of sentences included in the document in the sentence. Calculating a state of the entity and the entity, and a probability estimator for each of the entities based on the entity and the state of the entity calculated for each of the plurality of sentences by the entity calculator. Estimating a transition probability representing a probability that a state of the entity in a preceding sentence of adjacent sentences in the document transitions to a state of the entity in a subsequent sentence for each state pair of the entity; and sentence selection The sentence is composed of sentences selected from the plurality of sentences so that the sum of the lengths of the sentences is equal to or less than a predetermined length. Generating a plurality of summary candidates for the written document; and a summary selection unit, each of the transition probabilities estimated by the probability estimation unit, each of the summary candidates generated by the sentence selection unit, Based on the determined entity weight, for each of the entities included in the summary, the state of the entity in the preamble of adjacent sentences in the summary transitions to the state of the entity in the subsequent sentence Selecting a summary of the document from each of the summary candidates to optimize a coherence score expressed using the transition probability and an evaluation function expressed using the weight of the entity; , Including.

また、前記文選択部は、前記複数の文から、文の長さの総和が前記予め定められた長さ以下になるようにランダムに繰り返し選択された文からなる前記文書の要約の候補を複数生成するようにすることができる。 In addition, the sentence selection unit may select a plurality of document summarization candidates that are composed of sentences that are repeatedly selected at random so that the sum of the lengths of the sentences is equal to or less than the predetermined length. Can be generated.

また、前記エンティティ計算部は、複数の文書に含まれる複数の文の各々について、前記文に含まれる、エンティティと前記エンティティの状態を計算し、前記確率推定部は、前記エンティティ計算部によって前記複数の文書に含まれる前記複数の文の各々について計算されたエンティティと前記エンティティの状態に基づいて、前記エンティティの各々について、前記エンティティの状態ペア毎に、前記遷移確率を推定し、前記文選択部は、前記複数の文書に含まれる前記複数の文から、文の長さの総和が予め定められた長さ以下になるように選択された文からなる、前記複数の文書の要約の候補を複数生成し、前記要約選択部は、前記確率推定部によって推定された前記遷移確率の各々と、前記文選択部によって生成された前記要約の候補の各々と、複数の学習用文書と前記複数の学習用文書に対する要約の各々とから予め求められたエンティティの重みとに基づいて、前記評価関数を最適化するように、前記要約の候補の各々から、前記複数の文書の要約を選択するようにすることができる。 The entity calculation unit calculates an entity and a state of the entity included in the sentence for each of a plurality of sentences included in a plurality of documents, and the probability estimation unit Based on the entity calculated for each of the plurality of sentences included in the document and the state of the entity, the transition probability is estimated for each state pair of the entity for each of the entities, and the sentence selection unit Are a plurality of candidates for summarization of the plurality of documents, which are composed of sentences selected from the plurality of sentences included in the plurality of documents so that the sum of the lengths of the sentences is equal to or less than a predetermined length. The summary selector generates each of the transition probabilities estimated by the probability estimator and the summary candidate generated by the sentence selector. Each of the summary candidates so as to optimize the evaluation function based on each of the plurality of learning documents and a weight of the entity previously determined from the plurality of learning documents and each of the summaries for the plurality of learning documents. From the above, it is possible to select a summary of the plurality of documents.

また、前記エンティティの重みは、複数の学習用文書と前記複数の学習用文書の各々に対する要約とから予め学習されたものであるようにすることができる。 Further, the weight of the entity may be learned in advance from a plurality of learning documents and a summary for each of the plurality of learning documents.

また、前記要約選択部は、評価関数を最適化するように、前記要約の候補の各々から、前記文書の要約を選択し、かつ前記要約において隣接する文で共通する単語の割合が予め定められた閾値以下となるように、前記要約を生成するようにすることができる。 Further, the summary selection unit selects a summary of the document from each of the summary candidates so as to optimize an evaluation function, and a ratio of words common to adjacent sentences in the summary is determined in advance. The summary can be generated so as to be less than or equal to the threshold value.

また、第３の発明のプログラムは、コンピュータを、上記の要約装置の各部として機能させるためのプログラムである。 A program of the third invention is a program for causing a computer to function as each part of the above summary device.

以上説明したように、本発明の要約装置、方法、及びプログラムによれば、複数の文の各々について計算されたエンティティとエンティティの状態に基づいて、文書において隣接する文のうちの前文でのエンティティの状態が、後文でのエンティティの状態へ遷移する確率を表す遷移確率を推定し、複数の文から、文の長さの総和が予め定められた長さ以下になるように選択された文からなる文書の要約の候補を複数生成し、推定された遷移確率の各々と、要約の候補の各々と、予め求められたエンティティの重みとに基づいて、要約において隣接する文のうちの前文でのエンティティの状態が、後文でのエンティティの状態へ遷移する遷移確率を用いて表されるコヒーレンススコア、及びエンティティの重みを用いて表される評価関数を最適化するように、要約の候補の各々から、文書の要約を選択することにより、人が読みやすい要約を得ることができる、という効果が得られる。 As described above, according to the summary device, method, and program of the present invention, the entity in the preceding sentence of the sentences adjacent to each other in the document based on the entity calculated for each of the plurality of sentences and the state of the entity. A sentence that is selected so that the total length of sentences is less than or equal to a predetermined length from a plurality of sentences by estimating the transition probability that represents the probability that the state of the transition to the entity state in the later sentence Generating a plurality of document summary candidates, and, based on each of the estimated transition probabilities, each of the summary candidates, and a previously determined entity weight, The coherence score expressed using the transition probability that the state of the entity in the latter sentence transitions to the entity state in the later sentence and the evaluation function expressed using the weight of the entity As of, from each of the candidates summary, by selecting the summary of the document, it is possible to obtain a human readable summary, the effect is obtained that.

本発明の実施の形態に係る学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る要約装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the summary apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る学習装置における学習処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the learning process routine in the learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る要約装置における要約処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the summary process routine in the summary apparatus which concerns on embodiment of this invention. 要約装置における文選択処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the sentence selection process routine in a summary apparatus. 実験によって得られた要約の例を示す図である。It is a figure which shows the example of the summary obtained by experiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態の概要＞
本実施の形態は、自然言語処理の技術に属し、文書の自動要約に関連する。本実施の形態は、従来手法と比較して、読みやすい要約を生成することを目的とする。また、本実施の形態では、要約の方法として抽出法を対象とする。 <Outline of Embodiment of the Present Invention>
This embodiment belongs to natural language processing technology and relates to automatic document summarization. The purpose of this embodiment is to generate an easy-to-read summary compared to the conventional method. In this embodiment, an extraction method is used as a summary method.

本実施の形態では、生成された文書の品質を評価するためにエンティティを使用して、自動複数文書要約の品質を向上させることを目的とする。ここでは、単語であるエンティティと、当該エンティティの状態を用いる。本実施の形態は、生成された文書の一貫性と情報性を評価するためにエンティティに基づいて、評価関数を設計する。本実施の形態における要約文の生成とは、評価関数の最大化により、文書から複数文を選択することである。 The purpose of this embodiment is to improve the quality of automatic multi-document summarization by using entities to evaluate the quality of generated documents. Here, an entity that is a word and the state of the entity are used. In this embodiment, an evaluation function is designed based on an entity in order to evaluate the consistency and information property of a generated document. The generation of the summary sentence in the present embodiment is to select a plurality of sentences from the document by maximizing the evaluation function.

要約化は、主要な情報を網羅する一連の文を生成することである。本実施の形態は、要約を生成するために重み付き最長パス問題を使用する。また、エンティティに基づいて、要約として適切かどうかを評価する最適化関数を設計する。これにより、文書の要約は、評価関数を最大化することにより一連の文を選択する問題として扱うことができる。 Summarization is the generation of a series of sentences that cover key information. This embodiment uses the weighted longest path problem to generate a summary. Also, based on the entity, we design an optimization function that evaluates whether it is appropriate as a summary. Thus, document summaries can be treated as a problem of selecting a sequence of sentences by maximizing the evaluation function.

本実施の形態は、２つの主要な部分で構成されている。一つは、要約文の評価システムであり、もう一つは、要約のための複数文選択システムである。 This embodiment is composed of two main parts. One is a summary sentence evaluation system, and the other is a multiple sentence selection system for summarization.

本実施の形態では、重要な情報を網羅するように文を複数選択し、選択された複数の文が一貫した文書となるように並べ替える。また、各文は、要約に重要な情報を追加し、かつ、文書の一貫性を維持するように選ばれる。 In the present embodiment, a plurality of sentences are selected so as to cover important information, and rearranged so that the selected plurality of sentences become a consistent document. Each sentence is also chosen to add important information to the summary and maintain document consistency.

＜問題の設定＞
まず、本実施の形態における問題の設定と、本実施の形態で用いる文選択アルゴリズムについて説明する。 <Problem settings>
First, the problem setting in this embodiment and the sentence selection algorithm used in this embodiment will be described.

［重み付き最長パス問題］
合計Ｌ個の文からなるＫ個の文書があると仮定する。すなわち、Ｋ個の文書に含まれる文の数が合計Ｌ個であると仮定する。これらのＫ個の文書の要約を生成する。要約を生成するために、Ｌ個の文から複数の文を選択する。上記のＬ個の文から抽出された文書Ｔが存在すると仮定する。文書ＴがＫ個の文書の要約として、どれほど適切かを評価するために関数を設計する。 [Weighted longest path problem]
Assume that there are K documents consisting of a total of L sentences. That is, it is assumed that the total number of sentences included in K documents is L. Generate a summary of these K documents. In order to generate a summary, a plurality of sentences are selected from the L sentences. Assume that there is a document T extracted from the above L sentences. A function is designed to evaluate how appropriate a document T is as a summary of K documents.

既知の通り、優れた要約は、１）冗長性がなく、２）重要な情報を網羅し、３）可能な限り一貫性があるものである。評価関数は、この３点を反映している必要がある。
要約Ｔは、Ｍ個のエンティティ（単語）、Ｎ個の文からなり、以下の通り、Ｎ＜Ｌとなる。 As is known, a good summary is 1) without redundancy, 2) covers important information, and 3) is as consistent as possible. The evaluation function needs to reflect these three points.
The summary T is composed of M entities (words) and N sentences, and N <L as follows.

ｒ_ｉ ^Ａは、文Ａにおけるエンティティｅ_ｉの状態を示す。「ｓｕｂｊ」、「ｏｂｊ」、「ｐｒｅｓｅｎｔ」および「ａｂｓｅｎｔ」という４種類の状態がある。「ｓｕｂｊ」は、エンティティｅ_ｉが文Ａにおける主語であることを示す。「ｏｂｊ」は、エンティティｅ_ｉが文Ａにおける目的語であることを示す。エンティティｅ_ｉが文Ａに存在し、「ｓｕｂｊ」や「ｏｂｊ」に適合しない場合は、「ｐｒｅｓｅｎｔ」が使用される。「ａｂｓｅｎｔ」は、文Ａにおいて、エンティティｅ_ｉが存在していないことを示す。エンティティｅ_ｉの重み付けａ_ｉは、ソース文書セットにおける、トークンの正規化頻度である。 r _i ^A indicates the state of the entity e _i in the sentence A. There are four types of states: “subj”, “obj”, “present”, and “absent”. “Subj” indicates that the entity e _i is the subject in the sentence A. “Obj” indicates that the entity e _i is an object in the sentence A. If entity e _i exists in sentence A and does not match “subj” or “obj”, “present” is used. “Absent” indicates that the entity e _i does not exist in the sentence A. The weighting a _i of the entity e _i is the token normalization frequency in the source document set.

Ｓｃｏｒｅ（Ｔ）は、生成された要約Ｔの総合的評価である。Ｓｃｏｒｅ（Ｔ）は、生成された要約Ｔが要約として、どれほど適切なのかという指標となる。要約を抽出する手法は、以下の式（２）に示すように、Ｓｃｏｒｅ（Ｔ）を最大化する一連の文Ｓｅｑを検出することである。 Score (T) is an overall evaluation of the generated summary T. Score (T) is an index of how appropriate the generated summary T is as a summary. The technique for extracting the summary is to detect a series of sentences Seq that maximize Score (T), as shown in the following equation (2).

ａ_ｉは、エンティティｅ_ｉの重み付けであり、Ｆ_ｉは、抽出した文字列のコヒーレンススコアである。上記のようにコヒーレンススコアＦ_ｉは、要約に含まれるエンティティの各々についての、要約において隣接する文のうちの前文でのエンティティｅ_ｉの状態ｒ_ｊ ^ｉが、後文でのエンティティｅ_ｉの状態ｒ_ｊ ^ｉ＋１へ遷移する遷移確率ｐ_ｅｉ（ｒ_ｊ ^ｉｒ_ｊ ^ｉ＋１）を用いて表される。本実施の形態で用いるモデルは、固定長の重み付けされた最長パス問題である。この問題は、一貫性と情報の網羅を考慮する。 a _i is a weighting of the entity e _i , and F _i is a coherence score of the extracted character string. Coherence score F _i as described above, for each of the entities included in the summary, the state r _j ⁱ of entity e _i in the preamble of the adjacent sentences in summary, an entity e _i in the post statement state It is expressed using a transition probability p _ei (r _j ⁱ r _j ^{i + 1} ) for transitioning to r _j ⁱ +1. The model used in this embodiment is a fixed length weighted longest path problem. This issue allows for consistency and information coverage.

さらに、冗長性を考慮する必要がある。すなわち、要約において、多くの単語を共有している文は、隣り合うべきではない。この制約により、ローカル冗長性とグローバル冗長性を削減することができる。 Furthermore, it is necessary to consider redundancy. That is, in a summary, sentences that share many words should not be adjacent. This restriction can reduce local redundancy and global redundancy.

すなわち、要約において、すべての隣接文で以下の（４）式の制約を満たしているか否かを確認する。 That is, in the summary, it is confirmed whether or not all the adjacent sentences satisfy the constraint of the following expression (4).

Ｒ＝Ａ’∩Ｂ’／Ａ’∪Ｂ’ （３）
（Ａ’，Ｂ’は、それぞれ文Ａ，Ｂに含まれる単語セットである。）
Ｒ≦Ｔｈｒｅｓｈｏｌｄ１（４） R = A'∩B '/ A'∪B' (3)
(A ′ and B ′ are word sets included in sentences A and B, respectively.)
R ≦ Threshold 1 (4)

上記式（３）において、分子は、文Ａと文Ｂとにおいて共通する単語の種類数である。また、分母は、文Ａと文Ｂとにおいて出現する単語の種類数である。 In the above formula (3), the numerator is the number of types of words that are common to the sentence A and the sentence B. The denominator is the number of types of words that appear in sentence A and sentence B.

なお、Ｒ＞Ｔｈｒｅｓｈｏｌｄ１である場合、Ａ−Ｂ及びＢ−Ａの文の組み合わせを要約の候補を表すグラフから削除する。これにより、共通する単語が多い文同士は、要約において隣接した位置に配置されないようになる。 When R> Threshold1, the combination of sentences AB and BA is deleted from the graph representing the summary candidates. This prevents sentences with many common words from being placed at adjacent positions in the summary.

この式には冗長性の排除、情報の網羅、文の一貫性という、優れた要約を生成する３個の重要な要素が関係する。閾値Ｔｈｒｅｓｈｏｌｄ１およびエンティティの重み付けａ_ｉは、実験によって予め決定される。 This formula involves three important elements that produce a good summary: redundancy elimination, information coverage, and sentence consistency. The threshold Threshold 1 and the entity weight a _i are predetermined by experiment.

［文選択アルゴリズム］
本実施の形態では、文選択アルゴリズムとして、復号化アルゴリズムを利用する。復号化アルゴリズムは、ＮＰ困難問題であり、多項式時間アルゴリズムは存在しない。ランダム化されたアルゴリズムは迅速に近似解決法を得ることができる。本実施の形態では、すべての要件を満たす一連の文をランダムに選択する。また、本実施の形態では、この作業を数回繰り返して、要約の候補を複数生成し、スコアが最大となる要約の候補を、要約とする。選択処理の詳細は、アルゴリズム１に示す。 [Sentence selection algorithm]
In this embodiment, a decryption algorithm is used as the sentence selection algorithm. The decoding algorithm is an NP-hard problem and there is no polynomial time algorithm. A randomized algorithm can quickly obtain an approximate solution. In the present embodiment, a series of sentences satisfying all requirements are selected at random. In the present embodiment, this operation is repeated several times to generate a plurality of summary candidates, and the summary candidate having the maximum score is defined as a summary. Details of the selection process are shown in Algorithm 1.

アルゴリズム１ランダムアルゴリズム
初期化:
Ｕ←｛文書セットに存在するすべての文｝
Ｓ←｛｝空集合（要約の候補集合）
Ｃ←｛｝空集合（文集合）
ＲＥＰＥＡＴ：（一定回数繰り返し）
（文をランダムに選択する。ｓ∈Ｕ）
ＩＦ: ｌｅｎｔｈ（ｓ）＋Σ_ｉｌｅｎｔｈ（ｓ_ｉ）<= Ｔｈｒｅｓｈｏｌｄ２; ｓ_ｉ∈Ｃ，ＴＨＥＮ
Ｕからｓを取り除き、ｓをＣの最後に付ける。
ＥＬＳＥ
ＣをＳに追加し、Ｃ，Ｕを初期化する。
ＥＮＤＩＦ
Ｏ＝ａｒｇｍａｘ_ＴＳｃｏｒｅ（Ｔ）; Ｔ∈Ｓ
ＲＥＴＵＲＮＯ Algorithm 1 Random algorithm initialization:
U ← {all sentences in the document set}
S ← {} empty set (summary candidate set)
C ← {} empty set (sentence set)
REPEAT: (Repeat a certain number of times)
(Select a sentence at random. S∈U)
IF: lenti (s) + Σ _i lenti (s _i ) <= Threshold2; s _i ∈C, THEN
Remove s from U and add s to the end of C.
ELSE
C is added to S and C and U are initialized.
END IF
O = argmax _T Score (T); TεS
RETURN O

＜本発明の実施の形態に係る学習装置の構成＞
次に、本発明の実施の形態に係る学習装置の構成について説明する。図１に示すように、本発明の実施の形態に係る学習装置１００は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この学習装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部３０とを備えている。 <Configuration of Learning Device According to Embodiment of the Present Invention>
Next, the configuration of the learning device according to the embodiment of the present invention will be described. As shown in FIG. 1, a learning device 100 according to an embodiment of the present invention is a computer that includes a CPU, a RAM, and a ROM that stores a program for executing a learning processing routine described later and various data. Can be configured. Functionally, the learning apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 30, as shown in FIG.

入力部１０は、複数の学習用文書を受け付ける。また、入力部１０は、複数の学習用文書の各々に対して予め用意された要約の各々を受け付ける。 The input unit 10 receives a plurality of learning documents. Further, the input unit 10 receives each summary prepared in advance for each of the plurality of learning documents.

演算部２０は、文書記憶部２１と、要約記憶部２２と、エンティティ計算部２３と、回帰分析部２４と、重み記憶部２５と、を含んで構成されている。 The calculation unit 20 includes a document storage unit 21, a summary storage unit 22, an entity calculation unit 23, a regression analysis unit 24, and a weight storage unit 25.

文書記憶部２１には、入力部１０において受け付けた複数の学習用文書が記憶されている。 The document storage unit 21 stores a plurality of learning documents received by the input unit 10.

要約記憶部２２には、入力部１０において受け付けた複数の要約が記憶されている。 The summary storage unit 22 stores a plurality of summaries received by the input unit 10.

エンティティ計算部２３は、文書記憶部２１に記憶された複数の学習用文書の各々について、当該学習用文書に含まれるエンティティを計算する。エンティティを計算する際に必要とされる事前処理には、語幹処理、及びストップワードの削除と単純化が含まれる。最初に必要なのは、エンティティを識別し、代名詞の照応を解決することである。本実施の形態では、利用可能なエンティティ識別ツールと照応解析ツールとしてStanford CoreNLPを使用する。例えば、エンティティとしては、形容詞の単語、副詞の単語、名詞の単語、代名詞の単語、及び動詞の単語の少なくとも１つが用いられる。なお、文脈を変更した際に代名詞の意味が変わる可能性があるため、代名詞は、代名詞が示すエンティティによって置き換えられる。 The entity calculation unit 23 calculates an entity included in the learning document for each of the plurality of learning documents stored in the document storage unit 21. Pre-processing required when calculating entities includes stemming and stop word removal and simplification. The first requirement is to identify the entity and resolve pronoun anaphora. In the present embodiment, Stanford CoreNLP is used as an available entity identification tool and anaphora analysis tool. For example, as the entity, at least one of an adjective word, an adverb word, a noun word, a pronoun word, and a verb word is used. Since the meaning of the pronoun may change when the context is changed, the pronoun is replaced by the entity indicated by the pronoun.

回帰分析部２４は、エンティティ計算部２３によって複数の学習用文書の各々について計算されたエンティティと、要約記憶部２２に記憶された複数の要約とから、論理的回帰分析法を用いて、各エンティティの重みａ_ｉを学習する。また、冗長性の閾値も、実験よって決定する。エンティティの重みａ_ｉは、複数の学習用文書と複数の学習用文書の各々に対する要約とから予め学習される。複数の学習用文書としては、例えばＤＵＣ２００３データを使用することができる。 The regression analysis unit 24 uses the logical regression analysis method to calculate each entity from the entities calculated for each of the plurality of learning documents by the entity calculation unit 23 and the plurality of summaries stored in the summary storage unit 22. Learn the weights a _i of. The redundancy threshold is also determined by experiment. The entity weight a _i is learned in advance from a plurality of learning documents and a summary for each of the plurality of learning documents. As a plurality of learning documents, for example, DUC2003 data can be used.

重み記憶部２５には、回帰分析部２４によって得られた各エンティティの重みａ_ｉが格納される。 The weight storage unit 25 stores the weight a _{i of} each entity obtained by the regression analysis unit 24.

出力部３０は、重み記憶部２５に記憶された各エンティティの重みａ_ｉを結果として出力する。 The output unit 30 outputs the weight a _{i of} each entity stored in the weight storage unit 25 as a result.

＜本発明の実施の形態に係る要約装置の構成＞
次に、本発明の実施の形態に係る要約装置の構成について説明する。図２に示すように、本発明の実施の形態に係る要約装置２００は、ＣＰＵと、ＲＡＭと、後述する要約処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この要約装置２００は、機能的には図２に示すように入力部４０と、演算部５０と、出力部６０とを備えている。要約装置２００は、文書に含まれる文を選択して要約を生成する。 <Configuration of Summarization Device According to Embodiment of the Present Invention>
Next, the configuration of the summarization apparatus according to the embodiment of the present invention will be described. As shown in FIG. 2, the summarization apparatus 200 according to the embodiment of the present invention is a computer including a CPU, a RAM, and a ROM that stores a program for executing a summarization processing routine described later and various data. Can be configured. Functionally, the summary device 200 includes an input unit 40, a calculation unit 50, and an output unit 60 as shown in FIG. The summarization device 200 selects a sentence included in the document and generates a summary.

入力部４０は、要約対象の複数の文書を受け付ける。 The input unit 40 receives a plurality of documents to be summarized.

演算部５０は、文書記憶部５１と、エンティティ計算部５２と、確率推定部５３と、遷移確率記憶部５４と、文選択部５５と、要約候補記憶部５６と、重み記憶部５７と、要約選択部５８と、を含んで構成されている。 The calculation unit 50 includes a document storage unit 51, an entity calculation unit 52, a probability estimation unit 53, a transition probability storage unit 54, a sentence selection unit 55, a summary candidate storage unit 56, a weight storage unit 57, and a summary. And a selection unit 58.

文書記憶部５１には、入力部４０において受け付けた複数の文書が記憶される。 The document storage unit 51 stores a plurality of documents received by the input unit 40.

エンティティ計算部５２は、文書記憶部５１に記憶されている複数の文書に含まれる複数の文の各々について、当該文に含まれる、エンティティと当該エンティティの状態を計算する。 The entity calculation unit 52 calculates, for each of a plurality of sentences included in a plurality of documents stored in the document storage unit 51, an entity included in the sentence and the state of the entity.

確率推定部５３は、エンティティ計算部５２によって複数の文の各々について計算されたエンティティと当該エンティティの状態に基づいて、エンティティｅ_ｋの各々について、エンティティｅ_ｋの状態ペア（ｍ，ｎ）毎に、文書において隣接する文のうちの前文でのエンティティｅ_ｋの状態ｍが、後文でのエンティティｅ_ｋの状態ｎへ遷移する確率を表す遷移確率ｐ_ｅｋ（ｍｎ）を推定する。 Based on the entity calculated for each of the plurality of sentences by the entity calculating unit 52 and the state of the entity, the probability estimating unit 53 for each entity e _k for each state pair (m, n) of the entity e _k. , the state m entity e _k in the preamble of the adjacent sentences in the document, to estimate the transition probabilities p _ek representing the probability of transition to the state n of the entity e _k at the rear statement _(mn).

具体的には、確率推定部５３は、文書記憶部５１に記憶された、複数の文書からなる文書セットにおいて、各文書に含まれる各エンティティに対し、以下の式（５）に示すように、遷移確率を推定する。 Specifically, the probability estimation unit 53, for each entity included in each document in a document set composed of a plurality of documents stored in the document storage unit 51, as shown in the following equation (5), Estimate transition probability.

＃ｅ_ｋ（ｍ）ｅ_ｋ（ｎ）は、隣り合う文において、前文でｅ_ｋが状態ｍを後文でｅ_ｋが状態ｎを示す回数を、全文書を通して数え上げた数値であり、「Ｎ−Ｍ」は、文書数Ｍ、文数Ｎのデータにおける隣接文の組み合わせ数を示す。 #E _k (m) e _k (n) is a numerical value obtained by counting the number of times that e _k indicates state m in the preceding sentence and e _k indicates state n in the preceding sentence throughout the entire document. “−M” indicates the number of adjacent sentence combinations in the data of the number of documents M and the number of sentences N.

遷移確率記憶部５４には、確率推定部５３によって推定された遷移確率の各々が格納される。 Each transition probability estimated by the probability estimation unit 53 is stored in the transition probability storage unit 54.

文選択部５５は、文書記憶部５１に記憶された、複数の文書の複数の文から、文の長さの総和が予め定められた長さＴｈｒｅｓｈｏｌｄ２以下になるようにランダムに繰り返し選択された文からなる要約の候補を複数生成する。 The sentence selection unit 55 is a sentence that is repeatedly selected at random from a plurality of sentences of a plurality of documents stored in the document storage unit 51 so that the sum of the sentence lengths is equal to or less than a predetermined length Threshold 2. Generate multiple summary candidates consisting of

具体的には、文選択部５５は、文書記憶部５１に記憶された複数の文書の複数の文から、上記アルゴリズム１に従って選択された文からなる要約の候補を複数生成する。 Specifically, the sentence selection unit 55 generates a plurality of summary candidates including a sentence selected according to the algorithm 1 from a plurality of sentences of a plurality of documents stored in the document storage unit 51.

要約候補記憶部５６には、文選択部５５によって得られた複数の要約の候補が格納される。 The summary candidate storage unit 56 stores a plurality of summary candidates obtained by the sentence selection unit 55.

重み記憶部５７には、学習装置１００によって予め求められた各エンティティの重みａ_ｉが記憶されている。 The weight storage unit 57 stores the weights a _{i of the} entities obtained in advance by the learning device 100.

要約選択部５８は、遷移確率記憶部５４に記憶された遷移確率の各々と、要約候補記憶部５６に記憶された要約の候補の各々と、重み記憶部５７に記憶されたエンティティｅ_ｉの重みａ_ｉとに基づいて、コヒーレンススコアＦ_ｉ及びエンティティの重みａ_ｉとを用いて表される評価関数Ｓｃｏｒｅ（Ｔ）を最適化するように、要約候補記憶部５６に記憶された要約の候補の各々から、要約を選択する。 The summary selection unit 58 includes each of the transition probabilities stored in the transition probability storage unit 54, each of the summary candidates stored in the summary candidate storage unit 56, and the weight of the entity e _i stored in the weight storage unit 57. based on the a _i, so as to optimize the evaluation function score (T) is represented by using the weight a _i of coherence scores F _i and the entity, the candidate summaries stored in the summary candidate storage unit 56 From each, select a summary.

具体的には、要約選択部５８は、上記アルゴリズム１のうちの以下の式（６）に従って、要約候補記憶部５６に記憶された要約の候補の各々から、複数の文書の要約を選択する。 Specifically, the summary selection unit 58 selects summaries of a plurality of documents from each of the summary candidates stored in the summary candidate storage unit 56 according to the following equation (6) in the algorithm 1.

Ｏ＝ａｒｇｍａｘ_ＴＳｃｏｒｅ（Ｔ）; Ｔ∈Ｓ（６） O = argmax _T Score (T); TεS (6)

また、要約選択部５８は、上記式（３），（４）に従って、要約の候補において隣接する文で共通する単語の割合が予め定められた閾値Ｔｈｒｅｓｈｏｌｄ１以下となるように、要約を生成する。 In addition, the summary selection unit 58 generates a summary according to the above formulas (3) and (4) so that the ratio of words common to adjacent sentences in the summary candidates is equal to or less than a predetermined threshold Threshold1.

出力部６０は、要約選択部５８によって選択された要約を結果として出力する。 The output unit 60 outputs the summary selected by the summary selection unit 58 as a result.

＜本発明の実施の形態に係る学習装置の作用＞
次に、本発明の実施の形態に係る学習装置１００の作用について説明する。入力部１０において、複数の学習用文書を受け付けると、複数の学習用文書が文書記憶部２１に記憶される。また、入力部１０において、複数の学習用文書の各々に対する要約の各々を受け付けると、複数の要約が要約記憶部２２に記憶される。そして、学習装置１００は、図３に示す学習処理ルーチンを実行する。 <Operation of Learning Device According to Embodiment of the Present Invention>
Next, the operation of the learning device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives a plurality of learning documents, the plurality of learning documents are stored in the document storage unit 21. In addition, when the input unit 10 receives each summary for each of the plurality of learning documents, the plurality of summaries are stored in the summary storage unit 22. And the learning apparatus 100 performs the learning process routine shown in FIG.

まず、ステップＳ１００において、エンティティ計算部２３は、文書記憶部２１に記憶された複数の学習用文書の各々について、当該学習用文書に含まれるエンティティｅ_ｉを計算する。 First, in step S <b> 100, the entity calculation unit 23 calculates an entity e _i included in the learning document for each of the plurality of learning documents stored in the document storage unit 21.

次に、ステップＳ１０２において、回帰分析部２４は、上記ステップＳ１００で複数の学習用文書の各々について計算されたエンティティｅ_ｉと、要約記憶部２２に記憶された複数の要約とから、論理的回帰分析法を用いて、各エンティティｅ_ｉの重みａ_ｉを学習する。 Next, in step S102, the regression analysis unit 24 performs logical regression from the entity e _i calculated for each of the plurality of learning documents in step S100 and the plurality of summaries stored in the summary storage unit 22. using analytical method learns the weight a _i of each entity e _i.

そして、ステップＳ１０４において、回帰分析部２４は、上記ステップＳ１０２で得られた各エンティティｅ_ｉの重みａ_ｉを重み記憶部２５に格納する。 In step S104, the regression analysis unit 24 stores the weight a _i of each entity e _i obtained in step S102 in the weight storage unit 25.

そして、ステップＳ１０６において、上記ステップＳ１０４で重み記憶部２５に格納された、各エンティティｅ_ｉの重みａ_ｉを出力部３０により出力して、学習処理ルーチンを終了する。 In step S106, the weight a _i of each entity e _i stored in the weight storage unit 25 in step S104 is output by the output unit 30, and the learning processing routine is terminated.

＜本発明の実施の形態に係る要約装置の作用＞
次に、本発明の実施の形態に係る要約装置２００の作用について説明する。学習装置１００によって学習された各エンティティｅ_ｉに対する重みａ_ｉが、要約装置２００の重み記憶部５７に格納される。また、入力部４０において要約対象の複数の文書を受け付けると、要約装置２００の文書記憶部５１に格納される。要約装置２００は、図４に示す要約処理ルーチンを実行する。 <Operation of the summary device according to the embodiment of the present invention>
Next, the operation of the summarizing apparatus 200 according to the embodiment of the present invention will be described. The weight a _i for each entity e _i learned by the learning device 100 is stored in the weight storage unit 57 of the summarization device 200. Further, when a plurality of documents to be summarized are received at the input unit 40, they are stored in the document storage unit 51 of the summarizing device 200. The summarizing apparatus 200 executes a summarizing process routine shown in FIG.

まず、ステップＳ２０２において、文書記憶部５１に記憶された複数の文書の各々について、当該文書に含まれる各文を単語に切り分ける。そして、エンティティ計算部５２は、複数の文書に含まれる複数の文の各々について、当該文に含まれる、エンティティと当該エンティティの状態を計算する。 First, in step S202, for each of a plurality of documents stored in the document storage unit 51, each sentence included in the document is cut into words. Then, the entity calculation unit 52 calculates, for each of the plurality of sentences included in the plurality of documents, the entity and the state of the entity included in the sentence.

ステップＳ２０４において、確率推定部５３は、上記ステップＳ２０４で複数の文の各々について計算されたエンティティと当該エンティティの状態に基づいて、エンティティｅ_ｋの各々について、状態ペア（ｍ、ｎ）毎に、上記式（５）に従って、遷移確率ｐ_ｅｋ（ｍｎ）を推定する。そして、確率推定部５３は、推定された遷移確率ｐ_ｅｋ（ｍｎ）の各々を、遷移確率記憶部５４に格納する。 In step S204, the probability estimator 53 based on the state of each Calculation entity and the entity of a plurality of sentences in the step S204, for each of the entities e _k, the state pairs (m, n) for each, The transition probability p _ek (mn) is estimated according to the above equation (5). Then, the probability estimation unit 53 stores each of the estimated transition probabilities p _ek (mn) in the transition probability storage unit 54.

ステップＳ２０６において、文選択部５５は、複数の文書の複数の文から、文の長さの総和が予め定められた長さＴｈｒｅｓｈｏｌｄ２以下になるようにランダムに繰り返し選択された文からなる要約の候補を複数生成する。ステップＳ２０８は、図５に示す文選択処理ルーチンによって実現される。 In step S206, the sentence selection unit 55 selects a summary candidate consisting of sentences randomly selected from a plurality of sentences in a plurality of documents so that the total sum of the sentence lengths is equal to or less than a predetermined length Threshold2. Generate multiple. Step S208 is realized by the sentence selection processing routine shown in FIG.

＜文選択処理ルーチン＞
ステップＳ３００において、文選択部５５は、複数の文書に存在するすべての文の集合Ｕ、要約の候補集合Ｓ、及び文集合Ｃを初期化する。なお、集合Ｕには、複数の文書に存在するすべての文を格納することで初期化を行う。また、要約の候補集合Ｓ及び文集合Ｃについては、空集合とすることで初期化を行う。 <Sentence selection processing routine>
In step S300, the sentence selection unit 55 initializes all sentence sets U, summary candidate sets S, and sentence sets C existing in a plurality of documents. The set U is initialized by storing all sentences existing in a plurality of documents. The summary candidate set S and sentence set C are initialized by being empty sets.

ステップＳ３０２において、文選択部５５は、上記ステップＳ３００で初期化された集合Ｕ、又は後述するステップＳ３１０で初期化された集合Ｕから、文ｓをランダムに選択する。 In step S302, the sentence selection unit 55 randomly selects the sentence s from the set U initialized in step S300 or the set U initialized in step S310 described later.

ステップＳ３０４において、文選択部５５は、前回のステップまでに選択された文ｓ_ｉの長さの総和Σ_ｉｌｅｎｇｔｈ(ｓ_ｉ)と、上記ステップＳ３０２で選択された文ｓの長さｌｅｎｇｔｈ(ｓ)との総和が予め定められた長さＴｈｒｅｓｈｏｌｄ２以下であるか否かを判定する。長さの総和が予め定められた長さＴｈｒｅｓｈｏｌｄ２以下である場合には、ステップＳ３０６へ進む。一方、長さの総和が予め定められた長さＴｈｒｅｓｈｏｌｄ２より大きい場合には、ステップＳ３０８へ進む。 In step S304, the sentence selection unit 55 calculates the sum Σ _i length (s _i ) of the lengths of the sentences s _i selected up to the previous step and the length length (s of the sentence s selected in step S302 above. ) Is less than or equal to a predetermined length Threshold2. If the total length is less than or equal to the predetermined length Threshold2, the process proceeds to step S306. On the other hand, if the total length is greater than the predetermined length Threshold2, the process proceeds to step S308.

ステップＳ３０６において、文選択部５５は、上記ステップＳ３００で初期化された集合Ｃ又は前回の本ステップＳ３０６で更新された集合Ｃの最後に、上記ステップＳ３０２で選択された文ｓを追加して、上記ステップＳ３０２へ戻る。 In step S306, the sentence selection unit 55 adds the sentence s selected in step S302 to the end of the set C initialized in step S300 or the set C updated in the previous step S306. The process returns to step S302.

ステップＳ３０８において、文選択部５５は、前回のステップまでに更新された集合Ｃを、上記ステップＳ３００で初期化された集合Ｓ又は前回の本ステップＳ３０８で更新された集合Ｓに追加する。要約の候補集合Ｓは、要約候補記憶部５６に記憶される。 In step S308, the sentence selection unit 55 adds the set C updated up to the previous step to the set S initialized in step S300 or the set S updated in the previous step S308. The summary candidate set S is stored in the summary candidate storage unit 56.

ステップＳ３１０において、文選択部５５は、集合Ｃ及び集合Ｕを初期化する。 In step S310, the sentence selection unit 55 initializes the set C and the set U.

ステップＳ３１２において、文選択部５５は、上記ステップＳ３０２〜上記ステップＳ３１０の処理が一定数繰り返されたか否かを判定する。上記ステップＳ３０２〜上記ステップＳ３１０の処理が一定数繰り返された場合には、ステップＳ３１４へ進む。一方、上記ステップＳ３０２〜上記ステップＳ３１０の処理が一定数繰り返されていない場合には、ステップＳ３０２へ戻る。 In step S312, the sentence selection unit 55 determines whether or not the processes in steps S302 to S310 have been repeated a certain number of times. If the processes from step S302 to step S310 are repeated a certain number of times, the process proceeds to step S314. On the other hand, if the processes from step S302 to step S310 are not repeated a certain number of times, the process returns to step S302.

ステップＳ３１４において、要約選択部５８は、上記ステップＳ２０４で遷移確率記憶部５４に記憶された遷移確率の各々と、上記ステップＳ３０８で要約候補記憶部５６に記憶された要約の候補の各々と、重み記憶部５７に記憶されたエンティティｅ_ｉの重みａ_ｉとに基づいて、式（６）に従って、要約候補記憶部５６に記憶された要約の候補の各々から、要約を選択する。 In step S314, the summary selection unit 58 uses each of the transition probabilities stored in the transition probability storage unit 54 in step S204, each of the summary candidates stored in the summary candidate storage unit 56 in step S308, and a weight. based on the weight a _i of the stored entity e _i in the storage unit 57, according to equation (6), from each of the candidates of the summaries stored in the summary candidate storage unit 56, selects a summary.

ステップＳ３１６において、上記ステップＳ３１４で得られた要約を出力して、文選択処理ルーチンを終了する。 In step S316, the summary obtained in step S314 is output, and the sentence selection processing routine ends.

次に、要約処理ルーチンに戻り、ステップＳ２０８において、要約選択部５８は、上記ステップＳ２０６で得られた要約から、上記式（３），（４）に従って、要約において隣接する文で共通する単語の割合が予め定められた閾値Ｔｈｒｅｓｈｏｌｄ１以下となるように、要約を生成する。ステップＳ２０８の処理により要約の冗長性が排除される。 Next, returning to the summary processing routine, in step S208, the summary selection unit 58 determines the word common to the adjacent sentences in the summary from the summary obtained in step S206 according to the above formulas (3) and (4). The summary is generated so that the ratio is equal to or less than a predetermined threshold Threshold1. The redundancy of the summary is eliminated by the process of step S208.

ステップＳ２１０において、要約選択部５８は、上記ステップＳ２０８で得られた要約を出力部６０により出力して、要約処理ルーチンを終了する。 In step S210, the summary selection unit 58 outputs the summary obtained in step S208 by the output unit 60, and ends the summary processing routine.

＜実施例＞
次に、本実施の形態の要約装置を用いた実験の結果を示す。 <Example>
Next, a result of an experiment using the summarization apparatus of the present embodiment will be shown.

［実験の準備］
実験は、複数文書要約タスクである、ＤＵＣ２００４要約化タスクのデータセットを使用して行った。データセットは５０個の文書クラスタがあり、各クラスタには、１０個の文書が含まれる。各クラスタに対しては、１個の要約が生成される。目標の要約長は、１００単語である。なお、エンティティの重みａ_ｉなどの各パラメータを決定するために、トレーニングデータとして、ＤＵＣ２００３のデータセットも利用可能である。図６に、実験の結果得られた要約の例を示す。 [Preparation for experiment]
The experiment was performed using the DUC2004 summarization task data set, which is a multi-document summarization task. The data set has 50 document clusters, and each cluster includes 10 documents. One summary is generated for each cluster. The target summary length is 100 words. In order to determine each parameter such as the entity weight a _i , a DUC2003 data set can also be used as training data. FIG. 6 shows an example of a summary obtained as a result of the experiment.

なお、ＲＯＵＧＥ−１のＦスコアは、人手による解析との相関が高いことが証明されているため、本実験では、結果の検討において、ＲＯＵＧＥ−１のＦスコアに注目する。 In addition, since it is proved that the F score of ROUGE-1 is highly correlated with the manual analysis, in this experiment, attention is paid to the F score of ROUGE-1 in the examination of the results.

［関連手法との比較］
本実施の形態に係る要約装置と、広範囲に使用されている複数のシステムとを比較する。実験結果を表１に示す。ＭＥＡＤは、ランク付けアルゴリズムを採用するツールキットであり、ＭＥＡＤに含まれるＭＭＲ（Maximal Marginal Relevance）は複数文書の要約を生成する。ＭＫＭＣは、オリジナルの最大網羅方法である。 [Comparison with related methods]
The summarization apparatus according to the present embodiment is compared with a plurality of systems widely used. The experimental results are shown in Table 1. MEAD is a toolkit that employs a ranking algorithm, and MMR (Maximal Marginal Relevance) included in MEAD generates a summary of a plurality of documents. MKMC is the original maximum coverage method.

上記表１に示すように、本実施の形態は、一貫性を、要約システムを設計する際の重要な要素とみなし、エンティティを使用することにより、最新の最先端技術による結果とほぼ同等の結果を得ることができることがわかる。 As shown in Table 1 above, the present embodiment regards consistency as an important element in designing a summarization system, and uses entities to obtain results that are almost equivalent to the results from the latest state-of-the-art technology. It can be seen that can be obtained.

以上説明したように、本発明の実施の形態に係る学習装置によれば、複数の学習用文書及び複数の学習用文書の各々に対する要約から、エンティティの重みを学習することにより、人が読みやすい要約を得るためのパラメータを得ることができる。 As described above, according to the learning device according to the embodiment of the present invention, it is easy for a person to read by learning entity weights from a plurality of learning documents and a summary for each of the plurality of learning documents. Parameters for obtaining a summary can be obtained.

また、本発明の実施の形態に係る要約装置によれば、複数の文の各々について計算されたエンティティとエンティティの状態に基づいて、文書において隣接する文のうちの前文でのエンティティの状態が、後文でのエンティティの状態へ遷移する確率を表す遷移確率を推定し、複数の文から、文の長さの総和が予め定められた長さ以下になるように選択された文からなる文書の要約の候補を複数生成し、推定された遷移確率の各々と、要約の候補の各々と、予め求められたエンティティの重みとに基づいて、要約において隣接する文のうちの前文でのエンティティの状態が、後文でのエンティティの状態へ遷移する遷移確率を用いて表されるコヒーレンススコア、及びエンティティの重みを用いて表される評価関数を最適化するように、要約の候補の各々から、文書の要約を選択することにより、人が読みやすい要約を得ることができる。 Further, according to the summarization apparatus according to the embodiment of the present invention, based on the entity and the state of the entity calculated for each of the plurality of sentences, the state of the entity in the preceding sentence of the adjacent sentences in the document is: Estimate the transition probability that represents the probability of transition to the entity state in the subsequent sentence, and from a plurality of sentences, the document consisting of the sentence selected so that the sum of the sentence lengths is less than or equal to the predetermined length A plurality of summary candidates are generated, and based on each of the estimated transition probabilities, each of the summary candidates, and the previously determined entity weight, the state of the entity in the preceding sentence of the sentences adjacent to each other in the summary To optimize the coherence score expressed using the transition probability to transition to the entity state in the later sentence and the evaluation function expressed using the entity weight From each candidate, by selecting the summary of the document, it is possible to obtain a human readable summary.

また、エンティティの状態の遷移を表す遷移確率を用いて表されるコヒーレンススコア及びエンティティの重みを用いて表される評価関数を最適化するように、要約を選択することにより、重要な情報が網羅され、かつ文書の一貫性が維持されている要約を得ることができる。 It also covers important information by selecting summaries to optimize the coherence score expressed using transition probabilities representing entity state transitions and the evaluation function expressed using entity weights. And a summary with document consistency maintained.

また、上記評価関数を最適化するように要約を選択し、かつ要約において隣接する文で共通する単語の割合が予め定められた閾値以下となるように、要約を生成することにより、冗長性が排除された要約を得ることができる。 In addition, by selecting the summary so as to optimize the evaluation function and generating the summary so that the ratio of words common to adjacent sentences in the summary is equal to or less than a predetermined threshold, redundancy is reduced. An excluded summary can be obtained.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記実施の形態では、複数の文書に対する要約を生成する場合を例に説明したが、これに限定されるものではなく、１つの文書から要約を生成するようにしてもよい。 For example, although cases have been described with the above embodiment as an example where summaries for a plurality of documents are generated, the present invention is not limited to this, and summaries may be generated from a single document.

また、上記実施の形態の要約選択部５８は、評価関数を最適化するように要約を選択し、かつ要約において隣接する文で共通する単語の割合が予め定められた閾値以下となるように、要約を生成する処理を行う場合を例に説明したが、当該処理を行わなくてもよい。また、文を選択して要約の候補を生成する際に、隣接する文で共通する単語の割合が予め定められた閾値以下となるように、要約の候補を生成し、隣接する文で共通する単語の割合が閾値より大きい要約の候補を破棄するようにしてもよい。 In addition, the summary selection unit 58 of the above embodiment selects the summary so as to optimize the evaluation function, and the ratio of words common to adjacent sentences in the summary is equal to or less than a predetermined threshold value. Although the case where the process which produces | generates the summary was demonstrated to the example, the said process does not need to be performed. In addition, when generating a summary candidate by selecting a sentence, the summary candidate is generated so that the ratio of words common to adjacent sentences is equal to or less than a predetermined threshold, and common to adjacent sentences. You may make it discard the summary candidate whose word ratio is larger than a threshold value.

また、学習装置と要約装置とを１つの装置で構成するようにしてもよい。 Further, the learning device and the summarization device may be configured by one device.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０入力部
２０演算部
２１文書記憶部
２２要約記憶部
２３エンティティ計算部
２４回帰分析部
２５重み記憶部
３０出力部
４０入力部
５０演算部
５１文書記憶部
５２エンティティ計算部
５３確率推定部
５４遷移確率記憶部
５５文選択部
５６要約候補記憶部
５７重み記憶部
５８要約選択部
６０出力部
１００学習装置
２００要約装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Arithmetic part 21 Document storage part 22 Summary storage part 23 Entity calculation part 24 Regression analysis part 25 Weight storage part 30 Output part 40 Input part 50 Calculation part 51 Document storage part 52 Entity calculation part 53 Probability estimation part 54 Transition probability Storage unit 55 Sentence selection unit 56 Summary candidate storage unit 57 Weight storage unit 58 Summary selection unit 60 Output unit 100 Learning device 200 Summary device

Claims

A summary device for selecting a sentence contained in a document and generating a summary,
For each of a plurality of sentences included in the document, an entity calculation unit that calculates an entity included in the sentence and the state of the entity;
Based on the entity calculated for each of the plurality of sentences and the state of the entity by the entity calculator, for each of the entity state pairs, A probability estimation unit that estimates a transition probability representing a probability that the state of the entity transitions to the state of the entity in a later sentence;
A sentence selection unit that generates a plurality of candidates for the summary of the document including sentences selected so that the total length of sentences is not more than a predetermined length from the plurality of sentences;
Based on each of the transition probabilities estimated by the probability estimation unit, each of the summary candidates generated by the sentence selection unit, and a weight of the entity obtained in advance, the entity included in the summary For each, a coherence score represented by using the transition probability that the state of the entity in the preceding sentence of the adjacent sentences in the summary transitions to the state of the entity in the subsequent sentence, and the weight of the entity A summary selector for selecting a summary of the document from each of the summary candidates so as to optimize the evaluation function represented using
Summarization device including.

The sentence selection unit generates, from the plurality of sentences, a plurality of candidates for the summary of the document including sentences repeatedly selected at random so that the total length of the sentences is equal to or less than the predetermined length. The summarization device according to claim 1.

The entity calculation unit calculates, for each of a plurality of sentences included in a plurality of documents, an entity included in the sentence and a state of the entity,
The probability estimator is configured for each state pair of the entity for each of the entities based on the entity calculated for each of the plurality of sentences included in the plurality of documents and the state of the entity by the entity calculator. To estimate the transition probability,
The sentence selection unit includes the sentences selected from the plurality of sentences included in the plurality of documents so that the sum of the lengths of the sentences is equal to or less than a predetermined length. Generate multiple candidates for
The summary selection unit is provided for each of the transition probabilities estimated by the probability estimation unit, each of the summary candidates generated by the sentence selection unit, a plurality of learning documents, and the plurality of learning documents. The summary of the plurality of documents is selected from each of the summary candidates to optimize the evaluation function based on entity weights previously determined from each of the summaries. 2. The summarization device according to 2.

The summarization apparatus according to any one of claims 1 to 3, wherein the weight of the entity is learned in advance from a plurality of learning documents and a summary for each of the plurality of learning documents.

The summary selection unit selects a summary of the document from each of the summary candidates so as to optimize an evaluation function, and a threshold value in which a ratio of words common to adjacent sentences in the summary is predetermined. The summarization device according to any one of claims 1 to 4, wherein the summary is generated so as to be as follows.

A summarization method for generating a summary by selecting sentences contained in a document,
An entity calculation unit calculating, for each of a plurality of sentences included in the document, an entity included in the sentence and a state of the entity;
A probability estimator, for each of the entities, for each of the entity state pairs, for each of the entities, based on the entity calculated for each of the plurality of sentences by the entity calculator and the state of the entities. Estimating the transition probability representing the probability that the state of the entity in the preamble of the transition to the state of the entity in the subsequent sentence;
A step of generating a plurality of document summary candidates composed of sentences selected from the plurality of sentences so that a total sum of sentence lengths is equal to or less than a predetermined length;
A summary selection unit generates a summary based on each of the transition probabilities estimated by the probability estimation unit, each of the summary candidates generated by the sentence selection unit, and a weight of an entity obtained in advance. A coherence score for each of the included entities represented by using the transition probability that the state of the entity in the preceding sentence of the adjacent sentences in the summary transitions to the state of the entity in the subsequent sentence; And selecting a summary of the document from each of the summary candidates so as to optimize the evaluation function represented using the entity weights;
Summarizing methods including:

The program for functioning a computer as each part of the summary apparatus of any one of Claims 1-5.