JP2017091368A

JP2017091368A - Paraphrase device, method, and program

Info

Publication number: JP2017091368A
Application number: JP2015223131A
Authority: JP
Inventors: 千明宮崎; Chiaki Miyazaki; 徹平野; Toru Hirano; 竜一郎東中; Ryuichiro Higashinaka; 俊朗牧野; Toshiaki Makino; 義博松尾; Yoshihiro Matsuo
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-11-13
Filing date: 2015-11-13
Publication date: 2017-05-25

Abstract

PROBLEM TO BE SOLVED: To provide a paraphrase device capable of replacing with an expression according to knowledge amount of interlocutor.SOLUTION: A paraphrase candidate enumeration section 32 refers to a predetermined thesaurus with respect to an input word of a noun and enumerates a plurality of paraphrase candidates. A paraphrase target selection section 34 selects a paraphrase expression from the plurality of paraphrase candidates based on a predetermined knowledge amount with respect to the input word. A paraphrase sentence generation section 36 generates a paraphrase sentence in which an input word of the input sentence is replaced with a paraphrase expression.SELECTED DRAWING: Figure 1

Description

本発明は、言い換え装置、方法、及びプログラムに係り、特に、入力文を言い換えるための言い換え装置、方法、及びプログラムに関する。 The present invention relates to a paraphrase device, method, and program, and more particularly, to a paraphrase device, method, and program for paraphrasing an input sentence.

非特許文献１には、大人と比較して理解可能な語彙数が少ない子供でも読みやすい文書を作成するために、難易度の高い語を平易な語（学習基本語彙）に言い換える手法が記載されている。 Non-Patent Document 1 describes a method for paraphrasing words with high difficulty into plain words (basic learning vocabulary) in order to create a readable document even for children with a small number of vocabularies that can be understood compared to adults. ing.

梶原智之, 山本和英, 小学生の読解支援に向けた複数の換言知識を併用した語彙平易化と評価，言語処理学会第19回年次大会発表論文集, pp.272-275, 2013.Tomoyuki Sugawara, Kazuhide Yamamoto, Simplification and evaluation of vocabulary using multiple paraphrasing knowledge to support reading comprehension for elementary school students, Proc. Of the 19th Annual Conference of the Language Processing Society, pp.272-275, 2013. Kyoko Kanzaki, Francis Bond, Noriko Tomuro, Hitoshi Isahara, Extraction of Attribute Concepts from Japanese Adjectives，In LREC-2008, 2008.Kyoko Kanzaki, Francis Bond, Noriko Tomuro, Hitoshi Isahara, Extraction of Attribute Concepts from Japanese Adjectives, In LREC-2008, 2008.

しかし、非特許文献１の手法は、子供に向けた読みやすさの向上を目的としたものであり、大人の特定分野や話題に関する知識量の違いを扱うものではない。 However, the method of Non-Patent Document 1 is intended to improve readability for children, and does not deal with differences in knowledge amount regarding specific fields and topics of adults.

また、あえて難しい、一般的でない表現への言い換えを行うことで、話し手や書き手の知識が豊富そうな印象を与えようとする技術はこれまでにはなかった。 In addition, there has never been a technology that tries to give the impression that the knowledge of speakers and writers is abundant by rephrasing difficult, uncommon expressions.

本発明は、上記問題点を解決するために成されたものであり、対話者の知識量に応じた表現に言い換えることができる言い換え装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and an object of the present invention is to provide a paraphrasing device, a method, and a program that can be paraphrased into expressions according to the amount of knowledge of a conversation person.

上記目的を達成するために、第１の発明に係る言い換え装置は、入力された入力文に含まれる名詞からなる入力語について、予め定められたシソーラスを参照して、複数の言い換え候補を列挙する言い換え候補列挙部と、前記言い換え候補列挙部により列挙された複数の言い換え候補から、前記入力語に対応して予め定められた知識量に基づいて、言い換え表現を選択する言い換え先選択部と、前記言い換え先選択部により選択された言い換え表現を用いて前記入力文の前記入力語を言い換えた、言い換え文を生成する言い換え文生成部と、を含んで構成されている。 To achieve the above object, the paraphrase device according to the first aspect of the present invention enumerates a plurality of paraphrase candidates with reference to a predetermined thesaurus for input words composed of nouns included in the input sentence. A paraphrase candidate enumeration unit; a paraphrase destination selection unit that selects a paraphrase expression from a plurality of paraphrase candidates enumerated by the paraphrase candidate enumeration unit based on a predetermined knowledge amount corresponding to the input word; and A paraphrase sentence generation unit that generates a paraphrase sentence in which the input word of the input sentence is paraphrased using the paraphrase expression selected by the paraphrase destination selection unit.

また、第１の発明に係る言い換え装置において、前記言い換え先選択部は、前記入力語に対応する知識量が多いほど、出現頻度が低い前記言い換え表現を選択し、前記入力語に対応する知識量が少ないほど、出現頻度が高い前記言い換え表現を選択するようにしてもよい。 Further, in the paraphrase device according to the first aspect, the paraphrase destination selection unit selects the paraphrase expression having a lower appearance frequency as the knowledge amount corresponding to the input word is larger, and the knowledge amount corresponding to the input word You may make it select the said paraphrase expression with high appearance frequency, so that there are few.

また、第１の発明に係る言い換え装置において、前記言い換え先選択部は、前記言い換え候補の各々について、前記知識量と、予め定められたコーパスにおいて前記言い換え候補が出現する文書の数である文書頻度と、前記言い換え候補と前記入力語との前記シソーラス上の距離とに基づいて、言い換えコストを算出することにより、前記言い換え表現を選択するようにしてもよい。 Further, in the paraphrase device according to the first aspect, the paraphrase destination selecting unit is, for each of the paraphrase candidates, a document frequency that is the knowledge amount and the number of documents in which the paraphrase candidate appears in a predetermined corpus. The paraphrase expression may be selected by calculating a paraphrase cost based on the distance between the paraphrase candidate and the input word on the thesaurus.

第２の発明に係る言い換え方法は、言い換え候補列挙部が、入力された入力文に含まれる名詞からなる入力語について、予め定められたシソーラスを参照して、複数の言い換え候補を列挙するステップと、言い換え先選択部が、前記言い換え候補列挙部により列挙された複数の言い換え候補から、前記入力語に対応して予め定められた知識量に基づいて、言い換え表現を選択するステップと、言い換え文生成部が、前記言い換え先選択部により選択された言い換え表現を用いて前記入力文の前記入力語を言い換えた、言い換え文を生成するステップと、を含んで実行することを特徴とする。 The paraphrase method according to the second invention is a step in which the paraphrase candidate enumeration unit enumerates a plurality of paraphrase candidates with reference to a predetermined thesaurus for an input word including nouns included in the input sentence. A paraphrase destination selecting unit selecting a paraphrase expression from a plurality of paraphrase candidates enumerated by the paraphrase candidate enumeration unit based on a predetermined knowledge amount corresponding to the input word; and paraphrase sentence generation And a step of generating a paraphrase sentence obtained by paraphrasing the input word of the input sentence using the paraphrase expression selected by the paraphrase destination selection unit.

また、第２の発明に係る言い換え方法において、前記言い換え先選択部が選択するステップは、前記入力語に対応する知識量が多いほど、出現頻度が低い前記言い換え表現を選択し、前記入力語に対応する知識量が少ないほど、出現頻度が高い前記言い換え表現を選択する。 In the paraphrase method according to the second invention, the step of selecting by the paraphrase destination selecting unit selects the paraphrase expression having a lower appearance frequency as the knowledge amount corresponding to the input word increases, and selects the paraphrase expression as the input word. The smaller the corresponding knowledge amount is, the higher the appearance frequency is selected.

また、第２の発明に係る言い換え方法において、前記言い換え先選択部が選択するステップは、前記言い換え候補の各々について、前記知識量と、予め定められたコーパスにおいて前記言い換え候補が出現する文書の数である文書頻度と、前記言い換え候補と前記入力語との前記シソーラス上の距離とに基づいて、言い換えコストを算出することにより、前記言い換え表現を選択する。 In the paraphrasing method according to the second invention, the step of selecting by the paraphrase destination selecting unit includes, for each of the paraphrase candidates, the knowledge amount and the number of documents in which the paraphrase candidates appear in a predetermined corpus. The paraphrase expression is selected by calculating a paraphrase cost based on the document frequency and the distance on the thesaurus between the paraphrase candidate and the input word.

また、第３の発明に係るプログラムは、コンピュータを、上記第１に発明に係る言い換え装置の各部として機能させるためのプログラムである。 A program according to the third invention is a program for causing a computer to function as each part of the paraphrasing device according to the first invention.

本発明の言い換え装置、方法、及びプログラムによれば、名詞からなる入力語について、予め定められたシソーラスを参照して、複数の言い換え候補を列挙し、複数の言い換え候補から、入力語に対応して予め定められた知識量に基づいて、言い換え表現を選択し、言い換え表現を用いて入力文の入力語を言い換えた、言い換え文を生成することにより、対話者の知識量に応じた表現に言い換えることができる、という効果が得られる。 According to the paraphrase device, method, and program of the present invention, with respect to an input word composed of nouns, a plurality of paraphrase candidates are listed with reference to a predetermined thesaurus, and a plurality of paraphrase candidates correspond to the input word. The paraphrased expression is selected based on the predetermined knowledge amount, and the paraphrase expression is used to rephrase the input word of the input sentence. The effect that it can be obtained.

本発明の実施の形態に係る言い換え装置の構成を示すブロック図である。It is a block diagram which shows the structure of the paraphrase apparatus which concerns on embodiment of this invention. 例１における日本語シソーラスの一例を示す図である。5 is a diagram illustrating an example of a Japanese thesaurus in Example 1. FIG. 例２における日本語シソーラスの一例を示す図である。6 is a diagram illustrating an example of a Japanese thesaurus in Example 2. FIG. 知識量設定ファイルの一例を示す図である。It is a figure which shows an example of a knowledge amount setting file. 言い換えコストの算出の一例を示す図である。It is a figure which shows an example of calculation of a paraphrase cost. 本発明の実施の形態に係る言い換え装置における言い換え処理ルーチンを示すフローチャートである。It is a flowchart which shows the paraphrase process routine in the paraphrase apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る概要＞ <Outline according to Embodiment of the Present Invention>

まず、本発明の実施の形態における概要を説明する。 First, an outline of the embodiment of the present invention will be described.

本発明の実施の形態においては、言い換え先表現の候補として、予め用意されたシソーラス（体系化された類語辞典）から取得可能な上位語、同義語、インスタンス（実例）を列挙する。次に、列挙された候補の中から、（１）設定された知識量、（２）表現の一般性（予め用意しておいた日本語コーパスにおける文書頻度）、（３）言い換え元表現とのシソーラス上での距離、を用いて言い換えコストを算出する。そして、算出した言い換えコストが最も低い言い換え先となる言い換え表現を１つ選択し、この表現を用いて言い換え文を作成する。 In the embodiment of the present invention, broad words, synonyms, and instances (examples) that can be acquired from a thesaurus (systematic synonym dictionary) prepared in advance are listed as candidates for paraphrase destination expressions. Next, among the listed candidates, (1) the set knowledge amount, (2) the generality of the expression (document frequency in a Japanese corpus prepared in advance), and (3) the paraphrase source expression The paraphrasing cost is calculated using the distance on the thesaurus. Then, one paraphrase expression as a paraphrase destination having the lowest calculated paraphrase cost is selected, and a paraphrase sentence is created using this expression.

言い換えの例としては、例えば「ボジョレーが好きです」という文が入力された場合に、ワインの固有名を用いて「ジョゼフ・ドルーアンが好きです」のように言い換えると、より知識が豊富な印象を与えることができる。反対に、「赤ワインが好きです」のように上位語を用いて言い換えると、ワインに詳しくないことが想定される話し手（書き手）が話しても（書いても）違和感のない文、および、ワインに詳しくない聞き手（読み手）にも伝わりやすい文を生成することができる。 As an example of paraphrasing, for example, when the sentence “I like Beaujolais” is entered, using the wine's proper name like “I like Joseph Drouin” gives a more knowledgeable impression. Can be given. On the other hand, if you rephrase using a broader word like “I like red wine”, even if a speaker (writer) who is assumed to be inexperienced in wine speaks (even if written), there is no sense of incongruity, and wine It is possible to generate a sentence that can be easily communicated to listeners (readers) who are not familiar with.

上記とは性質の異なる例として、「サーチエンジンで検索してください」という文が入力された場合には、具体的な固有名を用いて「google（Ｒ）で検索してください」や「google（Ｒ）などのサーチエンジンで検索してください」とすることによって、先述の例とは逆に、知識の少ない人に適した伝わりやすい文に変えることができる。 As an example of a different nature from the above, if you enter the phrase “Please search using a search engine,” use a specific unique name such as “Search by google (R)” or “ "Please search with a search engine such as (R)", it is possible to change to a sentence that is easy to communicate, suitable for people with little knowledge, contrary to the previous example.

また、どのような表現（上位語・同義語・インスタンス（固有名などの実例））が伝わりやすい、より一般的な表現であるかは、コーパスにおける出現頻度に基づいて判断する。 In addition, it is determined based on the appearance frequency in the corpus what expressions (higher terms, synonyms, instances (examples such as proper names)) are more easily transmitted.

＜本発明の実施の形態に係る言い換え装置の構成＞ <Configuration of Paraphrase Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る言い換え装置の構成について説明する。図１に示すように、本発明の実施の形態に係る言い換え装置１００は、ＣＰＵと、ＲＡＭと、後述する言い換え処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この言い換え装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 Next, the configuration of the paraphrasing device according to the embodiment of the present invention will be described. As shown in FIG. 1, a paraphrase device 100 according to an embodiment of the present invention is a computer including a CPU, a RAM, and a ROM that stores a program and various data for executing a paraphrase processing routine described later. Can be configured. Functionally, the paraphrasing device 100 includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部１０は、言い換え対象の入力文を受け付ける。例１の入力される入力文は「ボジョレーが好きです」である。また、例２の入力される入力文は「サーチエンジンで検索してください」である。以下、説明の便宜のため、例１の入力文が入力されたことを前提とした処理を例１と、例２の入力文が入力されたことを前提とした処理を例２と記載する。 The input unit 10 receives an input sentence to be paraphrased. The input sentence input in Example 1 is “I like Beaujolais”. In addition, the input sentence to be input in Example 2 is “Please search with a search engine”. Hereinafter, for the convenience of explanation, the processing based on the input sentence of Example 1 is described as Example 1 and the processing based on the input sentence of Example 2 is described as Example 2.

演算部２０は、名詞列抽出部３０と、言い換え候補列挙部３２と、言い換え先選択部３４と、言い換え文生成部３６と、日本語シソーラス４０と、知識量設定ファイル４２と、日本語コーパス４４とを含んで構成されている。 The calculation unit 20 includes a noun string extraction unit 30, a paraphrase candidate enumeration unit 32, a paraphrase destination selection unit 34, a paraphrase sentence generation unit 36, a Japanese thesaurus 40, a knowledge amount setting file 42, and a Japanese corpus 44. It is comprised including.

名詞列抽出部３０は、入力部１０において受け付けた入力文から、入力文に含まれる名詞からなる名詞列を、入力語として抽出する。具体的には、入力文を形態素解析にかけ、形態素に分割すると同時に、各形態素の品詞を取得する。そして、形態素解析結果から、１つ以上の名詞又は連続した名詞からなる名詞列を、入力語として全て抽出する。 The noun string extraction unit 30 extracts a noun string including nouns included in the input sentence from the input sentence received by the input unit 10 as an input word. Specifically, the input sentence is subjected to morphological analysis and divided into morphemes, and at the same time, the part of speech of each morpheme is acquired. Then, from the morphological analysis result, all noun strings composed of one or more nouns or continuous nouns are extracted as input words.

入力文が例１の場合には、「ボジョレー／が／好き／です」と各形態素に分割し、品詞として「名詞／助詞／形容詞／助動詞」を取得する。次に、「ボジョレー」（名詞）を言い換え対象の入力語として抽出する。 When the input sentence is Example 1, it is divided into morphemes such as “bojolay / ga / like / is”, and “noun / particle / adjective / auxiliary verb” is acquired as a part of speech. Next, “Beaujolais” (noun) is extracted as an input word to be paraphrased.

入力文が例２の場合には、「サーチ／エンジン／で／検索／し／て／ください」と各形態素に分割し、品詞として「名詞／名詞／助詞／名詞-サ変／動詞／助詞／動詞」を取得する。次に、「サーチ」（名詞）、及び「サーチエンジン」（名詞、名詞）を言い換え対象の入力語として抽出する。 When the input sentence is Example 2, it is divided into morphemes such as “search / engine / de / search / do / te / please”, and the part of speech is “noun / noun / particle / noun-sa-variant / verb / particle / verb”. Is obtained. Next, “search” (noun) and “search engine” (noun, noun) are extracted as input words to be paraphrased.

なお、本実施の形態においては、「検索（する）」のような動作性名詞（サ変名詞）は言い換え対象から除外することとする。 In the present embodiment, an operational noun (sa-changing noun) such as “search (do)” is excluded from the paraphrase target.

言い換え候補列挙部３２は、名詞列抽出部３０で抽出された入力語について、日本語シソーラス４０を参照して、複数の言い換え候補を列挙する。 The paraphrase candidate enumeration unit 32 enumerates a plurality of paraphrase candidates with reference to the Japanese thesaurus 40 for the input words extracted by the noun string extraction unit 30.

日本語シソーラス４０には、一例として非特許文献２のシソーラスの一部にインスタンス（固有名）を追加することによって作成した日本語シソーラスを用いることとするが、体系化された日本語類語辞典であれば、どのようなものを用いても構わない。例１及び例２に対応する日本語シソーラス４０の部分的な例を図２、及び図３に示す。 As the Japanese thesaurus 40, for example, a Japanese thesaurus created by adding an instance (proprietary name) to a part of the thesaurus of Non-Patent Document 2 is used, but in a systematic Japanese thesaurus Any one can be used as long as it exists. A partial example of the Japanese thesaurus 40 corresponding to Example 1 and Example 2 is shown in FIGS.

言い換え候補列挙部３２は、具体的には、抽出された全ての入力語を日本語シソーラス４０の収録語と照合し、収録語と一致した入力語のうち、構成文字数が最も多い入力語の同義語、上位語（１〜Ｎ階層上の語とその同義語）、インスタンス（固有名などの実例）を言い換え候補として取得する。上記図３に示すように、例２の場合は、「サーチ」よりも構成文字数が多い「サーチエンジン」を言い換えの対象とし、その言い換え候補を日本語シソーラス４０の収録語から取得する。 Specifically, the paraphrase candidate enumeration unit 32 collates all the extracted input words with the recorded words of the Japanese thesaurus 40, and among the input words that match the recorded words, synonymous with the input word having the largest number of constituent characters. A word, a broader word (a word on the 1 to N hierarchy and its synonym), and an instance (an example of a proper name) are acquired as paraphrase candidates. As shown in FIG. 3, in the case of Example 2, “search engine” having more constituent characters than “search” is used as a paraphrase target, and the paraphrase candidates are acquired from words recorded in the Japanese thesaurus 40.

本実施の形態では、入力文に含まれる元の名詞列である入力語、（例１の場合は「ボジョレー」、例２の場合は「サーチエンジン」）も言い換え候補に含めることとするが、含めないようにしても構わない。入力語を言い換え候補に含める場合には、結果的に元の名詞列のまま言い換え文が生成される可能性がある。 In the present embodiment, an input word that is an original noun sequence included in the input sentence (in the case of Example 1, “Beaujolais”, in the case of Example 2 “Search Engine”) is also included in the paraphrase candidates. You may not include it. When an input word is included in a paraphrase candidate, as a result, a paraphrase sentence may be generated with the original noun string.

例１の場合には言い換え候補として、日本語シソーラス４０の収録語から「アルコール」、「酒」、「ワイン」、「ぶどう酒」、「赤ワイン」、「ボジョレー」、「ジョゼフ・ドルーアン」を列挙する。 In the case of Example 1, “alcohol”, “sake”, “wine”, “grape wine”, “red wine”, “bojolet”, “Joseph Drouin” are listed from the Japanese thesaurus 40 as the paraphrase candidates. .

例２の場合には言い換え候補として、日本語シソーラス４０の収録語から「コード」、「ソフトウェア」、「プログラム」、「サーチエンジン」、「検索エンジン」、「yahoo（Ｒ）」、「google（Ｒ）」を列挙する。 In the case of Example 2, as paraphrase candidates, “code”, “software”, “program”, “search engine”, “search engine”, “yahoo (R)”, “google ( R) ".

また、本実施の形態では、Ｎ＝３として上位語は３階層上まで遡って取得することとするが、何階層上まで遡っても構わない。 In this embodiment, N = 3, and the broader term is acquired retroactively up to three levels, but any number of levels may be used.

さらに、対話システムに適用する際には、対話の話題（話題語）を用いて、列挙する言い換え候補を制限しても良い。例えば、「ワイン」の話題で既に何発話かのやりとりが行われていた場合に、システムが改めて「ワインが好きです」と発言するのは不自然なので、シソーラスの階層が、話題語である「ワイン」以上の語（「ワイン」、「ぶどう酒」、「アルコール」、「酒」）は言い換え候補から除外しても良い。 Further, when applied to a dialogue system, the paraphrase candidates to be enumerated may be limited by using conversation topics (topic words). For example, if the conversation about “Wine” has already been exchanged, it is unnatural for the system to say “I like wine” again, so the hierarchy of the thesaurus is the topic word “ Words above “wine” (“wine”, “vinegar”, “alcohol”, “alcohol”) may be excluded from the paraphrase candidates.

言い換え先選択部３４は、言い換え候補列挙部３２により列挙された複数の言い換え候補から、入力語に対応して予め定められた知識量に基づいて、入力語に対応する知識量が多いほど、出現頻度が低い言い換え表現を選択し、入力語に対応する知識量が少ないほど、出現頻度が高い言い換え表現を選択する。具体的には、以下に説明するように、言い換え先選択部３４は、言い換え候補の各々について、知識量設定ファイル４２に設定された知識量と、日本語コーパス４４において言い換え候補が出現する文書の数である文書頻度と、言い換え候補と入力語とのシソーラス上の距離とに基づいて、言い換えコストを算出することにより、言い換え表現を選択する。 The paraphrase destination selection unit 34 appears from the plurality of paraphrase candidates listed by the paraphrase candidate enumeration unit 32 as the knowledge amount corresponding to the input word increases based on the knowledge amount predetermined corresponding to the input word. A paraphrase expression with a lower frequency is selected, and a paraphrase expression with a higher appearance frequency is selected as the knowledge amount corresponding to the input word is smaller. Specifically, as described below, the paraphrase destination selection unit 34 sets the knowledge amount set in the knowledge amount setting file 42 and the document in which the paraphrase candidate appears in the Japanese corpus 44 for each of the paraphrase candidates. A paraphrase expression is selected by calculating a paraphrase cost based on a document frequency that is a number and a distance on the thesaurus between the paraphrase candidate and the input word.

知識量設定ファイル４２の一例を図４に示す。話題の分類の各々に対する知識量として、キャラクタ毎に「知識量多」又は「知識量少」が設定されている。知識量の設定は話し手又は聞き手となるキャラクタに応じた設定であり、例えば図４では、キャラクタ毎に設定ａと、設定ｂとを有しているものとする。知識量は、日本語シソーラス４０の収録語を予め大まかな話題に分類しておき、その大まかな話題の分類に対する知識量を設定することとする。例えば、本実施の形態では、収録語「赤ワイン」は話題の分類「グルメ」に、収録語「サーチエンジン」は話題の分類「コンピュータ」に予め分類しておき、キャラクタ毎に、話題の分類「グルメ」、「コンピュータ」に対して知識量を設定する。なお、話題の分類に対して知識量を定めるのではなく、例えば、日本語シソーラス４０の収録語毎に知識量を定めるようにし、入力語に対応する収録語の知識量を用いるようにしてもよい。 An example of the knowledge amount setting file 42 is shown in FIG. As the knowledge amount for each of the topic categories, “high knowledge amount” or “low knowledge amount” is set for each character. The setting of the knowledge amount is a setting corresponding to the character to be a speaker or a listener. For example, in FIG. 4, it is assumed that each character has a setting a and a setting b. As for the knowledge amount, the recorded words of the Japanese thesaurus 40 are classified into rough topics in advance, and the knowledge amount for the rough topic classification is set. For example, in the present embodiment, the recorded word “red wine” is classified in advance into the topic classification “gourmet”, and the recorded word “search engine” is classified in advance into the topic classification “computer”, and the topic classification “ Set the amount of knowledge for “Gourmet” and “Computer”. Instead of determining the knowledge amount for topic classification, for example, the knowledge amount may be determined for each recorded word of the Japanese thesaurus 40, and the knowledge amount of the recorded word corresponding to the input word may be used. Good.

日本語コーパス４４は、複数の文書が格納されている。本実施の形態では、「文書」はＳＮＳでの1回分の投稿や、ブログ１記事、ニュース１記事などを指す。 The Japanese corpus 44 stores a plurality of documents. In the present embodiment, “document” refers to one posting on SNS, one blog article, one news article, and the like.

言い換え先選択部３４は、以下の第１〜第３の処理を行って言い換え表現を選択する。 The paraphrase destination selection unit 34 performs the following first to third processes to select a paraphrase expression.

言い換え先選択部３４は、第１の処理として、各言い換え候補について、日本語コーパス４４における文書頻度（当該言い換え候補が出現する文書の数）をカウントする。なお、文書頻度が高い表現は、多くの文書に出現する表現であり、より一般的な表現であると言える。反対に、文書頻度が低い表現は、一般的にはあまり使用されない、珍しい表現であると言える。 As a first process, the paraphrase destination selection unit 34 counts the document frequency (the number of documents in which the paraphrase candidate appears) in the Japanese corpus 44 for each paraphrase candidate. An expression with a high document frequency is an expression that appears in many documents, and can be said to be a more general expression. On the other hand, an expression with a low document frequency is an unusual expression that is not commonly used.

次に、言い換え先選択部３４は、第２の処理として、各言い換え候補について、（１）設定された知識量、（２）文書頻度、及び（３）言い換え元表現である入力語とのシソーラス上での距離を用いて言い換えコストを算出する。「知識量多」の設定の場合は、一般的な表現ほど言い換えコストが高くなるようにコストを設計する。つまり、珍しい表現が選ばれやすくなる。本実施の形態では、表現ｗへの言い換えコストｃ（ｗ）は、文書頻度がｄｆ（ｗ）、言い換え元表記である入力語とのシソーラス上での距離がｄ（ｗ）のとき、以下（１）式に従って算出される。 Next, as the second process, the paraphrase destination selection unit 34, for each paraphrase candidate, (1) a set knowledge amount, (2) document frequency, and (3) a thesaurus with an input word that is a paraphrase source expression The paraphrasing cost is calculated using the above distance. In the case of the “high knowledge amount” setting, the cost is designed so that the general expression has a higher paraphrase cost. In other words, unusual expressions are easily selected. In the present embodiment, the paraphrase cost c (w) to the expression w is as follows when the document frequency is df (w) and the distance on the thesaurus from the input word that is the paraphrase source notation is d (w): 1) Calculated according to the equation.

「知識量少」の設定の場合は、珍しい表現ほど言い換えコストが高くなるようにコストを設計する。つまり、一般的な表現が選ばれやすくなる。本実施の形態では、日本語コーパスに含まれる総文書数がＤ（例えばＤ＝１２０００００）のとき、逆文書頻度ｉｄｆ（ｗ）を用いて、下記（２）式に従って表現ｗへの言い換えコストを算出する。 In the case of the setting of “small amount of knowledge”, the cost is designed so that the paraphrasing cost becomes higher for the rare expression. That is, general expressions are easily selected. In the present embodiment, when the total number of documents included in the Japanese corpus is D (for example, D = 1200,000), the paraphrasing cost for the expression w is calculated according to the following equation (2) using the inverse document frequency idf (w). calculate.

そして、言い換え先選択部３４は、第３の処理により、算出されたコストが最も低い候補を言い換え先として選択する。図５にコスト算出の一例を示す。図５に示すように、入力語からの移動距離は、日本語シソーラス４０上での位置関係が近い表現ほど言い換え先として選択されやすく、遠い表現ほど選択されにくくなるようにするために言い換えコストに組み込むように計算している。本実施の形態では、移動距離は単純な移動回数を用いている。例えば、図３の場合、「サーチエンジン」と「検索エンジン」は同義語であり、同じノードに位置しているので、図５において「サーチエンジン」から「検索エンジン」への移動距離（移動回数）は０となっており、「サーチエンジン」から「プログラム」へは１回の移動が必要なので、移動距離は１となっている。 Then, the paraphrase destination selecting unit 34 selects the candidate having the lowest calculated cost as the paraphrase destination by the third process. FIG. 5 shows an example of cost calculation. As shown in FIG. 5, the movement distance from the input word is reduced to a paraphrasing cost in order to make it easier to select a paraphrase as an expression with a closer positional relationship on the Japanese thesaurus 40 and as a farther expression is less likely to be selected. Calculate to include. In the present embodiment, the movement distance uses a simple number of movements. For example, in the case of FIG. 3, “search engine” and “search engine” are synonyms and are located at the same node. Therefore, in FIG. 5, the movement distance (number of movements) from “search engine” to “search engine”. ) Is 0, and the movement distance is 1 because one movement is required from the “search engine” to the “program”.

例１の場合の言い換え候補については、話題の分類「グルメ」に対して定められた知識量が用いられる。例２の場合の言い換え候補については、話題の分類「コンピュータ」に対して定められた知識量が用いられる。 For the paraphrase candidate in the case of Example 1, the amount of knowledge determined for the topic classification “gourmet” is used. For the paraphrase candidates in the case of Example 2, the amount of knowledge determined for the topic classification “computer” is used.

以下に例１及び例２について、設定ａ及び設定ｂを適用した場合について説明する。 The case where setting a and setting b are applied to Example 1 and Example 2 will be described below.

図４の設定ａを適用すると、話題の分類「グルメ」に対して定められた知識量は「知識量多」で、話題の分類「コンピュータ」に対して定められた知識量は「知識量少」である。この場合、例１の言い換え候補（話題の分類「グルメ」）に対する言い換え先となる言い換え表現は「ジョゼフ・ドルーアン」となり、例２の言い換え候補（話題の分類「コンピュータ」）に対する言い換え先となる言い換え表現は「google（Ｒ）」となる。 When the setting a in FIG. 4 is applied, the knowledge amount determined for the topic classification “gourmet” is “high knowledge amount”, and the knowledge amount determined for the topic classification “computer” is “small knowledge amount”. Is. In this case, the paraphrase expression that becomes the paraphrase destination for the paraphrase candidate (topic classification “gourmet”) of Example 1 is “Joseph Drouin”, and the paraphrase that becomes the paraphrase destination for the paraphrase candidate (topic classification “computer”) of Example 2 The expression is “google (R)”.

図４の設定ｂを適用すると、話題の分類「グルメ」に対して定められた知識量は「知識量少」で、話題の分類「コンピュータ」に対して定められた知識量は「知識量多」である。この場合、例１（話題の分類「グルメ」）の言い換え先となる言い換え表現は「赤ワイン」で、例２（話題の分類「コンピュータ」）の言い換え先となる言い換え表現は「サーチエンジン」（言い換え前の表現のまま）となる。 When the setting b in FIG. 4 is applied, the knowledge amount determined for the topic classification “gourmet” is “low knowledge amount”, and the knowledge amount determined for the topic classification “computer” is “high knowledge amount”. Is. In this case, the paraphrase expression that is the paraphrase destination of Example 1 (topic classification “gourmet”) is “red wine”, and the paraphrase expression that is the paraphrase destination of Example 2 (topic classification “computer”) is “search engine” (paraphrase). As in the previous expression).

なお、上記（１）式及び（２）式の言い換えコストの算出式は一例であり、知識量が多い設定の場合に珍しい表現が選ばれやすくなり、知識量が少ない設定の場合には一般的な表現が選ばれやすくなるようにコストが設計できさえすれば、どのような算出式を用いても構わない。 In addition, the formulas for calculating the paraphrasing costs in the above formulas (1) and (2) are only examples, and it is easy to select an unusual expression in the case of a setting with a large amount of knowledge, and it is common in the case of a setting with a small amount of knowledge. Any calculation formula can be used as long as the cost can be designed so that a simple expression can be easily selected.

また、本実施の形態では、移動距離は単純な移動回数を用いているが、これに限定されるものではない。移動距離の測り方はこの方法に限らず、意味・概念的に近い表現に低い値が付き、遠い表現に高い値が付くような尺度であればどのようなものを用いても良い。 In this embodiment, the movement distance uses a simple number of movements, but is not limited to this. The method of measuring the movement distance is not limited to this method, and any method may be used as long as the scale is such that a low value is attached to a representation that is close in meaning and concept and a high value is attached to a distant representation.

言い換え文生成部３６は、言い換え先選択部３４により選択された言い換え表現を用いて入力文の入力語を言い換えた、言い換え文を生成する。具体的には、以下例１、例２の場合について説明するように、入力文に含まれる名詞列を言い換え先の表現と置換する。 The paraphrase sentence generation unit 36 generates a paraphrase sentence obtained by paraphrasing the input word of the input sentence using the paraphrase expression selected by the paraphrase destination selection unit 34. Specifically, as will be described below for Examples 1 and 2, the noun string included in the input sentence is replaced with the paraphrase destination expression.

例１の場合には、「知識量少」の場合に選択された言い換え表現が「赤ワイン」であれば、入力文「ボジョレーが好きです」の「ボジョレー」を「赤ワイン」に置換し、「赤ワインが好きです」という言い換え文を生成する。「知識量多」の場合に選択された言い換え表現が「ジョゼフ・ドルーアン」であれば、「ボジョレー」を「ジョゼフ・ドルーアン」に置換し、「ジョゼフ・ドルーアンが好きです」という言い換え文を生成する。 In the case of Example 1, if the paraphrase expression selected in the case of “small amount of knowledge” is “red wine”, the input sentence “I like Beaujolais” is replaced with “red wine” and “red wine” Is generated. If the paraphrase selected in the case of “Knowledge” is “Joseph Drouin”, replace “Beaujolais” with “Joseph Drouin” and generate the paraphrase sentence “I like Joseph Drouin” .

例２の場合には、「知識量少」の場合に選択された言い換え表現が「google（Ｒ）」であれば、入力文「サーチエンジンで検索してください」の「サーチエンジン」を「google（Ｒ）」に置換し、「google（Ｒ）で検索してください」という言い換え文を生成する。「知識量多少」の場合に選択された言い換え表現が「サーチエンジン」であれば、言い換え前の表現のまま、「サーチエンジンで検索してください」という言い換え文を生成する。 In the case of Example 2, if the paraphrase expression selected in the case of “small amount of knowledge” is “google (R)”, “search engine” of the input sentence “search with search engine” is set to “google (R) ”is replaced, and a paraphrase sentence“ Please search with google (R) ”is generated. If the paraphrase expression selected in the case of “some knowledge amount” is “search engine”, a paraphrase sentence “please search with search engine” is generated with the expression before paraphrase being maintained.

なお、本実施の形態では、単純に言い換え元と言い換え先の表現を置換しているが、インスタンス（固有名などの実例）が言い換え先となった場合には、「ジョゼフ・ドルーアンなどのボジョレー」「google（Ｒ）などのサーチエンジン」のように「＜インスタンス＞などの＜言い換え元表現＞」としても構わない。 In this embodiment, the expression of the paraphrase source and the paraphrase destination is simply replaced. However, when an instance (an example such as a unique name) is the paraphrase destination, “Beautiful such as Joseph Druen” As in “search engine such as google (R)”, it may be “<paraphrase source expression> such as <instance>”.

＜本発明の実施の形態に係る言い換え装置の作用＞ <Operation of the paraphrasing device according to the embodiment of the present invention>

次に、本発明の実施の形態に係る言い換え装置１００の作用について説明する。入力部１０において入力文を受け付けると、言い換え装置１００は、図６に示す言い換え処理ルーチンを実行する。 Next, the operation of the paraphrasing device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives an input sentence, the paraphrasing device 100 executes a paraphrase processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた入力文に含まれる名詞列を、入力語として抽出する。 First, in step S100, a noun string included in the input sentence received by the input unit 10 is extracted as an input word.

次に、ステップＳ１０２では、ステップＳ１００で抽出された入力語について、日本語シソーラス４０を参照して、複数の言い換え候補を列挙する。 Next, in step S102, with respect to the input word extracted in step S100, a plurality of paraphrase candidates are listed with reference to the Japanese thesaurus 40.

ステップＳ１０４では、ステップＳ１０２で列挙された言い換え候補から処理対処の候補を選択する。 In step S104, a candidate for processing is selected from the paraphrase candidates listed in step S102.

ステップＳ１０６では、ステップＳ１０４で選択された言い換え候補について、日本語コーパス４４において当該言い換え候補が出現する文書の数である文書頻度をカウントする。 In step S106, for the paraphrase candidates selected in step S104, the document frequency that is the number of documents in which the paraphrase candidates appear in the Japanese corpus 44 is counted.

ステップＳ１０８では、知識量設定ファイル４２に設定された知識量と、ステップＳ１０６で数えられた文書頻度と、言い換え候補と入力語とのシソーラス上の距離とに基づいて、上記（１）式又は（２）式に従って、言い換えコストを算出する。 In step S108, based on the knowledge amount set in the knowledge amount setting file 42, the document frequency counted in step S106, and the distance on the thesaurus between the paraphrase candidate and the input word, the above equation (1) or ( 2) The paraphrase cost is calculated according to the equation.

ステップＳ１１０では、全ての言い換え候補についてステップＳ１０６〜Ｓ１０８の処理を終了したかを判定し、終了していればステップＳ１１２へ移行し、終了していなければステップＳ１０４へ戻って次の言い換え表現を選択して処理を繰り返す。 In step S110, it is determined whether or not the processing in steps S106 to S108 has been completed for all the paraphrase candidates. If completed, the process proceeds to step S112. If not completed, the process returns to step S104 to select the next paraphrase expression. And repeat the process.

ステップＳ１１２では、ステップＳ１０６〜Ｓ１０８でコストが算出された複数の言い換え候補のうち、算出されたコストが最も低い言い換え候補を言い換え表現として選択する。 In step S112, the paraphrase candidate with the lowest calculated cost is selected as the paraphrase expression from among the plurality of paraphrase candidates whose costs are calculated in steps S106 to S108.

ステップＳ１１４では、ステップＳ１０４で選択された言い換え表現を用いて入力文の入力語を言い換えた、言い換え文を生成し、出力部５０に出力して処理を終了する。 In step S114, the paraphrase text which paraphrased the input word of the input sentence using the paraphrase expression selected by step S104 is produced | generated, and it outputs to the output part 50, and complete | finishes a process.

以上説明したように、本発明の実施の形態に係る言い換え装置によれば、名詞からなる入力語について、日本語シソーラス４０を参照して、複数の言い換え候補を列挙し、複数の言い換え候補から、入力語に対応して予め定められた知識量設定ファイル４２の知識量に基づいて、言い換え表現を選択し、言い換え表現を用いて入力文の入力語を言い換えた、言い換え文を生成することにより、対話者の知識量に応じた表現に言い換えることができる。 As described above, according to the paraphrase device according to the embodiment of the present invention, with respect to an input word composed of nouns, a plurality of paraphrase candidates are listed with reference to the Japanese thesaurus 40, and By selecting a paraphrase expression based on the knowledge amount of the knowledge amount setting file 42 determined in advance corresponding to the input word, and generating a paraphrase sentence in which the input word of the input sentence is rephrased using the paraphrase expression, It can be paraphrased as an expression according to the amount of knowledge of the interlocutor.

また、本発明の実施の形態に係る手法によれば、想定される話し手（書き手）や聞き手（読み手）の持つ、特定の話題に関する知識量に合わせて表現を言い換えることができる。これを人間と対話を行うコンピュータ（対話システム）に適用すると、対話システムのキャラクタ設定に応じて、システムの発話に含まれる表現を変更することができる。例えば、子供キャラクタの対話システムの場合は、アルコールに関する知識は少ないはずなので、「お父さんはボジョレーが好きです」を「お父さんは赤ワインが好きです」のように言い換えることで、より自然な発話にすることができる。反対に、あえて一般的でない表現（固有の商品名・生産者名など）に言い換えることで、ワインに詳しそうな口ぶりで話す対話システムを作ることもできる。また、対話相手（ユーザ）の持っている知識量に合わせて、より理解しやすい表現に言い換えることで、ユーザにとって使い勝手の良い対話システムを実現することに貢献できる。 Further, according to the technique according to the embodiment of the present invention, it is possible to rephrase the expression according to the knowledge amount regarding a specific topic possessed by an assumed speaker (writer) or listener (reader). When this is applied to a computer (dialogue system) that interacts with humans, the expression included in the utterance of the system can be changed according to the character setting of the dialogue system. For example, in the case of a child character dialogue system, there should be little knowledge about alcohol, so changing `` Dad loves Beaujolais '' to `` Dad likes red wine '' makes it a more natural utterance Can do. On the other hand, it is possible to create a dialogue system that speaks like a wine in detail by rephrasing it with uncommon expressions (such as unique product names and producer names). Moreover, it can contribute to implement | achieving a user friendly dialog system by paraphrasing the expression which is easy to understand according to the knowledge amount which the other party (user) has.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

１０入力部
２０演算部
３０名詞列抽出部
３２言い換え候補列挙部
３４言い換え先選択部
３６言い換え文生成部
４０日本語シソーラス
４２知識量設定ファイル
４４日本語コーパス
５０出力部
１００言い換え装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 30 Noun string extraction part 32 Paraphrase candidate enumeration part 34 Paraphrase destination selection part 36 Paraphrase sentence generation part 40 Japanese thesaurus 42 Knowledge amount setting file 44 Japanese corpus 50 Output part 100 Paraphrase apparatus

Claims

A paraphrase candidate enumeration unit that enumerates a plurality of paraphrase candidates with reference to a predetermined thesaurus for input words consisting of nouns included in the input sentence,
A paraphrase destination selection unit that selects a paraphrase expression from a plurality of paraphrase candidates listed by the paraphrase candidate enumeration unit based on a predetermined knowledge amount corresponding to the input word;
Rephrasing the input word of the input sentence using the paraphrase expression selected by the paraphrase destination selection unit;
Including paraphrasing device.

The paraphrase destination selection unit selects the paraphrase expression having a lower appearance frequency as the knowledge amount corresponding to the input word is larger, and the paraphrase expression having a higher appearance frequency as the knowledge amount corresponding to the input word is smaller. The paraphrasing device according to claim 1 to select.

The paraphrase destination selection unit, for each of the paraphrase candidates, the knowledge amount, the document frequency that is the number of documents in which the paraphrase candidates appear in a predetermined corpus, and the paraphrase candidates and the input words The paraphrase device according to claim 1, wherein the paraphrase expression is selected by calculating a paraphrase cost based on a distance on a thesaurus.

The paraphrase candidate enumeration unit enumerates a plurality of paraphrase candidates with reference to a predetermined thesaurus for input words composed of nouns included in the input sentence,
A paraphrase destination selecting unit selecting a paraphrase expression from a plurality of paraphrase candidates enumerated by the paraphrase candidate enumeration unit based on a knowledge amount determined in advance corresponding to the input word;
A step of generating a paraphrase sentence in which the paraphrase sentence generation unit paraphrases the input word of the input sentence using the paraphrase expression selected by the paraphrase destination selection unit;
Including paraphrasing methods.

The step of selecting by the paraphrase destination selection unit selects the paraphrase expression having a lower appearance frequency as the knowledge amount corresponding to the input word is larger, and the appearance frequency is higher as the knowledge amount corresponding to the input word is smaller. The paraphrase method according to claim 4, wherein the paraphrase expression is selected.

The step of selecting by the paraphrase destination selecting unit includes, for each of the paraphrase candidates, the knowledge amount, a document frequency that is the number of documents in which the paraphrase candidates appear in a predetermined corpus, the paraphrase candidates, and the input The paraphrase method according to claim 4, wherein the paraphrase expression is selected by calculating a paraphrase cost based on a distance on the thesaurus with a word.

The program for functioning a computer as each part of the paraphrasing apparatus of any one of Claims 1-3.