JP2007072663A

JP2007072663A - Example translation device and example translation method

Info

Publication number: JP2007072663A
Application number: JP2005257671A
Authority: JP
Inventors: Naoto Kato; 直人加藤
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-09-06
Filing date: 2005-09-06
Publication date: 2007-03-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide an example translation device capable of performing translation from an original language to an objective language by use of a parallel translation example corpus. <P>SOLUTION: This example translation device has: a reception part 11 receiving document information; a parallel translation example corpus storage part 12 storing the parallel translation example corpus associating and having a plurality of original language examples and a plurality of objective language examples that are translations of the original language examples in the objective language; a selection part 13 selecting at least two original language examples similar to the document information from the plurality of original language examples of the parallel translation example corpus; a translation candidate information production part 14 producing translation candidate information showing a candidate of the translation of the document information by use of at least two objective language examples corresponding to at least the two selected original language examples; a translation document information production part 15 selecting the candidate of the translation satisfying a prescribed condition from the translation candidate information, and producing translation document information showing the selected candidate of the translation; and an output part 16 outputting the translation document information. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、機械翻訳を行う用例翻訳装置等に関する。 The present invention relates to an example translation apparatus that performs machine translation.

計算機の進歩と言語資源の増加に伴って、コーパスを利用した自然言語処理技術が発展してきている。自然言語処理の一つである機械翻訳でも、対訳用例を用いた用例翻訳が提唱されて依頼、用例を用いたアプローチが研究されてきている（例えば、非特許文献１〜５参照）。
ＭａｋｏｔｏＮａｇａｏ．ＡｆｒａｍｅｗｏｒｋｏｆａｍｅｃｈａｎｉｃａｌｔｒａｎｓｌａｔｉｏｎｂｅｔｗｅｅｎＪａｐａｎｅｓｅａｎｄＥｎｇｌｉｓｈｂｙａｎａｌｏｇｙｐｒｉｎｃｉｐｌｅ．Ｐｒｏｃ．ｏｆｔｈｅｉｎｔｅｒｎａｔｉｏｎａｌＮＡＴＯｓｙｍｐｏｓｉｕｍｏｎＡｒｔｉｆｉｃｉａｌａｎｄｈｕｍａｎｉｎｔｅｌｌｉｇｅｎｃｅ，ｐｐ．１７３−１８０，１９８４．ＳａｔｏｓｈｉＳａｔｏ，ＭａｋｏｔｏＮａｇａｏ．Ｔｏｗａｒｄｍｅｍｏｒｙ−ｂａｓｅｄｔｒａｎｓｌａｔｉｏｎ．Ｐｒｏｃｓｏｆｔｈｅ１３ｔｈｃｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔａｔｉｏｎａｌｌｉｎｇｕｉｓｔｉｃｓ，ｐｐ．２４７−２５２，１９９０．ＳａｔｏｓｈｉＳａｔｏ．ＭＢＴ２：ａｍｅｔｈｏｄｆｏｒｃｏｍｂｉｎｉｎｇｆｒａｇｍｅｎｔｓｏｆｅｘａｍｐｌｅｓｉｎｅｘａｍｐｌｅ−ｂａｓｅｄｔｒａｎｓｌａｔｉｏｎ．Ａｒｔｉｆｉｃｉａｌｉｎｔｅｌｌｉｇｅｎｃｅ，Ｖｏｌ．７５，ｐｐ．３１−４９，１９９５．ＥｉｉｃｈｉｒｏＳｕｍｉｔａ．Ｅｘａｍｐｌｅ−ｂａｓｅｄｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎｕｓｉｎｇＤＰ−ｍａｔｃｈｉｎｇｂｅｔｗｅｅｎｗｏｒｄｓｅｑｕｅｎｃｅｓ．ＤＤＭＴｗｏｒｋｓｈｏｐｏｆ３９ｔｈＡＣＬ，ｐｐ．１−８，２００１荒牧英治他、用例ベース翻訳のための日英アライメント確信度と日本語類似度を用いた訳語選択．自然言語処理、Ｖｏｌ．１１、Ｎｏ．１、ｐｐ．１０７−１２３，２００４． With the progress of computers and the increase of language resources, natural language processing technology using corpus has been developed. Also in machine translation, which is one of natural language processing, an example translation using a parallel translation example has been proposed and an approach using a request and an example has been studied (for example, see Non-Patent Documents 1 to 5).
Makoto Nagao. A framework of a mechanical translation between Japan and English by analog principal. Proc. of the international NATO symposium on Artificial and human intelligence, pp. 173-180, 1984. Satoshi Sato, Makoto Nagao. Toward memory-based translation. Procs of the 13th conference on Computational linguistics, pp. 247-252, 1990. Satoshi Sato. MBT2: a method for combining fragments of examples in example-based translation. Artificial intelligence, Vol. 75, pp. 31-49, 1995. Eiichiro Sumita. Example-based machine translation using DP-matching between word sequences. DDMT workshop of 39th ACL, pp. 1-8, 2001 Eiji Aramaki et al., Translation selection using Japanese-English alignment confidence and Japanese similarity for example-based translation. Natural language processing, Vol. 11, no. 1, pp. 107-123, 2004.

従来例の手法では、原言語と目的言語間において、単語アライメントや句アライメントがとられていることが前提となっている。しかしながら、実際の対訳コーパス、特に話し言葉コーパスでは、省略や慣用表現により、アライメントをとることが難しい場合も多い。また、目的言語側では同じ２つの文であるにも関わらず、原言語側では異なる表現を用いている文等の場合に、うまく用例が使われない場合もある。 In the conventional method, it is assumed that word alignment and phrase alignment are established between the source language and the target language. However, in an actual bilingual corpus, especially a spoken language corpus, alignment is often difficult due to omissions and idiomatic expressions. In addition, in the case of a sentence using different expressions on the source language side even though the two sentences are the same on the target language side, the example may not be used well.

本発明は、上記問題点を解決するためになされたものであり、単語アライメントや句アライメントを必ずしも用いなくても、対訳用例コーパスを用いて、原言語から目的言語への翻訳を可能とする用例翻訳装置等を提供することを目的とする。 The present invention has been made to solve the above-described problems, and an example that enables translation from a source language to a target language using a bilingual example corpus without necessarily using word alignment or phrase alignment. An object is to provide a translation device or the like.

上記目的を達成するため、本発明による用例翻訳装置は、原言語の文書を示す情報である文書情報を受け付ける受付部と、原言語での用例である複数の原言語用例と、原言語と異なる目的言語における原言語用例の訳である複数の目的言語用例とを対応付けて有する情報である対訳用例コーパスが記憶される対訳用例コーパス記憶部と、前記対訳用例コーパスの複数の原言語用例から、前記受付部が受け付けた文書情報の示す文書と類似する２以上の原言語用例を選択する選択部と、前記選択部が選択した２以上の原言語用例に対応する２以上の目的言語用例を用いて、前記受付部が受け付けた文書情報の翻訳の候補を示す情報である翻訳候補情報を作成する翻訳候補情報作成部と、前記翻訳候補情報作成部が作成した翻訳候補情報から所定の条件を充たす翻訳の候補を選択し、当該選択した翻訳の候補を示す情報である翻訳文書情報を作成する翻訳文書情報作成部と、前記翻訳文書情報作成部が作成した翻訳文書情報を出力する出力部と、を備えたものである。 In order to achieve the above object, an example translation apparatus according to the present invention differs from a source language in a receiving unit that receives document information that is information indicating a document in the source language, and a plurality of source language examples that are examples in the source language. From the bilingual example corpus storage unit storing the bilingual example corpus that is information having a plurality of target language examples that are translations of the source language examples in the target language, and a plurality of source language examples of the bilingual example corpus, A selection unit that selects two or more source language examples similar to the document indicated by the document information received by the reception unit, and two or more target language examples corresponding to the two or more source language examples selected by the selection unit are used. A translation candidate information creation unit that creates translation candidate information that is information indicating translation candidates of document information received by the reception unit, and a translation candidate information created by the translation candidate information creation unit. Select a translation candidate that satisfies the condition, and create a translation document information creation unit that creates translation document information that is information indicating the selected translation candidate, and output the translation document information created by the translation document information creation unit Part.

このような構成により、対訳用例コーパスを用いて原言語から目的言語への機械翻訳を実行することができる。したがって、単語アライメントや句アライメント等を必ずしも用いずに、機械翻訳を行うことができる。 With such a configuration, machine translation from the source language to the target language can be executed using the parallel translation example corpus. Therefore, machine translation can be performed without necessarily using word alignment or phrase alignment.

また、本発明による用例翻訳装置では、前記選択部は、前記複数の原言語用例と、前記受付部が受け付けた文書情報の示す文書との類似度を算出し、選択した２以上の原言語用例と前記文書情報の示す文書とに共通する単語によって前記文書情報の示す文書の全体をカバーすることができるように、類似度の高い原言語用例を選択してもよい。 In the example translation apparatus according to the present invention, the selection unit calculates a similarity between the plurality of source language examples and the document indicated by the document information received by the reception unit, and selects two or more source language examples selected. The source language example having a high degree of similarity may be selected so that the entire document indicated by the document information can be covered with words common to the document indicated by the document information.

このような構成により、文書情報の示す文書と関連の深い原言語用例を選択することができ、その原言語用例に対応する目的言語用例を用いて翻訳を行うことによって、より適切な翻訳を行うことができうる。 With such a configuration, it is possible to select a source language example closely related to the document indicated by the document information, and to perform more appropriate translation by performing translation using the target language example corresponding to the source language example. Can be.

また、本発明による用例翻訳装置では、前記選択部は、２個の文書の類似度として、２個の文書に共通する単語の割合を用いてもよい。
また、本発明による用例翻訳装置では、前記選択部は、２個の文書の類似度として、２個の文書のＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングの値を用いてもよい。 In the example translation apparatus according to the present invention, the selection unit may use a ratio of words common to two documents as a similarity between the two documents.
In the example translation apparatus according to the present invention, the selection unit may use a DP (Dynamic Programming) matching value of two documents as the similarity between the two documents.

また、本発明による用例翻訳装置では、前記選択部は、前記原言語が単語ごとに区切られていない言語である場合に、前記受付部が受け付けた文書情報の示す文書及び対訳用例コーパスの有する原言語用例を単語ごとに区切る処理を行ってもよい。 In the example translation apparatus according to the present invention, the selection unit may include a document indicated by the document information received by the reception unit and a source of a parallel translation example corpus when the source language is a language that is not divided into words. A process of dividing a language example into words may be performed.

このような構成により、原言語が、例えば日本語や中国語などのように単語ごとに区切られていない言語であっても、文書情報の示す文書や原言語用例を単語ごとに区切る処理を行うことによって、両者の類似度を算出することができるようになる。 With such a configuration, even if the source language is a language that is not segmented for each word such as Japanese or Chinese, the document or source language example indicated by the document information is segmented for each word. Thus, the similarity between the two can be calculated.

また、本発明による用例翻訳装置では、前記翻訳候補情報作成部は、前記選択部によって選択された２以上の原言語用例に対応する２以上の目的言語用例を対訳用例コーパスから取得し、その取得した２以上の目的言語用例間で共通する単語を一の共通ノードとし、取得した２以上の目的言語用例におけるその他の単語を一の単独ノードとして、共通ノード及び単独ノードの隣接関係を示す翻訳候補情報を作成してもよい。
このような構成により、目的言語用例での単語の可能な組合せによって、翻訳の候補を作成することができる。 In the example translation apparatus according to the present invention, the translation candidate information creation unit obtains two or more target language examples corresponding to the two or more source language examples selected by the selection unit from the bilingual example corpus, A translation candidate indicating the adjacent relationship between the common node and the single node, with the common word between the two or more target language examples as one common node and the other words in the acquired two or more target language examples as one single node Information may be created.
With such a configuration, translation candidates can be created with possible combinations of words in the target language example.

また、本発明による用例翻訳装置では、前記翻訳候補情報作成部は、前記翻訳候補情報の示すノードの隣接関係以外の隣接関係で関係づけられる２単語の並びの２グラム確率がしきい値以上の場合に、当該２単語の並びを示す隣接関係を翻訳候補情報に追加してもよい。
このような構成により、翻訳の候補を増やすことができ、正しい翻訳結果が得られる可能性をより高くすることができうる。 Moreover, in the example translation apparatus according to the present invention, the translation candidate information creation unit has a two-gram probability of an arrangement of two words related by an adjacent relationship other than the adjacent relationship of the node indicated by the translation candidate information equal to or higher than a threshold value. In this case, an adjacency indicating the arrangement of the two words may be added to the translation candidate information.
With such a configuration, the number of translation candidates can be increased, and the possibility of obtaining a correct translation result can be further increased.

また、本発明による用例翻訳装置では、前記翻訳候補情報作成部は、前記目的言語が単語ごとに区切られていない言語である場合に、前記選択部が選択した２以上の原言語用例に対応する２以上の目的言語用例を単語ごとに区切る処理を行ってもよい。 In the example translation apparatus according to the present invention, the translation candidate information creation unit corresponds to two or more source language examples selected by the selection unit when the target language is a language that is not divided for each word. Two or more target language examples may be divided into words.

このような構成により、目的言語が、例えば日本語や中国語などのように単語ごとに区切られていない言語であっても、目的言語用例を単語ごとに区切る処理を行うことによって、共通ノード及び単独ノードの隣接関係によって翻訳候補情報を作成することができるようになる。 With such a configuration, even if the target language is a language that is not segmented for each word such as Japanese or Chinese, the common node and Translation candidate information can be created by the adjacency relationship of single nodes.

また、本発明による用例翻訳装置では、前記翻訳文書情報作成部は、前記翻訳文書情報の示す各翻訳の候補についてＮグラム確率の積を算出し、当該Ｎグラム確率の積が最大である翻訳の候補を所定の条件を充たす翻訳の候補として選択してもよい。
このような構成により、最も確からしい翻訳の候補を翻訳結果とすることになり、正しい翻訳結果が得られる可能性を高くすることができうる。 In the example translation apparatus according to the present invention, the translation document information creation unit calculates a product of N-gram probabilities for each translation candidate indicated by the translation document information, and the translation product having the maximum N-gram probability product is calculated. The candidate may be selected as a translation candidate that satisfies a predetermined condition.
With such a configuration, the most probable translation candidate is used as the translation result, and the possibility of obtaining a correct translation result can be increased.

また、本発明による用例翻訳装置では、前記対訳用例コーパスは、文を構成する要素である構成要素のアライメントである構成要素アライメントをも含み、前記翻訳文書情報作成部は、前記受付部が受け付けた文書情報の示す文書と翻訳の候補の文書との間での前記構成要素アライメントによるアライメントの存否に基づいて、所定の条件を充たす翻訳の候補を選択してもよい。 In the example translation apparatus according to the present invention, the parallel example corpus includes a component alignment that is an alignment of components that constitute a sentence, and the translated document information creation unit is received by the receiving unit. Translation candidates that satisfy a predetermined condition may be selected based on the presence or absence of alignment by the component alignment between the document indicated by the document information and the translation candidate document.

このような構成により、アライメントのとれている構成要素について、そのアライメントの結果が反映されるように翻訳文書情報を作成することにより、正しい翻訳結果が得られる可能性をより高くすることができうる。 With such a configuration, it is possible to increase the possibility of obtaining a correct translation result by creating translation document information so that the alignment result is reflected for the aligned components. .

また、本発明による用例翻訳装置では、前記翻訳文書情報作成部は、前記選択部によって選択された２以上の原言語用例のいずれかについて、翻訳の候補の文書と当該原言語用例に対応する目的言語用例との類似度が、前記受付部が受け付けた文書情報の示す文書と当該原言語用例との類似度に対して所定の範囲内でない場合に、当該翻訳の候補を選択しなくてもよい。
このような構成により、不適切な翻訳の候補を翻訳の結果として選択しないようにすることができうる。 In the example translation apparatus according to the present invention, the translation document information creation unit may correspond to a translation candidate document and the source language example for one of the two or more source language examples selected by the selection unit. If the similarity to the language example is not within a predetermined range with respect to the similarity between the document indicated by the document information received by the receiving unit and the source language example, the translation candidate may not be selected. .
With such a configuration, it is possible to prevent selection of inappropriate translation candidates as a result of translation.

また、本発明による用例翻訳装置では、前記翻訳文書情報作成部は、２個の文書の類似度として、２個の文書に共通する単語の割合を用いてもよい。
また、本発明による用例翻訳装置では、前記翻訳文書情報作成部は、２個の文書の類似度として、２個の文書のＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングの値を用いてもよい。 In the example translation apparatus according to the present invention, the translated document information creation unit may use a word ratio common to the two documents as the similarity between the two documents.
In the example translation apparatus according to the present invention, the translated document information creation unit may use a value of DP (Dynamic Programming) matching between two documents as the similarity between the two documents.

本発明による用例翻訳装置等によれば、対訳用例コーパスを用いて、原言語から目的言語への機械翻訳を行うことができる。 According to the example translation apparatus and the like according to the present invention, machine translation from the source language to the target language can be performed using the parallel translation example corpus.

以下、本発明による用例翻訳装置について、実施の形態を用いて説明する。なお、以下の実施の形態において、同じ符号を付した構成要素は同一または相当するものであり、再度の説明を省略することがある。 Hereinafter, an example translation apparatus according to the present invention will be described using embodiments. Note that, in the following embodiments, the components given the same reference numerals are the same or equivalent, and repetitive description may be omitted.

（実施の形態１）
本発明の実施の形態１による用例翻訳装置について、図面を参照しながら説明する。
図１は、本実施の形態による用例翻訳装置１の構成を示すブロック図である。図１において、本実施の形態による用例翻訳装置１は、受付部１１と、対訳用例コーパス記憶部１２と、選択部１３と、翻訳候補情報作成部１４と、翻訳文書情報作成部１５と、出力部１６とを備える。 (Embodiment 1)
An example translation apparatus according to Embodiment 1 of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of an example translation apparatus 1 according to the present embodiment. In FIG. 1, an example translation apparatus 1 according to the present embodiment includes a reception unit 11, a parallel example corpus storage unit 12, a selection unit 13, a translation candidate information creation unit 14, a translation document information creation unit 15, and an output. Part 16.

受付部１１は、文書情報を受け付ける。ここで、文書情報とは、原言語の文書を示す情報である。文書情報は、例えば、１文あるいは２文以上の文書を示す情報である。本実施の形態では、文書情報は、１文の文書を示す情報であるとする。受付部１１は、例えば、入力デバイス（例えば、キーボードやマウス、タッチパネルなど）から入力された文書情報を受け付けてもよく、有線もしくは無線の通信回線を介して送信された文書情報を受信してもよく、所定の記録媒体（例えば、光ディスクや磁気ディスク、半導体メモリなど）から読み出された文書情報を受け付けてもよい。本実施の形態では、受付部１１は、記録媒体２１で記憶されている文書情報を読み出すものとする。ここで、記録媒体２１は、例えば、半導体メモリや磁気ディスク、光ディスクなどによって実現されうるものである。記録媒体２１は、用例翻訳装置１が有していてもよく、あるいは、用例翻訳装置１の外部にあってもよい。なお、受付部１１は、受け付けを行うためのデバイス（例えば、モデムやネットワークカードなど）を含んでもよく、あるいは含まなくてもよい。また、受付部１１は、ハードウェアによって実現されてもよく、あるいは所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 The accepting unit 11 accepts document information. Here, the document information is information indicating a source language document. The document information is information indicating a document of one sentence or two sentences, for example. In the present embodiment, it is assumed that the document information is information indicating a single sentence document. For example, the reception unit 11 may receive document information input from an input device (for example, a keyboard, a mouse, or a touch panel), or may receive document information transmitted via a wired or wireless communication line. Alternatively, document information read from a predetermined recording medium (for example, an optical disk, a magnetic disk, a semiconductor memory, etc.) may be received. In the present embodiment, it is assumed that the accepting unit 11 reads document information stored in the recording medium 21. Here, the recording medium 21 can be realized by, for example, a semiconductor memory, a magnetic disk, an optical disk, or the like. The recording medium 21 may be included in the example translation apparatus 1 or may be outside the example translation apparatus 1. The accepting unit 11 may or may not include a device for accepting (for example, a modem or a network card). In addition, the reception unit 11 may be realized by hardware, or may be realized by software such as a driver that drives a predetermined device.

対訳用例コーパス記憶部１２では、対訳用例コーパスが記憶される。ここで、対訳用例コーパスとは、原言語での用例である複数の原言語用例と、目的言語における原言語用例の訳である複数の目的言語用例とを対応付けて有する情報である。用例翻訳装置１は、原言語から、目的言語への翻訳を行うものである。目的言語は、原言語とは異なる言語である。原言語、目的言語の言語は問わない。原言語、目的言語は、例えば、日本語、中国語、韓国語、英語、フランス語、ドイツ語、ロシア語等である。本実施の形態では、原言語が日本語であり、目的言語が英語である場合について説明する。対訳用例コーパスは、例えば、出願人（ＡＴＲ）が作成した旅行会話基本表現のコーパスであるＢＴＥＣ（ＢａｓｉｃＴｒａｖｅｌＥｘｐｒｅｓｓｉｏｎＣｏｒｐｕｓ）であってもよく、その他のコーパスであってもよい。対訳用例コーパスは、原言語用例と目的言語用例との対を、１万個以上、通常は数万個程度有する。なお、用例翻訳のためには、原言語用例と目的言語用例との対の数が多い方がいいことは言うまでもない。対訳用例コーパス記憶部１２は、所定の記録媒体（例えば、半導体メモリや磁気ディスク、光ディスクなど）によって実現されうる。対訳用例コーパス記憶部１２に対訳用例コーパスが記憶される過程は問わない。例えば、記録媒体を介して対訳用例コーパスが対訳用例コーパス記憶部１２で記憶されるようになってもよく、通信回線等を介して送信された対訳用例コーパスが対訳用例コーパス記憶部１２で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された対訳用例コーパスが対訳用例コーパス記憶部１２で記憶されるようになってもよい。対訳用例コーパス記憶部１２での記憶は、外部のストレージデバイス等から読み出した対訳用例コーパスのＲＡＭ等における一時的な記憶でもよく、あるいは、ハードディスク等における長期的な記憶でもよい。 The translation example corpus storage unit 12 stores a translation example corpus. Here, the bilingual example corpus is information having a plurality of source language examples that are examples in the source language and a plurality of target language examples that are translations of the source language examples in the target language in association with each other. The example translation apparatus 1 performs translation from a source language to a target language. The target language is a language different from the original language. The language of the source language and the target language is not limited. The source language and the target language are, for example, Japanese, Chinese, Korean, English, French, German, Russian, and the like. In this embodiment, a case where the source language is Japanese and the target language is English will be described. The parallel translation example corpus may be, for example, a BTEC (Basic Travel Expression Corpus) that is a corpus of a basic expression of travel conversation created by the applicant (ATR), or may be another corpus. The bilingual example corpus has 10,000 or more, usually several tens of thousands, pairs of source language examples and target language examples. Needless to say, for example translation, it is better to have a larger number of pairs of source language examples and target language examples. The bilingual example corpus storage unit 12 can be realized by a predetermined recording medium (for example, a semiconductor memory, a magnetic disk, an optical disk, or the like). The process of storing the parallel example corpus in the parallel example corpus storage unit 12 does not matter. For example, the parallel example corpus may be stored in the parallel example corpus storage unit 12 via a recording medium, and the parallel example corpus transmitted via a communication line or the like is stored in the parallel example corpus storage unit 12. Alternatively, the bilingual example corpus input via the input device may be stored in the bilingual example corpus storage unit 12. The storage in the parallel example corpus storage unit 12 may be temporary storage in the RAM of the parallel example corpus read from an external storage device or the like, or may be long-term storage in a hard disk or the like.

選択部１３は、対訳用例コーパスの複数の原言語用例から、受付部１１が受け付けた文書情報の示す文書と類似する２以上の原言語用例を選択する。より具体的な処理としては、選択部１３は次のように原言語用例を選択してもよい。すなわち、選択部１３は、まず、複数の原言語用例と、受付部１１が受け付けた文書情報の示す文書との類似度を算出する。そして、選択部１３は、選択した２以上の原言語用例と文書情報の示す文書とに共通する単語によって文書情報の示す文書の全体をカバーすることができるように、類似度の高い原言語用例を選択してもよい。なお、選択部１３は、２個の文書の類似度として、２個の文書に共通する単語の割合を用いてもよい。選択部１３は、共通する単語の割合を、例えば、次式を用いて算出してもよい。

The selection unit 13 selects two or more source language examples similar to the document indicated by the document information received by the receiving unit 11 from a plurality of source language examples of the parallel translation example corpus. As a more specific process, the selection unit 13 may select a source language example as follows. That is, the selection unit 13 first calculates the similarity between a plurality of source language examples and the document indicated by the document information received by the receiving unit 11. Then, the selection unit 13 can use the common source language examples having high similarity so that the entire document indicated by the document information can be covered by words common to the two or more selected source language examples and the document indicated by the document information. May be selected. Note that the selection unit 13 may use the ratio of words common to the two documents as the similarity between the two documents. The selection unit 13 may calculate the ratio of common words using, for example, the following equation.

なお、式（１）以外の式によって２個の文書に共通する単語の割合を算出してもよい。例えば、上記式（１）の分子の係数「２」を「１」とした式を用いてもよい。また、選択部１３は、２個の文書の類似度として、２個の文書のＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングの値を用いてもよい。ＤＰマッチングについては従来から知られているため、その説明を省略する。なお、選択部１３は、原言語が単語ごとに区切られていない言語である場合に、受付部１１が受け付けた文書情報の示す文書や原言語用例を単語ごとに区切る処理を行ってもよい。この処理は、例えば、形態素解析によって行われる。単語ごとに区切られていない言語とは、例えば、日本語や中国語などである。 Note that the ratio of words common to two documents may be calculated by an expression other than Expression (1). For example, an equation in which the coefficient “2” of the numerator in the equation (1) is “1” may be used. Further, the selection unit 13 may use a DP (Dynamic Programming) matching value between two documents as the similarity between the two documents. Since DP matching is conventionally known, its description is omitted. Note that when the source language is a language that is not divided for each word, the selection unit 13 may perform processing for dividing the document or the source language example indicated by the document information received by the reception unit 11 for each word. This process is performed by morphological analysis, for example. Languages that are not divided into words are, for example, Japanese and Chinese.

翻訳候補情報作成部１４は、選択部１３が選択した２以上の原言語用例に対応する２以上の目的言語用例を用いて翻訳候補情報を作成する。ここで、翻訳候補情報とは、受付部１１が受け付けた文書情報の翻訳の候補を示す情報である。翻訳候補情報は、目的言語用例での単語の並びの可能な組合せを求めることによって作成される。より具体的な処理としては、翻訳候補情報作成部１４は、次のように翻訳候補情報を作成してもよい。翻訳候補情報作成部１４は、選択部１３によって選択された２以上の原言語用例に対応する２以上の目的言語用例を対訳用例コーパスから取得する。翻訳候補情報作成部１４は、その取得した２以上の目的言語用例間で共通する単語を一の共通ノードとし、取得した２以上の目的言語用例におけるその他の単語を一の単独ノードとして、共通ノード及び単独ノードの隣接関係を示す翻訳候補情報を作成してもよい。「共通する単語」とは、一致する単語であってもよく、あるいは、類似する単語であってもよい。本実施の形態では、共通する単語は一致する単語であるとする。なお、翻訳候補情報作成部１４は、目的言語が単語ごとに区切られていない言語である場合に、選択部１３が選択した２以上の原言語用例に対応する２以上の目的言語用例を単語ごとに区切る処理を行ってもよい。この処理は、例えば、形態素解析によって行われる。 The translation candidate information creation unit 14 creates translation candidate information using two or more target language examples corresponding to the two or more source language examples selected by the selection unit 13. Here, the translation candidate information is information indicating candidates for translation of the document information received by the receiving unit 11. Translation candidate information is created by finding possible combinations of word sequences in the target language example. As a more specific process, the translation candidate information creation unit 14 may create translation candidate information as follows. The translation candidate information creation unit 14 acquires two or more target language examples corresponding to the two or more source language examples selected by the selection unit 13 from the parallel translation example corpus. The translation candidate information creation unit 14 sets the common word between the acquired two or more target language examples as one common node, and sets the other words in the acquired two or more target language examples as one single node. Translation candidate information indicating adjacency relationships between single nodes may be created. The “common word” may be a matching word or a similar word. In the present embodiment, it is assumed that the common word is a matching word. When the target language is a language that is not divided for each word, the translation candidate information creation unit 14 performs two or more target language examples corresponding to the two or more source language examples selected by the selection unit 13 for each word. You may perform the process divided into. This process is performed by morphological analysis, for example.

翻訳文書情報作成部１５は、翻訳候補情報作成部１４が作成した翻訳候補情報から所定の条件を充たす翻訳の候補を選択し、その選択した翻訳の候補を示す情報である翻訳文書情報を作成する。この翻訳文書情報は、受付部１１が受け付けた文書情報を翻訳した文書を示す情報である。より具体的な処理としては、翻訳文書情報作成部１５は、次のように翻訳文書情報を作成してもよい。翻訳文書情報作成部１５は、翻訳候補情報の示す各翻訳の候補についてＮグラム確率の積を算出し、そのＮグラム確率の積が最大である翻訳の候補を所定の条件を充たす翻訳の候補として選択してもよい。ここで、Ｎグラム確率を算出するときに用いるコーパスは、対訳用例コーパス記憶部１２が記憶している対訳用例コーパスであってもよく、図示しない記録媒体によって記憶されているコーパスであってもよい。また、Ｎグラム確率の「Ｎ」は、例えば、Ｎ＝２であってもよく、Ｎ＝３であってもよく、あるいは、それ以上の値であってもよい。また、翻訳文書情報作成部１５がＮグラム確率を算出してもよく、あるいは、他の構成要素、他の装置等で算出されたＮグラム確率を翻訳文書情報作成部１５が用いてもよい。Ｎグラム確率については、従来から知られているため、その説明を省略する。翻訳文書情報作成部１５は、選択部１３によって選択された２以上の原言語用例のいずれかについて、翻訳の候補の文書とその原言語用例に対応する目的言語用例との類似度が、受付部１１が受け付けた文書情報の示す文書とその原言語用例との類似度に対して所定の範囲内でない場合に、その翻訳の候補を選択しないようにしてもよい。具体的には、受付部１１が受け付けた文書情報の示す文書をＪ_０とし、選択部１３が選択した原言語用例をＪ_ｉ（ただし、ｉは１からＭの整数。Ｍは選択部１３が選択した原言語用例の数であり、２以上の整数）とし、原言語用例Ｊ_ｉに対応する目的言語用例をＥ_ｉ（ただし、ｉは１からＭの整数）とし、翻訳の候補をＥ_０とすると、１からＭまでのｉのそれぞれについて次式が充たされる翻訳の候補Ｅ_０のうちのいずれかを、翻訳文書情報作成部１５が選択するようにしてもよい。

The translation document information creation unit 15 selects a translation candidate that satisfies a predetermined condition from the translation candidate information created by the translation candidate information creation unit 14, and creates translation document information that is information indicating the selected translation candidate. . The translated document information is information indicating a document obtained by translating the document information received by the receiving unit 11. As a more specific process, the translation document information creation unit 15 may create translation document information as follows. The translation document information creation unit 15 calculates a product of N-gram probabilities for each translation candidate indicated by the translation candidate information, and selects a translation candidate having the maximum N-gram probability product as a translation candidate satisfying a predetermined condition. You may choose. Here, the corpus used when calculating the N-gram probability may be a bilingual example corpus stored in the bilingual example corpus storage unit 12 or a corpus stored in a recording medium (not shown). . Further, “N” of the N-gram probability may be, for example, N = 2, N = 3, or a value greater than that. Moreover, the translation document information creation unit 15 may calculate the N-gram probability, or the translation document information creation unit 15 may use the N-gram probability calculated by another component, another device, or the like. Since the N-gram probability is conventionally known, the description thereof is omitted. The translation document information creation unit 15 determines whether the similarity between the candidate document for translation and the target language example corresponding to the source language example for any one of the two or more source language examples selected by the selection unit 13 is a reception unit. When the similarity between the document indicated by the document information received by the document 11 and the source language example is not within a predetermined range, the translation candidate may not be selected. Specifically, the document indicated by the document information receiving unit 11 has received the J _0, selector 13 selects the source language example J _{i (where,} i is an integer .M of M 1 has selecting section 13 The number of selected source language examples is an integer of 2 or more), the target language example corresponding to the source language example J _i is E _i (where i is an integer from 1 to M), and the translation candidate is E _0. Then, the translation document information creation unit 15 may select any one of translation candidates E ₀ satisfying the following expression for each of i from 1 to M.

翻訳文書情報作成部１５は、２個の文書の類似度として、２個の文書に共通する単語の割合（例えば、上記（１）式）を用いてもよく、あるいは、２個の文書のＤＰマッチングの値を用いてもよい。 The translated document information creation unit 15 may use the ratio of words common to the two documents (for example, the above formula (1)) as the similarity between the two documents, or the DP of the two documents. A matching value may be used.

出力部１６は、翻訳文書情報作成部１５が作成した翻訳文書情報を出力する。ここで、この出力は、例えば、表示デバイス（例えば、ＣＲＴや液晶ディスプレイなど）への表示でもよく、所定の機器への通信回線を介した送信でもよく、プリンタによる印刷でもよく、記録媒体への蓄積でもよい。本実施の形態では、出力部１６は、翻訳文書情報を記録媒体２２に蓄積するものとする。ここで、記録媒体２２は、例えば、半導体メモリや磁気ディスク、光ディスクなどによって実現されうるものである。記録媒体２２は、用例翻訳装置１が有していてもよく、あるいは、用例翻訳装置１の外部にあってもよい。なお、出力部１６は、出力を行うデバイス（例えば、表示デバイスやプリンタなど）を含んでもよく、あるいは含まなくてもよい。また、出力部１６は、ハードウェアによって実現されてもよく、あるいは、それらのデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 The output unit 16 outputs the translation document information created by the translation document information creation unit 15. Here, the output may be, for example, display on a display device (for example, a CRT or a liquid crystal display), transmission via a communication line to a predetermined device, printing by a printer, or output to a recording medium. It may be accumulated. In the present embodiment, it is assumed that the output unit 16 accumulates translated document information in the recording medium 22. Here, the recording medium 22 can be realized by, for example, a semiconductor memory, a magnetic disk, an optical disk, or the like. The recording medium 22 may be included in the example translation apparatus 1 or may be outside the example translation apparatus 1. The output unit 16 may or may not include an output device (for example, a display device or a printer). The output unit 16 may be realized by hardware, or may be realized by software such as a driver that drives these devices.

なお、対訳用例コーパス記憶部１２と、記録媒体２１，２２とは、同一の記録媒体によって実現されてもよく、別々の記録媒体によって実現されてもよい。前者の場合、対訳用例コーパスの記憶されている記録媒体の領域が対訳用例コーパス記憶部１２となる。 The bilingual example corpus storage unit 12 and the recording media 21 and 22 may be realized by the same recording medium or may be realized by different recording media. In the former case, the region of the recording medium in which the parallel example corpus is stored becomes the parallel example corpus storage unit 12.

次に、本実施の形態による用例翻訳装置１の動作について、図２のフローチャートを用いて説明する。図２のフローチャートは、用例翻訳が開始されてからの処理である。 Next, the operation of the example translation apparatus 1 according to the present embodiment will be described using the flowchart of FIG. The flowchart in FIG. 2 is a process after the example translation is started.

（ステップＳ１０１）受付部１１は、文書情報を受け付けたかどうか判断する。そして、文書情報を受け付けた場合には、ステップＳ１０２に進み、そうでない場合には、文書情報を受け付けるまで、ステップＳ１０１の処理を繰り返す。 (Step S101) The receiving unit 11 determines whether document information has been received. If the document information is received, the process proceeds to step S102. If not, the process of step S101 is repeated until the document information is received.

（ステップＳ１０２）選択部１３は、受付部１１が受け付けた文書情報の示す文書と類似する２以上の原言語用例を対訳用例コーパスから選択する。なお、この処理の詳細については、後述する。 (Step S102) The selection unit 13 selects two or more source language examples similar to the document indicated by the document information received by the receiving unit 11 from the bilingual example corpus. Details of this process will be described later.

（ステップＳ１０３）翻訳候補情報作成部１４は、選択部１３が選択した２以上の原言語用例に対応する２以上の目的言語用例を用いて、受付部１１が受け付けた文書情報の翻訳の候補を示す翻訳候補情報を作成する。 (Step S103) The translation candidate information creation unit 14 uses the two or more target language examples corresponding to the two or more source language examples selected by the selection unit 13 to select translation candidate of the document information received by the receiving unit 11. The translation candidate information shown is created.

（ステップＳ１０４）翻訳文書情報作成部１５は、翻訳候補情報作成部１４が作成した翻訳候補情報の示す翻訳の候補から所定条件を充たす翻訳の候補を選択し、その翻訳の候補を示す翻訳文書情報を作成する。なお、この処理の詳細については、後述する。 (Step S104) The translation document information creation unit 15 selects a translation candidate satisfying a predetermined condition from the translation candidates indicated by the translation candidate information created by the translation candidate information creation unit 14, and translate document information indicating the translation candidate. Create Details of this process will be described later.

（ステップＳ１０５）出力部１６は、翻訳文書情報作成部１５が作成した翻訳文書情報を出力する。そして、用例翻訳装置１の動作は終了となる。なお、複数の文書情報について用例翻訳を行う場合には、用例翻訳装置１は、ステップＳ１０１からステップＳ１０５までの処理を繰り返して実行してもよい。 (Step S105) The output unit 16 outputs the translation document information created by the translation document information creation unit 15. Then, the operation of the example translation apparatus 1 ends. In addition, when performing example translation about several document information, the example translation apparatus 1 may repeatedly perform the process from step S101 to step S105.

次に、本実施の形態による用例翻訳装置１の動作について、具体例を用いて説明する。この具体例では、原言語の文書情報の示す文書が「グラスゴーまで寝台の切符をお願いします」であるとする。また、対訳用例コーパス記憶部１２が記憶している対訳用例コーパスは、図３で示されるものであるとする。図３において、対訳用例コーパスでは、旅行会話に関する日本語の原言語用例と、その訳である英語の目的言語用例とが対応付けられている。 Next, the operation of the example translation apparatus 1 according to this embodiment will be described using a specific example. In this specific example, it is assumed that the document indicated by the document information in the source language is “Please give me a bed ticket to Glasgow”. Further, the parallel example corpus stored in the parallel example corpus storage unit 12 is assumed to be as shown in FIG. In FIG. 3, in the bilingual example corpus, a Japanese source language example related to travel conversation is associated with an English target language example that is a translation thereof.

まず、ユーザが記録媒体２１で記憶されている文書情報「グラスゴーまで寝台の切符をお願いします」を指定して、その文書を翻訳する旨の指示を用例翻訳装置１に入力したとする（この入力は、例えば、マウス等を用いて行われてもよい）。すると、受付部１１は、記録媒体２１から文書情報「グラスゴーまで寝台の切符をお願いします」を読み出し、選択部１３と、翻訳文書情報作成部１５とに渡す（ステップＳ１０１）。 First, it is assumed that the user specifies document information “Please give me a bed ticket to Glasgow” stored in the recording medium 21 and inputs an instruction to translate the document to the example translation apparatus 1 (this The input may be performed using, for example, a mouse). Then, the reception unit 11 reads the document information “Please give me a bed ticket to Glasgow” from the recording medium 21, and passes it to the selection unit 13 and the translation document information creation unit 15 (step S101).

選択部１３は、図４のフローチャートで示されるように原言語用例の選択（図２のフローチャートのステップＳ１０２）を行うものとする。まず、図４のフローチャートについて説明する。 As shown in the flowchart of FIG. 4, the selection unit 13 selects a source language example (step S102 in the flowchart of FIG. 2). First, the flowchart of FIG. 4 will be described.

（ステップＳ２０１）選択部１３は、受付部１１が受け付けた文書情報の示す文書と、対訳用例コーパスに含まれる原言語用例との類似度を算出する。なお、前述のように、原言語が単語ごとに区切られていない言語であり、原言語の文書情報の示す文書や原言語用例について形態素解析がなされていない場合に、受付部１１が受け付けた文書情報の示す文書を単語ごとに区切る処理を行う。 (Step S201) The selection unit 13 calculates the similarity between the document indicated by the document information received by the reception unit 11 and the source language example included in the parallel translation example corpus. Note that, as described above, the document received by the receiving unit 11 when the source language is a language that is not divided into words and the morphological analysis is not performed on the document indicated by the source language document information or the source language example. A process of dividing the document indicated by the information into words is performed.

（ステップＳ２０２）選択部１３は、算出した類似度に基づいて、原言語用例を類似度の降順となるようにソートする。
（ステップＳ２０３）選択部１３は、カウンタＩを「１」に設定する。 (Step S202) The selection unit 13 sorts the source language examples in descending order of similarity based on the calculated similarity.
(Step S203) The selection unit 13 sets the counter I to “1”.

（ステップＳ２０４）選択部１３は、Ｉ番目の原言語用例を選択する。ここで、Ｉ番目とは、ステップＳ２０２でソートした結果の類似度の高いものからＩ番目という意味である。 (Step S204) The selection unit 13 selects the Ith source language example. Here, the I-th means the I-th from the ones with a high similarity as a result of sorting in step S202.

（ステップＳ２０５）選択部１３は、それまでに選択した原言語用例によって文書情報がカバーできたかどうか判断する。すなわち、選択部１３は、それまでに選択した原言語用例と、文書情報の示す文書とに共通する単語によって、文書情報の示す文書の全体をカバーすることができたかどうか判断する。ここで、文書Ａと文書Ｂとに共通する単語とは、単語の並びも含めて共通していてもよく、あるいは、単語の並びに関係なく共通していてもよい。前者の場合には、例えば、文書Ａでは、単語ａ、単語ｂの順番で並んでおり、文書Ｂでは、単語ｂ、単語ａの順番で並んでいるとすると、単語ａだけ、あるいは単語ｂだけは両文書で共通する単語であるが、単語ａ、単語ｂの両方は、両文書で共通する単語ではないと判断する。一方、後者の場合には、上記と同様の状況において、両文書で単語ａ、単語ｂの両方が共通する単語となる。この具体例では、前者の場合であるとする。また、共通する単語とは、一致する単語であってもよく、あるいは、類似する単語であってもよい。この具体例では、共通する単語は、一致する単語であるとする。また、複数の単語によって文書の全体をカバーすることができるとは、その複数の単語を並べることにより、その文書を作成することができることを意味する。そして、カバーできたと判断した場合には、終了となり、カバーできていないと判断した場合には、ステップＳ２０６に進む。 (Step S205) The selection unit 13 determines whether the document information can be covered by the source language examples selected so far. In other words, the selection unit 13 determines whether or not the entire document indicated by the document information can be covered with words common to the source language examples selected so far and the document indicated by the document information. Here, the word common to the document A and the document B may be common including the word sequence, or may be common regardless of the arrangement of the words. In the former case, for example, in document A, word a and word b are arranged in the order, and in document B, word b and word a are arranged in order. Is a word common to both documents, but both words a and b are determined not to be common words in both documents. On the other hand, in the latter case, in the same situation as described above, both words a and b are common words in both documents. In this specific example, the former case is assumed. The common word may be a matching word or a similar word. In this specific example, it is assumed that the common word is a matching word. Also, being able to cover the entire document with a plurality of words means that the document can be created by arranging the plurality of words. If it is determined that the cover is available, the process ends. If it is determined that the cover is not available, the process proceeds to step S206.

（ステップＳ２０６）選択部１３は、カウンタＩを１だけインクリメントする。そして、ステップＳ２０４に戻る。 (Step S206) The selection unit 13 increments the counter I by 1. Then, the process returns to step S204.

なお、全ての原言語用例を選択しても文書情報をカバーできない場合には、エラーであるとして、用例翻訳の処理を終了してもよく、あるいは、ステップＳ１０３以降の処理を継続してもよい。 If document information cannot be covered even if all source language examples are selected, it is determined that an error has occurred, and the example translation process may be terminated, or the processes in and after step S103 may be continued. .

具体的な処理に戻ると、選択部１３は、まず、図３で示される対訳用例コーパスから原言語用例を読み出し、その読み出した原言語用例と、受付部１１から受け取った文書情報「グラスゴーまで寝台の切符をお願いします」とについて形態素解析を行い、文書情報「グラスゴーまで寝台の切符をお願いします」と、各原言語用例との類似度を、式（１）を用いて算出する。用例ＩＤ「１」の原言語用例（これをＪ_１とする。用例ＩＤが「２」以降の用例についても同様であるとする）と、文書情報の示す文書（これをＪ_０とする）との類似度は、両文書に共通する単語が、「まで」、「の」、「を」、「お願い」、「し」、「ます」の６個であるため、

と算出される。同様にして、用例ＩＤ２〜４についても類似度を算出すると、

となる（ステップＳ２０１）。なお、用例ＩＤが「５」以降の用例についても、同様にして類似度を求めるものとする。 Returning to the specific processing, the selection unit 13 first reads the source language example from the parallel example corpus shown in FIG. 3, and reads the source language example and the document information “Glassgo sleeper bed” received from the reception unit 11. The morphological analysis is performed on “Please give me a ticket”, and the similarity between the document information “Please give me a bed ticket to Glasgow” and each source language example is calculated using Equation (1). Examples ID "1" of the source language example (This is referred to as J _1. Examples ID is assumed to be the same for example after "2") and the document (This is referred to as J ₀₎ indicated by the document information and Since there are six common words in both documents, "Until", "No", "O", "Request", "Shi", "Masu",

Is calculated. Similarly, when similarity is calculated for example IDs 2 to 4,

(Step S201). It should be noted that the similarities are obtained in the same manner for the examples with the example ID “5” and thereafter.

選択部１３が類似度の降順になるように原言語用例をソートすると、用例ＩＤが「２」，「４」，「１」，「３」の順番となり、用例ＩＤが「５」以降の用例は、５番目以降になったとする（ステップＳ２０２）。すると、選択部１３は、ソート後の１番目の原言語用例から順番に、文書情報「グラスゴーまで寝台の切符をお願いします」と共通する単語で文書情報をカバーできるまで、原言語用例の選択を行う（ステップＳ２０３〜２０５）。図５は、用例ＩＤ１〜４の原言語用例を示す図である。各原言語用例において、文書情報と共通する単語には下線が引かれている。図５からわかるように、用例ＩＤ１〜４によって、文書情報がカバーされている。したがって、選択部１３は、用例ＩＤが「３」の用例を選択した後に、選択した原言語用例で文書情報をカバーできたと判断し（ステップＳ２０５）、原言語用例の選択を終了する（ステップＳ１０２）。そして、選択部１３は、選択した原言語用例の用例ＩＤ「１」、「２」、「３」、「４」を、翻訳候補情報作成部１４と、翻訳文書情報作成部１５とに渡す。 When the source language examples are sorted so that the selection unit 13 is in descending order of similarity, the example IDs are “2”, “4”, “1”, “3”, and the example ID is “5” or later. Is the fifth or later (step S202). Then, the selection unit 13 selects the source language examples until the document information can be covered with words common to the document information “Please give me a bed ticket to Glasgow” in order from the first source language example after sorting. (Steps S203 to 205). FIG. 5 is a diagram illustrating a source language example of the example IDs 1 to 4. In each source language example, words that are common to document information are underlined. As can be seen from FIG. 5, the document information is covered by the example IDs 1 to 4. Therefore, after selecting the example with the example ID “3”, the selection unit 13 determines that the document information can be covered with the selected source language example (step S205), and ends the selection of the source language example (step S102). ). Then, the selection unit 13 passes the example IDs “1”, “2”, “3”, and “4” of the selected source language examples to the translation candidate information creation unit 14 and the translation document information creation unit 15.

翻訳候補情報作成部１４は、選択した原言語用例の用例ＩＤ「１」、「２」、「３」、「４」を選択部１３から受け取ると、その用例ＩＤに対応した目的言語用例を対訳用例コーパスから読み出す。そして、翻訳候補情報作成部１４は、その読み出した目的言語用例間で共通する単語を共通ノードとして一のノードに設定する。具体的には、図６で示されるように、目的言語用例間で共通する単語がひとくくりにされる（図６では、四角で囲われているのが共通する単語である）。ここでは、一致する単語を共通する単語としている。また、図６において、用例ＩＤが「ｉ」である目的言語用例を、「Ｅ_ｉ」としている。なお、このように共通する単語を見つける方法としては、ＤＮＡ解析でよく用いられるマルチプルアライメントの技術を用いることができる。マルチプルアライメントについては従来から知られているため、その説明を省略する。マルチプルアライメントの詳細については、下記の文献１を参照されたい。 Upon receiving the example IDs “1”, “2”, “3”, “4” of the selected source language example from the selection unit 13, the translation candidate information creation unit 14 translates the target language example corresponding to the example ID. Read from the example corpus. Then, the translation candidate information creation unit 14 sets a common word among the read target language examples as one common node. Specifically, as shown in FIG. 6, words that are common between the target language examples are grouped together (in FIG. 6, the words surrounded by a square are common words). Here, the matching word is a common word. In FIG. 6, the target language example whose example ID is “i” is “E _i ”. As a method for finding common words in this way, a multiple alignment technique often used in DNA analysis can be used. Since multiple alignment is conventionally known, the description thereof is omitted. For details of multiple alignment, refer to Reference 1 below.

（文献１）美宅成樹、金久實．ヒトゲノム計画と知識情報処理．培風館、１９９５． (Reference 1) Naruki Miyake, Kaoru Kinhisa. Human genome project and knowledge information processing. Baifukan, 1995.

次に、翻訳候補情報作成部１４は、目的言語用例間で共通しない単語を一の単独ノードとする。そして、翻訳候補情報作成部１４は、各ノード（共通ノード、単独ノードの両方を含む）を一意に特定することができるノードＩＤを各ノードに付与する。例えば、「Ｃａｎ」はノードＩＤ「Ｎ００１」、「Ｉ」はノードＩＤ「Ｎ００２」、「'ｄ」はノードＩＤ「Ｎ００３」、「ｗｏｕｌｄ」はノードＩＤ「Ｎ００４」といったように、ノードＩＤが付与される。その後、翻訳候補情報作成部１４は、共通ノード及び単独ノードの隣接関係を示す翻訳候補情報を作成する（ステップＳ１０３）。この翻訳候補情報は、ある単語の前の単語の候補と、後ろの単語の候補とを示す情報である。図７は、翻訳候補情報を示す図である。翻訳候補情報では、ノードＩＤと、そのノードＩＤで識別されるノードの単語と、そのノードＩＤで識別されるノードの前のノードＩＤと、そのノードＩＤで識別されるノードの後のノードＩＤとが対応付けられている。ここで、ノードＩＤ「Ｎ０００」は、開始ノードを示し、ノードＩＤ「Ｎ９９９」（図７では図示されていない）は、終了ノードを示すものとする。例えば、ノードＩＤ「Ｎ００１」の単語「Ｃａｎ」の前のノードＩＤは「Ｎ０００」であり、後のノードＩＤは「Ｎ００２」であるため、「Ｃａｎ」の前のノードは開始ノードであり、後のノードは「Ｉ」であることがわかる。 Next, the translation candidate information creation unit 14 sets a word that is not common among the target language examples as one single node. Then, the translation candidate information creation unit 14 gives each node a node ID that can uniquely identify each node (including both a common node and a single node). For example, “Can” is a node ID “N001”, “I” is a node ID “N002”, “′ d” is a node ID “N003”, “would” is a node ID “N004”, and so on. Is done. Thereafter, the translation candidate information creation unit 14 creates translation candidate information indicating the adjacent relationship between the common node and the single node (step S103). This translation candidate information is information indicating a word candidate before a certain word and a word candidate behind. FIG. 7 is a diagram showing translation candidate information. In the translation candidate information, the node ID, the word of the node identified by the node ID, the node ID before the node identified by the node ID, and the node ID after the node identified by the node ID Are associated. Here, the node ID “N000” indicates the start node, and the node ID “N999” (not illustrated in FIG. 7) indicates the end node. For example, since the node ID before the word “Can” of the node ID “N001” is “N000” and the subsequent node ID is “N002”, the node before “Can” is the start node and the It can be seen that the node of “I” is “I”.

図８は、図７の翻訳候補情報をわかりやすく図式化したものである。図７で示される翻訳候補情報によって、図８で示されるように、開始ノードから終了ノードまでの取り得る単語の並びを翻訳候補情報から知ることができる。この翻訳候補情報から単語の一の並びを選択することによって、翻訳文書情報が作成される（ステップＳ１０４）。 FIG. 8 is a diagram illustrating the translation candidate information in FIG. 7 in an easy-to-understand manner. With the translation candidate information shown in FIG. 7, as shown in FIG. 8, the possible word sequences from the start node to the end node can be known from the translation candidate information. Translation document information is created by selecting a sequence of words from the translation candidate information (step S104).

次に、その翻訳文書情報の作成について説明する。翻訳文書情報作成部１５は、図９のフローチャートで示されるよう翻訳文書情報の作成（図２のフローチャートのステップＳ１０４）を行うものとする。まず、図９のフローチャートについて説明する。 Next, creation of the translated document information will be described. The translated document information creating unit 15 creates translated document information (step S104 in the flowchart in FIG. 2) as shown in the flowchart in FIG. First, the flowchart of FIG. 9 will be described.

（ステップＳ３０１）翻訳文書情報作成部１５は、カウンタＪを「１」に設定する。
（ステップＳ３０２）翻訳文書情報作成部１５は、翻訳候補情報で示されるＪ番目の翻訳の候補が、選択部１３から受け取った用例ＩＤ「１」、「２」、「３」、「４」（式（２）ではｉ＝１〜４）について前述の式（２）を充たすかどうか判断する。ここで、Δ_{ｌｏｗｅｒ}、Δ_{ｕｐｐｅｒ}をそれぞれ「０．２」に設定している。また、類似度は、式（１）を用いて算出するものとする。また、翻訳文書情報作成部１５は、類似度を算出するときに、受付部１１から受け取った文書情報を用い、選択部１３から受け取った選択された原言語用例の用例ＩＤを用いて、原言語用例を対訳用例コーパスから読み出す。なお、類似度の算出時に原言語用例、文書情報の示す文書に対して、それぞれ形態素解析を行うものとする。あるいは、選択部１３が原言語用例、及び文書情報の示す文書に対して行った形態素解析の結果を用いてもよい。そして、式（２）を充たす場合には、ステップＳ３０３に進み、そうでない場合には、ステップＳ３０６に進む。 (Step S301) The translated document information creation unit 15 sets the counter J to “1”.
(Step S302) The translation document information creation unit 15 uses the example IDs “1”, “2”, “3”, “4” (J) that the J-th translation candidate indicated by the translation candidate information has received from the selection unit 13. In the equation (2), it is determined whether or not the above equation (2) is satisfied for i = 1 to 4). Here, Δ _lower and Δ _upper are set to “0.2”, respectively. In addition, the similarity is calculated using Equation (1). Further, when calculating the similarity, the translated document information creating unit 15 uses the document information received from the accepting unit 11 and uses the example ID of the selected source language example received from the selecting unit 13. Read the example from the parallel example corpus. It is assumed that morphological analysis is performed on the source language example and the document indicated by the document information when calculating the similarity. Or you may use the result of the morphological analysis which the selection part 13 performed with respect to the document which the example for a source language and document information show. If the expression (2) is satisfied, the process proceeds to step S303. Otherwise, the process proceeds to step S306.

（ステップＳ３０３）翻訳文書情報作成部１５は、Ｊ番目の翻訳の候補について、Ｎグラム確率の積を算出する。ここでは、３グラム確率の積を算出するものとする。例えば、翻訳文書情報作成部１５は、Ｊ番目の候補の文頭から１個ずつ単語をずらしながら、３個の単語単位で３グラム確率を乗算していく。なお、３グラム確率の積の値は一般に非常に小さい値となるため、翻訳文書情報作成部１５は、「−ｌｏｇ（３グラム確率の積）」を算出してもよい。このｌｏｇは、自然対数であってもよく、常用対数であってもよい。以下の説明においても、Ｎグラム確率の積を、Ｎグラム確率の積に「−ｌｏｇ」をとったものとしてもよい。また、Ｎグラム確率の積の値（「−ｌｏｇ」をとったものを含む）を単語数について正規化してもよい。例えば、Ｎグラム確率の積の値を単語数で割ることによって正規化してもよい。 (Step S303) The translation document information creation unit 15 calculates a product of N-gram probabilities for the Jth translation candidate. Here, it is assumed that a product of 3-gram probabilities is calculated. For example, the translated document information creation unit 15 multiplies the 3 gram probability in units of three words while shifting words one by one from the head of the Jth candidate sentence. Since the value of the product of 3 gram probabilities is generally very small, the translated document information creation unit 15 may calculate “−log (product of 3 gram probabilities)”. This log may be a natural logarithm or a common logarithm. Also in the following description, the product of N-gram probabilities may be obtained by taking “−log” as the product of N-gram probabilities. Further, the product value of N-gram probabilities (including the value obtained by taking “−log”) may be normalized with respect to the number of words. For example, normalization may be performed by dividing the product of N-gram probabilities by the number of words.

（ステップＳ３０４）翻訳文書情報作成部１５は、翻訳文書情報作成部１５の図示しない記録媒体でそれまでに記憶されているＮグラム確率の積と、ステップＳ３０３で算出したＮグラム確率の積とを比較する。そして、記憶されているＮグラム確率の積よりも、ステップＳ３０３で算出したＮグラム確率の積の方が大きい場合には、ステップＳ３０５に進み、そうでない場合には、ステップＳ３０６に進む。 (Step S304) The translation document information creation unit 15 calculates the product of the N-gram probabilities stored so far in the recording medium (not shown) of the translation document information creation unit 15 and the product of the N-gram probabilities calculated in Step S303. Compare. If the N-gram probability product calculated in step S303 is larger than the stored N-gram probability product, the process proceeds to step S305. Otherwise, the process proceeds to step S306.

（ステップＳ３０５）Ｊ番目の翻訳の候補を示す情報と、ステップＳ３０３で算出したＮグラム確率とを前述の図示しない記録媒体に上書きで記録する。なお、その記録媒体に何も記憶されていない場合には、そのまま記録する。また、記録されるＪ番目の翻訳の候補を示す情報は、Ｊ番目の翻訳の候補の文書を再現できる情報であれば、その内容を問わない。例えば、その情報は「Ｊ番目」でもよく、その翻訳の候補の単語の並びを示すためのノードＩＤの並びであってもよく、翻訳の候補の文書そのものを示す情報であってもよい。
（ステップＳ３０６）翻訳文書情報作成部１５は、カウンタＪを１だけインクリメントする。 (Step S305) Information indicating the candidate for the Jth translation and the N-gram probability calculated in Step S303 are overwritten and recorded on the recording medium (not shown). If nothing is stored in the recording medium, it is recorded as it is. The information indicating the Jth translation candidate recorded may be any information as long as the information can reproduce the Jth translation candidate document. For example, the information may be “J-th”, may be a sequence of node IDs for indicating a sequence of translation candidate words, or may be information indicating a translation candidate document itself.
(Step S306) The translated document information creation unit 15 increments the counter J by one.

（ステップＳ３０７）翻訳文書情報作成部１５は、翻訳候補情報を参照し、Ｊ番目の翻訳の候補があるかどうか判断する。そして、Ｊ番目の翻訳の候補がある場合には、ステップＳ３０２に戻り、そうでない場合には、全ての翻訳の候補について検討したことになるため、ステップＳ３０８に進む。 (Step S307) The translation document information creation unit 15 refers to the translation candidate information and determines whether there is a Jth translation candidate. If there is a Jth translation candidate, the process returns to step S302. If not, all translation candidates have been considered, and the process proceeds to step S308.

（ステップＳ３０８）翻訳文書情報作成部１５は、前述の図示しない記録媒体で記憶されている翻訳の候補を示す情報を読み出し、それに基づいて、翻訳文書情報を作成し、その作成した翻訳文書情報を出力部１６に渡す。なお、前述の図示しない記録媒体に翻訳の候補の文書そのものが記憶されている場合には、その文書を読み出すことも翻訳文書情報の作成とする。 (Step S308) The translation document information creation unit 15 reads information indicating translation candidates stored in the above-described recording medium (not shown), creates translation document information based on the information, and creates the created translation document information. It passes to the output unit 16. If a translation candidate document itself is stored in the recording medium (not shown), reading the document is also used to create translation document information.

なお、図９のフローチャートでは、式（２）を充たす翻訳の候補についてのみＮグラム確率の積を求める場合について説明したが、全ての翻訳の候補についてＮグラム確率の積を算出し、式（２）を充たし、かつ、Ｎグラム確率の積の値が最大である翻訳の候補を選択するようにしてもよい。 In the flowchart of FIG. 9, the case where the product of N-gram probabilities is obtained only for translation candidates satisfying Expression (2) has been described. However, the product of N-gram probabilities is calculated for all translation candidates, and Expression (2 ) And a translation candidate having a maximum N-gram probability product value may be selected.

ここで、翻訳文書情報作成部１５が、式（２）を充さない翻訳の候補を翻訳文書情報の作成で用いない理由について簡単に説明する。一般に、原言語での文書と、その文書の目的言語での訳とがある場合に、原言語での文書と原言語用例との類似度と、その文書の目的言語での訳と目的言語用例との類似度とは比較的近いものとなるはずであると考えられる。したがって、式（２）の充たす翻訳の候補の中から、最終的な翻訳文書情報を作成するようにしている。なお、文書情報「グラスゴーまで寝台の切符をお願いします」の英語での訳が「Ｉ'ｄｌｉｋｅａｓｌｅｅｐｉｎｇｃａｒｔｉｃｋｅｔｔｏＧｌａｓｇｏｗ，ｐｌｅａｓｅ．」であるとして、この英語での訳をＥ_０とすると、

となることからも、そのような考えが妥当であると思われる。 Here, the reason why the translation document information creation unit 15 does not use translation candidates that do not satisfy Expression (2) in the creation of translation document information will be briefly described. In general, if there is a document in the source language and a translation of the document in the target language, the similarity between the document in the source language and the source language example, and the translation of the document in the target language and the example in the target language It is considered that the degree of similarity should be relatively close. Therefore, final translation document information is created from the translation candidates satisfied by the expression (2). The document information "Glasgow I'd like a ticket of bed until" the translation in English is "I'd like a sleeping car ticket to Glasgow, please. " As it is, and the translation of this English and _{E 0} ,

Therefore, such an idea seems to be appropriate.

出力部１６は、翻訳文書情報作成部１５から受け取った翻訳文書情報を記録媒体２２に蓄積する（ステップＳ１０５）。このようにして、用例翻訳の一連の処理が終了する。 The output unit 16 accumulates the translation document information received from the translation document information creation unit 15 in the recording medium 22 (step S105). In this way, a series of example translation processing is completed.

なお、この具体例では、図４のフローチャートを用いて、選択した原言語用例で文書情報をカバーできた場合に原言語用例の選択を終了する場合について説明したが、選択した原言語用例によって文書情報がカバーできたとしても、余分に原言語用例を選択してもよい。また、図４のフローチャートでは、類似度の高いものから順番に原言語用例の選択を行う場合について説明したが、類似度の高いものから飛び飛びに（例えば、１個おき、あるいは２個おきなどに）原言語用例の選択を行ってもよい。さらに、選択部１３は、ランダムに原言語用例を対訳用例コーパスから読み出し、その読み出した原言語用例と文書情報との類似度があらかじめ定められている値（例えば、０．３など。ただし、この値に限定されない）よりも大きい場合にその原言語用例を選択し、そのような選択の処理を、選択した原言語用例で文書情報をカバーできるまで繰り返すようにしてもよい。 In this specific example, the case where the selection of the source language example is completed when the document information can be covered by the selected source language example has been described using the flowchart of FIG. Even if the information can be covered, extra source language examples may be selected. Further, in the flowchart of FIG. 4, the case where the source language examples are selected in order from the one with the highest similarity has been described. However, the one with the highest similarity is skipped (for example, every other piece or every second piece). ) Source language examples may be selected. Further, the selection unit 13 randomly reads the source language example from the parallel translation example corpus, and has a value (for example, 0.3, etc.) in which the similarity between the read source language example and the document information is determined in advance. (If not limited to a value), the source language example may be selected, and such selection processing may be repeated until the document information can be covered by the selected source language example.

また、この具体例では、共通ノードが目的言語用例間で一致する単語である場合について説明したが、目的言語用例間で類似する単語を共通ノードとしてよいことは前述の通りである。例えば、「ｗｏｕｌｄ」と「'ｄ」とを類似すると判断し、共通ノードとしてもよい。また、単語ｗ_１と単語ｗ_２とが類似するかどうかを、次の式によって判断してもよい。

Further, in this specific example, the case where the common node is a word that matches between the target language examples has been described. However, as described above, a word that is similar between the target language examples may be used as the common node. For example, “would” and “′ d” may be determined to be similar and may be a common node. Further, whether the word w ₁ and the word w ₂ are similar may be determined by the following formula.

ここで、ｓｉｍＣｈａｒは表層の一致（文字の一致）、ｓｉｍＰｏｓは品詞の一致を求める関数であり、一致したときは「１．０」をとり、一致しないときは「０．０」をとる。ｓｉｍＳｅｍは意味的類似度であり、「０．０」〜「１．０」の値をとり、類似度が高いほど値は大きい。ｓｉｍＷｏｒｄの値が所定の値（例えば、０．５など。ただし、この値に限定されない）以上である場合に、翻訳候補情報作成部１４は、両単語が類似すると判断する。上記の式（３）において、λ_２，λ_３＝０とし、所定の値を「１．０」とした場合が、上記具体例となる。なお、ｓｉｍＣｈａｒ、ｓｉｍＰｏｓ、ｓｉｍＳｅｍについては、従来から知られているため、その説明を省略する。 Here, simChar is a surface layer match (character match), and simPos is a function for finding part-of-speech match, and takes “1.0” when matched and “0.0” when not matched. SimSem is a semantic similarity and takes a value of “0.0” to “1.0”, and the value is larger as the similarity is higher. If the value of simWord is equal to or greater than a predetermined value (for example, 0.5, but not limited to this value), the translation candidate information creation unit 14 determines that both words are similar. In the above formula (3), a case where λ ₂ , λ ₃ = 0 and a predetermined value is “1.0” is the specific example. Since simChar, simPos, and simSem are conventionally known, the description thereof is omitted.

また、この具体例では、図７で示されるように、翻訳候補情報において、単語の前後のノードが示される場合について説明したが、翻訳候補情報において、単語の前のノードあるいは後のノードのみが示されてもよい。その場合であっても、翻訳の候補における単語の並びを知ることができるからである。 Further, in this specific example, as shown in FIG. 7, the description has been given of the case where nodes before and after the word are indicated in the translation candidate information. However, in the translation candidate information, only the node before or after the word is included. May be shown. Even in such a case, it is possible to know the arrangement of words in translation candidates.

また、この具体例では、式（２）を充たす翻訳の候補を用いて翻訳文書情報を作成する場合について説明したが、これは一例であって、翻訳文書情報作成部１５は、翻訳の候補が式（２）を充たすかどうかの判断を行わなくてもよい。その場合には、例えば、図９のフローチャートにおいて、ステップＳ３０２の処理を省略し、ステップＳ３０１からステップＳ３０３に進むようにしてもよい。 Further, in this specific example, the case where the translation document information is created using the translation candidate satisfying the expression (2) has been described. However, this is an example, and the translation document information creation unit 15 determines whether the translation candidate is a translation candidate. It is not necessary to determine whether or not Expression (2) is satisfied. In that case, for example, in the flowchart of FIG. 9, the process of step S302 may be omitted, and the process may proceed from step S301 to step S303.

また、この具体例では、コンマやピリオド、クエスチョンマーク（疑問符）が１個の単語であるとして翻訳の候補を作成したが、コンマやピリオド、クエスチョンマーク、エクスクラメーションマーク（感嘆符）、句読点等を１個の単語として取り扱ってもよく、そうでなくてもよい。 In this example, a candidate for translation was created assuming that a comma, period, question mark (question mark) is a single word, but a comma, period, question mark, exclamation mark (exclamation mark), punctuation marks, etc. May be treated as a single word or not.

また、この具体例では、原言語が日本語であり、目的言語が英語である場合について説明したが、原言語と目的言語とが異なる言語であれば、両者は任意の言語でよく、日本語や英語に限定されないことは言うまでもない。
また、この具体例では、Δ_{ｌｏｗｅｒ}と、Δ_{ｕｐｐｅｒ}とが同じ値である場合について説明したが、両者が異なる値であってもよいことは言うまでもない。 In this example, the source language is Japanese and the target language is English. However, as long as the source language and the target language are different from each other, both may be arbitrary languages. Needless to say, it is not limited to English.
In this specific example, the case where Δ _lower and Δ _upper are the same value has been described, but it is needless to say that both values may be different.

以上のように、本実施の形態による用例翻訳装置１では、単語アライメントや句アライメントを用いるのではなく、対訳用例コーパスを用いることによって原言語から目的言語への翻訳を行うことができる。 As described above, in the example translation apparatus 1 according to the present embodiment, translation from the source language to the target language can be performed by using the parallel translation example corpus instead of using word alignment or phrase alignment.

なお、翻訳候補情報の作成において、翻訳候補情報作成部１４は、上述のようにして作成された翻訳候補情報をさらに拡張してもよい。例えば、翻訳候補情報作成部１４は、翻訳候補情報の示すノードの隣接関係以外の隣接関係で関係づけられる２単語の並びの２グラム確率がしきい値以上の場合に、その２単語の並びを示す隣接関係を翻訳候補情報に追加してもよい。具体的には、図８で示される翻訳の候補において、図８に含まれない任意のノードの隣接関係、例えば、ノード「ｌｉｋｅ」と、ノード「ｓｅａｔ」との隣接関係で関係づけられる２単語の並び、「ｌｉｋｅ−ｓｅａｔ」の２グラム確率がしきい値「０．１」以上である場合には、その２単語の並びを示す隣接関係を翻訳候補情報に追加してもよい。なお、しきい値については、適切な値を任意に設定することができる。また、任意のノードの隣接関係について判断するのではなく、所定の制限を加えてもよい。例えば、あるノード（ノードＡとする）と、そのノードＡから翻訳候補情報に含まれる隣接関係によってたどることができるノードであって、そのノードＡよりも後のノードであり、ノードＡからＫ個以内（Ｋは、２以上の所定の整数）のノードとの隣接関係についてのみ、上記のように２グラム確率をしきい値と比較し、それ以外の隣接関係については、翻訳候補情報への追加を行わなくてもよい。Ｋの値としては、例えば、Ｋ＝３としてもよい。図１０は、そのようにして２単語の並びを示す隣接関係の追加された翻訳候補情報の一例を示す図である。破線で示される隣接関係が追加された隣接関係である。 In creating the translation candidate information, the translation candidate information creating unit 14 may further expand the translation candidate information created as described above. For example, the translation candidate information creation unit 14 determines the arrangement of the two words when the 2-gram probability of the arrangement of the two words related by the adjacent relationship other than the adjacent relationship of the node indicated by the translation candidate information is greater than or equal to the threshold value. The adjacent relationship shown may be added to the translation candidate information. Specifically, in the translation candidates shown in FIG. 8, two words that are related by an adjacent relationship between arbitrary nodes not included in FIG. 8, for example, an adjacent relationship between the node “like” and the node “seat” If the 2-gram probability of “like-seat” is equal to or greater than the threshold value “0.1”, an adjacency indicating the arrangement of the two words may be added to the translation candidate information. An appropriate value can be arbitrarily set for the threshold value. Further, a predetermined restriction may be added instead of determining the adjacency relationship between arbitrary nodes. For example, a node that can be traced from a certain node (referred to as node A) by the adjacent relationship included in the translation candidate information from the node A, is a node after the node A, and K nodes from the node A The 2 gram probability is compared with the threshold value as described above only for the adjacent relationship with nodes within (with K being a predetermined integer of 2 or more), and the other adjacent relationships are added to the translation candidate information. It is not necessary to perform. As the value of K, for example, K = 3 may be used. FIG. 10 is a diagram illustrating an example of translation candidate information to which an adjacent relationship indicating the arrangement of two words is added. This is an adjacency in which an adjacency indicated by a broken line is added.

また、対訳用例コーパスに、文を構成する要素である構成要素のアライメントである構成要素アライメントが含まれてもよい。その場合には、翻訳文書情報作成部１５は、その構成要素アライメントを用いて翻訳文書情報の作成を行ってもよい。ここで、文を構成する要素である構成要素とは、例えば、単語や句などである。したがって、構成要素アライメントは、例えば、単語アライメントであってもよく、句アライメントであってもよく、あるいは、その両方であってもよい。 In addition, the parallel translation example corpus may include a component alignment that is an alignment of components that constitute a sentence. In that case, the translation document information creation unit 15 may create translation document information using the component element alignment. Here, the component which is an element which comprises a sentence is a word, a phrase, etc., for example. Thus, the component alignment may be, for example, word alignment, phrase alignment, or both.

対訳用例コーパスに構成要素アライメントが含まれる場合に、翻訳文書情報作成部１５は、受付部１１が受け付けた文書情報の示す文書と翻訳の候補の文書との間での構成要素アライメントによるアライメントの存否に基づいて、所定の条件を充たす翻訳の候補を選択してもよい。例えば、翻訳文書情報作成部１５は、受付部１１が受け付けた文書情報の示す文書と翻訳の候補の文書との間での構成要素アライメントによるアライメントの存否に応じて、翻訳の候補に含まれる構成要素に対してスコアを付与し、Ｎグラム確率の積とそのスコアとが所定の条件を充たす翻訳の候補を選択してもよい。より具体的には、対訳用例コーパスに含まれる単語アライメントによって、「グラスゴー」と、「Ｇｌａｓｇｏｗ」とのアライメントがとられている場合に、翻訳文書情報作成部１５は、翻訳の候補に含まれる構成要素（この場合は単語）である「Ｇｌａｓｇｏｗ」に対してスコア「２．０」を付与し、ノード「Ｇｌａｓｇｏｗ」を含む翻訳の候補についてＮグラム確率の積を算出する場合（例えば、図９のステップＳ３０３）に、「Ｇｌａｓｇｏｗ」に付与されたスコア「２．０」も乗算し、その積の結果を他のＮグラム確率の積の結果と比較するようにしてもよい。このようにすることで、アライメントのとられている構成要素を含む翻訳の候補が翻訳文書情報作成部１５によって選択され、最終的に翻訳文書情報となる確率を高くすることができる。上記説明の場合、「Ｌｏｎｄｏｎ」や「ＳａｎＦｒａｎｃｉｓｃｏ」等を含む翻訳の候補よりも、「Ｇｌａｓｇｏｗ」を含む翻訳の候補の方がＮグラム確率の積の結果が大きな値となり、「グラスゴー」の英訳である「Ｇｌａｓｇｏｗ」が翻訳文書に含まれる可能性を高くすることができる。また、翻訳文書情報作成部１５は、全ての翻訳の候補についてＮグラム確率の積を算出し、構成要素アライメントによるアライメントのとられている構成要素を含む翻訳の候補については、その候補に含まれる構成要素ごとにＮグラム確率の積の結果に対してスコアを加算し、その結果の最大値を選択することによって、翻訳文書情報を作成してもよい。なお、アライメントのとられている構成要素に付与するスコアの値を任意に設定できることは言うまでもない。また、アライメントのとられている構成要素に対してスコアを付与するのではなく、アライメントのとられている構成要素を含む翻訳の候補から、Ｎグラム確率の積が最大になるものを選択するようにしてもよい。 In the case where the component alignment is included in the bilingual example corpus, the translated document information creation unit 15 determines whether or not there is alignment by the component alignment between the document indicated by the document information received by the receiving unit 11 and the translation candidate document. Based on the above, a translation candidate that satisfies a predetermined condition may be selected. For example, the translation document information creation unit 15 is included in the translation candidates depending on whether or not there is alignment by component alignment between the document indicated by the document information received by the reception unit 11 and the translation candidate document. A score may be assigned to the element, and a translation candidate in which the product of N-gram probabilities and the score satisfy a predetermined condition may be selected. More specifically, when “Glasgow” and “Glasgow” are aligned by the word alignment included in the bilingual example corpus, the translation document information creation unit 15 is included in the translation candidates. When a score “2.0” is assigned to “Glasgow”, which is an element (in this case, a word), and a product of N-gram probabilities is calculated for translation candidates including the node “Glasgow” (for example, FIG. In step S303), the score “2.0” given to “Glasgow” may be multiplied, and the result of the product may be compared with the result of the product of other N-gram probabilities. By doing so, it is possible to increase the probability that translation candidates including the aligned constituent elements are selected by the translation document information creation unit 15 and finally become translation document information. In the case of the above description, the translation result including “Glasgow” has a larger product of N-gram probabilities than the translation candidates including “London”, “San Francisco”, etc., and the English translation of “Glasgow” It is possible to increase the possibility that “Glasgow” is included in the translated document. Moreover, the translation document information creation unit 15 calculates the product of N-gram probabilities for all translation candidates, and translation candidates including the constituent elements aligned by the constituent element alignment are included in the candidates. The translation document information may be created by adding a score to the product of N-gram probabilities for each component and selecting the maximum value of the results. Needless to say, the value of the score to be given to the aligned component can be arbitrarily set. Also, instead of assigning a score to the aligned component, select a translation candidate including the aligned component that maximizes the product of N-gram probabilities. It may be.

また、本実施の形態では、文書間の類似度を、ＤＰマッチングを用いて算出する場合について説明したが、ＤＰマッチング以外の編集距離を用いて文書間の類似度を算出してもよい。 In the present embodiment, the case where the similarity between documents is calculated using DP matching has been described. However, the similarity between documents may be calculated using an edit distance other than DP matching.

また、上記実施の形態において、各処理または各機能は、単一の装置または単一のシステムによって集中処理されることによって実現されてもよく、あるいは、複数の装置または複数のシステムによって分散処理されることによって実現されてもよい。また、用例翻訳装置１は、スタンドアロンの装置であってもよく、あるいは、サーバ・クライアントシステムにおけるサーバであってもよい。 In the above embodiment, each process or each function may be realized by centralized processing by a single device or a single system, or may be distributedly processed by a plurality of devices or a plurality of systems. It may be realized by doing. The example translation apparatus 1 may be a stand-alone apparatus or a server in a server / client system.

また、上記実施の形態において、各構成要素は専用のハードウェアにより構成されてもよく、あるいは、ソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されてもよい。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。なお、上記実施の形態における用例翻訳装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、原言語の文書を示す情報である文書情報を受け付ける受付ステップと、対訳用例コーパス記憶部で記憶されている、原言語での用例である複数の原言語用例と、原言語と異なる目的言語における原言語用例の訳である複数の目的言語用例とを対応付けて有する情報である対訳用例コーパスの複数の原言語用例から、前記受付ステップで受け付けた文書情報の示す文書と類似する２以上の原言語用例を選択する選択ステップと、前記選択ステップで選択した２以上の原言語用例に対応する２以上の目的言語用例を用いて、前記受付ステップで受け付けた文書情報の翻訳の候補を示す情報である翻訳候補情報を作成する翻訳候補情報作成ステップと、前記翻訳候補作成ステップで作成した翻訳候補情報から所定の条件を充たす翻訳の候補を選択し、当該選択した翻訳の候補を示す情報である翻訳文書情報を作成する翻訳文書情報作成ステップと、前記翻訳文書情報作成ステップで作成した翻訳文書情報を出力する出力ステップと、を実行させるためのものである。 In the above embodiment, each component may be configured by dedicated hardware, or a component that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. The software that realizes the example translation apparatus in the above embodiment is the following program. In other words, the program includes a reception step of receiving document information, which is information indicating a document in the source language, and a plurality of source language examples that are examples of the source language stored in the parallel translation example corpus storage unit. The document information received in the receiving step from the plurality of source language examples of the parallel translation example corpus, which is information having a plurality of target language examples that are translations of the source language examples in the target language different from the source language, Document information received in the receiving step using a selection step for selecting two or more source language examples similar to a document and two or more target language examples corresponding to the two or more source language examples selected in the selection step A translation candidate information creation step for creating translation candidate information that is information indicating translation candidates of the translation, and the translation candidate information created in the translation candidate creation step A translation document information creating step for creating a translation document information that is information indicating the selected translation candidate and a translation document information created in the translation document information creation step. And an output step for outputting.

なお、上記プログラムにおいて、情報を受け付ける受付ステップや、情報を出力する出力ステップなどでは、ハードウェアでしか行われない処理、例えば、出力ステップにおけるモデムやインターフェースカードなどで行われる処理は少なくとも含まれない。 In the above program, the reception step for receiving information and the output step for outputting information do not include at least processing performed only by hardware, for example, processing performed by a modem or an interface card in the output step. .

また、このプログラムは、サーバなどからダウンロードされることによって実行されてもよく、所定の記録媒体（例えば、ＣＤ−ＲＯＭなどの光ディスクや磁気ディスク、半導体メモリなど）に記録されたプログラムが読み出されることによって実行されてもよい。 Further, this program may be executed by being downloaded from a server or the like, and a program recorded on a predetermined recording medium (for example, an optical disk such as a CD-ROM, a magnetic disk, a semiconductor memory, or the like) is read out. May be executed by

また、このプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes this program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

図１１は、上記プログラムを実行して、上記実施の形態による用例翻訳装置１を実現するコンピュータの外観の一例を示す模式図である。上記実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムによって実現されうる。 FIG. 11 is a schematic diagram showing an example of the external appearance of a computer that executes the program and realizes the example translation apparatus 1 according to the embodiment. The above-described embodiment can be realized by computer hardware and a computer program executed on the computer hardware.

図１１において、コンピュータシステム１００は、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブ１０５、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ１０６を含むコンピュータ１０１と、キーボード１０２と、マウス１０３と、モニタ１０４とを備える。 In FIG. 11, a computer system 100 includes a computer 101 including a CD-ROM (Compact Disk Read Only Memory) drive 105, an FD (Flexible Disk) drive 106, a keyboard 102, a mouse 103, and a monitor 104.

図１２は、コンピュータシステムを示す図である。図１２において、コンピュータ１０１は、ＣＤ−ＲＯＭドライブ１０５、ＦＤドライブ１０６に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１１と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１１２と、ＣＰＵ１１１に接続され、アプリケーションプログラムの命令を一時的に記憶すると共に、一時記憶空間を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１３と、アプリケーションプログラム、システムプログラム、及びデータを記憶するハードディスク１１４と、ＣＰＵ１１１、ＲＯＭ１１２等を相互に接続するバス１１５とを備える。なお、コンピュータ１０１は、ＬＡＮへの接続を提供する図示しないネットワークカードを含んでもよい。 FIG. 12 is a diagram illustrating a computer system. In FIG. 12, in addition to the CD-ROM drive 105 and the FD drive 106, a computer 101 includes a CPU (Central Processing Unit) 111, a ROM (Read Only Memory) 112 for storing a program such as a bootup program, A CPU (Random Access Memory) 113 that is connected to the CPU 111 and temporarily stores application program instructions and provides a temporary storage space, a hard disk 114 that stores application programs, system programs, and data, a CPU 111 and a ROM 112. Etc. to each other. The computer 101 may include a network card (not shown) that provides connection to the LAN.

コンピュータシステム１００に、上記実施の形態による用例翻訳装置１の機能を実行させるプログラムは、ＣＤ−ＲＯＭ１２１、またはＦＤ１２２に記憶されて、ＣＤ−ＲＯＭドライブ１０５、またはＦＤドライブ１０６に挿入され、ハードディスク１１４に転送されてもよい。これに代えて、そのプログラムは、図示しないネットワークを介してコンピュータ１０１に送信され、ハードディスク１１４に記憶されてもよい。プログラムは実行の際にＲＡＭ１１３にロードされる。なお、プログラムは、ＣＤ−ＲＯＭ１２１やＦＤ１２２、またはネットワークから直接、ロードされてもよい。 A program for causing the computer system 100 to execute the function of the example translation apparatus 1 according to the above embodiment is stored in the CD-ROM 121 or the FD 122, inserted into the CD-ROM drive 105 or the FD drive 106, and stored in the hard disk 114. May be forwarded. Instead, the program may be transmitted to the computer 101 via a network (not shown) and stored in the hard disk 114. The program is loaded into the RAM 113 at the time of execution. The program may be loaded directly from the CD-ROM 121, the FD 122, or the network.

プログラムは、コンピュータ１０１に、上記実施の形態による用例翻訳装置１の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含まなくてもよい。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいてもよい。コンピュータシステム１００がどのように動作するのかについては周知であり、詳細な説明を省略する。
また、本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 101 to execute the functions of the example translation apparatus 1 according to the above-described embodiment. The program may include only a part of an instruction that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 100 operates is well known and will not be described in detail.
Further, the present invention is not limited to the above-described embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上より、本発明による用例翻訳装置等によれば、対訳用例コーパスを用いて翻訳を行うことができ、機械翻訳を行う装置等として有用である。 As described above, according to the example translation apparatus and the like according to the present invention, translation can be performed using the parallel example corpus, which is useful as an apparatus for performing machine translation.

本発明の実施の形態１による用例翻訳装置の構成を示すブロック図The block diagram which shows the structure of the example translation apparatus by Embodiment 1 of this invention. 同実施の形態による用例翻訳装置の動作を示すフローチャートThe flowchart which shows operation | movement of the example translation apparatus by the embodiment 同実施の形態における対訳用例コーパスの一例を示す図The figure which shows an example of the example corpus for parallel translation in the embodiment 同実施の形態における原言語用例を選択する処理の一例を示すフローチャートThe flowchart which shows an example of the process which selects the example for source languages in the embodiment 同実施の形態における選択された原言語用例の一例を示す図The figure which shows an example of the example for the selected source language in the embodiment 同実施の形態におけるマルチプルアライメントの一例を示す図The figure which shows an example of the multiple alignment in the embodiment 同実施の形態における翻訳候補情報の一例を示す図The figure which shows an example of the translation candidate information in the embodiment 同実施の形態における翻訳候補情報によって示される単語のノードの並びの一例を示す図The figure which shows an example of the arrangement | sequence of the node of the word shown by the translation candidate information in the embodiment 同実施の形態における翻訳文書情報を作成する処理の一例を示すフローチャートThe flowchart which shows an example of the process which produces the translation document information in the embodiment 同実施の形態における単語の隣接関係の追加された翻訳候補情報によって示される単語のノードの並びの一例を示す図The figure which shows an example of the arrangement | sequence of the node of the word shown by the translation candidate information to which the adjacency relation of the word in the same embodiment was added 同実施の形態におけるコンピュータシステムの外観一例を示す模式図Schematic diagram showing an example of the appearance of the computer system in the embodiment 同実施の形態におけるコンピュータシステムの構成の一例を示す図The figure which shows an example of a structure of the computer system in the embodiment

Explanation of symbols

１用例翻訳装置
１１受付部
１２対訳用例コーパス記憶部
１３選択部
１４翻訳候補情報作成部
１５翻訳文書情報作成部
１６出力部
２１，２２記録媒体
DESCRIPTION OF SYMBOLS 1 Example translation apparatus 11 Reception part 12 Parallel example corpus memory | storage part 13 Selection part 14 Translation candidate information creation part 15 Translation document information creation part 16 Output part 21,22 Recording medium

Claims

A reception unit that receives document information that is information indicating a document in the source language;
A parallel translation in which a bilingual example corpus, which is information having a plurality of source language examples that are examples in the source language, and a plurality of target language examples that are translations of the source language examples in a target language different from the source language is stored Example corpus storage unit,
A selection unit that selects two or more source language examples similar to the document indicated by the document information received by the reception unit from a plurality of source language examples of the parallel example corpus;
A translation candidate that creates translation candidate information, which is information indicating translation candidates of document information received by the receiving unit, using two or more target language examples corresponding to the two or more source language examples selected by the selection unit. An information creation department;
A translation document information creation unit that selects translation candidates satisfying a predetermined condition from the translation candidate information created by the translation candidate information creation unit, and creates translation document information that is information indicating the selected translation candidates;
An example translation apparatus comprising: an output unit that outputs translation document information created by the translation document information creation unit.

The selection unit includes:
The degree of similarity between the plurality of source language examples and the document indicated by the document information received by the receiving unit is calculated, and the document is expressed by a word common to the two or more selected source language examples and the document indicated by the document information. 2. The example translation apparatus according to claim 1, wherein a source language example having a high degree of similarity is selected so that the entire document indicated by the information can be covered.

The selection unit includes:
The example translation apparatus according to claim 2, wherein a ratio of words common to the two documents is used as the similarity between the two documents.

The translation candidate information creation unit
Two or more target language examples corresponding to two or more source language examples selected by the selection unit are acquired from the parallel example corpus, and a common word between the acquired two or more target language examples is defined as one common node. The example of any one of claims 1 to 3, wherein translation candidate information indicating adjacency relation between a common node and a single node is created with another word in the obtained two or more target language examples as one single node. Translation device.

The translation candidate information creation unit
When the 2-gram probability of the two-word sequence related by the adjacent relationship other than the adjacent relationship of the node indicated by the translation candidate information is greater than or equal to the threshold value, the adjacent relationship indicating the two-word sequence is added to the translation candidate information The example translation apparatus according to claim 4.

The translated document information creation unit
The product of N-gram probabilities is calculated for each translation candidate indicated by the translation document information, and the translation candidate having the maximum N-gram probability product is selected as a translation candidate satisfying a predetermined condition. The example translation apparatus according to claim 5.

The bilingual example corpus also includes a component alignment that is an alignment of components that are components of a sentence,
The translated document information creation unit
The translation candidate satisfying a predetermined condition is selected based on the presence or absence of alignment by the component element alignment between the document indicated by the document information received by the reception unit and the translation candidate document. Example translation device.

A parallel translation in which a bilingual example corpus, which is information having a plurality of source language examples that are examples in the source language, and a plurality of target language examples that are translations of the source language examples in a target language different from the source language is stored An example translation method that is processed using an example corpus storage unit, a reception unit, a selection unit, a translation candidate information creation unit, a translation document information creation unit, and an output unit,
A receiving step in which the receiving unit receives document information that is information indicating a document in a source language;
A selection step in which the selection unit selects two or more source language examples similar to the document indicated by the document information received in the receiving step, from a plurality of source language examples of the parallel example corpus;
The translation candidate information creation unit is information indicating translation candidates of the document information received in the reception step using two or more target language examples corresponding to the two or more source language examples selected in the selection step. A translation candidate information creation step for creating translation candidate information;
The translation document information creation unit selects a translation candidate satisfying a predetermined condition from the translation candidate information created in the translation candidate creation step, and creates translation document information that is information indicating the selected translation candidate Document information creation step;
An example translation method, comprising: an output step in which the output unit outputs the translation document information created in the translation document information creation step.

On the computer,
A reception step for receiving document information that is information indicating a document in the source language;
A plurality of source language examples, which are examples in the source language, and a plurality of target language examples, which are translations of source language examples in a target language different from the source language, are stored in correspondence with the parallel translation example corpus storage unit. A selection step of selecting two or more source language examples similar to the document indicated by the document information received in the receiving step from a plurality of source language examples of the parallel example corpus that is information;
Translation candidates for creating translation candidate information, which is information indicating translation candidates of the document information received in the receiving step, using two or more target language examples corresponding to the two or more source language examples selected in the selecting step Information creation step;
Selecting a translation candidate satisfying a predetermined condition from the translation candidate information created in the translation candidate creation step, and creating a translation document information that is information indicating the selected translation candidate;
An output step of outputting the translated document information created in the translated document information creating step.