WO2019021804A1

WO2019021804A1 - Information processing device, information processing method, and program

Info

Publication number: WO2019021804A1
Application number: PCT/JP2018/025959
Authority: WO
Inventors: 政明星野; 由紀子荒川; 澁谷　崇; 亮介三谷
Original assignee: ソニー株式会社
Priority date: 2017-07-24
Filing date: 2018-07-10
Publication date: 2019-01-31
Also published as: JPWO2019021804A1

Abstract

The present disclosure pertains to an information processing device, an information processing method, and a program with which it is possible to lighten a load pertaining to the creation of a corpus needed for developing a semantic analyzer and reduce development costs. A corpus is generated by analyzing the predicate-argument structure of a manually generated In Domain (IND) corpus, setting a substitution location, searching for a word similar to the word at the substitution location by a case-frame dictionary, and substituting the word at the substitution location with the word resulting from the search. The present disclosure generates a corpus used for learning in a semantic analyzer. The present disclosure can be applied to a corpus generation device.

Description

INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM

The present disclosure relates to an information processing apparatus, an information processing method, and a program, and in particular, an information processing apparatus, an information processing method, and a program that can reduce the development cost of a corpus that contributes to improvement of the accuracy of a semantic analyzer. About.

The speech dialogue system converts the utterance content into text data, analyzes the text data semantically, and recognizes the utterance content.

In order to recognize the uttered content, a semantic analyzer is used which analyzes the uttered content by machine learning using a corpus (example sentence collection) and recognizes it.

The semantic analyzer analyzes and recognizes the utterance content by machine learning using a corpus for the utterance content to be handled for each application program.

In speech dialogue systems, multi-domain speech dialogue systems are widely used so that multiple topics such as weather inquiry, schedule confirmation, music reproduction, tasks and application programs can be handled by a single system.

In multi-domain speech dialogue systems, the ease of additional construction of semantic analysis function of new domain is required. Therefore, a method (architecture) for combining semantic analyzers of individual domains to construct a semantic analysis system has been widely proposed. This architecture is a system composed of a semantic analysis system of a plurality of domains and a domain selector (frame estimator) integrating the same (Non-Patent Document 1).

In a multi-domain speech dialogue system, there is a need for a technology that realizes a semantic analyzer that can recognize the utterance content of various application programs by learning using a corpus required for each domain.

In addition, it is necessary to prevent transition of semantic analysis processing to an incorrect domain when receiving an unexpected utterance (Out of Domain utterance: hereinafter also referred to as OOD utterance). For that purpose, it is ideal to prepare the OOD corpus and re-learn the frame estimator, but since development of the OOD corpus requires many steps, various methods such as a method to estimate using the history of dialogue are discussed (See Non-Patent Document 2).

However, since the corpuses according to

Non Patent Literatures

1 and 2 are manually created, and more corpuses are needed to recognize the utterance content of various application programs, the load associated with creating a corpus is required. There is a large burden on the development cost of semantic analyzers.

The present disclosure has been made in view of such circumstances, and in particular, reduces the development cost of the semantic analyzer by enabling efficient development of a corpus required for learning. .

An information processing apparatus according to one aspect of the present disclosure includes: a structure analysis unit that analyzes a structure of an input sentence; a replacement point setting unit that sets a replacement point in the input sentence based on an analysis result of the structure analysis unit; It is an information processing device including: a corpus generation unit which generates a corpus by replacing words in the substitution part in an input sentence.

The input sentence may be an IND (In Domain) judgment sentence which is an utterance content to be handled by a predetermined application program.

In the structure analysis unit, the replacement point setting unit that analyzes a predicate term structure of the input sentence sets a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit. Can be made to

A candidate for replacing the word at the replacement part in the input sentence may further include a dictionary query unit for querying and searching a dictionary, and the corpus generation unit is searched by the dictionary query unit. The word of the replacement part in the input sentence may be replaced with the word.

The dictionary may be a case frame dictionary.

The replacement location setting unit is configured to set a replacement location in the input sentence and a replacement method for the replacement location based on the predicate term structure that is an analysis result of the structure analysis unit, and the corpus generation unit The word of the substitution part in the input sentence may be substituted by the substitution method to generate a corpus.

In the replacement method, a predicate of the input sentence is fixed, and a first method of replacing a noun which is a predicate term including a target case, and a predicate term including the target case of the input sentence are fixed. And, it is possible to include the second method of replacing the predicate.

The corpus generated by the corpus generation unit may be an IND (In Domain) determination sentence, which is utterance content to be handled by a predetermined application program, or OOD, which is unexpected utterance content that should not be handled by a predetermined application program. Out of Domain) It is possible to further include a classification unit that classifies into a judgment sentence.

A corpus which is the OOD determination sentence and exists near the boundary in the feature space represented by each feature with the IND determination sentence is classified as the COOD (Close OOD) determination sentence as the IND determination sentence. It is possible to further include a COOD determination sentence extraction unit to extract from the corpus.

In the COOD determination sentence extraction unit, in a domain including a corpus classified as the IND determination sentence, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number is set as the COOD determination sentence from the domain It can be made to extract.

The COOD determination sentence extraction unit includes a corpus whose non-emergence property represented by a ratio of the number of words not included in the self and other corpus to the number of words included in the corpus of the domain is higher than a predetermined value. It is possible to extract from the domain as the COOD determination sentence.

The COOD determination sentence extraction unit has a non-appearance of a word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value in the corpus of the domain higher than a predetermined value A corpus can be extracted from the domain as the COOD determination sentence.

The COOD determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard a sentence whose Perplexity value is higher than a predetermined value as a non-statement.

A corpus which is the IND determination sentence and exists in the vicinity of the boundary in the feature space represented by each feature with the OOD determination sentence is classified as the COD (Close IND) determination sentence as the OOD determination sentence It is possible to further include a CIND judgment sentence extraction unit to extract from the corpus.

The CIND test sentence extraction unit includes, in a domain including a corpus classified as the OOD test sentence, a corpus in which the number of words included in the IND corpus is more than a predetermined number from all corpuses classified as the OOD test sentence. It can be made to extract as a CIND judgment candidate sentence.

The CIND determination sentence extraction unit is a CIND determination candidate in which a non-occurrence probability represented by a ratio of the number of words not included in the IND corpus to a number of words included in the corpus of the domain is lower than a predetermined number. It can be made to extract as a sentence.

In the CIND determination sentence extraction unit, non-reappearance of a word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value in the corpus of the domain is lower than a predetermined number A corpus can be extracted as the CIND test sentence.

The CIND determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard one having a Perplexity value higher than a predetermined value as a non-statement.

The information processing method according to one aspect of the present disclosure analyzes a structure of an input sentence, sets a replacement place in the input sentence based on an analysis result of the structure, and replaces a word of the replacement place in the input sentence. Information processing method including the step of generating a corpus.

A program according to one aspect of the present disclosure includes: a structure analysis unit that analyzes a structure of an input sentence; a replacement point setting unit that sets a replacement point in the input sentence based on an analysis result of the structure analysis unit; The program is for causing a computer to function as a corpus generation unit that generates a corpus by replacing the words at the replacement portion in.

In one aspect of the present disclosure, a structure of an input sentence is analyzed, a substitution place in the input sentence is set based on an analysis result of the structure, and a word of the substitution place in the input sentence is substituted. It is generated.

According to one aspect of the present disclosure, in particular, it is possible to reduce the development cost of the semantic analyzer.

It is a figure explaining operation of a semantic analysis part. It is a figure explaining a COOD judgment sentence and a CIND judgment sentence. It is a figure explaining an example of composition of a corpus generation device of this indication. It is a figure explaining the structural example of the filtering process part of FIG. It is a figure explaining the process in which a corpus is produced | generated, and the classification of the corpus produced | generated. It is a figure explaining a similar sentence and a sentence which becomes non-existent. It is a flowchart explaining a corpus generation process. It is a figure explaining the analysis result of a predicate term structure. It is a figure explaining the setting example of a substitution system. It is a figure explaining the setting example of a substitution part. It is a figure explaining the setting example of a substitution part. It is a figure explaining the example of storage of the analysis result of the Japanese predicate term structure. It is a figure explaining the example of the word group which substitutes object case which used the analysis result of the predicate term structure of FIG. 12 when the substitution system in Japanese is Action fixed Category substitution system. It is a figure explaining the example of the word group which substitutes a predicate using the analysis result of the predicate term structure of FIG. 12 when the substitution system in Japanese is Category fixed Action substitution system. It is a figure explaining the example of the substituted OOD candidate sentence in Japanese. It is a figure explaining the storage example of the analysis result of the predicate term structure in English corresponding to the analysis result of the predicate term structure of FIG. FIG. 17 is a diagram for explaining an example of a word group for replacing a target case using the analysis result of the predicate term structure of FIG. 16 when the substitution method in English is the Action fixed Category substitution method. FIG. 17 is a diagram for explaining an example of a word group for replacing a predicate using the analysis result of the predicate term structure of FIG. 16 when the substitution method in English is the Category fixed Action substitution method. It is a figure explaining the example of the substituted OOD candidate sentence in English. It is a figure explaining the analysis result of deep layer case analysis, and the analysis result of surface layer case analysis. It is a flowchart explaining a filtering process. It is a flow chart explaining COOD corpus extraction processing. It is a figure explaining a Perplexity value and n-gram probability value. It is a figure explaining non-appearance. It is a figure explaining TF value and TF / IDF value. It is a flowchart explaining CIND corpus extraction processing. It is a figure explaining the example of composition of a general purpose personal computer.

Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the present specification and the drawings, components having substantially the same functional configuration will be assigned the same reference numerals and redundant description will be omitted.

<About semantic analysis system>
In describing a corpus generation device to which the technology of the present disclosure is applied, a semantic analysis system using a corpus generated by the corpus generation device will be described.

The semantic analysis system recognizes the user's speech and causes the corresponding application program to execute.

The semantic analysis system includes, for example, as shown in FIG. 1, an input reception unit 11, a speech recognition unit 12, a frame estimation unit 13, semantic analysis units 14-1 to 14-3 of each domain, and application programs 18 to 20. It consists of.

The input reception unit 11 receives a user's speech input as an input of a speech signal (speech) and outputs the speech signal to the speech recognition unit 12.

The voice recognition unit 12 recognizes a voice signal, converts it into a text character string (text), and outputs the text string to the frame estimation unit 13.

The frame estimation unit 13 causes the processing to be transited to the semantic analysis units 14-1 to 14-3 of the optimum application (domain) of the subsequent stage based on the text character string. Also, the frame estimation unit 13 rejects text strings that do not belong to any domain.

The semantic analysis units 14-1 to 14-3 analyze attributes (attributes) and corresponding values (values) based on text strings, and provide an application program 18 for weather guidance to be an action target, The analysis results 15 to 17 respectively corresponding to the application program 19 for schedule confirmation or the application program 20 for music reproduction are supplied. The semantic analysis units 14-1 to 14-3 are simply referred to as the semantic analysis unit 14 unless it is necessary to distinguish them.

More specifically, for example, when an utterance input V1 such as "What is the weather in Tokyo tomorrow?" Is accepted as an audio signal by the input acceptance unit 11, the speech recognition unit 12 reads "What is the weather in Tokyo tomorrow?" It is recognized as a text string and is supplied to the frame estimation unit 13.

For example, when it is determined that the speech input is a speech to the application program 18 for weather guidance, the frame estimation unit 13 causes the semantic analysis unit 14-1 for the weather guidance domain to transition the process.

The semantic analysis unit 14-1 analyzes the weather-related utterance based on this text string, and the information indicating where the information is “Tokyo” is the information indicating the when the information is “tomorrow”. Analysis result 15 including the above is supplied to the application program 18.

The weather guidance application program 18 displays tomorrow's weather guidance for Tokyo based on the analysis result 15.

Further, for example, when an utterance input V2 such as "What is your plan from 15 o'clock today?" Is accepted as an audio signal by the input acceptance unit 11, the speech recognition unit 12 It recognizes as a text character string "", and supplies it to the frame estimation unit 13.

For example, when it is determined that the speech input is a speech to the application program 19 for schedule confirmation, the frame estimation unit 13 causes the semantic analysis unit 14-2 of the domain for schedule confirmation to transition the process.

The semantic analysis unit 14-2 analyzes the utterance about the schedule confirmation based on the text string, and, for example, the utterance input is the utterance for the application program 19 of the schedule confirmation, and the information indicating the date (date) is “Today And the analysis result 16 including that the information indicating time is "15 o'clock" is supplied to the application program 19.

The schedule confirmation application program 19 displays today's 15 o'clock schedule confirmation based on the analysis result 16.

Furthermore, for example, when an utterance input V3 such as "Take a new song from Higashino Naka!" Is accepted as a voice signal by the input acceptance unit 11, the voice recognition unit 12 reads the text "Take a new song from Higashino Naka" It is recognized as a row and supplied to the frame estimation unit 13.

For example, when it is determined that the speech input is a speech to the application program 20 for music reproduction, the frame estimation unit 13 causes the semantic analysis unit 14-3 of the music reproduction domain to transition the process.

The semantic analysis unit 14-3 analyzes an utterance related to music reproduction based on this text string, and, for example, the utterance input is an utterance to the application program 20 for music reproduction, and the information indicating the artist (artist) is “Tono The application program 20 is supplied with an analysis result 17 including that the information indicating the music is "new song".

The music reproduction application program 20 reproduces a new song of Nakano Tono based on the analysis result 17.

Here, the semantic analysis unit 14 performs machine learning using a corpus, which is an example sentence collection, in order to analyze information composed of text strings.

The corpus mainly includes a corpus (also referred to as an IND corpus) consisting of utterance contents (In Domain utterance: hereinafter referred to as IND) to be handled by the application program, and utterance contents (Out of Domain utterance: hereinafter OOD) which can not be handled by the application program. It is divided into a corpus (also called OOD corpus) consisting of utterances.

By learning the IND corpus, the semantic analysis unit 14 can analyze and recognize the utterance content to be handled by the application program. Similarly, by learning the OOD corpus together with the IND corpus, the frame estimation unit 13 can cause the semantic analysis unit 14 of the correct domain to make a transition to the process, and can reject an unexpected utterance. That is, by learning the IND corpus and the OOD corpus, the frame estimation unit 13 analyzes the utterance to be handled and the utterance to be rejected, and can appropriately recognize the utterance content.

By the way, although learning of both the frame estimation unit 13 and the semantic analysis unit 14 requires many corpuses, in general, the corpus is manually created.

For example, even if a specific service such as a weather guidance application program is assumed, various utterance phrases are assumed, and depending on an algorithm, generally 1000 or more utterance examples are required.

Furthermore, as the types of services increase, it is necessary to create as many corpuses as the number of application programs corresponding to the increased types of services.

However, manually creating a corpus is very expensive and places a heavy burden on development. In particular, the frame estimation unit 13 must prepare an OOD utterance for each domain, and for example, it is necessary to prepare an approximately twice as many corpuses as the total number of corpuses.

In addition, software programs have been widely used to replace part of words or phrases in sentences that make up a corpus, but designation of replacement parts and coding of rule sentences of words to be replaced on a pattern basis The work to be done is complicated, and furthermore, there is a cost for manually judging and sorting the completed sentences again, and the burden is heavy.

Furthermore, a method has also been proposed for creating a corpus by collecting similar sentences from a large amount of text on the Internet, but most of the sentences on the Internet are written in written words and there are few utterance examples.

<About COOD judgment sentences>
By the way, the corpus determined as the IND corpus by the frame estimation unit 13 or the semantic analysis unit 14 has a limit due to the recognition accuracy, so a corpus that is regarded as an OOD corpus may be included in part. As such an OOD determination sentence is what was an IND determination sentence but has been made an OOD determination sentence due to an erroneous determination, it can be considered as a corpus close to the IND determination sentence.

For example, it is considered to express the feature of the meaning represented by the corpus as a distribution in the feature space as shown in FIG. 2 by the

features

1 and 2 of the words contained in the corpus. In the example of FIG. 2, a black circle is a distribution of a corpus regarded as an IND determination sentence, and a cross is a distribution of a corpus regarded as an OOD determination sentence.

Of the corpus regarded as the OOD determination sentence indicated by the cross in FIG. 2, the corpus existing in the vicinity of the corpus regarded as the IND determination sentence can be considered as a corpus similar to the IND determination sentence. Of the corpus regarded as the OOD judgment sentence, the corpus existing in the vicinity of the distribution of the corpus regarded as the IND judgment sentence is referred to as a COOD (Close Out of Domain) judgment sentence.

Although the COOD determination sentence is an OOD determination sentence, it is a corpus similar to the IND determination sentence, and in other words, it can be considered as a corpus which is not similar to the IND determination sentence. Furthermore, the COOD determination sentence can be considered to be a very misleading expression to be discriminated as the IND determination sentence, and also a corpus that is likely to cause an erroneous determination.

In order to improve the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14, it is important to learn a sufficient amount of corpuses consisting of the COOD determination sentences. The corpus generation device of the present disclosure makes it possible to easily and mass-generate a corpus including a COOD corpus that is particularly effective for improving recognition accuracy, and to reduce the load associated with corpus development.

<Configuration Example of Corpus Generation Device of the Present Disclosure>
The corpus generation device of the present disclosure enables the frame estimation unit 13 to efficiently generate a corpus required for learning the configuration corresponding to the frame estimation unit 13 and the semantic analysis unit 14 in FIG. 1. And the development cost of the semantic analysis unit 14.

FIG. 3 shows a configuration example of an embodiment of the corpus generation device of the present disclosure.

The corpus generation device 51 receives a corpus consisting of IND sentences generated by an input sentence manually or by some other method, and generates a corpus consisting of substitution generated sentences by replacing words by language analysis etc. At the same time, the generated corpus is classified into a corpus consisting of a COOD determination sentence, an OOD determination sentence, an IND determination sentence, and a CIND determination sentence by filtering processing. Here, the CIND (Close IND) determination sentence is a corpus similar to the IND determination sentence among the corpus classified as the OOD determination sentence. That is, by performing machine learning of the corpus of CIND determination sentences, the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 can be improved.

The corpus generation device 51 includes an IND sentence reception unit 101, a language analysis unit 102, a replacement location setting unit 103, a dictionary inquiry unit 104, a replacement execution unit 105, a double sentence exclusion unit 106, a case frame dictionary 107, and a generation condition setting data storage unit 108. , And a substitution generated sentence storage unit 109, and a filtering processing unit 110.

The IND sentence reception unit 101 receives an input of an IND sentence generated manually or an IND sentence generated by another method, and outputs the input to the language analysis unit 102.

The language analysis unit 102 analyzes the morpheme, the phrase, and the predicate term structure for each of the IND sentences output from the IND sentence reception unit 101, and outputs the analysis result to the replacement location setting unit 103.

The replacement location setting unit 103 sets a replacement condition based on the analysis result of the predicate term structure of the IND statement, and outputs the replacement condition to the dictionary query unit 104.

The dictionary inquiring unit 104 inquires the case frame dictionary 107, searches the word of the substitution position set based on the substitution condition among the IND determination sentences according to the set substitution method, and transmits the search result to the substitution execution unit 105. Output.

The substitution execution unit 105 substitutes the word at the substitution position set based on the substitution condition with the word searched by the set substitution method, and generates a new corpus. At this time, based on the generation condition setting data stored in the generation condition setting data storage unit 108, the end of the sentence or the like of the newly generated corpus is adjusted. The sentence generated as a result of the above processing is output to the double sentence exclusion unit 106.

The heavy sentence exclusion unit 106 determines whether or not the corpus output from the substitution execution unit 105 is a heavy sentence. If the corpus is a heavy sentence, it is regarded as a discarding judgment sentence and discarded. In addition, if the corpus output from the replacement execution unit 105 is not a multiple sentence, the heavy sentence exclusion unit 106 stores the corpus as a replacement generated sentence in the replacement generated sentence storage unit 109.

The case frame dictionary 107 is a dictionary in which predicates are classified according to their word sense (case frame) and the words are grouped with a specific case, and the word meaning of the case frame set in the replacement part by the dictionary inquiry unit 104. The words that become are searched.

The generation condition setting data storage unit 108 stores generation condition setting data which is data of a condition for generating a corpus adjusted by the substitution execution unit 105, and the substitution execution unit 105 stores the generation condition setting data. The replaced corpus is adjusted based on the generation condition setting data of the unit 108.

The filtering processing unit 110 classifies the corpus stored in the substitution generated sentence storage unit 109 into a corpus of OOD determination sentence, COOD determination sentence, IND determination sentence, and CIND determination sentence. The configuration example of the filtering processing unit 110 will be described later in detail with reference to FIG.

<Filtering unit>
Next, a configuration example of the filtering processing unit 110 in FIG. 3 will be described with reference to FIG.

The filtering processing unit 110 includes a semantic analyzer 131, an IND determination sentence storage unit 132, a COOD corpus extraction unit 133, a COOD determination sentence storage unit 134, a confirmed IND determination sentence storage unit 135, an OOD determination sentence storage unit 136, and a CIND corpus extraction unit 137, a CIND judgment sentence storage unit 138, and a confirmed OOD judgment sentence storage unit 139.

The semantic analyzer 131 is a semantic analyzer (corresponding to the semantic analysis unit 14 in FIG. 1) learned with a corpus of an old version (a corpus generated so far), and is stored in the substitution generated sentence storage unit 109. Each of the corpuses consisting of substitution generated sentences has an utterance content to be handled by a predetermined application program (hereinafter also referred to as an IND determination sentence) or an utterance that can not be handled by a predetermined application program (rejected) It is determined whether the content is (hereinafter also referred to as an OOD determination sentence). In addition, the semantic analyzer 131 stores the IND determination sentence in the IND determination sentence storage unit 132 and the OOD determination sentence in the OOD determination sentence storage unit 136.

A COOD (Close OOD) corpus extraction unit 133 extracts a corpus classified as a COOD determination sentence among the IND determination sentences stored in the IND determination sentence storage unit 132 and stores the corpus in the COOD determination sentence storage unit 134. The other IND determination sentences are stored in the determination IND determination sentence storage unit 135 as the determination IND determination sentences. In addition, the COOD corpus extraction unit 133 discards, as a discarding determination sentence, a corpus that is non-sentence and is not classified as a COOD determination sentence or a definite IND determination sentence in a corpus consisting of the IND determination sentences. Here, the COOD determination sentence is an OOD determination sentence existing near the boundary in the determination with the IND determination sentence. Details of the COOD determination statement will be described later with reference to FIG.

In addition, the COOD corpus extraction unit 133 includes a non-statement determination unit 133a and a non-appearance determination unit 133b, controls the non-statement determination unit 133a to determine whether it is a non-statement or not. It is regarded as a discard judgment statement and discarded. In addition, the COOD corpus extraction unit 133 controls the non-appearance determination unit 133b to determine whether the IND determination sentence is a COOD determination sentence or not, based on the non-occurrence of the corpus not regarded as a non-statement. It is determined whether it is an IND determination sentence, the COOD determination sentence is extracted therefrom, and stored in the COOD determination sentence storage unit 134, and the determined IND determination sentence is stored in the determined IND determination sentence storage unit 135.

The non-sentence determination unit 133a calculates a Perplexity value, which is an index indicating the sentence likeness of meaning, in a corpus consisting of IND determination sentences, and determines whether or not it is a non-statement by comparing the Perplexity value with a predetermined threshold. And discards the corpus regarded as non-sentence as a discard judgment sentence.

Non-appearance determination unit 133 b indicates non-recurrence indicating whether or not a word that does not easily appear in the corpus group in the domain determined as the IND determination sentence is included for each corpus that is determined not to be a non-statement. Calculate the parameters. Then, the non-appearance determination unit 133b compares the parameter indicating non-applicability with the predetermined threshold value to make a corpus including a word that is difficult to appear in the corpus group in the domain determined as the IND determination sentence as a COOD determination sentence Consider it as extraction. Further, the non-appearance determination unit 133b stores the extracted COOD determination sentence in the COOD determination sentence storage unit 134, regards the corpus including the other IND determination sentences as a confirmed IND determination sentence, and determines the confirmed IND determination sentence storage unit Make it memorize in 135.

A CIND (Close IND) corpus extraction unit 137 extracts a corpus classified as a CIND determination sentence from the OOD determination sentences stored in the OOD determination sentence storage unit 136, and stores the corpus in the CIND determination sentence storage unit 138, The other OOD determination sentences are stored in the determination OOD determination sentence storage unit 139 as the determination OOD determination sentences. In addition, the CIND corpus extraction unit 137 discards, as a discarding determination sentence, a corpus that is not classified as a CIND determination sentence or a finalized OOD determination sentence among corpuses including OOD determination sentences. Here, the CIND determination sentence is an IND determination sentence existing near the boundary in the determination with the OOD determination sentence. Determination of which domain each CIND test sentence belongs to is finally performed manually. The CIND determination sentence will be described later in detail with reference to FIG.

Further, the CIND corpus extraction unit 137 includes a non-statement determination unit 137a and a non-appearance determination unit 137b, and controls the non-statement determination unit 137a to make it determine whether it is a non-statement or not. It is regarded as a discard judgment statement and discarded. In addition, the CIND corpus extraction unit 137 controls the non-appearance determination unit 137b to determine whether the corpus not regarded as a non-statement is a CIND determination sentence or a definite OOD determination sentence. Among them, the CIND determination sentence is extracted and stored in the CIND determination sentence storage unit 138, and the determined OOD determination sentence is stored in the determined OOD determination sentence storage unit 139.

The non-statement determination unit 137a calculates a Perplexity value, which is an index indicating the sentence likeness of meaning, in a corpus consisting of OOD determination sentences, and determines whether or not it is a non-statement by comparing the Perplexity value with a predetermined threshold. And discards the corpus regarded as non-sentence as a discard judgment sentence.

The non-appearance determination unit 137 b determines whether or not each of the corpuses regarded as non-statement includes a word that does not easily appear in the corpus group in the domain determined as the OOD determination sentence, and the OOD-determination sentence A corpus containing words that do not easily appear in the corpus group in the domain determined in is extracted as a confirmed OOD determination sentence. Furthermore, the non-appearance determination unit 137 b stores the extracted confirmed OOD determination sentence in the confirmed OOD determination sentence storage unit 139, considers the other corpus as a CIND determination sentence and extracts it, and stores it in the CIND determination sentence storage unit 138. Let

<About COOD Judgment Statement and CIND Judgment Statement>
Here, the COOD determination sentence and the CIND determination sentence will be described.

As shown in FIG. 5, the semantic analyzer 131 in FIG. 4 is a corpus group consisting of substitution generated sentences generated from IND sentences stored in the substitution generated sentence storage unit 109, and a corpus group of IND judgment sentences. It is classified into a corpus group of OOD judgment sentences.

Further, the COOD corpus extraction unit 133 discards a corpus regarded as a non-statement as a discarding decision sentence by a decision based on a Perplexity value to be described later from the corpus group consisting of IND decision sentences, and the non-occurrence of words. The corpus judged to be a COOD determination sentence is extracted by the determination based on it. Then, the COOD corpus extraction unit 133 regards the corpus consisting of the remaining IND determination sentences as the confirmed IND determination sentences.

That is, since what is input to the COOD corpus extraction unit 133 is a corpus consisting of IND determination sentences generated by various word substitutions based on the IND sentences, corpuses regarded as definite IND determination sentences are compared. Will increase.

However, the corpus determined as the IND determination sentence by the semantic analyzer 131 has a limit due to recognition accuracy, and further, since the word is replaced, a corpus which is considered to be an OOD determination sentence is also included in part. Be Such an OOD determination sentence is a corpus similar to the IND determination sentence because what was an IND determination sentence is replaced with the OOD determination sentence by word substitution, and the COOD described with reference to FIG. It can be considered as a judgment sentence.

Here, two or more corpuses (sentences) are "similar" to each other, for example, a sentence having two or more sentences having similar predicates similar to each other, and relates to the predicate and the predicate The structure of terms (predicate term structure) is similar. Furthermore, two or more sentences can be said to be more similar if they are sentences that are similar to the meaning and role of the words of the term according to the predicate.

For example, when the word "set" is taken as an example, there are a plurality of word meanings. For example, as shown in the example Ex1 of FIG. 6, two types of examples of the word "set" are: It is listed as "set 4" and "set 8".

In the example Ex1, (4) “set 4” as a term related to the predicate “set”, and (“system”, 83), (“company”, 42), (“school”, 33) as the principal subject of operation (“Supervisor”, 18) is listed, and (“conference”, 41), (“participant”, 27), (“travel time”, 10) as target classes are listed; (“PC (personal computer)”, 95), (“scheduler”, 72), and (“smart phone”, 33) are listed. In the figure, numerical values that appear after each word (in the case of action primes, 83, 42 are associated with words such as “system”, “company”, “school”, and “superior”). , 33, 18) show the number of times (frequency) in which a word and a predicate are linked and searched when a predetermined number of sentences are searched, and is described here in descending order of frequency. Note that this number may be another index such as a weight, may be normalized according to the population, and may be used in combination by multiplying the index or the like.

Further, in the example Ex1, as the item related to it in “setting 8”, (“wife”, 40), (“daughter”, 33), (“son”, 28), (“wife”, 40) “Mother”, 13) is mentioned, and (“alarm”, 52), (“alarm”, 48), (“timer”, 42) as object cases are mentioned, There are listed "alarm", 94), "clock", 42, ("mobile", 35) and ("smartphone, 19").

Furthermore, as a predicate in the sense of a word similar to "to set," "to set 1" is mentioned. In this case, as a term related thereto, ("she", 56), ("father", 52), (“Wife”, 49), and (“Timer”, 67), (“Sleep Timer”, 42) (“Alarm”, 41) as target cases; As cases ("rice cooker", 52), ("air conditioner", 45), ("radio", 32), ("mobile", 12) are mentioned.

That is, when used in the meaning classified into each of "set 4", "set 8", and "set 1" mentioned in the example Ex1, the operation primary case, target case, tool case The sentences replaced by each within the meaning of each word are the sentences whose predicate structures are similar, and “set 8” is “set 1”, both of which are words of the semantic class related to the timer in the target case. It can be considered that there is a high possibility of having similar meaning in this respect.

Also, two or more sentences being "not" mutually means, for example, sentences having similar predicate term structures and having different notation predicates or noun phrases of semantic class.

For example, as an example in the case where predicate term structures are similar and have different notation predicates, as shown in the upper part of the example Ex2 in FIG. 6, "set alarm at 6 o'clock", "alarm at 6 o'clock And "release the alarm at six o'clock".

In addition, as an example in the case where the predicate term structure is similar and has a noun phrase of a semantic class of different notation, as shown in the lower part of the example Ex2 of FIG. 6, "set timer at 8 o'clock". "Set up a business meeting at 8 o'clock", "Set up a shutdown at 8 o'clock.", And the like. The first "timer" is a noun phrase related to the clock function, the second "business meeting" is a noun phrase related to the work event, and the "shutdown" is a semantically different classification of computer control.

Furthermore, as shown in the lowermost part of the example Ex2 of FIG. 6 as an example of the semantic class of the noun phrase, for example, when the word "Yamagata" is taken as an example, a class meaning a person or surname etc. There are classes such as "Yamagata Prefecture" which means a place name or prefecture name, and "Yamagata Shinkansen" which means a route name.

On the other hand, among the corpuses regarded as the OOD judgment sentences, the corpus which is apart from the corpus regarded as the IND judgment sentence and in the feature space can be considered to have a low possibility of being misrecognized as the IND judgment sentence. .

Therefore, by learning a large number of COOD determination sentences, the frame estimation unit 13 and the semantic analysis unit 14 can reliably reject the corpus which is similar to the IND determination sentence but is the OOD determination sentence, As a result, it is possible to improve the recognition accuracy. For this reason, it is possible to improve the recognition accuracy by generating and learning a corpus that becomes more COOD determination sentences.

The CIND test sentence is a corpus corresponding to the COOD test sentence, and in the feature space of FIG. 2, a corpus existing near a boundary with a distribution of a corpus regarded as an OOD test sentence among corpuses regarded as an IND test sentence. It is.

The CIND test sentence is an IND test sentence, but it is a corpus similar to the OOD test sentence, and in other words, it can be considered as a corpus which is not similar to the OOD test sentence. Furthermore, the CIND determination sentence is a very misleading expression to be judged as the OOD determination sentence, and it can be considered as a corpus that is likely to cause an erroneous determination.

On the other hand, among the corpuses regarded as the IND determination sentences, the corpus separated from the corpus regarded as the OOD determination sentence and the corpus in the feature space can be considered to be unlikely to cause an erroneous determination as the OOD determination sentence.

Therefore, by learning a large number of CIND determination sentences, the frame estimation unit 13 and the semantic analysis unit 14 can reliably recognize the corpus which is similar to the OOD determination sentence but is the IND determination sentence, As a result, it is possible to improve the recognition accuracy. For this reason, it is possible to improve the recognition accuracy by generating and learning a corpus as a larger number of CIND determination sentences.

From the above, it is possible to improve the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 by learning many COOD determination sentences and CIND determination sentences.

<Corpus generation processing>
Next, the corpus generation processing by the corpus generation device 51 of FIG. 3 will be described with reference to the flowchart of FIG. 7.

In step S11, the IND sentence reception unit 101 selects an unprocessed IND sentence as an IND sentence to be processed among the IND sentences created manually or the like, receives an input, and outputs it to the language analysis unit 102.

In step S12, the language analysis unit 102 analyzes the morpheme, phrase, and predicate term structure of the IND sentence to be processed.

In step S13, the language analysis unit 102 stores the predicate structure analysis result. More specifically, the language analysis unit 102 stores the predicate structure analysis result as long as no error occurs in the predicate structure analysis process. When an error occurs, the language analysis unit 102 discards, for example, the IND statement to be processed.

In step S14, the language analysis unit 102 determines whether or not there is an unprocessed IND sentence, and if there is, the process returns to step S11. That is, the processes of steps S11 to S14 are repeated until there are no unprocessed IND statements. Then, in step S14, when all the IND statements are processed and it is determined that there is no unprocessed IND statement, the process proceeds to step S15.

Here, with reference to FIG. 8, an example of data of the stored result of predicate term structure analysis will be described.

When analyzing the predicate term structure, when the input sentence is "Teach me a delicious sushi restaurant in Ginza", for example, when analyzed in deep layer case analysis, it is analyzed as a structure as shown in the example Ex11 of FIG. Be done. That is, "in Ginza" is analyzed as a place case, "delicious" is analyzed as an adjunctive modification section, "Sushiya" is analyzed as a target case, and "Teach me" is analyzed as a predicate section.

Moreover, in the same input sentence, when analyzed by surface layer case analysis, it is analyzed as a structure as shown by example Ex12 of FIG. That is, "in Ginza" is analyzed as a de-rating, "delicious" as an adjective, "Sushiya" as a demeaning, and "Teach me" as a verb.

At the time of predicate term structure analysis, it may be set so that deep case analysis, surface case analysis and the like can be switched, and one of them may be selected.

In this manner, the position of the verb phrase or the part to be the target case in the noun phrase (the object in the case of English, etc.) is determined in the analysis result.

If, for example, the IND statement to be processed is "Which restaurant is near and delicious?", The IND statement may not have a predicate, and predicate term structure analysis may not be successful. In such a case, based on a predetermined rule, the omitted predicate may be complemented so that the analysis result of the predicate term structure can be interpolated.

In step S15, the replacement point setting unit 103 inputs the analysis result of predicate term structure analysis stored in the process of step S13, sets a replacement condition designated in advance, and supplies the setting result to the dictionary inquiry unit 104.

<Replacement condition>
Here, replacement conditions will be described. The replacement condition includes a replacement method and a replacement part.

First, there are two types of replacement methods, the following two methods, for example, and one of them is set.

More specifically, the first method of the replacement method is the Action fixed (predicate fixed) Category replacement (target case replacement) method, and the second method is the Category fixed (target fixed case) Action replacement (predicate) Replacement) method.

The Action fixed (predicate fixed) Category substitution (target case substitution) method is, for example, as shown in 1) of the upper part of the example Ex 21 in FIG. 9 when the input sentence is “set an alarm at 7 o'clock”. In this method, the predicate (Action) that is “set” is fixed, and “alarm” that is the target case (Category) is replaced, and in the example Ex21 1), “alarm” is “physical property”. Is replaced by ".

Further, the Category fixed (target case fixed) Action substitution (predicate substitution) method is shown, for example, in 2) of the lower part of the example Ex21 of FIG. 9 when the input sentence is “set alarm at 7 o'clock”. It is a method of fixing “alarm” which is a target category (Category) and replacing a predicate (action) which is “set”, and in the example Ex 31-2), “set” Is replaced by "release".

In addition, in the case where you want to specify a sentence with two predicates or a case other than the target case (time grade, tool case, etc.), the setting of the boundary of the phrase, etc. can be arbitrarily specified by the user. It may be made to switch according to the contents of specification.

<Replacement place when the replacement method is fixed as Action and Category replacement method>
When the replacement method is Action fixed and Category replacement method, for example, the replacement points are set as shown in the example Ex22 of FIG.

When the IND sentence to be processed is, for example, “Teach me a recommended spot near this weekend station”, and when ヲ is specified as a replacement designated place, the replacement location setting unit 103 is at the top of the example Ex22. As shown, the division unit structure is divided into "this weekend", "station", "near", "de", "recommend", "no", "spot", and "tell me" Do.

Also, the replacement location setting unit 103. You may be able to change the setting for grouping into a single phrase as needed for a specific word or phrase separation unit using a rule or a word dictionary. For example, as shown in the second row from the top of Ex. Ex22, "This weekend", "Near the station", "De", "Recommend", "No", "Spot", and "Tell me" Adjust to

Furthermore, the replacement part setting unit 103 uses the word of “recommendation” in the form of “recommendation” as shown in the third row from the top of the example Ex 22 among the replacement parts from the structure of the adjusted dividing unit, for example. Replace by group.

Converting a word in one place in this way is effective for making a COOD statement, but adds the de-rating "Near the station" to the target of replacement, and does not replace the "spot" of the rank. It is also possible to add a set variation such as replacing only the de-rated "near station".

<Replacement place when the substitution method is fixed to Category and Action substitution>
The substitution place in the case where the substitution scheme is Category fixed (target case fixed) and Action substitution (predicate substitution) is set, for example, as shown in the example Ex23 of FIG.

If the IND sentence to be processed is, for example, “Teach me a recommended spot near this weekend station”, and the predicate is designated as a substitution place, the substitution place setting unit 103 is the top row of the example Ex23. As shown in the figure, as the structure of the writing unit, this weekend, at the station, near, at, on, recommended, at, on, spots, and tell me To divide.

Furthermore, the replacement location setting unit 103 sets the dependency source of the standard “recommendation” as shown in the third row from the top of the example Ex 23 among the replacement locations from the structure of the adjusted split writing unit, for example. Replace a certain "tell me" with a similar semantic term predicate that has the same "recommend" as a case. In addition, a setting may be made to replace with a non-similar word-like predicate that does not have the same "recommendation" to the case.

The selection criteria of the above-mentioned replacing predicate can be judged not only by what kind of words they have in a case, but also by the similarity and dissimilarity of words in other terms such as de-rating and ni-rating.

In step S16, the dictionary inquiry unit 104 reads out one unprocessed IND sentence in the stored data of the predicate structure analysis result in the language analysis unit 102 and accepts it as an IND sentence to be processed.

In step S17, the dictionary inquiring unit 104 specifies a replacement part according to the specified replacement method based on the setting result, and uses a noun phrase corresponding to the term of the word of the replacement part and a predicate corresponding to the meaning. The case frame dictionary 107 is referred to and searched.

In step S18, the dictionary inquiry unit 104 stores the IND sentence to be processed, the setting information on replacement, and the search result in association with each other.

In step S19, the dictionary inquiry unit 104 determines whether or not there is an unprocessed IND statement among the stored data of the predicate structure analysis result, and if there is an unprocessed IND statement, the process returns to step S16. . That is, the process of steps S16 to S19 is repeated until the process of searching for replacement candidates is completed for all IND sentences that are storage data of the predicate structure analysis result. Then, in step S19, the process of searching for replacement candidates for all the IND sentences is completed, and when it is considered that there are no unprocessed IND sentences, the process proceeds to step S20.

When the result of analyzing the predicate term structure in Japanese is as shown in FIG. 12, when the substitution method is the Action fixed Category substitution method, as shown in FIG. Groups are searched. Further, in the case where the result of the predicate term structure analysis is as shown in FIG. 12, in the case of the Category fixed Action substitution method, as shown in FIG. 14, a word group replacing the predicate part is searched. Such substitution produces, for example, Japanese OOD candidate sentences as shown in FIG.

Also, in the case where the English predicate term structure analysis result corresponding to the predicate term structure analysis result of FIG. 12 is as shown in FIG. 16, when the replacement system is the Action fixed Category substitution system, as shown in FIG. A word group replacing a predicate term is searched. Further, in the case where the result of the predicate term structure analysis is as shown in FIG. 16, in the case of the Category fixed Action substitution method, as shown in FIG. 18, the word group replacing the predicate part is searched. Such substitution produces, for example, an English OOD candidate sentence as shown in FIG.

Hereinafter, with reference to FIG. 12 to FIG. 15, an example of Japanese predicate term structure analysis result, a search result when the substitution method is the Action fixed Category substitution method, a search result when the Category fixed Action substitution method is the An example of an OOD candidate sentence will be described. 16 to 19, examples of predicate term structure analysis results in English, search results when the substitution method is the Action fixed Category substitution method, search results when the Category fixed Action substitution method, and An example of an OOD candidate sentence will be described.

<Example of result analysis of Japanese predicate terms>
In FIG. 12, an example of the result of predicate term structure analysis is shown, and from the left, a sentence ID, a sentence, a predicate, a predicate ending, a predicate term, and an original domain are shown. In addition, predicate terms are, from the left, place case or de case (place case or de case), adnominal modification clause or no case (argument modification clause or no case),. )It is shown.

More specifically, in the statement “set alarm at 7 o'clock” with statement ID 1 statement, the predicate is “set”, the predicate ending is “to”, and the object grade or rating is “alarm”. And the original domain is shown to be ALARM-SETUP.

In addition, in the sentence "Teach a delicious sushi restaurant in Ginza" with a sentence ID of 1001, the predicate is "teaching", the predicate ending is "te", and the place rating or de-rating is "Ginza", It has been shown that the target grade or grade is "Sushiya" and the original domain is RESTAURANT-SEARCH.

Furthermore, in the sentence "Teach me an Italian restaurant" with a sentence ID of 1002, the predicate is "teaching", the predicate ending is "te", and the adjacency modification clause or no case is "Italian", and the object is It is shown that the case or condition is "restaurant" and the original domain is RESTAURANT-SEARCH.

In addition, in the sentence “Please look for a restaurant that can eat an Italian course” with a sentence ID of 1003, the predicate is “saga” and the predicate ending is “please”, and the adjacency modified clause or no case Are "Italian", "Courses", Targeted or Qualified is "Restaurant", and it is shown that the original domain is RESTAURANT-SEARCH.

<Example of word group for replacing target case when Japanese substitution method is Action fixed Category substitution method>
FIG. 13 shows an example of a word group when the target case or the case is replaced among the predicate terms to be replaced by the replacement method in the Action fixed Category replacement method with respect to the statement of the result of the predicate item structure analysis shown in FIG. It is shown.

In FIG. 13, sentence IDs, sentences, predicates, predicate endings, term replacement words, and original domains are shown from the left. In addition, the replacement words of the term are indicated from the left as de-case (place case or de-case), no case (argument modification clause or case),... In FIG. 13, an example of a word group when replacing a case (object case or case) is shown. The items in FIG. 13 corresponding to those in FIG. 12 are the same descriptions, so the description thereof will be omitted as appropriate.

That is, when the sentence ID is 1, "meeting", "participant", and "moving time" are shown as examples of replacement words of "alarm" which is the case.

Further, when the sentence ID is 1001, "news", "mechanism", "life", "flow", and "art" are shown as examples of replacement words of "sushiya" which is a strict grade.

Furthermore, when the sentence ID is 1002, "news", "mechanism", "life", "flow", and "art" are shown as examples of replacement words of the "restaurant" which is the standard.

In addition, when the sentence ID is 1003, examples of replacement words of "restaurant" which is the standard "surgery", "office", "parent", "car", "course", and "detached house" are shown. ing.

<Example of a word group for replacing predicates in the Japanese substitution method Category fixed action substitution method>
FIG. 14 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.

In FIG. 14, the sentence ID, the sentence, the predicate, the predicate ending, the substitution word of the predicate, and the original domain are shown from the left. The items in FIG. 14 corresponding to those in FIG. 12 are the same descriptions, so the description thereof will be omitted as appropriate.

That is, when the sentence ID is 1, "improve", "equip", and "fit" are shown as examples of replacement words of "set" which is a predicate.

In addition, when the sentence ID is 1001 and 1002, as examples of substitution words of "teaching" which is a predicate, "get out", "eat and walk", "open business", "buy", "help", "cut up" , And "Rating" is shown.

Furthermore, when the sentence ID is 1003, as an example of the substitution word of "suga" which is a predicate, "run", "open", "help", "use", "special feature", "watch", And “find” is shown.

<Example of Japanese OOD candidate sentence generated by substitution>
(Example of OOD candidate statement of Action fixed Category substitution method)
First, among the examples of the Japanese OOD candidate sentences generated by the above-described process, an example of the OOD candidate sentence of the Action fixed Category substitution method will be described with reference to the left part of FIG.

That is, as shown in the left part of FIG. 15, as an example of the OOD candidate sentence for the sentence “set an alarm at 7 o'clock” in the sentence ID of 1 in FIG. “Set participant at 7 o'clock” is shown. That is, "alarm" is replaced with "meeting" and "participant" respectively.

In addition, as an example of an OOD candidate sentence for the sentence “Teach a delicious sushi restaurant in Ginza” as the sentence ID in FIG. 12 is “Teach me good news in Ginza” and “Teach me a delicious mechanism in Ginza” It is shown. That is, "sushi shop" is replaced with "news" and "mechanism" respectively.

Furthermore, as an example of the OOD candidate sentence for the sentence "Teach me an Italian restaurant" for the sentence ID in FIG. 12 that is 1002, "Teach me Italian news" and "Teach me how Italian is" are shown. . That is, "restaurant" is replaced with "news" and "mechanism" respectively.

In addition, as an example of the OOD candidate sentence for the sentence "Please look for the restaurant which can be eaten of the Italian course" in the sentence ID 1003 in FIG. Looking for an office where you can eat That is, "restaurant" is replaced with "surgery" and "course", respectively.

(Example of OOD candidate sentence of Category fixed Action substitution method)
Next, with reference to the right part of FIG. 15, an example of the OOD candidate sentence of the Category fixed Action substitution method among the examples of the Japanese OOD candidate sentences generated by the above-described processing will be described.

That is, as shown in the right part of FIG. 15, as an example of the OOD candidate sentence for the sentence “set an alarm at 7 o'clock” in the sentence ID in FIG. "Already equipped with an alarm at 7 o'clock" is shown. That is, “setting” is replaced with “improvement” and “equipment”, respectively.

In addition, as an example of the OOD candidate sentence for the sentence "Teach me a delicious sushi restaurant in Ginza" as the sentence ID in Fig. 12 is "out of a delicious sushi shop in Ginza", and "Is shown. That is, "Teach me" is replaced with "Leave" and "Eat and walk" respectively.

Furthermore, as an example of an OOD candidate sentence for the sentence "Teach me an Italian restaurant" in the sentence ID in FIG. 12B, "Exit from an Italian restaurant" and "Eat and walk to an Italian restaurant" are shown. There is. That is, "Teach me" is replaced with "Leave" and "Eat and walk" respectively.

Also, as an example of an OOD candidate sentence for the sentence "Please look for an eaten restaurant of an Italian course" in the sentence ID 1003 in Fig. 12, "run an eaten restaurant of an Italian course", and "Italian course "Open the restaurant where you can eat" is shown. That is, "search" is replaced with "run" and "open", respectively.

<Example of result analysis of English predicate terms>
FIG. 16 shows an example of the result of analyzing the predicate term structure in English, and from the left, sentence ID, sentence (Sentence), predicate (verb: Action), predicate term (Argument), and original domain (Original Domain) )It is shown. In addition, the predicate term is, from the left, the term according to the predicate (prep_in),..., The target case (dobj).

In addition, for the predicate term structure analysis in English, refer to Stanford Parser (for details, Marie-Catherine de Marneffe and Christopher D. Manning 2008 Revised for the Stanford Parser v. 3.3 in December 2013. “Stanford typed dependencies manual” The analysis results of (a) are taken as an example. d obj represents that the semantic role (case) of the term related to the predicate is the target case.

More specifically, in the statement "find a Chinese buffet near" with a statement ID of 1, the predicate is "find", the target case (dobj) is "Chinese buffet", and the original domain is AREA_INFO-SEARCH_EVENT. It is shown that there is.

Also, in the sentence “find Chinese food in Austin” with a sentence ID of 2, the predicate is “find”, the term relating to the predicate is “in” and the subject case (dobj) is “Chinese”. It is indicated that the original domain is AREA_INFO-SEARCH_EVENT.

Furthermore, in the sentence "turn on some tunes please" whose sentence ID is 1537, it is indicated that the predicate is "turn on", the target case (dobj) is "tunes", and the original domain is MUSIC_PLAY. ing.

Also, in the sentence "I'd like to hear some Beatles" with a sentence ID of 1538, the predicate is "hear", the target case (dobj) is "Beatles", and the original domain is MUSIC_PLAY. It is shown.

<Example of a word group for replacing the target case when the English substitution method is the Action fixed Category substitution method>
FIG. 17 shows a word group when the target case (dobj) is substituted among the predicates to be substituted by the substitution method in the Action fixed Category substitution method with respect to the sentence of the result of the English predicate term structure analysis shown in FIG. An example is shown.

In FIG. 17, a sentence ID, a sentence (Sentence), a predicate (verb), a term replacement word (Argument), and an original domain (Original Domain) are shown from the left. Further, as for the replacement word of the term, the term according to the predicate (prep_in),... FIG. 17 shows an example of the word group when replacing the object case (dobj). The items in FIG. 17 that correspond to those in FIG. 16 are the same as the items in FIG.

That is, when the sentence ID is one sentence “find a Chinese buffet near”, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words of the target case “Chinese buffet”. It is shown.

In addition, when the sentence ID is 2 sentences “find Chinese food in Austin”, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words of the target case “Chinese food”. It is shown.

Furthermore, when the sentence ID is the sentence “turn on some tunes please” of 1537, “light”, “power”, and “you” are shown as examples of replacement words of the target case “tunes”.

Also, in the case of the sentence “I'd like to hear some Beatles” with a sentence ID of 1538, “team-mate”, “boss”, and “neighbor” are examples of replacement words of the target case “Beatles”. It is shown.

<Example of a word group for replacing a predicate in an English substitution scheme with a Category fixed action substitution scheme>
FIG. 18 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.

In FIG. 18, sentence IDs, sentences, predicates, predicate endings, substitution words of predicates, and original domains are shown from the left. The items in FIG. 18 that correspond to those in FIG. 16 are the same as the items in FIG.

That is, when the sentence ID is one sentence “find a Chinese buffet near”, “include”, “open”, “run”, and “include” are examples of replacement words of “find” which is a predicate (verb: Action). "operate" is shown.

In addition, when the sentence ID is 2 sentences “find Chinese food in Austin”, “include”, “open”, “run”, and “operate” are shown as examples of replacement words of the predicate “find”. ing.

Furthermore, when the sentence ID is 1537, "turn on some tunes please", "download", "record", and "compose" are shown as examples of replacement words of the predicate "turn on". .

Also, in the case of the sentence "I'd like to hear some Beatles" with a sentence ID of 1538, "work with", "copy" and "remove" are shown as examples of replacement words of the predicate "hear". ing.

<Example of OOD candidate sentence generated by substitution>
(Example of OOD candidate statement of Action fixed Category substitution method)
First, among the examples of the English OOD candidate sentences generated by the above-described processing, an example of the OOD candidate sentence of the Action fixed Category substitution method will be described with reference to the left part of FIG.

That is, as shown in the left part of FIG. 19, as examples of the OOD candidate sentences for the sentence “find a Chinese buffet near” in FIG. 16 as the sentence ID in FIG. 16, “find a bomb near” and “find a victim nearby” "It is shown. That is, "Chinese buffet" is replaced with "bomb" and "victim" respectively.

In addition, “find cache in Austin” and “find remains in Austin” are shown as examples of the OOD candidate sentences for the sentence “find Chinese food in Austin” in which the sentence ID in FIG. 16 is two. That is, "Chinese food" is replaced with "cache" and "remains" respectively.

Furthermore, “turn on light please” and “turn on power please” are shown as an example of the OOD candidate sentences for the sentence “turn on some tunes please” in the sentence ID in FIG. That is, "some tunes" is replaced with "light" and "power" respectively.

In addition, as examples of the OOD candidate sentences for the sentence "I'd like to hear some Beatles" in the sentence ID of 1538 in Fig. 16, "I'd like to hear team-mate" and "I'd like to hear team neighbor" "It is shown. That is, "Beatles" is replaced with "team-mate" and "neighbor" respectively.

(Example of OOD candidate sentence of Category fixed Action substitution method)
Next, with reference to the right part of FIG. 19, an example of the OOD candidate sentence of the Category fixed Action substitution system among the examples of the English OOD candidate sentences generated by the above-described processing will be described.

That is, as shown in the right part of FIG. 16, “Open Chinese buffet near” and “Operate Chinese buffet nearby” as an example of the OOD candidate sentence for the sentence “find a Chinese buffet near” in FIG. "It is shown. That is, "find" is replaced with "Open" and "Operate" respectively.

Also, “Open Chinese food in Austin” and “Operate Chinese food in Austin” are shown as examples of the OOD candidate sentences for the sentence “find Chinese food in Austin” in which the sentence ID in FIG. 16 is two. That is, "find" is replaced with "Open" and "Operate" respectively.

Furthermore, “Record some tunes please” and “Compose some tunes please” are shown as an example of the OOD candidate sentences for the sentence “turn on some tunes please” in FIG. That is, "turn on" is replaced with "Record" and "Compose" respectively.

Also, as an example of an OOD candidate sentence for the sentence "I'd like to hear some Beatles" in the sentence ID in FIG. 16, "I'd like to copy some Beatles" and "I'd like to remove some Beatles" "It is shown. That is, "hear" is replaced with "copy" and "remove" respectively.

<How to search the case frame dictionary for words that are likely to become COOD sentences>
Next, a method of searching for a word likely to become a COOD sentence from the case frame dictionary 107 will be described.

FIG. 20 shows a simple image of the case frame dictionary.

For example, in the case of deep case analysis, when it is a predicate “set”, for example, as shown in an example Ex 31 of FIG. 20, “set 4” and “set 8” as “set”. Two case frame examples are given. The number at the end is a number for identifying different meanings of the same "set".

In the example Ex31 of FIG. 20, <setting 4> indicates what kind of word is involved as a predicate term having each role. In () are represented by words and numbers. The numbers represent the number of times (frequency) that the word and the predicate are associated. For example, (“meeting”, 41) was related 41 times as a target case to the word “setting 4” as the word “meeting” in a large amount of corpus data from which a case frame dictionary was made It represents a thing. This numerical value may be another index such as a weight, may be normalized according to the population, and may be used in combination by multiplying the index or the like.

In the example Ex31 of FIG. 20, in “setting 4”, (“system”, 83), (“company”, 42), (“school”, 33), (“supervisor”, 18) mainly operate. Listed as the target class (“conference”, 41), (“participant”, 27), (“travel time”, 10), and as the class “(PC (personal computer)” , 95), ("scheduler", 72), ("smartphone", 33).

Further, in the example Ex31, in “setting 8”, (“wife”, 40), (“daughter”, 33), (“son”, 28), and (“mother”, 13) are given as the operative prime designations. Listed (“alarm”, 52), (“alarm”, 48), (“timer”, 42), and as instrumental (“alarm”, 94), (““ Clock ", 48), (" mobile ", 35), (" smartphone ", 19) are mentioned.

Furthermore, in the example Ex31, <1> to be set as a case frame having a different expression such as "to set" and having a similar meaning is listed, and in this case, ("she", 56), (" "Father", 52), ("Wife", 49) are listed, and ("Timer", 67), ("Sleep Timer", 42), ("Alarm", 41) are listed as target cases The tools are listed as (“rice cooker”, 52), (“air conditioner”, 45), (“radio”, 32) and (“mobile”, 12).

Furthermore, in the example Ex31, <Improving 15> is mentioned as a case frame whose expression is different and the wording is not similar to “set”, and in this case, (“method”, 102), ( “Quality”, 73) and (“Process”, 67) are listed, and “Case” (“Operation”, 81), (“Performance”, 75) and (“Alarm”, 2) are listed as target cases. The tools are listed as (“exchange”, 58), (“invention”, 49) and (“method”, 41).

The original IND statement "set alarm at 7 o'clock" will be used as an example to explain the process of sentence generation by substitution using deep case analysis.

In the case of the Action (predicate) fixed Category (target case) replacement mode, when the predicate “set” is fixed, <Set 4> having no “alarm” in the target case is selected. <4> to set does not include the word "alarm" in the subject case, but instead uses different semantic classes such as ("meeting", 41), ("participant", 27), ("moving time", 10) Contains the word. By replacing with these words to generate a new corpus, it is possible to create candidates for COOD sentences whose word meanings are slightly different.

In Category (target case) fixed Action (predicate) replacement mode, when the target case “alarm” is fixed, the case frame having the same “alarm” as a different case is set <set 1> <improvement Yes 15> is selected. Both include the same word "alarm" in the subject case, but in <set 1>, the frequency of "alarm" is 41 times, whereas the frequency of <alarm> is <improved 15> Less than twice. Such predicates are likely to differ slightly in word meaning. The timer function includes words of semantic classes that are not very relevant. Thus, a predicate is selected which satisfies the condition that the fixed word is in the same term, and the value n representing the frequency of the word or the strength of the relationship is smaller than a certain threshold value α.

By replacing the original predicate with "to improve" and generating a new corpus, it is possible to create a candidate for a COOD sentence whose word sense is slightly different "to improve the alarm at 7 o'clock".

Further, in the case of surface case analysis, when it is a predicate “set”, for example, “set 4” as a form to be used for “set” as shown in the example Ex 32 of FIG. Two of "set 8" are given as examples.

In Example Ex32, (“System”, 83), (“Company”, 42), (“School”, 33), (“Superior”, 18) are listed as “Grade 4”. , And listed as a standard (“meeting”, 41), (“participant”, 27), (“moving time”, 10), and as a de-rated (“PC (personal computer)”, 95 , ("Scheduler", 72), ("smartphone", 33) are mentioned.

Further, in the example Ex32, (“Wife”, 40), (“Daughter”, 33), (“Son”, 28), (“Mother”, 13) are “G8” in “Set 8”. Listed (“alarm”, 52), (“alarm”, 48), (“timer”, 42), and as a degrad (“alarm”, 94), (“ The clock ", 42), (" mobile ", 35), (" smart phone ", 19) are mentioned.

Furthermore, in the example Ex32, “set 1” is mentioned as a word meaning similar to “set”, and in this case, (“she”, 56), (“father”, 52), as a rating. ("Wife", 49) is mentioned, and ("Timer", 67), ("Sleep Timer", 42), ("Alarm", 41) are listed as the standard, and as the de- "Rice cooker", 52), ("Air conditioner", 45), ("Radio", 32), ("Mobile", 12) are mentioned.

Furthermore, in the example Ex32, <Alternate 15> is mentioned as a case frame that is different from “set” and is not described similarly and in this case, (“method”, 102), “Quality”, 73) and (“Process”, 67) are listed, and “Standard” (“Operation”, 81), (“Performance”, 75) and (“Alarm”, 2) are listed as standards. Also, as de-ratings ("exchange", 58), ("invention", 49), ("method", 41) are listed.

Similarly, processing of sentence generation by substitution using surface layer case analysis will be described using the original IND sentence “set an alarm at 7 o'clock” as an example.

In the case of the Action (predicate) fixed Category (legacy) substitution mode, when the predicate "set" is fixed, <Set 4> which does not have an "alarm" in the standard is selected. <4> to set does not include the word "alarm" in the case, but instead ("meeting", 41), ("participant", 27), ("moving time", 10) different semantic classes such as Contains the word. By generating a new corpus by replacing with these words, it is possible to create candidates of COOD sentences with slightly different word meanings of “set”, such as “set a participant at 7 o'clock”.

In the Category fixed action (predicate) replacement mode, when the target case "alarm" is fixed, the case frame having a different case with the same "alarm" in the standard <set 1> <improvement Yes 15> is selected. Both include the same word "alarm" in the subject case, but in <set 1>, the frequency of "alarm" is 41 times, whereas the frequency of <alarm> is <improved 15> Less than twice. Such predicates are likely to differ slightly in word meaning. The timer function contains many words of semantic classes that are not very relevant. Thus, a predicate is selected which satisfies the condition that the fixed word is in the same term, and the value n representing the frequency of the word or the strength of the relationship is smaller than a certain threshold value α.

By replacing the original predicate with "to improve" and generating a new corpus, the word sense is slightly different and it is possible to create a candidate for the COOD statement "to improve the alarm at 7 o'clock".

The case frame dictionary 107 may be an existing one. In addition, since it is possible that the general existing case frame dictionary does not contain many words used for the service purpose that is a domain, a user-defined case frame dictionary that is compiled by collecting words necessary for the service is added It may be possible.

For the existing case frame dictionary 107, see, for example, A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis, by Daisuke Kawahara and Sadao Kurohashi. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association See Computational Linguistics (HLT-NAACL2006), pp. 176-183, 2006. or Case Frame Compilation from High-Performance Computing, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), 2006. I want to be The dictionary is added with frequency information of words in terms of predicates, and this can be used for the strength of the relationship between the predicates and terms.

Here, the description returns to the flowchart of FIG. 7.

In step S20, the replacement execution unit 105 reads out the unprocessed IND statement as the processing target IND statement among the stored search results, and stores the setting information on the replacement and the search result, which are stored in association with each other. Read out and accept.

In step S21, substitution execution unit 105 searches for a word to be a substitution target of the IND determination sentence to be processed based on the IND determination sentence to be processed, the setting information on substitution, and the search result. Based on the search results of the process, replace it to generate a corpus, and adjust utilization and word endings.

In step S22, the replacement execution unit 105 stores the generated corpus as a primary replacement generated sentence.

In step S23, the replacement execution unit 105 determines whether or not there is an unprocessed IND statement among the stored search results, and if it is present, the process returns to step S20. That is, the processing in steps S20 to S23 is repeated until a corpus is generated by substitution based on the search results for all the IND sentences. When it is determined in step S23 that there is no unprocessed IND statement, the process proceeds to step S24.

In step S24, the heavy sentence exclusion unit 106 reads an unprocessed corpus out of the corpus stored in the process of step S22, and accepts it as a corpus to be processed.

In step S25, the heavy sentence exclusion unit 106 determines whether there is a sentence (heavy sentence) that overlaps with the corpus to be processed and the corpus generated so far and saved by the process of step S22. More specifically, the heavy sentence exclusion unit 106 searches the corpus set as the processing target from the corpus group stored as a newly generated corpus, and determines whether the sentence is a heavy sentence based on the presence or absence of a match. Do. If it is determined in step S25 that the sentence is a double sentence, the process proceeds to step S26.

In step S26, the heavy sentence exclusion unit 106 considers that the generated corpus is a heavy sentence and is a discarding judgment sentence, and discards the generated corpus.

On the other hand, when it is determined in step S25 that the generated corpus is not a double sentence, the process proceeds to step S27.

In step S27, the heavy sentence exclusion unit 106 causes the substitution generated sentence storage unit 109 to store the substitution-generated corpus to be processed.

In step S28, the replacement execution unit 105 determines whether or not there is an unprocessed search result. If there is an unprocessed search result, the process returns to step S24.

When it is determined in step S28 that there is no unprocessed search result, the process proceeds to step S29.

In step S29, the replacement execution unit 105 stores the corpus currently stored without being excluded as a double sentence in the replacement generated sentence storage unit 109 as the final replacement generated sentence.

In step S30, the filtering processing unit 110 executes the filtering process to perform OOD judgment sentence, COOD judgment sentence, IND judgment on a corpus consisting of newly generated substitution generated sentences stored in the substitution generated sentence storage unit 109. Classify into sentences and sentences of CIND judgment sentences. The details of the filtering process will be described later with reference to the flowchart of FIG.

By the above processing, it is possible to generate a new corpus as a replacement generated sentence by word replacement based on the IND sentence.

Further, through the above processing, Japanese OOD candidate sentences as shown in FIG. 15 and English OOD candidate sentences as shown in FIG. 19 are generated.

<Filtering process>
Next, the filtering process by the filtering processing unit 110 will be described with reference to the flowchart of FIG.

In step S31, the semantic analyzer 131 accepts, as a processing target corpus, a corpus consisting of unprocessed substitution generated sentences among corpuses consisting of substitution generated sentences stored in the substitution generated sentence storage unit 109.

In step S32, the semantic analyzer 131 determines whether or not the corpus consisting of substitution generation sentences to be processed is an IND determination sentence. If it is determined in step S32 that the sentence is an IND determination sentence, the process proceeds to step S33.

In step S33, the semantic analyzer 131 causes the IND determination sentence storage unit 132 to store a corpus consisting of substitution generated sentences to be processed.

On the other hand, if it is determined in step S32 that the sentence is not the IND determination sentence, that is, if the corpus including the replacement generation sentence to be processed is considered to be the OOD determination sentence, the process proceeds to step S34.

In step S34, the semantic analyzer 131 regards the replacement generated sentence to be processed as an OOD determination sentence, and stores it in the OOD determination sentence storage unit 136.

In step S35, the semantic analyzer 131 determines in the substitution generated sentence storage unit 109 whether or not there is an unprocessed substitution generated sentence, and when it is determined that there is an unprocessed substitution generated sentence, the process is a step Returning to S31, the subsequent processing is repeated. That is, until it is considered that there is no unprocessed input sentence, it is judged whether or not all replacement sentences are IND judgment sentences, and the IND judgment sentence is stored in the IND judgment sentence storage unit 132, The process in which the OOD determination sentence which is the above is stored in the OOD determination sentence storage unit 136 is repeated.

Then, if it is determined in step S35 that there is no unprocessed replacement generated sentence, the process proceeds to step S36. That is, a group of substitution generated sentences stored in substitution generated sentence storage unit 109 by the processing up to this point is generated by learning using a corpus of an old version by semantic analyzer 131 to generate IND judgment sentences and OOD. It is classified into a judgment sentence, and each is stored in the IND judgment sentence storage unit 132 and the OOD judgment sentence storage unit 136.

In step S36, the COOD corpus extraction unit 133 executes COOD corpus extraction processing, extracts COOD determination sentence candidates from the domain of the corpus regarded as the IND determination sentences, and causes the COOD determination sentence storage unit 134 to store them. At the same time, the remaining IND determination sentences are stored as fixed IND determination sentences in the fixed IND determination sentence storage unit 135. At this time, among the corpuses regarded as IND judgment sentences, the corpus regarded as non-statement is regarded as discard judgment sentences and discarded.

Details of the COOD corpus extraction processing will be described later with reference to the flowchart of FIG.

In step S37, the CIND corpus extraction unit 137 executes a CIND corpus extraction process to extract a CIND determination sentence from the corpus regarded as an OOD determination sentence, and causes the CIND determination sentence storage unit 138 to store the remaining CIND determination sentence. The OOD determination sentence is stored in the determination OOD determination sentence storage unit 139 as the determination OOD determination sentence. At this time, among the corpuses regarded as the OOD judgment sentences, the corpus regarded as the non-statement is regarded as the discarding judgment sentence and discarded.

Details of the CIND corpus extraction processing will be described later with reference to the flowchart in FIG.

By the above processing, it is possible to generate a large amount of corpuses regarded as the COOD determination sentence, the fixed IND determination sentence, the CIND determination sentence, and the fixed OOD determination sentence.

Also, after this processing, manual confirmation work is required for the COOD determination sentence, the confirmed IND determination sentence, the CIND determination sentence, and the corpus regarded as the confirmed OOD determination sentence, but the COOD determination sentence, the confirmed IND determination are made in advance. Since the sentence is classified into any of the sentence, the CIND judgment sentence, and the confirmed OOD judgment sentence, it is possible to reduce the load of the confirmation work, and as a result, it is possible to reduce the development cost of the corpus. In addition, the frame estimation unit 13 and the semantic analysis unit 14 can improve the recognition accuracy by learning using a corpus including the generated COOD determination sentence and CIND determination sentence.

<COOD corpus extraction process>
Next, COOD corpus extraction processing will be described with reference to the flowchart in FIG. Although it is desirable to manually perform the process of extracting the COOD determination sentence from the substitution generated sentence, it is possible to further narrow down the COOD candidate sentences by the COOD corpus extraction process by the following filtering in order to improve the work efficiency.

In step S 51, the COOD corpus extraction unit 133 receives an input of a corpus serving as an unprocessed IND determination sentence among corpuses serving as an IND determination sentence stored in the IND determination sentence storage unit 132, and is processed It is a corpus.

In step S52, the COOD corpus extraction unit 133 controls the non-statement determination unit 133a to calculate the Perplexity value of the processing target corpus.

Here, the Perplexity value is a value representing the average number of branches when the number of branches (number of candidates) of the word following the word is represented by the reciprocal of the n-gram probability. That is, when compared with a sentence generated by combining a plurality of words at random, the connection probability between the words in the sentence having meaning is high, and the number of branches of the connected words is low, so the Perplexity value is low. Conversely, for sentences that do not make sense, the probability of combining words is low, and the number of branches of connected words is high, so the Perplexity value is high.

That is, there are sentences that do not pass meaning in the sentences generated by word substitution. For example, "Teach me a restaurant with a good reputation near here" means, but the sentence "A restaurant with a good reputation near me, which is a statement that replaces" restaurant "with" responsibility "means a natural meaning. It is difficult to interpret with and there is a sense of discomfort. From another point of view, it can be said that the word "responsibility" is hard to appear in line with words such as "reputable" and "good".

The same is true for English. The sentence "break phone number again" created by replacing "repeat" in the sentence "repeat phone number again" with break is still not natural. It can be said that the probability that the phone number appears after the break is extremely low. If the inter-word connection probability (n-gram) is previously trained with a large amount of training data (Training), the probabilistic validity of the generated sentence can be determined.

That is, it can be said that the Perplexity value is an index for judging the probabilistic validity of the generated sentence.

The specific calculation method of the perplexity value of the produced | generated sentence is as follows, for example. Please refer to Daniel Jurafsky's 2016 “Language Modeling with N-grams” Chapter 4, https://web.stanford.edu/~jurafsky/slp3/4.pdf 2016 for details on how to calculate Perplexity values. .

In a probabilistic language model, the joint probability P (w) of word strings (sentences) is modeled based on the idea that word strings are generated probabilistically. There are various methods of modeling the joint probability p (w), but the modeling using the following n-gram is shown as an example.

... (1)

... (2)

Here, the n-gram probability model in the case where the equation (1) is Bi-gram (n = 2) and the n-gram probability model in the case where the equation (2) is Tri-gram (n = 3) It is.

In learning a language model, for example, a large amount of training text (Training data) such as an Internet site or a news article is used to learn the above n-gram parameter using n-gram probability of a word.

In such learning, there is a problem (Sparseness) in which most n-grams become zero because there are many word strings that do not appear in the corpus, so smoothing processing (Language Modeling Smoothing) and backoff that interpolate with non-zero values Processing (Back off) is performed.

In addition, for smoothing processing (Language Modeling Smoothing) and back-off processing (Back off), see Zhai & Lafferty, 2001, Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval.

The non-sentence determination unit 133a uses the n-gram model thus learned to calculate the Perplexity value represented by the following Expression (3) for the generated corpus.

... (3)

For example, as shown in the example sentence 1) in the example Ex41 in FIG. 23, the sentence “Tell me a good reputation near here” is an unnatural sentence. In such a case, for example, since the n-gram probability p (responsibility | good) = 1.90365e-05 of “responsibility” and “good” is low, the Perplexity value PLL is 80.4152.

Also, for example, as shown in the example sentence 2) in the example Ex 41 in FIG. 23, the sentence “Tell me a good reputation surfing near here” is also a sentence that does not pass the unnatural meaning. However, in such a case, the Perplexity value PLL is 70.6759, for example, because the n-gram probability p (surfing | good) = 2.13532e-05 of “surfing” and “good” is low.

Furthermore, for example, as shown in example sentence 3) in the example Ex41 in FIG. 23, the sentence “Tell me a good-reputable store near here” is a relatively meaningful sentence. In such a case, for example, since the n-gram probability p (store | good) = "0.00105223" of "store" and "good" is relatively high, the Perplexity value PLL is 57.4806.

Also, for example, as shown in the example sentence 4) in the example Ex41 of FIG. 23, the sentence “Tell me a massage with a good reputation near here” is a sentence that makes sense. In such a case, for example, since the n-gram probability p (massage | good) of "massage" and "good" is 0.003375552, which is relatively high, the Perplexity value PLL is 57.0273.

Thus, for a corpus consisting of meaningful sentences, the n-gram probability increases and the Perplexity value decreases.

In step S53, the non-statement determining unit 133a determines whether or not the sentence is a non-statement based on whether or not the calculated Perplexity value of the processing target corpus is larger than a predetermined threshold value α.

In step S53, for example, when the Perplexity value PLL is larger than the predetermined threshold value α, the process proceeds to step S55.

In step S55, the non-statement determining unit 133a regards the corpus to be processed as a non-statement, discards the corpus as a determination text, and the process proceeds to step S56.

On the other hand, in step S53, for example, when the Perplexity value PLL is not larger than the predetermined threshold α, that is, when the Perplexity value PLL is smaller than the predetermined threshold α, the non-statement determining unit 133a determines that the corpus to be processed is The process proceeds to step S54, assuming that the sentence is meaningful.

In step S54, the non-statement determining unit 133a stores a corpus to be processed.

In step S56, the COOD corpus extraction unit 133 determines whether or not there is a corpus serving as an unprocessed IND determination sentence among the corpuses serving as the IND determination sentences stored in the IND determination sentence storage unit 132, If there is a corpus as an IND determination sentence of processing, the processing returns to step S51. That is, the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined whether the corpus is not a non sentence but a corpus consisting of sentences having meaning, by comparison with a predetermined threshold value α. The processes of S51 to S56 are repeated. Then, in step S56, the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined whether or not it is a corpus consisting of sentences having meaning rather than non-sentences by comparison with a predetermined threshold value α. If it is determined that there is no unprocessed corpus, the process proceeds to step S57.

In step S57, the COOD corpus extraction unit 133 becomes an IND determination sentence which is regarded as a corpus consisting of sentences through which the meaning passes, not a non-statement, by the processing of the non-statement determination unit 133a in steps S52 and S53. Among the corpuses, input of a corpus which is an unprocessed IND determination sentence is accepted to be a processing target corpus.

In step S58, the COOD corpus extraction unit 133 controls the non-appearance determination unit 133b to calculate the non-occurrence of the word included in the processing target corpus in the target domain.

The non-appearance is an index indicating how much the generated corpus is included in the corpus of the domain determined to be the IND determination sentence by using the semantic analyzer to include words that do not appear in the corpus.

For example, sentences (corpus) as shown in FIG. 24 all contain words with low frequency of occurrence in the domain of ALARM-CHANGE, that is, Close OOD with high non-appearance and including the discard judgment sentence. Become. However, here, since non-sentences are excluded in the process before determining non-recurrence, the COOD determination sentence is substantially extracted. In the following, the notation in “” is a word having high non-recurrence.

That is, in FIG. 24, the "registry" of "change the" registry "of the alarm" and the "script" of "reset the" script "at 8 o'clock" correct the "meal" of the alarm. "Meal" of "Kana", "Please change the wording of 7 o'clock in the morning at 8 o'clock in the morning", and change the "log file" that sounds at 6:30 in the morning, at about 7 o'clock "Please change" the "log file" of "6 o'clock" system "at 7 o'clock" "system" is "Please change the alarm's" thought "at 7 o'clock" but "the evening" Change "work schedule" at 5 o'clock to "5:30" "work schedule" is "Please change" design "at 6:30 am to 8:30", "design" is "7 am The "price" of the "price" at 8 o'clock "and the" menu "of" change the 7 menu at 7 o'clock to 7:30 "are non-representative words. And it is regarded as a COOD judgment sentence.

These non-appearances can be determined numerically, for example, by using the number of words that do not appear in the target domain included in the processing target corpus.

For example, the non-appearance determination unit 133 b sets the total number of words included in the processing target corpus including the IND determination sentence as n, and the number of words not appearing in the domain of the IND determination sentence (of the corpus belonging to the domain including the IND determination sentence When the number of words not included in any corpus excluding the processing target corpus is assumed to be no, no / n is calculated as a parameter representing non-appearance.

In step S59, the non-appearance determination unit 133b determines whether the parameter no / n representing non-occurrence in the domain consisting of the IND determination sentences of the words included in the processing target corpus is larger than a predetermined threshold value β. .

In step S59, if the parameter no / n representing non-recurring property is larger than the predetermined threshold value β, that is, if the non-recurring property of the word included in the processing target corpus is high in the domain consisting of the IND determination sentence, the processing , And proceeds to step S60.

In step S60, the non-appearance determination unit 133b regards the processing target corpus as a COOD determination sentence and extracts the corpus, and stores the corpus in the COOD determination sentence storage unit 134.

On the other hand, in step S59, when the parameter no / n representing non-appearance is not larger than the predetermined threshold β, that is, the parameter no / n representing non-occurrence is smaller than the predetermined threshold If the non-occurrence of the included word in the domain consisting of the IND determination sentence is low, the process proceeds to step S61.

In step S61, the non-appearance determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and causes the determined IND determination sentence storage unit 135 to store it.

In step S62, the COOD corpus extraction unit 133 determines whether there is an unprocessed IND determination sentence. If there is an unprocessed IND determination sentence, the process returns to step S57. That is, the processes of steps S57 to S62 are repeated until the unprocessed IND determination sentence disappears, and the process of the non-appearance determination unit 133b of steps S58 and S59 is repeated.

Then, in step S62, when there is no unprocessed IND determination sentence, that is, when it is considered that all the IND determination sentences have been processed, the process ends.

According to the above processing, a corpus having a high Perplexity value and not being a non-sentence and having a high non-occurrence of the included word is regarded as a COOD determination sentence among the domains constituted by the corpus consisting of the IND determination sentences. A corpus that is stored in the COOD determination sentence storage unit 134 and is not a non-statement and has a low non-appearance of the included word is regarded as a fixed IND determination sentence, and stored in the fixed IND determination sentence storage unit 135 Be done.

In the above, an example of expressing non-appearance by the parameter no / n has been described, but other values may be used if non-emergence can be expressed, for example, TF (Term Frequency) / IDF (Inverse Document Frequency) ) May be used.

Here, the TF value is an index for analyzing a word that characterizes each document (in this case, domain) when there are a plurality of documents (in this case, plural domains), and the following equation ( It is represented by 4).

... (4)

The IDF value is an index indicating whether each word is used in common between documents, and is expressed by the following equation (5).

... (5)

Furthermore, the TF / IDF value is the number of words whose frequency of occurrence is less than the threshold β (0 ≦ β ≦ 1) among the important word lists frequently appearing ubiquitously in the target domain of the IND determination sentence, or exists in the important word list. When it is assumed that the number of words not to be generated is the value nlw, the non-appearance determination unit 133b calculates the parameter nlw / n indicating the non-appearance in step S59.

In step S59, the non-appearance determination unit 133b determines whether the parameter nlw / n representing the non-appearance in a predetermined domain of the word included in the processing target corpus is larger than a predetermined threshold value γ.

In step S59, when the parameter nlw / n representing non-appearance is larger than the predetermined threshold value γ, the process proceeds to step S60, and the non-appearance determination unit 133b extracts the processing target corpus as a COOD determination sentence and extracts And store the information in the COOD determination sentence storage unit 134.

On the other hand, in step S59, when the parameter nlw / n representing non-appearance is not larger than a predetermined threshold γ, that is, when the parameter nlw / n representing non-appearance is smaller than a predetermined threshold γ, the processing is In step S61, the non-appearance determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and causes the fixed IND determination sentence storage unit 135 to store the processing target corpus.

<About TF value and TF / IDF value>
For example, the TF value and the TF / IDF value obtained based on the corpus example including IND determination sentences of the domain of ALARM-CHANGE in the example Ex51 of FIG. 25 are represented by the examples Ex52 and Ex53 of FIG.

That is, the corpus of example Ex 51 sequentially changes “alarm at 8 o'clock”, “alarm edits”, “reset alarm”, “changes the alarm from 7 o'clock to 8 o'clock” in order from the top "You can change the 7 o'clock alarm at 8 o'clock", "Change the alarm 8 o'clock", "Change the alarm time at 8 o'clock and change the alarm time", "Can you change the alarm time", "Armish Change "," at 10 o'clock alarm 7 o'clock "," Change 10 o'clock alarm at 11 o'clock ... "," I want to change the alarm set at 6 o'clock to 5:30 "," every day Change the setting to sound at 7 o'clock to 6:30, "change the setting to sound at 7 o'clock every 6:30," and "Edit the alarm on Saturday and Sunday morning".

The TF value of each word in the corpus group of this example Ex 51 is “changed” 351, “7” 334, “8” 260 from the top in descending order of TF value, as shown in example Ex 52 in the figure. , "Change" is 258, "time" is 220, "setting" is 159, "6" is 152, "do" is 148, "alarm" is 110, "morning" is 64, "set" is 56, " "Wish" is 55, ..., "Awakening" is 6, "Alarm Clock" is 5.

In addition, the TF / IDF value of each word in the corpus group of this example Ex51 is, as indicated by an example Ex53 in the figure, in descending order of TF / IDF value, “alarm” from the top to 0.0050379225545, “set” to 0.00328857409316 , "Wake up" is 0.00100030484831, "Tomorrow" is 0.000795915410064, "Wake up" is 0.000763298323913, "Wake up" is 0.0006226999996573, "Song" is 0.00060708690425, "Morning" is 0.000521900115019, "Setting" is 0.00046290476509, "Over" is 0.000033639933349, "Announcement" is 0.000297198910881, "Awakening" is 0.00029252 4823 8318, "Okachi" is 0.000196205208903, "Alarm Clock" is 0.000185042017918, "Wake Up" is 0.000175252029018, "Wish" is 0.000124433590006, "Time" is 0.00107951491631, "morning" is 0,0010,076,7041 ... It is.

That is, a word having a high TF value or a TF / IDF value can be considered to be a word having a high frequency of appearance (low non-occurrence) and a high degree of importance. Therefore, it is considered that among the corpuses included in the IND determination sentence, a corpus including a large number of words whose TF value or TF / IDF value is equal to or less than a certain threshold is likely to be a COOD determination sentence. Therefore, the COOD determination sentence is obtained as a corpus including a large number of words not included in the word group having a high TF / IDF value among corpuses included in the IND determination sentence.

<CIND corpus extraction processing>
Next, CIND corpus extraction processing will be described with reference to the flowchart of FIG.

In step S101, the CIND corpus extraction unit 137 receives an input of a corpus consisting of unprocessed OOD determination sentences in a corpus consisting of OOD determination sentences stored in the OOD determination sentence storage unit 136, and Do.

In step S102, the CIND corpus extraction unit 137 controls the non-statement determination unit 137a to calculate the Perplexity value of the processing target corpus.

In step S103, the non-statement determining unit 137a determines whether or not the sentence is a non-statement based on whether the calculated Perplexity value of the processing target corpus is larger than a predetermined threshold value α.

In step S103, for example, when the Perplexity value PLL is larger than the predetermined threshold value α, the process proceeds to step S105.

In step S105, the non-statement determination unit 137a regards the corpus to be processed as a non-statement, discards the corpus as a determination text, and the process proceeds to step S106.

On the other hand, in step S103, for example, when the Perplexity value PLL is not larger than the predetermined threshold α, that is, when the Perplexity value PLL is smaller than the predetermined threshold α, the non-statement determination unit 137a determines that the corpus to be processed is The process proceeds to step S104, assuming that the sentence is meaningful.

In step S104, the non-statement determining unit 137a stores a corpus to be processed.

In step S106, the CIND corpus extraction unit 137 determines whether or not there is a corpus as an unprocessed OOD determination sentence in the corpus as the OOD determination sentence stored in the OOD determination sentence storage unit 136, If there is a corpus that becomes an OOD determination sentence of the process, the process returns to step S101. In other words, the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined whether the corpus is not a non sentence but a corpus consisting of sentences having meaning, by comparison with a predetermined threshold α. The processes of S101 to S106 are repeated. Then, in step S106, the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined whether or not it is a corpus consisting of sentences having meaning rather than non sentences by comparison with a predetermined threshold value α. If it is determined that there is no unprocessed corpus, the process proceeds to step S107.

In step S107, the CIND corpus extraction unit 137 becomes an OOD determination sentence stored as regarded as a corpus consisting of sentences having meaning, not non-statements, by the processing of the non-statement determination unit 137a in steps S102 and S103. Among the corpuses, input of a corpus that is an unprocessed OOD determination sentence is accepted as a processing target corpus.

In step S108, the CIND corpus extraction unit 137 controls the non-appearance determination unit 137b to calculate the parameter no / n representing the non-occurrence of the word included in the processing target corpus in the target domain.

In step S109, the non-appearance determination unit 137b determines whether or not the parameter no / n representing the non-emergence in the predetermined domain of the word included in the processing target corpus is larger than the predetermined threshold value β.

In step S109, when the parameter no / n representing non-appearance is larger than the predetermined threshold value β, the process proceeds to step S110.

In step S110, the non-appearance determination unit 137b regards the processing target corpus as a confirmed OOD determination sentence and stores the corpus in the determined OOD determination sentence storage unit 139.

On the other hand, in step S109, when the parameter no / n representing non-appearance is not larger than the predetermined threshold value β, that is, when the parameter no / n representing non-occurrence is smaller than the predetermined threshold value β, the process The process proceeds to step S111.

In step S111, the non-appearance determination unit 137b regards the processing target corpus as a CIND determination sentence and causes the CIND determination sentence storage unit 138 to store it.

In step S112, the CIND corpus extraction unit 137 determines whether there is an unprocessed OOD determination sentence. If there is an unprocessed OOD determination sentence, the processing returns to step S107. That is, the processes of steps S107 to S112 are repeated until there are no unprocessed OOD determination sentences.

Then, in step S112, when there is no unprocessed OOD determination sentence, that is, when all the OOD determination sentences are considered to be processed, the processing ends.

By the above processing, a corpus which is high in Perplexity value and is not a non-statement and has a low non-appearance of the included word is regarded as a CIND determination sentence among the corpuses which become the OOD determination sentence, and the CIND determination sentence A corpus which is stored in the storage unit 138 and which is not a non-statement and which is high in non-occurrence of the contained word is regarded as a determined OOD determination sentence and stored in the determined OOD determination sentence storage unit 139. In addition, since it is not known which CIND a CIND test sentence belongs to which domain, it is necessary to finally carry out by hand. However, since the narrowing-down for that purpose can be automated, the number of operation steps can be reduced.

In the above, although the example which represents the parameter showing non-appearance by no / n has been described, other values may be used as long as non-emergence can be expressed, for example, TF (Term Frequency) / IDF (Inverse Document Frequency) ) May be used.

Furthermore, the TF / IDF value is the number of words whose appearance frequency is less than the threshold β (0 ≦ β ≦ 1) among the important word lists that frequently appear ubiquitously in the target domain of the IND determination sentence, or When the number of non-existent words is set to the value nlw, in step S108, the non-appearance determination unit 137b calculates a parameter nlw / n indicating non-appearance.

In step S108, the non-appearance determination unit 137b determines whether the parameter nlw / n representing the non-occurrence in the predetermined domain of the word included in the processing target corpus is larger than the predetermined threshold γ.

In step S108, when the parameter nlw / n representing non-appearance is larger than the predetermined threshold value γ, the process proceeds to step S110, and the CIND corpus extraction unit 137 considers the processing target corpus as the confirmed OOD determination sentence and confirms It is stored in the OOD determination sentence storage unit 139.

On the other hand, in step S108, when the parameter nlw / n representing non-appearance is not larger than the predetermined threshold value γ, that is, when the parameter nlw / n representing non-occurrence is smaller than the predetermined threshold value γ, the process Proceeding to step S111, the non-appearance determination unit 137b regards the processing target corpus as a CIND determination sentence and causes the CIND determination sentence storage unit 138 to store it.

In addition, since it is not known which CIND sentence belongs to which domain the CIND judgment sentence is determined, it is necessary to finally carry out by hand. However, since the narrowing-down for that purpose can be automated, the number of operation steps can be reduced.

By the above processing, it is possible to generate many corpuses from the IND sentences generated by the predetermined means without requiring human hands. For this reason, it becomes possible to reduce the load concerning development of a corpus, and it becomes possible to reduce development cost.

Further, the corpus is generated in a state classified into an IND determination sentence (confirmed IND determination sentence), a COOD determination sentence, a CIND determination sentence, and an OOD determination sentence (confirmed OOD determination sentence).

As a result, by causing the frame estimation unit 13 and the semantic analysis unit 14 to learn using a corpus consisting of more COOD determination sentences and CIND determination sentences, in the vicinity of the boundary between the IND determination sentence and the OOD determination sentence Since learning can be performed using a distributed corpus, even confusing expressions distributed near the boundary between the IND determination sentence and the OOD determination sentence can be appropriately recognized, and the accuracy of the semantic analysis unit can be improved. It becomes possible.

In addition, although it is assumed that the generated corpus actually requires manual confirmation work, IND judgment sentences (decided IND judgment sentences), COOD judgment sentences, CIND judgment sentences, and OOD judgment sentences ( Because it is classified as a finalized OOD judgment statement, and non-statement and heavy sentences are also discarded, it is possible to reduce the burden of confirmation work, and as a result, it is possible to reduce the development cost of the corpus. Become.

Note that the above-described processing order may be interchanged. For example, the COOD corpus extraction process of step S36 and the CIND corpus extraction process of step S37 may be interchanged. In addition, the non-statement determination process using Perplexity values in the COOD corpus extraction process and the CIND corpus extraction process and the extraction process of the COOD determination sentence and the CIND determination sentence using a parameter indicating non-appearance are exchanged in order. It is also good.

<Example executed by software>
By the way, although the series of processes described above can be executed not only by hardware but also by software. When a series of processes are to be executed by software, various functions may be executed by installing a computer in which a program constituting the software is incorporated in dedicated hardware or various programs. It can be installed from a recording medium, for example, on a general-purpose personal computer.

FIG. 27 shows a configuration example of a general-purpose personal computer. This personal computer incorporates a CPU (Central Processing Unit) 1001. An input / output interface 1005 is connected to the CPU 1001 via the bus 1004. A ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004.

The input / output interface 1005 includes an input unit 1006 including an input device such as a keyboard and a mouse through which the user inputs an operation command, an output unit 1007 for outputting a processing operation screen and an image of a processing result to a display device, programs and various data. A storage unit 1008 including a hard disk drive to be stored, a LAN (Local Area Network) adapter, and the like are connected to a communication unit 1009 that executes communication processing via a network represented by the Internet. Also, a magnetic disc (including a flexible disc), an optical disc (including a compact disc-read only memory (CD-ROM), a digital versatile disc (DVD)), a magneto-optical disc (including a mini disc (MD)) or a semiconductor A drive 1010 for reading and writing data to a removable medium 1011 such as a memory is connected.

The CPU 1001 reads a program stored in the ROM 1002 or a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, is installed in the storage unit 1008, and is loaded from the storage unit 1008 to the RAM 1003. Execute various processes according to the program. The RAM 1003 also stores data necessary for the CPU 1001 to execute various processes.

In the computer configured as described above, for example, the CPU 1001 loads the program stored in the storage unit 1008 into the RAM 1003 via the input / output interface 1005 and the bus 1004, and executes the program. Processing is performed.

The program executed by the computer (CPU 1001) can be provided by being recorded on, for example, a removable medium 1011 as a package medium or the like. Also, the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

In the computer, the program can be installed in the storage unit 1008 via the input / output interface 1005 by mounting the removable media 1011 in the drive 1010. The program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. In addition, the program can be installed in advance in the ROM 1002 or the storage unit 1008.

Note that the program executed by the computer may be a program that performs processing in chronological order according to the order described in this specification, in parallel, or when necessary, such as when a call is made. It may be a program to be processed.

Note that the CPU 1001 in FIG. 27 realizes the functions of the semantic analyzer 131, the COOD corpus extraction unit 133, and the CIND corpus extraction unit 137. In addition, the storage unit 1008 includes an IND determination sentence storage unit 132, an OOD determination sentence storage unit 136, a COOD determination sentence storage unit 134, a confirmed IND determination sentence storage unit 135, a CIND determination sentence storage unit 138, and a determined OOD determination sentence storage unit. Realize 139.

Further, in the present specification, a system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same case. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device housing a plurality of modules in one housing are all systems. .

The embodiment of the present disclosure is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the present disclosure.

For example, the present disclosure can have a cloud computing configuration in which one function is shared and processed by a plurality of devices via a network.

Further, each step described in the above-described flowchart can be executed by one device or in a shared manner by a plurality of devices.

Furthermore, in the case where a plurality of processes are included in one step, the plurality of processes included in one step can be executed by being shared by a plurality of devices in addition to being executed by one device.

The present disclosure can also have the following configurations.
<1> A structural analysis unit that analyzes the structure of input sentences;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
An information processing apparatus including: a corpus generation unit that generates a corpus by replacing words in the replacement portion in the input sentence;
<2> The information processing apparatus according to <1>, wherein the input sentence is an IND (In Domain) determination sentence which is an utterance content to be handled by a predetermined application program.
<3> The structure analysis unit analyzes a predicate term structure of the input sentence. The replacement point setting unit is a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit. The information processing apparatus according to <1> or <2>.
<4> The information processing apparatus further includes a dictionary query unit that queries a dictionary to search for a candidate for replacing the word of the replacement part in the input sentence,
The information processing apparatus according to any one of <3>, wherein the corpus generation unit replaces the word of the replacement portion in the input sentence with the word searched by the dictionary inquiry unit.
<5> The information processing apparatus according to <4>, wherein the dictionary is a case frame dictionary.
<6> The replacement point setting unit sets a replacement point in the input sentence and a replacement method of the replacement point based on the predicate term structure which is an analysis result of the structure analysis unit,
The information processing apparatus according to <4>, wherein the corpus generation unit generates a corpus by replacing the word of the replacement part in the input sentence with the replacement method.
<7> The substitution method fixes a predicate of the input sentence, and fixes a first term for replacing a noun which is a predicate term including a target case, and a predicate term including an object case of the input sentence The information processing apparatus according to <6>, further comprising: a second method of replacing a predicate.
<8> The corpus generated by the corpus generation unit is an IND (In Domain) determination statement that is utterance content to be handled by a predetermined application program, or unexpected utterance content that should not be handled by a predetermined application program The information processing apparatus according to any one of <1> to <7>, further including a classification unit that classifies an OOD (Out of Domain) determination sentence.
<9> The IND determination sentence, which is a COOD (Close OOD) determination sentence, which is a corpus existing in the vicinity of a boundary in the feature space represented by each feature of the OOD determination sentence and the IND determination sentence. The information processing apparatus according to <8>, further including a COOD determination sentence extraction unit for extracting from a corpus classified as.
<10> The COOD determination sentence extraction unit determines, from the domain, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number in a domain including a corpus classified as the IND determination sentence. The information processing apparatus according to <9>, which is extracted as a sentence.
<11> The COOD determination sentence extraction unit is a corpus in which non-appearance represented by a ratio of the number of words not included in the self and the other corpus to the number of words included in the corpus of the domain is higher than a predetermined value. The information processing apparatus according to <10>, extracting the COOD determination sentence from the domain.
<12> The COOD determination sentence extraction unit has a non-appearance represented by TF / IDF of a word consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in a corpus of the domain as a predetermined value. The information processing apparatus according to <10>, wherein a corpus including many lower words is extracted as the COOD determination sentence from the domain.
<13> The information processing apparatus according to <10>, wherein the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
<14> The OOD determination sentence as a CIND (Close IND) determination sentence, a corpus existing in the vicinity of the boundary in the feature space represented by each feature of the IND determination sentence and the OOD determination sentence The information processing apparatus according to <8>, further including a CIND determination sentence extraction unit for extracting from a corpus classified as.
<15> In the domain including the corpus classified as the OOD determination sentence, the CIND determination sentence extraction unit includes a corpus in which the number of words included in a corpus classified as another OOD determination sentence other than itself is more than a predetermined number The information processing apparatus according to <14>, extracting as the CIND determination sentence from all corpus classified as the OOD determination sentence.
<16> The CIND determination sentence extraction unit is a corpus whose non-emergence property, represented by a ratio of the number of words not included in the corpus other than the self, to the number of words included in the corpus of the domain is lower than a predetermined number The information processing apparatus according to <15>, extracting the CIND determination sentence.
<17> The CIND determination sentence extraction unit determines a predetermined number of non-reappearance of a word represented by TF / IDF including a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in the corpus of the domain. The information processing apparatus according to <15>, wherein a lower corpus is extracted as the CIND determination sentence.
<18> The information processing apparatus according to <15>, wherein the CIND determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
<19> Analyze the structure of the input sentence,
Setting replacement points in the input sentence based on the analysis result of the structure;
An information processing method comprising: replacing a word of the replacement part in the input sentence to generate a corpus.
<20> A structural analysis unit that analyzes the structure of the input sentence,
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
The program which functions a computer as a corpus generation part which substitutes the word of the said substitution part in the said input sentence, and produces | generates a corpus.

51 corpus generation apparatus, 101 IND sentence reception unit, 102 language analysis unit, 103 replacement location setting unit, 104 dictionary query unit, 105 replacement execution unit, 106 double sentence exclusion unit, 107 case frame dictionary, 108 generation condition setting data storage unit, 109 replacement generated sentence storage unit, 110 filtering processing unit, 131 semantic analyzer, 132 IND determination sentence storage unit, 133 COOD corpus extraction unit, 133a non-sentence determination unit, 133 b non-appearance determination unit, 134 COOD determination sentence storage unit, 135 Confirmed IND judgment sentence storage unit, 136 OOD judgment sentence storage unit, 137 CIND corpus extraction unit, 137a non-sentence judgment unit, 137b non-appearance judgment unit, 138 CIND judgment sentence storage unit, 139 confirmed OOD judgment sentence storage unit

Claims

A structural analysis unit that analyzes the structure of the input sentence;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
An information processing apparatus including: a corpus generation unit that generates a corpus by replacing words in the replacement portion in the input sentence;
The information processing apparatus according to claim 1, wherein the input sentence is an IND (In Domain) determination sentence that is utterance content to be handled by a predetermined application program.
The structure analysis unit analyzes a predicate term structure of the input sentence,
The information processing apparatus according to claim 1, wherein the replacement point setting unit sets a replacement point in the input sentence based on the predicate term structure which is an analysis result of the structure analysis unit.
The information processing apparatus further includes a dictionary query unit that queries a dictionary to search for a candidate for replacing the word of the replacement part in the input sentence,
The information processing apparatus according to claim 3, wherein the corpus generation unit replaces the word of the replacement part in the input sentence with the word searched by the dictionary inquiry unit.
The information processing apparatus according to claim 4, wherein the dictionary is a case frame dictionary.
The replacement point setting unit sets a replacement point in the input sentence and a replacement method of the replacement point based on the predicate term structure which is an analysis result of the structure analysis unit,
The information processing apparatus according to claim 4, wherein the corpus generation unit generates a corpus by replacing the word of the replacement part in the input sentence with the replacement method.
The substitution method fixes a predicate of the input sentence, and fixes a first term for replacing a noun which is a predicate term including a target case, and a predicate term including an object case of the input sentence, and The information processing apparatus according to claim 6, further comprising: a second method for replacing a predicate.
The corpus generated by the corpus generation unit may be an IND (In Domain) determination sentence, which is utterance content to be handled by a predetermined application program, or OOD, which is unexpected utterance content that should not be handled by a predetermined application program. The information processing apparatus according to claim 1, further comprising a classification unit that classifies into (Out of Domain) judgment sentences.
A corpus which is the OOD determination sentence and exists near the boundary in the feature space represented by each feature with the IND determination sentence is classified as the COOD (Close OOD) determination sentence as the IND determination sentence. The information processing apparatus according to claim 8, further comprising a COOD determination sentence extraction unit that extracts from the corpus.
The COOD determination sentence extraction unit extracts a corpus in which the number of words not included in itself and other corpus is more than a predetermined number in the domain including the corpus classified as the IND determination sentence as the COOD determination sentence from the domain The information processing apparatus according to claim 9.
The COOD determination sentence extraction unit includes a corpus whose non-emergence property represented by a ratio of the number of words not included in the self and other corpuses to the number of words included in the corpus of the domain is higher than a predetermined value. The information processing apparatus according to claim 10, wherein the information is extracted as the COOD determination sentence.
The COOD determination sentence extraction unit is a corpus of the domain, a word whose non-appearance of a word represented by TF / IDF composed of TF (Term Frequency) value and IDF (Inverse Document Frequency) value is lower than a predetermined value The information processing apparatus according to claim 10, wherein a corpus including a large number of characters is extracted as the COOD determination sentence from the domain.
The information processing apparatus according to claim 10, wherein the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
A corpus which is the IND determination sentence and exists in the vicinity of the boundary in the feature space represented by each feature with the OOD determination sentence is classified as the COD (Close IND) determination sentence as the OOD determination sentence The information processing apparatus according to claim 8, further comprising a CIND determination sentence extraction unit that extracts from the corpus.
In the domain including the corpus classified as the OOD determination sentence, the CIND determination sentence extraction unit is configured such that, in a domain including a corpus classified as the OOD determination sentence, a corpus in which the number of words included in a corpus classified into another OOD determination sentence other than itself is more than a predetermined number The information processing apparatus according to claim 14, wherein the CIND determination sentence is extracted from all the corpus classified as a determination sentence.
The CIND test sentence extraction unit may set a corpus whose non-emergence property, represented by a ratio of the number of words not included in the corpus other than the self, to the number of words included in the corpus of the domain is lower than a predetermined number. The information processing apparatus according to claim 15, which extracts as a CIND determination sentence.
The CIND determination sentence extraction unit is a corpus in which the non-appearance of a word represented by TF / IDF, which is composed of TF (Term Frequency) values and IDF (Inverse Document Frequency) values, is lower than a predetermined number. The information processing apparatus according to claim 15, wherein the CIND determination statement is extracted as the CIND determination sentence.
The information processing apparatus according to claim 15, wherein the CIND determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
Analyze the structure of the input sentence,
Setting replacement points in the input sentence based on the analysis result of the structure;
An information processing method comprising: replacing a word of the replacement part in the input sentence to generate a corpus.
A structural analysis unit that analyzes the structure of the input sentence;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
The program which functions a computer as a corpus generation part which substitutes the word of the said substitution part in the said input sentence, and produces | generates a corpus.