WO2019021804A1 - Information processing device, information processing method, and program - Google Patents
Information processing device, information processing method, and program Download PDFInfo
- Publication number
- WO2019021804A1 WO2019021804A1 PCT/JP2018/025959 JP2018025959W WO2019021804A1 WO 2019021804 A1 WO2019021804 A1 WO 2019021804A1 JP 2018025959 W JP2018025959 W JP 2018025959W WO 2019021804 A1 WO2019021804 A1 WO 2019021804A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sentence
- corpus
- determination
- unit
- ind
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 52
- 238000003672 processing method Methods 0.000 title claims abstract description 8
- 238000006467 substitution reaction Methods 0.000 claims abstract description 118
- 238000000034 method Methods 0.000 claims description 150
- 238000004458 analytical method Methods 0.000 claims description 137
- 238000000605 extraction Methods 0.000 claims description 79
- 238000012360 testing method Methods 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 10
- 238000012916 structural analysis Methods 0.000 claims description 4
- 238000011161 development Methods 0.000 abstract description 14
- 238000012545 processing Methods 0.000 description 83
- 230000008569 process Effects 0.000 description 73
- 230000009471 action Effects 0.000 description 46
- 230000008859 change Effects 0.000 description 20
- 238000001914 filtration Methods 0.000 description 17
- 238000012790 confirmation Methods 0.000 description 12
- 235000021170 buffet Nutrition 0.000 description 10
- 230000007717 exclusion Effects 0.000 description 10
- 235000013305 food Nutrition 0.000 description 9
- 230000014509 gene expression Effects 0.000 description 7
- 238000012986 modification Methods 0.000 description 6
- 230000004048 modification Effects 0.000 description 6
- 238000009499 grossing Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 230000007704 transition Effects 0.000 description 5
- 238000013500 data storage Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 241000209094 Oryza Species 0.000 description 3
- 235000007164 Oryza sativa Nutrition 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 235000009566 rice Nutrition 0.000 description 3
- 239000002344 surface layer Substances 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000010410 layer Substances 0.000 description 2
- 235000012054 meals Nutrition 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000001356 surgical procedure Methods 0.000 description 2
- 240000000220 Panda oleosa Species 0.000 description 1
- 235000016496 Panda oleosa Nutrition 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Definitions
- the present disclosure relates to an information processing apparatus, an information processing method, and a program, and in particular, an information processing apparatus, an information processing method, and a program that can reduce the development cost of a corpus that contributes to improvement of the accuracy of a semantic analyzer. About.
- the speech dialogue system converts the utterance content into text data, analyzes the text data semantically, and recognizes the utterance content.
- a semantic analyzer In order to recognize the uttered content, a semantic analyzer is used which analyzes the uttered content by machine learning using a corpus (example sentence collection) and recognizes it.
- the semantic analyzer analyzes and recognizes the utterance content by machine learning using a corpus for the utterance content to be handled for each application program.
- multi-domain speech dialogue systems are widely used so that multiple topics such as weather inquiry, schedule confirmation, music reproduction, tasks and application programs can be handled by a single system.
- This architecture is a system composed of a semantic analysis system of a plurality of domains and a domain selector (frame estimator) integrating the same (Non-Patent Document 1).
- OOD utterance In addition, it is necessary to prevent transition of semantic analysis processing to an incorrect domain when receiving an unexpected utterance (Out of Domain utterance: hereinafter also referred to as OOD utterance). For that purpose, it is ideal to prepare the OOD corpus and re-learn the frame estimator, but since development of the OOD corpus requires many steps, various methods such as a method to estimate using the history of dialogue are discussed (See Non-Patent Document 2).
- Non Patent Literatures 1 and 2 are manually created, and more corpuses are needed to recognize the utterance content of various application programs, the load associated with creating a corpus is required. There is a large burden on the development cost of semantic analyzers.
- the present disclosure has been made in view of such circumstances, and in particular, reduces the development cost of the semantic analyzer by enabling efficient development of a corpus required for learning. .
- An information processing apparatus includes: a structure analysis unit that analyzes a structure of an input sentence; a replacement point setting unit that sets a replacement point in the input sentence based on an analysis result of the structure analysis unit; It is an information processing device including: a corpus generation unit which generates a corpus by replacing words in the substitution part in an input sentence.
- the input sentence may be an IND (In Domain) judgment sentence which is an utterance content to be handled by a predetermined application program.
- IND In Domain
- the replacement point setting unit that analyzes a predicate term structure of the input sentence sets a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit.
- a candidate for replacing the word at the replacement part in the input sentence may further include a dictionary query unit for querying and searching a dictionary, and the corpus generation unit is searched by the dictionary query unit.
- the word of the replacement part in the input sentence may be replaced with the word.
- the dictionary may be a case frame dictionary.
- the replacement location setting unit is configured to set a replacement location in the input sentence and a replacement method for the replacement location based on the predicate term structure that is an analysis result of the structure analysis unit, and the corpus generation unit
- the word of the substitution part in the input sentence may be substituted by the substitution method to generate a corpus.
- a predicate of the input sentence is fixed, and a first method of replacing a noun which is a predicate term including a target case, and a predicate term including the target case of the input sentence are fixed. And, it is possible to include the second method of replacing the predicate.
- the corpus generated by the corpus generation unit may be an IND (In Domain) determination sentence, which is utterance content to be handled by a predetermined application program, or OOD, which is unexpected utterance content that should not be handled by a predetermined application program.
- IND In Domain
- OOD unexpected utterance content that should not be handled by a predetermined application program.
- Out of Domain It is possible to further include a classification unit that classifies into a judgment sentence.
- a corpus which is the OOD determination sentence and exists near the boundary in the feature space represented by each feature with the IND determination sentence is classified as the COOD (Close OOD) determination sentence as the IND determination sentence. It is possible to further include a COOD determination sentence extraction unit to extract from the corpus.
- the COOD determination sentence extraction unit in a domain including a corpus classified as the IND determination sentence, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number is set as the COOD determination sentence from the domain It can be made to extract.
- the COOD determination sentence extraction unit includes a corpus whose non-emergence property represented by a ratio of the number of words not included in the self and other corpus to the number of words included in the corpus of the domain is higher than a predetermined value. It is possible to extract from the domain as the COOD determination sentence.
- the COOD determination sentence extraction unit has a non-appearance of a word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value in the corpus of the domain higher than a predetermined value
- TF Term Frequency
- IDF Inverse Document Frequency
- the COOD determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard a sentence whose Perplexity value is higher than a predetermined value as a non-statement.
- a corpus which is the IND determination sentence and exists in the vicinity of the boundary in the feature space represented by each feature with the OOD determination sentence is classified as the COD (Close IND) determination sentence as the OOD determination sentence It is possible to further include a CIND judgment sentence extraction unit to extract from the corpus.
- the CIND test sentence extraction unit includes, in a domain including a corpus classified as the OOD test sentence, a corpus in which the number of words included in the IND corpus is more than a predetermined number from all corpuses classified as the OOD test sentence. It can be made to extract as a CIND judgment candidate sentence.
- the CIND determination sentence extraction unit is a CIND determination candidate in which a non-occurrence probability represented by a ratio of the number of words not included in the IND corpus to a number of words included in the corpus of the domain is lower than a predetermined number. It can be made to extract as a sentence.
- non-reappearance of a word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value in the corpus of the domain is lower than a predetermined number
- a corpus can be extracted as the CIND test sentence.
- the CIND determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard one having a Perplexity value higher than a predetermined value as a non-statement.
- the information processing method analyzes a structure of an input sentence, sets a replacement place in the input sentence based on an analysis result of the structure, and replaces a word of the replacement place in the input sentence.
- Information processing method including the step of generating a corpus.
- a program includes: a structure analysis unit that analyzes a structure of an input sentence; a replacement point setting unit that sets a replacement point in the input sentence based on an analysis result of the structure analysis unit;
- the program is for causing a computer to function as a corpus generation unit that generates a corpus by replacing the words at the replacement portion in.
- a structure of an input sentence is analyzed, a substitution place in the input sentence is set based on an analysis result of the structure, and a word of the substitution place in the input sentence is substituted. It is generated.
- FIG. 17 is a diagram for explaining an example of a word group for replacing a target case using the analysis result of the predicate term structure of FIG. 16 when the substitution method in English is the Action fixed Category substitution method.
- FIG. 17 is a diagram for explaining an example of a word group for replacing a predicate using the analysis result of the predicate term structure of FIG. 16 when the substitution method in English is the Category fixed Action substitution method. It is a figure explaining the example of the substituted OOD candidate sentence in English. It is a figure explaining the analysis result of deep layer case analysis, and the analysis result of surface layer case analysis. It is a flowchart explaining a filtering process. It is a flow chart explaining COOD corpus extraction processing.
- the semantic analysis system recognizes the user's speech and causes the corresponding application program to execute.
- the semantic analysis system includes, for example, as shown in FIG. 1, an input reception unit 11, a speech recognition unit 12, a frame estimation unit 13, semantic analysis units 14-1 to 14-3 of each domain, and application programs 18 to 20. It consists of.
- the input reception unit 11 receives a user's speech input as an input of a speech signal (speech) and outputs the speech signal to the speech recognition unit 12.
- a speech signal speech
- the voice recognition unit 12 recognizes a voice signal, converts it into a text character string (text), and outputs the text string to the frame estimation unit 13.
- the frame estimation unit 13 causes the processing to be transited to the semantic analysis units 14-1 to 14-3 of the optimum application (domain) of the subsequent stage based on the text character string. Also, the frame estimation unit 13 rejects text strings that do not belong to any domain.
- the semantic analysis units 14-1 to 14-3 analyze attributes (attributes) and corresponding values (values) based on text strings, and provide an application program 18 for weather guidance to be an action target, The analysis results 15 to 17 respectively corresponding to the application program 19 for schedule confirmation or the application program 20 for music reproduction are supplied.
- the semantic analysis units 14-1 to 14-3 are simply referred to as the semantic analysis unit 14 unless it is necessary to distinguish them.
- the frame estimation unit 13 causes the semantic analysis unit 14-1 for the weather guidance domain to transition the process.
- the semantic analysis unit 14-1 analyzes the weather-related utterance based on this text string, and the information indicating where the information is “Tokyo” is the information indicating the when the information is “tomorrow”. Analysis result 15 including the above is supplied to the application program 18.
- the weather guidance application program 18 displays tomorrow's weather guidance for Tokyo based on the analysis result 15.
- the frame estimation unit 13 causes the semantic analysis unit 14-2 of the domain for schedule confirmation to transition the process.
- the semantic analysis unit 14-2 analyzes the utterance about the schedule confirmation based on the text string, and, for example, the utterance input is the utterance for the application program 19 of the schedule confirmation, and the information indicating the date (date) is “Today And the analysis result 16 including that the information indicating time is "15 o'clock" is supplied to the application program 19.
- the schedule confirmation application program 19 displays today's 15 o'clock schedule confirmation based on the analysis result 16.
- the voice recognition unit 12 reads the text "Take a new song from Higashino Naka" It is recognized as a row and supplied to the frame estimation unit 13.
- the frame estimation unit 13 causes the semantic analysis unit 14-3 of the music reproduction domain to transition the process.
- the semantic analysis unit 14-3 analyzes an utterance related to music reproduction based on this text string, and, for example, the utterance input is an utterance to the application program 20 for music reproduction, and the information indicating the artist (artist) is “Tono The application program 20 is supplied with an analysis result 17 including that the information indicating the music is "new song”.
- the music reproduction application program 20 reproduces a new song of Nakano Tono based on the analysis result 17.
- the semantic analysis unit 14 performs machine learning using a corpus, which is an example sentence collection, in order to analyze information composed of text strings.
- the corpus mainly includes a corpus (also referred to as an IND corpus) consisting of utterance contents (In Domain utterance: hereinafter referred to as IND) to be handled by the application program, and utterance contents (Out of Domain utterance: hereinafter OOD) which can not be handled by the application program. It is divided into a corpus (also called OOD corpus) consisting of utterances.
- the semantic analysis unit 14 can analyze and recognize the utterance content to be handled by the application program.
- the frame estimation unit 13 can cause the semantic analysis unit 14 of the correct domain to make a transition to the process, and can reject an unexpected utterance. That is, by learning the IND corpus and the OOD corpus, the frame estimation unit 13 analyzes the utterance to be handled and the utterance to be rejected, and can appropriately recognize the utterance content.
- the frame estimation unit 13 must prepare an OOD utterance for each domain, and for example, it is necessary to prepare an approximately twice as many corpuses as the total number of corpuses.
- the corpus determined as the IND corpus by the frame estimation unit 13 or the semantic analysis unit 14 has a limit due to the recognition accuracy, so a corpus that is regarded as an OOD corpus may be included in part.
- an OOD determination sentence is what was an IND determination sentence but has been made an OOD determination sentence due to an erroneous determination, it can be considered as a corpus close to the IND determination sentence.
- a black circle is a distribution of a corpus regarded as an IND determination sentence
- a cross is a distribution of a corpus regarded as an OOD determination sentence.
- the corpus existing in the vicinity of the corpus regarded as the IND determination sentence can be considered as a corpus similar to the IND determination sentence.
- the corpus existing in the vicinity of the distribution of the corpus regarded as the IND judgment sentence is referred to as a COOD (Close Out of Domain) judgment sentence.
- the COOD determination sentence is an OOD determination sentence, it is a corpus similar to the IND determination sentence, and in other words, it can be considered as a corpus which is not similar to the IND determination sentence. Furthermore, the COOD determination sentence can be considered to be a very misleading expression to be discriminated as the IND determination sentence, and also a corpus that is likely to cause an erroneous determination.
- the corpus generation device of the present disclosure makes it possible to easily and mass-generate a corpus including a COOD corpus that is particularly effective for improving recognition accuracy, and to reduce the load associated with corpus development.
- the corpus generation device of the present disclosure enables the frame estimation unit 13 to efficiently generate a corpus required for learning the configuration corresponding to the frame estimation unit 13 and the semantic analysis unit 14 in FIG. 1. And the development cost of the semantic analysis unit 14.
- FIG. 3 shows a configuration example of an embodiment of the corpus generation device of the present disclosure.
- the corpus generation device 51 receives a corpus consisting of IND sentences generated by an input sentence manually or by some other method, and generates a corpus consisting of substitution generated sentences by replacing words by language analysis etc. At the same time, the generated corpus is classified into a corpus consisting of a COOD determination sentence, an OOD determination sentence, an IND determination sentence, and a CIND determination sentence by filtering processing.
- the CIND (Close IND) determination sentence is a corpus similar to the IND determination sentence among the corpus classified as the OOD determination sentence. That is, by performing machine learning of the corpus of CIND determination sentences, the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 can be improved.
- the corpus generation device 51 includes an IND sentence reception unit 101, a language analysis unit 102, a replacement location setting unit 103, a dictionary inquiry unit 104, a replacement execution unit 105, a double sentence exclusion unit 106, a case frame dictionary 107, and a generation condition setting data storage unit 108. , And a substitution generated sentence storage unit 109, and a filtering processing unit 110.
- the IND sentence reception unit 101 receives an input of an IND sentence generated manually or an IND sentence generated by another method, and outputs the input to the language analysis unit 102.
- the language analysis unit 102 analyzes the morpheme, the phrase, and the predicate term structure for each of the IND sentences output from the IND sentence reception unit 101, and outputs the analysis result to the replacement location setting unit 103.
- the replacement location setting unit 103 sets a replacement condition based on the analysis result of the predicate term structure of the IND statement, and outputs the replacement condition to the dictionary query unit 104.
- the dictionary inquiring unit 104 inquires the case frame dictionary 107, searches the word of the substitution position set based on the substitution condition among the IND determination sentences according to the set substitution method, and transmits the search result to the substitution execution unit 105. Output.
- the substitution execution unit 105 substitutes the word at the substitution position set based on the substitution condition with the word searched by the set substitution method, and generates a new corpus. At this time, based on the generation condition setting data stored in the generation condition setting data storage unit 108, the end of the sentence or the like of the newly generated corpus is adjusted. The sentence generated as a result of the above processing is output to the double sentence exclusion unit 106.
- the heavy sentence exclusion unit 106 determines whether or not the corpus output from the substitution execution unit 105 is a heavy sentence. If the corpus is a heavy sentence, it is regarded as a discarding judgment sentence and discarded. In addition, if the corpus output from the replacement execution unit 105 is not a multiple sentence, the heavy sentence exclusion unit 106 stores the corpus as a replacement generated sentence in the replacement generated sentence storage unit 109.
- the case frame dictionary 107 is a dictionary in which predicates are classified according to their word sense (case frame) and the words are grouped with a specific case, and the word meaning of the case frame set in the replacement part by the dictionary inquiry unit 104. The words that become are searched.
- the generation condition setting data storage unit 108 stores generation condition setting data which is data of a condition for generating a corpus adjusted by the substitution execution unit 105, and the substitution execution unit 105 stores the generation condition setting data.
- the replaced corpus is adjusted based on the generation condition setting data of the unit 108.
- the filtering processing unit 110 classifies the corpus stored in the substitution generated sentence storage unit 109 into a corpus of OOD determination sentence, COOD determination sentence, IND determination sentence, and CIND determination sentence.
- the configuration example of the filtering processing unit 110 will be described later in detail with reference to FIG.
- the filtering processing unit 110 includes a semantic analyzer 131, an IND determination sentence storage unit 132, a COOD corpus extraction unit 133, a COOD determination sentence storage unit 134, a confirmed IND determination sentence storage unit 135, an OOD determination sentence storage unit 136, and a CIND corpus extraction unit 137, a CIND judgment sentence storage unit 138, and a confirmed OOD judgment sentence storage unit 139.
- the semantic analyzer 131 is a semantic analyzer (corresponding to the semantic analysis unit 14 in FIG. 1) learned with a corpus of an old version (a corpus generated so far), and is stored in the substitution generated sentence storage unit 109.
- Each of the corpuses consisting of substitution generated sentences has an utterance content to be handled by a predetermined application program (hereinafter also referred to as an IND determination sentence) or an utterance that can not be handled by a predetermined application program (rejected) It is determined whether the content is (hereinafter also referred to as an OOD determination sentence).
- the semantic analyzer 131 stores the IND determination sentence in the IND determination sentence storage unit 132 and the OOD determination sentence in the OOD determination sentence storage unit 136.
- a COOD (Close OOD) corpus extraction unit 133 extracts a corpus classified as a COOD determination sentence among the IND determination sentences stored in the IND determination sentence storage unit 132 and stores the corpus in the COOD determination sentence storage unit 134.
- the other IND determination sentences are stored in the determination IND determination sentence storage unit 135 as the determination IND determination sentences.
- the COOD corpus extraction unit 133 discards, as a discarding determination sentence, a corpus that is non-sentence and is not classified as a COOD determination sentence or a definite IND determination sentence in a corpus consisting of the IND determination sentences.
- the COOD determination sentence is an OOD determination sentence existing near the boundary in the determination with the IND determination sentence. Details of the COOD determination statement will be described later with reference to FIG.
- the COOD corpus extraction unit 133 includes a non-statement determination unit 133a and a non-appearance determination unit 133b, controls the non-statement determination unit 133a to determine whether it is a non-statement or not. It is regarded as a discard judgment statement and discarded.
- the COOD corpus extraction unit 133 controls the non-appearance determination unit 133b to determine whether the IND determination sentence is a COOD determination sentence or not, based on the non-occurrence of the corpus not regarded as a non-statement. It is determined whether it is an IND determination sentence, the COOD determination sentence is extracted therefrom, and stored in the COOD determination sentence storage unit 134, and the determined IND determination sentence is stored in the determined IND determination sentence storage unit 135.
- the non-sentence determination unit 133a calculates a Perplexity value, which is an index indicating the sentence likeness of meaning, in a corpus consisting of IND determination sentences, and determines whether or not it is a non-statement by comparing the Perplexity value with a predetermined threshold. And discards the corpus regarded as non-sentence as a discard judgment sentence.
- a Perplexity value which is an index indicating the sentence likeness of meaning
- Non-appearance determination unit 133 b indicates non-recurrence indicating whether or not a word that does not easily appear in the corpus group in the domain determined as the IND determination sentence is included for each corpus that is determined not to be a non-statement. Calculate the parameters. Then, the non-appearance determination unit 133b compares the parameter indicating non-applicability with the predetermined threshold value to make a corpus including a word that is difficult to appear in the corpus group in the domain determined as the IND determination sentence as a COOD determination sentence Consider it as extraction.
- the non-appearance determination unit 133b stores the extracted COOD determination sentence in the COOD determination sentence storage unit 134, regards the corpus including the other IND determination sentences as a confirmed IND determination sentence, and determines the confirmed IND determination sentence storage unit Make it memorize in 135.
- a CIND (Close IND) corpus extraction unit 137 extracts a corpus classified as a CIND determination sentence from the OOD determination sentences stored in the OOD determination sentence storage unit 136, and stores the corpus in the CIND determination sentence storage unit 138, The other OOD determination sentences are stored in the determination OOD determination sentence storage unit 139 as the determination OOD determination sentences.
- the CIND corpus extraction unit 137 discards, as a discarding determination sentence, a corpus that is not classified as a CIND determination sentence or a finalized OOD determination sentence among corpuses including OOD determination sentences.
- the CIND determination sentence is an IND determination sentence existing near the boundary in the determination with the OOD determination sentence. Determination of which domain each CIND test sentence belongs to is finally performed manually. The CIND determination sentence will be described later in detail with reference to FIG.
- the CIND corpus extraction unit 137 includes a non-statement determination unit 137a and a non-appearance determination unit 137b, and controls the non-statement determination unit 137a to make it determine whether it is a non-statement or not. It is regarded as a discard judgment statement and discarded.
- the CIND corpus extraction unit 137 controls the non-appearance determination unit 137b to determine whether the corpus not regarded as a non-statement is a CIND determination sentence or a definite OOD determination sentence.
- the CIND determination sentence is extracted and stored in the CIND determination sentence storage unit 138, and the determined OOD determination sentence is stored in the determined OOD determination sentence storage unit 139.
- the non-statement determination unit 137a calculates a Perplexity value, which is an index indicating the sentence likeness of meaning, in a corpus consisting of OOD determination sentences, and determines whether or not it is a non-statement by comparing the Perplexity value with a predetermined threshold. And discards the corpus regarded as non-sentence as a discard judgment sentence.
- a Perplexity value which is an index indicating the sentence likeness of meaning
- the non-appearance determination unit 137 b determines whether or not each of the corpuses regarded as non-statement includes a word that does not easily appear in the corpus group in the domain determined as the OOD determination sentence, and the OOD-determination sentence A corpus containing words that do not easily appear in the corpus group in the domain determined in is extracted as a confirmed OOD determination sentence. Furthermore, the non-appearance determination unit 137 b stores the extracted confirmed OOD determination sentence in the confirmed OOD determination sentence storage unit 139, considers the other corpus as a CIND determination sentence and extracts it, and stores it in the CIND determination sentence storage unit 138. Let
- the semantic analyzer 131 in FIG. 4 is a corpus group consisting of substitution generated sentences generated from IND sentences stored in the substitution generated sentence storage unit 109, and a corpus group of IND judgment sentences. It is classified into a corpus group of OOD judgment sentences.
- the COOD corpus extraction unit 133 discards a corpus regarded as a non-statement as a discarding decision sentence by a decision based on a Perplexity value to be described later from the corpus group consisting of IND decision sentences, and the non-occurrence of words.
- the corpus judged to be a COOD determination sentence is extracted by the determination based on it.
- the COOD corpus extraction unit 133 regards the corpus consisting of the remaining IND determination sentences as the confirmed IND determination sentences.
- the corpus determined as the IND determination sentence by the semantic analyzer 131 has a limit due to recognition accuracy, and further, since the word is replaced, a corpus which is considered to be an OOD determination sentence is also included in part.
- Such an OOD determination sentence is a corpus similar to the IND determination sentence because what was an IND determination sentence is replaced with the OOD determination sentence by word substitution, and the COOD described with reference to FIG. It can be considered as a judgment sentence.
- the COOD determination sentence is an OOD determination sentence, it is a corpus similar to the IND determination sentence, and in other words, it can be considered as a corpus which is not similar to the IND determination sentence. Furthermore, the COOD determination sentence can be considered to be a very misleading expression to be discriminated as the IND determination sentence, and also a corpus that is likely to cause an erroneous determination.
- two or more corpuses are "similar" to each other, for example, a sentence having two or more sentences having similar predicates similar to each other, and relates to the predicate and the predicate
- the structure of terms is similar.
- two or more sentences can be said to be more similar if they are sentences that are similar to the meaning and role of the words of the term according to the predicate.
- this number may be another index such as a weight, may be normalized according to the population, and may be used in combination by multiplying the index or the like.
- two or more sentences being "not” mutually means, for example, sentences having similar predicate term structures and having different notation predicates or noun phrases of semantic class.
- the corpus which is apart from the corpus regarded as the IND judgment sentence and in the feature space can be considered to have a low possibility of being misrecognized as the IND judgment sentence. .
- the frame estimation unit 13 and the semantic analysis unit 14 can reliably reject the corpus which is similar to the IND determination sentence but is the OOD determination sentence, As a result, it is possible to improve the recognition accuracy. For this reason, it is possible to improve the recognition accuracy by generating and learning a corpus that becomes more COOD determination sentences.
- the CIND test sentence is a corpus corresponding to the COOD test sentence, and in the feature space of FIG. 2, a corpus existing near a boundary with a distribution of a corpus regarded as an OOD test sentence among corpuses regarded as an IND test sentence. It is.
- the CIND test sentence is an IND test sentence, but it is a corpus similar to the OOD test sentence, and in other words, it can be considered as a corpus which is not similar to the OOD test sentence. Furthermore, the CIND determination sentence is a very misleading expression to be judged as the OOD determination sentence, and it can be considered as a corpus that is likely to cause an erroneous determination.
- the corpus separated from the corpus regarded as the OOD determination sentence and the corpus in the feature space can be considered to be unlikely to cause an erroneous determination as the OOD determination sentence.
- the frame estimation unit 13 and the semantic analysis unit 14 can reliably recognize the corpus which is similar to the OOD determination sentence but is the IND determination sentence, As a result, it is possible to improve the recognition accuracy. For this reason, it is possible to improve the recognition accuracy by generating and learning a corpus as a larger number of CIND determination sentences.
- step S11 the IND sentence reception unit 101 selects an unprocessed IND sentence as an IND sentence to be processed among the IND sentences created manually or the like, receives an input, and outputs it to the language analysis unit 102.
- step S12 the language analysis unit 102 analyzes the morpheme, phrase, and predicate term structure of the IND sentence to be processed.
- step S13 the language analysis unit 102 stores the predicate structure analysis result. More specifically, the language analysis unit 102 stores the predicate structure analysis result as long as no error occurs in the predicate structure analysis process. When an error occurs, the language analysis unit 102 discards, for example, the IND statement to be processed.
- step S14 the language analysis unit 102 determines whether or not there is an unprocessed IND sentence, and if there is, the process returns to step S11. That is, the processes of steps S11 to S14 are repeated until there are no unprocessed IND statements. Then, in step S14, when all the IND statements are processed and it is determined that there is no unprocessed IND statement, the process proceeds to step S15.
- predicate term structure analysis it may be set so that deep case analysis, surface case analysis and the like can be switched, and one of them may be selected.
- the position of the verb phrase or the part to be the target case in the noun phrase (the object in the case of English, etc.) is determined in the analysis result.
- the IND statement to be processed is "Which restaurant is near and delicious?"
- the IND statement may not have a predicate, and predicate term structure analysis may not be successful.
- the omitted predicate may be complemented so that the analysis result of the predicate term structure can be interpolated.
- step S15 the replacement point setting unit 103 inputs the analysis result of predicate term structure analysis stored in the process of step S13, sets a replacement condition designated in advance, and supplies the setting result to the dictionary inquiry unit 104.
- the replacement condition includes a replacement method and a replacement part.
- the first method of the replacement method is the Action fixed (predicate fixed) Category replacement (target case replacement) method
- the second method is the Category fixed (target fixed case) Action replacement (predicate) Replacement) method.
- the Action fixed (predicate fixed) Category substitution (target case substitution) method is, for example, as shown in 1) of the upper part of the example Ex 21 in FIG. 9 when the input sentence is “set an alarm at 7 o'clock”.
- the predicate (Action) that is “set” is fixed, and “alarm” that is the target case (Category) is replaced, and in the example Ex21 1), “alarm” is “physical property”. Is replaced by ".
- the Category fixed (target case fixed) Action substitution (predicate substitution) method is shown, for example, in 2) of the lower part of the example Ex21 of FIG. 9 when the input sentence is “set alarm at 7 o'clock”. It is a method of fixing “alarm” which is a target category (Category) and replacing a predicate (action) which is “set”, and in the example Ex 31-2), “set” Is replaced by "release”.
- the setting of the boundary of the phrase, etc. can be arbitrarily specified by the user. It may be made to switch according to the contents of specification.
- the replacement location setting unit 103 is at the top of the example Ex22. As shown, the division unit structure is divided into "this weekend”, “station”, “near”, “de”, “recommend”, “no”, “spot”, and “tell me” Do.
- the replacement location setting unit 103 You may be able to change the setting for grouping into a single phrase as needed for a specific word or phrase separation unit using a rule or a word dictionary. For example, as shown in the second row from the top of Ex. Ex22, "This weekend”, “Near the station”, “De”, “Recommend”, “No”, “Spot”, and “Tell me” Adjust to
- the replacement part setting unit 103 uses the word of “recommendation” in the form of “recommendation” as shown in the third row from the top of the example Ex 22 among the replacement parts from the structure of the adjusted dividing unit, for example. Replace by group.
- Converting a word in one place in this way is effective for making a COOD statement, but adds the de-rating "Near the station” to the target of replacement, and does not replace the "spot" of the rank. It is also possible to add a set variation such as replacing only the de-rated "near station".
- the substitution place setting unit 103 is the top row of the example Ex23. As shown in the figure, as the structure of the writing unit, this weekend, at the station, near, at, on, recommended, at, on, spots, and tell me To divide.
- the replacement location setting unit 103 sets the dependency source of the standard “recommendation” as shown in the third row from the top of the example Ex 23 among the replacement locations from the structure of the adjusted split writing unit, for example. Replace a certain "tell me” with a similar semantic term predicate that has the same "recommend” as a case. In addition, a setting may be made to replace with a non-similar word-like predicate that does not have the same "recommendation” to the case.
- the selection criteria of the above-mentioned replacing predicate can be judged not only by what kind of words they have in a case, but also by the similarity and dissimilarity of words in other terms such as de-rating and ni-rating.
- step S16 the dictionary inquiry unit 104 reads out one unprocessed IND sentence in the stored data of the predicate structure analysis result in the language analysis unit 102 and accepts it as an IND sentence to be processed.
- step S17 the dictionary inquiring unit 104 specifies a replacement part according to the specified replacement method based on the setting result, and uses a noun phrase corresponding to the term of the word of the replacement part and a predicate corresponding to the meaning.
- the case frame dictionary 107 is referred to and searched.
- step S18 the dictionary inquiry unit 104 stores the IND sentence to be processed, the setting information on replacement, and the search result in association with each other.
- step S19 the dictionary inquiry unit 104 determines whether or not there is an unprocessed IND statement among the stored data of the predicate structure analysis result, and if there is an unprocessed IND statement, the process returns to step S16. . That is, the process of steps S16 to S19 is repeated until the process of searching for replacement candidates is completed for all IND sentences that are storage data of the predicate structure analysis result. Then, in step S19, the process of searching for replacement candidates for all the IND sentences is completed, and when it is considered that there are no unprocessed IND sentences, the process proceeds to step S20.
- the English predicate term structure analysis result corresponding to the predicate term structure analysis result of FIG. 12 is as shown in FIG. 16, when the replacement system is the Action fixed Category substitution system, as shown in FIG. A word group replacing a predicate term is searched. Further, in the case where the result of the predicate term structure analysis is as shown in FIG. 16, in the case of the Category fixed Action substitution method, as shown in FIG. 18, the word group replacing the predicate part is searched. Such substitution produces, for example, an English OOD candidate sentence as shown in FIG.
- FIG. 12 An example of the result of predicate term structure analysis is shown, and from the left, a sentence ID, a sentence, a predicate, a predicate ending, a predicate term, and an original domain are shown.
- predicate terms are, from the left, place case or de case (place case or de case), adnominal modification clause or no case (argument modification clause or no case),. )It is shown.
- FIG. 13 shows an example of a word group when the target case or the case is replaced among the predicate terms to be replaced by the replacement method in the Action fixed Category replacement method with respect to the statement of the result of the predicate item structure analysis shown in FIG. It is shown.
- sentence IDs, sentences, predicates, predicate endings, term replacement words, and original domains are shown from the left.
- the replacement words of the term are indicated from the left as de-case (place case or de-case), no case (argument modification clause or case),...
- FIG. 13 an example of a word group when replacing a case (object case or case) is shown.
- the items in FIG. 13 corresponding to those in FIG. 12 are the same descriptions, so the description thereof will be omitted as appropriate.
- FIG. 14 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.
- FIG. 14 the sentence ID, the sentence, the predicate, the predicate ending, the substitution word of the predicate, and the original domain are shown from the left.
- the items in FIG. 14 corresponding to those in FIG. 12 are the same descriptions, so the description thereof will be omitted as appropriate.
- FIG. 16 shows an example of the result of analyzing the predicate term structure in English, and from the left, sentence ID, sentence (Sentence), predicate (verb: Action), predicate term (Argument), and original domain (Original Domain) )It is shown.
- the predicate term is, from the left, the term according to the predicate (prep_in),..., The target case (dobj).
- FIG. 17 shows a word group when the target case (dobj) is substituted among the predicates to be substituted by the substitution method in the Action fixed Category substitution method with respect to the sentence of the result of the English predicate term structure analysis shown in FIG. An example is shown.
- FIG. 17 a sentence ID, a sentence (Sentence), a predicate (verb), a term replacement word (Argument), and an original domain (Original Domain) are shown from the left. Further, as for the replacement word of the term, the term according to the predicate (prep_in),... FIG. 17 shows an example of the word group when replacing the object case (dobj).
- the items in FIG. 17 that correspond to those in FIG. 16 are the same as the items in FIG.
- sentence ID is 2 sentences “find Chinese food in Austin”, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words of the target case “Chinese food”. It is shown.
- FIG. 18 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.
- FIG. 18 sentence IDs, sentences, predicates, predicate endings, substitution words of predicates, and original domains are shown from the left.
- the items in FIG. 18 that correspond to those in FIG. 16 are the same as the items in FIG.
- sentence ID is 2 sentences “find Chinese food in Austin”, “include”, “open”, “run”, and “operate” are shown as examples of replacement words of the predicate “find”. ing.
- “find cache in Austin” and “find remains in Austin” are shown as examples of the OOD candidate sentences for the sentence “find Chinese food in Austin” in which the sentence ID in FIG. 16 is two. That is, “Chinese food” is replaced with “cache” and “remains” respectively.
- “Open Chinese food in Austin” and “Operate Chinese food in Austin” are shown as examples of the OOD candidate sentences for the sentence “find Chinese food in Austin” in which the sentence ID in FIG. 16 is two. That is, “find” is replaced with “Open” and “Operate” respectively.
- FIG. 20 shows a simple image of the case frame dictionary.
- ⁇ setting 4> indicates what kind of word is involved as a predicate term having each role.
- () are represented by words and numbers.
- the numbers represent the number of times (frequency) that the word and the predicate are associated. For example, (“meeting”, 41) was related 41 times as a target case to the word “setting 4” as the word “meeting” in a large amount of corpus data from which a case frame dictionary was made It represents a thing.
- This numerical value may be another index such as a weight, may be normalized according to the population, and may be used in combination by multiplying the index or the like.
- ⁇ 1> to be set as a case frame having a different expression such as "to set” and having a similar meaning is listed, and in this case, (“she”, 56), (" “Father”, 52), ("Wife”, 49) are listed, and ("Timer”, 67), ("Sleep Timer", 42), ("Alarm”, 41) are listed as target cases
- the tools are listed as (“rice cooker”, 52), (“air conditioner”, 45), (“radio”, 32) and (“mobile”, 12).
- Example Ex32 (“System”, 83), (“Company”, 42), (“School”, 33), (“Superior”, 18) are listed as “Grade 4”. , And listed as a standard (“meeting”, 41), (“participant”, 27), (“moving time”, 10), and as a de-rated (“PC (personal computer)”, 95 , ("Scheduler", 72), ("smartphone", 33) are mentioned.
- PC personal computer
- the target case "alarm” when the target case "alarm" is fixed, the case frame having a different case with the same "alarm” in the standard ⁇ set 1> ⁇ improvement Yes 15> is selected. Both include the same word “alarm” in the subject case, but in ⁇ set 1>, the frequency of "alarm” is 41 times, whereas the frequency of ⁇ alarm> is ⁇ improved 15> Less than twice.
- Such predicates are likely to differ slightly in word meaning.
- the timer function contains many words of semantic classes that are not very relevant. Thus, a predicate is selected which satisfies the condition that the fixed word is in the same term, and the value n representing the frequency of the word or the strength of the relationship is smaller than a certain threshold value ⁇ .
- the case frame dictionary 107 may be an existing one.
- the general existing case frame dictionary does not contain many words used for the service purpose that is a domain
- a user-defined case frame dictionary that is compiled by collecting words necessary for the service is added It may be possible.
- step S20 the replacement execution unit 105 reads out the unprocessed IND statement as the processing target IND statement among the stored search results, and stores the setting information on the replacement and the search result, which are stored in association with each other. Read out and accept.
- substitution execution unit 105 searches for a word to be a substitution target of the IND determination sentence to be processed based on the IND determination sentence to be processed, the setting information on substitution, and the search result. Based on the search results of the process, replace it to generate a corpus, and adjust utilization and word endings.
- step S22 the replacement execution unit 105 stores the generated corpus as a primary replacement generated sentence.
- step S23 the replacement execution unit 105 determines whether or not there is an unprocessed IND statement among the stored search results, and if it is present, the process returns to step S20. That is, the processing in steps S20 to S23 is repeated until a corpus is generated by substitution based on the search results for all the IND sentences. When it is determined in step S23 that there is no unprocessed IND statement, the process proceeds to step S24.
- step S24 the heavy sentence exclusion unit 106 reads an unprocessed corpus out of the corpus stored in the process of step S22, and accepts it as a corpus to be processed.
- step S25 the heavy sentence exclusion unit 106 determines whether there is a sentence (heavy sentence) that overlaps with the corpus to be processed and the corpus generated so far and saved by the process of step S22. More specifically, the heavy sentence exclusion unit 106 searches the corpus set as the processing target from the corpus group stored as a newly generated corpus, and determines whether the sentence is a heavy sentence based on the presence or absence of a match. Do. If it is determined in step S25 that the sentence is a double sentence, the process proceeds to step S26.
- step S26 the heavy sentence exclusion unit 106 considers that the generated corpus is a heavy sentence and is a discarding judgment sentence, and discards the generated corpus.
- step S25 when it is determined in step S25 that the generated corpus is not a double sentence, the process proceeds to step S27.
- step S27 the heavy sentence exclusion unit 106 causes the substitution generated sentence storage unit 109 to store the substitution-generated corpus to be processed.
- step S28 the replacement execution unit 105 determines whether or not there is an unprocessed search result. If there is an unprocessed search result, the process returns to step S24.
- step S28 When it is determined in step S28 that there is no unprocessed search result, the process proceeds to step S29.
- step S29 the replacement execution unit 105 stores the corpus currently stored without being excluded as a double sentence in the replacement generated sentence storage unit 109 as the final replacement generated sentence.
- step S30 the filtering processing unit 110 executes the filtering process to perform OOD judgment sentence, COOD judgment sentence, IND judgment on a corpus consisting of newly generated substitution generated sentences stored in the substitution generated sentence storage unit 109. Classify into sentences and sentences of CIND judgment sentences. The details of the filtering process will be described later with reference to the flowchart of FIG.
- Japanese OOD candidate sentences as shown in FIG. 15 and English OOD candidate sentences as shown in FIG. 19 are generated.
- step S31 the semantic analyzer 131 accepts, as a processing target corpus, a corpus consisting of unprocessed substitution generated sentences among corpuses consisting of substitution generated sentences stored in the substitution generated sentence storage unit 109.
- step S32 the semantic analyzer 131 determines whether or not the corpus consisting of substitution generation sentences to be processed is an IND determination sentence. If it is determined in step S32 that the sentence is an IND determination sentence, the process proceeds to step S33.
- step S33 the semantic analyzer 131 causes the IND determination sentence storage unit 132 to store a corpus consisting of substitution generated sentences to be processed.
- step S32 determines whether the sentence is not the IND determination sentence, that is, if the corpus including the replacement generation sentence to be processed is considered to be the OOD determination sentence. If it is determined in step S32 that the sentence is not the IND determination sentence, that is, if the corpus including the replacement generation sentence to be processed is considered to be the OOD determination sentence, the process proceeds to step S34.
- step S34 the semantic analyzer 131 regards the replacement generated sentence to be processed as an OOD determination sentence, and stores it in the OOD determination sentence storage unit 136.
- step S35 the semantic analyzer 131 determines in the substitution generated sentence storage unit 109 whether or not there is an unprocessed substitution generated sentence, and when it is determined that there is an unprocessed substitution generated sentence, the process is a step Returning to S31, the subsequent processing is repeated. That is, until it is considered that there is no unprocessed input sentence, it is judged whether or not all replacement sentences are IND judgment sentences, and the IND judgment sentence is stored in the IND judgment sentence storage unit 132, The process in which the OOD determination sentence which is the above is stored in the OOD determination sentence storage unit 136 is repeated.
- step S35 if it is determined in step S35 that there is no unprocessed replacement generated sentence, the process proceeds to step S36. That is, a group of substitution generated sentences stored in substitution generated sentence storage unit 109 by the processing up to this point is generated by learning using a corpus of an old version by semantic analyzer 131 to generate IND judgment sentences and OOD. It is classified into a judgment sentence, and each is stored in the IND judgment sentence storage unit 132 and the OOD judgment sentence storage unit 136.
- step S36 the COOD corpus extraction unit 133 executes COOD corpus extraction processing, extracts COOD determination sentence candidates from the domain of the corpus regarded as the IND determination sentences, and causes the COOD determination sentence storage unit 134 to store them.
- the remaining IND determination sentences are stored as fixed IND determination sentences in the fixed IND determination sentence storage unit 135.
- the corpus regarded as non-statement is regarded as discard judgment sentences and discarded.
- step S37 the CIND corpus extraction unit 137 executes a CIND corpus extraction process to extract a CIND determination sentence from the corpus regarded as an OOD determination sentence, and causes the CIND determination sentence storage unit 138 to store the remaining CIND determination sentence.
- the OOD determination sentence is stored in the determination OOD determination sentence storage unit 139 as the determination OOD determination sentence.
- the corpus regarded as the non-statement is regarded as the discarding judgment sentence and discarded.
- the COOD determination sentence, the confirmed IND determination sentence, the CIND determination sentence, and the corpus regarded as the confirmed OOD determination sentence are made in advance. Since the sentence is classified into any of the sentence, the CIND judgment sentence, and the confirmed OOD judgment sentence, it is possible to reduce the load of the confirmation work, and as a result, it is possible to reduce the development cost of the corpus.
- the frame estimation unit 13 and the semantic analysis unit 14 can improve the recognition accuracy by learning using a corpus including the generated COOD determination sentence and CIND determination sentence.
- COOD corpus extraction processing will be described with reference to the flowchart in FIG. Although it is desirable to manually perform the process of extracting the COOD determination sentence from the substitution generated sentence, it is possible to further narrow down the COOD candidate sentences by the COOD corpus extraction process by the following filtering in order to improve the work efficiency.
- step S 51 the COOD corpus extraction unit 133 receives an input of a corpus serving as an unprocessed IND determination sentence among corpuses serving as an IND determination sentence stored in the IND determination sentence storage unit 132, and is processed It is a corpus.
- step S52 the COOD corpus extraction unit 133 controls the non-statement determination unit 133a to calculate the Perplexity value of the processing target corpus.
- the Perplexity value is a value representing the average number of branches when the number of branches (number of candidates) of the word following the word is represented by the reciprocal of the n-gram probability. That is, when compared with a sentence generated by combining a plurality of words at random, the connection probability between the words in the sentence having meaning is high, and the number of branches of the connected words is low, so the Perplexity value is low. Conversely, for sentences that do not make sense, the probability of combining words is low, and the number of branches of connected words is high, so the Perplexity value is high.
- the Perplexity value is an index for judging the probabilistic validity of the generated sentence.
- generated sentence is as follows, for example. Please refer to Daniel Jurafsky's 2016 “Language Modeling with N-grams” Chapter 4, https://web.stanford.edu/ ⁇ jurafsky/slp3/4.pdf 2016 for details on how to calculate Perplexity values. .
- the joint probability P (w) of word strings is modeled based on the idea that word strings are generated probabilistically.
- a large amount of training text such as an Internet site or a news article is used to learn the above n-gram parameter using n-gram probability of a word.
- the non-sentence determination unit 133a uses the n-gram model thus learned to calculate the Perplexity value represented by the following Expression (3) for the generated corpus.
- the sentence “Tell me a good reputation near here” is an unnatural sentence.
- the Perplexity value PLL is 80.4152.
- the sentence “Tell me a good reputation surfing near here” is also a sentence that does not pass the unnatural meaning.
- the Perplexity value PLL is 70.6759, for example, because the n-gram probability p (surfing
- good) 2.13532e-05 of “surfing” and “good” is low.
- the sentence “Tell me a good-reputable store near here” is a relatively meaningful sentence.
- the Perplexity value PLL is 57.4806.
- the sentence “Tell me a massage with a good reputation near here” is a sentence that makes sense.
- the Perplexity value PLL is 57.0273.
- the n-gram probability increases and the Perplexity value decreases.
- step S53 the non-statement determining unit 133a determines whether or not the sentence is a non-statement based on whether or not the calculated Perplexity value of the processing target corpus is larger than a predetermined threshold value ⁇ .
- step S53 for example, when the Perplexity value PLL is larger than the predetermined threshold value ⁇ , the process proceeds to step S55.
- step S55 the non-statement determining unit 133a regards the corpus to be processed as a non-statement, discards the corpus as a determination text, and the process proceeds to step S56.
- step S53 for example, when the Perplexity value PLL is not larger than the predetermined threshold ⁇ , that is, when the Perplexity value PLL is smaller than the predetermined threshold ⁇ , the non-statement determining unit 133a determines that the corpus to be processed is The process proceeds to step S54, assuming that the sentence is meaningful.
- step S54 the non-statement determining unit 133a stores a corpus to be processed.
- step S56 the COOD corpus extraction unit 133 determines whether or not there is a corpus serving as an unprocessed IND determination sentence among the corpuses serving as the IND determination sentences stored in the IND determination sentence storage unit 132, If there is a corpus as an IND determination sentence of processing, the processing returns to step S51. That is, the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined whether the corpus is not a non sentence but a corpus consisting of sentences having meaning, by comparison with a predetermined threshold value ⁇ . The processes of S51 to S56 are repeated.
- step S56 the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined whether or not it is a corpus consisting of sentences having meaning rather than non-sentences by comparison with a predetermined threshold value ⁇ . If it is determined that there is no unprocessed corpus, the process proceeds to step S57.
- step S57 the COOD corpus extraction unit 133 becomes an IND determination sentence which is regarded as a corpus consisting of sentences through which the meaning passes, not a non-statement, by the processing of the non-statement determination unit 133a in steps S52 and S53.
- the corpuses input of a corpus which is an unprocessed IND determination sentence is accepted to be a processing target corpus.
- step S58 the COOD corpus extraction unit 133 controls the non-appearance determination unit 133b to calculate the non-occurrence of the word included in the processing target corpus in the target domain.
- the non-appearance is an index indicating how much the generated corpus is included in the corpus of the domain determined to be the IND determination sentence by using the semantic analyzer to include words that do not appear in the corpus.
- sentences (corpus) as shown in FIG. 24 all contain words with low frequency of occurrence in the domain of ALARM-CHANGE, that is, Close OOD with high non-appearance and including the discard judgment sentence. Become. However, here, since non-sentences are excluded in the process before determining non-recurrence, the COOD determination sentence is substantially extracted. In the following, the notation in “” is a word having high non-recurrence.
- non-appearances can be determined numerically, for example, by using the number of words that do not appear in the target domain included in the processing target corpus.
- the non-appearance determination unit 133 b sets the total number of words included in the processing target corpus including the IND determination sentence as n, and the number of words not appearing in the domain of the IND determination sentence (of the corpus belonging to the domain including the IND determination sentence When the number of words not included in any corpus excluding the processing target corpus is assumed to be no, no / n is calculated as a parameter representing non-appearance.
- step S59 the non-appearance determination unit 133b determines whether the parameter no / n representing non-occurrence in the domain consisting of the IND determination sentences of the words included in the processing target corpus is larger than a predetermined threshold value ⁇ . .
- step S59 if the parameter no / n representing non-recurring property is larger than the predetermined threshold value ⁇ , that is, if the non-recurring property of the word included in the processing target corpus is high in the domain consisting of the IND determination sentence, the processing , And proceeds to step S60.
- step S60 the non-appearance determination unit 133b regards the processing target corpus as a COOD determination sentence and extracts the corpus, and stores the corpus in the COOD determination sentence storage unit 134.
- step S59 when the parameter no / n representing non-appearance is not larger than the predetermined threshold ⁇ , that is, the parameter no / n representing non-occurrence is smaller than the predetermined threshold If the non-occurrence of the included word in the domain consisting of the IND determination sentence is low, the process proceeds to step S61.
- step S61 the non-appearance determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and causes the determined IND determination sentence storage unit 135 to store it.
- step S62 the COOD corpus extraction unit 133 determines whether there is an unprocessed IND determination sentence. If there is an unprocessed IND determination sentence, the process returns to step S57. That is, the processes of steps S57 to S62 are repeated until the unprocessed IND determination sentence disappears, and the process of the non-appearance determination unit 133b of steps S58 and S59 is repeated.
- step S62 when there is no unprocessed IND determination sentence, that is, when it is considered that all the IND determination sentences have been processed, the process ends.
- a corpus having a high Perplexity value and not being a non-sentence and having a high non-occurrence of the included word is regarded as a COOD determination sentence among the domains constituted by the corpus consisting of the IND determination sentences.
- a corpus that is stored in the COOD determination sentence storage unit 134 and is not a non-statement and has a low non-appearance of the included word is regarded as a fixed IND determination sentence, and stored in the fixed IND determination sentence storage unit 135 Be done.
- the TF value is an index for analyzing a word that characterizes each document (in this case, domain) when there are a plurality of documents (in this case, plural domains), and the following equation ( It is represented by 4).
- the IDF value is an index indicating whether each word is used in common between documents, and is expressed by the following equation (5).
- the TF / IDF value is the number of words whose frequency of occurrence is less than the threshold ⁇ (0 ⁇ ⁇ ⁇ 1) among the important word lists frequently appearing ubiquitously in the target domain of the IND determination sentence, or exists in the important word list.
- the non-appearance determination unit 133b calculates the parameter nlw / n indicating the non-appearance in step S59.
- step S59 the non-appearance determination unit 133b determines whether the parameter nlw / n representing the non-appearance in a predetermined domain of the word included in the processing target corpus is larger than a predetermined threshold value ⁇ .
- step S59 when the parameter nlw / n representing non-appearance is larger than the predetermined threshold value ⁇ , the process proceeds to step S60, and the non-appearance determination unit 133b extracts the processing target corpus as a COOD determination sentence and extracts And store the information in the COOD determination sentence storage unit 134.
- step S59 when the parameter nlw / n representing non-appearance is not larger than a predetermined threshold ⁇ , that is, when the parameter nlw / n representing non-appearance is smaller than a predetermined threshold ⁇ , the processing is In step S61, the non-appearance determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and causes the fixed IND determination sentence storage unit 135 to store the processing target corpus.
- the TF value of each word in the corpus group of this example Ex 51 is “changed” 351, “7” 334, “8” 260 from the top in descending order of TF value, as shown in example Ex 52 in the figure.
- “Change” is 258, "time” is 220, “setting” is 159, “6” is 152, “do” is 148, “alarm” is 110, “morning” is 64, “set” is 56, “ “Wish” is 55, ..., “Awakening” is 6, “Alarm Clock” is 5.
- the TF / IDF value of each word in the corpus group of this example Ex51 is, as indicated by an example Ex53 in the figure, in descending order of TF / IDF value, “alarm” from the top to 0.0050379225545, “set” to 0.00328857409316 , “Wake up” is 0.00100030484831, “Tomorrow” is 0.000795915410064, “Wake up” is 0.000763298323913, “Wake up” is 0.0006226999996573, “Song” is 0.00060708690425, “Morning” is 0.000521900115019, “Setting” is 0.00046290476509, “Over” is 0.000033639933349, “Announcement” is 0.000297198910881, “Awakening” is 0.00029252 4823 8318, “Okachi” is 0.000196205208903, “Alarm Clock” is 0.000185042017918, "
- a word having a high TF value or a TF / IDF value can be considered to be a word having a high frequency of appearance (low non-occurrence) and a high degree of importance. Therefore, it is considered that among the corpuses included in the IND determination sentence, a corpus including a large number of words whose TF value or TF / IDF value is equal to or less than a certain threshold is likely to be a COOD determination sentence. Therefore, the COOD determination sentence is obtained as a corpus including a large number of words not included in the word group having a high TF / IDF value among corpuses included in the IND determination sentence.
- step S101 the CIND corpus extraction unit 137 receives an input of a corpus consisting of unprocessed OOD determination sentences in a corpus consisting of OOD determination sentences stored in the OOD determination sentence storage unit 136, and Do.
- step S102 the CIND corpus extraction unit 137 controls the non-statement determination unit 137a to calculate the Perplexity value of the processing target corpus.
- step S103 the non-statement determining unit 137a determines whether or not the sentence is a non-statement based on whether the calculated Perplexity value of the processing target corpus is larger than a predetermined threshold value ⁇ .
- step S103 for example, when the Perplexity value PLL is larger than the predetermined threshold value ⁇ , the process proceeds to step S105.
- step S105 the non-statement determination unit 137a regards the corpus to be processed as a non-statement, discards the corpus as a determination text, and the process proceeds to step S106.
- step S103 for example, when the Perplexity value PLL is not larger than the predetermined threshold ⁇ , that is, when the Perplexity value PLL is smaller than the predetermined threshold ⁇ , the non-statement determination unit 137a determines that the corpus to be processed is The process proceeds to step S104, assuming that the sentence is meaningful.
- step S104 the non-statement determining unit 137a stores a corpus to be processed.
- step S106 the CIND corpus extraction unit 137 determines whether or not there is a corpus as an unprocessed OOD determination sentence in the corpus as the OOD determination sentence stored in the OOD determination sentence storage unit 136, If there is a corpus that becomes an OOD determination sentence of the process, the process returns to step S101.
- the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined whether the corpus is not a non sentence but a corpus consisting of sentences having meaning, by comparison with a predetermined threshold ⁇ . The processes of S101 to S106 are repeated.
- step S106 the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined whether or not it is a corpus consisting of sentences having meaning rather than non sentences by comparison with a predetermined threshold value ⁇ . If it is determined that there is no unprocessed corpus, the process proceeds to step S107.
- step S107 the CIND corpus extraction unit 137 becomes an OOD determination sentence stored as regarded as a corpus consisting of sentences having meaning, not non-statements, by the processing of the non-statement determination unit 137a in steps S102 and S103.
- the corpuses input of a corpus that is an unprocessed OOD determination sentence is accepted as a processing target corpus.
- step S108 the CIND corpus extraction unit 137 controls the non-appearance determination unit 137b to calculate the parameter no / n representing the non-occurrence of the word included in the processing target corpus in the target domain.
- step S109 the non-appearance determination unit 137b determines whether or not the parameter no / n representing the non-emergence in the predetermined domain of the word included in the processing target corpus is larger than the predetermined threshold value ⁇ .
- step S109 when the parameter no / n representing non-appearance is larger than the predetermined threshold value ⁇ , the process proceeds to step S110.
- step S110 the non-appearance determination unit 137b regards the processing target corpus as a confirmed OOD determination sentence and stores the corpus in the determined OOD determination sentence storage unit 139.
- step S109 when the parameter no / n representing non-appearance is not larger than the predetermined threshold value ⁇ , that is, when the parameter no / n representing non-occurrence is smaller than the predetermined threshold value ⁇ , the process The process proceeds to step S111.
- step S111 the non-appearance determination unit 137b regards the processing target corpus as a CIND determination sentence and causes the CIND determination sentence storage unit 138 to store it.
- step S112 the CIND corpus extraction unit 137 determines whether there is an unprocessed OOD determination sentence. If there is an unprocessed OOD determination sentence, the processing returns to step S107. That is, the processes of steps S107 to S112 are repeated until there are no unprocessed OOD determination sentences.
- step S112 when there is no unprocessed OOD determination sentence, that is, when all the OOD determination sentences are considered to be processed, the processing ends.
- a corpus which is high in Perplexity value and is not a non-statement and has a low non-appearance of the included word is regarded as a CIND determination sentence among the corpuses which become the OOD determination sentence, and the CIND determination sentence
- a corpus which is stored in the storage unit 138 and which is not a non-statement and which is high in non-occurrence of the contained word is regarded as a determined OOD determination sentence and stored in the determined OOD determination sentence storage unit 139.
- the narrowing-down for that purpose can be automated, the number of operation steps can be reduced.
- the TF / IDF value is the number of words whose appearance frequency is less than the threshold ⁇ (0 ⁇ ⁇ ⁇ 1) among the important word lists that frequently appear ubiquitously in the target domain of the IND determination sentence, or
- the non-appearance determination unit 137b calculates a parameter nlw / n indicating non-appearance.
- step S108 the non-appearance determination unit 137b determines whether the parameter nlw / n representing the non-occurrence in the predetermined domain of the word included in the processing target corpus is larger than the predetermined threshold ⁇ .
- step S108 when the parameter nlw / n representing non-appearance is larger than the predetermined threshold value ⁇ , the process proceeds to step S110, and the CIND corpus extraction unit 137 considers the processing target corpus as the confirmed OOD determination sentence and confirms It is stored in the OOD determination sentence storage unit 139.
- step S108 when the parameter nlw / n representing non-appearance is not larger than the predetermined threshold value ⁇ , that is, when the parameter nlw / n representing non-occurrence is smaller than the predetermined threshold value ⁇ , the process Proceeding to step S111, the non-appearance determination unit 137b regards the processing target corpus as a CIND determination sentence and causes the CIND determination sentence storage unit 138 to store it.
- the corpus is generated in a state classified into an IND determination sentence (confirmed IND determination sentence), a COOD determination sentence, a CIND determination sentence, and an OOD determination sentence (confirmed OOD determination sentence).
- the above-described processing order may be interchanged.
- the COOD corpus extraction process of step S36 and the CIND corpus extraction process of step S37 may be interchanged.
- the non-statement determination process using Perplexity values in the COOD corpus extraction process and the CIND corpus extraction process and the extraction process of the COOD determination sentence and the CIND determination sentence using a parameter indicating non-appearance are exchanged in order. It is also good.
- FIG. 27 shows a configuration example of a general-purpose personal computer.
- This personal computer incorporates a CPU (Central Processing Unit) 1001.
- An input / output interface 1005 is connected to the CPU 1001 via the bus 1004.
- a ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004.
- the input / output interface 1005 includes an input unit 1006 including an input device such as a keyboard and a mouse through which the user inputs an operation command, an output unit 1007 for outputting a processing operation screen and an image of a processing result to a display device, programs and various data.
- a storage unit 1008 including a hard disk drive to be stored, a LAN (Local Area Network) adapter, and the like are connected to a communication unit 1009 that executes communication processing via a network represented by the Internet.
- a magnetic disc including a flexible disc
- an optical disc including a compact disc-read only memory (CD-ROM), a digital versatile disc (DVD)
- a magneto-optical disc including a mini disc (MD)
- a semiconductor A drive 1010 for reading and writing data to a removable medium 1011 such as a memory is connected.
- the CPU 1001 reads a program stored in the ROM 1002 or a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, is installed in the storage unit 1008, and is loaded from the storage unit 1008 to the RAM 1003. Execute various processes according to the program.
- the RAM 1003 also stores data necessary for the CPU 1001 to execute various processes.
- the CPU 1001 loads the program stored in the storage unit 1008 into the RAM 1003 via the input / output interface 1005 and the bus 1004, and executes the program. Processing is performed.
- the program executed by the computer (CPU 1001) can be provided by being recorded on, for example, a removable medium 1011 as a package medium or the like. Also, the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
- the program can be installed in the storage unit 1008 via the input / output interface 1005 by mounting the removable media 1011 in the drive 1010.
- the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008.
- the program can be installed in advance in the ROM 1002 or the storage unit 1008.
- the program executed by the computer may be a program that performs processing in chronological order according to the order described in this specification, in parallel, or when necessary, such as when a call is made. It may be a program to be processed.
- the CPU 1001 in FIG. 27 realizes the functions of the semantic analyzer 131, the COOD corpus extraction unit 133, and the CIND corpus extraction unit 137.
- the storage unit 1008 includes an IND determination sentence storage unit 132, an OOD determination sentence storage unit 136, a COOD determination sentence storage unit 134, a confirmed IND determination sentence storage unit 135, a CIND determination sentence storage unit 138, and a determined OOD determination sentence storage unit. Realize 139.
- a system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same case. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device housing a plurality of modules in one housing are all systems. .
- the present disclosure can have a cloud computing configuration in which one function is shared and processed by a plurality of devices via a network.
- each step described in the above-described flowchart can be executed by one device or in a shared manner by a plurality of devices.
- the plurality of processes included in one step can be executed by being shared by a plurality of devices in addition to being executed by one device.
- the present disclosure can also have the following configurations.
- a structural analysis unit that analyzes the structure of input sentences;
- a replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
- An information processing apparatus including: a corpus generation unit that generates a corpus by replacing words in the replacement portion in the input sentence;
- the information processing apparatus according to ⁇ 1>, wherein the input sentence is an IND (In Domain) determination sentence which is an utterance content to be handled by a predetermined application program.
- the structure analysis unit analyzes a predicate term structure of the input sentence.
- the replacement point setting unit is a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit.
- the information processing apparatus further includes a dictionary query unit that queries a dictionary to search for a candidate for replacing the word of the replacement part in the input sentence,
- the information processing apparatus according to any one of ⁇ 3>, wherein the corpus generation unit replaces the word of the replacement portion in the input sentence with the word searched by the dictionary inquiry unit.
- the dictionary is a case frame dictionary.
- the replacement point setting unit sets a replacement point in the input sentence and a replacement method of the replacement point based on the predicate term structure which is an analysis result of the structure analysis unit,
- the information processing apparatus according to ⁇ 4>, wherein the corpus generation unit generates a corpus by replacing the word of the replacement part in the input sentence with the replacement method.
- the substitution method fixes a predicate of the input sentence, and fixes a first term for replacing a noun which is a predicate term including a target case, and a predicate term including an object case of the input sentence
- the information processing apparatus according to ⁇ 6> further comprising: a second method of replacing a predicate.
- the corpus generated by the corpus generation unit is an IND (In Domain) determination statement that is utterance content to be handled by a predetermined application program, or unexpected utterance content that should not be handled by a predetermined application program
- the information processing apparatus according to any one of ⁇ 1> to ⁇ 7>, further including a classification unit that classifies an OOD (Out of Domain) determination sentence.
- the IND determination sentence which is a COOD (Close OOD) determination sentence, which is a corpus existing in the vicinity of a boundary in the feature space represented by each feature of the OOD determination sentence and the IND determination sentence.
- the information processing apparatus according to ⁇ 8> further including a COOD determination sentence extraction unit for extracting from a corpus classified as.
- the COOD determination sentence extraction unit determines, from the domain, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number in a domain including a corpus classified as the IND determination sentence.
- the information processing apparatus according to ⁇ 9> which is extracted as a sentence.
- the COOD determination sentence extraction unit is a corpus in which non-appearance represented by a ratio of the number of words not included in the self and the other corpus to the number of words included in the corpus of the domain is higher than a predetermined value.
- the information processing apparatus according to ⁇ 10> extracting the COOD determination sentence from the domain.
- the COOD determination sentence extraction unit has a non-appearance represented by TF / IDF of a word consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in a corpus of the domain as a predetermined value.
- the information processing apparatus according to ⁇ 10> wherein a corpus including many lower words is extracted as the COOD determination sentence from the domain.
- the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
- the OOD determination sentence as a CIND (Close IND) determination sentence, a corpus existing in the vicinity of the boundary in the feature space represented by each feature of the IND determination sentence and the OOD determination sentence
- the information processing apparatus according to ⁇ 8>, further including a CIND determination sentence extraction unit for extracting from a corpus classified as. ⁇ 15>
- the CIND determination sentence extraction unit includes a corpus in which the number of words included in a corpus classified as another OOD determination sentence other than itself is more than a predetermined number
- the information processing apparatus according to ⁇ 14> extracting as the CIND determination sentence from all corpus classified as the OOD determination sentence.
- the CIND determination sentence extraction unit is a corpus whose non-emergence property, represented by a ratio of the number of words not included in the corpus other than the self, to the number of words included in the corpus of the domain is lower than a predetermined number
- the CIND determination sentence extraction unit determines a predetermined number of non-reappearance of a word represented by TF / IDF including a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in the corpus of the domain.
- a structural analysis unit that analyzes the structure of the input sentence
- a replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
- the program which functions a computer as a corpus generation part which substitutes the word of the said substitution part in the said input sentence, and produces
- 51 corpus generation apparatus 101 IND sentence reception unit, 102 language analysis unit, 103 replacement location setting unit, 104 dictionary query unit, 105 replacement execution unit, 106 double sentence exclusion unit, 107 case frame dictionary, 108 generation condition setting data storage unit, 109 replacement generated sentence storage unit, 110 filtering processing unit, 131 semantic analyzer, 132 IND determination sentence storage unit, 133 COOD corpus extraction unit, 133a non-sentence determination unit, 133 b non-appearance determination unit, 134 COOD determination sentence storage unit, 135 Confirmed IND judgment sentence storage unit, 136 OOD judgment sentence storage unit, 137 CIND corpus extraction unit, 137a non-sentence judgment unit, 137b non-appearance judgment unit, 138 CIND judgment sentence storage unit, 139 confirmed OOD judgment sentence storage unit
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
The present disclosure pertains to an information processing device, an information processing method, and a program with which it is possible to lighten a load pertaining to the creation of a corpus needed for developing a semantic analyzer and reduce development costs. A corpus is generated by analyzing the predicate-argument structure of a manually generated In Domain (IND) corpus, setting a substitution location, searching for a word similar to the word at the substitution location by a case-frame dictionary, and substituting the word at the substitution location with the word resulting from the search. The present disclosure generates a corpus used for learning in a semantic analyzer. The present disclosure can be applied to a corpus generation device.
Description
本開示は、情報処理装置、および情報処理方法、並びにプログラムに関し、特に、意味解析器の精度の向上に寄与するコーパスの開発コストを低減できるようにした情報処理装置、および情報処理方法、並びにプログラムに関する。
The present disclosure relates to an information processing apparatus, an information processing method, and a program, and in particular, an information processing apparatus, an information processing method, and a program that can reduce the development cost of a corpus that contributes to improvement of the accuracy of a semantic analyzer. About.
音声対話システムは、発話内容をテキストデータに変換し、テキストデータを意味解析し、発話内容を認識する。
The speech dialogue system converts the utterance content into text data, analyzes the text data semantically, and recognizes the utterance content.
発話内容の認識には、コーパス(例文集)を用いた機械学習により発話内容を解析し、認識する意味解析器が使用される。
In order to recognize the uttered content, a semantic analyzer is used which analyzes the uttered content by machine learning using a corpus (example sentence collection) and recognizes it.
意味解析器は、アプリケーションプログラム毎に扱うべき発話内容に対するコーパスを用いた機械学習により、発話内容を解析し、認識する。
The semantic analyzer analyzes and recognizes the utterance content by machine learning using a corpus for the utterance content to be handled for each application program.
音声対話システムでは、天気の問い合わせ、スケジュール確認、音楽の再生など複数の話題、タスクやアプリケーションプログラムを単一のシステムで扱えるようにマルチドメイン音声対話システムが広く利用されている。
In speech dialogue systems, multi-domain speech dialogue systems are widely used so that multiple topics such as weather inquiry, schedule confirmation, music reproduction, tasks and application programs can be handled by a single system.
マルチドメインの音声対話システムでは、新たなドメインの意味解析機能の追加構築の容易さが求められる。そのため個々のドメインの意味解析器を結合して意味解析システムを構築する手法(アーキテクチャ)が広く提案されている。このアーキテクチャは複数のドメインの意味解析システムとそれを統合するドメイン選択器(フレーム推定器)で構成されたシステムとなる(非特許文献1)。
In multi-domain speech dialogue systems, the ease of additional construction of semantic analysis function of new domain is required. Therefore, a method (architecture) for combining semantic analyzers of individual domains to construct a semantic analysis system has been widely proposed. This architecture is a system composed of a semantic analysis system of a plurality of domains and a domain selector (frame estimator) integrating the same (Non-Patent Document 1).
マルチドメインの音声対話システムにおいては、それぞれのドメインに必要とされるコーパスを用いた学習により、多様なアプリケーションプログラムの発話内容を認識できる意味解析器を実現する技術が必要である。
In a multi-domain speech dialogue system, there is a need for a technology that realizes a semantic analyzer that can recognize the utterance content of various application programs by learning using a corpus required for each domain.
また、想定外の発話(Out of Domain発話:以下OOD発話とも称する)を受けた時に誤ったドメインに意味解析処理を遷移することを防ぐ必要がある。そのためにはOODコーパスを用意してフレーム推定器を再学習するのが理想的だが、OODコーパスの開発には工数がかかるため対話の履歴も活用して推定する方法など様々な手法が議論されている(非特許文献2参照)。
In addition, it is necessary to prevent transition of semantic analysis processing to an incorrect domain when receiving an unexpected utterance (Out of Domain utterance: hereinafter also referred to as OOD utterance). For that purpose, it is ideal to prepare the OOD corpus and re-learn the frame estimator, but since development of the OOD corpus requires many steps, various methods such as a method to estimate using the history of dialogue are discussed (See Non-Patent Document 2).
しかしながら、非特許文献1,2に係るコーパスは、人手で作成されており、多様なアプリケーションプログラムの発話内容を認識するためには、より多くのコーパスが必要となるので、コーパスの作成に係る負荷が意味解析器の開発コストの大きな負担になっている。
However, since the corpuses according to Non Patent Literatures 1 and 2 are manually created, and more corpuses are needed to recognize the utterance content of various application programs, the load associated with creating a corpus is required. There is a large burden on the development cost of semantic analyzers.
本開示は、このような状況に鑑みてなされたものであり、特に、学習に必要とされるコーパスを効率的に開発できるようにすることで、意味解析器の開発コストを低減させるものである。
The present disclosure has been made in view of such circumstances, and in particular, reduces the development cost of the semantic analyzer by enabling efficient development of a corpus required for learning. .
本開示の一側面の情報処理装置は、入力文の構造を解析する構造解析部と、前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部とを含む情報処理装置である。
An information processing apparatus according to one aspect of the present disclosure includes: a structure analysis unit that analyzes a structure of an input sentence; a replacement point setting unit that sets a replacement point in the input sentence based on an analysis result of the structure analysis unit; It is an information processing device including: a corpus generation unit which generates a corpus by replacing words in the substitution part in an input sentence.
前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文とすることができる。
The input sentence may be an IND (In Domain) judgment sentence which is an utterance content to be handled by a predetermined application program.
前記構造解析部には、前記入力文の述語項構造を解析する前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定させるようにすることができる。
In the structure analysis unit, the replacement point setting unit that analyzes a predicate term structure of the input sentence sets a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit. Can be made to
前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含ませるようにすることができ、前記コーパス生成部には、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換させるようにしてもよい。
A candidate for replacing the word at the replacement part in the input sentence may further include a dictionary query unit for querying and searching a dictionary, and the corpus generation unit is searched by the dictionary query unit. The word of the replacement part in the input sentence may be replaced with the word.
前記辞書は、格フレーム辞書とすることができる。
The dictionary may be a case frame dictionary.
前記置換箇所設定部には、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定させ、前記コーパス生成部には、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成させるようにすることができる。
The replacement location setting unit is configured to set a replacement location in the input sentence and a replacement method for the replacement location based on the predicate term structure that is an analysis result of the structure analysis unit, and the corpus generation unit The word of the substitution part in the input sentence may be substituted by the substitution method to generate a corpus.
前記置換方式には、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第1の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第2の方式とを含ませるようにすることができる。
In the replacement method, a predicate of the input sentence is fixed, and a first method of replacing a noun which is a predicate term including a target case, and a predicate term including the target case of the input sentence are fixed. And, it is possible to include the second method of replacing the predicate.
前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD(Out of Domain)判定文に分類する分類部をさらに含ませるようにしてもよい。
The corpus generated by the corpus generation unit may be an IND (In Domain) determination sentence, which is utterance content to be handled by a predetermined application program, or OOD, which is unexpected utterance content that should not be handled by a predetermined application program. Out of Domain) It is possible to further include a classification unit that classifies into a judgment sentence.
前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD(Close OOD)判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含ませるようにすることができる。
A corpus which is the OOD determination sentence and exists near the boundary in the feature space represented by each feature with the IND determination sentence is classified as the COOD (Close OOD) determination sentence as the IND determination sentence. It is possible to further include a COOD determination sentence extraction unit to extract from the corpus.
前記COOD判定文抽出部には、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出させるようにすることができる。
In the COOD determination sentence extraction unit, in a domain including a corpus classified as the IND determination sentence, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number is set as the COOD determination sentence from the domain It can be made to extract.
前記COOD判定文抽出部には、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出させるようにすることができる。
The COOD determination sentence extraction unit includes a corpus whose non-emergence property represented by a ratio of the number of words not included in the self and other corpus to the number of words included in the corpus of the domain is higher than a predetermined value. It is possible to extract from the domain as the COOD determination sentence.
前記COOD判定文抽出部には、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出させるようにすることができる。
The COOD determination sentence extraction unit has a non-appearance of a word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value in the corpus of the domain higher than a predetermined value A corpus can be extracted from the domain as the COOD determination sentence.
前記COOD判定文抽出部には、前記ドメインのコーパスにおけるPerplexity値を算出させ、前記Perplexity値が所定値よりも高いものを非文として廃棄させるようにすることができる。
The COOD determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard a sentence whose Perplexity value is higher than a predetermined value as a non-statement.
前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND(Close IND)判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含ませるようにすることができる。
A corpus which is the IND determination sentence and exists in the vicinity of the boundary in the feature space represented by each feature with the OOD determination sentence is classified as the COD (Close IND) determination sentence as the OOD determination sentence It is possible to further include a CIND judgment sentence extraction unit to extract from the corpus.
前記CIND判定文抽出部には、前記OOD判定文として分類されたコーパスを含むドメインにおいて、INDコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定候補文として抽出させるようにすることができる。
The CIND test sentence extraction unit includes, in a domain including a corpus classified as the OOD test sentence, a corpus in which the number of words included in the IND corpus is more than a predetermined number from all corpuses classified as the OOD test sentence. It can be made to extract as a CIND judgment candidate sentence.
前記CIND判定文抽出部には、前記ドメインのコーパスに含まれる単語数に対する、前記INDコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定候補文として抽出させるようにすることができる。
The CIND determination sentence extraction unit is a CIND determination candidate in which a non-occurrence probability represented by a ratio of the number of words not included in the IND corpus to a number of words included in the corpus of the domain is lower than a predetermined number. It can be made to extract as a sentence.
前記CIND判定文抽出部には、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出させるようにすることができる。
In the CIND determination sentence extraction unit, non-reappearance of a word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value in the corpus of the domain is lower than a predetermined number A corpus can be extracted as the CIND test sentence.
前記CIND判定文抽出部には、前記ドメインのコーパスにおけるPerplexity値を算出させ、前記Perplexity値が所定値よりも高いものを非文として廃棄させるようにすることができる。
The CIND determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard one having a Perplexity value higher than a predetermined value as a non-statement.
本開示の一側面の情報処理方法は、入力文の構造を解析し、前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、前記入力文における前記置換箇所の単語を置換してコーパスを生成するステップを含む情報処理方法である。
The information processing method according to one aspect of the present disclosure analyzes a structure of an input sentence, sets a replacement place in the input sentence based on an analysis result of the structure, and replaces a word of the replacement place in the input sentence. Information processing method including the step of generating a corpus.
本開示の一側面のプログラムは、入力文の構造を解析する構造解析部と、前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部としてコンピュータを機能させるプログラムである。
A program according to one aspect of the present disclosure includes: a structure analysis unit that analyzes a structure of an input sentence; a replacement point setting unit that sets a replacement point in the input sentence based on an analysis result of the structure analysis unit; The program is for causing a computer to function as a corpus generation unit that generates a corpus by replacing the words at the replacement portion in.
本開示の一側面においては、入力文の構造が解析され、前記構造の解析結果に基づいて、前記入力文における置換箇所が設定され、前記入力文における前記置換箇所の単語が置換されてコーパスが生成される。
In one aspect of the present disclosure, a structure of an input sentence is analyzed, a substitution place in the input sentence is set based on an analysis result of the structure, and a word of the substitution place in the input sentence is substituted. It is generated.
本開示の一側面によれば、特に、意味解析器の開発コストを低減させることが可能となる。
According to one aspect of the present disclosure, in particular, it is possible to reduce the development cost of the semantic analyzer.
以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。
Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the present specification and the drawings, components having substantially the same functional configuration will be assigned the same reference numerals and redundant description will be omitted.
<意味解析システムについて>
本開示の技術を適用したコーパス生成装置を説明するにあたり、コーパス生成装置により生成されたコーパスを用いた意味解析システムについて説明する。 <About semantic analysis system>
In describing a corpus generation device to which the technology of the present disclosure is applied, a semantic analysis system using a corpus generated by the corpus generation device will be described.
本開示の技術を適用したコーパス生成装置を説明するにあたり、コーパス生成装置により生成されたコーパスを用いた意味解析システムについて説明する。 <About semantic analysis system>
In describing a corpus generation device to which the technology of the present disclosure is applied, a semantic analysis system using a corpus generated by the corpus generation device will be described.
意味解析システムは、ユーザの発話を認識して、対応するアプリケーションプログラムを実行させる。
The semantic analysis system recognizes the user's speech and causes the corresponding application program to execute.
意味解析システムは、例えば、図1で示されるように、入力受付部11、音声認識部12、フレーム推定部13、各ドメインの意味解析部14-1乃至14-3、およびアプリケーションプログラム18乃至20より構成される。
The semantic analysis system includes, for example, as shown in FIG. 1, an input reception unit 11, a speech recognition unit 12, a frame estimation unit 13, semantic analysis units 14-1 to 14-3 of each domain, and application programs 18 to 20. It consists of.
入力受付部11は、ユーザの発話入力を、音声信号(speech)の入力として受け付けて音声認識部12に出力する。
The input reception unit 11 receives a user's speech input as an input of a speech signal (speech) and outputs the speech signal to the speech recognition unit 12.
音声認識部12は、音声信号を認識して、テキスト文字列(text)に変換し、フレーム推定部13に出力する。
The voice recognition unit 12 recognizes a voice signal, converts it into a text character string (text), and outputs the text string to the frame estimation unit 13.
フレーム推定部13は、テキスト文字列に基づいて後段の最適なアプリケーション(ドメイン)の意味解析部14-1乃至14-3に処理を遷移させる。また、フレーム推定部13は、どのドメインにも属さないテキスト文字列は棄却する。
The frame estimation unit 13 causes the processing to be transited to the semantic analysis units 14-1 to 14-3 of the optimum application (domain) of the subsequent stage based on the text character string. Also, the frame estimation unit 13 rejects text strings that do not belong to any domain.
意味解析部14-1乃至14-3は、テキスト文字列に基づいて、属性(attribute)と対応する値(value)を解析して、動作(action)の対象となる天気案内のアプリケーションプログラム18、予定確認のアプリケーションプログラム19、または、音楽再生のアプリケーションプログラム20に対して、それぞれに対応する解析結果15乃至17を供給する。尚、意味解析部14-1乃至14-3は、特に、区別する必要がない場合、単に、意味解析部14と称する。
The semantic analysis units 14-1 to 14-3 analyze attributes (attributes) and corresponding values (values) based on text strings, and provide an application program 18 for weather guidance to be an action target, The analysis results 15 to 17 respectively corresponding to the application program 19 for schedule confirmation or the application program 20 for music reproduction are supplied. The semantic analysis units 14-1 to 14-3 are simply referred to as the semantic analysis unit 14 unless it is necessary to distinguish them.
より詳細には、例えば、「明日の東京の天気は?」といった発話入力V1が、入力受付部11で、音声信号として受け付けられた場合、音声認識部12は、「明日の東京の天気は」というテキスト文字列として認識して、フレーム推定部13に供給する。
More specifically, for example, when an utterance input V1 such as "What is the weather in Tokyo tomorrow?" Is accepted as an audio signal by the input acceptance unit 11, the speech recognition unit 12 reads "What is the weather in Tokyo tomorrow?" It is recognized as a text string and is supplied to the frame estimation unit 13.
フレーム推定部13は、例えば、発話入力が天気案内のアプリケーションプログラム18に対する発話であると判定するとき、天気案内のドメインの意味解析部14-1に処理を遷移させる。
For example, when it is determined that the speech input is a speech to the application program 18 for weather guidance, the frame estimation unit 13 causes the semantic analysis unit 14-1 for the weather guidance domain to transition the process.
意味解析部14-1は、このテキスト文字列に基づいて、天気に関する発話を解析し、どこ(where)を示す情報が「東京」であり、いつ(when)を示す情報が「明日」であることを含む解析結果15をアプリケーションプログラム18に供給する。
The semantic analysis unit 14-1 analyzes the weather-related utterance based on this text string, and the information indicating where the information is “Tokyo” is the information indicating the when the information is “tomorrow”. Analysis result 15 including the above is supplied to the application program 18.
天気案内のアプリケーションプログラム18は、解析結果15に基づいて、東京の明日の天気案内を表示する。
The weather guidance application program 18 displays tomorrow's weather guidance for Tokyo based on the analysis result 15.
また、例えば、「今日の15時からの予定は?」といった発話入力V2が、入力受付部11で、音声信号として受け付けられた場合、音声認識部12は、「今日の15時からの予定は」というテキスト文字列として認識して、フレーム推定部13に供給する。
Further, for example, when an utterance input V2 such as "What is your plan from 15 o'clock today?" Is accepted as an audio signal by the input acceptance unit 11, the speech recognition unit 12 It recognizes as a text character string "", and supplies it to the frame estimation unit 13.
フレーム推定部13は、例えば、発話入力が予定確認のアプリケーションプログラム19に対する発話であると判定するとき、予定確認のドメインの意味解析部14-2に処理を遷移させる。
For example, when it is determined that the speech input is a speech to the application program 19 for schedule confirmation, the frame estimation unit 13 causes the semantic analysis unit 14-2 of the domain for schedule confirmation to transition the process.
意味解析部14-2は、このテキスト文字列に基づいて、予定確認に関する発話を解析し、例えば、発話入力が予定確認のアプリケーションプログラム19に対する発話であり、期日(date)を示す情報が「今日」であり、時刻(time)を示す情報が「15時」であることを含む解析結果16をアプリケーションプログラム19に供給する。
The semantic analysis unit 14-2 analyzes the utterance about the schedule confirmation based on the text string, and, for example, the utterance input is the utterance for the application program 19 of the schedule confirmation, and the information indicating the date (date) is “Today And the analysis result 16 including that the information indicating time is "15 o'clock" is supplied to the application program 19.
予定確認のアプリケーションプログラム19は、解析結果16に基づいて、今日の15時の予定確認を表示する。
The schedule confirmation application program 19 displays today's 15 o'clock schedule confirmation based on the analysis result 16.
さらに、例えば、「東野ナカの新曲かけて!」といった発話入力V3が、入力受付部11で、音声信号として受け付けられた場合、音声認識部12は、「東野ナカの新曲かけて」というテキスト文字列として認識して、フレーム推定部13に供給する。
Furthermore, for example, when an utterance input V3 such as "Take a new song from Higashino Naka!" Is accepted as a voice signal by the input acceptance unit 11, the voice recognition unit 12 reads the text "Take a new song from Higashino Naka" It is recognized as a row and supplied to the frame estimation unit 13.
フレーム推定部13は、例えば、発話入力が音楽再生のアプリケーションプログラム20に対する発話であると判定するとき、音楽再生のドメインの意味解析部14-3に処理を遷移させる。
For example, when it is determined that the speech input is a speech to the application program 20 for music reproduction, the frame estimation unit 13 causes the semantic analysis unit 14-3 of the music reproduction domain to transition the process.
意味解析部14-3は、このテキスト文字列に基づいて、音楽再生に関する発話を解析し、例えば、発話入力が音楽再生のアプリケーションプログラム20に対する発話であり、アーティスト(artist)を示す情報が「東野ナカ」であり、楽曲(music)を示す情報が「新曲」であることを含む解析結果17をアプリケーションプログラム20に供給する。
The semantic analysis unit 14-3 analyzes an utterance related to music reproduction based on this text string, and, for example, the utterance input is an utterance to the application program 20 for music reproduction, and the information indicating the artist (artist) is “Tono The application program 20 is supplied with an analysis result 17 including that the information indicating the music is "new song".
音楽再生のアプリケーションプログラム20は、解析結果17に基づいて、東野ナカの新曲を再生する。
The music reproduction application program 20 reproduces a new song of Nakano Tono based on the analysis result 17.
ここで、意味解析部14は、テキスト文字列からなる情報を解析するために、例文集であるコーパスを用いた機械学習を行う。
Here, the semantic analysis unit 14 performs machine learning using a corpus, which is an example sentence collection, in order to analyze information composed of text strings.
コーパスは、主に、アプリケーションプログラムで扱うべき発話内容(In Domain発話:以下、INDとも称する)からなるコーパス(INDコーパスとも称する)と、アプリケーションプログラムで扱えない発話内容(Out of Domain発話:以下OOD発話とも称する)からなるコーパス(OODコーパスとも称する)とに分けられる。
The corpus mainly includes a corpus (also referred to as an IND corpus) consisting of utterance contents (In Domain utterance: hereinafter referred to as IND) to be handled by the application program, and utterance contents (Out of Domain utterance: hereinafter OOD) which can not be handled by the application program. It is divided into a corpus (also called OOD corpus) consisting of utterances.
意味解析部14は、INDコーパスを学習することで、アプリケーションプログラムで扱うべき発話内容を解析し、認識できるようになる。また、同様に、フレーム推定部13は、INDコーパスとともにOODコーパスを学習することで、正しいドメインの意味解析部14に処理を遷移させ、想定外の発話に対しては棄却できるようになる。すなわち、フレーム推定部13は、INDコーパスとOODコーパスとを学習することで、扱うべき発話と、棄却すべき発話とを解析し、発話内容を適切に認識できるようになる。
By learning the IND corpus, the semantic analysis unit 14 can analyze and recognize the utterance content to be handled by the application program. Similarly, by learning the OOD corpus together with the IND corpus, the frame estimation unit 13 can cause the semantic analysis unit 14 of the correct domain to make a transition to the process, and can reject an unexpected utterance. That is, by learning the IND corpus and the OOD corpus, the frame estimation unit 13 analyzes the utterance to be handled and the utterance to be rejected, and can appropriately recognize the utterance content.
ところで、フレーム推定部13、および意味解析部14の学習は、ともに、多くのコーパスが必要とされるが、一般に、コーパスは、人手で作成されている。
By the way, although learning of both the frame estimation unit 13 and the semantic analysis unit 14 requires many corpuses, in general, the corpus is manually created.
例えば、天気案内のアプリケーションプログラムなどの特定のサービスを想定しても様々な発話の言い回しが想定され、アルゴリズムにも依存するが一般に1000以上の発話事例が必要とされている。
For example, even if a specific service such as a weather guidance application program is assumed, various utterance phrases are assumed, and depending on an algorithm, generally 1000 or more utterance examples are required.
さらに、サービスの種類が増えれば、増えたサービスの種類に対応するアプリケーションプログラムの数だけ、それぞれのコーパスを作成する必要がある。
Furthermore, as the types of services increase, it is necessary to create as many corpuses as the number of application programs corresponding to the increased types of services.
しかしながら、コーパスを人手で作成するには、膨大なコストが掛かり開発の大きな負担となっている。特に、フレーム推定部13は、ドメイン毎のOOD発話を用意せねばならず、例えば、コーパス総数において約2倍のコーパスを用意する必要が生じる。
However, manually creating a corpus is very expensive and places a heavy burden on development. In particular, the frame estimation unit 13 must prepare an OOD utterance for each domain, and for example, it is necessary to prepare an approximately twice as many corpuses as the total number of corpuses.
また、コーパスを構成する文章の一部の単語やフレーズをソフトウエアプログラムで置換して行う方法も広く行われているが、入れ替え部分の指定や、置換される単語のルール文をパターン単位でコーディングする作業は煩雑であり、さらに、できあがった文章を再度、人手で判定して仕分けする作業コストも発生し、その負担は大きなものとなる。
In addition, software programs have been widely used to replace part of words or phrases in sentences that make up a corpus, but designation of replacement parts and coding of rule sentences of words to be replaced on a pattern basis The work to be done is complicated, and furthermore, there is a cost for manually judging and sorting the completed sentences again, and the burden is heavy.
さらに、インターネット上の膨大なテキストから類似の文を集めるようにしてコーパスを作成する方法も提案されているが、インターネット上の文章の殆どは書き言葉でかかれたもので発話事例は少ない。
Furthermore, a method has also been proposed for creating a corpus by collecting similar sentences from a large amount of text on the Internet, but most of the sentences on the Internet are written in written words and there are few utterance examples.
<COOD判定文について>
ところで、フレーム推定部13や意味解析部14によりINDコーパスとして判定されたコーパスには、認識精度による限界があるので、一部にOODコーパスとみなされるコーパスが含まれることがある。このようなOOD判定文は、IND判定文であったものが、誤判定によりOOD判定文とされてしまったものであるので、IND判定文に近いコーパスであると考えることができる。 <About COOD judgment sentences>
By the way, the corpus determined as the IND corpus by theframe estimation unit 13 or the semantic analysis unit 14 has a limit due to the recognition accuracy, so a corpus that is regarded as an OOD corpus may be included in part. As such an OOD determination sentence is what was an IND determination sentence but has been made an OOD determination sentence due to an erroneous determination, it can be considered as a corpus close to the IND determination sentence.
ところで、フレーム推定部13や意味解析部14によりINDコーパスとして判定されたコーパスには、認識精度による限界があるので、一部にOODコーパスとみなされるコーパスが含まれることがある。このようなOOD判定文は、IND判定文であったものが、誤判定によりOOD判定文とされてしまったものであるので、IND判定文に近いコーパスであると考えることができる。 <About COOD judgment sentences>
By the way, the corpus determined as the IND corpus by the
例えば、コーパスに含まれる単語の素性1,2により、コーパスが表現する意味の特徴を、図2で示されるような特徴空間内の分布として表現することを考える。図2の例においては、黒丸印が、IND判定文とみなされるコーパスの分布であり、バツ印が、OOD判定文とみなされるコーパスの分布である。
For example, it is considered to express the feature of the meaning represented by the corpus as a distribution in the feature space as shown in FIG. 2 by the features 1 and 2 of the words contained in the corpus. In the example of FIG. 2, a black circle is a distribution of a corpus regarded as an IND determination sentence, and a cross is a distribution of a corpus regarded as an OOD determination sentence.
図2のバツ印で示されるOOD判定文とみなされるコーパスのうち、IND判定文とみなされるコーパスの近傍に存在するコーパスは、IND判定文に類似したコーパスであると考えることができる。このOOD判定文とみなされるコーパスのうち、IND判定文とみなされるコーパスの分布の近傍に存在するコーパスは、COOD(Close Out of Domain)判定文といわれる。
Of the corpus regarded as the OOD determination sentence indicated by the cross in FIG. 2, the corpus existing in the vicinity of the corpus regarded as the IND determination sentence can be considered as a corpus similar to the IND determination sentence. Of the corpus regarded as the OOD judgment sentence, the corpus existing in the vicinity of the distribution of the corpus regarded as the IND judgment sentence is referred to as a COOD (Close Out of Domain) judgment sentence.
COOD判定文は、OOD判定文ではあるが、IND判定文に類似したコーパスであり、換言すれば、IND判定文に似て非なるコーパスであると考えることができる。さらに言えば、COOD判定文は、IND判定文と判別するには、非常に紛らわしい表現であり、誤判定を起こし易いコーパスであるとも考えることができる。
Although the COOD determination sentence is an OOD determination sentence, it is a corpus similar to the IND determination sentence, and in other words, it can be considered as a corpus which is not similar to the IND determination sentence. Furthermore, the COOD determination sentence can be considered to be a very misleading expression to be discriminated as the IND determination sentence, and also a corpus that is likely to cause an erroneous determination.
フレーム推定部13、および意味解析部14の認識精度を向上させるには、このCOOD判定文からなるコーパスを必要にして十分な量だけ学習させることが重要である。本開示のコーパス生成装置は、特に、認識精度の向上に有効なCOODコーパスを含むコーパスを容易で、かつ、大量に生成できるようにし、コーパスの開発に係る負荷を低減させるようにする。
In order to improve the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14, it is important to learn a sufficient amount of corpuses consisting of the COOD determination sentences. The corpus generation device of the present disclosure makes it possible to easily and mass-generate a corpus including a COOD corpus that is particularly effective for improving recognition accuracy, and to reduce the load associated with corpus development.
<本開示のコーパス生成装置の構成例>
本開示のコーパス生成装置は、図1のフレーム推定部13、および意味解析部14に相当する構成の学習に必要とされるコーパスを効率的に生成できるようにすることで、フレーム推定部13、および意味解析部14の開発コストを低減させるものである。 <Configuration Example of Corpus Generation Device of the Present Disclosure>
The corpus generation device of the present disclosure enables theframe estimation unit 13 to efficiently generate a corpus required for learning the configuration corresponding to the frame estimation unit 13 and the semantic analysis unit 14 in FIG. 1. And the development cost of the semantic analysis unit 14.
本開示のコーパス生成装置は、図1のフレーム推定部13、および意味解析部14に相当する構成の学習に必要とされるコーパスを効率的に生成できるようにすることで、フレーム推定部13、および意味解析部14の開発コストを低減させるものである。 <Configuration Example of Corpus Generation Device of the Present Disclosure>
The corpus generation device of the present disclosure enables the
図3は、本開示のコーパス生成装置の一実施の形態の構成例を示している。
FIG. 3 shows a configuration example of an embodiment of the corpus generation device of the present disclosure.
コーパス生成装置51は、入力文である人手で、または、何らかのその他の手法で生成されたIND文からなるコーパスを受け付けて、言語解析などにより単語を置換することで置換生成文からなるコーパスを生成すると共に、生成したコーパスをフィルタリング処理によりCOOD判定文、OOD判定文、IND判定文、およびCIND判定文からなるコーパスに分類する。ここで、CIND(Close IND)判定文は、OOD判定文に分類されたコーパスのうち、IND判定文と類似したコーパスである。すなわち、CIND判定文のコーパスを機械学習させることでも、フレーム推定部13、および意味解析部14の認識精度を向上させることができる。
The corpus generation device 51 receives a corpus consisting of IND sentences generated by an input sentence manually or by some other method, and generates a corpus consisting of substitution generated sentences by replacing words by language analysis etc. At the same time, the generated corpus is classified into a corpus consisting of a COOD determination sentence, an OOD determination sentence, an IND determination sentence, and a CIND determination sentence by filtering processing. Here, the CIND (Close IND) determination sentence is a corpus similar to the IND determination sentence among the corpus classified as the OOD determination sentence. That is, by performing machine learning of the corpus of CIND determination sentences, the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 can be improved.
コーパス生成装置51は、IND文受付部101、言語解析部102、置換箇所設定部103、辞書照会部104、置換実行部105、重文排除部106、格フレーム辞書107、生成条件設定データ記憶部108、置換生成文記憶部109、フィルタリング処理部110を備えている。
The corpus generation device 51 includes an IND sentence reception unit 101, a language analysis unit 102, a replacement location setting unit 103, a dictionary inquiry unit 104, a replacement execution unit 105, a double sentence exclusion unit 106, a case frame dictionary 107, and a generation condition setting data storage unit 108. , And a substitution generated sentence storage unit 109, and a filtering processing unit 110.
IND文受付部101は、人手により生成されたIND文や、その他の手法で生成されたIND文の入力を受け付けて、言語解析部102に出力する。
The IND sentence reception unit 101 receives an input of an IND sentence generated manually or an IND sentence generated by another method, and outputs the input to the language analysis unit 102.
言語解析部102は、IND文受付部101より出力されるIND文のそれぞれについて、形態素、句、および述語項構造を解析し、解析結果を置換箇所設定部103に出力する。
The language analysis unit 102 analyzes the morpheme, the phrase, and the predicate term structure for each of the IND sentences output from the IND sentence reception unit 101, and outputs the analysis result to the replacement location setting unit 103.
置換箇所設定部103は、IND文の述語項構造の解析結果に基づいて、置換条件を設定し、辞書照会部104に出力する。
The replacement location setting unit 103 sets a replacement condition based on the analysis result of the predicate term structure of the IND statement, and outputs the replacement condition to the dictionary query unit 104.
辞書照会部104は、格フレーム辞書107を照会し、IND判定文のうち置換条件に基づいて設定された置換位置の単語を、設定された置換方式で検索し、検索結果を置換実行部105に出力する。
The dictionary inquiring unit 104 inquires the case frame dictionary 107, searches the word of the substitution position set based on the substitution condition among the IND determination sentences according to the set substitution method, and transmits the search result to the substitution execution unit 105. Output.
置換実行部105は、置換条件に基づいて設定された置換位置の単語を、設定された置換方式で検索された単語で置換し新たなコーパスを生成する。この時に生成条件設定データ記憶部108に記憶されている生成条件設定データに基づいて、新たに生成したコーパスの文末などを調整する。以上の処理の結果生成された文を重文排除部106に出力する。
The substitution execution unit 105 substitutes the word at the substitution position set based on the substitution condition with the word searched by the set substitution method, and generates a new corpus. At this time, based on the generation condition setting data stored in the generation condition setting data storage unit 108, the end of the sentence or the like of the newly generated corpus is adjusted. The sentence generated as a result of the above processing is output to the double sentence exclusion unit 106.
重文排除部106は、置換実行部105より出力されるコーパスが、重文であるか否かを判定し、重文である場合、廃棄判定文とみなして、廃棄処理する。また、重文排除部106は、置換実行部105より出力されるコーパスが、重文ではない場合、コーパスを置換生成文として置換生成文記憶部109に記憶させる。
The heavy sentence exclusion unit 106 determines whether or not the corpus output from the substitution execution unit 105 is a heavy sentence. If the corpus is a heavy sentence, it is regarded as a discarding judgment sentence and discarded. In addition, if the corpus output from the replacement execution unit 105 is not a multiple sentence, the heavy sentence exclusion unit 106 stores the corpus as a replacement generated sentence in the replacement generated sentence storage unit 109.
格フレーム辞書107は、述語をその語義(格フレーム)ごとに分類し、それに特定の格をもってかかる単語がまとめられた辞書であり、辞書照会部104により、置換箇所に設定された格フレームの語義となる単語が検索される。
The case frame dictionary 107 is a dictionary in which predicates are classified according to their word sense (case frame) and the words are grouped with a specific case, and the word meaning of the case frame set in the replacement part by the dictionary inquiry unit 104. The words that become are searched.
生成条件設定データ記憶部108は、置換実行部105により調整されるコーパスを生成する上での条件のデータである生成条件設定データを記憶しており、置換実行部105は、生成条件設定データ記憶部108の生成条件設定データに基づいて、置換されたコーパスを調整する。
The generation condition setting data storage unit 108 stores generation condition setting data which is data of a condition for generating a corpus adjusted by the substitution execution unit 105, and the substitution execution unit 105 stores the generation condition setting data. The replaced corpus is adjusted based on the generation condition setting data of the unit 108.
フィルタリング処理部110は、置換生成文記憶部109に記憶されているコーパスを、OOD判定文、COOD判定文、IND判定文、およびCIND判定文のコーパスに分類する。尚、フィルタリング処理部110の構成例については、図4を参照して、詳細を後述する。
The filtering processing unit 110 classifies the corpus stored in the substitution generated sentence storage unit 109 into a corpus of OOD determination sentence, COOD determination sentence, IND determination sentence, and CIND determination sentence. The configuration example of the filtering processing unit 110 will be described later in detail with reference to FIG.
<フィルタリング処理部>
次に、図4を参照して、図3のフィルタリング処理部110の構成例について説明する。 <Filtering unit>
Next, a configuration example of thefiltering processing unit 110 in FIG. 3 will be described with reference to FIG.
次に、図4を参照して、図3のフィルタリング処理部110の構成例について説明する。 <Filtering unit>
Next, a configuration example of the
フィルタリング処理部110は、意味解析器131、IND判定文記憶部132、COODコーパス抽出部133、COOD判定文記憶部134、確定IND判定文記憶部135、OOD判定文記憶部136、CINDコーパス抽出部137、CIND判定文記憶部138、および確定OOD判定文記憶部139を備えている。
The filtering processing unit 110 includes a semantic analyzer 131, an IND determination sentence storage unit 132, a COOD corpus extraction unit 133, a COOD determination sentence storage unit 134, a confirmed IND determination sentence storage unit 135, an OOD determination sentence storage unit 136, and a CIND corpus extraction unit 137, a CIND judgment sentence storage unit 138, and a confirmed OOD judgment sentence storage unit 139.
意味解析器131は、古いバージョンのコーパス(これまでに生成されたコーパス)で学習された意味解析器(図1の意味解析部14に対応するもの)であり、置換生成文記憶部109に記憶されている置換生成文からなるコーパスのそれぞれについて、所定のアプリケーションプログラムで扱うべき発話内容(以下、IND判定文とも称する)であるか、所定のアプリケーションプログラムで扱うことができない(棄却すべき)発話内容(以下、OOD判定文とも称する)であるかを判定する。また、意味解析器131は、IND判定文をIND判定文記憶部132に、OOD判定文をOOD判定文記憶部136に、それぞれ記憶させる。
The semantic analyzer 131 is a semantic analyzer (corresponding to the semantic analysis unit 14 in FIG. 1) learned with a corpus of an old version (a corpus generated so far), and is stored in the substitution generated sentence storage unit 109. Each of the corpuses consisting of substitution generated sentences has an utterance content to be handled by a predetermined application program (hereinafter also referred to as an IND determination sentence) or an utterance that can not be handled by a predetermined application program (rejected) It is determined whether the content is (hereinafter also referred to as an OOD determination sentence). In addition, the semantic analyzer 131 stores the IND determination sentence in the IND determination sentence storage unit 132 and the OOD determination sentence in the OOD determination sentence storage unit 136.
COOD(Close OOD)コーパス抽出部133は、IND判定文記憶部132に記憶されているIND判定文のうち、COOD判定文に分類されるコーパスを抽出し、COOD判定文記憶部134に記憶させると共に、それ以外のIND判定文を確定IND判定文として確定IND判定文記憶部135に記憶させる。また、COODコーパス抽出部133は、IND判定文からなるコーパスのうち、非文であり、COOD判定文にも、確定IND判定文にも分類されないコーパスを廃棄判定文として廃棄する。ここで、COOD判定文とは、IND判定文との判定において、その境界付近に存在するOOD判定文である。尚、COOD判定文については、図5を参照して詳細を後述する。
A COOD (Close OOD) corpus extraction unit 133 extracts a corpus classified as a COOD determination sentence among the IND determination sentences stored in the IND determination sentence storage unit 132 and stores the corpus in the COOD determination sentence storage unit 134. The other IND determination sentences are stored in the determination IND determination sentence storage unit 135 as the determination IND determination sentences. In addition, the COOD corpus extraction unit 133 discards, as a discarding determination sentence, a corpus that is non-sentence and is not classified as a COOD determination sentence or a definite IND determination sentence in a corpus consisting of the IND determination sentences. Here, the COOD determination sentence is an OOD determination sentence existing near the boundary in the determination with the IND determination sentence. Details of the COOD determination statement will be described later with reference to FIG.
また、COODコーパス抽出部133は、非文判定部133aおよび非出現性判定部133bを備えており、非文判定部133aを制御して、非文であるか否かを判定させ、非文を廃棄判定文とみなし、廃棄処理する。また、COODコーパス抽出部133は、非出現性判定部133bを制御して、非文とみなされなかったコーパスの非出現性に基づいて、IND判定文がCOOD判定文であるか、または、確定IND判定文であるかを判定させ、このうちCOOD判定文を抽出してCOOD判定文記憶部134に記憶させ、確定IND判定文を確定IND判定文記憶部135に記憶させる。
In addition, the COOD corpus extraction unit 133 includes a non-statement determination unit 133a and a non-appearance determination unit 133b, controls the non-statement determination unit 133a to determine whether it is a non-statement or not. It is regarded as a discard judgment statement and discarded. In addition, the COOD corpus extraction unit 133 controls the non-appearance determination unit 133b to determine whether the IND determination sentence is a COOD determination sentence or not, based on the non-occurrence of the corpus not regarded as a non-statement. It is determined whether it is an IND determination sentence, the COOD determination sentence is extracted therefrom, and stored in the COOD determination sentence storage unit 134, and the determined IND determination sentence is stored in the determined IND determination sentence storage unit 135.
非文判定部133aは、IND判定文からなるコーパスにおける、意味の通る文章らしさを示す指標であるPerplexity値を算出し、Perplexity値と所定の閾値との比較により非文であるか否かを判定し、非文とみなしたコーパスを廃棄判定文として廃棄処理する。
The non-sentence determination unit 133a calculates a Perplexity value, which is an index indicating the sentence likeness of meaning, in a corpus consisting of IND determination sentences, and determines whether or not it is a non-statement by comparing the Perplexity value with a predetermined threshold. And discards the corpus regarded as non-sentence as a discard judgment sentence.
非出現性判定部133bは、非文ではないとみなされたコーパスのそれぞれについて、IND判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるか否かを示す非出現性を示すパラメータを算出する。そして、非出現性判定部133bは、非出現性を示すパラメータと所定の閾値との比較により、IND判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるコーパスをCOOD判定文とみなして抽出する。さらに、非出現性判定部133bは、抽出したCOOD判定文をCOOD判定文記憶部134に記憶させ、それ以外のIND判定文からなるコーパスを確定IND判定文とみなして、確定IND判定文記憶部135に記憶させる。
Non-appearance determination unit 133 b indicates non-recurrence indicating whether or not a word that does not easily appear in the corpus group in the domain determined as the IND determination sentence is included for each corpus that is determined not to be a non-statement. Calculate the parameters. Then, the non-appearance determination unit 133b compares the parameter indicating non-applicability with the predetermined threshold value to make a corpus including a word that is difficult to appear in the corpus group in the domain determined as the IND determination sentence as a COOD determination sentence Consider it as extraction. Further, the non-appearance determination unit 133b stores the extracted COOD determination sentence in the COOD determination sentence storage unit 134, regards the corpus including the other IND determination sentences as a confirmed IND determination sentence, and determines the confirmed IND determination sentence storage unit Make it memorize in 135.
CIND(Close IND)コーパス抽出部137は、OOD判定文記憶部136に記憶されているOOD判定文のうち、CIND判定文に分類されるコーパスを抽出し、CIND判定文記憶部138に記憶させ、それ以外のOOD判定文を確定OOD判定文として確定OOD判定文記憶部139に記憶させる。また、CINDコーパス抽出部137は、OOD判定文からなるコーパスのうち、CIND判定文にも、確定OOD判定文にも分類されないコーパスを廃棄判定文として廃棄する。ここで、CIND判定文とは、OOD判定文との判定において、その境界付近に存在するIND判定文である。各CIND判定文がどのドメインに属するかの判定は最終的には人手などで行う。尚、CIND判定文については、図5を参照して詳細を後述する。
A CIND (Close IND) corpus extraction unit 137 extracts a corpus classified as a CIND determination sentence from the OOD determination sentences stored in the OOD determination sentence storage unit 136, and stores the corpus in the CIND determination sentence storage unit 138, The other OOD determination sentences are stored in the determination OOD determination sentence storage unit 139 as the determination OOD determination sentences. In addition, the CIND corpus extraction unit 137 discards, as a discarding determination sentence, a corpus that is not classified as a CIND determination sentence or a finalized OOD determination sentence among corpuses including OOD determination sentences. Here, the CIND determination sentence is an IND determination sentence existing near the boundary in the determination with the OOD determination sentence. Determination of which domain each CIND test sentence belongs to is finally performed manually. The CIND determination sentence will be described later in detail with reference to FIG.
また、CINDコーパス抽出部137は、非文判定部137aおよび非出現性判定部137bを備えており、非文判定部137aを制御して、非文であるか否かを判定させ、非文を廃棄判定文とみなし、廃棄処理する。また、CINDコーパス抽出部137は、非出現性判定部137bを制御して、非文とみなされなかったコーパスがCIND判定文であるか、または、確定OOD判定文であるかを判定させ、このうちCIND判定文を抽出してCIND判定文記憶部138に記憶させ、確定OOD判定文を確定OOD判定文記憶部139に記憶させる。
Further, the CIND corpus extraction unit 137 includes a non-statement determination unit 137a and a non-appearance determination unit 137b, and controls the non-statement determination unit 137a to make it determine whether it is a non-statement or not. It is regarded as a discard judgment statement and discarded. In addition, the CIND corpus extraction unit 137 controls the non-appearance determination unit 137b to determine whether the corpus not regarded as a non-statement is a CIND determination sentence or a definite OOD determination sentence. Among them, the CIND determination sentence is extracted and stored in the CIND determination sentence storage unit 138, and the determined OOD determination sentence is stored in the determined OOD determination sentence storage unit 139.
非文判定部137aは、OOD判定文からなるコーパスにおける、意味の通る文章らしさを示す指標であるPerplexity値を算出し、Perplexity値と所定の閾値との比較により非文であるか否かを判定し、非文とみなしたコーパスを廃棄判定文として廃棄処理する。
The non-statement determination unit 137a calculates a Perplexity value, which is an index indicating the sentence likeness of meaning, in a corpus consisting of OOD determination sentences, and determines whether or not it is a non-statement by comparing the Perplexity value with a predetermined threshold. And discards the corpus regarded as non-sentence as a discard judgment sentence.
非出現性判定部137bは、非文ではないとみなされたコーパスのそれぞれについて、OOD判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるか否かを判定し、OOD判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるコーパスを確定OOD判定文とみなして抽出する。さらに、非出現性判定部137bは、抽出した確定OOD判定文を確定OOD判定文記憶部139に記憶させ、それ以外のコーパスをCIND判定文とみなして抽出し、CIND判定文記憶部138に記憶させる。
The non-appearance determination unit 137 b determines whether or not each of the corpuses regarded as non-statement includes a word that does not easily appear in the corpus group in the domain determined as the OOD determination sentence, and the OOD-determination sentence A corpus containing words that do not easily appear in the corpus group in the domain determined in is extracted as a confirmed OOD determination sentence. Furthermore, the non-appearance determination unit 137 b stores the extracted confirmed OOD determination sentence in the confirmed OOD determination sentence storage unit 139, considers the other corpus as a CIND determination sentence and extracts it, and stores it in the CIND determination sentence storage unit 138. Let
<COOD判定文およびCIND判定文について>
ここで、COOD判定文およびCIND判定文について説明する。 <About COOD Judgment Statement and CIND Judgment Statement>
Here, the COOD determination sentence and the CIND determination sentence will be described.
ここで、COOD判定文およびCIND判定文について説明する。 <About COOD Judgment Statement and CIND Judgment Statement>
Here, the COOD determination sentence and the CIND determination sentence will be described.
図4の意味解析器131は、置換生成文記憶部109に記憶されているIND文から生成された置換生成文からなるコーパス群を、図5で示されるように、IND判定文のコーパス群とOOD判定文のコーパス群とに分類する。
As shown in FIG. 5, the semantic analyzer 131 in FIG. 4 is a corpus group consisting of substitution generated sentences generated from IND sentences stored in the substitution generated sentence storage unit 109, and a corpus group of IND judgment sentences. It is classified into a corpus group of OOD judgment sentences.
さらに、COODコーパス抽出部133は、IND判定文からなるコーパス群より、後述するPerplexity値に基づいた判定により、非文とみなされるコーパスを廃棄判定文として廃棄処理すると共に、単語の非出現性に基づいた判定によりCOOD判定文とみなされるコーパスを抽出する。そして、COODコーパス抽出部133は、残りのIND判定文からなるコーパスを確定IND判定文とみなす。
Further, the COOD corpus extraction unit 133 discards a corpus regarded as a non-statement as a discarding decision sentence by a decision based on a Perplexity value to be described later from the corpus group consisting of IND decision sentences, and the non-occurrence of words. The corpus judged to be a COOD determination sentence is extracted by the determination based on it. Then, the COOD corpus extraction unit 133 regards the corpus consisting of the remaining IND determination sentences as the confirmed IND determination sentences.
すなわち、COODコーパス抽出部133に入力されるのはIND文を元に様々に単語が置換されることで生成されたIND判定文からなるコーパスであるので、確定IND判定文とみなされるコーパスが比較的多くなる。
That is, since what is input to the COOD corpus extraction unit 133 is a corpus consisting of IND determination sentences generated by various word substitutions based on the IND sentences, corpuses regarded as definite IND determination sentences are compared. Will increase.
しなしながら、意味解析器131によりIND判定文として判定されたコーパスには、認識精度による限界がある上、さらに、単語が置換されているので、一部にOOD判定文とみなされるコーパスも含まれる。このようなOOD判定文は、IND判定文であったものが、単語の置換によりOOD判定文に代わったものであるので、IND判定文に近いコーパスであり、図2を参照して説明したCOOD判定文であると考えることができる。
However, the corpus determined as the IND determination sentence by the semantic analyzer 131 has a limit due to recognition accuracy, and further, since the word is replaced, a corpus which is considered to be an OOD determination sentence is also included in part. Be Such an OOD determination sentence is a corpus similar to the IND determination sentence because what was an IND determination sentence is replaced with the OOD determination sentence by word substitution, and the COOD described with reference to FIG. It can be considered as a judgment sentence.
COOD判定文は、OOD判定文ではあるが、IND判定文に類似したコーパスであり、換言すれば、IND判定文に似て非なるコーパスであると考えることができる。さらに言えば、COOD判定文は、IND判定文と判別するには、非常に紛らわしい表現であり、誤判定を起こし易いコーパスであるとも考えることができる。
Although the COOD determination sentence is an OOD determination sentence, it is a corpus similar to the IND determination sentence, and in other words, it can be considered as a corpus which is not similar to the IND determination sentence. Furthermore, the COOD determination sentence can be considered to be a very misleading expression to be discriminated as the IND determination sentence, and also a corpus that is likely to cause an erroneous determination.
ここで、2つ以上のコーパス(文章)が相互に「似ている」とは、例えば、2つ以上の文章が相互に類似した語義の述語を持った文章であり、述語と、述語に係る項の構造(述語項構造)が似ていることである。さらに、述語にかかる項の単語の意味や役割まで似ている文章であれば、2つ以上の文章は、より似ているといえる。
Here, two or more corpuses (sentences) are "similar" to each other, for example, a sentence having two or more sentences having similar predicates similar to each other, and relates to the predicate and the predicate The structure of terms (predicate term structure) is similar. Furthermore, two or more sentences can be said to be more similar if they are sentences that are similar to the meaning and role of the words of the term according to the predicate.
例えば、「設定する」という言葉を例にする場合、語義は、複数に存在し、例えば、図6の例Ex1で示されるように、「設定する」の語義のうちの2種類の例が、「設定する4」および「設定する8」として挙げられている。
For example, when the word "set" is taken as an example, there are a plurality of word meanings. For example, as shown in the example Ex1 of FIG. 6, two types of examples of the word "set" are: It is listed as "set 4" and "set 8".
例Ex1においては、「設定する4」では「設定する」という述語にかかる項として、動作主格としての(“システム”,83)、(“会社”,42)、(“学校”,33)、(“上司”,18)が挙げられており、対象格としての(“会議”,41)、(“参加者”,27)、(“移動時間”,10)が挙げられており、道具格としての(“PC(パーソナルコンピュータ)”,95)、(“スケジューラ”,72)、(“スマホ”,33)が挙げられている。尚、図中において、各単語の後に表記される数値(動作主格であれば、“システム”、“会社”、“学校”、および“上司”といった言葉に対応付けて表記されている83,42,33,18)は、所定数の文章を検索したときに単語と述語とが結びついて検索される回数(頻度)を示しており、ここでは、頻度の高い順に表記されている。尚、この数字は重みの様な別の指標でもよいし、母集団によりノーマライズして使用するようにしてもよいし、指標を乗算するなどして組み合わせて使用するようにしてもよい。
In the example Ex1, (4) “set 4” as a term related to the predicate “set”, and (“system”, 83), (“company”, 42), (“school”, 33) as the principal subject of operation (“Supervisor”, 18) is listed, and (“conference”, 41), (“participant”, 27), (“travel time”, 10) as target classes are listed; (“PC (personal computer)”, 95), (“scheduler”, 72), and (“smart phone”, 33) are listed. In the figure, numerical values that appear after each word (in the case of action primes, 83, 42 are associated with words such as “system”, “company”, “school”, and “superior”). , 33, 18) show the number of times (frequency) in which a word and a predicate are linked and searched when a predetermined number of sentences are searched, and is described here in descending order of frequency. Note that this number may be another index such as a weight, may be normalized according to the population, and may be used in combination by multiplying the index or the like.
また、例Ex1においては、「設定する8」では同様にそれに係る項として、動作主格としての、(“妻”,40)、(“娘”,33)、(“息子”,28)、(“母”,13)が挙げられており、対象格としての(“アラーム”,52)、(“目覚まし”,48)、(“タイマ”,42)が挙げられており、道具格としての(“目覚まし”,94)、(“時計”,42)、(“携帯”,35)、(“スマホ,19)が挙げられている。
Further, in the example Ex1, as the item related to it in “setting 8”, (“wife”, 40), (“daughter”, 33), (“son”, 28), (“wife”, 40) “Mother”, 13) is mentioned, and (“alarm”, 52), (“alarm”, 48), (“timer”, 42) as object cases are mentioned, There are listed "alarm", 94), "clock", 42, ("mobile", 35) and ("smartphone, 19").
さらに、「設定する」に類似する語義の述語として「セットする1」が挙げられており、この場合、それにかかる項として、動作主格としての、(“彼女”,56)、(“父”,52)、(“妻”,49)が挙げられており、対象格としての(“タイマ”,67)、(“スリープタイマ”,42)(“アラーム”,41)が挙げられており、道具格としての(“炊飯器”,52)、(“エアコン”,45)、(“ラジオ”,32)、(“携帯”,12)が挙げられている。
Furthermore, as a predicate in the sense of a word similar to "to set," "to set 1" is mentioned. In this case, as a term related thereto, ("she", 56), ("father", 52), (“Wife”, 49), and (“Timer”, 67), (“Sleep Timer”, 42) (“Alarm”, 41) as target cases; As cases ("rice cooker", 52), ("air conditioner", 45), ("radio", 32), ("mobile", 12) are mentioned.
つまり、例Ex1で挙げられている「設定する4」、「設定する8」、および「セットする1」とは、それぞれに分類される語義で使用される際、動作主格、対象格、道具格をそれぞれの語義の範囲内で置き換えられた文章同士は、述語構造が類似した文章であり、「設定する8」は「セットする1」はどちらも対象格にタイマに関係する意味クラスの単語がかかる点で類似した語義をもつ可能性が高いと考えることができる。
That is, when used in the meaning classified into each of "set 4", "set 8", and "set 1" mentioned in the example Ex1, the operation primary case, target case, tool case The sentences replaced by each within the meaning of each word are the sentences whose predicate structures are similar, and “set 8” is “set 1”, both of which are words of the semantic class related to the timer in the target case. It can be considered that there is a high possibility of having similar meaning in this respect.
また、2つ以上の文章が相互に「非なる」とは、例えば、述語項構造が似ていて、異なる表記の述語、または、意味クラスの名詞句を持った文章である。
Also, two or more sentences being "not" mutually means, for example, sentences having similar predicate term structures and having different notation predicates or noun phrases of semantic class.
例えば、述語項構造が似ていて、異なる表記の述語を持つ場合の例としては、図6の例Ex2の上段で示されるように、「6時にアラームを設定して。」、「6時にアラームを破壊して。」、「6時にアラームを解放して。」などが挙げられる。
For example, as an example in the case where predicate term structures are similar and have different notation predicates, as shown in the upper part of the example Ex2 in FIG. 6, "set alarm at 6 o'clock", "alarm at 6 o'clock And "release the alarm at six o'clock".
また、述語項構造が似ていて、異なる表記の意味クラスの名詞句を持つ場合の例としては、図6の例Ex2の下段で示されるように、「8時にタイマを設定して。」、「8時に営業会議を設定して。」、「8時にシャットダウンを設定して。」、などが挙げられる。1番目の「タイマ」は時計機能に関する名詞句、2番目の「営業会議」は仕事のイベントに関する名詞句、「シャットダウン」はコンピュータの制御に関する名詞句で意味的には異なる分類になる。
In addition, as an example in the case where the predicate term structure is similar and has a noun phrase of a semantic class of different notation, as shown in the lower part of the example Ex2 of FIG. 6, "set timer at 8 o'clock". "Set up a business meeting at 8 o'clock", "Set up a shutdown at 8 o'clock.", And the like. The first "timer" is a noun phrase related to the clock function, the second "business meeting" is a noun phrase related to the work event, and the "shutdown" is a semantically different classification of computer control.
さらに名詞句の意味クラスの例として、図6の例Ex2の最下段で示されるように、例えば、「山形」という言葉を例とする場合、「山形」という人物や姓などを意味するクラス、「山形県」という地名や県名を意味するクラス、「山形新幹線」などの路線名を意味するクラスなどである。
Furthermore, as shown in the lowermost part of the example Ex2 of FIG. 6 as an example of the semantic class of the noun phrase, for example, when the word "Yamagata" is taken as an example, a class meaning a person or surname etc. There are classes such as "Yamagata Prefecture" which means a place name or prefecture name, and "Yamagata Shinkansen" which means a route name.
これに対して、OOD判定文とみなされるコーパスのうち、IND判定文とみなされるコーパスと特徴空間内において離れているコーパスは、IND判定文と誤認識される可能性が低いと考えることができる。
On the other hand, among the corpuses regarded as the OOD judgment sentences, the corpus which is apart from the corpus regarded as the IND judgment sentence and in the feature space can be considered to have a low possibility of being misrecognized as the IND judgment sentence. .
したがって、フレーム推定部13および意味解析部14は、COOD判定文を多く学習することで、IND判定文に類似しているが、OOD判定文であるコーパスを確実に棄却することが可能となり、結果として、認識精度を向上させることが可能となる。このため、より多くのCOOD判定文となるコーパスを生成し、学習させることで認識精度を向上させることができる。
Therefore, by learning a large number of COOD determination sentences, the frame estimation unit 13 and the semantic analysis unit 14 can reliably reject the corpus which is similar to the IND determination sentence but is the OOD determination sentence, As a result, it is possible to improve the recognition accuracy. For this reason, it is possible to improve the recognition accuracy by generating and learning a corpus that becomes more COOD determination sentences.
CIND判定文は、COOD判定文と対応するコーパスであり、図2の特徴空間内において、IND判定文とみなされるコーパスのうち、OOD判定文とみなされるコーパスの分布との境界付近に存在するコーパスである。
The CIND test sentence is a corpus corresponding to the COOD test sentence, and in the feature space of FIG. 2, a corpus existing near a boundary with a distribution of a corpus regarded as an OOD test sentence among corpuses regarded as an IND test sentence. It is.
CIND判定文は、IND判定文であるが、OOD判定文に類似したコーパスであり、換言すれば、OOD判定文に似て非なるコーパスであると考えることができる。さらに言えば、CIND判定文は、OOD判定文と判別するには、非常に紛らわしい表現であり、誤判定を起こし易いコーパスであるとも考えることができる。
The CIND test sentence is an IND test sentence, but it is a corpus similar to the OOD test sentence, and in other words, it can be considered as a corpus which is not similar to the OOD test sentence. Furthermore, the CIND determination sentence is a very misleading expression to be judged as the OOD determination sentence, and it can be considered as a corpus that is likely to cause an erroneous determination.
これに対して、IND判定文とみなされるコーパスのうち、OOD判定文とみなされるコーパスと特徴空間内において離れているコーパスは、OOD判定文と誤判定を起こし難いと考えることができる。
On the other hand, among the corpuses regarded as the IND determination sentences, the corpus separated from the corpus regarded as the OOD determination sentence and the corpus in the feature space can be considered to be unlikely to cause an erroneous determination as the OOD determination sentence.
したがって、フレーム推定部13および意味解析部14は、CIND判定文を多く学習することで、OOD判定文に類似しているが、IND判定文であるコーパスを確実に認識することが可能となり、結果として、認識精度を向上させることが可能となる。このため、より多くのCIND判定文となるコーパスを生成し、学習させることで認識精度を向上させることができる。
Therefore, by learning a large number of CIND determination sentences, the frame estimation unit 13 and the semantic analysis unit 14 can reliably recognize the corpus which is similar to the OOD determination sentence but is the IND determination sentence, As a result, it is possible to improve the recognition accuracy. For this reason, it is possible to improve the recognition accuracy by generating and learning a corpus as a larger number of CIND determination sentences.
以上のことから、COOD判定文とCIND判定文とをそれぞれ多く学習することでフレーム推定部13および意味解析部14の認識精度を向上させることが可能となる。
From the above, it is possible to improve the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 by learning many COOD determination sentences and CIND determination sentences.
<コーパス生成処理>
次に、図7のフローチャートを参照して、図3のコーパス生成装置51によるコーパス生成処理について説明する。 <Corpus generation processing>
Next, the corpus generation processing by thecorpus generation device 51 of FIG. 3 will be described with reference to the flowchart of FIG. 7.
次に、図7のフローチャートを参照して、図3のコーパス生成装置51によるコーパス生成処理について説明する。 <Corpus generation processing>
Next, the corpus generation processing by the
ステップS11において、IND文受付部101は、人手などで作成されたIND文のうち、未処理のIND文を処理対象のIND文として選択して、入力を受け付けて言語解析部102に出力する。
In step S11, the IND sentence reception unit 101 selects an unprocessed IND sentence as an IND sentence to be processed among the IND sentences created manually or the like, receives an input, and outputs it to the language analysis unit 102.
ステップS12において、言語解析部102は、処理対象となるIND文の形態素、句、および述語項構造を解析する。
In step S12, the language analysis unit 102 analyzes the morpheme, phrase, and predicate term structure of the IND sentence to be processed.
ステップS13において、言語解析部102は、述語構造解析結果を保存する。より詳細には、言語解析部102は、述語構造解析処理においてエラーが生じない限り、述語構造解析結果を保存する。尚、エラーが生じた場合、言語解析部102は、例えば、処理対象となるIND文を廃棄する。
In step S13, the language analysis unit 102 stores the predicate structure analysis result. More specifically, the language analysis unit 102 stores the predicate structure analysis result as long as no error occurs in the predicate structure analysis process. When an error occurs, the language analysis unit 102 discards, for example, the IND statement to be processed.
ステップS14において、言語解析部102は、未処理のIND文が存在するか否かを判定し、存在する場合、処理は、ステップS11に戻る。すなわち、未処理のIND文がなくなるまで、ステップS11乃至S14の処理が繰り返される。そして、ステップS14において、全てのIND文が処理されて、未処理のIND文が存在しないと判定された場合、処理は、ステップS15に進む。
In step S14, the language analysis unit 102 determines whether or not there is an unprocessed IND sentence, and if there is, the process returns to step S11. That is, the processes of steps S11 to S14 are repeated until there are no unprocessed IND statements. Then, in step S14, when all the IND statements are processed and it is determined that there is no unprocessed IND statement, the process proceeds to step S15.
ここで、図8を参照して、保存された述語項構造解析の結果のデータの例について説明する。
Here, with reference to FIG. 8, an example of data of the stored result of predicate term structure analysis will be described.
述語項構造を解析するとは、入力文が、「銀座でおいしい寿司屋を教えて」である場合、例えば、深層格解析で解析されるとき、図8の例Ex11で示されるような構造として解析される。すなわち、「銀座で」が、場所格として、「おいしい」が連体修飾節として、「寿司屋を」が対象格として、「教えて」が述語節として解析される。
When analyzing the predicate term structure, when the input sentence is "Teach me a delicious sushi restaurant in Ginza", for example, when analyzed in deep layer case analysis, it is analyzed as a structure as shown in the example Ex11 of FIG. Be done. That is, "in Ginza" is analyzed as a place case, "delicious" is analyzed as an adjunctive modification section, "Sushiya" is analyzed as a target case, and "Teach me" is analyzed as a predicate section.
また、同様の入力文において、表層格解析で解析されるとき、図8の例Ex12で示されるような構造として解析される。すなわち、「銀座で」が、デ格として、「おいしい」が形容詞として、「寿司屋を」がヲ格として、「教えて」が動詞として解析される。
Moreover, in the same input sentence, when analyzed by surface layer case analysis, it is analyzed as a structure as shown by example Ex12 of FIG. That is, "in Ginza" is analyzed as a de-rating, "delicious" as an adjective, "Sushiya" as a demeaning, and "Teach me" as a verb.
述語項構造解析に際しては、深層格解析や表層格解析などを切り替えられるように設定し、いずれかをユーザが選択できるようにしてもよい。
At the time of predicate term structure analysis, it may be set so that deep case analysis, surface case analysis and the like can be switched, and one of them may be selected.
このように、解析結果で動詞句の位置や名詞句で対象格になる部分など(英語ならば目的語など)が決定される。
In this manner, the position of the verb phrase or the part to be the target case in the noun phrase (the object in the case of English, etc.) is determined in the analysis result.
尚、例えば、処理対象となるIND文が「近場で美味しいレストランは?」といった場合、IND文に述語が無く、述語項構造解析がうまくいかない場合がある。このような場合、所定のルールに基づいて、省略されている述語を補完し述語項構造の解析結果を補間できるようにしてもよい。
If, for example, the IND statement to be processed is "Which restaurant is near and delicious?", The IND statement may not have a predicate, and predicate term structure analysis may not be successful. In such a case, based on a predetermined rule, the omitted predicate may be complemented so that the analysis result of the predicate term structure can be interpolated.
ステップS15において、置換箇所設定部103は、ステップS13の処理において保存された述語項構造解析の解析結果を入力とし予め指定された置換条件を設定し、設定結果を辞書照会部104に供給する。
In step S15, the replacement point setting unit 103 inputs the analysis result of predicate term structure analysis stored in the process of step S13, sets a replacement condition designated in advance, and supplies the setting result to the dictionary inquiry unit 104.
<置換条件>
ここで、置換条件について説明する。置換条件には置換方式と置換箇所がある。 <Replacement condition>
Here, replacement conditions will be described. The replacement condition includes a replacement method and a replacement part.
ここで、置換条件について説明する。置換条件には置換方式と置換箇所がある。 <Replacement condition>
Here, replacement conditions will be described. The replacement condition includes a replacement method and a replacement part.
まず、置換方式には、大きく以下の2方式があり、例えば、そのいずれかに設定される。
First, there are two types of replacement methods, the following two methods, for example, and one of them is set.
より詳細には、置換方式の第一の方式は、Action固定(述部固定)Category置換(対象格置換)方式であり、第二の方式は、Category固定(対象格固定)Action置換(述部置換)方式である。
More specifically, the first method of the replacement method is the Action fixed (predicate fixed) Category replacement (target case replacement) method, and the second method is the Category fixed (target fixed case) Action replacement (predicate) Replacement) method.
Action固定(述部固定)Category置換(対象格置換)方式は、例えば、入力文が、「7時にアラームを設定して」である場合、図9の例Ex21の上段の1)で示されるように、「設定して」である述部(Action)を固定し、対象格(Category)である「アラーム」を置換する方式であり、例Ex21の1)においては、「アラーム」が、「物性」に置換されている。
The Action fixed (predicate fixed) Category substitution (target case substitution) method is, for example, as shown in 1) of the upper part of the example Ex 21 in FIG. 9 when the input sentence is “set an alarm at 7 o'clock”. In this method, the predicate (Action) that is “set” is fixed, and “alarm” that is the target case (Category) is replaced, and in the example Ex21 1), “alarm” is “physical property”. Is replaced by ".
また、Category固定(対象格固定)Action置換(述部置換)方式は、例えば、入力文が、「7時にアラームを設定して」である場合、図9の例Ex21の下段の2)で示されるように、対象格(Category)である「アラーム」を固定し、「設定して」である述部(Action)を置換する方式であり、例Ex31の2)においては、「設定して」が、「解放して」に置換されている。
Further, the Category fixed (target case fixed) Action substitution (predicate substitution) method is shown, for example, in 2) of the lower part of the example Ex21 of FIG. 9 when the input sentence is “set alarm at 7 o'clock”. It is a method of fixing “alarm” which is a target category (Category) and replacing a predicate (action) which is “set”, and in the example Ex 31-2), “set” Is replaced by "release".
尚、述部が2カ所ある文や、対象格以外の格を指定したい場合(時間格、道具格など)の設定や、句の境界の設定などは、ユーザが任意に指定できるようにして、指定内容に応じて切り替えられるようにしてもよい。
In addition, in the case where you want to specify a sentence with two predicates or a case other than the target case (time grade, tool case, etc.), the setting of the boundary of the phrase, etc. can be arbitrarily specified by the user. It may be made to switch according to the contents of specification.
<置換方式がAction固定、Category置換方式の場合の置換箇所>
置換方式がAction固定、Category置換方式の場合の置換箇所は、例えば、図10の例Ex22で示されるように設定される。 <Replacement place when the replacement method is fixed as Action and Category replacement method>
When the replacement method is Action fixed and Category replacement method, for example, the replacement points are set as shown in the example Ex22 of FIG.
置換方式がAction固定、Category置換方式の場合の置換箇所は、例えば、図10の例Ex22で示されるように設定される。 <Replacement place when the replacement method is fixed as Action and Category replacement method>
When the replacement method is Action fixed and Category replacement method, for example, the replacement points are set as shown in the example Ex22 of FIG.
処理対象となるIND文が、例えば、「今週末駅の近くでおすすめのスポット教えて」である場合、ヲ格を置換指定場所に指定した場合置換箇所設定部103は、例Ex22の最上段で示されるように、分ち書き単位の構造として、「今週末」、「駅の」、「近く」、「で」、「おすすめ」、「の」、「スポット」、および「教えて」に分割する。
When the IND sentence to be processed is, for example, “Teach me a recommended spot near this weekend station”, and when ヲ is specified as a replacement designated place, the replacement location setting unit 103 is at the top of the example Ex22. As shown, the division unit structure is divided into "this weekend", "station", "near", "de", "recommend", "no", "spot", and "tell me" Do.
また、置換箇所設定部103は。特定の単語や句の分かち書き単位に関して必要に応じて1つの句にまとめる設定をルールや単語辞書を使って変更する事ができてもよい。例えば、例Ex22の上から2段目で示されるように、「今週末」、「駅の近く」、「で」、「おすすめ」、「の」、「スポット」、および「教えて」のように調整する。
Also, the replacement location setting unit 103. You may be able to change the setting for grouping into a single phrase as needed for a specific word or phrase separation unit using a rule or a word dictionary. For example, as shown in the second row from the top of Ex. Ex22, "This weekend", "Near the station", "De", "Recommend", "No", "Spot", and "Tell me" Adjust to
さらに、置換箇所設定部103は、この調整された分ち書き単位の構造から置換箇所のうち、例えば、例Ex22の上から3段目で示されるように、「おすすめ」をヲ格でかかる単語群で置換する。
Furthermore, the replacement part setting unit 103 uses the word of “recommendation” in the form of “recommendation” as shown in the third row from the top of the example Ex 22 among the replacement parts from the structure of the adjusted dividing unit, for example. Replace by group.
この様に1カ所だけ単語を変換することはCOOD文を作るのに有効であるが、さらにデ格である「駅の近く」を置換対象に加えたり、ヲ格の「スポット」は置換せずにデ格の「駅の近く」のみ置換するなどの設定バリエーションを加える事もできる。
Converting a word in one place in this way is effective for making a COOD statement, but adds the de-rating "Near the station" to the target of replacement, and does not replace the "spot" of the rank. It is also possible to add a set variation such as replacing only the de-rated "near station".
<置換方式がCategory固定、Action置換の場合の置換箇所>
置換方式がCategory固定(対象格固定)、Action置換(述部置換)の場合の置換箇所は、例えば、図11の例Ex23で示されるように設定される。 <Replacement place when the substitution method is fixed to Category and Action substitution>
The substitution place in the case where the substitution scheme is Category fixed (target case fixed) and Action substitution (predicate substitution) is set, for example, as shown in the example Ex23 of FIG.
置換方式がCategory固定(対象格固定)、Action置換(述部置換)の場合の置換箇所は、例えば、図11の例Ex23で示されるように設定される。 <Replacement place when the substitution method is fixed to Category and Action substitution>
The substitution place in the case where the substitution scheme is Category fixed (target case fixed) and Action substitution (predicate substitution) is set, for example, as shown in the example Ex23 of FIG.
処理対象となるIND文が、例えば、「今週末駅の近くでおすすめのスポット教えて」である場合、述部が置換場所に指定される場合、置換箇所設定部103は、例Ex23の最上段で示されるように、分ち書き単位の構造として、「今週末」、「駅の」、「近く」、「で」、「おすすめ」、「の」、「スポット」、および「教えて」に分割する。
If the IND sentence to be processed is, for example, “Teach me a recommended spot near this weekend station”, and the predicate is designated as a substitution place, the substitution place setting unit 103 is the top row of the example Ex23. As shown in the figure, as the structure of the writing unit, this weekend, at the station, near, at, on, recommended, at, on, spots, and tell me To divide.
さらに、置換箇所設定部103は、この調整された分ち書き単位の構造から置換箇所のうち、例えば、例Ex23の上から3段目で示されるように、ヲ格「おすすめ」の係り元である「教えて」を同じ「おすすめ」をヲ格に持つ類似の語義の述語で置換する。また、ヲ格に同じ「おすすめ」を持たない非類似の語義の述語で置換するという設定を行ってもよい。
Furthermore, the replacement location setting unit 103 sets the dependency source of the standard “recommendation” as shown in the third row from the top of the example Ex 23 among the replacement locations from the structure of the adjusted split writing unit, for example. Replace a certain "tell me" with a similar semantic term predicate that has the same "recommend" as a case. In addition, a setting may be made to replace with a non-similar word-like predicate that does not have the same "recommendation" to the case.
前記の置換する述語の選択基準はヲ格にどういう単語を持つかで判断するだけではなく、デ格やニ格など他の項の単語の類似性、非類似性で判断することもできる。
The selection criteria of the above-mentioned replacing predicate can be judged not only by what kind of words they have in a case, but also by the similarity and dissimilarity of words in other terms such as de-rating and ni-rating.
ステップS16において、辞書照会部104は、言語解析部102に述語構造解析結果の保存データのうちの未処理の1個のIND文を読み出して処理対象のIND文として受け付ける。
In step S16, the dictionary inquiry unit 104 reads out one unprocessed IND sentence in the stored data of the predicate structure analysis result in the language analysis unit 102 and accepts it as an IND sentence to be processed.
ステップS17において、辞書照会部104は、設定結果に基づいて、指定された置換方式に応じて置換箇所を特定し、置換箇所の単語の項に対応する名詞句や、語義に対応する述語を、格フレーム辞書107を参照して検索する。
In step S17, the dictionary inquiring unit 104 specifies a replacement part according to the specified replacement method based on the setting result, and uses a noun phrase corresponding to the term of the word of the replacement part and a predicate corresponding to the meaning. The case frame dictionary 107 is referred to and searched.
ステップS18において、辞書照会部104は、処理対象となるIND文、置換に関する設定情報、および検索結果を、それぞれ対応付けて保存する。
In step S18, the dictionary inquiry unit 104 stores the IND sentence to be processed, the setting information on replacement, and the search result in association with each other.
ステップS19において、辞書照会部104は、述語構造解析結果の保存データのうち、未処理のIND文があるか否かを判定し、未処理のIND文がある場合、処理は、ステップS16に戻る。すなわち、述語構造解析結果の保存データである全てのIND文について、置換候補を検索する処理が完了するまで、ステップS16乃至S19の処理が繰り返される。そして、ステップS19において、全てのIND文について、置換候補を検索する処理が完了し、未処理のIND文がないとみなされた場合、処理は、ステップS20に進む。
In step S19, the dictionary inquiry unit 104 determines whether or not there is an unprocessed IND statement among the stored data of the predicate structure analysis result, and if there is an unprocessed IND statement, the process returns to step S16. . That is, the process of steps S16 to S19 is repeated until the process of searching for replacement candidates is completed for all IND sentences that are storage data of the predicate structure analysis result. Then, in step S19, the process of searching for replacement candidates for all the IND sentences is completed, and when it is considered that there are no unprocessed IND sentences, the process proceeds to step S20.
尚、日本語における述語項構造解析結果が、図12で示されるような場合、置換方式がAction固定Category置換方式のときには、図13で示されるように、述語項(ヲ格)を置換する単語群が検索される。また、述語項構造解析結果が、図12で示されるような場合、Category固定Action置換方式のときには、図14で示されるように、述語部分を置換する単語群が検索される。このような置換により、例えば、図15で示されるような日本語のOOD候補文として生成される。
When the result of analyzing the predicate term structure in Japanese is as shown in FIG. 12, when the substitution method is the Action fixed Category substitution method, as shown in FIG. Groups are searched. Further, in the case where the result of the predicate term structure analysis is as shown in FIG. 12, in the case of the Category fixed Action substitution method, as shown in FIG. 14, a word group replacing the predicate part is searched. Such substitution produces, for example, Japanese OOD candidate sentences as shown in FIG.
また、図12の述語項構造解析結果に対応する英語の述語項構造解析結果が、図16で示されるような場合、置換方式がAction固定Category置換方式のときには、図17で示されるように、述語項を置換する単語群が検索される。また、述語項構造解析結果が、図16で示されるような場合、Category固定Action置換方式のときには、図18で示されるように、述語部分を置換する単語群が検索される。このような置換により、例えば、図19で示されるような英語のOOD候補文として生成される。
Also, in the case where the English predicate term structure analysis result corresponding to the predicate term structure analysis result of FIG. 12 is as shown in FIG. 16, when the replacement system is the Action fixed Category substitution system, as shown in FIG. A word group replacing a predicate term is searched. Further, in the case where the result of the predicate term structure analysis is as shown in FIG. 16, in the case of the Category fixed Action substitution method, as shown in FIG. 18, the word group replacing the predicate part is searched. Such substitution produces, for example, an English OOD candidate sentence as shown in FIG.
以下、図12乃至図15を参照して日本語の述語項構造解析結果の例、置換方式がAction固定Category置換方式であるときの検索結果、Category固定Action置換方式であるときの検索結果、およびOOD候補文の例について説明する。また、図16乃至図19を参照して、英語の述語項構造解析結果の例、置換方式がAction固定Category置換方式であるときの検索結果、Category固定Action置換方式であるときの検索結果、およびOOD候補文の例について説明する。
Hereinafter, with reference to FIG. 12 to FIG. 15, an example of Japanese predicate term structure analysis result, a search result when the substitution method is the Action fixed Category substitution method, a search result when the Category fixed Action substitution method is the An example of an OOD candidate sentence will be described. 16 to 19, examples of predicate term structure analysis results in English, search results when the substitution method is the Action fixed Category substitution method, search results when the Category fixed Action substitution method, and An example of an OOD candidate sentence will be described.
<日本語の述語項構造解析結果の例>
図12においては、述語項構造解析結果の例が示されており、左から文ID、文、述語、述語語尾、述語項、および元のドメインが示されている。また、述語項は、左から場所格またはデ格(場所格orデ格)、連体修飾節またはノ格(連体修飾節orノ格)、・・・対象格またはヲ格(対象格orヲ格)が示されている。 <Example of result analysis of Japanese predicate terms>
In FIG. 12, an example of the result of predicate term structure analysis is shown, and from the left, a sentence ID, a sentence, a predicate, a predicate ending, a predicate term, and an original domain are shown. In addition, predicate terms are, from the left, place case or de case (place case or de case), adnominal modification clause or no case (argument modification clause or no case),. )It is shown.
図12においては、述語項構造解析結果の例が示されており、左から文ID、文、述語、述語語尾、述語項、および元のドメインが示されている。また、述語項は、左から場所格またはデ格(場所格orデ格)、連体修飾節またはノ格(連体修飾節orノ格)、・・・対象格またはヲ格(対象格orヲ格)が示されている。 <Example of result analysis of Japanese predicate terms>
In FIG. 12, an example of the result of predicate term structure analysis is shown, and from the left, a sentence ID, a sentence, a predicate, a predicate ending, a predicate term, and an original domain are shown. In addition, predicate terms are, from the left, place case or de case (place case or de case), adnominal modification clause or no case (argument modification clause or no case),. )It is shown.
より詳細には、文IDが1の文“7時にアラームを設定して”においては、述語が“設定”であり、述語語尾が“して”であり、対象格またはヲ格が“アラーム”であり、元のドメインがALARM-SETUPであることが示されている。
More specifically, in the statement “set alarm at 7 o'clock” with statement ID 1 statement, the predicate is “set”, the predicate ending is “to”, and the object grade or rating is “alarm”. And the original domain is shown to be ALARM-SETUP.
また、文IDが1001の文“銀座でおいしい寿司屋を教えて”においては、述語が“教え”であり、述語語尾が“て”であり、場所格またはデ格が“銀座”であり、対象格またはヲ格が“寿司屋”であり、元のドメインがRESTAURANT-SEARCHであることが示されている。
In addition, in the sentence "Teach a delicious sushi restaurant in Ginza" with a sentence ID of 1001, the predicate is "teaching", the predicate ending is "te", and the place rating or de-rating is "Ginza", It has been shown that the target grade or grade is "Sushiya" and the original domain is RESTAURANT-SEARCH.
さらに、文IDが1002の文“イタリアンのレストランを教えて”においては、述語が“教え”であり、述語語尾が“て”であり、連体修飾節またはノ格が“イタリアン”であり、対象格またはヲ格が“レストラン”であり、元のドメインがRESTAURANT-SEARCHであることが示されている。
Furthermore, in the sentence "Teach me an Italian restaurant" with a sentence ID of 1002, the predicate is "teaching", the predicate ending is "te", and the adjacency modification clause or no case is "Italian", and the object is It is shown that the case or condition is "restaurant" and the original domain is RESTAURANT-SEARCH.
また、文IDが1003の文“イタリアンのコースの食べれるレストランをさがしてください”においては、述語が“さが”であり、述語語尾が“してください”であり、連体修飾節またはノ格が“イタリアン”、“コース”であり、対象格またはヲ格が“レストラン”であり、元のドメインがRESTAURANT-SEARCHであることが示されている。
In addition, in the sentence “Please look for a restaurant that can eat an Italian course” with a sentence ID of 1003, the predicate is “saga” and the predicate ending is “please”, and the adjacency modified clause or no case Are "Italian", "Courses", Targeted or Qualified is "Restaurant", and it is shown that the original domain is RESTAURANT-SEARCH.
<日本語の置換方式がAction固定Category置換方式のときの対象格を置換する単語群の例>
図13は、図12で示される述語項構造解析結果の文に対して、置換方式がAction固定Category置換方式において置換する述語項のうち、対象格またはヲ格を置換するときの単語群の例が示されている。 <Example of word group for replacing target case when Japanese substitution method is Action fixed Category substitution method>
FIG. 13 shows an example of a word group when the target case or the case is replaced among the predicate terms to be replaced by the replacement method in the Action fixed Category replacement method with respect to the statement of the result of the predicate item structure analysis shown in FIG. It is shown.
図13は、図12で示される述語項構造解析結果の文に対して、置換方式がAction固定Category置換方式において置換する述語項のうち、対象格またはヲ格を置換するときの単語群の例が示されている。 <Example of word group for replacing target case when Japanese substitution method is Action fixed Category substitution method>
FIG. 13 shows an example of a word group when the target case or the case is replaced among the predicate terms to be replaced by the replacement method in the Action fixed Category replacement method with respect to the statement of the result of the predicate item structure analysis shown in FIG. It is shown.
図13においては、左から文ID、文、述語、述語語尾、項の置換単語、および元のドメインが示されている。また、項の置換単語は、左からデ格(場所格またはデ格)、ノ格(連体修飾節またはノ格)、・・・ヲ格(対象格またはヲ格)が示されている。図13においては、このうちヲ格(対象格またはヲ格)を置換するときの単語群の例が示されている。尚、図13における、図12と対応する項目については、同一の記載とされているので、その説明は適宜省略する。
In FIG. 13, sentence IDs, sentences, predicates, predicate endings, term replacement words, and original domains are shown from the left. In addition, the replacement words of the term are indicated from the left as de-case (place case or de-case), no case (argument modification clause or case),... In FIG. 13, an example of a word group when replacing a case (object case or case) is shown. The items in FIG. 13 corresponding to those in FIG. 12 are the same descriptions, so the description thereof will be omitted as appropriate.
すなわち、文IDが1の場合、ヲ格である“アラーム”の置換単語の例として、“会議”、“参加者”および“移動時間”が示されている。
That is, when the sentence ID is 1, "meeting", "participant", and "moving time" are shown as examples of replacement words of "alarm" which is the case.
また、文IDが1001の場合、ヲ格である“寿司屋”の置換単語の例として、“ニュース”、“仕組み”、“人生”、“流れ“、および”芸”が示されている。
Further, when the sentence ID is 1001, "news", "mechanism", "life", "flow", and "art" are shown as examples of replacement words of "sushiya" which is a strict grade.
さらに、文IDが1002の場合、ヲ格である“レストラン”の置換単語の例として、“ニュース”、“仕組み”、“人生”、“流れ“、および”芸”が示されている。
Furthermore, when the sentence ID is 1002, "news", "mechanism", "life", "flow", and "art" are shown as examples of replacement words of the "restaurant" which is the standard.
また、文IDが1003の場合、ヲ格である“レストラン”の置換単語の例として、“外科”、”オフィス”、“親父”、“自動車”、”講座”、および“一戸建て”が示されている。
In addition, when the sentence ID is 1003, examples of replacement words of "restaurant" which is the standard "surgery", "office", "parent", "car", "course", and "detached house" are shown. ing.
<日本語の置換方式がCategory固定Action置換方式における述語を置換する単語群の例>
図14は、図12で示される文に対して、置換方式がCategory固定Action置換方式における述語を置換するときの単語群の例が示されている。 <Example of a word group for replacing predicates in the Japanese substitution method Category fixed action substitution method>
FIG. 14 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.
図14は、図12で示される文に対して、置換方式がCategory固定Action置換方式における述語を置換するときの単語群の例が示されている。 <Example of a word group for replacing predicates in the Japanese substitution method Category fixed action substitution method>
FIG. 14 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.
図14においては、左から文ID、文、述語、述語語尾、述語の置換単語、および元のドメインが示されている。尚、図14における、図12と対応する項目については、同一の記載とされているので、その説明は適宜省略する。
In FIG. 14, the sentence ID, the sentence, the predicate, the predicate ending, the substitution word of the predicate, and the original domain are shown from the left. The items in FIG. 14 corresponding to those in FIG. 12 are the same descriptions, so the description thereof will be omitted as appropriate.
すなわち、文IDが1の場合、述語である“設定”の置換単語の例として、“改善する”、“装備する”、および”装着する”が示されている。
That is, when the sentence ID is 1, "improve", "equip", and "fit" are shown as examples of replacement words of "set" which is a predicate.
また、文IDが1001,1002の場合、述語である“教え”の置換単語の例として、“抜け出す”、“食べ歩く”、“開業する”、“買い取る“、”手伝う”、”切り盛りする”、および“格付ける”が示されている。
In addition, when the sentence ID is 1001 and 1002, as examples of substitution words of "teaching" which is a predicate, "get out", "eat and walk", "open business", "buy", "help", "cut up" , And "Rating" is shown.
さらに、文IDが1003の場合、述語である“さが”の置換単語の例として、“営む”、“開く”、“手伝う”、“利用する”、“特集する”、“下見する”、および“探し当てる”が示されている。
Furthermore, when the sentence ID is 1003, as an example of the substitution word of "suga" which is a predicate, "run", "open", "help", "use", "special feature", "watch", And “find” is shown.
<置換生成される日本語のOOD候補文の例>
(Action固定Category置換方式のOOD候補文の例)
まず、図15の左部を参照して、上述した処理により置換生成される日本語のOOD候補文の例のうち、Action固定Category置換方式のOOD候補文の例について説明する。 <Example of Japanese OOD candidate sentence generated by substitution>
(Example of OOD candidate statement of Action fixed Category substitution method)
First, among the examples of the Japanese OOD candidate sentences generated by the above-described process, an example of the OOD candidate sentence of the Action fixed Category substitution method will be described with reference to the left part of FIG.
(Action固定Category置換方式のOOD候補文の例)
まず、図15の左部を参照して、上述した処理により置換生成される日本語のOOD候補文の例のうち、Action固定Category置換方式のOOD候補文の例について説明する。 <Example of Japanese OOD candidate sentence generated by substitution>
(Example of OOD candidate statement of Action fixed Category substitution method)
First, among the examples of the Japanese OOD candidate sentences generated by the above-described process, an example of the OOD candidate sentence of the Action fixed Category substitution method will be described with reference to the left part of FIG.
すなわち、図15の左部で示されるように、図12における文IDが1の文“7時にアラームを設定して”に対するOOD候補文の例として、「7時に会議を設定して」、および「7時に参加者を設定して」が示されている。すなわち、「アラーム」が「会議」、「参加者」にそれぞれ置換されている。
That is, as shown in the left part of FIG. 15, as an example of the OOD candidate sentence for the sentence “set an alarm at 7 o'clock” in the sentence ID of 1 in FIG. “Set participant at 7 o'clock” is shown. That is, "alarm" is replaced with "meeting" and "participant" respectively.
また、図12における文IDが1001の文“銀座でおいしい寿司屋を教えて”に対するOOD候補文の例として、「銀座でおいしいニュースを教えて」、および「銀座でおいしい仕組みを教えて」が示されている。すなわち、「寿司屋」が「ニュース」、「仕組み」にそれぞれ置換されている。
In addition, as an example of an OOD candidate sentence for the sentence “Teach a delicious sushi restaurant in Ginza” as the sentence ID in FIG. 12 is “Teach me good news in Ginza” and “Teach me a delicious mechanism in Ginza” It is shown. That is, "sushi shop" is replaced with "news" and "mechanism" respectively.
さらに、図12における文IDが1002の文“イタリアンのレストランを教えて”に対するOOD候補文の例として、「イタリアンのニュースを教えて」、および「イタリアンの仕組みを教えて」が示されている。すなわち、「レストラン」が「ニュース」、「仕組み」にそれぞれ置換されている。
Furthermore, as an example of the OOD candidate sentence for the sentence "Teach me an Italian restaurant" for the sentence ID in FIG. 12 that is 1002, "Teach me Italian news" and "Teach me how Italian is" are shown. . That is, "restaurant" is replaced with "news" and "mechanism" respectively.
また、図12における文IDが1003の文“イタリアンのコースの食べれるレストランをさがしてください”に対するOOD候補文の例として、「イタリアンのコースを食べれる外科をさがして」、および「イタリアンのコースを食べれるオフィスをさがして」が示されている。すなわち、「レストラン」が「外科」、「コース」にそれぞれ置換されている。
In addition, as an example of the OOD candidate sentence for the sentence "Please look for the restaurant which can be eaten of the Italian course" in the sentence ID 1003 in FIG. Looking for an office where you can eat That is, "restaurant" is replaced with "surgery" and "course", respectively.
(Category固定Action置換方式のOOD候補文の例)
次に、図15の右部を参照して、上述した処理により置換生成される日本語のOOD候補文の例のうち、Category固定Action置換方式のOOD候補文の例について説明する。 (Example of OOD candidate sentence of Category fixed Action substitution method)
Next, with reference to the right part of FIG. 15, an example of the OOD candidate sentence of the Category fixed Action substitution method among the examples of the Japanese OOD candidate sentences generated by the above-described processing will be described.
次に、図15の右部を参照して、上述した処理により置換生成される日本語のOOD候補文の例のうち、Category固定Action置換方式のOOD候補文の例について説明する。 (Example of OOD candidate sentence of Category fixed Action substitution method)
Next, with reference to the right part of FIG. 15, an example of the OOD candidate sentence of the Category fixed Action substitution method among the examples of the Japanese OOD candidate sentences generated by the above-described processing will be described.
すなわち、図15の右部で示されるように、図12における文IDが1の文“7時にアラームを設定して”に対するOOD候補文の例として、「7時にアラームを改善して」、および「7時にアラームを装備して」が示されている。すなわち、「設定」が「改善」、「装備」にそれぞれ置換されている。
That is, as shown in the right part of FIG. 15, as an example of the OOD candidate sentence for the sentence “set an alarm at 7 o'clock” in the sentence ID in FIG. "Already equipped with an alarm at 7 o'clock" is shown. That is, “setting” is replaced with “improvement” and “equipment”, respectively.
また、図12における文IDが1001の文“銀座でおいしい寿司屋を教えて”に対するOOD候補文の例として、「銀座でおいしい寿司屋を抜け出して」、および「銀座でおいしい寿司屋を食べ歩いて」が示されている。すなわち、「教えて」が「抜け出して」、「食べ歩いて」にそれぞれ置換されている。
In addition, as an example of the OOD candidate sentence for the sentence "Teach me a delicious sushi restaurant in Ginza" as the sentence ID in Fig. 12 is "out of a delicious sushi shop in Ginza", and "Is shown. That is, "Teach me" is replaced with "Leave" and "Eat and walk" respectively.
さらに、図12における文IDが1002の文“イタリアンのレストランを教えて”に対するOOD候補文の例として、「イタリアンのレストランを抜け出して」、および「イタリアンのレストランを食べ歩いて」が示されている。すなわち、「教えて」が「抜け出して」、「食べ歩いて」にそれぞれ置換されている。
Furthermore, as an example of an OOD candidate sentence for the sentence "Teach me an Italian restaurant" in the sentence ID in FIG. 12B, "Exit from an Italian restaurant" and "Eat and walk to an Italian restaurant" are shown. There is. That is, "Teach me" is replaced with "Leave" and "Eat and walk" respectively.
また、図12における文IDが1003の文“イタリアンのコースの食べれるレストランをさがしてください”に対するOOD候補文の例として、「イタリアンのコースの食べれるレストランを営んで」、および「イタリアンのコースの食べれるレストランを開いて」が示されている。すなわち、「さがして」が「営んで」、「開いて」にそれぞれ置換されている。
Also, as an example of an OOD candidate sentence for the sentence "Please look for an eaten restaurant of an Italian course" in the sentence ID 1003 in Fig. 12, "run an eaten restaurant of an Italian course", and "Italian course "Open the restaurant where you can eat" is shown. That is, "search" is replaced with "run" and "open", respectively.
<英語の述語項構造解析結果の例>
図16においては、英語の述語項構造解析結果の例が示されており、左から文ID、文(Sentence)、述語(verb:Action)、述語項(Argument)、および元のドメイン(Original Domain)が示されている。また、述語項は、左からinで述語に係る項(prep_in)、・・・対象格(dobj)が示されている。 <Example of result analysis of English predicate terms>
FIG. 16 shows an example of the result of analyzing the predicate term structure in English, and from the left, sentence ID, sentence (Sentence), predicate (verb: Action), predicate term (Argument), and original domain (Original Domain) )It is shown. In addition, the predicate term is, from the left, the term according to the predicate (prep_in),..., The target case (dobj).
図16においては、英語の述語項構造解析結果の例が示されており、左から文ID、文(Sentence)、述語(verb:Action)、述語項(Argument)、および元のドメイン(Original Domain)が示されている。また、述語項は、左からinで述語に係る項(prep_in)、・・・対象格(dobj)が示されている。 <Example of result analysis of English predicate terms>
FIG. 16 shows an example of the result of analyzing the predicate term structure in English, and from the left, sentence ID, sentence (Sentence), predicate (verb: Action), predicate term (Argument), and original domain (Original Domain) )It is shown. In addition, the predicate term is, from the left, the term according to the predicate (prep_in),..., The target case (dobj).
尚、英語の述語語項構造解析は、Stanford Parser(詳細については、Marie-Catherine de Marneffe and Christopher D. Manning 2008 Revised for the Stanford Parser v. 3.3 in December 2013. “Stanford typed dependencies manual”を参照されたい)の解析結果を例としている。dobjは述語に係る項の意味役割(格)が対象格であることを表している。
In addition, for the predicate term structure analysis in English, refer to Stanford Parser (for details, Marie-Catherine de Marneffe and Christopher D. Manning 2008 Revised for the Stanford Parser v. 3.3 in December 2013. “Stanford typed dependencies manual” The analysis results of (a) are taken as an example. d obj represents that the semantic role (case) of the term related to the predicate is the target case.
より詳細には、文IDが1の文“find a Chinese buffet nearby”においては、述語が“find”であり、対象格(dobj)が“Chinese buffet”であり、元のドメインがAREA_INFO-SEARCH_EVENTであることが示されている。
More specifically, in the statement "find a Chinese buffet near" with a statement ID of 1, the predicate is "find", the target case (dobj) is "Chinese buffet", and the original domain is AREA_INFO-SEARCH_EVENT. It is shown that there is.
また、文IDが2の文“find Chinese food in Austin”においては、述語が“find”であり、inで述語に係る項(prep_in)が“Austin”であり、対象格(dobj)が“Chinese food”であり、元のドメインがAREA_INFO-SEARCH_EVENTであることが示されている。
Also, in the sentence “find Chinese food in Austin” with a sentence ID of 2, the predicate is “find”, the term relating to the predicate is “in” and the subject case (dobj) is “Chinese”. It is indicated that the original domain is AREA_INFO-SEARCH_EVENT.
さらに、文IDが1537の文“turn on some tunes please”においては、述語が“turn on”であり、対象格(dobj)が“tunes”であり、元のドメインがMUSIC_PLAYであることが示されている。
Furthermore, in the sentence "turn on some tunes please" whose sentence ID is 1537, it is indicated that the predicate is "turn on", the target case (dobj) is "tunes", and the original domain is MUSIC_PLAY. ing.
また、文IDが1538の文“I'd like to hear some Beatles”においては、述語が“hear”であり、対象格(dobj)が“Beatles”であり、元のドメインがMUSIC_PLAYであることが示されている。
Also, in the sentence "I'd like to hear some Beatles" with a sentence ID of 1538, the predicate is "hear", the target case (dobj) is "Beatles", and the original domain is MUSIC_PLAY. It is shown.
<英語の置換方式がAction固定Category置換方式のときの対象格を置換する単語群の例>
図17は、図16で示される英語の述語項構造解析結果の文に対して、置換方式がAction固定Category置換方式において置換する述語項のうち、対象格(dobj)を置換するときの単語群の例が示されている。 <Example of a word group for replacing the target case when the English substitution method is the Action fixed Category substitution method>
FIG. 17 shows a word group when the target case (dobj) is substituted among the predicates to be substituted by the substitution method in the Action fixed Category substitution method with respect to the sentence of the result of the English predicate term structure analysis shown in FIG. An example is shown.
図17は、図16で示される英語の述語項構造解析結果の文に対して、置換方式がAction固定Category置換方式において置換する述語項のうち、対象格(dobj)を置換するときの単語群の例が示されている。 <Example of a word group for replacing the target case when the English substitution method is the Action fixed Category substitution method>
FIG. 17 shows a word group when the target case (dobj) is substituted among the predicates to be substituted by the substitution method in the Action fixed Category substitution method with respect to the sentence of the result of the English predicate term structure analysis shown in FIG. An example is shown.
図17においては、左から文ID、文(Sentence)、述語(verb)、項の置換単語(Argument)、および元のドメイン(Original Domain)が示されている。また、項の置換単語は、左からinで述語に係る項(prep_in)、・・・対象格(dobj)が示されている。図17においては、このうち対象格(dobj)を置換するときの単語群の例が示されている。尚、図17における、図16と対応する項目については、同一の記載とされているので、その説明は適宜省略する。
In FIG. 17, a sentence ID, a sentence (Sentence), a predicate (verb), a term replacement word (Argument), and an original domain (Original Domain) are shown from the left. Further, as for the replacement word of the term, the term according to the predicate (prep_in),... FIG. 17 shows an example of the word group when replacing the object case (dobj). The items in FIG. 17 that correspond to those in FIG. 16 are the same as the items in FIG.
すなわち、文IDが1の文“find a Chinese buffet nearby”の場合、対象格である“Chinese buffet”の置換単語の例として、“victim”,“bomb”,“cache”、および“remains”が示されている。
That is, when the sentence ID is one sentence “find a Chinese buffet near”, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words of the target case “Chinese buffet”. It is shown.
また、文IDが2の文“find Chinese food in Austin”の場合、対象格である“Chinese food”の置換単語の例として、“victim”,“bomb”,“cache”、および“remains”が示されている。
In addition, when the sentence ID is 2 sentences “find Chinese food in Austin”, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words of the target case “Chinese food”. It is shown.
さらに、文IDが1537の文“turn on some tunes please”の場合、対象格である“tunes”の置換単語の例として、“light”,“power”、および“you”が示されている。
Furthermore, when the sentence ID is the sentence “turn on some tunes please” of 1537, “light”, “power”, and “you” are shown as examples of replacement words of the target case “tunes”.
また、文IDが1538の文“I'd like to hear some Beatles”の場合、対象格である“Beatles”の置換単語の例として、“team-mate”,“boss”、および“neighbor”が示されている。
Also, in the case of the sentence “I'd like to hear some Beatles” with a sentence ID of 1538, “team-mate”, “boss”, and “neighbor” are examples of replacement words of the target case “Beatles”. It is shown.
<英語の置換方式がCategory固定Action置換方式における述語を置換する単語群の例>
図18は、図16で示される文に対して、置換方式がCategory固定Action置換方式における述語を置換するときの単語群の例が示されている。 <Example of a word group for replacing a predicate in an English substitution scheme with a Category fixed action substitution scheme>
FIG. 18 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.
図18は、図16で示される文に対して、置換方式がCategory固定Action置換方式における述語を置換するときの単語群の例が示されている。 <Example of a word group for replacing a predicate in an English substitution scheme with a Category fixed action substitution scheme>
FIG. 18 shows an example of a word group when the substitution scheme substitutes a predicate in the Category fixed Action substitution scheme for the sentence shown in FIG.
図18においては、左から文ID、文、述語、述語語尾、述語の置換単語、および元のドメインが示されている。尚、図18における、図16と対応する項目については、同一の記載とされているので、その説明は適宜省略する。
In FIG. 18, sentence IDs, sentences, predicates, predicate endings, substitution words of predicates, and original domains are shown from the left. The items in FIG. 18 that correspond to those in FIG. 16 are the same as the items in FIG.
すなわち、文IDが1の文“find a Chinese buffet nearby”の場合、述語(verb:Action)である“find”の置換単語の例として、“include”,“open”,“run”、および“operate”が示されている。
That is, when the sentence ID is one sentence “find a Chinese buffet near”, “include”, “open”, “run”, and “include” are examples of replacement words of “find” which is a predicate (verb: Action). "operate" is shown.
また、文IDが2の文“find Chinese food in Austin”の場合、述語である“find”の置換単語の例として、“include”,“open”,“run”、および“operate”が示されている。
In addition, when the sentence ID is 2 sentences “find Chinese food in Austin”, “include”, “open”, “run”, and “operate” are shown as examples of replacement words of the predicate “find”. ing.
さらに、文IDが1537の文“turn on some tunes please”の場合、述語である“turn on”の置換単語の例として、“download”,“record”、および“compose”,が示されている。
Furthermore, when the sentence ID is 1537, "turn on some tunes please", "download", "record", and "compose" are shown as examples of replacement words of the predicate "turn on". .
また、文IDが1538の文“I'd like to hear some Beatles”の場合、述語である“hear”の置換単語の例として、“work with”,“copy”、および“remove”が示されている。
Also, in the case of the sentence "I'd like to hear some Beatles" with a sentence ID of 1538, "work with", "copy" and "remove" are shown as examples of replacement words of the predicate "hear". ing.
<置換生成される英語のOOD候補文の例>
(Action固定Category置換方式のOOD候補文の例)
まず、図19の左部を参照して、上述した処理により置換生成される英語のOOD候補文の例のうち、Action固定Category置換方式のOOD候補文の例について説明する。 <Example of OOD candidate sentence generated by substitution>
(Example of OOD candidate statement of Action fixed Category substitution method)
First, among the examples of the English OOD candidate sentences generated by the above-described processing, an example of the OOD candidate sentence of the Action fixed Category substitution method will be described with reference to the left part of FIG.
(Action固定Category置換方式のOOD候補文の例)
まず、図19の左部を参照して、上述した処理により置換生成される英語のOOD候補文の例のうち、Action固定Category置換方式のOOD候補文の例について説明する。 <Example of OOD candidate sentence generated by substitution>
(Example of OOD candidate statement of Action fixed Category substitution method)
First, among the examples of the English OOD candidate sentences generated by the above-described processing, an example of the OOD candidate sentence of the Action fixed Category substitution method will be described with reference to the left part of FIG.
すなわち、図19の左部で示されるように、図16における文IDが1の文“find a Chinese buffet nearby”に対するOOD候補文の例として、“find a bomb nearby”、および“find a victim nearby”が示されている。すなわち、“Chinese buffet”が“bomb”、“victim”にそれぞれ置換されている。
That is, as shown in the left part of FIG. 19, as examples of the OOD candidate sentences for the sentence “find a Chinese buffet near” in FIG. 16 as the sentence ID in FIG. 16, “find a bomb near” and “find a victim nearby” "It is shown. That is, "Chinese buffet" is replaced with "bomb" and "victim" respectively.
また、図16における文IDが2の文“find Chinese food in Austin”に対するOOD候補文の例として、“find cache in Austin”、および“find remains in Austin”が示されている。すなわち、“Chinese food”が“cache”、“remains”にそれぞれ置換されている。
In addition, “find cache in Austin” and “find remains in Austin” are shown as examples of the OOD candidate sentences for the sentence “find Chinese food in Austin” in which the sentence ID in FIG. 16 is two. That is, "Chinese food" is replaced with "cache" and "remains" respectively.
さらに、図16における文IDが1537の文“turn on some tunes please”に対するOOD候補文の例として、“turn on light please”、および“turn on power please”が示されている。すなわち、“some tunes”が“light”、“power”にそれぞれ置換されている。
Furthermore, “turn on light please” and “turn on power please” are shown as an example of the OOD candidate sentences for the sentence “turn on some tunes please” in the sentence ID in FIG. That is, "some tunes" is replaced with "light" and "power" respectively.
また、図16における文IDが1538の文“I'd like to hear some Beatles”に対するOOD候補文の例として、“I'd like to hear team-mate”、および“I'd like to hear neighbor”が示されている。すなわち、“Beatles”が“team-mate”、“neighbor”にそれぞれ置換されている。
In addition, as examples of the OOD candidate sentences for the sentence "I'd like to hear some Beatles" in the sentence ID of 1538 in Fig. 16, "I'd like to hear team-mate" and "I'd like to hear team neighbor" "It is shown. That is, "Beatles" is replaced with "team-mate" and "neighbor" respectively.
(Category固定Action置換方式のOOD候補文の例)
次に、図19の右部を参照して、上述した処理により置換生成される英語のOOD候補文の例のうち、Category固定Action置換方式のOOD候補文の例について説明する。 (Example of OOD candidate sentence of Category fixed Action substitution method)
Next, with reference to the right part of FIG. 19, an example of the OOD candidate sentence of the Category fixed Action substitution system among the examples of the English OOD candidate sentences generated by the above-described processing will be described.
次に、図19の右部を参照して、上述した処理により置換生成される英語のOOD候補文の例のうち、Category固定Action置換方式のOOD候補文の例について説明する。 (Example of OOD candidate sentence of Category fixed Action substitution method)
Next, with reference to the right part of FIG. 19, an example of the OOD candidate sentence of the Category fixed Action substitution system among the examples of the English OOD candidate sentences generated by the above-described processing will be described.
すなわち、図16の右部で示されるように、図16における文IDが1の文“find a Chinese buffet nearby”に対するOOD候補文の例として、“Open Chinese buffet nearby”、および“Operate Chinese buffet nearby”が示されている。すなわち、“find”が“Open”、“Operate”にそれぞれ置換されている。
That is, as shown in the right part of FIG. 16, “Open Chinese buffet near” and “Operate Chinese buffet nearby” as an example of the OOD candidate sentence for the sentence “find a Chinese buffet near” in FIG. "It is shown. That is, "find" is replaced with "Open" and "Operate" respectively.
また、図16における文IDが2の文“find Chinese food in Austin”に対するOOD候補文の例として、“Open Chinese food in Austin”、および“Operate Chinese food in Austin”が示されている。すなわち、“find”が“Open”、“Operate”にそれぞれ置換されている。
Also, “Open Chinese food in Austin” and “Operate Chinese food in Austin” are shown as examples of the OOD candidate sentences for the sentence “find Chinese food in Austin” in which the sentence ID in FIG. 16 is two. That is, "find" is replaced with "Open" and "Operate" respectively.
さらに、図16における文IDが1537の文“turn on some tunes please”に対するOOD候補文の例として、“Record some tunes please”、および“Compose some tunes please”が示されている。すなわち、“turn on”が“Record”、“Compose”にそれぞれ置換されている。
Furthermore, “Record some tunes please” and “Compose some tunes please” are shown as an example of the OOD candidate sentences for the sentence “turn on some tunes please” in FIG. That is, "turn on" is replaced with "Record" and "Compose" respectively.
また、図16における文IDが1538の文“I'd like to hear some Beatles”に対するOOD候補文の例として、“I’d like to copy some Beatles”、および“I’d like to remove some Beatles”が示されている。すなわち、“hear”が“copy”、“remove”にそれぞれ置換されている。
Also, as an example of an OOD candidate sentence for the sentence "I'd like to hear some Beatles" in the sentence ID in FIG. 16, "I'd like to copy some Beatles" and "I'd like to remove some Beatles" "It is shown. That is, "hear" is replaced with "copy" and "remove" respectively.
<COOD文になりすい単語の格フレーム辞書からの検索方法>
次に、COOD文になりすい単語を格フレーム辞書107から検索する方法について述べる。 <How to search the case frame dictionary for words that are likely to become COOD sentences>
Next, a method of searching for a word likely to become a COOD sentence from thecase frame dictionary 107 will be described.
次に、COOD文になりすい単語を格フレーム辞書107から検索する方法について述べる。 <How to search the case frame dictionary for words that are likely to become COOD sentences>
Next, a method of searching for a word likely to become a COOD sentence from the
図20に格フレーム辞書の簡単なイメージを示す。
FIG. 20 shows a simple image of the case frame dictionary.
例えば、深層格解析の場合、「設定する」という述部であるときには、例えば、図20の例Ex31で示されるように、「設定する」として、<設定する4>および<設定する8>の2種類の格フレーム例が挙げられている。尚、末尾の番号は、同一の「設定する」のうちの異なる語義を識別する番号である。
For example, in the case of deep case analysis, when it is a predicate “set”, for example, as shown in an example Ex 31 of FIG. 20, “set 4” and “set 8” as “set”. Two case frame examples are given. The number at the end is a number for identifying different meanings of the same "set".
図20の例Ex31においては、<設定する4>では、それぞれの役割を持った述語項としてどのような単語が係るかが表されている。()の中は単語と数字で表されている。数字はその単語と述語と結びついた回数(頻度)を表している。例えば(“会議”、41)というのは、格フレーム辞書を作る元になった大量のコーパスデータの中で「会議」という単語が<設定する4>という語義に対象格として41回係った事を表している。この数値は重みの様な別の指標でもよいし、母集団によりノーマライズして使用するようにしてもよいし、指標を乗算するなどして組み合わせて使用するようにしてもよい。
In the example Ex31 of FIG. 20, <setting 4> indicates what kind of word is involved as a predicate term having each role. In () are represented by words and numbers. The numbers represent the number of times (frequency) that the word and the predicate are associated. For example, (“meeting”, 41) was related 41 times as a target case to the word “setting 4” as the word “meeting” in a large amount of corpus data from which a case frame dictionary was made It represents a thing. This numerical value may be another index such as a weight, may be normalized according to the population, and may be used in combination by multiplying the index or the like.
図20の例Ex31においては、「設定する4」では、動作主として、(“システム”,83)、(“会社”,42)、(“学校”,33)、(“上司”,18)が挙げられており、対象格として(“会議”,41)、(“参加者”,27)、(“移動時間”,10)が挙げられており、道具格として(“PC(パーソナルコンピュータ)”,95)、(“スケジューラ”,72)、(“スマホ”,33)が挙げられている。
In the example Ex31 of FIG. 20, in “setting 4”, (“system”, 83), (“company”, 42), (“school”, 33), (“supervisor”, 18) mainly operate. Listed as the target class (“conference”, 41), (“participant”, 27), (“travel time”, 10), and as the class “(PC (personal computer)” , 95), ("scheduler", 72), ("smartphone", 33).
また、例Ex31においては、「設定する8」では、動作主格として、(“妻”,40)、(“娘”,33)、(“息子”,28)、(“母”,13)が挙げられており、対象格として(“アラーム”,52)、(“目覚まし”,48)、(“タイマ”,42)が挙げられており、道具格として(“目覚まし”,94)、(“時計”,48)、(“携帯”,35)、(“スマホ”,19)が挙げられている。
Further, in the example Ex31, in “setting 8”, (“wife”, 40), (“daughter”, 33), (“son”, 28), and (“mother”, 13) are given as the operative prime designations. Listed (“alarm”, 52), (“alarm”, 48), (“timer”, 42), and as instrumental (“alarm”, 94), (““ Clock ", 48), (" mobile ", 35), (" smartphone ", 19) are mentioned.
さらに、例Ex31においては、「設定する」と表記が異なり語義が類似する格フレームとして<セットする1>が挙げられており、この場合、動作主格として、(“彼女”,56)、(“父”,52)、(“妻”,49)が挙げられており、対象格として(“タイマ”,67)、(“スリープタイマ”,42)、(“アラーム”,41)が挙げられており、道具格として(“炊飯器”,52)、(“エアコン”,45)、(“ラジオ”,32)、(“携帯”,12)が挙げられている。
Furthermore, in the example Ex31, <1> to be set as a case frame having a different expression such as "to set" and having a similar meaning is listed, and in this case, ("she", 56), (" "Father", 52), ("Wife", 49) are listed, and ("Timer", 67), ("Sleep Timer", 42), ("Alarm", 41) are listed as target cases The tools are listed as (“rice cooker”, 52), (“air conditioner”, 45), (“radio”, 32) and (“mobile”, 12).
さらに、例Ex31においては、「設定する」とは表記が異なり語義が類似しない格フレームとして<改善する15>が挙げられており、この場合、動作主格として、(“手法”,102)、(“品質”,73)、(“工程”,67)が挙げられており、対象格として(“動作”,81)、(“性能”,75)、(“アラーム”,2)が挙げられており、道具格として(“交換”,58)、(“工夫”,49)、(“方法”,41)が挙げられている。
Furthermore, in the example Ex31, <Improving 15> is mentioned as a case frame whose expression is different and the wording is not similar to “set”, and in this case, (“method”, 102), ( “Quality”, 73) and (“Process”, 67) are listed, and “Case” (“Operation”, 81), (“Performance”, 75) and (“Alarm”, 2) are listed as target cases. The tools are listed as (“exchange”, 58), (“invention”, 49) and (“method”, 41).
「7時にアラームを設定して」という元のIND文を例に、深層格解析を用いた置換による文生成の処理を説明する。
The original IND statement "set alarm at 7 o'clock" will be used as an example to explain the process of sentence generation by substitution using deep case analysis.
Action(述部)固定Category(対象格)置換モードの場合、述部である“設定する”が固定されると、対象格に“アラーム”を持たない<設定する4>が選択される。<設定する4>は対象格に“アラーム”という単語を含まず、そのかわりに(“会議”,41)、(“参加者”,27)、(“移動時間”,10)といった異なる意味クラスの単語が含まれる。これらの単語で置換して新たなコーパスが生成されることで「設定する」の語義が微妙に異なるCOOD文の候補を作ることができる。
In the case of the Action (predicate) fixed Category (target case) replacement mode, when the predicate “set” is fixed, <Set 4> having no “alarm” in the target case is selected. <4> to set does not include the word "alarm" in the subject case, but instead uses different semantic classes such as ("meeting", 41), ("participant", 27), ("moving time", 10) Contains the word. By replacing with these words to generate a new corpus, it is possible to create candidates for COOD sentences whose word meanings are slightly different.
Category(対象格)固定Action(述部)置換モードの場合、対象格である“アラーム”が固定されると、同じ“アラーム”を対象格にもつ表記の異なる格フレーム<セットする 1><改善する 15>が選択される。どちらも対象格に“アラーム”という同じ単語を含むが、<セットする 1>においては、“アラーム”の頻度は41回であるのに対して、<改善する 15>は“アラーム”の頻度は2回と少ない。この様な述語は語義が微妙に異なる可能性が高い。タイマ機能にはあまり関連しない意味クラスの単語群が含まれる。この様に、固定した単語を同じ項に持ち、かつ、その単語の頻度や関係の強さを表す値nが、ある閾値αより少ないという条件を満たす述語を選択する。
In Category (target case) fixed Action (predicate) replacement mode, when the target case “alarm” is fixed, the case frame having the same “alarm” as a different case is set <set 1> <improvement Yes 15> is selected. Both include the same word "alarm" in the subject case, but in <set 1>, the frequency of "alarm" is 41 times, whereas the frequency of <alarm> is <improved 15> Less than twice. Such predicates are likely to differ slightly in word meaning. The timer function includes words of semantic classes that are not very relevant. Thus, a predicate is selected which satisfies the condition that the fixed word is in the same term, and the value n representing the frequency of the word or the strength of the relationship is smaller than a certain threshold value α.
元の述語を「改善する」で置換して新たなコーパスが生成することで語義が微妙に異なるCOOD文の候補「7時にアラームを改善して」を作ることができる。
By replacing the original predicate with "to improve" and generating a new corpus, it is possible to create a candidate for a COOD sentence whose word sense is slightly different "to improve the alarm at 7 o'clock".
また、表層格解析の場合、「設定する」という述部であるときには、例えば、図20の例Ex32で示されるように、「設定する」には、使用する形態として、「設定する4」および「設定する8」の二つが例として挙げられている。
Further, in the case of surface case analysis, when it is a predicate “set”, for example, “set 4” as a form to be used for “set” as shown in the example Ex 32 of FIG. Two of "set 8" are given as examples.
例Ex32においては、「設定する4」では、ガ格として、(“システム”,83)、(“会社”,42)、(“学校”,33)、(“上司”,18)が挙げられており、ヲ格として(“会議”,41)、(“参加者”,27)、(“移動時間”,10)が挙げられており、デ格として(“PC(パーソナルコンピュータ)”,95)、(“スケジューラ”,72)、(“スマホ”,33)が挙げられている。
In Example Ex32, (“System”, 83), (“Company”, 42), (“School”, 33), (“Superior”, 18) are listed as “Grade 4”. , And listed as a standard (“meeting”, 41), (“participant”, 27), (“moving time”, 10), and as a de-rated (“PC (personal computer)”, 95 , ("Scheduler", 72), ("smartphone", 33) are mentioned.
また、例Ex32においては、「設定する8」では、ガ格として、(“妻”,40)、(“娘”,33)、(“息子”,28)、(“母”,13)が挙げられており、ヲ格として(“アラーム”,52)、(“目覚まし”,48)、(“タイマ”,42)が挙げられており、デ格として(“目覚まし”,94)、(“時計”,42)、(“携帯”,35)、(“スマホ”,19)が挙げられている。
Further, in the example Ex32, (“Wife”, 40), (“Daughter”, 33), (“Son”, 28), (“Mother”, 13) are “G8” in “Set 8”. Listed (“alarm”, 52), (“alarm”, 48), (“timer”, 42), and as a degrad (“alarm”, 94), (“ The clock ", 42), (" mobile ", 35), (" smart phone ", 19) are mentioned.
さらに、例Ex32においては、「設定する」に類似する語義として「セットする1」が挙げられており、この場合、ガ格として、(“彼女”,56)、(“父”,52)、(“妻”,49)が挙げられており、ヲ格として(“タイマ”,67)、(“スリープタイマ”,42)、(“アラーム”,41)が挙げられており、デ格として(“炊飯器”,52)、(“エアコン”,45)、(“ラジオ”,32)、(“携帯”,12)が挙げられている。
Furthermore, in the example Ex32, “set 1” is mentioned as a word meaning similar to “set”, and in this case, (“she”, 56), (“father”, 52), as a rating. ("Wife", 49) is mentioned, and ("Timer", 67), ("Sleep Timer", 42), ("Alarm", 41) are listed as the standard, and as the de- "Rice cooker", 52), ("Air conditioner", 45), ("Radio", 32), ("Mobile", 12) are mentioned.
さらに、例Ex32においては、「設定する」とは表記が異なり語義が類似しない格フレームとして<改善する 15>が挙げられており、この場合、ガ格として、(“手法”,102)、(“品質”,73)、(“工程”,67)が挙げられており、ヲ格として(“動作”,81)、(“性能”,75)、(“アラーム”,2)が挙げられており、デ格として(“交換”,58)、(“工夫”,49)、(“方法”,41)が挙げられている。
Furthermore, in the example Ex32, <Alternate 15> is mentioned as a case frame that is different from “set” and is not described similarly and in this case, (“method”, 102), “Quality”, 73) and (“Process”, 67) are listed, and “Standard” (“Operation”, 81), (“Performance”, 75) and (“Alarm”, 2) are listed as standards. Also, as de-ratings ("exchange", 58), ("invention", 49), ("method", 41) are listed.
同じく「7時にアラームを設定して」という元のIND文を例に、表層格解析を用いた置換による文生成の処理を説明する。
Similarly, processing of sentence generation by substitution using surface layer case analysis will be described using the original IND sentence “set an alarm at 7 o'clock” as an example.
Action(述部)固定Category(ヲ格)置換モードの場合、述部である「設定する」が固定されると、ヲ格に“アラーム”を持たない<設定する4>が選択される。<設定する4>はヲ格に“アラーム”という単語を含まず、そのかわりに(“会議”,41)、(“参加者”,27)、(“移動時間”,10)といった異なる意味クラスの単語が含まれる。これらの単語で置換して新たなコーパスが生成されることで「設定する」の語義が微妙に異なるCOOD文の候補「7時に参加者を設定して」等を作ることができる。
In the case of the Action (predicate) fixed Category (legacy) substitution mode, when the predicate "set" is fixed, <Set 4> which does not have an "alarm" in the standard is selected. <4> to set does not include the word "alarm" in the case, but instead ("meeting", 41), ("participant", 27), ("moving time", 10) different semantic classes such as Contains the word. By generating a new corpus by replacing with these words, it is possible to create candidates of COOD sentences with slightly different word meanings of “set”, such as “set a participant at 7 o'clock”.
Category(ヲ格)固定Action(述部)置換モードの場合、対象格である“アラーム”が固定されると、同じ“アラーム”をヲ格にもつ表記の異なる格フレーム<セットする 1><改善する 15>が選択される。どちらも対象格に“アラーム”という同じ単語を含むが、<セットする 1>においては、“アラーム”の頻度は41回であるのに対して、<改善する 15>は“アラーム”の頻度は2回と少ない。この様な述語は語義が微妙に異なる可能性が高い。タイマ機能にはあまり関連しない意味クラスの単語群が多く含まれる。この様に、固定した単語を同じ項に持ち、かつ、その単語の頻度や関係の強さを表す値nが、ある閾値αより少ないという条件を満たす述語を選択する。
In the Category fixed action (predicate) replacement mode, when the target case "alarm" is fixed, the case frame having a different case with the same "alarm" in the standard <set 1> <improvement Yes 15> is selected. Both include the same word "alarm" in the subject case, but in <set 1>, the frequency of "alarm" is 41 times, whereas the frequency of <alarm> is <improved 15> Less than twice. Such predicates are likely to differ slightly in word meaning. The timer function contains many words of semantic classes that are not very relevant. Thus, a predicate is selected which satisfies the condition that the fixed word is in the same term, and the value n representing the frequency of the word or the strength of the relationship is smaller than a certain threshold value α.
元の述語を「改善する」で置換して新たなコーパスが生成されることで語義が微妙に異なりCOOD文の候補「7時にアラームを改善して」を作ることができる。
By replacing the original predicate with "to improve" and generating a new corpus, the word sense is slightly different and it is possible to create a candidate for the COOD statement "to improve the alarm at 7 o'clock".
尚、格フレーム辞書107については、既存のものを利用するようにしてもよい。また、汎用的な既存の格フレーム辞書にはドメインとなるサービス目的で使う単語があまり含まれていない事も考えられるので、サービスに必要な単語を集めて編纂されたユーザ定義格フレーム辞書を追加できるようにしてもよい。
The case frame dictionary 107 may be an existing one. In addition, since it is possible that the general existing case frame dictionary does not contain many words used for the service purpose that is a domain, a user-defined case frame dictionary that is compiled by collecting words necessary for the service is added It may be possible.
既存の格フレーム辞書107については、例えば、Daisuke Kawahara and Sadao Kurohashi.著のA Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis, In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL2006), pp.176-183, 2006.、または、Case Frame Compilation from the Web using High-Performance Computing, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), 2006を参照されたい。同辞書は述語の項の単語の頻度情報が付加されており、これを前記の述語と項単語の関係の強さに利用することができる。
For the existing case frame dictionary 107, see, for example, A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis, by Daisuke Kawahara and Sadao Kurohashi. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association See Computational Linguistics (HLT-NAACL2006), pp. 176-183, 2006. or Case Frame Compilation from High-Performance Computing, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), 2006. I want to be The dictionary is added with frequency information of words in terms of predicates, and this can be used for the strength of the relationship between the predicates and terms.
ここで、図7のフローチャートの説明に戻る。
Here, the description returns to the flowchart of FIG. 7.
ステップS20において、置換実行部105は、保存されている検索結果のうち、未処理のIND文を処理対象のIND文として読み出すと共に、対応付けて保存されている、置換に関する設定情報、および検索結果を読み出して、受け付ける。
In step S20, the replacement execution unit 105 reads out the unprocessed IND statement as the processing target IND statement among the stored search results, and stores the setting information on the replacement and the search result, which are stored in association with each other. Read out and accept.
ステップS21において、置換実行部105は、処理対象となるIND判定文、置換に関する設定情報、および検索結果に基づいて、処理対象となるIND判定文の置換対象となる単語を、検索結果のうち未処理の検索結果に基づいて置換してコーパスを生成し、活用や語尾を調整する。
In step S21, substitution execution unit 105 searches for a word to be a substitution target of the IND determination sentence to be processed based on the IND determination sentence to be processed, the setting information on substitution, and the search result. Based on the search results of the process, replace it to generate a corpus, and adjust utilization and word endings.
ステップS22において、置換実行部105は、生成したコーパスを一次置換生成文として保存する。
In step S22, the replacement execution unit 105 stores the generated corpus as a primary replacement generated sentence.
ステップS23において、置換実行部105は、保存されている検索結果のうち、未処理のIND文があるか否かを判定し、存在する場合、処理は、ステップS20に戻る。すなわち、全てのIND文に対して、検索結果に基づいて、置換によりコーパスが生成されるまで、ステップS20乃至S23の処理が繰り返される。そして、ステップS23において、未処理のIND文が存在しないとみなされた場合、処理は、ステップS24に進む。
In step S23, the replacement execution unit 105 determines whether or not there is an unprocessed IND statement among the stored search results, and if it is present, the process returns to step S20. That is, the processing in steps S20 to S23 is repeated until a corpus is generated by substitution based on the search results for all the IND sentences. When it is determined in step S23 that there is no unprocessed IND statement, the process proceeds to step S24.
ステップS24において、重文排除部106は、ステップS22の処理により保存されたコーパスのうち、未処理のコーパスを読み出して、処理対象のコーパスとして受け付ける。
In step S24, the heavy sentence exclusion unit 106 reads an unprocessed corpus out of the corpus stored in the process of step S22, and accepts it as a corpus to be processed.
ステップS25において、重文排除部106は、処理対象のコーパスと、それまでに生成され、ステップS22の処理により保存されたコーパスと重複する文(重文)がないかを判定する。より詳細には、重文排除部106は、新たに生成されたコーパスとして記憶しているコーパス群より、処理対象に設定されたコーパスを検索し、一致するものの有無により重文であるか否かを判定する。ステップS25において、重文であると判定された場合、処理は、ステップS26に進む。
In step S25, the heavy sentence exclusion unit 106 determines whether there is a sentence (heavy sentence) that overlaps with the corpus to be processed and the corpus generated so far and saved by the process of step S22. More specifically, the heavy sentence exclusion unit 106 searches the corpus set as the processing target from the corpus group stored as a newly generated corpus, and determines whether the sentence is a heavy sentence based on the presence or absence of a match. Do. If it is determined in step S25 that the sentence is a double sentence, the process proceeds to step S26.
ステップS26において、重文排除部106は、生成されたコーパスが重文であり、破棄判定文であるとみなし、生成されたコーパスを廃棄処理する。
In step S26, the heavy sentence exclusion unit 106 considers that the generated corpus is a heavy sentence and is a discarding judgment sentence, and discards the generated corpus.
一方、ステップS25において、生成されたコーパスが重文ではないと判定された場合、処理は、ステップS27に進む。
On the other hand, when it is determined in step S25 that the generated corpus is not a double sentence, the process proceeds to step S27.
ステップS27において、重文排除部106は、処理対象である置換生成されたコーパスを置換生成文記憶部109に記憶させる。
In step S27, the heavy sentence exclusion unit 106 causes the substitution generated sentence storage unit 109 to store the substitution-generated corpus to be processed.
ステップS28において、置換実行部105は、未処理の検索結果があるか否かを判定し、未処理の検索結果がある場合、処理は、ステップS24に戻る。
In step S28, the replacement execution unit 105 determines whether or not there is an unprocessed search result. If there is an unprocessed search result, the process returns to step S24.
そして、ステップS28において、未処理の検索結果がないとみなされた場合、処理は、ステップS29に進む。
When it is determined in step S28 that there is no unprocessed search result, the process proceeds to step S29.
ステップS29において、置換実行部105は、現状で重文として排除されずに保存されているコーパスを最終の置換生成文として置換生成文記憶部109に記憶させる。
In step S29, the replacement execution unit 105 stores the corpus currently stored without being excluded as a double sentence in the replacement generated sentence storage unit 109 as the final replacement generated sentence.
ステップS30において、フィルタリング処理部110は、フィルタリング処理を実行して、置換生成文記憶部109に記憶されている新たに生成された置換生成文からなるコーパスをOOD判定文、COOD判定文、IND判定文、およびCIND判定文のコーパスに分類する。尚、フィルタリング処理については、図21のフローチャートを参照して詳細を後述する。
In step S30, the filtering processing unit 110 executes the filtering process to perform OOD judgment sentence, COOD judgment sentence, IND judgment on a corpus consisting of newly generated substitution generated sentences stored in the substitution generated sentence storage unit 109. Classify into sentences and sentences of CIND judgment sentences. The details of the filtering process will be described later with reference to the flowchart of FIG.
以上の処理により、IND文に基づいて、単語の置換により新たなコーパスを置換生成文として生成することが可能となる。
By the above processing, it is possible to generate a new corpus as a replacement generated sentence by word replacement based on the IND sentence.
また、以上の処理により、図15で示されるような日本語のOOD候補文や図19で示されるような英語のOOD候補文が生成される。
Further, through the above processing, Japanese OOD candidate sentences as shown in FIG. 15 and English OOD candidate sentences as shown in FIG. 19 are generated.
<フィルタリング処理>
次に、図21のフローチャートを参照して、フィルタリング処理部110によるフィルタリング処理について説明する。 <Filtering process>
Next, the filtering process by thefiltering processing unit 110 will be described with reference to the flowchart of FIG.
次に、図21のフローチャートを参照して、フィルタリング処理部110によるフィルタリング処理について説明する。 <Filtering process>
Next, the filtering process by the
ステップS31において、意味解析器131は、置換生成文記憶部109に記憶されている置換生成文からなるコーパスのうち、未処理の置換生成文からなるコーパスを処理対象のコーパスとして受け付ける。
In step S31, the semantic analyzer 131 accepts, as a processing target corpus, a corpus consisting of unprocessed substitution generated sentences among corpuses consisting of substitution generated sentences stored in the substitution generated sentence storage unit 109.
ステップS32において、意味解析器131は、処理対象の置換生成文からなるコーパスが、IND判定文であるか否かを判定する。ステップS32において、IND判定文であるとみなされた場合、処理は、ステップS33に進む。
In step S32, the semantic analyzer 131 determines whether or not the corpus consisting of substitution generation sentences to be processed is an IND determination sentence. If it is determined in step S32 that the sentence is an IND determination sentence, the process proceeds to step S33.
ステップS33において、意味解析器131は、処理対象となる置換生成文からなるコーパスをIND判定文記憶部132に記憶させる。
In step S33, the semantic analyzer 131 causes the IND determination sentence storage unit 132 to store a corpus consisting of substitution generated sentences to be processed.
一方、ステップS32において、IND判定文ではないとみなされた場合、すなわち、処理対象の置換生成文からなるコーパスがOOD判定文であるとみなされた場合、処理は、ステップS34に進む。
On the other hand, if it is determined in step S32 that the sentence is not the IND determination sentence, that is, if the corpus including the replacement generation sentence to be processed is considered to be the OOD determination sentence, the process proceeds to step S34.
ステップS34において、意味解析器131は、処理対象となる置換生成文をOOD判定文であるものとみなし、OOD判定文記憶部136に記憶させる。
In step S34, the semantic analyzer 131 regards the replacement generated sentence to be processed as an OOD determination sentence, and stores it in the OOD determination sentence storage unit 136.
ステップS35において、意味解析器131は、置換生成文記憶部109において、未処理の置換生成文があるか否かを判定し、未処理の置換生成文が存在すると判定する場合、処理は、ステップS31に戻り、それ以降の処理が繰り返される。すなわち、未処理の入力文がないとみなされるまで、全ての置換生成文が、IND判定文であるか否かが判定されて、IND判定文がIND判定文記憶部132に記憶され、それ以外であるOOD判定文がOOD判定文記憶部136に記憶される処理が繰り返される。
In step S35, the semantic analyzer 131 determines in the substitution generated sentence storage unit 109 whether or not there is an unprocessed substitution generated sentence, and when it is determined that there is an unprocessed substitution generated sentence, the process is a step Returning to S31, the subsequent processing is repeated. That is, until it is considered that there is no unprocessed input sentence, it is judged whether or not all replacement sentences are IND judgment sentences, and the IND judgment sentence is stored in the IND judgment sentence storage unit 132, The process in which the OOD determination sentence which is the above is stored in the OOD determination sentence storage unit 136 is repeated.
そして、ステップS35において、未処理の置換生成文がないとみなされると、処理は、ステップS36に進む。すなわち、ここまでの処理により、置換生成文記憶部109に記憶されていた置換生成文からなる群が、古いバージョンのコーパスを用いた学習により生成された意味解析器131により、IND判定文とOOD判定文とに分類されて、それぞれが、IND判定文記憶部132およびOOD判定文記憶部136に記憶される。
Then, if it is determined in step S35 that there is no unprocessed replacement generated sentence, the process proceeds to step S36. That is, a group of substitution generated sentences stored in substitution generated sentence storage unit 109 by the processing up to this point is generated by learning using a corpus of an old version by semantic analyzer 131 to generate IND judgment sentences and OOD. It is classified into a judgment sentence, and each is stored in the IND judgment sentence storage unit 132 and the OOD judgment sentence storage unit 136.
ステップS36において、COODコーパス抽出部133は、COODコーパス抽出処理を実行して、IND判定文とみなされたコーパスのドメインよりCOOD判定文の候補を抽出して、COOD判定文記憶部134に記憶させると共に、残りのIND判定文を確定IND判定文として、確定IND判定文記憶部135に記憶させる。この際、IND判定文とみなされたコーパスのうち、非文とみなされるコーパスについては、廃棄判定文とみなされて、廃棄処理される。
In step S36, the COOD corpus extraction unit 133 executes COOD corpus extraction processing, extracts COOD determination sentence candidates from the domain of the corpus regarded as the IND determination sentences, and causes the COOD determination sentence storage unit 134 to store them. At the same time, the remaining IND determination sentences are stored as fixed IND determination sentences in the fixed IND determination sentence storage unit 135. At this time, among the corpuses regarded as IND judgment sentences, the corpus regarded as non-statement is regarded as discard judgment sentences and discarded.
尚、COODコーパス抽出処理については、図22のフローチャートを参照して、詳細を後述する。
Details of the COOD corpus extraction processing will be described later with reference to the flowchart of FIG.
ステップS37において、CINDコーパス抽出部137は、CINDコーパス抽出処理を実行して、OOD判定文とみなされたコーパスよりCIND判定文を抽出して、CIND判定文記憶部138に記憶させると共に、残りのOOD判定文を確定OOD判定文として確定OOD判定文記憶部139に記憶させる。この際、OOD判定文とみなされたコーパスのうち、非文とみなされるコーパスについては、廃棄判定文とみなされて、廃棄処理される。
In step S37, the CIND corpus extraction unit 137 executes a CIND corpus extraction process to extract a CIND determination sentence from the corpus regarded as an OOD determination sentence, and causes the CIND determination sentence storage unit 138 to store the remaining CIND determination sentence. The OOD determination sentence is stored in the determination OOD determination sentence storage unit 139 as the determination OOD determination sentence. At this time, among the corpuses regarded as the OOD judgment sentences, the corpus regarded as the non-statement is regarded as the discarding judgment sentence and discarded.
尚、CINDコーパス抽出処理については、図26のフローチャートを参照して、詳細を後述する。
Details of the CIND corpus extraction processing will be described later with reference to the flowchart in FIG.
以上の処理により、COOD判定文、確定IND判定文、CIND判定文、および確定OOD判定文とみなされるコーパスを効率よく、かつ、大量に生成することが可能となる。
By the above processing, it is possible to generate a large amount of corpuses regarded as the COOD determination sentence, the fixed IND determination sentence, the CIND determination sentence, and the fixed OOD determination sentence.
また、この処理の後、COOD判定文、確定IND判定文、CIND判定文、および確定OOD判定文とみなされるコーパスについては、人手による確認作業が必要となるが、予めCOOD判定文、確定IND判定文、CIND判定文、および確定OOD判定文のいずれかに分類されているため、確認作業に係る負荷を低減させることが可能となり、結果として、コーパスの開発コストを低減させることが可能となる。また、フレーム推定部13、および意味解析部14は、生成されたCOOD判定文、およびCIND判定文からなるコーパスを用いた学習により認識精度を向上させることが可能となる。
Also, after this processing, manual confirmation work is required for the COOD determination sentence, the confirmed IND determination sentence, the CIND determination sentence, and the corpus regarded as the confirmed OOD determination sentence, but the COOD determination sentence, the confirmed IND determination are made in advance. Since the sentence is classified into any of the sentence, the CIND judgment sentence, and the confirmed OOD judgment sentence, it is possible to reduce the load of the confirmation work, and as a result, it is possible to reduce the development cost of the corpus. In addition, the frame estimation unit 13 and the semantic analysis unit 14 can improve the recognition accuracy by learning using a corpus including the generated COOD determination sentence and CIND determination sentence.
<COODコーパス抽出処理>
次に、図22のフローチャートを参照して、COODコーパス抽出処理について説明する。置換生成文からCOOD判定文を抽出する処理は人手で行うのが望ましいが、作業効率を上げるために以下のフィルタリングによるCOODコーパス抽出処理によりCOOD候補文をさらに絞り込むことが可能である。 <COOD corpus extraction process>
Next, COOD corpus extraction processing will be described with reference to the flowchart in FIG. Although it is desirable to manually perform the process of extracting the COOD determination sentence from the substitution generated sentence, it is possible to further narrow down the COOD candidate sentences by the COOD corpus extraction process by the following filtering in order to improve the work efficiency.
次に、図22のフローチャートを参照して、COODコーパス抽出処理について説明する。置換生成文からCOOD判定文を抽出する処理は人手で行うのが望ましいが、作業効率を上げるために以下のフィルタリングによるCOODコーパス抽出処理によりCOOD候補文をさらに絞り込むことが可能である。 <COOD corpus extraction process>
Next, COOD corpus extraction processing will be described with reference to the flowchart in FIG. Although it is desirable to manually perform the process of extracting the COOD determination sentence from the substitution generated sentence, it is possible to further narrow down the COOD candidate sentences by the COOD corpus extraction process by the following filtering in order to improve the work efficiency.
ステップS51において、COODコーパス抽出部133は、IND判定文記憶部132に記憶されているIND判定文となるコーパスのうち、いずれか未処理のIND判定文となるコーパスの入力を受け付けて、処理対象コーパスとする。
In step S 51, the COOD corpus extraction unit 133 receives an input of a corpus serving as an unprocessed IND determination sentence among corpuses serving as an IND determination sentence stored in the IND determination sentence storage unit 132, and is processed It is a corpus.
ステップS52において、COODコーパス抽出部133は、非文判定部133aを制御して、処理対象コーパスのPerplexity値を算出させる。
In step S52, the COOD corpus extraction unit 133 controls the non-statement determination unit 133a to calculate the Perplexity value of the processing target corpus.
ここで、Perplexity値とは、ある単語の次に来る単語の分岐数(候補数)をn-gram確率の逆数で表現したときの平均分岐数を表す値である。すなわち、複数の単語をランダムに組み合わせて生成された文と比較すると、意味のある文における単語間の結合確率は高くなり、連接する単語の分岐数が低くなるためPerplexity値は低くなる。逆に、意味のない文については、単語の結合確率が低くなるため、連接する単語の分岐数が高くなるため、Perplexity値は高くなる。
Here, the Perplexity value is a value representing the average number of branches when the number of branches (number of candidates) of the word following the word is represented by the reciprocal of the n-gram probability. That is, when compared with a sentence generated by combining a plurality of words at random, the connection probability between the words in the sentence having meaning is high, and the number of branches of the connected words is low, so the Perplexity value is low. Conversely, for sentences that do not make sense, the probability of combining words is low, and the number of branches of connected words is high, so the Perplexity value is high.
すなわち、単語置換によって生成された文の中には意味の通らない文も存在する。例えば、「この近くにある評判のいいレストラン教えて」は意味が通るが、「レストラン」を「責任」に置換した文である「この近くにある評判のいい責任教えて」は、自然な意味とは解釈し難く、違和感がある。別の見方をすれば「責任」という単語は「評判」、および「いい」などの単語に連なって出現し難いと言える。
That is, there are sentences that do not pass meaning in the sentences generated by word substitution. For example, "Teach me a restaurant with a good reputation near here" means, but the sentence "A restaurant with a good reputation near me, which is a statement that replaces" restaurant "with" responsibility "means a natural meaning. It is difficult to interpret with and there is a sense of discomfort. From another point of view, it can be said that the word "responsibility" is hard to appear in line with words such as "reputable" and "good".
英語の場合も同様である。“repeat phone number again”という文のrepeatをbreakに置換して作成した文”break phone number again”は、やはり自然な意味ではない。breakに続いてphone numberが出現する確率がきわめて低いと言える。もし単語間の連接確率(n-gram)をあらかじめ大量の文(Training Data)で学習(Training)しておけば、生成された文の確率的な妥当性を判定することができる。
The same is true for English. The sentence "break phone number again" created by replacing "repeat" in the sentence "repeat phone number again" with break is still not natural. It can be said that the probability that the phone number appears after the break is extremely low. If the inter-word connection probability (n-gram) is previously trained with a large amount of training data (Training), the probabilistic validity of the generated sentence can be determined.
すなわち、Perplexity値は、生成された文の確率的な妥当性を判断する指標であるともいえる。
That is, it can be said that the Perplexity value is an index for judging the probabilistic validity of the generated sentence.
生成された文のperplexity値の具体的な計算方法は、例えば、以下の通りである。尚、Perplexity値の算出方法の詳細については、Daniel Jurafsky著の2016“Language Modeling with N-grams”Chapter4,https://web.stanford.edu/~jurafsky/slp3/4.pdf 2016を参照されたい。
The specific calculation method of the perplexity value of the produced | generated sentence is as follows, for example. Please refer to Daniel Jurafsky's 2016 “Language Modeling with N-grams” Chapter 4, https://web.stanford.edu/~jurafsky/slp3/4.pdf 2016 for details on how to calculate Perplexity values. .
確率的言語モデルでは、単語列は確率的に生成されるという考えに基づき、単語列(文)の結合確率P(w)をモデル化する。結合確率p(w)のモデル化の手法はいろいろあるが下記n-gramを使ったモデル化を例として示す。
In a probabilistic language model, the joint probability P (w) of word strings (sentences) is modeled based on the idea that word strings are generated probabilistically. There are various methods of modeling the joint probability p (w), but the modeling using the following n-gram is shown as an example.
ここで、式(1)が、Bi-gram(n=2)の場合のn-gram確率モデルであり、式(2)が、Tri-gram(n=3)の場合のn-gram確率モデルである。
Here, the n-gram probability model in the case where the equation (1) is Bi-gram (n = 2) and the n-gram probability model in the case where the equation (2) is Tri-gram (n = 3) It is.
言語モデルの学習においては、例えば、インターネットサイトや、News記事等のテキスト等大量の学習用テキスト(Training data)を用いて単語のn-gram確率を用いた上記のn-gramパラメータを学習する。
In learning a language model, for example, a large amount of training text (Training data) such as an Internet site or a news article is used to learn the above n-gram parameter using n-gram probability of a word.
このような学習においては、コーパスに現われない単語列が多いためほとんどのn-gramがゼロになる問題(Sparseness)があるため、ゼロ以外の値で補間するスムージング処理(Language Modeling Smoothing)とバックオフ処理(Back off)とが行われる。
In such learning, there is a problem (Sparseness) in which most n-grams become zero because there are many word strings that do not appear in the corpus, so smoothing processing (Language Modeling Smoothing) and backoff that interpolate with non-zero values Processing (Back off) is performed.
尚、スムージング処理(Language Modeling Smoothing)とバックオフ処理(Back off)とについては、Zhai & Lafferty著の2001, A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrievalを参照されたい。
In addition, for smoothing processing (Language Modeling Smoothing) and back-off processing (Back off), see Zhai & Lafferty, 2001, Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval.
非文判定部133aは、このように学習したn-gramモデルを用いて、生成されたコーパスに対して以下の式(3)で示されるPerplexity値を計算する。
The non-sentence determination unit 133a uses the n-gram model thus learned to calculate the Perplexity value represented by the following Expression (3) for the generated corpus.
例えば、図23の例Ex41における例文1)で示されるように「この近くにある評判のいい責任を教えて」という文章は、不自然な意味の通らない文である。このような場合、例えば、「責任」と「いい」とのn-gram確率p(責任|いい)=1.90365e-05が低いため、Perplexity値PLLは、80.4152となる。
For example, as shown in the example sentence 1) in the example Ex41 in FIG. 23, the sentence “Tell me a good reputation near here” is an unnatural sentence. In such a case, for example, since the n-gram probability p (responsibility | good) = 1.90365e-05 of “responsibility” and “good” is low, the Perplexity value PLL is 80.4152.
また、例えば、図23の例Ex41における例文2)で示されるように「この近くにある評判のいいサーフィンを教えて」という文章は、やはり不自然な意味の通らない文である。しかしながら、このような場合、例えば、「サーフィン」と「いい」とのn-gram確率p(サーフィン|いい)=2.13532e-05が低いため、Perplexity値PLLは、70.6759となる。
Also, for example, as shown in the example sentence 2) in the example Ex 41 in FIG. 23, the sentence “Tell me a good reputation surfing near here” is also a sentence that does not pass the unnatural meaning. However, in such a case, the Perplexity value PLL is 70.6759, for example, because the n-gram probability p (surfing | good) = 2.13532e-05 of “surfing” and “good” is low.
さらに、例えば、図23の例Ex41における例文3)で示されるように「この近くにある評判のいい店舗を教えて」という文章は、比較的意味の通る文である。このような場合、例えば、「店舗」と「いい」とのn-gram確率p(店舗|いい)=0.000105223であり比較的高めであるため、Perplexity値PLLは、57.4806となる。
Furthermore, for example, as shown in example sentence 3) in the example Ex41 in FIG. 23, the sentence “Tell me a good-reputable store near here” is a relatively meaningful sentence. In such a case, for example, since the n-gram probability p (store | good) = "0.00105223" of "store" and "good" is relatively high, the Perplexity value PLL is 57.4806.
また、例えば、図23の例Ex41における例文4)で示されるように「この近くにある評判のいいマッサージを教えて」という文章は、意味の通る文である。このような場合、例えば、「マッサージ」と「いい」とのn-gram確率p(マッサージ|いい)=0.000378552であり比較的高めであるため、Perplexity値PLLは、57.0273となる。
Also, for example, as shown in the example sentence 4) in the example Ex41 of FIG. 23, the sentence “Tell me a massage with a good reputation near here” is a sentence that makes sense. In such a case, for example, since the n-gram probability p (massage | good) of "massage" and "good" is 0.003375552, which is relatively high, the Perplexity value PLL is 57.0273.
このように、意味の通る文からなるコーパスについては、n-gram確率が高くなり、Perplexity値は小さくなる。
Thus, for a corpus consisting of meaningful sentences, the n-gram probability increases and the Perplexity value decreases.
ステップS53において、非文判定部133aは、算出した処理対象コーパスのPerplexity値が、所定の閾値αより大きいか否かに基づいて、非文であるか否かを判定する。
In step S53, the non-statement determining unit 133a determines whether or not the sentence is a non-statement based on whether or not the calculated Perplexity value of the processing target corpus is larger than a predetermined threshold value α.
ステップS53において、例えば、Perplexity値PLLが、所定の閾値αよりも大きい場合、処理は、ステップS55に進む。
In step S53, for example, when the Perplexity value PLL is larger than the predetermined threshold value α, the process proceeds to step S55.
ステップS55において、非文判定部133aは、処理対象のコーパスを非文とみなし、廃棄判定文として廃棄処理し、処理は、ステップS56に進む。
In step S55, the non-statement determining unit 133a regards the corpus to be processed as a non-statement, discards the corpus as a determination text, and the process proceeds to step S56.
一方、ステップS53において、例えば、Perplexity値PLLが、所定の閾値αよりも大きくない場合、すなわち、Perplexity値PLLが、所定の閾値αよりも小さい場合、非文判定部133aは、処理対象コーパスが、意味の通る文であるとみなして、処理は、ステップS54に進む。
On the other hand, in step S53, for example, when the Perplexity value PLL is not larger than the predetermined threshold α, that is, when the Perplexity value PLL is smaller than the predetermined threshold α, the non-statement determining unit 133a determines that the corpus to be processed is The process proceeds to step S54, assuming that the sentence is meaningful.
ステップS54において、非文判定部133aは、処理対象のコーパスを記憶する。
In step S54, the non-statement determining unit 133a stores a corpus to be processed.
ステップS56において、COODコーパス抽出部133は、IND判定文記憶部132に記憶されているIND判定文となるコーパスのうち、未処理のIND判定文となるコーパスがあるか否かを判定し、未処理のIND判定文となるコーパスがある場合、処理は、ステップS51に戻る。すなわち、全てのIND判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により、非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされるまで、ステップS51乃至S56の処理が繰り返される。そして、ステップS56において、全てのIND判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により、非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされて、未処理のコーパスがないとみなされた場合、処理は、ステップS57に進む。
In step S56, the COOD corpus extraction unit 133 determines whether or not there is a corpus serving as an unprocessed IND determination sentence among the corpuses serving as the IND determination sentences stored in the IND determination sentence storage unit 132, If there is a corpus as an IND determination sentence of processing, the processing returns to step S51. That is, the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined whether the corpus is not a non sentence but a corpus consisting of sentences having meaning, by comparison with a predetermined threshold value α. The processes of S51 to S56 are repeated. Then, in step S56, the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined whether or not it is a corpus consisting of sentences having meaning rather than non-sentences by comparison with a predetermined threshold value α. If it is determined that there is no unprocessed corpus, the process proceeds to step S57.
ステップS57において、COODコーパス抽出部133は、ステップS52,S53の非文判定部133aの処理により、非文ではなく、意味の通る文からなるコーパスとみなされて記憶されているIND判定文となるコーパスのうち、いずれか未処理のIND判定文となるコーパスの入力を受け付けて、処理対象コーパスとする。
In step S57, the COOD corpus extraction unit 133 becomes an IND determination sentence which is regarded as a corpus consisting of sentences through which the meaning passes, not a non-statement, by the processing of the non-statement determination unit 133a in steps S52 and S53. Among the corpuses, input of a corpus which is an unprocessed IND determination sentence is accepted to be a processing target corpus.
ステップS58において、COODコーパス抽出部133は、非出現性判定部133bを制御して、処理対象コーパスに含まれる単語の、対象となるドメインにおける非出現性を算出させる。
In step S58, the COOD corpus extraction unit 133 controls the non-appearance determination unit 133b to calculate the non-occurrence of the word included in the processing target corpus in the target domain.
非出現性とは、生成されたコーパスを意味解析器にかけてIND判定文と判定されたドメインのコーパス群に出現しない単語をどの程度含んでいるかを示す指標である。
The non-appearance is an index indicating how much the generated corpus is included in the corpus of the domain determined to be the IND determination sentence by using the semantic analyzer to include words that do not appear in the corpus.
例えば、図24で示されるような文(コーパス)は、いずれもALARM-CHANGEというドメインにおいて、出現頻度の低い単語を含んでおり、すなわち、非出現性が高く、廃棄判定文を含むClose OODとなる。ただし、ここでは、非出現性を求める前の処理で、非文は排除されているので、実質的に、COOD判定文が抽出されることになる。尚、以下においては、『』内の表記が非出現性の高い単語である。
For example, sentences (corpus) as shown in FIG. 24 all contain words with low frequency of occurrence in the domain of ALARM-CHANGE, that is, Close OOD with high non-appearance and including the discard judgment sentence. Become. However, here, since non-sentences are excluded in the process before determining non-recurrence, the COOD determination sentence is substantially extracted. In the following, the notation in “” is a word having high non-recurrence.
すなわち、図24においては、「アラームの『レジストリ』をかえて」の『レジストリ』が、「『脚本』を8時にセットしなおして」の『脚本』が、「アラームの『食事』直してくれるかな」の『食事』が、「朝7時の『文言』を朝8時に変更してください」の『文言』が、「朝6時半に鳴る『ログファイル』を7時頃に変更してください」の『ログファイル』が、「6時の『制度』を7時に変更して」の『制度』が、「アラームの『考え』を7時に変えてください」の『考え』が、「夕方5時の『工期』を5時半に変更して」の『工期』が、「午前6時半の『設計』を午前8時半に変更してください」の『設計』が、「朝7時の『価格』を8時に変更して」の『価格』が、「7時の『献立』を7:30に変えて」の『献立』が、それぞれ非出現性の高い言葉であることから、COOD判定文とみなされる。
That is, in FIG. 24, the "registry" of "change the" registry "of the alarm" and the "script" of "reset the" script "at 8 o'clock" correct the "meal" of the alarm. "Meal" of "Kana", "Please change the wording of 7 o'clock in the morning at 8 o'clock in the morning", and change the "log file" that sounds at 6:30 in the morning, at about 7 o'clock "Please change" the "log file" of "6 o'clock" system "at 7 o'clock" "system" is "Please change the alarm's" thought "at 7 o'clock" but "the evening" Change "work schedule" at 5 o'clock to "5:30" "work schedule" is "Please change" design "at 6:30 am to 8:30", "design" is "7 am The "price" of the "price" at 8 o'clock "and the" menu "of" change the 7 menu at 7 o'clock to 7:30 "are non-representative words. And it is regarded as a COOD judgment sentence.
これらの非出現性については、例えば、処理対象コーパスに含まれる、目的とするドメインで出現しない単語数を用いることで数値的に求めることができる。
These non-appearances can be determined numerically, for example, by using the number of words that do not appear in the target domain included in the processing target corpus.
例えば、非出現性判定部133bは、IND判定文からなる処理対象コーパスに含まれる全単語数をnとし、IND判定文のドメインにおいて出現しない単語数(IND判定文からなるドメインに属するコーパスのうち、処理対象コーパスを除いたいずれのコーパスにも含まれていない単語数)をnoとしたとき、非出現性を表すパラメータとしてno/nを算出する。
For example, the non-appearance determination unit 133 b sets the total number of words included in the processing target corpus including the IND determination sentence as n, and the number of words not appearing in the domain of the IND determination sentence (of the corpus belonging to the domain including the IND determination sentence When the number of words not included in any corpus excluding the processing target corpus is assumed to be no, no / n is calculated as a parameter representing non-appearance.
ステップS59において、非出現性判定部133bは、処理対象コーパスに含まれる単語のIND判定文からなるドメインにおける非出現性を表すパラメータno/nが所定の閾値βよりも大きいか否かを判定する。
In step S59, the non-appearance determination unit 133b determines whether the parameter no / n representing non-occurrence in the domain consisting of the IND determination sentences of the words included in the processing target corpus is larger than a predetermined threshold value β. .
ステップS59において、非出現性を表すパラメータno/nが所定の閾値βよりも大きい場合、すなわち、処理対象コーパスに含まれる単語の、IND判定文からなるドメインにおける非出現性が高い場合、処理は、ステップS60に進む。
In step S59, if the parameter no / n representing non-recurring property is larger than the predetermined threshold value β, that is, if the non-recurring property of the word included in the processing target corpus is high in the domain consisting of the IND determination sentence, the processing , And proceeds to step S60.
ステップS60において、非出現性判定部133bは、処理対象コーパスをCOOD判定文とみなして抽出し、COOD判定文記憶部134に記憶させる。
In step S60, the non-appearance determination unit 133b regards the processing target corpus as a COOD determination sentence and extracts the corpus, and stores the corpus in the COOD determination sentence storage unit 134.
一方、ステップS59において、非出現性を表すパラメータno/nが所定の閾値βよりも大きくない場合、すなわち、非出現性を表すパラメータno/nが所定の閾値βよりも小さく、処理対象コーパスに含まれる単語の、IND判定文からなるドメインにおける非出現性が低い場合、処理は、ステップS61に進む。
On the other hand, in step S59, when the parameter no / n representing non-appearance is not larger than the predetermined threshold β, that is, the parameter no / n representing non-occurrence is smaller than the predetermined threshold If the non-occurrence of the included word in the domain consisting of the IND determination sentence is low, the process proceeds to step S61.
ステップS61において、非出現性判定部133bは、処理対象コーパスを確定IND判定文とみなして確定IND判定文記憶部135に記憶させる。
In step S61, the non-appearance determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and causes the determined IND determination sentence storage unit 135 to store it.
ステップS62において、COODコーパス抽出部133は、未処理のIND判定文が存在するか否かを判定し、未処理のIND判定文が存在する場合、処理は、ステップS57に戻る。すなわち、未処理のIND判定文がなくなるまで、ステップS57乃至S62の処理が繰り返され、ステップS58,S59の非出現性判定部133bの処理が繰り返される。
In step S62, the COOD corpus extraction unit 133 determines whether there is an unprocessed IND determination sentence. If there is an unprocessed IND determination sentence, the process returns to step S57. That is, the processes of steps S57 to S62 are repeated until the unprocessed IND determination sentence disappears, and the process of the non-appearance determination unit 133b of steps S58 and S59 is repeated.
そして、ステップS62において、未処理のIND判定文が存在しない、すなわち、全てのIND判定文が処理されたとみなされた場合、処理は、終了する。
Then, in step S62, when there is no unprocessed IND determination sentence, that is, when it is considered that all the IND determination sentences have been processed, the process ends.
以上の処理により、IND判定文からなるコーパスにより構成されるドメインのうち、Perplexity値が高く、非文ではないものであって、含まれる単語の非出現性が高いコーパスが、COOD判定文とみなされ、COOD判定文記憶部134に記憶され、非文ではないものであって、含まれる単語の非出現性が低いコーパスが、確定IND判定文とみなされ、確定IND判定文記憶部135に記憶される。
According to the above processing, a corpus having a high Perplexity value and not being a non-sentence and having a high non-occurrence of the included word is regarded as a COOD determination sentence among the domains constituted by the corpus consisting of the IND determination sentences. A corpus that is stored in the COOD determination sentence storage unit 134 and is not a non-statement and has a low non-appearance of the included word is regarded as a fixed IND determination sentence, and stored in the fixed IND determination sentence storage unit 135 Be done.
尚、以上においては、非出現性をパラメータno/nにより表現する例について説明してきたが、非出現性を表現できれば、その他の値でもよく、例えば、TF(Term Frequency)/IDF(Inverse Document Frequency)値を用いるようにしてもよい。
In the above, an example of expressing non-appearance by the parameter no / n has been described, but other values may be used if non-emergence can be expressed, for example, TF (Term Frequency) / IDF (Inverse Document Frequency) ) May be used.
ここで、TF値とは、複数の文書(この場合は複数のドメイン)があった場合、それぞれの文書(この場合はドメイン)を特徴づける単語を分析するための指標であり、以下の式(4)で表される。
Here, the TF value is an index for analyzing a word that characterizes each document (in this case, domain) when there are a plurality of documents (in this case, plural domains), and the following equation ( It is represented by 4).
また、IDF値は、各単語が文書間で共通に使われているか否かを表す指標であり、以下の式(5)で表される。
The IDF value is an index indicating whether each word is used in common between documents, and is expressed by the following equation (5).
さらに、TF/IDF値は、IND判定文の対象ドメインに偏在してよく出現する重要単語リストのうち出現頻度が閾値β(0≦β≦1)より小さい単語数、または、重要単語リストに存在しない単語数を値nlwとするとき、ステップS59において、非出現性判定部133bは、非出現性を示すパラメータnlw/nを算出する。
Furthermore, the TF / IDF value is the number of words whose frequency of occurrence is less than the threshold β (0 ≦ β ≦ 1) among the important word lists frequently appearing ubiquitously in the target domain of the IND determination sentence, or exists in the important word list. When it is assumed that the number of words not to be generated is the value nlw, the non-appearance determination unit 133b calculates the parameter nlw / n indicating the non-appearance in step S59.
ステップS59において、非出現性判定部133bは、処理対象コーパスに含まれる単語の所定のドメインにおける非出現性を表すパラメータnlw/nが所定の閾値γよりも大きいか否かを判定する。
In step S59, the non-appearance determination unit 133b determines whether the parameter nlw / n representing the non-appearance in a predetermined domain of the word included in the processing target corpus is larger than a predetermined threshold value γ.
ステップS59において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きい場合、処理は、ステップS60に進み、非出現性判定部133bは、処理対象コーパスをCOOD判定文とみなして抽出し、COOD判定文記憶部134に記憶させる。
In step S59, when the parameter nlw / n representing non-appearance is larger than the predetermined threshold value γ, the process proceeds to step S60, and the non-appearance determination unit 133b extracts the processing target corpus as a COOD determination sentence and extracts And store the information in the COOD determination sentence storage unit 134.
一方、ステップS59において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きくない場合、すなわち、非出現性を表すパラメータnlw/nが所定の閾値γよりも小さい場合、処理は、ステップS61に進み、非出現性判定部133bは、処理対象コーパスを確定IND判定文とみなして確定IND判定文記憶部135に記憶させる。
On the other hand, in step S59, when the parameter nlw / n representing non-appearance is not larger than a predetermined threshold γ, that is, when the parameter nlw / n representing non-appearance is smaller than a predetermined threshold γ, the processing is In step S61, the non-appearance determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and causes the fixed IND determination sentence storage unit 135 to store the processing target corpus.
<TF値とTF/IDF値について>
例えば、図25の例Ex51における、ALARM-CHANGEというドメインのIND判定文からなるコーパス例に基づいて、求められるTF値およびTF/IDF値は、それぞれ図25の例Ex52,Ex53で表される。 <About TF value and TF / IDF value>
For example, the TF value and the TF / IDF value obtained based on the corpus example including IND determination sentences of the domain of ALARM-CHANGE in the example Ex51 of FIG. 25 are represented by the examples Ex52 and Ex53 of FIG.
例えば、図25の例Ex51における、ALARM-CHANGEというドメインのIND判定文からなるコーパス例に基づいて、求められるTF値およびTF/IDF値は、それぞれ図25の例Ex52,Ex53で表される。 <About TF value and TF / IDF value>
For example, the TF value and the TF / IDF value obtained based on the corpus example including IND determination sentences of the domain of ALARM-CHANGE in the example Ex51 of FIG. 25 are represented by the examples Ex52 and Ex53 of FIG.
すなわち、例Ex51のコーパス群は、上から順に、「アラームを8時に変更して」、「アラーム編集してくれる」、「アラーム設定しなおして」、「7時から8時にアラームを変更して」、「7時のアラームを8時に変えてもらえる」、「アラーム8時に変更して」、「目覚まし8時に変えて目覚ましの時間変更して」、「目覚ましの時間変更してもらえる」、「目覚まし変更して」、「10時のアラーム7時にして」、「10時のアラームを11時に変えて」、・・・、「6時に設定したアラームを5時30分に変えたい」、「毎日7時に鳴らすのを6時30分に設定を変えて」、「毎回7時に鳴らすのを6時30分に設定を変更して」、および「土日朝のアラームを編集」からなる。
That is, the corpus of example Ex 51 sequentially changes “alarm at 8 o'clock”, “alarm edits”, “reset alarm”, “changes the alarm from 7 o'clock to 8 o'clock” in order from the top "You can change the 7 o'clock alarm at 8 o'clock", "Change the alarm 8 o'clock", "Change the alarm time at 8 o'clock and change the alarm time", "Can you change the alarm time", "Armish Change "," at 10 o'clock alarm 7 o'clock "," Change 10 o'clock alarm at 11 o'clock ... "," I want to change the alarm set at 6 o'clock to 5:30 "," every day Change the setting to sound at 7 o'clock to 6:30, "change the setting to sound at 7 o'clock every 6:30," and "Edit the alarm on Saturday and Sunday morning".
この例Ex51のコーパス群の各単語のTF値は、図中の例Ex52で示されるように、TF値の大きな順に、上から「変更」が351、「7」が334、「8」が260、「変え」が258、「時間」が220、「設定」が159、「6」が152、「し」が148、「目覚まし」が110、「午前」が64、「セット」が56、「願い」が55、・・・、「目覚し」が6、「目覚まし時計」が5である。
The TF value of each word in the corpus group of this example Ex 51 is “changed” 351, “7” 334, “8” 260 from the top in descending order of TF value, as shown in example Ex 52 in the figure. , "Change" is 258, "time" is 220, "setting" is 159, "6" is 152, "do" is 148, "alarm" is 110, "morning" is 64, "set" is 56, " "Wish" is 55, ..., "Awakening" is 6, "Alarm Clock" is 5.
また、この例Ex51のコーパス群の各単語のTF/IDF値は、図中の例Ex53で示されるように、TF/IDF値の降順に、上から「アラーム」が0.00504379225545、「セット」が0.00328857409316、「起こ」が0.00110030484831、「明日」が0.000795915410064、「目覚まし」が0.000763298323913、「起き」が0.000622699996573、「鳴ら」が0.00060708690425、「朝」が0.000521901615019、「設定」が0.000466290476509、「かけ」が0.000336399933349、「鳴」が0.000297198910881、「めざまし」が0.000223592438318、「おこ」が0.000196205208903、「目覚まし時計」が0.000185042017918、「起床」が0.000175552029018、「願い」が0.000124433590006、「時間」が0.000107951491631、「午前」が0.000102767940411・・・である。
In addition, the TF / IDF value of each word in the corpus group of this example Ex51 is, as indicated by an example Ex53 in the figure, in descending order of TF / IDF value, “alarm” from the top to 0.0050379225545, “set” to 0.00328857409316 , "Wake up" is 0.00100030484831, "Tomorrow" is 0.000795915410064, "Wake up" is 0.000763298323913, "Wake up" is 0.0006226999996573, "Song" is 0.00060708690425, "Morning" is 0.000521900115019, "Setting" is 0.00046290476509, "Over" is 0.000033639933349, "Announcement" is 0.000297198910881, "Awakening" is 0.00029252 4823 8318, "Okachi" is 0.000196205208903, "Alarm Clock" is 0.000185042017918, "Wake Up" is 0.000175252029018, "Wish" is 0.000124433590006, "Time" is 0.00107951491631, "morning" is 0,0010,076,7041 ... It is.
すなわち、TF値またはTF/IDF値が高い単語は、出現頻度が高く(非出現性が低く)、重要度が高い単語であると考えることができる。したがって、IND判定文に含まれるコーパスのうち、このTF値またはTF/IDF値がある閾値以下の単語を多く含むコーパスは、COOD判定文である可能性が高いと考えられる。このため、COOD判定文は、IND判定文に含まれるコーパスのうち、このTF/IDF値が高い単語群に含まれていない単語を多く含むコーパスとして求められる。
That is, a word having a high TF value or a TF / IDF value can be considered to be a word having a high frequency of appearance (low non-occurrence) and a high degree of importance. Therefore, it is considered that among the corpuses included in the IND determination sentence, a corpus including a large number of words whose TF value or TF / IDF value is equal to or less than a certain threshold is likely to be a COOD determination sentence. Therefore, the COOD determination sentence is obtained as a corpus including a large number of words not included in the word group having a high TF / IDF value among corpuses included in the IND determination sentence.
<CINDコーパス抽出処理>
次に、図26のフローチャートを参照して、CINDコーパス抽出処理について説明する。 <CIND corpus extraction processing>
Next, CIND corpus extraction processing will be described with reference to the flowchart of FIG.
次に、図26のフローチャートを参照して、CINDコーパス抽出処理について説明する。 <CIND corpus extraction processing>
Next, CIND corpus extraction processing will be described with reference to the flowchart of FIG.
ステップS101において、CINDコーパス抽出部137は、OOD判定文記憶部136に記憶されているOOD判定文からなるコーパスのうち、未処理のOOD判定文からなるコーパスの入力を受け付けて、処理対象コーパスとする。
In step S101, the CIND corpus extraction unit 137 receives an input of a corpus consisting of unprocessed OOD determination sentences in a corpus consisting of OOD determination sentences stored in the OOD determination sentence storage unit 136, and Do.
ステップS102において、CINDコーパス抽出部137は、非文判定部137aを制御して、処理対象コーパスのPerplexity値を算出させる。
In step S102, the CIND corpus extraction unit 137 controls the non-statement determination unit 137a to calculate the Perplexity value of the processing target corpus.
ステップS103において、非文判定部137aは、算出した処理対象コーパスのPerplexity値が、所定の閾値αより大きいか否かに基づいて、非文であるか否かを判定する。
In step S103, the non-statement determining unit 137a determines whether or not the sentence is a non-statement based on whether the calculated Perplexity value of the processing target corpus is larger than a predetermined threshold value α.
ステップS103において、例えば、Perplexity値PLLが、所定の閾値αよりも大きい場合、処理は、ステップS105に進む。
In step S103, for example, when the Perplexity value PLL is larger than the predetermined threshold value α, the process proceeds to step S105.
ステップS105において、非文判定部137aは、処理対象のコーパスを非文とみなし、廃棄判定文として廃棄処理し、処理は、ステップS106に進む。
In step S105, the non-statement determination unit 137a regards the corpus to be processed as a non-statement, discards the corpus as a determination text, and the process proceeds to step S106.
一方、ステップS103において、例えば、Perplexity値PLLが、所定の閾値αよりも大きくない場合、すなわち、Perplexity値PLLが、所定の閾値αよりも小さい場合、非文判定部137aは、処理対象コーパスが、意味の通る文であるとみなして、処理は、ステップS104に進む。
On the other hand, in step S103, for example, when the Perplexity value PLL is not larger than the predetermined threshold α, that is, when the Perplexity value PLL is smaller than the predetermined threshold α, the non-statement determination unit 137a determines that the corpus to be processed is The process proceeds to step S104, assuming that the sentence is meaningful.
ステップS104において、非文判定部137aは、処理対象のコーパスを記憶する。
In step S104, the non-statement determining unit 137a stores a corpus to be processed.
ステップS106において、CINDコーパス抽出部137は、OOD判定文記憶部136に記憶されているOOD判定文となるコーパスのうち、未処理のOOD判定文となるコーパスがあるか否かを判定し、未処理のOOD判定文となるコーパスがある場合、処理は、ステップS101に戻る。すなわち、全てのOOD判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により、非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされるまで、ステップS101乃至S106の処理が繰り返される。そして、ステップS106において、全てのOOD判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされて、未処理のコーパスがないとみなされた場合、処理は、ステップS107に進む。
In step S106, the CIND corpus extraction unit 137 determines whether or not there is a corpus as an unprocessed OOD determination sentence in the corpus as the OOD determination sentence stored in the OOD determination sentence storage unit 136, If there is a corpus that becomes an OOD determination sentence of the process, the process returns to step S101. In other words, the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined whether the corpus is not a non sentence but a corpus consisting of sentences having meaning, by comparison with a predetermined threshold α. The processes of S101 to S106 are repeated. Then, in step S106, the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined whether or not it is a corpus consisting of sentences having meaning rather than non sentences by comparison with a predetermined threshold value α. If it is determined that there is no unprocessed corpus, the process proceeds to step S107.
ステップS107において、CINDコーパス抽出部137は、ステップS102,S103の非文判定部137aの処理により、非文ではなく、意味の通る文からなるコーパスとみなされて記憶されているOOD判定文となるコーパスのうち、いずれか未処理のOOD判定文となるコーパスの入力を受け付けて、処理対象コーパスとする。
In step S107, the CIND corpus extraction unit 137 becomes an OOD determination sentence stored as regarded as a corpus consisting of sentences having meaning, not non-statements, by the processing of the non-statement determination unit 137a in steps S102 and S103. Among the corpuses, input of a corpus that is an unprocessed OOD determination sentence is accepted as a processing target corpus.
ステップS108において、CINDコーパス抽出部137は、非出現性判定部137bを制御して、処理対象コーパスに含まれる単語の、対象となるドメインにおける非出現性を表すパラメータno/nを算出させる。
In step S108, the CIND corpus extraction unit 137 controls the non-appearance determination unit 137b to calculate the parameter no / n representing the non-occurrence of the word included in the processing target corpus in the target domain.
ステップS109において、非出現性判定部137bは、処理対象コーパスに含まれる単語の所定のドメインにおける非出現性を表すパラメータno/nが所定の閾値βよりも大きいか否かを判定する。
In step S109, the non-appearance determination unit 137b determines whether or not the parameter no / n representing the non-emergence in the predetermined domain of the word included in the processing target corpus is larger than the predetermined threshold value β.
ステップS109において、非出現性を表すパラメータno/nが所定の閾値βよりも大きい場合、処理は、ステップS110に進む。
In step S109, when the parameter no / n representing non-appearance is larger than the predetermined threshold value β, the process proceeds to step S110.
ステップS110において、非出現性判定部137bは、処理対象コーパスを確定OOD判定文とみなして確定OOD判定文記憶部139に記憶させる。
In step S110, the non-appearance determination unit 137b regards the processing target corpus as a confirmed OOD determination sentence and stores the corpus in the determined OOD determination sentence storage unit 139.
一方、ステップS109において、非出現性を表すパラメータno/nが所定の閾値βよりも大きくない場合、すなわち、非出現性を表すパラメータno/nが所定の閾値βよりも小さい場合、処理は、ステップS111に進む。
On the other hand, in step S109, when the parameter no / n representing non-appearance is not larger than the predetermined threshold value β, that is, when the parameter no / n representing non-occurrence is smaller than the predetermined threshold value β, the process The process proceeds to step S111.
ステップS111において、非出現性判定部137bは、処理対象コーパスをCIND判定文とみなしてCIND判定文記憶部138に記憶させる。
In step S111, the non-appearance determination unit 137b regards the processing target corpus as a CIND determination sentence and causes the CIND determination sentence storage unit 138 to store it.
ステップS112において、CINDコーパス抽出部137は、未処理のOOD判定文が存在するか否かを判定し、未処理のOOD判定文が存在する場合、処理は、ステップS107に戻る。すなわち、未処理のOOD判定文がなくなるまで、ステップS107乃至S112の処理が繰り返される。
In step S112, the CIND corpus extraction unit 137 determines whether there is an unprocessed OOD determination sentence. If there is an unprocessed OOD determination sentence, the processing returns to step S107. That is, the processes of steps S107 to S112 are repeated until there are no unprocessed OOD determination sentences.
そして、ステップS112において、未処理のOOD判定文が存在しない、すなわち、全てのOOD判定文が処理されたとみなされた場合、処理は、終了する。
Then, in step S112, when there is no unprocessed OOD determination sentence, that is, when all the OOD determination sentences are considered to be processed, the processing ends.
以上の処理により、OOD判定文となるコーパスのうち、Perplexity値が高く、非文ではないものであって、含まれる単語の非出現性が低いコーパスが、CIND判定文とみなされ、CIND判定文記憶部138に記憶され、非文ではないものであって、含まれる単語の非出現性が高いコーパスが、確定OOD判定文とみなされ、確定OOD判定文記憶部139に記憶される。尚、CIND判定文はどのドメインに属するCINDかの判定はわからないため、最終的に人手で行う必要がある。但しそのための絞り込みを自動化できるので作業工数を軽減することができる。
By the above processing, a corpus which is high in Perplexity value and is not a non-statement and has a low non-appearance of the included word is regarded as a CIND determination sentence among the corpuses which become the OOD determination sentence, and the CIND determination sentence A corpus which is stored in the storage unit 138 and which is not a non-statement and which is high in non-occurrence of the contained word is regarded as a determined OOD determination sentence and stored in the determined OOD determination sentence storage unit 139. In addition, since it is not known which CIND a CIND test sentence belongs to which domain, it is necessary to finally carry out by hand. However, since the narrowing-down for that purpose can be automated, the number of operation steps can be reduced.
以上においては、非出現性を表すパラメータをno/nにより表現する例について説明してきたが、非出現性を表現できれば、その他の値でもよく、例えば、TF(Term Frequency)/IDF(Inverse Document Frequency)値を用いるようにしてもよい。
In the above, although the example which represents the parameter showing non-appearance by no / n has been described, other values may be used as long as non-emergence can be expressed, for example, TF (Term Frequency) / IDF (Inverse Document Frequency) ) May be used.
さらに、TF/IDF値は、IND判定文の対象ドメインに偏在してよく出現する重要単語リストのうち出現頻度が閾値β(0≦β≦1)より小さい単語の数、または、重要単語リストに存在しない単語数を値nlwとするとき、ステップS108において、非出現性判定部137bは、非出現性を示すパラメータnlw/nを算出する。
Furthermore, the TF / IDF value is the number of words whose appearance frequency is less than the threshold β (0 ≦ β ≦ 1) among the important word lists that frequently appear ubiquitously in the target domain of the IND determination sentence, or When the number of non-existent words is set to the value nlw, in step S108, the non-appearance determination unit 137b calculates a parameter nlw / n indicating non-appearance.
ステップS108において、非出現性判定部137bは、処理対象コーパスに含まれる単語の所定のドメインにおける非出現性を表すパラメータnlw/nが所定の閾値γよりも大きいか否かを判定する。
In step S108, the non-appearance determination unit 137b determines whether the parameter nlw / n representing the non-occurrence in the predetermined domain of the word included in the processing target corpus is larger than the predetermined threshold γ.
ステップS108において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きい場合、処理は、ステップS110に進み、CINDコーパス抽出部137は、処理対象コーパスを確定OOD判定文とみなして確定OOD判定文記憶部139に記憶させる。
In step S108, when the parameter nlw / n representing non-appearance is larger than the predetermined threshold value γ, the process proceeds to step S110, and the CIND corpus extraction unit 137 considers the processing target corpus as the confirmed OOD determination sentence and confirms It is stored in the OOD determination sentence storage unit 139.
一方、ステップS108において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きくない場合、すなわち、非出現性を表すパラメータnlw/nが所定の閾値γよりも小さい場合、処理は、ステップS111に進み、非出現性判定部137bは、処理対象コーパスをCIND判定文とみなしてCIND判定文記憶部138に記憶させる。
On the other hand, in step S108, when the parameter nlw / n representing non-appearance is not larger than the predetermined threshold value γ, that is, when the parameter nlw / n representing non-occurrence is smaller than the predetermined threshold value γ, the process Proceeding to step S111, the non-appearance determination unit 137b regards the processing target corpus as a CIND determination sentence and causes the CIND determination sentence storage unit 138 to store it.
尚、CIND判定文はどのドメインに属するCIND文かの判定はわからないため、最終的に人手で行う必要がある。但しそのための絞り込みを自動化できるので作業工数を軽減することができる。
In addition, since it is not known which CIND sentence belongs to which domain the CIND judgment sentence is determined, it is necessary to finally carry out by hand. However, since the narrowing-down for that purpose can be automated, the number of operation steps can be reduced.
以上の処理により、所定の手段で生成されたIND文から多くのコーパスを、人手を必要とすることなく生成することが可能となる。このため、コーパスの開発に係る負荷を低減させることが可能となり、開発コストを低減させることが可能となる。
By the above processing, it is possible to generate many corpuses from the IND sentences generated by the predetermined means without requiring human hands. For this reason, it becomes possible to reduce the load concerning development of a corpus, and it becomes possible to reduce development cost.
また、コーパスは、IND判定文(確定IND判定文)、COOD判定文、CIND判定文、およびOOD判定文(確定OOD判定文)に分類された状態で生成される。
Further, the corpus is generated in a state classified into an IND determination sentence (confirmed IND determination sentence), a COOD determination sentence, a CIND determination sentence, and an OOD determination sentence (confirmed OOD determination sentence).
結果として、より多くのCOOD判定文およびCIND判定文からなるコーパスを用いてフレーム推定部13、および意味解析部14を学習させるようにすることで、IND判定文とOOD判定文との境界付近に分布するコーパスによる学習が可能となるので、IND判定文とOOD判定文との境界付近に分布するような紛らわしい表現でも適切に認識することが可能になり、意味解析部の精度を向上させることが可能となる。
As a result, by causing the frame estimation unit 13 and the semantic analysis unit 14 to learn using a corpus consisting of more COOD determination sentences and CIND determination sentences, in the vicinity of the boundary between the IND determination sentence and the OOD determination sentence Since learning can be performed using a distributed corpus, even confusing expressions distributed near the boundary between the IND determination sentence and the OOD determination sentence can be appropriately recognized, and the accuracy of the semantic analysis unit can be improved. It becomes possible.
また、生成されたコーパスは、現実には、人手による確認作業が必要となることが想定されるが、IND判定文(確定IND判定文)、COOD判定文、CIND判定文、およびOOD判定文(確定OOD判定文)に分類されている上、非文や重文も廃棄されているので、確認作業に係る負荷を低減することが可能となり、結果として、コーパスの開発コストを低減させることが可能となる。
In addition, although it is assumed that the generated corpus actually requires manual confirmation work, IND judgment sentences (decided IND judgment sentences), COOD judgment sentences, CIND judgment sentences, and OOD judgment sentences ( Because it is classified as a finalized OOD judgment statement, and non-statement and heavy sentences are also discarded, it is possible to reduce the burden of confirmation work, and as a result, it is possible to reduce the development cost of the corpus. Become.
尚、上述した処理順序は、前後を入れ替えるようにしてもよく、例えば、ステップS36のCOODコーパス抽出処理とステップS37のCINDコーパス抽出処理とは処理順序を入れ替えてもよい。また、COODコーパス抽出処理、およびCINDコーパス抽出処理におけるPerplexity値を用いた非文判定処理と、非出現性を表すパラメータを用いたCOOD判定文およびCIND判定文の抽出処理とは、順序を入れ替えてもよい。
Note that the above-described processing order may be interchanged. For example, the COOD corpus extraction process of step S36 and the CIND corpus extraction process of step S37 may be interchanged. In addition, the non-statement determination process using Perplexity values in the COOD corpus extraction process and the CIND corpus extraction process and the extraction process of the COOD determination sentence and the CIND determination sentence using a parameter indicating non-appearance are exchanged in order. It is also good.
<ソフトウェアにより実行させる例>
ところで、上述した一連の処理は、ハードウェアにより実行させることもできるが、ソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、記録媒体からインストールされる。 <Example executed by software>
By the way, although the series of processes described above can be executed not only by hardware but also by software. When a series of processes are to be executed by software, various functions may be executed by installing a computer in which a program constituting the software is incorporated in dedicated hardware or various programs. It can be installed from a recording medium, for example, on a general-purpose personal computer.
ところで、上述した一連の処理は、ハードウェアにより実行させることもできるが、ソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、記録媒体からインストールされる。 <Example executed by software>
By the way, although the series of processes described above can be executed not only by hardware but also by software. When a series of processes are to be executed by software, various functions may be executed by installing a computer in which a program constituting the software is incorporated in dedicated hardware or various programs. It can be installed from a recording medium, for example, on a general-purpose personal computer.
図27は、汎用のパーソナルコンピュータの構成例を示している。このパーソナルコンピュータは、CPU(Central Processing Unit)1001を内蔵している。CPU1001にはバス1004を介して、入出力インタフェース1005が接続されている。バス1004には、ROM(Read Only Memory)1002およびRAM(Random Access Memory)1003が接続されている。
FIG. 27 shows a configuration example of a general-purpose personal computer. This personal computer incorporates a CPU (Central Processing Unit) 1001. An input / output interface 1005 is connected to the CPU 1001 via the bus 1004. A ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004.
入出力インタフェース1005には、ユーザが操作コマンドを入力するキーボード、マウスなどの入力デバイスよりなる入力部1006、処理操作画面や処理結果の画像を表示デバイスに出力する出力部1007、プログラムや各種データを格納するハードディスクドライブなどよりなる記憶部1008、LAN(Local Area Network)アダプタなどよりなり、インターネットに代表されるネットワークを介した通信処理を実行する通信部1009が接続されている。また、磁気ディスク(フレキシブルディスクを含む)、光ディスク(CD-ROM(Compact Disc-Read Only Memory)、DVD(Digital Versatile Disc)を含む)、光磁気ディスク(MD(Mini Disc)を含む)、もしくは半導体メモリなどのリムーバブルメディア1011に対してデータを読み書きするドライブ1010が接続されている。
The input / output interface 1005 includes an input unit 1006 including an input device such as a keyboard and a mouse through which the user inputs an operation command, an output unit 1007 for outputting a processing operation screen and an image of a processing result to a display device, programs and various data. A storage unit 1008 including a hard disk drive to be stored, a LAN (Local Area Network) adapter, and the like are connected to a communication unit 1009 that executes communication processing via a network represented by the Internet. Also, a magnetic disc (including a flexible disc), an optical disc (including a compact disc-read only memory (CD-ROM), a digital versatile disc (DVD)), a magneto-optical disc (including a mini disc (MD)) or a semiconductor A drive 1010 for reading and writing data to a removable medium 1011 such as a memory is connected.
CPU1001は、ROM1002に記憶されているプログラム、または磁気ディスク、光ディスク、光磁気ディスク、もしくは半導体メモリ等のリムーバブルメディア1011ら読み出されて記憶部1008にインストールされ、記憶部1008からRAM1003にロードされたプログラムに従って各種の処理を実行する。RAM1003にはまた、CPU1001が各種の処理を実行する上において必要なデータなども適宜記憶される。
The CPU 1001 reads a program stored in the ROM 1002 or a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, is installed in the storage unit 1008, and is loaded from the storage unit 1008 to the RAM 1003. Execute various processes according to the program. The RAM 1003 also stores data necessary for the CPU 1001 to execute various processes.
以上のように構成されるコンピュータでは、CPU1001が、例えば、記憶部1008に記憶されているプログラムを、入出力インタフェース1005及びバス1004を介して、RAM1003にロードして実行することにより、上述した一連の処理が行われる。
In the computer configured as described above, for example, the CPU 1001 loads the program stored in the storage unit 1008 into the RAM 1003 via the input / output interface 1005 and the bus 1004, and executes the program. Processing is performed.
コンピュータ(CPU1001)が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブルメディア1011に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。
The program executed by the computer (CPU 1001) can be provided by being recorded on, for example, a removable medium 1011 as a package medium or the like. Also, the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.
コンピュータでは、プログラムは、リムーバブルメディア1011をドライブ1010に装着することにより、入出力インタフェース1005を介して、記憶部1008にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部1009で受信し、記憶部1008にインストールすることができる。その他、プログラムは、ROM1002や記憶部1008に、あらかじめインストールしておくことができる。
In the computer, the program can be installed in the storage unit 1008 via the input / output interface 1005 by mounting the removable media 1011 in the drive 1010. The program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. In addition, the program can be installed in advance in the ROM 1002 or the storage unit 1008.
なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。
Note that the program executed by the computer may be a program that performs processing in chronological order according to the order described in this specification, in parallel, or when necessary, such as when a call is made. It may be a program to be processed.
尚、図27におけるCPU1001が、意味解析器131、COODコーパス抽出部133、およびCINDコーパス抽出部137の機能を実現させる。また、記憶部1008が、IND判定文記憶部132、OOD判定文記憶部136、COOD判定文記憶部134、確定IND判定文記憶部135、CIND判定文記憶部138、および確定OOD判定文記憶部139を実現する。
Note that the CPU 1001 in FIG. 27 realizes the functions of the semantic analyzer 131, the COOD corpus extraction unit 133, and the CIND corpus extraction unit 137. In addition, the storage unit 1008 includes an IND determination sentence storage unit 132, an OOD determination sentence storage unit 136, a COOD determination sentence storage unit 134, a confirmed IND determination sentence storage unit 135, a CIND determination sentence storage unit 138, and a determined OOD determination sentence storage unit. Realize 139.
また、本明細書において、システムとは、複数の構成要素(装置、モジュール(部品)等)の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、1つの筐体の中に複数のモジュールが収納されている1つの装置は、いずれも、システムである。
Further, in the present specification, a system means a set of a plurality of components (devices, modules (parts), etc.), and it does not matter whether all the components are in the same case. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device housing a plurality of modules in one housing are all systems. .
なお、本開示の実施の形態は、上述した実施の形態に限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。
The embodiment of the present disclosure is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the present disclosure.
例えば、本開示は、1つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。
For example, the present disclosure can have a cloud computing configuration in which one function is shared and processed by a plurality of devices via a network.
また、上述のフローチャートで説明した各ステップは、1つの装置で実行する他、複数の装置で分担して実行することができる。
Further, each step described in the above-described flowchart can be executed by one device or in a shared manner by a plurality of devices.
さらに、1つのステップに複数の処理が含まれる場合には、その1つのステップに含まれる複数の処理は、1つの装置で実行する他、複数の装置で分担して実行することができる。
Furthermore, in the case where a plurality of processes are included in one step, the plurality of processes included in one step can be executed by being shared by a plurality of devices in addition to being executed by one device.
尚、本開示は、以下のような構成も取ることができる。
<1> 入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
を含む情報処理装置。
<2> 前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文である
<1>に記載の情報処理装置。
<3> 前記構造解析部は、前記入力文の述語項構造を解析する
前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定する
<1>または<2>に記載の情報処理装置。
<4> 前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含み、
前記コーパス生成部は、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換する
<3>のいずれかに記載の情報処理装置。
<5> 前記辞書は、格フレーム辞書である
<4>に記載の情報処理装置。
<6> 前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定し、
前記コーパス生成部は、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成する
<4>に記載の情報処理装置。
<7> 前記置換方式は、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第1の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第2の方式とを含む
<6>に記載の情報処理装置。
<8> 前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD(Out of Domain)判定文に分類する分類部をさらに含む
<1>乃至<7>のいずれかに記載の情報処理装置。
<9> 前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD(Close OOD)判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含む
<8>に記載の情報処理装置。
<10> 前記COOD判定文抽出部は、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出する
<9>に記載の情報処理装置。
<11> 前記COOD判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出する
<10>に記載の情報処理装置。
<12> 前記COOD判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなる単語のTF/IDFで表される非出現性が所定値より低い単語を多く含むコーパスを、前記ドメインより前記COOD判定文として抽出する
<10>に記載の情報処理装置。
<13> 前記COOD判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
<10>に記載の情報処理装置。
<14> 前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND(Close IND)判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含む
<8>に記載の情報処理装置。
<15> 前記CIND判定文抽出部は、前記OOD判定文として分類されたコーパスを含むドメインにおいて、自ら以外の他のOOD判定文に分類されたコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定文として抽出する
<14>に記載の情報処理装置。
<16> 前記CIND判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら以外の他のコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
<15>に記載の情報処理装置。
<17> 前記CIND判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
<15>に記載の情報処理装置。
<18> 前記CIND判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
<15>に記載の情報処理装置。
<19> 入力文の構造を解析し、
前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、
前記入力文における前記置換箇所の単語を置換してコーパスを生成する
ステップを含む情報処理方法。
<20> 入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
してコンピュータを機能させるプログラム。 The present disclosure can also have the following configurations.
<1> A structural analysis unit that analyzes the structure of input sentences;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
An information processing apparatus including: a corpus generation unit that generates a corpus by replacing words in the replacement portion in the input sentence;
<2> The information processing apparatus according to <1>, wherein the input sentence is an IND (In Domain) determination sentence which is an utterance content to be handled by a predetermined application program.
<3> The structure analysis unit analyzes a predicate term structure of the input sentence. The replacement point setting unit is a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit. The information processing apparatus according to <1> or <2>.
<4> The information processing apparatus further includes a dictionary query unit that queries a dictionary to search for a candidate for replacing the word of the replacement part in the input sentence,
The information processing apparatus according to any one of <3>, wherein the corpus generation unit replaces the word of the replacement portion in the input sentence with the word searched by the dictionary inquiry unit.
<5> The information processing apparatus according to <4>, wherein the dictionary is a case frame dictionary.
<6> The replacement point setting unit sets a replacement point in the input sentence and a replacement method of the replacement point based on the predicate term structure which is an analysis result of the structure analysis unit,
The information processing apparatus according to <4>, wherein the corpus generation unit generates a corpus by replacing the word of the replacement part in the input sentence with the replacement method.
<7> The substitution method fixes a predicate of the input sentence, and fixes a first term for replacing a noun which is a predicate term including a target case, and a predicate term including an object case of the input sentence The information processing apparatus according to <6>, further comprising: a second method of replacing a predicate.
<8> The corpus generated by the corpus generation unit is an IND (In Domain) determination statement that is utterance content to be handled by a predetermined application program, or unexpected utterance content that should not be handled by a predetermined application program The information processing apparatus according to any one of <1> to <7>, further including a classification unit that classifies an OOD (Out of Domain) determination sentence.
<9> The IND determination sentence, which is a COOD (Close OOD) determination sentence, which is a corpus existing in the vicinity of a boundary in the feature space represented by each feature of the OOD determination sentence and the IND determination sentence. The information processing apparatus according to <8>, further including a COOD determination sentence extraction unit for extracting from a corpus classified as.
<10> The COOD determination sentence extraction unit determines, from the domain, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number in a domain including a corpus classified as the IND determination sentence. The information processing apparatus according to <9>, which is extracted as a sentence.
<11> The COOD determination sentence extraction unit is a corpus in which non-appearance represented by a ratio of the number of words not included in the self and the other corpus to the number of words included in the corpus of the domain is higher than a predetermined value. The information processing apparatus according to <10>, extracting the COOD determination sentence from the domain.
<12> The COOD determination sentence extraction unit has a non-appearance represented by TF / IDF of a word consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in a corpus of the domain as a predetermined value. The information processing apparatus according to <10>, wherein a corpus including many lower words is extracted as the COOD determination sentence from the domain.
<13> The information processing apparatus according to <10>, wherein the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
<14> The OOD determination sentence as a CIND (Close IND) determination sentence, a corpus existing in the vicinity of the boundary in the feature space represented by each feature of the IND determination sentence and the OOD determination sentence The information processing apparatus according to <8>, further including a CIND determination sentence extraction unit for extracting from a corpus classified as.
<15> In the domain including the corpus classified as the OOD determination sentence, the CIND determination sentence extraction unit includes a corpus in which the number of words included in a corpus classified as another OOD determination sentence other than itself is more than a predetermined number The information processing apparatus according to <14>, extracting as the CIND determination sentence from all corpus classified as the OOD determination sentence.
<16> The CIND determination sentence extraction unit is a corpus whose non-emergence property, represented by a ratio of the number of words not included in the corpus other than the self, to the number of words included in the corpus of the domain is lower than a predetermined number The information processing apparatus according to <15>, extracting the CIND determination sentence.
<17> The CIND determination sentence extraction unit determines a predetermined number of non-reappearance of a word represented by TF / IDF including a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in the corpus of the domain. The information processing apparatus according to <15>, wherein a lower corpus is extracted as the CIND determination sentence.
<18> The information processing apparatus according to <15>, wherein the CIND determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
<19> Analyze the structure of the input sentence,
Setting replacement points in the input sentence based on the analysis result of the structure;
An information processing method comprising: replacing a word of the replacement part in the input sentence to generate a corpus.
<20> A structural analysis unit that analyzes the structure of the input sentence,
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
The program which functions a computer as a corpus generation part which substitutes the word of the said substitution part in the said input sentence, and produces | generates a corpus.
<1> 入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
を含む情報処理装置。
<2> 前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文である
<1>に記載の情報処理装置。
<3> 前記構造解析部は、前記入力文の述語項構造を解析する
前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定する
<1>または<2>に記載の情報処理装置。
<4> 前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含み、
前記コーパス生成部は、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換する
<3>のいずれかに記載の情報処理装置。
<5> 前記辞書は、格フレーム辞書である
<4>に記載の情報処理装置。
<6> 前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定し、
前記コーパス生成部は、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成する
<4>に記載の情報処理装置。
<7> 前記置換方式は、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第1の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第2の方式とを含む
<6>に記載の情報処理装置。
<8> 前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD(Out of Domain)判定文に分類する分類部をさらに含む
<1>乃至<7>のいずれかに記載の情報処理装置。
<9> 前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD(Close OOD)判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含む
<8>に記載の情報処理装置。
<10> 前記COOD判定文抽出部は、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出する
<9>に記載の情報処理装置。
<11> 前記COOD判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出する
<10>に記載の情報処理装置。
<12> 前記COOD判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなる単語のTF/IDFで表される非出現性が所定値より低い単語を多く含むコーパスを、前記ドメインより前記COOD判定文として抽出する
<10>に記載の情報処理装置。
<13> 前記COOD判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
<10>に記載の情報処理装置。
<14> 前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND(Close IND)判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含む
<8>に記載の情報処理装置。
<15> 前記CIND判定文抽出部は、前記OOD判定文として分類されたコーパスを含むドメインにおいて、自ら以外の他のOOD判定文に分類されたコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定文として抽出する
<14>に記載の情報処理装置。
<16> 前記CIND判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら以外の他のコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
<15>に記載の情報処理装置。
<17> 前記CIND判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
<15>に記載の情報処理装置。
<18> 前記CIND判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
<15>に記載の情報処理装置。
<19> 入力文の構造を解析し、
前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、
前記入力文における前記置換箇所の単語を置換してコーパスを生成する
ステップを含む情報処理方法。
<20> 入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
してコンピュータを機能させるプログラム。 The present disclosure can also have the following configurations.
<1> A structural analysis unit that analyzes the structure of input sentences;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
An information processing apparatus including: a corpus generation unit that generates a corpus by replacing words in the replacement portion in the input sentence;
<2> The information processing apparatus according to <1>, wherein the input sentence is an IND (In Domain) determination sentence which is an utterance content to be handled by a predetermined application program.
<3> The structure analysis unit analyzes a predicate term structure of the input sentence. The replacement point setting unit is a replacement point in the input sentence based on the predicate term structure that is an analysis result of the structure analysis unit. The information processing apparatus according to <1> or <2>.
<4> The information processing apparatus further includes a dictionary query unit that queries a dictionary to search for a candidate for replacing the word of the replacement part in the input sentence,
The information processing apparatus according to any one of <3>, wherein the corpus generation unit replaces the word of the replacement portion in the input sentence with the word searched by the dictionary inquiry unit.
<5> The information processing apparatus according to <4>, wherein the dictionary is a case frame dictionary.
<6> The replacement point setting unit sets a replacement point in the input sentence and a replacement method of the replacement point based on the predicate term structure which is an analysis result of the structure analysis unit,
The information processing apparatus according to <4>, wherein the corpus generation unit generates a corpus by replacing the word of the replacement part in the input sentence with the replacement method.
<7> The substitution method fixes a predicate of the input sentence, and fixes a first term for replacing a noun which is a predicate term including a target case, and a predicate term including an object case of the input sentence The information processing apparatus according to <6>, further comprising: a second method of replacing a predicate.
<8> The corpus generated by the corpus generation unit is an IND (In Domain) determination statement that is utterance content to be handled by a predetermined application program, or unexpected utterance content that should not be handled by a predetermined application program The information processing apparatus according to any one of <1> to <7>, further including a classification unit that classifies an OOD (Out of Domain) determination sentence.
<9> The IND determination sentence, which is a COOD (Close OOD) determination sentence, which is a corpus existing in the vicinity of a boundary in the feature space represented by each feature of the OOD determination sentence and the IND determination sentence. The information processing apparatus according to <8>, further including a COOD determination sentence extraction unit for extracting from a corpus classified as.
<10> The COOD determination sentence extraction unit determines, from the domain, a corpus in which the number of words not included in itself and other corpus is more than a predetermined number in a domain including a corpus classified as the IND determination sentence. The information processing apparatus according to <9>, which is extracted as a sentence.
<11> The COOD determination sentence extraction unit is a corpus in which non-appearance represented by a ratio of the number of words not included in the self and the other corpus to the number of words included in the corpus of the domain is higher than a predetermined value. The information processing apparatus according to <10>, extracting the COOD determination sentence from the domain.
<12> The COOD determination sentence extraction unit has a non-appearance represented by TF / IDF of a word consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in a corpus of the domain as a predetermined value. The information processing apparatus according to <10>, wherein a corpus including many lower words is extracted as the COOD determination sentence from the domain.
<13> The information processing apparatus according to <10>, wherein the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
<14> The OOD determination sentence as a CIND (Close IND) determination sentence, a corpus existing in the vicinity of the boundary in the feature space represented by each feature of the IND determination sentence and the OOD determination sentence The information processing apparatus according to <8>, further including a CIND determination sentence extraction unit for extracting from a corpus classified as.
<15> In the domain including the corpus classified as the OOD determination sentence, the CIND determination sentence extraction unit includes a corpus in which the number of words included in a corpus classified as another OOD determination sentence other than itself is more than a predetermined number The information processing apparatus according to <14>, extracting as the CIND determination sentence from all corpus classified as the OOD determination sentence.
<16> The CIND determination sentence extraction unit is a corpus whose non-emergence property, represented by a ratio of the number of words not included in the corpus other than the self, to the number of words included in the corpus of the domain is lower than a predetermined number The information processing apparatus according to <15>, extracting the CIND determination sentence.
<17> The CIND determination sentence extraction unit determines a predetermined number of non-reappearance of a word represented by TF / IDF including a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in the corpus of the domain. The information processing apparatus according to <15>, wherein a lower corpus is extracted as the CIND determination sentence.
<18> The information processing apparatus according to <15>, wherein the CIND determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement.
<19> Analyze the structure of the input sentence,
Setting replacement points in the input sentence based on the analysis result of the structure;
An information processing method comprising: replacing a word of the replacement part in the input sentence to generate a corpus.
<20> A structural analysis unit that analyzes the structure of the input sentence,
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
The program which functions a computer as a corpus generation part which substitutes the word of the said substitution part in the said input sentence, and produces | generates a corpus.
51 コーパス生成装置, 101 IND文受付部, 102 言語解析部, 103 置換箇所設定部, 104 辞書照会部, 105 置換実行部, 106 重文排除部, 107 格フレーム辞書, 108 生成条件設定データ記憶部, 109 置換生成文記憶部, 110 フィルタリング処理部, 131 意味解析器, 132 IND判定文記憶部, 133 COODコーパス抽出部, 133a 非文判定部, 133b 非出現性判定部, 134 COOD判定文記憶部, 135 確定IND判定文記憶部, 136 OOD判定文記憶部, 137 CINDコーパス抽出部, 137a 非文判定部, 137b 非出現性判定部, 138 CIND判定文記憶部, 139 確定OOD判定文記憶部
51 corpus generation apparatus, 101 IND sentence reception unit, 102 language analysis unit, 103 replacement location setting unit, 104 dictionary query unit, 105 replacement execution unit, 106 double sentence exclusion unit, 107 case frame dictionary, 108 generation condition setting data storage unit, 109 replacement generated sentence storage unit, 110 filtering processing unit, 131 semantic analyzer, 132 IND determination sentence storage unit, 133 COOD corpus extraction unit, 133a non-sentence determination unit, 133 b non-appearance determination unit, 134 COOD determination sentence storage unit, 135 Confirmed IND judgment sentence storage unit, 136 OOD judgment sentence storage unit, 137 CIND corpus extraction unit, 137a non-sentence judgment unit, 137b non-appearance judgment unit, 138 CIND judgment sentence storage unit, 139 confirmed OOD judgment sentence storage unit
Claims (20)
- 入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
を含む情報処理装置。 A structural analysis unit that analyzes the structure of the input sentence;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
An information processing apparatus including: a corpus generation unit that generates a corpus by replacing words in the replacement portion in the input sentence; - 前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文である
請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the input sentence is an IND (In Domain) determination sentence that is utterance content to be handled by a predetermined application program. - 前記構造解析部は、前記入力文の述語項構造を解析し、
前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定する
請求項1に記載の情報処理装置。 The structure analysis unit analyzes a predicate term structure of the input sentence,
The information processing apparatus according to claim 1, wherein the replacement point setting unit sets a replacement point in the input sentence based on the predicate term structure which is an analysis result of the structure analysis unit. - 前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含み、
前記コーパス生成部は、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換する
請求項3に記載の情報処理装置。 The information processing apparatus further includes a dictionary query unit that queries a dictionary to search for a candidate for replacing the word of the replacement part in the input sentence,
The information processing apparatus according to claim 3, wherein the corpus generation unit replaces the word of the replacement part in the input sentence with the word searched by the dictionary inquiry unit. - 前記辞書は、格フレーム辞書である
請求項4に記載の情報処理装置。 The information processing apparatus according to claim 4, wherein the dictionary is a case frame dictionary. - 前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定し、
前記コーパス生成部は、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成する
請求項4に記載の情報処理装置。 The replacement point setting unit sets a replacement point in the input sentence and a replacement method of the replacement point based on the predicate term structure which is an analysis result of the structure analysis unit,
The information processing apparatus according to claim 4, wherein the corpus generation unit generates a corpus by replacing the word of the replacement part in the input sentence with the replacement method. - 前記置換方式は、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第1の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第2の方式とを含む
請求項6に記載の情報処理装置。 The substitution method fixes a predicate of the input sentence, and fixes a first term for replacing a noun which is a predicate term including a target case, and a predicate term including an object case of the input sentence, and The information processing apparatus according to claim 6, further comprising: a second method for replacing a predicate. - 前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND(In Domain)判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD(Out of Domain)判定文に分類する分類部をさらに含む
請求項1に記載の情報処理装置。 The corpus generated by the corpus generation unit may be an IND (In Domain) determination sentence, which is utterance content to be handled by a predetermined application program, or OOD, which is unexpected utterance content that should not be handled by a predetermined application program. The information processing apparatus according to claim 1, further comprising a classification unit that classifies into (Out of Domain) judgment sentences. - 前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD(Close OOD)判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含む
請求項8に記載の情報処理装置。 A corpus which is the OOD determination sentence and exists near the boundary in the feature space represented by each feature with the IND determination sentence is classified as the COOD (Close OOD) determination sentence as the IND determination sentence. The information processing apparatus according to claim 8, further comprising a COOD determination sentence extraction unit that extracts from the corpus. - 前記COOD判定文抽出部は、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出する
請求項9に記載の情報処理装置。 The COOD determination sentence extraction unit extracts a corpus in which the number of words not included in itself and other corpus is more than a predetermined number in the domain including the corpus classified as the IND determination sentence as the COOD determination sentence from the domain The information processing apparatus according to claim 9. - 前記COOD判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出する
請求項10に記載の情報処理装置。 The COOD determination sentence extraction unit includes a corpus whose non-emergence property represented by a ratio of the number of words not included in the self and other corpuses to the number of words included in the corpus of the domain is higher than a predetermined value. The information processing apparatus according to claim 10, wherein the information is extracted as the COOD determination sentence. - 前記COOD判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定値より低い単語を多く含むコーパスを、前記ドメインより前記COOD判定文として抽出する
請求項10に記載の情報処理装置。 The COOD determination sentence extraction unit is a corpus of the domain, a word whose non-appearance of a word represented by TF / IDF composed of TF (Term Frequency) value and IDF (Inverse Document Frequency) value is lower than a predetermined value The information processing apparatus according to claim 10, wherein a corpus including a large number of characters is extracted as the COOD determination sentence from the domain. - 前記COOD判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
請求項10に記載の情報処理装置。 The information processing apparatus according to claim 10, wherein the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement. - 前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND(Close IND)判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含む
請求項8に記載の情報処理装置。 A corpus which is the IND determination sentence and exists in the vicinity of the boundary in the feature space represented by each feature with the OOD determination sentence is classified as the COD (Close IND) determination sentence as the OOD determination sentence The information processing apparatus according to claim 8, further comprising a CIND determination sentence extraction unit that extracts from the corpus. - 前記CIND判定文抽出部は、前記OOD判定文として分類されたコーパスを含むドメインにおいて、自ら以外の他のOOD判定文に分類されたコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定文として抽出する
請求項14に記載の情報処理装置。 In the domain including the corpus classified as the OOD determination sentence, the CIND determination sentence extraction unit is configured such that, in a domain including a corpus classified as the OOD determination sentence, a corpus in which the number of words included in a corpus classified into another OOD determination sentence other than itself is more than a predetermined number The information processing apparatus according to claim 14, wherein the CIND determination sentence is extracted from all the corpus classified as a determination sentence. - 前記CIND判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら以外の他のコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
請求項15に記載の情報処理装置。 The CIND test sentence extraction unit may set a corpus whose non-emergence property, represented by a ratio of the number of words not included in the corpus other than the self, to the number of words included in the corpus of the domain is lower than a predetermined number. The information processing apparatus according to claim 15, which extracts as a CIND determination sentence. - 前記CIND判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
請求項15に記載の情報処理装置。 The CIND determination sentence extraction unit is a corpus in which the non-appearance of a word represented by TF / IDF, which is composed of TF (Term Frequency) values and IDF (Inverse Document Frequency) values, is lower than a predetermined number. The information processing apparatus according to claim 15, wherein the CIND determination statement is extracted as the CIND determination sentence. - 前記CIND判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
請求項15に記載の情報処理装置。 The information processing apparatus according to claim 15, wherein the CIND determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-statement. - 入力文の構造を解析し、
前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、
前記入力文における前記置換箇所の単語を置換してコーパスを生成する
ステップを含む情報処理方法。 Analyze the structure of the input sentence,
Setting replacement points in the input sentence based on the analysis result of the structure;
An information processing method comprising: replacing a word of the replacement part in the input sentence to generate a corpus. - 入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
してコンピュータを機能させるプログラム。 A structural analysis unit that analyzes the structure of the input sentence;
A replacement part setting unit configured to set a replacement part in the input sentence based on an analysis result of the structure analysis unit;
The program which functions a computer as a corpus generation part which substitutes the word of the said substitution part in the said input sentence, and produces | generates a corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2019532489A JPWO2019021804A1 (en) | 2017-07-24 | 2018-07-10 | Information processing apparatus, information processing method, and program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2017-142620 | 2017-07-24 | ||
JP2017142620 | 2017-07-24 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019021804A1 true WO2019021804A1 (en) | 2019-01-31 |
Family
ID=65040081
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2018/025959 WO2019021804A1 (en) | 2017-07-24 | 2018-07-10 | Information processing device, information processing method, and program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2019021804A1 (en) |
WO (1) | WO2019021804A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020201445A (en) * | 2019-06-13 | 2020-12-17 | 株式会社日立製作所 | Computer system, model generation method and model management program |
JP2021047783A (en) * | 2019-09-20 | 2021-03-25 | 株式会社日立製作所 | Information processing method and information processing apparatus |
WO2021149206A1 (en) * | 2020-01-22 | 2021-07-29 | 日本電信電話株式会社 | Generation device, generation method, and generation program |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012507809A (en) * | 2008-11-05 | 2012-03-29 | グーグル・インコーポレーテッド | Custom language model |
JP2015075952A (en) * | 2013-10-09 | 2015-04-20 | 日本電信電話株式会社 | Speech generation device, method, and program |
-
2018
- 2018-07-10 WO PCT/JP2018/025959 patent/WO2019021804A1/en active Application Filing
- 2018-07-10 JP JP2019532489A patent/JPWO2019021804A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012507809A (en) * | 2008-11-05 | 2012-03-29 | グーグル・インコーポレーテッド | Custom language model |
JP2015075952A (en) * | 2013-10-09 | 2015-04-20 | 日本電信電話株式会社 | Speech generation device, method, and program |
Non-Patent Citations (2)
Title |
---|
TRAINING OF ROBUST LANGUAGE MODELS BY AUTOMATIC SENTENCE GENERATION BASED ON WORD REPLACING WORDS WITH RESPECT TO THEIR CONTEXTS, 15 June 2010 (2010-06-15), pages 1 - 6 * |
YAMAGIWA, AYAKO ET AL: "Study on multivalued classification by ECOC-SVM aiming to support vector", PROCEEDINGS OF 2016 SPRING CONFERENCE OF JAPAN INDUSTRIAL MANAGEMENT ASSOCIATION, 28 May 2016 (2016-05-28), pages 78 - 79 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020201445A (en) * | 2019-06-13 | 2020-12-17 | 株式会社日立製作所 | Computer system, model generation method and model management program |
JP7261096B2 (en) | 2019-06-13 | 2023-04-19 | 株式会社日立製作所 | Computer system, model generation method and model management program |
JP2021047783A (en) * | 2019-09-20 | 2021-03-25 | 株式会社日立製作所 | Information processing method and information processing apparatus |
JP7316165B2 (en) | 2019-09-20 | 2023-07-27 | 株式会社日立製作所 | Information processing method and information processing device |
WO2021149206A1 (en) * | 2020-01-22 | 2021-07-29 | 日本電信電話株式会社 | Generation device, generation method, and generation program |
JPWO2021149206A1 (en) * | 2020-01-22 | 2021-07-29 | ||
JP7327523B2 (en) | 2020-01-22 | 2023-08-16 | 日本電信電話株式会社 | Generation device, generation method and generation program |
Also Published As
Publication number | Publication date |
---|---|
JPWO2019021804A1 (en) | 2020-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111324728B (en) | Text event abstract generation method and device, electronic equipment and storage medium | |
US6243670B1 (en) | Method, apparatus, and computer readable medium for performing semantic analysis and generating a semantic structure having linked frames | |
JP3429184B2 (en) | Text structure analyzer, abstracter, and program recording medium | |
CN108875059B (en) | Method and device for generating document tag, electronic equipment and storage medium | |
CN109492109B (en) | Information hotspot mining method and device | |
US20050080613A1 (en) | System and method for processing text utilizing a suite of disambiguation techniques | |
US20030046078A1 (en) | Supervised automatic text generation based on word classes for language modeling | |
RU2679988C1 (en) | Extracting information objects with the help of a classifier combination | |
CN108304375A (en) | A kind of information identifying method and its equipment, storage medium, terminal | |
US20190392035A1 (en) | Information object extraction using combination of classifiers analyzing local and non-local features | |
CN103678684A (en) | Chinese word segmentation method based on navigation information retrieval | |
CN1495641B (en) | Method and device for converting speech character into text character | |
Tomašic et al. | Implementation of a slogan generator | |
CN107943940A (en) | Data processing method, medium, system and electronic equipment | |
WO2019021804A1 (en) | Information processing device, information processing method, and program | |
CN114817465A (en) | Entity error correction method and intelligent device for multi-language semantic understanding | |
KR20220068937A (en) | Standard Industrial Classification Based on Machine Learning Approach | |
JP5426292B2 (en) | Opinion classification device and program | |
CN114970516A (en) | Data enhancement method and device, storage medium and electronic equipment | |
CN115062135A (en) | Patent screening method and electronic equipment | |
Banerjee et al. | Generating abstractive summaries from meeting transcripts | |
KR102661438B1 (en) | Web crawler system that collect Internet articles and provides a summary service of issue article affecting the global value chain | |
CN112151021A (en) | Language model training method, speech recognition device and electronic equipment | |
CN111949781B (en) | Intelligent interaction method and device based on natural sentence syntactic analysis | |
JP2004220226A (en) | Document classification method and device for retrieved document |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18837253 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019532489 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18837253 Country of ref document: EP Kind code of ref document: A1 |