WO2024164976A1

WO2024164976A1 - Sample construction method and apparatus, and electronic device and readable storage medium

Info

Publication number: WO2024164976A1
Application number: PCT/CN2024/075789
Authority: WO
Inventors: 王承之
Original assignee: 维沃移动通信有限公司
Priority date: 2023-02-08
Filing date: 2024-02-04
Publication date: 2024-08-15
Also published as: CN116089569A

Abstract

The present application belongs to the technical field of artificial intelligence. Disclosed are a sample construction method and apparatus, and an electronic device and a readable storage medium. The method comprises: acquiring a parallel corpus training sample, wherein the parallel corpus training sample comprises original text and carries a specification type label corresponding to each keyword in the original text; replacing a first keyword in the original text with at least one first specification non-conformance word corresponding to the first keyword, so as to generate at least one piece of extended text; replacing a first specification type label corresponding to the first keyword with a second specification type label corresponding to the first specification non-conformance word, so as to obtain a parallel corpus training sample in which the labels are replaced; and constructing a target training sample on the basis of the parallel corpus training sample in which the labels are replaced, and the at least one piece of extended text.

Description

Sample construction method, device, electronic device and readable storage medium

Cross-references

This application claims priority to a Chinese patent application filed with the China Patent Office on February 8, 2023, with application number 202310085121.X and application name “Sample construction method, device, electronic device and readable storage medium”. The entire contents of the application are incorporated by reference into this application.

Technical Field

The present application belongs to the field of artificial intelligence technology, and specifically relates to a sample construction method, device, electronic device and readable storage medium.

Background Art

With the development of computer performance and Internet technology, existing translation methods usually use large-scale bilingual parallel corpora to train translation models and generate translations based on the distribution of real corpora in the text to be translated.

However, since parallel corpus training samples are often composed of high-quality standard texts, the translation model trained with the parallel corpus training samples can only translate standard texts. When translating texts containing non-standard words, the overall translation accuracy is low.

Therefore, how to construct richer parallel corpus training samples is an urgent problem to be solved in this application.

Summary of the invention

The purpose of the embodiments of the present application is to provide a sample construction method, device, electronic device and readable storage medium, which can solve the problem of how to construct richer parallel corpus training samples.

In a first aspect, an embodiment of the present application provides a sample construction method, which includes: obtaining a parallel corpus training sample, the parallel corpus training sample contains an original text and carries a standard type label corresponding to each keyword in the original text; replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; replacing a first standard type label corresponding to the first keyword with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after label replacement; and constructing a target training sample based on the parallel corpus training sample after label replacement and at least one extended text.

In a second aspect, an embodiment of the present application provides a sample construction device, which includes: an acquisition module, a processing module and a construction module; the acquisition module is used to acquire a parallel corpus training sample, the parallel corpus training sample contains an original text and carries a standard type label corresponding to each keyword in the original text; the processing module is used to replace the first keyword in the original text of the parallel corpus training sample acquired by the acquisition module with at least one first The processing module is further used to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample obtained by the acquisition module with the second standard type label corresponding to the first non-standard word to obtain the parallel corpus training sample after the label is replaced; the construction module is used to construct the target training sample based on the parallel corpus training sample after the label is replaced processed by the processing module and at least one extended text.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the method described in the first aspect are implemented.

In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a program or instruction is stored, and when the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.

In a fifth aspect, an embodiment of the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.

In a sixth aspect, an embodiment of the present application provides a computer program product, which is stored in a storage medium and is executed by at least one processor to implement the method described in the first aspect.

In an embodiment of the present application, a parallel corpus training sample is obtained, the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text; the first keyword in the original text is replaced with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; the first standard type label corresponding to the first keyword is replaced with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after the label is replaced; based on the parallel corpus training sample after the label is replaced and at least one extended text, a target training sample is constructed. Through this scheme, since the sample construction device can replace the keywords in the original text of the parallel corpus training sample, at least one extended text is generated to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training samples can include non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training samples and making the parallel corpus training samples have more and more flexible training content.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a schematic diagram of an example of a word that does not conform to the standard provided in an embodiment of the present application;

FIG2 is a flow chart of a sample construction method provided in an embodiment of the present application;

FIG3 is one of the example schematic diagrams of a sample construction method provided in an embodiment of the present application;

FIG4 is a second schematic diagram of an example of a sample construction method provided in an embodiment of the present application;

FIG5 is a third example schematic diagram of a sample construction method provided in an embodiment of the present application;

FIG6 is a flowchart of a translation model for translation provided by an embodiment of the present application;

FIG7 is a schematic diagram of the structure of a sample construction device provided in an embodiment of the present application;

FIG8 is a schematic diagram of a hardware structure of an electronic device provided in an embodiment of the present application;

FIG. 9 is a second schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.

DETAILED DESCRIPTION

The following will be combined with the drawings in the embodiments of the present application to clearly describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. All other embodiments obtained by ordinary technicians in this field based on the embodiments in the present application belong to the scope of protection of this application.

The terms "first", "second", etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated before and after are in an "or" relationship.

Some terms/nouns involved in the embodiments of the present application are explained below.

1. Cognate characters/words: There are often many characters/words with the same linguistic origins between languages or scripts with close branches. These characters/words have similar pronunciations, spellings or meanings, and may be easily confused in terms of character structure. For example, Chinese and Japanese, both written in Chinese characters (for example, "荣耀" and "栄尊"), English and German, both belonging to the West Germanic branch (for example, "popular" and " ), simplified and traditional Chinese, etc. Due to input errors and other reasons, words in the text to be translated may be replaced by cognates, resulting in a decrease in the quality of the translated text.

2. Kana: A phonetic writing system of Japanese. There are two writing systems: Hiragana and Katakana. The two can be converted into each other. Each Kana represents a syllable. Kanji in Japanese can be transcribed into Kana according to their pronunciation, similar to the pinyin of Chinese. At the same time, Kana is also a written language of Japanese, used to represent inherent vocabulary and grammatical auxiliary words in Japanese.

3. Kanji: Kanji used in Japanese, together with kana, constitute the written language of Japanese, and are often used to represent the names of objects or actions, etc. There are about 2000-3000 commonly used Kanji in modern Japanese. Their shapes are the same as those of Chinese characters, and there are certain intersections and differences with simplified and traditional characters.

4. Original text: The original text to be translated. The specific language of the original text is not restricted.

5. Translation: The result of translating the original text through the translation model. There is no restriction on the specific language of the translation.

6. Language model: A model used to calculate the probability of a sentence (i.e., the probability that a sequence of words can form a normal sentence). Its core is to calculate the probability of the current word appearing by the first n words in the sentence. Perplexity is usually used as an evaluation index.

7. Perplexity: An indicator for evaluating the quality of a sentence. The higher the perplexity, the more difficult it is to understand the sentence, that is, the less likely it is to be a fluent and semantically correct sentence.

8. Morphology: The study of words in sentences, including their structure, morphology and parts of speech, such as nouns, adjectives, adverbs, singular and plural in English, etc.

9. Syntactic structure: the relationship between sentence components and the rules or processes by which they form sentences, such as the common "subject-predicate-object" structure.

10. Sequence labeling: Given a sentence, label each word in the sentence, or predict the category label of the word.

11. Word segmentation: a type of sequence labeling task. For languages such as Chinese and Japanese where there are no spaces between words when writing, the word segmentation model can segment sentences at the word level and predict the category labels such as the lexical and syntactic structures of the words. The word segmentation model trained in this solution also involves predicting the extended forms of words that do not conform to the standard (for example: pronunciation spelling, cognates, easily confused words, etc.).

The sample construction method, device, electronic device and readable storage medium provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and their application scenarios.

Existing machine translation methods usually use large-scale bilingual parallel corpus training samples to train translation models and generate translations based on the distribution of real corpus.

However, since the original text in the parallel corpus training samples is usually high-quality standard text, standardization problems such as words being transcribed into pronunciation spelling or cognates rarely occur, and the translation model often does not have access to these non-standard words and does not have the ability to accurately translate them. However, in some specific scenarios, the text input into the translation model may contain non-standard words whose expressions do not conform to conventional grammar. For example, in the language education scenario, as shown in Figure 1, the words in the text may be transcribed into the pronunciation spelling form of the language (such as Chinese pinyin, Japanese kana, etc.) for teaching or examinations; incorrect input by users when typing may also cause pronunciation spelling, typos, cognates, etc. in the text to be translated; in tasks such as image translation and speech translation, the recognition results of front-end modules such as image text recognition and speech recognition may have problems such as glyph similarity errors, glyph similarity errors, and transcoding errors, which may also cause the downstream translation model to receive non-standard text. Therefore, since the text sequences containing these irregular or erroneous words are often not very common sequences, that is, their expressions do not conform to conventional grammatical, lexical or syntactic structures, the translation model usually finds it difficult to correctly translate such irregular or erroneous words.

Take Japanese as an example. On the one hand, Japanese characters have two systems: kana and kanji. Japanese kanji are highly similar to Chinese characters, and have certain overlaps and differences with simplified Chinese characters (hereinafter referred to as Simplified Chinese) and traditional Chinese characters (hereinafter referred to as Traditional Chinese), as shown in Table 1. When Chinese users input Japanese, they may replace Chinese characters with cognates that do not exist in Japanese, or typos with similar glyphs, due to reasons such as saving trouble, laziness, and confusion of glyphs. This may lead to Model translation error.

Table 1

On the other hand, Japanese kana can have its own meaning and be used in written expressions, and can also be used to spell the pronunciation of Chinese characters. In online texts such as social platforms, many users do not spell standard Chinese characters in order to save trouble, but directly replace them with the pronunciation of kana, as shown in Figure 1. However, kana with the same pronunciation will have a large number of "multiple meanings" and produce many non-standard Japanese Chinese character expressions. In addition, since there is no space between words when writing Japanese, and the character set of Japanese kana transcription completely overlaps with normal text, if a large number of Chinese characters in the text are transcribed into kana, it is difficult for existing methods to correctly identify and segment the non-standard kana words in these sentences; in addition, there are also a large number of homophones in Japanese, and the pronunciation of the same kana may correspond to multiple different Chinese characters, as shown in Table 2.

Table 2

Since most of the training corpora of existing text translation methods are standardized corpora, when inputting text with non-standard expressions, the translation model often outputs the transliteration of these words, or even random translation, resulting in inaccurate translation.

The sample construction method provided in the embodiment of the present application is that the sample construction device can replace the keywords in the original text in the parallel corpus training sample to generate at least one extended text to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training sample, so that the parallel corpus training sample has more and more flexible training content.

The sample construction method provided in the embodiment of the present application may be executed by a sample construction device. For example, the sample construction device may be an electronic device, or a component in the electronic device, such as an integrated circuit or a chip. The sample construction method provided in the embodiment of the present application will be described below by taking the sample construction device as an example.

The present application embodiment provides a sample construction method, and Figure 2 shows a flow chart of a sample construction method provided by the present application embodiment, and the execution subject of the method can be a sample construction device. As shown in Figure 2, the sample construction method provided by the present application embodiment can include the following steps 201 to 204.

Step 201: Obtain parallel corpus training samples.

The parallel corpus training sample may include the original text and carry the standard type label corresponding to each keyword in the original text.

In the embodiment of the present application, the parallel corpus training sample may be a bilingual or multilingual corpus consisting of an original text and its parallel corresponding target text.

Optionally, the original text may be a text that does not contain words that do not conform to the specification.

Optionally, the above keyword can be any word in the original text.

Optionally, the above-mentioned specification type tag may indicate the specification type of the keyword.

It can be understood that, on the one hand, due to the presence of a large number of homophones in the same language, the extended form of these words may be the same as other standard words in the standard vocabulary, for example, "さくら" can be the kana transcription of the surname "佐倉 (佐仓)", and can also represent the noun "樱花", so it is difficult to identify all non-standard words by the method of rules. On the other hand, due to the different rules of text sequences between different languages, for example, there is no space between the words in Japanese, when a large number of Chinese characters are transcribed as kana in the text to be translated, the method of rules is also difficult to accurately identify the boundary between words and words, so it is difficult to accurately translate all words in the text to be translated by the method of rules. Therefore, the sample construction device in the sample construction method provided in the embodiment of the present application can use the text data (i.e., original text) marked with information such as lexical and syntactic structures, and on this basis, increase the standard type label corresponding to the keyword.

Step 202: Replace the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text.

Optionally, the sample construction device may replace any keyword in the plain original text with at least one first non-standard word corresponding thereto, to obtain a plurality of extended texts with the same semantics but different standardization levels.

Optionally, other annotation information such as part of speech, syntactic structure, etc. of the extended text may be kept consistent with the annotation information of the original text.

Optionally, the non-standard words mentioned above may be words whose expressions do not conform to conventional grammatical, lexical or syntactic structures.

Optionally, the above-mentioned non-standard words may include at least one of the following situations: including pronunciation spelling, including typos, including homologous word replacement, and including glyph errors.

Optionally, "replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword" can be understood as: replacing the compliant keyword with a non-standard word that is of the same origin, has the same/similar pronunciation, or has a similar glyphic expression and does not conform to conventional grammar, morphology, or syntax structure.

For example, if the original text contains the keyword "realm", the sample construction device can replace it with "church" which has the same pronunciation. Or it does not conform to the standard word "きょうかい(bianjie)".

Optionally, the parallel corpus training sample may be a parallel corpus training sample in a parallel corpus training sample set. The above step 202 may include the following step 202a.

Step 202a: based on the word frequency of each keyword in the original text in the parallel corpus training sample set, determine at least one first keyword from the original text, and replace each first keyword in the at least one first keyword in the original text with the first non-standard word corresponding to each keyword to generate a first extended text.

The first extended text is any extended text among the at least one extended text mentioned above.

Optionally, the sample construction device may replace the keywords in the original text based on the word frequency of each keyword in the original text in the parallel corpus training sample set.

It can be understood that a word with a high frequency is more likely to be replaced.

Specifically, the first keyword in the original text may be replaced with its corresponding first non-standard word according to its word frequency setting in the parallel corpus training sample set.

For example, as shown in FIG3 , the keywords “とても (very)”, “頼もしく (trustworthy)”, and “優しい (gentle)” are replaced with a form containing phonetic reading and writing (i.e., its standard type label is phonetic reading and writing-Hiragana) “たのもしく (trustworthy)” and “優しい (gentle)” are replaced with a form containing phonetic reading and writing (i.e., its standard type label is phonetic reading and writing-Hiragana) “やさしい (gentle)” according to their word frequencies in the parallel corpus training sample set. rou地)", and obtain extended text 1; replace "とても(very)" with a form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-Katakana) "トテモ(feichangde)", replace "頼もしく" with a form containing cognates (that is, its standard type label is cognates-Traditional Chinese) "賴もしく(靠賴地)", and replace "優しい" with a form containing cognates (that is, its standard type label is cognates-Simplified Chinese) "優しい(温柔地)", and obtain extended text 2.

In this way, since the sample construction device can replace the keywords based on the frequency of the keywords in the parallel corpus training sample set, the keywords with high frequency can be replaced more times with at least one non-standard word corresponding to it, so that the generated extended text can contain as many possible non-standard forms corresponding to the original text as possible, and the subsequent training of the translation model can be more comprehensive.

Step 203: Replace the first standard type label corresponding to the first keyword with the second standard type label corresponding to the first non-standard word, and obtain a parallel corpus training sample after the labels are replaced.

Optionally, a canonical type tag may indicate the canonical type of the word.

Exemplarily, when a word is a standard word (i.e., the first keyword), its corresponding standard type tag (i.e., the first standard type tag) can indicate that it is a standard word; when a word is not a standard word (i.e., the first non-standard word), its corresponding standard type tag (i.e., the second standard type tag) can indicate its non-standard form.

For example, as shown in Table 3, the second standard type tag may include pronunciation spelling-Hiragana, pronunciation spelling-Katakana, There are many forms, such as cognates-Simplified Chinese, cognates-Traditional Chinese, easily confused words-Simplified Chinese, easily confused words-Traditional Chinese, easily confused words-reorganization, etc.

Table 3

Step 204: construct a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text.

Optionally, the sample construction device may associate the non-standard words in the extended text with the standard type labels corresponding to them in the parallel corpus training sample after the labels are replaced, to obtain a target training sample.

The embodiment of the present application provides a sample construction method, because the sample construction device can replace the keywords in the original text in the parallel corpus training sample, generate at least one extended text, so as to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training sample, so that the parallel corpus training sample has more and more flexible training content.

Optionally, the number of extended texts is N, and N is a positive integer. After the above step 202, the sample construction method provided in the embodiment of the present application may further include the following step 205.

Step 205: When a second extended text among the N extended texts contains an unregistered word that is not included in the parallel corpus training sample set, initialize feature information of the unregistered word.

The initialization process includes at least one of the following: taking a weighted average of the feature information of the unregistered word according to the frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set; taking a weighted average of the feature information of the unregistered word using the feature information of the cognate word corresponding to the unregistered word; setting the feature information of the unregistered word to 0; and randomly initializing the feature information of the unregistered word.

Optionally, the first extended text and the second extended text may be the same or different.

Optionally, the sample construction device can convert the obtained extended text into a word vector sequence corresponding to the model training based on the feature information of the word.

Exemplarily, the sample construction device can obtain a word vector sequence through algorithms such as a word to vector (Word2Vec) algorithm and a regression algorithm based on global word frequency statistics (Glove algorithm), or can obtain a word vector sequence by training and iteration in a translation model such as Transformer.

In actual implementation, the sample construction device can obtain the word vector sequence corresponding to the extended text in any possible way, and this application does not make any specific limitation.

In an embodiment of the present application, for unregistered words, that is, words that have not appeared in the parallel corpus training sample set, any combination of the following methods can be used to initialize the feature information of the unregistered words to obtain the corresponding word vector: ① According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the feature information of the unregistered word is weighted averaged; ② Using the feature information of the cognate word corresponding to the unregistered word, the feature information of the unregistered word is weighted averaged; ③ The feature information of the unregistered word is set to 0; ④ The feature information of the unregistered word is randomly initialized.

In the actual process of training the model, the sample construction device can also randomly initialize the feature information of the standard type labels corresponding to the unregistered words, or combine the standard type labels, and obtain the standard type labels and their feature information corresponding to the unregistered words by weighted averaging the feature information of the corresponding words.

Thus, on the one hand, at the data level, since the sample construction device can initialize the feature information of the unregistered words, the training of the translation model can be enhanced; on the other hand, at the model level, since the sample construction device can initialize the feature information of the unregistered words, the translation model can learn the phonetic correlation between the non-conforming words and their corresponding conforming words during training, thus improving the translation robustness of the translation model. Thus, the sample construction method provided in the embodiment of the present application can improve the translation quality and translation accuracy of the translation model.

Optionally, after the above step 204, the sample construction method provided in the embodiment of the present application may further include the following steps 206 and 207.

Step 206: restore at least one non-standard word in the first translation text to a standard word to generate M second translation texts.

Among them, a non-standard word is restored to at least one standard word.

Optionally, the first translated text may be a sentence or a paragraph.

Optionally, the first translated text may be text input by a user, or may be text acquired from another device.

Optionally, the sample construction device can identify non-standard words in the first translation text through the following three methods: Method 1: an extended vocabulary construction method based on homology, pronunciation, and character set; Method 2: a word segmentation model method based on extended vocabulary enhancement; Method 3: an irregular translation detection method based on language model probability.

Method 1 to method 3 are described in detail below in conjunction with specific embodiments.

Method 1: Extended vocabulary construction method based on cognates, pronunciations, and character sets.

In the embodiment of the present application, the extended form of the non-standard word in the first translation text may include all words matched by the non-standard word in the extended word list.

It can be understood that when the first translation text includes a non-standard word, the characters used by the non-standard word may exceed the normal character set of the current language, for example, Chinese pinyin characters outside the Chinese character set appear in Chinese, or the non-standard word uses a spelling that does not exist in the current language, for example, the English "October" uses the spelling of the German "Oktober" in the same language family. Therefore, the sample construction device in the sample construction method provided in the embodiment of the present application can construct an extended word list by mining the similarities between words in different languages, and the extended word representation is shown in Table 3.

In the embodiment of the present application, taking Japanese as an example, the extended vocabulary may include: common pronunciation spellings of words and their variants; cognate or synonymous characters/words in other languages with close branches in the language system; easily confused words obtained by recombining a word and its cognates; easily confused words obtained by replacing a word with a word with a similar shape, etc.

Optionally, if the dictionary meanings of a word and its cognate or synonymous characters/words are highly similar, cognates can be constructed by mining dictionary information of various languages.

Alternatively, easily confused words may be words that do not exist in their original language or cognate language.

Optionally, the extended vocabulary may include multiple word sets, each of which may include one or more non-standard words and a standard word set corresponding to the non-standard word.

Optionally, the sample construction device may identify non-standard words in the first translation text by character set detection, extended vocabulary matching, etc., and use the word set matched in the extended vocabulary as the first word set.

Method 2: Word segmentation model method based on extended vocabulary enhancement.

Optionally, as shown in FIG3 , the words in the first translation text can be replaced with any extended form in the extended vocabulary according to the word frequency setting in the parallel corpus training sample set, and the corresponding standard type label can be replaced, and the word segmentation model can be trained with the extended form of the corpus and its corresponding standard type label.

Optionally, before the above step 206, the sample construction method provided in the embodiment of the present application may further include the following step A.

Step A: After the first translation text is input into the word segmentation model, the first translation text is segmented to obtain M word segments, where M is an integer greater than 1, and each of the M word segments is identified as not meeting the standard word, to obtain each word segment pair. The corresponding recognition result of a segmentation word is used to indicate whether a segmentation word does not conform to the standard word.

Exemplarily, the word segmentation model may be a word segmentation model that has undergone enhanced training.

Exemplarily, the word segmentation model that has undergone enhanced training can predict a standard type label for each obtained word segmentation. If the predicted standard type label for the word segmentation indicates that the word segmentation is not in compliance with the standard, the word segmentation is identified as a not in compliance with the standard.

In this way, since the sample construction device can enable the word segmentation model that has undergone enhanced training to acquire the ability to recognize words, learn the similarities between non-standard words and standard words in terms of lexical structure, syntactic structure, contextual information, etc., and predict standard type labels for the output word segmentations, the word segmentation model can accurately segment the first translation text and identify non-standard words in the first translation text.

Method 3: Irregular translation detection method based on language model probability.

It is understandable that since the probability of non-standard words appearing in the parallel corpus training sample set is low, and the meanings, contexts and other information between homophones are also quite different, they are less fluent than normal text. Therefore, the language model can be used to calculate the perplexity of the first translation text to determine whether the text contains non-standard expressions.

Optionally, the sample construction device may input the first translation text into an n-gram language model, and calculate the probability of the current word _wi being associated with the first n words of the first translation text by using the following formula 1.

Wherein, _wi is the current word, and N is the number of words in the first translation text.

It can be seen from formula (1) that the lower the conditional probability P( _wi | _win ... _wi-1 ) of the current word is, the lower the fluency of the first translation text is, and the higher the perplexity of the first translation text is.

Optionally, before the above step 206, the sample construction method provided in the embodiment of the present application may further include the following steps B1 to B4.

Step B1: Segment the first translation text into words to obtain M word segments.

Wherein, M is an integer greater than 1.

Exemplarily, the sample construction device may input the first translated text into an enhanced word segmentation model for word segmentation.

Step B2: for each of the M participles, when the conditional probability corresponding to a participle is less than a first preset threshold, obtain P first standard-compliant words corresponding to the participle.

Wherein, P is a positive integer.

It can be understood that if the conditional probability corresponding to the word segment is less than the first preset threshold, it means that the word segment may not conform to the standard word.

Optionally, the P first standard-compliant words may be X standard-compliant words in a standard-compliant word set matched by the one word segment in the extended vocabulary.

Step B3: Replace a word in the first translation text with each first symbol in the P first standard words. The first translation text after P replacements is obtained.

Step B4: If the first perplexity corresponding to any replaced first translation text is smaller than the second perplexity corresponding to the first translation text, and the difference between the first perplexity and the second perplexity is greater than a second preset threshold, the sample construction device determines that a segmented word does not conform to the standard word.

It can be understood that if the first perplexity corresponding to any replaced first translation text is less than the second perplexity corresponding to the first translation text, and the difference between the first perplexity and the second perplexity is greater than the second preset threshold, it can be said that the replaced first translation text is smoother and more reasonable. In other words, there are non-standard words in the first translation text before replacement.

In this way, since the sample construction device can replace the possible non-conforming words in the first translation text with the corresponding first conforming words, and respectively calculate the perplexity of the first translation text before and after the replacement, when the difference in the perplexity decrease of the first translation text after the replacement is greater than the second preset threshold, the word is determined as a non-conforming word. Therefore, the recognition of non-conforming words can be made more accurate, and the first translation text after the replacement can be made more fluent and reasonable, so that the subsequent translation is more accurate and the accuracy rate is higher.

Optionally, the above step 206 may be specifically implemented through the following steps 206a and 206b.

Step 206a: Obtain at least one first word set that does not correspond to the standard word.

The first word set may include: a plurality of word subsets. A word subset may include one or more non-standard words in at least one non-standard word, and each non-standard word corresponds to a standard word set.

It can be understood that if at least one non-compliant word includes multiple non-compliant words, the compliant word sets corresponding to each of the multiple non-compliant words may be the same or different.

For example, at least one of the above-mentioned non-standard words includes the non-standard word "已經" and the non-standard word "已經", the set of standard words corresponding to the non-standard word "已經" can be a set including the standard word "已", and the set of standard words corresponding to the non-standard word "已經" can also be a set including the standard word "已".

Step 206b: For each word subset in the multiple word subsets, restore and map a word subset with a set of standard words corresponding to each non-standard word in the word subset in the first translation text to generate at least one second translation text.

In the embodiment of the present application, "restoring and mapping a word subset to a set of compliant words corresponding to each non-compliant word in the word subset" can be understood as: restoring each non-compliant word in the above-mentioned word subset in turn to each compliant word in the corresponding set of compliant words, and traversing all restored combinations of compliant words.

For example, the first translation text is: When I think about saying goodbye to xiaoyuan tomorrow, my heart is filled with nostalgia for Shen Shen. It contains the non-standard word "xiaoyuan" and the non-standard word "Shen Shen". The set of standard words corresponding to the non-standard word "xiaoyuan" includes: campus, courtyard; the set of standard words corresponding to the non-standard word "Shen Shen" includes: deeply, scrutinize. Then, the sample construction device can return the set of standard words corresponding to each non-standard word to the standard word set. The original mapping results in 6 second translation texts, which are: When I think of saying goodbye to the campus tomorrow, my heart is filled with nostalgia for Shenshen; When I think of saying goodbye to the campus tomorrow, my heart is filled with deep nostalgia; When I think of saying goodbye to the campus tomorrow, my heart is filled with nostalgia for Shenshen; When I think of saying goodbye to the courtyard tomorrow, my heart is filled with nostalgia for Shenshen; When I think of saying goodbye to the courtyard tomorrow, my heart is filled with deep nostalgia; When I think of saying goodbye to the campus tomorrow, my heart is filled with nostalgia for Shenshen.

In this way, since the sample construction device can restore the non-standard words in the first translation text to all possible standard words to generate at least one second translation text, the non-standard words in the first translation text can be corrected as much as possible, making the subsequent translation more accurate and fluent.

Step 207: input the first feature information corresponding to the first translated text and the second feature information corresponding to X second translated texts among the M second translated texts into the first translation model for text translation to obtain a target translated text.

Among them, the first feature information includes text feature information of the first translated text and feature information of the standard type labels corresponding to the non-standard words in the first translated text, and the second feature information includes text feature information of the second translated text and feature information of the standard type labels corresponding to the non-standard words in the second translated text.

In an embodiment of the present application, the first translation model is obtained by training based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.

Optionally, the above step 207 may be specifically implemented through the following steps 207a and 207b.

Step 207a: input X second translation texts among the M second translation texts and the first translation text into the first translation model for text translation, and output L candidate translations.

The L candidate translations include candidate translations corresponding to X second translation texts and candidate translations corresponding to the first translation text, one candidate translation corresponds to at least one second translation text, L is a positive integer, and L is less than or equal to X.

It can be understood that, since the enhanced translation model can make the same translation for non-standard words with different extended forms, the number of candidate translations output by the translation model is less than the number of second translation texts input.

Exemplarily, as shown in FIG4 , when the original text (ie, the first translation text) “鸡亲は学校に勤める(父工作学校)” is input into the enhanced translation model, the target translation “父工作学校” can be obtained.

When the extended text 1 "両亲は学校につとめる(Parents work in school)" is input into the enhanced translation model, that is, "両亲(两亲)" in the original text is replaced with a form reorganized with easily confused words (that is, its standard type label is easily confused words-reorganization) "両亲(两亲)", "学校" is replaced with a form containing cognates and traditional Chinese characters (that is, its standard type label is cognates-traditional Chinese characters) "學校(學校)", and "勤める(工作)" is replaced with a form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-hiragana) "つとめる(gongzuo)", the target translation "父們工作学校" can also be obtained.

When the extended text 2 "兩親はがっこうにツトメル(My parents work at the school)" is input into the enhanced translation model, That is, "鸡亲(两亲)" in the original text is replaced with the traditional Chinese form of the easily confused word (that is, its standard type label is easily confused word-traditional) "两亲(两亲)", "学校" is replaced with the form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-hiragana) "がっこう(xuexiao)", and "勤める(工作)" is replaced with the form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-katakana) "ツトメル(gongzuo)", and the target translation "父家人工作学校" can also be obtained.

Step 207b: Determine the candidate translation that meets the first condition among the L candidate translations as the target translation.

Optionally, the candidate translations satisfying the first condition may include at least one of the following:

Case 1: The candidate translation whose fluency meets the first predetermined condition;

Case 2: The candidate translation whose translation quality meets the second predetermined condition;

Case 3: candidate translations whose relevance satisfies the third predetermined condition.

The above correlation includes at least one of the following: prior probability, similarity, and perplexity.

Exemplarily, the first predetermined condition may be that the perplexity of the candidate translation is less than or equal to a third preset threshold. It can be understood that the lower the perplexity of the candidate translation, the higher the fluency and the more reasonable the candidate translation.

Exemplarily, for situation 1, the sample construction device may calculate the perplexities of L candidate translations respectively through the language model, and determine the candidate translation whose perplexity is less than or equal to the third preset threshold as the target translation.

Exemplarily, the second predetermined condition may be that the translation quality of the candidate translation is greater than or equal to a fourth preset threshold. It is understood that the sample construction device may determine the candidate translation whose translation quality is greater than or equal to the fourth preset threshold as the target translation.

Exemplarily, the third predetermined condition may be that the relevance of the candidate translation is greater than or equal to a fifth preset threshold. It is understood that the sample construction device may determine the candidate translation whose relevance is greater than or equal to the fifth preset threshold as the target translation.

It should be noted that if there is a candidate translation that satisfies a plurality of predetermined conditions in the first condition, the sample construction device may determine the candidate translation that satisfies the most predetermined conditions as the target translation.

In this way, since the sample construction device can determine the candidate translation with the best evaluation result as the target translation based on the fluency, translation quality and relevance of the candidate translations, the output target translation can be optimized.

Optionally, the sample construction device may evaluate the translation quality of the candidate translations by using representation and feature learning methods.

Illustratively, after the above step 207a, the sample construction method provided in the embodiment of the present application may further include the following steps 207c and 207d.

Step 207c: for each candidate translation among the L candidate translations, extract first text feature information of a candidate translation, a first translated text corresponding to the candidate translation, and second text feature information of the first translated text.

Exemplarily, the first text feature information may include features such as lexical and syntactic structures of the candidate translations.

Exemplarily, the second text feature information may include features such as lexical and syntactic structures of the second translated text and the first translated text.

For example, the sample construction device can extract the first text of the candidate translation by training the word segmentation model of the target language. The feature information is extracted from the second translation text corresponding to the candidate translation and the second text feature information of the first translation text respectively through the original language word segmentation model.

Step 207d: Calculate a translation quality parameter corresponding to a candidate translation based on the first text feature information and the second text feature information.

Exemplarily, the sample construction device may calculate the quality of the translation result using a regression algorithm.

Exemplarily, the translation quality parameter corresponding to a candidate translation may be a result value of a regression algorithm.

It can be understood that the regression algorithm can output the probability of the quality of the candidate translation: the closer the result of the regression algorithm is to 1, the better the quality of the candidate translation is, and the closer the result of the regression algorithm is to 0, the worse the quality of the candidate translation is.

In this way, since the sample construction device can calculate the translation quality parameter corresponding to a candidate translation based on the first text feature information of a candidate translation, the first translation text corresponding to the candidate translation, and the second text feature information of the first translation text, candidate translations with better translation quality can be screened out.

Optionally, the relevance of the candidate translations can be obtained by weighting the following six evaluation indicators:

① The expanded words in the second translation text corresponding to the candidate translation are given a priori probability according to their expansion type, similarity with the corresponding non-standard words in the first translation text, word frequency with the corresponding non-standard words in the first translation text, etc., and the candidate translation with higher probability is selected. (For example: if only the expansion type is considered, the priori probability of pronunciation spelling, cognates, and easily confused words is [0.7, 0.2, 0.1], and the priori probability of Hiragana and Katakana in pronunciation spelling is [0.8, 0.2], then the priori probability of pronunciation spelling-Hiragana is 0.7*0.8＝0.56). ② The second translation text corresponding to the candidate translation is input into the word segmentation model, and the similarity of the annotation information such as word segmentation and morphology and syntactic structure with the first translation text is calculated, and the candidate translation corresponding to the second translation text with higher similarity is selected. ③ Calculate the perplexity of the second translation text and the first translation text through the language model, and select the candidate translation corresponding to the first translation text whose perplexity is lower than that of the first translation text and whose perplexity difference exceeds the second preset threshold. ④ Input the candidate translation into the word segmentation model, calculate the similarity between the word segmentation and the annotation information such as lexical and syntactic structure and the candidate translation corresponding to the first translation text, and select the candidate translation with higher similarity. ⑤ Calculate the string similarity between all candidate translations, and select the candidate translation with higher similarity. ⑥ Calculate the similarity between the translations corresponding to the extended words in all candidate translations.

It should be noted that evaluation index ④ may be determined by other evaluation indexes of the candidate translation corresponding to the first translation text. If the fluency and translation quality of the candidate translation corresponding to the text to be translated are poor, the weight corresponding to index ④ will be reduced accordingly.

Furthermore, since different second translation texts can obtain the same candidate translation through the enhanced translation model, the relevance of the candidate translation can be obtained by weighting the evaluation indicators of the candidate translations corresponding to the different second translation texts.

Optionally, before the above step 207, the sample construction method provided in the embodiment of the present application can also screen at least one second translation text through evaluation indicators ①~③ for calculating the relevance of candidate translations, and screen out X second translation texts from M second translation texts, so as to improve the efficiency of actual translation and reduce the power consumption of the sample construction device.

In the sample construction method provided by the embodiment of the present application, on the one hand, since the present application can restore the non-standard words in the first translation text to at least one standard word and generate at least one second translation text, the first translation text containing the non-standard words can be restored to the standard first translation text, avoiding translation errors caused by the presence of non-standard words; on the other hand, since the present application can input the standard part or all of the second translation text and the original first translation text into the translation model at the same time when the first translation text is input into the translation model for translation, it is possible to output a translation with higher accuracy as the translation result. In this way, the sample construction method provided by the embodiment of the present application can improve the accuracy of the translation model translation.

Optionally, before the above step 206, the sample construction method provided in the embodiment of the present application may further include the following step 208.

Step 208: After the first translation text is input into the first word segmentation model, the first translation text is segmented to obtain K word segments, and each of the K word segments is identified as a word that does not conform to the standard to obtain a recognition result corresponding to each word segment.

The recognition result corresponding to a segmented word is used to indicate whether the segmented word is a non-standard word. When a segmented word is a non-standard word, the recognition result corresponding to the segmented word includes the standard type corresponding to the segmented word.

In the embodiment of the present application, the first word segmentation model is obtained by training based on the target training sample set, and K is an integer greater than 1.

Exemplarily, the sample construction device can also use the word segmentation model trained in the above method 2 to predict labels for non-standard words in the enhanced text, and introduce corresponding standard type label vectors in the training of the translation model, so that the model learns the semantic relationship between the extended words and the first keyword corresponding to the unregistered words, thereby enhancing the prediction and translation capabilities of the extended words.

It can be understood that the canonical type label vector has the same dimension as the word vector. The extended word vector in the enhanced sentence is added to the vector corresponding to the canonical type label predicted by the word segmentation model to obtain the final representation vector of the extended word.

Specifically, taking Japanese as an example, as shown in Table 3, the extended words include pronunciation spelling, cognates, easily confused words, etc., and each type has subdivisions such as hiragana, katakana, simplified Chinese, traditional Chinese, and reorganization, so there are multiple combinations of standard word type label types. During the training process, each standard type label vector can be randomly initialized, or the weighted average of the corresponding word vectors of each component of the label (such as pronunciation spelling, cognates, hiragana, katakana, etc.) can be used as the initial vector, and the standard type label vector is iteratively optimized through model training.

It should be noted that the label prediction of the word segmentation model for the expanded word may be different from the label of the actual expanded form of the word. However, the enhanced sentences with incorrect label predictions of these standard types are not corrected, but retained at a certain ratio, thereby enhancing the robustness of the translation model and enabling the model to learn to output correct translations when incorrect expanded word labels are input.

Optionally, the sample construction device may use parallel corpus training samples and target training samples to construct a basic translation model. Type of enhanced training.

It can be understood that the output translations corresponding to each original text and all the corresponding extended texts are the same, so that the translation model enhances the translation robustness to non-standard expressions.

Exemplarily, the enhanced translation model may generate a word vector table containing extended words and a canonical type label vector table.

In the embodiment of the present application, after the sample construction device inputs the N second translation texts and the first translation text in at least one second translation text into the translation model, the input text can be segmented by the enhanced word segmentation model, the non-standard words can be identified, and the extended forms of the non-standard words can be predicted. Then, by querying the vector table, the standard word vector, the extended word vector and the standard type label vector corresponding to the input text are input into the model, as shown in FIG4, to obtain the generated translation.

Thus, on the one hand, at the data level, the sample construction device can construct adversarial training data based on the extended vocabulary to enhance the training of the translation model; on the other hand, at the model level, the sample construction device incorporates the standard type label vector in the input encoding layer, so that the model can learn the extended form of non-standard words and the semantic relevance of non-standard words and their corresponding standard words in the text during training. Thus, the translation model can improve the translation robustness and translation quality of the first translation text containing non-standard words, so that the translation model can output the correct translation regardless of whether the first translation text contains non-standard words.

Optionally, after the above step 201, the sample construction method provided in the embodiment of the present application may further include the following steps 301 and 302. The above step 202 may be specifically implemented by the following step a.

Step 301: Display each standard-compliant word corresponding to the first non-standard-compliant word.

The first non-standard word may be one or more non-standard words in the at least one non-standard word. That is, the first non-standard word may be one or more non-standard words.

Exemplarily, the sample construction device may display each standard-compliant word corresponding to the first non-standard-compliant word in order of relevance from high to low.

Exemplarily, the sample construction device may calculate the relevance of each standard-compliant word corresponding to the first non-standard-compliant word by using the following formula (2).
S(W)＝αS ₁ (W)+βS ₂ (W)+γS ₃ (W) (Formula 2)

Among them, S ₁ (W) is the prior probability corresponding to the non-standard word in the extended vocabulary; S ₂ (W) is the lexical similarity between the non-standard word and its restored standard word; S ₃ (W) is the restoration of the non-standard word; α, β, γ are adjustable weight coefficients.

For example, for S ₂ (W), if only part of speech is considered, if the morphology of the non-standard word is the same as that of the standard word after it is restored, then S ₂ (W) can be 1; if the morphology of the non-standard word is different from that of the standard word after it is restored, then S ₂ (W) can be 0.

Step 302: Receive a first input of a target standard-compliant word among the displayed standard-compliant words.

Exemplarily, the target standard-compliant words are one or more standard-compliant words among the displayed standard-compliant words.

In one example, the target standard-compliant word may be a standard-compliant word corresponding to the same non-standard-compliant word.

In one example, the target standard-compliant word may include standard-compliant words corresponding to multiple different non-standard-compliant words.

In one example, when the target standard-compliant word includes multiple standard-compliant words corresponding to non-standard-compliant words, the sample construction device will restore each standard-compliant word selected by the user.

Exemplarily, the target standard-compliant word may be a standard-compliant word selected by the user to replace a non-standard-compliant word.

Exemplarily, the first input is used to select a standard word that needs to be restored from the displayed standard words.

Exemplarily, the first input may be a user's touch input, a specific voice input, or a specific gesture input of a target word that conforms to the standard, which is not limited in this embodiment of the present application.

For example, the first input may be a click input by the user on a target word that meets the specification.

Step a: In response to the first input, restore the first non-standard word in the first translation text to a target standard word to generate at least one second translation text.

Exemplarily, if there are non-standard words in the first translation text that have not been manually restored by the user, the electronic device can restore them according to the above-mentioned relevant steps to generate at least one second translation text.

For example, as shown in (a) of FIG5 , the sample construction device may display the first non-standard word “りょうしん(liangqin)” corresponding to the standard words “鸡亲(两亲)” and “良心”. Then, the sample construction device receives the user's click input (i.e., the first input) for the target standard word “鸡亲(两亲)”, and as shown in (b) of FIG5 , restores the non-standard word “りょうしん(liangqin)” to the target standard word “鸡亲(两亲)”, and generates the second translation text “りょうしんは学校に勤める(两亲/父工作学校)”.

In this way, since the sample construction device can display the corresponding standard words to the non-standard words, and the user selects the target standard words to be restored through the first input, the generated second translation text response can be reduced, thereby reducing the power consumption required for translation.

Optionally, the sample construction method provided in the embodiment of the present application can construct corresponding extended vocabulary according to the linguistic features of different languages to be applied to different translation languages and language directions.

The present application embodiment provides a sample construction method, and FIG6 shows a flowchart of a translation model provided by the present application embodiment for translation, wherein the translation model is a translation model obtained by training the target training sample. As shown in FIG6 , the sample construction method provided by the present application embodiment may include the following steps 601 to 607.

Step 601: Obtain the text to be translated.

Step 602: Automatically identify whether there are any words that do not conform to the standard in the text to be translated.

Step 603: When there are words that do not conform to the standard in the text to be translated, the restoration results that do not conform to the standard are sorted by credibility and presented to the user.

Step 604: In response to a first input of a user selecting a restoration result that does not conform to the standard word, the non-conforming standard word is restored to generate at least one second translation text.

Step 605: restore the non-standard words that are not selected by the user to generate at least one second translation text.

Step 606: Input at least one second translation text into the first translation model for text translation to obtain at least one candidate translation.

Step 607: Determine a target translation from at least one candidate translation, and output the target translation.

The sample construction method provided in the embodiment of the present application can be executed by a sample construction device. In the embodiment of the present application, the sample construction device provided in the embodiment of the present application is described by taking the sample construction method executed by the sample construction device as an example.

Fig. 7 shows a possible structural diagram of a sample construction device involved in an embodiment of the present application. As shown in Fig. 7 , the sample construction device 70 may include: an acquisition module 71 , a processing module 72 and a construction module 73 .

Among them, the above-mentioned acquisition module 71 is used to obtain a parallel corpus training sample, which contains the original text and carries the standard type label corresponding to each keyword in the original text; the above-mentioned processing module 72 is used to replace the first keyword in the original text of the parallel corpus training sample acquired by the acquisition module 71 with at least one first non-standard word corresponding to the first keyword, so as to generate at least one extended text; the above-mentioned processing module 72 is also used to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample acquired by the acquisition module 71 with the second standard type label corresponding to the first non-standard word, so as to obtain the parallel corpus training sample after the label is replaced; the above-mentioned construction module 73 is used to construct a target training sample based on the parallel corpus training sample after the label is replaced and processed by the processing module 72 and at least one extended text.

In a possible implementation, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set; the processing module 72 is specifically used to:

Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;

The first extended text is any extended text among the at least one extended text.

In a possible implementation, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of extended texts is N, where N is a positive integer;

The processing module 72 is further configured to, after replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;

The initialization process includes one of the following:

According to the frequency of the first keyword corresponding to the unregistered word and each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the frequency of the unregistered word is The word feature information is weighted averaged;

Using the feature information of the cognates corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;

Set the feature information of unregistered words to 0;

The feature information of unregistered words is randomly initialized.

In a possible implementation, the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set; the above device further includes: a translation module;

The processing module 72 is further configured to restore at least one non-standard word in the first translation text to a standard word after the construction module 73 constructs the target training sample based on the parallel corpus training sample after replacing the label and at least one extended text, so as to generate M second translation texts, and restore one non-standard word to at least one standard word;

The above-mentioned translation module is used to input the first feature information corresponding to the first translation text and the second feature information corresponding to the X second translation texts among the M second translation texts obtained by the processing module 72 into the first translation model for text translation to obtain a target translation, wherein the first feature information includes text feature information of the first translation text and feature information of the standard type label corresponding to the non-standard words in the first translation text, and the second feature information includes text feature information of the second translation text and feature information of the standard type label corresponding to the non-standard words in the second translation text;

The first translation model is trained based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to one parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.

In a possible implementation, the above device further includes: a word segmentation module;

The above-mentioned word segmentation module is used for, before the processing module 72 restores at least one non-standard word in the first translation text to a standard word to generate M second translation texts, inputting the first translation text into the first word segmentation model, segmenting the first translation text to obtain K word segments, and performing non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes a standard type corresponding to the word segment;

The first word segmentation model is trained based on the target training sample set, and K is an integer greater than 1.

In a possible implementation, words that do not conform to standard standards include at least one of the following: including pinyin reading and writing, including typos, including homologous word replacements, and including glyph errors.

The embodiment of the present application provides a sample construction device. Since the sample construction device can replace the key words in the original text of the parallel corpus training sample, generate at least one extended text to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the key word is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can be based on the parallel corpus training sample after the label is replaced and at least one extended text, The target training samples are constructed. Therefore, the target training samples can contain words that do not conform to the standard and their corresponding standard type labels, thereby enriching the content of the parallel corpus training samples and making the parallel corpus training samples have more and more flexible training content.

The sample construction device in the embodiment of the present application can be an electronic device or a component in the electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices other than a terminal. Exemplarily, the electronic device can be a mobile phone, a tablet computer, a laptop computer, a PDA, a car-mounted electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc. It can also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., and the embodiment of the present application is not specifically limited.

The sample construction device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiment of the present application.

The sample construction device provided in the embodiment of the present application can implement the various processes implemented in the method embodiments of Figures 2 to 6 and achieve the same technical effects. To avoid repetition, they will not be described here.

Optionally, as shown in Figure 8, an embodiment of the present application also provides an electronic device 800, including a processor 801 and a memory 802, and the memory 802 stores a program or instruction that can be executed on the processor 801. When the program or instruction is executed by the processor 801, the various steps of the above-mentioned sample construction method embodiment are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.

FIG. 9 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 900 includes but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, and a processor 910 and other components.

Those skilled in the art will appreciate that the electronic device 900 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 910 through a power management system, so that the power management system can manage charging, discharging, and power consumption management. The electronic device structure shown in FIG9 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.

The processor 910 is used to obtain parallel corpus training samples, wherein the parallel corpus training samples include original text The method comprises the following steps: the first keyword in the original text of the obtained parallel corpus training sample is replaced with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; the first standard type label corresponding to the first keyword in the obtained parallel corpus training sample is replaced with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after label replacement; and a target training sample is constructed based on the parallel corpus training sample after label replacement and the at least one extended text.

Optionally, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set; the processor 910 is specifically configured to:

Optionally, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of extended texts is N, where N is a positive integer;

The processor 910 is further configured to, after replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;

The initialization process includes one of the following:

According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the word feature information of the unregistered word is weighted averaged;

Set the feature information of unregistered words to 0;

The feature information of unregistered words is randomly initialized.

Optionally, the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set;

The processor 910 is further configured to restore at least one non-standard word in the first translation text to a standard word after constructing a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text, so as to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;

The processor 910 is further configured to input the first feature information corresponding to the first translated text and the second feature information corresponding to the X second translated texts among the M second translated texts obtained by the processor 910 into the first translation model for text translation to obtain a target translated text, wherein the first feature information includes the text feature information of the first translated text and the feature information of the standard type label corresponding to the non-standard word in the first translated text, and the second feature information includes the text feature information of the second translated text. This feature information and the feature information of the standard type label corresponding to the non-standard word in the second translation text;

Optionally, the processor 910 is configured to restore at least one non-standard word in the first translation text to a standard word before generating the M second translation texts, input the first translation text into the first word segmentation model, segment the first translation text to obtain K word segments, and perform non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes a standard type corresponding to the word segment;

Optionally, non-standard words include at least one of the following: containing phonetic reading and writing, containing typos, containing homologous word replacements, and containing glyph errors.

An embodiment of the present application provides an electronic device, which can replace a keyword in an original text in a parallel corpus training sample with at least one non-standard word corresponding to the keyword, generate at least one extended text, and replace the standard type label corresponding to the keyword with the standard type label corresponding to the non-standard word, thereby obtaining a parallel corpus training sample after replacing the label. Then, the electronic device can construct a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, so that the translation model trained with the target training sample can translate non-standard words, thereby improving the translation accuracy of the translation model.

It should be understood that in the embodiment of the present application, the input unit 904 may include a graphics processing unit (GPU) 9041 and a microphone 9042, and the graphics processor 9041 processes the image data of the static picture or video obtained by the image capture device (such as a camera) in the video capture mode or the image capture mode. The display unit 906 may include a display panel 9061, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc. The user input unit 907 includes a touch panel 9071 and at least one of other input devices 9072. The touch panel 9071 is also called a touch screen. The touch panel 9071 may include two parts: a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control key, a switch key, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.

The memory 909 can be used to store software programs and various data. The memory 909 can mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area can store an operating system, an application program or instruction required for at least one function (such as a sound playback function, an image playback function, etc.). In addition, the memory 909 can include a volatile memory or a non-volatile memory, or the memory 909 can include a volatile and a non-volatile memory. Both memories. Among them, the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM) or a flash memory. The volatile memory can be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous connection dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM). The memory 909 in the embodiment of the present application includes but is not limited to these and any other suitable types of memories.

The processor 910 may include one or more processing units; optionally, the processor 910 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 910.

An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored. When the program or instruction is executed by a processor, each process of the above-mentioned sample construction method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.

An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned sample construction method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

It should be understood that the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.

An embodiment of the present application provides a computer program product, which is stored in a storage medium. The program product is executed by at least one processor to implement the various processes of the sample construction method embodiment described above, and can achieve the same technical effect. To avoid repetition, it will not be described here.

It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of more restrictions, the elements defined by the sentence "comprises a..." do not exclude the elements included in the process, method, article or device. In addition, it should be noted that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in a reverse order according to the functions involved. For example, the described method may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, the features described with reference to certain examples may be combined in other examples.

Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.

The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.

Claims

A sample construction method, the method comprising:

Obtaining a parallel corpus training sample, wherein the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text;

Replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text;

Replacing the first standard type label corresponding to the first keyword with the second standard type label corresponding to the first non-standard word, to obtain the parallel corpus training sample after the label is replaced;

A target training sample is constructed based on the parallel corpus training sample after replacing the label and the at least one extended text.
The method according to claim 1, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set;

The step of replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text includes:

Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;

The first extended text is any extended text among the at least one extended text.
The method according to claim 1, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of the extended texts is N, where N is a positive integer;

After replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, the method further includes:

When a second extended text among the N extended texts contains an unregistered word that is not included in the parallel corpus training sample set, initializing feature information of the unregistered word;

The initialization process includes at least one of the following:

According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, weighted averaging is performed on the feature information of the unregistered word;

Using the feature information of the cognate words corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;

Setting the feature information of the unregistered word to 0;

The feature information of the unregistered word is randomly initialized.
The method according to claim 1, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set;

After constructing a target training sample based on the parallel corpus training sample after replacing the label and the at least one extended text, the method further includes:

Restoring each of the extended texts in the first translation text to a standard word to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;

Inputting first feature information corresponding to the first translated text and second feature information corresponding to X second translated texts among the M second translated texts into a first translation model for text translation to obtain a target translated text, wherein the first feature information includes text feature information of the first translated text and feature information of standard type labels corresponding to non-standard words in the first translated text, and the second feature information includes text feature information of the second translated text and feature information of standard type labels corresponding to non-standard words in the second translated text;

Among them, the first translation model is obtained by training based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
The method according to claim 4, wherein before restoring at least one non-standard word in the first translation text to a standard word to generate M second translation texts, the method further comprises:

After inputting the first translation text into the first word segmentation model, the first translation text is segmented to obtain K word segments, and each of the K word segments is identified as a word that does not conform to the standard to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a word that does not conform to the standard. If the word segment is a word that does not conform to the standard, the recognition result corresponding to the word segment includes the standard type corresponding to the word segment;

The first word segmentation model is trained based on a target training sample set, and K is an integer greater than 1.
The method according to any one of claims 1 to 4, wherein the non-standard words include at least one of the following situations: including phonetic spelling, including typos, including homologous word replacement, and including glyph errors.
A sample construction device, the device comprising: an acquisition module, a processing module and a construction module;

The acquisition module is used to acquire a parallel corpus training sample, wherein the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text;

The processing module is used to replace the first keyword in the original text in the parallel corpus training sample acquired by the acquisition module with at least one first non-standard word corresponding to the first keyword, so as to generate at least one extended text;

The processing module is further configured to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample acquired by the acquisition module with the second standard type label corresponding to the first non-standard word. Obtaining the parallel corpus training sample after replacing the label;

The construction module is used to construct a target training sample based on the parallel corpus training sample after the label is replaced and processed by the processing module and the at least one extended text.
The device according to claim 7, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set;

The processing module is specifically used for:

Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to the first keyword to generate a first extended text;

The first extended text is any extended text among the at least one extended text.
The device according to claim 7, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of the extended texts is N, where N is a positive integer;

The processing module is further configured to, after replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;

The initialization process includes one of the following:

According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, weighted averaging the word feature information of the unregistered word;

Using the feature information of the cognate words corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;

Setting the feature information of the unregistered word to 0;

The feature information of the unregistered word is randomly initialized.
The device according to claim 7, wherein the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set;

The device also includes: a translation module;

The processing module is further configured to restore at least one non-standard word in the first translation text to a standard word after the construction module constructs the target training sample based on the parallel corpus training sample after replacing the label and the at least one extended text, so as to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;

The translation module is used to input the first feature information corresponding to the first translation text and the second feature information corresponding to the X second translation texts among the M second translation texts obtained by the processing module into the first translation model for The first feature information includes text feature information of the first translated text and feature information of standard type labels corresponding to non-standard words in the first translated text, and the second feature information includes text feature information of the second translated text and feature information of standard type labels corresponding to non-standard words in the second translated text;

Among them, the first translation model is obtained by training based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
The device according to claim 10, further comprising: a word segmentation module;

The word segmentation module is used for, before the processing module restores at least one non-standard word in the first translation text to a standard word to generate M second translation texts, inputting the first translation text into a first word segmentation model, performing word segmentation on the first translation text to obtain K word segments, and performing non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes the standard type corresponding to the word segment;

The first word segmentation model is trained based on a target training sample set, and K is an integer greater than 1.
According to the device according to any one of claims 7 to 10, the non-standard words include at least one of the following situations: including pinyin reading and writing, including typos, including homologous word replacement, and including glyph errors.
An electronic device comprises a processor and a memory, wherein the memory stores a program or instruction that can be run on the processor, and when the program or instruction is executed by the processor, the steps of the sample construction method according to any one of claims 1 to 6 are implemented.
A readable storage medium stores a program or instruction, and when the program or instruction is executed by a processor, the steps of the sample construction method according to any one of claims 1 to 6 are implemented.
A chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the steps of the sample construction method according to any one of claims 1 to 6.
A computer program product, wherein the program product is stored in a storage medium and is executed by at least one processor to implement the steps of the sample construction method according to any one of claims 1 to 6.
An electronic device, characterized in that it comprises the electronic device used to execute the steps of the sample construction method according to any one of claims 1 to 6.