[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024164976A1 - Sample construction method and apparatus, and electronic device and readable storage medium - Google Patents

Sample construction method and apparatus, and electronic device and readable storage medium Download PDF

Info

Publication number
WO2024164976A1
WO2024164976A1 PCT/CN2024/075789 CN2024075789W WO2024164976A1 WO 2024164976 A1 WO2024164976 A1 WO 2024164976A1 CN 2024075789 W CN2024075789 W CN 2024075789W WO 2024164976 A1 WO2024164976 A1 WO 2024164976A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
text
standard
training sample
translation
Prior art date
Application number
PCT/CN2024/075789
Other languages
French (fr)
Chinese (zh)
Inventor
王承之
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Publication of WO2024164976A1 publication Critical patent/WO2024164976A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application belongs to the field of artificial intelligence technology, and specifically relates to a sample construction method, device, electronic device and readable storage medium.
  • the purpose of the embodiments of the present application is to provide a sample construction method, device, electronic device and readable storage medium, which can solve the problem of how to construct richer parallel corpus training samples.
  • an embodiment of the present application provides a sample construction method, which includes: obtaining a parallel corpus training sample, the parallel corpus training sample contains an original text and carries a standard type label corresponding to each keyword in the original text; replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; replacing a first standard type label corresponding to the first keyword with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after label replacement; and constructing a target training sample based on the parallel corpus training sample after label replacement and at least one extended text.
  • an embodiment of the present application provides a sample construction device, which includes: an acquisition module, a processing module and a construction module; the acquisition module is used to acquire a parallel corpus training sample, the parallel corpus training sample contains an original text and carries a standard type label corresponding to each keyword in the original text; the processing module is used to replace the first keyword in the original text of the parallel corpus training sample acquired by the acquisition module with at least one first The processing module is further used to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample obtained by the acquisition module with the second standard type label corresponding to the first non-standard word to obtain the parallel corpus training sample after the label is replaced; the construction module is used to construct the target training sample based on the parallel corpus training sample after the label is replaced processed by the processing module and at least one extended text.
  • an embodiment of the present application provides an electronic device, which includes a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the method described in the first aspect are implemented.
  • an embodiment of the present application provides a readable storage medium, on which a program or instruction is stored, and when the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.
  • an embodiment of the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.
  • an embodiment of the present application provides a computer program product, which is stored in a storage medium and is executed by at least one processor to implement the method described in the first aspect.
  • a parallel corpus training sample is obtained, the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text; the first keyword in the original text is replaced with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; the first standard type label corresponding to the first keyword is replaced with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after the label is replaced; based on the parallel corpus training sample after the label is replaced and at least one extended text, a target training sample is constructed.
  • the sample construction device can replace the keywords in the original text of the parallel corpus training sample, at least one extended text is generated to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample.
  • the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training samples can include non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training samples and making the parallel corpus training samples have more and more flexible training content.
  • FIG1 is a schematic diagram of an example of a word that does not conform to the standard provided in an embodiment of the present application.
  • FIG2 is a flow chart of a sample construction method provided in an embodiment of the present application.
  • FIG3 is one of the example schematic diagrams of a sample construction method provided in an embodiment of the present application.
  • FIG4 is a second schematic diagram of an example of a sample construction method provided in an embodiment of the present application.
  • FIG5 is a third example schematic diagram of a sample construction method provided in an embodiment of the present application.
  • FIG6 is a flowchart of a translation model for translation provided by an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a sample construction device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of a hardware structure of an electronic device provided in an embodiment of the present application.
  • FIG. 9 is a second schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
  • first, second, etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first”, “second”, etc. are generally of one type, and the number of objects is not limited.
  • the first object can be one or more.
  • “and/or” in the specification and claims represents at least one of the connected objects, and the character “/" generally indicates that the objects associated before and after are in an "or” relationship.
  • Cognate characters/words There are often many characters/words with the same linguistic origins between languages or scripts with close branches. These characters/words have similar pronunciations, spellings or meanings, and may be easily confused in terms of character structure. For example, Chinese and Japanese, both written in Chinese characters (for example, " ⁇ ” and “ ⁇ ”), English and German, both belonging to the West Germanic branch (for example, "popular” and " ), simplified and traditional Chinese, etc. Due to input errors and other reasons, words in the text to be translated may be replaced by cognates, resulting in a decrease in the quality of the translated text.
  • Kana A phonetic writing system of Japanese. There are two writing systems: Hiragana and Katakana. The two can be converted into each other. Each Kana represents a syllable. Kanji in Japanese can be transcribed into Kana according to their pronunciation, similar to the pinyin of Chinese. At the same time, Kana is also a written language of Japanese, used to represent inherent vocabulary and grammatical auxiliary words in Japanese.
  • Kanji used in Japanese, together with kana, constitute the written language of Japanese, and are often used to represent the names of objects or actions, etc. There are about 2000-3000 commonly used Kanji in modern Japanese. Their shapes are the same as those of Chinese characters, and there are certain intersections and differences with simplified and traditional characters.
  • Original text The original text to be translated. The specific language of the original text is not restricted.
  • Translation The result of translating the original text through the translation model. There is no restriction on the specific language of the translation.
  • Language model A model used to calculate the probability of a sentence (i.e., the probability that a sequence of words can form a normal sentence). Its core is to calculate the probability of the current word appearing by the first n words in the sentence. Perplexity is usually used as an evaluation index.
  • Perplexity An indicator for evaluating the quality of a sentence. The higher the perplexity, the more difficult it is to understand the sentence, that is, the less likely it is to be a fluent and semantically correct sentence.
  • Morphology The study of words in sentences, including their structure, morphology and parts of speech, such as nouns, adjectives, adverbs, singular and plural in English, etc.
  • Syntactic structure the relationship between sentence components and the rules or processes by which they form sentences, such as the common "subject-predicate-object" structure.
  • Sequence labeling Given a sentence, label each word in the sentence, or predict the category label of the word.
  • Word segmentation a type of sequence labeling task. For languages such as Chinese and Japanese where there are no spaces between words when writing, the word segmentation model can segment sentences at the word level and predict the category labels such as the lexical and syntactic structures of the words. The word segmentation model trained in this solution also involves predicting the extended forms of words that do not conform to the standard (for example: pronunciation spelling, cognates, easily confused words, etc.).
  • the text input into the translation model may contain non-standard words whose expressions do not conform to conventional grammar.
  • the words in the text may be transcribed into the pronunciation spelling form of the language (such as Chinese pinyin, Japanese kana, etc.) for teaching or examinations; incorrect input by users when typing may also cause pronunciation spelling, typos, cognates, etc.
  • the recognition results of front-end modules such as image text recognition and speech recognition may have problems such as glyph similarity errors, glyph similarity errors, and transcoding errors, which may also cause the downstream translation model to receive non-standard text. Therefore, since the text sequences containing these irregular or erroneous words are often not very common sequences, that is, their expressions do not conform to conventional grammatical, lexical or syntactic structures, the translation model usually finds it difficult to correctly translate such irregular or erroneous words.
  • Japanese characters have two systems: kana and kanji.
  • Japanese kanji are highly similar to Chinese characters, and have certain overlaps and differences with simplified Chinese characters (hereinafter referred to as Simplified Chinese) and traditional Chinese characters (hereinafter referred to as Traditional Chinese), as shown in Table 1.
  • Simplified Chinese simplified Chinese characters
  • Traditional Chinese traditional Chinese characters
  • Japanese kana can have its own meaning and be used in written expressions, and can also be used to spell the pronunciation of Chinese characters.
  • many users do not spell standard Chinese characters in order to save trouble, but directly replace them with the pronunciation of kana, as shown in Figure 1.
  • kana with the same pronunciation will have a large number of "multiple meanings" and produce many non-standard Japanese Chinese character expressions.
  • the sample construction method provided in the embodiment of the present application is that the sample construction device can replace the keywords in the original text in the parallel corpus training sample to generate at least one extended text to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample.
  • the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training sample, so that the parallel corpus training sample has more and more flexible training content.
  • the sample construction method provided in the embodiment of the present application may be executed by a sample construction device.
  • the sample construction device may be an electronic device, or a component in the electronic device, such as an integrated circuit or a chip.
  • the sample construction method provided in the embodiment of the present application will be described below by taking the sample construction device as an example.
  • the present application embodiment provides a sample construction method
  • Figure 2 shows a flow chart of a sample construction method provided by the present application embodiment, and the execution subject of the method can be a sample construction device.
  • the sample construction method provided by the present application embodiment can include the following steps 201 to 204.
  • Step 201 Obtain parallel corpus training samples.
  • the parallel corpus training sample may include the original text and carry the standard type label corresponding to each keyword in the original text.
  • the parallel corpus training sample may be a bilingual or multilingual corpus consisting of an original text and its parallel corresponding target text.
  • the original text may be a text that does not contain words that do not conform to the specification.
  • the above keyword can be any word in the original text.
  • the above-mentioned specification type tag may indicate the specification type of the keyword.
  • the extended form of these words may be the same as other standard words in the standard vocabulary, for example, " ⁇ ” can be the kana transcription of the surname " ⁇ ( ⁇ )", and can also represent the noun " ⁇ ”, so it is difficult to identify all non-standard words by the method of rules.
  • the method of rules is also difficult to accurately identify the boundary between words and words, so it is difficult to accurately translate all words in the text to be translated by the method of rules.
  • the sample construction device in the sample construction method provided in the embodiment of the present application can use the text data (i.e., original text) marked with information such as lexical and syntactic structures, and on this basis, increase the standard type label corresponding to the keyword.
  • the text data i.e., original text
  • information such as lexical and syntactic structures
  • Step 202 Replace the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text.
  • the sample construction device may replace any keyword in the plain original text with at least one first non-standard word corresponding thereto, to obtain a plurality of extended texts with the same semantics but different standardization levels.
  • annotation information such as part of speech, syntactic structure, etc. of the extended text may be kept consistent with the annotation information of the original text.
  • non-standard words mentioned above may be words whose expressions do not conform to conventional grammatical, lexical or syntactic structures.
  • non-standard words may include at least one of the following situations: including pronunciation spelling, including typos, including homologous word replacement, and including glyph errors.
  • "replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword” can be understood as: replacing the compliant keyword with a non-standard word that is of the same origin, has the same/similar pronunciation, or has a similar glyphic expression and does not conform to conventional grammar, morphology, or syntax structure.
  • the sample construction device can replace it with "church” which has the same pronunciation. Or it does not conform to the standard word “ ⁇ (bianjie)”.
  • the parallel corpus training sample may be a parallel corpus training sample in a parallel corpus training sample set.
  • the above step 202 may include the following step 202a.
  • Step 202a based on the word frequency of each keyword in the original text in the parallel corpus training sample set, determine at least one first keyword from the original text, and replace each first keyword in the at least one first keyword in the original text with the first non-standard word corresponding to each keyword to generate a first extended text.
  • the first extended text is any extended text among the at least one extended text mentioned above.
  • the sample construction device may replace the keywords in the original text based on the word frequency of each keyword in the original text in the parallel corpus training sample set.
  • the first keyword in the original text may be replaced with its corresponding first non-standard word according to its word frequency setting in the parallel corpus training sample set.
  • the keywords “ ⁇ (very)”, “ ⁇ (trustworthy)”, and “ ⁇ (gentle)” are replaced with a form containing phonetic reading and writing (i.e., its standard type label is phonetic reading and writing-Hiragana) “ ⁇ (trustworthy)” and “ ⁇ (gentle)” are replaced with a form containing phonetic reading and writing (i.e., its standard type label is phonetic reading and writing-Hiragana) “ ⁇ (gentle)” according to their word frequencies in the parallel corpus training sample set.
  • the sample construction device can replace the keywords based on the frequency of the keywords in the parallel corpus training sample set, the keywords with high frequency can be replaced more times with at least one non-standard word corresponding to it, so that the generated extended text can contain as many possible non-standard forms corresponding to the original text as possible, and the subsequent training of the translation model can be more comprehensive.
  • Step 203 Replace the first standard type label corresponding to the first keyword with the second standard type label corresponding to the first non-standard word, and obtain a parallel corpus training sample after the labels are replaced.
  • a canonical type tag may indicate the canonical type of the word.
  • a word when a word is a standard word (i.e., the first keyword), its corresponding standard type tag (i.e., the first standard type tag) can indicate that it is a standard word; when a word is not a standard word (i.e., the first non-standard word), its corresponding standard type tag (i.e., the second standard type tag) can indicate its non-standard form.
  • the second standard type tag may include pronunciation spelling-Hiragana, pronunciation spelling-Katakana, There are many forms, such as cognates-Simplified Chinese, cognates-Traditional Chinese, easily confused words-Simplified Chinese, easily confused words-Traditional Chinese, easily confused words-reorganization, etc.
  • Step 204 construct a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text.
  • the sample construction device may associate the non-standard words in the extended text with the standard type labels corresponding to them in the parallel corpus training sample after the labels are replaced, to obtain a target training sample.
  • the embodiment of the present application provides a sample construction method, because the sample construction device can replace the keywords in the original text in the parallel corpus training sample, generate at least one extended text, so as to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample.
  • the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training sample, so that the parallel corpus training sample has more and more flexible training content.
  • the sample construction method provided in the embodiment of the present application may further include the following step 205.
  • Step 205 When a second extended text among the N extended texts contains an unregistered word that is not included in the parallel corpus training sample set, initialize feature information of the unregistered word.
  • the initialization process includes at least one of the following: taking a weighted average of the feature information of the unregistered word according to the frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set; taking a weighted average of the feature information of the unregistered word using the feature information of the cognate word corresponding to the unregistered word; setting the feature information of the unregistered word to 0; and randomly initializing the feature information of the unregistered word.
  • the first extended text and the second extended text may be the same or different.
  • the sample construction device can convert the obtained extended text into a word vector sequence corresponding to the model training based on the feature information of the word.
  • the sample construction device can obtain a word vector sequence through algorithms such as a word to vector (Word2Vec) algorithm and a regression algorithm based on global word frequency statistics (Glove algorithm), or can obtain a word vector sequence by training and iteration in a translation model such as Transformer.
  • a word vector sequence through algorithms such as a word to vector (Word2Vec) algorithm and a regression algorithm based on global word frequency statistics (Glove algorithm)
  • Word2Vec word to vector
  • Glove algorithm global word frequency statistics
  • sample construction device can obtain the word vector sequence corresponding to the extended text in any possible way, and this application does not make any specific limitation.
  • any combination of the following methods can be used to initialize the feature information of the unregistered words to obtain the corresponding word vector: 1 According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the feature information of the unregistered word is weighted averaged; 2 Using the feature information of the cognate word corresponding to the unregistered word, the feature information of the unregistered word is weighted averaged; 3 The feature information of the unregistered word is set to 0; 4 The feature information of the unregistered word is randomly initialized.
  • the sample construction device can also randomly initialize the feature information of the standard type labels corresponding to the unregistered words, or combine the standard type labels, and obtain the standard type labels and their feature information corresponding to the unregistered words by weighted averaging the feature information of the corresponding words.
  • the sample construction device can initialize the feature information of the unregistered words
  • the training of the translation model can be enhanced
  • the model level since the sample construction device can initialize the feature information of the unregistered words, the translation model can learn the phonetic correlation between the non-conforming words and their corresponding conforming words during training, thus improving the translation robustness of the translation model.
  • the sample construction method provided in the embodiment of the present application can improve the translation quality and translation accuracy of the translation model.
  • the sample construction method provided in the embodiment of the present application may further include the following steps 206 and 207.
  • Step 206 restore at least one non-standard word in the first translation text to a standard word to generate M second translation texts.
  • a non-standard word is restored to at least one standard word.
  • the first translated text may be a sentence or a paragraph.
  • the first translated text may be text input by a user, or may be text acquired from another device.
  • the sample construction device can identify non-standard words in the first translation text through the following three methods: Method 1: an extended vocabulary construction method based on homology, pronunciation, and character set; Method 2: a word segmentation model method based on extended vocabulary enhancement; Method 3: an irregular translation detection method based on language model probability.
  • Method 1 Extended vocabulary construction method based on cognates, pronunciations, and character sets.
  • the extended form of the non-standard word in the first translation text may include all words matched by the non-standard word in the extended word list.
  • the sample construction device in the sample construction method can construct an extended word list by mining the similarities between words in different languages, and the extended word representation is shown in Table 3.
  • the extended vocabulary may include: common pronunciation spellings of words and their variants; cognate or synonymous characters/words in other languages with close branches in the language system; easily confused words obtained by recombining a word and its cognates; easily confused words obtained by replacing a word with a word with a similar shape, etc.
  • cognates can be constructed by mining dictionary information of various languages.
  • easily confused words may be words that do not exist in their original language or cognate language.
  • the extended vocabulary may include multiple word sets, each of which may include one or more non-standard words and a standard word set corresponding to the non-standard word.
  • the sample construction device may identify non-standard words in the first translation text by character set detection, extended vocabulary matching, etc., and use the word set matched in the extended vocabulary as the first word set.
  • Method 2 Word segmentation model method based on extended vocabulary enhancement.
  • the words in the first translation text can be replaced with any extended form in the extended vocabulary according to the word frequency setting in the parallel corpus training sample set, and the corresponding standard type label can be replaced, and the word segmentation model can be trained with the extended form of the corpus and its corresponding standard type label.
  • the sample construction method provided in the embodiment of the present application may further include the following step A.
  • Step A After the first translation text is input into the word segmentation model, the first translation text is segmented to obtain M word segments, where M is an integer greater than 1, and each of the M word segments is identified as not meeting the standard word, to obtain each word segment pair.
  • M is an integer greater than 1
  • each of the M word segments is identified as not meeting the standard word, to obtain each word segment pair.
  • the corresponding recognition result of a segmentation word is used to indicate whether a segmentation word does not conform to the standard word.
  • the word segmentation model may be a word segmentation model that has undergone enhanced training.
  • the word segmentation model that has undergone enhanced training can predict a standard type label for each obtained word segmentation. If the predicted standard type label for the word segmentation indicates that the word segmentation is not in compliance with the standard, the word segmentation is identified as a not in compliance with the standard.
  • the sample construction device can enable the word segmentation model that has undergone enhanced training to acquire the ability to recognize words, learn the similarities between non-standard words and standard words in terms of lexical structure, syntactic structure, contextual information, etc., and predict standard type labels for the output word segmentations, the word segmentation model can accurately segment the first translation text and identify non-standard words in the first translation text.
  • Method 3 Irregular translation detection method based on language model probability.
  • the language model can be used to calculate the perplexity of the first translation text to determine whether the text contains non-standard expressions.
  • the sample construction device may input the first translation text into an n-gram language model, and calculate the probability of the current word wi being associated with the first n words of the first translation text by using the following formula 1.
  • wi is the current word
  • N is the number of words in the first translation text.
  • the sample construction method provided in the embodiment of the present application may further include the following steps B1 to B4.
  • Step B1 Segment the first translation text into words to obtain M word segments.
  • M is an integer greater than 1.
  • the sample construction device may input the first translated text into an enhanced word segmentation model for word segmentation.
  • Step B2 for each of the M participles, when the conditional probability corresponding to a participle is less than a first preset threshold, obtain P first standard-compliant words corresponding to the participle.
  • P is a positive integer.
  • conditional probability corresponding to the word segment is less than the first preset threshold, it means that the word segment may not conform to the standard word.
  • the P first standard-compliant words may be X standard-compliant words in a standard-compliant word set matched by the one word segment in the extended vocabulary.
  • Step B3 Replace a word in the first translation text with each first symbol in the P first standard words. The first translation text after P replacements is obtained.
  • Step B4 If the first perplexity corresponding to any replaced first translation text is smaller than the second perplexity corresponding to the first translation text, and the difference between the first perplexity and the second perplexity is greater than a second preset threshold, the sample construction device determines that a segmented word does not conform to the standard word.
  • the sample construction device can replace the possible non-conforming words in the first translation text with the corresponding first conforming words, and respectively calculate the perplexity of the first translation text before and after the replacement, when the difference in the perplexity decrease of the first translation text after the replacement is greater than the second preset threshold, the word is determined as a non-conforming word. Therefore, the recognition of non-conforming words can be made more accurate, and the first translation text after the replacement can be made more fluent and reasonable, so that the subsequent translation is more accurate and the accuracy rate is higher.
  • step 206 may be specifically implemented through the following steps 206a and 206b.
  • Step 206a Obtain at least one first word set that does not correspond to the standard word.
  • the first word set may include: a plurality of word subsets.
  • a word subset may include one or more non-standard words in at least one non-standard word, and each non-standard word corresponds to a standard word set.
  • the compliant word sets corresponding to each of the multiple non-compliant words may be the same or different.
  • At least one of the above-mentioned non-standard words includes the non-standard word " ⁇ ” and the non-standard word " ⁇ "
  • the set of standard words corresponding to the non-standard word “ ⁇ ” can be a set including the standard word " ⁇ ”
  • the set of standard words corresponding to the non-standard word “ ⁇ ” can also be a set including the standard word " ⁇ ”.
  • Step 206b For each word subset in the multiple word subsets, restore and map a word subset with a set of standard words corresponding to each non-standard word in the word subset in the first translation text to generate at least one second translation text.
  • restoring and mapping a word subset to a set of compliant words corresponding to each non-compliant word in the word subset can be understood as: restoring each non-compliant word in the above-mentioned word subset in turn to each compliant word in the corresponding set of compliant words, and traversing all restored combinations of compliant words.
  • the first translation text is: When I think about saying goodbye to xiaoyuan tomorrow, my heart is filled with nostalgia for Shen Shen. It contains the non-standard word "xiaoyuan” and the non-standard word "Shen Shen".
  • the set of standard words corresponding to the non-standard word "xiaoyuan” includes: campus, courtyard; the set of standard words corresponding to the non-standard word "Shen Shen” includes: deeply, scrutinize. Then, the sample construction device can return the set of standard words corresponding to each non-standard word to the standard word set.
  • the sample construction device can restore the non-standard words in the first translation text to all possible standard words to generate at least one second translation text, the non-standard words in the first translation text can be corrected as much as possible, making the subsequent translation more accurate and fluent.
  • Step 207 input the first feature information corresponding to the first translated text and the second feature information corresponding to X second translated texts among the M second translated texts into the first translation model for text translation to obtain a target translated text.
  • the first feature information includes text feature information of the first translated text and feature information of the standard type labels corresponding to the non-standard words in the first translated text
  • the second feature information includes text feature information of the second translated text and feature information of the standard type labels corresponding to the non-standard words in the second translated text.
  • the first translation model is obtained by training based on a target training sample set
  • the target training sample set includes multiple target training samples
  • one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set
  • M and X are positive integers, and X is less than or equal to M.
  • step 207 may be specifically implemented through the following steps 207a and 207b.
  • Step 207a input X second translation texts among the M second translation texts and the first translation text into the first translation model for text translation, and output L candidate translations.
  • the L candidate translations include candidate translations corresponding to X second translation texts and candidate translations corresponding to the first translation text, one candidate translation corresponds to at least one second translation text, L is a positive integer, and L is less than or equal to X.
  • the enhanced translation model can make the same translation for non-standard words with different extended forms, the number of candidate translations output by the translation model is less than the number of second translation texts input.
  • the target translation “ ⁇ ” can be obtained.
  • Step 207b Determine the candidate translation that meets the first condition among the L candidate translations as the target translation.
  • the candidate translations satisfying the first condition may include at least one of the following:
  • Case 1 The candidate translation whose fluency meets the first predetermined condition
  • Case 2 The candidate translation whose translation quality meets the second predetermined condition
  • Case 3 candidate translations whose relevance satisfies the third predetermined condition.
  • the above correlation includes at least one of the following: prior probability, similarity, and perplexity.
  • the first predetermined condition may be that the perplexity of the candidate translation is less than or equal to a third preset threshold. It can be understood that the lower the perplexity of the candidate translation, the higher the fluency and the more reasonable the candidate translation.
  • the sample construction device may calculate the perplexities of L candidate translations respectively through the language model, and determine the candidate translation whose perplexity is less than or equal to the third preset threshold as the target translation.
  • the second predetermined condition may be that the translation quality of the candidate translation is greater than or equal to a fourth preset threshold. It is understood that the sample construction device may determine the candidate translation whose translation quality is greater than or equal to the fourth preset threshold as the target translation.
  • the third predetermined condition may be that the relevance of the candidate translation is greater than or equal to a fifth preset threshold. It is understood that the sample construction device may determine the candidate translation whose relevance is greater than or equal to the fifth preset threshold as the target translation.
  • the sample construction device may determine the candidate translation that satisfies the most predetermined conditions as the target translation.
  • the sample construction device can determine the candidate translation with the best evaluation result as the target translation based on the fluency, translation quality and relevance of the candidate translations, the output target translation can be optimized.
  • the sample construction device may evaluate the translation quality of the candidate translations by using representation and feature learning methods.
  • the sample construction method provided in the embodiment of the present application may further include the following steps 207c and 207d.
  • Step 207c for each candidate translation among the L candidate translations, extract first text feature information of a candidate translation, a first translated text corresponding to the candidate translation, and second text feature information of the first translated text.
  • the first text feature information may include features such as lexical and syntactic structures of the candidate translations.
  • the second text feature information may include features such as lexical and syntactic structures of the second translated text and the first translated text.
  • the sample construction device can extract the first text of the candidate translation by training the word segmentation model of the target language.
  • the feature information is extracted from the second translation text corresponding to the candidate translation and the second text feature information of the first translation text respectively through the original language word segmentation model.
  • Step 207d Calculate a translation quality parameter corresponding to a candidate translation based on the first text feature information and the second text feature information.
  • the sample construction device may calculate the quality of the translation result using a regression algorithm.
  • the translation quality parameter corresponding to a candidate translation may be a result value of a regression algorithm.
  • the regression algorithm can output the probability of the quality of the candidate translation: the closer the result of the regression algorithm is to 1, the better the quality of the candidate translation is, and the closer the result of the regression algorithm is to 0, the worse the quality of the candidate translation is.
  • the sample construction device can calculate the translation quality parameter corresponding to a candidate translation based on the first text feature information of a candidate translation, the first translation text corresponding to the candidate translation, and the second text feature information of the first translation text, candidate translations with better translation quality can be screened out.
  • the relevance of the candidate translations can be obtained by weighting the following six evaluation indicators:
  • the expanded words in the second translation text corresponding to the candidate translation are given a priori probability according to their expansion type, similarity with the corresponding non-standard words in the first translation text, word frequency with the corresponding non-standard words in the first translation text, etc., and the candidate translation with higher probability is selected.
  • the priori probability of pronunciation spelling, cognates, and easily confused words is [0.7, 0.2, 0.1]
  • the priori probability of Hiragana and Katakana in pronunciation spelling is [0.8, 0.2]
  • the second translation text corresponding to the candidate translation is input into the word segmentation model, and the similarity of the annotation information such as word segmentation and morphology and syntactic structure with the first translation text is calculated, and the candidate translation corresponding to the second translation text with higher similarity is selected.
  • 3 Calculate the perplexity of the second translation text and the first translation text through the language model, and select the candidate translation corresponding to the first translation text whose perplexity is lower than that of the first translation text and whose perplexity difference exceeds the second preset threshold.
  • 4 Input the candidate translation into the word segmentation model, calculate the similarity between the word segmentation and the annotation information such as lexical and syntactic structure and the candidate translation corresponding to the first translation text, and select the candidate translation with higher similarity.
  • 5 Calculate the string similarity between all candidate translations, and select the candidate translation with higher similarity.
  • 6 Calculate the similarity between the translations corresponding to the extended words in all candidate translations.
  • evaluation index 4 may be determined by other evaluation indexes of the candidate translation corresponding to the first translation text. If the fluency and translation quality of the candidate translation corresponding to the text to be translated are poor, the weight corresponding to index 4 will be reduced accordingly.
  • the relevance of the candidate translation can be obtained by weighting the evaluation indicators of the candidate translations corresponding to the different second translation texts.
  • the sample construction method provided in the embodiment of the present application can also screen at least one second translation text through evaluation indicators 1 ⁇ 3 for calculating the relevance of candidate translations, and screen out X second translation texts from M second translation texts, so as to improve the efficiency of actual translation and reduce the power consumption of the sample construction device.
  • the present application can restore the non-standard words in the first translation text to at least one standard word and generate at least one second translation text, the first translation text containing the non-standard words can be restored to the standard first translation text, avoiding translation errors caused by the presence of non-standard words; on the other hand, since the present application can input the standard part or all of the second translation text and the original first translation text into the translation model at the same time when the first translation text is input into the translation model for translation, it is possible to output a translation with higher accuracy as the translation result. In this way, the sample construction method provided by the embodiment of the present application can improve the accuracy of the translation model translation.
  • the sample construction method provided in the embodiment of the present application may further include the following step 208.
  • Step 208 After the first translation text is input into the first word segmentation model, the first translation text is segmented to obtain K word segments, and each of the K word segments is identified as a word that does not conform to the standard to obtain a recognition result corresponding to each word segment.
  • the recognition result corresponding to a segmented word is used to indicate whether the segmented word is a non-standard word.
  • the recognition result corresponding to the segmented word includes the standard type corresponding to the segmented word.
  • the first word segmentation model is obtained by training based on the target training sample set, and K is an integer greater than 1.
  • the sample construction device can also use the word segmentation model trained in the above method 2 to predict labels for non-standard words in the enhanced text, and introduce corresponding standard type label vectors in the training of the translation model, so that the model learns the semantic relationship between the extended words and the first keyword corresponding to the unregistered words, thereby enhancing the prediction and translation capabilities of the extended words.
  • the canonical type label vector has the same dimension as the word vector.
  • the extended word vector in the enhanced sentence is added to the vector corresponding to the canonical type label predicted by the word segmentation model to obtain the final representation vector of the extended word.
  • each standard type label vector can be randomly initialized, or the weighted average of the corresponding word vectors of each component of the label (such as pronunciation spelling, cognates, hiragana, katakana, etc.) can be used as the initial vector, and the standard type label vector is iteratively optimized through model training.
  • the label prediction of the word segmentation model for the expanded word may be different from the label of the actual expanded form of the word.
  • the enhanced sentences with incorrect label predictions of these standard types are not corrected, but retained at a certain ratio, thereby enhancing the robustness of the translation model and enabling the model to learn to output correct translations when incorrect expanded word labels are input.
  • the sample construction device may use parallel corpus training samples and target training samples to construct a basic translation model.
  • Type of enhanced training may be used.
  • the enhanced translation model may generate a word vector table containing extended words and a canonical type label vector table.
  • the sample construction device inputs the N second translation texts and the first translation text in at least one second translation text into the translation model
  • the input text can be segmented by the enhanced word segmentation model, the non-standard words can be identified, and the extended forms of the non-standard words can be predicted.
  • the vector table by querying the vector table, the standard word vector, the extended word vector and the standard type label vector corresponding to the input text are input into the model, as shown in FIG4, to obtain the generated translation.
  • the sample construction device can construct adversarial training data based on the extended vocabulary to enhance the training of the translation model; on the other hand, at the model level, the sample construction device incorporates the standard type label vector in the input encoding layer, so that the model can learn the extended form of non-standard words and the semantic relevance of non-standard words and their corresponding standard words in the text during training.
  • the translation model can improve the translation robustness and translation quality of the first translation text containing non-standard words, so that the translation model can output the correct translation regardless of whether the first translation text contains non-standard words.
  • the sample construction method provided in the embodiment of the present application may further include the following steps 301 and 302.
  • the above step 202 may be specifically implemented by the following step a.
  • Step 301 Display each standard-compliant word corresponding to the first non-standard-compliant word.
  • the first non-standard word may be one or more non-standard words in the at least one non-standard word. That is, the first non-standard word may be one or more non-standard words.
  • the sample construction device may display each standard-compliant word corresponding to the first non-standard-compliant word in order of relevance from high to low.
  • the sample construction device may calculate the relevance of each standard-compliant word corresponding to the first non-standard-compliant word by using the following formula (2).
  • S(W) ⁇ S 1 (W)+ ⁇ S 2 (W)+ ⁇ S 3 (W) (Formula 2)
  • S 1 (W) is the prior probability corresponding to the non-standard word in the extended vocabulary
  • S 2 (W) is the lexical similarity between the non-standard word and its restored standard word
  • S 3 (W) is the restoration of the non-standard word
  • ⁇ , ⁇ , ⁇ are adjustable weight coefficients.
  • S 2 (W) For example, for S 2 (W), if only part of speech is considered, if the morphology of the non-standard word is the same as that of the standard word after it is restored, then S 2 (W) can be 1; if the morphology of the non-standard word is different from that of the standard word after it is restored, then S 2 (W) can be 0.
  • Step 302 Receive a first input of a target standard-compliant word among the displayed standard-compliant words.
  • the target standard-compliant words are one or more standard-compliant words among the displayed standard-compliant words.
  • the target standard-compliant word may be a standard-compliant word corresponding to the same non-standard-compliant word.
  • the target standard-compliant word may include standard-compliant words corresponding to multiple different non-standard-compliant words.
  • the sample construction device will restore each standard-compliant word selected by the user.
  • the target standard-compliant word may be a standard-compliant word selected by the user to replace a non-standard-compliant word.
  • the first input is used to select a standard word that needs to be restored from the displayed standard words.
  • the first input may be a user's touch input, a specific voice input, or a specific gesture input of a target word that conforms to the standard, which is not limited in this embodiment of the present application.
  • the first input may be a click input by the user on a target word that meets the specification.
  • Step a In response to the first input, restore the first non-standard word in the first translation text to a target standard word to generate at least one second translation text.
  • the electronic device can restore them according to the above-mentioned relevant steps to generate at least one second translation text.
  • the sample construction device may display the first non-standard word “ ⁇ (liangqin)” corresponding to the standard words “ ⁇ ( ⁇ )” and “ ⁇ ”. Then, the sample construction device receives the user's click input (i.e., the first input) for the target standard word “ ⁇ ( ⁇ )”, and as shown in (b) of FIG5 , restores the non-standard word “ ⁇ (liangqin)” to the target standard word “ ⁇ ( ⁇ )”, and generates the second translation text “ ⁇ ( ⁇ / ⁇ )”.
  • the sample construction device can display the corresponding standard words to the non-standard words, and the user selects the target standard words to be restored through the first input, the generated second translation text response can be reduced, thereby reducing the power consumption required for translation.
  • sample construction method provided in the embodiment of the present application can construct corresponding extended vocabulary according to the linguistic features of different languages to be applied to different translation languages and language directions.
  • the present application embodiment provides a sample construction method
  • FIG6 shows a flowchart of a translation model provided by the present application embodiment for translation, wherein the translation model is a translation model obtained by training the target training sample.
  • the sample construction method provided by the present application embodiment may include the following steps 601 to 607.
  • Step 601 Obtain the text to be translated.
  • Step 602 Automatically identify whether there are any words that do not conform to the standard in the text to be translated.
  • Step 603 When there are words that do not conform to the standard in the text to be translated, the restoration results that do not conform to the standard are sorted by credibility and presented to the user.
  • Step 604 In response to a first input of a user selecting a restoration result that does not conform to the standard word, the non-conforming standard word is restored to generate at least one second translation text.
  • Step 605 restore the non-standard words that are not selected by the user to generate at least one second translation text.
  • Step 606 Input at least one second translation text into the first translation model for text translation to obtain at least one candidate translation.
  • Step 607 Determine a target translation from at least one candidate translation, and output the target translation.
  • the sample construction method provided in the embodiment of the present application can be executed by a sample construction device.
  • the sample construction device provided in the embodiment of the present application is described by taking the sample construction method executed by the sample construction device as an example.
  • Fig. 7 shows a possible structural diagram of a sample construction device involved in an embodiment of the present application.
  • the sample construction device 70 may include: an acquisition module 71 , a processing module 72 and a construction module 73 .
  • the above-mentioned acquisition module 71 is used to obtain a parallel corpus training sample, which contains the original text and carries the standard type label corresponding to each keyword in the original text;
  • the above-mentioned processing module 72 is used to replace the first keyword in the original text of the parallel corpus training sample acquired by the acquisition module 71 with at least one first non-standard word corresponding to the first keyword, so as to generate at least one extended text;
  • the above-mentioned processing module 72 is also used to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample acquired by the acquisition module 71 with the second standard type label corresponding to the first non-standard word, so as to obtain the parallel corpus training sample after the label is replaced;
  • the above-mentioned construction module 73 is used to construct a target training sample based on the parallel corpus training sample after the label is replaced and processed by the processing module 72 and at least one extended text.
  • the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set; the processing module 72 is specifically used to:
  • At least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;
  • the first extended text is any extended text among the at least one extended text.
  • the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of extended texts is N, where N is a positive integer;
  • the processing module 72 is further configured to, after replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;
  • the initialization process includes one of the following:
  • the frequency of the unregistered word is The word feature information is weighted averaged
  • the feature information of unregistered words is randomly initialized.
  • the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set; the above device further includes: a translation module;
  • the processing module 72 is further configured to restore at least one non-standard word in the first translation text to a standard word after the construction module 73 constructs the target training sample based on the parallel corpus training sample after replacing the label and at least one extended text, so as to generate M second translation texts, and restore one non-standard word to at least one standard word;
  • the above-mentioned translation module is used to input the first feature information corresponding to the first translation text and the second feature information corresponding to the X second translation texts among the M second translation texts obtained by the processing module 72 into the first translation model for text translation to obtain a target translation, wherein the first feature information includes text feature information of the first translation text and feature information of the standard type label corresponding to the non-standard words in the first translation text, and the second feature information includes text feature information of the second translation text and feature information of the standard type label corresponding to the non-standard words in the second translation text;
  • the first translation model is trained based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to one parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
  • the above device further includes: a word segmentation module;
  • the above-mentioned word segmentation module is used for, before the processing module 72 restores at least one non-standard word in the first translation text to a standard word to generate M second translation texts, inputting the first translation text into the first word segmentation model, segmenting the first translation text to obtain K word segments, and performing non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes a standard type corresponding to the word segment;
  • the first word segmentation model is trained based on the target training sample set, and K is an integer greater than 1.
  • words that do not conform to standard standards include at least one of the following: including pinyin reading and writing, including typos, including homologous word replacements, and including glyph errors.
  • the embodiment of the present application provides a sample construction device. Since the sample construction device can replace the key words in the original text of the parallel corpus training sample, generate at least one extended text to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the key word is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can be based on the parallel corpus training sample after the label is replaced and at least one extended text, The target training samples are constructed. Therefore, the target training samples can contain words that do not conform to the standard and their corresponding standard type labels, thereby enriching the content of the parallel corpus training samples and making the parallel corpus training samples have more and more flexible training content.
  • the sample construction device in the embodiment of the present application can be an electronic device or a component in the electronic device, such as an integrated circuit or a chip.
  • the electronic device can be a terminal or other devices other than a terminal.
  • the electronic device can be a mobile phone, a tablet computer, a laptop computer, a PDA, a car-mounted electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc.
  • It can also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., and the embodiment of the present application is not specifically limited.
  • Network Attached Storage NAS
  • PC personal computer
  • TV television
  • teller machine a self-service machine
  • the sample construction device in the embodiment of the present application may be a device having an operating system.
  • the operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiment of the present application.
  • the sample construction device provided in the embodiment of the present application can implement the various processes implemented in the method embodiments of Figures 2 to 6 and achieve the same technical effects. To avoid repetition, they will not be described here.
  • an embodiment of the present application also provides an electronic device 800, including a processor 801 and a memory 802, and the memory 802 stores a program or instruction that can be executed on the processor 801.
  • the program or instruction is executed by the processor 801
  • the various steps of the above-mentioned sample construction method embodiment are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the electronic devices in the embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.
  • FIG. 9 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.
  • the electronic device 900 includes but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, and a processor 910 and other components.
  • the electronic device 900 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 910 through a power management system, so that the power management system can manage charging, discharging, and power consumption management.
  • a power source such as a battery
  • the electronic device structure shown in FIG9 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.
  • the processor 910 is used to obtain parallel corpus training samples, wherein the parallel corpus training samples include original text
  • the method comprises the following steps: the first keyword in the original text of the obtained parallel corpus training sample is replaced with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; the first standard type label corresponding to the first keyword in the obtained parallel corpus training sample is replaced with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after label replacement; and a target training sample is constructed based on the parallel corpus training sample after label replacement and the at least one extended text.
  • the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set; the processor 910 is specifically configured to:
  • At least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;
  • the first extended text is any extended text among the at least one extended text.
  • the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of extended texts is N, where N is a positive integer;
  • the processor 910 is further configured to, after replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;
  • the initialization process includes one of the following:
  • the word feature information of the unregistered word is weighted averaged
  • the feature information of unregistered words is randomly initialized.
  • the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set
  • the processor 910 is further configured to restore at least one non-standard word in the first translation text to a standard word after constructing a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text, so as to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;
  • the processor 910 is further configured to input the first feature information corresponding to the first translated text and the second feature information corresponding to the X second translated texts among the M second translated texts obtained by the processor 910 into the first translation model for text translation to obtain a target translated text, wherein the first feature information includes the text feature information of the first translated text and the feature information of the standard type label corresponding to the non-standard word in the first translated text, and the second feature information includes the text feature information of the second translated text. This feature information and the feature information of the standard type label corresponding to the non-standard word in the second translation text;
  • the first translation model is trained based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to one parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
  • the processor 910 is configured to restore at least one non-standard word in the first translation text to a standard word before generating the M second translation texts, input the first translation text into the first word segmentation model, segment the first translation text to obtain K word segments, and perform non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes a standard type corresponding to the word segment;
  • the first word segmentation model is trained based on the target training sample set, and K is an integer greater than 1.
  • non-standard words include at least one of the following: containing phonetic reading and writing, containing typos, containing homologous word replacements, and containing glyph errors.
  • An embodiment of the present application provides an electronic device, which can replace a keyword in an original text in a parallel corpus training sample with at least one non-standard word corresponding to the keyword, generate at least one extended text, and replace the standard type label corresponding to the keyword with the standard type label corresponding to the non-standard word, thereby obtaining a parallel corpus training sample after replacing the label. Then, the electronic device can construct a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, so that the translation model trained with the target training sample can translate non-standard words, thereby improving the translation accuracy of the translation model.
  • the input unit 904 may include a graphics processing unit (GPU) 9041 and a microphone 9042, and the graphics processor 9041 processes the image data of the static picture or video obtained by the image capture device (such as a camera) in the video capture mode or the image capture mode.
  • the display unit 906 may include a display panel 9061, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc.
  • the user input unit 907 includes a touch panel 9071 and at least one of other input devices 9072.
  • the touch panel 9071 is also called a touch screen.
  • the touch panel 9071 may include two parts: a touch detection device and a touch controller.
  • Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control key, a switch key, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.
  • the memory 909 can be used to store software programs and various data.
  • the memory 909 can mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area can store an operating system, an application program or instruction required for at least one function (such as a sound playback function, an image playback function, etc.).
  • the memory 909 can include a volatile memory or a non-volatile memory, or the memory 909 can include a volatile and a non-volatile memory. Both memories.
  • the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM) or a flash memory.
  • the volatile memory can be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous connection dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM).
  • the memory 909 in the embodiment of the present application includes but is not limited to these and any other suitable types of memories.
  • the processor 910 may include one or more processing units; optionally, the processor 910 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 910.
  • An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored.
  • a program or instruction is stored.
  • the program or instruction is executed by a processor, each process of the above-mentioned sample construction method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
  • the processor is the processor in the electronic device described in the above embodiment.
  • the readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
  • An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned sample construction method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
  • the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.
  • An embodiment of the present application provides a computer program product, which is stored in a storage medium.
  • the program product is executed by at least one processor to implement the various processes of the sample construction method embodiment described above, and can achieve the same technical effect. To avoid repetition, it will not be described here.
  • the technical solution of the present application can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.
  • a storage medium such as ROM/RAM, a disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present application belongs to the technical field of artificial intelligence. Disclosed are a sample construction method and apparatus, and an electronic device and a readable storage medium. The method comprises: acquiring a parallel corpus training sample, wherein the parallel corpus training sample comprises original text and carries a specification type label corresponding to each keyword in the original text; replacing a first keyword in the original text with at least one first specification non-conformance word corresponding to the first keyword, so as to generate at least one piece of extended text; replacing a first specification type label corresponding to the first keyword with a second specification type label corresponding to the first specification non-conformance word, so as to obtain a parallel corpus training sample in which the labels are replaced; and constructing a target training sample on the basis of the parallel corpus training sample in which the labels are replaced, and the at least one piece of extended text.

Description

样本构建方法、装置、电子设备及可读存储介质Sample construction method, device, electronic device and readable storage medium
交叉引用Cross-references
本申请要求在2023年02月08日提交中国专利局、申请号为202310085121.X、申请名称为“样本构建方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority to a Chinese patent application filed with the China Patent Office on February 8, 2023, with application number 202310085121.X and application name “Sample construction method, device, electronic device and readable storage medium”. The entire contents of the application are incorporated by reference into this application.
技术领域Technical Field
本申请属于人工智能技术领域,具体涉及一种样本构建方法、装置、电子设备及可读存储介质。The present application belongs to the field of artificial intelligence technology, and specifically relates to a sample construction method, device, electronic device and readable storage medium.
背景技术Background Art
随着计算机性能和互联网技术的发展,现有的翻译方法通常是采用大规模双语平行语料训练翻译模型,并基于待翻译文本中真实语料的分布生成译文。With the development of computer performance and Internet technology, existing translation methods usually use large-scale bilingual parallel corpora to train translation models and generate translations based on the distribution of real corpora in the text to be translated.
然而,由于平行语料训练样本往往是由高质量的规范文本组成的,因此,经过该平行语料训练样本训练得到的翻译模型只能对规范文本进行翻译,而对于包含不符合规范词的文本进行翻译时,整体的翻译准确度较低。However, since parallel corpus training samples are often composed of high-quality standard texts, the translation model trained with the parallel corpus training samples can only translate standard texts. When translating texts containing non-standard words, the overall translation accuracy is low.
因此,如何构建更加丰富的平行语料训练样本是本申请亟待解决的问题。Therefore, how to construct richer parallel corpus training samples is an urgent problem to be solved in this application.
发明内容Summary of the invention
本申请实施例的目的是提供一种样本构建方法、装置、电子设备及可读存储介质,能够解决如何构建更加丰富的平行语料训练样本的问题。The purpose of the embodiments of the present application is to provide a sample construction method, device, electronic device and readable storage medium, which can solve the problem of how to construct richer parallel corpus training samples.
第一方面,本申请实施例提供了一种样本构建方法,该方法包括:获取平行语料训练样本,平行语料训练样本包含原始文本并携带原始文本中的每个关键词所对应的规范类型标签;将原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本;将第一关键词所对应的第一规范类型标签替换为第一不符合规范词所对应的第二规范类型标签,获得替换标签后的平行语料训练样本;基于替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本。In a first aspect, an embodiment of the present application provides a sample construction method, which includes: obtaining a parallel corpus training sample, the parallel corpus training sample contains an original text and carries a standard type label corresponding to each keyword in the original text; replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; replacing a first standard type label corresponding to the first keyword with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after label replacement; and constructing a target training sample based on the parallel corpus training sample after label replacement and at least one extended text.
第二方面,本申请实施例提供了一种样本构建装置,该装置包括:获取模块,处理模块和构建模块;获取模块,用于获取平行语料训练样本,平行语料训练样本包含原始文本并携带原始文本中的每个关键词所对应的规范类型标签;处理模块,用于将获取模块获取的平行语料训练样本中的原始文本中的第一关键词替换为第一关键词对应的至少一个第一 不符合规范词,以生成至少一个扩展文本;处理模块,还用于将获取模块获取的平行语料训练样本中的第一关键词所对应的第一规范类型标签替换为第一不符合规范词所对应的第二规范类型标签,获得替换标签后的平行语料训练样本;构建模块,用于基于处理模块处理后的替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本。In a second aspect, an embodiment of the present application provides a sample construction device, which includes: an acquisition module, a processing module and a construction module; the acquisition module is used to acquire a parallel corpus training sample, the parallel corpus training sample contains an original text and carries a standard type label corresponding to each keyword in the original text; the processing module is used to replace the first keyword in the original text of the parallel corpus training sample acquired by the acquisition module with at least one first The processing module is further used to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample obtained by the acquisition module with the second standard type label corresponding to the first non-standard word to obtain the parallel corpus training sample after the label is replaced; the construction module is used to construct the target training sample based on the parallel corpus training sample after the label is replaced processed by the processing module and at least one extended text.
第三方面,本申请实施例提供了一种电子设备,该电子设备包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, wherein the memory stores programs or instructions that can be run on the processor, and when the program or instructions are executed by the processor, the steps of the method described in the first aspect are implemented.
第四方面,本申请实施例提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a program or instruction is stored, and when the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.
第五方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的方法。In a fifth aspect, an embodiment of the present application provides a chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.
第六方面,本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如第一方面所述的方法。In a sixth aspect, an embodiment of the present application provides a computer program product, which is stored in a storage medium and is executed by at least one processor to implement the method described in the first aspect.
在本申请实施例中,获取平行语料训练样本,平行语料训练样本包含原始文本并携带原始文本中的每个关键词所对应的规范类型标签;将原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本;将第一关键词所对应的第一规范类型标签替换为第一不符合规范词所对应的第二规范类型标签,获得替换标签后的平行语料训练样本;基于替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本。通过该方案,由于样本构建装置可以对平行语料训练样本中的原始文本中的关键词进行替换,生成至少一个扩展文本,以扩大平行语料训练样本所覆盖的词汇范围;同时,并将该关键词所对应的规范类型标签替换为不符合规范词所对应的规范类型标签,获得替换标签后的平行语料训练样本,以丰富平行语料训练样本所包含的内容。最后,样本构建装置可以基于替换标签后的平行语料训练样本和至少一个扩展文本,构建得到目标训练样本。因此可以使得目标训练样本中包含不符合规范词及其所对应的规范类型标签,从而可以丰富平行语料训练样本的内容,使得平行语料训练样本具有更多更灵活的训练内容。In an embodiment of the present application, a parallel corpus training sample is obtained, the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text; the first keyword in the original text is replaced with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; the first standard type label corresponding to the first keyword is replaced with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after the label is replaced; based on the parallel corpus training sample after the label is replaced and at least one extended text, a target training sample is constructed. Through this scheme, since the sample construction device can replace the keywords in the original text of the parallel corpus training sample, at least one extended text is generated to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training samples can include non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training samples and making the parallel corpus training samples have more and more flexible training content.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的一种不符合规范词的实例示意图;FIG1 is a schematic diagram of an example of a word that does not conform to the standard provided in an embodiment of the present application;
图2是本申请实施例提供的一种样本构建方法的流程图;FIG2 is a flow chart of a sample construction method provided in an embodiment of the present application;
图3是本申请实施例提供的一种样本构建方法的实例示意图之一;FIG3 is one of the example schematic diagrams of a sample construction method provided in an embodiment of the present application;
图4是本申请实施例提供的一种样本构建方法的实例示意图之二; FIG4 is a second schematic diagram of an example of a sample construction method provided in an embodiment of the present application;
图5是本申请实施例提供的一种样本构建方法的实例示意图之三;FIG5 is a third example schematic diagram of a sample construction method provided in an embodiment of the present application;
图6是本申请实施例提供的一种翻译模型进行翻译的流程图;FIG6 is a flowchart of a translation model for translation provided by an embodiment of the present application;
图7是本申请实施例提供的一种样本构建装置的结构示意图;FIG7 is a schematic diagram of the structure of a sample construction device provided in an embodiment of the present application;
图8是本申请实施例提供的一种电子设备的硬件结构示意图之一;FIG8 is a schematic diagram of a hardware structure of an electronic device provided in an embodiment of the present application;
图9是本申请实施例提供的一种电子设备的硬件结构示意图之二。FIG. 9 is a second schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
具体实施方式DETAILED DESCRIPTION
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. All other embodiments obtained by ordinary technicians in this field based on the embodiments in the present application belong to the scope of protection of this application.
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, so that the embodiments of the present application can be implemented in an order other than those illustrated or described here, and the objects distinguished by "first", "second", etc. are generally of one type, and the number of objects is not limited. For example, the first object can be one or more. In addition, "and/or" in the specification and claims represents at least one of the connected objects, and the character "/" generally indicates that the objects associated before and after are in an "or" relationship.
下面对本申请实施例中涉及的一些术语/名词进行解释说明。Some terms/nouns involved in the embodiments of the present application are explained below.
1、同源字/词:语系分支较近的语种或文字之间,往往存在较多语言学起源相同的字/词,这些字/词的读音、拼写或含义相近,在字形构成上可能容易发生混淆。例如,均使用汉字书写的中文和日语(例:“荣誉”和“栄誉”)、同属西日尔曼语支的英语和德语(例:“popular”和)、中文的简体和繁体等。由于输入错误等原因,待翻译文本中的词语可能被替换为同源词,导致翻译的译文质量下降。1. Cognate characters/words: There are often many characters/words with the same linguistic origins between languages or scripts with close branches. These characters/words have similar pronunciations, spellings or meanings, and may be easily confused in terms of character structure. For example, Chinese and Japanese, both written in Chinese characters (for example, "荣耀" and "栄尊"), English and German, both belonging to the West Germanic branch (for example, "popular" and " ), simplified and traditional Chinese, etc. Due to input errors and other reasons, words in the text to be translated may be replaced by cognates, resulting in a decrease in the quality of the translated text.
2、假名:日语的一种表音文字,存在平假名和片假名两种写法,两者可互相转化,每个假名代表一个音节。日语中的汉字都可以根据读音转写为假名,类似于中文的拼音。同时,假名也是日语的一种书写文字,用来表示日语中的固有词汇及文法助词等。2. Kana: A phonetic writing system of Japanese. There are two writing systems: Hiragana and Katakana. The two can be converted into each other. Each Kana represents a syllable. Kanji in Japanese can be transcribed into Kana according to their pronunciation, similar to the pinyin of Chinese. At the same time, Kana is also a written language of Japanese, used to represent inherent vocabulary and grammatical auxiliary words in Japanese.
3、日语汉字:日语中使用的汉字,与假名共同组成日语的书写文字,常用来表示实物的名称或动作等。现代日语中的常用汉字为2000-3000个左右,其字形与中文汉字同源,且与简体字、繁体字均存在一定的交集和差异。3. Kanji: Kanji used in Japanese, together with kana, constitute the written language of Japanese, and are often used to represent the names of objects or actions, etc. There are about 2000-3000 commonly used Kanji in modern Japanese. Their shapes are the same as those of Chinese characters, and there are certain intersections and differences with simplified and traditional characters.
4、原文:待翻译的原始文本,原文的具体语言无限制。4. Original text: The original text to be translated. The specific language of the original text is not restricted.
5、译文:原文经过翻译模型翻译后的结果,译文的具体语言无限制。5. Translation: The result of translating the original text through the translation model. There is no restriction on the specific language of the translation.
6、语言模型:用来计算一个句子的概率(即:一段词语序列能够构成正常句子的概率)的模型,其核心是通过句中的前n个词计算当前词出现的概率。通常使用困惑度作为评价 指标。6. Language model: A model used to calculate the probability of a sentence (i.e., the probability that a sequence of words can form a normal sentence). Its core is to calculate the probability of the current word appearing by the first n words in the sentence. Perplexity is usually used as an evaluation index.
7、困惑度:评价一个句子好坏的指标,困惑度越高,证明一个句子越难懂,即越不可能是一个通顺、语义正确的句子。7. Perplexity: An indicator for evaluating the quality of a sentence. The higher the perplexity, the more difficult it is to understand the sentence, that is, the less likely it is to be a fluent and semantically correct sentence.
8、词法:对句子中词的研究,包括词的结构、形态及词性,如名词、形容词、副词,英语中的单数、复数等。8. Morphology: The study of words in sentences, including their structure, morphology and parts of speech, such as nouns, adjectives, adverbs, singular and plural in English, etc.
9、句法结构:句子成分的相关关系,以及它们组成句子的规则或过程,如常见的“主谓宾”结构。9. Syntactic structure: the relationship between sentence components and the rules or processes by which they form sentences, such as the common "subject-predicate-object" structure.
10、序列标注:给定一个句子,对句子中的每一个词语进行标注,或者说对词语的类别标签做出预测。10. Sequence labeling: Given a sentence, label each word in the sentence, or predict the category label of the word.
11、分词:序列标注任务的一种。对于中文、日语等书写时词语间不存在空格的语言,分词模型能够对句子进行词级别切分,并对词语的词法、句法结构等类别标签进行预测。本方案中训练的分词模型还涉及对不符合规范词语的扩展形式(例如:发音拼写、同源词、易混淆词等)进行预测。11. Word segmentation: a type of sequence labeling task. For languages such as Chinese and Japanese where there are no spaces between words when writing, the word segmentation model can segment sentences at the word level and predict the category labels such as the lexical and syntactic structures of the words. The word segmentation model trained in this solution also involves predicting the extended forms of words that do not conform to the standard (for example: pronunciation spelling, cognates, easily confused words, etc.).
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的样本构建方法、装置、电子设备及可读存储介质进行详细地说明。The sample construction method, device, electronic device and readable storage medium provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and their application scenarios.
现有的机器翻译方法通常采用大规模双语平行语料训练样本来训练翻译模型,基于真实语料的分布来生成译文。Existing machine translation methods usually use large-scale bilingual parallel corpus training samples to train translation models and generate translations based on the distribution of real corpus.
然而,由于平行语料训练样本中的原始文本通常为高质量的规范文本,词被转写为读音拼写或同源词等规范性问题极少出现,翻译模型往往接触不到这些表达不规范的词,也不具备将其准确翻译的能力。但是,在一些特定场景下,输入翻译模型的文本中可能包含表达方式不符合常规语法的不符合规范词,例如,在语言教育场景下,如图1所示,文本中的词语可能被转写为该语种的读音拼写形式(如汉语拼音、日语假名等),用于教学或考试;用户在打字时的错误输入,也可能导致待翻译文本中出现读音拼写、错别字、同源词替换等错误;在图片翻译、语音翻译等任务中,图像文字识别、语音识别等前置模块的识别结果可能出现字形相似错误、字音相似错误和转码错误等问题,也可能会导致下游的翻译模型接收到不规范的文本。如此,由于这些包含不规范或错误的词的文本序列往往不是一个很常见的序列,即其表达方式不符合常规语法、词法或句法结构,因此导致翻译模型对于这种不规范或错误的词通常难以正确的进行翻译。However, since the original text in the parallel corpus training samples is usually high-quality standard text, standardization problems such as words being transcribed into pronunciation spelling or cognates rarely occur, and the translation model often does not have access to these non-standard words and does not have the ability to accurately translate them. However, in some specific scenarios, the text input into the translation model may contain non-standard words whose expressions do not conform to conventional grammar. For example, in the language education scenario, as shown in Figure 1, the words in the text may be transcribed into the pronunciation spelling form of the language (such as Chinese pinyin, Japanese kana, etc.) for teaching or examinations; incorrect input by users when typing may also cause pronunciation spelling, typos, cognates, etc. in the text to be translated; in tasks such as image translation and speech translation, the recognition results of front-end modules such as image text recognition and speech recognition may have problems such as glyph similarity errors, glyph similarity errors, and transcoding errors, which may also cause the downstream translation model to receive non-standard text. Therefore, since the text sequences containing these irregular or erroneous words are often not very common sequences, that is, their expressions do not conform to conventional grammatical, lexical or syntactic structures, the translation model usually finds it difficult to correctly translate such irregular or erroneous words.
以日语为例,一方面,日语的文字存在假名和汉字两种体系,其中,日语汉字与中文汉字高度相似,且与中文的简体字(以下简称简中)、繁体字(以下简称繁中)均存在一定的交集和差异,如表1所示。中文用户在输入日语时,可能由于省事和偷懒、字形混淆等原因,将汉字词语替换为了日语中不存在的同源词,或字形相近的错别字,因此可能导致 模型翻译错误。Take Japanese as an example. On the one hand, Japanese characters have two systems: kana and kanji. Japanese kanji are highly similar to Chinese characters, and have certain overlaps and differences with simplified Chinese characters (hereinafter referred to as Simplified Chinese) and traditional Chinese characters (hereinafter referred to as Traditional Chinese), as shown in Table 1. When Chinese users input Japanese, they may replace Chinese characters with cognates that do not exist in Japanese, or typos with similar glyphs, due to reasons such as saving trouble, laziness, and confusion of glyphs. This may lead to Model translation error.
表1
Table 1
另一方面,日语的假名既可本身有含义,用于书面表达,也可以用作拼写汉字的读音。在社交平台等网络文本中,很多用户为了省事,不拼写规范的汉字,而直接以假名的读音形式替换,如图1所示。然而,相同读音的假名会存在大量“一词多义”的情况,并产生很多不规范的日语汉字表达。并且,由于日语书写时词语之间不存在空格,且日语假名转写的字符集与正常文本完全重合,若文本中大量的汉字被转写为假名,现有方法难以对这些句子中的不规范假名用词进行正确识别和切分;此外,日语中也存在大量的同音词现象,相同假名的读音可能对应多个不同的汉字词语,如表2所示。On the other hand, Japanese kana can have its own meaning and be used in written expressions, and can also be used to spell the pronunciation of Chinese characters. In online texts such as social platforms, many users do not spell standard Chinese characters in order to save trouble, but directly replace them with the pronunciation of kana, as shown in Figure 1. However, kana with the same pronunciation will have a large number of "multiple meanings" and produce many non-standard Japanese Chinese character expressions. In addition, since there is no space between words when writing Japanese, and the character set of Japanese kana transcription completely overlaps with normal text, if a large number of Chinese characters in the text are transcribed into kana, it is difficult for existing methods to correctly identify and segment the non-standard kana words in these sentences; in addition, there are also a large number of homophones in Japanese, and the pronunciation of the same kana may correspond to multiple different Chinese characters, as shown in Table 2.
表2
Table 2
由于现有文本翻译方法大多训练语料都是规范的语料,当输入不规范表达的文本时,翻译模型往往会输出这些词的音译,甚至随机翻译,导致无法得到准确的译文。Since most of the training corpora of existing text translation methods are standardized corpora, when inputting text with non-standard expressions, the translation model often outputs the transliteration of these words, or even random translation, resulting in inaccurate translation.
而本申请实施例提供的样本构建方法,由于样本构建装置可以对平行语料训练样本中的原始文本中的关键词进行替换,生成至少一个扩展文本,以扩大平行语料训练样本所覆盖的词汇范围;同时,并将该关键词所对应的规范类型标签替换为不符合规范词所对应的规范类型标签,获得替换标签后的平行语料训练样本,以丰富平行语料训练样本所包含的内容。最后,样本构建装置可以基于替换标签后的平行语料训练样本和至少一个扩展文本,构建得到目标训练样本。因此可以使得目标训练样本中包含不符合规范词及其所对应的规范类型标签,从而可以丰富平行语料训练样本的内容,使得平行语料训练样本具有更多更灵活的训练内容。The sample construction method provided in the embodiment of the present application is that the sample construction device can replace the keywords in the original text in the parallel corpus training sample to generate at least one extended text to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training sample, so that the parallel corpus training sample has more and more flexible training content.
本申请实施例提供的样本构建方法的执行主体可以为样本构建装置。示例性地,该样本构建装置可以为电子设备,也可以为该电子设备中的部件,例如集成电路或芯片。以下将以样本构建装置为例对本申请实施例提供的样本构建方法进行示例性说明。 The sample construction method provided in the embodiment of the present application may be executed by a sample construction device. For example, the sample construction device may be an electronic device, or a component in the electronic device, such as an integrated circuit or a chip. The sample construction method provided in the embodiment of the present application will be described below by taking the sample construction device as an example.
本申请实施例提供一种样本构建方法,图2示出了本申请实施例提供的一种样本构建方法的流程图,该方法的执行主体可以为样本构建装置。如图2所示,本申请实施例提供的样本构建方法可以包括下述的步骤201至步骤204。The present application embodiment provides a sample construction method, and Figure 2 shows a flow chart of a sample construction method provided by the present application embodiment, and the execution subject of the method can be a sample construction device. As shown in Figure 2, the sample construction method provided by the present application embodiment can include the following steps 201 to 204.
步骤201、获取平行语料训练样本。Step 201: Obtain parallel corpus training samples.
其中,上述平行语料训练样本可以包含原始文本并携带原始文本中的每个关键词所对应的规范类型标签。The parallel corpus training sample may include the original text and carry the standard type label corresponding to each keyword in the original text.
本申请实施例中,平行语料训练样本可以为由原始文本及其平行对应的译语文本构成的双语或多语语料。In the embodiment of the present application, the parallel corpus training sample may be a bilingual or multilingual corpus consisting of an original text and its parallel corresponding target text.
可选地,原始文本可以为不包含不符合规范词的文本。Optionally, the original text may be a text that does not contain words that do not conform to the specification.
可选地,上述关键词可以为原始文本中的任一词。Optionally, the above keyword can be any word in the original text.
可选地,上述规范类型标签可以指示关键词的规范类型。Optionally, the above-mentioned specification type tag may indicate the specification type of the keyword.
可以理解,一方面,由于同一语种中存在大量同音词,这些词的扩展形式可能与规范词表中的其他规范词相同,例如,“さくら”既可以是姓氏“佐倉(佐仓)”的假名转写,也可以表示名词“樱花”,因此通过规则的方法难以对所有的不符合规范词进行识别。另一方面,由于不同语种之间文本序列的规则不同,例如,日语中的词之间没有空格,在待翻译文本中大量汉字被转写为假名时,规则的方法也难以准确识别词与词之间的边界,因此通过规则的方法难以对待翻译文本中的所有词进行准确的翻译。所以,本申请实施例提供的样本构建方法中的样本构建装置可以采用标记了词法、句法结构等信息的文本数据(即原始文本),并在此基础上增加关键词所对应的规范类型标签。It can be understood that, on the one hand, due to the presence of a large number of homophones in the same language, the extended form of these words may be the same as other standard words in the standard vocabulary, for example, "さくら" can be the kana transcription of the surname "佐倉 (佐仓)", and can also represent the noun "樱花", so it is difficult to identify all non-standard words by the method of rules. On the other hand, due to the different rules of text sequences between different languages, for example, there is no space between the words in Japanese, when a large number of Chinese characters are transcribed as kana in the text to be translated, the method of rules is also difficult to accurately identify the boundary between words and words, so it is difficult to accurately translate all words in the text to be translated by the method of rules. Therefore, the sample construction device in the sample construction method provided in the embodiment of the present application can use the text data (i.e., original text) marked with information such as lexical and syntactic structures, and on this basis, increase the standard type label corresponding to the keyword.
步骤202、将原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本。Step 202: Replace the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text.
可选地,样本构建装置可以将平原始文本中的任一关键词替换为其所对应的至少一个第一不符合规范词,得到多条语义相同,规范程度不同的扩展文本。Optionally, the sample construction device may replace any keyword in the plain original text with at least one first non-standard word corresponding thereto, to obtain a plurality of extended texts with the same semantics but different standardization levels.
可选地,扩展文本的词性、句法结构等其他标注信息可以与原始文本的标注信息保持一致。Optionally, other annotation information such as part of speech, syntactic structure, etc. of the extended text may be kept consistent with the annotation information of the original text.
可选地,上述不符合规范词可以为表达方式不符合常规语法、词法或句法结构的词语。Optionally, the non-standard words mentioned above may be words whose expressions do not conform to conventional grammatical, lexical or syntactic structures.
可选地,上述不符合规范词可以包括以下至少一种情况:包含读音拼写、包含错别字、包含同源字替换、包含字形错误。Optionally, the above-mentioned non-standard words may include at least one of the following situations: including pronunciation spelling, including typos, including homologous word replacement, and including glyph errors.
可选地,“将原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词”可以理解为:将符合规范的关键词,替换为同源的、读音相同/相似的或字形相似的表达方式不符合常规语法、词法或句法结构的不符合规范词。Optionally, "replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword" can be understood as: replacing the compliant keyword with a non-standard word that is of the same origin, has the same/similar pronunciation, or has a similar glyphic expression and does not conform to conventional grammar, morphology, or syntax structure.
例如,原始文本中包含关键词“境界”,样本构建装置可以将其替换为发音相同的“教会” 或不符合规范词“きょうかい(bianjie)”。For example, if the original text contains the keyword "realm", the sample construction device can replace it with "church" which has the same pronunciation. Or it does not conform to the standard word "きょうかい(bianjie)".
可选地,平行语料训练样本可以为平行语料训练样本集中的一个平行语料训练样本。上述步骤202可以包括下述的步骤202a。Optionally, the parallel corpus training sample may be a parallel corpus training sample in a parallel corpus training sample set. The above step 202 may include the following step 202a.
步骤202a、基于原始文本中的每个关键词在平行语料训练样本集中的词频,从原始文本中确定至少一个第一关键词,将原始文本中的至少一个第一关键词中的每个第一关键词替换为各自对应的第一不符合规范词,以生成第一扩展文本。Step 202a: based on the word frequency of each keyword in the original text in the parallel corpus training sample set, determine at least one first keyword from the original text, and replace each first keyword in the at least one first keyword in the original text with the first non-standard word corresponding to each keyword to generate a first extended text.
其中,第一扩展文本为上述至少一个扩展文本中的任一扩展文本。The first extended text is any extended text among the at least one extended text mentioned above.
可选地,样本构建装置可以基于原始文本中的每个关键词在平行语料训练样本集中的词频,对原始文本中的关键词进行替换。Optionally, the sample construction device may replace the keywords in the original text based on the word frequency of each keyword in the original text in the parallel corpus training sample set.
可以理解,词频高的词,表示其越容易被发生替换。It can be understood that a word with a high frequency is more likely to be replaced.
具体地,原始文本中的第一关键词可以以其在平行语料训练样本集中的词频设置替换为其所对应的第一不符合规范词。Specifically, the first keyword in the original text may be replaced with its corresponding first non-standard word according to its word frequency setting in the parallel corpus training sample set.
示例性地,如图3所示,关键词“とても(非常地)”、“頼もしく(可信赖地)”、“優しい(温柔地)”按照其在平行语料训练样本集中的词频,将“頼もしく(可信赖地)”替换为含有拼音读写的形式(即其规范类型标签为拼音读写-平假名)“たのもしく(可信lai地)”,“優しい(温柔地)”替换为含有拼音读写的形式(即其规范类型标签为拼音读写-平假名)“やさしい(温rou地)”,获得扩展文本1;将“とても(非常地)”替换为含有拼音读写的形式(即其规范类型标签为拼音读写-片假名)“トテモ(feichangde)”,将“頼もしく”替换为含有同源字的形式(即其规范类型标签为同源字-繁中)“賴もしく(可信賴地)”,“優しい”替换为含有同源字的形式(即其规范类型标签为同源字-简中)“优しい(温柔地)”,获得扩展文本2。For example, as shown in FIG3 , the keywords “とても (very)”, “頼もしく (trustworthy)”, and “優しい (gentle)” are replaced with a form containing phonetic reading and writing (i.e., its standard type label is phonetic reading and writing-Hiragana) “たのもしく (trustworthy)” and “優しい (gentle)” are replaced with a form containing phonetic reading and writing (i.e., its standard type label is phonetic reading and writing-Hiragana) “やさしい (gentle)” according to their word frequencies in the parallel corpus training sample set. rou地)", and obtain extended text 1; replace "とても(very)" with a form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-Katakana) "トテモ(feichangde)", replace "頼もしく" with a form containing cognates (that is, its standard type label is cognates-Traditional Chinese) "賴もしく(靠賴地)", and replace "優しい" with a form containing cognates (that is, its standard type label is cognates-Simplified Chinese) "優しい(温柔地)", and obtain extended text 2.
如此,由于样本构建装置可以基于关键词在平行语料训练样本集中的词频,对关键词进行替换,因此可以使得词频高的关键词更多次的被替换为其所对应的至少一个不符合规范词,从而使得生成的扩展文本可以尽可能多的包含原始文本所对应的所有的可能的不符合规范的形式,进而可以使得后续对翻译模型的训练可以更加全面。In this way, since the sample construction device can replace the keywords based on the frequency of the keywords in the parallel corpus training sample set, the keywords with high frequency can be replaced more times with at least one non-standard word corresponding to it, so that the generated extended text can contain as many possible non-standard forms corresponding to the original text as possible, and the subsequent training of the translation model can be more comprehensive.
步骤203、将第一关键词所对应的第一规范类型标签替换为第一不符合规范词所对应的第二规范类型标签,获得替换标签后的平行语料训练样本。Step 203: Replace the first standard type label corresponding to the first keyword with the second standard type label corresponding to the first non-standard word, and obtain a parallel corpus training sample after the labels are replaced.
可选地,规范类型标签可以指示词的规范类型。Optionally, a canonical type tag may indicate the canonical type of the word.
示例性地,当词为符合规范词(即第一关键词)时,其所对应的规范类型标签(即第一规范类型标签)可以指示其为符合规范的词;当词为不符合规范词(即第一不符合规范词)时,其所对应的规范类型标签(即第二规范类型标签)可以指示其不符合规范的形式。Exemplarily, when a word is a standard word (i.e., the first keyword), its corresponding standard type tag (i.e., the first standard type tag) can indicate that it is a standard word; when a word is not a standard word (i.e., the first non-standard word), its corresponding standard type tag (i.e., the second standard type tag) can indicate its non-standard form.
例如,如表3所示,第二规范类型标签可以包含读音拼写-平假名、读音拼写-片假名、 同源词-简中、同源词-繁中、易混淆词-简中、易混淆词-繁中、易混淆词-重组等等多种形式。For example, as shown in Table 3, the second standard type tag may include pronunciation spelling-Hiragana, pronunciation spelling-Katakana, There are many forms, such as cognates-Simplified Chinese, cognates-Traditional Chinese, easily confused words-Simplified Chinese, easily confused words-Traditional Chinese, easily confused words-reorganization, etc.
表3
Table 3
步骤204、基于替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本。Step 204: construct a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text.
可选地,样本构建装置可以将扩展文本中的不符合规范词与替换标签后的平行语料训练样本中与其对应的规范类型标签进行关联,得到目标训练样本。Optionally, the sample construction device may associate the non-standard words in the extended text with the standard type labels corresponding to them in the parallel corpus training sample after the labels are replaced, to obtain a target training sample.
本申请实施例提供一种样本构建方法,由于样本构建装置可以对平行语料训练样本中的原始文本中的关键词进行替换,生成至少一个扩展文本,以扩大平行语料训练样本所覆盖的词汇范围;同时,并将该关键词所对应的规范类型标签替换为不符合规范词所对应的规范类型标签,获得替换标签后的平行语料训练样本,以丰富平行语料训练样本所包含的内容。最后,样本构建装置可以基于替换标签后的平行语料训练样本和至少一个扩展文本,构建得到目标训练样本。因此可以使得目标训练样本中包含不符合规范词及其所对应的规范类型标签,从而可以丰富平行语料训练样本的内容,使得平行语料训练样本具有更多更灵活的训练内容。The embodiment of the present application provides a sample construction method, because the sample construction device can replace the keywords in the original text in the parallel corpus training sample, generate at least one extended text, so as to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the keyword is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can construct a target training sample based on the parallel corpus training sample after the label is replaced and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, thereby enriching the content of the parallel corpus training sample, so that the parallel corpus training sample has more and more flexible training content.
可选地,扩展文本的数量为N,N为正整数。在上述步骤202之后,本申请实施例提供的样本构建方法还可以包括下述的步骤205。Optionally, the number of extended texts is N, and N is a positive integer. After the above step 202, the sample construction method provided in the embodiment of the present application may further include the following step 205.
步骤205、在N个扩展文本中的第二扩展文本中包含未收录在所述平行语料训练样本集中的未登录词的情况下,对所述未登录词的特征信息进行初始化。 Step 205: When a second extended text among the N extended texts contains an unregistered word that is not included in the parallel corpus training sample set, initialize feature information of the unregistered word.
其中,初始化的过程包括以下至少之一:按照未登录词对应的第一关键词,和N个扩展文本中的每个扩展文本中未登录词对应的第一关键词所对应的每个不符合规范词在平行语料训练样本集中的词频,对未登录词的特征信息进行加权平均;使用未登录词对应的同源词的特征信息,对未登录词的特征信息进行加权平均;将未登录词的特征信息置为0;将未登录词的特征信息随机初始化。The initialization process includes at least one of the following: taking a weighted average of the feature information of the unregistered word according to the frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set; taking a weighted average of the feature information of the unregistered word using the feature information of the cognate word corresponding to the unregistered word; setting the feature information of the unregistered word to 0; and randomly initializing the feature information of the unregistered word.
可选地,上述第一扩展文本与上述第二扩展文本可以相同,也可以不同。Optionally, the first extended text and the second extended text may be the same or different.
可选地,样本构建装置可以基于词的特征信息,将得到的扩展文本转化为模型训练对应的词向量序列。Optionally, the sample construction device can convert the obtained extended text into a word vector sequence corresponding to the model training based on the feature information of the word.
示例性地,样本构建装置可以通过词向量(word to vector,Word2Vec)算法、基于全局词频统计的回归算法(Glove算法)等算法得到词向量序列,也可以在Transformer等翻译模型中训练迭代得到词向量序列。Exemplarily, the sample construction device can obtain a word vector sequence through algorithms such as a word to vector (Word2Vec) algorithm and a regression algorithm based on global word frequency statistics (Glove algorithm), or can obtain a word vector sequence by training and iteration in a translation model such as Transformer.
实际实现中,样本构建装置可以通过任一可能的方式得到扩展文本对应的词向量序列,本申请不做具体限定。In actual implementation, the sample construction device can obtain the word vector sequence corresponding to the extended text in any possible way, and this application does not make any specific limitation.
本申请实施例中,对于未登录词,即未在平行语料训练样本集中出现过的词,可以采用下述方法的任意组合对未登录词的特征信息进行初始化,得到对应的词向量:①按照未登录词对应的第一关键词,和N个扩展文本中的每个扩展文本中未登录词对应的第一关键词所对应的每个不符合规范词在平行语料训练样本集中的词频,对未登录词的特征信息进行加权平均;②使用未登录词对应的同源词的特征信息,对未登录词的特征信息进行加权平均;③将未登录词的特征信息置为0;④将未登录词的特征信息随机初始化。In an embodiment of the present application, for unregistered words, that is, words that have not appeared in the parallel corpus training sample set, any combination of the following methods can be used to initialize the feature information of the unregistered words to obtain the corresponding word vector: ① According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the feature information of the unregistered word is weighted averaged; ② Using the feature information of the cognate word corresponding to the unregistered word, the feature information of the unregistered word is weighted averaged; ③ The feature information of the unregistered word is set to 0; ④ The feature information of the unregistered word is randomly initialized.
在实际对模型的训练过程中,样本构建装置还可以对未登录词所对应的规范类型标签的特征信息进行随机初始化,或将规范类型标签进行组合,并通过对其所对应的词的特征信息进行加权平均,得到未登录词所对应的规范类型标签及其特征信息。In the actual process of training the model, the sample construction device can also randomly initialize the feature information of the standard type labels corresponding to the unregistered words, or combine the standard type labels, and obtain the standard type labels and their feature information corresponding to the unregistered words by weighted averaging the feature information of the corresponding words.
如此,一方面,在数据层面上,由于样本构建装置可以对未登录词的特征信息进行初始化,因此可以增强对翻译模型的训练;另一方面,在模型层面上,由于样本构建装置可以通过对未登录词的特征信息的初始化,让翻译模型在训练时可以学习不符合规范词与其对应的符合规范词的语音相关性,因此可以提升翻译模型的翻译鲁棒性。如此,本申请实施例提供的样本构建方法可以提升翻译模型的翻译质量和翻译准确性。Thus, on the one hand, at the data level, since the sample construction device can initialize the feature information of the unregistered words, the training of the translation model can be enhanced; on the other hand, at the model level, since the sample construction device can initialize the feature information of the unregistered words, the translation model can learn the phonetic correlation between the non-conforming words and their corresponding conforming words during training, thus improving the translation robustness of the translation model. Thus, the sample construction method provided in the embodiment of the present application can improve the translation quality and translation accuracy of the translation model.
可选地,在上述步骤204之后,本申请实施例提供的样本构建方法还可以包括下述的步骤206和步骤207。Optionally, after the above step 204, the sample construction method provided in the embodiment of the present application may further include the following steps 206 and 207.
步骤206、将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本。Step 206: restore at least one non-standard word in the first translation text to a standard word to generate M second translation texts.
其中,一个不符合规范词还原为至少一个规范词。 Among them, a non-standard word is restored to at least one standard word.
可选地,上述第一翻译文本可以为一句话,也可以为一段话。Optionally, the first translated text may be a sentence or a paragraph.
可选地,第一翻译文本可以为用户输入的文本,也可以为从其他设备上获取的文本。Optionally, the first translated text may be text input by a user, or may be text acquired from another device.
可选地,样本构建装置可以通过以下三种方法识别第一翻译文本中的不符合规范词:方法1:基于同源、读音、字符集合的扩展词表构建方法;方法2:基于扩展词表增强的分词模型方法;方法3:基于语言模型概率的不规则译文检测方法。Optionally, the sample construction device can identify non-standard words in the first translation text through the following three methods: Method 1: an extended vocabulary construction method based on homology, pronunciation, and character set; Method 2: a word segmentation model method based on extended vocabulary enhancement; Method 3: an irregular translation detection method based on language model probability.
下面结合具体实施例对方法1至方法3进行详细描述。Method 1 to method 3 are described in detail below in conjunction with specific embodiments.
方法1:基于同源、读音、字符集合的扩展词表构建方法。Method 1: Extended vocabulary construction method based on cognates, pronunciations, and character sets.
本申请实施例中,第一翻译文本中的不符合规范词的扩展形式可以包括该不符合规范词在扩展词表中匹配到的所有的词。In the embodiment of the present application, the extended form of the non-standard word in the first translation text may include all words matched by the non-standard word in the extended word list.
可以理解,由于第一翻译文本中包括不符合规范词时,该不符合规范词所使用字符可能超出当前语种的正常字符集,例如,中文中出现汉字字符集之外的汉语拼音字符,或者,该不符合规范词使用了当前语种中不存在的拼写,例如,英语“October”采用同语系语言德语“Oktober”的拼写。因此,本申请实施例提供的样本构建方法中的样本构建装置可以通过挖掘不同语种之间词语的相似性,构建一个扩展词表,扩展词表示例如表3所示。It can be understood that when the first translation text includes a non-standard word, the characters used by the non-standard word may exceed the normal character set of the current language, for example, Chinese pinyin characters outside the Chinese character set appear in Chinese, or the non-standard word uses a spelling that does not exist in the current language, for example, the English "October" uses the spelling of the German "Oktober" in the same language family. Therefore, the sample construction device in the sample construction method provided in the embodiment of the present application can construct an extended word list by mining the similarities between words in different languages, and the extended word representation is shown in Table 3.
本申请实施例中,以日语为例,扩展词表可以包括:词的常见读音拼写及其变体;词在语言体系分支较近的其他语种中的同源或同义的字/词;词及其同源词重组得到的易混淆词;词由其字形相似的词替换后得到的易混淆词等。In the embodiment of the present application, taking Japanese as an example, the extended vocabulary may include: common pronunciation spellings of words and their variants; cognate or synonymous characters/words in other languages with close branches in the language system; easily confused words obtained by recombining a word and its cognates; easily confused words obtained by replacing a word with a word with a similar shape, etc.
可选地,词与其同源或同义的字/词之间的词典释义高度相似,可以通过挖掘各语种的词典信息,进行同源构建。Optionally, if the dictionary meanings of a word and its cognate or synonymous characters/words are highly similar, cognates can be constructed by mining dictionary information of various languages.
可选地,易混淆词可以为不存在于其原语种或同源语种中的词语。Alternatively, easily confused words may be words that do not exist in their original language or cognate language.
可选地,扩展词表中可以包括多个词集合,每个词集合中可以包括一个或多个不符合规范词和与该一个不符合规范词对应的符合规范词集合。Optionally, the extended vocabulary may include multiple word sets, each of which may include one or more non-standard words and a standard word set corresponding to the non-standard word.
可选地,样本构建装置可以通过字符集检测、扩展词表匹配等方法,识别第一翻译文本中的不符合规范词,并将在扩展词表中匹配到的词集合作为第一词集合。Optionally, the sample construction device may identify non-standard words in the first translation text by character set detection, extended vocabulary matching, etc., and use the word set matched in the extended vocabulary as the first word set.
方法2:基于扩展词表增强的分词模型方法。Method 2: Word segmentation model method based on extended vocabulary enhancement.
可选地,如图3所示,第一翻译文本中的词可以以词在平行语料训练样本集中的词频设置替换为扩展词表中的任一扩展形式,并替换对应的规范类型标签,以语料的扩展形式及其对应的规范类型标签对分词模型进行训练。Optionally, as shown in FIG3 , the words in the first translation text can be replaced with any extended form in the extended vocabulary according to the word frequency setting in the parallel corpus training sample set, and the corresponding standard type label can be replaced, and the word segmentation model can be trained with the extended form of the corpus and its corresponding standard type label.
可选地,在上述步骤206之前,本申请实施例提供的样本构建方法还可以包括下述的步骤A。Optionally, before the above step 206, the sample construction method provided in the embodiment of the present application may further include the following step A.
步骤A、将第一翻译文本输入分词模型后,对第一翻译文本进行分词,得到M个分词,M为大于1的整数,并对M个分词中的每个分词进行不符合规范词识别,得到每个分词对 应的识别结果,一个分词对应识别结果用于表征一个分词是否属于不符合规范词。Step A: After the first translation text is input into the word segmentation model, the first translation text is segmented to obtain M word segments, where M is an integer greater than 1, and each of the M word segments is identified as not meeting the standard word, to obtain each word segment pair. The corresponding recognition result of a segmentation word is used to indicate whether a segmentation word does not conform to the standard word.
示例性地,分词模型可以为经过增强训练的分词模型。Exemplarily, the word segmentation model may be a word segmentation model that has undergone enhanced training.
示例性地,经过增强训练的分词模型可以对得到的每个分词进行规范类型标签预测,若预测得到该分词的规范类型标签指示该分词为不符合规范词,则将该分词识别为不符合规范词。Exemplarily, the word segmentation model that has undergone enhanced training can predict a standard type label for each obtained word segmentation. If the predicted standard type label for the word segmentation indicates that the word segmentation is not in compliance with the standard, the word segmentation is identified as a not in compliance with the standard.
如此,由于样本构建装置可以使经过增强训练的分词模型获取对词的识别能力,学习不符合规范词和符合规范词在词法、句法结构、上下文信息等方面的相似性,并对输出的分词进行规范类型标签预测,因此可以使得分词模型对第一翻译文本进行准确切分并识别出第一翻译文本中的不符合规范词。In this way, since the sample construction device can enable the word segmentation model that has undergone enhanced training to acquire the ability to recognize words, learn the similarities between non-standard words and standard words in terms of lexical structure, syntactic structure, contextual information, etc., and predict standard type labels for the output word segmentations, the word segmentation model can accurately segment the first translation text and identify non-standard words in the first translation text.
方法3:基于语言模型概率的不规则译文检测方法。Method 3: Irregular translation detection method based on language model probability.
可以理解,由于不符合规范词在平行语料训练样本集中出现的概率较低,且同音词之间词义、上下文等信息也差别较大,较正常文本更不通顺。因此,可以使用语言模型计算第一翻译文本的困惑度,判断该文本是否含有不符合规范表达。It is understandable that since the probability of non-standard words appearing in the parallel corpus training sample set is low, and the meanings, contexts and other information between homophones are also quite different, they are less fluent than normal text. Therefore, the language model can be used to calculate the perplexity of the first translation text to determine whether the text contains non-standard expressions.
可选地,样本构建装置可以将第一翻译文本输入n元语言模型中,通过下述的公式1计算当前词wi与第一翻译文本的前n个词相关的概率。
Optionally, the sample construction device may input the first translation text into an n-gram language model, and calculate the probability of the current word wi being associated with the first n words of the first translation text by using the following formula 1.
其中,wi为当前词,N为第一翻译文本的词数。Wherein, wi is the current word, and N is the number of words in the first translation text.
由公式(1)可知,当前词语的条件概率P(wi|wi-n…wi-1)越低,其所在的第一翻译文本的通顺程度就越低,该第一翻译文本的困惑度也越高。It can be seen from formula (1) that the lower the conditional probability P( wi | win ... wi-1 ) of the current word is, the lower the fluency of the first translation text is, and the higher the perplexity of the first translation text is.
可选地,在上述步骤206之前,本申请实施例提供的样本构建方法还可以包括下述的步骤B1至步骤B4。Optionally, before the above step 206, the sample construction method provided in the embodiment of the present application may further include the following steps B1 to B4.
步骤B1、对第一翻译文本进行分词,得到M个分词。Step B1: Segment the first translation text into words to obtain M word segments.
其中,M为大于1的整数。Wherein, M is an integer greater than 1.
示例性地,样本构建装置可以将第一翻译文本输入增强的分词模型进行分词。Exemplarily, the sample construction device may input the first translated text into an enhanced word segmentation model for word segmentation.
步骤B2、针对M个分词中的每个分词,在一个分词对应的条件概率小于第一预设阈值的情况下,获取一个分词对应的P个第一符合规范词。Step B2: for each of the M participles, when the conditional probability corresponding to a participle is less than a first preset threshold, obtain P first standard-compliant words corresponding to the participle.
其中,P为正整数。Wherein, P is a positive integer.
可以理解,若该一个分词对应的条件概率小于第一预设阈值,则表示该分词可能为不符合规范词。It can be understood that if the conditional probability corresponding to the word segment is less than the first preset threshold, it means that the word segment may not conform to the standard word.
可选地,P个第一符合规范词可以为该一个分词在扩展词表中进行匹配到的符合规范词集合中的X个符合规范词。Optionally, the P first standard-compliant words may be X standard-compliant words in a standard-compliant word set matched by the one word segment in the extended vocabulary.
步骤B3、将第一翻译文本中的一个分词分别替换为P个第一符合规范词中每个第一符 合规范词,得到P个替换后的第一翻译文本。Step B3: Replace a word in the first translation text with each first symbol in the P first standard words. The first translation text after P replacements is obtained.
步骤B4、若任一替换后的第一翻译文本对应的第一困惑度小于第一翻译文本对应的第二困惑度,且,第一困惑度与第二困惑度间的差值大于第二预设阈值,则样本构建装置确定一个分词为不符合规范词。Step B4: If the first perplexity corresponding to any replaced first translation text is smaller than the second perplexity corresponding to the first translation text, and the difference between the first perplexity and the second perplexity is greater than a second preset threshold, the sample construction device determines that a segmented word does not conform to the standard word.
可以理解,若任一替换后的第一翻译文本对应的第一困惑度小于第一翻译文本对应的第二困惑度,且,第一困惑度与第二困惑度间的差值大于第二预设阈值,则可以表示替换后的第一翻译文本的更流畅,更合理。也就是说,替换前的第一翻译文本中存在不符合规范词。It can be understood that if the first perplexity corresponding to any replaced first translation text is less than the second perplexity corresponding to the first translation text, and the difference between the first perplexity and the second perplexity is greater than the second preset threshold, it can be said that the replaced first translation text is smoother and more reasonable. In other words, there are non-standard words in the first translation text before replacement.
如此,由于样本构建装置可以将第一翻译文本中可能的不符合规范词替换为其对应的第一符合规范词,并分别计算替换前后的第一翻译文本的困惑度,在替换后的第一翻译文本的困惑度下降差值大于第二预设阈值的情况下,将该词确定为不符合规范词。因此,可以使得对不符合规范词的识别更加准确,并且使得进行替换后的第一翻译文本更流畅,更合理,从而使得后续的翻译更加准确,正确率更高。In this way, since the sample construction device can replace the possible non-conforming words in the first translation text with the corresponding first conforming words, and respectively calculate the perplexity of the first translation text before and after the replacement, when the difference in the perplexity decrease of the first translation text after the replacement is greater than the second preset threshold, the word is determined as a non-conforming word. Therefore, the recognition of non-conforming words can be made more accurate, and the first translation text after the replacement can be made more fluent and reasonable, so that the subsequent translation is more accurate and the accuracy rate is higher.
可选地,上述步骤206具体可以通过下述的步骤206a和206b实现。Optionally, the above step 206 may be specifically implemented through the following steps 206a and 206b.
步骤206a、获取至少一个不符合规范词对应的第一词集合。Step 206a: Obtain at least one first word set that does not correspond to the standard word.
其中,第一词集合可以包括:多个词子集。一个词子集中可以包括至少一个不符合规范词中的一个或多个不符合规范词,每个不符合规范词对应一个符合规范词集合。The first word set may include: a plurality of word subsets. A word subset may include one or more non-standard words in at least one non-standard word, and each non-standard word corresponds to a standard word set.
可以理解,若至少一个不符合规范词中包含多个不符合规范词,则该多个不符合规范词中的每个不符合规范词所对应的符合规范词集合可以相同,也可以不同。It can be understood that if at least one non-compliant word includes multiple non-compliant words, the compliant word sets corresponding to each of the multiple non-compliant words may be the same or different.
例如,上述至少一个不符合规范词中包含不符合规范词“己经”和不符合规范词“巳经”,不符合规范词“己经”对应的符合规范词集合可以为包含符合规范词“已经”的集合,不符合规范词“巳经”对应的符合规范词集合也可以为包含符合规范词“已经”的集合。For example, at least one of the above-mentioned non-standard words includes the non-standard word "已經" and the non-standard word "已經", the set of standard words corresponding to the non-standard word "已經" can be a set including the standard word "已", and the set of standard words corresponding to the non-standard word "已經" can also be a set including the standard word "已".
步骤206b、针对多个词子集中的每个词子集,在第一翻译文本中将一个词子集与一个词子集中的每个不符合规范词对应的符合规范词集合进行还原映射,以生成至少一个第二翻译文本。Step 206b: For each word subset in the multiple word subsets, restore and map a word subset with a set of standard words corresponding to each non-standard word in the word subset in the first translation text to generate at least one second translation text.
本申请实施例中,“将一个词子集与一个词子集中的每个不符合规范词对应的符合规范词集合进行还原映射”可以理解为:将上述一个词子集中的每个不符合规范词依次还原为其所对应的符合规范词集合中的每个符合规范词,并遍历所有的符合规范词还原组合。In the embodiment of the present application, "restoring and mapping a word subset to a set of compliant words corresponding to each non-compliant word in the word subset" can be understood as: restoring each non-compliant word in the above-mentioned word subset in turn to each compliant word in the corresponding set of compliant words, and traversing all restored combinations of compliant words.
例如,第一翻译文本为:一想到明天就要告别xiaoyuan,我的心中就涌起了申申的眷恋之情。其中,包含不符合规范词“xiaoyuan”和不符合规范词“申申”。不符合规范词“xiaoyuan”对应的符合规范词集合包括:校园,小院;不符合规范词“申申”对应的符合规范词集合包括:深深,审审。那么,样本构建装置可以将每个不符合规范词对应的符合规范词集合进行还 原映射,得到6个第二翻译文本,分别为:一想到明天就要告别校园,我的心中就涌起了申申的眷恋之情;一想到明天就要告别校园,我的心中就涌起了深深的眷恋之情;一想到明天就要告别校园,我的心中就涌起了审审的眷恋之情;一想到明天就要告别小院,我的心中就涌起了申申的眷恋之情;一想到明天就要告别小院,我的心中就涌起了深深的眷恋之情;一想到明天就要告别校园,我的心中就涌起了审审的眷恋之情。For example, the first translation text is: When I think about saying goodbye to xiaoyuan tomorrow, my heart is filled with nostalgia for Shen Shen. It contains the non-standard word "xiaoyuan" and the non-standard word "Shen Shen". The set of standard words corresponding to the non-standard word "xiaoyuan" includes: campus, courtyard; the set of standard words corresponding to the non-standard word "Shen Shen" includes: deeply, scrutinize. Then, the sample construction device can return the set of standard words corresponding to each non-standard word to the standard word set. The original mapping results in 6 second translation texts, which are: When I think of saying goodbye to the campus tomorrow, my heart is filled with nostalgia for Shenshen; When I think of saying goodbye to the campus tomorrow, my heart is filled with deep nostalgia; When I think of saying goodbye to the campus tomorrow, my heart is filled with nostalgia for Shenshen; When I think of saying goodbye to the courtyard tomorrow, my heart is filled with nostalgia for Shenshen; When I think of saying goodbye to the courtyard tomorrow, my heart is filled with deep nostalgia; When I think of saying goodbye to the campus tomorrow, my heart is filled with nostalgia for Shenshen.
如此,由于样本构建装置可以将第一翻译文本中的不符合规范词还原为所有有可能的符合规范词,以生成至少一个第二翻译文本,因此可以尽可能的修正第一翻译文本中的不符合规范词,使得后续得到的译文更加准确、通顺。In this way, since the sample construction device can restore the non-standard words in the first translation text to all possible standard words to generate at least one second translation text, the non-standard words in the first translation text can be corrected as much as possible, making the subsequent translation more accurate and fluent.
步骤207、将第一翻译文本对应的第一特征信息和M个第二翻译文本中的X个第二翻译文本对应的第二特征信息输入第一翻译模型进行文本翻译,以得到目标译文。Step 207: input the first feature information corresponding to the first translated text and the second feature information corresponding to X second translated texts among the M second translated texts into the first translation model for text translation to obtain a target translated text.
其中,第一特征信息包括第一翻译文本的文本特征信息和第一翻译文本中的不符合规范词所对应的规范类型标签的特征信息,第二特征信息包括第二翻译文本的文本特征信息和第二翻译文本中的不符合规范词所对应的规范类型标签的特征信息。Among them, the first feature information includes text feature information of the first translated text and feature information of the standard type labels corresponding to the non-standard words in the first translated text, and the second feature information includes text feature information of the second translated text and feature information of the standard type labels corresponding to the non-standard words in the second translated text.
本申请实施例中,第一翻译模型是基于目标训练样本集训练得到的,目标训练样本集包括多个目标训练样本,一个目标训练样本对应平行语料训练样本集中的一个平行语料训练样本,M、X为正整数,且X小于或等于M。In an embodiment of the present application, the first translation model is obtained by training based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
可选地,上述步骤207具体可以通过下述的步骤207a和步骤207b实现。Optionally, the above step 207 may be specifically implemented through the following steps 207a and 207b.
步骤207a、将M个第二翻译文本中的X个第二翻译文本和第一翻译文本输入第一翻译模型进行文本翻译,输出L个候选译文。Step 207a: input X second translation texts among the M second translation texts and the first translation text into the first translation model for text translation, and output L candidate translations.
其中,L个候选译文包括X个第二翻译文本对应的候选译文以及第一翻译文本对应的候选译文,一个候选译文对应至少一个第二翻译文本,L为正整数,且L小于等于X。The L candidate translations include candidate translations corresponding to X second translation texts and candidate translations corresponding to the first translation text, one candidate translation corresponds to at least one second translation text, L is a positive integer, and L is less than or equal to X.
可以理解,由于增强后的翻译模型可以对不同扩展形式的不符合规范词语做出相同的翻译,因此翻译模型输出的候选译文数量小于输入的第二翻译文本的数量。It can be understood that, since the enhanced translation model can make the same translation for non-standard words with different extended forms, the number of candidate translations output by the translation model is less than the number of second translation texts input.
示例性地,如图4所示,在增强翻译模型中输入原始文本(即第一翻译文本)“両親は学校に勤める(父母在学校工作)”时,可以翻译得到“父母在学校工作”目标译文。Exemplarily, as shown in FIG4 , when the original text (ie, the first translation text) “鸡亲は学校に勤める(父工作学校)” is input into the enhanced translation model, the target translation “父工作学校” can be obtained.
当在增强翻译模型中输入扩展文本1“両亲は學校につとめる(父母在学校工作)”,即将原始文本中的“両親(两亲)”替换为与易混淆词重组的形式(即其规范类型标签为易混淆词-重组)“両亲(两亲)”,将“学校”替换为含有同源词和繁体的形式(即其规范类型标签为同源词-繁体)“學校(学校)”,将“勤める(工作)”替换为含有拼音读写的形式(即其规范类型标签为拼音读写-平假名)“つとめる(gongzuo)”,也可以翻译得到“父母在学校工作”目标译文。When the extended text 1 "両亲は学校につとめる(Parents work in school)" is input into the enhanced translation model, that is, "両亲(两亲)" in the original text is replaced with a form reorganized with easily confused words (that is, its standard type label is easily confused words-reorganization) "両亲(两亲)", "学校" is replaced with a form containing cognates and traditional Chinese characters (that is, its standard type label is cognates-traditional Chinese characters) "學校(學校)", and "勤める(工作)" is replaced with a form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-hiragana) "つとめる(gongzuo)", the target translation "父們工作学校" can also be obtained.
当在增强翻译模型中输入扩展文本2“兩親はがっこうにツトメル(父母在学校工作)”, 即将原始文本中的“両親(两亲)”替换为与易混淆词繁体的形式(即其规范类型标签为易混淆词-繁体)“兩親(两亲)”,将“学校”替换为含有拼音读写的形式(即其规范类型标签为拼音读写-平假名)“がっこう(xuexiao)”,将“勤める(工作)”替换为含有拼音读写的形式(即其规范类型标签为拼音读写-片假名)“ツトメル(gongzuo)”,也可以翻译得到“父母在学校工作”目标译文。When the extended text 2 "兩親はがっこうにツトメル(My parents work at the school)" is input into the enhanced translation model, That is, "鸡亲(两亲)" in the original text is replaced with the traditional Chinese form of the easily confused word (that is, its standard type label is easily confused word-traditional) "两亲(两亲)", "学校" is replaced with the form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-hiragana) "がっこう(xuexiao)", and "勤める(工作)" is replaced with the form containing pinyin reading and writing (that is, its standard type label is pinyin reading and writing-katakana) "ツトメル(gongzuo)", and the target translation "父家人工作学校" can also be obtained.
步骤207b、将L个候选译文中,满足第一条件的候选译文确定为目标译文。Step 207b: Determine the candidate translation that meets the first condition among the L candidate translations as the target translation.
可选地,满足第一条件的候选译文可以包括以下至少之一:Optionally, the candidate translations satisfying the first condition may include at least one of the following:
情况1:流畅度满足第一预定条件的候选译文;Case 1: The candidate translation whose fluency meets the first predetermined condition;
情况2:翻译质量满足第二预定条件的候选译文;Case 2: The candidate translation whose translation quality meets the second predetermined condition;
情况3:相关度满足第三预定条件的候选译文。Case 3: candidate translations whose relevance satisfies the third predetermined condition.
其中,上述相关度包括以下至少一项:先验概率,相似度,困惑度。The above correlation includes at least one of the following: prior probability, similarity, and perplexity.
示例性地,第一预定条件可以为候选译文的困惑度小于或等于第三预设阈值。可以理解,候选译文的困惑度越低,表示该候选译文的流畅度越高,越合理。Exemplarily, the first predetermined condition may be that the perplexity of the candidate translation is less than or equal to a third preset threshold. It can be understood that the lower the perplexity of the candidate translation, the higher the fluency and the more reasonable the candidate translation.
示例性地,针对情况1,样本构建装置可以通过语言模型分别计算L个候选译文的困惑度,将困惑度小于或等于第三预设阈值的候选译文确定为目标译文。Exemplarily, for situation 1, the sample construction device may calculate the perplexities of L candidate translations respectively through the language model, and determine the candidate translation whose perplexity is less than or equal to the third preset threshold as the target translation.
示例性地,第二预定条件可以为候选译文的翻译质量大于或等于第四预设阈值。可以理解,样本构建装置可以将翻译质量大于或等于第四预设阈值的候选译文确定为目标译文。Exemplarily, the second predetermined condition may be that the translation quality of the candidate translation is greater than or equal to a fourth preset threshold. It is understood that the sample construction device may determine the candidate translation whose translation quality is greater than or equal to the fourth preset threshold as the target translation.
示例性地,第三预定条件可以为候选译文的相关度大于或等于第五预设阈值。可以理解,样本构建装置可以将相关度大于或等于第五预设阈值的候选译文确定为目标译文。Exemplarily, the third predetermined condition may be that the relevance of the candidate translation is greater than or equal to a fifth preset threshold. It is understood that the sample construction device may determine the candidate translation whose relevance is greater than or equal to the fifth preset threshold as the target translation.
应注意的是,若存在候选译文满足第一条件中的多个预定条件,则样本构建装置可以将满足最多预定条件的候选译文确定为目标译文。It should be noted that if there is a candidate translation that satisfies a plurality of predetermined conditions in the first condition, the sample construction device may determine the candidate translation that satisfies the most predetermined conditions as the target translation.
如此,由于样本构建装置可以基于候选译文的流畅度、翻译质量和相关度,将评价结果最优的候选译文确定为目标译文,因此可以使得输出的目标译文最佳。In this way, since the sample construction device can determine the candidate translation with the best evaluation result as the target translation based on the fluency, translation quality and relevance of the candidate translations, the output target translation can be optimized.
可选地,样本构建装置可以通过表示和特征学习法来评估候选译文的翻译质量。Optionally, the sample construction device may evaluate the translation quality of the candidate translations by using representation and feature learning methods.
示例性地,在上述步骤207a之后,本申请实施例提供的样本构建方法还可以包括下述的步骤207c和步骤207d。Illustratively, after the above step 207a, the sample construction method provided in the embodiment of the present application may further include the following steps 207c and 207d.
步骤207c、针对L个候选译文中的每个候选译文,提取一个候选译文的第一文本特征信息,以及一个候选译文对应的第一翻译文本以及第一翻译文本的第二文本特征信息。Step 207c: for each candidate translation among the L candidate translations, extract first text feature information of a candidate translation, a first translated text corresponding to the candidate translation, and second text feature information of the first translated text.
示例性地,第一文本特征信息可以包括候选译文的词法、句法结构等特征。Exemplarily, the first text feature information may include features such as lexical and syntactic structures of the candidate translations.
示例性地,第二文本特征信息可以包括第二翻译文本和第一翻译文本的词法、句法结构等特征。Exemplarily, the second text feature information may include features such as lexical and syntactic structures of the second translated text and the first translated text.
示例性地,样本构建装置可以通过训练目标语种的分词模型,提取候选译文的第一文 本特征信息,并通过原语种分词模型分别提取候选译文对应的第二翻译文本以及第一翻译文本的第二文本特征信息。For example, the sample construction device can extract the first text of the candidate translation by training the word segmentation model of the target language. The feature information is extracted from the second translation text corresponding to the candidate translation and the second text feature information of the first translation text respectively through the original language word segmentation model.
步骤207d、基于第一文本特征信息和第二文本特征信息,计算出一个候选译文对应的翻译质量参数。Step 207d: Calculate a translation quality parameter corresponding to a candidate translation based on the first text feature information and the second text feature information.
示例性地,样本构建装置可以利用回归算法计算翻译结果的质量。Exemplarily, the sample construction device may calculate the quality of the translation result using a regression algorithm.
示例性地,一个候选译文对应的翻译质量参数可以为回归算法的结果数值。Exemplarily, the translation quality parameter corresponding to a candidate translation may be a result value of a regression algorithm.
可以理解,回归算法可以输出候选译文质量好坏的概率:回归算法的结果越接近1则表示该候选译文的质量越好,回归算法的结果越接近0则表示该候选译文的质量越差。It can be understood that the regression algorithm can output the probability of the quality of the candidate translation: the closer the result of the regression algorithm is to 1, the better the quality of the candidate translation is, and the closer the result of the regression algorithm is to 0, the worse the quality of the candidate translation is.
如此,由于样本构建装置可以基于一个候选译文的第一文本特征信息,以及一个候选译文对应的第一翻译文本以及第一翻译文本的第二文本特征信息,计算出一个候选译文对应的翻译质量参数,因此可以筛选出翻译质量较好的候选译文。In this way, since the sample construction device can calculate the translation quality parameter corresponding to a candidate translation based on the first text feature information of a candidate translation, the first translation text corresponding to the candidate translation, and the second text feature information of the first translation text, candidate translations with better translation quality can be screened out.
可选地,上述候选译文的相关度可以由下述的6个评价指标加权得到:Optionally, the relevance of the candidate translations can be obtained by weighting the following six evaluation indicators:
①将候选译文对应的第二翻译文本中的扩展词根据其扩展类型、与第一翻译文本中对应的不符合规范词的相似度、与第一翻译文本中对应的不符合规范词的词频等,给出先验概率,筛选出概率较高的候选译文。(例:若仅考虑扩展类型,设读音拼写、同源词、易混淆词的先验概率为[0.7,0.2,0.1],读音拼写中平假名、片假名的先验概率为[0.8,0.2],则读音拼写-平假名的先验概率为0.7*0.8=0.56)。②将候选译文对应的第二翻译文本输入分词模型,计算词语切分和词法、句法结构等标注信息与第一翻译文本的相似度,并筛选相似度较高的第二翻译文本对应的候选译文。③将第二翻译文本与第一翻译文本通过语言模型计算困惑度,筛选出困惑度较第一翻译文本困惑度降低,且困惑度差值超过第二预设阈值的第一翻译文本对应的候选译文。④将候选译文输入分词模型,计算词语切分和词法、句法结构等标注信息与第一翻译文本对应的候选译文的相似度,并筛选相似度较高的候选译文。⑤计算所有候选译文之间的字符串相似度,筛选出相似度较高的候选译文。⑥计算所有候选译文中的扩展词语对应的译文之间的相似度。① The expanded words in the second translation text corresponding to the candidate translation are given a priori probability according to their expansion type, similarity with the corresponding non-standard words in the first translation text, word frequency with the corresponding non-standard words in the first translation text, etc., and the candidate translation with higher probability is selected. (For example: if only the expansion type is considered, the priori probability of pronunciation spelling, cognates, and easily confused words is [0.7, 0.2, 0.1], and the priori probability of Hiragana and Katakana in pronunciation spelling is [0.8, 0.2], then the priori probability of pronunciation spelling-Hiragana is 0.7*0.8=0.56). ② The second translation text corresponding to the candidate translation is input into the word segmentation model, and the similarity of the annotation information such as word segmentation and morphology and syntactic structure with the first translation text is calculated, and the candidate translation corresponding to the second translation text with higher similarity is selected. ③ Calculate the perplexity of the second translation text and the first translation text through the language model, and select the candidate translation corresponding to the first translation text whose perplexity is lower than that of the first translation text and whose perplexity difference exceeds the second preset threshold. ④ Input the candidate translation into the word segmentation model, calculate the similarity between the word segmentation and the annotation information such as lexical and syntactic structure and the candidate translation corresponding to the first translation text, and select the candidate translation with higher similarity. ⑤ Calculate the string similarity between all candidate translations, and select the candidate translation with higher similarity. ⑥ Calculate the similarity between the translations corresponding to the extended words in all candidate translations.
需要说明的是,评价指标④可以由第一翻译文本对应的候选译文的其他评价指标决定,若待翻译文对应候选译文的流畅度与翻译质量较差,指标④对应的权值也会相应的降低。It should be noted that evaluation index ④ may be determined by other evaluation indexes of the candidate translation corresponding to the first translation text. If the fluency and translation quality of the candidate translation corresponding to the text to be translated are poor, the weight corresponding to index ④ will be reduced accordingly.
进一步地,由于通过增强的翻译模型,不同的第二翻译文本可以得到相同的候选译文,则该候选译文的相关度可以由该不同的第二翻译文本对应的候选译文的评价指标加权得到。Furthermore, since different second translation texts can obtain the same candidate translation through the enhanced translation model, the relevance of the candidate translation can be obtained by weighting the evaluation indicators of the candidate translations corresponding to the different second translation texts.
可选地,在上述步骤207之前,本申请实施例提供的样本构建方法还可以通过用于计算候选译文的相关度的评价指标①~③,来对至少一个第二翻译文本进行筛选,筛选出M个第二翻译文本中的X个第二翻译文本,以提高实际翻译时的效率,降低样本构建装置的功耗。 Optionally, before the above step 207, the sample construction method provided in the embodiment of the present application can also screen at least one second translation text through evaluation indicators ①~③ for calculating the relevance of candidate translations, and screen out X second translation texts from M second translation texts, so as to improve the efficiency of actual translation and reduce the power consumption of the sample construction device.
本申请实施例提供的样本构建方法中,一方面,由于本申请可以将第一翻译文本中的不符合规范词还原为至少一个符合规范词,生成至少一个第二翻译文本,因此可以使得将包含不符合规范词的第一翻译文本还原为规范的第一翻译文本,避免由于不符合规范词的存在而导致的翻译错误;另一方面,由于本申请在将第一翻译文本输入翻译模型进行翻译时,可以规范的部分或全部第二翻译文本和原始的第一翻译文本同时输入翻译模型,从而能够输出准确度更高的译文作为翻译结果。如此,本申请实施例提供的样本构建方法可以提高翻译模型翻译的准确性。In the sample construction method provided by the embodiment of the present application, on the one hand, since the present application can restore the non-standard words in the first translation text to at least one standard word and generate at least one second translation text, the first translation text containing the non-standard words can be restored to the standard first translation text, avoiding translation errors caused by the presence of non-standard words; on the other hand, since the present application can input the standard part or all of the second translation text and the original first translation text into the translation model at the same time when the first translation text is input into the translation model for translation, it is possible to output a translation with higher accuracy as the translation result. In this way, the sample construction method provided by the embodiment of the present application can improve the accuracy of the translation model translation.
可选地,在上述步骤206之前,本申请实施例提供的样本构建方法还可以包括下述的步骤208。Optionally, before the above step 206, the sample construction method provided in the embodiment of the present application may further include the following step 208.
步骤208、将第一翻译文本输入第一分词模型后,对第一翻译文本进行分词,得到K个分词,并对K个分词中的每个分词进行不符合规范词识别,得到每个分词对应的识别结果。Step 208: After the first translation text is input into the first word segmentation model, the first translation text is segmented to obtain K word segments, and each of the K word segments is identified as a word that does not conform to the standard to obtain a recognition result corresponding to each word segment.
其中,一个分词对应的识别结果用于表征一个分词是否属于不符合规范词,在一个分词属于不符合规范词的情况下,一个分词对应的识别结果包括一个分词所对应的规范类型。The recognition result corresponding to a segmented word is used to indicate whether the segmented word is a non-standard word. When a segmented word is a non-standard word, the recognition result corresponding to the segmented word includes the standard type corresponding to the segmented word.
本申请实施例中,第一分词模型是基于目标训练样本集训练得到的,K为大于1的整数。In the embodiment of the present application, the first word segmentation model is obtained by training based on the target training sample set, and K is an integer greater than 1.
示例性地,样本构建装置还可以使用上述方法2中训练得到的分词模型,对增强文本中的不符合规范词进行标签预测,并在翻译模型的训练中引入相应的规范类型标签向量,使模型学习扩展词语与未登录词对应的第一关键词间的语义关系,增强对扩展词语的预测和翻译能力。Exemplarily, the sample construction device can also use the word segmentation model trained in the above method 2 to predict labels for non-standard words in the enhanced text, and introduce corresponding standard type label vectors in the training of the translation model, so that the model learns the semantic relationship between the extended words and the first keyword corresponding to the unregistered words, thereby enhancing the prediction and translation capabilities of the extended words.
可以理解,规范类型标签向量与词向量具有相同的维度,将增强句子中的扩展词语向量与分词模型预测的规范类型标签对应的向量相加,得到该扩展词语最终的表示向量。It can be understood that the canonical type label vector has the same dimension as the word vector. The extended word vector in the enhanced sentence is added to the vector corresponding to the canonical type label predicted by the word segmentation model to obtain the final representation vector of the extended word.
具体的,以日语为例,如表3所示,扩展词有读音拼写、同源词、易混淆词等类型,每个类型下还有平假名、片假名、简中、繁中、重组等细分类型,故词语规范类型标签类型形成有多种组合。在训练过程中,可以对各个规范类型标签向量进行随机初始化,也可以使用标签各组成词条(如:读音拼写、同源词、平假名、片假名等)对应词向量的加权平均作为初始向量,并通过模型训练对规范类型标签向量进行迭代优化。Specifically, taking Japanese as an example, as shown in Table 3, the extended words include pronunciation spelling, cognates, easily confused words, etc., and each type has subdivisions such as hiragana, katakana, simplified Chinese, traditional Chinese, and reorganization, so there are multiple combinations of standard word type label types. During the training process, each standard type label vector can be randomly initialized, or the weighted average of the corresponding word vectors of each component of the label (such as pronunciation spelling, cognates, hiragana, katakana, etc.) can be used as the initial vector, and the standard type label vector is iteratively optimized through model training.
需要说明的是,分词模型对扩展词语的标签预测,可能与该词语替换的真实扩展形式的标签不同,但对于这些规范类型标签预测错误增强句子并未进行纠正,而是按一定比例予以保留,从而增强翻译模型的鲁棒性,使模型能够学习在输入错误的扩展词语标签时,仍然可以输出正确译文的能力。It should be noted that the label prediction of the word segmentation model for the expanded word may be different from the label of the actual expanded form of the word. However, the enhanced sentences with incorrect label predictions of these standard types are not corrected, but retained at a certain ratio, thereby enhancing the robustness of the translation model and enabling the model to learn to output correct translations when incorrect expanded word labels are input.
可选地,样本构建装置可以使用平行语料训练样本与目标训练样本,对基础的翻译模 型进行增强训练。Optionally, the sample construction device may use parallel corpus training samples and target training samples to construct a basic translation model. Type of enhanced training.
可以理解,每个原始文本和与其对应的所有扩展文本对应的输出译文均相同,从而使得翻译模型增强对不规范表达的翻译鲁棒性。It can be understood that the output translations corresponding to each original text and all the corresponding extended texts are the same, so that the translation model enhances the translation robustness to non-standard expressions.
示例性地,增强的翻译模型可以生成包含扩展词的词向量表和规范类型标签向量表。Exemplarily, the enhanced translation model may generate a word vector table containing extended words and a canonical type label vector table.
本申请实施例中,在样本构建装置将至少一个第二翻译文本中的N个第二翻译文本和第一翻译文本输入翻译模型之后,可以先通过增强的分词模型,对输入的文本进行切分,识别不符合规范词,并对不符合规范词语的扩展形式进行预测。然后,通过查询向量表,将输入的文本对应的符合规范词向量、扩展词向量和规范类型标签向量输入模型,如图4所示,得到生成的译文。In the embodiment of the present application, after the sample construction device inputs the N second translation texts and the first translation text in at least one second translation text into the translation model, the input text can be segmented by the enhanced word segmentation model, the non-standard words can be identified, and the extended forms of the non-standard words can be predicted. Then, by querying the vector table, the standard word vector, the extended word vector and the standard type label vector corresponding to the input text are input into the model, as shown in FIG4, to obtain the generated translation.
如此,一方面,在数据层面上,样本构建装置可以基于扩展词表构造对抗训练数据对翻译模型进行增强训练;另一方面,在模型层面上,样本构建装置在输入编码层融入规范类型标签向量,让模型在训练时学习不符合规范词的扩展形式和不符合规范词与其对应的符合规范词在文本中的语义相关性。如此,可以提升翻译模型对包含不符合规范词的第一翻译文本的翻译鲁棒性和翻译质量,从而使得无论第一翻译文本中是否包含不符合规范词时,翻译模型都可以输出正确的译文。Thus, on the one hand, at the data level, the sample construction device can construct adversarial training data based on the extended vocabulary to enhance the training of the translation model; on the other hand, at the model level, the sample construction device incorporates the standard type label vector in the input encoding layer, so that the model can learn the extended form of non-standard words and the semantic relevance of non-standard words and their corresponding standard words in the text during training. Thus, the translation model can improve the translation robustness and translation quality of the first translation text containing non-standard words, so that the translation model can output the correct translation regardless of whether the first translation text contains non-standard words.
可选地,在上述步骤201之后,本申请实施例提供的样本构建方法还可以包括下述的步骤301和步骤302。上述步骤202具体可以通过下述的步骤a实现。Optionally, after the above step 201, the sample construction method provided in the embodiment of the present application may further include the following steps 301 and 302. The above step 202 may be specifically implemented by the following step a.
步骤301、显示第一不符合规范词对应的每个符合规范词。Step 301: Display each standard-compliant word corresponding to the first non-standard-compliant word.
其中,第一不符合规范词可以为至少一个不符合规范词中的一个或多个不符合规范词。也就是说,上述第一不符合规范词可以为一个或多个不符合规范词。The first non-standard word may be one or more non-standard words in the at least one non-standard word. That is, the first non-standard word may be one or more non-standard words.
示例性地,样本构建装置可以将第一不符合规范词对应的每个符合规范词按相关度从高到低的顺序进行显示。Exemplarily, the sample construction device may display each standard-compliant word corresponding to the first non-standard-compliant word in order of relevance from high to low.
示例性地,样本构建装置可以通过下述的公式(2)计算第一不符合规范词对应的每个符合规范词的相关度。
S(W)=αS1(W)+βS2(W)+γS3(W)    (公式2)
Exemplarily, the sample construction device may calculate the relevance of each standard-compliant word corresponding to the first non-standard-compliant word by using the following formula (2).
S(W)=αS 1 (W)+βS 2 (W)+γS 3 (W) (Formula 2)
其中,S1(W)为扩展词表中不符合规范词对应的先验概率;S2(W)为不符合规范词与其还原后的符合规范词的词法相似度;S3(W)为不符合规范词还原;α、β、γ为可调权重系数。Among them, S 1 (W) is the prior probability corresponding to the non-standard word in the extended vocabulary; S 2 (W) is the lexical similarity between the non-standard word and its restored standard word; S 3 (W) is the restoration of the non-standard word; α, β, γ are adjustable weight coefficients.
例如,对于S2(W),若仅考虑词性,若不符合规范词与其还原后的符合规范词的词法相同,则S2(W)可以为1,若不符合规范词与其还原后的符合规范词的词法不同,则S2(W)可以为0。For example, for S 2 (W), if only part of speech is considered, if the morphology of the non-standard word is the same as that of the standard word after it is restored, then S 2 (W) can be 1; if the morphology of the non-standard word is different from that of the standard word after it is restored, then S 2 (W) can be 0.
步骤302、接收对显示的符合规范词中的目标符合规范词的第一输入。Step 302: Receive a first input of a target standard-compliant word among the displayed standard-compliant words.
示例性地,上述目标符合规范词为显示的符合规范词中一个或多个符合规范词。 Exemplarily, the target standard-compliant words are one or more standard-compliant words among the displayed standard-compliant words.
一种示例中,上述目标符合规范词可以为同一不符合规范词对应的符合规范词。In one example, the target standard-compliant word may be a standard-compliant word corresponding to the same non-standard-compliant word.
一种示例中,上述目标符合规范词可以包含多个不同不符合规范词对应的符合规范词。In one example, the target standard-compliant word may include standard-compliant words corresponding to multiple different non-standard-compliant words.
一种示例中,在上述目标符合规范词包含多个不符合规范词对应的符合规范词的情况下,样本构建装置会将用户选择的每个符合规范词均进行还原。In one example, when the target standard-compliant word includes multiple standard-compliant words corresponding to non-standard-compliant words, the sample construction device will restore each standard-compliant word selected by the user.
示例性地,上述目标符合规范词可以为用户选择的替换不符合规范词的符合规范词。Exemplarily, the target standard-compliant word may be a standard-compliant word selected by the user to replace a non-standard-compliant word.
示例性地,上述第一输入用于从显示的符合规范词中选择需要还原地符合规范词。Exemplarily, the first input is used to select a standard word that needs to be restored from the displayed standard words.
示例性地,上述第一输入可以为用户对目标符合规范词的触控输入、特定语音输入或特定手势输入,本申请实施例对此不作限定。Exemplarily, the first input may be a user's touch input, a specific voice input, or a specific gesture input of a target word that conforms to the standard, which is not limited in this embodiment of the present application.
例如,第一输入可以为用户对目标符合规范词的点击输入。For example, the first input may be a click input by the user on a target word that meets the specification.
步骤a、响应于第一输入,将第一翻译文本中的第一不符合规范词还原为目标符合规范词,以生成至少一个第二翻译文本。Step a: In response to the first input, restore the first non-standard word in the first translation text to a target standard word to generate at least one second translation text.
示例性地,若第一翻译文本中存在未被用户进行手动还原的不符合规范词,则电子设备可以将其按照上述的相关步骤进行还原,以生成至少一个第二翻译文本。Exemplarily, if there are non-standard words in the first translation text that have not been manually restored by the user, the electronic device can restore them according to the above-mentioned relevant steps to generate at least one second translation text.
示例性地,如图5中的(a)所示,样本构建装置可以显示第一不符合规范词“りょうしん(liangqin)”对应的符合规范词“両親(两亲)”和“良心”。然后,样本构建装置接收用户对目标符合规范词“両親(两亲)”的点击输入(即第一输入),如图5中的(b)所示,将不符合规范词“りょうしん(liangqin)”还原为目标符合规范词“両親(两亲)”,生成第二翻译文本“りょうしんは学校に勤める(两亲/父母在学校工作)”。For example, as shown in (a) of FIG5 , the sample construction device may display the first non-standard word “りょうしん(liangqin)” corresponding to the standard words “鸡亲(两亲)” and “良心”. Then, the sample construction device receives the user's click input (i.e., the first input) for the target standard word “鸡亲(两亲)”, and as shown in (b) of FIG5 , restores the non-standard word “りょうしん(liangqin)” to the target standard word “鸡亲(两亲)”, and generates the second translation text “りょうしんは学校に勤める(两亲/父工作学校)”.
如此,由于样本构建装置可以显示不符合规范词对应的符合规范词,由用户通过第一输入选择所要进行还原的目标符合规范词,因此可以使得生成的第二翻译文本响应的减少,从而降低翻译所需的功耗。In this way, since the sample construction device can display the corresponding standard words to the non-standard words, and the user selects the target standard words to be restored through the first input, the generated second translation text response can be reduced, thereby reducing the power consumption required for translation.
可选地,本申请实施例提供的样本构建方法可以根据不同语言的语言学特征构建相应的扩展词表,以应用于不同的翻译语言和语向。Optionally, the sample construction method provided in the embodiment of the present application can construct corresponding extended vocabulary according to the linguistic features of different languages to be applied to different translation languages and language directions.
本申请实施例提供了一种样本构建方法,图6示出了本申请实施例提供的一种翻译模型进行翻译的流程图,该翻译模型为经过目标训练样本训练得到的翻译模型。如图6所示,本申请实施例提供的样本构建方法可以包括下述的步骤601至步骤607。The present application embodiment provides a sample construction method, and FIG6 shows a flowchart of a translation model provided by the present application embodiment for translation, wherein the translation model is a translation model obtained by training the target training sample. As shown in FIG6 , the sample construction method provided by the present application embodiment may include the following steps 601 to 607.
步骤601、获取待翻译文本。Step 601: Obtain the text to be translated.
步骤602、自动识别待翻译文本中是否存在不符合规范词。Step 602: Automatically identify whether there are any words that do not conform to the standard in the text to be translated.
步骤603、在待翻译文本中存在不符合规范词的情况下,将不符合规范词的还原结果按可信度排序,并呈现给用户。Step 603: When there are words that do not conform to the standard in the text to be translated, the restoration results that do not conform to the standard are sorted by credibility and presented to the user.
步骤604、响应于用户选择不符合规范词的还原结果的第一输入,将不符合规范词进行还原,生成至少一个第二翻译文本。 Step 604: In response to a first input of a user selecting a restoration result that does not conform to the standard word, the non-conforming standard word is restored to generate at least one second translation text.
步骤605、将用户未选择还原结果的不符合规范词进行还原,生成至少一个第二翻译文本。Step 605: restore the non-standard words that are not selected by the user to generate at least one second translation text.
步骤606、将至少一个第二翻译文本输入第一翻译模型进行文本翻译,得到至少一个候选译文。Step 606: Input at least one second translation text into the first translation model for text translation to obtain at least one candidate translation.
步骤607、从至少一个候选译文中,确定出目标译文,并输出目标译文。Step 607: Determine a target translation from at least one candidate translation, and output the target translation.
本申请实施例提供的样本构建方法,执行主体可以为样本构建装置。本申请实施例中以样本构建装置执行样本构建方法为例,说明本申请实施例提供的样本构建装置。The sample construction method provided in the embodiment of the present application can be executed by a sample construction device. In the embodiment of the present application, the sample construction device provided in the embodiment of the present application is described by taking the sample construction method executed by the sample construction device as an example.
图7示出了本申请实施例中涉及的样本构建装置的一种可能的结构示意图。如图7所示,该样本构建装置70可以包括:获取模块71,处理模块72和构建模块73。Fig. 7 shows a possible structural diagram of a sample construction device involved in an embodiment of the present application. As shown in Fig. 7 , the sample construction device 70 may include: an acquisition module 71 , a processing module 72 and a construction module 73 .
其中,上述获取模块71,用于获取平行语料训练样本,平行语料训练样本包含原始文本并携带原始文本中的每个关键词所对应的规范类型标签;上述处理模块72,用于将获取模块71获取的平行语料训练样本中的原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本;上述处理模块72,还用于将获取模块71获取的平行语料训练样本中的第一关键词所对应的第一规范类型标签替换为第一不符合规范词所对应的第二规范类型标签,获得替换标签后的平行语料训练样本;上述构建模块73,用于基于处理模块72处理后的替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本。Among them, the above-mentioned acquisition module 71 is used to obtain a parallel corpus training sample, which contains the original text and carries the standard type label corresponding to each keyword in the original text; the above-mentioned processing module 72 is used to replace the first keyword in the original text of the parallel corpus training sample acquired by the acquisition module 71 with at least one first non-standard word corresponding to the first keyword, so as to generate at least one extended text; the above-mentioned processing module 72 is also used to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample acquired by the acquisition module 71 with the second standard type label corresponding to the first non-standard word, so as to obtain the parallel corpus training sample after the label is replaced; the above-mentioned construction module 73 is used to construct a target training sample based on the parallel corpus training sample after the label is replaced and processed by the processing module 72 and at least one extended text.
一种可能的实现方式,平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本;上述处理模块72,具体用于:In a possible implementation, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set; the processing module 72 is specifically used to:
基于原始文本中的每个关键词在平行语料训练样本集中的词频,从原始文本中确定至少一个第一关键词,将原始文本中的至少一个第一关键词中的每个第一关键词替换为各自对应的第一不符合规范词,以生成第一扩展文本;Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;
其中,第一扩展文本为至少一个扩展文本中的任一扩展文本。The first extended text is any extended text among the at least one extended text.
一种可能的实现方式,平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本,扩展文本的数量为N,N为正整数;In a possible implementation, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of extended texts is N, where N is a positive integer;
上述处理模块72,还用于在将原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本之后,在N个扩展文本中的第二扩展文本中包含未收录在平行语料训练样本集中的未登录词的情况下,对未登陆词的词特征信息进行初始化;The processing module 72 is further configured to, after replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;
其中,初始化的过程包括以下之一:The initialization process includes one of the following:
按照未登录词对应的第一关键词,和N个扩展文本中的每个扩展文本中未登录词对应的第一关键词所对应的每个不符合规范词在平行语料训练样本集中的词频,对未登录词的 词特征信息进行加权平均;According to the frequency of the first keyword corresponding to the unregistered word and each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the frequency of the unregistered word is The word feature information is weighted averaged;
使用未登录词对应的同源词的特征信息,对未登录词的特征信息进行加权平均;Using the feature information of the cognates corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;
将未登录词的特征信息置为0;Set the feature information of unregistered words to 0;
将未登录词的特征信息随机初始化。The feature information of unregistered words is randomly initialized.
一种可能的实现方式,平行语料样本为平行语料样本集中的一个平行语料训练样本;上述装置还包括:翻译模块;In a possible implementation, the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set; the above device further includes: a translation module;
上述处理模块72,还用于在构建模块73基于替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本之后,将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本,一个不符合规范词还原为至少一个规范词;The processing module 72 is further configured to restore at least one non-standard word in the first translation text to a standard word after the construction module 73 constructs the target training sample based on the parallel corpus training sample after replacing the label and at least one extended text, so as to generate M second translation texts, and restore one non-standard word to at least one standard word;
上述翻译模块,用于将第一翻译文本中对应的第一特征信息和处理模块72得到的M个第二翻译文本中的X个第二翻译文本对应的第二特征信息输入第一翻译模型进行文本翻译,以得到目标译文,第一特征信息包括第一翻译文本的文本特征信息和第一翻译文本中的不符合规范词所对应的规范类型标签的特征信息,第二特征信息包括第二翻译文本的文本特征信息和第二翻译文本中的不符合规范词所对应的规范类型标签的特征信息;The above-mentioned translation module is used to input the first feature information corresponding to the first translation text and the second feature information corresponding to the X second translation texts among the M second translation texts obtained by the processing module 72 into the first translation model for text translation to obtain a target translation, wherein the first feature information includes text feature information of the first translation text and feature information of the standard type label corresponding to the non-standard words in the first translation text, and the second feature information includes text feature information of the second translation text and feature information of the standard type label corresponding to the non-standard words in the second translation text;
其中,第一翻译模型是基于目标训练样本集训练得到的,目标训练样本集包括多个目标训练样本,一个目标训练样本对应平行语料训练样本集中的一个平行语料训练样本,M、X为正整数,且X小于或等于M。The first translation model is trained based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to one parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
一种可能的实现方式,上述装置还包括:分词模块;In a possible implementation, the above device further includes: a word segmentation module;
上述分词模块,用于在处理模块72将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本之前,将第一翻译文本输入第一分词模型后,对第一翻译文本进行分词,得到K个分词,并对K个分词中的每个分词进行不符合规范词识别,得到每个分词对应的识别结果,一个分词对应的识别结果用于表征一个分词是否属于不符合规范词,在一个分词属于不符合规范词的情况下,一个分词对应的识别结果包括一个分词所对应的规范类型;The above-mentioned word segmentation module is used for, before the processing module 72 restores at least one non-standard word in the first translation text to a standard word to generate M second translation texts, inputting the first translation text into the first word segmentation model, segmenting the first translation text to obtain K word segments, and performing non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes a standard type corresponding to the word segment;
其中,第一分词模型是基于目标训练样本集训练得到的,K为大于1的整数。The first word segmentation model is trained based on the target training sample set, and K is an integer greater than 1.
一种可能的实现方式,不符合规范词包括以下至少一种情况:包含拼音读写、包含错别字、包含同源字替换、包含字形错误。In a possible implementation, words that do not conform to standard standards include at least one of the following: including pinyin reading and writing, including typos, including homologous word replacements, and including glyph errors.
本申请实施例提供一种样本构建装置,由于样本构建装置可以对平行语料训练样本中的原始文本中的关键词进行替换,生成至少一个扩展文本,以扩大平行语料训练样本所覆盖的词汇范围;同时,并将该关键词所对应的规范类型标签替换为不符合规范词所对应的规范类型标签,获得替换标签后的平行语料训练样本,以丰富平行语料训练样本所包含的内容。最后,样本构建装置可以基于替换标签后的平行语料训练样本和至少一个扩展文本, 构建得到目标训练样本。因此可以使得目标训练样本中包含不符合规范词及其所对应的规范类型标签,从而可以丰富平行语料训练样本的内容,使得平行语料训练样本具有更多更灵活的训练内容。The embodiment of the present application provides a sample construction device. Since the sample construction device can replace the key words in the original text of the parallel corpus training sample, generate at least one extended text to expand the vocabulary covered by the parallel corpus training sample; at the same time, the standard type label corresponding to the key word is replaced with the standard type label corresponding to the non-standard word, and the parallel corpus training sample after the label is replaced is obtained to enrich the content contained in the parallel corpus training sample. Finally, the sample construction device can be based on the parallel corpus training sample after the label is replaced and at least one extended text, The target training samples are constructed. Therefore, the target training samples can contain words that do not conform to the standard and their corresponding standard type labels, thereby enriching the content of the parallel corpus training samples and making the parallel corpus training samples have more and more flexible training content.
本申请实施例中的样本构建装置可以是电子设备,也可以是电子设备中的部件,例如集成电路或芯片。该电子设备可以是终端,也可以为除终端之外的其他设备。示例性的,电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device,MID)、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,还可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。The sample construction device in the embodiment of the present application can be an electronic device or a component in the electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices other than a terminal. Exemplarily, the electronic device can be a mobile phone, a tablet computer, a laptop computer, a PDA, a car-mounted electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a robot, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook or a personal digital assistant (personal digital assistant, PDA), etc. It can also be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., and the embodiment of the present application is not specifically limited.
本申请实施例中的样本构建装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为ios操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。The sample construction device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in the embodiment of the present application.
本申请实施例提供的样本构建装置能够实现图2至图6的方法实施例实现的各个过程,达到相同的技术效果,为避免重复,这里不再赘述。The sample construction device provided in the embodiment of the present application can implement the various processes implemented in the method embodiments of Figures 2 to 6 and achieve the same technical effects. To avoid repetition, they will not be described here.
可选地,如图8所示,本申请实施例还提供一种电子设备800,包括处理器801和存储器802,存储器802上存储有可在所述处理器801上运行的程序或指令,该程序或指令被处理器801执行时实现上述样本构建方法实施例的各个步骤,且能达到相同的技术效果,为避免重复,这里不再赘述。Optionally, as shown in Figure 8, an embodiment of the present application also provides an electronic device 800, including a processor 801 and a memory 802, and the memory 802 stores a program or instruction that can be executed on the processor 801. When the program or instruction is executed by the processor 801, the various steps of the above-mentioned sample construction method embodiment are implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
需要说明的是,本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and non-mobile electronic devices mentioned above.
图9为实现本申请实施例的一种电子设备的硬件结构示意图。FIG. 9 is a schematic diagram of the hardware structure of an electronic device implementing an embodiment of the present application.
该电子设备900包括但不限于:射频单元901、网络模块902、音频输出单元903、输入单元904、传感器905、显示单元906、用户输入单元907、接口单元908、存储器909、以及处理器910等部件。The electronic device 900 includes but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, and a processor 910 and other components.
本领域技术人员可以理解,电子设备900还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器910逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图9中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。Those skilled in the art will appreciate that the electronic device 900 may also include a power source (such as a battery) for supplying power to each component, and the power source may be logically connected to the processor 910 through a power management system, so that the power management system can manage charging, discharging, and power consumption management. The electronic device structure shown in FIG9 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently, which will not be described in detail here.
其中,上述处理器910,用于:获取平行语料训练样本,平行语料训练样本包含原始文 本并携带原始文本中的每个关键词所对应的规范类型标签;将获取的平行语料训练样本中的原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本;将获取的平行语料训练样本中的第一关键词所对应的第一规范类型标签替换为第一不符合规范词所对应的第二规范类型标签,获得替换标签后的平行语料训练样本;基于替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本。The processor 910 is used to obtain parallel corpus training samples, wherein the parallel corpus training samples include original text The method comprises the following steps: the first keyword in the original text of the obtained parallel corpus training sample is replaced with at least one first non-standard word corresponding to the first keyword to generate at least one extended text; the first standard type label corresponding to the first keyword in the obtained parallel corpus training sample is replaced with a second standard type label corresponding to the first non-standard word to obtain a parallel corpus training sample after label replacement; and a target training sample is constructed based on the parallel corpus training sample after label replacement and the at least one extended text.
可选地,平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本;上述处理器910,具体用于:Optionally, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set; the processor 910 is specifically configured to:
基于原始文本中的每个关键词在平行语料训练样本集中的词频,从原始文本中确定至少一个第一关键词,将原始文本中的至少一个第一关键词中的每个第一关键词替换为各自对应的第一不符合规范词,以生成第一扩展文本;Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;
其中,第一扩展文本为至少一个扩展文本中的任一扩展文本。The first extended text is any extended text among the at least one extended text.
可选地,平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本,扩展文本的数量为N,N为正整数;Optionally, the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of extended texts is N, where N is a positive integer;
上述处理器910,还用于在将原始文本中的第一关键词替换为第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本之后,在N个扩展文本中的第二扩展文本中包含未收录在平行语料训练样本集中的未登录词的情况下,对未登陆词的词特征信息进行初始化;The processor 910 is further configured to, after replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;
其中,初始化的过程包括以下之一:The initialization process includes one of the following:
按照未登录词对应的第一关键词,和N个扩展文本中的每个扩展文本中未登录词对应的第一关键词所对应的每个不符合规范词在平行语料训练样本集中的词频,对未登录词的词特征信息进行加权平均;According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, the word feature information of the unregistered word is weighted averaged;
使用未登录词对应的同源词的特征信息,对未登录词的特征信息进行加权平均;Using the feature information of the cognates corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;
将未登录词的特征信息置为0;Set the feature information of unregistered words to 0;
将未登录词的特征信息随机初始化。The feature information of unregistered words is randomly initialized.
可选地,平行语料样本为平行语料样本集中的一个平行语料训练样本;Optionally, the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set;
上述处理器910,还用于在基于替换标签后的平行语料训练样本与至少一个扩展文本,构建目标训练样本之后,将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本,一个不符合规范词还原为至少一个规范词;The processor 910 is further configured to restore at least one non-standard word in the first translation text to a standard word after constructing a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text, so as to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;
上述处理器910,还用于将第一翻译文本中对应的第一特征信息和处理器910得到的M个第二翻译文本中的X个第二翻译文本对应的第二特征信息输入第一翻译模型进行文本翻译,以得到目标译文,第一特征信息包括第一翻译文本的文本特征信息和第一翻译文本中的不符合规范词所对应的规范类型标签的特征信息,第二特征信息包括第二翻译文本的文 本特征信息和第二翻译文本中的不符合规范词所对应的规范类型标签的特征信息;The processor 910 is further configured to input the first feature information corresponding to the first translated text and the second feature information corresponding to the X second translated texts among the M second translated texts obtained by the processor 910 into the first translation model for text translation to obtain a target translated text, wherein the first feature information includes the text feature information of the first translated text and the feature information of the standard type label corresponding to the non-standard word in the first translated text, and the second feature information includes the text feature information of the second translated text. This feature information and the feature information of the standard type label corresponding to the non-standard word in the second translation text;
其中,第一翻译模型是基于目标训练样本集训练得到的,目标训练样本集包括多个目标训练样本,一个目标训练样本对应平行语料训练样本集中的一个平行语料训练样本,M、X为正整数,且X小于或等于M。The first translation model is trained based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to one parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
可选地,上述处理器910,用于将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本之前,将第一翻译文本输入第一分词模型后,对第一翻译文本进行分词,得到K个分词,并对K个分词中的每个分词进行不符合规范词识别,得到每个分词对应的识别结果,一个分词对应的识别结果用于表征一个分词是否属于不符合规范词,在一个分词属于不符合规范词的情况下,一个分词对应的识别结果包括一个分词所对应的规范类型;Optionally, the processor 910 is configured to restore at least one non-standard word in the first translation text to a standard word before generating the M second translation texts, input the first translation text into the first word segmentation model, segment the first translation text to obtain K word segments, and perform non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes a standard type corresponding to the word segment;
其中,第一分词模型是基于目标训练样本集训练得到的,K为大于1的整数。The first word segmentation model is trained based on the target training sample set, and K is an integer greater than 1.
可选地,不符合规范词包括以下至少一种情况:包含拼音读写、包含错别字、包含同源字替换、包含字形错误。Optionally, non-standard words include at least one of the following: containing phonetic reading and writing, containing typos, containing homologous word replacements, and containing glyph errors.
本申请实施例提供一种电子设备,由于电子设备可以将平行语料训练样本中的原始文本中的关键词替换为该关键词对应的至少一个不符合规范词,生成至少一个扩展文本,并将该关键词所对应的规范类型标签替换为不符合规范词所对应的规范类型标签,获得替换标签后的平行语料训练样本。然后,电子设备可以基于替换标签后的平行语料训练样本和至少一个扩展文本,构建得到目标训练样本。因此可以使得目标训练样本中包含不符合规范词及其所对应的规范类型标签,从而使得经过目标训练样本训练得到的翻译模型可以对不符合规范词进行翻译,提高翻译模型翻译的准确性。An embodiment of the present application provides an electronic device, which can replace a keyword in an original text in a parallel corpus training sample with at least one non-standard word corresponding to the keyword, generate at least one extended text, and replace the standard type label corresponding to the keyword with the standard type label corresponding to the non-standard word, thereby obtaining a parallel corpus training sample after replacing the label. Then, the electronic device can construct a target training sample based on the parallel corpus training sample after replacing the label and at least one extended text. Therefore, the target training sample can contain non-standard words and their corresponding standard type labels, so that the translation model trained with the target training sample can translate non-standard words, thereby improving the translation accuracy of the translation model.
应理解的是,本申请实施例中,输入单元904可以包括图形处理器(Graphics Processing Unit,GPU)9041和麦克风9042,图形处理器9041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元906可包括显示面板9061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板9061。用户输入单元907包括触控面板9071以及其他输入设备9072中的至少一种。触控面板9071,也称为触摸屏。触控面板9071可包括触摸检测装置和触摸控制器两个部分。其他输入设备9072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。It should be understood that in the embodiment of the present application, the input unit 904 may include a graphics processing unit (GPU) 9041 and a microphone 9042, and the graphics processor 9041 processes the image data of the static picture or video obtained by the image capture device (such as a camera) in the video capture mode or the image capture mode. The display unit 906 may include a display panel 9061, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, etc. The user input unit 907 includes a touch panel 9071 and at least one of other input devices 9072. The touch panel 9071 is also called a touch screen. The touch panel 9071 may include two parts: a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (such as a volume control key, a switch key, etc.), a trackball, a mouse, and a joystick, which will not be repeated here.
存储器909可用于存储软件程序以及各种数据。存储器909可主要包括存储程序或指令的第一存储区和存储数据的第二存储区,其中,第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外,存储器909可以包括易失性存储器或非易失性存储器,或者,存储器909可以包括易失性和非易失性 存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本申请实施例中的存储器909包括但不限于这些和任意其它适合类型的存储器。The memory 909 can be used to store software programs and various data. The memory 909 can mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area can store an operating system, an application program or instruction required for at least one function (such as a sound playback function, an image playback function, etc.). In addition, the memory 909 can include a volatile memory or a non-volatile memory, or the memory 909 can include a volatile and a non-volatile memory. Both memories. Among them, the non-volatile memory can be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM) or a flash memory. The volatile memory can be a random access memory (RAM), a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDRSDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchronous connection dynamic random access memory (SLDRAM) and a direct memory bus random access memory (DRRAM). The memory 909 in the embodiment of the present application includes but is not limited to these and any other suitable types of memories.
处理器910可包括一个或多个处理单元;可选的,处理器910集成应用处理器和调制解调处理器,其中,应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作,调制解调处理器主要处理无线通信信号,如基带处理器。可以理解的是,上述调制解调处理器也可以不集成到处理器910中。The processor 910 may include one or more processing units; optionally, the processor 910 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to an operating system, a user interface, and application programs, and the modem processor mainly processes wireless communication signals, such as a baseband processor. It is understandable that the modem processor may not be integrated into the processor 910.
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述样本构建方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored. When the program or instruction is executed by a processor, each process of the above-mentioned sample construction method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.
其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk.
本申请实施例另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述样本构建方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned sample construction method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.
本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如上述样本构建方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。An embodiment of the present application provides a computer program product, which is stored in a storage medium. The program product is executed by at least one processor to implement the various processes of the sample construction method embodiment described above, and can achieve the same technical effect. To avoid repetition, it will not be described here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该 要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of more restrictions, the elements defined by the sentence "comprises a..." do not exclude the elements included in the process, method, article or device. In addition, it should be noted that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in a reverse order according to the functions involved. For example, the described method may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, the features described with reference to certain examples may be combined in other examples.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present application, or the part that contributes to the prior art, can be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。 The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms without departing from the purpose of the present application and the scope of protection of the claims, all of which are within the protection of the present application.

Claims (17)

  1. 一种样本构建方法,所述方法包括:A sample construction method, the method comprising:
    获取平行语料训练样本,所述平行语料训练样本包含原始文本并携带所述原始文本中的每个关键词所对应的规范类型标签;Obtaining a parallel corpus training sample, wherein the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text;
    将所述原始文本中的第一关键词替换为所述第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本;Replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text;
    将所述第一关键词所对应的第一规范类型标签替换为所述第一不符合规范词所对应的第二规范类型标签,获得替换标签后的所述平行语料训练样本;Replacing the first standard type label corresponding to the first keyword with the second standard type label corresponding to the first non-standard word, to obtain the parallel corpus training sample after the label is replaced;
    基于替换标签后的所述平行语料训练样本与所述至少一个扩展文本,构建目标训练样本。A target training sample is constructed based on the parallel corpus training sample after replacing the label and the at least one extended text.
  2. 根据权利要求1所述的方法,其中,所述平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本;The method according to claim 1, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set;
    所述将所述原始文本中的第一关键词替换为所述第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本,包括:The step of replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text includes:
    基于所述原始文本中的每个关键词在所述平行语料训练样本集中的词频,从所述原始文本中确定至少一个第一关键词,将所述原始文本中的所述至少一个第一关键词中的每个第一关键词替换为各自对应的第一不符合规范词,以生成第一扩展文本;Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to each first keyword to generate a first extended text;
    其中,所述第一扩展文本为所述至少一个扩展文本中的任一扩展文本。The first extended text is any extended text among the at least one extended text.
  3. 根据权利要求1所述的方法,其中,所述平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本,所述扩展文本的数量为N,N为正整数;The method according to claim 1, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of the extended texts is N, where N is a positive integer;
    所述将所述原始文本中的第一关键词替换为所述第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本之后,所述方法还包括:After replacing the first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, the method further includes:
    在N个扩展文本中的第二扩展文本中包含未收录在所述平行语料训练样本集中的未登录词的情况下,对所述未登录词的特征信息进行初始化;When a second extended text among the N extended texts contains an unregistered word that is not included in the parallel corpus training sample set, initializing feature information of the unregistered word;
    其中,所述初始化的过程包括以下至少之一:The initialization process includes at least one of the following:
    按照所述未登录词对应的第一关键词,和所述N个扩展文本中的每个扩展文本中所述未登录词对应的第一关键词所对应的每个不符合规范词在所述平行语料训练样本集中的词频,对所述未登录词的特征信息进行加权平均;According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, weighted averaging is performed on the feature information of the unregistered word;
    使用所述未登录词对应的同源词的特征信息,对所述未登录词的特征信息进行加权平均;Using the feature information of the cognate words corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;
    将所述未登录词的特征信息置为0;Setting the feature information of the unregistered word to 0;
    将所述未登录词的特征信息随机初始化。 The feature information of the unregistered word is randomly initialized.
  4. 根据权利要求1所述的方法,其中,所述平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本;The method according to claim 1, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set;
    所述基于替换标签后的所述平行语料训练样本与所述至少一个扩展文本,构建目标训练样本之后,所述方法还包括:After constructing a target training sample based on the parallel corpus training sample after replacing the label and the at least one extended text, the method further includes:
    将第一翻译文本中的扩展文本中的每个扩展文本中还原为规范词,以生成M个第二翻译文本,一个不符合规范词还原为至少一个规范词;Restoring each of the extended texts in the first translation text to a standard word to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;
    将所述第一翻译文本对应的第一特征信息和所述M个第二翻译文本中的X个第二翻译文本对应的第二特征信息输入第一翻译模型进行文本翻译,以得到目标译文,所述第一特征信息包括所述第一翻译文本的文本特征信息和所述第一翻译文本中的不符合规范词所对应的规范类型标签的特征信息,所述第二特征信息包括所述第二翻译文本的文本特征信息和所述第二翻译文本中的不符合规范词所对应的规范类型标签的特征信息;Inputting first feature information corresponding to the first translated text and second feature information corresponding to X second translated texts among the M second translated texts into a first translation model for text translation to obtain a target translated text, wherein the first feature information includes text feature information of the first translated text and feature information of standard type labels corresponding to non-standard words in the first translated text, and the second feature information includes text feature information of the second translated text and feature information of standard type labels corresponding to non-standard words in the second translated text;
    其中,所述第一翻译模型是基于目标训练样本集训练得到的,所述目标训练样本集包括多个所述目标训练样本,一个所述目标训练样本对应所述平行语料训练样本集中的一个平行语料训练样本,M、X为正整数,且X小于或等于M。Among them, the first translation model is obtained by training based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
  5. 根据权利要求4所述的方法,其中,所述将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本之前,所述方法还包括:The method according to claim 4, wherein before restoring at least one non-standard word in the first translation text to a standard word to generate M second translation texts, the method further comprises:
    将所述第一翻译文本输入第一分词模型后,对所述第一翻译文本进行分词,得到K个分词,并对所述K个分词中的每个分词进行不符合规范词识别,得到所述每个分词对应的识别结果,一个分词对应的识别结果用于表征所述一个分词是否属于不符合规范词,在所述一个分词属于不符合规范词的情况下,所述一个分词对应的识别结果包括所述一个分词所对应的规范类型;After inputting the first translation text into the first word segmentation model, the first translation text is segmented to obtain K word segments, and each of the K word segments is identified as a word that does not conform to the standard to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a word that does not conform to the standard. If the word segment is a word that does not conform to the standard, the recognition result corresponding to the word segment includes the standard type corresponding to the word segment;
    其中,所述第一分词模型是基于目标训练样本集训练得到的,K为大于1的整数。The first word segmentation model is trained based on a target training sample set, and K is an integer greater than 1.
  6. 根据权利要求1至4任一项所述的方法,其中,所述不符合规范词包括以下至少一种情况:包含读音拼写、包含错别字、包含同源字替换、包含字形错误。The method according to any one of claims 1 to 4, wherein the non-standard words include at least one of the following situations: including phonetic spelling, including typos, including homologous word replacement, and including glyph errors.
  7. 一种样本构建装置,所述装置包括:获取模块,处理模块和构建模块;A sample construction device, the device comprising: an acquisition module, a processing module and a construction module;
    所述获取模块,用于获取平行语料训练样本,所述平行语料训练样本包含原始文本并携带所述原始文本中的每个关键词所对应的规范类型标签;The acquisition module is used to acquire a parallel corpus training sample, wherein the parallel corpus training sample includes an original text and carries a standard type label corresponding to each keyword in the original text;
    所述处理模块,用于将所述获取模块获取的所述平行语料训练样本中的所述原始文本中的第一关键词替换为所述第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本;The processing module is used to replace the first keyword in the original text in the parallel corpus training sample acquired by the acquisition module with at least one first non-standard word corresponding to the first keyword, so as to generate at least one extended text;
    所述处理模块,还用于将所述获取模块获取的所述平行语料训练样本中的所述第一关键词所对应的第一规范类型标签替换为所述第一不符合规范词所对应的第二规范类型标签, 获得替换标签后的所述平行语料训练样本;The processing module is further configured to replace the first standard type label corresponding to the first keyword in the parallel corpus training sample acquired by the acquisition module with the second standard type label corresponding to the first non-standard word. Obtaining the parallel corpus training sample after replacing the label;
    所述构建模块,用于基于所述处理模块处理后的替换标签后的所述平行语料训练样本与所述至少一个扩展文本,构建目标训练样本。The construction module is used to construct a target training sample based on the parallel corpus training sample after the label is replaced and processed by the processing module and the at least one extended text.
  8. 根据权利要求7所述的装置,其中,所述平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本;The device according to claim 7, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set;
    所述处理模块,具体用于:The processing module is specifically used for:
    基于所述原始文本中的每个关键词在所述平行语料训练样本集中的词频,从所述原始文本中确定至少一个第一关键词,将所述原始文本中的所述至少一个第一关键词中的每个第一关键词替换为各自对应的第一不符合规范词,以生成第一扩展文本;Based on the word frequency of each keyword in the original text in the parallel corpus training sample set, at least one first keyword is determined from the original text, and each first keyword in the at least one first keyword in the original text is replaced with a first non-standard word corresponding to the first keyword to generate a first extended text;
    其中,所述第一扩展文本为所述至少一个扩展文本中的任一扩展文本。The first extended text is any extended text among the at least one extended text.
  9. 根据权利要求7所述的装置,其中,所述平行语料训练样本为平行语料训练样本集中的一个平行语料训练样本,所述扩展文本的数量为N,N为正整数;The device according to claim 7, wherein the parallel corpus training sample is a parallel corpus training sample in a parallel corpus training sample set, and the number of the extended texts is N, where N is a positive integer;
    所述处理模块,还用于在将所述原始文本中的第一关键词替换为所述第一关键词对应的至少一个第一不符合规范词,以生成至少一个扩展文本之后,在N个所述扩展文本中的第二扩展文本中包含未收录在所述平行语料训练样本集中的未登录词的情况下,对所述未登陆词的词特征信息进行初始化;The processing module is further configured to, after replacing a first keyword in the original text with at least one first non-standard word corresponding to the first keyword to generate at least one extended text, initialize word feature information of the unregistered word when a second extended text among the N extended texts includes an unregistered word that is not included in the parallel corpus training sample set;
    其中,所述初始化的过程包括以下之一:The initialization process includes one of the following:
    按照所述未登录词对应的第一关键词,和所述N个扩展文本中的每个扩展文本中所述未登录词对应的第一关键词所对应的每个不符合规范词在所述平行语料训练样本集中的词频,对所述未登录词的词特征信息进行加权平均;According to the first keyword corresponding to the unregistered word and the word frequency of each non-standard word corresponding to the first keyword corresponding to the unregistered word in each of the N extended texts in the parallel corpus training sample set, weighted averaging the word feature information of the unregistered word;
    使用所述未登录词对应的同源词的特征信息,对所述未登录词的特征信息进行加权平均;Using the feature information of the cognate words corresponding to the unregistered words, weighted averaging the feature information of the unregistered words;
    将所述未登录词的特征信息置为0;Setting the feature information of the unregistered word to 0;
    将所述未登录词的特征信息随机初始化。The feature information of the unregistered word is randomly initialized.
  10. 根据权利要求7所述的装置,其中,所述平行语料样本为平行语料样本集中的一个平行语料训练样本;The device according to claim 7, wherein the parallel corpus sample is a parallel corpus training sample in a parallel corpus sample set;
    所述装置还包括:翻译模块;The device also includes: a translation module;
    所述处理模块,还用于在所述构建模块基于替换标签后的所述平行语料训练样本与所述至少一个扩展文本,构建目标训练样本之后,将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本,一个不符合规范词还原为至少一个规范词;The processing module is further configured to restore at least one non-standard word in the first translation text to a standard word after the construction module constructs the target training sample based on the parallel corpus training sample after replacing the label and the at least one extended text, so as to generate M second translation texts, wherein one non-standard word is restored to at least one standard word;
    所述翻译模块,用于将所述第一翻译文本中对应的第一特征信息和所述处理模块得到的所述M个第二翻译文本中的X个第二翻译文本对应的第二特征信息输入第一翻译模型进 行文本翻译,以得到目标译文,所述第一特征信息包括所述第一翻译文本的文本特征信息和所述第一翻译文本中的不符合规范词所对应的规范类型标签的特征信息,所述第二特征信息包括所述第二翻译文本的文本特征信息和所述第二翻译文本中的不符合规范词所对应的规范类型标签的特征信息;The translation module is used to input the first feature information corresponding to the first translation text and the second feature information corresponding to the X second translation texts among the M second translation texts obtained by the processing module into the first translation model for The first feature information includes text feature information of the first translated text and feature information of standard type labels corresponding to non-standard words in the first translated text, and the second feature information includes text feature information of the second translated text and feature information of standard type labels corresponding to non-standard words in the second translated text;
    其中,所述第一翻译模型是基于目标训练样本集训练得到的,所述目标训练样本集包括多个所述目标训练样本,一个所述目标训练样本对应所述平行语料训练样本集中的一个平行语料训练样本,M、X为正整数,且X小于或等于M。Among them, the first translation model is obtained by training based on a target training sample set, the target training sample set includes multiple target training samples, one target training sample corresponds to a parallel corpus training sample in the parallel corpus training sample set, M and X are positive integers, and X is less than or equal to M.
  11. 根据权利要求10所述的装置,所述装置还包括:分词模块;The device according to claim 10, further comprising: a word segmentation module;
    所述分词模块,用于在所述处理模块将第一翻译文本中的至少一个不符合规范词还原为规范词,以生成M个第二翻译文本之前,将所述第一翻译文本输入第一分词模型后,对所述第一翻译文本进行分词,得到K个分词,并对所述K个分词中的每个分词进行不符合规范词识别,得到所述每个分词对应的识别结果,一个分词对应的识别结果用于表征所述一个分词是否属于不符合规范词,在所述一个分词属于不符合规范词的情况下,所述一个分词对应的识别结果包括所述一个分词所对应的规范类型;The word segmentation module is used for, before the processing module restores at least one non-standard word in the first translation text to a standard word to generate M second translation texts, inputting the first translation text into a first word segmentation model, performing word segmentation on the first translation text to obtain K word segments, and performing non-standard word recognition on each of the K word segments to obtain a recognition result corresponding to each word segment, wherein the recognition result corresponding to a word segment is used to indicate whether the word segment is a non-standard word, and when the word segment is a non-standard word, the recognition result corresponding to the word segment includes the standard type corresponding to the word segment;
    其中,所述第一分词模型是基于目标训练样本集训练得到的,K为大于1的整数。The first word segmentation model is trained based on a target training sample set, and K is an integer greater than 1.
  12. 根据权利要求7至10任一项所述的装置,其中,所述不符合规范词包括以下至少一种情况:包含拼音读写、包含错别字、包含同源字替换、包含字形错误。According to the device according to any one of claims 7 to 10, the non-standard words include at least one of the following situations: including pinyin reading and writing, including typos, including homologous word replacement, and including glyph errors.
  13. 一种电子设备,包括处理器和存储器,所述存储器存储可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如权利要求1至6任一项所述的样本构建方法的步骤。An electronic device comprises a processor and a memory, wherein the memory stores a program or instruction that can be run on the processor, and when the program or instruction is executed by the processor, the steps of the sample construction method according to any one of claims 1 to 6 are implemented.
  14. 一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如权利要求1至6任一项所述的样本构建方法的步骤。A readable storage medium stores a program or instruction, and when the program or instruction is executed by a processor, the steps of the sample construction method according to any one of claims 1 to 6 are implemented.
  15. 一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如权利要求1至6任一项所述的样本构建方法的步骤。A chip, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the steps of the sample construction method according to any one of claims 1 to 6.
  16. 一种计算机程序产品,所述程序产品被存储在存储介质中,所述程序产品被至少一个处理器执行以实现如权利要求1至6任一项所述的样本构建方法的步骤。A computer program product, wherein the program product is stored in a storage medium and is executed by at least one processor to implement the steps of the sample construction method according to any one of claims 1 to 6.
  17. 一种电子设备,其特征在于,包括所述电子设备用于执行如权利要求1至6任一项所述的样本构建方法的步骤。 An electronic device, characterized in that it comprises the electronic device used to execute the steps of the sample construction method according to any one of claims 1 to 6.
PCT/CN2024/075789 2023-02-08 2024-02-04 Sample construction method and apparatus, and electronic device and readable storage medium WO2024164976A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310085121.XA CN116089569A (en) 2023-02-08 2023-02-08 Sample construction method, device, electronic equipment and readable storage medium
CN202310085121.X 2023-02-08

Publications (1)

Publication Number Publication Date
WO2024164976A1 true WO2024164976A1 (en) 2024-08-15

Family

ID=86213758

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/075789 WO2024164976A1 (en) 2023-02-08 2024-02-04 Sample construction method and apparatus, and electronic device and readable storage medium

Country Status (2)

Country Link
CN (1) CN116089569A (en)
WO (1) WO2024164976A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089569A (en) * 2023-02-08 2023-05-09 维沃移动通信有限公司 Sample construction method, device, electronic equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN107315734A (en) * 2017-05-04 2017-11-03 中国科学院信息工程研究所 A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN107656990A (en) * 2017-09-14 2018-02-02 中山大学 A kind of file classification method based on two aspect characteristic informations of word and word
CN110210035A (en) * 2019-06-04 2019-09-06 苏州大学 The training method of sequence labelling method, apparatus and sequence labelling model
CN113434650A (en) * 2021-06-29 2021-09-24 平安科技(深圳)有限公司 Question and answer pair expansion method and device, electronic equipment and readable storage medium
CN113468856A (en) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 Variant text generation method, variant text translation model training method, variant text classification device and variant text translation model training device
CN114201975A (en) * 2021-10-26 2022-03-18 科大讯飞股份有限公司 Translation model training method, translation method and device
CN116089569A (en) * 2023-02-08 2023-05-09 维沃移动通信有限公司 Sample construction method, device, electronic equipment and readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN107315734A (en) * 2017-05-04 2017-11-03 中国科学院信息工程研究所 A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN107656990A (en) * 2017-09-14 2018-02-02 中山大学 A kind of file classification method based on two aspect characteristic informations of word and word
CN110210035A (en) * 2019-06-04 2019-09-06 苏州大学 The training method of sequence labelling method, apparatus and sequence labelling model
CN113468856A (en) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 Variant text generation method, variant text translation model training method, variant text classification device and variant text translation model training device
CN113434650A (en) * 2021-06-29 2021-09-24 平安科技(深圳)有限公司 Question and answer pair expansion method and device, electronic equipment and readable storage medium
CN114201975A (en) * 2021-10-26 2022-03-18 科大讯飞股份有限公司 Translation model training method, translation method and device
CN116089569A (en) * 2023-02-08 2023-05-09 维沃移动通信有限公司 Sample construction method, device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN116089569A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
Habash et al. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization
US8401839B2 (en) Method and apparatus for providing hybrid automatic translation
US7630880B2 (en) Japanese virtual dictionary
KR101573854B1 (en) Method and system for statistical context-sensitive spelling correction using probability estimation based on relational words
WO2003065245A1 (en) Translating method, translated sentence outputting method, recording medium, program, and computer device
KR20090106937A (en) Correction System for spelling error and method thereof
KR102552811B1 (en) System for providing cloud based grammar checker service
WO2022135474A1 (en) Information recommendation method and apparatus, and electronic device
CN111950301A (en) English translation quality analysis method and system for Chinese translation and English translation
WO2021034395A1 (en) Data-driven and rule-based speech recognition output enhancement
WO2024164976A1 (en) Sample construction method and apparatus, and electronic device and readable storage medium
Zhang et al. Design and implementation of Chinese Common Braille translation system integrating Braille word segmentation and concatenation rules
CN112559725A (en) Text matching method, device, terminal and storage medium
JP2024525173A (en) Multilingual Grammar Error Correction
CN111914533B (en) Method and system for analyzing English long sentence
CN115034209A (en) Text analysis method and device, electronic equipment and storage medium
WO2009144890A1 (en) Pre-translation rephrasing rule generating system
CN114328848B (en) Text processing method and device
Ratnam et al. Phonogram-based Automatic Typo Correction in Malayalam Social Media Comments
Turcato et al. Pre-processing closed captions for machine translation
US20230169257A1 (en) Device for generating combined sentences of images and characters
KR100978223B1 (en) Method of building educational contents for foreign languages
JP2004118461A (en) Method and device for training language model, method and device for kana/kanji conversion, computer program, and computer readable recording medium
JP5853688B2 (en) Language processing program, language processing apparatus, and language processing method
JP2000200268A (en) Handwritten character input and converting device, document preparing device, and computer-readable recording medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24752831

Country of ref document: EP

Kind code of ref document: A1