CN114021589A

CN114021589A - Sample generation method and device, computer equipment and storage medium

Info

Publication number: CN114021589A
Application number: CN202111308216.0A
Authority: CN
Inventors: 李旭; 张凯; 张忠敏; 吴大帅; 王书乔; 尹传政; 马成
Original assignee: Zhejiang Taimei Medical Technology Co Ltd
Current assignee: Zhejiang Taimei Medical Technology Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-08

Abstract

The invention provides a sample generation method, which comprises the following steps: receiving a first sample formed by a corpus of a first language and a corpus of a second language; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary; acquiring vocabularies included in the first sample and candidate words with a similar meaning word relationship; replacing the vocabulary of the corresponding similar word relation in the first sample by the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language. The method comprises the steps of replacing words in the existing parallel linguistic data with candidate words having a similar meaning word relationship with the words to form a sample, so that a training sample of the parallel linguistic data is expanded to a certain extent.

Description

Sample generation method and device, computer equipment and storage medium

Technical Field

The present specification relates to the technical field of computer data processing, and in particular, to a method and an apparatus for generating a sample, a computer device, and a storage medium.

Background

With the continuous development of natural language processing technology, machine translation models based on neural networks are applied more and more in various fields. To improve the performance of the machine translation model, a large number of parallel corpus samples are often required to train the model. However, in some specific fields, such as the medical field, the number of high-quality parallel corpus samples is still limited. In the prior art, manual translation of the speech is often needed to obtain more training samples. This is not only time consuming but also costly.

Disclosure of Invention

In view of the above, embodiments of the present disclosure are directed to a method, an apparatus, a computer device and a storage medium for generating a sample, so as to extend a parallel corpus training sample based on an existing corpus to some extent.

The embodiment of the present specification provides a method for generating a sample, including: receiving a first sample formed by a corpus of a first language and a corpus of a second language; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary; acquiring vocabularies included in the first sample and candidate words with a similar meaning word relationship; replacing the vocabulary of the corresponding similar word relation in the first sample by the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language.

The present specification provides a sample generation device, including: the system comprises a sample receiving module, a first analysis module and a second analysis module, wherein the sample receiving module is used for receiving a first sample formed by a first language corpus and a second language corpus; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary; a candidate word obtaining module, configured to obtain candidate words having a similar meaning word relationship with the vocabulary included in the first sample; the candidate word replacing module is used for replacing the vocabulary of the corresponding similar meaning word relation in the first sample by using the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language.

The present specification embodiments propose a computer device comprising a memory storing a computer program and a processor implementing the method of any one of the claims when the processor executes the computer program.

The present specification embodiments propose a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the claims.

In the embodiment of the specification, the sample is formed by replacing the vocabulary in the existing parallel corpus with the candidate word having the similar meaning word relationship with the vocabulary, so that the parallel corpus training sample is expanded to a certain extent.

Drawings

Fig. 1 is a schematic diagram illustrating an example of a scenario provided in an embodiment.

Fig. 2 is a schematic diagram illustrating an example of a scenario provided in an embodiment.

Fig. 3 is a schematic diagram illustrating an example of a scenario provided in an embodiment.

Fig. 4 is a schematic flow chart illustrating a sample generation method according to an embodiment.

Fig. 5 is a schematic flow chart illustrating a sample generation method according to an embodiment.

Fig. 6 is a block diagram showing a configuration of a sample generation device according to an embodiment.

Detailed Description

In order to make the technical solution of the present invention better understood, the technical solution of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments of the present invention. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present specification belong to the protection scope of the present specification.

Please refer to fig. 1, fig. 2 and fig. 3. In the scenario example of the system for generating a sample provided in this specification, a user may be a worker in the medical field, and needs to expand a parallel corpus sample library based on the existing chinese and english corpora in the medical field, so as to increase the number of training samples of a chinese and english medical translation model and improve the performance of the translation model.

The user firstly inputs the Chinese and English language material in the existing medical field into the system, and the system generates parallel language material according to the existing language material. The system firstly takes out Chinese or English medical texts in a corpus, divides the texts into sentences and preprocesses the sentences. The system can convert capital words in English sentences into lowercase forms and convert traditional characters in Chinese sentences into simplified characters. Meanwhile, the system can convert full-angle symbols in the sentences into half-angle symbols and perform word segmentation on the processed sentences. Then, the system calculates a Perplexity index (Perplexity) for each sentence, and deletes the corresponding sentence from the corpus when the Perplexity index of the sentence is greater than a set threshold. After the above processing is completed, the system searches Chinese and English sentence pairs with similar semantics in the corpus to form parallel corpora. And marking the Chinese or English sentences of which the corresponding semantemes are not found and tend to be the same into monolingual corpus.

For the parallel linguistic data, the system also matches Chinese and English words in the parallel linguistic data. Specifically, the system will go to the corresponding english vocabulary in the chinese corpus to match the english vocabulary whose semantics are nearly the same as the chinese vocabulary in the parallel corpus, and when finding that there is a chinese vocabulary that does not match the english vocabulary, it indicates that there is a missing or redundant semantic component in the chinese and english corpuses, and then the corresponding parallel corpus will be deleted. Or the system can move the English vocabulary of the English corpus in the parallel corpus to the corresponding Chinese vocabulary in the Chinese corpus with the matching semantics being the same as the English vocabulary, and when the situation that the English vocabulary in the parallel corpus is not matched with the corresponding Chinese vocabulary is found, the corresponding parallel corpus is deleted.

After the preprocessing is finished, the system firstly constructs a Chinese-English parallel corpus sample set for training a machine translation model for translating Chinese into English and an English-Chinese parallel corpus sample set for training a machine translation model for translating English into Chinese through the existing parallel corpora. And respectively training an English-Chinese translation model for translating Chinese into English and an English-Chinese translation model for translating English into Chinese according to the Chinese-English parallel corpus sample set and the English-Chinese parallel corpus sample set. And constructing the translation model based on a Transformer model. And then, the system acquires the monolingual corpus, wherein the Chinese monolingual corpus generates corresponding English corpus through the Chinese-English translation model, forms English-Chinese parallel corpus samples which can be used for training the English-Chinese translation model, and adds the English-Chinese parallel corpus sample set. The English monolingual corpus generates a corresponding Chinese corpus through the English-Chinese translation model, forms Chinese-English parallel corpus samples which can be used for training the Chinese-English translation model, and adds a Chinese-English parallel corpus sample set.

The system then requests acquisition of a knowledge-map of the medical domain. For the Chinese-English parallel corpus sample set and the English-Chinese parallel corpus sample set, the system selects part of the parallel corpus samples based on a preset probability and creates a copy of the parallel corpus samples. The system then finds proper nouns in the copy based on the thesaurus of medical domain proper nouns. Then, based on the proper nouns, candidate words of which the proper nouns have a similar meaning word relationship are searched in the knowledge graph, the candidate words are replaced by the corresponding proper nouns in the copies, so that new parallel corpus samples are obtained, and the new parallel corpus samples are added into corresponding parallel corpus sample sets.

In addition, for the Chinese-English parallel corpus sample set and the English-Chinese parallel corpus sample set, the system can call a BioBert biomedical pre-training language model for processing. And replacing part of words in the language material used as model input in the parallel language material with mask identification based on a preset probability through a mask method of the language model, creating a copy of the replaced parallel language material, and adding the copy into a corresponding parallel language material sample set.

After the processing, the system transmits the expanded Chinese-English parallel corpus sample set and the expanded English-Chinese parallel corpus sample set to the user. And the user can call a training system of the machine translation model according to the extended parallel corpus sample set to train the corresponding machine translation model.

The embodiment of the specification provides a sample generation system. The sample generation system may include a client and a server. The client may be an electronic device with network access capabilities. Specifically, for example, the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television, a smart speaker, a microphone, and the like. Wherein, wearable equipment of intelligence includes but not limited to intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet, intelligent necklace etc.. Alternatively, the client may be software that can run in the electronic device. The server may be an electronic device having a certain arithmetic processing capability. Which may have a network communication module, a processor, memory, etc. Of course, the server may also refer to software running in the electronic device. The server may be a distributed server, and may be a system having a plurality of processors, memories, network communication modules, and the like that cooperate with one another. Alternatively, the server may also be a server cluster formed by several servers. Or, with the development of scientific technology, the server can also be a new technical means capable of realizing the corresponding functions of the specification implementation mode. For example, it may be a new form of "server" implemented based on quantum computing.

Referring to fig. 4, an embodiment of the present disclosure provides a method for generating a sample, which includes the following steps.

Step S110: receiving a first sample formed by a corpus of a first language and a corpus of a second language; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary.

Before generating the sample, the existing corpus needs to be received, and the sample is generated based on the received corpus.

The first language and the second language may be natural languages, for example, languages that naturally evolve with culture, such as chinese, english, and japanese. Of course, the first language and the second language may also be artificial languages, such as world language, multilingual and international language, etc. In some embodiments, the first and second languages may also include computer languages, such as C, C + + and python, among others. Alternatively, the first language and the second language may be different versions of the same language, such as simplified chinese and traditional chinese. Of course, the first language and the second language may be different languages.

The corpus may be sentences expressed by the first language or the second language, the sentences including at least one word. The corpus may be a statement that is not subjected to data processing, or a statement that is subjected to data processing. Specifically, for example, the corpus may be a sentence composed of at least one word after word segmentation. Or the sentences of partial characters and words which do not contain semantic information are removed after data cleaning. Of course, the corpus may also be a sentence subjected to normalization processing. Wherein, the sentence can be stored by character string, or can be stored by list.

The first sample comprises a corpus of a first language and a corpus of a second language, wherein the semantics of the corpus expression of the first language and the semantics of the corpus expression of the second language are the same. For example, a corpus in a first language may be a statement expressed in chinese, "no diagnosed cases today"; correspondingly, the corpus of the second language may be an english sentence with the same semantic as the chinese expression, "the pure are no configured cases today". The first sample can be obtained based on manual arrangement, or can be obtained by a server, for example, the first sample can be obtained by the server from the internet through a crawler technology. Alternatively, the server may be generated based on a certain number of corpora and operation methods.

In the process of receiving the first sample, the server running the sample generation method may send a request to the database, and the server receives the request after the database returns corresponding first sample data. Of course, the receiving method may be that the database directly sends the data of the first sample to the server, and the server receives the data of the first sample after detecting the data information. In some embodiments, the process of receiving the first sample may also be actively accessing, by the server, the first sample data stored in the local storage medium, and receiving, by the memory of the server, the data stored in the storage medium.

In the first sample formed by the linguistic data of the first language and the linguistic data of the second language, the method for forming the first sample aims at establishing the corresponding relation between the linguistic data of the first language and the linguistic data of the second language. Specifically, for example, the corpus of the first language and the corpus of the second language may be concatenated by specifying the identifier. The corpus of the first language and the corpus of the second language may also be stored in a two-dimensional list. Of course, in some embodiments, the corresponding corpus of the first language and the storage address of the corpus of the first language may also be recorded in the dictionary.

Step S120: and acquiring the candidate words with the similar meaning word relationship with the vocabulary included in the first sample.

And obtaining candidate words with similar word relation to the vocabulary contained in the first sample, and forming a second sample based on the candidate words and the first sample.

The vocabulary included in the first sample may be one vocabulary in the first sample, or may be a plurality of vocabularies in the first sample. The vocabulary may include only the vocabulary in the corpus of the first language or only the vocabulary in the corpus of the second language, or may include both the vocabulary in the corpus of the first language and the vocabulary in the corpus of the second language. The vocabulary may be of any part of speech, for example, a verb, a noun or an adjective. The vocabulary may be a proprietary vocabulary of an application domain or a universal vocabulary. Specifically, for example, a proprietary word in the medical field, such as "cephalosporin", "amoxicillin", and the like, may be used. Or a general vocabulary, such as "weather", "high-speed rail", etc.

The candidate word is a word having a near-word relationship with a word included in the first sample. The similar meaning word relation indicates that the expressed semantics among the words with the similar meaning word relation tend to be the same. When the vocabulary in the first sample has multiple semantics, the candidate word should tend to be the same as the semantics of the vocabulary expressed in the corpus of the first sample. The candidate words having a similar meaning word relationship with the vocabulary in the first sample may be one or more. The language of the candidate word may be consistent with the language of the vocabulary. Of course, in some embodiments, the language of the candidate word may not be consistent with the language of the vocabulary, for example, the candidate word of "ETC" may also be "ECT".

In addition, the candidate word can be any part of speech and any domain word.

In the step of acquiring the vocabulary included in the first sample, the acquisition may be performed by directly requesting the near-meaning word of the vocabulary from an external server. Of course, the obtaining method may be that an external server requests a dictionary database in which the vocabulary and the corresponding candidate word are recorded, and after receiving the dictionary database by a local server, the local server searches the dictionary database to obtain the candidate word corresponding to the vocabulary.

Step S130: replacing the vocabulary of the corresponding similar word relation in the first sample by the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language.

And replacing the vocabulary of the corresponding similar word relation in the first sample by the candidate word to generate a second sample different from the first sample.

The second sample is an extended sample formed based on the first sample for extending a number of samples. The second sample also includes corpora in the first language and corpora in the second language. The candidate words are used for replacing the vocabulary with the similar meaning word relationship in the first sample to form the second sample, so that the semanteme expressed by the linguistic data of the first language and the linguistic data of the second language in the second sample tend to be the same, and meanwhile, the linguistic data of the second sample and the linguistic data of the first sample can have certain difference to form different samples.

Referring to fig. 5, in some embodiments, the method for generating a sample further includes the following steps.

Step S210: receiving a first corpus; wherein the first corpus belongs to a first language.

Step S220: and acquiring a second corpus which adopts a second language to express the same semantic meaning as the first corpus, wherein the first corpus and the second corpus form the first sample.

In the corpus, there may be a portion of the first corpus. The first corpus only comprises a corpus expressed based on a first language, and a second corpus which is expressed by a second language and has the same semantic meaning with the corpus is not correspondingly expressed. Therefore, the first corpus cannot be directly processed by the sample generation method described in the embodiment of the present specification to generate the second sample. However, the first corpus still includes certain information, and if the first corpus is discarded, certain waste is caused. Therefore, the first corpus can be effectively utilized to generate the first sample based on steps S210 and S220, so as to form the second sample based on the sample generation method described in the embodiment of the present specification.

The second corpus is expressed in a second language and tends to be a corpus of the same semantic as the first corpus. In the manner of obtaining the second corpus, the first corpus may be translated into the second corpus by an existing machine translation model. In some embodiments, the first corpus may also be returned to the user, and the second corpus corresponding to the first corpus may be labeled manually and input to the server. Of course, in some embodiments, a translation model may be trained first by an existing first sample, and then the first corpus may be translated into the second corpus by the translation model trained by the existing first sample. The translation model can translate the linguistic data of the first language into the linguistic data of the second language. The translation model can be built based on a Transformer model, can also be built based on an LSTM model, and of course, can also be built based on a GRU model.

The first corpus and the second corpus form the first sample, which includes a method for forming the first sample formed by the corpus of the first language and the corpus of the second language in the embodiment of the present specification.

In some embodiments, the first sample is used to train a training model that translates the second language into the first language.

When the first sample is formed by the first corpus and the second corpus, and the second corpus is obtained by translating the first corpus by the translation model, the semantic quality of the first sample is relatively deficient. This is mainly because the performance of the model for translating the first language into the second language may not be relatively high, and the first corpus cannot be translated into the corpus with the same semantic meaning as the first corpus and higher semantic quality. The translation model is trained by using the first sample formed by the first corpus and the second corpus, and the performance of the trained translation model has certain influence. Therefore, the first sample formed by the first corpus and the second corpus translated from the first corpus is only used for training the training model for translating the second language into the first language, so that the accuracy of the label of the training sample can be ensured, namely the training target of the training model is the first corpus with relatively high semantic quality, and the performance of the translation model is ensured. Specifically, for example, when the first corpus sample is "a case without diagnosis today", a corresponding translation result may be obtained through an existing translation model, and the result may be "No relationship cases today", wherein the translation result has some language diseases, for example, "relationship" should be expressed in a passive form in syntax. At this time, the translation result is used as the second corpus, and the first corpus is combined to form the first sample. The first sample may be "no firm cases today-no diagnosed cases" where "-" is used to distinguish the first corpus from the second corpus. At this point, the first sample may be used to train a translation model that translates English to Chinese. The input of the translation model is English corpus of ' no context cases today ', the learning target is ' case without diagnosis ' today ' which is corpus with higher semantic quality and without larger language diseases, and the performance of the translation model can be ensured not to be reduced due to the semantic quality of the second corpus in the first sample to a certain extent. Meanwhile, the translation model obtained based on the first sample training has better fault-tolerant capability in practical application, and can identify information with partial errors input by a user and translate the information into expression information which tends to be the same as correct semantics.

In some embodiments, the method of generating a sample further comprises: obtaining a primary sample; wherein the primary sample comprises a corpus formed in the first language and the second language; calculating a confusion index of the corpus of the primary sample; wherein the confusion index is used for representing the semantic quality of the corpus; and taking the primary sample with the confusion index smaller than a set threshold value as the first sample.

The higher the semantic quality of the corpus forming the first sample, the better the performance of the training model using the first sample after the training is completed. Therefore, before the first sample is generated, a part of the corpus with poor corpus quality can be removed based on the confusion index, and the formed corpus of the first sample is clear in semantic meaning and high in quality.

The primary sample includes a corpus expressed in a first language and a corpus expressed in a second language. Wherein, the semanteme of the corpus expressed by the first language and the semanteme of the corpus expressed by the second language tend to be the same. The primary sample can be obtained by manual arrangement or obtained by a server. Of course, it may be generated based on an existing translation model.

And the confusion index is used for evaluating the semantic quality of the corpus. The larger the numerical value of the confusion index is, the worse the semantic quality of the corresponding corpus is. The confusion index can be obtained by manual marking and can also be generated based on a language model. For example, for a sentence, the probability of the sentence occurring can be calculated through a Bi-gram model, and the probability is used as a confusion index to represent the semantic quality of the sentence. Specifically, for example, for a sentence "today weather is really good", the sentence "today", "weather", "true good" can be obtained after word segmentation. The probability of occurrence of the sentence can be evaluated based on the Bi-gram model, i.e. a value of P (weather today) true-P (weather today) P (weather | today) is calculated. Assuming that the probability of each vocabulary recorded in the Bi-gram model is P (today) ═ 0.1, P (weather today) ═ 0.05, and P (true best weather) ═ 0.06, the probability of the corresponding sentence is 0.0003. The confusion indicator may also be 0.0003. In some embodiments, a numerical value obtained by subjecting the probability to a certain processing step may be used as the confusion index. For example, the probability values may be logarithmically added in sequence, and the added sum may be divided by the number of terms to obtain a confusion indicator. This avoids the problem that the value of multiplying a plurality of probabilities becomes small, and reduces the influence of the number of words in the sentence on the result.

The set threshold is used to determine whether the primary sample is used to generate a first sample. The set threshold may be set by a designer of the system based on experience, or may be learned by the server based on labeled samples.

The set threshold may comprise only one value. In this case, the first confusion index of the corpus expressed in the first language may be calculated, the second confusion index of the corpus expressed in the second language may be calculated, and the first confusion index and the second confusion index may be combined to obtain the confusion index corresponding to the primary sample. For example, an average value of the first and second confusion indexes is used as the confusion index corresponding to the primary sample. At this time, the primary sample whose confusion index is smaller than a set threshold may be set as the first sample.

In some embodiments, the set threshold may include two values, a first confusion threshold and a second confusion threshold, respectively. The first confusion threshold may correspond to a first confusion indicator of a corpus expressed in a first language, and the second confusion threshold may correspond to a second confusion threshold of a corpus expressed in a second language. And when the first confusion index and the second confusion index in the primary sample are both smaller than the corresponding confusion threshold, taking the confusion index as the first sample.

In some embodiments, the method of generating a sample further comprises: obtaining a primary sample; wherein the primary sample comprises a corpus formed in the first language and the second language; matching vocabularies in the linguistic data of the first language of the primary sample with vocabularies with similar semantemes in the linguistic data of the second language; taking the primary sample as the first sample if the vocabulary of the first language all matches the vocabulary of the second language.

The semantic components of part of the corpus are missing or redundant in the primary sample, and the semantic quality of the first sample formed by the primary sample is somewhat deficient.

Therefore, the vocabulary in the corpus of the first language in the primary sample can be moved to the vocabulary with the same matching semantics in the corpus of the second language, so as to ensure the accuracy of the corpus information of the first language. Specifically, for example, the primary sample may include a sentence pair of chinese corpus and english corpus, which is "no confirmed cases today", wherein chinese is the first language and english is the second language. The result of the primary sample after word segmentation and elimination of some formal words without actual semantics can be [ [ "today", "none", "confirmed", "case" ], [ "no", "confirmed", "cases" ]. Then the server will take out the first vocabulary "today" in the Chinese corpus to match in the English corpus, find that there is no vocabulary in the English corpus and the "today" semanteme of Chinese corpus tend to be the same, then the server will not form the first sample based on the said primary sample, in order to ensure the first sample semanteme formed by the primary sample is complete to some extent.

The vocabulary in the corpus of the first language may be a vocabulary after word segmentation, or a vocabulary composed of a plurality of continuous vocabularies. Accordingly, the vocabulary in the corpus of the second language may be a single vocabulary or a vocabulary consisting of a plurality of consecutive vocabularies. One vocabulary in the corpus of the first language can be matched with vocabularies in a plurality of corpora of the second language, and also can be matched with vocabularies in a plurality of corpora of the first language. Of course, the words in the plurality of linguistic data of the first language may be matched to the words in the plurality of linguistic data of the second language. Specifically, for example, when the first language is chinese and the second language is english, a word "great wall" in chinese may be matched to three words "the", "great" and "wall" after english word segmentation. Similarly, a plurality of words in the corpus of the first language may correspond to only one word of the second language. For example, when the first language is English and the second language is Chinese, the two English words "tool" and "kit" may be matched to a Chinese word "toolkit".

The matching mode may be that the server directly calls a function library, for example, the C + + language program fast _ align. Of course, it is also possible to design a matching method by itself, for example, calculating the similarity between the vocabularies in the corpus of the first language and the vocabularies in the corpus of the second language through the word vector, so as to determine whether the semantics of the vocabularies tend to be the same.

In some embodiments, the method of generating a sample further comprises: matching vocabularies in the linguistic data of the second language of the primary sample with vocabularies with similar semantemes in the linguistic data of the first language; and taking the primary sample as the first sample when the vocabularies of the first language are matched with the vocabularies of the second language and/or the vocabularies in the corpus of the second language are matched with the vocabularies of the first language.

After the corpus semantics of the first language in the primary sample are ensured to be complete, the corpus language of the second language also needs to be ensured to be complete. Thus, the vocabulary in the corpus of the second language can be matched in the corpus of the first language. When the vocabularies in the corpus of the second language can be matched with the vocabularies with the same semanteme in the corpus of the first language, the semanteme of the corpus of the second language is complete. And when the vocabulary of the first language material and the vocabulary of the second language material can be successfully matched and no unmatched vocabulary exists in the language materials, taking the primary sample as the first sample. Specifically, for example, the primary sample is "play great wall-visit the great wall", and the primary sample is [ [ "play", "great wall" ], [ "visit", "the", "great wall", "wall" ], after being subjected to word segmentation and partial removal of formal words without actual semantics. Wherein, the "playing" is successfully matched with the "visit", and the "great wall" is successfully matched with the "," great "and" wall ". At this time, the words in the primary sample are successfully matched, and the primary sample can be formed into the first sample without the condition that the words are not matched with the words, so that the semantic integrity of the first sample formed by the primary sample is ensured to a certain extent.

In some embodiments, the method of generating a sample further comprises replacing a portion of the vocabulary in the first sample with a specified identifier to form the second sample.

By replacing part of the vocabulary in the first sample with the assigned identifier, not only can a second sample be formed, but the second sample can also enhance the robustness of the model when used for training.

The assigned identifier may be predetermined to replace a portion of the vocabulary in the first sample. For example, the designated identifier may be "[ MASK ]", "$, or" __ ", or the like.

The partial vocabulary in the first sample can be one vocabulary or a plurality of vocabularies. The vocabulary may be a vocabulary at a specific position in the corpus, for example, a vocabulary having a position number of an integer multiple of 10 in the corpus may be replaced with the designated identifier. Of course, the vocabulary may be randomly selected from the corpus based on the preset probability.

Preferably, the part of words replaced by the identifier should be all words in the corpus of the first language or words in the corpus of the second language, and the corpus of the replaced words with the specified identifier should be used as input for the training model. Specifically, for example, when the first sample is "no diagnosed cases today", the result of the word segmentation may be [ "today", "no", "confirmed", "cases" ], [ "no", "confirmed", "cases", "today" ], and a part of the vocabulary in the first sample may be replaced by the designated identifier "[ MASK ]" to form the second sample. Such as: "today" in [ "[ MASK ]", "No", "confirmed", "case" ], [ "no", "confirmed", "cases", "today" ] ] is replaced with a designated identifier. At this time, [ "[ MASK ]", "none", "confirmed", "case" ] may be used as the input of the training model, and [ "no", "confirmed", "cases", "today" ] may be used as the label of the training model. The corpus replaced by the identifier is assigned, certain interference information can be considered to exist, and when the model is learned, the model can learn the replaced vocabulary through context, so that the robustness of the model is enhanced.

In some embodiments, the partial vocabulary in the first sample may also be replaced by an identifier by calling a masking method of a masking language model. For example, the words at the masked positions in the first sample can be randomly replaced by a masking method of the BioBert biomedical pre-training language model using a random method with a certain probability.

In some embodiments, the step of obtaining candidate words having a near-word relationship with the vocabulary included in the first sample comprises: determining a proper noun for the first sample; acquiring a knowledge graph with the proper nouns; and determining candidate words having a similar word relationship with the proper nouns based on the knowledge graph.

In the method for forming the second sample based on the candidate words with the similar meaning word relationship, the similarity degree of the candidate words and the vocabulary semantics in the first sample is more important, and the method has certain influence on the performance of the training model. Especially for some special fields, it is difficult to obtain a better candidate word with a similar meaning word relationship. Therefore, the corresponding candidate words may be obtained based on the knowledge-graph.

The proper noun may be a proper noun of a general field such as a person's name, a place name, a country name, a landscape name, an object name, and the like. Of course, the term may also be a term specific to a particular field, for example, a term specific to a medical field, such as amoxicillin, hemoglobin, and the like.

The method for determining the proper nouns can be based on a preset proper noun word library, and when words existing in the word library are matched, the words are marked as the proper nouns. Of course, a model for named entity recognition may also be trained through a sample, and the model is easily obtained by inputting the information of the first sample, such as the corpus, the part of speech, and the like, into the model for named entity recognition.

The knowledge graph is a semantic network in which the relationships between different entities are recorded. The knowledge graph can be a knowledge graph of a general field and can also be a knowledge graph of a specific field. The knowledge graph can be stored locally in the server, or can be stored in an external server, and the local server requests the external server to obtain the knowledge graph when in use. Of course, the knowledge graph may also be constructed by the server in real time from the corpus.

Candidate words having a similar meaning word relationship with the proper noun are determined in the knowledge graph, and the closest candidate words having the similar meaning word relationship can be directly found in the knowledge graph. Of course, a plurality of vocabularies having a similar word relationship with the proper noun may also be obtained from the knowledge graph, and then the last candidate word of the vocabulary having the semantic closest to the proper noun may be selected from the plurality of vocabularies. For example, the similarity between a plurality of words and a proper noun may be calculated by a word vector, and a word with the highest similarity may be selected as a candidate word. Of course, a plurality of words may be used as candidate words, so as to generate the second samples with the same number as the number of candidate words. In some embodiments, the target domain of the corpus may be determined based on the context of the proper nouns, and then a knowledge graph of the corresponding domain may be obtained for matching the candidate words.

Referring to fig. 6, in some embodiments, a sample generation apparatus 1000 may be provided, including: a sample receiving module 1100, a candidate word obtaining module 1200 and a candidate word replacing module 1300.

A sample receiving module 1100, configured to receive a first sample formed by corpora of a first language and corpora of a second language; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary.

A candidate word obtaining module 1200, configured to obtain a candidate word having a similar meaning word relationship with the vocabulary included in the first sample.

A candidate word replacing module 1300, configured to replace a vocabulary of a corresponding near-meaning word relationship in the first sample with the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language.

The specific functions and effects achieved by the sample generation device can be explained by referring to other embodiments in this specification, and are not described herein again. The various modules in the sample generation apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device may be provided, comprising a memory having a computer program stored therein and a processor that implements the method steps of the embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium may be provided, on which a computer program is stored, which when executed by a processor implements the method steps in the embodiments. The specific functions and effects achieved by the sample generation device can be explained by referring to other embodiments in this specification, and are not described herein again. The various modules in the sample generation apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include processes of the embodiments of the methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The description is made in a progressive manner among the embodiments of the present specification. The different embodiments focus on the different parts described compared to the other embodiments. After reading this specification, one skilled in the art can appreciate that many embodiments and many features disclosed in the embodiments can be combined in many different ways, and for the sake of brevity, all possible combinations of features in the embodiments are not described. However, as long as there is no contradiction between combinations of these technical features, the scope of the present specification should be considered as being described.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an embodiment of the present disclosure, and is not intended to limit the scope of the claims of the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A method of generating a sample, comprising:

receiving a first sample formed by a corpus of a first language and a corpus of a second language; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary;

acquiring vocabularies included in the first sample and candidate words with a similar meaning word relationship;

replacing the vocabulary of the corresponding similar word relation in the first sample by the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language.

2. The method of claim 1, further comprising:

receiving a first corpus; wherein the first corpus belongs to a first language;

and acquiring a second corpus which adopts a second language to express the same semantic meaning as the first corpus, wherein the first corpus and the second corpus form the first sample.

3. The method of claim 2, wherein the first sample is used to train a training model that translates the second language into the first language.

4. The method of claim 1, further comprising:

obtaining a primary sample; wherein the primary sample comprises a corpus formed in the first language and the second language;

calculating a confusion index of the corpus of the primary sample; wherein the confusion index is used for representing the semantic quality of the corpus;

and taking the primary sample with the confusion index smaller than a set threshold value as the first sample.

5. The method of claim 1, further comprising:

matching vocabularies in the linguistic data of the first language of the primary sample with vocabularies with similar semantemes in the linguistic data of the second language;

and taking the primary sample as the first sample when all the words in the language material of the first language are matched with the words in the language material of the second language.

6. The method of claim 5, further comprising:

matching vocabularies in the linguistic data of the second language of the primary sample with vocabularies with similar semantemes in the linguistic data of the first language;

and taking the primary sample as the first sample under the condition that all the words in the linguistic data of the first language are matched with the words in the linguistic data of the second language and/or all the words in the linguistic data of the second language are matched with the words in the linguistic data of the first language.

7. The method of claim 1, further comprising:

and replacing part of words in the first sample with a specified identifier to form the second sample.

8. The method according to claim 1, wherein the step of obtaining the candidate words having a similar word relationship to the vocabulary included in the first sample comprises:

determining a proper noun for the first sample;

acquiring a knowledge graph with the proper nouns;

and determining candidate words having a similar word relationship with the proper nouns based on the knowledge graph.

9. An apparatus for generating a sample, comprising:

the system comprises a sample receiving module, a first analysis module and a second analysis module, wherein the sample receiving module is used for receiving a first sample formed by a first language corpus and a second language corpus; wherein the corpora included in the first sample tend to express the same semantics; the corpus comprises at least one vocabulary;

a candidate word obtaining module, configured to obtain candidate words having a similar meaning word relationship with the vocabulary included in the first sample;

the candidate word replacing module is used for replacing the vocabulary of the corresponding similar meaning word relation in the first sample by using the candidate word to form a second sample; wherein the second sample comprises a corpus of a first language and a corpus of a second language.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.