Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows an exemplary system architecture 100 to which embodiments of the disclosed method for word segmentation or apparatus for word segmentation may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have various client applications installed thereon, such as a web browser application, a shopping-like application, a search-like application, an instant messaging tool, a mailbox client, social platform software, and the like.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices, including but not limited to smart phones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may be a server that provides various services, such as an information processing server that processes text transmitted by the terminal apparatuses 101, 102, 103. The information processing server can analyze and process the received data such as the text to be participled and the like, and obtain a processing result (such as a Chinese character vocabulary sequence).
It should be noted that the method for word segmentation provided by the embodiment of the present disclosure may be executed by the terminal devices 101, 102, and 103, or may be executed by the server 105, and accordingly, the apparatus for word segmentation may be disposed in the terminal devices 101, 102, and 103, or may be disposed in the server 105.
The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. Under the condition that data used in the process of generating the Chinese character vocabulary sequence corresponding to the text to be participled does not need to be acquired from a remote place, the system architecture does not comprise a network but only comprises terminal equipment or a server.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method for word segmentation in accordance with the present disclosure is shown. The method for word segmentation comprises the following steps:
step 201, obtaining a text to be participled.
In the embodiment, an execution subject (for example, the server 105 shown in fig. 1) of the method for segmenting words may obtain the text to be segmented from remote or local by a wired connection manner or a wireless connection manner. The text to be participled is the text to be participled.
In practice, word segmentation refers to Chinese word segmentation. Therefore, the text to be participled can be a text in the form of Chinese characters. Through word segmentation, the text to be segmented can be segmented into one or more Chinese character vocabularies.
In this embodiment, the text to be segmented may be a text input by a user using a user terminal. Specifically, the text to be segmented may be a text directly input by the user, or may be a text converted from speech input by the user.
Step 202, converting the Chinese characters in the text to be segmented into pinyin to obtain the pinyin text to be segmented.
In this embodiment, based on the text to be participled obtained in step 201, the execution main body may convert the chinese characters in the text to be participled into pinyin to obtain a pinyin text to be participled.
Specifically, the execution main body may pre-store a dictionary indicating a correspondence between the chinese characters and the pinyin, and may further convert the chinese characters in the text to be segmented into the pinyin based on the dictionary to obtain the pinyin text to be segmented.
As an example, based on the text to be participled obtained in step 201 being "what is a feature dish in fujian", the executing main body may convert the chinese characters in the text to be participled into pinyin to obtain a pinyin text "fu jian de te se cai you na xie" to be participled.
Step 203, responding to the target pinyin included in the pinyin text to be participled, and selecting a preset mark for representing the target pinyin and the fuzzy sound corresponding to the target pinyin from a preset mark set.
In this embodiment, for the pinyin text to be participled obtained in step 202, the execution main body may select a preset mark for representing the target pinyin and the fuzzy sound corresponding to the target pinyin from a preset mark set in response to the pinyin text to be participled including the target pinyin. Wherein, the target pinyin is the pinyin corresponding to the fuzzy sound. The fuzzy sound corresponding to a pinyin may be a pinyin that is not readily distinguishable from the pinyin.
In practice, mandarin may differ from dialects in the pronunciation of the word, for example, for "good fortune," mandarin corresponds to a pinyin "fu" and dialects correspond to a pinyin "hu". Here, pinyin "fu" and pinyin "hu" are fuzzy sounds of each other (where "hu" is a fuzzy sound corresponding to "fu;" fu "is a fuzzy sound corresponding to" hu ").
In addition, in network parlance, some words may be mistaken for expression of a certain (e.g., tune) mood. For example, "eat" is misused as "this meal". Here, the pinyin "chi" corresponding to "eating" and the pinyin "ci" corresponding to "this" are fuzzy sounds of each other (where "ci" is a fuzzy sound corresponding to "chi", "chi" is a fuzzy sound corresponding to "ci").
It should be noted that, in practice, other application scenarios of the fuzzy sound (for example, a scenario of recognizing and correcting a text wrongly written word) may also exist, and details are not repeated in this application.
Specifically, the execution main body may determine characteristics of pinyins that are fuzzy sounds with each other in advance (for example, a pinyin including a syllable "f" and a pinyin including a syllable "h" are fuzzy sounds with each other), and then detect the pinyin text to be participled based on the characteristics to determine whether the pinyin text to be participled includes a pinyin corresponding to the characteristics, and if so, determine that the pinyin text to be participled includes a target pinyin (i.e., a pinyin corresponding to the characteristics).
As an example, if the pinyin text to be segmented is "fu jian de te cai you na xie", since the pinyin including the syllable "f" and the pinyin including the syllable "h" are fuzzy sounds, the execution main body may select the preset marks for representing the target pinyin and the fuzzy sounds corresponding to the target pinyin from the preset mark set in response to the pinyin text to be segmented being "fu jian de te cai you na xie" including the target pinyin "fu".
In this embodiment, the predetermined mark set may be a predetermined mark set for representing pinyins that are fuzzy sounds of each other. Specifically, each preset mark in the preset mark set may be used to mark a pair of pinyin with fuzzy sound. The preset mark may be various marks. In particular, the predetermined marks may be letters or combinations of letters in order to correspond to the form of pinyin.
As an example, the preset flag set may include "hfu; chi ", where" hfu "can be used to characterize" fu "and" hu "that are mutually ambiguities; "chi" may be used to characterize "chi" and "ci" that are fuzzy sounds of each other. Furthermore, for the target pinyin "fu" in the pinyin text "fu jian de se cai you na xi" to be participled, the execution subject may be selected from a preset mark set "hfu; and selecting preset marks ' hfu ' for representing target pinyin ' fu ' and ' hu ' from the chi '.
And 204, replacing the target pinyin in the pinyin text to be participled by the selected preset mark to obtain a new pinyin text to be participled.
In this embodiment, based on the preset mark obtained in step 203, the execution main body may replace the target pinyin in the pinyin text to be participled with the preset mark to obtain a new pinyin text to be participled.
Continuing with the above example, for the pinyin text "fu jian de te se cai you na xi" to be segmented and the preset mark "hfu" selected for the target pinyin "fu", the execution main body may replace "fu" with "hfu" to obtain a new pinyin text "hfu jian de te se cai you na xi" to be segmented.
Step 205, inputting the new pinyin text to be participled into a pre-trained word segmentation model to obtain a pinyin word sequence.
In this embodiment, based on the new pinyin text to be participled obtained in step 204, the execution main body may input the new pinyin text to be participled into a pre-trained participle model to obtain a pinyin participle sequence.
Here, the word segmentation model may be used to represent the correspondence between the pinyin text including the preset marks and the pinyin vocabulary sequences. Specifically, the word segmentation Model may be trained in various ways based on existing models for performing language processing (e.g., CRF (Conditional Random Field), HMM (Hidden Markov Model), and BiLSTM + CRF).
In some optional implementations of this embodiment, the word segmentation model may be obtained by training the execution subject or other electronic device through the following steps:
the method comprises the following steps of firstly, obtaining a second sample text set and a second sample Chinese character vocabulary sequence which is predetermined aiming at a second sample text in the second sample text set.
Here, the second sample text set may include a plurality of second sample texts for training the word segmentation model. The second sample text may be text in the form of chinese characters. In practice, the second sample text in the second sample text set may be segmented by manual tagging to obtain a second sample Chinese character vocabulary sequence, or the second sample text may be segmented by using an existing model for segmenting the Chinese character text to obtain the second sample Chinese character vocabulary sequence.
A second step of, for a second sample text in the second sample text set, performing the following steps: converting the Chinese characters in the second sample text into pinyin to obtain a second sample pinyin text; in response to the second sample pinyin text including the target pinyin, selecting a preset mark for representing the target pinyin in the second sample pinyin text and a fuzzy sound corresponding to the target pinyin from a preset mark set; replacing the target pinyin in the second sample pinyin text with the preset mark corresponding to the second sample pinyin text to obtain a new second sample pinyin text; and carrying out word segmentation on the new second sample pinyin text based on the second sample Chinese character vocabulary sequence corresponding to the second sample text to obtain a second sample pinyin vocabulary sequence.
Here, the new second sample pinyin text may be obtained by obtaining the new pinyin text to be participled, which is not described herein again.
Based on the second sample Chinese character vocabulary sequence, the execution main body or other electronic equipment can perform word segmentation on the new second sample pinyin text to obtain a second sample pinyin vocabulary sequence. Specifically, the execution subject or other electronic device may perform segmentation at a corresponding position in the new second sample pinyin text based on the text segmentation position corresponding to the second sample chinese character vocabulary sequence, so as to obtain the second sample pinyin vocabulary sequence.
As an example, the second sample chinese character vocabulary sequence is "fujian; of (1); special dishes; comprises the following steps of; which's' corresponding segmentation position is after the second word, after the third word, after the sixth word and after the seventh word, then for the new second sample pinyin text 'hfu jian de se cai you na xie', the new second sample pinyin text can be segmented at the positions after the second pinyin, after the third pinyin, after the sixth pinyin and after the seventh pinyin, so as to obtain the second sample pinyin vocabulary sequence 'hfu jian'; de; te se cai; you; and na xie ".
And thirdly, taking the obtained new second sample pinyin text as input, taking a second sample pinyin vocabulary sequence corresponding to the input new second sample pinyin text as expected output, and training to obtain a word segmentation model.
Here, the executing entity may use a machine learning method to input the obtained new second sample pinyin text, output a second sample pinyin word sequence corresponding to the input new second sample pinyin text as an expected output, train an initial model (e.g., a CRF model), and use the trained initial model as the word segmentation model.
Specifically, the new second sample pinyin text obtained based on the second step may be input into the initial model to obtain the actual output. Then, a gradient descent method and a back propagation method can be adopted, parameters of the initial model are adjusted based on actual output and a second sample pinyin vocabulary sequence (namely expected output) corresponding to the input new second sample pinyin text, the initial model obtained after each parameter adjustment is used as the initial model for next training, and the training is finished under the condition that a preset training finishing condition is met, so that the trained initial model (namely a word segmentation model) is obtained.
It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the target times; the loss value of the actual output calculated by using a preset loss function (such as a cross entropy loss function) relative to the expected output (i.e. the second sample pinyin vocabulary sequence) is smaller than a preset loss value threshold. Here, the target number may be a preset number, or may be a training number obtained by using all the obtained new second sample pinyin texts for training.
It should be noted that, in the word segmentation model of the present disclosure, if the sample pinyin text including the preset mark is used for training and the sample pinyin text not including the preset mark is also used for training during the training process, the trained word segmentation model may also be used to represent the correspondence between the pinyin text not including the preset mark and the pinyin vocabulary sequence (i.e., the pinyin text not including the preset mark is segmented into the pinyin vocabulary sequence).
And step 206, based on the pinyin vocabulary sequence, performing word segmentation on the text to be segmented to obtain a Chinese character vocabulary sequence.
In this embodiment, based on the pinyin vocabulary sequence obtained in step 205, the execution main body may perform word segmentation on the text to be word segmented to obtain a chinese character vocabulary sequence.
Specifically, the execution main body may perform segmentation at a corresponding position in the text to be word-segmented based on a text segmentation position corresponding to the pinyin vocabulary sequence, so as to obtain a chinese character vocabulary sequence.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for word segmentation according to the present embodiment.
In the application scenario of fig. 3, the server 301 may first obtain the text 302 to be participled (e.g., "sky grey is sunny").
Then, the server 301 may convert the chinese characters in the text 302 to be participled into pinyin to obtain a pinyin text 303 to be participled (e.g., "tie qi hu i chang q lang").
Next, the server 301 may select a preset mark 3041 (e.g., "hfui") from the preset mark set 304, in response to the target pinyin (e.g., "hui") included in the pinyin text 303 to be participled, the preset mark 3041 being used for characterizing the target pinyin and a fuzzy sound (e.g., "fei") corresponding to the target pinyin, wherein the target pinyin is the pinyin corresponding to the fuzzy sound.
Then, the server 301 may replace the target pinyin in the pinyin text 303 to be participled with the selected preset mark 3041 to obtain a new pinyin text 305 to be participled (e.g., "tie qi hfui chang q lang").
Then, the server 301 may input the new pinyin text 305 to be participled into a pre-trained participle model 306 to obtain a pinyin vocabulary sequence 307 (e.g., "tian qi; hfui chang; q ing lang"), where the participle model 306 is used to represent a correspondence between the pinyin text including the preset marks and the pinyin vocabulary sequence.
Finally, the server 301 may perform word segmentation on the text to be word segmented 302 based on the pinyin vocabulary sequence 307 to obtain a chinese character vocabulary sequence 308 (e.g., "weather; grey, clear").
The method provided by the embodiment of the disclosure can perform word segmentation on the text from the dimension of the pinyin, and can convert the pinyin corresponding to the fuzzy sound in the pinyin text into the preset mark which can be identified by the word segmentation model, so that the influence of the fuzzy sound on word segmentation can be reduced, and the accuracy of word segmentation is improved.
With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for word segmentation is shown. The flow 400 of the method for word segmentation comprises the following steps:
step 401, obtaining a text to be segmented.
In the embodiment, an execution subject (for example, the server 105 shown in fig. 1) of the method for segmenting words may obtain the text to be segmented from remote or local by a wired connection manner or a wireless connection manner. The text to be participled is the text to be participled.
Step 402, converting the Chinese characters in the text to be segmented into pinyin to obtain the pinyin text to be segmented.
In this embodiment, based on the text to be participled obtained in step 401, the execution main body may convert the chinese characters in the text to be participled into pinyin to obtain a pinyin text to be participled.
Step 403, in response to that the pinyin text to be participled includes the target pinyin, selecting a preset mark for representing the target pinyin and the fuzzy sound corresponding to the target pinyin from the preset mark set.
In this embodiment, for the pinyin text to be participled obtained in step 402, the execution main body may select a preset mark for representing the target pinyin and a fuzzy sound corresponding to the target pinyin from a preset mark set in response to the pinyin text to be participled including the target pinyin. Wherein, the target pinyin is the pinyin corresponding to the fuzzy sound. The fuzzy sound corresponding to a pinyin may be a pinyin that is not readily distinguishable from the pinyin.
In this embodiment, the predetermined mark set may be a predetermined mark set for representing pinyins that are fuzzy sounds of each other.
And step 404, replacing the target pinyin in the pinyin text to be participled with the selected preset mark to obtain a new pinyin text to be participled.
In this embodiment, based on the preset mark obtained in step 403, the execution main body may replace the target pinyin in the pinyin text to be participled with the preset mark to obtain a new pinyin text to be participled.
Step 405, inputting the new pinyin text to be participled into a pre-trained participle model to obtain a pinyin vocabulary sequence.
In this embodiment, based on the new pinyin text to be participled obtained in step 404, the execution main body may input the new pinyin text to be participled into a pre-trained participle model to obtain a pinyin participle sequence.
Here, the word segmentation model may be used to represent the correspondence between the pinyin text including the preset marks and the pinyin vocabulary sequences.
And 406, based on the pinyin vocabulary sequence, performing word segmentation on the text to be segmented to obtain a Chinese character vocabulary sequence.
In this embodiment, based on the pinyin vocabulary sequence obtained in step 405, the execution main body may perform word segmentation on the text to be word segmented to obtain a chinese character vocabulary sequence.
Steps 401, 402, 403, 404, 405, and 406 may be performed in a manner similar to that of steps 201, 202, 203, 204, 205, and 206 in the foregoing embodiment, respectively, and the above description for steps 201, 202, 203, 204, 205, and 206 also applies to steps 401, 402, 403, 404, 405, and 406, which is not described herein again.
Step 407, a pre-generated vocabulary is obtained.
In this embodiment, the execution subject may obtain a word list generated in advance. The word list is used for indicating the corresponding relation between the sample pinyin words and the sample Chinese character words. Specifically, the sample Chinese character vocabulary is converted into pinyin, and the sample pinyin vocabulary corresponding to the sample Chinese character vocabulary in the vocabulary is obtained. It is understood that a sample pinyin word in the vocabulary may correspond to a plurality of sample hanzi words, for example, the sample pinyin word "zhi wu" may correspond to the sample pinyin words "plant", "job", "fabric", "pollution control", and so on.
In this embodiment, the vocabulary may include target sample pinyin vocabulary consisting of the preset marks. The target sample pinyin vocabulary is used for representing pinyin vocabularies which are represented by the preset marks and respectively correspond to the pinyin of the fuzzy tones. The sample Chinese character vocabulary corresponding to the target sample pinyin vocabulary comprises Chinese character vocabularies respectively represented by the represented pinyin vocabularies.
As an example, the vocabulary may include a target sample pinyin word "zhix wu" composed of a preset mark "zhix", which may be used to represent the pinyin word "zhi wu" and the pinyin word "zi wu", where "zhi wu" and "zi wu" are the pinyin words respectively corresponding to the mutually fuzzy pinyins "zhi" and "zi" represented by the preset mark "zhix". Furthermore, the sample chinese character vocabulary corresponding to the target sample pinyin vocabulary "zhix wu" may include "plant", "job", "fabric", "treating dirty", "meridian", "self-thought", and so on.
In some optional implementations of this embodiment, the vocabulary may be generated by the execution subject or other electronic device through the following steps:
the method comprises the steps of firstly, obtaining a first sample text set and a first sample Chinese character word sequence predetermined for the first sample text in the first sample text set.
Here, the first sample text set may include a plurality of first sample texts for generating a vocabulary. The first sample text may be text in the form of Chinese characters. In practice, the first sample text in the first sample text set may be segmented by manual labeling to obtain a first sample Chinese character vocabulary sequence, or the first sample text may be segmented by using an existing model for segmenting Chinese character texts to obtain a first sample Chinese character vocabulary sequence.
It should be noted that the first sample text set and the second sample text set may be the same or different.
A second step, for a first sample text in the first sample text set, performing the following steps: converting the Chinese characters in the first sample text into pinyin to obtain a first sample pinyin text; in response to the first sample pinyin text including the target pinyin, selecting a preset mark for representing the target pinyin in the first sample pinyin text and a fuzzy sound corresponding to the target pinyin from a preset mark set; replacing the target pinyin in the first sample pinyin text with a preset mark corresponding to the first sample pinyin text to obtain a new first sample pinyin text; and based on the first sample Chinese character vocabulary sequence corresponding to the first sample text, performing word segmentation on the new first sample pinyin text to obtain a first sample pinyin vocabulary sequence.
Here, the new first sample pinyin text may be obtained by the above method for obtaining a new pinyin text to be participled, which is not described herein again.
Based on the first sample Chinese character vocabulary sequence, the execution main body or other electronic equipment can perform word segmentation on the new first sample pinyin text to obtain the first sample pinyin vocabulary sequence. Specifically, the execution main body or other electronic equipment may perform segmentation at a corresponding position in the new first sample pinyin text based on a text segmentation position corresponding to the first sample chinese character vocabulary sequence, so as to obtain the first sample pinyin vocabulary sequence. And the first sample pinyin vocabulary in the obtained first sample pinyin vocabulary sequence corresponds to the first sample Chinese character vocabulary in the first sample Chinese character vocabulary sequence one by one.
By way of example, the first sample chinese character lexical sequence is "fujian; of (1); special dishes; comprises the following steps of; which ' the corresponding segmentation position is after the second word, after the third word, after the sixth word and after the seventh word, then for the new first sample pinyin text ' hfu jian de te se cai you na xie ', the new first sample pinyin text can be segmented at the positions after the second pinyin, after the third pinyin, after the sixth pinyin and after the seventh pinyin, so as to obtain the first sample pinyin vocabulary sequence ' hfu jian '; de; te se cai; you; and na xie ". Wherein, the "Fujian" corresponds to the "hfu jian"; "of" corresponds to "de"; the "special dish" corresponds to "te se cai"; "has" corresponds to "you"; "which" corresponds to "na xie".
And thirdly, generating a word list by using the obtained corresponding first sample Chinese character vocabulary and the first sample pinyin vocabulary.
Specifically, the execution subject or other electronic device may combine the corresponding first sample chinese character vocabulary and the first sample pinyin vocabulary obtained based on each first sample in the first sample set into a vocabulary.
And step 408, executing a matching step for the Chinese character vocabulary corresponding to the preset mark in the Chinese character vocabulary sequence.
In this embodiment, for the chinese vocabulary corresponding to the preset mark in the chinese vocabulary sequence obtained in step 406, the executing entity may perform the following matching steps: determining a target sample pinyin vocabulary corresponding to the Chinese character vocabulary from a vocabulary as a pinyin vocabulary for matching based on the Chinese character vocabulary and a preset mark corresponding to the Chinese character vocabulary; and matching the sample Chinese character vocabulary corresponding to the matched pinyin vocabulary in the vocabulary with the Chinese character vocabulary to obtain a matching result. The Chinese character vocabulary corresponding to the preset mark, namely the corresponding pinyin vocabulary, comprises the Chinese character vocabulary of the target pinyin. The matching result is used for characterizing whether the sample Chinese character vocabulary corresponding to the pinyin vocabulary for matching includes the Chinese character vocabulary, and may include but is not limited to at least one of the following: characters, numbers, symbols, images. As an example, the matching result may include the numbers "0" and "1", where "0" may be used to represent that the sample chinese character vocabulary corresponding to the pinyin vocabulary for matching does not include the chinese character vocabulary; "1" can be used to characterize that the sample Chinese character vocabulary corresponding to the pinyin vocabulary for matching includes the Chinese character vocabulary.
Specifically, as an example, for the Chinese vocabulary sequence "weather; ash is common; the execution main body can firstly determine a target sample pinyin vocabulary 'hfui chang' corresponding to the Chinese character vocabulary from a vocabulary as a pinyin vocabulary for matching based on the Chinese character vocabulary and a preset mark 'hfui' corresponding to the Chinese character vocabulary in clear. Then, the execution main body can match each sample Chinese character vocabulary corresponding to the matching pinyin vocabulary "hfui chang" in the vocabulary with the Chinese character vocabulary, and obtain a matching result. For example, if the sample chinese character vocabulary corresponding to "hfui chang" in the vocabulary includes "extraordinary, fertile, and grayish normal", and includes the corresponding pre-marked chinese character vocabulary "grayish normal" in the chinese character vocabulary sequence, a matching result "1" representing that the sample chinese character vocabulary corresponding to the pinyin vocabulary for matching includes the chinese character vocabulary may be generated.
In some optional implementations of this embodiment, the matching step may further include: responding to the obtained matching result to represent that the sample Chinese character vocabulary corresponding to the matching pinyin vocabulary does not comprise the Chinese character vocabulary, and selecting a sample Chinese character vocabulary similar to the Chinese character vocabulary from the sample Chinese character vocabulary corresponding to the matching pinyin vocabulary; and replacing the Chinese character vocabulary in the Chinese character vocabulary sequence by the selected sample Chinese character vocabulary. The sample Chinese character vocabulary similar to the Chinese character vocabulary can be the sample Chinese character vocabulary with the similarity of the Chinese character vocabulary more than or equal to a preset similarity threshold value, and can also be the sample Chinese character vocabulary most similar to the Chinese character vocabulary.
Through the implementation mode, the wrong words can be corrected while the words with the fuzzy sound characteristics in the text to be participled are corrected.
In practice, while the Chinese character vocabulary corresponding to the preset mark in the Chinese character vocabulary sequence obtained in step 406 is corrected based on the vocabulary, other Chinese character vocabularies in the Chinese character vocabulary sequence can be corrected, so as to improve the comprehensiveness of text correction.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for word segmentation in the present embodiment highlights the step of matching the vocabulary with the fuzzy sound feature in the text to be word segmented by using the pre-generated vocabulary. Therefore, the scheme described in this embodiment can accurately position the vocabulary with the characteristic of the fuzzy sound in the text to be segmented based on the preset mark, and judge whether the vocabulary with the characteristic of the fuzzy sound is input by the user based on the vocabulary, so that the accuracy and the adaptability of text error correction are improved (for example, the vocabulary with the characteristic of the fuzzy sound in the text to be segmented "sky-ash-often-sunny" may not be input by the error).
With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for word segmentation, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.
As shown in fig. 5, the apparatus 500 for word segmentation of the present embodiment includes: a first acquisition unit 501, a conversion unit 502, a selection unit 503, a substitution unit 504, an input unit 505, and a word segmentation unit 506. The first obtaining unit 501 is configured to obtain a text to be participled; the conversion unit 502 is configured to convert the Chinese characters in the text to be segmented into pinyin to obtain a pinyin text to be segmented; the selecting unit 503 is configured to select a preset mark for representing the target pinyin and a fuzzy sound corresponding to the target pinyin from a preset mark set in response to the target pinyin being included in the pinyin text to be participled, wherein the target pinyin is a pinyin corresponding to the fuzzy sound; the replacing unit 504 is configured to replace the target pinyin in the pinyin text to be participled with the selected preset mark to obtain a new pinyin text to be participled; the input unit 505 is configured to input a new pinyin text to be participled into a pre-trained participle model to obtain a pinyin vocabulary sequence, wherein the participle model is used for representing the corresponding relationship between the pinyin text including preset marks and the pinyin vocabulary sequence; the word segmentation unit 506 is configured to segment the text to be segmented based on the pinyin vocabulary sequence to obtain a chinese character vocabulary sequence.
In this embodiment, the first obtaining unit 501 of the apparatus 500 for word segmentation may obtain the text to be segmented from a remote location or a local location through a wired connection manner or a wireless connection manner. The text to be participled is the text to be participled.
In this embodiment, based on the text to be participled obtained by the first obtaining unit 501, the converting unit 502 may convert the chinese characters in the text to be participled into pinyin, so as to obtain a pinyin text to be participled.
In this embodiment, for the pinyin text to be participled obtained by the converting unit 502, the selecting unit 503 may select the preset mark for representing the target pinyin and the fuzzy sound corresponding to the target pinyin from the preset mark set in response to the target pinyin included in the pinyin text to be participled. Wherein, the target pinyin is the pinyin corresponding to the fuzzy sound. The fuzzy sound corresponding to a pinyin may be a pinyin that is not readily distinguishable from the pinyin.
In this embodiment, the predetermined mark set may be a predetermined mark set for representing pinyins that are fuzzy sounds of each other.
In this embodiment, based on the preset mark obtained by the selecting unit 503, the replacing unit 504 may replace the target pinyin in the pinyin text to be participled with the preset mark to obtain a new pinyin text to be participled.
In this embodiment, based on the new pinyin text to be participled obtained by the replacing unit 504, the input unit 505 may input the new pinyin text to be participled into a pre-trained participle model to obtain a pinyin participle sequence.
Here, the word segmentation model may be used to represent the correspondence between the pinyin text including the preset marks and the pinyin vocabulary sequences.
In this embodiment, based on the pinyin vocabulary sequence obtained by the input unit 505, the word segmentation unit 506 may perform word segmentation on the text to be segmented to obtain a chinese character vocabulary sequence.
In some optional implementations of this embodiment, the apparatus 500 may further include: a second obtaining unit (not shown in the figure), configured to obtain a pre-generated word list, where the word list is used to indicate a corresponding relationship between a sample pinyin word and a sample Chinese character word, the word list includes a target sample pinyin word composed of preset marks, the target sample pinyin word is used to represent pinyin words respectively corresponding to the contained preset marks and representing the pinyin of the fuzzy sounds, and the sample Chinese character word corresponding to the target sample pinyin word includes Chinese character words respectively represented by the represented pinyin words; a matching unit (not shown in the figure) configured to perform the following matching steps for the Chinese vocabulary corresponding to the preset mark in the Chinese vocabulary sequence: determining a target sample pinyin vocabulary corresponding to the Chinese character vocabulary from a vocabulary as a pinyin vocabulary for matching based on the Chinese character vocabulary and a preset mark corresponding to the Chinese character vocabulary; and matching the sample Chinese character vocabulary corresponding to the matching pinyin vocabulary in the vocabulary with the Chinese character vocabulary to obtain a matching result, wherein the matching result is used for representing whether the sample Chinese character vocabulary corresponding to the matching pinyin vocabulary comprises the Chinese character vocabulary.
In some optional implementations of this embodiment, the matching unit may be further configured to: responding to the obtained matching result to represent that the sample Chinese character vocabulary corresponding to the matching pinyin vocabulary does not comprise the Chinese character vocabulary, and selecting a sample Chinese character vocabulary similar to the Chinese character vocabulary from the sample Chinese character vocabulary corresponding to the matching pinyin vocabulary; and replacing the Chinese character vocabulary in the Chinese character vocabulary sequence by the selected sample Chinese character vocabulary.
In some optional implementations of this embodiment, the vocabulary may be generated by: acquiring a first sample text set and a first sample Chinese character vocabulary sequence predetermined for the first sample text in the first sample text set; for a first sample text in the first set of sample texts, performing the following steps: converting the Chinese characters in the first sample text into pinyin to obtain a first sample pinyin text; in response to the first sample pinyin text including the target pinyin, selecting a preset mark for representing the target pinyin in the first sample pinyin text and a fuzzy sound corresponding to the target pinyin from a preset mark set; replacing the target pinyin in the first sample pinyin text with a preset mark corresponding to the first sample pinyin text to obtain a new first sample pinyin text; based on a first sample Chinese character vocabulary sequence corresponding to the first sample text, carrying out word segmentation on a new first sample pinyin text to obtain a first sample pinyin vocabulary sequence, wherein a first sample pinyin vocabulary in the obtained first sample pinyin vocabulary sequence corresponds to a first sample Chinese character vocabulary in the first sample Chinese character vocabulary sequence corresponding to the first sample text; and generating a word list by using the obtained corresponding first sample Chinese character vocabulary and the first sample pinyin vocabulary.
In some optional implementations of this embodiment, the word segmentation model may be obtained by training through the following steps: acquiring a second sample text set and a second sample Chinese character vocabulary sequence predetermined aiming at a second sample text in the second sample text set; for a second sample text in the second set of sample texts, performing the following steps: converting the Chinese characters in the second sample text into pinyin to obtain a second sample pinyin text; in response to the second sample pinyin text including the target pinyin, selecting a preset mark for representing the target pinyin in the second sample pinyin text and a fuzzy sound corresponding to the target pinyin from a preset mark set; replacing the target pinyin in the second sample pinyin text with the preset mark corresponding to the second sample pinyin text to obtain a new second sample pinyin text; based on a second sample Chinese character vocabulary sequence corresponding to the second sample text, carrying out word segmentation on the new second sample pinyin text to obtain a second sample pinyin vocabulary sequence; and taking the obtained new second sample pinyin text as input, taking a second sample pinyin word sequence corresponding to the input new second sample pinyin text as expected output, and training to obtain a word segmentation model.
It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.
The device 500 provided by the above embodiment of the present disclosure may perform word segmentation on a text from the dimension of pinyin, and may convert pinyin corresponding to a fuzzy sound in the pinyin text into a preset mark that can be identified by a word segmentation model, thereby reducing the influence of the fuzzy sound on word segmentation and improving the accuracy of word segmentation.
Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the terminal device 101, 102, 103 or the server 105 of fig. 1) 600 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be word segmented; converting Chinese characters in the text to be segmented into pinyin to obtain the pinyin text to be segmented; responding to the target pinyin included in the pinyin text to be segmented, and selecting a preset mark for representing the target pinyin and a fuzzy sound corresponding to the target pinyin from a preset mark set, wherein the target pinyin is the pinyin corresponding to the fuzzy sound; replacing a target pinyin in the pinyin text to be participled with the selected preset mark to obtain a new pinyin text to be participled; inputting a new pinyin text to be participled into a pre-trained participle model to obtain a pinyin vocabulary sequence, wherein the participle model is used for representing the corresponding relation between the pinyin text including preset marks and the pinyin vocabulary sequence; based on the pinyin vocabulary sequence, performing word segmentation on the text to be segmented to obtain a Chinese character vocabulary sequence.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first acquiring unit may also be described as a "unit that acquires text to be segmented".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.