CN110413998B - Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof - Google Patents
Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof Download PDFInfo
- Publication number
- CN110413998B CN110413998B CN201910638948.2A CN201910638948A CN110413998B CN 110413998 B CN110413998 B CN 110413998B CN 201910638948 A CN201910638948 A CN 201910638948A CN 110413998 B CN110413998 B CN 110413998B
- Authority
- CN
- China
- Prior art keywords
- word segmentation
- candidate
- word
- text
- segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 171
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000012545 processing Methods 0.000 claims abstract description 9
- 230000003044 adaptive effect Effects 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a self-adaptive Chinese word segmentation method oriented to the power industry, a system and a medium thereof, wherein the method comprises the following steps: s1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented; s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; s3, segmenting each candidate text sentence to obtain one or more segmented words; s4, replacing the word in the candidate text terms one by one with the word with the same meaning as the word of the word and carrying out semantic discrimination, returning to S3 if ambiguity occurs, and reserving the word as the candidate word if ambiguity does not exist; s5, acquiring one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity; s6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
Description
Technical Field
The invention relates to the technical field of data processing of power equipment, in particular to a self-adaptive Chinese word segmentation method and system for the power industry and a computer readable storage medium.
Background
In recent years, with the increasing popularity of networks, the text scale on the internet is gradually enlarged, information resources are continuously increased, in order to retrieve and mine valuable information from a large amount of resources, internet companies are greatly developing technology in the field of natural language processing, chinese word segmentation is a basis and premise of the natural language processing technology, and plays an important role in information processing such as information retrieval, machine translation, information filtering and the like, and is a key technology and difficulty of information processing; so far, a large number of data management systems are established by the national grid company, and the service data volume is huge.
Therefore, the following technical problems exist: because of different definition rules of data information by each business department and each business system, the situation that the names of the data from the same source are inconsistent in different business systems in reality causes a problem of a plurality of sources, and certain difficulty is brought to the data uniformity among the business systems.
Disclosure of Invention
The invention aims to provide a self-adaptive Chinese word segmentation method and system for the power industry and a computer readable storage medium, so as to solve the technical problems.
In order to achieve the object of the present invention, according to a first aspect of the present invention, an embodiment of the present invention provides an adaptive chinese word segmentation method for the power industry, including the steps of:
step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
s3, segmenting each candidate text sentence to obtain one or more segmented words;
step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;
s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;
and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
Preferably, the step S2 includes:
separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered;
judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
Preferably, the step S3 includes:
extracting vocabulary corresponding to vocabulary in a dictionary database from candidate text sentences to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.
Preferably, the step S4 includes:
when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;
and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.
Preferably, the step S6 includes:
and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.
According to a second aspect of the present invention, an embodiment of the present invention provides an adaptive chinese word segmentation system for the power industry, including:
the text acquisition unit is used for acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
the text segmentation unit is used for carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
the word segmentation unit is used for segmenting each candidate text sentence to obtain one or more word segments;
the first word segmentation screening unit is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;
the second word screening unit is used for acquiring one or more electric power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more electric power field professional vocabularies, and determining a final word segmentation according to the similarity;
and the output unit is used for sequencing and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms.
Preferably, the text segmentation unit includes:
the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered;
the second segmentation unit is used for judging whether the characters in each text sentence to be filtered are professional segmentation words in the power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
Preferably, the word segmentation unit is specifically configured to extract a vocabulary corresponding to a vocabulary in the dictionary database in the candidate text sentence to obtain a segmented word; wherein, the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
the output unit includes:
the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;
and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.
Preferably, the output unit includes:
and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.
According to a third aspect of the present invention, an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the power industry oriented adaptive chinese word segmentation method.
In the embodiment of the invention, the characteristics of the electric data are combined, a word segmentation dictionary base unique to the electric power field is established, candidate word segmentation is obtained by splitting and ambiguity judging candidate text sentences according to the words in the word segmentation dictionary base, and the final word segmentation is further determined according to the similarity between the candidate word segmentation and the similar words in the word segmentation dictionary base, so that the accuracy of word segmentation is greatly improved, and the working efficiency and the use efficiency of data can be remarkably improved according to the data matching analysis among various business systems.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a self-adaptive chinese word segmentation method for the power industry according to a first embodiment of the present invention.
Fig. 2 is a schematic diagram of a self-adaptive chinese word segmentation system for the power industry in a second embodiment of the present invention.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
In addition, numerous specific details are set forth in the following examples in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known means have not been described in detail in order to not obscure the present invention.
As shown in fig. 1, the embodiment of the invention provides a self-adaptive Chinese word segmentation method for the power industry, which comprises the following steps:
step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
s3, segmenting each candidate text sentence to obtain one or more segmented words;
step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;
s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;
and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
The step S2 specifically includes:
separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered;
judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
Specifically, for a text sentence to be filtered, first extracting a first character, judging whether the first character is a professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; and then, continuing to judge the subsequent characters until the last character in the text sentence to be filtered is taken out, so as to realize the filtering of the candidate text sentence. And comparing the characters taken out of the text sentences with the special vocabulary of the electric power industry according to the constructed special vocabulary of the electric power industry and the daily vocabulary word segmentation dictionary, and judging whether the characters are special words of the electric power industry.
Wherein, the step S3 includes:
extracting vocabulary corresponding to vocabulary in a dictionary database from candidate text sentences to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.
In particular, there may be zero or more segmentations in one candidate text sentence for words corresponding to the vocabulary in the dictionary database that are semantically similar to each other.
Wherein, the step S4 includes:
when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;
and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.
Specifically, one candidate text sentence may correspond to a plurality of candidate word segments, in this step, the candidate word segments are screened according to the similarity value, and finally, only one word segment is output by one candidate text sentence, so that the word segment error rate is reduced.
Wherein, the step S6 includes:
and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.
Specifically, in this embodiment, each word segmentation result obtained by calculation is ranked according to the occurrence frequency, the ranked word segmentation results are output at intervals of spaces, the first ten digits after ranking are selected for key display, the subsequent word segmentation results are hidden, when viewing is required, corresponding keys are clicked, the remaining word segmentation results are displayed, and all word segmentation results are output to a display device in the form of a bar graph and displayed to a user.
According to the embodiment of the invention, through selecting word segmentation data in a special word segmentation dictionary in the electric power field, the extracted candidate text terms are separated into a plurality of text sentences to be output, the text terms can be preprocessed, word segmentation interference caused by the marks and the spaces contained in the text terms is reduced, preprocessing efficiency of the text terms is increased, the problem of efficiency in processing the text terms is solved, the extracted characters are substituted for comparison, whether the characters are special word segmentation in the electric power field is judged until the last characters in the text sentences are extracted, word-by-word substitution and judgment can be carried out on the extracted text sentences, all the same characters are not substituted for comparison judgment, the workload of character comparison judgment is reduced, the character comparison judgment efficiency is higher, the candidate text terms after segmentation can be segmented, ambiguity is carried out on the word segmentation data obtained after segmentation until the word segmentation is not contained, the situation generated after text terms segmentation is reduced, the word segmentation ambiguity is avoided, the word segmentation ambiguity is caused by the fact that the user is generated when the word segmentation is still more old, the word segmentation ambiguity is increased, the word segmentation ambiguity is more clear, the word segmentation ambiguity is calculated, the word can be obtained, the word segmentation ambiguity is more clear, and the result can be obtained, and the visual and the result is more clear, and the word can be obtained by the visual and the word is more clear, and the result is obtained by the word is more when the word segmentation is calculated.
As shown in fig. 2, a second embodiment of the present invention provides an adaptive chinese word segmentation system for the power industry, including:
a text obtaining unit 1, configured to obtain candidate text terms, where the candidate text terms are phrases or paragraphs to be segmented;
a text segmentation unit 2, configured to perform segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
the word segmentation unit 3 is used for segmenting each candidate text sentence to obtain one or more word segments;
the first word segmentation screening unit 4 is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;
the second word screening unit 5 is used for acquiring one or more electric power field professional vocabularies similar to the candidate word semanteme, calculating the similarity between the candidate word and the one or more electric power field professional vocabularies, and determining a final word according to the similarity;
and the output unit 6 is used for sorting and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms.
Wherein the text segmentation unit 2 comprises:
the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered;
the second segmentation unit is used for judging whether the characters in each text sentence to be filtered are professional segmentation words in the power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, and if not, extracting all the same characters in the text sentence and discarding the same characters; the word segmentation is to segment characters and characters after the characters together to obtain candidate text sentences.
The word segmentation unit 3 is specifically configured to extract a vocabulary corresponding to a vocabulary in a dictionary database in a candidate text sentence to obtain a segmented word; wherein, the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
the output unit 6 includes:
the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;
and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.
Wherein the output unit 6 includes:
and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.
It should be noted that the system of the second embodiment corresponds to the method of the first embodiment, and is used for implementing the method of the first embodiment, so that other undescribed contents of the system of the second embodiment can be obtained by referring to the method of the first embodiment, and are not repeated herein.
It should also be appreciated that the method of embodiment one and the system of embodiment two may be implemented in numerous ways, including as a process, an apparatus, or a system. The methods described herein may be implemented in part by program instructions for instructing a processor to perform such methods, as well as such instructions recorded on a non-transitory computer-readable storage medium such as a hard disk drive, floppy disk, optical disk (such as a Compact Disc (CD) or Digital Versatile Disc (DVD)), flash memory, and the like. In some embodiments, the program instructions may be stored remotely and transmitted over a network via optical or electronic communication links.
An embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the adaptive chinese word segmentation method for electric power industry of embodiment one.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (7)
1. The self-adaptive Chinese word segmentation method for the power industry is characterized by comprising the following steps of:
step S1, acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; separating punctuations and spaces in the candidate text terms to obtain a plurality of text parts, and removing the punctuations and the spaces in the text parts to obtain a plurality of text sentences to be filtered; judging whether characters in each text sentence to be filtered are professional word segmentation in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, wherein the word segmentation is to segment the characters and the characters after the characters together to obtain a candidate text sentence; if not, extracting all the same characters in the text sentence and discarding the same characters;
s3, segmenting each candidate text sentence to obtain one or more segmented words; extracting vocabulary corresponding to vocabulary in a dictionary database in the candidate text sentence to obtain word segmentation; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
step S4, replacing the word segmentation in the candidate text terms one by one with the word with the same meaning as the word segmentation word, carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and retaining the word segmentation as the candidate word segmentation if the text terms before and after replacement are not ambiguous;
s5, obtaining one or more power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more power field professional vocabularies, and determining a final word segmentation according to the similarity;
and S6, sorting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms, and outputting the sorted word segmentation.
2. The power industry-oriented adaptive chinese word segmentation method as in claim 1, wherein step S4 comprises:
when a candidate text sentence corresponds to a plurality of candidate word segments, calculating the similarity value of each candidate word segment in the candidate text sentence and one or more power domain professional vocabularies, and accumulating to obtain the similarity value corresponding to the candidate word segment;
and selecting the candidate word with the highest similarity value as the final word of the candidate text sentence.
3. The power industry-oriented adaptive chinese word segmentation method as in claim 2, wherein step S6 comprises:
and outputting the sequenced final word segmentation with the space as an interval, selecting the first ten sequenced digits for key display, and hiding other final word segmentation results.
4. An adaptive chinese word segmentation system for the power industry, comprising:
the text acquisition unit is used for acquiring candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented;
the text segmentation unit is used for carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences;
the word segmentation unit is used for segmenting each candidate text sentence to obtain one or more word segments; extracting vocabulary corresponding to vocabulary in a dictionary database in the candidate text sentence to obtain word segmentation; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field;
the first word segmentation screening unit is used for replacing the word segments in the candidate text terms one by one with words with the same meaning as the word segments and carrying out semantic discrimination, returning to the step S3 if the text terms before and after replacement are ambiguous, and reserving the word segments as the candidate word segments if the text terms before and after replacement are not ambiguous;
the second word screening unit is used for acquiring one or more electric power field professional vocabularies similar to the candidate word segmentation semanteme, calculating the similarity between the candidate word segmentation and the one or more electric power field professional vocabularies, and determining a final word segmentation according to the similarity; and
the output unit is used for sorting and outputting the final word segmentation according to the occurrence frequency of the word segmentation in the candidate text terms;
wherein the text segmentation unit comprises:
the first segmentation unit is used for separating punctuation and space in the candidate text terms to obtain a plurality of text parts, and removing the punctuation and space in the text parts to obtain a plurality of text sentences to be filtered; and
the second segmentation unit is used for judging whether characters in each text sentence to be filtered are professional segmentation words in the electric power industry, if so, extracting all the same characters in the text sentence and segmenting the same characters into words, wherein the segmentation into words is to segment the characters and the characters after the characters together to obtain candidate text sentences; if not, extracting all the same characters in the text sentence and discarding the same characters;
the word segmentation unit is specifically used for extracting words corresponding to word assembly in the dictionary database in the candidate text sentence to obtain segmented words; the vocabulary in the dictionary database is the vocabulary in the special word segmentation dictionary in the electric power field.
5. The power industry oriented adaptive Chinese word segmentation system of claim 4,
the output unit includes:
the similarity calculation unit is used for calculating the similarity value of each candidate word in the candidate text sentence and one or more power domain professional vocabularies when a plurality of candidate words are corresponding to the candidate text sentence, and accumulating to obtain the similarity value corresponding to the candidate word;
and the final word segmentation determining unit is used for selecting the candidate word segmentation with the highest similarity value as the final word segmentation of the candidate text sentence.
6. The power industry oriented adaptive chinese word segmentation system as recited in claim 5, wherein the output unit comprises:
and the display unit is used for outputting the sequenced final word segmentation by taking the space as an interval, selecting the first ten sequenced bits for key display, and hiding other final word segmentation results.
7. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the power industry oriented adaptive chinese word segmentation method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638948.2A CN110413998B (en) | 2019-07-16 | 2019-07-16 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638948.2A CN110413998B (en) | 2019-07-16 | 2019-07-16 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110413998A CN110413998A (en) | 2019-11-05 |
CN110413998B true CN110413998B (en) | 2023-04-21 |
Family
ID=68361553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910638948.2A Active CN110413998B (en) | 2019-07-16 | 2019-07-16 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110413998B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079428B (en) * | 2019-12-27 | 2023-09-19 | 北京羽扇智信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN112257425A (en) * | 2020-09-29 | 2021-01-22 | 国网天津市电力公司 | Power data analysis method and system based on data classification model |
CN112926320B (en) * | 2021-03-24 | 2022-12-27 | 山东亿云信息技术有限公司 | Text key content intelligent extraction method and system based on subject term optimization |
CN114881017B (en) * | 2022-04-25 | 2024-10-18 | 南京烽火星空通信发展有限公司 | Self-adaptive dynamic word segmentation method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077275A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Method and device for performing word segmentation based on context |
CN106844326A (en) * | 2015-12-04 | 2017-06-13 | 北京国双科技有限公司 | A kind of method and device for obtaining word |
CN107608968A (en) * | 2017-09-22 | 2018-01-19 | 深圳市易图资讯股份有限公司 | Chinese word cutting method, the device of text-oriented big data |
CN107918604A (en) * | 2017-11-13 | 2018-04-17 | 彩讯科技股份有限公司 | A kind of Chinese segmenting method and device |
CN109828981A (en) * | 2017-11-22 | 2019-05-31 | 阿里巴巴集团控股有限公司 | A kind of data processing method and calculate equipment |
-
2019
- 2019-07-16 CN CN201910638948.2A patent/CN110413998B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077275A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Method and device for performing word segmentation based on context |
CN106844326A (en) * | 2015-12-04 | 2017-06-13 | 北京国双科技有限公司 | A kind of method and device for obtaining word |
CN107608968A (en) * | 2017-09-22 | 2018-01-19 | 深圳市易图资讯股份有限公司 | Chinese word cutting method, the device of text-oriented big data |
CN107918604A (en) * | 2017-11-13 | 2018-04-17 | 彩讯科技股份有限公司 | A kind of Chinese segmenting method and device |
CN109828981A (en) * | 2017-11-22 | 2019-05-31 | 阿里巴巴集团控股有限公司 | A kind of data processing method and calculate equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110413998A (en) | 2019-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109189942B (en) | Construction method and device of patent data knowledge graph | |
CN108920467B (en) | Method and device for learning word meaning of polysemous word and search result display method | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
CN107463548B (en) | Phrase mining method and device | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN111783518A (en) | Training sample generation method and device, electronic equipment and readable storage medium | |
US11507746B2 (en) | Method and apparatus for generating context information | |
CN112784009B (en) | Method and device for mining subject term, electronic equipment and storage medium | |
CN112364628B (en) | New word recognition method and device, electronic equipment and storage medium | |
CN110928981A (en) | Method, system and storage medium for establishing and perfecting iteration of text label system | |
CN110704638A (en) | Clustering algorithm-based electric power text dictionary construction method | |
CN110008473B (en) | Medical text named entity identification and labeling method based on iteration method | |
CN112395881A (en) | Material label construction method and device, readable storage medium and electronic equipment | |
CN112527977B (en) | Concept extraction method, concept extraction device, electronic equipment and storage medium | |
CN111325019A (en) | Word bank updating method and device and electronic equipment | |
CN107577713A (en) | Text handling method based on electric power dictionary | |
CN110413997A (en) | New word discovery method, system and readable storage medium for power industry | |
CN107291952B (en) | Method and device for extracting meaningful strings | |
CN112115362B (en) | Programming information recommendation method and device based on similar code recognition | |
CN106933797B (en) | Target information generation method and device | |
CN110472243B (en) | Chinese spelling checking method | |
CN111310457B (en) | Word mismatching recognition method and device, electronic equipment and storage medium | |
CN112632985A (en) | Corpus processing method and device, storage medium and processor | |
CN109344254B (en) | Address information classification method and device | |
CN114416923A (en) | News entity linking method and system based on rich text characteristics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |