CN107992570A

CN107992570A - Character string method for digging, device, electronic equipment and computer-readable recording medium

Info

Publication number: CN107992570A
Application number: CN201711230875.0A
Authority: CN
Inventors: 李泽中
Original assignee: Beijing Xiaodu Information Technology Co Ltd
Current assignee: Beijing Xiaodu Information Technology Co Ltd
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2018-05-04

Abstract

The embodiment of the present disclosure discloses a kind of character string method for digging, device, electronic equipment and computer-readable recording medium, and the character string method for digging includes：Training string data collection is obtained, wherein, the trained string data collection includes training string data and character string characteristic；The trained string data collection is trained, obtains target string judgment models；Target string judgement is carried out to test character string according to the target string judgment models.The disclosure can improve the validity of string segmentation, improve retrieval validity, finally effectively improve the service quality of businessman or service provider, strengthen user experience.

Description

Character string mining method and device, electronic equipment and computer readable storage medium

Technical Field

The disclosure relates to the technical field of information processing, in particular to a character string mining method and device, electronic equipment and a computer-readable storage medium.

Background

With the development of internet technology, more and more merchants or service providers provide services for users through internet platforms, seek to improve service quality, enhance user experience, and strive for more user orders, so as to improve the utilization rate of existing resources and create more value for the merchants or service providers. However, when the user uses the search service provided by the merchant or the service provider at present, the hit rate of the search result cannot meet the requirement of the user, thereby weakening the user experience.

Disclosure of Invention

The embodiment of the disclosure provides a character string mining method and device, electronic equipment and a computer-readable storage medium.

In a first aspect, an embodiment of the present disclosure provides a character string mining method.

Specifically, the character string mining method includes:

acquiring a training character string data set, wherein the training character string data set comprises training character string data and character string characteristic data;

training the training character string data set to obtain a target character string judgment model;

and judging the target character string of the test character string according to the target character string judgment model.

With reference to the first aspect, in a first implementation manner of the first aspect, the acquiring training string data in the training string data set includes:

acquiring historical character string data;

taking the data confirmed as the target character string in the historical character string data as a training positive sample;

taking the data confirmed as the non-target character string in the historical character string data as a training negative sample;

training string data is generated based on the training positive and negative examples.

With reference to the first aspect, in a first implementation manner of the first aspect, the character string feature data includes: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names.

With reference to the first aspect, in a first implementation manner of the first aspect, the training a training string data set to obtain a target string judgment model includes:

training based on the training character string data set to obtain a characteristic weight value corresponding to character string characteristic data;

and generating a target character string judgment model based on the weight value of the character string characteristic data.

With reference to the first aspect, in a first implementation manner of the first aspect, the training based on the training character string data set to obtain a feature weight value corresponding to character string feature data includes:

training based on the training character string data set to obtain a characteristic weight determination model;

determining a feature weight value corresponding to the character string feature data based on the feature weight determination model.

With reference to the first aspect, in a first implementation manner of the first aspect, the generating a target character string judgment model based on the weight value of the character string feature data includes:

generating a probability calculation model with the character string w as a target character string according to the weight value of the character string characteristic data;

and confirming the character string with the probability meeting the preset condition as a target character string.

With reference to the first aspect, in a first implementation manner of the first aspect, the probability calculation model is represented as:

wherein f is_iRepresenting the ith feature, λ, in the character string feature data_iDenotes the ith feature f_iCorresponding weight value, p representing the probability that the character string is the target character stringThe value is obtained.

With reference to the first aspect, in a first implementation manner of the first aspect, the determining a character string with a probability meeting a preset condition as a target character string includes:

and confirming the character strings with the probability greater than a preset probability threshold value as target character strings.

With reference to the first aspect, in a first implementation manner of the first aspect, the test character string is a character string input within a preset historical time period.

With reference to the first aspect and the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the method further includes: and executing preset operation on the target character string.

In a second aspect, an embodiment of the present disclosure provides a character string mining apparatus.

Specifically, the character string mining device includes:

an acquisition module configured to acquire a training string data set, wherein the training string data set includes training string data and string feature data;

the training module is configured to train the training character string data set to obtain a target character string judgment model;

and the judging module is configured to judge the target character string of the test character string according to the target character string judging model.

With reference to the second aspect, in a first implementation manner of the second aspect, the obtaining module includes:

an acquisition sub-module configured to acquire history string data;

a first confirming submodule configured to take data confirmed as a target character string in the historical character string data as a training positive sample;

a second confirming sub-module configured to take data confirmed as a non-target character string in the historical character string data as a training negative sample;

a first generation submodule configured to generate training string data based on the training positive samples and the training negative samples.

With reference to the second aspect, in a first implementation manner of the second aspect, the character string feature data includes: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names.

With reference to the second aspect, in a first implementation manner of the second aspect, the training module includes:

a training submodule configured to train based on the training character string data set to obtain a feature weight value corresponding to character string feature data;

a second generation sub-module configured to generate a target character string judgment model based on the weight value of the character string feature data.

With reference to the second aspect, in a first implementation manner of the second aspect, the training submodule includes:

a training unit configured to perform training based on the training string data set, resulting in a feature weight determination model;

a determination unit configured to determine a feature weight value corresponding to the character string feature data based on the feature weight determination model.

With reference to the second aspect, in a first implementation manner of the second aspect, the second generation submodule includes:

a generating unit configured to generate a probability calculation model in which the character string w is a target character string according to the weight value of the character string feature data;

and the confirming unit is configured to confirm the character string with the probability meeting the preset condition as the target character string.

With reference to the second aspect, in a first implementation manner of the second aspect, the probability calculation model is represented as:

wherein f is_iRepresenting the ith feature, λ, in the character string feature data_iDenotes the ith feature f_iAnd the corresponding weight value, p, represents the probability value that the character string is the target character string.

With reference to the second aspect, in a first implementation manner of the second aspect, the confirming unit is configured to confirm a character string with a probability greater than a preset probability threshold as the target character string.

With reference to the second aspect, in a first implementation manner of the second aspect, the test character string is a character string input within a preset historical time period.

With reference to the second aspect and the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the apparatus further includes: and the execution module is configured to execute preset operation on the target character string.

In a third aspect, an embodiment of the present disclosure provides an electronic device, which includes a memory and a processor, where the memory is used to store one or more computer instructions that support a character string mining apparatus to execute the character string mining method in the first aspect, and the processor is configured to execute the computer instructions stored in the memory. The character string mining device may further include a communication interface for the character string mining device to communicate with other devices or a communication network.

In a fourth aspect, the disclosed embodiments provide a computer-readable storage medium for storing computer instructions for a string mining apparatus, which includes computer instructions for executing the string mining method in the first aspect described above as related to the string mining apparatus.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the technical scheme, whether a certain character string is a target character string or not is analyzed and determined by considering various character string characteristics and distributing the corresponding weight value to each characteristic, for example, whether the certain character string is a new character string which can be added into a segmentation dictionary or a retrieval dictionary or not is determined, the content of the segmentation dictionary or the retrieval dictionary is enriched, the effectiveness of character string segmentation is improved, the retrieval effectiveness is improved, the service quality of a merchant or a service provider is effectively improved, and the user experience is enhanced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 illustrates a flow diagram of a method of string mining according to an embodiment of the present disclosure;

FIG. 2 shows a flow chart of step S101 according to the embodiment shown in FIG. 1;

FIG. 3 shows a flowchart of step S102 according to the embodiment shown in FIG. 1;

FIG. 4 shows a flowchart of step S301 according to the embodiment shown in FIG. 3;

FIG. 5 shows a flowchart of step S302 according to the embodiment shown in FIG. 3;

fig. 6 is a block diagram showing a structure of a character string mining apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of the acquisition module 601 according to the embodiment shown in FIG. 6;

FIG. 8 illustrates a block diagram of the structure of the training module 602 according to the embodiment shown in FIG. 6;

FIG. 9 shows a block diagram of a training submodule 801 according to the embodiment shown in FIG. 8;

FIG. 10 is a block diagram illustrating the structure of a second generation submodule 802 according to the embodiment shown in FIG. 8;

FIG. 11 shows a block diagram of an electronic device according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of a computer system suitable for use in implementing a string mining method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.

In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.

It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

According to the technical scheme provided by the embodiment of the disclosure, by considering various character string characteristics and distributing a corresponding weight value to each characteristic, whether a certain character string is a target character string or not is determined through analysis, for example, whether the certain character string is a new character string which can be added into a segmentation dictionary or a retrieval dictionary or not is determined, the content of the segmentation dictionary or the retrieval dictionary is enriched, the effectiveness of character string segmentation is improved, the retrieval effectiveness is improved, the service quality of a merchant or a service provider is effectively improved, and the user experience is enhanced.

The character string of the technical scheme of the present disclosure can be used for the purposes of retrieval, search, pairing, etc., and for the convenience of description, the technical scheme of the present disclosure is described in detail below by taking the retrieval as an example.

Fig. 1 shows a flowchart of a character string mining method according to an embodiment of the present disclosure. As shown in fig. 1, the character string mining method includes the following steps S101 to S103:

in step S101, a training string data set is obtained, where the training string data set includes training string data and string feature data;

in step S102, training the training character string data set to obtain a target character string judgment model;

in step S103, a target character string determination is performed on the test character string according to the target character string determination model.

Considering that when a user uses a retrieval service provided by a merchant or a service provider at present, the merchant or the service provider generally retrieves a character string input by the user directly or a character string segmented according to a general dictionary to obtain a word or a word as a retrieval object, and in many cases, the retrieval object does not exist in the segmented dictionary or the retrieval dictionary, so that a lot of noise naturally exists in a retrieval result obtained based on the retrieval object, contents desired by the user cannot be retrieved accurately, the hit rate of the retrieval result cannot meet the requirement of the user, the quality of the service of the merchant or the service provider is reduced, and the user experience is weakened.

In this embodiment, a method for mining a character string is provided, which includes analyzing and determining whether a character string is a target character string by considering a plurality of character string features and based on training character string data, and adding the obtained target character string to a segmentation dictionary or a retrieval dictionary, thereby enriching the contents of the segmentation dictionary and the retrieval dictionary, and improving the retrieval effectiveness of the character string, specifically, first obtaining a training character string data set, wherein the training character string data set includes training character string data and character string feature data, then training the training character string data set to obtain a target character string judgment model, finally performing target character string judgment on a test character string according to the target character string judgment model, and then adding the determined target character string to the segmentation dictionary or the retrieval dictionary, thereby improving the effectiveness of character string segmentation, and further, the hit rate of the retrieval result is improved, the service quality of the merchant or the service provider is improved, and the user experience is enhanced.

In an optional implementation manner of this embodiment, as shown in fig. 2, the step S101, that is, the step of acquiring training string data in the training string data set, includes steps S201 to S204:

in step S201, history string data is acquired;

in step S202, data confirmed as a target character string in the history character string data is used as a training positive sample;

in step S203, data confirmed as a non-target character string in the history character string data is used as a training negative sample;

in step S204, training string data is generated based on the training positive samples and the training negative samples.

In this embodiment, in order to obtain more accurate training data, historical string data in a preset historical time period may be first randomly obtained, where the length of the historical string data may be set according to the needs of practical applications, for example, for a retrieval service based on an internet platform, the length of the historical string data may be set to 2-5 characters; then, verifying the historical character string data, taking the data of the target character string which is confirmed to be subsequently added into the segmentation dictionary or the retrieval dictionary in the historical character string data as a training positive sample, and taking the data of the non-target character string which is confirmed to be not added into the segmentation dictionary or the retrieval dictionary in the historical character string data as a training negative sample; and finally, combining the training positive sample and the training negative sample to form training character string data.

In an optional implementation manner of this embodiment, the character string feature data includes: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names.

The character string feature data is obtained through multiple tests and verifications, and the character string segmentation effectiveness and the retrieval effectiveness can be improved.

Wherein, the word frequency score f of the character string w in a preset historical time period₁Can be expressed as:

f₁＝log(count(w))

wherein, count (w) is the number of times that the character string w is executed with predetermined operations such as retrieval, inquiry, pairing and the like in a preset historical time period.

The word frequency score f of the character string w in a preset historical time period₁The higher the token the more often the string is used.

Wherein, the mutual information score f of the character string w₂Can be expressed as:

where N represents the length of the string w, c₁c₂… c_NRepresenting a character in the string w, p (c)_i,c_i+1) Denotes c_i,c_i+1Probability of co-occurrence of two characters, p (c)_i) Indicating character c_iThe probability of occurrence.

The character string w mutual information score f₂The higher the compactness of the characters inside the character string is characterized.

Wherein, the information entropy score f of the character string w₃Can be expressed as:

f₃＝H_left(w)+H_right(w)

wherein H_left(w) denotes the left information entropy, H_right(w) denotes right information entropy, a denotes a left neighboring word set of the character string w, B denotes a right neighboring word set of the character string w, and p (aw | w) and p (wb | w) denote conditional probabilities of occurrence of the left neighboring word and the right neighboring word, respectively.

Information entropy score f of the character string w₃The higher the external flexibility characterizing the string.

The feature of whether the character string w is a preset name belongs to an artificial knowledge feature, and if a character string w appears as a preset name, it is likely to be a word that needs to be added to a dictionary, where the preset name may be, for example, a merchant name, a service provider name, a product name, a service name, or the like. The feature of whether the character string w is a preset name may be further divided into two features of whether the character string w is a merchant name or a service provider name, and whether the character string w is a product name or a service name, and specifically, the feature of whether the character string w is a merchant name or a service provider name may be represented as:

the characteristic of whether the character string w is a product name or a service name can be expressed as:

in an optional implementation manner of this embodiment, as shown in fig. 3, the step S102 of training the training character string data set to obtain the target character string judgment model includes steps S301 to S302:

in step S301, training based on the training character string data set to obtain a feature weight value corresponding to character string feature data;

in step S302, a target character string judgment model is generated based on the weight value of the character string feature data.

In this embodiment, a set of feature weight values corresponding to the character string feature data is obtained by training based on the training character string data set using a model training method and an optimization algorithm, and a target character string judgment model is generated based on the weight values of the character string feature data.

The method for model training and the optimization algorithm may be selected by those skilled in the art according to the requirements of practical applications, and the present disclosure does not limit the method specifically.

In an optional implementation manner of this embodiment, as shown in fig. 4, the step S301 of obtaining a feature weight value corresponding to character string feature data based on the training character string data set includes steps S401 to S402:

in step S401, training is performed based on the training character string data set to obtain a feature weight determination model;

in step S402, a feature weight value corresponding to the character string feature data is determined based on the feature weight determination model.

As mentioned above, the present disclosure contemplates a wide variety of string feature data that may be used to characterize the necessity of a string being subjected to a predetermined operation, such as being added to a segmentation dictionary or a search dictionary. However, the contribution of the character string feature data to the necessity judgment is different, that is, when the plurality of feature data are used to represent the necessity of performing the preset operation on a character string, the weights of different features should not be treated as the same thing, but should be treated differently.

Therefore, in this embodiment, the optimal weight distribution for the plurality of feature data is determined by using a training model integrated optimization algorithm, for example, a simple, efficient and widely-used logistic regression model in machine learning may be used to obtain a feature weight determination model based on the training character string data set, and the feature weight determination model may be further used to obtain feature weight values corresponding to the plurality of feature data, and usually, an optimal set of feature weight values corresponding to the plurality of feature data may be obtained by a superposition optimization algorithm.

In an optional implementation manner of this embodiment, as shown in fig. 5, the step S302 of generating a target character string judgment model based on the weight value of the character string feature data includes steps S501 to S502:

in step S501, a probability calculation model with the character string w as a target character string is generated according to the weight value of the character string feature data;

in step S502, a character string having a probability that meets a preset condition is determined as a target character string.

In this embodiment, the target character string determination model may include a probability calculation model that the character string w is the target character string and a part that determines whether the character string w is the target character string according to the obtained probability value, specifically, the probability calculation model that the character string w is the target character string is first generated according to the weight value of the character string feature data obtained above, and the probability calculation model may be represented as:

wherein f is_iRepresenting the ith feature, λ, in the character string feature data_iDenotes the ith feature f_iAnd p represents the probability value that the character string w is the target character string. Then, whether the character string w is a target character string is judged according to a probability value corresponding to a certain character string w, for example, the character string with the probability value larger than a preset probability threshold value can be regarded as the important character string, the information content is large, and the character string with the probability value larger than the preset probability threshold value can be determined as the target character string when the character string is effective in preset operations such as retrieval, query and pairing, and the like, and the character string with the probability value larger than the preset probability threshold value can be added into a segmentation dictionary or a retrieval dictionary, so that the segmentation effectiveness of the character string can be improved, and the hit rate of a retrieval result can be improved.

The probability threshold value can be set according to the needs of practical application, and the specific value of the probability threshold value is not specifically limited by the disclosure.

In an optional implementation manner of this embodiment, the test character string is a character string input within a preset historical time period, the test character string is similar to the training character string, and the length of the test character string may be set according to the needs of the actual application, for example, may be set to 2-5.

In an optional implementation manner of this embodiment, the method further includes a step of performing a preset operation on the target character string, where the preset operation includes: adding into dictionary files such as a segmentation dictionary and a retrieval dictionary, performing retrieval, performing search, performing query, and performing matching.

When the preset operation is to be added to a dictionary file such as a divided dictionary, a search dictionary, or the like, in an optional implementation manner of this embodiment, the method further includes a step of determining whether the test character string exists in the dictionary file, where the step is mainly to determine whether it is necessary to perform subsequent determination on the target character string, that is, to perform determination on the target character string only on the test character string that does not exist in the dictionary file.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.

Fig. 6 is a block diagram illustrating a structure of a character string mining apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 6, the character string mining device includes:

an obtaining module 601 configured to obtain a training string data set, where the training string data set includes training string data and string feature data;

a training module 602 configured to train the training character string data set to obtain a target character string judgment model;

a judging module 603 configured to perform target string judgment on the test string according to the target string judgment model.

In this embodiment, a character string mining apparatus is provided, which determines whether a certain character string is a target character string by considering various character string features and analyzing based on training character string data, and the target character string thus obtained may be subsequently added to a segmentation dictionary or a retrieval dictionary, thereby enriching the contents of the segmentation dictionary and the retrieval dictionary and improving the retrieval effectiveness of the character string, specifically, a training character string data set is first obtained by an obtaining module 601, wherein the training character string data set includes training character string data and character string feature data, then the training character string data set is trained by a training module 602 to obtain a target character string judgment model, finally, a target character string judgment is performed on a test character string according to the target character string judgment model by a judging module 603, and the determined target character string may be subsequently added to the segmentation dictionary or the retrieval dictionary, therefore, the effectiveness of character string segmentation can be improved, the hit rate of a retrieval result is further improved, the service quality of a merchant or a service provider is improved, and the user experience is enhanced.

In an optional implementation manner of this embodiment, as shown in fig. 7, the obtaining module 601 includes:

an acquisition submodule 701 configured to acquire history character string data;

a first confirming sub-module 702 configured to take data confirmed as a target character string in the historical character string data as a training positive sample;

a second confirming sub-module 703 configured to take data confirmed as a non-target character string in the historical character string data as a training negative sample;

a first generating sub-module 704 configured to generate training string data based on the training positive samples and the training negative samples.

In this embodiment, in order to obtain more accurate training data, the obtaining sub-module 701 may first randomly obtain historical string data within a preset historical time period, where the length of the historical string data may be set according to the needs of practical applications, for example, for a retrieval service based on an internet platform, the length of the historical string data may be set to 2-5 characters; then, the historical character string data is verified, the first confirmation sub-module 702 takes the data of the target character string which is confirmed to be subsequently added into the segmentation dictionary or the retrieval dictionary in the historical character string data as a training positive sample, and on the contrary, the second confirmation sub-module 703 takes the data of the non-target character string which is confirmed to be not added into the segmentation dictionary or the retrieval dictionary in the historical character string data as a training negative sample; finally, the first generation submodule 704 combines the training positive samples and the training negative samples to form training string data.

f₁＝log(count(w))

f₃＝H_left(w)+H_right(w)

wherein H_left(w) represents left informationEntropy, H_right(w) denotes right information entropy, a denotes a left neighboring word set of the character string w, B denotes a right neighboring word set of the character string w, and p (aw | w) and p (wb | w) denote conditional probabilities of occurrence of the left neighboring word and the right neighboring word, respectively.

in an optional implementation manner of this embodiment, as shown in fig. 8, the training module 602 includes:

a training submodule 801 configured to train to obtain a feature weight value corresponding to character string feature data based on the training character string data set;

a second generation submodule 802 configured to generate a target character string judgment model based on the weight value of the character string feature data.

In this embodiment, a training submodule 801 is used to train based on the training character string data set by using a model training method and an optimization algorithm to obtain a set of feature weight values corresponding to character string feature data, and a second generation submodule 802 is further used to generate a target character string judgment model based on the weight values of the character string feature data.

In an optional implementation manner of this embodiment, as shown in fig. 9, the training sub-module 801 includes:

a training unit 901 configured to perform training based on the training character string data set to obtain a feature weight determination model;

a determining unit 902 configured to determine a feature weight value corresponding to the character string feature data based on the feature weight determination model.

Therefore, in this embodiment, the training unit 901 determines the optimal weight distribution for the plurality of feature data by using a training model comprehensive optimization algorithm, for example, a simple, efficient and widely-used logistic regression model in machine learning may be used to obtain a feature weight determination model based on the training character string data set, and the determining unit 902 may further obtain the feature weight values corresponding to the plurality of feature data by using the feature weight determination model, and may obtain an optimal set of feature weight values corresponding to the plurality of feature data by using a superposition optimization algorithm.

In an optional implementation manner of this embodiment, as shown in fig. 10, the second generation sub-module 802 includes:

a generating unit 1001 configured to generate a probability calculation model in which a character string w is a target character string, according to a weight value of the character string feature data;

a confirming unit 1002 configured to confirm a character string having a probability that meets a preset condition as a target character string.

In this embodiment, the target character string determination model may include a probability calculation model in which the character string w is a target character string and a part for determining whether the character string w is the target character string according to the obtained probability value, specifically, the generation unit 1001 first generates the probability calculation model in which the character string w is the target character string according to the weight value of the character string feature data obtained above, and the probability calculation model may be expressed as:

wherein f is_iRepresenting the ith feature, λ, in the character string feature data_iDenotes the ith feature f_iAnd p represents the probability value that the character string w is the target character string. Then, the determining unit 1002 determines whether a certain character string w is a target character string according to a probability value corresponding to the character string w, for example, a character string having a probability value greater than a preset probability threshold may be regarded as a character string itself being important and having a relatively large information amount, and a character string having a probability value greater than a preset probability threshold may be determined as a target character string if the character string is effective for preset operations such as retrieval, query, pairing, and the like, and such a character string having a probability value greater than a preset probability threshold may be determined as a target character stringThe character strings are added into the segmentation dictionary or the retrieval dictionary, so that the effectiveness of character string segmentation can be improved, and the hit rate of the retrieval result is further improved.

In an optional implementation manner of this embodiment, the apparatus further includes an execution module configured to execute a preset operation on the target character string, where the preset operation includes: adding into dictionary files such as a segmentation dictionary and a retrieval dictionary, performing retrieval, performing search, performing query, and performing matching.

When the preset operation is to be added to a dictionary file such as a divided dictionary, a search dictionary, and the like, in an optional implementation manner of this embodiment, the apparatus further includes a second determining module, which may be configured to determine whether the test character string exists in the dictionary file, where the second determining module is mainly configured to determine whether it is necessary to perform subsequent determination on the target character string, that is, the determining module performs determination on the target character string only on the test character string that does not exist in the dictionary file.

The present disclosure also discloses an electronic device, fig. 11 shows a block diagram of an electronic device according to an embodiment of the present disclosure, and as shown in fig. 11, the electronic device 1100 includes a memory 1101 and a processor 1102; wherein,

the memory 1101 is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor 1102 to implement:

The one or more computer instructions are further executable by the processor 1102 to implement:

the acquiring of the training string data set includes:

acquiring historical character string data;

The character string feature data includes: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names.

The training of the training character string data set to obtain the target character string judgment model comprises the following steps:

The training based on the training character string data set to obtain the characteristic weight value corresponding to the character string characteristic data comprises the following steps:

The generating of the target character string judgment model based on the weight value of the character string feature data includes:

The probability calculation model is expressed as:

Confirming the character string with the probability meeting the preset condition as a target character string, comprising:

The test character string is a character string input in a preset historical time period.

Further comprising:

and executing preset operation on the target character string.

FIG. 12 is a schematic block diagram of a computer system suitable for use in implementing a string mining method according to an embodiment of the present disclosure.

As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU)1201, which can execute various processes in the embodiments shown in fig. 1 to 5 described above according to a program stored in a Read Only Memory (ROM)1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM1203, various programs and data necessary for the operation of the system 1200 are also stored. The CPU1201, ROM1202, and RAM1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

In particular, the methods described above with reference to fig. 1-5 may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the string mining method of fig. 1-5. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

The disclosure discloses A1, a character string mining method, the method comprises: acquiring a training character string data set, wherein the training character string data set comprises training character string data and character string characteristic data; training the training character string data set to obtain a target character string judgment model; and judging the target character string of the test character string according to the target character string judgment model. A2, the acquiring training string data in the training string data set according to the method of A1, comprising: acquiring historical character string data; taking the data confirmed as the target character string in the historical character string data as a training positive sample; taking the data confirmed as the non-target character string in the historical character string data as a training negative sample; training string data is generated based on the training positive and negative examples. A3, according to the method of A1, the character string feature data includes: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names. A4, according to the method of A1, training the training character string data set to obtain a target character string judgment model, including: training based on the training character string data set to obtain a characteristic weight value corresponding to character string characteristic data; and generating a target character string judgment model based on the weight value of the character string characteristic data. A5, according to the method in A4, the training based on the training character string data set to obtain the feature weight value corresponding to the character string feature data includes: training based on the training character string data set to obtain a characteristic weight determination model; determining a feature weight value corresponding to the character string feature data based on the feature weight determination model. A6, according to the method of A4, the generating a target character string judgment model based on the weight values of the character string feature data includes: generating a probability calculation model with the character string w as a target character string according to the weight value of the character string characteristic data; and confirming the character string with the probability meeting the preset condition as a target character string. A7, according to the method of A6, the probability computation model is represented as:

wherein f is_iRepresenting the ith feature, λ, in the character string feature data_iDenotes the ith feature f_iAnd the corresponding weight value, p, represents the probability value that the character string is the target character string. A8, according to the method in A6, confirming the character string with the probability meeting the preset condition as the target character string, including: and confirming the character strings with the probability greater than a preset probability threshold value as target character strings. A9, according to the method in A1, the test character string is the character string input in the preset historical time period. A10, the method of A1, further comprising: and executing preset operation on the target character string.

The present disclosure discloses B11, a character string mining device, the device includes: an acquisition module configured to acquire a training string data set, wherein the training string data set includes training string data and string feature data; the training module is configured to train the training character string data set to obtain a target character string judgment model; and the judging module is configured to judge the target character string of the test character string according to the target character string judging model. B12, the apparatus of B11, the obtaining module comprising: an acquisition sub-module configured to acquire history string data; a first confirming submodule configured to take data confirmed as a target character string in the historical character string data as a training positive sample; a second confirming sub-module configured to take data confirmed as a non-target character string in the historical character string data as a training negative sample; a first generation submodule configured to generate training string data based on the training positive samples and the training negative samples. B13, the device according to B11, the character string feature data includes: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names. B14, the apparatus of B11, the training module comprising: a training submodule configured to train based on the training character string data set to obtain a feature weight value corresponding to character string feature data; a second generation sub-module configured to generate a target character string judgment model based on the weight value of the character string feature data. B15, the apparatus according to B14, the training submodule comprising: a training unit configured to perform training based on the training string data set, resulting in a feature weight determination model; a determination unit configured to determine a feature weight value corresponding to the character string feature data based on the feature weight determination model. B16, the apparatus according to B14, the second generation submodule includes: a generating unit configured to generate a probability calculation model in which the character string w is a target character string according to the weight value of the character string feature data; and the confirming unit is configured to confirm the character string with the probability meeting the preset condition as the target character string. B17, according to the device of B16, the probability calculation model is represented as:

wherein f is_iRepresenting the ith feature, λ, in the character string feature data_iDenotes the ith feature f_iAnd the corresponding weight value, p, represents the probability value that the character string is the target character string. B18, the device according to B16, the confirming unit is configured to confirm the character string with the probability larger than the preset probability threshold value as the target character string. B19, according to the device of B11, the test character string is the character string input in a preset historical time period. B20, the apparatus according to B11, further comprising: and the execution module is configured to execute preset operation on the target character string.

The present disclosure discloses C21, an electronic device comprising a memory and a processor; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of A1-A10.

The present disclosure also discloses D22, a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method as recited in any of a1-a 10.

Claims

1. A method for mining character strings, the method comprising:

2. The method of claim 1, wherein obtaining training string data in the set of training string data comprises:

acquiring historical character string data;

3. The method of claim 1, wherein the string characterization data comprises: the word frequency score of the character string w in a preset historical time period, the mutual information score of the character string w, the information entropy score of the character string w and whether the character string w is one or more of preset names.

4. The method of claim 1, wherein training the training string dataset to obtain the target string judgment model comprises:

5. The method of claim 4, wherein training based on the training string data set to obtain the feature weight value corresponding to the string feature data comprises:

6. The method according to claim 4, wherein generating a target character string judgment model based on the weight values of the character string feature data comprises:

7. The method of claim 6, wherein the probabilistic computational model is represented as:

<mrow> <mi>p</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>1</mn> <mo>+</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <msub> <mo>&Sigma;</mo> <mi>i</mi> </msub> <msub> <mi>&lambda;</mi> <mi>i</mi> </msub> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

8. A character string mining apparatus, characterized in that the apparatus comprises:

9. An electronic device comprising a memory and a processor; wherein,

the memory is to store one or more computer instructions, wherein the one or more computer instructions are to be executed by the processor to implement the method of any one of claims 1-7.

10. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the method of any one of claims 1-7.