CN111813905B

CN111813905B - Corpus generation method, corpus generation device, computer equipment and storage medium

Info

Publication number: CN111813905B
Application number: CN202010555008.XA
Authority: CN
Inventors: 黎旭东; 林桂
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2024-05-10
Anticipated expiration: 2040-06-17
Also published as: WO2021120588A1; CN111813905A

Abstract

The invention relates to the field of artificial intelligence, and discloses a corpus generation method, a corpus generation device, computer equipment and a storage medium, wherein the corpus generation method comprises the following steps: the invention further relates to a blockchain technology, the obtained target corpus is stored in a blockchain network, and the accuracy of the target corpus for vaccine questions and answers is improved.

Description

Corpus generation method, corpus generation device, computer equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a corpus generating method, apparatus, computer device, and storage medium.

Background

Along with the improvement of the living standard of people, many people start to care about the health problems of the people, the problem related to vaccine also becomes a hot problem of the health problems, in order to relieve the pressure of consultation windows of hospitals, some hospitals start to adopt intelligent robot service systems, the consultants are given effective feedback through intelligent question-answering robots, the intelligent question-answering robots need to train by adopting a large number of corpora in related fields before being used, so that the accuracy of question-answering is improved, and a large number of corpora related to vaccine are needed for training aiming at the vaccine question-answering robots in an understandable manner.

Currently, relevant corpus of vaccine is obtained, mainly web crawlers are crawled from relevant sites, corpus selection is performed by means of regular matching and keyword extraction, the corpus selected by means of the corpus selection is used for training the question-answering robot, the accuracy degree of the question-answering robot is far from reaching the requirements, the response accuracy of the question-answering robot is low, and meanwhile user experience is affected, so that how to obtain training corpus with high accuracy degree becomes a difficult problem to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides a corpus generation method, a corpus generation device, computer equipment and a storage medium, which are used for improving the accuracy of generating training corpus of a vaccine question-answering robot.

In order to solve the above technical problems, an embodiment of the present application provides a corpus generating method, including:

acquiring consultation text and response text related to the vaccine from a medical consultation library as initial text;

performing data cleaning on the initial text to obtain original corpus data;

Clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;

Aiming at each cluster of coarse-granularity clustered corpus, performing secondary clustering treatment on the coarse-granularity clustered corpus through a density clustering algorithm, and taking the obtained density clustered corpus as a target corpus.

Optionally, the acquiring the consultation text and the response text related to the vaccine from the medical consultation library includes:

determining the page weight of each preset path in the medical consultation library in a link analysis mode;

determining a target page according to the page weight of each preset path;

calculating a page ranking value of each target page based on a preset page ranking strategy, and sequencing the target pages according to the order of the page ranking values from large to small to obtain a target page queue;

and grabbing contents in the target page based on the target page queue to obtain consultation text and response text related to the vaccine.

Optionally, for each cluster of coarse-granularity clustered corpus, performing secondary clustering processing on the coarse-granularity clustered corpus by using a density clustering algorithm, and taking the obtained density clustered corpus as the target corpus comprises:

acquiring a preset scanning radius eps and a preset minimum inclusion point minPts;

Counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;

Corpus data with the number of the neighborhood points being greater than or equal to a preset minimum containing point minPts is used as a core point;

The method comprises the steps that corpus data in a preset scanning radius eps of a core point is used as boundary points, wherein the number of neighborhood points is smaller than a preset minimum inclusion point minPts;

And connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain a target corpus.

Optionally, after clustering the raw corpus data by using a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters, and performing secondary clustering on the coarse-granularity clustered corpora by using a density clustering algorithm for each cluster of coarse-granularity clustered corpora, and before taking the obtained density clustered corpora as a target corpus, the corpus generating method further includes:

setting different category labels for coarse-granularity cluster corpus of each cluster, and storing the cluster coarse-granularity cluster corpus, the category labels and the corresponding relation between the cluster coarse-granularity cluster corpus and the category labels into an elastic search engine.

Optionally, after performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm for each cluster of coarse-granularity clustered corpus, and taking the obtained density clustered corpus as the target corpus, the corpus generation method further includes:

acquiring a preset threshold, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold to obtain a clustering result;

and selecting non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain updated target corpus.

Optionally, after performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm for each cluster of coarse-granularity clustered corpus, taking the obtained density clustered corpus as a target corpus, the method further includes: storing the target corpus in a blockchain network node.

In order to solve the above technical problem, an embodiment of the present application further provides a corpus generating device, including:

the data acquisition module is used for acquiring consultation texts and response texts related to the vaccine from the medical consultation library as initial texts;

The data cleaning module is used for cleaning the data of the initial text to obtain original corpus data;

the coarse-granularity clustering module is used for clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;

The corpus determining module is used for carrying out secondary clustering processing on the coarse-granularity clustered corpuses through a density clustering algorithm aiming at each cluster of coarse-granularity clustered corpuses, and taking the obtained density clustered corpuses as target corpuses.

Optionally, the data acquisition module includes:

the link analysis unit is used for determining the page weight of each preset path in the medical consultation library in a link analysis mode;

The target page determining unit is used for determining a target page according to the page weight of each preset path;

the page ordering unit is used for calculating the page ranking value of each target page based on a preset page ranking strategy, and ordering the target pages according to the order of the page ranking values from large to small to obtain a target page queue;

And the content acquisition unit is used for capturing the content in the target page based on the target page queue to obtain the consultation text and the response text related to the vaccine.

Optionally, the corpus determining module includes:

a preset parameter obtaining unit, configured to obtain a preset scanning radius eps and a preset minimum inclusion point minPts;

The domain point number determining unit is used for counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;

The core shop determining unit is used for taking corpus data with the number of the neighborhood points being greater than or equal to a preset minimum containing point number minPts as core points;

The boundary point determining unit is used for taking corpus data which is smaller than the preset minimum inclusion point number minPts and is positioned in a preset scanning radius eps of the core point as boundary points;

The target corpus acquisition unit is used for connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain the target corpus.

Optionally, the corpus generating device further includes:

the first storage module is used for setting different category labels for coarse-granularity clustering corpus of each cluster, and storing the cluster coarse-granularity clustering corpus, the category labels and the corresponding relation between the cluster coarse-granularity clustering corpus and the category labels into the elastic search engine.

Optionally, the corpus generating device further includes:

The aggregation module is used for acquiring a preset threshold value, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold value to obtain a clustering result;

And the updating module is used for selecting the non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain the updated target corpus.

Optionally, the corpus generating device further includes:

And the second storage module is used for storing the target corpus in the blockchain network node.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of the corpus generating method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the steps of the corpus generation method.

According to the corpus generation method, the device, the computer equipment and the storage medium, the consultation text and the response text related to the vaccine are obtained from the medical consultation library and are used as the initial text, the initial text is subjected to data cleaning to obtain the original corpus data, the K-means clustering model is further adopted to perform clustering processing on the original corpus data to obtain at least two clusters of coarse-granularity clustered corpora, the coarse-granularity clustered corpora are subjected to secondary clustering processing through the density clustering algorithm aiming at each cluster of coarse-granularity clustered corpora, more accurate classification is achieved through multi-level clustering processing, the obtained density clustered corpora are used as target corpora, classification of the target corpora is more accurate, and meanwhile, the accuracy of the target corpora aiming at vaccine questions and answers is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a corpus generation method of the present application;

FIG. 3 is a schematic diagram of the structure of one embodiment of a corpus generating device according to the present application;

FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III, moving Picture expert compression standard audio plane 3), MP4 players (Moving Picture E interface display perts Group Audio Layer IV, moving Picture expert compression standard audio plane 4), laptop and desktop computers, and so on.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the corpus generating method provided by the embodiment of the present application is executed by a server, and accordingly, the corpus generating device is disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements, and the terminal devices 101, 102, 103 in the embodiment of the present application may specifically correspond to application systems in actual production.

Referring to fig. 2, fig. 2 shows a corpus generating method according to an embodiment of the present invention, and the method is applied to the server in fig. 1 for illustration, and is described in detail as follows:

S201: and acquiring consultation text and response text related to the vaccine from the medical consultation library as initial text.

Specifically, a medical inquiry library is subjected to inquiry processing by searching preset keywords, so that consultation texts and response texts related to vaccines are obtained and serve as initial texts.

The preset keywords may specifically be related words or phrases that cover vaccination time, procedure, notes, crowd adaptability, and the like of the vaccine.

The medical consultation library refers to a resource library for storing information (text information and voice information) of vaccine-related questions consulted by a network or telephone.

As a preferred mode, in order to facilitate inquiry, the voice information can be converted into text information through a third-party voice conversion text tool, and then the text information is stored in a medical consultation library.

It should be noted that, the medical diagnosis library in this embodiment corresponds to a plurality of site pages, and the site pages provide inquiry and reading of record information of medical diagnosis.

Preferably, the crawler method is adopted in the embodiment, the consultation text and the response text related to the vaccine are quickly and accurately crawled from the site pages of the medical consultation library, the acquisition speed of the initial text is improved, and the generation efficiency of the training corpus is improved.

S202: and performing data cleaning on the initial text to obtain original corpus data.

Specifically, the obtained initial text, including punctuation marks, text formats, invalid expressions, pictures, etc., requires advanced data cleansing before data processing is performed on the data.

Wherein data cleansing includes, but is not limited to: removing punctuation pictures, dividing books, extracting key sentences and the like.

Further, vectorization is carried out on the text after data cleaning, and the obtained word vector is used as original corpus data.

Specifically, the text after data cleaning is mapped into a vector, and the vectors are linked together to form a word vector space, and each vector is equivalent to a point in the space.

For example, a certain automobile sales company has two keywords, namely BMW and Benz, in its product name, and according to a preset corpus, all possible classifications of the two keywords are obtained: "automotive," luxury, "" animal, "" action, "and" food. Thus, a vector representation is introduced for both keywords:

< automobile, luxury, animal, action, food >

The probability that the two keywords belong to each category is calculated according to a statistical learning method, and the probability that the computer learns is:

BMW= <0.5,0.2,0.2,0.0,0.1>

Benz= <0.7,0.2,0.0,0.1,0.0>

It will be appreciated that the value of each dimension of the base word vector represents a feature that has some semantic and grammatical interpretability, and that each dimension of the base word vector may be referred to as a keyword feature.

It should be noted that, in this embodiment, the word vector representation may be a word segmentation, a phrase, or a pair of question-answer sentences, where no more idleness is made.

S203: and clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters.

Specifically, a K-means clustering model is adopted to perform clustering processing on original corpus data, and the original corpus data corresponding to each clustering center is used as a cluster of coarse-granularity clustered corpus to obtain at least two clusters of coarse-granularity clustered corpus.

The coarse-granularity clustering corpus refers to clustering corpus with low precision, wherein some common semantics are contained, but the final semantics are not necessarily the same. For example, two pieces of original corpus data are ' after me eat meal for a while, the bellybutton is hungry and ' after me eat meal bellybutton is a little painful ', after the two pieces of corpus data are clustered by a K-means clustering model, the two pieces of corpus data are clustered into a cluster, and therefore the two pieces of corpus data belong to coarse-granularity clustering corpus, and in order to ensure the accuracy of classification, the coarse-granularity clustering corpus is required to be further finely classified in the follow-up process.

The K-means algorithm is a distance-based clustering algorithm, and the distance is used as an evaluation index of similarity, namely the closer the distance between two objects is, the greater the similarity is. The algorithm considers clusters to be made up of objects that are close together, thus targeting a compact and independent cluster as the final target.

S204: aiming at each cluster of coarse-granularity clustered corpus, performing secondary clustering treatment on the coarse-granularity clustered corpus through a density clustering algorithm, and taking the obtained density clustered corpus as a target corpus.

Specifically, because the vaccine question-answering specialization is stronger, so that the training corpus with finer classification and higher accuracy is required, and because of the functional limitation of the K-means algorithm, each type of vaccine problem cannot be perfectly clustered, coarse-granularity clustering is performed on the original corpus by using the K-means clustering algorithm, proper text clusters can be obtained by adjusting algorithm super-parameters in the clustering process, so that the texts in the clusters have certain similarity, each cluster roughly represents one type of vaccine problem, for example, different question methods related to the inoculation time of a certain vaccine can be concentrated in the same cluster, and different problem directions of the inoculation time can be separated from the cluster, and in order to further improve the fineness of classification and the accuracy of the corpus against the vaccine problem, the embodiment adopts a density clustering algorithm to perform secondary clustering processing on the coarse-granularity clustering corpus, and the obtained density clustering corpus is used as the target corpus.

Preferably, the density clustering algorithm adopted in this embodiment is DBSCAN, and specifically, the process of performing secondary clustering by using DBSCAN may refer to the description of the subsequent embodiments, and in order to avoid repetition, no description is repeated here.

Of these, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a relatively representative Density-based clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of densely connected points, it is possible to partition a region having a sufficiently high density into clusters and find clusters of arbitrary shape in a noisy spatial database.

In this embodiment, the consultation text and the response text related to the vaccine are obtained from the medical consultation library and used as initial texts, the initial texts are subjected to data cleaning to obtain initial corpus data, then a K-means clustering model is adopted to perform clustering processing on the initial corpus data to obtain coarse-granularity clustering corpuses of at least two clusters, the coarse-granularity clustering corpuses of each cluster are subjected to secondary clustering processing through a density clustering algorithm, multi-level clustering processing is achieved to obtain more accurate classification, the obtained density clustering corpuses are used as target corpuses, classification of the target corpuses is more accurate, and meanwhile, the accuracy of the target corpuses for vaccine question and answer is improved.

In an embodiment, after the target corpus is obtained, each target corpus is stored in a blockchain network node, and sharing of data information among different platforms is realized through blockchain storage, so that data can be prevented from being tampered.

Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

In some optional implementations of this embodiment, in step S201, the acquiring, as the initial text, the vaccine-related consultation text and the response text from the medical consultation library includes:

Determining a target page according to the page weight of each preset path;

and grabbing contents in the target page based on the target page queue to obtain the consultation text and the response text related to the vaccine.

Specifically, a plurality of preset paths are stored in a medical consultation library in advance, each preset path stores 1 or more pages, corresponding information is obtained through crawling page contents, before crawling pages, link analysis is performed on sites to be crawled, the weight of each site page is confirmed, so that target pages needing crawling can be determined according to the weight later, a server is preset with reference weight, when the calculated page weight is greater than the preset reference weight, the page is confirmed to have crawling value, the page is determined to be the target page, further, the page ranking value of each target page is calculated according to a preset page ranking strategy, the target pages are ranked according to the order of page ranking values from large to small, a target page queue is obtained, crawling is performed on the contents of the target pages according to the page sequence in the target page queue, and the basic data and the user information corresponding to the basic data contained in the target page are obtained.

The link analysis refers to analyzing basic features of a page corresponding to each preset path in the medical inquiry library, and in this embodiment, the basic features selected for analysis include, but are not limited to: vaccine correlation, network topology, and page content, etc.

The network topology analysis comprises analysis of data such as an outer link, a hierarchy, a level and the like of the webpage.

The page content analysis comprises analysis of content characteristic data such as appearance, text and the like of the webpage.

According to the method, three analysis results are obtained through vaccine related text analysis, network topology analysis and webpage content analysis, and comprehensive evaluation is carried out on the three analysis results to obtain the webpage weight of the website. The specific mode of comprehensive evaluation can be realized through a preset weighting formula, and can also be set according to actual needs, and the method is not limited herein.

Wherein the preset page ranking policy includes, but is not limited to: pageRank strategy, hilltop algorithm, link relationship based ranking (TrustRank) algorithm, expertRank, and the like.

The PageRank strategy, also called a webpage ranking strategy, a Google left ranking strategy or a Peel ranking strategy, is a technology calculated according to hyperlinks among webpages, and is used as one of elements of webpage ranking, the PageRank value can be used for reflecting the relevance and importance of webpages, is an important factor frequently used for evaluating webpage optimization in search engine optimization operation, and is ranked according to the PageRank value from large to small, so that pages with higher importance level are ranked in front, and when content crawling is carried out at the back, information of webpages with higher ranking is preferentially obtained.

In this embodiment, by constructing the page weight queue, crawling is further performed according to the order in the page weight queue, so that important information is preferentially crawled, which is beneficial to improving quality of crawled content and crawling efficiency.

In some optional implementations of this embodiment, in step S204, for each cluster of coarse-granularity clustered corpora, performing secondary clustering processing on the coarse-granularity clustered corpora by using a density clustering algorithm, where the obtained density clustered corpora is used as a target corpus, including:

counting the number of other corpus data contained in the preset scanning radius eps of the corpus data aiming at each corpus data in the coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;

Corpus data with the number of neighborhood points being greater than or equal to the preset minimum containing point number minPts is used as a core point;

Specifically, for each corpus data in coarse-granularity clustering corpuses, counting the number of other corpus data contained in a preset scanning radius eps of the corpus data, taking the number as the number of neighborhood points corresponding to the corpus data, further taking the corpus data with the number of neighborhood points being larger than or equal to the preset minimum containing point number minPts as a core point, taking the corpus data with the number of neighborhood points being smaller than the preset minimum containing point number minPts as the core point, taking the corpus data in the preset scanning radius eps of the core point as the boundary point, connecting the boundary points with the distance not exceeding the preset scanning radius eps to form a density cluster with a shape of a closed polygon, and adding the core point in the range of the density cluster into the density cluster to obtain the target corpus.

The preset scan radius eps and the preset minimum inclusion point number minPts may be set according to the actual requirement, and are not limited herein, for example, the preset scan radius eps is set to 10, and the preset minimum inclusion point number minPts is set to 5.

It should be understood that boundary points with a distance not exceeding the preset scanning radius eps are connected to each other to form a density cluster, and the density cluster obtained finally can be one or a plurality of density clusters, and each density cluster is a collection of various branch problems of vaccine problems of one category, and the specific category of vaccine problems and the number of branch problems depend on the content of the crawled initial text.

In this embodiment, corpus data which does not belong to any one of the core points and the boundary points is used as noise points in coarse-granularity clustering corpus, and the noise points are cleaned, so that the accuracy of the corpus is improved.

In the embodiment, the coarse-granularity clustering corpus is subjected to secondary clustering to refine classification on each type of vaccine problem, so that the accuracy of the training corpus is improved, meanwhile, some noise points are filtered, the situation that the subsequent vaccine question-answer training is interfered by the corpus which is relatively weak in relation to the vaccine question-answer is avoided, and the accuracy of corpus generation is improved.

In some optional implementations of the present embodiment, after step S203, and before step S204, the corpus generating method further includes:

Different class labels are set for coarse-granularity cluster corpus of each cluster, and the cluster coarse-granularity cluster corpus, the class labels and the corresponding relation between the cluster coarse-granularity cluster corpus and the class labels are stored in an elastic search engine.

Specifically, for the coarse-grained clustering corpus of each cluster, setting a unique class label for the coarse-grained clustering corpus of each cluster, storing the coarse-grained clustering corpus of the cluster, the class label and the corresponding relation between the coarse-grained clustering corpus of the cluster and the class label into an elastic search engine, and rapidly storing and sequencing the coarse-grained clustering corpus of the cluster, the class label and the corresponding relation between the coarse-grained clustering corpus of the cluster and the class label by utilizing the characteristics of the elastic search engine so as to facilitate rapid extraction speed and aggregation treatment of the data and the corresponding relation stored by the elastic search engine.

The method mainly comprises the following steps of submitting data to an elastiscearch database by a user, then word segmentation the corresponding sentence through a word segmentation controller, storing weights and word segmentation results into the data together, ranking and scoring the results according to the weights when the user searches the data, and returning the results to the user according to the high-low order of the scores.

In the embodiment, by setting a unique classification label for coarse-granularity clustering corpus of each cluster and establishing a corresponding relation to store in an elastic search engine, data fusion and screening of some irrelevant corpus data are facilitated through the elastic search engine.

In some optional implementations of the present embodiment, after step S204, the corpus generating method further includes:

And selecting the non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain updated target corpus.

Specifically, the elastic search engine can be used for obtaining the expression similar text, under the condition that the elastic search engine searches a certain threshold value, the similar problem of the representative problem can be obtained from the target corpus through the aggregation function of the elastic search engine, the target corpus is screened again, and the non-strong related corpus can be removed, so that the corpus quality is improved.

The certain threshold, that is, the preset threshold in this embodiment, may be set according to actual needs, for example, set to 0.6, which is not specifically limited herein, according to different actual application scenarios.

The non-relevant corpus refers to clusters or corpora with the relevance lower than a preset threshold after the target corpus is aggregated by adopting an elastic search engine.

Optionally, in this embodiment, the distances between the non-strong correlation corpus and all cluster centers of the target corpus are calculated through a sentence similarity algorithm, if the non-strong correlation corpus is smaller than the preset distance, the non-strong correlation corpus is determined to be a weak similar text, namely, the problem orphan is regarded as a problem alone, the problem orphan is regarded as a new type of problem, and the problem orphan is updated to the target corpus as a new corpus, so that the support of the target corpus to the vaccine problem of the biased cold door is improved.

Among other sentence similarity algorithms include, but are not limited to: violence (Brute Force) Algorithm, RK Algorithm, KMP (The Knuth-Morris-Pratt algoritm) Algorithm, and string correction similarity Algorithm based on The pictophonetic code EDITDISTANCE. Can be selected and used according to actual demands, and is not limited herein.

In the embodiment, the non-relevant corpus is removed through the elastic search engine, the target corpus is updated, the simplification and accuracy of the target corpus are ensured, the problem that the accuracy of subsequent vaccine question-answer training is low due to low-relevance corpus is avoided, meanwhile, some orphan problems are independently used as a type of problems, the target corpus is supplemented, and the support of the target corpus to the vaccine problem of the cold gate is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Fig. 3 shows a schematic block diagram of a corpus generating apparatus in one-to-one correspondence with the above-described embodiment of the corpus generating method. As shown in fig. 3, the corpus generating device includes a data acquisition module 31, a data cleaning module 32, a coarse granularity clustering module 33, and a corpus determining module 34. The functional modules are described in detail as follows:

The data acquisition module 31 is configured to acquire a consultation text and a response text related to the vaccine from the medical consultation library as an initial text;

the data cleaning module 32 is configured to perform data cleaning on the initial text to obtain original corpus data;

the coarse-granularity clustering module 33 is configured to perform clustering processing on the original corpus data by using a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;

The corpus determining module 34 is configured to perform secondary clustering processing on the coarse-granularity clustered corpora by a density clustering algorithm for each cluster of coarse-granularity clustered corpora, and use the obtained density clustered corpora as a target corpus.

Optionally, the data acquisition module 31 includes:

Optionally, corpus determining module 34 includes:

The core shop determining unit is used for taking corpus data with the number of neighborhood points being larger than or equal to a preset minimum containing point number minPts as core points;

Optionally, the corpus generating device further includes:

For specific limitation of the corpus generating device, reference may be made to the limitation of the corpus generating method hereinabove, and the detailed description thereof will be omitted. The above-mentioned corpus generating means may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only a computer device 4 having a component connection memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used for storing an operating system and various application software installed on the computer device 4, such as program codes for controlling electronic files, etc. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute a program code stored in the memory 41 or process data, such as a program code for executing control of an electronic file.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the corpus generation method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The corpus generation method is applied to training corpus generation of a vaccine question-answering robot and is characterized by comprising the following steps:

performing data cleaning on the initial text to obtain original corpus data;

Aiming at each cluster of coarse-granularity clustered corpus, performing secondary clustering treatment on the coarse-granularity clustered corpus through a density clustering algorithm, and taking the obtained density clustered corpus as a target corpus;

Aiming at each cluster of coarse-granularity clustering corpus, performing secondary clustering treatment on the coarse-granularity clustering corpus through a density clustering algorithm, wherein the method comprises the following steps of taking the obtained density clustering corpus as a target corpus:

the method comprises the steps that corpus data in a preset scanning radius eps of any core point is used as boundary points, wherein the number of neighborhood points is smaller than a preset minimum inclusion point number minPts;

Connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain a target corpus;

After performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm aiming at each cluster of coarse-granularity clustered corpus and taking the obtained density clustered corpus as a target corpus, the corpus generation method further comprises the following steps:

selecting non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain updated target corpus;

Calculating the distance between the cluster center of the target corpus and the uncorrelated corpus through a similarity algorithm, taking the uncorrelated corpus with the distance smaller than a preset distance as a problem isolated point, and updating the target corpus according to the problem isolated point.

2. The corpus generation method of claim 1, wherein the acquiring, as the initial text, the vaccine-related consultation text and response text from the medical consultation library includes:

determining a target page according to the page weight of each preset path;

3. The corpus generation method according to claim 1 or 2, characterized in that after clustering the raw corpus data by using a K-means clustering model to obtain coarse-grained clustered corpora of at least two clusters, and after performing secondary clustering on the coarse-grained clustered corpora by a density clustering algorithm for each cluster of coarse-grained clustered corpora, the corpus generation method further comprises, before taking the obtained density clustered corpora as a target corpus:

4. The corpus generation method of claim 1, wherein after performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm for each cluster of coarse-granularity clustered corpus, taking the obtained density clustered corpus as a target corpus, the corpus generation method further comprises: storing the target corpus in a blockchain network node.

5. A corpus generation device applied to training corpus generation of a vaccine question-answering robot, wherein the corpus generation device is operative to implement the corpus generation method according to any one of claims 1 to 4, the corpus generation device comprising:

The corpus determining module is used for carrying out secondary clustering processing on the coarse-granularity clustered corpuses through a density clustering algorithm aiming at each cluster of coarse-granularity clustered corpuses, and taking the obtained density clustered corpuses as target corpuses;

wherein, the corpus determining module includes:

6. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the corpus generation method according to any of claims 1 to 4 when executing the computer program.

7. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the corpus generation method according to any of claims 1 to 4.