CN111813905B - Corpus generation method, corpus generation device, computer equipment and storage medium - Google Patents
Corpus generation method, corpus generation device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111813905B CN111813905B CN202010555008.XA CN202010555008A CN111813905B CN 111813905 B CN111813905 B CN 111813905B CN 202010555008 A CN202010555008 A CN 202010555008A CN 111813905 B CN111813905 B CN 111813905B
- Authority
- CN
- China
- Prior art keywords
- corpus
- coarse
- clustering
- cluster
- clustered
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 229960005486 vaccine Drugs 0.000 claims abstract description 45
- 238000004422 calculation algorithm Methods 0.000 claims description 36
- 238000012545 processing Methods 0.000 claims description 23
- 238000004458 analytical method Methods 0.000 claims description 19
- 230000004044 response Effects 0.000 claims description 19
- 238000004140 cleaning Methods 0.000 claims description 14
- 238000003064 k means clustering Methods 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 8
- 230000004931 aggregating effect Effects 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 239000013598 vector Substances 0.000 description 9
- 230000009193 crawling Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000002776 aggregation Effects 0.000 description 4
- 238000004220 aggregation Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 235000013305 food Nutrition 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000005802 health problem Effects 0.000 description 2
- 238000011081 inoculation Methods 0.000 description 2
- 235000012054 meals Nutrition 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000002255 vaccination Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0281—Customer communication at a business location, e.g. providing product or service information, consulting
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/60—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
- G16H40/67—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for remote operation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Strategic Management (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Primary Health Care (AREA)
- Epidemiology (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Marketing (AREA)
- Human Computer Interaction (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of artificial intelligence, and discloses a corpus generation method, a corpus generation device, computer equipment and a storage medium, wherein the corpus generation method comprises the following steps: the invention further relates to a blockchain technology, the obtained target corpus is stored in a blockchain network, and the accuracy of the target corpus for vaccine questions and answers is improved.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a corpus generating method, apparatus, computer device, and storage medium.
Background
Along with the improvement of the living standard of people, many people start to care about the health problems of the people, the problem related to vaccine also becomes a hot problem of the health problems, in order to relieve the pressure of consultation windows of hospitals, some hospitals start to adopt intelligent robot service systems, the consultants are given effective feedback through intelligent question-answering robots, the intelligent question-answering robots need to train by adopting a large number of corpora in related fields before being used, so that the accuracy of question-answering is improved, and a large number of corpora related to vaccine are needed for training aiming at the vaccine question-answering robots in an understandable manner.
Currently, relevant corpus of vaccine is obtained, mainly web crawlers are crawled from relevant sites, corpus selection is performed by means of regular matching and keyword extraction, the corpus selected by means of the corpus selection is used for training the question-answering robot, the accuracy degree of the question-answering robot is far from reaching the requirements, the response accuracy of the question-answering robot is low, and meanwhile user experience is affected, so that how to obtain training corpus with high accuracy degree becomes a difficult problem to be solved urgently.
Disclosure of Invention
The embodiment of the invention provides a corpus generation method, a corpus generation device, computer equipment and a storage medium, which are used for improving the accuracy of generating training corpus of a vaccine question-answering robot.
In order to solve the above technical problems, an embodiment of the present application provides a corpus generating method, including:
acquiring consultation text and response text related to the vaccine from a medical consultation library as initial text;
performing data cleaning on the initial text to obtain original corpus data;
Clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;
Aiming at each cluster of coarse-granularity clustered corpus, performing secondary clustering treatment on the coarse-granularity clustered corpus through a density clustering algorithm, and taking the obtained density clustered corpus as a target corpus.
Optionally, the acquiring the consultation text and the response text related to the vaccine from the medical consultation library includes:
determining the page weight of each preset path in the medical consultation library in a link analysis mode;
determining a target page according to the page weight of each preset path;
calculating a page ranking value of each target page based on a preset page ranking strategy, and sequencing the target pages according to the order of the page ranking values from large to small to obtain a target page queue;
and grabbing contents in the target page based on the target page queue to obtain consultation text and response text related to the vaccine.
Optionally, for each cluster of coarse-granularity clustered corpus, performing secondary clustering processing on the coarse-granularity clustered corpus by using a density clustering algorithm, and taking the obtained density clustered corpus as the target corpus comprises:
acquiring a preset scanning radius eps and a preset minimum inclusion point minPts;
Counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;
Corpus data with the number of the neighborhood points being greater than or equal to a preset minimum containing point minPts is used as a core point;
The method comprises the steps that corpus data in a preset scanning radius eps of a core point is used as boundary points, wherein the number of neighborhood points is smaller than a preset minimum inclusion point minPts;
And connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain a target corpus.
Optionally, after clustering the raw corpus data by using a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters, and performing secondary clustering on the coarse-granularity clustered corpora by using a density clustering algorithm for each cluster of coarse-granularity clustered corpora, and before taking the obtained density clustered corpora as a target corpus, the corpus generating method further includes:
setting different category labels for coarse-granularity cluster corpus of each cluster, and storing the cluster coarse-granularity cluster corpus, the category labels and the corresponding relation between the cluster coarse-granularity cluster corpus and the category labels into an elastic search engine.
Optionally, after performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm for each cluster of coarse-granularity clustered corpus, and taking the obtained density clustered corpus as the target corpus, the corpus generation method further includes:
acquiring a preset threshold, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold to obtain a clustering result;
and selecting non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain updated target corpus.
Optionally, after performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm for each cluster of coarse-granularity clustered corpus, taking the obtained density clustered corpus as a target corpus, the method further includes: storing the target corpus in a blockchain network node.
In order to solve the above technical problem, an embodiment of the present application further provides a corpus generating device, including:
the data acquisition module is used for acquiring consultation texts and response texts related to the vaccine from the medical consultation library as initial texts;
The data cleaning module is used for cleaning the data of the initial text to obtain original corpus data;
the coarse-granularity clustering module is used for clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;
The corpus determining module is used for carrying out secondary clustering processing on the coarse-granularity clustered corpuses through a density clustering algorithm aiming at each cluster of coarse-granularity clustered corpuses, and taking the obtained density clustered corpuses as target corpuses.
Optionally, the data acquisition module includes:
the link analysis unit is used for determining the page weight of each preset path in the medical consultation library in a link analysis mode;
The target page determining unit is used for determining a target page according to the page weight of each preset path;
the page ordering unit is used for calculating the page ranking value of each target page based on a preset page ranking strategy, and ordering the target pages according to the order of the page ranking values from large to small to obtain a target page queue;
And the content acquisition unit is used for capturing the content in the target page based on the target page queue to obtain the consultation text and the response text related to the vaccine.
Optionally, the corpus determining module includes:
a preset parameter obtaining unit, configured to obtain a preset scanning radius eps and a preset minimum inclusion point minPts;
The domain point number determining unit is used for counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;
The core shop determining unit is used for taking corpus data with the number of the neighborhood points being greater than or equal to a preset minimum containing point number minPts as core points;
The boundary point determining unit is used for taking corpus data which is smaller than the preset minimum inclusion point number minPts and is positioned in a preset scanning radius eps of the core point as boundary points;
The target corpus acquisition unit is used for connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain the target corpus.
Optionally, the corpus generating device further includes:
the first storage module is used for setting different category labels for coarse-granularity clustering corpus of each cluster, and storing the cluster coarse-granularity clustering corpus, the category labels and the corresponding relation between the cluster coarse-granularity clustering corpus and the category labels into the elastic search engine.
Optionally, the corpus generating device further includes:
The aggregation module is used for acquiring a preset threshold value, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold value to obtain a clustering result;
And the updating module is used for selecting the non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain the updated target corpus.
Optionally, the corpus generating device further includes:
And the second storage module is used for storing the target corpus in the blockchain network node.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the steps of the corpus generating method when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the steps of the corpus generation method.
According to the corpus generation method, the device, the computer equipment and the storage medium, the consultation text and the response text related to the vaccine are obtained from the medical consultation library and are used as the initial text, the initial text is subjected to data cleaning to obtain the original corpus data, the K-means clustering model is further adopted to perform clustering processing on the original corpus data to obtain at least two clusters of coarse-granularity clustered corpora, the coarse-granularity clustered corpora are subjected to secondary clustering processing through the density clustering algorithm aiming at each cluster of coarse-granularity clustered corpora, more accurate classification is achieved through multi-level clustering processing, the obtained density clustered corpora are used as target corpora, classification of the target corpora is more accurate, and meanwhile, the accuracy of the target corpora aiming at vaccine questions and answers is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a corpus generation method of the present application;
FIG. 3 is a schematic diagram of the structure of one embodiment of a corpus generating device according to the present application;
FIG. 4 is a schematic structural diagram of one embodiment of a computer device in accordance with the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III, moving Picture expert compression standard audio plane 3), MP4 players (Moving Picture E interface display perts Group Audio Layer IV, moving Picture expert compression standard audio plane 4), laptop and desktop computers, and so on.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the corpus generating method provided by the embodiment of the present application is executed by a server, and accordingly, the corpus generating device is disposed in the server.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements, and the terminal devices 101, 102, 103 in the embodiment of the present application may specifically correspond to application systems in actual production.
Referring to fig. 2, fig. 2 shows a corpus generating method according to an embodiment of the present invention, and the method is applied to the server in fig. 1 for illustration, and is described in detail as follows:
S201: and acquiring consultation text and response text related to the vaccine from the medical consultation library as initial text.
Specifically, a medical inquiry library is subjected to inquiry processing by searching preset keywords, so that consultation texts and response texts related to vaccines are obtained and serve as initial texts.
The preset keywords may specifically be related words or phrases that cover vaccination time, procedure, notes, crowd adaptability, and the like of the vaccine.
The medical consultation library refers to a resource library for storing information (text information and voice information) of vaccine-related questions consulted by a network or telephone.
As a preferred mode, in order to facilitate inquiry, the voice information can be converted into text information through a third-party voice conversion text tool, and then the text information is stored in a medical consultation library.
It should be noted that, the medical diagnosis library in this embodiment corresponds to a plurality of site pages, and the site pages provide inquiry and reading of record information of medical diagnosis.
Preferably, the crawler method is adopted in the embodiment, the consultation text and the response text related to the vaccine are quickly and accurately crawled from the site pages of the medical consultation library, the acquisition speed of the initial text is improved, and the generation efficiency of the training corpus is improved.
S202: and performing data cleaning on the initial text to obtain original corpus data.
Specifically, the obtained initial text, including punctuation marks, text formats, invalid expressions, pictures, etc., requires advanced data cleansing before data processing is performed on the data.
Wherein data cleansing includes, but is not limited to: removing punctuation pictures, dividing books, extracting key sentences and the like.
Further, vectorization is carried out on the text after data cleaning, and the obtained word vector is used as original corpus data.
Specifically, the text after data cleaning is mapped into a vector, and the vectors are linked together to form a word vector space, and each vector is equivalent to a point in the space.
For example, a certain automobile sales company has two keywords, namely BMW and Benz, in its product name, and according to a preset corpus, all possible classifications of the two keywords are obtained: "automotive," luxury, "" animal, "" action, "and" food. Thus, a vector representation is introduced for both keywords:
< automobile, luxury, animal, action, food >
The probability that the two keywords belong to each category is calculated according to a statistical learning method, and the probability that the computer learns is:
BMW= <0.5,0.2,0.2,0.0,0.1>
Benz= <0.7,0.2,0.0,0.1,0.0>
It will be appreciated that the value of each dimension of the base word vector represents a feature that has some semantic and grammatical interpretability, and that each dimension of the base word vector may be referred to as a keyword feature.
It should be noted that, in this embodiment, the word vector representation may be a word segmentation, a phrase, or a pair of question-answer sentences, where no more idleness is made.
S203: and clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters.
Specifically, a K-means clustering model is adopted to perform clustering processing on original corpus data, and the original corpus data corresponding to each clustering center is used as a cluster of coarse-granularity clustered corpus to obtain at least two clusters of coarse-granularity clustered corpus.
The coarse-granularity clustering corpus refers to clustering corpus with low precision, wherein some common semantics are contained, but the final semantics are not necessarily the same. For example, two pieces of original corpus data are ' after me eat meal for a while, the bellybutton is hungry and ' after me eat meal bellybutton is a little painful ', after the two pieces of corpus data are clustered by a K-means clustering model, the two pieces of corpus data are clustered into a cluster, and therefore the two pieces of corpus data belong to coarse-granularity clustering corpus, and in order to ensure the accuracy of classification, the coarse-granularity clustering corpus is required to be further finely classified in the follow-up process.
The K-means algorithm is a distance-based clustering algorithm, and the distance is used as an evaluation index of similarity, namely the closer the distance between two objects is, the greater the similarity is. The algorithm considers clusters to be made up of objects that are close together, thus targeting a compact and independent cluster as the final target.
S204: aiming at each cluster of coarse-granularity clustered corpus, performing secondary clustering treatment on the coarse-granularity clustered corpus through a density clustering algorithm, and taking the obtained density clustered corpus as a target corpus.
Specifically, because the vaccine question-answering specialization is stronger, so that the training corpus with finer classification and higher accuracy is required, and because of the functional limitation of the K-means algorithm, each type of vaccine problem cannot be perfectly clustered, coarse-granularity clustering is performed on the original corpus by using the K-means clustering algorithm, proper text clusters can be obtained by adjusting algorithm super-parameters in the clustering process, so that the texts in the clusters have certain similarity, each cluster roughly represents one type of vaccine problem, for example, different question methods related to the inoculation time of a certain vaccine can be concentrated in the same cluster, and different problem directions of the inoculation time can be separated from the cluster, and in order to further improve the fineness of classification and the accuracy of the corpus against the vaccine problem, the embodiment adopts a density clustering algorithm to perform secondary clustering processing on the coarse-granularity clustering corpus, and the obtained density clustering corpus is used as the target corpus.
Preferably, the density clustering algorithm adopted in this embodiment is DBSCAN, and specifically, the process of performing secondary clustering by using DBSCAN may refer to the description of the subsequent embodiments, and in order to avoid repetition, no description is repeated here.
Of these, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a relatively representative Density-based clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of densely connected points, it is possible to partition a region having a sufficiently high density into clusters and find clusters of arbitrary shape in a noisy spatial database.
In this embodiment, the consultation text and the response text related to the vaccine are obtained from the medical consultation library and used as initial texts, the initial texts are subjected to data cleaning to obtain initial corpus data, then a K-means clustering model is adopted to perform clustering processing on the initial corpus data to obtain coarse-granularity clustering corpuses of at least two clusters, the coarse-granularity clustering corpuses of each cluster are subjected to secondary clustering processing through a density clustering algorithm, multi-level clustering processing is achieved to obtain more accurate classification, the obtained density clustering corpuses are used as target corpuses, classification of the target corpuses is more accurate, and meanwhile, the accuracy of the target corpuses for vaccine question and answer is improved.
In an embodiment, after the target corpus is obtained, each target corpus is stored in a blockchain network node, and sharing of data information among different platforms is realized through blockchain storage, so that data can be prevented from being tampered.
Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
In some optional implementations of this embodiment, in step S201, the acquiring, as the initial text, the vaccine-related consultation text and the response text from the medical consultation library includes:
determining the page weight of each preset path in the medical consultation library in a link analysis mode;
Determining a target page according to the page weight of each preset path;
Calculating a page ranking value of each target page based on a preset page ranking strategy, and sequencing the target pages according to the order of the page ranking values from large to small to obtain a target page queue;
and grabbing contents in the target page based on the target page queue to obtain the consultation text and the response text related to the vaccine.
Specifically, a plurality of preset paths are stored in a medical consultation library in advance, each preset path stores 1 or more pages, corresponding information is obtained through crawling page contents, before crawling pages, link analysis is performed on sites to be crawled, the weight of each site page is confirmed, so that target pages needing crawling can be determined according to the weight later, a server is preset with reference weight, when the calculated page weight is greater than the preset reference weight, the page is confirmed to have crawling value, the page is determined to be the target page, further, the page ranking value of each target page is calculated according to a preset page ranking strategy, the target pages are ranked according to the order of page ranking values from large to small, a target page queue is obtained, crawling is performed on the contents of the target pages according to the page sequence in the target page queue, and the basic data and the user information corresponding to the basic data contained in the target page are obtained.
The link analysis refers to analyzing basic features of a page corresponding to each preset path in the medical inquiry library, and in this embodiment, the basic features selected for analysis include, but are not limited to: vaccine correlation, network topology, and page content, etc.
The network topology analysis comprises analysis of data such as an outer link, a hierarchy, a level and the like of the webpage.
The page content analysis comprises analysis of content characteristic data such as appearance, text and the like of the webpage.
According to the method, three analysis results are obtained through vaccine related text analysis, network topology analysis and webpage content analysis, and comprehensive evaluation is carried out on the three analysis results to obtain the webpage weight of the website. The specific mode of comprehensive evaluation can be realized through a preset weighting formula, and can also be set according to actual needs, and the method is not limited herein.
Wherein the preset page ranking policy includes, but is not limited to: pageRank strategy, hilltop algorithm, link relationship based ranking (TrustRank) algorithm, expertRank, and the like.
The PageRank strategy, also called a webpage ranking strategy, a Google left ranking strategy or a Peel ranking strategy, is a technology calculated according to hyperlinks among webpages, and is used as one of elements of webpage ranking, the PageRank value can be used for reflecting the relevance and importance of webpages, is an important factor frequently used for evaluating webpage optimization in search engine optimization operation, and is ranked according to the PageRank value from large to small, so that pages with higher importance level are ranked in front, and when content crawling is carried out at the back, information of webpages with higher ranking is preferentially obtained.
In this embodiment, by constructing the page weight queue, crawling is further performed according to the order in the page weight queue, so that important information is preferentially crawled, which is beneficial to improving quality of crawled content and crawling efficiency.
In some optional implementations of this embodiment, in step S204, for each cluster of coarse-granularity clustered corpora, performing secondary clustering processing on the coarse-granularity clustered corpora by using a density clustering algorithm, where the obtained density clustered corpora is used as a target corpus, including:
acquiring a preset scanning radius eps and a preset minimum inclusion point minPts;
counting the number of other corpus data contained in the preset scanning radius eps of the corpus data aiming at each corpus data in the coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;
Corpus data with the number of neighborhood points being greater than or equal to the preset minimum containing point number minPts is used as a core point;
The method comprises the steps that corpus data in a preset scanning radius eps of a core point is used as boundary points, wherein the number of neighborhood points is smaller than a preset minimum inclusion point minPts;
And connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain a target corpus.
Specifically, for each corpus data in coarse-granularity clustering corpuses, counting the number of other corpus data contained in a preset scanning radius eps of the corpus data, taking the number as the number of neighborhood points corresponding to the corpus data, further taking the corpus data with the number of neighborhood points being larger than or equal to the preset minimum containing point number minPts as a core point, taking the corpus data with the number of neighborhood points being smaller than the preset minimum containing point number minPts as the core point, taking the corpus data in the preset scanning radius eps of the core point as the boundary point, connecting the boundary points with the distance not exceeding the preset scanning radius eps to form a density cluster with a shape of a closed polygon, and adding the core point in the range of the density cluster into the density cluster to obtain the target corpus.
The preset scan radius eps and the preset minimum inclusion point number minPts may be set according to the actual requirement, and are not limited herein, for example, the preset scan radius eps is set to 10, and the preset minimum inclusion point number minPts is set to 5.
It should be understood that boundary points with a distance not exceeding the preset scanning radius eps are connected to each other to form a density cluster, and the density cluster obtained finally can be one or a plurality of density clusters, and each density cluster is a collection of various branch problems of vaccine problems of one category, and the specific category of vaccine problems and the number of branch problems depend on the content of the crawled initial text.
In this embodiment, corpus data which does not belong to any one of the core points and the boundary points is used as noise points in coarse-granularity clustering corpus, and the noise points are cleaned, so that the accuracy of the corpus is improved.
In the embodiment, the coarse-granularity clustering corpus is subjected to secondary clustering to refine classification on each type of vaccine problem, so that the accuracy of the training corpus is improved, meanwhile, some noise points are filtered, the situation that the subsequent vaccine question-answer training is interfered by the corpus which is relatively weak in relation to the vaccine question-answer is avoided, and the accuracy of corpus generation is improved.
In some optional implementations of the present embodiment, after step S203, and before step S204, the corpus generating method further includes:
Different class labels are set for coarse-granularity cluster corpus of each cluster, and the cluster coarse-granularity cluster corpus, the class labels and the corresponding relation between the cluster coarse-granularity cluster corpus and the class labels are stored in an elastic search engine.
Specifically, for the coarse-grained clustering corpus of each cluster, setting a unique class label for the coarse-grained clustering corpus of each cluster, storing the coarse-grained clustering corpus of the cluster, the class label and the corresponding relation between the coarse-grained clustering corpus of the cluster and the class label into an elastic search engine, and rapidly storing and sequencing the coarse-grained clustering corpus of the cluster, the class label and the corresponding relation between the coarse-grained clustering corpus of the cluster and the class label by utilizing the characteristics of the elastic search engine so as to facilitate rapid extraction speed and aggregation treatment of the data and the corresponding relation stored by the elastic search engine.
The method mainly comprises the following steps of submitting data to an elastiscearch database by a user, then word segmentation the corresponding sentence through a word segmentation controller, storing weights and word segmentation results into the data together, ranking and scoring the results according to the weights when the user searches the data, and returning the results to the user according to the high-low order of the scores.
In the embodiment, by setting a unique classification label for coarse-granularity clustering corpus of each cluster and establishing a corresponding relation to store in an elastic search engine, data fusion and screening of some irrelevant corpus data are facilitated through the elastic search engine.
In some optional implementations of the present embodiment, after step S204, the corpus generating method further includes:
Acquiring a preset threshold, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold to obtain a clustering result;
And selecting the non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain updated target corpus.
Specifically, the elastic search engine can be used for obtaining the expression similar text, under the condition that the elastic search engine searches a certain threshold value, the similar problem of the representative problem can be obtained from the target corpus through the aggregation function of the elastic search engine, the target corpus is screened again, and the non-strong related corpus can be removed, so that the corpus quality is improved.
The certain threshold, that is, the preset threshold in this embodiment, may be set according to actual needs, for example, set to 0.6, which is not specifically limited herein, according to different actual application scenarios.
The non-relevant corpus refers to clusters or corpora with the relevance lower than a preset threshold after the target corpus is aggregated by adopting an elastic search engine.
Optionally, in this embodiment, the distances between the non-strong correlation corpus and all cluster centers of the target corpus are calculated through a sentence similarity algorithm, if the non-strong correlation corpus is smaller than the preset distance, the non-strong correlation corpus is determined to be a weak similar text, namely, the problem orphan is regarded as a problem alone, the problem orphan is regarded as a new type of problem, and the problem orphan is updated to the target corpus as a new corpus, so that the support of the target corpus to the vaccine problem of the biased cold door is improved.
Among other sentence similarity algorithms include, but are not limited to: violence (Brute Force) Algorithm, RK Algorithm, KMP (The Knuth-Morris-Pratt algoritm) Algorithm, and string correction similarity Algorithm based on The pictophonetic code EDITDISTANCE. Can be selected and used according to actual demands, and is not limited herein.
In the embodiment, the non-relevant corpus is removed through the elastic search engine, the target corpus is updated, the simplification and accuracy of the target corpus are ensured, the problem that the accuracy of subsequent vaccine question-answer training is low due to low-relevance corpus is avoided, meanwhile, some orphan problems are independently used as a type of problems, the target corpus is supplemented, and the support of the target corpus to the vaccine problem of the cold gate is improved.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.
Fig. 3 shows a schematic block diagram of a corpus generating apparatus in one-to-one correspondence with the above-described embodiment of the corpus generating method. As shown in fig. 3, the corpus generating device includes a data acquisition module 31, a data cleaning module 32, a coarse granularity clustering module 33, and a corpus determining module 34. The functional modules are described in detail as follows:
The data acquisition module 31 is configured to acquire a consultation text and a response text related to the vaccine from the medical consultation library as an initial text;
the data cleaning module 32 is configured to perform data cleaning on the initial text to obtain original corpus data;
the coarse-granularity clustering module 33 is configured to perform clustering processing on the original corpus data by using a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;
The corpus determining module 34 is configured to perform secondary clustering processing on the coarse-granularity clustered corpora by a density clustering algorithm for each cluster of coarse-granularity clustered corpora, and use the obtained density clustered corpora as a target corpus.
Optionally, the data acquisition module 31 includes:
the link analysis unit is used for determining the page weight of each preset path in the medical consultation library in a link analysis mode;
The target page determining unit is used for determining a target page according to the page weight of each preset path;
The page ordering unit is used for calculating the page ranking value of each target page based on a preset page ranking strategy, and ordering the target pages according to the order of the page ranking values from large to small to obtain a target page queue;
And the content acquisition unit is used for capturing the content in the target page based on the target page queue to obtain the consultation text and the response text related to the vaccine.
Optionally, corpus determining module 34 includes:
a preset parameter obtaining unit, configured to obtain a preset scanning radius eps and a preset minimum inclusion point minPts;
The domain point number determining unit is used for counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;
The core shop determining unit is used for taking corpus data with the number of neighborhood points being larger than or equal to a preset minimum containing point number minPts as core points;
The boundary point determining unit is used for taking corpus data which is smaller than the preset minimum inclusion point number minPts and is positioned in a preset scanning radius eps of the core point as boundary points;
The target corpus acquisition unit is used for connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain the target corpus.
Optionally, the corpus generating device further includes:
the first storage module is used for setting different category labels for coarse-granularity clustering corpus of each cluster, and storing the cluster coarse-granularity clustering corpus, the category labels and the corresponding relation between the cluster coarse-granularity clustering corpus and the category labels into the elastic search engine.
Optionally, the corpus generating device further includes:
The aggregation module is used for acquiring a preset threshold value, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold value to obtain a clustering result;
And the updating module is used for selecting the non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain the updated target corpus.
Optionally, the corpus generating device further includes:
and the second storage module is used for storing the target corpus in the blockchain network node.
For specific limitation of the corpus generating device, reference may be made to the limitation of the corpus generating method hereinabove, and the detailed description thereof will be omitted. The above-mentioned corpus generating means may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only a computer device 4 having a component connection memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used for storing an operating system and various application software installed on the computer device 4, such as program codes for controlling electronic files, etc. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute a program code stored in the memory 41 or process data, such as a program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.
The present application also provides another embodiment, namely, a computer-readable storage medium storing an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the corpus generation method as described above.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.
Claims (7)
1. The corpus generation method is applied to training corpus generation of a vaccine question-answering robot and is characterized by comprising the following steps:
acquiring consultation text and response text related to the vaccine from a medical consultation library as initial text;
performing data cleaning on the initial text to obtain original corpus data;
Clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;
Aiming at each cluster of coarse-granularity clustered corpus, performing secondary clustering treatment on the coarse-granularity clustered corpus through a density clustering algorithm, and taking the obtained density clustered corpus as a target corpus;
Aiming at each cluster of coarse-granularity clustering corpus, performing secondary clustering treatment on the coarse-granularity clustering corpus through a density clustering algorithm, wherein the method comprises the following steps of taking the obtained density clustering corpus as a target corpus:
acquiring a preset scanning radius eps and a preset minimum inclusion point minPts;
Counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;
Corpus data with the number of the neighborhood points being greater than or equal to a preset minimum containing point minPts is used as a core point;
the method comprises the steps that corpus data in a preset scanning radius eps of any core point is used as boundary points, wherein the number of neighborhood points is smaller than a preset minimum inclusion point number minPts;
Connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain a target corpus;
After performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm aiming at each cluster of coarse-granularity clustered corpus and taking the obtained density clustered corpus as a target corpus, the corpus generation method further comprises the following steps:
acquiring a preset threshold, and aggregating the target corpus by adopting an elastic search engine according to the preset threshold to obtain a clustering result;
selecting non-relevant corpus according to the clustering result, and removing the non-relevant corpus to obtain updated target corpus;
Calculating the distance between the cluster center of the target corpus and the uncorrelated corpus through a similarity algorithm, taking the uncorrelated corpus with the distance smaller than a preset distance as a problem isolated point, and updating the target corpus according to the problem isolated point.
2. The corpus generation method of claim 1, wherein the acquiring, as the initial text, the vaccine-related consultation text and response text from the medical consultation library includes:
determining the page weight of each preset path in the medical consultation library in a link analysis mode;
determining a target page according to the page weight of each preset path;
calculating a page ranking value of each target page based on a preset page ranking strategy, and sequencing the target pages according to the order of the page ranking values from large to small to obtain a target page queue;
and grabbing contents in the target page based on the target page queue to obtain consultation text and response text related to the vaccine.
3. The corpus generation method according to claim 1 or 2, characterized in that after clustering the raw corpus data by using a K-means clustering model to obtain coarse-grained clustered corpora of at least two clusters, and after performing secondary clustering on the coarse-grained clustered corpora by a density clustering algorithm for each cluster of coarse-grained clustered corpora, the corpus generation method further comprises, before taking the obtained density clustered corpora as a target corpus:
setting different category labels for coarse-granularity cluster corpus of each cluster, and storing the cluster coarse-granularity cluster corpus, the category labels and the corresponding relation between the cluster coarse-granularity cluster corpus and the category labels into an elastic search engine.
4. The corpus generation method of claim 1, wherein after performing secondary clustering processing on the coarse-granularity clustered corpus by a density clustering algorithm for each cluster of coarse-granularity clustered corpus, taking the obtained density clustered corpus as a target corpus, the corpus generation method further comprises: storing the target corpus in a blockchain network node.
5. A corpus generation device applied to training corpus generation of a vaccine question-answering robot, wherein the corpus generation device is operative to implement the corpus generation method according to any one of claims 1 to 4, the corpus generation device comprising:
the data acquisition module is used for acquiring consultation texts and response texts related to the vaccine from the medical consultation library as initial texts;
The data cleaning module is used for cleaning the data of the initial text to obtain original corpus data;
the coarse-granularity clustering module is used for clustering the original corpus data by adopting a K-means clustering model to obtain coarse-granularity clustered corpora of at least two clusters;
The corpus determining module is used for carrying out secondary clustering processing on the coarse-granularity clustered corpuses through a density clustering algorithm aiming at each cluster of coarse-granularity clustered corpuses, and taking the obtained density clustered corpuses as target corpuses;
wherein, the corpus determining module includes:
a preset parameter obtaining unit, configured to obtain a preset scanning radius eps and a preset minimum inclusion point minPts;
The domain point number determining unit is used for counting the number of other corpus data contained in the preset scanning radius eps by aiming at each corpus data in coarse-granularity clustered corpus, and taking the number as the number of neighborhood points corresponding to the corpus data;
The core shop determining unit is used for taking corpus data with the number of the neighborhood points being greater than or equal to a preset minimum containing point number minPts as core points;
The boundary point determining unit is used for taking corpus data which is smaller than the preset minimum inclusion point number minPts and is positioned in a preset scanning radius eps of the core point as boundary points;
The target corpus acquisition unit is used for connecting boundary points with the distance not exceeding a preset scanning radius eps to form a density cluster, and adding core points within the range of the density cluster into the density cluster to obtain the target corpus.
6. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the corpus generation method according to any of claims 1 to 4 when executing the computer program.
7. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the corpus generation method according to any of claims 1 to 4.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010555008.XA CN111813905B (en) | 2020-06-17 | 2020-06-17 | Corpus generation method, corpus generation device, computer equipment and storage medium |
PCT/CN2020/099524 WO2021120588A1 (en) | 2020-06-17 | 2020-06-30 | Method and apparatus for language generation, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010555008.XA CN111813905B (en) | 2020-06-17 | 2020-06-17 | Corpus generation method, corpus generation device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111813905A CN111813905A (en) | 2020-10-23 |
CN111813905B true CN111813905B (en) | 2024-05-10 |
Family
ID=72844729
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010555008.XA Active CN111813905B (en) | 2020-06-17 | 2020-06-17 | Corpus generation method, corpus generation device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111813905B (en) |
WO (1) | WO2021120588A1 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114519101B (en) * | 2020-11-18 | 2023-06-06 | 易保网络技术(上海)有限公司 | Data clustering method and system, data storage method and system and storage medium |
CN112530409B (en) * | 2020-12-01 | 2024-01-23 | 平安科技(深圳)有限公司 | Speech sample screening method and device based on geometry and computer equipment |
CN113658710B (en) * | 2021-08-11 | 2024-07-26 | 东软集团股份有限公司 | Data matching method and related equipment thereof |
CN113921137A (en) * | 2021-10-15 | 2022-01-11 | 深圳百岁欢智能科技有限公司 | Health detection and management method based on medical big data |
CN114141387B (en) * | 2021-11-25 | 2024-08-16 | 泰康保险集团股份有限公司 | Interactive information recommendation method, device and equipment in Internet medical session |
CN114461594A (en) * | 2021-12-31 | 2022-05-10 | 国网河北省电力有限公司营销服务中心 | Data compression method, edge device and computer storage medium |
CN116863127A (en) * | 2022-03-28 | 2023-10-10 | 华为技术有限公司 | Method for acquiring region of interest and related equipment |
CN114860667B (en) * | 2022-05-17 | 2024-08-02 | 深圳须弥云图空间科技有限公司 | File classification method, device, electronic equipment and computer readable storage medium |
CN115101058A (en) * | 2022-06-17 | 2022-09-23 | 科大讯飞股份有限公司 | Voice data processing method and device, storage medium and equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101388025A (en) * | 2008-10-09 | 2009-03-18 | 浙江大学 | Semantic web object ordering method based on Pagerank |
CN102073730A (en) * | 2011-01-14 | 2011-05-25 | 哈尔滨工程大学 | Method for constructing topic web crawler system |
CN102662954A (en) * | 2012-03-02 | 2012-09-12 | 杭州电子科技大学 | Method for implementing topical crawler system based on learning URL string information |
CN107480708A (en) * | 2017-07-31 | 2017-12-15 | 微梦创科网络科技(中国)有限公司 | The clustering method and system of a kind of complex model |
CN109117861A (en) * | 2018-06-29 | 2019-01-01 | 浙江大学宁波理工学院 | A kind of multi-level cluster analysis method of point set for taking spatial position into account |
CN109189934A (en) * | 2018-11-13 | 2019-01-11 | 平安科技(深圳)有限公司 | Public sentiment recommended method, device, computer equipment and storage medium |
CN110347835A (en) * | 2019-07-11 | 2019-10-18 | 招商局金融科技有限公司 | Text Clustering Method, electronic device and storage medium |
WO2020073534A1 (en) * | 2018-10-12 | 2020-04-16 | 平安科技(深圳)有限公司 | Pushing method and apparatus based on re-clustering, and computer device and storage medium |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714140A (en) * | 2013-12-23 | 2014-04-09 | 北京锐安科技有限公司 | Searching method and device based on topic-focused web crawler |
CN105117501B (en) * | 2015-10-09 | 2017-07-11 | 广州神马移动信息科技有限公司 | Web crawlers dispatching method and apply its network crawler system |
CN107656948B (en) * | 2016-11-14 | 2019-05-07 | 平安科技(深圳)有限公司 | The problems in automatically request-answering system clustering processing method and device |
US10956677B2 (en) * | 2018-02-05 | 2021-03-23 | International Business Machines Corporation | Statistical preparation of data using semantic clustering |
CN109558403B (en) * | 2018-09-28 | 2024-02-02 | 中国平安人寿保险股份有限公司 | Data aggregation method and device, computer device and computer readable storage medium |
CN109766437A (en) * | 2018-12-07 | 2019-05-17 | 中科恒运股份有限公司 | A kind of Text Clustering Method, text cluster device and terminal device |
CN110032631B (en) * | 2019-03-26 | 2021-07-02 | 腾讯科技(深圳)有限公司 | Information feedback method, device and storage medium |
-
2020
- 2020-06-17 CN CN202010555008.XA patent/CN111813905B/en active Active
- 2020-06-30 WO PCT/CN2020/099524 patent/WO2021120588A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101388025A (en) * | 2008-10-09 | 2009-03-18 | 浙江大学 | Semantic web object ordering method based on Pagerank |
CN102073730A (en) * | 2011-01-14 | 2011-05-25 | 哈尔滨工程大学 | Method for constructing topic web crawler system |
CN102662954A (en) * | 2012-03-02 | 2012-09-12 | 杭州电子科技大学 | Method for implementing topical crawler system based on learning URL string information |
CN107480708A (en) * | 2017-07-31 | 2017-12-15 | 微梦创科网络科技(中国)有限公司 | The clustering method and system of a kind of complex model |
CN109117861A (en) * | 2018-06-29 | 2019-01-01 | 浙江大学宁波理工学院 | A kind of multi-level cluster analysis method of point set for taking spatial position into account |
WO2020073534A1 (en) * | 2018-10-12 | 2020-04-16 | 平安科技(深圳)有限公司 | Pushing method and apparatus based on re-clustering, and computer device and storage medium |
CN109189934A (en) * | 2018-11-13 | 2019-01-11 | 平安科技(深圳)有限公司 | Public sentiment recommended method, device, computer equipment and storage medium |
CN110347835A (en) * | 2019-07-11 | 2019-10-18 | 招商局金融科技有限公司 | Text Clustering Method, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2021120588A1 (en) | 2021-06-24 |
CN111813905A (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111813905B (en) | Corpus generation method, corpus generation device, computer equipment and storage medium | |
US11334635B2 (en) | Domain specific natural language understanding of customer intent in self-help | |
CN112632385B (en) | Course recommendation method, course recommendation device, computer equipment and medium | |
US10366107B2 (en) | Categorizing questions in a question answering system | |
US9996604B2 (en) | Generating usage report in a question answering system based on question categorization | |
US9318027B2 (en) | Caching natural language questions and results in a question and answer system | |
US20170161619A1 (en) | Concept-Based Navigation | |
CN111708873A (en) | Intelligent question answering method and device, computer equipment and storage medium | |
CN113822067A (en) | Key information extraction method and device, computer equipment and storage medium | |
CN112131393A (en) | Construction method of medical knowledge map question-answering system based on BERT and similarity algorithm | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
CN110612522B (en) | Establishment of solid model | |
CN112287069B (en) | Information retrieval method and device based on voice semantics and computer equipment | |
RU2664481C1 (en) | Method and system of selecting potentially erroneously ranked documents with use of machine training algorithm | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111538931A (en) | Big data-based public opinion monitoring method and device, computer equipment and medium | |
CN110688405A (en) | Expert recommendation method, device, terminal and medium based on artificial intelligence | |
CN116796730A (en) | Text error correction method, device, equipment and storage medium based on artificial intelligence | |
CN113569118B (en) | Self-media pushing method, device, computer equipment and storage medium | |
EP4127957A1 (en) | Methods and systems for searching and retrieving information | |
CN110717008A (en) | Semantic recognition-based search result ordering method and related device | |
CN117972032A (en) | Question and answer method, device, equipment and medium based on large language model | |
CN111914201B (en) | Processing method and device of network page | |
KR102454261B1 (en) | Collaborative partner recommendation system and method based on user information | |
CN113505889B (en) | Processing method and device of mapping knowledge base, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40031287 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |