CN114139624A - Method for mining time series data similarity information based on integrated model - Google Patents
Method for mining time series data similarity information based on integrated model Download PDFInfo
- Publication number
- CN114139624A CN114139624A CN202111438131.4A CN202111438131A CN114139624A CN 114139624 A CN114139624 A CN 114139624A CN 202111438131 A CN202111438131 A CN 202111438131A CN 114139624 A CN114139624 A CN 114139624A
- Authority
- CN
- China
- Prior art keywords
- data
- time sequence
- series data
- time
- time series
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005065 mining Methods 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000005457 optimization Methods 0.000 claims abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000013145 classification model Methods 0.000 claims abstract description 4
- 238000013528 artificial neural network Methods 0.000 claims description 30
- 230000004927 fusion Effects 0.000 claims description 22
- 238000012706 support-vector machine Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000010354 integration Effects 0.000 claims description 5
- 238000005259 measurement Methods 0.000 claims description 4
- 238000005315 distribution function Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000007418 data mining Methods 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 6
- 238000001514 detection method Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/29—Graphical models, e.g. Bayesian networks
- G06F18/295—Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A method for mining time series data similarity information based on an integrated model comprises a hidden Markov model and a Wasserstein distance-based conditional variation self-encoder model. The method comprises the steps of establishing an input layer, and carrying out primary processing on an input time sequence; then, a hidden Markov classification layer and a conditional variation encoder layer respectively learn and classify the input data; after learning is finished, through further optimization, two layers of classification models are fused through a Stacking algorithm, and parallel training can be achieved. Meanwhile, the Wasserstein distance is innovatively used for replacing the KL divergence to measure the distance between the two time sequences, so that the classifier has wider application. The method can better perform similar information mining from the time series hidden state and distribution, and can also fuse all the mined information, so that the model learning is more effective, the operation efficiency is higher, and the method has wider applicability.
Description
Technical Field
The invention belongs to the technical field of data mining and machine learning, and particularly relates to a method for mining time series data similarity information based on an integrated model.
Background
In time series data mining, similarity information is more critical information and is one of the starting points of data mining. However, in the current mining of time series data, many algorithms lose similarity information of data distribution, and only perform similarity calculation from the perspective of data. The similarity mining which only depends on data angles is information loss, and the loss can cause some characteristics which are implicitly contained in time sequence data to be lost, can influence the learning effect, and can cause the difference between the learned distribution and the real distribution to be large. At present, an algorithm utilizing time series distribution information is lacked, distribution similarity is one of the problems of important research in statistics, but mining distribution similarity in mining time series data information has not been widely discussed.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention provides a method for mining time series data similarity information based on an integrated model, which integrates a hidden markov classifier for mining hidden information of time series data and a conditional variational self-coding classifier based on Wasserstein distance for mining distribution similarity information of time series data based on the integrated model, and classifies the time series data by learning the mined time series data information by using the integrated model. The invention not only can effectively classify the time sequence data, but also can integrate the discrete information and the continuous information of the time sequence data, so that the learning is more effective, the time sequence data can be parallel, and the operation efficiency is higher.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for mining time series distribution similarity information integrated by a hidden markov model and a Wasserstein distance based conditional variational self-encoder, comprising the steps of:
step 1: and processing the original time sequence data to obtain processed time sequence data. The original time sequence data refers to unclassified time sequence data directly acquired, and the original time sequence data can be classified into one or more categories, and the method specifically comprises the following steps:
step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;
step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data obtained in the step 1.1 into a signature vector sig (A) by using a minimum hash function;
step 1.3: after sig (a) of step 1.2 is obtained, sig (a) is divided into different segments, each having a segment bit flag.
Step 1.4: repeating the steps 1.1 to 1.3 on all the classified time sequence data in the step 1.1, thus obtaining segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, completing data preprocessing, and obtaining the processed time sequence data.
Step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; the basic classification layer outputs a new data set with the same size as the input data, and the method specifically comprises the following steps:
step 2.1: inputting the processed time sequence data obtained in the step 1 into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier, wherein the solving parameters specifically comprise the following steps:
step 2.1.1: for a processed time series numberAccording to O ═ O1,o2,o3…,oTUsing a forward-backward algorithm, calculating the occurrence probability P (O | λ) of the processed time series data when the hidden markov classifier λ is (a, B, pi). Wherein o is1,o2,o3…,oTRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;
step 2.1.2: for D processed time series data { (O)1),(O2),...,(OD) And calculating parameters A, B and pi of the hidden Markov classifier by using a Bohm-Welch algorithm. Wherein (O)i) I is 1,2, … D represents the i-th processed time-series data;
step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated1,o2,o3…,oTMost likely hidden state sequenceWhereinA numerical value O indicating the time series data O after processing at the time iiIs hidden state.
Step 2.2: inputting the processed time sequence data obtained in the step 1 into a conditional variation self-encoder of a basic classification layer, calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain a weak classifier of the conditional variation self-encoder, and constructing the weak classifier of the conditional variation self-encoder comprises the following steps:
step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder2Where μ denotes normalMean value of distribution, σ2A variance representing a normal distribution;
step 2.2.2: the standard normal distribution N (0,1) is sampled to obtain a sample e. Mu, sigma output to step 2.2.1 neural network encoder2The sum sample belongs to the formula 1 to obtain
Step 2.2.3:data with the same dimension as the processed time series data output by the neural network decoder
Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output dataAs part of the optimization target error epsilon. And (3) performing multiple iterations to optimize the optimization target to obtain a trained neural network decoder, wherein the step of calculating the Wasserstein distance comprises the following steps:
step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization termAnd performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows: wherein p (x) represents a distribution function of the processed time series data, p (x)k) X represents the time-series data after processing at time kkThe probability of (d);
step 2.2.4.2: the Wasserstein distance was calculated using the Sinkhorn approximation algorithm to simplify the amount of computation. The calculation formula of the Wasserstein distance obtained by combining the entropy regular function of the step 2.2.4.1 is formula 2:
wherein,representing processed time series data O and neural network decoder output dataThe distance of the Wasserstein of (1),indicates at time n, fromTransfer to onThe cost function of (2);
step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;
wherein,is represented byThe reconstruction error of the reconstruction O is calculated in the formula 4.
step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.
And step 3: establishing an integrated fusion layer, using the output of a basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;
step 3.1: and inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data.
Step 3.2: constructing a secondary learner, using the integrated training data set collected in the step 3.1 as training data of the secondary learner, so that the secondary learner learns the output of the basic classification layer, wherein the process of constructing the secondary learner comprises the following steps:
step 3.2.1: and (3) constructing a support vector machine between any two types of samples to classify by using the support vector machine classifier as a secondary learner. Thus, if the number of data classes of the integrated training data set is k, thenNeed to be constructedA support vector machine classifier;
step 3.2.2: inputting an integrated training data setTraining the support vector machine classifiers, counting the classification classes of each support vector machine classifier for samples of unknown classes after training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class;
step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.
And 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.
Compared with the prior art, the invention has the beneficial effects that:
1) the method can extract the similarity characteristics of the time series data based on the hidden variable, integrate the discrete information and the continuous information of the time series and use the discrete information and the continuous information for data mining, and supplement the blank of the existing time series data mining method.
2) The invention introduces a condition variation self-encoder based on Wasserstein distance, changes KL divergence used for measurement in an original model into the Wasserstein distance, and performs approximate calculation by using a Sinkhorn algorithm, so that an implicit variable fits wider data distribution, and calculation resources are saved.
3) The invention introduces the ensemble learning, integrates and fuses a hidden Markov model for digging time sequence hidden state information and a condition variation self-encoder model for digitizing time sequence similarity information based on Wasserstein distance by utilizing a Stacking fusion model optimization algorithm, reduces the redundancy, enables the base learners to mutually make up for the deficiency, and improves the classification accuracy and the operation efficiency.
4) The method can be used for time series data abnormity detection and traffic flow data prediction, for time series data abnormity detection, the time series of normal conditions can be used as training data to be input into the integrated model for learning, and then the data to be detected is input into the model for classification to obtain whether the data to be detected is abnormal data. For traffic flow data prediction, consistent traffic flow data is marked and input into an integrated model for learning, data to be detected is input after learning is finished, whether the traffic flow is congested or not is judged, the defect that prediction of other methods on emergency is not timely can be overcome, and the robustness of the model is improved.
Drawings
Fig. 1 is an overall structural view of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
As shown in FIG. 1, a method for mining time series data similarity information based on an integration model constructs an input layer, a basic classification layer and an integration fusion layer. And inputting the original data into an input layer, and preprocessing the original data to obtain sample data. The basic classification layer comprises two weak classifiers which are respectively a hidden Markov weak classifier and a conditional variational self-coding weak classifier based on Wasserstein distance. Inputting the processed time sequence data into two weak classifiers in parallel for learning classification, and training and learning n hidden Markov weak classifiers lambda for each class of data1,λ2,…,λnInputting the processed time series data into all classifiers to obtain n probabilities p1,p2,…,pnAnd taking the class label of the maximum value in all the results as the final classification result of the hidden Markov weak classifier. Inputting the processed time sequence data into a conditional variational self-encoder weak classifier based on Wasserstein distance while using a hidden Markov weak classifier for learning training, sampling in the input to obtain a sample x, and obtaining normal distribution N (mu, sigma) by the x passing through a neural network encoder2) Sufficient statistics of: mean μ and variance σ2(ii) a Followed by N (mu, sigma)2) Sampling to obtain z, z output via neural network decoderThus, a new sample under the same distribution is generated by using a neural network decoder which is trained by learningAnd finally, taking the outputs of the two weak classifiers as the input of the integrated fusion layer. The integrated algorithm can combine information in two directions during data mining, so that the model mining information is more, and the learning ability is stronger; meanwhile, the base learner can perform parallel training, and the operation efficiency is higher.
Referring to fig. 1, taking mining time series data as an example, the method includes the following steps:
step 1: and preprocessing the original time sequence data to obtain processed time sequence data. The original time sequence data refers to unclassified time sequence data directly acquired, and the original time sequence data can be classified into one or more categories, and the method specifically comprises the following steps:
step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;
step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data obtained in the step 1.1 into a signature vector sig (A) by using a minimum hash function;
step 1.3: after sig (a) of step 1.2 is obtained, sig (a) is divided into different segments, each having a segment bit flag.
Step 1.4: repeating the steps 1.1 to 1.3 on all the classified time sequence data in the step 1.1, thus obtaining segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, completing data preprocessing, and obtaining the processed time sequence data.
Step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; the basic classification layer outputs a new data set with the same size as the input data, and the method specifically comprises the following steps:
step 2.1: inputting the processed time sequence data obtained in the step 1 into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier, wherein the solving parameters specifically comprise the following steps:
step 2.1.1: for one processed time series data O ═ O1,o2,o3…,oTUsing a forward-backward algorithm, calculating the occurrence probability P (O | λ) of the processed time series data when the hidden markov classifier λ is (a, B, pi). Wherein o is1,o2,o3…,oTRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;
step 2.1.2: for D processed time series data { (O)1),(O2),...,(OD) And calculating parameters A, B and pi of the hidden Markov classifier by using a Bohm-Welch algorithm. Wherein (O)i) I is 1,2, … D represents the i-th processed time-series data;
step 2.1.3: using viterbi algorithm for hidden markov classifier λ ═ (a, B, Π), the processed time series O ═ O is calculated1,o2,o3…,oTMost likely hidden state sequenceColumn(s) ofWhereinA numerical value O indicating the time series data O after processing at the time iiIs hidden state.
Step 2.2: inputting the processed time sequence data obtained in the step 1 into a conditional variation self-encoder of a basic classification layer, calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain a weak classifier of the conditional variation self-encoder, and constructing the weak classifier of the conditional variation self-encoder comprises the following steps:
step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder2Where μ denotes the mean of a normal distribution, σ2A variance representing a normal distribution;
step 2.2.2: the standard normal distribution N (0,1) is sampled to obtain a sample e. Mu, sigma output to step 2.2.1 neural network encoder2The sum sample belongs to the formula 1 to obtain
Step 2.2.3:data with the same dimension as the processed time series data output by the neural network decoder
Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output dataAs part of the optimization target error epsilon. And (3) performing multiple iterations to optimize the optimization target to obtain a trained neural network decoder, wherein the step of calculating the Wasserstein distance comprises the following steps:
step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization termAnd performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows: wherein p (x) represents a distribution function of the processed time series data, p (x)k) X represents the time-series data after processing at time kkThe probability of (d);
step 2.2.4.2: the Wasserstein distance was calculated using the Sinkhorn approximation algorithm to simplify the amount of computation. The calculation formula of the Wasserstein distance obtained by combining the entropy regular function of the step 2.2.4.1 is formula 2:
wherein,representing processed time series data O and neural network decoder output dataThe distance of the Wasserstein of (1),indicates at time n, fromTransfer to onThe cost function of (2);
step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;
wherein,is represented byThe reconstruction error of the reconstruction O is calculated in the formula 4.
step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.
And step 3: establishing an integrated fusion layer, using the output of a basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;
step 3.1: and inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data.
Step 3.2: constructing a secondary learner, using the integrated training data set collected in the step 3.1 as training data of the secondary learner, so that the secondary learner learns the output of the basic classification layer, wherein the process of constructing the secondary learner comprises the following steps:
step 3.2.1: and (3) constructing a support vector machine between any two types of samples to classify by using the support vector machine classifier as a secondary learner. Therefore, if the number of data classes of the integrated training data set is k, then it is necessary to constructA support vector machine classifier;
step 3.2.2: inputting an integrated training data setTraining the support vector machine classifiers, counting the classification classes of each support vector machine classifier for samples of unknown classes after training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class;
step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.
And 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.
The following is a specific application case of the invention in website identification. And taking the time series data of the five website traffics as experimental data, mining the similarity of the time series data through an integrated model to perform network identification, and confirming the category of the website traffic time series data of unknown category.
In this example, the raw time-series data is a time-series data having a length of 2525 and a width of 1024, the length indicating the number of acquisition times included in the time-series data of this example, the width indicating the website traffic volume corresponding to each acquisition time, and the raw time-series data including the traffic volume time-series data of 5 websites. Firstly, data preprocessing is carried out on original time sequence data, and a partial sensitive hashing algorithm is selected to delete partial data in the example, so that each signature vector is divided into six segments on the basis of minimum hashing processing data, data with the highest similarity is reserved, data with the low similarity is deleted, and noise interference is reduced. Then, 200 time series data of each type of website are selected as original time series data to obtain processed time series data. And next, inputting the processed time series data into the step 2 according to categories to construct a basic classification layer. Next, the processed time series data is inputted into the hidden markov weak classifier of step 2.1 of the present invention for hidden information mining, in this example, the inputted size is 1024 × 1. Meanwhile, the processed time series data are input into the Wasserstein distance-based conditional variation self-encoder weak classifier of step 2.2 of the invention in parallel to learn the distribution of the input time series data, and the size of the input data in this example is 1024 × 1. It should be noted that, as described in step 2.2.1, it is necessary to ensure that the sampled samples are uniformly distributed during sampling, so that the number of collected processed time series data for each class is the same. The sampled time series data is then learned according to steps 2.2.2 through 2.2.5 of the present invention. And after learning is finished, taking the output of the basic classification layer as the input of the integrated fusion layer, learning the input according to the step 3 of the invention, and after the learning of the integrated fusion layer is finished, using the integrated fusion layer for website identification. And inputting the flow time sequence data of the unknown website into the integrated fusion layer, wherein the obtained output classification is the category of the unknown website.
The invention may also be used in other industrial scenarios, such as the prediction of traffic flow. It should be noted that traffic flow prediction is a regression problem rather than a classification problem, and therefore, the output of the integrated model needs to be slightly modified, and the distribution of traffic flow data is learned by means of a conditional variation self-encoder based on Wasserstein distance, and the predicted flow for a certain time is output. The method comprises the following specific steps: first, the original time-series data are traffic flow time-series data of a certain place, the length of the original time-series data indicates the number of times included in the time-series, and the width indicates the traffic flow at a certain time. And secondly, preprocessing the original time sequence data to obtain processed time sequence data. And next, inputting the processed time series data into a basic classification layer for carrying out classification information mining to obtain output data. Next, the time when the basic classification layer outputs is taken as the input of a secondary learner in the integrated fusion layer to learn the traffic flow output by the basic classification layer, the output of the secondary learner is the traffic flow, and the output of the secondary learner is taken as the output of the integrated fusion layer. And finally, inputting the time needing to be predicted into the integrated fusion layer, wherein the obtained output is the predicted traffic flow. This makes predictions by mining similarity information for traffic flow time series data.
Meanwhile, the invention can also be used for time sequence data of other dimensions, such as identifying a voice source, wherein the voice time sequence data comprises three dimensions of pronunciation time, pronunciation duration and pronunciation interval, and one dimension is more than the website flow time sequence data and the traffic flow time sequence data in the above case. The voice time sequence data can be directly input into the basic classification layer for classification learning, then the output of the basic classification layer is used as the input of the integrated fusion layer for classification learning, finally the voice time sequence data of unknown classes are used as the input of the integrated fusion layer, and the output classes are the classes of the voice time sequence data of unknown classes. Thus, the similarity of the voice time sequence data of the same source can be used for identifying the voice source.
Claims (8)
1. A method for mining time series data similarity information based on an integrated model is characterized by comprising the following steps:
step 1: processing original time sequence data, and dividing the original time sequence data into one or more categories, wherein the original time sequence data refer to unclassified time sequence data which are directly acquired;
step 2: establishing a basic classification layer, and inputting the processed time series data into a plurality of weak classifiers in the basic classification layer for preliminary classification; the basic classification layer comprises two models, namely a hidden Markov weak classifier obtained by utilizing a hidden Markov model and a conditional variational self-encoder weak classifier obtained by utilizing a conditional variational self-encoder based on Wasserstein distance; outputting a new data set with the same size as the input data by the basic classification layer;
and step 3: establishing an integrated fusion layer, using the output of the basic classification layer as the input of the integrated fusion layer, and performing integrated learning training through fusing a plurality of weak classifiers in the basic classification layer to obtain a secondary learner so as to obtain a final integrated model;
and 4, step 4: and mining similarity information of the time sequence data by using the obtained integration model.
2. The method for mining similarity information of time-series data based on an integrated model according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1: classifying the original time sequence data, measuring the original time sequence by using the Jacard distance, clustering the time sequence data with similar distances to obtain the classified time sequence data;
step 1.2: converting a certain time sequence data A in a certain class of the classified time sequence data into a signature vector sig (A) by using a minimum hash function;
step 1.3: dividing sig (A) into different segments, each segment having a segment bit flag;
step 1.4: and repeating the steps 1.2 to 1.4 on all the classified time sequence data to obtain segment bit marks of all the classified time sequence data, determining the similarity of the classified time sequence data according to the same segment bit marks, deleting data with different segment bit marks, and finishing data processing.
3. The method for mining time series data similarity information based on an integrated model according to claim 1, wherein the step 2 of establishing a basic classification layer specifically comprises the following steps:
step 2.1: inputting the processed time sequence data into a hidden Markov classification model of a basic classification layer, solving parameters by using a forward-backward algorithm and a Bohm-Welch algorithm, and decoding by using a Viterbi algorithm to obtain a hidden Markov weak classifier;
step 2.2: and inputting the processed time sequence data into a conditional variation self-encoder of a basic classification layer, and calculating the Wasserstein distance of the time sequence data by using a Sinkhorn approximation algorithm to obtain the weak classifier of the conditional variation self-encoder.
4. The method for mining similarity information of time series data based on integrated model according to claim 3, wherein the step 2.1 of constructing the hidden Markov weak classifier specifically comprises the following steps:
step 2.1.1: for one processed time series data O ═ O1,o2,o3…,oTCalculating the occurrence probability P (O | lambda) of the processed time sequence data in a hidden Markov classifier lambda (A, B, pi) by using a forward-backward algorithm; wherein o is1,o2,o3…,oTRepresenting the numerical value of the processed time series data from the time 1 to the time T, A representing a hidden state transition probability matrix, B representing an observation state generation probability matrix, wherein the observation state refers to the numerical value of the processed time series data, and pi represents the initial probability distribution of the hidden state;
step 2.1.2: for D processed time series data { (O)1),(O2),...,(OD) Calculating by using a Bohm-Welch algorithm to obtain parameters A, B and pi of the hidden Markov classifier lambda; wherein (O)i) I is 1,2, … D represents the i-th processed time-series data;
5. The method for mining similarity information of time series data based on integrated model according to claim 4, wherein the step 2.2 of constructing the conditional variant self-encoder weak classifier specifically comprises the following steps:
step 2.2.1: sampling certain type of data in the processed time sequence data to obtain a time sequence data sample O, and outputting normally distributed statistic mu, sigma through a neural network encoder2Where μ denotes the mean of a normal distribution, σ2A variance representing a normal distribution;
step 2.2.2: sampling the standard normal distribution N (0,1) to obtain a sample E, and determining the output mu, sigma of the neural network encoder in step 2.2.12The sum sample belongs to the formula 1 to obtain
Step 2.2.3:obtaining data with same dimensionality of time sequence data after being output and processed by a neural network decoder
Step 2.2.4: using the time-series data samples O after Wasserstein distance measurement processing and the neural network decoder output dataThe distance of the target is used as a part of the error epsilon of the optimization target, and the optimization target is optimized through multiple iterations to obtain a trained neural network decoder;
step 2.2.5: and inputting the processed time series data into the neural network encoder and the trained neural network decoder in the step 2.2.4, and outputting the generated time series data which is approximately consistent with the distribution of the processed time series data.
6. The method for mining similarity information of time-series data based on an integrated model according to claim 5, wherein the step 2.2.4 of calculating the Wasserstein distance specifically comprises the following steps:
step 2.2.4.1: pair of processed time series data samples O and neural network decoder output data by introducing entropy regularization termAnd performing dimension reduction smoothing treatment, wherein an entropy regular function is as follows: wherein p (x) represents a distribution function of the processed time series data, p (x)k) X represents the time-series data after processing at time kkThe probability of (d);
step 2.2.4.2: the Sinkhorn approximation algorithm is used for calculating the Wasserstein distance to simplify the calculated amount, and the calculation formula of the Wasserstein distance obtained by combining the entropy regulation function of the step 2.2.4.1 is shown as a formula 2:
wherein,representing processed time series data O and neural network decoder output dataThe distance of the Wasserstein of (1),indicates at time n, fromTransfer to onThe cost function of (2);
step 2.2.4.3: integrating the Wasserstein distance into the optimization target error epsilon to obtain the expression of an optimization target, namely a formula 3;
wherein,is represented byReconstructing the reconstruction error of the O in a calculation mode of a formula 4;
7. The method for mining the similarity information of the time-series data based on the integrated model according to claim 1, wherein the step 3 of constructing the integrated fusion layer comprises the following steps:
step 3.1: inputting the processed time sequence data into two weak classifiers in a basic classification layer, collecting output data as an integrated training data set, wherein the number of data categories of the integrated training data set is consistent with that of the processed time sequence data;
step 3.2: constructing a secondary learner, and using the collected integrated training data set as training data of the secondary learner so that the secondary learner learns the output of the basic classification layer;
step 3.3: the output of the secondary learner is taken as the final output of the integrated fusion layer.
8. The method for mining similarity information of time-series data based on integrated model according to claim 7, wherein the step 3.2 of constructing the secondary learner comprises the following steps:
step 3.2.1: a support vector machine classifier is used as a secondary learner, a support vector machine is constructed between any two types of samples for classification, the number of data classes of an integrated training data set is k, and the support vector machine needs to be constructedA support vector machine classifier;
step 3.2.2: inputting an integrated training data setAnd (3) training the support vector machine classifiers, counting the classification classes of each support vector machine classifier for the samples of unknown classes after the training is finished, taking the class with the most votes as the class of the sample of the unknown class, and outputting the class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111438131.4A CN114139624A (en) | 2021-11-29 | 2021-11-29 | Method for mining time series data similarity information based on integrated model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111438131.4A CN114139624A (en) | 2021-11-29 | 2021-11-29 | Method for mining time series data similarity information based on integrated model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114139624A true CN114139624A (en) | 2022-03-04 |
Family
ID=80389582
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111438131.4A Pending CN114139624A (en) | 2021-11-29 | 2021-11-29 | Method for mining time series data similarity information based on integrated model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114139624A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115599984A (en) * | 2022-09-09 | 2023-01-13 | 北京理工大学(Cn) | Retrieval method |
CN116304358A (en) * | 2023-05-17 | 2023-06-23 | 济南安迅科技有限公司 | User data acquisition method |
CN117993500A (en) * | 2024-04-07 | 2024-05-07 | 江西为易科技有限公司 | Medical teaching data management method and system based on artificial intelligence |
-
2021
- 2021-11-29 CN CN202111438131.4A patent/CN114139624A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115599984A (en) * | 2022-09-09 | 2023-01-13 | 北京理工大学(Cn) | Retrieval method |
CN115599984B (en) * | 2022-09-09 | 2023-06-09 | 北京理工大学 | Retrieval method |
CN116304358A (en) * | 2023-05-17 | 2023-06-23 | 济南安迅科技有限公司 | User data acquisition method |
CN116304358B (en) * | 2023-05-17 | 2023-08-08 | 济南安迅科技有限公司 | User data acquisition method |
CN117993500A (en) * | 2024-04-07 | 2024-05-07 | 江西为易科技有限公司 | Medical teaching data management method and system based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108171209B (en) | Face age estimation method for metric learning based on convolutional neural network | |
CN110826638B (en) | Zero sample image classification model based on repeated attention network and method thereof | |
CN114139624A (en) | Method for mining time series data similarity information based on integrated model | |
CN111368920B (en) | Quantum twin neural network-based classification method and face recognition method thereof | |
CN111914644A (en) | Dual-mode cooperation based weak supervision time sequence action positioning method and system | |
CN108985380B (en) | Point switch fault identification method based on cluster integration | |
CN112015863A (en) | Multi-feature fusion Chinese text classification method based on graph neural network | |
CN104966105A (en) | Robust machine error retrieving method and system | |
CN110851176B (en) | Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus | |
CN111597340A (en) | Text classification method and device and readable storage medium | |
CN111343147B (en) | Network attack detection device and method based on deep learning | |
CN116257759A (en) | Structured data intelligent classification grading system of deep neural network model | |
CN116561748A (en) | Log abnormality detection device for component subsequence correlation sensing | |
CN114882069A (en) | Taxi track abnormity detection method based on LSTM network and attention mechanism | |
CN110751191A (en) | Image classification method and system | |
CN110717602B (en) | Noise data-based machine learning model robustness assessment method | |
CN116467141A (en) | Log recognition model training, log clustering method, related system and equipment | |
CN111242028A (en) | Remote sensing image ground object segmentation method based on U-Net | |
CN114898164B (en) | Neural network image classifier confidence calibration method and system | |
CN117474529A (en) | Intelligent operation and maintenance system for power grid | |
CN117932065A (en) | Global hyperbolic embedding-based multi-intention detection and semantic slot filling method | |
CN117372144A (en) | Wind control strategy intelligent method and system applied to small sample scene | |
CN117152504A (en) | Space correlation guided prototype distillation small sample classification method | |
CN113420733B (en) | Efficient distributed big data acquisition implementation method and system | |
Sangeetha et al. | Crime Rate Prediction and Prevention: Unleashing the Power of Deep Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |