CN118333147B

CN118333147B - Related subspace searching method in mass data outlier detection

Info

Publication number: CN118333147B
Application number: CN202410774444.4A
Authority: CN
Inventors: 万晓珑; 徐千惠; 韩希先; 王金宝
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2024-06-17
Filing date: 2024-06-17
Publication date: 2024-08-13
Anticipated expiration: 2044-06-17
Also published as: CN118333147A

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a related subspace searching method in mass data outlier detection. The method mainly comprises the following steps: step 1, preprocessing original data to construct an ordered list set and a hash fragment set; all hash fragments obtained through pretreatment are sequentially utilized to carry out self-adaptive correlation attribute judgment, and a correlation attribute result set without repetition is reserved; step 2, generating all candidate subspaces according to the result of the step 1, and judging relevant subspaces by utilizing the precomputed ordered list set and the FLA structure based on the most frequent replacement strategy; and 3, performing redundancy deletion on the result in the step 2 and returning all relevant subspaces. The invention divides the data set into the hash fragments which can be accommodated in the memory in advance by utilizing the local sensitive hash index, thereby avoiding the limitation of processing mass data sets due to insufficient memory; and all hash fragments are independently verified to cut irrelevant attributes, so that the number of candidate subspaces is greatly reduced.

Description

Related subspace searching method in mass data outlier detection

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a related subspace searching method in mass data outlier detection.

Background

Outlier detection is an important task in data analysis and data mining to find a collection of data objects in which observations in a dataset deviate significantly from most observations (Hawkins D M. Identification of Outliers [ M ]. Berlin: springer, 1980.). The outlier detection technology not only can be used for data cleaning, but also is beneficial to acquiring high-value information, and is widely applied to various actual scenes at present.

However, the conventional outlier detection method cannot cope with the influence of "dimension disaster" on the one hand, and cannot identify data objects that appear to be abnormal only in certain attribute combinations on the other hand (AGGARWAL C c, outlier Analysis [ M ]. Cham: springer, 2017.). Research and exploration of subspace-based outlier detection algorithms was thereby initiated, which aims at finding outliers for each subspace by selecting one or more attribute subsets (i.e. subspaces) from a full-dimensional data space (Ihab F. Ilyas and Xu Chu. 2019. DATA CLEANING [ M ]. New York: ACM, 2019.). Existing subspace outlier detection algorithms can be divided into the following classes （Trittenbach H, Böhm K. Dimension-based subspace search for outlier detection[J]. International Journal Data Science and Analytics. 2019, 7: 87-101.）： random sampling-based, data object-based, and correlation-based methods.

The existing subspace outlier detection algorithm has the problems of high calculation cost and high calculation cost generally, and cannot adapt to the ever-increasing data scale. Therefore, how to effectively search subspaces in mass data, in which outliers are easy to find, has important research significance.

Disclosure of Invention

Aiming at the defects of huge index structure, huge number of candidate subspaces and the like in the existing subspace outlier detection algorithm, the invention provides a related subspace search method (RSLSH: relevant Subspaces SEARCH WITH Locality SENSITIVE HASHING) in mass data outlier detection based on local sensitive hash partitioning.

The invention provides a related subspace searching method in mass data outlier detection, which mainly comprises the following steps:

Step 1, preprocessing original data to construct an ordered list set and a hash fragment set; all hash fragments obtained through pretreatment are sequentially utilized to carry out self-adaptive correlation attribute judgment, and a correlation attribute result set without repetition is reserved;

step 2, generating all candidate subspaces according to the result of the step 1, and judging relevant subspaces by utilizing the precomputed ordered list set and the FLA structure based on the most frequent replacement strategy;

And 3, performing redundant deletion on the result in the step 2 by using a shearing strategy and returning all relevant subspaces.

Preferably, in step 1, the method for preprocessing the original data to construct the hash fragment set includes:

one-time hashing according to specified parameters Determining a P-stable combined hash function for hash mappingThen utilizeCalculation ofHash bucket number, wherein each data objectOne hash will produce a length ofIs a combination vector of (a)WhileCorresponding hash bucket numberObtained by an AND operation combining elements within the vector; storing all data objects with the same hash bucket number in the same fileThe hash slices are formed once in the middle,The number of hash slices is one time; wherein the method comprises the steps of; DB is a D-dimensional dataset containing N points;

Secondary hash, the hash bucket number obtained in the first stage is obtained Performing re-hashing, projecting the same into integer values in a one-dimensional space, and hashing adjacent barrel numbers to adjacent positions through reordering of projection results;

merging operation, first determining average number of data stored in merging barrel In the case of merging the total number does not exceedOn the premise of traversing adjacent hash bucket numbers in turn, storing data objects in corresponding hash fragments in the same fileConstitutes a secondary hash fragment in which，The number of the secondary hash slices.

Preferably, the merging operation is still larger than the memory capacity after completionIs divided into (a) secondary hash slicesBy enlarging combined hash vectorsThe recursion division strategy is continued until all the fragments meet the memory constraint condition to obtain the hash fragment setWherein，The final hash slice number.

Preferably, in step 1, determining whether the current attribute dimension is a relevant attribute specifically includes: and determining whether the single attribute dimension variance meets the uniform distribution condition on the single attribute by judging the relation between the single attribute dimension variance and the expected variance in the data space, and if the single attribute dimension variance is smaller than the expected variance, namely, the variance threshold lower bound, considering the attribute dimension as a related attribute.

Preferably, in step 2, the candidate subspace generation is specifically: verifying the correlation attribute results of each hash segment independently forAnd sequentially generating all relevant attribute combinations by backtracking of the depth-first search, wherein all relevant attribute combinations form a candidate subspace set, and the candidate subspace set is used for judging whether the candidate subspace set is a relevant subspace or not.

Preferably, in step 2, the FLA structure is a fixed-size array structure maintained in memory, the size of which is determined by the allocated memory capacity and the size of the data set, wherein each element stores dimensionsOrdered list of (2)And historical frequency of useAnd carrying out adaptive dynamic data replacement along with the change of the computation subspace before computing the subspace correlation each time, judging the storage position of the data when the computation subspace correlation needs to read the specified dimension data each time, and then reading and completing corresponding computation according to the need.

Preferably, in step 2, the most frequent replacement policy is specifically: in the FLA structure updating process, attribute dimensions with higher access frequency are replaced preferentially.

Preferably, in step 3, the shearing strategy is specifically: is provided withTwo subspaces, andIf (if)Then delete redundant subspaces; Wherein,Representing subspaces respectivelyIs used in the manufacture of a printed circuit board,、Respectively representIs a correlation of (3).

According to the technical scheme, the beneficial effects of the invention are as follows:

(1) The method is characterized by providing a new related subspace search algorithm based on local sensitive hash, and the related subspace in outlier detection can be efficiently obtained on mass data;

(2) The pre-calculated ordered list set is used as a bottom data structure, and the method can be applied to all scenes for solving the subspace correlation only by constructing once, so that the cost of sequencing in each calculation is avoided;

(3) The data set is divided into hash fragments which can be accommodated in the memory in advance by utilizing the local sensitive hash index, so that the limitation of processing the massive data set due to insufficient memory is avoided;

(4) Independent verification is carried out on all hash fragments to carry out independent attribute shearing, so that the number of candidate subspaces is greatly reduced;

(5) And calculating subspace correlation by using a sequence table, and designing a FLA auxiliary structure based on the most frequent replacement strategy in combination with a search mode, so that I/O overhead in calculating subspace correlation is reduced.

Drawings

FIG. 1 is a flowchart of a related subspace search method in outlier detection of mass data in an embodiment of the present invention;

FIG. 2 is a schematic diagram of the construction of the ordered list set AL;

FIG. 3 is a schematic diagram of a hash fragment set LS construction flow;

FIG. 4 is a schematic diagram of a process flow for generating candidate subspaces;

FIG. 5 is a graph of experimental results of parameter settings on a synthetic dataset; wherein (a) the number of hash functions [ ] ) ; (B) Mapping width [ ]) ; (C) Maximum reserved attribute number）；

FIG. 6 is a graph of experimental results of quality assessment of algorithmic search results on a synthetic dataset;

FIG. 7 is a graph of experimental results of the impact of data size N on algorithm performance on a synthetic dataset; wherein (a) execution time; (b) calculating the number of subspaces; (c) I/O overhead;

FIG. 8 is a graph of experimental results of the influence of the data dimension D on the algorithm performance on the synthetic dataset; wherein (a) execution time; (b) calculating the number of subspaces; (c) I/O overhead;

FIG. 9 is a graph of experimental results of the impact of allocating memory capacity C on the composite dataset on the performance of the algorithm; wherein (a) execution time; (b) calculating the number of subspaces; (c) calculating the number of subspaces; (d) FLA replacement times.

Detailed Description

The objects, technical solutions and advantages of the present invention will be further described with reference to the accompanying drawings so that those skilled in the art can more clearly understand the present invention. It should be noted that, in the case of no conflict, embodiments of the present invention and features of the embodiments may be combined with each other, and the described embodiments are only some embodiments of the present invention, not all embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

The concepts involved in the embodiments of the present invention are explained as follows:

related subspace search: given data set The task of the related subspace search is to find a set of related subspaces in subspace outlier detectionWherein，Representing a full-dimensional data space.

Subspace correlation: for the followingQuantizing the subspace correlation based on the probability density function, lettingThe number of iterations is indicated and,Representation ofWith respect toIs used for the edge probability of (1),Representation ofWith respect toConditional probability of (2), wherein. ThenSubspace correlation of (2)The definition is as follows:,（HiCS：KELLER K, MULLER E, and BOHM K. HiCS: High contrast subspaces for density-based outlier ranking[C]）.

Related subspace: for the following If subspaceIs related to (a)Then call itIs thatIs used for the related subspace of the frame.

The invention provides a related subspace searching method in mass data outlier detection based on local sensitive hash, namely RSLSH algorithm. As shown in fig. 1, RSLSH algorithm consists of step 1: sequentially reading the hash fragments, and calculating a related attribute result set; step 2: depth-first searching and backtracking to generate candidate subspaces and judging related subspaces; step 3: redundant subspace deletion constitutes. In addition, as can be seen from fig. 1, to run RSLSH the algorithm requires that the ordered list set and the hash tile set be constructed first in a preprocessing stage.

1. Pretreatment: the original data is read, and an ordered list set AL and a hash fragment set LS are constructed.

The ordered list set AL is a data structure formed by respectively sorting and renumbering the data sets according to the attribute values thereof, can be used for solving the correlation of any candidate subspace, and can be applied to various attribute combination scenes only by once construction. For a D-dimensional dataset DB containing N data objects, an ordered list setCan be regarded as D ordered listsEach of which is a combination ofBy N index pairsComposition of whichRepresentation ofAt the position index of the data set DB,Representation ofThe corresponding projection values are numbered in ascending order,Then represent the th in DBThe data object is at the firstProjection values in dimensions. The flow of building the ordered list set AL in the data set DB containing 16 data objects, 2-dimensional attributes is shown in fig. 2: firstly, obtaining an actual ordered list corresponding to each attribute dimension according to DBEach ordered listFrom the following componentsGroup index pairComposition of whichRepresentation ofAt the position index of the data set DB,Representation ofThe corresponding projection value is used to determine the projection value,According toThe values are arranged in ascending order. And ordered listThen by pairingIn a sequential arrangementColumn renumbering is obtained.

The hash fragment set LS is a structure formed by putting together data objects close to each other in a data set, each hash fragment can be regarded as a cluster of an original data space, the size of the hash fragment meets the memory limit requirement, the hash fragment set LS can be read into a memory one by one for judging related attributes of subspaces, and all the hash fragments together form the hash fragment set. The pretreatment process of the hash fragments mainly comprises three stages of primary hash, secondary hash and merging operation.

In a one-time hash phase, first, according to the specified parametersDetermining a P-stable combined hash function for hash mappingThen utilizeCalculation ofBelonging to a hash bucket number, wherein each data objectOne hash will produce a length ofIs a combination vector of (a)WhileCorresponding hash bucket numberObtained by an AND operation combining the elements within the vector. Storing all data objects with the same hash bucket number in the same fileConstitutes a one-time hash slice in which，Is the number of hash slices at a time.

In the secondary hash stage, the hash bucket number obtained in the hash bucket number stage obtained in the primary hash stage is used for obtaining the hash bucket numberAnd (3) performing rehash, projecting the same into integer values in a one-dimensional space, and hashing adjacent barrel numbers to adjacent positions through reordering of projection results.

In the merging stage, the average number of data stored in the merging barrel is determined，Is the memory capacity. In the case of combining the total number of the components to be not more thanOn the premise of traversing adjacent hash bucket numbers in turn, storing data objects in corresponding hash fragments in the same fileConstitutes a secondary hash fragment in which，The number of the secondary hash slices.

The flow of hash partitioning is illustrated in fig. 3 by taking the data set DB in fig. 2 as an example, and the data objects in the data set DB are hashed once to obtain four hash bucket numbersCorresponding to four once hash sliced filesWherein the data objects in each hash tile are close to each other. In the secondary hash stage, each bucket number corresponds toRecording the number of data in the bucket and the secondary hash result, e.g. hash bucket number, respectivelyThe data comprises 7 data objects, the secondary hash value of the data objects is 2, and the result that the four hash bucket numbers are ordered according to the ascending order of the hash values is。

In the merging phase, first calculateThen, the judgment is sequentially carried out, becauseWhileTherefore, it willMerging corresponding primary hash files to obtain three secondary hash fragment files。

For the secondary hash fragments which are still larger than the memory capacity after the merging operation is completedThe invention adopts the method of enlarging the length of the combined hash vectorContinuing the recursion division strategy until all the fragments meet the memory constraint condition to obtain a preprocessed hash fragment setWherein, the method comprises the steps of, wherein,，The number of final hash slices.

2. And sequentially reading the hash fragments, and calculating a related attribute result set.

And sequentially reading all data points in the hash fragments, judging whether the dimension of the current attribute is a relevant attribute, and if so, adding the dimension into the relevant attribute result of the hash fragments until all attribute judgment of all the hash fragments is completed. According to the property of the correlation attribute, any data space is unevenly distributed on the correlation attribute, and the attribute variance is small. Then determining whether the single attribute dimension variance satisfies a uniform distribution condition on the single attribute by judging the relation between the single attribute dimension variance and the expected variance in the data space, and if the single attribute dimension variance is smaller than the expected variance, namely, the variance threshold lower bound, then considering the attribute dimension as a related attribute.

The invention sequentially divides the hash obtained in the pretreatment stage into piecesAnd reading into a memory for calculation. And (3) reading the currently read hash fragments into the memory for calculation. And (3) reading the currently read hash fragments into the memory for calculation. Hash shard for current readOrder-makingRepresentation ofIs used to determine the data object of the data object,Representation ofIs used to determine the average vector of (a),Representing data objectsAnd average vectorEuclidean distance of (2)The variance of (2) is. Order theRepresentation ofData objectsThe dimensions of the projection are projected,Representation ofData objectsDimension projection average value, then for，A kind of electronic deviceThe dimensional variance is. Order theRepresenting a predefined variance threshold coefficient,Representing the maximum reserved attribute number, the adaptive variance threshold isFor any attribute dimensionIf (if)ThenAnd the attribute is related, otherwise, the attribute is irrelevant.

3. And backtracking the depth-first search to generate candidate subspaces, and judging the relevant subspaces.

And sequentially and independently verifying the related attribute result of each hash fragment, and judging the related subspace by utilizing the precomputed ordered list set and the FLA structure based on the most frequent replacement strategy while generating all candidate subspaces.

In generating candidate subspaces, the present invention independently validates each hash tileRelated attribute set of (2). For the followingSequentially generating all relevant attribute combinations. Since the one-dimensional subspace has no correlation, the initial stage starts to generate from the two-dimensional subspace, generates candidate subspaces in a depth-first backtracking strategy by maintaining a stack implementation, and calculates whether the subspace correlation satisfies the condition. In the algorithm execution process, only the data with the appointed dimension needs to be read when the subspace correlation is calculated, so that the FLA structure updating operation is preferably carried out before each correlation calculation. For the followingOrder-makingIndicating the actual storage length of the FLA,Representing attribute dimension contained in FLA, firstly judging whether FLA structure reaches upper storage limit, if so，For the size of the dataset DB, then the subspaces are traversed sequentiallyThe attribute dimension in (a) will satisfyAttribute dimension of (a)Ordered list corresponding toIs stored in the FLA structure. Otherwise, it indicates that the current FLA has reached the upper storage limit according toAnd (3) withIs to decide whether to perform data replacement: (1) If it isAnd (3) withThe data replacement is not needed, and the optimal result is obtained at present; (2)And (3) withIf the inclusion relationship does not exist, data substitution is needed. Order theRepresentation ofIs present in butThe set of attributes that do not exist in the database,Representation ofIs not in existence butA set of attributes that exist in the database. If it isThen will be arranged in natural order (one order for attribute dimensions to be arranged from small to large)Front of (2)Dimension dataThe middle part is replaced by the middle part to ensure thatThis is true. On the contrary, ifThe history is used most frequently according to the most frequent replacement (MFU) strategyDimension dataThe middle part is replaced by the middle part to ensure thatThis is true.

When calculating a condition sample in the subspace correlation process, judging whether the attribute dimension exists in the FLA structure or not before each data fragment is acquired, if so, directly reading the data from the FLA, and increasing the access frequency of the dimension once, otherwise, reading the required data from an external memory, and finally, checking the correlation of the calculation subspace by using KS, wherein the data acquisition mode in the KS checking process is consistent with that of the calculation condition sample.

The flow of FLA structure update based on MFU policy is shown in fig. 4, and the process of (5) in fig. 4 is as follows, assuming that the FLA structure can accommodate three-dimensional data at mostSwitching to a subtree for a root nodeA process of subtrees that is the root node, at this timeTemporarily not used. FLA containment before execution of (5)For calculating subspacesWill beAs an attribute replacement list, according to the MFU policy, the access frequency is highestThe corresponding ordered list will beAlternatively, the same is true in (8).

4. Redundant subspace deletion.

And (3) redundant deletion is carried out on the result obtained in step (3) by utilizing a shearing strategy, and the reserved subspaces form a final result set. The shearing strategy is specifically as follows: is provided withTwo subspaces, andIf (if)Then delete redundant subspacesWherein, the method comprises the steps of, wherein,Representing subspaces respectivelyIs used in the manufacture of a printed circuit board,、Respectively representIs a correlation of (3).

5. Algorithm performance evaluation

In order to verify the performance advantages of the algorithm provided by the invention, a plurality of groups of experiments are performed on the synthetic data set and the real data set.

Experimental facilities: DELL OptiPlex 7000 Tower (Intel (R) Core (TM) i7-12700 CPU @2.10GHz (12 CPUs) +32GB memory+64 bits Win11 operating system+JDK 1.8+ IntelliJ IDEA 2022.3.3)

Synthesizing a data set: the complete data space is randomly segmented into subspaces of 2-5 dimensions, high-density clusters are generated in the subspaces, a certain proportion of deviation values are added to each subspace to serve as outliers, the number of the deviation values of each subspace is calculated according to the ratio of the number of the outliers to the number of the subspaces in the complete data space, and the outliers of the composite data set are uniformly set to be 1%.

Table 1 true dataset information

Data set name	Tuple number (number)	Number of attributes (personal)	Number of outliers (individuals)
				Arrhythmia	420	129	18
Mammography	11183	6	260
				Adult	38323	6	1168
KDDCup99	48113	40	200
				ForestCover	286048	10	2747
Hepmass	5775111	28	524987

。

True dataset: the experimental use of real dataset information is shown in Table 1, derived from ODDS database, kaggle database, UCI database, and datasets provided by existing studies (https:// www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI /).

Algorithm quality assessment: the experimental time limit of all algorithms is set to 24 hours, the algorithm only obtains 100 subspaces with highest correlation degree from the results of all subspace searching methods, and LOF algorithm is adopted as an outlier detection model (setting) The accuracy of outlier detection is obtained by calculating the area under the ROC curve (receiver operating characteristic curve, receiver operating characteristic curve, abbreviated ROC curve) to evaluate the quality of subspace obtained by subspace search algorithm.

The experiment first performed algorithm parameter settings and then performance assessment was performed mainly from the following aspects: search result quality, data size N, data dimension D, memory capacity C, real dataset effect. Specific setting information of the above parameters is shown in table 2.

Table 2 experimental parameter set table

Parameters (parameters)	Value taking	Default value
			Distribution memory capacity (C)	4G24G	24G
Data size N () (Synthesis)	00	5000
			Data dimension d		20
Number of hash functions		9
			Map width		5
Maximum reserved attribute number		6
			Number of iterations	-	20
Proportion of conditional sample selection	-	0.1
			Coefficient of variance	-	0.8

。

Four algorithms were tested in total: wherein HiCS algorithm and GMD algorithm are existing related subspace search algorithm, and are used as comparison group; and a baseline algorithm BA algorithm and RSLSH algorithm of related subspace search on mass data, wherein the literature of the existing algorithm is respectively HiCS：KELLER K, MULLER E, and BOHM K. HiCS: High contrast subspaces for density-based outlier ranking[C]//Proceedings of the 28th International Conference on Data Engineering, Arlington, VA, USA, Apr 01-05, 2012. Washington: IEEE, 2012: 1037-1048;GMD：TRITTENBACH H, BOHM K. Dimension-based subspace search for outlier detection[J]. International Journal Data Science and Analytics. 2019, 7: 87-101.

The invention carries out a plurality of groups of experiments on the synthetic data set and the real data set, wherein the experiments comprise parameter setting experiments on the synthetic data set, the experimental results are shown in fig. 5, the quality evaluation of algorithm searching results on the synthetic data set is shown in fig. 6, the experimental results are shown in fig. 7, the data dimension D on the synthetic data set is shown in fig. 7, the distribution memory capacity C on the synthetic data set is shown in fig. 8, the experimental results are shown in fig. 9, and the experimental results of the real data set are shown in table 3.

The results of the parameter set-up experiments on the synthetic dataset are shown in figure 5. Given a givenThe experiments were each quality evaluated on multiple synthetic datasets of 5 different dimensions with the average as the experimental result. The change in average AUC and run time of the RSLSH algorithm with increasing value is shown in fig. 5 (a). When (when)The detection result of the algorithm has better robustness when in use, but followsAn increase in value also creates a larger search space resulting in algorithm inefficiency, thus setting upAs a default value. Given a givenAlong withValue variation RSLSH the variation of the average AUC and run time of the algorithm is shown in FIG. 5 (b), according to the experimental results, the present invention setsAs a default value. Given a givenAlong withThe change in average AUC and run time of the value change algorithm is shown in fig. 5 (c). Experimental results show that the detection quality of RSLSH algorithm followsGradually increasing and then decreasing and finally stabilizing, in the followingPeak is reached. Thus, an arrangement is made ofAs a default value.

The quality of the algorithm search results on the synthetic dataset is evaluated, and the experimental results are shown in fig. 6. The average AUC values for all comparison algorithms are shown in figure 6. The subspace quality assessment of the RSLSH algorithm is superior to the other three methods, and particularly when the data dimension is increased, a high-quality detection result can be still maintained, which shows that the RSLSH algorithm ensures the quality of the search subspace result while processing the mass data problem.

The effect of data size N on algorithm performance on the synthetic dataset, and the experimental results are shown in fig. 7. The HiCS algorithm cannot be used in addition to the RSLSH algorithm due to exceeding the time limitWhen executed, the BA algorithm cannot be executedExecute at the time, while the GMD algorithm cannot be executed due to exceeding the memory limitAnd is performed as shown in fig. 7 (a). The RSLSH algorithm exhibits better performance than other algorithms, and for the GMD algorithm, the number of calculated subspace deviation levels is slightly lower than RSLSH algorithm, as shown in fig. 7 (b), but since the index list is constructed and the calculated deviation levels need to be ordered for a single column of attribute data, the algorithm is not as efficient as RSLSH algorithm, andThe larger the difference between the two algorithms is, the more obvious. The BA algorithm and HiCS algorithm show similar execution time trends, and the RSLSH algorithm has a performance of 1 order of magnitude advantage over both algorithms because the lack of an effective clipping strategy results in a large number of computation subspaces. In addition, the I/O overhead of the RSLSH algorithm of the present invention includes two parts: step 1 reads the overhead of LS and step 2 calculates the overhead of subspace correlation. In the process of gradually increasing the data size, the former part of I/O cost is basically the same as the data set size, and the latter part of cost is thatWhen the underlying data structure is larger than allocated memory, updating FLA with MFU results in a dramatic increase in I/O overhead, which is illustrated in FIG. 7 (c). The RSLSH algorithm significantly reduces the computational overhead by adding very little part of the time overhead and additional I/O overhead, improving the overall performance of the algorithm.

The effect of data dimension D on algorithm performance on the composite dataset was shown in fig. 8. Given the size of data, as the dimension of the data increases, the primary factor affecting the efficiency of execution of the RSLSH algorithm is the number of candidate subspaces calculated. This is because the correlation attribute result set calculated in algorithm step 1 increases as the dimension increases from 10 to 30, so that the number of candidate subspaces to be determined in step 2 increases linearly, as shown in fig. 8 (b). The I/O overhead of RSLSH algorithm is also affected by the data dimension, as shown in FIG. 8 (c), and if the current calculated data dimension exceeds the FLA size, then the FLA structure is updated with additional overhead, so the I/O overhead in subspace calculation is significantly increased. Thus, the RSLSH algorithm execution time shown in fig. 8 (a) increases linearly.

The effect of allocating memory capacity C on the composite dataset on the performance of the algorithm, and the experimental results are shown in fig. 9.

Wherein RSLSH algorithm was not optimized with FLA structure in step 2 but otherwise identical to RSLSH algorithm for demonstrating the effect of FLA structure. The change in allocated memory capacity affects mainly the length of the FLA structure and thus the number of data replacements when updating the FLA, as shown in fig. 9 (d), and in the case of smaller capacity, the FLA structure needs to perform multiple data replacement operations, increasing the I/O overhead when the algorithm is executed. As the allocated memory increases, the FLA data replacement times decrease, the I/O overhead of step 2 tends to be stationary, the I/O overhead of step 1 remains unchanged, and the number of needed computation subspaces remains unchanged, which explains the I/O overhead trend in fig. 9 (c). Thus, as allocated memory capacity increases from 4GB to 24GB, the RSLSH algorithm FLA structure utilization increases, and the I/O overhead decreases, so the algorithm time overhead decreases, as shown in FIG. 9 (a). However, the RSLSH algorithm does not use the FLA structure to optimize the subspace correlation calculation process, when the memory capacity is 8GB-20GB, the I/O overhead and the RSLSH algorithm have a 2-order difference, and the smaller the allocation capacity is, the smaller the memory capacity of the FLA is, the difference is also reduced, as shown in fig. 9 (b). When the capacity is larger than 20GB, the running time of the algorithm bottom layer data structure is smaller than the allocated memory, so that the running time of the two algorithms is relatively close, the execution time of RSLSH algorithm is better than RSLSH algorithm when the memory capacity is smaller than 20GB, and the effectiveness of the FLA structure in reducing the I/O overhead of the algorithm is proved.

The results of the real dataset experiments are shown in table 3. Experimental results show that RSLSH algorithm shows high-quality results on six real data sets, and the accuracy of obtaining subspaces for outlier detection is ensured on the basis that the execution time is obviously better than that of other three algorithms.

In summary, the RSLSH algorithm of the present invention has significant performance advantages over all three other algorithms. When the data scale is enlarged, other algorithms cannot process massive data within reasonable time limit, and on a data set with smaller scale, the RSLSH algorithm, the HiCS algorithm and the BA algorithm always have advantages of 1 order of magnitude in execution time, and the advantages of the RSLSH algorithm, the HiCS algorithm and the BA algorithm are more and more obvious along with the increase of N value.

TABLE 3 results of real dataset experiments

。

Claims

1. The related subspace searching method in mass data outlier detection is characterized by mainly comprising the following steps of:

Step 2, generating all candidate subspaces according to the result of the step 1, and judging relevant subspaces by utilizing the precomputed ordered list set and the FLA structure based on the most frequent replacement strategy; the FLA structure is a fixed-size array structure maintained in memory, the size of which is determined by the allocated memory capacity and the size of the data set, wherein each element stores dimensions Ordered list of (2)And historical frequency of useCarrying out adaptive dynamic data replacement along with the change of the computation subspace before computing the subspace correlation each time, judging the storage position of the data when the computation subspace correlation each time needs to read the specified dimension data, and then reading and completing corresponding computation according to the requirement;

2. The method for searching related subspaces in mass data outlier detection according to claim 1, wherein in step 1, the method for preprocessing the original data to construct a hash fragment set comprises the following steps of;

one-time hashing according to specified parameters Determining a P-stable combined hash function for hash mappingThen utilizeCalculation ofBelonging to a hash bucket number, wherein each data objectOne hash will produce a length ofIs a combination vector of (a)WhileCorresponding hash bucket numberObtained by an AND operation combining elements within the vector; storing all data objects with the same hash bucket number in the same fileConstitutes a one-time hash slice in which，For the number of hash slices at a time,; DB is a D-dimensional dataset containing N data objects;

secondary hash, which is to obtain the hash bucket number in the primary hash stage Performing re-hashing, projecting the same into integer values in a one-dimensional space, and hashing adjacent barrel numbers to adjacent positions through reordering of projection results;

merging operation, first determining average number of data stored in merging barrel In the case of merging the total number does not exceedOn the premise of traversing adjacent hash bucket numbers in turn, storing data objects in corresponding hash fragments in the same fileConstitutes a secondary hash fragment in which，Is the number of the secondary hash slices,。

3. The method for searching related subspaces in mass data outlier detection according to claim 2, wherein the merging operation is still larger than the memory capacity after completionIs divided into (a) secondary hash slicesBy enlarging combined hash vectorsThe recursion division strategy is continued until all the fragments meet the memory constraint condition to obtain the hash fragment setWherein；The number of final hash slices.

4. The method for searching related subspaces in mass data outlier detection according to claim 1, wherein in step 1, determining whether the current attribute dimension is a related attribute is specifically: and determining whether the single attribute dimension variance meets the uniform distribution condition on the single attribute by judging the relation between the single attribute dimension variance and the expected variance in the data space, and if the single attribute dimension variance is smaller than the expected variance, namely, the variance threshold lower bound, considering the attribute dimension as a related attribute.

5. The method for searching related subspaces in mass data outlier detection according to claim 1, wherein in step 2, the candidate subspace generation is specifically: verifying the correlation attribute results of each hash segment independently forAnd sequentially generating all relevant attribute combinations by backtracking of the depth-first search, wherein all relevant attribute combinations form a candidate subspace set, and the candidate subspace set is used for judging whether the candidate subspace set is a relevant subspace or not.

6. The method for searching related subspaces in mass data outlier detection according to claim 1, wherein in step 2, the most frequent replacement strategy is specifically: in the FLA structure update process, attribute dimensions with high access frequency are replaced preferentially.

7. The method for searching related subspaces in mass data outlier detection according to claim 1, wherein in step 3, the clipping strategy specifically comprises: is provided withTwo subspaces, andIf (if)Then delete redundant subspaces; Wherein,Representing subspaces respectivelyIs used in the manufacture of a printed circuit board,、Respectively representIs a correlation of (3).