CN115617881B - Multi-sequence periodic frequent pattern mining method in uncertain transaction database - Google Patents
Multi-sequence periodic frequent pattern mining method in uncertain transaction database Download PDFInfo
- Publication number
- CN115617881B CN115617881B CN202211635955.5A CN202211635955A CN115617881B CN 115617881 B CN115617881 B CN 115617881B CN 202211635955 A CN202211635955 A CN 202211635955A CN 115617881 B CN115617881 B CN 115617881B
- Authority
- CN
- China
- Prior art keywords
- sequence
- item set
- database
- upfps
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/04—Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method for mining a multi-sequence periodic frequent pattern in an uncertain transaction database, which specifically comprises the following steps: s1, inputting a database and four user-defined thresholds; s2, scanning the database to construct a UPFPS-list of the item set x 1, and judging whether the item set x is a periodic frequent item set or not; s3, pruning the search space according to the upper bound value, and adding the UPFPS-list meeting the conditions into the set; s4, intersecting and combining the trimmed 1 item sets into 2 item sets, constructing UPFPS-list of the 2 item sets, and judging whether the 2 item sets are UPFPS; and S5, recursively circulating the n-1 item set until the n item set cannot be expanded, and outputting all UPFPSs in the uncertain database. The technical scheme of the invention solves the problem that the prior art can not carry out periodic frequent pattern mining on a plurality of sequences in an uncertain database.
Description
Technical Field
The invention relates to the technical field of data mining, in particular to a multi-sequence periodic frequent pattern mining method in an uncertain transaction database.
Background
Data mining has become an important technological approach to the exploitation of excessive and complex data, and scientific research in this field of technology is also developing at an increasing rate. The association rule is also called frequent item set mining and is a hot direction in the field of data mining. The existing algorithms for frequent patterns are all used for mining patterns which are useful for us in an accurate database, and in the real-world situation, the collected data is more or less lost or inaccurate due to various reasons, so that the collected data contains uncertainty. In real life, more data with uncertainty is mined, and a deterministic database has certain limitations in practical application. Mining frequently periodic patterns in the multi-sequence uncertainty database, wherein the periodic and frequent patterns are both satisfied, and the common periodic and frequent patterns are mined in the multi-sequence simultaneously, and the occurrence probability of the patterns also satisfies a threshold value set by a user. Meanwhile, how to measure the ratio of the patterns mined in a plurality of sequences in an uncertain database to the total database sequence number is a complicated problem.
In recent years, frequent Pattern Mining (FPM) was a popular data mining problem, and was originally intended to analyze transactions in large chain supermarkets to find products frequently purchased by customers, so as to make better sales decisions. Frequent pattern mining is an early and fundamental research direction in data mining, and with the continuous development of FPM, FPM has been applied in many fields today. Different approaches have been developed to dig out meaningful patterns. At present, the frequent pattern mining based on the determined data is continuously explored, the mining period frequent pattern in the sequence also becomes an increasingly popular research direction, and the periodic and frequent occurrence of the research pattern along with the time has more practical application significance in real life. However, there are two limitations in the process of frequent pattern mining in the traditional mining cycle, the first limitation is that the item sets contained in things are all definite, while uncertain data in real life are ubiquitous, such as: sensor networks, military, economic, logistics, financial, biomedical, etc. The second limitation is that current algorithms that study periodic frequent patterns are usually directed to a single sequence, and it is more valuable to find a group of co-occurring periodic and frequent patterns in real life, such as: the method has more practical significance in website browsing click analysis, biomedical gene and market shopping basket research.
As the sequence pattern mining is more and more popular, the time sequence between transactions is ignored in the database in the traditional FPM algorithm, the mined patterns may be very frequent or the association between the patterns is very high, but the time sequence between the projects or the transactions is not reflected. One of the most popular research tasks at present is Sequence Pattern Mining (SPM), in real life, sequences with respect to time are more emphasized in many fields, and the sequence pattern mining task is to dig out frequent sub-sequences from a group of sequences, which is more widely applied in reality. The data mining algorithm can mine several types of modes, and a frequent item set is a basic step of many important data mining tasks such as association rules, correlation analysis, sequence modes, weighting modes and the like and is also content of basic research in data mining. The conventional SPM algorithms have three limitations, firstly, the item sets contained in the data are all definite, but the uncertain data in real life are ubiquitous, such as: uncertain databases play an important role in many fields such as sensor networks, military, economy, logistics, finance, biomedicine and the like. But the uncertainty of the data has great influence on data mining, so people pay more attention to mining in an uncertainty database. Second, SPM algorithms are used to find patterns that occur periodically in a sequence in a cyclic manner, but are an important research context in data mining. Thirdly, the SPM algorithm basically studies a single sequence, ignores the relevance among a plurality of sequences, and simultaneously mines and analyzes potential relations among a plurality of sequences, so that the SPM algorithm has great research value and significance.
Therefore, there is a need for a method of periodic frequent pattern mining for multiple sequences in an uncertain-transaction database that overcomes the above problems.
Disclosure of Invention
The invention mainly aims to provide a periodic frequent pattern mining method under multiple sequences in an uncertain transaction database, so as to solve the problem that the periodic frequent pattern mining of the multiple sequences in the uncertain database cannot be carried out in the prior art.
In order to achieve the aim, the invention provides a method for mining a multi-sequence periodic frequent pattern in an uncertain transaction database, which comprises the following steps:
s1, inputting an uncertain transaction database of a large number of customers within a period of time, and customizing four thresholds by a merchant, wherein the four thresholds are respectively a minimum support frequency minSup, a maximum periodicity maxPr, a periodicity standard deviation maxStd and a minimum expected support number minExpRa;
s2, scanning the database to construct a UPFPS-list of the item set x 1, namely constructing a data list UPFPS-list which is formed by the purchase probability of each item x and the transaction in which the certain item x appears in the purchase sequence of which users according to the time sequence, and judging whether the item set x is a periodic frequent item set UPFPS in the uncertain database, and the method specifically comprises the following steps:
s2.1, calculating the transaction number sup (x, S) of the commodity x appearing in the sequence S, calculating the maximum periodicity maxPeer (x, S) of the item set x, and the period standard deviation stanDev (x, S), then traversing each single item set x by the algorithm in a loop, and for the commodity x appearing in the purchase sequence S, if the purchase probability of the commodity x is greater than the minimum purchase frequency, namely sup (x, S) > = minSup, the time interval between the two times of purchase of the commodity x does not exceed the maximum period threshold, namely maxPeer (x, S) <= maxPr, and the purchase period of the commodity x is stable in a certain range, namely stanDev (x, S) < = maxStd, then 1 item set x is called to be frequent in the sequence S periodically, and the algorithm stores the sequences of the item set x satisfying the condition into a set prSeq (x);
s2.2, if the expected periodic sequence is more than or equal to minExpRa than expRa (x), outputting 1 item set x as a periodic frequent pattern item set UPFPS;
wherein the expected periodic sequence ratio of sequences for x in the database is defined as expRa (x) = expsu (x)/| D |, where the value of expsu (x) is calculated from the sequences in the set PrSeq (x), and | D | is the total number of sequences in the database of sequences;
s3, pruning the search space according to an upper bound value upExpRa, adding UPFPS-list of 1 item set x meeting the condition upExpRa (x) > = minExpRa to a set bound UPFPS, and not expanding any more if the condition is not met;
s4, intersecting and merging the trimmed 1 item set into 2 item sets by using a set bound UPFPS, namely combining data information of two commodities to construct a UPFPS-list of the 2 item set, storing the UPFPS-list of the item set which accords with an upper bound value of upExpRa (x) > = minExpRa into the bound UPFPS so as to carry out a new iteration, and judging whether the 2 item set is a periodic frequent item set UPFPS in an uncertain database;
and S5, recursively circulating the n-1 item sets until the n item sets cannot be expanded, and outputting all periodic frequent item sets UPFPS in the uncertain database.
Further, a term set of one commodity composition is 1 term set X, a term set of a plurality of commodities compositions is X, two user-specified thresholds maxPr, minSup are set, if sup (X, S) > = minSup, maxPer (X, S) < = maxPr, the term set X is a cycle-frequent candidate pattern in the sequence S, a sequence set in the sequence database with the term set X as the cycle-frequent candidate pattern is represented and defined as candSeq (X) = { S ^ X, S) > = minSup ^ maxPer (X, S) < = maxPr ^ pro (X, T, S) | = T ^ 0^ S ^ D }, and an upper bound of the value of the desired support sequence ratio prexa (X) of X in the database is defined as expa (X) =/exp (X) | D (X), where X is calculated from the sequence set X.
Further, in the indeterminate sequence database, the sequence S of the kth transaction of some goods X i The probability value of the existence of item set X is defined and expressed as pro (X, T) k , S i ) ( 1<= k<= m, 1<=i<=|D|)。
Further, the expected support number expsoup (X) is the sum of the maximum probabilities that the item set X exists in the specified sequences in the uncertain database, and is calculated by the formula:wherein P (X, S) i ) The value of (A) indicates that the set of items X is present in the sequence S i The maximum probability of transaction in (1) is。
The invention has the following beneficial effects:
1. in order to measure the support number of the patterns in the uncertain database, a new support metric is defined, namely the maximum value of the transaction probability is found in each sequence, and a new period measure, namely an expected sequence period ratio (expRa), is defined in the uncertain database on the basis of the maximum value of the transaction probability, so that the mode with periodicity is found in a plurality of sequences;
2. in order to effectively reduce the search space, an upper bound upExpRa of the expRa measure is provided, and two new pruning characteristics are provided;
3. a novel UPFPS-list structure is provided to avoid repeated scanning of the database by the algorithm and improve the operation efficiency of the algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts. In the drawings:
FIG. 1 illustrates a flow chart of a method for periodic frequent pattern mining of multiple sequences in an uncertain transaction database according to the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The definition of periodic frequent patterns in an uncertain database is first introduced, and then the periodic patterns and the expected support number are combined and expanded to a plurality of sequences in an uncertain database. Finally, a strategy pruning of the search space and two new pruning properties are proposed.
Definition 1: i = { (I) 1 :P i1 ), (i 2 :P i2 ),...,(i m :P im ) Is a set of different items (symbols or binary values) that appear in the transaction database, represented for each item appearing in pairs as (i: p i ) Where i is an item and P i (wherein P is i ∈(0,1]) Is the probability value that item i exists. An Itemset containing K Items is called K-Itemset. One sequence S is an ordered set of lists S =<T 1 ,T 2 ,....T m >And T is j Is contained in I (1)<=j<=m),T j Referred to as one transaction, each transaction contains a number of items (i: P) i ) And j is transaction T j Is called transaction T j The TID of (1).
Definition 2: the item set composed of one commodity is 1 item set X, the item set composed of a plurality of commodities is X, inSequence S of the kth transaction of the commodity X in the uncertain sequence database i The probability value of the existence of item set X is defined and expressed as pro (X, T) k , S i ) ( 1<= k<= m, 1<=i<=|D|)。
Definition 3: consider a sequence S of rows i A set of items X, sequence S i The ordered transaction list in which item set X is contained is defined as TR (X, S) =<T g(1) , T g(2) ,..., T g(k) >Is contained in S i。 Let T g(z) And Tg (z+1) Is a transaction in which item set X appears in sequence Si in two consecutive occurrences. The periodic calculation formula for two consecutive transactions containing item set X is per (T) g(z), T g(z+1) ) = g (z + 1) -g (z). Sequence S i The period of the middle set X is pr (X, S) i = per1, per 2., perk +1}, where Perk = g (k) -g (k-1), g (k) being the TID of the transaction in which the set of items X appears, and g (0) =0 and g (k + 1) = | S are specified i L, where l S i And | is the length of the sequence.
Definition 4: the standard deviation of the period of one set of terms X in the sequence S is denoted as stanDev (X, S).
Definition 5: the maximum periodicity of one set of entries X in the sequence S is defined as maxPer (X, S) = argmax (pr (X, S)).
Definition 6: in a sequence S, one set of items X may appear in multiple transactions, and the number of transactions in the sequence S that contain the occurrence of X is defined as sup (X, S) = | TR (X, S) |.
Definition 7: there are three user-specified thresholds maxPr, minSup, and maxStd. If (X, S) > = minSup, maxPeer (X, S) < = maxPr, pro (i, T, S) | 0, and stanDev (X, S) < = maxStd, then term set X is said to be periodic in sequence, and the set of sequences for which term set X satisfies the periodic frequent pattern is defined as PrSeq (X) = { S | sup (X, S) > = minSup ^ maxPeer (X, S) < = maxPr ^ stanDev (X, S) < ^ maxStd ^ pro (X, T, S) | =0^ T ^ S ^ D }.
Definition 8: the expected support number is the sum of the maximum probabilities that the item set X exists in the specified sequence in the uncertain database, and the calculation formula is as follows:. Wherein P (x, S) i ) The value of (A) indicates that the set of items X is present in the sequence S i The maximum probability of transaction in (1) is。
Definition 10: in a sequence database, if a set of terms X satisfies that a set of periodically frequent sequences is PrSeq (X), then the desired support sequence ratio for term set X is expRa (X) = expcup (X)/| D |, where the value of expcup (X) is calculated from the sequences in set PrSeq (X), | D | is the total number of sequences in the sequence database.
Definition 11: in an uncertain database, if the expRa (X) value of an item set X is not less than minExpRa, then this item set X is a periodic frequent pattern in the uncertain database.
Definition 12: two user-specified thresholds maxPr and minSup are set. If sup (X, S) > = minSup, maxPeer (X, S) < = maxPr, item set X is a cycle-frequent candidate pattern in the sequence. A set of sequences in the sequence database with the set X of entries as the periodic frequent candidate patterns is represented and defined as candSeq (X) = { S | sup (X, S) > = minSup ^ maxPeer (X, S) < = maxPr ^ pro (X, T, S) | =0^ T ^ S ^ D }, and the upper bound of the value of the desired support sequence ratio expRa (X) for X in the database is defined as expRa (X) = expPasp (X)/| D |, where the value of expPasp (X) is calculated from the sequences in the set candSeq (X).
Theorem 1: in the sequence database, for item set X, upExpRa (X) > = expRa (X).
Theorem 2: in the sequence database, for any two sets of items, if X \8838; XY, then upExpRa (XY) < = upExpRa (X).
Theorem 3: in a sequence database, if upExpRa (X) < = minExpRa for any item set X, then neither item set X nor its superset XY is UPFPS.
The specific algorithm process named MUPFPS of the first algorithm in the present invention is described below with reference to fig. 1:
a method for mining multiple sequences of periodic frequent patterns in an uncertain-transaction database as shown in fig. 1, comprising the following steps:
s1, inputting an uncertain transaction database of a large number of customers within a period of time, customizing four thresholds by a merchant, wherein the four thresholds are a minimum support frequency minSup, a maximum periodicity maxPr, a periodicity standard deviation maxStd and a minimum expected support number minExpRa;
s2, scanning the database to construct a UPFPS-list of the 1 item set x, namely constructing a data list UPFPS-list which is formed by the purchase probability of each item x and the transaction in which a certain item x appears in the purchase sequence of several users according to the time sequence, and judging whether the 1 item set x is a periodic frequent item set UPFPS in the uncertain database, and the method specifically comprises the following steps:
s2.1, calculating the transaction number sup (x, S) of the commodity x appearing in the sequence S, calculating the maximum periodicity maxPeer (x, S) of the item set x, and the period standard deviation stanDev (x, S), then traversing each single item set x by the algorithm in a loop, and for the commodity x appearing in the purchase sequence S, if the purchase probability of the commodity x is greater than the minimum purchase frequency, namely sup (x, S) > = minSup, the time interval between the two times of purchase of the commodity x does not exceed the maximum period threshold, namely maxPeer (x, S) <= maxPr, and the purchase period of the commodity x is stable in a certain range, namely stanDev (x, S) < = maxStd, then 1 item set x is called to be frequent in the sequence S periodically, and the algorithm stores the sequences of the item set x satisfying the condition into a set prSeq (x);
s2.2, if the expected periodic sequence is more than or equal to minExpRa than expRa (x), outputting 1 item set x as a periodic frequent pattern item set UPFPS;
wherein the expected periodic sequence ratio of sequences for x in the database is defined as expRa (x) = expsu (x)/| D |, where the value of expsu (x) is calculated from the sequences in the set PrSeq (x), and | D | is the total number of sequences in the database of sequences;
s3, pruning the search space according to an upper bound value upExpRa, adding UPFPS-list of 1 item set x meeting the condition upExpRa (x) > = minExpRa into a set bound UPFPS, and not expanding any more if the condition is not met;
specifically, the value of upExpRa (x) is computed to prune the search space and UPFPS-list of the term set. If upExpRa (x) > = minExpRa, then the UPFPS-list of 1 item set x is added to the set bound UPFPS and the set is stored in ascending order according to the value of upExpRa.
S4, intersecting and merging the trimmed 1 item set into 2 item sets by using a set bound UPFPS, namely combining data information of two commodities to construct a UPFPS-list of the 2 item set, storing the UPFPS-list of the item set which accords with an upper bound value of upExpRa (x) > = minExpRa into the bound UPFPS so as to carry out a new iteration, and judging whether the 2 item set is a periodic frequent item set UPFPS in an uncertain database;
and S5, recursively circulating the n-1 item set until the n item set cannot be expanded, and outputting all periodic frequent item sets UPFPS in the uncertain database.
The search process of the algorithm includes the following parameters: a multi-sequence uncertain database, UPFPS-list of the P extended item set, and a series of user-defined thresholds minSup, maxPr, maxStd, and minExpRa. The search process is implemented by merging the set of items Px in UPPFPS-list with all the remaining sets of extension items Py of P. The extension of item set P is the set of items obtained by appending item set z to P, denoted Pz. When this procedure is invoked for the first time, P is an empty set and P's extension is a single set of items. The search process executes a loop that combines each pair of extensions Px and Py of P to generate Pxy. The merging of the item set Px and the item set Py in the Px loop is to generate a new extended item set Pxy of length | Px | +1 in ascending order of upExpRa values. The algorithm constructs UPPFPS-list of item sets Px and Py into UPFPS-list of Pxy through an intersection program on the premise of not repeatedly scanning the database. Then, UPFPS-list of Pxy is scanned to obtain a sequence satisfying candSeq (Pxy) condition, and upExpRa (Pxy) is further calculated. If upExpRa (Pxy) > = minExpRa, then Pxy and its superset are likely to be UPFPS, and Pxy's UPFPS-list is added to a set named ExtensionsOfPx, and the upExpRa (Pxy) of each item set in the set is not less than minExpRa, otherwise neither Pxy nor its superset are likely to be UPFPS. The UPFPS-list of all the extension item sets of Px of upExpRa > = minExpRa is then stored in the set, and if the value of expRa is finally calculated to be not less than the threshold value minExpRa, the output Pxy is UPFPS and the extension item Pxy is added to the extension set considered by the depth-first search. And finally, recursively calling the UPFPS-list which is judged to be the UPFPS by the algorithm, performing list cross construction, and continuously searching the expansion item set of the Px until all the expansion item sets of the Px are searched.
Because the metric upExpRa has strict de-monotonicity, both the non-periodic frequent item set pattern set and its extended item set in the uncertain database are deleted from the search space, because the algorithm uses UPFPS-list to compute the expRa values of the item sets, and the upExpRa values of the output item sets are not less than minExpRa, so the algorithm is complete and correct. The algorithm outputs all the UPFPS and their expRa as well as the serial and transaction numbers.
PREFERRED EMBODIMENTS
Table 1: sequence database sample
SID |
1.(a:0.66,e:0.68),(a:0.76,c:0.91,e:0.48),(a:0.6,c:0.75,e:0.58),(a:0.55,b:0.37,c:0.78),(a:0.48,b:0.7,e:0.66) |
2.(a:0.76, c:0.48),(b:0.87),(b:0.65, c:0.46),(b:0.73, c:0.34),(a:0.52, e:0.33) |
3.(a:0.64,c:0.62),(a:0.33, b:0.51,e:0.48),(a:0.6, c:0.75, d:0.56),(a:0.49, c:0.34, e:0.88),(a:48, b:0.7) |
4.(a:0.46,c:0.78,c:0.77,e:0.87),(a:0.66,b:0.97,e:0.38),(a:0.36,b:0.67,c:0.75),(a:0.45,c:0.37,c:0.78),(a:0.81,e:0.69) |
Table 2: UPFPS-list of item set { a }
i-set {a} |
Sid-list {1,2,3,4} |
Tran-list [{1,2,3,4,5},{1,5},{1,2,3,4,5},{1,2,3,4,5}] |
Pro-list[{0.66,0.76,0.6,0.55,0.48},{0.76,0.52},{0.64,0.33,0.6,0.49,0.48},{0.46,0.66,0.36,0.45,0.81}] |
Table 3: UPFPS-list of item set { c }
i-set {c} |
Sid-list {1,2,3,4} |
Tran-list [{2, 3, 4},{1, 3, 4},{1, 3, 4},{1, 3, 4}] |
Pro-list[{0.91,0.75,0.78},{0.48,0.46,0.34},{0.62,0.75,0.34},{0.77,0.75,0.78}] |
Table 4: UPFPS-list of item set { a, c }
i-set {a,c} |
Sid-list {1,2,3,4} |
Tran-list [{2,3,4},{1},{1,3,4},{1,3,4}] |
Pro-list [{0.69,0.45,0.42},{0.36},{0.39,0.45,0.16},{0.35,0.27,0.35}] |
Table 1 is a database of tentative transactions for a large number of customers over a period of time, wherein the first row sequence represents a first probability of a first user purchasing item a of 0.66 and a probability of purchasing item e of 0.68. In Table 2, sid-list {1,2,3,4} represents the product a purchased by the first, second, third and fourth customers, and {1,2,3,4,5} in Tran-list [ {1,2,3,4,5}, {1,2,3,4,5} ] represents the product a purchased by the first customer for the first time, the second time, the third time, the fourth time and the fifth time, and so on.
Pro-list [ {0.66,0.76,0.6,0.55,0.48}, {0.76,0.52}, {0.64,0.33,0.6,0.49,0.48}, {0.46,0.66,0.36,0.45,0.81} ] represents that the probability of purchasing commodity a for the first time by the first client is 0.66, the probability of purchasing commodity a for the second time is 0.76, and so on.
Pro-list [ {0.69,0.45,0.42}, {0.36}, {0.39,0.45,0.16}, {0.35,0.27,0.35} ] in Table 4 represents that the probability of the first customer purchasing both commodity a and commodity c at the same time is 0.76X 0.91 and approximately equal to 0.69, and so on.
Based on the above data and specific properties, the merchant gives a set of thresholds minSup = 2, maxStd = 1.2, maxPr = 3 and minExpRa = 0.35, respectively. The UPFPS-list of { a, c } for table three is obtained by intersecting the UPFPS-lists of the item sets { a } and { c } shown for table two and table three, respectively. The Sid-list of the set of terms { a } is {1,2,3,4}, the Sid-list of the set of terms { c } is {1,2,3,4}, and the Sid-list of the set of terms { a, c } is {1,2,3,4} after intersection. For sequences S in a database 1 Tran-l of the middle item set { a }ist is {1,2,3,4,5}, and the set of terms { c } is in the sequence S 1 The Tran-list in (1) is {2,3,4}, and the Tran-list {2,3,4} of the term set { a, c } obtained by the intersection process is added to the Tran-list of { a, c }. Likewise, the Tran-list of other sequences of { a, c } is constructed in the same manner, the Pro-list of the set of terms { a }: { ([0.66], [0.76],[0.6],[0.55],[0.48]), ([0.76], [0.52]), ([0.64], [0.33], [0.6],[0.49], [0.48]),([0.46],[0.66],[0.36],[0.45],[0.81]) Pro-list of item set { c }: { ([0.91],[0.75],[0.78]),([0.48],[046],[0.34]),([0.62],[0.75],[0.34]),([0.77],[0.75],[0.78]) }. Pro-list of the set of terms { a, c } obtained by intersection: { ([0.69],[0.45],[0.42]),([0.36]),([0.39], [0.45],[0.16]), ([0.35],[0.27],[0.35]) }. The UPFPS-list of the set of terms { a, c } resulting from the intersection is shown in Table 4, where pr ({ a, c }, S) can be calculated from the UPFPS-list of the set of terms { a, c } 1 ) = {2,1,1,1},sup({a,c},S 1 ) =3 >= minsupe and maxPer ({ a, c }, S) 1 ) = 2<= maxPr, so the set of terms { a, c } is in the sequence S 1 Is a candidate periodic frequent pattern, and then the sequence S 1 Added to the set candSeq (x), then pr ({ a, c }, S) is calculated 1 ) = {1,4},sup({a,c},S 2 ) =1 <= minsupe and maxPer ({ a, c }, S) 2 ) = 4>= maxPr, so the set of terms { a, c } is in the sequence S 2 Is not a candidate periodic frequent pattern, and then pr ({ a, c }, S) is calculated 1 ) = {1,2,1,1},sup(a,c,S 3 ) =3 <= minsupe and maxPer ({ a, c }, S) 3 ) = 2>= maxPr, so the set of terms { a, c } is in the sequence S 3 Is a candidate periodic frequent pattern, and the sequence S 3 Added to the set candSeq (x), the similarly known item sets { a, c } are in the sequence S 4 Is also a candidate periodic frequent pattern, and also sequences S 4 Added to the set candSeq (x). The sequence set candSeq ({ a, c }) obtained by the above calculation contains sequence { S } 1 ,S 3 ,S 4 Using information of UPFPS-list of { a, c }, calculating exprup ({ a, c }) = 0.69 + 0.45 + 0.35 = 1.49, and then obtaining upExpRa ({ a, c }) = exprup ({ a, c })/| D | = 1.49/4 = 0.37> = minExpRa,Thus the set of items { a, c } and supersets thereof may be UPFPSs. The algorithm finds sequences { S } that satisfy the condition of sequence set PrSeq ({ a, c }) 1 ,S 3 ,S 4 Exptup ({ a, c }) = 0.69 + 0.45 + 0.35 = 1.49, and finally expRa ({ a, c }) = 1.49/4 = 0.37>= minExpRa, judge that the item set { a, c } is one UPFPS and output it.
Example two
The second algorithm, which performs breadth-first search mining all the UPFPS, is called MUPFPS _ BFS, and takes as input the ambiguous multi-sequence database and a series of user-defined thresholds minSup, maxPr, maxStd, and minExpRa. The algorithm runs to the end and outputs all UPFPS.
The MUPFPS _ BFS algorithm firstly scans a database to create a UPFPS-list of a single item set, and then calculates through parameters in Sid-list, tran-list and Pro-list in the UPFPS-list to obtain whether each item set meets the following conditions: sup ({ x }, S) ≧ minSup, maxPeer ({ x }, S) ≦ maxPr and stanDev ({ x }, S) < maxStd, that is, satisfying these conditions means that term set x is periodic in sequence S. Then, whether the item set x has frequent periodicity in each sequence is checked, and for all sequence sets prSeq (x) meeting the frequent periodicity in the item set x, the algorithm calculates the expected support number of the item set by dividing the total number of sequences of the database by the sequences in the sets, so as to obtain expRa (x). If this value is not less than the user-defined minExpRa, then item set x is called a UPFPS and output.
In addition, upExpRa (x) is further computed and pruned for UPFPS-list of the search space and term set. This stores all the 1 item sets of upExpRa (x) having a value not less than minExpRa in the set boundUPFPS, and the clipped item sets are stored in the set boundUPFPS in order of the value of upExpRa (x). Finally, the algorithm searches and mines UPFPS through a cycle to the item set with breadth first, and a recursive calling item set expansion process in the cycle process generates a larger mode until a new item set cannot be generated. The item set extension process first calls the set boundUPFPS because the boundUPFPS contains all item sets with upExpRa (x) values no less than minExpRa. The item set expansion process combines any two 1 item sets to obtain a 2 item set, obtains a 2 item set of which the value of upExpRa (x) is not less than minExpRa, and then stores the 2 item set in a set bound PFPS. Meanwhile, calculating expRa (x) identifies and outputs UPFPS of 2 item sets. Generally, in the process of generating an item set each time a recursion is called, if a boundUPFPS set contains at least two item sets, a loop continuously calls the generation process, and two k item sets (k > = 1) are combined to generate a k +1 item set.
The item set extension process takes as input a set of UPFPS-list of k item sets (k > = 1), defined as boundedupps, and a series of thresholds minSup, maxPr, maxStd, minExpRa, and finally outputs the UPFPS-list of item sets of length k +1, denoted as boundedupfps _ k +1. The algorithm first initializes a set boundUPFPS _ k +1 with length k +1, and stores all sets of k +1 items of upExpRa (x) not less than a custom threshold mineExpRa into the set boundUPFPS _ k +1. Putting all item sets in UPFPS-list in the bound UPFPS set into a variable item set, and then circularly combining the k item sets in sequence to generate a k +1 item set. If the item set Px and the item set Py with the length of k have prefixes with the common length of k-1 items, the item set Px and the item set Py are combined into an item set Pxy with the length of k +1, and the algorithm calls an intersection process to combine the UPFPS-list of the item set Px and the UPFPS-list of the Py to generate the UPFPS-list of a new item set Pxy. The algorithm will then determine if the set of items Pxy and its superset are likely UPFPS. The algorithm uses the information in UPFPS-list of the item set Pxy to calculate sup (Pxy, S), maxPeer (Pxy, S), stanDev (Pxy, S) and Pro (Pxy, T, S) in the sequence of the item set Pxy. The algorithm calculates upExpRa (Pxy) from these values, and if upExpRa (Pxy) is not less than minExpRa, adds UPFPS-list of item set Pxy to the boundUPFPS _ k +1 set to produce a larger item set. Then, expRa (x) is calculated from UPFPS-list of Pxy, and if this value is not less than minExpRa, then output Pxy is a periodic frequent pattern in the uncertain database.
Finally, after combining all k item sets that can be combined, the algorithm adds bound UPFPS _ k +1 to bound UPFPS in preparation for the next algorithm call to generate the item set for k +2 item set generation.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.
Claims (3)
1. A method for mining a multi-sequence periodic frequent pattern in an uncertain transaction database is characterized by comprising the following steps:
s1, inputting an uncertain transaction database of a large number of customers within a period of time, and customizing four thresholds by a merchant, wherein the four thresholds are respectively a minimum support frequency minSup, a maximum periodicity maxPr, a periodicity standard deviation maxStd and a minimum expected support number minExpRa;
s2, scanning the database to construct a UPFPS-list of the item set x 1, namely constructing a data list UPFPS-list which is formed by the purchase probability of each item x and the transaction in which the certain item x appears in the purchase sequence of which users according to the time sequence, and judging whether the item set x is a periodic frequent item set UPFPS in the uncertain database, and the method specifically comprises the following steps:
s2.1, calculating the transaction number sup (x, S) of the commodity x appearing in the sequence S, calculating the maximum periodicity maxPeer (x, S) of the item set x, and the period standard deviation stanDev (x, S), then traversing each single item set x by the algorithm in a loop, and for the commodity x appearing in the purchase sequence S, if the purchase probability of the commodity x is greater than the minimum purchase frequency, namely sup (x, S) > = minSup, the time interval between the two times of purchase of the commodity x does not exceed the maximum period threshold, namely maxPeer (x, S) <= maxPr, and the purchase period of the commodity x is stable in a certain range, namely stanDev (x, S) < = maxStd, then 1 item set x is called to be frequent in the sequence S periodically, and the algorithm stores the sequences of the item set x satisfying the condition into a set prSeq (x);
s2.2, if the expected periodic sequence is more than or equal to minExpRa than expRa (x), outputting 1 item set x as a periodic frequent pattern item set UPFPS;
wherein the expected periodic sequence ratio of sequences for x in the database is defined as expRa (x) = expsu (x)/| D |, where the value of expsu (x) is calculated from the sequences in the set PrSeq (x), and | D | is the total number of sequences in the database of sequences;
s3, pruning the search space according to an upper bound value upExpRa, adding UPFPS-list of 1 item set x meeting the condition upExpRa (x) > = minExpRa into a set bound UPFPS, and not expanding any more if the condition is not met;
s4, intersecting and merging the trimmed 1 item set into 2 item sets by using a set bound UPFPS, namely combining data information of two commodities to construct a UPFPS-list of the 2 item set, storing the UPFPS-list of the item set which accords with an upper bound value of upExpRa (x) > = minExpRa into the bound UPFPS so as to carry out a new iteration, and judging whether the 2 item set is a periodic frequent item set UPFPS in an uncertain database;
and S5, recursively circulating the n-1 item set until the n item set cannot be expanded, and outputting all periodic frequent item sets UPFPS in the uncertain database.
2. The method of claim 1, wherein the item set of one commodity is 1 item set X, the item set of multiple commodities is X, two user-specified thresholds maxPr and minsupe are provided, if (X, S) > = minsupe and maxPeer (X, S) < = maxPr, then item set X is a frequent pattern in sequence S, and the sequence set of the item set X as the frequent pattern in the sequence database is represented and defined as candSeq (X) = { S | plus (X, S) = minSufA ^ maxPeer (X, S) < = maxPr ^ pro (X, T, S)! 0^ T ^ S ^ D }, where T represents a transaction and D represents the database, the upper bound of the values of the desired support sequence ratio expRa (X) for X in the database is defined as upExpRa (X) = expSup (X)/| D |, where the value of expSup (X) is calculated from the sequences in the set candSeq (X), where | D | represents the total number of sequences in the database of sequences;
sequence S of k-th transaction of goods X in indeterminate sequence database i The probability value of the existence of item set X is defined and expressed as pro (X, T) k ,S i )(1<=k<=m,1<=i<= | D |), where T is k Representing a transaction.
3. A multiple sequence contingency transaction database according to claim 2The periodic frequent pattern mining method is characterized in that an expected support number expSup (X) is the sum of the maximum probabilities that a term set X exists in a specified sequence in an uncertain database, and the calculation formula is as follows:wherein P (X, S) i ) The value of (A) indicates that the set of items X is present in the sequence S i The maximum probability of transaction in (1) isP (X) represents a set of entries X in the sequence S i A probability value existing in a certain transaction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211635955.5A CN115617881B (en) | 2022-12-20 | 2022-12-20 | Multi-sequence periodic frequent pattern mining method in uncertain transaction database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211635955.5A CN115617881B (en) | 2022-12-20 | 2022-12-20 | Multi-sequence periodic frequent pattern mining method in uncertain transaction database |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115617881A CN115617881A (en) | 2023-01-17 |
CN115617881B true CN115617881B (en) | 2023-03-21 |
Family
ID=84880134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211635955.5A Active CN115617881B (en) | 2022-12-20 | 2022-12-20 | Multi-sequence periodic frequent pattern mining method in uncertain transaction database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115617881B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113268528A (en) * | 2021-06-01 | 2021-08-17 | 西北工业大学 | Multi-probability threshold frequent item set mining method and device for sensing data |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6718317B1 (en) * | 2000-06-02 | 2004-04-06 | International Business Machines Corporation | Methods for identifying partial periodic patterns and corresponding event subsequences in an event sequence |
CN104408127A (en) * | 2014-11-27 | 2015-03-11 | 无锡市思库瑞科技信息有限公司 | Maximal pattern mining method for uncertain data based on depth-first |
CN106033424B (en) * | 2015-03-11 | 2020-04-21 | 哈尔滨工业大学深圳研究生院 | Data mining method and device |
CN105224685A (en) * | 2015-10-28 | 2016-01-06 | 同济大学 | A kind of system of digging user cyclic pattern and method thereof |
CN107870913B (en) * | 2016-09-23 | 2021-12-14 | 腾讯科技(深圳)有限公司 | Efficient time high expectation weight item set mining method and device and processing equipment |
CN107870956B (en) * | 2016-09-28 | 2021-04-27 | 腾讯科技(深圳)有限公司 | High-utility item set mining method and device and data processing equipment |
US9933781B1 (en) * | 2016-11-23 | 2018-04-03 | Denso International America, Inc. | Data-driven planning for automated driving |
CN108346085A (en) * | 2018-01-30 | 2018-07-31 | 南京邮电大学 | Electric business platform personalized recommendation method based on weighted frequent items mining algorithm |
CN111930797A (en) * | 2020-07-09 | 2020-11-13 | 西北工业大学 | Uncertain periodic frequent item set mining method and device |
-
2022
- 2022-12-20 CN CN202211635955.5A patent/CN115617881B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113268528A (en) * | 2021-06-01 | 2021-08-17 | 西北工业大学 | Multi-probability threshold frequent item set mining method and device for sensing data |
Also Published As
Publication number | Publication date |
---|---|
CN115617881A (en) | 2023-01-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lee et al. | An uncertainty-based approach: frequent itemset mining from uncertain data with different item importance | |
Lin et al. | RWFIM: Recent weighted-frequent itemsets mining | |
CN108346085A (en) | Electric business platform personalized recommendation method based on weighted frequent items mining algorithm | |
Song et al. | Multi-objective association rule mining with binary bat algorithm | |
Ingle et al. | Association rule mining using improved Apriori algorithm | |
Nouioua et al. | Tkc: Mining top-k cross-level high utility itemsets | |
Huang et al. | Targeted mining of top-k high utility itemsets | |
Lee et al. | An efficient approach for mining maximized erasable utility patterns | |
CN115617881B (en) | Multi-sequence periodic frequent pattern mining method in uncertain transaction database | |
Lessanibahri et al. | A novel pruning algorithm for mining long and maximum length frequent itemsets | |
CN107609110B (en) | Mining method and device for maximum multiple frequent patterns based on classification tree | |
CN115563192B (en) | Method for mining high-utility periodic frequent pattern applied to purchase pattern | |
Ralla et al. | An incremental technique for mining coverage patterns in large databases | |
Elbassioni | On finding minimal infrequent elements in multi-dimensional data defined over partially ordered sets | |
Lin et al. | Joint utility and frequency for pattern classification | |
CN107908711A (en) | Dense Databases quick association rule digging method based on vertical data distribution | |
Vu et al. | An efficient approach for mining association rules from sparse and dense databases | |
Klangwisan et al. | Mining weighted-frequent-regular itemsets from transactional database | |
Lin et al. | Fast algorithms for mining multiple fuzzy frequent itemsets | |
CN114219574B (en) | Commodity combination mining method based on weighted frequent sequences | |
Kavitha et al. | High Utility Itemset Mining With Influential Cross Selling Items From Transactional Database | |
Li et al. | Mining productive itemsets in dynamic databases | |
Verma et al. | A rational approach to improve access time of apriori algorithm by applying inner join in a arm to redefining fis in textual data | |
Huang et al. | NSPIS: Mining negative sequential patterns with individual support | |
Yuan et al. | Association Rule Mining Technique with Optimized FP-Tree Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |