CN110399406B

CN110399406B - Method, device and computer storage medium for mining global high utility sequence pattern

Info

Publication number: CN110399406B
Application number: CN201910692048.6A
Authority: CN
Inventors: 林浚玮; 李圆法; 陈伟; 王巨宏
Original assignee: Tencent Technology Shenzhen Co Ltd; Harbin Institute of Technology Shenzhen
Current assignee: Tencent Technology Shenzhen Co Ltd; Harbin Institute of Technology Shenzhen
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2024-06-04
Anticipated expiration: 2039-07-26
Also published as: CN110399406A

Abstract

The present disclosure provides a method, apparatus, and computer-readable storage medium for mining global high utility sequence patterns. The method comprises the following steps: determining a first type of item in the sequence database, wherein the first type of item is an item with a global sequence weight utility value higher than a first threshold value; determining utility value linked lists of all sequences in a sequence database; mining at least one candidate global high utility sequence pattern from a sequence database and determining a first set according to the determined first class item, wherein the first set comprises the at least one candidate global high utility sequence pattern, the identification of sequences comprising the respective candidate global high utility sequence patterns, and utility values of the respective candidate global high utility sequence patterns in the respective sequences; and mining a global high utility sequence pattern from the at least one candidate global high utility sequence pattern according to the utility value linked list and the first set of each sequence.

Description

Method, device and computer storage medium for mining global high-utility sequence patterns

技术领域Technical Field

本公开涉及数据处理领域，具体地，涉及一种挖掘全局高效用序列模式的方法、装置及计算机可读存储介质。The present disclosure relates to the field of data processing, and in particular, to a method, device and computer-readable storage medium for mining global high-utility sequence patterns.

背景技术Background technique

序列模式挖掘是数据挖掘领域的重要技术。序列模式挖掘是针对序列数据库的。序列数据库可以包括多条序列(也可以称为事务(transaction))，其中每个序列可以包括至少一个项集(itemset)，每个项集包括至少一个项(item)，并且项集之间存在排序顺序。以超市的购物数据为例，某用户在第一天购买了商品a和商品b，第二天购买了商品a和商品c，第三天购买了商品b。用户在这段时间的购物数据可以抽象为一条序列：<[a b]，[a c]，[b]>，其中a、b和c是项，[]内的项构成一个项集，多个项集按顺序排列构成了序列。高效用序列模式挖掘算法所挖掘的是效用值高于预设阈值的商品组合，即序列模式(pattern)。序列模式是不同项集的有序排列。Sequential pattern mining is an important technology in the field of data mining. Sequential pattern mining is aimed at sequence databases. Sequential databases can include multiple sequences (also called transactions), where each sequence can include at least one itemset, each itemset includes at least one item, and there is a sorting order between item sets. Taking the shopping data of a supermarket as an example, a user bought goods a and goods b on the first day, goods a and goods c on the second day, and goods b on the third day. The user's shopping data during this period can be abstracted as a sequence: <[a b], [a c], [b]>, where a, b and c are items, the items in [] constitute an itemset, and multiple item sets are arranged in order to form a sequence. The high-utility sequential pattern mining algorithm mines commodity combinations with utility values higher than the preset threshold, that is, sequential patterns. Sequential patterns are ordered arrangements of different item sets.

在挖掘高效用模式的过程中，通过计算整个数据库的总效用值来查找高效用模式的过程需要较多的计算，高效用序列模式的挖掘更是如此。因此，高效用序列模式挖掘比传统的高效用模式挖掘和频繁序列模式挖掘更加复杂。目前的分布式且并行的模式挖掘集中在高效用模式挖掘和频繁序列模式挖掘，例如，可以在Hadoop平台上进行高效用模式挖掘和频繁序列模式挖掘。因此，还不存在分布式且并行的高效用序列模式挖掘方法。In the process of mining high-utility patterns, the process of finding high-utility patterns by calculating the total utility value of the entire database requires more calculations, especially the mining of high-utility sequence patterns. Therefore, high-utility sequence pattern mining is more complicated than traditional high-utility pattern mining and frequent sequence pattern mining. Current distributed and parallel pattern mining focuses on high-utility pattern mining and frequent sequence pattern mining. For example, high-utility pattern mining and frequent sequence pattern mining can be performed on the Hadoop platform. Therefore, there is no distributed and parallel high-utility sequence pattern mining method.

发明内容Summary of the invention

为此，本公开提供了一种挖掘全局高效用序列模式的方法、装置及计算机可读存储介质。To this end, the present disclosure provides a method, an apparatus, and a computer-readable storage medium for mining global high-utility sequence patterns.

根据本公开的一个方面，提供了一种用于挖掘全局高效用序列模式的方法，包括：确定序列数据库中的第一类项，其中第一类项是全局序列权重效用值高于第一阈值的项；确定所述序列数据库中各个序列的效用值链表；根据所确定的第一类项，从所述序列数据库挖掘至少一个候选的全局高效用序列模式并确定第一集合，其中所述第一集合包括所述至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值；以及根据各个序列的效用值链表和所述第一集合，从所述至少一个候选的全局高效用序列模式中挖掘全局高效用序列模式。According to one aspect of the present disclosure, a method for mining global high-utility sequence patterns is provided, comprising: determining a first category of items in a sequence database, wherein the first category of items are items whose global sequence weighted utility values are higher than a first threshold; determining a utility value chain list of each sequence in the sequence database; mining at least one candidate global high-utility sequence pattern from the sequence database and determining a first set based on the determined first category of items, wherein the first set includes the at least one candidate global high-utility sequence pattern, an identifier of a sequence including each candidate global high-utility sequence pattern, and a utility value of each candidate global high-utility sequence pattern in the corresponding sequence; and mining a global high-utility sequence pattern from the at least one candidate global high-utility sequence pattern based on the utility value chain list of each sequence and the first set.

根据本公开的一个示例，其中所述确定序列数据库中的第一类项包括：确定序列数据库中各个项的全局序列权重效用值；以及将全局序列权重效用值高于第一阈值的项确定为第一类项。According to an example of the present disclosure, determining the first category of items in the sequence database includes: determining the global sequence weight utility value of each item in the sequence database; and determining the items whose global sequence weight utility value is higher than a first threshold as the first category of items.

根据本公开的一个示例，其中确定序列数据库中每个项的全局序列权重效用值包括：确定该项在序列数据库的各个分区的局部序列权重效用值；以及根据所确定的局部序列权重效用值确定该项的全局序列权重效用值。According to an example of the present disclosure, determining the global sequence weight utility value of each item in a sequence database includes: determining the local sequence weight utility values of the item in each partition of the sequence database; and determining the global sequence weight utility value of the item based on the determined local sequence weight utility values.

根据本公开的一个示例，其中该项在所述序列数据库的每个分区的局部序列权重效用值是根据该分区中包括该项的序列的效用值确定的。According to an example of the present disclosure, the local sequence weight utility value of the item in each partition of the sequence database is determined according to the utility values of the sequences including the item in the partition.

根据本公开的一个示例，其中确定序列数据库中每个序列的效用值链表包括：根据该序列中各个项的效用值和各个项在该序列中的位置，确定该序列的效用值链表。According to an example of the present disclosure, determining the utility value linked list of each sequence in the sequence database includes: determining the utility value linked list of the sequence according to the utility value of each item in the sequence and the position of each item in the sequence.

根据本公开的一个示例，其中根据所确定的第一类项，从所述序列数据库挖掘至少一个候选的全局高效用序列模式包括：根据所确定的第一类项，从序列数据库的各个分区挖掘局部高效用序列模式；以及根据所挖掘的局部高效用序列模式确定至少一个候选的全局高效用序列模式。According to an example of the present disclosure, mining at least one candidate global high-utility sequence pattern from the sequence database based on the determined first category items includes: mining local high-utility sequence patterns from each partition of the sequence database based on the determined first category items; and determining at least one candidate global high-utility sequence pattern based on the mined local high-utility sequence patterns.

根据本公开的一个示例，其中根据所确定的第一类项，从所述序列数据库的每个分区挖掘局部高效用序列模式包括：对于该分区包括的各个序列中属于第一类项的一个项，计算该项在各个序列中的效用值和剩余效用值，其中该项在一个序列中的剩余效用值是该序列中、该项之后的所有项的效用值之和；构建该项在各个序列中的效用列表；根据该项在各个序列中的效用列表确定该项的效用值链；根据该分区中各个项的效用值链，从该分区挖掘局部高效用序列模式。According to an example of the present disclosure, based on the determined first category items, mining local high utility sequence patterns from each partition of the sequence database includes: for an item belonging to the first category in each sequence included in the partition, calculating the utility value and residual utility value of the item in each sequence, wherein the residual utility value of the item in a sequence is the sum of the utility values of all items after the item in the sequence; constructing a utility list of the item in each sequence; determining the utility value chain of the item based on the utility list of the item in each sequence; and mining local high utility sequence patterns from the partition based on the utility value chain of each item in the partition.

根据本公开的一个示例，其中根据各个序列的效用值链表和所述第一集合，从所述至少一个候选的全局高效用序列模式挖掘全局高效用序列模式包括：根据各个序列的效用值链表和所述第一集合，确定各个候选的全局高效用序列模式的局部效用值；根据各个候选的全局高效用序列模式的局部效用值，确定各个候选的全局高效用序列模式的全局效用值；以及将全局效用值大于第一阈值的序列模式确定为全局高效用序列模式。According to an example of the present disclosure, mining a global high utility sequence pattern from the at least one candidate global high utility sequence pattern according to the utility value chain list of each sequence and the first set includes: determining the local utility value of each candidate global high utility sequence pattern according to the utility value chain list of each sequence and the first set; determining the global utility value of each candidate global high utility sequence pattern according to the local utility value of each candidate global high utility sequence pattern; and determining a sequence pattern whose global utility value is greater than a first threshold as a global high utility sequence pattern.

根据本公开的一个示例，上述方法还包括：根据负载均衡算法将所述序列数据库中的序列划分为多个分区。According to an example of the present disclosure, the above method further includes: dividing the sequences in the sequence database into multiple partitions according to a load balancing algorithm.

根据本公开的另一方面，提供了一种用于挖掘全局高效用序列模式的装置，包括：第一确定单元，被配置为确定序列数据库中的第一类项，其中第一类项是全局序列权重效用值高于第一阈值的项；第二确定单元，被配置为确定所述序列数据库中各个序列的效用值链表；第一挖掘单元，被配置为根据所确定的第一类项，从所述序列数据库挖掘至少一个候选的全局高效用序列模式并确定第一集合，其中所述第一集合包括所述至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值；以及第二挖掘单元，被配置为根据各个序列的效用值链表和所述第一集合，从所述至少一个候选的全局高效用序列模式中挖掘全局高效用序列模式。According to another aspect of the present disclosure, there is provided an apparatus for mining global high-utility sequence patterns, comprising: a first determination unit, configured to determine a first category of items in a sequence database, wherein the first category of items are items whose global sequence weight utility values are higher than a first threshold; a second determination unit, configured to determine a utility value chain list of each sequence in the sequence database; a first mining unit, configured to mine at least one candidate global high-utility sequence pattern from the sequence database and determine a first set based on the determined first category of items, wherein the first set includes the at least one candidate global high-utility sequence pattern, an identifier of a sequence including each candidate global high-utility sequence pattern, and a utility value of each candidate global high-utility sequence pattern in the corresponding sequence; and a second mining unit, configured to mine a global high-utility sequence pattern from the at least one candidate global high-utility sequence pattern based on the utility value chain list of each sequence and the first set.

根据本公开的一个示例，其中所述第一确定单元被配置为确定所述序列数据库中各个项的全局序列权重效用值；以及将全局序列权重效用值高于第一阈值的项确定为第一类项。According to an example of the present disclosure, the first determination unit is configured to determine the global sequence weighted utility value of each item in the sequence database; and determine the items whose global sequence weighted utility value is higher than a first threshold as first category items.

根据本公开的一个示例，其中所述第二确定单元被配置为确定每个项在序列数据库的各个分区的局部序列权重效用值；以及根据所确定的局部序列权重效用值确定该项的全局序列权重效用值。According to an example of the present disclosure, the second determination unit is configured to determine the local sequence weighted utility value of each item in each partition of the sequence database; and determine the global sequence weighted utility value of the item according to the determined local sequence weighted utility value.

根据本公开的一个示例，其中该项在序列数据库的每个分区的局部序列权重效用值是根据该分区中包括该项的序列的效用值确定的。According to an example of the present disclosure, the local sequence weight utility value of the item in each partition of the sequence database is determined according to the utility values of the sequences including the item in the partition.

根据本公开的一个示例，其中所述第二确定单元被配置为根据每个序列中各个项的效用值和各个项在该序列中的位置，确定该序列的效用值链表。According to an example of the present disclosure, the second determining unit is configured to determine a utility value linked list of the sequence according to the utility value of each item in each sequence and the position of each item in the sequence.

根据本公开的一个示例，其中所述第一挖掘单元被配置为根据所确定的第一类项，从序列数据库的各个分区挖掘局部高效用序列模式；以及根据所挖掘的局部高效用序列模式确定至少一个候选的全局高效用序列模式。According to an example of the present disclosure, the first mining unit is configured to mine local high-utility sequence patterns from various partitions of the sequence database based on the determined first category items; and determine at least one candidate global high-utility sequence pattern based on the mined local high-utility sequence patterns.

根据本公开的一个示例，其中所述第一挖掘单元被配置为对于序列数据库的每个分区包括的各个序列中属于第一类项的一个项，计算该项在各个序列中的效用值和剩余效用值，其中该项在一个序列中的剩余效用值是该序列中、该项之后的所有项的效用值之和；构建该项在各个序列中的效用列表；根据该项在各个序列中的效用列表确定该项的效用值链；根据该分区中各个项的效用值链，从该分区挖掘局部高效用序列模式。According to an example of the present disclosure, the first mining unit is configured to calculate the utility value and residual utility value of an item belonging to the first category in each sequence included in each partition of the sequence database, wherein the residual utility value of the item in a sequence is the sum of the utility values of all items after the item in the sequence; construct a utility list of the item in each sequence; determine the utility value chain of the item based on the utility list of the item in each sequence; and mine a local high-utility sequence pattern from the partition based on the utility value chain of each item in the partition.

根据本公开的一个示例，其中所述第二挖掘单元被配置为根据各个序列的效用值链表和所述第一集合，确定各个候选的全局高效用序列模式的局部效用值；根据各个候选的全局高效用序列模式的局部效用值，确定各个候选的全局高效用序列模式的全局效用值；以及将全局效用值大于第一阈值的序列模式确定为全局高效用序列模式。According to an example of the present disclosure, the second mining unit is configured to determine the local utility value of each candidate global high-utility sequence pattern based on the utility value chain list of each sequence and the first set; determine the global utility value of each candidate global high-utility sequence pattern based on the local utility value of each candidate global high-utility sequence pattern; and determine the sequence pattern whose global utility value is greater than the first threshold as the global high-utility sequence pattern.

根据本公开的一个示例，上述装置还包括负载分配单元，被配置为根据负载均衡算法将序列数据库中的序列划分为多个分区。According to an example of the present disclosure, the above-mentioned device also includes a load distribution unit, which is configured to divide the sequences in the sequence database into multiple partitions according to a load balancing algorithm.

根据本公开的另一方面，提供了一种用于挖掘全局高效用序列模式的装置，包括：处理器；以及存储器，其中，所述存储器中存储有计算机可执行程序，当由所述处理器执行所述计算机可执行程序时，执行上述方法。According to another aspect of the present disclosure, there is provided an apparatus for mining global high-utility sequence patterns, comprising: a processor; and a memory, wherein a computer executable program is stored in the memory, and when the computer executable program is executed by the processor, the above method is performed.

根据本公开的另一方面，提供了一种计算机可读存储介质，其上存储有指令，所述指令在被处理器执行时，使得所述处理器执行上述方法。According to another aspect of the present disclosure, a computer-readable storage medium is provided, on which instructions are stored. When the instructions are executed by a processor, the processor executes the above method.

通过本公开提供的挖掘全局高效用序列模式的方法、装置及计算机可读存储介质，确定了序列数据库中各个序列的效用值链表和第一集合，并根据这两种数据结构来挖掘全局高效用序列模式，节省了大量时间，加快了在序列数据库中计算全局效用值的计算过程，加快了挖掘速度，降低了时间复杂度。Through the method, device and computer-readable storage medium for mining global high-utility sequence patterns provided by the present invention, the utility value linked list and the first set of each sequence in the sequence database are determined, and the global high-utility sequence pattern is mined based on these two data structures, which saves a lot of time, speeds up the calculation process of calculating the global utility value in the sequence database, speeds up the mining speed, and reduces the time complexity.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过结合附图对本公开实施例进行更详细的描述，本公开的上述以及其它目的、特征和优势将变得更加明显。附图用来提供对本公开实施例的进一步理解，并且构成说明书的一部分，与本公开实施例一起用于解释本公开，并不构成对本公开的限制。在附图中，相同的参考标号通常代表相同部件或步骤。The above and other purposes, features and advantages of the present disclosure will become more apparent by describing the embodiments of the present disclosure in more detail in conjunction with the accompanying drawings. The accompanying drawings are used to provide a further understanding of the embodiments of the present disclosure and constitute a part of the specification. Together with the embodiments of the present disclosure, they are used to explain the present disclosure and do not constitute a limitation of the present disclosure. In the accompanying drawings, the same reference numerals generally represent the same components or steps.

图1是根据本公开实施例的从序列数据库挖掘全局高效用序列模式的系统架构的示意图。FIG1 is a schematic diagram of a system architecture for mining global high-utility sequence patterns from a sequence database according to an embodiment of the present disclosure.

图2是根据本公开实施例的用于挖掘全局高效用序列模式的方法的流程图。FIG. 2 is a flowchart of a method for mining global high-utility sequence patterns according to an embodiment of the present disclosure.

图3示出了项a在序列s1中的效用列表的示意图。FIG. 3 shows a schematic diagram of the utility list of item a in sequence s1.

图4示出了项a的效用值链的示意图。FIG. 4 shows a schematic diagram of the utility value chain of item a.

图5是根据本公开实施例的从至少一个候选的全局高效用序列模式挖掘全局高效用序列模式的方法的流程图。FIG. 5 is a flowchart of a method for mining a global high-utility sequence pattern from at least one candidate global high-utility sequence pattern according to an embodiment of the present disclosure.

图6是根据本公开实施例的用于挖掘全局高效用序列模式的装置的结构示意图。FIG6 is a schematic diagram of the structure of an apparatus for mining global high-utility sequence patterns according to an embodiment of the present disclosure.

图7示出了根据本公开实施例的计算机设备的架构。FIG. 7 shows the architecture of a computer device according to an embodiment of the present disclosure.

具体实施方式Detailed ways

为了使得本公开的目的、技术方案和优点更为明显，下面将参照附图详细描述根据本公开的示例实施例。在附图中，相同的参考标号自始至终表示相同的元件。应当理解：这里描述的实施例仅仅是说明性的，而不应被解释为限制本公开的范围。In order to make the purpose, technical solutions and advantages of the present disclosure more obvious, the exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. In the accompanying drawings, the same reference numerals represent the same elements from beginning to end. It should be understood that the embodiments described here are merely illustrative and should not be interpreted as limiting the scope of the present disclosure.

在本公开中，当序列模式的效用值较高时，例如，当序列模式的效用值高于预设阈值时，可以将该序列模式称为“高效用序列模式”。也就是说，“高效用序列模式”可以是效用值高于预设阈值的序列模式。这里的“预设阈值”可以是固定不变的，或者可以随着挖掘算法的应用场景的改变而改变。In the present disclosure, when the utility value of a sequence pattern is high, for example, when the utility value of a sequence pattern is higher than a preset threshold, the sequence pattern may be referred to as a "high-utility sequence pattern". That is, a "high-utility sequence pattern" may be a sequence pattern whose utility value is higher than a preset threshold. The "preset threshold" here may be fixed or may change as the application scenario of the mining algorithm changes.

本公开提出了一种分布式且并行的高效用序列模式挖掘的技术方案。在本公开中，通过基于Hadoop平台的分布式计算框架来实现分布式且并行的高效用序列模式挖掘。在挖掘过程中，使用序列数据库中各个序列的效用值链表和第一集合，来保存挖掘过程中必要的信息，从而加快挖掘速度，降低时间复杂度。这里所提到的“分布式计算框架”可以是映射和归纳(MapReduce)框架，其中Map是把一个键值(key-value)对映射成一个新的键值对，Reduce是把键值对中键相同的值整合，同时映射成新的键值对。此外，执行Map操作的模块可以称为Mapper，执行Reduce操作的模块可以称为Reducer。The present disclosure proposes a technical solution for distributed and parallel high-utility sequence pattern mining. In the present disclosure, distributed and parallel high-utility sequence pattern mining is achieved through a distributed computing framework based on the Hadoop platform. In the mining process, the utility value linked list and the first set of each sequence in the sequence database are used to save the necessary information in the mining process, thereby speeding up the mining speed and reducing the time complexity. The "distributed computing framework" mentioned here can be a mapping and induction (MapReduce) framework, in which Map is to map a key-value pair into a new key-value pair, and Reduce is to integrate the values with the same key in the key-value pair and map them into new key-value pairs at the same time. In addition, the module that performs the Map operation can be called Mapper, and the module that performs the Reduce operation can be called Reducer.

首先，参照图1来描述根据本公开实施例的从序列数据库挖掘全局高效用序列模式(Global-High Utility Sequence Pattern，G-HUSP)的系统架构。图1是根据本公开实施例的从序列数据库挖掘全局高效用序列模式的系统架构的示意图。如图1所示，系统架构100可以包括三部分，分别为识别部分110、局部挖掘部分120和整合部分130。识别部分110可以包括多个Mapper和多个Reducer，例如n个Mapper和n个Reducer，其中n为正整数。识别部分110可以用于确定序列数据库中的第一类项，该第一类项是全局序列权重效用值高于第一阈值的项。第一类项是有可能构成高效用序列模式的项，因此也可以称为有希望的项(promising item)。局部挖掘部分120可以包括多个Mapper和多个Reducer，例如n个Mapper和n个Reducer，其中n为正整数。局部挖掘部分120可以用于根据第一类项，从序列数据库挖掘出局部高效用序列模式(Local-High Utility Sequence Pattern，L-HUSP)。该局部高效用序列模式中的一部分序列模式可能为全局高效用序列模式，另一部分序列模式可能不是全局高效用序列模式，则该另一部分序列模块可以作为候选的全局高效用序列模式。此外，局部挖掘部分120还可以用于确定第一集合(可以表示为sidset)，该第一集合可以包括至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值。此外，还可以确定序列数据库中各个序列的效用值链表。整合部分130可以包括多个Mapper和多个Reducer，例如n个Mapper和n个Reducer，其中n为正整数。整合部分130可以用于根据各个序列的效用值链表和所述第一集合，从所述至少一个候选的全局高效用序列模式中挖掘全局高效用序列模式。通过图1所示的系统架构，可以使用序列数据库中各个序列的效用值链表和第一集合，来保存挖掘过程中必要的信息，从而加快高效用序列模式的挖掘速度，降低时间复杂度。First, the system architecture of mining a global high-utility sequence pattern (Global-High Utility Sequence Pattern, G-HUSP) from a sequence database according to an embodiment of the present disclosure is described with reference to FIG. 1. FIG. 1 is a schematic diagram of a system architecture of mining a global high-utility sequence pattern from a sequence database according to an embodiment of the present disclosure. As shown in FIG. 1, the system architecture 100 may include three parts, namely, an identification part 110, a local mining part 120, and an integration part 130. The identification part 110 may include a plurality of mappers and a plurality of reducers, for example, n mappers and n reducers, where n is a positive integer. The identification part 110 may be used to determine the first category of items in the sequence database, which are items whose global sequence weight utility values are higher than a first threshold. The first category of items are items that may constitute a high-utility sequence pattern, and therefore may also be referred to as promising items. The local mining part 120 may include a plurality of mappers and a plurality of reducers, for example, n mappers and n reducers, where n is a positive integer. The local mining part 120 can be used to mine a local high-utility sequence pattern (Local-High Utility Sequence Pattern, L-HUSP) from a sequence database according to the first category item. A part of the sequence pattern in the local high-utility sequence pattern may be a global high-utility sequence pattern, and another part of the sequence pattern may not be a global high-utility sequence pattern, then the other part of the sequence module can be used as a candidate global high-utility sequence pattern. In addition, the local mining part 120 can also be used to determine a first set (which can be expressed as a sidset), which can include at least one candidate global high-utility sequence pattern, an identifier of a sequence including each candidate global high-utility sequence pattern, and the utility value of each candidate global high-utility sequence pattern in the corresponding sequence. In addition, the utility value chain list of each sequence in the sequence database can also be determined. The integration part 130 can include multiple Mappers and multiple Reducers, such as n Mappers and n Reducers, where n is a positive integer. The integration part 130 can be used to mine a global high-utility sequence pattern from the at least one candidate global high-utility sequence pattern according to the utility value chain list of each sequence and the first set. Through the system architecture shown in Figure 1, the utility value linked list and the first set of each sequence in the sequence database can be used to save the necessary information in the mining process, thereby speeding up the mining of high-utility sequence patterns and reducing time complexity.

需要认识到，尽管在图1中示出了三阶段的MapReduce，但这只是示意性的。根据本公开的实施例，还可以是更少或更多个阶段的MapReduce。此外，每个阶段的MapReduce包含的Mapper和Reducer的个数可以相同，也可以不同。此外，不同阶段的MapReduce包含的Mapper和/或Reducer的个数可以相同，也可以不同。It should be noted that although three stages of MapReduce are shown in FIG1 , this is only for illustration. According to the embodiments of the present disclosure, there may be fewer or more stages of MapReduce. In addition, the number of Mappers and Reducers included in each stage of MapReduce may be the same or different. In addition, the number of Mappers and/or Reducers included in different stages of MapReduce may be the same or different.

此外，应该理解，在本公开中，“局部”是针对数据库的一个分区而言的，而“全局”是针对数据库整体而言的。例如，本公开中的“局部高效用序列模式”可以是从数据库的一个分区挖掘出的高效用序列模式，即对该分区而言是高效用的序列模式；而本公开中的“全局高效用序列模式”可以是从多个局部高效用序列模式中挖掘出的高效用序列模式，即对数据库整体而言是高效用的序列模式。又例如，本公开中的“局部序列权重效用值”可以是根据数据库的一个分区中的数据确定的效用值；而本公开中的“全局高效用序列模式”可以是根据数据库中的所有数据确定的效用值。In addition, it should be understood that in the present disclosure, "local" refers to a partition of a database, while "global" refers to the database as a whole. For example, the "local high-utility sequence pattern" in the present disclosure may be a high-utility sequence pattern mined from a partition of a database, that is, a sequence pattern that is highly useful for the partition; and the "global high-utility sequence pattern" in the present disclosure may be a high-utility sequence pattern mined from multiple local high-utility sequence patterns, that is, a sequence pattern that is highly useful for the database as a whole. For another example, the "local sequence weight utility value" in the present disclosure may be a utility value determined based on the data in a partition of a database; and the "global high-utility sequence pattern" in the present disclosure may be a utility value determined based on all the data in the database.

下面将结合图2来具体描述根据图1所示的系统框架的挖掘全局高效用序列模式的方法的流程图。图2是根据本公开实施例的用于挖掘全局高效用序列模式的方法200的流程图。如图2所示，在步骤S201中，确定序列数据库中的第一类项，其中第一类项是全局序列权重效用值(Global Sequence Weight Utility，GSWU)高于第一阈值的项。The following will specifically describe a flow chart of a method for mining global high-utility sequence patterns according to the system framework shown in FIG1 in conjunction with FIG2. FIG2 is a flow chart of a method 200 for mining global high-utility sequence patterns according to an embodiment of the present disclosure. As shown in FIG2, in step S201, a first category of items in a sequence database is determined, wherein the first category of items is an item whose global sequence weight utility value (Global Sequence Weight Utility, GSWU) is higher than a first threshold.

在本公开中，序列数据库可以包括多个序列以及与各个序列对应的标识信息。在本公开中，序列可以是定量序列(quantitative sequence)。与每个序列对应的标识信息可以称为序列标识(sequence id，sid)。可以用s_l来表示第l条序列的序列标识，其中l为正整数。每个序列可以包括一个或多个项集，每个项集可以包括一个或多个项。每个项具有内部效用值和外部效用值。在交易类型的数据库中，内部效用值可以是项的交易数量。在其他场景的数据库中，内部效用值的形式可相应的调整。记录数据库中各个项的外部效用值的表格可以称为外部效用值表。在交易类型的数据库中，外部效用值表可以是利润表，即外部效用值表可以记录数据库中各项的单位利润值。在其他场景的数据库中，外部效用值表的形式可相应的调整。In the present disclosure, a sequence database may include multiple sequences and identification information corresponding to each sequence. In the present disclosure, a sequence may be a quantitative sequence. The identification information corresponding to each sequence may be referred to as a sequence identifier (sequence id, sid). The sequence identifier of the lth sequence may be represented by s _l , where l is a positive integer. Each sequence may include one or more item sets, and each item set may include one or more items. Each item has an internal utility value and an external utility value. In a transaction-type database, the internal utility value may be the number of transactions of the item. In databases of other scenarios, the form of the internal utility value may be adjusted accordingly. A table recording the external utility values of each item in the database may be referred to as an external utility value table. In a transaction-type database, the external utility value table may be a profit table, that is, the external utility value table may record the unit profit value of each item in the database. In databases of other scenarios, the form of the external utility value table may be adjusted accordingly.

下面的表1示出了序列数据库的一个示例。如表1所示，序列数据库是交易类型的数据库，其包括了5个序列，分别为s₁～s₅。每条序列由同一顾客在不同时间的购买清单组成，每个购买清单即为项集，购买的商品即为项。例如，序列s1表示顾客先购买2个商品a和3个商品c，再购买3个商品a、1个商品b和2个商品c，然后购买4个商品a、5个商品b和4个商品d，最后购买了3个商品e。Table 1 below shows an example of a sequence database. As shown in Table 1, the sequence database is a transaction type database, which includes 5 sequences, namely s ₁ to s ₅ . Each sequence consists of the purchase list of the same customer at different times. Each purchase list is an item set, and the purchased goods are items. For example, sequence s1 means that the customer first purchases 2 goods a and 3 goods c, then purchases 3 goods a, 1 goods b and 2 goods c, then purchases 4 goods a, 5 goods b and 4 goods d, and finally purchases 3 goods e.

sidsid 序列sequence s₁ s ₁ <[(a∶2)(c∶3)]，[(a∶3)(b∶1)(c∶2)]，[(a∶4)(b∶5)(d∶4)]，[(e∶3)]><[(a∶2)(c∶3)]，[(a∶3)(b∶1)(c∶2)]，[(a∶4)(b∶5)(d∶4)]，[(e∶3)]> s₂ s ₂ <[(a∶1)(e∶3)]，[(a∶5)(b∶3)(d∶2)]，[(b∶2)(c∶1)(d∶4)(e∶3)]><[(a∶1)(e∶3)], [(a∶5)(b∶3)(d∶2)], [(b∶2)(c∶1)(d∶4)(e∶3)]> s₃ s ₃ <[(e∶2)]，[(c∶2)(d∶3)]，[(a∶3)(e∶3)]，[(b∶4)(d∶5)]><[(e∶2)]，[(c∶2)(d∶3)]，[(a∶3)(e∶3)]，[(b∶4)(d∶5)]> s₄ s ₄ <[(b∶2)(c∶3)]，[(a∶5)(e∶1)]，[(b∶4)(d∶3)(e∶5)]><[(b:2)(c:3)],[(a:5)(e:1)],[(b:4)(d:3)(e:5)]> s₅ s ₅ <[(a∶4)(c∶3)]，[(a∶2)(b∶5)(c∶2)(d∶4)(e∶3)]><[(a∶4)(c∶3)], [(a∶2)(b∶5)(c∶2)(d∶4)(e∶3)]>

表1 序列数据库的示例Table 1 Examples of sequence databases

下面的表2示出了外部效用值表的一个示例。如表2所示，商品a的利润为5，商品b的利润为3，商品c的利润为4，商品d的利润为2，商品e的利润为1，以及商品f的利润为6。An example of an external utility value table is shown in Table 2 below. As shown in Table 2, the profit of commodity a is 5, the profit of commodity b is 3, the profit of commodity c is 4, the profit of commodity d is 2, the profit of commodity e is 1, and the profit of commodity f is 6.

项item aa bb cc dd ee ff 利润profit 55 33 44 22 11 66

表2 外部效用值表的示例Table 2 Example of external utility value table

在步骤S201中，可以确定序列数据库中各个项的全局序列权重效用值，并将全局序列权重效用值高于第一阈值的项确定为第一类项。步骤S201可以由上面所描述的识别部分110(即第一阶段MapReduce)来进行。In step S201, the global sequence weighted utility value of each item in the sequence database can be determined, and items with global sequence weighted utility values higher than a first threshold are determined as first-category items. Step S201 can be performed by the identification part 110 (ie, the first-stage MapReduce) described above.

下面将描述确定序列数据库中每个项的全局序列权重效用值的过程。根据本公开的一个示例，对于序列数据库中的每个项，可以首先确定该项在序列数据库的各个分区的局部序列权重效用值(Local Sequence Weight Utility，LSWU)，然后根据所确定的局部序列权重效用值确定该项的全局序列权重效用值。The process of determining the global sequence weight utility value of each item in the sequence database will be described below. According to an example of the present disclosure, for each item in the sequence database, the local sequence weight utility value (LSWU) of the item in each partition of the sequence database can be first determined, and then the global sequence weight utility value of the item is determined based on the determined local sequence weight utility value.

例如，首先可以将序列数据库划分为多个分区，并将该多个分区分配给第一阶段MapReduce中的多个Mapper。例如，可以将序列数据库划分为n个分区，并且将第1个分区分配给第一阶段MapReduce中的Mapper 1，......，将第k个分区分配给第一阶段MapReduce中的Mapper k，......，将第n个分区分配给第一阶段MapReduce中的Mapper n，其中1≤k≤n且为正整数。For example, the sequence database may be first divided into a plurality of partitions, and the plurality of partitions may be assigned to a plurality of mappers in the first stage MapReduce. For example, the sequence database may be divided into n partitions, and the 1st partition may be assigned to Mapper 1 in the first stage MapReduce, ..., the kth partition may be assigned to Mapper k in the first stage MapReduce, ..., the nth partition may be assigned to Mapper n in the first stage MapReduce, where 1≤k≤n is a positive integer.

然后，对于第k个分区中的每个序列，Mapper k可以确定该序列的效用值。例如，Mapper k可以按照传统的计算序列的效用值的方法来确定该序列的效用值。例如，序列的效用值可以为组成该序列的各个项集在该序列中的效用值的加和。在本公开中，序列s_l的效用值可以表示为u(s_l)。Then, for each sequence in the kth partition, Mapper k can determine the utility value of the sequence. For example, Mapper k can determine the utility value of the sequence according to a conventional method for calculating the utility value of a sequence. For example, the utility value of a sequence can be the sum of the utility values of each item set constituting the sequence in the sequence. In the present disclosure, the utility value of a sequence s _l can be expressed as u(s _l ).

然后，对于该序列中的每个项，Mapper k可以生成键值对，并且该键值对可以由该项和该序列的效用值构成。例如，对于序列s_l中的项i，Mapper k可以生成键值对(i，u(s_l))。由此可见，序列的序列标识和该序列的内容可以作为一个键值对输入Mapper k，然后，Mapper k输出一个或多个新的键值对。Then, for each item in the sequence, Mapper k can generate a key-value pair, and the key-value pair can be composed of the item and the utility value of the sequence. For example, for item i in sequence s _l , Mapper k can generate a key-value pair (i, u(s _l )). It can be seen that the sequence identifier of the sequence and the content of the sequence can be input into Mapper k as a key-value pair, and then Mapper k outputs one or more new key-value pairs.

此外，由于每个分区中的不同序列可能包含同一个项，因此，对于这些不同序列中的同一个项，Mapper可以生成多个键值对。在这种情形下，可以给每个Mapper配置一个组合模块(例如可以称为combiner)，以确定同一个项在每个分区的局部序列权重效用值。具体地，该项在序列数据库的每个分区的局部序列权重效用值可以是根据该分区中包括该项的序列的效用值确定的。例如，该项在序列数据库的每个分区的局部序列权重效用值可以是该分区中包括该项的序列的效用值之和。通过这种方式，可以减少下面将要描述的Reducer的工作负载，从而降低了对通信成本和运输时间的要求。例如，可以通过下面的公式(1)来确定项i在序列数据库的第k个分区的局部序列权重效用值：In addition, since different sequences in each partition may contain the same item, the Mapper can generate multiple key-value pairs for the same item in these different sequences. In this case, each Mapper can be configured with a combination module (for example, it can be called a combiner) to determine the local sequence weight utility value of the same item in each partition. Specifically, the local sequence weight utility value of the item in each partition of the sequence database can be determined based on the utility value of the sequence including the item in the partition. For example, the local sequence weight utility value of the item in each partition of the sequence database can be the sum of the utility values of the sequence including the item in the partition. In this way, the workload of the Reducer to be described below can be reduced, thereby reducing the requirements for communication costs and transportation time. For example, the local sequence weight utility value of item i in the kth partition of the sequence database can be determined by the following formula (1):

其中，i表示项，D_k表示序列数据库的第k个分区，s表示包括该项的序列，u(s)表示序列的效用值。Among them, i represents an item, _Dk represents the kth partition of the sequence database, s represents the sequence including the item, and u(s) represents the utility value of the sequence.

下面以一个具体示例来描述确定一个项在序列数据库的一个分区的局部序列权重效用值的过程。例如，在序列数据库的第k个分区包括序列s₁和序列s₂的示例中，Mapper k可以确定序列s₁和和序列s₂的效用值分别为u(s₁)和u(s₂)。然后，对于序列s₁中的每个项，即项a、项b、项c、项d和项e，Mapper k可以生成键值对(a，u(s₁))、(b，u(s₁))、(c，u(s₁))、(d，u(s₁))、(e，u(s₁))。对于序列s₂中的每个项，即项a、项b、项c、项d和项e，Mapper k可以生成键值对(a，u(s₂))、(b，u(s₂))、(c，u(s₂))、(d，u(s₂))、(e，u(s₂))。因此，对于项a，存在两个键值对，即(a，u(s₁))和(a，u(s₂))。项a的这两个键值对也可以表示为(a，l_u)，其中l_u是包括u(s₁)和u(s₂)的集合。然后，组合模块对集合l_u中的元素求和，即u(s₁)+u(s₂)，以获得项a在第k个分区的局部序列权重效用值LSWU_a-k＝u(s₁)+u(s₂)。类似地，可以获得项b、项c、项d和项e在第k个分区的局部序列权重效用值。The following is a specific example to describe the process of determining the local sequence weighted utility value of an item in a partition of a sequence database. For example, in an example where the kth partition of the sequence database includes sequence s ₁ and sequence s ₂ , Mapper k can determine that the utility values of sequence s ₁ and sequence s ₂ are u(s ₁ ) and u(s ₂ ), respectively. Then, for each item in sequence s ₁ , namely item a, item b, item c, item d, and item e, Mapper k can generate key-value pairs (a, u(s ₁ )), (b, u(s ₁ )), (c, u(s ₁ )), (d, u(s ₁ )), (e, u(s ₁ )). For each item in sequence s ₂ , namely item a, item b, item c, item d and item e, Mapper k can generate key-value pairs (a, u(s ₂ )), (b, u(s ₂ )), (c, u(s ₂ )), (d, u(s ₂ )), (e, u(s ₂ )). Therefore, for item a, there are two key-value pairs, namely (a, u(s ₁ )) and (a, u(s ₂ )). These two key-value pairs of item a can also be expressed as (a, _lu ), where _lu is a set including u(s ₁ ) and u(s ₂ ). Then, the combination module sums the elements in the set _lu , namely u(s ₁ )+u(s ₂ ), to obtain the local sequence weight utility value LSWU _ak =u(s ₁ )+u(s ₂ ) of item a in the kth partition. Similarly, the local sequence weighted utility values of item b, item c, item d, and item e in the kth partition can be obtained.

由此可见，Mapper k输出的键值对可以作为与该Mapper k对应的组合模块的输入，然后该组合模块生成新的键值对。该新的键值对可以由项和该项在第k个分区的局部序列权重效用值构成。例如，对于项i，与Mapper k对应的组合模块可以生成键值对(i，LSWU_i-k)。在项i为项a的示例中，与Mapper k对应的组合模块可以输出键值对(a，LSWU_a-k)。It can be seen that the key-value pair output by Mapper k can be used as the input of the combination module corresponding to Mapper k, and then the combination module generates a new key-value pair. The new key-value pair can be composed of an item and the local sequence weighted utility value of the item in the kth partition. For example, for item i, the combination module corresponding to Mapper k can generate a key-value pair (i, LSWU _ik ). In the example where item i is item a, the combination module corresponding to Mapper k can output a key-value pair (a, LSWU _ak ).

通过上面的方式，可以确定每个项在序列数据库的各个分区的局部序列权重效用值。在确定了每个项在序列数据库的各个分区的局部序列权重效用值之后，可以根据所确定的局部序列权重效用值确定该项的全局序列权重效用值。例如，可以将每个项在序列数据库的各个分区的局部序列权重效用值之和，作为该项的全局序列权重效用值。By the above method, the local sequence weighted utility value of each item in each partition of the sequence database can be determined. After the local sequence weighted utility value of each item in each partition of the sequence database is determined, the global sequence weighted utility value of the item can be determined according to the determined local sequence weighted utility value. For example, the sum of the local sequence weighted utility values of each item in each partition of the sequence database can be used as the global sequence weighted utility value of the item.

具体地，可以将多个组合模块的输出中、键值相同的键值对输入至第一阶段MapReduce中的一个Reducer中。也就是说，将多个组合模块的输出中、与同一个项对应的多个键值对，例如，与项i对应的多个键值对(i，LSWU_i-k)，输入至一个Reducer中。该Reducer可以将这多个键值对中的局部序列权重效用值的加和，作为项i的全局序列权重效用值GSWU_i。例如，可以通过下面的公式(2)来确定项i的全局序列权重效用值：Specifically, the key-value pairs with the same key value in the output of multiple combination modules can be input into a Reducer in the first stage MapReduce. That is to say, multiple key-value pairs corresponding to the same item in the output of multiple combination modules, for example, multiple key-value pairs (i, LSWU _ik ) corresponding to item i, are input into a Reducer. The Reducer can sum the local sequence weighted utility values in these multiple key-value pairs as the global sequence weighted utility value GSWU _i of item i. For example, the global sequence weighted utility value of item i can be determined by the following formula (2):

其中，GSWU(i，D)表示项i在序列数据库D中的全局序列权重效用值，D_k表示序列数据库的第k个分区，LSWU(i，D_k)表示项i在序列数据库的第k个分区的局部序列权重效用值。Wherein, GSWU(i, D) represents the global sequence weighted utility value of item i in the sequence database D, D _k represents the kth partition of the sequence database, and LSWU(i, D _k ) represents the local sequence weighted utility value of item i in the kth partition of the sequence database.

至此，已经描述了确定序列数据库中每个项的全局序列权重效用值的过程。在确定了序列数据库中每个项的全局序列权重效用值之后，第一阶段MapReduce中的各个Reducer可以将全局序列权重效用值高于或等于第一阈值的项确定为第一类项，并丢弃全局序列权重效用值小于第一阈值的项。每个Reducer可以输出一个或多个新的键值对，其中每个新的键值对可以由一个第一类项和该第一类项的全局序列权重效用值构成。例如，当项i是第一类项时，某个Reducer可以输出键值对(i，GSWU_i)。So far, the process of determining the global sequence weighted utility value of each item in the sequence database has been described. After determining the global sequence weighted utility value of each item in the sequence database, each Reducer in the first stage MapReduce can determine the items whose global sequence weighted utility value is higher than or equal to the first threshold as first-class items, and discard the items whose global sequence weighted utility value is less than the first threshold. Each Reducer can output one or more new key-value pairs, where each new key-value pair can be composed of a first-class item and the global sequence weighted utility value of the first-class item. For example, when item i is a first-class item, a Reducer can output a key-value pair (i, GSWU _i ).

这里所描述的“第一阈值”可以是根据数据库的总效用值和阈值因子而确定的。例如，“第一阈值”可以是数据库的总效用值和阈值因子的乘积。可以根据传统的计算数据库的总效用值的方法来确定数据库的总效用值。例如，数据库的总效用值可以是数据库中各个事务的效用值的加和。数据库的总效用值可以表示为u(D)。阈值因子可以是预先设置的，其可以表示为δ。因此，第一阈值可以表示为δ×u(D)。The "first threshold" described here may be determined based on the total utility value of the database and the threshold factor. For example, the "first threshold" may be the product of the total utility value of the database and the threshold factor. The total utility value of the database may be determined according to a conventional method for calculating the total utility value of a database. For example, the total utility value of a database may be the sum of the utility values of each transaction in the database. The total utility value of the database may be represented as u(D). The threshold factor may be pre-set, which may be represented as δ. Therefore, the first threshold may be represented as δ×u(D).

通过步骤S201，可以识别出有希望构成高效用序列模式的项。未被识别出的项可以被丢弃，并且不再需要考虑。通过步骤S201，用于搜索高效用序列模式的空间比原始的搜索空间减小了很多，从而提高了搜索速度，加快了挖掘速度。Through step S201, items that are expected to form high-utility sequence patterns can be identified. Unidentified items can be discarded and no longer need to be considered. Through step S201, the space used to search for high-utility sequence patterns is much smaller than the original search space, thereby improving the search speed and accelerating the mining speed.

返回图2，在步骤S202中，确定序列数据库中各个序列的效用值链表。步骤S202可以在步骤S201之前或之后执行，也可以和步骤S201同步执行。Returning to Fig. 2, in step S202, the utility value chain list of each sequence in the sequence database is determined. Step S202 may be performed before or after step S201, or may be performed simultaneously with step S201.

根据本公开的一个示例，对于序列数据库中的一个序列，可以根据该序列中各个项的效用值和各个项在该序列中的位置，确定该序列的效用值链表。项的效用值可以是项的内部效用值和外部效用值的乘积。各个项在序列中的位置可以包括各个项的初始位置和相邻位置，其中项的初始位置可以是项在序列中第一次出现的位置，相邻位置可以是项在序列中下一次出现的位置。此外，一个序列的效用值链表可以包括两行，其中第一行可以是关于各个项的效用值和相邻位置的信息(可以简称为Utility Position Information，UPinformation)，第二行可以是关于序列中的非重复项的初始位置的信息(可以简称为Header Table)。第二行可以包括非重复项和各个非重复项的初始位置。According to an example of the present disclosure, for a sequence in a sequence database, a utility value linked list of the sequence can be determined based on the utility values of each item in the sequence and the position of each item in the sequence. The utility value of an item can be the product of the internal utility value and the external utility value of the item. The position of each item in the sequence can include the initial position and adjacent position of each item, wherein the initial position of the item can be the position where the item first appears in the sequence, and the adjacent position can be the position where the item appears next in the sequence. In addition, the utility value linked list of a sequence can include two rows, wherein the first row can be information about the utility value and adjacent position of each item (can be referred to as Utility Position Information, UPinformation), and the second row can be information about the initial position of non-repeating items in the sequence (can be referred to as Header Table). The second row can include non-repeating items and the initial position of each non-repeating item.

下面的表3示出了表1中的序列s₁的效用值链表。如表3所示，序列s₁的效用值链表包括两行，第一行示出了序列s₁中各个项a、b、c、d、e的效用值和相邻位置，第二行示出了序列s₁中各个项a、b、c、d、e的初始位置。具体地，第一行中的元素(a，10,3)中的“a”表示序列s₁中的第1个项，“10”表示项a在序列s₁中的效用值为10，“3”表示项a在序列s₁中下一次出现的位置。第一行中的元素(c，8，-)中的“c”表示序列s₁中的第5个项，“8”表示项c在序列s₁中的效用值为8，“-”表示项c在序列s₁中没有下一个位置。第二行中的元素(a，1)中的“a”表示序列s₁中的项，“1”表示项a在序列s₁中的初始位置。Table 3 below shows the utility value chain table of sequence s ₁ in Table 1. As shown in Table 3, the utility value chain table of sequence s ₁ includes two rows. The first row shows the utility values and adjacent positions of each item a, b, c, d, and e in sequence s ₁ , and the second row shows the initial positions of each item a, b, c, d, and e in sequence s _1. Specifically, "a" in the element (a, 10, 3) in the first row represents the first item in sequence s ₁ , "10" represents the utility value of item a in sequence s ₁ is 10, and "3" represents the next position of item a in sequence s _1. "c" in the element (c, 8, -) in the first row represents the fifth item in sequence s ₁ , "8" represents the utility value of item c in sequence s ₁ is 8, and "-" represents that item c has no next position in sequence s _1. "a" in the element (a, 1) in the second row represents the item in sequence s ₁ , and "1" represents the initial position of item a in sequence s ₁ .

表3 序列s₁的效用值链表的示例Table 3 Example of utility value list of sequence s ₁

可以理解，序列的效用值链表是通过对原始数据库中的序列进行转换和扩展而形成的，其记录了关于原始数据库的信息和需要被计算的公共信息。通过序列的效用值链表，可以提高序列模式的计算速度。这是由于，目标序列模式在单个事务中可能有多个匹配项，因此，计算事务中序列模式的效用值需要查找所有匹配项，然后取最大效用值。效用值链表记录了事务中项的下一个位置，因此，不需要多次扫描事务，而只要连续搜索项的下一个位置就可以计算事务中序列模式的最大效用值。It can be understood that the utility value linked list of the sequence is formed by converting and expanding the sequence in the original database, which records information about the original database and the common information that needs to be calculated. The utility value linked list of the sequence can improve the calculation speed of the sequence pattern. This is because the target sequence pattern may have multiple matching items in a single transaction. Therefore, calculating the utility value of the sequence pattern in the transaction requires finding all matching items and then taking the maximum utility value. The utility value linked list records the next position of the item in the transaction. Therefore, there is no need to scan the transaction multiple times. Instead, the maximum utility value of the sequence pattern in the transaction can be calculated by continuously searching for the next position of the item.

返回图2，在步骤S203中，根据所确定的第一类项，从序列数据库挖掘至少一个候选的全局高效用序列模式并确定第一集合，其中所述第一集合包括所述至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值。步骤S203可以由上面所描述的局部挖掘部分120(即第二阶段MapReduce)来进行。Returning to FIG. 2 , in step S203 , based on the determined first category item, at least one candidate global high-utility sequence pattern is mined from the sequence database and a first set is determined, wherein the first set includes the at least one candidate global high-utility sequence pattern, the identifier of the sequence including each candidate global high-utility sequence pattern, and the utility value of each candidate global high-utility sequence pattern in the corresponding sequence. Step S203 can be performed by the local mining part 120 (i.e., the second stage MapReduce) described above.

根据本公开的一个示例，执行步骤S203之前，可以将序列数据库中的序列分配到多个任务(task)中。任务的数量可以表示为m，其中m为正整数。例如，m可以为第二阶段MapReduce中的Mapper的数量的倍数。在下面的示例中，以m等于第二阶段MapReduce中的Mapper的数量为例来描述本公开。According to an example of the present disclosure, before executing step S203, the sequences in the sequence database may be assigned to multiple tasks. The number of tasks may be represented as m, where m is a positive integer. For example, m may be a multiple of the number of mappers in the second stage MapReduce. In the following example, the present disclosure is described by taking m as equal to the number of mappers in the second stage MapReduce as an example.

在该示例中，可以根据负载均衡算法将序列数据库中的序列划分为多个分区。例如，可以根据负载均衡算法将序列数据库中的序列分配到多个任务中。具体地，对于序列数据库中的一个序列，可以确定该序列包括的第一类项的数量(Num)。然后，从多个任务中选择具有最小工作负载的任务p，并将该序列分配到该任务p，同时根据该序列包括的第一类项的数量来更新该任务p的工作负载。例如，第p个任务的工作负载可以表示为WL_p，当一个序列被分配给该任务后，该任务的工作负载由WL_p更新为(WL_p+Num)。In this example, the sequences in the sequence database can be divided into multiple partitions according to the load balancing algorithm. For example, the sequences in the sequence database can be assigned to multiple tasks according to the load balancing algorithm. Specifically, for a sequence in the sequence database, the number (Num) of first-class items included in the sequence can be determined. Then, a task p with the smallest workload is selected from multiple tasks, and the sequence is assigned to the task p, and the workload of the task p is updated according to the number of first-class items included in the sequence. For example, the workload of the p-th task can be expressed as WL _p . When a sequence is assigned to the task, the workload of the task is updated from WL _p to (WL _p +Num).

此外，在该示例中，算法可以将每个任务的工作负载初始化为0。因此，在算法的第一次迭代中，由于每个任务的工作负载均为0，因此，对于序列数据库中的一个序列，可以从多个任务中随机选择一个任务，并将该序列分配到该任务。例如，可以从多个任务中选择第1个任务，并将该序列分配到第1个任务。In addition, in this example, the algorithm can initialize the workload of each task to 0. Therefore, in the first iteration of the algorithm, since the workload of each task is 0, for a sequence in the sequence database, a task can be randomly selected from multiple tasks and the sequence can be assigned to the task. For example, the first task can be selected from multiple tasks and the sequence can be assigned to the first task.

此外，这里所描述的“任务”也可以称为任务文件(task file)。下文中，可以替换地使用任务和任务文件。In addition, the "task" described here may also be referred to as a task file. Hereinafter, the terms task and task file may be used interchangeably.

通过上述负载均衡算法，可以避免划分数据库时导致节点之间的工作负载不均衡而影响挖掘算法，使得各个节点之间的工作负载均衡，从而有效地提高了挖掘计算的速度。Through the above load balancing algorithm, it is possible to avoid the impact of the mining algorithm on the workload imbalance between nodes caused by partitioning the database, so that the workload between each node is balanced, thereby effectively improving the speed of mining calculations.

步骤S203可以包括三个子步骤S2031～S2033。在步骤S2031中，可以根据所确定的第一类项，从序列数据库的各个分区挖掘局部高效用序列模式。然后，在步骤S2032中，可以根据所挖掘的局部高效用序列模式确定至少一个候选的全局高效用序列模式。然后，在步骤S2033中，可以确定第一集合。步骤S2033也可以和步骤S2033同时执行。Step S203 may include three sub-steps S2031 to S2033. In step S2031, local high-utility sequence patterns may be mined from each partition of the sequence database according to the determined first category items. Then, in step S2032, at least one candidate global high-utility sequence pattern may be determined according to the mined local high-utility sequence patterns. Then, in step S2033, the first set may be determined. Step S2033 may also be performed simultaneously with step S2033.

在本公开中，可以根据所确定的第一类项，从各个任务挖掘局部高效用序列模式。这些局部高效用序列模式中的一部分序列模式可能为全局高效用序列模式，另一部分序列模式可能不是全局高效用序列模式。可以将该另一部分序列模式作为候选的全局高效用序列模式。In the present disclosure, local high-utility sequence patterns can be mined from various tasks based on the determined first-category items. Some of these local high-utility sequence patterns may be global high-utility sequence patterns, and another part of the sequence patterns may not be global high-utility sequence patterns. The other part of the sequence patterns can be used as candidate global high-utility sequence patterns.

下面将描述在步骤S2031中从序列数据库的每个分区挖掘局部高效用序列模式的过程。具体地，对于每个分区包括的各个序列中属于第一类项的一个项，计算该项在各个序列中的效用值和剩余效用值，构建该项在各个序列中的效用列表(utility list)，根据该项在各个序列中的效用列表确定该项的效用值链；根据该分区中各个项的效用值链(utility chain)，从该分区挖掘局部高效用序列模式。The following describes the process of mining local high-utility sequence patterns from each partition of the sequence database in step S2031. Specifically, for an item belonging to the first category in each sequence included in each partition, the utility value and the residual utility value of the item in each sequence are calculated, a utility list of the item in each sequence is constructed, and the utility value chain of the item is determined according to the utility list of the item in each sequence; and according to the utility value chain of each item in the partition, a local high-utility sequence pattern is mined from the partition.

在本公开中，项在一个序列中的剩余效用值可以是该序列中、该项之后的所有项的效用值之和。此外，项在一个序列中的效用列表可以包括序列的标识信息(可表示为sid)、项所在的各个项集的标识信息(可表示为tid)、该各个项集中的该项在序列中的效用值(可表示为acu)和剩余效用值(可表示为ru)、以及从一个项集指向下一个项集的指示信息(例如，指针)(可表示为next)。此外，项的效用值链可以包括项在各个序列的效用列表。In the present disclosure, the residual utility value of an item in a sequence may be the sum of the utility values of all items in the sequence and after the item. In addition, the utility list of an item in a sequence may include the identification information of the sequence (which may be represented as sid), the identification information of each item set in which the item is located (which may be represented as tid), the utility value of the item in the sequence in each item set (which may be represented as acu) and the residual utility value (which may be represented as ru), and the indication information (e.g., pointer) from one item set to the next item set (which may be represented as next). In addition, the utility value chain of an item may include the utility list of the item in each sequence.

下面给出项在一个序列中的效用列表的一个示例。假设一个分区包括表1所示的序列s₁～s₅，项a属于第一类项，则对于序列s₁，可以确定序列的标识信息为1。此外，项a出现在序列s₁的第1个项集，因此，确定第1个项集中的项a在序列s₁中的效用值和剩余效用值，分别为10和84。由于项a还出现在序列s₁的第2个项集，因此，确定第2个项集中的项a在序列s₁中的效用值和剩余效用值，分别为15和57。由于项a还出现在序列s₁的第3个项集，因此，确定第3个项集中的项a在序列s₁中的效用值和剩余效用值，分别为20和26。因此，可以构建项a在序列s₁中的效用列表。图3示出了项a在序列s₁中的效用列表的示意图。如图3所示，第一组数据(1，1，10，84)中的第1个“1”表示序列s₁，第2个“1”表示序列s₁的第1个项集，“10”表示第1个项集中的项a在序列s₁中的效用值，“84”表示第1个项集中的项a在序列s₁中的剩余效用值。第二组数据(1，2，15,57)中的“1”表示序列s₁，“2”表示序列s₁的第2个项集，“15”表示第2个项集中的项a在序列s₁中的效用值，“57”表示第2个项集中的项a在序列s₁中的剩余效用值。第三组数据(1，3,20,26)中的“1”表示序列s₁，“3”表示序列s₁的第3个项集，“20”表示第3个项集中的项a在序列s₁中的效用值，“26”表示第3个项集中的项a在序列s₁中的剩余效用值。图3中的黑色箭头表示从一个项集指向下一个项集的指针。An example of a utility list of items in a sequence is given below. Assuming that a partition includes the sequences s ₁ to s ₅ shown in Table 1, and item a belongs to the first category of items, then for sequence s ₁ , the identification information of the sequence can be determined to be 1. In addition, item a appears in the first item set of sequence s ₁ , so the utility value and residual utility value of item a in the first item set in sequence s ₁ are determined to be 10 and 84, respectively. Since item a also appears in the second item set of sequence s ₁ , the utility value and residual utility value of item a in the second item set in sequence s ₁ are determined to be 15 and 57, respectively. Since item a also appears in the third item set of sequence s ₁ , the utility value and residual utility value of item a in the third item set in sequence s ₁ are determined to be 20 and 26, respectively. Therefore, the utility list of item a in sequence s ₁ can be constructed. FIG3 shows a schematic diagram of the utility list of item a in sequence s ₁ . As shown in Figure 3, the first "1" in the first set of data (1, 1, 10, 84) represents sequence s ₁ , the second "1" represents the first item set of sequence s ₁ , "10" represents the utility value of item a in the first item set in sequence s ₁ , and "84" represents the residual utility value of item a in the first item set in sequence s _1. In the second set of data (1, 2, 15, 57), "1" represents sequence s ₁ , "2" represents the second item set of sequence s ₁ , "15" represents the utility value of item a in the second item set in sequence s ₁ , and "57" represents the residual utility value of item a in the second item set in sequence s ₁ . In the third set of data (1, 3, 20, 26), "1" represents sequence s ₁ , "3" represents the third item set in sequence s ₁ , "20" represents the utility value of item a in the third item set in sequence s ₁ , and "26" represents the residual utility value of item a in the third item set in sequence s _1. The black arrows in Figure 3 represent pointers from one item set to the next item set.

下面给出项的效用值链的一个示例。在上面的示例中，类似地，可以确定项a在序列s₂～s₅中的效用列表。然后，可以根据项a在序列s₁～s₅中的效用列表确定项a的效用值链。图4示出了项a的效用值链的示意图。如图4所示，项a的效用值链包括项a在序列s₁中的效用列表、项a在序列s₂中的效用列表、项a在序列s₃中的效用列表、项a在序列s₄中的效用列表、以及项a在序列s₅中的效用列表。An example of a utility value chain of an item is given below. In the above example, similarly, the utility list of item a in sequence s ₂ to s ₅ can be determined. Then, the utility value chain of item a can be determined based on the utility list of item a in sequence s ₁ to s _5. FIG4 shows a schematic diagram of the utility value chain of item a. As shown in FIG4 , the utility value chain of item a includes the utility list of item a in sequence s ₁ , the utility list of item a in sequence s ₂ , the utility list of item a in sequence s ₃ , the utility list of item a in sequence s ₄ , and the utility list of item a in sequence s ₅ .

类似地，可以确定每个分区包括的各个序列中属于第一类项的各个项的效用值链。然后，可以根据该分区中各个项的效用值链，从该分区挖掘局部高效用序列模式。例如，可以将该分区中的各个项和各个项的效用值链作为传统的高效用序列模式算法(例如，HUS-Span算法)的输入，并通过该算法输出与该分区对应的一个或多个局部高效用序列模式。此外，通过该算法还可以输出每个局部高效用序列模式在相应序列中的效用值以及序列的标识信息。可以将该算法的输出表示为键值对(pattern，{sid，utility})，其中pattern表示局部高效用序列模式，sid表示包含局部高效用序列模式的序列的标识，utility表示局部高效用序列模式在相应序列中的效用值。Similarly, the utility value chains of the items belonging to the first category in the sequences included in each partition can be determined. Then, based on the utility value chains of the items in the partition, local high-utility sequence patterns can be mined from the partition. For example, the items in the partition and the utility value chains of the items can be used as inputs of a traditional high-utility sequence pattern algorithm (e.g., HUS-Span algorithm), and one or more local high-utility sequence patterns corresponding to the partition can be output through the algorithm. In addition, the algorithm can also output the utility value of each local high-utility sequence pattern in the corresponding sequence and the identification information of the sequence. The output of the algorithm can be represented as a key-value pair (pattern, {sid, utility}), where pattern represents a local high-utility sequence pattern, sid represents the identification of the sequence containing the local high-utility sequence pattern, and utility represents the utility value of the local high-utility sequence pattern in the corresponding sequence.

上述关于步骤S2031的操作可以由第二阶段MapReduce中的Mapper来进行。例如，序列数据库的多个分区可以分别由第二阶段MapReduce中的多个Mapper处理，从而各个Mapper可以从与其对应的分区挖掘局部高效用序列模式。在这种情形下，上面所描述的算法输出可以是Mapper的输出。也就是说，对于序列数据库的一个分区，与该分区对应的Mapper的输出为一个或多个键值对(pattern，{sid，utility})，其中该一个或多个pattern是从该分区挖掘的一个或多个局部高效用序列模式。The above operation of step S2031 can be performed by the Mapper in the second stage MapReduce. For example, multiple partitions of the sequence database can be processed by multiple Mappers in the second stage MapReduce respectively, so that each Mapper can mine local high-utility sequence patterns from the partition corresponding to it. In this case, the algorithm output described above can be the output of the Mapper. That is, for a partition of the sequence database, the output of the Mapper corresponding to the partition is one or more key-value pairs (pattern, {sid, utility}), where the one or more patterns are one or more local high-utility sequence patterns mined from the partition.

在步骤S2031之后，在步骤S2032中，可以根据所挖掘的局部高效用序列模式确定至少一个候选的全局高效用序列模式。例如，可以将多个Mapper的输出中、键值相同的键值对输入至第二阶段MapReduce中的一个Reducer中。也就是说，将多个Mapper的输出中、与同一个pattern对应的多个键值对，例如，与pattern x对应的多个键值对(pattern x，{sid，utility})，输入至一个Reducer中。该Reducer可以确定与pattern x对应的多个效用值的加和，并根据该加和和第一阈值，来确定pattern x是否为全局高效用序列模式。如果该加和大于或等于第一阈值，则确定pattern x是全局高效用序列模式。如果该加和小于第一阈值，则确定pattern x不是全局高效用序列模式，而是候选的全局高效用序列模式。After step S2031, in step S2032, at least one candidate global high-utility sequence pattern can be determined based on the mined local high-utility sequence pattern. For example, key-value pairs with the same key value in the output of multiple Mappers can be input into a Reducer in the second stage MapReduce. That is, multiple key-value pairs corresponding to the same pattern in the output of multiple Mappers, for example, multiple key-value pairs corresponding to pattern x (pattern x, {sid, utility}), are input into a Reducer. The Reducer can determine the sum of multiple utility values corresponding to pattern x, and determine whether pattern x is a global high-utility sequence pattern based on the sum and the first threshold. If the sum is greater than or equal to the first threshold, it is determined that pattern x is a global high-utility sequence pattern. If the sum is less than the first threshold, it is determined that pattern x is not a global high-utility sequence pattern, but a candidate global high-utility sequence pattern.

此外，每个Reducer可以将输出一个或多个新的键值对，每个新的键值对可以由一个候选的全局高效用模式、与该候选的高效用序列模式对应的序列的标识、以及该候选的高效用序列模式在序列中的效用值构成。例如，该新的键值对可以表示为(sid，(pattern，utility))，即更改了Mapper输出的键值对的形式。In addition, each Reducer may output one or more new key-value pairs, each of which may consist of a candidate global high-utility pattern, an identifier of a sequence corresponding to the candidate high-utility sequence pattern, and the utility value of the candidate high-utility sequence pattern in the sequence. For example, the new key-value pair may be expressed as (sid, (pattern, utility)), which means that the form of the key-value pair output by the Mapper is changed.

可以根据第二阶段MapReduce中的多个Reducer的输出来确定步骤S2032中的“至少一个候选的全局高效用序列模式”。例如，可以根据第二阶段MapReduce中的多个Reducer输出的键值对中的序列模式来确定步骤S2032中的“至少一个候选的全局高效用序列模式”。例如，多个Reducer的输出可以为(s₁，(pattern 1，utility 1))、(s₂，(pattern 1，utility 1))、(s₃，(pattern 2，utility 2))、(s₃，(pattern 1，utility 1))、(s₄，(pattern2，utility 2))，则步骤S2032中的“至少一个候选的全局高效用序列模式”可以为pattern1和pattern 2。The “at least one candidate global high-utility sequence pattern” in step S2032 may be determined according to the outputs of multiple Reducers in the second stage of MapReduce. For example, the “at least one candidate global high-utility sequence pattern” in step S2032 may be determined according to the sequence patterns in the key-value pairs output by multiple Reducers in the second stage of MapReduce. For example, the outputs of multiple Reducers may be (s ₁ , (pattern 1, utility 1)), (s ₂ , (pattern 1, utility 1)), (s ₃ , (pattern 2, utility 2)), (s ₃ , (pattern 1, utility 1)), (s ₄ , (pattern2, utility 2)), then the “at least one candidate global high-utility sequence pattern” in step S2032 may be pattern 1 and pattern 2.

此外，在步骤S2032之后，在步骤S2033中，可以确定第一集合。例如，可以根据第二阶段MapReduce中的多个Reducer的输出，确定第一集合。第一集合可以包括所述至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值。例如，第一集合可以包括多个子集合，每个子集合包括序列的标识、该序列包括的候选的全局高效用序列模式、以及该序列所包括的候选的全局高效用序列模式在该序列中的效用值。例如，第二阶段MapReduce中的多个Reducer的输出可以为(s₁，(pattern 1，utility 1))、(s₂，(pattern 1，utility 1))、(s₃，(pattern 2，utility 2))、(s₃，(pattern 1，utility 1))、(s₄，(pattern 2，utility 2))，则第一集合可以包括四个子集合，其中第1个子集合为(s₁，(pattern 1，utility 1))，第2个子集合为(s₂，(pattern 1，utility 1))，第3个子集合为(S₃，(pattern 2，utility 2)，(pattern 1，utility 1))，第4个子集合为(s₄，(pattern 2，utility 2)。In addition, after step S2032, in step S2033, a first set can be determined. For example, the first set can be determined based on the outputs of multiple Reducers in the second stage MapReduce. The first set may include the at least one candidate global high-utility sequence pattern, the identifier of the sequence including each candidate global high-utility sequence pattern, and the utility value of each candidate global high-utility sequence pattern in the corresponding sequence. For example, the first set may include multiple subsets, each subset including the identifier of the sequence, the candidate global high-utility sequence pattern included in the sequence, and the utility value of the candidate global high-utility sequence pattern included in the sequence in the sequence. For example, the outputs of multiple Reducers in the second stage of MapReduce can be ( _s1 , (pattern 1, utility 1)), ( _s2 , (pattern 1, utility 1)), ( _s3 , (pattern 2, utility 2)), ( _s3 , (pattern 1, utility 1)), ( _s4 , (pattern 2, utility 2)), then the first set can include four subsets, the first subset is ( _s1 , (pattern 1, utility 1)), the second subset is ( _s2 , (pattern 1, utility 1)), the third subset is ( _S3 , (pattern 2, utility 2), (pattern 1, utility 1)), and the fourth subset is ( _s4 , (pattern 2, utility 2).

可以理解，通过第一集合这种数据结构，可以加快候选的全局高效用序列模式的效用值的计算。具体地，如果，序列包括一个候选的全局高效用序列模式，则该候选的全局高效用序列模式的效用值可以直接从第一集合获得，而不需要再次计算它的效用值，因为重复计算会花费很多时间。It can be understood that the calculation of the utility value of the candidate global high-utility sequence pattern can be accelerated by the data structure of the first set. Specifically, if the sequence includes a candidate global high-utility sequence pattern, the utility value of the candidate global high-utility sequence pattern can be directly obtained from the first set without calculating its utility value again, because repeated calculation will take a lot of time.

在上面的示例中，没有为第二阶段MapReduce中的Mapper配置相应的组合模块。然而，本公开不限于此。例如，也可以为第二阶段MapReduce中的Mapper配置相应的组合模块。In the above example, no corresponding combination module is configured for the Mapper in the second stage MapReduce. However, the present disclosure is not limited to this. For example, a corresponding combination module may also be configured for the Mapper in the second stage MapReduce.

返回图2，在步骤S204中，根据各个序列的效用值链表和所述第一集合，从所述至少一个候选的全局高效用序列模式挖掘全局高效用序列模式。步骤S204可以由上面所描述的整合部分130(即第三阶段MapReduce)来进行。Returning to Fig. 2, in step S204, according to the utility value chain list of each sequence and the first set, a global high-utility sequence pattern is mined from the at least one candidate global high-utility sequence pattern. Step S204 can be performed by the integration part 130 (ie, the third stage MapReduce) described above.

下面将结合图5来具体描述步骤S204。图5是根据本公开实施例的从至少一个候选的全局高效用序列模式挖掘全局高效用序列模式的方法500的流程图。如图5所示，在步骤S501中，可以根据各个序列的效用值链表和所述第一集合，确定各个候选的全局高效用序列模式的局部效用值。Step S204 will be described in detail below in conjunction with FIG5. FIG5 is a flowchart of a method 500 for mining a global high utility sequence pattern from at least one candidate global high utility sequence pattern according to an embodiment of the present disclosure. As shown in FIG5, in step S501, the local utility value of each candidate global high utility sequence pattern can be determined based on the utility value linked list of each sequence and the first set.

具体地，可以将至少一个候选的全局高效用序列模式和第一集合作为第三阶段MapReduce中的多个Mapper的输入。例如，可以将至少一个候选的全局高效用序列模式划分为多组，然后将多组分别输入多个Mapper。此外，可以将第一集合输入每个Mapper。Specifically, at least one candidate global high-utility sequence pattern and the first set may be used as inputs of multiple mappers in the third stage MapReduce. For example, at least one candidate global high-utility sequence pattern may be divided into multiple groups, and then the multiple groups are respectively input into multiple mappers. In addition, the first set may be input into each mapper.

然后，每个Mapper可以确定与其对应的多个候选的全局高效用序列模式中的每个候选的全局高效用序列模式的效用值。例如，对于与一个Mapper对应的多个候选的全局高效用序列模式中的一个候选的全局高效用序列模式，可以通过Mapper判断第一集合是否包括该候选的全局高效用序列模式。当第一集合包括该候选的全局高效用序列模式时，可以根据第一集合确定该候选的全局高效用序列模式的效用值。此外，当第一集合不包括该候选的全局高效用序列模式时，可以根据序列的效用值链表来确定该候选的全局高效用序列模式的效用值。Then, each Mapper can determine the utility value of each candidate global high utility sequence pattern among the multiple candidate global high utility sequence patterns corresponding to it. For example, for a candidate global high utility sequence pattern among the multiple candidate global high utility sequence patterns corresponding to a Mapper, the Mapper can determine whether the first set includes the candidate global high utility sequence pattern. When the first set includes the candidate global high utility sequence pattern, the utility value of the candidate global high utility sequence pattern can be determined according to the first set. In addition, when the first set does not include the candidate global high utility sequence pattern, the utility value of the candidate global high utility sequence pattern can be determined according to the utility value chain list of the sequence.

这是由于，当已经计算出候选的全局高效用序列模式的效用值时，可以通过查询包括该候选的全局高效用序列模式的序列的sidset而直接获得该候选的全局高效用序列模式的效用值。然而，当没有计算出候选的全局高效用序列模式的效用值时，需要检查其是否在特定序列出现。如果出现这种情形，需要按照该特定序列计算候选的全局高效用序列模式的效用值。需要注意的是，该操作的计算是耗时的，因为需要扫描该特定序列，并且候选的全局高效用序列模式在该特定序列中可能存在多个匹配。因此，需要多次扫描该特定序列，以找到最大匹配作为候选的全局高效用序列模式在该特定序列中的效用值。因此，要完成挖掘任务，必须对整个序列数据库进行多次扫描。本公开提出的序列的效用值链表，是个紧凑的数据结构，适用于处理大数据问题。This is because, when the utility value of the candidate global high-utility sequence pattern has been calculated, the utility value of the candidate global high-utility sequence pattern can be directly obtained by querying the sidset of the sequence including the candidate global high-utility sequence pattern. However, when the utility value of the candidate global high-utility sequence pattern is not calculated, it is necessary to check whether it appears in a specific sequence. If this situation occurs, it is necessary to calculate the utility value of the candidate global high-utility sequence pattern according to the specific sequence. It should be noted that the calculation of this operation is time-consuming because the specific sequence needs to be scanned, and the candidate global high-utility sequence pattern may have multiple matches in the specific sequence. Therefore, it is necessary to scan the specific sequence multiple times to find the maximum match as the utility value of the candidate global high-utility sequence pattern in the specific sequence. Therefore, to complete the mining task, the entire sequence database must be scanned multiple times. The utility value linked list of the sequence proposed in the present disclosure is a compact data structure suitable for processing big data problems.

下面将结合具体示例来描述根据序列的效用值链表来确定候选的全局高效用序列模式的效用值的示例。例如，可以根据上述表2所示出的序列s₁的效用值链表来确定候选的全局高效用序列模式<[a，c]，b>的效用值。具体地，由于项a和项c处于同一项集，因此，可以根据项c的出现的位置找到所有a、c出现的位置，即第一个位置(1，2)且效用值为22，以及第二个位置(3，5)且效用值为23。对于项a、c满足的第一个位置(1，2)，可以找到项b满足的所有位置，即4和7，则可以计算项a、c、b合起来的效用值为22+3＝25、以及22+15＝37。对于项a、c满足的第二个位置(3，5)，可以找到项b满足的所有位置，即7，则可以计算项a、c、b合起来的效用值为23+15＝38。因此，序列模式<[a，c]，b>的效用值为max{25，37，38}＝38。The following will describe an example of determining the utility value of a candidate global high-utility sequence pattern according to a utility value chain table of a sequence in conjunction with a specific example. For example, the utility value of a candidate global high-utility sequence pattern <[a, c], b> can be determined according to the utility value chain table of the sequence _s 1 shown in Table 2 above. Specifically, since item a and item c are in the same item set, all positions where a and c appear can be found according to the position where item c appears, i.e., the first position (1, 2) with a utility value of 22, and the second position (3, 5) with a utility value of 23. For the first position (1, 2) satisfied by items a and c, all positions satisfied by item b can be found, i.e., 4 and 7, and then the utility values of items a, c, and b combined can be calculated as 22+3=25 and 22+15=37. For the second position (3, 5) satisfied by items a and c, all positions satisfied by item b can be found, i.e., 7, and then the utility value of items a, c, and b combined can be calculated as 23+15=38. Therefore, the utility value of the sequence pattern <[a, c], b> is max{25, 37, 38}=38.

在本公开中，第三阶段MapReduce中的每个Mapper可以输出一个或多个新的键值对，其中每个新的键值对可以由一个候选的全局高效用序列模式和其效用值构成。例如，该新的键值对可以表示为(pattern，utility)。In the present disclosure, each Mapper in the third stage MapReduce can output one or more new key-value pairs, wherein each new key-value pair can be composed of a candidate global high-utility sequence pattern and its utility value. For example, the new key-value pair can be expressed as (pattern, utility).

此外，同一个Mapper可能输出与同一个候选的全局高效用序列模式对应的多个键值对(pattern，utility)。例如，对于候选的全局高效用序列模式pattern y，同一个Mapper可能输出两个键值对，分别为(pattern y，utility 1)和(pattern y，utility 2)。这两个键值对也可以表示为(pattern y，G_u)，其中G_u是包括utility 1和utility 2的集合。In addition, the same Mapper may output multiple key-value pairs (pattern, utility) corresponding to the same candidate global high-utility sequence pattern. For example, for the candidate global high-utility sequence pattern pattern y, the same Mapper may output two key-value pairs, namely (pattern y, utility 1) and (pattern y, utility 2). These two key-value pairs can also be expressed as (pattern y, G _u ), where G _u is a set including utility 1 and utility 2.

此外，在这种情形下，可以给每个Mapper配置一个组合模块(例如可以称为combiner)，以确定同一个候选的全局高效用序列模式的局部效用值。具体地，同一个候选的全局高效用序列模式的局部效用值可以是根据与该候选的全局高效用序列模式对应的多个键值对中的效用值确定的。例如，同一个候选的全局高效用序列模式的局部效用值可以是与该候选的全局高效用序列模式对应的多个键值对中的效用值的加和。例如，对于候选的全局高效用序列模式pattern y，同一个Mapper可能输出两个键值对，分别为(patterny，utility 1)和(pattern y，utility 2)，则对于候选的全局高效用序列模式pattern y的局部效用值local～utility为(utility 1+utility 2)。In addition, in this case, each Mapper can be configured with a combination module (for example, it can be called a combiner) to determine the local utility value of the same candidate global high-utility sequence pattern. Specifically, the local utility value of the same candidate global high-utility sequence pattern can be determined based on the utility values in multiple key-value pairs corresponding to the candidate global high-utility sequence pattern. For example, the local utility value of the same candidate global high-utility sequence pattern can be the sum of the utility values in multiple key-value pairs corresponding to the candidate global high-utility sequence pattern. For example, for the candidate global high-utility sequence pattern pattern y, the same Mapper may output two key-value pairs, namely (patterny, utility 1) and (pattern y, utility 2), then the local utility value local~utility for the candidate global high-utility sequence pattern pattern y is (utility 1+utility 2).

在本公开中，组合模块也可以输出一个或多个新的键值对，其中每个新的键值对可以由一个候选的全局高效用序列模式和其局部效用值构成。例如，该新的键值对可以表示为(pattern，local-utility)。在候选的全局高效用序列模式为pattern y的示例中，该组合模式可以输出键值对(pattern y，utility 1+utility 2)。In the present disclosure, the combination module may also output one or more new key-value pairs, wherein each new key-value pair may be composed of a candidate global high-utility sequence pattern and its local utility value. For example, the new key-value pair may be expressed as (pattern, local-utility). In the example where the candidate global high-utility sequence pattern is pattern y, the combination pattern may output a key-value pair (pattern y, utility 1+utility 2).

返回图5，在步骤S502中，可以根据各个候选的全局高效用序列模式的局部效用值，确定各个候选的全局高效用序列模式的全局效用值。例如，对于每个候选的全局高效用序列模式，可以根据该候选的全局高效用序列模式的多个局部效用值，来确定该候选的全局高效用序列模式的全局效用值。例如，可以将该候选的全局高效用序列模式的多个局部效用值的加和，作为该候选的全局高效用序列模式的全局效用值。Returning to FIG. 5, in step S502, the global utility value of each candidate global high utility sequence pattern can be determined according to the local utility value of each candidate global high utility sequence pattern. For example, for each candidate global high utility sequence pattern, the global utility value of the candidate global high utility sequence pattern can be determined according to multiple local utility values of the candidate global high utility sequence pattern. For example, the sum of multiple local utility values of the candidate global high utility sequence pattern can be used as the global utility value of the candidate global high utility sequence pattern.

具体地，可以将多个组合模块的输出中、键值相同的键值对输入至第三阶段MapReduce中的一个Reducer中。也就是说，将多个组合模块的输出中、与同一个候选的全局高效用序列模式对应的多个键值对，例如，与候选的全局高效用序列模式pattern y对应的多个键值对，输入至一个Reducer中。该Reducer可以将这多个键值对中的局部效用值的加和，作为候选的全局高效用序列模式的全局效用值(global-utility)。Specifically, the key-value pairs with the same key value in the output of multiple combination modules can be input into a Reducer in the third stage MapReduce. That is, multiple key-value pairs corresponding to the same candidate global high-utility sequence pattern in the output of multiple combination modules, for example, multiple key-value pairs corresponding to the candidate global high-utility sequence pattern pattern y, are input into a Reducer. The Reducer can sum the local utility values in these multiple key-value pairs as the global-utility value of the candidate global high-utility sequence pattern.

然后，在步骤S503中，可以将全局效用值大于第一阈值的序列模式确定为全局高效用序列模式。例如，第三阶段MapReduce中的各个Reducer可以将全局效用值高于或等于第一阈值的序列模式确定为全局高效用序列模式。每个Reducer可以输出一个或多个新的键值对，其中每个新的键值对可以由一个全局高效用序列模式和该全局高效用序列模式的全局效用值构成。例如，当pattern y是全局高效用序列模式时，某个Reducer可以输出键值对(pattern y，global-utility)。因此，第三阶段MapReduce中的各个Reducer输出的键值对中的序列模式均为全局高效用序列模式。Then, in step S503, a sequence pattern whose global utility value is greater than the first threshold value can be determined as a global high-utility sequence pattern. For example, each Reducer in the third stage MapReduce can determine a sequence pattern whose global utility value is higher than or equal to the first threshold value as a global high-utility sequence pattern. Each Reducer can output one or more new key-value pairs, wherein each new key-value pair can be composed of a global high-utility sequence pattern and the global utility value of the global high-utility sequence pattern. For example, when pattern y is a global high-utility sequence pattern, a certain Reducer can output a key-value pair (pattern y, global-utility). Therefore, the sequence patterns in the key-value pairs output by each Reducer in the third stage MapReduce are all global high-utility sequence patterns.

通过本实施例提供的挖掘全局高效用序列模式的方法，确定了序列数据库中各个序列的效用值链表和第一集合，并根据这两种数据结构来挖掘全局高效用序列模式，节省了大量时间，加快了在序列数据库中计算全局效用值的计算过程，加快了挖掘速度，降低了时间复杂度。Through the method for mining global high-utility sequence patterns provided by this embodiment, the utility value linked list and the first set of each sequence in the sequence database are determined, and the global high-utility sequence patterns are mined based on these two data structures, which saves a lot of time, speeds up the calculation process of calculating the global utility value in the sequence database, speeds up the mining speed, and reduces the time complexity.

以下，参照图6来描述根据本公开实施例的与图2所示的方法对应的装置。图6示出了根据本公开实施例的用于挖掘全局高效用序列模式的装置600的结构示意图。由于装置600的功能与在上文中参照图2描述的方法的细节相同，因此在这里为了简单起见，省略对相同内容的详细描述。如图6所示，装置600包括：第一确定单元610，被配置为确定序列数据库中的第一类项，其中第一类项是全局序列权重效用值高于第一阈值的项；第二确定单元620，被配置为确定序列数据库中各个序列的效用值链表；第一挖掘单元630，被配置为根据所确定的第一类项，从序列数据库挖掘至少一个候选的全局高效用序列模式并确定第一集合，其中所述第一集合包括所述至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值；以及第二挖掘单元640，被配置为根据各个序列的效用值链表和所述第一集合，从所述至少一个候选的全局高效用序列模式中挖掘全局高效用序列模式。除了这四个单元以外，装置600还可以包括其他部件，然而，由于这些部件与本公开实施例的内容无关，因此在这里省略其图示和描述。Hereinafter, a device corresponding to the method shown in FIG. 2 according to an embodiment of the present disclosure is described with reference to FIG. 6. FIG. 6 shows a schematic diagram of the structure of a device 600 for mining global high-utility sequence patterns according to an embodiment of the present disclosure. Since the functions of the device 600 are the same as the details of the method described above with reference to FIG. 2, a detailed description of the same contents is omitted here for simplicity. As shown in FIG6 , the device 600 includes: a first determination unit 610, configured to determine a first category of items in a sequence database, wherein the first category of items are items whose global sequence weight utility values are higher than a first threshold; a second determination unit 620, configured to determine a utility value chain list of each sequence in the sequence database; a first mining unit 630, configured to mine at least one candidate global high-utility sequence pattern from the sequence database and determine a first set according to the determined first category of items, wherein the first set includes the at least one candidate global high-utility sequence pattern, the identifier of the sequence including each candidate global high-utility sequence pattern, and the utility value of each candidate global high-utility sequence pattern in the corresponding sequence; and a second mining unit 640, configured to mine a global high-utility sequence pattern from the at least one candidate global high-utility sequence pattern according to the utility value chain list of each sequence and the first set. In addition to these four units, the device 600 may also include other components, however, since these components are irrelevant to the content of the embodiment of the present disclosure, their illustration and description are omitted here.

第一确定单元610可以确定序列数据库中各个项的全局序列权重效用值，并将全局序列权重效用值高于第一阈值的项确定为第一类项。第一确定单元610可以是上面所描述的识别部分110(即第一阶段MapReduce)。The first determination unit 610 can determine the global sequence weight utility value of each item in the sequence database, and determine the items with global sequence weight utility values higher than the first threshold as first-category items. The first determination unit 610 can be the identification part 110 described above (ie, the first-stage MapReduce).

下面将描述第一确定单元610确定序列数据库中每个项的全局序列权重效用值的过程。根据本公开的一个示例，对于序列数据库中的每个项，第一确定单元610可以首先确定该项在序列数据库的各个分区的局部序列权重效用值(Local Sequence WeightUtility，LSWU)，然后根据所确定的局部序列权重效用值确定该项的全局序列权重效用值。The process of determining the global sequence weight utility value of each item in the sequence database by the first determination unit 610 will be described below. According to an example of the present disclosure, for each item in the sequence database, the first determination unit 610 may first determine the local sequence weight utility value (LSWU) of the item in each partition of the sequence database, and then determine the global sequence weight utility value of the item according to the determined local sequence weight utility value.

例如，首先，第一确定单元610可以将序列数据库划分为多个分区，并将该多个分区分配给第一阶段MapReduce中的多个Mapper。例如，可以将序列数据库划分为n个分区，并且将第1个分区分配给第一阶段MapReduce中的Mapper 1，......，将第k个分区分配给第一阶段MapReduce中的Mapper k，......，将第n个分区分配给第一阶段MapReduce中的Mappern，其中1≤k≤n且为正整数。For example, first, the first determination unit 610 may divide the sequence database into a plurality of partitions, and assign the plurality of partitions to a plurality of mappers in the first stage MapReduce. For example, the sequence database may be divided into n partitions, and the 1st partition is assigned to Mapper 1 in the first stage MapReduce, ..., the kth partition is assigned to Mapper k in the first stage MapReduce, ..., the nth partition is assigned to Mappern in the first stage MapReduce, where 1≤k≤n is a positive integer.

此外，由于每个分区中的不同序列可能包含同一个项，因此，对于这些不同序列中的同一个项，Mapper可以生成多个键值对。在这种情形下，可以给每个Mapper配置一个组合模块(例如可以称为combiner)，以确定同一个项在每个分区的局部序列权重效用值。具体地，该项在序列数据库的每个分区的局部序列权重效用值可以是根据该分区中包括该项的序列的效用值确定的。例如，该项在序列数据库的每个分区的局部序列权重效用值可以是该分区中包括该项的序列的效用值之和。In addition, since different sequences in each partition may contain the same item, the Mapper can generate multiple key-value pairs for the same item in these different sequences. In this case, each Mapper can be configured with a combination module (for example, it can be called a combiner) to determine the local sequence weight utility value of the same item in each partition. Specifically, the local sequence weight utility value of the item in each partition of the sequence database can be determined based on the utility value of the sequence including the item in the partition. For example, the local sequence weight utility value of the item in each partition of the sequence database can be the sum of the utility values of the sequence including the item in the partition.

通过上面的方式，第一确定单元610可以确定每个项在序列数据库的各个分区的局部序列权重效用值。在确定了每个项在序列数据库的各个分区的局部序列权重效用值之后，第一确定单元610可以根据所确定的局部序列权重效用值确定该项的全局序列权重效用值。例如，可以将每个项在序列数据库的各个分区的局部序列权重效用值之和，作为该项的全局序列权重效用值。In the above manner, the first determining unit 610 can determine the local sequence weight utility value of each item in each partition of the sequence database. After determining the local sequence weight utility value of each item in each partition of the sequence database, the first determining unit 610 can determine the global sequence weight utility value of the item according to the determined local sequence weight utility value. For example, the sum of the local sequence weight utility values of each item in each partition of the sequence database can be used as the global sequence weight utility value of the item.

具体地，可以将多个组合模块的输出中、键值相同的键值对输入至第一阶段MapReduce中的一个Reducer中。也就是说，将多个组合模块的输出中、与同一个项对应的多个键值对，例如，与项i对应的多个键值对(i，LSWU_i-k)，输入至一个Reducer中。该Reducer可以将这多个键值对中的局部序列权重效用值的加和，作为项i的全局序列权重效用值GSWU_i。Specifically, the key-value pairs with the same key value in the outputs of the multiple combination modules can be input into a Reducer in the first stage MapReduce. That is, multiple key-value pairs corresponding to the same item in the outputs of the multiple combination modules, for example, multiple key-value pairs (i, LSWU _ik ) corresponding to item i, are input into a Reducer. The Reducer can sum the local sequence weighted utility values in the multiple key-value pairs as the global sequence weighted utility value GSWU _i of item i.

至此，已经描述了确定序列数据库中每个项的全局序列权重效用值的过程。在确定了序列数据库中每个项的全局序列权重效用值之后，第一阶段MapReduce中的各个Reducer可以将全局序列权重效用值高于或等于第一阈值的项确定为第一类项。每个Reducer可以输出一个或多个新的键值对，其中每个新的键值对可以由一个第一类项和该第一类项的全局序列权重效用值构成。例如，当项i是第一类项时，某个Reducer可以输出键值对(i，GSWU_i)。So far, the process of determining the global sequence weighted utility value of each item in the sequence database has been described. After determining the global sequence weighted utility value of each item in the sequence database, each Reducer in the first stage MapReduce can determine the items whose global sequence weighted utility value is greater than or equal to the first threshold as first-class items. Each Reducer can output one or more new key-value pairs, where each new key-value pair can be composed of a first-class item and the global sequence weighted utility value of the first-class item. For example, when item i is a first-class item, a Reducer can output a key-value pair (i, GSWU _i ).

根据本公开的一个示例，对于序列数据库中的一个序列，第二确定单元620可以根据该序列中各个项的效用值和各个项在该序列中的位置，确定该序列的效用值链表。项的效用值可以是项的内部效用值和外部效用值的乘积。各个项在序列中的位置可以包括各个项的初始位置和相邻位置，其中项的初始位置可以是项在序列中第一次出现的位置，相邻位置可以是项在序列中下一次出现的位置。此外，一个序列的效用值链表可以包括两行，其中第一行可以是关于各个项的效用值和相邻位置的信息(可以简称为Utility PositionInformation，UP information)，第二行可以是关于序列中的非重复项的初始位置的信息(可以简称为Header Table)。第二行可以包括非重复项和各个非重复项的初始位置。According to an example of the present disclosure, for a sequence in a sequence database, the second determination unit 620 may determine the utility value chain table of the sequence according to the utility value of each item in the sequence and the position of each item in the sequence. The utility value of an item may be the product of the internal utility value and the external utility value of the item. The position of each item in the sequence may include the initial position and adjacent position of each item, wherein the initial position of the item may be the position where the item first appears in the sequence, and the adjacent position may be the position where the item appears next in the sequence. In addition, the utility value chain table of a sequence may include two rows, wherein the first row may be information about the utility value and adjacent position of each item (which may be referred to as Utility Position Information, UP information), and the second row may be information about the initial position of non-repeating items in the sequence (which may be referred to as Header Table). The second row may include non-repeating items and the initial positions of each non-repeating item.

在本公开中，第一挖掘单元630可以是上面所描述的局部挖掘部分120(即第二阶段MapReduce)。In the present disclosure, the first mining unit 630 may be the local mining part 120 (ie, the second stage MapReduce) described above.

根据本公开的一个示例，装置600还可以包括负载分配单元(图中未示出)，被配置为将序列数据库中的序列分配到多个任务(task)中。任务的数量可以表示为m，其中m为正整数。例如，m可以为第二阶段MapReduce中的Mapper的数量的倍数。在下面的示例中，以m等于第二阶段MapReduce中的Mapper的数量为例来描述本公开。According to an example of the present disclosure, the apparatus 600 may further include a load distribution unit (not shown in the figure), which is configured to distribute the sequences in the sequence database to a plurality of tasks. The number of tasks may be represented as m, where m is a positive integer. For example, m may be a multiple of the number of Mappers in the second stage MapReduce. In the following example, the present disclosure is described by taking m equal to the number of Mappers in the second stage MapReduce as an example.

在该示例中，负载分配单元可以根据负载均衡算法将序列数据库中的序列划分为多个分区。例如，可以根据负载均衡算法将序列数据库中的序列分配到多个任务中。具体地，对于序列数据库中的一个序列，可以确定该序列包括的第一类项的数量(Num)。然后，从多个任务中选择具有最小工作负载的任务p，并将该序列分配到该任务p，同时根据该序列包括的第一类项的数量来更新该任务p的工作负载。例如，第p个任务的工作负载可以表示为WL_p，当一个序列被分配给该任务后，该任务的工作负载由WL_p更新为(WL_p+Num)。In this example, the load distribution unit can divide the sequences in the sequence database into multiple partitions according to the load balancing algorithm. For example, the sequences in the sequence database can be distributed to multiple tasks according to the load balancing algorithm. Specifically, for a sequence in the sequence database, the number (Num) of first-category items included in the sequence can be determined. Then, a task p with the smallest workload is selected from multiple tasks, and the sequence is distributed to the task p, and the workload of the task p is updated according to the number of first-category items included in the sequence. For example, the workload of the p-th task can be expressed as WL _p , and when a sequence is distributed to the task, the workload of the task is updated from WL _p to (WL _p +Num).

在本公开中，第一挖掘单元630可以根据所确定的第一类项，从序列数据库的各个分区挖掘局部高效用序列模式。然后，第一挖掘单元630可以根据所挖掘的局部高效用序列模式确定至少一个候选的全局高效用序列模式。然后，第一挖掘单元630可以确定第一集合。In the present disclosure, the first mining unit 630 may mine local high-utility sequence patterns from each partition of the sequence database according to the determined first category items. Then, the first mining unit 630 may determine at least one candidate global high-utility sequence pattern according to the mined local high-utility sequence pattern. Then, the first mining unit 630 may determine the first set.

下面将描述第一挖掘单元630从序列数据库的每个分区挖掘局部高效用序列模式的过程。具体地，对于每个分区包括的各个序列中属于第一类项的一个项，计算该项在各个序列中的效用值和剩余效用值，构建该项在各个序列中的效用列表(utility list)，根据该项在各个序列中的效用列表确定该项的效用值链；根据该分区中各个项的效用值链(utility chain)，从该分区挖掘局部高效用序列模式。The following describes the process of mining local high-utility sequence patterns from each partition of the sequence database by the first mining unit 630. Specifically, for an item belonging to the first category in each sequence included in each partition, the utility value and the residual utility value of the item in each sequence are calculated, a utility list of the item in each sequence is constructed, and the utility value chain of the item is determined according to the utility list of the item in each sequence; and according to the utility value chain of each item in the partition, a local high-utility sequence pattern is mined from the partition.

类似地，可以确定每个分区包括的各个序列中属于第一类项的各个项的效用值链。然后，可以根据该分区中各个项的效用值链，从该分区挖掘局部高效用序列模式。例如，可以将该分区中的各个项和各个项的效用值链作为传统的高效用序列模式算法的输入，并通过该算法输出与该分区对应的一个或多个局部高效用序列模式。此外，通过该算法还可以输出每个局部高效用序列模式在相应序列中的效用值以及序列的标识信息。可以将该算法的输出表示为键值对(pattern，{sid，utility})，其中pattern表示局部高效用序列模式，sid表示包含局部高效用序列模式的序列的标识，utility表示局部高效用序列模式在相应序列中的效用值。Similarly, the utility value chains of the items belonging to the first category in the sequences included in each partition can be determined. Then, based on the utility value chains of the items in the partition, local high-utility sequence patterns can be mined from the partition. For example, the items in the partition and the utility value chains of the items can be used as inputs of a traditional high-utility sequence pattern algorithm, and one or more local high-utility sequence patterns corresponding to the partition can be output through the algorithm. In addition, the utility value of each local high-utility sequence pattern in the corresponding sequence and the identification information of the sequence can also be output through the algorithm. The output of the algorithm can be represented as a key-value pair (pattern, {sid, utility}), where pattern represents a local high-utility sequence pattern, sid represents the identification of the sequence containing the local high-utility sequence pattern, and utility represents the utility value of the local high-utility sequence pattern in the corresponding sequence.

上述操作可以由第二阶段MapReduce中的Mapper来进行。例如，序列数据库的多个分区可以分别由第二阶段MapReduce中的多个Mapper处理，从而各个Mapper可以从与其对应的分区挖掘局部高效用序列模式。在这种情形下，上面所描述的算法输出可以是Mapper的输出。也就是说，对于序列数据库的一个分区，与该分区对应的Mapper的输出为一个或多个键值对(pattern，{sid，utility})，其中该一个或多个pattern是从该分区挖掘的一个或多个局部高效用序列模式。The above operations can be performed by the Mapper in the second stage MapReduce. For example, multiple partitions of the sequence database can be processed by multiple Mappers in the second stage MapReduce respectively, so that each Mapper can mine local high-utility sequence patterns from the partitions corresponding to it. In this case, the algorithm output described above can be the output of the Mapper. That is, for a partition of the sequence database, the output of the Mapper corresponding to the partition is one or more key-value pairs (pattern, {sid, utility}), where the one or more patterns are one or more local high-utility sequence patterns mined from the partition.

然后，第一挖掘单元630可以根据所挖掘的局部高效用序列模式确定至少一个候选的全局高效用序列模式。例如，可以将多个Mapper的输出中、键值相同的键值对输入至第二阶段MapReduce中的一个Reducer中。也就是说，将多个Mapper的输出中、与同一个pattern对应的多个键值对，例如，与pattern x对应的多个键值对(pattern x，{sid，utility})，输入至一个Reducer中。该Reducer可以确定与pattern x对应的多个效用值的加和，并根据该加和和第一阈值，来确定pattern x是否为全局高效用序列模式。如果该加和大于或等于第一阈值，则确定pattern x是全局高效用序列模式。如果该加和小于第一阈值，则确定pattern x不是全局高效用序列模式，而是候选的全局高效用序列模式。Then, the first mining unit 630 can determine at least one candidate global high-utility sequence pattern based on the mined local high-utility sequence pattern. For example, the key-value pairs with the same key value in the output of multiple Mappers can be input into a Reducer in the second stage MapReduce. That is, multiple key-value pairs corresponding to the same pattern in the output of multiple Mappers, for example, multiple key-value pairs corresponding to pattern x (pattern x, {sid, utility}), are input into a Reducer. The Reducer can determine the sum of multiple utility values corresponding to pattern x, and determine whether pattern x is a global high-utility sequence pattern based on the sum and the first threshold. If the sum is greater than or equal to the first threshold, it is determined that pattern x is a global high-utility sequence pattern. If the sum is less than the first threshold, it is determined that pattern x is not a global high-utility sequence pattern, but a candidate global high-utility sequence pattern.

可以根据第二阶段MapReduce中的多个Reducer的输出来确定“至少一个候选的全局高效用序列模式”。例如，可以根据第二阶段MapReduce中的多个Reducer输出的键值对中的序列模式来确定步骤S2032中的“至少一个候选的全局高效用序列模式”。例如，多个Reducer的输出可以为(s₁，(pattern 1，utility 1))、(s₂，(pattern 1，utility 1))、(s₃，(pattern 2，utility 2))、(S₃，(pattern 1，utility 1))、(s₄，(pattern 2，utility 2))，则步骤S2032中的“至少一个候选的全局高效用序列模式”可以为pattern 1和pattern 2。The "at least one candidate global high-utility sequence pattern" may be determined based on the outputs of multiple Reducers in the second stage MapReduce. For example, the "at least one candidate global high-utility sequence pattern" in step S2032 may be determined based on the sequence patterns in the key-value pairs output by multiple Reducers in the second stage MapReduce. For example, the outputs of multiple Reducers may be (s ₁ , (pattern 1, utility 1)), (s ₂ , (pattern 1, utility 1)), (s ₃ , (pattern 2, utility 2)), (S ₃ , (pattern 1, utility 1)), (s ₄ , (pattern 2, utility 2)), then the "at least one candidate global high-utility sequence pattern" in step S2032 may be pattern 1 and pattern 2.

此外，第一挖掘单元630可以确定第一集合。例如，可以根据第二阶段MapReduce中的多个Reducer的输出，确定第一集合。第一集合可以包括所述至少一个候选的全局高效用序列模式、包括各个候选的全局高效用序列模式的序列的标识以及各个候选的全局高效用序列模式在相应序列中的效用值。例如，第一集合可以包括多个子集合，每个子集合包括序列的标识、该序列包括的候选的全局高效用序列模式、以及该序列所包括的候选的全局高效用序列模式在该序列中的效用值。例如，第二阶段MapReduce中的多个Reducer的输出可以为(s₁，(pattern 1，utility 1))、(s₂，(pattern 1，utility 1))、(s₃，(pattern 2，utility 2))、(s₃，(pattern 1，utility 1))、(s₄，(pattern 2，utility2))，则第一集合可以包括四个子集合，其中第1个子集合为(s1，(pattern 1，utility 1))，第2个子集合为(s₂，(pattern 1，utility 1))，第3个子集合为(s₃，(pattern 2，utility 2)，(pattern 1，utility 1))，第4个子集合为(s₄，(pattern 2，utility 2)。In addition, the first mining unit 630 can determine a first set. For example, the first set can be determined based on the outputs of multiple Reducers in the second stage MapReduce. The first set can include at least one candidate global high-utility sequence pattern, an identifier of a sequence including each candidate global high-utility sequence pattern, and a utility value of each candidate global high-utility sequence pattern in the corresponding sequence. For example, the first set may include multiple subsets, each subset including an identifier of a sequence, a candidate global high-utility sequence pattern included in the sequence, and a utility value of the candidate global high-utility sequence pattern included in the sequence in the sequence. For example, the outputs of multiple Reducers in the second stage of MapReduce can be ( _s1 , (pattern 1, utility 1)), ( _s2 , (pattern 1, utility 1)), ( _s3 , (pattern 2, utility 2)), ( _s3 , (pattern 1, utility 1)), ( _s4 , (pattern 2, utility2)), then the first set can include four subsets, the first subset is (s1, (pattern 1, utility 1)), the second subset is ( _s2 , (pattern 1, utility 1)), the third subset is ( _s3 , (pattern 2, utility 2), (pattern 1, utility 1)), and the fourth subset is ( _s4 , (pattern 2, utility 2).

此外，在本公开中，第二挖掘单元640可以是上面所描述的整合部分130(即第三阶段MapReduce)。Furthermore, in the present disclosure, the second mining unit 640 may be the integration part 130 (ie, the third stage MapReduce) described above.

第二挖掘单元640可以根据各个序列的效用值链表和所述第一集合，确定各个候选的全局高效用序列模式的局部效用值。The second mining unit 640 may determine the local utility value of each candidate global high-utility sequence pattern according to the utility value chain list of each sequence and the first set.

此外，在这种情形下，可以给每个Mapper配置一个组合模块(例如可以称为combiner)，以确定同一个候选的全局高效用序列模式的局部效用值。具体地，同一个候选的全局高效用序列模式的局部效用值可以是根据与该候选的全局高效用序列模式对应的多个键值对中的效用值确定的。例如，同一个候选的全局高效用序列模式的局部效用值可以是与该候选的全局高效用序列模式对应的多个键值对中的效用值的加和。例如，对于候选的全局高效用序列模式pattern y，同一个Mapper可能输出两个键值对，分别为(patterny，utility 1)和(pattern y，utility 2)，则对于候选的全局高效用序列模式pattern y的局部效用值local-utility为(utility 1+utility 2)。In addition, in this case, each Mapper can be configured with a combination module (for example, it can be called a combiner) to determine the local utility value of the same candidate global high-utility sequence pattern. Specifically, the local utility value of the same candidate global high-utility sequence pattern can be determined based on the utility values in multiple key-value pairs corresponding to the candidate global high-utility sequence pattern. For example, the local utility value of the same candidate global high-utility sequence pattern can be the sum of the utility values in multiple key-value pairs corresponding to the candidate global high-utility sequence pattern. For example, for the candidate global high-utility sequence pattern pattern y, the same Mapper may output two key-value pairs, namely (patterny, utility 1) and (pattern y, utility 2), then the local utility value local-utility for the candidate global high-utility sequence pattern pattern y is (utility 1+utility 2).

然后，第二挖掘单元640可以根据各个候选的全局高效用序列模式的局部效用值，确定各个候选的全局高效用序列模式的全局效用值。例如，对于每个候选的全局高效用序列模式，可以根据该候选的全局高效用序列模式的多个局部效用值，来确定该候选的全局高效用序列模式的全局效用值。例如，可以将该候选的全局高效用序列模式的多个局部效用值的加和，作为该候选的全局高效用序列模式的全局效用值。Then, the second mining unit 640 can determine the global utility value of each candidate global high utility sequence pattern according to the local utility value of each candidate global high utility sequence pattern. For example, for each candidate global high utility sequence pattern, the global utility value of the candidate global high utility sequence pattern can be determined according to multiple local utility values of the candidate global high utility sequence pattern. For example, the sum of multiple local utility values of the candidate global high utility sequence pattern can be used as the global utility value of the candidate global high utility sequence pattern.

然后，第二挖掘单元640可以将全局效用值大于第一阈值的序列模式确定为全局高效用序列模式。例如，第三阶段MapReduce中的各个Reducer可以将全局效用值高于或等于第一阈值的序列模式确定为全局高效用序列模式。每个Reducer可以输出一个或多个新的键值对，其中每个新的键值对可以由一个全局高效用序列模式和该全局高效用序列模式的全局效用值构成。例如，当pattern y是全局高效用序列模式时，某个Reducer可以输出键值对(pattern y，global-utility)。因此，第三阶段MapReduce中的各个Reducer输出的键值对中的序列模式均为全局高效用序列模式。Then, the second mining unit 640 can determine the sequence pattern whose global utility value is greater than the first threshold as a global high-utility sequence pattern. For example, each Reducer in the third stage MapReduce can determine the sequence pattern whose global utility value is higher than or equal to the first threshold as a global high-utility sequence pattern. Each Reducer can output one or more new key-value pairs, wherein each new key-value pair can be composed of a global high-utility sequence pattern and the global utility value of the global high-utility sequence pattern. For example, when pattern y is a global high-utility sequence pattern, a certain Reducer can output a key-value pair (pattern y, global-utility). Therefore, the sequence patterns in the key-value pairs output by each Reducer in the third stage MapReduce are all global high-utility sequence patterns.

通过本实施例提供的挖掘全局高效用序列模式的装置，确定了序列数据库中各个序列的效用值链表和第一集合，并根据这两种数据结构来挖掘全局高效用序列模式，节省了大量时间，加快了在序列数据库中计算全局效用值的计算过程，加快了挖掘速度，降低了时间复杂度。Through the device for mining global high-utility sequence patterns provided by this embodiment, the utility value linked list and the first set of each sequence in the sequence database are determined, and the global high-utility sequence pattern is mined based on these two data structures, which saves a lot of time, speeds up the calculation process of calculating the global utility value in the sequence database, speeds up the mining speed, and reduces the time complexity.

此外，根据本公开实施例的装置也可以借助于图7所示的计算设备的架构来实现。图7示出了该计算设备的架构。如图7所示，计算设备700可以包括总线710、一个或多个CPU720、只读存储器(ROM)730、随机存取存储器(RAM)740、连接到网络的通信端口750、输入/输出组件760、硬盘770等。计算设备700中的存储设备，例如ROM 730或硬盘770可以存储计算机处理和/或通信使用的各种数据或文件以及CPU所执行的程序指令。计算设备700还可以包括用户界面780。当然，图7所示的架构只是示例性的，在实现不同的设备时，根据实际需要，可以省略图7示出的计算设备中的一个或多个组件。In addition, the apparatus according to the embodiment of the present disclosure can also be implemented with the aid of the architecture of the computing device shown in FIG7. FIG7 shows the architecture of the computing device. As shown in FIG7, the computing device 700 may include a bus 710, one or more CPUs 720, a read-only memory (ROM) 730, a random access memory (RAM) 740, a communication port 750 connected to a network, an input/output component 760, a hard disk 770, and the like. The storage device in the computing device 700, such as the ROM 730 or the hard disk 770, can store various data or files used for computer processing and/or communication and program instructions executed by the CPU. The computing device 700 may also include a user interface 780. Of course, the architecture shown in FIG7 is only exemplary. When implementing different devices, one or more components in the computing device shown in FIG7 may be omitted according to actual needs.

本公开的实施例也可以被实现为计算机可读存储介质。根据本公开实施例的计算机可读存储介质上存储有计算机可读指令。当所述计算机可读指令由处理器运行时，可以执行参照以上附图描述的根据本公开实施例的方法。所述计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。The embodiments of the present disclosure may also be implemented as a computer-readable storage medium. Computer-readable instructions are stored on a computer-readable storage medium according to an embodiment of the present disclosure. When the computer-readable instructions are executed by a processor, the method according to the embodiment of the present disclosure described with reference to the above figures may be executed. The computer-readable storage medium includes, but is not limited to, for example, a volatile memory and/or a non-volatile memory. The volatile memory may, for example, include a random access memory (RAM) and/or a cache memory (cache), etc. The non-volatile memory may, for example, include a read-only memory (ROM), a hard disk, a flash memory, etc.

本领域技术人员能够理解，本公开所披露的内容可以出现多种变型和改进。例如，以上所描述的各种设备或组件可以通过硬件实现，也可以通过软件、固件、或者三者中的一些或全部的组合实现。Those skilled in the art will appreciate that the contents disclosed in this disclosure may be subject to various variations and improvements. For example, the various devices or components described above may be implemented by hardware, or by software, firmware, or a combination of some or all of the three.

此外，如本公开和权利要求书中所示，除非上下文明确提示例外情形，“一”、“一个”、“一种”和/或“该”等词并非特指单数，也可包括复数。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性，而只是用来区分不同的组成部分。同样，“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同，而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。In addition, as shown in the present disclosure and claims, unless the context clearly indicates an exception, the words "a", "an", "a kind" and/or "the" do not specifically refer to the singular, but may also include the plural. The words "first", "second" and similar words used in the present disclosure do not indicate any order, quantity or importance, but are only used to distinguish different components. Similarly, words such as "include" or "comprise" mean that the elements or objects appearing before the word include the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Words such as "connect" or "connected" and similar words are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

此外，本公开中使用了流程图用来说明根据本公开的实施例的系统所执行的操作。应当理解的是，前面或下面操作不一定按照顺序来精确地执行。相反，可以按照倒序或同时处理各种步骤。同时，也可以将其他操作添加到这些过程中，或从这些过程移除某一步或数步操作。In addition, flow charts are used in the present disclosure to illustrate the operations performed by the system according to the embodiments of the present disclosure. It should be understood that the preceding or following operations are not necessarily performed precisely in order. On the contrary, various steps may be processed in reverse order or simultaneously. At the same time, other operations may also be added to these processes, or one or more operations may be removed from these processes.

除非另有定义，这里使用的所有术语(包括技术和科学术语)具有与本发明所属领域的普通技术人员共同理解的相同含义。还应当理解，诸如在通常字典里定义的那些术语应当被解释为具有与它们在相关技术的上下文中的含义相一致的含义，而不应用理想化或极度形式化的意义来解释，除非这里明确地这样定义。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention belongs. It should also be understood that terms such as those defined in common dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology and should not be interpreted in an idealized or extremely formal sense, unless explicitly defined as such herein.

以上对本公开进行了详细说明，但对于本领域技术人员而言，显然，本公开并非限定于本说明书中说明的实施方式。本公开在不脱离由权利要求书的记载所确定的本公开的宗旨和范围的前提下，可以作为修改和变更方式来实施。因此，本说明书的记载是以示例说明为目的，对本公开而言并非具有任何限制性的意义。The present disclosure is described in detail above, but it is obvious to those skilled in the art that the present disclosure is not limited to the embodiments described in this specification. The present disclosure can be implemented as a modification and variation without departing from the purpose and scope of the present disclosure as determined by the claims. Therefore, the description in this specification is for the purpose of illustration and does not have any restrictive meaning for the present disclosure.

Claims

1. A method for mining global high utility sequence patterns, comprising:

Determining a first type of item in the sequence database, wherein the first type of item is an item with a global sequence weight utility value higher than a first threshold value;

determining a utility value linked list of each sequence in the sequence database;

Mining at least one candidate global high utility sequence pattern from the sequence database and determining a first set according to the determined first class of items, wherein the first set comprises a plurality of subsets, each subset comprising the at least one candidate global high utility sequence pattern, an identification of a sequence comprising a respective candidate global high utility sequence pattern, and utility values of the respective candidate global high utility sequence pattern in the respective sequence; and

Mining a global high utility sequence pattern from the at least one candidate global high utility sequence pattern based on a linked list of utility values for each sequence and the first set,

Wherein mining global high utility sequence patterns from the at least one candidate global high utility sequence pattern according to the utility value linked list of each sequence and the first set comprises:

When the first set includes the at least one candidate global high utility sequence pattern, a utility value of the at least one candidate global high utility sequence pattern is determined from the first set.

2. The method of claim 1, wherein said determining a first type of entry in a sequence database comprises:

determining a global sequence weight utility value of each item in the sequence database; and

Items with global sequence weight utility values above a first threshold are determined to be of a first type.

3. The method of claim 2, wherein determining a global sequence weight utility value for each item in the sequence database comprises:

determining local sequence weight utility values of the item in each partition of the sequence database; and

And determining a global sequence weight utility value of the item according to the determined local sequence weight utility value.

4. A method as claimed in claim 3, wherein the local sequence weight utility value of the term at each partition of the sequence database is determined from utility values of sequences comprising the term in that partition.

5. The method of any one of claims 1 to 4, wherein said determining a linked list of utility values for each sequence in the sequence database comprises:

And determining a utility value linked list of the sequence according to the utility value of each item in the sequence and the position of each item in the sequence.

6. The method of any of claims 1 to 4, wherein the mining at least one candidate global high utility sequence pattern from the sequence database according to the determined first class of items comprises:

mining a local high utility sequence pattern from each partition of the sequence database according to the determined first class item; and

At least one candidate global utility sequence pattern is determined from the mined local utility sequence patterns.

7. The method of claim 6, wherein mining a local high utility sequence pattern from each partition of the sequence database according to the determined first type of item comprises:

For an item belonging to the first class of items in the respective sequences comprised by the partition,

Calculating utility values and remaining utility values of the item in each sequence, wherein the remaining utility values of the item in a sequence are the sum of utility values of all items in the sequence that follow the item;

Constructing a utility list of the item in each sequence;

determining a utility value chain of the item according to utility lists of the item in each sequence;

and mining a local high-utility sequence mode from the partition according to the utility value chains of the items in the partition.

8. The method of any of claims 1 to 4, wherein mining global utility sequence patterns from the at least one candidate global utility sequence pattern according to a utility value linked list of individual sequences and the first set comprises:

determining local utility values of each candidate global high utility sequence mode according to utility value linked lists of each sequence and the first set;

Determining the global utility value of each candidate global high utility sequence mode according to the local utility value of each candidate global high utility sequence mode; and

A sequence pattern having a global utility value greater than a first threshold is determined as a global high utility sequence pattern.

9. The method of claim 6, further comprising:

and dividing the sequence in the sequence database into a plurality of partitions according to a load balancing algorithm.

10. An apparatus for mining global high utility sequence patterns, comprising:

A first determining unit configured to determine a first type of item in the sequence database, wherein the first type of item is an item for which the global sequence weight utility value is higher than a first threshold value;

the second determining unit is configured to determine utility value linked lists of all sequences in the sequence database;

A first mining unit configured to mine at least one candidate global utility sequence pattern from the sequence database according to the determined first class item and determine a first set, wherein the first set comprises a plurality of subsets, each subset comprising the at least one candidate global utility sequence pattern, an identification of a sequence comprising a respective candidate global utility sequence pattern, and utility values of the respective candidate global utility sequence pattern in the respective sequence; and

A second mining unit configured to mine global high utility sequence patterns from the at least one candidate global high utility sequence pattern according to a utility value linked list of each sequence and the first set,

Wherein the second mining unit is further configured to determine utility values of the at least one candidate global high utility sequence pattern from the first set when the first set includes the at least one candidate global high utility sequence pattern.

11. The apparatus of claim 10, wherein the first determining unit is configured to determine global sequence weight utility values for respective items in the sequence database; and determining an item with a global sequence weight utility value higher than a first threshold as a first type item.

12. The apparatus according to claim 10 or 11, wherein the second determining unit is configured to determine a utility value linked list for each sequence based on the utility values of the respective items in the sequence and the positions of the respective items in the sequence.

13. The apparatus of claim 10 or 11, wherein the second mining unit is configured to determine local utility values for each candidate globally highly-utility sequence pattern from a linked list of utility values for each sequence and the first set; determining the global utility value of each candidate global high utility sequence mode according to the local utility value of each candidate global high utility sequence mode; and determining a sequence pattern having a global utility value greater than a first threshold as a global high utility sequence pattern.

14. An apparatus for mining global high utility sequence patterns, comprising:

a processor; and

A memory, wherein the memory has stored therein a computer executable program which, when executed by the processor, performs the method of any of claims 1-9.

15. A computer readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-9.