JP5478229B2

JP5478229B2 - Data analysis system and method

Info

Publication number: JP5478229B2
Application number: JP2009280525A
Authority: JP
Inventors: 隆彦新谷
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-12-10
Filing date: 2009-12-10
Publication date: 2014-04-23
Anticipated expiration: 2029-12-10
Also published as: JP2011123652A

Description

本発明は、データ解析システム及び方法に関し、特にデータベースに含まれるデータの出現順序の規則性を明らかにするデータマイニング技術に関する。 The present invention relates to a data analysis system and method, and more particularly to a data mining technique for clarifying regularity of appearance order of data included in a database.

モバイル端末、ＩＣカード、ＩＣタグなどのデータ収集環境の普及により、時々刻々の人の行動や物の状態をデータとして獲得し、大量に蓄積することが可能となった。この大量に蓄積された人の行動や物の状態のデータを解析し、特徴的または典型的な行動パターンや状態パターンを抽出し、マーケティングやヘルスケアなどに応用したいというニーズがある。その解決手段として、大量に蓄積されたデータを解析し、その中に埋もれた有用な規則性やパターンを抽出するデータマイニングが知られている。特に、データの時間軸上の出現順序のパターンを解析する技術として時系列パターンマイニングがある。 With the widespread use of data collection environments such as mobile terminals, IC cards, and IC tags, it has become possible to acquire and store a large amount of human behavior and the state of things from time to time. There is a need to analyze this large amount of accumulated human behavior and object state data, extract characteristic or typical behavior patterns and state patterns, and apply them to marketing and healthcare. As a solution, data mining is known in which a large amount of data is analyzed and useful regularities and patterns buried in the data are extracted. In particular, there is time-series pattern mining as a technique for analyzing a pattern of appearance order on the time axis of data.

例えば、クレジットカードの利用データの時系列パターンマイニングを考えてみる。顧客が店舗でクレジットカードを利用した履歴がクレジットカードの利用データとして、利用日時、利用店舗、利用額が記録される。大量のクレジットカード利用データから、複数の顧客に共通して現れるパターンとして、順序を伴うパターンである時系列パターンを抽出することが出来る。「店舗Ａで購買した顧客は、その後に店舗Ｂで高額の購買をする場合が多い」のような時系列パターンが抽出された場合、店舗Ａと店舗Ｂに併買行動の関係があることがわかり、店舗の配置、販売戦略に役立てることが出来る。また、Ｗｅｂサイトのアクセスログからユーザの典型的なアクセスパターンを抽出ことが可能であり、抽出されたアクセスパターンに含まれないアクセスが起きていること、過去のデータでは抽出されなかったアクセスパターンが新たなデータで抽出されたことなどの判定により、異常なアクセスの検出に活用出来る。また、建設機械の稼働履歴と故障やメンテナンスの履歴から、故障が発生しやすい稼働状態パターンや通常の稼働パターンを抽出することが可能であり、故障を防ぐ稼働プランの設計や異常な稼働状態の検出に役立てることが出来る。 For example, consider time-series pattern mining of credit card usage data. The history of the use of the credit card at the store by the customer is recorded as the use data of the credit card, the use date, the use store, and the use amount. As a pattern that appears in common with a plurality of customers, a time-series pattern that is an order-related pattern can be extracted from a large amount of credit card usage data. If a time-series pattern such as “Customers who purchased at store A often make expensive purchases at store B” is extracted, it can be seen that store A and store B have a relationship with concurrent purchase behavior. , Can be used for store location and sales strategy. In addition, it is possible to extract a typical access pattern of the user from the access log of the website, that there is an access that is not included in the extracted access pattern, and an access pattern that has not been extracted from past data. It can be used to detect abnormal access by determining whether it has been extracted with new data. In addition, it is possible to extract operation status patterns and normal operation patterns that are likely to cause failures from the operation history of construction machinery and the history of failures and maintenance. It can be useful for detection.

大量のデータから時系列パターンを抽出する研究は、データマイニングの分野
で行われてきた。例えば、特許文献１、非特許文献１に記載の方法がある。特許文献１と非特許文献１の方法では、アイテム（データ項目、事象）の組合せとタイムスタンプ（時刻）又は出現順所を示す識別子からなるデータベースから、ユーザが予め設定した支持度（出現頻度の全データに対する割合を表す）の最小値以上となる時系列パターンを抽出する。時系列パターンはアイテムの組合せ（アイテムセット）の出現順序を含んだパターンであり、１以上のｎ個のアイテムセットからなる時系列パターンは、＜（ＩＳ１）・・・（ＩＳｎ）＞で表現される。ここで、（ＩＳ１）、・・・、（ＩＳｎ）はアイテムセットであり、アイテムセットは１以上のアイテムからなる。ある時系列パターンの支持度は、全時系列データの中でその時系列パターンを含む時系列データの割合である。最小支持度以上となる時系列パターンは頻出時系列パターンと呼ばれる。頻出時系列パターン抽出は、候補となる時系列パターン（候補時系列パターン）の作成とデータベースからのデータ読み出しによる時系列データ中に該候補時系列パターンが現れる頻度の数え上げと、支持度の最小値以上の頻度となる時系列パターンの選出によって行われる。 Research on extracting time-series patterns from a large amount of data has been conducted in the field of data mining. For example, there are methods described in Patent Document 1 and Non-Patent Document 1. In the methods of Patent Literature 1 and Non-Patent Literature 1, the support level (appearance frequency of the appearance frequency) set in advance by the user from a database including a combination of items (data items, events) and an identifier indicating a time stamp (time) or appearance order. A time series pattern that is equal to or greater than the minimum value (representing the ratio to all data) is extracted. A time-series pattern is a pattern including the order of appearance of a combination of items (item set), and a time-series pattern composed of one or more n item sets is represented by <(IS1)... (ISn)>. The Here, (IS1),..., (ISn) are item sets, and the item set includes one or more items. The support degree of a certain time series pattern is the ratio of the time series data including the time series pattern in all the time series data. A time series pattern that is equal to or greater than the minimum support is called a frequent time series pattern. Frequent time-series pattern extraction is performed by creating a candidate time-series pattern (candidate time-series pattern), counting the frequency at which the candidate time-series pattern appears in the time-series data by reading data from the database, and the minimum support level. This is performed by selecting a time-series pattern having the above frequency.

時系列パターンマイニングの別の問題として、１つの時系列データにおいて繰り返される時系列パターン（繰り返し時系列パターン）を抽出する問題もある。例えば、クレジットカードの利用データの繰り返し時系列パターン抽出を考えてみる。長期間に渡って利用された１利用者の利用データから、複数回繰り返される利用パターンである時系列パターンを抽出することが出来る。「店舗Ｃで購買し、次に店舗Ｄで購買したとき、その後に店舗Ｅで購買することが多い」のような時系列パターンが抽出された場合、店舗Ｃと店舗Ｄと店舗Ｅを決まった順序で定期的に利用していることが分かり、広告戦略、販売戦略に役立てることや、定期的な利用パターンの傾向から利用者のタイプを分類するセグメンテーションに役立てることが出来る。 As another problem of time series pattern mining, there is also a problem of extracting a time series pattern (repeated time series pattern) repeated in one time series data. For example, consider repeated time series pattern extraction of credit card usage data. A time-series pattern that is a usage pattern that is repeated a plurality of times can be extracted from usage data of one user that has been used for a long period of time. If a time-series pattern such as “Purchase at store C, then purchase at store D, then often at store E” is extracted, store C, store D, and store E are determined. It can be seen that it is regularly used in order, which can be useful for advertising strategy and sales strategy, and for segmentation that classifies user types based on the tendency of regular usage patterns.

繰り返し時系列パターンを抽出する研究はデータマイニングの分野、バイオインフォマティクスの分野で行われてきた。例えば、特許文献２、３に記載の方法がある。特許文献２に記載の方法では、一人の時系列データにおいて、所定の頻度以上繰り返される時系列パターンを抽出ことにより、定期的に行っているパターンを抽出出来る。また、特許文献３に記載の方法では、実際に繰り返される厳密な回数を数え上げず、統計的に繰り返されていると判断されたパターンを抽出する。特許文献２と異なり、繰り返されるパターンは意味の無い部分と考えられており、それらを除去することが特許文献３のようなバイオの分野での繰り返しパターン抽出の目的である。特許文献２では、繰り返されるパターンが意味のある部分と考えられており、それらを厳密な出現回数と共に見つけ出すことが目的である。 Research to extract repeated time series patterns has been conducted in the fields of data mining and bioinformatics. For example, there are methods described in Patent Documents 2 and 3. According to the method described in Patent Document 2, it is possible to extract a regularly performed pattern by extracting a time series pattern repeated more than a predetermined frequency in one person's time series data. In the method described in Patent Document 3, a pattern that is determined to be statistically repeated is extracted without counting the exact number of times that are actually repeated. Unlike Patent Document 2, repeated patterns are considered to be meaningless parts, and the removal of them is the purpose of repeated pattern extraction in the field of biotechnology like Patent Document 3. In Patent Document 2, repeated patterns are considered to be meaningful parts, and the purpose is to find them together with the exact number of appearances.

特開平8-263346号公報JP-A-8-263346 特開2001-229202号公報JP 2001-229202 米国特許出願公開US2003/0068617US Patent Application Publication 2003/0068617

R.Agrawal, R.Srikant, "Mining Sequential Patterns:Generalizations and Performance Improvements", in proceedings ofInternational Conference on Extending Database Technology, 1996R. Agrawal, R. Srikant, "Mining Sequential Patterns: Generalizations and Performance Improvements", in proceedings of International Conference on Extending Database Technology, 1996

実際の解析あるいは分析では常に全データを分析対象とする訳ではない。例えば、人の行動分析においては、たまたま行われた行動はノイズとして除去し、ある程度の回数以上繰り返された行動のみを分析対象とする場合がある。多くのユーザが定期的に行った行動を有意な行動パターンとして抽出したいというニーズもある。購買分析においては、定期的に繰り返して購買してくれる顧客の購買パターンを知り、そのパターンでの購買を促進することにより、定期的に繰り返し購買をする顧客を生み出すことに役立てることが出来る。 In actual analysis or analysis, not all data is always analyzed. For example, in human behavior analysis, the behavior that happens to be performed may be removed as noise, and only behavior that has been repeated more than a certain number of times may be the analysis target. There is also a need to extract the actions that many users perform regularly as significant action patterns. In purchasing analysis, knowing the purchase pattern of customers who make regular and repeated purchases and promoting purchases in that pattern can help to create customers who make regular and repeated purchases.

このように、前述の特許文献１と非特許文献１、特許文献２と３はそれぞれ、多人数の時系列データにおいて複数の顧客に共通して現れる時系列パターンの抽出、一人の時系列データにおいて複数回繰り返される時系列パターンの抽出が可能である。しかし、この両者を共に考慮して時系列パターンを抽出することは考慮されていなかった。 As described above, Patent Document 1 and Non-Patent Document 1, and Patent Documents 2 and 3 respectively extract time series patterns that appear in common to a plurality of customers in time series data of a large number of persons, and time series data of one person. It is possible to extract a time series pattern that is repeated a plurality of times. However, it has not been considered to extract a time series pattern considering both of them.

単純には、まず個々の顧客の時系列データで繰り返しパターンを抽出し、それらの中から所定の数以上の顧客で抽出されたパターンを抽出することによって、両者の条件を満たすパターンを抽出出来る。しかし、顧客毎に繰り返す購買パターンは異なるため、単純に組み合わせた方法では不要なパターンの探索処理を膨大な時系列データに対して行うことになり、膨大な不要な処理が発生するため、現実的には困難である。 Simply, a pattern satisfying both conditions can be extracted by first extracting repeated patterns from the time-series data of individual customers and extracting patterns extracted by a predetermined number or more of them from them. However, since the purchase pattern to be repeated for each customer is different, the simple combination method performs an unnecessary pattern search process on a huge amount of time-series data, and a huge amount of unnecessary processing occurs. It is difficult.

本発明の目的は、各時系列データにおいて所定の回数以上繰り返される時系列パターンであり、所定の数以上の時系列データにおいて該繰り返し回数の条件を満たす時系列パターンである頻出繰り返し時系列パターンを抽出するデータ解析システム及び方法を提供することである。 An object of the present invention is a time series pattern that is repeated a predetermined number of times or more in each time series data, and a frequently repeated time series pattern that is a time series pattern that satisfies the condition of the number of repetitions in a time series data of a predetermined number or more. To provide a data analysis system and method for extraction.

本発明の他の目的は、チェックポイントの利用と繰り返し回数と出現回数の上限値の算出によって、探索処理量を低減するデータ解析システム及び方法を提供することである。 Another object of the present invention is to provide a data analysis system and method that reduce the amount of search processing by using checkpoints and calculating the upper limit of the number of repetitions and the number of appearances.

上記の目的を達成するため、本発明においては、処理部と記憶部とを備えた計算機を用い、事象と，事象の属するＩＤと，事象間の順序関係を示す情報の組が複数格納されたデータを、同じＩＤを有する事象をその順序関係に従って並べた時系列データとし，１以上の事象を順方向に並べた重複順列を時系列パターンとし，所定の数以上の時系列データにおいて，各時系列データにおける所定の回数以上繰り返される時系列パターンである頻出繰り返し時系列パターンを抽出するため，繰り返し回数が未知の時系列パターンについて，各時系列データにおける繰り返し回数を数え上げるステップと，この繰り返し回数が所定の繰り返し回数以上となる時系列データの数を数え上げるステップと，この数え上げた時系列データの数が所定の数以上となる時系列パターンを抽出するステップとを処理部で実行するデータ解析システム及び方法を構成する。 In order to achieve the above object, in the present invention, a computer including a processing unit and a storage unit is used, and a plurality of sets of information indicating an event, an ID to which the event belongs, and an order relationship between events are stored. The data is time-series data in which events having the same ID are arranged according to the order relationship, and a duplicate permutation in which one or more events are arranged in the forward direction is a time-series pattern. In order to extract a frequent repeating time series pattern that is a time series pattern that is repeated a predetermined number of times in series data, a step of counting the number of repetitions in each time series data for a time series pattern with an unknown number of repetitions, A step of counting the number of time-series data that is equal to or greater than a predetermined number of repetitions, and the number of time-series data counted is a predetermined number or more. Configuring the data analysis system and method for performing the steps in the processing unit for extracting time series pattern as a.

また、上記の目的を達成するため、本発明においては、処理部と記憶部とを備えた計算機を用い、事象と，事象の属するＩＤと，事象間の順序関係を示す情報の組が複数格納されたデータを，同じＩＤを有する事象をその順序関係に従って並べた時系列データとし，１以上の事象を順方向に並べた重複順列を時系列パターンとし，所定の数以上の時系列データにおいて，各時系列データにおける所定の回数以上繰り返される時系列パターンである頻出繰り返し時系列パターンを抽出するため，各時系列データに，所定の間隔でチェックポイントを設定する第1のステップと，各時系列データにおける繰り返し回数が未知の時系列パターンについて，各時系列データについてチェックポイントから次のチェックポイントまでの範囲で該時系列パターンが繰り返される回数を数え上げる第２のステップと，
該時系列データにおける該時系列パターンの繰り返し回数の上限値を，既に数え上げられたチェックポイントまででの繰り返し回数と，該チェックポイント以降に現れる各事象の繰り返し回数との和から算出する第３のステップと，該算出された上限値が所定の繰り返し回数以上となる時系列データの数を数え上げる第４のステップと，該時系列データの数が所定の数以上となる時系列パターンを抽出する第５のステップと，該抽出された時系列パターンについて前記第２から第５のステップを最後のチェックポイントまで繰り返す第６のステップを処理部で実行するデータ解析システム及び方法を構成する。 In order to achieve the above object, in the present invention, a computer including a processing unit and a storage unit is used to store a plurality of sets of events, IDs to which events belong, and information indicating the order relationship between events. The time-series data in which the events having the same ID are arranged according to the order relation, and the overlapping permutation in which one or more events are arranged in the forward direction are set as a time-series pattern. In order to extract a frequent repeated time series pattern that is a time series pattern repeated for a predetermined number of times in each time series data, a first step for setting checkpoints at predetermined intervals in each time series data, and each time series For a time-series pattern whose number of repetitions in the data is unknown, the time-series pattern of each time-series data is in the range from the check point to the next check point. A second step of counting the number of times the process is repeated;
The upper limit value of the number of repetitions of the time series pattern in the time series data is calculated from the sum of the number of repetitions up to the already counted check point and the number of repetitions of each event appearing after the check point. A fourth step of counting the number of time series data for which the calculated upper limit value is equal to or greater than a predetermined number of repetitions, and a time series pattern for extracting the time series pattern for which the number of time series data is equal to or greater than a predetermined number A data analysis system and method for executing the fifth step and the sixth step of repeating the second to fifth steps up to the last check point for the extracted time series pattern in the processing unit are configured.

本発明によると、個々の時系列データにおいて所定の回数以上繰り返される時系列パターンであって、かつ、所定の数以上の時系列データにおいて該繰り返し条件を満たす時系列パターンを抽出することが可能となる。 According to the present invention, it is possible to extract a time series pattern that is repeated a predetermined number of times or more in individual time series data and that satisfies the repetition condition in a predetermined number or more of time series data. Become.

また、本発明のデータの処理単位毎に各時系列データの繰り返し回数の数え上げ処理において繰り返し回数の上限値を算出し、所定の繰り返し回数未満となる場合に以降の繰り返し回数の数え上げを回避すること、出現頻度の数え上げをデータの処理単位毎に行うことによって解析処理量を低減することが可能となる。 Further, the upper limit value of the number of repetitions is calculated in the process of counting the number of repetitions of each time series data for each data processing unit of the present invention, and when the number of repetitions is less than the predetermined number of repetitions, the subsequent number of repetitions is avoided. The amount of analysis processing can be reduced by counting the appearance frequency for each data processing unit.

第１の実施例のシステム構成例を示す図である。It is a figure which shows the system configuration example of a 1st Example. 第１の実施例に係る、ユーザインタフェースの一例を示す図である。It is a figure which shows an example of the user interface based on a 1st Example. 第１の実施例に係る、ユーザ操作とシステム動作の関連を示したフロー図である。It is the flowchart which showed the relationship between user operation and system operation | movement based on a 1st Example. 第１の実施例に係る、頻出繰り返し時系列パターン抽出処理の概要を示すフローチャート図である。It is a flowchart figure which shows the outline | summary of the frequent repetition time series pattern extraction process based on 1st Example. 第１の実施例に係る、頻出繰り返しアイテム抽出処理を示すフローチャート図である。It is a flowchart figure which shows the frequent repeated item extraction process based on 1st Example. 第１の実施例に係る、時系列データ読み出し処理を示すフローチャート図である。It is a flowchart figure which shows the time series data read-out process based on a 1st Example. 第１の実施例に係る、繰り返しパターン計数処理を示すフローチャート図である。It is a flowchart figure which shows the repeating pattern count process based on 1st Example. 第１の実施例に係る、特定の時系列データに対する特定の時系列パターンの繰り返し回数の数え上げ処理を示すフローチャート図である。It is a flowchart figure which shows the counting process of the repetition frequency of the specific time series pattern with respect to specific time series data based on 1st Example. 第１の実施例に係る、本発明における出現回数計数処理を示すフローチャート図である。It is a flowchart figure which shows the appearance frequency counting process in this invention based on a 1st Example. 第３の実施例のシステム構成例を示す図である。It is a figure which shows the system configuration example of a 3rd Example. 第３の実施例に係る、ユーザインタフェース例を示す図である。It is a figure which shows the example of a user interface based on a 3rd Example. 第３の実施例に係る、ユーザ操作とシステム動作の関連を示した図である。It is the figure which showed the relationship between user operation and system operation | movement based on a 3rd Example. 第３の実施例に係る、特定の時系列データに対する特定の時系列パターンの繰り返し回数の数え上げ処理の変形例を示すフローチャート図である。It is a flowchart figure which shows the modification of the counting process of the repetition frequency of the specific time series pattern with respect to specific time series data based on 3rd Example. 第１の実施例に係る、時系列データ読み出し処理と記憶装置から読み出される時系列データの関係示すイメージ図である。It is an image figure which shows the relationship between the time series data reading process based on 1st Example, and the time series data read from a memory | storage device.

以下、図面を参照して本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

はじめに、種々の実施の形態において利用するデータの構成を説明する。データベースはレコードの集合からなり、レコードは事象（アイテム）の組合せ（アイテムセット）と、その事象の組合せが属する識別子（時系列データＩＤ）と、タイムスタンプ又は順序関係を示す識別子の組からなる。同一の時系列データＩＤを持つ１以上のレコードを、タイムスタンプ又は順序関係を示す識別子の順に配置したアイテムセットのリストの組で表現したデータを時系列データと呼ぶ。ここで、アイテムは離散値である。アイテムが連続値の場合、範囲分割などによって区分けし、各区分けに特定の離散値を割り当てることにより、連続値を離散値に対応付けることが可能である。また、離散値をグループに分類し、各グループを特定の離散値に対応付けることにより、アイテムに含まれない離散値に対応付けることも可能である。 First, the configuration of data used in various embodiments will be described. The database is composed of a set of records, and the record is composed of a combination of events (items) (an item set), an identifier to which the combination of the events belongs (time series data ID), and an identifier indicating a time stamp or an order relationship. Data expressing one or more records having the same time-series data ID as a set of item sets arranged in the order of time stamps or identifiers indicating order relations is called time-series data. Here, the item is a discrete value. When the item is a continuous value, it is possible to associate the continuous value with the discrete value by dividing the item by range division or the like and assigning a specific discrete value to each division. Further, by classifying discrete values into groups and associating each group with a specific discrete value, it is also possible to associate with discrete values not included in the item.

例えば、クレジットカードの利用データの場合の一例を表１と表２に示す。表１は表形式の表現であり、クレジットカードの利用データの場合、１レコードは、ある顧客の一回の利用を意味し、時系列データＩＤは「カードＩＤ」、タイムスタンプは「利用日」、事象は「利用内容」となる。また、表２は時系列データ形式の表現であり、１時系列データは、ある顧客の長期間に渡る利用の履歴（利用履歴）を意味し、事象のリストは、利用した順に並べられた利用内容となる。 For example, Tables 1 and 2 show examples of credit card usage data. Table 1 is a tabular representation. In the case of credit card usage data, one record means one use of a customer, the time series data ID is “card ID”, and the time stamp is “use date”. The event becomes “usage content”. Table 2 is a representation of time-series data format. One time-series data means a long-term usage history (usage history) of a customer, and a list of events is arranged in the order used. It becomes contents.

時系列パターンはアイテムセットの重複順列であり、１以上のｎ個のアイテムセット（ＩＳ１）、・・・、（ＩＳｎ）から構成される時系列パターンは＜（ＩＳ１）・・・（ＩＳｎ）＞と表現される。ある時系列パターンが１つの時系列データ中に現れる回数である繰り返し回数と呼ぶ。ある時系列パターンは、該時系列パターンが所定の繰り返し回数以上現れる時系列データの数である出現頻度と、該時系列データにおける繰り返し回数の統計値を評価値として持つ。ここで、時系列データの数は、異なる時系列データＩＤの種類数と等しい。 The time series pattern is an overlapping permutation of item sets, and the time series pattern composed of one or more n item sets (IS1),..., (ISn) is <(IS1) ... (ISn)>. It is expressed as This is called the number of repetitions, which is the number of times a certain time series pattern appears in one piece of time series data. A certain time series pattern has an appearance frequency, which is the number of time series data in which the time series pattern appears more than a predetermined number of repetitions, and a statistical value of the number of repetitions in the time series data as evaluation values. Here, the number of time-series data is equal to the number of types of different time-series data IDs.

表１、２に示したデータの例では、時系列データ数は３であり、例えば、時系列パターン＜（店舗Ａ）（店舗Ｂ）＞のカードＩＤがｃａｒｄ０１のデータにおける繰り返し回数は２である。また、例えば、時系列パターン＜（店舗Ａ）（店舗Ｂ）＞は，カードＩＤがｃａｒｄ０１で繰り返し回数が２，ｃａｒｄ０２で１，ｃａｒｄ０３で３であることから，所定の繰り返し回数を２以上と設定した場合の出現頻度は２であり、ｃａｒｄ０１とｃａｒｄ０３が該当する。また，繰り返し回数の統計値は平均２．５、最大３、最小２である。 In the example of data shown in Tables 1 and 2, the number of time-series data is 3, for example, the number of repetitions in the data with the card ID card01 of the time-series pattern <(store A) (store B)> is 2. . Also, for example, the time series pattern <(store A) (store B)> has a card ID of card01, the number of repetitions is 2, card02, 1 and card03 is 3, so the predetermined number of repetitions is set to 2 or more. In this case, the appearance frequency is 2, which corresponds to card01 and card03. The statistical value of the number of repetitions is 2.5 on average, 3 on the maximum, and 2 on the minimum.

図１は、第１の実施例のデータ解析システムの一構成例を示す図である。このシステムは、処理部であるプロセッサ１０１と、記憶部を構成するメモリ１０２と記憶装置１０３を有する。プロセッサ１０１とメモリ１０２はコンピュータ１００を構成し、解析対象のデータは記憶装置１０３に格納されている。本実施例の時系列パターン抽出プログラムはメモリ１０２に格納されており、プロセッサ１０１によって実行されることによって、図４に示す処理が実行される。 FIG. 1 is a diagram illustrating a configuration example of the data analysis system according to the first embodiment. This system includes a processor 101 that is a processing unit, a memory 102 that constitutes a storage unit, and a storage device 103. The processor 101 and the memory 102 constitute a computer 100, and data to be analyzed is stored in the storage device 103. The time-series pattern extraction program of the present embodiment is stored in the memory 102, and the processing shown in FIG.

メモリ１０２には、図１に示すように、実行プログラム１０６に加え、設定値情報１０７〜チェックポイント情報１１１が記憶される。設定値情報１０７は、解析対象データと繰り返し回数の最小値と出現頻度の最小値と解析対象データの処理単位の設定値をデータあるいはファイルなどの形式で保持する。時系列データ情報１０８は、記憶装置１０３から読み出した時系列データについて時系列データＩＤと時系列データを、例えば、ｃａｒｄ０１、＜（店舗Ａ）（店舗Ｃ）（店舗Ａ）（店舗Ｂ、高額決済）（店舗Ｃ）＞のように、表形式あるいはリスト形式などで保持する。 As shown in FIG. 1, the memory 102 stores setting value information 107 to checkpoint information 111 in addition to the execution program 106. The setting value information 107 holds analysis target data, a minimum value of the number of repetitions, a minimum value of appearance frequency, and a setting value of a processing unit of the analysis target data in the form of data or a file. The time-series data information 108 includes time-series data ID and time-series data for the time-series data read from the storage device 103, for example, card01, <(store A) (store C) (store A) (store B, high-value settlement). ) (Store C)> and hold in a table format or a list format.

アイテム情報１０９は、解析対象のデータに現れる各アイテムについて、例えば（店舗Ａ、ｃａｒｄ０１、３、０）のように、アイテムと時系列データＩＤと時系列データにおける繰り返し回数と探索する時系列パターンの数え上げに利用するための繰り返し回数（カウント値と呼ぶ）との組を表形式あるいはリスト形式などで保持する。探索時系列パターン情報１１０は、探索する時系列パターンについて、例えば、（＜（店舗Ａ）（店舗Ｂ、高額決済）＞、ｃａｒｄ０１、１、０）のように、探索時系列パターンと時系列データＩＤと数え上げ済みの繰り返し回数と数え上げ済みの時系列パターンのパターン位置との組を表形式あるいはリスト形式などで保持する。 For each item appearing in the data to be analyzed, the item information 109 is an item, a time-series data ID, the number of repetitions in the time-series data, and the time-series pattern to be searched, such as (Store A, card 01, 3, 0). A combination of the number of repetitions (called a count value) to be used for counting is held in a table format or a list format. The search time-series pattern information 110 includes the search time-series pattern and time-series data for the time-series pattern to be searched, such as (<(store A) (store B, high-price payment)>, card01, 1, 0). A combination of the ID, the counted number of repetitions and the pattern position of the counted time series pattern is held in a table format or a list format.

また、メモリ１０２は、繰り返し回数の数え上げを行う時系列データの先頭位置をチェックポイント情報１１１として保持する。このチェックポイントについては後述する。更に、コンピュータ１００には、キーボードやマウスなどを備える入力装置１０４、及びディスプレイやプリンタなどからなる出力装置１０５が接続されている。 Further, the memory 102 holds the start position of the time-series data for counting up the number of repetitions as checkpoint information 111. This check point will be described later. Further, the computer 100 is connected to an input device 104 including a keyboard and a mouse, and an output device 105 including a display and a printer.

図２は、本実施例のユーザインタフェースの一例を示している。このユーザインタフェース２００は、解析対象のデータを指定する解析データ指定部２０１、解析対象データの処理単位を指定するチェックポイント指定部２０２、抽出する時系列パターンの繰り返し回数の最小値（最小繰り返し回数）を指定する最小繰り返し回数入力部２０３と出現頻度の最小値（最小出現頻度）を指定する最小出現頻度入力部２０４、処理の実行を指令する実行ボタン２０５、抽出された時系列パターンとその評価値とを表示する結果表示部２０６からなる。 FIG. 2 shows an example of the user interface of this embodiment. The user interface 200 includes an analysis data specifying unit 201 that specifies data to be analyzed, a checkpoint specifying unit 202 that specifies a processing unit of analysis target data, and a minimum number of repetitions of a time series pattern to be extracted (minimum number of repetitions). Minimum repeat count input unit 203 for designating, minimum appearance frequency input unit 204 for designating the minimum value of appearance frequency (minimum appearance frequency), execution button 205 for instructing execution of processing, extracted time series pattern and its evaluation value Is displayed on the result display unit 206.

ユーザは解析対象データを解析データ指定部２０１で指定し、処理単位をチェックポイント指定部２０２に、抽出する時系列パターンの繰り返し回数の最小値を最小繰り返し回数入力部２０３に、抽出する時系列パターンの出現頻度の最小値を最小出現頻度入力部２０４に、それぞれ入力する。そして、実行ボタン２０５によって、時系列パターン抽出処理を開始する。 The user designates the analysis target data with the analysis data designation unit 201, extracts the processing unit into the checkpoint designation unit 202, and extracts the minimum number of repetitions of the time series pattern to be extracted into the minimum repetition number input unit 203. Are input to the minimum appearance frequency input unit 204, respectively. Then, a time series pattern extraction process is started by the execution button 205.

抽出された時系列パターンは、時系列パターンを構成するアイテムセットのリストと、時系列パターンの評価値である繰り返し回数の統計値と出現頻度とが結果表示部２０６に表示される。なお、結果表示部２０６では表形式を用いて時系列パターンを表示したが、時系列パターンを構成するアイテムセットをノードとする遷移図によって表示しても構わない。 In the extracted time series pattern, a list of item sets constituting the time series pattern, a statistical value of the number of repetitions as an evaluation value of the time series pattern, and an appearance frequency are displayed on the result display unit 206. In the result display unit 206, the time series pattern is displayed using a table format, but it may be displayed by a transition diagram having the item set constituting the time series pattern as a node.

また、解析データ指定部２０１とチェックポイント指定部２０２と最小繰り返し回数入力部２０３と最小出現頻度入力部２０４とが入力装置１０４に、結果表示部２０６が出力装置１０５に対応している。なお、タッチパネルとして機能するディスプレイ等を用いることにより、この入力装置１０４と出力装置１０５を一体化構成として形成することができることは言うまでもない。 The analysis data specifying unit 201, the checkpoint specifying unit 202, the minimum repetition count input unit 203, and the minimum appearance frequency input unit 204 correspond to the input device 104, and the result display unit 206 corresponds to the output device 105. It goes without saying that the input device 104 and the output device 105 can be formed as an integrated configuration by using a display or the like that functions as a touch panel.

図３は、本実施例の時系列パターン抽出処理におけるユーザによる操作とシステムによる操作のフローの一例を示した図である。 FIG. 3 is a diagram illustrating an example of a flow of an operation by the user and an operation by the system in the time-series pattern extraction process of the present embodiment.

はじめに、ユーザは入力装置１０４において、解析対象のデータを指定、解析対象データの処理単位を入力、抽出する時系列パターンの最小繰り返し回数と最小出現頻度を入力する（３０１）。次に、実行を指示する（３０２）ことによって、時系列パターン抽出処理を開始する。 First, the user designates the data to be analyzed, inputs the processing unit of the analysis target data, and inputs the minimum repetition frequency and minimum appearance frequency of the time series pattern to be extracted (301). Next, the execution of time series pattern extraction processing is started by instructing execution (302).

データ解析システムは実行指示と同時に解析データ、データの処理単位、最小繰り返し回数、最小出現頻度を取得してメモリ１０２に格納し、メモリ１０２に格納された実行プログラムをプロセッサ１０１で実行する（３０３）。実行プログラムは、記憶装置１０３からの時系列データの読み出し、繰り返し回数の計数処理、出現頻度の計数処理によって、頻出繰り返し時系列パターンを抽出する（３０４）。頻出繰り返し時系列パターンの抽出処理手順の詳細は後述する。最後に、抽出された時系列パターンを出力装置１０６に出力する（３０５）。ユーザは出力装置１０６に出力された時系列パターンをチェックすること（３０６）によって、時系列パターン抽出処理を終了する。 The data analysis system acquires the analysis data, the data processing unit, the minimum number of repetitions, and the minimum appearance frequency simultaneously with the execution instruction, stores the acquired data in the memory 102, and executes the execution program stored in the memory 102 by the processor 101 (303). . The execution program extracts frequent repeated time series patterns by reading time series data from the storage device 103, counting the number of repetitions, and counting the appearance frequency (304). The details of the frequent repeat time series pattern extraction processing procedure will be described later. Finally, the extracted time series pattern is output to the output device 106 (305). The user checks the time series pattern output to the output device 106 (306), thereby ending the time series pattern extraction process.

図４は、本実施例における時系列パターン抽出処理の全体処理手順を説明するフローチャートである。 FIG. 4 is a flowchart for explaining the overall processing procedure of the time-series pattern extraction process in this embodiment.

はじめにユーザは、入力処理（３０１）に対応し、解析対象の解析データ、解析データの処理単位、最小繰り返し回数、最小出現頻度を入力装置１０４に入力する（４０１）。プロセッサ１０１は、入力された解析対象データ、解析処理単位、最小繰り返し回数、最小出現頻度をメモリ１０２に保持する。メモリ１０２では、解析対象のデータをデータベース名やファイル名として、データの処理単位と最小繰り返し回数と最小出現頻度を数値として、設定値情報１０７に保持し、チェックポイント情報１１１にデータ位置の先頭を示す０を設定する（４０１）。 First, in response to the input process (301), the user inputs the analysis data to be analyzed, the processing unit of the analysis data, the minimum number of repetitions, and the minimum appearance frequency to the input device 104 (401). The processor 101 holds the input analysis target data, the analysis processing unit, the minimum number of repetitions, and the minimum appearance frequency in the memory 102. In the memory 102, the analysis target data is stored as the database name or file name, the data processing unit, the minimum number of repetitions, and the minimum appearance frequency are stored as numerical values in the setting value information 107, and the head of the data position is stored in the checkpoint information 111. 0 is set (401).

次に、プロセッサ１０１は、実行プログラムの処理（３０４）に対応して、メモリ１０２に格納された実行プログラム１０６を実行し、頻出繰り返しアイテムの抽出処理４０２、探索する候補となる時系列パターンの設定４０３、記憶装置１０３からの時系列データの読み出し処理４０４、繰り返しパターン計数処理４０５、出現頻度の計数処理４０６によって頻出繰り返し時系列パターンを抽出する。また、抽出された時系列パターンを出力装置１０５に出力する（４０９）。 Next, in response to the execution program processing (304), the processor 101 executes the execution program 106 stored in the memory 102, the frequent repeated item extraction processing 402, and the setting of a time-series pattern to be searched. Frequently repeated time-series patterns are extracted by 403, time-series data reading processing 404 from the storage device 103, repeated pattern counting processing 405, and appearance frequency counting processing 406. The extracted time series pattern is output to the output device 105 (409).

図４の頻出繰り返しアイテム抽出処理４０２は、記憶装置１０３から時系列データを読み出し、時系列データ毎の各アイテムの繰り返し回数の数え上げ、出現頻度の数え上げを行うことによって、最小繰り返し回数と最小出現頻度の条件を共に満たす全てのアイテムを抽出する。 The frequent repeated item extraction process 402 in FIG. 4 reads time-series data from the storage device 103, counts the number of repetitions of each item for each time-series data, and counts the appearance frequency, thereby minimizing the number of repetitions and the minimum appearance frequency. Extract all items that satisfy both conditions.

図５は、図４における頻出繰り返しアイテム抽出処理４０２の手順を詳細に説明するフローチャートである。はじめにアイテム情報１０９を初期化し、空にする（５０１）。アイテム情報１０９は探索するアイテムについて、アイテム、時系列データＩＤ、この時系列データＩＤの時系列データにおける繰り返し回数、カウント値の組を表形式あるいはリストで保持する。次に、記憶装置１０３から１つの時系列データを読み出し、時系列データ情報１０９に時系列データＩＤと時系列データを保持する（５０２）。次に、該時系列データに現れる各アイテムの繰り返し回数を数え上げ（５０３）、最小繰り返し回数以上となるアイテムについて、アイテム、該時系列データの時系列データＩＤ、該アイテムの該時系列データＩＤの時系列データにおける繰り返し回数、カウント値の初期値である０の組をアイテム情報に登録する（５０４）。全ての時系列データについて同様の処理を繰り返す（５０５）。 FIG. 5 is a flowchart for explaining in detail the procedure of the frequent repeated item extraction process 402 in FIG. First, the item information 109 is initialized and emptied (501). The item information 109 holds items, time series data IDs, repetition times in the time series data of the time series data IDs, and count values for the item to be searched in a table format or a list. Next, one time-series data is read from the storage device 103, and the time-series data ID and time-series data are held in the time-series data information 109 (502). Next, the number of repetitions of each item appearing in the time series data is counted (503), and for an item that is greater than or equal to the minimum number of repetitions, the item, the time series data ID of the time series data, the time series data ID of the item A set of 0 that is the initial value of the number of repetitions and count value in the time series data is registered in the item information (504). The same processing is repeated for all time series data (505).

全ての時系列データに対する処理が終了した時点で、アイテム情報１０９に登録された各アイテムについて、アイテム情報１０９に登録されている時系列データＩＤの数（出現頻度）を数え上げ（５０６）、該出現頻度が最小出現頻度未満となるアイテムのアイテムと時系列ＩＤと繰り返し回数とカウント値の組をアイテム情報から削除する（５０７）。 When the processing for all the time series data is completed, the number (appearance frequency) of the time series data ID registered in the item information 109 is counted for each item registered in the item information 109 (506). A set of an item whose frequency is less than the minimum appearance frequency, a time series ID, the number of repetitions, and a count value is deleted from the item information (507).

図５で説明した頻出繰り返しアイテム抽出処理４０２が終了した時点で、最小繰り返し回数と最小出現頻度を共に満たす全てのアイテムについて、アイテムと時系列データＩＤと該時系列データにおける繰り返し回数とカウント値の初期値の組がメモリ１０２のアイテム情報１０９に格納される。ここで、繰り返し回数の数え上げが終了した時系列データＩＤについては、時系列データ情報１０８に時系列データを保持しておく必要はなく、時系列データＩＤのみ保持し、時系列データは削除しても構わない。 When the frequent repeated item extraction process 402 described with reference to FIG. 5 is completed, the items, the time series data ID, the number of repetitions in the time series data, and the count value of all items that satisfy both the minimum number of repetitions and the minimum appearance frequency. A set of initial values is stored in the item information 109 of the memory 102. Here, it is not necessary to store the time series data ID in the time series data information 108 for the time series data ID for which the number of repetitions has been completed, only the time series data ID is stored and the time series data is deleted. It doesn't matter.

図４に戻り、次に候補となる探索時系列パターンが設定（４０３）される。該候補となる探索時系列パターンは、２つ以上のアイテムから構成され、各時系列データにおける繰り返し回数と出現頻度が未知の時系列パターンである。探索時系列パターン情報１１０に時系列パターン、時系列データＩＤ、数え上げ済みの繰り返し回数の初期値、数え上げ済みの時系列パターンの位置の初期値の組が保持される。ここで、探索時系列パターン情報１１０は、探索する時系列パターンの各時系列データにおける繰り返し回数を数え上げるために利用される。なお、数え上げ済みの繰り返し回数の初期値と数え上げ済みの時系列パターンのパターン位置の初期値は共に０が設定される。 Returning to FIG. 4, the next candidate search time series pattern is set (403). The candidate search time-series pattern is composed of two or more items, and is a time-series pattern whose number of repetitions and appearance frequency are unknown in each time-series data. The search time series pattern information 110 holds a set of a time series pattern, a time series data ID, an initial value of the counted number of repetitions, and an initial value of the position of the counted time series pattern. Here, the search time series pattern information 110 is used to count the number of repetitions in each time series data of the time series pattern to be searched. Note that 0 is set for both the initial value of the counted number of repetitions and the initial value of the counted time series pattern position.

図４の時系列データ読み出し処理４０４は、記憶装置１０３から時系列データを読み出し、読み出した時系列データをメモリ１０２の時系列データ情報１０８に保持する。 4 reads time-series data from the storage device 103, and stores the read time-series data in the time-series data information 108 of the memory 102.

図６は、図４における時系列データ読み出し処理４０４の手順を詳細に説明するフローチャートである。はじめに、メモリ１０２の時系列データ情報１０８の各時系列データＩＤの時系列データが保持されている場合、時系列データＩＤのみ保持し、時系列データを削除する（６０１）。次に、チェックポイント情報１１１から現在の時系列データのデータ位置を読み出し（６０３）、時系列データ情報１０８に格納された各時系列データＩＤについて、現データ位置から設定値情報１０６の解析処理単位の分だけ後ろのデータ位置までの時系列データを記憶装置１０３から読み出し（６０４）、アイテム情報１０９に登録されているアイテムを選び出し、時系列データ情報１０８に該時系列データＩＤの時系列データとして保持し（６０５）、該時系列データに現れる各アイテムについてアイテム情報１０９のカウント値に該アイテムが現れる回数を加算する（６０６）。全ての時系列データについて同様の処理を繰り返す（６０７）。全ての時系列データに対する処理が終了した時点で、時系列データ情報１０８に設定情報１０６のデータの処理単位分の各時系列データＩＤの時系列データが保持される。 FIG. 6 is a flowchart for explaining in detail the procedure of the time-series data reading process 404 in FIG. First, when the time series data of each time series data ID of the time series data information 108 in the memory 102 is held, only the time series data ID is held and the time series data is deleted (601). Next, the data position of the current time series data is read from the check point information 111 (603), and for each time series data ID stored in the time series data information 108, the analysis processing unit of the set value information 106 from the current data position. Is read from the storage device 103 (604), the item registered in the item information 109 is selected, and the time series data information 108 is used as the time series data of the time series data ID. (605), and for each item appearing in the time series data, the number of times the item appears is added to the count value of the item information 109 (606). The same processing is repeated for all time series data (607). When the processing for all the time series data is completed, the time series data information 108 holds the time series data of each time series data ID for the processing unit of the data of the setting information 106.

再び図４に戻り、次に繰り返しパターン計数処理４０５が行われる。繰り返しパターン計数処理４０５は、メモリ１０２の時系列データ情報１０８、アイテム情報１０９、探索時系列パターン情報１１０を利用し、探索時系列パターン情報１１０に保持された時系列パターンの各時系列データにおける繰り返し回数を数え上げる。 Returning to FIG. 4 again, the repeated pattern counting process 405 is performed next. The repetitive pattern counting process 405 uses the time series data information 108, the item information 109, and the search time series pattern information 110 in the memory 102, and repeats the time series pattern held in the search time series pattern information 110 in each time series data. Count up the number of times.

図７は、図４の１時系列データに対する繰り返しパターン計数処理４０５の手順を詳細に説明するフローチャートである。探索時系列パターン情報１１０に保持された各時系列パターンについて（７０１）、繰り返し回数の数え上げ処理を行い（７０２）、該探索時系列パターンの該時系列データにおける繰り返し回数を数え上げる。次に、該探索時系列パターンの該時系列データにおける繰り返し回数の上限値を算出する（７０３）。繰り返し回数の上限値は、該時系列データにおける数え上げ済みの繰り返し回数、該時系列パターンにおける数え上げ済みの位置、該探索時系列パターンを構成する各アイテムの該時系列データにおける繰り返し回数とカウント値から、次に示す式によって算出される。

数え上げ済みのパターン位置が時系列パターンの初期値の場合：
(繰り返し回数の上限値) = (数え上げ済み繰り返し回数) + min{アイテムaの
繰り返し回数 - アイテムaのカウント値｝
ここで，aは探索時系列パターンを構成するアイテムである。
数え上げ済みのパターン位置が時系列パターンの初期値でない場合：
(繰り返し回数の上限値) = (数え上げ済み繰り返し回数) + min{アイテムaの
繰り返し回数 - アイテムaのカウント値｝ + 1
ここで，aは探索時系列パターンを構成するアイテムである。
FIG. 7 is a flowchart for explaining in detail the procedure of the repetitive pattern counting process 405 for the one time series data of FIG. For each time series pattern held in the search time series pattern information 110 (701), the number of repetitions is counted (702), and the number of repetitions of the search time series pattern in the time series data is counted. Next, an upper limit value of the number of repetitions in the time series data of the search time series pattern is calculated (703). The upper limit value of the number of repetitions is based on the number of repetitions counted in the time series data, the counted position in the time series pattern, the number of repetitions in the time series data and the count value of each item constituting the search time series pattern Is calculated by the following equation.

When the counted pattern position is the initial value of the time series pattern:
(Upper limit number of repetitions) = (Number of repeated repetitions) + min {Number of repetitions of item a-Count value of item a}
Here, a is an item constituting a search time series pattern.
When the counted pattern position is not the initial value of the time series pattern:
(Maximum number of repetitions) = (Number of repeated repetitions) + min {Number of repetitions of item a-Count value of item a} + 1
Here, a is an item constituting a search time series pattern.

時系列パターンの繰り返し回数の厳密な値は、現在までに繰り返し回数の数え上げ処理を完了した時系列データにおける繰り返し回数と、繰り返し回数の数え上げ処理が未処理の時系列データにおける繰り返し回数の和からなる。探索時系列パターンが数え上げ処理済みの時系列データと未処理の時系列データに跨る場合には、前述の和に１を加算する必要がある。繰り返し回数の数え上げが未処理の時系列データにおける探索時系列パターンの繰り返し回数の厳密な値は数え上げ処理を完了するまで未知である。 The exact value of the number of repetitions of the time series pattern is the sum of the number of repetitions in the time series data for which the process for counting the number of repetitions has been completed up to now and the number of repetitions in the time series data for which the process for counting the number of repetitions has not been processed. . When the search time series pattern spans time-series data that has been counted and unprocessed time-series data, it is necessary to add 1 to the above-mentioned sum. The exact value of the number of iterations of the search time series pattern in the time series data for which the number of iterations has not been processed is unknown until the counting process is completed.

しかし、時系列パターンの繰り返し回数には、ある時系列パターンの繰り返し回数は該時系列パターンを構成する各アイテムの繰り返し回数以上になることはないという性質がある。したがって、該未処理の時系列データにおける探索時系列パターンの繰り返し回数は該未処理の時系列データにおける該探索時系列パターンを構成するアイテムの繰り返し回数の最小値より高い値にはなり得ない。数１は上記の時系列パターンの繰り返し回数の性質を利用して、時系列データにおける繰り返し回数の上限値を、時系列データにおける数え上げ済みの繰り返し回数、数え上げ済みの位置、探索時系列パターンを構成する各アイテムの時系列データにおける繰り返し回数とカウント値から算出する数式である。 However, the number of repetitions of a time series pattern has a property that the number of repetitions of a certain time series pattern does not exceed the number of repetitions of each item constituting the time series pattern. Therefore, the number of repetitions of the search time series pattern in the unprocessed time series data cannot be higher than the minimum value of the number of repetitions of items constituting the search time series pattern in the unprocessed time series data. Formula 1 uses the above-mentioned property of the number of repetitions of the time-series pattern, and configures the upper limit value of the number of repetitions in the time-series data, the number of repetitions counted in the time-series data, the counted position, and the search time-series pattern It is a mathematical formula calculated from the number of repetitions and the count value in the time-series data of each item.

数１によって算出された値が最小繰り返し回数を満たさない場合、該探索時系列パターンについて該時系列データＩＤの未処理の時系列データを含めた末尾まで数え上げを行ったとしても、該探索時系列パターンは該時系列データにおいては最小繰り返し回数を満たすことがないことが分かるため、探索時系列パターン情報から該探索時系列パターンの該時系列データに対応する情報を削除し（７０５）、以降の該時系列データにおける該探索時系列パターンの数え上げ処理は省略する。 When the value calculated by Equation 1 does not satisfy the minimum number of repetitions, even if the search time series pattern is counted up to the end including the unprocessed time series data of the time series data ID, the search time series Since it can be seen that the pattern does not satisfy the minimum number of repetitions in the time series data, information corresponding to the time series data of the search time series pattern is deleted from the search time series pattern information (705). The counting process of the search time series pattern in the time series data is omitted.

図８は、図７の１つの探索時系列パターンに対する１つの時系列データにおける繰り返し回数数え上げ処理７０２の手順を詳細に説明するフローチャートである。 FIG. 8 is a flowchart for explaining in detail the procedure of the repeat count counting process 702 in one time series data for one search time series pattern of FIG.

はじめに探索する時系列パターンの処理対象の時系列データＩＤにおける繰り返し回数数え上げ済みパターン位置を取得し、該取得したパターン位置が初期値でない場合には１つ後ろを現パターン位置に設定し，初期値の場合には現パターン位置に先頭のパターン位置を設定する（８０１）。繰り返し回数数え上げを開始する該時系列データのデータ位置を該時系列データの先頭に設定する（８０２）。以降、現データ位置のアイテムセットから順に（８０３）、該探索時系列パターンの現パターン位置のアイテムセットを含むデータ位置を探す（８０４）。 First, the pattern position that has been counted in the time-series data ID to be processed for the time-series pattern to be searched is acquired, and if the acquired pattern position is not the initial value, the next pattern position is set as the current pattern position, and the initial value In the case of (1), the head pattern position is set to the current pattern position (801). The data position of the time-series data at which the repetition count starts is set at the head of the time-series data (802). Thereafter, in order from the item set at the current data position (803), the data position including the item set at the current pattern position of the search time series pattern is searched (804).

該時系列データの現データ位置のアイテムセットが該探索時系列パターンの現パターン位置のアイテムセットを含まない場合、現データ位置が該時系列データの末尾かどうかを調べ（８０９）、末尾でない場合には、現データ位置を１つ後ろにずらし、該時系列データの現データ位置のアイテムセットが該時系列パターンの現パターン位置のアイテムセットを含むか調べる処理（８０４）以降を繰り返す。末尾の場合には、処理を終了する。 If the item set at the current data position of the time series data does not include the item set at the current pattern position of the search time series pattern, it is checked whether the current data position is the end of the time series data (809). The current data position is shifted backward by one, and the process (804) and subsequent steps for checking whether the item set at the current data position of the time series data includes the item set at the current pattern position of the time series pattern are repeated. In the case of the end, the process is terminated.

また、該時系列データの現データ位置のアイテムセットが該探索時系列パターンの現パターン位置のアイテムセットを含む場合、現パターン位置が該探索時系列パターンの末尾かどうかを調べる（８０５）。末尾の場合には、探索時系列パターン情報１１０の数え上げ済み繰り返し回数の値を１増加し（８０６）、数え上げ済みの時系列パターンのパターン位置に先頭のパターン位置を設定する（８０７）。末尾でない場合には、探索時系列パターン情報１１０の数え上げ済みの時系列パターンのパターン位置に現パターン位置を設定する（８０８）。現データ位置が該時系列データの末尾かどうかを調べ（８０９）、末尾でない場合には、現データ位置を１つ後ろにずらし（８１０）、該時系列データの現データ位置のアイテムセットが該時系列パターンの現パターン位置のアイテムセットを含むかを調べる処理（８０４）に戻る。末尾まで処理が完了した時点で、探索時系列パターン情報１１０の該時系列データＩＤの数え上げ済み時系列パターン位置に現パターン位置を設定し（８１１）、終了する。 If the item set at the current data position of the time series data includes the item set at the current pattern position of the search time series pattern, it is checked whether the current pattern position is the end of the search time series pattern (805). In the case of the end, the number of repeated iterations counted in the search time series pattern information 110 is increased by 1 (806), and the top pattern position is set to the pattern position of the counted time series pattern (807). If it is not the end, the current pattern position is set to the pattern position of the time series pattern already counted in the search time series pattern information 110 (808). It is checked whether the current data position is the end of the time series data (809). If not, the current data position is shifted backward by one (810), and the item set of the current data position of the time series data is The process returns to the process (804) for checking whether the item set at the current pattern position of the time series pattern is included. When the processing is completed to the end, the current pattern position is set to the time-series pattern position counted in the time-series data ID of the search time-series pattern information 110 (811), and the process ends.

再び図４に戻り、次に出現頻度計数処理４０６が行われる。出現頻度計数処理４０４は探索時系列パターンの出現回数を数え上げる。 Returning to FIG. 4 again, the appearance frequency counting process 406 is performed next. The appearance frequency counting process 404 counts the number of appearances of the search time series pattern.

図９は、この出現頻度計数処理４０４の手順を詳細に説明するフローチャートである。メモリ１０２の探索時系列パターン情報１１０に登録された各探索時系列パターンについて（９０１）、該探索時系列パターンの時系列データＩＤの種類の数を数え上げ（９０２）、設定値情報１０７の最小出現回数未満となる探索時系列パターンを探索時系列パターン情報１１０から削除する（９０４）。探索時系列パターン情報に登録された全探索時系列パターンに対する同様の処理が終了した時点で、探索時系列パターン情報１１０に最小繰り返し回数と最小出現頻度の条件を共に満たす可能性のある探索時系列パターンのみが登録される。 FIG. 9 is a flowchart for explaining the procedure of the appearance frequency counting process 404 in detail. For each search time series pattern registered in the search time series pattern information 110 of the memory 102 (901), the number of types of time series data ID of the search time series pattern is counted (902), and the minimum appearance of the set value information 107 Search time series patterns that are less than the number of times are deleted from the search time series pattern information 110 (904). A search time series that may satisfy both the minimum repetition frequency and minimum appearance frequency conditions in the search time series pattern information 110 at the time when the same processing for all search time series patterns registered in the search time series pattern information is completed. Only patterns are registered.

再び図４に戻り、次に、探索時系列パターン情報１１０に含まれない時系列データＩＤを時系列データ情報から削除する（４０７）。 Returning to FIG. 4 again, next, the time series data ID not included in the search time series pattern information 110 is deleted from the time series data information (407).

以上の処理を全時系列データの末尾まで終了した時点で、探索時系列パターン情報１１０に登録されている探索時系列パターンから抽出された時系列パターンとして繰り返し回数の統計値と出現頻度の統計値と共に出力する（４０９）。探索時系列パターン情報１１０には、各時系列パターンの各時系列データにおける繰り返し回数の厳密な値が格納されているため、繰り返し回数に関する統計値の算出が可能であり、各時系列パターンの出現頻度と全時系列データ数が既知
のため、出現頻度の統計値の算出が可能である。 When the above processing is completed up to the end of all time series data, the statistical value of the number of repetitions and the statistical value of the appearance frequency are extracted as the time series pattern extracted from the search time series pattern registered in the search time series pattern information 110. (409). The search time-series pattern information 110 stores the exact value of the number of repetitions in each time-series data of each time-series pattern, so that it is possible to calculate a statistical value related to the number of repetitions, and the appearance of each time-series pattern. Since the frequency and the total number of time series data are known, it is possible to calculate the statistical value of the appearance frequency.

なお、図１の出力装置１０５の図２の結果表示部２０６には、繰り返し回数の統計値として平均値と最大値と最小値、出現頻度の統計値として頻度と全時系列データ数に対する割合を一例として示した。 In the result display unit 206 of FIG. 2 of the output device 105 of FIG. 1, the average value, maximum value, and minimum value are used as the statistical values of the number of repetitions, and the frequency and the ratio to the total number of time series data as the statistical values of the appearance frequency Shown as an example.

図１４は、本実施例の図４のフローチャートにおける、時系列データ読み出し処理４０４、繰り返しパターン計数処理４０５、出現頻度計数処理４０６に対する記憶装置１０３から読み出される時系列データのチェックポイントによる処理単位のイメージ図である。 FIG. 14 is an image diagram of processing units by checkpoints of time series data read from the storage device 103 for the time series data reading process 404, the repetition pattern counting process 405, and the appearance frequency counting process 406 in the flowchart of FIG. 4 of the present embodiment. It is.

１時系列データを一本の直線で示しており、1回目の時系列データ読み出し処理で、はじめにデータ１の先頭から最初のチェックポイントまでの時系列データが読み出される。該読み出された時系列データに対して、繰り返しパターン計数処理が行われる。データ１の繰り返しパターン計数処理が終了した時点で、次の時系列データであるデータ２の先頭から最初のチェックポイントまでの時系列データが読み出され、繰り返しパターン計数処理が行われる。すべての時系列データに対して先頭から最初のチェックポイントまでの時系列データ読み出し処理と繰り返しパターン計数処理が終了した時点で、出現頻度計数処理が行われる。 One time-series data is indicated by a single straight line. In the first time-series data reading process, first, time-series data from the beginning of data 1 to the first check point is read. A repeated pattern counting process is performed on the read time-series data. When the repeated pattern counting process for data 1 is completed, the time series data from the beginning of data 2 which is the next time series data to the first check point is read, and the repeated pattern counting process is performed. Appearance frequency counting processing is performed when the time-series data reading processing from the top to the first check point and the repeated pattern counting processing are completed for all time-series data.

一回目の時系列データ読み出し処理では、各時系列データについて先頭から最初のチェックポイントまでの時系列データが読み出される。出現頻度計数処理の結果が最小出現頻度以上である場合、最初のチェックポイントまでの繰り返しパターン計数処理における繰り返し回数の上限値が最小繰り返し回数以上であった時系列データについて、最初のチェックポイントから2番目のチェックポイントまでに対して、時系列データの読み出し処理、繰り返しパターン計数処理が行われ、すべての時系列データについて終了した時点で、出現頻度計数処理が行われる。 In the first time-series data reading process, time-series data from the beginning to the first check point is read for each time-series data. If the result of the appearance frequency counting process is greater than or equal to the minimum appearance frequency, 2 times from the first checkpoint for time series data where the upper limit of the number of repetitions in the repeated pattern counting process up to the first checkpoint is greater than or equal to the minimum number of repetitions. Time series data reading processing and repetitive pattern counting processing are performed up to the first check point, and appearance frequency counting processing is performed when all time series data is completed.

出現頻度計数処理の結果が最小出現頻度以上である場合には、同様の処理が繰り返される。最後のチェックポイントまで終了した時点で、最小繰り返し回数以上となる時系列データにおける繰り返し回数、および、出現頻度を得ることができる。途中のチェックポイントまでの出現頻度計数処理の結果が最小出現頻度未満となる場合、その探索時系列パターンは最小出現頻度を満たさないことが分かるため、該途中のチェックポイント以降の時系列データについては処理を省略する。また、各時系列データについて、途中のチェックポイントまでの繰り返しパターン計数処理の結果の繰り返し回数の上限値が最小繰り返し回数未満となる場合、その探索時系列パターンは該時系列データにおいては最小繰り返し回数を満たさないことがわかるため、該途中のチェックポイント以降の時系列データについては処理を省略する。これらの結果、頻出繰り返し時系列パターンでない探索時系列パターンに対する処理を回避し、探索処理負荷を低減することが出来る。 If the result of the appearance frequency counting process is greater than or equal to the minimum appearance frequency, the same process is repeated. It is possible to obtain the number of repetitions and the appearance frequency in the time-series data that is equal to or greater than the minimum number of repetitions when the last checkpoint is completed. If the result of the appearance frequency counting process up to a checkpoint in the middle is less than the minimum appearance frequency, it can be seen that the search time series pattern does not satisfy the minimum appearance frequency. The process is omitted. Further, for each time series data, when the upper limit of the number of repetitions as a result of the repetition pattern counting process up to a checkpoint in the middle is less than the minimum number of repetitions, the search time series pattern is the minimum number of repetitions in the time series data. Therefore, the processing is omitted for the time series data after the checkpoint in the middle. As a result, it is possible to avoid processing for a search time series pattern that is not a frequently repeated time series pattern, and to reduce the search processing load.

以上のように、本実施例によると、各時系列データにおける繰り返し回数の最小値と、全時系列データにおける出現頻度の最小値の条件を共に満たす時系列パターンを得ることが出来る。また、本実施例においては、解析処理の途中経過において繰り返し回数の上限値を算出すること、及び算出した繰り返し回数の上限値を用いて出現頻度を数え上げることにより、最小繰り返し回数を満たし得ない探索時系列パターンと最小出現回数を満たし得ない探索時系列パターンの繰り返し回数の数え上げ処理を途中で打ち切ることにより、繰り返し回数の数え上げ処理対象となる時系列データの範囲を限定し、解析処理負荷を低減することが可能となる。 As described above, according to this embodiment, it is possible to obtain a time series pattern that satisfies both the minimum value of the number of repetitions in each time series data and the minimum value of the appearance frequency in all time series data. Further, in this embodiment, a search that cannot satisfy the minimum number of iterations by calculating the upper limit value of the number of iterations in the course of the analysis process and counting the appearance frequency using the calculated upper limit value of the number of iterations. By reducing the number of iterations of the search time series pattern that cannot satisfy the minimum number of occurrences of the time series pattern, the range of time series data subject to the number of iterations is reduced, reducing the analysis processing load. It becomes possible to do.

なお、以上で説明した実施例１では繰り返し回数、出現頻度を利用したが、繰り返し回数の各時系列データ長あるいは期間との商である繰り返し率、出現頻度の全時系列データ数との商である出現率（支持度）を利用することでも同様に解析処理を実施することが可能である。 In the first embodiment described above, the number of repetitions and the appearance frequency are used, but the repetition rate that is a quotient of each time series data length or period of the number of repetitions and the number of all time series data of the appearance frequency. The analysis process can be similarly performed by using a certain appearance rate (support level).

また，本実施例では時系列データの処理単位をチェックポイント指定部２０２に入力させたが，メモリ１０２の設定値情報１０７のチェックポイント情報に所定の値を予め設定しておくことによって，ユーザによる処理単位の入力を省略することも可能である。 In this embodiment, the processing unit of the time series data is input to the checkpoint specifying unit 202. However, by setting a predetermined value in the checkpoint information of the set value information 107 of the memory 102 in advance, the user can It is also possible to omit the input of the processing unit.

次に、第２の実施例として、クレジットカードの利用データを例にとり、図１のデータ解析システムの実行プログラム１０６の処理と、メモリ１０２に格納される各種の情報を説明する。ここで、本実施例における解析対象のクレジットカードの利用データの一例を表３に示す。表３に明らかなように、時系列データ数２０のデータが記憶装置１０３に格納されているものとする。また、入力装置１０４において、データの処理単位を５、最小繰り返し回数を３、最小出現頻度を５と入力されたとし、メモリ１０２の設定値情報１０７に格納されているとする。 Next, as a second embodiment, taking credit card usage data as an example, the processing of the execution program 106 of the data analysis system in FIG. 1 and various information stored in the memory 102 will be described. Here, Table 3 shows an example of usage data of the credit card to be analyzed in the present embodiment. As is apparent from Table 3, it is assumed that the data of the time series data number 20 is stored in the storage device 103. In the input device 104, it is assumed that the data processing unit is 5, the minimum number of repetitions is 3, and the minimum appearance frequency is 5 and is stored in the setting value information 107 of the memory 102.

プロセッサ１０１はメモリ１０２に格納されている実行プログラム１０６を実行し、はじめに頻出繰り返しアイテム抽出処理４０２を行う。例えば、このアイテム抽出処理４０２により、ｃａｒｄ０１の時系列データが記憶装置１０３から読み出され、各アイテムの繰り返し回数が数え上げられたとき、メモリ１０２のアイテム情報１０９に（店舗Ａ、ｃａｒｄ０１、３、０）、（店舗Ｂ、ｃａｒｄ０１、４、０）、（店舗Ｃ、ｃａｒｄ０１、３、０）、（店舗Ｄ、ｃａｒｄ０１、１、０）、（店舗Ｅ、ｃａｒｄ０１、１、０）、（高額決済、ｃａｒｄ０１、３、０）が格納される。最小出現回数が３であることから、アイテム”店舗Ｄ”と”店舗Ｅ”がアイテム情報から削除され、メモリ１０２のアイテム情報１０９には（店舗Ａ、ｃａｒｄ０１、３、０）、（店舗Ｂ、ｃａｒｄ０１、４、０）、（店舗Ｃ、ｃａｒｄ０１、３、０）、（高額決済、ｃａｒｄ０１、３、０）が保持される。 The processor 101 executes the execution program 106 stored in the memory 102 and first performs a frequent repeated item extraction process 402. For example, when this item extraction process 402 reads card01 time-series data from the storage device 103 and counts the number of repetitions of each item, the item information 109 in the memory 102 stores (store A, card01, 3, 0). ), (Store B, card 01, 4, 0), (Store C, card 01, 3, 0), (Store D, card 01, 1, 0), (Store E, card 01, 1, 0), (high value payment, card01, 3, 0) are stored. Since the minimum number of appearances is 3, the items “Store D” and “Store E” are deleted from the item information, and the item information 109 in the memory 102 includes (Store A, card 01, 3, 0), (Store B, card01, 4, 0), (store C, card01, 3, 0), (high value settlement, card01, 3, 0).

ここで、探索時系列パターン設定処理４０３において、候補となる繰り返し回数が未知の探索時系列パターンとして、＜（店舗Ａ）（店舗Ｂ、高額決済）＞と＜（店舗Ｃ）（店舗Ａ）＞と＜（店舗Ｃ）（店舗Ｂ）＞が設定されたとする。 Here, in the search time series pattern setting process 403, <(store A) (store B, high-priced payment)> and <(store C) (store A)> are used as search time series patterns whose number of candidate repetitions is unknown. And <(Store C) (Store B)> are set.

次に、プロセッサ１０１は実行プログラム１０６の時系列データ読み出し処理４０４において、記憶装置１０３に格納された時系列データが読み出され、メモリ１０２の時系列データ情報１０８に処理単位毎に格納される。 Next, in the time-series data reading process 404 of the execution program 106, the processor 101 reads the time-series data stored in the storage device 103 and stores it in the time-series data information 108 of the memory 102 for each processing unit.

読み出された時系列データから頻出繰り返しアイテム以外を削除し、時系列データ情報１０８に格納すると共に、各アイテムの繰り返し回数が数え上げられ、アイテム情報１０９の繰り返し回数情報が更新される。例えば、ｃａｒｄ０１の時系列データでは、最初の時系列データ読み出し処理では時系列データとして＜（店舗Ａ）（店舗Ｃ）（店舗Ａ）（店舗Ｂ、高額決済）（店舗Ｅ）（店舗Ｃ）＞が読み出され、頻出繰り返しアイテム以外が削除され、ｃａｒｄ０１、＜（店舗Ａ）（店舗Ｃ）（店舗Ａ）（店舗Ｂ、高額決済）（店舗Ｃ）＞が時系列データ情報１０８に保持される。また、アイテム情報１０９のカウント値を（店舗Ａ、ｃａｒｄ０１、３、２）、（店舗Ｂ、ｃａｒｄ０１、４、１）、（店舗Ｃ、ｃａｒｄ０１、３、２）（高額決済、ｃａｒｄ０１、３、１）と更新する。 Other than the frequently repeated items are deleted from the read time-series data and stored in the time-series data information 108, the number of repetitions of each item is counted, and the number of repetitions information of the item information 109 is updated. For example, in the time-series data of card01, as the time-series data in the first time-series data reading process, <(store A) (store C) (store A) (store B, high-value settlement) (store E) (store C)> Is read, items other than the frequently repeated items are deleted, and card01, <(store A) (store C) (store A) (store B, high-value settlement) (store C)> is held in the time-series data information 108. . In addition, the count value of the item information 109 is (store A, card 01, 3, 2), (store B, card 01, 4, 1), (store C, card 01, 3, 2) (high payment, card 01, 3, 1 ) And update.

次に、繰り返しパターン計数処理４０５において、探索時系列パターンの各時系列データにおける繰り返し回数が数え上げられる。例えば、ｃａｒｄ０１の時系列データについて説明する。まず、探索時系列パターン＜（店舗Ａ）（店舗Ｂ、高額決済）＞が数え上げられる場合、メモリ１０２の探索時系列パターン情報１１０から、該探索パターンのｃａｒｄ０１の情報として、数え上げ済みの繰り返し回数として０、数え上げ済みの時系列パターンのパターン位置として０が読み出される。 Next, in the repeated pattern counting process 405, the number of repetitions in each time series data of the search time series pattern is counted. For example, time series data of card01 will be described. First, when the search time-series pattern <(store A) (store B, high-price payment)> is counted, the number of repetitions that have been counted from the search time-series pattern information 110 of the memory 102 as card01 information of the search pattern. 0, 0 is read as the pattern position of the counted time series pattern.

その後、ｃａｒｄ０１の時系列データについて、メモリ１０２の時系列データ情報１０８の先頭のデータ位置のアイテムセットから順に該探索パターンの１番目のパターン位置のアイテムセット（店舗Ａ）が現れるデータ位置を探し、データ位置１番目に検出される。該探索パターンの１番目のデータ位置のアイテムセットは末尾でないため、探索パターン情報の数え上げ済み時系列パターンの位置を１に更新する。 Thereafter, for the time series data of card01, the data position where the item set (store A) of the first pattern position of the search pattern appears in order from the item set of the top data position of the time series data information 108 of the memory 102 is searched. The data position is detected first. Since the item set at the first data position of the search pattern is not the end, the position of the time-series pattern in which the search pattern information is counted is updated to 1.

次に、該探索パターンの２番目のパターン位置のアイテムセット（店舗Ｂ、高額決済）が現れるデータ位置を該時系列データのデータ位置２番目以降で探し、データ位置４番目で検出される。該探索パターンの２番目のパターン位置のアイテムセットは末尾のため、探索パターン情報の数え上げ済み回数を１増加させ、時系列パターンのパターン位置を０に更新する。再び、該探索パターンの１番目のアイテムセット（店舗Ａ）が現れるデータ位置を該時系列データのデータ位置５番目以降で探すが、該時系列パターンの末尾であるデータ位置５番目まで探しても検出されないため、探索時系列パターン情報１１０の時系列パターン＜（店舗Ａ）（店舗Ｂ、高額決済）＞の時系列データＩＤがｃａｒｄ０１の数え上げ済み時系列パターンのパターン位置を１に更新し、繰り返し回数の数え上げ処理を終了する。 Next, the data position where the item set (store B, high-priced payment) at the second pattern position of the search pattern appears is searched for at the second and subsequent data positions of the time series data, and detected at the fourth data position. Since the item set at the second pattern position of the search pattern is the end, the number of times the search pattern information has been counted is incremented by 1, and the pattern position of the time series pattern is updated to 0. Again, the data position where the first item set (store A) of the search pattern appears is searched from the fifth data position of the time series data, but even if the data position is searched up to the fifth data position which is the end of the time series pattern. Since it is not detected, the pattern position of the counted time series pattern whose time series data ID of the time series pattern <(store A) (store B, expensive payment)> of the search time series pattern information 110 is card01 is updated to 1. The number counting process ends.

また、探索パターン＜（店舗Ｃ）（店舗Ａ）＞が数え上げられる場合、メモリ１０２の探索時系列パターン情報１１０から、該探索パターンのｃａｒｄ０１の情報として、数え上げ済みの繰り返し回数が０、数え上げ済みのパターン位置０が読み出される。その後、ｃａｒｄ０１の時系列データについて、メモリ１０２の時系列データ情報１０８の先頭のデータ位置のアイテムセットから順に該探索パターンの１番目のパターン位置のアイテムセット（店舗Ｃ）が現れるデータ位置を探し、データ位置２番目に検出される。該探索パターンの１番目のパターン位置のアイテムセット（店舗Ｃ）は末尾でないため、現パターン位置を２に更新する（図８の８０８）。 Further, when the search pattern <(store C) (store A)> is counted, the number of repeated iterations counted is 0 as the information of card01 of the search pattern from the search time series pattern information 110 of the memory 102. Pattern position 0 is read out. Thereafter, for the time series data of card01, the data position where the item set (store C) of the first pattern position of the search pattern appears in order from the item set of the first data position of the time series data information 108 of the memory 102 is searched. The second data position is detected. Since the item set (store C) at the first pattern position of the search pattern is not the end, the current pattern position is updated to 2 (808 in FIG. 8).

次に、該探索パターンの２番目のパターン位置のアイテムセット（店舗Ａ）が現れるデータ位置を該時系列データのデータ位置３番目以降で探し、データ位置３番目で検出される。該探索パターンの２番目のパターン位置のアイテムセット（店舗Ａ）は末尾のため、探索パターン情報の数え上げ済み回数を１増加させ（同８０６）、現パターンの位置を先頭に更新する（同８０７）。 Next, the data position where the item set (store A) at the second pattern position of the search pattern appears is searched after the third data position of the time series data and detected at the third data position. Since the item set (store A) at the second pattern position of the search pattern is the end, the number of times the search pattern information has been counted is incremented by 1 (same 806), and the current pattern position is updated to the top (same 807). .

再び、該探索パターンの１番目のパターン位置のアイテムセット（店舗Ｃ）が現れるデータ位置を該時系列データのデータ位置３番目以降で探し、データ位置５番目で検出される。該探索パターンの１番目のパターン位置のアイテムセットは末尾でないため、現パターン位置を２に更新する。また再び、該探索パターンの２番目のパターン位置のアイテムセット（店舗Ａ）が現れるデータ位置を該時系列データのデータ位置５番目以降で探そうとするが、該時系列パターンのデータ位置５番目が末尾のため（同８０９）、探索パターン情報１１０の数え上げ済み時系列パターンのパターン位置を２に設定し（同８１１）、繰り返し回数の数え上げ処理７０２を終了する。 Again, the data position where the item set (store C) at the first pattern position of the search pattern appears is searched after the third data position of the time-series data and detected at the fifth data position. Since the item set at the first pattern position of the search pattern is not the end, the current pattern position is updated to 2. Again, an attempt is made to find the data position at which the item set (store A) at the second pattern position of the search pattern appears at the fifth and subsequent data positions of the time series data. Is the end (same as 809), the pattern position of the time-sequential pattern counted in the search pattern information 110 is set to 2 (same as 811), and the repeat count counting process 702 is terminated.

次に、図７のフローにおいて、繰り返し回数の上限値の算出７０３が行われる。例えば、ｃａｒｄ０１における探索パターン＜（店舗Ａ）（店舗Ｂ、高額決済）＞について、探索時系列パターン情報１１０には（＜（店舗Ａ）（店舗Ｂ、高額決済）＞、ｃａｒｄ０１、１、０）、アイテム情報１０９には（店舗Ａ、ｃａｒｄ０１、３、２）、（店舗Ｂ、ｃａｒｄ０１、４、１）、（高額決済、ｃａｒｄ０１、３、１）が保持されていることから、数１により上限値が２（＝１＋１）と算出される。 Next, calculation 703 of the upper limit value of the number of repetitions is performed in the flow of FIG. For example, for the search pattern <(store A) (store B, high-price payment)> in card01, the search time-series pattern information 110 includes (<(store A) (store B, high-price payment)>, card01, 1, 0). The item information 109 holds (Store A, card 01, 3, 2), (Store B, card 01, 4, 1), and (high value settlement, card 01, 3, 1). The value is calculated as 2 (= 1 + 1).

該算出された上限値は最小繰り返し回数（本実施例では３）未満のため、該探索パターンが最小繰り返し回数以上となり得ないことが分かる。該探索パターンの情報を探索パターン情報１１０から削除することによって、プロセッサ０１０ではｃａｒｄ０１の探索パターンの繰り返し回数数え上げ処理を打ち切り、２回目以降の時系列データ読み出し処理後の繰り返し数え上げ処理を省略する。 Since the calculated upper limit value is less than the minimum number of repetitions (3 in this embodiment), it can be seen that the search pattern cannot exceed the minimum number of repetitions. By deleting the search pattern information from the search pattern information 110, the processor 010 cancels the count counting process for the card01 search pattern, and omits the repeat counting process after the second and subsequent time-series data reading processes.

また、探索パターン＜（店舗Ｃ）（店舗Ａ）＞について、探索パターン情報には（＜（店舗Ｃ）（店舗Ａ）＞、ｃａｒｄ０１、１、１）、アイテム情報には（店舗Ａ、ｃａｒｄ０１、３、２）、（店舗Ｃ、ｃａｒｄ０１、３、２）が保持されていることから、数１により上限値が３（＝１＋１＋１）と算出される。該算出された上限値は最小繰り返し回数以上のため、該探索パターンが最小繰り返し回数以上の可能性があることから、２回目以降の時系列データ読み出し処理後の繰り返し数え上げ処理を省略することは出来ない。 For the search pattern <(store C) (store A)>, the search pattern information is (<(store C) (store A)>, card01, 1, 1), and the item information is (store A, card01, 3, 2) and (Store C, cards 01, 3, 2) are held, and therefore the upper limit value is calculated as 3 (= 1 + 1 + 1) according to Equation 1. Since the calculated upper limit value is equal to or greater than the minimum number of repetitions, the search pattern may be equal to or greater than the minimum number of repetitions. Therefore, it is possible to omit the repeat counting process after the second and subsequent time series data read processing. Absent.

次に、図４の繰り返し時系列パターン抽出処理の出現頻度の計数処理４０６が行われる。プロセッサ１０１はメモリ１０２の探索時系列パターン情報１１０に格納された時系列パターンと時系列データＩＤから、各時系列パターンの時系列データＩＤの種類数を数え上げ、最小出現頻度未満となる時系列パターンの情報を探索パターン情報から削除する。例えば、探索パターン＜（店舗Ｃ）（店舗Ｂ）＞について、探索時系列パターン情報１１０には（＜（店舗Ｃ）（店舗Ｂ）＞、ｃａｒｄ０１、１、１）、（＜（店舗Ｃ）（店舗Ｂ）＞、ｃａｒｄ０５、２、１）、（＜（店舗Ｃ）（店舗Ｂ）＞、ｃａｒｄ０８、２、０）が格納されているとする。該探索パターンの出現頻度は３であり、最小出現頻度（本実施例においては５）を満たさないことが分かる。この場合、プロセッサ１０１は探索パターン情報１１０から該探索パターンを含む情報を削除する。探索パターン情報１１０から削除された時系列パターンは、２回目以降の時系列データ読み出し処理後の繰り返し数え上げ処理を省略できる。 Next, the appearance frequency counting process 406 of the repeated time series pattern extraction process of FIG. 4 is performed. The processor 101 counts the number of types of the time series data ID of each time series pattern from the time series pattern and time series data ID stored in the search time series pattern information 110 of the memory 102, and the time series pattern becomes less than the minimum appearance frequency. Is deleted from the search pattern information. For example, for the search pattern <(store C) (store B)>, the search time series pattern information 110 includes (<(store C) (store B)>, cards 01, 1, 1), (<(store C) ( Store B)>, card 05, 2, 1), (<(store C) (store B)>, card 08, 2, 0) are stored. It can be seen that the appearance frequency of the search pattern is 3, which does not satisfy the minimum appearance frequency (5 in this embodiment). In this case, the processor 101 deletes information including the search pattern from the search pattern information 110. The time series pattern deleted from the search pattern information 110 can omit the repeat counting process after the second time series data reading process.

出現頻度計数処理４０６の次に、時系列データ情報１０８の更新４０７が行われる。繰り返し回数数え上げ処理を行う必要のある時系列データＩＤは探索時系列パターン情報１１０に保持されている。探索時系列パターン情報１１０に保持されていない時系列データＩＤは繰り返し回数の数え上げ処理を行う必要がないため、２回目以降の時系列データ読み出し処理は不要である。探索時系列パターン情報１１０に保持されない時系列データＩＤがある場合、該時系列データＩＤを時系列データ情報１０８から削除する。例えば、ｃａｒｄ０２、ｃａｒｄ０４、ｃａｒｄ０７が時系列データＩＤとして保持されていない場合、時系列データ情報１０８から削除する。 Following the appearance frequency counting process 406, the time-series data information 108 is updated 407. The time series data ID that needs to be subjected to the repeat count counting process is held in the search time series pattern information 110. Since the time series data ID that is not held in the search time series pattern information 110 does not need to perform the process of counting the number of repetitions, the time series data reading process for the second and subsequent times is unnecessary. If there is a time series data ID that is not held in the search time series pattern information 110, the time series data ID is deleted from the time series data information 108. For example, when card02, card04, and card07 are not held as time-series data IDs, they are deleted from the time-series data information 108.

以上の処理を各時系列データの末尾まで繰り返す。例えば、本実施例の２回目の時系列データ読み出し処理では、時系列データの順位位置６番目から１０番目までが読み出され、ｃａｒｄ０１について、＜（店舗Ｂ）（店舗Ｂ、高額決済）（店舗Ａ）（店舗Ｃ）（店舗Ｂ、高額決済）＞が時系列データ情報１０８としてメモリ１０２に保持され、探索パターン＜（店舗Ｃ）（店舗Ａ）＞を数え上げる場合、メモリ１０２の探索時系列パターン情報１１０から、該探索パターンのｃａｒｄ０１の情報として、数え上げ済みの繰り返し回数が１、数え上げ済みの時系列パターンの位置が１が読み出される。その後、ｃａｒｄ０１の時系列データについて、メモリ１０２の時系列データ情報１０８の先頭のアイテムセットから順に該探索パターンの２番目のアイテムセット（店舗Ａ）が現れる順位位置を探す処理を開始する。 The above processing is repeated until the end of each time series data. For example, in the second time-series data reading process of the present embodiment, the 6th to 10th ranking positions of the time-series data are read, and for card01, <(store B) (store B, expensive payment) (store A) (Store C) (Store B, high-value payment)> is stored in the memory 102 as time-series data information 108, and when the search pattern <(Store C) (Store A)> is counted, the search time-series pattern in the memory 102 From the information 110, as the information of card01 of the search pattern, the counted number of repetitions is 1, and the position of the counted time series pattern is 1. Thereafter, for the time series data of card01, a process of searching for a rank position where the second item set (store A) of the search pattern appears in order from the top item set of the time series data information 108 of the memory 102 is started.

以上詳述して本実施例の処理によって、最小繰り返し回数と最小出現頻度の条件を共に満たす時系列パターンを不要な繰り返し回数数え上げ処理を回避しつつ抽出することが出来る。 As described above in detail, according to the processing of the present embodiment, it is possible to extract a time series pattern that satisfies both the minimum number of repetitions and the minimum appearance frequency conditions while avoiding unnecessary repetition number counting processing.

本実施例の場合、例えば、ｃａｒｄ０１における探索パターン＜（店舗Ａ）（店舗Ｂ、高額決済）＞については２回目の時系列データ読み出し処理以降の繰り返し回数数え上げ処理を省略することができる。また、例えば、ｃａｒｄ０２、ｃａｒｄ０４、ｃａｒｄ０７の時系列データは２回目以降の時系列データ読み出し処理を省略することができる。 In the case of the present embodiment, for example, for the search pattern <(store A) (store B, high-price payment)> in card01, the repeat count counting process after the second time-series data reading process can be omitted. Further, for example, the time-series data of card02, card04, and card07 can omit the second and subsequent time-series data read processing.

次に、第３の実施例として、データ解析システムが、各時系列データに区切れが存在するデータを解析対象とする場合を説明する。１つの時系列データにおいて所定の句切れをまたがる時系列パターンは該時系列データに含まれないとしたい場合がある。例えば、人の行動パターンの分析において１日単位での行動パターンを考える場合、日付をまたがる行動パターンを数え上げてはならない。 Next, as a third embodiment, a case will be described in which the data analysis system sets data to be analyzed in which each time-series data is separated. There may be a case where it is desired that a time series pattern extending over a predetermined phrase break in one time series data is not included in the time series data. For example, when an action pattern in a unit of one day is considered in the analysis of a person's action pattern, the action pattern across the date must not be counted.

図１０は、第３の実施例のシステム構成例を示す図である。このシステムは、図１のシステム構成のメモリ１０２に、解析対象データにおける時系列データの区切れの条件を条件式などの形式で区切れ条件１００１として保持する。 FIG. 10 is a diagram illustrating an example of a system configuration of the third embodiment. In this system, the memory 102 having the system configuration shown in FIG. 1 stores the time-series data delimiter conditions in the analysis target data as delimiter conditions 1001 in the form of conditional expressions.

図１１は、本実施例のユーザインタフェースの一例を示している。このユーザインタフェース１１００は、図２のユーザインタフェース２００に解析対象データにおける時系列データの区切れ条件を設定する区切れ条件設定部１１０１を加えた構成を取る。ユーザは解析対象データを解析データ指定部２０１で指定し、処理単位をチェックポイント指定部２０２に、抽出する時系列パターンの繰り返し回数の最小値を最小繰り返し回数入力部２０３に、抽出する時系列パターンの出現頻度の最小値を最小出現頻度入力部２０４に、時系列データの区切れ条件を区切れ条件設定部１１０１に、それぞれ入力する。実行ボタン２０５によって、時系列パターン抽出処理を開始する。抽出された時系列パターンは、時系列パターンを構成するアイテムセットのリストと、時系列パターンの評価値である繰り返し回数の統計値と出現頻度とが結果表示部２０６に表示される。 FIG. 11 shows an example of a user interface of this embodiment. The user interface 1100 has a configuration in which a delimiter condition setting unit 1101 for setting delimiter conditions for time-series data in analysis target data is added to the user interface 200 of FIG. The user designates the analysis target data with the analysis data designation unit 201, extracts the processing unit into the checkpoint designation unit 202, and extracts the minimum number of repetitions of the time series pattern to be extracted into the minimum repetition number input unit 203. The minimum appearance frequency is input to the minimum appearance frequency input unit 204 and the time series data delimiter condition is input to the delimiter condition setting unit 1101. A time series pattern extraction process is started by the execution button 205. In the extracted time series pattern, a list of item sets constituting the time series pattern, a statistical value of the number of repetitions as an evaluation value of the time series pattern, and an appearance frequency are displayed on the result display unit 206.

図１２は、本実施例の時系列パターン抽出処理におけるユーザによる操作とシステムによる操作のフローを示した図である。はじめに、ユーザは入力装置１０４において、解析対象のデータを指定、解析対象データの処理単位を入力、抽出する時系列パターンの最小繰り返し回数と最小出現頻度、および、時系列データの区切れ条件を設定部１１０１から入力する（１２０１）。以降の処理は、図３と同様である。 FIG. 12 is a diagram illustrating a flow of an operation by the user and an operation by the system in the time-series pattern extraction process of the present embodiment. First, the user designates data to be analyzed, inputs the processing unit of the analysis target data, sets the minimum number of repetitions and the minimum frequency of appearance of the time series pattern to be extracted, and the time series data delimitation conditions using the input device 104 Input from the unit 1101 (1201). The subsequent processing is the same as in FIG.

解析対象の時系列データに区切れ条件が設定された場合の時系列パターン抽出の全体処理手順、頻出繰り返しアイテム抽出処理、候補とする探索時系列パターン設定処理、時系列データ読み出し処理、出現頻度計数処理は、前述の処理手順と同様である。この時系列データに区切れ条件が設定された場合の時系列パターン抽出処理では、繰り返しパターン計数処理における１つの探索時系列パターンに対する１つの時系列データにおける繰り返し回数数え上げ処理が前述の処理手順と異なる。 Overall processing sequence of time series pattern extraction when a condition is set for the time series data to be analyzed, frequent repeated item extraction processing, candidate search time series pattern setting processing, time series data reading processing, appearance frequency counting The processing is the same as the processing procedure described above. In the time-series pattern extraction process when the time-series data is divided and the condition is set, the repetition count counting process in one time-series data with respect to one search time-series pattern in the repetition pattern counting process is different from the above-described processing procedure. .

図１３に１つの探索時系列パターンに対する１つの時系列データにおける繰り返し回数数え上げ処理７０２の解析対象の時系列データに、区切れ条件が設定された場合の手順を詳細に説明するフローチャートを示す。処理の内容は、前述の図８における繰り返し回数数え上げのパターン位置を設定する処理８０１から時系列データの末尾まで処理したか調べる処理８０９までは同様である。現データ位置が該時系列データの末尾かどうかを調べ（８０９）、末尾でない場合には、現データ位置の直後がメモリ１０２の設定値情報１０７の区切れ条件１００１を満たすかどうかを調べる（１３０１）。 FIG. 13 is a flowchart for explaining in detail the procedure in the case where a delimiter condition is set in the time series data to be analyzed in the repetition count counting process 702 in one time series data for one search time series pattern. The content of the processing is the same from the processing 801 for setting the pattern position for counting up the number of repetitions in FIG. 8 to the processing 809 for checking whether processing has been performed up to the end of the time series data. It is checked whether or not the current data position is the end of the time series data (809). If it is not the end, it is checked whether or not the current data position satisfies the delimitation condition 1001 of the setting value information 107 in the memory 102 (1301). ).

区切れ条件を満たす場合は現パターン位置を先頭のパターン位置に戻し（１３０２）、現データ位置を１つ後ろにずらし（８１０）、該時系列データの現データ位置のアイテムセットが該時系列パターンの現パターン位置のアイテムセットを含むか調べる処理（８０４）以降を繰り返す。区切れ条件を満たさない場合は現データ位置を１つ後ろにずらし（８１０）、該時系列データの現データ位置のアイテムセットが該時系列パターンの現パターン位置のアイテムセットを含むか調べる処理（８０４）以降を繰り返す。末尾の場合は前述と同様である。 If the division condition is satisfied, the current pattern position is returned to the first pattern position (1302), the current data position is shifted backward by one (810), and the item set at the current data position of the time series data is the time series pattern. The process of checking whether the item set at the current pattern position is included (804) and subsequent steps are repeated. If the delimiter condition is not satisfied, the current data position is shifted backward by one (810), and it is checked whether the item set at the current data position of the time series data includes the item set at the current pattern position of the time series pattern ( Step 804) and subsequent steps are repeated. The case of the end is the same as described above.

以上のように、本実施例によると、時系列パターンの区切れの条件を設定し、探索時系列パターンの繰り返し回数の数え上げ処理において、時系列データの句切れが存在する時点で、探索パターンの繰り返し回数を数え上げるパターン位置を先頭に戻すことによって、時系列パターンの句切れをまたがる場合の繰り返し回数の数え上げを回避することが出来る。これにより、時系列データに区切れを設定した場合にも時系列パターンを抽出することが出来る。 As described above, according to the present embodiment, the time-series pattern segmentation condition is set, and when the search time-series pattern repetition count is counted, the search pattern By returning the pattern position for counting up the number of repetitions to the top, it is possible to avoid counting up the number of repetitions when straddling the punctuation of the time series pattern. This makes it possible to extract a time series pattern even when a break is set in the time series data.

第４の実施例として、Ｗｅｂサイトのアクセスログデータを解析するデータ解析システムを説明する。すなわち、Ｗｅｂサイトのアクセスログデータを例にとり、上述したデータ解析システムにおける実行プログラム１０６において行われる、１つの探索時系列パターンに対する１つの時系列データにおける繰り返し回数数え上げ処理７０２を説明する。Ｗｅｂアクセスログデータの場合、１レコードはあるユーザの一回のアクセスを意味し、時系列データＩＤはユーザＩＤ、タイムスタンプはアクセス日時、事象はアクセスしたページのＵＲＬとなる。また、Ｗｅｂサイトのアクセスログは一連のアクセスの単位を表わすセッション番号を持ち、同一のセッション番号を持つレコードは同一のセッションでのアクセスであるとする。 As a fourth embodiment, a data analysis system for analyzing access log data of a website will be described. That is, taking the access log data of the Web site as an example, the repetition count counting process 702 for one time-series data for one search time-series pattern performed in the execution program 106 in the data analysis system described above will be described. In the case of Web access log data, one record means one access of a user, the time series data ID is the user ID, the time stamp is the access date and time, and the event is the URL of the accessed page. The Web site access log has a session number representing a series of access units, and records having the same session number are accesses in the same session.

例えば、表４のデータのｕｓｅｒ０１はセッション番号１００、１０１、１０２の３つのセッションを持つ。セッション番号１００では、最初にページＡがアクセスされ、次にページＢがアクセスされ、最後にページＣがアクセスされたことを意味する。ここで、解析対象のアクセスログデータを表４に示すデータとし、記憶装置１０３に格納されているとする。また、入力装置１０４において、区切れ条件として“ｉ番目のレコードのセッション番号≠（ｉ＋１）番目のレコードのセッション番号”が設定されたとし、メモリ１０２の設定情報に格納されているとする。 For example, user01 of the data in Table 4 has three sessions with session numbers 100, 101, and 102. Session number 100 means that page A is accessed first, then page B is accessed, and page C is accessed last. Here, it is assumed that the access log data to be analyzed is the data shown in Table 4 and stored in the storage device 103. In the input device 104, it is assumed that “session number of i-th record ≠ session number of (i + 1) -th record” is set as a delimiter condition, and is stored in the setting information of the memory 102.

１つの探索時系列パターンに対する１つの時系列データにおける繰り返し回数数え上げ処理７０２として、ｕｓｅｒ０１の時系列データにおける探索パターン＜（ページＡ）（ページＢ）＞が数え上げられるとし、メモリ１０２の時系列データ情報１０８のｕｓｅｒ０１の時系列データにセッション番号とＵＲＬの組のリストとして（１００、ページＡ）（１００、ページＢ）（１００、ページＡ）（１０１、ページＡ）（１０１、ページＢ）（１０１、ページＤ）（１０１、ページＥ）（１０１、ページＤ）（１０２、ページＤ）が保持されており、メモリ１０２の探索パターン情報１１０から、該探索パターンのｕｓｅｒ０１の情報として、数え上げ済み繰り返し回数が０、数え上げ済みの時系列パターンのパターン位置が０と読み出されたとする。本実施例では、探索パターンの先頭のパターン位置のアイテムセット（ページＡ）が含まれるデータ位置とメモリ１０２の時系列データ情報１０８の先頭のデータ位置から順に探索する（図１３の８０３に対応）。 It is assumed that the search pattern <(page A) (page B)> in the time series data of user01 is counted as the repetition count counting process 702 in one time series data for one search time series pattern. (100, Page A) (100, Page B) (100, Page A) (101, Page A) (101, Page B) (101, Page D) (101, page E) (101, page D) (102, page D) are held, and the number of repeated repetitions counted from the search pattern information 110 of the memory 102 as information of user01 of the search pattern. 0, the pattern position of the counted time series pattern is read as 0 It was to be. In this embodiment, the search is performed in order from the data position including the item set (page A) at the head pattern position of the search pattern and the head data position of the time-series data information 108 in the memory 102 (corresponding to 803 in FIG. 13). .

はじめに、該探索パターンの先頭のパターン位置のアイテムセット（ページＡ）が該時系列データの先頭のデータ位置のアイテムセットに現れることが検出される。現パターン位置が該探索パターンの末尾でないため（同８０５のNo）、現パターン位置を次のパターン位置である２とする（同８０８）。 First, it is detected that the item set (page A) at the head pattern position of the search pattern appears in the item set at the head data position of the time-series data. Since the current pattern position is not the end of the search pattern (No in 805), the current pattern position is set to 2 which is the next pattern position (808).

次に、現データ位置が末尾であるか調べる（同８０９）が、現データ位置が末尾でないことから、現データ位置の直後が区切れであるかを調べる（同１３０１）。上述の通り、メモリ１０２の区切れ条件１００１に句切れの条件としてセッション番号が異なること設定されているため、現データ位置のセッション番号と次のデータ位置のセッション番号を比較し、共に１００と等しいセッション番号であるため、現データ位置を２とし（同８１０）、再び現パターン位置のアイテムセットが現データ位置に含まれるか調べる（同８０４）。本実施例の場合、現パターン位置が２のとき、現データ位置が２において現パターン位置のアイテムセット（ページＢ）が現データ位置に現れることが検出され、メモリ１０２の探索パターン情報１１０の数え上げ済み繰り返し回数を１に更新し（同８０６）、現パターン位置を先頭（同８０７）、現データ位置を３とする。 Next, it is checked whether the current data position is at the end (step 809). Since the current data position is not at the end, it is checked whether the current data position is immediately after (step 1301). As described above, since the session number is set to be different as the phrase break condition in the delimiter condition 1001 of the memory 102, the session number at the current data position is compared with the session number at the next data position, and both are equal to 100. Since it is the session number, the current data position is set to 2 (810), and it is checked again whether the item set at the current pattern position is included in the current data position (804). In this embodiment, when the current pattern position is 2, it is detected that the current data position is 2 and the current pattern position item set (page B) appears at the current data position, and the search pattern information 110 in the memory 102 is counted. The number of repeated repetitions is updated to 1 (same 806), the current pattern position is set to the top (same 807), and the current data position is set to 3.

次に、現パターン位置のアイテムセット（ページＡ）が現データ位置のアイテムセット（ページＣ）に含まれないため、現データ位置を次のデータ位置にずらそうとする。ここで、現データ位置のセッション番号が１００、次のデータ位置のセッション番号が１０１であることから、区切れが存在することが検出され（同１３０１）、現データ位置を次のデータ位置にずらす前に、現パターン位置に先頭のパターン位置を設定する（同１３０２）。これよって、セッション番号１００と１０１にまたがる時系列パターンの繰り返し回数の数え上げを回避することができる。以下、同様に繰り返し回数数え上げ処理が時系列データの末尾まで続行される。 Next, since the item set (page A) at the current pattern position is not included in the item set (page C) at the current data position, the current data position is shifted to the next data position. Here, since the session number of the current data position is 100 and the session number of the next data position is 101, it is detected that there is a partition (1301), and the current data position is shifted to the next data position. Before, the head pattern position is set to the current pattern position (1302). As a result, it is possible to avoid counting the number of repetitions of the time-series pattern across the session numbers 100 and 101. Thereafter, the repeat count counting process is continued until the end of the time series data.

なお、本実施例では時系列データ情報にセッション番号とアイテムセットの組のリストを保持し、区切れの条件として条件式を設定したが、句切れの条件を所定の記号とし、時系列データの句切れに所定の記号を付加したデータを用いることでも同様に解析処理を実施することが可能である。例えば、表４のデータにおいて、セッション番号が異なる場合を区切れと設定し、句切れを表わす記号を“．”とした場合、表５の時系列データによって本実施例と同様に解析処理を実施することが可能となる。 In this embodiment, a list of session number and item set pairs is stored in the time series data information, and a conditional expression is set as a delimiter condition. However, the phrase break condition is a predetermined symbol, and the time series data The analysis process can be similarly performed by using data in which a predetermined symbol is added to a phrase break. For example, in the data of Table 4, if the session number is different and set as a break, and the symbol representing the phrase break is set to “.”, The analysis processing is performed in the same manner as the present embodiment using the time series data of Table 5 It becomes possible to do.

以上詳述した本発明は、データベース及びデータウェアハウスを対象としたデータ解析システム及び方法に関し、特にデータベースのレコードを解析してデータの出現順序の規則性を明らかにするデータマイニング技術として極めて有用である。 The present invention described above in detail relates to a data analysis system and method for databases and data warehouses, and is particularly useful as a data mining technique for clarifying regularity of the appearance order of data by analyzing records in the database. is there.

１００…コンピュータ
１０１…プロセッサ
１０２…メモリ
１０３…記憶装置
１０４…入力装置
１０５…出力装置
１０６…実行プログラム
１０７…設定値情報
１０８…時系列データ情報
１０９…アイテム情報
１１０…探索パターン情報
１１１…チェックポイント情報
１００１…区切れ条件
１１０１…区切れ条件設定部
２００…ユーザインタフェース
２０１…解析データ指定部
２０２…チェックポイント指定部
２０３…最小繰り返し回数入力部
２０４…最小出現頻度入力部
２０５…実行ボタン
２０６…結果表示部。 DESCRIPTION OF SYMBOLS 100 ... Computer 101 ... Processor 102 ... Memory 103 ... Storage device 104 ... Input device 105 ... Output device 106 ... Execution program 107 ... Setting value information 108 ... Time series data information 109 ... Item information 110 ... Search pattern information 111 ... Checkpoint information 1001 ... Separation condition 1101 ... Separation condition setting unit 200 ... User interface 201 ... Analysis data designation unit 202 ... Check point designation unit 203 ... Minimum repetition count input unit 204 ... Minimum appearance frequency input unit 205 ... Execution button 206 ... Result display Department.

Claims

A data analysis system for analyzing data consisting of a plurality of sets of an event, an ID to which the event belongs, and information indicating an order relation between events,
The computer includes the data to be analyzed, a storage unit that stores an execution program that performs analysis processing, and a processing unit that executes the execution program,
The processor is
A first step of storing the data in which events having the same ID are arranged according to the order relation in the storage unit as time-series data;
A second step of counting the number of repetitions in each of the time series data for a time series pattern comprising overlapping permutations of the events;
A third step of counting the number of the time series data in which the number of repetitions is equal to or greater than a predetermined number;
A fourth step of extracting a time series pattern in which the counted number of time series data is equal to or greater than a predetermined number;
A data analysis system characterized by executing

A data analysis system according to claim 1, wherein
The processor is
Providing check points at predetermined intervals in each of the time series data, and storing the time series data in the storage unit in a range from a check point to a next check point in the first step;
Counting the number of repetitions from the time-series data in the range stored in the storage unit in the second step for the number of repetitions in each of the time-series data of the time-series pattern with an unknown number of repetitions;
The number of repetitions of the time-series pattern in the time-series data is added by adding the number of times of counting in the range in which the counting is performed and the number of repetitions of the events constituting the time-series pattern in the time-series data that has not been counted. Calculating an upper limit value of
In the third step, counting up the number of time-series data of a time-series pattern in which the upper limit of the number of repetitions is a predetermined number or more;
Repeating the above steps for a time series pattern in which the counted number of time series data is greater than or equal to a predetermined number;
A data analysis system characterized by executing

The data analysis system according to claim 1,
The time series data is data having a delimiter,
The processor is
Executing the second step, the third step, and the fourth step for each of the divisions;
A data analysis system characterized by this.

The data analysis system according to claim 1,
The time series data is access log data including a session number indicating an access unit to a website.
The processor is
Executing the second step, the third step, and the fourth step for each of the access log data having the same session number;
A data analysis system characterized by this.

The data analysis system according to claim 1,
The computer further includes an output unit,
The processing unit outputs the extracted time series pattern together with the number counted in the second step and the number counted in the third step to the output unit.
A data analysis system characterized by this.

The data analysis system according to claim 1,
The computer further includes an input unit capable of inputting the predetermined number of times and the predetermined number.
A data analysis system characterized by this.

A data analysis system according to claim 3, wherein
The computer further includes an input unit capable of inputting a condition for setting the break in the time series data.
A data analysis system characterized by this.

A data analysis method for analyzing data in which a plurality of sets of information indicating an event, an ID to which the event belongs, and an order relationship between the events is stored by a computer including a processing unit and a storage unit,
Data in which the events having the same ID are arranged according to the order relationship is time series data, and a duplicate permutation in which one or more of the events are arranged in a forward direction is a time series pattern,
The processor is
In order to extract a frequent repeated time series pattern that is the time series pattern repeated a predetermined number of times or more in each time series data in a predetermined number or more of the time series data,
For the time-series pattern whose number of repetitions is unknown, counting the number of repetitions in each of the time-series data;
Counting the number of the time-series data in which the number of repetitions is a predetermined number or more as an appearance frequency;
Extracting the time series pattern in which the appearance frequency is a predetermined number or more;
The data analysis method characterized by performing.

A data analysis method according to claim 8, comprising:
The time series data is data in which a break exists,
The processing unit executes each of the steps for the time-series data for each of the divisions,
A data analysis method characterized by the above.

A data analysis method according to claim 8, comprising:
The time series data is access log data having a session number indicating an access unit to a website.
The processing unit executes each of the steps for each access log data having the same session number.
A data analysis method characterized by the above.

The data analysis method according to claim 8, comprising:
The calculator further includes a display unit,
The processing unit displays the number of repetitions and the appearance frequency corresponding to the extracted time-series pattern on the display unit.
A data analysis method characterized by the above.

A data analysis method for performing analysis processing of data in which a plurality of sets of information indicating an event, an ID to which the event belongs, and an order relationship between the events is stored by a computer including a processing unit and a storage unit ,
The processor is
Data in which events having the same ID are arranged in accordance with the order relationship is time series data, and a duplicate permutation in which one or more events are arranged in the forward direction is a time series pattern. In order to extract a frequently repeated time series pattern that is the time series pattern repeated a predetermined number of times or more in each series data,
A first step of setting checkpoints at predetermined intervals in each of the time series data;
A second step of counting the number of times the time series pattern is repeated in a range from a check point to the next check point for each time series data for the time series pattern in which the number of repetitions in each time series data is unknown;
The upper limit value of the number of repetitions of the time series pattern in the time series data is calculated from the sum of the number of repetitions up to the already counted check point and the number of repetitions of each event appearing after the check point. Steps,
A fourth step of counting the number of time-series data for which the calculated upper limit value is equal to or greater than a predetermined number of repetitions as the appearance frequency;
A fifth step of extracting a time-series pattern in which the counted appearance frequency is a predetermined number or more;
A sixth step of repeating the second to fifth steps until the last checkpoint for the extracted time-series pattern;
The data analysis method characterized by performing.

A data analysis method according to claim 12, comprising:
The calculator further includes an output unit,
The processor is
At the time of processing up to the last checkpoint, the number of repetitions corresponding to the extracted time-series pattern and the appearance frequency are output to the output unit.
A data analysis method characterized by the above.

A data analysis method according to claim 12, comprising:
The time series data is data in which a break exists,
The processing unit executes each of the steps for the time-series data for each of the divisions,
A data analysis method characterized by the above.

A data analysis method according to claim 12, comprising:
The time series data is access log data including a session number indicating an access unit to a website.
The processing unit executes each of the steps for each access log data having the same session number.
A data analysis method characterized by the above.