JP6903595B2

JP6903595B2 - Data analysis support system and data analysis support method

Info

Publication number: JP6903595B2
Application number: JP2018008112A
Authority: JP
Inventors: 岳志半田; 川崎　健治; 健治川崎; 高志津野
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2021-07-14
Anticipated expiration: 2038-01-22
Also published as: JP2019128646A; KR20200019741A; WO2019142391A1; KR102312685B1

Description

本発明は、情報処理装置を用いてデータ分析を支援する技術に関する。 The present invention relates to a technique for supporting data analysis using an information processing device.

ＩｏＴ（Internet of Things）技術やビッグデータ技術の進展に伴い、複数の業務システムやセンサを跨いでのデータ利活用のニーズが高まっている。様々な業務システムが保有する大量データに対するデータ分析アルゴリズムとして、相関ルールマイニング（バスケット分析・相関分析）がある。相関ルールマイニングは対象のデータ中より、頻繁に同時に生起する事象の相関（相関ルール）を見つけることに焦点を当てた技法であり、非数値のデータに対しても適用可能な手法である。相関ルールは、例えば「事象Ａと事象Ｂは同時に起きやすい」や「商品Ｃは商品Ｄと同時に購入されやすい」のような形で表され、データ分析だけでなくデータ検索システムや情報推薦システムでも用いられる。 With the progress of IoT (Internet of Things) technology and big data technology, there is an increasing need for data utilization across multiple business systems and sensors. Correlation rule mining (basket analysis / correlation analysis) is a data analysis algorithm for a large amount of data possessed by various business systems. Correlation rule mining is a technique that focuses on finding the correlation (correlation rule) of events that occur frequently and simultaneously from the target data, and is also applicable to non-numerical data. Correlation rules are expressed in the form of, for example, "event A and event B are likely to occur at the same time" and "product C is likely to be purchased at the same time as product D", and are expressed not only in data analysis but also in data search systems and information recommendation systems. Used.

相関の強さを示す指標として、支持度（全サンプル中での発生割合）、確信度（前提部・結論部の組合せ頻度）、リフト（組合せor単独で発生する度合い）が存在し、これらの指標に対する閾値処理を行うことで分析における有用なルールを抽出する。しかしながら、例えば、低い支持度（少事例）のルールを残すと大量のルールが残ってしまい、この中から有用なルールを見つけ出すことが困難となる。一方で、高い支持度（多事例）のルールを残すと事象として自明な（当たり前の）ルールが残り、業務改善や原因分析のための有益なルールを見つけ出すのは困難となる。 As an index showing the strength of correlation, there are support level (occurrence rate in all samples), certainty level (combination frequency of premise part / conclusion part), and lift (degree of occurrence in combination or alone). Useful rules in analysis are extracted by performing threshold processing on the index. However, for example, if a rule with a low approval rating (small cases) is left, a large number of rules remain, and it becomes difficult to find a useful rule from these rules. On the other hand, if a rule with a high degree of support (many cases) is left, a self-evident (natural) rule remains as an event, and it becomes difficult to find a useful rule for business improvement and cause analysis.

こうした相関ルールマイニングにて抽出した膨大なルールの絞込に関する技術に、コンテンツ消費（商品購買等）データを対象とし、コンテンツ間の関連を示す複数の相関ルールを生成する手段と、前記コンテンツ消費データを用いて、各相関ルールの希少度を算出する手段と、前記希少度を用いて前記相関ルールを絞り込み推薦ルールを生成する手段とを備え、前記希少度の算出においては、相関ルール毎に前記コンテンツ消費データから取得した、当該相関ルールの条件部および帰結部のコンテンツの合計数と、当該相関ルールに該当するユーザ数と、当該相関ルールに該当する各ユーザのコンテンツ消費数と、消費された全コンテンツ数と、当該相関ルールの条件部および帰結部の各コンテンツを消費したユーザ数とを用いて前記希少度を算出することを特徴とする推薦ルール生成装置（特許文献１参照）などが提案されている。 In addition to the technology related to narrowing down a huge amount of rules extracted by such correlation rule mining, a means for generating a plurality of correlation rules indicating relationships between contents by targeting content consumption (product purchase, etc.) data, and the content consumption data. A means for calculating the rarity of each correlation rule using the above, and a means for narrowing down the correlation rule and generating a recommendation rule using the rarity. The total number of contents in the condition part and the consequent part of the correlation rule acquired from the content consumption data, the number of users corresponding to the correlation rule, and the number of contents consumed by each user corresponding to the correlation rule, were consumed. A recommendation rule generator (see Patent Document 1), which is characterized by calculating the rarity using the total number of contents and the number of users who have consumed each content in the condition part and the consequent part of the correlation rule, has been proposed. Has been done.

特開２０１４−２２２３９８号公報Japanese Unexamined Patent Publication No. 2014-222398

Rakesh Agrawal and Ramakrishnan Srikant，”Fast algorithms for mining association rules”，Proceedings of the 20th International Conference on Very Large Data Bases，pp.487-499，1994Rakesh Agrawal and Ramakrishnan Srikant, "Fast algorithms for mining association rules", Proceedings of the 20th International Conference on Very Large Data Bases, pp.487-499, 1994

複数の業務システムのデータを突き合わせての分析作業における初期段階では、分析目的は明確になっていても、分析対象データが明確になっていない場合が多い。すなわち、分析目的の検証及び評価に向けたデータ分析結果獲得のために、どのデータを用いればよいかが明確になっていない場合が多い。 At the initial stage of analysis work by matching data from multiple business systems, the purpose of analysis is clear, but the data to be analyzed is often not clear. That is, it is often unclear which data should be used to obtain data analysis results for verification and evaluation of analysis purposes.

このような場合において、分析実施者は分析対象となるデータの元業務システムにおけるＥＲ（Entity-Relationship）図等のデータ構造仕様書をもとに、まずは分析できそうなデータを突き合わせていく、すなわち、元業務システムのデータ構造の観点で近い距離にあるデータテーブル同士をまずは組合せていく、といったアプローチをとることが多い。これは、限られた分析作業工数内で有益な分析結果を得るために、まずは分析結果を得られそうなデータの組合せを選択していくためである。 In such a case, the analyst first matches the data that can be analyzed based on the data structure specifications such as the ER (Entity-Relationship) diagram in the original business system of the data to be analyzed, that is, In many cases, the approach is to first combine data tables that are close to each other from the viewpoint of the data structure of the original business system. This is to first select a combination of data in which the analysis result is likely to be obtained in order to obtain a useful analysis result within the limited analysis work man-hours.

複数業務システムのデータ分析においては、前記のようなアプローチをとることが多いことから、特に業務システムを跨ぐデータテーブル同士、または、１業務システムであっても元業務システムのデータ構造の観点で遠い距離にあるデータテーブル同士を用いた分析を十分に実施することができない場合が多い。また、元業務システムのデータ構造の観点で近い距離にあるデータテーブル同士を組合せていくというアプローチでは、分析対象データの組合せとしては珍しくない（よくある）組合せである場合が多く、分析結果としても自明な（当たり前の）結果が得られることが多く、業務の改善や事象の原因調査にとっての有益な結果の獲得に繋がらない可能性ある。 In data analysis of multiple business systems, the above approach is often taken, so it is far from the viewpoint of the data structure of the original business system, especially between data tables that straddle business systems or even one business system. In many cases, it is not possible to sufficiently carry out analysis using data tables that are located at a distance. In addition, the approach of combining data tables that are close to each other from the viewpoint of the data structure of the original business system is often a rare (common) combination of data to be analyzed, and the analysis result also shows. Obvious (obvious) results are often obtained, which may not lead to useful results for business improvement or event cause investigation.

以上より、「業務システムを跨ぐデータテーブル同士」や「元業務システムのデータ構造の観点で距離が遠いデータテーブル同士」といった、分析対象データの組合せとして意外な組合せとなるデータの相関は、特に有益な分析結果となり得る。 From the above, the correlation of data that is an unexpected combination of data to be analyzed, such as "data tables that straddle business systems" and "data tables that are far apart from the viewpoint of the data structure of the original business system", is particularly useful. Analysis results can be obtained.

しかしながら、従来技術において特徴部分とする希少度算出において得られる希少度は当該相関ルールの生じる確率を示しており、前述した「業務システムを跨ぐデータテーブル同士」や「元業務システムのデータ構造の観点で距離が遠いデータテーブル同士」といった意外なデータの組合せに関しては未考慮である。従って、従来技術では抽出された膨大な数の相関ルールから、ルールの前提部及び結論部に含まれる属性の組合せとして意外な組合せを含む相関ルールを絞り込めず、分析者にとって有益なルールを提示できない。 However, the rarity obtained in the rarity calculation, which is a feature of the prior art, indicates the probability that the correlation rule will occur, and the above-mentioned "data tables straddling the business system" and "viewpoint of the data structure of the original business system" No consideration is given to unexpected data combinations such as "data tables that are far apart from each other". Therefore, in the prior art, it is not possible to narrow down the correlation rules including unexpected combinations as the combination of attributes included in the premise part and the conclusion part of the rule from the huge number of correlation rules extracted, and present a useful rule for the analyst. Can not.

本発明の一側面は、データ分析支援システムである。このシステムは、複数のデータテーブルを含む分析対象データテーブルを記憶する、記憶装置と、分析対象データテーブルを解析し、データテーブルに含まれる属性の相関を示す複数の相関ルールを抽出する、相関ルール抽出部と、複数のデータテーブル間の関連性を示すデータ関係モデルを生成する、データ関係モデル生成部と、相関ルール毎に、当該相関ルールの前提部及び結論部の属性の組合せを生成し、当該組合わせ毎の前記データ関係モデルにおける当該属性間の距離を求め、当該距離に基づき意外度を算出する、意外度算出部と、を備える。 One aspect of the present invention is a data analysis support system. This system stores the data table to be analyzed including multiple data tables, analyzes the storage device and the data table to be analyzed, and extracts multiple correlation rules showing the correlation of the attributes contained in the data table. A combination of attributes of the premise part and the conclusion part of the correlation rule is generated for each of the data relation model generation part and the correlation rule, which generates the data relation model showing the relationship between the extraction part and the plurality of data tables. It is provided with an unexpected degree calculation unit that obtains a distance between the attributes in the data relational model for each combination and calculates an unexpected degree based on the distance.

本発明の他の一側面は、入力装置、出力装置、記憶装置、および処理装置を含む情報処理装置で実行されるデータ分析支援システム方法である。この方法では、記憶装置に、複数のデータテーブルを含む分析対象データテーブルを準備する、第１のステップと、複数のデータテーブル間の関連性を示すデータ関係モデルを生成する、第２のステップと、分析対象データテーブルを解析し、データテーブルに含まれる属性の相関を示す複数の相関ルールを抽出する、第３のステップと、相関ルール毎に、当該相関ルールの前提部及び結論部の属性の組合せを生成し、当該組合わせ毎の前記データ関係モデルにおける当該属性間の距離を求め、当該距離に基づき意外度を算出する、第４のステップと、を備える。 Another aspect of the present invention is a data analysis support system method performed by an information processing device including an input device, an output device, a storage device, and a processing device. In this method, a first step of preparing an analysis target data table including a plurality of data tables in a storage device, and a second step of generating a data relational model showing the relationship between the plurality of data tables. , Analyzing the data table to be analyzed and extracting multiple correlation rules showing the correlation of the attributes contained in the data table, the third step, and for each correlation rule, the attributes of the premise part and the conclusion part of the correlation rule It comprises a fourth step of generating a combination, finding the distance between the attributes in the data relational model for each combination, and calculating the degree of surprise based on the distance.

膨大な数の相関ルール中から意外性のあるルールを絞り込むことができ、業務改善や原因分析のための有益な情報把握を素早く行える。 Unexpected rules can be narrowed down from a huge number of correlation rules, and useful information for business improvement and cause analysis can be quickly grasped.

データ分析支援システムの構成例を示すブロック図である。It is a block diagram which shows the configuration example of a data analysis support system. 分析対象データ蓄積部に蓄積される分析対象データのフォーマット例を示す表図である。It is a table figure which shows the format example of the analysis target data which is stored in the analysis target data storage part. データ関係モデル記憶部のエンティティテーブル及びリレーションテーブルのフォーマット例と、リレーション生成原理を示す概念図である。It is a conceptual diagram which shows the format example of the entity table and the relation table of the data relation model storage part, and the relation generation principle. 相関ルール記憶部の相関ルール格納テーブルのデータフォーマット例を示す表図である。It is a table figure which shows the data format example of the correlation rule storage table of the correlation rule storage part. 分析実施者が分析対象データの取り込み、相関ルールの算出及び相関ルールの絞込を行う画面例を示す平面図である。It is a top view which shows the screen example which the analysis performer takes in the analysis target data, calculates the correlation rule, and narrows down the correlation rule. データ分析支援システムのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware configuration example of a data analysis support system. データ分析支援システムにおいてデータ関係モデルを生成、相関ルールを抽出及び意外度を算出する一連の手順を示すフローチャートである。It is a flowchart which shows a series of steps of generating a data relational model in a data analysis support system, extracting a correlation rule, and calculating an unexpected degree. データ関係モデル生成部が、分析対象データテーブルからデータ関係モデルを生成する手順の詳細を示すフローチャートである。It is a flowchart which shows the detail of the procedure which the data relational model generation part generates the data relational model from the analysis target data table. データ結合部が、分析対象データテーブルを１つのデータテーブルに結合する手順の詳細を示すフローチャートである。It is a flowchart which shows the detail of the procedure which the data combination part joins the analysis target data table into one data table. 意外度算出部が、データ関係モデルに基づき相関ルール毎に意外度を算出する手順の詳細を示すフローチャートである。It is a flowchart which shows the detail of the procedure which the surprise degree calculation unit calculates the surprise degree for each correlation rule based on a data relational model.

以下に本発明の実施形態について図面を用いて詳細に説明する。ただし、本発明は以下に示す実施の形態の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. However, the present invention is not construed as being limited to the description of the embodiments shown below. It is easily understood by those skilled in the art that a specific configuration thereof can be changed without departing from the idea or gist of the present invention.

以下に説明する発明の構成において、同一部分又は同様な機能を有する部分には同一の符号を異なる図面間で共通して用い、重複する説明は省略することがある。 In the configuration of the invention described below, the same reference numerals may be used in common among different drawings for the same parts or parts having similar functions, and duplicate description may be omitted.

同一あるいは同様な機能を有する要素が複数ある場合には、同一の符号に異なる添字を付して説明する場合がある。ただし、複数の要素を区別する必要がない場合には、添字を省略して説明する場合がある。 When there are a plurality of elements having the same or similar functions, they may be described by adding different subscripts to the same code. However, if it is not necessary to distinguish between a plurality of elements, the subscript may be omitted for explanation.

本明細書等における「第１」、「第２」、「第３」などの表記は、構成要素を識別するために付するものであり、必ずしも、数、順序、もしくはその内容を限定するものではない。また、構成要素の識別のための番号は文脈毎に用いられ、一つの文脈で用いた番号が、他の文脈で必ずしも同一の構成を示すとは限らない。また、ある番号で識別された構成要素が、他の番号で識別された構成要素の機能を兼ねることを妨げるものではない。 The notations such as "first", "second", and "third" in the present specification and the like are attached to identify the components, and do not necessarily limit the number, order, or contents thereof. is not it. Further, the numbers for identifying the components are used for each context, and the numbers used in one context do not always indicate the same composition in the other contexts. Further, it does not prevent the component identified by a certain number from having the function of the component identified by another number.

図面等において示す各構成の位置、大きさ、形状、範囲などは、発明の理解を容易にするため、実際の位置、大きさ、形状、範囲などを表していない場合がある。このため、本発明は、必ずしも、図面等に開示された位置、大きさ、形状、範囲などに限定されない。 The position, size, shape, range, etc. of each configuration shown in the drawings and the like may not represent the actual position, size, shape, range, etc. in order to facilitate understanding of the invention. Therefore, the present invention is not necessarily limited to the position, size, shape, range, etc. disclosed in the drawings and the like.

本明細書において単数形で表される構成要素は、特段文脈で明らかに示されない限り、複数形を含むものとする。 Components represented in the singular form herein shall include the plural form unless explicitly stated in the context.

以下で説明する実施例の一例は、分析対象データテーブルを解析し複数の相関ルールを抽出する相関ルール抽出部と、分析対象データテーブル間の関連性を示すエンティティテーブルとリレーションテーブルから構成されるデータ関係モデルを生成するデータ関係モデル生成部と、データ関係モデルと相関ルール抽出手段が抽出した相関ルールとを用いて相関ルール毎に当該相関ルールの前提部及び結論部の属性の組合せ毎にデータ関係モデルにおけるエンティティ間の距離及びリレーションの強度に基づき意外度を算出する意外度算出部と、を備えることを特徴とするデータ分析支援システムである。 An example of the example described below is data composed of a correlation rule extraction unit that analyzes an analysis target data table and extracts a plurality of correlation rules, and an entity table and a relational table that show the relationship between the analysis target data tables. Data relation for each correlation rule using the data relation model generation part that generates the relation model and the correlation rule extracted by the data relation model and the correlation rule extraction means for each combination of the attributes of the premise part and the conclusion part of the correlation rule. It is a data analysis support system characterized by including an unexpected degree calculation unit that calculates an unexpected degree based on the distance between entities in a model and the strength of a relation.

図１は、本実施例におけるデータ分析支援システム１００の構成例を示す図である。図１に示すように本実施例が対象とするデータ分析支援システム１００は、ユーザ端末１１１と通信可能に接続されている。データ分析支援システム１００は例えばサーバーで構成することができ、ユーザ端末１１１は例えばパーソナルコンピュータで構成することができる。両者は例えばネットワークを介して接続することができる。 FIG. 1 is a diagram showing a configuration example of the data analysis support system 100 in this embodiment. As shown in FIG. 1, the data analysis support system 100 targeted by this embodiment is communicably connected to the user terminal 111. The data analysis support system 100 can be configured by, for example, a server, and the user terminal 111 can be configured by, for example, a personal computer. Both can be connected via a network, for example.

本実施例にかかるデータ分析支援システム１００は、機能部品として分析対象データ蓄積部１０１と、データ関係モデル記憶部１０２と、相関ルール記憶部１０３と、データ取得部１０４と、データ関係モデル生成部１０５と、データ結合部１０６と、相関ルール抽出部１０７と、意外度算出部１０８と、ルール推薦部１０９と、ユーザインターフェース部１１０とを備える。 The data analysis support system 100 according to this embodiment has the analysis target data storage unit 101, the data relation model storage unit 102, the correlation rule storage unit 103, the data acquisition unit 104, and the data relation model generation unit 105 as functional components. A data combination unit 106, a correlation rule extraction unit 107, an unexpected degree calculation unit 108, a rule recommendation unit 109, and a user interface unit 110 are provided.

データ取得部１０４は、ユーザ端末１１１に対して分析実施者１１２が行うデータ取り込み要求を受信し、分析対象データ蓄積部１０１に格納される分析対象データテーブルを取得する。 The data acquisition unit 104 receives the data acquisition request made by the analysis performer 112 to the user terminal 111, and acquires the analysis target data table stored in the analysis target data storage unit 101.

図２に、分析対象データ蓄積部１０１に蓄積される、分析対象データテーブルの例を示す。図２の例では、分析対象データテーブルとして、列車データテーブル１０１１と駅データテーブル１０１２の例が示されている。各テーブルは、カラム名１０１１１，１０１２１を備え、各カラムに所定の数値あるいはテキストのデータを格納する。データテーブルは例えば図２に示すデータフォーマットを有しており、一般的な表形式データの構造であるものを対象とする。 FIG. 2 shows an example of an analysis target data table stored in the analysis target data storage unit 101. In the example of FIG. 2, examples of the train data table 1011 and the station data table 1012 are shown as the data tables to be analyzed. Each table has column names 10111 and 10121, and stores predetermined numerical values or text data in each column. The data table has, for example, the data format shown in FIG. 2, and targets a data table having a general tabular data structure.

本実施例の実施において、分析対象データは表形式データあるいはこれと同等の機能を有するデータであることが前提であり、業種や分野に関わらず適用可能であるものである。本実施例においては鉄道分野の各種業務システムのデータを例に挙げ説明する。鉄道分野の各種業務システムのデータ例として、列車データテーブル１０１１と、駅データテーブル１０１２の２テーブルが分析対象データ記憶部に定義されている。各テーブルには例えば、主体または客体となる対象物を示す識別情報や、対象物に関する各種物理量あるいはステイタスの情報等が記憶されている。 In the implementation of this embodiment, it is premised that the data to be analyzed is tabular data or data having a function equivalent to this, and it can be applied regardless of the type of industry or field. In this embodiment, data of various business systems in the railway field will be described as an example. As data examples of various business systems in the railway field, two tables, a train data table 1011 and a station data table 1012, are defined in the analysis target data storage unit. In each table, for example, identification information indicating an object that is a subject or an object, information on various physical quantities or status related to the object, and the like are stored.

データ関係モデル生成部１０５は、分析対象となるデータテーブル間の関連性を示すデータ関係モデルを生成し、生成したモデルをデータ関係モデル記憶部１０２に格納する。データ関係モデル記憶部１０２に記憶されるデータ関係モデルは、データ関係モデルのデータテーブルのテーブル名と各テーブルのカラム一覧を定義するエンティティテーブルと、データ関係モデルのデータテーブル間の関連を定義するリレーションテーブルの２テーブルで構成される。 The data relational model generation unit 105 generates a data relational model showing the relationship between the data tables to be analyzed, and stores the generated model in the data relational model storage unit 102. The data relational model stored in the data relational model storage unit 102 is a relation that defines the table name of the data table of the data relational model, the entity table that defines the column list of each table, and the relationship between the data tables of the data relational model. It consists of two tables.

図３に、データ関係モデル記憶部１０２に格納される、データ関係モデルの概念図を示す。上述のようにデータ関係モデルは、エンティティテーブル１０２１０とリレーションテーブル１０２２０を含む。 FIG. 3 shows a conceptual diagram of the data relational model stored in the data relational model storage unit 102. As mentioned above, the data relational model includes the entity table 10210 and the relation table 10220.

エンティティテーブル１０２１０は、分析対象データ蓄積部１０１に蓄積される、各データテーブルのカラム名を一覧として纏めたものである。エンティティテーブル１０２１０は、例えば図３に示すデータフォーマットを有しており、テーブル名１０２１１と、それに対応するカラム名１０２１２とを含む。リレーションテーブル１０２２０は、第１テーブル１０２２１と、第１テーブルのカラム１０２２２と、第２テーブル１０２２３と、第２テーブルのカラム１０２２４とを含む。 The entity table 10210 is a list of column names of each data table stored in the analysis target data storage unit 101. The entity table 10210 has, for example, the data format shown in FIG. 3, and includes a table name 10211 and a corresponding column name 10212. The relation table 10220 includes a first table 10221, a column 10222 of the first table, a second table 10223, and a column 10224 of the second table.

図３に示す例では、エンティティテーブル１０２１０に、列車データテーブル（図２の１０１１）と駅データテーブル(図２の１０１２）の２テーブルが定義されており、列車データテーブルには施行日、線区、列車番号、行先、始発駅、終着駅の計６カラムが、駅データテーブル１０１２には列車番号、駅名、他社線乗り入れ、到着時刻、出発時刻、遅延時分、滞留人数の計７カラムが定義される。 In the example shown in FIG. 3, two tables, a train data table (1011 in FIG. 2) and a station data table (1012 in FIG. 2), are defined in the entity table 10210, and the train data table includes the enforcement date and the line section. , Train number, destination, first station, last station, total 6 columns, station data table 1012 defines train number, station name, other company's line entry, arrival time, departure time, delay time, total number of people staying 7 columns Will be done.

また、リレーションテーブル１０２２０には、列車データテーブル１０１１と駅データテーブル１０１２間のリレーションが定義されており、列車データテーブル１０１１の列車番号カラムと駅データテーブル１０１２の列車番号カラムとの間に、リレーションが定義される。同様にして、列車データテーブル１０１１の始発駅及び終着駅カラムと、駅データテーブル１０１２の駅名カラムとの間にリレーションが定義される。 Further, in the relation table 10220, a relation between the train data table 1011 and the station data table 1012 is defined, and there is a relation between the train number column of the train data table 1011 and the train number column of the station data table 1012. Defined. Similarly, a relation is defined between the starting and ending station columns of the train data table 1011 and the station name column of the station data table 1012.

データ結合部１０６は、分析対象データ記憶部に格納される分析対象データテーブルのカラムをキーにして水平方向に結合することで１つのデータテーブルを生成する。 The data joining unit 106 generates one data table by joining in the horizontal direction using the column of the analysis target data table stored in the analysis target data storage unit as a key.

相関ルール抽出部１０７は、データ結合部１０６にて生成されたデータテーブルを対象に相関ルールマイニングを行うことで生成された相関ルールを相関ルール記憶部１０３に格納する。相関ルールの抽出は例えばアプリオリアルゴリズム（非特許文献１参照）などの公知のアルゴリズムを用いて実現することができる。 The correlation rule extraction unit 107 stores the correlation rule generated by performing the correlation rule mining on the data table generated by the data combination unit 106 in the correlation rule storage unit 103. The extraction of the correlation rule can be realized by using a known algorithm such as an a priori algorithm (see Non-Patent Document 1).

相関ルールマイニングは、分析対象データ中で頻繁に同時に起きる事象を見つけることに焦点を当てた分析アルゴリズムである。複数の事象間の発生に見られる同時性や関係性といった、分析対象データ中で頻繁に同時に生起する事象の組合せをルールとして抽出し、このルールを相関ルールと呼ぶ。例えば、ある事象Ｘの下である事象Ｙが発生する関係が認められる際は「Ｘ⇒Ｙ」のように記述し、矢印（⇒）の左側を前提部（事象Ｘ）と、右側を結論部（事象Ｙ）と呼び、事象Ｘが発生した際のＹの発生する確率を示すものである。 Correlation rule mining is an analysis algorithm that focuses on finding frequently simultaneous events in the data to be analyzed. A combination of events that occur frequently at the same time in the data to be analyzed, such as the simultaneity and relationships seen in the occurrence of multiple events, is extracted as a rule, and this rule is called a correlation rule. For example, when the relationship that the event Y under a certain event X occurs is recognized, it is described as "X⇒Y", the left side of the arrow (⇒) is the premise part (event X), and the right side is the conclusion part. It is called (event Y) and indicates the probability that Y will occur when event X occurs.

よく知られているように、相関ルールマイニングでは相関の強さを示す指標として、支持度、信頼度、リフトの計３つの指標がある。支持度は、ある事象を含む全データ中の割合である。確信度は、前提部の事象が生じた下で結論部の事象が生じる割合であり、前提部と結論部に含まれる事象間の関連の強さを表す。リフトは、前提部の事象が生じた下で結論部の事象が生じる割合（確信度）を、全データ中で結論部の事象が生じた割合で割ったものであり、結論部の事象が単独で生じた割合よりも前提部の事象の下で結論部の事象が生じる割合がどれだけ多いかを倍率で示したものである。 As is well known, in correlation rule mining, there are a total of three indicators of the strength of correlation: support, reliability, and lift. Support is a percentage of all data that includes an event. The degree of certainty is the rate at which the event of the conclusion part occurs after the event of the premise part occurs, and represents the strength of the relationship between the event of the premise part and the event included in the conclusion part. The lift is the ratio (confidence) of the event of the conclusion part occurring under the event of the premise part divided by the ratio of the event of the conclusion part occurring in all the data, and the event of the conclusion part is independent. It shows how much the event of the conclusion part occurs under the event of the premise part more than the rate of the event of the conclusion part.

例えば、「事象Ｘが発生した下で事象Ｙが発生する割合が６０％で、全データ中で事象Ｘと事象Ｙが同時に生じる割合が２０％であり、事象Ｘの下で事象Ｙが生じる割合は全データ中で事象Ｙが単独で生じる割合の２．５倍になっている」という相関ルール「Ｘ⇒Ｙ」は、支持度＝２０％、確信度６０％、リフト２．５と示される。なお、前提部と結論部に含まれる事象はそれぞれ複数あってもよい。また、前提部と結論部が含む「事象」については特に「アイテム」や「属性」などと呼称される場合もある。以降の説明では「事象」ではなく「属性」と呼ぶ。 For example, "The rate at which event Y occurs under event X is 60%, the rate at which event X and event Y occur at the same time is 20% in all data, and the rate at which event Y occurs under event X. The correlation rule "X⇒Y" that "is 2.5 times the rate at which event Y occurs alone in all data" is shown as support = 20%, certainty 60%, and lift 2.5. .. There may be a plurality of events included in the premise part and the conclusion part. In addition, the "event" included in the premise part and the conclusion part may be referred to as an "item" or an "attribute". In the following description, it will be referred to as an "attribute" rather than an "event".

図４は相関ルール記憶部１０３が備える相関ルール格納テーブル１０３０のデータフォーマット例を示す。相関ルール格納テーブル１０３０は、データ項目として前提部１０３１と、結論部１０３２と、支持度１０３３と、確信度１０３４と、リフト１０３５と、意外度１０３６とを含む。図４の例では相関ルール「列車番号（Ｔ１００）⇒車両ＩＤ（Ｍ１−０１）」は支持度＝８．３０％、確信度＝６０％、リフト２．３である。本例ではデータ値（前記の相関ルール中のＴ１００とＭ１−０１）だけでなく、当該データ値がどのテーブルのどのカラムに属していたかという情報（前記の相関ルール中では列車データテーブル１０１１の列車番号と車両データテーブル車両ＩＤ）も前提部と結論部に保持するが、データテーブルについては記載を省略している。「意外度」については意外度算出部１０８の説明にて後述する。 FIG. 4 shows an example of a data format of the correlation rule storage table 1030 included in the correlation rule storage unit 103. The correlation rule storage table 1030 includes a premise part 1031, a conclusion part 1032, a support degree 1033, a certainty degree 1034, a lift 1035, and an unexpected degree 1036 as data items. In the example of FIG. 4, the correlation rule "train number (T100) ⇒ vehicle ID (M1-01)" has a support level of 8.30%, a certainty level of 60%, and a lift of 2.3. In this example, not only the data values (T100 and M1-01 in the above correlation rule) but also the information about which column of which table the data value belonged to (the train in the train data table 1011 in the above correlation rule). The number and vehicle data table (vehicle ID) are also stored in the premise part and the conclusion part, but the data table is omitted. The "surprise degree" will be described later in the description of the surprise degree calculation unit 108.

意外度算出部１０８は、相関ルール抽出部１０７にて抽出した相関ルール毎に、相関ルールの前提部と結論部に含まれる事象を、データ関係モデル生成部１０５が生成したデータ関係モデルに照らし合わせ意外度を算出し、相関ルール記憶部１０３に格納する。算出した意外度は、相関ルール記憶部１０３の相関ルール格納テーブル１０３０の意外度カラム（図４）に格納される。 The unexpected degree calculation unit 108 compares the events included in the premise part and the conclusion part of the correlation rule with the data relation model generated by the data relation model generation unit 105 for each correlation rule extracted by the correlation rule extraction unit 107. The degree of surprise is calculated and stored in the correlation rule storage unit 103. The calculated unexpectedness is stored in the unexpectedness column (FIG. 4) of the correlation rule storage table 1030 of the correlation rule storage unit 103.

ルール推薦部１０９は、分析実施者からの相関ルール絞り込み要求と分析実施者が定義した支持度と、確信度と、リフトと、意外度に対する計４つの閾値を受信し、相関ルール記憶部１０３に格納される相関ルール全てに閾値処理を行うことで相関ルールを絞り込み、絞り込んだ結果をユーザ端末１１１に返す。閾値処理は各指標に対して設定された閾値よりも高い値をもつルールを残し、閾値以下の値をもつルールを取り除くものである。支持度と、確信度と、リフトと、意外度の計４の各指標のいずれについても閾値よりも高い値をもつルールを残す。 The rule recommendation unit 109 receives a correlation rule narrowing request from the analyst, a support degree defined by the analyst performer, a certainty degree, a lift, and a total of four threshold values for the unexpected degree, and receives the correlation rule storage unit 103. By performing threshold processing on all the stored correlation rules, the correlation rules are narrowed down, and the narrowed down result is returned to the user terminal 111. The threshold processing leaves rules having a value higher than the threshold set for each index, and removes rules having a value equal to or lower than the threshold. A rule having a value higher than the threshold value is left for each of the four indicators of support, certainty, lift, and unexpectedness.

ユーザインターフェース部１１０は、分析実施者が分析対象データの取り込み、相関ルールの算出及び相関ルールの絞込を行う画面１１０１を生成する。 The user interface unit 110 generates a screen 1101 in which the analyst executes the analysis target data, calculates the correlation rule, and narrows down the correlation rule.

図５に、ユーザインターフェース部１１０が生成する画面の一例の平面図を示す。本画面の例は図５に示すように、ヘッダー部１１０２、閾値設定部１１０３と、相関ルール一覧表示部１１０４と、データ関係モデル表示部１１０５から構成される。ヘッダー部１１０２には分析実施者が分析対象データを取り込むためのデータ取り込みボタンと、分析対象データに対する相関ルールの抽出及び意外度算出を行うための相関ルール算出ボタンと、抽出した相関ルールを閾値設定部１１０３にて設定した閾値で絞り込みを行うための相関ルール絞り込みボタンとで構成される。 FIG. 5 shows a plan view of an example of the screen generated by the user interface unit 110. As shown in FIG. 5, an example of this screen is composed of a header unit 1102, a threshold value setting unit 1103, a correlation rule list display unit 1104, and a data relational model display unit 1105. In the header section 1102, a data import button for the analyst to import the analysis target data, a correlation rule calculation button for extracting the correlation rule for the analysis target data and calculating the degree of surprise, and a threshold setting of the extracted correlation rule are set. It is composed of a correlation rule narrowing down button for narrowing down with the threshold value set in the part 1103.

分析実施者１１２がデータ取り込みボタンを押下すると、ユーザ端末１１１からデータ取得要求が、データ取得部１０４へと送信される。分析対象データ蓄積部１０１からデータが取り込まれると、データ関係モデル生成部１０５によりデータ関係モデルが生成され、生成結果が例えばＥＲ図としてデータ関係モデル表示部１１０５に表示される。生成されたモデルに対して、エンティティ追加・編集ボタン、リレーション追加ボタン、削除ボタンを用いることで、分析の目的や分析者の知識等に応じて分析実施者がモデルを調整・変更してもよい。また、データの取り込みは分析対象データ蓄積部１０１のデータテーブル全てではなく、分析実施者１１２が任意のデータテーブルを選択してもよい。この場合、データ取得要求と併せて分析実施者が選択したデータテーブル名称もデータ取得部１０４へと送信する。 When the analyst 112 presses the data acquisition button, the data acquisition request is transmitted from the user terminal 111 to the data acquisition unit 104. When data is taken in from the analysis target data storage unit 101, the data relational model generation unit 105 generates a data relational model, and the generation result is displayed on the data relational model display unit 1105 as, for example, an ER diagram. By using the entity addition / edit button, relation addition button, and deletion button for the generated model, the analyst may adjust / change the model according to the purpose of analysis and the knowledge of the analyst. .. Further, the data acquisition is not limited to all the data tables of the analysis target data storage unit 101, and the analysis practitioner 112 may select an arbitrary data table. In this case, the data table name selected by the analyst along with the data acquisition request is also transmitted to the data acquisition unit 104.

分析実施者１１２が相関ルール算出ボタンを押下すると、データ結合部１０６が生成したデータテーブルに対して相関ルール抽出部１０７が相関ルールの抽出処理を行い、意外度算出部１０８が抽出された各相関ルールに対してデータ関係モデルに基づき意外度を算出する。全ルールに対して意外度の算出まで完了すると、全ての相関ルールが相関ルール一覧表示部１１０４に一覧表示される。 When the analyst 112 presses the correlation rule calculation button, the correlation rule extraction unit 107 extracts the correlation rule from the data table generated by the data combination unit 106, and the unexpected degree calculation unit 108 extracts each correlation. Calculate the degree of surprise for the rule based on the data-related model. When the calculation of the degree of surprise for all the rules is completed, all the correlation rules are listed in the correlation rule list display unit 1104.

分析実施者１１２が相関ルール絞込ボタンを押下すると、閾値設定部１１０３の支持度と、確信度と、リフトと、意外度に対し設定された閾値とルール推薦要求が、ルール推薦部１０９へと送信される。ルール推薦部１０９でルール絞込を行った結果を、画面１１０１に表示する。 When the analyst 112 presses the correlation rule narrowing down button, the support, certainty, lift, threshold and rule recommendation request set for the threshold setting unit 1103 are sent to the rule recommendation unit 109. Will be sent. The result of narrowing down the rules by the rule recommendation unit 109 is displayed on the screen 1101.

図５の例では閾値として支持度＝３．０％、確信度＝２０．０％、リフト＝１．５、意外度＝８０．０％が設定されている。この結果、抽出された相関ルールのうち各指標について前記閾値よりも高い値をルールが絞り込んだ結果のルールとして、相関ルール一覧表示部１１０４に表示される。図５の例では相関ルール「列車番号（Ｔ１０２）⇒勾配（０．５−１．０％）」は支持度＝７．５％、確信度＝５０％、リフト＝２．６、意外度＝１００％であり、いずれの指標も閾値設定部１１０３にて設定される閾値よりも高い値をもつルールとして残る。これら機能部品の詳細は後にフローチャートを用いて後述する。 In the example of FIG. 5, the threshold values are set to support degree = 3.0%, certainty degree = 20.0%, lift = 1.5, and unexpected degree = 80.0%. As a result, among the extracted correlation rules, the rule narrows down the values higher than the threshold value for each index, and the rule is displayed on the correlation rule list display unit 1104. In the example of FIG. 5, the correlation rule "train number (T102) ⇒ gradient (0.5-1.0%)" has support = 7.5%, certainty = 50%, lift = 2.6, unexpectedness = It is 100%, and each index remains as a rule having a value higher than the threshold value set by the threshold value setting unit 1103. Details of these functional components will be described later using a flowchart.

図６は、データ分析支援システム１００のハードウェア構成例を示す図である。データ分析支援システム１００は、ＣＰＵ（中央処理装置）２０１、ＨＤＤ（磁気ディスク装置）２０２、メモリ２０３、入力部２０４、表示部２０５、通信部２０６を備える。ＣＰＵ２０１は、データの入出力、読み込み、格納および各種処理を実行する。ＨＤＤ２０２は、データを記憶する装置であり、メモリ２０３は、プログラムおよびデータを一時的に記憶する装置である。両者を纏めて記憶装置という。入力部２０４は、ユーザからの操作入力を受け付ける入力装置である。表示部２０５は、利用者にデータを表示する装置であり出力装置の一つである。通信部２０６は、ユーザ端末１１１と通信し、データを送受信する装置である。これらの各装置は、一般的なコンピュータの各構成として実現できる。 FIG. 6 is a diagram showing a hardware configuration example of the data analysis support system 100. The data analysis support system 100 includes a CPU (central processing unit) 201, an HDD (magnetic disk device) 202, a memory 203, an input unit 204, a display unit 205, and a communication unit 206. The CPU 201 executes data input / output, reading, storage, and various processes. The HDD 202 is a device for storing data, and the memory 203 is a device for temporarily storing programs and data. Both are collectively called a storage device. The input unit 204 is an input device that receives an operation input from the user. The display unit 205 is a device that displays data to the user and is one of the output devices. The communication unit 206 is a device that communicates with the user terminal 111 and transmits / receives data. Each of these devices can be realized as each configuration of a general computer.

図１の分析対象データ蓄積部１０１、データ関係モデル記憶部１０２、相関ルール記憶部１０３は、例えば、ＨＤＤ２０２により実現される。実施例１のデータ取得部１０４、データ関係モデル生成部１０５、データ結合部１０６、相関ルール抽出部１０７、意外度算出部１０８、ルール推薦部１０９の各部分は、例えば、メモリ２０３に格納されたプログラムをＣＰＵ２０１が実行し、ＣＰＵ２０１、ＨＤＤ２０２、メモリ２０３、入力部２０４、表示部２０５、通信部２０６等のハードウェアを制御することによって、実行される。 The analysis target data storage unit 101, the data relational model storage unit 102, and the correlation rule storage unit 103 of FIG. 1 are realized by, for example, the HDD 202. Each part of the data acquisition unit 104, the data relation model generation unit 105, the data combination unit 106, the correlation rule extraction unit 107, the unexpectedness calculation unit 108, and the rule recommendation unit 109 of the first embodiment is stored in, for example, the memory 203. The program is executed by the CPU 201, which is executed by controlling the hardware such as the CPU 201, the HDD 202, the memory 203, the input unit 204, the display unit 205, and the communication unit 206.

以上のデータ分析支援システム１００の構成は、単体のコンピュータで構成してもよいし、あるいは、ＣＰＵ２０１、ＨＤＤ２０２、メモリ２０３、入力部２０４、表示部２０５任意の部分が、通信部２０６を介したネットワークで接続された他のコンピュータで構成されてもよい。また、本実施例中、ソフトウエアで構成した機能と同等の機能は、ＦＰＧＡ（Field Programmable Gate Array）、ＡＳＩＣ（Application Specific Integrated Circuit）などのハードウェアでも実現できる。 The configuration of the above data analysis support system 100 may be configured by a single computer, or any part of the CPU 201, HDD 202, memory 203, input unit 204, and display unit 205 may be networked via the communication unit 206. It may be composed of other computers connected by. Further, in this embodiment, a function equivalent to the function configured by software can be realized by hardware such as FPGA (Field Programmable Gate Array) and ASIC (Application Specific Integrated Circuit).

図７は、データ分析支援システム１００においてデータ関係モデルを生成、相関ルールを抽出及び意外度を算出する一連の手順を示す全体フローチャートである。 FIG. 7 is an overall flowchart showing a series of procedures for generating a data relational model, extracting a correlation rule, and calculating an unexpected degree in the data analysis support system 100.

データ取得部１０４は、分析実施者１１２がユーザ端末１１１に表示される図５の画面上にて入力した分析対象データの取り込み要求を受信し、分析対象データ蓄積部１０１から分析対象データテーブルを取得する。そして、データ関係モデル生成部１０５が前記取得したデータテーブルに対するデータ関係モデルを生成する（Ｓ３０１）。 The data acquisition unit 104 receives the analysis target data acquisition request input on the screen of FIG. 5 displayed on the user terminal 111 by the analysis performer 112, and acquires the analysis target data table from the analysis target data storage unit 101. To do. Then, the data relational model generation unit 105 generates a data relational model for the acquired data table (S301).

データ結合部１０６が、分析対象データテーブルについて時系列のデータ項目をキーとして水平方向に内部結合することで、１つのデータテーブルを生成する（Ｓ３０２）。 The data joining unit 106 generates one data table by internally joining the data table to be analyzed in the horizontal direction using a time-series data item as a key (S302).

相関ルール抽出部１０７は、分析実施者１１２がユーザ端末１１１に表示される図５の画面上にて入力した相関ルールの抽出要求を受信し、相関ルールを抽出する（Ｓ３０３）。 The correlation rule extraction unit 107 receives the correlation rule extraction request input by the analyst 112 on the screen of FIG. 5 displayed on the user terminal 111, and extracts the correlation rule (S303).

意外度算出部１０８は、相関ルール抽出部１０７が抽出した相関ルール毎に、データ関係モデル生成部１０５が生成したデータ関係モデルに照らし合わせ意外度を算出する（Ｓ３０４）。 The unexpected degree calculation unit 108 calculates the unexpected degree for each correlation rule extracted by the correlation rule extraction unit 107 by comparing with the data relational model generated by the data relational model generation unit 105 (S304).

ルール推薦部１０９は、分析実施者がユーザ端末１１１に表示される図５の画面上にて入力した相関ルール絞り込み要求と、支持度、確信度、リフト、意外度に対する閾値とを受信し、相関ルール毎に算出された支持度、確信度、リフト、意外度に対して閾値処理を行いルールの絞り込みを行い、結果をユーザ端末１１１へと返す（Ｓ３０５）。 The rule recommendation unit 109 receives the correlation rule narrowing request input by the analyst on the screen of FIG. 5 displayed on the user terminal 111, and the threshold values for the degree of support, the degree of certainty, the lift, and the degree of unexpectedness, and correlates them. Threshold processing is performed on the support, certainty, lift, and unexpectedness calculated for each rule to narrow down the rules, and the result is returned to the user terminal 111 (S305).

なお、データ関係モデルの生成Ｓ３０１は、相関ルールの抽出Ｓ３０３の後に行なってもよい。あるいは、図７の処理以前に予め作成して保存しておいても良い。 The data relational model generation S301 may be performed after the correlation rule extraction S303. Alternatively, it may be created and saved in advance before the processing of FIG. 7.

データ関係モデル生成部１０５が行う処理手順の詳細は図８に示すフローチャートにて後述する。データ結合部１０６が行う処理手順の詳細は図９に示すフローチャートにて後述する。意外度算出部１０８が行う処理手順の詳細は図１０に示すフローチャートにて後述する。 Details of the processing procedure performed by the data-related model generation unit 105 will be described later in the flowchart shown in FIG. Details of the processing procedure performed by the data combining unit 106 will be described later in the flowchart shown in FIG. The details of the processing procedure performed by the unexpected degree calculation unit 108 will be described later in the flowchart shown in FIG.

図８は、データ関係モデル生成部１０５が、分析対象データテーブルからデータ関係モデルを生成する手順Ｓ３０１の詳細を示すフローチャートである。 FIG. 8 is a flowchart showing the details of the procedure S301 in which the data relational model generation unit 105 generates the data relational model from the analysis target data table.

データ関係モデル生成部１０５は、データ取得部が取得した分析対象データテーブル全てに対して、各データテーブルのカラム名一覧を取得しデータ関係モデルのエンティティテーブル１０２１０（図３参照）に格納する（Ｓ３０１１）。 The data relational model generation unit 105 acquires a list of column names of each data table for all the analysis target data tables acquired by the data acquisition unit and stores them in the entity table 10210 (see FIG. 3) of the data relational model (S3011). ).

取得した全データテーブルから２テーブルを選び出す選び方の数分だけループ処理を行う（Ｓ３０１２）。 Loop processing is performed for the number of selection methods for selecting two tables from all the acquired data tables (S3012).

続いて、Ｓ３０１２にて選び出した２テーブルについて各テーブルのカラム数同士を掛け合わせた数分だけループを行う（Ｓ３０１３）。当該処理は一方のテーブルのカラムを固定し、固定したカラムに対しもう一方のテーブルが持つ全カラムについて処理を行うことと同義である。 Subsequently, the two tables selected in S3012 are looped by the number of columns multiplied by the number of columns in each table (S3013). This process is synonymous with fixing the columns of one table and performing the process for all the columns of the other table for the fixed columns.

Ｓ３０１２にて選びだした２テーブルのうち、一方のテーブルに定義されるカラムと、もう一方のテーブルに定義されるカラムの名称を比較する（Ｓ３０１４）。 Of the two tables selected in S3012, the names of the columns defined in one table and the columns defined in the other table are compared (S3014).

比較を行うカラムの名称が部分一致あるいは完全一致するかどうかを判定する（Ｓ３０１５）。 It is determined whether the names of the columns to be compared are partially matched or exactly matched (S3015).

部分一致しているならば、当該２カラム間にリレーションが有ると判定し、データ関係モデル記憶部１０２のリレーションテーブル１０２２０に格納する（Ｓ３０１６）。 If there is a partial match, it is determined that there is a relation between the two columns, and the relation is stored in the relation table 10220 of the data relation model storage unit 102 (S3016).

分析対象データが図２に示す列車データテーブル１０１１と駅データテーブル１０１２の場合の、データ関係モデル生成処理の説明を示す。列車データテーブル１０１１と駅データテーブル１０１２のテーブル名と各テーブルのカラム名を取得し、取得結果を図３に示すエンティティテーブル１０２１０へと格納する。 A description of the data relational model generation process when the data to be analyzed is the train data table 1011 and the station data table 1012 shown in FIG. 2 is shown. The table names of the train data table 1011 and the station data table 1012 and the column names of each table are acquired, and the acquisition results are stored in the entity table 10210 shown in FIG.

続いて、取得した全データテーブルに対して２テーブルを選び出す選び方を計算する。本例では、対象データテーブルが２テーブルであり、ここから２テーブルを選び出す選び方は１通りとなるため、１回だけループ処理を行う。 Then, the selection method of selecting two tables from all the acquired data tables is calculated. In this example, there are two target data tables, and there is only one way to select two tables from them, so loop processing is performed only once.

続いて、選び出した２テーブルの各カラム数は列車データテーブル１０１１が６、駅データテーブル１０１２が７であることから６×７＝４２となり４２回ループ処理を行う。まず、列車データテーブル１０１１の施行日カラムと、駅データテーブル１０１２の全カラムと文字列が部分一致するかどうかを判定する（ループ回数は計７回）。同様にして列車データテーブル１０１１の残りの５カラムについても、駅データテーブル１０１２の全カラムとの文字列が部分一致するかを判定する。 Subsequently, the number of columns in each of the two selected tables is 6 × 7 = 42 because the train data table 1011 is 6 and the station data table 1012 is 7, and loop processing is performed 42 times. First, it is determined whether or not the execution date column of the train data table 1011 and all the columns of the station data table 1012 and the character strings partially match (the total number of loops is 7). Similarly, for the remaining 5 columns of the train data table 1011 as well, it is determined whether or not the character strings of all the columns of the station data table 1012 partially match.

本例では、列車データテーブル１０１１の列車番号カラムが駅データテーブル１０１２の列車番号カラムと文字列部分一致することから、各カラム同士にリレーション３０００１が有りと判定し、判定結果を図３に示すリレーションテーブル１０２２０に格納する。更に、列車データテーブル１０１１の始発駅及び終着駅カラムについても、駅データテーブル１０１２の駅名カラムと文字列が部分一致することから、各カラム間にリレーション３０００２が有りと判定し、リレーションテーブル１０２２０に結果を格納する。 In this example, since the train number column of the train data table 1011 partially matches the train number column of the station data table 1012, it is determined that there is a relation 30001 between the columns, and the determination result is shown in FIG. Store in table 10220. Further, regarding the starting station and ending station columns of the train data table 1011 as well, since the character strings partially match the station name column of the station data table 1012, it is determined that there is a relation 30002 between each column, and the result is shown in the relation table 10220. To store.

データ関係モデル生成処理により生成したエンティティテーブル１０２１０とリレーションテーブル１０２２０に格納されるデータを用いることで、図５のデータ関係モデル表示部１１０５に示すような概略クラス図の形式で、データ関係モデルは表すことができる。 By using the data stored in the entity table 10210 and the relation table 10220 generated by the data relational model generation process, the data relational model is represented in the form of a schematic class diagram as shown in the data relational model display unit 1105 of FIG. be able to.

図２に示す列車データテーブル１０１１と駅データテーブル１０１２はそれぞれ列車クラスと駅クラスとして表され、当該２データテーブル間のリレーションは列車クラスと駅クラスの間を結ぶ線で表される。図５に示すデータ関係モデルの例では、図２では記載を省略しているが車両や線路等のクラスも表示している。なお、本例では視認性向上のため車両エンティティと地上設備エンティティ間のリレーション（車両クラスの速度ログ及び室内温度ログカラムが、地上設備クラスの動作ログ及びアラームログカラムと文字列が部分一致するためリレーション有りと判定される）等の一部のリレーションについては省略して記載している。 The train data table 1011 and the station data table 1012 shown in FIG. 2 are represented as a train class and a station class, respectively, and the relationship between the two data tables is represented by a line connecting the train class and the station class. In the example of the data relational model shown in FIG. 5, although the description is omitted in FIG. 2, the classes such as vehicles and railroad tracks are also displayed. In this example, in order to improve visibility, the relationship between the vehicle entity and the ground equipment entity (the vehicle class speed log and indoor temperature log column partially match the ground equipment class operation log and alarm log column, so the relationship Some relations such as (determined to be present) are omitted.

また、データ関係モデル生成処理におけるリレーションは、分析対象データテーブルの構造間の関係だけでなく、業種固有に見られる構造物間の階層関係や、位置や経路上における近接や前後の関係を定義してもよい。例えば、鉄道分野における列車には、列車を組成する車両、更に車両を組成する各種の車両部品というように列車−車両−車両部品といった構造物間の階層関係が考えられ、このような構造物間の階層関係をあらかじめ定義しておくことで、同一の構造物にて生じる事象を定義できる。また、位置や経路上における近接や前後の関係としては、駅の並び順や並走する線区の情報をあらかじめ定義することで、隣接する駅間で波及する事象の関係や振替路線や構造の列車に波及する関係を定義できる。 In addition, the relation in the data relation model generation process defines not only the relation between the structures of the data table to be analyzed, but also the hierarchical relation between the structures seen unique to the industry, and the proximity and front-back relations on the position and the route. You may. For example, in the field of railways, a hierarchical relationship between structures such as train-vehicle-vehicle parts such as a vehicle that composes a train and various vehicle parts that compose a vehicle can be considered. By predefining the hierarchical relationship of, it is possible to define the events that occur in the same structure. In addition, as the proximity and front-to-back relationship on the position and route, by predefining the information on the order of stations and the lines running in parallel, the relationship of events that spread between adjacent stations and the transfer route and structure You can define the relationship that spreads to the train.

図９は、データ結合部１０６が、分析対象データテーブルを１つのデータテーブルに結合する手順Ｓ３０２の詳細を示すフローチャートである。 FIG. 9 is a flowchart showing the details of the procedure S302 in which the data joining unit 106 joins the data table to be analyzed into one data table.

データ結合部１０６はデータ取得部１０４が取得した分析対象データテーブル全てに対して、ループ処理を行う（Ｓ３０２１）。 The data combining unit 106 performs loop processing on all the analysis target data tables acquired by the data acquisition unit 104 (S3021).

当該テーブルのカラム毎に、カラムに定義されるデータ値を取得しデータ型の判定を行う（Ｓ３０２２）。 For each column of the table, the data value defined in the column is acquired and the data type is determined (S3022).

Ｓ３０２２で判定したデータ型がタイムスタンプ型、日付型、時刻型のいずれかに該当するかどうかを判定する（Ｓ３０２３）。 It is determined whether or not the data type determined in S3022 corresponds to any of the time stamp type, the date type, and the time type (S3023).

当該カラムについてタイムスタンプ型、日付型、時刻型のいずれかに該当するならば当該カラムを、時系列を示すカラムであると判定する（Ｓ３０２４）。 If the column corresponds to any of the time stamp type, the date type, and the time type, the column is determined to be a column indicating a time series (S3024).

分析対象データテーブル全てに対してデータ型の判定を行い、時系列のカラム判定が完了した後、時系列を示すと判定されたカラムをキーにして、同じデータ型同士でデータテーブルを水平方向に内部結合することで、分析対象データを１つのデータテーブルとする（Ｓ３０２５）。 After the data type is determined for all the data tables to be analyzed and the column determination of the time series is completed, the data table is moved horizontally between the same data types using the column determined to indicate the time series as a key. By internally joining, the data to be analyzed becomes one data table (S3025).

上例ではＳ３０２２にてカラム単位でデータ値を解析することでデータ型の判定を行っているが、どのカラムでデータテーブルの結合を行うか定まっている場合は、事前にどのカラムが系列を示すか定義したユーザ定義のデータテーブルをあらかじめ用意し、このユーザ定義テーブルを参照することで、データ型の判定を行ってもよい。例えば、線路の検査測定ログ結果データテーブルのように、線路軌道上のどの位置における検査結果かを示すキロ程というカラムが存在する場合、時系列ではなく位置系列のデータであるため、キロ程をキーにしてデータテーブルを結合したい場合がある。この場合、あらかじめユーザ定義テーブルにキロ程を定義しておき、このユーザ定義テーブルを参照し分析対象データテーブル全てに対してキロ程を含むカラムを持つかどうかを判定し、キロ程と判定されたカラムをキーにしてデータテーブル同士を水平結合する。 In the above example, the data type is determined by analyzing the data value in column units in S3022, but if it is decided in which column to join the data tables, which column indicates the series in advance. The data type may be determined by preparing a defined user-defined data table in advance and referring to this user-defined table. For example, if there is a column called kilometer that indicates the position of the inspection result on the track, as in the track inspection measurement log result data table, the kilometer is not a time series but a position series data. You may want to join the data tables as a key. In this case, the kilometer is defined in the user-defined table in advance, and it is determined by referring to this user-defined table whether or not there is a column containing the kilometer for all the data tables to be analyzed, and the kilometer is determined. Horizontally join data tables using columns as keys.

また、データテーブルの結合キーとなるカラムのデータ値はデータテーブル毎に最小単位やデータ取得のタイミングが異なる場合がある。例えば、あるテーブルの時刻を示すカラムでは、３０秒単位でデータが取得されている一方で、別のテーブルでは１分単位でデータが取得されているというように、同じ時刻を示すカラムでもデータの最小単位が異なっている場合がある。また、同じ３０秒単位のデータテーブルであっても、データ取得のタイミングが異なることから、基点となる時刻が「１０：００：０５」と「１０：００：１２」のような場合もある。このような場合、分析実施者の要求に応じて、時刻を示すカラムのデータ値について最小単位を揃えたり、より粗い単位へと揃えたりする前処理を分析対象データテーブルに対して行ってもよい。 In addition, the minimum unit of the data value of the column that is the join key of the data table and the timing of data acquisition may differ for each data table. For example, in a column showing the time in one table, data is acquired in units of 30 seconds, while in another table, data is acquired in units of 1 minute. The minimum unit may be different. Further, even in the same data table in units of 30 seconds, since the data acquisition timings are different, the base time may be "10:00:05" and "10:00:12". In such a case, the analysis target data table may be subjected to preprocessing such as aligning the minimum unit or aligning the data value in the column indicating the time to a coarser unit according to the request of the analyst. ..

図１０は、意外度算出部１０８が、データ関係モデルに基づき相関ルール毎に意外度を算出する手順Ｓ３０４の詳細を示すフローチャートである。 FIG. 10 is a flowchart showing the details of the procedure S304 in which the unexpected degree calculation unit 108 calculates the unexpected degree for each correlation rule based on the data relational model.

意外度算出部１０８は相関ルール抽出部１０７の処理完了後、抽出した相関ルールの数分だけループ処理を行う（Ｓ３０４１）。 After the processing of the correlation rule extraction unit 107 is completed, the unexpected degree calculation unit 108 performs loop processing for the number of extracted correlation rules (S3041).

ループ処理の対象となる相関ルールについて、前提部と結論部に含まれる属性の一覧を取得する（Ｓ３０４２）。すでに述べたように、属性とは前提部と結論部に含まれる事象を指す。 Acquire a list of attributes included in the premise part and the conclusion part for the correlation rule to be looped (S3042). As already mentioned, an attribute refers to an event included in the premise part and the conclusion part.

取得した属性一覧から２属性を選ぶ選び方数分だけループ処理を行う（Ｓ３０４３）。 How to select 2 attributes from the acquired attribute list Perform loop processing for the number of selections (S3043).

選び出した２属性同士のデータ関係モデルにおける距離の算出を行う（Ｓ３０４４）。データ関係モデルでの２属性間の距離は、当該属性が属するクラス間の距離である。クラス間の距離は、例えば図５に示されるデータ関係モデルにおいて、クラス間を結ぶリレーションの数として把握できる。例えば、列車クラスと線路クラスの距離は２である。よって、列車クラスの属性「施行日」と線路クラスの属性「キロ程」の距離は２となる。 The distance in the data relational model between the selected two attributes is calculated (S3044). The distance between two attributes in the data relational model is the distance between the classes to which the attribute belongs. The distance between the classes can be grasped as the number of relationships connecting the classes in the data relational model shown in FIG. 5, for example. For example, the distance between the train class and the track class is 2. Therefore, the distance between the train class attribute "enforcement date" and the track class attribute "km" is 2.

なお、一般にデータモデルでエンティティやテーブルと呼ばれるものが、オブジェクトモデルではクラスやオブジェクトと呼ばれる。本明細書ではエンティティ、テーブル、クラスの語は置き換えて解してもよい。 Note that what is generally called an entity or table in the data model is called a class or object in the object model. In this specification, the terms of entities, tables, and classes may be replaced and understood.

Ｓ３０４３のループ処理完了後、（前提部と結論部に含まれる全属性から２属性を選ぶ選び方全てに対するデータ関係モデルにおける距離の総和）で、（２属性を選ぶ選び方のうち２属性間の距離が２以上となるものの距離の総和）を割ることで意外度を算出し、これを相関ルール格納テーブル１０３０の当該ルールの意外度カラムに格納する（Ｓ３０４５）。 After the loop processing of S3043 is completed, in (the sum of the distances in the data relational model for all the selection methods for selecting 2 attributes from all the attributes included in the premise part and the conclusion part), (the distance between the 2 attributes in the selection method for selecting 2 attributes) The unexpectedness is calculated by dividing the sum of the distances of those having 2 or more, and this is stored in the unexpectedness column of the rule in the correlation rule storage table 1030 (S3045).

ここで、図５のデータ関係モデル表示部１１０５に示すデータ関係モデルに基づき、相関ルール１「列車番号（Ｔ１０２）⇒勾配（０．５−１．０％）」と相関ルール２「列車番号（Ｔ２００）⇒アラームログ（Ａ２００）、室内温度ログ（２６．０−２６．５℃）」のそれぞれに対して意外度を算出する例を説明する。 Here, based on the data relational model shown in the data relational model display unit 1105 of FIG. 5, the correlation rule 1 "train number (T102) ⇒ gradient (0.5-1.0%)" and the correlation rule 2 "train number (train number (0.5-1.0%)" An example of calculating the degree of surprise for each of "T200) ⇒ alarm log (A200) and room temperature log (26.0-26.5 ° C.)" will be described.

相関ルール１「列車番号（Ｔ１０２）⇒勾配（０．５−１．０％）」に含まれる属性の一覧として「列車番号（Ｔ１０２）」と「勾配（０．５−１．０％）」の２属性が取得される。各属性はそれぞれ列車データテーブル１０１１の列車番号カラムと、線路データテーブルの勾配カラムである。前提部と結論部に含まれる計２属性から２属性を選ぶ選び方は１通りであるため、１回だけループ処理を行う。この２属性についてデータ関係モデルでの距離を算出すると、「列車番号（Ｔ１０２）」は列車クラスに属し「勾配（０．５−１．０％）」は線路クラスに属していることから、各々のクラスは車両クラスを挟んで距離２である。意外度を算出すると（前提部と結論部に含まれる全属性から２属性を選ぶ選び方全てに対してデータ関係モデルにおける距離の総和）は２、（２属性を選ぶ選び方のうち２属性間の距離が２以上となるものの距離の総和）も２となり、２／２＝１となり意外度は１（１００％）となる。 Correlation rule 1 "Train number (T102)" and "Gradient (0.5-1.0%)" as a list of attributes included in "Train number (T102) ⇒ Gradient (0.5-1.0%)" 2 attributes are acquired. Each attribute is a train number column of the train data table 1011 and a gradient column of the track data table, respectively. Since there is only one way to select 2 attributes from the total of 2 attributes included in the premise part and the conclusion part, the loop processing is performed only once. When calculating the distances in the data relational model for these two attributes, "train number (T102)" belongs to the train class and "gradient (0.5-1.0%)" belongs to the track class. Class is a distance of 2 across the vehicle class. When calculating the degree of surprise, (the sum of the distances in the data relational model for all the selection methods for selecting 2 attributes from all the attributes included in the premise part and the conclusion part) is 2, (the distance between 2 attributes among the selection methods for selecting 2 attributes). Is 2 or more, but the total distance) is also 2, 2/2 = 1, and the degree of surprise is 1 (100%).

また、相関ルール２「列車番号（Ｔ２００）⇒アラームログ（Ａ２００）、室内温度ログ（２６．０−２６．５℃）」に含まれる属性の一覧として、「列車番号（Ｔ２００）」、「アラームログ（Ａ２００）」、「室内温度ログ（２６．０−２６．５℃）」の３属性が取得される。各属性はそれぞれ列車データテーブル１０１１の列車番号カラムと、地上設備データテーブルのアラームログカラムと、車両データテーブルの室内温度ログカラムである。前提部と結論部に含まれる計３属性から２属性を選ぶ選び方は、「列車番号（Ｔ２００）とアラームログ（Ａ２００）」、「列車番号（Ｔ２００）と室内温度ログ（２６．０−２６．５℃）」、「アラームログ（Ａ２００）と室内温度ログ（２６．０−２６．５℃）」の計３通りであるため、３回だけループ処理を行う。各属性の組合せについて、データ関係モデルでの距離を算出すると、「列車番号（Ｔ２００）とアラームログ（Ａ２００）」間の距離＝列車クラスと地上設備クラス間の距離＝３、「列車番号（Ｔ２００）と室内温度ログ（２６．０−２６．５℃）」間の距離＝列車クラスと車両クラス間の距離＝１、「アラームログ（Ａ２００）と室内温度ログ（２６．０−２６．５℃）」間の距離＝車両クラスと地上設備クラス間の距離＝２となる。従って意外度を算出すると、（前提部と結論部に含まれる全属性から２属性を選ぶ選び方全てに対してデータ関係モデルにおける距離の総和）は３＋１＋２＝６、（２属性を選ぶ選び方のうち２属性間の距離が２以上となるものの距離の総和）＝３＋２＝５となり、５／６＝０．８３で意外度は０．８３（８３％）となる。 In addition, as a list of attributes included in correlation rule 2 "train number (T200) ⇒ alarm log (A200), indoor temperature log (26.0-26.5 ° C)", "train number (T200)" and "alarm" Three attributes of "log (A200)" and "room temperature log (26.0-26.5 ° C.)" are acquired. Each attribute is a train number column of the train data table 1011, an alarm log column of the ground equipment data table, and an indoor temperature log column of the vehicle data table. How to select 2 attributes from a total of 3 attributes included in the premise part and conclusion part is "Train number (T200) and alarm log (A200)", "Train number (T200) and room temperature log (26.0-26.). Since there are a total of three types of "5 ° C.)" and "alarm log (A200) and room temperature log (26.0-26.5 ° C.)", the loop processing is performed only three times. When calculating the distance in the data relation model for each combination of attributes, the distance between "train number (T200) and alarm log (A200)" = distance between train class and ground equipment class = 3, "train number (T200)" ) And the indoor temperature log (26.0-26.5 ° C) = distance between the train class and the vehicle class = 1, "alarm log (A200) and indoor temperature log (26.0-26.5 ° C)" ) ”= Distance between vehicle class and ground equipment class = 2. Therefore, when calculating the degree of surprise, (the sum of the distances in the data relational model for all the selection methods for selecting 2 attributes from all the attributes included in the premise part and the conclusion part) is 3 + 1 + 2 = 6, (2 of the selection methods for selecting 2 attributes). The sum of the distances between the attributes is 2 or more) = 3 + 2 = 5, 5/6 = 0.83, and the degree of surprise is 0.83 (83%).

以上のように意外度を計算することにより、相関ルールの前提部、結論部の各属性として、モデルで距離２以上のものが多いほど、意外度が大きくなる。つまり、一般的な物体や事象の関係とは乖離した属性の組み合わせを持つルールほど、意外な関係として評価される。このように、意外度という評価指標を導入することにより、膨大な数の相関ルール中からデータの組合せとして意外なものを定量的に評価することができ、効果的にルールを絞り込むことができる。 By calculating the degree of surprise as described above, as each attribute of the premise part and the conclusion part of the correlation rule, the more the model has a distance of 2 or more, the larger the degree of surprise becomes. In other words, a rule that has a combination of attributes that deviates from the relationship between general objects and events is evaluated as an unexpected relationship. In this way, by introducing an evaluation index called unexpectedness, it is possible to quantitatively evaluate an unexpected combination of data from a huge number of correlation rules, and it is possible to effectively narrow down the rules.

なお、上記の例では、（２属性を選ぶ選び方のうち２属性間の距離が２以上となるものの総和）を分子としているが、（２属性を選ぶ選び方のうち２属性間の距離がｍ以上となるものの距離の総和）として、パラメータｍを例えば３以上として、任意に設定することも可能である。ｍを大きくするほど、意外性の大きなルールがより強調される意外度が得られる。 In the above example, (the sum of the selection methods for selecting two attributes where the distance between the two attributes is 2 or more) is the numerator, but (the distance between the two attributes among the selection methods for selecting the two attributes is m or more). It is also possible to arbitrarily set the parameter m as, for example, 3 or more as the sum of the distances). The larger m is, the more unexpected the rule is emphasized.

実施例１では、データテーブル間のリレーションについて、存在有無の結果のみを用いて意外度を算出している。他の例では、意外度算出部１０８では、意外度算出にリレーションの重みを考慮してもよい。実施例２では、意外度算出にリレーションの重みを考慮する例を示す。 In the first embodiment, the degree of surprise is calculated by using only the result of the presence / absence of the relation between the data tables. In another example, the unexpected degree calculation unit 108 may consider the weight of the relation in the unexpected degree calculation. In the second embodiment, an example in which the weight of the relation is considered in the calculation of the degree of surprise is shown.

リレーションの重みとは２テーブル間でリレーション有りと判定されたカラムのペア数で定義することができる。リレーションの重みで、２テーブル間のデータ構造上での関連性の強さを数値的に表すことができる。 The relation weight can be defined by the number of pairs of columns determined to have a relation between the two tables. The weight of the relation can numerically express the strength of the relationship in the data structure between the two tables.

例えば、図３に示す例では、リレーションテーブル１０２２０に、列車データテーブル１０１１と駅データテーブル１０１２間で計３レコード（ペア）のリレーションが定義されている。このため、列車データテーブル１０１１と駅データテーブル１０１２間の重みは３となる。重みの大きいリレーションの両端のデータテーブル同士ほど、データ分析における分析対象データとして組合せとして選択される可能性が高いといえる。このため、重みの大きいリレーションの両端のデータテーブル同士は、データの組合せとしては意外性がなく、当たり前であると考えられる。 For example, in the example shown in FIG. 3, a total of 3 records (pairs) of relations are defined in the relation table 10220 between the train data table 1011 and the station data table 1012. Therefore, the weight between the train data table 1011 and the station data table 1012 is 3. It can be said that the data tables at both ends of the relation with a larger weight are more likely to be selected as a combination as the data to be analyzed in the data analysis. Therefore, the data tables at both ends of the relation with a large weight are not surprising as a combination of data, and are considered to be natural.

従って、リレーションの重みを考慮した意外度算出では、図７の意外度算出Ｓ３０４において、２テーブル間のデータ関係モデルの距離に対して、リレーションの重みの逆数を掛ける等の処理を行い、データ関係モデルでの距離を補正した上で算出を行う。こうすることで、データ構造上での関連性の強さまで考慮して意外度を算出することができる。 Therefore, in the unexpected degree calculation in consideration of the relation weight, in the unexpected degree calculation S304 of FIG. 7, the distance of the data relational model between the two tables is multiplied by the reciprocal of the relational weight, and the data relation is performed. The calculation is performed after correcting the distance in the model. By doing so, it is possible to calculate the degree of surprise in consideration of the strength of the relevance in the data structure.

また、分析対象データの組合せによっては、データ関係モデルで任意の２カラムの距離の算出において、当該２カラムが属する２クラス間を結ぶ経路が複数存在する場合や、ループ経路が存在する場合がある。このような場合は、データモデルにおける２カラム間の距離算出では、最短となる経路の距離を取得したり、一度通過した経路は二度通過しないという制約を設けたりする等により、意外度を算出してもよい。 In addition, depending on the combination of data to be analyzed, in the calculation of the distance between any two columns in the data relational model, there may be a plurality of routes connecting the two classes to which the two columns belong, or there may be a loop route. .. In such a case, in the calculation of the distance between two columns in the data model, the unexpectedness is calculated by acquiring the distance of the shortest route or setting a restriction that the route that has passed once does not pass twice. You may.

すなわち、実施例１では、相関ルールの前提部及び結論部の属性が含まれるデータテーブル間に存在する、リレーションの数をカウントすることにより、当該属性間の距離を求めていた。実施例２では、リレーションにより対応付けられた２テーブル間で、カラム名が部分一致あるいは完全一致するペアの数で当該リレーションの重みを算出し、重みの逆数を前記２テーブル間のリレーションの補正値とし、相関ルールの前提部及び結論部の属性が含まれる前記データテーブル間に存在する、リレーションの補正値を加算することにより、当該属性間の距離を求める。なお、実施例１ではパラメータｍは基本的に自然数であるが、実施例２では重み付けの処理があるためパラメータｍは自然数でなくてもよい。 That is, in the first embodiment, the distance between the attributes is obtained by counting the number of relations existing between the data tables including the attributes of the premise part and the conclusion part of the correlation rule. In the second embodiment, the weight of the relation is calculated by the number of pairs whose column names partially match or exactly match between the two tables associated by the relation, and the reciprocal of the weight is the correction value of the relation between the two tables. Then, the distance between the attributes is obtained by adding the correction value of the relation existing between the data tables including the attributes of the premise part and the conclusion part of the correlation rule. In the first embodiment, the parameter m is basically a natural number, but in the second embodiment, the parameter m does not have to be a natural number because of the weighting process.

以上説明した実施例に拠れば、分析実施者は、膨大な数の相関ルール中からデータの組合せとして自明なものと意外なものを判断しながらルールを絞り込むことができ、業務改善や原因分析のための有益な情報把握を素早く行える。 According to the examples described above, the analyst can narrow down the rules while judging the obvious and unexpected data combinations from the huge number of correlation rules, and can improve the business and analyze the cause. You can quickly grasp useful information for this purpose.

データ分析支援システム１００、分析対象データ蓄積部１０１、データ関係モデル記憶部１０２、相関ルール記憶部１０３、データ取得部１０４、データ関係モデル生成部１０５、データ結合部１０６、相関ルール抽出部１０７、意外度算出部１０８、ルール推薦部１０９、ユーザインターフェース部１１０ Data analysis support system 100, data storage unit 101 to be analyzed, data relation model storage unit 102, correlation rule storage unit 103, data acquisition unit 104, data relation model generation unit 105, data combination unit 106, correlation rule extraction unit 107, unexpected Degree calculation unit 108, rule recommendation unit 109, user interface unit 110

Claims

It is a data analysis support system
A storage device that stores data tables to be analyzed, including multiple data tables,
A correlation rule extraction unit that analyzes the analysis target data table and extracts a plurality of correlation rules indicating the correlation of the attributes included in the data table.
A data relational model generation unit that generates a data relational model showing the relationship between the plurality of data tables, and
For each correlation rule, a combination of attributes of the premise part and the conclusion part of the correlation rule is generated, the distance between the attributes in the data relational model for each combination is obtained, and the degree of surprise is calculated based on the distance. , Surprising degree calculation part,
A data analysis support system characterized by being equipped with.

The unexpected degree calculation unit is used for each of the correlation rules.
In "The sum of the distances in the data relational model for all the selection methods for selecting two attributes from all the attributes included in the premise part and the conclusion part of the correlation rule", "The distance between the two attributes in the selection method for selecting the two attributes is m. By dividing the "sum of the above distances"
The unexpectedness is calculated for each correlation rule.
The data analysis support system according to claim 1.

The m is 2.
The data analysis support system according to claim 2.

The unexpectedness calculation unit
Performs weighting on association between the data table, corrects the pre Ki距away by the weighting,
The data analysis support system according to claim 1.

The data relational model is
It is composed of an entity table indicating the attribute names included in each of the data tables and a relation table indicating whether or not the attribute names included in each of the data tables are related.
The data analysis support system according to claim 1.

The data analysis support system according to claim 1.
Provided with a user interface unit that generates a screen for an analyst for presenting the unexpectedness calculated for each correlation rule.
A data analysis support system featuring.

The data analysis support system according to claim 6.
Provided with a rule recommendation unit that receives a predetermined threshold value for the unexpected degree and narrows down the correlation rule having a value higher than the received threshold value.
A data analysis support system featuring.

An input device, an output device, storage device, and a data content析支援方method executed by the information processing apparatus including a processing device,
The first step of preparing an analysis target data table including a plurality of data tables in the storage device, and
The second step of generating a data relational model showing the relationship between the plurality of data tables, and
The third step of analyzing the analysis target data table and extracting a plurality of correlation rules indicating the correlation of the attributes contained in the data table, and
For each correlation rule, a combination of attributes of the premise part and the conclusion part of the correlation rule is generated, the distance between the attributes in the data relational model for each combination is obtained, and the degree of surprise is calculated based on the distance. , 4th step,
A data analysis support method characterized by being equipped with.

The fourth step is for each correlation rule.
In "The sum of the distances in the data relational model for all the selection methods for selecting two attributes from all the attributes included in the premise part and the conclusion part of the correlation rule", "The distance between the two attributes in the selection method for selecting the two attributes is m. By dividing the "sum of the above distances"
The unexpectedness is calculated for each correlation rule.
The data analysis support method according to claim 8.

The m is 2.
The data analysis support method according to claim 9.

The fourth step is
Performs weighting on association between the data table, corrects the pre Ki距away by the weighting,
The data analysis support method according to claim 8.

The data relational model is
It is composed of an entity table indicating the attribute names included in each of the data tables and a relation table indicating whether or not the attribute names included in each of the data tables are related.
The data analysis support method according to claim 8.

A fifth step in which the output device displays a screen for inputting a threshold value for the unexpected degree, and
It has a sixth step of receiving the threshold value from the input device and narrowing down the correlation rule having a degree of surprise higher than the threshold value.
The data analysis support method according to claim 8.

Each of the plurality of data tables includes a column name indicating an attribute name, and includes a column name indicating an attribute name.
The second step is
By associating two tables whose column names partially or completely match among the plurality of data tables by relation, a data relational model showing the relationship between the plurality of data tables is generated.
The fourth step is
The distance between the attributes is obtained by counting the number of the relations existing between the data tables including the attributes of the premise part and the conclusion part of the correlation rule.
The data analysis support method according to claim 8.

The fourth step is
The weight of the relation is calculated by the number of pairs whose column names partially match or exactly match between the two tables associated with the relation.
The reciprocal of the weight is used as the correction value for the relationship between the two tables.
The distance between the attributes is obtained by adding the correction values of the relations existing between the data tables including the attributes of the premise part and the conclusion part of the correlation rule.
The data analysis support method according to claim 14.