JP7227772B2

JP7227772B2 - DATA ASSET ANALYSIS SUPPORT SYSTEM AND DATA ANALYSIS METHOD

Info

Publication number: JP7227772B2
Application number: JP2019008115A
Authority: JP
Inventors: 慶子工藤; 純司野口
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2023-02-22
Anticipated expiration: 2039-01-22
Also published as: JP2020119110A

Description

本発明は、データアセット分析を行う技術に関する。 The present invention relates to techniques for performing data asset analysis.

機械や設備のセンサデータ、人が記録したデータなど様々なデータを用いた故障箇所診断支援システムや故障予兆検知システムなどの業務課題に応じたデータ分析システムがある。 There are data analysis systems that respond to business issues, such as failure point diagnosis support systems and failure sign detection systems that use various data such as sensor data of machines and equipment and data recorded by people.

分析の用途や目的に応じて、データ分析の即時性や安全性を柔軟に調整可能なデータ分析システムを提供する技術として特許文献１がある。特許文献１では、各店舗からの生データを前処理する前処理プログラムと、前処理されたデータを分析する分析プログラムとを有する。それぞれの店舗と前処理プログラムとを対応付けた送り先リストに基づいて、データ収集装置で、各店舗に対応したデータの前処理が行われ、前処理された前処理データに店舗IDを付して、センタサーバに送信し、センタサーバでは、店舗IDに応じた分析プログラムを読み出し、前処理データを分析するように構成されている。 Patent document 1 is a technique for providing a data analysis system capable of flexibly adjusting the immediacy and safety of data analysis according to the application and purpose of the analysis. Patent Document 1 has a preprocessing program that preprocesses raw data from each store and an analysis program that analyzes the preprocessed data. Based on a destination list that associates each store with a preprocessing program, the data collection device preprocesses the data corresponding to each store, and attaches a store ID to the preprocessed preprocessed data. , to a center server, and the center server reads out an analysis program corresponding to the store ID and analyzes the preprocessed data.

特開２０１０－２７７５３４号公報JP 2010-277534 A

特許文献１では、生データを生成する各店舗の情報（ID）と、前処理プログラムと分析プログラムとを対応させ、各店舗の生データに合致した前処理プログラムと、分析プログラムとを選択することで、データ分析の即時性やデータ分析の効率向上を実現している。 In Patent Document 1, the information (ID) of each store that generates raw data is associated with the preprocessing program and the analysis program, and the preprocessing program and analysis program that match the raw data of each store are selected. This has realized immediacy of data analysis and improved efficiency of data analysis.

しかし、最近のデータアセット分析環境では、データの所有者と分析者が完全に分離しており、分析者はデータ所有者から提供されるデータの種別について情報が提供されない場合がある。つまり、特許文献１のように店舗情報等のようにデータ種別を特定できる情報がデータ所有者から提供されない場合がある。 However, in modern data asset analysis environments, data owners and analysts are completely separate, and analysts may not be informed of the type of data provided by data owners. In other words, there are cases where the data owner does not provide information that can specify the data type, such as store information, as in Patent Document 1.

このような環境では、人が分析に用いるデータの特性を理解し、適した分析手法の選択を行う必要があり、近年のデータ増加により、データの種類数や量が膨大となったことで事前のデータの特性理解が煩雑で本分析を実施するまでの準備に多大な工数とリードタイムを要するという課題があった。 In such an environment, it is necessary for people to understand the characteristics of the data used for analysis and select the appropriate analysis method. There was a problem that it was complicated to understand the characteristics of the data, and that it required a lot of man-hours and lead time to prepare for the actual analysis.

そこで、本発明の目的は、分析対象データがどのようなデータか分からない場合、つまり、データ種別情報等がないデータの事前分析の工数を削減することで、業務課題解決を図るデータ分析までの期間を大幅に短縮するデータアセット分析支援システム及び方法を提供することにある。 Therefore, the object of the present invention is to reduce the number of man-hours for pre-analysis of data that does not have data type information, etc., when the type of data to be analyzed is unknown. To provide a data asset analysis support system and method for greatly shortening the period.

上記課題を解決する本発明のデータアセット分析支援システムの一例は、処理部と、記憶装置と、出力装置とを有するデータアセット分析支援システムにおいて、記憶装置には、データ所有者から提供される投入データと、分析対象データの特徴量と分析フローとを対応させて記憶する分析テンプレートとを記憶する。そして、処理部は、記憶装置に記憶された投入データの特徴量を生成し、生成された特徴量と、分析テンプレートとに基づいて、投入データに対する分析フローを選択し、投入データに対し、選択された分析フローを実行し、分析フローによる分析結果を前記出力装置に送信し、出力装置は、分析フローによる分析結果を出力するよう構成される。 An example of the data asset analysis support system of the present invention for solving the above problems is a data asset analysis support system having a processing unit, a storage device, and an output device, wherein the storage device includes inputs provided by the data owner. It stores data and an analysis template that stores feature amounts of analysis target data and analysis flows in association with each other. Then, the processing unit generates a feature amount of the input data stored in the storage device, selects an analysis flow for the input data based on the generated feature amount and the analysis template, and selects an analysis flow for the input data. and transmitting analysis results according to the analysis flow to the output device, and the output device is configured to output the analysis results according to the analysis flow.

本発明によれば、データの特性理解のための工数を削減し、業務課題解決までの期間を短縮できる。また、データ特性に応じた適切な分析フローを選定することができる。 According to the present invention, it is possible to reduce man-hours for understanding the characteristics of data and shorten the period until business problem solving. Also, an appropriate analysis flow can be selected according to data characteristics.

また、本発明によれば、簡易に結果が得られるため、データ蓄積への意欲を醸成することができる。 In addition, according to the present invention, results can be obtained easily, so it is possible to motivate people to accumulate data.

上記した以外の課題、構成および効果は、以下の発明を実施するための形態の説明により明らかにされる。 Problems, configurations, and effects other than those described above will be clarified by the following description of the mode for carrying out the invention.

データアセット分析支援システムのハードウェア構成図である。1 is a hardware configuration diagram of a data asset analysis support system; FIG. データアセット分析支援システムの処理フロー図である。It is a processing flow diagram of the data asset analysis support system. データアセット分析支援システムの全体概要を示した図である。It is the figure which showed the whole data asset analysis support system outline|summary. データアセット分析支援システムのテンプレート実行処理の概要を示した図である。FIG. 4 is a diagram showing an overview of template execution processing of the data asset analysis support system; データアセット分析支援システムにおける特徴量生成のフロー図である。It is a flow chart of feature quantity generation in the data asset analysis support system. データアセット分析支援システムにおける特徴量生成のフロー図である。It is a flow chart of feature quantity generation in the data asset analysis support system. 日時項目を1列目に並び替えた投入データの一例を示す図である。FIG. 10 is a diagram showing an example of input data in which date and time items are rearranged in the first column; データアセット分析支援システムにおけるテンプレート選択、実行フロー図である。It is a template selection in a data asset analysis support system, and an execution flowchart. 分析テンプレートテーブルを示す図である。FIG. 10 is a diagram showing an analysis template table; FIG. 分析データテーブルを示す図である。It is a figure which shows an analysis data table. データ分析の出力イメージを示す図である。It is a figure which shows the output image of data analysis. データ分析の出力イメージを示す図である。It is a figure which shows the output image of data analysis. データ分析の出力イメージを示す図である。It is a figure which shows the output image of data analysis.

以下、図面を参照して本発明の実施形態を説明する。以下の記載および図面は、本発明を説明するための例示であって、説明の明確化のため、適宜、省略および簡略化がなされている。本発明は、他の種々の形態でも実施する事が可能である。特に限定しない限り、各構成要素は単数でも複数でも構わない。 Embodiments of the present invention will be described below with reference to the drawings. The following description and drawings are examples for explaining the present invention, and are appropriately omitted and simplified for clarity of explanation. The present invention can also be implemented in various other forms. Unless otherwise specified, each component may be singular or plural.

以下の説明では、「テーブル」等の表現にて各種情報を説明することがあるが、各種情報は、これら以外のデータ構造で表現されていてもよい。データ構造に依存しないことを示すために「ＸＸテーブル」等を「ＸＸ情報」と呼ぶことがある。識別情報について説明する際に、「識別情報」、「識別子」、「名」、「ＩＤ」、「番号」等の表現を用いるが、これらについてはお互いに置換が可能である。 In the following description, various types of information may be described using expressions such as “table”, but the various types of information may be expressed using data structures other than these. "XX table" and the like are sometimes called "XX information" to indicate that they do not depend on the data structure. When describing identification information, expressions such as “identification information”, “identifier”, “name”, “ID”, and “number” are used, but these can be replaced with each other.

同一あるいは同様な機能を有する構成要素が複数ある場合には、基本的に同一の符号を付して説明するが、機能が同じであっても機能を実現するための手段が異なる場合がある。 When there are a plurality of constituent elements having the same or similar functions, they will basically be given the same reference numerals for explanation, but even if the functions are the same, the means for realizing the functions may differ.

また、以下の説明では、プログラムを実行して行う処理を説明する場合があるが、プログラムは、中央処理部であるプロセッサ（例えばＣＰＵ）によって実行されることで、定められた処理を、適宜に記憶資源（例えばメモリ）および／またはインターフェースデバイス（例えば通信ポート）等を用いながら行うため、処理の主体がプロセッサとされてもよい。 Further, in the following description, there are cases where processing performed by executing a program will be described. Since processing is performed using storage resources (eg, memory) and/or interface devices (eg, communication ports), etc., processing may be performed by a processor.

プログラムは、プログラムソースから計算機のような装置にインストールされてもよい。プログラムソースは、例えば、プログラム配布サーバまたは計算機が読み取り可能な記憶メディアであってもよい。プログラムソースがプログラム配布サーバの場合、プログラム配布サーバはプロセッサと配布対象のプログラムを記憶する記憶資源を含み、プログラム配布サーバのプロセッサが配布対象のプログラムを他の計算機に配布してもよい。また、以下の説明において、２以上のプログラムが１つのプログラムとして実現されてもよいし、１つのプログラムが２以上のプログラムとして実現されてもよい。 A program may be installed on a device, such as a computer, from a program source. The program source may be, for example, a program distribution server or a computer-readable storage medium. When the program source is a program distribution server, the program distribution server may include a processor and storage resources for storing the distribution target program, and the processor of the program distribution server may distribute the distribution target program to other computers. Also, in the following description, two or more programs may be implemented as one program, and one program may be implemented as two or more programs.

本発明の代表的な実施形態について概要を説明する。図１は、データアセット分析支援システムのハードウェア構成図である。図１に示したように、データアセット分析支援システム104は、外部装置との通信を行うためのインタフェース105、入出力装置130、処理部（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）106、メモリ108、ＨＤＤやＳＳＤ等から構成される記憶装置107を含んで構成されている。 An overview of representative embodiments of the present invention will be described. FIG. 1 is a hardware configuration diagram of a data asset analysis support system. As shown in FIG. 1, the data asset analysis support system 104 includes an interface 105 for communicating with external devices, an input/output device 130, a processing unit (Central Processing Unit) 106, a memory 108, an HDD, an SSD, etc. It is configured including a configured storage device 107 .

インタフェース104はネットワーク103を介して、各種センサからデータを収集するエッジコンピュータ150、センサデータ等を格納するデータレイク151に接続されており、これら、エッジコンピュータ150、データレイク151から分析対象となるデータを受信する。入出力装置130の出力装置は、表示装置やプリンタ等の出力装置を含み、入力装置は、マウス、キーボード、タッチパネル等の入力装置を含む。メモリ108は、揮発メモリであるDRAM（Dynamic Random Access Memory）等によって構成される。 The interface 104 is connected via a network 103 to an edge computer 150 that collects data from various sensors and a data lake 151 that stores sensor data and the like. receive. Output devices of the input/output device 130 include output devices such as display devices and printers, and input devices include input devices such as mice, keyboards, and touch panels. The memory 108 is composed of a volatile memory such as a DRAM (Dynamic Random Access Memory).

処理部を構成するCPU106は、記憶装置107に格納された各種プログラム110～115をメモリ108に読み込み、各種プログラムを実行することで、プログラムに対応した各種機能を実現する。メモリ108は、CPU106で実行される各種プログラムが格納されたり、CPU106の処理結果が一時的に格納されたりする。 A CPU 106 constituting a processing unit reads various programs 110 to 115 stored in a storage device 107 into a memory 108 and executes various programs, thereby realizing various functions corresponding to the programs. The memory 108 stores various programs to be executed by the CPU 106 and temporarily stores processing results of the CPU 106 .

ここで、各種プログラムは、特徴量生成プログラム110、テンプレート管理プログラム111、テンプレート選択プログラム112、テンプレート実行プログラム113、分析結果レポート出力114、可視化実行プログラム115が含まれる。これらプログラムをCPU106が実行することで、各種処理を実現する。 Here, the various programs include a feature generation program 110, a template management program 111, a template selection program 112, a template execution program 113, an analysis result report output 114, and a visualization execution program 115. Various processes are realized by the CPU 106 executing these programs.

また、記憶装置107には、分析対象となるエッジコンピュータ150やデータレイク151から入力される投入データ120、特徴量生成プログラム110によって、投入データから生成された分析データ121、テンプレート管理プログラム111によって管理され、テンプレート選択プログラム112によって選択され、投入データの分析を行う分析フローを含む複数のテンプレート122を格納する。複数のテンプレートからテンプレート選択プログラム112によって、投入データを分析する一つのテンプレートが選択される。 The storage device 107 also stores input data 120 input from the edge computer 150 and data lake 151 to be analyzed, analysis data 121 generated from the input data by the feature generation program 110, and managed by the template management program 111. and stores a plurality of templates 122 containing analysis flows that are selected by the template selection program 112 and perform analysis of input data. A template selection program 112 selects one template for analyzing input data from a plurality of templates.

図２は、データアセット分析支援システム104の処理フロー図である。エッジコンピュータ150やデータレイク151から分析対象となる投入データ120が入力されると、特徴量生成プログラム110にて、投入データ120の平均値、分布、偏差値等の特徴量が生成される（A）。 FIG. 2 is a processing flow diagram of the data asset analysis support system 104. As shown in FIG. When the input data 120 to be analyzed is input from the edge computer 150 or the data lake 151, the feature amount generation program 110 generates feature amounts such as the average value, distribution, and deviation value of the input data 120 (A ).

特徴量生成プログラム110で生成された分析データに基づいて、テンプレート管理プログラム111で管理されている複数のテンプレート122から一つのテンプレートを選択する（B）。テンプレートには、投入データを分析するための分析フローが含まれている。選択されたテンプレートにより、テンプレート実行プログラム113がテンプレートを実行（分析フローを実行）し（C）、分析結果レポート出力プログラム114が分析結果を出力装置に送信し、出力装置は分析結果を表示する（E）。 One template is selected from a plurality of templates 122 managed by the template management program 111 based on the analysis data generated by the feature generation program 110 (B). The template contains an analysis flow for analyzing input data. Based on the selected template, the template execution program 113 executes the template (executes the analysis flow) (C), the analysis result report output program 114 transmits the analysis result to the output device, and the output device displays the analysis result ( E).

ここで、テンプレートとは特徴量から選択される分析フローに該当し、テンプレート実行とは、特徴量から選択された分析フローの実行に該当する。 Here, the template corresponds to the analysis flow selected from the feature amount, and the template execution corresponds to the execution of the analysis flow selected from the feature amount.

投入データは、既存の可視化実行プログラム115に入力され、投入データ120の平均値、分布、偏差値等を、投入データの可視化を実行し（D）、出力装置に送信し、出力装置は可視化データを表示する。 The input data is input to the existing visualization execution program 115, the average value, distribution, deviation value, etc. of the input data 120 are visualized (D) and transmitted to the output device, and the output device outputs the visualized data display.

図２では、分析テンプレート122として、予兆診断、分類診断を示しているが、故障診断の分析フローが含まれていても良い。 Although predictive diagnosis and classification diagnosis are shown as the analysis template 122 in FIG. 2, an analysis flow for failure diagnosis may be included.

図３は、データアセット分析支援システムの全体概要を示した図である。図３で説明される可視化システム115、分析テンプレート管理システム111等は、各種プログラムをCPU106が実行することで実現される機能を表したものである。各システムは、CPU106に該当する制御部と、記憶装置107やメモリ108に該当する記憶部から構成される。 FIG. 3 is a diagram showing an overall overview of the data asset analysis support system. The visualization system 115, the analysis template management system 111, and the like illustrated in FIG. 3 represent functions realized by the CPU 106 executing various programs. Each system is composed of a control section corresponding to the CPU 106 and a storage section corresponding to the storage device 107 and the memory 108 .

エッジコンピュータ150やデータレイク151から提供される投入データ120は、可視化システム115、特徴量生成システム110にそれぞれ入力される。分析テンプレート管理システムは、テンプレート管理プログラム111によって管理され、データの特徴量と分析フローを対応付けて管理する分析テンプレートテーブル122(図９Ａ)を記憶する。分析テンプレートテーブルの情報は、予め分析者によって登録されていても良い。 The input data 120 provided from the edge computer 150 and the data lake 151 are input to the visualization system 115 and feature value generation system 110, respectively. The analysis template management system is managed by the template management program 111, and stores an analysis template table 122 (FIG. 9A) that associates and manages data feature amounts and analysis flows. Information in the analysis template table may be registered in advance by an analyst.

可視化システム115によって、統計処理された分析結果は、分析結果レポート114として、出力装置130に送信される。出力装置130で出力されるイメージは、図１０から１２に示している。 Analysis results statistically processed by visualization system 115 are sent to output device 130 as analysis result report 114 . Images output on output device 130 are shown in FIGS.

一方、特徴量生成システム110によって、投入データの平均値、分布、偏差値等の特徴量を算出し、各投入データのファイルごとに算出された特徴量を分析データテーブル（図９Ｂ）に対応付けて記憶する。テンプレート選択実行システム112・113は、特徴量生成システム110で算出された特徴量と、データの特徴量と分析フローを対応付けて管理する分析テンプレートテーブル122に基づいて、投入データに対して分析フローを選択して実行する。 On the other hand, the feature amount generation system 110 calculates the feature amounts such as the average value, distribution, and deviation value of the input data, and associates the feature amounts calculated for each input data file with the analysis data table (Fig. 9B). memorize. The template selection and execution systems 112 and 113 analyze the input data based on the analysis template table 122 that manages the feature values calculated by the feature value generation system 110 and the data feature values and analysis flows in association with each other. select and execute.

テンプレート選択・実行システム112・113は、図１のテンプレート選択プログラム、テンプレート実行プログラムをCPU106が実行することで実現される。 The template selection/execution systems 112 and 113 are implemented by the CPU 106 executing the template selection program and template execution program shown in FIG.

テンプレート選択・実行システム112・113の分析結果も、可視化システム125の出力と同様、出力装置によって分析結果レポートとして出力される。 The analysis results of the template selection/execution systems 112 and 113 are also output as an analysis result report by an output device, similar to the output of the visualization system 125. FIG.

図４は、データアセット分析支援システムのテンプレート実行処理の概要を示した図である。図４に記載した各システムは、各種プログラムをCPU106が実行することで実現される機能を表したものである。 FIG. 4 is a diagram showing an outline of template execution processing of the data asset analysis support system. Each system shown in FIG. 4 represents functions realized by the CPU 106 executing various programs.

特徴量生成システム110に投入データ120が入力されると（A）、特徴量生成処理110aにより投入データの平均値、分布、偏差値等の特徴量が生成され、分析データ121が出力される（B）。 When the input data 120 is input to the feature amount generation system 110 (A), the feature amount such as the average value, distribution, and deviation value of the input data is generated by the feature amount generation processing 110a, and the analysis data 121 is output ( B).

特徴量生成システム110によって生成された特徴量からテンプレート選択・実行システム112・113は、分析テンプレートテーブルに基づいて、テンプレートを選択する（C）。即ち、特徴量に対応した分析フローを選択する。 Template selection/execution systems 112 and 113 select templates based on the analysis template table from the feature values generated by the feature value generation system 110 (C). That is, an analysis flow corresponding to the feature quantity is selected.

テンプレート選択・実行システム112・113は、特徴量生成システム110から出力された分析データ121を入力し、特徴量からテンプレートを選択112a（特徴量に対応した分析フロー）し、選択されたテンプレート（選択された分析フロー）を実行処理113aする（C・D）。 The template selection/execution system 112/113 receives the analysis data 121 output from the feature quantity generation system 110, selects a template 112a from the feature quantity (analysis flow corresponding to the feature quantity), selects the template (selection (C・D).

テンプレート選択・実行システム112・113の出力は、分析結果レポート出力プログラム114によって、出力装置に送信される。 The output of template selection/execution systems 112 and 113 is sent to an output device by analysis result report output program 114 .

図５は、データアセット分析支援システム104における特徴量生成のフロー図である。この処理は、特徴量生成プログラム110をCPU106からなる処理部が実行することで実現される。 FIG. 5 is a flowchart of feature generation in the data asset analysis support system 104. As shown in FIG. This processing is realized by executing the feature amount generation program 110 by the processing unit composed of the CPU 106 .

まず、投入データは、一列目に日時に関するデータである日時項目、２列目に故障有無フラグ、故障個所、或いはその他の目的変数となる情報があるデータであるという制約があるものとする。一列目の日時に関するデータ時間順に並び替える（ソートする）(S501)。本実施例では、日時とデータの値との相関関係を基に、データ種別等が不明なデータからデータの種別を効率よく把握するため、日時とデータとの関係から第１の特徴量（特徴量１）を算出する。 First, it is assumed that the input data is restricted such that the first column contains date and time items, which are data relating to date and time, and the second column contains data that includes a failure presence/absence flag, failure location, or other information that serves as an objective variable. The date and time data in the first column is rearranged (sorted) in chronological order (S501). In this embodiment, based on the correlation between date and time and data values, the first feature amount (feature Calculate quantity 1).

ステップS501で投入データの各項目の内、1列目の日時に関する項目を並び替えた後の投入データの一例を図７に示す。 FIG. 7 shows an example of input data after rearrangement of the date and time items in the first column among the input data items in step S501.

図７に示す通り、項目の1列目701は、日時に関する項目である。2列目は、故障発生の情報、3列目から5列目(703～705)は、各装置に付与されたセンサの値である。各装置は、装置を識別するための名称として、EquipmentA等が付されている。また、各装置に付されるセンサを識別するためのセンサ識別子として、Sensor1等が付されている。装置を識別するための情報は、名称の他、識別子等であっても良く、センサの識別子は、センサ名称であっても良い。投入データは、機器の故障に関するデータの他、営業データの解約予測等であっても良い。その場合、２列目には解約有無フラグ、3列目以降に契約者情報(性別、年収等の営業に関するデータとなる。 As shown in FIG. 7, the first column 701 of items is an item related to date and time. The second column is the information on failure occurrence, and the third to fifth columns (703 to 705) are the sensor values assigned to each device. Each device is given a name such as EquipmentA as a name for identifying the device. Further, Sensor1 and the like are attached as sensor identifiers for identifying the sensors attached to each device. The information for identifying the device may be an identifier or the like in addition to the name, and the identifier of the sensor may be the sensor name. The input data may be data related to device failures, as well as business data such as cancellation predictions. In that case, the 2nd column will be the cancellation flag, and the 3rd and subsequent columns will be the contractor information (gender, annual income, etc.).

さらに、6列目以降(707～709)は、各センサの値を示している。センサを識別するためのセンサ名が付されている。図７は、故障予兆診断を行う場合の投入データの一例を示しているが、故障個所推定の投入データであっても良い。この場合、項目として、異常コード等が入力される場合がある。 Furthermore, the sixth and subsequent columns (707 to 709) show the values of each sensor. A sensor name is attached to identify the sensor. FIG. 7 shows an example of the input data in the case of failure predictive diagnosis, but input data for failure location estimation may also be used. In this case, an error code or the like may be entered as an item.

図５に戻り、並び替えられた1列目の日時項目の各行の値の差分を算出する（S502）。次に、差分の最頻値を算出する（S503）。次に、差分の最頻値に該当しない値が全体の20％未満か判定し（S504）、20％未満の場合には、ステップS506に進み、投入データは定期的な特徴を有するデータであるとし、第１の特徴量を定期とする。ステップS504で20％以上の場合、投入データは不定期的な特徴を有するデータであるとし、第１の特徴量を不定期とする。 Returning to FIG. 5, the difference between the values in each row of the rearranged date and time items in the first column is calculated (S502). Next, the mode of difference is calculated (S503). Next, it is determined whether the value that does not correspond to the mode of the difference is less than 20% of the total (S504), and if it is less than 20%, the process proceeds to step S506, and the input data is data having periodic characteristics. , and the first feature amount is regular. If it is 20% or more in step S504, the input data is assumed to be data having irregular characteristics, and the first feature quantity is assumed to be irregular.

つまり、本実施例では、日時とデータの値との相関関係を基に、データ種別の不明なデータからデータの種別を効率よく特定するため、投入データを並び替えて、投入データの一列目の日時項目に関するデータを時間順に並び替え、並び替えられた1列目のデータの差分、即ち、日時データの差分を算出する。 That is, in this embodiment, based on the correlation between the date and time and the data value, in order to efficiently identify the data type from the data whose data type is unknown, the input data are rearranged so that the first column of the input data is The data related to the date and time items are sorted in chronological order, and the difference between the sorted data in the first column, that is, the difference in the date and time data is calculated.

例えば、投入データのデータが出現する間隔が1分であるとか、1日とかを算出する。差分の最頻値が1分である場合、1分毎に定期的に出現するデータであると推定できる。そのため、ステップS504で最頻値に該当しないデータが20％未満であるかを判定する。即ち、80％以上のデータが1分毎に出現しているため、第１の特徴量が定期であると判定するようにしている。このような定期的なデータは、例えば、投入データが風力発電の通常状態のセンサデータのような種類に該当すると推定できるからである。一方、1列目データが最頻値と関係なく出現するデータである場合、例えば、機械等の異常を検知した場合のセンサデータであると推定できる。 For example, the interval at which the input data appears is calculated to be 1 minute or 1 day. If the mode of difference is 1 minute, it can be estimated that the data appears periodically every minute. Therefore, in step S504, it is determined whether the data not corresponding to the mode is less than 20%. That is, since 80% or more of the data appear every minute, it is determined that the first feature amount is regular. This is because such periodic data can be assumed to correspond to the type of input data, for example, sensor data of a normal state of wind power generation. On the other hand, if the data in the first column is data appearing regardless of the mode value, it can be estimated to be sensor data when an abnormality of a machine or the like is detected, for example.

図５のステップS504で示した、20％という閾値は、単なる例示であって、投入データに応じて、分析者が適宜変更し得る値である。 The threshold value of 20% shown in step S504 of FIG. 5 is merely an example, and is a value that the analyst can change as appropriate according to input data.

図６は、データアセット分析支援システムにおける特徴量生成のフロー図であり、図５の続きの処理を示している。 FIG. 6 is a flowchart of feature quantity generation in the data asset analysis support system, and shows the processing continued from FIG.

ステップS601は、第１の特徴量が定期であるか否かを判定する。第１の特徴量が定期でない場合には、処理を終了する。 A step S601 decides whether or not the first feature amount is regular. If the first feature amount is not regular, the process ends.

第１の特徴量が定期である場合、ステップS602に進み、各データ項目の値の種類が100以下の項目を抽出する。データ項目としては、プログラムを特定するプログラムIDや動作モードのようなユーザによって設定される値である。プログラムIDの場合、例えば、「１」「２」「５」といったIDの値が100以下である場合に抽出される。
ユーザによって設定されるデータ項目以外に、データ項目が温度である場合、10度、11度、15度等の値があるが、これら温度の値が100以下である場合に抽出されても良い。また、加速度センサの値が１ｍ／ｓ²、２ｍ／ｓ²の値があるが、これら加速度センサの値が100以下である場合に抽出されても良い。 If the first feature value is regular, the process proceeds to step S602 to extract items whose value types are 100 or less for each data item. The data items are values set by the user, such as the program ID that identifies the program and the operation mode. In the case of a program ID, for example, when the ID value such as "1", "2", and "5" is 100 or less, it is extracted.
In addition to the data item set by the user, if the data item is temperature, there are values such as 10 degrees, 11 degrees, and 15 degrees, and these temperature values may be extracted if they are 100 or less. Also, although there are values of 1 m/s ² and 2 m/s ² from the acceleration sensor, the values of these acceleration sensors may be extracted when they are 100 or less.

加速度センサを例にとって説明すると、値の種類が100以下であるということは、機械が一定の動作を繰り返す、ある種のモード設定動作であると推定できる。加速度センサの値が100を超える場合、色々な数値（種類）が計測されているので、機械が一定の動作を繰り返すモードではないと推定できる。そのため、本実施例では、ステップS602で各データ項目の値の種類数が100以下の項目を抽出しているが、100という閾値は、単なる例示であって、分析者によって適宜設定し得る値である。 Taking the acceleration sensor as an example, if the number of types of values is 100 or less, it can be assumed that the machine repeats a certain operation, which is a kind of mode setting operation. If the value of the acceleration sensor exceeds 100, various numerical values (types) are measured, so it can be assumed that the machine is not in a mode that repeats a fixed operation. Therefore, in this embodiment, in step S602, items with 100 or less types of values for each data item are extracted. be.

ステップS603において、値ごとに継続時間の最頻値を算出する。例えば、抽出された項目が加速度センサの出力である場合、加速度センサの値が１ｍ／ｓ²の継続時間が、1秒が10回、3秒が2回であり、継続時間1秒が最も多い場合、1秒を最頻値とする。 In step S603, the mode of duration is calculated for each value. For example, if the extracted item is the output of the acceleration sensor, the duration of the acceleration sensor value of 1m/ ^s2 is 10 times for 1 second, 2 times for 3 seconds, and the duration of 1 second is the most common. 1 second is the mode.

ステップS604において、最頻値に該当しない値が全体の20％未満かを判定する。例えば、上述の加速度センサを例にとると、1秒以外の値が20％以上あれば、第２の特徴量(特徴量２)はなしと判定し(S605)、20％未満であれば、第２の特徴量ありと判定する（S606）。つまり、第２の特徴量として、対象データが一定の動作を繰り返す機械等をセンシングして得られたデータであると推定する。 In step S604, it is determined whether or not the value that does not correspond to the mode is less than 20% of the total. For example, taking the above acceleration sensor as an example, if the value other than 1 second is 20% or more, it is determined that there is no second feature amount (feature amount 2) (S605), and if it is less than 20%, the second 2 is determined to be present (S606). That is, as the second feature amount, it is estimated that the target data is data obtained by sensing a machine or the like that repeats a certain motion.

図５、６の特徴量生成のフローにより、投入データが第１の特徴量を有する、即ち、第１の特徴量が定期である場合、例えば、センサデータのセンシング対象である機械は、通常に動作している可能性が高いため、予兆診断の分析フローが適していると判断する。第１の特徴量がない場合、センサデータのセンシング対象である機械は、正常に動作していない可能性が高いため、分類診断の分析フローが適していると判断する。
以上は、発明理解のため温度や加速度等のセンサデータを例に説明したが、データ項目としては、プログラムを特定するプログラムIDや動作モードのようなユーザによって設定される値である場合も、センサデータの値の代わりに、プログラムID等の値を用いることで適応できる。 According to the flow of feature amount generation in FIGS. 5 and 6, when the input data has the first feature amount, that is, when the first feature amount is regular, for example, the machine that is the sensing target of the sensor data normally Since there is a high possibility that it is working, it is judged that the predictive diagnosis analysis flow is suitable. If there is no first feature amount, it is highly likely that the machine, which is the sensing target of the sensor data, is not operating normally. Therefore, it is determined that the classification diagnosis analysis flow is suitable.
In the above, sensor data such as temperature and acceleration are used as examples for understanding the invention. It can be adapted by using a value such as a program ID instead of a data value.

尚、予兆診断の分析フローは、例えば、特開２０１６－３３７７８号公報に記載の分析を適応することができる。また、分類診断の分析フローとして、例えば、特開２０１８－２５９２８号公報に記載の分類診断の分析を適応することができる。 For the analysis flow of predictive diagnosis, for example, the analysis described in JP-A-2016-33778 can be applied. Further, as an analysis flow for classification diagnosis, for example, the analysis for classification diagnosis described in Japanese Patent Application Laid-Open No. 2018-25928 can be applied.

第１の特徴量がある場合には、さらに、第２の特徴量から、例えば、センシングの対象となる機械が一定の動作の繰り返しで動作している場合、予兆診断（モードあり）の分析フローとするか、センシング対象が一定の動作の繰り返しで動作ではない場合、予兆診断（モードなし）の分析フローとするかを判断する。 If there is a first feature amount, furthermore, from the second feature amount, for example, if the machine to be sensed is operating with a constant repetition of operations, a predictive diagnosis (with mode) analysis flow Alternatively, if the sensing target repeats a certain motion and is not motion, it is determined whether to use the analysis flow of predictive diagnosis (no mode).

図５、６の処理結果は、特徴量生成プログラム110によって、分析データ121に格納される。即ち、特徴量生成プログラム110によって生成された特徴量は、図９Ｂで示す分析データテーブルに各投入データに対応して記憶される。 5 and 6 are stored in the analysis data 121 by the feature generation program 110. FIG. That is, the feature amount generated by the feature amount generation program 110 is stored in the analysis data table shown in FIG. 9B corresponding to each input data.

図８は、データアセット分析支援システムにおけるテンプレート選択、実行フロー図である。テンプレート選択プログラム112は、図９Ｂで示した分析データテーブルを読み込む（S802）。ステップS803はレコード数分繰り返しのループを開始する。ステップS804は、第１の特徴量(特徴量１)、第２の特徴量（特徴量２）が一致するテンプレートを分析テンプレートテーブルから選択する。 FIG. 8 is a flow chart of template selection and execution in the data asset analysis support system. The template selection program 112 reads the analysis data table shown in FIG. 9B (S802). Step S803 starts a loop that repeats the number of records. A step S804 selects from the analysis template table a template that matches the first feature amount (feature amount 1) and the second feature amount (feature amount 2).

分析テンプレートテーブルは、図９Ａで示したように、各特徴量と分析フローとを対応して記憶し、テンプレート122に格納されたテーブルである。例えば、特徴量１が定期で、特徴量２が無しのデータに対して、予兆診断（モードなし）の分析フローが対応して記憶されている。 The analysis template table, as shown in FIG. 9A, is a table stored in the template 122, storing each feature quantity and analysis flow in association with each other. For example, for data in which feature quantity 1 is regular and feature quantity 2 is absent, an analysis flow for predictive diagnosis (no mode) is stored correspondingly.

次に、選択されたテンプレート、即ち、分析フローを投入データに対して実行する(S805)。可視化実行プログラム115による投入データの可視化結果とテンプレート実行結果をレポートに出力する（S806）。全てのレコードに対する処理が終了すると(S807)、処理を終了する。 Next, the selected template, that is, the analysis flow is executed on the input data (S805). The visualization result of the input data and the template execution result by the visualization execution program 115 are output to a report (S806). When all records have been processed (S807), the process ends.

図９Ａは、分析テンプレートテーブルを示す図である。分析テンプレートテーブルは、データの特徴量と分析フローを対応付けて管理する。例えば、特徴量１が定期で、特徴量２が無しのに対して、予兆診断（モードなし）の分析フローが対応して記憶されている。 FIG. 9A is a diagram showing an analysis template table. The analysis template table associates and manages data feature amounts and analysis flows. For example, when the feature quantity 1 is regular and the feature quantity 2 is absent, an analysis flow for predictive diagnosis (no mode) is stored correspondingly.

図９Ｂは、分析データテーブルを示す図である。分析データテーブルは、ファイル形式で投入される投入データ毎に算出された特徴量を、投入ファイル名と対応付けて記憶する。例えば、投入ファイル名[111.csv]に対し、特徴量１が[定期]、特徴量２が[なし]が対応して記憶されている。 FIG. 9B is a diagram showing an analysis data table. The analysis data table stores feature amounts calculated for each input data input in a file format in association with input file names. For example, with respect to the input file name [111.csv], the feature quantity 1 is stored as [regular] and the feature quantity 2 is stored as [none].

図１０は、分析結果レポートや可視化実行プログラムの出力イメージの一例を示した図である。分析結果レポートは出力装置によって表示される。 FIG. 10 is a diagram showing an example of an analysis result report and an output image of the visualization execution program. An analysis result report is displayed by an output device.

図１０(a)は、分析結果レポートをヒストグラムとして出力したイメージである。x軸に、例えば、図７に示した投入データのセンサの値を示し、y軸にx軸の値に該当するデータの個数（出現頻度）を示したものである。このようなヒストグラムは、データ分布傾向の把握に適している。 FIG. 10(a) is an image of an analysis result report output as a histogram. For example, the x-axis shows the sensor values of the input data shown in FIG. 7, and the y-axis shows the number of data (appearance frequency) corresponding to the x-axis value. Such a histogram is suitable for understanding data distribution trends.

図１０（ｂ）は、分析結果レポートをラインチャートとして出力したイメージである。x軸は時間を表し、y軸は、例えば、投入データの項目の値を示す。ラインチャートは、時間に応じたデータの傾向把握に適している。 FIG. 10B is an image of the analysis result report output as a line chart. The x-axis represents time and the y-axis shows, for example, the values of the input data items. Line charts are suitable for grasping trends in data over time.

図１０（ｃ）は、分析結果レポートを散布図(相関グラフ)として出力したイメージである。x軸は比較対象の項目の値を示し、y軸は当該項目の値を示す。散布図では、当該項目について、比較対象の項目数分が出力され、項目間の相関関係の把握に適している。 FIG. 10(c) is an image of the analysis result report output as a scatter diagram (correlation graph). The x-axis indicates the value of the item to be compared, and the y-axis indicates the value of the item. In the scatter diagram, the number of items to be compared is output for the item, and it is suitable for understanding the correlation between items.

図１１は、データ分析の出力イメージを示す図であり、分析フローの実行結果を示す。 FIG. 11 is a diagram showing an output image of data analysis, showing the execution result of the analysis flow.

分析フローを行った投入ファイルの対象ファイル情報を表示する対象ファイル表示欄1101と、分析テンプレートの実行結果を表す分析テンプレート表示欄1102とが表示されている。対象ファイル表示欄1101には、分析フローの実行日時、対象ファイル名、対象ファイルのデータ量、対象ファイルに含まれるデータの項目数、学習期間、目的変数項目、推定データサイクル等が含まれる。 A target file display field 1101 for displaying target file information of an input file that has undergone an analysis flow, and an analysis template display field 1102 for displaying the execution result of the analysis template are displayed. The target file display column 1101 includes the execution date and time of the analysis flow, the target file name, the data volume of the target file, the number of data items included in the target file, the learning period, the objective variable item, the estimated data cycle, and the like.

分析テンプレート表示欄1102には、テンプレート名1103と、どのテンプレートが実行されたかを示す実行欄1104、テンプレートを実行した結果を示す結果欄1105が含まれる。図１１の例では、テンプレート名[分類診断.knr]が実行され、その結果が、[Accuracy 0.7]であったことが示されている。この結果には、テンプレートに含まれる分析フローの結果の確からしさが示され、[Accuracy 0.7]は分析モデルの過去データによるテストでの正解率＝70%を示している。 The analysis template display column 1102 includes a template name 1103, an execution column 1104 indicating which template has been executed, and a result column 1105 indicating the result of executing the template. The example of FIG. 11 shows that the template name [classification diagnosis.knr] was executed and the result was [Accuracy 0.7]. This result shows the certainty of the result of the analysis flow included in the template, and [Accuracy 0.7] shows the accuracy rate = 70% in the test based on the past data of the analysis model.

図１２は、データ分析の出力イメージを示す図である。図１２では、出力イメージとして異常度グラフを示している。図１２では、2016年7月あたりに異常度が増加し、故障の予兆が表れていることを示している。 FIG. 12 is a diagram showing an output image of data analysis. FIG. 12 shows an anomaly degree graph as an output image. FIG. 12 shows that the degree of anomaly increased around July 2016, and signs of failure appeared.

以上の処理により、データ所有者と、データ分析者が異なり、データ分析者がデータ所有者から提供されるデータに関する情報が十分開示されない場合であっても、投入データである分析対象データの事前分析を、日時とデータの値の関係に基づき、データが定期的なものであるかを示す第１の特徴量と、データ項目の値の種類とその継続時間からデータが一定の繰り返し、例えば、センシング対象が一定のモードで動作しているかを示す第２の特徴量に基づいて、分析フローを特定するので、データの事前分析の工数を削減でき、業務課題解決を図るデータ分析までの期間を大幅に短縮することができる。 Through the above processing, even if the data owner and the data analyst are different and the data analyst does not sufficiently disclose the information on the data provided by the data owner, pre-analysis of the data to be analyzed, which is the input data , based on the relationship between the date and time and the value of the data, the first feature value that indicates whether the data is regular, the type of value of the data item, and the duration of the data. Since the analysis flow is specified based on the second feature value that indicates whether the target is operating in a certain mode, the number of man-hours required for pre-analysis of data can be reduced, and the time required for data analysis to solve business problems can be significantly shortened. can be shortened to

また、データ特性に応じた適切な分析フローを選定することができる。 Also, an appropriate analysis flow can be selected according to data characteristics.

また、本実施例によれば、簡易に結果が得られるため、データ蓄積への意欲を醸成することができる。 Moreover, according to the present embodiment, results can be obtained easily, so it is possible to motivate users to accumulate data.

１０４：データアセット分析支援システム、１０５：インタフェース、１０６：ＣＰＵ、１０７：記憶装置、１０８：メモリ、１１０：特徴量生成プログラム、１１１：テンプレート管理プログラム、１１２：テンプレート選択プログラム、１１３：テンプレート実行プログラム、１１４：分析結果レポート出力プログラム、１１５：可視化実行プログラム、１２０：投入データ、１２１：分析データ、１２２：テンプレート、１３０：入出力装置。 104: Data Asset Analysis Support System, 105: Interface, 106: CPU, 107: Storage Device, 108: Memory, 110: Feature Generation Program, 111: Template Management Program, 112: Template Selection Program, 113: Template Execution Program, 114: analysis result report output program, 115: visualization execution program, 120: input data, 121: analysis data, 122: template, 130: input/output device.

Claims

In a data asset analysis support system having a processing unit, a storage device, and an output device,
The storage device stores input data provided by a data owner and an analysis template that stores feature amounts of analysis target data and analysis flows in association with each other,
The processing unit is
Sorting the date and time items of the input data stored in the storage device in chronological order,
Calculate the difference between the date and time items of the input data,
Calculate the mode of the calculated difference,
If the value that does not correspond to the mode value of the calculated difference is less than the threshold value, determine that the first feature amount of the input data is regular,
If the value that does not correspond to the mode value of the calculated difference is not less than a threshold value, determining that the first feature amount of the input data is irregular and generating a feature amount,
selecting an analysis flow for the input data based on the generated feature amount and the analysis template;
executing the selected analysis flow on the input data;
sending the analysis result of the analysis flow to the output device;
The data asset analysis support system, wherein the output device outputs analysis results according to the analysis flow.

The processing unit is
when the first feature amount of the input data is regular, extracting items whose value types are equal to or less than a predetermined value for each data item;
calculating the mode of the duration for each value of each data item;
If the value that does not correspond to the mode value of the duration time is less than a predetermined threshold value, it is determined that there is a second feature amount,
2. The data asset analysis support system according to claim 1, wherein when the value not corresponding to the mode of the duration is not less than a predetermined threshold value, it is determined that the second feature quantity does not exist.

3. The data asset analysis support according to claim 2 , wherein the analysis template stored in the storage device stores an analysis flow corresponding to the first feature amount and the second feature amount. system.

The storage device stores an analysis data table that associates and stores the first feature amount of the input data, the second feature amount of the input data, and the file name of the input data. 4. The data asset analysis support system according to claim 3 .

A data analysis method for a data asset analysis support system having a processing unit, a storage device, and an output device,
The storage device stores input data provided by a data owner and an analysis template that stores feature amounts of analysis target data and analysis flows in association with each other,
The processing unit is
Sorting the date and time items from the input data stored in the storage device in chronological order,
Calculate the difference between the date and time items of the input data,
Calculate the mode of the calculated difference,
If the value that does not correspond to the mode value of the calculated difference is less than the threshold value, determine that the first feature amount of the input data is regular,
If the value that does not correspond to the mode value of the calculated difference is not less than a threshold value, determining that the first feature amount of the input data is irregular and generating a feature amount,
selecting an analysis flow for the input data based on the generated feature amount and the analysis template;
executing the selected analysis flow on the input data;
sending the analysis result of the analysis flow to the output device;
A data analysis method, wherein the output device outputs an analysis result obtained by the analysis flow.

The processing unit is
when the first feature amount of the input data is regular, extracting items whose value types are equal to or less than a predetermined value for each data item;
calculating the mode of the duration for each value of each data item;
If the value that does not correspond to the mode value of the duration time is less than a predetermined threshold value, it is determined that there is a second feature amount,
6. The data analysis method according to claim 5, wherein when the value not corresponding to the mode of the duration is not less than a predetermined threshold value, it is determined that the second feature quantity does not exist.