CN110197084B

CN110197084B - Medical data joint learning system and method based on trusted computing and privacy protection

Info

Publication number: CN110197084B
Application number: CN201910506663.3A
Authority: CN
Inventors: 王爽; 郑灏; 王晓峰; 汤海旭; 窦佐超; 王文浩
Original assignee: Shanghai Lianyi Biotechnology Co ltd
Current assignee: Shanghai Nowei Information Technology Co.,Ltd.
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2021-07-30
Anticipated expiration: 2039-06-12
Also published as: CN110197084A

Abstract

The invention relates to a medical data joint learning system and method based on trusted computing and privacy protection. The joint learning center control layer receives non-sensitive meta-information uploaded by a data contributor through a data contributor management layer of a data node where the data contributor is located for recording, and original data are locally registered, stored and isolated and calculated; the joint learning center control layer processes a joint learning request initiated by a data miner through the data miner interaction layer, collects non-sensitive intermediate results obtained by performing local isolation calculation on each data node based on original data in a safety calculation area, and returns final joint learning results to the data miner interaction layer. The invention provides a whole set of service system based on medical big data security sharing, trusted computing, deep mining, authority authentication and multi-platform joint learning, and solves the problems of scattered, single and incomplete medical data privacy protection and data mining at the present stage.

Description

Medical data joint learning system and method based on trusted computing and privacy protection

Technical Field

The invention relates to safe sharing, credible mining and privacy safety protection of medical big data. In particular to a medical big data joint learning system and method based on trusted computing and privacy protection.

Background

The existing medical big data searching, sharing and data mining services are still in an immature stage, deep credible mining and authority authentication on data are lacked, and system standards and protective measures are not formed yet. Strict laws, missing protection systems and standards cause that medical data owners such as a large number of hospitals and medical research institutions are reluctant or afraid to share their own data resources, thereby seriously affecting rapid progress and development of medical disciplines under the trend of internet big data, such as comprehensive diagnosis and analysis of diseases, big data statistical analysis of genetic disease genes and the like.

The differential privacy protection method for the Chinese patent medical data publication, application number 201510690500.7, aims at the privacy safety problem of the medical data direct publication, and protects the relative data privacy on the premise of ensuring the data availability by means of differential and noise adding methods and the like. The method still defaults to direct contact of data miners to the data (although the data is processed through privacy protection), and does not relate to authority verification of the data miners, credible authentication of a computational analysis platform, multi-platform joint learning and the like.

A cloud medical data monitoring system and method with efficient privacy protection, application number 201610859330.5, which is a system for encrypted uploading, querying and reading of medical data protection of a cloud server. The invention can not realize further mining and analysis of medical data under the condition of encryption, and does not relate to authority verification of a data inquirer, credible authentication of a data platform, multi-platform joint learning and the like.

Chinese patent is a privacy protection data mining system and method based on medical big data, application number 201811118948.1, and discloses a three-level medical data storage, query and management system based on non-interactive zero-knowledge proof. The local samples are ensured not to be leaked to the server side, and false sample matching and the like are avoided. The system does not relate to further credible mining of medical big data, authority authentication of data miners, credible authentication of data platforms, multi-platform joint learning and the like.

Disclosure of Invention

The invention relates to a medical data joint learning system and method based on trusted computing and privacy protection, provides a whole set of service system based on medical big data security sharing, trusted computing, deep mining, authority authentication and multi-platform joint learning, and solves the problems of scattered, single and incomplete medical data privacy protection and data mining at the present stage.

In order to achieve the above object, one technical solution of the present invention is to provide a medical data joint learning method based on trusted computing and privacy protection:

the joint learning central control layer receives and stores the non-sensitive meta-information uploaded by the data contributor through the data contributor management layer of the data node where the data contributor is located; the meta-information is based on the original data of the data contributors and does not contain sensitive information of the original data;

the joint learning center control layer receives and processes a joint learning request initiated by a data miner through the data miner interaction layer; in a safety calculation area of a control layer of the joint learning center, intermediate results obtained by performing local isolation calculation on original data of each data node are summarized and analyzed, and the joint learning results are transmitted back to an interaction layer of a data miner.

Optionally, the joint learning central control layer is provided with a central node server and a security computation server, and interacts with data node servers respectively arranged on the data contributor management layers of the data nodes;

the medical data joint learning method comprises the following processes:

step one, registering and storing all original data of data contributors in a local firewall; the data contributors access the data node server through the first interactive system, register the data set and designate the access authority and the valid time of the data set; all the original data are stored in a local private database and are positioned in a firewall; the data node server sends the meta information to a central node server for recording;

secondly, the data miner accesses the central node server through the second interactive system, searches available data sets based on own authority after completing user registration and verification, and creates a joint learning example;

thirdly, the data miner sends a joint learning request to the central node server;

fourthly, based on the data set selected by the data miner, the central node server sends local calculation requests to all data nodes related to the current joint learning request;

fifthly, receiving the data nodes of the local calculation request, performing local isolation calculation based on original data in a firewall through respective data node servers, and performing intermediate result interaction with a safety calculation server; the intermediate result does not contain raw data;

sixthly, the safety calculation server collects and updates intermediate results obtained by local isolation calculation of all the data nodes, generates and outputs a joint learning result, and returns the joint learning result to the central node server;

and seventhly, the central node server generates a joint learning report to support the data miner to obtain and use the joint learning result.

Optionally, the data contributor performs setting of access right in the data registration process through a data contributor management layer of the data node where the data contributor is located;

the access rights specify one or more of a time, a place, a data miner, and a joint learning task that allow use of the data.

Optionally, the data miner selects the data with the data authority disclosed, and/or the data contributor designates the data of the data miner to perform joint learning;

the data miners set the own joint learning examples to be private or public, and other data miners are allowed to inquire and research the public joint learning examples.

Optionally, the meta information and the intermediate result of each data node are uploaded to the joint learning central control layer in an encrypted state.

Optionally, before uploading the meta-information to the central node server, the data node may initiate remote enclave authentication based on intel software protection extension service to the central node server;

and the safety calculation server uses Intel software protection extension service to collect and analyze the intermediate results uploaded by each data node.

Optionally, the meta information includes a network protocol address and a port of the data node server, a file name, description, and a supported research method of the original data; the intermediate result does not relate to sensitive information of the original data; the intermediate result comprises an intermediate training model and statistical parameters.

The medical data joint learning system based on the trusted computing and the privacy protection can be applied to any one of the medical data joint learning methods based on the trusted computing and the privacy protection.

The medical data joint learning system comprises:

the data node servers are arranged on the data contributor management layer of each data node;

the central node server and the safety calculation server are arranged at a joint learning central control layer and interact with each data node server;

the data node server registers a local data set and assigns access authority, uploads meta-information to a central node server for recording, receives a local calculation request of the central node server, performs local isolation calculation on locally stored original data, and sends an intermediate result to a security calculation server for summarizing;

the central node server receives a joint learning request initiated by a data miner, informs a safety calculation server of a joint learning example created by the data miner, sends a local calculation request to a data node related to the current joint learning request, waits for and receives joint learning results collected and summarized by the safety calculation server from the corresponding data node, generates a joint learning report and returns the joint learning report to the data miner.

Optionally, the data node server implements a management framework by using Spring + Vue, and implements local isolation computation by C + +;

the central node server realizes a control architecture by using Spring boot + Vue and is deployed on a hardware platform provided with a Docker-Complex by using a Docker technology;

the safe computing server uses C + +/Rust in combination with Intel software safety extension service.

Optionally, the data node server, the local private database, and the first web page end interaction system configured by the data contributor management layer are located within a local firewall of the data node where the data node is located;

based on a first webpage end interactive system, a data contributor accesses a data node server through a browser;

and the data miner interaction layer is provided with a second webpage end interaction system, and the data miner accesses the central node server through a browser.

Compared with the prior art, the medical data joint learning system and method based on the trusted computing and privacy protection have the advantages that:

the scheme of the invention is implemented by a central node server of a central control layer of the joint learning, a safe computing server (a trusted computing area) and a plurality of data node servers of a data contributor management layer based on the joint learning. All the storage related to the original medical data is performed in a local isolation mode on the data nodes, and privacy disclosure is avoided fundamentally. The present invention enables strict and flexible authorization authentication of data sets, including but not limited to task, user, time and location based authorization. The central node stores the non-sensitive meta-information of the data set, and deep mining of the medical data is achieved by using a series of joint learning algorithms. Meanwhile, the central node joint learning core program uses Intel SGX software protection extension service, and the safety of the calculated data and results in an untrusted environment is ensured.

Drawings

FIG. 1 is a block diagram of the overall system of the present invention;

FIG. 2 is an exemplary diagram of a data format of a joint learning request submitted by a data miner via a browser;

FIG. 3 is an exemplary diagram of a central node server notifying a security computation server of parameters of a joint learning instance;

FIG. 4 is an exemplary diagram of a data format of a local computation request sent by a central node server to a data node;

FIG. 5 is an exemplary diagram of a data format of the joint learning result collected by the security computation server and sent to the central node server;

FIG. 6 is an exemplary graph of a joint learning report generated by the central node server and returned to the data miners;

FIG. 7 is an exemplary diagram of data set base information stored by a data node server;

FIG. 8 is an exemplary graph of raw data stored by a data node server;

FIG. 9 is an exemplary diagram of data set meta-information for a data node server registered with a central node server.

Detailed Description

The principles, features and system flow of the present invention are described below in conjunction with the drawings, which are set forth by way of example only and not intended to limit the scope of the invention.

As shown in fig. 1, the medical data joint learning scheme based on trusted computing and privacy protection includes three major parts:

first, a data contributor management layer;

the local management layer enables localized registration, storage, and computation of all raw medical data by data contributors (e.g., medical big data owners in hospitals, medical research institutions, etc.). Specifically, all of the raw data of the data contributors is completely registered and stored locally (within the firewall). At the same time, all calculations involving the raw data are also limited to being performed in local isolation. The design radically avoids the external leakage of the privacy data.

The local management layer only uploads the meta information of the original data, such as the network protocol address (IP address) and port of the local server, the file name of the original data, the description and the supported research method, to the central node server of the joint learning central control layer. Meanwhile, in the local isolation calculation process, only intermediate results (such as intermediate training models and statistical parameters) are transmitted to a safety calculation region of a central control layer of the joint learning for safety summarization.

The intermediate data does not relate to any data privacy information. For example, in an analysis of variance (ANOVA) test, the local server returns only the average and data volume in the local data set, and the central node server calculates the overall average and data volume from these values and returns it to the local server. The local server calculates the square of the difference between the local value and the integral average value according to the values, and then returns the square to the central node server, and the central node server obtains the relevant values and then calculates to obtain the F statistic value, so that the p-value of the test can be obtained in the F distribution.

It is emphasized here that the intermediate results of the calculations are all transmitted, stored and calculated in an encrypted state. Even if the central node server is hijacked, the state and data of the calculation cannot be disclosed.

In the data registration process, the invention designs a strict and flexible access authority control mechanism. Such as authorization based on a joint learning task, authorization based on a data set validity time, authorization based on a specified data miner, authorization based on a geographic location/research institution, and so forth. Specifically, the data contributors can specify who, at what times, and at what locations, use their own provided data sets to conduct joint learning studies of the specified methods.

Before uploading the meta-information to the central node server, the local server initiates remote enclave authentication based on the intel SGX trusted computing unit to the central node server to verify whether the trusted computing unit of the central node server has been trusted registered in the intel verification server. Therefore, the privacy and the safety of the meta-information and the intermediate calculation result in the transmission, storage and calculation processes are ensured.

Second, the Joint learning Central control layer

The central node server is responsible for data registration of data contributors, meta-information storage (without involving any raw data), and processing of data miner joint learning requests. The safety calculation server uses Intel software protection extension Service (SGX) to collect and analyze the intermediate results of local calculation at the cloud end, and finally the results are transmitted back to the interaction layer of the data miner, and a joint learning result report is generated at the browser end.

The encrypted intermediate results uploaded by each medical data node are loaded to a core program of the central node server to be encrypted and summarized to obtain a final learning result. The core program of the invention uses SGX service provided by Intel, all operations are encrypted in a trusted computing area, thereby greatly improving the safety of program operation and realizing the privacy, integrity and usability of codes and data. Specifically, the kernel only trusts the CPU of the kernel and Intel, and effectively prevents the attack to the kernel after the bottom layer OS (operating system) is clamped. While administratively not trusting the provider of the cloud service.

Thirdly, a data miner interaction layer;

the interaction layer of the data miner is provided with a webpage end interaction system, the data miner can access the joint learning interaction system through a browser to complete user registration, and after verification, the data miner can select data with public data authority or data which is appointed by a certain data contributor to the data miner to perform joint learning of different algorithms. Such as the chi-square test, proportional risk regression, analysis of variance algorithms, and the kolmogorov-smirnov test, among others. Meanwhile, the data miner can also choose to set the own joint learning instance as public or private. The disclosed joint learning examples may also be queried and studied by other data miners.

The invention uses a 'united Learning' (Federated Learning) model to realize the safe sharing and deep mining of medical data. As shown in fig. 1, the joint learning model performs local operations using the servers of the medical data contributors themselves, and only uploads encrypted intermediate results (statistical information, intermediate training models, etc.) to the central node server for security aggregation, and all training data (original data) are retained in the original respective devices.

That is, the data contributors have ownership of the data, the original data remains local, and the objects for search or analysis may all be encrypted data. The data miner can execute encryption retrieval to ensure the privacy of a search target; the data contributors can select rental data and adjust prices according to market demands; if the retrieval results are matched, the data miner can select to lease corresponding data to perform joint learning analysis, and the encrypted analysis parameters and the joint learning operation results can only be extracted and checked by the data miner. Data contributors may choose to deregister registered data at any time. Once revoked, the encryption key is destroyed and the data miner cannot continue to use the data.

Illustratively, the data contributors of the invention are respectively provided with a local data management interactive system, which comprises a data node server arranged in a local firewall of the data node server, and a local private database and a webpage end interactive system which interact with the data node server. And the data node server is further interacted with a central node server and a safety calculation server of the joint learning central control layer.

The data node server uses Spring + Vue to realize a management layer (architecture priority), and C + + realizes local isolation calculation (speed priority). At the data contributor management level, data contributors upload local data collections (fig. 7, fig. 8), specify access rights (e.g., time, place, personnel, task based restrictions), register data meta-information (fig. 9) to the central node server. In order to realize local isolation calculation, the data node server receives the local isolation calculation request (figure 4) of the central node server, performs joint learning local isolation calculation of a corresponding method, and sends an intermediate result to the security calculation server for summarizing.

Taking a proportional risk regression model as an example, a DF first derivative matrix and a DDF second derivative Hessian matrix are obtained through local isolation calculation and are sent to a safety calculation server, the safety calculation server returns a non-convergence coefficient matrix, and the two parties repeat the operation until the convergence condition is met.

Note that the data node server and the secure compute server communicate prior to remote verification based on intel enclave authentication techniques.

The example central node server realizes a joint learning central control layer by using a Spring boot + Vue architecture, and can be rapidly deployed on any hardware platform provided with a Docker-Complex by using a Docker technology. The central node server is responsible for receiving the joint learning request of the data miner (fig. 2), informing the security computation server of the joint learning instance (fig. 3), sending the local isolation computation request to the data node cluster involved in the joint learning (fig. 4), waiting for and receiving the joint learning result collected (from the data node cluster) and summarized by the security computation server (fig. 5), generating a joint learning result report and returning the joint learning result report to the data miner (fig. 6).

The example secure compute server (trusted compute zone), using C + +/Rust in conjunction with intel software security extension Service (SGX), accepts the central node server joint learning request (fig. 3), aggregates the local isolation computation results from the data node cluster (taking proportional risk regression model as an example, the intermediate results include the unconverged coefficient matrix, the DF first derivative matrix, and the DDF second derivative hessian matrix), computes the final results and sends them to the central node server (fig. 5).

The following is an example of a specific service flow of the present invention:

in the first step, the data contributor registers the data set (fig. 7 and 8) through the local data node server, and specifies the access right, the valid time and the like of the data set. All the original data are stored in a local private database and are positioned in a firewall. Meanwhile, the data node server initiates enclave authentication to the central node server, and after confirming the secure encrypted computing environment, the encrypted meta-information (fig. 9) is sent to the central node server for record.

And secondly, completing user registration by the data miner through an interactive system, searching an available data set based on self authority after verification, and creating a joint learning example.

Third, the data miner initiates a joint learning request to the central node server (fig. 2).

Fourth, the central node server sends out local computation requests to all data nodes involved in this joint learning (based on the data set selected by the data miner) (fig. 4).

And fifthly, each data node performs local isolation calculation and interacts with an intermediate result (not related to original data) of the safety calculation server.

For example, in the joint learning of the proportional risk regression testing model, (1) the local isolation calculation calculates a DF first derivative matrix and a DDF second derivative Hessian matrix according to original data, and then sends the DF first derivative matrix and the DDF second derivative Hessian matrix to the safety calculation server; (2) and the safety calculation server calculates the unconverged coefficient matrix and returns the unconverged coefficient matrix to the data node server. And (3) repeating the operations (1) and (2) until a convergence coefficient condition is met. The data transmission of the process only relates to a derivative matrix and an unconverged parameter matrix of the original data, and does not contain any original data information.

Meanwhile, all intermediate results (reciprocal matrix, non-convergence parameter matrix and the like) are transmitted in an encrypted state, and are decrypted and calculated in a trusted calculation area. Even if the cloud server deploying the secure computing server is hijacked by an attacker, intermediate results cannot be leaked.

And sixthly, the safety calculation server collects and updates the local isolation calculation results of all the data nodes, generates and outputs a final joint learning result (figure 5), and returns the final joint learning result to the central node server.

Seventhly, the central node server generates a joint learning report (figure 6), and the data miner inquires or prints the joint learning result.

Fig. 2 to 9 take joint learning of the proportional-risk regression model as an example:

FIG. 2 is an example of a data format of a joint learning request submitted by a data miner via a browser. The joint learning request of the data miner, for example, provides data attribute information of the joint learning method: a selected attribute parameter list (including attribute name, whether classification is possible, attribute value, etc.); data node information: the unique identifier of the data node (including the unique identifier of the data node data set, the literal name of the data set, etc.), and the literal description of the data node; joint learning instance information: name, whether to publish, start time, expected end time, remark description, unique identifier of the user to which the joint learning belongs, and the like.

Fig. 3 is an example of parameters of a security computing server notified by a central node server of joint learning examples, including a joint learning unique identifier, joint learning task attributes corresponding to each method, and a data node list (including a unique identifier of a data node, a network address and a port, a current state of joint learning, and the like).

Fig. 4 is an example of a data format of a local computation request sent by a central node server to a data node, where the data format includes a file name of a local data set, a local isolation computation attribute list (including attribute values, attribute names, and information on whether the attributes can be classified), and a local unique identifier of the data set.

Fig. 5 is an example of a data format of the joint learning result collected by the security computation server and sent to the central node server, and includes a joint learning data set attribute list, a correlation coefficient, a Z test value, a P probability value, and the like.

FIG. 6 is an example of a joint learning report generated by the central node server and returned to the data miners, including a joint learning summary (including joint learning name, creator, detailed description, disclosure rights, creation time, completion time, etc.); joint learning parameters (including attribute names, data nodes participating in the joint learning, and the like); and (4) combining the learning results (including attribute names, correlation parameters, P probability values, Z test values and the like).

Fig. 7 is an example of basic information of a data set stored by a data node server, which includes a data set (including a data set local database unique identifier, a data combination name, a data set description, and the like), a data set support method (such as a specific support method, a public authority, a data set file name, an authorized user, an authorized mechanism, an authorized start/end time, and the like), and a data set summary (including an attribute list, a data amount, an attribute classification number, a classification value, and the like).

Fig. 8 is an example of raw data stored by the data node server, which includes a list of attributes, whether to classify, attribute values, and the like. Fig. 9 is an example of data set meta-information of a data node server registered with a central node server, which includes a data set meta-information list: metadata including each data set (such as classification possibility, attribute list, data set file name, local database unique identifier, supported joint learning method, data set name, classification possibility, data set description, classification to which attribute belongs, effective start date, etc.); a data node name; a data node description; a data node pass token; a data node network address and port; data node user name, etc.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A medical data joint learning method based on trusted computing and privacy protection is characterized in that,

the joint learning central control layer is provided with a central node server and a safety calculation server, and interacts with data node servers respectively arranged on data contributor management layers of all data nodes;

the central node server receives and stores the non-sensitive meta-information uploaded by the data contributor through the data node server of the data node where the data contributor is located; the data node server is used for registering a local data set and uploading meta information to the central node server for recording; the meta-information corresponds to the original data of the data contributors without containing sensitive information of the original data;

the central node server receives and processes the joint learning request initiated by the data miner through the data miner interaction layer, informs the safety calculation server of the joint learning instance created by the data miner, and sends a local calculation request to the data node related to the current joint learning request; the data node server which receives the local computing request carries out local isolation computing on the locally stored original data and sends the obtained non-sensitive intermediate result to the safety computing server;

the safety calculation server collects and analyzes intermediate results sent to the safety calculation server by the data node server to obtain a joint learning result; and the central node server generates a joint learning report according to the joint learning result and returns the joint learning report to the data miner interaction layer.

2. The medical data joint learning method according to claim 1,

the medical data joint learning method comprises the following processes:

3. The medical data joint learning method according to claim 1 or 2,

the data contributor also sets access authority in the process of registering the data set through a data contributor management layer of the data node where the data contributor is located;

4. The medical data joint learning method according to claim 1 or 2,

the data miners select the data of the public data authority and/or the data of the data contributors appointed to the data miners to carry out joint learning;

5. The medical data joint learning method according to claim 1 or 2,

and uploading the meta information and the intermediate result of each data node to the joint learning central control layer in an encrypted state.

6. The medical data joint learning method according to claim 2,

before uploading the meta-information to the central node server, the data node initiates remote enclave authentication based on Intel software protection extension service to the central node server;

7. The medical data joint learning method according to claim 1 or 2,

the meta information comprises a network protocol address and a port of the data node server, and a file name, description and supported research method of the original data; the intermediate result comprises an intermediate training model and statistical parameters.

8. A medical data joint learning system based on trusted computing and privacy protection, which is applied to the medical data joint learning method based on trusted computing and privacy protection as claimed in any one of claims 1 to 7, and the medical data joint learning system comprises:

9. The medical data joint learning system of claim 8,

the data node server realizes a management architecture by using Spring + Vue and realizes local isolation calculation by C + +;

10. The medical data joint learning system of claim 9,

the data node server, the local private database and the first webpage end interaction system which are configured by the data contributor management layer are positioned in a local firewall of the data node;