US20140130178A1 - Automated Determination of Quasi-Identifiers Using Program Analysis - Google Patents
Automated Determination of Quasi-Identifiers Using Program Analysis Download PDFInfo
- Publication number
- US20140130178A1 US20140130178A1 US14/151,474 US201414151474A US2014130178A1 US 20140130178 A1 US20140130178 A1 US 20140130178A1 US 201414151474 A US201414151474 A US 201414151474A US 2014130178 A1 US2014130178 A1 US 2014130178A1
- Authority
- US
- United States
- Prior art keywords
- program
- dataset
- fields
- quasi
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6209—Protecting access to data via a platform, e.g. using keys or access control rules to a single file or object, e.g. in a secure envelope, encrypted and accessed using a key, or with access control rules appended to the object itself
Definitions
- the present invention generally relates to a system and method for managing data, and more particularly to a system and method for identifying sensitive data so it can be anonymized in a manner that increases privacy.
- Databases or datasets containing personal information are increasingly being used for secondary purposes, such as medical research, public policy analysis, and marketing studies. Such use makes it increasingly possible for third parties to identify individuals associated with the data and to learn personal, sensitive information about those individuals.
- Undesirable invasion of an individual's privacy may occur even after the data has been anonymized, for example, by removing or masking explicit sensitive fields such as those that contain an individual's name, social security number, or other such explicit information that directly identifies a person.
- a set of quasi-identifier fields may be any subset of fields of a given dataset which can either be matched with other, external datasets to infer the identities of the individuals involved, or used to determine a value of another sensitive field in the dataset based upon the values contained in such fields.
- quasi-identifier fields may be data containing an individual's ZIP code, gender, or date of birth, which, while not explicit, may be matched with corresponding fields in external, publicly available datasets such as census data, birth-death records, and voter registration lists to explicitly identify an individual.
- otherwise hidden fields containing sensitive information such as, for example, disease diagnoses, if the values in such hidden, sensitive fields are dependent upon values of other quasi-identifier fields in the dataset, such as fields containing clinical symptoms and/or medications prescribed for example, from which information in an otherwise hidden field may be independently determined.
- Typical systems and methods that seek to protect information contained in a dataset include several shortcomings. For example, many conventional methods depend upon a central tenet that all fields that qualify as either explicit or quasi-identifier fields can be easily identified in a dataset, which is not always the case. In addition, typical conventional techniques primarily focus on preventing identities of individuals to be revealed and do not adequately address the situation where values in other sensitive fields, such as an HIV diagnosis, may need to be hidden. Furthermore, conventional techniques that rely upon statistical analysis or machine learning approaches to determine quasi-identifiers in a dataset, while useful, are also prone to producing many false positives (fields are falsely identified as being quasi-identifiers when they are not) as well as many false negatives (fields are falsely identified as not being quasi-identifiers when they are).
- a method for identifying quasi-identifier data fields in a dataset includes identifying a program having access to the dataset, the program including one or more program statements for reading or writing a value in one or more fields in the dataset; determining a first output program statement in the program, where the first program output statement is a program statement for writing a first value into a sensitive data field in the dataset; determining, with a processor, a first set of program statements in the program, where the first set of program statements includes one or more program statements that contribute to the computation of the first value written into the sensitive data field; and, analyzing, with the processor, the first set of program statements, and determining, based on the analysis of the first set of program statements, one or more quasi-identifier data fields associated with the sensitive data field in the dataset.
- a system for identifying data fields in a dataset includes a memory storing instructions and data, and a processor for executing the instructions and processing the data.
- the data includes a set of programs and a dataset having one or more data fields
- the instructions include identifying a program in the set of programs, the program having one or more program statements for reading or writing a value in one or more fields in the dataset; determining a first output program statement in the program, where the first program output statement is a program statement for writing a first value into a sensitive data field in the dataset; determining a first set of program statements in the program, where the first set of program statements includes one or more program statements that contribute to the computation of the first value written into the sensitive data field; and, analyzing the first set of program statements, and determining, based on the analysis of the first set of program statements, one or more data fields associated with the sensitive data field in the dataset.
- FIG. 1 illustrates a system in accordance with an aspect of the invention.
- FIG. 2 illustrates a sample dataset in accordance with one aspect of the invention.
- FIG. 3 illustrates an example of a pseudo-code program in accordance with one aspect of the invention.
- FIG. 4 illustrates an example of the operation of the system in FIG.1 in accordance with an aspect of the system and method.
- FIG. 5 illustrates a flow diagram in accordance with various aspects of the invention.
- FIG. 6 illustrates a block diagram of a computing system in accordance with various aspects of the invention.
- a system and method for automated determination of quasi-identifiers fields for one or more sensitive fields in a given dataset are provided. Instead of identifying quasi-identifiers based on the conventional approach of analyzing the contents of a given dataset, the system and method disclosed herein identifies quasi-identifier fields based upon an analysis of computer programs that are used to create and manipulate the dataset. Once such quasi-identifier fields have been found, they may be anonymized using existing anonymization techniques such as k-anonymity or L-diversity. As a result, the anonymized quasi-identifiers cannot be used to identify an individual associated with the data in the dataset, or to infer a value of one or more other sensitive fields contained in the dataset.
- FIG. 1 illustrates a system 10 in accordance with various aspects of the invention disclosed herein.
- System 10 includes a database 12 , one or more programs 14 , an analysis module 16 , and a list of quasi-identifier fields 18 .
- Database 12 may include one or more datasets 20 , which may contain a set of data, including sensitive data that may need to be anonymized before the data set is provided to an external third party for further use, e.g., research or analysis.
- Dataset 20 may be organized by conventional columns and rows, where each column may be a field such as name, age, address, medical symptom, medical diagnosis, etc. Likewise, each row may be a record that includes related data in one or more fields that is associated with an individual. While the system and method disclosed herein is advantageous when the data set is organized by fields and records, it will be appreciated that the invention is not limited to any particular organization of data, and is equally applicable to any set of data that includes sensitive data that needs to be protected.
- Program 14 may be one or more computer programs containing program statements that include instructions, executable by a processor, for reading, writing, storing, modifying, or otherwise manipulating the data contained in dataset 20 .
- Program statements in program 14 may include program instructions written in any programming language, such as Java, C, C++, C#, Javascript, SQL, Visual Basic, Perl, PHP, pseudo-code, assembly, machine language or any combination of languages. Further, it will be understood that the system and invention disclosed herein is not limited to any particular type of program or programming language.
- Analysis module 16 in system 10 may be implemented in hardware, software, or a combination of both.
- analysis module 16 may itself be a software program, executable by a processor, where the analysis module 16 has access to both program 14 and database 12 and contains instructions for analyzing program 14 for identifying quasi-identifier fields for one or more sensitive fields in dataset 20 .
- the functionality of analysis module 16 may also be implemented in hardware, such as on a custom application specific integrated circuit (“ASIC”).
- ASIC application specific integrated circuit
- analysis module may also contain instructions for anonymizing data in dataset 20 associated with such quasi-identifier fields using one or more anonymization techniques such as, for example, k-anonymity or other conventional anonymization techniques.
- quasi-identifiers 18 may be considered a list of quasi-identifiers fields determined by analysis module 16 to contain information based upon which values in other sensitive fields in dataset 20 may be ascertained.
- FIG. 2 shows a database 212 containing a dataset 220 that includes data organized in a generic row and column format.
- the data contained in dataset 220 may be a collection of medical records of patients treated in a hospital.
- each row 222 in dataset 220 may contain a record of related medical data associated with a particular patient, including, for example, one or more medical symptoms (factors) and medical diagnoses, where the medical diagnoses are determined based on the patient's medical symptoms associated with the patient.
- Each medical symptom or medical factor used to determine a diagnosis may be represented by a different field in the dataset.
- each of the fields f 1 , f 2 , f 3 , f 4 , f 5 , f 6 , f 7 and f 8 may respectively represent factors such as takes_heart_medicine, has_chest_pain, has_diabetes, exercises_regularly, has_trouble_breating, has_high_fat_diet, has_high_cholesterol, takes_tranquilizers, etc.
- each symptom field may contain a Boolean, true or false value (not shown), that indicates whether the particular medical factor represented by the field applies to the patient.
- each medical diagnosis may also be represented by a different field in the dataset.
- fields d 1 , d 2 , d 3 , d 4 , d 5 and d 6 may also contain Boolean, true or false data (not shown), where each field respectively represents whether the patient has been diagnosed with a medical diagnosis such as has_heart_disease, has_heart_burn, has_heart_murmur, has_COPD, needs_surgery, etc. based on the one or more factors associated with the patient.
- the values contained in one or more diagnoses fields d 1 -d 6 in dataset 220 may be considered sensitive fields that need to be masked or anonymized for protecting the privacy of a patient before providing other data in dataset 220 to an external third party for marketing, research, or other purposes.
- dataset 220 may contain many other fields (not shown) such as the patient's name, address, date of birth, social security information, insurance information, etc., which may need to be provided to an insurance company to determine how subscribers of its plans are using the provisions provided by the company. In such cases, instead of hiding or protecting information related to an individual's identity, it may be desirable to instead protect specific medical conditions or diagnoses associated with the patient.
- the analysis module may identify a list of quasi-identifier fields associated with the sensitive fields d 1 -d 6 by analyzing one or more programs that are used to create or modify the data contained in dataset 220 .
- program slicing can be used to automatically identify parts of a program that affect the value of a given variable at a given point in that program.
- a computer program may compute many values during its execution.
- Program slicing can be used to determine the specific program statements that contribute towards the computation of specific values in that program.
- the analysis module may analyze one or more program slices in a program to automatically determine quasi-identifier fields for one or more sensitive fields in a dataset.
- the analysis module may determine the corresponding program slice (a set of program statements), which yield all statements that directly or indirectly contribute towards computation of its value. The analysis module may then identify or extract quasi-identifiers associated with that output field from the program slice. If the output field is deemed to be a sensitive field for privacy reasons, then not only should the data in that field be masked, but one or more of the identified quasi-identifier fields may also be masked or anonymized. Otherwise, there is a risk that the value of the data in the sensitive field may be inferred based on the values of the quasi-identifier fields.
- FIG.3 shows exemplary pseudo-code logic of a program 314 having access to dataset 220 that may be analyzed by the analysis module 16 using either static or dynamic program analysis.
- program 314 may contain instructions that may access dataset 220 and read, write, or modify the contents of the dataset.
- program 314 may be an executable program slice of a larger medical program used by medical personnel to diagnose a patient with one or medical diagnoses (d 1 , d 2 , d 3 , d 4 , d 5 , d 6 ) based on one or more medical factors (f 1 , f 2 , f 3 , f 4 , f 5 , f 6 , f 7 and f 8 ) contained in a patient's record in the dataset 220 .
- program 314 may include one or more program statements for reading the values of one or more medical symptoms f 1 -f 8 contained in a patient's record in the dataset.
- program 314 may also include program output statements (indicated by reference numeral 318 ) for writing one or more medical diagnoses d 1 -d 6 into a patient's record in the dataset.
- Program 314 may execute according to the program statements in lines 1 - 18 to determine whether the value of one or more diagnoses d 1 -d 6 is true or false based upon particular factors f 1 -f 8 exhibited by the patient.
- FIG. 4 illustrates an example of the analysis module 16 using static analysis for determining one or more quasi-identifier fields associated with the sensitive diagnosis field d 3 .
- the analysis module may begin by analyzing the logic contained in the program statements in program 314 , and identifying a program output statement (indicated by block 410 ) that writes a value into the sensitive data field d 3 in dataset 220 (indicated by arrow 412 ).
- the analysis module may then recursively determine and analyze a set of program statements (or program slice), that indirectly or directly contribute to computing the value that is written into the sensitive data field d 3 .
- the analysis module may first identify the program statements in program 314 that may have last assigned a value (directly contributed) to the value of d 3 which was written into the dataset 220 . As indicated by the arrows 414 and 416 , the analysis module may determine, based upon an examination of the logic in the program, that program statements on both line 6 and line 13 assign values to d 3 , and that either statement may thus have assigned a value that was ultimately written into the sensitive data field d 3 in the dataset 220 .
- the analysis module may now recursively continue to analyze program 314 to determine any further program statements upon which the program statements in lines 6 and 13 may further depend, which may indirectly contribute to the value assigned to d 3 .
- the analysis module may analyze the condition on line 5 and determine that that the program statement on line 6 depends upon the values of factors f 3 and f 5 .
- the analysis module may recursively look for other statements that assign a value to these factors which may lead to yet further dependencies. As seen in FIG. 4 , however, factors f 3 and f 5 are not assigned any values and do not have any dependencies on any other fields in the dataset 220 . Thus, the analysis module may stop recursively looking for further dependencies for factors f 3 and f 5 and identify both as possible quasi-identifier fields for the sensitive data field d 3 .
- the analysis module may determine that the program statement on line 13 depends on the condition on line 12 (arrow 420 ). The analysis module may thus analyze the program statement on line 12 and also identify diagnosis d 2 (circled) as a possible quasi-identifier field upon which the value assigned to sensitive field d 3 may depend (a diagnosis field may be a quasi-identifier for another diagnosis field).
- the analysis module may continue to recursively analyze the program to determine any further dependencies of d 2 (which may indirectly contribute to the value written into sensitive data field d 3 ) by analyzing the program for one or more statements that may have last assigned a value to d 2 , which, as shown by arrow 422 , is the program statement on line 4 .
- the assignment on program statement in line 4 is dependent on the conditional statement in line 3 (arrow 424 ).
- the analysis module may examine the statement in line 3 and determine that the value of d 2 may be dependent on the value of factors f 2 , f 3 , and f 4 (circled).
- the analysis module may determine that these factors do not have any other dependencies, and thus conclude that all backward dependencies (all potential quasi-identifiers) for sensitive data field d 3 have now been found.
- factors f 3 and f 5 were already identified as quasi-identifiers previously, the analysis module may thus simply identify factors f 2 and f 4 as additional quasi-identifiers for the data field d 3 .
- the analysis module may now collect all quasi-identifier fields identified above into a quasi-identifier list 430 for sensitive data field d 3 , which, in this case, include factors f 2 , f 3 , f 4 , and f 5 and the diagnosis d 2 , and anonymize or mask the quasi-identifier fields in the dataset using one or more conventional anonymization techniques.
- the recursive program analysis method disclosed above for identifying the quasi-identifier fields for sensitive field d 3 in program 314 be similarly applied to other sensitive fields in the dataset 220 .
- diagnosis field d 2 in program 314 reveals that it reciprocally depends on diagnosis d 3 and the values of the same four factors, f 2 , f 3 , f 4 , and f 5 .
- fields d 2 and d 3 represent heart disease and heart burn, respectively
- f 2 , f 3 , f 4 , and f 5 represent factors related to blood pressure, chest pain, exercise level, and type of diet, respectively.
- these two diagnoses not only depend on the same set of symptoms and risk factors but they also depend on each other.
- d 2 or d 3 is considered a sensitive field for privacy reasons, then it is desirable to anonymize the other as well. Otherwise, there is a risk that the hidden diagnosis in one of these fields may be inferred based on the value of the diagnosis in the other field.
- statically analyzing a program slice (set of program statements) is often helpful in identifying quasi-identifiers for one or more sensitive data fields in a data set.
- static analysis of a program may sometimes also yield false positives (i.e., identify one or more fields as quasi-identifiers when they are not).
- line 8 must execute if d 5 is assigned a value, and line 11 cannot execute (i.e., is infeasible) if line 8 executes, the factors f 3 , f 4 , and f 7 evaluated in line 11 are not valid quasi-identifiers for field d 5 .
- field d 5 only has five valid quasi-identifiers, which include f 1 , f 2 , f 5 , f 6 , and f 8 .
- diagnosis field d 5 has only five valid quasi-identifiers, and not eight as indicated by static analysis. Such false positives may also arise if a program in question makes heavy use of indirect references via memory addresses. False positives like these, however, may be avoided if the analysis module also uses dynamic analysis, which is described next.
- the analysis module may dynamically analyze a program to determine program paths, that are actually taken (and/or not taken) during execution of the program under different or all possible input conditions, to identify paths that are feasible and/or unfeasible. While, as demonstrated above, a program such as program 314 may contain infeasible paths, such paths will not be executed, and hence, no quasi-identifiers based on an analysis of an infeasible path would be considered by the analysis module during dynamic analysis.
- the analysis module may dynamically analyze the program statements in program 314 by tracing or observing all paths taken by program 314 during its execution by a processor in determining one or more diagnoses for a record of a particular patient. As program 314 executes, the analysis module May trace or record the actual path(s) traversed based on the inputs provided to the program. As a result, the analysis module, when determining a program statement where a given field was assigned a value, may recursively analyze paths (program statements) that were actually taken during the execution of the program when identifying quasi-identifiers with respect to a given sensitive data field, and ignore paths (program statements) that were not taken by the program.
- the analysis module may trace program paths that are executed by program 314 based one or more possible combinations of inputs.
- programs will often operate upon a finite combination of inputs.
- analysis module may trace the execution of program 314 to dynamically identify quasi-identifier fields for the sensitive data field d 5 , based on true or false combinations of the finite factors f 1 -f 8 and data fields d 1 -d 4 , and d 6 . While such “brute force” approach may be computationally burdensome, it will eliminate any chance of generating false positives.
- the analysis module may consider additional information that may be used to determine only valid inputs.
- the analysis module may be programmed with information regarding specific symptoms and diagnoses, such that it can generate and analyze the program based on valid combinations of certain symptoms and/or diagnosis, while ignoring other invalid ones.
- test data sets input conditions
- the same test data sets (input conditions) that are used to validate the program during testing may also be used to dynamically analyze the corresponding program slices and identify quasi-identifier fields for one or more sensitive fields in a database.
- the dataset that contains the sensitive fields and the quasi-identifier fields in question may, before masking or anonymization of such fields, itself serve as a test set of data.
- the analysis module may also adaptively determine whether static analysis or dynamic analysis is more appropriate for a given program.
- a determination that dynamic analysis is more appropriate for a particular program may be adaptively made, for example, if the program contains many indirect variable references, such that static analysis of such a program is likely to contain many infeasible paths and result in many false positives.
- the analysis module may compare the number of indirect references in all or a portion of the program to a threshold and determine a likelihood of generating an unacceptable number of false positives. If the number of indirect references exceeds the threshold or the likelihood of generating false positives is unacceptably high, then the analysis module may analyze the program using dynamic analysis.
- the analysis module may adaptively determine that the number of indirect references or the likelihood of generating false positives is low (based on a comparison with a threshold), and analyze the program using static analysis based upon the determination it is likely to result in a few or acceptable number of false positives.
- FIG. 5 is a flow chart of a process 500 in accordance with various aspects of the system and method disclosed herein.
- the process begins in block 515 .
- the system and method identifies a set of programs containing one or more programs having access to a given dataset, where each program in the set of programs include one or more program statements for reading, writing, modifying, or otherwise manipulating the data in the dataset.
- the system and method determines whether all programs in the set of programs identified in block 520 have been analyzed to determine one or more quasi-identifiers fields for one or more sensitive data fields contained in the dataset.
- the system and method identifies a set of output statements in the program selected in block 530 , where the set of output statements includes one or more output statements that write or update a value in one or more sensitive data fields in the dataset.
- the system and method determines if all output statements in, the set of output statements identified in block 535 have been analyzed.
- the system and method selects an output statement that remains to be analyzed from the set of output statements, where the output statement writes or updates a value of a given sensitive data field in the dataset.
- the system and method recursively identifies, using, for example, static and/or dynamic analysis, a set of one or more program statements (e.g., a program slice) that indirectly or directly contribute to the value that is written by the output statement into the given sensitive data field in the dataset.
- program statements e.g., a program slice
- the system and method identifies one or more data fields in the dataset, which are indirectly or directly referenced in the set of program statements identified in block 550 , as quasi-identifier fields for the given sensitive data field. The system and method then returns to block 540 to check if all output statements in the selected program have been analyzed.
- the system and method uses conventional anonymization techniques such as K-anonymity or L-diversity to partially or completely mask the values of one or more fields in the dataset that have been determined to be quasi-identifier fields for one or more sensitive data fields.
- K-anonymity or L-diversity uses conventional anonymization techniques such as K-anonymity or L-diversity to partially or completely mask the values of one or more fields in the dataset that have been determined to be quasi-identifier fields for one or more sensitive data fields.
- the system and method then ends in block 565 .
- FIG. 6 is a block diagram illustrating a computer system upon which various aspects of the system and method as disclosed herein can be implemented.
- FIG. 6 shows a computing device 600 having one or more input devices 612 , such as a keyboard, mouse, and/or various other types of input devices such as pen-inputs, joysticks, buttons, touch screens, etc.
- Computing device 600 also contains a display 614 , which could include, for instance, a CRT, LCD, plasma screen monitor, TV, projector, etc.
- the computing device 600 may be a personal computer, server or mainframe, mobile phone, PDA, laptop etc.
- computing device 600 also contains a processor 610 , memory 620 , and other components typically present in a computer.
- Memory 620 stores information accessible by processor 610 , including instructions 624 that may be executed by the processor 610 and data 622 that may be retrieved, executed, manipulated or stored by the processor.
- the memory may be of any type capable of storing information accessible by the processor, such as a hard-drive, ROM, RAM, CD-ROM, DVD, Blu-Ray disk, flash memories, write-capable or read-only memories.
- the processor 610 may comprise any number of well known processors, such as processors from Intel Corporation. Alternatively, the processor may be a dedicated controller for executing operations, such as an ASIC.
- Data 622 may include dataset 20 , program 14 , and quasi-identifiers 18 as described above with respect to FIGS. 1-3 .
- Data 622 may be retrieved, stored, modified, or processed by processor 610 in accordance with the instructions 624 .
- the data may be stored as a collection of data.
- the data may be stored in computer registers, in a relational database as a table having a plurality of different fields and records, XML documents, or flat files. Data may also be stored in one or more relational databases.
- the data may also be formatted in any computer readable format such as, but not limited to, binary values, ASCII etc.
- the data may include any information sufficient to identify the relevant information, such as descriptive text, proprietary codes, pointers, references to data stored in other memories (including other network locations) or information which is used by a function to calculate the relevant data.
- Instructions 624 may implement the functionality described with respect to the analysis module and in accordance with the process disclosed above.
- the instructions 624 may comprise any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor.
- the terms “instructions,” “steps” and “programs” may be used interchangeably herein.
- the instructions may be stored in any computer language or format, such as in object code or modules of source code.
- instructions 624 may include analysis module 16 , and the processor may execute instructions contained in analysis module 16 in accordance with the functionality described above.
- processor 610 and memory 620 are functionally illustrated in FIG. 6 as being within the same block, it will be understood that the processor and memory may actually comprise multiple processors and memories that may or may not be stored within the same physical housing or location. Some or all of the instructions and data, such as the dataset 20 or the program 14 , for example, may be stored on a removable recording medium such as a CD-ROM, DVD or Blu-Ray disk. Alternatively, such information may be stored within a read-only computer chip. Some or all of the instructions and data may be stored in a location physically remote from, yet still accessible by, the processor. Similarly, the processor may actually comprise a collection of processors which may or may not operate in parallel. Data may be distributed and stored across multiple memories 620 such as hard drives, data centers, server farms or the like.
- computing device 600 may communicate with one or more other computing devices (not shown). Each of such other computing devices may be configured with a processor, memory and instructions, as well as one or more user input devices and displays. Each computing device may be a general purpose computer, intended for use by a person, having all the components normally found in a personal computer such as a central processing unit (“CPU”), display, CD-ROM, DVD or Blu-Ray drive, hard-drive, mouse, keyboard, touch-sensitive screen, speakers, microphone, modem and/or router (telephone, cable or otherwise) and all of the components used for connecting these elements to one another.
- CPU central processing unit
- the one or more other computing devices may include a third party computer (not shown) to which the computing device 600 transmits a dataset for further use or analysis, where the dataset that the computing device 600 transmits to the third party computer may be a dataset that has been anonymized in accordance with various aspects of the system and method disclosed herein herein.
- computing device 600 may be capable of direct and indirect communication with such other computing devices over a network (not shown).
- a typical networking system can include a large number of connected devices, with different devices being at different nodes of the network.
- the network including any intervening nodes may comprise various configurations and protocols including the Internet, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi, Bluetooth and HTTP. Communication across the network, including any intervening nodes, may be facilitated by any device capable of transmitting data to and from other computers, such as modems (e.g., dial-up or cable), network interfaces and wireless interfaces.
- modems e.g., dial-up or cable
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Bioethics (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Virology (AREA)
- Stored Programmes (AREA)
- Storage Device Security (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Debugging And Monitoring (AREA)
Abstract
A system and method for automated determination of quasi-identifiers for sensitive data fields in a dataset are provided. In one aspect, the system and method identifies quasi-identifier fields in the dataset based upon a static analysis of program statements in a computer program having access to - sensitive data fields in the dataset. In another aspect, the system and method identifies quasi-identifier fields based upon a dynamic analysis of program statements in a computer program having access to -sensitive data fields in the dataset. Once such quasi-identifiers have been identified, the data stored in such fields may be anonymized using techniques such as k-anonymity. As a result, the data in the anonymized quasi-identifiers fields cannot be used to infer a value stored in a sensitive data field in the dataset.
Description
- This application claims the benefit of the filing date of U.S. Provisional Patent Application No. 61/174,690, filed May 1, 2009, the disclosure of which is hereby incorporated herein by reference.
- The present invention generally relates to a system and method for managing data, and more particularly to a system and method for identifying sensitive data so it can be anonymized in a manner that increases privacy.
- Databases or datasets containing personal information, such as databases containing healthcare records or mobile subscribers' location records, are increasingly being used for secondary purposes, such as medical research, public policy analysis, and marketing studies. Such use makes it increasingly possible for third parties to identify individuals associated with the data and to learn personal, sensitive information about those individuals.
- Undesirable invasion of an individual's privacy may occur even after the data has been anonymized, for example, by removing or masking explicit sensitive fields such as those that contain an individual's name, social security number, or other such explicit information that directly identifies a person.
- One way this may occur is, for example, by analyzing less explicit and so called “quasi-identifier” fields in a dataset. In this regard, a set of quasi-identifier fields may be any subset of fields of a given dataset which can either be matched with other, external datasets to infer the identities of the individuals involved, or used to determine a value of another sensitive field in the dataset based upon the values contained in such fields.
- For example, quasi-identifier fields may be data containing an individual's ZIP code, gender, or date of birth, which, while not explicit, may be matched with corresponding fields in external, publicly available datasets such as census data, birth-death records, and voter registration lists to explicitly identify an individual. Similarly, it may also be possible to infer values of otherwise hidden fields containing sensitive information such as, for example, disease diagnoses, if the values in such hidden, sensitive fields are dependent upon values of other quasi-identifier fields in the dataset, such as fields containing clinical symptoms and/or medications prescribed for example, from which information in an otherwise hidden field may be independently determined.
- Typical systems and methods that seek to protect information contained in a dataset include several shortcomings. For example, many conventional methods depend upon a central tenet that all fields that qualify as either explicit or quasi-identifier fields can be easily identified in a dataset, which is not always the case. In addition, typical conventional techniques primarily focus on preventing identities of individuals to be revealed and do not adequately address the situation where values in other sensitive fields, such as an HIV diagnosis, may need to be hidden. Furthermore, conventional techniques that rely upon statistical analysis or machine learning approaches to determine quasi-identifiers in a dataset, while useful, are also prone to producing many false positives (fields are falsely identified as being quasi-identifiers when they are not) as well as many false negatives (fields are falsely identified as not being quasi-identifiers when they are).
- Therefore, improved methods and systems are desired for identifying and anonymizing quasi-identifiers fields in a data set whose values may be used to infer the values in other sensitive fields.
- In one aspect, a method for identifying quasi-identifier data fields in a dataset is provided. The method includes identifying a program having access to the dataset, the program including one or more program statements for reading or writing a value in one or more fields in the dataset; determining a first output program statement in the program, where the first program output statement is a program statement for writing a first value into a sensitive data field in the dataset; determining, with a processor, a first set of program statements in the program, where the first set of program statements includes one or more program statements that contribute to the computation of the first value written into the sensitive data field; and, analyzing, with the processor, the first set of program statements, and determining, based on the analysis of the first set of program statements, one or more quasi-identifier data fields associated with the sensitive data field in the dataset.
- In another aspect, a system for identifying data fields in a dataset is provided, where the system includes a memory storing instructions and data, and a processor for executing the instructions and processing the data. The data includes a set of programs and a dataset having one or more data fields, and the instructions include identifying a program in the set of programs, the program having one or more program statements for reading or writing a value in one or more fields in the dataset; determining a first output program statement in the program, where the first program output statement is a program statement for writing a first value into a sensitive data field in the dataset; determining a first set of program statements in the program, where the first set of program statements includes one or more program statements that contribute to the computation of the first value written into the sensitive data field; and, analyzing the first set of program statements, and determining, based on the analysis of the first set of program statements, one or more data fields associated with the sensitive data field in the dataset.
-
FIG. 1 illustrates a system in accordance with an aspect of the invention. -
FIG. 2 illustrates a sample dataset in accordance with one aspect of the invention. -
FIG. 3 illustrates an example of a pseudo-code program in accordance with one aspect of the invention. -
FIG. 4 illustrates an example of the operation of the system inFIG.1 in accordance with an aspect of the system and method. -
FIG. 5 illustrates a flow diagram in accordance with various aspects of the invention. -
FIG. 6 illustrates a block diagram of a computing system in accordance with various aspects of the invention. - A system and method for automated determination of quasi-identifiers fields for one or more sensitive fields in a given dataset are provided. Instead of identifying quasi-identifiers based on the conventional approach of analyzing the contents of a given dataset, the system and method disclosed herein identifies quasi-identifier fields based upon an analysis of computer programs that are used to create and manipulate the dataset. Once such quasi-identifier fields have been found, they may be anonymized using existing anonymization techniques such as k-anonymity or L-diversity. As a result, the anonymized quasi-identifiers cannot be used to identify an individual associated with the data in the dataset, or to infer a value of one or more other sensitive fields contained in the dataset.
-
FIG. 1 illustrates asystem 10 in accordance with various aspects of the invention disclosed herein.System 10 includes adatabase 12, one ormore programs 14, ananalysis module 16, and a list ofquasi-identifier fields 18. -
Database 12 may include one ormore datasets 20, which may contain a set of data, including sensitive data that may need to be anonymized before the data set is provided to an external third party for further use, e.g., research or analysis.Dataset 20 may be organized by conventional columns and rows, where each column may be a field such as name, age, address, medical symptom, medical diagnosis, etc. Likewise, each row may be a record that includes related data in one or more fields that is associated with an individual. While the system and method disclosed herein is advantageous when the data set is organized by fields and records, it will be appreciated that the invention is not limited to any particular organization of data, and is equally applicable to any set of data that includes sensitive data that needs to be protected. -
Program 14 may be one or more computer programs containing program statements that include instructions, executable by a processor, for reading, writing, storing, modifying, or otherwise manipulating the data contained indataset 20. Program statements inprogram 14 may include program instructions written in any programming language, such as Java, C, C++, C#, Javascript, SQL, Visual Basic, Perl, PHP, pseudo-code, assembly, machine language or any combination of languages. Further, it will be understood that the system and invention disclosed herein is not limited to any particular type of program or programming language. -
Analysis module 16 insystem 10 may be implemented in hardware, software, or a combination of both. In one aspect,analysis module 16 may itself be a software program, executable by a processor, where theanalysis module 16 has access to bothprogram 14 anddatabase 12 and contains instructions for analyzingprogram 14 for identifying quasi-identifier fields for one or more sensitive fields indataset 20. Alternatively, the functionality ofanalysis module 16 may also be implemented in hardware, such as on a custom application specific integrated circuit (“ASIC”). - Upon identification of the quasi-identifiers 18 based upon an analysis of
program 14, analysis module may also contain instructions for anonymizing data indataset 20 associated with such quasi-identifier fields using one or more anonymization techniques such as, for example, k-anonymity or other conventional anonymization techniques. Thus, quasi-identifiers 18 may be considered a list of quasi-identifiers fields determined byanalysis module 16 to contain information based upon which values in other sensitive fields indataset 20 may be ascertained. - Operation of the system and method in accordance with one aspect is described below.
FIG. 2 shows adatabase 212 containing adataset 220 that includes data organized in a generic row and column format. In one embodiment, the data contained indataset 220 may be a collection of medical records of patients treated in a hospital. In accordance with this embodiment, each row 222 indataset 220 may contain a record of related medical data associated with a particular patient, including, for example, one or more medical symptoms (factors) and medical diagnoses, where the medical diagnoses are determined based on the patient's medical symptoms associated with the patient. - Each medical symptom or medical factor used to determine a diagnosis may be represented by a different field in the dataset. For example, each of the fields f1, f2, f3, f4, f5, f6, f7 and f8 may respectively represent factors such as takes_heart_medicine, has_chest_pain, has_diabetes, exercises_regularly, has_trouble_breating, has_high_fat_diet, has_high_cholesterol, takes_tranquilizers, etc. Thus, each symptom field may contain a Boolean, true or false value (not shown), that indicates whether the particular medical factor represented by the field applies to the patient.
- Likewise, each medical diagnosis may also be represented by a different field in the dataset. For example, fields d1, d2, d3, d4, d5 and d6 may also contain Boolean, true or false data (not shown), where each field respectively represents whether the patient has been diagnosed with a medical diagnosis such as has_heart_disease, has_heart_burn, has_heart_murmur, has_COPD, needs_surgery, etc. based on the one or more factors associated with the patient.
- While particular types of fields containing Boolean or true or false data have been described for simplicity and ease of understanding, it will be understood that the system and method disclosed herein is not limited to any type of field or data, and is equally applicable to alpha-numeric data, image data, sound data, audio/visual data, documents, or any other compressed or uncompressed data that may be accessed and processed by a computer.
- The values contained in one or more diagnoses fields d1-d6 in
dataset 220 may be considered sensitive fields that need to be masked or anonymized for protecting the privacy of a patient before providing other data indataset 220 to an external third party for marketing, research, or other purposes. For example,dataset 220 may contain many other fields (not shown) such as the patient's name, address, date of birth, social security information, insurance information, etc., which may need to be provided to an insurance company to determine how subscribers of its plans are using the provisions provided by the company. In such cases, instead of hiding or protecting information related to an individual's identity, it may be desirable to instead protect specific medical conditions or diagnoses associated with the patient. In addition to masking or anonymizing the values in explicitly sensitive fields (such as a patient's diagnoses), it is also desirable to be able to identify and anonymize the values in other, quasi-identifier fields, which may otherwise be used by a knowledgeable third party to infer the values contained in the sensitive fields. Thus, in one aspect, the analysis module may identify a list of quasi-identifier fields associated with the sensitive fields d1-d6 by analyzing one or more programs that are used to create or modify the data contained indataset 220. - In computer programs, program slicing can be used to automatically identify parts of a program that affect the value of a given variable at a given point in that program. A computer program may compute many values during its execution. Program slicing can be used to determine the specific program statements that contribute towards the computation of specific values in that program. In one aspect, the analysis module may analyze one or more program slices in a program to automatically determine quasi-identifier fields for one or more sensitive fields in a dataset.
- When a dataset is manipulated by a program, its fields may be considered as input or output fields (or both) from the perspective of that program. For each output field that is believed to contain sensitive data (i.e., a sensitive field), the analysis module may determine the corresponding program slice (a set of program statements), which yield all statements that directly or indirectly contribute towards computation of its value. The analysis module may then identify or extract quasi-identifiers associated with that output field from the program slice. If the output field is deemed to be a sensitive field for privacy reasons, then not only should the data in that field be masked, but one or more of the identified quasi-identifier fields may also be masked or anonymized. Otherwise, there is a risk that the value of the data in the sensitive field may be inferred based on the values of the quasi-identifier fields.
- There are two aspects to computing and analyzing a program slice, referred to here as static analysis and dynamic analysis, by which the
analysis module 16 may automatically determine a set of quasi-identifiers for a sensitive data field in a given dataset. The operation of the analysis module in accordance with each aspect is explained below with reference to an exemplary pseudo-code program having access todataset 220. -
FIG.3 shows exemplary pseudo-code logic of aprogram 314 having access todataset 220 that may be analyzed by theanalysis module 16 using either static or dynamic program analysis. As seen therein,program 314 may contain instructions that may accessdataset 220 and read, write, or modify the contents of the dataset. In one embodiment,program 314 may be an executable program slice of a larger medical program used by medical personnel to diagnose a patient with one or medical diagnoses (d1, d2, d3, d4, d5, d6) based on one or more medical factors (f1, f2, f3, f4, f5, f6, f7 and f8) contained in a patient's record in thedataset 220. - While line numbers 1-18 have been depicted to indicate certain program statements for ease of understanding, they are not necessary to the operation of the program. As indicated by
reference numeral 316,program 314 may include one or more program statements for reading the values of one or more medical symptoms f1-f8 contained in a patient's record in the dataset. In addition,program 314 may also include program output statements (indicated by reference numeral 318) for writing one or more medical diagnoses d1-d6 into a patient's record in the dataset.Program 314 may execute according to the program statements in lines 1-18 to determine whether the value of one or more diagnoses d1-d6 is true or false based upon particular factors f1-f8 exhibited by the patient. -
FIG. 4 illustrates an example of theanalysis module 16 using static analysis for determining one or more quasi-identifier fields associated with the sensitive diagnosis field d3. - The analysis module may begin by analyzing the logic contained in the program statements in
program 314, and identifying a program output statement (indicated by block 410) that writes a value into the sensitive data field d3 in dataset 220 (indicated by arrow 412). - The analysis module may then recursively determine and analyze a set of program statements (or program slice), that indirectly or directly contribute to computing the value that is written into the sensitive data field d3.
- Thus, the analysis module may first identify the program statements in
program 314 that may have last assigned a value (directly contributed) to the value of d3 which was written into thedataset 220. As indicated by thearrows line 6 andline 13 assign values to d3, and that either statement may thus have assigned a value that was ultimately written into the sensitive data field d3 in thedataset 220. - Having identified that the write statement in
block 410 may be dependent on the program statements inlines program 314 to determine any further program statements upon which the program statements inlines - As the program statement on
line 6 is executed only if the condition online 5 is true (arrow 418), the analysis module may analyze the condition online 5 and determine that that the program statement online 6 depends upon the values of factors f3 and f5. Upon determining that the value in sensitive field d3 may depend on factors f3 and f5 (circled), the analysis module may recursively look for other statements that assign a value to these factors which may lead to yet further dependencies. As seen inFIG. 4 , however, factors f3 and f5 are not assigned any values and do not have any dependencies on any other fields in thedataset 220. Thus, the analysis module may stop recursively looking for further dependencies for factors f3 and f5 and identify both as possible quasi-identifier fields for the sensitive data field d3. - Applying a similar analysis to the program statement on
line 13, the analysis module may determine that the program statement online 13 depends on the condition on line 12 (arrow 420). The analysis module may thus analyze the program statement online 12 and also identify diagnosis d2 (circled) as a possible quasi-identifier field upon which the value assigned to sensitive field d3 may depend (a diagnosis field may be a quasi-identifier for another diagnosis field). - Upon determining that diagnosis field d3 may be dependent on diagnosis field d2 in
dataset 220, the analysis module may continue to recursively analyze the program to determine any further dependencies of d2 (which may indirectly contribute to the value written into sensitive data field d3) by analyzing the program for one or more statements that may have last assigned a value to d2, which, as shown byarrow 422, is the program statement online 4. - The assignment on program statement in
line 4 is dependent on the conditional statement in line 3 (arrow 424). Thus, the analysis module may examine the statement inline 3 and determine that the value of d2 may be dependent on the value of factors f2, f3, and f4 (circled). Continuing to examine the program statements associated with factors f2, f3, and f4 recursively as described above, the analysis module may determine that these factors do not have any other dependencies, and thus conclude that all backward dependencies (all potential quasi-identifiers) for sensitive data field d3 have now been found. As factors f3 and f5 were already identified as quasi-identifiers previously, the analysis module may thus simply identify factors f2 and f4 as additional quasi-identifiers for the data field d3. - As indicated previously, the analysis module may now collect all quasi-identifier fields identified above into a
quasi-identifier list 430 for sensitive data field d3, which, in this case, include factors f2, f3, f4, and f5 and the diagnosis d2, and anonymize or mask the quasi-identifier fields in the dataset using one or more conventional anonymization techniques. - The recursive program analysis method disclosed above for identifying the quasi-identifier fields for sensitive field d3 in
program 314 be similarly applied to other sensitive fields in thedataset 220. - For example, applying the same static program analysis technique to diagnosis field d2 in
program 314 reveals that it reciprocally depends on diagnosis d3 and the values of the same four factors, f2, f3, f4, and f5. Thus, if fields d2 and d3 represent heart disease and heart burn, respectively, and f2, f3, f4, and f5 represent factors related to blood pressure, chest pain, exercise level, and type of diet, respectively, then, according to the above program, these two diagnoses not only depend on the same set of symptoms and risk factors but they also depend on each other. Thus, if either d2 or d3 is considered a sensitive field for privacy reasons, then it is desirable to anonymize the other as well. Otherwise, there is a risk that the hidden diagnosis in one of these fields may be inferred based on the value of the diagnosis in the other field. - To provide yet another example, applying the same static program analysis technique to diagnosis field d2 in
program 314 reveals that that program statement online 8 is the last statement that assigns a value to variable d4. In addition, the program statement online 8 is based upon the condition in the program statement online 7, which, as seen, is based upon factors f5 and f8. As neither of these factors has any further dependencies (is not modified by the program 314), the analysis module may stop its backward trace for further dependencies and determine that factors f5 and f8 are the only possible quasi-identifier fields for diagnosis field d4. - As illustrated by the foregoing examples, statically analyzing a program slice (set of program statements) is often helpful in identifying quasi-identifiers for one or more sensitive data fields in a data set. However, static analysis of a program may sometimes also yield false positives (i.e., identify one or more fields as quasi-identifiers when they are not).
- For example, this can be seen in
program 314 with respect to diagnosis field d5. In this case, if static program analysis is applied in the manner described above to diagnosis field d5, the result indicates that sensitive field d5 may depend on all eight factors, f1 through f8. However, if instead of just statically identifying all potential dependencies as quasi-identifiers in the manner described above, the feasibility of paths during the operation (e.g., actual execution) ofprogram 314 are also considered, it can be determined that, in fact, that there are only five factors that may be quasi-identifier fields for sensitive field d5, as described below. - As seen in
program 314, the only place where d5 is assigned a value is on the program statement online 17. However,line 17 can execute only if the program statement online 16 evaluates to true, i.e., if d4 has the value true. But, d4 can be true only ifline 8 executes. However, ifline 8 executes,line 11 cannot execute, aslines line 7. Therefore, asline 8 must execute if d5 is assigned a value, andline 11 cannot execute (i.e., is infeasible) ifline 8 executes, the factors f3, f4, and f7 evaluated inline 11 are not valid quasi-identifiers for field d5. Thus, field d5 only has five valid quasi-identifiers, which include f1, f2, f5, f6, and f8. - Based on the foregoing, it can be seen that diagnosis field d5 has only five valid quasi-identifiers, and not eight as indicated by static analysis. Such false positives may also arise if a program in question makes heavy use of indirect references via memory addresses. False positives like these, however, may be avoided if the analysis module also uses dynamic analysis, which is described next.
- The static program analysis technique described in the previous sections included traversing all program paths (whether feasible or infeasible) when looking for statements that may have assigned or used a value in a given field. Instead, the analysis module may dynamically analyze a program to determine program paths, that are actually taken (and/or not taken) during execution of the program under different or all possible input conditions, to identify paths that are feasible and/or unfeasible. While, as demonstrated above, a program such as
program 314 may contain infeasible paths, such paths will not be executed, and hence, no quasi-identifiers based on an analysis of an infeasible path would be considered by the analysis module during dynamic analysis. - In one embodiment, the analysis module may dynamically analyze the program statements in
program 314 by tracing or observing all paths taken byprogram 314 during its execution by a processor in determining one or more diagnoses for a record of a particular patient. Asprogram 314 executes, the analysis module May trace or record the actual path(s) traversed based on the inputs provided to the program. As a result, the analysis module, when determining a program statement where a given field was assigned a value, may recursively analyze paths (program statements) that were actually taken during the execution of the program when identifying quasi-identifiers with respect to a given sensitive data field, and ignore paths (program statements) that were not taken by the program. - Furthermore, this holds true even if a program makes heavy use of indirect assignments or references via memory addresses, because when the execution path used to compute dynamic slices (a set of program statements that are executed in a path taken by the program) is recorded, the actual memory addresses of variables that are assigned values and are used in all statements along that path can also be recorded by the analysis module, such that the analysis module may decide whether a particular program statement assigning or using a value via a memory address refers to a potential quasi-identifier field or not.
- Thus, in one aspect, the analysis module may trace program paths that are executed by
program 314 based one or more possible combinations of inputs. As one of ordinary skill in the art would appreciate, programs will often operate upon a finite combination of inputs. For example, analysis module may trace the execution ofprogram 314 to dynamically identify quasi-identifier fields for the sensitive data field d5, based on true or false combinations of the finite factors f1-f8 and data fields d1-d4, and d6. While such “brute force” approach may be computationally burdensome, it will eliminate any chance of generating false positives. In another aspect, the analysis module may consider additional information that may be used to determine only valid inputs. For example, the analysis module may be programmed with information regarding specific symptoms and diagnoses, such that it can generate and analyze the program based on valid combinations of certain symptoms and/or diagnosis, while ignoring other invalid ones. - As most programs are normally tested to ensure that they function (execute) as desired with respect to a program's features, in another aspect, the same test data sets (input conditions) that are used to validate the program during testing may also be used to dynamically analyze the corresponding program slices and identify quasi-identifier fields for one or more sensitive fields in a database.
- In this regard, the dataset that contains the sensitive fields and the quasi-identifier fields in question may, before masking or anonymization of such fields, itself serve as a test set of data.
- Thus, dynamic analysis of a given program in the manner described above may dramatically reduce or even eliminate false positives in many cases.
- There is a tradeoff involved between using static and dynamic analysis. While computing and analyzing a static slice may be much more efficient (e.g., faster), it may lead to false positives. Dynamic analysis, on the contrary, may be much more expensive, both in terms of computation time and space required, and it may miss detection of some genuine dependencies (i.e., may allow false negatives if feasible paths of a program under certain input conditions are not evaluated), but it substantially reduces or eliminates false positives.
- Thus, in a yet another embodiment, the analysis module may also adaptively determine whether static analysis or dynamic analysis is more appropriate for a given program. A determination that dynamic analysis is more appropriate for a particular program may be adaptively made, for example, if the program contains many indirect variable references, such that static analysis of such a program is likely to contain many infeasible paths and result in many false positives. Thus, in one aspect the analysis module may compare the number of indirect references in all or a portion of the program to a threshold and determine a likelihood of generating an unacceptable number of false positives. If the number of indirect references exceeds the threshold or the likelihood of generating false positives is unacceptably high, then the analysis module may analyze the program using dynamic analysis. In other cases, the analysis module may adaptively determine that the number of indirect references or the likelihood of generating false positives is low (based on a comparison with a threshold), and analyze the program using static analysis based upon the determination it is likely to result in a few or acceptable number of false positives.
-
FIG. 5 is a flow chart of a process 500 in accordance with various aspects of the system and method disclosed herein. The process begins in block 515. In block 520, the system and method identifies a set of programs containing one or more programs having access to a given dataset, where each program in the set of programs include one or more program statements for reading, writing, modifying, or otherwise manipulating the data in the dataset. - In block 525, the system and method determines whether all programs in the set of programs identified in block 520 have been analyzed to determine one or more quasi-identifiers fields for one or more sensitive data fields contained in the dataset.
- If the result of the check in block 525 is false, that is, that all of the programs in the set of programs have not been analyzed, then in block 530 the system and method selects a program from the set of programs for analysis.
- In block 535, the system and method identifies a set of output statements in the program selected in block 530, where the set of output statements includes one or more output statements that write or update a value in one or more sensitive data fields in the dataset.
- In block 540, the system and method determines if all output statements in, the set of output statements identified in block 535 have been analyzed.
- If the result of the check in block 540 is false, that is, all program statements in the set of output statements have not been analyzed, then in block 545 the system and method selects an output statement that remains to be analyzed from the set of output statements, where the output statement writes or updates a value of a given sensitive data field in the dataset.
- In block 550, the system and method recursively identifies, using, for example, static and/or dynamic analysis, a set of one or more program statements (e.g., a program slice) that indirectly or directly contribute to the value that is written by the output statement into the given sensitive data field in the dataset.
- In block 555, the system and method identifies one or more data fields in the dataset, which are indirectly or directly referenced in the set of program statements identified in block 550, as quasi-identifier fields for the given sensitive data field. The system and method then returns to block 540 to check if all output statements in the selected program have been analyzed.
- If the result of the check in block 540 is true, that is, all statements in the set of output statements for the selected program have been analyzed, the system and method returns to block 525 to check if all programs in the set of one or more programs have been analyzed.
- If the result of the check in block 525 is true, i.e., each program in the set of one or more programs has been analyzed, then the system and method proceeds to block 560.
- In block 560, the system and method uses conventional anonymization techniques such as K-anonymity or L-diversity to partially or completely mask the values of one or more fields in the dataset that have been determined to be quasi-identifier fields for one or more sensitive data fields. The system and method then ends in block 565.
-
FIG. 6 is a block diagram illustrating a computer system upon which various aspects of the system and method as disclosed herein can be implemented.FIG. 6 shows acomputing device 600 having one ormore input devices 612, such as a keyboard, mouse, and/or various other types of input devices such as pen-inputs, joysticks, buttons, touch screens, etc.Computing device 600 also contains adisplay 614, which could include, for instance, a CRT, LCD, plasma screen monitor, TV, projector, etc. In one embodiment, thecomputing device 600 may be a personal computer, server or mainframe, mobile phone, PDA, laptop etc. In addition,computing device 600 also contains aprocessor 610,memory 620, and other components typically present in a computer. -
Memory 620 stores information accessible byprocessor 610, includinginstructions 624 that may be executed by theprocessor 610 anddata 622 that may be retrieved, executed, manipulated or stored by the processor. The memory may be of any type capable of storing information accessible by the processor, such as a hard-drive, ROM, RAM, CD-ROM, DVD, Blu-Ray disk, flash memories, write-capable or read-only memories. Theprocessor 610 may comprise any number of well known processors, such as processors from Intel Corporation. Alternatively, the processor may be a dedicated controller for executing operations, such as an ASIC. -
Data 622 may includedataset 20,program 14, and quasi-identifiers 18 as described above with respect toFIGS. 1-3 .Data 622 may be retrieved, stored, modified, or processed byprocessor 610 in accordance with theinstructions 624. The data may be stored as a collection of data. For instance, although the invention is not limited by any particular data structure, the data may be stored in computer registers, in a relational database as a table having a plurality of different fields and records, XML documents, or flat files. Data may also be stored in one or more relational databases. - Additionally, the data may also be formatted in any computer readable format such as, but not limited to, binary values, ASCII etc. Moreover, the data may include any information sufficient to identify the relevant information, such as descriptive text, proprietary codes, pointers, references to data stored in other memories (including other network locations) or information which is used by a function to calculate the relevant data.
-
Instructions 624 may implement the functionality described with respect to the analysis module and in accordance with the process disclosed above. Theinstructions 624 may comprise any set of instructions to be executed directly (such as machine code) or indirectly (such as scripts) by the processor. In that regard, the terms “instructions,” “steps” and “programs” may be used interchangeably herein. The instructions may be stored in any computer language or format, such as in object code or modules of source code. In one embodiment,instructions 624 may includeanalysis module 16, and the processor may execute instructions contained inanalysis module 16 in accordance with the functionality described above. - Although the
processor 610 andmemory 620 are functionally illustrated inFIG. 6 as being within the same block, it will be understood that the processor and memory may actually comprise multiple processors and memories that may or may not be stored within the same physical housing or location. Some or all of the instructions and data, such as thedataset 20 or theprogram 14, for example, may be stored on a removable recording medium such as a CD-ROM, DVD or Blu-Ray disk. Alternatively, such information may be stored within a read-only computer chip. Some or all of the instructions and data may be stored in a location physically remote from, yet still accessible by, the processor. Similarly, the processor may actually comprise a collection of processors which may or may not operate in parallel. Data may be distributed and stored acrossmultiple memories 620 such as hard drives, data centers, server farms or the like. - In one aspect,
computing device 600 may communicate with one or more other computing devices (not shown). Each of such other computing devices may be configured with a processor, memory and instructions, as well as one or more user input devices and displays. Each computing device may be a general purpose computer, intended for use by a person, having all the components normally found in a personal computer such as a central processing unit (“CPU”), display, CD-ROM, DVD or Blu-Ray drive, hard-drive, mouse, keyboard, touch-sensitive screen, speakers, microphone, modem and/or router (telephone, cable or otherwise) and all of the components used for connecting these elements to one another. In one aspect, for example, the one or more other computing devices may include a third party computer (not shown) to which thecomputing device 600 transmits a dataset for further use or analysis, where the dataset that thecomputing device 600 transmits to the third party computer may be a dataset that has been anonymized in accordance with various aspects of the system and method disclosed herein herein. - In addition,
computing device 600 may be capable of direct and indirect communication with such other computing devices over a network (not shown). It should be appreciated that a typical networking system can include a large number of connected devices, with different devices being at different nodes of the network. The network including any intervening nodes, may comprise various configurations and protocols including the Internet, intranets, virtual private networks, wide area networks, local networks, private networks using communication protocols proprietary to one or more companies, Ethernet, WiFi, Bluetooth and HTTP. Communication across the network, including any intervening nodes, may be facilitated by any device capable of transmitting data to and from other computers, such as modems (e.g., dial-up or cable), network interfaces and wireless interfaces. - Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims.
Claims (14)
1. A method for automatically identifying one or more quasi-identifier data fields in a dataset, the method comprising:
identifying a program having access to the dataset, the program including one or more program statements for reading or writing a value in one or more fields in the dataset;
determining a first output program statement in the program, where the first program output statement is a program statement for writing a first value into a sensitive data field in the dataset;
determining, with a processor, a first set of program statements in the program, where the first set of program statements includes one or more program statements that contribute to the computation of the first value written into the sensitive data field; and,
analyzing, with the processor, the first set of program statements, and determining, based on the analysis of the first set of program statements, one or more quasi-identifier data fields associated with the sensitive data field in the dataset.
2. The method of claim 1 , further comprising:
anonymizing, in the dataset, data stored in at least one of the quasi-identifier data fields.
3. The method of claim 1 , further comprising anonymizing, in the dataset, data stored in at least one of the quasi-identifier data fields using K-anonymity.
4. The method of claim 1 , wherein determining the first set of program statements includes determining the first set of program statements using static program analysis.
5. The method of claim 1 , wherein determining the first set of program statements includes determining the first set of program statements using dynamic program analysis.
6.-7. (canceled)
8. The method of claim 1 , wherein analyzing, with the processor, the first set of program statements includes recursively analyzing, with the processor, the first set of program statements.
9. A system for automatically identifying one or more data fields in a dataset, the system comprising:
a memory storing instructions and data, the data comprising a set of programs and a dataset having one or more data fields;
a processor to execute the instructions and to process the data, wherein the instructions comprise:
identifying a program in the set of programs, the program having one or more program statements for reading or writing a value in one or more fields in the dataset;
determining a first output program statement in the program, where the first program output statement is a program statement for writing a first value into a sensitive data field in the dataset;
determining a first set of program statements in the program, where the first set of program statements includes one or more program statements that contribute to the computation of the first value written into the sensitive data field; and,
analyzing the first set of program statements, and determining, based on the analysis of the first set of program statements, one or more data fields associated with the sensitive data field in the dataset.
10. The system of claim 9 , wherein the instructions further comprise:
anonymizing, in the dataset, data stored in at least one of the data fields associated with the sensitive data field.
11. The system of claim 9 , wherein the instructions further comprise:
anonymizing, in the dataset, data stored in at least one of the data fields associated with the sensitive data field using K-anonymity.
12. The system of claim 9 , wherein determining the first set of program statements includes determining the first set of program statements using static program analysis.
13. The system of claim 9 , wherein determining the first set of program statements includes determining the first set of program statements using dynamic program analysis.
14.-15. (canceled)
16. The system of claim 9 , wherein analyzing the first set of program statements includes recursively analyzing the first set of program statements.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/151,474 US20140130178A1 (en) | 2009-05-01 | 2014-01-09 | Automated Determination of Quasi-Identifiers Using Program Analysis |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17469009P | 2009-05-01 | 2009-05-01 | |
US12/771,130 US8661423B2 (en) | 2009-05-01 | 2010-04-30 | Automated determination of quasi-identifiers using program analysis |
US14/151,474 US20140130178A1 (en) | 2009-05-01 | 2014-01-09 | Automated Determination of Quasi-Identifiers Using Program Analysis |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/771,130 Continuation US8661423B2 (en) | 2009-05-01 | 2010-04-30 | Automated determination of quasi-identifiers using program analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140130178A1 true US20140130178A1 (en) | 2014-05-08 |
Family
ID=43032797
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/771,130 Expired - Fee Related US8661423B2 (en) | 2009-05-01 | 2010-04-30 | Automated determination of quasi-identifiers using program analysis |
US14/151,474 Abandoned US20140130178A1 (en) | 2009-05-01 | 2014-01-09 | Automated Determination of Quasi-Identifiers Using Program Analysis |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/771,130 Expired - Fee Related US8661423B2 (en) | 2009-05-01 | 2010-04-30 | Automated determination of quasi-identifiers using program analysis |
Country Status (2)
Country | Link |
---|---|
US (2) | US8661423B2 (en) |
WO (1) | WO2010127216A2 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9760718B2 (en) | 2015-09-18 | 2017-09-12 | International Business Machines Corporation | Utility-aware anonymization of sequential and location datasets |
US9870381B2 (en) | 2015-05-22 | 2018-01-16 | International Business Machines Corporation | Detecting quasi-identifiers in datasets |
US10095883B2 (en) | 2016-07-22 | 2018-10-09 | International Business Machines Corporation | Method/system for the online identification and blocking of privacy vulnerabilities in data streams |
US10235143B2 (en) | 2011-10-12 | 2019-03-19 | International Business Machines Corporation | Generating a predictive data structure |
US10936752B2 (en) | 2018-03-01 | 2021-03-02 | International Business Machines Corporation | Data de-identification across different data sources using a common data model |
US20220136841A1 (en) * | 2020-11-03 | 2022-05-05 | Here Global B.V. | Method, apparatus, and computer program product for anonymizing trajectories |
US20220215127A1 (en) * | 2019-04-29 | 2022-07-07 | Telefonaktiebolaget Lm Ericsson (Publ) | Data anonymization views |
US11626022B2 (en) | 2015-11-20 | 2023-04-11 | Motorola Solutions, Inc. | Method, device, and system for detecting a dangerous road event and/or condition |
Families Citing this family (75)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1950684A1 (en) | 2007-01-29 | 2008-07-30 | Accenture Global Services GmbH | Anonymity measuring device |
US8627483B2 (en) | 2008-12-18 | 2014-01-07 | Accenture Global Services Limited | Data anonymization based on guessing anonymity |
US8544104B2 (en) * | 2010-05-10 | 2013-09-24 | International Business Machines Corporation | Enforcement of data privacy to maintain obfuscation of certain data |
US8682910B2 (en) * | 2010-08-03 | 2014-03-25 | Accenture Global Services Limited | Database anonymization for use in testing database-centric applications |
JP5735485B2 (en) * | 2010-08-06 | 2015-06-17 | パナソニック インテレクチュアル プロパティ コーポレーション オブアメリカPanasonic Intellectual Property Corporation of America | Anonymized information sharing device and anonymized information sharing method |
US9667741B1 (en) | 2011-03-08 | 2017-05-30 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9356993B1 (en) | 2011-03-08 | 2016-05-31 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9432342B1 (en) | 2011-03-08 | 2016-08-30 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US8694646B1 (en) * | 2011-03-08 | 2014-04-08 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US8726398B1 (en) | 2011-12-13 | 2014-05-13 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9413526B1 (en) | 2011-03-08 | 2016-08-09 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9852311B1 (en) | 2011-03-08 | 2017-12-26 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9292696B1 (en) | 2011-03-08 | 2016-03-22 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9300637B1 (en) * | 2011-03-08 | 2016-03-29 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US11228566B1 (en) | 2011-03-08 | 2022-01-18 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
US9338220B1 (en) | 2011-03-08 | 2016-05-10 | Ciphercloud, Inc. | System and method to anonymize data transmitted to a destination computing device |
JPWO2012176923A1 (en) * | 2011-06-20 | 2015-02-23 | 日本電気株式会社 | Anonymization index determination device and method, and anonymization processing execution system and method |
US9621406B2 (en) | 2011-06-30 | 2017-04-11 | Amazon Technologies, Inc. | Remote browsing session management |
US8799412B2 (en) | 2011-06-30 | 2014-08-05 | Amazon Technologies, Inc. | Remote browsing session management |
US8577963B2 (en) | 2011-06-30 | 2013-11-05 | Amazon Technologies, Inc. | Remote browsing session between client browser and network based browser |
US9003544B2 (en) | 2011-07-26 | 2015-04-07 | Kaspersky Lab Zao | Efficient securing of data on mobile devices |
US9037696B2 (en) | 2011-08-16 | 2015-05-19 | Amazon Technologies, Inc. | Managing information associated with network resources |
US9195768B2 (en) | 2011-08-26 | 2015-11-24 | Amazon Technologies, Inc. | Remote browsing session management |
US10089403B1 (en) | 2011-08-31 | 2018-10-02 | Amazon Technologies, Inc. | Managing network based storage |
US10693991B1 (en) | 2011-09-27 | 2020-06-23 | Amazon Technologies, Inc. | Remote browsing session management |
US8914514B1 (en) | 2011-09-27 | 2014-12-16 | Amazon Technologies, Inc. | Managing network based content |
US9152970B1 (en) | 2011-09-27 | 2015-10-06 | Amazon Technologies, Inc. | Remote co-browsing session management |
US9641637B1 (en) | 2011-09-27 | 2017-05-02 | Amazon Technologies, Inc. | Network resource optimization |
US9383958B1 (en) | 2011-09-27 | 2016-07-05 | Amazon Technologies, Inc. | Remote co-browsing session management |
US8849802B2 (en) | 2011-09-27 | 2014-09-30 | Amazon Technologies, Inc. | Historical browsing session management |
US8589385B2 (en) | 2011-09-27 | 2013-11-19 | Amazon Technologies, Inc. | Historical browsing session management |
US9178955B1 (en) | 2011-09-27 | 2015-11-03 | Amazon Technologies, Inc. | Managing network based content |
US8615431B1 (en) | 2011-09-29 | 2013-12-24 | Amazon Technologies, Inc. | Network content message placement management |
US9313100B1 (en) | 2011-11-14 | 2016-04-12 | Amazon Technologies, Inc. | Remote browsing session management |
US8972477B1 (en) | 2011-12-01 | 2015-03-03 | Amazon Technologies, Inc. | Offline browsing session management |
US9117002B1 (en) | 2011-12-09 | 2015-08-25 | Amazon Technologies, Inc. | Remote browsing session management |
US9009334B1 (en) | 2011-12-09 | 2015-04-14 | Amazon Technologies, Inc. | Remote browsing session management |
US9330188B1 (en) | 2011-12-22 | 2016-05-03 | Amazon Technologies, Inc. | Shared browsing sessions |
US9092405B1 (en) | 2012-01-26 | 2015-07-28 | Amazon Technologies, Inc. | Remote browsing and searching |
US9336321B1 (en) | 2012-01-26 | 2016-05-10 | Amazon Technologies, Inc. | Remote browsing and searching |
US8839087B1 (en) | 2012-01-26 | 2014-09-16 | Amazon Technologies, Inc. | Remote browsing and searching |
US9509783B1 (en) | 2012-01-26 | 2016-11-29 | Amazon Technlogogies, Inc. | Customized browser images |
US9087024B1 (en) | 2012-01-26 | 2015-07-21 | Amazon Technologies, Inc. | Narration of network content |
US8627195B1 (en) | 2012-01-26 | 2014-01-07 | Amazon Technologies, Inc. | Remote browsing and searching |
TW201331770A (en) | 2012-01-31 | 2013-08-01 | Ibm | Method and system for persevering privacy against a dataset |
US9183258B1 (en) | 2012-02-10 | 2015-11-10 | Amazon Technologies, Inc. | Behavior based processing of content |
US9037975B1 (en) | 2012-02-10 | 2015-05-19 | Amazon Technologies, Inc. | Zooming interaction tracking and popularity determination |
US20150033356A1 (en) * | 2012-02-17 | 2015-01-29 | Nec Corporation | Anonymization device, anonymization method and computer readable medium |
US9137210B1 (en) | 2012-02-21 | 2015-09-15 | Amazon Technologies, Inc. | Remote browsing session management |
US9208316B1 (en) | 2012-02-27 | 2015-12-08 | Amazon Technologies, Inc. | Selective disabling of content portions |
US9374244B1 (en) | 2012-02-27 | 2016-06-21 | Amazon Technologies, Inc. | Remote browsing session management |
US10296558B1 (en) | 2012-02-27 | 2019-05-21 | Amazon Technologies, Inc. | Remote generation of composite content pages |
US9460220B1 (en) | 2012-03-26 | 2016-10-04 | Amazon Technologies, Inc. | Content selection based on target device characteristics |
US9307004B1 (en) | 2012-03-28 | 2016-04-05 | Amazon Technologies, Inc. | Prioritized content transmission |
US9772979B1 (en) | 2012-08-08 | 2017-09-26 | Amazon Technologies, Inc. | Reproducing user browsing sessions |
CN102867022B (en) * | 2012-08-10 | 2015-01-14 | 上海交通大学 | System for anonymizing set type data by partially deleting certain items |
US8943197B1 (en) | 2012-08-16 | 2015-01-27 | Amazon Technologies, Inc. | Automated content update notification |
DK2946288T3 (en) * | 2013-01-17 | 2020-08-31 | Tata Consultancy Services Ltd | FIXED FOR A PANEL SYSTEM AND METHOD FOR PROVIDING ACCESS CONTROL FOR SENSITIVE INFORMATION |
US9578137B1 (en) | 2013-06-13 | 2017-02-21 | Amazon Technologies, Inc. | System for enhancing script execution performance |
US10152463B1 (en) | 2013-06-13 | 2018-12-11 | Amazon Technologies, Inc. | System for profiling page browsing interactions |
CA2852253A1 (en) * | 2014-05-23 | 2015-11-23 | University Of Ottawa | System and method for shifting dates in the de-identification of datesets |
US9635041B1 (en) | 2014-06-16 | 2017-04-25 | Amazon Technologies, Inc. | Distributed split browser content inspection and analysis |
US10779733B2 (en) * | 2015-10-16 | 2020-09-22 | At&T Intellectual Property I, L.P. | Telemedicine application of video analysis and motion augmentation |
EP3797790A1 (en) | 2015-12-04 | 2021-03-31 | Boehringer Ingelheim International GmbH | Biparatopic polypeptides antagonizing wnt signaling in tumor cells |
US10419401B2 (en) * | 2016-01-08 | 2019-09-17 | Capital One Services, Llc | Methods and systems for securing data in the public cloud |
JP2018109838A (en) * | 2016-12-28 | 2018-07-12 | 富士通株式会社 | Information processing device, information processing system, program and information processing method |
US10664538B1 (en) | 2017-09-26 | 2020-05-26 | Amazon Technologies, Inc. | Data security and data access auditing for network accessible content |
US10726095B1 (en) | 2017-09-26 | 2020-07-28 | Amazon Technologies, Inc. | Network content layout using an intermediary system |
US10528761B2 (en) * | 2017-10-26 | 2020-01-07 | Sap Se | Data anonymization in an in-memory database |
US11734439B2 (en) * | 2018-10-18 | 2023-08-22 | International Business Machines Corporation | Secure data analysis |
US11456996B2 (en) * | 2019-12-10 | 2022-09-27 | International Business Machines Corporation | Attribute-based quasi-identifier discovery |
US11960623B2 (en) * | 2020-03-27 | 2024-04-16 | EMC IP Holding Company LLC | Intelligent and reversible data masking of computing environment information shared with external systems |
WO2021220404A1 (en) * | 2020-04-28 | 2021-11-04 | 日本電信電話株式会社 | Anonymized database generation device, anonymized database generation method, and program |
JP7380856B2 (en) * | 2020-04-28 | 2023-11-15 | 日本電信電話株式会社 | Quasi-identifier determination device, quasi-identifier determination method, program |
US11727147B2 (en) * | 2020-09-10 | 2023-08-15 | Google Llc | Systems and methods for anonymizing large scale datasets |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7937383B2 (en) * | 2008-02-01 | 2011-05-03 | Microsoft Corporation | Generating anonymous log entries |
US20130138627A1 (en) * | 2009-08-12 | 2013-05-30 | Apple Inc. | Quick Find for Data Fields |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5819226A (en) * | 1992-09-08 | 1998-10-06 | Hnc Software Inc. | Fraud detection using predictive modeling |
US5862382A (en) * | 1995-05-08 | 1999-01-19 | Kabushiki Kaisha Toshiba | Program analysis system and program analysis method |
US5708828A (en) * | 1995-05-25 | 1998-01-13 | Reliant Data Systems | System for converting data from input data environment using first format to output data environment using second format by executing the associations between their fields |
US6389533B1 (en) * | 1999-02-05 | 2002-05-14 | Intel Corporation | Anonymity server |
US6339772B1 (en) * | 1999-07-06 | 2002-01-15 | Compaq Computer Corporation | System and method for performing database operations on a continuous stream of tuples |
US7269578B2 (en) * | 2001-04-10 | 2007-09-11 | Latanya Sweeney | Systems and methods for deidentifying entries in a data source |
US7107585B2 (en) * | 2002-07-29 | 2006-09-12 | Arm Limited | Compilation of application code in a data processing apparatus |
US7219341B2 (en) * | 2002-10-31 | 2007-05-15 | International Business Machines Corporation | Code analysis for selective runtime data processing |
US7606788B2 (en) * | 2003-08-22 | 2009-10-20 | Oracle International Corporation | Method and apparatus for protecting private information within a database |
US7721275B2 (en) * | 2004-05-14 | 2010-05-18 | Sap Ag | Data-flow based post pass optimization in dynamic compilers |
US7698690B2 (en) * | 2005-11-10 | 2010-04-13 | International Business Machines Corporation | Identifying code that wastes time performing redundant computation |
US7624178B2 (en) * | 2006-02-27 | 2009-11-24 | International Business Machines Corporation | Apparatus, system, and method for dynamic adjustment of performance monitoring |
US7475085B2 (en) * | 2006-04-04 | 2009-01-06 | International Business Machines Corporation | Method and apparatus for privacy preserving data mining by restricting attribute choice |
JP5083218B2 (en) * | 2006-12-04 | 2012-11-28 | 日本電気株式会社 | Information management system, anonymization method, and storage medium |
US8112422B2 (en) * | 2008-10-27 | 2012-02-07 | At&T Intellectual Property I, L.P. | Computer systems, methods and computer program products for data anonymization for aggregate query answering |
-
2010
- 2010-04-30 WO PCT/US2010/033121 patent/WO2010127216A2/en active Application Filing
- 2010-04-30 US US12/771,130 patent/US8661423B2/en not_active Expired - Fee Related
-
2014
- 2014-01-09 US US14/151,474 patent/US20140130178A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7937383B2 (en) * | 2008-02-01 | 2011-05-03 | Microsoft Corporation | Generating anonymous log entries |
US20130138627A1 (en) * | 2009-08-12 | 2013-05-30 | Apple Inc. | Quick Find for Data Fields |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10235143B2 (en) | 2011-10-12 | 2019-03-19 | International Business Machines Corporation | Generating a predictive data structure |
US10725751B2 (en) | 2011-10-12 | 2020-07-28 | International Business Machines Corporation | Generating a predictive data structure |
US11269834B2 (en) | 2015-05-22 | 2022-03-08 | International Business Machines Corporation | Detecting quasi-identifiers in datasets |
US10380088B2 (en) | 2015-05-22 | 2019-08-13 | International Business Machines Corporation | Detecting quasi-identifiers in datasets |
US9870381B2 (en) | 2015-05-22 | 2018-01-16 | International Business Machines Corporation | Detecting quasi-identifiers in datasets |
US9760718B2 (en) | 2015-09-18 | 2017-09-12 | International Business Machines Corporation | Utility-aware anonymization of sequential and location datasets |
US11626022B2 (en) | 2015-11-20 | 2023-04-11 | Motorola Solutions, Inc. | Method, device, and system for detecting a dangerous road event and/or condition |
US10095883B2 (en) | 2016-07-22 | 2018-10-09 | International Business Machines Corporation | Method/system for the online identification and blocking of privacy vulnerabilities in data streams |
US11030340B2 (en) | 2016-07-22 | 2021-06-08 | International Business Machines Corporation | Method/system for the online identification and blocking of privacy vulnerabilities in data streams |
US10936752B2 (en) | 2018-03-01 | 2021-03-02 | International Business Machines Corporation | Data de-identification across different data sources using a common data model |
US10936750B2 (en) | 2018-03-01 | 2021-03-02 | International Business Machines Corporation | Data de-identification across different data sources using a common data model |
US20220215127A1 (en) * | 2019-04-29 | 2022-07-07 | Telefonaktiebolaget Lm Ericsson (Publ) | Data anonymization views |
US12124610B2 (en) * | 2019-04-29 | 2024-10-22 | Telefonaktiebolaget Lm Ericsson (Publ) | Data anonymization views |
US20220136841A1 (en) * | 2020-11-03 | 2022-05-05 | Here Global B.V. | Method, apparatus, and computer program product for anonymizing trajectories |
US11662215B2 (en) * | 2020-11-03 | 2023-05-30 | Here Global B.V. | Method, apparatus, and computer program product for anonymizing trajectories |
Also Published As
Publication number | Publication date |
---|---|
WO2010127216A2 (en) | 2010-11-04 |
WO2010127216A3 (en) | 2011-01-27 |
US8661423B2 (en) | 2014-02-25 |
WO2010127216A8 (en) | 2011-05-05 |
US20110119661A1 (en) | 2011-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8661423B2 (en) | Automated determination of quasi-identifiers using program analysis | |
US10936752B2 (en) | Data de-identification across different data sources using a common data model | |
US11710544B2 (en) | Performing analytics on protected health information | |
US9195732B2 (en) | Efficient SQL based multi-attribute clustering | |
Lyu et al. | Past, present, and future of emergency general surgery in the USA | |
EP3153974A1 (en) | Medical care data search system | |
JP2010505155A (en) | Medical evaluation support system and method | |
US20110113049A1 (en) | Anonymization of Unstructured Data | |
Botsis et al. | Decision support environment for medical product safety surveillance | |
CN112655047B (en) | Method for classifying medical records | |
US20150213458A1 (en) | Analytic modeling of protected health information | |
US20090119323A1 (en) | Method and system for reducing a data set | |
US20230197218A1 (en) | Method and system for detection of waste, fraud, and abuse in information access using cognitive artificial intelligence | |
Talbert et al. | Too much information: research issues associated with large databases | |
US20130198225A1 (en) | System and method for queried patient abstract | |
Chignell et al. | Nonconfidential patient types in emergency clinical decision support | |
Zamora et al. | Characterizing chronic disease and polymedication prescription patterns from electronic health records | |
US11321372B2 (en) | Method and system for a natural language processing using data streaming | |
Yee et al. | Big data: Its implications on healthcare and future steps | |
EP3654339A1 (en) | Method of classifying medical records | |
US20110078164A1 (en) | Method, apparatus and computer program product for providing a rational range test for data translation | |
Charitha et al. | Big Data Analysis and Management in Healthcare | |
Sajan | RULE-BASED EXTRACTION OF SOCIAL DETERMINANTS OF HEALTH FROM ELECTRONIC HEALTH RECORDS OF PATIENTS WITH TYPE 2 DIABETES | |
Malik et al. | Autonomous, Decentralized and Privacy-Enabled Data Preparation for Evidence-Based Medicine with Brain Aneurysm as a Phenotype | |
O’Keefe et al. | Assessing privacy risks in population health publications using a checklist-based approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: TELCORDIA TECHNOLOGIES, INC., NEW JERSEY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AGRAWAL, HIRALAL;COCHINWALA, MUNIR;HORGAN, JOSEPH R.;SIGNING DATES FROM 20100428 TO 20100628;REEL/FRAME:035115/0798 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |