-
Notifications
You must be signed in to change notification settings 8000 li> - Fork 27
Add remaining elements of protected health information #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thanks @higgi13425 Is the idea that people managing data under HIPAA will replace real data with fake data? |
Exactly. To deidentify a clinical dataset.
zipcode replaced with deid_zipcodename replaced with deid_namestreet with deid_streetdob with deid_dobetc.
Ideally, the date of birth(dob) would be the index date, and could be assigned a random date in the year 1900.
then all other dates in the dataset could be adjusted relative to deid_dob, to preserve the sequence of events and relative time, while keeping data deidentified.
This would be really helpful for folks like me with HIPAA issues with PHI-containing datasets.
Even cooler - a function to 1) add a deid_x version of each PHI variable in the dataset, then2) split dataset into two - one with PHI plus unique key (stored securely)- and the 2nd with unique key plus deid_x versions of PHI data (plus all the other data).
then you could share the 2nd dataframe (on GitHub, etc),but if you really needed to, you could merge to re-identify.
thanks for considering it.
Peter
|
added in a bunch of locales that we had data for but were not using yet added in many methods on PersonProvider for parts of names tweaked internals of personprovider to work with names that have probabilities - so far only in en_gb so far #62 #61 fixing addressprovider adding en_GB and en_US not done yet
thanks @higgi13425 done already
not done, questions
For the below, I assume there's no standard format to this? is it just a string of letters and numbers? If so, we don't need specialized functions for each one
not done, can do
your function idea is interesting. i'll open a new issue for that so this issue can focus on the data types |
birthdate - the idea was to randomly select a day/month, and place the date of birth in a year that clearly is *not* the real date of birth - so that there is no confusion later between true dob and deid_dob. 1900 is a reasonable year, in that there are no people born in 1900 still alive.
county name - for my purposes, US county only.I could imagine that if this becomes popular, the equivalent in other countries would be worthwhile.
I agree, Most of the numbers can already be done.
fax number ~ phone number
This sounds promising!
Peter
|
z <- DateTimeProvider$new()
z$date_time_between("1900-01-01", "1900-12-31")
|
Many of these are included already, but the full list is here:
https://medschool.duke.edu/research/clinical-and-translational-research/duke-office-clinical-research/irb-and-institutional-14
Name
Address (all geographic subdivisions smaller than state, including street address, city county, and zip code)
All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89)
Telephone numbers
Fax number
Email address
Social Security Number
Medical record number
Health plan beneficiary number
Account number
Certificate or licence number
Any vehicle or other device serial number
Web URL
Internet Protocol (IP) Address
Finger or voice print
Photographic image - Photographic images are not limited to images of the face.
Any other characteristic that could uniquely identify the individual
The text was updated successfully, but these errors were encountered: