This repository has the corpus preprocessing and training instructions for the models and corpora from the paper Enhancing Clinical Models with Pseudo Data for De-identification.
- Install the MIMIC-III database as described in the mimic package install section.
- Uncompress the pseudo sources:
tar jxf pseudo-source.tar.bz2
- Install Python dependencies:
pip install -r src/python/requirements-all.txt
- Load the SQLite DB from downloaded lists:
./cpbert load
- Create the admission files:
./cpbert admids
- Create the masked and pseudo corpora files:
./cpbert process <admission ID file>
- Create admission IDs to process:
./cpbert adms --shuffle -s 10 -o adm-ids
- Process the first set of 10:
./cpbert process adm-ids/0000 -d pseudos
- Create the corpus file:
find pseudos -name \*-pseudo.txt -exec cat {} >> pseudo-corpus.txt \;
- Confirm corpus status as newlines, words, and byte counts:
wc pseudo-corpus.txt
- Follow the instructions to reproduce the de-identification results.
The pretrained, de-identification models and pseudo corpus are available upon request. All require proper documentation of certification by Physionet as explained in the paper.
Copyright (c) 2025 Paul Landes