8000 GitHub - APPFL/cpbert
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

APPFL/cpbert

Repository files navigation

ClinicalPseudoBERT Pretrained LLM

This repository has the corpus preprocessing and training instructions for the models and corpora from the paper Enhancing Clinical Models with Pseudo Data for De-identification.

Reproducing Results

  1. Install the MIMIC-III database as described in the mimic package install section.
  2. Uncompress the pseudo sources: tar jxf pseudo-source.tar.bz2
  3. Install Python dependencies: pip install -r src/python/requirements-all.txt
  4. Load the SQLite DB from downloaded lists: ./cpbert load
  5. Create the admission files: ./cpbert admids
  6. Create the masked and pseudo corpora files: ./cpbert process <admission ID file>
  7. Create admission IDs to process: ./cpbert adms --shuffle -s 10 -o adm-ids
  8. Process the first set of 10: ./cpbert process adm-ids/0000 -d pseudos
  9. Create the corpus file: find pseudos -name \*-pseudo.txt -exec cat {} >> pseudo-corpus.txt \;
  10. Confirm corpus status as newlines, words, and byte counts: wc pseudo-corpus.txt
  11. Follow the instructions to reproduce the de-identification results.

Models and Corpora

The pretrained, de-identification models and pseudo corpus are available upon request. All require proper documentation of certification by Physionet as explained in the paper.

License

MIT License

Copyright (c) 2025 Paul Landes

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0