Document Anonymizer

This program takes in a blob of text or the path to a .txt file and tries to remove all PII whilst still retaining information that is important to the text via keyword extraction. E-mails, Phone #s and links are redacted using heuristics. Other PII such as name, nationality, organization are redacted using Spacy's Named Entity Recognition (NER) system. A keyword extraction algorithm is run on the input text to extract words that contain the 'gist' of the text. The keywords that have been redacted by Spacy are then removed.

User Inputs: A string of text of any length
Outputs: The text

Dependencies

Python 3.6+

Windows

Run pip install -r requirements.txt
Run python -m spacy download en_core_web_sm

Mac

Run pip3 install -r requirements.txt
Run python3 -m spacy download en_core_web_sm

Usage

Windows

Run python docAnon.py in the project directory

Mac

Run python3 docAnon.py in the project directory

Future Improvements

Get the PDF functionality working
Clean up some of these ugly empty print() statements and use delimiters instead
Better NER algorithm that doesn't have to make two passes of the tokenized text
My own keyword extraction algorithm/ tweaking the existing one
Support for non-PDF file formats

Sources

Spacy Linguistic Features: https://spacy.io/usage/linguistic-features
Keyword extraction algorithm taken from Maarten Grootendorst "Keyword Extraction with BERT": https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea
PDF Coordinates: https://stackoverflow.com/questions/22898145/how-to-extract-text-and-text-coordinates-from-a-pdf-file

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
docAnon.py		docAnon.py
requirements.txt		requirements.txt
result.txt		result.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Anonymizer

Dependencies

Windows

Mac

Usage

Windows

Mac

Future Improvements

Sources

About

Uh oh!

Releases

Packages

Uh oh!

Languages

taimurshaikh/DocumentAnonymizer

Folders and files

Latest commit

History

Repository files navigation

Document Anonymizer

Dependencies

Windows

Mac

Usage

Windows

Mac

Future Improvements

Sources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages