Surya is a document OCR toolkit that does:
- Accurate OCR in 90+ languages
- Line-level text detection in any language
- Table and chart detection (coming soon)
It works on a range of documents (see usage and benchmarks for more details).
Detection | OCR |
---|---|
Surya is named for the Hindu sun god, who has universal vision.
Discord is where we discuss future development.
Name | Text Detection | OCR |
---|---|---|
Japanese | Image | Image |
Chinese | Image | Image |
Hindi | Image | Image |
Arabic | Image | Image |
Chinese + Hindi | Image | Image |
Presentation | Image | Image |
Scientific Paper | Image | Image |
Scanned Document | Image | Image |
New York Times | Image | Image |
Scanned Form | Image | Image |
Textbook | Image | Image |
You'll need python 3.9+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See here for more details.
Install with:
pip install surya-ocr
Model weights will automatically download the first time you run surya. Note that this does not work with the latest version of transformers 4.37+
yet, so you will need to keep 4.36.2
, which is installed with surya.
- Inspect the settings in
surya/settings.py
. You can override any settings with environment variables. - Your torch device will be automatically detected, but you can override this. For example,
TORCH_DEVICE=cuda
. For text detection, themps
device has a bug (on the Apple side) that may prevent it from working properly.
I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:
pip install streamlit
surya_gui
You can OCR text in an image, pdf, or folder of images/pdfs with the following command. This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
surya_ocr DATA_PATH --images --langs hi,en