This repository provides source codes for two applications:
- i) Text detection and alignment such as psalms, orations, etc.
- ii) Books of Hours segmentation.
-
python3.7 see requirements.txt file
-
pycodestyle 2.6.0 (pep8) (--max-line-length=100) (ex: pycodestyle file.py --max-line-length=100)
-
After installing segeval, open the file windowdiff.py (.local/lib/python3.7/site-packages/segeval/window/) and replace the following instruction:
assert len(window) is window_size + 1 by assert len(window) == window_size + 1
- git clone https://github.com/hazemAmir/gitHorae.git
- cd Horae
- virtualenv -p python ENV
- source ENV/bin/activate
- pip3 install -r requirements.txt
- python3 run_load_json.py
- python3 run_text_alignment.py -t 40 -v True -s False
- python3 run_seg_preprocessing.py
- python3 run_segmentation.py -v True -r 100 -s False -t True -c svm -dt hier -l level1
-
Input: 2 input directories (TEKLIA JSON format)
- Manual annotations (*.json) located in "data/horae-json-export/manual_annotations/volumes_name.json" - Transcriptions (*.json) located in "data/horae-json-export/transcriptions/volumes_name*.json"
Output: 1 output directory:
- Alignment directory (contains raw files used to detect and extract liturgical texts such as psalms, etc.) --> location: ../data/alignment/raw/ conctains multicolumn csv files --> raw files columns format: transcription '\t' image_id '\t' element_id '\t' element_polygon - 1 log file that contains information about the aligned volumes and the number of aligned sections --> Location: ../data/alignment/alignment.log
-
Input: 2 input directories: choose train and test files from "../data/segmentation/raw/" and copy them into the following directories:
- Train directory: ../data/segmentation/train/raw/ - Test directory: ../data/segmentation/test/raw/
Output: 7 output directories (contain several files format dedicated to train and test line classification and segmentation)
--> location: ../data/segmentation/.... "../data/segmentation/train/csv/hier/" "../data/segmentation/train/csv/flat/" "../data/segmentation/test/csv/hier/" "../data/segmentation/test/csv/flat/" "../data/segmentation/test/choiformat/hier/" "../data/segmentation/test/choiformat/flat/" "../data/segmentation/test/txt/"
-
Choisir le mode validation sur les 8 volumes don't on a une annotation des 8 psaumes de la pénitence.
Les 8 volumes se trouvent dans le repertoire /data/alignement/raw_test/
Il faut donc executer la commande suivante:
- python3 run_text_alignment.py -t 40 -v True -s False
le paramètre "send -s" peut être mis à "True" pour la visualisation des annotations sur Arkindex
-
Application de la recherche des textes liturgiques sur tous les volumes présent dans /data/alignement/raw/ Dans ce cas,le mode validation doit être mis à "False"
- python3 run_text_alignment.py -t 40 -v False -s False
Ainsi, tous les textes liturgiques listés dans le fichier ../data/alignment/reference/textes_liturgiques_utf8.txt seront alignés
- Send --> --send or -s 'Send annotation to Arkindex' (True/False)
- Validation--> --valid or -v (True/False only if test annotations are available)
- THreshold --> --th or -t 'Threshold to select a segment as a candidate [40 - 95] %'
- python3 run_segmentation.py -v True -r 100 -s False -t True -c svm -dt hier -l level1 performs training with svm, bert or bert2 and produces segmentaton results for all the volums contained in the ../data/segmentation/test/raw/ directory
- Segmentation Level --> --level or -l (level1/level12/level123)
- Segmentation type --> --data_type or -dt 'hierarchical or flat segmentation (hier/flat)
- Classifier --> --classifier or -cl (svm, bert ,bert2)
- Train --> --train or -t (True/False)
- Send --> --send or -s 'Send annotation to Arkindex' (True/False)
- Validation--> --valid or -v (True/False)
- Relaxaion factor --> --relaxation or -r 'Relaxation factor berween 50 and 100'