-
Notifications
You must be signed in to change notification settings - Fork 12
Mismatch between copyright/license file listing and corpus files #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm fairly confident the ones marked as cc-by in the copyright file are missing from the corpus (because there was no folder with that name), but that's only eight. |
I've extracted some data from the copr.htm files and added that to the translations and licenses spreadsheet. |
The metadata file contains some bibles (e.g., a Hindi bible (hin2017)) under a http://creativecommons.org/licenses/by-sa/4.0/ license that are not part of the published corpus. Do you know why this is the case? |
Thank you for this note. There are a few discrepancies between the metadata file and the texts in the corpus. This is
largely due to human (my) error. Eventually we hope to have these created with code, and that should eliminate most of
these mismatches.
At the moment though there are several manual steps to create the corpus and it's all too easy for mismatches to creep
in.On 15/06/2022 08:24, Jonathan Janetzki wrote:
…
The metadata file
<https://github.com/BibleNLP/ebible-corpus/blob/main/metadata/Copyright%20and%20license%20information.xlsx> contains
some bibles (e.g., a Hindi bible (hin2017)) under a http://creativecommons.org/licenses/by-sa/4.0/ license that are
not part of the published corpus. Do you know why this is the case?
—
Reply to this email directly, view it on GitHub
<#3 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAH3UN6L77OH2R4YZFI5E3VPGAJTANCNFSM5QVMRGYQ>.
You are receiving this because you were assigned.Message ID: ***@***.***>
|
Hi @davidbaines, thanks a lot for your answer and your effort in putting this awesome corpus together. How did you get each bible into a single file? If you have some code to create it, could you share it, please? This would help me to automatically create semantic domain dictionaries from bibles. |
We use extract_corpora.py or bulk_extract_corpora.py from SIL's NLP repo They have a CLI to read USFM files from eBible and extract them into the one verse per line format. There's no need for a GPU when doing the extraction. |
@janetzki here are further instructions that were once given to me.
Create a subfolder under ‘Paratext/projects’ for each project (i.e., “$SIL_NLP_DATA_PATH > Paratext > projects > projectA”) and unzip the project files into that subfolder. Then give that subfolder name as the command line option when you run extract_corpora: You can extract multiple projects in one extract run. OR
|
Thank you so much for the instructions, @davidbaines @DCNemesis. They have worked so that |
The corpus folder contains 109 files, while hte copyright/license file listing contains 132 entries.
The text was updated successfully, but these errors were encountered: