Mismatch between copyright/license file listing and corpus files #3

mmartin9684-sil · 2022-03-14T12:35:47Z

The corpus folder contains 109 files, while hte copyright/license file listing contains 132 entries.

DCNemesis · 2022-03-14T12:44:45Z

I'm fairly confident the ones marked as cc-by in the copyright file are missing from the corpus (because there was no folder with that name), but that's only eight.

davidbaines · 2022-04-26T16:59:04Z

I've extracted some data from the copr.htm files and added that to the translations and licenses spreadsheet.
It shows that there could be in the region of 1000 resources available to share and we have currently have 684 texts in the repo.
So we've better coverage of the license information and hope that there are more files we can identify to add soon.

janetzki · 2022-06-15T07:23:59Z

The metadata file contains some bibles (e.g., a Hindi bible (hin2017)) under a http://creativecommons.org/licenses/by-sa/4.0/ license that are not part of the published corpus. Do you know why this is the case?
And do you maybe have the code to merge the plain text chapter files from ebible.org?

davidbaines · 2022-06-15T13:00:01Z

Thank you for this note. There are a few discrepancies between the metadata file and the texts in the corpus. This is largely due to human (my) error. Eventually we hope to have these created with code, and that should eliminate most of these mismatches. At the moment though there are several manual steps to create the corpus and it's all too easy for mismatches to creep in.On 15/06/2022 08:24, Jonathan Janetzki wrote:

…

The metadata file <https://github.com/BibleNLP/ebible-corpus/blob/main/metadata/Copyright%20and%20license%20information.xlsx> contains some bibles (e.g., a Hindi bible (hin2017)) under a http://creativecommons.org/licenses/by-sa/4.0/ license that are not part of the published corpus. Do you know why this is the case? — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAH3UN6L77OH2R4YZFI5E3VPGAJTANCNFSM5QVMRGYQ>. You are receiving this because you were assigned.Message ID: ***@***.***>

janetzki · 2022-06-17T11:28:33Z

Hi @davidbaines, thanks a lot for your answer and your effort in putting this awesome corpus together. How did you get each bible into a single file? If you have some code to create it, could you share it, please? This would help me to automatically create semantic domain dictionaries from bibles.

davidbaines · 2022-06-17T13:33:33Z

We use extract_corpora.py or bulk_extract_corpora.py from SIL's NLP repo They have a CLI to read USFM files from eBible and extract them into the one verse per line format.

There's no need for a GPU when doing the extraction.

DCNemesis · 2022-06-17T15:47:31Z

@janetzki here are further instructions that were once given to me.
The extract_corpora script (under silnlp.common) is the one you want for extracting a single, or a few texts. It expects a certain folder structure and an environment variable named SIL_NLP_DATA_PATH which points to the root of that structure.
The folder structure looks like this:

$SIL_NLP_DATA_PATH
  MT
    scripture
  Paratext
    projects
      project_1
      project_2

Create a subfolder under ‘Paratext/projects’ for each project (i.e., “$SIL_NLP_DATA_PATH > Paratext > projects > projectA”) and unzip the project files into that subfolder. Then give that subfolder name as the command line option when you run extract_corpora:
python -m silnlp.common.extract_corpora <projectA> [<projectB> …]

You can extract multiple projects in one extract run.
You will find the extracted files in “$SIL_NLP_DATA_PATH > MT > scripture”

OR
You can use the "silnlp.common.bulk_extract_corpora" script to extract all projects from a given folder. This script does not require you to set up a specific directory structure. We use poetry to manage our dependencies, so you will need to install it first and run "poetry install" to create a virtual environment and install the dependencies. Once that is done, you can run the script like this:

poetry run python -m silnlp.common.bulk_extract_corpora --input <projects_folder> --output <extracted_corpora_folder> --error-log <error_log_text_file>
The error log will contain all projects that failed to extract.

janetzki · 2022-06-20T13:01:29Z

Thank you so much for the instructions, @davidbaines @DCNemesis. They have worked so that extract_corpora.py successfully has put some bibles in a text file.

mmartin9684-sil assigned DCNemesis, cdleong and davidbaines Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between copyright/license file listing and corpus files #3

Mismatch between copyright/license file listing and corpus files #3

Mismatch between copyright/license file listing and corpus files #3

Mismatch between copyright/license file listing and corpus files #3

Comments