8000 Mismatch between copyright/license file listing and corpus files · Issue #3 · BibleNLP/ebible · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Mismatch between copyright/license file listing and corpus files #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mmartin9684-sil opened this issue Mar 14, 2022 · 8 comments
Open
Assignees

Comments

@mmartin9684-sil
Copy link
Contributor

The corpus folder contains 109 files, while hte copyright/license file listing contains 132 entries.

@DCNemesis
Copy link
Contributor

I'm fairly confident the ones marked as cc-by in the copyright file are missing from the corpus (because there was no folder with that name), but that's only eight.

@davidbaines
Copy link
Contributor

I've extracted some data from the copr.htm files and added that to the translations and licenses spreadsheet.
It shows that there could be in the region of 1000 resources available to share and we have currently have 684 texts in the repo.
So we've better coverage of the license information and hope that there are more files we can identify to add soon.

@janetzki
Copy link
janetzki commented Jun 15, 2022

The metadata file contains some bibles (e.g., a Hindi bible (hin2017)) under a http://creativecommons.org/licenses/by-sa/4.0/ license that are not part of the published corpus. Do you know why this is the case?
And do you maybe have the code to merge the plain text chapter files from ebible.org?

@davidbaines
Copy link
Contributor
davidbaines commented Jun 15, 2022 via email

@janetzki
Copy link

Hi @davidbaines, thanks a lot for your answer and your effort in putting this awesome corpus together. How did you get each bible into a single file? If you have some code to create it, could you share it, please? This would help me to automatically create semantic domain dictionaries from bibles.

@davidbaines
Copy link
Contributor

We use extract_corpora.py or bulk_extract_corpora.py from SIL's NLP repo They have a CLI to read USFM files from eBible and extract them into the one verse per line format.

There's no need for a GPU when doing the extraction.

@DCNemesis
Copy link
Contributor
DCNemesis commented Jun 17, 2022

@janetzki here are further instructions that were once given to me.
The extract_corpora script (under silnlp.common) is the one you want for extracting a single, or a few texts. It expects a certain folder structure and an environment variable named SIL_NLP_DATA_PATH which points to the root of that structure.
The folder structure looks like this:

$SIL_NLP_DATA_PATH
  MT
    scripture
  Paratext
    projects
      project_1
      project_2

Create a subfolder under ‘Paratext/projects’ for each project (i.e., “$SIL_NLP_DATA_PATH > Paratext > projects > projectA”) and unzip the project files into that subfolder. Then give that subfolder name as the command line option when you run extract_corpora:
python -m silnlp.common.extract_corpora <projectA> [<projectB> …]

You can extract multiple projects in one extract run.
You will find the extracted files in “$SIL_NLP_DATA_PATH > MT > scripture”

OR
You can use the "silnlp.common.bulk_extract_corpora" script to extract all projects from a given folder. This script does not require you to set up a specific directory structure. We use poetry to manage our dependencies, so you will need to install it first and run "poetry install" to create a virtual environment and install the dependencies. Once that is done, you can run the script like this:

poetry run python -m silnlp.common.bulk_extract_corpora --input <projects_folder> --output <extracted_corpora_folder> --error-log <error_log_text_file>
The error log will contain all projects that failed to extract.

@janetzki
Copy link

Thank you so much for the instructions, @davidbaines @DCNemesis. They have worked so that extract_corpora.py successfully has put some bibles in a text file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0