8000 Add support for parallel data curation by shuoyangd · Pull Request #193 · NVIDIA/NeMo-Curator · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add support for parallel data curation #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails 10000 .

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Nov 27, 2024
Merged

Conversation

shuoyangd
Copy link
Contributor

Description

This PR adds support for parallel data curation. Namely:

  • A new dataset class ParallelDataset that supports loading and writing parallel data in simple bitext format.
  • A new ScoreFilter subclass ParallelScoreFilter that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.
  • A new ScoreFilter subclass JointScoreFilter that allows implementation of filters that takes both fields of the parallel sentence/document pairs.
  • New heuristic filters: HistogramFilter and LengthRatioFilter.
  • Adding model-based filters with quality estimation models: QualityEstimationFilter.
  • Support for two families of quality estimation models: comet and cometoid.
  • A tutorial for parallel data curation.
  • Tests accompanying new features.

Joint work at MTMA 2024 with @nverma1.

Usage

See tutorials/bitext_cleaning/main.py.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

shuoyangd and others added 30 commits August 8, 2024 14:28
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…set test, fix a few data and import bugs

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…rget

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
< 10000 div class="commit-build-statuses">
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…xtensions are removed twice before writing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ntScoreFilter can take more than one fields for source and target

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…tern, test currently failing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…led, fix tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
@shuoyangd shuoyangd requested a review from ryantwolf October 1, 2024 22:50
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Copy link
Collaborator
@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just a couple of nits. I'll set up a dedicated call to walk through some of the model stuff though, I think that might need a bit more work.

@@ -47,17 +47,7 @@
"import_downloader",
"import_extractor",
"import_iterator",
"download_common_crawl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was modified by mistake, so probably revert it unless there's something I'm missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still should be addressed.

def _single_partition_write_to_simple_bitext(out_df, output_file_path):
src_output_file_path = output_file_path + f".{out_df['src_lang'].iloc[0]}"
tgt_output_file_path = output_file_path + f".{out_df['tgt_lang'].iloc[0]}"
with open(src_output_file_path, "w+") as src_out, open(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, has this been tested with multiple partitions? I just want to ensure no race condition happens when you have multiple workers writing to the same file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method has been changed. Resolving this.

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ke logic clearer

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you delete this file now that we use the pyproject.toml?

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Copy link
Collaborator
@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So excited to finally have this in!

@ryantwolf ryantwolf merged commit 3d14b0d into NVIDIA:main Nov 27, 2024
3 checks passed
@shuoyangd shuoyangd mentioned this pull request Dec 2, 2024
3 tasks
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* add data interface to read simple bitext

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* adding ParallelScoreFilter

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* allow ParallelScoreFilter to take different filters for source and target

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add JointScoreFilter and LengthRatioFilter

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] add heuristic filter w/o test

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* merge with main

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add test for histogram filter, fix a few bugs

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* length ratio, joint score filter testing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix typing in joint test

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add a fake comet qe filter as an initial step

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] adding bitext cleaning tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] fixing example

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix slow histogram filter, fix faulty bitext loading

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* tutorial running

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] documentation of bitext tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add tested version of comet-qe filter

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* remove print line for debug

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add comet filter to tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* use refactored qe filter

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* wrap_qe_input should be a static method

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* use conditional import for comet, formatting changes

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] add cometoid

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] attempt to resolve device conflict but is failing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] playing with cometoid arguments

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] -d 0 doesn't look necessary

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* tested arguments for Cometoid

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* use proper safe import, make sure test doesn't crash sans comet/pymarian

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* falling back to comet for tutorial since that's easier to set up, uppdate README

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* give credit to original fairseq implementation of histogram filtering, run black formatter

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix pre-commit complaint

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix small bug

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix another occurrence of the same bug

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* introduce shard limit to a single PyMarian API call to avoid memory leakage

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* repartition after reading simple bitext data

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* -d 0 is actually needed for pymarian

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* remove duplicate LengthRatioFilter definition

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* [WIP] addressed comments in NVIDIA#193 apart from resolving .iloc pattern, test currently failing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* refactor to resolve .loc pattern, test passing

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add missing file

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* revert changes in setup.py

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add api guide, small change on bitext/parallel score filter docstring

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix read_simple_bitext test issues

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* reinstate dependencies lost during merging

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* re-enable multiple partitions for simple bitext, add parallel write

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* take care of the case where filename is not supplied in dataframe, make logic clearer

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* address other minor comments in the PR, fix segment order scrambling

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* fix test errors, add bitext dependencies

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add back more missing imports

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* add bitext to [all] in .toml, add platformdirs as dependency

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* merge upstream, remove old bitext requirement list

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

* delete requirement file again

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>

---------

Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Co-authored-by: nverma1 <neha.verma2017@gmail.com>
Signed-off-by: Rucha Apte <ruchaa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0