-
Notifications
You must be signed in to change notification settings - Fork 128
Add support for parallel data curation #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails 10000 .
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…set test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…rget Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…xtensions are removed twice before writing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ntScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…tern, test currently failing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…led, fix tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, just a couple of nits. I'll set up a dedicated call to walk through some of the model stuff though, I think that might need a bit more work.
nemo_curator/download/__init__.py
Outdated
@@ -47,17 +47,7 @@ | |||
"import_downloader", | |||
"import_extractor", | |||
"import_iterator", | |||
"download_common_crawl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was modified by mistake, so probably revert it unless there's something I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still should be addressed.
def _single_partition_write_to_simple_bitext(out_df, output_file_path): | ||
src_output_file_path = output_file_path + f".{out_df['src_lang'].iloc[0]}" | ||
tgt_output_file_path = output_file_path + f".{out_df['tgt_lang'].iloc[0]}" | ||
with open(src_output_file_path, "w+") as src_out, open( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, has this been tested with multiple partitions? I just want to ensure no race condition happens when you have multiple workers writing to the same file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method has been changed. Resolving this.
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
…ke logic clearer Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
requirements/requirements_bitext.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you delete this file now that we use the pyproject.toml?
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So excited to finally have this in!
* add data interface to read simple bitext Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * adding ParallelScoreFilter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add test for ParallelScoreFilter, small style change for ParallelDataset test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * allow ParallelScoreFilter to take different filters for source and target Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add JointScoreFilter and LengthRatioFilter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] add heuristic filter w/o test Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * merge with main Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add test for histogram filter, fix a few bugs Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * length ratio, joint score filter testing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix typing in joint test Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add a fake comet qe filter as an initial step Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] adding bitext cleaning tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] fixing example Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix slow histogram filter, fix faulty bitext loading Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * tutorial running Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] documentation of bitext tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add tested version of comet-qe filter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix ParallelDataset bug where single file name is not accepted, and dataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add docstring to explain simple bitext format, fix a bug where file extensions are removed twice before writing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * remove print line for debug Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add comet filter to tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * refactor COMET QE filter to decouple model from filter, make sure JointScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * use refactored qe filter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * wrap_qe_input should be a static method Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * use conditional import for comet, formatting changes Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] add cometoid Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] attempt to resolve device conflict but is failing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] playing with cometoid arguments Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] -d 0 doesn't look necessary Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * tested arguments for Cometoid Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * use proper safe import, make sure test doesn't crash sans comet/pymarian Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * falling back to comet for tutorial since that's easier to set up, uppdate README Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * give credit to original fairseq implementation of histogram filtering, run black formatter Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix pre-commit complaint Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix small bug Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix another occurrence of the same bug Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * introduce shard limit to a single PyMarian API call to avoid memory leakage Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * repartition after reading simple bitext data Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * -d 0 is actually needed for pymarian Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * remove duplicate LengthRatioFilter definition Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * refactor repeated code segment in file writing, change classifier to accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * [WIP] addressed comments in NVIDIA#193 apart from resolving .iloc pattern, test currently failing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * refactor to resolve .loc pattern, test passing Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add missing file Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * revert changes in setup.py Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix a small bug in parallel dataset, explain why repartition is disabled, fix tutorial Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add api guide, small change on bitext/parallel score filter docstring Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix read_simple_bitext test issues Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * reinstate dependencies lost during merging Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * re-enable multiple partitions for simple bitext, add parallel write Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * take care of the case where filename is not supplied in dataframe, make logic clearer Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * address other minor comments in the PR, fix segment order scrambling Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * fix test errors, add bitext dependencies Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add back more missing imports Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * add bitext to [all] in .toml, add platformdirs as dependency Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * merge upstream, remove old bitext requirement list Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> * delete requirement file again Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> --------- Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> Co-authored-by: nverma1 <neha.verma2017@gmail.com> Signed-off-by: Rucha Apte <ruchaa@nvidia.com>
Description
This PR adds support for parallel data curation. Namely:
ParallelDataset
that supports loading and writing parallel data in simple bitext format.ScoreFilter
subclassParallelScoreFilter
that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.ScoreFilter
subclassJointScoreFilter
that allows implementation of filters that takes both fields of the parallel sentence/document pairs.HistogramFilter
andLengthRatioFilter
.QualityEstimationFilter
.comet
andcometoid
.Joint work at MTMA 2024 with @nverma1.
Usage
See
tutorials/bitext_cleaning/main.py
.Checklist