8000 Add support for parallel data curation by shuoyangd · Pull Request #193 · NVIDIA/NeMo-Curator · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add support for parallel data curation #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
c7a6423
add data interface to read simple bitext
shuoyangd Jul 30, 2024
4b3dc97
adding ParallelScoreFilter
nverma1 Jul 30, 2024
114716e
add test for ParallelScoreFilter, small style change for ParallelData…
shuoyangd Jul 31, 2024
cbab143
allow ParallelScoreFilter to take different filters for source and ta…
shuoyangd Jul 31, 2024
82f5486
add JointScoreFilter and LengthRatioFilter
nverma1 Jul 31, 2024
f9a0535
[WIP] add heuristic filter w/o test
shuoyangd Jul 31, 2024
8f25988
merge with main
shuoyangd Jul 31, 2024
612249c
add test for histogram filter, fix a few bugs
shuoyangd Jul 31, 2024
2fe4973
length ratio, joint score filter testing
nverma1 Jul 31, 2024
b61d7f1
fix typing in joint test
nverma1 Jul 31, 2024
f63a1f9
add a fake comet qe filter as an initial step
shuoyangd Aug 1, 2024
76bced7
[WIP] adding bitext cleaning tutorial
nverma1 Aug 1, 2024
1a2bb1e
[WIP] fixing example
nverma1 Aug 2, 2024
74698d5
fix slow histogram filter, fix faulty bitext loading
shuoyangd Aug 2, 2024
bf2e6ac
tutorial running
nverma1 Aug 2, 2024
62d1242
[WIP] documentation of bitext tutorial
nverma1 Aug 2, 2024
c413ea2
add tested version of comet-qe filter
shuoyangd Aug 2, 2024
5a90038
fix ParallelDataset bug where single file name is not accepted, and d…
shuoyangd Aug 5, 2024
f8046dd
add docstring to explain simple bitext format, fix a bug where file e…
shuoyangd Aug 5, 2024
6c7aea4
remove print line for debug
shuoyangd Aug 5, 2024
a457995
add comet filter to tutorial
shuoyangd Aug 5, 2024
c5a6f1c
refactor COMET QE filter to decouple model from filter, make sure Joi…
shuoyangd Aug 5, 2024
61713e4
use refactored qe filter
shuoyangd Aug 5, 2024
a4d2bb3
wrap_qe_input should be a static method
shuoyangd Aug 5, 2024
0674400
use conditional import for comet, formatting changes
shuoyangd Aug 6, 2024
6936f9a
[WIP] add cometoid
shuoyangd Aug 6, 2024
da96d29
[WIP] attempt to resolve device conflict but is failing
shuoyangd Aug 7, 2024
14b7d70
[WIP] playing with cometoid arguments
shuoyangd Aug 7, 2024
b02b56d
[WIP] -d 0 doesn't look necessary
shuoyangd Aug 7, 2024
6c1e719
tested arguments for Cometoid
shuoyangd Aug 8, 2024
70a7fe8
use proper safe import, make sure test doesn't crash sans comet/pymarian
shuoyangd Aug 8, 2024
c66d7f9
falling back to comet for tutorial since that's easier to set up, upp…
shuoyangd Aug 8, 2024
861bd4d
give credit to original fairseq implementation of histogram filtering…
shuoyangd Aug 8, 2024
52ba08e
fix pre-commit complaint
shuoyangd Aug 8, 2024
62c254b
fix small bug
shuoyangd Aug 11, 2024
91ea9fa
fix another occurrence of the same bug
shuoyangd Aug 13, 2024
12783ec
introduce shard limit to a single PyMarian API call to avoid memory l…
shuoyangd Aug 13, 2024
a65588a
repartition after reading simple bitext data
shuoyangd Aug 16, 2024
3f1d09b
-d 0 is actually needed for pymarian
shuoyangd Aug 16, 2024
102429a
remove duplicate LengthRatioFilter definition
shuoyangd Sep 5, 2024
8a367dd
refactor repeated code segment in file writing, change classifier to …
shuoyangd Sep 20, 2024
396d7ba
[WIP] addressed comments in #193 apart from resolving .iloc pattern, …
shuoyangd Sep 20, 2024
eb4f4df
refactor to resolve .loc pattern, test passing
shuoyangd Oct 1, 2024
3addf44
add missing file
shuoyangd Oct 1, 2024
a14a78a
revert changes in setup.py
shuoyangd Oct 1, 2024
6b8dfa0
fix a small bug in parallel dataset, explain why repartition is disab…
shuoyangd Oct 1, 2024
bb4f148
add api guide, small change on bitext/parallel score filter docstring
shuoyangd Oct 1, 2024
d309744
fix read_simple_bitext test issues
shuoyangd Oct 1, 2024
21676bd
Merge branch 'main' into main
shuoyangd Oct 1, 2024
7797925
reinstate dependencies lost during merging
shuoyangd Oct 2, 2024
be4f162
re-enable multiple partitions for simple bitext, add parallel write
shuoyangd Nov 14, 2024
3cd7683
take care of the case where filename is not supplied in dataframe, ma…
shuoyangd Nov 15, 2024
66edd4f
address other minor comments in the PR, fix segment order scrambling
shuoyangd Nov 27, 2024
5da2eec
merge upstream changes
shuoyangd Nov 27, 2024
6d9cf0b
fix test errors, add bitext dependencies
shuoyangd Nov 27, 2024
842ff43
add back more missing imports
shuoyangd Nov 27, 2024
680654d
add bitext to [all] in .toml, add platformdirs as dependency
shuoyangd Nov 27, 2024
5683a94
merge upstream, remove old bitext requirement list
shuoyangd Nov 27, 2024
d33cbb5
merge upstream again
shuoyangd Nov 27, 2024
42cab43
delete requirement file again
shuoyangd Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/user-guide/api/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,12 @@ DocumentDataset
.. autoclass:: nemo_curator.datasets.DocumentDataset
:members:

.. autoclass:: nemo_curator.datasets.ParallelDataset
:members:

-------------------------------
ImageTextPairDataset
-------------------------------

.. autoclass:: nemo_curator.datasets.ImageTextPairDataset
:members:
:members:
20 changes: 20 additions & 0 deletions docs/user-guide/api/filters.rst
8000
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ Base Class
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.BitextFilter
:members:
:member-order: bysource

.. autofunction:: nemo_curator.filters.import_filter

------------------------------
Expand Down Expand Up @@ -40,6 +44,14 @@ FastText Filters
:members:
:member-order: bysource

------------------------------
Quality Estimation Filters
------------------------------

.. autoclass:: nemo_curator.filters.QualityEstimationFilter
:members:
:member-order: bysource

------------------------------
Heuristic Filters
------------------------------
Expand Down Expand Up @@ -132,6 +144,14 @@ Heuristic Filters
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.HistogramFilter
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.LengthRatioFilter
:members:
:member-order: bysource

------------------------------
Code Filters
------------------------------
Expand Down
3 changes: 2 additions & 1 deletion nemo_curator/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@
from nemo_curator.utils.import_utils import image_only_import_from

from .doc_dataset import DocumentDataset
from .parallel_dataset import ParallelDataset

ImageTextPairDataset = image_only_import_from(
"nemo_curator.datasets.image_text_pair_dataset", "ImageTextPairDataset"
)

__all__ = ["DocumentDataset", "ImageTextPairDataset"]
__all__ = ["DocumentDataset", "ImageTextPairDataset", "ParallelDataset"]
167 changes: 167 additions & 0 deletions nemo_curator/datasets/parallel_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
import csv
from typing import List, Optional, Tuple, Union

import dask.dataframe as dd
import pandas as pd

from nemo_curator.datasets.doc_dataset import DocumentDataset
from nemo_curator.utils.distributed_utils import write_to_disk
from nemo_curator.utils.file_utils import remove_path_extension
from nemo_curator.utils.import_utils import gpu_only_import

cudf = gpu_only_import("cudf")


class ParallelDataset(DocumentDataset):
"""
An extension of the standard `DocumentDataset` with a special method that loads simple bitext.

For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format
and use interfaces defined in `DocumentDataset`.
"""

def persist(self):
return ParallelDataset(self.df.persist())

@classmethod
def read_simple_bitext(
cls,
src_input_files: Union[str, List[str]],
tgt_input_files: Union[str, List[str]],
src_lang: str,
tgt_lang: str,
backend: str = "pandas",
add_filename: bool = False,
npartitions: int = 16,
):
"""See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters.

Args:
src_input_files (Union[str, List[str]]): one or several input files, in source language
tgt_input_files (Union[str, List[str]]): one or several input files, in target language

Raises:
TypeError: If types of `src_input_files` and `tgt_input_files` doesn't agree.

Returns:
ParallelDataset: A `ParallelDataset` object with `self.df` holding the ingested simple bitext.
"""

if isinstance(src_input_files, str) and isinstance(tgt_input_files, str):
src_input_files = [src_input_files]
tgt_input_files = [tgt_input_files]
elif not isinstance(src_input_files, list) or not isinstance(
tgt_input_files, list
):
raise TypeError("Both file inputs must be strings or lists.")

# use default doc id for now
# but in the future it might be useful to allow customizing doc id by passing a prefix
df_files = []
# We do not use `dd.from_map` because an individual file could be pretty large,
# hence, it's not appropriate to partition based on individual files.
# What we do is that we concatenate all the individual files and perform repartition.
for src_input_file, tgt_input_file in zip(src_input_files, tgt_input_files):
df_file = ParallelDataset.read_single_simple_bitext_file_pair(
(src_input_file, tgt_input_file),
src_lang=src_lang,
tgt_lang=tgt_lang,
backend=backend,
add_filename=add_filename,
)
df_files.append(df_file)

if backend == "cudf":
df = cudf
else:
df = pd

data = dd.from_pandas(df.concat(df_files), npartitions=npartitions)
return cls(data)

def to_bitext(
self,
output_file_dir,
F438 write_to_filename=False,
):
"""See `nemo_curator.utils.distributed_utils.write_to_disk` docstring for parameter usage."""
write_to_disk(
df=self.df,
output_file_dir=output_file_dir,
write_to_filename=write_to_filename,
output_type="bitext",
)

@staticmethod
def read_single_simple_bitext_file_pair(
input_file_pair: Tuple[str],
src_lang: str,
tgt_lang: str,
doc_id: str = None,
backend: str = "cudf",
add_filename: bool = False,
) -> Union[dd.DataFrame, "dask_cudf.DataFrame"]:
"""This function reads a pair of "simple bitext" files into a pandas DataFrame.
A simple bitext is a commonly data format in machine translation.
It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:

data.de:

```
Wir besitzen keine Reisetaschen aus Leder.
Die Firma produziert Computer für den deutschen Markt.
...
```

data.en:

```
We don't own duffel bags made of leather.
The company produces computers for the German market.
...
```

For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.

Args:
input_file_pair (Tuple[str]): A pair of file paths pointing to the input files
src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. 'en')
tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en')
doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None.
backend (str, optional): Backend of the data frame. Defaults to "cudf".
add_filename (bool, optional): Add filename as an extra field to every segment in the file. Defaults to False.

Returns:
Union[dd.DataFrame, dask_cudf.DataFrame]
"""
src_input_file, tgt_input_file = input_file_pair
assert remove_path_extension(src_input_file) == remove_path_extension(
tgt_input_file
), f"Assuming source and target filenames would have common prefix before language code, but got {src_input_file} and {tgt_input_file}."

if not doc_id:
doc_id = "▁".join([src_input_file, tgt_input_file])

if backend == "cudf":
df = cudf
else:
df = pd

df_src = df.read_csv(
src_input_file, names=["src"], sep="\t", quoting=csv.QUOTE_NONE
)
df_tgt = df.read_csv(
tgt_input_file, names=["tgt"], sep="\t", quoting=csv.QUOTE_NONE
)
assert len(df_src) == len(
df_tgt
), f"We assume the source and target file would have the same number of lines, but got {len(df_src)} and {len(df_tgt)}."
df_combined = df.concat([df_src, df_tgt], axis=1)
df_combined["doc_id"] = doc_id
df_combined["src_lang"] = src_lang
df_combined["tgt_lang"] = tgt_lang

if add_filename:
df_combined["filename"] = remove_path_extension(src_input_file)

return df_combined
13 changes: 12 additions & 1 deletion nemo_curator/filters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from .classifier_filter import FastTextLangId, FastTextQualityFilter
from .bitext_filter import BitextFilter
from .classifier_filter import (
FastTextLangId,
FastTextQualityFilter,
QualityEstimationFilter,
)
from .code import (
AlphaFilter,
GeneralCommentToCodeFilter,
Expand All @@ -29,6 +34,8 @@
BulletsFilter,
CommonEnglishWordsFilter,
EllipsisFilter,
HistogramFilter,
LengthRatioFilter,
LongWordFilter,
MeanWordLengthFilter,
NonAlphaNumericFilter,
Expand All @@ -51,6 +58,7 @@
from .synthetic import AnswerabilityFilter, EasinessFilter

__all__ = [
"BitextFilter",
"DocumentFilter",
"import_filter",
"FastTextLangId",
Expand Down Expand Up @@ -85,6 +93,9 @@
"AlphaFilter",
"HTMLBoilerplateFilter",
"PerExtensionFilter",
"LengthRatioFilter",
"HistogramFilter",
"QualityEstimationFilter",
"AnswerabilityFilter",
"EasinessFilter",
]
Loading
0