diff --git a/.github/conda/meta.yaml b/.github/conda/meta.yaml index 153ceab890a..55e158e3f5c 100644 --- a/.github/conda/meta.yaml +++ b/.github/conda/meta.yaml @@ -24,7 +24,7 @@ requirements: - dataclasses - multiprocess - fsspec - - huggingface_hub >=0.23.0,<1.0.0 + - huggingface_hub >=0.24.0,<1.0.0 - packaging - aiohttp run: @@ -40,7 +40,7 @@ requirements: - dataclasses - multiprocess - fsspec - - huggingface_hub >=0.23.0,<1.0.0 + - huggingface_hub >=0.24.0,<1.0.0 - packaging - aiohttp diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index f5049514706..7cf288de146 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -62,7 +62,7 @@ jobs: run: uv pip install --system --upgrade pyarrow huggingface-hub "dill<0.3.9" - name: Install dependencies (minimum versions) if: ${{ matrix.deps_versions != 'deps-latest' }} - run: uv pip install --system pyarrow==15.0.0 huggingface-hub==0.23.5 transformers dill==0.3.1.1 + run: uv pip install --system pyarrow==15.0.0 huggingface-hub==0.24.7 transformers dill==0.3.1.1 - name: Test with pytest run: | python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/ diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 9ea765b1f88..d20d8a08b54 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -47,6 +47,12 @@ If you want to add a dataset see specific instructions in the section [*How to a 4. Set up a development environment by running the following command in a virtual environment: + Simple setup with code formatting only (recommended) + ```bash + pip install -e ".[quality]" + ``` + + Advanced setup with all the optional dependencies ```bash pip install -e ".[dev]" ``` @@ -91,14 +97,16 @@ Note that if any files were formatted by `pre-commit` hooks during committing, y Go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review. -## How to add a dataset +## Datasets on Hugging Face + +### How to add a dataset on Hugging Face -You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation: +You can share your dataset on https://huggingface.co/datasets directly using your account (no need to open a PR on GitHub), see the documentation: * [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset) * [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share) -## How to contribute to the dataset cards +### How to contribute to the dataset cards Improving the documentation of datasets is an ever-increasing effort, and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset. diff --git a/README.md b/README.md index 90e3096df4e..269de8ef2b3 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ 🤗 Datasets is a lightweight library providing **two** main features: -- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), +- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share) @@ -36,11 +36,11 @@ - Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). - Smart caching: never wait for your data to process several times. - Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping). -- Built-in interoperability with NumPy, pandas, PyTorch, TensorFlow 2 and JAX. -- Native support for audio and image data. +- Built-in interoperability with NumPy, PyTorch, TensorFlow 2, JAX, Pandas, Polars and more. +- Native support for audio, image and video data. - Enable streaming mode to save disk space and start iterating over the dataset immediately. -🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗 Datasets and `tfds` can be found in the section [Main differences between 🤗 Datasets and `tfds`](#main-differences-between--datasets-and-tfds). +🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https://github.com/tensorflow/datasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. # Installation @@ -64,11 +64,12 @@ Follow the installation pages of TensorFlow and PyTorch to see how to install th For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation -## Installation to use with PyTorch/TensorFlow/pandas +## Installation to use with Machine Learning & Data frameworks frameworks -If you plan to use 🤗 Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas. +If you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (3.14+) you should also install PyTorch, TensorFlow or JAX. +🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately. -For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart +For more details on using the library with these frameworks, check the quick start page in the documentation: https://huggingface.co/docs/datasets/quickstart # Usage @@ -86,7 +87,7 @@ from huggingface_hub import list_datasets print([dataset.id for dataset in list_datasets()]) # Load a dataset and print the first example in the training set -squad_dataset = load_dataset('squad') +squad_dataset = load_dataset('rajpurkar/squad') print(squad_dataset['train'][0]) # Process the dataset - add a column with the length of the context texts @@ -103,7 +104,7 @@ If your dataset is bigger than your disk or if you don't want to wait to downloa ```python # If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset -image_dataset = load_dataset('cifar100', streaming=True) +image_dataset = load_dataset('timm/imagenet-1k-wds', streaming=True) for example in image_dataset["train"]: break ``` @@ -117,7 +118,6 @@ For more details on using the library, check the quick start page in the documen - Processing image data: https://huggingface.co/docs/datasets/image_process - Processing text data: https://huggingface.co/docs/datasets/nlp_process - Streaming a dataset: https://huggingface.co/docs/datasets/stream -- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script - etc. # Add a new dataset to the Hub @@ -128,17 +128,9 @@ You can find: - [how to upload a dataset to the Hub using your web browser or Python](https://huggingface.co/docs/datasets/upload_dataset) and also - [how to upload it using Git](https://huggingface.co/docs/datasets/share). -# Main differences between 🤗 Datasets and `tfds` - -If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`: - -- the scripts in 🤗 Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request -- the backend serialization of 🤗 Datasets is based on [Apache Arrow](https://arrow.apache.org/) instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache). -- the user-facing dataset object of 🤗 Datasets is not a `tf.data.Dataset` but a built-in framework-agnostic dataset class with methods inspired by what we like in `tf.data` (like a `map()` method). It basically wraps a memory-mapped Arrow table cache. - # Disclaimers -🤗 Datasets may run Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, we ask users to: +You can use 🤗 Datasets to load datasets based on Python code defined by the dataset authors to parse certain data formats or structures. For security reasons, this feature is disabled by default and requires passing `trust_remote_code=True`. In this case we also ask users that want to load such datasets to: - check the dataset scripts they're going to run beforehand and - pin the `revision` of the repositories they use. diff --git a/docs/README.md b/docs/README.md index 8ec1b07e1c2..abcec636429 100644 --- a/docs/README.md +++ b/docs/README.md @@ -237,7 +237,7 @@ The syntax for Example docstrings can look as follows: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 2e3728ef83a..a4bb1111a46 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -30,12 +30,20 @@ title: Process - local: stream title: Stream - - local: use_with_tensorflow - title: Use with TensorFlow - local: use_with_pytorch title: Use with PyTorch + - local: use_with_tensorflow + title: Use with TensorFlow + - local: use_with_numpy + title: Use with NumPy - local: use_with_jax title: Use with JAX + - local: use_with_pandas + title: Use with Pandas + - local: use_with_polars + title: Use with Polars + - local: use_with_pyarrow + title: Use with PyArrow - local: use_with_spark title: Use with Spark - local: cache diff --git a/docs/source/about_arrow.md b/docs/source/about_arrow.md index 88b67e7d6f3..8587260629a 100644 --- a/docs/source/about_arrow.md +++ b/docs/source/about_arrow.md @@ -23,7 +23,7 @@ For example, loading the full English Wikipedia dataset only takes a few MB of R # Process.memory_info is expressed in bytes, so convert to megabytes >>> mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) ->>> wiki = load_dataset("wikipedia", "20220301.en", split="train") +>>> wiki = load_dataset("wikimedia/wikipedia", "20220301.en", split="train") >>> mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024) >>> print(f"RAM memory used: {(mem_after - mem_before)} MB") diff --git a/docs/source/about_dataset_features.mdx b/docs/source/about_dataset_features.mdx index f9b93fa9cb8..30a221e6d6c 100644 --- a/docs/source/about_dataset_features.mdx +++ b/docs/source/about_dataset_features.mdx @@ -8,7 +8,7 @@ Let's have a look at the features of the MRPC dataset from the GLUE benchmark: ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('glue', 'mrpc', split='train') +>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train') >>> dataset.features {'idx': Value(dtype='int32', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), @@ -36,7 +36,7 @@ If your data type contains a list of objects, then you want to use the [`Sequenc ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('squad', split='train') +>>> dataset = load_dataset('rajpurkar/squad', split='train') >>> dataset.features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), @@ -115,7 +115,7 @@ When you load an image dataset and call the image column, the [`Image`] feature ```py >>> from datasets import load_dataset, Image ->>> dataset = load_dataset("beans", split="train") +>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train") >>> dataset[0]["image"] ``` @@ -129,7 +129,7 @@ Index into an image dataset using the row index first and then the `image` colum With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`, ```py ->>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False)) +>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False)) >>> dataset[0]["image"] {'bytes': None, 'path': '/Users/username/.cache/huggingface/datasets/downloads/extracted/772e7c1fba622cff102b85dd74bcce46e8168634df4eaade7bedd3b8d91d3cd7/train/healthy/healthy_train.265.jpg'} diff --git a/docs/source/about_dataset_load.mdx b/docs/source/about_dataset_load.mdx index 2498ae22a6b..439140803d9 100644 --- a/docs/source/about_dataset_load.mdx +++ b/docs/source/about_dataset_load.mdx @@ -117,4 +117,4 @@ The dataset repositories on the Hub are scanned for malware, see more informatio Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers. The code of these datasets is considered **safe**. -It concerns datasets that are not under a namespace, e.g. "squad" or "glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name". +It concerns datasets that are not under a namespace, e.g. "rajpurkar/squad" or "nyu-mll/glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name". diff --git a/docs/source/about_mapstyle_vs_iterable.mdx b/docs/source/about_mapstyle_vs_iterable.mdx index f794eea5714..f4c893671a6 100644 --- a/docs/source/about_mapstyle_vs_iterable.mdx +++ b/docs/source/about_mapstyle_vs_iterable.mdx @@ -14,7 +14,7 @@ For example you can download ImageNet-1k like this and access any row: ```python from datasets import load_dataset -imagenet = load_dataset("imagenet-1k", split="train") # downloads the full dataset +imagenet = load_dataset("timm/imagenet-1k-wds", split="train") # downloads the full dataset print(imagenet[0]) ``` @@ -28,14 +28,14 @@ For example, you can stream the ImageNet-1k dataset without downloading it on di ```python from datasets import load_dataset -imagenet = load_dataset("imagenet-1k", split="train", streaming=True) # will start loading the data when iterated over +imagenet = load_dataset("timm/imagenet-1k-wds", split="train", streaming=True) # will start loading the data when iterated over for example in imagenet: print(example) break ``` Streaming can read online data without writing any file to disk. -For example, you can stream datasets made out of multiple shards, each of which is hundreds of gigabytes like [C4](https://huggingface.co/datasets/c4), [OSCAR](https://huggingface.co/datasets/oscar) or [LAION-2B](https://huggingface.co/datasets/laion/laion2B-en). +For example, you can stream datasets made out of multiple shards, each of which is hundreds of gigabytes like [C4](https://huggingface.co/datasets/c4) or [LAION-2B](https://huggingface.co/datasets/laion/laion2B-en). Learn more about how to stream a dataset in the [Dataset Streaming Guide](./stream). This is not the only difference though, because the "lazy" behavior of an `IterableDataset` is also present when it comes to dataset creation and processing. diff --git a/docs/source/access.mdx b/docs/source/access.mdx index ecdfbbf446b..2f4c891ce1e 100644 --- a/docs/source/access.mdx +++ b/docs/source/access.mdx @@ -13,7 +13,7 @@ This tutorial uses the [rotten_tomatoes](https://huggingface.co/datasets/rotten_ ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ``` ### Indexing @@ -99,7 +99,7 @@ An [`IterableDataset`] is loaded when you set the `streaming` parameter to `True ```py >>> from datasets import load_dataset ->>> iterable_dataset = load_dataset("food101", split="train", streaming=True) +>>> iterable_dataset = load_dataset("ethz/food101", split="train", streaming=True) >>> for example in iterable_dataset: ... print(example) ... break @@ -111,7 +111,7 @@ You can also create an [`IterableDataset`] from an *existing* [`Dataset`], but i ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> iterable_dataset = dataset.to_iterable_dataset() ``` diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx index a68b880b8de..bf344a09bb7 100644 --- a/docs/source/cache.mdx +++ b/docs/source/cache.mdx @@ -36,7 +36,7 @@ After you download a dataset, control how it is loaded by [`load_dataset`] with ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('squad', download_mode='force_redownload') +>>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload') ``` Refer to [`DownloadMode`] for a full list of download modes. diff --git a/docs/source/faiss_es.mdx b/docs/source/faiss_es.mdx index ddd274759ea..9ee6565df94 100644 --- a/docs/source/faiss_es.mdx +++ b/docs/source/faiss_es.mdx @@ -76,7 +76,7 @@ Start Elasticsearch on your machine, or see the [Elasticsearch installation guid ```py >>> from datasets import load_dataset ->>> squad = load_dataset('squad', split='validation') +>>> squad = load_dataset('rajpurkar/squad', split='validation') ``` 2. Build the index with [`Dataset.add_elasticsearch_index`]: @@ -98,7 +98,7 @@ Start Elasticsearch on your machine, or see the [Elasticsearch installation guid ```py >>> from datasets import load_dataset ->>> squad = load_dataset('squad', split='validation') +>>> squad = load_dataset('rajpurkar/squad', split='validation') >>> squad.add_elasticsearch_index("context", host="localhost", port="9200", es_index_name="hf_squad_val_context") >>> squad.get_index("context").es_index_name hf_squad_val_context @@ -108,7 +108,7 @@ hf_squad_val_context ```py >>> from datasets import load_dataset ->>> squad = load_dataset('squad', split='validation') +>>> squad = load_dataset('rajpurkar/squad', split='validation') >>> squad.load_elasticsearch_index("context", host="localhost", port="9200", es_index_name="hf_squad_val_context") >>> query = "machine" >>> scores, retrieved_examples = squad.get_nearest_examples("context", query, k=10) diff --git a/docs/source/filesystems.mdx b/docs/source/filesystems.mdx index af262492261..ac3ade4c4f0 100644 --- a/docs/source/filesystems.mdx +++ b/docs/source/filesystems.mdx @@ -1,7 +1,18 @@ # Cloud storage -🤗 Datasets supports access to cloud storage providers through a `fsspec` FileSystem implementations. -You can save and load datasets from any cloud storage in a Pythonic way. +## Hugging Face Datasets + +The Hugging Face Dataset Hub is home to a growing collection of datasets that span a variety of domains and tasks. + +It's more than a cloud storage: the Dataset Hub is a platform that provides data versioning thanks to git, as well as a Dataset Viewer to explore the data, making it a great place to store AI-ready datasets. + +This guide shows how to import data from other cloud storage using the filesystems implementations from `fsspec`. + +## Import data from a cloud storage + +Most cloud storage providers have a `fsspec` FileSystem implementation, which is useful to import data from any cloud provider with the same code. +This is especially useful to publish datasets on Hugging Face. + Take a look at the following table for some example of supported cloud storage providers: | Storage provider | Filesystem implementation | @@ -9,214 +20,46 @@ Take a look at the following table for some example of supported cloud storage p | Amazon S3 | [s3fs](https://s3fs.readthedocs.io/en/latest/) | | Google Cloud Storage | [gcsfs](https://gcsfs.readthedocs.io/en/latest/) | | Azure Blob/DataLake | [adlfs](https://github.com/fsspec/adlfs) | -| Dropbox | [dropboxdrivefs](https://github.com/MarineChap/dropboxdrivefs)| -| Google Drive | [gdrivefs](https://github.com/intake/gdrivefs) | | Oracle Cloud Storage | [ocifs](https://ocifs.readthedocs.io/en/latest/) | -This guide will show you how to save and load datasets with any cloud storage. -Here are examples for S3, Google Cloud Storage, Azure Blob Storage, and Oracle Cloud Object Storage. - -## Set up your cloud storage FileSystem - -### Amazon S3 - -1. Install the S3 FileSystem implementation: - -``` ->>> pip install s3fs -``` - -2. Define your credentials - -To use an anonymous connection, use `anon=True`. -Otherwise, include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket. - -```py ->>> storage_options = {"anon": True} # for anonymous connection -# or use your credentials ->>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} # for private buckets -# or use a botocore session ->>> import aiobotocore.session ->>> s3_session = aiobotocore.session.AioSession(profile="my_profile_name") ->>> storage_options = {"session": s3_session} -``` - -3. Create your FileSystem instance - -```py ->>> import s3fs ->>> fs = s3fs.S3FileSystem(**storage_options) -``` - -### Google Cloud Storage - -1. Install the Google Cloud Storage implementation: - -``` ->>> conda install -c conda-forge gcsfs -# or install with pip ->>> pip install gcsfs -``` - -2. Define your credentials - -```py ->>> storage_options={"token": "anon"} # for anonymous connection -# or use your credentials of your default gcloud credentials or from the google metadata service ->>> storage_options={"project": "my-google-project"} -# or use your credentials from elsewhere, see the documentation at https://gcsfs.readthedocs.io/ ->>> storage_options={"project": "my-google-project", "token": TOKEN} -``` - -3. Create your FileSystem instance - -```py ->>> import gcsfs ->>> fs = gcsfs.GCSFileSystem(**storage_options) -``` - -### Azure Blob Storage - -1. Install the Azure Blob Storage implementation: - -``` ->>> conda install -c conda-forge adlfs -# or install with pip ->>> pip install adlfs -``` - -2. Define your credentials - -```py ->>> storage_options = {"anon": True} # for anonymous connection -# or use your credentials ->>> storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY} # gen 2 filesystem -# or use your credentials with the gen 1 filesystem ->>> storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET} -``` - -3. Create your FileSystem instance +This guide will show you how to import data files from any cloud storage and save a dataset on Hugging Face. -```py ->>> import adlfs ->>> fs = adlfs.AzureBlobFileSystem(**storage_options) -``` - -### Oracle Cloud Object Storage - -1. Install the OCI FileSystem implementation: - -``` ->>> pip install ocifs -``` - -2. Define your credentials - -```py ->>> storage_options = {"config": "~/.oci/config", "region": "us-ashburn-1"} -``` - -3. Create your FileSystem instance +Let's say we want to publish a dataset on Hugging Face from Parquet files from a cloud storage. -```py ->>> import ocifs ->>> fs = ocifs.OCIFileSystem(**storage_options) -``` - -## Load and Save your datasets using your cloud storage FileSystem - -### Download and prepare a dataset into a cloud storage - -You can download and prepare a dataset into your cloud storage by specifying a remote `output_dir` in `download_and_prepare`. -Don't forget to use the previously defined `storage_options` containing your credentials to write into a private cloud storage. - -The `download_and_prepare` method works in two steps: -1. it first downloads the raw data files (if any) in your local cache. You can set your cache directory by passing `cache_dir` to [`load_dataset_builder`] -2. then it generates the dataset in Arrow or Parquet format in your cloud storage by iterating over the raw data files. +First, instantiate your cloud storage filesystem and list the files you'd like to import: -Load a dataset builder from the Hugging Face Hub (see [how to load from the Hugging Face Hub](./loading#hugging-face-hub)): - -```py ->>> output_dir = "s3://my-bucket/imdb" ->>> builder = load_dataset_builder("imdb") ->>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet") +```python +>>> import fsspec +>>> fs = fsspec.filesystem("...") # s3 / gcs / abfs / adl / oci / ... +>>> data_dir = "path/to/my/data/" +>>> pattern = "*.parquet" +>>> data_files = fs.glob(data_dir + pattern) +["path/to/my/data/0001.parquet", "path/to/my/data/0001.parquet", ...] ``` -Use your own data files (see [how to load local and remote files](./loading#local-and-remote-files)): - -```py ->>> data_files = {"train": ["path/to/train.csv"]} ->>> output_dir = "s3://my-bucket/imdb" ->>> builder = load_dataset_builder("csv", data_files=data_files) ->>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet") -``` - -It is highly recommended to save the files as compressed Parquet files to optimize I/O by specifying `file_format="parquet"`. -Otherwise the dataset is saved as an uncompressed Arrow file. - -You can also specify the size of the shards using `max_shard_size` (default is 500MB): - -```py ->>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet", max_shard_size="1GB") -``` - -#### Dask - -Dask is a parallel computing library and it has a pandas-like API for working with larger than memory Parquet datasets in parallel. -Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel. -Dask supports local data but also data from a cloud storage. - -Therefore you can load a dataset saved as sharded Parquet files in Dask with - -```py -import dask.dataframe as dd - -df = dd.read_parquet(output_dir, storage_options=storage_options) - -# or if your dataset is split into train/valid/test -df_train = dd.read_parquet(output_dir + f"/{builder.name}-train-*.parquet", storage_options=storage_options) -df_valid = dd.read_parquet(output_dir + f"/{builder.name}-validation-*.parquet", storage_options=storage_options) -df_test = dd.read_parquet(output_dir + f"/{builder.name}-test-*.parquet", storage_options=storage_options) -``` - -You can find more about dask dataframes in their [documentation](https://docs.dask.org/en/stable/dataframe.html). - -## Saving serialized datasets - -After you have processed your dataset, you can save it to your cloud storage with [`Dataset.save_to_disk`]: - -```py -# saves encoded_dataset to amazon s3 ->>> encoded_dataset.save_to_disk("s3://my-private-datasets/imdb/train", storage_options=storage_options) -# saves encoded_dataset to google cloud storage ->>> encoded_dataset.save_to_disk("gcs://my-private-datasets/imdb/train", storage_options=storage_options) -# saves encoded_dataset to microsoft azure blob/datalake ->>> encoded_dataset.save_to_disk("adl://my-private-datasets/imdb/train", storage_options=storage_options) -``` - - - -Remember to define your credentials in your [FileSystem instance](#set-up-your-cloud-storage-filesystem) `fs` whenever you are interacting with a private cloud storage. - - - -## Listing serialized datasets - -List files from a cloud storage with your FileSystem instance `fs`, using `fs.ls`: +Then you can create a dataset on Hugging Face and import the data files, using for example: -```py ->>> fs.ls("my-private-datasets/imdb/train", detail=False) -["dataset_info.json.json","dataset.arrow","state.json"] +```python +>>> from huggingface_hub import create_repo, upload_file +>>> from tqdm.auto import tqdm +>>> destination_dataset = "username/my-dataset" +>>> create_repo(destination_dataset, repo_type="dataset") +>>> for data_file in tqdm(fs.glob(data_dir + pattern)): +... with fs.open(data_file) as fileobj: +... path_in_repo = data_file[len(data_dir):] +... upload_file( +... path_or_fileobj=fileobj, +... path_in_repo=path_in_repo, +... repo_id=destination_dataset, +... repo_type="dataset", +... ) ``` -### Load serialized datasets +Check out the [huggingface_hub](https://huggingface.co/docs/huggingface_hub) documentation on files uploads [here](https://huggingface.co/docs/huggingface_hub/en/guides/upload) if you're looking for more upload options. -When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]: +Finally you can now load the dataset using 🤗 Datasets: -```py ->>> from datasets import load_from_disk -# load encoded_dataset from cloud storage ->>> dataset = load_from_disk("s3://a-public-datasets/imdb/train", storage_options=storage_options) ->>> print(len(dataset)) -25000 +```python +>>> from datasets import load_dataset +>>> ds = load_dataset("username/my-dataset") ``` diff --git a/docs/source/image_classification.mdx b/docs/source/image_classification.mdx index e1b3e059266..6861bfec866 100644 --- a/docs/source/image_classification.mdx +++ b/docs/source/image_classification.mdx @@ -15,7 +15,7 @@ Load the dataset and take a look at an example: ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("beans") +>>> dataset = load_dataset("AI-Lab-Makerere/beans") >>> dataset["train"][10] {'image': , 'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/angular_leaf_spot/angular_leaf_spot_train.204.jpg', diff --git a/docs/source/image_dataset.mdx b/docs/source/image_dataset.mdx index 43559cbd2d7..256e00e4449 100644 --- a/docs/source/image_dataset.mdx +++ b/docs/source/image_dataset.mdx @@ -303,7 +303,7 @@ Now if users want to load the `breakfast` configuration, they can use the config ```py >>> from datasets import load_dataset ->>> ds = load_dataset("food101", "breakfast", split="train") +>>> ds = load_dataset("ethz/food101", "breakfast", split="train") ``` ### Add dataset metadata @@ -312,7 +312,7 @@ Adding information about your dataset is useful for users to learn more about it ```py >>> from datasets import load_dataset_builder ->>> ds_builder = load_dataset_builder("food101") +>>> ds_builder = load_dataset_builder("ethz/food101") >>> ds_builder.info ``` diff --git a/docs/source/image_load.mdx b/docs/source/image_load.mdx index 131b5f77412..a3bdc34a585 100644 --- a/docs/source/image_load.mdx +++ b/docs/source/image_load.mdx @@ -13,7 +13,7 @@ When you load an image dataset and call the image column, the images are decoded ```py >>> from datasets import load_dataset, Image ->>> dataset = load_dataset("beans", split="train") +>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train") >>> dataset[0]["image"] ``` @@ -39,7 +39,7 @@ You can load a dataset from the image path. Use the [`~Dataset.cast_column`] fun If you only want to load the underlying path to the image dataset without decoding the image object, set `decode=False` in the [`Image`] feature: ```py ->>> dataset = load_dataset("beans", split="train").cast_column("image", Image(decode=False)) +>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False)) >>> dataset[0]["image"] {'bytes': None, 'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/bean_rust/bean_rust_train.29.jpg'} diff --git a/docs/source/installation.md b/docs/source/installation.md index 06f7b1c32e3..a6027b2ee5d 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -1,6 +1,6 @@ # Installation -Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.7+**. +Before you start, you'll need to setup your environment and install the appropriate packages. 🤗 Datasets is tested on **Python 3.9+**. @@ -48,7 +48,7 @@ pip install datasets Run the following command to check if 🤗 Datasets has been properly installed: ```bash -python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" +python -c "from datasets import load_dataset; print(load_dataset('rajpurkar/squad', split='train')[0])" ``` This command downloads version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), loads the training split, and prints the first training example. You should see: @@ -98,7 +98,7 @@ pip install -e . Again, you can check if 🤗 Datasets was properly installed with the following command: ```bash -python -c "from datasets import load_dataset; print(load_dataset('squad', split='train')[0])" +python -c "from datasets import load_dataset; print(load_dataset('rajpurkar/squad', split='train')[0])" ``` ## conda diff --git a/docs/source/load_hub.mdx b/docs/source/load_hub.mdx index d2c71754bc6..d4baafba008 100644 --- a/docs/source/load_hub.mdx +++ b/docs/source/load_hub.mdx @@ -12,7 +12,7 @@ Use the [`load_dataset_builder`] function to load a dataset builder and inspect ```py >>> from datasets import load_dataset_builder ->>> ds_builder = load_dataset_builder("rotten_tomatoes") +>>> ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") # Inspect dataset description >>> ds_builder.info.description @@ -29,7 +29,7 @@ If you're happy with the dataset, then load it with [`load_dataset`]: ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ``` ## Splits @@ -39,7 +39,7 @@ A split is a specific subset of a dataset like `train` and `test`. List a datase ```py >>> from datasets import get_dataset_split_names ->>> get_dataset_split_names("rotten_tomatoes") +>>> get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes") ['train', 'validation', 'test'] ``` @@ -48,7 +48,7 @@ Then you can load a specific split with the `split` parameter. Loading a dataset ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> dataset Dataset({ features: ['text', 'label'], @@ -61,7 +61,7 @@ If you don't specify a `split`, 🤗 Datasets returns a [`DatasetDict`] object i ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("rotten_tomatoes") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes") DatasetDict({ train: Dataset({ features: ['text', 'label'], diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx index e30037e3cc7..6766e885e3b 100644 --- a/docs/source/loading.mdx +++ b/docs/source/loading.mdx @@ -143,8 +143,8 @@ To load a Parquet file: To load remote Parquet files via HTTP, pass the URLs instead: ```py ->>> base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/" ->>> data_files = {"train": base_url + "wikipedia-train.parquet"} +>>> base_url = "https://huggingface.co/datasets/wikimedia/wikipedia/resolve/main/20231101.ab/" +>>> data_files = {"train": base_url + "train-00000-of-00001.parquet"} >>> wiki = load_dataset("parquet", data_files=data_files, split="train") ``` @@ -162,8 +162,8 @@ To load an Arrow file: To load remote Arrow files via HTTP, pass the URLs instead: ```py ->>> base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/" ->>> data_files = {"train": base_url + "wikipedia-train.arrow"} +>>> base_url = "https://huggingface.co/datasets/croissantllm/croissant_dataset/resolve/main/english_660B_11/" +>>> data_files = {"train": base_url + "train/data-00000-of-00080.arrow"} >>> wiki = load_dataset("arrow", data_files=data_files, split="train") ``` @@ -232,7 +232,7 @@ In this case, each process is given a subset of shards to prepare: ```python from datasets import load_dataset -imagenet = load_dataset("imagenet-1k", num_proc=8) +imagenet = load_dataset("timm/imagenet-1k-wds", num_proc=8) ml_librispeech_spanish = load_dataset("facebook/multilingual_librispeech", "spanish", num_proc=8) ``` @@ -321,16 +321,16 @@ You can also choose only to load specific slices of a split. There are two optio Concatenate a `train` and `test` split by: ```py ->>> train_test_ds = datasets.load_dataset("bookcorpus", split="train+test") +>>> train_test_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train+test") ===STRINGAPI-READINSTRUCTION-SPLIT=== >>> ri = datasets.ReadInstruction("train") + datasets.ReadInstruction("test") ->>> train_test_ds = datasets.load_dataset("bookcorpus", split=ri) +>>> train_test_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=ri) ``` Select specific rows of the `train` split: ```py ->>> train_10_20_ds = datasets.load_dataset("bookcorpus", split="train[10:20]") +>>> train_10_20_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[10:20]") ===STRINGAPI-READINSTRUCTION-SPLIT=== >>> train_10_20_ds = datasets.load_dataset("bookcorpu", split=datasets.ReadInstruction("train", from_=10, to=20, unit="abs")) ``` @@ -338,28 +338,28 @@ Select specific rows of the `train` split: Or select a percentage of a split with: ```py ->>> train_10pct_ds = datasets.load_dataset("bookcorpus", split="train[:10%]") +>>> train_10pct_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[:10%]") ===STRINGAPI-READINSTRUCTION-SPLIT=== ->>> train_10_20_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train", to=10, unit="%")) +>>> train_10_20_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=datasets.ReadInstruction("train", to=10, unit="%")) ``` Select a combination of percentages from each split: ```py ->>> train_10_80pct_ds = datasets.load_dataset("bookcorpus", split="train[:10%]+train[-80%:]") +>>> train_10_80pct_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[:10%]+train[-80%:]") ===STRINGAPI-READINSTRUCTION-SPLIT=== >>> ri = (datasets.ReadInstruction("train", to=10, unit="%") + datasets.ReadInstruction("train", from_=-80, unit="%")) ->>> train_10_80pct_ds = datasets.load_dataset("bookcorpus", split=ri) +>>> train_10_80pct_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=ri) ``` Finally, you can even create cross-validated splits. The example below creates 10-fold cross-validated splits. Each validation dataset is a 10% chunk, and the training dataset makes up the remaining complementary 90% chunk: ```py ->>> val_ds = datasets.load_dataset("bookcorpus", split=[f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)]) ->>> train_ds = datasets.load_dataset("bookcorpus", split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)]) +>>> val_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=[f"train[{k}%:{k+10}%]" for k in range(0, 100, 10)]) +>>> train_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=[f"train[:{k}%]+train[{k+10}%:]" for k in range(0, 100, 10)]) ===STRINGAPI-READINSTRUCTION-SPLIT=== ->>> val_ds = datasets.load_dataset("bookcorpus", [datasets.ReadInstruction("train", from_=k, to=k+10, unit="%") for k in range(0, 100, 10)]) ->>> train_ds = datasets.load_dataset("bookcorpus", [(datasets.ReadInstruction("train", to=k, unit="%") + datasets.ReadInstruction("train", from_=k+10, unit="%")) for k in range(0, 100, 10)]) +>>> val_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", [datasets.ReadInstruction("train", from_=k, to=k+10, unit="%") for k in range(0, 100, 10)]) +>>> train_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", [(datasets.ReadInstruction("train", to=k, unit="%") + datasets.ReadInstruction("train", from_=k+10, unit="%")) for k in range(0, 100, 10)]) ``` ### Percent slicing and rounding @@ -368,21 +368,21 @@ The default behavior is to round the boundaries to the nearest integer for datas ```py # 19 records, from 500 (included) to 519 (excluded). ->>> train_50_52_ds = datasets.load_dataset("bookcorpus", split="train[50%:52%]") +>>> train_50_52_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[50%:52%]") # 20 records, from 519 (included) to 539 (excluded). ->>> train_52_54_ds = datasets.load_dataset("bookcorpus", split="train[52%:54%]") +>>> train_52_54_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[52%:54%]") ``` If you want equal sized splits, use `pct1_dropremainder` rounding instead. This treats the specified percentage boundaries as multiples of 1%. ```py # 18 records, from 450 (included) to 468 (excluded). ->>> train_50_52pct1_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train", from_=50, to=52, unit="%", rounding="pct1_dropremainder")) +>>> train_50_52pct1_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=datasets.ReadInstruction("train", from_=50, to=52, unit="%", rounding="pct1_dropremainder")) # 18 records, from 468 (included) to 486 (excluded). ->>> train_52_54pct1_ds = datasets.load_dataset("bookcorpus", split=datasets.ReadInstruction("train",from_=52, to=54, unit="%", rounding="pct1_dropremainder")) +>>> train_52_54pct1_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split=datasets.ReadInstruction("train",from_=52, to=54, unit="%", rounding="pct1_dropremainder")) # Or equivalently: ->>> train_50_52pct1_ds = datasets.load_dataset("bookcorpus", split="train[50%:52%](pct1_dropremainder)") ->>> train_52_54pct1_ds = datasets.load_dataset("bookcorpus", split="train[52%:54%](pct1_dropremainder)") +>>> train_50_52pct1_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[50%:52%](pct1_dropremainder)") +>>> train_52_54pct1_ds = datasets.load_dataset("ajibawa-2023/General-Stories-Collection", split="train[52%:54%](pct1_dropremainder)") ``` @@ -397,22 +397,6 @@ If you want equal sized splits, use `pct1_dropremainder` rounding instead. This Sometimes, you may get unexpected results when you load a dataset. Two of the most common issues you may encounter are manually downloading a dataset and specifying features of a dataset. -### Manual download - -Certain datasets require you to manually download the dataset files due to licensing incompatibility or if the files are hidden behind a login page. This causes [`load_dataset`] to throw an `AssertionError`. But 🤗 Datasets provides detailed instructions for downloading the missing files. After you've downloaded the files, use the `data_dir` argument to specify the path to the files you just downloaded. - -For example, if you try to download a configuration from the [MATINF](https://huggingface.co/datasets/matinf) dataset: - -```py ->>> dataset = load_dataset("matinf", "summarization") -Downloading and preparing dataset matinf/summarization (download: Unknown size, generated: 246.89 MiB, post-processed: Unknown size, total: 246.89 MiB) to /root/.cache/huggingface/datasets/matinf/summarization/1.0.0/82eee5e71c3ceaf20d909bca36ff237452b4e4ab195d3be7ee1c78b53e6f540e... -AssertionError: The dataset matinf with config summarization requires manual data. -Please follow the manual download instructions: To use MATINF you have to download it manually. Please fill this google form (https://forms.gle/nkH4LVE4iNQeDzsc9). You will receive a download link and a password once you complete the form. Please extract all files in one folder and load the dataset with: *datasets.load_dataset('matinf', data_dir='path/to/folder/folder_name')*. -Manual data can be loaded with `datasets.load_dataset(matinf, data_dir='') -``` - -If you've already downloaded a dataset from the *Hub with a loading script* to your computer, then you need to pass an absolute path to the `data_dir` or `data_files` parameter to load that dataset. Otherwise, if you pass a relative path, [`load_dataset`] will load the directory from the repository on the Hub instead of the local directory. - ### Specify features When you create a dataset from local files, the [`Features`] are automatically inferred by [Apache Arrow](https://arrow.apache.org/docs/). However, the dataset's features may not always align with your expectations, or you may want to define the features yourself. The following example shows how you can add custom labels with the [`ClassLabel`] feature. diff --git a/docs/source/nlp_process.mdx b/docs/source/nlp_process.mdx index bfcc0bd16ba..f2ef233c679 100644 --- a/docs/source/nlp_process.mdx +++ b/docs/source/nlp_process.mdx @@ -56,7 +56,7 @@ Pass the dictionary of the label mappings to the [`~Dataset.align_labels_with_ma ```py >>> from datasets import load_dataset ->>> mnli = load_dataset("glue", "mnli", split="train") +>>> mnli = load_dataset("nyu-mll/glue", "mnli", split="train") >>> mnli_aligned = mnli.align_labels_with_mapping(label2id, "label") ``` diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx index 185bde10d72..62dc9127d4b 100644 --- a/docs/source/package_reference/main_classes.mdx +++ b/docs/source/package_reference/main_classes.mdx @@ -52,6 +52,7 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table. - take - train_test_split - shard + - repeat - to_tf_dataset - push_to_hub - save_to_disk @@ -172,6 +173,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth - skip - take - shard + - repeat - load_state_dict - state_dict - info diff --git a/docs/source/process.mdx b/docs/source/process.mdx index 198b7509456..2a2ea3c0cba 100644 --- a/docs/source/process.mdx +++ b/docs/source/process.mdx @@ -17,7 +17,7 @@ The examples in this guide use the MRPC dataset, but feel free to load any datas ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("glue", "mrpc", split="train") +>>> dataset = load_dataset("nyu-mll/glue", "mrpc", split="train") ``` @@ -127,11 +127,11 @@ The splits are shuffled by default, but you can set `shuffle=False` to prevent s 🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~Dataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter. -For example, the [imdb](https://huggingface.co/datasets/imdb) dataset has 25000 examples: +For example, the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset has 25000 examples: ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("imdb", split="train") +>>> dataset = load_dataset("stanfordnlp/imdb", split="train") >>> print(dataset) Dataset({ features: ['text', 'label'], @@ -263,7 +263,7 @@ Sometimes a column can be a nested structure of several types. Take a look at th ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("squad", split="train") +>>> dataset = load_dataset("rajpurkar/squad", split="train") >>> dataset.features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), @@ -502,6 +502,52 @@ Use [`~Dataset.map`] to apply the function over the whole dataset: For each original sentence, RoBERTA augmented a random word with three alternatives. The original word `distorting` is supplemented by `withholding`, `suppressing`, and `destroying`. +### Asynchronous processing + +Asynchronous functions are useful to call API endpoints in parallel, for example to download content like images or call a model endpoint. + +You can define an asynchronous function using the `async` and `await` keywords, here is an example function to call a chat model from Hugging Face: + +```python +>>> import aiohttp +>>> import asyncio +>>> from huggingface_hub import get_token +>>> sem = asyncio.Semaphore(20) # max number of simultaneous queries +>>> async def query_model(model, prompt): +... api_url = f"https://api-inference.huggingface.co/models/{model}/v1/chat/completions" +... headers = {"Authorization": f"Bearer {get_token()}", "Content-Type": "application/json"} +... json = {"messages": [{"role": "user", "content": prompt}], "max_tokens": 20, "seed": 42} +... async with sem, aiohttp.ClientSession() as session, session.post(api_url, headers=headers, json=json) as response: +... output = await response.json() +... return {"Output": output["choices"][0]["message"]["content"]} +``` + +Asynchronous functions run in parallel, which accelerates the process a lot. The same code takes a lot more time if it's run sequentially, because it does nothing while waiting for the model response. It is generally recommended to use `async` / `await` when you function has to wait for a response from an API for example, or if it downloads data and it can take some time. + +Note the presence of a `Semaphore`: it sets the maximum number of queries that can run in parallel. It is recommended to use a `Semaphore` when calling APIs to avoid rate limit errors. + +Let's use it to call the [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) model and ask it to return the main topic of each math problem in the [Maxwell-Jia/AIME_2024](https://huggingface.co/Maxwell-Jia/AIME_2024) dataset: + +```python +>>> from datasets import load_dataset +>>> ds = load_dataset("Maxwell-Jia/AIME_2024", split="train") +>>> model = "microsoft/Phi-3-mini-4k-instruct" +>>> prompt = 'What is this text mainly about ? Here is the text:\n\n```\n{Problem}\n```\n\nReply using one or two words max, e.g. "The main topic is Linear Algebra".' +>>> async def get_topic(example): +... return await query_model(model, prompt.format(Problem=example['Problem'])) +>>> ds = ds.map(get_topic) +>>> ds[0] +{'ID': '2024-II-4', + 'Problem': 'Let $x,y$ and $z$ be positive real numbers that...', + 'Solution': 'Denote $\\log_2(x) = a$, $\\log_2(y) = b$, and..., + 'Answer': 33, + 'Output': 'The main topic is Logarithms.'} +``` + +Here, [`Dataset.map`] runs many `get_topic` function asynchronously so it doesn't have to wait for every single model response which would take a lot of time to do sequentially. + +By default, [`Dataset.map`] runs up to one thousand map functions in parallel, so don't forget to set the maximum number of API calls that can run in parallel with a `Semaphore`, otherwise the model could return rate limit errors or overload. For advanced use cases, you can change the maximum number of queries in parallel in `datasets.config`. + ### Process multiple splits Many datasets have splits that can be processed simultaneously with [`DatasetDict.map`]. For example, tokenize the `sentence1` field in the train and test split by: @@ -510,7 +556,7 @@ Many datasets have splits that can be processed simultaneously with [`DatasetDic >>> from datasets import load_dataset # load all the splits ->>> dataset = load_dataset('glue', 'mrpc') +>>> dataset = load_dataset('nyu-mll/glue', 'mrpc') >>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples["sentence1"]), batched=True) >>> encoded_dataset["train"][0] {'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', @@ -554,7 +600,7 @@ Here's an example of how to use the `batch()` method: ```python >>> from datasets import load_dataset ->>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> batched_dataset = dataset.batch(batch_size=4) >>> batched_dataset[0] {'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', @@ -579,20 +625,21 @@ Separate datasets can be concatenated if they share the same column types. Conca ```py >>> from datasets import concatenate_datasets, load_dataset ->>> bookcorpus = load_dataset("bookcorpus", split="train") ->>> wiki = load_dataset("wikipedia", "20220301.en", split="train") +>>> stories = load_dataset("ajibawa-2023/General-Stories-Collection", split="train") +>>> stories = stories.remove_columns([col for col in stories.column_names if col != "text"]) # only keep the 'text' column +>>> wiki = load_dataset("wikimedia/wikipedia", "20220301.en", split="train") >>> wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"]) # only keep the 'text' column ->>> assert bookcorpus.features.type == wiki.features.type ->>> bert_dataset = concatenate_datasets([bookcorpus, wiki]) +>>> assert stories.features.type == wiki.features.type +>>> bert_dataset = concatenate_datasets([stories, wiki]) ``` You can also concatenate two datasets horizontally by setting `axis=1` as long as the datasets have the same number of rows: ```py >>> from datasets import Dataset ->>> bookcorpus_ids = Dataset.from_dict({"ids": list(range(len(bookcorpus)))}) ->>> bookcorpus_with_ids = concatenate_datasets([bookcorpus, bookcorpus_ids], axis=1) +>>> stories_ids = Dataset.from_dict({"ids": list(range(len(stories)))}) +>>> stories_with_ids = concatenate_datasets([stories, stories_ids], axis=1) ``` ### Interleave @@ -630,40 +677,94 @@ Note that if no sampling probabilities are specified, the new dataset will have ## Format -The [`~Dataset.set_format`] function changes the format of a column to be compatible with some common data formats. Specify the output you'd like in the `type` parameter and the columns you want to format. Formatting is applied on-the-fly. +The [`~Dataset.with_format`] function changes the format of a column to be compatible with some common data formats. Specify the output you'd like in the `type` parameter. You can also choose which the columns you want to format using `columns=`. Formatting is applied on-the-fly. For example, create PyTorch tensors by setting `type="torch"`: ```py ->>> import torch ->>> dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"]) +>>> dataset = dataset.with_format(type="torch") +``` + +The [`~Dataset.set_format`] function also changes the format of a column, except it runs in-place: + +```py +>>> dataset.set_format(type="torch") ``` -The [`~Dataset.with_format`] function also changes the format of a column, except it returns a new [`Dataset`] object: +If you need to reset the dataset to its original format, set the format to `None` (or use [`~Dataset.reset_format`]): ```py ->>> dataset = dataset.with_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "label"]) +>>> dataset.format +{'type': 'torch', 'format_kwargs': {}, 'columns': [...], 'output_all_columns': False} +>>> dataset = dataset.with_format(None) +>>> dataset.format +{'type': None, 'format_kwargs': {}, 'columns': [...], 'output_all_columns': False} ``` +### Tensors formats + +Several tensors or arrays formats are supported. It is generally recommended to use these formats instead of converting outputs of a dataset to tensors or arrays manually to avoid unnecessary data copies and accelerate data loading. + +Here is the list of supported tensors or arrays formats: + +- NumPy: format name is "numpy", for more information see [Using Datasets with NumPy](use_with_numpy) +- PyTorch: format name is "torch", for more information see [Using Datasets with PyTorch](use_with_pytorch) +- TensorFlow: format name is "tensorflow", for more information see [Using Datasets with TensorFlow](use_with_tensorflow) +- JAX: format name is "jax", for more information see [Using Datasets with JAX](use_with_jax) + -🤗 Datasets also provides support for other common data formats such as NumPy, Pandas, and JAX. Check out the [Using Datasets with TensorFlow](https://huggingface.co/docs/datasets/master/en/use_with_tensorflow#using-totfdataset) guide for more details on how to efficiently create a TensorFlow dataset. +Check out the [Using Datasets with TensorFlow](use_with_tensorflow#using-totfdataset) guide for more details on how to efficiently create a TensorFlow dataset. -If you need to reset the dataset to its original format, use the [`~Dataset.reset_format`] function: +When a dataset is formatted in a tensor or array format, all the data are formatted as tensors or arrays (except unsupported types like strings for example for PyTorch): -```py ->>> dataset.format -{'type': 'torch', 'format_kwargs': {}, 'columns': ['label'], 'output_all_columns': False} ->>> dataset.reset_format() ->>> dataset.format -{'type': 'python', 'format_kwargs': {}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False} +```python +>>> ds = Dataset.from_dict({"text": ["foo", "bar"], "tokens": [[0, 1, 2], [3, 4, 5]]}) +>>> ds = ds.with_format("torch") +>>> ds[0] +{'text': 'foo', 'tokens': tensor([0, 1, 2])} +>>> ds[:2] +{'text': ['foo', 'bar'], + 'tokens': tensor([[0, 1, 2], + [3, 4, 5]])} +``` + +### Tabular formats + +You can use a dataframes or tables format to optimize data loading and data processing, since they generally offer zero-copy operations and transforms written in low-level languages. + +Here is the list of supported dataframes or tables formats: + +- Pandas: format name is "pandas", for more information see [Using Datasets with Pandas](use_with_pandas) +- Polars: format name is "polars", for more information see [Using Datasets with Polars](use_with_polars) +- PyArrow: format name is "arrow", for more information see [Using Datasets with PyArrow](use_with_tensorflow) + +When a dataset is formatted in a dataframe or table format, every dataset row or batches of rows is formatted as a dataframe or table, and dataset colums are formatted as a series or array: + +```python +>>> ds = Dataset.from_dict({"text": ["foo", "bar"], "label": [0, 1]}) +>>> ds = ds.with_format("pandas") +>>> ds[:2] + text label +0 foo 0 +1 bar 1 +``` + +Those formats make it possible to iterate on the data faster by avoiding data copies, and also enable faster data processing in [`~Dataset.map`] or [`~Dataset.filter`]: + +```python +>>> ds = ds.map(lambda df: df.assign(upper_text=df.text.str.upper()), batched=True) +>>> ds[:2] + text label upper_text +0 foo 0 FOO +1 bar 1 BAR ``` -### Format transform +### Custom format transform -The [`~Dataset.set_transform`] function applies a custom formatting transform on-the-fly. This function replaces any previously specified format. For example, you can use this function to tokenize and pad tokens on-the-fly. Tokenization is only applied when examples are accessed: +The [`~Dataset.with_transform`] function applies a custom formatting transform on-the-fly. This function replaces any previously specified format. For example, you can use this function to tokenize and pad tokens on-the-fly. Tokenization is only applied when examples are accessed: ```py >>> from transformers import AutoTokenizer @@ -671,12 +772,14 @@ The [`~Dataset.set_transform`] function applies a custom formatting transform on >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") >>> def encode(batch): ... return tokenizer(batch["sentence1"], batch["sentence2"], padding="longest", truncation=True, max_length=512, return_tensors="pt") ->>> dataset.set_transform(encode) +>>> dataset = dataset.with_transform(encode) >>> dataset.format {'type': 'custom', 'format_kwargs': {'transform': }, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False} ``` -You can also use the [`~Dataset.set_transform`] function to decode formats not supported by [`Features`]. For example, the [`Audio`] feature uses [`soundfile`](https://python-soundfile.readthedocs.io/en/0.11.0/) - a fast and simple library to install - but it does not provide support for less common audio formats. Here is where you can use [`~Dataset.set_transform`] to apply a custom decoding transform on the fly. You're free to use any library you like to decode the audio files. +There is also [`~Dataset.set_transform`] which does the same but runs in-place. + +You can also use the [`~Dataset.with_transform`] function to decode formats not supported by [`Features`]. For example, the [`Audio`] feature uses [`soundfile`](https://python-soundfile.readthedocs.io/en/0.11.0/) - a fast and simple library to install - but it does not provide support for less common audio formats. Here is where you can use [`~Dataset.set_transform`] to apply a custom decoding transform on the fly. You're free to use any library you like to decode the audio files. The example below uses the [`pydub`](http://pydub.com/) package to open an audio format not supported by `soundfile`: @@ -720,12 +823,6 @@ Use the [`load_from_disk`] function to reload the dataset: >>> reloaded_dataset = load_from_disk("path/of/my/dataset/directory") ``` - - -Want to save your dataset to a cloud storage provider? Read our [Cloud Storage](./filesystems) guide to learn how to save your dataset to AWS or Google Cloud Storage. - - - ## Export 🤗 Datasets supports exporting as well so you can work with your dataset in other applications. The following table shows currently supported file formats you can export to: diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx index 8b35674fe51..cf71dee75a0 100644 --- a/docs/source/quickstart.mdx +++ b/docs/source/quickstart.mdx @@ -175,7 +175,7 @@ Image datasets are loaded just like text datasets. However, instead of a tokeniz ```py >>> from datasets import load_dataset, Image ->>> dataset = load_dataset("beans", split="train") +>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train") ``` Most image models work with RBG images. If your dataset contains images in a different mode, you can use the [`~Dataset.cast_column`] function to set the mode to RGB: @@ -281,7 +281,7 @@ Text needs to be tokenized into individual tokens by a [tokenizer](https://huggi ```py >>> from datasets import load_dataset ->>> dataset = load_dataset("glue", "mrpc", split="train") +>>> dataset = load_dataset("nyu-mll/glue", "mrpc", split="train") ``` **2**. Next, load a pretrained [BERT](https://huggingface.co/bert-base-uncased) model and its corresponding tokenizer from the [🤗 Transformers](https://huggingface.co/transformers/) library. It is totally normal to see a warning after you load the model about some weights not being initialized. This is expected because you are loading this model checkpoint for training with another task. diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx index f17899aa438..d16c27ee9e3 100644 --- a/docs/source/stream.mdx +++ b/docs/source/stream.mdx @@ -13,13 +13,13 @@ This is especially helpful when: -For example, the English split of the [oscar-corpus/OSCAR-2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201) dataset is 1.2 terabytes, but you can use it instantly with streaming. Stream a dataset by setting `streaming=True` in [`load_dataset`] as shown below: +For example, the English split of the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset is 45 terabytes, but you can use it instantly with streaming. Stream a dataset by setting `streaming=True` in [`load_dataset`] as shown below: ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('oscar-corpus/OSCAR-2201', 'en', split='train', streaming=True) +>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True) >>> print(next(iter(dataset))) -{'id': 0, 'text': 'Founded in 2015, Golden Bees is a leading programmatic recruitment platform dedicated to employers, HR agencies and job boards. The company has developed unique HR-custom technologies and predictive algorithms to identify and attract the best candidates for a job opportunity.', ... +{'text': "How AP reported in all formats from tornado-stricken regionsMarch 8, 2012\nWhen the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau... ``` Dataset streaming also lets you work with a dataset made of local files without doing any conversion. @@ -59,11 +59,11 @@ If you have an existing [`Dataset`] object, you can convert it to an [`IterableD >>> from datasets import load_dataset # faster 🐇 ->>> dataset = load_dataset("food101") +>>> dataset = load_dataset("ethz/food101") >>> iterable_dataset = dataset.to_iterable_dataset() # slower 🐢 ->>> iterable_dataset = load_dataset("food101", streaming=True) +>>> iterable_dataset = load_dataset("ethz/food101", streaming=True) ``` The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`IterableDataset`] is instantiated. This is useful when working with big datasets, and you'd like to shuffle the dataset or to enable fast parallel loading with a PyTorch DataLoader. @@ -72,7 +72,7 @@ The [`~Dataset.to_iterable_dataset`] function supports sharding when the [`Itera >>> import torch >>> from datasets import load_dataset ->>> dataset = load_dataset("food101") +>>> dataset = load_dataset("ethz/food101") >>> iterable_dataset = dataset.to_iterable_dataset(num_shards=64) # shard the dataset >>> iterable_dataset = iterable_dataset.shuffle(buffer_size=10_000) # shuffles the shards order and use a shuffle buffer when you start iterating dataloader = torch.utils.data.DataLoader(iterable_dataset, num_workers=4) # assigns 64 / 4 = 16 shards from the shuffled list of shards to each worker when you start iterating @@ -86,7 +86,7 @@ The `buffer_size` argument controls the size of the buffer to randomly sample ex ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) +>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True) >>> shuffled_dataset = dataset.shuffle(seed=42, buffer_size=10_000) ``` @@ -116,10 +116,11 @@ You can split your dataset one of two ways: - [`IterableDataset.take`] returns the first `n` examples in a dataset: ```py ->>> dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True) +>>> dataset = load_dataset('HuggingFaceFW/fineweb', split='train', streaming=True) >>> dataset_head = dataset.take(2) >>> list(dataset_head) -[{'id': 0, 'text': 'Mtendere Village was...'}, {'id': 1, 'text': 'Lily James cannot fight the music...'}] +[{'text': "How AP reported in all formats from tor...}, + {'text': 'Did you know you have two little yellow...}] ``` - [`IterableDataset.skip`] omits the first `n` examples in a dataset and returns the remaining examples: @@ -171,20 +172,22 @@ If your dataset has `dataset.num_shards==1`, you should chunk it using [`Iterabl ```py >>> from datasets import interleave_datasets ->>> en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True, trust_remote_code=True) ->>> fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True, trust_remote_code=True) +>>> es_dataset = load_dataset('allenai/c4', 'es', split='train', streaming=True) +>>> fr_dataset = load_dataset('allenai/c4', 'fr', split='train', streaming=True) ->>> multilingual_dataset = interleave_datasets([en_dataset, fr_dataset]) +>>> multilingual_dataset = interleave_datasets([es_dataset, fr_dataset]) >>> list(multilingual_dataset.take(2)) -[{'text': 'Mtendere Village was inspired by the vision...'}, {'text': "Média de débat d'idées, de culture et de littérature..."}] +[{'text': 'Comprar Zapatillas para niña en chancla con goma por...'}, + {'text': 'Le sacre de philippe ier, 23 mai 1059 - Compte Rendu...'}] ``` Define sampling probabilities from each of the original datasets for more control over how each of them are sampled and combined. Set the `probabilities` argument with your desired sampling probabilities: ```py ->>> multilingual_dataset_with_oversampling = interleave_datasets([en_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42) +>>> multilingual_dataset_with_oversampling = interleave_datasets([es_dataset, fr_dataset], probabilities=[0.8, 0.2], seed=42) >>> list(multilingual_dataset_with_oversampling.take(2)) -[{'text': 'Mtendere Village was inspired by the vision...'}, {'text': 'Lily James cannot fight the music...'}] +[{'text': 'Comprar Zapatillas para niña en chancla con goma por...'}, + {'text': 'Chevrolet Cavalier Usados en Bogota - Carros en Vent...'}] ``` Around 80% of the final dataset is made of the `en_dataset`, and 20% of the `fr_dataset`. @@ -205,7 +208,7 @@ Provide [`IterableDataset.rename_column`] with the name of the original column, ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('mc4', 'en', streaming=True, split='train', trust_remote_code=True) +>>> dataset = load_dataset('allenai/c4', 'en', streaming=True, split='train') >>> dataset = dataset.rename_column("text", "content") ``` @@ -215,7 +218,7 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('mc4', 'en', streaming=True, split='train', trust_remote_code=True) +>>> dataset = load_dataset('allenai/c4', 'en', streaming=True, split='train') >>> dataset = dataset.remove_columns('timestamp') ``` @@ -225,7 +228,7 @@ When you need to remove one or more columns, give [`IterableDataset.remove_colum ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('glue', 'mrpc', split='train', streaming=True) +>>> dataset = load_dataset('nyu-mll/glue', 'mrpc', split='train', streaming=True) >>> dataset.features {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), @@ -280,24 +283,30 @@ Next, apply this function to the dataset with [`IterableDataset.map`]: ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('oscar', 'unshuffled_deduplicated_en', streaming=True, split='train', trust_remote_code=True) +>>> dataset = load_dataset('allenai/c4', 'en', streaming=True, split='train') >>> updated_dataset = dataset.map(add_prefix) >>> list(updated_dataset.take(3)) -[{'id': 0, 'text': 'My text: Mtendere Village was inspired by...'}, - {'id': 1, 'text': 'My text: Lily James cannot fight the music...'}, - {'id': 2, 'text': 'My text: "I\'d love to help kickstart...'}] +[{'text': 'My text: Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making...', + 'timestamp': '2019-04-25 12:57:54', + 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}, + {'text': 'My text: Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve go...', + 'timestamp': '2019-04-21 10:07:13', + 'url': 'https://forums.macrumors.com/threads/restore-from-larger-disk-to-smaller-disk.1311329/'}, + {'text': 'My text: Foil plaid lycra and spandex shortall with metallic slinky insets. Attached metall...', + 'timestamp': '2019-04-25 10:40:23', + 'url': 'https://awishcometrue.com/Catalogs/Clearance/Tweens/V1960-Find-A-Way'}] ``` -Let's take a look at another example, except this time, you will remove a column with [`IterableDataset.map`]. When you remove a column, it is only removed after the example has been provided to the mapped function. This allows the mapped function to use the content of the columns before they are removed. +Let's take a look at another example, except this time, you will remove columns with [`IterableDataset.map`]. When you remove a column, it is only removed after the example has been provided to the mapped function. This allows the mapped function to use the content of the columns before they are removed. Specify the column to remove with the `remove_columns` argument in [`IterableDataset.map`]: ```py ->>> updated_dataset = dataset.map(add_prefix, remove_columns=["id"]) +>>> updated_dataset = dataset.map(add_prefix, remove_columns=["timestamp", "url"]) >>> list(updated_dataset.take(3)) -[{'text': 'My text: Mtendere Village was inspired by...'}, - {'text': 'My text: Lily James cannot fight the music...'}, - {'text': 'My text: "I\'d love to help kickstart...'}] +[{'text': 'My text: Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making...'}, + {'text': 'My text: Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve go...'}, + {'text': 'My text: Foil plaid lycra and spandex shortall with metallic slinky insets. Attached metall...'}] ``` ### Batch processing @@ -309,14 +318,14 @@ Specify the column to remove with the `remove_columns` argument in [`IterableDat ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer ->>> dataset = load_dataset("mc4", "en", streaming=True, split="train", trust_remote_code=True) +>>> dataset = load_dataset("allenai/c4", "en", streaming=True, split="train") >>> tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased') >>> def encode(examples): ... return tokenizer(examples['text'], truncation=True, padding='max_length') >>> dataset = dataset.map(encode, batched=True, remove_columns=["text", "timestamp", "url"]) >>> next(iter(dataset)) -{'input_ids': [101, 8466, 1018, 1010, 4029, 2475, 2062, 18558, 3100, 2061, ...,1106, 3739, 102], -'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..., 1, 1]} +{'input_ids': [101, 4088, 16912, 22861, 4160, 2465, 2635, 2173, 1999, 3335, ..., 0, 0, 0], +'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..., 0, 0]} ``` @@ -331,10 +340,10 @@ You can filter rows in the dataset based on a predicate function using [`Dataset ```py >>> from datasets import load_dataset ->>> dataset = load_dataset('oscar', 'unshuffled_deduplicated_en', streaming=True, split='train', trust_remote_code=True) ->>> start_with_ar = dataset.filter(lambda example: example['text'].startswith('Ar')) +>>> dataset = load_dataset('HuggingFaceFW/fineweb', streaming=True, split='train') +>>> start_with_ar = dataset.filter(lambda example: example['text'].startswith('San Francisco')) >>> next(iter(start_with_ar)) -{'id': 4, 'text': 'Are you looking for Number the Stars (Essential Modern Classics)?...'} +{'text': 'San Francisco 49ers cornerback Shawntae Spencer will miss the rest of the sea...} ``` [`Dataset.filter`] can also filter by indices if you set `with_indices=True`: @@ -342,9 +351,9 @@ You can filter rows in the dataset based on a predicate function using [`Dataset ```py >>> even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True) >>> list(even_dataset.take(3)) -[{'id': 0, 'text': 'Mtendere Village was inspired by the vision of Chief Napoleon Dzombe, ...'}, - {'id': 2, 'text': '"I\'d love to help kickstart continued development! And 0 EUR/month...'}, - {'id': 4, 'text': 'Are you looking for Number the Stars (Essential Modern Classics)? Normally, ...'}] +[{'text': 'How AP reported in all formats from tornado-stricken regionsMarch 8, 2012 Whe...}, + {'text': 'Car Wash For Clara! Now is your chance to help! 2 year old Clara Woodward has...}, + {'text': 'Log In Please enter your ECode to log in. Forgotten your eCode? If you create...}] ``` ## Batch diff --git a/docs/source/use_dataset.mdx b/docs/source/use_dataset.mdx index 0ab53e1b70b..073b5261e0d 100644 --- a/docs/source/use_dataset.mdx +++ b/docs/source/use_dataset.mdx @@ -35,7 +35,7 @@ Check out the [Tokenizers](https://huggingface.co/course/chapter2/4?fw=pt) secti >>> from datasets import load_dataset >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> dataset = load_dataset("rotten_tomatoes", split="train") +>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") ``` **2**. Call your tokenizer on the first row of `text` in the dataset: @@ -159,7 +159,7 @@ The most common preprocessing you'll do with image datasets is *data augmentatio >>> from datasets import load_dataset, Image >>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k") ->>> dataset = load_dataset("beans", split="train") +>>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train") ``` **2**. Index into the first row of the dataset. When you call the `image` column of the dataset, the underlying PIL object is automatically decoded into an image. diff --git a/docs/source/use_with_jax.mdx b/docs/source/use_with_jax.mdx index dc73a8df778..89d1628df06 100644 --- a/docs/source/use_with_jax.mdx +++ b/docs/source/use_with_jax.mdx @@ -108,7 +108,7 @@ To avoid this, you must explicitly use the [`Array`] feature type and specify th >>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] >>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')}) >>> ds = Dataset.from_dict({"data": data}, features=features) ->>> ds = ds.with_format("torch") +>>> ds = ds.with_format("jax") >>> ds[0] {'data': Array([[1, 2], [3, 4]], dtype=int32)} diff --git a/docs/source/use_with_numpy.mdx b/docs/source/use_with_numpy.mdx new file mode 100644 index 00000000000..a8084655915 --- /dev/null +++ b/docs/source/use_with_numpy.mdx @@ -0,0 +1,191 @@ +# Use with NumPy + +This document is a quick introduction to using `datasets` with NumPy, with a particular focus on how to get +`numpy.ndarray` objects out of our datasets, and how to use them to train models based on NumPy such as `scikit-learn` models. + + +## Dataset format + +By default, datasets return regular Python objects: integers, floats, strings, lists, etc.. + +To get NumPy arrays instead, you can set the format of the dataset to `numpy`: + +```py +>>> from datasets import Dataset +>>> data = [[1, 2], [3, 4]] +>>> ds = Dataset.from_dict({"data": data}) +>>> ds = ds.with_format("numpy") +>>> ds[0] +{'data': array([1, 2])} +>>> ds[:2] +{'data': array([ + [1, 2], + [3, 4]])} +``` + + + +A [`Dataset`] object is a wrapper of an Arrow table, which allows fast reads from arrays in the dataset to NumPy arrays. + + + +Note that the exact same procedure applies to `DatasetDict` objects, so that +when setting the format of a `DatasetDict` to `numpy`, all the `Dataset`s there +will be formatted as `numpy`: + +```py +>>> from datasets import DatasetDict +>>> data = {"train": {"data": [[1, 2], [3, 4]]}, "test": {"data": [[5, 6], [7, 8]]}} +>>> dds = DatasetDict.from_dict(data) +>>> dds = dds.with_format("numpy") +>>> dds["train"][:2] +{'data': array([ + [1, 2], + [3, 4]])} +``` + + +### N-dimensional arrays + +If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same array if the shape is fixed: + +```py +>>> from datasets import Dataset +>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] # fixed shape +>>> ds = Dataset.from_dict({"data": data}) +>>> ds = ds.with_format("numpy") +>>> ds[0] +{'data': array([[1, 2], + [3, 4]])} +``` + +```py +>>> from datasets import Dataset +>>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]] # varying shape +>>> ds = Dataset.from_dict({"data": data}) +>>> ds = ds.with_format("numpy") +>>> ds[0] +{'data': array([array([1, 2]), array([3])], dtype=object)} +``` + +However this logic often requires slow shape comparisons and data copies. +To avoid this, you must explicitly use the [`Array`] feature type and specify the shape of your tensors: + +```py +>>> from datasets import Dataset, Features, Array2D +>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] +>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')}) +>>> ds = Dataset.from_dict({"data": data}, features=features) +>>> ds = ds.with_format("numpy") +>>> ds[0] +{'data': array([[1, 2], + [3, 4]])} +>>> ds[:2] +{'data': array([[[1, 2], + [3, 4]], + + [[5, 6], + [7, 8]]])} +``` + +### Other feature types + +[`ClassLabel`] data is properly converted to arrays: + +```py +>>> from datasets import Dataset, Features, ClassLabel +>>> labels = [0, 0, 1] +>>> features = Features({"label": ClassLabel(names=["negative", "positive"])}) +>>> ds = Dataset.from_dict({"label": labels}, features=features) +>>> ds = ds.with_format("numpy") +>>> ds[:3] +{'label': array([0, 0, 1])} +``` + +String and binary objects are unchanged, since NumPy only supports numbers. + +The [`Image`] and [`Audio`] feature types are also supported. + + + +To use the [`Image`] feature type, you'll need to install the `vision` extra as +`pip install datasets[vision]`. + + + +```py +>>> from datasets import Dataset, Features, Image +>>> images = ["path/to/image.png"] * 10 +>>> features = Features({"image": Image()}) +>>> ds = Dataset.from_dict({"image": images}, features=features) +>>> ds = ds.with_format("numpy") +>>> ds[0]["image"].shape +(512, 512, 3) +>>> ds[0] +{'image': array([[[ 255, 255, 255], + [ 255, 255, 255], + ..., + [ 255, 255, 255], + [ 255, 255, 255]]], dtype=uint8)} +>>> ds[:2]["image"].shape +(2, 512, 512, 3) +>>> ds[:2] +{'image': array([[[[ 255, 255, 255], + [ 255, 255, 255], + ..., + [ 255, 255, 255], + [ 255, 255, 255]]]], dtype=uint8)} +``` + + + +To use the [`Audio`] feature type, you'll need to install the `audio` extra as +`pip install datasets[audio]`. + + + +```py +>>> from datasets import Dataset, Features, Audio +>>> audio = ["path/to/audio.wav"] * 10 +>>> features = Features({"audio": Audio()}) +>>> ds = Dataset.from_dict({"audio": audio}, features=features) +>>> ds = ds.with_format("numpy") +>>> ds[0]["audio"]["array"] +array([-0.059021 , -0.03894043, -0.00735474, ..., 0.0133667 , + 0.01809692, 0.00268555], dtype=float32) +>>> ds[0]["audio"]["sampling_rate"] +array(44100, weak_type=True) +``` + +## Data loading + +NumPy doesn't have any built-in data loading capabilities, so you'll either need to materialize the NumPy arrays like `X, y` to use in `scikit-learn` or use a library such as [PyTorch](https://pytorch.org/) to load your data using a `DataLoader`. + +### Using `with_format('numpy')` + +The easiest way to get NumPy arrays out of a dataset is to use the `with_format('numpy')` method. Lets assume +that we want to train a neural network on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) available +at the HuggingFace Hub at https://huggingface.co/datasets/mnist. + +```py +>>> from datasets import load_dataset +>>> ds = load_dataset("mnist") +>>> ds = ds.with_format("numpy") +>>> ds["train"][0] +{'image': array([[ 0, 0, 0, ...], + [ 0, 0, 0, ...], + ..., + [ 0, 0, 0, ...], + [ 0, 0, 0, ...]], dtype=uint8), + 'label': array(5)} +``` + +Once the format is set we can feed the dataset to the model based on NumPy in batches using the `Dataset.iter()` +method: + +```py +>>> for epoch in range(epochs): +... for batch in ds["train"].iter(batch_size=32): +... x, y = batch["image"], batch["label"] +... ... +``` diff --git a/docs/source/use_with_pandas.mdx b/docs/source/use_with_pandas.mdx new file mode 100644 index 00000000000..9c2cfb7e878 --- /dev/null +++ b/docs/source/use_with_pandas.mdx @@ -0,0 +1,83 @@ +# Use with Pandas + +This document is a quick introduction to using `datasets` with Pandas, with a particular focus on how to process +datasets using Pandas functions, and how to convert a dataset to Pandas or from Pandas. + +This is particularly useful as it allows fast operations, since `datasets` uses PyArrow under the hood and PyArrow is well integrated with Pandas. + +## Dataset format + +By default, datasets return regular Python objects: integers, floats, strings, lists, etc. + +To get Pandas DataFrames or Series instead, you can set the format of the dataset to `pandas` using [`Dataset.with_format`]: + +```py +>>> from datasets import Dataset +>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]} +>>> ds = Dataset.from_dict(data) +>>> ds = ds.with_format("pandas") +>>> ds[0] # pd.DataFrame + col_0 col_1 +0 a 0.0 +>>> ds[:2] # pd.DataFrame + col_0 col_1 +0 a 0.0 +1 b 0.0 +>>> ds["data"] # pd.Series +0 a +1 b +2 c +3 d +Name: col_0, dtype: object +``` + +This also works for `IterableDataset` objects obtained e.g. using `load_dataset(..., streaming=True)`: + +```py +>>> ds = ds.with_format("pandas") +>>> for df in ds.iter(batch_size=2): +... print(df) +... break + col_0 col_1 +0 a 0.0 +1 b 0.0 +``` + +## Process data + +Pandas functions are generally faster than regular hand-written python functions, and therefore they are a good option to optimize data processing. You can use Pandas functions to process a dataset in [`Dataset.map`] or [`Dataset.filter`]: + +```python +>>> from datasets import Dataset +>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]} +>>> ds = Dataset.from_dict(data) +>>> ds = ds.with_format("pandas") +>>> ds = ds.map(lambda df: df.assign(col_2=df.col_1 + 1), batched=True) +>>> ds[:2] + col_0 col_1 col_2 +0 a 0.0 1.0 +1 b 0.0 1.0 +>>> ds = ds.filter(lambda df: df.col_0 == "b", batched=True) +>>> ds[0] + col_0 col_1 col_2 +0 b 0.0 1.0 +``` + +We use `batched=True` because it is faster to process batches of data in Pandas rather than row by row. It's also possible to use `batch_size=` in `map()` to set the size of each `df`. + +This also works for [`IterableDataset.map`] and [`IterableDataset.filter`]. + +## Import or Export from Pandas + +To import data from Pandas, you can use [`Dataset.from_pandas`]: + +```python +ds = Dataset.from_pandas(df) +``` + +And you can use [`Dataset.to_pandas`] to export a Dataset to a Pandas DataFrame: + + +```python +df = Dataset.from_pandas(ds) +``` diff --git a/docs/source/use_with_polars.mdx b/docs/source/use_with_polars.mdx new file mode 100644 index 00000000000..8718d9addaa --- /dev/null +++ b/docs/source/use_with_polars.mdx @@ -0,0 +1,139 @@ +# Use with Polars + +This document is a quick introduction to using `datasets` with Polars, with a particular focus on how to process +datasets using Polars functions, and how to convert a dataset to Polars or from Polars. + +This is particularly useful as it allows fast zero-copy operations, since both `datasets` and Polars use Arrow under the hood. + +## Dataset format + +By default, datasets return regular Python objects: integers, floats, strings, lists, etc. + +To get Polars DataFrames or Series instead, you can set the format of the dataset to `polars` using [`Dataset.with_format`]: + +```py +>>> from datasets import Dataset +>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]} +>>> ds = Dataset.from_dict(data) +>>> ds = ds.with_format("polars") +>>> ds[0] # pl.DataFrame +shape: (1, 2) +┌───────┬───────┐ +│ col_0 ┆ col_1 │ +│ --- ┆ --- │ +│ str ┆ f64 │ +╞═══════╪═══════╡ +│ a ┆ 0.0 │ +└───────┴───────┘ +>>> ds[:2] # pl.DataFrame +shape: (2, 2) +┌───────┬───────┐ +│ col_0 ┆ col_1 │ +│ --- ┆ --- │ +│ str ┆ f64 │ +╞═══════╪═══════╡ +│ a ┆ 0.0 │ +│ b ┆ 0.0 │ +└───────┴───────┘ +>>> ds["data"] # pl.Series +shape: (4,) +Series: 'col_0' [str] +[ + "a" + "b" + "c" + "d" +] +``` + +This also works for `IterableDataset` objects obtained e.g. using `load_dataset(..., streaming=True)`: + +```py +>>> ds = ds.with_format("polars") +>>> for df in ds.iter(batch_size=2): +... print(df) +... break +shape: (2, 2) +┌───────┬───────┐ +│ col_0 ┆ col_1 │ +│ --- ┆ --- │ +│ str ┆ f64 │ +╞═══════╪═══════╡ +│ a ┆ 0.0 │ +│ b ┆ 0.0 │ +└───────┴───────┘ +``` + +## Process data + +Polars functions are generally faster than regular hand-written python functions, and therefore they are a good option to optimize data processing. You can use Polars functions to process a dataset in [`Dataset.map`] or [`Dataset.filter`]: + +```python +>>> import polars as pl +>>> from datasets import Dataset +>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]} +>>> ds = Dataset.from_dict(data) +>>> ds = ds.with_format("polars") +>>> ds = ds.map(lambda df: df.with_columns(pl.col("col_1").add(1).alias("col_2")), batched=True) +>>> ds[:2] +shape: (2, 3) +┌───────┬───────┬───────┐ +│ col_0 ┆ col_1 ┆ col_2 │ +│ --- ┆ --- ┆ --- │ +│ str ┆ f64 ┆ f64 │ +╞═══════╪═══════╪═══════╡ +│ a ┆ 0.0 ┆ 1.0 │ +│ b ┆ 0.0 ┆ 1.0 │ +└───────┴───────┴───────┘ +>>> ds = ds.filter(lambda df: df["col_0"] == "b", batched=True) +>>> ds[0] +shape: (1, 3) +┌───────┬───────┬───────┐ +│ col_0 ┆ col_1 ┆ col_2 │ +│ --- ┆ --- ┆ --- │ +│ str ┆ f64 ┆ f64 │ +╞═══════╪═══════╪═══════╡ +│ b ┆ 0.0 ┆ 1.0 │ +└───────┴───────┴───────┘ +``` + +We use `batched=True` because it is faster to process batches of data in Polars rather than row by row. It's also possible to use `batch_size=` in `map()` to set the size of each `df`. + +This also works for [`IterableDataset.map`] and [`IterableDataset.filter`]. + +### Example: data extraction + +Many functions are available in Polars and for any data type: string, floats, integers, etc. You can find the full list [here](https://docs.pola.rs/api/python/stable/reference/expressions/functions.html). Those functions are written in Rust and run on batches of data which enables fast data processing. + +Here is an example that shows a 2.5x speed boost using Polars instead of a regular python function to extract solutions from a LLM reasoning dataset: + +```python +from datasets import load_dataset + +ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train") + +# Using a regular python function +pattern = re.compile("boxed\\{(.*)\\}") +result_ds = ds.map(lambda x: {"value_solution": m.group(1) if (m:=pattern.search(x["solution"])) else None}) +# Time: 10s + +# Using a Polars function +expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution") +result_ds = ds.with_format("polars").map(lambda df: df.with_columns(expr), batched=True) +# Time: 2s +``` + +## Import or Export from Polars + +To import data from Polars, you can use [`Dataset.from_polars`]: + +```python +ds = Dataset.from_polars(df) +``` + +And you can use [`Dataset.to_polars`] to export a Dataset to a Polars DataFrame: + + +```python +df = Dataset.from_polars(ds) +``` diff --git a/docs/source/use_with_pyarrow.mdx b/docs/source/use_with_pyarrow.mdx new file mode 100644 index 00000000000..3e6239cad0d --- /dev/null +++ b/docs/source/use_with_pyarrow.mdx @@ -0,0 +1,108 @@ +# Use with PyArrow + +This document is a quick introduction to using `datasets` with PyArrow, with a particular focus on how to process +datasets using Arrow compute functions, and how to convert a dataset to PyArrow or from PyArrow. + +This is particularly useful as it allows fast zero-copy operations, since `datasets` uses PyArrow under the hood. + +## Dataset format + +By default, datasets return regular Python objects: integers, floats, strings, lists, etc. + +To get PyArrow Tables or Arrays instead, you can set the format of the dataset to `pyarrow` using [`Dataset.with_format`]: + +```py +>>> from datasets import Dataset +>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]} +>>> ds = Dataset.from_dict(data) +>>> ds = ds.with_format("arrow") +>>> ds[0] # pa.Table +pyarrow.Table +col_0: string +col_1: double +---- +col_0: [["a"]] +col_1: [[0]] +>>> ds[:2] # pa.Table +pyarrow.Table +col_0: string +col_1: double +---- +col_0: [["a","b"]] +col_1: [[0,0]] +>>> ds["data"] # pa.array + +[ + [ + "a", + "b", + "c", + "d" + ] +] +``` + +This also works for `IterableDataset` objects obtained e.g. using `load_dataset(..., streaming=True)`: + +```py +>>> ds = ds.with_format("arrow") +>>> for table in ds.iter(batch_size=2): +... print(table) +... break +pyarrow.Table +col_0: string +col_1: double +---- +col_0: [["a","b"]] +col_1: [[0,0]] +``` + +## Process data + +PyArrow functions are generally faster than regular hand-written python functions, and therefore they are a good option to optimize data processing. You can use Arrow compute functions to process a dataset in [`Dataset.map`] or [`Dataset.filter`]: + +```python +>>> import pyarrow.compute as pc +>>> from datasets import Dataset +>>> data = {"col_0": ["a", "b", "c", "d"], "col_1": [0., 0., 1., 1.]} +>>> ds = Dataset.from_dict(data) +>>> ds = ds.with_format("arrow") +>>> ds = ds.map(lambda t: t.append_column("col_2", pc.add(t["col_1"], 1)), batched=True) +>>> ds[:2] +pyarrow.Table +col_0: string +col_1: double +col_2: double +---- +col_0: [["a","b"]] +col_1: [[0,0]] +col_2: [[1,1]] +>>> ds = ds.filter(lambda t: pc.equal(t["col_0"], "b"), batched=True) +>>> ds[0] +pyarrow.Table +col_0: string +col_1: double +col_2: double +---- +col_0: [["b"]] +col_1: [[0]] +col_2: [[1]] +``` + +We use `batched=True` because it is faster to process batches of data in PyArrow rather than row by row. It's also possible to use `batch_size=` in `map()` to set the size of each `table`. + +This also works for [`IterableDataset.map`] and [`IterableDataset.filter`]. + +## Import or Export from PyArrow + +A [`Dataset`] is a wrapper of a PyArrow Table, you can instantiate a Dataset directly from the Table: + +```python +ds = Dataset(table) +``` + +You can access the PyArrow Table of a dataset using [`Dataset.data`], which returns a [`MemoryMappedTable`] or a [`InMemoryTable`] or a [`ConcatenationTable`], depending on the origin of the Arrow data and the operations that were applied. + +Those objects wrap the underlying PyArrow table accessible at `Dataset.data.table`. This table contains all the data of the dataset, but there might also be an indices mapping at `Dataset._indices` which maps the dataset rows indices to the PyArrow Table rows indices. This can happen if the dataset has been shuffled with [`Dataset.shuffle`] or if only a subset of the rows are used (e.g. after a [`Dataset.select`]). + +In the general case, you can export a dataset to a PyArrow Table using `table = ds.with_format("arrow")[:]`. diff --git a/docs/source/video_load.mdx b/docs/source/video_load.mdx index 782869d2eed..51f9fada9fa 100644 --- a/docs/source/video_load.mdx +++ b/docs/source/video_load.mdx @@ -10,7 +10,7 @@ Video datasets have [`Video`] type columns, which contain `decord` objects. -To work with video datasets, you need to have the `vision` dependency installed. Check out the [installation](./installation#vision) guide to learn how to install it. +To work with video datasets, you need to have the `decord` package installed. Check out the [installation](https://github.com/dmlc/decord?tab=readme-ov-file#installation) guide to learn how to install it, or the [related discussions](https://github.com/dmlc/decord/issues/213). diff --git a/setup.py b/setup.py index 9fc12d63597..de81c2632ff 100644 --- a/setup.py +++ b/setup.py @@ -62,9 +62,7 @@ ``` Check that you can install it in a virtualenv/notebook by running: ``` - pip install huggingface-hub fsspec aiohttp - pip install -U tqdm pyarrow - pip install -i https://testpypi.python.org/pypi datasets + !pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ datasets ``` 6. Upload the final version to the actual PyPI: @@ -129,11 +127,11 @@ "multiprocess<0.70.17", # to align with dill<0.3.9 (see above) # to save datasets locally or on any filesystem # minimum 2023.1.0 to support protocol=kwargs in fsspec's `open`, `get_fs_token_paths`, etc.: see https://github.com/fsspec/filesystem_spec/pull/1143 - "fsspec[http]>=2023.1.0,<=2024.9.0", + "fsspec[http]>=2023.1.0,<=2024.12.0", # for data streaming via http "aiohttp", # To get datasets from the Datasets Hub on huggingface.co - "huggingface-hub>=0.23.0", + "huggingface-hub>=0.24.0", # Utilities from PyPA to e.g., compare versions "packaging", # To parse YAML metadata from dataset cards @@ -235,7 +233,7 @@ setup( name="datasets", - version="3.2.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots) + version="3.3.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots) description="HuggingFace community-driven open-source library of datasets", long_description=open("README.md", encoding="utf-8").read(), long_description_content_type="text/markdown", diff --git a/src/datasets/__init__.py b/src/datasets/__init__.py index 42fabebdf89..a992f315d4a 100644 --- a/src/datasets/__init__.py +++ b/src/datasets/__init__.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -__version__ = "3.2.0" +__version__ = "3.3.0" from .arrow_dataset import Dataset from .arrow_reader import ReadInstruction diff --git a/src/datasets/arrow_dataset.py b/src/datasets/arrow_dataset.py index b69a4203e0a..ec81f8e4d73 100644 --- a/src/datasets/arrow_dataset.py +++ b/src/datasets/arrow_dataset.py @@ -15,9 +15,11 @@ # Lint as: python3 """Simple Dataset wrapping an Arrow Table.""" +import asyncio import contextlib import copy import fnmatch +import inspect import itertools import json import math @@ -808,7 +810,7 @@ def from_pandas( and therefore doesn't have an associated cache directory. This may change in the feature, but in the meantime if you want to reduce memory usage you should write it back on disk - and reload using using e.g. save_to_disk / load_from_disk. + and reload using e.g. save_to_disk / load_from_disk. Args: df (`pandas.DataFrame`): @@ -908,7 +910,7 @@ def from_dict( and therefore doesn't have an associated cache directory. This may change in the feature, but in the meantime if you want to reduce memory usage you should write it back on disk - and reload using using e.g. save_to_disk / load_from_disk. + and reload using e.g. save_to_disk / load_from_disk. Args: mapping (`Mapping`): @@ -973,7 +975,7 @@ def from_list( and therefore doesn't have an associated cache directory. This may change in the feature, but in the meantime if you want to reduce memory usage you should write it back on disk - and reload using using e.g. save_to_disk / load_from_disk. + and reload using e.g. save_to_disk / load_from_disk. Args: mapping (`List[dict]`): A list of mappings of strings to row values. @@ -1749,7 +1751,7 @@ def data(self) -> Table: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.data MemoryMappedTable text: string @@ -1769,7 +1771,7 @@ def cache_files(self) -> List[dict]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.cache_files [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}] ``` @@ -1787,7 +1789,7 @@ def num_columns(self) -> int: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.num_columns 2 ``` @@ -1802,7 +1804,7 @@ def num_rows(self) -> int: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.num_rows 1066 ``` @@ -1819,7 +1821,7 @@ def column_names(self) -> List[str]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.column_names ['text', 'label'] ``` @@ -1834,7 +1836,7 @@ def shape(self) -> Tuple[int, int]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.shape (1066, 2) ``` @@ -1859,7 +1861,7 @@ def unique(self, column: str) -> List: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.unique('label') [1, 0] ``` @@ -1967,7 +1969,7 @@ def flatten(self, new_fingerprint: Optional[str] = None, max_depth=16) -> "Datas ```py >>> from datasets import load_dataset - >>> ds = load_dataset("squad", split="train") + >>> ds = load_dataset("rajpurkar/squad", split="train") >>> ds.features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), @@ -1990,7 +1992,7 @@ def flatten(self, new_fingerprint: Optional[str] = None, max_depth=16) -> "Datas dataset.info.features = self._info.features.flatten(max_depth=max_depth) dataset.info.features = Features({col: dataset.info.features[col] for col in dataset.data.column_names}) dataset._data = update_metadata_with_features(dataset._data, dataset.features) - logger.info(f'Flattened dataset from depth {depth} to depth {1 if depth + 1 < max_depth else "unknown"}.') + logger.info(f"Flattened dataset from depth {depth} to depth {1 if depth + 1 < max_depth else 'unknown'}.") dataset._fingerprint = new_fingerprint return dataset @@ -2039,7 +2041,7 @@ def cast( ```py >>> from datasets import load_dataset, ClassLabel, Value - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} @@ -2097,7 +2099,7 @@ def cast_column(self, column: str, feature: FeatureType, new_fingerprint: Option ```py >>> from datasets import load_dataset, ClassLabel - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} @@ -2142,7 +2144,7 @@ def remove_columns(self, column_names: Union[str, List[str]], new_fingerprint: O ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds = ds.remove_columns('label') Dataset({ features: ['text'], @@ -2198,7 +2200,7 @@ def rename_column( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds = ds.rename_column('label', 'label_new') Dataset({ features: ['text', 'label_new'], @@ -2260,7 +2262,7 @@ def rename_columns(self, column_mapping: Dict[str, str], new_fingerprint: Option ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds = ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) Dataset({ features: ['text_new', 'label_new'], @@ -2329,7 +2331,7 @@ def select_columns(self, column_names: Union[str, List[str]], new_fingerprint: O ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.select_columns(['text']) Dataset({ features: ['text'], @@ -2362,7 +2364,7 @@ def __len__(self): ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.__len__ >> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds.set_format(type='numpy', columns=['text', 'label']) @@ -2566,7 +2568,7 @@ def reset_format(self): ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds.set_format(type='numpy', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) @@ -2611,7 +2613,7 @@ def set_transform( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') >>> def encode(batch): ... return tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt') @@ -2644,7 +2646,7 @@ def with_format( Args: type (`str`, *optional*): - Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']`. + Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`List[str]`, *optional*): Columns to format in the output. @@ -2659,7 +2661,7 @@ def with_format( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds.format @@ -2728,7 +2730,7 @@ def with_transform( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> def encode(example): ... return tokenizer(example["text"], padding=True, truncation=True, return_tensors='pt') @@ -2798,7 +2800,7 @@ def cleanup_cache_files(self) -> int: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.cleanup_cache_files() 10 ``` @@ -2869,6 +2871,9 @@ def map( Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. + If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. + It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. + Args: function (`Callable`): Function with one of the following signatures: @@ -2878,6 +2883,7 @@ def map( - `function(batch: Dict[str, List], *extra_args) -> Dict[str, List]` if `batched=True` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) For advanced usage, the function can also return a `pyarrow.Table`. + If the function is asynchronous, then `map` will run your function in parallel. Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. If no function is provided, default to identity function: `lambda x: x`. with_indices (`bool`, defaults to `False`): @@ -2936,7 +2942,7 @@ def map( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example @@ -3176,9 +3182,9 @@ def format_new_fingerprint(new_fingerprint: str, rank: int) -> str: del kwargs["shard"] else: logger.info(f"Loading cached processed dataset at {format_cache_file_name(cache_file_name, '*')}") - assert ( - None not in transformed_shards - ), f"Failed to retrieve results from map: result list {transformed_shards} still contains None - at least one worker failed to return its results" + assert None not in transformed_shards, ( + f"Failed to retrieve results from map: result list {transformed_shards} still contains None - at least one worker failed to return its results" + ) logger.info(f"Concatenating {num_proc} shards") result = _concatenate_map_style_datasets(transformed_shards) # update fingerprint if the dataset changed @@ -3276,10 +3282,9 @@ def _map_single( **format_kwargs, ) - class NumExamplesMismatchError(Exception): - pass + check_same_num_examples = batched and len(shard.list_indexes()) > 0 - def validate_function_output(processed_inputs, indices): + def validate_function_output(processed_inputs): """Validate output of the map function.""" allowed_processed_inputs_types = (Mapping, pa.Table, pd.DataFrame) if config.POLARS_AVAILABLE and "polars" in sys.modules: @@ -3290,7 +3295,7 @@ def validate_function_output(processed_inputs, indices): raise TypeError( f"Provided `function` which is applied to all elements of table returns a variable of type {type(processed_inputs)}. Make sure provided `function` returns a variable of type `dict` (or a pyarrow table) to update the dataset or `None` if you are only interested in side effects." ) - elif isinstance(indices, list) and isinstance(processed_inputs, Mapping): + if batched and isinstance(processed_inputs, Mapping): allowed_batch_return_types = (list, np.ndarray, pd.Series) if config.POLARS_AVAILABLE and "polars" in sys.modules: import polars as pl @@ -3316,9 +3321,8 @@ def validate_function_output(processed_inputs, indices): f"Provided `function` which is applied to all elements of table returns a `dict` of types {[type(x) for x in processed_inputs.values()]}. When using `batched=True`, make sure provided `function` returns a `dict` of types like `{allowed_batch_return_types}`." ) - def apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_examples=False, offset=0): + def prepare_inputs(pa_inputs, indices, offset=0): """Utility to apply the function on a selection of columns.""" - nonlocal update_data inputs = format_table( pa_inputs, 0 if not batched else range(pa_inputs.num_rows), @@ -3335,7 +3339,12 @@ def apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_example additional_args += (effective_indices,) if with_rank: additional_args += (rank,) - processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) + return inputs, fn_args, additional_args, fn_kwargs + + def prepare_outputs(pa_inputs, inputs, processed_inputs): + nonlocal update_data + if not (update_data := (processed_inputs is not None)): + return None if isinstance(processed_inputs, LazyDict): processed_inputs = { k: v for k, v in processed_inputs.data.items() if k not in processed_inputs.keys_to_format @@ -3343,17 +3352,7 @@ def apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_example returned_lazy_dict = True else: returned_lazy_dict = False - if update_data is None: - # Check if the function returns updated examples - updatable_types = (Mapping, pa.Table, pd.DataFrame) - if config.POLARS_AVAILABLE and "polars" in sys.modules: - import polars as pl - - updatable_types += (pl.DataFrame,) - update_data = isinstance(processed_inputs, updatable_types) - validate_function_output(processed_inputs, indices) - if not update_data: - return None # Nothing to update, let's move on + validate_function_output(processed_inputs) if shard._format_type or input_columns: # TODO(QL, MS): ideally the behavior should be the same even if the dataset is formatted (may require major release) inputs_to_merge = dict(zip(pa_inputs.column_names, pa_inputs.itercolumns())) @@ -3374,7 +3373,9 @@ def apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_example input_num_examples = len(pa_inputs) processed_inputs_num_examples = len(processed_inputs[next(iter(processed_inputs.keys()))]) if input_num_examples != processed_inputs_num_examples: - raise NumExamplesMismatchError() + raise DatasetTransformationNotAllowedError( + "Using `.map` in batched mode on a dataset with attached indexes is allowed only if it doesn't create or remove existing examples. You can first run `.drop_index() to remove your index and then re-add it." + ) from None if isinstance(inputs, Mapping) and isinstance(processed_inputs, Mapping): # The .map() transform *updates* the dataset: # the output dictionary contains both the the input data and the output data. @@ -3383,6 +3384,18 @@ def apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_example else: return processed_inputs + def apply_function(pa_inputs, indices, offset=0): + """Utility to apply the function on a selection of columns.""" + inputs, fn_args, additional_args, fn_kwargs = prepare_inputs(pa_inputs, indices, offset=offset) + processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) + return prepare_outputs(pa_inputs, inputs, processed_inputs) + + async def async_apply_function(pa_inputs, indices, offset=0): + """Utility to apply the function on a selection of columns. Same code but async""" + inputs, fn_args, additional_args, fn_kwargs = prepare_inputs(pa_inputs, indices, offset=offset) + processed_inputs = await function(*fn_args, *additional_args, **fn_kwargs) + return prepare_outputs(pa_inputs, inputs, processed_inputs) + def init_buffer_and_writer(): # Prepare output buffer and batched writer in memory or on file if we update the table writer_features = features @@ -3418,6 +3431,35 @@ def init_buffer_and_writer(): ) return buf_writer, writer, tmp_file + def iter_outputs(shard_iterable): + if inspect.iscoroutinefunction(function): + indices: Union[List[int], List[List[int]]] = [] + tasks: List[asyncio.Task] = [] + try: + loop = asyncio.get_running_loop() + except RuntimeError: + loop = asyncio.new_event_loop() + for i, example in shard_iterable: + indices.append(i) + tasks.append(loop.create_task(async_apply_function(example, i, offset=offset))) + # keep the total active tasks under a certain number + if len(tasks) >= config.MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL: + done, pending = loop.run_until_complete( + asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED) + ) + while tasks and len(pending) >= config.MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL: + done, pending = loop.run_until_complete( + asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED) + ) + # yield finished tasks + while tasks and tasks[0].done(): + yield indices.pop(0), tasks.pop(0).result() + while tasks: + yield indices.pop(0), loop.run_until_complete(tasks.pop(0)) + else: + for i, example in shard_iterable: + yield i, apply_function(example, i, offset=offset) + num_examples_progress_update = 0 # If `update_data` is True after processing the first example/batch, initalize these resources with `init_buffer_and_writer` buf_writer, writer, tmp_file = None, None, None @@ -3437,13 +3479,12 @@ def init_buffer_and_writer(): else: num_rows = len(shard) if not drop_last_batch else len(shard) // batch_size * batch_size shard_iterable = zip( - range(0, num_rows, batch_size), + (list(range(i, min(i + batch_size, num_rows))) for i in range(0, num_rows, batch_size)), arrow_formatted_shard.iter(batch_size, drop_last_batch=drop_last_batch), ) if not batched: _time = time.time() - for i, example in shard_iterable: - example = apply_function_on_filtered_inputs(example, i, offset=offset) + for i, example in iter_outputs(shard_iterable): if update_data: if i == 0: buf_writer, writer, tmp_file = init_buffer_and_writer() @@ -3467,24 +3508,10 @@ def init_buffer_and_writer(): num_examples_progress_update = 0 else: _time = time.time() - for i, batch in shard_iterable: - num_examples_in_batch = len(batch) - indices = list( - range(*(slice(i, i + batch_size).indices(shard.num_rows))) - ) # Something simpler? - try: - batch = apply_function_on_filtered_inputs( - batch, - indices, - check_same_num_examples=len(shard.list_indexes()) > 0, - offset=offset, - ) - except NumExamplesMismatchError: - raise DatasetTransformationNotAllowedError( - "Using `.map` in batched mode on a dataset with attached indexes is allowed only if it doesn't create or remove existing examples. You can first run `.drop_index() to remove your index and then re-add it." - ) from None + for i, batch in iter_outputs(shard_iterable): + num_examples_in_batch = len(i) if update_data: - if i == 0: + if i and i[0] == 0: buf_writer, writer, tmp_file = init_buffer_and_writer() stack.enter_context(writer) if isinstance(batch, pa.Table): @@ -3564,7 +3591,7 @@ def batch( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> batched_ds = ds.batch(batch_size=4) >>> batched_ds[0] {'text': ['compassionately explores the seemingly irreconcilable situation...', ...], # 4 items @@ -3610,6 +3637,9 @@ def filter( """Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. + If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simulatenous calls (configurable). + It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. + Args: function (`Callable`): Callable with one of the following signatures: @@ -3618,6 +3648,7 @@ def filter( - `function(batch: Dict[str, List]) -> List[bool]` if `batched=True` and `with_indices=False` and `with_rank=False` - `function(batch: Dict[str, List], *extra_args) -> List[bool]` if `batched=True` and `with_indices=True` and/or `with_rank=True` (one extra arg for each) + If the function is asynchronous, then `filter` will run your function in parallel. If no function is provided, defaults to an always `True` function: `lambda x: True`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the @@ -3666,7 +3697,7 @@ def filter( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.filter(lambda x: x["label"] == 1) Dataset({ features: ['text', 'label'], @@ -3687,7 +3718,9 @@ def filter( indices = self.map( function=partial( - get_indices_from_mask_function, + async_get_indices_from_mask_function + if inspect.iscoroutinefunction(function) + else get_indices_from_mask_function, function, batched, with_indices, @@ -3699,7 +3732,7 @@ def filter( with_rank=True, features=Features({"indices": Value("uint64")}), batched=True, - batch_size=batch_size, + batch_size=batch_size if batched else 1, remove_columns=self.column_names, keep_in_memory=keep_in_memory, load_from_cache_file=load_from_cache_file, @@ -3831,7 +3864,7 @@ def select( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.select(range(4)) Dataset({ features: ['text', 'label'], @@ -3906,7 +3939,7 @@ def _select_contiguous( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds._select_contiguous(0, 4) Dataset({ features: ['text', 'label'], @@ -3969,7 +4002,7 @@ def _select_with_indices_mapping( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds._select_with_indices_mapping(range(4)) Dataset({ features: ['text', 'label'], @@ -4060,7 +4093,7 @@ def skip(self, n: int) -> "Dataset": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> list(ds.take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, @@ -4078,6 +4111,38 @@ def skip(self, n: int) -> "Dataset": """ return self.select(range(n, len(self))) + def repeat(self, num_times: int) -> "Dataset": + """ + Create a new [`Dataset`] that repeats the underlying dataset `num_times` times. + + Like itertools.repeat, repeating once just returns the full dataset. + + Args: + num_times (`int`): + Number of times to repeat the dataset. + + Example: + ```py + >>> from datasets import load_dataset + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") + >>> ds = ds.take(2).repeat(2) + >>> list(ds) + [{'label': 1, + 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, + {'label': 1, + 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'}, + {'label': 1, 'text': 'effective but too-tepid biopic'}, + {'label': 1, + 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, + {'label': 1, + 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'}, + {'label': 1, 'text': 'effective but too-tepid biopic'}] + ``` + """ + if num_times is None: + raise ValueError("Map style datasets do not support indefinite repetition.") + return _concatenate_map_style_datasets([self] * num_times) if num_times > 0 else self.select([]) + def take(self, n: int) -> "Dataset": """ Create a new [`Dataset`] with only the first `n` elements. @@ -4090,7 +4155,7 @@ def take(self, n: int) -> "Dataset": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> small_ds = ds.take(2) >>> list(small_ds) [{'label': 1, @@ -4146,7 +4211,7 @@ def sort( ```py >>> from datasets import load_dataset - >>> ds = load_dataset('rotten_tomatoes', split='validation') + >>> ds = load_dataset('cornell-movie-review-data/rotten_tomatoes', split='validation') >>> ds['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> sorted_ds = ds.sort('label') @@ -4304,7 +4369,7 @@ def shuffle( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] @@ -4438,7 +4503,7 @@ def train_test_split( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds = ds.train_test_split(test_size=0.2, shuffle=True) DatasetDict({ train: Dataset({ @@ -4689,7 +4754,7 @@ def shard( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds Dataset({ features: ['text', 'label'], @@ -5205,9 +5270,12 @@ def to_iterable_dataset(self, num_shards: Optional[int] = 1) -> "IterableDataset from .iterable_dataset import ArrowExamplesIterable, IterableDataset if self._format_type is not None: - raise NotImplementedError( - "Converting a formatted dataset to a formatted iterable dataset is not implemented yet. Please run `my_dataset = my_dataset.with_format(None)` before calling to_iterable_dataset" - ) + if self._format_kwargs or ( + self._format_columns is not None and set(self._format_columns) != set(self.column_names) + ): + raise NotImplementedError( + "Converting a formatted dataset with kwargs or selected columns to a formatted iterable dataset is not implemented yet. Please run `my_dataset = my_dataset.with_format(None)` before calling to_iterable_dataset" + ) if num_shards > len(self): raise ValueError( f"Unable to shard a dataset of size {len(self)} into {num_shards} shards (the number of shards exceeds the number of samples)." @@ -5228,7 +5296,10 @@ def to_iterable_dataset(self, num_shards: Optional[int] = 1) -> "IterableDataset Dataset._generate_tables_from_shards, kwargs={"shards": shards, "batch_size": config.DEFAULT_MAX_BATCH_SIZE}, ) - return IterableDataset(ex_iterable, info=DatasetInfo(features=self.features)) + ds = IterableDataset(ex_iterable, info=DatasetInfo(features=self.features)) + if self._format_type: + ds = ds.with_format(self._format_type) + return ds def _push_parquet_shards_to_hub( self, @@ -5651,7 +5722,7 @@ def push_to_hub( create_pr=create_pr, ) logger.info( - f"Commit #{i+1} completed" + f"Commit #{i + 1} completed" + (f" (still {num_commits - i - 1} to go)" if num_commits - i - 1 else "") + "." ) @@ -5681,7 +5752,7 @@ def add_column( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> more_text = ds["text"] >>> ds.add_column(name="text_2", column=more_text) Dataset({ @@ -5932,7 +6003,7 @@ def add_item(self, item: dict, new_fingerprint: str): ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> new_review = {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'} >>> ds = ds.add_item(new_review) >>> ds[-1] @@ -5983,7 +6054,7 @@ def align_labels_with_mapping(self, label2id: Dict, label_column: str) -> "Datas ```python >>> # dataset with mapping {'entailment': 0, 'neutral': 1, 'contradiction': 2} - >>> ds = load_dataset("glue", "mnli", split="train") + >>> ds = load_dataset("nyu-mll/glue", "mnli", split="train") >>> # mapping to align with >>> label2id = {'CONTRADICTION': 0, 'NEUTRAL': 1, 'ENTAILMENT': 2} >>> ds_aligned = ds.align_labels_with_mapping(label2id, "label") @@ -6308,8 +6379,10 @@ def get_indices_from_mask_function( if with_rank: additional_args += (rank,) mask = function(*inputs, *additional_args, **fn_kwargs) + if isinstance(mask, (pa.Array, pa.ChunkedArray)): + mask = mask.to_pylist() else: - # we get batched data (to do less look-ups) but `function` only accepts one example + # we get batched data (to return less data than input) but `function` only accepts one example # therefore we need to call `function` on each example of the batch to get the mask *inputs, indices, rank = args mask = [] @@ -6343,3 +6416,62 @@ def get_indices_from_mask_function( indices_array = indices_mapping.column(0).take(indices_array) indices_array = indices_array.to_pylist() return {"indices": indices_array} + + +async def async_get_indices_from_mask_function( + function: Callable, + batched: bool, + with_indices: bool, + with_rank: bool, + input_columns: Optional[Union[str, List[str]]], + indices_mapping: Optional[Table] = None, + *args, + **fn_kwargs, +): + """same function but async""" + if batched: + # we extract indices and rank from args + *inputs, indices, rank = args + additional_args = () + if with_indices: + additional_args += (indices,) + if with_rank: + additional_args += (rank,) + mask = await function(*inputs, *additional_args, **fn_kwargs) + if isinstance(mask, (pa.Array, pa.ChunkedArray)): + mask = mask.to_pylist() + else: + # we get batched data (to return less data than input) but `function` only accepts one example + # therefore we need to call `function` on each example of the batch to get the mask + *inputs, indices, rank = args + mask = [] + if input_columns is None: + # inputs only contains a batch of examples + batch: dict = inputs[0] + num_examples = len(batch[next(iter(batch.keys()))]) + for i in range(num_examples): + example = {key: batch[key][i] for key in batch} + additional_args = () + if with_indices: + additional_args += (indices[i],) + if with_rank: + additional_args += (rank,) + mask.append(await function(example, *additional_args, **fn_kwargs)) + else: + # inputs is a list of columns + columns: List[List] = inputs + num_examples = len(columns[0]) + for i in range(num_examples): + input = [column[i] for column in columns] + additional_args = () + if with_indices: + additional_args += (indices[i],) + if with_rank: + additional_args += (rank,) + mask.append(await function(*input, *additional_args, **fn_kwargs)) + indices_array = [i for i, to_keep in zip(indices, mask) if to_keep] + if indices_mapping is not None: + indices_array = pa.array(indices_array, type=pa.uint64()) + indices_array = indices_mapping.column(0).take(indices_array) + indices_array = indices_array.to_pylist() + return {"indices": indices_array} diff --git a/src/datasets/builder.py b/src/datasets/builder.py index 1200749538e..258402617a4 100644 --- a/src/datasets/builder.py +++ b/src/datasets/builder.py @@ -255,7 +255,7 @@ class DatasetBuilder: Datasets Hub. If `True`, will get token from `"~/.huggingface"`. repo_id (`str`, *optional*): ID of the dataset repository. - Used to distinguish builders with the same name but not coming from the same namespace, for example "squad" + Used to distinguish builders with the same name but not coming from the same namespace, for example "rajpurkar/squad" and "lhoestq/squad" repo IDs. In the latter, the builder name would be "lhoestq___squad". data_files (`str` or `Sequence` or `Mapping`, *optional*): Path(s) to source data file(s). @@ -524,7 +524,7 @@ def get_exported_dataset_info(self) -> DatasetInfo: ```py >>> from datasets import load_dataset_builder - >>> ds_builder = load_dataset_builder('rotten_tomatoes') + >>> ds_builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes') >>> ds_builder.get_exported_dataset_info() DatasetInfo(description='', citation='', homepage='', license='', features={'speaker_id': Value(dtype='string', id=None), 'path': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'sentence': Value(dtype='string', id=None)}, post_processed=None, supervised_keys=None, builder_name=None, dataset_name=None, config_name='default', version=None, splits={'train': SplitInfo(name='train', num_bytes=1722002133, num_examples=11660, shard_lengths=None, dataset_name=None), 'test': SplitInfo(name='test', num_bytes=86120227, num_examples=760, shard_lengths=None, dataset_name=None)}, download_checksums=None, download_size=1475540500, post_processing_size=None, dataset_size=1808122360, size_in_bytes=None) ``` @@ -782,7 +782,7 @@ def download_and_prepare( ```py >>> from datasets import load_dataset_builder - >>> builder = load_dataset_builder("rotten_tomatoes") + >>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") >>> builder.download_and_prepare() ``` @@ -790,7 +790,7 @@ def download_and_prepare( ```py >>> from datasets import load_dataset_builder - >>> builder = load_dataset_builder("rotten_tomatoes") + >>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") >>> builder.download_and_prepare("./output_dir", file_format="parquet") ``` @@ -799,7 +799,7 @@ def download_and_prepare( ```py >>> from datasets import load_dataset_builder >>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} - >>> builder = load_dataset_builder("rotten_tomatoes") + >>> builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes") >>> builder.download_and_prepare("s3://my-bucket/my_rotten_tomatoes", storage_options=storage_options, file_format="parquet") ``` """ @@ -1093,7 +1093,7 @@ def as_dataset( ```py >>> from datasets import load_dataset_builder - >>> builder = load_dataset_builder('rotten_tomatoes') + >>> builder = load_dataset_builder('cornell-movie-review-data/rotten_tomatoes') >>> builder.download_and_prepare() >>> ds = builder.as_dataset(split='train') >>> ds @@ -1114,7 +1114,7 @@ def as_dataset( "datasets.load_dataset() before trying to access the Dataset object." ) - logger.debug(f'Constructing Dataset for split {split or ", ".join(self.info.splits)}, from {self._output_dir}') + logger.debug(f"Constructing Dataset for split {split or ', '.join(self.info.splits)}, from {self._output_dir}") # By default, return all splits if split is None: @@ -1528,9 +1528,9 @@ def _prepare_split( # the content is the number of examples progress update pbar.update(content) - assert ( - None not in examples_per_job - ), f"Failed to retrieve results from prepare_split: result list {examples_per_job} still contains None - at least one worker failed to return its results" + assert None not in examples_per_job, ( + f"Failed to retrieve results from prepare_split: result list {examples_per_job} still contains None - at least one worker failed to return its results" + ) total_shards = sum(shards_per_job) total_num_examples = sum(examples_per_job) @@ -1783,9 +1783,9 @@ def _prepare_split( # the content is the number of examples progress update pbar.update(content) - assert ( - None not in examples_per_job - ), f"Failed to retrieve results from prepare_split: result list {examples_per_job} still contains None - at least one worker failed to return its results" + assert None not in examples_per_job, ( + f"Failed to retrieve results from prepare_split: result list {examples_per_job} still contains None - at least one worker failed to return its results" + ) total_shards = sum(shards_per_job) total_num_examples = sum(examples_per_job) diff --git a/src/datasets/combine.py b/src/datasets/combine.py index d2aad87f0cc..fcbc092b1bb 100644 --- a/src/datasets/combine.py +++ b/src/datasets/combine.py @@ -104,15 +104,15 @@ def interleave_datasets( [10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24] For datasets in streaming mode (iterable): - >>> from datasets import load_dataset, interleave_datasets - >>> d1 = load_dataset("oscar", "unshuffled_deduplicated_en", split="train", streaming=True) - >>> d2 = load_dataset("oscar", "unshuffled_deduplicated_fr", split="train", streaming=True) + >>> from datasets import interleave_datasets + >>> d1 = load_dataset('allenai/c4', 'es', split='train', streaming=True) + >>> d2 = load_dataset('allenai/c4', 'fr', split='train', streaming=True) >>> dataset = interleave_datasets([d1, d2]) >>> iterator = iter(dataset) >>> next(iterator) - {'text': 'Mtendere Village was inspired by the vision...} + {'text': 'Comprar Zapatillas para niña en chancla con goma por...'} >>> next(iterator) - {'text': "Média de débat d'idées, de culture...} + {'text': 'Le sacre de philippe ier, 23 mai 1059 - Compte Rendu...' ``` """ from .arrow_dataset import Dataset diff --git a/src/datasets/config.py b/src/datasets/config.py index 43801efcaef..47ec8868f17 100644 --- a/src/datasets/config.py +++ b/src/datasets/config.py @@ -255,6 +255,9 @@ GLOBBED_DATA_FILES_MAX_NUMBER_FOR_MODULE_INFERENCE = 10 ARCHIVED_DATA_FILES_MAX_NUMBER_FOR_MODULE_INFERENCE = 200 +# Async map functions +MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL = 1000 + # Progress bars PBAR_REFRESH_TIME_INTERVAL = 0.05 # 20 progress updates per sec diff --git a/src/datasets/data_files.py b/src/datasets/data_files.py index 793c6ed8115..26987e299a2 100644 --- a/src/datasets/data_files.py +++ b/src/datasets/data_files.py @@ -386,7 +386,7 @@ def resolve_pattern( matched_paths = [ filepath if filepath.startswith(protocol_prefix) else protocol_prefix + filepath for filepath, info in fs.glob(pattern, detail=True, **glob_kwargs).items() - if info["type"] == "file" + if (info["type"] == "file" or (info.get("islink") and os.path.isfile(os.path.realpath(filepath)))) and (xbasename(filepath) not in files_to_ignore) and not _is_inside_unrequested_special_dir(filepath, fs_pattern) and not _is_unrequested_hidden_file_or_is_inside_unrequested_hidden_dir(filepath, fs_pattern) diff --git a/src/datasets/dataset_dict.py b/src/datasets/dataset_dict.py index b06f7ffb97c..251cbaa5055 100644 --- a/src/datasets/dataset_dict.py +++ b/src/datasets/dataset_dict.py @@ -89,7 +89,7 @@ def data(self) -> Dict[str, Table]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.data ``` """ @@ -104,7 +104,7 @@ def cache_files(self) -> Dict[str, Dict]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.cache_files {'test': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'}], 'train': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'}], @@ -122,7 +122,7 @@ def num_columns(self) -> Dict[str, int]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.num_columns {'test': 2, 'train': 2, 'validation': 2} ``` @@ -138,7 +138,7 @@ def num_rows(self) -> Dict[str, int]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.num_rows {'test': 1066, 'train': 8530, 'validation': 1066} ``` @@ -154,7 +154,7 @@ def column_names(self) -> Dict[str, List[str]]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.column_names {'test': ['text', 'label'], 'train': ['text', 'label'], @@ -172,7 +172,7 @@ def shape(self) -> Dict[str, Tuple[int]]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.shape {'test': (1066, 2), 'train': (8530, 2), 'validation': (1066, 2)} ``` @@ -189,7 +189,7 @@ def flatten(self, max_depth=16) -> "DatasetDict": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("squad") + >>> ds = load_dataset("rajpurkar/squad") >>> ds["train"].features {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), 'context': Value(dtype='string', id=None), @@ -228,7 +228,7 @@ def unique(self, column: str) -> Dict[str, List]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.unique("label") {'test': [1, 0], 'train': [1, 0], 'validation': [1, 0]} ``` @@ -247,7 +247,7 @@ def cleanup_cache_files(self) -> Dict[str, int]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.cleanup_cache_files() {'test': 0, 'train': 0, 'validation': 0} ``` @@ -276,7 +276,7 @@ def cast(self, features: Features) -> "DatasetDict": ```py >>> from datasets import load_dataset, ClassLabel, Value - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} @@ -308,7 +308,7 @@ def cast_column(self, column: str, feature) -> "DatasetDict": ```py >>> from datasets import load_dataset, ClassLabel - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} @@ -342,7 +342,7 @@ def remove_columns(self, column_names: Union[str, List[str]]) -> "DatasetDict": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds = ds.remove_columns("label") DatasetDict({ train: Dataset({ @@ -382,7 +382,7 @@ def rename_column(self, original_column_name: str, new_column_name: str) -> "Dat ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds = ds.rename_column("label", "label_new") DatasetDict({ train: Dataset({ @@ -425,7 +425,7 @@ def rename_columns(self, column_mapping: Dict[str, str]) -> "DatasetDict": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.rename_columns({'text': 'text_new', 'label': 'label_new'}) DatasetDict({ train: Dataset({ @@ -461,7 +461,7 @@ def select_columns(self, column_names: Union[str, List[str]]) -> "DatasetDict": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.select_columns("text") DatasetDict({ train: Dataset({ @@ -527,7 +527,7 @@ def formatted_as( Args: type (`str`, *optional*): - Output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']`. + Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`List[str]`, *optional*): Columns to format in the output. @@ -563,7 +563,7 @@ def set_format( Args: type (`str`, *optional*): - Output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']`. + Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`List[str]`, *optional*): Columns to format in the output. @@ -608,7 +608,7 @@ def reset_format(self): ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True) >>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label']) @@ -670,7 +670,7 @@ def with_format( Args: type (`str`, *optional*): - Output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']`. + Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means `__getitem__` returns python objects (default). columns (`List[str]`, *optional*): Columns to format in the output. @@ -685,7 +685,7 @@ def with_format( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds["train"].format @@ -755,7 +755,7 @@ def with_transform( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> def encode(example): ... return tokenizer(example['text'], truncation=True, padding=True, return_tensors="pt") @@ -799,10 +799,24 @@ def map( num_proc: Optional[int] = None, desc: Optional[str] = None, ) -> "DatasetDict": - """Apply a function to all the elements in the table (individually or in batches) - and update the table (if function does updated examples). + """ + Apply a function to all the examples in the table (individually or in batches) and update the table. + If your function returns a column that already exists, then it overwrites it. The transformation is applied to all the datasets of the dataset dictionary. + You can specify whether the function should be batched or not with the `batched` parameter: + + - If batched is `False`, then the function takes 1 example in and should return 1 example. + An example is a dictionary, e.g. `{"text": "Hello there !"}`. + - If batched is `True` and `batch_size` is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. + A batch is a dictionary, e.g. a batch of 1 example is `{"text": ["Hello there !"]}`. + - If batched is `True` and `batch_size` is `n > 1`, then the function takes a batch of `n` examples as input and can return a batch with `n` examples, or with an arbitrary number of examples. + Note that the last batch may have less than `n` examples. + A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. + + If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. + It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. + Args: function (`callable`): with one of the following signature: - `function(example: Dict[str, Any]) -> Dict[str, Any]` if `batched=False` and `with_indices=False` @@ -811,8 +825,9 @@ def map( - `function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]` if `batched=True` and `with_indices=True` For advanced usage, the function can also return a `pyarrow.Table`. + If the function is asynchronous, then `map` will run your function in parallel. Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. - + If no function is provided, default to identity function: `lambda x: x`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. with_rank (`bool`, defaults to `False`): @@ -863,7 +878,7 @@ def map( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example @@ -975,7 +990,7 @@ def filter( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds.filter(lambda x: x["label"] == 1) DatasetDict({ train: Dataset({ @@ -1107,7 +1122,7 @@ def sort( ```py >>> from datasets import load_dataset - >>> ds = load_dataset('rotten_tomatoes') + >>> ds = load_dataset('cornell-movie-review-data/rotten_tomatoes') >>> ds['train']['label'][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] >>> sorted_ds = ds.sort('label') @@ -1183,7 +1198,7 @@ def shuffle( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes") >>> ds["train"]["label"][:10] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] @@ -1802,7 +1817,7 @@ def push_to_hub( create_pr=create_pr, ) logger.info( - f"Commit #{i+1} completed" + f"Commit #{i + 1} completed" + (f" (still {num_commits - i - 1} to go)" if num_commits - i - 1 else "") + "." ) @@ -1821,12 +1836,11 @@ def with_format( ) -> "IterableDatasetDict": """ Return a dataset with the specified format. - The 'pandas' format is currently not implemented. Args: type (`str`, *optional*): - Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'arrow', 'jax']`. + Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means it returns python objects (default). Example: @@ -1834,7 +1848,7 @@ def with_format( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True) >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds = ds.with_format("torch") @@ -1889,6 +1903,9 @@ def map( Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. + If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. + It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. + Args: function (`Callable`, *optional*, defaults to `None`): Function applied on-the-fly on the examples when you iterate on the dataset. @@ -1900,6 +1917,7 @@ def map( - `function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]` if `batched=True` and `with_indices=True` For advanced usage, the function can also return a `pyarrow.Table`. + If the function is asynchronous, then `map` will run your function in parallel. Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. If no function is provided, default to identity function: `lambda x: x`. with_indices (`bool`, defaults to `False`): @@ -1925,7 +1943,7 @@ def map( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example @@ -1990,7 +2008,7 @@ def filter( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.filter(lambda x: x["label"] == 0) >>> list(ds["train"].take(3)) [{'label': 0, 'text': 'Review: simplistic , silly and tedious .'}, @@ -2048,7 +2066,7 @@ def shuffle( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> list(ds["train"].take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, @@ -2091,7 +2109,7 @@ def rename_column(self, original_column_name: str, new_column_name: str) -> "Ite ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.rename_column("text", "movie_review") >>> next(iter(ds["train"])) {'label': 1, @@ -2122,7 +2140,7 @@ def rename_columns(self, column_mapping: Dict[str, str]) -> "IterableDatasetDict ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.rename_columns({"text": "movie_review", "label": "rating"}) >>> next(iter(ds["train"])) {'movie_review': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', @@ -2151,7 +2169,7 @@ def remove_columns(self, column_names: Union[str, List[str]]) -> "IterableDatase ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.remove_columns("label") >>> next(iter(ds["train"])) {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} @@ -2177,7 +2195,7 @@ def select_columns(self, column_names: Union[str, List[str]]) -> "IterableDatase ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds = ds.select("text") >>> next(iter(ds["train"])) {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} @@ -2202,7 +2220,7 @@ def cast_column(self, column: str, feature: FeatureType) -> "IterableDatasetDict ```py >>> from datasets import load_dataset, ClassLabel - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} @@ -2238,7 +2256,7 @@ def cast( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", streaming=True) >>> ds["train"].features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} diff --git a/src/datasets/features/features.py b/src/datasets/features/features.py index ec7dc2a548c..c66feb5ba98 100644 --- a/src/datasets/features/features.py +++ b/src/datasets/features/features.py @@ -398,12 +398,15 @@ def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool, optimize_list_cas output[k] = casted_v return output if has_changed else obj, has_changed elif hasattr(obj, "__array__"): - return ( - _cast_to_python_objects( - obj.__array__(), only_1d_for_numpy=only_1d_for_numpy, optimize_list_casting=optimize_list_casting - )[0], - True, - ) + if np.isscalar(obj): + return obj, False + else: + return ( + _cast_to_python_objects( + obj.__array__(), only_1d_for_numpy=only_1d_for_numpy, optimize_list_casting=optimize_list_casting + )[0], + True, + ) elif isinstance(obj, (list, tuple)): if len(obj) > 0: for first_elmt in obj: @@ -1023,7 +1026,7 @@ def str2int(self, values: Union[str, Iterable]) -> Union[int, Iterable]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> ds.features["label"].str2int('neg') 0 ``` @@ -1070,7 +1073,7 @@ def int2str(self, values: Union[int, Iterable]) -> Union[str, Iterable]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> ds.features["label"].int2str(0) 'neg' ``` @@ -1341,7 +1344,7 @@ def encode_nested_example(schema, obj, level=0): break # be careful when comparing tensors here if ( - not isinstance(first_elmt, list) + not (isinstance(first_elmt, list) or np.isscalar(first_elmt)) or encode_nested_example(schema.feature, first_elmt, level=level + 1) != first_elmt ): return [encode_nested_example(schema.feature, o, level=level + 1) for o in obj] @@ -2107,7 +2110,7 @@ def copy(self) -> "Features": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") >>> copy_of_features = ds.features.copy() >>> copy_of_features {'label': ClassLabel(names=['neg', 'pos'], id=None), @@ -2168,8 +2171,8 @@ def recursive_reorder(source, target, stack=""): if sorted(source) != sorted(target): message = ( f"Keys mismatch: between {source} (source) and {target} (target).\n" - f"{source.keys()-target.keys()} are missing from target " - f"and {target.keys()-source.keys()} are missing from source" + stack_position + f"{source.keys() - target.keys()} are missing from target " + f"and {target.keys() - source.keys()} are missing from source" + stack_position ) raise ValueError(message) return {key: recursive_reorder(source[key], target[key], stack + f".{key}") for key in target} @@ -2204,7 +2207,7 @@ def flatten(self, max_depth=16) -> "Features": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("squad", split="train") + >>> ds = load_dataset("rajpurkar/squad", split="train") >>> ds.features.flatten() {'answers.answer_start': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'answers.text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), diff --git a/src/datasets/features/image.py b/src/datasets/features/image.py index c63d4d43964..0393689fc46 100644 --- a/src/datasets/features/image.py +++ b/src/datasets/features/image.py @@ -69,7 +69,7 @@ class Image: ```py >>> from datasets import load_dataset, Image - >>> ds = load_dataset("beans", split="train") + >>> ds = load_dataset("AI-Lab-Makerere/beans", split="train") >>> ds.features["image"] Image(decode=True, id=None) >>> ds[0]["image"] diff --git a/src/datasets/features/translation.py b/src/datasets/features/translation.py index 584bf3186c3..f1eae5d82e2 100644 --- a/src/datasets/features/translation.py +++ b/src/datasets/features/translation.py @@ -102,7 +102,7 @@ def encode_example(self, translation_dict): return translation_dict elif self.languages and set(translation_dict) - lang_set: raise ValueError( - f'Some languages in example ({", ".join(sorted(set(translation_dict) - lang_set))}) are not in valid set ({", ".join(lang_set)}).' + f"Some languages in example ({', '.join(sorted(set(translation_dict) - lang_set))}) are not in valid set ({', '.join(lang_set)})." ) # Convert dictionary into tuples, splitting out cases where there are diff --git a/src/datasets/fingerprint.py b/src/datasets/fingerprint.py index da8a8d578f6..e0002639fe7 100644 --- a/src/datasets/fingerprint.py +++ b/src/datasets/fingerprint.py @@ -221,7 +221,7 @@ def generate_fingerprint(dataset: "Dataset") -> str: def generate_random_fingerprint(nbits: int = 64) -> str: - return f"{fingerprint_rng.getrandbits(nbits):0{nbits//4}x}" + return f"{fingerprint_rng.getrandbits(nbits):0{nbits // 4}x}" def update_fingerprint(fingerprint, transform, transform_args): diff --git a/src/datasets/formatting/__init__.py b/src/datasets/formatting/__init__.py index 8aa21d37bd2..9771618c7b9 100644 --- a/src/datasets/formatting/__init__.py +++ b/src/datasets/formatting/__init__.py @@ -22,6 +22,7 @@ Formatter, PandasFormatter, PythonFormatter, + TableFormatter, TensorFormatter, format_table, query_table, diff --git a/src/datasets/formatting/formatting.py b/src/datasets/formatting/formatting.py index 2dae3a52fd3..c0b31cc16c5 100644 --- a/src/datasets/formatting/formatting.py +++ b/src/datasets/formatting/formatting.py @@ -215,11 +215,14 @@ def extract_batch(self, pa_table: pa.Table) -> pd.DataFrame: class PythonFeaturesDecoder: - def __init__(self, features: Optional[Features]): + def __init__( + self, features: Optional[Features], token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None + ): self.features = features + self.token_per_repo_id = token_per_repo_id def decode_row(self, row: dict) -> dict: - return self.features.decode_example(row) if self.features else row + return self.features.decode_example(row, token_per_repo_id=self.token_per_repo_id) if self.features else row def decode_column(self, column: list, column_name: str) -> list: return self.features.decode_column(column, column_name) if self.features else column @@ -393,9 +396,14 @@ class Formatter(Generic[RowFormat, ColumnFormat, BatchFormat]): numpy_arrow_extractor = NumpyArrowExtractor pandas_arrow_extractor = PandasArrowExtractor - def __init__(self, features: Optional[Features] = None): + def __init__( + self, + features: Optional[Features] = None, + token_per_repo_id: Optional[Dict[str, Union[str, bool, None]]] = None, + ): self.features = features - self.python_features_decoder = PythonFeaturesDecoder(self.features) + self.token_per_repo_id = token_per_repo_id + self.python_features_decoder = PythonFeaturesDecoder(self.features, self.token_per_repo_id) self.pandas_features_decoder = PandasFeaturesDecoder(self.features) def __call__(self, pa_table: pa.Table, query_type: str) -> Union[RowFormat, ColumnFormat, BatchFormat]: @@ -421,7 +429,15 @@ def recursive_tensorize(self, data_struct: dict): raise NotImplementedError -class ArrowFormatter(Formatter[pa.Table, pa.Array, pa.Table]): +class TableFormatter(Formatter[RowFormat, ColumnFormat, BatchFormat]): + table_type: str + column_type: str + + +class ArrowFormatter(TableFormatter[pa.Table, pa.Array, pa.Table]): + table_type = "arrow table" + column_type = "arrow array" + def format_row(self, pa_table: pa.Table) -> pa.Table: return self.simple_arrow_extractor().extract_row(pa_table) @@ -433,8 +449,8 @@ def format_batch(self, pa_table: pa.Table) -> pa.Table: class PythonFormatter(Formatter[Mapping, list, Mapping]): - def __init__(self, features=None, lazy=False): - super().__init__(features) + def __init__(self, features=None, lazy=False, token_per_repo_id=None): + super().__init__(features, token_per_repo_id) self.lazy = lazy def format_row(self, pa_table: pa.Table) -> Mapping: @@ -457,7 +473,10 @@ def format_batch(self, pa_table: pa.Table) -> Mapping: return batch -class PandasFormatter(Formatter[pd.DataFrame, pd.Series, pd.DataFrame]): +class PandasFormatter(TableFormatter[pd.DataFrame, pd.Series, pd.DataFrame]): + table_type = "pandas dataframe" + column_type = "pandas series" + def format_row(self, pa_table: pa.Table) -> pd.DataFrame: row = self.pandas_arrow_extractor().extract_row(pa_table) row = self.pandas_features_decoder.decode_row(row) @@ -484,8 +503,8 @@ class CustomFormatter(Formatter[dict, ColumnFormat, dict]): to return. """ - def __init__(self, transform: Callable[[dict], dict], features=None, **kwargs): - super().__init__(features=features) + def __init__(self, transform: Callable[[dict], dict], features=None, token_per_repo_id=None, **kwargs): + super().__init__(features=features, token_per_repo_id=token_per_repo_id) self.transform = transform def format_row(self, pa_table: pa.Table) -> dict: diff --git a/src/datasets/formatting/jax_formatter.py b/src/datasets/formatting/jax_formatter.py index e247b7b5822..7dbfdd1322b 100644 --- a/src/datasets/formatting/jax_formatter.py +++ b/src/datasets/formatting/jax_formatter.py @@ -36,8 +36,8 @@ class JaxFormatter(TensorFormatter[Mapping, "jax.Array", Mapping]): - def __init__(self, features=None, device=None, **jnp_array_kwargs): - super().__init__(features=features) + def __init__(self, features=None, device=None, token_per_repo_id=None, **jnp_array_kwargs): + super().__init__(features=features, token_per_repo_id=token_per_repo_id) import jax from jaxlib.xla_client import Device diff --git a/src/datasets/formatting/np_formatter.py b/src/datasets/formatting/np_formatter.py index 032758bce21..81c8e0605ae 100644 --- a/src/datasets/formatting/np_formatter.py +++ b/src/datasets/formatting/np_formatter.py @@ -24,8 +24,8 @@ class NumpyFormatter(TensorFormatter[Mapping, np.ndarray, Mapping]): - def __init__(self, features=None, **np_array_kwargs): - super().__init__(features=features) + def __init__(self, features=None, token_per_repo_id=None, **np_array_kwargs): + super().__init__(features=features, token_per_repo_id=token_per_repo_id) self.np_array_kwargs = np_array_kwargs def _consolidate(self, column): diff --git a/src/datasets/formatting/polars_formatter.py b/src/datasets/formatting/polars_formatter.py index 543bde52dd0..7ea2f783aec 100644 --- a/src/datasets/formatting/polars_formatter.py +++ b/src/datasets/formatting/polars_formatter.py @@ -13,7 +13,6 @@ # limitations under the License. import sys -from collections.abc import Mapping from functools import partial from typing import TYPE_CHECKING, Optional @@ -23,7 +22,7 @@ from ..features import Features from ..features.features import decode_nested_example from ..utils.py_utils import no_op_if_value_is_null -from .formatting import BaseArrowExtractor, TensorFormatter +from .formatting import BaseArrowExtractor, TableFormatter if TYPE_CHECKING: @@ -98,7 +97,10 @@ def decode_batch(self, batch: "pl.DataFrame") -> "pl.DataFrame": return self.decode_row(batch) -class PolarsFormatter(TensorFormatter[Mapping, "pl.DataFrame", Mapping]): +class PolarsFormatter(TableFormatter["pl.DataFrame", "pl.Series", "pl.DataFrame"]): + table_type = "polars dataframe" + column_type = "polars series" + def __init__(self, features=None, **np_array_kwargs): super().__init__(features=features) self.np_array_kwargs = np_array_kwargs diff --git a/src/datasets/formatting/tf_formatter.py b/src/datasets/formatting/tf_formatter.py index 9f0c06ec82a..d8f12e5d9a7 100644 --- a/src/datasets/formatting/tf_formatter.py +++ b/src/datasets/formatting/tf_formatter.py @@ -30,8 +30,8 @@ class TFFormatter(TensorFormatter[Mapping, "tf.Tensor", Mapping]): - def __init__(self, features=None, **tf_tensor_kwargs): - super().__init__(features=features) + def __init__(self, features=None, token_per_repo_id=None, **tf_tensor_kwargs): + super().__init__(features=features, token_per_repo_id=token_per_repo_id) self.tf_tensor_kwargs = tf_tensor_kwargs import tensorflow as tf # noqa: F401 - import tf at initialization diff --git a/src/datasets/formatting/torch_formatter.py b/src/datasets/formatting/torch_formatter.py index 051badb0ac4..ad67984dd0b 100644 --- a/src/datasets/formatting/torch_formatter.py +++ b/src/datasets/formatting/torch_formatter.py @@ -30,8 +30,8 @@ class TorchFormatter(TensorFormatter[Mapping, "torch.Tensor", Mapping]): - def __init__(self, features=None, **torch_tensor_kwargs): - super().__init__(features=features) + def __init__(self, features=None, token_per_repo_id=None, **torch_tensor_kwargs): + super().__init__(features=features, token_per_repo_id=token_per_repo_id) self.torch_tensor_kwargs = torch_tensor_kwargs import torch # noqa import torch at initialization diff --git a/src/datasets/info.py b/src/datasets/info.py index d9e4cad598f..57137001fad 100644 --- a/src/datasets/info.py +++ b/src/datasets/info.py @@ -200,7 +200,7 @@ def write_to_directory(self, dataset_info_dir, pretty_print=False, storage_optio ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation") + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation") >>> ds.info.write_to_directory("/path/to/directory/") ``` """ diff --git a/src/datasets/inspect.py b/src/datasets/inspect.py index a383ef5b4b6..f039257752e 100644 --- a/src/datasets/inspect.py +++ b/src/datasets/inspect.py @@ -55,7 +55,7 @@ def get_dataset_infos( - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. `'./dataset/squad'` or `'./dataset/squad/squad.py'` - a dataset identifier on the Hugging Face Hub (list all available datasets and ids with [`huggingface_hub.list_datasets`]), - e.g. `'squad'`, `'glue'` or``'openai/webtext'` + e.g. `'rajpurkar/squad'`, `'nyu-mll/glue'` or``'openai/webtext'` revision (`Union[str, datasets.Version]`, *optional*): If specified, the dataset module will be loaded from the datasets repository at this version. By default: @@ -78,7 +78,7 @@ def get_dataset_infos( ```py >>> from datasets import get_dataset_infos - >>> get_dataset_infos('rotten_tomatoes') + >>> get_dataset_infos('cornell-movie-review-data/rotten_tomatoes') {'default': DatasetInfo(description="Movie Review Dataset.\nThis is a dataset of containing 5,331 positive and 5,331 negative processed\nsentences from Rotten Tomatoes movie reviews...), ...} ``` """ @@ -122,7 +122,7 @@ def get_dataset_config_names( - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. `'./dataset/squad'` or `'./dataset/squad/squad.py'` - a dataset identifier on the Hugging Face Hub (list all available datasets and ids with [`huggingface_hub.list_datasets`]), - e.g. `'squad'`, `'glue'` or `'openai/webtext'` + e.g. `'rajpurkar/squad'`, `'nyu-mll/glue'` or `'openai/webtext'` revision (`Union[str, datasets.Version]`, *optional*): If specified, the dataset module will be loaded from the datasets repository at this version. By default: @@ -146,7 +146,7 @@ def get_dataset_config_names( ```py >>> from datasets import get_dataset_config_names - >>> get_dataset_config_names("glue") + >>> get_dataset_config_names("nyu-mll/glue") ['cola', 'sst2', 'mrpc', @@ -194,7 +194,7 @@ def get_dataset_default_config_name( - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. `'./dataset/squad'` or `'./dataset/squad/squad.py'` - a dataset identifier on the Hugging Face Hub (list all available datasets and ids with [`huggingface_hub.list_datasets`]), - e.g. `'squad'`, `'glue'` or `'openai/webtext'` + e.g. `'rajpurkar/squad'`, `'nyu-mll/glue'` or `'openai/webtext'` revision (`Union[str, datasets.Version]`, *optional*): If specified, the dataset module will be loaded from the datasets repository at this version. By default: @@ -261,7 +261,7 @@ def get_dataset_config_info( - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./dataset/squad'`` or ``'./dataset/squad/squad.py'`` - a dataset identifier on the Hugging Face Hub (list all available datasets and ids with [`huggingface_hub.list_datasets`]), - e.g. ``'squad'``, ``'glue'`` or ``'openai/webtext'`` + e.g. ``'rajpurkar/squad'``, ``'nyu-mll/glue'`` or ``'openai/webtext'`` config_name (:obj:`str`, optional): Defining the name of the dataset configuration. data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s). download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters. @@ -322,7 +322,7 @@ def get_dataset_split_names( - a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. `'./dataset/squad'` or `'./dataset/squad/squad.py'` - a dataset identifier on the Hugging Face Hub (list all available datasets and ids with [`huggingface_hub.list_datasets`]), - e.g. `'squad'`, `'glue'` or `'openai/webtext'` + e.g. `'rajpurkar/squad'`, `'nyu-mll/glue'` or `'openai/webtext'` config_name (`str`, *optional*): Defining the name of the dataset configuration. data_files (`str` or `Sequence` or `Mapping`, *optional*): @@ -345,7 +345,7 @@ def get_dataset_split_names( ```py >>> from datasets import get_dataset_split_names - >>> get_dataset_split_names('rotten_tomatoes') + >>> get_dataset_split_names('cornell-movie-review-data/rotten_tomatoes') ['train', 'validation', 'test'] ``` """ diff --git a/src/datasets/iterable_dataset.py b/src/datasets/iterable_dataset.py index 382b20915bf..38fc2c1a922 100644 --- a/src/datasets/iterable_dataset.py +++ b/src/datasets/iterable_dataset.py @@ -1,4 +1,6 @@ +import asyncio import copy +import inspect import itertools import sys from collections import Counter @@ -10,13 +12,27 @@ import fsspec.asyn import numpy as np +import pandas as pd import pyarrow as pa from . import config from .arrow_dataset import Dataset, DatasetInfoMixin from .features import Features -from .features.features import FeatureType, _align_features, _check_if_features_can_be_aligned, cast_to_python_objects -from .formatting import PythonFormatter, TensorFormatter, get_format_type_from_alias, get_formatter +from .features.features import ( + FeatureType, + Value, + _align_features, + _check_if_features_can_be_aligned, + cast_to_python_objects, +) +from .formatting import ( + ArrowFormatter, + PythonFormatter, + TableFormatter, + TensorFormatter, + get_format_type_from_alias, + get_formatter, +) from .info import DatasetInfo from .splits import NamedSplit, Split from .table import cast_table_to_features, read_schema_from_file, table_cast @@ -79,7 +95,7 @@ def _examples_to_batch(examples: List[Dict[str, Any]]) -> Dict[str, list]: def _batch_to_examples(batch: Dict[str, list]) -> Iterator[Dict[str, Any]]: """Convert a batch (dict of examples) to examples list""" - n_examples = len(batch[next(iter(batch))]) + n_examples = 0 if len(batch) == 0 else len(batch[next(iter(batch))]) for i in range(n_examples): yield {col: array[i] for col, array in batch.items()} @@ -134,6 +150,10 @@ def iter_arrow(self) -> Optional[Callable[[], Iterator[Tuple[Key, pa.Table]]]]: def is_typed(self) -> bool: return False + @property + def features(self) -> Optional[Features]: + return None + def shuffle_data_sources(self, generator: np.random.Generator) -> "_BaseExamplesIterable": """ Either shuffle the shards/sources of the dataset, or propagate the shuffling to the underlying iterable. @@ -408,6 +428,10 @@ def iter_arrow(self): def is_typed(self): return self.ex_iterable.is_typed + @property + def features(self): + return self.ex_iterable.features + def _init_state_dict(self) -> dict: self._state_dict = { "ex_iterable": self.ex_iterable._init_state_dict(), @@ -539,6 +563,10 @@ def iter_arrow(self): def is_typed(self): return self.ex_iterable.is_typed + @property + def features(self): + return self.ex_iterable.features + def _init_state_dict(self) -> dict: self._state_dict = self.ex_iterable._init_state_dict() return self._state_dict @@ -577,6 +605,10 @@ def __init__(self, ex_iterable: _BaseExamplesIterable, step: int, offset: int): def is_typed(self): return self.ex_iterable.is_typed + @property + def features(self): + return self.ex_iterable.features + def _init_state_dict(self) -> dict: self._state_dict = self.ex_iterable._init_state_dict() return self._state_dict @@ -626,6 +658,10 @@ def __init__( def is_typed(self): return self.ex_iterables[0].is_typed + @property + def features(self): + return self.ex_iterables[0].features + def _get_indices_iterator(self): # this is an infinite iterator to keep track of which iterator we want to pick examples from ex_iterable_idx = self._state_dict["ex_iterable_idx"] if self._state_dict else 0 @@ -709,12 +745,12 @@ class VerticallyConcatenatedMultiSourcesExamplesIterable(_BaseExamplesIterable): """ VerticallyConcatenatedMultiSourcesExamplesIterable simply chains the input iterables. It doesn't require the examples iterables to always yield the same columns. - Instead, this is handled by the `IterableDataset` class or `TypedExamplesIterable`. + Instead, this is handled by the `IterableDataset` class or `FormattedExamplesIterable`. For information, `IterableDataset` merges the features of all the datasets to concatenate into one. We use `IterableDataset._resolve_features` to obtain the features of all the datasets to concatenate. - Then for each example, `IterableDataset` and `TypedExamplesIterable` automatically fill missing columns with None. + Then for each example, `IterableDataset` and `FormattedExamplesIterable` automatically fill missing columns with None. This is done with `_apply_feature_types_on_example`. """ @@ -726,6 +762,10 @@ def __init__(self, ex_iterables: List[_BaseExamplesIterable]): def is_typed(self): return self.ex_iterables[0].is_typed + @property + def features(self): + return self.ex_iterables[0].features + @property def iter_arrow(self): if all(ex_iterable.iter_arrow is not None for ex_iterable in self.ex_iterables): @@ -792,12 +832,12 @@ class HorizontallyConcatenatedMultiSourcesExamplesIterable(_BaseExamplesIterable This check is done once when yielding the first example. However it doesn't fill missing columns with None. - Instead, this is handled by the `IterableDataset` class or `TypedExamplesIterable`. + Instead, this is handled by the `IterableDataset` class or `FormattedExamplesIterable`. For information, `IterableDataset` merges the features of all the datasets to concatenate into one. We use `IterableDataset._resolve_features` to obtain the features of all the datasets to concatenate. - Then for each example, `IterableDataset` and `TypedExamplesIterable` automatically fill missing columns with None. + Then for each example, `IterableDataset` and `FormattedExamplesIterable` automatically fill missing columns with None. This is done with `_apply_feature_types_on_example`. """ @@ -810,6 +850,10 @@ def __init__(self, ex_iterables: List[_BaseExamplesIterable]): def is_typed(self): return self.ex_iterables[0].is_typed + @property + def features(self): + return self.ex_iterables[0].features + def _init_state_dict(self) -> dict: self._state_dict = {"ex_iterables": [ex_iterable._init_state_dict() for ex_iterable in self.ex_iterables]} return self._state_dict @@ -873,6 +917,10 @@ def __init__( def is_typed(self): return self.ex_iterables[0].is_typed + @property + def features(self): + return self.ex_iterables[0].features + def _get_indices_iterator(self): rng = deepcopy(self.generator) num_sources = len(self.ex_iterables) @@ -934,6 +982,19 @@ def shard_data_sources( ) +def _table_output_to_arrow(output) -> pa.Table: + if isinstance(output, pa.Table): + return output + if isinstance(output, (pd.DataFrame, pd.Series)): + return pa.Table.from_pandas(output) + if config.POLARS_AVAILABLE and "polars" in sys.modules: + import polars as pl + + if isinstance(output, (pl.DataFrame, pl.Series)): + return output.to_arrow() + return output + + class MappedExamplesIterable(_BaseExamplesIterable): def __init__( self, @@ -947,6 +1008,7 @@ def __init__( remove_columns: Optional[List[str]] = None, fn_kwargs: Optional[dict] = None, formatting: Optional["FormattingConfig"] = None, + features: Optional[Features] = None, ): super().__init__() self.ex_iterable = ex_iterable @@ -958,29 +1020,34 @@ def __init__( self.with_indices = with_indices self.input_columns = input_columns self.fn_kwargs = fn_kwargs or {} - self.formatting = formatting + self.formatting = formatting # required for iter_arrow + self._features = features # sanity checks - if formatting and formatting.format_type == "arrow": + if formatting and formatting.is_table: # batch_size should match for iter_arrow if not isinstance(ex_iterable, RebatchedArrowExamplesIterable): raise ValueError( - "The Arrow-formatted MappedExamplesIterable has underlying iterable" + f"The {formatting.format_type.capitalize()}-formatted {type(self).__name__} has underlying iterable" f"that is a {type(ex_iterable).__name__} instead of a RebatchedArrowExamplesIterable." ) elif ex_iterable.batch_size != (batch_size if batched else 1): raise ValueError( - f"The Arrow-formatted MappedExamplesIterable has batch_size={batch_size if batched else 1} which is" + f"The {formatting.format_type.capitalize()}-formatted {type(self).__name__} has batch_size={batch_size if batched else 1} which is" f"different from {ex_iterable.batch_size=} from its underlying iterable." ) @property def iter_arrow(self): - if self.formatting and self.formatting.format_type == "arrow": + if self.formatting and self.formatting.is_table: return self._iter_arrow @property def is_typed(self): - return False + return self.features is not None # user has extracted features + + @property + def features(self): + return self._features def _init_state_dict(self) -> dict: self._state_dict = { @@ -992,7 +1059,7 @@ def _init_state_dict(self) -> dict: return self._state_dict def __iter__(self): - if self.formatting and self.formatting.format_type == "arrow": + if self.formatting and self.formatting.is_table: formatter = PythonFormatter() for key, pa_table in self._iter_arrow(max_chunksize=1): yield key, formatter.format_row(pa_table) @@ -1016,11 +1083,7 @@ def _iter(self): else: format_dict = None - if self.batched: - if self._state_dict: - self._state_dict["previous_state"] = self.ex_iterable.state_dict() - self._state_dict["num_examples_since_previous_state"] = 0 - self._state_dict["previous_state_example_idx"] = current_idx + def iter_batched_inputs(): for key, example in iterator: # If `batched`, first build the batch, if `batch_size` is None or <=0, then the batch is the whole dataset iterator_batch = ( @@ -1030,6 +1093,8 @@ def _iter(self): ) key_examples_list = [(key, example)] + list(iterator_batch) keys, examples = zip(*key_examples_list) + # the new key is the concatenation of the examples keys from the batch + key = "_".join(str(key) for key in keys) if ( self.drop_last_batch and self.batch_size is not None @@ -1039,30 +1104,109 @@ def _iter(self): return batch = _examples_to_batch(examples) batch = format_dict(batch) if format_dict else batch - # then apply the transform - inputs = batch - function_args = [inputs] if self.input_columns is None else [inputs[col] for col in self.input_columns] - if self.with_indices: - function_args.append([current_idx + i for i in range(len(key_examples_list))]) - transformed_batch = dict(batch) # this will be updated with the function output - transformed_batch.update(self.function(*function_args, **self.fn_kwargs)) - # then remove the unwanted columns - if self.remove_columns: - for c in self.remove_columns: - del transformed_batch[c] - if transformed_batch: - first_col = next(iter(transformed_batch)) - bad_cols = [ - col - for col in transformed_batch - if len(transformed_batch[col]) != len(transformed_batch[first_col]) - ] - if bad_cols: - raise ValueError( - f"Column lengths mismatch: columns {bad_cols} have length {[len(transformed_batch[col]) for col in bad_cols]} while {first_col} has length {len(transformed_batch[first_col])}." + indices = [current_idx + i for i in range(len(key_examples_list))] + yield indices, (key, batch) + + def iter_inputs(): + for key, example in iterator: + # If not batched, we can apply the transform and yield the example directly + # first copy the example, since we might drop some keys + example = dict(example) + example = format_dict(example) if format_dict else example + yield current_idx, (key, example) + + def validate_function_output(processed_inputs): + if self.batched and processed_inputs: + first_col = next(iter(processed_inputs)) + bad_cols = [ + col for col in processed_inputs if len(processed_inputs[col]) != len(processed_inputs[first_col]) + ] + if bad_cols: + raise ValueError( + f"Column lengths mismatch: columns {bad_cols} have length {[len(processed_inputs[col]) for col in bad_cols]} " + f"while {first_col} has length {len(processed_inputs[first_col])}." + ) + + def prepare_inputs(key_example, indices): + key, example = key_example + fn_args = [example] if self.input_columns is None else [example[col] for col in self.input_columns] + additional_args = () + if self.with_indices: + fn_args += (indices,) + inputs = dict(example) + return inputs, fn_args, additional_args, self.fn_kwargs + + def prepare_outputs(key_example, inputs, processed_inputs): + validate_function_output(processed_inputs) + # this logic mimics the one in Dataset.map + if self.remove_columns: + for c in self.remove_columns: + if c in inputs: + del inputs[c] + if processed_inputs is key_example[1] and c in processed_inputs: + del processed_inputs[c] + transformed_inputs = {**inputs, **processed_inputs} + if self.features: + for c in self.features.keys(): + if c not in transformed_inputs: + transformed_inputs[c] = ( + [None] * len(transformed_inputs[next(iter(processed_inputs))]) if self.batched else None ) - # the new key is the concatenation of the examples keys from the batch - new_key = "_".join(str(key) for key in keys) + transformed_inputs = ( + self.features.decode_batch(transformed_inputs) + if self.batched + else self.features.decode_example(transformed_inputs) + ) + return transformed_inputs + + def apply_function(key_example, indices): + """Utility to apply the function on a selection of columns.""" + inputs, fn_args, additional_args, fn_kwargs = prepare_inputs(key_example, indices) + processed_inputs = self.function(*fn_args, *additional_args, **fn_kwargs) + return prepare_outputs(key_example, inputs, processed_inputs) + + async def async_apply_function(key_example, indices): + """Utility to apply the function on a selection of columns. Same code but async""" + inputs, fn_args, additional_args, fn_kwargs = prepare_inputs(key_example, indices) + processed_inputs = await self.function(*fn_args, *additional_args, **fn_kwargs) + return prepare_outputs(key_example, inputs, processed_inputs) + + def iter_outputs(): + inputs_iterator = iter_batched_inputs() if self.batched else iter_inputs() + if inspect.iscoroutinefunction(self.function): + indices: Union[List[int], List[List[int]]] = [] + tasks: List[asyncio.Task] = [] + try: + loop = asyncio.get_running_loop() + except RuntimeError: + loop = asyncio.new_event_loop() + for i, key_example in inputs_iterator: + indices.append(i) + tasks.append(loop.create_task(async_apply_function(key_example, i))) + # keep the total active tasks under a certain number + if len(tasks) >= config.MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL: + done, pending = loop.run_until_complete( + asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED) + ) + while tasks and len(pending) >= config.MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL: + done, pending = loop.run_until_complete( + asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED) + ) + # yield finished tasks + while tasks and tasks[0].done(): + yield indices.pop(0), tasks.pop(0).result() + while tasks: + yield indices.pop(0), loop.run_until_complete(tasks.pop(0)) + else: + for i, key_example in inputs_iterator: + yield i, apply_function(key_example, i) + + if self.batched: + if self._state_dict: + self._state_dict["previous_state"] = self.ex_iterable.state_dict() + self._state_dict["num_examples_since_previous_state"] = 0 + self._state_dict["previous_state_example_idx"] = current_idx + for key, transformed_batch in iter_outputs(): # yield one example at a time from the transformed batch for example in _batch_to_examples(transformed_batch): current_idx += 1 @@ -1071,34 +1215,20 @@ def _iter(self): if num_examples_to_skip > 0: num_examples_to_skip -= 1 continue - yield new_key, example + yield key, example if self._state_dict: self._state_dict["previous_state"] = self.ex_iterable.state_dict() self._state_dict["num_examples_since_previous_state"] = 0 self._state_dict["previous_state_example_idx"] = current_idx else: - for key, example in iterator: - # If not batched, we can apply the transform and yield the example directly - # first copy the example, since we might drop some keys - example = dict(example) - example = format_dict(example) if format_dict else example - # then apply the transform - inputs = example - function_args = [inputs] if self.input_columns is None else [inputs[col] for col in self.input_columns] - if self.with_indices: - function_args.append(current_idx) - transformed_example = dict(example) # this will be updated with the function output - transformed_example.update(self.function(*function_args, **self.fn_kwargs)) - # then we remove the unwanted columns - if self.remove_columns: - for c in self.remove_columns: - del transformed_example[c] + for key, transformed_example in iter_outputs(): current_idx += 1 if self._state_dict: self._state_dict["previous_state_example_idx"] += 1 yield key, transformed_example def _iter_arrow(self, max_chunksize: Optional[int] = None) -> Iterator[Tuple[Key, pa.Table]]: + formatter: TableFormatter = get_formatter(self.formatting.format_type) if self.formatting else ArrowFormatter() if self.ex_iterable.iter_arrow: iterator = self.ex_iterable.iter_arrow() else: @@ -1125,17 +1255,23 @@ def _iter_arrow(self, max_chunksize: Optional[int] = None) -> Iterator[Tuple[Key ): return # first build the batch - function_args = [pa_table] if self.input_columns is None else [pa_table[col] for col in self.input_columns] + function_args = ( + [formatter.format_batch(pa_table)] + if self.input_columns is None + else [pa_table[col] for col in self.input_columns] + ) if self.with_indices: if self.batched: function_args.append([current_idx + i for i in range(len(pa_table))]) else: function_args.append(current_idx) # then apply the transform - output_table = self.function(*function_args, **self.fn_kwargs) + output = self.function(*function_args, **self.fn_kwargs) + output_table = _table_output_to_arrow(output) if not isinstance(output_table, pa.Table): raise TypeError( - f"Provided `function` which is applied to pyarrow tables returns a variable of type {type(output_table)}. Make sure provided `function` returns a a pyarrow table to update the dataset." + f"Provided `function` which is applied to {formatter.table_type} returns a variable of type " + f"{type(output)}. Make sure provided `function` returns a {formatter.table_type} to update the dataset." ) # we don't need to merge results for consistency with Dataset.map which merges iif both input and output are dicts # then remove the unwanted columns @@ -1176,6 +1312,7 @@ def shuffle_data_sources(self, generator: np.random.Generator) -> "MappedExample remove_columns=self.remove_columns, fn_kwargs=self.fn_kwargs, formatting=self.formatting, + features=self.features, ) def shard_data_sources(self, num_shards: int, index: int, contiguous=True) -> "MappedExamplesIterable": @@ -1191,6 +1328,7 @@ def shard_data_sources(self, num_shards: int, index: int, contiguous=True) -> "M remove_columns=self.remove_columns, fn_kwargs=self.fn_kwargs, formatting=self.formatting, + features=self.features, ) @property @@ -1198,7 +1336,34 @@ def num_shards(self) -> int: return self.ex_iterable.num_shards -class FilteredExamplesIterable(_BaseExamplesIterable): +def _add_mask( + input: Union[dict, pa.Table], + mask: Union[bool, list, pa.Array, pa.ChunkedArray, pa.BooleanScalar], + mask_column_name: str, +): + if isinstance(input, pa.Table): + if not isinstance(mask, (list, pa.Array, pa.ChunkedArray)): + mask = pa.array([mask], type=pa.bool_()) + return input.append_column(mask_column_name, mask) + else: + return {mask_column_name: mask} + + +def add_mask(mask_function: Callable, input: Union[dict, pa.Table], *args, mask_column_name: str, **kwargs): + mask = mask_function(input, *args, **kwargs) + return _add_mask(input, mask, mask_column_name) + + +async def async_add_mask( + mask_function: Callable, input: Union[dict, pa.Table], *args, mask_column_name: str, **kwargs +): + mask = await mask_function(input, *args, **kwargs) + return _add_mask(input, mask, mask_column_name) + + +class FilteredExamplesIterable(MappedExamplesIterable): + mask_column_name = "===MASK===" + def __init__( self, ex_iterable: _BaseExamplesIterable, @@ -1210,203 +1375,62 @@ def __init__( fn_kwargs: Optional[dict] = None, formatting: Optional["FormattingConfig"] = None, ): - super().__init__() - self.ex_iterable = ex_iterable - self.function = function - self.batched = batched - self.batch_size = batch_size - self.with_indices = with_indices - self.input_columns = input_columns - self.fn_kwargs = fn_kwargs or {} - self.formatting = formatting - # sanity checks - if formatting and formatting.format_type == "arrow": - # batch_size should match for iter_arrow - if not isinstance(ex_iterable, RebatchedArrowExamplesIterable): - raise ValueError( - "The Arrow-formatted FilteredExamplesIterable has underlying iterable" - f"that is a {type(ex_iterable).__name__} instead of a RebatchedArrowExamplesIterable." - ) - elif ex_iterable.batch_size != (batch_size if batched else 1): - raise ValueError( - f"The Arrow-formatted FilteredExamplesIterable has batch_size={batch_size if batched else 1} which is" - f"different from {ex_iterable.batch_size=} from its underlying iterable." - ) - - @property - def iter_arrow(self): - if self.formatting and self.formatting.format_type == "arrow": - return self._iter_arrow - - @property - def is_typed(self): - return self.ex_iterable.is_typed - - def _init_state_dict(self) -> dict: - self._state_dict = { - "ex_iterable": self.ex_iterable._init_state_dict(), - "previous_state": None, - "num_examples_since_previous_state": 0, - "previous_state_example_idx": 0, - } - return self._state_dict - - def __iter__(self): - if self.formatting and self.formatting.format_type == "arrow": - formatter = PythonFormatter() - for key, pa_table in self._iter_arrow(max_chunksize=1): - yield key, formatter.format_row(pa_table) + self.mask_function = function + if ex_iterable.is_typed: + features = Features({**ex_iterable.features, self.mask_column_name: Value("bool")}) else: - yield from self._iter() + features = None + super().__init__( + ex_iterable=ex_iterable, + function=partial( + async_add_mask if inspect.iscoroutinefunction(function) else add_mask, + function, + mask_column_name=self.mask_column_name, + ), + with_indices=with_indices, + input_columns=input_columns, + batched=batched, + batch_size=batch_size, + fn_kwargs=fn_kwargs, + formatting=formatting, + features=features, + ) def _iter(self): - current_idx = self._state_dict["previous_state_example_idx"] if self._state_dict else 0 - if self._state_dict and self._state_dict["previous_state"]: - self.ex_iterable.load_state_dict(self._state_dict["previous_state"]) - num_examples_to_skip = self._state_dict["num_examples_since_previous_state"] - else: - num_examples_to_skip = 0 - iterator = iter(self.ex_iterable) - - if self.formatting: - formatter = get_formatter(self.formatting.format_type) - format_dict = ( - formatter.recursive_tensorize if isinstance(formatter, TensorFormatter) else cast_to_python_objects - ) - else: - format_dict = None - - if self.batched: - if self._state_dict: - self._state_dict["previous_state"] = self.ex_iterable.state_dict() - self._state_dict["num_examples_since_previous_state"] = 0 - self._state_dict["previous_state_example_idx"] = current_idx - for key, example in iterator: - # If `batched`, first build the batch, if `batch_size` is None or <=0, then the batch is the whole dataset - iterator_batch = ( - iterator - if self.batch_size is None or self.batch_size <= 0 - else islice(iterator, self.batch_size - 1) - ) - key_examples_list = [(key, example)] + list(iterator_batch) - keys, examples = zip(*key_examples_list) - batch = _examples_to_batch(examples) - batch = format_dict(batch) if format_dict else batch - # then compute the mask for the batch - inputs = batch - function_args = [inputs] if self.input_columns is None else [inputs[col] for col in self.input_columns] - if self.with_indices: - function_args.append([current_idx + i for i in range(len(key_examples_list))]) - mask = self.function(*function_args, **self.fn_kwargs) - # yield one example at a time from the batch - for key_example, to_keep in zip(key_examples_list, mask): - current_idx += 1 - if self._state_dict: - self._state_dict["num_examples_since_previous_state"] += 1 - if num_examples_to_skip > 0: - num_examples_to_skip -= 1 - continue - if to_keep: - yield key_example - if self._state_dict: - self._state_dict["previous_state"] = self.ex_iterable.state_dict() - self._state_dict["num_examples_since_previous_state"] = 0 - self._state_dict["previous_state_example_idx"] = current_idx - else: - for key, example in iterator: - # If not batched, we can apply the filtering function direcly - example = dict(example) - inputs = format_dict(example) if format_dict else example - function_args = [inputs] if self.input_columns is None else [inputs[col] for col in self.input_columns] - if self.with_indices: - function_args.append(current_idx) - to_keep = self.function(*function_args, **self.fn_kwargs) - current_idx += 1 - if self._state_dict: - self._state_dict["previous_state_example_idx"] += 1 - if to_keep: - yield key, example + for key, example in super()._iter(): + example = dict(example) + if example.pop(self.mask_column_name): + yield key, example def _iter_arrow(self, max_chunksize: Optional[int] = None): - if self.ex_iterable.iter_arrow: - iterator = self.ex_iterable.iter_arrow() - else: - iterator = _convert_to_arrow(self.ex_iterable, batch_size=self.batch_size if self.batched else 1) - - if self._state_dict and self._state_dict["previous_state"]: - self.ex_iterable.load_state_dict(self._state_dict["previous_state"]) - num_examples_to_skip = self._state_dict["num_examples_since_previous_state"] - else: - num_examples_to_skip = 0 - if self._state_dict and max_chunksize is not None: - self._state_dict["previous_state"] = self.ex_iterable.state_dict() - self._state_dict["num_examples_since_previous_state"] = 0 - current_idx = self._state_dict["previous_state_example_idx"] if self._state_dict else 0 - for key, pa_table in iterator: - if ( - self.batched - and self.batch_size is not None - and len(pa_table) < self.batch_size - and self.drop_last_batch - ): - return - # first build the batch - function_args = [pa_table] if self.input_columns is None else [pa_table[col] for col in self.input_columns] - if self.with_indices: - if self.batched: - function_args.append([current_idx + i for i in range(len(pa_table))]) - else: - function_args.append(current_idx) - # then apply the transform - mask = self.function(*function_args, **self.fn_kwargs) - # return output - if self.batched: - output_table = pa_table.filter(mask) - elif mask.as_py() if isinstance(mask, pa.BooleanScalar) else mask: - output_table = pa_table - else: - output_table = pa_table.slice(0, 0) - - if max_chunksize is None: - current_idx += len(pa_table) - if self._state_dict: - self._state_dict["previous_state_example_idx"] += len(pa_table) - if len(output_table) > 0: - yield key, output_table - else: - for i, pa_subtable in enumerate(output_table.to_reader(max_chunksize=max_chunksize)): - current_idx += 1 - if self._state_dict: - self._state_dict["num_examples_since_previous_state"] += 1 - if num_examples_to_skip > 0: - num_examples_to_skip -= 1 - continue - yield f"{key}_{i}", pa_subtable - if self._state_dict: - self._state_dict["previous_state"] = self.ex_iterable.state_dict() - self._state_dict["num_examples_since_previous_state"] = 0 - self._state_dict["previous_state_example_idx"] += len(pa_table) + for key, pa_table in super()._iter_arrow(max_chunksize=max_chunksize): + mask = pa_table[self.mask_column_name] + yield key, pa_table.drop(self.mask_column_name).filter(mask) def shuffle_data_sources(self, seed: Optional[int]) -> "FilteredExamplesIterable": """Shuffle the wrapped examples iterable.""" return FilteredExamplesIterable( self.ex_iterable.shuffle_data_sources(seed), - function=self.function, + function=self.mask_function, with_indices=self.with_indices, input_columns=self.input_columns, batched=self.batched, batch_size=self.batch_size, + fn_kwargs=self.fn_kwargs, + formatting=self.formatting, ) def shard_data_sources(self, num_shards: int, index: int, contiguous=True) -> "FilteredExamplesIterable": """Keep only the requested shard.""" return FilteredExamplesIterable( self.ex_iterable.shard_data_sources(num_shards, index, contiguous=contiguous), - function=self.function, + function=self.mask_function, with_indices=self.with_indices, input_columns=self.input_columns, batched=self.batched, batch_size=self.batch_size, + fn_kwargs=self.fn_kwargs, + formatting=self.formatting, ) @property @@ -1426,6 +1450,10 @@ def __init__(self, ex_iterable: _BaseExamplesIterable, buffer_size: int, generat def is_typed(self): return self.ex_iterable.is_typed + @property + def features(self): + return self.ex_iterable.features + def _init_state_dict(self) -> dict: self._state_dict = self.ex_iterable._init_state_dict() self._original_state_dict = self.state_dict() @@ -1500,6 +1528,10 @@ def __init__( def is_typed(self): return self.ex_iterable.is_typed + @property + def features(self): + return self.ex_iterable.features + def _init_state_dict(self) -> dict: self._state_dict = {"skipped": False, "ex_iterable": self.ex_iterable._init_state_dict()} return self._state_dict @@ -1548,6 +1580,54 @@ def num_shards(self) -> int: return self.ex_iterable.num_shards +class RepeatExamplesIterable(_BaseExamplesIterable): + """ + Iterable that repeats the underlying iterable a given number of times. + """ + + def __init__( + self, + ex_iterable: _BaseExamplesIterable, + num_times: Optional[int], + ): + super().__init__() + self.ex_iterable = ex_iterable + self.num_times = num_times + + def _init_state_dict(self) -> dict: + self._state_dict = { + "repeat_index": 0, + "ex_iterable": self.ex_iterable._init_state_dict(), + } + return self._state_dict + + def __iter__(self): + repeat_index = self._state_dict["repeat_index"] if self._state_dict else 0 + while True: + if self.num_times is not None and repeat_index >= max(self.num_times, 0): + break + yield from self.ex_iterable + repeat_index += 1 + if self._state_dict: + self._state_dict["repeat_index"] = repeat_index + self._state_dict["ex_iterable"] = self.ex_iterable._init_state_dict() + + def shuffle_data_sources(self, generator: np.random.Generator) -> "RepeatExamplesIterable": + """Shuffle the underlying iterable, then repeat.""" + return RepeatExamplesIterable(self.ex_iterable.shuffle_data_sources(generator), num_times=self.num_times) + + def shard_data_sources(self, worker_id: int, num_workers: int) -> "RepeatExamplesIterable": + """Shard, then repeat shards.""" + return RepeatExamplesIterable( + self.ex_iterable.shard_data_sources(worker_id, num_workers), + num_times=self.num_times, + ) + + @property + def n_shards(self) -> int: + return self.ex_iterable.n_shards + + class TakeExamplesIterable(_BaseExamplesIterable): def __init__( self, @@ -1567,6 +1647,10 @@ def __init__( def is_typed(self): return self.ex_iterable.is_typed + @property + def features(self): + return self.ex_iterable.features + def _init_state_dict(self) -> dict: self._state_dict = {"num_taken": 0, "ex_iterable": self.ex_iterable._init_state_dict()} return self._state_dict @@ -1652,44 +1736,85 @@ def _apply_feature_types_on_batch( return decoded_batch -class TypedExamplesIterable(_BaseExamplesIterable): +@dataclass +class FormattingConfig: + format_type: Optional[str] + + @property + def is_table(self) -> bool: + return isinstance(get_formatter(self.format_type), TableFormatter) + + @property + def is_tensor(self) -> bool: + return isinstance(get_formatter(self.format_type), TensorFormatter) + + +class FormattedExamplesIterable(_BaseExamplesIterable): def __init__( self, ex_iterable: _BaseExamplesIterable, - features: Features, + formatting: Optional[FormattingConfig], + features: Optional[Features], token_per_repo_id: Dict[str, Union[str, bool, None]], ): super().__init__() self.ex_iterable = ex_iterable - self.features = features + self._features = features + self.formatting = formatting self.token_per_repo_id = token_per_repo_id @property def iter_arrow(self): - if self.ex_iterable.iter_arrow is not None: + if self.ex_iterable.iter_arrow and (not self.formatting or self.formatting.is_table): return self._iter_arrow @property def is_typed(self): - return True + return self.ex_iterable.is_typed or self._features is not None + + @property + def features(self): + return self._features def _init_state_dict(self) -> dict: self._state_dict = self.ex_iterable._init_state_dict() return self._state_dict def __iter__(self): - # Then for each example, `TypedExamplesIterable` automatically fills missing columns with None. - # This is done with `_apply_feature_types_on_example`. - for key, example in self.ex_iterable: - yield ( - key, - _apply_feature_types_on_example(example, self.features, token_per_repo_id=self.token_per_repo_id), + if not self.formatting or self.formatting.is_table: + formatter = PythonFormatter() + else: + formatter = get_formatter( + self.formatting.format_type, + features=self._features if not self.ex_iterable.is_typed else None, + token_per_repo_id=self.token_per_repo_id, ) + if self.ex_iterable.iter_arrow: + # feature casting (inc column addition) handled within self._iter_arrow() + for key, pa_table in self._iter_arrow(): + batch = formatter.format_batch(pa_table) + for example in _batch_to_examples(batch): + yield key, example + else: + format_dict = ( + formatter.recursive_tensorize + if isinstance(formatter, TensorFormatter) + else cast_to_python_objects # cast in case features is None + ) + for key, example in self.ex_iterable: + # don't apply feature types if already applied by ex_iterable (e.g. in case of chained with_format) + if self.features and not self.ex_iterable.is_typed: + example = _apply_feature_types_on_example( + example, self.features, token_per_repo_id=self.token_per_repo_id + ) + yield key, format_dict(example) def _iter_arrow(self) -> Iterator[Tuple[Key, pa.Table]]: - schema = self.features.arrow_schema - for key, pa_table in self.ex_iterable.iter_arrow(): + if not self.features: + yield from self.ex_iterable._iter_arrow() + for key, pa_table in self.ex_iterable._iter_arrow(): columns = set(pa_table.column_names) + schema = self.features.arrow_schema # add missing columns for column_name in self.features: if column_name not in columns: @@ -1699,20 +1824,22 @@ def _iter_arrow(self) -> Iterator[Tuple[Key, pa.Table]]: pa_table = cast_table_to_features(pa_table, self.features) yield key, pa_table - def shuffle_data_sources(self, generator: np.random.Generator) -> "TypedExamplesIterable": + def shuffle_data_sources(self, generator: np.random.Generator) -> "FormattedExamplesIterable": """Shuffle the wrapped examples iterable.""" - return TypedExamplesIterable( + return FormattedExamplesIterable( self.ex_iterable.shuffle_data_sources(generator), features=self.features, token_per_repo_id=self.token_per_repo_id, + formatting=self.formatting, ) - def shard_data_sources(self, num_shards: int, index: int, contiguous=True) -> "TypedExamplesIterable": + def shard_data_sources(self, num_shards: int, index: int, contiguous=True) -> "FormattedExamplesIterable": """Keep only the requested shard.""" - return TypedExamplesIterable( + return FormattedExamplesIterable( self.ex_iterable.shard_data_sources(num_shards, index, contiguous=contiguous), features=self.features, token_per_repo_id=self.token_per_repo_id, + formatting=self.formatting, ) @property @@ -1720,17 +1847,6 @@ def num_shards(self) -> int: return self.ex_iterable.num_shards -@dataclass -class FormattingConfig: - format_type: Optional[str] - - def __post_init__(self): - if self.format_type == "pandas": - raise NotImplementedError( - "The 'pandas' formatting is not implemented for iterable datasets. You can use 'numpy' or 'arrow' instead." - ) - - @dataclass class ShufflingConfig: generator: np.random.Generator @@ -1985,7 +2101,7 @@ def _iter_pytorch(self): else: format_dict = None - if self._formatting and (ex_iterable.iter_arrow or self._formatting == "arrow"): + if self._formatting and (ex_iterable.iter_arrow or self._formatting.is_table): if ex_iterable.iter_arrow: iterator = ex_iterable.iter_arrow() else: @@ -2025,7 +2141,7 @@ def _prepare_ex_iterable_for_iteration( self, batch_size: int = 1, drop_last_batch: bool = False ) -> _BaseExamplesIterable: ex_iterable = self._ex_iterable - if self._formatting and (ex_iterable.iter_arrow or self._formatting.format_type == "arrow"): + if self._formatting and (ex_iterable.iter_arrow or self._formatting.is_table): ex_iterable = RebatchedArrowExamplesIterable( ex_iterable, batch_size=batch_size, drop_last_batch=drop_last_batch ) @@ -2081,7 +2197,7 @@ def __iter__(self): else: format_dict = None - if self._formatting and (ex_iterable.iter_arrow or self._formatting.format_type == "arrow"): + if self._formatting and (ex_iterable.iter_arrow or self._formatting.is_table): if ex_iterable.iter_arrow: iterator = ex_iterable.iter_arrow() else: @@ -2117,7 +2233,7 @@ def iter(self, batch_size: int, drop_last_batch: bool = False): format_dict = None ex_iterable = self._prepare_ex_iterable_for_iteration(batch_size=batch_size, drop_last_batch=drop_last_batch) - if self._formatting and (ex_iterable.iter_arrow or self._formatting == "arrow"): + if self._formatting and (ex_iterable.iter_arrow or self._formatting.is_table): if ex_iterable.iter_arrow: iterator = ex_iterable.iter_arrow() else: @@ -2259,12 +2375,11 @@ def with_format( ) -> "IterableDataset": """ Return a dataset with the specified format. - The 'pandas' format is currently not implemented. Args: type (`str`, *optional*): - Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'arrow', 'jax']`. + Either output type selected in `[None, 'numpy', 'torch', 'tensorflow', 'jax', 'arrow', 'pandas', 'polars']`. `None` means it returns python objects (default). Example: @@ -2272,7 +2387,7 @@ def with_format( ```py >>> from datasets import load_dataset >>> from transformers import AutoTokenizer - >>> ds = load_dataset("rotten_tomatoes", split="validation", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True) >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") >>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True) >>> ds = ds.with_format("torch") @@ -2339,6 +2454,9 @@ def map( Note that the last batch may have less than `n` examples. A batch is a dictionary, e.g. a batch of `n` examples is `{"text": ["Hello there !"] * n}`. + If the function is asynchronous, then `map` will run your function in parallel, with up to one thousand simulatenous calls. + It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. + Args: function (`Callable`, *optional*, defaults to `None`): Function applied on-the-fly on the examples when you iterate on the dataset. @@ -2350,6 +2468,7 @@ def map( - `function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]` if `batched=True` and `with_indices=True` For advanced usage, the function can also return a `pyarrow.Table`. + If the function is asynchronous, then `map` will run your function in parallel. Moreover if your function returns nothing (`None`), then `map` will run your function and return the dataset unchanged. If no function is provided, default to identity function: `lambda x: x`. with_indices (`bool`, defaults to `False`): @@ -2378,7 +2497,7 @@ def map( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> def add_prefix(example): ... example["text"] = "Review: " + example["text"] ... return example @@ -2399,18 +2518,40 @@ def map( function = identity_func if fn_kwargs is None: fn_kwargs = {} - ex_iterable = ( - TypedExamplesIterable(self._ex_iterable, self._info.features, token_per_repo_id=self._token_per_repo_id) - if self._info.features is not None - else self._ex_iterable + + ex_iterable = self._ex_iterable + # no need to apply features if ex_iterable is typed and if there was no cast_column() + input_features = ( + None + if (ex_iterable.is_typed and (self._info.features is None or self._info.features == ex_iterable.features)) + else self._info.features ) - ex_iterable = ( - RebatchedArrowExamplesIterable( + + if self._formatting and self._formatting.is_table: + # apply formatting before iter_arrow to keep map examples iterable happy + ex_iterable = FormattedExamplesIterable( + ex_iterable, + formatting=copy.deepcopy(self._formatting), + features=input_features, + token_per_repo_id=self._token_per_repo_id, + ) + ex_iterable = RebatchedArrowExamplesIterable( ex_iterable, batch_size=batch_size if batched else 1, drop_last_batch=drop_last_batch ) - if self._formatting and self._formatting.format_type == "arrow" - else ex_iterable - ) + else: + if self._formatting and self._ex_iterable.iter_arrow: + ex_iterable = RebatchedArrowExamplesIterable( + self._ex_iterable, batch_size=batch_size if batched else 1, drop_last_batch=drop_last_batch + ) + if self._formatting or input_features: + # apply formatting after iter_arrow to avoid re-encoding the examples + ex_iterable = FormattedExamplesIterable( + ex_iterable, + formatting=copy.deepcopy(self._formatting), + features=input_features, + token_per_repo_id=self._token_per_repo_id, + ) + ex_iterable = MappedExamplesIterable( ex_iterable, function=function, @@ -2422,6 +2563,7 @@ def map( remove_columns=remove_columns, fn_kwargs=fn_kwargs, formatting=self._formatting, + features=features, ) info = self.info.copy() info.features = features @@ -2447,6 +2589,9 @@ def filter( """Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. + If the function is asynchronous, then `filter` will run your function in parallel, with up to one thousand simulatenous calls (configurable). + It is recommended to use a `asyncio.Semaphore` in your function if you want to set a maximum number of operations that can run at the same time. + Args: function (`Callable`): Callable with one of the following signatures: @@ -2456,6 +2601,7 @@ def filter( - `function(example: Dict[str, List]) -> List[bool]` if `with_indices=False, batched=True` - `function(example: Dict[str, List], indices: List[int]) -> List[bool]` if `with_indices=True, batched=True` + If the function is asynchronous, then `filter` will run your function in parallel. If no function is provided, defaults to an always True function: `lambda x: True`. with_indices (`bool`, defaults to `False`): Provide example indices to `function`. Note that in this case the signature of `function` should be `def function(example, idx): ...`. @@ -2473,7 +2619,7 @@ def filter( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> ds = ds.filter(lambda x: x["label"] == 0) >>> list(ds.take(3)) [{'label': 0, 'movie_review': 'simplistic , silly and tedious .'}, @@ -2486,11 +2632,19 @@ def filter( if isinstance(input_columns, str): input_columns = [input_columns] - # We need the examples to be decoded for certain feature types like Image or Audio, so we use TypedExamplesIterable here + # We need the examples to be decoded for certain feature types like Image or Audio, + # format and type before filtering + ex_iterable = self._ex_iterable + if self._info.features or self._formatting: + ex_iterable = FormattedExamplesIterable( + ex_iterable, + formatting=self._formatting, + features=None if ex_iterable.is_typed else self._info.features, + token_per_repo_id=self._token_per_repo_id, + ) + ex_iterable = FilteredExamplesIterable( - TypedExamplesIterable(self._ex_iterable, self._info.features, token_per_repo_id=self._token_per_repo_id) - if self._info.features is not None - else self._ex_iterable, + ex_iterable, function=function, with_indices=with_indices, input_columns=input_columns, @@ -2542,7 +2696,7 @@ def shuffle( ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> list(ds.take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, @@ -2591,7 +2745,7 @@ def skip(self, n: int) -> "IterableDataset": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> list(ds.take(3)) [{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, @@ -2623,6 +2777,49 @@ def skip(self, n: int) -> "IterableDataset": token_per_repo_id=self._token_per_repo_id, ) + def repeat(self, num_times: Optional[int]) -> "IterableDataset": + """ + Create a new [`IterableDataset`] that repeats the underlying dataset `num_times` times. + + N.B. The effect of calling shuffle after repeat depends significantly on buffer size. + With buffer_size 1, duplicate data is never seen in the same iteration, even after shuffling: + ds.repeat(n).shuffle(seed=42, buffer_size=1) is equivalent to ds.shuffle(seed=42, buffer_size=1).repeat(n), + and only shuffles shard orders within each iteration. + With buffer size >= (num samples in the dataset * num_times), we get full shuffling of the repeated data, i.e. we can observe duplicates in + the same iteration. + + Args: + num_times (`int`) or (`None`): + Number of times to repeat the dataset. If `None`, the dataset will be repeated indefinitely. + + Example: + ```py + >>> from datasets import load_dataset + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train") + >>> ds = ds.take(2).repeat(2) + >>> list(ds) + [{'label': 1, + 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, + {'label': 1, + 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'}, + {'label': 1, 'text': 'effective but too-tepid biopic'}, + {'label': 1, + 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, + {'label': 1, + 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .'}, + {'label': 1, 'text': 'effective but too-tepid biopic'}] + ``` + """ + return IterableDataset( + ex_iterable=RepeatExamplesIterable(self._ex_iterable, num_times=num_times), + info=self._info, + split=self._split, + formatting=self._formatting, + shuffling=copy.deepcopy(self._shuffling), + distributed=copy.deepcopy(self._distributed), + token_per_repo_id=self._token_per_repo_id, + ) + def take(self, n: int) -> "IterableDataset": """ Create a new [`IterableDataset`] with only the first `n` elements. @@ -2635,7 +2832,7 @@ def take(self, n: int) -> "IterableDataset": ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> small_ds = ds.take(2) >>> list(small_ds) [{'label': 1, @@ -2725,7 +2922,7 @@ def column_names(self) -> Optional[List[str]]: ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="validation", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation", streaming=True) >>> ds.column_names ['text', 'label'] ``` @@ -2762,7 +2959,7 @@ def rename_column(self, original_column_name: str, new_column_name: str) -> "Ite ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> next(iter(ds)) {'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'} @@ -2816,7 +3013,7 @@ def remove_columns(self, column_names: Union[str, List[str]]) -> "IterableDatase ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> next(iter(ds)) {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1} >>> ds = ds.remove_columns("label") @@ -2851,7 +3048,7 @@ def select_columns(self, column_names: Union[str, List[str]]) -> "IterableDatase ```py >>> from datasets import load_dataset - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> next(iter(ds)) {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1} >>> ds = ds.select_columns("text") @@ -2952,7 +3149,7 @@ def cast( ```py >>> from datasets import load_dataset, ClassLabel, Value - >>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True) + >>> ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True) >>> ds.features {'label': ClassLabel(names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)} @@ -2992,7 +3189,7 @@ def _step(self, step: int, offset: int) -> "IterableDataset": def _resolve_features(self): if self.features is not None: return self - elif isinstance(self._ex_iterable, TypedExamplesIterable): + elif self._ex_iterable.is_typed: features = self._ex_iterable.features else: features = _infer_features_from_batch(self.with_format(None)._head()) diff --git a/src/datasets/packaged_modules/arrow/arrow.py b/src/datasets/packaged_modules/arrow/arrow.py index 2613fa058bc..bcf31c473d2 100644 --- a/src/datasets/packaged_modules/arrow/arrow.py +++ b/src/datasets/packaged_modules/arrow/arrow.py @@ -45,7 +45,7 @@ def _split_generators(self, dl_manager): with open(file, "rb") as f: try: reader = pa.ipc.open_stream(f) - except pa.lib.ArrowInvalid: + except (OSError, pa.lib.ArrowInvalid): reader = pa.ipc.open_file(f) self.info.features = datasets.Features.from_arrow_schema(reader.schema) break @@ -65,7 +65,7 @@ def _generate_tables(self, files): try: try: batches = pa.ipc.open_stream(f) - except pa.lib.ArrowInvalid: + except (OSError, pa.lib.ArrowInvalid): reader = pa.ipc.open_file(f) batches = (reader.get_batch(i) for i in range(reader.num_record_batches)) for batch_idx, record_batch in enumerate(batches): diff --git a/src/datasets/packaged_modules/imagefolder/imagefolder.py b/src/datasets/packaged_modules/imagefolder/imagefolder.py index f9a1a88b85c..6a16c760c01 100644 --- a/src/datasets/packaged_modules/imagefolder/imagefolder.py +++ b/src/datasets/packaged_modules/imagefolder/imagefolder.py @@ -57,8 +57,8 @@ class ImageFolder(folder_based_builder.FolderBasedBuilder): ".gbr", ".gif", ".grib", - ".h5", - ".hdf", + # ".h5", # may contain zero or several images + # ".hdf", # may contain zero or several images ".png", ".apng", ".jp2", diff --git a/src/datasets/packaged_modules/webdataset/webdataset.py b/src/datasets/packaged_modules/webdataset/webdataset.py index 0768437b36a..ed0bf6428b1 100644 --- a/src/datasets/packaged_modules/webdataset/webdataset.py +++ b/src/datasets/packaged_modules/webdataset/webdataset.py @@ -34,6 +34,9 @@ def _get_pipeline_from_tar(cls, tar_path, tar_iterator): if example_key is None: continue if current_example and current_example["__key__"] != example_key: + # reposition some keys in last position + current_example["__key__"] = current_example.pop("__key__") + current_example["__url__"] = current_example.pop("__url__") yield current_example current_example = {} current_example["__key__"] = example_key diff --git a/src/datasets/search.py b/src/datasets/search.py index 4f76f9b671f..d61089bde5b 100644 --- a/src/datasets/search.py +++ b/src/datasets/search.py @@ -175,7 +175,7 @@ def passage_generator(): successes += ok if successes != len(documents): logger.warning( - f"Some documents failed to be added to ElasticSearch. Failures: {len(documents)-successes}/{len(documents)}" + f"Some documents failed to be added to ElasticSearch. Failures: {len(documents) - successes}/{len(documents)}" ) logger.info(f"Indexed {successes:d} documents") diff --git a/src/datasets/splits.py b/src/datasets/splits.py index 5cca4e8b966..07fc18e734b 100644 --- a/src/datasets/splits.py +++ b/src/datasets/splits.py @@ -478,7 +478,7 @@ class SplitReadInstruction: """ def __init__(self, split_info=None): - self._splits = NonMutableDict(error_msg="Overlap between splits. Split {key} has been added with " "itself.") + self._splits = NonMutableDict(error_msg="Overlap between splits. Split {key} has been added with itself.") if split_info: self.add(SlicedSplitInfo(split_info=split_info, slice_value=None)) diff --git a/src/datasets/utils/logging.py b/src/datasets/utils/logging.py index dffd5ce46e0..1417f82ac0d 100644 --- a/src/datasets/utils/logging.py +++ b/src/datasets/utils/logging.py @@ -57,8 +57,7 @@ def _get_default_logging_level(): return log_levels[env_level_str] else: logging.getLogger().warning( - f"Unknown option DATASETS_VERBOSITY={env_level_str}, " - f"has to be one of: { ', '.join(log_levels.keys()) }" + f"Unknown option DATASETS_VERBOSITY={env_level_str}, has to be one of: {', '.join(log_levels.keys())}" ) return _default_log_level diff --git a/src/datasets/utils/stratify.py b/src/datasets/utils/stratify.py index d0967aa1abb..6dab0469e81 100644 --- a/src/datasets/utils/stratify.py +++ b/src/datasets/utils/stratify.py @@ -81,11 +81,11 @@ def stratified_shuffle_split_generate_indices(y, n_train, n_test, rng, n_splits= raise ValueError("Minimum class count error") if n_train < n_classes: raise ValueError( - "The train_size = %d should be greater or " "equal to the number of classes = %d" % (n_train, n_classes) + "The train_size = %d should be greater or equal to the number of classes = %d" % (n_train, n_classes) ) if n_test < n_classes: raise ValueError( - "The test_size = %d should be greater or " "equal to the number of classes = %d" % (n_test, n_classes) + "The test_size = %d should be greater or equal to the number of classes = %d" % (n_test, n_classes) ) class_indices = np.split(np.argsort(y_indices, kind="mergesort"), np.cumsum(class_counts)[:-1]) for _ in range(n_splits): diff --git a/tests/conftest.py b/tests/conftest.py index f4e79eb5bf3..e9bb542c954 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -52,7 +52,7 @@ def set_sqlalchemy_silence_uber_warning(monkeypatch): # To be removed once SQLAlchemy 2.0 supported try: monkeypatch.setattr("sqlalchemy.util.deprecations.SILENCE_UBER_WARNING", True) - except AttributeError: + except (ModuleNotFoundError, AttributeError): pass diff --git a/tests/packaged_modules/test_audiofolder.py b/tests/packaged_modules/test_audiofolder.py index 30dfd7a8d85..cf93704147b 100644 --- a/tests/packaged_modules/test_audiofolder.py +++ b/tests/packaged_modules/test_audiofolder.py @@ -3,7 +3,6 @@ import numpy as np import pytest -import soundfile as sf from datasets import Audio, ClassLabel, Features, Value from datasets.builder import InvalidConfigName @@ -195,6 +194,7 @@ def data_files_with_two_splits_and_metadata(request, tmp_path, audio_file): @pytest.fixture def data_files_with_zip_archives(tmp_path, audio_file): import librosa + import soundfile as sf data_dir = tmp_path / "audiofolder_data_dir_with_zip_archives" data_dir.mkdir(parents=True, exist_ok=True) diff --git a/tests/test_arrow_dataset.py b/tests/test_arrow_dataset.py index 6cf8898ce67..2e54aadf7b6 100644 --- a/tests/test_arrow_dataset.py +++ b/tests/test_arrow_dataset.py @@ -1,3 +1,4 @@ +import asyncio import contextlib import copy import itertools @@ -7,6 +8,7 @@ import re import sys import tempfile +import time from functools import partial from pathlib import Path from unittest import TestCase @@ -869,6 +871,32 @@ def test_concatenate_pickle(self, in_memory): self.assertEqual(dset_concat.info.description, "Dataset2\n\nDataset1") del dset1, dset2, dset3 + def test_repeat(self, in_memory): + with tempfile.TemporaryDirectory() as tmp_dir: + with self._create_dummy_dataset(in_memory, tmp_dir, multiple_columns=True) as dset: + repeated_dset = dset.repeat(3) + column_values_dict = {col: dset[col] for col in dset.column_names} + for col, single_values in column_values_dict.items(): + self.assertListEqual(repeated_dset[col], single_values * 3) + del repeated_dset + + with tempfile.TemporaryDirectory() as tmp_dir: + with self._create_dummy_dataset(in_memory, tmp_dir, multiple_columns=True) as dset: + with pytest.raises(ValueError): + dset.repeat(None) + + with tempfile.TemporaryDirectory() as tmp_dir: + with self._create_dummy_dataset(in_memory, tmp_dir, multiple_columns=True) as dset: + repeated_dset = dset.repeat(0) + self.assertEqual(len(repeated_dset), 0) + del repeated_dset + + with tempfile.TemporaryDirectory() as tmp_dir: + with self._create_dummy_dataset(in_memory, tmp_dir, multiple_columns=True) as dset: + repeated_dset = dset.repeat(-1) + self.assertEqual(len(repeated_dset), 0) + del repeated_dset + def test_flatten(self, in_memory): with tempfile.TemporaryDirectory() as tmp_dir: with Dataset.from_dict( @@ -4165,7 +4193,7 @@ def test_dummy_dataset_serialize_fs(dataset, mockfs): [ "relative/path", "/absolute/path", - "s3://bucket/relative/path", + "hf://bucket/relative/path", "hdfs://relative/path", "hdfs:///absolute/path", ], @@ -4179,7 +4207,7 @@ def test_build_local_temp_path(uri_or_path): assert ( "hdfs://" not in path_relative_to_tmp_dir - and "s3://" not in path_relative_to_tmp_dir + and "hf://" not in path_relative_to_tmp_dir and not local_temp_path.startswith(extracted_path_without_anchor) and local_temp_path.endswith(extracted_path_without_anchor) ), f"Local temp path: {local_temp_path}" @@ -4280,8 +4308,9 @@ def test_dataset_to_iterable_dataset(dataset: Dataset): assert iterable_dataset.num_shards == 3 with pytest.raises(ValueError): dataset.to_iterable_dataset(num_shards=len(dataset) + 1) + assert dataset.with_format("torch").to_iterable_dataset()._formatting.format_type == "torch" with pytest.raises(NotImplementedError): - dataset.with_format("torch").to_iterable_dataset() + dataset.with_format("torch", columns=[dataset.column_names[0]]).to_iterable_dataset() @require_pil @@ -4421,6 +4450,50 @@ def f(x): assert outputs == {"a": [{"nested": [[i]]} for i in [-1, -1, 2, 3]]} +def test_map_async(): + dset = Dataset.from_dict({"x": range(100)}) + + async def f(example): + await asyncio.sleep(0.1) + return {"y": 1} + + _start = time.time() + out = dset.map(f) + assert time.time() - _start < 2.0 + assert out[0]["y"] == 1 + + async def f(batch): + await asyncio.sleep(0.1) + return {"y": [1] * len(batch["x"])} + + _start = time.time() + out = dset.map(f, batched=True) + assert time.time() - _start < 2.0 + assert out[0]["y"] == 1 + + +def test_filter_async(): + dset = Dataset.from_dict({"x": range(100)}) + + async def f(example): + await asyncio.sleep(0.1) + return example["x"] == 42 + + _start = time.time() + out = dset.filter(f) + assert time.time() - _start < 2.0 + assert len(out) == 1 + + async def f(batch): + await asyncio.sleep(0.1) + return [x == 42 for x in batch["x"]] + + _start = time.time() + out = dset.filter(f, batched=True) + assert time.time() - _start < 2.0 + assert len(out) == 1 + + def test_dataset_getitem_raises(): ds = Dataset.from_dict({"a": [0, 1, 2, 3]}) with pytest.raises(TypeError): diff --git a/tests/test_inspect.py b/tests/test_inspect.py index 63b93b3b6a6..5ad65230d0d 100644 --- a/tests/test_inspect.py +++ b/tests/test_inspect.py @@ -16,7 +16,7 @@ @pytest.mark.parametrize( "path, config_name, expected_splits", [ - ("squad", "plain_text", ["train", "validation"]), + ("rajpurkar/squad", "plain_text", ["train", "validation"]), ("dalle-mini/wit", "default", ["train"]), ("paws", "labeled_final", ["train", "test", "validation"]), ], @@ -54,7 +54,7 @@ def test_get_dataset_config_info_raises(path, config_name, expected_exception): "path, expected", [ ("acronym_identification", ["default"]), - ("squad", ["plain_text"]), + ("rajpurkar/squad", ["plain_text"]), ("hf-internal-testing/dataset_with_script", ["default"]), ("dalle-mini/wit", ["default"]), ("hf-internal-testing/librispeech_asr_dummy", ["clean"]), @@ -72,7 +72,7 @@ def test_get_dataset_config_names(path, expected): "path, expected", [ ("acronym_identification", "default"), - ("squad", "plain_text"), + ("rajpurkar/squad", "plain_text"), ("hf-internal-testing/dataset_with_script", "default"), ("dalle-mini/wit", "default"), ("hf-internal-testing/librispeech_asr_dummy", "clean"), @@ -92,7 +92,7 @@ def test_get_dataset_default_config_name(path, expected): @pytest.mark.parametrize( "path, expected_configs, expected_splits_in_first_config", [ - ("squad", ["plain_text"], ["train", "validation"]), + ("rajpurkar/squad", ["plain_text"], ["train", "validation"]), ("dalle-mini/wit", ["default"], ["train"]), ("paws", ["labeled_final", "labeled_swap", "unlabeled_final"], ["train", "test", "validation"]), ], @@ -110,7 +110,7 @@ def test_get_dataset_info(path, expected_configs, expected_splits_in_first_confi @pytest.mark.parametrize( "path, expected_config, expected_splits", [ - ("squad", "plain_text", ["train", "validation"]), + ("rajpurkar/squad", "plain_text", ["train", "validation"]), ("dalle-mini/wit", "default", ["train"]), ("paws", "labeled_final", ["train", "test", "validation"]), ], diff --git a/tests/test_iterable_dataset.py b/tests/test_iterable_dataset.py index 44d3d24fa23..714a0013a71 100644 --- a/tests/test_iterable_dataset.py +++ b/tests/test_iterable_dataset.py @@ -1,6 +1,9 @@ +import asyncio import pickle +import time from copy import deepcopy from itertools import chain, cycle, islice +from unittest.mock import patch import numpy as np import pandas as pd @@ -17,7 +20,7 @@ Image, Value, ) -from datasets.formatting import get_format_type_from_alias +from datasets.formatting import Formatter, get_format_type_from_alias from datasets.info import DatasetInfo from datasets.iterable_dataset import ( ArrowExamplesIterable, @@ -25,12 +28,14 @@ CyclingMultiSourcesExamplesIterable, ExamplesIterable, FilteredExamplesIterable, + FormattedExamplesIterable, FormattingConfig, HorizontallyConcatenatedMultiSourcesExamplesIterable, IterableDataset, MappedExamplesIterable, RandomlyCyclingMultiSourcesExamplesIterable, RebatchedArrowExamplesIterable, + RepeatExamplesIterable, SelectColumnsIterable, ShuffledDataSourcesArrowExamplesIterable, ShuffledDataSourcesExamplesIterable, @@ -38,7 +43,6 @@ SkipExamplesIterable, StepExamplesIterable, TakeExamplesIterable, - TypedExamplesIterable, VerticallyConcatenatedMultiSourcesExamplesIterable, _BaseExamplesIterable, _batch_to_examples, @@ -50,8 +54,10 @@ assert_arrow_memory_doesnt_increase, is_rng_equal, require_dill_gt_0_3_2, + require_jax, require_not_windows, require_numpy1_on_windows, + require_polars, require_pyspark, require_tf, require_torch, @@ -585,6 +591,59 @@ def test_mapped_examples_iterable_remove_columns(n, func, batched, batch_size, r assert_load_state_dict_resumes_iteration(ex_iterable) +# issue #7345 and PR #7353 +@pytest.mark.parametrize("batched", [False, True]) +@pytest.mark.parametrize("batch_size", [None, 2]) +@pytest.mark.parametrize("input_columns", [None, ["i"]]) +@pytest.mark.parametrize("remove_columns", [None, ["i"]]) +@pytest.mark.parametrize("new_output", [False, True]) +def test_iterable_dataset_vs_dataset_map(batched, batch_size, input_columns, remove_columns, new_output): + if input_columns is not None and not new_output: + return + + ds1 = Dataset.from_list([{"i": i} for i in range(4)]) + + if batched: + + def f1(i): + return {"i": [j + 1 for j in i]} + else: + + def f1(i): + return {"i": i + 1} + + if input_columns is None: + + def f2(x): + return f1(x["i"]) + else: + f2 = f1 + + if new_output: + f = f2 + else: + + def f(x): + x["i"] = f2(x)["i"] + return x + + r = [ + list( + ds2.map( + f, + batch_size=batch_size, + batched=batched, + remove_columns=remove_columns, + input_columns=input_columns, + ) + ) + for ds2 in [ds1, ds1.to_iterable_dataset()] + ] + r[1] = [x for x in r[1] if len(x) > 0] + assert len(r[0]) == len(r[1]) + assert all(x == y for x, y in zip(*r)) + + @pytest.mark.parametrize( "n, func, batched, batch_size, fn_kwargs", [ @@ -1087,15 +1146,59 @@ def test_filtered_examples_iterable_input_columns(n, func, batched, batch_size, assert_load_state_dict_resumes_iteration(ex_iterable) +def test_map_async(): + dset = Dataset.from_dict({"x": range(100)}).to_iterable_dataset() + + async def f(example): + await asyncio.sleep(0.1) + return {"y": 1} + + _start = time.time() + out = dset.map(f) + assert time.time() - _start < 2.0 + assert next(iter(out))["y"] == 1 + + async def f(batch): + await asyncio.sleep(0.1) + return {"y": [1] * len(batch["x"])} + + _start = time.time() + out = dset.map(f, batched=True) + assert time.time() - _start < 2.0 + assert next(iter(out))["y"] == 1 + + +def test_filter_async(): + dset = Dataset.from_dict({"x": range(100)}).to_iterable_dataset() + + async def f(example): + await asyncio.sleep(0.1) + return example["x"] == 42 + + _start = time.time() + out = dset.filter(f) + assert time.time() - _start < 2.0 + assert len(list(out)) == 1 + + async def f(batch): + await asyncio.sleep(0.1) + return [x == 42 for x in batch["x"]] + + _start = time.time() + out = dset.filter(f, batched=True) + assert time.time() - _start < 2.0 + assert len(list(out)) == 1 + + def test_skip_examples_iterable(): total, count = 10, 2 base_ex_iterable = ExamplesIterable(generate_examples_fn, {"n": total}) skip_ex_iterable = SkipExamplesIterable(base_ex_iterable, n=count) expected = list(generate_examples_fn(n=total))[count:] assert list(skip_ex_iterable) == expected - assert ( - skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable - ), "skip examples makes the shards order fixed" + assert skip_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is skip_ex_iterable, ( + "skip examples makes the shards order fixed" + ) assert_load_state_dict_resumes_iteration(skip_ex_iterable) @@ -1105,12 +1208,34 @@ def test_take_examples_iterable(): take_ex_iterable = TakeExamplesIterable(base_ex_iterable, n=count) expected = list(generate_examples_fn(n=total))[:count] assert list(take_ex_iterable) == expected - assert ( - take_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is take_ex_iterable - ), "skip examples makes the shards order fixed" + assert take_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is take_ex_iterable, ( + "skip examples makes the shards order fixed" + ) assert_load_state_dict_resumes_iteration(take_ex_iterable) +@pytest.mark.parametrize( + "n, num_times", + [ + (3, None), + (3, 3), + (3, 0), + ], +) +def test_repeat_examples_iterable(n, num_times): + base_ex_iterable = ExamplesIterable(generate_examples_fn, {"n": n}) + ex_iterable = RepeatExamplesIterable(base_ex_iterable, num_times=num_times) + all_examples = [x for _, x in generate_examples_fn(n=n)] + if num_times is not None: + expected = all_examples * max(num_times, 0) + assert [x for _, x in ex_iterable] == expected + else: + max_iters = 135 + iterator = iter(ex_iterable) + for i in range(max_iters): + assert next(iterator)[1] == all_examples[i % len(all_examples)], f"iteration {i} failed," + + def test_vertically_concatenated_examples_iterable(): ex_iterable1 = ExamplesIterable(generate_examples_fn, {"label": 10}) ex_iterable2 = ExamplesIterable(generate_examples_fn, {"label": 5}) @@ -1155,9 +1280,9 @@ def test_horizontally_concatenated_examples_iterable(): concatenated_ex_iterable = HorizontallyConcatenatedMultiSourcesExamplesIterable([ex_iterable1, ex_iterable2]) expected = [{**x, **y} for (_, x), (_, y) in zip(ex_iterable1, ex_iterable2)] assert [x for _, x in concatenated_ex_iterable] == expected - assert ( - concatenated_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is concatenated_ex_iterable - ), "horizontally concatenated examples makes the shards order fixed" + assert concatenated_ex_iterable.shuffle_data_sources(np.random.default_rng(42)) is concatenated_ex_iterable, ( + "horizontally concatenated examples makes the shards order fixed" + ) assert_load_state_dict_resumes_iteration(concatenated_ex_iterable) @@ -1181,8 +1306,8 @@ def test_horizontally_concatenated_examples_iterable(): BufferShuffledExamplesIterable(ExamplesIterable(generate_examples_fn, {}), 10, np.random.default_rng(42)), SkipExamplesIterable(ExamplesIterable(generate_examples_fn, {}), 10), TakeExamplesIterable(ExamplesIterable(generate_examples_fn, {}), 10), - TypedExamplesIterable( - ExamplesIterable(generate_examples_fn, {}), Features({"id": Value("int32")}), token_per_repo_id={} + FormattedExamplesIterable( + ExamplesIterable(generate_examples_fn, {}), None, Features({"id": Value("int32")}), token_per_repo_id={} ), ], ) @@ -1226,8 +1351,8 @@ def test_no_iter_arrow(ex_iterable: _BaseExamplesIterable): # BufferShuffledExamplesIterable(ArrowExamplesIterable(generate_tables_fn, {}), 10, np.random.default_rng(42)), # not implemented # SkipExamplesIterable(ArrowExamplesIterable(generate_tables_fn, {}), 10), # not implemented # TakeExamplesIterable(ArrowExamplesIterable(generate_tables_fn, {}), 10), # not implemented - TypedExamplesIterable( - ArrowExamplesIterable(generate_tables_fn, {}), Features({"id": Value("int32")}), token_per_repo_id={} + FormattedExamplesIterable( + ArrowExamplesIterable(generate_tables_fn, {}), None, Features({"id": Value("int32")}), token_per_repo_id={} ), ], ) @@ -1627,7 +1752,12 @@ def test_iterable_dataset_features_cast_to_python(): assert list(dataset) == [{"timestamp": pd.Timestamp(2020, 1, 1).to_pydatetime(), "array": [1] * 5, "id": 0}] -@pytest.mark.parametrize("format_type", [None, "torch", "python", "tf", "tensorflow", "np", "numpy", "jax"]) +@require_torch +@require_tf +@require_jax +@pytest.mark.parametrize( + "format_type", [None, "torch", "python", "tf", "tensorflow", "np", "numpy", "jax", "arrow", "pd", "pandas"] +) def test_iterable_dataset_with_format(dataset: IterableDataset, format_type): formatted_dataset = dataset.with_format(format_type) assert formatted_dataset._formatting.format_type == get_format_type_from_alias(format_type) @@ -1681,6 +1811,14 @@ def test_iterable_dataset_take(dataset: IterableDataset, n): assert list(take_dataset) == list(dataset)[:n] +@pytest.mark.parametrize("n", [0, 2]) +def test_iterable_dataset_repeat(dataset: IterableDataset, n): + repeat_dataset = dataset.repeat(n) + assert isinstance(repeat_dataset._ex_iterable, RepeatExamplesIterable) + assert repeat_dataset._ex_iterable.num_times == n + assert list(repeat_dataset) == list(dataset) * n + + def test_iterable_dataset_shard(): num_examples = 20 num_shards = 5 @@ -2139,6 +2277,70 @@ def add_one_numpy(example): assert isinstance(next(dataset.iter(batch_size=3))["id"], list) +def test_format_from_arrow(): + python_arrow_extractor = Formatter.python_arrow_extractor + numpy_arrow_extractor = Formatter.numpy_arrow_extractor + + with ( + patch.object(Formatter, "python_arrow_extractor") as mock_python_arrow_extractor, + patch.object(Formatter, "numpy_arrow_extractor") as mock_numpy_arrow_extractor, + ): + mock_python_arrow_extractor.side_effect = python_arrow_extractor + mock_numpy_arrow_extractor.side_effect = numpy_arrow_extractor + + def g(): + yield 0, pa.table({"a": range(10)}) + + ds = IterableDataset(ArrowExamplesIterable(g, {})) + ds = ds.with_format("np") + ds = ds.map(lambda x: x, batched=True) + next(iter(ds)) + + # we do arrow -> numpy -> python + mock_numpy_arrow_extractor.assert_called() + # we don't do any arrow -> python + mock_python_arrow_extractor.assert_not_called() + + +def test_format_arrow(dataset: IterableDataset): + ds = dataset.with_format("arrow") + assert isinstance(next(iter(ds)), pa.Table) + assert isinstance(next(iter(ds.iter(batch_size=4))), pa.Table) + assert len(next(iter(ds))) == 1 + assert len(next(iter(ds.iter(batch_size=4)))) == 4 + ds = ds.map(lambda t: t.append_column("new_col", pa.array([0] * len(t)))) + ds = ds.map(lambda t: t.append_column("new_col_batched", pa.array([1] * len(t))), batched=True) + ds = ds.with_format(None) + assert next(iter(ds)) == {**next(iter(dataset)), "new_col": 0, "new_col_batched": 1} + + +def test_format_pandas(dataset: IterableDataset): + ds = dataset.with_format("pandas") + assert isinstance(next(iter(ds)), pd.DataFrame) + assert isinstance(next(iter(ds.iter(batch_size=4))), pd.DataFrame) + assert len(next(iter(ds))) == 1 + assert len(next(iter(ds.iter(batch_size=4)))) == 4 + ds = ds.map(lambda df: df.assign(new_col=[0] * len(df))) + ds = ds.map(lambda df: df.assign(new_col_batched=[1] * len(df)), batched=True) + ds = ds.with_format(None) + assert next(iter(ds)) == {**next(iter(dataset)), "new_col": 0, "new_col_batched": 1} + + +@require_polars +def test_format_polars(dataset: IterableDataset): + import polars as pl + + ds = dataset.with_format("polars") + assert isinstance(next(iter(ds)), pl.DataFrame) + assert isinstance(next(iter(ds.iter(batch_size=4))), pl.DataFrame) + assert len(next(iter(ds))) == 1 + assert len(next(iter(ds.iter(batch_size=4)))) == 4 + ds = ds.map(lambda df: df.with_columns(pl.Series([0] * len(df)).alias("new_col"))) + ds = ds.map(lambda df: df.with_columns(pl.Series([1] * len(df)).alias("new_col_batched")), batched=True) + ds = ds.with_format(None) + assert next(iter(ds)) == {**next(iter(dataset)), "new_col": 0, "new_col_batched": 1} + + @pytest.mark.parametrize("num_shards1, num_shards2, num_workers", [(2, 1, 1), (2, 2, 2), (1, 3, 1), (4, 3, 3)]) def test_interleave_dataset_with_sharding(num_shards1, num_shards2, num_workers): from torch.utils.data import DataLoader @@ -2217,7 +2419,7 @@ def test_iterable_dataset_batch(): assert len(batch["id"]) == 3 assert len(batch["text"]) == 3 assert batch["id"] == [3 * i, 3 * i + 1, 3 * i + 2] - assert batch["text"] == [f"Text {3*i}", f"Text {3*i+1}", f"Text {3*i+2}"] + assert batch["text"] == [f"Text {3 * i}", f"Text {3 * i + 1}", f"Text {3 * i + 2}"] # Check last partial batch assert len(batches[3]["id"]) == 1 @@ -2234,7 +2436,7 @@ def test_iterable_dataset_batch(): assert len(batch["id"]) == 3 assert len(batch["text"]) == 3 assert batch["id"] == [3 * i, 3 * i + 1, 3 * i + 2] - assert batch["text"] == [f"Text {3*i}", f"Text {3*i+1}", f"Text {3*i+2}"] + assert batch["text"] == [f"Text {3 * i}", f"Text {3 * i + 1}", f"Text {3 * i + 2}"] # Test with batch_size=4 (doesn't evenly divide dataset size) batched_ds = ds.batch(batch_size=4, drop_last_batch=False) @@ -2245,7 +2447,7 @@ def test_iterable_dataset_batch(): assert len(batch["id"]) == 4 assert len(batch["text"]) == 4 assert batch["id"] == [4 * i, 4 * i + 1, 4 * i + 2, 4 * i + 3] - assert batch["text"] == [f"Text {4*i}", f"Text {4*i+1}", f"Text {4*i+2}", f"Text {4*i+3}"] + assert batch["text"] == [f"Text {4 * i}", f"Text {4 * i + 1}", f"Text {4 * i + 2}", f"Text {4 * i + 3}"] # Check last partial batch assert len(batches[2]["id"]) == 2 diff --git a/tests/test_load.py b/tests/test_load.py index 5d551e8afbe..46a76749624 100644 --- a/tests/test_load.py +++ b/tests/test_load.py @@ -1172,11 +1172,11 @@ def test_load_dataset_builder_for_community_dataset_with_script_no_parquet_expor @pytest.mark.integration def test_load_dataset_builder_use_parquet_export_if_dont_trust_remote_code_keeps_features(): - dataset_name = "food101" - builder = datasets.load_dataset_builder(dataset_name, trust_remote_code=False) + repo_id = "ethz/food101" + builder = datasets.load_dataset_builder(repo_id, trust_remote_code=False) assert isinstance(builder, DatasetBuilder) assert builder.name == "parquet" - assert builder.dataset_name == dataset_name + assert builder.dataset_name == repo_id.split("/")[-1] assert builder.config.name == "default" assert list(builder.info.features) == ["image", "label"] assert builder.info.features["image"] == Image() diff --git a/tests/test_metadata_util.py b/tests/test_metadata_util.py index 962de3945e7..b6b45e1812f 100644 --- a/tests/test_metadata_util.py +++ b/tests/test_metadata_util.py @@ -279,7 +279,7 @@ def test_metadata_configs_incorrect_yaml(): def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset_infos(): exported_parquet_files = [ { - "dataset": "beans", + "dataset": "AI-Lab-Makerere/beans", "config": "default", "split": "test", "url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet", @@ -287,7 +287,7 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset "size": 17707203, }, { - "dataset": "beans", + "dataset": "AI-Lab-Makerere/beans", "config": "default", "split": "train", "url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", @@ -295,7 +295,7 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset "size": 143780164, }, { - "dataset": "beans", + "dataset": "AI-Lab-Makerere/beans", "config": "default", "split": "validation", "url": "https://huggingface.co/datasets/beans/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet", @@ -305,7 +305,7 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset ] dataset_infos = { "default": DatasetInfo( - dataset_name="beans", + dataset_name="AI-Lab-Makerere/beans", config_name="default", version="0.0.0", splits={ @@ -314,21 +314,21 @@ def test_split_order_in_metadata_configs_from_exported_parquet_files_and_dataset "num_bytes": 143996486, "num_examples": 1034, "shard_lengths": None, - "dataset_name": "beans", + "dataset_name": "AI-Lab-Makerere/beans", }, "validation": { "name": "validation", "num_bytes": 18525985, "num_examples": 133, "shard_lengths": None, - "dataset_name": "beans", + "dataset_name": "AI-Lab-Makerere/beans", }, "test": { "name": "test", "num_bytes": 17730506, "num_examples": 128, "shard_lengths": None, - "dataset_name": "beans", + "dataset_name": "AI-Lab-Makerere/beans", }, }, download_checksums={ diff --git a/tests/test_offline_util.py b/tests/test_offline_util.py index 22a372205ad..ed8ff49b815 100644 --- a/tests/test_offline_util.py +++ b/tests/test_offline_util.py @@ -1,9 +1,7 @@ from tempfile import NamedTemporaryFile -import huggingface_hub import pytest import requests -from packaging import version from datasets.utils.file_utils import fsspec_get, fsspec_head @@ -18,10 +16,8 @@ def test_offline_with_timeout(): requests.request("GET", "https://huggingface.co") with pytest.raises(requests.exceptions.Timeout): requests.request("GET", "https://huggingface.co", timeout=1.0) - # old versions of `huggingface_hub` don't have timeouts by default and don't allow to set timeouts in HfFileSystem - if version.parse(huggingface_hub.__version__) >= version.parse("0.23.0"): - with pytest.raises(requests.exceptions.Timeout), NamedTemporaryFile() as temp_file: - fsspec_get("hf://dummy", temp_file=temp_file) + with pytest.raises(requests.exceptions.Timeout), NamedTemporaryFile() as temp_file: + fsspec_get("hf://dummy", temp_file=temp_file) @pytest.mark.integration