Loading unoptimized parquet dataset throws an error

Hi, I'm trying to use StreamingDataset directly with parquet dataset and it's giving an error related do caching. Providing custom cache dir doesn't help. Any pointers on how I can run this?

import litdata as ld

uri = "s3://my-bucket/my-data"

ld.index_parquet_dataset(uri, "index")

ds = ld.StreamingDataset(uri)

    317     if self._item_loader.__class__.__name__ != self._config["item_loader"]:
    318         item_loader = self._config["item_loader"]
--> 319         raise ValueError(f"Please, use Cache(..., item_loader={item_loader}(...))")
    320 else:
    321     if (
    322         len(self._config["data_format"]) == 1
    323         and self._config["data_format"][0].startswith("no_header_tensor")
    324         and not isinstance(self._item_loader, TokensLoader)
    325     ):

ValueError: Please, use Cache(..., item_loader=ParquetLoader(...))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions