8000 cudf/python/dask_cudf at main · rapidsai/cudf · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Latest commit

 

History

History
 
 

dask_cudf

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

 Dask cuDF - A GPU Backend for Dask DataFrame

Dask cuDF (a.k.a. dask-cudf or dask_cudf) is an extension library for Dask DataFrame that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the "cudf" dataframe backend for Dask DataFrame.

Important

Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with Dask-CUDA) to leverage multiple GPUs efficiently.

Using Dask cuDF

Please visit the official documentation page for detailed information about using Dask cuDF.

Installation

See the RAPIDS install page for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.

Resources

Quick-start example

A very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:

import dask
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client

if __name__ == "__main__":

  # Define a GPU-aware cluster to leverage multiple GPUs
  client = Client(
    LocalCUDACluster(
      CUDA_VISIBLE_DEVICES="0,1",  # Use two workers (on devices 0 and 1)
      rmm_pool_size=0.9,  # Use 90% of GPU memory as a pool for faster allocations
      enable_cudf_spill=True,  # Improve device memory stability
      local_directory="/fast/scratch/",  # Use fast local storage for spilling
    )
  )

  # Set the default dataframe backend to "cudf"
  dask.config.set({"dataframe.backend": "cudf"})

  # Create your DataFrame collection from on-disk
  # or in-memory data
  df = dd.read_parquet("/my/parquet/dataset/")

  # Use cudf-like syntax to transform and/or query your data
  query = df.groupby('item')['price'].mean()

  # Compute, persist, or write out the result
  query.head()

If you do not have multiple GPUs available, using LocalCUDACluster is optional. However, it is still a good idea to enable cuDF spilling.

If you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see the RAPIDS deployment documentation for more instructions.

0