Releases: rapidsai/cudf
Releases · rapidsai/cudf
v22.04.00
🚨 Breaking Changes
- Drop unsupported method argument from nunique and distinct_count. (#10411) @bdice
- Refactor stream compaction APIs (#10370) @PointKernel
- Add scan_aggregation and reduce_aggregation derived types. (#10357) @nvdbaranec
- Avoid
decimal
type narrowing for decimal binops (#10299) @galipremsagar - Rewrites
sample
API (#10262) @isVoid - Remove probe-time null equality parameters in
cudf::hash_join
(#10260) @PointKernel - Enable proper
Index
round-tripping inorc
reader and writer (#10170) @galipremsagar - Add JNI for
strings::split_re
andstrings::split_record_re
(#10139) @ttnghia - Change cudf::strings::find_multiple to return a lists column (#10134) @davidwendt
- Remove the option to completely disable decimal128 columns in the ORC reader (#10127) @vuule
- Remove deprecated code (#10124) @vyasr
- Update gpu_utils.py to reflect current CUDA support. (#10113) @bdice
- Optimize compaction operations (#10030) @PointKernel
- Remove deprecated method Series.set_index. (#9945) @bdice
- Add cudf::strings::findall_record API (#9911) @davidwendt
- Upgrade
arrow
&pyarrow
to6.0.1
(#9686) @galipremsagar
🐛 Bug Fixes
- Fix an issue with tdigest merge aggregations. (#10506) @nvdbaranec
- Batch of fixes for index overflows in grid stride loops. (#10448) @nvdbaranec
- Update dask_cudf imports to be compatible with latest dask (#10442) @rlratzel
- Fix for integer overflow in contiguous-split (#10437) @jbrennan333
- Fix has_null predicate for drop_list_duplicates on nested structs (#10436) @sperlingxx
- Fix empty reduce with List output and non-List input (#10435) @sperlingxx
- Fix
list
andstruct
meta generation issue indask-cudf
(#10434) @galipremsagar - Fix error in
cudf.to_numeric
when abool
input is passed (#10431) @galipremsagar - Support cupy array in
quantile
input (#10429) @galipremsagar - Fix benchmarks to work with new aggregation types (#10428) @davidwendt
- Fix cudf::shift to handle offset greater than column size (#10414) @davidwendt
- Fix lifespan of the temporary directory that holds cuFile configuration file (#10403) @vuule
- Fix error thrown in compiled-binaryop benchmark (#10398) @davidwendt
- Limiting async allocator using alignment of 512 (#10395) @rongou
- Include <optional> in multibyte split. (#10385) @bdice
- Fix issue with column and scalar re-assignment (#10377) @galipremsagar
- Fix floating point data generation in benchmarks (#10372) @vuule
- Avoid overflow in fused_concatenate_kernel output_index (#10344) @abellina
- Remove is_relationally_comparable for table device views (#10342) @davidwendt
- Fix debug compile error in device_span to column_view conversion (#10331) @davidwendt
- Add Pascal support to JCUDF transcode (row_conversion) (#10329) @mythrocks
- Fix
std::bad_alloc
exception due to JIT reserving a huge buffer (#10317) @ttnghia - Fixes up the overflowed fixed-point round on nullable column (#10316) @sperlingxx
- Fix DataFrame slicing issues for empty cases (#10310) @brandon-b-miller
- Fix documentation issues (#10307) @ajschmidt8
- Allow Java bindings to use default decimal precisions when writing columns (#10276) @sperlingxx
- Fix incorrect slicing of GDS read/write calls (#10274) @vuule
- Fix out-of-memory error in compiled-binaryop benchmark (#10269) @davidwendt
- Add tests of reflected ufuncs and fix behavior of logical reflected ufuncs (#10261) @vyasr
- Remove probe-time null equality parameters in
cudf::hash_join
(#10260) @PointKernel - Fix out-of-memory error in UrlDecode benchmark (#10258) @davidwendt
- Fix groupby reductions that perform operations on source type instead of target type (#10250) @ttnghia
- Fix small leak in explode (#10245) @revans2
- Yet another small JNI memory leak (#10238) @revans2
- Fix regex octal parsing to limit to 3 characters (#10233) @davidwendt
- Fix string to decimal128 conversion handling large exponents (#10231) @davidwendt
- Fix JNI leak on copy to device (#10229) @revans2
- Fix the data generator element size for decimal types (#10225) @vuule
- Fix
decimal
metadata in parquet writer (#10224) @galipremsagar - Fix strings handling of hex in regex pattern (#10220) @davidwendt
- Fix docs builds (#10216) @ajschmidt8
- Fix a leftover _has_nulls change from Nullate (#10211) @devavret
- Fix bitmask of the output for JNI of
lists::drop_list_duplicates
(#10210) @ttnghia - Fix compile error in
binaryop/compiled/util.cpp
(#10209) @ttnghia - Skip ORC and Parquet readers' benchmark cases that are not currently supported (#10194) @vuule
- Fix JNI leak of a cudf::column_view native class. (#10171) @revans2
- Enable proper
Index
round-tripping inorc
reader and writer (#10170) @galipremsagar - Convert Column Name to String Before Using Struct Column Factory (#10156) @isVoid
- Preserve the correct
ListDtype
while creating an identical empty column (#10151) @galipremsagar - benchmark fixture - static object pointer fix (#10145) @karthikeyann
- Fix UDF Caching (#10133) @brandon-b-miller
- Raise duplicate column error in
DataFrame.rename
(#10120) @galipremsagar - Fix flaky memory usage test by guaranteeing array size. (#10114) @vyasr
- Encode values from python callback for C++ (#10103) @jdye64
- Add check for regex instructions causing an infinite-loop (#10095) @davidwendt
- Remove metadata singleton from nvtext normalizer (#10090) @davidwendt
- Column equality testing fixes (#10011) @brandon-b-miller
- Pin libcudf runtime dependency for cudf / libcudf-kafka nightlies (#9847) @charlesbluca
📖 Documentation
- Fix documentation for DataFrame.corr and Series.corr. (#10493) @bdice
- Add
cut
to API docs (#10479) @shwina - Remove documentation for methods removed in #10124. (#10366) @bdice
- Fix documentation issues (#10306) @ajschmidt8
- Fix
fixed_point
binary operation documentation (#10198) @codereport - Remove cleaned up methods from docs (#10189) @galipremsagar
- Update developer guide to recommend no default stream parameter. (#10136) @bdice
- Update benchmarking guide to use NVBench. (#10093) @bdice
🚀 New Features
- Add StringIO support to read_text (#10465) @cwharris
- Add support for tdigest and merge_tdigest aggregations through cudf::reduce (#10433) @nvdbaranec
- JNI support for Collect Ops in Reduction (#10427) @sperlingxx
- Enable read_text with dask_cudf using byte_range (#10407) @ChrisJar
- Add
cudf::stable_sort_by_key
(#10387) @PointKernel - Implement
maps_column_view
abstraction overLIST<STRUCT<K,V>>
(#10380) @mythrocks - Support Java bindings for Avro reader (#10373) @HaoYang670
- Refactor stream compaction APIs (#10370) @PointKernel
- Support collect aggregations in reduction (#10353) @sperlingxx
- Refactor array_ufunc for Index and unify across all classes (#10346) @vyasr
- Add JNI for extract_list_element with index column (#10341) @firestarman
- Support
min
andmax
operations for structs in rolling window (#10332) @ttnghia - Add device create_sequence_table for benchmarks (#10300) @karthikeyann
- Enable numpy ufuncs for DataFrame (#10287) @vyasr
- move input generation for json benchmark to device (#10281) @karthikeyann
- move input generation for type dispatcher benchmark to device (#10280) @karthikeyann
- move input generation for copy benchmark to device (#10279) @karthikeyann
- generate url decode benchmark input in device (#10278) @karthikeyann
- device input generation in join bench (#10277) @karthikeyann
- Add nvtext::byte_pair_encoding API (#10270) @davidwendt
- Prevent internal usage of expensive APIs (#10263) @vyasr
- Column to JCUDF row for tables with strings (#10235) @hyperbolic2346
- Support
percent_rank()
aggregation (#10227) @mythrocks - Refactor Series.array_ufunc (#10217) @vyasr
- Reduce pytest runtime (#10203) @brandon-b-miller
- Add regex flags parameter to python cudf strings split (#10185) @davidwendt
- Support for
MOD
,PMOD
andPYMOD
fordecimal32/64/128
(#10179) @codereport - Adding string row size iterator for row to column and column to row conversion (#10157) @hyperbolic2346
- Add file size counter to cuIO benchmarks (#10154) @vuule
- byte_range support for multibyte_split/read_text (#10150) @cwharris
- Add JNI for
strings::split_re
andstrings::split_record_re
(#10139) @ttnghia - Add
maxSplit
parameter to Java binding forstrings:split
(#10137) @ttnghia - Add libcudf strings split API that accepts regex pattern (#10128) @davidwendt
- generate benchmark input in device (#10109) @karthikeyann
- Avoid
nan_as_null
op ifnan_count
is 0 (#10082) @galipremsagar - Add Dataframe and Index nunique (#10077) @martinfalisse
- Support nanosecond timestamps in parquet (#10063) @PointKernel
- Java bindings for mixed semi and anti joins (#10040) @jlowe
- Implement mixed equality/conditional semi/anti joins (#10037) @vyasr
- Optimize compaction operations (#10030) @PointKernel
- Support
args=
inSeries.apply
(#9982) @brandon-b-miller - Add cudf::strings::findall_record API (#9911) @davidwendt
- Add covariance for sort groupby (python) (#9889) @mayankanand007
- Implement DataFrame diff() (#9817) @skirui-source
- Implement DataFrame pct_change (#9805) @skirui-source
- Support segmented reductions and null mask reductions (#9621) @isVoid
- Add 'spearman' correlation method for
dataframe.corr
andseries.corr
(#7141) @dominicshanshan
🛠️ Improvements
- Add
scipy
skip for a test (#10502) @galipremsagar - Temporarily disable new
ops-bot
functionality (#10496) @ajschmidt8 - Include <cstddef> to fix compilation of parquet reader on GCC 11. (#10483) @bdice
- Pin
dask
anddistributed
(#10481) @galipremsagar - MD5 refactoring. (#10445) @bdice
- Remove or split up Frame methods that use the index (#10439) @vyasr
- Centralization of tdigest aggregation code. (#10422) @nvdbaranec
- Simplify column binary operations (#10421) @vyasr
- Add
.github/ops-bot.yaml
config file (#10420) @ajschmidt8 - Use list of columns for methods in
Groupby.pyx
(#10419) @isVoid - Remov...
v22.02.00
🚨 Breaking Changes
- ORC writer API changes for granular statistics (#10058) @mythrocks
decimal128
Support forto/from_arrow
(#9986) @codereport- Remove deprecated method
one_hot_encoding
(#9977) < 10000 a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/users/isVoid/hovercard" data-octo-click="hovercard-link-click" data-octo-dimensions="link_type:self" href="https://github.com/isVoid">@isVoid - Remove str.subword_tokenize (#9968) @VibhuJawa
- Remove deprecated
method
parameter frommerge
andjoin
. (#9944) @bdice - Remove deprecated method DataFrame.hash_columns. (#9943) @bdice
- Remove deprecated method Series.hash_encode. (#9942) @bdice
- Refactoring ceil/round/floor code for datetime64 types (#9926) @mayankanand007
- Introduce
nan_as_null
parameter forcudf.Index
(#9893) @galipremsagar - Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
- Break tie for
top
categorical columns inSeries.describe
(#9867) @isVoid - Add partitioning support in parquet writer (#9810) @devavret
- Move
drop_duplicates
,drop_na
,_gather
,take
to IndexFrame and create their_base_index
counterparts (#9807) @isVoid - Raise temporary error for
decimal128
types in parquet reader (#9804) @galipremsagar - Change default
dtype
of all nulls column fromfloat
toobject
(#9803) @galipremsagar - Remove unused masked udf cython/c++ code (#9792) @brandon-b-miller
- Pick smallest decimal type with required precision in ORC reader (#9775) @vuule
- Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Refactor TableTest assertion methods to a separate utility class (#9762) @jlowe
- Use cuFile direct device reads/writes by default in cuIO (#9722) @vuule
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Add parameters to control row group size in Parquet writer (#9677) @vuule
- Refactor bit counting APIs, introduce valid/null count functions, and split host/device side code for segmented counts. (#9588) @bdice
- Add support for
decimal128
in cudf python (#9533) @galipremsagar - Implement
lists::index_of()
to find positions in list rows (#9510) @mythrocks - Rewriting row/column conversions for Spark <-> cudf data conversions (#8444) @hyperbolic2346
🐛 Bug Fixes
- Add check for negative stripe index in ORC reader (#10074) @vuule
- Update Java tests to expect DECIMAL128 from Arrow (#10073) @jlowe
- Avoid index materialization when
DataFrame
is created with un-namedSeries
objects (#10071) @galipremsagar - fix gcc 11 compilation errors (#10067) @rongou
- Fix
columns
ordering issue in parquet reader (#10066) @galipremsagar - Fix dataframe setitem with
ndarray
types (#10056) @galipremsagar - Remove implicit copy due to conversion from cudf::size_type and size_t (#10045) @robertmaynard
- Include <optional> in headers that use std::optional (#10044) @robertmaynard
- Fix repr and concat of
StructColumn
(#10042) @galipremsagar - Include row group level stats when writing ORC files (#10041) @vuule
- build.sh respects the
--build_metrics
and--incl_cache_stats
flags (#10035) @robertmaynard - Fix memory leaks in JNI native code. (#10029) @mythrocks
- Update JNI to use new arena mr constructor (#10027) @rongou
- Fix null check when comparing structs in
arg_min
operation of reduction/groupby (#10026) @ttnghia - Wrap CI script shell variables in quotes to fix local testing. (#10018) @bdice
- cudftestutil no longer propagates compiler flags to external users (#10017) @robertmaynard
- Remove
CUDA_DEVICE_CALLABLE
macro usage (#10015) @hyperbolic2346 - Add missing list filling header in meta.yaml (#10007) @devavret
- Fix
conda
recipes forcustreamz
&cudf_kafka
(#10003) @ajschmidt8 - Fix matching regex word-boundary (\b) in strings replace (#9997) @davidwendt
- Fix null check when comparing structs in
min
andmax
reduction/groupby operations (#9994) @ttnghia - Fix octal pattern matching in regex string (#9993) @davidwendt
decimal128
Support forto/from_arrow
(#9986) @codereport- Fix groupby shift/diff/fill after selecting from a
GroupBy
(#9984) @shwina - Fix the overflow problem of decimal rescale (#9966) @sperlingxx
- Use default value for decimal precision in parquet writer when not specified (#9963) @devavret
- Fix cudf java build error. (#9958) @firestarman
- Use gpuci_mamba_retry to install local artifacts. (#9951) @bdice
- Fix regression HostColumnVectorCore requiring native libs (#9948) @jlowe
- Rename aggregate_metadata in writer to fix name collision (#9938) @devavret
- Fixed issue with percentile_approx where output tdigests could have uninitialized data at the end. (#9931) @nvdbaranec
- Resolve racecheck errors in ORC kernels (#9916) @vuule
- Fix the java build after parquet partitioning support (#9908) @revans2
- Fix compilation of benchmark for parquet writer. (#9905) @bdice
- Fix a memcheck error in ORC writer (#9896) @vuule
- Introduce
nan_as_null
parameter forcudf.Index
(#9893) @galipremsagar - Fix fallback to sort aggregation for grouping only hash aggregate (#9891) @abellina
- Add zlib to cudfjni link when using static libcudf library dependency (#9890) @jlowe
- TimedeltaIndex constructor raises an AttributeError. (#9884) @skirui-source
- Fix cudf.Scalar string datetime construction (#9875) @brandon-b-miller
- Load libcufile.so with RTLD_NODELETE flag (#9872) @vuule
- Break tie for
top
categorical columns inSeries.describe
(#9867) @isVoid - Fix null handling for structs
min
andarg_min
in groupby, groupby scan, reduction, and inclusive_scan (#9864) @ttnghia - Add one-level list encoding support in parquet reader (#9848) @PointKernel
- Fix an out-of-bounds read in validity copying in contiguous_split. (#9842) @nvdbaranec
- Fix join of MultiIndex to Index with one column and overlapping name. (#9830) @vyasr
- Fix caching in
Series.applymap
(#9821) @brandon-b-miller - Enforce boolean
ascending
for dask-cudfsort_values
(#9814) @charlesbluca - Fix ORC writer crash with empty input columns (#9808) @vuule
- Change default
dtype
of all nulls column fromfloat
toobject
(#9803) @galipremsagar - Load native dependencies when Java ColumnView is loaded (#9800) @jlowe
- Fix dtype-argument bug in dask_cudf read_csv (#9796) @rjzamora
- Fix overflow for min calculation in strings::from_timestamps (#9793) @revans2
- Fix memory error due to lambda return type deduction limitation (#9778) @karthikeyann
- Revert regex $/EOL end-of-string new-line special case handling (#9774) @davidwendt
- Fix missing streams (#9767) @karthikeyann
- Fix make_empty_scalar_like on list_type (#9759) @sperlingxx
- Update cmake and conda to 22.02 (#9746) @devavret
- Fix out-of-bounds memory write in decimal128-to-string conversion (#9740) @davidwendt
- Match pandas scalar result types in reductions (#9717) @brandon-b-miller
- Fix regex non-multiline EOL/$ matching strings ending with a new-line (#9715) @davidwendt
- Fixed build by adding more checks for int8, int16 (#9707) @razajafri
- Fix
null
handling whenboolean
dtype is passed (#9691) @galipremsagar - Fix stream usage in
segmented_gather()
(#9679) @mythrocks
📖 Documentation
- Update
decimal
dtypes related docs entries (#10072) @galipremsagar - Fix regex doc describing hexadecimal escape characters (#10009) @davidwendt
- Fix cudf compilation instructions. (#9956) @esoha-nvidia
- Fix see also links for IO APIs (#9895) @galipremsagar
- Fix build instructions for libcudf doxygen (#9837) @davidwendt
- Fix some doxygen warnings and add missing documentation (#9770) @karthikeyann
- update cuda version in local build (#9736) @karthikeyann
- Fix doxygen for enum types in libcudf (#9724) @davidwendt
- Spell check fixes (#9682) @karthikeyann
- Fix links in C++ Developer Guide. (#9675) @bdice
🚀 New Features
- Remove libcudacxx patch needed for nvcc 11.4 (#10057) @robertmaynard
- Allow CuPy 10 (#10048) @jakirkham
- Add in support for NULL_LOGICAL_AND and NULL_LOGICAL_OR binops (#10016) @revans2
- Add
groupby.transform
(only support for aggregations) (#10005) @shwina - Add partitioning support to Parquet chunked writer (#10000) @devavret
- Add jni for sequences (#9972) @wbo4958
- Java bindings for mixed left, inner, and full joins (#9941) @jlowe
- Java bindings for JSON reader support (#9940) @wbo4958
- Enable transpose for string columns in cudf python (#9937) @galipremsagar
- Support structs for
cudf::contains
with column/scalar input (#9929) @ttnghia - Implement mixed equality/conditional joins (#9917) @vyasr
- Add cudf::strings::extract_all API (#9909) @davidwendt
- Implement JNI for
cudf::scatter
APIs (#9903) @ttnghia - JNI: Function to copy and set validity from bool column. (#9901) @mythrocks
- Add dictionary support to cudf::copy_if_else (#9887) @davidwendt
- add run_benchmarks target for running benchmarks with json output (#9879) @karthikeyann
- Add regex_flags parameter to strings replace_re functions (#9878) @davidwendt
- Add_suffix and add_prefix for DataFrames and Series (#9846) @mayankanand007
- Add JNI for
cudf::drop_duplicates
(#9841) @ttnghia - Implement per-list sequence (#9839) @ttnghia
- adding
series.transpose
(#9835) @mayankanand007 - Adding support for
Series.autocorr
(#9833) @mayankanand007 - Support round operation on datetime64 datatypes (#9820) @mayankanand007
- Add partitioning support in parquet writer (#9810) @devavret
- Raise temporary error for
decimal128
types in parquet reader (#9804) @galipremsagar - Add decimal128 support to Parquet reader and writer (#9765) @vuule
- Optimize
groupby::scan
(#9754) @PointKernel - Add sample JNI API (#9728) @res-life
- Support
min
andmax
in inclusive scan for structs (#9725) @ttnghia - Add
first
andlast
method toIndexedFrame
(#9710) @isVoid - Support
min
andmax
reduction for structs (#9697) @ttnghia - Add parameters to control row group size in Parquet writer (#9677) @vuule
- Run compute-sanitizer in nightly build (#9641) @karthikeyann
- Implement Series.datetime.floor (#9571) @skirui-source
- ceil/floor for
DatetimeIndex
(#9554) @mayankanand007 - Add support for
decimal128
in cudf python (#9533) @galipremsagar - ...
v21.12.02
v21.12.01
v21.12.00
🚨 Breaking Changes
- Update
bitmask_and
andbitmask_or
to return a pair of resulting mask and count of unset bits (#9616) @PointKernel - Remove sizeof and standardize on memory_usage (#9544) @vyasr
- Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
- Refactor sorting APIs (#9464) @vyasr
- Update Java nvcomp JNI bindings to nvcomp 2.x API (#9384) @jbrennan333
- Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
- JNI: Support nested types in ORC writer (#9334) @firestarman
- Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
- Refactor cuIO timestamp processing with
cuda::std::chrono
(#9278) @PointKernel - Various internal MultiIndex improvements (#9243) @vyasr
🐛 Bug Fixes
- Fix read_parquet bug for bytes input (#9669) @rjzamora
- Use
_gather
internal forsort_*
(#9668) @isVoid - Fix behavior of equals for non-DataFrame Frames and add tests. (#9653) @vyasr
- Dont recompute output size if it is already available (#9649) @abellina
- Fix read_parquet bug for extended dtypes from remote storage (#9638) @rjzamora
- add const when getting data from a JNI data wrapper (#9637) @wjxiz1992
- Fix debrotli issue on CUDA 11.5 (#9632) @vuule
- Use std::size_t when computing join output size (#9626) @jlowe
- Fix
usecols
parameter handling indask_cudf.read_csv
(#9618) @galipremsagar - Add support for string
'nan', 'inf' & '-inf'
values while type-casting tofloat
(#9613) @galipremsagar - Avoid passing NativeFileDatasource to pyarrow in read_parquet (#9608) @rjzamora
- Fix test failure with cuda 11.5 in row_bit_count tests. (#9581) @nvdbaranec
- Correct _LIBCUDACXX_CUDACC_VER value computation (#9579) @robertmaynard
- Increase max RLE stream size estimate to avoid potential overflows (#9568) @vuule
- Fix edge case in tdigest scalar generation for groups containing all nulls. (#9551) @nvdbaranec
- Fix pytests failing in
cuda-11.5
environment (#9547) @galipremsagar - compile libnvcomp with PTDS if requested (#9540) @jbrennan333
- Fix
segmented_gather()
for null LIST rows (#9537) @mythrocks - Deprecate DataFrame.label_encoding, use private _label_encoding method internally. (#9535) @bdice
- Fix several test and benchmark issues related to bitmask allocations. (#9521) @nvdbaranec
- Fix for inserting duplicates in groupby result cache (#9508) @karthikeyann
- Fix mismatched types error in clip() when using non int64 numeric types (#9498) @davidwendt
- Match conda pinnings for style checks (revert part of #9412, #9433). (#9490) @bdice
- Make sure all dask-cudf supported aggs are handled in
_tree_node_agg
(#9487) @charlesbluca - Resolve
hash_columns
FutureWarning
indask_cudf
(#9481) @pentschev - Add fixed point to AllTypes in libcudf unit tests (#9472) @karthikeyann
- Fix regex handling of embedded null characters (#9470) @davidwendt
- Fix memcheck error in copy-if-else (#9467) @davidwendt
- Fix bug in dask_cudf.read_parquet for index=False (#9453) @rjzamora
- Preserve the decimal scale when creating a default scalar (#9449) @revans2
- Push down parent nulls when flattening nested columns. (#9443) @mythrocks
- Fix memcheck error in gtest SegmentedGatherTest/GatherSliced (#9442) @davidwendt
- Revert "Fix quantile division / partition handling for dask-cudf sort… (#9438) @charlesbluca
- Allow int-like objects for the
decimals
argument inround
(#9428) @shwina - Fix stream compaction's
drop_duplicates
API to use stable sort (#9417) @ttnghia - Skip Comparing Uniform Window Results in Var/std Tests (#9416) @isVoid
- Fix
StructColumn.to_pandas
type handling issues (#9388) @galipremsagar - Correct issues in the build dir cudf-config.cmake (#9386) @robertmaynard
- Fix Java table partition test to account for non-deterministic ordering (#9385) @jlowe
- Fix timestamp truncation/overflow bugs in orc/parquet (#9382) @PointKernel
- Fix the crash in stats code (#9368) @devavret
- Make Series.hash_encode results reproducible. (#9366) @bdice
- Fix libcudf compile warnings on debug 11.4 build (#9360) @davidwendt
- Fail gracefully when compiling python UDFs that attempt to access columns with unsupported dtypes (#9359) @brandon-b-miller
- Set pass_filenames: false in mypy pre-commit configuration. (#9349) @bdice
- Fix cudf_assert in cudf::io::orc::gpu::gpuDecodeOrcColumnData (#9348) @davidwendt
- Fix memcheck error in groupby-tdigest get_scalar_minmax (#9339) @davidwendt
- Optimizations for
cudf.concat
whenaxis=1
(#9333) @galipremsagar - Use f-string in join helper warning message. (#9325) @bdice
- Avoid casting to list or struct dtypes in dask_cudf.read_parquet (#9314) @rjzamora
- Fix null count in statistics for parquet (#9303) @devavret
- Potential overflow of
decimal32
when casting toint64_t
(#9287) @codereport - Fix quantile division / partition handling for dask-cudf sort on null dataframes (#9259) @charlesbluca
- Updating cudf version also updates rapids cmake branch (#9249) @robertmaynard
- Implement
one_hot_encoding
in libcudf and bind to python (#9229) @isVoid - BUG FIX: CSV Writer ignores the header parameter when no metadata is provided (#8740) @skirui-source
📖 Documentation
- Update Documentation to use
TYPED_TEST_SUITE
(#9654) @codereport - Add dedicated page for
StringHandling
in python docs (#9624) @galipremsagar - Update docstring of
DataFrame.merge
(#9572) @galipremsagar - Use raw strings to avoid SyntaxErrors in parsed docstrings. (#9526) @bdice
- Add example to docstrings in
rolling.apply
(#9522) @isVoid - Update help message to escape quotes in ./build.sh --cmake-args. (#9494) @bdice
- Improve Python docstring formatting. (#9493) @bdice
- Update table of I/O supported types (#9476) @vuule
- Document invalid regex patterns as undefined behavior (#9473) @davidwendt
- Miscellaneous documentation fixes to
cudf
(#9471) @galipremsagar - Fix many documentation errors in libcudf. (#9355) @karthikeyann
- Fixing SubwordTokenizer docs issue (#9354) @mayankanand007
- Improved deprecation warnings. (#9347) @bdice
- doc reorder mr, stream to stream, mr (#9308) @karthikeyann
- Deprecate method parameters to DataFrame.join, DataFrame.merge. (#9291) @bdice
- Added deprecation warning for
.label_encoding()
(#9289) @mayankanand007
🚀 New Features
- Enable Series.divide and DataFrame.divide (#9630) @vyasr
- Update
bitmask_and
andbitmask_or
to return a pair of resulting mask and count of unset bits (#9616) @PointKernel - Add handling of mixed numeric types in
to_dlpack
(#9585) @galipremsagar - Support re.Pattern object for pat arg in str.replace (#9573) @davidwendt
- Add JNI for
lists::drop_list_duplicates
with keys-values input column (#9553) @ttnghia - Support structs column in
min
,max
,argmin
andargmax
groupby aggregate() and scan() (#9545) @ttnghia - Move libcudacxx to use
rapids_cpm
and use newer versions (#9539) @robertmaynard - Add scan min/max support for chrono types to libcudf reduction-scan (not groupby scan) (#9518) @davidwendt
- Support
args=
inapply
(#9514) @brandon-b-miller - Add groupby scan min/max support for strings values (#9502) @davidwendt
- Add list output option to character_ngrams() function (#9499) @davidwendt
- More granular column selection in ORC reader (#9496) @vuule
- add min_periods, ddof to groupby covariance, & correlation aggregation (#9492) @karthikeyann
- Implement Series.datetime.floor (#9488) @skirui-source
- Enable linting of CMake files using pre-commit (#9484) @vyasr
- Add support for single-line regex anchors ^/$ in contains_re (#9482) @davidwendt
- Augment
order_by
to Accept a List ofnull_precedence
(#9455) @isVoid - Add format API for list column of strings (#9454) @davidwendt
- Enable Datetime/Timedelta dtypes in Masked UDFs (#9451) @brandon-b-miller
- Add cudf python groupby.diff (#9446) @karthikeyann
- Implement
lists::stable_sort_lists
for stable sorting of elements within each row of lists column (#9425) @ttnghia - add ctest memcheck using cuda-sanitizer (#9414) @karthikeyann
- Support Unary Operations in Masked UDF (#9409) @isVoid
- Move Several Series Function to Frame (#9394) @isVoid
- MD5 Python hash API (#9390) @bdice
- Add cudf strings is_title API (#9380) @davidwendt
- Enable casting to int64, uint64, and double in AST code. (#9379) @vyasr
- Add support for writing ORC with map columns (#9369) @vuule
- extract_list_elements() with column_view indices (#9367) @mythrocks
- Reimplement
lists::drop_list_duplicates
for keys-values lists columns (#9345) @ttnghia - Support Python UDFs written in terms of rows (#9343) @brandon-b-miller
- JNI: Support nested types in ORC writer (#9334) @firestarman
- Optionally nullify out-of-bounds indices in segmented_gather(). (#9318) @mythrocks
- Add shallow hash function and shallow equality comparison for column_view (#9312) @karthikeyann
- Add CudaMemoryBuffer for cudaMalloc memory using RMM cuda_memory_resource (#9311) @rongou
- Add parameters to control row index stride and stripe size in ORC writer (#9310) @vuule
- Add
na_position
param to dask-cudfsort_values
(#9264) @charlesbluca - Add
ascending
parameter for dask-cudfsort_values
(#9250) @charlesbluca - New array conversion methods (#9236) @vyasr
- Series
apply
method backed by masked UDFs (#9217) @brandon-b-miller - Grouping by frequency and resampling (#9178) @shwina
- Pure-python masked UDFs (#9174) @brandon-b-miller
- Add Covariance, Pearson correlation for sort groupby (libcudf) (#9154) @karthikeyann
- Add
calendrical_month_sequence
in c++ anddate_range
in python (#8886) @shwina
🛠️ Improvements
- Followup to PR 9088 comments (#9659) @cwharris
- Update cuCollections to version that supports installed libcudacxx (#9633) @robertmaynard
- Add
11.5
dev.yml tocudf
(#9617) @galipremsagar - Add
xfail
for parquet reader11.5
issue (#9612) @galipremsagar - remove deprecated Rmm.initialize method (#9607) @rongou
- Use HostColumnVectorCore for ch...
v21.10.01
v21.10.00
🚨 Breaking Changes
- Remove Cython APIs for table view generation (#9199) @vyasr
- Upgrade
pandas
version incudf
(#9147) @galipremsagar - Make AST operators nullable (#9096) @vyasr
- Remove the option to pass data types as strings to
read_csv
andread_json
(#9079) @vuule - Update JNI java CSV APIs to not use deprecated API (#9066) @revans2
- Support additional format specifiers in from_timestamps (#9047) @davidwendt
- Expose expression base class publicly and simplify public AST API (#9045) @vyasr
- Add support for struct type in ORC writer (#9025) @vuule
- Remove aliases of various api.types APIs from utils.dtypes. (#9011) @vyasr
- Java bindings for conditional join output sizes (#9002) @jlowe
- Move compute_column API out of ast namespace (#8957) @vyasr
cudf.dtype
function (#8949) @shwina- Refactor Frame reductions (#8944) @vyasr
- Add nested column selection to parquet reader (#8933) @devavret
- JNI Aggregation Type Changes (#8919) @revans2
- Add groupby_aggregation and groupby_scan_aggregation classes and force their usage. (#8906) @nvdbaranec
- Expand CSV and JSON reader APIs to accept
dtypes
as a vector or map ofdata_type
objects (#8856) @vuule - Change cudf docs theme to pydata theme (#8746) @galipremsagar
- Enable compiled binary ops in libcudf, python and java (#8741) @karthikeyann
- Make groupby transform-like op order match original data order (#8720) @isVoid
🐛 Bug Fixes
fixed_point
cudf::groupby
formean
aggregation (#9296) @codereport- Fix
interleave_columns
when the input string lists column having empty child column (#9292) @ttnghia - Update nvcomp to include fixes for installation of headers (#9276) @devavret
- Fix Java column leak in testParquetWriteMap (#9271) @jlowe
- Fix call to thrust::reduce_by_key in argmin/argmax libcudf groupby (#9263) @davidwendt
- Fixing empty input to getMapValue crashing (#9262) @hyperbolic2346
- Fix duplicate names issue in
MultiIndex.deserialize
(#9258) @galipremsagar Dataframe.sort_index
optimizations (#9238) @galipremsagar- Temporarily disabling problematic test in parquet writer (#9230) @devavret
- Explicitly disable groupby on unsupported key types. (#9227) @mythrocks
- Fix
gather
for sliced input structs column (#9218) @ttnghia - Fix JNI code for left semi and anti joins (#9207) @jlowe
- Only install thrust when using a non 'system' version (#9206) @robertmaynard
- Remove zlib from libcudf public CMake dependencies (#9204) @robertmaynard
- Fix out-of-bounds memory read in orc gpuEncodeOrcColumnData (#9196) @davidwendt
- Fix
gather()
forSTRUCT
inputs with no nulls in members. (#9194) @mythrocks - get_cucollections properly uses rapids_cpm_find (#9189) @robertmaynard
- rapids-export correctly reference build code block and doc strings (#9186) @robertmaynard
- Fix logic while parsing the sum statistic for numerical orc columns (#9183) @ayushdg
- Add handling for nulls in
dask_cudf.sorting.quantile_divisions
(#9171) @charlesbluca - Approximate overflow detection in ORC statistics (#9163) @vuule
- Use decimal precision metadata when reading from parquet files (#9162) @shwina
- Fix variable name in Java build script (#9161) @jlowe
- Import rapids-cmake modules using the correct cmake variable. (#9149) @robertmaynard
- Fix conditional joins with empty left table (#9146) @vyasr
- Fix joining on indexes with duplicate level names (#9137) @shwina
- Fixes missing child column name in dtype while reading ORC file. (#9134) @rgsl888prabhu
- Apply type metadata after column is slice-copied (#9131) @isVoid
- Fix a bug: inner_join_size return zero if build table is empty (#9128) @PointKernel
- Fix multi hive-partition parquet reading in dask-cudf (#9122) @rjzamora
- Support null literals in expressions (#9117) @vyasr
- Fix cudf::hash_join output size for struct joins (#9107) @jlowe
- Import fix (#9104) @shwina
- Fix cudf::strings::is_fixed_point checking of overflow for decimal32 (#9093) @davidwendt
- Fix branch_stack calculation in
row_bit_count()
(#9076) @mythrocks - Fetch rapids-cmake to work around cuCollection cmake issue (#9075) @jlowe
- Fix compilation errors in groupby benchmarks. (#9072) @nvdbaranec
- Preserve float16 upscaling (#9069) @galipremsagar
- Fix memcheck read error in libcudf contiguous_split (#9067) @davidwendt
- Add support for reading ORC file with no row group index (#9060) @rgsl888prabhu
- Various multiindex related fixes (#9036) @shwina
- Avoid rebuilding cython in build.sh (#9034) @brandon-b-miller
- Add support for percentile dispatch in
dask_cudf
(#9031) @galipremsagar - cudf resolve nvcc 11.0 compiler crashes during codegen (#9028) @robertmaynard
- Fetch correct grouping keys
agg
of dask groupby (#9022) @galipremsagar - Allow
where()
to work with a Series andother=cudf.NA
(#9019) @sarahyurick - Use correct index when returning Series from
GroupBy.apply()
(#9016) @charlesbluca - Fix
Dataframe
indexer setitem when array is passed (#9006) @galipremsagar - Fix ORC reading of files with struct columns that have null values (#9005) @vuule
- Ensure JNI native libraries load when CompiledExpression loads (#8997) @jlowe
- Fix memory read error in get_dremel_data in page_enc.cu (#8995) @davidwendt
- Fix memory write error in get_list_child_to_list_row_mapping utility (#8994) @davidwendt
- Fix debug compile error for csv_test.cpp (#8981) @davidwendt
- Fix memory read/write error in concatenate_lists_ignore_null (#8978) @davidwendt
- Fix concatenation of
cudf.RangeIndex
(#8970) @galipremsagar - Java conditional joins should not require matching column counts (#8955) @jlowe
- Fix concatenate empty structs (#8947) @sperlingxx
- Fix cuda-memcheck errors for some libcudf functions (#8941) @davidwendt
- Apply series name to result of
SeriesGroupby.apply()
(#8939) @charlesbluca cdef packed_columns
ascppclass
instead ofstruct
(#8936) @charlesbluca- Inserting a
cudf.NA
into a DataFrame (#8923) @sarahyurick - Support casting with Pandas dtype aliases (#8920) @sarahyurick
- Allow
sort_values
to accept samekind
values as Pandas (#8912) @sarahyurick - Enable casting to pandas nullable dtypes (#8889) @brandon-b-miller
- Fix libcudf memory errors (#8884) @karthikeyann
- Throw KeyError when accessing field from struct with nonexistent key (#8880) @NV-jpt
- replace auto with auto& ref for cast<&> (#8866) @karthikeyann
- Add missing include<optional> in binops (#8864) @karthikeyann
- Fix
select_dtypes
to work when non-class dtypes present in dataframe (#8849) @sarahyurick - Re-enable JSON tests (#8843) @vuule
- Support header with embedded delimiter in csv writer (#8798) @davidwendt
📖 Documentation
- Add IO docs page in
cudf
documentation (#9145) @galipremsagar - use correct namespace in cuio code examples (#9037) @cwharris
- Restructuring
Contributing doc
(#9026) @iskode - Update stable version in readme (#9008) @galipremsagar
- Add spans and more include guidelines to libcudf developer guide (#8931) @harrism
- Update Java build instructions to mention Arrow S3 and Docker (#8867) @jlowe
- List GDS-enabled formats in the docs (#8805) @vuule
- Change cudf docs theme to pydata theme (#8746) @galipremsagar
🚀 New Features
- Revert "Add shallow hash function and shallow equality comparison for column_view (#9185)" (#9283) @karthikeyann
- Align
DataFrame.apply
signature with pandas (#9275) @brandon-b-miller - Add struct type support for
drop_list_duplicates
(#9202) @ttnghia - support CUDA async memory resource in JNI (#9201) @rongou
- Add shallow hash function and shallow equality comparison for column_view (#9185) @karthikeyann
- Superimpose null masks for STRUCT columns. (#9144) @mythrocks
- Implemented bindings for
ceil
timestamp operation (#9141) @shaneding - Adding MAP type support for ORC Reader (#9132) @rgsl888prabhu
- Implement
interleave_columns
for lists with arbitrary nested type (#9130) @ttnghia - Add python bindings to fixed-size window and groupby
rolling.var
,rolling.std
(#9097) @isVoid - Make AST operators nullable (#9096) @vyasr
- Java bindings for approx_percentile (#9094) @andygrove
- Add
dseries.struct.explode
(#9086) @isVoid - Add support for BaseIndexer in Rolling APIs (#9085) @galipremsagar
- Remove the option to pass data types as strings to
read_csv
andread_json
(#9079) @vuule - Add handling for nested dicts in dask-cudf groupby (#9054) @charlesbluca
- Added Series.dt.is_quarter_start and Series.dt.is_quarter_end (#9046) @TravisHester
- Support nested types for nth_element reduction (#9043) @sperlingxx
- Update sort groupby to use non-atomic operation (#9035) @karthikeyann
- Add support for struct type in ORC writer (#9025) @vuule
- Implement
interleave_columns
for structs columns (#9012) @ttnghia - Add groupby first and last aggregations (#9004) @shwina
- Add
DecimalBaseColumn
and moveas_decimal_column
(#9001) @isVoid - Python/Cython bindings for multibyte_split (#8998) @jdye64
- Support scalar
months
inadd_calendrical_months
, extends API to INT32 support (#8991) @isVoid - Added Series.dt.is_month_end (#8989) @TravisHester
- Support for using tdigests to compute approximate percentiles. (#8983) @nvdbaranec
- Support "unflatten" of columns flattened via
flatten_nested_columns()
: (#8956) @mythrocks - Implement timestamp ceil (#8942) @shaneding
- Add nested column selection to parquet reader (#8933) @devavret
- Expose conditional join size calculation (#8928) @vyasr
- Support Nulls in Timeseries Generator (#8925) @isVoid
- Avoid index equality check in
_CPackedColumns.from_py_table()
(#8917) @charlesbluca - Add dot product binary op (#8909) @charlesbluca
- Expose
days_in_month
function in libcudf and add python bindings (#8892) @isVoid - Series string repeat (#8882) @sarahyurick
- Python binding for quarters (#8862) @shaneding
- Expand CSV and JSON reader APIs to accept
dtypes
as a vector or map ofdata_type
objects (#8856) @vuule - Add Java bindings for AST ...