Releases: Jianqoq/Hpt
Releases · Jianqoq/Hpt
0.1.2
New Methods
from_raw
, allows user to pass raw pointer to create a new Tensorforget
, check reference count and forget the memory, you can use it to construct other libary's Tensor.forget_copy
, clone the data, return the cloned memory, this method doesn't need to check reference count.- cpu
matmul_post
, allows user to do post calculation after matrix multiplication - cuda
conv2d
, convolution, usescudnn
as the backed - cuda
dw_conv2d
, depth-wise convolution, usescudnn
as the backed - cuda
conv2d_group
, group convolution, usescudnn
as the backed - cuda
batchnorm_conv2d
, convolution with batch normalization, usescudnn
as the backed
Bug fixes
- batch matmul for CPU
matmul
- wrong
max_nr
andmax_mr
for bf16/f16 mixed_precision matmul kernel - wrong conversion from
CPU
toCUDA
Tensor whenCPU
Tensor is not contiguous - wrong usage of cublas in
matmul
forCUDA
Internal Change
- added layout validation for
scatter
inCPU
- use fp16 instruction to convert f32 to f16 for Neon. Speed up all calculation related to f16 for Neon.
- let f16 able to convert to i16/u16 by using
fp16
- refectored simd files, make it more maintainable and extendable
- re-exports cudarc
v0.1.1
v0.1.0
- fixed some docs issue
- implemented Matmul for CPU, allows all primitive data types
- exposed
FFT
methods - use
fp16
instructions forf16
for Neon - fixed wrong
fma
calculation for f32, f64 in Neon - added Matmul, FFT benchmarks
- update
LRU_cache_size
after resize lru cache
v0.0.21
- refectored files
- fixed wrong calculation for reduction in 1-dimension case
- fixed save load issue for Cuda.
- simplified save/load API
- added tests for save/load for cpu and cuda.
- changed some methods api like
selu
- fixed lots of docs issue in github page
- make rust docs consistent for tensor operators
v0.0.18
v0.0.17
- redesigned slice, changed
match_selection
toselect
, now support syntax likeselect![1:2:3, .., 2:]
, similar asNumpy
. - added support for custom allocator, user can now use their custom memory allocator
concat
,vstack
,hstack
,dstack
are now moved toConcat
trait.- updated
concat 9739 code>,
vstack
,hstack
,dstack
docs, fixedresize_cuda_lru_cache
doc.
v0.0.16
- added cuda kernel launch configuration checking function
- added single/list cuda tensor saving/loading support
- added incremental compilation support for hpt-cudakernels, speed up development speed
- added parallel nvcc compilation
- reimplemented reduce kernels, optimized and implemented reduce for CUDA for all reduction operators CPU supported
- added
resize_lru_cache
, allowed user to control lru cache size. - Renamed
set_lr_display_elements
toset_display_elements
- Renamed
set_cuda_seed
, and now it accepts backend generic type - added docs for
get_num_threads
,set_num_threads
,resize_lru_cache
,set_display_elements
,set_display_precision
,set_seed
- fixed wrong
cuda
tensor tocpu
tensor conversion issue when tensor is sliced. - Simplified display method implementation for cuda, now directly call
to_cpu
- added reduce benchmark for
cuda
at github page
v0.0.15
- added Save/Load derive macro support for Cuda Backend
- added uncontiguous support for Cuda reduce
- refectored hpt-allocator, simplified implementation, improved matain ability
- updated tensor display method documentation
- added unary and reduce tests for cuda
- fixed cuda scalar sinh, tanh, cosh method code gen issue.
- Added CudaType trait to allow cross platform type name mapping between rust primitive type and C promitive type.
- refectored hpt file organization for cuda
- added backend support status in hpt docs
- added resnet example in hpt-examples
- added lstm, resnet benchmarks for hpt CPU backend in docs
- changed with out methods signature, all method with name
*_
will requires mutable out. - fixed docs for binary methods.
- changed some crate's method visibility so the user won't see them