Releases · Jianqoq/Hpt

New Methods

from_raw, allows user to pass raw pointer to create a new Tensor
forget, check reference count and forget the memory, you can use it to construct other libary's Tensor.
forget_copy, clone the data, return the cloned memory, this method doesn't need to check reference count.
cpu matmul_post, allows user to do post calculation after matrix multiplication
cuda conv2d, convolution, uses cudnn as the backed
cuda dw_conv2d, depth-wise convolution, uses cudnn as the backed
cuda conv2d_group, group convolution, uses cudnn as the backed
cuda batchnorm_conv2d, convolution with batch normalization, uses cudnn as the backed

batch matmul for CPU matmul
wrong max_nr and max_mr for bf16/f16 mixed_precision matmul kernel
wrong conversion from CPU to CUDA Tensor when CPU Tensor is not contiguous
wrong usage of cublas in matmul for CUDA

added layout validation for scatter in CPU
use fp16 instruction to convert f32 to f16 for Neon. Speed up all calculation related to f16 for Neon.
let f16 able to convert to i16/u16 by using fp16
refectored simd files, make it more maintainable and extendable
re-exports cudarc

reexport half::f16 and half::bf16
added docs for conv
simplified some trait bounds
added custom allocator support for some methods that I left in last release.
changed Debug behavior, now debug will show tensor meta data info instead of printing the data

redesigned slice, changed match_selection to select, now support syntax like select![1:2:3, .., 2:], similar as Numpy.
added support for custom allocator, user can now use their custom memory allocator
concat, vstack, hstack, dstack are now moved to Concat trait.
updated concat, vstack, hstack, dstack docs, fixed resize_cuda_lru_cache doc.

added cuda kernel launch configuration checking function
added single/list cuda tensor saving/loading support
added incremental compilation support for hpt-cudakernels, speed up development speed
added parallel nvcc compilation
reimplemented reduce kernels, optimized and implemented reduce for CUDA for all reduction operators CPU supported
added resize_lru_cache, allowed user to control lru cache size.
Renamed set_lr_display_elements to set_display_elements
Renamed set_cuda_seed, and now it accepts backend generic type
added docs for get_num_threads, set_num_threads, resize_lru_cache, set_display_elements, set_display_precision, set_seed
fixed wrong cuda tensor to cpu tensor conversion issue when tensor is sliced.
Simplified display method implementation for cuda, now directly call to_cpu
added reduce benchmark for cuda at github page

added Save/Load derive macro support for Cuda Backend
added uncontiguous support for Cuda reduce
refectored hpt-allocator, simplified implementation, improved matain ability
updated tensor display method documentation
added unary and reduce tests for cuda
fixed cuda scalar sinh, tanh, cosh method code gen issue.
Added CudaType trait to allow cross platform type name mapping between rust primitive type and C promitive type.
refectored hpt file organization for cuda
added backend support status in hpt docs
added resnet example in hpt-examples
added lstm, resnet benchmarks for hpt CPU backend in docs
changed with out methods signature, all method with name *_ will requires mutable out.
fixed docs for binary methods.
changed some crate's method visibility so the user won't see them