8000 Releases · Jianqoq/Hpt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: Jianqoq/Hpt

0.1.2

25 Mar 21:26
Compare
Choose a tag to compare

New Methods

  • from_raw, allows user to pass raw pointer to create a new Tensor
  • forget, check reference count and forget the memory, you can use it to construct other libary's Tensor.
  • forget_copy, clone the data, return the cloned memory, this method doesn't need to check reference count.
  • cpu matmul_post, allows user to do post calculation after matrix multiplication
  • cuda conv2d, convolution, uses cudnn as the backed
  • cuda dw_conv2d, depth-wise convolution, uses cudnn as the backed
  • cuda conv2d_group, group convolution, uses cudnn as the backed
  • cuda batchnorm_conv2d, convolution with batch normalization, uses cudnn as the backed

Bug fixes

  • batch matmul for CPU matmul
  • wrong max_nr and max_mr for bf16/f16 mixed_precision matmul kernel
  • wrong conversion from CPU to CUDA Tensor when CPU Tensor is not contiguous
  • wrong usage of cublas in matmul for CUDA

Internal Change

  • added layout validation for scatter in CPU
  • use fp16 instruction to convert f32 to f16 for Neon. Speed up all calculation related to f16 for Neon.
  • let f16 able to convert to i16/u16 by using fp16
  • refectored simd files, make it more maintainable and extendable
  • re-exports cudarc

v0.1.1

20 Mar 07:41
Compare
Choose a tag to compare
  • Fixed wrong index calculation for matmul when input is transposed

v0.1.0

20 Mar 00:29
Compare
Choose a tag to compare
  • fixed some docs issue
  • implemented Matmul for CPU, allows all primitive data types
  • exposed FFT methods
  • use fp16 instructions for f16 for Neon
  • fixed wrong fma calculation for f32, f64 in Neon
  • added Matmul, FFT benchmarks
  • update LRU_cache_size after resize lru cache

v0.0.21

10 Mar 04:43
Compare
Choose a tag to compare
  • refectored files
  • fixed wrong calculation for reduction in 1-dimension case
  • fixed save load issue for Cuda.
  • simplified save/load API
  • added tests for save/load for cpu and cuda.
  • changed some methods api like selu
  • fixed lots of docs issue in github page
  • make rust docs consistent for tensor operators

v0.0.18

05 Mar 18:46
Compare
Choose a tag to compare
  • reexport half::f16 and half::bf16
  • added docs for conv
  • simplified some trait bounds
  • added custom allocator support for some methods that I left in last release.
  • changed Debug behavior, now debug will show tensor meta data info instead of printing the data

v0.0.17

03 Mar 20:37
Compare
Choose a tag to compare
  • redesigned slice, changed match_selection to select, now support syntax like select![1:2:3, .., 2:], similar as Numpy.
  • added support for custom allocator, user can now use their custom memory allocator
  • concat, vstack, hstack, dstack are now moved to Concat trait.
  • updated concat, vstack, hstack, dstack docs, fixed resize_cuda_lru_cache doc.

v0.0.16

02 Mar 08:52
Compare
Choose a tag to compare
  • added cuda kernel launch configuration checking function
  • added single/list cuda tensor saving/loading support
  • added incremental compilation support for hpt-cudakernels, speed up development speed
  • added parallel nvcc compilation
  • reimplemented reduce kernels, optimized and implemented reduce for CUDA for all reduction operators CPU supported
  • added resize_lru_cache, allowed user to control lru cache size.
  • Renamed set_lr_display_elements to set_display_elements
  • Renamed set_cuda_seed, and now it accepts backend generic type
  • added docs for get_num_threads, set_num_threads, resize_lru_cache, set_display_elements, set_display_precision, set_seed
  • fixed wrong cuda tensor to cpu tensor conversion issue when tensor is sliced.
  • Simplified display method implementation for cuda, now directly call to_cpu
  • added reduce benchmark for cuda at github page

v0.0.15

19 Feb 18:30
Compare
Choose a tag to compare
  • added Save/Load derive macro support for Cuda Backend
  • added uncontiguous support for Cuda reduce
  • refectored hpt-allocator, simplified implementation, improved matain ability
  • updated tensor display method documentation
  • added unary and reduce tests for cuda
  • fixed cuda scalar sinh, tanh, cosh method code gen issue.
  • Added CudaType trait to allow cross platform type name mapping between rust primitive type and C promitive type.
  • refectored hpt file organization for cuda
  • added backend support status in hpt docs
  • added resnet example in hpt-examples
  • added lstm, resnet benchmarks for hpt CPU backend in docs
  • changed with out methods signature, all method with name *_ will requires mutable out.
  • fixed docs for binary methods.
  • changed some crate's method visibility so the user won't see them
0