v0.3.0

@qwopqwop200

Overview

CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
New models: BaiChuan, InternLM.
Other updates: see 'Full Change Log' below for details.

Pytorch qlinear by @qwopqwop200 in #116
Specify UTF-8 encoding for README.md in setup.py by @EliEron in #132
Support cuda 64dim by @qwopqwop200 in #126
Support 32dim by @qwopqwop200 in #125
Peft integration by @PanQiWei in #102
Support setting inject_fused_attention and inject_fused_mlp to False by @TheBloke in #134
Add transpose operator when replace Conv1d with qlinear_cuda_old by @geekinglcq in #140
Add support for BaiChuan model by @LaaZa in #164
Fix error message by @AngainorDev in #141
Add support for InternLM by @cczhong11 in #189
Fix stale documentation by @MarisaKirisame in #158

Full Changelog: v0.2.1...v0.3.0