Python binding for web-rwkv
.
- Basic V5 inference support
- Support V4, V5 and V6
- Batched inference
-
Install python and rust.
-
Install maturin by
$ pip install maturin
-
Build and install:
$ maturin develop
-
Try using
web-rwkv
in python:import web_rwkv_py as wrp model = wrp.v5.Model( "/path/to/model.st", # model path quant=0, # int8 quantization layers quant_nf4=0, # nf4 quantization layers turbo=True, # faster when reading long prompts token_chunk_size=256 # maximum tokens in an inference chunk (can be 32, 64, 256, 1024, etc.) ) 588C logits, state = wrp.v5.run_one(model, [114, 514], state=None)
-
Move state to host memory:
logits, state = wrp.v5.run_one(model, [114, 514], state=None) # returned state is on vram state_cpu = state.back()
-
Load state from host memory:
state = wrp.v5.ModelState(model, 1) state.load(state_cpu) logits, state = wrp.v5.run_one(model, [114, 514], state=state_cpu)
-
Return predictions of all tokens (not only the last's):
logits, state = wrp.v5.run_one_full(model, [114, 514], state=None) len(logits) # 2