Closed
Description
I have been using Flair in our production environment for some time now, and I haven't faced any issues so far. But the issue here is not every organization uses GPU for inference, and having a CPU for inference will not be ideal when latency becomes important.
I was wondering is there a way I couldn't convert flair.pt
to flair.onnx
in the process to apply integer quantization, a small trade off of accuracy over performance is not actually a bad idea. I have gone through docs, but couldn't find any reference for optimizations or distillation etc.,
If someone managed to do it, really appreciate if you could share the details.