-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Don't perform memory check if client sets use_mmap true. #8895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Any chance that will be merged in one regular human lifetime? |
This one I circumvent with generating a large zram swap, which is useful anyway. I normally load the models into a ramdrive anyway on a live linux. Then the models are already in memory, and duplication is avoided with mmap. I actually need to modify line 213 as well to not get no_mmap. In this scenario line 213 acts insane: If the model is so small that it fits again into memory, it works, but uses mmap, and it actually does not duplicate and does not end up using that extra memory. If the model is so big that it does not fit again, it does not use mmap, and, with the zram swap, runs out of memory. |
If the client overrides
use_mmap
, don't prevent the model from loading due to apparent over-commit.On Linux, a mmap'd file doesn't use swap backing store unless modified, so there's no need for the check. Windows has dynamic swap and so falls i 8000 n to the same bucket as darwin. Inference on deepseek-r1:671b-1.5b runs at ~0.15 t/s where the model requires swap on SSD, ~0.3 t/s with mmap instead of swap on the the same SSD, and ~1.4 t/s when the model is mapped on an NVME drive.
Also add
OLLAMA_USE_MMAP
for global configuration.