8000 Upgrade Torch to v2.6.0 everywhere by ashahba · Pull Request #4450 · kserve/kserve · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Upgrade Torch to v2.6.0 everywhere #4450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 28, 2025

Conversation

ashahba
Copy link
Contributor
@ashahba ashahba commented May 6, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes
Fixes #4449

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing:
I have updated poetry lock files and I'm not anticipating any test failures but I'll update here if I see anything strange.

Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
@ashahba
Copy link
Contributor Author
ashahba commented May 6, 2025

I did try make poetry-lock and all passed locally!

...
...
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv kserve in /localdisk/ashahba/source_code/kserve/python/kserve/.venv
Resolving dependencies... (19.3s)
moving into folder ./paddleserver
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv paddleserver in /localdisk/ashahba/source_code/kserve/python/paddleserver/.venv
Resolving dependencies... (7.6s)
moving into folder ./pmmlserver
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv pmmlserver in /localdisk/ashahba/source_code/kserve/python/pmmlserver/.venv
Resolving dependencies... (1.9s)
moving into folder ./custom_transformer
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv custom-transformer in /localdisk/ashahba/source_code/kserve/python/custom_transformer/.venv
Resolving dependencies... (0.8s)
moving into folder ./artexplainer
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv artserver in /localdisk/ashahba/source_code/kserve/python/artexplainer/.venv
Resolving dependencies... (2.5s)
moving into folder ./sklearnserver
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv sklearnserver in /localdisk/ashahba/source_code/kserve/python/sklearnserver/.venv
Resolving dependencies... (2.1s)
moving into folder ./plugin/poetry-version-plugin
-e Skipping folder ./plugin/poetry-version-plugin
moving into folder ./custom_model
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv custom-model in /localdisk/ashahba/source_code/kserve/python/custom_model/.venv
Resolving dependencies... (2.0s)
moving into folder ./custom_tokenizer
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv custom-tokenizer in /localdisk/ashahba/source_code/kserve/python/custom_tokenizer/.venv
Resolving dependencies... (0.6s)
moving into folder ./kserve
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Resolving dependencies... (11.1s)
moving into folder ./test_resources/graph/error_404_isvc
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv error-404-isvc in /localdisk/ashahba/source_code/kserve/python/test_resources/graph/error_404_isvc/.venv
Resolving dependencies... (0.5s)
moving into folder ./test_resources/graph/success_200_isvc
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv success-200-isvc in /localdisk/ashahba/source_code/kserve/python/test_resources/graph/success_200_isvc/.venv
Resolving dependencies... (0.6s)
moving into folder ./huggingfaceserver
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv huggingfaceserver in /localdisk/ashahba/source_code/kserve/python/huggingfaceserver/.venv
Resolving dependencies... (10.6s)
moving into folder ./xgbserver
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv xgbserver in /localdisk/ashahba/source_code/kserve/python/xgbserver/.venv
Resolving dependencies... (2.3s)
moving into folder ./aiffairness
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv aifserver in /localdisk/ashahba/source_code/kserve/python/aiffairness/.venv
Resolving dependencies... (1.6s)
moving into folder ./lgbserver
poetry-version-plugin: New package version is updated from given file version is: 0.15.0
poetry-version-plugin: New version updated in toml file
Creating virtualenv lgbserver in /localdisk/ashahba/source_code/kserve/python/lgbserver/.venv
Resolving dependencies... (2.0s)

and that introduced no changes:

$ git status
On branch ashahba/torch-2.6.0-upgrade
nothing to commit, working tree clean

But I'm not sure why this test is failing.
Maybe it's Python version related.

@Jooho
Copy link
Contributor
Jooho commented May 7, 2025

from virtualenv 2.31.0, --wheel option is removed and it causes this issue. so I think we should set virtualenv version to 2.30.0 for now to pass ci.

@ashahba
Copy link
Contributor Author
ashahba commented May 8, 2025

from virtualenv 2.31.0, --wheel option is removed and it causes this issue. so I think we should set virtualenv version to 2.30.0 for now to pass ci.

That sounds great @Jooho
Let me give that a try.

@ashahba
Copy link
8000 Contributor Author
ashahba commented May 11, 2025

Now the E2E testing is throwing:

ERROR ImagePull]: failed to pull image registry.k8s.io/kube-controller-manager:v1.30.7: output: E0511 22:29:00.085778    5495 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to register layer: write /usr/local/bin/kube-controller-manager: no space left on device" image="registry.k8s.io/kube-controller-manager:v1.30.7"
time="2025-05-11T22:29:00Z" level=fatal msg="pulling image: failed to register layer: write /usr/local/bin/kube-controller-manager: no space left on device

@yuzisun
Copy link
Member
yuzisun commented May 12, 2025

Now the E2E testing is throwing:

ERROR ImagePull]: failed to pull image registry.k8s.io/kube-controller-manager:v1.30.7: output: E0511 22:29:00.085778    5495 log.go:32] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to register layer: write /usr/local/bin/kube-controller-manager: no space left on device" image="registry.k8s.io/kube-controller-manager:v1.30.7"
time="2025-05-11T22:29:00Z" level=fatal msg="pulling image: failed to register layer: write /usr/local/bin/kube-controller-manager: no space left on device

@sivanantha321 Can you help take a look, might be related to the arm build we added back?

ashahba added 3 commits May 13, 2025 08:53
Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
@ashahba
Copy link
Contributor Author
ashahba commented May 13, 2025

E2E Tests / test-raw fails intermittently but E2E Tests / test-predictor has 2 tests failing which I'm not sure why I'm seeing it with this PR!

=========================== short test summary info ============================
FAILED predictor/test_sklearn.py::test_sklearn_runtime_kserve - RuntimeError: Timeout to start the InferenceService isvc-sklearn-runtime.                                The InferenceService is as following: {'apiVersion': 'serving.kserve.io/v1beta1', 'kind': 'InferenceService', 'metadata': {'creationTimestamp': '2025-05-13T23:11:09Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 1, 'managedFields': [{'apiVersion': 'serving.kserve.io/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:minReplicas': {}, 'f:model': {'.': {}, 'f:args': {}, 'f:modelFormat': {'.': {}, 'f:name': {}}, 'f:name': {}, 'f:resources': {'.': {}, 'f:limits': {'.': {}, 'f:cpu': {}, 'f:memory': {}}, 'f:requests': {'.': {}, 'f:cpu': {}, 'f:memory': {}}}, 'f:storageUri': {}}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2025-05-13T23:11:09Z'}, {'apiVersion': 'serving.kserve.io/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {
FAILED custom/test_ray.py::test_custom_model_http_ray - RuntimeError: Timeout to start the InferenceService custom-model-http-ray.                                The InferenceService is as following: {'apiVersion': 'serving.kserve.io/v1beta1', 'kind': 'InferenceService', 'metadata': {'creationTimestamp': '2025-05-13T23:13:21Z', 'finalizers': ['inferenceservice.finalizers'], 'generation': 1, 'managedFields': [{'apiVersion': 'serving.kserve.io/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:predictor': {'.': {}, 'f:containers': {}}}}, 'manager': 'OpenAPI-Generator', 'operation': 'Update', 'time': '2025-05-13T23:13:21Z'}, {'apiVersion': 'serving.kserve.io/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:metadata': {'f:finalizers': {'.': {}, 'v:"inferenceservice.finalizers"': {}}}}, 'manager': 'manager', 'operation': 'Update', 'time': '2025-05-13T23:13:21Z'}, {'apiVersion': 'serving.kserve.io/v1beta1', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:status': {'.': {}, 'f:components': {'.': {}
================== 2 failed, 45 passed in 1165.18s (0:19:25) ===================

@sivanantha321
Copy link
Member

(raylet) [2025-05-26 06:10:51,820 E 360 360] (raylet) node_manager.cc:3287: 20 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: a296f0e6912bb7ab93039b716f623e2f3afa8bf37419e2a1a1717e0f, IP: 10.244.0.52) over the last time period. To see more information about the Workers killed on this node, use ray logs raylet.out -ip 10.244.0.52

@ashahba looks like we need to increase the memory limit for ray test.

@ashahba
Copy link
Contributor Author
ashahba commented May 27, 2025

(raylet) [2025-05-26 06:10:51,820 E 360 360] (raylet) node_manager.cc:3287: 20 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: a296f0e6912bb7ab93039b716f623e2f3afa8bf37419e2a1a1717e0f, IP: 10.244.0.52) over the last time period. To see more information about the Workers killed on this node, use ray logs raylet.out -ip 10.244.0.52

@ashahba looks like we need to increase the memory limit for ray test.

Thanks @sivanantha321
Testing it now before pushing to this branch.

Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
@ashahba
Copy link
Contributor Author
ashahba commented May 27, 2025

@sivanantha321 I updated memory from 2G to 4G for Ray and the tests passed here: https://github.com/ashahba/kserve/actions/runs/15280898619/job/42983412413

We can even experiment with 3G but I'm not sure if that's worth the effort for this PR.

@ashahba
Copy link
Contributor Author
ashahba commented May 27, 2025

I think the one failed test just needs to be re-triggered and possibly a bit flaky.

@sivanantha321
Copy link
Member

@sivanantha321 I updated memory from 2G to 4G for Ray and the tests passed here: https://github.com/ashahba/kserve/actions/runs/15280898619/job/42983412413

We can even experiment with 3G but I'm not sure if that's worth the effort for this PR.

4GB is fine.

@sivanantha321
Copy link
Member

/lgtm
/approve

@github-actions github-actions bot added the lgtm label May 28, 2025
@sivanantha321 sivanantha321 merged commit 8601bfa into kserve:master May 28, 2025
65 checks passed
@ashahba ashahba deleted the ashahba/torch-2.6.0-upgrade branch May 28, 2025 13:58
israel-hdez pushed a commit to israel-hdez/kserve that referenced this pull request May 29, 2025
Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A few Python component still not updated to Torch v2.6.0 CPU
4 participants
0