This repository contains playbooks that set up a Kubernetes cluster in Slurm Reservation to evaluate NVIDIA-Ingest with GPU support on the master and worker nodes. It ensures that the GPU worker node is configured correctly to handle GPU workloads, including running NVIDIA Ingest (NV-Ingest) and NeMo Retriever Extraction.
The playbook is divided into three main sections:
- Install and Configure Containerd with NVIDIA Support: Installs Containerd, configures it for Kubernetes, and adds NVIDIA runtime if a GPU is detected.
- Initialize Kubernetes Control Plane: Initializes the Kubernetes cluster on the master node.
- Join Worker Nodes to Cluster: Joins the worker nodes to the cluster and labels the GPU node.
Additionally, this setup includes the deployment of NV-Ingest using Helm for GPU-accelerated workloads.
- Prerequisites
- Cluster Setup
- GPU Worker Configuration
- Deploy NVIDIA Device Plugin
- Test GPU Workloads
- Deploy NV-Ingest
- Cleanup
- Troubleshooting
Before running the playbook, ensure the following:
- Master Node: Properly initialized (
kubectl get nodes
should show the master node asReady
). - Worker Nodes: Correctly join the cluster.
- GPU Worker: The GPU worker node (
icgpu10
) must have:- NVIDIA GPU with drivers installed.
- NVIDIA Container Toolkit configured.
- Containerd configured for Kubernetes.
- NVIDIA NGC API Key: Required for pulling NVIDIA Docker images and deploying NV-Ingest. Generate it from the NVIDIA NGC website.
- Ansible: Installed on the machine running the playbook.
- SSH Access: Ensure SSH access to all nodes with the provided SSH key (
/etc/kubernetes/key
).
- This step installs and configures Containerd as the container runtime on all nodes.
- If a GPU is detected on
icgpu10
, it configures Containerd to use the NVIDIA runtime for GPU workloads.
- Initializes the Kubernetes cluster on the master node (
c267
) usingkubeadm init
. - Sets up
kubeconfig
for the user and deploys the Flannel CNI plugin for networking.
- Joins worker nodes (
c276
andicgpu10
) to the cluster usingkubeadm join
. - Labels the GPU node (
icgpu10
) withgpu=true
to enable GPU scheduling.
The GPU worker node (icgpu10
) is automatically configured to:
- Use the NVIDIA runtime for GPU workloads.
- Be labeled with
gpu=true
for GPU scheduling.
The NVIDIA Device Plugin is deployed on the master node to enable GPU resource management in Kubernetes. This allows Kubernetes to schedule GPU workloads on the GPU worker node.
To verify that the GPU worker node is properly configured, you can deploy a test pod that runs nvidia-smi
.
- Apply the GPU test pod:
kubectl apply -f gpu-test.yaml
- Check the status of the pod:
kubectl get pods -n gpu-test
- Verify GPU usage:
kubectl logs -f <gpu-test-pod-name> -n gpu-test
- Check the logs to see the output of nvidia-smi:
kubectl logs gpu-test
- Clean up the test pod:
kubectl delete pod gpu-test
To clean up the Kubernetes cluster and remove all resources:
- Delete the NV-Ingest deployment:
helm uninstall nv-ingest -n nv-ingest kubectl delete namespace nv-ingest
- Delete the GPU test pod (if still running):
kubectl delete pod gpu-test
- Reset the Kubernetes cluster (on the master node):
kubeadm reset
- Remove Containerd and Kubernetes components from all nodes.
- Remove the NVIDIA Container Toolkit and drivers from the GPU worker node.
-
GPU Not Detected
- Ensure the NVIDIA drivers and Container Toolkit are installed on the GPU worker node.
- Verify that the NVIDIA runtime is configured in
/etc/containerd/config.toml
.
-
Pods Not Scheduling on GPU Node
- Check that the GPU node is labeled correctly:
kubectl get nodes --show-labels
-
Ensure the NVIDIA Device Plugin is running:
kubectl get pods -n kube-system | grep nvidia kubectl get pods -A | grep nvidia
-
NV-Ingest Deployment Fails
- Verify that the NVIDIA NGC API key is correct and has the necessary permissions.
- Check the logs of the NV-Ingest pods for errors:
kubectl get pods -n nv-ingest kubectl logs -n nv-ingest <pod-name>
-
Network Issues
- Ensure the Flannel CNI plugin is deployed and running:
kubectl get pods -n kube-system | grep flannel kubectl get pods -A | grep flannel
-
General Kubernetes Issues
- Check the status of all nodes:
kubectl get nodes
- Check the status of all pods:
kubectl get pods --all-namespaces
- Check the logs of the Kubernetes components:
journalctl -u kubelet
- Check the logs of the master node:
journalctl -u kube-apiserver
- Check the logs of the worker node:
journalctl -u kubelet
- Check the logs of the Flannel CNI plugin:
kubectl logs -n kube-system <flannel-pod-name>
- Check the logs of the NVIDIA Device Plugin:
kubectl logs -n kube-system <nvidia-device-plugin-pod-name>
- Check the logs of the NV-Ingest pods:
kubectl logs -n nv-ingest <nv-ingest-pod-name>
- Check the logs of the GPU test pod:
kubectl logs -n gpu-test <gpu-test-pod-name>
- Check the logs of the GPU worker node:
journalctl -u kubelet
- Check the logs of the master node:
journalctl -u kube-apiserver