Kubernetes Cluster Setup with GPU Support

This repository contains playbooks that set up a Kubernetes cluster in Slurm Reservation to evaluate NVIDIA-Ingest with GPU support on the master and worker nodes. It ensures that the GPU worker node is configured correctly to handle GPU workloads, including running NVIDIA Ingest (NV-Ingest) and NeMo Retriever Extraction.

The playbook is divided into three main sections:

Install and Configure Containerd with NVIDIA Support: Installs Containerd, configures it for Kubernetes, and adds NVIDIA runtime if a GPU is detected.
Initialize Kubernetes Control Plane: Initializes the Kubernetes cluster on the master node.
Join Worker Nodes to Cluster: Joins the worker nodes to the cluster and labels the GPU node.

Additionally, this setup includes the deployment of NV-Ingest using Helm for GPU-accelerated workloads.

Prerequisites

Before running the playbook, ensure the following:

Master Node: Properly initialized (kubectl get nodes should show the master node as Ready).
Worker Nodes: Correctly join the cluster.
GPU Worker: The GPU worker node (icgpu10) must have:
- NVIDIA GPU with drivers installed.
- NVIDIA Container Toolkit configured.
- Containerd configured for Kubernetes.
NVIDIA NGC API Key: Required for pulling NVIDIA Docker images and deploying NV-Ingest. Generate it from the NVIDIA NGC website.
Ansible: Installed on the machine running the playbook.
SSH Access: Ensure SSH access to all nodes with the provided SSH key (/etc/kubernetes/key).

Cluster Setup

1. Install and Configure Containerd with NVIDIA Support

This step installs and configures Containerd as the container runtime on all nodes.
If a GPU is detected on icgpu10, it configures Containerd to use the NVIDIA runtime for GPU workloads.

2. Initialize Kubernetes Control Plane

Initializes the Kubernetes cluster on the master node (c267) using kubeadm init.
Sets up kubeconfig for the user and deploys the Flannel CNI plugin for networking.

3. Join Worker Nodes to Cluster

Joins worker nodes (c276 and icgpu10) to the cluster using kubeadm join.
Labels the GPU node (icgpu10) with gpu=true to enable GPU scheduling.

GPU Worker Configuration

The GPU worker node (icgpu10) is automatically configured to:

Use the NVIDIA runtime for GPU workloads.
Be labeled with gpu=true for GPU scheduling.

Deploy NVIDIA Device Plugin

The NVIDIA Device Plugin is deployed on the master node to enable GPU resource management in Kubernetes. This allows Kubernetes to schedule GPU workloads on the GPU worker node.

Test GPU Workloads

To verify that the GPU worker node is properly configured, you can deploy a test pod that runs nvidia-smi.

Steps:

Apply the GPU test pod:
```
kubectl apply -f gpu-test.yaml
```
Check the status of the pod:
```
kubectl get pods -n gpu-test
```

Verify GPU usage:

kubectl logs -f <gpu-test-pod-name> -n gpu-test

Check the logs to see the output of nvidia-smi:
```
kubectl logs gpu-test
```
Clean up the test pod:
```
kubectl delete pod gpu-test
```

Cleanup

To clean up the Kubernetes cluster and remove all resources:

Delete the NV-Ingest deployment:

helm uninstall nv-ingest -n nv-ingest
kubectl delete namespace nv-ingest

Delete the GPU test pod (if still running):
```
kubectl delete pod gpu-test
```
Reset the Kubernetes cluster (on the master node):
```
kubeadm reset
```
Remove Containerd and Kubernetes components from all nodes.
Remove the NVIDIA Container Toolkit and drivers from the GPU worker node.

Troubleshooting

GPU Not Detected
- Ensure the NVIDIA drivers and Container Toolkit are installed on the GPU worker node.
- Verify that the NVIDIA runtime is configured in /etc/containerd/config.toml.
Pods Not Scheduling on GPU Node
- Check that the GPU node is labeled correctly:
```
kubectl get nodes --show-labels
```

Ensure the NVIDIA Device Plugin is running:

kubectl get pods -n kube-system | grep nvidia
kubectl get pods -A | grep nvidia

NV-Ingest Deployment Fails
- Verify that the NVIDIA NGC API key is correct and has the necessary permissions.
- Check the logs of the NV-Ingest pods for errors:
```
kubectl get pods -n nv-ingest
kubectl logs -n nv-ingest <pod-name>
```

Network Issues

Ensure the Flannel CNI plugin is deployed and running:

kubectl get pods -n kube-system | grep flannel
kubectl get pods -A | grep flannel

General Kubernetes Issues
- Check the status of all nodes:
```
 kubectl get nodes
```
- Check the status of all pods:
```
kubectl get pods --all-namespaces
```
- Check the logs of the Kubernetes components:
```
journalctl -u kubelet
```
- Check the logs of the master node:
```
journalctl -u kube-apiserver
```
- Check the logs of the worker node:
```
 journalctl -u kubelet
```
- Check the logs of the Flannel CNI plugin:
```
kubectl logs -n kube-system <flannel-pod-name>
```
- Check the logs of the NVIDIA Device Plugin:
```
kubectl logs -n kube-system <nvidia-device-plugin-pod-name>
```
- Check the logs of the NV-Ingest pods:
```
kubectl logs -n nv-ingest <nv-ingest-pod-name>
```
- Check the logs of the GPU test pod:
```
kubectl logs -n gpu-test <gpu-test-pod-name>
```
- Check the logs of the GPU worker node:
```
journalctl -u kubelet
```
- Check the logs of the master node:
```
journalctl -u kube-apiserver
```

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
K8s_Containerd_Nvidia.yml		K8s_Containerd_Nvidia.yml
LICENSE		LICENSE
README.md		README.md
gpu-test.yaml		gpu-test.yaml
inventory.ini		inventory.ini
k8s_setup_rocky.sh		k8s_setup_rocky.sh
k8s_setup_ubuntu.sh		k8s_setup_ubuntu.sh
master.yml		master.yml
nv-ingest.yml		nv-ingest.yml
workers.yml		workers.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kubernetes Cluster Setup with GPU Support

Table of Contents

Prerequisites

Cluster Setup

1. Install and Configure Containerd with NVIDIA Support

2. Initialize Kubernetes Control Plane

3. Join Worker Nodes to Cluster

GPU Worker Configuration

Deploy NVIDIA Device Plugin

Test GPU Workloads

Steps:

Cleanup

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

ricardojacomini/k8s_nvidia_ingest

Folders and files

Latest commit

History

Repository files navigation

Kubernetes Cluster Setup with GPU Support

Table of Contents

Prerequisites

Cluster Setup

1. Install and Configure Containerd with NVIDIA Support

2. Initialize Kubernetes Control Plane

3. Join Worker Nodes to Cluster

GPU Worker Configuration

Deploy NVIDIA Device Plugin

Test GPU Workloads

Steps:

Cleanup

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages