8000 Issue #279: Updated .md files for Omnia Core and Appliance by avinashvishwanath · Pull Request #280 · dell/omnia · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Issue #279: Updated .md files for Omnia Core and Appliance #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/INSTALL_OMNIA.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,11 @@ Commands to install JupyterHub and Kubeflow:
* `ansible-playbook platforms/jupyterhub.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2"`
* `ansible-playbook platforms/kubeflow.yml -i inventory -e "ansible_python_interpreter=/usr/bin/python2" `

__Note:__ When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the **Apply Kubeflow configurations** task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
* Format the OS on manager and compute nodes.
* In the `omnia_config.yml` file, change the k8s_cni variable value from calico to flannel.
* Run the Kubernetes and Kubeflow playbooks.

## Add a new compute node to the cluster

To update the INVENTORY file present in `omnia` directory with the new node IP address under the compute group. Ensure the other nodes which are already a part of the cluster are also present in the compute group along with the new node. Then, run`omnia.yml` to add the new node to the cluster and update the configurations of the manager node.
7 changes: 6 additions & 1 deletion docs/INSTALL_OMNIA_APPLIANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Omnia considers the following usernames as default:
* `admin` for AWX
* `slurm` for MariaDB

8. Run `ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"` to install Omnia appliance.
9. Run `ansible-playbook appliance.yml -e "ansible_python_interpreter=/usr/bin/python2"` to install Omnia appliance.


Omnia creates a log file which is available at: `/var/log/omnia.log`.
Expand Down Expand Up @@ -116,6 +116,11 @@ __Note:__ To install __JupyterHub__ and __Kubeflow__ playbooks:
* From __PLAYBOOK__ dropdown menu, select __platforms/jupyterhub.yml__ and launch the template to install JupyterHub playbook.
* From __PLAYBOOK__ dropdown menu, select __platforms/kubeflow.yml__ and launch the template to install Kubeflow playbook.

__Note:__ When the Internet connectivity is unstable or slow, it may take more time to pull the images to create the Kubeflow containers. If the time limit is exceeded, the **Apply Kubeflow configurations** task may fail. To resolve this issue, you must redeploy Kubernetes cluster and reinstall Kubeflow by completing the following steps:
* Complete the PXE booting of the manager and compute nodes.
* In the `omnia_config.yml` file, change the k8s_cni variable value from calico to flannel.
* Run the Kubernetes and Kubeflow playbooks.

The DeployOmnia template may not run successfully if:
- The Manager group contains more than one host.
- The Compute group does not contain a host. Ensure that the Compute group is assigned with at least one host node.
Expand Down
19 changes: 13 additions & 6 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Issue: Hosts do not display on the AWX UI.
Resolution:
* Verify if `provisioned_hosts.yml` is present in the `omnia/appliance/roles/inventory/files` folder.
* Verify if hosts are not listed in the `provisioned_hosts.yml` file. If hosts are not listed, then servers are not PXE booted yet.
* If hosts are listed in the `provisioned_hosts.yml` file, then an IP address has been assigned to them by DHCP. However, hosts are not displyed on the AWX UI as the PXE boot is still in process or is not initiated.
* If hosts are listed in the `provisioned_hosts.yml` file, then an IP address has been assigned to them by DHCP. However, hosts are not displayed on the AWX UI as the PXE boot is still in process or is not initiated.
* Check for the reachable and unreachable hosts using the `provisioned_report.yml` tool present in the `omnia/appliance/tools` folder. To run provisioned_report.yml, in the omnia/appliance directory, run `playbook -i roles/inventory/files/provisioned_hosts.yml tools/provisioned_report.yml`.

# Frequently asked questions
Expand All @@ -87,7 +87,7 @@ Resolution:
1. When AWX is not accessible even after five minutes of wait time.
2. When __isMigrating__ or __isInstalling__ is seen in the failure message.

Resolution:
Resolution:
Wait for AWX UI to be accessible at http://\<management-station-IP>:8081, and then run the `appliance.yaml` file again, where __management-station-IP__ is the ip address of the management node.

* What are the next steps after the nodes in a Kubernetes cluster reboots?
Expand All @@ -111,7 +111,7 @@ Resolution:
Cause:
* When the mounted .iso file is corrupt.

Resolution:
Resolution:
1. Go to __var__->__log__->__cobbler__->__cobbler.log__ to view the error.
2. If the error message is **repo verification failed** then it signifies that the .iso file is not mounted properly.
3. Verify if the downloaded .iso file is valid and correct.
Expand All @@ -122,7 +122,7 @@ Resolution:
* When RAID is configured on the server.
* When more than two servers in the same network have Cobbler services running.

Resolution:
Resolution:
1. Create a Non-RAID or virtual disk in the server.
2. Check if other systems except for the management node has cobblerd running. If yes, then stop the Cobbler container using the following commands: `docker rm -f cobbler` and `docker image rm -f cobbler`.

Expand Down Expand Up @@ -156,13 +156,20 @@ Resolution:
* systemctl restart slurmdbd on manager node
* systemctl restart slurmd on compute node

* What to do if Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding?
Cause: With the host network which is DNS issue.
* What to do if Kubernetes Pods are unable to communicate with the servers when the DNS servers are not responding?
Cause: With the host network which is DNS issue.
Resolution:
1. In your Kubernetes cluster, run `kubeadm reset -f` on the nodes.
2. In the management node, edit the `omnia_config.yml` file to change the Kubernetes Pod Network CIDR. Suggested IP range is 192.168.0.0/16 and ensure you provide an IP which is not in use in your host network.
3. Execute omnia.yml and skip slurm using __skip_ tag __slurm__.

* What to do if time taken to pull the images to create the Kubeflow containers exceeds the limit and the Apply Kubeflow configurations task fails?
Cause: Unstable or slow Internet connectivity.
Resolution:
1. Complete the PXE booting/ format the OS on manager and compute nodes.
2. In the omnia_config.yml file, change the k8s_cni variable value from calico to flannel.
3. Run the Kubernetes and Kubeflow playbooks.

# Limitations
1. Removal of Slurm and Kubernetes component roles are not supported. However, skip tags can be provided at the start of installation to select the component roles.​
2. After the installation of the Omnia appliance, changing the manager node is not supported. If you need to change the manager node, you must redeploy the entire cluster.
Expand Down
0