Designating Specific Role Nodes¶
When installing a production cluster you may want to:
- Set one or more Run:ai system nodes. These are nodes dedicated to Run:ai software.
- Machine learning frequently requires jobs that require CPU but not GPU. You may want to direct these jobs to dedicated nodes that do not have GPUs, so as not to overload these machines.
- Limit Run:ai monitoring and scheduling to specific nodes in the cluster.
To perform these tasks you will need the Run:ai Administrator CLI. See Installing the Run:ai Administrator Command-line Interface.
Dedicated Run:ai System Nodes¶
Find out the names of the nodes designated for the Run:ai system by running kubectl get nodes
. For each such node run:
If you re-run kubectl get nodes
you will see the node role of these nodes changed to runai-system
To remove the runai-system node role run:
Warning
Do not select the Kubernetes master as a runai-system node. This may cause Kubernetes to stop working (specifically if Kubernetes API Server is configured on 443 instead of the default 6443).
Dedicated GPU and CPU Nodes¶
Important
To enable this feature, you must set the cluster configuration flag global.nodeAffinity.restrictScheduling
to true
. For more information see customize cluster.
Separate nodes into those that:
- Run GPU workloads
- Run CPU workloads
- Do not run Run:ai at all. these jobs will not be monitored using the Run:ai Administration User interface.
Review nodes names using kubectl get nodes
. For each such node run:
or
Nodes not marked as GPU worker or CPU worker will not run Run:ai at all.
To set all workers not running runai-system as GPU only or CPU only workers run:
To remove the CPU or GPU worker node role run:
or