Lesson 3.1: Scheduling Pods (Pod Affinity, Anti-Affinity, Taints and Tolerations)
Taints and Tolerations
Taints and tolerations work together to control pod scheduling on nodes. Taints are applied to nodes to repel pods unless they have a matching toleration. Tolerations are added to pods to allow (but not guarantee) scheduling on tainted nodes.
Key Concepts
-
Taints:
- Applied to nodes to restrict which pods can run on them.
- Syntax:
kubectl taint node <node-name> key=value:effect
. - Effects:
- NoSchedule: Pods without matching tolerations will not be scheduled on the node.
- PreferNoSchedule: Kubernetes will try to avoid scheduling pods without tolerations but doesn’t enforce it strictly.
- NoExecute: Evicts existing pods without matching tolerations and blocks new ones.
-
Tolerations:
- Added to pods to allow them to tolerate a node’s taint.
- Defined in the pod’s spec.tolerations field.
Example Scenario:
[root@master ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION cka-cluster2-control-plane Ready control-plane 44h v1.29.14 cka-cluster2-worker Ready <none> 44h v1.29.14 cka-cluster2-worker2 Ready <none> 44h v1.29.14 [root@master ~]# kubectl taint node cka-cluster2-worker gpu=true:NoSchedule node/cka-cluster2-worker tainted [root@master ~]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule node/cka-cluster2-worker2 tainted [root@master ~]# kubectl describe node cka-cluster2-worker | grep -i taint Taints: gpu=true:NoSchedule [root@master ~]# kubectl describe node cka-cluster2-worker2 | grep -i taint Taints: gpu=true:NoSchedule
- Tainting two worker nodes (
cka-cluster2-worker
andcka-cluster2-worker2
) withgpu=true:NoSchedule
:- Both nodes now repel pods that don’t tolerate the
gpu=true:NoSchedule
taint.
- Both nodes now repel pods that don’t tolerate the
[root@master ~]# kubectl run nginx --image=nginx pod/nginx created [root@master ~]# kubectl get pods NAME READY STATUS RESTARTS AGE nginx 0/1 Pending 0 5s [root@master ~]# kubectl describe pod nginx | tail -5 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 117s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) had untolerated taint {gpu: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
- Creating an nginx pod without a toleration:
- Outcome:
- The pod remained
Pending
because no nodes were available (all nodes had the NoSchedule taint). - The scheduler’s event log confirmed the failure:
- The pod remained
- Outcome:
[root@master tainttoleration]# kubectl run redis --image=redis --dry-run=client -o yaml > redis.yml [root@master tainttoleration]# vim redis.yml [root@master tainttoleration]# cat redis.yml apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: redis name: redis spec: containers: - image: redis name: redis resources: {} dnsPolicy: ClusterFirst restartPolicy: Always tolerations: - key: "gpu" operator: "Equal" value: "true" effect: "NoSchedule" status: {} [root@master tainttoleration]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 0/1 Pending 0 5m <none> <none> <none> <none> redis 1/1 Running 0 44s 10.244.1.14 cka-cluster2-worker <none> <none> # Now try untainting one of the nodes to see if nginx is scheduled to the untained node [root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint Taints: gpu=true:NoSchedule
- Created a redis pod with a toleration for gpu=true:NoSchedule:
- The redis pod was scheduled on
cka-cluster2-worker
(still tainted) because it tolerated the taint.
- The redis pod was scheduled on
Untaining a Node
- Removed the taint from
cka-cluster2-worker2
:cka-cluster2-worker2
became untainted and available for scheduling.- The nginx pod (still without a toleration) was scheduled on
cka-cluster2-worker2
because it no longer had the taint.
[root@master tainttoleration]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule- node/cka-cluster2-worker2 untainted [root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint Taints: <none> # Now the pod is scheduled and is running state [root@master tainttoleration]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 1/1 Running 0 7m59s 10.244.2.19 cka-cluster2-worker2 <none> <none> redis 1/1 Running 0 3m43s 10.244.1.14 cka-cluster2-worker <none> <none>
When to Use Taints and Tolerations
- Dedicated Nodes: Reserve nodes for specific workloads (e.g., GPU nodes for AI workloads).
- Node Isolation: Prevent non-critical pods from running on critical nodes (e.g., control plane).
- Graceful Evictions: Use NoExecute to evict pods during maintenance.
NodeSelector
NodeSelector is a mechanism to constrain Pods to run on specific nodes by matching node labels. It is a simple way to enforce scheduling based on node characteristics (e.g., hardware, environment, or custom labels).
Key Concepts
- Node Labels:
- Key-value pairs attached to nodes to describe their attributes (e.g., gpu=true, env=prod).
- Labels are set using kubectl label node
<node-name> <key>=<value>
.
- NodeSelector:
- A field in the Pod’s spec that specifies which node labels the Pod requires.
- The Pod will only be scheduled on nodes that have all the specified labels.
Example Scenario
- Created a Pod (nginx.yml) with a
nodeSelector
requiring the labelgpu=false
: - Initial State of Nodes
- Node Labels:
- Neither
cka-cluster2-worker
norcka-cluster2-worker2
had thegpu=false
label initially.
- Neither
- Node Taints:
cka-cluster2-worker2
had a taint (gpu=true:NoSchedule
).cka-cluster2-worker
had no taints.
- Node Labels:
- The Pod remained
Pending
because no nodes matched the gpu=false label.
[root@master nodeselector]# cat nginx.yml apiVersion: v1 kind: Pod metadata: labels: run: nginx name: nginx spec: containers: - image: nginx name: nginx nodeSelector: gpu: "false" [root@master nodeselector]# kubectl apply -f nginx.yml [root@master nodeselector]# kubectl describe nodes cka-cluster2-worker | grep -i taint Taints: <none> [root@master nodeselector]# kubectl describe nodes cka-cluster2-worker2 | grep -i taint Taints: gpu=true:NoSchedule [root@master nodeselector]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 0/1 Pending 0 6s <none> <none> <none> <none>
- Labeling the pods: Added the gpu=false label to both worker nodes
- Node Status After Labeling:
- Both nodes now had the label
gpu=false
. cka-cluster2-worker2
still had the taintgpu=true:NoSchedule
.
- Both nodes now had the label
[root@master nodeselector]# kubectl label node cka-cluster2-worker gpu=false node/cka-cluster2-worker labeled [root@master nodeselector]# kubectl label node cka-cluster2-worker2 gpu=false node/cka-cluster2-worker2 labeled [root@master nodeselector]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 1/1 Running 0 4m32s 10.244.1.15 cka-cluster2-worker <none> <none>
- Why cka-cluster2-worker?
- NodeSelector: Both nodes matched the
gpu=false
label. - Taints:
cka-cluster2-worker2
had a NoSchedule taint, which repelled the Pod (no toleration was added).cka-cluster2-worker
had no taints, so the Pod was scheduled there.
- NodeSelector: Both nodes matched the
Affinity
Affinity in Kubernetes provides advanced control over Pod scheduling by defining rules that influence which nodes a Pod can be placed on. Unlike nodeSelector, affinity rules are more expressive and allow for complex scheduling logic. There are two primary types of node affinity:
requiredDuringSchedulingIgnoredDuringExecution
This is a hard requirement that must be met for the Pod to be scheduled. If no node matches the rule, the Pod remains in a Pending state.
- In this affinity.yml manifest, the Pod redis requires a node with the label
disktype=ssd
: - Initial State:
- No nodes had the disktype=ssd label.
- The Pod stayed Pending with an event message:
[root@master affinity]# cat affinity.yml apiVersion: v1 kind: Pod metadata: labels: run: redis name: redis spec: containers: - image: redis name: redis affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd [root@master affinity]# kubectl get pods No resources found in default namespace. [root@master affinity]# kubectl apply -f affinity.yml pod/redis created [root@master affinity]# kubectl get pods NAME READY STATUS RESTARTS AGE redis 0/1 Pending 0 3s [root@master affinity]# kubectl describe pod redis Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 9s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
- After Labeling:
- Labeled
cka-cluster2-worker
withdisktype=ssd
- The scheduler detected the matching node and scheduled the Pod:
- Labeled
[root@master affinity]# kubectl label node cka-cluster2-worker disktype=ssd node/cka-cluster2-worker labeled [root@master affinity]# kubectl get pods NAME READY STATUS RESTARTS AGE redis 1/1 Running 0 2m8s [root@master affinity]# kubectl describe pod redis | grep Node: Node: cka-cluster2-worker/172.18.0.4
preferredDuringSchedulingIgnoredDuringExecution
This is a soft preference, not a strict requirement. The scheduler tries to fulfill the rule but will schedule the Pod on another node if the preferred condition isn’t met.
- In this affinity2.yml manifest, the Pod
redis-new
prefers a node with the labeldisktype=hdd
: - Outcome:
- No nodes had the disktype=hdd label.
- The scheduler ignored the preference and scheduled the Pod on an available node (cka-cluster2-worker2):
- Key Behavior:
- The scheduler attempts to place the Pod on a node matching the preference but doesn’t enforce it.
- The Pod runs even if no nodes match the preference.
[root@master affinity]# cat affinity2.yml apiVersion: v1 kind: Pod metadata: labels: run: redis name: redis-new spec: containers: - image: redis name: redis affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: disktype operator: In values: - hdd [root@master affinity]# kubectl apply -f affinity2.yml pod/redis-new created [root@master affinity]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES redis 1/1 Running 0 8m33s 10.244.1.16 cka-cluster2-worker <none> <none> redis-new 1/1 Running 0 5s 10.244.2.20 cka-cluster2-worker2 <none> <none>
IgnoredDuringExecution
- Both affinity types include IgnoredDuringExecution, meaning changes to node labels after the Pod is scheduled do not affect the Pod.
- For example:
- If you remove the
disktype=ssd
label fromcka-cluster2-worker
, theredis
Pod continues running. - Similarly, adding
disktype=hdd
to a node later does not trigger a rescheduling ofredis-new
.
- If you remove the
- This ensures stability: once a Pod is scheduled, it isn’t disrupted by label changes.
Checking if pods run even after removing the labels, it will impact the pods scheduling after this
[root@master affinity]# kubectl label node cka-cluster2-worker disktype- node/cka-cluster2-worker unlabeled [root@master affinity]# kubectl get pods NAME READY STATUS RESTARTS AGE redis 1/1 Running 0 10m redis-new 1/1 Running 0 2m28s
Combining Taints/Tolerations and Affinity in Kubernetes
Taints/tolerations and affinity are complementary mechanisms in Kubernetes for controlling pod placement. While taints/tolerations repel pods from nodes unless explicitly allowed, affinity attracts pods to nodes based on labels. Using them together enables precise scheduling logic.