Lesson 3.1: Scheduling Pods (Pod Affinity, Anti-Affinity, Taints and Tolerations)


Taints and Tolerations

Taints and tolerations work together to control pod scheduling on nodes. Taints are applied to nodes to repel pods unless they have a matching toleration. Tolerations are added to pods to allow (but not guarantee) scheduling on tainted nodes.

Key Concepts

  • Taints:

    • Applied to nodes to restrict which pods can run on them.
    • Syntax: kubectl taint node <node-name> key=value:effect.
    • Effects:
      • NoSchedule: Pods without matching tolerations will not be scheduled on the node.
      • PreferNoSchedule: Kubernetes will try to avoid scheduling pods without tolerations but doesn’t enforce it strictly.
      • NoExecute: Evicts existing pods without matching tolerations and blocks new ones.
  • Tolerations:

    • Added to pods to allow them to tolerate a node’s taint.
    • Defined in the pod’s spec.tolerations field.

Example Scenario:

[root@master ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION cka-cluster2-control-plane Ready control-plane 44h v1.29.14 cka-cluster2-worker Ready <none> 44h v1.29.14 cka-cluster2-worker2 Ready <none> 44h v1.29.14 [root@master ~]# kubectl taint node cka-cluster2-worker gpu=true:NoSchedule node/cka-cluster2-worker tainted [root@master ~]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule node/cka-cluster2-worker2 tainted [root@master ~]# kubectl describe node cka-cluster2-worker | grep -i taint Taints: gpu=true:NoSchedule [root@master ~]# kubectl describe node cka-cluster2-worker2 | grep -i taint Taints: gpu=true:NoSchedule
  • Tainting two worker nodes (cka-cluster2-worker and cka-cluster2-worker2) with gpu=true:NoSchedule:
    • Both nodes now repel pods that don’t tolerate the gpu=true:NoSchedule taint.
[root@master ~]# kubectl run nginx --image=nginx pod/nginx created [root@master ~]# kubectl get pods NAME READY STATUS RESTARTS AGE nginx 0/1 Pending 0 5s [root@master ~]# kubectl describe pod nginx | tail -5 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 117s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) had untolerated taint {gpu: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  • Creating an nginx pod without a toleration:
    • Outcome:
      • The pod remained Pending because no nodes were available (all nodes had the NoSchedule taint).
      • The scheduler’s event log confirmed the failure:
[root@master tainttoleration]# kubectl run redis --image=redis --dry-run=client -o yaml > redis.yml [root@master tainttoleration]# vim redis.yml [root@master tainttoleration]# cat redis.yml apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: redis name: redis spec: containers: - image: redis name: redis resources: {} dnsPolicy: ClusterFirst restartPolicy: Always tolerations: - key: "gpu" operator: "Equal" value: "true" effect: "NoSchedule" status: {} [root@master tainttoleration]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 0/1 Pending 0 5m <none> <none> <none> <none> redis 1/1 Running 0 44s 10.244.1.14 cka-cluster2-worker <none> <none> # Now try untainting one of the nodes to see if nginx is scheduled to the untained node [root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint Taints: gpu=true:NoSchedule
  • Created a redis pod with a toleration for gpu=true:NoSchedule:
    • The redis pod was scheduled on cka-cluster2-worker (still tainted) because it tolerated the taint.

Untaining a Node

  • Removed the taint from cka-cluster2-worker2:
    • cka-cluster2-worker2 became untainted and available for scheduling.
    • The nginx pod (still without a toleration) was scheduled on cka-cluster2-worker2 because it no longer had the taint.
[root@master tainttoleration]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule- node/cka-cluster2-worker2 untainted [root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint Taints: <none> # Now the pod is scheduled and is running state [root@master tainttoleration]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 1/1 Running 0 7m59s 10.244.2.19 cka-cluster2-worker2 <none> <none> redis 1/1 Running 0 3m43s 10.244.1.14 cka-cluster2-worker <none> <none>

When to Use Taints and Tolerations

  • Dedicated Nodes: Reserve nodes for specific workloads (e.g., GPU nodes for AI workloads).
  • Node Isolation: Prevent non-critical pods from running on critical nodes (e.g., control plane).
  • Graceful Evictions: Use NoExecute to evict pods during maintenance.

NodeSelector

NodeSelector is a mechanism to constrain Pods to run on specific nodes by matching node labels. It is a simple way to enforce scheduling based on node characteristics (e.g., hardware, environment, or custom labels).

Key Concepts

  • Node Labels:
    • Key-value pairs attached to nodes to describe their attributes (e.g., gpu=true, env=prod).
    • Labels are set using kubectl label node <node-name> <key>=<value>.
  • NodeSelector:
    • A field in the Pod’s spec that specifies which node labels the Pod requires.
    • The Pod will only be scheduled on nodes that have all the specified labels.

Example Scenario

  • Created a Pod (nginx.yml) with a nodeSelector requiring the label gpu=false:
  • Initial State of Nodes
    • Node Labels:
      • Neither cka-cluster2-worker nor cka-cluster2-worker2 had the gpu=false label initially.
    • Node Taints:
      • cka-cluster2-worker2 had a taint (gpu=true:NoSchedule).
      • cka-cluster2-worker had no taints.
  • The Pod remained Pending because no nodes matched the gpu=false label.
[root@master nodeselector]# cat nginx.yml apiVersion: v1 kind: Pod metadata: labels: run: nginx name: nginx spec: containers: - image: nginx name: nginx nodeSelector: gpu: "false" [root@master nodeselector]# kubectl apply -f nginx.yml [root@master nodeselector]# kubectl describe nodes cka-cluster2-worker | grep -i taint Taints: <none> [root@master nodeselector]# kubectl describe nodes cka-cluster2-worker2 | grep -i taint Taints: gpu=true:NoSchedule [root@master nodeselector]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 0/1 Pending 0 6s <none> <none> <none> <none>
  • Labeling the pods: Added the gpu=false label to both worker nodes
  • Node Status After Labeling:
    • Both nodes now had the label gpu=false.
    • cka-cluster2-worker2 still had the taint gpu=true:NoSchedule.
[root@master nodeselector]# kubectl label node cka-cluster2-worker gpu=false node/cka-cluster2-worker labeled [root@master nodeselector]# kubectl label node cka-cluster2-worker2 gpu=false node/cka-cluster2-worker2 labeled [root@master nodeselector]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 1/1 Running 0 4m32s 10.244.1.15 cka-cluster2-worker <none> <none>
  • Why cka-cluster2-worker?
    • NodeSelector: Both nodes matched the gpu=false label.
    • Taints:
      • cka-cluster2-worker2 had a NoSchedule taint, which repelled the Pod (no toleration was added).
      • cka-cluster2-worker had no taints, so the Pod was scheduled there.

Affinity

Affinity in Kubernetes provides advanced control over Pod scheduling by defining rules that influence which nodes a Pod can be placed on. Unlike nodeSelector, affinity rules are more expressive and allow for complex scheduling logic. There are two primary types of node affinity:

requiredDuringSchedulingIgnoredDuringExecution

This is a hard requirement that must be met for the Pod to be scheduled. If no node matches the rule, the Pod remains in a Pending state.

  • In this affinity.yml manifest, the Pod redis requires a node with the label disktype=ssd:
  • Initial State:
    • No nodes had the disktype=ssd label.
    • The Pod stayed Pending with an event message:
[root@master affinity]# cat affinity.yml apiVersion: v1 kind: Pod metadata: labels: run: redis name: redis spec: containers: - image: redis name: redis affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd [root@master affinity]# kubectl get pods No resources found in default namespace. [root@master affinity]# kubectl apply -f affinity.yml pod/redis created [root@master affinity]# kubectl get pods NAME READY STATUS RESTARTS AGE redis 0/1 Pending 0 3s [root@master affinity]# kubectl describe pod redis Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 9s default-scheduler 0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  • After Labeling:
    • Labeled cka-cluster2-worker with disktype=ssd
    • The scheduler detected the matching node and scheduled the Pod:
[root@master affinity]# kubectl label node cka-cluster2-worker disktype=ssd node/cka-cluster2-worker labeled [root@master affinity]# kubectl get pods NAME READY STATUS RESTARTS AGE redis 1/1 Running 0 2m8s [root@master affinity]# kubectl describe pod redis | grep Node: Node: cka-cluster2-worker/172.18.0.4

preferredDuringSchedulingIgnoredDuringExecution

This is a soft preference, not a strict requirement. The scheduler tries to fulfill the rule but will schedule the Pod on another node if the preferred condition isn’t met.

  • In this affinity2.yml manifest, the Pod redis-new prefers a node with the label disktype=hdd:
  • Outcome:
    • No nodes had the disktype=hdd label.
    • The scheduler ignored the preference and scheduled the Pod on an available node (cka-cluster2-worker2):
  • Key Behavior:
    • The scheduler attempts to place the Pod on a node matching the preference but doesn’t enforce it.
    • The Pod runs even if no nodes match the preference.
[root@master affinity]# cat affinity2.yml apiVersion: v1 kind: Pod metadata: labels: run: redis name: redis-new spec: containers: - image: redis name: redis affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: disktype operator: In values: - hdd [root@master affinity]# kubectl apply -f affinity2.yml pod/redis-new created [root@master affinity]# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES redis 1/1 Running 0 8m33s 10.244.1.16 cka-cluster2-worker <none> <none> redis-new 1/1 Running 0 5s 10.244.2.20 cka-cluster2-worker2 <none> <none>

IgnoredDuringExecution

  • Both affinity types include IgnoredDuringExecution, meaning changes to node labels after the Pod is scheduled do not affect the Pod.
  • For example:
    • If you remove the disktype=ssd label from cka-cluster2-worker, the redis Pod continues running.
    • Similarly, adding disktype=hdd to a node later does not trigger a rescheduling of redis-new.
  • This ensures stability: once a Pod is scheduled, it isn’t disrupted by label changes.

Checking if pods run even after removing the labels, it will impact the pods scheduling after this

[root@master affinity]# kubectl label node cka-cluster2-worker disktype- node/cka-cluster2-worker unlabeled [root@master affinity]# kubectl get pods NAME READY STATUS RESTARTS AGE redis 1/1 Running 0 10m redis-new 1/1 Running 0 2m28s

Combining Taints/Tolerations and Affinity in Kubernetes

Taints/tolerations and affinity are complementary mechanisms in Kubernetes for controlling pod placement. While taints/tolerations repel pods from nodes unless explicitly allowed, affinity attracts pods to nodes based on labels. Using them together enables precise scheduling logic.

References:

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.