Lesson 3.1: Scheduling Pods (Pod Affinity, Anti-Affinity, Taints and Tolerations)


Taints and Tolerations

Taints and tolerations work together to control pod scheduling on nodes. Taints are applied to nodes to repel pods unless they have a matching toleration. Tolerations are added to pods to allow (but not guarantee) scheduling on tainted nodes.

Key Concepts

  • Taints:

    • Applied to nodes to restrict which pods can run on them.
    • Syntax: kubectl taint node <node-name> key=value:effect.
    • Effects:
      • NoSchedule: Pods without matching tolerations will not be scheduled on the node.
      • PreferNoSchedule: Kubernetes will try to avoid scheduling pods without tolerations but doesn’t enforce it strictly.
      • NoExecute: Evicts existing pods without matching tolerations and blocks new ones.
  • Tolerations:

    • Added to pods to allow them to tolerate a node’s taint.
    • Defined in the pod’s spec.tolerations field.

Example Scenario:

[root@master ~]# kubectl get nodes 
NAME                         STATUS   ROLES           AGE   VERSION
cka-cluster2-control-plane   Ready    control-plane   44h   v1.29.14
cka-cluster2-worker          Ready    <none>          44h   v1.29.14
cka-cluster2-worker2         Ready    <none>          44h   v1.29.14
 
[root@master ~]# kubectl taint node cka-cluster2-worker gpu=true:NoSchedule
node/cka-cluster2-worker tainted
[root@master ~]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule
node/cka-cluster2-worker2 tainted
 
[root@master ~]# kubectl describe node cka-cluster2-worker | grep -i taint 
Taints:             gpu=true:NoSchedule
[root@master ~]# kubectl describe node cka-cluster2-worker2 | grep -i taint 
Taints:             gpu=true:NoSchedule
  • Tainting two worker nodes (cka-cluster2-worker and cka-cluster2-worker2) with gpu=true:NoSchedule:
    • Both nodes now repel pods that don’t tolerate the gpu=true:NoSchedule taint.
[root@master ~]# kubectl run nginx --image=nginx 
pod/nginx created
 
[root@master ~]# kubectl get pods 
NAME    READY   STATUS    RESTARTS   AGE
nginx   0/1     Pending   0          5s
 
[root@master ~]# kubectl describe pod nginx | tail -5
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) had untolerated taint {gpu: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  • Creating an nginx pod without a toleration:
    • Outcome:
      • The pod remained Pending because no nodes were available (all nodes had the NoSchedule taint).
      • The scheduler’s event log confirmed the failure:
[root@master tainttoleration]# kubectl run redis --image=redis --dry-run=client -o yaml > redis.yml 
[root@master tainttoleration]# vim redis.yml 
[root@master tainttoleration]# cat redis.yml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: redis
  name: redis
spec:
  containers:
  - image: redis
    name: redis
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
status: {}
 
[root@master tainttoleration]# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                  NOMINATED NODE   READINESS GATES
nginx   0/1     Pending   0          5m    <none>        <none>                <none>           <none>
redis   1/1     Running   0          44s   10.244.1.14   cka-cluster2-worker   <none>           <none>
 
# Now try untainting one of the nodes to see if nginx is scheduled to the untained node 
[root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint
Taints:             gpu=true:NoSchedule
  • Created a redis pod with a toleration for gpu=true:NoSchedule:
    • The redis pod was scheduled on cka-cluster2-worker (still tainted) because it tolerated the taint.

Untaining a Node

  • Removed the taint from cka-cluster2-worker2:
    • cka-cluster2-worker2 became untainted and available for scheduling.
    • The nginx pod (still without a toleration) was scheduled on cka-cluster2-worker2 because it no longer had the taint.
[root@master tainttoleration]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule-
node/cka-cluster2-worker2 untainted
[root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint
Taints:             <none>
 
# Now the pod is scheduled and is running state 
[root@master tainttoleration]# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE                   NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          7m59s   10.244.2.19   cka-cluster2-worker2   <none>           <none>
redis   1/1     Running   0          3m43s   10.244.1.14   cka-cluster2-worker    <none>           <none>

When to Use Taints and Tolerations

  • Dedicated Nodes: Reserve nodes for specific workloads (e.g., GPU nodes for AI workloads).
  • Node Isolation: Prevent non-critical pods from running on critical nodes (e.g., control plane).
  • Graceful Evictions: Use NoExecute to evict pods during maintenance.

NodeSelector

NodeSelector is a mechanism to constrain Pods to run on specific nodes by matching node labels. It is a simple way to enforce scheduling based on node characteristics (e.g., hardware, environment, or custom labels).

Key Concepts

  • Node Labels:
    • Key-value pairs attached to nodes to describe their attributes (e.g., gpu=true, env=prod).
    • Labels are set using kubectl label node <node-name> <key>=<value>.
  • NodeSelector:
    • A field in the Pod’s spec that specifies which node labels the Pod requires.
    • The Pod will only be scheduled on nodes that have all the specified labels.

Example Scenario

  • Created a Pod (nginx.yml) with a nodeSelector requiring the label gpu=false:
  • Initial State of Nodes
    • Node Labels:
      • Neither cka-cluster2-worker nor cka-cluster2-worker2 had the gpu=false label initially.
    • Node Taints:
      • cka-cluster2-worker2 had a taint (gpu=true:NoSchedule).
      • cka-cluster2-worker had no taints.
  • The Pod remained Pending because no nodes matched the gpu=false label.
[root@master nodeselector]# cat nginx.yml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx
spec:
  containers:
  - image: nginx
    name: nginx
  nodeSelector:
    gpu: "false"
 
[root@master nodeselector]# kubectl apply -f nginx.yml 
 
[root@master nodeselector]# kubectl describe nodes cka-cluster2-worker | grep -i taint
Taints:             <none>
[root@master nodeselector]# kubectl describe nodes cka-cluster2-worker2 | grep -i taint
Taints:             gpu=true:NoSchedule
 
[root@master nodeselector]# kubectl get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
nginx   0/1     Pending   0          6s    <none>   <none>   <none>           <none>
  • Labeling the pods: Added the gpu=false label to both worker nodes
  • Node Status After Labeling:
    • Both nodes now had the label gpu=false.
    • cka-cluster2-worker2 still had the taint gpu=true:NoSchedule.
[root@master nodeselector]# kubectl label node cka-cluster2-worker gpu=false 
node/cka-cluster2-worker labeled
[root@master nodeselector]# kubectl label node cka-cluster2-worker2 gpu=false 
node/cka-cluster2-worker2 labeled
 
[root@master nodeselector]# kubectl get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          4m32s   10.244.1.15   cka-cluster2-worker   <none>           <none>
  • Why cka-cluster2-worker?
    • NodeSelector: Both nodes matched the gpu=false label.
    • Taints:
      • cka-cluster2-worker2 had a NoSchedule taint, which repelled the Pod (no toleration was added).
      • cka-cluster2-worker had no taints, so the Pod was scheduled there.

Affinity

Affinity in Kubernetes provides advanced control over Pod scheduling by defining rules that influence which nodes a Pod can be placed on. Unlike nodeSelector, affinity rules are more expressive and allow for complex scheduling logic. There are two primary types of node affinity:

requiredDuringSchedulingIgnoredDuringExecution

This is a hard requirement that must be met for the Pod to be scheduled. If no node matches the rule, the Pod remains in a Pending state.

  • In this affinity.yml manifest, the Pod redis requires a node with the label disktype=ssd:
  • Initial State:
    • No nodes had the disktype=ssd label.
    • The Pod stayed Pending with an event message:
[root@master affinity]# cat affinity.yml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: redis
  name: redis
spec:
  containers:
  - image: redis
    name: redis
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype 
            operator: In
            values:
            - ssd
[root@master affinity]# kubectl get pods 
No resources found in default namespace.
[root@master affinity]# kubectl apply -f affinity.yml 
pod/redis created
[root@master affinity]# kubectl get pods 
NAME    READY   STATUS    RESTARTS   AGE
redis   0/1     Pending   0          3s
 
[root@master affinity]# kubectl describe pod redis 
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  9s    default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  • After Labeling:
    • Labeled cka-cluster2-worker with disktype=ssd
    • The scheduler detected the matching node and scheduled the Pod:
[root@master affinity]# kubectl label node cka-cluster2-worker disktype=ssd 
node/cka-cluster2-worker labeled
 
[root@master affinity]# kubectl get pods 
NAME    READY   STATUS    RESTARTS   AGE
redis   1/1     Running   0          2m8s
 
[root@master affinity]# kubectl describe pod redis | grep Node:
Node:             cka-cluster2-worker/172.18.0.4

preferredDuringSchedulingIgnoredDuringExecution

This is a soft preference, not a strict requirement. The scheduler tries to fulfill the rule but will schedule the Pod on another node if the preferred condition isn’t met.

  • In this affinity2.yml manifest, the Pod redis-new prefers a node with the label disktype=hdd:
  • Outcome:
    • No nodes had the disktype=hdd label.
    • The scheduler ignored the preference and scheduled the Pod on an available node (cka-cluster2-worker2):
  • Key Behavior:
    • The scheduler attempts to place the Pod on a node matching the preference but doesn’t enforce it.
    • The Pod runs even if no nodes match the preference.
[root@master affinity]# cat affinity2.yml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: redis
  name: redis-new
spec:
  containers:
  - image: redis
    name: redis
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - hdd
[root@master affinity]# kubectl apply -f affinity2.yml 
pod/redis-new created
[root@master affinity]# kubectl get pods -o wide 
NAME        READY   STATUS    RESTARTS   AGE     IP            NODE                   NOMINATED NODE   READINESS GATES
redis       1/1     Running   0          8m33s   10.244.1.16   cka-cluster2-worker    <none>           <none>
redis-new   1/1     Running   0          5s      10.244.2.20   cka-cluster2-worker2   <none>           <none>

IgnoredDuringExecution

  • Both affinity types include IgnoredDuringExecution, meaning changes to node labels after the Pod is scheduled do not affect the Pod.
  • For example:
    • If you remove the disktype=ssd label from cka-cluster2-worker, the redis Pod continues running.
    • Similarly, adding disktype=hdd to a node later does not trigger a rescheduling of redis-new.
  • This ensures stability: once a Pod is scheduled, it isn’t disrupted by label changes.

Checking if pods run even after removing the labels, it will impact the pods scheduling after this

[root@master affinity]# kubectl label node cka-cluster2-worker disktype-
node/cka-cluster2-worker unlabeled
[root@master affinity]# kubectl get pods 
NAME        READY   STATUS    RESTARTS   AGE
redis       1/1     Running   0          10m
redis-new   1/1     Running   0          2m28s

Combining Taints/Tolerations and Affinity in Kubernetes

Taints/tolerations and affinity are complementary mechanisms in Kubernetes for controlling pod placement. While taints/tolerations repel pods from nodes unless explicitly allowed, affinity attracts pods to nodes based on labels. Using them together enables precise scheduling logic.

References:

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.