Lesson 3.1: Scheduling Pods (Pod Affinity, Anti-Affinity, Taints and Tolerations)

Taints and Tolerations

Taints and tolerations work together to control pod scheduling on nodes. Taints are applied to nodes to repel pods unless they have a matching toleration. Tolerations are added to pods to allow (but not guarantee) scheduling on tainted nodes.

Key Concepts

Taints:
- Applied to nodes to restrict which pods can run on them.
- Syntax: kubectl taint node <node-name> key=value:effect.
- Effects:
  - NoSchedule: Pods without matching tolerations will not be scheduled on the node.
  - PreferNoSchedule: Kubernetes will try to avoid scheduling pods without tolerations but doesn’t enforce it strictly.
  - NoExecute: Evicts existing pods without matching tolerations and blocks new ones.
Tolerations:
- Added to pods to allow them to tolerate a node’s taint.
- Defined in the pod’s spec.tolerations field.

Example Scenario:

[root@master ~]# kubectl get nodes 
NAME                         STATUS   ROLES           AGE   VERSION
cka-cluster2-control-plane   Ready    control-plane   44h   v1.29.14
cka-cluster2-worker          Ready    <none>          44h   v1.29.14
cka-cluster2-worker2         Ready    <none>          44h   v1.29.14
 
[root@master ~]# kubectl taint node cka-cluster2-worker gpu=true:NoSchedule
node/cka-cluster2-worker tainted
[root@master ~]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule
node/cka-cluster2-worker2 tainted
 
[root@master ~]# kubectl describe node cka-cluster2-worker | grep -i taint 
Taints:             gpu=true:NoSchedule
[root@master ~]# kubectl describe node cka-cluster2-worker2 | grep -i taint 
Taints:             gpu=true:NoSchedule

Tainting two worker nodes (cka-cluster2-worker and cka-cluster2-worker2) with gpu=true:NoSchedule:
- Both nodes now repel pods that don’t tolerate the gpu=true:NoSchedule taint.

[root@master ~]# kubectl run nginx --image=nginx 
pod/nginx created
 
[root@master ~]# kubectl get pods 
NAME    READY   STATUS    RESTARTS   AGE
nginx   0/1     Pending   0          5s
 
[root@master ~]# kubectl describe pod nginx | tail -5
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  117s  default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) had untolerated taint {gpu: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

Creating an nginx pod without a toleration:
- Outcome:
  - The pod remained Pending because no nodes were available (all nodes had the NoSchedule taint).
  - The scheduler’s event log confirmed the failure:

[root@master tainttoleration]# kubectl run redis --image=redis --dry-run=client -o yaml > redis.yml 
[root@master tainttoleration]# vim redis.yml 
[root@master tainttoleration]# cat redis.yml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: redis
  name: redis
spec:
  containers:
  - image: redis
    name: redis
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
status: {}
 
[root@master tainttoleration]# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                  NOMINATED NODE   READINESS GATES
nginx   0/1     Pending   0          5m    <none>        <none>                <none>           <none>
redis   1/1     Running   0          44s   10.244.1.14   cka-cluster2-worker   <none>           <none>
 
# Now try untainting one of the nodes to see if nginx is scheduled to the untained node 
[root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint
Taints:             gpu=true:NoSchedule

Created a redis pod with a toleration for gpu=true:NoSchedule:
- The redis pod was scheduled on cka-cluster2-worker (still tainted) because it tolerated the taint.

Untaining a Node

Removed the taint from cka-cluster2-worker2:
- cka-cluster2-worker2 became untainted and available for scheduling.
- The nginx pod (still without a toleration) was scheduled on cka-cluster2-worker2 because it no longer had the taint.

[root@master tainttoleration]# kubectl taint node cka-cluster2-worker2 gpu=true:NoSchedule-
node/cka-cluster2-worker2 untainted
[root@master tainttoleration]# kubectl describe node cka-cluster2-worker2 | grep -i taint
Taints:             <none>
 
# Now the pod is scheduled and is running state 
[root@master tainttoleration]# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE                   NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          7m59s   10.244.2.19   cka-cluster2-worker2   <none>           <none>
redis   1/1     Running   0          3m43s   10.244.1.14   cka-cluster2-worker    <none>           <none>

When to Use Taints and Tolerations

Dedicated Nodes: Reserve nodes for specific workloads (e.g., GPU nodes for AI workloads).
Node Isolation: Prevent non-critical pods from running on critical nodes (e.g., control plane).
Graceful Evictions: Use NoExecute to evict pods during maintenance.

NodeSelector

NodeSelector is a mechanism to constrain Pods to run on specific nodes by matching node labels. It is a simple way to enforce scheduling based on node characteristics (e.g., hardware, environment, or custom labels).

Key Concepts

Node Labels:
- Key-value pairs attached to nodes to describe their attributes (e.g., gpu=true, env=prod).
- Labels are set using kubectl label node <node-name> <key>=<value>.
NodeSelector:
- A field in the Pod’s spec that specifies which node labels the Pod requires.
- The Pod will only be scheduled on nodes that have all the specified labels.

Example Scenario

Created a Pod (nginx.yml) with a nodeSelector requiring the label gpu=false:
Initial State of Nodes
- Node Labels:
  - Neither cka-cluster2-worker nor cka-cluster2-worker2 had the gpu=false label initially.
- Node Taints:
  - cka-cluster2-worker2 had a taint (gpu=true:NoSchedule).
  - cka-cluster2-worker had no taints.
The Pod remained Pending because no nodes matched the gpu=false label.

[root@master nodeselector]# cat nginx.yml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx
spec:
  containers:
  - image: nginx
    name: nginx
  nodeSelector:
    gpu: "false"
 
[root@master nodeselector]# kubectl apply -f nginx.yml 
 
[root@master nodeselector]# kubectl describe nodes cka-cluster2-worker | grep -i taint
Taints:             <none>
[root@master nodeselector]# kubectl describe nodes cka-cluster2-worker2 | grep -i taint
Taints:             gpu=true:NoSchedule
 
[root@master nodeselector]# kubectl get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
nginx   0/1     Pending   0          6s    <none>   <none>   <none>           <none>

Labeling the pods: Added the gpu=false label to both worker nodes
Node Status After Labeling:
- Both nodes now had the label gpu=false.
- cka-cluster2-worker2 still had the taint gpu=true:NoSchedule.

[root@master nodeselector]# kubectl label node cka-cluster2-worker gpu=false 
node/cka-cluster2-worker labeled
[root@master nodeselector]# kubectl label node cka-cluster2-worker2 gpu=false 
node/cka-cluster2-worker2 labeled
 
[root@master nodeselector]# kubectl get pods -o wide 
NAME    READY   STATUS    RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          4m32s   10.244.1.15   cka-cluster2-worker   <none>           <none>

Why cka-cluster2-worker?
- NodeSelector: Both nodes matched the gpu=false label.
- Taints:
  - cka-cluster2-worker2 had a NoSchedule taint, which repelled the Pod (no toleration was added).
  - cka-cluster2-worker had no taints, so the Pod was scheduled there.

Affinity

Affinity in Kubernetes provides advanced control over Pod scheduling by defining rules that influence which nodes a Pod can be placed on. Unlike nodeSelector, affinity rules are more expressive and allow for complex scheduling logic. There are two primary types of node affinity:

requiredDuringSchedulingIgnoredDuringExecution

This is a hard requirement that must be met for the Pod to be scheduled. If no node matches the rule, the Pod remains in a Pending state.

In this affinity.yml manifest, the Pod redis requires a node with the label disktype=ssd:
Initial State:
- No nodes had the disktype=ssd label.
- The Pod stayed Pending with an event message:

[root@master affinity]# cat affinity.yml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: redis
  name: redis
spec:
  containers:
  - image: redis
    name: redis
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype 
            operator: In
            values:
            - ssd
[root@master affinity]# kubectl get pods 
No resources found in default namespace.
[root@master affinity]# kubectl apply -f affinity.yml 
pod/redis created
[root@master affinity]# kubectl get pods 
NAME    READY   STATUS    RESTARTS   AGE
redis   0/1     Pending   0          3s
 
[root@master affinity]# kubectl describe pod redis 
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  9s    default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

After Labeling:
- Labeled cka-cluster2-worker with disktype=ssd
- The scheduler detected the matching node and scheduled the Pod:

[root@master affinity]# kubectl label node cka-cluster2-worker disktype=ssd 
node/cka-cluster2-worker labeled
 
[root@master affinity]# kubectl get pods 
NAME    READY   STATUS    RESTARTS   AGE
redis   1/1     Running   0          2m8s
 
[root@master affinity]# kubectl describe pod redis | grep Node:
Node:             cka-cluster2-worker/172.18.0.4

preferredDuringSchedulingIgnoredDuringExecution

This is a soft preference, not a strict requirement. The scheduler tries to fulfill the rule but will schedule the Pod on another node if the preferred condition isn’t met.

In this affinity2.yml manifest, the Pod redis-new prefers a node with the label disktype=hdd:
Outcome:
- No nodes had the disktype=hdd label.
- The scheduler ignored the preference and scheduled the Pod on an available node (cka-cluster2-worker2):
Key Behavior:
- The scheduler attempts to place the Pod on a node matching the preference but doesn’t enforce it.
- The Pod runs even if no nodes match the preference.

[root@master affinity]# cat affinity2.yml 
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: redis
  name: redis-new
spec:
  containers:
  - image: redis
    name: redis
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - hdd
[root@master affinity]# kubectl apply -f affinity2.yml 
pod/redis-new created
[root@master affinity]# kubectl get pods -o wide 
NAME        READY   STATUS    RESTARTS   AGE     IP            NODE                   NOMINATED NODE   READINESS GATES
redis       1/1     Running   0          8m33s   10.244.1.16   cka-cluster2-worker    <none>           <none>
redis-new   1/1     Running   0          5s      10.244.2.20   cka-cluster2-worker2   <none>           <none>

IgnoredDuringExecution

Both affinity types include IgnoredDuringExecution, meaning changes to node labels after the Pod is scheduled do not affect the Pod.
For example:
- If you remove the disktype=ssd label from cka-cluster2-worker, the redis Pod continues running.
- Similarly, adding disktype=hdd to a node later does not trigger a rescheduling of redis-new.
This ensures stability: once a Pod is scheduled, it isn’t disrupted by label changes.

Checking if pods run even after removing the labels, it will impact the pods scheduling after this

[root@master affinity]# kubectl label node cka-cluster2-worker disktype-
node/cka-cluster2-worker unlabeled
[root@master affinity]# kubectl get pods 
NAME        READY   STATUS    RESTARTS   AGE
redis       1/1     Running   0          10m
redis-new   1/1     Running   0          2m28s

Combining Taints/Tolerations and Affinity in Kubernetes

Taints/tolerations and affinity are complementary mechanisms in Kubernetes for controlling pod placement. While taints/tolerations repel pods from nodes unless explicitly allowed, affinity attracts pods to nodes based on labels. Using them together enables precise scheduling logic.

Lesson 3.1: Scheduling Pods (Pod Affinity, Anti-Affinity, Taints and Tolerations)

Taints and Tolerations

Key Concepts

Example Scenario:

Untaining a Node

When to Use Taints and Tolerations

NodeSelector

Key Concepts

Example Scenario

Affinity

requiredDuringSchedulingIgnoredDuringExecution

preferredDuringSchedulingIgnoredDuringExecution

IgnoredDuringExecution

Combining Taints/Tolerations and Affinity in Kubernetes

References: