Secure Kubernetes Cluster Upgrades: A Step-by-Step Guide

In this post, I’ll document a recent Kubernetes cluster upgrade process I implemented, focusing on security, automation, and best practices. I’ll walk through the entire process from environment assessment to verification, highlighting challenges and solutions along the way.

Environment Overview

Our setup consisted of a small K3s Kubernetes cluster running on Ubuntu 24.04 with:

1 master node (control plane)
2 worker nodes
All nodes running an older kernel version (6.8.0-56-generic)
Multiple security updates pending

Upgrade Objectives

Update all system packages across all nodes
Apply kernel updates securely
Minimize downtime by implementing a rolling upgrade
Establish secure automation for future upgrades

Step 1: Setting Up Secure Access

The first step was to establish secure, password-less authentication using SSH keys instead of using plaintext passwords.

# Generate a secure ED25519 SSH key
ssh-keygen -t ed25519 -f ~/.ssh/k8s_worker_access -C "k8s_worker_automation_$(date +%Y%m%d)" -N ""

# Set up the automation directory structure
mkdir -p ~/k8s-automation/inventory ~/k8s-automation/scripts

# Create an inventory of worker nodes
cat << EOF > ~/k8s-automation/inventory/workers.txt
10.10.10.xx
10.10.10.yy
EOF

Step 2: Setting Up the Upgrade Script

I created an upgrade script that would:

Safely drain each node before updates
Apply system updates
Handle required reboots
Verify the node is ready before proceeding to the next one

#!/bin/bash

SSH_KEY="/root/.ssh/k8s_worker_access"
declare -A NODE_MAP
NODE_MAP["10.10.10.xx"]="uk8s-worker"
NODE_MAP["10.10.10.yy"]="uk8s-worker2"

while IFS= read -r worker_ip; do
  node_name="${NODE_MAP[$worker_ip]}"
  echo "Processing worker: $worker_ip (Kubernetes node: $node_name)"
  
  # Drain the node
  kubectl drain "$node_name" --ignore-daemonsets --delete-emptydir-data --force

  # Perform upgrade
  ssh -i "$SSH_KEY" -o StrictHostKeyChecking=no "root@${worker_ip}" '
      export DEBIAN_FRONTEND=noninteractive
      apt-get update
      apt-get upgrade -y
      apt-get autoremove -y
      apt-get clean
  '

  # Reboot if necessary
  ssh -i "$SSH_KEY" "root@${worker_ip}" '
      if [ -f /var/run/reboot-required ]; then
          echo "Rebooting node..."
          nohup bash -c "sleep 2; reboot" &
          exit
      fi
  '

  # Wait for node to come back
  echo "Waiting for node to be ready..."
  sleep 30
  kubectl wait --for=condition=ready "node/${node_name}" --timeout=300s

  # Uncordon the node
  kubectl uncordon "$node_name"
  
  echo "Node $worker_ip ($node_name) has been upgraded and is ready"
done < "/root/k8s-automation/inventory/workers.txt"

Step 3: Distributing SSH Keys Securely

With the automation script prepared, I securely distributed the SSH keys to all nodes:

# Install sshpass for initial key distribution
apt-get install -y sshpass

# Distribute keys to each worker node
for worker in $(cat ~/k8s-automation/inventory/workers.txt); do
    echo "Distributing SSH key to $worker"
    sshpass -p "[REDACTED]" ssh-copy-id -i ~/.ssh/k8s_worker_access.pub \
    -o StrictHostKeyChecking=no "root@${worker}"
done

Step 4: Upgrading Worker Nodes

With everything prepared, I proceeded with the worker node upgrades. The process was executed in a sequential manner to ensure cluster stability:

# Execute the upgrade script
~/k8s-automation/scripts/upgrade-workers.sh

During the upgrade process:

Each worker node was cordoned and drained
All system packages were updated (approximately 36 packages including security updates)
New kernel (6.8.0-57-generic) was installed
Nodes were rebooted where required
The script waited for the node to rejoin the cluster
The node was uncordoned before moving to the next node

This approach ensured that:

The cluster maintained availability throughout the upgrade
Workloads were properly rescheduled
No manual intervention was required once the process started

Step 5: Upgrading the Master Node

The master node required special attention as it hosts the control plane components:

I cordoned the master node to prevent new workloads but did not drain it
Checked for pending updates
Applied the updates
Monitored the reboot process

# Cordon the master node
kubectl cordon uk8s-master

# Check for updates
ssh -i ~/.ssh/k8s_worker_access [email protected] 'apt list --upgradable'

# In this case, the master node had fewer pending updates (only 4 packages)
# compared to the worker nodes (36 packages) as it had been patched more recently

# Apply updates
ssh -i ~/.ssh/k8s_worker_access [email protected] '
    export DEBIAN_FRONTEND=noninteractive
    apt-get update && \
    apt-get upgrade -y && \
    apt-get autoremove -y && \
    apt-get clean && \
    if [ -f /var/run/reboot-required ]; then
        echo "Reboot required - proceeding with reboot"
        nohup bash -c "sleep 2; reboot" &
        exit
    fi
'

Challenges Faced & Solutions

Challenge 1: Kubernetes API Unavailability During Master Reboot

When the master node rebooted, the Kubernetes API became temporarily unavailable. This was expected, but it required proper handling.

Solution: I added logic to wait for the API to become available again and implemented a retry mechanism.

Challenge 2: Kernel Module Loading After Reboot

After rebooting the master node with the new kernel, there were issues with IP tables modules not being loaded correctly.

Solution: I explicitly loaded the required kernel modules and restarted the K3s service:

ssh [email protected] '
    modprobe ip_tables && 
    modprobe iptable_nat && 
    modprobe iptable_filter && 
    modprobe xt_mark && 
    systemctl restart k3s
'

Challenge 3: Kubectl Configuration

The kubectl configuration needed to be updated to connect to the master node after reboot.

Solution: I retrieved the updated configuration from the master node and adjusted it:

ssh [email protected] 'cat /etc/rancher/k3s/k3s.yaml' > ~/.kube/config.new
sed -i "s/127.0.0.1/10.10.10.zz/" ~/.kube/config.new
export KUBECONFIG=~/.kube/config.new

# Make this configuration persistent
cp ~/.kube/config.new ~/.kube/config

Verification

After completing all upgrades, I verified the status of the cluster:

$ kubectl get nodes -o wide
NAME           STATUS   ROLES                  AGE   VERSION        INTERNAL-IP     OS-IMAGE             KERNEL-VERSION     
uk8s-master    Ready    control-plane,master   44d   v1.31.6+k3s1   10.10.10.zz     Ubuntu 24.04.2 LTS   6.8.0-57-generic   
uk8s-worker    Ready    worker                 44d   v1.31.6+k3s1   10.10.10.xx     Ubuntu 24.04.2 LTS   6.8.0-57-generic   
uk8s-worker2   Ready    worker                 44d   v1.31.6+k3s1   10.10.10.yy     Ubuntu 24.04.2 LTS   6.8.0-57-generic

All nodes were successfully upgraded with:

Latest system packages
New kernel version (6.8.0-57-generic)
All nodes in Ready state
Cluster fully operational

Best Practices & Lessons Learned

Security First: Never use plaintext passwords; always use SSH keys for automation
Backup Before Upgrading: Always ensure you have proper backups before starting
Rolling Updates: Upgrade one node at a time to maintain cluster availability
Proper Draining: Always drain nodes properly before maintenance
Master Node Care: Take extra precautions with the control plane node
Recovery Plan: Have a plan for API unavailability during master node upgrades
Kernel Module Awareness: Be prepared to load specific kernel modules after a kernel upgrade. Consider adding them to /etc/modules to load automatically on boot
Validation: Always verify the cluster is healthy after each step

Conclusion

This upgrade process demonstrates a secure, automated approach to keeping Kubernetes clusters up-to-date with minimal downtime. By implementing proper SSH key authentication, creating reusable automation scripts, and following a methodical approach, future upgrades can be handled efficiently and securely.

The resulting cluster is now running the latest kernel and security patches, improving both performance and security. This process can be adapted for clusters of any size, making it a valuable addition to your operations playbook.

Secure Kubernetes Cluster Upgrades: A Step-by-Step Guide#

Environment Overview#

Upgrade Objectives#

Step 1: Setting Up Secure Access#

Step 2: Setting Up the Upgrade Script#

Step 3: Distributing SSH Keys Securely#

Step 4: Upgrading Worker Nodes#

Step 5: Upgrading the Master Node#

Challenges Faced & Solutions#

Challenge 1: Kubernetes API Unavailability During Master Reboot#

Challenge 2: Kernel Module Loading After Reboot#

Challenge 3: Kubectl Configuration#

Verification#

Best Practices & Lessons Learned#

Conclusion#