- Регистрация
- 1 Мар 2015
- Сообщения
- 6,839
- Баллы
- 155
1. Node
1.1 Node NotReady
A Kubernetes cluster node being in the NotReady state can result from various issues. Here are some realistic and common reasons:
1. Node Resource Issues
There are still many possible reasons, like the hardware failures, missing some config files, some time the kubelet version mismatch also can casue node failure.
1.1.1 How to Debug a NotReady Node
kubectl describe node <node-name>
systemctl status kubelet
# if it down/inactive restart it
systemctl restart kubelet
journalctl -u kubelet
journalctl -u docker
journalctl -u kubelet
top
df -h
# or directly use kubectl top
kubectl top node --sort-by='cpu' | awk 'NR==2 {print $1}'
kubectl cordon NODENAME
kubectl drain NODENAME
kubectl uncordon NODENAME
1.2.1 kubectl cordon NODENAME
2.1 Update
Following the official documents
0. check availiable version
sudo apt update
sudo apt-cache madison kubeadm
1. updata kubeadmin
# change the version as you need
sudo apt-mark unhold kubeadm && \
sudo apt-get update && sudo apt-get install -y kubeadm='1.32.x-*' && \
sudo apt-mark hold kubeadm
kubeadm version
1.1 on control node
sudo kubeadm upgrade apply
1.2 on other node
sudo kubeadm upgrade node
2. drain node
kubectl drain <node-to-drain> --ignore-daemonsets
# change the version as you need
sudo apt-mark unhold kubelet kubectl && \
sudo apt-get update && sudo apt-get install -y kubelet='1.32.x-*' kubectl='1.32.x-*' && \
sudo apt-mark hold kubelet kubectl
Then restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
4. uncordon node
kubectl uncordon <node-to-uncordon>
1.1 Node NotReady
A Kubernetes cluster node being in the NotReady state can result from various issues. Here are some realistic and common reasons:
1. Node Resource Issues
- Insufficient Memory or CPU: If the node is running out of memory or CPU resources, the kubelet may mark the node as NotReady.
- Disk Pressure: The node's disk usage may be too high, causing the kubelet to mark it as NotReady.
- Example: kubectl describe node <node-name> shows DiskPressure under conditions.
- Network Pressure: High network latency or dropped packets may cause readiness issues.
- kubelet Down: The kubelet service on the node is not running or has crashed.
- Certificate Issues: The kubelet's certificate might have expired, causing it to fail authentication with the kube-apiserver.
- Configuration Errors: Misconfigured kubelet flags (e.g., wrong --cluster-dns, --api-servers) can lead to connectivity issues.
- Node Network Unreachable: The node cannot communicate with the control plane or other nodes.
- CNI Plugin Failure: Issues with the Container Network Interface (CNI) plugin (e.g., Calico, Flannel, Weave) may disrupt pod-to-pod or node-to-node communication.
- Firewall Rules: A firewall or security group blocking Kubernetes-related traffic (e.g., ports 6443, 10250) can cause the node to go NotReady.
- kube-apiserver Unreachable: The node cannot reach the API server due to network partitioning or DNS resolution issues.
- etcd Problems: If the control plane's etcd database is down or unhealthy, the API server might not respond to node heartbeats.
- Container Runtime Failure: The container runtime (e.g., Docker, containerd, CRI-O) is not running or is misconfigured.
- kube-proxy Failure: The kube-proxy component on the node is not functioning correctly, disrupting node communication.
There are still many possible reasons, like the hardware failures, missing some config files, some time the kubelet version mismatch also can casue node failure.
1.1.1 How to Debug a NotReady Node
- Check node conditions frist and kubelet status:
kubectl describe node <node-name>
systemctl status kubelet
# if it down/inactive restart it
systemctl restart kubelet
- Check logs:
- kubelet logs:
- or we can also check the container runtime logs (e.g., Docker):
journalctl -u kubelet
journalctl -u docker
- Verify network connectivity:
- Ping the control plane API server:
- Check CNI plugin logs.
journalctl -u kubelet
- Inspect resource usage:
top
df -h
# or directly use kubectl top
kubectl top node --sort-by='cpu' | awk 'NR==2 {print $1}'
1.2 Cordon and Drain Nodesin this article, we will always use systemd cmds, and ubuntu sys.
kubectl cordon NODENAME
kubectl drain NODENAME
kubectl uncordon NODENAME
1.2.1 kubectl cordon NODENAME
- Purpose: Marks a node as unschedulable. This prevents new pods from being scheduled on that node.
- Effect on existing pods: Existing pods continue to run on the node.
- Use case: temporarily prevent new workloads from being placed on a node, perhaps for investigation or minor maintenance, without disrupting existing applications.
1.2.2 kubectl drain NODENAMENOTICE: if you have specify the nodeName: <node_name>, then it can still schedule pod to the node, because:
- Cordoning a node: it tells the scheduler "don't place any new pods on this node unless there's a very good reason."
- nodeName in pod spec: this will be a direct instruction to Kubernetes: "I want this pod to run specifically on this node."
- Purpose: Evicts all pods from a node and marks it as unschedulable.
- Effect on existing pods: Gracefully terminates pods running on the node.
- Use case: perform more significant maintenance on a node, such as kernel updates, hardware replacement.
2.1 Update
Following the official documents
0. check availiable version
sudo apt update
sudo apt-cache madison kubeadm
1. updata kubeadmin
# change the version as you need
sudo apt-mark unhold kubeadm && \
sudo apt-get update && sudo apt-get install -y kubeadm='1.32.x-*' && \
sudo apt-mark hold kubeadm
kubeadm version
1.1 on control node
sudo kubeadm upgrade apply
1.2 on other node
sudo kubeadm upgrade node
2. drain node
kubectl drain <node-to-drain> --ignore-daemonsets
3. update kubelet and kubectlIf on other nodes, first ssh to the control node, then drain the node, then ssh back to updating node
# change the version as you need
sudo apt-mark unhold kubelet kubectl && \
sudo apt-get update && sudo apt-get install -y kubelet='1.32.x-*' kubectl='1.32.x-*' && \
sudo apt-mark hold kubelet kubectl
Then restart kubelet
sudo systemctl daemon-reload
sudo systemctl restart kubelet
4. uncordon node
kubectl uncordon <node-to-uncordon>
Similarly, if you are on other node, ssh back to control node, then do the uncordon.