K8sK8sKubectl Describe,k8s故障排查利器
XCJYC
除了kubectl logs、events,kubectl describe 也是K8s排查问题必备命令,有点类似``docker inspect,它返回结果比 kubectl get` 详细,比YAML直观,用过都说好。
1. 基础用法
1.1 查看Pod详情
1 2 3 4 5 6 7 8
| kubectl describe pod opsnot-postgresql
kubectl describe pod my-pod -n opsnot-postgresql
kubectl describe pods
|
1.2 查看不同资源
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| kubectl describe deployment opsnot-mariadb
kubectl describe service opsnot-service
kubectl describe node node-mariadb
kubectl describe configmap mariadb-config
kubectl describe secret mariadb-secret
|
2. Pod排查技巧
2.1 查看Pod事件(重点关注)
1 2 3 4 5
| kubectl describe pod ops-elasticsearch -n prod
|
实操案例:某Pod一直Pending,用describe一看Events:
1
| Warning FailedScheduling node didn't have enough memory
|
很明显内存不足,调整下requests轻松解决。
2.2 查看容器状态
1 2 3 4 5 6 7 8
| - 容器状态(Running/Waiting/Terminated) - 重启次数 - 退出码 - 最后一次重启原因
kubectl describe pod my-elasticsearch | grep -A 10 "State:" kubectl describe pod my-elasticsearch | grep "Restart Count"
|
实操案例:容器反复重启,describe显示:
1 2 3
| Last State: Terminated Reason: Error Exit Code: 137
|
容器退出码137,那就是OOM killed了,内存调大些就ok了。
2.3 查看资源使用情况
1 2 3 4 5 6
| kubectl describe pod opsnot-redis | grep -A 5 "Limits:" kubectl describe pod opsnot-redis | grep -A 5 "Requests:"
kubectl describe pod ops-not-kafka | grep "QoS Class"
|
实操案例:集群资源紧张,这个命令可以找到那些设置了过高的requests但实际用不到的Pod,优化资源,节能提效
3. Node排查
3.1 查看Node健康状态
1 2 3 4 5 6 7 8 9 10 11
| kubectl describe node worker-node-1
kubectl describe node worker-node-1 | grep -A 10 "Conditions:"
|
实操案例:Pod调度不上node节点,describe node发现:
1 2
| DiskPressure: True Message: kubelet has disk pressure
|
很明显是磁盘占满了,清理即可,一般都是日志爆了。
3.2 查看Node资源分配
1 2 3 4 5 6
| kubectl describe node worker-node-1 | grep -A 20 "Allocated resources:"
|
3.3 查看Node上的Pod
1 2 3 4 5 6 7 8 9 10 11
| kubectl describe node worker-node-1 | grep -A 50 "Non-terminated Pods:"
Non-terminated Pods: (10 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- default nginx-opsnot-abc12 100m (2%) 200m (5%) 128Mi (1%) 256Mi (3%) 5d default redis-xyz34 50m (1%) 100m (2%) 64Mi (0%) 128Mi (1%) 3d kube-system kube-proxy-5678 100m (2%) 0 (0%) 64Mi (0%) 0 (0%) 15d kube-system coredns-1234 100m (2%) 200m (5%) 70Mi (0%) 170Mi (2%) 15d
|
4. Service排查
4.1 查看Service配置
1 2 3 4 5 6 7 8 9
| kubectl describe service ops-not-service
kubectl describe svc my-service | grep Selector kubectl describe svc my-service | grep Endpoints
|
实操案例:Service访问不通,describe发现Endpoints是空的:
一般是Selector写错了,Pod的Label不匹配,改下Label就行了
4.2 检查Service类型和端口
1 2 3 4 5 6
| kubectl describe svc ops-not-service | grep Type
kubectl describe svc ops-not-service | grep Port
|
5. Deployment/StatefulSet排查
5.1 查看副本状态
1 2 3 4 5 6 7 8 9
| kubectl describe deployment opsnot-rabbitmq
kubectl describe deploy opsnot-rabbitmq | grep Replicas kubectl describe deploy opsnot-rabbitmq | grep -A 5 "Conditions:"
|
实操案例:发版后只有一部分Pod更新成功,describe显示:
1 2 3 4
| Replicas: 3 desired | 2 updated | 3 total | 2 available Conditions: Progressing: False Reason: ProgressDeadlineExceeded
|
这种情况基本是新版本镜像有问题导致Pod起不来,直接回滚
5.2 查看滚动更新策略
1 2 3 4 5 6 7 8
| kubectl describe deploy opsnot-rabbitmq | grep -A 3 "StrategyType:"
|
6. PVC/PV存储排查
6.1 查看PVC状态
1 2 3 4 5 6 7 8
| kubectl describe pvc opsnot-pvc
|
实操场景:PVC一直Pending,describe显示:
1 2
| Events: Warning ProvisioningFailed no volume plugin matched
|
基本是StorageClass配置错误,改一下就行了
6.2 查看PV详情
1 2 3 4 5 6 7 8
| kubectl describe pv opsnot-pv-name
|
7. ConfigMap/Secret排查
7.1 查看ConfigMap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| kubectl describe configmap my-config
kubectl describe cm my-config | grep -A 20 "Data"
kubectl describe 命令有输出长度限制,主要是为了:
防止终端被大量输出淹没 提高命令响应速度 避免网络传输过大数据
看yaml呗,这玩意是完整的
|
7.2 查看Secret
1 2 3 4 5 6 7 8 9 10
| kubectl describe secret my-secret
|
8. Ingress排查
8.1 查看Ingress规则
1 2 3 4 5 6 7 8
| kubectl describe ingress my-ingress
kubectl describe ing my-ingress | grep -A 20 "Rules:"
|
8.2 查看Ingress地址
1 2 3 4 5
| kubectl describe ingress opsnot-ingress | grep Address
kubectl describe ing opsnot-ingress | grep -A 5 "TLS:"
|
9. 实战技巧
9.1 快速定位问题Pod
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
NS=${1:-default}
echo "=== 查找 $NS 命名空间中的异常 Pod ==="
if ! kubectl get ns $NS &> /dev/null; then echo "错误: 命名空间 $NS 不存在!" exit 1 fi
PODS=$(kubectl get pods -n $NS --field-selector=status.phase!=Running -o name 2>/dev/null)
if [ -z "$PODS" ]; then echo "加班哥没有发现异常 Pod" exit 0 fi
for pod in $PODS; do echo "--- $pod ---" kubectl describe $pod -n $NS | grep -A 15 "Events:" echo "================================" done
|
9.2 批量检查资源
1 2 3 4 5 6 7 8 9 10 11
| for node in $(kubectl get nodes -o name); do echo "=== $node ===" kubectl describe $node | grep -A 5 "Conditions:" done
kubectl get svc -n prod -o name | while read svc; do echo "$svc:" kubectl describe $svc -n prod | grep Endpoints done
|
9.3 查看最近的事件
1 2 3 4 5 6 7 8
| kubectl describe pod ops-not-pod | grep -A 50 "Events:" | tail -20
kubectl get events --sort-by=.metadata.creationTimestamp
kubectl describe pod ops-not-pod && kubectl get events --field-selector involvedObject.name=my-pod
|
9.4 导出完整信息用于排查
1 2 3 4 5
| kubectl describe pod ops-not-pod > pod-describe.txt
kubectl describe all -n production > cluster-info.txt
|
10. 常见问题排查清单
10.1 Pod起不来
1 2 3 4 5 6 7 8 9 10 11
| kubectl get pod ops-not-mysql
kubectl describe pod ops-not-mysql
- ImagePullBackOff: 镜像拉取失败 - CrashLoopBackOff: 容器启动后立即退出 - Pending: 资源不足或调度失败 - Error: 配置错误
|
10.2 Service不通
1 2 3 4 5 6 7 8 9
| kubectl describe svc opsnot-nginx-service | grep Endpoints
kubectl describe svc opsnot-nginx-service | grep Selector kubectl get pods --show-labels
kubectl get pods -l app=opsnot-blog
|
10.3 网络问题
1 2 3 4 5 6 7 8 9 10 11
| kubectl describe pod my-pod | grep "IP:"
kubectl describe svc my-service | grep "IP:"
kubectl describe pod my-pod | grep -A 5 "DNS"
|
10.4 资源不足
1 2 3 4 5 6 7 8
| kubectl describe nodes | grep -A 10 "Allocated resources:"
kubectl describe pod opsnot-pod | grep -A 5 "Limits:"
kubectl describe pod opsnot-pod | grep "Insufficient"
|
11. 进阶用法
11.1 结合其他命令使用
1 2 3 4 5 6 7 8 9 10
| kubectl describe pod my-pod && kubectl logs my-pod --tail=50
kubectl describe pod my-pod kubectl exec -it my-pod -- sh
kubectl describe pod my-pod kubectl top pod my-pod
|
11.2 使用watch实时监控
1 2 3 4 5
| watch -n 2 'kubectl describe pod opsnot-pod | grep -A 10 "Events:"'
watch kubectl describe node worker-1 | grep "Allocated resources:" -A 20
|
11.3 格式化输出关键信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
POD=$1 NS=${2:-default}
echo "=== 基本信息 ===" kubectl describe pod $POD -n $NS | grep "Status:\|IP:\|Node:"
echo -e "\n=== 容器状态 ===" kubectl describe pod $POD -n $NS | grep "State:" -A 3
echo -e "\n=== 重启信息 ===" kubectl describe pod $POD -n $NS | grep "Restart Count"
echo -e "\n=== 最近事件 ===" kubectl describe pod $POD -n $NS | grep "Events:" -A 15 | tail -10
|
12. 性能优化提示
1 2 3 4 5 6 7 8 9 10 11
| kubectl describe pod ops-not-pod | grep -E "Status:|Events:|State:|Restart"
kubectl describe pods -n opsnot-namespace
kubectl get pods -o wide kubectl describe pod ops-not-pod
|
本文来自 opsnot.com。