Kubectl Describe，k8s故障排查利器

XCJYC2025-11-062025-11-18

除了kubectl logs、events，kubectl describe 也是K8s排查问题必备命令，有点类似``docker inspect，它返回结果比 kubectl get` 详细，比YAML直观，用过都说好。

1. 基础用法

1.1 查看Pod详情

# 查看Pod完整信息（opsnot最常用）
kubectl describe pod opsnot-postgresql

# 指定命名空间
kubectl describe pod my-pod -n opsnot-postgresql

# 查看所有Pod（opsnot.com提醒：要慎用！输出太多！）
kubectl describe pods

1.2 查看不同资源

# 查看Deployment（opsnot.com常用）
kubectl describe deployment opsnot-mariadb

# 查看Service
kubectl describe service opsnot-service

# 查看Node
kubectl describe node node-mariadb

# 查看ConfigMap
kubectl describe configmap mariadb-config

# 查看Secret（敏感信息会隐藏）
kubectl describe secret mariadb-secret

2. Pod排查技巧

2.1 查看Pod事件（重点关注）

# 直接看Pod信息，重点看最底部的Events
kubectl describe pod ops-elasticsearch -n prod

# opsnot.com经验：Events是排查问题的关键！
# 看到ImagePullBackOff、CrashLoopBackOff都能找到原因

实操案例：某Pod一直Pending，用describe一看Events：

1	Warning FailedScheduling node didn't have enough memory

很明显内存不足，调整下requests轻松解决。

2.2 查看容器状态

# Pod describe里会显示：
 - 容器状态（Running/Waiting/Terminated）
 - 重启次数
 - 退出码
 - 最后一次重启原因

kubectl describe pod my-elasticsearch | grep -A 10 "State:"
kubectl describe pod my-elasticsearch | grep "Restart Count"

实操案例：容器反复重启，describe显示：

1
2
3

Last State: Terminated
  Reason: Error
  Exit Code: 137

容器退出码137，那就是OOM killed了，内存调大些就ok了。

2.3 查看资源使用情况

# describe会显示requests和limits
kubectl describe pod opsnot-redis | grep -A 5 "Limits:"
kubectl describe pod opsnot-redis | grep -A 5 "Requests:"

# 查看实际分配的资源
kubectl describe pod ops-not-kafka | grep "QoS Class"

实操案例：集群资源紧张，这个命令可以找到那些设置了过高的requests但实际用不到的Pod，优化资源，节能提效

3. Node排查

3.1 查看Node健康状态

# 查看Node详情
kubectl describe node worker-node-1

# 重点看Conditions部分（opsnot.com推荐）
kubectl describe node worker-node-1 | grep -A 10 "Conditions:"

# 常见状态：
# Ready: True/False
# MemoryPressure: True/False
# DiskPressure: True/False
# PIDPressure: True/False

实操案例：Pod调度不上node节点，describe node发现：

1 2	DiskPressure: True Message: kubelet has disk pressure

很明显是磁盘占满了，清理即可，一般都是日志爆了。

3.2 查看Node资源分配

# 查看Node上的资源使用情况（加班哥推荐）
kubectl describe node worker-node-1 | grep -A 20 "Allocated resources:"

# 会显示：
# CPU Requests: 1200m (60% of 2 cores)
# Memory Requests: 4Gi (50% of 8Gi)

3.3 查看Node上的Pod

# describe node会列出该节点上的所有Pod及其资源占用情况（加班哥墙裂推荐：非常好用！！！）
kubectl describe node worker-node-1 | grep -A 50 "Non-terminated Pods:"

# 返回一般是这样的：
Non-terminated Pods:          (10 in total)
  Namespace                   Name                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                ------------  ----------  ---------------  -------------  ---
  default                     nginx-opsnot-abc12  100m (2%)     200m (5%)   128Mi (1%)       256Mi (3%)     5d
  default                     redis-xyz34         50m (1%)      100m (2%)   64Mi (0%)        128Mi (1%)     3d
  kube-system                 kube-proxy-5678     100m (2%)     0 (0%)      64Mi (0%)        0 (0%)         15d
  kube-system                 coredns-1234        100m (2%)     200m (5%)   70Mi (0%)        170Mi (2%)     15d

4. Service排查

4.1 查看Service配置

# 查看Service详情
kubectl describe service ops-not-service

# 重点看：
# - Selector: 匹配哪些Pod
# - Endpoints: 实际关联的Pod IP
# - Port配置
kubectl describe svc my-service | grep Selector
kubectl describe svc my-service | grep Endpoints

实操案例：Service访问不通，describe发现Endpoints是空的：

1	Endpoints: <none>

一般是Selector写错了，Pod的Label不匹配，改下Label就行了

4.2 检查Service类型和端口

# 查看Service暴露方式
kubectl describe svc ops-not-service | grep Type
# Type: ClusterIP / NodePort / LoadBalancer

# 查看端口映射
kubectl describe svc ops-not-service | grep Port

5. Deployment/StatefulSet排查

5.1 查看副本状态

# 查看Deployment
kubectl describe deployment opsnot-rabbitmq

# 重点关注：
# - Replicas: 期望数量 vs 实际运行数量
# - Conditions: 部署状态
# - Events: 滚动更新记录
kubectl describe deploy opsnot-rabbitmq | grep Replicas
kubectl describe deploy opsnot-rabbitmq | grep -A 5 "Conditions:"

实操案例：发版后只有一部分Pod更新成功，describe显示：

Replicas: 3 desired | 2 updated | 3 total | 2 available
Conditions:
  Progressing: False
  Reason: ProgressDeadlineExceeded

这种情况基本是新版本镜像有问题导致Pod起不来，直接回滚

5.2 查看滚动更新策略

# 查看更新策略
kubectl describe deploy opsnot-rabbitmq | grep -A 3 "StrategyType:"

# 输出示例：
# StrategyType: RollingUpdate
# RollingUpdateStrategy:
#   Max Surge: 25%
#   Max Unavailable: 25%

6. PVC/PV存储排查

6.1 查看PVC状态

# 查看PVC
kubectl describe pvc opsnot-pvc

# 重点看：
# - Status: Bound/Pending
# - Volume: 绑定的PV名称
# - Capacity: 实际容量
# - Events: 绑定失败原因

实操场景：PVC一直Pending，describe显示：

1 2	Events: Warning ProvisioningFailed no volume plugin matched

基本是StorageClass配置错误，改一下就行了

6.2 查看PV详情

# 查看PV
kubectl describe pv opsnot-pv-name

# 重点看：
# - Status: Available/Bound/Released
# - Claim: 被哪个PVC使用
# - Reclaim Policy: Delete/Retain
# - Access Modes: ReadWriteOnce/ReadWriteMany

7. ConfigMap/Secret排查

7.1 查看ConfigMap

# 查看ConfigMap详情
kubectl describe configmap my-config

# 会显示所有键值对（opsnot.com提醒：数据量大的话会截断）
kubectl describe cm my-config | grep -A 20 "Data"

#为什么会截断？
kubectl describe 命令有输出长度限制，主要是为了：

防止终端被大量输出淹没
提高命令响应速度
避免网络传输过大数据

#查看cm被截断了怎么办？
看yaml呗，这玩意是完整的

7.2 查看Secret

# 查看Secret（数据会被隐藏）
kubectl describe secret my-secret

# 输出示例：
# Data
# ====
# password: 16 bytes
# username: 8 bytes

# opsnot提示：真实数据不会显示，只显示大小

8. Ingress排查

8.1 查看Ingress规则

# 查看Ingress
kubectl describe ingress my-ingress

# 重点看：
# - Rules: 路由规则
# - Backend: 后端Service
# - Events: 配置更新记录
kubectl describe ing my-ingress | grep -A 20 "Rules:"

8.2 查看Ingress地址

# 查看Ingress分配的IP
kubectl describe ingress opsnot-ingress | grep Address

# 查看TLS配置
kubectl describe ing opsnot-ingress | grep -A 5 "TLS:"

9. 实战技巧

9.1 快速定位问题Pod

# opsnot.com故障排查脚本
#!/bin/bash
NS=${1:-default}

echo "=== 查找 $NS 命名空间中的异常 Pod ==="

# 检查命名空间是否存在
if ! kubectl get ns $NS &> /dev/null; then
    echo "错误: 命名空间 $NS 不存在!"
    exit 1
fi

# 获取异常 Pod 列表
PODS=$(kubectl get pods -n $NS --field-selector=status.phase!=Running -o name 2>/dev/null)

if [ -z "$PODS" ]; then
    echo "加班哥没有发现异常 Pod"
    exit 0
fi

for pod in $PODS; do
    echo "--- $pod ---"
    kubectl describe $pod -n $NS | grep -A 15 "Events:"
    echo "================================"
done

9.2 批量检查资源

# 检查所有Node状态
for node in $(kubectl get nodes -o name); do
    echo "=== $node ==="
    kubectl describe $node | grep -A 5 "Conditions:"
done

# 检查命名空间内所有Service的Endpoints
kubectl get svc -n prod -o name | while read svc; do
    echo "$svc:"
    kubectl describe $svc -n prod | grep Endpoints
done

9.3 查看最近的事件

# Pod最近的事件（按时间排序）
kubectl describe pod ops-not-pod | grep -A 50 "Events:" | tail -20

# 所有资源的事件
kubectl get events --sort-by=.metadata.creationTimestamp

# opsnot经验：结合describe和events一起看
kubectl describe pod ops-not-pod && kubectl get events --field-selector involvedObject.name=my-pod

9.4 导出完整信息用于排查

# 导出Pod完整信息
kubectl describe pod ops-not-pod > pod-describe.txt

# 导出所有资源信息
kubectl describe all -n production > cluster-info.txt

10. 常见问题排查清单

10.1 Pod起不来

# 1. 先看Pod状态
kubectl get pod ops-not-mysql

# 2. describe看Events
kubectl describe pod ops-not-mysql

# 常见原因：
 - ImagePullBackOff: 镜像拉取失败
 - CrashLoopBackOff: 容器启动后立即退出
 - Pending: 资源不足或调度失败
 - Error: 配置错误

10.2 Service不通

# 1. 检查Service的Endpoints
kubectl describe svc opsnot-nginx-service | grep Endpoints

# 2. 如果Endpoints为空，检查Selector
kubectl describe svc opsnot-nginx-service | grep Selector
kubectl get pods --show-labels

# 3. 检查Pod是否Ready
kubectl get pods -l app=opsnot-blog

10.3 网络问题

# 1. 检查Pod IP
kubectl describe pod my-pod | grep "IP:"

# 2. 检查Service ClusterIP
kubectl describe svc my-service | grep "IP:"

# 3. 检查DNS
kubectl describe pod my-pod | grep -A 5 "DNS"

# 加班哥排查流程：
# Pod -> Service -> Ingress 逐层排查

10.4 资源不足

# 1. 检查Node资源
kubectl describe nodes | grep -A 10 "Allocated resources:"

# 2. 检查Pod资源配置
kubectl describe pod opsnot-pod | grep -A 5 "Limits:"

# 3. 看Events里有没有资源不足的告警
kubectl describe pod opsnot-pod | grep "Insufficient"

11. 进阶用法

11.1 结合其他命令使用

# describe + logs 组合排查
kubectl describe pod my-pod && kubectl logs my-pod --tail=50

# describe + exec 组合
kubectl describe pod my-pod
kubectl exec -it my-pod -- sh

# describe + top 查看资源使用
kubectl describe pod my-pod
kubectl top pod my-pod

11.2 使用watch实时监控

# 实时监控Pod变化（加班哥常用）
watch -n 2 'kubectl describe pod opsnot-pod | grep -A 10 "Events:"'

# 实时监控Node状态
watch kubectl describe node worker-1 | grep "Allocated resources:" -A 20

11.3 格式化输出关键信息

# opsnot.com自用脚本：快速查看Pod关键信息
#!/bin/bash
POD=$1
NS=${2:-default}

echo "=== 基本信息 ==="
kubectl describe pod $POD -n $NS | grep "Status:\|IP:\|Node:"

echo -e "\n=== 容器状态 ==="
kubectl describe pod $POD -n $NS | grep "State:" -A 3

echo -e "\n=== 重启信息 ==="
kubectl describe pod $POD -n $NS | grep "Restart Count"

echo -e "\n=== 最近事件 ==="
kubectl describe pod $POD -n $NS | grep "Events:" -A 15 | tail -10

12. 性能优化提示

# describe输出很多，用管道过滤关键信息
kubectl describe pod ops-not-pod | grep -E "Status:|Events:|State:|Restart"

# 只看特定命名空间，避免全局搜索
kubectl describe pods -n opsnot-namespace

# 结合-o wide查看更多信息
kubectl get pods -o wide
kubectl describe pod ops-not-pod

# opsnot.com建议：先用get快速定位，再用describe详细排查

本文来自 opsnot.com。