K8s 探针机制 — Liveness / Readiness / Startup 配置指南 + 百万级故障复盘

📅 创建于 2026-05-08 🔄 更新于 2026-05-13 📝 487 字

K8s 探针机制

三大探针职责

探针	作用	失败后果	典型场景
存活探针 (Liveness)	判断容器是否"活着"	kubelet 重启容器	死锁、内存泄漏、进程卡死
就绪探针 (Readiness)	判断容器是否"准备好"接收流量	从 Service Endpoints 剔除	启动慢、滚动更新、临时过载
启动探针 (Startup)	保护慢启动应用不被误杀	kubelet 重启容器	SpringBoot、大数据组件

核心关系： 启动探针成功前，存活/就绪探针不执行。启动探针→存活探针（自愈）→就绪探针（流量接入）。

三种探测方式

方式	适用场景	说明
HTTP GET	有 HTTP 端口的应用（SpringBoot、Go Web）	检查 HTTP 状态码（200-399 为健康）
TCP Socket	无 HTTP 接口的服务（Nginx、Redis）	检查端口是否开放
Exec	自定义检查逻辑	执行容器内命令，退出码 0 为健康

生产配置模板

场景 1：SpringBoot 应用 HTTP GET

containers:
 - name: springboot-app
 image: payment:latest
 ports:
 - containerPort: 8080

 startupProbe:
 httpGet:
 path: /actuator/health
 port: 8080
 initialDelaySeconds: 10
 periodSeconds: 5
 failureThreshold: 12 # 给足 60 秒启动时间
 timeoutSeconds: 3

 livenessProbe:
 httpGet:
 path: /actuator/health/liveness # 独立接口，不依赖外部服务
 port: 8080
 initialDelaySeconds: 70 # 启动成功后 +10s 缓冲
 periodSeconds: 15
 failureThreshold: 3
 timeoutSeconds: 3

 readinessProbe:
 httpGet:
 path: /actuator/health/readiness # 独立接口
 port: 8080
 initialDelaySeconds: 70
 periodSeconds: 5
 failureThreshold: 3
 successThreshold: 1

场景 2：Nginx TCP 探测

containers:
 - name: nginx-app
 image: nginx:latest
 ports:
 - containerPort: 80
 startupProbe:
 tcpSocket:
 port: 80
 initialDelaySeconds: 5
 periodSeconds: 5
 failureThreshold: 6 # 30 秒启动时间
 livenessProbe:
 tcpSocket:
 port: 80
 initialDelaySeconds: 35
 periodSeconds: 15
 failureThreshold: 3
 readinessProbe:
 tcpSocket:
 port: 80
 initialDelaySeconds: 35
 periodSeconds: 5
 failureThreshold: 3

场景 3：自定义 Exec 探测

containers:
 - name: custom-app
 image: custom:latest
 startupProbe:
 exec:
 command:
 - cat
 - /tmp/startup-complete
 initialDelaySeconds: 10
 periodSeconds: 10
 failureThreshold: 10
 livenessProbe:
 exec:
 command:
 - pgrep
 - custom-app
 initialDelaySeconds: 110
 periodSeconds: 20
 failureThreshold: 3
 readinessProbe:
 exec:
 command:
 - grep
 - "Business started"
 - /var/log/custom-app.log
 initialDelaySeconds: 110
 periodSeconds: 10
 failureThreshold: 3

6 个常见踩坑点

#	错误	后果	正确做法
1	存活探针依赖外部服务	下游抖动导致 Pod 被重启，级联故障	存活探针只检查自身进程/本地状态
2	initialDelaySeconds 过短	应用还在启动就被重启，死循环	`failureThreshold × periodSeconds ≥ 启动时间`
3	failureThreshold 太小	网络抖动就触发重启，服务不稳定	核心业务设 3-5
4	periodSeconds 太频繁	kubelet 频繁探测，消耗 CPU/内存	存活 15-30s，就绪 5-10s
5	存活和就绪用同一套参数	过载时存活探针也失败，直接重启	存活→进程级，就绪→业务级，职责分离
6	不监控探针状态	探针异常无法察觉	Prometheus 监控 `kube_pod_container_status_restarts_total`

最佳实践

💀 案例复盘：一次价值百万的探针配置失误

2025 年某金融公司凌晨 2 点，支付服务全部进入 CrashLoopBackOff，37 分钟业务瘫痪，直接损失超百万元。

事故链

Liveness Probe 路径 /healthz 在应用启动早期即返回 200
 ↓
K8s 误判容器"活着"，实际核心模块未加载完成
 ↓
外部请求失败 → 探针判定超时 → Pod 被重启
 ↓
新 Pod 同样未完成启动即被判定存活 → 再次重启
 ↓
所有实例陷入"死亡循环" → CrashLoopBackOff

根因

/healthz 在 Spring Boot Actuator 中默认只要 JVM 启动就返回 200，不代表业务就绪。

教训

Liveness Probe 的 /healthz 不能仅检查端口可达，必须检查核心模块状态
initialDelaySeconds 必须大于应用最长启动时间（含依赖初始化、缓存预热）
尽早引入 Startup Probe：慢启动应用用 Startup Probe 兜底，避免存活探针误杀
存活探针必须幂等：不能因重启带来副作用（如重复扣款、重复发消息）

最佳实践

探针	核心准则	推荐参数
启动探针	给足启动时间	`failureThreshold=10-20`, `periodSeconds=5`
存活探针	轻量无依赖，快速检测死锁	`periodSeconds=15-30`, `failureThreshold=3`
就绪探针	高频检测，快速剔除	`periodSeconds=5-10`, `failureThreshold=3`

关键提醒：

全量配置：生产所有 Pod 必须配存活+就绪，慢启动加启动探针
接口独立：存活、就绪、启动探针用不同的检查接口
结合监控：配合 Prometheus/Grafana 告警
灰度测试：新应用先在测试环境验证探针配置

与 pod-troubleshooting（CrashLoopBackOff 排障）、k8s-rolling-update-pitfalls（滚动更新中 Readiness 探针的真实业务就绪判断）和 k8s-troubleshooting-principles（排查原则）配合使用。