返回首页

全栈性能排障方法论 — Nginx → 应用 → 数据库 → 服务器

📅 创建于 2026-05-08 🔄 更新于 2026-05-11 📝 706 字

troubleshooting production monitoring networking database mysql nginx

全栈性能排障方法论

核心原则

分层排查，从外到内，先排查容易确认的层，再深入复杂的层：

确认是全局问题还是局部问题
确认是网络问题还是服务端问题
确认是哪个服务的瓶颈

每个阶段要有明确目标——「我要确认什么？输出是什么？根据输出怎么判断？」

第一阶段：确认问题范围

# 从多地 curl 看响应时间差异
for i in {1..5}; do
  curl -o /dev/null -s -w "Time: %{time_total}s, HTTP: %{http_code}\\n" \
    -H "Host: www.example.com" http://123.45.67.89/api/homepage
done

# 查看 Nginx access_log 状态码分布
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# 统计 5xx 错误是否增多
grep "2024-03-15T14:" /var/log/nginx/access.log | awk '{if($9>=500) print $0}' | wc -l

第二阶段：Nginx 层排查

# 查看 worker 进程资源
ps -eo pid,ppid,comm,%cpu,%mem,rss | grep nginx

# 查看当前并发连接数
ss -tn | grep :8080 | wc -l

# 查看 error_log 中的 502/504/超时
tail -n 200 /var/log/nginx/error.log | grep -E "502|504|upstream|timeout|connect"

# 统计 upstream 响应时间（需在 access_log 配置 $upstream_response_time）
awk '{if($2>5) print $0}' /var/log/nginx/access.log | head -20

关键判断： 直接访问 upstream 端口快，通过 Nginx 慢 → 问题在 Nginx 层。反之在 upstream/数据库层。

Nginx 层常见配置问题参见 nginx-config-pitfalls（location 匹配、proxy_pass 路径处理、upstream keepalive 等）。

第三阶段：上游应用层排查

# 确认 upstream 端口监听
ss -tlnp | grep -E "8080|3000|5000|9000"

# 直接测试 upstream 响应
time curl -s http://127.0.0.1:8080/api/data > /dev/null

# 查看应用日志
tail -n 100 /var/log/app/application.log | grep -E "ERROR|WARN|Exception"
journalctl -u app-backend --since "10 minutes ago" | grep -E "ERROR|Exception|slow"

第四阶段：数据库层排查

# 查看当前运行 SQL
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# 查看慢查询
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';

# EXPLAIN 分析
EXPLAIN SELECT ...;

# 查看锁等待（MySQL 5.7）
SELECT r.trx_id, r.trx_mysql_thread_id, r.trx_query,
       b.trx_id, b.trx_mysql_thread_id, b.trx_query
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX b ON w.blocking_trx_id = b.trx_id
JOIN information_schema.INNODB_TRX r ON w.requesting_trx_id = r.trx_id;

# InnoDB 状态
SHOW ENGINE INNODB STATUS\G

详见 mysql-performance-config。

第五阶段：服务器资源排查

# CPU
ps aux --sort=-%cpu | head -10
pidstat -p $(pgrep -f "mysqld" | head -1) 1 5

# 内存
free -h
dmesg | grep -iE "oom|mysql|killed" | tail -20

# OOMKilled 排查：参见 [[k8s-resource-limits-configuration]] 的 Burstable QoS 陷阱章节

# 磁盘 IO
iostat -x 1 3
iotop -o 2>/dev/null

# 网络
ss -s
netstat -an | awk '/:80\s/ {print $NF}' | sort | uniq -c | sort -rn

第六阶段：典型根因与快速定位

场景	特征	修复
Nginx upstream 超时	直连 upstream 快，走 Nginx 慢，504	增大 `proxy_read_timeout`
数据库连接池耗尽	PROCESSLIST 大量 Waiting for connection	kill 慢查询 + 增大连接池
慢查询致 CPU 高	type=ALL, key=NULL, rows=数百万	加索引
InnoDB 缓冲池命中率低	命中率 < 95%	增大 `innodb_buffer_pool_size`
内存不足致 swap	free -h 显示 swap 已用	增大物理内存或调小缓冲池

故障时间线复盘模板

记录每个关键时间点的操作，便于事后回顾：

14:02 - 收到投诉：首页加载慢
14:05 - 监控发现 502/504 错误率从 0.1% → 5%
14:08 - top 发现 CPU 100%，mysqld 占 90%
14:10 - SHOW PROCESSLIST 发现大量慢查询
14:12 - slow_query_log 发现定时任务大表 JOIN 无索引
14:20 - 联系开发确认 SQL
14:30 - 加索引，SQL 从 45s → 0.3s
14:35 - 监控恢复

预防性监控建议

指标	告警阈值	采集方式
Nginx 502/504 错误率	> 1%	access_log 统计
upstream 响应时间 P99	> 3s	access_log / Prometheus
MySQL 慢查询数/分钟	> 10/min	slow_query_log
MySQL 连接数使用率	> 80%	SHOW GLOBAL STATUS
InnoDB 缓冲池命中率	< 95%	SHOW ENGINE INNODB STATUS
Swap 使用率	> 10%	free -h
磁盘 IO util	> 80%	iostat

总结

网站变慢排查，核心是分层定位：

① 全局 vs 局部 → ② Nginx vs upstream → ③ 应用 vs 数据库 → ④ 数据库细分 → ⑤ 服务器资源

关键习惯：

每个命令都要服务于某个假设的验证
修复后必须有前后对比（5.2s → 0.3s）
临时修复只是止血，系统性优化才是根治

关联页面

页面	关联点
ops-automation-scripts	运维自动化脚本 5 件套（健康巡检/日志告警/服务守护等，配合全栈排障使用）
server-performance-four-dimensions	五维排查框架与告警阈值
nginx-502-504-connection-reset-guide	Nginx 502/504 深度排查
linux-disk-space-troubleshooting	磁盘空间排查
nginx-realtime-push-guide	SSE/WebSocket 实时推送全链路排障
nginx-load-balancing-strategy-guide	Nginx 负载均衡策略选择实战指南 — 加权轮询与 IP Hash 深度对比、混合策略最佳实践
nginx-log-analysis-monitoring-guide	Nginx 日志分析与监控体系构建指南 — 自定义日志格式、性能分析技巧、GoAccess 可视化、
api-latency-troubleshooting-interview	大厂面试实战：API 从 200ms 飙到 3 秒的完整排查思路，覆盖调用链分析、数据库慢查询、连接
ops-sre-jargon-guide	运维/SRE 行话速查指南：南北流量、APM、SLA、熔断降级、灰度发布、Sidecar 等 15+
port-connectivity-troubleshooting-guide	线上服务端口连不上的完整排查指南：从进程/端口监听/防火墙/云安全组/SELinux/网络路由到 D
network-packet-loss-troubleshooting	网络丢包排查全链路指南：从 ping 到 tcpdump 逐层排查，含物理/逻辑丢包区分、8 步排查
cpu-100-full-chain-diagnosis	CPU 100% 故障排查从告警到根因的全链路分析：排查七步法、真实级联故障案例（MySQL 备份锁
mysql-slow-query-5-rounds-case-study	MySQL 慢查询 5 轮递进式优化实战案例：10.2s→9.7ms，含执行计划分析与覆盖索引构建
nginx-troubleshooting-methodology-8-steps	一套可以在生产环境照搬执行的 Nginx 故障排查套路，覆盖 502/504/499/500/con
database-troubleshooting-checklist-mysql-redis	覆盖 MySQL（12 种）和 Redis（11 种）最常见的生产故障，每个故障按现象→排查→根因→
network-troubleshooting-order	网络排障七步法