Nginx 502/504/Connection Reset 深度排查指南

📅 创建于 2026-05-11 🔄 更新于 2026-05-11 📝 622 字

nginx troubleshooting networking production debugging performance

Nginx 502/504/Connection Reset 深度排查

502、504、Connection Reset 指向完全不同的故障类型，排查路径也不同。核心方法：日志先行 → 分段验证 → 对症下药。

三种错误的本质区分

错误	error.log 关键词	直接原因	排查方向
502	`connect() failed (111: Connection refused)`	Nginx 无法连接 upstream	后端存活 / 防火墙 / proxy_pass
504	`upstream timed out (110: Connection timed out)`	upstream 响应超过 proxy_read_timeout	后端响应时间 / 超时配置
Connection Reset	`recv() failed (104: Connection reset by peer)`	upstream 主动发送 RST 包	后端 fd / 线程池 / backlog

先看 access.log 确认实际状态码：

awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

四段链路排查法

段1: 客户端 → Nginx 网络    段2: Nginx 自身
段3: Nginx → upstream 网络  段4: upstream（后端服务）

本机对照法

curl -sS -o /dev/null -w 'http_code=%{http_code} time_total=%{time_total}s\n' http://127.0.0.1/health

本机 curl	外部访问	结论
✅ 正常	❌ 异常	段1 问题（客户端→Nginx 网络）
❌ 异常	❌ 异常	段2~4 问题

直连 upstream 法

curl -sS -o /dev/null -w "code=%{http_code} total=%{time_total}s\n" http://10.0.1.10:8080/health

本机 Nginx	直连 upstream	结论
❌ 异常	❌ 异常	段4 问题（后端服务）
❌ 异常	✅ 正常	段2~3 问题（Nginx 配置 / Nginx→upstream 网络）

502 排查六步

步骤	命令	发现什么
① 后端进程	`ps aux \\| grep backend`	进程是否存活
② 端口监听	`ss -lntp \\| grep 8080`	端口是否已绑定
③ 防火墙	`iptables -L -n`	是否拦截了 upstream 端口
④ proxy_pass	`grep "proxy_pass" conf.d/*.conf`	地址是否写对
⑤ DNS 解析	`nslookup backend.example.com`	域名是否可解析
⑥ 健康检查	`curl -I http://127.0.0.1:8080/health`	后端是否正常响应

同时检查 OOM：dmesg -T | grep -i "oom\|killed"

504 排查三步

① 测量 upstream 真实响应时间

curl -sS -o /dev/null -w "\
time_namelookup=%{time_namelookup}s\n\
time_connect=%{time_connect}s\n\
time_starttransfer=%{time_starttransfer}s\n\
time_total=%{time_total}s\n" http://10.0.1.10:8080/api/slow

time_starttransfer 是 key——从开始到收到第一个字节的耗时。如果 >> proxy_read_timeout，就是根因。

② 按接口拆分超时

location /api/quick/  { proxy_read_timeout 5s; }
location /api/export/ { proxy_read_timeout 120s; }
location /api/poll/   { proxy_read_timeout 3600s; proxy_buffering off; }

③ 检查后端慢的根因

治本优于治标：先优化后端响应速度（慢 SQL、外部依赖超时、线程池排队、Full GC），短期内增大 proxy_read_timeout 保业务可用。

Connection Reset 排查四步

① 确认 RST 位置

grep "Connection reset by peer" /var/log/nginx/error.log

② 检查 upstream fd 和线程池

cat /proc/$(pidof java)/limits | grep "open files"
lsof -p $(pidof java) | wc -l
netstat -s | grep -i "listen overflow"

③ 检查 Nginx 连接数

Active connections 接近 worker_connections × worker_processes → 连接数紧张。

④ 检查 TCP backlog

netstat -s | grep -i "listen"

调大 backlog：listen 8080 backlog=65535; + sysctl -w net.core.somaxconn=65535

日志配置（排查前提）

所有排查的前提是 access.log 记录了 upstream 字段：

log_format main_ext '$remote_addr - $remote_user [$time_local] '
    '"$request" $status $body_bytes_sent '
    'upstream_addr=$upstream_addr '
    'upstream_status=$upstream_status '
    'upstream_response_time=$upstream_response_time '
    'request_time=$request_time';

如果没有这些字段，排查效率降低 80%。 建议 Nginx 反向代理的第一件事就是把它们加进去。

故障转移与降级

upstream backend {
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8080 backup;
}

server {
    location /api/ {
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;     # 不要太大，防雪崩
        error_page 502 504 = @fallback;
    }
    location @fallback {
        return 200 '{"status":"degraded"}';
    }
}

关联页面

页面	关联点
nginx-config-pitfalls	踩坑点二 proxy_pass 路径 / 踩坑点四 upstream keepalive / 502/504 流程图
nginx-pre-launch-checklist	上线前超时参数检查项
network-troubleshooting-order	基础网络连通性排障（防火墙/路由/tcpdump）
fullstack-performance-troubleshooting	全栈性能排障入口
ops-automation-scripts	运维自动化脚本 5 件套 — 日志异常告警脚本监控 Nginx error log
nginx-realtime-push-guide	SSE/WebSocket 实时推送 — 代理超时/断连排查参考
dns-troubleshooting-practical-guide	DNS 故障排查实战指南 — dig/nslookup 工具详解、四阶段排查流程、SERVFAIL/
nginx-production-performance-optimization	生产级 Nginx 性能优化 — OS 内核/Worker 进程/HTTP I/O/Upstream
nginx-log-analysis-troubleshooting-guide	Nginx 日志分析、4xx/5xx/超时故障排查、真实案例复盘、日志分析工具链、告警监控与性能调优
nginx-troubleshooting-methodology-8-steps	一套可以在生产环境照搬执行的 Nginx 故障排查套路，覆盖 502/504/499/500/con
linux-kernel-tuning-production	net.core.somaxconn / fs.file-max 内核参数