来源:马哥Linux运维 | 发布日期:2025-08-03
Keepalived+Nginx 高可用:3 个隐藏坑位与防护方案
Nginx+Keepalived 是最经典的中小规模高可用方案,但生产环境中存在三个极其隐蔽的坑位:脑裂、僵尸进程健康检查、配置同步时序陷阱。本文基于真实事故复盘,给出完整的检测脚本与部署方案。
坑位一:脑裂 — 网络分区导致的双主灾难
问题
大多数运维只考虑了心跳检测失败,却忽略了网络分区导致的更隐蔽的双主问题。核心交换机端口间歇性故障导致主备节点心跳时断时续:
- 主节点认为备节点已死,继续持有 VIP
- 备节点认为主节点已死,也抢占 VIP
- 网络中同时出现两个相同的 VIP → 客户端请求随机分发 → 会话不一致 + 数据不同步
解决方案
防脑裂 Keepalived 配置
两台机器都设置为 BACKUP + nopreempt(禁用抢占模式),由 priority 决定谁为主:
vrrp_instance VI_1 {
state BACKUP # 两台都设为 BACKUP
interface eth0
virtual_router_id 51
priority 100 # 主 100,备 90
advert_int 1
nopreempt # 禁用抢占模式,减少切换次数
authentication {
auth_type PASS
auth_pass your_complex_password_here
}
track_script {
chk_nginx
chk_network
}
notify_master "/etc/keepalived/scripts/check_split_brain.sh"
virtual_ipaddress {
192.168.1.100
}
}
vrrp_script chk_nginx {
script "/etc/keepalived/scripts/check_nginx.sh"
interval 2
weight -2
fall 3
rise 2
}
vrrp_script chk_network {
script "/etc/keepalived/scripts/check_network.sh"
interval 5
weight -2
fall 2
rise 1
}
脑裂检测脚本
#!/bin/bash
# check_split_brain.sh
REMOTE_IP="192.168.1.11" # 对端 IP
VIP="192.168.1.100"
ping -c 1 -W 1 $REMOTE_IP >/dev/null 2>&1
if [ $? -eq 0 ]; then
# 对端可达,检查是否也绑定了 VIP
ssh -o ConnectTimeout=2 -o StrictHostKeyChecking=no $REMOTE_IP \
"ip addr show | grep $VIP" >/dev/null 2>&1
if [ $? -eq 0 ]; then
# 发现脑裂!立即释放 VIP 并告警
logger "CRITICAL: Split brain detected! Releasing VIP..."
ip addr del $VIP/24 dev eth0
curl -X POST "your_alert_webhook" \
-d "Split brain detected on $(hostname)"
exit 1
fi
fi
要点: notify_master 在实例变为 MASTER 时触发 → 检测对端是否也持有 VIP → 发现双主立即释放并告警。
坑位二:健康检查的僵尸进程陷阱
问题
90% 的健康检查脚本只检查进程是否存在,不检查服务是否真正可用:
# ❌ 错误示例——只查进程
ps -ef | grep nginx | grep -v grep
if [ $? -ne 0 ]; then exit 1; fi
当 nginx worker 进程因内存泄漏变成僵尸进程时,master 进程还在,但已无法处理任何请求。简单的进程检查依然返回正常 → Keepalived 不触发故障转移 → 用户全部访问失败。
生产级健康检查脚本
#!/bin/bash
# 六层健康检查
NGINX_PID=$(ps -ef | grep "nginx: master" | grep -v grep | awk '{print $2}')
VIP="192.168.1.100"
CHECK_URL="http://127.0.0.1/health"
# 1. 进程检查
if [ -z "$NGINX_PID" ]; then
logger "Nginx master process not found"; exit 1
fi
# 2. 端口监听检查
netstat -tlnp | grep ":80 " | grep nginx >/dev/null 2>&1
if [ $? -ne 0 ]; then
logger "Nginx port 80 not listening"; exit 1
fi
# 3. 配置语法检查
nginx -t >/dev/null 2>&1
if [ $? -ne 0 ]; then
logger "Nginx configuration syntax error"; exit 1
fi
# 4. 真实 HTTP 请求检查(关键)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
logger "Nginx health check failed, HTTP code: $HTTP_CODE"
systemctl restart nginx
sleep 2
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
logger "Nginx restart failed, triggering failover"
exit 1
fi
fi
# 5. 系统负载检查
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD > 10" | bc -l) )); then
logger "System load too high: $LOAD"; exit 1
fi
# 6. 内存使用检查
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
logger "Memory usage too high: $MEM_USAGE%"; exit 1
fi
logger "Nginx health check passed"
exit 0
Nginx 健康检查接口
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
location /health/detailed {
access_log off;
content_by_lua_block {
local json = require "cjson"
local health_data = {
status = "healthy",
timestamp = ngx.time(),
connections = {
active = ngx.var.connections_active,
reading = ngx.var.connections_reading,
writing = ngx.var.connections_writing,
waiting = ngx.var.connections_waiting
}
}
ngx.say(json.encode(health_data))
}
}
坑位三:配置同步的时序陷阱
问题
更新 Nginx 配置时,两台服务器重启时序不当会导致服务完全不可用。
典型事故: 更新配置添加新 upstream,操作流程是"先主后备":
- 更新主节点配置 → 重启 nginx
- 更新备节点配置 → 重启 nginx
主节点 nginx 重启时,Keepalived 健康检查检测到异常 → 立即将 VIP 切换到备节点。但此时备节点还是旧配置,新的 upstream 根本不存在 → 用户请求全部失败。
反过来如果先更新备节点也会有问题:主节点仍用旧配置运行 → 维护窗口结束后想要更新主节点时可能已有连接中断的风险。
安全配置更新脚本
#!/bin/bash
# safe_nginx_update.sh
VIP="192.168.1.100"
OTHER_NODE="192.168.1.11"
NGINX_CONF_DIR="/etc/nginx"
BACKUP_DIR="/etc/nginx/backups/$(date +%Y%m%d_%H%M%S)"
safe_restart_nginx() {
nginx -t || { echo "Config test failed"; return 1; }
# 优雅重载(不中断现有连接)
nginx -s reload
sleep 2
# 验证服务恢复
curl -s -o /dev/null -w "%{http_code}" \
--connect-timeout 2 --max-time 5 http://127.0.0.1/health \
| grep -q "200" || return 1
}
main() {
echo "=== Nginx Safe Update ==="
# 1. 备份当前配置
echo "Step 1: Backing up current configuration..."
mkdir -p $BACKUP_DIR
cp -r $NGINX_CONF_DIR/* $BACKUP_DIR/
echo "Backup saved to $BACKUP_DIR"
# 2. 部署新配置到当前节点
echo "Step 2: Deploying new configuration..."
# 此处用 rsync/scp/puppet 等部署新配置
# 3. 验证对端服务正常
echo "Step 3: Verifying peer node health..."
ssh root@$OTHER_NODE "curl -s http://127.0.0.1/health" >/dev/null
if [ $? -ne 0 ]; then
echo "Peer node health check failed!"; exit 1
fi
# 4. 重启当前节点
echo "Step 4: Restarting nginx on current node..."
safe_restart_nginx || { echo "Failed to restart nginx!"; exit 1; }
# 5. 最终验证
echo "Step 5: Final verification..."
curl -s http://$VIP/health >/dev/null
if [ $? -eq 0 ]; then
echo "✅ All services healthy!"
else
echo "❌ Service verification failed!"; exit 1
fi
}
main "$@"
CI/CD 集成示例
# GitLab CI
deploy_nginx_config:
stage: deploy
script:
- ansible-playbook -i inventory/production nginx_update.yml
only:
- master
when: manual # 手动触发,避免误操作
# Ansible Playbook(serial: 1 一台一台执行)
- name: Update nginx configuration safely
hosts: nginx_servers
serial: 1
tasks:
- name: Backup current configuration
copy:
src: /etc/nginx/nginx.conf
dest: "/etc/nginx/nginx.conf.backup.{{ ansible_date_time.epoch }}"
remote_src: yes
- name: Update configuration
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
backup: yes
notify: restart nginx safely
handlers:
- name: restart nginx safely
script: /usr/local/bin/safe_restart_nginx.sh
核心原则: serial: 1 + when: manual 逐台执行,配合安全重启脚本,确保任何时候最多一台 nginx 重启,VIP 始终可用。
总结与最佳实践
| 维度 | 防护方案 |
|---|---|
| 脑裂防护 | 双重 BACKUP + nopreempt + notify_master 检测 + 多重 track_script |
| 健康检查 | 六层检测(进程 → 端口 → 配置 → HTTP → 负载 → 内存)+ 自动修复重试 |
| 配置同步 | 逐台部署 + 优雅重载 + 前后验证 + 自动回滚 |
| CI/CD 集成 | serial:1 逐台执行 + manual 手动触发 + Ansible handler |
| 运维规范 | 监控告警覆盖切换行为 + 每月故障演练 + 文档沉淀 |
关联页面
| 页面 | 关联点 |
|---|---|
| keepalived-nginx-ha | Keepalived+Nginx 高可用架构原理与基本配置 |
| nginx-config-pitfalls | Nginx 典型配置错误复盘 |
| nginx-pre-launch-checklist | Nginx 上线前检查清单 |
| nginx-log-analysis-monitoring-guide | Nginx 日志分析与监控体系 |