返回首页

Keepalived+Nginx 高可用实战 — 3 个隐藏坑位与生产级防护方案

📅 创建于 2026-05-26 🔄 更新于 2026-05-26 📝 958 字

nginx keepalived networking production ha troubleshooting monitoring architecture

来源：马哥Linux运维 | 发布日期：2025-08-03

Keepalived+Nginx 高可用：3 个隐藏坑位与防护方案

Nginx+Keepalived 是最经典的中小规模高可用方案，但生产环境中存在三个极其隐蔽的坑位：脑裂、僵尸进程健康检查、配置同步时序陷阱。本文基于真实事故复盘，给出完整的检测脚本与部署方案。

坑位一：脑裂 — 网络分区导致的双主灾难

问题

大多数运维只考虑了心跳检测失败，却忽略了网络分区导致的更隐蔽的双主问题。核心交换机端口间歇性故障导致主备节点心跳时断时续：

主节点认为备节点已死，继续持有 VIP
备节点认为主节点已死，也抢占 VIP
网络中同时出现两个相同的 VIP → 客户端请求随机分发 → 会话不一致 + 数据不同步

解决方案

防脑裂 Keepalived 配置

两台机器都设置为 BACKUP + nopreempt（禁用抢占模式），由 priority 决定谁为主：

vrrp_instance VI_1 {
    state BACKUP          # 两台都设为 BACKUP
    interface eth0
    virtual_router_id 51
    priority 100          # 主 100，备 90
    advert_int 1
    nopreempt             # 禁用抢占模式，减少切换次数

    authentication {
        auth_type PASS
        auth_pass your_complex_password_here
    }

    track_script {
        chk_nginx
        chk_network
    }

    notify_master "/etc/keepalived/scripts/check_split_brain.sh"

    virtual_ipaddress {
        192.168.1.100
    }
}

vrrp_script chk_nginx {
    script "/etc/keepalived/scripts/check_nginx.sh"
    interval 2
    weight -2
    fall 3
    rise 2
}

vrrp_script chk_network {
    script "/etc/keepalived/scripts/check_network.sh"
    interval 5
    weight -2
    fall 2
    rise 1
}

脑裂检测脚本

#!/bin/bash
# check_split_brain.sh
REMOTE_IP="192.168.1.11"    # 对端 IP
VIP="192.168.1.100"

ping -c 1 -W 1 $REMOTE_IP >/dev/null 2>&1
if [ $? -eq 0 ]; then
    # 对端可达，检查是否也绑定了 VIP
    ssh -o ConnectTimeout=2 -o StrictHostKeyChecking=no $REMOTE_IP \
        "ip addr show | grep $VIP" >/dev/null 2>&1
    if [ $? -eq 0 ]; then
        # 发现脑裂！立即释放 VIP 并告警
        logger "CRITICAL: Split brain detected! Releasing VIP..."
        ip addr del $VIP/24 dev eth0
        curl -X POST "your_alert_webhook" \
            -d "Split brain detected on $(hostname)"
        exit 1
    fi
fi

要点： notify_master 在实例变为 MASTER 时触发 → 检测对端是否也持有 VIP → 发现双主立即释放并告警。

坑位二：健康检查的僵尸进程陷阱

问题

90% 的健康检查脚本只检查进程是否存在，不检查服务是否真正可用：

# ❌ 错误示例——只查进程
ps -ef | grep nginx | grep -v grep
if [ $? -ne 0 ]; then exit 1; fi

当 nginx worker 进程因内存泄漏变成僵尸进程时，master 进程还在，但已无法处理任何请求。简单的进程检查依然返回正常 → Keepalived 不触发故障转移 → 用户全部访问失败。

生产级健康检查脚本

#!/bin/bash
# 六层健康检查
NGINX_PID=$(ps -ef | grep "nginx: master" | grep -v grep | awk '{print $2}')
VIP="192.168.1.100"
CHECK_URL="http://127.0.0.1/health"

# 1. 进程检查
if [ -z "$NGINX_PID" ]; then
    logger "Nginx master process not found"; exit 1
fi

# 2. 端口监听检查
netstat -tlnp | grep ":80 " | grep nginx >/dev/null 2>&1
if [ $? -ne 0 ]; then
    logger "Nginx port 80 not listening"; exit 1
fi

# 3. 配置语法检查
nginx -t >/dev/null 2>&1
if [ $? -ne 0 ]; then
    logger "Nginx configuration syntax error"; exit 1
fi

# 4. 真实 HTTP 请求检查（关键）
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    --connect-timeout 2 --max-time 5 $CHECK_URL)
if [ "$HTTP_CODE" != "200" ]; then
    logger "Nginx health check failed, HTTP code: $HTTP_CODE"
    systemctl restart nginx
    sleep 2
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 2 --max-time 5 $CHECK_URL)
    if [ "$HTTP_CODE" != "200" ]; then
        logger "Nginx restart failed, triggering failover"
        exit 1
    fi
fi

# 5. 系统负载检查
LOAD=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')
if (( $(echo "$LOAD > 10" | bc -l) )); then
    logger "System load too high: $LOAD"; exit 1
fi

# 6. 内存使用检查
MEM_USAGE=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
if (( $(echo "$MEM_USAGE > 90" | bc -l) )); then
    logger "Memory usage too high: $MEM_USAGE%"; exit 1
fi

logger "Nginx health check passed"
exit 0

Nginx 健康检查接口

location /health {
    access_log off;
    return 200 "healthy\n";
    add_header Content-Type text/plain;
}

location /health/detailed {
    access_log off;
    content_by_lua_block {
        local json = require "cjson"
        local health_data = {
            status = "healthy",
            timestamp = ngx.time(),
            connections = {
                active = ngx.var.connections_active,
                reading = ngx.var.connections_reading,
                writing = ngx.var.connections_writing,
                waiting = ngx.var.connections_waiting
            }
        }
        ngx.say(json.encode(health_data))
    }
}

坑位三：配置同步的时序陷阱

问题

更新 Nginx 配置时，两台服务器重启时序不当会导致服务完全不可用。

典型事故： 更新配置添加新 upstream，操作流程是"先主后备"：

更新主节点配置 → 重启 nginx
更新备节点配置 → 重启 nginx

主节点 nginx 重启时，Keepalived 健康检查检测到异常 → 立即将 VIP 切换到备节点。但此时备节点还是旧配置，新的 upstream 根本不存在 → 用户请求全部失败。

反过来如果先更新备节点也会有问题：主节点仍用旧配置运行 → 维护窗口结束后想要更新主节点时可能已有连接中断的风险。

安全配置更新脚本

#!/bin/bash
# safe_nginx_update.sh
VIP="192.168.1.100"
OTHER_NODE="192.168.1.11"
NGINX_CONF_DIR="/etc/nginx"
BACKUP_DIR="/etc/nginx/backups/$(date +%Y%m%d_%H%M%S)"

safe_restart_nginx() {
    nginx -t || { echo "Config test failed"; return 1; }
    # 优雅重载（不中断现有连接）
    nginx -s reload
    sleep 2
    # 验证服务恢复
    curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 2 --max-time 5 http://127.0.0.1/health \
        | grep -q "200" || return 1
}

main() {
    echo "=== Nginx Safe Update ==="

    # 1. 备份当前配置
    echo "Step 1: Backing up current configuration..."
    mkdir -p $BACKUP_DIR
    cp -r $NGINX_CONF_DIR/* $BACKUP_DIR/
    echo "Backup saved to $BACKUP_DIR"

    # 2. 部署新配置到当前节点
    echo "Step 2: Deploying new configuration..."
    # 此处用 rsync/scp/puppet 等部署新配置

    # 3. 验证对端服务正常
    echo "Step 3: Verifying peer node health..."
    ssh root@$OTHER_NODE "curl -s http://127.0.0.1/health" >/dev/null
    if [ $? -ne 0 ]; then
        echo "Peer node health check failed!"; exit 1
    fi

    # 4. 重启当前节点
    echo "Step 4: Restarting nginx on current node..."
    safe_restart_nginx || { echo "Failed to restart nginx!"; exit 1; }

    # 5. 最终验证
    echo "Step 5: Final verification..."
    curl -s http://$VIP/health >/dev/null
    if [ $? -eq 0 ]; then
        echo "✅ All services healthy!"
    else
        echo "❌ Service verification failed!"; exit 1
    fi
}
main "$@"

CI/CD 集成示例

# GitLab CI
deploy_nginx_config:
  stage: deploy
  script:
    - ansible-playbook -i inventory/production nginx_update.yml
  only:
    - master
  when: manual      # 手动触发，避免误操作

# Ansible Playbook（serial: 1 一台一台执行）
- name: Update nginx configuration safely
  hosts: nginx_servers
  serial: 1
  tasks:
    - name: Backup current configuration
      copy:
        src: /etc/nginx/nginx.conf
        dest: "/etc/nginx/nginx.conf.backup.{{ ansible_date_time.epoch }}"
        remote_src: yes
    - name: Update configuration
      template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
        backup: yes
      notify: restart nginx safely
  handlers:
    - name: restart nginx safely
      script: /usr/local/bin/safe_restart_nginx.sh

核心原则： serial: 1 + when: manual 逐台执行，配合安全重启脚本，确保任何时候最多一台 nginx 重启，VIP 始终可用。

总结与最佳实践

维度	防护方案
脑裂防护	双重 BACKUP + nopreempt + notify_master 检测 + 多重 track_script
健康检查	六层检测（进程 → 端口 → 配置 → HTTP → 负载 → 内存）+ 自动修复重试
配置同步	逐台部署 + 优雅重载 + 前后验证 + 自动回滚
CI/CD 集成	serial:1 逐台执行 + manual 手动触发 + Ansible handler
运维规范	监控告警覆盖切换行为 + 每月故障演练 + 文档沉淀

关联页面

页面	关联点
keepalived-nginx-ha	Keepalived+Nginx 高可用架构原理与基本配置
nginx-config-pitfalls	Nginx 典型配置错误复盘
nginx-pre-launch-checklist	Nginx 上线前检查清单
nginx-log-analysis-monitoring-guide	Nginx 日志分析与监控体系