网络质量监控与自动化故障转移：MTR、延迟探测与代理自愈方案

最后更新于：2026年06月

当你管理多个代理节点时，是否会遇到这些问题：某个节点明明已经挂了，但客户端还在傻傻地重试；晚高峰时节点延迟飙升，但要到用户投诉才发现；代理服务宕机了半小时，你还是通过别人才知道？

网络质量监控和自动化故障转移是代理运维的最后一环——它让你的代理系统从"被动响应"变成"主动防御"。本文将介绍完整的监控工具链（MTR、iperf3、Prometheus + Grafana）、代理健康检查脚本，以及故障自动切换的实现方案。

🧭 为什么需要网络监控？

常见的代理故障类型

PLAINTEXT

┌──────────────────────────────────────────────────────────┐
│                   代理故障类型分析                        │
│                                                          │
│  1. 节点完全不可达                                        │
│     - VPS 被墙/封 IP                                      │
│     - 防火墙配置错误                                       │
│     - 服务进程崩溃                                         │
│                                                          │
│  2. 节点性能下降                                          │
│     - 带宽被限制                                          │
│     - 延迟升高（Ping 值 > 300ms）                        │
│     - 丢包严重（> 5%）                                    │
│     - CPU/内存过载                                        │
│                                                          │
│  3. 协议级别故障                                          │
│     - TLS 证书过期                                        │
│     - Reality/VLESS 配置失效                              │
│     - 端口被封                                           │
│                                                          │
│  4. 区域性故障                                           │
│     - 某个地区的所有节点同时出问题                        │
│     - 对等网络（Peering）中断                            │
│     - 海底光缆故障                                        │
└──────────────────────────────────────────────────────────┘

被动等待 vs 主动监控

PLAINTEXT

┌──────────────────────────────────────────────────────────┐
│  被动等待模式（大多数用户）                                │
│                                                          │
│  节点故障 → 用户发现无法访问 → 手动切换节点 → 恢复       │
│       │                                                   │
│       └──► 平均故障时间：15-60 分钟                      │
│                                                          │
│  主动监控模式（推荐）                                      │
│                                                          │
│  监控系统持续检测节点 → 发现故障 → 自动切换 → 用户无感知  │
│       │                                                   │
│       └──► 平均故障时间：< 30 秒                         │
└──────────────────────────────────────────────────────────┘

📡 基础监控工具链

1. MTR：网络路径质量探测

MTR 是 traceroute 和 ping 的结合体，能持续追踪到目标主机的每一跳网络质量。

BASH

# 安装 MTR
# macOS
brew install mtr

# Linux (Debian/Ubuntu)
apt install mtr-tiny

# Windows: 从 https://winmtr.net/download/ 下载

# 基础用法：持续探测到目标服务器的路由和延迟
mtr -rwc 50 你的VPS_IP

# 参数说明：
# -r: 输出报告模式
# -c 50: 发送 50 个 ICMP 包
# -w: 使用 WHOIS 解析 IP 归属
# -p: 高亮显示丢包

# 输出示例：
#                         My traceroute  [v0.92]
# VPS-IP (VPS-IP)
# Host                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
# 1. _gateway                                     0.0%    50    1.2    1.8    0.8    3.2    0.5
# 2. 10.0.0.1                                     0.0%    50    2.1    2.4    1.9    4.1    0.4
# 3. 72.14.215.85                                 0.0%    50   12.3   15.7   11.2   28.4    3.8
# 4. VPS公网IP                                    0.0%    50   45.2   48.9   43.1   62.3    4.2

MTR 关键指标解读：

指标	含义	正常值	警告值	危险值
Loss%	丢包率	0%	1-3%	> 5%
Last	最近一次延迟	< RTT	RTT × 1.5	RTT × 2+
Avg	平均延迟	< 100ms (国内)	100-200ms	> 300ms
Best	最佳延迟	接近物理延迟	正常	—
Wrst	最差延迟	< Avg × 2	Avg × 2-3	> Avg × 3
StDev	延迟抖动	< 5ms	5-20ms	> 20ms

2. iperf3：带宽测量

iperf3 用于精确测量两点之间的网络吞吐量（TCP/UDP）和延迟。

BASH

# 安装
# macOS
brew install iperf3

# Linux
apt install iperf3

# VPS 上启动 iperf3 服务端（端口 5201）
iperf3 -s -p 5201 -f M

# 客户端测试带宽（从本地电脑测试到 VPS）
iperf3 -c VPS_IP -p 5201 -t 10 -f M

# 参数说明：
# -s: 服务端模式
# -c: 客户端模式，后接服务端 IP
# -t 10: 测试持续 10 秒
# -f M: 以 Mbps 为单位显示结果
# -R: 反向测试（服务端下载，客户端上传）
# -P 4: 并行 4 个连接

# 输出示例：
# [ ID] Interval           Transfer     Bitrate
# [  5]   0.00-10.00  sec  112 MBytes  112 Mbits/sec

iperf3 关键指标解读：

指标	含义	VPS 性能参考
Bandwidth	实际吞吐量	接近 VPS 标称带宽（1Gbps = ~940Mbps 实际）
Jitter	UDP 抖动	< 1ms 为优秀，> 5ms 为差
Lost/Total	UDP 丢包	< 1% 为优秀，> 5% 为差
Retr	TCP 重传数	0 为优秀，> 10 为差

3. 延迟探测脚本

一个简单但实用的延迟探测脚本：

BASH

#!/bin/bash
# latency_check.sh - 多节点延迟探测

NODES=(
  "VPS1|1.2.3.4"
  "VPS2|5.6.7.8"
  "VPS3|9.10.11.12"
)

echo "========== 节点延迟探测 $(date +'%Y-%m-%d %H:%M:%S') =========="
echo ""

for node in "${NODES[@]}"; do
  IFS='|' read -r name ip <<< "$node"

  # 探测 5 次，计算平均值
  result=$(ping -c 5 -q "$ip" 2>/dev/null | tail -1)
  loss=$(echo "$result" | awk -F',' '{print $3}' | awk '{print $1}')
  avg=$(echo "$result" | awk -F'/' '{print $5}' | awk -F'.' '{print $1}')

  # 判断状态
  if [ "$loss" = "0.0%" ]; then
    if [ -n "$avg" ] && [ "$avg" -lt 200 ]; then
      status="✅ 优秀"
    elif [ -n "$avg" ] && [ "$avg" -lt 400 ]; then
      status="⚠️ 一般"
    else
      status="🔴 延迟高"
    fi
  else
    status="❌ 丢包 $loss"
  fi

  printf "%-15s | %-8s | %s\n" "$name" "${avg}ms" "$status"
done

📊 Prometheus + Grafana 监控体系

整体架构

PLAINTEXT

┌──────────────────────────────────────────────────────────────┐
│                    监控体系架构                               │
│                                                              │
│   ┌──────────────┐                                          │
│   │  代理节点 1   │                                          │
│   │  (exporter) │                                          │
│   └──────┬───────┘                                          │
│          │ metrics (9100)                                    │
│          ▼                                                  │
│   ┌──────────────────────────────────────────────────┐     │
│   │                 Prometheus Server                  │     │
│   │  - 定期抓取节点指标                                │     │
│   │  - 存储时序数据                                    │     │
│   │  - 触发告警规则                                    │     │
│   └────────────────────┬───────────────────────────────┘     │
│                        │ metrics                             │
│                        ▼                                     │
│   ┌──────────────────────────────────────────────────┐     │
│   │                    Grafana                        │     │
│   │  - 可视化仪表盘                                    │     │
│   │  - 实时状态展示                                    │     │
│   │  - 历史趋势分析                                    │     │
│   └──────────────────────────────────────────────────┘     │
│                                                              │
│   ┌──────────────────────────────────────────────────┐     │
│   │              AlertManager                          │     │
│   │  - 邮件/微信/钉钉 告警                             │     │
│   │  - 告警抑制和聚合                                  │     │
│   └──────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐ │ 监控体系架构 │ │ │ │ ┌──────────────┐ │ │ │ 代理节点 1 │ │ │ │ (exporter) │ │ │ └──────┬───────┘ │ │ │ metrics (9100) │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Prometheus Server │ │ │ │ - 定期抓取节点指标 │ │ │ │ - 存储时序数据 │ │ │ │ - 触发告警规则 │ │ │ └────────────────────┬───────────────────────────────┘ │ │ │ metrics │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Grafana │ │ │ │ - 可视化仪表盘 │ │ │ │ - 实时状态展示 │ │ │ │ - 历史趋势分析 │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ AlertManager │ │ │ │ - 邮件/微信/钉钉告警 │ │ │ │ - 告警抑制和聚合 │ │ │ └──────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘

节点 Exporter 部署（node_exporter）

BASH

# 在每个 VPS 上安装 node_exporter
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-1.6.1.linux-amd64.tar.gz
tar -xzvf node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
sudo cp node_exporter /usr/local/bin/

# 创建 systemd 服务
sudo cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# 验证
curl http://localhost:9100/metrics

自定义代理指标 Exporter

除了系统指标，还需要暴露代理服务本身的指标：

BASH

#!/bin/bash
# proxy_exporter.sh - 代理节点健康指标

# 配置
CHECK_TIMEOUT=5
NODES=(
  "node1|1.2.3.4|443"
  "node2|5.6.7.8|443"
)

# 输出 Prometheus 格式指标
echo '# HELP proxy_node_up 节点是否在线'
echo '# TYPE proxy_node_up gauge'

for node in "${NODES[@]}"; do
  IFS='|' read -r name ip port <<< "$node"

  # 检测端口连通性
  if timeout "$CHECK_TIMEOUT" bash -c "echo >/dev/tcp/$ip/$port" 2>/dev/null; then
    status=1
  else
    status=0
  fi

  echo "proxy_node_up{name=\"$name\",ip=\"$ip\",port=\"$port\"} $status"
done

# 检测延迟
echo '# HELP proxy_node_latency_ms 节点延迟（毫秒）'
echo '# TYPE proxy_node_latency_ms gauge'

for node in "${NODES[@]}"; do
  IFS='|' read -r name ip port <<< "$node"

  # 使用 curl 测量延迟
  latency=$(curl -o /dev/null -s -w '%{time_total}' \
    --connect-timeout 3 \
    "https://$ip:$port" 2>/dev/null)

  if [ -n "$latency" ]; then
    latency_ms=$(echo "$latency * 1000" | bc)
  else
    latency_ms=0
  fi

  echo "proxy_node_latency_ms{name=\"$name\",ip=\"$ip\"} $latency_ms"
done

Prometheus 配置

YAML

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # 节点 Exporter
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'

  # 代理指标 Exporter
  - job_name: 'proxy'
    static_configs:
      - targets:
          - 'node1:9200'
          - 'node2:9200'

Grafana 仪表盘配置

推荐的仪表盘面板：

PLAINTEXT

1. 节点状态总览
   - 类型：Stat
   - 显示：每个节点的上线/离线状态
   - 颜色：绿色=在线，红色=离线

2. 延迟趋势图
   - 类型：Time series (Graph)
   - 显示：各节点延迟随时间变化
   - 阈值：> 200ms 变黄，> 400ms 变红

3. 丢包率图
   - 类型：Time series
   - 显示：ICMP/TCP 丢包百分比
   - 阈值：> 1% 变黄，> 5% 变红

4. 带宽利用率
   - 类型：Gauge
   - 显示：当前网络吞吐量
   - 阈值：根据 VPS 带宽设置

5. CPU/内存使用率
   - 类型：Time series
   - 显示：历史趋势
   - 阈值：> 80% 变黄，> 95% 变红

6. 告警历史
   - 类型：Table
   - 显示：最近触发的告警记录

🔄 代理故障自动切换

方案一：Clash Meta 自动切换

Clash Meta 内置了 url-test、fallback、load-balance 三种策略组，能自动选择最优节点：

YAML

# Clash 配置中的自动切换策略

proxy-groups:
  # 方案 A: 自动选最快（定期测试延迟）
  - name: "⚡ 自动选择"
    type: url-test
    proxies:
      - 节点1
      - 节点2
      - 节点3
    url: http://www.gstatic.com/generate_204
    interval: 300    # 每 5 分钟测试一次
    tolerance: 50    # 延迟容忍度（毫秒）
    lazy: true       # 只在需要时测试

  # 方案 B: 故障转移（按顺序尝试，可用则切换回来）
  - name: "🔄 故障转移"
    type: fallback
    proxies:
      - 主节点
      - 备用节点1
      - 备用节点2
    url: http://www.gstatic.com/generate_204
    interval: 300

  # 方案 C: 负载均衡（轮询或一致性哈希）
  - name: "⚖️ 负载均衡"
    type: load-balance
    proxies:
      - 节点1
      - 节点2
      - 节点3
    url: http://www.gstatic.com/generate_204
    interval: 300
    strategy: consistent-hashing  # 一致性哈希，同一域名走同一节点

方案二：Shell 脚本自动切换

对于非 Clash 客户端，可以使用脚本实现自动切换：

BASH

#!/bin/bash
# auto_switch.sh - 代理节点自动切换脚本
# 配合 systemd timer 或 cron 使用

# ============== 配置区域 ==============
PRIMARY_NODE="1.2.3.4"          # 主节点 IP
SECONDARY_NODE="5.6.7.8"        # 备用节点 IP
PRIMARY_LABEL="primary"         # 用于日志标识
SECONDARY_LABEL="secondary"
CHECK_URL="http://www.gstatic.com/generate_204"
MAX_LATENCY=500                 # 最大允许延迟（毫秒）
MAX_LOSS=5                       # 最大允许丢包率（%）
CHECK_COUNT=5                    # 每次检测的包数量
LOG_FILE="/var/log/proxy_switch.log"

# ============== 辅助函数 ==============
log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

check_node() {
    local ip=$1
    local result

    # 检测丢包和延迟
    result=$(ping -c "$CHECK_COUNT" -q "$ip" 2>&1)
    local loss=$(echo "$result" | grep -oP '\d+(?=% packet loss)' || echo "100")
    local avg=$(echo "$result" | grep -oP 'rtt min/avg/max/mdev = [\d.]+/([\d.]+)' | grep -oP '(?<=avg/max/mdev = )[\d.]+' | head -1)

    # 转换延迟为整数（毫秒）
    if [ -n "$avg" ]; then
        avg_ms=$(printf "%.0f" "$avg" 2>/dev/null || echo "999")
    else
        avg_ms=999
    fi

    echo "$loss $avg_ms"
}

# ============== 主逻辑 ==============
log "========== 开始检测 =========="

# 检查主节点
read -r primary_loss primary_latency <<< "$(check_node "$PRIMARY_NODE")"
log "主节点 $PRIMARY_NODE: 丢包=${primary_loss}%, 延迟=${primary_latency}ms"

# 判断是否需要切换
should_switch=false

if [ "$primary_loss" -gt "$MAX_LOSS" ]; then
    log "⚠️ 主节点丢包率超过阈值 ($primary_loss% > $MAX_LOSS%)"
    should_switch=true
fi

if [ "$primary_latency" -gt "$MAX_LATENCY" ]; then
    log "⚠️ 主节点延迟超过阈值 (${primary_latency}ms > ${MAX_LATENCY}ms)"
    should_switch=true
fi

if [ "$primary_loss" -eq 100 ]; then
    log "🔴 主节点完全不可达"
    should_switch=true
fi

# 如果需要切换，检测备用节点
if [ "$should_switch" = true ]; then
    log "🔄 尝试切换到备用节点..."

    read -r secondary_loss secondary_latency <<< "$(check_node "$SECONDARY_NODE")"
    log "备用节点 $SECONDARY_NODE: 丢包=${secondary_loss}%, 延迟=${secondary_latency}ms"

    if [ "$secondary_loss" -lt "$MAX_LOSS" ] && [ "$secondary_latency" -lt "$MAX_LATENCY" ]; then
        log "✅ 备用节点可用，执行切换"

        # 调用切换脚本（根据你的实际情况修改）
        # 例如：更新 sing-box 配置、重启服务等
        # /opt/scripts/switch_to_backup.sh

        # 发送通知（可选）
        # curl -s -X POST "https://notify.example.com/send" \
        #   -d "text=主节点故障，已自动切换到备用节点"
    else
        log "❌ 备用节点也不可用，切换失败！"
        # 发送紧急告警
        # curl -s -X POST "https://notify.example.com/alert" \
        #   -d "text=所有节点均不可用！"
    fi
else
    log "✅ 主节点状态正常，无需切换"
fi

log "========== 检测完成 =========="

#!/bin/bash # auto_switch.sh - 代理节点自动切换脚本 # 配合 systemd timer 或 cron 使用 # ============== 配置区域 ============== PRIMARY_NODE="1.2.3.4" # 主节点 IP SECONDARY_NODE="5.6.7.8" # 备用节点 IP PRIMARY_LABEL="primary" # 用于日志标识 SECONDARY_LABEL="secondary" CHECK_URL="http://www.gstatic.com/generate_204" MAX_LATENCY=500 # 最大允许延迟（毫秒） MAX_LOSS=5 # 最大允许丢包率（%） CHECK_COUNT=5 # 每次检测的包数量 LOG_FILE="/var/log/proxy_switch.log" # ============== 辅助函数 ============== log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE" } check_node() { local ip=$1 local result # 检测丢包和延迟 result=$(ping -c "$CHECK_COUNT" -q "$ip" 2>&1) local loss=$(echo "$result" | grep -oP '\d+(?=% packet loss)' || echo "100") local avg=$(echo "$result" | grep -oP 'rtt min/avg/max/mdev = [\d.]+/([\d.]+)' | grep -oP '(?<=avg/max/mdev = )[\d.]+' | head -1) # 转换延迟为整数（毫秒） if [ -n "$avg" ]; then avg_ms=$(printf "%.0f" "$avg" 2>/dev/null || echo "999") else avg_ms=999 fi echo "$loss $avg_ms" } # ============== 主逻辑 ============== log "========== 开始检测 ==========" # 检查主节点 read -r primary_loss primary_latency <<< "$(check_node "$PRIMARY_NODE")" log "主节点 $PRIMARY_NODE: 丢包=${primary_loss}%, 延迟=${primary_latency}ms" # 判断是否需要切换 should_switch=false if [ "$primary_loss" -gt "$MAX_LOSS" ]; then log "⚠️ 主节点丢包率超过阈值 ($primary_loss% > $MAX_LOSS%)" should_switch=true fi if [ "$primary_latency" -gt "$MAX_LATENCY" ]; then log "⚠️ 主节点延迟超过阈值 (${primary_latency}ms > ${MAX_LATENCY}ms)" should_switch=true fi if [ "$primary_loss" -eq 100 ]; then log "🔴 主节点完全不可达" should_switch=true fi # 如果需要切换，检测备用节点 if [ "$should_switch" = true ]; then log "🔄 尝试切换到备用节点..." read -r secondary_loss secondary_latency <<< "$(check_node "$SECONDARY_NODE")" log "备用节点 $SECONDARY_NODE: 丢包=${secondary_loss}%, 延迟=${secondary_latency}ms" if [ "$secondary_loss" -lt "$MAX_LOSS" ] && [ "$secondary_latency" -lt "$MAX_LATENCY" ]; then log "✅ 备用节点可用，执行切换" # 调用切换脚本（根据你的实际情况修改） # 例如：更新 sing-box 配置、重启服务等 # /opt/scripts/switch_to_backup.sh # 发送通知（可选） # curl -s -X POST "https://notify.example.com/send" \ # -d "text=主节点故障，已自动切换到备用节点" else log "❌ 备用节点也不可用，切换失败！" # 发送紧急告警 # curl -s -X POST "https://notify.example.com/alert" \ # -d "text=所有节点均不可用！" fi else log "✅ 主节点状态正常，无需切换" fi log "========== 检测完成 =========="

方案三：Keepalived 实现 VIP 漂移

对于更高级的 HA 场景，可以使用 Keepalived 实现虚拟 IP 的自动漂移：

BASH

# 安装 Keepalived
apt install -y keepalived

# /etc/keepalived/keepalived.conf
global_defs {
    router_id proxy_vip
    vrrp_version 3
}

vrrp_instance v_proxy {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    nopreempt                    # 非抢占模式

    virtual_ipaddress {
        192.168.1.100/24       # 虚拟 IP（客户端连接此 IP）
    }

    track_script {
        check_proxy
    }

    notify_master /opt/scripts/become_master.sh
    notify_backup /opt/scripts/become_backup.sh
    notify_fault /opt/scripts/fault.sh
}

vrrp_script check_proxy {
    script "/opt/scripts/check_proxy.sh"
    interval 5
    fall 2
    rise 2
}

方案四：DNS 轮询 + 健康检查

利用 DNS 实现简单的负载均衡和故障转移：

BASH

#!/bin/bash
# dns_health_check.sh - 更新 DNS 记录实现故障转移
# 需要域名托管在支持 API 更新的 DNS 服务商

DNS_API="https://api.cloudflare.com/client/v4"
AUTH_EMAIL="your@email.com"
AUTH_KEY="your_api_key"
ZONE_ID="your_zone_id"
RECORD_NAME="proxy.example.com"

# 检测节点状态
check_and_update() {
    local node_ip=$1
    local node_name=$2

    # 测试节点是否可达
    if ping -c 3 -W 5 "$node_ip" > /dev/null 2>&1; then
        echo "✅ $node_name ($node_ip) 在线"

        # 如果当前 DNS 指向该节点，无需更新
        current_ip=$(curl -s "$DNS_API/zones/$ZONE_ID/dns_records" \
            -H "X-Auth-Email: $AUTH_EMAIL" \
            -H "X-Auth-Key: $AUTH_KEY" \
            -H "Content-Type: application/json" \
            | jq -r ".result[] | select(.name == \"$RECORD_NAME\") | .content")

        if [ "$current_ip" != "$node_ip" ]; then
            # 更新 DNS 记录
            record_id=$(curl -s "$DNS_API/zones/$ZONE_ID/dns_records" \
                -H "X-Auth-Email: $AUTH_EMAIL" \
                -H "X-Auth-Key: $AUTH_KEY" \
                | jq -r ".result[] | select(.name == \"$RECORD_NAME\") | .id")

            curl -X PUT "$DNS_API/zones/$ZONE_ID/dns_records/$record_id" \
                -H "X-Auth-Email: $AUTH_EMAIL" \
                -H "X-Auth-Key: $AUTH_KEY" \
                -H "Content-Type: application/json" \
                -d "{\"type\":\"A\",\"name\":\"$RECORD_NAME\",\"content\":\"$node_ip\"}"

            echo "🔄 DNS 已更新: $RECORD_NAME -> $node_ip"
        fi
    else
        echo "❌ $node_name ($node_ip) 不通"
    fi
}

# 检查所有节点，按顺序选择第一个可用的
NODES=(
    "1.2.3.4|节点1"
    "5.6.7.8|节点2"
    "9.10.11.12|节点3"
)

for entry in "${NODES[@]}"; do
    IFS='|' read -r ip name <<< "$entry"
    check_and_update "$ip" "$name" && break
done

📈 Grafana 告警规则

节点离线告警

YAML

# /etc/prometheus/rules/proxy_alerts.yml
groups:
  - name: proxy_alerts
    rules:
      # 节点离线告警
      - alert: ProxyNodeDown
        expr: proxy_node_up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "代理节点 {{ $labels.name }} ({{ $labels.ip }}) 离线"
          description: "节点已离线超过 2 分钟，当前状态: {{ $value }}"

      # 延迟过高告警
      - alert: ProxyNodeHighLatency
        expr: proxy_node_latency_ms > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "代理节点 {{ $labels.name }} 延迟过高"
          description: "节点延迟持续超过 300ms，当前: {{ $value }}ms"

      # 丢包告警
      - alert: ProxyNodePacketLoss
        expr: rate(proxy_node_packets_lost_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "代理节点 {{ $labels.name }} 存在丢包"
          description: "丢包率: {{ $value | humanizePercentage }}"

AlertManager 配置（通知渠道）

YAML

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default-receiver'

receivers:
  - name: 'default-receiver'
    # 邮件通知
    email_configs:
      - to: your@email.com
        send_resolved: true
        smarthost: smtp.gmail.com:587
        auth_username: your@email.com
        auth_password: your_app_password

    # 钉钉通知（需要 webhook）
    webhook_configs:
      - url: http://your-dingtalk-webhook/api/send

    # 企业微信通知
    webhook_configs:
      - url: http://your-wecom-webhook/api/send

抑制规则:
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'instance']

🛠️ 日常巡检脚本

综合巡检脚本

BASH

#!/bin/bash
# daily_health_check.sh - 每日代理节点巡检
# 建议配合 cron 每天早上执行：0 9 * * * /opt/scripts/daily_health_check.sh

LOG_DIR="/var/log/health_check"
REPORT_FILE="$LOG_DIR/report_$(date +%Y%m%d).log"
ALERT_EMAIL="your@email.com"

mkdir -p "$LOG_DIR"

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$REPORT_FILE"
}

log "========== 每日巡检报告 $(date +'%Y-%m-%d') =========="

# 1. 服务状态检查
log "【1. 服务状态】"
for service in xray sing-box hysteria-server; do
    if systemctl is-active --quiet "$service" 2>/dev/null; then
        log "  ✅ $service: 运行中"
    else
        log "  ❌ $service: 未运行"
    fi
done

# 2. 端口监听检查
log "【2. 端口监听】"
for port in 443 80 8080; do
    if ss -tulnp | grep -q ":$port "; then
        log "  ✅ 端口 $port: 监听中"
    else
        log "  ⚠️  端口 $port: 未监听"
    fi
done

# 3. 证书有效期检查
log "【3. TLS 证书】"
cert_file="/etc/ssl/server.crt"
if [ -f "$cert_file" ]; then
    expiry_date=$(openssl x509 -in "$cert_file" -noout -enddate | cut -d= -f2)
    expiry_epoch=$(date -d "$expiry_date" +%s)
    now_epoch=$(date +%s)
    days_left=$(( ($expiry_epoch - $now_epoch) / 86400 ))

    if [ $days_left -lt 30 ]; then
        log "  🔴 证书将在 $days_left 天后过期！需要续期！"
    elif [ $days_left -lt 90 ]; then
        log "  ⚠️  证书将在 $days_left 天后过期，建议近期续期"
    else
        log "  ✅ 证书有效期: $expiry_date ($days_left 天)"
    fi
fi

# 4. 资源使用率检查
log "【4. 资源使用】"
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
mem_usage=$(free | grep Mem | awk '{printf "%.1f", $3/$2 * 100}')
disk_usage=$(df -h / | tail -1 | awk '{print $5}' | sed 's/%//')

log "  CPU: ${cpu_usage}%, 内存: ${mem_usage}%, 磁盘: ${disk_usage}%"

if (( $(echo "$cpu_usage > 80" | bc -l) )); then
    log "  ⚠️  CPU 使用率过高"
fi

if (( $(echo "$mem_usage > 90" | bc -l) )); then
    log "  ⚠️  内存使用率过高"
fi

if [ "$disk_usage" -gt 85 ]; then
    log "  ⚠️  磁盘使用率超过 85%"
fi

# 5. 带宽使用检查（如果有 vnstat）
log "【5. 带宽使用】"
if command -v vnstat &> /dev/null; then
    today=$(vnstat | grep "today" | awk '{print $5, $6}')
    log "  今日流量: $today"
fi

# 6. 日志错误检查
log "【6. 最近错误】"
journalctl -p err -n 10 --no-pager >> "$REPORT_FILE" 2>&1

# 7. 生成总结
log ""
log "========== 巡检完成 =========="

# 发送报告邮件（如果有异常）
if grep -q "❌\|🔴\|⚠️" "$REPORT_FILE"; then
    log "⚠️  检测到异常，发送告警..."
    mail -s "[代理监控] 每日巡检异常报告 $(date +%Y%m%d)" "$ALERT_EMAIL" < "$REPORT_FILE"
fi

#!/bin/bash # daily_health_check.sh - 每日代理节点巡检 # 建议配合 cron 每天早上执行：0 9 * * * /opt/scripts/daily_health_check.sh LOG_DIR="/var/log/health_check" REPORT_FILE="$LOG_DIR/report_$(date +%Y%m%d).log" ALERT_EMAIL="your@email.com" mkdir -p "$LOG_DIR" log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a "$REPORT_FILE" } log "========== 每日巡检报告 $(date +'%Y-%m-%d') ==========" # 1. 服务状态检查 log "【1. 服务状态】" for service in xray sing-box hysteria-server; do if systemctl is-active --quiet "$service" 2>/dev/null; then log " ✅ $service: 运行中" else log " ❌ $service: 未运行" fi done # 2. 端口监听检查 log "【2. 端口监听】" for port in 443 80 8080; do if ss -tulnp | grep -q ":$port "; then log " ✅ 端口 $port: 监听中" else log " ⚠️ 端口 $port: 未监听" fi done # 3. 证书有效期检查 log "【3. TLS 证书】" cert_file="/etc/ssl/server.crt" if [ -f "$cert_file" ]; then expiry_date=$(openssl x509 -in "$cert_file" -noout -enddate | cut -d= -f2) expiry_epoch=$(date -d "$expiry_date" +%s) now_epoch=$(date +%s) days_left=$(( ($expiry_epoch - $now_epoch) / 86400 )) if [ $days_left -lt 30 ]; then log " 🔴 证书将在 $days_left 天后过期！需要续期！" elif [ $days_left -lt 90 ]; then log " ⚠️ 证书将在 $days_left 天后过期，建议近期续期" else log " ✅ 证书有效期: $expiry_date ($days_left 天)" fi fi # 4. 资源使用率检查 log "【4. 资源使用】" cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//') mem_usage=$(free | grep Mem | awk '{printf "%.1f", $3/$2 * 100}') disk_usage=$(df -h / | tail -1 | awk '{print $5}' | sed 's/%//') log " CPU: ${cpu_usage}%, 内存: ${mem_usage}%, 磁盘: ${disk_usage}%" if (( $(echo "$cpu_usage > 80" | bc -l) )); then log " ⚠️ CPU 使用率过高" fi if (( $(echo "$mem_usage > 90" | bc -l) )); then log " ⚠️ 内存使用率过高" fi if [ "$disk_usage" -gt 85 ]; then log " ⚠️ 磁盘使用率超过 85%" fi # 5. 带宽使用检查（如果有 vnstat） log "【5. 带宽使用】" if command -v vnstat &> /dev/null; then today=$(vnstat | grep "today" | awk '{print $5, $6}') log " 今日流量: $today" fi # 6. 日志错误检查 log "【6. 最近错误】" journalctl -p err -n 10 --no-pager >> "$REPORT_FILE" 2>&1 # 7. 生成总结 log "" log "========== 巡检完成 ==========" # 发送报告邮件（如果有异常） if grep -q "❌\|🔴\|⚠️" "$REPORT_FILE"; then log "⚠️ 检测到异常，发送告警..." mail -s "[代理监控] 每日巡检异常报告 $(date +%Y%m%d)" "$ALERT_EMAIL" < "$REPORT_FILE" fi

systemd Timer 配置（替代 cron）

INI

# /etc/systemd/system/daily_health_check.service
[Unit]
Description=每日代理节点巡检
After=network.target

[Service]
Type=oneshot
ExecStart=/opt/scripts/daily_health_check.sh
User=root

[Install]
WantedBy=multi-user.target

INI

# /etc/systemd/system/daily_health_check.timer
[Unit]
Description=每日代理节点巡检定时器

[Timer]
OnCalendar=*-*-* 09:00:00
Persistent=true

[Install]
WantedBy=timers.target

BASH

systemctl enable --now daily_health_check.timer
systemctl list-timers --all | grep daily_health_check

🔍 常见故障排查

问题 1：MTR 显示某跳丢包严重

PLAINTEXT

可能原因：
1. 那一跳的路由器本身就有限制（正常现象）
2. 那一跳的网络拥塞
3. 防火墙主动丢弃 ICMP 包

判断方法：
- 如果丢包集中在第一跳或最后一跳 = 本地网络问题
- 如果丢包集中在中间某跳 = 那一跳的路由节点问题
- 如果所有跳都有轻微丢包 = 本地网络不稳定

处理：
- 更换本地网络（WiFi → 有线 / 4G）
- 使用 TCP MTR（绕过 ICMP 限制）
  mtr -T -r -c 50 目标IP

问题 2：iperf3 测试带宽远低于标称值

PLAINTEXT

可能原因：
1. VPS 本身带宽限制（查看控制台）
2. 测试的客户端带宽不足
3. 网络路径中有带宽限制
4. TCP 重传率高

排查步骤：
# 1. 确认 VPS 控制台带宽
# 2. 用多个并发连接测试（测试总带宽）
iperf3 -c VPS_IP -P 4 -t 10

# 3. 测试 UDP 带宽和抖动
iperf3 -c VPS_IP -u -b 1G -t 10

# 4. 检查 VPS TCP 拥塞控制
sysctl net.ipv4.tcp_congestion_control
# 推荐: bbr

问题 3：Prometheus 抓取失败

PLAINTEXT

排查步骤：

1. 检查 Exporter 是否运行
   curl http://localhost:9100/metrics

2. 检查 Prometheus 抓取配置
   curl http://localhost:9090/api/v1/targets

3. 检查防火墙
   firewall-cmd --list-ports | grep 9100

4. 检查 Prometheus 日志
   journalctl -u prometheus -n 50

问题 4：Grafana 仪表盘空白

PLAINTEXT

可能原因：
1. Prometheus 数据源未正确配置
2. 查询语句错误
3. 时间范围不对

解决：
1. 检查数据源: Configuration → Data Sources → 测试连接
2. 检查查询: 在 Explore 中单独执行 PromQL
3. 右上角选择正确的时间范围

📋 完整监控体系检查清单

PLAINTEXT

部署阶段：
□ 在所有 VPS 上安装 node_exporter
□ 配置 Prometheus 定期抓取
□ 部署 Grafana 并配置数据源
□ 导入/创建仪表盘
□ 配置 AlertManager 和告警渠道
□ 测试告警是否正常工作

日常运维：
□ 每日自动巡检（systemd timer）
□ 每周查看 Grafana 趋势报告
□ 每月检查证书有效期
□ 每月审查告警历史，优化阈值
□ 每季度更新 Exporter 和 Prometheus 版本

容量规划：
□ 监控带宽使用趋势，提前升级
□ 监控 CPU/内存使用，适时扩容
□ 记录节点故障频率，淘汰不稳定节点
□ 定期测试备用节点的可用性

结语

网络监控和故障转移是代理运维从"能用"到"好用"的关键一步。再好的代理协议和节点，如果没有监控和自动切换，用户体验也会大打折扣。

总结要点：

✅ MTR：持续探测网络路径质量，定位故障节点
✅ iperf3：精确测量带宽吞吐量和网络延迟
✅ Prometheus + Grafana：完整的监控体系，实时可视化和告警
✅ Clash Meta 策略组：内置的自动故障转移（url-test/fallback）
✅ Shell 脚本：轻量级的定时健康检查和自动切换
✅ Keepalived VIP：高级 HA 场景下的虚拟 IP 漂移
✅ DNS 动态更新：基于 API 的故障转移
✅ 日常巡检：自动化报告，发现问题于萌芽

监控不是为了"救火"，而是为了"防火"。建立好监控体系后，你就能从容应对各种网络问题，让代理服务真正做到"24/7 稳定运行"。

愿你的代理系统永远在线、永不故障！📊

🧭 为什么需要网络监控？

常见的代理故障类型

被动等待 vs 主动监控

📡 基础监控工具链

1. MTR：网络路径质量探测

2. iperf3：带宽测量

3. 延迟探测脚本

📊 Prometheus + Grafana 监控体系

整体架构

节点 Exporter 部署（node_exporter）

自定义代理指标 Exporter

Prometheus 配置

Grafana 仪表盘配置

🔄 代理故障自动切换

方案一：Clash Meta 自动切换

方案二：Shell 脚本自动切换

方案三：Keepalived 实现 VIP 漂移

方案四：DNS 轮询 + 健康检查

📈 Grafana 告警规则

节点离线告警

AlertManager 配置（通知渠道）

🛠️ 日常巡检脚本

综合巡检脚本

systemd Timer 配置（替代 cron）

🔍 常见故障排查

问题 1：MTR 显示某跳丢包严重

问题 2：iperf3 测试带宽远低于标称值

问题 3：Prometheus 抓取失败

问题 4：Grafana 仪表盘空白

📋 完整监控体系检查清单

结语

版权声明

相关内容

Docker一键部署VPN：X-UI/Trojan/Shadowsocks容器化方案