MySQL InnoDB Cluster 常见故障场景分析


                                                                                                                                                <p style="color:#000000; margin-left:8px; margin-right:8px; text-align:left"><span style="background-color:#ffffff; color:#888888"><span>作者:赵黎明,</span>

爱可生 MySQL DBA 团队成员,熟悉 Oracle、MySQL 等数据库,擅长数据库性能问题诊断、事务与锁问题的分析等,负责处理客户 MySQL 及我司自研 DMP 平台日常运维中的问题,对开源数据库相关技术非常感兴趣。

爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。

本文约 4800 字,预计阅读需要 15 分钟。

 

一、背景

随着企业业务对数据库高可用性要求的不断提高, MySQL InnoDB Cluster [1] (以下简称为 MIC)作为 MySQL 官方提供的高可用解决方案,在生产环境中得到了广泛应用。该方案基于  MySQL Group Replication [2]  和  MySQL Shell [3]  管理工具,能够提供自动故障检测与恢复、数据一致性保证等关键特性。

然而,在实际运维过程中,由于硬件故障、网络问题、配置错误等多种因素,集群仍可能面临各种异常情况。为了确保数据库服务的持续稳定运行,MySQL 运维人员需要具备快速诊断和恢复集群故障的能力。

本文是基于实际生产环境中的经验总结,对 MIC 的常见故障场景进行了测试,并提供了相应的恢复方案和操作指导,旨在帮助 MySQL 运维人员提升故障处理效率,保障业务连续性。

测试环境信息

  • OS:Red Hat Enterprise Linux release 8.10 (Ootpa)

  • MySQL:mysql-community-server-8.4.5-1.el8.x86_64

  • mysqlshell:mysql-shell-8.4.5-1.el8.x86_64

  • mysqlrouter:mysql-router-community-8.4.5-1.el8.x86_64

二、场景示例

场景 1:集群非主节点实例重启

故障模拟:将 node2 从节点重启

<em><span>-- node2</span></em>
<span>systemctl restart mysql</span>

<em><span>-- node1</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>\js</span>
<span>var cluster=dba.getCluster()</span>
<span>cluster.status()</span>

恢复过程

<em><span>-- 登陆 mysqlshell 查看集群初始状态</span></em>
<span>[root@node1 ~]</span>
<em><span># mysqlsh --uri clusteruser@10.186.65.13:3306</span></em>
<span>MySQL Shell 8.4.5</span>

<span>Copyright (c) 2016, 2025, Oracle and/or its affiliates.</span>
<span>Oracle is a registered trademark of Oracle Corporation and/or its affiliates.</span>
<span>Other names may be trademarks of their respective owners.</span>

<span>Type '\</span>
<span style="color:#c678dd"><span>help</span></span>
<span style="color:#98c379"><span>' or '</span></span>
<span>\?</span>
<span style="color:#98c379"><span>' for help; '</span></span>
<span>\quit</span>
<span style="color:#98c379"><span>' to exit.</span>
<span>Creating a Classic session to '</span></span>
<span>clusteruser@</span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span style="color:#98c379"><span>'</span>
<span>Fetching global names for auto-completion... Press ^C to stop.</span>
<span>Your MySQL connection id is 402320</span>
<span>Server version: 8.4.5 MySQL Community Server - GPL</span>
<span>No default schema selected; type \use <schema> to set one.</span>
<span> MySQL  10.186.65.13:3306 ssl  SQL > \js</span>
<span>Switching to JavaScript mode...</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > var cluster=dba.getCluster()</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK", </span>
<span>        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<span># 集群内所有成员状态都为 "ONLINE",node1 为 "PRIMARY" 节点,node2/3 为 "SECONDARY" 节点</span>

<span>-- node2 上实例重启期间观察集群状态</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span>        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "shellConnectError": "MySQL Error 2013: Could not open connection to '</span></span>
<span>node2:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span style="color:#98c379"><span>': Lost connection to MySQL server at '</span></span>
<span>reading </span>
<span style="color:#c678dd"><span>initial</span></span>
<span> communication packet</span>
<span style="color:#98c379"><span>', system error: 104", </span>
<span>                "status": "(MISSING)"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<span># 通过 mysqlshell 连接 node2 节点报错 Error 2013,该节点状态变为 "(MISSING)"</span>

<span>-- node2 上实例完成重启后再次查看集群状态</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK", </span>
<span>        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<span># node2 上的实例自动重启完成后,集群可自动恢复正常,无需额外处理</span>
</span>

小结

故障现象

  • 重启期间 node2 状态显示为  <span>(MISSING)</span>

  • 集群状态变为  <span>OK_NO_TOLERANCE_PARTIAL</span>

  • 出现连接错误  <span>Lost connection to MySQL server</span>

恢复要点

  • 从节点重启后,集群能够自动检测并重新加入

  • 无需人工干预,自动恢复为  <span>ONLINE</span>  状态

  • 集群在单节点故障期间仍可正常提供服务

影响范围

  • 重启期间集群容错能力降为 0

  • 对应用读写无影响

  • 短暂影响读负载均衡

场景 2:集群非主节点实例宕机

故障模拟:将 node3 从节点停止后再启动

<em><span>-- node3:</span></em>
<span>systemctl </span>
<span style="color:#c678dd"><span>stop</span></span>
<span> mysql</span>

<em><span>-- node1:</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>cluster.status()</span>

<em><span>-- node3:</span></em>
<span>systemctl </span>
<span style="color:#c678dd"><span>start</span></span>
<span> mysql</span>

恢复过程

<em><span>-- 集群初始状态</span></em>
<span>同上,略...</span>

<em><span>-- node3 上实例停止后查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span>        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "shellConnectError": "MySQL Error 2013: Could not open connection to 'node3:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", </span>
<span>                "status": "(MISSING)"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<em><span># mysqlshell 连接 node3 上的实例报错,节点状态为 "(MISSING)"</span></em>

<em><span>-- node3 实例重新启动后再次查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK", </span>
<span>        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<em><span># 将 node3 上宕机的实例启动后,集群可自动恢复正常,无需额外处理</span></em>

小结

故障现象

  • node3 状态显示为  <span>(MISSING)</span>

  • 集群状态为  <span>OK_NO_TOLERANCE_PARTIAL</span>

  • 出现连接错误  <span>Lost connection to MySQL server</span>

恢复要点

  • 停止的从节点重新启动后自动恢复

  • 集群自动重新同步数据

  • 整个过程无需人工干预

影响范围

  • 故障期间集群失去容错能力

  • 读负载能力下降

  • 主节点写入不受影响

场景 3:集群主节点与从节点之间网络割裂

故障模拟:断开 node1 主节点与其他从节点之间的网络

<em><span>-- node1:</span></em>
<span>./stop_net.sh</span>
<span>cat stop_net.sh</span>
<em><span>#!/bin/bash</span></em>
<span>iptables -A INPUT -s 10.186.65.18 -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span>iptables -A </span>
<span style="color:#c678dd"><span>OUTPUT</span></span>
<span> -d </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.18</span></span>
<span> -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span>iptables -A </span>
<span style="color:#c678dd"><span>INPUT</span></span>
<span> -s </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.24</span></span>
<span> -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span>iptables -A </span>
<span style="color:#c678dd"><span>OUTPUT</span></span>
<span> -d </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.24</span></span>
<span> -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span style="color:#c678dd"><span>sleep </span></span>
<span style="color:#d19a66"><span>30</span></span>
<span>iptables -F</span>

<em><span>-- node1:</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>\js</span>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>cluster.status()</span>

恢复过程

<em><span>-- 集群初始状态</span></em>
<span>同上,略...</span>

<em><span>-- 执行断网脚本期间集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "NO_QUORUM", </span>
<span>        "statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "(MISSING)", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)", </span>
<span>                "status": "UNREACHABLE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "(MISSING)", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "shellConnectError": "MySQL Error 2003: Could not open connection to 'node3:3306': Can't connect to MySQL server on 'node3:3306' (110)", </span>
<span>                "status": "UNREACHABLE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<em><span># 此时,node2/3的memberState均为"(MISSING)",msyqlshell无法连接,节点status均为"UNREACHABLE"</span></em>

<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "NO_QUORUM", </span>
<span>        "statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "(MISSING)", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)", </span>
<span>                "status": "UNREACHABLE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "instanceErrors": [</span>
<span>                    "ERROR: split-brain! Instance is not part of the majority group, but has state ONLINE"</span>
<span>                ], </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "ONLINE", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "status": "UNREACHABLE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<em><span># 此时,集群状态略有不同,node2的memberState、status保持不变,而node3的memberState变为"ONLINE",出现了split-brain(脑裂)的提示</span></em>

<em><span>-- 同时在node2节点上查看集群状态</span></em>
<span> MySQL  10.186.65.18:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK", </span>
<span>        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "instanceErrors": [</span>
<span>                    "WARNING: Instance is NOT a PRIMARY but super_read_only option is OFF."</span>
<span>                ], </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<em><span># 此时,node2/3之间的网络是正常的,仅仅是node1与它们之间的网络发生割裂,整个集群依然是正常的,未发生切主</span></em>

<em><span>-- 断网恢复后,再次在node1上查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK", </span>
<span>        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>

<em><span># 断网状态恢复后,node1上查看集群状态也恢复正常</span></em>

小结

故障现象

  • 主节点显示  <span>NO_QUORUM</span>  状态

  • 从节点显示  <span>UNREACHABLE</span>

  • 可能出现脑裂警告  <span>split-brain! Instance is not part of the majority group</span>

恢复要点

  • 网络恢复后集群自动重新建立连接

  • 自动进行数据一致性检查

  • 主从角色自动协商确定

影响范围

  • 网络分区期间写入操作可能受阻

  • 需要确保多数节点达成一致

  • 可能产生数据冲突需要解决

场景 4:集群主节点重启

故障模拟:将 node1 主节点重启

<em><span>-- node1:</span></em>
<span>systemctl restart mysql</span>

<em><span>-- node1:</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>\js</span>
<span>var cluster=dba.getCluster()</span>
<span>cluster.status()</span>

恢复过程

<em><span>-- node1 主节点重启时查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node2:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span>        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "shellConnectError": "MySQL Error 2013: Could not open connection to 'node1:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", </span>
<span>                "status": "(MISSING)"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node2:3306"</span>
<span>}</span>

<em><span># 主节点重启后,mysqlshell连接该节点失败,发生了切主,其memberRole由"PRIMARY"变为"SECONDARY"</span></em>

<em><span>-- node1主节点完成重启后再次查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node2:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span>        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "instanceErrors": [</span>
<span>                    "NOTE: group_replication is stopped."</span>
<span>                ], </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "OFFLINE", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "status": "(MISSING)", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "mode": "R/O", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node2:3306"</span>
<span>}</span>

<em><span># node1主节点实例重启完成后,mysqlshell 恢复连接,memberState 变为 "OFFLINE",节点 status 变为 "(MISSING)",如未自动加入集群,需要手动执行 rejoinInstance 将其重新加入集群</span></em>

小结

故障现象

  • 主节点重启期间自动故障转移

  • node2 被选举为新的主节点

  • 原主节点重启后状态为  <span>(MISSING)</span>

恢复要点

  • 集群自动完成主节点切换

  • 原主节点需要手动或自动重新加入

  • 确保应用能够感知主节点变更

影响范围

  • 主节点切换期间短暂写入中断

  • 应用连接可能需要重连

  • 原主节点重新加入期间集群容错能力下降

场景 5:集群所有节点同时重启后丢失 Quorum

故障模拟:同时将集群内所有节点重启

<em><span>-- node1 & node2 & node3</span></em>
<span>./restart_node.sh</span>
<span>cat restart_node.sh</span>
<em><span>#!/bin/bash</span></em>
<span>systemctl </span>
<span style="color:#c678dd"><span>stop</span></span>
<span> mysql</span>
<span style="color:#c678dd"><span>sleep </span></span>
<span style="color:#d19a66"><span>30</span></span>
<span>systemctl </span>
<span style="color:#c678dd"><span>start</span></span>
<span> mysql</span>

<em><span>-- node1</span></em>
<span>cluster.status()</span>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>\</span>
<span style="color:#c678dd"><span>sql</span></span>
<span style="color:#c678dd"><span>show </span></span>
<span style="color:#e6c07b"><span>binary </span></span>
<span style="color:#c678dd"><span>log </span></span>
<span style="color:#c678dd"><span>status</span></span>
<span>;</span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>on</span></span>
<span>;</span>
<span style="color:#c678dd"><span>start</span></span>
<span> group_replication;</span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>off</span></span>
<span>;</span>
<span>\js</span>
<span>var cluster=dba.getCluster()</span>
<span>cluster.status()</span>
<span>cluster.rejoinInstance("clusteruser@node2:3306")</span>
<span>cluster.rejoinInstance("clusteruser@node3:3306")</span>
<span>cluster.status()</span>

<em><span># PS:如仍然无法成功执行,可能需要先执行rescan,再执行addInstance将故障节点重新加回集群</span></em>
<span>cluster.rescan()</span>
<span>cluster.addInstance(clusteruser@node2:3306)</span>

恢复过程

<em><span>-- 同时将3个节点上的实例重启,重启完成后查看所有节点上mysqld进程</span></em>
<span>[root@node1 ~]</span>
<em><span># ./restart_node.sh</span></em>
<span>[root@node1 ~]</span>
<em><span># ps -ef|grep mysql</span></em>
<span>mysqlro+   65096       1  3 Nov19 ?        01:36:34 /usr/bin/mysqlrouter -c /mysql_router/mysqlrouter.conf</span>
<span>root       77313   77260  0 13:48 pts/0    00:00:02 mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>mysql      77963       1  4 15:45 ?        00:00:01 /usr/sbin/mysqld</span>
<span>root       78021   77883  0 15:45 pts/2    00:00:00 grep </span>
<em><span>--color=auto mysql</span></em>
<span>[root@node1 ~]</span>
<em><span>#</span></em>

<span>[root@node2 ~]</span>
<em><span># ./restart_node.sh</span></em>
<span>[root@node2 ~]</span>
<em><span># ps -ef|grep mysql</span></em>
<span>root       27868   27780  0 15:29 pts/0    00:00:01 mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.18:3306</span></em>
<span>mysql      28001       1 19 15:45 ?        00:00:03 /usr/sbin/mysqld</span>
<span>root       28078   27924  0 15:45 pts/2    00:00:00 grep </span>
<em><span>--color=auto mysql</span></em>
<span>[root@node2 ~]</span>
<em><span>#</span></em>

<span>[root@node3 ~]</span>
<em><span># ./restart_node.sh</span></em>
<span>[root@node3 ~]</span>
<em><span># ps -ef|grep mysql</span></em>
<span>root      127839  127035  0 15:29 pts/0    00:00:00 mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.24:3306</span></em>
<span>mysql     128073       1 10 15:44 ?        00:00:02 /usr/sbin/mysqld</span>
<span>root      128153  127992  0 15:44 pts/2    00:00:00 grep </span>
<em><span>--color=auto mysql</span></em>
<span>[root@node3 ~]</span>
<em><span>#</span></em>

<em><span>-- 重启集群所有节点时,无法查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>Cluster.status: The cluster object is disconnected. Please </span>
<span style="color:#c678dd"><span>use</span></span>
<span> dba.getCluster() </span>
<span style="color:#c678dd"><span>to</span></span>
<span> obtain a </span>
<span style="color:#c678dd"><span>fresh</span></span>
<span> cluster handle. (RuntimeError)</span>
<span> MySQL  </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl  JS > </span>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>Dba.getCluster: This </span>
<span style="color:#c678dd"><span>function</span></span>
<span style="color:#c678dd"><span>is</span></span>
<span style="color:#c678dd"><span>not</span></span>
<span> available </span>
<span style="color:#c678dd"><span>through</span></span>
<span> a </span>
<span style="color:#c678dd"><span>session </span></span>
<span style="color:#c678dd"><span>to</span></span>
<span> a </span>
<span style="color:#c678dd"><span>standalone </span></span>
<span style="color:#c678dd"><span>instance</span></span>
<span> (metadata </span>
<span style="color:#c678dd"><span>exists</span></span>
<span>, </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> belongs </span>
<span style="color:#c678dd"><span>to</span></span>
<span> that metadata, but GR </span>
<span style="color:#c678dd"><span>is</span></span>
<span style="color:#c678dd"><span>not</span></span>
<span> active) (MYSQLSH </span>
<span style="color:#d19a66"><span>51314</span></span>
<span>)</span>

<em><span>-- 确认 GTID 最全的节点(假设其为 node1 旧主节点)</span></em>
<span> MySQL  </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl  </span>
<span style="color:#c678dd"><span>SQL</span></span>
<span> > </span>
<span style="color:#c678dd"><span>show </span></span>
<span style="color:#e6c07b"><span>binary </span></span>
<span style="color:#c678dd"><span>log </span></span>
<span style="color:#c678dd"><span>status</span></span>
<span>;</span>
<span>ERROR: 2013 (HY000): Lost connection to MySQL server during query</span>
<span>The global session got disconnected..</span>
<span>Attempting to reconnect to 'mysql://clusteruser@10.186.65.13:3306'..</span>
<span>The global session was successfully reconnected.</span>
<span> MySQL  10.186.65.13:3306 ssl  SQL > </span>
<span style="color:#c678dd"><span>show </span></span>
<span style="color:#e6c07b"><span>binary </span></span>
<span style="color:#c678dd"><span>log </span></span>
<span style="color:#c678dd"><span>status</span></span>
<span>;</span>
<span>+</span>
<em><span>------------------+----------+--------------+------------------+---------------------------------------------+</span></em>
<span>| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                           |</span>
<span>+</span>
<em><span>------------------+----------+--------------+------------------+---------------------------------------------+</span></em>
<span>| mysql_bin.000003 |      194 |              |                  | aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1-2540 |</span>
<span>+</span>
<em><span>------------------+----------+--------------+------------------+---------------------------------------------+</span></em>
<span>1 row in </span>
<span style="color:#c678dd"><span>set</span></span>
<span> (</span>
<span style="color:#d19a66"><span>0.0001</span></span>
<span> sec)</span>

<em><span>-- 将主节点作为组复制引导节点并启动组复制</span></em>
<span> MySQL  </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl  </span>
<span style="color:#c678dd"><span>SQL</span></span>
<span> > </span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>on</span></span>
<span>;</span>
<span>Query OK, 0 rows affected (0.0006 sec)</span>
<span> MySQL  10.186.65.13:3306 ssl  SQL > </span>
<span style="color:#c678dd"><span>start</span></span>
<span> group_replication;</span>
<span>Query OK, 0 rows affected (1.1069 sec)</span>
<span> MySQL  10.186.65.13:3306 ssl  SQL > </span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>off</span></span>
<span>;</span>
<span>Query OK, 0 rows affected (0.0004 sec)</span>

<em><span>-- 再次查看集群状态</span></em>
<span> MySQL  10.186.65.13:3306 ssl  SQL > \js</span>
<span>Switching to JavaScript mode...</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > var cluster=dba.getCluster()</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.status()</span>
<span>{</span>
<span>    "clusterName": "mycluster", </span>
<span>    "defaultReplicaSet": {</span>
<span>        "name": "default", </span>
<span>        "primary": "node1:3306", </span>
<span>        "ssl": "DISABLED", </span>
<span>        "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span>        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", </span>
<span>        "topology": {</span>
<span>            "node1:3306": {</span>
<span>                "address": "node1:3306", </span>
<span>                "memberRole": "PRIMARY", </span>
<span>                "mode": "R/W", </span>
<span>                "readReplicas": {}, </span>
<span>                "replicationLag": "applier_queue_applied", </span>
<span>                "role": "HA", </span>
<span>                "status": "ONLINE", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node2:3306": {</span>
<span>                "address": "node2:3306", </span>
<span>                "instanceErrors": [</span>
<span>                    "NOTE: group_replication is stopped."</span>
<span>                ], </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "OFFLINE", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "status": "(MISSING)", </span>
<span>                "version": "8.4.5"</span>
<span>            }, </span>
<span>            "node3:3306": {</span>
<span>                "address": "node3:3306", </span>
<span>                "instanceErrors": [</span>
<span>                    "NOTE: group_replication is stopped."</span>
<span>                ], </span>
<span>                "memberRole": "SECONDARY", </span>
<span>                "memberState": "OFFLINE", </span>
<span>                "mode": "n/a", </span>
<span>                "readReplicas": {}, </span>
<span>                "role": "HA", </span>
<span>                "status": "(MISSING)", </span>
<span>                "version": "8.4.5"</span>
<span>            }</span>
<span>        }, </span>
<span>        "topologyMode": "Single-Primary"</span>
<span>    }, </span>
<span>    "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<span> MySQL  10.186.65.13:3306 ssl  JS > </span>

<em><span># 此时,其他两个从节点的memberState均为"OFFLINE",status均为"(MISSING)",需要执行rejoinInstance,将它们重新加入集群</span></em>

<em><span>-- 依次将其他节点加入集群</span></em>
<span> MySQL  10.186.65.13:3306 ssl  JS > cluster.rejoinInstance("clusteruser@node2:3306")</span>
<span>The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of 'node2:3306' </span>
<span style="color:#c678dd"><span>with</span></span>
<span> a </span>
<span style="color:#c678dd"><span>physical </span></span>
<span style="color:#c678dd"><span>snapshot </span></span>
<span style="color:#c678dd"><span>from</span></span>
<span> an existing cluster member. </span>
<span style="color:#c678dd"><span>To </span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option</span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'clone'</span></span>
<span>.</span>

<span>The incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> may be safely used </span>
<span style="color:#c678dd"><span>if</span></span>
<span> you </span>
<span style="color:#c678dd"><span>are</span></span>
<span> sure </span>
<span style="color:#c678dd"><span>all</span></span>
<span> updates ever executed </span>
<span style="color:#c678dd"><span>in</span></span>
<span> the cluster were done </span>
<span style="color:#c678dd"><span>with</span></span>
<span> GTIDs enabled, there </span>
<span style="color:#c678dd"><span>are </span></span>
<span style="color:#c678dd"><span>no</span></span>
<span> purged transactions </span>
<span style="color:#c678dd"><span>and</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new </span></span>
<span style="color:#c678dd"><span>instance</span></span>
<span> contains the same GTID </span>
<span style="color:#c678dd"><span>set</span></span>
<span style="color:#c678dd"><span>as</span></span>
<span> the cluster </span>
<span style="color:#c678dd"><span>or</span></span>
<span> a subset </span>
<span style="color:#c678dd"><span>of</span></span>
<span> it. </span>
<span style="color:#c678dd"><span>To</span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option</span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'incremental'</span></span>
<span>.</span>

<span>Incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> was selected because it seems </span>
<span style="color:#c678dd"><span>to</span></span>
<span> be safely usable.</span>

<span>Validating </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>at</span></span>
<span> node2:</span>
<span style="color:#d19a66"><span>3306.</span></span>
<span>..</span>

<span>This </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> reports its own address </span>
<span style="color:#c678dd"><span>as</span></span>
<span> node2:</span>
<span style="color:#d19a66"><span>3306</span></span>

<span style="color:#c678dd"><span>Instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>is</span></span>
<span> suitable.</span>
<span>Rejoining </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node2:3306'</span></span>
<span style="color:#c678dd"><span>to</span></span>
<span> cluster </span>
<span style="color:#98c379"><span>'mycluster'</span></span>
<span>...</span>

<span style="color:#c678dd"><span>Monitoring </span></span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> process </span>
<span style="color:#c678dd"><span>of</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new</span></span>
<span> cluster member. Press ^C </span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#c678dd"><span>stop </span></span>
<span style="color:#c678dd"><span>monitoring </span></span>
<span style="color:#c678dd"><span>and</span></span>
<span> let it continue </span>
<span style="color:#c678dd"><span>in</span></span>
<span> background.</span>
<span>State </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> already finished </span>
<span style="color:#c678dd"><span>for</span></span>
<span style="color:#98c379"><span>'node2:3306'</span></span>

<span>The </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node2:3306'</span></span>
<span> was successfully rejoined </span>
<span style="color:#c678dd"><span>to</span></span>
<span> the cluster.</span>

<span> MySQL  </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl  JS > cluster.rejoinInstance(</span>
<span style="color:#98c379"><span>"clusteruser@node3:3306"</span></span>
<span>)</span>
<span>The safest </span>
<span style="color:#c678dd"><span>and</span></span>
<span> most convenient way </span>
<span style="color:#c678dd"><span>to</span></span>
<span> provision a </span>
<span style="color:#c678dd"><span>new </span></span>
<span style="color:#c678dd"><span>instance </span></span>
<span style="color:#c678dd"><span>is </span></span>
<span style="color:#c678dd"><span>through </span></span>
<span style="color:#c678dd"><span>auto matic </span></span>
<span style="color:#c678dd"><span>clone</span></span>
<span> provisioning, which will completely overwrite the state </span>
<span style="color:#c678dd"><span>of</span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span style="color:#c678dd"><span>with</span></span>
<span> a </span>
<span style="color:#c678dd"><span>physical </span></span>
<span style="color:#c678dd"><span>snapshot </span></span>
<span style="color:#c678dd"><span>from</span></span>
<span> an existing cluster member. </span>
<span style="color:#c678dd"><span>To </span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option</span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'clone'</span></span>
<span>.</span>

<span>The incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> may be safely used </span>
<span style="color:#c678dd"><span>if</span></span>
<span> you </span>
<span style="color:#c678dd"><span>are</span></span>
<span> sure </span>
<span style="color:#c678dd"><span>all</span></span>
<span> updates ever executed </span>
<span style="color:#c678dd"><span>in</span></span>
<span> the cluster were done </span>
<span style="color:#c678dd"><span>with</span></span>
<span> GTIDs enabled, there </span>
<span style="color:#c678dd"><span>are</span></span>
<span style="color:#c678dd"><span>no</span></span>
<span> purged transactions </span>
<span style="color:#c678dd"><span>and</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new </span></span>
<span style="color:#c678dd"><span>instance</span></span>
<span> contains the same GTID </span>
<span style="color:#c678dd"><span>set</span></span>
<span style="color:#c678dd"><span>as</span></span>
<span> the cluster </span>
<span style="color:#c678dd"><span>or</span></span>
<span> a subset </span>
<span style="color:#c678dd"><span>of</span></span>
<span> it. </span>
<span style="color:#c678dd"><span>To </span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option </span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'incremental'</span></span>
<span>.</span>

<span>Incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> was selected because it seems </span>
<span style="color:#c678dd"><span>to</span></span>
<span> be safely usable.</span>

<span>Validating </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>at</span></span>
<span> node3:</span>
<span style="color:#d19a66"><span>3306.</span></span>
<span>..</span>

<span>This </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> reports its own address </span>
<span style="color:#c678dd"><span>as</span></span>
<span> node3:</span>
<span style="color:#d19a66"><span>3306</span></span>

<span style="color:#c678dd"><span>Instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>is</span></span>
<span> suitable.</span>
<span>Rejoining </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span style="color:#c678dd"><span>to</span></span>
<span> cluster </span>
<span style="color:#98c379"><span>'mycluster'</span></span>
<span>...</span>

<span style="color:#c678dd"><span>Monitoring </span></span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> process </span>
<span style="color:#c678dd"><span>of</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new</span></span>
<span> cluster member. Press ^C </span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#c678dd"><span>stop </span></span>
<span style="color:#c678dd"><span>monitoring</span></span>
<span style="color:#c678dd"><span>and</span></span>
<span> let it continue </span>
<span style="color:#c678dd"><span>in</span></span>
<span> background.</span>
<span>State </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> already finished </span>
<span style="color:#c678dd"><span>for </span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>

<span>The </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span> was successfully rejoined </span>
<span style="color:#c678dd"><span>to</span></span>
<span> the cluster.</span>

<em><span>-- 再次查看集群状态</span></em>
<span> MySQL  </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl  JS > cluster.status()</span>
<span>{</span>
<span>    </span>
<span style="color:#98c379"><span>"clusterName"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"mycluster"</span></span>
<span>, </span>
<span>    </span>
<span style="color:#98c379"><span>"defaultReplicaSet"</span></span>
<span>: {</span>
<span>        </span>
<span style="color:#98c379"><span>"name"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"default"</span></span>
<span>, </span>
<span>        </span>
<span style="color:#98c379"><span>"primary"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>, </span>
<span>        </span>
<span style="color:#98c379"><span>"ssl"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"DISABLED"</span></span>
<span>, </span>
<span>        </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"OK"</span></span>
<span>, </span>
<span>        </span>
<span style="color:#98c379"><span>"statusText"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"Cluster is ONLINE and can tolerate up to ONE failure."</span></span>
<span>, </span>
<span>        </span>
<span style="color:#98c379"><span>"topology"</span></span>
<span>: {</span>
<span>            </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>: {</span>
<span>                </span>
<span style="color:#98c379"><span>"address"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"memberRole"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"PRIMARY"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"mode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"R/W"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"readReplicas"</span></span>
<span>: {}, </span>
<span>                </span>
<span style="color:#98c379"><span>"replicationLag"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"applier_queue_applied"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"role"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"HA"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"ONLINE"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"version"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"8.4.5"</span></span>
<span>            }, </span>
<span>            </span>
<span style="color:#98c379"><span>"node2:3306"</span></span>
<span>: {</span>
<span>                </span>
<span style="color:#98c379"><span>"address"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node2:3306"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"memberRole"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"SECONDARY"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"mode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"R/O"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"readReplicas"</span></span>
<span>: {}, </span>
<span>                </span>
<span style="color:#98c379"><span>"replicationLag"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"applier_queue_applied"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"role"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"HA"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"ONLINE"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"version"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"8.4.5"</span></span>
<span>            }, </span>
<span>            </span>
<span style="color:#98c379"><span>"node3:3306"</span></span>
<span>: {</span>
<span>                </span>
<span style="color:#98c379"><span>"address"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node3:3306"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"memberRole"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"SECONDARY"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"mode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"R/O"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"readReplicas"</span></span>
<span>: {}, </span>
<span>                </span>
<span style="color:#98c379"><span>"replicationLag"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"applier_queue_applied"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"role"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"HA"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"ONLINE"</span></span>
<span>, </span>
<span>                </span>
<span style="color:#98c379"><span>"version"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"8.4.5"</span></span>
<span>            }</span>
<span>        }, </span>
<span>        </span>
<span style="color:#98c379"><span>"topologyMode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"Single-Primary"</span></span>
<span>    }, </span>
<span>    </span>
<span style="color:#98c379"><span>"groupInformationSourceMember"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>}</span>
<span> MySQL  </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl  JS ></span>

<em><span># 实例重启后,组复制默认不会开启,可能会缺少仲裁节点,导致无法查看集群状态,需要先配置一个主引导节点后启动组复制,并将其余节点通过rejoinInstance的方式加入集群</span></em>
<em><span># 如果要对集群所在服务器进行计划内的Linux内核升级等运维操作时,建议采用滚动方式依次对各个节点上的实例进行重启,应尽量避免所有节点同时发生重启</span></em>

小结

故障现象

  • 所有节点重启后集群可能无法自动恢复

  • 出现  <span>The cluster object is disconnected</span>  错误

  • 组复制服务未自动启动

恢复要点

  • 需要手动选择 GTID 最全的节点作为引导节点

  • 使用  <span>group_replication_bootstrap_group</span>  参数引导集群

  • 通过  <span>rejoinInstance</span>  重新加入其他节点

影响范围

  • 整个集群服务中断

  • 需要人工干预恢复

  • 恢复时间相对较长

总结

本文对 MIC 在实际使用过程中出现的常见故障场景进行了一些测试,旨在使 MySQL 运维人员可以更好地理解其故障特性和恢复机制,提高应对各种异常情况的能力,确保数据库服务的高可用性和数据安全性。

其中,故障场景主要分为两类:

  1. 自动化恢复能力

    • 单节点故障(重启、宕机)集群具备自动恢复能力

    • 网络分区恢复后集群能够自动重新建立一致性

    • 主节点故障时自动进行故障转移

  2. 需要人工干预的场景

    • 所有节点同时重启导致仲裁丢失

    • 集群引导节点选择不当

    • 数据不一致需要手动修复

参考资料

[1]  MySQL InnoDB Cluster:  https://dev.mysql.com/doc/mysql-shell/8.4/en/monitoring-innodb-cluster.html

[2]  MySQL Group Replication:  https://dev.mysql.com/doc/en/group-replication.html

[3]  MySQL Shell:  https://dev.mysql.com/doc/mysql-shell/8.4/en/rejoin-cluster.html

                                                                                </div>



Source link

未经允许不得转载:紫竹林-程序员中文网 » MySQL InnoDB Cluster 常见故障场景分析

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
关于我们 免责申明 意见反馈 隐私政策
程序员中文网:公益在线网站,帮助学习者快速成长!
关注微信 技术交流
推荐文章
每天精选资源文章推送
推荐文章
随时随地碎片化学习
推荐文章
发现有趣的