<p style="color:#000000; margin-left:8px; margin-right:8px; text-align:left"><span style="background-color:#ffffff; color:#888888"><span>作者:赵黎明,</span>
爱可生 MySQL DBA 团队成员,熟悉 Oracle、MySQL 等数据库,擅长数据库性能问题诊断、事务与锁问题的分析等,负责处理客户 MySQL 及我司自研 DMP 平台日常运维中的问题,对开源数据库相关技术非常感兴趣。
爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。
本文约 4800 字,预计阅读需要 15 分钟。
一、背景
随着企业业务对数据库高可用性要求的不断提高, MySQL InnoDB Cluster [1] (以下简称为 MIC)作为 MySQL 官方提供的高可用解决方案,在生产环境中得到了广泛应用。该方案基于 MySQL Group Replication [2] 和 MySQL Shell [3] 管理工具,能够提供自动故障检测与恢复、数据一致性保证等关键特性。
然而,在实际运维过程中,由于硬件故障、网络问题、配置错误等多种因素,集群仍可能面临各种异常情况。为了确保数据库服务的持续稳定运行,MySQL 运维人员需要具备快速诊断和恢复集群故障的能力。
本文是基于实际生产环境中的经验总结,对 MIC 的常见故障场景进行了测试,并提供了相应的恢复方案和操作指导,旨在帮助 MySQL 运维人员提升故障处理效率,保障业务连续性。
测试环境信息
-
OS:Red Hat Enterprise Linux release 8.10 (Ootpa)
-
MySQL:mysql-community-server-8.4.5-1.el8.x86_64
-
mysqlshell:mysql-shell-8.4.5-1.el8.x86_64
-
mysqlrouter:mysql-router-community-8.4.5-1.el8.x86_64
二、场景示例
场景 1:集群非主节点实例重启
故障模拟:将 node2 从节点重启
<em><span>-- node2</span></em>
<span>systemctl restart mysql</span>
<em><span>-- node1</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>\js</span>
<span>var cluster=dba.getCluster()</span>
<span>cluster.status()</span>
恢复过程
<em><span>-- 登陆 mysqlshell 查看集群初始状态</span></em>
<span>[root@node1 ~]</span>
<em><span># mysqlsh --uri clusteruser@10.186.65.13:3306</span></em>
<span>MySQL Shell 8.4.5</span>
<span>Copyright (c) 2016, 2025, Oracle and/or its affiliates.</span>
<span>Oracle is a registered trademark of Oracle Corporation and/or its affiliates.</span>
<span>Other names may be trademarks of their respective owners.</span>
<span>Type '\</span>
<span style="color:#c678dd"><span>help</span></span>
<span style="color:#98c379"><span>' or '</span></span>
<span>\?</span>
<span style="color:#98c379"><span>' for help; '</span></span>
<span>\quit</span>
<span style="color:#98c379"><span>' to exit.</span>
<span>Creating a Classic session to '</span></span>
<span>clusteruser@</span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span style="color:#98c379"><span>'</span>
<span>Fetching global names for auto-completion... Press ^C to stop.</span>
<span>Your MySQL connection id is 402320</span>
<span>Server version: 8.4.5 MySQL Community Server - GPL</span>
<span>No default schema selected; type \use <schema> to set one.</span>
<span> MySQL 10.186.65.13:3306 ssl SQL > \js</span>
<span>Switching to JavaScript mode...</span>
<span> MySQL 10.186.65.13:3306 ssl JS > var cluster=dba.getCluster()</span>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK", </span>
<span> "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<span># 集群内所有成员状态都为 "ONLINE",node1 为 "PRIMARY" 节点,node2/3 为 "SECONDARY" 节点</span>
<span>-- node2 上实例重启期间观察集群状态</span>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span> "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "shellConnectError": "MySQL Error 2013: Could not open connection to '</span></span>
<span>node2:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span style="color:#98c379"><span>': Lost connection to MySQL server at '</span></span>
<span>reading </span>
<span style="color:#c678dd"><span>initial</span></span>
<span> communication packet</span>
<span style="color:#98c379"><span>', system error: 104", </span>
<span> "status": "(MISSING)"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<span># 通过 mysqlshell 连接 node2 节点报错 Error 2013,该节点状态变为 "(MISSING)"</span>
<span>-- node2 上实例完成重启后再次查看集群状态</span>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK", </span>
<span> "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<span># node2 上的实例自动重启完成后,集群可自动恢复正常,无需额外处理</span>
</span>
小结
故障现象
-
重启期间 node2 状态显示为
<span>(MISSING)</span> -
集群状态变为
<span>OK_NO_TOLERANCE_PARTIAL</span> -
出现连接错误
<span>Lost connection to MySQL server</span>
恢复要点
-
从节点重启后,集群能够自动检测并重新加入
-
无需人工干预,自动恢复为
<span>ONLINE</span>状态 -
集群在单节点故障期间仍可正常提供服务
影响范围
-
重启期间集群容错能力降为 0
-
对应用读写无影响
-
短暂影响读负载均衡
场景 2:集群非主节点实例宕机
故障模拟:将 node3 从节点停止后再启动
<em><span>-- node3:</span></em>
<span>systemctl </span>
<span style="color:#c678dd"><span>stop</span></span>
<span> mysql</span>
<em><span>-- node1:</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>cluster.status()</span>
<em><span>-- node3:</span></em>
<span>systemctl </span>
<span style="color:#c678dd"><span>start</span></span>
<span> mysql</span>
恢复过程
<em><span>-- 集群初始状态</span></em>
<span>同上,略...</span>
<em><span>-- node3 上实例停止后查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span> "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "shellConnectError": "MySQL Error 2013: Could not open connection to 'node3:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", </span>
<span> "status": "(MISSING)"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<em><span># mysqlshell 连接 node3 上的实例报错,节点状态为 "(MISSING)"</span></em>
<em><span>-- node3 实例重新启动后再次查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK", </span>
<span> "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<em><span># 将 node3 上宕机的实例启动后,集群可自动恢复正常,无需额外处理</span></em>
小结
故障现象
-
node3 状态显示为
<span>(MISSING)</span> -
集群状态为
<span>OK_NO_TOLERANCE_PARTIAL</span> -
出现连接错误
<span>Lost connection to MySQL server</span>
恢复要点
-
停止的从节点重新启动后自动恢复
-
集群自动重新同步数据
-
整个过程无需人工干预
影响范围
-
故障期间集群失去容错能力
-
读负载能力下降
-
主节点写入不受影响
场景 3:集群主节点与从节点之间网络割裂
故障模拟:断开 node1 主节点与其他从节点之间的网络
<em><span>-- node1:</span></em>
<span>./stop_net.sh</span>
<span>cat stop_net.sh</span>
<em><span>#!/bin/bash</span></em>
<span>iptables -A INPUT -s 10.186.65.18 -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span>iptables -A </span>
<span style="color:#c678dd"><span>OUTPUT</span></span>
<span> -d </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.18</span></span>
<span> -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span>iptables -A </span>
<span style="color:#c678dd"><span>INPUT</span></span>
<span> -s </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.24</span></span>
<span> -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span>iptables -A </span>
<span style="color:#c678dd"><span>OUTPUT</span></span>
<span> -d </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.24</span></span>
<span> -j </span>
<span style="color:#c678dd"><span>DROP</span></span>
<span style="color:#c678dd"><span>sleep </span></span>
<span style="color:#d19a66"><span>30</span></span>
<span>iptables -F</span>
<em><span>-- node1:</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>\js</span>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>cluster.status()</span>
恢复过程
<em><span>-- 集群初始状态</span></em>
<span>同上,略...</span>
<em><span>-- 执行断网脚本期间集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "NO_QUORUM", </span>
<span> "statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "(MISSING)", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)", </span>
<span> "status": "UNREACHABLE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "(MISSING)", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "shellConnectError": "MySQL Error 2003: Could not open connection to 'node3:3306': Can't connect to MySQL server on 'node3:3306' (110)", </span>
<span> "status": "UNREACHABLE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<em><span># 此时,node2/3的memberState均为"(MISSING)",msyqlshell无法连接,节点status均为"UNREACHABLE"</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "NO_QUORUM", </span>
<span> "statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "(MISSING)", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)", </span>
<span> "status": "UNREACHABLE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "instanceErrors": [</span>
<span> "ERROR: split-brain! Instance is not part of the majority group, but has state ONLINE"</span>
<span> ], </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "ONLINE", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "status": "UNREACHABLE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<em><span># 此时,集群状态略有不同,node2的memberState、status保持不变,而node3的memberState变为"ONLINE",出现了split-brain(脑裂)的提示</span></em>
<em><span>-- 同时在node2节点上查看集群状态</span></em>
<span> MySQL 10.186.65.18:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK", </span>
<span> "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "instanceErrors": [</span>
<span> "WARNING: Instance is NOT a PRIMARY but super_read_only option is OFF."</span>
<span> ], </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<em><span># 此时,node2/3之间的网络是正常的,仅仅是node1与它们之间的网络发生割裂,整个集群依然是正常的,未发生切主</span></em>
<em><span>-- 断网恢复后,再次在node1上查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK", </span>
<span> "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<em><span># 断网状态恢复后,node1上查看集群状态也恢复正常</span></em>
小结
故障现象
-
主节点显示
<span>NO_QUORUM</span>状态 -
从节点显示
<span>UNREACHABLE</span> -
可能出现脑裂警告
<span>split-brain! Instance is not part of the majority group</span>
恢复要点
-
网络恢复后集群自动重新建立连接
-
自动进行数据一致性检查
-
主从角色自动协商确定
影响范围
-
网络分区期间写入操作可能受阻
-
需要确保多数节点达成一致
-
可能产生数据冲突需要解决
场景 4:集群主节点重启
故障模拟:将 node1 主节点重启
<em><span>-- node1:</span></em>
<span>systemctl restart mysql</span>
<em><span>-- node1:</span></em>
<span>mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>\js</span>
<span>var cluster=dba.getCluster()</span>
<span>cluster.status()</span>
恢复过程
<em><span>-- node1 主节点重启时查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node2:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span> "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "shellConnectError": "MySQL Error 2013: Could not open connection to 'node1:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", </span>
<span> "status": "(MISSING)"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node2:3306"</span>
<span>}</span>
<em><span># 主节点重启后,mysqlshell连接该节点失败,发生了切主,其memberRole由"PRIMARY"变为"SECONDARY"</span></em>
<em><span>-- node1主节点完成重启后再次查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node2:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span> "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "instanceErrors": [</span>
<span> "NOTE: group_replication is stopped."</span>
<span> ], </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "OFFLINE", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "status": "(MISSING)", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "memberRole": "SECONDARY", </span>
<span> "mode": "R/O", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node2:3306"</span>
<span>}</span>
<em><span># node1主节点实例重启完成后,mysqlshell 恢复连接,memberState 变为 "OFFLINE",节点 status 变为 "(MISSING)",如未自动加入集群,需要手动执行 rejoinInstance 将其重新加入集群</span></em>
小结
故障现象
-
主节点重启期间自动故障转移
-
node2 被选举为新的主节点
-
原主节点重启后状态为
<span>(MISSING)</span>
恢复要点
-
集群自动完成主节点切换
-
原主节点需要手动或自动重新加入
-
确保应用能够感知主节点变更
影响范围 :
-
主节点切换期间短暂写入中断
-
应用连接可能需要重连
-
原主节点重新加入期间集群容错能力下降
场景 5:集群所有节点同时重启后丢失 Quorum
故障模拟:同时将集群内所有节点重启
<em><span>-- node1 & node2 & node3</span></em>
<span>./restart_node.sh</span>
<span>cat restart_node.sh</span>
<em><span>#!/bin/bash</span></em>
<span>systemctl </span>
<span style="color:#c678dd"><span>stop</span></span>
<span> mysql</span>
<span style="color:#c678dd"><span>sleep </span></span>
<span style="color:#d19a66"><span>30</span></span>
<span>systemctl </span>
<span style="color:#c678dd"><span>start</span></span>
<span> mysql</span>
<em><span>-- node1</span></em>
<span>cluster.status()</span>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>\</span>
<span style="color:#c678dd"><span>sql</span></span>
<span style="color:#c678dd"><span>show </span></span>
<span style="color:#e6c07b"><span>binary </span></span>
<span style="color:#c678dd"><span>log </span></span>
<span style="color:#c678dd"><span>status</span></span>
<span>;</span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>on</span></span>
<span>;</span>
<span style="color:#c678dd"><span>start</span></span>
<span> group_replication;</span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>off</span></span>
<span>;</span>
<span>\js</span>
<span>var cluster=dba.getCluster()</span>
<span>cluster.status()</span>
<span>cluster.rejoinInstance("clusteruser@node2:3306")</span>
<span>cluster.rejoinInstance("clusteruser@node3:3306")</span>
<span>cluster.status()</span>
<em><span># PS:如仍然无法成功执行,可能需要先执行rescan,再执行addInstance将故障节点重新加回集群</span></em>
<span>cluster.rescan()</span>
<span>cluster.addInstance(clusteruser@node2:3306)</span>
恢复过程
<em><span>-- 同时将3个节点上的实例重启,重启完成后查看所有节点上mysqld进程</span></em>
<span>[root@node1 ~]</span>
<em><span># ./restart_node.sh</span></em>
<span>[root@node1 ~]</span>
<em><span># ps -ef|grep mysql</span></em>
<span>mysqlro+ 65096 1 3 Nov19 ? 01:36:34 /usr/bin/mysqlrouter -c /mysql_router/mysqlrouter.conf</span>
<span>root 77313 77260 0 13:48 pts/0 00:00:02 mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.13:3306</span></em>
<span>mysql 77963 1 4 15:45 ? 00:00:01 /usr/sbin/mysqld</span>
<span>root 78021 77883 0 15:45 pts/2 00:00:00 grep </span>
<em><span>--color=auto mysql</span></em>
<span>[root@node1 ~]</span>
<em><span>#</span></em>
<span>[root@node2 ~]</span>
<em><span># ./restart_node.sh</span></em>
<span>[root@node2 ~]</span>
<em><span># ps -ef|grep mysql</span></em>
<span>root 27868 27780 0 15:29 pts/0 00:00:01 mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.18:3306</span></em>
<span>mysql 28001 1 19 15:45 ? 00:00:03 /usr/sbin/mysqld</span>
<span>root 28078 27924 0 15:45 pts/2 00:00:00 grep </span>
<em><span>--color=auto mysql</span></em>
<span>[root@node2 ~]</span>
<em><span>#</span></em>
<span>[root@node3 ~]</span>
<em><span># ./restart_node.sh</span></em>
<span>[root@node3 ~]</span>
<em><span># ps -ef|grep mysql</span></em>
<span>root 127839 127035 0 15:29 pts/0 00:00:00 mysqlsh </span>
<em><span>--uri clusteruser@10.186.65.24:3306</span></em>
<span>mysql 128073 1 10 15:44 ? 00:00:02 /usr/sbin/mysqld</span>
<span>root 128153 127992 0 15:44 pts/2 00:00:00 grep </span>
<em><span>--color=auto mysql</span></em>
<span>[root@node3 ~]</span>
<em><span>#</span></em>
<em><span>-- 重启集群所有节点时,无法查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>Cluster.status: The cluster object is disconnected. Please </span>
<span style="color:#c678dd"><span>use</span></span>
<span> dba.getCluster() </span>
<span style="color:#c678dd"><span>to</span></span>
<span> obtain a </span>
<span style="color:#c678dd"><span>fresh</span></span>
<span> cluster handle. (RuntimeError)</span>
<span> MySQL </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl JS > </span>
<span style="color:#c678dd"><span>var</span></span>
<span> cluster=dba.getCluster()</span>
<span>Dba.getCluster: This </span>
<span style="color:#c678dd"><span>function</span></span>
<span style="color:#c678dd"><span>is</span></span>
<span style="color:#c678dd"><span>not</span></span>
<span> available </span>
<span style="color:#c678dd"><span>through</span></span>
<span> a </span>
<span style="color:#c678dd"><span>session </span></span>
<span style="color:#c678dd"><span>to</span></span>
<span> a </span>
<span style="color:#c678dd"><span>standalone </span></span>
<span style="color:#c678dd"><span>instance</span></span>
<span> (metadata </span>
<span style="color:#c678dd"><span>exists</span></span>
<span>, </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> belongs </span>
<span style="color:#c678dd"><span>to</span></span>
<span> that metadata, but GR </span>
<span style="color:#c678dd"><span>is</span></span>
<span style="color:#c678dd"><span>not</span></span>
<span> active) (MYSQLSH </span>
<span style="color:#d19a66"><span>51314</span></span>
<span>)</span>
<em><span>-- 确认 GTID 最全的节点(假设其为 node1 旧主节点)</span></em>
<span> MySQL </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl </span>
<span style="color:#c678dd"><span>SQL</span></span>
<span> > </span>
<span style="color:#c678dd"><span>show </span></span>
<span style="color:#e6c07b"><span>binary </span></span>
<span style="color:#c678dd"><span>log </span></span>
<span style="color:#c678dd"><span>status</span></span>
<span>;</span>
<span>ERROR: 2013 (HY000): Lost connection to MySQL server during query</span>
<span>The global session got disconnected..</span>
<span>Attempting to reconnect to 'mysql://clusteruser@10.186.65.13:3306'..</span>
<span>The global session was successfully reconnected.</span>
<span> MySQL 10.186.65.13:3306 ssl SQL > </span>
<span style="color:#c678dd"><span>show </span></span>
<span style="color:#e6c07b"><span>binary </span></span>
<span style="color:#c678dd"><span>log </span></span>
<span style="color:#c678dd"><span>status</span></span>
<span>;</span>
<span>+</span>
<em><span>------------------+----------+--------------+------------------+---------------------------------------------+</span></em>
<span>| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |</span>
<span>+</span>
<em><span>------------------+----------+--------------+------------------+---------------------------------------------+</span></em>
<span>| mysql_bin.000003 | 194 | | | aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1-2540 |</span>
<span>+</span>
<em><span>------------------+----------+--------------+------------------+---------------------------------------------+</span></em>
<span>1 row in </span>
<span style="color:#c678dd"><span>set</span></span>
<span> (</span>
<span style="color:#d19a66"><span>0.0001</span></span>
<span> sec)</span>
<em><span>-- 将主节点作为组复制引导节点并启动组复制</span></em>
<span> MySQL </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl </span>
<span style="color:#c678dd"><span>SQL</span></span>
<span> > </span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>on</span></span>
<span>;</span>
<span>Query OK, 0 rows affected (0.0006 sec)</span>
<span> MySQL 10.186.65.13:3306 ssl SQL > </span>
<span style="color:#c678dd"><span>start</span></span>
<span> group_replication;</span>
<span>Query OK, 0 rows affected (1.1069 sec)</span>
<span> MySQL 10.186.65.13:3306 ssl SQL > </span>
<span style="color:#c678dd"><span>set </span></span>
<span style="color:#c678dd"><span>global</span></span>
<span> group_replication_bootstrap_group=</span>
<span style="color:#c678dd"><span>off</span></span>
<span>;</span>
<span>Query OK, 0 rows affected (0.0004 sec)</span>
<em><span>-- 再次查看集群状态</span></em>
<span> MySQL 10.186.65.13:3306 ssl SQL > \js</span>
<span>Switching to JavaScript mode...</span>
<span> MySQL 10.186.65.13:3306 ssl JS > var cluster=dba.getCluster()</span>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.status()</span>
<span>{</span>
<span> "clusterName": "mycluster", </span>
<span> "defaultReplicaSet": {</span>
<span> "name": "default", </span>
<span> "primary": "node1:3306", </span>
<span> "ssl": "DISABLED", </span>
<span> "status": "OK_NO_TOLERANCE_PARTIAL", </span>
<span> "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", </span>
<span> "topology": {</span>
<span> "node1:3306": {</span>
<span> "address": "node1:3306", </span>
<span> "memberRole": "PRIMARY", </span>
<span> "mode": "R/W", </span>
<span> "readReplicas": {}, </span>
<span> "replicationLag": "applier_queue_applied", </span>
<span> "role": "HA", </span>
<span> "status": "ONLINE", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node2:3306": {</span>
<span> "address": "node2:3306", </span>
<span> "instanceErrors": [</span>
<span> "NOTE: group_replication is stopped."</span>
<span> ], </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "OFFLINE", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "status": "(MISSING)", </span>
<span> "version": "8.4.5"</span>
<span> }, </span>
<span> "node3:3306": {</span>
<span> "address": "node3:3306", </span>
<span> "instanceErrors": [</span>
<span> "NOTE: group_replication is stopped."</span>
<span> ], </span>
<span> "memberRole": "SECONDARY", </span>
<span> "memberState": "OFFLINE", </span>
<span> "mode": "n/a", </span>
<span> "readReplicas": {}, </span>
<span> "role": "HA", </span>
<span> "status": "(MISSING)", </span>
<span> "version": "8.4.5"</span>
<span> }</span>
<span> }, </span>
<span> "topologyMode": "Single-Primary"</span>
<span> }, </span>
<span> "groupInformationSourceMember": "node1:3306"</span>
<span>}</span>
<span> MySQL 10.186.65.13:3306 ssl JS > </span>
<em><span># 此时,其他两个从节点的memberState均为"OFFLINE",status均为"(MISSING)",需要执行rejoinInstance,将它们重新加入集群</span></em>
<em><span>-- 依次将其他节点加入集群</span></em>
<span> MySQL 10.186.65.13:3306 ssl JS > cluster.rejoinInstance("clusteruser@node2:3306")</span>
<span>The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of 'node2:3306' </span>
<span style="color:#c678dd"><span>with</span></span>
<span> a </span>
<span style="color:#c678dd"><span>physical </span></span>
<span style="color:#c678dd"><span>snapshot </span></span>
<span style="color:#c678dd"><span>from</span></span>
<span> an existing cluster member. </span>
<span style="color:#c678dd"><span>To </span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option</span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'clone'</span></span>
<span>.</span>
<span>The incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> may be safely used </span>
<span style="color:#c678dd"><span>if</span></span>
<span> you </span>
<span style="color:#c678dd"><span>are</span></span>
<span> sure </span>
<span style="color:#c678dd"><span>all</span></span>
<span> updates ever executed </span>
<span style="color:#c678dd"><span>in</span></span>
<span> the cluster were done </span>
<span style="color:#c678dd"><span>with</span></span>
<span> GTIDs enabled, there </span>
<span style="color:#c678dd"><span>are </span></span>
<span style="color:#c678dd"><span>no</span></span>
<span> purged transactions </span>
<span style="color:#c678dd"><span>and</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new </span></span>
<span style="color:#c678dd"><span>instance</span></span>
<span> contains the same GTID </span>
<span style="color:#c678dd"><span>set</span></span>
<span style="color:#c678dd"><span>as</span></span>
<span> the cluster </span>
<span style="color:#c678dd"><span>or</span></span>
<span> a subset </span>
<span style="color:#c678dd"><span>of</span></span>
<span> it. </span>
<span style="color:#c678dd"><span>To</span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option</span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'incremental'</span></span>
<span>.</span>
<span>Incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> was selected because it seems </span>
<span style="color:#c678dd"><span>to</span></span>
<span> be safely usable.</span>
<span>Validating </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>at</span></span>
<span> node2:</span>
<span style="color:#d19a66"><span>3306.</span></span>
<span>..</span>
<span>This </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> reports its own address </span>
<span style="color:#c678dd"><span>as</span></span>
<span> node2:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span style="color:#c678dd"><span>Instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>is</span></span>
<span> suitable.</span>
<span>Rejoining </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node2:3306'</span></span>
<span style="color:#c678dd"><span>to</span></span>
<span> cluster </span>
<span style="color:#98c379"><span>'mycluster'</span></span>
<span>...</span>
<span style="color:#c678dd"><span>Monitoring </span></span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> process </span>
<span style="color:#c678dd"><span>of</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new</span></span>
<span> cluster member. Press ^C </span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#c678dd"><span>stop </span></span>
<span style="color:#c678dd"><span>monitoring </span></span>
<span style="color:#c678dd"><span>and</span></span>
<span> let it continue </span>
<span style="color:#c678dd"><span>in</span></span>
<span> background.</span>
<span>State </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> already finished </span>
<span style="color:#c678dd"><span>for</span></span>
<span style="color:#98c379"><span>'node2:3306'</span></span>
<span>The </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node2:3306'</span></span>
<span> was successfully rejoined </span>
<span style="color:#c678dd"><span>to</span></span>
<span> the cluster.</span>
<span> MySQL </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl JS > cluster.rejoinInstance(</span>
<span style="color:#98c379"><span>"clusteruser@node3:3306"</span></span>
<span>)</span>
<span>The safest </span>
<span style="color:#c678dd"><span>and</span></span>
<span> most convenient way </span>
<span style="color:#c678dd"><span>to</span></span>
<span> provision a </span>
<span style="color:#c678dd"><span>new </span></span>
<span style="color:#c678dd"><span>instance </span></span>
<span style="color:#c678dd"><span>is </span></span>
<span style="color:#c678dd"><span>through </span></span>
<span style="color:#c678dd"><span>auto matic </span></span>
<span style="color:#c678dd"><span>clone</span></span>
<span> provisioning, which will completely overwrite the state </span>
<span style="color:#c678dd"><span>of</span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span style="color:#c678dd"><span>with</span></span>
<span> a </span>
<span style="color:#c678dd"><span>physical </span></span>
<span style="color:#c678dd"><span>snapshot </span></span>
<span style="color:#c678dd"><span>from</span></span>
<span> an existing cluster member. </span>
<span style="color:#c678dd"><span>To </span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option</span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'clone'</span></span>
<span>.</span>
<span>The incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> may be safely used </span>
<span style="color:#c678dd"><span>if</span></span>
<span> you </span>
<span style="color:#c678dd"><span>are</span></span>
<span> sure </span>
<span style="color:#c678dd"><span>all</span></span>
<span> updates ever executed </span>
<span style="color:#c678dd"><span>in</span></span>
<span> the cluster were done </span>
<span style="color:#c678dd"><span>with</span></span>
<span> GTIDs enabled, there </span>
<span style="color:#c678dd"><span>are</span></span>
<span style="color:#c678dd"><span>no</span></span>
<span> purged transactions </span>
<span style="color:#c678dd"><span>and</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new </span></span>
<span style="color:#c678dd"><span>instance</span></span>
<span> contains the same GTID </span>
<span style="color:#c678dd"><span>set</span></span>
<span style="color:#c678dd"><span>as</span></span>
<span> the cluster </span>
<span style="color:#c678dd"><span>or</span></span>
<span> a subset </span>
<span style="color:#c678dd"><span>of</span></span>
<span> it. </span>
<span style="color:#c678dd"><span>To </span></span>
<span style="color:#c678dd"><span>use</span></span>
<span> this method </span>
<span style="color:#c678dd"><span>by </span></span>
<span style="color:#c678dd"><span>default</span></span>
<span>, </span>
<span style="color:#c678dd"><span>set</span></span>
<span> the </span>
<span style="color:#98c379"><span>'recoveryMethod' </span></span>
<span style="color:#c678dd"><span>option </span></span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#98c379"><span>'incremental'</span></span>
<span>.</span>
<span>Incremental state </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> was selected because it seems </span>
<span style="color:#c678dd"><span>to</span></span>
<span> be safely usable.</span>
<span>Validating </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>at</span></span>
<span> node3:</span>
<span style="color:#d19a66"><span>3306.</span></span>
<span>..</span>
<span>This </span>
<span style="color:#c678dd"><span>instance</span></span>
<span> reports its own address </span>
<span style="color:#c678dd"><span>as</span></span>
<span> node3:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span style="color:#c678dd"><span>Instance</span></span>
<span> configuration </span>
<span style="color:#c678dd"><span>is</span></span>
<span> suitable.</span>
<span>Rejoining </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span style="color:#c678dd"><span>to</span></span>
<span> cluster </span>
<span style="color:#98c379"><span>'mycluster'</span></span>
<span>...</span>
<span style="color:#c678dd"><span>Monitoring </span></span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> process </span>
<span style="color:#c678dd"><span>of</span></span>
<span> the </span>
<span style="color:#c678dd"><span>new</span></span>
<span> cluster member. Press ^C </span>
<span style="color:#c678dd"><span>to </span></span>
<span style="color:#c678dd"><span>stop </span></span>
<span style="color:#c678dd"><span>monitoring</span></span>
<span style="color:#c678dd"><span>and</span></span>
<span> let it continue </span>
<span style="color:#c678dd"><span>in</span></span>
<span> background.</span>
<span>State </span>
<span style="color:#c678dd"><span>recovery</span></span>
<span> already finished </span>
<span style="color:#c678dd"><span>for </span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span>The </span>
<span style="color:#c678dd"><span>instance</span></span>
<span style="color:#98c379"><span>'node3:3306'</span></span>
<span> was successfully rejoined </span>
<span style="color:#c678dd"><span>to</span></span>
<span> the cluster.</span>
<em><span>-- 再次查看集群状态</span></em>
<span> MySQL </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl JS > cluster.status()</span>
<span>{</span>
<span> </span>
<span style="color:#98c379"><span>"clusterName"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"mycluster"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"defaultReplicaSet"</span></span>
<span>: {</span>
<span> </span>
<span style="color:#98c379"><span>"name"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"default"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"primary"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"ssl"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"DISABLED"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"OK"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"statusText"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"Cluster is ONLINE and can tolerate up to ONE failure."</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"topology"</span></span>
<span>: {</span>
<span> </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>: {</span>
<span> </span>
<span style="color:#98c379"><span>"address"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"memberRole"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"PRIMARY"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"mode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"R/W"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"readReplicas"</span></span>
<span>: {}, </span>
<span> </span>
<span style="color:#98c379"><span>"replicationLag"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"applier_queue_applied"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"role"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"HA"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"ONLINE"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"version"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"8.4.5"</span></span>
<span> }, </span>
<span> </span>
<span style="color:#98c379"><span>"node2:3306"</span></span>
<span>: {</span>
<span> </span>
<span style="color:#98c379"><span>"address"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node2:3306"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"memberRole"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"SECONDARY"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"mode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"R/O"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"readReplicas"</span></span>
<span>: {}, </span>
<span> </span>
<span style="color:#98c379"><span>"replicationLag"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"applier_queue_applied"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"role"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"HA"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"ONLINE"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"version"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"8.4.5"</span></span>
<span> }, </span>
<span> </span>
<span style="color:#98c379"><span>"node3:3306"</span></span>
<span>: {</span>
<span> </span>
<span style="color:#98c379"><span>"address"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node3:3306"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"memberRole"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"SECONDARY"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"mode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"R/O"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"readReplicas"</span></span>
<span>: {}, </span>
<span> </span>
<span style="color:#98c379"><span>"replicationLag"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"applier_queue_applied"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"role"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"HA"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"status"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"ONLINE"</span></span>
<span>, </span>
<span> </span>
<span style="color:#98c379"><span>"version"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"8.4.5"</span></span>
<span> }</span>
<span> }, </span>
<span> </span>
<span style="color:#98c379"><span>"topologyMode"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"Single-Primary"</span></span>
<span> }, </span>
<span> </span>
<span style="color:#98c379"><span>"groupInformationSourceMember"</span></span>
<span>: </span>
<span style="color:#98c379"><span>"node1:3306"</span></span>
<span>}</span>
<span> MySQL </span>
<span style="color:#d19a66"><span>10.186</span></span>
<span style="color:#d19a66"><span>.65</span></span>
<span style="color:#d19a66"><span>.13</span></span>
<span>:</span>
<span style="color:#d19a66"><span>3306</span></span>
<span> ssl JS ></span>
<em><span># 实例重启后,组复制默认不会开启,可能会缺少仲裁节点,导致无法查看集群状态,需要先配置一个主引导节点后启动组复制,并将其余节点通过rejoinInstance的方式加入集群</span></em>
<em><span># 如果要对集群所在服务器进行计划内的Linux内核升级等运维操作时,建议采用滚动方式依次对各个节点上的实例进行重启,应尽量避免所有节点同时发生重启</span></em>
小结
故障现象
-
所有节点重启后集群可能无法自动恢复
-
出现
<span>The cluster object is disconnected</span>错误 -
组复制服务未自动启动
恢复要点
-
需要手动选择 GTID 最全的节点作为引导节点
-
使用
<span>group_replication_bootstrap_group</span>参数引导集群 -
通过
<span>rejoinInstance</span>重新加入其他节点
影响范围
-
整个集群服务中断
-
需要人工干预恢复
-
恢复时间相对较长
总结
本文对 MIC 在实际使用过程中出现的常见故障场景进行了一些测试,旨在使 MySQL 运维人员可以更好地理解其故障特性和恢复机制,提高应对各种异常情况的能力,确保数据库服务的高可用性和数据安全性。
其中,故障场景主要分为两类:
-
自动化恢复能力 :
-
单节点故障(重启、宕机)集群具备自动恢复能力
-
网络分区恢复后集群能够自动重新建立一致性
-
主节点故障时自动进行故障转移
-
-
需要人工干预的场景 :
-
所有节点同时重启导致仲裁丢失
-
集群引导节点选择不当
-
数据不一致需要手动修复
-
参考资料
[1] MySQL InnoDB Cluster: https://dev.mysql.com/doc/mysql-shell/8.4/en/monitoring-innodb-cluster.html
[2] MySQL Group Replication: https://dev.mysql.com/doc/en/group-replication.html
[3] MySQL Shell: https://dev.mysql.com/doc/mysql-shell/8.4/en/rejoin-cluster.html
</div>