GitHub 是如何做好 MySQL 的高可用性的？
- 这个方案能否在夸数据中心实现，以及如何实现的？ 在不同的网络状况下会怎么样，延迟高，或延迟低的情况会怎么样？
- 这个解决方案能否承受整个数据中心（DC）的故障 或者网络隔离的情况？
VIP需要协商：虚拟IP由数据库本身所持有。 服务器必须发送ARP请求，才能够占有或释放VIP。 在新的数据库分配新的VIP之前，旧的服务器必须先释放其占有的VIP。这个过程会产生一些异常问题:
- 故障转移的顺序，首先是请求故障机器释放VIP，然后联系新的主库机器分配VIP。但是，如果故障机器本身不能访问，或者说拒绝释放VIP呢？ 考虑到机器故障的场景，故障机器不会立即响应或根本就不会响应释放VIP的请求，整个过程有下面两个问题：
- 即使VIP重新分配, 客户端已有的连接不会自动断开旧的故障机器，从而使得整个系统产生脑裂的情况。
In parts of our setup VIPs are bound by physical location. They are owned by a switch or a router. Thus, we can only reassign the VIPs onto co-located servers. In particular, in some cases we cannot assign the VIP to a server promoted in a different data center, and must make a DNS change.
- DNS changes take longer to propagate. Clients cache DNS names for a preconfigured time. A cross-DC failover implies more outage time: it will take more time to make all clients aware of the identity of the new master.
These limitations alone were enough to push us in search of a new solution, but for even more consideration were:
- Masters were self-injecting themselves with heartbeats via the
pt-heartbeatservice, for the purpose of lag measurement and throttling control. The service had to be kicked off on the newly promoted master. If possible, the service would be shut down on the old master.
- Likewise, Pseudo-GTID injection was self-managed by the masters. It would need to kick off on the new master, and preferably stop on the old master.
- The new master was set as writable. The old master was to be set as
read_only, if possible.
These extra steps were a contributing factor to the total outage time and introduced their own failures and friction.
The solution worked, and GitHub has had successful MySQL failovers that went well under the radar, but we wanted our HA to improve on the following:
- Be data center agnostic.
- Be tolerant of data center failure.
- Remove unreliable cooperative workflows.
- Reduce total outage time.
- As much as possible, have lossless failovers.
GitHub’s HA solution: orchestrator, Consul, GLB
Our new strategy, along with collateral improvements, solves or mitigates much of the concerns above. In today’s HA setup, we have:
- orchestrator to run detection and failovers. We use a cross-DC orchestrator/raft setup as depicted below.
- Hashicorp’s Consul for service discovery.
- GLB/HAProxy as a proxy layer between clients and writer nodes. Our GLB director is open sourced.
anycastfor network routing.
The new setup removes VIP and DNS changes altogether. And while we introduce more components, we are able to decouple the components and simplify the task, as well as be able to utilize solid and stable solutions. A breakdown follows.
A normal flow
On a normal day the apps connect to the write nodes through GLB/HAProxy.
The apps are never aware of the master’s identity. As before, they use a name. For example, the master for
cluster1 would be
mysql-writer-1.github.net. In our current setup, however, this name gets resolved to an anycast IP.
anycast, the name resolves to the same IP everywhere, but traffic is routed differently based on a client’s location. In particular, in each of our data centers we have GLB, our highly available load balancer, deployed on multiple boxes. Traffic to
mysql-writer-1.github.net always routes to the local data center’s GLB cluster. Thus, all clients are served by local proxies.
We run GLB on top of HAProxy. Our HAProxy has writer pools: one pool per MySQL cluster, where each pool has exactly one backend server: the cluster’s master. All GLB/HAProxy boxes in all DCs have the exact same pools, and they all indicate the exact same backend servers in these pools. Thus, if an app wishes to write to
mysql-writer-1.github.net, it matters not which GLB server it connects to. It will always get routed to the actual
cluster1 master node.
As far as the apps are concerned, discovery ends at GLB, and there is never a need for re-discovery. It’s all on GLB to route the traffic to the correct destination.
How does GLB know which servers to list as backends, and how do we propagate changes to GLB?
Discovery via Consul
Consul is well known as a service discovery solution, and also offers DNS services. In our solution, however, we utilize it as a highly available key-value (KV) store.
Within Consul’s KV store we write the identities of cluster masters. For each cluster, there’s a set of KV entries indicating the cluster’s master
fqdn, port, ipv4, ipv6.
Each GLB/HAProxy node runs consul-template: a service that listens on changes to Consul data (in our case: changes to clusters masters data).
consul-template produces a valid config file and is able to reload HAProxy upon changes to the config.
Thus, a change in Consul to a master’s identity is observed by each GLB/HAProxy box, which then reconfigures itself, sets the new master as the single entity in a cluster’s backend pool, and reloads to reflect those changes.
At GitHub we have a Consul setup in each data center, and each setup is highly available. However, these setups are independent of each other. They do not replicate between each other and do not share any data.
How does Consul get told of changes, and how is the information distributed cross-DC?
We run an
orchestrator nodes communicate to each other via raft consensus. We have one or two
orchestrator nodes per data center.
orchestrator is charged with failure detection, with MySQL failover, and with communicating the change of master to Consul. Failover is operated by the single
orchestrator/raft leader node, but the change, the news that a cluster now has a new master, is propagated to all
orchestrator nodes through the
orchestrator nodes receive the news of a master change, they each communicate to their local Consul setups: they each invoke a KV write. DCs with more than one
orchestrator representative will have multiple (identical) writes to Consul.
Putting the flow together
In a master crash scenario:
orchestratornodes detect failures.
orchestrator/raftleader kicks off a recovery. A new master gets promoted.
orchestrator/raftadvertises the master change to all
orchestrator/raftmember receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
- Each GLB/HAProxy has
consul-templaterunning, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
- Client traffic gets redirected to the new master.
There is a clear ownership of responsibilities for each component, and the entire design is both decoupled as well as simplified.
orchestrator doesn’t know about the load balancers. Consul doesn’t need to know where the information came from. Proxies only care about Consul. Clients only care about the proxy.
- There are no DNS changes to propagate.
- There is no TTL.
- The flow does not need the dead master’s cooperation. It is largely ignored.
To further secure the flow, we also have the following:
- HAProxy is configured with a very short
hard-stop-after. When it reloads with a new backend server in a writer-pool, it automatically terminates any existing connections to the old master.
hard-stop-afterwe don’t even require cooperation from the clients, and this mitigates a split-brain scenario. It’s noteworthy that this isn’t hermetic, and some time passes before we kill old connections. But there’s then a point in time after which we’re comfortable and expect no nasty surprises.
- We do not strictly require Consul to be available at all times. In fact, we only need it to be available at failover time. If Consul happens to be down, GLB continues to operate with the last known values and makes no drastic moves.
- GLB is set to validate the identity of the newly promoted master. Similarly to our context-aware MySQL pools, a check is made on the backend server, to confirm it is indeed a writer node. If we happen to delete the master’s identity in Consul, no problem; the empty entry is ignored. If we mistakenly write the name of a non-master server in Consul, no problem; GLB will refuse to update it and keep running with last known state.
We further tackle concerns and pursue HA objectives in the following sections.
orchestrator/raft failure detection
orchestrator uses a holistic approach to detecting failure, and as such it is very reliable. We do not observe false positives: we do not have premature failovers, and thus do not suffer unnecessary outage time.
orchestrator/raft further tackles the case for a complete DC network isolation (aka DC fencing). A DC network isolation can cause confusion: servers within that DC can talk to each other. Is it they that are network isolated from other DCs, or is it other DCs that are being network isolated?
orchestrator/raft setup, the
raft leader node is the one to run the failovers. A leader is a node that gets the support of the majority of the group (quorum). Our
orchestrator node deployment is such that no single data center makes a majority, and any
n-1 DCs do.
In the event of a complete DC network isolation, the
orchestrator nodes in that DC get disconnected from their peers in other DCs. As a result, the
orchestrator nodes in the isolated DC cannot be the leaders of the
raft cluster. If any such node did happen to be the leader, it steps down. A new leader will be assigned from any of the other DCs. That leader will have the support of all the other DCs, which are capable of communicating between themselves.
orchestrator node that calls the shots will be one that is outside the network isolated data center. Should there be a master in an isolated DC,
orchestrator will initiate the failover to replace it with a server in one of the available DCs. We mitigate DC isolation by delegating the decision making to the quorum in the non-isolated DCs.
Total outage time can further be reduced by advertising the master change sooner. How can that be achieved?
orchestrator begins a failover, it observes the fleet of servers available to be promoted. Understanding replication rules and abiding by hints and limitations, it is able to make an educated decision on the best course of action.
It may recognize that a server available for promotion is also an ideal candidate, such that:
- There is nothing to prevent the promotion of the server (and potentially the user has hinted that such server is preferred for promotion), and
- The server is expected to be able to take all of its siblings as replicas.
In such a case
orchestrator proceeds to first set the server as writable, and immediately advertises the promotion of the server (writes to Consul KV in our case), even while asynchronously beginning to fix the replication tree, an operation that will typically take a few more seconds.
It is likely that by the time our GLB servers have been fully reloaded, the replication tree is already intact, but it is not strictly required. The server is good to receive writes!
In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. It provides a way to achieve lossless failovers: any change applied on the master is either applied or waiting to be applied on one of the replicas.
Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.
We have set our timeout at a reasonably low value:
500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs. With this timeout we observe perfect semi-sync behavior (no fallback to asynchronous replication), as well as feel comfortable with a very short blocking period in case of acknowledgement failure.
We enable semi-sync on local DC replicas, and in the event of a master’s death, we expect (though do not strictly enforce) a lossless failover. Lossless failover on a complete DC failure is costly and we do not expect it.
While experimenting with semi-sync timeout, we also observed a behavior that plays to our advantage: we are able to influence the identity of the ideal candidate in the event of a master failure. By enabling semi-sync on designated servers, and by marking them as candidates, we are able to reduce total outage time by affecting the outcome of a failure. In our experiments we observe that we typically end up with the ideal candidates, and hence run quick advertisements.
Instead of managing the startup/shutdown of the
pt-heartbeat service on promoted/demoted masters, we opted to run it everywhere at all times. This required some patching so as to make
pt-heartbeat comfortable with servers either changing their
read_only state back and forth or completely crashing.
In our current setup
pt-heartbeat services run on masters and on replicas. On masters, they generate the heartbeat events. On replicas, they identify that the servers are
read-only and routinely recheck their status. As soon as a server is promoted as master,
pt-heartbeat on that server identifies the server as writable and begins injecting heartbeat events.
orchestrator ownership delegation
We further delegated to orchestrator:
- Pseudo-GTID injection,
- Setting the promoted master as writable, clearing its replication state, and
- Setting the old master as
read_only, if possible.
On all things new-master, this reduces friction. A master that is just being promoted is clearly expected to be alive and accessible, or else we would not promote it. It makes sense, then, to let
orchestrator apply changes directly to the promoted master.
Limitations and drawbacks
The proxy layer makes the apps unaware of the master’s identity, but it also masks the apps’ identities from the master. All the master sees are connections coming from the proxy layer, and we lose information about the actual source of the connection.
As distributed systems go, we are still left with unhandled scenarios.
Notably, on a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high.
More scenarios exist: the outage of Consul at the time of the failover; partial DC isolation; others. We understand that with distributed systems of this nature it is impossible to close all of the loopholes, so we focus on the most important cases.
Our orchestrator/GLB/Consul setup provides us with:
- Reliable failure detection,
- Data center agnostic failovers,
- Typically lossless failovers,
- Data center network isolation support,
- Split-brain mitigation (more in the works),
- No cooperation dependency,
10 and 13 secondsof total outage time in most cases.
- We see up to
20 secondsof total outage time in less frequent cases, and up to
25 secondsin extreme cases.
- We see up to
The orchestration/proxy/service-discovery paradigm uses well known and trusted components in a decoupled architecture, which makes it easier to deploy, operate and observe, and where each component can independently scale up or down. We continue to seek improvements as we continuously test our setup.
我们的翻译工作遵照 CC 协议，如果我们的工作有侵犯到您的权益，请及时联系我们。