翻译进度
20
分块数量
3
参与人数

GitHub 是如何做好 MySQL 的高可用性的?

这是一篇协同翻译的文章,你可以点击『我来翻译』按钮来参与翻译。


MySQL

Github 使用MySQL数据库作为所有非git事务的数据存储。数据库的可用性对Github的正常运行而言至关重要。无论是Github网站本身,还是Github API,身份验证服务等都需要访问数据库。Github运行了多个数据库集群用于支撑不同的服务于任务。数据库的架构采用的是传统的主从结构,集群中一个节点(主库)支持写访问,其余的节点(从库)同步主库的变更,支持读服务。

主库的可用性至关重要。一旦主库宕机,集群将不能够支持数据写入服务:任何需要保存的数据都无法写入到数据库保存。最终导致Github上任何变更,例如代码提交,提问,用户创建,代码review,创建仓库等操作都无法完成。

为了保证业务的正常运行,我们自然需要在集群中有一个可用的支持写入的数据库节点。同时,我们也必须能够快速的发现可用的可写入服务数据库节点。

就是说,在异常情况下,假如主库宕机的场景,我们必须确保新的主库能够立刻上线支持服务,同时保证集群中其他节点能够快速识别到新的主库。故障检测,主库迁移以及集群其他数据节点识别新主库的总时间构成了服务中断的总时间。

这篇文章说明了GitHub的MySQL高可用性和主库服务发现解决方案,该解决方案使我们能够可靠地运行跨数据中心的操作,能够容忍数据中心的隔离,并缩短在出现故障时停机时间。

写PHP的老王 翻译于 2个月前

高可用性的实现

本篇文章描述的解决方案是在以前Github高可用方案上的改进版本。正如前面说到的一样,MySQL的高可用策略必须适应业务的变化。我们期望MySQL以及GitHub上其他的服务都有能够应对变化的高可用解决方案。

当设计高可用以及服务发现系统方案的时候,从下面几个问题出发,也许能够帮助我们快速找到合适的解决方案:

  • 最大允许的服务中断的时间是多少?
  • 服务中断检测的准确性怎么样?是否能够允许服务中断检测误报(会导致过早故障转移)?
  • 故障转移的可靠性怎么样?什么情况会导致故障转移失败?
  • 这个方案能否在夸数据中心实现,以及如何实现的? 在不同的网络状况下会怎么样,延迟高,或延迟低的情况会怎么样?
  • 这个解决方案能否承受整个数据中心(DC)的故障 或者网络隔离的情况?
  • 有什么机制防止HA集群脑裂情况(在一个整体的系统,联系着的两个节点分裂为两个独立节点,这两个节点争抢共享资源写入数据的情况)?
  • 能否容忍数据丢失?容忍丢失程度是多少?

为了说明上面的几个问题,我们先来看一下我们之前的高可用方案,以及我们为什么要改进它。

写PHP的老王 翻译于 2个月前

摒弃基于VIP和DNS的发现机制

在之前的方案中,应用了下面的技术方案:

  • 使用orchestrator作为故障检测迁移方案。
  • 采用VIP和DNS方式作为主节点发现方案。

客户端通过节点名称,例如mysql-writer-1.github.net,解析成主节点的虚拟IP地址(VIP),从而找到主节点。

因此,正常情况下,客户端可以通过对节点名称的解析,连接到对应IP的主节点上。

写PHP的老王 翻译于 3周前

考虑夸三个数据中心的拓扑结构的情况:

sample 3 data center topology

一旦主库异常,必须将其中的一个数据副本服务器更新为主库服务器。

orchestrator 会检测异常,选出新的主库,然后重新分配数据库的名称以及虚拟IP(VIP)。客户端本身不知道主库的变更,客户端有的信息只是主库的名称,因此这个名称必须能够解析到新的主库服务器。考虑下面的问题:

VIP需要协商:虚拟IP由数据库本身所持有。 服务器必须发送ARP请求,才能够占有或释放VIP。 在新的数据库分配新的VIP之前,旧的服务器必须先释放其占有的VIP。这个过程会产生一些异常问题:

  • 故障转移的顺序,首先是请求故障机器释放VIP,然后联系新的主库机器分配VIP。但是,如果故障机器本身不能访问,或者说拒绝释放VIP呢? 考虑到机器故障的场景,故障机器不会立即响应或根本就不会响应释放VIP的请求,整个过程有下面两个问题:
    • 脑裂情况:如果有两个主机持有相同的VIP的情况,不同的客户端根据最短的网络链路将会连接到不同的主机上。
    • 整个VIP重新分配过程依赖两个独立服务器的相互协调,而且设置过程是不可靠的。
  • 即使故障机器正常释放VIP,整个流程也是非常耗时的,因为切换过程还需要连接故障机器。
  • 即使VIP重新分配, 客户端已有的连接不会自动断开旧的故障机器,从而使得整个系统产生脑裂的情况。
写PHP的老王 翻译于 3周前

In parts of our setup VIPs are bound by physical location. They are owned by a switch or a router. Thus, we can only reassign the VIPs onto co-located servers. In particular, in some cases we cannot assign the VIP to a server promoted in a different data center, and must make a DNS change.

  • DNS changes take longer to propagate. Clients cache DNS names for a preconfigured time. A cross-DC failover implies more outage time: it will take more time to make all clients aware of the identity of the new master.

These limitations alone were enough to push us in search of a new solution, but for even more consideration were:

  • Masters were self-injecting themselves with heartbeats via the pt-heartbeat service, for the purpose of lag measurement and throttling control. The service had to be kicked off on the newly promoted master. If possible, the service would be shut down on the old master.
  • Likewise, Pseudo-GTID injection was self-managed by the masters. It would need to kick off on the new master, and preferably stop on the old master.
  • The new master was set as writable. The old master was to be set as read_only, if possible.

These extra steps were a contributing factor to the total outage time and introduced their own failures and friction.

The solution worked, and GitHub has had successful MySQL failovers that went well under the radar, but we wanted our HA to improve on the following:

  • Be data center agnostic.
  • Be tolerant of data center failure.
  • Remove unreliable cooperative workflows.
  • Reduce total outage time.
  • As much as possible, have lossless failovers.

GitHub’s HA solution: orchestrator, Consul, GLB

Our new strategy, along with collateral improvements, solves or mitigates much of the concerns above. In today’s HA setup, we have:

MySQL HA solution at GitHub

The new setup removes VIP and DNS changes altogether. And while we introduce more components, we are able to decouple the components and simplify the task, as well as be able to utilize solid and stable solutions. A breakdown follows.

A normal flow

On a normal day the apps connect to the write nodes through GLB/HAProxy.

The apps are never aware of the master’s identity. As before, they use a name. For example, the master for cluster1 would be mysql-writer-1.github.net. In our current setup, however, this name gets resolved to an anycast IP.

With anycast, the name resolves to the same IP everywhere, but traffic is routed differently based on a client’s location. In particular, in each of our data centers we have GLB, our highly available load balancer, deployed on multiple boxes. Traffic to mysql-writer-1.github.net always routes to the local data center’s GLB cluster. Thus, all clients are served by local proxies.

We run GLB on top of HAProxy. Our HAProxy has writer pools: one pool per MySQL cluster, where each pool has exactly one backend server: the cluster’s master. All GLB/HAProxy boxes in all DCs have the exact same pools, and they all indicate the exact same backend servers in these pools. Thus, if an app wishes to write to mysql-writer-1.github.net, it matters not which GLB server it connects to. It will always get routed to the actual cluster1 master node.

As far as the apps are concerned, discovery ends at GLB, and there is never a need for re-discovery. It’s all on GLB to route the traffic to the correct destination.

How does GLB know which servers to list as backends, and how do we propagate changes to GLB?

Discovery via Consul

Consul is well known as a service discovery solution, and also offers DNS services. In our solution, however, we utilize it as a highly available key-value (KV) store.

Within Consul’s KV store we write the identities of cluster masters. For each cluster, there’s a set of KV entries indicating the cluster’s master fqdn, port, ipv4, ipv6.

Each GLB/HAProxy node runs consul-template: a service that listens on changes to Consul data (in our case: changes to clusters masters data). consul-template produces a valid config file and is able to reload HAProxy upon changes to the config.

Thus, a change in Consul to a master’s identity is observed by each GLB/HAProxy box, which then reconfigures itself, sets the new master as the single entity in a cluster’s backend pool, and reloads to reflect those changes.

At GitHub we have a Consul setup in each data center, and each setup is highly available. However, these setups are independent of each other. They do not replicate between each other and do not share any data.

How does Consul get told of changes, and how is the information distributed cross-DC?

orchestrator/raft

We run an orchestrator/raft setup: orchestrator nodes communicate to each other via raft consensus. We have one or two orchestrator nodes per data center.

orchestrator is charged with failure detection, with MySQL failover, and with communicating the change of master to Consul. Failover is operated by the single orchestrator/raft leader node, but the change, the news that a cluster now has a new master, is propagated to all orchestrator nodes through the raft mechanism.

As orchestrator nodes receive the news of a master change, they each communicate to their local Consul setups: they each invoke a KV write. DCs with more than one orchestrator representative will have multiple (identical) writes to Consul.

Putting the flow together

In a master crash scenario:

  • The orchestrator nodes detect failures.
  • The orchestrator/raft leader kicks off a recovery. A new master gets promoted.
  • orchestrator/raft advertises the master change to all raft cluster nodes.
  • Each orchestrator/raft member receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
  • Each GLB/HAProxy has consul-template running, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
  • Client traffic gets redirected to the new master.

There is a clear ownership of responsibilities for each component, and the entire design is both decoupled as well as simplified. orchestrator doesn’t know about the load balancers. Consul doesn’t need to know where the information came from. Proxies only care about Consul. Clients only care about the proxy.

Furthermore:

  • There are no DNS changes to propagate.
  • There is no TTL.
  • The flow does not need the dead master’s cooperation. It is largely ignored.

Additional details

To further secure the flow, we also have the following:

  • HAProxy is configured with a very short hard-stop-after. When it reloads with a new backend server in a writer-pool, it automatically terminates any existing connections to the old master.
    • With hard-stop-after we don’t even require cooperation from the clients, and this mitigates a split-brain scenario. It’s noteworthy that this isn’t hermetic, and some time passes before we kill old connections. But there’s then a point in time after which we’re comfortable and expect no nasty surprises.
  • We do not strictly require Consul to be available at all times. In fact, we only need it to be available at failover time. If Consul happens to be down, GLB continues to operate with the last known values and makes no drastic moves.
  • GLB is set to validate the identity of the newly promoted master. Similarly to our context-aware MySQL pools, a check is made on the backend server, to confirm it is indeed a writer node. If we happen to delete the master’s identity in Consul, no problem; the empty entry is ignored. If we mistakenly write the name of a non-master server in Consul, no problem; GLB will refuse to update it and keep running with last known state.

We further tackle concerns and pursue HA objectives in the following sections.

orchestrator/raft failure detection

orchestrator uses a holistic approach to detecting failure, and as such it is very reliable. We do not observe false positives: we do not have premature failovers, and thus do not suffer unnecessary outage time.

orchestrator/raft further tackles the case for a complete DC network isolation (aka DC fencing). A DC network isolation can cause confusion: servers within that DC can talk to each other. Is it they that are network isolated from other DCs, or is it other DCs that are being network isolated?

In an orchestrator/raft setup, the raft leader node is the one to run the failovers. A leader is a node that gets the support of the majority of the group (quorum). Our orchestrator node deployment is such that no single data center makes a majority, and any n-1 DCs do.

In the event of a complete DC network isolation, the orchestrator nodes in that DC get disconnected from their peers in other DCs. As a result, the orchestrator nodes in the isolated DC cannot be the leaders of the raft cluster. If any such node did happen to be the leader, it steps down. A new leader will be assigned from any of the other DCs. That leader will have the support of all the other DCs, which are capable of communicating between themselves.

Thus, the orchestrator node that calls the shots will be one that is outside the network isolated data center. Should there be a master in an isolated DC, orchestrator will initiate the failover to replace it with a server in one of the available DCs. We mitigate DC isolation by delegating the decision making to the quorum in the non-isolated DCs.

Quicker advertisement

Total outage time can further be reduced by advertising the master change sooner. How can that be achieved?

When orchestrator begins a failover, it observes the fleet of servers available to be promoted. Understanding replication rules and abiding by hints and limitations, it is able to make an educated decision on the best course of action.

It may recognize that a server available for promotion is also an ideal candidate, such that:

  • There is nothing to prevent the promotion of the server (and potentially the user has hinted that such server is preferred for promotion), and
  • The server is expected to be able to take all of its siblings as replicas.

In such a case orchestrator proceeds to first set the server as writable, and immediately advertises the promotion of the server (writes to Consul KV in our case), even while asynchronously beginning to fix the replication tree, an operation that will typically take a few more seconds.

It is likely that by the time our GLB servers have been fully reloaded, the replication tree is already intact, but it is not strictly required. The server is good to receive writes!

Semi-synchronous replication

In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. It provides a way to achieve lossless failovers: any change applied on the master is either applied or waiting to be applied on one of the replicas.

Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.

We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs. With this timeout we observe perfect semi-sync behavior (no fallback to asynchronous replication), as well as feel comfortable with a very short blocking period in case of acknowledgement failure.

We enable semi-sync on local DC replicas, and in the event of a master’s death, we expect (though do not strictly enforce) a lossless failover. Lossless failover on a complete DC failure is costly and we do not expect it.

While experimenting with semi-sync timeout, we also observed a behavior that plays to our advantage: we are able to influence the identity of the ideal candidate in the event of a master failure. By enabling semi-sync on designated servers, and by marking them as candidates, we are able to reduce total outage time by affecting the outcome of a failure. In our experiments we observe that we typically end up with the ideal candidates, and hence run quick advertisements.

Heartbeat injection

Instead of managing the startup/shutdown of the pt-heartbeat service on promoted/demoted masters, we opted to run it everywhere at all times. This required some patching so as to make pt-heartbeat comfortable with servers either changing their read_only state back and forth or completely crashing.

In our current setup pt-heartbeat services run on masters and on replicas. On masters, they generate the heartbeat events. On replicas, they identify that the servers are read-only and routinely recheck their status. As soon as a server is promoted as master, pt-heartbeat on that server identifies the server as writable and begins injecting heartbeat events.

orchestrator ownership delegation

We further delegated to orchestrator:

  • Pseudo-GTID injection,
  • Setting the promoted master as writable, clearing its replication state, and
  • Setting the old master as read_only, if possible.

On all things new-master, this reduces friction. A master that is just being promoted is clearly expected to be alive and accessible, or else we would not promote it. It makes sense, then, to let orchestrator apply changes directly to the promoted master.

Limitations and drawbacks

The proxy layer makes the apps unaware of the master’s identity, but it also masks the apps’ identities from the master. All the master sees are connections coming from the proxy layer, and we lose information about the actual source of the connection.

As distributed systems go, we are still left with unhandled scenarios.

Notably, on a data center isolation scenario, and assuming a master is in the isolated DC, apps in that DC are still able to write to the master. This may result in state inconsistency once network is brought back up. We are working to mitigate this split-brain by implementing a reliable STONITH from within the very isolated DC. As before, some time will pass before bringing down the master, and there could be a short period of split-brain. The operational cost of avoiding split-brains altogether is very high.

More scenarios exist: the outage of Consul at the time of the failover; partial DC isolation; others. We understand that with distributed systems of this nature it is impossible to close all of the loopholes, so we focus on the most important cases.

The results

Our orchestrator/GLB/Consul setup provides us with:

  • Reliable failure detection,
  • Data center agnostic failovers,
  • Typically lossless failovers,
  • Data center network isolation support,
  • Split-brain mitigation (more in the works),
  • No cooperation dependency,
  • Between 10 and 13 seconds of total outage time in most cases.
    • We see up to 20 seconds of total outage time in less frequent cases, and up to 25 seconds in extreme cases.

Conclusion

The orchestration/proxy/service-discovery paradigm uses well known and trusted components in a decoupled architecture, which makes it easier to deploy, operate and observe, and where each component can independently scale up or down. We continue to seek improvements as we continuously test our setup.

本文章首发在 LearnKu.com 网站上。

本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。

原文地址:https://github.blog/2018-06-20-mysql-hig...

译文地址:https://learnku.com/mysql/t/36820

本帖已被设为精华帖!
参与译者:3
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

请勿发布不友善或者负能量的内容。与人为善,比聪明更重要!