24-Pod Topology Spread Constraints

concepts/workloads/pods/pod-topology-spread-constraints/

Pod Topology Spread Constraints 拓扑扩展约束

FEATURE STATE: Kubernetes v1.16 alpha

You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization. 您可以使用拓扑扩展约束来控制pod如何在故障域(如区域、区域、节点和其他用户定义的拓扑域)中跨集群分布。这有助于实现高可用性和高效的资源利用。

Prerequisites

Enable Feature Gate

Ensure the EvenPodsSpread feature gate is enabled (it is disabled by default in 1.16). See Feature Gates for an explanation of enabling feature gates. The EvenPodsSpread feature gate must be enabled for the API Server and scheduler.

Node Labels

Topology spread constraints rely on node labels to identify the topology domain(s) that each Node is in. For example, a Node might have labels: 拓扑扩展约束依赖于节点标签来标识每个节点所在的拓扑域。例如,节点可能具有标签: node=node1,zone=us-east-1a,region=us-east-1

Suppose you have a 4-node cluster with the following labels假设您有一个具有以下标签的4节点集群:

NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4m26s   v1.16.0   node=node1,zone=zoneA
node2   Ready    <none>   3m58s   v1.16.0   node=node2,zone=zoneA
node3   Ready    <none>   3m17s   v1.16.0   node=node3,zone=zoneB
node4   Ready    <none>   2m43s   v1.16.0   node=node4,zone=zoneB

Then the cluster is logically viewed as below 然后从逻辑上看集群如下:

+---------------+---------------+
|     zoneA     |     zoneB     |
+-------+-------+-------+-------+
| node1 | node2 | node3 | node4 |
+-------+-------+-------+-------+

Instead of manually applying labels, you can also reuse the well-known labels that are created and populated automatically on most clusters. 您还可以重用在大多数集群上自动创建和填充的已知标签,而不是手动应用标签。

Spread Constraints for Pods

API

The field pod.spec.topologySpreadConstraints is introduced in 1.16 as below: 您可以定义一个或多个拓扑结构约束,以指示Kube调度器如何将每个传入的POD与现有的POD关联跨越集群。字段包括:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  topologySpreadConstraints:
  - maxSkew: <integer>
    topologyKey: <string>
    whenUnsatisfiable: <string>
    labelSelector: <object>

You can define one or multiple topologySpreadConstraint to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. The fields are:

  • maxSkew describes the degree to which Pods may be unevenly distributed. It’s the maximum permitted difference between the number of matching Pods in any two topology domains of a given topology type. It must be greater than zero. 描述pod分布不均的程度。这是给定拓扑类型中任意两个拓扑域中匹配的pod之间的最大允许差值。它必须大于零。
  • topologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain. 是节点标签的键。如果两个节点用此键标记并且具有相同的标签值,则调度器会将这两个节点视为处于同一拓扑中。调度器试图在每个拓扑域中放置均衡数量的pod。
  • whenUnsatisfiable indicates how to deal with a Pod if it doesn’t satisfy the spread constraint: 指示如果pod不满足扩展约束,如何处理它
    • DoNotSchedule (default) tells the scheduler not to schedule it.
    • ScheduleAnyway tells the scheduler to still schedule it while prioritizing nodes that minimize the skew. 告诉调度程序在对最小化倾斜的节点进行优先级排序时仍对其进行调度。
  • labelSelector is used to find matching Pods. Pods that match this label selector are counted to determine the number of Pods in their corresponding topology domain. See Label Selectors for more details. 用于查找匹配的pod。匹配此标签选择器的pod将被计数,以确定其相应拓扑域中pod的数量。有关详细信息,请参见标签选择器。

You can read more about this field by running kubectl explain Pod.spec.topologySpreadConstraints.

Example: One TopologySpreadConstraint

Suppose you have a 4-node cluster where 3 Pods labeled foo:bar are located in node1, node2 and node3 respectively (P represents Pod): 假设您有一个4节点集群,其中标记为foo:bar的3个pod分别位于node1、node2和node3中(p表示pod):

+---------------+---------------+
|     zoneA     |     zoneB     |
+-------+-------+-------+-------+
| node1 | node2 | node3 | node4 |
+-------+-------+-------+-------+
|   P   |   P   |   P   |       |
+-------+-------+-------+-------+

If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be given as: 如果我们希望传入的pod均匀地散布在现有的荚上,则可以将其指定为:

pods/topology-spread-constraints/one-constraint.yaml Copy pods/topology-spread-constraints/one-constraint.yaml to clipboard

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    foo: bar
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  containers:
  - name: pause
    image: k8s.gcr.io/pause:3.1

topologyKey: zone implies the even distribution will only be applied to the nodes which have label pair “zone:” present. whenUnsatisfiable: DoNotSchedule tells the scheduler to let it stay pending if the incoming Pod can’t satisfy the constraint. zone意味着均匀分布将只应用于存在标签对“zone:”的节点。当不可满足时:donotschedule告诉调度程序,如果传入的pod不能满足约束,则让它保持挂起状态。

If the scheduler placed this incoming Pod into “zoneA”, the Pods distribution would become [3, 1], hence the actual skew is 2 (3 - 1) - which violates maxSkew: 1. In this example, the incoming Pod can only be placed onto “zoneB”: 如果调度程序将这个传入的pod放入“zonea”,pods分布将变为[3,1],因此实际倾斜为2(3-1)-这违反了maxsew:1。在此示例中,传入的pod只能放置在“zoneb”上:

+---------------+---------------+      +---------------+---------------+
|     zoneA     |     zoneB     |      |     zoneA     |     zoneB     |
+-------+-------+-------+-------+      +-------+-------+-------+-------+
| node1 | node2 | node3 | node4 |  OR  | node1 | node2 | node3 | node4 |
+-------+-------+-------+-------+      +-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |      |   P   |   P   |  P P  |       |
+-------+-------+-------+-------+      +-------+-------+-------+-------+

You can tweak the Pod spec to meet various kinds of requirements 您可以调整pod规格以满足各种要求:

  • Change maxSkew to a bigger value like “2” so that the incoming Pod can be placed onto “zoneA” as well. 将maxSkew更改为更大的值,如“2”,这样传入的pod也可以放在“zonea”上。
  • Change topologyKey to “node” so as to distribute the Pods evenly across nodes instead of zones. In the above example, if maxSkew remains “1”, the incoming Pod can only be placed onto “node4”. 将TopologyKey更改为“node”,以便将播客均匀地分布在节点而不是区域中。在上面的例子中,如果maxscrew保持“1”,那么传入的pod只能放在“node4”上。
  • Change whenUnsatisfiable: DoNotSchedule to whenUnsatisfiable: ScheduleAnyway to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs are satisfied). However, it’s preferred to be placed onto the topology domain which has fewer matching Pods. (Be aware that this preferability is jointly normalized with other internal scheduling priorities like resource usage ratio, etc.) 将whenunsatisfable:donotschedule更改为whenunsatisfable:schedule,以确保传入的pod始终可调度(假设满足其他调度api)。但是,最好将其放置在具有较少匹配pod的拓扑域上。(请注意,此优先性与其他内部调度优先级(如资源使用率等)联合规范化。)

Example: Multiple TopologySpreadConstraints 拓扑扩展约束

This builds upon the previous example. Suppose you have a 4-node cluster where 3 Pods labeled foo:bar are located in node1, node2 and node3 respectively (P represents Pod): 这是基于前面的例子。假设您有一个4节点集群,其中标记为“foo:bar”的3个pod分别位于node1、node2和node3中(“p”表示pod):

+---------------+---------------+
|     zoneA     |     zoneB     |
+-------+-------+-------+-------+
| node1 | node2 | node3 | node4 |
+-------+-------+-------+-------+
|   P   |   P   |   P   |       |
+-------+-------+-------+-------+

You can use 2 TopologySpreadConstraints to control the Pods spreading on both zone and node: 可以使用2个拓扑预安装来控制播客在区域和节点上的传播:

pods/topology-spread-constraints/two-constraints.yaml Copy pods/topology-spread-constraints/two-constraints.yaml to clipboard

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    foo: bar
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  - maxSkew: 1
    topologyKey: node
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  containers:
  - name: pause
    image: k8s.gcr.io/pause:3.1

In this case, to match the first constraint, the incoming Pod can only be placed onto “zoneB”; while in terms of the second constraint, the incoming Pod can only be placed onto “node4”. Then the results of 2 constraints are ANDed, so the only viable option is to place on “node4”. 在这种情况下,为了匹配第一个约束,传入的pod只能放置在“zoneb”上;而在第二个约束中,传入的pod只能放置在“node4”上。然后对两个约束的结果进行了anded,因此唯一可行的选择是放置在“node4”上。

Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones: 多个约束可能导致冲突。假设您有一个跨越两个区域的3节点集群:

+---------------+-------+
|     zoneA     | zoneB |
+-------+-------+-------+
| node1 | node2 |  nod3 |
+-------+-------+-------+
|  P P  |   P   |  P P  |
+-------+-------+-------+

If you apply “two-constraints.yaml” to this cluster, you will notice “mypod” stays in Pending state. This is because: to satisfy the first constraint, “mypod” can only be put to “zoneB”; while in terms of the second constraint, “mypod” can only put to “node2”. Then a joint result of “zoneB” and “node2” returns nothing. 如果对这个集群应用“two constraints.yaml”,您将注意到“mypod”保持挂起状态。这是因为:为了满足第一个约束,“mypod”只能放在“zoneb”上;而对于第二个约束,“mypod”只能放在“node2”上。然后“zoneb”和“node2”的联合结果将不返回任何内容。

To overcome this situation, you can either increase the maxSkew or modify one of the constraints to use whenUnsatisfiable: ScheduleAnyway. 为了克服这种情况,您可以增加maxskew或修改其中一个约束以在不可满足的情况下使用:schedule anyway。

Conventions 通常情况

There are some implicit conventions worth noting here: 这里有一些隐含的约定值得注意:

  • Only the Pods holding the same namespace as the incoming Pod can be matching candidates. 只有与传入的pod具有相同名称空间的pod才可以匹配候选者。

  • Nodes without topologySpreadConstraints[*].topologyKey present will be bypassed. It implies that: 没有TopologyPreadConstraints[*]的节点。将绕过存在的TopologyKey。它意味着:

    1. the Pods located on those nodes do not impact maxSkew calculation - in the above example, suppose “node1” does not have label “zone”, then the 2 Pods will be disregarded, hence the incomingPod will be scheduled into “zoneA”. 位于这些节点上的pod不影响maxscrew计算-在上面的示例中,假设“node1”没有标签“zone”,那么这两个pod将被忽略,因此incomingpod将被调度到“zonea”。
    2. the incoming Pod has no chances to be scheduled onto this kind of nodes - in the above example, suppose a “node5” carrying label {zone-typo: zoneC} joins the cluster, it will be bypassed due to the absence of label key “zone”. 传入的pod没有机会被调度到这类节点上-在上面的示例中,假设一个带有标签{zone typo:zonec}的“node5”加入集群,它将由于没有标签键“zone”而被绕过。
  • Be aware of what will happen if the incomingPod’s topologySpreadConstraints[*].labelSelector doesn’t match its own labels. In the above example, if we remove the incoming Pod’s labels, it can still be placed onto “zoneB” since the constraints are still satisfied. However, after the placement, the degree of imbalance of the cluster remains unchanged - it’s still zoneA having 2 Pods which hold label {foo:bar}, and zoneB having 1 Pod which holds label {foo:bar}. So if this is not what you expect, we recommend the workload’s topologySpreadConstraints[*].labelSelector to match its own labels. 请注意,如果IncomingPod的拓扑PreadConstraints[].LabelSelector与自己的标签不匹配,将会发生什么。在上面的例子中,如果我们移除传入pod的标签,它仍然可以放在“zoneb”上,因为约束仍然满足。然而,在放置之后,集群的不平衡程度保持不变——仍然是zonea有2个pod保存标签{foo:bar},zoneb有1个pod保存标签{foo:bar}。因此,如果这不是您所期望的,我们建议工作负载的拓扑PreadConstraints[].LabelSelector与其自己的标签匹配。

  • If the incoming Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined, nodes not matching them will be bypassed. 如果传入的pod定义了spec.nodeselector或spec.affinity.nodeaffinity,则将绕过与它们不匹配的节点。

    Suppose you have a 5-node cluster ranging from zoneA to zoneC: 假设您有一个从zonea到zonec的5节点集群:

    +---------------+---------------+-------+
    |     zoneA     |     zoneB     | zoneC |
    +-------+-------+-------+-------+-------+
    | node1 | node2 | node3 | node4 | node5 |
    +-------+-------+-------+-------+-------+
    |   P   |   P   |   P   |       |       |
    +-------+-------+-------+-------+-------+

    and you know that “zoneC” must be excluded. In this case, you can compose the yaml as below, so that “mypod” will be placed onto “zoneB” instead of “zoneC”. Similarly spec.nodeSelector is also respected. 你也知道“Zonec”必须被排除在外。在这种情况下,您可以按如下方式编写yaml,以便将“mypod”放置在“zoneb”上,而不是“zonec”。同样,spec.nodeselector也受到尊重。

    pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml Copy pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml to clipboard

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    foo: bar
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: zone
            operator: NotIn
            values:
            - zoneC
  containers:
  - name: pause
    image: k8s.gcr.io/pause:3.1

Comparison with PodAffinity/PodAntiAffinity

In Kubernetes, directives related to “Affinity” control how Pods are scheduled - more packed or more scattered. 在kubernetes中,与“affinity”相关的指令控制pod的调度方式—更密集或更分散。

  • For PodAffinity, you can try to pack any number of Pods into qualifying topology domain(s) 对于“podAffinity”,您可以尝试将任意数量的pod打包到符合条件的拓扑域中
  • For PodAntiAffinity, only one Pod can be scheduled into a single topology domain. 对于“podantiaffinity”,只能将一个pod调度到单个拓扑域中。

The “EvenPodsSpread” feature provides flexible options to distribute Pods evenly across different topology domains - to achieve high availability or cost-saving. This can also help on rolling update workloads and scaling out replicas smoothly. See Motivation for more details. “EndoPosiDead”功能提供灵活的选项来均匀分布跨不同拓扑域的POD,以实现高可用性或节省成本。这也有助于滚动更新工作负载和平滑地扩展副本。有关更多详细信息,请参见动机。

Known Limitations

As of 1.16, at which this feature is Alpha, there are some known limitations:

  • Scaling down a Deployment may result in imbalanced Pods distribution.
  • Pods matched on tainted nodes are respected. See Issue 80921
k8s
本作品采用《CC 协议》,转载必须注明作者和本文链接
《L03 构架 API 服务器》
你将学到如 RESTFul 设计风格、PostMan 的使用、OAuth 流程,JWT 概念及使用 和 API 开发相关的进阶知识。
《G01 Go 实战入门》
从零开始带你一步步开发一个 Go 博客项目,让你在最短的时间内学会使用 Go 进行编码。项目结构很大程度上参考了 Laravel。
讨论数量: 0
(= ̄ω ̄=)··· 暂无内容!

讨论应以学习和精进为目的。请勿发布不友善或者负能量的内容,与人为善,比聪明更重要!