kubeSourceCodeNote/scheduler/Kubernetes源码学习-Scheduler-P4...

# P4-Node优先级算法

## 前言

在上一篇文档中，我们过了一遍node筛选算法：

[p3-Node筛选算法](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/Kubernetes源码学习-Scheduler-P3-Node筛选算法.md)

按调度规则设计，对筛选出的node，选择优先级最高的作为最终的fit node。那么本篇承接上一篇，进入下一步，看一看node优先级排序的过程。

Tips: 本篇篇幅较长，因调度优选算法较为复杂，但请耐心结合本篇阅读源码，多看几次，一定会有收获。 

## 正文

### 1. 优先级函数

#### 1.1 优先级函数入口

同上一篇，回到`pkg/scheduler/core/generic_scheduler.go`中的`Schedule()`函数，`pkg/scheduler/core/generic_scheduler.go:184`:

![](http://pwh8f9az4.bkt.clouddn.com/20190822165920.png)

截图中有几处标注，metric相关的几行，是收集metric信息，用以提供给prometheus使用的，kubernetes的几个核心组件都有这个功能，以后如果读prometheus的源码，这个单独拎出来再讲。直接进入优先级函数`PrioritizeNodes()`内部`pkg/scheduler/core/generic_scheduler.go:215`

#### 1.2 优先级函数概括说明

`pkg/scheduler/core/generic_scheduler.go:645 PrioritizeNodes()`，代码块较长，就不贴了.

在此函数上方的注释可以得知，这个函数的工作逻辑：

- 1.列出所有的优先级计算维度的方法，每个维度的方法返回该维度的得分，每个维度都有内部定义的weight权重，以及得分score，score取值范围在[0-10之间]，该维度的最终得分为 (score * weight)，得分越高越好

- 2.列出所有参与运算的node

- 3.循环对每一个node分别进行1中所有维度方法项计算，最后将该node的所有计算维度得分汇总

这里有一个重要的结构体始终贯穿整个函数栈，特别指出:

```go
	// HostPriority represents the priority of scheduling to a particular host, higher priority is better.
type HostPriority struct {
	// Name of the host
	Host string
	// Score associated with the host
	Score int
}
```

**两个重要变量**

```go
// pkg/scheduler/core/generic_scheduler.go:678
// 注意，这里的results是个双层array的结构，统计的是各维度各node的分别得分，即[][]HostPriority类型，用伪代码抽象一下:
/*
result = [
// 维度1,各node的得分
[{node-a: 1},{node-b: 2},{node-c: 3}...],
// 维度2,各node的得分
[{node-a: 3},{node-b: 1},{node-c: 2}...],
...
]
*/
  results := make([]schedulerapi.HostPriorityList, len(priorityConfigs), len(priorityConfigs))
  
  
  // pkg/scheduler/core/generic_scheduler.go:738
  // 这里的result是[]HostPriority类型，即汇总所有维度之后每个node的最终得分
  result := make(schedulerapi.HostPriorityList, 0, len(nodes))

  
```


#### 1.3 优先级函数分段说明

##### 1.3.1 Function(DEPRECATED)

`pkg/scheduler/core/generic_scheduler.go:682`

```go


  // DEPRECATED: we can remove this when all priorityConfigs implement the
	// Map-Reduce pattern.
	for i := range priorityConfigs {
		if priorityConfigs[i].Function != nil {
			wg.Add(1)
			go func(index int) {
				defer wg.Done()
				var err error
				results[index], err = priorityConfigs[index].Function(pod, nodeNameToInfo, nodes)
				if err != nil {
					appendError(err)
				}
			}(i)
		} else {
			results[i] = make(schedulerapi.HostPriorityList, len(nodes))
		}
	}
```

注释中说明这种直接计算方法(`priorityConfigs[i].Function`)是传统模式，已经DEPRECATED掉了，当前版本实际上只有一个维度(pod亲和性)采取了这种方法，取而代之的是Map-Reduce模式的计算方法,参见后方。Function运算的方式，随后会以pod亲和性这个维度的实例代码来说明。

##### 1.3.2 Map-Reduce Function

`pkg/scheduler/core/generic_scheduler.go:698`

```go
	workqueue.ParallelizeUntil(context.TODO(), 16, len(nodes), func(index int) {
		nodeInfo := nodeNameToInfo[nodes[index].Name]
		for i := range priorityConfigs {
			if priorityConfigs[i].Function != nil {
				continue
			}

			var err error
			results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
			if err != nil {
				appendError(err)
				results[i][index].Host = nodes[index].Name
			}
		}
	})

	for i := range priorityConfigs {
		if priorityConfigs[i].Reduce == nil {
			continue
		}
		wg.Add(1)
		go func(index int) {
			defer wg.Done()
			if err := priorityConfigs[index].Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil {
				appendError(err)
			}
			if klog.V(10) {
				for _, hostPriority := range results[index] {
					klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), hostPriority.Host, priorityConfigs[index].Name, hostPriority.Score)
				}
			}
		}(i)
	}
	// Wait for all computations to be finished.
	wg.Wait()
```

这里可以看出，若该维度未直接指定`priorityConfigs[i].Function`，则采取Map-Reduce模式.

```
引申：Map-Reduce是大数据里的思想，简单来说Map函数是对一组元素集上的每一个元素进行高度并行的运算，得到与元素
集对应(mapping关系)的结果集，Reduce函数则对结果集进行归纳运算而后返回需要的结果。
```

这里再次出现了上一篇中特别提到的`workqueue.ParallelizeUntil()`并行运算控制方法，同样以node为粒度，运行Map函数；而下方并行度不高的Reduce函数，则使用的sync模块才实现并发控制。符合Map-Reduce的思想。

没接触过Map-Reduce，但先不要被吓住，这里只是利用了这个思想，数据量并没有复杂到要拆分给多台机器分布式运算的级别。随后举一个使用Map-Reduce计算方法的维度的实例代码来说明。

### 2. 优先级计算维度

#### 2.1 默认注册的计算维度

通过上面的内容，对优先级算法有了一个模糊的认知：**统计节点的各计算维度得分的总和，分数越高优先级越高**。那么默认的优先级计算维度分别有哪些呢？在前面的[scheduler-框架篇](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/P2-调度器框架.md)中有讲过，调度算法全部位于`pkg/scheduler/algorithm`目录中，而`pkg/scheduler/algorithmprovider`内提供以工厂模式创建调度算法相关元素的方法，所以，我们直接来到`pkg/scheduler/algorithmprovider/defaults/register_priorities.go`文件内，所有默认的优先级计算维度的算法都在这里注册，篇幅有限，随便列举其中几个:

```go
	factory.RegisterPriorityFunction2(priorities.EqualPriority, core.EqualPriorityMap, nil, 1)
	// Optional, cluster-autoscaler friendly priority function - give used nodes higher priority.
	factory.RegisterPriorityFunction2(priorities.MostRequestedPriority, priorities.MostRequestedPriorityMap, nil, 1)
	factory.RegisterPriorityFunction2(
		priorities.RequestedToCapacityRatioPriority,
		priorities.RequestedToCapacityRatioResourceAllocationPriorityDefault().PriorityMap,
		nil,
		1)
```

如果仔细看代码里的注释可以发现，个别factory函数虽然已经将计算维度注册，但实际上默认并没有启用它，例如`ServiceSpreadingPriority`这一项中的注释表明，它已经相当大程度被`SelectorSpreadPriority`取代了，保留它是为了兼容此前的版本。那么默认使用的计算维度有哪些呢？

#### 2.2 默认使用的计算维度

默认使用的计算维度，在这个地方声明:

`pkg/scheduler/algorithmprovider/defaults/defaults.go:108`

```go
func defaultPriorities() sets.String {
	return sets.NewString(
		priorities.SelectorSpreadPriority,
		priorities.InterPodAffinityPriority,
		priorities.LeastRequestedPriority,
		priorities.BalancedResourceAllocation,
		priorities.NodePreferAvoidPodsPriority,
		priorities.NodeAffinityPriority,
		priorities.TaintTolerationPriority,
		priorities.ImageLocalityPriority,
	)
}

```

#### 2.3 新旧两种计算方式

在注册的每一个计算维度，都有专属的维度描述关键字，即factory方法的第一个参数(str类型)。不难发现，这里的每一个关键字，`pkg/scheduler/algorithm/priorities`目录内都有与其对应的文件,图中圈出了几个例子(灵魂画笔请原谅):

![](http://pwh8f9az4.bkt.clouddn.com/image-20190821171031395.png)

显而易见，维度计算的内容就在这些文件中，可以自行通过编辑器的跳转功能逐级查看进行验证.

通过这是factory方法可以看出，所有维度，默认的注册权重都是1，除了`NodePreferAvoidPodsPriority`这一项之外，它的weight值是10000，这一项是为了避免pod调度到node上，我们找到文件查看该方法的注释:

`pkg/scheduler/algorithm/priorities/node_prefer_avoid_pods.go:31`

```go
// CalculateNodePreferAvoidPodsPriorityMap priorities nodes according to the node annotation
// "scheduler.alpha.kubernetes.io/preferAvoidPods".
func CalculateNodePreferAvoidPodsPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error) {
... // 省略
}
```

得知node可以通过annotation添加`scheduler.alpha.kubernetes.io/preferAvoidPods`指定来避免指定的pod调度到本身之上，因此此项优先级超高覆盖过其他的各计算维度。

如果ctrl + F 过滤一下**map**关键字，你会发现，仅有`InterPodAffinityPriority`这一项是没有map关键字的：

```go
	// pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
	// as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
	factory.RegisterPriorityConfigFactory(
		priorities.InterPodAffinityPriority,
		factory.PriorityConfigFactory{
			Function: func(args factory.PluginFactoryArgs) priorities.PriorityFunction {
				return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
			},
			Weight: 1,
		},
	)
```


这也印证了前面说的当前仅剩pod亲和性这一个维度在使用传统的Function,虽然已经被DEPRECATED掉了，传统的Function是直接计算出结果，Map-Reduce是将这个过程解耦拆成了两个步骤，且我们可以看到所有的factory函数，很多形参`reduceFunction`接收到的实参实际是是`nil`:

![](http://pwh8f9az4.bkt.clouddn.com/image-20190822111624614.png)

这就说明这些维度的计算工作在map函数里面已经执行完成了，不需要再执行reduce函数了。因此，传统的Function的计算过程同样值得参考，那么首先就来看看`InterPodAffinityPriority`维度是怎么计算的吧!

### 3. 传统计算Function

#### 3.1 InterPodAffinityPriority

看代码之前，先来看一个标准的PodAffinity配置示例：

**PodAffinity**示例：

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod-a
  namespace: default
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
        weight: 100
          labelSelector:
            matchExpressions:
            - key: like
              operator: In
              values:
              - pod-b
          # 拓扑层级，大多数是node层级，但其实还有zone/region等层级
          topologyKey: kubernetes.io/hostname
          
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100 
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: unlike
              operator: In
              values:
              - pod-c
          topologyKey: kubernetes.io/hostname          
  containers:
  - name: test
    image: gcr.io/google_containers/pause:2.0
```

yaml中的申明意图是: pod-a亲近pod-b，疏远pod-c，所以在这项计算维度里，如果node上运行着pod-b ,则该node加分，如果该node上运行着pod-c，则node减分。

来看代码，仔细读代码，你会发现示例中的几个层级的key: `PreferredDuringSchedulingIgnoredDuringExecution`,`podAffinityTerm`,`labelSelector`,`topologyKey`在代码中都会出现：

`pkg/scheduler/algorithm/priorities/interpod_affinity.go:119`:

```go
func (ipa *InterPodAffinity) CalculateInterPodAffinityPriority(pod *v1.Pod, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, nodes []*v1.Node) (schedulerapi.HostPriorityList, error) {
 
	affinity := pod.Spec.Affinity
  // 判断待调度pod是否存在亲和性约束
	hasAffinityConstraints := affinity != nil && affinity.PodAffinity != nil
  // 判断待调度是否pod存在反亲和性约束
	hasAntiAffinityConstraints := affinity != nil && affinity.PodAntiAffinity != nil
  
  ... // 省略
  
  
  // 根据node上正在运行的pod来计算node得分的函数，分为两个层面计算，两个层面都可以加减分:
  // 1.待调度pod与现存pod的亲和性(软亲和性，因为待调度pod还未实际运行起来)
  // 2.现存pod与待调度pod的亲和性(硬亲和性，因为待调度pod正在运行)
  // 加减分操作由processTerm()方法进行计分，这个下面再讲
  // 这里是pod级别，被下方node级别的processNode调用
	processPod := func(existingPod *v1.Pod) error {
		existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName)
		if err != nil {
			if apierrors.IsNotFound(err) {
				klog.Errorf("Node not found, %v", existingPod.Spec.NodeName)
				return nil
			}
			return err
		}
		existingPodAffinity := existingPod.Spec.Affinity
    // 判断node上正在运行的pod是否与待调度的pod存在亲和性约束
		existingHasAffinityConstraints := existingPodAffinity != nil && existingPodAffinity.PodAffinity != nil
    // 判断node上正在运行的pod是否与待调度的pod存在反亲和性约束
		existingHasAntiAffinityConstraints := existingPodAffinity != nil && existingPodAffinity.PodAntiAffinity != nil

		if hasAffinityConstraints {
			terms := affinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, pod, existingPod, existingPodNode, 1)
		}
		if hasAntiAffinityConstraints {
			terms := affinity.PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, pod, existingPod, existingPodNode, -1)
		}

		if existingHasAffinityConstraints {
			if ipa.hardPodAffinityWeight > 0 {
				terms := existingPodAffinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution
				for _, term := range terms {
					pm.processTerm(&term, existingPod, pod, existingPodNode, float64(ipa.hardPodAffinityWeight))
				}
			}
			terms := existingPodAffinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, existingPod, pod, existingPodNode, 1)
		}
		if existingHasAntiAffinityConstraints {
			terms := existingPodAffinity.PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution
			pm.processTerms(terms, existingPod, pod, existingPodNode, -1)
		}
		return nil
	}
  
  // 这里是node级别的，调用上方的processPod,被下方的并发控制函数调用，内部逻辑分支有两支:
  // 1.pod指定了亲和性约束，那么node上每个现存的pod都要与待调度pod进行硬、软亲和性计算
  // 2.pod未指定亲和性约束，那么仅需要对node上现存的已指定亲和性约束的pod，与待调度pod进行硬亲和性计算
	processNode := func(i int) {
		nodeInfo := nodeNameToInfo[allNodeNames[i]]
		if nodeInfo.Node() != nil {
			if hasAffinityConstraints || hasAntiAffinityConstraints {
				for _, existingPod := range nodeInfo.Pods() {
					if err := processPod(existingPod); err != nil {
						pm.setError(err)
					}
				}
			} else {
				for _, existingPod := range nodeInfo.PodsWithAffinity() {
					if err := processPod(existingPod); err != nil {
						pm.setError(err)
					}
				}
			}
		}
	}
  // node级别并发
	workqueue.ParallelizeUntil(context.TODO(), 16, len(allNodeNames), processNode)
  ... // 省略

	// 计算此Pod亲和性维度的各node的得分
	result := make(schedulerapi.HostPriorityList, 0, len(nodes))
	for _, node := range nodes {
		fScore := float64(0)
		if (maxCount - minCount) > 0 {
      // 分母是maxCount - minCount,不直接使用maxCount做分母是因为maxCount可能为0，通过整除运算，控制node的最高得分为MaxPriority(默认10),最低位0
			fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))
		}
		result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: int(fScore)})
		if klog.V(10) {
			klog.Infof("%v -> %v: InterPodAffinityPriority, Score: (%d)", pod.Name, node.Name, int(fScore))
		}
	}
	return result, nil
}
```

上面代码中的注释已经将`CalculateInterPodAffinityPriority`这个函数的工作模式介绍的比较清晰了，那么再看一看计分函数`processTerm()`：

`pkg/scheduler/algorithm/priorities/interpod_affinity.go:107` --> `pkg/scheduler/algorithm/priorities/interpod_affinity.go:86`

```go
func (p *podAffinityPriorityMap) processTerm(term *v1.PodAffinityTerm, podDefiningAffinityTerm, podToCheck *v1.Pod, fixedNode *v1.Node, weight float64) {
	namespaces := priorityutil.GetNamespacesFromPodAffinityTerm(podDefiningAffinityTerm, term)
	selector, err := metav1.LabelSelectorAsSelector(term.LabelSelector)
	if err != nil {
		p.setError(err)
		return
	}
  // 待调度pod和被检查pod存在亲和性则匹配,匹配且node与指定的term处于同一拓扑层级，则node加分
	match := priorityutil.PodMatchesTermsNamespaceAndSelector(podToCheck, namespaces, selector)
	if match {
		func() {
			p.Lock()
			defer p.Unlock()
			for _, node := range p.nodes {
        // TopologyKey是拓扑逻辑层级，上面例子中的是kubernetes.io/hostname，kuernetes内建了几个层级
        // 如failure-domain.beta.kubernetes.io/zone，kubernetes.io/hostname等，参考:
        // https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity
				if priorityutil.NodesHaveSameTopologyKey(node, fixedNode, term.TopologyKey) {
					p.counts[node.Name] += weight
				}
			}
		}()
	}
}
```

**podAffinityPriority这个维度的算法到此就明了了**

### 4. Map-Reduce计算方法

在`pkg/scheduler/algorithmprovider/defaults/register_priorities.go:26`中的init()函数内，找出所有在注册且默认被使用的，同时包含map方法和reduce方法的factory函数，一共有3个，我们挑其中之一为例作启发，其余的就不写在文章里了，可以自行阅读:

```go
  // pkg/scheduler/algorithmprovider/defaults/register_priorities.go:58
	// spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node.
	factory.RegisterPriorityConfigFactory(
		priorities.SelectorSpreadPriority,
		factory.PriorityConfigFactory{
			MapReduceFunction: func(args factory.PluginFactoryArgs) (priorities.PriorityMapFunction, priorities.PriorityReduceFunction) {
				return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
			},
			Weight: 1,
		},
	)

	// pkg/scheduler/algorithmprovider/defaults/register_priorities.go:90
  factory.RegisterPriorityFunction2(priorities.NodeAffinityPriority, priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1)
  
  // pkg/scheduler/algorithmprovider/defaults/register_priorities.go:93
  factory.RegisterPriorityFunction2(priorities.TaintTolerationPriority, priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1)


```

那就以第一个`ServiceSpreadingPriority`维度为例吧，名字直译为: 选择器均分优先级，注释中可以得知，这一项是为了保障属于同一个**Service**或**replication controller**的的pod，尽量分散开在不同的node里，保障高可用。

`NewSelectorSpreadPriority()`方法用来注册此维度的Map和Reduce函数，来看看其内容：

`pkg/scheduler/algorithmprovider/defaults/register_priorities.go:62 NewSelectorSpreadPriority()`----> `pkg/scheduler/algorithm/priorities/selector_spreading.go:45`

```go
func NewSelectorSpreadPriority(
	serviceLister algorithm.ServiceLister,
	controllerLister algorithm.ControllerLister,
	replicaSetLister algorithm.ReplicaSetLister,
	statefulSetLister algorithm.StatefulSetLister) (PriorityMapFunction, PriorityReduceFunction) {
	selectorSpread := &SelectorSpread{
		serviceLister:     serviceLister,
		controllerLister:  controllerLister,
		replicaSetLister:  replicaSetLister,
		statefulSetLister: statefulSetLister,
	}
	return selectorSpread.CalculateSpreadPriorityMap, selectorSpread.CalculateSpreadPriorityReduce
}
```

注意这4个参数:`serviceLister/replicaSetLister/statefulSetLister/controllerLister`,与pod相关的四个上层抽象概念`Service/RC/RS/StatefulSet`都列出来了，返回的map函数是`CalculateSpreadPriorityMap`,reduce函数是`CalculateSpreadPriorityReduce`,分别看一看他们吧

#### 4.1  Map函数

`pkg/scheduler/algorithm/priorities/selector_spreading.go:66`

```go
func (s *SelectorSpread) CalculateSpreadPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error) {
	var selectors []labels.Selector
	node := nodeInfo.Node()
	if node == nil {
		return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
	}

	priorityMeta, ok := meta.(*priorityMetadata)
	if ok {
		selectors = priorityMeta.podSelectors
	} else {
		selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)
	}

	if len(selectors) == 0 {
		return schedulerapi.HostPriority{
			Host:  node.Name,
			Score: int(0),
		}, nil
	}

	count := countMatchingPods(pod.Namespace, selectors, nodeInfo)

	return schedulerapi.HostPriority{
		Host:  node.Name,
		Score: count,
	}, nil
}
```

继续看`countMatchingPods`函数:

`pkg/scheduler/algorithm/priorities/selector_spreading.go:187`:

```go
func countMatchingPods(namespace string, selectors []labels.Selector, nodeInfo *schedulernodeinfo.NodeInfo) int {
	if nodeInfo.Pods() == nil || len(nodeInfo.Pods()) == 0 || len(selectors) == 0 {
		return 0
	}
	count := 0
	for _, pod := range nodeInfo.Pods() {
		// Ignore pods being deleted for spreading purposes
		// Similar to how it is done for SelectorSpreadPriority
		if namespace == pod.Namespace && pod.DeletionTimestamp == nil {
			matches := true
			for _, selector := range selectors {
				if !selector.Matches(labels.Set(pod.Labels)) {
					matches = false
					break
				}
			}
			if matches {
				count++
			}
		}
	}
	return count
}
```

这里的计算方式概括一下:

已知`Service/RC/RS/StatefulSet`这四种对pod进行管理的抽象高层级资源(后面统称高层级资源)，选择器都是通过label来匹配pod的，因此，这里将待调度pod的高层级资源的selector选择器依次列出，与node上现运行的pod中的每一个进行依次比较，每出现一次**待调度pod的selector，命中了某个现运行pod的标签**的情况，则视为匹配成功，命中计数+1，未命中则不加计数(这里的计数越高代表匹配到的现运行pod数量越多，则最终优先级得分应该越低，待会儿在reduce函数里我们可以印证)。

举个例子:

- 假设待调度的为pod-a-1，node-a,node-b上现都运行有若干个pod
- node-a其中有1个pod-a-2与pod-a-1属于同一个Service，那么，node-a的count计数为1；
- node-b中没有pod被pod-a-1的selector命中，则node-b的count计数为0
- 计数越多，则对应的最终优先级得分应该越低，因此node-b的得分会比node-a高

**map函数到这里就结束了，但这个计数显然还不能作为节点在此维度的最终得分，因此，下面还有reduce函数**

#### 4.1  Reduce函数

基于前面map函数得出的各node的匹配次数count计数，来展开reduce函数运算:

`pkg/scheduler/algorithm/priorities/selector_spreading.go:99`

```go
func (s *SelectorSpread) CalculateSpreadPriorityReduce(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, result schedulerapi.HostPriorityList) error {
	countsByZone := make(map[string]int, 10)
	maxCountByZone := int(0)
	maxCountByNodeName := int(0)

	for i := range result {
		if result[i].Score > maxCountByNodeName {
			maxCountByNodeName = result[i].Score
		}
		zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
		if zoneID == "" {
			continue
		}
		countsByZone[zoneID] += result[i].Score
	}

	for zoneID := range countsByZone {
		if countsByZone[zoneID] > maxCountByZone {
			maxCountByZone = countsByZone[zoneID]
		}
	}

	haveZones := len(countsByZone) != 0

	maxCountByNodeNameFloat64 := float64(maxCountByNodeName)
	maxCountByZoneFloat64 := float64(maxCountByZone)
	MaxPriorityFloat64 := float64(schedulerapi.MaxPriority)

	for i := range result {
		// initializing to the default/max node score of maxPriority
		fScore := MaxPriorityFloat64
		if maxCountByNodeName > 0 {
      // 匹配数量最多的node，count=maxCountByNodeName，fScore得分为0
      // 匹配数量最少的node，假设count=0，则fScore得分为10
			fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64)
		}
		// If there is zone information present, incorporate it
		if haveZones {
			zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
			if zoneID != "" {
				zoneScore := MaxPriorityFloat64
				if maxCountByZone > 0 {
					zoneScore = MaxPriorityFloat64 * (float64(maxCountByZone-countsByZone[zoneID]) / maxCountByZoneFloat64)
				}
        // 这里将zone层级参与了运算，zoneWeighting=2/3，则nodeWeight取1/3，混合计算最终得分
				fScore = (fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore)
			}
		}
		result[i].Score = int(fScore)
		if klog.V(10) {
			klog.Infof(
				"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),
			)
		}
	}
	return nil
}
```

不难发现，这里的Reduce函数统计得分的方式，与传统Function最后一步统计最终得分，步骤可以说是一致的:

```go
// PodAffinityPriority统计最终得分
fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount))
```

只不过这里是使用Map-Reduce风格思想将其步骤解耦为了两步。Reduce函数介绍到此结束

## 总结

优先级算法相对而言比predicate断言算法要复杂一些，并且在当前版本的维度计算中存在传统Function函数与Map-Reduce风格函数混用的现象，一定程度上提高了阅读的难度，但相信仔细重复阅读代码，还是不难理解的，毕竟数据量还未到达大数据的级别，只是利用了其映射归纳的思想，解耦的同时提高一定的并发性能。

下一篇讲什么呢？我再研究研究，have fun!