From 16281a4bce83d5224f6c10262d01dca1de0ea939 Mon Sep 17 00:00:00 2001 From: yinwenqin Date: Thu, 22 Aug 2019 16:43:21 +0800 Subject: [PATCH] p4 --- .DS_Store | Bin 8196 -> 8196 bytes package-lock.json | 3 + scheduler/.DS_Store | Bin 6148 -> 6148 bytes ...学习-Scheduler -P3-Node筛选算法.md} | 18 +- ...学习-Scheduler-P1-调度器入口篇.md} | 38 +- ...码学习-Scheduler-P2-调度器框架.md} | 53 +- ...学习-Scheduler-P4-Node优先级算法.md | 636 ++++++++++++++++++ scheduler/P4-Node优先级算法.md | 59 -- scheduler/image/.DS_Store | Bin 8196 -> 8196 bytes scheduler/image/p4/.DS_Store | Bin 0 -> 6148 bytes .../p4/{schedule.jpg => p4-schedule.jpg} | Bin 11 files changed, 712 insertions(+), 95 deletions(-) create mode 100644 package-lock.json rename scheduler/{P3-Node筛选算法.md => Kubernetes源码学习-Scheduler -P3-Node筛选算法.md} (98%) rename scheduler/{P1-调度器入口篇.md => Kubernetes源码学习-Scheduler-P1-调度器入口篇.md} (87%) rename scheduler/{P2-调度器框架.md => Kubernetes源码学习-Scheduler-P2-调度器框架.md} (85%) create mode 100644 scheduler/Kubernetes源码学习-Scheduler-P4-Node优先级算法.md delete mode 100644 scheduler/P4-Node优先级算法.md create mode 100644 scheduler/image/p4/.DS_Store rename scheduler/image/p4/{schedule.jpg => p4-schedule.jpg} (100%) diff --git a/.DS_Store b/.DS_Store index 457edaba78156cec6a5bc51f5088695f00246bc9..36690112cbf6235b9d19feacb0911beac3a5fd6e 100644 GIT binary patch delta 1219 zcmchVTSyd97{|ZAX`MORdem*)HCL-OG+j$;^On?#dFdiXSeZ$&wi68Lx;m?265GY7 zhi)=KPy{8t*$u>kdg!GnBD#=47wV;l9wI2}r3mV{N2s7)n}_+B|NoqC&iVhop}9{2Mh#0q8%>#& zmFq1iUR+*XGcnVVD$>$9Z!<$iIBM=Njb8hq(-mt82fK_k3~JhhaHP)EGj%r;Ue?;z zVfOTy(Sf+iv-KQJjI;KTY~9oyn%o{z>$CNFjBT0JYkLj(Rx+Qj=W8O!S}b$7p;{Dj zpC&A=Wv3aQ+Pa9BvNjRYHgpC%j84_g&&!xtqiyO}cT-{|U&hkl(^|K8cXi3rTS{K8 z2bj3Wwod+hZP1D~M~v{VW399wrsu^pV&tZU6re_GCxaq1Ku72p9j7yNfv(Uv-J-kn zfF99fdO|Ph6}_f6^o~B#C;CiZ0B|HD1uo3RJmg_M3b6!!l%X8es6#y(u?o#-#b#_l zJ3{C}7(M94AckZ&F#C+s+gAN;gYTv7Yp({ z9+AM*3SY5wrb&>>D_zVzB1s}sZdp0!B2hwB?y_p>PM7r9ggeuv^BP{OO-_~|<_#36 z%f-jd8z@mA3{*mflMwU+D*ukcv#Wkqr-WkdFl@f)B+KNChfU zh1xij2CTz+v|s}^VjF_kF2NWQOhkgQ5XBzEun+rj07r1tffE>pjgvSlA&rhX^sgA+ iJ9}(~>sw;%za;SQ0Al|ofjQYsW^flBdmL0Ssf#tY@Aw6OQ?7B()s^zPn! z+aiV-e-kyvpVX+)n7&zK;)Bt~M}KNGCXK1l#`Hm7d^8$Od@`Qd*-Kh}o=mh!caoWJ zX3m_MJ!ig|;amU!V`*(KKqUY$$^w;QDppBCF7k?$=l2v4BIyHYkOmWUNF}l>Mh6{% z2LcZS9tb=Tcp&h=?cf1=vqfTSy!%oblz|5V58RO+5cflbvcRa1ixR!7g9>VQ%m&|ew-2nD^>NiX6J z2aHM#%D@AG2UdDO93Ku2Bw>c)_M75&XD(sd36fq(%$bJcHa32bdHDtF))z8X#ELc+ zos7@e*|@8^X(y(pPKeeq)tyQB^fm+ETbxS?k!mGW-ea#h39Q=YP+=`>_w#IQ6w z8?)1v=6LhOrGPQc_@q)XKi^azX>QxUP#>9ZZEdWNv^4KuSm10+&EA83Q?uvJ&p*2G z*ac}D`u78rEVK2s)eSYBOt&;N??TSdA999rUL5K>(myb%C}~H3ST|MI$ml()t2>iQ zi9;6mC2i|iO3#pWBT3s#&sroYO{!)xO@s9=(~KMEb!XJo&$^;E?${}}+Z#A)kgn*P zRLX?@(K*+!E$6tNatJxUv|)32=dRt2`wl$Nb8X|M5`~xX@{x?;7ztA!&KjCKGp0I8 z-O>zedYn9FxyETjPw`S5;6s)=t6!_wQmKTthPJPz#gu9`nobyJ(uR9Zu2B`*!Icl$ zghpRKXlr6d&69+xLN$z?5dGaLol4owYeRKhQQ43~mdGjg@CL3peCQF{R;k*=TbQEp zVN*@$rtE6v?M&(CBN=I~?Cjv3tQ70{XwsgYC3_s%-4*I)%9revSbIQq+@o{4Maz5E zRfP^QHYOxpnx<>4I$A3?-t$0pv{AMc@xFwh9y*~H#)<6~WZ^k@9$tXg;UZju58xB{ z0+lo&0>8oU@CW<}e`6ssEW%=}#2vU3tFZywupRHm4(!4s*pGww5XNx= zPvRs_;c2vS4pZphJU)R>;xqUxzJzb!oA?&Kjql?}_%U9_Pw_MS2EWB0;<;?z zTNma^KCzg4gYRqLVPzc=R!W;&I9CdY(NfyhPSW+nT`4`-Nm3AJPKskSG#Ap%wFVQ{ z{lcajyV3VFfX78*axi?!sEE!`)bq`>>Tbcz{^ggS~hd z`-q8g9L5nG#W75vMm(Ix89YNgOk)PKxQLGv8=u0b@kQd}%lHbuN{qa?jDLMO=1Ii& zR_9+Pp0Z5ankH$Exwk3?t=ruL`SJrRMg0HZ+Q0wb?q&qv7kD7>z`yVS7WBk>Vl=^8 z7rBVFBb28oizLP^O7v8y@WOFIFB~Ua`iCLaBQ#X#M15S8NIjJP^B)4lp98`BAH4q_ I`|jN0A1U+Zi~s-t diff --git a/package-lock.json b/package-lock.json new file mode 100644 index 0000000..48e341a --- /dev/null +++ b/package-lock.json @@ -0,0 +1,3 @@ +{ + "lockfileVersion": 1 +} diff --git a/scheduler/.DS_Store b/scheduler/.DS_Store index 5853d3f725cbfd2499824e2495c8e364b986f77a..881f9b49ac96e93d02b0966548ba58ac1dddef4f 100644 GIT binary patch delta 48 zcmZoMXffEJ%EWkTvKmtZk7RYVp{a$Xj)IAW;pFv9vW&APpJkF~oW1!ilQ`>Uc8$9!k&c3~g;}kRLbbWMfsTTSu~}^`Cx^JIp{-{^Ze>+< YO>N!u$rqU98E0(%z$DJPn1w?C0IU2IL;wH) diff --git a/scheduler/P3-Node筛选算法.md b/scheduler/Kubernetes源码学习-Scheduler -P3-Node筛选算法.md similarity index 98% rename from scheduler/P3-Node筛选算法.md rename to scheduler/Kubernetes源码学习-Scheduler -P3-Node筛选算法.md index 28038ac..a5258cd 100644 --- a/scheduler/P3-Node筛选算法.md +++ b/scheduler/Kubernetes源码学习-Scheduler -P3-Node筛选算法.md @@ -1,3 +1,15 @@ +--- +title: "Kubernetes源码学习-Scheduler-P3-Node筛选算法" +date: 2019/08/12 20:16:58 +tags: +- Kubernetes +- Golang +- 读源码 + + + +--- + # P3-Node筛选算法 ## 前言 @@ -141,7 +153,7 @@ ParallelizeUntil()的这种实现方式,可以很好地将并发实现和具 `pkg/scheduler/core/generic_scheduler.go:460 --> pkg/scheduler/internal/cache/node_tree.go:161` -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p3/zone.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/zone.jpg) 可以看到,这里有一个zone的逻辑层级,这个层级仿佛没有见过,google了一番才了解了这个颇为冷门的功能:这是一个轻量级的支持集群联邦特性的实现,单个cluster可以属于多个zone,但这个功能目前只有GCE和AWS支持,且绝大多数的使用场景也用不到,可以说是颇为冷门。默认情况下,cluster只属于一个zone,可以理解为cluster和zone是同层级,因此后面见到有关zone相关的层级,我们直接越过它。有兴趣的朋友可以了解一下zone的概念: @@ -227,7 +239,7 @@ func podFitsOnNode( 有了以上理解,我们接着看代码,图中已注释: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p3/podFitsOnNode.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/podFitsOnNode.jpg) 图中`pkg/scheduler/core/generic_scheduler.go:608`位置正式开始了逐个计算筛选算法,那么筛选方法、筛选方法顺序在哪里呢?在上一篇[P2-框架篇]([https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/P2-%E8%B0%83%E5%BA%A6%E5%99%A8%E6%A1%86%E6%9E%B6.md](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/P2-调度器框架.md))中已经有讲过,默认调度算法都在`pkg/scheduler/algorithm/`路径下,我们接着往下看. @@ -256,7 +268,7 @@ var ( [链接](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/predicates-ordering.md) -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p3/predicates.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/predicates.jpg) **筛选key** diff --git a/scheduler/P1-调度器入口篇.md b/scheduler/Kubernetes源码学习-Scheduler-P1-调度器入口篇.md similarity index 87% rename from scheduler/P1-调度器入口篇.md rename to scheduler/Kubernetes源码学习-Scheduler-P1-调度器入口篇.md index 3f51a45..54e7b34 100644 --- a/scheduler/P1-调度器入口篇.md +++ b/scheduler/Kubernetes源码学习-Scheduler-P1-调度器入口篇.md @@ -1,3 +1,17 @@ +--- +title: "Kubernetes源码学习-Scheduler-P1-调度器入口篇" +date: 2019/08/05 16:27:53 +tags: +- Kubernetes +- Golang +- 读源码 + + + +--- + +## + # 调度器入口 ## 前言 @@ -56,8 +70,8 @@ matebook-x-pro:local ywq$ go run main.go # 因需要多次测试,这里所有的测试步骤就把build的步骤跳过,直接使用go run main.go进行测试 ``` **我们打开IDE来查看一下testapp的代码结构:** -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/cobra1.jpg) -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/cobra2.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/cobra1.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/cobra2.jpg) ``` # 现在还未创建子命令,那么来创建几个试试: @@ -90,7 +104,7 @@ add called ``` **来看看新增的子命令是怎么运行的呢?** -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/cobra3.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/cobra3.jpg) 截图圈中部分可以看出,子命令是在init()函数里为root级添加了一个子命令,先不去管底层实现,接着往下. **测试cobra的强大简洁的flag处理** @@ -101,7 +115,7 @@ deleteCmd.PersistentFlags().StringVar(&obj,"object", "", "A function to delete a ``` 在`Run:func()`匿名函数中添加一行输出: `fmt.Println("delete obj:",cmd.Flag("object").Value)` -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/cobra4.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/cobra4.jpg) 运行结果: @@ -112,7 +126,7 @@ delete obj: obj1 ``` 如果觉得`--`flag符号太麻烦,cobra同样支持短符号`-`flag缩写: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/cobra5.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/cobra5.jpg) 运行结果: @@ -132,7 +146,7 @@ add.go delete.go get.go pods.go root.go ``` 可以发现,cmd/目录下多了一个pods.go文件,我们来看看它是怎么关联上delete父级命令的,同时为它添加一行输出: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/cobra6.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/cobra6.jpg) 执行命令: ``` @@ -146,13 +160,13 @@ delete pods: pod1 ## 入口 通过对上方cobra的基本了解,我们不难知道,`cmd/kube-scheduler/scheduler.go`内的main()方法内部实际调用的是`cobra.Command.Run`内的匿名函数,我们可以进入`NewSchedulerCommand()`内部确认: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/main1.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/main1.jpg) 可以看到,调用了`Run`内部`runCommand`方法,再来看看Run方法内部需要重点关注的几个点: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/runCommand.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/runCommand.jpg) 其中,上方是对命令行的参数、选项校验的步骤,跳过,重点关注两个变量:`cc和stopCh`,这两个变量会作为最后调用`Run()`方法的参数,其中`stopCh`作用是作为主程序退出的信号通知其他各协程进行相关的退出操作的,另外一个cc变量非常重要,可以点击`c.Complete()`方法,查看该方法的详情: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/runCommand.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/runCommand.jpg) `Complete()`方法本质上返回的是一个Config结构体,该结构体内部的元素非常丰富,篇幅有限就不一一点开截图了,大家可以自行深入查看这些元素的作用,这里简单概括一下其中几个: ``` @@ -175,13 +189,13 @@ Recorder record.EventRecorder Broadcaster record.EventBroadcaster ``` 这里层级非常深,不便展示,Config这一个结构体非常重要,可以认真读一读代码。回到`cmd/kube-scheduler/app/server.go`.`runCommand`这里来,接着往下,进入其最后return调用的`Run()`函数中,函数中的前部分都是启动scheduler相关的组件,如event broadcaster、informers、healthz server、metric server等,重点看图中红框圈出的`sched.Run()`,这才是scheduler主程序的调用运行函数: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/Run.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/Run.jpg) 进入`sched.Run()`: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/scheRun.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/scheRun.jpg) `wait.Until`这个调用的逻辑是,直到收到stop信号才终止,在此之前循环运行`sched.scheduleOne`。代码走到这里,终于找到启动入口最内部的主体啦: -![image](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p1/scheduleOne.jpg) +![image](http://pwh8f9az4.bkt.clouddn.com/scheduleOne.jpg) `sched.scheduleOne`这个函数有代码点长,整体的功能可以概括为:获取需调度的pod、寻找匹配node、发起绑定到node请求、绑定检查等一系列操作. diff --git a/scheduler/P2-调度器框架.md b/scheduler/Kubernetes源码学习-Scheduler-P2-调度器框架.md similarity index 85% rename from scheduler/P2-调度器框架.md rename to scheduler/Kubernetes源码学习-Scheduler-P2-调度器框架.md index d4b133d..1449afc 100644 --- a/scheduler/P2-调度器框架.md +++ b/scheduler/Kubernetes源码学习-Scheduler-P2-调度器框架.md @@ -1,4 +1,15 @@ -# 调度器框架 +--- +title: "Kubernetes源码学习-Scheduler-P2-调度器框架" +date: 2019/08/09 16:08:30 +tags: +- Kubernetes +- Golang +- 读源码 + + +--- + +## 调度器框架 ## 前言 @@ -12,15 +23,15 @@ 回顾上一篇篇末,我们找到了调度框架的实际调度工作逻辑的入口位置,`pkg/scheduler/scheduler.go:435`, `scheduleOne()`函数内部,定位在`pkg/scheduler/scheduler.go:457`位置,是通过这个`sched.schedule(pod)`方法来获取与pod匹配的node的,我们直接跳转2次,来到了这里`pkg/scheduler/core/generic_scheduler.go:107` -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/schedule.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/schedule.jpg) -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/AlgSchedule.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/AlgSchedule.jpg) -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/scheduleStruct.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/scheduleStruct.jpg) 通过注释可以知道,ScheduleAlgorithm interface中的Schedule方法就是用来为pod筛选node的,但这是个接口方法,并不是实际调用的,我们稍微往下,在`pkg/scheduler/core/generic_scheduler.go:162`这个位置,就可以找到实际调用的Schedule方法: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/genericSchedule.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/genericSchedule.jpg) 这个函数里面有4个重要的步骤: @@ -46,41 +57,41 @@ g.selectHost(priorityList) 先来逆向回溯代码结构,找到哪里创建了scheduler,调度器的默认初始化配置,默认的调度算法来源等等框架相关的东西。`Schedule()`方法属于`genericScheduler`结构体,先查看`genericScheduler`结构体,再选中结构体名称,crtl + b组合键查看它在哪些地方被引用,找出创建结构体的位置: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/createGenSche.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/createGenSche.jpg) 通过缩略代码框,排除test相关的测试文件,很容易找出创建结构体的地方位于`pkg/scheduler/core/generic_scheduler.go:1189`,点击图中红框圈中位置,跳转过去,果然找到了`NewGenericScheduler()`方法,这个方法是用来创建一个`genericScheduler`对象的,那么我们再次crtl + b组合键查看`NewGenericScheduler`再什么地方被调用: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/newGenericScheduler.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/newGenericScheduler.jpg) 找出了在`pkg/scheduler/factory/factory.go:441`这个位置上找到了调用入口,这里位于`CreateFromKeys()`方法中,继续crtl + b查看它的引用,跳转到`pkg/scheduler/factory/factory.go:336`这个位置: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/newGenericScheduler.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/newGenericScheduler.jpg) -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/createFromProvider.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/createFromProvider.jpg) -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/getAlgorithmProvider.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/getAlgorithmProvider.jpg) 这里找到了`algorithmProviderMap`这个变量,顾名思义,这个变量里面包含的应该就是调度算法的来源,点击进去查看,跳转到了`pkg/scheduler/factory/plugins.go:86`这个位置,组合键查看引用,一眼就可以看出哪个引用为这个map添加了元素: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/addMapEle.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/addMapEle.jpg) 跳转过去,来到了`pkg/scheduler/factory/plugins.go:391`这个位置,这个函数的作用是为scheduler的配置指定调度算法,即`FitPredicate、Priority`这两个算法需要用到的metric或者方法,再次请出组合键,查找哪个地方调用了这个方法: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/registerAlgorithmProvider.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/registerAlgorithmProvider.jpg) 来到了`pkg/scheduler/algorithmprovider/defaults/defaults.go:99`,继续组合键向上查找引用,这次引用只有一个,没有弹窗直接跳转过去了`pkg/scheduler/algorithmprovider/defaults/defaults.go:36`: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/registerAlgorithmProvider1.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/registerAlgorithmProvider1.jpg) -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/init.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/init.jpg) 我们来看看`defaultPredicates(), defaultPriorities()`这两个函数具体的内容: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/default.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/default.jpg) 我们随便点击进去一个`predicates`选项查看其内容: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/memPressure.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/memPressure.jpg) `CheckNodeMemoryPressure`这个词相应熟悉kubernetes 应用的朋友一定不会陌生,例如在node内存压力大无法调度的pod时,`kubectl describe pod xxx`就会在状态信息里面看到这个关键词。 @@ -103,7 +114,7 @@ func RegisterAlgorithmProvider(name string, predicateKeys, priorityKeys sets.Str 可以看到,这个方法为DefaultProvider绑定了配置:筛选算法和优先级排序算法的key集合,这些key只是字符串,那么是怎么具体落实到计算的方法过程上去的呢?让我们看看`pkg/scheduler/algorithmprovider/defaults/`目录下的`register_predicates.go,register_priorities.go`这两个文件: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/preinit.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/preinit.jpg) 它们同样也在init()函数中初始化时使用`factory.RegisterFitPredicate()`方法做了一些注册操作,这个方法的两个参数,前一个是筛选/计算优先级 的关键key名,后一个是具体计算的功能实现方法,点击`factory.RegisterFitPredicate()`方法,深入一级,查看内部代码, @@ -177,15 +188,15 @@ func (s *Scheme) AddTypeDefaultingFunc(srcType Object, fn func(interface{})) { 我们选中然后ctrl+b,查找AddTypeDefaultingFunc()的引用,弹窗中你可以看到有非常非常多的对象都引用了该方法,这些不同类型的对象相信无一例外都是通过Default()方法来生成默认配置的,我们找到其中的包含scheduler的方法: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/addDefaultFunc.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/addDefaultFunc.jpg) 跳转进去,来到了这个位置`pkg/scheduler/apis/config/v1alpha1/zz_generated.defaults.go:31`(原谅我的灵魂笔法): -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/registerDefaults.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/registerDefaults.jpg) 进入`SetDefaults_KubeSchedulerConfiguration()`,来到`pkg/scheduler/apis/config/v1alpha1/defaults.go:42`: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/SetDefaults_KubeSchedulerConfiguration.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/SetDefaults_KubeSchedulerConfiguration.jpg) 看到了`DefaultProvider`吗?是不是觉得瞬间豁然开朗,原来是在这里调用指定了scheduler配置的`AlgorithmSource.Provider`。 @@ -276,7 +287,7 @@ func podFitsOnNode( 最后,对`pkg/scheduler`路径下的各子目录的功能来一个图文总结吧: -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p2/dir.jpg) +![](http://pwh8f9az4.bkt.clouddn.com/dir.jpg) diff --git a/scheduler/Kubernetes源码学习-Scheduler-P4-Node优先级算法.md b/scheduler/Kubernetes源码学习-Scheduler-P4-Node优先级算法.md new file mode 100644 index 0000000..dcc2b1d --- /dev/null +++ b/scheduler/Kubernetes源码学习-Scheduler-P4-Node优先级算法.md @@ -0,0 +1,636 @@ +# P4-Node优先级算法 + +## 前言 + +在上一篇文档中,我们过了一遍node筛选算法: + +[p3-Node筛选算法](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/P3-Node%E7%AD%9B%E9%80%89%E7%AE%97%E6%B3%95.md) + +按调度规则设计,对筛选出的node,选择优先级最高的作为最终的fit node。那么本篇承接上一篇,进入下一步,看一看node优先级排序的过程。 + +Tips: 本篇篇幅较长,因调度优选算法较为复杂,但请耐心结合本篇阅读源码,多看几次,一定会有收获。 + +## 正文 + +### 1. 优先级函数 + +#### 1.1 优先级函数入口 + +同上一篇,回到`pkg/scheduler/core/generic_scheduler.go`中的`Schedule()`函数,`pkg/scheduler/core/generic_scheduler.go:184`: + +![](http://pwh8f9az4.bkt.clouddn.com/p4-schedule.jpg) + +截图中有几处标注,metric相关的几行,是收集metric信息,用以提供给prometheus使用的,kubernetes的几个核心组件都有这个功能,以后如果读prometheus的源码,这个单独拎出来再讲。直接进入优先级函数`PrioritizeNodes()`内部`pkg/scheduler/core/generic_scheduler.go:215` + +#### 1.2 优先级函数概括说明 + +`pkg/scheduler/core/generic_scheduler.go:645 PrioritizeNodes()`,代码块较长,就不贴了. + +在此函数上方的注释可以得知,这个函数的工作逻辑: + +- 1.列出所有的优先级计算维度的方法,每个维度的方法返回该维度的得分,每个维度都有内部定义的weight权重,以及得分score,score取值范围在[0-10之间],该维度的最终得分为 (score * weight),得分越高越好 + +- 2.列出所有参与运算的node + +- 3.循环对每一个node分别进行1中所有维度方法项计算,最后将该node的所有计算维度得分汇总 + +这里有一个重要的结构体始终贯穿整个函数栈,特别指出: + +```go + // HostPriority represents the priority of scheduling to a particular host, higher priority is better. +type HostPriority struct { + // Name of the host + Host string + // Score associated with the host + Score int +} +``` + +**两个重要变量** + +```go +// pkg/scheduler/core/generic_scheduler.go:678 +// 注意,这里的results是个双层array的结构,统计的是各维度各node的分别得分,即[][]HostPriority类型,用伪代码抽象一下: +/* +result = [ +// 维度1,各node的得分 +[{node-a: 1},{node-b: 2},{node-c: 3}...], +// 维度2,各node的得分 +[{node-a: 3},{node-b: 1},{node-c: 2}...], +... +] +*/ + results := make([]schedulerapi.HostPriorityList, len(priorityConfigs), len(priorityConfigs)) + + + + // pkg/scheduler/core/generic_scheduler.go:738 + // 这里的result是[]HostPriority类型,即汇总所有维度之后每个node的最终得分 + result := make(schedulerapi.HostPriorityList, 0, len(nodes)) + + +``` + + + +#### 1.3 优先级函数分段说明 + +##### 1.3.1 Function(DEPRECATED) + +`pkg/scheduler/core/generic_scheduler.go:682` + +```go + + + // DEPRECATED: we can remove this when all priorityConfigs implement the + // Map-Reduce pattern. + for i := range priorityConfigs { + if priorityConfigs[i].Function != nil { + wg.Add(1) + go func(index int) { + defer wg.Done() + var err error + results[index], err = priorityConfigs[index].Function(pod, nodeNameToInfo, nodes) + if err != nil { + appendError(err) + } + }(i) + } else { + results[i] = make(schedulerapi.HostPriorityList, len(nodes)) + } + } +``` + +注释中说明这种直接计算方法(`priorityConfigs[i].Function`)是传统模式,已经DEPRECATED掉了,当前版本实际上只有一个维度(pod亲和性)采取了这种方法,取而代之的是Map-Reduce模式的计算方法,参见后方。Function运算的方式,随后会以pod亲和性这个维度的实例代码来说明。 + +##### 1.3.2 Map-Reduce Function + +`pkg/scheduler/core/generic_scheduler.go:698` + +```go + workqueue.ParallelizeUntil(context.TODO(), 16, len(nodes), func(index int) { + nodeInfo := nodeNameToInfo[nodes[index].Name] + for i := range priorityConfigs { + if priorityConfigs[i].Function != nil { + continue + } + + var err error + results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo) + if err != nil { + appendError(err) + results[i][index].Host = nodes[index].Name + } + } + }) + + for i := range priorityConfigs { + if priorityConfigs[i].Reduce == nil { + continue + } + wg.Add(1) + go func(index int) { + defer wg.Done() + if err := priorityConfigs[index].Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil { + appendError(err) + } + if klog.V(10) { + for _, hostPriority := range results[index] { + klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), hostPriority.Host, priorityConfigs[index].Name, hostPriority.Score) + } + } + }(i) + } + // Wait for all computations to be finished. + wg.Wait() +``` + +这里可以看出,若该维度未直接指定`priorityConfigs[i].Function`,则采取Map-Reduce模式. + +``` +引申:Map-Reduce是大数据里的思想,简单来说Map函数是对一组元素集上的每一个元素进行高度并行的运算,得到与元素 +集对应(mapping关系)的结果集,Reduce函数则对结果集进行归纳运算而后返回需要的结果。 +``` + +这里再次出现了上一篇中特别提到的`workqueue.ParallelizeUntil()`并行运算控制方法,同样以node为粒度,运行Map函数;而下方并行度不高的Reduce函数,则使用的sync模块才实现并发控制。符合Map-Reduce的思想。 + +没接触过Map-Reduce,但先不要被吓住,这里只是利用了这个思想,数据量并没有复杂到要拆分给多台机器分布式运算的级别。随后举一个使用Map-Reduce计算方法的维度的实例代码来说明。 + +### 2. 优先级计算维度 + +#### 2.1 默认注册的计算维度 + +通过上面的内容,对优先级算法有了一个模糊的认知:**统计节点的各计算维度得分的总和,分数越高优先级越高**。那么默认的优先级计算维度分别有哪些呢?在前面的[scheduler-框架篇](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/P2-调度器框架.md)中有讲过,调度算法全部位于`pkg/scheduler/algorithm`目录中,而`pkg/scheduler/algorithmprovider`内提供以工厂模式创建调度算法相关元素的方法,所以,我们直接来到`pkg/scheduler/algorithmprovider/defaults/register_priorities.go`文件内,所有默认的优先级计算维度的算法都在这里注册,篇幅有限,随便列举其中几个: + +```go + factory.RegisterPriorityFunction2(priorities.EqualPriority, core.EqualPriorityMap, nil, 1) + // Optional, cluster-autoscaler friendly priority function - give used nodes higher priority. + factory.RegisterPriorityFunction2(priorities.MostRequestedPriority, priorities.MostRequestedPriorityMap, nil, 1) + factory.RegisterPriorityFunction2( + priorities.RequestedToCapacityRatioPriority, + priorities.RequestedToCapacityRatioResourceAllocationPriorityDefault().PriorityMap, + nil, + 1) +``` + +如果仔细看代码里的注释可以发现,个别factory函数虽然已经将计算维度注册,但实际上默认并没有启用它,例如`ServiceSpreadingPriority`这一项中的注释表明,它已经相当大程度被`SelectorSpreadPriority`取代了,保留它是为了兼容此前的版本。那么默认使用的计算维度有哪些呢? + +#### 2.2 默认使用的计算维度 + +默认使用的计算维度,在这个地方声明: + +`pkg/scheduler/algorithmprovider/defaults/defaults.go:108` + +```go +func defaultPriorities() sets.String { + return sets.NewString( + priorities.SelectorSpreadPriority, + priorities.InterPodAffinityPriority, + priorities.LeastRequestedPriority, + priorities.BalancedResourceAllocation, + priorities.NodePreferAvoidPodsPriority, + priorities.NodeAffinityPriority, + priorities.TaintTolerationPriority, + priorities.ImageLocalityPriority, + ) +} + +``` + +#### 2.3 新旧两种计算方式 + +在注册的每一个计算维度,都有专属的维度描述关键字,即factory方法的第一个参数(str类型)。不难发现,这里的每一个关键字,`pkg/scheduler/algorithm/priorities`目录内都有与其对应的文件,图中圈出了几个例子(灵魂画笔请原谅): + +![image-20190821171031395](/Users/ywq/Library/Application Support/typora-user-images/image-20190821171031395.png) + +显而易见,维度计算的内容就在这些文件中,可以自行通过编辑器的跳转功能逐级查看进行验证. + +通过这是factory方法可以看出,所有维度,默认的注册权重都是1,除了`NodePreferAvoidPodsPriority`这一项之外,它的weight值是10000,这一项是为了避免pod调度到node上,我们找到文件查看该方法的注释: + +`pkg/scheduler/algorithm/priorities/node_prefer_avoid_pods.go:31` + +```go +// CalculateNodePreferAvoidPodsPriorityMap priorities nodes according to the node annotation +// "scheduler.alpha.kubernetes.io/preferAvoidPods". +func CalculateNodePreferAvoidPodsPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error) { +... // 省略 +} +``` + +得知node可以通过annotation添加`scheduler.alpha.kubernetes.io/preferAvoidPods`指定来避免指定的pod调度到本身之上,因此此项优先级超高覆盖过其他的各计算维度。 + +如果ctrl + F 过滤一下**map**关键字,你会发现,仅有`InterPodAffinityPriority`这一项是没有map关键字的: + +```go + // pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.) + // as some other pods, or, conversely, should not be placed in the same topological domain as some other pods. + factory.RegisterPriorityConfigFactory( + priorities.InterPodAffinityPriority, + factory.PriorityConfigFactory{ + Function: func(args factory.PluginFactoryArgs) priorities.PriorityFunction { + return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight) + }, + Weight: 1, + }, + ) +``` + + + +这也印证了前面说的当前仅剩pod亲和性这一个维度在使用传统的Function,虽然已经被DEPRECATED掉了,传统的Function是直接计算出结果,Map-Reduce是将这个过程解耦拆成了两个步骤,且我们可以看到所有的factory函数,很多形参`reduceFunction`接收到的实参实际是是`nil`: + +![image-20190822111624614](/Users/ywq/Library/Application Support/typora-user-images/image-20190822111624614.png) + +这就说明这些维度的计算工作在map函数里面已经执行完成了,不需要再执行reduce函数了。因此,传统的Function的计算过程同样值得参考,那么首先就来看看`InterPodAffinityPriority`维度是怎么计算的吧! + +### 3. 传统计算Function + +#### 3.1 InterPodAffinityPriority + +看代码之前,先来看一个标准的PodAffinity配置示例: + +**PodAffinity**示例: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: pod-a + namespace: default +spec: + affinity: + podAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - podAffinityTerm: + weight: 100 + labelSelector: + matchExpressions: + - key: like + operator: In + values: + - pod-b + # 拓扑层级,大多数是node层级,但其实还有zone/region等层级 + topologyKey: kubernetes.io/hostname + + podAntiAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + podAffinityTerm: + labelSelector: + matchExpressions: + - key: unlike + operator: In + values: + - pod-c + topologyKey: kubernetes.io/hostname + containers: + - name: test + image: gcr.io/google_containers/pause:2.0 +``` + +yaml中的申明意图是: pod-a亲近pod-b,疏远pod-c,所以在这项计算维度里,如果node上运行着pod-b ,则该node加分,如果该node上运行着pod-c,则node减分。 + +来看代码,仔细读代码,你会发现示例中的几个层级的key: `PreferredDuringSchedulingIgnoredDuringExecution`,`podAffinityTerm`,`labelSelector`,`topologyKey`在代码中都会出现: + +`pkg/scheduler/algorithm/priorities/interpod_affinity.go:119`: + +```go +func (ipa *InterPodAffinity) CalculateInterPodAffinityPriority(pod *v1.Pod, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, nodes []*v1.Node) (schedulerapi.HostPriorityList, error) { + + affinity := pod.Spec.Affinity + // 判断待调度pod是否存在亲和性约束 + hasAffinityConstraints := affinity != nil && affinity.PodAffinity != nil + // 判断待调度是否pod存在反亲和性约束 + hasAntiAffinityConstraints := affinity != nil && affinity.PodAntiAffinity != nil + + ... // 省略 + + + // 根据node上正在运行的pod来计算node得分的函数,分为两个层面计算,两个层面都可以加减分: + // 1.待调度pod与现存pod的亲和性(软亲和性,因为待调度pod还未实际运行起来) + // 2.现存pod与待调度pod的亲和性(硬亲和性,因为待调度pod正在运行) + // 加减分操作由processTerm()方法进行计分,这个下面再讲 + // 这里是pod级别,被下方node级别的processNode调用 + processPod := func(existingPod *v1.Pod) error { + existingPodNode, err := ipa.info.GetNodeInfo(existingPod.Spec.NodeName) + if err != nil { + if apierrors.IsNotFound(err) { + klog.Errorf("Node not found, %v", existingPod.Spec.NodeName) + return nil + } + return err + } + existingPodAffinity := existingPod.Spec.Affinity + // 判断node上正在运行的pod是否与待调度的pod存在亲和性约束 + existingHasAffinityConstraints := existingPodAffinity != nil && existingPodAffinity.PodAffinity != nil + // 判断node上正在运行的pod是否与待调度的pod存在反亲和性约束 + existingHasAntiAffinityConstraints := existingPodAffinity != nil && existingPodAffinity.PodAntiAffinity != nil + + if hasAffinityConstraints { + terms := affinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution + pm.processTerms(terms, pod, existingPod, existingPodNode, 1) + } + if hasAntiAffinityConstraints { + terms := affinity.PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution + pm.processTerms(terms, pod, existingPod, existingPodNode, -1) + } + + if existingHasAffinityConstraints { + if ipa.hardPodAffinityWeight > 0 { + terms := existingPodAffinity.PodAffinity.RequiredDuringSchedulingIgnoredDuringExecution + for _, term := range terms { + pm.processTerm(&term, existingPod, pod, existingPodNode, float64(ipa.hardPodAffinityWeight)) + } + } + terms := existingPodAffinity.PodAffinity.PreferredDuringSchedulingIgnoredDuringExecution + pm.processTerms(terms, existingPod, pod, existingPodNode, 1) + } + if existingHasAntiAffinityConstraints { + terms := existingPodAffinity.PodAntiAffinity.PreferredDuringSchedulingIgnoredDuringExecution + pm.processTerms(terms, existingPod, pod, existingPodNode, -1) + } + return nil + } + + // 这里是node级别的,调用上方的processPod,被下方的并发控制函数调用,内部逻辑分支有两支: + // 1.pod指定了亲和性约束,那么node上每个现存的pod都要与待调度pod进行硬、软亲和性计算 + // 2.pod未指定亲和性约束,那么仅需要对node上现存的已指定亲和性约束的pod,与待调度pod进行硬亲和性计算 + processNode := func(i int) { + nodeInfo := nodeNameToInfo[allNodeNames[i]] + if nodeInfo.Node() != nil { + if hasAffinityConstraints || hasAntiAffinityConstraints { + for _, existingPod := range nodeInfo.Pods() { + if err := processPod(existingPod); err != nil { + pm.setError(err) + } + } + } else { + for _, existingPod := range nodeInfo.PodsWithAffinity() { + if err := processPod(existingPod); err != nil { + pm.setError(err) + } + } + } + } + } + // node级别并发 + workqueue.ParallelizeUntil(context.TODO(), 16, len(allNodeNames), processNode) + ... // 省略 + + // 计算此Pod亲和性维度的各node的得分 + result := make(schedulerapi.HostPriorityList, 0, len(nodes)) + for _, node := range nodes { + fScore := float64(0) + if (maxCount - minCount) > 0 { + // 分母是maxCount - minCount,不直接使用maxCount做分母是因为maxCount可能为0,通过整除运算,控制node的最高得分为MaxPriority(默认10),最低位0 + fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount)) + } + result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: int(fScore)}) + if klog.V(10) { + klog.Infof("%v -> %v: InterPodAffinityPriority, Score: (%d)", pod.Name, node.Name, int(fScore)) + } + } + return result, nil +} +``` + +上面代码中的注释已经将`CalculateInterPodAffinityPriority`这个函数的工作模式介绍的比较清晰了,那么再看一看计分函数`processTerm()`: + +`pkg/scheduler/algorithm/priorities/interpod_affinity.go:107` --> `pkg/scheduler/algorithm/priorities/interpod_affinity.go:86` + +```go +func (p *podAffinityPriorityMap) processTerm(term *v1.PodAffinityTerm, podDefiningAffinityTerm, podToCheck *v1.Pod, fixedNode *v1.Node, weight float64) { + namespaces := priorityutil.GetNamespacesFromPodAffinityTerm(podDefiningAffinityTerm, term) + selector, err := metav1.LabelSelectorAsSelector(term.LabelSelector) + if err != nil { + p.setError(err) + return + } + // 待调度pod和被检查pod存在亲和性则匹配,匹配且node与指定的term处于同一拓扑层级,则node加分 + match := priorityutil.PodMatchesTermsNamespaceAndSelector(podToCheck, namespaces, selector) + if match { + func() { + p.Lock() + defer p.Unlock() + for _, node := range p.nodes { + // TopologyKey是拓扑逻辑层级,上面例子中的是kubernetes.io/hostname,kuernetes内建了几个层级 + // 如failure-domain.beta.kubernetes.io/zone,kubernetes.io/hostname等,参考: + // https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity + if priorityutil.NodesHaveSameTopologyKey(node, fixedNode, term.TopologyKey) { + p.counts[node.Name] += weight + } + } + }() + } +} +``` + +**podAffinityPriority这个维度的算法到此就明了了** + +### 4. Map-Reduce计算方法 + +在`pkg/scheduler/algorithmprovider/defaults/register_priorities.go:26`中的init()函数内,找出所有在注册且默认被使用的,同时包含map方法和reduce方法的factory函数,一共有3个,我们挑其中之一为例作启发,其余的就不写在文章里了,可以自行阅读: + +```go + // pkg/scheduler/algorithmprovider/defaults/register_priorities.go:58 + // spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node. + factory.RegisterPriorityConfigFactory( + priorities.SelectorSpreadPriority, + factory.PriorityConfigFactory{ + MapReduceFunction: func(args factory.PluginFactoryArgs) (priorities.PriorityMapFunction, priorities.PriorityReduceFunction) { + return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister) + }, + Weight: 1, + }, + ) + + // pkg/scheduler/algorithmprovider/defaults/register_priorities.go:90 + factory.RegisterPriorityFunction2(priorities.NodeAffinityPriority, priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1) + + // pkg/scheduler/algorithmprovider/defaults/register_priorities.go:93 + factory.RegisterPriorityFunction2(priorities.TaintTolerationPriority, priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1) + + +``` + +那就以第一个`ServiceSpreadingPriority`维度为例吧,名字直译为: 选择器均分优先级,注释中可以得知,这一项是为了保障属于同一个**Service**或**replication controller**的的pod,尽量分散开在不同的node里,保障高可用。 + +`NewSelectorSpreadPriority()`方法用来注册此维度的Map和Reduce函数,来看看其内容: + +`pkg/scheduler/algorithmprovider/defaults/register_priorities.go:62 NewSelectorSpreadPriority()`----> `pkg/scheduler/algorithm/priorities/selector_spreading.go:45` + +```go +func NewSelectorSpreadPriority( + serviceLister algorithm.ServiceLister, + controllerLister algorithm.ControllerLister, + replicaSetLister algorithm.ReplicaSetLister, + statefulSetLister algorithm.StatefulSetLister) (PriorityMapFunction, PriorityReduceFunction) { + selectorSpread := &SelectorSpread{ + serviceLister: serviceLister, + controllerLister: controllerLister, + replicaSetLister: replicaSetLister, + statefulSetLister: statefulSetLister, + } + return selectorSpread.CalculateSpreadPriorityMap, selectorSpread.CalculateSpreadPriorityReduce +} +``` + +注意这4个参数:`serviceLister/replicaSetLister/statefulSetLister/controllerLister`,与pod相关的四个上层抽象概念`Service/RC/RS/StatefulSet`都列出来了,返回的map函数是`CalculateSpreadPriorityMap`,reduce函数是`CalculateSpreadPriorityReduce`,分别看一看他们吧 + +#### 4.1 Map函数 + +`pkg/scheduler/algorithm/priorities/selector_spreading.go:66` + +```go +func (s *SelectorSpread) CalculateSpreadPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error) { + var selectors []labels.Selector + node := nodeInfo.Node() + if node == nil { + return schedulerapi.HostPriority{}, fmt.Errorf("node not found") + } + + priorityMeta, ok := meta.(*priorityMetadata) + if ok { + selectors = priorityMeta.podSelectors + } else { + selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister) + } + + if len(selectors) == 0 { + return schedulerapi.HostPriority{ + Host: node.Name, + Score: int(0), + }, nil + } + + count := countMatchingPods(pod.Namespace, selectors, nodeInfo) + + return schedulerapi.HostPriority{ + Host: node.Name, + Score: count, + }, nil +} +``` + +继续看`countMatchingPods`函数: + +`pkg/scheduler/algorithm/priorities/selector_spreading.go:187`: + +```go +func countMatchingPods(namespace string, selectors []labels.Selector, nodeInfo *schedulernodeinfo.NodeInfo) int { + if nodeInfo.Pods() == nil || len(nodeInfo.Pods()) == 0 || len(selectors) == 0 { + return 0 + } + count := 0 + for _, pod := range nodeInfo.Pods() { + // Ignore pods being deleted for spreading purposes + // Similar to how it is done for SelectorSpreadPriority + if namespace == pod.Namespace && pod.DeletionTimestamp == nil { + matches := true + for _, selector := range selectors { + if !selector.Matches(labels.Set(pod.Labels)) { + matches = false + break + } + } + if matches { + count++ + } + } + } + return count +} +``` + +这里的计算方式概括一下: + +已知`Service/RC/RS/StatefulSet`这四种对pod进行管理的抽象高层级资源(后面统称高层级资源),选择器都是通过label来匹配pod的,因此,这里将待调度pod的高层级资源的selector选择器依次列出,与node上现运行的pod中的每一个进行依次比较,每出现一次**待调度pod的selector,命中了某个现运行pod的标签**的情况,则视为匹配成功,加1分,未命中则不加分(这里的分数越高代表匹配到的现运行pod数量越多,则最终优先级得分应该越低,待会儿在reduce函数里我们可以印证)。 + +举个例子: + +- 假设待调度的为pod-a-1,node-a,node-b上现都运行有若干个pod +- node-a其中有1个pod-a-2与pod-a-1属于同一个Service,那么,node-a得分为99分; +- node-b中没有pod被pod-a-1的selector命中,则node-b得分为100分 + +**map函数到这里就结束了,但这个得分显然还不能作为节点在此维度的最终得分,因此,下面还有reduce函数** + +#### 4.1 Reduce函数 + +基于前面map函数得出的各node的匹配次数count计数,来展开reduce函数运算: + +`pkg/scheduler/algorithm/priorities/selector_spreading.go:99` + +```go +func (s *SelectorSpread) CalculateSpreadPriorityReduce(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, result schedulerapi.HostPriorityList) error { + countsByZone := make(map[string]int, 10) + maxCountByZone := int(0) + maxCountByNodeName := int(0) + + for i := range result { + if result[i].Score > maxCountByNodeName { + maxCountByNodeName = result[i].Score + } + zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node()) + if zoneID == "" { + continue + } + countsByZone[zoneID] += result[i].Score + } + + for zoneID := range countsByZone { + if countsByZone[zoneID] > maxCountByZone { + maxCountByZone = countsByZone[zoneID] + } + } + + haveZones := len(countsByZone) != 0 + + maxCountByNodeNameFloat64 := float64(maxCountByNodeName) + maxCountByZoneFloat64 := float64(maxCountByZone) + MaxPriorityFloat64 := float64(schedulerapi.MaxPriority) + + for i := range result { + // initializing to the default/max node score of maxPriority + fScore := MaxPriorityFloat64 + if maxCountByNodeName > 0 { + // 匹配数量最多的node,count=maxCountByNodeName,fScore得分为0 + // 匹配数量最少的node,假设count=0,则fScore得分为10 + fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64) + } + // If there is zone information present, incorporate it + if haveZones { + zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node()) + if zoneID != "" { + zoneScore := MaxPriorityFloat64 + if maxCountByZone > 0 { + zoneScore = MaxPriorityFloat64 * (float64(maxCountByZone-countsByZone[zoneID]) / maxCountByZoneFloat64) + } + // 这里将zone层级参与了运算,zoneWeighting=2/3,则nodeWeight取1/3,混合计算最终得分 + fScore = (fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore) + } + } + result[i].Score = int(fScore) + if klog.V(10) { + klog.Infof( + "%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore), + ) + } + } + return nil +} +``` + +不难发现,这里的Reduce函数统计得分的方式,与传统Function最后一步统计最终得分,步骤可以说是一致的: + +```go +// PodAffinityPriority统计最终得分 +fScore = float64(schedulerapi.MaxPriority) * ((pm.counts[node.Name] - minCount) / (maxCount - minCount)) +``` + +只不过这里是使用Map-Reduce风格思想将其步骤解耦为了两步。Reduce函数介绍到此结束 + +## 总结 + +优先级算法相对而言比predicate断言算法要复杂一些,并且在当前版本的维度计算中存在传统Function函数与Map-Reduce风格函数混用的现象,一定程度上提高了阅读的难度,但相信仔细重复阅读代码,还是不难理解的,毕竟数据量还未到达大数据的级别,只是利用了其映射归纳的思想,解耦的同时提高一定的并发性能。 + +下一篇讲什么呢?我再研究研究,have fun! \ No newline at end of file diff --git a/scheduler/P4-Node优先级算法.md b/scheduler/P4-Node优先级算法.md deleted file mode 100644 index 7b81e26..0000000 --- a/scheduler/P4-Node优先级算法.md +++ /dev/null @@ -1,59 +0,0 @@ -# P4-Node优先级算法 - -## 前言 - -在上一篇文档中,我们过了一遍node筛选算法: - -[p3-Node筛选算法](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/P3-Node%E7%AD%9B%E9%80%89%E7%AE%97%E6%B3%95.md) - -按调度规则设计,对筛选出的node,选择优先级最高的作为最终的fit node。那么本篇承接上一篇,进入下一步,node优先级算法。 - - - -## 正文 - -同上一篇,回到`pkg/scheduler/core/generic_scheduler.go`中的`Schedule()`函数,`pkg/scheduler/core/generic_scheduler.go:184`: - -![](https://github.com/yinwenqin/kubeSourceCodeNote/blob/master/scheduler/image/p4/schedule.jpg) - -截图中有几处标注,metric相关的几行,是收集metric信息,用以提供给prometheus使用的,kubernetes的几个核心组件都有这个功能,以后如果读prometheus的源码,这个单独拎出来再讲。 - - - -**PodAffinity**示例: - -```yaml -apiVersion: v1 -kind: Pod -metadata: - name: pod-a - namespace: default -spec: - affinity: - podAffinity: - preferredDuringSchedulingIgnoredDuringExecution: - - podAffinityTerm: - weight: 100 - labelSelector: - matchExpressions: - - key: like - operator: In - values: - - pod-a - topologyKey: kubernetes.io/hostname - podAntiAffinity: - preferredDuringSchedulingIgnoredDuringExecution: - - weight: 100 - podAffinityTerm: - labelSelector: - matchExpressions: - - key: unlike - operator: In - values: - - pod-a - topologyKey: kubernetes.io/hostname - containers: - - name: test - image: gcr.io/google_containers/pause:2.0 -``` - diff --git a/scheduler/image/.DS_Store b/scheduler/image/.DS_Store index 7f1f946bcc076d7b539d1bed1f4a57f1a502b517..33b809d9bbfb0b0e1e14eda6fb6defdeb7103ee0 100644 GIT binary patch delta 201 zcmZp1XmOa}&&akhU^hP_+hiVrcZ?OAe+z73WDUtn%uSuVM`#Z7PR2u%qlM)ek8G|M z?qCvRVklrRNh&WcNXp4iVqjo6HCaKRfk(2s+R)U(Qb)nW!f^6R0a?ablaC9?i_Qir k0Ba}-I sV8Fx(p&7iOG?ZdIv$H1@V-^m;4Wg<&0T*E43hX&L&p$$qDprKhvt+--jT7}7np#A3 zem<@ulZcFPQ@L2!n>{z**++&mCkOWA81W14cNZlEfg7;MkzE(HCqgga^y>{tEnwC%0;vJ&^%eQ zLs35+`xjp>T0