今天提交了一個核心補丁,只要是關於fork的時候子進程優先於父進程啟動並執行補丁,email本文如下:
CFS scheduler become the main scheduler after 2.6.23.everything is fair,no starvation,no complexity.The new task would not simply be queued at the head to quickly preempt current.according to the code of kernel 2.6.28,if you clear the STAR_DEBIT bit by sysctl -w kernel.sched_features=orig_value&~STSRT_DEBIT_bit,child task would not preempt its father always,and this problem is easier to recur if you use a father task with lower nice value. my test file is:
/*******child_first.c**********/
#include
#include
#include
int main(int argc,char *argv[])
{
cpu_set_t mask;
__CPU_ZERO( &mask );
__CPU_SET(0, &mask );
sched_setaffinity( 0, sizeof(mask), &mask );
int v = atoi(argv[1]);
nice(v);
int i = 90000;
while(i-->0)
{
v++;
}
if(fork() == 0)
{
printf("sub/n");
exit(0);
}
printf("main,%d/n",v);
}
just compile it to child_first and do following:
[root@zhaoya ~]#sysctl -w kernel.sched_features=0
[root@zhaoya ~]#./child_first -20
[root@zhaoya ~]#./child_first -xx
...
[root@zhaoya ~]#./child_first 10000...
after all this,believe your eyes.
because the code judgeing the condition whether the child should preempt the father is very LOOSE!if the nice value of father is very low and the nr_running is very small,the cfs_rq->min_vruntime is always equal with the vruntime of father,so {curr->vruntime vruntime}.if the nice value if high,the cfs_rq->min_vruntime is always little than father so {cfs_rq->min_vruntime vruntime}
Signed-off-by: Ya Zhao marywangran@gmail.com>
---
--- linux-2.6.28.1/kernel/sched_fair.c.orig 2009-04-28 22:26:00.000000000 +0800
+++ linux-2.6.28.1/kernel/sched_fair.c 2009-04-28 22:34:49.000000000 +0800
@@ -1628,12 +1628,13 @@ static void task_new_fair(struct rq *rq,
/* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
- curr && curr->vruntime vruntime) {
+ curr && (curr->vruntime vruntime||cfs_rq->min_vruntime curr->vruntime)) {
/*
* Upon rescheduling, sched_class::put_prev_task() will place
* 'current' within the tree based on its new key value.
*/
- swap(curr->vruntime, se->vruntime);
+ if( curr->vruntime vruntime )
+ swap(curr->vruntime, se->vruntime);
resched_task(rq->curr);
}
--
回複:
I say:but if the child runs last,there maybe more copy-on-write.User can disable child-runs-first if he can confirm the child would not do exec or so . now that the kernel provide the policy,why we implement it halfway?
Somebody say:Sure, I just wanted to raise the issue, child-runs-first doesn't really work reliably on SMP, and since even embedded is moving to SMP the value of keeping it around seems to be less each day.
I say:you are right.but i don't think child waking up on another CPU must run first.The kernel will do his best for users.In kernel everything is middle course in my opinion.if one must do one thing perfectly,the other will lose.so if the cpu on which the child waking up is the same as its father and user give the policy of child-runs-first,we confirm child-runs-first.and if not,let god make the war continue.
i think on SMP child-runs-first is not a matter. if we must confirm child-runs-first,these two cpus must spend much time for synchronization,and at when the father can run,god know
But as long as we do have it, I agree that your patch is wanted.
為 了這個補丁,我可做了不少的工作,看看cfs的代碼,想象一下它的原理就會明白,前提,不設定START_DEBIT特性,如果你用高權值也就是負 nice值執行上面的測試,那麼就不會發生子進程搶佔父進程,如果是正nice值得話,搶佔的可能性會變大,為何呢?很簡單,如果是高權值的進程,它在一 個調度周期被分配的時間會很多,因此它的虛擬時間會推進的很慢,也就是說不會迅速變大,這樣的話,看看update_min_vruntime的代碼,高 權值得進程的vruntime可能就是cfs_rq的min_vruntime,這樣在sysctl_sched_child_runs_first即使 為1的情況下也不會搶佔父進程,因為在place_entity中新進程的vruntime直接就是cfs_rq的min_vruntime,如此一來就 不會發生搶佔,不是說在wake_up_new_task中最後還有一個check_preempt_curr判斷搶佔的嗎?是的,但是第一,我們不能把 子進程先運行這件特殊的事情委託給一個更一般的機制;第二,即使委託給它了,check_preempt_curr代碼也還是無法搶佔父進程,在cfs 中,check_preempt_curr的代碼有以下邏輯:
s64 gran, vdiff = curr->vruntime - se->vruntime;
if (vdiff return -1;
gran = wakeup_gran(curr);
if (vdiff > gran)
return 1;
return 0;
vdiff 顯然為0,如此返回一個-1,搶佔沒戲。按照這樣的理論,當用正的nice值進行測試的時候,搶佔是否會發生呢?會的,但是不是完全會,而是機率增加了罷 了,而且,正nice值的搶佔也不是sysctl_sched_child_runs_first這個if語句的功勞,而是 check_preempt_curr的功勞,因為低權值的進程虛擬時間推進得很快,因為它大部分時間都是在透支,因此curr的vruntime很大幾 率要比cfs_rq的min_vruntime要大,於是vdiff就是一個正值,但是一個搶佔粒度又成了問題,不到一定的差值,不會搶佔,其實根本不要 把子進程先運行這件事擺托給check_preempt_curr,而是要在sysctl_sched_child_runs_first裡面搞定,真是 上面理論分析的結果嗎?不是的,測試發現即使用正的nice值,還是很少搶佔,這到底為何,於是我在update_min_vruntime中的諸多判斷 中加入一個計數器,然後在jiffies達到一定量時列印出來這些值,我又懷疑我關於調度周期的猜測是否正確,也就是是否一個進程每個調度周期只運行一 次,於是我又在set_next_entity中加入了一個同一個進程兩次運行間隔的測量機制,然後設定兩個計數器一個是小於一個周期的數量,另一個是無 論如何的數量,然後也在那個地方列印,於是我發現我錯了,進程在一個周期被調度多於一次的數量佔總數量的比例太高了,而且很多時候,在 update_min_vruntime中判斷紅/黑樹狀結構的left_most的結果都是空值,這到底怎麼回事?於是我查看/proc /sched_debug檔案,我的媽呀,nr_running為1,要麼就是2,反正不超過3,於是我明白了ps -e|wc -l查出來的100個進程都是io進程,不是cpu進程,於是我寫了一個cpu進程:
int main(int argc,char *argv[])
{
nice( atoi(argv[1]) );
int a = 1,b = 0;
while( a++||1 )
{
b += a;
}
}
這 個進程夠cpu把,分別從nice增量值-19到20調用40個該程式,然後查看那些計數器,太帥了,和我想的一樣,這時cfs才真正進入正軌,雖然也存 在同一個調度周期調度2次一個進程的,但是明顯少了很多,很快這個比例從原來的1:1到了1:15,而且left_most也不再為空白了。到此就都明白 了,因為只有少數進程活動,left_most當然很容易就是空了,於是即使是nice值很高的進程,只要它活動,只要它在運 行,min_vruntime就是它的vruntime,畢竟只有它自己了。現在該規整一下這個補丁了,其實在2.6.25中就沒有這樣的問題,因為無論 如何在新進程建立時都要resched_task,我們知道,cfs中的紅/黑樹狀結構相同的key值要加到已有元素的右邊,並且既然已經 resched_task了,那麼在schedule中要調用put_pre_entity,這樣的話即使父子進程的vruntime相等,子進程也會搶 占父進程,如果父進程的vruntime比較大,那麼resched_task將終結父進程的運行,如果比較小,那麼交換它們,2.6.25的設計起碼在 建立子進程這方面很不錯,然而,resched_task調用不是很美觀,如果確認寫時複製不可避免的時候,resched_task就沒有必要了,這樣 會徒然增加一次cache重新整理,沒有必要,於是為了使得子進程先運行成為一種策略,那麼就把resched_task移到了該策略中,沒有想到,由於代碼 的不嚴密,這個策略成了一個謊言,我的補丁就是彌補這樣真實的謊言的,上述提交的補丁有些地方不是很好,將(curr->vruntime vruntime||cfs_rq->min_vruntime vruntime)全部去掉會更好,其實就是將curr->vruntime vruntime的判斷移到了if語句裡面,這樣就和2.6.25一樣了,馬上再次提交。這個補丁核心就是,既然說一個變數可以控制是否子 進程優先運行,那麼為何不讓它總是起作用呢?