From: "Siddha, Suresh B" This time Ken Chen brought up this issue -- No it has nothing to do with industry db benchmark ;-) Even with the above mentioned Nick's patch in -mm, I see system livelock's if for example I have 7000 processes pinned onto one cpu (this is on the fastest 8-way system I have access to). I am sure there will be other systems where this problem can be encountered even with lesser pin count. We tried to fix this issue but as you know there is no good mechanism in fixing this issue with out letting the regular paths know about this. Our proposed solution is appended and we tried to minimize the affect on fast path. It builds up on Nick's patch and once this situation is detected, it will not do any more move_tasks as long as busiest cpu is always the same cpu and the queued processes on busiest_cpu, their cpu affinity remain same(found out by runqueue's "generation_num") Signed-off-by: Suresh Siddha Signed-off-by: Ken Chen Signed-off-by: Andrew Morton --- 25-akpm/kernel/sched.c | 33 ++++++++++++++++++++++++++++++--- 1 files changed, 30 insertions(+), 3 deletions(-) diff -puN kernel/sched.c~sched-improve-pinned-task-handling-again kernel/sched.c --- 25/kernel/sched.c~sched-improve-pinned-task-handling-again 2005-04-01 20:34:44.000000000 -0800 +++ 25-akpm/kernel/sched.c 2005-04-01 20:34:44.000000000 -0800 @@ -205,9 +205,16 @@ struct runqueue { /* * nr_running and cpu_load should be in the same cacheline because * remote CPUs use both these fields when doing load calculation. + * generation_num also needs to be in the same cacheline as nr_running. */ - unsigned long nr_running; + unsigned int nr_running; #ifdef CONFIG_SMP + /* + * generation_num gets incremented in the following cases + * - a process moves to this runqueue + * - cpu affinity of a process on this runqueue is changed + */ + unsigned int generation_num; unsigned long cpu_load[3]; #endif unsigned long long nr_switches; @@ -237,6 +244,8 @@ struct runqueue { task_t *migration_thread; struct list_head migration_queue; + runqueue_t *busiest_rq; + unsigned int busiest_generation_num; #endif #ifdef CONFIG_SCHEDSTATS @@ -598,6 +607,9 @@ static inline void __activate_task(task_ { enqueue_task(p, rq->active); rq->nr_running++; +#ifdef CONFIG_SMP + rq->generation_num++; +#endif } /* @@ -1670,6 +1682,7 @@ void pull_task(runqueue_t *src_rq, prio_ src_rq->nr_running--; set_task_cpu(p, this_cpu); this_rq->nr_running++; + this_rq->generation_num++; enqueue_task(p, this_array); p->timestamp = (p->timestamp - src_rq->timestamp_last_tick) + this_rq->timestamp_last_tick; @@ -1998,6 +2011,14 @@ static int load_balance(int this_cpu, ru schedstat_add(sd, lb_imbalance[idle], imbalance); + /* if all tasks on busiest_cpu were pinned and can't be moved to + * this_cpu and from our last load_balance, there is no + * changes to busiest_cpu's generation_num, then we are balanced + */ + if (unlikely(this_rq->busiest_rq == busiest && + this_rq->busiest_generation_num == busiest->generation_num)) + goto out_balanced; + nr_moved = 0; if (busiest->nr_running > 1) { /* @@ -2013,8 +2034,12 @@ static int load_balance(int this_cpu, ru spin_unlock(&busiest->lock); /* All tasks on this runqueue were pinned by CPU affinity */ - if (unlikely(all_pinned)) + if (unlikely(all_pinned)) { + this_rq->busiest_rq = busiest; + this_rq->busiest_generation_num = busiest->generation_num; goto out_balanced; + } else + this_rq->busiest_rq = NULL; } spin_unlock(&this_rq->lock); @@ -4148,8 +4173,10 @@ int set_cpus_allowed(task_t *p, cpumask_ p->cpus_allowed = new_mask; /* Can the task run on the task's current CPU? If so, we're done */ - if (cpu_isset(task_cpu(p), new_mask)) + if (cpu_isset(task_cpu(p), new_mask)) { + rq->generation_num++; goto out; + } if (migrate_task(p, any_online_cpu(new_mask), &req)) { /* Need help from migration thread: drop lock and wait. */ _