Rafael J. Wysocki: CPU power management (cpufreq, cpuidle) integration with the scheduler

July 4, 2013

Participants: Rafael J. Wysocki, Catalin Marinas, Morten Rasmussen, Preeti U Murthy, Vincent Guittot, Tony Luck, Peter Zijlstra, Paul Turner.

People tagged: Ingo Molnar, Morten Rasmussen, Preeti U Murthy, Catalin Marinas, Peter Zijlstra, Len Brown, and Arjan van de Ven.

Threads merged into this one:

Power-aware scheduling

Rafael J. Wysocki suggested a session on power-aware scheduling (see here for an earlier LWN report).

Catalin Marinas expressed his hope that discussions on LKML would continue and that only a few final details would need to be worked out in Edinburgh. ;-)

Morten Rasmussen also expressed his interest in the topic and said that he would soon be posting a new set of patches on LKML, while pointing out earlier efforts here and here. Preeti U Murthy also expressed an interest in this topic, raising the question of whether integration of power management into the scheduler was the right path, and also calling out difficulties in evaluating the scheduler on large systems, especially given that changes that help some workloads and configurations, but hurt others. Tony Luck agreed that it might be hard to evaluate the scheduler on large systems, but that because large systems are very much with us, that is what we need to do, perhaps by creating tools for this purpose. Peter Zijlstra said that on battery-powered systems, the figure of merit was the amount of work enabled by a given battery charge. Peter added that benchmarks would help, and that he requested them some time ago. Peter's main concern is ensuring that the people working on energy management work together rather than each going off and doing their own little thing. Tony Luck pointed out that energy efficiency is also important on servers, especially when you have a few tens of thousands of them, but Peter replied that the large and small systems seem to want quite different things from the scheduler. Catalin Marinas, after first noting that he was not Morten, argued that there scheduling and power management can be organized as follows:

Individual CPU C- and P-state selection.
1. Load tracking.
2. Idle-time estimation.
Optimal task placement.
1. Cache topology.
2. Memory topology (NUMA).
3. Power topology (e.g., socket boundaries).

Catalin further argues that currently, cpufreq and cpuidle address the first category, but that their decisions are affected by the scheduler's decisions that take into account the second category. Or part of the second category: The scheduler currently covers (2a) and (2b), but nothing covers (2c). Catalin stated that Morten's patches are a start towards addressing (2c), but that additional work was needed. Catalin noted that Alex Shi's patches also work towards packing tasks to decrease energy consumption, but said that they need additional work to accommodate big.LITTLE systems. Vincent Guittot has additional concerns with Alex's patch set.

Vincent Guittot posted some work on packing tasks, later versions of which are available here. James Bottomley reminded Vincent of the need to update this patch based on earlier feedback from the maintainers. Vincent said that such an update was in progress, saying further that his main goal had been to indicate his interest in the discussion.

Morten Rasmussen argued that Vincent's and Alex's patches are relevant and need to be integrated into whatever power-aware scheduling scheme is settled upon. However, he does not believe that Vincent's and Alex's patches, even taken together, are sufficient: More work is needed as well. He also listed topics:

Upgrading Paul Turner's load-tracking patches to account for changes in processor frequency, possibly with the help of hardware such as the x86 aperf/mperf approach suggested by Arjan van de Ven.
Power topology representation, including clock domains, power domains, coupled C-states, and so on.
Power-savings strategies for different platforms and workloads.
Most important, coming to agreement on the goal of power-aware scheduling. Morten believes that different systems and different workloads will require different goals, for example, some systems might need to minimize power consumption without degrading performance, while others might need to minimize energy per instruction even if that results in a substantial performance degradation.