Frequency scaling (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281140030.4507-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 18:46:22 GMT
Message-ID: <fa.oaa1dfv.jkm430@ifi.uio.no>

On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> The following patches add CPU frequency and volatage scaling
> support (Intel SpeedStep, AMD PowerNow, etc.) to kernel 2.5.32

The thing is, this interface appears fundamentally broken with respect to
CPU's that change their frequency on the fly. I happen to know one such
CPU rather well myself.

What is this interface supposed to do about a CPU that can change its
frequency dynamically several hundred times a second? Having the OS
control it simply isn't an option - the overhead of the control is _way_
more than is acceptable at that level.

In short, this interface is too broken to be called generic.

A quote from Peter Anvin:

  "What is worse is that the interface is, in my opinion, fundamentally
   broken for *ALL* CPUs.  It doesn't present a policy interface to the
   kernel, instead it presents a frequency-setting interface and expect
   the policy to be done in userspace.  The kernel is the only part of the
   system which has sufficient information (idle times of all CPUs, for
   example) to do a decent job managing the CPU frequency efficiently.
   On Transmeta CPUs this policy should simply be passed down to CMS, of
   course; on other CPUs the kernel needs to manage it."

In other words: there is no valid way that a _user_ can set the policy
right now: the user can set the frequency, but since any sane policy
depends on how busy the CPU is, the user isn't even, the right person to
_do_ that, since the user doesn't _know_.

Also note that policy is not just about how busy the CPU is, but also
about how _hot_ the CPU is. Again, a user-mode application (that maybe
polls the situation every minute or so), simply _cannot_ handle this
situation. You need to have the ability to poll the CPU tens of times a
second to react to heat events, and clearly user mode cannot do that
without impacting performance in a big way.

The interface needs to be improved upon. It is simply _not_ valid to say
"run at this speed" as the primary policy.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281249520.4507-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 19:57:12 GMT
Message-ID: <fa.ocq1dfv.h4i439@ifi.uio.no>

On 28 Aug 2002, Alan Cox wrote:
>
> Systems designers are designing on the basis of thermal slowdowns being
> the optimal way to build some systems. Its actually quite reasonable for
> many workloads.

Absolutely. Thermal policy is often an overriding thing, where even
non-transmeta CPU's will simply do the decision "on their own", without
input from the OS. That's simply because some designs will literally not
work above certain temperatures and do not have the heat sink capacity to
get out of a tight spot by purely external cooling.

But that's just one part of it. Even aside from thermal concerns, you want
to drop frequency aggressively when the machine is idle, because dropping
the frequency allows you to drop the voltage and effetively gets you a
cubed power reduction (which not only saves your battery, but also cools
the chip down so that when you _do_ start going full speed again you have
more thermal headroom).

So in order to avoid the thermal shutdown, you need to be proactive about
the frequency. Which again means that a user-level "once a second" or
"once in a blue moon" approach is fundamentally flawed.

I don't disagree with _also_ being able to set the frequency statically.
However, I do disagree with an interface that seems to be _purely_
designed for this, and nothing else.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281246560.4507-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 19:48:04 GMT
Message-ID: <fa.ocq9dfv.h4q436@ifi.uio.no>

On 28 Aug 2002, Alan Cox wrote:
>
> You might want to read the paper on the original cpufreq for ARM. It
> gives real world cases where the user -needs- to be able to control the
> policy. I think you misunderstand what the interface is about. Large
> numbers of systems benefit from usermode policy engines.

That's not the point.

The point is that the _policy_ (not the end result) needs to be pushed
down to the kernel, so that the kernel can do the right thing with it.

That policy can be updated in "real time" from user space, of course. But
the fact is that you cannot just set a frequency and leave it at that, it
doesn't work.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281327140.8978-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 20:28:16 GMT
Message-ID: <fa.ocqfenv.l444b8@ifi.uio.no>

On 28 Aug 2002, Alan Cox wrote:
>
> If you look at the papers on the original ARM cpufreq code you'll see a
> case where very long granuality user driven policy is pretty much
> essential. The kernel sometimes does not have enough information.

Alan, that is _not_ the point here.

It's ok to tell the kernel these "long-term" policies. But it has to be
told as a POLICY, not as a random number. Because I can show you a hundred
other cases where the user mode code does _not_have_a_clue_.

That's my argument. The kernel should be given a _policy_, not a "this
frequency". Because a frequency is provably not enough, and can be quite
hurtful.

And I do not want to get people used to passing in frequencies, when I can
absolutely _prove_ that it's the wrong thing for 99% of all uses.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.44.0208281633410.27728-100000@home.transmeta.com>
Date: Wed, 28 Aug 2002 23:44:24 GMT
Message-ID: <fa.l6tbp1v.1j1g3pn@ifi.uio.no>

On 29 Aug 2002, Alan Cox wrote:
>
> One of the policies I need from the kernel is "run at the frequency I
> told you to run". Its a policy, its not the general case policy. The
> /proc file is that policy.

That's ok, but the current code DOES NOT DO THAT.

The current code has no support at all for the notion of policies, and
gives absolutely _zero_ support for it. It blindly assumes that the CPU
can (and should) run at one frequency, and as long as it does that, I
don't want it in the kernel.

> cpufreq is cpu speed control not power management policy. I agree
> entirely that most people should not be using echo "500" >/proc/... as a
> power management policy.
>
> Likewise /dev/hda is not a file system and peopel should not be using dd
> to store there files.

You've had that argument before, and it was bogus then - and it is bogus
now.

It is possible to put a filesystem on top of /dev/hda - because the block
layer is designed to allow it. It is not possible to build sane policy
upon the current frequency patches, because it is _not_ designed for
passing down the policy.

Exactly because some chips _need_ to have the policy passed down, the
lowest levels need to be able to pass it down.

It is _then_ ok to say that "if you do a 'echo 500 > /proc/cpu/freq', that
will also imply a policy of a fixed frequency". But if the frequency
setting code does not allow for any policy interface AT ALL, then it is
fundamentally broken.

That's my beef with it. We should not have "generic" interfaces that are
known to be fundamentally broken. As it is, the code - as designed - is
useless for a growing class of devices.

Think of it as a layering issue:
 - user level policy
 - kernel interface (possibly many - for different policies)
 - low-level driver

Ok?

Now, what the current patches do is (a) one kernel interface (the
fixed-frequency one) and (b) low-level drivers.

The kernel interface is fine - it doesn't do what I think many people
might want to do, but it's simple and I agree that other policies can be
implemented with other interfaces. Fine.

But the fact that low-level drivers don't even support the notion of a
policy means that they are useless for any other interface. And I'm saying
that it's a clear design bug, and for no good reason.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281400330.16824-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 21:04:44 GMT
Message-ID: <fa.n9lp75v.cnejo0@ifi.uio.no>

On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> #3 Then the cpufreq driver is called to actually set the CPU frequency.
>
> #3 is absolutely ready

#3 is _not_ ready, if it doesn't include a "policy" part in addition to
the frequency. That was what I started off talking about: on some CPU's
you absolutely do _not_ want to set a hard frequency, you want to tell the
CPU how to behave (possibly together with a frequency _range_).

Until that is done, no other upper layers can use this low-level
functionality, since all upper layers would be forced to come up with a
hard frequency goal.

THAT is the problem. If you want to build infrastructure for upper layers,
then that infrastructure has to be able to pass down sufficient
information from those upper layers.

Think of this as a driver abstraction layer. Some hardware will do more
for you, some will do less. Some hardware is the equivalent of a dumb
frame buffer (where software has to change frequency and voltage by hand,
and be careful about every single step and the delays in between), while
some other hardware contains internal accelerators where you just tell
them what you want, and the hardware will do it for you asynchronously.

The current abstraction layer _thinks_ that all hardware is stupid, and is
thus not actually usable with smart hardware. See?

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281331020.8978-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 20:41:54 GMT
Message-ID: <fa.ocabevv.lk843e@ifi.uio.no>

On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> Do these CPUs need kernel support? E.g. do udelay() calls work as
> expected?

Crusoe CPU's do not.

But Intel CPU's _do_ need this, for example (since they change the TSC
frequency).

And Intel CPU's do _not_ want to have user mode telling them what to do
several times a second - yet it's entirely reasonable to have a kernel
timer function that estimates processor load at every timer tick, and
reacts to that a few times a second.

Which is why such a CPU needs to be passed in a _policy_. Which is my
whole argument.

Let's put it another way, because I've seen people at Transmeta scramble
when Microsoft thought it was a good idea to have the OS tell what
frequency the CPU should run at, and trust me, they got it wrong. From my
contacts at Intel, I can promise you that they got it wrong wrt Intel's
chips too, so this is not a Transmeta-only issue.

All I'm saying is that instead of a frequency, you should take more of a
"what is the goal of this" approach, and pass in _that_. Then, in user
land, you might have a situation where you know that "the goal is to run
at 300MHz, and nothing else". That may sometimes be the right goal, but
quite often it isn't.

And THAT IS MY POINT. If you have a more policy-oriented interface,
everybody can work with it. If you have a strict "this frequency"
approach, some people literally _cannot_ live with it, and will end up
throttling behind your back.

The goals may be:
 - "low power" vs "high performance"
	Obvious. "Aggressive power management" vs "Power management with
	performance as the primary goal"

 - "strive for max 20% idle"

	The kernel may slow down the clock if the timer tick shows lots of
	idle time. Tell the rest of the system when you do so.

 - "RT latency - 300MHz minimum"
	The idle loop might drop the frequency, but not past a certain
	point.

 - "run at exactly 500 MHz"

Notice how only the _last_ goal is expressible in the "frequency" space.
Everything else needs at least one additional piece of information, ie the
policy the kernel/CPU should take wrt the power management.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281406190.16824-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 21:07:36 GMT
Message-ID: <fa.n8m576v.dnqjo6@ifi.uio.no>

On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> "policy input" --> "frequency input" --> cpufreq core --> cpufreq driver
>   user-space    |                 k e r n e l  -  s p a c e

No.

The "policy input" has to filter down ALL THE WAY. If you turn it into a
frequency-only input at _any_ time, you've lost information that the
lowest levels need.

THAT is the problem with the current #3 - it _assumes_ that the policy
input has already been converted to frequency, and since it assumes that,
it cannot handle the case where the hardware itself wants to know what the
policy was.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.44.0208281649540.27728-100000@home.transmeta.com>
Date: Thu, 29 Aug 2002 00:02:46 GMT
Message-ID: <fa.l7dhpav.1jhq21t@ifi.uio.no>

On 29 Aug 2002, Alan Cox wrote:
>
> So what you are saying is that you want to be sure that something like
> "please run at a low speed to save battery" is translated by smarter
> cpus into "please save battery" and on spudstop the CPU would go "umm
> duh ok 300MHz"

Yup, exactly.

I suspect that this is also what most people actually want to use anyway:
you don't care that your CPU is a speedstep 1GHz/500Mhz or a 700/300 (or
whatever the combinations are), you really want to just say "go to power
save mode" vs "go to performance mode".

Sure, for speedstep, you can obviously trivially _emulate_ this in user
mode with the frequency approach, but for the generic case it isn't.

I don't know how many policies would be needed (too many just adds
complexity for no gain), but I _suspect_ that something like a

 { min-Hz, max-Hz, policy }

triple with "policy" being just a few different values ("performance",
"powersave") is sufficient. Clearly this triple trivially _becomes_ the
"single MHz" by just making min and max be the same if you really want one
particular MHz (at which time "policy" doesn't matter).

With something like the above, you could do something like

	{ 0, ~0UL, "performance" }	=> generic highest performance setting
	{ 0, ~0UL, "power-save" }	=> generic power-save setting
	{ 300, 500, "performance" }	=> give me a performance setting in the specified range
	{ 1700, 1700, "performance" }	=> run at a fixed 1.7GHz

(maybe the "policy" thing actually makes a difference even for the
fixed-frequency case: it can give hints about whether to allow C1-C3
states when idle etc).

		Linus

Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <aklq8b$220$1@penguin.transmeta.com>
Date: Thu, 29 Aug 2002 18:46:02 GMT
Message-ID: <fa.iagu2iv.b6a5iq@ifi.uio.no>

In article <1030618420.7290.112.camel@irongate.swansea.linux.org.uk>,
Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
>>  { min-Hz, max-Hz, policy }
>>
>
>For a few of the processors "event-hz" or similar would be nice. The
>Geode supports hardware assisted bursting to full processor speed when
>doing SMM, I/O and IRQ handling.

Hmm.. I would assume that you'd just use the high frequency for that?
So, for example, assuming you have a 600/300 Geode, when you do

	{ 0, ~0UL, "power-save" }

that would tell the Geode driver to run at 300MHz normally
("power-save"), and at 600Mhz when doing critical events.

In contrast, a

	{ 0, ~0UL, "performance" }

mode would mean that it always runs at 600MHz (modulo heat throttling,
of course).

And a

	{ 300, 300, "power-save" }

means that you want the chip to always run at 300MHz, even when handling
critical events.

I don't know the exact details of what kinds of frequencies the Geode
supports, but it sounds to me like you don't really need another
frequency value..

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: RE: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
Original-Message-ID: <Pine.LNX.4.33.0208281343170.8978-100000@penguin.transmeta.com>
Date: Wed, 28 Aug 2002 20:45:23 GMT
Message-ID: <fa.ocqlf7v.l425rc@ifi.uio.no>

On Wed, 28 Aug 2002, Grover, Andrew wrote:
>
> Well TMTA CPUs would seem to be easy, because all this is done behind the
> OS's back, right?

Yes. However, I certainly wouldn't mind having the same interfaces as
everybody else to set things like "aggressive" vs "powersave". Transmeta
does all the actual _work_ behind the OS's back, but you can still tell
the CPU what policy to take, and what frequency limits to use.

> Let's talk about CPUs in which the OS has to control processor performance.
> The way I see it, there are a bunch of inputs that are going to determine
> CPU speed & voltage: user preference, workload, and thermals.

Absolutely.

> Any workload analysis has to be in the kernel.

....with user mode input (ie user mode can know a lot of high-level stuff
that the kernel _doesn't_ know). So the kernel does potentially need user
input on policy.

		Linus

Index Home About Blog