General > General Technical Chat
Trying to understand these results - processor thermal/power tuning.
(1/2) > >>
paulca:
Processors are basically a large lump of fet transistors.  Leaving that oversimplification and moving on quickly...

Modern gaming chips today are typically power limited from factory.  They have become so thermally efficient and the coolers so good the bottleneck they are facing is the cost of the voltage regulators, PCB and power supply to power them.

So there exists ranges of GPU for example, where there are 3 or 4 models which all use the exact same chip and the exact same memory.  The only difference is in the support hardware, like VRMs and cooler and the PCB to back them up, the factory power limiter and of course the cost.

Out of the box they operate with a completely dynamic clock and voltage.  They have a self learned voltage to clock speed curve.  I believe it is stock template which the card can modify and learn.  I don't know if this process is per card or per batch, it's likely done during test jig stage to scan the final stable voltage/clock curve for that individual card.

So stock mine sits at 200Mhz and 700mV at idle.  ie. nearly standby.  Put under 100% load however and it jumps up on it's V/Hz curve until nearly instantly slams hard into the input power limiter, settling at around 1860Mhz and 1020mV, which results in about 320W power.

So, almost immediately you see there is no room at all to overclock this card.  Overclocking it, by trying to force the clock higher, draws more power and the card just clocks itself back further down the curve to meet it's power limit.

Finally we come to why I'm asking this here and not a PC Gamer forum. 

When I try to employ under-volting techniques, such as literally lifting the whole voltage/Hz curve +200Mhz.  Meaning instead of running at 1860Mhz at 1.020V, it hits 1860Mhz at 950mV and "would" try to hit 2060Mhz at 1020mV.  Things don't go as expected.... or do they?

It would seem that lowering the voltage, does not lower the power requirement.  I can hear people already shouting "Ohms law!", but I'm not sure it's that simple.

Does a higher Mhz clock speed at a lower voltage actually result in a higher current draw and equal Wattage?

Getting well out of my comfort zone, but fets, be they a big lump of a power fet or a nano-scopic fet on a GPU die, have gate charge requirements and resistence and capacitance working against you.

As you increase the clock speed the criticality of those rising and falling edges increase.  The limiting factor on those rising and falling edges is how quickly you can apply charge to the gate of the mosfets and how quickly you can dump the charge off them again.  Normally higher voltages make these transitions faster as they can drive more current onto the gates to "charge the gate capacitor" faster.  All of which makes more heat, more current, more voltage, more power, more heat.  At least that is how it "used" to work.

So what I can't figure out, is if I have a higher quality fabbed die (say a high binned die) and it can run at 1860Mhz at a lower voltage than stock 1.020V, that should mean the gates are charging fast enough and dumping fast enough to maintain a stable processor.... but it should mean less voltage = less current = less power = less heat.

Oddly in testing however while it does result in slightly less heat and it looks like it's clocking higher on occasion, it still hits it's power limiter and it still performs the same or slightly worse.

I mean, actually getting more performance out of the card will require a shunt resistor mod, so I can just blatantly lie to it's power limiter.  That's understood.

What I can't figure out is this less voltage, not resulting in less heat per Hz.

Has there been a paradigm shift in IC transistor design that somehow breaks that relationship or is this more likely an anomaly caused by the various power regulation phases, the location of the shunt resistors and the software control of the power limit?

EDIT:  I ended up applying a bias on the curve to give me up to +250Mhz at the lowest voltage and +0Mhz at the top end (for stability).  As per the perplexing thing with the power limiter, it doesn't perform better at 100% load.  However, with many game titles, when locked to the framerate of the display, they don't use all 100% and it's here I do see a significant reduction in heat and power.  So much so that a 2022 game, running at 1440p@60FPS, maximum details was drawing 100W under the card limiter.  With under 100% load the card also drops its Mhz and therefore it's voltage to match the load.  So it seems to be running cooler.
Psi:
I think there is likely a few layers of abstraction and 'smarts' in between the overclocking tools and the bare die hardware and Vregs. Which maybe confusing the effects you see when changing settings, since the system is trying to keep things somewhat stable.


Sort of like mechanical airplane control vs fly-by-wire on a modern jet


paulca:

--- Quote from: Psi on May 12, 2022, 11:02:12 am ---I think there is likely a few layers of abstraction and 'smarts' in between the overclocking tools and the bare die hardware and Vregs. Which maybe confusing the effects you see when changing settings, since the system is trying to keep things somewhat stable.


Sort of like mechanical airplane control vs fly-by-wire on a modern jet

--- End quote ---

I agree, that is absolutely true.  However the measurements it gives on voltage and clock frequency should be verbatim.

One thing I should check is the frequency of the boost clock tuner.  I mean the metrics I am viewing for those have about a 500ms minimum resolution.  I'm sure it's doing that algorythm at least 10Hz, more probably 100Hz.  So I might be looking a fairly smooth clock oscillation with a few spikes but that is just nyquist anomaly and it is actually shifting over a much wider range at a much higher frequency and my tweaks are modifying it in ways I can't see.

Because I biased the clocks up at the lower voltages and left them basically stock at the upper end, it may well be spiking up to very high voltages testing stability, power and temps, dropping way back as it hits the power limiter very hard then ramping back up again, repeating every 100-200ms.  All I see is an average clock of 1860Mhz. 

I'm not sure that adds up though.  The only averaging occurring would be in the capacitance on the ADC for the voltage tap.  Maybe that is enough.  Otherwise if you take a lower rate sample of digital data you just get random, not averaged values. 

Clock speed however doesn't have any averaging or capacitance it's a purely digital value.  So if it was a sample frequency anomaly would I not see random fluctuating values instead of fairly steady +/- 30Mhz clocks.
paulca:
Maybe there are other things at play too.

Maybe I am actually getting higher clocks at lower voltage, but maybe they don't equate directly to performance because they are triggering spiritic memory bus errors and given they now run error correction controllers on the VRAM, that could amount to memory re-reads or error correction delays.  So I don't see an increase in performance but a slight negative for (at 100% load) the same power.

I know that is how it works if you try and overclock the memory too far.  Which is another factor again entirely.  If you increase it too far your performance starts to drop as the "correctable error rate" rises on the memory channels, slowing it down.

Maybe it could be that the GPU cores are being exhausted by memory speed and/or hitting similar negative performance gains with higher speeds.

Can anyone answer the basic hypothetical FET gate response to higher/lower voltages and rise/fall times.  How does that relate to current and therefore power, given you have billions of them adding up. ??  For example.  If you lower the gate charge voltage, does that actually cause the gate capacitance to draw more current or just a slower rise time and less power?
tom66:
CMOS power dissipation is essentially proportional to static losses plus dynamic losses.   Dynamic losses are V^2 * f dependent.  So assuming static losses are not as significant as dynamic losses (a safe assumption for a GPU, I would say) you would expect that dropping the voltage would have a significant impact.

It is possible the metrics you are seeing are somewhat fudged:  the card controller 'knows' that voltage is outside of the acceptable range for a given frequency and so it lies to you.  That is why you observe similar performance.

Have you actually measured the core voltage?  You may not be able to measure the frequencies (all internal, generated with PLLs) but you could actually see the core GPU supply voltage on a scope.
Navigation
Message Index
Next page
There was an error while thanking
Thanking...

Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod