Author Topic: Trying to understand these results - processor thermal/power tuning.  (Read 687 times)

0 Members and 1 Guest are viewing this topic.

Offline paulcaTopic starter

  • Super Contributor
  • ***
  • Posts: 4362
  • Country: gb
Processors are basically a large lump of fet transistors.  Leaving that oversimplification and moving on quickly...

Modern gaming chips today are typically power limited from factory.  They have become so thermally efficient and the coolers so good the bottleneck they are facing is the cost of the voltage regulators, PCB and power supply to power them.

So there exists ranges of GPU for example, where there are 3 or 4 models which all use the exact same chip and the exact same memory.  The only difference is in the support hardware, like VRMs and cooler and the PCB to back them up, the factory power limiter and of course the cost.

Out of the box they operate with a completely dynamic clock and voltage.  They have a self learned voltage to clock speed curve.  I believe it is stock template which the card can modify and learn.  I don't know if this process is per card or per batch, it's likely done during test jig stage to scan the final stable voltage/clock curve for that individual card.

So stock mine sits at 200Mhz and 700mV at idle.  ie. nearly standby.  Put under 100% load however and it jumps up on it's V/Hz curve until nearly instantly slams hard into the input power limiter, settling at around 1860Mhz and 1020mV, which results in about 320W power.

So, almost immediately you see there is no room at all to overclock this card.  Overclocking it, by trying to force the clock higher, draws more power and the card just clocks itself back further down the curve to meet it's power limit.

Finally we come to why I'm asking this here and not a PC Gamer forum. 

When I try to employ under-volting techniques, such as literally lifting the whole voltage/Hz curve +200Mhz.  Meaning instead of running at 1860Mhz at 1.020V, it hits 1860Mhz at 950mV and "would" try to hit 2060Mhz at 1020mV.  Things don't go as expected.... or do they?

It would seem that lowering the voltage, does not lower the power requirement.  I can hear people already shouting "Ohms law!", but I'm not sure it's that simple.

Does a higher Mhz clock speed at a lower voltage actually result in a higher current draw and equal Wattage?

Getting well out of my comfort zone, but fets, be they a big lump of a power fet or a nano-scopic fet on a GPU die, have gate charge requirements and resistence and capacitance working against you.

As you increase the clock speed the criticality of those rising and falling edges increase.  The limiting factor on those rising and falling edges is how quickly you can apply charge to the gate of the mosfets and how quickly you can dump the charge off them again.  Normally higher voltages make these transitions faster as they can drive more current onto the gates to "charge the gate capacitor" faster.  All of which makes more heat, more current, more voltage, more power, more heat.  At least that is how it "used" to work.

So what I can't figure out, is if I have a higher quality fabbed die (say a high binned die) and it can run at 1860Mhz at a lower voltage than stock 1.020V, that should mean the gates are charging fast enough and dumping fast enough to maintain a stable processor.... but it should mean less voltage = less current = less power = less heat.

Oddly in testing however while it does result in slightly less heat and it looks like it's clocking higher on occasion, it still hits it's power limiter and it still performs the same or slightly worse.

I mean, actually getting more performance out of the card will require a shunt resistor mod, so I can just blatantly lie to it's power limiter.  That's understood.

What I can't figure out is this less voltage, not resulting in less heat per Hz.

Has there been a paradigm shift in IC transistor design that somehow breaks that relationship or is this more likely an anomaly caused by the various power regulation phases, the location of the shunt resistors and the software control of the power limit?

EDIT:  I ended up applying a bias on the curve to give me up to +250Mhz at the lowest voltage and +0Mhz at the top end (for stability).  As per the perplexing thing with the power limiter, it doesn't perform better at 100% load.  However, with many game titles, when locked to the framerate of the display, they don't use all 100% and it's here I do see a significant reduction in heat and power.  So much so that a 2022 game, running at 1440p@60FPS, maximum details was drawing 100W under the card limiter.  With under 100% load the card also drops its Mhz and therefore it's voltage to match the load.  So it seems to be running cooler.
« Last Edit: May 12, 2022, 10:53:11 am by paulca »
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 

Offline Psi

  • Super Contributor
  • ***
  • Posts: 10385
  • Country: nz
I think there is likely a few layers of abstraction and 'smarts' in between the overclocking tools and the bare die hardware and Vregs. Which maybe confusing the effects you see when changing settings, since the system is trying to keep things somewhat stable.


Sort of like mechanical airplane control vs fly-by-wire on a modern jet


« Last Edit: May 12, 2022, 11:03:59 am by Psi »
Greek letter 'Psi' (not Pounds per Square Inch)
 

Offline paulcaTopic starter

  • Super Contributor
  • ***
  • Posts: 4362
  • Country: gb
I think there is likely a few layers of abstraction and 'smarts' in between the overclocking tools and the bare die hardware and Vregs. Which maybe confusing the effects you see when changing settings, since the system is trying to keep things somewhat stable.


Sort of like mechanical airplane control vs fly-by-wire on a modern jet

I agree, that is absolutely true.  However the measurements it gives on voltage and clock frequency should be verbatim.

One thing I should check is the frequency of the boost clock tuner.  I mean the metrics I am viewing for those have about a 500ms minimum resolution.  I'm sure it's doing that algorythm at least 10Hz, more probably 100Hz.  So I might be looking a fairly smooth clock oscillation with a few spikes but that is just nyquist anomaly and it is actually shifting over a much wider range at a much higher frequency and my tweaks are modifying it in ways I can't see.

Because I biased the clocks up at the lower voltages and left them basically stock at the upper end, it may well be spiking up to very high voltages testing stability, power and temps, dropping way back as it hits the power limiter very hard then ramping back up again, repeating every 100-200ms.  All I see is an average clock of 1860Mhz. 

I'm not sure that adds up though.  The only averaging occurring would be in the capacitance on the ADC for the voltage tap.  Maybe that is enough.  Otherwise if you take a lower rate sample of digital data you just get random, not averaged values. 

Clock speed however doesn't have any averaging or capacitance it's a purely digital value.  So if it was a sample frequency anomaly would I not see random fluctuating values instead of fairly steady +/- 30Mhz clocks.
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 

Offline paulcaTopic starter

  • Super Contributor
  • ***
  • Posts: 4362
  • Country: gb
Maybe there are other things at play too.

Maybe I am actually getting higher clocks at lower voltage, but maybe they don't equate directly to performance because they are triggering spiritic memory bus errors and given they now run error correction controllers on the VRAM, that could amount to memory re-reads or error correction delays.  So I don't see an increase in performance but a slight negative for (at 100% load) the same power.

I know that is how it works if you try and overclock the memory too far.  Which is another factor again entirely.  If you increase it too far your performance starts to drop as the "correctable error rate" rises on the memory channels, slowing it down.

Maybe it could be that the GPU cores are being exhausted by memory speed and/or hitting similar negative performance gains with higher speeds.

Can anyone answer the basic hypothetical FET gate response to higher/lower voltages and rise/fall times.  How does that relate to current and therefore power, given you have billions of them adding up. ??  For example.  If you lower the gate charge voltage, does that actually cause the gate capacitance to draw more current or just a slower rise time and less power?
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 

Online tom66

  • Super Contributor
  • ***
  • Posts: 7334
  • Country: gb
  • Electronics Hobbyist & FPGA/Embedded Systems EE
CMOS power dissipation is essentially proportional to static losses plus dynamic losses.   Dynamic losses are V^2 * f dependent.  So assuming static losses are not as significant as dynamic losses (a safe assumption for a GPU, I would say) you would expect that dropping the voltage would have a significant impact.

It is possible the metrics you are seeing are somewhat fudged:  the card controller 'knows' that voltage is outside of the acceptable range for a given frequency and so it lies to you.  That is why you observe similar performance.

Have you actually measured the core voltage?  You may not be able to measure the frequencies (all internal, generated with PLLs) but you could actually see the core GPU supply voltage on a scope.
 

Offline paulcaTopic starter

  • Super Contributor
  • ***
  • Posts: 4362
  • Country: gb
CMOS power dissipation is essentially proportional to static losses plus dynamic losses.   Dynamic losses are V^2 * f dependent.  So assuming static losses are not as significant as dynamic losses (a safe assumption for a GPU, I would say) you would expect that dropping the voltage would have a significant impact.

It is possible the metrics you are seeing are somewhat fudged:  the card controller 'knows' that voltage is outside of the acceptable range for a given frequency and so it lies to you.  That is why you observe similar performance.

Have you actually measured the core voltage?  You may not be able to measure the frequencies (all internal, generated with PLLs) but you could actually see the core GPU supply voltage on a scope.

I've not measured it, no.  Not sure I would have the test rig to do so.  I've killed more than one thing with a slipped probe before.  (Those rails carry 300+ Amps!)  Also VCore is distributed across many different VRMs each running a short duty cycle, I'd need to be sure I was measuring after they all get smoothed out... and not have the scope burden cause instabilities.  A cheap scope with a 1MOhm input probably won't cut it, although I do have a 10M input scope.

On it lying.  I have seen it occasionally overrule me, in subtle ways.  For example, I locked it to 900mV and 1875Mhz and it said, "Sure."  then ran at 1860Mhz at 900mV.  It was reporting honestly but ignoring the request for 1875.  I pushed it up to 1890 (it's 15Mhz intervals) and answered, "Really?".  It crashed.

But still.  1860Mhz @ 900mV is a significant undervolt compared with stock 1.020V at the same freq.  Yet it hits the power limiter just as much and underperforms.  Baffling.
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 

Offline paulcaTopic starter

  • Super Contributor
  • ***
  • Posts: 4362
  • Country: gb
One thing I should try is applying a forced voltage limiter on it.  The software I'm using to tweak does not directly support this, but I believe other tools can and the card will accept a hard voltage limit in addition to the curve.  So as normally it will work up the curve until it hits, temp, power or voltage limits.  If I can force it to 900mV limit and prevent it trying to boost higher at all, I might get different results.

Obviously the voltage settings are deeply hidden behind many disclaimers and mostly they are inert these days, no manual voltage control, no fixed voltage, but I think you can still set a 'lower' voltage limit.
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 

Offline paulcaTopic starter

  • Super Contributor
  • ***
  • Posts: 4362
  • Country: gb
The gross of this is fun.  Seeing how far it will go, seeing how much I can get out of it without modding the shunt resistors or the vbios (which is cheaper).

The net is I'm not going to run it at it's peak, there is usually stability and thermal consequences, not to mention longevity/silicon degradation.  But in order to get the most out of the power limit is basically hunting for the cards most efficient curve.  Thus when I back off in the gaming load or drop the power limiter myself to 75% maybe the card runs much, much cooler, draws less power but does not lose more than 10% performance.

So you can see why I'm confused by the results, not actually lowering the power draw at full load.
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf