Keeping in mind I don't know very much about GPU design --
GPUs are also heavily SIMD (single instruction multiple data), in other words vector based. That's why the memory bus is so wide, to get the thruput necessary to do lots of relatively simple operations simultaneously (in practice, mostly linear algebra: vector and matrix arithmetic).
It used to be that there was one, or few, instruction decoders, and a shitton of ALUs and data buses. You could write very high performance, non-branching shaders, or pitiful performance conditional-branching shaders that ran slower than on the CPU itself (which is better suited to branching, thanks to prediction and speculative execution).
(Well. Going back really far, you had a pipeline controlled by registers, hidden by proprietary drivers, and no general computational resources as such. I'm talking from, since... early-2000s I think?)
This is still true today, but there are more instruction decoders too. (AFAIK,) the whole unit together counts as a core (whether it's NVidia's CUDA or whatever). As long as you have lots of cores, and an "embarrassingly parallel" problem to solve, you can have all of them branch independently without much penalty (caches notwithstanding, because cache rules all).
But you'll still get that much better performance by harnessing the full power of each core, as well as all cores together. This often means "wasting" arithmetic steps, in exchange for eliminating branches, or even loops (which can at least be unrolled to some extent).
A typical example for DSP, vector and GPU code:
Say you have an if statement:
if (cond)
var1 = expr1
else
var2 = expr2;
Instead of branching, it can be rewritten as:
var1 = var1 * (1 - cond) + cond * expr1;
var2 = var2 * cond + (1 - cond) * expr2;
because in C (this is a C example by the way), a logical (cond) evaluates to 0 or 1, and if that expression can be evaluated without a branch (e.g., a compare can be evaluated with a subtraction, sign-extend (high word = 0 or -1), then arithmetic negation), then it can be used at almost no penalty in the subsequent arithmetic. Which is simply the sum of two terms, each masked by the condition or its complement. It seems wasteful at first, but disturbing a deep pipeline is far more wasteful (well, for sufficiently simple expressions; at some point, you will end up saving time by branching between complex operations).
I have no idea if shader compilers know this. It's a pretty obvious optimization on that kind of platform. Certainly it can only be done if there are no side effects (you can always write bad code that's impossible to optimize; writing good code that the compiler can worth with, takes care). Anyway, I just like that it's a different way of looking at the problem -- normally your instinct is to reduce instructions period, but the priority changes when you're on different platforms.
Tim