But to respond to the intent of the question: the default optimisation option for SDCC is 'balanced', and this was with that. Haven't tried compiling the original code with 'speed' or 'size' optimisation levels, but I doubt it would make much difference. SDCC doesn't seem to have as sophisticated an optimiser as GCC.
Anyway, I think it's more the difference of approach to the problem that leads to the drastic improvement, rather than any compiler cleverness (or lack thereof).
Ah...
Sometimes the compiler is smart enough to do things like this; it depends on the internal structure, and how many conditions they put into the optimizer. (GCC and Clang use an intermediate format to reason about optimization, which is then converted to the final target instruction set with little or no additional optimization.)
An example where this optimization will fail, might be when the intermediate format has efficient indexed indirect addressing -- access to structs and arrays (and both) are always(??) implemented in literal fashion, using indexed indirect addressing modes when available -- but if the target does not, it will constantly perform pointer arithmetic to deal with the indexing. Of... more historic interest, Z80 and 8086 have some of these modes but they're pretty slow at it (10-20 cycles), so it's not always the fastest route, or the most compact (the instructions are much longer, too).
It looks like the STM8 is... 6502-ish? Doesn't have more registers, or many more anyway, but more can be used as pointers? So that'll be a lot of faffing around regardless. Hmm, SP-relative indexed? That's handy.
Optimizers also tend to be lazy, so that an operation that they most definitely can improve, they might fail on, just because they run out of time to do it.
Last crazy example I made was a bit-shuffling operation; specifically, packing 24-bit color into a 5-6-5 LCD format. I wrote a one-liner (well, one statement, sprawled over half a dozen lines for readability...) and GCC shat out something like 200 words, of what looked to be the expression pretty much verbatim. Much like your first example, just longer. Putting intermediate steps in temp variables cut that by about half.
Seems likely, GCC would've improved your particular expression, but clearly SDCC gave up. Probably there's a scale factor where each one gives up, and my example was past GCC's limit, and yours was past SDCC's.
Do try the other optimization levels -- those settings sound analogous to -O1 or 2, -O3 and -Os. Maybe they'll use more powerful/aggressive tricks, or just "think" about it for longer.
In any case, it's definitely a thing, optimizing code for the target -- ultimately, you're only writing C code for a purpose. When that's specific to a platform, you can write statements that more closely reflect the platform, and should be easier to optimize as a result. (But aren't always. Do keep an open mind to alternatives, and try several when you need to squeeze out those extra bytes/cycles. The compiler certainly isn't as clever as you are, and may not pick up on the approach you were thinking of.)
The emphasis on knowing the compiler, and choosing judicious optimization settings, and platform size and performance, reflects the reality -- your time is far more expensive than the chips are, so you rarely if ever have a justifiable reason to optimize this heavily.
(I've played with this a few times myself, fortunately I need no justification as I'm not getting paid to code, and with good reason.
)
And of course for cross-platform code, you can, say, write headers for each platform -- #if _CPU_STM8 ... #elseif _CPU_AVR ... etc. You can always keep a copy of straightforward, well commented, but maybe not terribly optimal, code in the #else clause -- that way your code always works on
some platform, but also works particularly well on the enumerated platforms.
The challenge then of course being, keep all versions in sync when that code inevitably needs updating.
Tim