Fast unsigned integer multiply by x100 on 8bit AVR?

Fast unsigned integer multiply by x100 on 8bit AVR?
Posted by beduino on 14 Jan, 2019 00:26
Hello,
Just trying to tell somehow AVR C to reuse x2 in code below for faster
multiply by 100, while x2 can be reused and instead of 13 shifts left,
we can have only 6 as C code suggest,
but I still get in assembled listing ugly looking compilator assembler code
with 3 loops for 2,5,6 left shifts

It is code for ATTiny85 with optimisation for size enabled,
however can not figure out howto get code with x2 reused and x5,
while x6 is only 1 left shift more

Probably I will try rewrite this part in assembler, but it puzles me fora while why such ugly assembler code we get,
while I've suggested in C language howto make this code more eficcient or maybe not?

#1 Reply
Posted by Yansi on 14 Jan, 2019 00:47
Why in hell would you want that, if AVR already has a two cycle 8x8 multiplier

//EDIT: Sorry, I forgot that the "tinyAVR" garbage can't multiply. :-/ It's been a while I toyed with them.

Better to ask question then, why do you need to multiply by a 100? (maybe the task can be optimized in other ways)

#2 Reply
Posted by NorthGuy on 14 Jan, 2019 00:49
If you just use multiplication, good chance the C compiler can figure out everything by itself.

#3 Reply
Posted by beduino on 14 Jan, 2019 01:06
Quote from: NorthGuy on 14 Jan, 2019 00:49
If you just use multiplication, good chance the C compiler can figure out everything by itself.
Nope, __mulsi3 is used when we let C compiler for too much

Code: [Select]
return (x*100 +x0);
Any other ideas?
BTW: cut down to 16bit probably will also be fine instead of 32bit, while microsecond time is needed at time periods for frequencies higher than 100Hz, but the same - bloody C compiler generate the same ugly looking loops for x5,x6 - only x2looks good while it is replaced by 2 inline shifts

#4 Reply
Posted by Yansi on 14 Jan, 2019 01:10
So what are you trying to achieve with this. Stop being secret. Otherwise we can't help much.

More code is needed, than just return (x*100 + x0)

What is the range of X and X0? Why does it need to get multiplied? What does it calculate. There may be better ways to implement that.

#5 Reply
Posted by Rerouter on 14 Jan, 2019 01:31
Not that good at reading avr assembler. But it would be number in. Shift twice. Copy result to another register. Shift 3 times. Copy to third register. Shift once. Add all 3 registers.

Could this method not condense your assembler. If you need more registers. You always have the 3 gpio registers that dont get touched by the c compiler.

#6 Reply
Posted by blacksheeplogic on 14 Jan, 2019 01:31
My guess glancing at the output although I don't know the compiler being used is that the optimizer is reducing register dependencies.

#7 Reply
Posted by IanB on 14 Jan, 2019 01:35
Quote from: beduino on 14 Jan, 2019 00:26
Hello,
Just trying to tell somehow AVR C to reuse x2 in code below for faster
multiply by 100, while x2 can be reused and instead of 13 shifts left,
we can have only 6 as C code suggest,
but I still get in assembled listing ugly looking compilator assembler code
with 3 loops for 2,5,6 left shifts

It is code for ATTiny85 with optimisation for size enabled,
however can not figure out howto get code with x2 reused and x5,
while x6 is only 1 left shift more

Firstly, this seems to be an 8-bit micro, and you are trying to work with 32 bit values. So that is bound to lead to complex code since the 32 bit quantities have to be split into 8 bit chunks.

Secondly, have you tried writing code that looks more like this: ?

Code: [Select]
uint32_t x0 = TCNT0; uint32_t x = avr_time_counter; x <<= 1; x <<= 1; x0 += x; x <<= 1; x <<= 1; x <<= 1; x0 += x; x <<= 1; x0 += x; return x0;

#8 Reply
Posted by sleemanj on 14 Jan, 2019 02:02
You could use a volatile variable to "decouple" the stages, forcing the compiler to save and read the result between each step...

Code: [Select]
volatile uint32_t t_x; x2 = x<<2; // First Step t_x = x2; // Save it x5 = t_x; // Read it x5 = x5 << 3; // Next Step t_x = x5; // Save it x6 = t_x; // Read it x6 = x6 << 1; // Last step
Should work I expect, of course, shuffling that stuff back and forth between ram and registers won't be likely to save you a lot.

#9 Reply
Posted by T3sl4co1l on 14 Jan, 2019 02:42
You told it to optimize for size; I don't see that chaining the shifts saves much space compared to what it's produced. (Changing the last one from six to one shifts would save the last loop, at least.) Reducing register pressure might help, but who knows. That isn't obvious from the short section, at least.

Compilers are smart enough to implement constants in non-obvious ways, like this. They're also smart enough to know if a drop-in library routine is better.

You can try link-time optimization and expensive optimizations (-flto, -fexpensive-optimizations) too, see if that helps any.

If you explicitly want it written as a loop, you should probably start with it that way, and let the compiler unroll if it wants to. Example:
Code: [Select]
for (i = 0x1e; i; i >>= 2) { x0 >>= (i & 0x03); x1 += x0; }Note the magic number 0x1e coding for the number of shifts to perform, as packed 2-bit integers. Just the first thing that came to mind, maybe a bit too perverse for readability/maintainability's sake, but it should compile to almost exactly the function you were expecting.

Tim

#10 Reply
Posted by Yansi on 14 Jan, 2019 02:58
Impressive trick with the for cycle there.

#11 Reply
Posted by cv007 on 14 Jan, 2019 05:41
I could be way off, and don't really know what you are doing, but maybe you can rethink the problem-

assuming TCNT0 will be 0-99 (some timer compare mode), and you are incrementing 'avr_time_counter' in an interrupt,
and it also seems you want to be able to get the 'current time'

this is what it seems you are doing-

tcnt0_compare_match_isr(){ avr_time_counter++; } //inc counter every overflow
uint32_t get_time(){ return avr_time_counter*100 + TCNT0; } //but don't want *100

if so, maybe this would make more sense-

tcnt0_compare_match_isr(){ avr_time_counter += 100; } //inc by 100, not too painful

uint32_t get_time(){
uint32_t t0; uint8_t t1;
do{
t1 = TCNT0;
t0 = avr_time_counter;
}while(t1 < TCNT0); //do again if overflowed
return t0+t1;
}

maybe not any better, but maybe it is (the get_time() could probably be better, just showing that TCNT0 can 'overflow' while dealing with these multi-byte numbers and could end up having a mismatch of tcnt0 and its bigger counter, so simply check if tcnt0 overflowed in the process and do again if so)

I'm sure there are also ways to stay with base 2 numbers (and its advantages) when you think you have to use base 10. That's probably not thought about much anymore. but when dealing with an attiny or a smaller pic, it sometimes takes a little effort and creative thinking to keep the multiply/divide code from showing up.

#12 Reply
Posted by westfw on 14 Jan, 2019 08:22
If people are going to play this game, they need to attach the code produced, preferably with instruction and cycle counts...
This bit has the advantage of being extremely obvious. Both multiplies are optimized to inline code:
Code: [Select]
int main() { // multiply an 8bit number by 100, by first // multiplying by 25 (to yield a 16bit number) // and then by 4 (to yield 24 bits.) uint16_t x = PORTB; __uint24 x1 = x*25; // always fits in 16bits. x1 *= 4; // fits in 24 bits... PORTB = x1; PORTB = x1>>8; // output all the bits... PORTB = x1>>16; }Produces:
Code: [Select]
0: 88 b3 in r24, 0x18 ; 24 __uint24 x1 = x*25; 2: 90 e0 ldi r25, 0x00 ; extend to 16 bits. 4: 9c 01 movw r18, r24 6: 22 0f add r18, r18 8: 33 1f adc r19, r19 a: 22 0f add r18, r18 c: 33 1f adc r19, r19 e: 82 0f add r24, r18 10: 93 1f adc r25, r19 12: 9c 01 movw r18, r24 14: 22 0f add r18, r18 16: 33 1f adc r19, r19 18: 22 0f add r18, r18 1a: 33 1f adc r19, r19 1c: 82 0f add r24, r18 1e: 93 1f adc r25, r19 20: a0 e0 ldi r26, 0x00 ; extend to 24 bits. x1 *= 4; 22: 88 0f add r24, r24 24: 99 1f adc r25, r25 26: aa 1f adc r26, r26 28: 88 0f add r24, r24 2a: 99 1f adc r25, r25 2c: aa 1f adc r26, r26 PORTB = x1; 2e: 88 bb out 0x18, r24 ; 24That seems pretty good.Note that __uint24 is pretty new.
Annoyingly, there is no uint24_t. Also, gcc will will not inline a __uint24 * 10.

#13 Reply
Posted by westfw on 14 Jan, 2019 11:24
Wait a minute. Why are we calculating this with a 32bit integer? 16 is plenty for 255*100

#14 Reply
Posted by cv007 on 14 Jan, 2019 16:13
Quote
Why are we calculating this with a 32bit integer? 16 is plenty for 255*100
He is adding avr_time_counter to tcnt0 to get some kind of system time. So its not an 8bit number*255, its a larger number*100 + 0 to 255 (or to 99 I presume, otherwise it would be a little odd). How large a number is needed, we don't know.

Ultimately, I think his bigger problem is yet unseen- reading TCNT0 while its running and adding to presumably an interrupt incremented 32/24/16bit number. As I have shown, that's easily dealt with, but if not dealt with incorrect times will show up occasionally (and in many cases, very incorrect).

#15 Reply
Posted by Yansi on 14 Jan, 2019 16:15
Well, I think I might now understand what he is doing.

Why don't you just increment the 32bit system time variable by 100 by default, instead of 1? (You would not then need to multiply by a 100).

#16 Reply
Posted by beduino on 14 Jan, 2019 17:27
Thanks everyone for brain storm which during lunch get me idea of trying to tell AVR C compiler to do something simpler like
instead of multiplay by 100, just write optimised multiply by 10 and... input result of one to another to get x100= 10x10
x10= x2 * 5, x2= (x<<1)
Code: [Select]
// Multiply by 100 inline uint32_t avr_x100_uint32(uint32_t x ) { x= avr_x10_uint32(x ); // x10 x= avr_x10_uint32(x ); // x10 return x; }
Hopefuly, when wrote this crude simply x10=x2*5 this way
Code: [Select]
// Multiply by 10 inline uint32_t avr_x10_uint32(uint32_t x ) { uint32_t x2= (x<<1); // x2 x= x2; x+= x2; x+= x2; x+= x2; x+= x2; return x; }
Now, no complains for AVR C optimizer which done pretty smart code from this crude brute force aproach shown above Only one loop with 2 iterations for shift by 2 - x4 , but didn't even tried understand whole code generated, but it looks much better than those bloody a few loops 2,5,6
Code: [Select]
85 .global avr_time_us_get 87 avr_time_us_get: 88 007c 0F93 push r16 89 007e 1F93 push r17 90 /* prologue: function */ 91 /* frame size = 0 */ 92 /* stack size = 2 */ 93 .L__stack_usage = 2 94 0080 42B7 in r20,0x32 95 0082 0091 0000 lds r16,avr_time_counter 96 0086 1091 0000 lds r17,avr_time_counter+1 97 008a 2091 0000 lds r18,avr_time_counter+2 98 008e 3091 0000 lds r19,avr_time_counter+3 99 0092 000F lsl r16 100 0094 111F rol r17 101 0096 221F rol r18 102 0098 331F rol r19 103 009a D901 movw r26,r18 104 009c C801 movw r24,r16 105 009e 52E0 ldi r21,2 106 1: 107 00a0 880F lsl r24 108 00a2 991F rol r25 109 00a4 AA1F rol r26 110 00a6 BB1F rol r27 111 00a8 5A95 dec r21 112 00aa 01F4 brne 1b 113 00ac 800F add r24,r16 114 00ae 911F adc r25,r17 115 00b0 A21F adc r26,r18 116 00b2 B31F adc r27,r19 117 00b4 880F lsl r24 118 00b6 991F rol r25 119 00b8 AA1F rol r26 120 00ba BB1F rol r27 121 00bc 8C01 movw r16,r24 122 00be 9D01 movw r18,r26 123 00c0 000F lsl r16 124 00c2 111F rol r17 125 00c4 221F rol r18 126 00c6 331F rol r19 127 00c8 040F add r16,r20 128 00ca 111D adc r17,__zero_reg__ 129 00cc 211D adc r18,__zero_reg__ 130 00ce 311D adc r19,__zero_reg__ 131 00d0 080F add r16,r24 132 00d2 191F adc r17,r25 133 00d4 2A1F adc r18,r26 134 00d6 3B1F adc r19,r27 135 00d8 080F add r16,r24 136 00da 191F adc r17,r25 137 00dc 2A1F adc r18,r26 138 00de 3B1F adc r19,r27 139 00e0 BC01 movw r22,r24 140 00e2 CD01 movw r24,r26 141 00e4 600F add r22,r16 142 00e6 711F adc r23,r17 143 00e8 821F adc r24,r18 144 00ea 931F adc r25,r19 145 /* epilogue start */ 146 00ec 1F91 pop r17 147 00ee 0F91 pop r16
Quote from: cv007 on 14 Jan, 2019 16:13
Ultimately, I think his bigger problem is yet unseen- reading TCNT0 while its running and adding to presumably an interrupt incremented 32/24/16bit number. As I have shown, that's easily dealt with, but if not dealt with incorrect times will show up occasionally (and in many cases, very incorrect).
Yep, It is another concern - howto synchronize to get TCNT0 and avr_time_counter not corrupted during ISR compare match - exactly how someone noticed, it iscompare match at 99, since at 8Mhz system clock with 8 timer prescaler we have 1MHz, so time tick is 10kHz, so we need to multiply by 100 avr_time_clock to have estimated time in microseconds used rather for time difference than exact timing, while of course here will be some latency in ISR (avr_time_count++ only btw).

Looking for a way to check somehow while readint TCNT0 and avr_time_counter to maybe wait until timer compare match ISR (avr_time_counter++) completes - there should be somewhere in AVR register flag than timer 0 compare match is executed?

If someone already did such tricky thing on ATTiny85 let us know...[/code]

#17 Reply
Posted by cv007 on 14 Jan, 2019 17:41
Quote
avr_time_count++ only btw
As I already showed you, avr_time_count += 100 would work just as well and eliminate any need for *100. You do get one extra instruction in the isr because of that. One.

Quote
Looking for a way to check
I already showed that.

#18 Reply
Posted by Kleinstein on 14 Jan, 2019 17:47
Reading the counter, that is extended in resolution while there is a possible update from an ISR in between is a known problem.
A possible solution is to first stop interrupts, than read the timer register (= low byte) and software counter, than check the interrupt flag. One needs to correct the software counter (+1) fIf there is an ISR pending and the reading from the timer is low (just past causing the interrupt). If there is an interrupt pending but the counter value is high, the interrupt comes after the current reading and thus no correction needed.
After that interrups can be enabled again.

If x100 takes too much time - why do a divide by 100 with the timer at all. x64 or x128 are easier.
The 8 bit timers also usually allow for quite some prescaler values.

#19 Reply
Posted by kosine on 14 Jan, 2019 17:49
Just a thought, but how accurate do you need this to be?

If you're running the chip using the internal clock it may be off by 5%, and will likely vary between chips. If that's tollerable, then maybe just x96 instead of x100.

If it needs to be more accurate, then an external crystal may be required, in which case maybe use one that divides by 2.
____

I just checked the ATtiny85 datasheet. Section 6.1.6 mentions that it has a compatibility mode for ATtiny15 that recalibrates the internal RC oscillator to 6.4MHz. This is divided by 4 to run the system clock at 1.6MHz. Not sure if that will help, but might be an easier frequency to work with.

#20 Reply
Posted by Kilrah on 14 Jan, 2019 17:58
Typical XY problem. OP was asked several times what he actually wants to do, but instead of answering that insists on his x100 that may not even be a good solution to the actual problem in the first place.

So post the full picture if you want good answers, until then everyone's jsut wasting their time.

#21 Reply
Posted by NorthGuy on 14 Jan, 2019 18:09
Quote from: Kilrah on 14 Jan, 2019 17:58
Typical XY problem. OP was asked several times what he actually wants to do, but instead of answering that insists on his x100 that may not even be a good solution to the actual problem in the first place.

So post the full picture if you want good answers, until then everyone's jsut wasting their time.

I don't agree with that. Besides OP, there are hundreds of other people who read the thread, and there are even more people who will read this thread in the future. They're not interested in the OP problem (whatever it is). They're presumably interested in multiplication by 100. If, instead, they find a solution to the OP problem, it will be mostly useless to them - they will waste their time, and more and more people will continue reading this thread and wasting their time.

#22 Reply
Posted by Kilrah on 14 Jan, 2019 18:27
That's a weird way to think about it...

OK going about the x100 thing might help future viewers, but AFAIK the main purpose of this thread isn't to give something to a potential future viewer, it's to solve OP's problem they have now. And a x100 may not be a good solution to solve that problem... and we aren't going to even know whether it is until they properly describe their actual goal.

#23 Reply
Posted by snarkysparky on 14 Jan, 2019 18:45
I say answer the OP question as he posted it. Maybe I want to marry a chicken with a porcupine , don't ask why, if you don't have any approaches then don't reply.

#24 Reply
Posted by beduino on 14 Jan, 2019 19:57
Quote from: cv007 on 14 Jan, 2019 05:41
tcnt0_compare_match_isr(){ avr_time_counter += 100; } //inc by 100, not too painful
Sorry, didn't notice in a hurry earlier that we have +=100 while was so shocked that AVR C generator was not able generate code with less amount of paintfull loops but 2,5,6 for shifting as mentioned before

Trying to figure out how this loop showed in your code might help ensure we have consistent timer counter and timer counter.
Maybe by using this:
[qote]OCF0A: Output Compare Flag 0 A[/qote]
from
[qote]TIFR – Timer/Counter Interrupt Flag Register[/qote]
while according to Attiny85 datasheet:
[qote]
The OCF0A bit is set when a Compare Match occurs between the Timer/Counter0 and the data in OCR0A – Out-
put Compare Register0. OCF0A is cleared by hardware when executing the corresponding interrupt handling
vector.
[/qote]
so maybe this OCF0A flag could be usefull, but unsure whether is it cleared by hardware at the end of ISR when "reti" is called from ISR, or earlier at the begining of ISR handling $:-\$

Anyway, at least 8MHz @ 3.3Vcc or 16MHz @ 5Vcc F_CPU usually will be used, so different system clock is not an option.