Author Topic: Fast unsigned integer multiply by x100 on 8bit AVR? (Read 11135 times)

beduino · « **on:** January 14, 2019, 12:26:09 am »

Hello,
Just trying to tell somehow AVR C to reuse x2 in code below for faster
multiply by 100, while x2 can be reused and instead of 13 shifts left,
we can have only 6 as C code suggest,
but I still get in assembled listing ugly looking compilator assembler code
with 3 loops for 2,5,6 left shifts

It is code for ATTiny85 with optimisation for size enabled,
however can not figure out howto get code with x2 reused and x5,
while x6 is only 1 left shift more

Probably I will try rewrite this part in assembler, but it puzles me fora while why such ugly assembler code we get,
while I've suggested in C language howto make this code more eficcient or maybe not?

Yansi · « **Reply #1 on:** January 14, 2019, 12:47:47 am »

Why in hell would you want that, if AVR already has a two cycle 8x8 multiplier

//EDIT: Sorry, I forgot that the "tinyAVR" garbage can't multiply. :-/ It's been a while I toyed with them.

Better to ask question then, why do you need to multiply by a 100? (maybe the task can be optimized in other ways)

NorthGuy · « **Reply #2 on:** January 14, 2019, 12:49:48 am »

If you just use multiplication, good chance the C compiler can figure out everything by itself.

beduino · « **Reply #3 on:** January 14, 2019, 01:06:58 am »

Quote from: NorthGuy on January 14, 2019, 12:49:48 am

If you just use multiplication, good chance the C compiler can figure out everything by itself.

Nope, __mulsi3 is used when we let C compiler for too much

Code: [Select]

return (x*100 +x0);
Any other ideas?
BTW: cut down to 16bit probably will also be fine instead of 32bit, while microsecond time is needed at time periods for frequencies higher than 100Hz, but the same - bloody C compiler generate the same ugly looking loops for x5,x6 - only x2looks good while it is replaced by 2 inline shifts

Yansi · « **Reply #4 on:** January 14, 2019, 01:10:05 am »

So what are you trying to achieve with this. Stop being secret. Otherwise we can't help much.

More code is needed, than just return (x*100 + x0)

What is the range of X and X0? Why does it need to get multiplied? What does it calculate. There may be better ways to implement that.

Rerouter · « **Reply #5 on:** January 14, 2019, 01:31:21 am »

Not that good at reading avr assembler. But it would be number in. Shift twice. Copy result to another register. Shift 3 times. Copy to third register. Shift once. Add all 3 registers.

Could this method not condense your assembler. If you need more registers. You always have the 3 gpio registers that dont get touched by the c compiler.

blacksheeplogic · « **Reply #6 on:** January 14, 2019, 01:31:44 am »

My guess glancing at the output although I don't know the compiler being used is that the optimizer is reducing register dependencies.

IanB · « **Reply #7 on:** January 14, 2019, 01:35:57 am »

Quote from: beduino on January 14, 2019, 12:26:09 am

Hello,
Just trying to tell somehow AVR C to reuse x2 in code below for faster
multiply by 100, while x2 can be reused and instead of 13 shifts left,
we can have only 6 as C code suggest,
but I still get in assembled listing ugly looking compilator assembler code
with 3 loops for 2,5,6 left shifts

It is code for ATTiny85 with optimisation for size enabled,
however can not figure out howto get code with x2 reused and x5,
while x6 is only 1 left shift more

Firstly, this seems to be an 8-bit micro, and you are trying to work with 32 bit values. So that is bound to lead to complex code since the 32 bit quantities have to be split into 8 bit chunks.

Secondly, have you tried writing code that looks more like this: ?

Code: [Select]

  uint32_t x0 = TCNT0;

  uint32_t x = avr_time_counter;

  x <<= 1;
  x <<= 1;
  x0 += x;

  x <<= 1;
  x <<= 1;
  x <<= 1;
  x0 += x;

  x <<= 1;
  x0 += x;

  return x0;

sleemanj · « **Reply #8 on:** January 14, 2019, 02:02:47 am »

You could use a volatile variable to "decouple" the stages, forcing the compiler to save and read the result between each step...

Code: [Select]

  volatile uint32_t t_x;

  x2 = x<<2; // First Step

  t_x = x2; // Save it
  x5 = t_x; // Read it
  x5 = x5 << 3; // Next Step

  t_x = x5; // Save it
  x6 = t_x; // Read it
  x6 = x6 << 1; // Last step

Should work I expect, of course, shuffling that stuff back and forth between ram and registers won't be likely to save you a lot.

T3sl4co1l · « **Reply #9 on:** January 14, 2019, 02:42:49 am »

You told it to optimize for size; I don't see that chaining the shifts saves much space compared to what it's produced. (Changing the last one from six to one shifts would save the last loop, at least.) Reducing register pressure might help, but who knows. That isn't obvious from the short section, at least.

Compilers are smart enough to implement constants in non-obvious ways, like this. They're also smart enough to know if a drop-in library routine is better.

You can try link-time optimization and expensive optimizations (-flto, -fexpensive-optimizations) too, see if that helps any.

If you explicitly want it written as a loop, you should probably start with it that way, and let the compiler unroll if it wants to. Example:

Code: [Select]

for (i = 0x1e; i; i >>= 2) {
    x0 >>= (i & 0x03);
    x1 += x0;
}

Note the magic number 0x1e coding for the number of shifts to perform, as packed 2-bit integers. Just the first thing that came to mind, maybe a bit too perverse for readability/maintainability's sake, but it should compile to almost exactly the function you were expecting.

Tim

Yansi · « **Reply #10 on:** January 14, 2019, 02:58:30 am »

Impressive trick with the for cycle there.

cv007 · « **Reply #11 on:** January 14, 2019, 05:41:24 am »

I could be way off, and don't really know what you are doing, but maybe you can rethink the problem-

assuming TCNT0 will be 0-99 (some timer compare mode), and you are incrementing 'avr_time_counter' in an interrupt,
and it also seems you want to be able to get the 'current time'

this is what it seems you are doing-

tcnt0_compare_match_isr(){ avr_time_counter++; } //inc counter every overflow
uint32_t get_time(){ return avr_time_counter*100 + TCNT0; } //but don't want *100

if so, maybe this would make more sense-

tcnt0_compare_match_isr(){ avr_time_counter += 100; } //inc by 100, not too painful

uint32_t get_time(){
uint32_t t0; uint8_t t1;
do{
t1 = TCNT0;
t0 = avr_time_counter;
}while(t1 < TCNT0); //do again if overflowed
return t0+t1;
}

maybe not any better, but maybe it is (the get_time() could probably be better, just showing that TCNT0 can 'overflow' while dealing with these multi-byte numbers and could end up having a mismatch of tcnt0 and its bigger counter, so simply check if tcnt0 overflowed in the process and do again if so)

I'm sure there are also ways to stay with base 2 numbers (and its advantages) when you think you have to use base 10. That's probably not thought about much anymore. but when dealing with an attiny or a smaller pic, it sometimes takes a little effort and creative thinking to keep the multiply/divide code from showing up.

westfw · « **Reply #12 on:** January 14, 2019, 08:22:05 am »

If people are going to play this game, they need to attach the code produced, preferably with instruction and cycle counts...
This bit has the advantage of being extremely obvious. Both multiplies are optimized to inline code:

Code: [Select]

int main() {    // multiply an 8bit number by 100, by first    //  multiplying by 25 (to yield a 16bit number)    //   and then by 4 (to yield 24 bits.)
    uint16_t x = PORTB;
    __uint24 x1 = x*25;  // always fits in 16bits.
    x1 *= 4;    // fits in 24 bits...
    PORTB = x1;
    PORTB = x1>>8;   // output all the bits...
    PORTB = x1>>16;
}

Produces:

Code: [Select]

   0:   88 b3           in      r24, 0x18       ; 24
    __uint24 x1 = x*25;
   2:   90 e0           ldi     r25, 0x00       ; extend to 16 bits.
   4:   9c 01           movw    r18, r24
   6:   22 0f           add     r18, r18
   8:   33 1f           adc     r19, r19
   a:   22 0f           add     r18, r18
   c:   33 1f           adc     r19, r19
   e:   82 0f           add     r24, r18
  10:   93 1f           adc     r25, r19
  12:   9c 01           movw    r18, r24
  14:   22 0f           add     r18, r18
  16:   33 1f           adc     r19, r19
  18:   22 0f           add     r18, r18
  1a:   33 1f           adc     r19, r19
  1c:   82 0f           add     r24, r18
  1e:   93 1f           adc     r25, r19
  20:   a0 e0           ldi     r26, 0x00       ; extend to 24 bits.
    x1 *= 4;    
  22:   88 0f           add     r24, r24
  24:   99 1f           adc     r25, r25
  26:   aa 1f           adc     r26, r26
  28:   88 0f           add     r24, r24
  2a:   99 1f           adc     r25, r25
  2c:   aa 1f           adc     r26, r26
    PORTB = x1;
  2e:   88 bb           out     0x18, r24       ; 24

That seems pretty good.Note that __uint24 is pretty new.
Annoyingly, there is no uint24_t. Also, gcc will will not inline a __uint24 * 10.

westfw · « **Reply #13 on:** January 14, 2019, 11:24:56 am »

Wait a minute. Why are we calculating this with a 32bit integer? 16 is plenty for 255*100

cv007 · « **Reply #14 on:** January 14, 2019, 04:13:30 pm »

Quote

Why are we calculating this with a 32bit integer? 16 is plenty for 255*100

He is adding avr_time_counter to tcnt0 to get some kind of system time. So its not an 8bit number*255, its a larger number*100 + 0 to 255 (or to 99 I presume, otherwise it would be a little odd). How large a number is needed, we don't know.

Ultimately, I think his bigger problem is yet unseen- reading TCNT0 while its running and adding to presumably an interrupt incremented 32/24/16bit number. As I have shown, that's easily dealt with, but if not dealt with incorrect times will show up occasionally (and in many cases, very incorrect).

Yansi · « **Reply #15 on:** January 14, 2019, 04:15:54 pm »

Well, I think I might now understand what he is doing.

Why don't you just increment the 32bit system time variable by 100 by default, instead of 1? (You would not then need to multiply by a 100).

beduino · « **Reply #16 on:** January 14, 2019, 05:27:50 pm »

Thanks everyone for brain storm which during lunch get me idea of trying to tell AVR C compiler to do something simpler like
instead of multiplay by 100, just write optimised multiply by 10 and... input result of one to another to get x100= 10x10

x10= x2 * 5, x2= (x<<1)

Code: [Select]

// Multiply by 100
inline uint32_t avr_x100_uint32(uint32_t x ) {

	x= avr_x10_uint32(x ); // x10
	x= avr_x10_uint32(x ); // x10

	return x;
}

Hopefuly, when wrote this crude simply x10=x2*5 this way

Code: [Select]

// Multiply by 10
inline uint32_t avr_x10_uint32(uint32_t x ) {

	uint32_t x2= (x<<1); // x2

	x= x2;
	x+= x2;
	x+= x2;
	x+= x2;
	x+= x2;

	return x;
}

Now, no complains for AVR C optimizer which done pretty smart code from this crude brute force aproach shown above

Only one loop with 2 iterations for shift by 2 - x4 , but didn't even tried understand whole code generated, but it looks much better than those bloody a few loops 2,5,6

Code: [Select]

  85               	.global	avr_time_us_get
  87               	avr_time_us_get:
  88 007c 0F93      		push r16
  89 007e 1F93      		push r17
  90               	/* prologue: function */
  91               	/* frame size = 0 */
  92               	/* stack size = 2 */
  93               	.L__stack_usage = 2
  94 0080 42B7      		in r20,0x32
  95 0082 0091 0000 		lds r16,avr_time_counter
  96 0086 1091 0000 		lds r17,avr_time_counter+1
  97 008a 2091 0000 		lds r18,avr_time_counter+2
  98 008e 3091 0000 		lds r19,avr_time_counter+3
  99 0092 000F      		lsl r16
 100 0094 111F      		rol r17
 101 0096 221F      		rol r18
 102 0098 331F      		rol r19
 103 009a D901      		movw r26,r18
 104 009c C801      		movw r24,r16
 105 009e 52E0      		ldi r21,2
 106               		1:
 107 00a0 880F      		lsl r24
 108 00a2 991F      		rol r25
 109 00a4 AA1F      		rol r26
 110 00a6 BB1F      		rol r27
 111 00a8 5A95      		dec r21
 112 00aa 01F4      		brne 1b
 113 00ac 800F      		add r24,r16
 114 00ae 911F      		adc r25,r17
 115 00b0 A21F      		adc r26,r18
 116 00b2 B31F      		adc r27,r19
 117 00b4 880F      		lsl r24
 118 00b6 991F      		rol r25
 119 00b8 AA1F      		rol r26
 120 00ba BB1F      		rol r27
 121 00bc 8C01      		movw r16,r24
 122 00be 9D01      		movw r18,r26
 123 00c0 000F      		lsl r16
 124 00c2 111F      		rol r17
 125 00c4 221F      		rol r18
 126 00c6 331F      		rol r19
 127 00c8 040F      		add r16,r20
 128 00ca 111D      		adc r17,__zero_reg__
 129 00cc 211D      		adc r18,__zero_reg__
 130 00ce 311D      		adc r19,__zero_reg__
 131 00d0 080F      		add r16,r24
 132 00d2 191F      		adc r17,r25
 133 00d4 2A1F      		adc r18,r26
 134 00d6 3B1F      		adc r19,r27
 135 00d8 080F      		add r16,r24
 136 00da 191F      		adc r17,r25
 137 00dc 2A1F      		adc r18,r26
 138 00de 3B1F      		adc r19,r27
 139 00e0 BC01      		movw r22,r24
 140 00e2 CD01      		movw r24,r26
 141 00e4 600F      		add r22,r16
 142 00e6 711F      		adc r23,r17
 143 00e8 821F      		adc r24,r18
 144 00ea 931F      		adc r25,r19
 145               	/* epilogue start */
 146 00ec 1F91      		pop r17
 147 00ee 0F91      		pop r16

Quote from: cv007 on January 14, 2019, 04:13:30 pm

Ultimately, I think his bigger problem is yet unseen- reading TCNT0 while its running and adding to presumably an interrupt incremented 32/24/16bit number. As I have shown, that's easily dealt with, but if not dealt with incorrect times will show up occasionally (and in many cases, very incorrect).

Yep, It is another concern - howto synchronize to get TCNT0 and avr_time_counter not corrupted during ISR compare match - exactly how someone noticed, it iscompare match at 99, since at 8Mhz system clock with 8 timer prescaler we have 1MHz, so time tick is 10kHz, so we need to multiply by 100 avr_time_clock to have estimated time in microseconds used rather for time difference than exact timing, while of course here will be some latency in ISR (avr_time_count++ only btw).

Looking for a way to check somehow while readint TCNT0 and avr_time_counter to maybe wait until timer compare match ISR (avr_time_counter++) completes - there should be somewhere in AVR register flag than timer 0 compare match is executed?

If someone already did such tricky thing on ATTiny85 let us know...[/code]

cv007 · « **Reply #17 on:** January 14, 2019, 05:41:38 pm »

Quote

avr_time_count++ only btw

As I already showed you, avr_time_count += 100 would work just as well and eliminate any need for *100. You do get one extra instruction in the isr because of that. One.

Quote

Looking for a way to check

I already showed that.

Kleinstein · « **Reply #18 on:** January 14, 2019, 05:47:52 pm »

Reading the counter, that is extended in resolution while there is a possible update from an ISR in between is a known problem.
A possible solution is to first stop interrupts, than read the timer register (= low byte) and software counter, than check the interrupt flag. One needs to correct the software counter (+1) fIf there is an ISR pending and the reading from the timer is low (just past causing the interrupt). If there is an interrupt pending but the counter value is high, the interrupt comes after the current reading and thus no correction needed.
After that interrups can be enabled again.

If x100 takes too much time - why do a divide by 100 with the timer at all. x64 or x128 are easier.
The 8 bit timers also usually allow for quite some prescaler values.

kosine · « **Reply #19 on:** January 14, 2019, 05:49:59 pm »

Just a thought, but how accurate do you need this to be?

If you're running the chip using the internal clock it may be off by 5%, and will likely vary between chips. If that's tollerable, then maybe just x96 instead of x100.

If it needs to be more accurate, then an external crystal may be required, in which case maybe use one that divides by 2.
____

I just checked the ATtiny85 datasheet. Section 6.1.6 mentions that it has a compatibility mode for ATtiny15 that recalibrates the internal RC oscillator to 6.4MHz. This is divided by 4 to run the system clock at 1.6MHz. Not sure if that will help, but might be an easier frequency to work with.

Kilrah · « **Reply #20 on:** January 14, 2019, 05:58:07 pm »

Typical XY problem. OP was asked several times what he actually wants to do, but instead of answering that insists on his x100 that may not even be a good solution to the actual problem in the first place.

So post the full picture if you want good answers, until then everyone's jsut wasting their time.

NorthGuy · « **Reply #21 on:** January 14, 2019, 06:09:56 pm »

Quote from: Kilrah on January 14, 2019, 05:58:07 pm

Typical XY problem. OP was asked several times what he actually wants to do, but instead of answering that insists on his x100 that may not even be a good solution to the actual problem in the first place.

So post the full picture if you want good answers, until then everyone's jsut wasting their time.

I don't agree with that. Besides OP, there are hundreds of other people who read the thread, and there are even more people who will read this thread in the future. They're not interested in the OP problem (whatever it is). They're presumably interested in multiplication by 100. If, instead, they find a solution to the OP problem, it will be mostly useless to them - they will waste their time, and more and more people will continue reading this thread and wasting their time.

Kilrah · « **Reply #22 on:** January 14, 2019, 06:27:07 pm »

That's a weird way to think about it...

OK going about the x100 thing might help future viewers, but AFAIK the main purpose of this thread isn't to give something to a potential future viewer, it's to solve OP's problem they have now. And a x100 may not be a good solution to solve that problem... and we aren't going to even know whether it is until they properly describe their actual goal.

snarkysparky · « **Reply #23 on:** January 14, 2019, 06:45:39 pm »

I say answer the OP question as he posted it. Maybe I want to marry a chicken with a porcupine , don't ask why, if you don't have any approaches then don't reply.

beduino · « **Reply #24 on:** January 14, 2019, 07:57:56 pm »

Quote from: cv007 on January 14, 2019, 05:41:24 am

tcnt0_compare_match_isr(){ avr_time_counter += 100; } //inc by 100, not too painful

Sorry, didn't notice in a hurry earlier that we have +=100 while was so shocked that AVR C generator was not able generate code with less amount of paintfull loops but 2,5,6 for shifting as mentioned before

Trying to figure out how this loop showed in your code might help ensure we have consistent timer counter and timer counter.
Maybe by using this:
[qote]OCF0A: Output Compare Flag 0 A[/qote]
from
[qote]TIFR – Timer/Counter Interrupt Flag Register[/qote]
while according to Attiny85 datasheet:
[qote]
The OCF0A bit is set when a Compare Match occurs between the Timer/Counter0 and the data in OCR0A – Out-
put Compare Register0. OCF0A is cleared by hardware when executing the corresponding interrupt handling
vector.
[/qote]
so maybe this OCF0A flag could be usefull, but unsure whether is it cleared by hardware at the end of ISR when "reti" is called from ISR, or earlier at the begining of ISR handling $:-\$

Anyway, at least 8MHz @ 3.3Vcc or 16MHz @ 5Vcc F_CPU usually will be used, so different system clock is not an option.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Fast unsigned integer multiply by x100 on 8bit AVR? (Read 11135 times)

Share me