Hello,
Just trying to tell somehow AVR C to reuse x2 in code below for faster
multiply by 100, while x2 can be reused and instead of 13 shifts left,
we can have only 6 as C code suggest,
but I still get in assembled listing ugly looking compilator assembler code
with 3 loops for 2,5,6 left shifts
It is code for ATTiny85 with optimisation for size enabled,
however can not figure out howto get code with x2 reused and x5,
while x6 is only 1 left shift more
Probably I will try rewrite this part in assembler, but it puzles me fora while why such ugly assembler code we get,
while I've suggested in C language howto make this code more eficcient or maybe not?
Why in hell would you want that, if AVR already has a two cycle 8x8 multiplier
//EDIT: Sorry, I forgot that the "tinyAVR" garbage can't multiply. :-/ It's been a while I toyed with them.
Better to ask question then, why do you need to multiply by a 100? (maybe the task can be optimized in other ways)
If you just use multiplication, good chance the C compiler can figure out everything by itself.
If you just use multiplication, good chance the C compiler can figure out everything by itself.
Nope, __mulsi3 is used when we let C compiler for too much
return (x*100 +x0);
Any other ideas?
BTW: cut down to 16bit probably will also be fine instead of 32bit, while microsecond time is needed at time periods for frequencies higher than 100Hz, but the same - bloody C compiler generate the same ugly looking loops for x5,x6 - only x2looks good while it is replaced by 2 inline shifts
So what are you trying to achieve with this. Stop being secret. Otherwise we can't help much.
More code is needed, than just return (x*100 + x0)
What is the range of X and X0? Why does it need to get multiplied? What does it calculate. There may be better ways to implement that.
Not that good at reading avr assembler. But it would be number in. Shift twice. Copy result to another register. Shift 3 times. Copy to third register. Shift once. Add all 3 registers.
Could this method not condense your assembler. If you need more registers. You always have the 3 gpio registers that dont get touched by the c compiler.
My guess glancing at the output although I don't know the compiler being used is that the optimizer is reducing register dependencies.
Hello,
Just trying to tell somehow AVR C to reuse x2 in code below for faster
multiply by 100, while x2 can be reused and instead of 13 shifts left,
we can have only 6 as C code suggest,
but I still get in assembled listing ugly looking compilator assembler code
with 3 loops for 2,5,6 left shifts
It is code for ATTiny85 with optimisation for size enabled,
however can not figure out howto get code with x2 reused and x5,
while x6 is only 1 left shift more
Firstly, this seems to be an 8-bit micro, and you are trying to work with 32 bit values. So that is bound to lead to complex code since the 32 bit quantities have to be split into 8 bit chunks.
Secondly, have you tried writing code that looks more like this: ?
uint32_t x0 = TCNT0;
uint32_t x = avr_time_counter;
x <<= 1;
x <<= 1;
x0 += x;
x <<= 1;
x <<= 1;
x <<= 1;
x0 += x;
x <<= 1;
x0 += x;
return x0;
You could use a volatile variable to "decouple" the stages, forcing the compiler to save and read the result between each step...
volatile uint32_t t_x;
x2 = x<<2; // First Step
t_x = x2; // Save it
x5 = t_x; // Read it
x5 = x5 << 3; // Next Step
t_x = x5; // Save it
x6 = t_x; // Read it
x6 = x6 << 1; // Last step
Should work I expect, of course, shuffling that stuff back and forth between ram and registers won't be likely to save you a lot.
You told it to optimize for size; I don't see that chaining the shifts saves much space compared to what it's produced. (Changing the last one from six to one shifts would save the last loop, at least.) Reducing register pressure might help, but who knows. That isn't obvious from the short section, at least.
Compilers are smart enough to implement constants in non-obvious ways, like this. They're also smart enough to know if a drop-in library routine is better.
You can try link-time optimization and expensive optimizations (-flto, -fexpensive-optimizations) too, see if that helps any.
If you explicitly want it written as a loop, you should probably start with it that way, and let the compiler unroll if it wants to. Example:
for (i = 0x1e; i; i >>= 2) {
x0 >>= (i & 0x03);
x1 += x0;
}
Note the magic number 0x1e coding for the number of shifts to perform, as packed 2-bit integers. Just the first thing that came to mind, maybe a bit too perverse for readability/maintainability's sake, but it should compile to almost exactly the function you were expecting.
Tim
Impressive trick with the for cycle there.
I could be way off, and don't really know what you are doing, but maybe you can rethink the problem-
assuming TCNT0 will be 0-99 (some timer compare mode), and you are incrementing 'avr_time_counter' in an interrupt,
and it also seems you want to be able to get the 'current time'
this is what it seems you are doing-
tcnt0_compare_match_isr(){ avr_time_counter++; } //inc counter every overflow
uint32_t get_time(){ return avr_time_counter*100 + TCNT0; } //but don't want *100
if so, maybe this would make more sense-
tcnt0_compare_match_isr(){ avr_time_counter += 100; } //inc by 100, not too painful
uint32_t get_time(){
uint32_t t0; uint8_t t1;
do{
t1 = TCNT0;
t0 = avr_time_counter;
}while(t1 < TCNT0); //do again if overflowed
return t0+t1;
}
maybe not any better, but maybe it is (the get_time() could probably be better, just showing that TCNT0 can 'overflow' while dealing with these multi-byte numbers and could end up having a mismatch of tcnt0 and its bigger counter, so simply check if tcnt0 overflowed in the process and do again if so)
I'm sure there are also ways to stay with base 2 numbers (and its advantages) when you think you have to use base 10. That's probably not thought about much anymore. but when dealing with an attiny or a smaller pic, it sometimes takes a little effort and creative thinking to keep the multiply/divide code from showing up.
If people are going to play this game, they need to attach the code produced, preferably with instruction and cycle counts...
This bit has the advantage of being extremely obvious. Both multiplies are optimized to inline code:int main() { // multiply an 8bit number by 100, by first // multiplying by 25 (to yield a 16bit number) // and then by 4 (to yield 24 bits.)
uint16_t x = PORTB;
__uint24 x1 = x*25; // always fits in 16bits.
x1 *= 4; // fits in 24 bits...
PORTB = x1;
PORTB = x1>>8; // output all the bits...
PORTB = x1>>16;
}
Produces: 0: 88 b3 in r24, 0x18 ; 24
__uint24 x1 = x*25;
2: 90 e0 ldi r25, 0x00 ; extend to 16 bits.
4: 9c 01 movw r18, r24
6: 22 0f add r18, r18
8: 33 1f adc r19, r19
a: 22 0f add r18, r18
c: 33 1f adc r19, r19
e: 82 0f add r24, r18
10: 93 1f adc r25, r19
12: 9c 01 movw r18, r24
14: 22 0f add r18, r18
16: 33 1f adc r19, r19
18: 22 0f add r18, r18
1a: 33 1f adc r19, r19
1c: 82 0f add r24, r18
1e: 93 1f adc r25, r19
20: a0 e0 ldi r26, 0x00 ; extend to 24 bits.
x1 *= 4;
22: 88 0f add r24, r24
24: 99 1f adc r25, r25
26: aa 1f adc r26, r26
28: 88 0f add r24, r24
2a: 99 1f adc r25, r25
2c: aa 1f adc r26, r26
PORTB = x1;
2e: 88 bb out 0x18, r24 ; 24
That seems pretty good.Note that __uint24 is pretty new.
Annoyingly, there is no uint24_t. Also, gcc will will not inline a __uint24 * 10.
Wait a minute. Why are we calculating this with a 32bit integer? 16 is plenty for 255*100
Why are we calculating this with a 32bit integer? 16 is plenty for 255*100
He is adding avr_time_counter to tcnt0 to get some kind of system time. So its not an 8bit number*255, its a larger number*100 + 0 to 255 (or to 99 I presume, otherwise it would be a little odd). How large a number is needed, we don't know.
Ultimately, I think his bigger problem is yet unseen- reading TCNT0 while its running and adding to presumably an interrupt incremented 32/24/16bit number. As I have shown, that's easily dealt with, but if not dealt with incorrect times will show up occasionally (and in many cases, very incorrect).
Well, I think I might now understand what he is doing.
Why don't you just increment the 32bit system time variable by 100 by default, instead of 1? (You would not then need to multiply by a 100).
Thanks everyone for brain storm which during lunch get me idea of trying to tell AVR C compiler to do something simpler like
instead of multiplay by 100, just write optimised multiply by 10 and... input result of one to another to get x100= 10x10
x10= x2 * 5, x2= (x<<1)
// Multiply by 100
inline uint32_t avr_x100_uint32(uint32_t x ) {
x= avr_x10_uint32(x ); // x10
x= avr_x10_uint32(x ); // x10
return x;
}
Hopefuly, when wrote this crude simply x10=x2*5 this way
// Multiply by 10
inline uint32_t avr_x10_uint32(uint32_t x ) {
uint32_t x2= (x<<1); // x2
x= x2;
x+= x2;
x+= x2;
x+= x2;
x+= x2;
return x;
}
Now, no complains for AVR C optimizer which done pretty smart code from this crude brute force aproach shown above
Only one loop with 2 iterations for shift by 2 - x4 , but didn't even tried understand whole code generated, but it looks much better than those bloody a few loops 2,5,6
85 .global avr_time_us_get
87 avr_time_us_get:
88 007c 0F93 push r16
89 007e 1F93 push r17
90 /* prologue: function */
91 /* frame size = 0 */
92 /* stack size = 2 */
93 .L__stack_usage = 2
94 0080 42B7 in r20,0x32
95 0082 0091 0000 lds r16,avr_time_counter
96 0086 1091 0000 lds r17,avr_time_counter+1
97 008a 2091 0000 lds r18,avr_time_counter+2
98 008e 3091 0000 lds r19,avr_time_counter+3
99 0092 000F lsl r16
100 0094 111F rol r17
101 0096 221F rol r18
102 0098 331F rol r19
103 009a D901 movw r26,r18
104 009c C801 movw r24,r16
105 009e 52E0 ldi r21,2
106 1:
107 00a0 880F lsl r24
108 00a2 991F rol r25
109 00a4 AA1F rol r26
110 00a6 BB1F rol r27
111 00a8 5A95 dec r21
112 00aa 01F4 brne 1b
113 00ac 800F add r24,r16
114 00ae 911F adc r25,r17
115 00b0 A21F adc r26,r18
116 00b2 B31F adc r27,r19
117 00b4 880F lsl r24
118 00b6 991F rol r25
119 00b8 AA1F rol r26
120 00ba BB1F rol r27
121 00bc 8C01 movw r16,r24
122 00be 9D01 movw r18,r26
123 00c0 000F lsl r16
124 00c2 111F rol r17
125 00c4 221F rol r18
126 00c6 331F rol r19
127 00c8 040F add r16,r20
128 00ca 111D adc r17,__zero_reg__
129 00cc 211D adc r18,__zero_reg__
130 00ce 311D adc r19,__zero_reg__
131 00d0 080F add r16,r24
132 00d2 191F adc r17,r25
133 00d4 2A1F adc r18,r26
134 00d6 3B1F adc r19,r27
135 00d8 080F add r16,r24
136 00da 191F adc r17,r25
137 00dc 2A1F adc r18,r26
138 00de 3B1F adc r19,r27
139 00e0 BC01 movw r22,r24
140 00e2 CD01 movw r24,r26
141 00e4 600F add r22,r16
142 00e6 711F adc r23,r17
143 00e8 821F adc r24,r18
144 00ea 931F adc r25,r19
145 /* epilogue start */
146 00ec 1F91 pop r17
147 00ee 0F91 pop r16
Ultimately, I think his bigger problem is yet unseen- reading TCNT0 while its running and adding to presumably an interrupt incremented 32/24/16bit number. As I have shown, that's easily dealt with, but if not dealt with incorrect times will show up occasionally (and in many cases, very incorrect).
Yep, It is another concern - howto synchronize to get TCNT0 and avr_time_counter not corrupted during ISR compare match - exactly how someone noticed, it iscompare match at 99, since at 8Mhz system clock with 8 timer prescaler we have 1MHz, so time tick is 10kHz, so we need to multiply by 100 avr_time_clock to have estimated time in microseconds used rather for time difference than exact timing, while of course here will be some latency in ISR (avr_time_count++ only btw).
Looking for a way to check somehow while readint TCNT0 and avr_time_counter to maybe wait until timer compare match ISR (avr_time_counter++) completes - there should be somewhere in AVR register flag than timer 0 compare match is executed?
If someone already did such tricky thing on ATTiny85 let us know...[/code]
avr_time_count++ only btw
As I already showed you, avr_time_count += 100 would work just as well and eliminate any need for *100. You do get one extra instruction in the isr because of that. One.
Looking for a way to check
I already showed that.
Reading the counter, that is extended in resolution while there is a possible update from an ISR in between is a known problem.
A possible solution is to first stop interrupts, than read the timer register (= low byte) and software counter, than check the interrupt flag. One needs to correct the software counter (+1) fIf there is an ISR pending and the reading from the timer is low (just past causing the interrupt). If there is an interrupt pending but the counter value is high, the interrupt comes after the current reading and thus no correction needed.
After that interrups can be enabled again.
If x100 takes too much time - why do a divide by 100 with the timer at all. x64 or x128 are easier.
The 8 bit timers also usually allow for quite some prescaler values.
Just a thought, but how accurate do you need this to be?
If you're running the chip using the internal clock it may be off by 5%, and will likely vary between chips. If that's tollerable, then maybe just x96 instead of x100.
If it needs to be more accurate, then an external crystal may be required, in which case maybe use one that divides by 2.
____
I just checked the ATtiny85 datasheet. Section 6.1.6 mentions that it has a compatibility mode for ATtiny15 that recalibrates the internal RC oscillator to 6.4MHz. This is divided by 4 to run the system clock at 1.6MHz. Not sure if that will help, but might be an easier frequency to work with.
Typical
XY problem. OP was asked several times what he actually wants to do, but instead of answering that insists on his x100 that may not even be a good solution to the actual problem in the first place.
So post the full picture if you want good answers, until then everyone's jsut wasting their time.
Typical XY problem. OP was asked several times what he actually wants to do, but instead of answering that insists on his x100 that may not even be a good solution to the actual problem in the first place.
So post the full picture if you want good answers, until then everyone's jsut wasting their time.
I don't agree with that. Besides OP, there are hundreds of other people who read the thread, and there are even more people who will read this thread in the future. They're not interested in the OP problem (whatever it is). They're presumably interested in multiplication by 100. If, instead, they find a solution to the OP problem, it will be mostly useless to them - they will waste their time, and more and more people will continue reading this thread and wasting their time.
That's a weird way to think about it...
OK going about the x100 thing might help future viewers, but AFAIK the main purpose of this thread isn't to give something to a potential future viewer, it's to solve OP's problem they have now. And a x100 may not be a good solution to solve that problem... and we aren't going to even know whether it is until they properly describe their actual goal.
I say answer the OP question as he posted it. Maybe I want to marry a chicken with a porcupine , don't ask why, if you don't have any approaches then don't reply.
tcnt0_compare_match_isr(){ avr_time_counter += 100; } //inc by 100, not too painful
Sorry, didn't notice in a hurry earlier that we have +=100 while was so shocked that AVR C generator was not able generate code with less amount of paintfull loops but 2,5,6 for shifting as mentioned before
Trying to figure out how this loop showed in your code might help ensure we have consistent timer counter and timer counter.
Maybe by using this:
[qote]OCF0A: Output Compare Flag 0 A[/qote]
from
[qote]TIFR – Timer/Counter Interrupt Flag Register[/qote]
while according to Attiny85 datasheet:
[qote]
The OCF0A bit is set when a Compare Match occurs between the Timer/Counter0 and the data in OCR0A – Out-
put Compare Register0. OCF0A is cleared by hardware when executing the corresponding interrupt handling
vector.
[/qote]
so maybe this OCF0A flag could be usefull, but unsure whether is it cleared by hardware at the end of ISR when "reti" is called from ISR, or earlier at the begining of ISR handling
Anyway, at least 8MHz @ 3.3Vcc or 16MHz @ 5Vcc F_CPU usually will be used, so different system clock is not an option.