General > General Technical Chat
ASM programming is FASCINATING!
tggzzz:
--- Quote from: greenpossum on August 01, 2020, 12:21:48 am ---That's just the way EEVblog is, every discussion will eventually turn into a exposition by experts of angles far outside the scope of the original query or observation. Often educational but sometimes overwhelming.
--- End quote ---
It is not just eevblog. I've seen the phenomenon for 35 years; it started on usenet and continues there and elsewhere.
MK14:
--- Quote from: eti on August 01, 2020, 12:12:06 am ---
--- Quote from: MK14 on July 30, 2020, 07:43:39 pm ---
--- Quote from: ebastler on July 30, 2020, 06:57:24 pm ---We seem to have lost the OP a while ago. More specifically, we lost him right after his original post. Maybe the fascination didn't last. ::)
--- End quote ---
I noticed this as well, a long while ago. But decided, the thread is of general interest, and seems to be a fun and educational experience. I'm pleased with the information in it, and have enjoyed participating and reading it.
We were at least a bit off-topic as well. (Opinion dependent!).
So, if they had sticked around, it may have taken a completely different course.
I suspect, in a number of cases, you get the following:
[I'm really revving to go, and want to do assembly language programming, it must/will be great fun]
..
>>>Tries their first assembly program, 8 lines long, takes 38 minutes, to get rid of 5 assembly errors, then runs it and it crashes for no apparent reason.
..
[Gives up and moves on to something else]
--- End quote ---
I NEVER give up! I may "temporarily set-aside" an interest, no matter how fleeting my current interest in it, is, and allow my life to continue, and allow whatever I've learnt up until that point, to subconsciously absorb into my mind and digest for however many weeks, months or years, but my interest is still there. I may not be au fait with this subject, but I have returned, re-visited my ASM interest, failed a few times to grasp... but it's truly absorbing into my mind now :)
--- End quote ---
Now the dust has settled a bit.
Yes, assembler, can be a rather hard and daunting subject, to master.
It can look so, temptingly easy, with each assembly instruction, only doing a very simple and quick operation.
E.g. INC A .. Increment Accumulator by 1, so 57 becomes 58 and so fourth.
NOP .. Do absolutely nothing this machine cycle(s).
CLC .. Clear Carry flag
But the real difficulty, is realising how, using these very simple building blocks. You can create a powerful, working program from them.
Speaking from experience, I'd recommend, doing 3 things initially..
(1) Practice
(2) Practice
(3) Practice
Do all 3 things, don't skimp and only do one or two of them.
There probably are some good books on how to go from beginner, to Assembly language programmer.
If you have the time, patience and desire to read books like that.
EDIT: Thread Cleanup - Removed later posts - To hopefully get back on-topic.
RJSV:
More thoughts, on 6502 vs Z-80:
Very old-school, but my guess is PIC programmers might benefit from this thread. Now comparing implementing an indexed list, there is an interesting reversal. With Z-80 ASM code, you use a base value, in IX register, that's 16 bits for a 64k total RAM space, and plus a fixed offset of 8 bits. Visualize IX,14 for example. Now with a 6502 it is reversed: There is a fixed BASE, plus a variable offset held in a register.
So that case visualize 20,000 plus contents of 'Y' register. That's the reverse case, as you get a total RAM of 64k, but the relevant block is 256 bytes. Sounds more natural, and practical. Hopefully, my presentation will precede some helpful ASM learning, I always used that difference as a memory tool, thinking first that the (two) processors had reversed scopes, and then I could recall the exact structure, and thus write code appropriately.
However, a project partner disagreed with convention,
writing SELF-MODIFYING code. In horror, I nastilly spoke "You must never, ever write self-mod code, it won't work from a ROM". He was pulling out the 2-byte base value, from the 6502 ASM object code, figuring to increment that, for access to a larger, 8k block (C-64 bit mapped' graphics). But it worked, so...
I wished for better people skills as there was arguing around that. But now, maybe some legacy projects have to deal with issues, of non- writable ROM, in year 2020 legacy re-issues. (Can always copy to a RAM buffer, for correct function, in any hand-held legacy gamer.) I think, ultimately, the other guy was right.
By the way, a 6502 could do a 16 bit alteration, by a few lines; change the low byte and bring any overflow / underflows to the high byte, but there are several lines
of code just to retrieve and store. Probably, the guy was thinking Z-80, while making 6502 code! But it works, and I felt like a snob, "his code inferior or dodgy". Humans are complex, imperfect.
While reading such, some learning takes place; how to write code, either 6502 or Z-80, by playing each system concepts against the other. I think one justification was, we can re-write for a hypothetical, when it comes real. I have to respect that.
Oh, and I would use caution w an Altair Simulator, regarding exact real-time clock states, it may not strictly follow the 'legacy' functions. If I recall, the Z-80 could run on a 4 MHz clock, using 4 clocks per (instruction). The 6502 often used a 1 MHz clock, with one clock per instruction.
Injecting some personality into the process is a good memorizing tool, that's why the 6502 and Z-80 can play off each other...
T3sl4co1l:
I like to approach optimization from the compiler's side, and go from there. Work in stages:
1. Any high-level optimizations (to the algorithm) have already been done. (Start there first!!)
2. You can write your ASM function to drop in, so the compiler knows how to use it and you can go back to #1 if a new strategy comes up. (You may end up discarding that function in the process, wasting effort -- hence the emphasis on the front end.)
3. While you still need to comply with the compiler's contract for call/return conventions (the ABI), you have complete freedom to (ab)use the hardware within the function. If you can make better use of it, there you go, that's clock cycles saved!
4. If it looks like #1 is basically done, you can go farther and expand into other functions, optimizing neighboring ones, inlining them in a longer or higher-level function, etc. (The compiler does some of this already, but it won't discover many optimizations it couldn't have already made in the base functions.)
Case study: a simple DSP (digital signal processing) operation. The basic fundamental of DSP is the multiply-accumulate (MAC) instruction, A = A + B * C, used in a loop to convolve two arrays. (Convolution is a big word for a simple process: for each element in two arrays, multiply the elements pairwise, then sum up their results. If you know linear algebra, it's the dot product of two vectors. This isn't a full definition of convolution, more a functional example.)
For example, if we convolve a sequence of digital samples (the latest sample, and the N previous samples), with a sequence of samples representing the impulse response of a filter (also of length N), the result is the next sample in the series of that signal as if it were passed through that filter. This is a FIR (finite impulse response) filter: an impulse input (one sample nonzero, the rest zero) gets multiplied by each element of the filter in turn, all of which can be nonzero, up to length N where the magnitude implicitly drops to zero, because, well, that's the size of the array.
FIR filters are great because you can control the impulse response, well, exactly; that's all the filter array is! And by extension the frequency response, and it's very easy to get certain minimum-phase or equal-delay properties from them. The downside is, if your desired filter has a very long time constant (e.g., a low frequency, narrow bandpass/stop, or long time delay or reverberation), you need as long of an array. And as much sample history. And you need to compute the convolution every sample.
So, DSP machines need to crank a lot of data, and tend to be very powerful indeed, if relatively specialized for this one task. (Example: one might use a 200MHz core capable of delivering one MAC per cycle, so could process around 4500 words per 44.1kHz audio sample; a FIR filter could have a minimum cutoff on the order of 20Hz.)
If we use a history of input and output samples, we can employ feedback to useful effect. We have to be careful, obviously we don't want that accumulating to gibberish, with exponential growth; it has to be stable. And being a feedback process, we expect it to have an exponential (well, discrete time, so technically, geometric) decay; an infinite impulse response (IIR). Whereas the FIR filter coefficients can take any value, the IIR filter coefficients have to be computed carefully. Fortunately, that analysis has been done, so we can design filters using tools, without having to get too in-depth with the underlying theory. (Which revolves around the Z transform. It happens to map to the Fourier transform -- everything the EE knows about analog signals, already applies to DSP, if in a somewhat odd way.)
Anyway, that explains the basic operation. How do we compute it?
Here's a basic MAC operation, taking as parameters, a 32-bit "accumulator", a 16-bit sample, and an 8-bit coefficient. (This wouldn't be nearly enough bits for a proper filter, but works well enough for a simple volume control -- we might convolve arrays of live samples, with gain coefficients, to create a mixing board. We can fill the arrays however we like, after all; the convolution doesn't have to be across time, it just is when we are filtering a signal.) The format is integer, signed, presumably in fixed point. (A typical case would be both sample and coefficient in fractional format (0.16 and 0.8 ), so that the result is 8.24, and the top 8 bits are simply discarded, after testing for overflow of course.
--- Code: ---int32_t mac32r16p8(int32_t acc, int16_t r, const int8_t* p) {
return acc + (int32_t)r * *p;
}
--- End code ---
In avr-gcc 4.5.4, this compiles to: (comments added inline to help with those unfamiliar with the instruction set)
--- Code: ---mac32r16p8:
push r14
push r15
push r16
push r17 ; save no-clobber registers
mov r14,r22
mov r15,r23
mov r16,r24
mov r17,r25 ; r14:r15:r16:r17 = acc
mov r30,r18
mov r31,r19 ; Z = p
ld r22,Z ; r22 = *p
clr r23
sbrc r22,7 ; skip following instruction if bit in register set
com r23 ; bit 7 = sign bit; com = complement -- ah, this is a sign extend operation
mov r24,r23
mov r25,r23 ; sign extend to 32 bits (r22:r23:r24:r25 = (int32_t)*p)
mov r30,r20
mov r31,r21 ; [1]
mov r18,r30
mov r19,r31 ; ?!?!
clr r20
sbrc r19,7 ; oh, sign extending r20:r21 = r...
com r20 ; probably the compiler allocated r18:r19:r20:r21 at [1],
mov r21,r20 ; so it had to move to a temporary register (r30:r31) first. sloppy.
rcall __mulsi3 ; r22-r25 * r18-r21 = r26-r31; 37 cycles
mov r18,r22
mov r19,r23
mov r20,r24
mov r21,r25 ; ?!
add r18,r14
adc r19,r15
adc r20,r16
adc r21,r17 ; acc + product
mov r22,r18
mov r23,r19
mov r24,r20
mov r25,r21 ; return result
pop r17 ; restore no-clobber registers
pop r16
pop r15
pop r14
ret
--- End code ---
- Note the surrounding push and pops: the ABI says, don't clobber r2-r17 and r28-r29. This function uses a lot of registers (8 passed in, 4 passed out), so that might happen. Push and pop costs a couple cycles each (most instructions are 1 cycle, but most memory accesses add a 1-2 cycle penalty), so they're a priority to get rid of.
- Most of the instructions are moves. Hmm, that's not a good sign. Why can't we get the registers in the right places, to begin with? Well, calling conventions are fixed by the ABI, nothing we can do about that at this stage, but there's still more shuffling going on than that.
- Though, it looks like acc starts in r22-r25, and that's also where our output goes. Hmmmmm...
- Also, the compiler has already made obvious boners: there is a MOVW instruction which copies registers pairwise; instead of mov r14,r22; mov r15,r23 it should've used movw r14,r22 (and the rest).
- It looks like everything is just setup to use the library wide-multiply call __mulsi3, a 32 x 32 = 32 bit multiply. This sounds terribly inefficient. I mean, for what it does, the library call isn't bad, 37 cycles for that much work -- but we were supposed to be doing 8x16 here. This is ridiculous!
But it is standard practice, when the compiler encounters operations that would be too difficult to reason about. GCC will emit MUL instructions for byte multiplies, but anything larger uses library calls.
Naturally, they don't have libraries for every possible combination, only the most common -- signed-signed, signed-unsigned and unsigned-unsigned at 16 and 32 bits each I think.
Library calls also use a custom ABI, are never inlined, and never pruned (even with -flto). So the unused parts (sign extension and correction) waste code space, too.
So let's have a go at this, eh? What can we get it down to? Here's what I have in my project:
--- Code: ---mac32r16p8:
movw r30, r18
ld r18, Z+ ; get multiplier
; r25:r24:r23:r22 += mul_signed(r21:r20, r18)
eor r19, r19 ; use r19 as zero reg
mulsu r18, r20 ; p*lo (result in r0:r1)
sbc r24, r19
sbc r25, r19 ; sign extend
add r22, r0
adc r23, r1
adc r24, r19
adc r25, r19
muls r18, r21 ; p*hi
sbc r25, r19
add r23, r0
adc r24, r1
adc r25, r19
eor r1, r1
ret
--- End code ---
- Yup, the push/pop can be removed!
- Wait, how is this even so short? Yeah the compiler is wasteful, but heck! Well, inlining the multiply and stripping it down to the required 8x16 bit operation saves a hell of a lot of effort. The two sub-terms need to be added together (same way you do 1 x 2 = 3 digit multiplication by hand). The MULS(U) instructions return carry set if the result needs to be corrected for signedness; in essence we sign-extend the result, hence the sbc rx, 0's into the accumulator. It's a lot more addition than the 2-3 add/adc needed for an unsigned operation, but it's still better than adjusting for sign after the fact (i.e. using an unsigned multiply).
- Clearly, the semantics of doing all this is more involved than just one instruction. Extra registers need to be allocated (MUL uses r0:r1 as implicit destination; and I've used r19 to hold zero). Probably there's no equivalent in GCC's internal representation, either (where most of the optimization is done). So they give up and call a library.
- Conveniently, acc is passed in the same place the result is returned, so we can just accumulate directly into those registers. (This depends on parameter order -- swapping parameters alone may yield optimization!)
- Only one instruction is needed to maintain compiler compatibility: r1 is used as the zero register, so needs to be zeroed after use.
I further inlined this into the looping function. Notice the ld r18, Z+ postincrement instruction; r30:r31 isn't read after this, it's discarded instead. (There's no cycle penalty for ld Z vs. ld Z+ so it doesn't matter that I left it in there.) I can abuse this clobbered value of Z to just run this function in a loop, loading and accumulate all the products right there. :) But further, I can inline it, saving a few more cycles (loop overhead being less than call overhead + unrolled loop).
Overall, the optimizations on this project yielded almost a tenfold improvement -- on the meager platform, it went from hardly worth it (two mixer channels, ~50k MACs/s) to reasonably featureful (two biquad filter stages and 8 mixer channels, ~500k MACs/s).
Tim
KL27x:
--- Quote ---It only has RETLW (Return literal in W), so any subroutine that needs to pass back a value in W has to have multiple exit points, one for each possible value!
--- End quote ---
A lot of the cool new instructions are convenient, but in many cases we were achieving the same thing with a more basic instruction set, already.
In this case, you can transfer contents of w to any register you want and retrieve it after calling the subroutine. The "new feature" mostly just saves code space (and 1 memory register) and execution cycles. Any given subroutine is going to change registers and/or ports. That's what they're for; that's all they do; that's what code does at the most basic level. The w register is just one more of many.
You don't do retlw lookup tables because it's the only way to do it. You do it where it's convenient. We still use them in 14 bit PIC, because they're still convenient.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
edited to add: if you don't have a return instruction, you use up one general data register and two lines of program memory after the call, and it's somewhat uncommon to need to do this in the first place. If you have only return and no retlw, your lookup tables will take up twice the program memory. E.g., instead of
retlw ['a']
retlw ['b']
retlw ['c']
you would have to do
movlw ['a']
return
movlw ['b']
return
movlw ['c']
return
also your BRW or addwf pcl instruction will need an extra instruction prior to it to double the value of indexing
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
In my own code, 90% of my "returns" could be "retlw [arbitrary value]." And I still use retlw data tables. When deciding what to cram into the 12 bit core, they left out the one of two instructions that was less valuable with the easier work around. "Return" is more expendable. The only place where it's not expendable is return from an interrupt. And here, you have retfie, only. You don't have retfielw (which here would be almost useless).
Eti: everything you can do in C, you can do in assembly. But in more complex micros, writing assembly will make your spin and is extremely time consuming.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version