Author Topic: cortex-m assembler/disassembler into mcu (Read 1777 times)

martinribelotta · « **on:** July 13, 2020, 02:04:17 pm »

Hi All, I'm working on a version of memorable DEBUG.COM from MSDOS but targetting ARMv7M/v6M. The interface does not seek to be the same as debug.com but to be in the same spirit.

https://github.com/martinribelotta/cmx-debug

For now, this only have a hex dump and disassembler but I'm happy with result to be a weekend game, and it is clearly a work in progress project.

I was test it in my bluepill board but the code is really generic... for the IO console I use semihosting facility from newlib but is retargetable to any interface like serial or whatever...

The project is in GPLv3 licence because use binutils code (specifically a modified rutines from libopcodes)

A little screenshoot:

Thanks for reviewing and I hope someone will find it useful.

westfw · « **Reply #1 on:** July 14, 2020, 11:04:15 am »

Nice. I wish you had posted this a couple of weeks back, when I started writing a disassembler from scratch (a quick look around found several that looked designed for big computers), and I didn’t even suspect that gdb’s would be useful.
I had similar goals in mind, too. “Something like “debug” or “ddt””

How much space is is taking?

martinribelotta · « **Reply #2 on:** July 14, 2020, 04:31:46 pm »

Quote

Nice. I wish you had posted this a couple of weeks back, when I started writing a disassembler from scratch (a quick look around found several that looked designed for big computers), and I didn’t even suspect that gdb’s would be useful.
I had similar goals in mind, too. “Something like “debug” or “ddt””

Mayve you take ideas and some code from my project... Just remember that this code is tied to GPLv3 because it came (modified) from binutils

Quote

How much space is is taking?

From my last compilation:

Code: [Select]

Memory region         Used Size  Region Size  %age Used
             RAM:         428 B         2 KB     20.90%
             ROM:       27012 B        64 KB     41.22%

But this is mostly for newlib stdio libraries (I use newlib nano because no need floating point representation)

A more presice information is reveal using nm:

Code: [Select]

arm-none-eabi-nm -S -t dec --size-sort build/disassembler.o 
00000000 00000004 b ifthen_next_state
00000000 00000016 T abort
00000000 00000020 t print_addr
00000000 00000068 r arm_conditional
00000000 00000132 t data_barrier_option
00000000 00000144 T do_disassemble
00000000 00000152 t psr_name
00000000 00000154 t arm_decode_bitfield
00000000 00000408 t banked_regname
00000000 00000432 r regnames
00000000 00000672 r thumb_opcodes
00000000 00001228 t print_insn_thumb16
00000000 00002232 r thumb32_opcodes
00000000 00002684 t print_insn_thumb32

This is with -Og optimization... but no change significativelly with -Os:

Code: [Select]

arm-none-eabi-nm -S -t dec --size-sort build/disassembler.o 
00000000 00000004 b ifthen_next_state
00000000 00000012 T abort
00000000 00000024 t print_addr
00000000 00000060 r CSWTCH.118
00000000 00000068 r arm_conditional
00000000 00000084 r CSWTCH.119
00000000 00000432 r regnames
00000000 00000448 r CSWTCH.117
00000000 00000672 r thumb_opcodes
00000000 00002232 r thumb32_opcodes
00000000 00003348 T do_disassemble

(this only merge some functions in one big do_disassemble saving some code for enter/exit.

In my first approach (one function for instruction encoding type) the memory usage was more than this approach (one big function in printf style with custom interpreter)

amyk · « **Reply #3 on:** July 15, 2020, 12:12:08 am »

The technical term for this type of firmware is "ROM monitor".

brucehoult · « **Reply #4 on:** July 15, 2020, 12:30:12 am »

Nice. If anyone wants to do the work, it might be worth making an ARM version based on the style and code used in https://github.com/riscv/riscv-opcodes, which is Berkeley license. This is used for the disassembler in the standard simulator (spike aka riscv-isa-sim) and many other things.

westfw · « **Reply #5 on:** July 15, 2020, 03:06:08 am »

Quote

remember that this code is tied to GPLv3 because it came (modified) from binutils

Noted. May be worth it anyway.
Interestingly, a quick glance indicates that I have a very similar data structure for the decoding, except that I have code between multiple branches of the tree, rather than one huge table. I can't quite tell whether that'll be a net win; the 16bit thumb instructions were pretty easy, but the 32bit decode is painful :-(

brucehoult · « **Reply #6 on:** July 15, 2020, 03:28:56 am »

Quote from: westfw on July 15, 2020, 03:06:08 am

Quote
remember that this code is tied to GPLv3 because it came (modified) from binutils
Noted. May be worth it anyway.
Interestingly, a quick glance indicates that I have a very similar data structure for the decoding, except that I have code between multiple branches of the tree, rather than one huge table. I can't quite tell whether that'll be a net win; the 16bit thumb instructions were pretty easy, but the 32bit decode is painful :-(

Yeah, A32 and T16 encoding are reasonably pleasant, but T32 is just horrid. Weirdly, A64 isn't any better, despite the opportunity for a fresh start.

The RISC-V code I've written or maintained doesn't use any fancy data structures for the instruction decode. Each instruction or instruction format [1] has a binary/hex "mask" and "match" pair. All the possibilities are tested sequentially if ((opcode & mask) == match) ... and then the variable fields for that instruction/format extracted. That's plenty fast enough for a disassembler (especially an interactive one). If more speed is needed then a cache (e.g. hash table) is kept of the N most recently seen raw instructions and their mapping to the appropriate mask/match pair or fully-decoded instruction struct or whatever. The number of completely distinct instruction opcodes in a program is relatively small, and certainly that's the case for a loop or function.

[1] it's a bit arbitrary where you draw the line e.g. do you decode add/sub/and/or/xor individually, or just do "ALU-OP" and treat the operation as a field?

Nominal Animal · « **Reply #7 on:** July 15, 2020, 06:04:26 am »

I got a flashback to the Commodore 128 monitor, the built-in system monitor utility I used to learn 6502 assembly. Used it almost as much as the built-in sprite editor (sprdef).

Quote from: brucehoult on July 15, 2020, 03:28:56 am

All the possibilities are tested sequentially if ((opcode & mask) == match) ...

Are the mask bits continuous? Could you sort the mask,match tuples according to minimum or maximum possible opcode, so you could skip large swathes of opcodes? Or is the total number small enough to not bother? Sorry for regressing to a 4-year old, but this is kinda interesting.

brucehoult · « **Reply #8 on:** July 15, 2020, 08:01:09 am »

Quote from: Nominal Animal on July 15, 2020, 06:04:26 am

I got a flashback to the Commodore 128 monitor, the built-in system monitor utility I used to learn 6502 assembly. Used it almost as much as the built-in sprite editor (sprdef).

Quote from: brucehoult on July 15, 2020, 03:28:56 am
All the possibilities are tested sequentially if ((opcode & mask) == match) ...
Are the mask bits continuous? Could you sort the mask,match tuples according to minimum or maximum possible opcode, so you could skip large swathes of opcodes? Or is the total number small enough to not bother? Sorry for regressing to a 4-year old, but this is kinda interesting.

No. It covers the whole 32 bit instruction, masking out all the fields containing register numbers, literals that parameterize an instruction (and the high 16 bits for 16 bit instructions), and keeping the fixed bits that make that instruction .. that instruction.

You could sort by "mask". You could sort by "match". But I can't see any way to sort by a combination of them.

RV64IMAFDC ends up with I guess a bit over 100 instructions in the list. That doesn't take long to do a linear search on.

Many programs move a mask/match pair that hits to the front of the list. That means that most of the time you get a hit in the first 10 or so. Just a few instructions such as add immediate, load register, store register, beq, bne, jal, ret pretty quickly make up well over 90% of all opcodes.

That's about as good as you'd get by sorting and binary searching anyway, even if it was possible.

Nominal Animal · « **Reply #9 on:** July 15, 2020, 09:28:37 am »

Quote from: brucehoult on July 15, 2020, 08:01:09 am

RV64IMAFDC ends up with I guess a bit over 100 instructions in the list. That doesn't take long to do a linear search on.

No, that's definitely not too large to just scan through in linear fashion.

Even if you stick the entire table in ROM/flash, you can reorder it with most commonly used instructions first. Personally, I might put the mask and match in one array, and the per-instruction data in a separate array, for optimal locality during the search, but that's about it.

Even on x86-64, arrays that fit in a single (or a couple) cache line(s) are faster to scan in linear fashion than any alternatives. (I've mentioned I've done some research on how to repack large arrays so that binary search has better locality of access; I call them "chunked binary search trees".)

westfw · « **Reply #10 on:** July 15, 2020, 10:13:11 am »

Quote

ends up with I guess a bit over 100 instructions in the list. That doesn't take long to do a linear search on.

It's a disassembler, operating at human-reading-text timescales. Doing 100-odd mask/compare/loop cycles is not a problem. Call it no more than 5000 instructions - even at 16MHz execution rates that's less than a millisecond per instruction.

I was happy that I got my ARM disassembly code to automatically detect literal pools and NOT try to decode them. It was easier than I expected, too.

brucehoult · « **Reply #11 on:** July 15, 2020, 01:43:31 pm »

Quote from: westfw on July 15, 2020, 10:13:11 am

Quote
ends up with I guess a bit over 100 instructions in the list. That doesn't take long to do a linear search on.
It's a disassembler, operating at human-reading-text timescales. Doing 100-odd mask/compare/loop cycles is not a problem. Call it no more than 5000 instructions - even at 16MHz execution rates that's less than a millisecond per instruction.

As I said, spike is the reference RISC-V simulator, used for example by hardware designers to to run in lock-step with their FPGA or verilator runs to check them. It is written to be easy to understand and easy to extend, and to be "obviously correct". It uses this technique to match and decode instructions and runs at over 100 MIPS on a current PC. That's no more than 40 to 50 clock cycles on average to not only match the opcode with a linear search if necessary, but to extract the various fields and simulate it.

martinribelotta · « **Reply #12 on:** July 15, 2020, 05:25:04 pm »

Quote

I was happy that I got my ARM disassembly code to automatically detect literal pools and NOT try to decode them. It was easier than I expected, too.

What is your technic? I track instructions with memory access and mark these memori regions as LD/ST acceced... next, in disassembler, this is treat as literals.

https://github.com/martinribelotta/cmx-debug/commit/79ba166322a9056d21f8412ee365af887abb8115

This is only a concept implementation... I'm sure not catch all corner cases and not catch any non word access (LD*[B|H]) properly...

martinribelotta · « **Reply #13 on:** July 15, 2020, 05:37:33 pm »

Quote

Interestingly, a quick glance indicates that I have a very similar data structure for the decoding, except that I have code between multiple branches of the tree, rather than one huge table. I can't quite tell whether that'll be a net win; the 16bit thumb instructions were pretty easy, but the 32bit decode is painful :-(

This is my first approach (multiple decode in tree with little tables and mask) but I spent two hours to get a simple branch and ld/st decoder... Then I thought someone could have done better and finally I searched several source codes (openocd, binutils, llvm)... The most simplest code to integrate was libopcodes from binutils...

Hey, I admit it, I'm lazy ...

Maybe, the tables in tree may be better optimized for code size and speed than the printf-like interpreter. Ideally, the best form to do the disassembling is emulating the combinational logic of decode circuit (maybe with some FSM emulating the control unit) but this approach is really hard and need many hours/man of reverse engineering from instruction format to <possible> decode implementation... This technique is not only useful for disassembly but also for emulation ... (clearly outside my current scope)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: cortex-m assembler/disassembler into mcu (Read 1777 times)

martinribelotta

cortex-m assembler/disassembler into mcu

westfw

Re: cortex-m assembler/disassembler into mcu

martinribelotta

Re: cortex-m assembler/disassembler into mcu

amyk

Re: cortex-m assembler/disassembler into mcu

brucehoult

Re: cortex-m assembler/disassembler into mcu

westfw

Re: cortex-m assembler/disassembler into mcu

brucehoult

Re: cortex-m assembler/disassembler into mcu

Nominal Animal

Re: cortex-m assembler/disassembler into mcu

brucehoult

Re: cortex-m assembler/disassembler into mcu

Nominal Animal

Re: cortex-m assembler/disassembler into mcu

westfw

Re: cortex-m assembler/disassembler into mcu

brucehoult

Re: cortex-m assembler/disassembler into mcu

martinribelotta

Re: cortex-m assembler/disassembler into mcu

martinribelotta

Re: cortex-m assembler/disassembler into mcu

Share me