Author Topic: ARM Cortex series simulator  (Read 5614 times)

0 Members and 1 Guest are viewing this topic.

Online hansTopic starter

  • Super Contributor
  • ***
  • Posts: 1626
  • Country: nl
ARM Cortex series simulator
« on: May 13, 2018, 08:25:31 am »
I'm doing some experiments with continuous integration for embedded firmware, and aside from maintaining functional quality (unit tests), I'd also like to keep track of some other figure of merits that give indications of performance, like code size, RAM utilization and execution time.

I've written a small script that extracts all the unit tests from my libraries and builds them automatically using GCC, and given a baseline figure for the system (that includes overhead from vector tables), it can calculate the code size for a particular test (i.e. the usage of a particular library function or combinations of..)

Cross compilation from here on is a trivial step, the question is to which platform. At this moment I'm doing so for ARM Cortex m3 processors using some STM32 linker file, since CPU's like ARM cortex m0, m3 and m4 are the ones I'll likely be targeting in my projects. Unfortunately, I've not been able to find a cycle accurate simulator for these CPUs that I could drive using the command line automatically.
The bare basics is, I'm looking for a simulator that I can give a HEX/ELF/BIN/assembly program to execute, and it has some performance counter (e.g. a timer or stopwatch) to measure speed. Some mechanism to stop the simulation within the program would be nice.

I think I'm looking for a piece of software similar to simavr. I've used it in a course before, and I believe it's able to execute code on an AVR, and then output .VCD files which can be opened using GTKwave. I think, haven't looked yet, it would be rather trivial to automatically extract the timing info from such a log.

However, I rarely target AVRs, so I would like to stick to simulation on ARM CPU's. Does anyone know an (open source) simulator for ARM cortex m0, m3 or m4?
I've looked at QEMU, but it looks like it is providing an emulation which is not cycle accurate (rather achieves to be "fast")
I've yet to look into OVPworld, unfortunately their download page requires an account...

 

Offline jeremy

  • Super Contributor
  • ***
  • Posts: 1079
  • Country: au
Re: ARM Cortex series simulator
« Reply #1 on: May 13, 2018, 12:43:47 pm »
AFAIK nothing exists which is a solid simulator/emulator. It seems the device variations are too numerous and too complex to be worth it. You can get simulators for the core itself (see Keil), but none of the peripherals would be working.

This is something I have struggled with too.
 

Online hansTopic starter

  • Super Contributor
  • ***
  • Posts: 1626
  • Country: nl
Re: ARM Cortex series simulator
« Reply #2 on: May 13, 2018, 03:51:00 pm »
That's disappointing. I guess I will wait for my OVP activation and see what turns up. Preferably I'm don't want to implement an instruction set simulator for the whole Cortex m4 core.
I don't really need any peripherals. Just the core and perhaps a SysTick timer would be sufficient. Perhaps it is deemed to be rather useless if you don't have a peripheral subsystem to simulate though.

In the meanwhile I also looked at AVR, and I quickly stumbled upon reasons why it's not preferable. Not only does it not have a STL implementation by default, I'm also seeing quite a lot of "side-effects" from the 8-bit CPU core where most of my code assumes 16-bit or more CPU architectures.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Re: ARM Cortex series simulator
« Reply #3 on: May 13, 2018, 04:11:31 pm »
If a CPU has cache and the memory bus is somehow shared with (unknown) periphery, the cycle accurate timing simulation is incredibly complex, if at all possible.

Even doing satisfactory timing measurements on the real device is problematic.

If you're interested in worst case you can assume no cache, which makes simulations much easier, but,  for the MCUs which are cache dependent, this is nothing close to the performance level which may be achieved with cache.

I think the critical timing on the contemporary 32-bit systems is achieved mostly by DMA, periphery etc., while the CPU is constrained to tasks which are not time critical. Is it ever worth measuring/simulating the timing?
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14309
  • Country: fr
Re: ARM Cortex series simulator
« Reply #4 on: May 13, 2018, 04:32:35 pm »
You may take another approach and automate testing on a real target.
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11228
  • Country: us
    • Personal site
Re: ARM Cortex series simulator
« Reply #5 on: May 13, 2018, 05:08:49 pm »
The only  way to do cycle accurate simulation on anything more complex than CM0+ is to generate simulators based on the actual Verilog code, and only ARM can do that.

In addition to that, unless your entire code fits into Flash or in TCM, you will also have to take into account bus contention and the fact that there will be wait cycles.

And given that real cores have good ways to instrument and measure the performance, there is no real benefits from cycle-accurate simulation.
Alex
 

Offline WillHuang

  • Contributor
  • Posts: 47
Re: ARM Cortex series simulator
« Reply #6 on: May 13, 2018, 08:52:26 pm »
Just out of curiosity, which framework are you using for unit testing?

BTW, I don't know if you already know this, but you can generate code size information, inc. RAM usage during the compilation/linking
 

Offline abraxa

  • Frequent Contributor
  • **
  • Posts: 377
  • Country: de
  • Sigrok associate
Re: ARM Cortex series simulator
« Reply #7 on: May 13, 2018, 09:00:56 pm »
I'd say it's worth checking out http://www.lauterbach.com/frames.html?download_demo.html for your purposes.
 

Online hansTopic starter

  • Super Contributor
  • ***
  • Posts: 1626
  • Country: nl
Re: ARM Cortex series simulator
« Reply #8 on: May 13, 2018, 09:02:52 pm »
Perhaps let me explain further what I'm trying to do.

I'm (re)writing/refactoring some of my low-level C++ libraries that are facing hardware registers or model abstract concepts (for example a circular buffer). As you can imagine, these libraries make heavy use of templates in order to make them low overhead when instantiated, also such that overhead from virtual methods is avoided where possible.

I'm using googletest for my unit tests. I wrote a small Python script that extracts the test code from all test cases, and copies the code over to a dedicated cpp file with it's own main. Since my C++ library currently only contains templated code, all classes live in header files, which makes compilation trivial to do.
I currently use the size and nm utility to view the resulting code size.

Considering the memory available in modern ARM chips, it's nice to also allow some more software luxury. For example, instead of having the user application calling I2cStart() I2cTxByte() and I2cStop() in during device communication; it's also possible to assembles a frame of 'tokens' (or bytes) to transmit, which can then be iterated by the driver and transmitted.

What is nice, is that there is more context knowledge available in the software. For example I could log all I2C transactions that fail and immediately know what the complete frame was..

Obviously all this high-level sauce adds overhead to the system, which may or may not be a problem (for I2C it's likely not, as synchronous implementations often burn cycles in a spinlock). This is what I want to characterize, or at least have some comparable figures between changes, rather than "guessing" how nice the assembly looks like (which is always going to be bloated depending on the level of abstraction and/or logging enabled during compilation).

To me simulation sounds like a reasonable choice. Remember that I still don't actually care if I'm testing against a real hardware peripheral. Ultimately I2C, SPI or any other protocol is a certain means of putting bytes onto a wire. It's not hard to write a "virtual" driver that just acts as if it's writing to peripheral space and back.

This is why I called it "figure of merit". I'm aware that I cannot measure how many microseconds or cycles exactly these operations are going to take, due to caches, bus contention, peripheral speed (peripheral clock domain could be 1/16 of CPU, for example), etc. In fact, if you're reading up on WCET for real time systems, you'll find that it's very hard to determine this for modern computer architectures (and discusses that measured WCETs are often not the "worst case")

You could argue this is not worth it. I know, this is just an experiment :-) This is also the reasons why many of my hobby projects never get "finished"..
To some degree, I think you could also argue that a CI system is not worth it; on face value it does not have a "functional value" in a software production environment. There are plenty of companies around that don't use such a system, and they still deliver their software to customers. Yet it's an great tool to make sure quality standards are met, that no discrepancies appear between builds and/or they are dealt with as quickly as possible..

I guess I could also use "real hardware" to accomplish this goal, but if there is a software solution available I would rather use that. It's easier to deploy at any random server (I run my CI stuff in a Docker containers, which again runs in a VM on a Proxmox host). In my home lab, ofcourse I could just dig up any ARM devboard with debugger and run all tests on it..

I think OVP wrote something about providing CPU models generated from RTL, so that certainly sounds like an interesting utility to try out. But I'm not holding my hopes up for a turnkey solution, considering that "regular" simulators are not easy to go by it seems.
« Last Edit: May 13, 2018, 09:09:41 pm by hans »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Re: ARM Cortex series simulator
« Reply #9 on: May 13, 2018, 10:59:01 pm »
If you're an I2C master (or SPI master for that matter) you don't really need to meet any timing requirements - the only danger is going too fast for the device you're talking to.

But if you did have timing requirements (as in case with SPI slave, for example), you couldn't do it in unit test because it largely depends on the other parts of the system. Say, an enabled interrupt will slow you down and may wreck the timing and it has nothing to do with your SPI code.

IMHO, the embedded timing is better done as a complex - you find the fastest task(s) in your project and you build everything to make sure you meet the timing for these tasks (including their possible interference with each other), and all I2C, SPI etc. just drag by in the main loop without any worries about timing. Moreover, often you don't even have anything fast, so meeting timing is not even a concern.
 

Offline jeremy

  • Super Contributor
  • ***
  • Posts: 1079
  • Country: au
Re: ARM Cortex series simulator
« Reply #10 on: May 13, 2018, 11:19:24 pm »
You could possibly use the DesignStart Cortex M3 core with something like verilator.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4196
  • Country: us
Re: ARM Cortex series simulator
« Reply #11 on: May 14, 2018, 01:37:42 am »
Keil has a simulator.  I don't recall how well it does at measuring cycle-counts.
All the "real" chips that I've seen have bus structures and "flash accelerators" that I don't think are even documented well enough predict performance, much less write a cycle-accurate simulator.  "Set the flash memory controller for 5 wait states at 84MHz. Maybe the accelerator will return data faster than that, maybe not."  :-(

For a "figure of merit", isn't a non cycle-accurate simulator good enough?   You want to see if you code is "tight", not necessarily whether it's maximally avoiding pipeline stalls and hitting caches"...
 

Offline jnz

  • Frequent Contributor
  • **
  • Posts: 593
Re: ARM Cortex series simulator
« Reply #12 on: May 14, 2018, 04:55:21 am »
IDK if it’s the same sim Keil has but repackaged, but iirc MBED has a web Cortex simulator.
 

Offline obiwanjacobi

  • Frequent Contributor
  • **
  • Posts: 988
  • Country: nl
  • What's this yippee-yayoh pin you talk about!?
    • Marctronix Blog
Re: ARM Cortex series simulator
« Reply #13 on: May 14, 2018, 05:55:14 am »
I thought of doing something similar and concluded that just running it on some dedicated (diagnostics) hardware would be the simplest and yield the most relevant results... Never built any of it of course.   ::)
Arduino Template Library | Zalt Z80 Computer
Wrong code should not compile!
 

Online hansTopic starter

  • Super Contributor
  • ***
  • Posts: 1626
  • Country: nl
Re: ARM Cortex series simulator
« Reply #14 on: May 14, 2018, 07:55:00 am »
If you're an I2C master (or SPI master for that matter) you don't really need to meet any timing requirements - the only danger is going too fast for the device you're talking to.

But if you did have timing requirements (as in case with SPI slave, for example), you couldn't do it in unit test because it largely depends on the other parts of the system. Say, an enabled interrupt will slow you down and may wreck the timing and it has nothing to do with your SPI code.

IMHO, the embedded timing is better done as a complex - you find the fastest task(s) in your project and you build everything to make sure you meet the timing for these tasks (including their possible interference with each other), and all I2C, SPI etc. just drag by in the main loop without any worries about timing. Moreover, often you don't even have anything fast, so meeting timing is not even a concern.


I already know that functionally speaking the code will work. Unit tests provide me that confidence. The only difference between "unit tests" on x86 vs ARM is I swap a device driver at the very last point when individual bytes need to be received and transmitted. In C this is basically 3 lines of code, and would need to debugged "by hand" on actual hardware at some point.

That's all fine, the trouble is those unit tests don't say anything about how fast the code runs. I don't want to have 10 microseconds of processing between bytes due to software overhead. That's what I want to benchmark and keep track of, such that changes in this abstract code doesn't accidentally ruin driver performance.

This test doesn't tell anything at all if deadlines will be met in a real-time system.

Keil has a simulator.  I don't recall how well it does at measuring cycle-counts.
All the "real" chips that I've seen have bus structures and "flash accelerators" that I don't think are even documented well enough predict performance, much less write a cycle-accurate simulator.  "Set the flash memory controller for 5 wait states at 84MHz. Maybe the accelerator will return data faster than that, maybe not."  :-(

For a "figure of merit", isn't a non cycle-accurate simulator good enough?   You want to see if you code is "tight", not necessarily whether it's maximally avoiding pipeline stalls and hitting caches"...


I guess I phrased "cycle accurate simulator" a bit imprecise. If I have a simulator that just counts the number of instructions ran, that's also fine. I don't mind neglecting the 1 cycle wait state here and there. As long as the measured value is proportional to the length of the software program.



I will take a look at the suggestions later today. Hopefully something is out there that can also be automated and controlled via some command line/scripting on a Linux box.
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11228
  • Country: us
    • Personal site
Re: ARM Cortex series simulator
« Reply #15 on: May 14, 2018, 08:06:27 am »
For Cortex-M0 I have created a simulator as part of the wireless network simulator project - https://github.com/ataradov/netsim/blob/master/netsim/core.c . The code is simple enough that it can be modified to produce any statistics you like. And if you remove all the wireless simulation stuff, the whole working SoC simulator will be just a few files. It is not much and does not do what you want out of the box, but can be a good start if nothing better comes along.
Alex
 
The following users thanked this post: hans

Online hansTopic starter

  • Super Contributor
  • ***
  • Posts: 1626
  • Country: nl
Re: ARM Cortex series simulator
« Reply #16 on: May 14, 2018, 12:49:27 pm »
Thanks Alex, that looks really quite usable actually. It seems I only needed to fill in the core object and call core_clk until the program has "finished".

A simple modification to the bkpt instruction and it now acts as a cycle counter. Bkpt #0 starts the counter, bkpt #1 stops execution and prints the passed cycles, and I could probably use bkpt #2 to also abort if an assertion has failed.

Unfortunately it does mean that if I want to simulate Cortex m3 processors, I would need expand the simulation myself, but atleast it's way closer than using AVRs which is a completely different architecture.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4003
  • Country: nz
 
The following users thanked this post: hans

Online splin

  • Frequent Contributor
  • **
  • Posts: 999
  • Country: gb
Re: ARM Cortex series simulator
« Reply #18 on: May 15, 2018, 02:00:15 pm »
Most M3 and M4 devices contain Instruction trace functionality in ETM (Embedded Trace Macroblock), ITM (Instrumentation Trace Macroblock) and possibly ETB (Embedded Trace Buffer) units. While nowhere near as convenient as a simulator, they can provide non intrusive cycle accurate timing information in the face of cache misses, prefetechers, DMA, interrupts, internal bus contention etc. Most importantly it runs the device as your firmware actually configures it (clock sources/dividers, Flash wait states etc.) rather than how your simulator believes it is configured.

I say 'can be non intrusive' as there are limitations due to the throughput rate of the trace port and the configuration of the trace modules, but it should be possible, for example, to trace all instructions which change the instruction flow without either stalling the CPU or losing trace data. You also have the option to run the device at a lower clock speed to allow more comprehensive trace logging (such as including data tracing) without losing trace data.

I believe some M0/M0+ devices support more limited tracing with a Micro Trace Buffer, but they should all support SWD and SWW which can also be used for tracing.

Naturally it seems that tools that support ETM in particular are expensive (KEIL, Segger J-Trace etc). Atollic is free and supports instruction trace but still requires a suitable J-tag probe and is now only for ST devices. Sigroc is also free and now includes ETM decoders and can be used with low cost logic analyzers, but again speed may be an issue. There is also SWT (serial wire trace) which can be configured to output the PC every 64 clock cycles which might be enough.

ST have a whitepaper on real time tracing using SWW, SWD etc. here (registration required unfortunately)

http://info.atollic.com/swv_event_data_tracing_whitepaper?hsCtaTracking=c92954ee-825c-40d4-994a-b785f7d7a5c6%7Ca9fb94cd-a492-4ce1-a7a8-1ca091fb448c

 

Online rsjsouza

  • Super Contributor
  • ***
  • Posts: 5980
  • Country: us
  • Eternally curious
    • Vbe - vídeo blog eletrônico
Re: ARM Cortex series simulator
« Reply #19 on: May 19, 2018, 12:01:09 pm »
Code Composer Studio version 5.5 has a cycle accurate Cortex M3 simulator. It hasn't been maintained in years, but it may help. The tool is free of charge.
http://processors.wiki.ti.com/index.php/Category:Simulation
http://processors.wiki.ti.com/index.php/Download_CCS#Code_Composer_Studio_Version_5_Downloads
Vbe - vídeo blog eletrônico http://videos.vbeletronico.com

Oh, the "whys" of the datasheets... The information is there not to be an axiomatic truth, but instead each speck of data must be slowly inhaled while carefully performing a deep search inside oneself to find the true metaphysical sense...
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf