Author Topic: How are micros programmed in real world situations? (Read 30336 times)

sonnytiger · « **on:** April 09, 2015, 10:16:42 am »

Do they use C? Or even assembler?
I ask because I really enjoy low level programming and am trying to find real life applications I can focus on.

cyr · « **Reply #1 on:** April 09, 2015, 10:51:54 am »

In my experience it's pretty much all C, with perhaps a couple of lines of inline assembly here and there in very rare cases.

Psi · « **Reply #2 on:** April 09, 2015, 10:53:11 am »

I would say over 98% of MCUs are programmed in some version of C.

Generally assembler is only used for 4 reasons.

Money
1) If you get the code as small and fast as possible you can get more functionality out of cheaper chips.

Necessity
2) For very small/limited micros, like some of the really cut down 4 or 8bit chip, the micro is just too simple to support compiled code.

Speed
3) If you run out of cycles to perform a function at the needed speed and don't want/can't change the hardware to something faster you can recode the function with inline ASM for more speed.
Sometimes programmers are forced into it by management who will not agree to a new hardware revision but keep wanting more features added.

Timing
4) Time critical applications, If you need absolute control over when things happen you can code down at the ASM level where you know how long instructions are going to take.

EEVblog · « **Reply #3 on:** April 09, 2015, 11:00:23 am »

As other said, mostly C, with some inline assembly for speed sometimes.

sonnytiger · « **Reply #4 on:** April 09, 2015, 11:10:07 am »

Thanks for the reply guys, that makes sense. What are the most common micros in real life projects? Microchip or Atmel? or maybe both?

I didn't expect a reply from you Dave, I appreciate it.

Psi · « **Reply #5 on:** April 09, 2015, 11:11:38 am »

Quote from: sonnytiger on April 09, 2015, 11:10:07 am

Thanks for the reply guys, that makes sense. What are the most common micros in real life projects? Microchip or Atmel? or maybe both?

I didn't expect a reply from you Dave, I appreciate it.

Yep, PIC and AVR are used, but there are lots more. STM32 is common as are LPC and loads of others.

http://www.eetimes.com/document.asp?doc_id=1261398

sonnytiger · « **Reply #6 on:** April 09, 2015, 11:15:44 am »

And what kind of setup would you use to program these micros? Just the official dev kit and/or programmer form the respective company?

mikerj · « **Reply #7 on:** April 09, 2015, 11:41:29 am »

Quote from: sonnytiger on April 09, 2015, 11:15:44 am

And what kind of setup would you use to program these micros? Just the official dev kit and/or programmer form the respective company?

You can use cheap debugger or an eval board that includes a debugger for programming during development, but you'd typically use a good quality production programmer for devices in production. Unlike the cheap debuggers, these have protected outputs so that board faults, or misaligned connectors etc. don't blow up the programmer.

Some production programmers will also perform verification at multiple supply voltages.

sonnytiger · « **Reply #8 on:** April 09, 2015, 11:49:32 am »

Good to know, what are some super common applications? I know micros are pretty much used in everything, but some really common ones.

EDIT: And for the Microchip parts, is the pic kit 3 a good place to start? How about the AVR Dragon for Atmel?

ehughes · « **Reply #9 on:** April 09, 2015, 12:16:05 pm »

http://www.slideshare.net/StephanCadene1/2014-embeddedmarketstudythennowwhatsnext

C is it. Most of reponses that were not for microcontrollers, rather embedded "PCs"

sonnytiger · « **Reply #10 on:** April 09, 2015, 12:33:59 pm »

Embedded PCs? Even for automotive and Industrial Control? Wouldn't Industrial Control be largely PLCs?

Psi · « **Reply #11 on:** April 09, 2015, 12:41:23 pm »

Arduino is a simple place to start.

But if you want a real micro get a STM32F103 discovery board for ~$11 and install the free EmBlocks IDE/toolchain.
(EmBlocks works out of the box for STM32 discovery boards)

dannyf · « **Reply #12 on:** April 09, 2015, 01:16:49 pm »

Quote

Do they use C? Or even assembler?

If you look at the installed base, I would say that a substantial portion (maybe even majority of them) of the mcus are programmed in assembly - think about those little guys in your coffee machine, in your calculators, watches, washers / dryers, window motors, mp3 players, toys, etc.. If you talk to those taiwanese guys, they wouldn't talk to you if you don't buy 100k+ pcs / yr. Those are tiny mcus and it is hard to imagine them being programmed in C.

However, most of the coding in the mcu land is done in C: many of the mcus: in cars, routers, wifi devices, embedded appliances, etc. They are smaller in quantity per project / coding but most of them are done in C to lower software costs.

So the answer depends on the question, duh!

bktemp · « **Reply #13 on:** April 09, 2015, 01:46:01 pm »

Quote from: dannyf on April 09, 2015, 01:16:49 pm

Quote
Do they use C? Or even assembler?

If you look at the installed base, I would say that a substantial portion (maybe even majority of them) of the mcus are programmed in assembly - think about those little guys in your coffee machine, in your calculators, watches, washers / dryers, window motors, mp3 players, toys, etc.. If you talk to those taiwanese guys, they wouldn't talk to you if you don't buy 100k+ pcs / yr. Those are tiny mcus and it is hard to imagine them being programmed in C.

For really simple high volume stuff maybe, but not for most stuff.
I have seen source codes for some cheap video games, lcd monitors and for some media player ics. All of them were C. The main function is done in hardware, all the software has to do is adjust the settings depending on the input data and then let the dedicated hardware do their job. In mp3 and video players you have GBs of flash, so nobody cares about code size. Even in white goods modern devices often offer complex menues in multiple languages and GUIs with touch screens and lots of graphics that take many 100kBytes or even MBytes.

sonnytiger · « **Reply #14 on:** April 09, 2015, 01:51:56 pm »

Quote from: Psi on April 09, 2015, 12:41:23 pm

Arduino is a simple place to start.

But if you want a real micro get a STM32F103 discovery board for ~$11 and install the free EmBlocks IDE/toolchain.
(EmBlocks works out of the box for STM32 discovery boards)

I have done lots of stuff for Arduino, but I am more interested in learning how to use something that I would see in a professional setting.
So it depends, that makes sense.
That sounds like some interesting stuff to see!

Wilksey · « **Reply #15 on:** April 09, 2015, 02:52:46 pm »

Purely depends on your industry.

With all best practices taken on board etc, most of them get ignored, only really the smaller companies that try and abide by the best practices in my experience.

I work in a medium sized company and we have issued code in Arduino format, the hex file that gets produced just gets written directly to the Atmel chip, we use a range of MCU's and FPGA / CPLD's, depends on what the engineers favourite flavour of chip is, it's that simple.

Mainly C, I use some inline assembler for speed and timing occasionally if necessary.

PicKit 3, PM3, or the ICD3 is what we use, PK3 and PM3 for field programming,although, bootloaders can be of assistance here, our latest "product" I have created a wireless bootloader, which dumps the program into E2 before verifying it and programming the chip, it takes a backup of the program memory into the E2 also in a different location, belt and braces as much as it can be as GSM isn't 100% reliable. Had to modify the linker and create "modules" and almost an API as the interrupts were becoming a nightmare I have separate interrupts and a function pointer for each mode, it sounds complicated and unnecessary I know but trust me, it was the only way I could get it to work reliably!

Can't really go into what it was, but sensors, automation and control systems, CCTV and signage is the game.

nctnico · « **Reply #16 on:** April 09, 2015, 03:11:35 pm »

Quote from: sonnytiger on April 09, 2015, 11:49:32 am

EDIT: And for the Microchip parts, is the pic kit 3 a good place to start? How about the AVR Dragon for Atmel?

I wouldn't start with those. Old and antiquated. Better get something with an ARM CPU under the hood. Having enough processing power makes life much easier when doing timing critical stuff because you don't have to care so much. Unless you are doing high volume designs (>10000 pieces) engineering costs are dominant over component cost.

sonnytiger · « **Reply #17 on:** April 09, 2015, 03:13:08 pm »

Great info thanks! I'll probably have to use the pickit because the others are a little too expensive...

What dev boards/programmers does atmel have for the ARM based MCUs?

AndyC_772 · « **Reply #18 on:** April 09, 2015, 03:27:21 pm »

Quote from: sonnytiger on April 09, 2015, 01:51:56 pm

I have done lots of stuff for Arduino, but I am more interested in learning how to use something that I would see in a professional setting.

My choice of processor is usually guided by the choice of peripherals I need, and the operating voltage of the circuit.

If, say, you're building something that needs to be able to drive out 5V, then it can save space and cost if you can just use a 5V capable microcontroller, rather than having to provide 3.3V and some amplifiers or level shifters.

Or, perhaps the application could be well implemented with a slightly unusual peripheral that's only available in a particular MCU. That may make it a good choice, even if there are compromises elsewhere.

Personally I've yet to come across an MCU requirement that couldn't be met by an 8 or 16 bit PIC, or an STM32. Since I know these families reasonably well, they'd be my first port of call. I've only used other families when they've been prescribed for non-technical reasons.

Stonent · « **Reply #19 on:** April 09, 2015, 03:32:53 pm »

Many embedded PCs are not much different than regular ones (assuming we're talking about x86).

Several of the HMI panels that I've worked with that aren't made by a dedicated company like Allen Bradley, are basically a low profile PC motherboard with a flash drive running Windows 7 embedded.

Some of the bigger names like Allen Bradley use panels that run Windows CE that designed to run compiled applications using their software or Visual Studio.

The PLCs themselves tend to run some kind of RTOS that isn't really visible to the user. In fact I think many of them may run two operating systems simultaneously (perhaps on separate processors) The RTOS does the real time I/O switching and the second OS is more conventional and operates the peripherals and embedded HTTP server. That way any networking events only affect the second OS and the RTOS doesn't get bothered. (All speculation, but that's how I would do it)

Howardlong · « **Reply #20 on:** April 09, 2015, 03:48:03 pm »

For programming devices, when ordering 1000s of devices, I usually use Microchip's own programming service, turnaround is about a week. However, sometimes Microchip lack stock so I purchase the devices from whoever I can find with inventory and program them up myself using a customised ZIF socket and either a RealICE or an ICD 3 and a foot switch to apply power. But believe me, programming 1,000 QFN devices one at a time in a ZIF socket is not my idea of fun.

For NXP's LPC lineup, you can use an LPC Link2 for debugging and programming, they're only EUR15, I have no idea how they can make them for that price, there is certainly no profit in it. If you buy two, you will have both a debugger and a target, their top of the range triple core LPC4370 is the MCU on that board.

Stonent · « **Reply #21 on:** April 09, 2015, 04:02:11 pm »

Oh back on topic for me, Mike from Mike's Electric Stuff uses the Pickit's ability to hold the firmware in memory and just presses it down to the programming header or pads on the board, clicks the button and moves on to the next board.

Time at 24:13 he starts talking about it. (Direct link at that time https://youtu.be/cQWnLOCGpXM?t=1453 )

Jeroen3 · « **Reply #22 on:** April 09, 2015, 06:33:01 pm »

Typically you create code for micro's using a C cross-compiler. But you can get Java (please don't) and there are startups to getting Javascript to mcu for fast learning purposes.
You can also use C++, which has the benefit of OO programming, (http://en.wikipedia.org/wiki/Object-oriented_programming) which can be really helpful in scalability.
Some people will try to tell you that C++ is not suitable for MCU use, they might get all heated about it. Please try it yourself. If fluent in C, you can easily do OO in C.

Some compilers people use:

One of the many variants of GCC, which is free if time is not a currency.
Keil
Tasking
IAR

Eclipse is not a compiler, just an IDE (Integrated Development Environment) I'd prefer not to use.

There are also many architectures, but the most popular ones are:
- ARM Cortex and regular ARM processors.
- Atmel (they have several)
- PIC
- 8051.
There some manufacturers that have some more proprietary stuff. Such as TI or Freescale, but most of the market will be ARM, 8051 and some atmel/pic.

Personally I'd stick with ARM Cortex, they're relatively easy to use across the entire family, but there is a short but steep learning curve. From M0 to M4. But there are plenty of other people developing for it, so the internet is full of it.
You can also easily switch brands if you don't like the peripherals in your current chip, without learning a new architecture or compiler. If you were using Keil or something else generic.

Muxr · « **Reply #23 on:** April 10, 2015, 04:18:07 am »

Quote from: Psi on April 09, 2015, 12:41:23 pm

Arduino is a simple place to start.

But if you want a real micro get a STM32F103 discovery board for ~$11 and install the free EmBlocks IDE/toolchain.
(EmBlocks works out of the box for STM32 discovery boards)

I am confused. Real micro as opposed to Micros Arduino supports? Are AVRs, SAM3X8E and MK20DX256VLH7 not real micros?

EmBlocks only runs on Windows.

westfw · « **Reply #24 on:** April 10, 2015, 07:30:08 am »

Quote

And what kind of setup would you use to program these micros? Just the official dev kit and/or programmer form the respective company?

You have to separate "development" programming from "production" programming. A development engineer would probably use some official programmer/debugger, like the PicKit3, Atmel ICE (which is the newer version of the dragon, BTW), Segger JTAG, or whatever. A "development board" from a vendor might be used for some initial experimentation, but you'd quickly (?) transition to your own hardware design.

When the software is complete and/or it's time to ship the product, you'd create a distributable object file (.elf, .hex, whatever), and hand that off to the manufacturing department, who might have their own idea how to burn that file into the finished product (gang programmers capable of burning multiple chips at once, bed-of-nails fixtures that you can push a board into and get it programmed, specialized connectors and high-speed programmers...)

At a small company, "development" and "manufacturing" might be the same person, and the difference just a matter of ... focus.
When you
re developing, it's all about the turnaround time to fix/build/program/test new code. By the time you manufacture, it's about throughput in products per hour, and you want the programming step to be quick and simple, so you can sit there and churn out a weeks worth of orders. (I remember an early contribution I made at one startup was postscript code that printed labels for the EPROMS. It was much better than the hand-scrawled labels... :-) )

Quote

I have done lots of stuff for Arduino, but I am more interested in learning how to use something that I would see in a professional setting.

If you have experience with Arduino, I don't see why using the arduino board with "more serious" tools isn't a good next step. Try one (or many) of the alternate arduino IDEs. Use the command line and learn to write Makefiles. Write your own libraries. Modify the core. Add support for some "close" chip that isn't well supported. Port as much as you can to some completely different chip family. Fix some of the bugs or add the features that the Arduino team(s) are slow or unwilling to implement...

VEGETA · « **Reply #25 on:** April 11, 2015, 10:22:06 am »

Quote from: mojo-chan on April 10, 2015, 08:29:08 am

Extremely low cost: PICs are hard to beat on price. The really cheap ones are a bugger to work with though, and the MPLAB software is a bit naff. If you go this route try to stick to the PIC24 and PIC32 lines if possible, as they are much easier and nicer than the crappy old PIC12, PIC16 and PIC18 ranges.

So working with PIC24 and PIC32 is easier than just PIC16? Can you explain this a bit please. I worked with some PIC16 stuff, didn't try others yet. But I am interested to learn about 18,24, and 32. How about dsPIC33?

__________

I have tried PICs and found them nice with PICKit3. I am interested in Cypress PSOC and Renesas. I haven't tried working with an ARM-based MCU, and I wonder If I can work easily with them just like using MPLab with the datasheet of the MCU... or is there more effort to be done?

I read a lot about renesas and their great MCUs, what do you think? They have CISC and RISC ones, but since I am gonna be using C all the time, what is the difference here?

thanks!

dom0 · « **Reply #26 on:** April 11, 2015, 10:43:31 am »

More powerful chips often come with some sort of low level standard library to avoid manually programming the hundreds to thousands of registers.

Quote from: sonnytiger on April 09, 2015, 11:10:07 am

Thanks for the reply guys, that makes sense. What are the most common micros in real life projects? Microchip or Atmel? or maybe both?

8051 derivatives are extremely common. ARM chips, too. AVR and PIC not so much, but are seen, sometimes.

dannyf · « **Reply #27 on:** April 11, 2015, 10:50:05 am »

Quote

So working with PIC24 and PIC32 is easier than just PIC16?

I use a lot of PIC24 (no PIC32 at all). Comparing to PIC12/16/18 chips, the 24s have more capable peripherals, bigger memories and generally faster. But three things stand out for the 24 vs. 12/16/18:

1) vectored interrupt controller: the co-mingled 1/2 interrupt vectors on the 12/16/18 are hard for modular programming;
2) remappable pins: the best on any chips I have seen. With some exceptions, you can pretty much re-route any peripheral functions to any pin. Fabulous.
3) consistent peripherals: many of the peripherals on in the 24 family are identical or highly alike so once you have written for one, you have pretty much written for them all - code is highly portable.

They are my favorite chips.

hans · « **Reply #28 on:** April 11, 2015, 12:58:40 pm »

The PIC16 and PIC18 share the same XC8 compiler, which is the HiTech compiler under the hood. It's a completely propriatary tool chain, which in itself is OK, albeit that it lacks some very basic functionality and support (like C99). Sometimes you get an error like: "cannot generate code for this expression", or the compiler will not optimize an all constant expression built from a few parameters; but instead generate software LUTs and execute the expression in runtime.

More over, the way the 8-bit PIC cores work are a pain. The hardware stack can be really limiting, both in terms of call depth and what you code can and can't do. Call the same function from inside an interrupt and in your main program? In worst cases the compiler will have to compile that function twice, because it has no fast way to "push a variable onto the stack". Stack variables are statically allocated, and the compiler does some call analysis to build a call tree and re-use RAM locations where it can. However, it assumes an interrupt can fire at any moment, and so it cannot make this assumption. Thereby it sometimes needs to create 2 separate functions which both use the variables on different RAM locations. The HiTech compiler is very specifically tailored to the PIC16/18 platform, and does the best job it can for the platform, but can't do magic and make the limitations disappear with no cost.

Having a vectored interrupt controller would make this issue even worse, especially when interrupts can be nested (which is what you want if dealing with vectored isr's & priorities). In that case the compiler would have to re-compile code for each ISR and statically allocate the stack variables for it.
Dealing with pointers is also quite slow, because there really isn't a native way to work with pointers in the instruction set. It is done with extra core registers that allow indirect RAM access. This in turn means that most pointer access on PIC16/18 will easily generate 3+ instructions worth of code, where on practically any other core (MSP430, PIC24, AVR, ARM, MIPS, etc.) this can be done in a single instruction.

So a big proportion of applications will run on a PIC16 or PIC18, but not at very fast speeds. This can be fine if all you do is read an I2C sensor and switch a relay on a threshold. But as soon you're dealing with complex protocols (the "connected world") that needs packets, data buffers, pointers, etc, code can run rather slowly. In this case I would quickly go to a PIC24 or ARM chip, because I don't have to fear these performance limitations that quickly.

nctnico · « **Reply #29 on:** April 11, 2015, 01:15:51 pm »

Quote from: mojo-chan on April 10, 2015, 08:29:08 am

ARMs are a bit harder to work with than 8 and 16 bit micros

Please stop with this nonsense. Getting an ARM going is just as hard as getting an 8 bit microcontroller going.

Howardlong · « **Reply #30 on:** April 11, 2015, 01:53:17 pm »

Regarding PICs, it largely depends on the job at hand which one you go for. It is absolutely true that the PIC10/12/16/18 series suffer due to their architecture, but sometimes that's all that's needed. I wrote a real time battery powered satellite tracker for a PIC18 about a decade ago... just as smartphones were about to hit the scene and rendered many vertical hardware applications like this largely irrelevant! (http://www.g6lvb.com/articles/LVBTracker2) Trying to shoehorn floating point orbital mechanics iterative numerical methods into a PIC18 was fun both in terms of compiler limitations and speed. The compiler I used did not automatically manage the bank allocation of statics, I had to do the allocation manually.

The PIC24/dsPIC series introduced around 2006 were a serious step up. I remember getting some engineering sample parts back then, it was a pretty amazing device with a proper stack and decent addressing modes without the need to be continually using bank switching (there is still bank switching in the PIC24/dsPIC devices but it's used far less than in the PIC10/12/16/18 series where all but the most simple task will need bank switching). Much of the basis for the PIC24/dsPIC core is based on the PIC16 at a very fundamental level, but the wider busses make life a lot easier. Also the PIC24/dsPIC benefit from a two phase rather than a four phase clock speeds things up.

The PIC32MX was introduced about 2009, and this was a departure from the proprietary core, using MIPS. This introduced some additional complexity regarding cache and clocking. With the PIC32MX1xx/2xx series, these cache related issues can be ignored, as in these devices the flash runs at the full core rate: these make a very good start for PIC32. The PIC32MZ announced in 2013 is still in a state of flux: it was announced and released far too early, and not only has the silicon been full of bugs, Microchip decided to completely change the programming paradigm around with the introduction of a new programming framework, namely Harmony, another "work in progress" making backward compatibility impossible: as a result everything for the PIC32MZ will need to be a re-write. Whether Microchip ever manage to gain much traction in the mid range 32 bit space as a result remains to be seen, but I am sceptical following these two unmitigated disasters, with many previous PIC32 aficionados having jumped ship following these latest debacles.

The good thing about PICs is that they have roughly similar peripherals, so there are few surprises as you go up through the families. yes, there are differences even between PICs in the same family, but they are usually manageable and generally maintain backwards compatibility.

The benefit of starting with a simple device like a PIC16 is that you are inevitably constrained by what you need to learn. Almost everything's in a single datasheet. When you move to the PIC24/PIC32, they have split everything up and you have no choice but to continually cross reference a dozen or so documents.

andersm · « **Reply #31 on:** April 11, 2015, 02:46:20 pm »

Quote from: VEGETA on April 11, 2015, 10:22:06 am

I read a lot about renesas and their great MCUs, what do you think? They have CISC and RISC ones, but since I am gonna be using C all the time, what is the difference here?

Renesas have basically pared down their portfolio to three lines: RL78 is their 8-bit MCU targeted at the low-end, distantly related to the Z80. RX is a 32-bit CISC MCU, covering basically the same markets as the Cortex-M3/4/7. Lastly there's the RZ series of ARM Cortex-A applications processors (there's also V850 for automotive). Development of the other lines has more or less stopped, and should probably be avoided for new applications.

RISC (or rather load-store architectures) produce slightly larger code, otherwise the difference is pretty much irrelevant.

zapta · « **Reply #32 on:** April 11, 2015, 02:54:59 pm »

Quote from: nctnico on April 11, 2015, 01:15:51 pm

Quote from: mojo-chan on April 10, 2015, 08:29:08 am
ARMs are a bit harder to work with than 8 and 16 bit micros
Please stop with this nonsense. Getting an ARM going is just as hard as getting an 8 bit microcontroller going.

I switched recently from to an ARM M0 (NXP LPC11U35) and it's actually easier to use than the ATMEGA328p I used before.

Uniform memory space (no distinction between flash strings and ram strings), orthogonal memory mapped I/O (easy to manipulate with simple indexing), better free IDE (eclipse/lpcxpresso, single package IDE/tool-chain install and support all three platforms), doesn't require additional hardware or software to program (just drag and drop the binary file to the USB/ISP virtual disk), compatible with optional $25 hardware debugger, small footprint (5x5mm), runs at 48Mhz, and requires very few extra parts. Life is good.

https://github.com/zapta/arm/blob/master/pro-mini/board/arm-pro-mini-schematic.pdf

VEGETA · « **Reply #33 on:** April 11, 2015, 04:01:20 pm »

Quote from: andersm on April 11, 2015, 02:46:20 pm

Quote from: VEGETA on April 11, 2015, 10:22:06 am
I read a lot about renesas and their great MCUs, what do you think? They have CISC and RISC ones, but since I am gonna be using C all the time, what is the difference here?
Renesas have basically pared down their portfolio to three lines: RL78 is their 8-bit MCU targeted at the low-end, distantly related to the Z80. RX is a 32-bit CISC MCU, covering basically the same markets as the Cortex-M3/4/7. Lastly there's the RZ series of ARM Cortex-A applications processors (there's also V850 for automotive). Development of the other lines has more or less stopped, and should probably be avoided for new applications.

RISC (or rather load-store architectures) produce slightly larger code, otherwise the difference is pretty much irrelevant.

So CISC and RISC will not have a difference if working with C except the resulting code size?

V850 for automotive... now what makes it for automotive? what features? there is also SuperH.

I've seen some great kits in their websites such as starter kit and display it one... but the price seems big. what is the kit to use for RZ ones and what price? I'd like a low price one if possible. and i figure the IDE is the one based on Eclipse.

andersm · « **Reply #34 on:** April 11, 2015, 04:37:51 pm »

Quote from: VEGETA on April 11, 2015, 04:01:20 pm

So CISC and RISC will not have a difference if working with C except the resulting code size?

Not really, no.

Quote

V850 for automotive... now what makes it for automotive? what features? there is also SuperH.

I misremembered, it's the RH850 series that's marketed exclusively for automotive. SuperH is another dead-end architecture. A good sign is looking at tool support. E2 Studio (their newest, Eclipse-based IDE) supports RL78, RX, RZ, SH-2/2A (and apparently you can debug RH850 code with it). The rest are supported only in HEW (old, last updated in 2012) and CubeSuite (not marketed outside Asia).

Another example is the GNU toolchains provided by KPIT (I believe under contract by Renesas). Only the RL78, RX, RZ and V850 toolchains are marked as active, the rest are considered legacy.

Quote

what is the kit to use for RZ ones and what price? I'd like a low price one if possible. and i figure the IDE is the one based on Eclipse.

No idea. This one seems cheap.

VEGETA · « **Reply #35 on:** April 11, 2015, 04:56:11 pm »

Seems that RZ\A1 is the most popular and good. E2 studio works well with it too... but I need an official dev kit. Starter kits are pricey! one of them is 1.1k! The HMI one is great o_O

Hmmm working with the bare chip is another option but not good for starters. Especially that their programmers are also VERY pricey! not like PICKit3 or something. Very hard choice just to learn the thing.

andersm · « **Reply #36 on:** April 11, 2015, 05:59:50 pm »

Quote from: VEGETA on April 11, 2015, 04:56:11 pm

but I need an official dev kit.

Why?

Quote

Hmmm working with the bare chip is another option but not good for starters. Especially that their programmers are also VERY pricey! not like PICKit3 or something. Very hard choice just to learn the thing.

The RZ/A are not suitable for beginners. Download the 3100+ page hardware manual and consider really carefully if you want to "just learn the thing." It's an applications processor mainly intended for running Linux or some other "real" operating system. As for debuggers, as an ARM CPU it supports the standard ARM debug interface. If you really want to use E2 Studio, I'm fairly sure it supports Segger's J-Link for debugging RZ/As, but if you're going to be writing Linux applications you'll be debugging using gdb over Ethernet anyway.

I found an Arduino-like board that's supported in the mbed environment, but it looks like it's not for sale yet. It seems there's also several boards from Japanese manufacturers if you search around a bit.

VEGETA · « **Reply #37 on:** April 11, 2015, 06:21:16 pm »

Well, no.. I am not interested in linux. I read that it supports RTOS (like FreeRTOS). So, is it that much different than just programming via C, just like PIC? I don't know how to use RTOS yet.

I don't know what project will I do with it yet, I just want to check it out and learn it. The other thing is that it is an ARM based MCU, which I have never used before. Their HMI stuff are great too. In general, I feel their MCU are a higher quality than most others.

I've used PIC16 with MPLab X and PICKit3. I want to try ARM-based 32-bit MCUs and found that Renesas is the best. I don't know why you are making it extremely hard to learn as I read many stuff online and in this forum saying that using ARM is not that much hard at all (not implying that it is extremely easy too).

Looking at their prices, it looks like they don't care about low-end customers like makers or hobbyists unlike PIC or ATMEL, which is bad for me xD.

andersm · « **Reply #38 on:** April 11, 2015, 06:40:50 pm »

Quote from: VEGETA on April 11, 2015, 06:21:16 pm

Well, no.. I am not interested in linux. I read that it supports RTOS (like FreeRTOS). So, is it that much different than just programming via C, just like PIC? I don't know how to use RTOS yet.

FreeRTOS provides tasks and synchronization services. It does not provide eg. hardware device drivers. The RZ demo included in the FreeRTOS distribution is for IAR's compiler, btw.

Quote

I want to try ARM-based 32-bit MCUs and found that Renesas is the best. I don't know why you are making it extremely hard to learn as I read many stuff online and in this forum saying that using ARM is not that much hard at all (not implying that it is extremely easy too).

Best by what criteria? This is not a microcontroller. It is a completely different beast from the Cortex-Ms usually discussed on this board, and far more complex. If you want an ARM microcontroller board, get an ST Discovery board or something similar.

VEGETA · « **Reply #39 on:** April 11, 2015, 06:57:48 pm »

Can you explain why it is that much more complex than other MCUs? What do I need to learn it btw? Rx is easier?

I have some free time to learn it if it needs a long learning curve. at least for 7 months or so.

nctnico · « **Reply #40 on:** April 11, 2015, 08:01:12 pm »

I wouldn't use the RZ/A series without Linux. There is just no point. Also 400MHz is quite low. I'm also not sure how well these devices are supported. If this is for a project with a potential of selling less than 5000 units I'd stick with either TI or Freescale. What counts with these kind of complex processors is good support. Either through a community or paid.

Howardlong · « **Reply #41 on:** April 11, 2015, 09:00:09 pm »

I am open to correction, but I don't think Renesas are very receptive to much other than 100ku+ projects. I believe they're still the biggest MCU producer globally.

They have a massive presence, just not in low to medium volume. They're a Japanese company, and yet if you go into the few remaining true electronics stores in Akihabara (still well worth a visit, and distinctly teaming with geeks of all ages none the less!) there's not much in the way of Renesas, I can't remember seeing any in fact, it's all PIC, TI, NXP, Atmel/Arduino. That probably tells you all you need to know.

ajb · « **Reply #42 on:** April 12, 2015, 05:46:25 pm »

Quote from: nctnico on April 11, 2015, 01:15:51 pm

Quote from: mojo-chan on April 10, 2015, 08:29:08 am
ARMs are a bit harder to work with than 8 and 16 bit micros
Please stop with this nonsense. Getting an ARM going is just as hard as getting an 8 bit microcontroller going.

If you already sorta know what you're doing, then sure, an ARM isn't any harder than a typical 8-bit micro. However it IS typically more complicated, and if you're just starting out with embedded devices or programming in general, then the increased complexity of the peripherals, having to deal with clock distribution, power management, and memory timings can present a much steeper barrier to entry. Couple that with often poorly documented libraries and a relative lack of example code as compared to the popular 8-bit systems, and ARMs on the whole aren't nearly as beginner-friendly. That will change as more people start using ARMs and releasing example code and beginner guides and such, but for the time being AVRs and PICs are still a good place for beginners to start out.

nctnico · « **Reply #43 on:** April 12, 2015, 08:23:09 pm »

So you think an 8 or 16 bit microcontroller doesn't have clock distribution? Even on an 8 bit PIC you need to set the right options during programming for the oscillator to work.

Thinking an ARM controller is more complicated than an 8 bit controller is utter nonsense. Getting an ARM going takes just as much time & learning as getting a PIC going for the first time.

Howardlong · « **Reply #44 on:** April 12, 2015, 09:00:41 pm »

For me, the biggest problem in getting up and going with ARM was knowing where to start. There are a whole bunch of marketing hurdles to master before you can even order something. The difference is that you have to understand both ARM marketing and the chip maker's marketing in order to make a reasonable comparative decision.

In the end whether it's ARM or not doesn't make a great deal of difference when developing. There's not really anything particularly special about it when compared with, say, MIPS based platforms.

I would say though that diving in to a complex M4 based device will be a baptism of fire. The clocking regimes are often very complex.

PICs have their own frustrations, even at the bottom end 8 bit devices. Some design decisions such as defaulting to analogue pins on GPIOs I still don't understand, but it becomes a natural reaction to switch them to GPIO as one of the first thing you do. For a beginner this kind of thing is just another barrier. Spending an hour or two just getting the core clock going at the right frequency with a mixture of fuse and functional programming is not unheard of.

PICs do have a simulator integrated into the IDE, so you can do stuff without hardware. Personally speaking, except for the most basic of scenarios, simulators are of pretty limited use, but I guess you can try out for free.

For any platform, though, you'd hope that doing a blinky should be a reasonably painless affair. Once you've got that toolchain working that is...

Psi · « **Reply #45 on:** April 17, 2015, 07:00:58 am »

Quote from: nctnico on April 12, 2015, 08:23:09 pm

So you think an 8 or 16 bit microcontroller doesn't have clock distribution? Even on an 8 bit PIC you need to set the right options during programming for the oscillator to work.

Thinking an ARM controller is more complicated than an 8 bit controller is utter nonsense. Getting an ARM going takes just as much time & learning as getting a PIC going for the first time.

Its more complicated in that there are lots more registers you need to set before things work.
In a empty AVR project you can write one line to set a pin to output and another to set it high. Compile and it works
RC clock is active and everything is on

Try doing that on a stm32
You get nothing because you didn't set up the system clock
Then you get nothing because you didn't set the peripheral clock divider
Then you get nothing because you have not enabled the clock for the port your talking too

Need I go on

andersm · « **Reply #46 on:** April 17, 2015, 08:52:08 am »

With no extra configuration, the STM32 will run off its internal 8MHz RC oscillator. You might need an extra line of code to configure the port directions, but that's it.

Also, "8-bit vs 32-bit" is a false dichotomy. The clock distribution diagram of eg. Atmel Xmegas look as complicated as any 32-bit MCU.

nctnico · « **Reply #47 on:** April 17, 2015, 08:54:33 am »

Quote from: Psi on April 17, 2015, 07:00:58 am

Quote from: nctnico on April 12, 2015, 08:23:09 pm
So you think an 8 or 16 bit microcontroller doesn't have clock distribution? Even on an 8 bit PIC you need to set the right options during programming for the oscillator to work.

Thinking an ARM controller is more complicated than an 8 bit controller is utter nonsense. Getting an ARM going takes just as much time & learning as getting a PIC going for the first time.

Its more complicated in that there are lots more registers you need to set before things work.
In a empty AVR project you can write one line to set a pin to output and another to set it high. Compile and it works
RC clock is active and everything is on

The same goes for the ARM Cortex devices. The ones from NXP have an RC clock running by default so even without any clock setup you have a running core. I doubt it will be very different on other ARM Cortex MCUs.

mikerj · « **Reply #48 on:** April 17, 2015, 09:07:10 am »

Quote from: nctnico on April 17, 2015, 08:54:33 am

The same goes for the ARM Cortex devices. The ones from NXP have an RC clock running by default so even without any clock setup you have a running core. I doubt it will be very different on other ARM Cortex MCUs.

Yes the core runs as soon as you power it up, but few if any of the peripherals will. On a PIC or AVR you can configure a pin for output and start using it as soon as the core is running, on a Cortex device you have to enable the clock to the peripheral before you can even start configuring it. That clock will also have various dividers that may need to be adjusted for your application.

They are undeniably more complex devices to use than the older 8 bit micros, and many more traps to fall into.

nctnico · « **Reply #49 on:** April 17, 2015, 09:46:09 am »

On NXP's ARM devices the common peripherals are already on & clocked at startup. On some devices all peripherals which don't need special clocking (like USB or ethernet) are enabled. In NXP's user manuals the description of each peripheral starts with links to the registers which control clock and power for that particular peripheral. How can that be difficult?

Howardlong · « **Reply #50 on:** April 17, 2015, 10:14:03 am »

The difficulty comes a tiny bit further along, when you want to start using a peripheral and need find out what's been initialised and what hasn't, and what clocks have been used for what, for example you don't want to accidentally start messing with any clocking on flashless NXP devices until you know where the SPIFI clock is coming from. The only way to do that is to look at, and understand, the startup code. On some ARM devices there's a lot of it, including boot ROM, CRT startup and board-specific initialisation. Some devices, like the PIC, have a very very limited amount of startup, just the generic CRT startup in fact that in most cases you don't need to know anything about at all. It does mean though that one of the first tasks many of us need to do is to fanny about getting the system clock running at a reasonable speed, but at least it's transparent.

But in general, yes, you should be able to get at least a ready-to-go Blinky project going reasonably quickly on any platform, assuming the toolchains, including debugging tools, are documented in an accurate and reasonably succinct way, preferably in a single, concise "getting started" document. What happens after that can be rather more time consuming, I doubt there are any platforms without nonsense that you have to overcome through failure first.

atferrari · « **Reply #51 on:** April 17, 2015, 11:40:33 am »

Quote from: Howardlong on April 12, 2015, 09:00:41 pm

PICs have their own frustrations, even at the bottom end 8 bit devices. Some design decisions such as defaulting to analogue pins on GPIOs I still don't understand, but it becomes a natural reaction to switch them to GPIO as one of the first thing you do. For a beginner this kind of thing is just another barrier. Spending an hour or two just getting the core clock going at the right frequency with a mixture of fuse and functional programming is not unheard of.

In line with that, I always think that, had I to teach how to use those micros (16F and 18F families) after a cursory reading of the datasheet, I would explain how to initialize the micro (to blink a LED, so to speak) my major stumbling block. Immediately after I would go with interruptions (I was really scared of them). The way I see this, then it is all peripherals and imagination.

But then, that is me.

VEGETA · « **Reply #52 on:** April 17, 2015, 02:06:54 pm »

what I understood from the last discussion is that working with ARM is not very different or more difficult than PIC or so. However, the setup is more difficult and there are stuff you need to do or set up in code before getting to do your code.

Now, is there any good resource to learn working with ARM MCUs? a book for example?

I am interested in Renesas RX62N board (the 99$ one i guess), which has a good course here (https://www.youtube.com/playlist?list=PLPIqCiMhcdO7TJiOvupVWuVCEsSWMKuWJ) based on it. And it has a whole book on embedded systems too (download: http://webpages.uncc.edu/~jmconrad/ECGR4101-2012-01/notes/All_ES_Conrad_Final_Soft_Proof_Blk.pdf ).

I have searched amazon for embedded books but got confused, many books are pricey and i don't know if it is good or not. maybe getting that dev board and start reading the book of it is the best?

Howardlong · « **Reply #53 on:** April 17, 2015, 03:19:14 pm »

Quote from: VEGETA on April 17, 2015, 02:06:54 pm

Now, is there any good resource to learn working with ARM MCUs? a book for example?

In my opinion, the best way is by doing. As I mentioned earlier, the stumbling block with ARM is finding the starting point because there are so many options often obfuscated by both ARM's and the vendors' marketing. I would say that if you already have a relationship with a particular vendor's devices who also have ARM, then go with them, as you will at least know your way around theor website and documentation standards.

I do have the Joseph Yiu books on The various Cortex cores but I very, very rarely refer to them.

I would pick a vendor's entry level ARM dev board and debugger as required, based on, say M0, that has a reasonable looking toolchain, with a getting started document, and go with that. Avoid cheapskating on third party debuggers unless you like a challenge, that might well end in tears, particularly if it's FTDI based.

AndyC_772 · « **Reply #54 on:** April 17, 2015, 03:34:25 pm »

I agree that a good tool chain is probably the best place to start.

I wasted a lot of time trying to set up Eclipse, and then even longer trying to work around bugs in CoIDE.

With hindsight, I should have simply bougt CrossWorks on day one. For a non-commercial licence it's ridiculously cheap, and even the 'full' version is much more cost-effective than most competing products.

westfw · « **Reply #55 on:** April 17, 2015, 07:19:03 pm »

Quote

clock distribution?

Most 8bit microcontrollers default to providing clock to most of the peripherals. To start using a peripheral, you start by writing values to the peripheral's control registers. "low power" microcontrollers may have clock control circuitry that disables peripherals to save power.
Most ARM chips default to NOT providing clock to most of the peripherals; if you write values to the peripheral control register, you get a fault condition (that you probably haven't set up to understand. So the chip mysteriously halts.)
It's not REALLY that much more complicated than "you have to disable the a2d and analog comparator before you can use those pins as digital IO" (PIC8), but ... it IS different. Most people aren't even particularly aware that GPIO ports NEED a clock.
And then the problem is enhanced by poor documentation. The STM32f103 has gotten a fair amount of discussion here - if you read their reference manual section on GPIO, the need to enable clock is not mentioned. If you SEARCH the entire 1100 page manual for "GPIO", you won't find anything about the clock (because for some reason the clock control bits get named xxx_IOPxxx. Perhaps that this is supposed to make it more obvious that you need to provide that clock if you use the non-GPIO functions that might be available on those port pins as well? Right.) (I know. Just use the GPIO_init() function provided by the vendor! What? It doesn't turn on the clock either?! Grr.)

I suppose it could all boil down to the fact that many 8-bit microcontrollers have now had several decades of documentation refinement and 3rd-party contributions to "overall community understanding." Including documents and projects aimed at clueless hobbyists. (I mean, how else can you explain that the PIC16F84, with all its architectural ugliness and low functionality, was THE hobbyist microcontroller of choice for ... quite a long time.)

ARM and other 32bit chips might eventually reach that "maturity." If the "churn rate" slows down, and it doesn't all get buried beneath an oversimplified and/or over-complicated abstraction layer (like Arduino or Linux.)

VEGETA · « **Reply #56 on:** April 17, 2015, 09:04:14 pm »

Quote from: AndyC_772 on April 17, 2015, 03:34:25 pm

I agree that a good tool chain is probably the best place to start.

I wasted a lot of time trying to set up Eclipse, and then even longer trying to work around bugs in CoIDE.

With hindsight, I should have simply bougt CrossWorks on day one. For a non-commercial licence it's ridiculously cheap, and even the 'full' version is much more cost-effective than most competing products.

This is another hardship. I like Renesas MCUs and I really want to learn how to use them. However, look at their programmers for example... there is about 3 of them, you don't know which one to buy for your needs as they support some MCUs and don't support other.... And, their IDEs for example, one based on eclipse and another one called HEW.... It is a total mess!

why wouldn't they have one programmer (other higher versions for production) and one IDE? And there is this JTAG debugger from segger and those other stuff like it!

^
That drove me to decide to buy the renesas dev board for rx62n (99$ one) when i have money and time, as i would learn to deal with the mcu without the hassle of the tools and stuff.

Now, when it comes to the point one should use RTOS, is it hard (assuming that I know to program the MCU normally)? I see that these RTOS have some .h and .c files that gets compiled along with the main.c that the user writes (sometimes more than just main.c as i see)... is dealing with them have different procedure or it is just coding?

thanks, and sorry for too much questions xD

Mechanical Menace · « **Reply #57 on:** April 20, 2015, 01:13:14 pm »

Quote from: VEGETA on April 17, 2015, 09:04:14 pm

Now, when it comes to the point one should use RTOS, is it hard (assuming that I know to program the MCU normally)?

If you actually need an RTOS they can make things easier, they (should) do all the hard work of making sure your time critical processes get cycles and access to peripherals exactly when needed while still doing a decent job at scheduling none time sensitive processes lol.

Quote

I see that these RTOS have some .h and .c files that gets compiled along with the main.c that the user writes (sometimes more than just main.c as i see)... is dealing with them have different procedure or it is just coding?

RTOS's can be a little more demanding on the userspace programmer than normal OS's, but at the end of the day they're still presented to you as Yet Another Library.

Howardlong · « **Reply #58 on:** April 20, 2015, 04:07:42 pm »

Quote from: Mechanical Menace on April 20, 2015, 01:13:14 pm

Quote from: VEGETA on April 17, 2015, 09:04:14 pm
Now, when it comes to the point one should use RTOS, is it hard (assuming that I know to program the MCU normally)?

If you actually need an RTOS they can make things easier, they (should) do all the hard work of making sure your time critical processes get cycles and access to peripherals exactly when needed while still doing a decent job at scheduling none time sensitive processes lol.

Quote
I see that these RTOS have some .h and .c files that gets compiled along with the main.c that the user writes (sometimes more than just main.c as i see)... is dealing with them have different procedure or it is just coding?

RTOS's can be a little more demanding on the userspace programmer than normal OS's, but at the end of the day they're still presented to you as Yet Another Library.

In general I agree, although you generally use a different mindset at the outset when going for an RTOS rather than a superloop style.

Superloops are much more deterministic, although if you can make your RTOS design to be non-preemptive then the non-deterministic nature can be controlled more than a completely pre-emptive solution. Non-preemptive solutions also remove context switching overhead.

Generally if at all possible, I avoid generic heap management in any resource-limited embedded applications, as it's another thing to control. You can dynamically allocate, but I generally if I do so I just use statically allocated pools of fixed-sized blobs for each requirement: using generic malloc type allocation leads to fragmentation, and unexpected consequences. This applies whether you use and RTOS or not though.

You can do an awful lot without an RTOS, but sometimes the complexity of a requirement makes an RTOS a more reasonable choice overall.

Mechanical Menace · « **Reply #59 on:** April 20, 2015, 04:53:57 pm »

Quote from: Howardlong on April 20, 2015, 04:07:42 pm

Non-preemptive solutions also remove context switching overhead.

Are you sure? Even with cooperative multitasking when a process hands off back to the OS there's still a context switch to kernel mode and back to user mode, the stack (or it's location depending on archtecture), registers and the instruction pointer still have to be read, stored, read from ram and restored...

zirlou21 · « **Reply #60 on:** April 20, 2015, 05:10:28 pm »

...good day to all.

...getting back with programming aspect; c or assembly.

...assembly is good when c cannot provide adequate control over hardware, eg. bit manipulating.
there are times when you need to shift or rotate a byte or word of data but your mcu-compatible c program lacks the command for it.Some mcu needs a outside library to support the need for it.,
so inline assembly is the solution for this.

...example implimenting a WDT program needs inline assembly coding...just what the zilog encore do more about.

...what do you think guy's.

Howardlong · « **Reply #61 on:** April 20, 2015, 06:27:59 pm »

Quote from: Mechanical Menace on April 20, 2015, 04:53:57 pm

Quote from: Howardlong on April 20, 2015, 04:07:42 pm
Non-preemptive solutions also remove context switching overhead.

Are you sure? Even with cooperative multitasking when a process hands off back to the OS there's still a context switch to kernel mode and back to user mode, the stack (or it's location depending on archtecture), registers and the instruction pointer still have to be read, stored, read from ram and restored...

I thought a bit about that a little bit before writing it, but not hard enough, you are right. There is indeed a context switch, but only when you allow it, so you don't have to maintain things like separate stacks. Apologies for my brain fart!

Even in non-preemptive systems, there's almost certainly some pre-emption from ISRs, but the trick is to minimise the resources used in those ISRs to things like a simple semaphore, mutex or queue operation.

Howardlong · « **Reply #62 on:** April 20, 2015, 06:58:48 pm »

Quote from: zirlou21 on April 20, 2015, 05:10:28 pm

...good day to all.

...getting back with programming aspect; c or assembly.

...assembly is good when c cannot provide adequate control over hardware, eg. bit manipulating.
there are times when you need to shift or rotate a byte or word of data but your mcu-compatible c program lacks the command for it.Some mcu needs a outside library to support the need for it.,
so inline assembly is the solution for this.

...example implimenting a WDT program needs inline assembly coding...just what the zilog encore do more about.

...what do you think guy's.

I think it depends on the environment, but these days it's extremely rare for me to write any assembler.

The most common for me is a Nop() for some ARM implementations which lack such a thing, allowing me to place breakpoints in specific places in code, particularly when the optimiser's on. It can be as simple as this:

#define NOP() asm volatile("mov r0, r0")

The last time I did any serious assembler was about four years ago when trying to squeeze the last drop out of a PIC24, performing some very tight DSP operations within a 1ms USB frame time. But there was a time when assembler, or more correctly machine code, was all I wrote.

For readability I'd much rather do it in C, doing assembler requires a lot of comments, often at least one every line or two, as much for my sanity as anyone else's when I come back to it a year later.

You can often use __builtin intrinsic functions to improve speed. The plus side of this is that they are more readable than assembler and they do what you tell them to, such as multiply two 16 bit variables and give a 32 bit answer. The bad news is that if, say, an addressing mode isn't well supported under the hood, you can end up generating more instructions than you thought, and just like assembler it's hardly portable. But then we are in the embedded world, if you get this far and decide you've chosen the wrong MCU, there's a lot more to be concerned about than a few __builtins and a couple of dozen lines of assembler.

For many implementations these days, you can do things like WDT resets with other intrinsic functions. The same applies to doing shifts and rotates where often there's a __builtin intrinsic function. I'd far rather use that than resort to assembler, assembler is very hard to maintain.

But I guess there are still some implementations that lack a lot of these intrinsics, in which case you have little choice.

andersm · « **Reply #63 on:** April 20, 2015, 07:38:51 pm »

Quote from: zirlou21 on April 20, 2015, 05:10:28 pm

...assembly is good when c cannot provide adequate control over hardware, eg. bit manipulating.
there are times when you need to shift or rotate a byte or word of data but your mcu-compatible c program lacks the command for it.Some mcu needs a outside library to support the need for it.,
so inline assembly is the solution for this.

A decent compiler will synthesize most operations for you, as long as you express clearly what you want. If you use a less-known compiler or less-known architecture you may have to resort to intrinsics or inline assembly.

Code: [Select]

unsigned int rol(unsigned int a)
{
  return (a << 1)|(a >> (sizeof(a)*8-1));
}

unsigned int ror(unsigned int a)
{
  return (a >> 1)|(a << (sizeof(a)*8-1));
}


00000000 <rol>:
   0:	ea4f 70f0 	mov.w	r0, r0, ror #31
   4:	4770      	bx	lr
   6:	bf00      	nop

00000008 <ror>:
   8:	ea4f 0070 	mov.w	r0, r0, ror #1
   c:	4770      	bx	lr
   e:	bf00      	nop

nctnico · « **Reply #64 on:** April 20, 2015, 11:43:39 pm »

I agree. Thinking you can code better assembly than a C compiler is so 1990.

Howardlong · « **Reply #65 on:** April 27, 2015, 01:48:11 am »

Hmm, I might change my mind - slightly - on this. I've just spent the last two days optimising some DSP code on an LPC4370 (ARM M4F), and there are rare occasions when it can make sense.

I have two pieces of code I'm trying to squeeze the last cent out of, one is a polyphase decimator (the ARM CMSIS-DSP decimator is _not_ polyphase but the interpolator is, go figure) and the other is a quadrature NCO oscillator and mixer.

I first spent some time trying to make the C code work reasonably fast. Keep in mind that my CPU cycle budget was already over by over 200% when I started looking at this on Friday, short of overclocking the 200MHz part to 600MHz, something had to give. With carefully crafted idioms and some data restructuring, I got that down to about 50% over budget, an improvement, but not enough. The environment is LPCXpresso 7.7.2, so gcc is the compiler. I have now recoded both bottlenecks in inline assembler, and after several hours have managed to improve optimised C code for the polyphase decimator from 3.3Msps (input samples) to 6.25Msps. The quadrature oscillator and mixer is improved from 8.5Msps to 12.5Msps. For both, I am pretty sure I can squeeze out another 5-10%, there are some stalls that as yet I'm unable to explain away.

The approach on the C side was to move from flash to RAM, then carefully code up compiler idioms, and examine what's generated. There was at this point a fair bit of loop unrolling, and some restructuring of data to allow the use of LDM/STM/VLDM/VSTM multiple register loads and stores.

Then I rolled up my sleeves and embarked on the most assembler I've written for probably a couple of decades.

On the assembler side, the human has the benefit that they know the nature of the parameters and how functions are called. The compiler just doesn't know that. With this knowledge, the code can be carefully crafted to avoid pipeline stalling by interleaving operations by making adjacent instructions have non-dependent registers, combining previously separate processes into one, avoiding excessive load and store operations, and keeping as much data as possible in registers (for processors like ARM especially). Again, being able to adjust your data structures and unrolling loops to make use of the LDM/STM/VLDM/VSTM instructions and minimise loop overhead helped.

In addition, for the NCO, there was some register moving going on for a delay line (it's a pair of IIR filters), and by unrolling the loop and recoding it, the register moving could be avoided. Combining the quadrature mixer and the quadrature NCO into a single process also made some savings.

So in short, you can sometimes improve things with assembler fairly significantly, but it's a bit of a last resort as maintenance is almost always going to be a head scratcher. Irrespective, to make any headway, even if you don't ever write any assembler, eventually you may well find that you need to understand the way the CPU works at least to the extent of being able to roughly follow the disassembled version in the debugger to be able to resolve a performance related problem.

Don't try this at home kids...

Code: [Select]

		__asm__ __volatile__
		(
			"\n\t"

			"vldm %[NCO],{s0-s7}	\n\t" // Initialise NCOs into registers

			"mov r6,#3				\n\t" // Divide by 3
			"udiv r5,%[NUMSAMPLES],r6	\n\t" // Result in R5 // ***Check this udiv, takes 160ns, needs work
			"mls r6,r6,r5,%[NUMSAMPLES]	\n\t" // Remainder in R6

			"cbz r5,loopexit1	\n\t"

			// Calculate NCOs, three at a time to avoid shuffling registers about
			"\nloop1:			\n\t"

			"vldm %[IN]!,{s10-s12}	\n\t" // Get next three samples

			"vmul.f32 s4,s6,s0	\n\t" // Iy0=A1*Iy1 // s2=Iy0, s0=Iy1, s1=Iy2 // NCO
			"vmul.f32 s5,s6,s1	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s2	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s3	\n\t" // s9=A2*Qy2
			"vadd.f32 s4,s4,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s5,s5,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s13,s4,s10	\n\t" // s13=Iy0*In[0] // Mixer
			"vmul.f32 s16,s5,s10	\n\t" // s16=Qy0*In[0]

			"vmul.f32 s2,s6,s4	\n\t" // Iy0=A1*Iy1 // s1=Iy0, s2=Iy1, s0=Iy2 // NCO
			"vmul.f32 s3,s6,s5	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s0	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s1	\n\t" // s9=A2*Qy2
			"vadd.f32 s2,s2,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s3,s3,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s14,s2,s11	\n\t" // s14=Iy0*In[1] // Mixer
			"vmul.f32 s17,s3,s11	\n\t" // s17=Qy0*In[1]

			"vmul.f32 s0,s6,s2	\n\t" // Iy0=A1*Iy1 // s0=Iy0, s1=Iy1, s2=Iy2 // NCO
			"vmul.f32 s1,s6,s3	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s4	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s5	\n\t" // s9=A2*Qy2
			"vadd.f32 s0,s0,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s1,s1,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s15,s0,s12	\n\t" // s15=Iy0*In[2] // Mixer
			"vmul.f32 s18,s1,s12	\n\t" // s18=Qy0*In[2]

			"vstm %[OUTI]!,{s13-s15}	\n\t" // Write the complex downconverted sample out
			"vstm %[OUTQ]!,{s16-s18}	\n\t"

			"subs r5,#1		\n\t" // Loop counter
			"bne loop1			\n\t"

			"\nloopexit1:		\n\t"
			"cbz r6,loopexit2	\n\t"

			"\nloop2:			\n\t" // This is the non-unrolled version for stragglers when num samples isn't divisible by 3

			"vldm %[IN]!,{s10}	\n\t" // Get next sample

			"vmov.f32 s4,s2		\n\t" // Iy2=Iy1 // Interleave I and Q instructions to prevent stalling // NCO
			"vmov.f32 s5,s3		\n\t" // Qy2=Qy1
			"vmov.f32 s2,s0		\n\t" // Iy1=Iy0
			"vmov.f32 s3,s1		\n\t" // Qy1=Qy0
			"vmul.f32 s0,s6,s2	\n\t" // Iy0=A1*Iy1
			"vmul.f32 s1,s6,s3	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s4	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s5	\n\t" // s9=A2*Qy2
			"vadd.f32 s0,s0,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s1,s1,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s8,s0,s10	\n\t" // s8=Iy0*In[0] // Mixer
			"vmul.f32 s9,s1,s10	\n\t" // s8=Qy0*In[0]

			"vstm %[OUTI]!,{s8}		\n\t" // Write the complex downconverted sample out
			"vstm %[OUTQ]!,{s9}	\n\t"

			"subs r6,#1			\n\t" // Loop counter
			"bne loop2				\n\t"

			"\nloopexit2:			\n\t"

			"vstm %[NCO],{s0-s5]	\n\t" // Store the NCO variables back

				: [OUTI]"+r" (pstOutI), [OUTQ]"+r" (pstOutQ), [IN]"+r" (pstIn)
				: [NCO]"r" (pncos), [NUMSAMPLES]"r" (nNumSamples)
		 : "r5","r6","s0", "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9","s10","s11","s12","s13","s14", "s15", "s16", "s17", "s18"

		);

nctnico · « **Reply #66 on:** April 27, 2015, 02:00:01 am »

Did you look into the SIMD instructions? Those should reduce the instruction fetch/decode/execution overhead.

Howardlong · « **Reply #67 on:** April 27, 2015, 02:09:00 am »

Quote from: nctnico on April 27, 2015, 02:00:01 am

Did you look into the SIMD instructions? Those should reduce the instruction fetch/decode/execution overhead.

Yes, they were of no use cycle-wise in this case, at the end of the day you're limited to 32 bits of data per cycle on the M4, and my samples are 32 bit.

I even coded it up in fixed point, but due to the lack of registers compared to the fpu, there was almost no cycle benefit going that way.

I did learn one thing, the floating point MAC instruction takes longer than carefully crafted separate multiply and add because the MAC stalls itself.

westfw · « **Reply #68 on:** April 27, 2015, 07:32:01 am »

Where are you getting the detailed info about register stalls and such? I would have thought that a MAC instruction would be carefully designed NOT to stall itself...

andersm · « **Reply #69 on:** April 27, 2015, 08:00:07 am »

The Cortex-M4 Technical Reference Manual gives one cycle each for VADD.F32 and VMUL.F32, and three cycles for VMLA.F32 and VFMA.F32.

Howardlong · « **Reply #70 on:** April 27, 2015, 09:49:56 am »

Quote from: westfw on April 27, 2015, 07:32:01 am

Where are you getting the detailed info about register stalls and such? I would have thought that a MAC instruction would be carefully designed NOT to stall itself...

You'd have thought wouldn't you? It is single cycle for fixed point. Not sure what the point of having an FP MAC instruction is TBH if it's not faster than separate instructions.

I've learned a lot in the last 48 hours.

Howardlong · « **Reply #71 on:** April 27, 2015, 10:31:52 am »

So here's a test for you ARM assembler aficionados that's had me scratching my head for a day or so now. Why does the code segment below between *** START *** and *** END *** take 231ns/4.9ns = 47 cycles? I count 38 inclusive of one of the two GPIO twiddles. The 3 three element LDM/STM take 4 cycles each (12 total), the 24 VFP instructions 24 cycles total, and one STRB for one of the two GPIOs takes 2 cycles, so 38 total.

Code is in RamLoc72, data is in RamLoc128, scope shows it to be consistent 231ns, no jitter, there are no ISRs or DMA or M0 cores running, this is it.

Edit: data is word (32 bit) aligned.

Code: [Select]

__RAMFUNC(RAM2) static void DownConvert3(NCOSTRUCT *pncos,SAMPLETYPE *pstIn,SAMPLETYPE *pstOutI,SAMPLETYPE *pstOutQ, int nNumSamples)
{

	int x=0x400F4000; // For twiddling diagnostic bits

	__asm__ __volatile__
	(
		"\n\t"
		"movs r2,#0					\n\t" // Literals for GPIO performance diagnostics
		"movs r3,#1					\n\t"

		"vldm %[NCO],{s0-s7}		\n\t" // Initialise NCOs into registers

		"mov r6,#3					\n\t" // Divide by 3: we try to do three samples unrolled
		"udiv r5,%[NUMSAMPLES],r6	\n\t" // Result in R5 // Check this udiv, takes 160ns
		"mls r6,r6,r5,%[NUMSAMPLES]	\n\t" // Remainder in R6


		"cbz r5,loopexit1			\n\t" // If there are less than three samples, do it one at a time	

		// Calculate NCOs, three at a time to avoid shuffling registers about
		"\nloop1:					\n\t"
			
"strb.w r3,[%[X],#100]	\n\t" // GPIO on
/****** START ******/

		"vldm %[IN]!,{s10-s12}		\n\t" // Load up next three samples (oscillators are IIRs with three sample delay)

		"vmul.f32 s4,s6,s0			\n\t" // Iy0=A1*Iy1 // s2=Iy0, s0=Iy1, s1=Iy2
		"vmul.f32 s5,s6,s1			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s2			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s3			\n\t" // s9=A2*Qy2
		"vadd.f32 s4,s4,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s5,s5,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s13,s4,s10		\n\t" // s13=Iy0*In[0]
		"vmul.f32 s16,s5,s10		\n\t" // s16=Qy0*In[0]

		"vmul.f32 s2,s6,s4			\n\t" // Iy0=A1*Iy1 // s1=Iy0, s2=Iy1, s0=Iy2
		"vmul.f32 s3,s6,s5			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s0			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s1			\n\t" // s9=A2*Qy2
		"vadd.f32 s2,s2,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s3,s3,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s14,s2,s11		\n\t" // s14=Iy0*In[1]
		"vmul.f32 s17,s3,s11		\n\t" // s17=Qy0*In[1]

		"vmul.f32 s0,s6,s2			\n\t" // Iy0=A1*Iy1 // s0=Iy0, s1=Iy1, s2=Iy2
		"vmul.f32 s1,s6,s3			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s4			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s5			\n\t" // s9=A2*Qy2
		"vadd.f32 s0,s0,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s1,s1,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s15,s0,s12		\n\t" // s15=Iy0*In[2]
		"vmul.f32 s18,s1,s12		\n\t" // s18=Qy0*In[2]

		"vstm %[OUTI]!,{s13-s15}	\n\t" // Store results
		"vstm %[OUTQ]!,{s16-s18}	\n\t"
			
"strb.w r2,[%[X],#100]	\n\t" // GPIO off
/****** END ******/

		"subs r5,#1					\n\t" // loop until no more
		"bne loop1					\n\t"

		// Here we deal with the "stragglers", ie samples beyond those divisible by three
		"\nloopexit1:				\n\t"
		"cbz r6,loopexit2			\n\t" // Nothing left to do, jump over

		"\nloop2:					\n\t"

		"vldm %[IN]!,{s10}			\n\t" // Load up next sample

		"vmov.f32 s4,s2				\n\t" // Iy2=Iy1 // Interleave I and Q instructions to prevent stalling
		"vmov.f32 s5,s3				\n\t" // Qy2=Qy1
		"vmov.f32 s2,s0				\n\t" // Iy1=Iy0
		"vmov.f32 s3,s1				\n\t" // Qy1=Qy0
		"vmul.f32 s0,s6,s2			\n\t" // Iy0=A1*Iy1
		"vmul.f32 s1,s6,s3			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s4			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s5			\n\t" // s9=A2*Qy2
		"vadd.f32 s0,s0,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s1,s1,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s8,s0,s10			\n\t" // s8=Iy0*In[0]
		"vmul.f32 s9,s1,s10			\n\t" // s8=Qy0*In[0]

		"vstm %[OUTI]!,{s8}			\n\t" // Store results
		"vstm %[OUTQ]!,{s9}			\n\t"

		"subs r6,#1					\n\t" // loop until no more
		"bne loop2					\n\t"

		"\nloopexit2:				\n\t"

		"vstm %[NCO],{s0-s5]		\n\t" // Save NCO state

			: [OUTI]"+r" (pstOutI), [OUTQ]"+r" (pstOutQ), [IN]"+r" (pstIn)
			: [NCO]"r" (pncos), [NUMSAMPLES]"r" (nNumSamples), [X]"r" (x)
			: "r2","r3","r5","r6","s0", "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9","s10","s11","s12","s13","s14", "s15", "s16", "s17", "s18"

	);

nctnico · « **Reply #72 on:** April 27, 2015, 02:04:15 pm »

I'd start by checking the clock frequency and then remove most of the instructions.

AndyC_772 · « **Reply #73 on:** April 27, 2015, 02:13:05 pm »

How long does it take if you duplicate all the instructions between setting the GPIO and clearing it? Does executing the code twice take an extra 47 cycles, or 36, or somewhere in between?

Jeroen3 · « **Reply #74 on:** April 27, 2015, 03:28:24 pm »

There are several reasons why assembler code that is assumed to take 38 cycles, takes 47 clocks. Few of them are:
- Bus wait states, for when you're accessing slower-clocked domains. Such as GPIO or anything on lower-clocked APB.
- Flash wait states, remember flash isn't 32 bit wide, but mostly 128 bit, so multiple instructions fit one flash fetch, and you can get out-of-sync with your gpio set/reset. Refer to memory barriers for this.
- Flash prefetching/caching. Characteristics are highly hardware dependent.

Measuring excution time using GPIO is ambigious. Compare it with the CCNT to see what happens.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211i/Bihcgfcf.html (demo url, not sure what hardware you have, but this register exists in most arms)

If your main goal is to program a fast assembly routine. Create a testkit in a simulator with an ideal environment. No bus waits, no flash waits, no prefetching, no pipelining. This way you only test your code without also measuring variables in hardware features that are changing with your code.

Howardlong · « **Reply #75 on:** April 27, 2015, 05:50:53 pm »

Quote from: nctnico on April 27, 2015, 02:04:15 pm

I'd start by checking the clock frequency and then remove most of the instructions.

Clock is 204MHz, measured on both CLK0 and CLK2 pins.

Howardlong · « **Reply #76 on:** April 27, 2015, 06:29:25 pm »

Quote from: Howardlong on April 27, 2015, 05:50:53 pm

Quote from: nctnico on April 27, 2015, 02:04:15 pm
I'd start by checking the clock frequency and then remove most of the instructions.

Clock is 204MHz, measured on both CLK0 and CLK2 pins.

If I do this:

Code: [Select]

__RAMFUNC(RAM2) static void DownConvert3(NCOSTRUCT *pncos,SAMPLETYPE *pstIn,SAMPLETYPE *pstOutI,SAMPLETYPE *pstOutQ, int nNumSamples)
{
	int x=0x400F4000; // For twiddling diagnostic bits

	__asm__ __volatile__
	(
		"\n\t"
		"movs r2,#0					\n\t" // Literals for GPIO performance diagnostics
		"movs r3,#1					\n\t"
		"\nloopy:					\n\t"
		""
		"strb.w r3,[%[X],#100]		\n\t" // GPIO on
		"strb.w r2,[%[X],#100]		\n\t" // GPIO off
		"b loopy					\n\t"
		:
		: [X]"r" (x)
		: "r2","r3"
	);

I get this:

STRB is either one or two cycles, one if following a previous load or store, two otherwise. B is two.

loopy:
STRB // 2 cycles
STRB // 1 cycle
B loopy // 2 cycles

So far so good.

Now if I do this:

Code: [Select]

	int x=0x400F4000; // For twiddling diagnostic bits

	__asm__ __volatile__
	(
		"\n\t"
		"movs r2,#0					\n\t" // Literals for GPIO performance diagnostics
		"movs r3,#1					\n\t"
		"\nloopy:					\n\t"
		""
		"strb.w r3,[%[X],#100]		\n\t" // GPIO on
		"strb.w r2,[%[X],#100]		\n\t" // GPIO off
		"strb.w r3,[%[X],#100]		\n\t" // GPIO on
		"strb.w r2,[%[X],#100]		\n\t" // GPIO off
		"b loopy					\n\t"
		:
		: [X]"r" (x)
		: "r2","r3"
	);

I get this:

Errr? No comprende.

andersm · « **Reply #77 on:** April 27, 2015, 06:53:20 pm »

Look in the documentation if your chip uses a code prefetch mechanism. The small loop may fit into the prefetch buffer, while the large one requires a refill. In some chips the prefetch can be turned off for increased determinism.

Howardlong · « **Reply #78 on:** April 27, 2015, 07:03:46 pm »

Quote from: AndyC_772 on April 27, 2015, 02:13:05 pm

How long does it take if you duplicate all the instructions between setting the GPIO and clearing it? Does executing the code twice take an extra 47 cycles, or 36, or somewhere in between?

Just did this, had to tweak the CBZ as it's limited in branch range, but no matter.

I measure 74 cycles on the scope (363ns unroll by six vs previous 230ns unroll by three), but it's doing twice as much. I could shave it further a tiny bit more by combining VLDMs and VSTMs. So that's equivalent to 37 cycles if we were only doing three rather than six.

While I am super happy it worked, indeed I have been going backwards and forwards trying to explain it myself, would anyone care to explain why?

!!!

Edit: Combined VLDMs and VSTMs and used up even more of the FP registers(!) as a result and now it's at 71 cycles, so a 24% improvement on yesterday. But I still have no clue why unrolling further had such a massive impact, usually it's the law of diminishing returns after about three or four.

Howardlong · « **Reply #79 on:** April 27, 2015, 07:45:17 pm »

Quote from: Jeroen3 on April 27, 2015, 03:28:24 pm

There are several reasons why assembler code that is assumed to take 38 cycles, takes 47 clocks. Few of them are:
- Bus wait states, for when you're accessing slower-clocked domains. Such as GPIO or anything on lower-clocked APB.
- Flash wait states, remember flash isn't 32 bit wide, but mostly 128 bit, so multiple instructions fit one flash fetch, and you can get out-of-sync with your gpio set/reset. Refer to memory barriers for this.
- Flash prefetching/caching. Characteristics are highly hardware dependent.

Measuring excution time using GPIO is ambigious. Compare it with the CCNT to see what happens.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211i/Bihcgfcf.html (demo url, not sure what hardware you have, but this register exists in most arms)

If your main goal is to program a fast assembly routine. Create a testkit in a simulator with an ideal environment. No bus waits, no flash waits, no prefetching, no pipelining. This way you only test your code without also measuring variables in hardware features that are changing with your code.

As mentioned previously, the target is an LPC4370 Cortex M4F, and I've tried to simplify this by (a) running code from on chip SRAM and (b) using different on-chip SRAM blocks for code and data. There are no wait states. Only the single M4 core is running, there is no DMA or IRQ going on. The timing is jitter free and consistent.

Good point regarding GPIO clocking. Regretfully the GPIO clock is wired to the core clock in this case, running at 204MHz.

Your tips regarding flash are worth knowing too, although I'm not sure how much it applies to the flashless LPC4370 which uses a proprietary quad SPI flash interface.

Regarding the use of a synthetic environment, I don't have one, but I'm not sure of the benefit if the pipelining is removed, you're running your code on a non-representative platform. It seems to me that understanding the pipelines and interleaving instructions is key to getting maximum performance. Please correct me if I've misunderstood though.

Thank you for your input, it is appreciated.

Howardlong · « **Reply #80 on:** April 27, 2015, 07:50:01 pm »

Quote from: andersm on April 27, 2015, 06:53:20 pm

Look in the documentation if your chip uses a code prefetch mechanism. The small loop may fit into the prefetch buffer, while the large one requires a refill. In some chips the prefetch can be turned off for increased determinism.

The code's running in on-chip SRAM without wait states. The only prefetch as far as I can see on the LPC4370 is on the SPIFI flash interface and what's already part of the M4 design, but please correct me if I've misinterpreted or missed something.

This is a good discussion, I appreciate the input.

miguelvp · « **Reply #81 on:** April 28, 2015, 01:43:40 am »

This video might help, it's for an ARM Cortex M3 PSoC 5LP but it gets the point across what can cause delays.

C++ vs Assembly vs Verilog.

Howardlong · « **Reply #82 on:** April 28, 2015, 10:56:33 am »

Quote from: miguelvp on April 28, 2015, 01:43:40 am

This video might help, it's for an ARM Cortex M3 PSoC 5LP but it gets the point across what can cause delays.

C++ vs Assembly vs Verilog.

OK, thanks.

Firstly, to be clear, in general I agree that assembler is very much a last resort, but there are very occasional edge cases where you need to get under the hood.

When we got into the meat, about 08:20, I thought that the pipeline bit was glossed over from about 10:15. At about 10:50, with the scope connected, it shows five total cycles. We know that an unconditional branch will always take at least 2 cycles as the pipeline is flushed. So that leaves 3 cycles for the two STRBs. One takes 1 cycle and the other (apparently) takes 2. The TRM is opened up and he shows the pipeline diagram, but he doesn't explain what is happening and why the stall is happening, "or something like that" as he says, twice. At 14:10 he says that it's stalling on both STRBs but does not give a convincing explanation IMHO. Then he goes straight onto configurable logic elements.

From the M3 Tech Ref Manual, regarding all scalar STRs including STRB

Neighboring load and store single instructions can pipeline their address and data phases.
This enables these instructions to complete in a single execution cycle.

and

STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed
in the initial cycle, and the data store is performed at the same time as the next instruction
is executing. If the store is to the store buffer, and the store buffer is full or not enabled,
the next instruction is delayed until the store can complete. If the store is not to the store
buffer, for example to the Code segment, and that transaction stalls, the impact on timing
is only felt if another load or store operation is executed before completion.

miguelvp · « **Reply #83 on:** April 28, 2015, 02:31:43 pm »

I remembered that video because I was doing something similar and wanted to optimize it. But at the end C++ wasn't the winner and neither was the hardware approach since it wasn't under software control it didn't have much utility to me.

I ended up with assembly using a nop in between them to compensate for the branch cycle and using another register pointing to the same output port to avoid stalls or something like that.

I'll try to dig up the code, but I was using an M0 so I don't know if that will impact things, and I don't recall what frequency I achieved at 50% duty cycle.

Howardlong · « **Reply #84 on:** April 28, 2015, 06:36:28 pm »

I figured this out after some suggestions from elsewhere (I have some seriously nerdy friends on FB), but it leaves a further question.

Due to a branch limit error when unrolling to 6 from the original 3, I changed a conditional branch "cbz" to "cbnz plus b".

Original:

Code: [Select]

		cbz r5,loopexit1
	loop1:
		/****** START ******/
		// Several vxx intructions
		/****** END ******/

		subs r5,#1
		bne loop1

New:

Code: [Select]

		cbnz r5,loop1
		b loopexit1
	loop1:
		/****** START ******/
		// Several vxx intructions
		/****** END ******/

		subs r5,#1
		bne loop1

and miraculously it works, exactly 38 cycles rather than the 47 I was getting. I have no idea why you lose 9 cycles the first way. That explains why unrolling to 6 worked, it had nothing to do with the unrolling and everything to do with the cbz instruction before the loop. The cbz isn't even in the loop, so why it has any effect on the loop speed still evades me.

andersm · « **Reply #85 on:** April 28, 2015, 07:06:54 pm »

It could affect the alignment of the first instruction in the loop. IIRC at least the Cortex-M3 fetches instructions in aligned 32-bit chunks, and if the instructions straddles a fetch boundary you'll incur at least one extra memory fetch penalty. Compare the addresses of the instructions, and try adding NOPs or use an assembler pseudo-op to force alignment in the slower case.

Howardlong · « **Reply #86 on:** April 28, 2015, 07:18:47 pm »

andersm I owe you a beer!

Old:

Code: [Select]

100803f4: 0xfb06c615   mls r6, r6, r5, r12
100803f8: 0xb3dd       cbz r5, 0x10080472 <loopexit61>
100803fa: 0xf8883064   strb.w r3, [r8, #100]   ; 0x64

New:

Code: [Select]

100803f4: 0xfb06c615   mls r6, r6, r5, r12
100803f8: 0xb905       cbnz r5, 0x100803fc <DownConvert6+72>
100803fa: 0xe03b       b.n 0x10080474 <loopexit61>
100803fc: 0xf8883064   strb.w r3, [r8, #100]   ; 0x64

Edit: had some time to think about this, and it feels like such a schoolboy error in retrospect, I was already aware of mixed Thumb and ARM code. Just need to figure out a reasonably maintainable and clear way to get this implemented.

andersm · « **Reply #87 on:** April 29, 2015, 05:34:06 am »

With the GNU assembler you can use the ".balign 4" directive to align the location counter to 32 bits, padding with NOPs.

Howardlong · « **Reply #88 on:** April 29, 2015, 11:13:09 am »

Quote from: andersm on April 29, 2015, 05:34:06 am

With the GNU assembler you can use the ".balign 4" directive to align the location counter to 32 bits, padding with NOPs.

OK, this works, but I'm not sure of the scope of .balign, it looks like it just applies to the next "allocation", whether it be a line of code or data, because further down at the branch it slips back to unpadded thumb instructions. This is not a problem, just as long as I'm aware of it.

The also fixes the "problem" with the GPIO twiddling: those instructions were also not word aligned. If I have a string of adjacent alternate GPIO on and off instructions that are not word aligned, the effect is that every third one causes a stall, so I think we can say that in the worst case, not having word-width ARM instructions word aligned can have an impact of 33%.

I've applied this to the polyphase decimator too, that was also not word aligned, a 16% improvement as a result, down from 75 to 63 cycles total for two 8 tap FIRs.

There is now not any more to squeeze out of these, the processing time is as expected.

Thanks again!

Howardlong · « **Reply #89 on:** April 29, 2015, 12:18:21 pm »

A quick note on how performance is affected depending on the memory used for code.

I ran 100 iterations of the dual FIR filter and measured the time taken.

33.7us RAMLoc128 (shared with data)
33.0us RAMLoc72
33.7us RAMAHB32
89.0us SPIFI (quad SQPI flash, running at the default 40.8MHz)
54.9us SPIFI (quad SQPI flash, running at 102MHz[Maximum allowed])

Debug environment for this is two LPC Link2's, one as debugger the other as target, and the cat has found a new enclosure for the debugger.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee