Author Topic: How are math.h library functions like sqrt, asinf, tanf implemented in CoIDE (Read 17810 times)

kody · « **on:** December 28, 2014, 03:08:51 am »

How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?

Need further information regarding the internal implementation(like taylor series?) to calculate the number of cycles taken for a certain program which implements these functions takes in FPU enabled Cortex M4 based MCU's.

dannyf · « **Reply #1 on:** December 28, 2014, 03:10:03 am »

CoIDE is just an IDE. The actual implementation is done by the compiler(s) you hook to CoIDE, just like any other IDE.

Mechatrommer · « **Reply #2 on:** December 28, 2014, 04:14:35 am »

Quote from: dannyf on December 28, 2014, 03:10:03 am

CoIDE is just an IDE. The actual implementation is done by the compiler(s) you hook to CoIDE, just like any other IDE.

the implementation is done by the library that came with the compiler, compiler is just that, to compile them. to know exactly whats going on, one need to dig the library, if one is lucky, they may find it in the form of plain source code, if one is unlucky, they may find it in the form of obj or precompiled stuffs.

kody · « **Reply #3 on:** December 28, 2014, 07:35:20 am »

I have used the arm-gcc compiler. I have tried to search online for a good description of these functions in source code. But to no avail. Could you please provide any further sources?

Math.h is the library

westfw · « **Reply #4 on:** December 28, 2014, 10:04:31 am »

I believe that by default, the gcc compilers use a set of generic software floating point code as part of libgcc (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html ), and higher level math functions as part of gcc libc ( http://www.gnu.org/software/libc/ ) All nice and portable, but not particularly fast or small.

Technically, these are separate from the compiler, and in fact are one of the things that a value-added compiler vendor might provide as "improved" versions. For example, avr-gcc has its own complete set of floating point routines. However, coide expects you to download the ARM compiler from launchpad.net, which claims to be using newlib or newlib-nano, so you can find code there, ie: https://github.com/32bitmicro/newlib-nano-1.0/tree/master/newlib/libm/math
It takes a brave (or desperate) vendor or engineer to implement new floating point math code...

Different libs for chips with HW floating point, of course. And cmsis-dsp has some specially optimized functions (apparently not direct replacements for libm functions, though.) CooCox does seem to have a bunch of peripheral libraries that replace the usual vendor libraries, but I don't see alternate floating point functions.

0xdeadbeef · « **Reply #5 on:** December 28, 2014, 10:35:00 am »

Quote from: kody on December 28, 2014, 03:08:51 am

How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Need further information regarding the internal implementation(like taylor series?) to calculate the number of cycles taken for a certain program which implements these functions takes in FPU enabled Cortex M4 based MCU's.

At least for sqrt, usually a variant of newton's method is used to calculate the invert square root.
See: http://en.wikipedia.org/wiki/Fast_inverse_square_root

Well, this is at least true for SW implementations, but I would assume that HW implementations use the same approach.

nctnico · « **Reply #6 on:** December 28, 2014, 11:25:32 am »

Quote from: kody on December 28, 2014, 07:35:20 am

I have used the arm-gcc compiler. I have tried to search online for a good description of these functions in source code. But to no avail. Could you please provide any further sources?

Math.h is the library

No. Math.h is a so called header file describing the functions. The actual library is libm. You have to look for the sources of libm. This can be tricky because the implementation depends on whether the processor you are using has a floating point processor or not.

kody · « **Reply #7 on:** December 28, 2014, 05:32:21 pm »

The processor I used is a STM32F4 it has FPU. But, we can have the option of enabling it or not.

So, by observing the documentation for these functions in libm code give some idea to calculate the number of cycles it took to calculate that particular function?

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

kody · « **Reply #8 on:** December 28, 2014, 05:41:57 pm »

Thanks @westfw! that was primarily what I would be concerned with.

Quote from: westfw on December 28, 2014, 10:04:31 am

I believe that by default, the gcc compilers use a set of generic software floating point code as part of libgcc (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html ), and higher level math functions as part of gcc libc ( http://www.gnu.org/software/libc/ ) All nice and portable, but not particularly fast or small.

Technically, these are separate from the compiler, and in fact are one of the things that a value-added compiler vendor might provide as "improved" versions. For example, avr-gcc has its own complete set of floating point routines. However, coide expects you to download the ARM compiler from launchpad.net, which claims to be using newlib or newlib-nano, so you can find code there, ie: https://github.com/32bitmicro/newlib-nano-1.0/tree/master/newlib/libm/math
It takes a brave (or desperate) vendor or engineer to implement new floating point math code...

Different libs for chips with HW floating point, of course. And cmsis-dsp has some specially optimized functions (apparently not direct replacements for libm functions, though.) CooCox does seem to have a bunch of peripheral libraries that replace the usual vendor libraries, but I don't see alternate floating point functions.

zapta · « **Reply #9 on:** December 28, 2014, 05:42:32 pm »

Quote from: kody on December 28, 2014, 05:32:21 pm

The processor I used is a STM32F4 it has FPU. But, we can have the option of enabling it or not.

So, by observing the documentation for these functions in libm code give some idea to calculate the number of cycles it took to calculate that particular function?

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Can't you just run a benchmark and extract the numbers from it?

0xdeadbeef · « **Reply #10 on:** December 28, 2014, 06:17:57 pm »

Well, a quick search brings us to the manual:
http://www.st.com/st-web-ui/static/active/jp/resource/technical/document/application_note/DM00047230.pdf

From what I see there, it's single precision FPU with add/sub/mul/div and sqrt plus some MAC instructions, but no trigonometric functions.
Quoting the manual, the cycles needed are:

Code: [Select]

? Absolute value (1 cycle)
? Negate of a float or of multiple floats (1 cycle)
? Addition (1 cycle)
? Subtraction (1 cycle)
? Multiply, multiply accumulate/subtract, multiply accumulate/subtract then negate (3 cycles)
? Divide (14 cycles)
? Square root (14 cycles)

Obviously, if they are part of a math library, this might need some additional cycles. E.g. if the check the value for a square root to be positive etc.
Anyway, the above functions are relatively "cheap" cycle wise. The trigonometric functions are then probably implemented using taylor series and accordingly expensive.

dannyf · « **Reply #11 on:** December 28, 2014, 06:25:05 pm »

Quote

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Yeah: most simulators will provide exact cycle counts for you; and you can use timers too - systick comes in handy for things like this.

kody · « **Reply #12 on:** December 29, 2014, 10:36:15 pm »

Hi dannyf,

When you say most simulators, can you please give some examples. Like open-source simulator software.

Thanks

Quote from: dannyf on December 28, 2014, 06:25:05 pm

Quote
For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Yeah: most simulators will provide exact cycle counts for you; and you can use timers too - systick comes in handy for things like this.

kody · « **Reply #13 on:** December 29, 2014, 10:37:57 pm »

Can we run the benchmarks from the CoIDE compiler?

Quote from: zapta on December 28, 2014, 05:42:32 pm

Quote from: kody on December 28, 2014, 05:32:21 pm
The processor I used is a STM32F4 it has FPU. But, we can have the option of enabling it or not.

So, by observing the documentation for these functions in libm code give some idea to calculate the number of cycles it took to calculate that particular function?

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Can't you just run a benchmark and extract the numbers from it?

kody · « **Reply #14 on:** December 29, 2014, 10:42:54 pm »

Thanks for the link 0deadbeeef. I too have it. The problem is, since I do not know how exactly the asinf and tanf functions are implemented. I can only keep guessing with these numbers.

Quote from: 0xdeadbeef on December 28, 2014, 06:17:57 pm

Well, a quick search brings us to the manual:
http://www.st.com/st-web-ui/static/active/jp/resource/technical/document/application_note/DM00047230.pdf

From what I see there, it's single precision FPU with add/sub/mul/div and sqrt plus some MAC instructions, but no trigonometric functions.
Quoting the manual, the cycles needed are:
Code: [Select]
? Absolute value (1 cycle) ? Negate of a float or of multiple floats (1 cycle) ? Addition (1 cycle) ? Subtraction (1 cycle) ? Multiply, multiply accumulate/subtract, multiply accumulate/subtract then negate (3 cycles) ? Divide (14 cycles) ? Square root (14 cycles)Obviously, if they are part of a math library, this might need some additional cycles. E.g. if the check the value for a square root to be positive etc.
Anyway, the above functions are relatively "cheap" cycle wise. The trigonometric functions are then probably implemented using taylor series and accordingly expensive.

T3sl4co1l · « **Reply #15 on:** December 29, 2014, 11:09:26 pm »

If you need to know implementation... inspect the implementation!

Find the library and disassemble the routines, or at worst, write a "simplest example program" incorporating the interested function, and disassemble that (the machine code output is then necessarily part of the project, and any IDE worth its salt will provide a disassembled view of that output).

If you can't read ARM assembler... better get a book/guide and a pot of coffee...

Tim

0xdeadbeef · « **Reply #16 on:** December 29, 2014, 11:30:38 pm »

Quote from: kody on December 29, 2014, 10:42:54 pm

Thanks for the link 0deadbeeef. I too have it. The problem is, since I do not know how exactly the asinf and tanf functions are implemented. I can only keep guessing with these numbers.

It clear that these function will lead a lot of cycles. Indeed, if they're implemented with an accuracy threshold, the number of loops needed for approximation depends on the actual value.
Not the correct repository, but to get an impression:
http://svnweb.cern.ch/world/wsvn/vdt/trunk/include/tan.h
There is a fast tangens implementation there called fast_tanfm which avoids the iterations and is mostly linear.
In this fast implementation are about 15 multiplications and one division, plus a lot of additions and integer operations. I'd assume this will cost at least 100 cycles in sum. Probably more.
If the non-fast version with iterations is used, the runtime behavior will me much more unpredictable and thus should not be used in a realtime environment.

dannyf · « **Reply #17 on:** December 30, 2014, 12:46:57 am »

Quote

Can we run the benchmarks from the CoIDE compiler?

Yes, if you have a hardware debugger like st-link or jlink, + the actual hardware.

theoldwizard1 · « **Reply #18 on:** December 30, 2014, 01:02:53 am »

Quote from: kody on December 28, 2014, 03:08:51 am

How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?

Complex math function are typically implemented using Taylor Series. Slow, but accurate.

zapta · « **Reply #19 on:** December 30, 2014, 01:09:56 am »

Quote from: theoldwizard1 on December 30, 2014, 01:02:53 am

Quote from: kody on December 28, 2014, 03:08:51 am
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Complex math function are typically implemented using Taylor Series. Slow, but accurate.

OP, you can use an implemented that trades accuracy for speed, for example a table driven piece wise interpolation.

kody · « **Reply #20 on:** December 30, 2014, 11:46:43 pm »

@dannyf - is it not possible if we only have the compiler? I mean, without the actual hardware/microcontroller.

Quote from: dannyf on December 30, 2014, 12:46:57 am

Quote
Can we run the benchmarks from the CoIDE compiler?

Yes, if you have a hardware debugger like st-link or jlink, + the actual hardware.

tggzzz · « **Reply #21 on:** December 31, 2014, 01:03:09 am »

Quote from: theoldwizard1 on December 30, 2014, 01:02:53 am

Quote from: kody on December 28, 2014, 03:08:51 am
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Complex math function are typically implemented using Taylor Series. Slow, but accurate.

Or, where multiplications are expensive, CORDIC is used for some functions. http://en.wikipedia.org/wiki/CORDIC

westfw · « **Reply #22 on:** December 31, 2014, 04:17:54 am »

Quote

inspect the implementation!
Find the library and disassemble the routines

This is likely to be difficult, especially if you're using the the gnu libraries. The sort of code that is good for portability is frequently not very good for readability (although, the disassembled object code might be better than the source...)

Keil has a free evaluation version (up to 32k code size), with simulator, that could probably be used to measure this without having actual hardware. OTOH, Keil may have their own libraries. (On the third hand, Keil is now owned by ARM, so they may be distributing the same libraries.) I guess that at least in theory, you could load up binaries produced by other compilers into the Keil simulator...

dannyf · « **Reply #23 on:** December 31, 2014, 02:11:45 pm »

Quote

is it not possible if we only have the compiler?

Yes. By looking at the delisting and counting up the instruction cycles.

Quote

I mean, without the actual hardware/microcontroller.

A far better approach is to get the actual hardware.

Mechatrommer · « **Reply #24 on:** January 01, 2015, 01:56:35 am »

Quote from: dannyf on December 31, 2014, 02:11:45 pm

Quote
I mean, without the actual hardware/microcontroller.
A far better approach is to get the actual hardware.

agree but not compulsary. setting the target hardware during compile time may produce whats wanted. if the target supports fptan then you'll see it in the dissasembler.

jpelczar · « **Reply #25 on:** January 02, 2015, 09:20:44 am »

You may also look at FreeBSD math library: https://svnweb.freebsd.org/base/head/lib/msun/src/

https://svnweb.freebsd.org/base/head/lib/msun/src/s_sin.c?revision=218509&view=markup
https://svnweb.freebsd.org/base/head/lib/msun/src/k_sin.c?revision=218509&view=markup

westfw · « **Reply #26 on:** January 02, 2015, 10:00:47 am »

Quote

FreeBSD math library

It looks to me like the newlib nano libraries and the freebsd libraries are essentially the same old Sun code.

kody · « **Reply #27 on:** January 03, 2015, 02:01:20 am »

Thanks for the links @jpelczar. Do you think the implementation would be same through different compilers/vendors?

Quote from: jpelczar on January 02, 2015, 09:20:44 am

You may also look at FreeBSD math library: https://svnweb.freebsd.org/base/head/lib/msun/src/

https://svnweb.freebsd.org/base/head/lib/msun/src/s_sin.c?revision=218509&view=markup
https://svnweb.freebsd.org/base/head/lib/msun/src/k_sin.c?revision=218509&view=markup

kody · « **Reply #28 on:** January 03, 2015, 02:03:08 am »

@dannyf- could you please expand more on how the delisting and the counting of the instruction cycles can be done?

Quote from: dannyf on December 31, 2014, 02:11:45 pm

Quote
is it not possible if we only have the compiler?

Yes. By looking at the delisting and counting up the instruction cycles.

Quote
I mean, without the actual hardware/microcontroller.

A far better approach is to get the actual hardware.

tggzzz · « **Reply #29 on:** January 03, 2015, 09:25:55 am »

Quote from: kody on January 03, 2015, 02:03:08 am

@dannyf- could you please expand more on how the delisting and the counting of the instruction cycles can be done?

Especially if the processor has a cache

andersm · « **Reply #30 on:** January 03, 2015, 11:03:11 am »

There are several options:
- Use a high-frequency counter. The simplest, and works on any chip. On many MCUs you can have timers ticking at the CPU frequency or f/2.
- Use tracing. Both the MCU and your development tools have to support this.
- Use performance counters. Many MCUs have these nowadays, and they can count instruction cycles, memory access cycles and all kinds of things.

The ARMv7-M DWT (Data Watchpoint and Trace unit) has both a cycle count timer and performance counters that can be used by software running on the MCU. Although it is optional, I would expect that all Cortex-M4 MCUs have it.

0xdeadbeef · « **Reply #31 on:** January 03, 2015, 11:20:36 am »

Quote from: tggzzz on January 03, 2015, 09:25:55 am

Especially if the processor has a cache

Obviously, a real measurement will always differ due to interrupts, pipeline effects, branch prediction, cache, waitstates, DMA blocking the bus or RAM etc.
Anyway, looking at the code will make it possible to better estimate the number of cycles needed. As stated above, even looking at C code for a "fast" tangens implementation allows to say this will need > 100 cycles. With the actual source code the prediction will be better and with the ASM code, it can be quite accurate - if you don't consider the complex runtime effects discussed above.

tggzzz · « **Reply #32 on:** January 03, 2015, 04:50:57 pm »

Quote from: 0xdeadbeef on January 03, 2015, 11:20:36 am

Quote from: tggzzz on January 03, 2015, 09:25:55 am
Especially if the processor has a cache
Obviously, a real measurement will always differ due to interrupts, pipeline effects, branch prediction, cache, waitstates, DMA blocking the bus or RAM etc.
Anyway, looking at the code will make it possible to better estimate the number of cycles needed. As stated above, even looking at C code for a "fast" tangens implementation allows to say this will need > 100 cycles. With the actual source code the prediction will be better and with the ASM code, it can be quite accurate - if you don't consider the complex runtime discussed above.

Even 20 years ago, measurements on an i486 with its tiny cache doing nothing else, there was a measured 10:1 difference between mean and maximum times. Modern processors have much faster clocks,but DRAM memory latency hasn't changed. Processors have much bigger caches and are more dependent on them to reduce the average memory latency. Naturally caches cannot change the maximum latency.

Hence the maximum:mean ratio has increased significantly and predicted execution times are even less valid than before.

Remember the truism "cache is the new RAM, RAM is the new disk"

0xdeadbeef · « **Reply #33 on:** January 03, 2015, 04:57:27 pm »

The situation is a little different on microcontrollers. Some still don't have any cache at all, some have very simple implementations, only the high end controllers have complex ones.
Generally, the internal SRAM is usually not cached. Cache is mainly needed to improve performance when running from flash. Note that fetching instructions from flash is a bottleneck for most faster microcontrollers. They usually use a burst read to fill a whole cache line but if there are are lot of branches and/or bad branch prediction, a cache miss can be a big performance hit.

mikerj · « **Reply #34 on:** January 03, 2015, 05:00:02 pm »

Quote from: kody on December 30, 2014, 11:46:43 pm

@dannyf - is it not possible if we only have the compiler? I mean, without the actual hardware/microcontroller.

You could use a simulator if one was available, but I don't think CoIDE includes this functionality? You could use the demo version of Keil etc. if you code fits into the space limitations.

tggzzz · « **Reply #35 on:** January 03, 2015, 05:56:03 pm »

Quote from: 0xdeadbeef on January 03, 2015, 04:57:27 pm

The situation is a little different on microcontrollers. Some still don't have any cache at all, some have very simple implementations, only the high end controllers have complex ones.
Generally, the internal SRAM is usually not cached. Cache is mainly needed to improve performance when running from flash. Note that fetching instructions from flash is a bottleneck for most faster microcontrollers. They usually use a burst read to fill a whole cache line but if there are are lot of branches and/or bad branch prediction, a cache miss can be a big performance hit.

Yes, as I implied in my first response.

OTOH, many microcontrollers have already surpassed the i486 in terms of cache. The current microcontroller I am using, in a Zynq FPGA, is a dual-core Arm-A9, each core having 32K=32K I+D cache. (The cheapest ARM is, IIRC costs <$1) That trend will continue, although there will always be some MCUs that don't have/need cache.

More interestingly, some actively avoid cache due to its "poor" behaviour in hard realtime systems, e.g. the very small and cheap XMOS processors with 2-10 cores. http://www.digikey.co.uk/product-search/en/integrated-circuits-ics/embedded-microcontrollers/2556109?k=xmos

Those XMOS processors are the only ones I know where the compiler/IDE guarantees the execution time. With all other processors, all bets are off.

gmb42 · « **Reply #36 on:** January 03, 2015, 07:12:56 pm »

In a post, here, RedHat explain how they've improved the performance of some math functions in glibc. Eventually I suppose these will filter down to newlib\nanolib etc.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: How are math.h library functions like sqrt, asinf, tanf implemented in CoIDE (Read 17810 times)

Share me