Author Topic: How are math.h library functions like sqrt, asinf, tanf implemented in CoIDE  (Read 17810 times)

0 Members and 1 Guest are viewing this topic.

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?

Need further information regarding the internal implementation(like taylor series?) to calculate the number of cycles taken for a certain program which implements these functions takes in FPU enabled Cortex M4 based MCU's.
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
CoIDE is just an IDE. The actual implementation is done by the compiler(s) you hook to CoIDE, just like any other IDE.
================================
https://dannyelectronics.wordpress.com/
 

Online Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11632
  • Country: my
  • reassessing directives...
CoIDE is just an IDE. The actual implementation is done by the compiler(s) you hook to CoIDE, just like any other IDE.
the implementation is done by the library that came with the compiler, compiler is just that, to compile them. to know exactly whats going on, one need to dig the library, if one is lucky, they may find it in the form of plain source code, if one is unlucky, they may find it in the form of obj or precompiled stuffs.
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
I have used the arm-gcc compiler. I have tried to search online for a good description of these functions in source code. But to no avail. Could you please provide any further sources?

Math.h is the library
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
I believe that by default, the gcc compilers use a set of generic software floating point code as part of libgcc (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html ), and higher level math functions as part of gcc libc ( http://www.gnu.org/software/libc/ )  All nice and portable, but not particularly fast or small.

Technically, these are separate from the compiler, and in fact are one of the things that a value-added compiler vendor might provide as "improved" versions.  For example, avr-gcc has its own complete set of floating point routines.  However, coide expects you to download the ARM compiler from launchpad.net,  which claims to be using newlib or newlib-nano, so you can find code there, ie:  https://github.com/32bitmicro/newlib-nano-1.0/tree/master/newlib/libm/math
It takes a brave (or desperate) vendor or engineer to implement new floating point math code...

Different libs for chips with HW floating point, of course.  And cmsis-dsp has some specially optimized functions (apparently not direct replacements for libm functions, though.)  CooCox does seem to have a bunch of peripheral libraries that replace the usual vendor libraries, but I don't see alternate floating point functions.
 

Offline 0xdeadbeef

  • Super Contributor
  • ***
  • Posts: 1576
  • Country: de
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Need further information regarding the internal implementation(like taylor series?) to calculate the number of cycles taken for a certain program which implements these functions takes in FPU enabled Cortex M4 based MCU's.
At least for sqrt, usually a variant of newton's method is used to calculate the invert square root.
See:  http://en.wikipedia.org/wiki/Fast_inverse_square_root

Well, this is at least true for SW implementations, but I would assume that HW implementations use the same approach.
Trying is the first step towards failure - Homer J. Simpson
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 26906
  • Country: nl
    • NCT Developments
I have used the arm-gcc compiler. I have tried to search online for a good description of these functions in source code. But to no avail. Could you please provide any further sources?

Math.h is the library
No. Math.h is a so called header file describing the functions. The actual library is libm. You have to look for the sources of libm. This can be tricky because the implementation depends on whether the processor you are using has a floating point processor or not.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
The processor I used is a STM32F4 it has FPU. But, we can have the option of enabling it or not.

So, by observing the documentation for these functions in libm code give some idea to calculate the number of cycles it took to calculate that particular function?

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
Thanks @westfw! that was primarily what I would be concerned with.

I believe that by default, the gcc compilers use a set of generic software floating point code as part of libgcc (https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html ), and higher level math functions as part of gcc libc ( http://www.gnu.org/software/libc/ )  All nice and portable, but not particularly fast or small.

Technically, these are separate from the compiler, and in fact are one of the things that a value-added compiler vendor might provide as "improved" versions.  For example, avr-gcc has its own complete set of floating point routines.  However, coide expects you to download the ARM compiler from launchpad.net,  which claims to be using newlib or newlib-nano, so you can find code there, ie:  https://github.com/32bitmicro/newlib-nano-1.0/tree/master/newlib/libm/math
It takes a brave (or desperate) vendor or engineer to implement new floating point math code...

Different libs for chips with HW floating point, of course.  And cmsis-dsp has some specially optimized functions (apparently not direct replacements for libm functions, though.)  CooCox does seem to have a bunch of peripheral libraries that replace the usual vendor libraries, but I don't see alternate floating point functions.
 

Offline zapta

  • Super Contributor
  • ***
  • Posts: 6190
  • Country: us
The processor I used is a STM32F4 it has FPU. But, we can have the option of enabling it or not.

So, by observing the documentation for these functions in libm code give some idea to calculate the number of cycles it took to calculate that particular function?

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Can't you just run a benchmark and extract the numbers from it?
 

Offline 0xdeadbeef

  • Super Contributor
  • ***
  • Posts: 1576
  • Country: de
Well, a quick search brings us to the manual:
http://www.st.com/st-web-ui/static/active/jp/resource/technical/document/application_note/DM00047230.pdf

From what I see there, it's single precision FPU with add/sub/mul/div and sqrt plus some MAC instructions, but no trigonometric functions.
Quoting the manual, the cycles needed are:
Code: [Select]
? Absolute value (1 cycle)
? Negate of a float or of multiple floats (1 cycle)
? Addition (1 cycle)
? Subtraction (1 cycle)
? Multiply, multiply accumulate/subtract, multiply accumulate/subtract then negate (3 cycles)
? Divide (14 cycles)
? Square root (14 cycles)
Obviously, if they are part of a math library, this might need some additional cycles. E.g. if the check the value for a square root to be positive etc.
Anyway, the above functions are relatively "cheap" cycle wise. The trigonometric functions are then probably implemented using taylor series and accordingly expensive.
Trying is the first step towards failure - Homer J. Simpson
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
Quote
For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Yeah: most simulators will provide exact cycle counts for you; and you can use timers too - systick comes in handy for things like this.
================================
https://dannyelectronics.wordpress.com/
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
Hi dannyf,

When you say most simulators, can you please give some examples. Like open-source simulator software.

Thanks

Quote
For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Yeah: most simulators will provide exact cycle counts for you; and you can use timers too - systick comes in handy for things like this.
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
Can we run the benchmarks from the CoIDE compiler?

The processor I used is a STM32F4 it has FPU. But, we can have the option of enabling it or not.

So, by observing the documentation for these functions in libm code give some idea to calculate the number of cycles it took to calculate that particular function?

For example- for sinf function with fpu enabled - can we calculate the number of clock cycles consumed for it?

Can't you just run a benchmark and extract the numbers from it?
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
Thanks for the link 0deadbeeef. I too have it. The problem is, since I do not know how exactly the asinf and tanf functions are implemented. I can only keep guessing with these numbers.

Well, a quick search brings us to the manual:
http://www.st.com/st-web-ui/static/active/jp/resource/technical/document/application_note/DM00047230.pdf

From what I see there, it's single precision FPU with add/sub/mul/div and sqrt plus some MAC instructions, but no trigonometric functions.
Quoting the manual, the cycles needed are:
Code: [Select]
? Absolute value (1 cycle)
? Negate of a float or of multiple floats (1 cycle)
? Addition (1 cycle)
? Subtraction (1 cycle)
? Multiply, multiply accumulate/subtract, multiply accumulate/subtract then negate (3 cycles)
? Divide (14 cycles)
? Square root (14 cycles)
Obviously, if they are part of a math library, this might need some additional cycles. E.g. if the check the value for a square root to be positive etc.
Anyway, the above functions are relatively "cheap" cycle wise. The trigonometric functions are then probably implemented using taylor series and accordingly expensive.
 

Online T3sl4co1l

  • Super Contributor
  • ***
  • Posts: 21684
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
If you need to know implementation... inspect the implementation!

Find the library and disassemble the routines, or at worst, write a "simplest example program" incorporating the interested function, and disassemble that (the machine code output is then necessarily part of the project, and any IDE worth its salt will provide a disassembled view of that output).

If you can't read ARM assembler... better get a book/guide and a pot of coffee... :)

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 

Offline 0xdeadbeef

  • Super Contributor
  • ***
  • Posts: 1576
  • Country: de
Thanks for the link 0deadbeeef. I too have it. The problem is, since I do not know how exactly the asinf and tanf functions are implemented. I can only keep guessing with these numbers.
It clear that these function will lead a lot of cycles. Indeed, if they're implemented with an accuracy threshold, the number of loops needed for approximation depends on the actual value.
Not the correct repository, but to get an impression:
http://svnweb.cern.ch/world/wsvn/vdt/trunk/include/tan.h
There is a fast tangens implementation there called fast_tanfm which avoids the iterations and is mostly linear.
In this fast implementation are about 15 multiplications and one division, plus a lot of additions and integer operations. I'd assume this will cost at least 100 cycles in sum. Probably more.
If the non-fast version with iterations is used, the runtime behavior will me much more unpredictable and thus should not be used in a realtime environment.
Trying is the first step towards failure - Homer J. Simpson
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
Quote
Can we run the benchmarks from the CoIDE compiler?

Yes, if you have a hardware debugger like st-link or jlink, + the actual hardware.
================================
https://dannyelectronics.wordpress.com/
 

Offline theoldwizard1

  • Regular Contributor
  • *
  • Posts: 172
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Complex math function are typically implemented using Taylor Series.  Slow, but accurate.
 

Offline zapta

  • Super Contributor
  • ***
  • Posts: 6190
  • Country: us
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Complex math function are typically implemented using Taylor Series.  Slow, but accurate.

OP, you can use an implemented that trades accuracy for speed, for example a table driven piece wise interpolation.
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
@dannyf - is it not possible if we only have the compiler? I mean, without the actual hardware/microcontroller.



Quote
Can we run the benchmarks from the CoIDE compiler?

Yes, if you have a hardware debugger like st-link or jlink, + the actual hardware.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19497
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
How are math.h library functions like sqrt, asinf, tanf implemented for the CoIDE compiler?
Complex math function are typically implemented using Taylor Series.  Slow, but accurate.

Or, where multiplications are expensive, CORDIC is used for some functions. http://en.wikipedia.org/wiki/CORDIC
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Quote
inspect the implementation!
Find the library and disassemble the routines
This is likely to be difficult, especially if you're using the the gnu libraries.  The sort of code that is good for portability is frequently not very good for readability  (although, the disassembled object code might be better than the source...)

Keil has a free evaluation version (up to 32k code size), with simulator, that could probably be used to measure this without having actual hardware.  OTOH, Keil may have their own libraries.  (On the third hand, Keil is now owned by ARM, so they may be distributing the same libraries.)   I guess that at least in theory, you could load up binaries produced by other compilers into the Keil simulator...
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
Quote
is it not possible if we only have the compiler?

Yes. By looking at the delisting and counting up the instruction cycles.

Quote
I mean, without the actual hardware/microcontroller.

A far better approach is to get the actual hardware.
================================
https://dannyelectronics.wordpress.com/
 

Online Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11632
  • Country: my
  • reassessing directives...
Quote
I mean, without the actual hardware/microcontroller.
A far better approach is to get the actual hardware.
agree but not compulsary. setting the target hardware during compile time may produce whats wanted. if the target supports fptan then you'll see it in the dissasembler.
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 


Offline westfw

  • Super Contributor
  • ***
  • Posts: 4199
  • Country: us
Quote
FreeBSD math library
It looks to me like the newlib nano libraries and the freebsd libraries are essentially the same old Sun code.
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
 

Offline kodyTopic starter

  • Contributor
  • Posts: 34
  • Country: ca
@dannyf- could you please expand more on how the delisting and the counting of the instruction cycles can be done?

Quote
is it not possible if we only have the compiler?

Yes. By looking at the delisting and counting up the instruction cycles.

Quote
I mean, without the actual hardware/microcontroller.

A far better approach is to get the actual hardware.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19497
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
@dannyf- could you please expand more on how the delisting and the counting of the instruction cycles can be done?

Especially if the processor has a cache :(
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline andersm

  • Super Contributor
  • ***
  • Posts: 1198
  • Country: fi
There are several options:
- Use a high-frequency counter. The simplest, and works on any chip. On many MCUs you can have timers ticking at the CPU frequency or f/2.
- Use tracing. Both the MCU and your development tools have to support this.
- Use performance counters. Many MCUs have these nowadays, and they can count instruction cycles, memory access cycles and all kinds of things.

The ARMv7-M DWT (Data Watchpoint and Trace unit) has both a cycle count timer and performance counters that can be used by software running on the MCU. Although it is optional, I would expect that all Cortex-M4 MCUs have it.

Offline 0xdeadbeef

  • Super Contributor
  • ***
  • Posts: 1576
  • Country: de
Especially if the processor has a cache :(
Obviously, a real measurement will always differ due to interrupts, pipeline effects, branch prediction, cache, waitstates, DMA blocking the bus or RAM etc.
Anyway, looking at the code will make it possible to better estimate the number of cycles needed. As stated above, even looking at C code for a "fast" tangens implementation allows to say this will need > 100 cycles. With the actual source code the prediction will be better and with the ASM code, it can be quite accurate - if you don't consider the complex runtime effects discussed above.
Trying is the first step towards failure - Homer J. Simpson
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19497
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Especially if the processor has a cache :(
Obviously, a real measurement will always differ due to interrupts, pipeline effects, branch prediction, cache, waitstates, DMA blocking the bus or RAM etc.
Anyway, looking at the code will make it possible to better estimate the number of cycles needed. As stated above, even looking at C code for a "fast" tangens implementation allows to say this will need > 100 cycles. With the actual source code the prediction will be better and with the ASM code, it can be quite accurate - if you don't consider the complex runtime  discussed above.

Even 20 years ago, measurements on an i486 with its tiny cache doing nothing else, there was a measured 10:1 difference between mean and maximum times. Modern processors have much faster clocks,but DRAM memory latency hasn't changed. Processors have much bigger caches and are more dependent on them to reduce the average memory latency. Naturally caches cannot change the maximum latency.

Hence the maximum:mean ratio has increased significantly and predicted execution times are even less valid than before.

Remember the truism "cache is the new RAM, RAM is the new disk"
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Offline 0xdeadbeef

  • Super Contributor
  • ***
  • Posts: 1576
  • Country: de
The situation is a little different on microcontrollers. Some still don't have any cache at all, some have very simple implementations, only the high end controllers have complex ones.
Generally, the internal SRAM is usually not cached. Cache is mainly needed to improve performance when running from flash. Note that fetching instructions from flash is a bottleneck for most faster microcontrollers. They usually use a burst read to fill a whole cache line but if there are are lot of branches and/or bad branch prediction, a cache miss can be a big performance hit.
Trying is the first step towards failure - Homer J. Simpson
 

Offline mikerj

  • Super Contributor
  • ***
  • Posts: 3240
  • Country: gb
@dannyf - is it not possible if we only have the compiler? I mean, without the actual hardware/microcontroller.

You could use a simulator if one was available, but I don't think CoIDE includes this functionality?  You could use the demo version of Keil etc. if you code fits into the space limitations.
 

Online tggzzz

  • Super Contributor
  • ***
  • Posts: 19497
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
The situation is a little different on microcontrollers. Some still don't have any cache at all, some have very simple implementations, only the high end controllers have complex ones.
Generally, the internal SRAM is usually not cached. Cache is mainly needed to improve performance when running from flash. Note that fetching instructions from flash is a bottleneck for most faster microcontrollers. They usually use a burst read to fill a whole cache line but if there are are lot of branches and/or bad branch prediction, a cache miss can be a big performance hit.

Yes, as I implied in my first response.

OTOH, many microcontrollers have already surpassed the i486 in terms of cache. The current microcontroller I am using, in a Zynq FPGA, is a dual-core Arm-A9, each core having 32K=32K I+D cache.  (The cheapest ARM is, IIRC costs <$1) That trend will continue, although there will always be some MCUs that don't have/need cache.

More interestingly, some actively avoid cache due to its "poor" behaviour in hard realtime systems, e.g. the very small and cheap XMOS processors with 2-10 cores. http://www.digikey.co.uk/product-search/en/integrated-circuits-ics/embedded-microcontrollers/2556109?k=xmos

Those XMOS processors are the only ones I know where the compiler/IDE guarantees the execution time. With all other processors, all bets are off.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online gmb42

  • Frequent Contributor
  • **
  • Posts: 294
  • Country: gb
In a post, here, RedHat explain how they've improved the performance of some math functions in glibc.  Eventually I suppose these will filter down to newlib\nanolib etc.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf