Author Topic: Why is a PC so fast compared to BeagleBoard?  (Read 6445 times)

0 Members and 1 Guest are viewing this topic.

Offline ManxTopic starter

  • Contributor
  • Posts: 39
  • Country: pl
Why is a PC so fast compared to BeagleBoard?
« on: August 09, 2019, 06:32:05 pm »
I have some code with trigonometric functions and matrix multiplications. Floating point calculations are single precision.

When I run one calculations cycle on 180 MHz STM32 F4, it takes 145 us.
When I run it on 1 GHz PocketBeagle Cortex A8 it takes 50us.
But when I run it on my laptop with 1.8 GHz 3rd gen Core i5, it takes 1.8 us.

Why?? It seems fine that Beagle is 3 times faster than STM32. But why is the laptop 30 times faster than Beagle, being only two times faster??
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #1 on: August 09, 2019, 06:44:42 pm »
Caches primarily, then effective branch prediction and speculative execution.

Compare the disassembled code. For matrix multiplication, compiler may have used SSE instructions on X86, which would make things multiple times faster right away.
Alex
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #2 on: August 09, 2019, 06:50:00 pm »
Flash memory access alone acts like speedbrake for MCU, not to mention that embedded ARM cores are optimized for low power, not speed. Intel CPU's are heavily performance-optimized. - Deep pipelines, out of order execution, huge instruction and data cache memories. Why you think CPU of your laptop needs 35W of power? - Just to act like heater or what?
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 8646
  • Country: gb
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #3 on: August 09, 2019, 06:51:04 pm »
I suspect that either you have something odd in the compile options for the A8, and/or the Beagleboard is not running at 1GHz. Depending how you set up a Beagleboard, it may deault to a low clock rate, and require you to take action to push the clock rate up to 1GHz.  A Core i5 will be a lot faster than an A8, but the speed ratio you have seen is too big to be representative.
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #4 on: August 09, 2019, 06:52:27 pm »
It really depends on the generated code. X86 floating point instruction set is much better than A8 FP ISA.
Alex
 
The following users thanked this post: JPortici

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #5 on: August 09, 2019, 07:10:51 pm »
I suspect that either you have something odd in the compile options for the A8, and/or the Beagleboard is not running at 1GHz.

It could be so. Stm32 results compared to Cortex A8 does not make sense as well.
 

Offline ManxTopic starter

  • Contributor
  • Posts: 39
  • Country: pl
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #6 on: August 09, 2019, 07:37:50 pm »
I simply use
Code: [Select]
g++ -Ofast RobotKinematicsMCU.cpp -o test It's the same with
Code: [Select]
g++ -Ofast -mcpu=cortex-a8 -mfloat-abi=hard -mfpu=neon  RobotKinematicsMCU.cpp -o test
When I check cpu frequency with
Code: [Select]
cpufreq-info the governor is set to "performance" and 1 GHz is shown to be running all the time.

Basic question: How can I see disassembled code?
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9890
  • Country: us
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #7 on: August 09, 2019, 07:38:37 pm »
It's even simpler than the explanations above:  Power - how much heat can you dissipate.  To switch in zero time takes infinite energy and, as switching speed increases, so does power.  Power also increases with complexity.  The more logic you have switching, the more power you need to deal with.

I bought a Raspberry Pi 4 - it's a pretty nice little machine.  I don't expect it to compete with my 4+ GHz I7-7700k water cooled tower.  There's a reason the Pi wall wart is 5V 3A (15 Watts) and my tower is around 500W.   That tower had better be faster than the Pi!

The speed power product (a measure of gate switching delay and gate power dissipation) is improving all the time.  I recall the VP of a company I worked for telling me to not bet against technology.  This was in the mainframe era and today my tower is faster and has more memory.

Yes, the architectural features matter but they all point toward increasing power dissipation.
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9890
  • Country: us
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #8 on: August 09, 2019, 07:46:00 pm »
Basic question: How can I see disassembled code?

gcc -S ...

or study up on

objdump -d

 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14471
  • Country: fr
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #9 on: August 09, 2019, 09:12:29 pm »
Yeah. All good points. When doing such comparisons, you have to determine how much it is affected by memory/cache access times, and how much by the FPU itself.

And that said, I don't know much about the FPU of a Cortex A8. I would expect it not to put out as many operations per cycle on average than the FPU of a Core i5, but I may be wrong.

 

Offline ManxTopic starter

  • Contributor
  • Posts: 39
  • Country: pl
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #10 on: August 09, 2019, 11:59:49 pm »
Thank you guys for your explanations. The main takeaway is then, that the results I got are to be expected. So I don't have to bother with moving to Beagle since the speed increase won't justify the effort. Fine with me ;) But I must say, seeing such a big difference between Beagle and a laptop really surprised me. Well, but now I know.
 

Offline james_s

  • Super Contributor
  • ***
  • Posts: 21611
  • Country: us
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #11 on: August 10, 2019, 12:08:16 am »
When you say something is x times faster it sounds like you are looking only at the clock speed. There is a LOT more to the performance of a CPU than the clock speed. The number of cores, the architecture of the CPU and of the system as a whole and the peripherals around it have an enormous effect on performance. Clock speed doesn't really tell you much anymore, it only matters when comparing like to like.
 

Online IanB

  • Super Contributor
  • ***
  • Posts: 11885
  • Country: us
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #12 on: August 10, 2019, 12:11:48 am »
No one has mentioned it yet, but the Intel processors in laptops have highly optimized floating point capabilities compared to simpler micros.
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #13 on: August 10, 2019, 12:27:49 am »
No one has mentioned it yet, but the Intel processors in laptops have highly optimized floating point capabilities compared to simpler micros.

In what sense Intel have highly optimized floating point capabilities compared to simpler micros (ARM Cortex A8)? BTW mentioned ARM have NEON SIMD tech which allows up-to 4*32bit parallel float operations. There's open source library http://projectne10.github.io/Ne10/
 

Offline rsjsouza

  • Super Contributor
  • ***
  • Posts: 5986
  • Country: us
  • Eternally curious
    • Vbe - vídeo blog eletrônico
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #14 on: August 10, 2019, 12:34:41 am »
Yeah. All good points. When doing such comparisons, you have to determine how much it is affected by memory/cache access times, and how much by the FPU itself.

And that said, I don't know much about the FPU of a Cortex A8. I would expect it not to put out as many operations per cycle on average than the FPU of a Core i5, but I may be wrong.
Assuming the code is actually oprimized and encoding the floating point instructions (and not emulated code), there'a another aspect: the FPU on an A8 shares its registers with its matrix math engine (NEON), thus reducing the performance of matrix operations that also use float. If you run the same on an A9 or A15 (BeagleBoard X15) you should have much more edge.

There was a wiki article at TI that talked about that. It is an inherent feature of the A8 cores, not only TI (IIRC).

(edit) I found the article. It is slightly different than I recall - the precision is limited with VFP+NEON
http://processors.wiki.ti.com/index.php/Using_NEON_and_VFPv3_on_Cortex-A8
« Last Edit: August 10, 2019, 02:05:28 am by rsjsouza »
Vbe - vídeo blog eletrônico http://videos.vbeletronico.com

Oh, the "whys" of the datasheets... The information is there not to be an axiomatic truth, but instead each speck of data must be slowly inhaled while carefully performing a deep search inside oneself to find the true metaphysical sense...
 
The following users thanked this post: SiliconWizard

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14471
  • Country: fr
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #15 on: August 10, 2019, 01:49:17 am »
Yet another point, and still related to memory access.

You will get significant performance penalty if the location in memory of the variables/arrays you are working on are not ideally aligned for the architecture in question. Ideal alignment depends a lot on the CPU architecture, and if not explicitely defined (through GCC attributes for instance) in code, can lead to very suboptimal results, which would not really reflect each CPU's capabilities.

All CPUs supporting some form of SIMD have specific alignment requirements to make the most of them.
 

Offline Berni

  • Super Contributor
  • ***
  • Posts: 4953
  • Country: si
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #16 on: August 10, 2019, 06:26:59 am »
Yep your beagleboard code is likely not that well optimized for the given hardware.

You need to look down into the instruction set what that CPU likes to do and make sure the C compiler does it in such a way. The NEON extensions can be very powerful for doing number crunching but will not tend to be used by the C compiler due to the careful memory layout needed for it to work.  The compiler is likely using the standard single cycle multiply function that you might find in a STM32F4. As for why its not as fast as the clock speed claims is likely because your data is stored in RAM in such a way that the CPUs data cache has trouble loading it efficiently.

When doing very repetitive math its also important what you have around your raw math operations, you can spend a lot of CPU time simply moving data around or shoveling it into the CPU registers to be worked on, or even just executing the for loop that is around it.

As for why Intel is so speedy its down to the superior parallel number crunching capabilities and its heavily pipelined execution that might shove multiple instructions trough the CPU core in a single clock cycle. The branch prediction might be working its magic on those repetitive loops, it has faster RAM and has a LOT more cache so all the data likely fits into its cache. Its also possible the C compiler does a better job at compiling your particular code implementation onto that instruction set.

If you need the speed then there are little Intel Atom single board computers out there. They won't quite have the performance of a i5/i7 at the same clock speed but its still gonna be pretty speedy while using a more reasonable amount of power.
 

Offline magic

  • Super Contributor
  • ***
  • Posts: 6779
  • Country: pl
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #17 on: August 10, 2019, 07:21:56 am »
perf stat -d ./test

On all 3 platforms.
 

Offline ManxTopic starter

  • Contributor
  • Posts: 39
  • Country: pl
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #18 on: August 10, 2019, 12:08:59 pm »
No much luck with perf. It says it won't work on my laptop because of the kernel version, and on Beagle it shows

Code: [Select]
Performance counter stats for './test':

      48877.972638      task-clock:u (msec)       #    0.997 CPUs utilized         
                 0      context-switches:u        #    0.000 K/sec                 
                 0      cpu-migrations:u          #    0.000 K/sec                 
                77      page-faults:u             #    0.002 K/sec                 
   <not supported>      cycles:u                                                   
   <not supported>      instructions:u                                             
   <not supported>      branches:u                                                 
   <not supported>      branch-misses:u                                             
   <not supported>      L1-dcache-loads:u                                           
   <not supported>      L1-dcache-load-misses:u                                     
   <not supported>      LLC-loads:u                                                 
   <not supported>      LLC-load-misses:u                                           

      49.024961877 seconds time elapsed

The code is indeed not optimized for any specific platform. But a big part of the execution time, as I checked with pin toggling on STM32, is just spent on standard trigonometric functions. And I think that -Ofast does loop unrolling.

I quickly checked Intel Atom on google and it's probably a great thing, but I'm looking for something for which I can design my own board. Atom boards certainly don't look like anything close to that ;)
« Last Edit: August 10, 2019, 12:21:42 pm by Manx »
 

Offline splin

  • Frequent Contributor
  • **
  • Posts: 999
  • Country: gb
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #19 on: August 10, 2019, 01:39:20 pm »
Take a look at http://gruntthepeon.free.fr/ssemath/neon_mathfun.html

comparing vector trigonometric performance  on ARM NEON and Intel Atom using their own optimized trig code. Fundamentally not much difference in cycles per operation.

They also tested their SSE2 implementation here:

http://gruntthepeon.free.fr/ssemath/
 

Offline rsjsouza

  • Super Contributor
  • ***
  • Posts: 5986
  • Country: us
  • Eternally curious
    • Vbe - vídeo blog eletrônico
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #20 on: August 10, 2019, 05:29:18 pm »
Take a look at http://gruntthepeon.free.fr/ssemath/neon_mathfun.html

comparing vector trigonometric performance  on ARM NEON and Intel Atom using their own optimized trig code. Fundamentally not much difference in cycles per operation.

They also tested their SSE2 implementation here:

http://gruntthepeon.free.fr/ssemath/
Thanks for sending this, although the Pandaboard uses a dual Cortex A9 (OMAP4430), which does not have the same bottleneck as the A8.

Having the code, it is easy to migrate to a BeagelBone or X15 - I might do that next week just for kicks.
Vbe - vídeo blog eletrônico http://videos.vbeletronico.com

Oh, the "whys" of the datasheets... The information is there not to be an axiomatic truth, but instead each speck of data must be slowly inhaled while carefully performing a deep search inside oneself to find the true metaphysical sense...
 

Online NiHaoMike

  • Super Contributor
  • ***
  • Posts: 9016
  • Country: us
  • "Don't turn it on - Take it apart!"
    • Facebook Page
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #21 on: August 10, 2019, 07:42:15 pm »
The Beaglebone is also quite old, so it's not going to perform particularly stellar even compared to other embedded Linux platforms like Raspberry Pi. It's still pretty unique in its capabilities, however, with its support for simultaneous USB host and device (not as big of a deal now with the Pi 4 also supporting that) and the I/O accelerator that, among other things, can be used as a logic analyzer.
Cryptocurrency has taught me to love math and at the same time be baffled by it.

Cryptocurrency lesson 0: Altcoins and Bitcoin are not the same thing.
 

Offline magic

  • Super Contributor
  • ***
  • Posts: 6779
  • Country: pl
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #22 on: August 10, 2019, 08:22:32 pm »
No much luck with perf.
That's indeed very little luck :)

The code is indeed not optimized for any specific platform. But a big part of the execution time, as I checked with pin toggling on STM32, is just spent on standard trigonometric functions. And I think that -Ofast does loop unrolling.
AFAIK trig functions are implemented by polynomial approximation so strictly CPU bound. Rather out of luck again, unless your standard library is completely inefficient garbage.
Perhaps some faster and slightly less accurate implementations can be found. Or find the relevant polynomials and roll your own, using SIMD to compute sinus of four numbers in parallel. These days they could frankly provide such stuff in standard libraries.
« Last Edit: August 10, 2019, 08:24:07 pm by magic »
 

Offline rsjsouza

  • Super Contributor
  • ***
  • Posts: 5986
  • Country: us
  • Eternally curious
    • Vbe - vídeo blog eletrônico
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #23 on: August 10, 2019, 10:15:43 pm »
The Beaglebone is also quite old, so it's not going to perform particularly stellar even compared to other embedded Linux platforms like Raspberry Pi.
I think the Pi4 is the only one that has a performance edge - the Pi3 versions are quite comparable in speed, no?
Vbe - vídeo blog eletrônico http://videos.vbeletronico.com

Oh, the "whys" of the datasheets... The information is there not to be an axiomatic truth, but instead each speck of data must be slowly inhaled while carefully performing a deep search inside oneself to find the true metaphysical sense...
 

Online NiHaoMike

  • Super Contributor
  • ***
  • Posts: 9016
  • Country: us
  • "Don't turn it on - Take it apart!"
    • Facebook Page
Re: Why is a PC so fast compared to BeagleBoard?
« Reply #24 on: August 10, 2019, 10:33:15 pm »
I think the Pi4 is the only one that has a performance edge - the Pi3 versions are quite comparable in speed, no?
The Beaglebone is a single core A8 while the Pi 3 is a quad core A53. What the Beaglebone has that the Pi doesn't is a pair of I/O accelerators that are essentially microcontrollers with fast access to the GPIO and RAM. That's what is used for the logic analyzer functionality.

The Odroid N2 is in a similar class as the Pi 4. Then there's the Jetson Nano, basically a lower end version of the Shield TV.
Cryptocurrency has taught me to love math and at the same time be baffled by it.

Cryptocurrency lesson 0: Altcoins and Bitcoin are not the same thing.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf